You are on page 1of 336

Lecture Notes in Electrical Engineering 292

Jacob Scharcanski
Hugo Proença
Eliza Du
Editors

Signal and
Image
Processing for
Biometrics
Lecture Notes in Electrical Engineering

Volume 292

Board of Series Editors


Leopoldo Angrisani, Napoli, Italy
Marco Arteaga, Coyoacán, México
Samarjit Chakraborty, München, Germany
Jiming Chen, Hangzhou, P.R. China
Tan Kay Chen, Singapore, Singapore
Rüdiger Dillmann, Karlsruhe, Germany
Gianluigi Ferrari, Parma, Italy
Manuel Ferre, Madrid, Spain
Sandra Hirche, München, Germany
Faryar Jabbari, Irvine, USA
Janusz Kacprzyk, Warsaw, Poland
Alaa Khamis, New Cairo City, Egypt
Torsten Kroeger, Stanford, USA
Tan Cher Ming, Singapore, Singapore
Wolfgang Minker, Ulm, Germany
Pradeep Misra, Dayton, USA
Sebastian Möller, Berlin, Germany
Subhas Mukhopadyay, Palmerston, New Zealand
Cun-Zheng Ning, Tempe, USA
Toyoaki Nishida, Sakyo-ku, Japan
Federica Pascucci, Roma, Italy
Tariq Samad, Minneapolis, USA
Gan Woon Seng, Nanyang Avenue, Singapore
Germano Veiga, Porto, Portugal
Junjie James Zhang, Charlotte, USA

For further volumes:


http://www.springer.com/series/7818
About this Series

‘‘Lecture Notes in Electrical Engineering (LNEE)’’ is a book series which reports


the latest research and developments in Electrical Engineering, namely:
• Communication, Networks, and Information Theory
• Computer Engineering
• Signal, Image, Speech and Information Processing
• Circuits and Systems
• Bioengineering
LNEE publishes authored monographs and contributed volumes which present
cutting edge research information as well as new perspectives on classical fields,
while maintaining Springer’s high standards of academic excellence. Also con-
sidered for publication are lecture materials, proceedings, and other related
materials of exceptionally high quality and interest. The subject matter should be
original and timely, reporting the latest research and developments in all areas of
electrical engineering.
The audience for the books in LNEE consists of advanced level students,
researchers, and industry professionals working at the forefront of their fields.
Much like Springer’s other Lecture Notes series, LNEE will be distributed through
Springer’s print and electronic publishing channels.
Jacob Scharcanski Hugo Proença

Eliza Du
Editors

Signal and Image Processing


for Biometrics

123
Editors
Jacob Scharcanski Eliza Du
Federal University of Rio Grande do Sul Indiana University-Purdue University
Porto Alegre Indianapolis
Brazil Indianapolis, IN
USA
Hugo Proença
IT-Instituto de Telecomunicações
University of Beira Interior
Covilhã
Portugal

ISSN 1876-1100 ISSN 1876-1119 (electronic)


ISBN 978-3-642-54079-0 ISBN 978-3-642-54080-6 (eBook)
DOI 10.1007/978-3-642-54080-6
Springer Heidelberg New York Dordrecht London

Library of Congress Control Number: 2014932533

 Springer-Verlag Berlin Heidelberg 2014


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or
information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed. Exempted from this legal reservation are brief
excerpts in connection with reviews or scholarly analysis or material supplied specifically for the
purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the
work. Duplication of this publication or parts thereof is permitted only under the provisions of
the Copyright Law of the Publisher’s location, in its current version, and permission for use must
always be obtained from Springer. Permissions for use may be obtained through RightsLink at the
Copyright Clearance Center. Violations are liable to prosecution under the respective Copyright Law.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt
from the relevant protective laws and regulations and therefore free for general use.
While the advice and information in this book are believed to be true and accurate at the date of
publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for
any errors or omissions that may be made. The publisher makes no warranty, express or implied, with
respect to the material contained herein.

Printed on acid-free paper

Springer is part of Springer Science+Business Media (www.springer.com)


Preface

The development of machines able to autonomously and covertly perform reliable


recognition of human beings is a long-term goal in the biometrics field. Some of
the biological traits used to perform biometric recognition support contactless data
acquisition and can be acquired covertly. Thus, at least theoretically, the sub-
sequent biometric recognition procedure can be performed without the subjects
being aware or have knowledge of it, and can take place in uncontrolled scenarios
(e.g., banks, airports, public areas, etc.). This real-world scenario brings many
challenges to the pattern recognition process, essentially because the quality of the
acquired data may vary depending on the acquisition conditions. The feasibility of
this type of recognition has received increasing attention and is of particular
interest in visual surveillance, computer forensics, threat assessment, and other
security areas. There is a growing interest in the development of biometric rec-
ognition systems that operate in unconstrained conditions, and important advances
have been recently been achieved in this field, in the various aspects covered in
this book, ranging from the recognition of ears, eyes, faces, gait, fingerprint,
handwritten signatures, altered appearances, and pattern detection and recognition
issues arising in biometrics applications.
The goal of this volume is to summarize the state-of-the-art in signal and image
processing techniques to newcomers to the field of biometrics and for more
experienced practitioners, and to provide future directions for this exciting area.
It offers a guide to the state-of-the-art, and the mainstream research work in this
field is presented in an organized manner, so the reader can easily follow the trends
that best suits her/his interests in this growing field.
The volume opens with Chap. 1 by Verdoolaege et al. The authors present
concepts of manifold learning and information geometry, and discuss how manifold
geometry can be exploited to obtain biometric data representations. The authors
also compare some of the representative manifold-based methods applied to face
recognition, and conclude that manifold learning is a promising research area in
noncooperative facial recognition, specially because it can accommodate different
head poses and facial expressions in a unified scheme.
In Chap. 2, Patel et al. define the remote face recognition problem and discuss
some of the key issues in this area. They investigate features that are robust to face
variations, which are useful for developing statistical models that account for these

v
vi Preface

variations in remote face recognition, and conclude by pointing out some open
problems in remote face recognition.
In Chap. 3, Proença et al. introduce facial sketch recognition as link between a
drew representation of a human face and an identity, based on information given
by a eyewitness, and compare computer-aided facial sketch recognition and
humans performing the same task. Their analysis allows to conclude which sets of
facial features are more appropriate considering the facial sketch reliability.
In Chap. 4, Singh describes a mutual information-based age transformation
algorithm that does a multiscale registration of a gallery and probe face images, by
minimizing the variations in facial features caused by aging. The performance of
the proposed algorithm is compared with other known algorithms for recognizing
altered facial appearances, and the experiments show that the proposed algorithm
outperforms the existing algorithms.
In Chap. 5, Zheng reviews three face recognition algorithms (matchers), and
present and compare the fusion performance of seven fusion methods: linear
discriminant analysis (LDA), k-nearest neighbor (KNN), artificial neural network
(ANN), support vector machine (SVM), binomial logistic regression (BLR),
Gaussian mixture model (GMM), and hidden Markov model (HMM). The author
obtains promising results on a set of 105 subjects, and concludes that the per-
formance of an integrated system equipped with multispectral imaging devices is
much higher than that of any single matcher or any single modality, which shall
enable night vision capability (i.e., recognizing a face at dark or at nighttime).
The volume continues with six chapters on the recognition of ears, eyes, gait,
fingerprint, handwritten signatures, and pattern detection and recognition
issues arising in biometrics applications. In Chap. 6, Barra et al. describe the
state-of-the-art in unconstrained ear processing, which due to its strong
three-dimensional characterization, its correct detection and recognition can be
significantly affected by illumination, noise introduced by hair, pose, and occlu-
sion. They discuss and compare the representative 2D and 3D approaches for ear
processing, and point out open problems for future research of ear processing in
uncontrolled settings.
In Chap. 7, Zhou et al. argue that the blood vessel structure of the sclera is
unique to each person, and is able to be acquired nonintrusively in the visible
wavelengths. Also, sclera recognition has been shown to achieve reasonable rec-
ognition accuracy under visible wavelengths, whereas iris recognition has only
proved accuracy in near infrared images. Hence, by combining iris and sclera
recognition, they achieve better recognition accuracy than when using each of
traits in an isolated way. Authors introduce a feature quality-based unconstrained
eye recognition system that combines the respective strengths of iris recognition
and sclera recognition for human identification, and can work with frontal and
off-angle eye images.
In Chap. 8, Makihara et al. handle the problem of varying gait speed, in terms of
recognition effectiveness. They describe a method of gait silhouette transformation
that handles varying walking speed changes in gait identification. When a person
changes his walking speed, the dynamic features (e.g., stride and joint angle) also
Preface vii

vary, while static features (e.g., thigh and shin lengths) remain relatively constant.
Based on this observation, authors propose to separate static and dynamic features
from gait silhouettes by fitting a human model. Then, a supervised factoriza-
tion-based speed transformation model for the dynamic features is described.
Finally, silhouettes are restored by combining the unchanged static features and
the transformed dynamic features. The evaluation carried out shows the effec-
tiveness of the algorithm described, turning it particularly suitable to be used in
nonconstrained acquisition scenarios.
In Chap. 9, Vatsa proposes a quality assessment algorithm based on the
Redundant Discrete Wavelet Transform to extract edge, noise, and smoothness
information in local regions and encodes into a quality vector. The feature
extraction algorithm first registers the gallery and probe fingerprint images using a
two-stage registration process. Then, a fast Mumford–Shah curve evolution
algorithm extracts level-3 features: pores, ridge contours, dots, and incipient
ridges. Gallery and probe features are matched using Mahalanobis distance mea-
sure and quality-based likelihood ratio approach. Further, the quality induced sum
rule fusion algorithm is used to combine the match scores obtained from level-2
and level-3 features. In Chap. 10, Housmani and Salicetti address the problem of
quality of online signature samples. Authors note that while signature complexity
and signature stability have been pointed in the literature as main quality criteria
for this biometric modality, the main drawback is the measurement of such criteria
separately. In this chapter, they analyze such criteria with a unifying view, in terms
of entropy-based measures. Based on experiments carried out on several
well-known datasets, the authors conclude that the degradation of signatures due to
mobile acquisition conditions can be accounted for using the entropy-based
measures.
In Chap. 11, Grabowski and Sankowski discuss the main problems of human
tracking in noncooperative scenarios. Authors pay special attention to the tracking
capabilities of the acquisition hardware and to image segmentation techniques.
As a general conclusion, the authors argue that off-the- shelf Pan–Tilt–Zoom
cameras can be a viable solution for biometric systems requiring human tracking
in noncooperative scenarios, even for demanding traits, as in the case of such a
small structures like the human iris.
As editors, we hope that this volume focused on signal and image processing, as
well as in pattern recognition techniques applied to biometrics, will demonstrate
the significant progress that has occurred in this field in recent years. We also hope
that the developments reported in this volume will motivate further research in this
exciting field.

Jacob Scharcanski
Hugo Proença
Eliza Du
Contents

1 Data and Information Dimensionality in Non-cooperative


Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Geert Verdoolaege, John Soldera, Thiarlei Macedo
and Jacob Scharcanski

2 Remote Identification of Faces . . . . . . . . . . . . . . . . . . . . . . . . . . 37


Vishal M. Patel, Jie Ni and Rama Chellappa

3 Biometric Identification from Facial Sketches of Poor Reliability:


Comparison of Human and Machine Performance . . . . . . . . . . . 57
H. Proença, J. C. Neves, J. Sequeiros, N. Carapito and N. C. Garcia

4 Recognizing Altered Facial Appearances


Due to Aging and Disguise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Richa Singh

5 Using Score Fusion for Improving the Performance


of Multispectral Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . 107
Yufeng Zheng

6 Unconstrained Ear Processing: What is Possible


and What Must Be Done . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Silvio Barra, Maria De Marsico, Michele Nappi and Daniel Riccio

7 Feature Quality-Based Unconstrained Eye Recognition . . . . . . . . 191


Zhi Zhou, Eliza Yingzi Du and N. Luke Thomas

8 Speed-Invariant Gait Recognition . . . . . . . . . . . . . . . . . . . . . . . . 209


Yasushi Makihara, Akira Tsuji and Yasushi Yagi

ix
x Contents

9 Quality Induced Multiclassifier Fingerprint Verification


Using Extended Feature Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
Mayank Vatsa

10 Quality Measures for Online Handwritten Signatures . . . . . . . . . 255


Nesma Houmani and Sonia Garcia-Salicetti

11 Human Tracking in Non-cooperative Scenarios . . . . . . . . . . . . . . 285


Kamil Grabowski and Wojciech Sankowski
Chapter 1
Data and Information Dimensionality
in Non-cooperative Face Recognition

Geert Verdoolaege, John Soldera, Thiarlei Macedo and Jacob Scharcanski

Abstract Data, information dimensionality and manifold learning techniques are


related issues that are gaining prominence in biometrics. Problems dealing with
large amounts of data often have dimensionality issues, leading to uncertainty and
inefficiency. This chapter presents concepts of manifold learning and information
geometry, and discusses how the manifold geometry can be exploited to obtain bio-
metric data representations in lower dimensions. It is also explained how biometric
data that are modeled with suitable probability distributions, can be classified accu-
rately using geodesic distances on probabilistic manifolds, or approximations when
the analytic geodesic distance solutions are not known. Also, we discuss some of
the representative manifold based methods applied to face recognition, and point out
future research directions.

G. Verdoolaege
Ghent University, Ghent, Belgium
e-mail: geert.verdoolaege@ugent.be
G. Verdoolaege
Laboratory for Plasma Physics, Royal Military Academy, Brussels, Belgium
J. Soldera
Instituto de Informática, UFRGS, Porto Alegre, Brazil
e-mail: jsoldera@inf.ufrgs.br
T. Macedo
Guardian Tecnologia da Informação, Caxias do Sul, Brazil
e-mail: thiarlei@gmail.com
J. Scharcanski (B)
Instituto de Informática and Programa de Pós-Graduação em Engenharia Elétrica,
UFRGS, Porto Alegre, Brazil
e-mail: jacobs@inf.ufrgs.br

J. Scharcanski et al. (eds.), Signal and Image Processing for Biometrics, 1


Lecture Notes in Electrical Engineering 292, DOI: 10.1007/978-3-642-54080-6_1,
© Springer-Verlag Berlin Heidelberg 2014
2 G. Verdoolaege et al.

1.1 Introduction

While some biometric modalities, such as fingerprint and iris recognition, have
already reached significant accuracy levels, still their applicability is limited in
non-cooperative scenarios. On the other hand, there is a need for more research in
non-cooperative scenarios, like face recognition using face information, where higher
levels of recognition accuracy are desirable, possibly through better feature extrac-
tion and classification methods. Also, to improve the accuracy of face recognition in
non-cooperative scenarios, some practical issues must be handled, like changes in
illumination, pose and expression variations, occlusions, aging and changes in body
mass index.
This chapter addresses some data and information dimensionality issues in face
recognition. In fact, we may approach face recognition as a 2D problem, but some
challenges still are encountered in 2D face recognition, such as the required invari-
ance to the lighting conditions and to the head pose. Hence, this chapter is organized in
such a way that we discuss the 2D face recognition challenges, and ways of handling
the dimensionality issues in face recognition using statistical and shape manifolds.
We discuss some representative techniques proposed for 2D face recognition in
the literature. Some of them use skin color models or geometric models in order
to determine specific landmarks on the face or taking some collection of pixels
from skin, nose, eyes and other parts. Those features are used by a classifier in
order to recognize people. On the other hand, there are techniques that evaluate
the entire face image pixels, implying a high data dimensionality. Therefore ways
of handling such data representations can be useful, such as principal component
analysis (PCA) or manifolds which represent a face space that can contain several
poses and expressions. Some other 2D techniques use facial features, such as lips
movement during a conversation, and recent studies have reached a high level of
recognition because each person has a particular way of speaking. Moreover, other
information provided by such features can be fused to improve the recognition rates,
such as acoustic information and eye motion.
Manifold methods are an important trend in pattern recognition methods, espe-
cially in face recognition, and we devote a substantial portion of this chapter to
describing such manifold approaches and their principles.

1.2 Manifold Methods

Manifold methods are an important trend in pattern recognition, especially in face


recognition. The concept of a manifold can occur in different ways in face recognition
studies. In one sense, face data can be represented in a multidimensional Euclidean
feature space, with multiple face data points forming a linear or nonlinear manifold
embedded in that space. The presence of a manifold is a sign of data redundancy and
one objective can be to project the manifold (and the face data) in a lower-dimensional
1 Data and Information Dimensionality in Non-cooperative Face Recognition 3

Euclidean space in order to reduce the dimensionality. Several techniques have been
proposed for nonlinear dimensionality reduction, and we will discuss some in the
present chapter.
In another sense, given a parametric distribution modeling certain characteristics
in facial images (e.g. texture), one may consider the manifold associated to this distri-
bution. Probabilistic or statistical manifolds are the central topic of interest in infor-
mation geometry, where the parameters of the distribution correspond to the local
coordinates on the manifold and the Fisher information acts as the unique Riemannian
metric tensor. Given a metric, the powerful methods of differential geometry can be
invoked for calculating distances between distributions. This is useful for discovering
patterns of distributions on the manifold and for dimensionality reduction.
Finally, a manifold may represent a shape space, with every point on the manifold
corresponding to a specific shape. Many different shape spaces have been considered,
based on landmark points, surface normals, medial axes, plane curves, etc. Again,
a Riemannian metric can be defined on the manifold, enabling the calculation of
distances between shapes. This, in turn, is the basis for inference on shape mani-
folds, e.g. for studying the variability or spread of a set of shapes on the manifold.
Dimensionality reduction of data on shape manifolds is also of considerable interest,
requiring specialized methods.
In this section we give a short overview of some dimensionality reduction tech-
niques operating in Euclidean feature spaces. We then go into some more detail on
the geometry of probabilistic manifolds and illustrate some differential geometry
concepts with an application to image texture discrimination. Finally, we briefly
touch upon shape manifolds. We start however by introducing a few concepts from
differential geometry that will be useful for studying manifolds.

1.2.1 Differentiable Manifolds

We consider the concept of a differentiable manifold, which basically is a topological


space that locally resembles a Euclidean space. An n-dimensional differentiable
manifold M can be charted by a set of patches x, which are differentiable mappings
from open subsets U of Rn onto (subsets of) M, with a differentiable inverse x−1 .
Once a patch has been chosen, one can define the coordinates x i (p) (i = 1, . . . , n)
of a point p ∈ M as the components in Rn of the vector x−1 (p). The notation x i for
the coordinates with the index i in superscript is common in differential geometry to
indicate the behavior of the coordinates (“contravariance”) under a transformation
to a different coordinate system. We shall adopt this notation here as well, in order
to maintain consistency with the differential geometry literature.
Manifolds are “curved” spaces and as a geometric object they are much more
complicated to study compared to vector spaces. To aid in this study, the concept of
a tangent space of a manifold M at a point p ∈ M is introduced as the best linear
approximation of M at p. It is important to note that manifolds are abstract geometric
objects that can be studied irrespective of their embedding in a (higher-dimensional)
4 G. Verdoolaege et al.

Euclidean space. Thus, one can study the intrinsic geometry of a manifold. Provided
the manifold can be embedded in a Euclidean space, which is, indeed, not always
possible, the manifold can also be characterized through its extrinsic geometry.
In order to give the definition for a tangent space, we need the concept of a tangent
vector. Suppose f is a mapping from an open subset W of M to R, then f is called
differentiable at a point p ∈ W, provided that for a certain patch x covering W the
composition f ◦ x is differentiable at x−1 (p). A tangent vector v p at the point p ∈ M
is a real-valued function from the space F(M) of differentiable mappings on M,
with a linearity and a Leibnizian property:

v p (af + bg) = av p (f ) + bv p (g), (1.1)


v p (f g) = f (p)v p (g) + g(p)v p (f ), (1.2)

for all a, b ∈ R and f , g ∈ F(M). The tangent space Mp to M at p is the set of all
tangent vectors to M at p and so it is a vector space. We can calculate the derivatives
of f at p w.r.t. the coordinates x i of p in Rn and it turns out that the set of functions

∂ 
: F(M) −→ R (1.3)
∂x i p

defined by  
∂  ∂f 
(f ) = (1.4)
∂x i p ∂x i p

provides a basis for Mp . They represent the change that f (p) undergoes as one
moves on the manifold in one of the coordinate directions. It can also be shown that,
for manifolds embedded in R3 , this definition corresponds to the usual notion of a
tangent vector to a two-dimensional surface, as the initial velocity of a curve on that
surface. For background and more details, we refer the reader to texts on fundamental
differential geometry [1–3].
In order to measure lengths on a manifold, the concept of a Riemannian metric
is introduced on a differentiable manifold. A Riemannian metric g is a symmetric
bilinear form operating on the tangent space at a point p ∈ M and resulting in a real
number. As such, g is often denoted as an inner product between two tangent vectors
v p and wp at p:
g(v p , w p ) = ∇v p , w p ∞. (1.5)

Technically, the metric is a covariant tensor of degree two, where the positive-
definiteness implies that g(v p , v p ) > 0 for all nonzero v p ∈ Mp . Given a coordinate
system x i on the manifold, one defines the components gij of the metric w.r.t. that
basis as follows:    ⎢
∂  ∂ 
gij = , , i, j = 1, . . . , n. (1.6)
∂x i p ∂x j p
1 Data and Information Dimensionality in Non-cooperative Face Recognition 5

The gij form an n × n symmetric matrix at every point p, and the dependence on
p can be denoted explicitly as gij (p).
A curve on a manifold can be defined as a mapping from an open interval I on the
real line to the manifold: α(t) : I −→ M. The metric is then related to calculating
lengths of curves on a manifold, or distances. Formally, the arc length L of a curve
α(t) for a ≤ t ≤ b is defined as

 b⎧
⎧⎨n ⎩
d i

d j
sab = ⎪ gij [α(t)] x ◦ α(t) x ◦ α(t) dt. (1.7)
a dt dt
i,j=1

In three-dimensional Euclidean space the metric components, w.r.t. a Cartesian


coordinate system, reduce to the Kronecker delta function δij . In that case, sab simply
becomes 
 b⎧ ⎧⎨ n ⎩ 2  b
⎪ dαi
sab = (t) dt = ||α∼ (t)||dt, (1.8)
a dt a
i=1

where the αi (t) are the coordinate functions of the curve in R3 and α∼ (t) is called
its speed. One can then also consider the arc length function sc (t), returning the arc
length of the curve from a specific position given by c until t > c:
 t
sc (t) = ||α∼ (u)||du. (1.9)
c

Furthermore, provided the speed of the curve vanishes nowhere (i.e., the curve
is regular), there exists a unit-speed parameterization β of α. It can be found by
considering the derivative of the arc length function:

sc∼ (t) = ||α∼ (t)|| (1.10)

and its inverse function t(s). Defining β(s) = α(t(s)), it is easy to prove that β has
unit speed. β is said to be parameterized by arc length.
The metric components can also be written in relation to the quadratic line element
(differential arc length) ds2 :


n
ds2 = gij dx i dx j . (1.11)
i,j=1

A geodesic on a manifold is a curve between two points on the manifold that,


provided the points are “sufficiently close”, has the shortest length, among all curves
between the points. This definition is, admittedly, rather vague and a more accurate
definition of a geodesic on an abstract surface can be found e.g. in [1, 2]. The
coordinate functions γ i (t) of the geodesic γ(t) parameterized by t can be found by
solving the geodesic equations:
6 G. Verdoolaege et al.

d2 γ r ⎨
n
dγ i dγ j
(t) + ρ r
ij [γ(t)] (t) (t) = 0. (1.12)
dt 2 dt dt
i,j=1

Here, ρ rij represent the so-called Christoffel symbols (of the second kind), which
can be written in terms of the metric components gij and the elements g ij of the
inverse matrix:
1 ⎨ km ∂gjm
n
∂gim ∂gij
ρ kij = g + − . (1.13)
2 ∂γ i ∂γ j ∂γ m
m=1

The expressions (1.12) form a set of nonlinear second-order ordinary differential


equations. This boundary value problem needs to be solved assuming the known
values of the coordinates at the boundary points of the geodesic.

1.2.2 Nonlinear Dimensionality Reduction in Euclidean Spaces

Our first application of differentiable manifolds in image analysis for face recognition
occurs in the case where the feature space is a (high-dimensional) Euclidean space.
Indeed, the data may be found to lie scattered around a lower-dimensional manifold,
in general curved, embedded in the feature space. This is due to interdependencies
between the features, limiting the configuration of the data in the feature space. This
phenomenon can be exploited for reducing the dimensionality of the data.
To see why feature interdependencies induce manifold structure in the feature
space, consider the simple case depicted in Fig. 1.1. The data points appear to lie
scattered around a straight line in a two-dimensional Euclidean feature space. The
reason is, of course, the dependence between the variables x and y. One may take
the point of view that this dependence is deterministic and in that case estimating
the parameters of the line can be the aim of a regression analysis. It is also possible
to view the data points as a sample from a bivariate Gaussian distribution. By calcu-
lating the first principal component of the data, using principal component analysis
(PCA), one can find the variability of the data along the best-fit line. It is important
to note that this is quite different from regression analysis, where the conditional
distribution of the dependent variable y is considered, given the independent vari-
able x, not the bivariate distribution. After PCA, neglecting the variability along the
orthogonal direction entails a projection of the data on the direction of the eigenvec-
tor, corresponding to the largest eigenvalue, of the data covariance matrix. Hence,
we have accomplished a natural reduction of the data dimensionality. PCA is one
method for linear dimensionality reduction, which, in the context of face recognition,
has been applied in the method of eigenfaces [4].
The straight line in the above example is, in fact, a linear one-dimensional man-
ifold embedded in the feature space. The generalization to more complex data sets
is twofold: first, the feature space may be of a high dimensionality with, accord-
ingly, the possibility of a data manifold of dimensionality larger than one. Second,
1 Data and Information Dimensionality in Non-cooperative Face Recognition 7

25

20

15
y

10

0
0 2 4 6 8 10
x

Fig. 1.1 A set of measurements of two linearly related variables x and y, with the modeled relation
indicated

the relation between the features may be nonlinear, leading to a curved manifold.
The purpose of nonlinear dimensionality reduction is then to learn the underlying
manifold dimensionality and its structure from the data. For instance, it has been
noted that facial data may also lie on nonlinear rather than linear submanifolds of the
Euclidean data space [5, 6]. Many different methods have been proposed to accom-
plish dimensionality reduction based on manifold structure and we now give a few
examples. Further information and more techniques can be found in e.g. [7, 8].

Principal Curves and Surfaces

One of the first attempts to generalize PCA to a nonlinear technique involved the
concepts of principal curves [9] and principal surfaces [10]. This is a straightforward
generalization of the PCA example that we just discussed: principal curves are one-
dimensional nonlinear manifolds and principal surfaces are higher-dimensional, and
both lie embedded in a higher-dimensional Euclidean feature space. Both methods
assume that the data have been centered by subtracting the sample mean. For instance,
considering the synthetic data set in Fig. 1.2, it is intuitively clear that the curve passes
through the “middle” of the data pattern and provides a good model for the data.
In the case of principal curves, it is convenient to parameterize the model curve
β by its arc length s. Then, to find the principal curve one considers the projection
of a data point x in the feature space onto a point β(s) on the curve that is closest (in
Euclidean distance) to x. Thus, one searches for a value λ of the parameter s, such
that ||x − β(λ)|| is minimal. If the data consist of N vectors xi (i = 1, . . . , N), then
the reconstruction error can be defined as the minimum, over all smooth curves β,
of the sum of squared Euclidean distances between each data point and the point on
8 G. Verdoolaege et al.

24

22

20

18

y 16

14

12

10

8
0 2 4 6 8 10

Fig. 1.2 Example of a nonlinear (quadratic) relation between variables x and y, together with the
best-fit curve passing through the “center” of the data

the curve that it projects to. By minimizing the reconstruction error, using a (local)
parameterization of the curve β, once can find a principal curve such that each point
on the curve is the average of all data points that project to it. The goodness of fit
can then be calculated from the reconstruction error. Similar arguments can be used
to define principal surfaces [10].

Manifold Learning

Another class of nonlinear dimensionality reduction techniques is referred to as


“manifold learning”. Well-known examples are the Isomap algorithm, local linear
embedding, Laplacian eigenmaps and Hessian eigenmaps. Here, we briefly mention
the idea behind the Isomap algorithm. For related methods we refer the reader to [8].
The isometric feature mapping or Isomap algorithm again assumes that the data
are spread around a nonlinear manifold embedded as a convex subset of the Euclidean
data space [11]. It then tries to approximate the geodesic distance on this manifold by
calculating Euclidean distances between neighboring data points (remember that the
manifold appears locally Euclidean). Finally, it applies a multidimensional scaling
algorithm for projecting the data into a lower-dimensional Euclidean space, pre-
serving the global distance geometry on the original manifold as much as possible.
Specifically, the following steps are taken by Isomap:
• Calculate the (usually) Euclidean distance between all pairs of data points xi and
xj . Connect every point xi to its neighbors in a ball with radius . This yields a
neighborhood graph with vertices the data points and edges characterized by the
distance between points.
1 Data and Information Dimensionality in Non-cooperative Face Recognition 9

• Estimate the geodesic distance between all pairs of points on the manifold by
finding the shortest path between them (e.g. using Floyd’s algorithm) according
to the neighborhood graph.
• Given the matrix of approximate geodesic distances between all pairs of data
points, apply classic multidimensional scaling to project the data into a Euclidean
space that is, ideally, of the same dimensionality as the manifold that is assumed
to characterize the data. Multidimensional scaling yields a configuration of data
points with a global distance geometry that is maximally similar to the original
geodesic distance geometry on the manifold.

1.2.3 Probabilistic Manifolds and the Rao Geodesic Distance

We next consider probabilistic or statistical manifolds. Many applications in image


analysis and biometry are based on a probabilistic description of the data and require a
similarity measure between probability distributions. One prominent example, which
we will study later on in this section, is the discrimination of texture information in
an image. For the purpose of similarity measurement, several distance or divergence
measures between probability distributions have been defined and put into practice
in the past. For instance, the Bhattacharyya distance and the Kullback-Leibler diver-
gence (KLD) are very popular probabilistic (dis)similarity measures. A compilation
including many other distances and divergences is given in [12]. In this section we
consider the Rao geodesic distance (GD) as a natural, intrinsic distance measure on
a manifold of probability distributions. Indeed, a family of probability density func-
tions (likelihoods, probabilistic models) parameterized by a vector θ, can be seen
as a differentiable manifold with each point P representing a probability density
function (PDF) labeled by its coordinates θ P . Cramér [13] and Rao [14] observed
that the Fisher information can be regarded as a Riemannian metric on a manifold
of probability distributions. This mainly initiated research in the field that is now
called information geometry (see e.g. [15–17]). Čencov showed that this Fisher-
Rao metric is the unique intrinsic metric on such a manifold, invariant under some
basic probabilistic transformations [18]. Thus, probability theory can be described in
terms of geometric structures invariant under coordinate transformations, to which
the methods of differential geometry can be applied. The corresponding geodesics
between probability distributions have a property of length minimization; they are
the “straight lines” of the geometry.
An important asset of the GD as a similarity measure between probability dis-
tributions, lies in the possibility of visualizing the geodesic paths. Indeed, on this
basis suitable approximations to the geodesics may be introduced, hence facilitating
or speeding up the estimation of GDs. Approximations to the geodesics can be con-
structed in a controlled way, with the impact of the accuracy of the approximation
clearly visible. This allows for an easy evaluation of the trade-off between accuracy
of the derived GD and speed of the application.
10 G. Verdoolaege et al.

Another advantage of the GD is that it is a distance measure (a metric) in the strict


sense of the word. In particular, it is symmetric and it obeys the triangle inequality.
For instance, below we present the results of a set of image classification experiments
based on the measurement of texture similarity in the wavelet domain. This in turn
can be applied in an image retrieval experiment, where the availability of a distance
measure obeying the triangle inequality is useful for constructing a lower bound on the
distance between the query image and a database image, by comparing the images to
a set of predefined key images [19]. This scheme may accelerate the retrieval process
considerably. In contrast, the KLD, although a popular distance measure in model-
based image retrieval, is not symmetric (although it can be symmetrized, forming
the J-divergence) and does not satisfy the triangle inequality.
In addition, closed-form expressions for the KLD, particularly for multivariate
distributions, are often hard to find or simply do not exist. Moreover, numerical
calculation (e.g. Monte Carlo integration) of the multidimensional integral involved
in evaluating the KLD between two multivariate distributions, is often not an easy
task and at least is computationally expensive.
Finally, experimental evidence suggests that the GD in general is a more accurate
distance measure between probability distributions than the KLD [20–23].
We start our discussion of probabilistic manifolds and geodesics with some basic
concepts from information geometry. We then consider the simple case of the univari-
ate normal distribution and we proceed with a relatively general elliptically-contoured
multivariate distribution, which we call the multivariate generalized Gaussian dis-
tribution (MGGD). It is characterized not only by a mean vector and a dispersion
matrix, but also by a shape parameter determining the peakedness of the distribution
and the heaviness of its tails. This flexible distribution is suitable for modeling the
wavelet statistics of images (in which case the mean is zero by construction, see
e.g. [24]). We present expressions for the metric and the geodesic equations on the
manifold of zero-mean MGGDs, which were derived in [25].

Information Geometry and the Fisher-Rao Metric

Information geometry is the mathematical field wherein concepts and methods from
differential geometry are applied for the analysis of probabilistic (or statistical) man-
ifolds. Although the geometric approach conceptually has several advantages, it has
been argued that this does not add significantly to the understanding of the theory
of parametric statistical models [17, 26]. However, as regards the application to the
calculation of distances between distributions, the information geometrical approach
in terms of the Fisher-Rao metric certainly offers considerable advantages. This is
the aspect of information geometry that we focus on in this chapter and it is also the
viewpoint taken in [26], where many applications are discussed.
Given a probability density function (PDF), or likelihood, p(x|θ) for a vector-
valued variable x over a domain D, labeled by an n-dimensional parameter vector θ,
the Fisher information matrix is the covariance matrix of the score. The score V, in
turn, is the gradient of the log-likelihood with respect to the parameters:
1 Data and Information Dimensionality in Non-cooperative Face Recognition 11

V(θ) = ∇ ln p(x|θ). (1.14)

Since the PDF is normalized to one, the expectation of the score vanishes under
certain regularity conditions:
 
p(x|θ)dx = 1 =⇒ E(V) = p(x|θ)∇ ln p(x|θ)dx = 0. (1.15)
D D

As a result, the entries gμν of the Fisher information matrix can be written as

∂ ∂
gμν (θ) = E μ
ln p(x|θ) ν ln p(x|θ)
∂θ ∂θ

∂ 2
= −E ln p(x|θ) , μ, ν = 1 . . . n. (1.16)
∂θμ ∂θν

The second equality follows again easily from the normalization of the PDF. The
Fisher information is a measure for the amount of information that measurements
of x yield about the parameters θ, through the PDF. For instance, if the distribution
of x is Gaussian and is relatively peaked (small standard deviation), then rather few
measurements of x are required to characterize the mean of the distribution. Put
differently, x contains a lot of information on the parameters. On the other hand, a
peaked distribution has a high average curvature, which, according to (1.16) yields a
high absolute value of the component of the Fisher information corresponding to the
mean. In a similar fashion, the inverse of the Fisher information corresponding to a
parameter is a lower bound on the variance (uncertainty) of an unbiased estimator of
this parameter, which is the famous Cramér-Rao lower bound.
Once the metric tensor on the probabilistic manifold is known, one can measure
distances based on the definition of the quadratic line element as in (1.7):

ds2 = gμν dθμ dθν . (1.17)

Furthermore, one can formulate the geodesic equations and, provided these can
be solved without too much effort, it becomes feasible and advantageous to calculate
the geodesic distance as a natural and theoretically well motivated similarity measure
between probability distributions.

Univariate Gaussian Distribution

Consider first the example of the univariate Gaussian distribution N (μ, σ), para-
meterized by its mean μ and standard deviation σ, and defined by the following
PDF: ⎩
1 (x − μ)2
p(x|μ, σ) = √ exp − . (1.18)
2πσ 2σ 2
12 G. Verdoolaege et al.

Fig. 1.3 a Illustration of (a) Example geodesic


the Poincaré half-plane with p
several half-circle geodesics, 1
one of them between the 4 p2
points p1 and p2 . b Probability

σ
densities corresponding to the
2
points p1 and p2 indicated in
(a). The densities associated
to some intermediate points 0
−5 0 5
on the geodesic between p1
μ
and p2 are also drawn
(b) 1 p1
p2
Density

0.5

0
−5 0 5
μ

The Fisher information matrix can be easily calculated, yielding [27]:


 
1
σ2
0
1 . (1.19)
0 σ2

This again indicates that a small standard deviation leads to a high Fisher infor-
mation for both the mean and standard deviation. The associated Fisher-Rao metric
describes a hyperbolic geometry, i.e., a manifold of constant negative scalar curva-
ture. A convenient model is provided by the Poincaré half-plane, which is represented
in Fig. 1.3a.1 The horizontal axis corresponds to the mean μ of the Gaussian dis-
tribution, while on the positive part of the vertical axis the standard deviation σ is
represented. Every point in this half-plane corresponds to a unique Gaussian and
the geodesics between two points are half-circles as well as half-lines ending on the
horizontal axis, the latter connecting distributions that differ only in their standard
deviation (not drawn). The distance between points along one of these curves in the
Poincaré half-plane is the same as the actual geodesic distance between the points.
The evolution of the distribution along an example geodesic is shown in Fig. 1.3b.
A closed-form expression exists for the GD, permitting a fast evaluation. Indeed,
for two univariate Gaussian distributions p1 (x|μ1 , σ1 ) and p2 (x|μ2 , σ2 ), parameter-
ized by their mean μi and standard deviation σi (i = 1, 2), the GD is given by [27]

1 It should be noted that this is different from an actual embedding of the manifold in Euclidean
space, see also [28].
1 Data and Information Dimensionality in Non-cooperative Face Recognition 13

√ 1+δ √
GD(p1 ||p2 ) = 2 ln = 2 2 tanh−1 δ,
1−δ

with ⎩ 1/2
(μ1 − μ2 )2 + 2(σ1 − σ2 )2
δ≡ . (1.20)
(μ1 − μ2 )2 + 2(σ1 + σ2 )2

Finally, in the case of multiple independent Gaussian variables it is easy to prove


that the square GD between two sets of products of distributions is given by the sum
of the squared GDs between corresponding individual distributions [27].

Multivariate Generalized Gaussian Distribution

We now proceed with a multivariate distribution that has been employed recently
to model the wavelet detail coefficients of multispectral (e.g. color) images. We
will apply it here to modeling of color textures, e.g. as encountered in patches of
facial images. The distribution is known by the name of the multivariate generalized
Gaussian distribution (MGGD) or the multivariate exponential power distribution
and characterized by the following density:
 
ρ n2 β
p(x|μ, ω, β) = n   n 1
π 2 ρ 2β 2 2β |ω| 2
n
 β 
1 ∼ −1
× exp − (x − μ) ω (x − μ) . (1.21)
2

Here, μ is the mean vector, ω is the dispersion matrix and β is the shape parameter.
Clearly, taking β = 1 results in the multivariate Gaussian distribution, while we
call the distribution with β = 1/2 the multivariate Laplace distribution. Several
properties of this distribution were discussed in [29]. For instance, if X is a random
vector distributed according to (1.21), then

E(X) = μ, (1.22)
 
21/β ρ n+2

Var(X) =  n  ω, (1.23)
nρ 2β
γ1 (X) = 0, (1.24)
  n+4 

n2 ρ 2β
n
ρ 2β
γ2 (X) =   2 − n(n + 2). (1.25)
ρ n+2

Here, γ1 and γ2 respectively denote the multivariate skewness and kurtosis,


according to Mardia [30]; the kurtosis being a decreasing function of the
14 G. Verdoolaege et al.

shape parameter. For the application that we have in mind μ ≡ 0, since the wavelet
detail coefficients to be modeled have zero expectation [24]. Therefore, in the fol-
lowing we will only discuss zero-mean MGGDs. In addition, note that the covariance
matrix is given by ω only in the case of the Gaussian distribution (β = 1), when also
γ2 = 0. We will refer to an MGGD with mean zero, dispersion matrix ω and shape
parameter β by means of the shorthand notation MGGD(β, ω).
Given a set of N n-dimensional vectors xi (i = 1, . . . , N), assumed to be dis-
tributed according to a zero-mean MGGD, the parameters to be estimated for the
best-fit MGGD model are the dispersion matrix ω and the shape parameter β. In the
experiments below the estimation was performed routinely via recursive solution of
the maximum likelihood equations. However, other possibilities for estimating the
parameters are the method of moments and the Fisher scoring algorithm [20].
In [25] the metric and geodesic equations for the zero-mean MGGD were derived
and geodesic distances were calculated or approximated, based on earlier work
related to the multivariate Gaussian distribution [31] and more general elliptically-
contoured distributions [32, 33]. A family of MGGDs characterized by a fixed shape
parameter β forms a submanifold of the general MGGD manifold. In this particular
case, the geodesics take the form of straight lines in Rn [32]. As a result, the geodesic
distance between two MGGDs characterized by (β, ω 1 ), respectively (β, ω 2 ), exists
in a closed form. Denoting this specific distance by GD(β, ω1 ||β, ω 2 ), we have:

GD(β, ω 1 ||β, ω2 )
⎡ ⎤1/2
1 ⎨ 1 ⎨
= ⎣ 3bh − (r i )2 + 2 bh − rirj ⎦ , (1.26)
4 4
i i<j

with r i ≡ ln λi and λi (i = 1, . . . , n) the n eigenvalues of ω 1 −1 ω 2 . In addition, bh


is defined by
1 n + 2β
bh ≡ . (1.27)
4 n+2

If on the other hand β is allowed to vary, the geodesic equations are more dif-
ficult to solve as no closed-form solution exists. Instead, through numerical opti-
mization, polynomial solutions can be obtained for the coordinates as a function of
the geodesic’s parameter t (i.e., the coordinate functions) [25]. In practice only a
few iterations of a BFGS Quasi-Newton method usually suffice to obtain substan-
tial information on the geodesics. However, this scheme still is computationally too
intensive to be practical for many applications. Therefore, a computationally more
convenient scheme can be introduced using a linear approximation to the geodesic
coordinate functions. Note that in such a case the geodesic itself still lives on the
MGGD manifold with nonzero curvature and therefore, if embedded in a Euclidean
space of sufficient dimension, the image of the geodesic under this map would not
be a straight line.
1 Data and Information Dimensionality in Non-cooperative Face Recognition 15

1 1

0.8 0
β 1
r
0.6 −1

0.4 −2
0 0.5 1 0 0.5 1
t t
0 0

−1
−1
2 3 −2
r r
−2
−3

−3 −4
0 0.5 1 0 0.5 1
t t

Fig. 1.4 Coordinates of the geodesic between the two trivariate MGGDs given in the main text

As an example, consider the geodesic between two trivariate distributions


MGGD(0.5, ω 1 ) (Laplacian) and MGGD(1, ω 2 ) (Gaussian), with ω 1 and ω 2 as
follows: ⎛ ⎞ ⎛ ⎞
0.5 0.4 0.5 0.8 0.9 1.1
ω 1 = ⎝0.4 0.9 0.7⎠ and ω 2 = ⎝0.9 1.1 1.3⎠ .
0.5 0.7 1.8 1.1 1.3 1.6

The log-eigenvalues r k of ω 1 −1 ω 2 are 0.58, −1.97 and −3.32. This yields the
boundary conditions for solving the geodesic equations (1.12). The coordinate func-
tions of the geodesic, obtained through polynomial approximation, are shown in
Fig. 1.4. The linear approximation involves simply replacing the true geodesic coor-
dinate functions for β and the r k by a straight line between the start and end points.
The calculation of the geodesic distances via integration along the geodesic can then
be carried out using a simple (and fast) trapezium rule.

Color Texture Discrimination Using Manifold Methods

Consider now the application of the MGGD model and similarity measurement with
the geodesic distance, to color texture modeling and discrimination [20].
16 G. Verdoolaege et al.

Experimental Setup. A series of classification experiments was conducted on


a well-known texture database, with the aim to provide a bench-mark for the com-
parison of different statistical models as well as comparison of the GD with the
KLD. The database comprised a set of 40 images from the VisTex database [34].
These are real-world 512 × 512 color images from different natural scenes (tex-
tures), displayed in Fig. 1.5, selected because of their sufficient homogeneity. The
images were expressed in the RGB color space. Every image was divided in sixteen
nonoverlapping 128 × 128 subimages, constituting a database of 640 subimages.
Gray-level images were obtained from the original color images by calculating their
luminance. In order to render the retrieval task sufficiently challenging, every color
(or gray-level) component of each subimage was individually normalized to zero
mean and unit variance. As a result, the gray scales of subimages from the same
original image were generally not in the same range. Then, on every component
individually a discrete wavelet transform was applied with three levels using the
Daubechies filters of length eight. The wavelet detail coefficients of every subband
over the three color components (or the gray-level) were modeled by an (M)GGD
estimated using maximum likelihood. The parameters of the (M)GGD models for
all subbands comprise the feature set for a single subimage. For gray-scale images
the mean β over the whole database turned out to be about 0.58, while for correlated
color images it was slightly lower: 0.48 (about Laplacian). This is an indication that
the Laplacian is much more suitable as a model than the Gaussian. The reason is
the heavy-tailedness of wavelet statistics, which corresponds better to the Laplacian
model [24].
The classification experiments were carried out by means of a k-nearest neighbor
classifier validated by the leave-one-out method. In practice, we considered one of
the 640 subimages (the test image), to be assigned to one of the 40 original texture
classes. The class labels of the other subimages were assumed to be known, which
constitutes the training phase of the classifier. Then, the similarity of the test image
to each of the remaining images was determined and the test image was assigned to
the class most common among the k = 15 nearest neighbors of the test image. We
chose k = 15 since ideally the 15 nearest neighbors of the test image should be the
15 subimages originating from the same original texture class to which the test image
belonged in reality. We next compared the assigned class label with the true class
label of the test image. We repeated the procedure, successively using every subimage
once as a test image (hence every time leaving out one of the subimages in the training
phase of the classifier). We then determined the rate of correct classification and used
this as a performance measure for the classifier. The results are shown in Table 1.1.
We conducted the experiments for various choices of the statistical model and
using the GD and KLD as a similarity measure. We started with the gray-level equiv-
alent of the 640 color images and we next treated the corresponding RGB color
images, considering the full (trivariate) correlation structure between the spectral
bands. The wavelet subbands, on the other hand, were assumed to be mutually inde-
pendent. For each of these instances, we employed several MGGD models, with fixed
and with variable shape parameter. In the case of fixed shape parameter, we chose
two models, namely β = 1, i.e., the (multivariate) Gaussian and the multivariate
1 Data and Information Dimensionality in Non-cooperative Face Recognition 17

Fig. 1.5 The 512 × 512 texture images from the VisTex database used in our experiments. From
left to right and top to bottom the images are: Bark0, Bark6, Bark8, Bark9, Brick1, Brick4, Brick5,
Buildings9, Fabric0, Fabric4, Fabric7, Fabric9, Fabric11, Fabric14, Fabric15, Fabric17, Fabric18,
Flowers5, Food0, Food5, Food8, Grass1, Leaves8, Leaves10, Leaves11, Leaves12, Leaves16,
Metal0, Metal2, Misc2, Sand0, Stone1, Stone4, Terrain10, Tile1, Tile4, Tile7, Water5, Wood1
and Wood2

Laplacian, characterized by β = 1/2. For a variable shape parameter, the GD was


evaluated using a linear approximation for the geodesic coordinate functions, as men-
tioned above. In the cases where a closed expression is available for the KLD, we
compared its performance as a similarity measure with the results using the GD [20].
For multivariate distributions, this is only possible for the Gaussian case.
In evaluating the results of these benchmarking experiments, it should be noted
that even small differences in correct classification rates usually reflect significant
performance characteristics. The reason is the particularly challenging nature of the
experiments, due to the normalization of the images. Hence, the differences observed
in the present experiments can easily lead to large corresponding differences in a
realistic application of texture discrimination.
The main conclusions that can be drawn from these experiments are the following.
First, texture discrimination is clearly much aided by the information in the corre-
lation structure between the color bands. Therefore the multivariate modeling of
spectral bands is often worth the extra computational burden. Second, the Laplace
distribution is a better model than the Gaussian for wavelet statistics, and in the
univariate case there lies an additional advantage in considering a variable shape
parameter. Finally, the GD invariably performs better than the KLD. This is a demon-
stration of the fact that the GD within the context of information geometry is better
suited as a similarity measure between PDFs than the KLD.
18 G. Verdoolaege et al.

Table 1.1 Correct classification rates (%) using different models (gray gray-level, MV multivariate
color with full correlation structure) for three wavelet scales, using the KLD and GD as similarity
measures in a database of 640 color texture images
Measure Model
Gray MV
KLD Gauss 77.8 95.9
Laplace 81.4 –
GGD 87.5 –
GD Gauss 79.5 97.0
Laplace 83.6 97.5
GGD 88.9 97.3

1.2.4 Shape Manifolds


Another important field wherein manifolds have been used in image analysis is the
statistical theory of shape. This field was pioneered by Kendall [35], who studied
shapes characterized by a labeled set of points, called landmarks. Various shape
spaces have been considered, but we will focus here on landmark-based shape spaces
[36–38].
Consider a set of k landmarks, called a k-ad, with coordinates (column vectors)
xi (i = 1, . . . , k) on an object in Euclidean space Rn , where usually n = 2 or 3. The
shape of the object is determined by the particular configuration of landmark points
and it is natural to identify all shapes that can be obtained from the initial shape by
translation, rotation and scaling. The similarity shape is the shape that remains if
one “removes” the effect of translation, rotation and scaling. To remove the effect of
translation, one can subtract the Euclidean centroid, i.e., the mean, of the shape from
the coordinates of each landmark. Arranging the coordinates in the n × k matrix X,
one finds a matrix U given by

1⎨
k
U = [x1 − x̄, . . . , xk − x̄], x̄ = xi . (1.28)
k
i=1

We then remove the effect of scaling, by dividing U by its Frobenius norm ||U||,
yielding a matrix Z:

x1 − x̄ xk − x̄
Z= ,..., , ||U|| = tr(ZZT ). (1.29)
||U|| ||U||

Now the sum of squared distances from the origin to the landmark points is 1. The
matrix Z is called the preshape of the k-ad X and it lies on the unit sphere within the
hyperplane Z1k = 0, where 1k is the column vector with k ones. Thus the space of
preshapes can be identified with the unit sphere Skn−n−1 . Finally, rotation is filtered
away by considering all rotations under left multiplication by an n × n rotation
1 Data and Information Dimensionality in Non-cooperative Face Recognition 19

matrix in SO(n). The shape of the k-ad X is then identified with the orbit of Z under
rotations, since all shapes that lie on the same orbit are, in fact, equivalent. So the
final shape space is technically a quotient space:

ωnk = Skn−n−1 /SO(n) (1.30)

In a similar fashion, shape spaces have been defined by considering the orbit of
k-ads under reflections, affine transformations (for planar scenes imaged from a large
distance) or projective transformations (for scenes imaged from a single observation
point).
Finally, once a shape is described by an orbit which is a point in a shape space,
one can consider the corresponding shape manifold and perform geometry on this
manifold. This way, it is possible to calculate geodesic distances between shapes and
even do statistics on shape manifolds [36–38].

1.3 Non-cooperative Face Recognition

Several approaches have been proposed in the literature to perform face recognition
with and/or without user assistance. An issue common to many methods is the so
called curse of dimensionality, since a single image often provides high-dimensional
information, and a very large number of samples may be required to obtain sufficient
statistics (i.e., the number of samples grows exponential with the number of dimen-
sions of the face recognition problem). However, often the number of samples (faces)
is limited, and this limitation may lead to losses of accuracy and performance if not
approached properly. An alternative for building robust face recognition schemes is
to reduce the problem dimensionality.
Recently, manifold methods have been proposed for biometrics, and specially for
face recognition, achieving high recognition accuracy rates. Some of these mani-
fold methods convert a high-dimensional face image representation into a lower-
dimensional representation, making face representation and comparisons easier.
Also, there are several approaches to learn manifolds from limited sample infor-
mation that can describe both local and global data geometry information, while
preserving class separability.
A typical face identification framework involves three steps, namely (a) face
detection, (b) facial feature extraction, and (c) face recognition. Each one of these
steps has its own challenges. Face detection basically involves the location of each
face in the input image [39], since images may contain more than one face, and
the faces must be segmented from the background clutter. Facial feature extraction
usually requires dealing with issues such as illumination, head pose, and face align-
ment and scaling (e.g. sometimes, face models are trained based on collections of
aligned faces [40–42]). Finally, face classification also is a challenging problem,
which usually requires dealing with intra-class variability issues, such as changes
in expression and facial features (e.g. aging, hair, glasses, etc.). Even a single face
20 G. Verdoolaege et al.

image may contain too much information to be processed properly, and must be
represented as high-dimensionality data. Therefore, dimensionality reduction meth-
ods help providing the very compact face representations used in face recognition.
Statistical manifold methods provide an adequate framework for face representation
and classification. Also, manifold learning methods allow to compensate for the data
set scarcity commonly found in face recognition problems (e.g. the complete set of
changes that a face may undergo), and provide a trade-off between data scarcity and
robust recognition performance.
There are several methods proposed in the literature to learn the manifold shape
from sparse and high-dimensional data. For example, a face manifold where faces lie
may have an unknown geometry and the number of face samples available may be
limited, consequently it may be necessary to learn an approximation to the manifold
in order to represent and classify faces accurately. Next, we discuss some typical
methods used for learning face manifolds while preserving the global data geome-
try and some desirable other properties (such as face class separation). The meth-
ods discussed below present different alternatives for analyzing the data geometry
and transforming the face data from high-dimensional to lower-dimensional spaces,
which can be used to represent and compare faces.

1.3.1 Manifold Learning in Face Recognition

A learnt manifold embedded in a high dimensional space preserves some desirable


features of the data from the original manifold, this capability can favor correct data
classification. Once the embedded manifold is learnt, new data (faces) can be mapped
and classified in this lower-dimensionality space.
There are linear and non-linear manifold learning methods. Linear manifold
learning methods are simple and popular although non-linear techniques potentially
can better preserve the data high dimensional structure. Moreover, there are non-
linear manifold learning methods that preserve the global data structure, and those
that preserve the local data structure. A desirable property of a manifold embedding
(learning) method is to preserve both, the local and the global data structures.
Principal Component Analysis (PCA) is one of the most common linear tech-
niques used for dimensionality reduction, specially in human face recognition
and biometrics [4, 43]. In this approach each a×b image is interpreted as an
ab-dimensional vector x, often implying that the original image space has high
dimensionality. The main goal of PCA-based methods is to find a new orthogo-
nal coordinate system that preserves most of the data variation, while performing
dimensionality reduction. The original data space is projected into this new space
with less dimensions spanned by the principal eigenvectors of the data covariance
matrix.
First, assuming n distinct training face images represented as vectors xi , the
ab-dimensional mean face vector x̂ is calculated by taking the average of the training
face vectors xi as:
1 Data and Information Dimensionality in Non-cooperative Face Recognition 21

Fig. 1.6 Illustration of the eigenfaces representation components shown as images for a face data-
base [44] with all faces in frontal position and neutral expression [4]: a mean face x̂; b first eigenface
vector shown as an image; c second eigenface vector shown as an image; d third eigenface vector
shown as an image

1⎨
n
x̂ = xi (1.31)
n
i

The matrix C containing the zero mean face vectors is calculated by subtracting
the mean face vector x̂ from each one of the training face vectors xi :
! "
C = x1 − x̂, x2 − x̂, . . . , xn − x̂ (1.32)

The goal is to determine the principal eigenvectors of the n × n covariance matrix


ωc:
ω c = CT C (1.33)

by solving the following eigenproblem:

ω c v = λv (1.34)

where (λ1 , λ2 , ..., λn ) and (v1 , v2 , . . . , vn ) are the eigenvalues and eigenvectors of
ω, respectively. Only the r eigenvectors with larger eigenvalues are kept, leading
to the transformation (projection) matrix U = [v1T , v2T , ..., vrT ]. Next, each face
vector xi is projected into the reduced dimensionality space using the transformation
matrix U:
yi = U(x − x̂) (1.35)

In this way, each face vector xi is mapped to yi by an orthogonal transfor-


mation, and yi lies in the reduced dimensionality “face space”. The eigenvectors
(v1 , v2 , . . . , vr ) are often called eigenfaces, and new faces are described by linear
combinations of these eigenfaces. In Fig. 1.6, there is an example of the mean face
image and the eigenfaces. A new face is classified by determining its nearest face
class prototype in the “face space”.
Another manifold learning approach which is commonly used in face recognition
is a nonlinear method called Locally Linear Embedding (LLE) [45], which aims at
22 G. Verdoolaege et al.

reconstructing a manifold while preserving the local data geometry. Local properties
of face vectors are preserved by writing each high-dimensional vector as a linear
combination of their nearest neighbors.
The first step in LLE is to determine all k-nearest neighbors of each face vector xi
by measuring Euclidean distances between pairs of face vectors (xi , xj ). Afterwards,
each face vector xi is reconstructed based on its neighbors by finding the weight
matrix W that minimizes the residual sum of squares:
# #2
⎨n # ⎨ #
# #
RSS(W) = # wij xj #
#xi − # , (1.36)
i=1 # j =i #

with the following restrictions:

wij = 0 unless xj is one of the k-nearest neighbors of xi , (1.37)

and, ⎨
wij = 1 for each face vector xi , (1.38)
j

where n is the total number of face vectors. The problem of finding the best wij can be
solved as an eigenvalue problem, minimizing the squared error RSS(W). In the next
step, using the obtained weights W, the reconstruction error Θ(Y) in the projected
space is minimized by obtaining the matrix Y, where each line yj is a face vector in
this projected space:
# #2
⎨n # ⎨ #
# #
Θ(Y) = #yi − w #
ij j # ,
y (1.39)
#
i # j =i #

with the following restrictions:



Yij = 0 for each face vector xj , (1.40)
j

and,
YT Y = I, with I being the identity matrix. (1.41)

It shall be observed that by minimizing Θ(Y), an orthogonal coordinate system


centered at the origin is obtained, and this minimization problem can be solved as
a eigenproblem. The dimensionality of the projected space will be determined by
selecting the lowest r eigenvalues. The local data structure is preserved since each
face vector is reconstructed by a linear combination of its neighboring face vectors,
leading to a coordinate system with good locality preservation.
1 Data and Information Dimensionality in Non-cooperative Face Recognition 23

As mentioned before, Isomap [11] is a non-linear manifold learning method


designed to preserve the global data geometry. This method has three main steps: (a)
construction of a neighborhood graph G, to identify the neighboring face vectors in
the manifold M in a pairwise manner, as indicated below:
    
xi − xj  if xi − xj  < ε
dX (i, j) = , (1.42)
0 otherwise

The next step in the Isomap manifold learning, (b) is to estimate the geodesic
distances for each pair of face vectors, which means finding the shortest distance
dG (i, j) in the manifold that connects both face vectors (e.g., using the Dijkstra
algorithm to find the geodesic distances, or shortest paths in graphs); and finally, (c)
obtain a multidimensional scaling (MDS) on the matrix of graph distances dG (i, j), by
constructing an embedding of the data in a r-dimensional Euclidean space that best
preserves the manifold estimated intrinsic geometry (usually, the larger eigenvalues
are selected to form a new coordinate system for the classification of face vectors).
Another interesting dimensionality reduction method is Kernel PCA, which con-
sists of the reformulation of PCA using a kernel function [46]. The Kernel PCA
method computes the principal eigenvectors of the kernel matrix, rather than those
of the covariance matrix. The PCA is then reformulated for the kernel space, since
a kernel matrix may be seen as the inner product of the face vectors in the high-
dimensional space that is reconstructed using the kernel function and nonlinear
mappings. As another example, we could mention the Maximum Variance Unfold-
ing (MVU) approach [47], which can be used to unroll complex datasets into lower
dimensional spaces.
Next, we discuss some well known approaches that provide dimensionality reduc-
tion, while trying to preserve the data geometry of the original data higher dimen-
sionality space.

1.3.2 Dimensionality Reduction Using Laplacianfaces

PCA is helpful to discover the data dimensionality and to compact data. A good
example is Eigenfaces [4]. However, this approach not always leads to a good data
class (cluster) separation. Another method proposed on literature called Linear Dis-
criminant Analysis (LDA) [48] searches for the projection in which each face vector
falls near to the points of the same cluster, and far from the other clusters, as pro-
posed in Fisherfaces [48]. Next, several methods that try to learn the face manifold by
approximating it by linear or nonlinear methods are discussed. A manifold embedding
method should be capable of preserving both local and global data structure in the
low dimensional space while preserving the manifold metric structure.
As discussed in [49], in PCA and LDA approaches provide linear manifold dis-
tances. Some authors [45, 50, 51] suggest that the face space lies on a non-linear
submanifold. However, both PCA and LDA only work on the Euclidean structure.
24 G. Verdoolaege et al.

They fail to discover the underlying structure if the faces vectors lie on a nonlinear
submanifold hidden in the image space.
Some non-linear methods were proposed to discover the nonlinear structure of the
manifold, e.g., Isomap [11], Locally Linear Embedding (LLE) [45, 52] and Laplacian
Eigenmaps [53]. However, these methods provide good results on benchmarks that
use artificial data tests, but it is unclear how to evaluate the maps on novel test face
vectors. Moreover, some nonlinear methods might not be suitable for computer vision
tasks such as face recognition.
Recently, a method called Laplacianfaces [49] was proposed with the goal of
both representing and recognizing faces efficiently. In this approach, the manifold
structure inherent to the data is modeled by a nearest neighbor graph and the face
subspace is obtained by Locality Preserving Projections (LPP) [54]. In this way, each
face vector is mapped to a lower-dimensionality subspace characterized by a set of
feature vectors (images), called Laplacianfaces. As a result, the local data structure
and data clusters are well preserved.
One alternative for determining a low dimensional subspace while promoting clus-
ter/class separability is the Linear Discriminant Analysis (LDA), where the objec-
tive is to find the linear projection that provides the best class separability in the
lower-dimensionality subspace. Consider the following transformation that maps
face vectors xi to yi through the transformation vector u:

uT xi = yi . (1.43)

The use of PCA allows to determine some projection directions u that preserve
data scattering in a lower-dimensionality subspace. However, LDA also allows to
find a lower-dimensionality subspace but where the data classes or clusters can be
discriminated more easily. The objective of LDA is to find the vector u that maximizes
the following ratio:
uT SB u
, (1.44)
uT SW u

where,

l
SB = nc (μc − μt )(μc − μt )T , (1.45)
c=1

$ n %

l ⎨ i
SW = (x(c)
i − μc )(x(c)
i − μc ) T
(1.46)
c=1 i=1

and SB and SW are the between-class and the within-class scatter matrices,
respectively; μt is the global mean of each face vector, and μc is the mean of the face
vectors of the class c; and x(c)
i is the ith face vector in the cth class.
It is known that PCA and LDA aim to preserve the global data organization
and structure, however in many applications like face recognition it is desirable to
1 Data and Information Dimensionality in Non-cooperative Face Recognition 25

preserve the local data structure among neighboring data points, leading to better
class separability. In the Laplacianfaces approach, the local data structure on the
face space is represented by a nearest neighbor graph using the Locality Preserving
Projections LPP algorithm [54]. The LPP objective function can be written as follows:
⎨ 2
min yi − yj Sij , (1.47)
ij

where the matrix S is the data similarity matrix defined as:


&
xi −xj
2
# #
Sij = e− t #xi − xj #2 < ε , (1.48)
0 otherwise.

where ε defines the radius of the local neighborhood where we are interested in the
closest neighbor, and t is a parameter used for defining the local data structure (higher
t values increases the similarity among neighboring data).
Alternatively the matrix S can be defined in another way, where the distance
# #2
condition #xi − xj # < ε is replaced by xi being among k nearest neighbors of xj ,
or xj being among the k nearest neighbors of xi .
Face vectors that are not close in the high-dimensional space, are projected far
away from each other in the low-dimensional space, which promotes inter-class sepa-
rability and intra-class consistency. Using some algebraic manipulation, the objective
function can be rewritten as follows:

argmin uT XLXT u, (1.49)


w

given that:
uT XLXT u = 1 (1.50)
'
where X = [x1 , x2 , . . . , xn ], D is a diagonal matrix with Dii = j Sij , and L = D−S
is the Laplacian matrix as in Spectral Graph Theory [55]. The matrix D provides a
natural distance measure on the face vectors, therefore the larger Dii is, the more
important is yi . The transformation vector u that minimizes the objective function is
given by the minimum eigenvalue solution to the following generalized eigenvalue
problem:
XLXT u = λXDXT u. (1.51)

The solution of Eq. 1.51 is the set of eigenvectors WLPP = (u1 , u2 , . . . , ur ) with
the r smaller eigenvalues, called Laplacianfaces. Thus, the manifold embedding is
given as follows:
x → y = WL T x, (1.52)

where,
26 G. Verdoolaege et al.

Fig. 1.7 Illustration of the Laplacianfaces subspace embedding in two-dimensions with axes
(z1 , z2 ). Similar facial expressions in the face database [44] are mapped to neighboring locations in
this subspace embedding

WL = WPCA WLPP , (1.53)

and,
WLPP = (u1 , u2 , . . . , ur ), (1.54)

where y is a r-dimensional vector, and WL is the desired transformation matrix.


PCA is used to reduce the noise effect and to project the face vectors xi in a sub-
space with almost the full data variability, using only the larger principal components.
In our experiments, 98 % of the information was kept in terms of the reconstruction
error. Figure 1.7 shows a two-dimensional embedding of the Laplacianface subspace
as an illustration where different facial expressions and head poses are mapped to this
subspace. It can be observed that there is a smooth transition path between similar
facial expressions and head poses.
In order to recognize a new face image using this scheme, three processing steps
are needed. First, the Laplacianfaces subspace embedding is calculated from the
face image training set. Second, the new face is projected into the subspace spanned
by the Laplacianfaces WLPP . Finally, the input face is assigned to the class of its
nearest-neighbor in the Laplacianfaces subspace. Figure 1.8 illustrates a comparison
between Eigenfaces and Laplacianfaces methods.
1 Data and Information Dimensionality in Non-cooperative Face Recognition 27

Fig. 1.8 Comparison between PCA and LDA for all frontal face images of the first four face classes
of the Radboud face database [44]. The plots represent a three-dimensional embedding of the face
space: a PCA manifold embedding, showing that the class separability is not well preserved with
face samples from different classes mixed; b LDA manifold embedding, showing that faces samples
of different classes are well separated in the LDA space

The LPP algorithm can be understood as a way of finding an optimal linear


approximation to the eigenfunctions of the Laplace Beltrami Operator on the face
manifold, where LPP preserves the local data structure [53]. Since LPP is non-
orthogonal, in some situations this may cause some difficulties to adequately project
the data on the manifold embedding.
Therefore, the Orthogonal Locality Preserving Projections method (OLPP) [5]
was proposed to improve the local data structure preservation, and increase the face
discrimination capability in the manifold embedding. The OLPP method is based on
a mapping applied to the projective matrix, that modifies the Laplacianfaces to make
them orthogonal so the metric structure is better preserved, as discussed next.

1.3.3 Orthogonal Laplacianfaces for Face Recognition


Eigenfaces and Fisherfaces are two of the most popular methods used to perform
face recognition, nevertheless these methods only account for flat (Euclidean) spaces.
Laplacianfaces was proposed in order to preserve the manifold structure, modeling
local relationships among face vectors. However, the data projection performed by
Laplacianfaces is non-orthogonal, which may cause local distortions. The OLPP
method was proposed to guarantee that the data projections are orthogonal, preserving
the metric structure of the face space while preserving the structure of the class of
the face vectors.
In fact, OLPP is a extension of the LPP method, and its first step is applying a
PCA projection and then discard all zeroed eigenvalues (i.e., the zeroed eigenvalues
of the PCA transformation matrix WPCA ). In this way, the extracted face features
are statistically uncorrelated and the number of dimensions is equal to the rank of
28 G. Verdoolaege et al.

the new data matrix. Next, the weights are computed that measure the similarities
between face vectors. If face vector xi and face vector xj are connected then:

2
Sij = e− xi −xj /t
, (1.55)

otherwise, Sij is set to zero. The matrix S captures the local structure of the face
manifold.
The orthogonal basis functions
' are determined next. Let D be the diagonal matrix
whose elements are Dii = j Sij , L = D − S be a Laplacian matrix, {a1 , a2 , . . . , al }
be the orthogonal basis vectors, and we define:

B(l−1) = [A(l−1) ]T (XDXT )−1 A(l−1) , (1.56)

where A(l−1) = [a1 , . . . , al−1 ].


So, the orthogonal basis vectors {a1 , a2 , . . . , al } are calculated by first computing
a1 as the eigenvector of (XDXT )−1 XLXT associated with the smallest eigenvalue.
The other al is computed as the eigenvector of:

M(l) = {I − (XDXT )−1 A(l−1) [B(l−1) ]−1 [A(l−1) ]T }(XDXT )−1 XLXT , (1.57)

associated with the smallest eigenvalue of M(l) .


Once the basis vectors {a1 , a2 , . . . , al } are obtained, the OLPP embedding is
computed. Considering WOLPP = [a1 , . . . , al ], the embedding is computed as fol-
lows:
x → y = Wo T x, (1.58)

where, Wo = WPCA WOLPP ; y is an l-dimensional representation of the face vector x;


Wo is the final transformation matrix used to map faces from the higher dimensional
space to the lower dimensional space, preserving the metric structure of the face
space. Face classification occurs as in the LPP algorithm [49].
There are other manifold learning methods proposed in the literature to recognize
faces, such the Orthogonal Neighborhood Preserving Projections (ONPP) [6]. The
ONPP method was designed to preserve both the global and the local geometry of the
data as captured by the aggregated pairwise proximity information, and uses local
neighborhood graphs to capture such structural information. Besides, an objective
function is used to minimize the discrepancies between the geometry of neighbors
and the global geometry resulting in a linear transformation that preserves both
global and local geometry. Another manifold learning approach is the Discriminative
Orthogonal Neighborhood-Preserving Projection (DONPP) method, which is based
on modeling each face vector by its neighbors constructing a patch for each face vector
including vectors from the same class and near classes. In this way, face vectors from
different classes are projected far away from each other. These patches are adjusted
to produce a global alignment matrix, which is used to obtain the eigenvectors of the
final face data projection matrix.
1 Data and Information Dimensionality in Non-cooperative Face Recognition 29

1.3.4 A Manifold Based Method for Non-cooperative Face


Recognition

The methods discussed in the previous Sections tend to focus on learning manifolds
using face images that have been manually cropped and aligned. In non-cooperative
face recognition it is desirable to automatically detect and analyze a face in the input
image without user cooperation or interaction. One issue with the presented methods
is that changes in the individual appearance may cause the recognition process to
fail. For example, if the individual presents him/herself wearing glasses, a hat or a
beard, this new face sample may be mapped far away from the other faces in the same
class. Another limitation of the methods previously discussed is that several images
of each individual are needed to compose a representative manifold. This may be a
limitation since in practice only a limited number of face samples of each individual
are available.
In order to overcome these issues, an alternative method is here proposed, where
faces in different poses and expressions are automatically detected and segmented,
have their structure represented compactly using a manifold learning technique,
facilitating the face recognition process.
In the proposed approach, each face segment is mapped to the manifold separately.
It was determined experimentally that face segments such as hair, iris, skin and mouth
coloration provide good biometric information. Given a color input image, initially
skin regions are detected on the RGB and HSV color spaces [56], and these regions are
used to locate a face. Hair is detected using the Normalized Graph Cut algorithm [57]
applied on a window containing the face, segmenting the skin, hair and background
regions. Once the skin, hair and background segments are detected, the eyes and
the mouth are detected using the Eye Map and Mouth Map methods on the YCbCr
[58]. Once the eyes and the mouth are detected, an Eyes-Mouth triangle is obtained
using biometric proportions among the eyes and the mouth locations. Also, the iris
is segmented using a circular Hough transform [59], and the mouth segmentation is
refined using Active Shape Models [40].
Once the face segments are detected, a fixed number of pixels is randomly selected
to create biometric feature vectors. However, the representation of a face image may
contain too much data, and we project these biometric feature vectors to a lower
dimensional space using the FINE method [60]. This method uses the Hellinger
distance as an approximation to the Fisher Information Distance in order to learn
dissimilarities between segments of different face images, as follows:
(
 2
H(p, q) = p(x) − q(x) dx. (1.59)

where p(x) and q(x) are probability densities describing the statistics of RGB colors
in distinct segmented face regions, smoothed by the KDE method (Kernel Density
Estimator), as in FINE [60].
30 G. Verdoolaege et al.

Fig. 1.9 Illustrations of a three-dimensional embedding of the face graphs obtained with FINE.
Each plot contains two face images, each represented by a four point graph. The dimensions
(z1 , z2 , z3 ) are the three principal components obtained by the MDS embedding method: a shows
two face graphs, each face graph representing a different face image taken from the same indi-
vidual; and b shows two face graphs representing two different face images (i.e. of two different
individuals). It is possible to see that similar face images have a good graph matching, while dis-
similar face images usually have dissimilar face graphs. The distances between vertexes associated
to the same face image segment are represented by red arrows

FINE also performs dimensionality reduction using Multi-Dimensional Scaling


(MDS) [61], allowing a single face to be mapped to four points in a manifold with
reduced dimensions. With a sampled manifold, face recognition is performed com-
paring geometrical similarities between faces on the manifold, where a single face
can be considered as a graph on the manifold surface. An example of the FINE
manifold containing face graphs is presented in Fig. 1.9, where the axis (z1 , z2 , z3 )
are the three main components obtained by the MDS embedding method for a three-
dimensional visualization. One advantage of the proposed method is to deal better
with pose and expression variations than the other methods previously discussed,
since pixel distributions sampled from the face segments are similar for different
expressions and poses, even using fewer training samples.

1.3.5 Experimental Results


Although recent methods for face recognition achieve high recognition rates, most
of them require the input face image to be cropped and pose-normalized. Some
methods try to automatize all processing steps, doing face detection, alignment and
face recognition. Specially, most of the proposed methods tend to fail when there are
expression or pose changes. Nevertheless, the method proposed in Sect. 1.3.4 (by
Soldera and Scharcanski) was designed to preform face recognition and to overcome
some of these issues using a manifold learning strategy.
To evaluate and compare the performances of the above mentioned face recog-
nition methods, some experiments were conducted with the Radboud face database
(RAFD) [44], which contains faces with different expressions. Table 1.2 shows the
results obtained with these methods, varying the dimensionality parameter d and
1 Data and Information Dimensionality in Non-cooperative Face Recognition 31

Table 1.2 Face recognition accuracies (in average) obtained for all 66 individuals in the Radboud
face database (RAFD), by varying the dimensionality parameter d (the proposed method has a fixed
dimensionality of 4, and obtained an accuracy of 93.9 % in these tests)
Dimensions 5 (%) 10 (%) 15 (%) 20 (%) 25 (%) 30 (%) 35 (%)
Isomap [11] 61.1 81.8 86.3 89.3 89.3 87.8 87.8
LLE [45] 57.5 84.8 83.3 89.3 84.8 83.3 86.3
PCA [4] 65.1 81.8 86.3 89.3 89.3 87.8 87.8
OLPP [5] 21.2 33.3 50.0 59.0 68.1 77.2 77.2
LPP [5] 25.7 45.4 51.5 62.1 60.6 68.1 68.1

Table 1.3 Face recognition accuracies obtained as the number of tested individuals is increased,
from the first 10 individuals to all 66 individuals in the Radboud face database (RAFD)
Individuals Best d 10 (%) 20 (%) 40 (%) 66 (%)
Proposed 4 100 100 95.0 93.9
Isomap [11] 20 90.0 85.0 92.5 89.3
LLE [45] 20 90.0 85.0 90.0 89.3
PCA [4] 20 80.0 85.0 92.5 89.3
OLPP [5] 30 70.0 75.0 82.5 77.2
LPP [5] 30 90.0 75.0 82.5 68.1
The dimensionality parameter d indicated maximizes the classification accuracy obtained for each
method. The proposed method has a fixed dimensionality of 4

using all individuals available in the face database. We used faces with angry and
happy expressions for training, and faces with fearful expressions for testing. In these
experiments, the faces were automatically segmented, pose and intensity normalized
to consider a non-cooperative scenario. For the pose normalization, the centers of
the irises are detected, and the face image is rotated so the line connecting the irises
centers is horizontal. In terms of intensity normalization, the color image is trans-
formed into a luminance image, and the gray levels are normalized to [0, 1]. Based
on these experiments, we chose the parameters (including the dimensionality para-
meter d) that maximize the performance of each one of the tested methods shown in
Table 1.2. It shall be observed that proposed method has a fixed dimensionality of 4.
We also conducted another set of experiments simulating the increase in the
number of individuals (i.e., of faces to be recognized). Table 1.3 shows the results
obtained in 4 situations (i.e., by taking the first 10, 20, 40 and 66 individuals of the
full database). For each individual, different facial expressions were considered as
in a non-cooperative scenario. These results were obtained with the parameter d
(dimensionality) that maximizes for each method the average classification accuracy
in the tests reported in Table 1.2. In Table 1.3 the accuracies are listed from sev-
eral methods using the dimensionality parameter d that maximizes the classification
accuracy obtained for each method. The proposed method has a fixed dimensionality
of 4, since FINE approximates the original manifold by a 4-dimensional space before
applying MDS for visualization.
32 G. Verdoolaege et al.

1.4 Concluding Remarks


Differentiable manifolds and manifold methods are rapidly gaining importance in
image analysis and face recognition. Manifolds can be introduced in various contexts
within image analysis, with the potential to simplify complicated problems related
to image representation and operations on image data. On the one hand, this is a
consequence of the nonlinearity of general manifolds, hence providing more realistic
models of complex data configurations. On the other hand, the powerful formalism
of differential geometry provides well-established methods for describing manifold-
valued data such as probability distributions and image shapes.
There is a lot of ongoing research in the area of “translation” of known methods and
algorithms from pattern recognition and statistics, which were designed to operate
on data in a Euclidean space, to the more general setting of curved manifolds. This is
rarely a straightforward operation, due to the more complicated nature of a manifold
compared to a vector space. However, it usually results in much more powerful
techniques, because they operate on a more natural representation of the data, which
better reflects the data configuration or the intrinsic nature of the face vectors.
Also, there has been a lot of discussion on how to use the advantages of mani-
fold learning and data dimensionality reduction in face recognition. A reconstructed
manifold embedded in a high dimensional space preserves some desired features of
the original manifold when necessary, facilitating large data manipulation as in face
recognition. Once the manifold is learnt, new subjects can be mapped to the obtained
lower dimensional space, where face classification may take place in a very flexible
way (so changes in head pose or facial expression can be accommodated easily).
This approach also may be useful in applications such as person re-identification,
and in other biometric modalities.
Manifold learning is a promising research area in computer vision and non-
cooperative facial recognition, specially because it can accommodate different head
poses and facial expressions in a unified scheme. As an illustration, we have presented
preliminary results of a promising new method that can perform face recognition in a
scenario where facial expressions and head pose may change in an unpredictable way.
Future work in this area shall look into non-cooperative face recognition methods
based on rich face representations, including some biometric features often neglected
by the current methods (e.g. iris, hair color, skin tones, etc.), at the same time allow-
ing to overcome some of the issues with the methods currently proposed in the
literature.

References

1. O’Neill B (1982) Elementary differential geometry, second revised edition. Academic Press,
New York
2. Gray A, Abbena E, Salamon S (2006) Modern differential geometry of curves and surfaces
with mathematica, 3rd edn. Chapman & Hall/CRC, Boca Raton
3. do Carmo MP, Flaherty F (1992) Riemannian geometry. Birkhäuser, Boston
1 Data and Information Dimensionality in Non-cooperative Face Recognition 33

4. Turk M, Pentland A (1991) Eigenfaces for recognition. J Cogn Neurosci 3(1):71–86


5. Cai D, He X, Han J, Zhang HJ (2006) Orthogonal laplacianfaces for face recognition. IEEE
Trans Image Process 15(11):3608–3614
6. Kokiopoulou E, Saad Y (2007) Orthogonal neighborhood preserving projections: a projection-
based dimensionality reduction technique. IEEE Trans Pattern Anal Mach Intell 29(12):
2143–2156
7. Lee JA, Verleysen M (2010) Nonlinear dimensionality reduction. Springer, New York
8. Izenman AJ (2008) Modern multivariate statistical techniques. Regression, classification, and
manifold learning. Springer, New York
9. Hastie T, Stuetzle W (1989) Principal curves. J Am Stat Assoc 84:502–516
10. Leblanc M, Tibshirani R (1994) Adaptive principal surfaces. J Am Stat Assoc 89:53–64
11. Tenenbaum JB, de Silva V, Langford JC (2000) A global geometric framework for nonlinear
dimensionality reduction. Science 290:2319–2323
12. Devijver PA, Kittler J (1982) Pattern recognition: a statistical approach. Prentice-Hall, London
13. Cramér H (1946) A contribution to the theory of statistical estimation. Skandinavisk Aktuari-
etidskrift 29:85–94
14. Rao CR (1945) Information and accuracy attainable in the estimation of statistical parameters.
Bull Calcutta Math Soc 37:81–89
15. Amari S, Nagaoka H (2000) Methods of information geometry. Transactions of Mathematical
Monographs, vol 191. American Mathematical Society, New York
16. Murray M, Rice J (1993) Differential geometry and statistics. Monographs on Statistics and
Applied Probability, vol 48. Chapman and Hall, New York
17. Kass RE, Vos PW (1997) Geometrical foundations of asymptotic inference. Wiley Series in
Probability and Statistics. Wiley-Interscience, New York
18. Čenkov NN (1982) Statistical decision rules and optimal inference. Translations of Mathemat-
ical Monographs, vol 53. American Mathematical Society, Providence
19. Berman AP, Shapiro LG (1999) A flexible image database system for content-based retrieval.
17th Int Conf Pattern Recogn 75(1–2):175–195
20. Verdoolaege G, Scheunders P (2011) Geodesics on the manifold of multivariate generalized
gaussian distributions with an application to multicomponent texture discrimination. Int J Com-
put Vis 95(3):265–286
21. Lenglet C, Rousson M, Deriche R (2006) DTI segmentation by statistical surface evolution.
IEEE Trans Med Imaging 25(6):685–700
22. Lenglet C, Rousson M, Deriche R, Faugeras O (2006) Statistics on the manifold of multivariate
normal distributions: theory and application to diffusion tensor MRI processing. J Math Imaging
Vis 25(3):423–444
23. Castano-Moraga CA, Lenglet C, Deriche R, Ruiz-Alzola J (2007) A riemannian approach to
anisotropic filtering of tensor fields. Signal Process 87(2):263–276
24. Mallat S (1989) A theory for multiresolution signal decomposition: the wavelet representation.
IEEE Trans Pattern Anal Mach Intell 11(7):674–692
25. Verdoolaege G, Scheunders P (2011) On the geometry of multivariate generalized gaussian
models. J Math Imaging Vis 43(3):180–193
26. Arwini K, Dodson CTJ (2008) Information geometry. Near randomness and near independence.
Springer, New York
27. Burbea J, Rao CR (1982) Entropy differential metric, distance and divergence measures in
probability spaces: a unified approach. J Multivar Anal 12(4):575–596
28. Verdoolaege G, Karagounis G, Murari A, Vega J, Van Oost G, JET-EFDA Contributors (2012)
Modeling fusion data in probabilistic metric spaces: applications to the identification of con-
finement regimes and plasma disruptions. Fusion Sci Technol 62(2):356–365
29. Gómez E, Gómez-Villegas MA, Marín JM (1998) A multivariate generalization of the power
exponential family of distributions. Commun Stat Theory Methods 27(3):589–600
30. Mardia KV, Kent JT, Bibby JM (1982) Multivariate analysis. Academic Press, London
34 G. Verdoolaege et al.

31. Skovgaard LT (1984) A Riemannian geometry of the multivariate normal model. Scand J Stat
11(4):211–223
32. Berkane M, Oden K, Bentler P (1997) Geodesic estimation in elliptical distributions. J Multivar
Anal 63(1):35–46
33. Calvo M, Oller JM (2002) A distance between elliptical distributions based in an embedding
into the Siegel group. J Comput Appl Math 145(2):319–334
34. MIT Vision and Modeling Group. Vision texture. online at http://vismod.media.mit.edu/
vismod/imagery/VisionTexture/, 2010
35. Kendall D (1984) Shape manifolds, procrustean metrics, and complex projective spaces. Bull
Lond Math Soc 16:81–121
36. Dryden IL, Mardia KV (1998) Statistical shape analysis. Wiley, New York
37. Small CG (1996) The statistical theory of shape. Springer, New York
38. Bhattacharya A, Bhattacharya R (2012) Nonparametric inference on manifolds. With applica-
tions to shape spaces. Cambridge University Press, Cambridge
39. Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features.
Proc IEEE Conf Comput Vis Pattern Recogn 1:511–518
40. Cootes TF, Taylor CJ, Cooper DH, Graham J (1995) Active shape models—their training and
application. Comput Vis Image Underst 61:38–59
41. Cootes TF, Edwards GJ, Taylor CJ (2001) Active appearance models. IEEE Trans Pattern Anal
Mach Intell 23:681–685
42. Vukadinovic D, Pantic M (2005) Fully automatic facial feature point detection using gabor
feature based boosted classifiers. IEEE Int Conf Syst Man Cybern 2:1692–1698
43. Hotelling H (1933) Analysis of a complex of statistical variables into principal components.
J Educ Psychol 24(6):417–441
44. Langner O, Dotsch R, Bijlstra G, Wigboldus DHJ, Hawk ST, van Knippenberg A (2010)
Presentation and validation of the radboud faces database. Cogn Emot 24(8):1377–1388
45. Roweis ST, Saul LK (2000) Nonlinear dimensionality reduction by locally linear embedding.
Science 290:2323–2326
46. Thomaz CE, Giraldi GA (2010) A new ranking method for principal components analysis and
its application to face image analysis. Image Vis Comput 28:902–913
47. Weinberger KQ, Saul LK (2006) An introduction to nonlinear dimensionality reduction by
maximum variance unfolding. Proc Natl Conf Artif Intell 2:1683–1686
48. Belhumeur PN, Hespanha JP, Kriegman DJ (1997) Eigenfaces vs. fisherfaces: recognition using
class specific linear projection. IEEE Trans Pattern Anal Mach Intell 19(7):711–720
49. He X, Yan S, Hu Y, Niyogi P, Zhang H (2005) Face recognition using laplacianfaces. IEEE
Trans Pattern Anal Mach Intell 27(3):328–340
50. Chang Y, Hu C, Turk M (2003) Manifold of facial expression. IEEE international workshop
on analysis and modeling of faces and gestures, pp 28–35
51. Shashua A, Levin A, Avidan S (2002) Manifold pursuit: a new approach to appearance based
recognition. Int Conf Pattern Recognit 3:590–594
52. Saul LK, Roweis ST (2003) Think globally, fit locally: unsupervised learning of low dimen-
sional manifolds. J Mach Learn Res 4:119–155
53. Belkin M, Niyogi P (2001) Laplacian eigenmaps and spectral techniques for embedding and
clustering. Adv Neural Inf Process Syst 14:585–591
54. He X, Niyogi P (2003) Locality preserving projections. Adv Neural Inf Process Syst 16:
153–160
55. Chung FRK (1997) Spectral graph theory. American Mathematical Society, Providence
56. Kovac J, Peer P, Solina F (2003) Human skin color clustering for face detection. Computer as
a Tool. IEEE Reg 8 2:144–148
57. Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach
Intell 22(8):888–905
1 Data and Information Dimensionality in Non-cooperative Face Recognition 35

58. Hsu R-L, Abdel-Mottaleb M, Jain AK (2002) Face detection in color images. IEEE Trans
Pattern Anal Mach Intell 24(5):696–706
59. Duda RO, Hart PE (1972) Use of the hough transformation to detect lines and curves in pictures.
Commun ACM 15(1):11–15
60. Carter KM, Raich R, Finn WG, Hero AO (2009) Fine: Fisher information nonparametric
embedding. IEEE Trans Pattern Anal Mach Intell 31(11):2093–2098
61. Cox MAA, Cox TF (2008) Multidimensional scaling. Springer Handbooks of Computational
Statistics. Springer, Berlin
Chapter 2
Remote Identification of Faces

Vishal M. Patel, Jie Ni and Rama Chellappa

Abstract Face recognition in unconstrained acquisition conditions is one of the


most challenging problems that has been actively researched in recent years. It is well
known that many state-of-the-art still image-based face recognition algorithms per-
form well, when constrained (frontal, well illuminated, high-resolution, sharp, and
full) face images are acquired. However, their performance degrades significantly
when the test images contain variations that are not present in the training images.
In this chapter, we highlight some of the key issues in remote face recognition. We
define the remote face recognition as one where faces are several tens of meters
(10–250 m) from the cameras. We then describe a remote face database which has
been acquired in an unconstrained outdoor maritime environment. The face images in
this database suffer from variations due to blur, poor illumination, pose, etc. Recogni-
tion performance of a subset of existing still image-based face recognition algorithms
is discussed on the remote face data set. It is demonstrated that in addition to applying
a good classification algorithm, finding features that are robust to variations men-
tioned above and developing statistical models which can account for these variations
are very important for remote face recognition.

This work was partially supported by a MURI grant N00014-08-1-0638 from the Office of
Naval Research.

V. M. Patel (B) · J. Ni · R. Chellappa


Department of Electrical and Computer Engineering, Center for Automation Research,
University of Maryland, College Park, MD 20742, USA
e-mail: pvishalm@umiacs.umd.edu
J. Ni
e-mail: jni@umiacs.umd.edu
R. Chellappa
e-mail: rama@umiacs.umd.edu

J. Scharcanski et al. (eds.), Signal and Image Processing for Biometrics, 37


Lecture Notes in Electrical Engineering 292, DOI: 10.1007/978-3-642-54080-6_2,
© Springer-Verlag Berlin Heidelberg 2014
38 V. M. Patel et al.

2.1 Introduction

During the past two decades, face recognition (FR) has received great attention
and tremendous progress has been made [66]. Numerous image-based algorithms
[6, 8, 20, 39, 56, 58, 59, 66] and video-based algorithms [19, 33, 68] have been
developed in the FR community. Currently, most of the existing FR algorithms have
been evaluated using databases which are collected at close range (less than a few
meters) and under different levels of controlled acquisition conditions. Some of the
most extensively used face datasets such as CMU PIE [53], FRGC/FERET [46] and
YaleB [23] were captured in constrained settings, with studio lights to control the
illumination variations while pose variations are controlled by cooperative subjects.
While FR techniques on these datasets have reached a high level of recognition
performance over the years, research in remote unconstrained FR field is still at a
nascent stage. Recently, a new database called “Labeled Faces in the Wild” (LFW)
[29] whose images are collected from the web, has been frequently used to address
some of the issues in unconstrained FR problem. Yet concerns have been raised
that these images are typically posed and framed by photographers and there is no
guarantee that such a set accurately captures the range of variations found in the real
world setting [47]. [61] describe a face video database, UTK-LRHM, acquired from
long distances with high magnifications. And the magnification blur is described as
a major source of degradation in their data. Figure 2.1 shows some sample images
from the FERET, FRGC and LFW datasets, respectively.
In this chapter, we address some of the issues related to the problem of FR when
face images are captured in unconstrained and remote setting. As one has a little con-
trol of the acquisition of the face images, the images one gets often suffer from low
resolution, poor illumination, blur, pose variation and occlusion. These variations
present serious challenges to existing FR algorithms. We provide a brief review of
developments and progress in remote face recognition. Furthermore, we introduce
a new dataset which was collected in a remote maritime environment [17]. We pro-
vide some preliminary experimental studies on this dataset and offer insights and
suggestions for the remote FR problem.
The organization of this chapter is as follows: In Sect. 2.2, we describe some
of the issues that arise in remote face recognition. Section 2.3 briefly discusses
image quality measures for face images. In Sect. 2.4, we describe the remote face
database collected by the authors’ group. Section 2.5 describes the algorithms that are
evaluated and corresponding recognition results are discussed in Sect. 2.6. Finally,
conclusions are given in Sect. 2.7.

2.2 Face Recognition at a Distance

Reliable extraction and matching of biometric signatures from face acquired at a


distance is a challenging problem [55]. First, as the subjects may not be cooperative,
the pose of the face and body relative to the sensor is likely to vary greatly. Second,
2 Remote Identification of Faces 39

Fig. 2.1 a Sample images from the FRGC dataset. b Sample images from the FERET dataset.
c Sample images from the LFW dataset

the lighting is uncontrolled and could be extreme in its variation. Third, when the
subjects are at great distances, the effects of a scattering media (static: fog and mist,
dynamic: rain, sleet, or sea spray) are greatly amplified. Fourth, the relative motion
between the subjects and the sensors produce jitter and motion blur in the images.
In this section, we investigate various artifacts introduced as a result of long range
acquisition of face images. Some of the factors that can affect long range FR system
performance can be summarized into four types [55]: (1) technology (dealing with
face image quality, heterogeneous face images, etc.), (2) environment (lighting, etc.),
(3) user (expression, facial hair, facial ware etc.), and (4) user-system (pose, height,
etc.). In what follows, we discuss some of these factors in detail.

2.2.1 Illumination

Variation in illumination conditions is one of the major challenges in remote face


recognition. In particular, when images are captured at long ranges, one does not have
any control over lighting conditions. As a result, the captured images often suffer
from extreme (due to sun) or low light conditions (due to shadow, bad weather,
evening, etc.).
The performance of most existing FR algorithms is highly sensitive to illumina-
tion variation. Various methods have been introduced to deal with this illumination
40 V. M. Patel et al.

problem in FR. They are based on illumination cone [9, 22], spherical harmonics
[7, 49, 62], quotient images [51, 57], gradient faces [63], logarithmic total variation
[18], albedo estimation [10], photometric stereo [67], and dictionaries [32, 45].
Changes induced by illumination can usually make face images of the same subject
far apart than images of different subjects. One can use estimates of albedo to mitigate
the illumination effect. Albedo is the fraction of light that a surface point reflects when
it is illuminated. It is an intrinsic property that depends on the material properties of
the surface and it is invariant to changes in illumination. Assuming the Lambertian
reflectance model for the facial surface, which is a restrictive assumption, one can
relate the surface normals, albedo and the intensity image by an image formation
model. The diffused component of the surface reflection is given by

xi, j = ρi, j max(ni,T j s, 0), (2.1)

where x is the pixel intensity, s is the light source direction, ρi, j is the surface albedo
at position i, j, ni, j is the surface normal of the corresponding surface point and
1 ∈ i, j ∈ N . The max function in (2.1) accounts for the formation of attached
shadows. Neglecting the attached shadows, (2.1) can be approximated as

xi, j = ρi, j max(ni,T j s, 0) ◦ ρi, j ni,T j s. (2.2)

(0)
Let ni, j and s(0) be the initial values of the surface normal and illumination direc-
tion. These initial values can be domain dependent average values. The Lambertian
assumption imposes the following constraints on the initial albedo
xi, j
ρi,(0)j = (0) (0)
, (2.3)
ni, j .s

where . is the standard dot product operator. Using (2.2), Eq. (2.3) can be re-written
as

(0) ni, j .s ni, j .s − ni,(0)j .s(0)


ρi, j = ρi, j = ρi, j + ρi, j (2.4)
ni,(0)j .s(0) ni,(0)j .s(0)
= ρi, j + ωi, j , (2.5)

(0)
ni, j .s−ni, j .s(0)
where ωi, j = (0) ρi, j . This can be viewed as a signal estimation problem
ni, j .s(0)
where ρi, j is the original signal, ρ (0) is the degraded signal and ω is the signal
dependent noise. Using this model, the albedo map can be estimated using the method
of minimum mean squared error criterion [10]. The illumination-free albedo image
can then be used for recognition. Figure 2.2 shows the results of albedo estimation
for two face images acquired at a distance using the method presented in [10].
2 Remote Identification of Faces 41

Fig. 2.2 Results of albedo


estimation for remotely
acquired images. Left Origi-
nal images; Right Estimated
albedo images

2.2.2 Pose Variation

Pose variation can be considered as one of the most important and challenging prob-
lems in face recognition. Magnitudes of variations of innate characteristics, which
distinguish one face from another, are often smaller than magnitudes of image vari-
ations caused by pose variations [64]. Popular frontal face recognition algorithms,
such as Eigenfaces [56] or Fisherfaces [8, 20], usually have low recognition rates
under pose changes as they do not take into account the 3D alignment issue when
creating the feature vectors for matching.
Existing methods for face recognition across pose can be roughly divided into two
broad categories: techniques that rely on 3D models and 2D techniques. Some of the
methods that use 3D information include [12] and [13]. One of the advantages of
using 2D methods is that they do not require the 3D prior information for performing
pose-invariant face recognition [15, 24, 48]. Image patch-based approaches have
also received significant attention in recent years [4, 5, 16, 31, 35]. Note that some
of these methods are highly sensitive to variations in illumination, resolution, blur
and occlusion.
Let n̄i, j , s̄ and Θ̄ be some initial estimates of the surface normals, illumination
direction and initial estimate of surface normals in pose Θ, respectively. Then, the
initial albedo at pixel (i, j) can be obtained by
xi, j
ρ̄i, j = , (2.6)
n̄i,Θ̄j .s̄

where n̄i,Θ̄j denotes the initial estimate of surface normals in pose Θ̄. Using this
model, we can re-formulate the problem of recovering albedo as a signal estimation
problem. Using arguments similar to Eq. (2.3), we get the following formulation for
the albedo estimation problem in the presence of pose
42 V. M. Patel et al.

Fig. 2.3 Pose normalization. Left column Original input images. Middle column Recovered albe-
does corresponding to frontal face images. Right column Pose normalized relighted images

ρ̄i, j = ρi, j h i, j + ωi, j , (2.7)

n̄i,Θj .s−n̄i,Θj .s̄ n̄i,Θj .s̄


where wi, j = ρi, j , h i, j = , ρi, j is the true albedo and ρ̄i, j is the
n̄i,Θ̄j .s̄ n̄i,Θ̄j .s̄
degraded albedo. Using this model, a pose-robust albedo estimation method was
proposed in [12]. Figure 2.3 shows some examples of pose normalized images using
this method. Once pose and illumination have been normalized, one can use these
images for illumination and pose-robust recognition [44, 54].
Face recognition methods require gallery and probe faces to be aligned. Methods
for pose alignment need the extraction of feature points. It is our experience that
existing methods for point feature extraction are not very effective on remote faces,
mainly due to illumination and blur.

2.2.3 Occlusion

Another challenge in remote FR is that since long-range face is usually for non-
cooperative subjects, acquired images are often contaminated by occlusion. The
occlusion may be the result of subject wearing sunglasses, scarf, hat or a mask.
Recognizing subjects in the presence of occlusion requires robust techniques for clas-
sification. One such technique was developed for the principle component analysis
in [14]. The recently developed algorithm for FR using sparse representation is also
shown to be robust to occlusion [59]. Figure 2.4 shows some images with occlusion
from the remote face dataset.
2 Remote Identification of Faces 43

Fig. 2.4 Some occluded face


images in remote face dataset

2.2.4 Blur

In remote FR, the distance between the subject and the sensor is in a spacial extent.
This in turn results in out of focus blur effects in the captured image. Often times
abbreviations of the imaging optics causes non-ideal focusing. Motion blur is another
phenomenon that occurs when the subject is moving rapidly or the camera is shaking.
[2, 43] are some of the methods that attempt to address this issue in face recognition.
In [2] local phase quantization is used to recognize blurred face images. Local phase
quantization is based on quantizing the Fourier transform phase in local neighbor-
hoods. It is shown that the quantized phase is blur invariant when certain conditions
are met. In [43], a method of inferring a point spread function (PSF) representing
the process of blur on faces is presented. The method uses learned prior informa-
tion derived from a training set of blurred faces to make the ill-posed problem more
tractable.
In the remote acquisition setting, often times blur is coupled with illumination
variations. It might be desirable to develop an algorithm that can compensate for
both blur and illumination simultaneously. Given the N × N arrays y and x, repre-
senting the observed image and the image to be estimated, respectively, the matrix
deconvolution problem can be described as

y = Hx + γ , (2.8)

where y, x, and γ are N 2 × 1 column vectors representing the arrays y, x, and γ


lexicographically ordered, H is the N 2 × N 2 matrix that models the blur operator
and γ denotes an N × N array of noise samples. Using the Lambertian model (2.2),
Eq. (2.8) can be re-written as

y = Hx + γ = HΦρ + γ
= Gρ + γ , (2.9)
44 V. M. Patel et al.

Fig. 2.5 Image deconvolution experiment with a face image from the remote face dataset. a Original
image. b Noisy blurred image. c Parametric manifold-based estimate

Fig. 2.6 A typical low-resolution face image in remote face dataset

where Φ = diag(ni,T j s) of size N 2 × N 2 , ρ is N 2 × 1 vector representing ρ and


G = HΦ. Having observed y, the general inverse problem is to estimate ρ with
incomplete information of G. It is well-known that regularization is often used to find
a unique and stable solution to the ill-posed inverse problems. One such regularization
method using patch manifold prior was developed in [42]. Figure 2.5 shows an
example of a face image that is deblurred using the method in [42].

2.2.5 Low Resolution

Image resolution is an important parameter in remote face acquisition, where there


is no control over the distance of human from the camera. Figure 2.6 illustrates a
practical scenario where one is faced with a challenging problem of recognizing
humans when the captured face images are of very low resolution (LR). Many meth-
ods have been proposed in the vision literature that can deal with this resolution
problem in face recognition. Most of these methods are based on some application
of super-resolution (SR) technique to increase the resolution of images so that the
recovered higher-resolution (HR) images can be used for recognition. One of the
major drawbacks of applying SR techniques is that there is a possibility that recov-
ered HR images may contain some serious artifacts. This is often the case when the
2 Remote Identification of Faces 45

resolution of the image is very low. As a result, these recovered images may not look
like the images of the same person and the recognition performance may degrade
significantly.
An Eigen-face domain SR method for FR was proposed in [25]. This method
proposes to solve the FR at LR using super-resolution (SR) of multiple LR images
using their PCA domain representation. Given a LR face image, [30] proposes to
directly compute a maximum likelihood identity parameter vector in the HR tensor
space that can be used for SR and recognition. A Tikhonov regularization method
that can combine the different steps of SR and recognition in one step was proposed
in [27].
Though the LR images are directly not suitable for face recognition purpose,
it is also not necessary to super-resolve them before recognition, as the problem
of recognition is not the same as super-resolution. Based on this motivation, some
different approaches to this problem have been suggested. Coupled Metric Learning
[36] attempts to solve this problem by mapping the LR image to a new subspace,
where higher recognition can be achieved. Extension of this method was recently
proposed in [37]. A similar approach for improving the matching performance of
the LR images using multidimensional scaling was recently proposed in [11]. A
log-polar domain-based method was proposed in [28]. Additional methods for LR
FR include correlation filter-based approach [1], a support vector data description
method [34] and a dictionary-based method [52]. 3D face modeling has also been
used to address the LR face recognition problem [38, 50]. There have been works to
solve the problem of unconstrained FR using videos [3].
In practical scenarios, the resolution change is also coupled with other parameters
such as pose change, illumination variations and expression. Algorithms specifically
designed to deal with LR images quite often fail in dealing with these variations.
Hence, it is essential to include these parameters while designing a robust method
for LR face recognition.

2.2.6 Atmospheric and Weather Artifacts

Most of the current vision algorithms and applications are applied to the images that
are captured under clear and nice weather conditions. However, often times in outdoor
applications, one faces adverse weather conditions such as extreme illumination, fog,
haze, rain and snow [40, 41, 55]. These extreme conditions can also present additional
difficulties in developing robust algorithms for face recognition. Figure 2.7, shows
an image that is collected in remote setting where the illumination condition caused
by the sun is extreme.
46 V. M. Patel et al.

Fig. 2.7 Extreme illumination conditions caused by the sun

2.3 Long Range Facial Image Quality

In FR systems, the ultimate recognition performance depends on how cooperative the


subject is, as well as the resolution in addition to illumination variations that invari-
ably are present in outdoors. For non-cooperative situations, one can increase the per-
formance by combining tracking technology and recognition technology together.
For instance, the system would first track the subject’s face. Then, it would get a
series of image of the same person. Using multiple images to recognize an individ-
ual can provide better recognition accuracy than just using a single image. However,
in order to detect and track without false alarms, the system must acquire images of
the subject with sufficient quality and resolution [55].
As discussed in the previous section, various factors could affect the quality of
remotely acquired images. Hence it is essential to derive an image quality measure
to study the relation between image quality and recognition performance. To this
end, a blind signal-to-noise ratio estimator has been defined for facial image quality
[55]. This measure is defined based on the concept that the statistics of image edge
intensity are correlated with noise and SNR estimation [65].
Suppose the probability density function f →∇ I → (r ) of the edge intensity image
→∇I→ can be calculated as a mixture of Rayleigh probability density functions. Con-
sider the quantity
 ∞
Q= f →∇I→(r ) dr,

where μ is the mean of →∇I→. It has been shown that the value of Q for the noisy
image is always smaller than the value of Q for the image without noise [65]. Then,
the face image quality is defined as
2 Remote Identification of Faces 47
  ∞
edge above 2 µ≤ s pixels
Q≤ =  ∼ f →∇I→(r ) dr.
edge pixels 2μ

It has been experimentally verified that the estimator Q ≤ is well correlated with the
recognition performance in FR [55]. Hence, setting up a comprehensive metric to
evaluate the quality of face images is essential in remote face recognition. Also, these
measures can be used to reject images that are of low quality. One can also derive
these type of metric based on sparse representation [44, 59].

2.4 Remote Face Database

In this section, we introduce a remote face database in which a significant number


of images are taken from long distances and under unconstrained outdoor maritime
environment [17]. As discussed in Sect. 2.2, the quality of the images differs in
the following aspects: the illumination is not controlled and is often severe; there
are pose variations and occluded faces due to non-cooperative subjects; finally, the
effects of scattering and high magnification resulting from long distance contribute
to the blurriness of face images.
The distance from which the face images were taken varies from 5 to 250 m
under different scenarios. Since the extraction of faces in the data set using the
existing state-of-the-art face detection algorithms was not reliable and the faces only
occupied small regions in large background scenes, the faces were manually cropped
and rescaled to a fixed size. The resulting database for still color face images contains
17 different individuals and 2,106 face images in total.
The faces were manually labeled according to their type (i.e. different illumination
conditions, occlusion, blur etc.). In total, the database contains 688 clear images, 85
partially occluded images, 37 severely occluded images, 540 images with medium
blur, 245 with sever blur, and 244 in poor illumination conditions. The remaining
images have two or more coupled conditions, such as poor lighting and blur, occlusion
and blur etc. Figure 2.8 shows two sample images acquired in a remote maritime
setting. Some of the extracted images from the database are shown in Fig. 2.9. Note
that some of these face images contain extreme variations which makes recognition
very difficult even for humans.

2.5 Algorithms

We present the results of two state-of-the-art FR algorithms on the remote face


database and compare their recognition performance [17]. More experimental results
using a dictionary-based method can be found in [44].
48 V. M. Patel et al.

Fig. 2.8 Typical images illustrating the different scenarios in the maritime domain

Fig. 2.9 Cropped face images


with different variations from
the remote face database
2 Remote Identification of Faces 49

2.5.1 Baseline Algorithm

The baseline recognition algorithm used performs Principle Component Analysis


(PCA) [60] followed by Linear Discriminate Analysis (LDA) [8, 20] and a Support
Vector Machine (SVM) [26].
LDA is a well-known method for feature extraction and dimensionality reduction
for pattern recognition and classification tasks. It uses class specific linear methods
for dimensionality reduction. It selects projection matrix A in such a way that the
ratio of the between-class scatter and the within-class scatter is maximized [20]. The
criterion function is defined as

|AT Σ B A|
Aopt = arg max
A |AT ΣW A|

where |.| denotes determinant of a matrix, Σ B and ΣW are between-class and within-
class scatter matrices, respectively.
The within-class scatter matrix becomes singular when the dimension of the data is
larger than the number of training samples. In order to deal with this, we first use PCA
as a dimensionality reduction method to project the raw data onto an intermediate
feature space with much lower dimension. Then, LDA is applied on features from
this intermediate PCA space. It is well known that LDA is not feasible when there
is only one image per subject. To further mitigate this small sample size problem of
LDA, we impose a regularizer in the LDA objective function

aT Σ B a
aopt = arg max =
a a T ΣW a + α J (a)

where the optimal a’s form the columns of the optimal projection matrix Aopt . In our
application, we use the Tikhonov regularizer J (a) = →a→22 . The resulting method
is often known as the Regularized Discriminate Analysis (RDA) [21]. Then, the
low-dimensional discriminant features from RDA are fed into a linear SVM for
classification.

2.5.2 Sparse Representation-Based Algorithm

A sparse representation-based classification (SRC) algorithm for FR was proposed


in [59] which was shown to be robust to noise and occlusion. The comparison with
other existing FR methods in [59] suggests that the SRC algorithm is among the best.
Hence, we treat it as state-of-the-art and use it for comparisons in this chapter.
The idea proposed is to create a dictionary matrix of the training samples as
column vectors. The test sample is also represented as a column vector. Different
dimensionality reduction methods are used to reduce the dimension of both the test
50 V. M. Patel et al.

vector and the vectors in the dictionary. It is then simply a matter of solving an π1
minimization problem in order to obtain the sparse solution. Once the sparse solution
is obtained, it can provide information as to which training sample the test vector
most closely relates to.
Let each image be represented as a vector in Rn , D be the dictionary (i.e. training
set) and y be the test image. The SRC algorithm is as follows:
1. Create a matrix of training samples D = [D1 , . . . , Dk ] for k classes, where Di ,
i = 1, . . . , k are the set of images of each class.
2. Reduce the dimension of the training images and a test image by any dimension-
ality reduction method. Denote the resulting dictionary and the test vector as D̃
and ỹ, respectively.
3. Normalize the columns of D̃ and ỹ.
4. Solve the following π1 minimization problem

α̂ = arg min

→ α ≤ →1 subject to ỹ = D̃α ≤ , (2.10)
α

5. Calculate the residuals


ri (ỹ) = →ỹ − D̃δi (α̂)→2 ,

for i = 1, . . . , k where δi a characteristic function that selects the coefficients


associated with the ith class.
6. Identity(y) = arg mini ri (ỹ).
The assumption made in this method is that given sufficient training samples of the
kth class, D̃k , any new test image y that belongs to the same class will approximately
lie in the linear span of the training samples from the class k. This implies that most
of the coefficients not associated with class k in α̂ will be close to zero. Hence, α ≤
is a sparse vector. This algorithm can also be extended to deal with occlusions and
random noise. Furthermore, a method of rejecting invalid test samples can also be
introduced within this framework. In particular, to decide whether a given test sample
is a valid sample or not, the notion of Sparsity Concentration Index (SCI) has been
proposed in [59]. See [59] for more details.

2.6 Experimental Results

In this section, we report experimental results using the algorithms described in


Sect. 2.5 on the remote face dataset [17].
The first set of experiments was designed to test the effectiveness of albedo maps
[10]. We select the gallery set from clear images, and gradually increase the number of
gallery of faces from one to fifteen images per subject. Each time the gallery images
are chosen randomly. We repeat the experiments five times and take the average
to arrive at the final recognition result. All the remaining clear images are selected
2 Remote Identification of Faces 51

0.9

0.8
albedo

recognition rate
0.7
pixels
0.6

0.5

0.4

0.3

0.2

0.1

0
0 5 10 15
number of gallery images per subject

Fig. 2.10 Comparison of intensity images and albedo maps using baseline

for testing. We compare the albedo maps with the intensity images as input to the
baseline. All the parameters of PCA, LDA and SVM are fine tuned. The results are
shown in Fig. 2.10. We found that intensity images outperform albedo maps although
the albedoes are intended to compensate for illumination variations. One possible
reason is that, the face images in the database are sometimes a bit away from frontal.
As albedo estimation needs a good alignment between the observed images and the
ensemble mean, the estimated albedo map is erroneous. These artifacts are also seen
in Fig. 2.2. On the other hand, intensity images contain texture information which
can partly counteract variations induced by pose.
In the second set of tests, we repeat the gallery selection as the first set of exper-
iments, and select the test images to be clear, poorly illuminated, medium blurred,
severely blurred, partially occluded and severely occluded respectively. The intensity
images are used as input. The rank-1 recognition results using baseline are given in
Fig. 2.11. We observe that the degradations in the test images decrease the perfor-
mance, especially when the faces are occluded and severely blurred.
In the third set of experiments, we compare the SRC method and baseline algo-
rithm. We selected 14 subjects with 10 clear images per subject to form the gallery
set. The test images are selected to contain clear, blurred, poorly illuminated and
occluded images respectively. For the SRC method, we compute the SCI value of
each image which can be used as a measure to reject images of low quality.
From the comparison results reported in Fig. 2.12, we observe that when no
rejection of test images is allowed, the recognition accuracy of baseline is superior
to the SRC method. This may be the case because when gallery images do not contain
variations that occur in the test images, the SRC method can not approximate the test
images correctly through linear span of gallery images. However, when rejection of
images is allowed, we remove images whose SCI values are below certain threshold,
and the performance of SRC method increases accordingly. The rejection rates in
52 V. M. Patel et al.

0.9

0.8

recognition rate
0.7
clear
0.6 poor illuminated
medium blurred
0.5 severely blurred
partially occluded
0.4 severely occluded

0.3

0.2

0.1

0
0 5 10 15
number of gallery images per subject

Fig. 2.11 Performance of the baseline algorithm as the condition of probe varies

Fig. 2.12 Comparison between SRC and baseline algorithms

Fig. 2.12 are 6, 25.11, 38.46 and 17.33 % when the test images are clear, poorly
lighted, occluded and blurred, respectively. Besides, we do observe the advantage of
SRC method for handling occluded images.
2 Remote Identification of Faces 53

2.7 Discussion and Conclusion

In this chapter, we briefly discussed some of the key issues in remote face recognition.
We then described a remote face database collected by the authors’ group and reported
the performance of state-of-the-art FR algorithms on it. The results demonstrate that
recognition rate decreases as the remotely acquired face images are affected by
illumination variation, blur, occlusion, pose variation.
The evaluations reported here can provide guidance for further research in remote
face recognition. In particular, the following are some areas where one can address
the challenges in remote face recognition:
• Develop face recognition algorithms which can account for multiple sources of
variation. For instance, low resolution is usually coupled with illumination and blur.
The coupling among different variation factors makes the remote face recognition
problem extremely difficult. Robust methods are needed to deal with these coupling
effects in remote face recognition.
• Incorporate features which are robust to these variations. For instance, features
which are robust to blur might not be able to deal with illumination variations.
Hence, the design of features that can normalize the coupling effect due to various
variations is needed.
• In order to develop robust recognition algorithm under these conditions, a more
comprehensive quality metric to reject low quality images is needed. As a result
this metric can make the recognition system more effective in practical acquisition
conditions.

References

1. Abiantun R, Savvides M, Vijaya Kumar BVK (2006)How low can you go? low resolution face
recognition study using kernel correlation feature analysis on the frgcv2 dataset. In: Biometric
consortium conference, 2006 biometrics symposium: special session on research at the, pp 1–6
2. Ahonen T, Rahtu E, Ojansivu V, Heikkila J (2008) Recognition of blurred faces using local
phase quantization. In: International conference on pattern recognition, pp 1–4
3. Arandjelovic OD, Cipolla R (2006) Face recognition from video using the generic shape-
illumination manifold. In: European conference on computer vision, vol IV, pp 27–40
4. Arashloo SR, Kittler J (2011) Energy normalization for pose-invariant face recognition based
on mrf model image matching. IEEE Trans Pattern Anal Mach Intell 33(6):1274–1280
5. Ashraf AB, Lucey S, Chen T (2008) Learning patch correspondences for improved viewpoint
invariant face recognition. In: IEEE conference on computer vision and pattern recognition,
pp 1–8
6. Bartlett MS, Movellan JR, Sejnowski TJ (2002) Face recognition by independent component
analysis. IEEE Trans Neural Networks 13(6):1450–1464
7. Basri R, Jacobs DW (2003) Lambertian reflectance and linear subspaces. IEEE Trans Pattern
Anal Mach Intell 25(2):218–233
8. Belhumeur P, Hespanda J, Kriegman D (1997) Eigenfaces versus fisherfaces: recognition using
class specific linear projection. IEEE Trans Pattern Anal Mach Intell 19(7):711–720
9. Belhumeur PN, Kriegman DJ (1996) What is the set of images of an object under all possible
lighting conditions? In: IEEE conference on computer vision and pattern recognition
54 V. M. Patel et al.

10. Biswas S, Aggarwal G, Chellappa R (2009) Robust estimation of albedo for illumination-
invariant matching and shape recovery. IEEE Trans Pattern Anal Mach Intell 29(2):884–899
11. Biswas S, Bowyer KW, Flynn PJ (2010) Multidimensional scaling for matching low-resolution
facial images. In: IEEE international conference on biometrics: theory applications and systems,
pp 1–6
12. Biswas S, Chellappa R (2010) Pose-robust albedo estimation from a single image. In: IEEE
conference on computer vision and pattern recognition, pp 2683–2690
13. Blanz Volker, Vetter Thomas (2003) Face recognition based on fitting a 3d morphable model.
IEEE Trans Pattern Anal Mach Intell 25:1063–1074
14. Candès EJ, Li X, Ma Y, Wright J (2011) Robust principal component analysis? J ACM 58(1):
1–37
15. Castillo CD Jacobs DW (2009) Using stereo matching with general epipolar geometry for 2d
face recognition across pose. IEEE Trans Pattern Anal Mach Intell 31(12):2298–2304
16. Chai Xiujuan, Shan Shiguang, Chen Xilin, Gao Wen (2007) Locally linear regression for pose-
invariant face recognition. IEEE Trans Image Process 16(7):1716–1725
17. Chellappa R, Ni J, Patel VM (2012) Remote identification of faces: problems, prospects, and
progress. Pattern Recogn Lett 33(15):1849–1859
18. Chen T, Yin W, Zhou XS, Comaniciu D, Huang TS (2006) Total variation models for variable
lighting face recognition. IEEE Trans Pattern Anal Mach Intell 28(9):1519–1524
19. Chen Y-C, Patel VM, Phillips PJ, Chellappa R (2012) Dictionary-based face recognition from
video. In: European conference on computer vision
20. Etemad K, Chellappa R (1997) Discriminant analysis for recognition of human face images.
J Opt Soc Am 14:1724–1733
21. Friedman J (1989) Regularized discriminant analysis. J Am Stat Assoc 84:165–175
22. Georghiades AS, Belhumeur PN, Kriegman DJ (2001) From few to many: ilumination cone
models for face recognition under variable lighting and pose. IEEE Trans Pattern Anal Mach
Intell 23(6):643–660
23. Georghiades AS, Belhumeur PN, Kriegman DJ (2001) From few to many: illumination cone
models for face recognition under variable lighting and pose. IEEE Trans Pattern Anal Mach
Intell 23(6):643–660
24. Gross R, Matthews I, Baker S (2004) Appearance-based face recognition and light-fields. IEEE
Trans Pattern Anal Mach Intell 26(4):449–465
25. Gunturk BK, Batur AU, Altunbasak Y, Hayes III MH, Mersereau RM (2003) Eigenface-domain
super-resolution for face recognition. IEEE Trans Image Process 12(5):597–606
26. Guo G, Li SZ, Chan K (2000) Face recognition by support vector machines. In: IEEE interna-
tional conference on automatic face and gesture recognition, Grenoble, France, pp 196–201
27. Hennings-Yeomans PH, Baker S, Kumar BVKV (2008) Simultaneous super-resolution and
feature extraction for recognition of low-resolution faces. In: IEEE international conference
on computer vision and pattern recognition, pp 1–8
28. Hotta K, Kurita T, Mishima T (1998) Scale invariant face detection method using higher-order
local autocorrelation features extracted from log-polar image. In: IEEE international conference
on automatic face and gesture recognition, pp 70–75
29. Huang G, Ramesh M, Berg T, Learned-Miller E (2007) Labeled faces in the wild: a database
for studying face recognition in unconstrained environments. University of Massachusetts,
Amherst, Technical Report, pp 07–49
30. Jia K, Gong S (2005) Multi-modal tensor face for simultaneous super-resolution and recogni-
tion. In: IEEE international conference on computer vision, vol 2, pp 1683–1690
31. Kanade T, Yamada A (2003) Multi-subregion based probabilistic approach toward pose-
invariant face recognition. In: IEEE international symposium on computational intelligence
in robotics and automation, vol 2, pp 954–959
32. Lee K, Ho J, Kriegman DJ (2005) Acquiring linear subspaces for face recognition under variable
lighting. IEEE Trans Pattern Anal Mach Intell 27(5):684–698
33. Lee K-C, Ho J, Yang M-H, Kriegman D (2005) Visual tracking and recognition using proba-
bilistic appearance manifolds. Comput Vis Image Underst 99:303–331
2 Remote Identification of Faces 55

34. Lee Sang-Woong, Park Jooyoung, Lee Seong-Whan (2006) Low resolution face recognition
based on support vector data description. Pattern Recognit 39(9):1809–1812
35. Li A, Shan S, Chen X, Gao W (2009) Maximizing intra-individual correlations for face recog-
nition across pose differences. In: IEEE conference on computer vision and pattern recognition,
pp 605–611
36. Li B, Chang H, Shan S, Chen X (2009) Coupled metric learning for face recognition with
degraded images. In: Asian conference on machine learning: advances in machine learning,
Berlin, Heidelberg, Springer, pp 220–233
37. Li B, Chang H, Shan S, Chen X (2010) Low-resolution face recognition via coupled locality
preserving mappings. IEEE Signal Process Lett 17(1):20–23
38. Medioni G, Choi J, Kuo C-H, Choudhury A, Zhang L, Fidaleo D (2007) Non-cooperative
persons identification at a distance with 3d face modeling. In: IEEE international conference
on biometrics: theory, paplications, and systems, pp 1–6
39. Moghaddam B (2000) Bayesian face recognition. Pattern Recognit 33:1771–1782
40. Narasimhan SG, Nayar SK (2003) Shedding light on the weather. In: IEEE conference on
computer vision and pattern recognition, vol 1, pp I-665–I-672
41. Nayar SK, Narasimhan SG (1999) Vision in bad weather. In: IEEE international conference
on computer vision, vol 2, pp 820–827
42. Ni J, Turaga P, Patel VM, Chellappa R (2011) Example-driven manifold priors for image
deconvolution. IEEE Trans Image Process 20(11):3086–3096
43. Nishiyama M, Hadid A, Takeshima H, Shotton J, Kozakaya T, Yamaguchi O (2011) Facial
deblur inference using subspace analysis for recognition of blurred faces. IEEE Trans Pattern
Anal Mach Intell 33(4):838–845
44. Patel VM, Wu T, Biswas S, Philips PJ, Chellappa R (2012) Dictionary-based face recognition
under variable lighting and pose. IEEE Trans Inf Forensics Secur 7(3):954–965
45. Patel VM, Wu T, Biswas S, Phillips PJ, Chellappa R (2011) Illumination robust dictionary-
based face recognition. In: International conference on image processing
46. Phillips PJ, Wechsler H, Huang J, Rauss PJ (1998) The feret database and evaluation procedure
for face-recognition algorithms. Image Vis Comput 16:295–306
47. Pinto N, DiCarlo J, Cox D (2009) How far can you get with a modern face recognition test set
using only simple features? In: IEEE conference on computer vision and pattern recognition,
Miami, FL, pp 2591–2568
48. Prince SJD, Warrell J, Elder JH, Felisberti FM (2008) Tied factor analysis for face recognition
across large pose differences. IEEE Trans Pattern Anal Mach Intell 30(6):970–984
49. Ramamoorthi R, Hanrahan P (2001) On the relationship between radiance and irradiance:
determining the illumination from images of a convex lambertian object. J Opt Soc Am
18(10):2448–2459
50. Rara HM, Elhabian SY, Ali AM, Miller M, Starr TL, Farag AA (2009) Distant face recognition
based on sparse-stereo reconstruction. In: IEEE international conference on image processing,
pp 4141–4144
51. Shashua A, Riklin-Raviv T (2001) The quotient image: class-based re-rendering and recogni-
tion with varying iluminations. IEEE Trans Pattern Anal Mach Intell 23(2):129–139
52. Shekhar S, Patel VM, Chellappa R (2011) Synthesis-based recognition of low resolution faces,
International Joint Conference on Biometrics (IJCB) 1–6. Washington, DC
53. Sim T, Baker S, Bsat M (2003) The cmu pose, illumination, and expression database. IEEE
Trans Pattern Anal Mach Intell 25:1615–1618
54. Taheri S, Sankaranarayanan AC, Chellappa R (2012) Joint albedo estimation and pose tracking
from video. IEEE Trans Pattern Anal Mach Intell 35(7):1674–1689
55. Tistarelli M, Li SZ, Chellappa R (2009) Handbook of remote biometrics: for surveillance and
security, 1st edn. Springer Publishing Company Incorporated, New York
56. Turk M, Pentland A (1991) Eigenfaces for recognition. J Cogn Neurosci 3:71–86
57. Wang H, Li SZ, Wang Y (2004) Generalized quotient image. In: International conference on
computer vision and pattern recognition
56 V. M. Patel et al.

58. Wiskott Laurenz, Fellous Jean-Marc, Krger Norbert, Von Der Malsburg Christoph (1997) Face
recognition by elastic bunch graph matching. IEEE Trans Pattern Anal Mach Intell 19:775–779
59. Wright J, Ganesh A, Yang A, Ma Y (2009) Robust face recognition via sparse representation.
IEEE Trans Pattern Anal Mach Intell 31:210–227
60. Yang M-H (2002) Kernel eigenfaces vs. kernel fisherfaces: face recognition using kernel meth-
ods. In: IEEE international conference on automatic face and gesture recognition, Washington,
DC, pp 215–220
61. Yao Y, Abidi B, Kalka N, Schmid N, Abidi M (2008) Improving long range and high magnifi-
cation face recognition: database acquisition, evaluation, and enhancement. Comput Vis Image
Underst 111:111–125
62. Zhang L, Samaras D (2003) Face recognition under variable lighting using harmonic image
exemplars. In: International conference on computer vision and pattern recognition
63. Zhang T, Tang YY, Fang B, Shang Z, Liu X (2009) Face recognition under varying illumination
using gradientfaces. IEEE Trans Image Process 18(11):2599–2606
64. Zhang Xiaozheng, Gao Yongsheng (2009) Face recognition across pose: a review. Pattern
Recogn 42:2876–2896
65. Zhang Z, Blum RS (1998) On estimating the quality of noisy images. In: IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), 5:2897–2900. Seattle,
Washington
66. Zhao W, Chellappa R, Phillips J, Rosenfeld A (2003) Face recognition: a literature survey.
ACM Comput Surv 35(4):399–458
67. Zhou SK, Aggarwal G, Chellappa R, Jacobs DW (2007) Appearance characterization of linear
lambertian objects, generalized photometric stereo, and illumination-invariant face recognition.
IEEE Trans Pattern Anal Mach Intell 29(2):230–245
68. Zhou S, Krueger V, Chellappa R (2003) Probabilistic recognition of human faces from video.
Comput Vis Image Underst 91:214–245
Chapter 3
Biometric Identification from Facial Sketches
of Poor Reliability: Comparison of Human
and Machine Performance

H. Proença, J. C. Neves, J. Sequeiros, N. Carapito and N. C. Garcia

Abstract Facial sketch recognition refers to the establishment of a link between a


drew representation of a human face and an identity, based on information given by a
eyewitness of some illegal act. It is a topic of growing interest, and various software
frameworks to synthesize sketches are available nowadays. When compared to the
traditional hand-made sketches, such sketches resemble more closely the appearance
of real mugshots, and led to the possibility of using automated face recognition
methods in the identification task. However, there are often deficiencies of witnesses
in describing the subjects’ appearance, which might bias the main features of sketches
with respect to the corresponding identity. This chapter compares the human and
machine performance in the task of sketch identification (rank-1 identification). One
hundred subjects were considered as gallery data, and five images from each stored
in a database. Also, one hundred sketches were drew by non-professionals and used
as probe data, each of these resembling an identity in the gallery set. Next, a set
of volunteers was asked to identify each sketch, and their answers compared to
the rank-1 identification responses given by automated face recognition techniques.
Three appearance-based face recognition algorithms were used: (1) Gabor-based
description, with ρ2 norm distance; (2) sparse representation for classification; and

H. Proença (B) · J. C. Neves · J. Sequeiros · N. Carapito · N. C. Garcia


Department of Computer Science, IT- Instituto de Telecomunicações, University of Beira Interior,
6201-001 Covilhã, Portugal
e-mail: hugomcp@ubi.pt
J. C. Neves
e-mail: jcneves@ubi.pt
J. Sequeiros
e-mail: a29192@ubi.pt
N. Carapito
e-mail: a29195@ubi.pt
N. C. Garcia
e-mail: a26931@ubi.pt

J. Scharcanski et al. (eds.), Signal and Image Processing for Biometrics, 57


Lecture Notes in Electrical Engineering 292, DOI: 10.1007/978-3-642-54080-6_3,
© Springer-Verlag Berlin Heidelberg 2014
58 H. Proença et al.

(3) eigenfaces. The sparse representation for classification algorithm yielded the best
results, whereas the responses given by the Gabor-based description algorithm were
the most correlated to human responses.

3.1 Introduction

The pursuit of criminals based on sketches drew by eyewitnesses of some illegal act
has been used by law enforcement agencies for many years. This process has two
major phases: (1) the eyewitness is asked to draw or coordinate the drawing of a
representation of a face; (2) once the eyewitness considers the sketch close enough
to the desired appearance, a range search over the enrolled identities is carried out,
in order to find the sub-set of identities that maximally resemble the appearance of
the sketch.
Sketches were traditionally hand-made, either by the the witness or law enforce-
ment agents, which conditioned their appearance and pushed them away from real
mugshots. Due to this, the earliest attempts to perform facial sketch recognition
translated the gallery data into a simplified sketch domain, performing recognition
subsequently. However, unlike hand-drew sketches, composite sketches are synthe-
sized using facial composite software systems and more closely resemble real faces,
which raises the possibility of using automated face recognition algorithms in the
identification task.
Most of the judicial processes of modern societies rely on the honesty of eyewit-
nesses [7]. Their testimony is known to impress deeply elements of a jury and, as
such, their reliability strongly matters. In this scope, perjury is usually considered
a crime, but it is not simple to distinguish it from mere misremembering, which is
always a strong possibility due to the emotional condition of witnesses both at the
moment of the crime and of the trial. Several studies about the reliability of eyewit-
nesses corroborate the vulnerability of human memory to bias [24], which increases
the skepticism about the reliability of the sketches. i.e., how closely the appearance
of a sketch actually resembles the appearance of the human identity that it should?
This topic motivated several studies over the last years and concerns a significant
amount of researchers (e.g., [8, 11, 18, 21]).
In this chapter we address the performance of automated face recognition methods
on composite sketches and compare it to the attained by humans, considering that
humans are extremely efficient in identifying a previously seen face [2]. In the scope
of the work reported in this chapter, Hancock, Bruce and Burton [10] compared the
performance of two sketch recognition algorithms to the ability of humans, having
concluded that automated methods provided reasonable correlations with human
ratings and memory data. Also, they studied the ability of each algorithm to perform
recognition with and without the hair visible, and with and without alteration in facial
expressions.
According to the above, the main singularities of this work are two-fold: (1)
we provide the performance of algorithms devised for recognizing real mugshots
3 Biometric Identification from Facial Sketches of Poor Reliability 59

on composite-sketch data; (2) we analyze the results with respect to the level of
reliability of the sketches, measured by the agreement of human responses on each
sketch, i.e., if most people agree in the identity established for a given sketch, then
we consider it a reliable sketch.
The main motivations for the study reported in this chapter can be summaritzed
as follows:

1. there is an increasingly high realism of sketches drew by software systems. These


frameworks produce such high realistic sketches that are often hard to distinguish
from real faces;
2. the fact that eyewitnesses often don’t remember with detail some of the subjects
features, which results in sketches that might not reproduce closely the identity
of interest. In the scope of this work, this concept is referred as reliability of the
sketch.
3. the relative effectiveness attained by humans and automated recognition algo-
rithms remains to be perceived. This enables to assess the degree of certainty of
humans in the identification task, correlating it to the performance of automated
recognition methods.

The remainder of this chapter is organized as follows: Sect. 3.2 gives an overview
of the most relevant methods to perform facial sketch recognition. The description
of our experiments and the corresponding discussion is given in Sect. 3.3. Finally,
Sect. 3.4 concludes the chapter.

3.2 Facial Sketch Recognition

Figure 3.1 gives a cohesive perspective of facial sketch recognition algorithms. Hav-
ing a set of gallery images of known identity (usually designated as mugshots), the
problem is to assign the probable identity to a sketch drew by a witness of some crime.
There are two general families of algorithms for this task: (1) one that converts the
available gallery data into the sketches domain, which is believed to most proba-
bly resemble the appearance of the query sketch; and (2) algorithms that match the
query sketch directly to the available gallery data, provided that the feature descrip-
tion and matching algorithms are robust enough to handle such obvious differences
in appearance of real data and sketches.
Most representative works on this subject tackled the problem in the sketch
domain, i.e., by transforming gallery mugshots into sketches (e.g., using eigenspace
techniques) and performing recognition subsequently (using methods such as Bayes-
ian classifiers, nearest neighbor techniques and analysis of principal components).
Among these works, the proposal of Wang and Tang [26] consists of a photo-sketch
synthesis and a recognition method based on Markov Random Fields. Their pro-
posal has three major phases: (1) given a mugshot, synthesize a sketch; (2) given
a sketch, synthesizing a photo; and (3) searching for face photos in the database
60 H. Proença et al.

Database Feature Encoding Feature Matching


...

ID1
Sketches Domain ID2
Conversion ...
IDk

Fig. 3.1 Block diagram of a typical facial sketch recognition algorithm. Given a set of mugshots
of known identities enrolled in a database, the problem is to establish the identity of a sketch. While
some of the existing methods translate the gallery data into the sketch domain, the problem has two
major phases: feature encoding and matching

based on a query sketch drew by an artist. Authors used a training set containing
photo-sketch pairs of frontal poses, with normal lighting, neutral expression and no
significant occlusions. In the recognition phase, two variants were tested: (1) firstly
transforming the photos into sketches and match on that domain; and (2) transforming
sketches to photos and match in the image domain. Li et al. [15] proposed a system
for face sketch synthesis and recognition based in two main phases: pseudo-sketch
generation and sketch recognition. The pseudo-sketch generation method is based on
local linear preserving of geometry between photo and sketch images, and the recog-
nition phase relies in nonlinear discriminate analysis, in order to match the identity
of interest from the previously synthesized pseudo-sketches. Klare et al. [14] were
particularly concerned with the forensic domain, based on the fact that most times
sketches are drawn while not looking for the subject, which decreases the quality of
the result. Hence, they presented a recognition method based in local feature-based
discriminant analysis (LFDA). In LFDA, both sketches and photos are represented
by Scale Invariant Feature Descriptors (SIFT) and Local Binary Patterns (LPB),
from where multiple discriminant projections are taken for feature representation.
Nearest neighbor is used in classification, resulting in a system that offers substantial
improvements in matching forensic sketches to the corresponding face images, when
compared to state-of-the-art face recognition methods. An overview of this approach
is also given in [12]. The approach of Liu et al. [16] has a main distinguishing feature:
instead of transforming mugshots into the sketch domain, they inverted the problem
and started by generating a realistic face image from the composite sketch using a
Hybrid subspace method and then built an illumination tolerant correlation filter able
to recognize the person under different illumination variations from a surveillance
video footage. Najati and Sim [17] approached the problem from the perspective of
the quality of the sketch. Considering that these are typically unreliable, they asked
the eyewitness to draw a sketch of the target face, providing some ancillary informa-
3 Biometric Identification from Facial Sketches of Poor Reliability 61

tion. Then, a drawing profile from the witness was obtained, by asking him/her to
draw a set of face photos. This profile is used to correct the sketch for the witness bias,
and feeds the recognition module, having authors observed consistent improvements
in performance, when compared to starting from the raw sketches. The approach
of Pramanik and Bhattacharjee [20] relies in statistics drew from facial landmarks,
such as the eye corners, nose, eyebrows and lips. They extracted a set of lengths,
areas and ratios, from both gallery mugshots and sketches. Next, the feature vectors
were normalized by their energy removed. Nearest neighbor techniques were used
to find the most probably identity for a sketch. Authors concluded that even this sim-
ple approach is effective when faces are in frontal pose, under homogenous lighting
conditions and without significant occlusions.
To the best of our knowledge, only Han et al. [9] considered the specific features
of sketches synthesized using facial composite software and proposed a component-
based representation model to measure the similarity between the sketch and a
mugshot. Facial landmarks in composite sketches and mugshots were detected by
means of an active shape model. Features were extracted for each facial component
using multi-scale local binary patterns, and similarity per component was calculated.
Finally, the similarity scores obtained from individual components were fused, yield-
ing the final similarity score.

3.3 Experiments and Discussion

In our experiments the responses of human observers and of automated facial sketch
recognition algorithms were compared. Regarding human, we asked to a set of vol-
unteers to look for a sketch and match it to one of the enrolled identities (available
as mugshots). The used protocol is summarized below:

1. Ask to each human volunteer to observe each sketch on a computer screen.


2. After 15 s, remove the sketch from the desktop. In case of any doubt, humans
were allowed to observe the sketch again.
3. Based on a single mugshot per enrolled identity, ask to the human observer to
match the sketch to one identity. The human observer may go over the database
any number of times and gradually eliminate the less likely foils until they are
left with just one.
4. Run three automated face recognition algorithms, using the set of sketches as
probes and five gallery mugshots per subject for learning purposes.
5. Evaluate the levels of correlation between the responses given by humans and
rank-1 identification results of the automated recognition algorithms.
62 H. Proença et al.

3.3.1 Recognition Using Sparse Representations

Sparse representations have been largely reported in the literature (e.g. [19]). The
main idea is to obtain a sparse linear representation of a probe with respect to a set of
samples that constitute an overcomplete dictionary of base elements. In this chapter,
the results yielded by the approach of Wright et al. [27] are reported. The singularity of
this proposal lies in the fact that, instead of using sparsity to identify a relevant model,
it uses the sparse representation of each probe directly for classification, obtaining
a linear combination of the base elements that give a compact representation of the
probe, assuming that a sufficient number of training samples per class i exists, i.e.,
Ai = [vi,1 , . . . , vi,n ] ∈ Rm×n i . Formally, each probe is regarded as a column vector
y that is possible to represent by a linear combination of samples of the corresponding
ith class:
y = ωi,1 vi,1 + · · · + ωi,n vi,n , (3.1)

being vi , the column representation of the gallery data. For classification, a matrix
A yields from the concatenation of all Ai elements:
.
A = [ A1 , A2 , . . . , Ak ] = [v1,1 , v1,2 , . . . , vk,n k ]. (3.2)

The linear representation of a probe y belonging to the ith class can be written in
terms of A:
y = Ax i ∈ Rm , (3.3)

where x i = [0, . . . , 0, ωi,1 , . . . , ωi,n i , 0, . . . , 0]T ∈ Rn is considered a sparse vector,


as all its entries are zero, except those associated to the corresponding ith class.
Considering that the system y = Ax is typically underdetermined, authors seek
the sparsest solution to it, by choosing the minimum ρ0 -norm solution:

(l 0 ) : x̂ 0 = arg min ||x||0 , subject to Ax = y. (3.4)

Based on recent developments in the theory of sparse representation, authors noted


that if the solution x 0 is sparse enough, the solution to the ρ0 minimization problem
is equal to the corresponding (ρ1 ) minimization problem:

(l 1 ) : x̂ 1 = arg min ||x||1 , subject to Ax = y. (3.5)

This problem can be solved in polynomial time by linear programming methods


[1]. Considering the noisy environment of biometric recognition, the model in (3.5)
might not hold exactly, and a modification to account for small dense noise was pro-
posed, y = Ax i + z, being z ∈ Rm the noise term with bounded energy z||2 < Θ. For
this model, the sparsest solution can be found by following l 1 minimization problem
[5], possible to obtain via second order cone programming techniques:
3 Biometric Identification from Facial Sketches of Poor Reliability 63

(ls1 ) : x̂ 1 = arg min ||x||1 , subject to || Ax − y||2 ◦ Θ. (3.6)

Having a probe y of unknown class, the label is found based on how well the
coefficients associated with all training samples of each class reproduce that probe.
A characteristic function γi for each class is defined such that γi (x) is a vector
whose only non-zero entries are associated with class i. Hence, ŷi = Aγi ( x̂ 1 ) is an
approximation of y, based on a linear combination of gallery data of the ith class.
The representation ŷi that more closely fits y is deemed to be the true class of y:

min || y − Aγi ( x̂ 1 )||2 . (3.7)


i

3.3.2 Recognition Using Gabor-Based Description and Nearest


Neighbor Matching

Daugman observed that simple cells in the visual cortex of mammalian brains can
be modeled by Gabor functions [3, 4]. and that image analysis based on Gabor
functions is deemed to be similar to the perception in the human visual system. Since
then, Gabor-based decompositions have been largely reported in the computer vision
literature (e.g., [6, 13]), and particulalry in the biometrics domain (e.g., face [25]
and fingerprint [22] recognition algorithms).
The impulse response of a Gabor kernel is defined by a harmonic function multi-
plied by a Gaussian function and these filters have a real and an imaginary component
representing orthogonal directions:
 δ2 δ22 
1 − 12 1 + 2π δ1
αx2 α y2
g(x, y, Φ, Σ, αx , α y ) = e ei Φ , (3.8)
2π αx α y

being δ1 = x cos(Σ ) − y sin(Σ ), δ2 = −x cos(Σ ) + y sin(Σ ), Φ the wavelength and


Σ the orientation. Figure 3.2 illustrates examples of Gabor filters, where the scale
(upper row) and rotation (bottom row) parameters are varying.
A bank of Gabor filters with various scales and rotations was created. The fil-
ters were convolved with the image I, resulting in a so-called Gabor space. In our
experiments αx = α y = Φ/2 was used, to alleviate the computational burden of the
optimization process associated to finding the optimal parameters of Gabor kernels

g i (I) = g → I, (3.9)

being “→” the convolution operator. Let G(I) = [|g →1 (I)|, . . . , |g →n (I)|] be the con-
catenation of of the magnitude of the g i (I) elements, such that

g i→ (I) = g i (I) − μ gi (I) , (3.10)


64 H. Proença et al.

Fig. 3.2 Illustration of the Gabor filters (real parts) used to decompose the gallery images. The
upper row illustrates variations in the scale parameter (Φ, wavelength), whereas the bottom row
illustrates variations in rotation (Σ)

Fig. 3.3 Example of the decomposition of a gallery image I (leftmost image) by three different
parameterizations of Gabor kernels (images at right)

being μgi (I ) the mean of the coefficients of gi (I ). Hence, G(I ) has zero mean and
feeds a matching module that yields the distance between the probe I and its Gabor
representations: 
min ||gi→ (I ) − gi→ (Ic(k) )||1 , (3.11)
k
i c

being Ic(k) the cth gallery image of the kth class.


Figure 3.3 illustrates a gallery image I (leftmost image) and three of its Gabor
representations gi (images b, c and d) obtained when using the parameters (Φ, Σ ) =
{(0.33, π/4), (0.28, 3π/4), (0.51, π/2)} for Gabor kernels (3.8).

3.3.3 Recognition Using Eigenfaces

Firstly proposed by Turk and Pentland [23], the concept of eigenface became
extremely popular in the computer vision literature and it is reported for many
different purposes. The main idea is to simplify the face recognition problem by
3 Biometric Identification from Facial Sketches of Poor Reliability 65

Fig. 3.4 Examples of the eigenfaces associated to the four largest eigenvalues of the gallery data
used in our experiments

first mapping the data into a space of lower dimensionality, found by the principal
components analysis. Let the jth image from the ith subject be represented by a
column vector vi, j . The first step is to obtain a representation of this column vector
subtracted by the mean face of all images in the data base:

1 
vi,→ j = vi, j − vk,r , (3.12)
ns ni r
k

being n s and n i the number of subjects and of images per subject available. A
dictionary is built from the concatenation of all vi,→ j , i.e., A = [v→1,1 , . . . , v→n i ,n s ].
Next, the covariance matrix is found C = A AT . The eigenvectors μi of C are found
and the best d eigenvectors are kept, corresponding to the d largest eigenvalues:

μi = Avi . (3.13)

Each face in the training set can be represented as a linear combination of the best
d eigenvectors: 
v̂i,→ j = ωd μd , (ωd = μdT vi,→ j ) (3.14)
d

yielding to a representation of each face in the principal components space given


by: ωi, j = [Φi, j1 , . . . , Φi, jd ]. Given an unknown probe image y, it is normalized
and projected on the eigenspace ω y = [Φ y1 , . . . , Φ yd ]. The identity assumed for that
probe is given by:  
î = arg min min ||ω y − ωi, j ||2 (3.15)
i j

Figure 3.4 gives the eigenfaces associated to the four largest eigenvalues of C,
when analyzing the principal components of the gallery data used in our experi-
ments (five images per subject, for one hundred subjects). here, we can perceive that
the hair is the most discriminating region (corresponding to μ1 ), followed by the
low-frequency components of the facial region (μ2 . details that regard the ocular
66 H. Proença et al.

Fig. 3.5 Examples the images used as gallery data in our experiments. The two leftmost columns
illustrate the used gallery samples, whereas the rightmost columns give the corresponding data
translated into the sketch domain

region and eyebrows are highlighted by μ2 and high frequency information about
the chin and nose are highlighted by μ4 .

3.3.4 Gallery Data

A set of one hundred subjects was used as enrolled identities. For each subject, five
frontal images of their faces (mugshots) were acquired under natural and artificial
light sources of varying intensity. Images have dimensions 4,368 × 2,912, and the
upper-right and bottom left corners of the faces were manually annotated, defining
regions-of-interest to avoid the effect of data misalignments. Next, each image was
converted into grayscale and stored in 8-bit depth uncompressed format. Images at
the left columns of Fig. 3.5 exemplify six gallery images from three different subjects.
3 Biometric Identification from Facial Sketches of Poor Reliability 67

Fig. 3.6 Screenshot of the web-based application (PortraitPad used to create the set of sketches
used in our experiments)

3.3.5 Sketches

Sketches were created by four humans (non-professional, i.e., they were not familiar
with the task), using the PortraitPad 1 software toolkit. This is a free access web-based
application that enables to customize the appearance of a sketch with high detail,
according to the composite method. Here, users are requested to iteratively select the
type of facial component (e.g., eyes, eyelids, lips,…) and a group of possibilities that
exhibit varying textures, shapes and sizes. Next, by selecting one of these possibilities
and placing it in the face that is being constructed, it is possible to resemble the
appearance of a real human face with high reliability. Figure 3.6 gives a screenshot
of the application used to construct the set of sketches of our experiments.
A total of one hundred images was created, having as main concern that each
sketch should resemble one of the identities enrolled in the database. Figure 3.7
illustrates pairs of sketches and of the corresponding identities in the gallery data
that they aim to resemble.

3.3.6 Correlation Between the Responses Given by Humans


and Automated Methods

Initially, all the sketches were considered as probe data and presented to the human
volunteers. They were asked to match each sketch against one of the identities

1 http://portraitpad.com/
68 H. Proença et al.

Fig. 3.7 Examples of the sketches and of the identities in gallery data that they aim to resemble

enrolled in the gallery. No more than one possibility should be considered and—in
case of doubt—volunteers had to choose the identity that they perceive to resemble
the sketch more closely. Let ϕi, j (x) be a characteristic function that gives the prob-
ability that a human observer matches the ith probe to the jth identity in the gallery
set. Figure 3.8 gives extreme cases of the statistic of ϕi, j , where the horizontal axis
denotes the class. Nearby each plot, the corresponding sketch is given, enabling to
perceive the main features of the sketches where users had reduced uncertainty (upper
row) and of the cases that raised more doubts (bottom rows).
Based on the empirical observation of the typical appearances of the subjects that
produced the extreme ϕi,c (x) statistics, we highlight the following factors:

• The hair style is particularly important to humans, when being asked to identify
an individual. Obviously, this raises the concerns about counterfeiting measures
from criminals, in order to prevent identification.
• Particular localized marks (e.g., spots) are also specially helpful to humans, when
attempting to identify others.
• At a secondary level, the shapes and sizes of eyes, nose and lips also play an
important role in the reliability of the resulting sketches. Every time a subject had
3 Biometric Identification from Facial Sketches of Poor Reliability 69

Fig. 3.8 Examples of extreme cases of the ϕi,c (x) statistic, corresponding to cases where the human
observers had minimal (upper row) and maximal uncertainty (bottom row)

a particularly big/small nose, ears,..., volunteers had no difficulty in matching the


sketch against the correct identity.
• The texture of the skin is particularly hard to reproduce in a sketch. In several
circumstances, the volunteers found extremely hard to define the skin texture of
a sketch, which they thought to maximally resemble the identity of the person of
interest.

In order to assess the levels of correlation between the responses given by the
human volunteers and by the automated methods, the agreement of their answers was
measured. It was assumed that any eventual dependence between responses would
be at most linear, which justifies the use of the phi coefficient φ. This coefficient, also
known as mean square contingency coefficient is a measure of association between
two binary variables. Let X i and Yi be two random variables that denote the rank-1
identity of the ith probe, given by experts X and Y . The φ coefficient between X and
Y is given by: 
i 1{X i =Yi }
φ=  (3.16)
i {X i =Yi } +
1 i 1{X i ∇=Yi }

where 1. is a characteristic function, X i and Yi denote the system outputs, X̄ , Ȳ are


the sample means and αY , αY the standard deviations.
During the experiments we noted that the agreement between experts varies
strongly with respect to the degree of reliability of the sketch. In case of sketches evi-
dently similar to a single identity, both humans and algorithms tend to agree on their
answer, whereas in cases where the identity of the sketch is not so evident, automated
methods and humans often disagree. Hence, resembling the sparsity coefficient index,
the following function measures the certainty of the classification of the ith sketch
by humans:
. k|| maxc ϕi,c (x)||1 /||ϕi,c (x)||1 − 1
Q(i) = , (3.17)
k−1
70 H. Proença et al.

Fig. 3.9 Relationship between the agreement of the responses given by the automated sketch
recognition strategies (G denotes the Gabor-based recognition, S regards the sparse representation
technique and E the Eigenfaces method) and human responses (denoted by H ). X Y is the agreement
between the responses given by X and Y experts). Results are given with respect to the levels of
uncertainty Q(.) of human observers in matching sketches to identities in gallery data (3.17)

where k denotes the number of classes (100 in our experiments). Q(i) ∞ 0 corre-
sponds to complete uncertainty and Q(i) = 1 occurs when all the human volunteers
matched the ith sketch against the same identity. Hence, the value of Q(.) enables to
perceive how the uncertainty of humans is related to the agreement of the responses
of automated recognition algorithms. Figure 3.9 expresses the relationship between
the value of Q(I ) and the φ coefficient, showing broad evidence about a direct rela-
tionship between both values. The Gabor-based description technique was observed
to be the algorithm that maximally resembles the responses given by humans (H G),
specially for highly reliable sketches. In general, moderate to low levels of agreement
between humans and automated methods were observed: for Q values lower than 0.3
the agreement between both types of experts was residual. Oppositely, values raise
significantly for high Q(.) values, as it is evident in the rightmost histogram of the
figure.
In summary, it can be concluded that Humans and the Gabor-based algorithm
have a rough linear relationship in their levels of correlation, with respect to the
level of reliability of the sketches, i.e., as the reliability of the sketches augments,
there is a substantially higher probability that the response attained by the automated
algorithm and of a human is the same. At the other extreme, the correlation of the
responses given by the eigenfaces algorithms was always low, both for sketches of
low, medium or high reliability.

3.3.7 Performance of Automated Methods

The performance of the automated face recognition methods was tested with respect
to the level of agreement of humans in correctly identifying sketches, using different
3 Biometric Identification from Facial Sketches of Poor Reliability 71

Fig. 3.10 Comparison of the performance attained by the sparse representation algorithm,
eigenfaces and Gabor description algorithms, with respect to the minimal level of agreement of
humans in matching a sketch to an identity (minimum value of Q(i) demanded)

Q(i) minimal values. Figure 3.10 gives the receiver operating characteristic curves
observed for each algorithm, when using Q(i) ≤ 0.0 (upper left plot), Q(i) ≤ 0.5
(upper right) and Q(i) ≤ 0.75 (bottom plot). The sparse representation method corre-
sponds to the continuous lines, the Eigenfaces technique is represented by the dashed
lines and the Gabor-based technique by the dotted lines. A relatively poor perfor-
mance was observed when all sketches and identities were considered (i.e., Q ≤ 0). In
this case, each algorithm outperformed in different regions of the performance space.
However, as the degree of evidence of the sketches’ identity increases (Q ≤ 05 and
Q ≤ 0.75) the performance attained by the sparse representation technique starts to
be consistently the best. In the case of sketches with maximum reliability (Q ≤ 0.75)
the difference to the remaining methods is evident, and full sensitivity is attained
at a false positive rate of about 0.25. In the case of the eigenfaces and Gabor-based
recognition methods, the improvements in performance also occurred, but not in
such an evident way.
In a subsequent evaluation, we evaluated wether the translation of the gallery
data into the sketch domain carries significant advantages over the comparison
72 H. Proença et al.

Fig. 3.11 Comparison of the performance attained when using the original gallery data and when
translating it into the sketch domain before the recognition phase (Img series represent the compar-
ison image-against-sketch, whereas Ske denote the translation into the sketch domain of all gallery
data, before the recognition process)

image-against-sketch. All the gallery images were converted to sketches using the
AKVIS software package,2 which was considered to faithfully resemble the appear-
ance of the sketches among the options freely available. Here, as illustrated in
Fig. 3.11, the idea is that, by converting first the photos into the sketch domain,
the amount of information is substantially reduced and only the most representative
features of each face are retained.
Figure 3.10 gives the results obtained for the three techniques tested. In the case
of the sparse representation technique, the translation of the gallery data into the
sketches domain didn’t not carried any advantage and even contributed for decreases
in performance, independently of the reliability of the sketched considered. In oppo-
sition, for the Gabor-based description and matching method, improvements in per-
formance were consistently observed at all levels of sketches’ reliability. Finally,
differences in performance in any direction were not considered evident in the case
of the eigenfaces recognition method, as performance in the sketch domain was

2 http://akvis.com/en/sketch/index.php
3 Biometric Identification from Facial Sketches of Poor Reliability 73

observed to be best than in original data domain only for Q ≤ 0.75, corresponding
to the sketches that more evidently match a single identity.
As a summary of the experiments carried out, it can be concluded that the intuitive
notion of reliability of a sketch plays a relevant role in the performance of automated
face identification algorithms. Also, the performance of such algorithms significantly
improves for reliable sketches, i.e., those that most humans perceive as with high
quality (Q ≤ 0.75). In opposition, there is no significant differences in the perfor-
mance of automated recognition algorithms among sketches of poor and medium
fidelities. When it comes to the algorithms used, it can be concluded that sparse
representations are the best technique for high reliability sketches. This observation
is in agreement to the functioning of that algorithm, requiring that gallery elements
constitute a vector basis (or close to that) of the query data, which should not happen
for very bad sketches. At the other extreme, the eigenfaces algorithm outperforms
all the other for sketches of moderate reliability. This should be due to the fact that,
in opposition to the remaining algorithms, the most important eigenvectors of real
faces and sketches of moderate quality still resemble. In this case, for better data,
the sparse representation is clearly better, but for worst data, the eigenvectors found
are not trustworthy anymore.

3.4 Conclusions and Future Trends

For decades, police and forensic organizations have been searching for potential
criminals based on descriptions given by eyewitnesses of some illegal act. However, it
is usual that witnesses only observe the criminals for a few seconds, are too nervous, or
that the criminals’ faces were partially occluded, which decreases the reliability of the
resulting sketches, i.e., how much the sketch and the appearance of the corresponding
subject actually resemble.
The availability of software frameworks to generate composite facial sketches
enables to obtain close representations of real mugshots, which motivated the work
described in this chapter. We assessed the recognition performance attained by three
face recognition methods and compared it to the attained by humans in matching
composite sketches to real mugshots. Also, the analysis considers the levels of relia-
bility of each sketch, measured by the agreement of humans in matching each sketch
against a single identity. Overall, we concluded that the sparse representation for
classification algorithm obtained the best results, which was particularly evident for
high reliable sketches.
Also, we were particularly concerned about the effects in performance when
sketches do not appropriately represent the main features of the identity of interest.
i.e., in case of sketches of poor reliability. A set of one hundred sketches was syn-
thesized by non-experts, using a freely available facial composite software system.
Then, human volunteers were asked to match each sketch to one of the enrolled
identities. When comparing the responses given by humans and by the automated
recognition algorithms, we concluded that they agree mostly in cases where the
74 H. Proença et al.

sketch has particular features that turn evident the corresponding identity. Also, the
rank-1 identification performance of automated methods is strongly conditioned to
the degree of evidence of the sketch. Finally, we assessed the improvements in per-
formance due to the translation of the gallery data into the sketch domain before the
recognition process. Improvements were not statistically significant in the case of
sparse representation and eigenfaces techniques, in opposition to the Gabor-based
description and matching strategy, where a consistent improvement was observed.
As a summary of the research conducted in the scope of this chapter, we highlight:
• the reliability of the sketches is a major concern, and both automated recognition
algorithms and humans have very low effectiveness on bad quality sketches;
• the Gabor-based recognition algorithm produces the responses that more closely
resemble those from humans. This is specially evident for sketches of high relia-
bility;
• In general, the sparse representation for classification algorithm produced the best
performance, in terms of rank-1 identification accuracy;
• the conversion of the gallery data into the sketches domain not always carries
advantages in terms of recognition performance. In several cases, we even observed
decreases in the recognition effectiveness.
As future trends for this research topic, it will be important to investigate about
automated methods that are able to estimate the reliability of a sketch of unknown
identity, which can be helpful to assist the eyewitnesses in describing a criminal.
Such quality marks of sketches should be regarded in terms of discriminating points,
i.e., information that is actually helpful for automated identification experts in dis-
tinguishing among a set of identities.

Acknowledgments The financial support given by “FCT-Fundação para a Ciência e Tecnologia”


and “FEDER” in the scope of the PTDC/EIA-EIA/103945/2008 (“NECOVID: Negative Covert
Biometric Identification”) research project is acknowledged.
Also, we acknowledge all the volunteers that participated in the data acquisition sessions and in the
synthesis of the facial sketches used in our experiments.

References

1. Chen C, Donoho D, Saunders M (2001) Atomic decomposition by basis pursuit. SIAM Rev
43(1):129–159
2. Cutler LB, Penrod SD, Fisher RP (1994) Conceptual, practical and empirical issues associated
with eyewitness identification test media. In: Ross D, Read J, Toglia M (eds) Adult eyewit-
ness testimony: current trends and developments. Cambridge University Press, New York, pp
163–181
3. Daugman J (1988) Complete discrete 2-D gabor transforms by neural networks for image
analysis and compression. IEEE Trans Acoust Speech Signal Process 36(7):1169–1179
4. Daugman J (1980) Two-dimensional spectral analysis of cortical receptive field profiles. Vision
Res 20(10):847–856
5. Donoho D (2006) For most large underdetermined systems of linear equations the minimal
11-norm solution is also the sparsest solution. Comm Pure Appl Math 59(6):797–829
3 Biometric Identification from Facial Sketches of Poor Reliability 75

6. Dunn D, Higgins WE (1995) Optimal gabor filters for texture segmentation. IEEE Trans Image
Process 4(7):947–964
7. Fisher G (1997) The juryÕs rise as a lie detector. Yale Law J 107:575
8. Gurney D, Pine K, Wiseman R (2013) The gestural misinformation effect: skewing eyewitness
testimony through gesture. Am J Psychol 126(3):301–314
9. Han H, Klare BF, Bonnen K, Jain AK (2013) Matching composite sketches to face photos.
IEEE Trans Inf Forensics Secur 8(1):191–204
10. Hancock P, Bruce V, Burton A (1998) Comparing two computer systems and human perceptions
of faces. Vision Res 38:2277–2288
11. Horry R, Halford P, Brewer N, Milne R, Bull R (2013) Archival analyses of eyewitness iden-
tification test outcomes: what can they tell us about eyewitness memory? Law and Human
Behavior, www.ncbi.nlm.nih.gov/pubmed/24127889. Accessed 14 Oct 2013
12. Jain AK, Klare BF (2011) Matching forensic sketches and mug shot to apprehend criminals.
Computer 44(5):94–96
13. Kamarainen J-K, Kyrki V, Kalviainen H (2006) Invariance properties of gabor filter-based
features overview and applications. IEEE Trans Image Process 15(5):1088–1099
14. Klare BF, Li Z, Jain AK (2011) Matching forensic sketches to mug shot photos. IEEE Trans
Pattern Anal Mach Intell 33(3):639–646
15. Li Y-H, Savvides M, Bhagavatula V (2006) Illumination tolerant face recognition using a novel
face from sketch synthesis approach and advanced correlation filters. In: Proceedings of the
IEEE international conference on acoustics, speech and signal processing, vol 2, p II
16. Liu Q, Tang X, Jin H, Lu H, Ma S (2005) A nonlinear approach for face sketch synthesis and
recognition. In: Proceedings of the IEEE computer society conference on computer vision and
pattern recognition, vol 1, pp 1005–1010
17. Nejati H, Sim T (2011) Do you see what i see? a more realistic eyewitness sketch recognition.
In: Proceedings of the international joint conference on biometrics, pp 1–8
18. Odinot G, Memon A, la Rood A, Millen A (2013) Are two interviews better than one? eyewit-
ness memory across repeated cognitive interviews. PLoS One 8(10):e76305
19. Pillai J, Patel V, Chellapa R, Ratha N (2011) Secure and robust Iris recognition using random
projections and sparse representations. IEEE Trans Pattern Anal Mach Intell 33(9):1877–1893
20. Pramanik S (2012) Geometric feature based face-sketch recognition. In: Proceedings of the
international conference on pattern recognition, informatics and medical engineering, pp
409–415
21. Rush E, Quas J, Yim I, Nikolayev M, Clark S, Larson R (2013) Stress, interviewer support,
and children’s eyewitness identification accuracy. Child Dev, doi:10.1111/cdev.12177
22. Shen L, Bai L (2006) A review on gabor wavelets for face recognition. Pattern Anal Appl
9:273–292
23. Turk M, Pentland A (1991) Eigenfaces for recognition. J Cogn Neurosci 3(1):71–86
24. Tversky B, Marsh E (2000) Biased retellings of events yield biased memories. Cogn psychol
40(1):1–38
25. Vinay K, Shreyas B (2006) Face recognition using gabor wavelets. In: Proceedings of the
fortieth asilomar conference on signals, systems and computers, pp 1955–1967
26. Wang X, Tang X (2009) Face photo-sketch synthesis and recognition. IEEE Trans Pattern Anal
Mach Intell 31(11):1955–1967
27. Wright J, Yang A, Ganesh A, Sastry S, Ma Y (2009) Robust face recognition via face repre-
sentation. IEEE Trans Pattern Anal Mach Intell 31(2):210–227
Chapter 4
Recognizing Altered Facial Appearances
Due to Aging and Disguise

Richa Singh

Abstract This chapter focuses on recognizing faces with variations in aging and
disguise. In the proposed approach, mutual information based age transformation
algorithm registers the gallery and probe face images and minimizes the variations
in facial features caused due to aging. Further, gallery and probe face images are
decomposed at different levels of granularity to extract non-disjoint spatial features.
At the first level, face granules are generated by applying Gaussian and Laplacian
operators to extract features at different resolutions and image properties. The second
level of granularity divides the face image into vertical and horizontal regions of
different sizes to specifically handle variations in pose and disguise. At the third level
of granularity, the face image is partitioned into small grid structures to extract local
features. A neural network architecture based 2D log polar Gabor transform is used
to extract binary phase information from each of the face granules. Finally, likelihood
ratio test statistics based support vector machine classification approach is used to
classify the granular information. The proposed algorithm is evaluated on multiple
databases comprising of disguised faces of real people, disguised synthetic face
images, faces with aging variations, and disguised faces of actors and actresses from
movie clips that also have aging variations. These databases cover a comprehensive
set of aging and disguise scenarios. The performance of the proposed algorithm is
compared with existing algorithms and the results show that the performance of the
proposed algorithm is significantly better than existing algorithms.

4.1 Introduction
Automatic face recognition is a long standing problem in computer vision that
requires the ability to identify an individual despite several variations in the appear-
ance of face [18]. Existing commercial face recognition systems capture faces of

R. Singh (B)
IIIT Delhi, Delhi, India
e-mail: rsingh@iiitd.ac.in

J. Scharcanski et al. (eds.), Signal and Image Processing for Biometrics, 77


Lecture Notes in Electrical Engineering 292, DOI: 10.1007/978-3-642-54080-6_4,
© Springer-Verlag Berlin Heidelberg 2014
78 R. Singh

Fig. 4.1 Face images of an individual captured under varying conditions

cooperative individuals in a controlled environment as part of the recognition process.


The results of the face recognition test reports FRGC 2004 [27] and FRVT 2006 [28]
show that under normal changes in a controlled environment, the performance of
existing face recognition systems is greatly enhanced. However, in real world appli-
cations, we have to deal with non-ideal data. As shown in Fig. 4.1, many applications
require face images to be captured outdoors where the lighting conditions are unpre-
dictable, the subjects may not be cooperative, the poses may vary, or the angles
and distances from the camera may not be normal. The performance of current face
recognition systems significantly deteriorates for such imperfect and challenging
cases.
In literature, researchers have identified illumination, expression and pose as the
three main challenges of face recognition. However, in real world applications, dif-
ferent law enforcement agencies have identified that facial aging/growth and disguise
are also important issues. Therefore, the problem is expanded to include these addi-
tional challenges of face recognition:

• Illumination: Images with proper illumination captured under controlled envi-


ronment are ideal for face recognition. However, face images with illumination
variations (Fig. 4.2a) reduce the performance of recognition algorithms because
illumination variations may cause some features to be hidden or altered.
• Image Quality: The quality of a face image depends on various features such
as motion blur, sensor noise, environmental noise, image resolution, and gray
scale/color depth. Any degradation in the quality of face images, as shown in
Fig. 4.2b, can lead to reduced recognition performance.
• Expression: Variations in expression can cause deformation in local facial struc-
ture and also change the facial appearance and local geometry of the face.
Figure 4.2c shows example of variations in facial features which can reduce the
recognition accuracy.
• Pose: Frontal face images contain ample information to be used for face recogni-
tion. However, in a profile or semi-profile face image, as shown in Fig. 4.2d, some
features are not visible and matching a frontal face image with a profile face image
may produce incorrect results.
• Aging: Temporal variations in human face is a regular process and with age
progression these variations may even lead to major structural changes. As shown
4 Recognizing Altered Facial Appearances Due to Aging and Disguise 79

Fig. 4.2 Challenges in face recognition due to variations in a illumination, b image quality,
c expression, d pose, e aging, and f disguise
80 R. Singh

in Fig. 4.2e, aging variations among the three face images of an individual are
difficult to handle by an automated face recognition system.
• Disguise: The most complicated among all the challenges is the variations due to
disguise in which an individual uses makeup accessories to alter the facial features
and impersonate another person or hide one’s identity. As shown in Fig. 4.2f, simple
accessories such as beard and mustache can change the appearance of an individual.

Other than these challenges, there are some issues that are pertinent to law
enforcement applications such as limited training dataset, matching a scanned face
image with digitally captured face images, and matching sketch images generated by
forensic artists with digital face images. Some of these issues have a linear relation
to the above mentioned challenges, e.g. matching a scanned face image with digital
face images where image quality plays an important role. However, challenges such
as matching sketches with a database of digital images require specially designed
algorithms.
Over the last 40 years, researchers have proposed several algorithms to address
some of these issues such as design of illumination compensation algorithms
and 3D/mosaic models to address pose variations. Among the above mentioned
challenges, aging and disguise have received little attention. In forensic and law
enforcement applications, aging and disguise are the issues which severely affect
the performance of existing face recognition systems. As shown in Fig. 4.3, active
disguise (changes induced deliberately) and passive disguise (changes due to age pro-
gression) are difficult challenges that lack meticulous research. This chapter focuses
on the challenging problems of face verification with age progression and variations
in disguise.

4.1.1 Literature Review of Face Recognition Under Variations


in Aging and Disguise

Human face undergoes significant changes as a person grows older. The facial fea-
tures vary for every person and are affected by several factors such as exposure to
sunlight, inherent genetics, and nutrition. The performance of face recognition sys-
tems cannot contend with the dynamics of temporal metamorphosis over a period
of time. Law enforcement agencies regularly require matching a probe image with
the individuals in the missing person database. In such applications, there may be a
significant difference between the facial features of probe and gallery images due to
age variation. For example, if the age of a probe image is 15 years and the gallery
image of the same person is of 5 years, existing face recognition algorithms are inef-
fective and may not yield the desired results. One approach to handle this challenge
is to regularly update the database with recent images or templates. However, this
method is not feasible for applications such as border control, homeland security, and
missing person verification. To address this issue, researchers have proposed several
computational models for the estimation of age progression and face recognition
4 Recognizing Altered Facial Appearances Due to Aging and Disguise 81

Fig. 4.3 a Active disguise


due to artifacts or accessories
(from the National Geographic
Database [33]) and b passive
disguise due to agevariations
(from the FG-Net database
(http://www.fgnet.rsunit.com))

algorithms across age progression. A computational model for estimating facial


appearances across different ages should incorporate the factors that influence the
process of aging, for example gender, ethnicity, diet, and age group. Burt and Perrett
[5] identified seven age groups spanning the age of 20–54 years. They characterized
the face shapes of each input face image by finding out a set of fiducial features and
generated the composite image for each age group by averaging the shape and skin
color from face images of the same age group. Tiddeman et al. [40] extended the idea
proposed by Burt and Perrett and used wavelet based methods for defining facial tex-
tures and creating ‘composite faces’. They used edge strength and Markov Random
Fields to retain edges (wrinkles) and improve texture characterization respectively.
Ramanathan and Chellappa worked on facial aging effects specially for individuals of
age 0–18 years [30]. They used the fact that the primary reason for facial development
in formative years is craniofacial growth. They combined cardioidal strain transfor-
mation along with age-based anthropometric data to develop a facial aging model.
In another paper, they introduced shape and texture variation model for estimating
facial aging in adults [31]. The shape variation model was developed by physical
models that characterize the functionalities of different facial muscles such as linear,
sheet, and sphincter muscles. The texture variation model was used to characterize
facial wrinkles in predesignated facial regions such as the forehead and nasolabial
regions. Suo et al. [39] proposed to use the multi-resolution face models built upon
‘And’-‘Or’ graphs. The ‘And’ nodes were used in finding the fine representation of
faces whereas the ‘Or’ nodes were used in finding different possible configurations.
82 R. Singh

Since it used multiple resolutions, the global and local appearance variations such
as color, skin pigments, and wrinkles were modeled at different resolutions.
The methods of face recognition across age progression can be categorized as
generative and non-generative [19, 32]. Generative methods involve inducing the
changes in the input facial images to incorporate aging variations. Non-generative
methods do not involve any change in the input data but age-invariant signatures
are computed from the input faces and employed to perform identity verification.
Ling et al. [21] proposed a non-generative method in which they showed that the
image gradient orientations are age-invariant and remain similar across ages. They
designed a face operator to extract the image gradient orientations and used SVM
classifier to do verification of faces across ages. Singh et al. [37] proposed a mutual
information-based transformation algorithm to minimize the difference between two
age separated images. Park et al. [26] developed a 3D facial aging model to solve
the problem of age-invariant face recognition. Their approach is based on the fact
that exact craniofacial aging can be developed only in 3D. The advantage of using a
3D model is that it reduces the variations due to lighting and pose. Mahalingam and
Kambhamettu [23] used graph based face matching. They extracted the feature points
using local feature analysis (LFA) and the face descriptor was developed by applying
LBP to each feature point. They developed an age model for each subject and face
matching was done using a 2-step approach involving MAP and graph matching.
Guo et al. [13] studied the relationship between face recognition accuracies and the
age intervals. They performed their experiments on MORPH-II, a very large face
recognition database. They found out that when the age gap between the gallery and
probe image is more than 15 years, the performance reduces much more as compared
to within 15 years. They also found out that the use of soft biometric features can
help in improving face recognition across age progression. Li et al. [19] proposed
discriminative model for achieving age-invariant recognition. They developed an
approach involving the use of scale invariant feature transform (SIFT), multi-scale
local binary pattern as local descriptors, and multi-feature discriminant analysis. The
proposed scheme outperformed the commercial system ‘FaceVacs’ by a significant
difference.
Another important challenge is disguise variations when the inter-personal and
intra-personal characteristics can be modeled to alter the appearance of an indi-
vidual, to impersonate someone else’s identity, or to hide one’s own identity. The
ability to recognize individuals who deliberately alter their appearance is very
important in security applications for identifying terrorists, controlling access to
sensitive areas, or monitoring border crossing. The recognition of faces with dis-
guise has only recently been addressed by few researchers. In 1995, a point dis-
tribution model-based method to model face shape, and to gain shape-free gray
image is proposed [17]. On a database of 60 disguised face images, less than
50 % correct classification accuracy with various combinations of shape and gray
image models is reported. Later, another method in which a face image is divided
into local regions which are matched using a probabilistic matching approach is
proposed in [24]. Ramanathan et al. [33] reported that disguise or occlusion affects
face similarity measures more than expression variations. A variant of Independent
4 Recognizing Altered Facial Appearances Due to Aging and Disguise 83

Component Analysis (ICA), called Locally Salient ICA (LS-ICA) is proposed by


[16]. This approach involves weighting local salient regions over which the kurto-
sis is maximized. Experiments performed on a synthetically occluded dataset show
that LS-ICA outperforms other subspace based methods. The performance of LS-
ICA is however, not evaluated on any real world occlusion/disguise dataset. Sparse
Representation based Classifier (SRC) is proposed by Wright et al. [43] to address
the covariate of disguise/occlusion. In SRC, occlusion is handled by the mechanism
of source-to-error separation. Consistent with observations by [24], they observe
that as the amount of occlusion increases, the performance of subspace based algo-
rithms as well as SRC reduces. Singh et al. [36] proposed a technique based on
texture information of face images extracted using 2D log polar Gabor transform
for addressing the disguise covariate. The algorithm also utilizes dynamic neural
network to learn discriminative features from Gabor coefficients. Yang and Zhang
[44] proposed Gabor occlusion dictionary in SRC to address the problem of partial
occlusion. With experiments on synthetic occlusion and real occlusion datasets, they
have shown performance improvement over SRC.

4.1.2 Motivation and Contributions

Generally face recognition algorithms either use facial information in a holistic way
or extract features and process them in parts. On the other hand, cognitive neurosci-
entists have observed that humans solve problems using perception and knowledge
represented at different levels of information granularity [38]. Humans recognize
faces using a combination of holistic approach together with discrete levels of infor-
mation or features. They can identify specific facial features and associate a contex-
tual relationship among them to recognize a face even with altered, occluded, and
aged appearances. This chapter proposes preprocessing and face verification algo-
rithms to recognize1 faces with variations in aging and disguise. Figure 4.4 illustrates
the concept of the proposed algorithm. First, an age transformation algorithm is pro-
posed that registers two face images and minimizes the aging variations. Unlike the
conventional method, the gallery face image is transformed with respect to the probe
face image and facial features are extracted from the registered gallery and probe
face images. We further propose a granular approach for face verification by extract-
ing dynamic feed-forward neural architecture based 2D log polar Gabor (termed
as 2DLPG-NN) phase features [36] at different granular levels. The granular lev-
els exhibit non-disjoint spatial features which are classified using a likelihood ratio
based Support Vector Machine (LR-SVM) classification algorithm. The proposed
face verification algorithm is validated using five face databases. The Notre Dame
face database [12] (http://www.nd.edu/~cvrl/UNDBiometricsDatabase.html) is used
for performance evaluation since it contains comprehensive variations in expression

1 Recognition can be either 1:1 matching (verification) or 1:N matching (identification). Note that,

in this paper, recognition and verification are used interchangeably.


84 R. Singh

Image Feature
Registration Granulation
Extraction
Likelihood
Ratio-
Gallery Decision
Support
Image Accept/Reject
Vector
Machine
Classification
Image Feature
Registration Granulation
Extraction

Probe Age
Image Transformation

Fig. 4.4 Illustrating the concept of the proposed face recognition algorithm

and illumination, and has been widely used for evaluating face recognition algo-
rithms. Along with it, another face database is used which contains images with
age variations and three face databases specifically aimed at validating the perfor-
mance for disguised face images. The performance of the proposed face verification
algorithm is also compared with existing face verification algorithms.

4.2 Proposed Registration Based Age Transformation Algorithm

Face verification with age variations is a challenging problem and most of the exist-
ing algorithms do not yield good recognition performance for face images with large
age differences. The growth of facial features is dynamic in nature and depends on
several factors such as genetics, environment, and living habits. Researchers have
used anthropometric measures or learning methods to simulate the facial growth and
applied face verification algorithms. However, existing age simulation algorithms do
not include the dynamics in the facial growth thus causing lower verification accu-
racy. Another way to address the issue is to minimize the aging difference between
the gallery and probe face images using a registration technique. In this chapter,
the mutual information based age transformation algorithm is proposed which can
be used to register face images with age difference between them. Age difference
minimization using the registration of gallery and probe face images is described as
follows:
Let F be a face image and subscript g and p represent the gallery and probe.
Therefore, let Fg and Fp be the detected gallery and probe face images to be matched,
respectively. Fg (x, y) and Fp (x, y) are transformed into polar form to obtain FgT (r, θ )
and FpT (r, θ ) respectively. Here, r and θ are defined with respect to the center coor-
dinate (xc , yc ). 
r= (x − xc )2 + (y − yc )2 , 0 ≤ r ≤ rmax (4.1)
4 Recognizing Altered Facial Appearances Due to Aging and Disguise 85

Fig. 4.5 Example of cartesian


to polar coordinate conversion

 
y − yc
θ = tan−1 (4.2)
x − xc

The coordinates of eyes and mouth are used to form a triangle and the center of this
triangle is chosen as the center point, (xc , yc ), for cartesian to polar conversion.
Figure 4.5 shows an example of cartesian to polar conversion around the center
point. This cartesian to polar conversion eliminates minor variations due to pose
and provides robust feature mapping used in the next steps of the algorithm.
Mutual information based registration algorithm is used to minimize the geo-
metric differences between the gallery and probe face images. It is a concept from
information theory in which statistical dependence is measured between two random
variables. Mutual information between two polar face images can be represented as
[22, 29],
M(FgT , FpT ) = H(FgT ) + H(FpT ) − H(FgT , FpT ) (4.3)

where, H(·) is the entropy of the face image and H(FgT , FpT ) is the joint entropy of
the gallery and probe face images. Registering a gallery face image FgT with respect
to a probe face image FpT requires maximizing the entropy H(FgT ) and H(FpT ), and
minimizing the joint entropy H(FgT , FpT ). Further, Hill et al. [15] proposed normalized
mutual information which can be represented as,

H(FgT ) + H(FpT )
 gT , FpT ) =
M(F (4.4)
H(FgT , FpT )

In this research, the normalized mutual information is used for minimizing aging
differences in a transformation space, S which includes parameters for shear, scale,
rotation, and translation. The optimal transformation parameters S ∗ are obtained
by exploring the search space S using normalized mutual information. The search
strategy is defined in Eq. 4.5.

 pT , S(FgT ))}
S ∗ = arg max{S} {M(F (4.5)

The gallery and probe face images FgT and FpT are registered using the optimal
transformation parameters S ∗ . This registration algorithm is linear and does not
accommodate the non-linear dynamic variations present in face images. To address
86 R. Singh

the non-linearity present in face images, multiresolution image pyramid scheme is


applied with the registration algorithm. A Gaussian pyramid of both the gallery
and probe face images is constructed. Registration parameters are estimated at the
coarsest level and then used to wrap the gallery face image in the next level of the
pyramid. The process is iteratively repeated through each level of the pyramid and
a final transformed gallery face image is obtained at the finest pyramid level. In
this manner, the global variations are handled at the coarsest resolution level and
the local non-linear variations at the finest resolution level. Besides minimizing the
age differences, this approach also reduces the differences due to minor pose and
expression variations. Thus, as shown in Fig. 4.6, this algorithm can also be viewed
as a preprocessing scheme to reduce the spatial variations between the gallery and
probe images. The registered face images are finally transformed back to Cartesian
coordinates from polar coordinates. Figures 4.6 and 4.7 shows the registered gallery
face images transformed with respect to the probe face images. Since the algorithm
registers two images, it can minimize the difference between faces of two individu-
als. However, as shown in Fig. 4.7, the inter-personal differences cannot be exactly
registered and any efficient face recognition algorithm can reject these transformed
impostor images. Once the age difference between the gallery and probe face image
is minimized, face verification algorithms can be applied to verify the identity of the
probe image.

4.3 Proposed Granular Approach for Recognition of Disguised


Faces

Sinha et al. [38] established 19 results based on face recognition capabilities of the
human mind. They suggested that humans can efficiently recognize familiar face
images even with low resolution and noise. Moreover, high and low frequency facial
information are processed both holistically and locally. Shape, motion, and imaging
conditions are also important factors in the processing of facial information by the
human mind. On the other hand, researchers from psychology and neuro-cognition
have simulated and analyzed human visual cortex by conducting certain experiments.
They reported that Gabor transform represents the properties of human visual cortex
[9, 10, 42] and exhibits log polar distribution response [11]. It is our hypothesis
that if these capabilities can be encoded in an automatic face recognition algorithm,
then the recognition performance of the algorithm can be made comparable to the
performance of human mind for matching familiar faces.
To incorporate the above mentioned research findings, a granular approach is
proposed for facial feature extraction and matching. The approach is based on the
concept that human mind solves problems at different levels of granularity [4, 20].
In this concept, known as granular computing, a unified framework is used to extract
non-disjoint features at different granularity levels and is synergistically combined to
obtain a more comprehensive information. Granular computing framework is more
4 Recognizing Altered Facial Appearances Due to Aging and Disguise 87

Registered

Registered

Registered
24 Years 42 Years

Registered

Fig. 4.6 Results of the proposed age transformation algorithm on images from the FG-Net face
database when the gallery and probe images belong to the same individual (http://www.fgnet.rsunit.
com)

useful in solving problems when the global information is fuzzy. For example, it is
difficult to produce correct verification result with a disguised face image. However,
with granulated information, more flexibility is achieved in analyzing the underlying
88 R. Singh

Registered

Registered

Registered

Registered

Fig. 4.7 Results of the proposed age transformation algorithm when the gallery and probe images
belong to different individuals (http://www.fgnet.rsunit.com). Any efficient face recognition algo-
rithm can identify these cases as impostors because the registration algorithm does not minimize
the inter-personal variations

information such as nose, ears, forehead, hair, cheeks, or combination of two or more
features.
There are four basic components of granular computing [4, 20]: granules, granular
structure, granulation, and information retrieval from granules.
• Granules are the basic component of granular computing. These are small particles
which collectively provide a representation with respect to a particular level. Gran-
ules have internal, external and contextual properties which reflect interaction of
4 Recognizing Altered Facial Appearances Due to Aging and Disguise 89

elements inside a granule, interaction with other granules, and relationship among
the granules.
• Granular structure provides a structured description of the problem. This structure
can be interpreted as a level or granulated view in the overall hierarchy.
• Granulation involves construction of the granules and granular structure, and
defines the representation and characterization of granules and its structure.
• Information retrieval from granules involve granular information processing such
as feature extraction and matching. This component may also involve comput-
ing the relation among granules, processing based on priors, and preservation of
invariant properties.

The proposed face recognition algorithm generates face granules by granulating


face image at three different levels. The granules are generated such that non-disjoint
facial features can be extracted to provide resilience to variations in disguise. Further,
features are extracted from the face granules and a classification algorithm is proposed
to effectively perform matching. The face recognition algorithm thus comprises of
three steps: generating granules from the face images, feature extraction from the
face granules along with matching, and classification of granular information.

4.3.1 Generating Granules from Face Image

Let F be the detected and preprocessed frontal face image of size n×n. Face granules
are generated pertaining to three different levels of granularity. In the first level,
face granules are generated by applying the Gaussian and Laplacian operators [6].
Let the granules generated by Gaussian and Laplacian operators be represented by
FGri , where i represents the granule number. For a face image of size 128 × 128,
Fig. 4.8 shows the face granules generated in the first level by applying Gaussian and
Laplacian operators. FGr1 to FGr3 are the granules generated by Gaussian operator
and FGr4 to FGr6 are the granules generated by Laplacian operator. The size of the
smallest granule in the first level is 32 × 32. These operators are applied only to three
levels of resolution because facial features are not useful for recognition when the
image size is less than 32 × 32. In these six granules, facial features are segregated at
different resolutions to provide edge information, noise, smoothness, and blurriness
present in a face image. This level of granular information thus provides resilience
to disguise variations in facial features such as eyes, mouth, and nose.
Campbell et al. [7] reported that different inner and outer facial regions represent
distinct information which is helpful for face verification. Therefore, in the second
level of granularity, horizontal and vertical granules are generated by dividing the
face image F into different regions as shown in Figs. 4.9 and 4.10. Here, FGr7 to
FGr15 denote the horizontal granules and FGr16 to FGr24 denote the vertical granules.
Among the nine horizontal granules, the first three granules i.e. FGr7 , FGr8 , and FGr9
have the same size n × n3 . The next three granules, i.e., FGr10 , FGr11 , and FGr12 are
generated such that the size of FGr10 and FGr12 is n × (m − ε) and the size of FGr11
90 R. Singh

FGr1 FGr 2 FGr 3 FGr 4 FGr 5 FGr 6

Fig. 4.8 Face granules in the first level of granularity. FGr1 , FGr2 , and FGr3 are generated by the
Gaussian operator, and FGr4 , FGr5 , and FGr6 are generated by the Laplacian operator

FGr 7 FGr 10 FGr 13

FGr 8 FGr 11 FGr 14

FGr 9 FGr 12 FGr 15

Fig. 4.9 Horizontal face granules generated from a face image in the second level of granularity
(FGr7 to FGr15 )

FGr16 FGr17 FGr18 FGr19 FGr20 FGr21 FGr22 FGr23 FGr24

Fig. 4.10 Vertical face granules generated from a face image in the second level of granularity
(FGr16 to FGr24 )
4 Recognizing Altered Facial Appearances Due to Aging and Disguise 91

Fig. 4.11 Face granules in


the third level of granularity
represent local information
(FGr25 to FGr40 )

is n × (m + 2ε). Further, FGr13 , FGr14 , and FGr15 are generated such that the size of
FGr13 and FGr15 is n × (m + ε) and the size of FGr14 is n × (m − 2ε). Similarly, nine
vertical granules FGr16 to FGr24 are generated. Figures 4.9 and 4.10 show horizontal
and vertical granules when the size of normalized face image is 128 × 128 and
ε = 10.2 This level of granularity provides the relation between horizontal and
vertical granules to handle minor variations in disguise such as glasses, hair style,
beard, and mustache.
Finally, for the third level of granularity, local fragments are extracted. Researchers
from cognition suggest that local facial fragments can provide robustness against
partial occlusion and change in viewpoints [14, 34, 38]. Moreover, human mind can
distinguish and classify individuals with their local facial fragments such as nose,
eyes, and mouth. To incorporate this property, local facial fragments are extracted
and used as face granules in the third level of granularity. Given the eye coordinates,
16 local facial fragments are extracted using the golden ratio face template [3].
Each of these fragments is a granule representing local information that provides
invariant and unique features for handling variations due to expression and disguise.
Figure 4.11 shows an example of local facial fragments used as face granules in the
third level of granularity.
The proposed granulation technique is used to generate 40 non-disjoint face gran-
ules from a face image of size 128 × 128. Here, it is to be noted that the technique is
based on fixed structure and no local feature based approach is used for generating
granules from frontal face images. For images captured from cooperative users, gran-
ulation can be performed according to the features. However, for altered images (both
aging and disguise), identifying features can be challenging and hence feature-based
partitioning may not yield accurate results compared to fixed structure partitioning.

2 In the extensive experiments, it is observed that ε = 10 yields the best verification results with
face images of size 128 × 128.
92 R. Singh

4.3.2 Feature Extraction from Face Granules and Feature


Matching

The next step in the proposed face recognition algorithm is feature extraction from
face granules. Here, the previously proposed dynamic feed-forward neural network
architecture based 2D log polar Gabor transform is used for extracting binary phase
features [36] from each face granule. As mentioned before, 2D log polar Gabor
transform exhibits the properties of visual cortex and therefore, feature extraction
algorithm encodes the texture pattern that are unique and invariant. Binary phase
features pertaining to the gallery and probe images are matched using Hamming
distance measure to generate the match scores. More details about the texture feature
extraction and matching algorithm can be found in [36].
For a given face image, the feature extraction process is performed separately for
all 40 face granules, i.e. computation of 40 phase features, one for each granule.
Thus the algorithm is used to generate gallery phase templates corresponding to the
gallery face images and are stored in the database. During query, phase template
for the probe image is generated and matched using the phase template matching
algorithm. The algorithm finally generates match score vector, s, where each element
si , (i = 1, . . . , 40) is associated with one of the 40 face granules. Each of these match
scores are in the range [0, 1] where 0 represents perfect accept and 1 represents perfect
reject.

4.3.3 Granular Information Classification

The final step in the proposed granular approach for face verification is classification
of match score vector s to yield an output decision of genuine or impostor. For
disguise and aging cases, different face granules may provide conflicting decisions.
For example, due to disguise variations, Gaussian granules may reject a genuine
subject but some of the local granules may provide a decision to accept. In such
cases, a carefully designed classification algorithm is required that can efficiently
address these confounding match scores. This section presents a classification scheme
that integrates likelihood ratio test-statistic [25] in a SVM classification framework.
The likelihood ratio aspect of the algorithm makes it robust to uncertainties in the
face granules whereas the use of SVM ensures that the algorithm is less prone to
over-fitting thereby permitting it to handle nonlinear/conflicting information. The
classification algorithm consists of two steps: (1) transforming match scores into
likelihood ratio and (2) applying support vector machine (SVM) classification.
Support vector machine [41] is a pattern classifier that constructs non-linear hyper-
planes in a multidimensional space. In this research, dual ν-SVM (2ν-SVM) [8] is
used which is an attractive alternative to SVM and offers much more natural setting
for parameter selection with reduced computational complexity. Traditionally, in bio-
metrics, SVM uses the match scores for classification. However, match scores may
4 Recognizing Altered Facial Appearances Due to Aging and Disguise 93

provide inaccurate or fuzzy information due to environmental dynamics or matcher


limitations. By transforming the match scores into likelihood ratio, the uncertainties
associated with them can be addressed better [25]. Therefore, the input to the clas-
sification algorithm comprises of likelihood ratio induced from match scores. The
procedure of transforming match scores into likelihood ratio is described as follows:
For a two-class classification problem, i.e. Θ = {genuine, impostor}, the match
scores corresponding to all N = 40 face granules are first computed, i.e. s =
(s1 , s2 , . . . , sN ), and then the densities of the genuine and impostor scores (Jgen (s)
and Jimp (s), respectively) are estimated. In the proposed SVM classification, it is
assumed that the distribution of match scores is a Gaussian distribution, i.e.,
  
1 1 si − μij 2
Jj (si , μij , σ ij ) = √ exp − (4.6)
σ ij 2π 2 σ ij

where, μij and σ ij are the mean and standard deviation of the ith face granule corre-
sponding to the jth element of Θ. While this is a very strong assumption, it does not
impact the performance in the context of this application.
Jgen (si )
The likelihood ratio Vi = Jimp (si ) ai pertaining to each face granule is computed
where ai is the verification prior (i.e., precision of the face granule obtained from
the training dataset). The resultant value V = (V1 , V2 , . . . , VN ) is used as input to
the SVM classifier. Further, utilizing the SVM classifier addresses the limitations of
the likelihood ratio test-statistic if the input data does not conform to the Gaussian
assumption (which is usually the case).
In the training phase, likelihood values induced from the match scores and their
labels are used to train the 2ν-SVM for classification. Let the kth labeled training
data be represented as Zk = (Vk , Y ), where Y ∈ Θ (or Y ∈ (+1, −1); where +1
represents the genuine class and −1 represents the impostor class). The training data
is mapped to a higher dimensional feature space such that Zk → ϕ(Zk ) and ϕ(·) is
the mapping function. The optimal hyperplane which separates the complete training
data into two different classes in the higher dimensional feature space can be obtained
using 2ν-SVM learning [8].
In the testing phase, the same procedure for computing match score induced
likelihood ratio is followed and Vprobe is classified using the trained 2ν-SVM. To
verify the identity, a decision to accept or reject is made on the test pattern using a
threshold t,

Accept, if SVM output > t
Decision(f (Vfused )) = (4.7)
Reject, otherwise.
94 R. Singh

4.4 Databases and Algorithms Used for Validation

To evaluate the performance of the proposed granular approach for face recognition,
five face databases are used and the performance is compared with five existing
algorithms. Details of these databases and algorithms are presented in Sects. 4.4.1
and 4.4.2 respectively.

4.4.1 Databases Used for Validation

Currently, there is no comprehensive face disguise database available in public


domain. The authors have prepared three databases which contain face images with
different disguise variations. These three databases are compiled using disguised
faces of real people, disguised synthetic faces, and disguised faces of actors and
actresses extracted from movie clips. The Notre Dame face database [12] (http://
www.nd.edu/~cvrl/UNDBiometricsDatabase.html) is also used to evaluate the per-
formance of the proposed face recognition algorithm. The characteristics of all four
databases are explained below.

1. Notre Dame Face Database [12] (http://www.nd.edu/~cvrl/UNDBiometrics


Database.html): This database is a part of the NIST Face Recognition Grand
Challenge (FRGC). We use collection B of the Notre Dame face database which
contains around 35,000 high resolution frontal face images with different lighting
conditions and expressions. This database also provides images of few individu-
als with time difference of 6 months to 1 year. It is one of the most comprehensive
face databases widely used for evaluating the performance of face recognition
algorithms.
2. Aging Database: To evaluate the performance of the proposed algorithm on aging
variations, the face aging database is used which comprises 1578 images of 130
individuals or subjects. The images are obtained partly from the FG-Net database
(http://www.fgnet.rsunit.com) and partly collected by the author.
3. Real Face Disguise Database: This database is prepared by the author. It contains
face images of 25 individuals with 15–25 different variations in images of each
individual. To prepare this database, the variations possible for disguise are first
analyzed and classified into eight categories depending on its effect on appearance
and facial features. Disguise database is collected to comprehensively include the
following eight categories.

• Minimal variations
• Variations in hair style
• Variations in beard and mustache
• Variations in glasses
• Variations in cap and hat
• Variations in lips, eyebrows and nose
4 Recognizing Altered Facial Appearances Due to Aging and Disguise 95

Table 4.1 Details of the face disguise databases used for evaluation
Disguise categories Disguise database—images per subject
Real faces Synthetic faces
No. of subjects = 25 No. of subject = 100
Images per subject = 15–25 Images per subject = 40
Total no. of images = 475 Total no. of images = 4000
Minimal variations 4–5 4
Variations in hair style 1–3 4
Variations in beard and mustache 1–2 4
Variations in glasses 1–2 4
Variations in cap and hat 1–3 2
Variations in lips, eyebrow and nose 1–2 4
Variations in aging and wrinkles 1–3 4
Multiple disguise variations 4–6 14

• Variations in aging and wrinkle


• Multiple disguise variations

Since the goal of this research is to evaluate the performance of the proposed
face recognition algorithm on disguise, the database contains frontal face images
with less emphasis on variations due to illumination, expression and pose. Details
of the number of images for each category of disguise variations is provided in
Table 4.1. Figure 4.12 shows an example of this database.
4. Synthetic Face Disguise Database: Creating a large real face disguise database
is a challenging task. In the previous paper [36], FACES software (http://www.
iqbiometrix.com/products_faces_40.html) was used to generate 4000 frontal face
images of 100 subjects with a comprehensive set of variations for disguise. Details
of the number of images in each of the eight categories is listed in Table 4.1.
Figure 4.13 shows an example of 40 variations of the same face image generated
using disguise accessories and makeup tools. This database is used to comprehen-
sively evaluate the performance of the proposed algorithm for disguised images.
5. Movie Face Database: Movies are a good source for finding face images of a large
number of individuals with variations in aging and disguise. Over a period of time,
an actor or actress uses makeup tools and accessories providing face images with
aging and disguise variations. A face database is prepared by collecting frontal
face images of actors and actresses from movie clips. This database contains 3500
face images pertaining to 100 individuals, i.e., 35 images per individual. These
images contain variations in expression, illumination, age, and disguise along
with minor variations in pose. This is the most challenging disguise database
with a number of non-ideal real world images.
96 R. Singh

Fig. 4.12 Images from the real face disguise database showing 18 variations of an individual

4.4.2 Face Recognition Algorithms Used for Comparison

To compare the performance of the proposed algorithm, five existing face recognition
algorithms are selected namely Principal Component Analysis (PCA) [2], Half Face
[33], Eigen-eyes based algorithm [35], Local Binary Pattern (LBP) [1], and 2D Log
Polar Gabor Transform with Neural Network Architecture (2DLPG-NN) [36].

4.5 Experimental Results

To evaluate the performance of the proposed granular approach for face verification,
images of all five databases are first detected manually by locating the eyes and mouth
coordinates. The size of detected face images is normalized to 128×128. This section
is divided into four parts: (a) experimental protocol, (b) performance evaluation of
face recognition algorithms, (c) effect of aging on verification performance, and (d)
effect of different types of disguise on verification performance.
4 Recognizing Altered Facial Appearances Due to Aging and Disguise 97

Fig. 4.13 Images from the synthetic face disguise database showing 40 variations of the same
face [36]

4.5.1 Experimental Protocol

In most of the real world applications, a face verification system is first trained on
a training database and then the trained system is used to perform recognition on
the gallery-probe face database. It is highly likely that there is no overlap between
the subjects used in the training database and the subjects in the gallery-probe data-
base. To evaluate the performance in such an application scenario, the databases
are partitioned into two groups: training database and testing database. From each
of the face databases, some subjects are selected for training and the remaining for
testing. Face images pertaining to 282 subjects (40 % from each database) are used
to train the proposed algorithm (estimating densities, learning parameters of SVM
classifier, computing granulation parameters, learning feature extraction algorithm,
and computing verification priors). The remaining images pertaining to 423 subjects
(60 % from each database) are used as the test (gallery-probe) database for perfor-
mance evaluation. This partition ensures that the verification is performed on unseen
images. The train-test partitioning is repeated five times and the Receiver Operat-
ing Characteristics (ROC) curves are generated by computing the false rejection
98 R. Singh

rates (FRR) over these trials at different false accept rates (FAR). The verification
accuracy is computed at 0.01 % FAR. The above experimental protocol is further
used to compare the performance with other face recognition algorithms.

4.5.2 Performance Evaluation of Face Recognition Algorithms

The verification performance of the proposed non-disjoint spatiotemporal face recog-


nition algorithm is evaluated and a comparison is performed with existing face recog-
nition algorithms using five face databases. The results are summarized in Table 4.2
and the ROC plots for each of the five databases are shown in Figs. 4.14, 4.15, 4.16,
4.17 and 4.18.
On the Notre Dame face database [12], existing face recognition algorithms yield
verification accuracy ranging from 49.1 to 93.4 %. The lower performance of PCA,
Half face and Eigen-eyes are due to the variations in expression and illumination
present in the Notre Dame face database. Since the experiments are performed with
single gallery per subject, these algorithms are not able to obtain an individualistic
representation for every subject which is required to match face images with different
variations. LBP and 2DLPG-NN can verify an individual if the variations are minor
or moderate. However, the performance of these two algorithms is affected with
multiple variations such as matching indoor face images with outdoor face images
that have variations in facial expression.
The proposed face recognition algorithm outperforms existing algorithms by at
least 5.7 % and gives an accuracy of 99.1 %. The proposed algorithm provides such
high level of performance because the non-disjoint texture features extracted at differ-
ent levels of granularity provide local and global information at multiple resolutions.
Also, the face granulation step incorporates cognitive and psychological research
findings and feature extraction uses 2D log polar Gabor transform which emulates
the visual cortex [9]. Finally, granular information classification is performed using
the proposed algorithm that integrates likelihood test statistics in a SVM fusion
framework that makes it resilient to sensor noise and environmental dynamics.
The other four databases contain comprehensive disguise and aging variations.
Using these databases, the verification accuracy of the proposed face recognition
algorithm is in the range of 81.5–95.7 % which is at least 7.8 % better than existing
algorithms. Existing algorithms perform better on synthetic face disguise database
compared to other three databases because the three databases with real face images
contain minor variations in pose, expression, and illumination along with disguise
and aging. The absence of all other variations in the synthetic face disguise database
leads to improved verification performance. Using one gallery image per subject, the
proposed face recognition algorithm shows a significantly high level of performance
on the aging and disguise databases that contain images with minor variations in pose,
expression, and illumination. Finally, t-test based statistical analysis is performed and
it is observed that the proposed algorithm is significantly different than existing face
verification algorithms.
4 Recognizing Altered Facial Appearances Due to Aging and Disguise 99

Table 4.2 Comparing performance of the proposed algorithm with existing face recognition
algorithms
Database Verification accuracy (%) at 0.01 % FAR
PCA Half face Eigen-eyes LBP 2DLPG-NN Proposed
[2] [33] [35] [1] [36] Proposed
Notre Dame [12] 49.1 61.3 62.8 81.2 93.4 99.1
Aging [12] 19.7 29.1 27.3 55.9 64.0 81.5
Synthetic face disguise 67.2 70.0 73.7 86.1 87.9 95.7
Real face disguise 19.1 21.6 17.8 57.5 74.8 90.0
Movie face 30.9 33.2 43.9 55.7 72.9 86.4

100

90
Genuine Accept Rate (%)

80

70

60 PCA
Half Face
Eigen−eyes
50 LBP
2D Log Polar Gabor
Proposed Granular Approach
40
−2 −1 0 1
10 10 10 10
False Accept Rate (%)

Fig. 4.14 ROC plot on the Notre Dame face database [12]

100

90
Genuine Accept Rate (%)

80

70

60

50

40 PCA
Half Face
30 Eigen−eyes
LBP
20 2D Log Polar Gabor
Proposed Granular Approach
10
−2 −1 0 1
10 10 10 10
False Accept Rate (%)

Fig. 4.15 ROC plot on the aging database


100 R. Singh

100

90

Genuine Accept Rate (%)


80

70

60

50

40 PCA
Half Face
30 Eigen−eyes
LBP
20 2D Log Polar Gabor
Proposed Granular Approach
10
−2 −1 0 1
10 10 10 10
False Accept Rate (%)

Fig. 4.16 ROC plot on the real face disguise database

100

95
Genuine Accept Rate (%)

90

85

80
PCA
75 Half Face
Eigen−eyes
70 LBP
2D Log Polar Gabor
Proposed Granular Approach
65 −2 −1 0 1
10 10 10 10
False Accept Rate (%)

Fig. 4.17 ROC plot on the synthetic face disguise database

4.5.3 Effect of Aging on Verification Performance

The performance of the proposed algorithm is evaluated for aging variations using
the face aging database. The face database is divided into three age groups:
(1) 1–18 years, (2) 19–40 years, and (3) beyond 40 years. The database is divided
into these three age groups because it is observed that face development depends on
the age of a person. For example, development in muscles and bone structure cause
significant changes in a face during the age of 1–18 years. From 19 to 40 years, the
growth rate is comparatively lower, whereas after 40 years, wrinkles and skin loos-
ening cause major change in facial features and appearance. It is thus very difficult
4 Recognizing Altered Facial Appearances Due to Aging and Disguise 101

100

90

Genuine Accept Rate (%)


80

70

60

50
PCA
Half Face
40
Eigen−eyes
LBP
30 2D Log Polar Gabor
Proposed Granular Approach
20
−2 −1 0 1
10 10 10 10
False Accept Rate (%)

Fig. 4.18 ROC plot on the movie database

to accurately model an individual’s face for large age variations such as from age 10
to age 60. Considering these factors, the performance is evaluated for individual age
groups. Details of the number of images in these three age groups are provided in
Table 4.3.
The experiments show that for the age group of 1–18 years, the verification accu-
racy is 65.2 % whereas with other age groups it is greater than 76 % (Table 4.4).
The results also suggest that the proposed algorithm yields better performance for
19–40 years, and beyond 40 years age groups compared to 1–18 years age group.
The difference is because there is greater change in facial features during 1–18 years
age group than any other age group. For example, if the age difference between the
gallery and probe images is 2 years, the rate of growth during 1–18 years is faster
than 19–40 years or beyond 40 years. Therefore, it can be asserted that due to feature
stabilization, other age groups provide better verification performance. The results
show that the age group of beyond 40 years yields the best verification accuracy of
87.5 %. The next experiment is performed to evaluate the performance of the pro-
posed face recognition algorithm without applying the age transformation algorithm.
Table 4.4 summarizes the verification performance of the proposed face recognition
algorithm with and without the proposed age transformation algorithm for the three
age groups. For the age group of 1–18 years, an improvement of 41.8 % is observed
with the use of the proposed age transformation algorithm. Since the transformation
algorithm minimizes the variations by registering the gallery and probe images both
locally and globally, the performance of face recognition is greatly enhanced. Simi-
larly, for other age groups, the performance of face recognition improves by at least
14.4 %.
102 R. Singh

Table 4.3 Details of the face aging database


Age group Number of Total number Total number of Average age
subjects of images images per subject difference in years
1–18 years 59 605 5–16 8
19–40 years 68 735 8–15 10
Beyond 40 years 34 238 4–10 5

Table 4.4 Verification results of the proposed age transformation algorithm


Age group Verification accuracy (%) at 0.01 % FAR
Without registration With registration based Improvement in
based age transformation age transformation verification accuracy
1–18 years 23.4 65.2 41.8
19–40 years 51.7 76.8 25.1
Beyond 40 years 73.1 87.5 14.4

4.5.4 Effect of Disguise on Verification Performance

Security applications generally use human observers to recognize the face of an


individual. In some applications, face recognition systems are used in conjunction
with limited human intervention. For autonomous operation, it is important that the
face recognition systems be able to provide high reliability and accuracy under diverse
conditions, including disguise. In this section, the performance of the proposed and
existing face recognition algorithms is analyzed for each disguise category on the
real, synthetic, and movie face databases.
Table 4.5 summarizes the performance of face recognition algorithms for differ-
ent categories of disguise variation. For most of the variations, appearance based
algorithms [2, 33, 35] yield lower verification accuracy because these algorithms
use facial appearance to determine the identity, and make up tools and accessories
significantly alter the facial appearance. On the other hand, texture based algorithms
[1, 36] provide significantly better verification accuracy compared to appearance
based algorithms. Conversely, these algorithms do not yield good verification accu-
racy with moderate to large disguise variations.
The proposed face recognition algorithm emulates the problem solving approach
of human mind and the results show that by fusing the granular information obtained
at various levels of granularity, the proposed algorithm yields high verification accu-
racy. In the experiments, it is observed that the first level of granularity provides
resilience to disguise variations in eye and nose regions whereas variations due to
glasses, hair style, beard, and mustache is addressed by the second level of granu-
larity. Local facial regions, in the third level of granularity, can handle variations in
nose, eyes, and mouth regions. However, the performance of the proposed algorithm
4 Recognizing Altered Facial Appearances Due to Aging and Disguise 103

Table 4.5 Comparing the performance of the proposed algorithm with existing algorithms for
each disguise category
Database Disguise categories Verification accuracy at 0.01 % FAR (%)
PCA Half Eigen- LBP 2DLPG- Proposed
[2] face [33] eyes [35] [1] NN [36] algorithm
Synthetic Minimal 89.5 72.7 93.1 98.5 98.9 100
Hair style 85.8 89.0 88.4 95.1 95.4 99.7
Beard + mustache 43.2 45.3 92.5 83.8 84.7 96.6
Glasses 60.9 62.7 13.7 84.3 86.0 96.1
Cap/hat 82.7 85.1 79.2 93.7 95.1 99.9
Lips/eyebrows/nose 86.8 89.3 79.7 96.9 97.2 99.4
Aging/wrinkles 75.6 77.9 73.1 94.5 95.6 99.3
Multiple disguise 18.7 21.1 50.3 63.4 72.9 89.5
Real Minimal 31.3 35.5 26.7 79.1 96.2 100
Hair style 30.6 33.4 25.2 75.3 87.7 99.9
Beard + mustache 21.0 23.7 26.9 37.4 72.3 92.8
Glasses 22.4 24.1 0.0 50.7 77.9 96.5
Cap/hat 27.1 28.3 10.6 72.9 78.2 99.8
Lips/eyebrows/nose 28.3 29.2 19.8 75.6 81.7 98.3
Aging/wrinkles 27.2 28.9 19.5 66.3 81.1 95.2
Multiple disguise 10.3 12.4 11.9 29.1 58.3 81.4
Movie Minimal 40.7 43.3 55.2 60.4 90.2 98.6
Hair style 39.1 41.6 53.8 76.4 88.7 98.3
Beard + mustache 31.3 34.8 54.9 37.1 72.5 87.9
Glasses 32.0 35.2 3.7 48.9 79.4 93.0
Cap/hat 35.4 37.1 32.3 69.6 83.3 97.6
Lips/eyebrows/nose 34.9 38.0 31.5 77.2 82.7 97.9
Aging/wrinkles 28.5 29.9 33.0 65.1 83.2 97.4
Multiple disguise 13.8 16.1 21.5 30.7 58.9 73.7

reduces for multiple disguise variations which probably are the most challenging
scenarios to handle.

4.6 Conclusion and Future Trends

Face recognition becomes a challenging problem when the faces are altered due
to aging and disguise. Recognizing faces with altered appearances is a challenging
task and is only now beginning to be addressed by researchers. In this chapter, a
novel face verification algorithm is proposed that first applies mutual information
based age transformation to minimize age difference between the gallery and probe
images. The algorithm then extracts facial features at different levels of granularity.
These non-disjoint features are extracted using neural network architecture based
2D log polar Gabor transform. Variations in aging and disguise cause alteration in
104 R. Singh

appearance and features which may lead to imprecise matching. Therefore, to clas-
sify imprecise information obtained from these granular features, likelihood ratio
based SVM match score classification algorithm is used. The proposed approach
is based on the observation that human mind recognizes face images by analyzing
the relation among non-disjoint spatial features extracted at different granular levels.
The proposed granular face verification algorithm is validated using the Notre Dame
face database, FG-Net face database and three other databases specially prepared
with disguised face images. The performance is also compared with existing state
of the art face recognition algorithms and algorithms specifically designed to recog-
nize disguised face images. Experimental results show that the proposed algorithm
outperforms existing algorithms by at least 5.7 %.
This chapter focuses on face recognition with variations in aging and disguise.
However, the problem becomes more challenging if there are variations in pose,
expression, and illumination as well. Plastic surgery can also be viewed as an exten-
sion of the problem. Future research can be targeted to address the challenges of face
recognition with variations in pose, expression and illumination along with the vari-
ations in aging and disguise. Undertaking such research would first require creating
a comprehensive database comprising of face images with pose, expression, illu-
mination, aging, and disguise variations. With such a database, a methodical study
can be performed the analyze the effect of individual covariates on face recognition
followed by the effect of their combination.

Acknowledgments The author would like to thank Prof. A. Noore, Dr. A. Ross and Dr. M. Vatsa
for discussions and feedback. The author also acknowledge the CVRL group from University of
Notre Dame for providing the Notre Dame face database.

References

1. Ahonen T, Hadid A, Pietikinen, M (2006) Face description with local binary patterns: appli-
cation to face recognition. IEEE Trans Pattern Anal Mach Intell 28(12):2037–2041
2. Alexander J, Smith J (2003) Engineering privacy in public: confounding face recognition, pri-
vacy enhancing technologies. In: Proceedings of international workshop on privacy enhancing
technologies, pp 88–106
3. Anderson K, McOwan P (2004) Robust real-time face tracker for cluttered environments.
Comput Vis Image Underst 95:184–200
4. Bargiela A, Pedrycz W (2002) Granular computing: an introduction. Kluwer, Boston
5. Burt M, Perrett DI (1995) Perception of age in adult caucasian male faces: computer graphic
manipulation of shape and colour information. J Roy Soc 259:137–143
6. Burt P, Adelson E (1983) A multiresolution spline with application to image mosaics. ACM
Trans Graph 2(4):217–236
7. Campbell R, Coleman M, Michael W, Jane B, Philip J, Wallace S, Michelotti J, Baron-Cohen
S (1999) When does the inner-face advantage in familiar face recognition arise and why? Vis
Cogn 6(2):197–216
8. Chew HG, Lim CC, Bogner RE (2004) An implementation of training dual-ν support vector
machines. In: Qi L, Teo KL, Yang X (eds) Optimization and control with applications, Kluwer
9. Daugman J (1980) Two-dimensional spectral analysis of cortical receptive field profiles. Vis
Res 20(10):847–856
4 Recognizing Altered Facial Appearances Due to Aging and Disguise 105

10. Daugman J (1985) Representational issues and local filter models of two-dimensional spatial
visual encoding. In: Rose D, Dobson V (eds) Models of the visual cortex. Wiley, New York,
pp 96–107
11. Daugman J (1988) An information-theoretic view of analog representation in striate cortex. In:
Schwartz E (ed) Computational neuroscience. MIT Press, Cambridge, pp 403–423
12. Flynn P, Bowyer K, Phillips P (2003) Assessment of time dependency in face recognition:
an initial study. In: Proceedings of audio and video-based biometric person authentication, pp
44–51
13. Guo G, Mu G, Ricanek K (2010) Cross-age face recognition on a very large database: the
performance versus age intervals and improvement using soft biometric traits. In: Proceedings
of international conference on pattern recognition, pp 3392–3395
14. Hayward W, Rhodes G, Schwaninger A (2008) An own-race advantage for components as well
as configurations in face recognition. Cognition 106(2):1017–1027
15. Hill D, Studholme C, Hawkes D (1994) Voxel similarity measures for automated image reg-
istration. In: Proceedings of SPIE conference on visualization in biomedical computing, pp
205–216
16. Kim J, Choi J, Yi J, Turk M (2005) Effective representation using ica for face recognition
robust to local distortion and partial occlusion. IEEE Trans Pattern Anal Mach Intell 27(12):
1977–1981
17. Lanitis A, Taylor C, Cootes T (1995) Automatic face identification system using flexible appear-
ance models. Image Vis Comput 13(5):393–401
18. Li S, Jain A (2005) Handbook of face recognition. Springer, New York
19. Li Z, Park U, Jain A (2011) A discriminative model for age invariant face recognition. IEEE
Trans Inf Forensics Secur 6(3):1028–1037
20. Lin T, Yao Y, Zadeh L (2002) Data mining, rough sets and granular computing. Physica-Verlag,
New York
21. Ling H, Soatto S, Ramanathan N, Jacobs D (2007) A study of face recognition as people age.
In: Proceedings of IEEE international conference on computer vision, pp 1–8
22. Maes F, Collignon A, Vandermeulen D, Marchal G, Suetens P (1997) Multimodality image
registration by maximization of mutual information. IEEE Trans Med Imaging 16(2):187–198
23. Mahalingam G, Kambhamettu C (2010) Age invariant face recognition using graph matching.
In: Proceedings of IEEE conference on biometrics: theory, applications and systems, pp 1–7
24. Martinez A (2002) Recognizing imprecisely localized, partially occluded, and expression vari-
ant faces from a single sample per class. IEEE Trans Pattern Anal Mach Intell 24(6):748–763
25. Nandakumar K, Chen Y, Dass SC, Jain AK (2008) Likelihood ratio based biometric score
fusion. IEEE Trans Pattern Anal Mach Intell 30(2):342–347
26. Park U, Tong Y, Jain AK (2010) Age-invariant face recognition. IEEE Trans Pattern Anal Mach
Intell 32(5):947–954
27. Phillips P, Flynn P, Scruggs T, Bowyer K, Worek W (2006) Preliminary face recognition grand
challenge results. In: Proceedings of international conference on automatic face and gesture
recognition, pp 15–24
28. Phillips P, Scruggs W, OToole A, Flynn P, Bowyer K, Schott C, Sharpe M (1998) FRVT 2006
and ICE 2006 large-scale results. NIST Technical Report NISTIR 7408
29. Pluim J, Maintz J, Viergever M (2003) Mutual information-based registration of medical
images: a survey. IEEE Trans Med Imaging 22(8):986–1004
30. Ramanathan N, Chellappa R (2006) Modeling age progression in young faces. Proc IEEE Conf
Comput Vis Pattern Recogn 1:387–394
31. Ramanathan N, Chellappa R (2008) Modeling shape and textural variations in aging adult faces.
In: Proceedings of IEEE international conference on automatic face and gesture recognition
32. Ramanathan N, Chellappa R, Biswas S (2009) Age progression in human faces: a survey. J Vis
Lang Comput 20:131–144
33. Ramanathan N, Chellappa R, Roy Chowdhury A (2004) Facial similarity across age, disguise,
illumination and pose. In: Proceedings of IEEE international conference on image processing,
vol 3, pp 1999–2002
106 R. Singh

34. Schwaninger A, Lobmaier J, Collishaw S (2002) Role of featural and configural information
in familiar and unfamiliar face recognition. In: Proceedings of second international workshop
on biologically motivated computer vision, pp 245–258
35. Silva P, Rosa AS (2003) Face recognition based on eigeneyes. Pattern Recogn Image Anal
13(2):335–338
36. Singh R, Vatsa M, Noore A (2009) Face recognition with disguise and single gallery images.
Image Vis Comput 27(3):245–257
37. Singh R, Vatsa M, Noore A, Singh SK (2007) Age transformation for improving face recog-
nition performance. In: Proceedings of international conference on pattern recognition and
machine intelligence, pp 576–583
38. Sinha P, Balas BJ, Ostrovsky Y, Russell R (2006) Face recognition by humans: 19 results all
computer vision researchers should know about. Proc IEEE 94(11):1948–1962
39. Suo J, Min F, Zhu S, Shan S, Chen X (2007) A multi-resolution dynamic model for face aging
simulation. In: Proceedings of IEEE conference on computer vision and pattern recognition
40. Tiddeman B, Burt DM, Perrett D (2001) Prototyping and transforming facial texture for per-
ception research. IEEE Comput Graph Appl 21(5):42–50
41. Vapnik V, Golowich S, Smola A (1997) Support vector method for function approximation,
regression estimation and signal processing. Proc Adv Neural Inf Process Syst 9:281–287
42. Webster M, de Valois R (1985) Relationship between spatial-frequency and orientation tuning
of striate-cortex cells. J Opt Soc Am 2(7):1124–1132
43. Wright J, Yang A, Ganesh A, Sastry S, Ma Y (2009) Robust face recognition via sparse
representation. IEEE Trans Pattern Anal Mach Intell 31(2):210–227
44. Yang M, Zhang L (2010) Gabor feature based sparse representation for face recognition with
gabor occlusion dictionary. In: Proceedings of European conference on computer vision, pp
448–461
Chapter 5
Using Score Fusion for Improving
the Performance of Multispectral
Face Recognition

Yufeng Zheng

Abstract Score fusion combines several scores from multiple modalities and/or
multiple matchers, which can increase the accuracy of face recognition meanwhile
decrease false accept rate (FAR). Specifically, the face scores are generated from
two-spectral bands (visible and thermal) and from three matchers (circular Gaussian
filter, face pattern byte, elastic bunch graphic matching). In this chapter, we first
review the three face recognition algorithms (matchers), then present and compare
the fusion performance of seven fusion methods: linear discriminant analysis (LDA),
k-nearest neighbor (KNN), artificial neural network (ANN), support vector machine
(SVM), binomial logistic regression (BLR), Gaussian mixture model (GMM), and
hidden Markov model (HMM). Our experiments are conducted with the Alcon State
University Multispectral face dataset that currently consists of two spectral images
from 105 subjects. The experimental results show that all score fusions can improve
the accuracy meanwhile reduce the FAR, and the KNN score fusion gives the best
performance.

5.1 Introduction

Human identification with biometrics has many applications, for examples, authen-
tification of entering secured areas, authorization of using restrictive devices, and
surveillance of crossing borders. In fact, biometrics can be implemented by compa-
nies, governments, border control to either verify a person’s identity for limiting or
allowing access to a secured facility (like computer files, building area, border cross-
ings), or to identify individuals (like suspects, criminals, terrorists) in order to record
and track their information. Basically biometrics uses the digital signatures of physio-
logical characteristics such as faces, fingerprints, irises, voice, palms to automatically

Y. Zheng (B)
Alcorn State University, Lorman, MS, USA
e-mail: yzheng@alcorn.edu

J. Scharcanski et al. (eds.), Signal and Image Processing for Biometrics, 107
Lecture Notes in Electrical Engineering 292, DOI: 10.1007/978-3-642-54080-6_5,
© Springer-Verlag Berlin Heidelberg 2014
108 Y. Zheng

identify a person. Compared to face recognition, the recognition techniques using


fingerprints and irises are well developed and widely applied. However, the data col-
lection procedures for both fingerprint and iris are quite time consuming, and require
closely contacting the imaging devices. These techniques cannot satisfy the needs in
highly-populated sites such as airports or customs. Tight security-imposed applica-
tions (at customs or borders) also require a passive and contactless data acquisition,
and a fast identification process. Face imaging is such a passive data collection that
does not require much subject collaboration.
The performance of face recognition can be improved utilizing multispectral
imagery. Chang et al. [1] demonstrated image quality enhancement of fusing multi-
spectral face images (e.g., visible and thermal) but did not report any recognition per-
formance. Bendada and Akhloufi [2] compared several face recognition algorithms
(Principal Component Analysis [PCA], Linear Discriminant Analysis [LDA], etc.)
with the Local Binary Pattern (LBP) and the local ternary pattern (LTP) extracted
from four-band face images (i.e., Visible, Short-, Medium-, Long-Wave Infrared),
and found that LDA algorithm performed the best on the visible dataset (92 %).
No fusion results were presented in their work. In contrast with image fusion (with
the prerequisite of image registration) and feature fusion (with the process of high
dimensional data) [3, 4], score-level fusion is more efficient due to utilization of
low dimensional data. Nandakumar et al. [5] showed that a density fusion estimated
by Gaussian Mixture Model (GMM) outperformed any single matcher and other
fusion methods (like sum rule with min-max). Poh et al. [6] reported that the more
biometric scores were used in score fusion, the higher fusion performance achieved.
Zheng and Elmaghraby [7] recently had a brief survey on the score fusion methods
working with multiple matchers and multispectral faces, and found that fusing mul-
timodal data was more effective and significant than combing variant matchers and
using different fusion methods. Their experiments also showed that Hidden Markov
Models (HMM) was the best fusion method.
A real face recognition system expects a low False accept Rate (FAR) as well
as high recognition accuracy. As reported in literatures [5–7], multimodal score
fusion can improve the accuracy meanwhile lowering the FAR. The face scores can
be generated from multispectral images and multiple matchers. This chapter will
present and compare the performance of seven score fusion methods. In addition, a
new face pattern using a set of Circular Gaussian filters (CGF) is introduced.
The remainder of this chapter is organized as follows. Three face recognition algo-
rithms (CGF, Face Pattern Byte [FPB], Elastic Bunch Graphic Matching [EBGM])
are described in Sect. 5.3. Score fusion methods are summarized in Sect. 5.4. Per-
formance evaluation is depicted in Sect. 5.5. Experiments are conducted on the
Alcorn State University Multispectral (ASUMS) face dataset [8]; and the exper-
imental results and discussion are presented in Sect. 5.6. Finally, conclusions are
drawn in Sect. 5.7.
5 Using Score Fusion for Improving the Performance 109

Fig. 5.1 The diagram of a Multispectral


multispectral face recognition Face Images
system: the system inputs are
(Visible, Thermal)
multispectral face images,
visible (RGB) and/or thermal
(LWIR) images; whereas
the final output provides the Face Preprocessing
information of the identified (Normalization , Face
subject. A complete process detection, Alignment)
of face recognition consists
of face preprocessing, facial
feature extraction, distance
metric, and face identification. Facial Feature
Score fusion (on Path 1) Extraction
can significantly increase the (CGF, FPB, EBGM)
recognition accuracy, which is
preferred whenever applicable

Distance Metric
(Hamming Distance )

Score Fusion
2
(Mean, KNN, HMM)

Face Identification
(Shortest distance )

5.2 Overview of a Multispectral Face Recognition System

As shown in Fig. 5.1, a given face image is first preprocessed with normalization,
face detection and face alignment. Face recognition algorithms are performed with
the preprocessed images. Three algorithms (matchers) are selected for face recog-
nition and score fusion, which are CGF, FPB, and EBGM. Face identification (on
Path 2) is done by selecting a gallery face of the best match with the probe face
according to either the shortest distance (for CGF and FPB) or the highest similarity
(for EBGM). Face identification (on Path 1) can be highly achieved via score fusion,
which combines the face scores from multiple matchers and/or multiple modalities.

5.3 Face Image Preprocessing


Face images are normalized using standard z-score normalization (similar with
Eq. (5.6)). Then faces are then detected using the Viola-Jones algorithm and a
projection profile analysis algorithm, respectively. The detected faces are further
110 Y. Zheng

Fig. 5.2 Four rectangles initially used by Viola and Jones [8, 14] which are used to represent
images for the learning algorithm. Both the light and shaded region are added up separately. The
sum of the light is then subtracted from the sum of the shaded regions

aligned using image registration technique such as normalized mutual information


(NMI) [9].

5.3.1 Visible Face Detection

The Viola-Jones (VJ) algorithm has become a very common method of object detec-
tion, including face detection. Viola and Jones proposed this algorithm as a machine
learning approach for object detection with an emphasis on obtaining results rapidly
and with high detection rates.
The VJ method of Object Detection uses three important aspects. The first is an
image representation structure called integral images by Viola and Jones [10, 11].
In this method, features are calculated by taking the sum of pixels within multiple
rectangular areas. The rectangles (see Fig. 5.2) show the four rectangles that were
initially used by Viola and Jones. Leinhart and Maydt [12] later extended the VJ algo-
rithm to allow for rectangle rotations of up to 45∈ . Using these methods, the sum of
both the white and the shaded regions of each rectangle are calculated independently.
The sum of the shaded region is then subtracted from the white region.
Viola and Jones admit that the use of rectangles is rather primitive, but they
allow for high computational efficiency in calculations [10, 11]. This also lends
well to the Adaboost algorithm that is used to learn features in the image. This
extension of the Adaboost algorithm allows for the system to select a set of features
and train the classifiers, a method first discussed by Freund and Schapire [13]. This
learning algorithm allows the system to learn the differences between the integral
representations of faces to those of the background.
The last contribution of the VJ method is the use of cascading classifiers. In order
to significantly speed the system up, Viola and Jones avoid areas that are highly
unlikely to contain the object. The image is tested with the first cascade. The areas
that do not conform to the first cascade are rejected and no longer processed. The
areas that may contain the object are further processed until all of the classifiers are
tested. The areas in the last classifier are likely to contain the object.
5 Using Score Fusion for Improving the Performance 111

The VJ algorithm performs well in visible face detection but not well in thermal
face detection.

5.3.2 Thermal Face Detection

A projection profile analysis algorithm is proposed for thermal face detection. The
procedure of face detection algorithm is described as follows.
(1) Smooth the image and segment the person in the image using a region growing
algorithm so that the contour of the person is obtained, see Fig. 5.3d.
(2) Perform morphological operations (close and open) to remove the noise segmen-
tation. Use a large morphological kernel (5 × 5) and repeat the process twice if
necessary.
(3) Calculate two projections horizontally and vertically; smooth the projection pro-
files (curves); and compute derivatives (first and second derivatives). Do multiple
smoothing operations prior to the derivative calculations. Refer to Fig. 5.3b for
a vertical projection and Fig. 5.3c for a horizontal projection.
(4) Use the minimum and maximum derivatives (corresponding to the locations
of the most abrupt changes) to determine the face region. Apply a smoothing
operation to the vertical projection after calculating the first derivative but before
looking for the minimum and maximum. Combining the local minimum to the
searching of extreme values in the horizontal projection. See the green small
circles (for the extreme values) in Fig. 5.3b, c.
(5) Draw a red rectangular on the original or binarized image (see Fig. 5.3d); and
then extract and save the detected face image.
(6) Some constraints (priori knowledge) are considered to make the region more
accurate and robust, e.g., the length of head should be around half of the length
of the shoulder. Also, higher order smoothing algorithm is used to smooth the
projection curve.
Region growing algorithm is used for the segmentation in this face detection
algorithm. Region growing algorithms have proven to be an effective approach for
image segmentation. The basic approach of a region growing algorithm is to start
from a seed region (typically one or more pixels) that are considered to be inside the
object to be segmented. The neighboring pixels are evaluated to determine if they
should also be considered as part of the object/region. If so, they are added to the
region, and the process continues until all new pixels are added to the region. Region
growing algorithms vary with the criteria used to include a pixel in the region or not,
with the type connectivity used to determine neighbors, and with the strategy used
to visit neighboring pixels.
A simple criterion for including pixels in a growing region is to evaluate intensity
value inside a specific interval, which is usually provided by the user based on analysis
of the image intensity. Values of lower and upper threshold should be provided.
The region growing algorithm includes those pixels whose intensities are inside the
112 Y. Zheng

Fig. 5.3 Illustration of thermal face detection: a original thermal image (640 × 480 pixels, taken
indoors at nighttime without lighting, and wearing glasses; b plot of the vertical projection of d; c
plot of the horizontal projection of d; and d binarized image, where the rectangle of ROI (173 × 232
pixels) is illustrated. The inside mask is referred to as head mask

interval. Slightly different from this, another strategy called neighborhood connected
region growing, will only accept a pixel if all its neighbors have intensities that fit
in the interval. The size of the neighborhood to be considered around each pixel is
defined by a predefined radius.
We proposed a criterion, named mean-value connected region growing. It is based
on simple statistics of the current region. First, the algorithm computes the mean of
intensity values for all the pixels currently included in the region. A user-provided
threshold is used to define a range around the mean. Neighbor pixels whose intensity
values fall inside the range are accepted and included in the region.

5.4 Face Recognition Algorithms

5.4.1 Circular Gaussian Filter

We propose to extract facial features using a set of band-pass filters formed by rotating
a 1-D Gaussian filter (off center) in frequency space, termed as “Circular Gaussian
5 Using Score Fusion for Improving the Performance 113

Filter” (refer to Eq. (5.1) and Fig. 5.4). A CGF can be uniquely characterized by
specifying a central frequency (f ) and a frequency band (σ ).
In Fourier frequency domain, a Circular Gaussian Filter (CGF) is defined as
follows: 2 2
1 − u1 +v2 1
CGF (u, v) = e 2σ , (5.1)
2π σ 2
where
u1 = (u − f cos θ ) cos θ − (v − f sin θ ) sin θ, (5.2a)

v1 = (u − f cos θ ) sin θ + (v − f sin θ ) cos θ, (5.2b)

where f specifies a central frequency, σ defines a frequency band, and θ ◦ [0, 2π ].


Face patterns are formed pixel by pixel with the CGF-filtered images, termed
as FP-CGF. Typically, the CGF filters at four bands (i.e., f ◦ {8, 16, 32, 64} and
σ ◦ {3, 6, 12, 24}) produce promising results with face images of 320 × 320-pixel
resolution. Four phase images and four magnitude images are derived from four CGF
filters, respectively. The FP-CGF has the same resolution as a face image and the
depth of 8 bits per pixel. In each FP-CGF pixel, its most significant 4 bits are from
the four binarized phase images, while its least 4 bits are from the four binarized
magnitude images. In our experiments, the threshold for phase images is a range
(i.e., 1 if phase > −0.15 π and phase < 0.15 π); whereas the threshold for the
normalized magnitude images is 0.1 (i.e., 1 if magnitude > 0.1).
A Hamming Distance (HD) [8] is calculated with the FP-CGF to measure the
similarities among faces. The matched face has the shortest HD from the probe
(query) face. No training is needed for such an unsupervised classification procedure.
The HD between two face patterns (P1 and P2 ) is defined as follows,

→(P1 ∇ P2 ) ∞ MaskROI →
dHD (P1 , P2 ) = , (5.3)
→MaskROI →

where Mask ROI is the face mask [15] (i.e., region of interest). Symbol ‘∇’ is the
Boolean Exclusive-OR (XOR); whereas Symbol ‘∞’ is the Boolean AND operator.
The norm operator ‘→→’ basically counts the number of bits that are ones. Thus, dHD
is a fractional number (from 0 to 1) that accounts for the similarity between two
faces. dHD = 0 means a perfect match (e.g., for two copies of the same face image).
In this chapter, FP-CGF is also shorted as CGF to simplify the notation.

5.4.2 Face Pattern Byte

Face pattern bytes are comprised of the binary bit code of maximal orientational
responses from a set of Gabor Wavelet Transforms (GWT). Multiple-band orien-
tational codes are then put into a face pattern byte (FPB) pixel-by-pixel. A HD is
114 Y. Zheng

Fig. 5.4 Illustration of CGF with f = 16 and σ = 6 (only the central part of CGF is presented).
Up-row CGF in frequency domain and in spatial domain, respectively. Low-row central slices of
the up-row figures

calculated with the FPB to measure the distance between two faces, and identification
is made with the shortest HD.
A 8 × 16 GWT (with 8 bands by 16 orientations) produces 256 coefficients
(128 magnitudes plus 128 phases) per pixel. We only use the magnitudes of GWT
coefficients to create an 8-bit FPB for each pixel on a face image. First, at each pixel,
an M-dimension vector, Vm , is created to store the orientation code representing
the orientational magnitude strength (Omn ) at each frequency (refer to Eq. (5.4a)). At
each frequency band (from 1 to 8), the index (0–15) of the maximal magnitude among
16 orientations (see Fig. 5.5a) is coded with 4 bits in order to maximize the orientation
difference (among subjects in the database). The 4-bit orientational code is first put
into an 8-dimension vector (Vm ) by the frequency order, i.e., the lowest (highest)
frequency corresponds to the lowest (highest) index of Vm . A FPB is then formed
with the most frequent orientations (called the mode) among some bands (refer to
Eqs. (5.4b) and (5.4c)). Specifically, the high half-byte (the most significant 4 bits)
in a FPB, FPBHHB , is the mode of high 4 bands (m = 5–8); whereas FPBLHB is the
mode of low 4 bands (m = 1–4). A factor, BFM , will avoid choosing any orientation
mode of a pattern in FPB if its frequency is only 1. This could make FPB immunized
5 Using Score Fusion for Improving the Performance 115

(a)

(b) BT1_H4 BT1_L4


Mode of Mode of
Band5 -8 Band1 -4

Fig. 5.5 Illustration of orientation bit code and face pattern byte creation: a 16 orientations (0–15)
and their color code (shown as different gray levels in black and white print) corresponding to their
BC = 32, and d BC = 64); b an 8-bit Face Pattern Byte
orientational bit code (in red font wherein dnbr oth
(FPB) along with 8 frequency bands

from noise. BFM = 1 only when the frequency of the mode, fM , is greater than or
equal to 2, otherwise BFM = 0 (see Eq. (5.4d)). It is clear that a FPB can code up to
16 orientations (N ≤ 16) but there is no limit to bands (M). 8 or 16 orientations are
good for face recognition.

Vm = OBC [Index(Max(Omn ))], (5.4a)



FPBLHB = Mode(Vm m=1∼M/2 ) · BFM , (5.4b)

FPBHHB = Mode(Vm m=M/2+1∼M ) · BFM , (5.4c)

1, if fM ⇒ 2
BFM = (5.4d)
0, otherwise

In Eqs. (5.4a)–(5.4d), Omn is the GWT magnitude at Frequency m and Orientation


n; m = 1 ∼ M and n = 1 ∼ N. The FPBs are stored as the feature vectors of the
gallery faces in database, which will be compared with that of the probe face during
the identification process. A FPB (8 bits per pixel for 8 × 16 GWT) is usually stored
in a byte variable (see Figs. 5.5b, 5.6).
Because the Hamming distance (defined in Eq. (5.3)) will be computed by check-
ing the bitwise difference between two face patterns (e.g., FPBs), the orienta-
tional bit code should favor the HD calculation. The FPBs are designed to reflect
116 Y. Zheng

Fig. 5.6 Visualization of FPB (8 × 16, n = 8): a, d one sample of face image from ASUDC,
ASUIR dataset (320 × 320 pixels), respectively; b, e the low half-byte (4-bit) of FPB features; c,
f the high half-byte of FPB features. Refer to Fig. 5.3a for orientation color code and Table 5.1 for
their performance

the orientational significance (or strength) along with frequency scales (locations).
Therefore, the closer (neighboring) orientations should have less bitwise difference,
whereas the further (orthogonal) orientations should have more bitwise difference.
Notice the nature of that 180∈ -apart orientations (such as 0∈ and 180∈ , 45∈ and 225∈ )
are actually same. For example, as shown in Fig. 5.5a (N = 16), “Orientation 0” is
neighboring with both “Orientation 1” and “Orientation 15”. Two inferences from
this nature are made as follows, (i) orientations are circular-neighboring, and (ii)
Orthogonal (90∈ -apart) orientations are the most different (furthest) orientations.
Clearly, the “natural coding” does not favor the HD calculation because the bit-code
difference (HD) between two neighboring orientations, 0 (“0000B”, in binary format)
and 15 (“1111B”), is 4 (i.e., the biggest HD, refer to Fig. 5.5a).
Two cost functions are defined in the following equations. A set of optimal bit code
should minimize the HD of neighboring orientations (dnbr BC ), meanwhile maximize

the HD of orthogonal orientations (doth ).


BC


N−1 
BC
dnbr = dHD (OnBC , On−1
BC
) + dHD (OnBC , On+1
BC
) , (5.5a)
n=0
5 Using Score Fusion for Improving the Performance 117


N−1 
BC
doth = dHD (OnBC , On+N/2
BC
) , (5.5b)
n=0

where OnBC is the bit code (BC) at Orientation n, N is the total number of orientations.
Keep in mind of the fact of circular neighboring. Thus, n = N − 1 when n < 0, and
let n = 0 when n > N −1. One “optimized coding” solution of N = 16 (orientations)
is as follows: {1110, 1010, 1000, 1100, 0100, 0110, 0010, 0000, 0001, 0101, 0111,
0011, 1011, 1001, 1101, 1111} (used in our experiments).

5.4.3 Elastic Bunch Graph Matching

There are many algorithms developed in face recognition domain in past two decades.
Three classical and successful algorithms (according to National Institute of Stan-
dards and Technology) are Principal Component Analysis (PCA), Linear Discrimi-
nant Analysis (LDA) [14, 16], and Elastic Bunch Graph Matching (EBGM) [17–19].
However, PCA is not selected for score fusion due to its poor performance [7, 8].
LDA has a better performance than PCA but worse than EBGM [7, 8]. Thus, EBGM
is chosen for score fusion due to its best performance in contrast with PCA and LDA.
In EBGM algorithm [17, 18], faces are represented as graphs, with nodes posi-
tioned at fiducial points (such as eyes, nose, mouth), and edges labeled with 2-D
distance vectors. Each node contains a set of 40 complex Gabor wavelet coeffi-
cients, including both phase and magnitude, known as a jet. Wavelet coefficients
are extracted using a family of Gabor kernels with 5 different spatial frequencies
and 8 orientations. A graph is labeled using a set of nodes connected by edges,
where nodes are labeled with jets, and edges are labeled with distances. Thus, the
geometry of a face is encoded by the edges, whereas the gray value distribution is
patch-wise encoded by nodes (jets). To identify a probe face (also called query face),
the face graph is positioned on the face image using elastic bunch graph matching,
which actually maximizes the graph similarity function (equivalent to face alignment)
[17, 18]. After the graph has been positioned on the probe face, it can be identified by
comparing the similarity between the probe face and each face stored in the database
(referred to as gallery face thereafter). From the literature report [18, 19] regarding
the comparisons of three common algorithms, it appears that the EBGM algorithm
presents the best performance in terms of verification rate and performance reliability
[18].

5.5 Score Fusion Methods

There are several types of score fusion methods: arithmetic combination of fusion
scores, classifier-based fusion, and density-based fusion. In arithmetic fusion, the
final score is calculated by a simple arithmetic operation [20] such as taking the
118 Y. Zheng

summation, average (mean), product, minimum, maximum, median, or majority


vote. In classifier-based fusion (referred to as classifier fusion), a classifier is first
trained with the labeled score data, and then tested with unlabeled scores [21, 22].
The choices of classifiers include linear discriminant analysis (LDA) [23], binomial
logistic regression (BLR) [6, 24], k-nearest neighbors (KNN), artificial neural net-
work (ANN) [25], support vector machine (SVM) [26], etc. In density-based fusion,
a multi-dimensional density function (e.g., GMM) is estimated with the score dataset,
and then it can predict the probability of any given score vector [27, 28]. HMM [7]
is sort of both classifier-based fusion and density-based fusion.
Based on the literature report in [7], seven score-fusion methods: LDA, KNN,
ANN, SVM, BLR, GMM, and HMM, are selected for comparison in our study.
In addition, mean fusion, a representative of arithmetic fusions, is also presented
for reference. More descriptions for BLR, GMM, and HMM fusions are given in
this section; whereas the details of other fusion methods can be found in the listed
references. The implementation details are given in Sect. 5.6.

5.5.1 Score Normalization

Score normalization is expected before fusing multiple scores since the multimodal
scores from various modalities are heterogeneous and thus have variant dynamic
ranges. To fuse the variant scores, it is required that all original scores are either
similarity scores or distance scores (but not the mix of similarity and distance).
The source scores may originate from a different type of device, called modality
(e.g., digital camera or thermal camera), and/or from variant analysis software, called
matcher (e.g., CGF, FPB, or EBGM algorithm). The large variances of multimodal
scores were caused either by different matching algorithms or by different natures
of biometrical data. In our experiments, a standard z-score normalization procedure
is applied to all biometric scores,

SN = (S0 − μ0 )/σ0 , (5.6)

where SN is the normalized score vector, S0 is the original score vector; μ0 and σ0
denote the mean and standard deviation of original scores, respectively.

5.5.2 k-Nearest Neighbors

The k-nearest neighbors (KNN) method is usually deployed with a clustering tech-
nique. Fuzzy C-means (FCM) is a data clustering technique wherein each data point
belongs to a cluster to some degree that is specified by a membership grade. FCM
starts with an initial guess of data membership and iteratively moves the cluster
centers to the correct location within a data set. Once a certain number of clusters
5 Using Score Fusion for Improving the Performance 119

are formed by the FCM algorithm, the k-nearest neighbors can be found from those
clusters using Euclidean distance (between a testing feature vector and the clustered
feature vectors). The probability of a given feature vector (multimodality scores) can
be calculated with the labeled clusters.

5.5.3 Binomial Logistic Regression

Logistic regression (LR) [24] is a regression analysis to predict the outcome of a


categorical dependent variable based on a set of predictor variables. The probability
describing the possible outcome of a single trial, P(x), is modeled as a function of
explanatory variables (predictors), using a logistic function as defined below.

1
P(x) = , (5.7)
1 + exp[−g(x)]

where

M
g(x) = βj xj + β0 , (5.8)
j=1

where xj are the elements in x. The weight parameters βj are optimized to maximize
the likelihood of the training data given the LR model [24]. P(x) always takes on
values between zero and one.
Logistic regression can be multinomial or binomial. Multinomial logistic regres-
sion (MLR) refers to cases where the outcome can have three or more possible types.
Binomial logistic regression (BLR) refers to the instance in which the observed out-
come can have only two possible types. For example, a face score fusion only outputs
two types of results, impostor or genuine, which makes BLR [6] suitable for face
scores. Generally, the outcome is coded as 0 (impostor) and 1 (genuine) in BLR as
it leads to the most straightforward interpretation.

5.5.4 Gaussian Mixture Model for Score Fusion

The face score distributions can be modeled with a number (g) of Gaussian functions,
each of which can be parameterized with a mean and a standard deviation. The
Gaussian parameters and the weights of combining g Gaussian functions can be
estimated with an Expectation Maximization (EM) algorithm [29]. Similar to the
HMM fusion, two GMM models will be estimated using genuine scores and impostor
scores, respectively. The testing scores is classified to genuine if the probability
calculated by the GMM(genuine) model is higher; vice versa. Nandakumar et al.
[5] concluded that GMM fusion outperformed any single matcher and other fusion
methods.
120 Y. Zheng

5.5.5 Hidden Markov Model for Score Fusion

The HMM fusion is a hybrid classifier-based and density-based fusion, but it signif-
icantly differs in data preparation and classification process. In the context of this
chapter, we need to distinguish two terms: multimodal scores and multi-matcher
scores. Multimodal biometric scores (also referred to as inter-modality scores) result
from different modalities (such as different imaging devices, visible and thermal cam-
eras); while multi-matcher scores (also referred to as intra-modality scores) result
from different software algorithms using the same modality (e.g., three face scores
generated from three face recognition algorithms).
For HMM training, a large database with known users (labeled with subject iden-
tifications) is expected, and thus a k-fold cross validation is utilized to satisfy this
need. All scores are normalized and organized as the inputs of HMM models using
k-fold cross validation. The HMM model is adapted to multimodal score fusion
and initialized with parameters like HMM(m, n, g), or denoted as m×n×g HMM.
Where m is the number of intra-modality scores (from m matchers upon one modality
data) representing an observation vector in HMM, and n is the number of modalities
corresponding to n hidden states. By placing n pieces of m-dimension observation
vectors together, an observation sequence (over time, t) is formed. g is the number
of Gaussian components per state in a Gaussian Mixture Model (GMM). The GMM
is applied to estimate the state probability density functions of each hidden state
in a continuous HMM model. The details of the HMM model and its adaption to
biometric score fusion are described elsewhere [7].
In HMM score fusion, the observation vector, Ot , can be the m-dimensional intra-
modality scores from m matchers. The observation sequence, O(t, s), can be formed
by combining n pieces of Ot from n modalities: O(t, s) = {Smn }. For example,
there are 2 biometric modalities (n = 2; visible, thermal) and 3 matching algorithms
(matchers) for each modality (m = 3). Thus, the length of O(t, s) is 6 (refer to the
cells at the rightmost column in Tables 5.2, 5.3). The observation symbol probabilities
matrix can be initialized with the GMM, where the number of Gaussian components
(g) in each state are usually fixed (e.g., g = 3) or automatically decided [30]. Notice
that two HMM models, λGen and λImp , are actually trained using genuine scores and
impostor scores, respectively; their parameters can be estimated using the Baum-
Welch algorithm [31]. An unlabeled biometric score sequence, O, will be classified
as a “genuine user” if PGen (O|λGen ) > PImp (O|λImp ) + η (a simple decision rule);
otherwise, O will be an “impostor user”, where η is a small positive number empiri-
cally decided by experiments. In general, m ⇒ 1, n ⇒ 1, and m × n ⇒ 2 are expected.
In other words, at least two scores are required for HMM fusion. If the number of
biometric modality is one (n = 1), then the number of matching scores from that
modality must contain two or more (produced from different matching algorithms,
e.g., CGF, FPB, and EBGM). If there are two or more modalities (n ⇒ 2), in order to
properly initialize and train the HMM models, the numbers of intra-modality scores
(m ⇒ 1) derived from each modality must be the same.
5 Using Score Fusion for Improving the Performance 121

Actual \ Predicted Predicted Negative Predicted Positive

Actual Negative True Negative (TN) False Positive (FP)

Actual Positive False Negative (FN) True Positive (TP)

Fig. 5.7 Definition of confusion matrix: the 2-by-2 elements (TN, FP, FN, TP)

5.6 Performance Evaluation

The genuine score is the matching score resulting from two samples of a single
user; while the impostor score is the matching score of two samples originating
from different users. On a closed dataset (i.e., all query users are included in the
database), the recognition performance can be measured by a verification rate (VR),
the percentage of the number of correctly identified users (i.e., genuine users are
recognized as genuine) over the total number of users. A higher VR means a better
recognition algorithm.
On an open dataset or for a commercial biometric system, a query user (i.e., probe
face) may not be contained in the database. To evaluate the system performance, we
define the following terms based on a confusion matrix [32] (see Fig. 5.7):

AC = (TN + TP)/(TN + FP + FN + TP), (5.9a)

FAR = FPR = FP/(TN + FP), (5.9b)

FRR = FNR = FN/(FN + TP), (5.9c)

GAR = TPR = TP/(FN + TP), (5.9d)

IRR = TNR = TN/(TN + FP), (5.9e)

where AC denotes accuracy that means genuine (or impostor) users recognized as
genuine (or impostor) users; FAR stands for false accept rate (also called false
positive rate), i.e., impostor users are recognized as genuine users (falsely accepted
by the system); FRR is for false rejection rate (also called false negative rate), i.e.,
genuine users are recognized as impostor users (falsely rejected by the system).
Similarly, we can also define genuine accept rate (GAR, or true positive rate), and
impostor rejection rate (IRR, true negative rate) in Eqs. (5.9d) and (5.9e).
It is usually acceptable to evaluate a biometric system by reporting AC, FAR and
FRR. Of course, an ideal biometric system expects a high AC, a low FAR and a low
FRR. If there is no negative cases (i.e., no intruders), then TN + FP = 0 in Eq. (5.9a),
which leads to AC = VR.
122 Y. Zheng

Fig. 5.8 Sample faces from the ASUMS database: notice that the two images (DC/visible,
IR/thermal) shown at two neighboring columns were acquired from the same subject. The images
are the aligned faces (320 × 320 pixels)

To sufficiently use the sample data in classification evaluation, k-fold cross valida-
tion is applied to split the original data into training and testing subsets. In a 10-fold
cross validation (i.e., k = 10), for example, all scores are divided into 10 subsets.
In each of 10 runs, one subset is held out for testing, and the rest 9 subsets are used
for training. In the next run, a different subset is used for testing until all 10 subsets
are used for tests. The classification performance is the averaged result of 10 runs.
For fair comparisons, the same group of k folds will be used for all selected fusion
methods.

5.7 Experimental Results and Discussion

The experiments were conducted on the Alcorn State University MultiSpectral


(ASUMS) dataset (refer to Fig. 5.8), which consists of two spectral face images [8].
The descriptions of face dataset and experimental results are given in the following
three subsections.

5.7.1 Face Dataset and Experimental Design

The ASUMS dataset currently consists of two spectral bands (visible and thermal)
from 105 subjects. The face images were acquired with a two-in-one imaging device,
5 Using Score Fusion for Improving the Performance 123

FLIR SC620, wherein the infrared (thermal) camera is of 640 × 480 pixel original
resolution and 7.5–13 μm spectral range, and the digital (visible) camera is of 2048×
1536 pixel original resolution. The visible dataset is denoted as ASUDC; while the
thermal dataset is denoted as ASUIR. ASUMS denotes the entire dataset (refer to
Tables 5.1, 5.2, 5.3).
Four frontal face images (not shown in this chapter; refer to [7]) were randomly
selected from each subject (per band), one of which was used as a probe image,
and three of which were used as gallery images. For example, a total of 8 images
per subject (of two bands) were analyzed in our experiments. Then switch the role
between probe and gallery, the reported performance of single algorithms in Table 5.1
is the average of four rotations. All face images were normalized. The face portion
was then detected and extracted using face detection algorithms [15, 33]. Notice that
the size of an extracted face is usually smaller than the resolution of original image
in database. Next, all faces were automatically lined up using image registration
algorithms. The visible images (originally in colors) were converted to grayscale
images prior to facial feature extraction.
The performance of single face recognition algorithm is tested on two datasets,
ASUDC and ASUIR, where the results of mean fusion are also presented in Table 5.1.
The performance of score fusion varying with different combinations of face scores
are exhibited in Tables 5.2 and 5.3.

5.7.2 Performance Of Single Face Recognition Algorithm

The three selected algorithms (CGF, FPB, EBGM) were tested with the same two
datasets as shown in Table 5.1. The FPBs were formed using 8 × 16 GWT [8],
while the FP-CGF were extracted with 4-band CGF. For face matching CGF and
FPB utilized Hamming distance, while EBFM used a similarity score [17, 18]. The
experimental results are presented in Table 5.1, where only the top-1 match (the
shortest distance or the highest similarity) results were reported.
From the results shown in Table 5.1, the EBGM is the best recognition algorithm
across two datasets, referred as to single best matcher (SBM). The FPB is the second
best, whose performance is slightly lower than that of the SBM.
The results of mean fusion are listed in Table 5.1. At the bottom tow, the three
results at left are the fusion of two single-matcher scores from two modalities; while
the two results at the rightmost column are the fusions of three single-modality scores
from three matchers. All mean fusions are better than the SBM (91.67 %, shown in
italic in Table 5.1). The mean fusion of FPB scores can achieve AC = 96.19 %,
which is better than other two fusions with CGF and EBGM scores. The result at
low-right corner (AC = 96.67 %, shown in boldface in Table 5.1) is the mean fusion
of all 6 scores, which is clearly the best.
FAR values are significantly decreased in all fusion results, especially for the
mean fusion of 6 scores (FAR = 3.33 %, compared with FAR = 8.33 % for SBM).
124 Y. Zheng

Table 5.1 The accuracy (AC, %) and FAR (%) of three face recognition algorithms and their mean
fusions tested on the ASUMS face dataset of 2 spectral images (of 105 subjects, 4 images per
subject)
Dataset\algorithm CGF FPB EBGM Mean fusion
ASUDC 85.71, 14.29 91.43, 8.57 91.67, 8.33 91.90, 8.10
ASUIR 75.24, 24.76 88.57, 11.43 90.24, 9.76 92.14, 7.86
Mean fusion 90.48, 9.52 96.19, 3.81 94.05, 5.95 96.67, 3.33

FRR values are not presented in Table 5.1 since it is not applicable to the face matching
by the shortest distance (or the highest similarity) on a closed dataset.

5.7.3 Performance Comparisons of Seven Score Fusion Methods

All seven score fusion methods require a training process. In addition to score normal-
ization, two more processes are expected. First, the balance of genuine and imposter
scores shall be considered in the training process since the majority of scores are
impostors. To avoid possible bias in model training, six impostor scores per matcher
per band per subject were randomly selected in each dataset for training. All genuine
scores (i.e., three genuine scores per matcher per band per subject) as well as the
reduced impostor scores are called the “trimmed dataset”. The seven fusions shown
in Tables 5.2 and 5.3 were tested with the trimmed dataset. Secondly, a 10-fold cross
validation is applied to all seven fusions. Moreover, the same randomly split 10 folds
were used in all tests shown in Tables 5.2 and 5.3. In contrast, the mean fusion used
all normalized scores without any cross validation.
The implementation details of seven fusion methods in our experiments are given
as follows. In LDA fusion, a quadratic discriminant function was actually fit with
the training data. Five neighbors (k = 5) were used in the KNN fusion. A feed-
forward pattern recognition network (sometimes referred as to perceptron) of 10
hidden neurons (in the hidden layer) was implemented for ANN fusion. In SVM
fusion, a 3rd-order polynomial kernel function was selected to map the training data
into kernel space; and the least-squares was used to find the separating hyperplane.
There was no specific setting in BLR fusion. The number of Gaussian models (g)
in GMM fusion was selected according to the best performance (varying g from
1 to the number of face scores). In HMM fusion, g = 3 was empirically set; and
parameters m × n were assigned with 3 × 1 for ASUDC and ASUIR, 1 × 2 for CGF,
FPB and EBGM, and 3 × 2 for ASUMS. Both GMM fusion and HMM fusion were
trained with two models (functions) using the genuine scores and impostor scores,
respectively.
In Tables 5.2 and 5.3, the score combinations per spectral band (ASUDC, AUSIR),
per matching algorithm (CGF, FPB, EBGM) and all 6 scores are listed in the top
row, wherein the SBM performance (AC, FAR) and the number of fusion scores are
5 Using Score Fusion for Improving the Performance 125

Table 5.2 The values (%) of AC, FAR, and FRR of 7 fusion methods tested on variant combinations
of face scores: per modality (DC, IR) and all modalities (the [AC, FAR] of SBM and the number
of scores shown in the top row)
Dataset ASUDC ASUIR ASUMS
(Perf. of SBM) (91.67, 8.33) (90.24, 9.76) (91.67, 8.33)
(# Scores) (3) (3) (6)
LDA fusion 97.12, 1.71, 6.35 96.94, 1.50, 7.70 98.36, 1.10, 3.25
KNN fusion 97.40, 1.50, 5.87 97.26, 1.55, 6.27 98.98, 0.35, 3.02
ANN fusion 97.28, 1.28, 6.98 96.94, 0.88, 9.52 98.24, 0.80, 4.60
SVM fusion 97.34, 0.94, 7.78 97.00, 0.70, 9.84 98.98, 0.40, 2.86
BLR fusion 97.20, 1.02, 8.10 96.60, 0.56, 11.83 98.30, 0.62, 4.92
GMM fusion 96.74, 2.81, 4.60 96.58, 2.51, 6.11 98.44, 1.12, 2.86
HMM fusion 97.12, 1.71, 6.35 96.84, 1.63, 7.70 98.34, 1.42, 2.38

Table 5.3 The values (%) of AC, FAR, and FRR of 7 fusion methods tested on variant combinations
of face scores: per algorithm (CGF, FPB, EBGM) (the [AC, FAR] of SBM and the number of scores
shown in the top row)
Dataset CGF FPB EBGM
(Perf. of SBM) (85.71, 14.29) (91.43, 8.57) (91.67, 8.33)
(# Scores) (2) (2) (2)
LDA fusion 90.02, 9.12, 12.54 95.98, 3.40, 5.87 96.28, 1.69, 9.76
KNN fusion 95.34, 5.22, 3.02 97.70, 2.70, 1.11 96.24, 1.58, 10.24
ANN fusion 92.78, 3.08, 19.52 96.72, 1.55, 8.41 96.46, 1.15, 10.63
SVM fusion 92.88, 2.70, 20.24 96.44, 1.07, 10.95 94.72, 0.21, 20.32
BLR fusion 92.74, 3.75, 17.70 96.20, 1.66, 10.16 94.12, 0.16, 22.86
GMM fusion 90.28, 9.04, 11.75 96.20, 3.29, 5.32 96.18, 1.42, 10.95
HMM fusion 91.08, 8.56, 10.00 96.30, 2.84, 6.27 96.22, 1.98, 9.13

also given. In each cell three performance values are reported: AC, FAR, and FRR.
To test and calculate FAR, the gallery images corresponding to that probe image
must be excluded from the dataset, which simulates an open dataset that does not
include the user (intruder).
The performance values of seven fusion methods are listed in the seven rows
of Tables 5.2 and 5.3. It is clear that all fusion results are much better than SBM.
In each column, the best performance values (the highest AC, the lowest FAR, the
lowest FRR) are shown in boldface; while the second best values are shown in italic
font. According to the AC, KNN fusion is the best method on average (with one
exception in the EBGM column). The lowest FAR and the lowest FRR spread across
different fusion methods. In the fusion of all 6 scores (in the rightmost column in
Table 5.2), KNN fusion (AC = 98.98, FAR = 0.35) is the best method, and SVM
fusion (AC = 98.98, FAR = 0.40) is the second best method.
Comparing the fusion results from two image modalities, Table 5.2 shows that
ASUDC (visible) scores result a slightly higher AC than ASUIR (thermal) scores for
126 Y. Zheng

all seven fusions; whereas ASUIR scores yield lower FAR. Regarding the fusions of
3 different algorithms, FPB scores produce the best AC for 6 fusion methods (except
for LDA); while EBGM scores generate the lowest FAR across all seven fusions.
Let us define AC Improvement = AC(Fusion) − AC(SBM). The values (%) of
AC improvements for KNN fusion (the best method) across six score combinations
(corresponding to Row 2 by excluding the top header row) are: 5.73, 7.02, 9.63, 6.27,
4.57, 7.31. The high AC Improvement of the KNN fusion using CGF scores (9.63 %)
was due to its low AC(SBM) = 85.71 %. The AC Improvements with FPB scores
(6.27 %) and ASUIR scores (7.02 %) are very good.
The fusion results of all 6 scores (refer to the rightmost column) show small dif-
ferences among different fusion methods. The biggest difference among AC values,
0.74 % is between KNN (98.98 %) and ANN (98.24 %), which is relatively small
compared with the AC Improvement, 7.31 % (= 98.98 − 91.67 %). It proves that
more diverse scores make the performance difference among fusion methods less
significant.
In the future we will expand the ASUMS dataset and sufficiently test and compare
the proposed score fusion methods on a larger dataset. We will also investigate and
validate the current approaches and findings by using other multimodal biometrical
datasets (like fingerprint, iris, and/or more spectral images).

5.8 Conclusions

We present and compare the performance improvement of a face recognition system


using seven score fusion methods. Experimental results show that all score fusions
can improve the accuracy, meanwhile reduce the FAR. Among the seven methods,
KNN is the best, and SVM is the second best according to the accuracy and the FAR
tested with the ASUMS face dataset.
The fusion performance among seven fusion methods are quite close when using
many diverse scores (e.g., 6 scores from two modalities and three matchers). The
comparisons of two spectral bands show that the visible scores (ASUDC) result
higher AC; while thermal scores (ASUIR) produce lower FAR. The comparisons
of three matchers show that the FPB scores generate the highest AC; whereas the
EBGM scores yield the lowest FAR.
It is viable to create an integrated face recognition system equipped with multi-
spectral imaging devices. The integrated system is implemented with a score fusion
approach (e.g., KNN) that combines the face scores from multiple matchers (e.g.,
CGF, FPB, EBGM). The performance (accuracy and FAR) of the integrated system
is much higher than that of any single matcher or any single modality. Furthermore,
the multispectral imaging devices enable night vision capability (i.e., recognizing a
face at dark or at nighttime).
Notice that the current experiments and conclusions were based upon one set of
multispectral face dataset (ASUMS). It is expected that the experimental results can
be validated on other biometrical datasets.
5 Using Score Fusion for Improving the Performance 127

Acknowledgments The current research is funded by the Department of Defense Research and
Education. The thermal face recognition research was previously supported by the Department of
Homeland Security.

References

1. Chang H, Koschan A, Abidi M, Kong SG, Won C-H (2008) Multispectral visible and infrared
imaging for face recognition. In: IEEE computer society conference on computer vision and
pattern recognition workshops
2. Bendada A, Akhloufi MA (2010) Multispectral face recognition in texture space. In: Canadian
conference on computer and robot vision, pp 101–106
3. Bebis G, Gyaourova A, Singh S, Pavlidis I (2006) Face recognition by fusing thermal infrared
and visible imagery. Image Vis Comput 24:727–742
4. Arandjelovic O, Hammoud RI, Cipolla R (2006) On person authentication by fusing visual
and thermal face biometrics. In: Proceedings of the IEEE international conference on video
and signal based surveillance
5. Nandakumar K, Chen Y, Dass SC, Jain AK (2008) Likelihood ratio-based biometric score
fusion. IEEE Trans Pattern Anal Mach Intell 30(2):342–347
6. Poh N, Bourlai T et al (2009) Benchmarking quality-dependent and cost-sensitive multimodal
biometric fusion algorithms. IEEE Trans Inf Forensics Secur 4(4), 849–866
7. Zheng Y, Elmaghraby A (2011) A brief survey on multispectral face recognition and multimodal
score fusion. In: IEEE international symposium on signal processing and information and
technology (ISSPIT), pp 543–550
8. Zheng Y (2011) Orientation-based face recognition using multispectral imagery and score
fusion. Opt Eng 50:117202
9. Hill DLG, Batchelor P (2001) Registration methodology: concepts and algorithms. In: Hajnal
JV, Hill DLG, Hawkes DJ (eds) Medical image registration. CRC, Boca Raton
10. Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features.
Proc CVPR 1:511–518
11. Viola P, Jones M (2001) Robust real-time object detection. Int J Comput Vis 57(2):137–154
12. Lienhart R, Maydt J (2002) An extended set of Haar-like features for rapid object detection.
Proc Int Conf Image Process 1:900–903
13. Freund Y, Schapire R (1995) A decision theoretic generalization of on-line learning and an
application to boosting. Computational Learning Theory: Eurocolt ’95, pp 23–37
14. Zhao W, Chellappa R, Krishnaswamy A (1998) Discriminant analysis of principal components
for face recognition. In: Proceedings of IEEE international conference on face and gesture
recognition, FG’98, 14–16, pp 336–341
15. Zheng Y (2012) Face detection and eyeglasses detection for thermal face recognition. In:
Proceedings of SPIE 8300
16. Lu J, Plataniotis KN, Venetsanopoulos AN (2003) Face recognition using LDA-based algo-
rithms. IEEE Trans Neural Netw 14(1):195–200
17. Wiskott L, Fellous JM, Krüger N, von der Malsburg C (1997) Face recognition by elastic bunch
graph matching. IEEE Trans Pattern Anal Mach Intell 19(7):775–779
18. Wiskott L, Fellous J-M, Krüger N, von der Malsburg C (1999) Face recognition by elastic
bunch graph matching. In: Jain LC et al (eds) Intelligent biometric techniques in fingerprint
and face recognition. CRC Press, Boca Raton, pp 355–396
19. Zhao W, Chellappa R, Rosenfeld A, Phillips PJ (2003) Face recognition: a literature survey.
ACM Comput Surv 35(4):399–458
20. Kuncheva LI (2002) A theoretical study on six classifier fusion strategies. IEEE Trans Pattern
Anal Mach Intell 24(2):281–286
128 Y. Zheng

21. Brunelli R, Falavigna D (1995) Person identification using multiple cues. IEEE Trans Pattern
Anal Mach Intell 17(10):955–966
22. Fierrez-Aguilar J, Ortega-Garcia J, Gonzalez-Rodriguez J, Bigun J (2005) Discriminative mul-
timodal biometric authentication based on quality measures. Pattern Recogn 38(5):777–779
23. Duda RO, Hart PE (1973) Pattern classification and scene analysis. Wiley, New York
24. McCullagh P, Nelder JA (1990) Generalized linear models. Chapman & Hall, New York
25. Kil RM, Koo I (2001) Optimization of a network with Gaussian kernel functions based on the
estimation of error confidence intervals. Proc IJCNN 3:1762–1766
26. Burges C (1998) A tutorial on support vector machines for pattern recognition. Data Mining
Knowl Discov 2:121–167
27. Prabhakar S, Jain AK (2002) Decision-level fusion in fingerprint verification. Pattern Recogn
35(4):861–874
28. Ulery B, Hicklin AR, Watson C, Fellner W, Hallinan P (2006) Studies of biometric fusion.
NIST Interagency Report
29. McLachla G, Peel D (2000) Finite mixture models. Wiley, Hoboken
30. Figueiredo M, Jain AK (2002) Unsupervised learning of finite mixture models. IEEE Trans
Pattern Anal Mach Intell 24(3):381–396
31. Baum LE, Petrie T (1966) Statistical inference for probability functions of finite state markov
chains. Ann Math Stat 37:1554–1563
32. Kahler B, Blasch E (2011) Decision-level fusion performance improvement from enhanced
HRR radar clutter suppression. J Adv Inf Fusion 6 (2):101–118
33. Reese K, Zheng Y, Elmaghraby A (2012) A comparison of face detection algorithms in visible
and thermal spectrums. In: International conference on advances in computer science and
application
Chapter 6
Unconstrained Ear Processing: What is Possible
and What Must Be Done

Silvio Barra, Maria De Marsico, Michele Nappi and Daniel Riccio

Abstract Ear biometrics, compared with other physical traits, presents both
advantages and limits. First of all, the small surface and the quite simple struc-
ture play a controversial role. On the positive side, they allow faster processing than,
say, face recognition, as well as less complex recognition strategies than, say, fin-
gerprints. On the negative side, the small ear area itself makes recognition systems
especially sensitive to occlusions. Moreover, the prominent 3D structure of distinc-
tive elements like the pinna and the lobe makes the same systems sensible to changes
in illumination and viewpoint. Overall, the best accuracy results are still achieved in
conditions that are significantly more favorable than those found in typical (really)
uncontrolled settings. This makes the use of this biometrics in real world applications
still difficult to propose, since a commercial use requires a much higher robustness.
Notwithstanding the mentioned limits, ear is still an attractive topic for biometrics
research, due to other positive aspects. In particular, it is quite easy to acquire ear
images remotely, and these anatomic features are also relatively stable in size and
structure along time. Of course, as any other biometric trait, they also call for some
template updating. This is mainly due to age, but not in the commonly assumed way.
The apparent bigger size of elders’ ears with respect to those of younger subjects, is
due to the fact that aging causes a relaxation of the skin and of some muscle-fibrous

S. Barra
University of Cagliari, Via Ospedale, 72 - 09124 Cagliari, Italy
e-mail: silvio.barra@unica.it
M. De Marsico (B)
Sapienza University of Rome, Via Salaria, 113 - 00198 Rome, Italy
e-mail: demarsico@di.uniroma1.it
M. Nappi
University of Salerno, Via Ponte Don Melillo, 84084 Fisciano, SA
e-mail: mnappi@unisa.it
D. Riccio
University of Napoli Federico II, Via Cintia 21, 80126 Napoli, Italy
e-mail: daniel.riccio@unina.it

J. Scharcanski et al. (eds.), Signal and Image Processing for Biometrics, 129
Lecture Notes in Electrical Engineering 292, DOI: 10.1007/978-3-642-54080-6_6,
© Springer-Verlag Berlin Heidelberg 2014
130 S. Barra et al.

structures that hold the so called pinna, i.e. the most evident anatomical element of
the ear. This creates the belief that ears continue growing all life long. On the other
hand, a similar process holds for the nose, for which the relaxation of the cartilage
tissue tends to cause a curvature downwards. In this chapter we will present a survey
of present techniques for ear recognition, from geometrical to 2D-3D multimodal,
and will attempt a reasonable hypothesis about the future ability of ear biometrics to
fulfill the requirements of less controlled/covert data acquisition frameworks.

6.1 Introduction

For thousands years humans have used physical traits (face, voice, gait, etc.) as a pri-
mary resource to recognize each other. As a matter of fact, the earliest documented
example of related strategies goes back to ancient Egyptians, who used height to
identify a person during salary payment. Recognizing an individual through unique
biometric traits can be used in many different applications. Authentication is a clas-
sical example. Relying on something that the subject must possess (cards, keys, etc.)
is subject to loss or theft, while relying on something that the subject must remember
(passwords, pins, etc.) is subject to forgetfulness or sniffing. Therefore, it is some-
times impossible to surely distinguish the real subject from an impostor, especially in
a remote session. Biometric traits are more difficult to steal or reproduce, so that their
integration in the authentication processes can reinforce it. Moreover, biometrics can
be exploited in further settings such as forensics, video-surveillance or ambient intel-
ligence. Ear biometrics is not among the most widely explored at present, though it
gained some credit for its reliability. The aim of this essay is to explore the available
approaches to extract a robust ear template, providing an accurate recognition, and to
discuss how these techniques can be suited to uncontrolled settings, which are typical
of real world applications. Even if we will try to describe the overall scenario of the
main techniques, an extensive and complete presentation of all methods presented in
literature is out of our scope. For this, we suggest the comprehensive works by Pflug
and Busch [74] and by Abaza et al. [5].
As an introductory consideration, it is worth noticing that quite few people are
aware of the uniqueness of the individual ear features. Actually, the external ear
“flap”, technically defined as pinna, presents several morphological components.
These make up a structure that, though relatively simple, varies significantly across
different individuals, as shown in Fig. 6.1.
The chapter will proceed as follows. The present section will introduce ear
anatomy and some preliminary considerations on its use as a biometric trait in auto-
matic recognition systems. Section 6.2 will present the most popular preprocessing
techniques for ear recognition, from acquisition to detection and feature extrac-
tion, and beyond. Section 6.3 will summarize techniques for ear recognition from
2D images, while Sect. 6.4 will do the same for ear recognition from 3D models.
Section 6.5 will present some relevant multimodal biometric approaches involving
ear, and finally Sect. 6.6 will draw some conclusions.
6 Unconstrained Ear Processing: What is Possible and What Must Be Done 131

Fig. 6.1 Differences in the morphology of the pinna in different individuals

6.1.1 Ear Anatomy

Recognition systems based on ear analysis deal with the variations that characterize
its structure. Among the fist researchers to understand and investigate the potential of
ear recognition, we can mention the Czech doctor R. Imhofer, who in 1906 noted that
he was able to distinguish among a set of 500 ears using only four features. Further
studies followed quite later. First of all, it is worth mentioning the pioneering work
by Iannarelli in 1949 [48]. The developed anthropometric recognition technique
is based on a series of 12 measurements and on the information concerning sex
and race. Obviously, among the anatomical structures that identify the so-defined
external, middle and inner ear, only the visible part of the first ones, namely those
composing the pinna, are relevant for recognition (Fig. 6.2). Actually, humans are
not able to recognize a person from the ears. However, this might be merely due
to the lack of training, since Iannarelli’s studies and a number of following ones
have demonstrated that they have a good discriminating power. A further outcome of
these studies is that ear variations are very limited along time. Growth is proportional
from birth to the first 4 months, then the lobe undergoes a vertical elongation from
4 months to the age of 8 years due to gravity force. The further growth in size, until
about the age of 20, does not alter the shape, which remains constant until about
the age of 60/70, and finally undergoes a further elongation. According to a widely
diffused belief, ears continue growing all life long, therefore elders’ ears are usually
bigger than those of younger subjects. However, this is rather due to the fact that aging
causes a relaxation of the skin and of some muscle-fibrous structures, in particular
those that hold the so called pinna, i.e. the most evident anatomical element of the
external part of the ear.
An interesting remark regards soft biometrics used for recognition, which are
somehow related to ear. Akkermans et al. [7] exploit a particular acoustic property of
the ear: due to the characteristic shape of the pinna and ear canal, when a sound signal
132 S. Barra et al.

Fig. 6.2 Anatomy of the pinna of external ear and Iannarelli’s measures

is played into it, it is reflected back in a modified form. This can be considered to
produce a personal signature, which can be generated, for example, by a microphone
incorporated into the head-phones which catches the reflected sound signal for a test
sound produced by the speaker of a mobile phone.
Grabham et al. [40] present a thorough assessment of otoacustic emissions as a
biometric trait. Otoacustic emissions are very low level sounds that the human ear
normally emits. They can be classified in those spontaneously occurring (transient
evoked otoacoustic emissions—TEOAE), and those triggered by stimulus (distortion
product otoacoustic emission—DPOAE). Although they only allow to recognize
people wearing an headset, they can be used for continuous re-identification during
interaction with a protected system. The authors demonstrate that stability is generally
limited to a period of six months. For this reason, a suitable setting may require an
initial identification of the user through a stronger biometrics at the beginning of
each session, and the continuous re-identification through this light-weight trait.
Since parameters such as OAE type, equipment type and use, and background noise,
may significantly affect recognition, this biometrics requires further exploration.

6.1.2 Positive and Negative Aspects

It is important to summarize pros and cons of ear recognition according to the basic
requirements for a biometric feature [50]:
• universality: every subject must present that feature;
• uniqueness: no pair of subjects share the same value for that feature;
• permanence: the value of the that feature must not change in time;
• collectability: the value of that feature must be quantitatively measurable;
6 Unconstrained Ear Processing: What is Possible and What Must Be Done 133

More parameters regard performance in terms of accuracy and computational


requirements, acceptability in terms of subjects’ level of acceptance, and circumven-
tion in terms of robustness to cheat.
Derived from the above elements, we can first list positive aspects of ear
recognition:
• universality: as a part of the body, every subject normally has them;
• uniqueness: they are sufficient to discriminate among individuals;
• efficacy/efficiency: good accuracy/low computational effort due to the small area;
• stability:
– low variations except for lobe length;
– uniform distribution of colors;
– absence of expression variations;
• conctactless acquisition: the image can be acquired at a distance;
• cheap acquisition: low cost of capture devices (CCD camera);
• easy and quick measurability.
The main negative aspects are:
• the small area is more sensible to occlusions by hair, earrings, caps;
• ear image can be confused by background clutter;
• the prevalent 3D nature of ear features makes captured images very sensible to
pose and acquisition angle;
• acquisition at a distance introduces heavy limits due to the combination of small
area and possibly low resolution.
As a consequence, the best results are presently obtained thanks to a strict profile
pose, absence of hair or other occlusion and relatively short distance; in other words,
user cooperation is required.

6.1.3 A Summary of Different Techniques Applied to Ear


Biometrics

A first rough classification of human ear identification techniques using outer


anatomical elements distinguishes two approaches: the first one exploits detection
and analysis of either feature points or relevant morphological elements, and derives
geometric measures from them; the second one relies on global processing [68].
Taking into further consideration the dimensionality of captured data and the kind
of feature extraction, we can also identify the following categories:
1. Recognition from 2D images
• Geometrical approaches (interest points)
• Global approaches
• Multiscale/multiview approaches
• Thermograms
134 S. Barra et al.

2. Recognition from 3D models


3. Multimodal systems
We will briefly sketch some examples here and return on more details in the fol-
lowing chapters.

Recognition from 2D images


In geometrical approaches, a set of measures are computed over a set of interest
points and/or interest contours, identified on a 2D normalized photo of the ear. The
pioneering work related to ear recognition, namely “Iannarelli system” of ear identi-
fication, is a noticeable example along this line. In Iannarelli’s system, the ear image
is normalized with respect to dimensions. The point named Crux of Helix is identified
(see Fig. 6.2) and becomes the center of a relative space. All measures are relative to
it, so that a wrong identification of such point compromises the whole measurement
and matching processes.
Burge and Burger [18, 19] are among the first to try more advanced techniques.
They use Voronoi diagrams to describe the curve segments that surround the ear
image, and represent the ear by an adjacency graph whose distances are used to
measure the corresponding features (Fig. 6.3). The method has not been extensively
tested, however we can assume that Voronoi representation suffers from the extreme
sensitiveness of ear segmentation to pose and illumination variations.
Global approaches consider the whole ear, instead of possible relevant points.
Hurley et al. [45, 46] address the problem of ear recognition through the simulation
of natural electromagnetic force fields. Each pixel in the image is treated as a Gaussian
attractor, and is the source of a spherically symmetric force field, acting upon all the
other pixels in a way which is directly proportional to pixel intensity and inversely
proportional to the square of distance. The ear image is so transformed into a force
field. This is equivalent to submit it to a low pass filter transforming it into a smooth
surface, where all information is still maintained. The directional properties of the
force field can support the location of local energy peaks and ridges to be used to
compose the final features. The technique does not require any explicit preliminary
description of ear topology, and the ear template is created just following the force
field lines. It is invariant to the translation of the initializing position and to scaling, as
well as to some noise. However, the assessment of the recognition accuracy has been
attempted only with images presenting variations on the vertical plane and without
hair occlusion [47].
Since face and ear present some common feature (e.g., sensitiveness to pose and
illumination, sensitiveness to occlusions) it is natural to inherit face related tech-
niques. In particular, Principal Component Analysis (PCA) has been widely used
for face template coding and recognition. In [26], PCA is applied to both biometrics
for comparison purposes, demonstrating similar performance as well as limitations.
Despite the apparently easiest task, even for ear recognition images must be accu-
rately registered, and extraneous information must be discarded by a close cropping.
Last but not least, this method suffers from very poor invariance to those factors
which also affect face recognition, in particular pose and illumination.
6 Unconstrained Ear Processing: What is Possible and What Must Be Done 135

Fig. 6.3 Relevant edges on ear external surface and derived Voronoi diagram and adjacency graph

Further global approaches, still borrowing from face recognition, are presented
in [68, 84]. The former applies neural network strategy to ear recognition, test-
ing Compression Networks, Borda Combination, Bayesian and Weighted Bayesian
Combinations techniques. The latter exploits Haar wavelets transformation.
In multiscale/multiview approaches, the set of features used for recognition is
enriched by considering more scales for the same image, or more acquisitions, pos-
sibly from slightly different points of view, for the same trait of the same subject.
While multiview approach is often used to obtain a 3D model of the anatomical
element, it can be also used in 2D techniques.
Though acquired by different equipment and containing information differ-
ent from pixel intensities, also thermograms are 2D images. Through a thermo-
graphic camera, a thermogram image captures the surface heat (i.e., infrared light)
136 S. Barra et al.

emitted by the subject. These images are not sufficiently detailed to allow recognition,
but can rather be used for ear detection and segmentation, especially in those cases
where the ear is partially covered and passive identification is involved (the user does
not cooperate and might be unaware of the acquisition). In such cases, texture and
color segmentation should allow to discard hair region. As an alternative, Burge and
Burger [19] propose to use thermogram images. The pinna usually presents an higher
temperature than hair, so that the latter can be segmented out. Moreover, if the ear is
visible, the Meatus (i.e., the passage leading into the inner ear) is the hottest part of
the image, which is clearly visible and allows to easily detect and localize the rest
of the ear region. Disadvantages include sensitiveness to movement, low resolution
and high costs.

Recognition from 3D models


Like other techniques, 3D processing for ear has followed the success of three-
dimensional techniques applied to face. As a matter of fact, they solve similar prob-
lems of sensitiveness of 2D intensity images to pose and illumination variations,
including shadows which may sometime play a role similar to occlusions. On the
other hand, the outer part of the ear presents even richer and deeper 3D structure than
face, with a very similar discriminating power, which can be profitably modeled and
used for recognition purposes. Among the first and most significant works along this
direction, we mention the one by Chen and Bhanu [16, 29], and the one by Yan and
Bowyer [108, 111]. Actually, both research lines have an articulated development in
time, and we will only mention the main achievements. Both use acquisitions by a
range scanner. In the first approach, ear detection exploits template matching of edge
clusters against an ear model; the model is based on the helix and antihelix, which are
quite extended and well identifiable anatomical elements; a number of feature points
are extracted based on local surface shape, and a signature called Local Surface Patch
(LSP) is computed for each of them. This signature is based on local curvature, and is
used together with helix/antihelix to compute the initial translation/rotation between
the probe and the gallery model. Refined transformation and recognition exploit Iter-
ated Closest Point (ICP). ICP is quite simple and accurate, therefore it is widely used
for 3D shape matching. The reverse of the medal is its high computational cost. The
authors also test on 3D models of the ear a more general approach presented in [30],
which integrates rank learning by SVM for efficient recognition of highly similar 3D
objects.
In the second mentioned approach, an efficient ICP registration method exploits
enrollment data, assuming that biometric applications include a registration phase
of subjects before they can be recognized. Moreover, ear extraction exploits both
2D appearance and 3D depth data. A detailed comparison of the two methods can
be found in [29]. It is worth mentioning that a number of attempts are made to
reduce the computational cost of ICP. As an example, in [112] Yan and Bowyer use a
k-d tree structure for points in 3D space, decompose the ear model into voxels, and
extract surface features from each of these voxels. In order to speed up the alignment
process, each voxel is assigned an appropriate index so that ICP only needs to align
voxel pairs with the same index.
6 Unconstrained Ear Processing: What is Possible and What Must Be Done 137

Multimodal systems
Multimodal systems including ear biometrics range from those that jointly exploit
2D images and 3D models, like those tested in [110], to those that exploit ear with
a variety of different biometrics. Due to physical proximity allowing an easy com-
bined capture, ear is very often used together with face. Results presented in all
works on multibiometric fusion demonstrate that performance of multimodal sys-
tems is affected by both those of single involved recognition modules, but also by
the adoption of a suitable fusion rule for the obtained results.

6.1.4 Some Available Datasets

In this section, we will only summarize the main features of the public ear datasets
which are used in the works considered herein. The aim of this presentation is to
provide a way to better compare such works, and in no way to give an exhaustive
list. On the other hand, it is to consider that the kind of approaches discussed here
address variable settings. Early datasets contain images acquired in well-controlled
pose and illumination conditions, and therefore rise too trivial problems to provide a
significant evaluation of recognition performance. Moreover, this section will only
describe either databases which are most frequently used and well described in lit-
erature, or databases which are available in different well-identifiable compositions,
which are alternatively used in different works. For home-collected databases or for
those lacking a consistent description, we will include details within the presentation
of approaches using them. For a more detailed list and description, we again refer to
the extensive review works by Pflug and Busch [74] and by Abaza et al. [5].

Carreira-Perpiñan database
Carreira-Perpiñan database is among those containing images with less distortion; it
includes 102 grey-scale images (6 images for each of 17 subjects) in PGM. The ear
images are cropped and rotated for uniformity (to a height/width ratio of 1.6), and
slightly brightened.

IIT Kanpur ear database


IIT Kanpur ear database is one of the first made available to the research community.
It seems not available anymore, or no complete description is available at present.
Different works report a different composition, or at least different subsets used for
experiments, and it is not possible to identify the exact content. One version contains
150 side images of human faces with resolution 640×480 pixels. Images are captured
from a distance of 0.5–1 m. A second version contains images of size 512×512 pixels
of 700 individuals. In general, it seems that even in its “hardest” composition, this
database mainly contains images affected by rotation and illumination. Moreover,
images are mostly filled by the ear region, so that detection is not that challenging.
When this database is exploited, we will report the composition used from time to
time.
138 S. Barra et al.

University of Notre Dame (UND)


UND offers five databases containing 2D images and depth images for the ear,
which are available under license (http://www.cse.nd.edu/~cvrl/CVRL/Data_Sets.
html). The databases are referred as Collections with different features, acquired in
different periods.
• Collection E (2002): about 464 right profile images from 114 subjects. From three
to nine images were taken for each user, on different days and under varying pose
and illumination conditions.
• Collection F (2003–2004): about 942 3D (depth images) and corresponding 2D
profile images from 302 subjects.
• Collection G (2003–2005): about 738 3D (depth images) and corresponding 2D
profile images from 235 subjects.
• Collection J2 (2003–2005): about 1800 3D (depth images) and corresponding 2D
profile images from 415 subjects.
• Collection NDOff-2007: about 7398 3D and corresponding 2D images of 396
subjects. The database contains different yaw and pitch poses, which are encoded
in the file names.

USTB ear database


USTB ear database is distributed in different versions (http://www1.ustb.edu.cn/
resb/en/subject/subject.htm).
• USTB I (2002)—180 images of 60 subjects; each subject appears in three images:
(a) normal ear image; (b) image with small angle rotation; (c) image under a
different lighting condition.
• USTB II (2003–2004)—308 images of 77 subjects; each subject appears in four
images; the profile view (0◦ ) is identified by the position of the CCD camera
perpendicular to the ear; for each subject the following four images were acquired:
(a) a profile image (0◦ , i.e. CCD camera is perpendicular to the ear plane); (b) an
image with −30◦ angle variation; (c) an image with 30◦ angle variation; (d) an
image with illumination variation.
• USTB III (2004)—79 volunteers; the database includes many images for each
subject, divided in regular and occluded images:

– Regular ear images: rotations of the head of the subject towards the right side
span the range from 0◦ to 60◦ , and images correspond to the angles 0◦ , 5◦ , 10◦ ,
15◦ , 20◦ , 25◦ , 30◦ , 40◦ , 45◦ , 50◦ , and 60◦ ; the database includes two images at
each angle resulting in a total of 22 images per subject; in a similar way, but
with less acquisitions, rotations of the head of the subject towards the left side
span the range from 0◦ to 45◦ and images correspond to the angles 0◦ , 5◦ , 10◦ ,
15◦ , 20◦ , 25◦ , 30◦ , 40◦ , 45◦ ; even in this case, the database includes two images
at each angle resulting in a total of 18 images per subject;
the left rotation directory also contains face images: subjects labeled from 1
to 31 have two frontal images, labeled as 19 and 20; subjects labeled from
32 to 79 have two frontal face images and also two images in each of four
6 Unconstrained Ear Processing: What is Possible and What Must Be Done 139

Fig. 6.4 Capture angles for images in USTB IV

other views, namely 15◦ rotation to right, 30◦ rotation to right, 15◦ rotation
to left and 30◦ rotation to left, therefore each one has 10 face images labeled
as 19–28.
– Ear images with partial occlusion: 144 ear images with partial occlusion belong
to 24 subjects with 6 images per subject; three kinds of occlusion are included,
namely partial occlusion (disturbance from some hair), trivial occlusion (little
hair), and regular occlusion (natural occlusion from hair).

• USTB IV (2007–2008)—17 CCD cameras at 15◦ interval and bilaterally symmet-


rical are distributed in a circle with radius 1 m (see Fig. 6.4); each of 500 subjects
is placed in the center, and images correspond to the subject looking at eye level,
upwards, downwards, right and left; each pose is photographed with the 17 cameras
simultaneously to capture the integral image of face and ear.
XM2VTS database
XM2VTSDB [64] collects multimodal (face and voice) data. The database includes
295 subjects, each recorded during four sessions. Each session corresponds to two
head rotation shots and six speech shots (three sentences twice). Datasets from this
database include high-quality color images, sound files, video sequences, and a 3D
model. The XM2VTS database is publicly available but not for free.

FERET database
The large FERET database of facial images was gathered from 1993 to 1997 (http://
www.itl.nist.gov/iad/humanid/feret/feret_master.html); the images were collected in
a semi-controlled environment. The database includes 1564 sets of images for a total
of 14126 images, corresponding to 1199 subjects and 365 duplicate sets of images.
140 S. Barra et al.

Each individual has images at right and left profile (labeled pr and pl). The FERET
database is available for public use.

IIT Delhi ear database


It is among the last presented ones. It is publicly available on request. It has been
collected by the Biometrics Research Laboratory at IIT Delhi since October 2006
using a simple imaging setup. Images come from the students and staff at IIT Delhi,
New Delhi, India, and are acquired in indoor setting. The database currently contains
images from 121 different subjects and each subject has at least three ear images.
The resolution of these jpeg images is 272 × 204 pixels. Recently, a larger version
of ear database (automatically cropped and normalized) from 212 users with 754
ear images was also made available on request (http://www4.comp.polyu.edu.hk/
~csajaykr/IITD/Database_Ear.htm).

UBEAR database
This database is available after registration at http://ubear.di.ubi.pt/ and can be con-
sidered among the first sets of images actually captured in a realistic setting [79].
Data was collected from subjects on-the-move, under changing lighting. The sub-
jects were not required any special care regarding ears occlusions or poses. For this
reason the UBEAR dataset is a valuable tool allowing to assess the robustness of
advanced ear recognition methods. Both ears of 126 subjects were captured start-
ing from a distance of 3 m and facing the capture device. Then the subjects were
asked to move their head upwards, downwards, outwards, towards. Afterwards, sub-
jects stepped ahead and backwards. For each subject two different capture sessions
were performed, giving a total of four videos per subject. Each video sequence was
manually analyzed and 17 frames were selected:
• 5 frames while the subject is stepping ahead and backwards
• 3 frames while the subject is moving the head upwards
• 3 frames while the subject is moving the head downwards
• 3 frames while the subject is moving the head outwards
• 3 frames while the subject is moving the head towards.

6.1.5 Brief Discussion

When addressing real-world applications, uncontrolled settings can affect accurate


ear recognition through many factors: (1) the ear might be cluttered/occluded by
other objects, and due to the small size this is a more serious problem than with face;
(2) the amount, direction and color of light on an ear can affect its appearance, due
to its rich 3D structure; (3) out-of-plane head rotations can hinder recognition, again
due to the 3D structure of the pinna; (4) the resolution and field of view of the camera
can significantly affect recognition, especially if acquisition is at a distance.
Literature shows that the best results are achieved by 3D model matching alone
or in combination with, e.g., 2D images. This allows recognizing ears under varying
6 Unconstrained Ear Processing: What is Possible and What Must Be Done 141

illumination and poses. However, a specialized equipment is required, which fur-


ther needs controlled illumination. Moreover, in many non-contact/non-collaborative
applications, it is required to work with surveillance photographs or with frames cap-
tured by surveillance videos, and this means that ears must be most often recognized
from 2D data sources. The next sections will attempt a comprehensive overview of
present techniques that potentially lend themselves to the above scenario.

6.2 Ear Image Preprocessing

Biometric recognition deals with the identification of a subject by characteristic


biometric traits (e.g., ears), and requires to extract discriminative information from
such traits and to organize it in a template. Settings and conditions for biometric
sample acquisition may vary according to the trait at hand and to the dimension-
ality of the captured sample (2D or 3D). Afterwards, the acquired sample must be
possibly enhanced and segmented: it is necessary to detect and locate the relevant
biometric elements, and to extract them from the non-relevant context. The extracted
region possibly undergoes a normalization/correction phase, in order to obtain a
“canonical” representation. These preprocessing tasks are often independent from
the chosen recognition approach. On the other hand, the following feature extraction
and biometric template (key) construction, are always performed consistently with
the methodology chosen for the recognition phase. For sake of consistent compar-
ison, the same set of procedures for sample acquisition and template extraction is
applied during both enrolling of interesting subjects, and recognition of probes. Dur-
ing enrollment the obtained template is stored in a database (gallery), while during
recognition it is compared with the stored ones. In any real world application, robust
enrollment is as important as robust recognition: quality of the former may signif-
icantly affect performance of the latter. The overall process implemented by any
biometric system can be facilitated and enhanced, through the execution of appro-
priate preprocessing steps. We will survey a number of preprocessing techniques for
ear biometrics which are found in literature. It is worth noticing that a correct ear
location/segmentation is a necessary condition for a reliable recognition, though this
may be hindered anyway by further factors like pose and illumination. In all cases,
if the ear is poorly acquired or segmented, the possibilities of a correct recognition
decrease dramatically. This especially holds for real world applications.

6.2.1 Acquisition Process

In most biometric systems, the acquisition process allows to capture an image of the
biometric trait (e.g., fingerprints, face, iris, and ear). Relevant factors that influence
the following process are the quality of the acquisition device and the kind of capture
environment. For instance, the quality of a 2D ear image can be affected by a dirty
142 S. Barra et al.

lens, which may produce image artifacts, or by a low sensor resolution, which may
cause a loss of sufficient image details. Illumination is an important environmental
factor, which may produce uneven image intensity and shadows. Finally, further
elements to consider are occlusions, like hair, earrings and the like, which may
distort relevant ear regions, as well as pose changes, affecting the 2D projection of the
inherently 3D ear anatomical structures. Pose and illumination variations can be dealt
with using 3D models. At present, biometric modalities that rely on 3D information
are constantly spreading, thanks to the increasing availability and decreasing costs of
3D capture devices, e.g., 3D scanners. In particular, biometric recognition algorithms
based on 3D face and, more recently, 3D ear data are achieving growing popularity. In
both cases, one can control the pose of the acquired subject and the distance from the
capturing device, as well as the illumination, using an appropriate set and disposition
of light sources. However, this is not possible in real world applications, and neither
2D nor 3D models are able to provide full invariance to any kind of distortion. Of
course, acquisition in 2D and 3D differ not only for the kind of device but also for
the structure of the obtained samples.
Acquisition in 2D requires digital cameras or video cameras, to capture still images
or video sequences from which single frames can be extracted. In this case, elements
that can change are the quality of lens and resolution of sensor. The acquired sample
is usually a 2D intensity image, where the value of each pixel represents the intensity
of the light reflected in that point by the surface. The color depends on how the
light is reflected. The features and quality of an intensity image also depend on
the camera parameters, which can be divided into extrinsic and intrinsic ones. The
extrinsic parameters denote the coordinate system transformations from 3D world
coordinates to 3D camera coordinates, so that they are used to map the camera
reference frame onto the world reference frame. The intrinsic parameters describe
the optical, geometric and digital characteristics of the camera, for example focal
length, image format, and the geometric distortion introduced by the optics. As
discussed above, intensity information can be substituted by heat information, if a
thermografic camera is used.
Acquisition in 3D can be performed in more ways, and therefore can produce
different kinds of samples. The most common ones are range images, also referred
to as 2.5D images, which are 2D images where the value associated to each pixel
represents the distance between that point and the capture device, and shaded models,
which are structures including points and polygons connected to each other within a
3D space. Shaded models can also be enriched by mapping a texture on their surface,
which can be derived by a 2D image of the same subject. Figure 6.5 shows examples
of the most used kinds of ear sample.
One of the earliest techniques to obtain a 3D model relies on stereo vision, i.e. on
stereo image pairs. It is a passive scanning technique (see [34]), since the scanning
device does not emit an “exploratory” beam towards the object. Stereoscopic sys-
tems usually employ two video cameras, looking at the same scene from two slightly
translated positions. These cameras are calibrated. This process allows to know their
extrinsic and intrinsic parameters. In practice, this methodology allows to reconstruct
a 3D model from a pair of 2D images. The critical point of the model construction
6 Unconstrained Ear Processing: What is Possible and What Must Be Done 143

Fig. 6.5 2D and 3D ear samples: a intensity image; b thermogram; c range image; d shaded model

procedure is the search for relevant corresponding features in the images captured by
each camera. The differences between such features make it possible to determine
the depth of each point inside the images, according to principles similar to those
underlying the human stereoscopic vision. The final result might be influenced by
illumination, but the overall quality of the sample is medium. A number of recent
proposals aim at overcoming the current limitations. In [121] a 3D ear reconstruction
method is presented which is based on binocular stereo vision. A matching phase
based on Scale Invariant Feature Transform (SIFT) is used to obtain a number of
seed matches. This gives a set of sparse correspondences. Then the authors apply
an adapted match propagation algorithm exploiting epipolar geometry constraints,
to obtain a quasi-dense set of correspondences. Finally, the 3D ear model is recon-
structed by triangulation.
In [88], the same authors directly exploit epipolar geometry, after cameras cal-
ibration and after ear location using AdaBoost in the pair of captured images (we
will address location techniques later). In most reconstruction techniques, feature
points are extracted independently in the two images, and then correspondences are
searched. Along an alternative line, in the mentioned work Harris corner detector
is used to extract feature points from the first image. For every such point, epipolar
geometry constraints guide the identification of an epipolar line in the second image.
When two cameras view a 3D scene from two distinct but well known positions,
a number of geometric relations hold between the 3D points and their projections
onto the two 2D images. These relations constrain correspondences between pairs of
image points. Following the consideration that the corresponding point in the second
image should be on, or very near to, the epipolar line, the search can be narrowed.
The 3D ear model can be reconstructed by the triangulation principle.
Photometric systems are a further passive alternative for 3D scanning. They usu-
ally exploit a single camera, but take multiple images under varying illumination.
Using such block of overlapped images, they invert the image formation model in the
attempt to recover the surface orientation at each pixel. In particular, in Close-Range
144 S. Barra et al.

Photogrammetry (CRP) the camera is close to the subject and can be hand-held
or mounted on a tripod. This type of photogrammetry is often called Image-Based
Modeling.
All passive scanning techniques require some illumination control. An alternative
is to use active scanners instead. They emit some kind of light (or radiation), including
visible light, ultrasound or x-ray, and detect the light reflection (or the radiation
passing through the object). Laser scanners project a single laser beam on the object;
a 3D extension of triangulation (knowing the position of the laser source and that of
the capture device) is the most used technique for surface reconstruction when close
objects reconstruction is involved [33]. Laser illumination has two main advantages
over incandescent or fluorescent light: (a) a laser beam can be focused tightly, even
over long distances, therefore allowing to improve the scan resolution, and (b) laser
light has a very narrow radiation spectrum, so that it is robust to ambient illumination.
As a consequence, the results are of high level. However, this technique is quite
invasive, since the laser beam may damage the retina. On the contrary, structured
light scanners (see [66]) use ordinary light, so that there is no danger for the retina.
In order to improve robustness, they project a light pattern (e.g., a grid, or an arc)
on the object to acquire: distortions of the pattern caused by the 3D structure of the
object give information on its 3D surface. The final result is still less accurate than
that obtained with laser. In both cases the reconstruction of the full 3D model of an
object requires to scan it from different points of view.
The recent work in [61] presents a cheap equipment for the acquisition of 3D ear
point clouds, including a 3D reconstruction device using line-structure light, which is
made of a single line laser, a stepping motor, a motion controller and a color camera.
The laser is driven by the step motor and continually projects beams onto the ear.
The camera captures and stores the ear images projected by the laser. These serial
strips are used as a basis for triangulation to recover the 3D coordinates of the scene
points and finally to acquire 3D ear data.

6.2.2 Detection

While the reduced size of the ear can be considered as an advantage for biometric
processing, at the same time it may hinder its detection, and makes its recognition
less robust to occlusions. The detection of the ear, i.e. the location of the relevant
head region containing it, should allow to separate it from the non-relevant parts
of the image, e.g., hair and the remaining part of the face. Less recent approaches
localize the ear after (manually) defining the small portion of the side of the face
around it, to which the search can be reasonably limited. However, non-intrusive,
real-world applications require to be able to detect the ear from a whole side face
image, where, as noticed, hair and earrings cause hindering occlusions, and where
ear position might not be frontal with respect to the acquisition device. Moreover
a good resolution is often required. We will briefly summarize some more recent
approaches, were some of these problems are better addressed. They can exploit
6 Unconstrained Ear Processing: What is Possible and What Must Be Done 145

some anatomical information, like skin color or the helix curve, or variations of
AdaBoost methodology.

6.2.2.1 Skin Color and Template Matching

Skin color has been used in template based techniques, or together with contour
information. In fact, due to the similar color or the ear and surrounding face region,
a first rough identification of the candidate area is used to reduce the search space,
and is almost always followed by a refinement through different techniques.
In [6] ear detection is preceded by a step of skin location to identify the face region.
Skin-tone detection follows Flek’s method [38], which uses a skin filter based on
color and texture. A morphological dilation operation is applied to the output of the
skin filter, to re-include zones possibly filtered out due to the color variation caused
by 3D structure (shadows, over illumination). Canny edge detection is applied only
to the skin region and an edge size filter removes short and/or isolated edges. Finally,
the ear region is detected using template matching. Actually, the used ear template
consists only of an edge representing the helix.
The approach presented in [116] uses skin color together with contour information,
which is particularly rich in the ear region. In particular, the paper presents a tracking
method for video data which combines a skin-color model and intensity contour
information to detect and track the human ear in a sequence of frames. Due to the
presented setting, this method appears to be suited to real-world applications. In a
first step, Continuously Adaptive Mean Shift (CAMSHIFT) algorithm is used for
roughly tracking the face profile in the frame sequence, and therefore identify the
Region of Interest (ROI) for the next step. Afterwards, a contour-based fitting method
is used for accurate ear location within the identified ROI. Since the mean shift
algorithm operates on probability distributions, the color image must be represented
by a suitable color distribution. The original CAMSHIFT algorithm requires to pre-
select part of the face as the calculation region of the probability distribution to build
the skin-color histogram; of course, this is not permitted by the addressed settings.
Therefore the original procedure is modified by computing a suitable skin-color
histogram off-line. Once the skin pixels have been extracted from the side face, edge
detection and contour fitting are used to refine ear location. Since the shape of the ear
is, almost always, more or less similar to an ellipse, ellipse fitting is used to locate
it. The technique gives good results even with slight head rotation.
In [75] ear location in a side face image is carried out in three steps. First, skin
segmentation eliminates all non-skin pixels from the image; afterwards, ear local-
ization performs ear detection on the segmented region, using a template matching
approach; finally, an ear verification step exploits the Zernike moments-based shape
descriptor to validate the detection result. The image is mapped onto chromatic color
space, where luminance is removed. Skin segmentation exploits a Gaussian model
to represent the color histogram of skin-color distribution, since it can be approxi-
mated in this way in the chromatic color space. Skin pixels are selected according
to a corresponding likelihood measure. It is interesting to notice that ethnicity may
146 S. Barra et al.

influence this process [25]. Skin chromas of different races may show some overlaps,
but certain chromas only belong to certain races. Therefore, a compromise must be
decided according to the kind of application. In [25], a chroma chart is prepared
in a training phase to determine pixel weights (likelihood to be skin pixels) in test
images. If the goal is to locate ears of a particular race in an image, the chroma
chart should be created using samples from only that race. This will increase the
skin detection accuracy. On the other hand, if the goal is to locate skin in images
from different races, as is more likely, the chroma chart must contain information
about the skin colors of various races, but this will select more image pixels as skin,
including a higher number of false positives. While these considerations might not
be relevant in an academic scenario, where enrolled subjects usually belong to the
main ethnic community of the country where the laboratory is located, they become
of paramount importance in real world applications, where they suggest appropriate
design choices. At present, the influence of demographics on biometric recognition
is a still emerging topic (see for example [56, 81] for face recognition).
Returning to [75], refinement of ear location within the skin region is performed
through an appropriate ear template. The template is created during a training phase
by averaging the intensities of a set of ear images, and considering various typical
ear shapes, from roughly triangular to round ones. The template is appropriately
resized during detection, according to the size of the bounding box of the skin region
identified in the side face, and to the experimentally measured typical ratio between
this region and the ear region. The normalized template is moved over the test image,
and normalized cross-correlation (in the range −1, 1) is computed at each pixel, with
high values (close to 1) indicating the presence of a ear. Such presence is verified in
the last step. The reported accuracy of localization, defined in (6.1), is 94 %.

accuracy = (genuine localization/total sample) × 100 (6.1)

As can be expected, failures are reported especially for poor quality images and hair
occlusion.
The work [82] uses mathematical morphology for automated ear segmentation.
Opening top hat transformation, which is defined as the difference between the input
image and its morphological opening, allows to detect bright pixels on a surrounding
dark area, and to suppress dark and smooth areas. As a matter of fact, the ear is
usually surrounded by darker pixels (hair) and smooth areas (cheek), and it contains
dark regions whose effect is to further emphasize the edges after the transformation.
The filtered image is thresholded, and 8-connected neighborhood is used to group the
pixels in connected regions and to label them. Smaller and bigger components (with
area less or greater than a threshold) are discarded, and the geometric properties of
the remaining ones are analyzed. The system is tested on three datasets: the profile
visible part of the database collected at the University of Notre Dame (UND) for
the purpose of face and ear recognition, the CITeR face database, collected at West
Virginia University to study the performance of face recognition, and the database of
the Automated 3-D Ear Identification project, collected at West Virginia University
and containing 318 subjects. The first set contains 100 profile images with different
6 Unconstrained Ear Processing: What is Possible and What Must Be Done 147

background and variable illumination conditions, the second set contains 150 images
from 50 video frames for 50 subjects with different views and variable heights, and
the third set contains 3500 images from 226 video frames for 226 subjects with
different views. This is one of the most extensive testbeds reported in literature. The
method achieves 89 % correct segmentation on the first set, 92 % on the second one,
and 92.51 % on the third one. The authors further devise an automated technique to
assess the segmentation result and avoid to pass a wrong region to the next recognition
steps. They define five subclasses for the ear segmentation outcome, based on the
viewing angle and on the quality of segments. A proper segment is a rectangular area
containing a correctly segmented ear. An improper segment is a rectangular area
that either contains only part of an ear or does not contain an ear at all. Two sub-
classes of proper segments include areas under viewing angles between [−10◦ , 40◦ ]
and [41◦ , 75◦ ], two sub-classes of improper segments contain part of the ear in
areas under viewing angles between [−10◦ , 40◦ ] and [41◦ , 75◦ ], and one subclass of
improper segments does not contain an ear. An Automated Segmentation Evaluator
(ASE) is implemented using low computational-cost, appearance-based features, and
a Bayesian classifier to identify the subclass for a given segmentation.

6.2.2.2 Skin Color and Contour Information

As already noticed, era contour information is extremely rich, even if possibly influ-
enced by illumination and pose. A number of proposed detection techniques are
based on this.
The work reported in [9] considers the helix, with its characteristic pattern of
two curves moving parallel to each other. Ear edges are extracted using Canny edge
detector. Afterwards, they are thinned and junctions are removed, so that each edge
has only two end points. Edges are divided into concave and convex ones. In this
process, curves may become segmented in more edges that must be reconnected. The
outer helix curve is identified by searching, for each candidate, another curve running
parallel to it in the direction of concave side, at a distance from it characterized by a
standard deviation below a given threshold. Even after this step, one may have more
possible curves representing helix. In order to find the right one, perpendicular lines
along each curve are drawn. If these lines intersect any other curve from convex side,
the inner curve cannot be the outer helix and hence is rejected. Among the remaining
curves, the longer one is selected. Once the outer helix has been identified, its end
points are joined by a straight line to complete ear detection.
The work [76] presents a number of differences with respect to the above men-
tioned one. While Canny is used again for edge detection, when a junction is detected
the edge is simply split in different branches. Spurious (short) edges are discarded,
and the remaining ones are approximated by line segments. All the ear edges present
some curvature, therefore their representation would need more line segments, and
so more than two points. On the other hand, linear or almost linear edges can be
represented by only two points, and since they cannot be a part of the ear they can be
removed. An edge connectivity graph is created, where edges are vertices (points)
148 S. Barra et al.

and two vertices are connected if the convex hulls of the corresponding edges inter-
sect. Ear localization takes into account the connected components in the graph. A
core observation is that most ear edges are convex, and in general outer edges contain
inner ones. As a consequence, convex hulls of outer edges must intersect those of
inner edges. This implies that their corresponding vertices are connected in the graph.
After computing the connected components for the graph created after the edge map,
the bounding box of the edges corresponding to the largest connected component is
claimed to be the ear boundary. The accuracy reaches 94.01 %, and falls to 90.52 %
when moving away from frontal view.
In [32] the authors use the image ray transform, based upon an analogy to light
rays. This particular transform can highlight tubular structures. In particular, one is
interested in detecting the helix of the ear, and in possibly distinguishing it from
spectacle frames. After Gaussian smoothing to reduce noise, and thresholding to
produce an image with a strong helix, a simple matching with an elliptical template is
used, across different rotations and scales; finally, the matched section is normalized
and extracted.
The work presented in [96] exploits Jet space similarity, i.e. similarity of Gabor
Jets. The original definition of Jet dates to [59], where they are used in the context
of an object recognition application based on the Dynamic Link Architecture. The
latter is a kind of hierarchical clustering into higher order entities of the neurons of
a Neural Network. In this approach, the image domain I contains a two-dimensional
array of nodes AIx = {(x, α)|α = 1, . . . , F}. Each node at position x consists of F
different feature detector neurons (x, α), where the label α denotes different feature
types, from simple local light intensities, to more complex types derived for example
by some filter operation. The image domain I is coupled to a light sensor array,
given that the same model may apply to the eye or to a camera. When such array is
put on an input, this leads to a specific activation sxα I of the feature neurons (x, α)
I
in the image domain I. After this, each node Ax contains a set of activity signals
JIx = {sxα
I |α = 1, . . . , F}. JI is the feature vector named “jet”. Given this structure,
x
images are represented as attributed graphs. Their vertices are the nodes. Attributes
attached to the vertices are the activity vectors of local feature detectors, i.e. the
“jets.” The links are the connections between feature detector neurons. In [96] a
number of kernel functions determining Gabor filters with orientation selectivity are
convoluted with the image. This produces features vectors defined as Gabor Jets.
Gabor Jet function is used as a visual feature of the gray scale image I(v) at each
image point v. The presented approach introduces an “ear graph” whose vertices are
labeled by the Gabor Jets at the body of the antihelix, superior antihelix crus, and
inferior antihelix crus. These Jets are stored as ear graphs in the gallery, and PCA
is used to obtain the eigenear graph; an ear is detected using the similarity between
sampled Jets and those reconstructed by the probe.
In [57], exact detection relies on two steps: identification of ear ROI, and contour
extraction from ROI. The identification of ear ROI in turn consists of three steps:
first, skin detection is performed using Gaussian classifiers, then edge detection
exploits LoG (Laplacian of Gaussian), and finally labeling of the edges uses intensity
clusters. Afterwards, contours extraction is performed by a localized region-based
6 Unconstrained Ear Processing: What is Possible and What Must Be Done 149

Fig. 6.6 Haar features used


in [49]

active contours model, which is robust against initial curve placement and image
noise. Features are extracted by SIFT and by log-Gabor for comparison. On a database
of 100 users, SIFT features are superior as they achieve Genuine Authentication Rate
(GAR) = 95 %, at False Authentication Rate (FAR) = 0.1, while log-Gabor features
achieve GAR = 85 % at the same FAR. The significant limit of this work is that, to
capture the ear images, volunteers are requested to keep their ear in the central hole
of a wooden stand. Therefore this system is not suited for most real world settings.
The work in [58] exploits morphological operators for ear detection. Ear images
undergo smoothing with a Gaussian filter which helps suppressing noise; this filtering
is followed by histogram equalization. Ear shape is extracted by a series of gray scale
morphological operations. The obtained gray scale image is binarized to extract the
ear shape boundaries. This image is combined with the silhouette obtained by a
thresholding operation, in order to obtain a mask able to eliminate all the skin region
around the ear. Fourier descriptors are used to smooth the obtained boundary.

6.2.2.3 AdaBoost

AdaBoost algorithm for machine learning, as well as its variations and related
methodologies, have proven very useful and powerful in addressing many practi-
cal problems. In particular, it gives good results in real-time, and is able to address
poor quality settings. After its adoption for face recognition in [91], its use has been
extended to many other kinds of objects.
The work presented in [49] is among the first systematic attempts to devise a
specific instantiation of the general AdaBoost approach, specialized for ear detection.
As in the original algorithm, Haar-like rectangular features representing the grey-
level differences are used as the weak classifiers, and are computed as in [91] by using
the Integral Image representation of the input image. AdaBoost is used to select the
best weak classifiers and to combine them into a strong one. The final detector is
made up by a cascade of classifiers. The differences between the structure of face
and ear, and in particular the curves of the helix and of the anti-helix and the ear pit,
suggest to use the eight types of features shown in Fig. 6.6.
Features with two rectangles detect horizontal and vertical edges, as usual;
features with three or four rectangles detect different types of lines and curves;
150 S. Barra et al.

Fig. 6.7 Haar features used


in [119]

the centre-surround feature (the second from left in the second row in Fig. 6.6) detects
the ear pit. In order to address the same problem, a different set of features is used
in [119], as shown in Fig. 6.7, and a 18 layer cascaded classifier is trained to detect
ears. The number of features in each layer is variable.
One of the drawbacks of AdaBoost is the time required for training. In [3] a
modification of the Viola-Jones method proposed in [99] is exploited for the ear,
reducing the complexity of the training phase.
The ear detection algorithm proposed in [87] attempts to exploit the strengths
of both contour-based techniques and AdaBoost. Ear candidate extraction adopts
the edge-based technique that the authors call the arc-masking method, together
with multilayer mosaic processing, and an AdaBoost polling method for candidate
verification only within the pre-selected regions.

6.2.2.4 Detection in 3D

At the best of our knowledge, all the attempts found in literature to capture a 3D
model of the ear are related to its detection from a range image of the side face. In
[27] ears are extracted exploiting a template matching-based detection method. The
model template is built by manually extracting ear regions from training images and
is represented by an average histogram (obtained from the histograms computed for
the single training images) of the shape index of this ear region. A shape index is a
quantitative measure of the shape of a surface at a point p, and is defined by (6.2):

1 1 k1 (p) + k2 (p)
S(p) = − tan−1 (6.2)
2 π k1 (p) − k2 (p)

where k1 and k2 are maximum and minimum principal curvatures. For sake of com-
pactness, the approach uses the distribution (histogram) of shape indexes as a robust
and more compact descriptor than shape index 2D image. During detection, the
obtained template must be appropriately matched with the test range image of a
side face. To this aim, matching is preceded by preliminary processing. Step edge
detection allows to identify regions with higher variation in depth, which usually
correspond to the boundary of pinna. Thresholding is applied to get a binary image,
and dilation allows to fill gaps. The analysis of connected components allows to
determine the best candidate ear region, which is then matched to the template.
6 Unconstrained Ear Processing: What is Possible and What Must Be Done 151

According to the authors themselves, this method cannot always identify the
ear region accurately, therefore in [28] they propose a variation. The exploited ear
template is the main difference with the preceding work. Ear helix and anti-helix parts
are extracted by running a step edge detector with different thresholds, choosing the
best result and thinning the edges, and finally performing a connected component
labeling. After this, step edge detection and binarization are performed on the test
image as above. The rest of the methodology is very similar, except for edge clustering
in place of the analysis of connected components, and for the procedure for template
matching.
The work in [127] exploits 3D local shape cues and defines new concise as well
as discriminative features for 3D object recognition. Even in this case, such fea-
tures are applied to the task of ear detection in range images. Again, the aim is to
find a set of features, which are robust to noise and pose variations and at the same
time can adequately characterize 3D human ear shape and capture its discriminative
information. The used method for feature encoding is inspired by the Histograms of
Oriented Gradients (HOG) descriptor. To locate the ear in a range image, the image
is scanned with a fixed-size window. The presented feature is named Histograms of
Categorized Shapes (HCS), and is extracted in each window position. To build HCS
descriptors, each pixel in the image is assigned a shape index based on its shape
category (e.g., spherical cup, saddle, etc.). The shape index of a rigid object is inde-
pendent of its position and orientation in space, as well as of its scale. Moreover,
each pixel is also assigned a magnitude, based on its curvedness values. HCS fea-
ture vector is extracted and used to train a binary ear/non-ear SVM classifier. Such
classifier determines whether the current window contains an ear (positive sample)
or non-ear (negative sample). To allow ear detection at different scales, the image is
iteratively resized using a scale factor, and the detection window is applied on each
of the resized images. Multiple detections in overlapping windows are fused.
Ear detection in 3D is addressed in [78] with an approach following the line in [75]
(already presented above). It starts from the consideration that in a 3D profile face
range image, ear is the only region containing maximum depth discontinuities, and is
therefore rich in edges. Moreover edges belonging to an ear are curved in nature. A 3D
range image of a profile face is first converted into a depth map image. The depth map
image from a range image contains the depth value of a 3D point as its pixel intensity.
The obtained depth map image undergoes the edge computation step using Canny
edge operator. A list of all the detected edges is obtained by connecting the edge
pixels together into a list, and when an edge junction is found, the list is closed and a
separate list is created for each of the new branches. An edge length-based criterion
is applied to discard edges originated by noise. Edges are then approximated by line
segments. Even in the case of the depth map image, an edge connectivity graph is
created, with edges being vertices which are connected if the convex hulls of the
corresponding edges intersect. Tests performed assess that the method is scale and
rotation invariant without adding any extra training. The technique achieves 99.06 %
correct 3D ear detection rate for 1604 images of UND-J2 database, and is also tested
with rotated, occluded and noisy images.
152 S. Barra et al.

6.2.3 More Preprocessing

After a correct detection, it is often the case that ear must be transformed into a
canonical form, which should allow extracting templates that can be matched despite
different sizes and rotations. Related techniques might be exploited during enrolling,
in order to store each template in its new form, and in this case the same procedure
must be applied to the probe image. As an alternative, registration and alignment
might be performed separately for each match. This topic is addressed in few works
in literature.
In [8] the contour of an ear in an image is fitted by applying a combination of a
snake technique and an ovoid model. The main limit of this approach is that a sketch
of the ear must be manually drawn to start the procedure, so that it cannot be used
in real-time, online, massive applications. After manual sketching, a snake model
is adapted to ear contour characteristics to improve the manual sketch. Finally, the
parameters of a novel ovoid model are estimated, that better approximate the ear.
The estimation of ovoid parameters achieves a twofold goal: it allows to roughly
compare two ears using the ovoid parameters, and allows to align two ear contours
in order to compare them irrespective of different ear location and size.
Active Shape Model (ASM) is proposed in [117] for ear normalization. An offline
training step exploits ASM to create the Point Distributed Model. To this aim, the
landmark points on the ear images of the training set are used. The training samples are
appropriately aligned by scaling, rotation and translation, so that the corresponding
points on different samples can be compared. The model is then applied on the test
ear images to identify the outer ear contour. Finally, the ear image is normalized to
standard size and direction according to the long axis (the line crossing through the
two points which have the longest distance on the ear contour) of outer ear contour.
After normalization, the long axes of different ear images will have the same length
and same direction.
The ear registration technique proposed in [20] is based on an algorithm that
attempts to create a homography transform between a gallery image and a probe
image using matching of feature points found by applying SIFT. Figure 6.8 shows
an example of correspondences that can be found by SIFT.
As a result of the original algorithm in [17], the probe includes an image of the
object to detect, if an homography can be created. Furthermore, the homography
defines the registration parameters between the gallery and the probe. The work in
[20] extends the original technique with an image distance algorithm, which requires
to segment gallery ears using a mask. These masks are semiautomatically created as
a preprocessing step when populating the gallery. The resulting ear recognition tech-
nique is robust to location, scale, pose, background clutter and occlusion, and appears
as a relevant step towards the accuracy of 3D ear recognition using unconstrained
2D data.
3D point clouds are exploited in [42] to implement a method based on projection
density, used to obtain the normalization of 3D ear coordinate direction. The method
stems from the observation that there are specific projective directions in a 3D model,
6 Unconstrained Ear Processing: What is Possible and What Must Be Done 153

Fig. 6.8 Examples of point correspondence computed by SIFT

along which points projected onto the low dimensional spaces are distributed most
sparsely. Optimal projection tracking is used to retrieve such directions and to revise
the posture of 3D surface until all the projective points are distributed as sparsely
as possible on the projection plane. Then PCA is used to obtain the principal axes
of the projective points and adjust the axes to vertical direction. Among the claimed
advantages of this method, compared with traditional ones, the authors mention that
it is not necessary to organize the 3D model in mesh grid, but a point cloud can be
used; moreover it is robust for noise and holes, and it generates a relatively uniform
posture, which speeds up registration by, say, ICP.

6.3 Recognition of 2D Images

We refer to the rough classification of 2D methods sketched in Sect. 6.1.3, with a


more detailed characterization of techniques. Geometrical approaches imply an often
demanding processing to identify, with an acceptable reliability, some characteristic
points or curves in the ear, able to drive the following feature extraction and recogni-
tion. Global methods aim at avoiding this preliminary step, by extracting the overall
characterization of the ear. However, due to the rich spatial structure, this may lead to
high dimensional feature vectors, so that dimensionality reduction techniques may
be required, as well as different alternatives able to support a lighter computation. It
is to say that most methods are hybrid, in the sense that they exploit the cooperation
among more different techniques. When this is the case, we will try to consider the
main approach. We will return on already presented works only when further detail
is worth mentioning.
154 S. Barra et al.

6.3.1 Geometrical Approaches

As reported in Sect. 6.1.3, one of the first works following Iannarelli’s pioneering
research was presented by Burge and Burger in [18]. In that first system implemented
to assess the viability of ear biometrics, the authors first experiment two different
techniques for ear location. In the first one, generalized Hough transform is used
to find the characteristic shapes of helix rim and lobule, resulting in two curves
loosely bounding the ear. The second method uses deformable templates of the
outer and inner ear according to an active contour approach. The methodology is
refined in [19], where the ear is located by using deformable contours on a Gaussian
pyramid representation of the gradient image. Within the located region, edges are
computed by Canny operator, and then edge relaxation is used to form larger curve
segments. Since differences in illumination and positioning would undermine this
method, the authors improve it by describing the relations between the curves in
a way invariant to affine transformations and to small shape changes caused by
different illumination. The chosen description is based on the neighborhood relation,
represented by a Voronoi neighborhood graph of the curves. The matching process
searches for subgraph isomorphisms also considering possibly broken curves. No
experimental results have ever been reported regarding this technique.
In [86], after edge detection and appropriate pruning, the extracted features are all
angles. The angles are divided into two vectors. The first vector contains angles cor-
responding to edges in the outer shape of the ear, and the second vector is composed
examining all other edges. The approach is based on the definition of max-line and
normal lines. Max-line is defined as the longest line with both endpoints on the edges
of the ear. If edges are correctly identified and connected, the max-line has both its
end points on the outer edge of the image. Features corresponding to each possibly
detected max-line are to be extracted. Normal lines are those which are perpendicular
to the max-line and whose intersections divide the max-line into (n + 1) equal parts,
with n a positive integer.
Given the points where the outer edge intersects the normal lines, each angle
stored in the first vector is formed by the segment connecting one such point with
the center of the max-line, and the segment connecting the upper end point of the
max-line and its center. The second feature vector is calculated similarly, and all the
points where all the edges of the ear intersect the normal lines are considered, except
for those already used for the first feature vector, i.e. those belonging to the outer
curve. Figure 6.9 gives and example of line pattern and of the two classes of angles.
The distance between two samples is computed in a hierarchical way, in two steps:
the first vector is used first, and the second one later. In the first step, two distance
measures are computed. The first one is given by the sum of absolute difference
between corresponding vector elements, while the second one counts the number
of similar elements, i.e. those whose difference is under a threshold. In order to
pass the first check, both measures must comply with the respective thresholds. In
the second step, two points are considered to match if their difference is under a
threshold and they also belong to corresponding normal lines. However, in this case
6 Unconstrained Ear Processing: What is Possible and What Must Be Done 155

Fig. 6.9 a Example of max-line and normal lines; b example angle in the first vector; c example
angle in the second vector

the vectors may have a different length, so that the final number of matching points
must be appropriately normalized as a percentage of the lower dimension between
the two vectors. Two images are said to match if they match with respect to the
first feature vector and the last percentage of matching points is greater than some
threshold. During an identification operation (1:N matching) a given query image is
first tested against all the images in the gallery using the first feature vector. Only
the images that are matched in this first stage are considered for the second stage
of comparison. Such division of the identification task into two stages significantly
reduces the time required. Therefore, this approach is especially appropriate for
massive applications, and is also scale and rotation invariant. However, it may suffer
for pose and illumination. The reported Equal Error Rate (EER) is 5 %. After the first
stage of classification the search space reduces by 80–98 %. Due to lack of details
on the test database, it is not possible to fully appreciate such performance.
Active Shape Model (ASM) technique is applied to ears in [63]. The aim is to
model the shape and local appearances of the ear in a statistical form. Steerable
features are also extracted from the ear image before applying the procedure. The
magnitude and phase of such features are able to encode rich discriminant infor-
mation regarding the local structural texture, and to better support shape location,
though entailing a computational cost lower than the popular Gabor filtes applied
at different scales. Eigenearshapes are used for final classification. The dataset used
for experiments contains 10 images of size 640 by 480 from each of 56 individuals.
In more detail, 5 images are for the left ear and 5 for the right ear, corresponding
to 5 different poses, with a difference of 5◦ between adjacent poses. Both left and
right ear images are used and the work shows that, as it might be expected, their
fusion outperforms single ears (95.1 % recognition rate for double ears). This can be
therefore also considered as an example of multibiometric system, where multiple
istances of the same trait are used.
156 S. Barra et al.

Fig. 6.10 Identification of point p0 for each contour

Choraś and Choraś [31] introduce a number of different methods based on the
processing of ear geometric features after normalization:
• Concentric circles based method—CCM: the centroid of (binary) ear contour is
assumed as the (0, 0) origin of a system of coordinates and becomes the center of
the concentric circles ci of increasing radius ri . Radial partitions are also identified
using the circle circumscribed to the contour image. Features rely on suitably
computed points of intersection of ear contours with the created concentric circles
• Contour tracing method—CTM: for each contour line in the binary ear image a
number of characteristic points are considered, including contour ending points,
contour bifurcations, points of contour intersection with the concentric circles
identified by the previous method
• Angle based contour representation method—ABM: each extracted contour is
considered independently and is represented by two sets of angles, corresponding
to the angles between the vectors centered in a point p0 , that is characteristic for
each contour; this point is the intersection between the perpendicular to the segment
identified by the contour ending points in its central point, and the contour itself
(see Fig. 6.10); the point p0 becomes the center of a system of polar coordinates
as well as the center of a set of concentric circles. The intersections of the contour
with such circles are used to build the contour representation
• Geometric Parameters Method—GPM:
– Triangle ratio method—TRM: finds the maximal chord of the longest contour
and the intersection points of such contour with the longest line perpendicular
to the maximal chord
– Shape ratio method—SRM (linear vs. circular contours): the shape ratio for
each contour is computed between the contour length, and the length of the line
connecting the ending points of the contour
– Contour Complexity—CC: the number of intersections between each maximal
chord and corresponding contours.
The best results are achieved by the GPM method. However, the work lacks
of details about the composition of the exploited dataset. Moreover, it would be
interesting to try a combination of the proposed features.
6 Unconstrained Ear Processing: What is Possible and What Must Be Done 157

The use of active contours is found again in [52], where ear is used for identity
tracking through images from a live camera. All works along the same line rely
on the basic concepts which characterize this methodology. It implies to define a
model whose internal forces prevent arbitrary shape deformations, while external
forces attract and deform the shape segments to reach the best fit with the edges on
an underlying image. Internal forces are defined as constraints on the characteristic
parameters of the curve, while direction and intensity of the external forces depend
from the pixels of the edges that are detected on the tracked image. The curve seg-
ments tend to an equilibrium position where internal and external forces balance each
other. The shape in motion is tracked by the deforming model. In the cited work,
after the edges have been detected and made more evident through morphological
operators, the active contours technique is applied to superimpose the lines of a stan-
dard ear model to them. This process is performed in two steps: the first one tries
to match the outer contour of the ear model to the bordering edge of the ear; the
second one executes a more precise matching of the inner contour lines afterwards.
The authors consider the first step as a refinement of the localization of the ear, used
to normalize the rest of the image. In this process, pixels on a segment representing
an edge exert a greater attraction force on a line segment of the model if they are
parallel. Once an active contour (model) has fitted the probe ear image, the last step
is to analyze this modified model and collect features from it. Two sets of features
are defined. The first one is derived from the distortion of the modified model with
respect to the original one, the second is derived using three appropriate axes of
measurement and 21 points selected appropriately on the model, each of which is
associated with one of the axes. A feature is given by the distance measured between
the projection of one such point on its associated axis, and the intersection of the
three axes. The proposed method is tested on images chosen from video frames pre-
senting twenty-eight different people. It is interesting to notice that the view angle
of the ears varied up to about 50◦ , a variation which is quite realistic, for instance, in
videosurveillance applications. Therefore the method is promising for real settings,
given that the reported EER is below 10 % (7.6 %), although it is not usable as a
single authentication measure in a high security setting.
The authors of [10] propose one of the first model-based approaches to recogni-
tion. The claim is that, being an abstract form of the corresponding object, a model
capitalizes on the specific structures and thus naturally discards unnecessary detail. If
well conceived, it can be robust to noise and occlusion and also potentially viewpoint-
invariant. Moreover, the authors choose to work with 2D images, since they seem
much more realistically appropriate to real world applications. The model includes
a number of ear parts, each having its specific typical appearance. Each ear image
is represented by a set of features. SIFT is used to automatically extract poten-
tial interest points (keypoints) in images, which describe neighborhoods of pixels.
During a training phase, these features are clustered across a training dataset to denote
the common ones. Model learning requires detecting the clusters and expressing them
by suitable statistical properties. SIFT produces features which are naturally suffi-
ciently invariant to illumination and viewpoint, and here they are also normalized with
respect to scale and rotation. When a probe image is submitted, the ear is located by
158 S. Barra et al.

approximating the relevant region by an ellipse. An initial feature vector is extracted


as the set of keypoints that are detected using SIFT. At this point, the model acts as
a mask, since only keypoints common with the model are used for recognition, for
which the mean of Euclidean distances between corresponding keypoints is used.
The approach is tested on 63 subjects from XM2VTS database [64], choosing images
without air occlusion, and then also occluding them with a rectangle from the top.
However, it is to say that a sharp occlusion like a synthetic shape is easier to detect
than a natural one, say unevenly diffused hair or a multicolor scarf. The worst result
in the tests is obtained with automatic ear registration (in contrast to manual one) and
with such occlusion, and achieves a recognition rate of 80.4 %. Therefore, it seems
appropriate for settings with medium acquisition control.
SIFT methodology and its modifications is a technique often found in literature
regarding ear recognition. For instance, it is adopted in [21, 122] . The first approach
has been already partially described in Sect. 6.2.3 in relation to [20]. As part of the
SIFT detection process, interest points are searched across locations and scales. When
an interest point is detected, its canonical orientation is calculated. By comparing
these values between the probe and the gallery, each point can be used to calculate an
approximate affine transform between the two images. However, a serial procedure
is used to avoid false positives. To this aim, the potential space of affine transforms is
subdivided into two dimensions for position, one for logarithm of the scale, and one
for rotation. Each dimension is further divided into bins: eight for scale and rotation
and one for every 128 pixels in width and height. Each point match is placed in the
appropriate bin and also in its closest neighbors (16 bin entries per point). The choice
of bins and this multiple classification of a point match should ensure robustness to
pose variations. Only points contained in bins attaining four or more point matches
pass to the next stage. A RANdom SAmple Consensus (RANSAC) algorithm is used.
Random sets of four points are selected from the list of point correspondences and
a homography is calculated. The homography that matches the most points within
1 % of the ear mask size, is selected as the best match. Gallery images that have
four affine matching feature points are passed to the distance measure. This process
greatly reduces false positives. The distance between gallery images and probe is the
sum of the squared pixel errors after a suitable normalization. The distance measure
is made robust to occlusion by thresholding the error. In practice, pixels that differ
by more than half the maximum brightness variation are considered to be occluded
and therefore are not included in the distance computation. The approach is tested on
subsets of XM2VTS, some of which were processed to test the effect of occlusion,
background clutter, resolution, noise, contrast, and brightness. The results highlight
a good robustness to background clutter, occlusion up to 18 %, and over ±13◦ of
pose variation. Due to the interesting yet highly detailed discussion, we address the
interested readers to the original work.
The ear recognition approach in [122] uses the SIFT descriptor with global context
(SIFT+GC) and some projective invariants to obtain ear features. The addition of
context to compute matching points is deemed useful due to the presence of multiple
local ear regions with similar texture. Recognition features are constructed using
the number of the matching points, and by five projective invariants, which are
6 Unconstrained Ear Processing: What is Possible and What Must Be Done 159

obtained by computing the cross ratios of five collinear points on the longest ear
axis. Experiments are carried out on the USTB II ear database. We remind that it
contains 308 images from 77 subjects (4 each). For each subject, the first image
is captured under uniform illumination condition and side pose, the other ones are
captured under various poses or illumination conditions. The result using the first
image as gallery and all the other three as probes is 91.34 % accuracy
SIFT approach is also used in [53]. In order to make it more robust, SIFT feature
extraction is only performed in regions having color probabilities in certain ranges.
The color model for ear skin is formed by Gaussian Mixture Model (GMM) and
clustering of the ear color pattern using vector quantization. K-L divergence is applied
to the GMM framework for recording the color similarity in the specified ranges.
After segmentation of ear images in color slice regions, SIFT keypoints are extracted
from single regions (slices) and an augmented vector of extracted SIFT features is
created for matching. Since the number of points is not fixed, the SIFT feature points
are varying in each slice region. After their extraction, they are gathered together
by concatenation to form an augmented group for each reference model and probe
model. The presented results come from tests on the IITK Ear database.
The version of the database used here consists of 800 ear images from 400 indi-
viduals taken in a controlled environment; the frontal view images are considered, so
that the ear viewpoints are consistently kept neutral; the ear images are downscaled
to 237 × 125 pixels with 500 dpi resolution.
The experiments entail two sessions. In the first session, the ear verification is
performed with SIFT features without color segmentation; in the second session,
verification is performed with the SIFT keypoint features detected from segmented
slice regions. When nearest neighbor is used for verification with these slice regions,
the system achieves the best performance of 96.93 % recognition accuracy. An evo-
lution of the method presented in [55] uses the Dempster-Shafer decision theory to
merge the detected keypoint features from individual slices, by combining the evi-
dences obtained from different sources in order to compute the final distance. The
performance increases to 98.25 %.
An improvement of the technique in [96] is proposed in [97]. It is based on Gabor
Jet similarity and has been already presented in Sect. 6.2.2.2. A higher robustness
to rotations in depth allows a good recognition accuracy. Data training not only
exploits the Gabor Jets from the registration image, but also the estimated Gabor Jets
of different poses obtained through linear jet transformation. Similarity is determined
through the correlation between the Gabor Jet of the registered images and the input
image.
The authors of [41] argue that the use of monolithic neural networks presents
slow learning and other limitations that can be addressed by devising a modular
neural network (MNN). They divide the ear images from USTB II into three parts
containing relevant ear regions: the helix, the concha and the lobule. Each subimage
is decomposed by wavelet transform and then fed into a modular neural network.
Such network is made of three higher-level modules, and each of them includes
three lower-level modules, one for each ear region. The difference among higher
level modules is that they are trained with different subjects. The authors test Sugeno
160 S. Barra et al.

measures and Winner-Takes-All for integrating the lower-lever modules, while for
higher level ones integration of decisions exploits Gating Network. The learning
function is chosen between scaled conjugate gradient (SCG), or gradient descent
with momentum and adaptive learning rate (GDX). Depending on the combination
between integrator and learning function, the results vary between 88.4 and 97.47 %
rank-1 performance on the USTB II database. The highest rank-1 performance is
achieved with Sugeno measure and conjugate gradient.

6.3.2 Global Approaches

The approaches described in this section are defined as “global” in the sense that they
do not consider specific points of the ear or specific parts of its contours, but rather
assign a potentially identical role to every pixel in the image. The first and most cited
one is based on force fields, and has been already presented in Sect. 6.1.3. We will
present further such approaches, attempting some classification based on the main
underlying concept.

6.3.2.1 Algorithms Based on Space Dimensionality Reduction

PCA is a very popular technique for feature space dimensionality reduction, even
though, when applied to biometric recognition, it presents a number of limitations due
to sensitivity to pose, illumination and expression (PIE) modifications. Nonetheless,
many variations to the basic technique and further strategies have been presented in
literature. As we already reported in Sect. 6.1.3, in [26] PCA is applied to both face
and ear for comparison. The obtained results show that given that limitations are the
same, also performance are comparable.
In the already cited work in [117], after ear detection ad normalization, recogni-
tion relies on a full-space linear discriminant analysis (FSLDA) algorithm. Tests are
performed on the USTB ear database, which is chosen because it also contains ear
images with rotation variations. In principle, FSLDA makes full use of the discrimi-
nant information in both the null space and in the non-null space of the within-class
scatter matrix. Results are slightly different if head turns left or right. When the
head turns left, recognition rates at 5, 10, 15 and 20◦ are above 90 %. After 20◦ , the
recognition rate drops dramatically. When the head turns right, the recognition rates
at 5 and 10◦ are above 90 % as well. The recognition rate at 15 and 20◦ is worse, but
still above 80 %, and even in this case drops dramatically after 20◦ .
A kind of hierarchical approach, defined as Compound Structure Classifier System
for Ear Recognition (CSCSER), is adopted in [125]. After an appropriate segmen-
tation, the ear probe is first roughly classified in one of five classes, based on the
height/width ratio. Afterwards Independent Component Analysis (ICA) is used, and
finally a Radial Basis Function Network (RBFN) returns the result. Experiments on
a home-collected image library demonstrate higher performance of ICA with respect
6 Unconstrained Ear Processing: What is Possible and What Must Be Done 161

to PCA, especially when the proposed classification structure is adopted. However,


the method heavily depends on the first classification step, which in turn heavily
depends on the assumption of a frontal pose. Therefore, methods of this kind, though
indicating a possible line, cannot be adopted in under-controlled settings.
In order to address the problem of pose variation, a different space reduction
technique is exploited in [101]. The authors rather start from Locally Linear Embed-
ding (LLE), which is a nonlinear dimensionality reduction deemed more able to find
the intrinsic structure of data points. This is achieved by constructing an embedded
space which preserves their topological relationship, though transforming the non-
linear problem into a linear one. The LLE algorithm selects k (nearest) neighbors for
each input vector, according to Euclidean distance, and then expresses the original
vector as a combination of such neighbors. Finally, new vectors in a space of reduced
dimension are computed using such weight, subject to some constraints. In the men-
tioned work, the authors propose an optimization procedure to compute k. Moreover,
they further improve LLE by applying LDA to the obtained vectors. The approach
is tested on USTB III database. The used dataset contains two groups of 19 images
each, corresponding to 19 different poses, for each subject out of 79 individuals. Head
rotations in the right side cover angles of 0◦ , 5◦ , 10◦ , 15◦ , 20◦ , 25◦ , 30◦ , 40◦ , 45◦
and 60◦ . Rotations in the left stop at 45◦ . On the other hand, illumination is constant.
While LLE achieves a 40.03 % recognition rate, using the proposed improvements
achieves a significantly higher 60.75 %. Notice that the exploited database is one of
the most challenging for pose variations.
The same authors in [102] further observe that, when the distribution of data
points is highly sparse, the choice of neighbors might be unstable. For this reason,
they propose to use a different distance function to select the “nearest” neighbors,
defined in (6.3), obtaining the Improved Locally Linear Embedding (IDLLE):
d 1
i=1 1+|xi −yi |
Hsim(X, Y ) = (6.3)
d
Experiments are performed on the same dataset from USTB III database again.
They show a better performance of IDLLE with respect to LLE, but it is not possible
to compare them with the preceding ones since results are disaggregated according
to different in-depth rotation angles. Of course, the best results are achieved in the
range [−15◦ , +25◦ ] with negative rotations being towards left, and positive towards
right, starting from 0◦ which is the up-right frontal position.
A different approach to improve LLE aiming at multi-pose recognition is pro-
posed in [106]. In traditional LLE, the neighbors for a point are chosen according to
k-nearest neighbor algorithm or, as an alternative, according to ε-neighbor algorithm.
In the latter, ε is a fixed radius around a point where neighboring points are selected,
causing different points to possibly have a different number of neighbors. Instead
of computing LLE many times with different values of k to find the optimal one,
the proposed approach uses both algorithms together. In this way each point has a
neighborhood including a set with a fixed number of neighbors, plus a further set
162 S. Barra et al.

with a variable number of neighbors depending on the computation of a minimal


reconstruction error. Compared with the k-nearest neighbor algorithm, this method
allows different numbers of neighbors for different points, thus avoiding unconnected
areas and misrepresentations caused by clustering in other areas. Compared with the
ε-nearest neighbor algorithm, this method uses an ε-ball with changing radius for
each different point, thus avoiding an excessive difference in the number of neighbors
for different points. Results of tests performed on USTB III pass from an accuracy
of 66.17 % with LLE to 78.14 % with this approach.
ICA is the starting point for the work in [100]. To address the intrinsic limita-
tions of ICA linearity, the authors adopt for ear recognition the Kernel Independent
Component Analysis (KICA) proposed by [12], and use SVM for recognition with
Gaussian Radial Basis Function (GBBF). Unfortunately, the authors only report
results on noise (Gaussian, salt, Poisson) obtained by adding noise to images from
Carreira-Perpiñan database. Images in such database are all frontal and with uniform
illumination, therefore assessment with respect to variations in pose and illumination
is lacking.

6.3.2.2 Algorithms Based on Wavelets, Transforms and Hybrid Systems

Generic Fourier Descriptor (GFD) [123] has been defined for image indexing, and is
adapted in [1] to ear recognition. Fourier Transform (FT) is quite robust to noise and
to some kinds of image distortion which only affect high frequencies. For this reason,
it has been widely used in image processing. As an example, 1D FT can be applied
to contour shapes. However, applying it for shape indexing involves the knowledge
of the object boundary, and unfortunately this requirement can be seldom satisfied.
2D FT overcomes this by considering the whole image information. However, 2D
FT has its limitations too when used for shape indexing. As a matter of fact, when
applied to the Cartesian representation of an image, it produces descriptors which
are not rotation invariant. On the other hand, rotation invariance is among the main
requirements for a shape descriptor. In [123] the problem is solved by transforming
the image in polar coordinates, and than treating the polar image in polar space as it
was a normal two-dimensional rectangular image in Cartesian space. In practice, the
2D FT on this rectangular image is similar to the normal 2D discrete FT in Cartesian
space. In [1] this consideration is applied to the segmented ear image, once the center
of the ear has been located, as shown in Fig. 6.11.
The GFD descriptor is made robust to rotation by ignoring phase information in
the coefficients, and only retaining their magnitudes; the first magnitude value is
normalized by the area of the circle containing the polar image, and all the other
magnitude values are normalized by the magnitude of the first coefficient. However,
in transforming the input image from Cartesian to Polar space, the correct location of
the center of the axes C(0, 0) plays a crucial role. GFD is not robust to translations,
therefore even small shifts of this point can cause significant modifications in the
coefficient values. In [1] translation invariance is added by choosing the center of
mass MC(xc , yc ) of the edge map of the input shape as C(0, 0). Results are presented
6 Unconstrained Ear Processing: What is Possible and What Must Be Done 163

Fig. 6.11 Transformation of the ear region from Cartesian to polar space

for a home-collected database, and demonstrate high robustness to rotations with


respect to the horizontal plane.
The representation of an image obtained by the Gabor filter exploits its orientation-
selective properties and is robustness to PIE variations, and can provide high
recognition performance. However, such performance decreases with the increase of
classification space, and also computational complexity plays against this technique.
On the other hand, the feature space dimension can be reduced by any of the clas-
sical methods. The approach presented in [94] uses General Discriminant Analysis
(GDA), which is designed for nonlinear classification based on a kernel function.
Results obtained on USTB II database demonstrate that the combination of the two
techniques achieves better results than the two separately, and also in a shorter time.
The recognition rate of Gabor algorithm is the lowest one (96.73 %) out of the three
algorithms the recognition rate of GDA reaches 98.35 %, but the total time of GDA
is much longer (2.1203 s vs. 0.5731 s); finally the recognition rate of Gabor+GDA is
the highest one (99.1 %), achieved in the shortest total time (0.5689 s).
An improvement of the technique presented in [10] and discussed in Sect. 6.3.1
exploits log-Gabor filtering after SIFT feature clustering [11]. The wavelet approach
aims at extracting the frequency content of the fluctuating surface of the Helix and the
Anti-helix of an ear. Since the wavelet process is localized, it can provide uncorrupted
information in presence of occlusions. Given an estimate of the Helix location, log-
Gabor filtering is applied to this region to describe this curve. The model parts
identified by the approach in [10] are used to vote for the position of the Helix. A
template is built by the sampled image intensities in a semi-circular region which
164 S. Barra et al.

includes the identified Helix. This semi-circle is centered in the point where the Crus
of helix curves inwards. In general, this point is almost the midpoint of the overall
ear height and is situated on the outermost part of the ear, opposite to the Helix. The
template is formed by sampling the image intensities along the lines starting from
the identified center, which are mostly normal to the Helix curve. In this template
the columns exhibit the variation between the ridge of the Anti-helix and the Helix
at each specific angle. A one dimensional log-Gabor filter is just used on these
columns. The presented tests use 252 images from 63 individuals, which are those in
XM2VTS database whose ear is not obscured by hair. The 4 images per individual
are taken in 4 different sessions over five months. Recognition using the original
nearest-neighbour algorithm on the model parts, obtains 91.5 % correct recognition.
The new log-Gabor coefficients alone achieve 85.7 % recognition rate. Substituting
the Euclidean distance with a more robust distance metric, the latter performance
reaches 88.4 %. Combination of log-Gabor and model, using the simple sum of
the normalized scores, reaches 97.4 % recognition rate. This suggests that, although
the log-Gabor performs worse than the model, it contains novel and independent
information which enhances the recognition.
The approach in [72] aims at extracting edge information along three independent
directions. To this aim, after histogram equalization and normalization, 2D wavelets
are computed on the input image in horizontal, vertical and diagonal directions sep-
arately, to obtain three feature matrices. The filter bank exploits Daubechie wavelet
class, which is a family of orthogonal wavelets. The points with high or low intensity
correspond to significant changes and represent the extremes for wavelet coefficients.
PCA applied to the single matrices of wavelet coefficients reveals that Horizontal
matrix has larger recognition rate than Vertical and Diagonal matrices (90.19, 86.27,
and 82.35 % respectively). Therefore, in combining them to obtain a single feature
matrix, different weights are assigned to the original ones. Finally, the integrated
matrix undergoes PCA to reduce the feature space dimension. Recognition tests are
preformed on USTB II Database and Carreira-Perpiñan database. Accuracy on the
first one, which is the “hardest” one, reaches 90.5 % and outperforms PCA alone
and ICA.
The authors of [113] propose to use the HMAX model, introducing a new robust
set of features. Each element of this set is actually a complex feature, representing the
combination of position- and scale-tolerant edge detectors over neighboring positions
and multiple orientations. The system relies on a quantitative model of visual cortex.
In its simplest version, such model includes four layers of computational units. Simple
S units combine their inputs with Gaussian-like tuning to increase object selectivity;
they alternate with complex C units, which pool their inputs through a maximum
operation, providing gradual invariance to scale and translation.
The system first simulates S1 responses by applying to the input image a battery
of Gabor filters, which can be described by (6.4):

X 2 + γ 2Y 2 2π
G(x, y) = exp(− ) × cos( X) (6.4)
2σ 2 λ
6 Unconstrained Ear Processing: What is Possible and What Must Be Done 165

where X = x cos θ + y sin θ , Y = −x sin θ + y cos θ , and θ (orientation), σ (width)


and λ (wavelength) are the filter parameters which allow to reproduce the behavior of
cortical cells. After appropriate selection, 16 scales and 4 orientations are chosen and
organized in 8 scale bands. The next C1 stage is built by following the proposal by
Riesenhuber and Poggio to adopt a max-like pooling operation for building position-
and scale-tolerant C1 units. The subsequent S2 stage is where learning occurs and
is exploited during training, where each S2 unit behaves like an RBF classifier on a
number of image patches at random positions and in all orientations. Input images
processed by C1 are convolved at different scales with units in S2, and a C2 feature
vector is produced whose final size depends only on the number of patches extracted
during learning and not on the input image size. This C2 feature vector is passed to a
classifier for final analysis. The authors test k-nearest neighbor (KNN) and support
vector machine (SVM) approaches for this last step. Experiments exploit USTB I
database (60 subjects with 3 images each with slight rotation and some illumination
variation), i.e. controlled conditions. Therefore this very interesting method, though
being scale and rotation invariant, should be further tested for position invariance.
An improvement for the classical force field approach in [45] is proposed in [36],
again to address the multi-pose recognition problem. In the original method, the
feature extraction is driven by some test pixels following the force field lines due to
the variation of potential energy, and getting the channels and wells. The latter are
stable and unique for each person, but this holds only for pose-fixed images, otherwise
they become unstable and may even disappear. To avoid this problem and address
multi-pose settings, the presented technique computes the force field transformation
without extracting the potential well feature. Feature description is rather performed
using Kernel Fisher Discriminant Analysis (KFDA), which is a nonlinear subspace
analysis method combining the kernel technique with LDA. As force is a vector,
the absolute value of force replaces the intensity in the ear image which becomes a
force image. Then column vectors of the two-dimensional image are concatenated to
obtain a one-dimensional augmented vector which is used as a sample in NKFDA.
USTB III database is used for experiments. While the best results achieved with
wells reach only 75.3 %, the proposes method approaches a 90 % recognition rate
with a dimension less than 20.
A further work inspired by physics of electricity is presented in [14]. Here, the
adopted pattern recognition approach relies on Edge Potential Function (EPF), which
models the attraction force generated by edge structures contained in an image over
similar curves. The model exploits the joint effect of single edge points in complex
structures, determined by edge position, strength, and continuity, and builds a cor-
responding edge map. Good results in the correct matching are obtained even in
presence of noise and partial occlusions. To obtain robustness to rotation, horizontal
and vertical translation, and scale modification, a probe template is compared to the
templates in the database exploiting a genetic algorithm, which is recursively run to
translate, scale, and rotate the template until a fitness function returns a value greater
than a pre-defined threshold. The proposed method achieves a rank-1 recognition rate
of 98 %, but the datasets used for testing are USTB II and UMIST (face images with
166 S. Barra et al.

rotations of 5◦ , 10◦ , and 15◦ ) which are not suited to significantly assess robustness
to pose variations.
Different Gabor-related techniques are tested in [58] for ear identification: the
extraction of phase information using either 1D log-Gabor and 2D Gabor filters, or
the complex Gabor filters, and the extraction of orientation information using a bank
of even Gabor filters. The best results in terms of rank-1 recognition accuracy are
obtained using a pair of log-Gabor filters, and achieve 96.27 and 95.93 %, respec-
tively, on a database of 125 subjects and on an extended version with 221 subjects.
Images contained in the used dataset are acquired in a strictly controlled setting, and
in frontal position.

6.3.2.3 More Hybrid Techniques

The system presented in [124] combines Independent Component Analysis (ICA)


and an RBF network. The ear image is decomposed into a linear combination of
several basic images. Then the corresponding coefficients are fed up into an RBF
network. Though being linear, ICA presents significant advantages over PCA: a better
probabilistic model of the data, which gives a better identification of clusters in the
n-dimensional space, a better unmixing ability, and the ability to handle higher order
data. In some experiments images are pre-filtered to enhance contours, using either
Wiener filtering or Laplacian-Gaussian filtering. The approach is tested on Carreira-
Perpiñan database and on a home-collected one. Both sets of images present ears in
a frontal pose.
The work in [35] adopts an approach that can be considered as global in the
definition used here, but is local with respect to the way ear image is processed.
The feature extraction process is based on the fractal technique of PIFS (Partitioned
Iterated Function Systems). Such process is made local by dividing the normalized
image of the ear region in four quadrants. They have no special relation with specific
anatomical elements inside, but rather represent usual zones possibly affected by
occlusion (upper ones by hair, lower ones by earrings). PIFS is based on building
a map of self-similarities within the image, and dividing this process into the four
quadrants allows to better address occlusions, since corrupted information in one
quadrant can be compensated by that provided by the other ones. Some computa-
tional optimizations are adopted to avoid the usual computational weight of fractal
techniques. Experiments are performed on subsets of Notre Dame ear database and of
side images from FERET. Images in the latter set are mostly affected by illumination
variations, while those in the former one also undergo pose changes. Obtained results
show that both the system in [35] and the compared ones (PCA, LDA, KDA, OLPP)
are less robust to pose than to illumination, but the former is much more robust to
occlusions. Moreover, since an illumination variation affecting only a region may be
compared to an occlusion, even this problem is better addressed.
A multimatcher approach is adopted in [69–71]. In the first work, each matcher
is trained using features extracted from a single-subwindow (SW) of the entire 2D
image. The ear is segmented using two landmarks, namely Triangular Fossa and
6 Unconstrained Ear Processing: What is Possible and What Must Be Done 167

Incisura Intertragica, according to one of the approaches in [109]. The whole image
(150 × 100) is divided into SWs of dimension 50 × 50 with a step of translation of
11 pixels to obtain 50 SWs for each image. The features are extracted by the convo-
lution of each SW with a bank of Gabor filters, and afterwards their dimensionality
is reduced using Laplacian eigen-maps. The best matchers correspond to the most
discriminative SWs, and are selected by running Sequential Forward Floating Selec-
tion (SFFS). Experiments use subjects from the UND database (collection E) and
the sum rule is employed for fusing the results from the selected matchers at the
score level. The rank-1 recognition rate is about 84 %. A quadratic discriminant clas-
sifier (QDA) is trained in the later work in [71] to improve discrimination between
genuine and impostors in verification tasks, based on the same feature extraction
technique. Finally, in [70], the multimatcher approach is used along very similar
lines yet combining different color spaces instead of different SWs, according to the
ability of different color spaces of bringing different information. The input images
are in the RGB color space, and are then mapped onto the 12 spaces: YPbPr, YCbCr,
YDbDr, JPEG-YCbCr, YIQ, YUV, HSV, HSL, XYZ, LAB, LUV, LCH. The three
image components of each of the 13 spaces are preprocessed. Feature extraction
is performed by using a bank of Gabor filters of different scales and orientations
applied on a regular grid superimposed to the image. Even in this case selection is
performed by SFFS. Identification results reach about 84 %. The authors’ conclusion
is that pose variations are still better approached through 3D approaches. However,
it might be interesting to test the combination of their two sets of matchers.
The work in [120] starts from a similar multimatcher approach to investigate the
problem of partial occlusion. In this work, the implementation of the different steps is
different from [69]. During training, sub-windows of the same position are combined
to form a sub-space, and then Neighborhood Preserving embedding (NPE) is applied
to get the projection matrix of each sub-space, so that the feature vectors of training
images are extracted according to the same technique. For each sub-space, a sub-
classifier uses the nearest neighbor rule. Sub-spaces are ranked according to to their
recognition rates. In the test stage, after ear detection and sub-window division, the
feature vectors of each sub-window are extracted, and the top ranking sub-classifiers
are selected for recognition. The final decision is produced by score level fusion
exploiting weighted sum rule. Experiments on the same USTB set show that this
method is more robust to occlusion than [69], but pose variation may still be a harder
problem to address.
The work presented in [77] approaches the problem of robustness to pose vari-
ations by considering different aspects. The first step entails image enhancement.
Three image enhancement techniques are applied in parallel to neutralize the effect
of poor contrast, noise and illumination (histogram equalization, non-local means fil-
ter and steerable filter), and the three obtained images are then processed separately.
Each of them separately undergoes a local feature extraction technique, namely
Speeded-Up Robust Features (SURF) [15], which is deemed able to minimize the
effect of pose variations and poor image registration. It makes use of Hessian matrix
for key-point detection. Haar wavelet responses dx and dy in horizontal and verti-
cal directions are computed in a circular region around the detected points. These
168 S. Barra et al.

responses are used to obtain the dominant orientation in the circular region, and then
feature vectors are measured relative to the dominant orientation. In this way the
resulting vectors are invariant to image rotation. Afterwards the method considers a
square region around each key-point, which is aligned along the dominant orientation
and divided into 4 × 4 sub-regions. Haar wavelet responses are computed for each
sub-region, and the sum of the wavelet responses in horizontal and vertical directions
for each subregion are used as feature values. The SURF feature vector of a key-point
is obtained by concatenating the feature vectors from all sixteen sub-regions around
it, resulting in a vector of 64 elements. Three sets of local features are obtained, one
for each enhanced image. Three separate nearest neighbor ratio classifiers are trained
on these sets of features. After computing nearest neighbor, its validity is assessed
by computing the ratio of the distance from the closest neighbor to the distance of
the second closest neighbor. Matching score between two ear images depends on the
number of matched feature points. These matching scores are normalized using min-
max normalization technique and are then fused using weighted sum rule. Tests are
performed using IITK database and UND database (Collection E). The composition
used here of IIT Kanpur database includes two data sets. Data Set 1 contains 801 side
face images of 190 subjects. For each subject, 2 to 10 images are acquired. Data Set
2 includes 801 side face images from 89 individuals. For each of them, 9 images are
captured containing three rotations (person looking straight, looking approximately
20◦ down, and looking approximately 20◦ up), and three scales (camera at a distance
of approximately 1 m and digital zoom of the camera at 1.7x, 2.6x and 3.3x) for
each rotation. For a detailed discussion of results, the reader can refer to the original
work. However, it is worth noticing that results on UND-E database, which presents
pose variations, are worst that those on IITK, which mainly presents rotation and
illumination variations.

6.3.3 Algorithms Based on Multiscale/Multiview

Robustness to illumination and pose can be significantly improved by adopting


techniques which address the problem by separately considering regions which are
usually affected at a different extent by illumination and pose variations. Multi-
scale/multiview approaches can be considered among these. In particular, multi-view
imaging aims at acquiring richer geometric and structural features for the ear.
In the already mentioned work in [63], multi-resolution is applied in an early stage
to select the best set of landmarks for each image. As a matter of fact, ASM can suffer
from changes in illumination and local minima in optimization, and the result depends
on initialization. The authors generate a pyramid of images with different resolutions
and obtain the profile statistics for each landmark and each pyramidal level.
The technique proposed in [93] focuses on the textural features of ear image. It
stems from the Local Binary Patterns (LBP) approach, but exploits a multi-resolution
variation allowing to capture both smaller and larger structural information. Circular
LBP is more flexible ans uses a circular region with discretionary radius and number
6 Unconstrained Ear Processing: What is Possible and What Must Be Done 169

of considered neighbors. In practice this is equivalent to acquire different resolu-


tion LBP operators. This solution can be affected by aliasing effects in the obtained
representation of the image, when large radii are exploited. The authors propose to
solve the problem by wavelet low-pass filters. Haar wavelet decomposition is first
performed, in order to obtain different filtered versions of the original image, in
a number depending on the chosen scale factor. Besides this, they adopt Uniform
(Circular) LBP to reduce the size of the derived histogram. A local binary pattern
is uniform if it contains at most two bitwise transitions from 0 to 1 or vice-versa.
In practice, uniform binary patterns occur more commonly in texture images than
others. In the computation of the LBP labels, a separate label is assigned to each
uniform pattern, while all the non-uniform patterns share a same label. For example,
when using a 8-neighborhood, there are a total of 256 patterns: since 58 of them
are uniform, we can reduce to 59 different labels. A further contribution to pose and
illumination invariance is given by applying a block division method joined with
multi-resolution. First the original image is divided into several overlapping blocks
of arbitrary size. Then the method calculates the ULBP distributions for each of
the blocks and chains the histograms into a single vector representing the image.
Multi-resolution is achieved by applying ULBP with two different radii and number
of neighbors, and combining the obtained histograms. In summary, ULBPs are com-
bined simultaneously with block-based and multi-resolution methods to describe
together the texture features of the filtered ear images, which are obtained by Haar
wavelets. Matching is performed by using Canberra distance measure [89]. Experi-
ments are performed on a subset of images from USTB III, namely on right rotations
of 0◦ , 5◦ , 20◦ , 35◦ and 45◦ . Results demonstrate that the combination of the proposed
techniques allows to pass, at a rotation of 35◦ , from 8.86 % recognition rate of PCA
to 62.66 %.
A similar but much simpler approach is presented in [95]. In this work, LBP
is applied to images at different scales, which are obtained by wavelet transform.
The eigenvectors of the obtained LBP images undergo Linear Discriminant Analysis
(LDA) algorithm to extract feature. In test phase, Euclidean distance is used. Tests
show a 100 % recognition rate with a feature space dimension of 60, but are performed
on USTB II which is not the most challenging one.
Different views of the ear contain shape features of different orientations. In
[126] a subspace of feature space termed ear manifold is exploited. Ears can be
projected onto feature space, and if this is done for multiple views, it is possible
to obtain a smooth trajectory. Ear pose manifold is deemed to play an important
role in recognition. The authors first construct a multi-view dataset of manually
segmented images, where for each out of 60 subjects there are eight different views
(−60◦ , −50◦ , −40◦ , −20◦ , 0◦ , +20◦ , +40◦ , +60◦ ). However, the capturing device
is placed in a dark room and is illuminated with fixed lighting, and moreover, as it can
be observed in Fig. 6.12, some frame seems to be further used to make segmentation
easier. Therefore, it is arguable if results fully hold in real-world settings.
Null space kernel discriminate analysis (NKDA) [60] is implemented on a subset
of training images to form ear multi-view discriminative feature space. Training
samples are projected onto the feature space to generate points at different positions.
170 S. Barra et al.

Fig. 6.12 Multiple view of a ear which are projected onto the ear manifold

Then B-Spline is used to interpolate views for all projected points, and to acquire the
pose curve representing the ear pose manifold used for recognition. It is assumed that
ears of different subjects give raise to different pose B-Spline curves, which in turn
represent different pose manifolds. During test phase, a test sample is projected onto
the discriminative feature space, and the distance to all existing points of B-Spline
pose manifolds is calculated. The subject corresponding to the closest pose manifold
from the test point is recognized as its identity. During the experiments, orientations
(−60◦ , −50◦ , −20◦ , +20◦ , +60◦ ) are used for training, while orientations ( −40◦ ,
0◦ , +20◦ ) are taken for test. The best achieved performance is 97.7 % rank-one
recognition rate.
A different multi-view approach is adopted in [61]. The work investigates multi-
pose ear recognition using Partial Least Square Discrimination (PLSD). Partial Least
Square Regression (PLSR) is a statistical regression technique that allows to relate
several dependent variables (called responses) to a large number of independent
variables (called predictors). PLSR models the relationship between the dependent
and the independent variables by creating latent vectors, that account for as much of
the covariance between the dependent and independent variables as possible. In this
way PLS can be used as a dimensionality reduction technique. Classification prob-
lems under linear representation framework, use the partial least square regression
model (6.5):
Y = XB + E (6.5)

where Y are the responses, X are the predictors, B is the regression coefficient matrix
and E stands for the residual matrix. In this case the matrix X of independent variables
is treated as the samples training matrix and the matrix Y of dependent variables
encodes the class membership of the independent variables. Since this is a multi-view
approach, many training samples belong to the same subject, therefore Y includes
vectors of all “1” on the diagonal, and of “0” elsewhere. In this way, the matrix B of
regression coefficients becomes the weighting matrix for class membership, which
can be used to combine the columns of X to represent the different class membership
in Y. Matrix B can be computed by the classical nonlinear iterative partial least
squares (NIPALS) algorithm, and after this the regression equation can be explained
as an approximation equation from the training data. In PLSD approach, PLS is first
used for data dimension reduction and components extraction, then classification is
performed through methods such as logistic discrimination, SVM, ANN, etc. The
6 Unconstrained Ear Processing: What is Possible and What Must Be Done 171

aim of the cited work is to study the actual classification performance of partial least
square representation, after the extraction of partial least square latent components.
When a new test sample is submitted for classification, the system computes its
regression response estimation for all classes, and then the residual between the
estimation and the expected class membership response “1”. The test sample is
assigned to the class producing the smallest residual.
It is interesting to consider the results in [4] about the experimental assessment
of ear symmetry, and therefore related to the multi-view approach. It is well known
that the human body, and in particular the face, are not perfectly symmetrical. Asym-
metry can reach different levels, and also be exaggerated for artistic purposes, like
the famous artist Antonio De Curtis (Totò). Along the same direction, a further
result from research is that human irises from the same individual are completely
different. Therefore it is straightforward to wonder if this holds for ears too. The
results in the mentioned study aim at a twofold effect: to understand the possibility
of matching the left and right ears of an individual, and to assess the feasibility of
reconstructing portions of the ear that may be occluded in a surveillance video. Both
symmetry operators and Iannarelli’s measurements are exploited. All experiments
use the WVU Ear Database. The WVU ear database consists of 460 video sequences
of 402 different subjects, and has multisequences for 54 subjects. Ear images are
extracted from video sequences. The starting frame of each video captures the left
profile (0◦ ) of the subject, and takes about 2 min to end at the right profile (180◦ ). The
database includes 55 subjects with eyeglasses, 42 subjects with earrings, 38 subjects
with partially occluded ears, and 2 subjects with fully occluded ears. It is not listed
in Sect. 6.1.4 since it is not still available. As for the first kind of experiments, the
authors concatenate the left and right ears and then apply the symmetry transform
suggested by Reisfeld et al. in [80]. Such transform assigns a symmetry magnitude
to every pixel in the image to construct a symmetry map. In [4] symmetry maps are
built for concatenated left and right ears, and also perfect symmetry maps, by con-
catenating each right ear and its mirror image. Both visual comparison and distance
measurement of the two maps for the same subject show that they are very similar,
and this suggests symmetry for most individuals. On the other hand, experiments
with Iannarelli’s measures highlight that asymmetries are mostly located in specific
areas, in particular they are related to the width of the helix rim of the concha. The
authors also present several experiments to assess the ear symmetry from a biomet-
ric recognition system perspective. A small preliminary experiment with a similar
setting had also been presented in the earlier work in [109]. It exploits images from
119 subjects under two poses. The right ear of the subject is used as the gallery and
the left ear is used as the probe. The results highlight that for most people the left
and right ears are quite symmetric, but for a number of subjects the left and right ears
have different shapes. In the experiments in [4] the authors compare the performance
obtained by matching right ear images in the gallery with right ear images used as
probes, and the performance obtained by matching left ear images in the gallery with
reflected right ear images. From the results of these two experiments, they estimate
that the effect of ear asymmetry affects the recognition rate by a magnitude between
5.94 and 7.84 % in verification mode. Experiments in identification mode show that
172 S. Barra et al.

Fig. 6.13 Examples of aligned color and range images of the same subject

the asymmetry of the two ears does affect the identification performance, where the
rank-1 accuracy drops by an absolute value of ~35 %.

6.4 Recognition of 3D Models

We already mentioned in Sect. 6.1.3 the seminal works by Chen and Bhanu [16, 29],
and by Yan and Bowyer [108, 111] on 3D ear recognition. Figure 6.13 shows two
examples of aligned color and range images used for feature extraction and fusion.
We will describe here some different techniques. However, before going further,
it is interesting to mention the results reported in [98] and regarding a series of
tests aiming at a comparison of the recognition performance of 3D face, 3D ear,
and 3D finger surface. Identification experiments exploit the Iterative Closest Point
algorithm on a multi-modal biometric dataset of multiple range images collected
from 85 individuals. The images are part of the multi-modal biometric database of the
University of Notre Dame. As one may expect, 3D face achieves the best performance
with a rank-1 recognition rate of approximately 93 % followed by both 3D ear and
6 Unconstrained Ear Processing: What is Possible and What Must Be Done 173

3D finger with 80 %. However, looking at the CMC curve for ear one can notice that
it stays constantly lower than the CMC curve for finger: at rank 20 finger reaches
about 97 %, while ear hardly exceeds 90 %. On the other hand, finger recognition is
supported in these tests by more information. Finger experiments employ score-level
fusion of the three separate matching scores obtained by comparing probe images of
the index, ring, and middle fingers to corresponding gallery images.
An important issue to consider is that real-life biometric applications, especially
online ones, require algorithms that are both robust, like 3D-based ones, but also effi-
cient, in order to scale with the increasing size of involved databases. The biometric
keys extracted from 3D ear models must be easily comparable, and this essentially
requires that possibly computationally expensive steps should be only performed
during preprocessing and not during normal operation.

6.4.1 Methods Based on 3D Acquisition

The recognition scheme presented in [62] aims at satisfying fast and low-cost prac-
tical applications. It exploits a fast slice curve matching for identification based on
3D ear point clouds. Acquisition is based on sequential laser strips, as described in
Sect. 6.2.1. Unfortunately, the acquisition is performed by putting the ear into an oval
hole, so that any segmentation problem is avoided. For this reason, it is impossible
to exploit the approach as is within an under-controlled setting. Isolated points are
pruned, based on cloud density and on point-to-point distances. After further noise
removal and Gaussian smoothing, PCA is applied to the point cloud to extract the
principal directions of the three largest spreads of distribution of these points. The
three principal directions are matched to a set of fixed specific directions, so obtain-
ing a pose normalization. A plane is used to cut the 3D ear models. Such plane is
perpendicular to the principal axis of ear point clouds, and cutting will provide a
slice of the contour of the ear. Thanks to pose normalization, it is easy to determine
the principal axis in a consistent way. A series of slices at different locations will
reflect different characteristic parts of the 3D ear shape. If two 3D models are similar,
then the slice curves at the same locations will be similar too. After some experi-
mentation, the chosen number of slices is 17. The curvature information extracted
from each slice is stored in a feature vector together with an index indicating the
slice position within the 3D model. A similarity measure is computed by searching
the longest common sequence between two slice curves with similar indexes. On a
home-collected dataset of 50 subjects, with 4 samples each, the proposed slice curve
matching approach can reach 94.5 % rank-one recognition rate.
In order to achieve fast 3D matching during biometric recognition, the authors of
[73] propose to exploit an annotated ear model (AEM) that is representative of human
ears. This purely geometrical model is used to register each ear in a dataset and to
acquire its shape through a fitting process. The 3D information extracted from the
fitted model is stored as metadata. This representation is compact and directly com-
parable, thus making the method robust and efficient. The AEM needs to be created
174 S. Barra et al.

only once and is based on statistical data. Moreover, the used polygonal 3D mesh
does not represent the whole ear, but only the inner region around concha, since this
is the one that is most often free from hair. During enrollment, raw captured data are
preprocessed and segmented, and then registered to the AEM. This expensive as well
as critical (for future recognition) step is performed only once, when a new template
is added to the database, and not for all matching operations. A better accuracy is
achieved by applying two different registration algorithms in sequence: the Iterative
Closest Point (ICP) first and a fine tuning one afterwards. The annotated model is fit-
ted to the data, and then a biometric key is extracted using geometry information and
stored in the database as metadata. During authentication metadata retrieved from
the database are directly compared using a light distance metric. However, it is to
say that such point is not completely clear, because in real applications one needs to
process probes in order to obtain matchable features, and this step seems excluded by
the way the approach is presented. Test for both time and accuracy performance are
conducted on Notre Dame database (UND) augmented with a further home-collected
dataset. The reported processing time is about 30 s for each enrollment operation,
but falls to 1 ms per comparison during authentication, which was one of the goals
for the approach. For the UND database, the rank-one rate is 93.9 %.

6.4.2 Video-Based Methods

The first approaches to 3D ear recognition all use 3D range data. The work in [22] is
an attempt to use video to derive 3D structure. In real-world applications, video is of
course much more feasible than range data due to the lack of special equipment, and
can be exploited also in normal video-surveillance settings. For this reason we will
devote some more space to describe this approach. The authors present two different
systems for 3D ear recognition from video. The first one is based on structure from
motion (SFM), and relies on four steps: preprocessing, three-dimensional recon-
struction, post-processing, and recognition. The second system is based on shape
from shading (SFS), and relies on three steps only: preprocessing, three-dimensional
reconstruction, and recognition (post-processing is absent). The preprocessing step
is performed in the same way in both systems. Its implementation starts from the con-
sideration that video frames are generally of lower quality than still images, since they
are affected by motion blur, compression and interlacing. Therefore image enhance-
ments is of even greater importance. In particular, artifacts must be eliminated or
sufficiently smoothed to avoid tracking false features, which would lead to incor-
rect 3D reconstruction. In order to smooth the image artifacts, this step exploits the
force field transformation method presented in [44]. The principal curvatures of the
image intensities are used to detect both ridges and ravines located along the ear.
Among the possible features, the location of the helix is particularly useful, since
it is generally the outermost detected ridge, and is an important reference element
to create a boundary for the valid ear region. After this preprocessing, the two tech-
niques diverge. Structure from motion (SFM) differs from stereopsis because the
6 Unconstrained Ear Processing: What is Possible and What Must Be Done 175

latter attempts to reconstruct a 3D model from two slightly translated images of an


object captured simultaneously (same time, different space). SFM does the same by
using a sequence of video frames of the same object (different time, same space).
On the other hand, both techniques try to locate corresponding features in the two
or more available images, to be able to determine their position in space. Kanade-
Luca-Tomasi feature tracker is used to track features across the sequence of frames.
Distance between a feature in the initial frame and the corresponding candidate in
a following one considers both linear translations and affine transformations, and is
obtained using the root mean square (RMS) residue; when the distance raises above
a given threshold the feature is abandoned. Post-processing relies on outlier filter-
ing, to eliminate artifacts (3D points whose nearest neighbor is too far in space).
Matching is divided into two steps: correspondence matching and depth matching.
Correspondence matching is in turn divided into two procedures: computing a sim-
ilarity between the shapes and then geometrically transforming the shapes using
2D-ICP in order to maximize the similarity. If a sufficient number of corresponding
matches are found between two shapes, depth matching is invoked. It compares the
similarities in terms of depth of the retained features from both point sets. It is worth
noticing that the authors find that correspondence matching is very useful to filter
out insufficiently matching database models in the (x, y) domain before performing
depth matching, since they observe that the locations of features are quite unique to
each individual. For this reason, a significant number of database models would not
reach the second step. The approach using Shape From Shading (SFS) might be less
interesting for fully automatic systems, since it requires to perform the 3D recon-
struction process on a single image, manually chosen among those with an optimal
view of the entire ear region. As a matter of fact, SFS is defined as a process which
recovers the 3D (visible) surface of a 3D object from the shading in its 2D inten-
sity image, by using information about the reflectance and illumination properties of
the scene. However, an automatic though computationally demanding procedure to
select the most stable 3D model is proposed later in [23], which implies to reconstruct
and compare 3D models from a number of subsequent frames. Returning to [22],
in the second system, the recognition phase first involves finding a closest match
between a database and test model via 3D-ICP and then computing the similarity
measure of the two models using an RMS difference. Experiments exploit the above
mentioned collection of video-clips by West Virginia University (WVU). With SFM,
performance by using from 10 to 50 frames for reconstruction, results in an optimal
recognition rate of about 95 % using 16 frames. As a possible explanation, using less
frames may cause missing information, while using more may cause more noise.
These experiments use the same video clip, dividing the set of frames between train-
ing and testing. Unfortunately, when incorporating a testing set from a completely
different set of videos, the obtained recognition performance is inadequate. Although
two different video clips contain the same subject, the detected features in each of
the videos have very poor correlation to one another. The conclusion is that the SFM
approach, using a sparse point map, provides more accurate relative depth informa-
tion about the feature points than the SFS, however is quite limited since heavily
relies on correlation amongst the feature points. On the other hand, SFM approach,
176 S. Barra et al.

which exploits a dense set of points and therefore provides a more accurate shape, is
able to successfully discriminate between individuals using different video clips for
training and testing (84 % recognition rate). The authors finally suggest fusing the
two approaches to obtain the best from the two.

6.5 Multimodal Systems

Since this chapter is focused on the strategies to perform biometric ear recognition
under uncontrolled data acquisition conditions (ideally fully covert ones), we will
leave out all those multimodal proposals which would imply an aware user partici-
pation, e.g. recognition involving some combination of ear and palmprint ([37] or of
ear and fingerprint [2, 39]) or of ear and signature [67]. While these systems improve
performance of identification for aware and somehow collaborative users, they are
of course not feasible for the kind of settings that we are addressing here. A more
complete review of multimodal systems involving ear recognition can be found in
[5, 74]. The systems that we will shortly describe rather join ear recognition with
gait or face, especially side face, which can be acquired even at a reasonable distance
without user participation.
On the other hand, it is also to mention the possibility, as for example with face,
to use ear in a wider interpretation of multimodality, i.e. to combine ear+ear in some
way. We will first mention these approaches.
A final aspect to anticipate is that, in many cases, the true novelty of presented
methods is in the kind of fusion they adopt, rather than in the techniques exploited
for the single modalities. Fusion in a multibiometric system may be performed at
feature-level, which implies to combine the different templates which must have a
compatible structure, and to train a specific system on the obtained combination. This
approach is the most expensive one, but fully preserves the original information. A
second choice is to perform fusion at matching score or ranking level. In general,
this is the preferred solution, because already existing unimodal systems can be
exploited and enough information is still present in the result (e.g., in identification
this is usually a list of scores or ranks). Finally, fusion can be performed at decision
level, which is the cheapest one; however, any useful information to possibly identify
problems from one system has been lost.

6.5.1 Matching and Fusion of Different Ear Aspects

Yan and Bowyer analyze in [109] the performance of 2D and 3D ear recognition under
different conditions. The images belong to the initial core acquired at the University
of Notre Dame. The approaches considered include PCA (widely known as “eigen-
ear”) with 2D intensity images, achieving 63.8 % rank-one recognition; PCA with
range images (often referred as 2.5D images), achieving 55.3 %; Hausdorff matching
6 Unconstrained Ear Processing: What is Possible and What Must Be Done 177

of edge images extracted from range images, achieving 67.5 %; and ICP matching
of the 3D data, that with some improvements passes from 84.1 to 98.7 %. ICP-based
matching achieves the best performance, and also shows good scalability with size
of datasets.
After this preliminary study, the same authors investigate the effect of fusing the
results from the above techniques [110]. They test multi-modal (2D+3D), multi-
algorithm (PCA+ICP) and multi-instance (two images for both enrollment and test-
ing) modalities. All provide an improvement over a single biometrics. Multi-modal
combinations include 2D PCA with 3D ICP, 2D PCA with 3D PCA, and 2D PCA
with 3D edge-based approach. The best results are achieved by multi-modal 2D PCA
fused with 3D ICP. To combine 2D PCA-based and 3D ICP-based ear recognition,
the authors propose a fusion rule at matching score level based on the interval dis-
tribution between rank-1 and rank-2 recognition rates. The rank-1 recognition rate
achieves 91.7 % with 302 subjects in the gallery. For the multi-algorithm tests, three
different algorithms are used to process 3D data: ICP-based algorithm, PCA-based
algorithm and edge-based algorithm. After score normalization, the weighted sum
rule is used for combinations. The best performance is achieved when combining
ICP and edge-based algorithm on the 3D data (90.2 %). In general, all the approaches
perform much better in multi-instance experiments with multiple images used to rep-
resent one subject. ICP approach used to match a two-image-per-person probe against
a two-image-per-person gallery reaches the highest rank-1 recognition rate of 97 %
with two images in the gallery and two used as probe.
Along a similar ear-ear combination line, in [13] a kind of feature-level fusion is
proposed for verification. SIFT is used to extract relevant features from ear images
at different poses. Features are then merged according to an appropriate fusion rule
to produce a single feature template called the Fused Template. The similarity of
SIFT features extracted from the live image and Fused Template of the enrolled
user is measured by their Euclidean distance. IITK database in a version with 1060
images is used to validate the performance of the method. The images are collected
from a distance of 2.5 m. The camera is moved circularly and the subject is at the
center. Orientation of the camera facing the ear in frontal pose is considered as
0◦ . The ear images are captured at −40◦ , −20◦ , +0◦ , +20◦ and +40◦ placing the
camera tripod at fixed landmark positions. Only images of right ear are collected.
Two images per pose (angle) are obtained in a short interval. There are 10 images per
subject. The database contains 1060 images from 106 subjects. The images obtained
are normalized to 648 × 486. Features are extracted from the ear images at −40◦ ,
+0◦ and +40◦ . The pose variation is chosen so to reduce the correlation and get a
more complete representation of the subject. Illumination is normalized by histogram
fitting, taking the image at 0◦ as reference. The dimension of the final feature vector is
reduced by eliminating redundant points between the fusing images, i.e. those found
in overlapping regions of the different images. Surviving keypoints are concatenated
to form a composite feature template. During matching, the SIFT features of each
keypoint pi of the probe image are matched to those of every keypoint tj of the
template in the database. The pair of matching keypoints (pk , tl ) with minimum
distance is removed. The matching process is continued for the remaining points
178 S. Barra et al.

until no pair is found. The decision of whether the probe template is matched or not
depends on the number of matching points. Comparing performance with non-fused
and fused templates, FAR decreases from 11.75 % down to 6.51 %, FRR from 11.59
to 2.86 %, and accuracy increases from 88.32 to 95.32 %.

6.5.2 Matching and Fusion of Ear and Face

Victor, Bowyer and Sarkar [90] were perhaps the first to compare the performance
of ear and face biometrics. They used PCA for both traits, and the final result of their
experiments was that face is better for recognition, despite possible expression and
other kinds of variations like make up and beard which are not found in ear. However,
they did not study the effect of combining the two biometrics.
It is worth noticing that, in principle, approaches fusing ear and side face are
realistically more suitable for uncontrolled recognition, since the two traits can be
acquired in one shot (actually, there is only one image including the two traits) and
any kind of fusion is often avoided. Fusing frontal face and ear requires two shots
instead, and either a controlled setting or a capture setting assuring images from
two points of view (frontal and side) on the subject to be recognized (with only one
device we should be sure that sooner or later the subject will turn in different poses).
Moreover, fusion is required at some level.

6.5.2.1 Ear and Frontal Face

Among the first results of combining face and ear biometrics we find those in the
already mentioned work in [26]. Combination experiments exploit a very simple
feature-level combination technique. The normalized and masked images of the ear
and face are concatenated to form a combined face-plus-ear image. These new images
undergo PCA processing. CMC curves for the experiments suggest that the multi-
modal biometrics offers a significant improvement of overall performance.
A similar approach is used in [118], with the difference that users are chimeric
ones, i.e., ears are from USTB II database, while faces with different poses are from
ORL database. Feature vectors are extracted from chained images using Full-Space
Linear Discriminant Analysis (FSLDA). Rank-one recognition rate of multimodal
biometrics is 98.7 % (apparently notwithstanding the number of dimensions) and is
higher than performance of single traits. It is worth noticing that ORL faces (visually)
present more significant variations than USTB II ears, and this seems to be the reason
for having ear performance better than face.
Face and ear are matched separately in [54], and face is captured in frontal pose.
For localization of ear region, Triangular Fossa and Antitragus are detected manu-
ally on ear image, and then used to apply a complete ear localization technique. Both
face and ear are cropped from the respective images. After geometric normaliza-
tion, histogram equalization is performed. Both face and ear images are convolved
6 Unconstrained Ear Processing: What is Possible and What Must Be Done 179

with Gabor wavelet filters to extract spatially enhanced Gabor features for both face
and ear. The feature vectors extracted from Gabor face and ear responses can be
further characterized and described by normal (Gaussian) distributions. Gaussian
Mixture Model (GMM) is applied to the high-dimensional Gabor face and Gabor
ear responses separately for quantitive measurements, and Expectation Maximiza-
tion (EM) algorithm is used to estimate density parameters in GMM. This produces
two sets of feature vectors which are then fused using Dempster-Shafer theory. The
approach is tested on the IIT Kanpur multimodal database. Database of face and ear
consists of 2 face and 2 ear images per person for 400 individuals. For the present
evaluation, only frontal view faces are used with uniform lighting, and minor changes
in facial expression. Face recognition achieves 91.96 % and ear recognition achieves
93.35 %, while fusion reaches 95.53 %. Despite the results, this method cannot be
deemed feasible for uncontrolled settings due to the quality required for images.
Moreover, the manual intervention by a human operator hinders its use for massive
applications.
3D ear and face are fused in [24]. As for 3D ear, the approach presented in [23] is
used. For the 2D face recognition, Active Shape Model technique is used to extract
a set of facial landmarks. A series of Gabor filters are applied at the locations of
facial landmarks, and the responses are calculated. The Gabor features are stored in
the database as the face model, and are compared with those extracted from a probe
image during testing. The match scores of the ear recognition and face recognition
are fused at score level using a weighted sum to fuse the results after normalizing
them. Weights are experimentally determined. Experiments are conducted using a
gallery set of 402 video clips and a probe of 60 video clips (images). The result
achieved for rank-1 identification for the 2D face recognition is 81.67 %, for 3D ear
recognition 95 %, and for fusion 100 %. This methodology presents the limitations
discussed in Sect. 6.4.2 regarding ear processing.
A multibiometric extension of the approach in [113] described in Sect. 6.3.2.2
is presented in [107]. The same HMAX approach is adopted for both face and ear,
using different filters: face images are processed with Gabor filter and ear images
with Gaussian filter. Resulting features are classified by K-NN and SVM classifiers,
in order to match results. Fusion occurs on the match scores obtained from the last
stages. The system is tested on a dataset which has 10 different images of each of 40
distinct chimeric users. The authors randomly pair faces from ORL and ears from
USTB to obtain a multimodal dataset for each of these 40 persons. The best results
with 40 classes are always obtained with SVM which achieves 88.3 % recognition
rate on faces, 81.2 % on ear, and finally 89.5 % on multimodal fusion.
The work presented in [51] especially addresses identification during video con-
ferences using face and ear. Face features include color features, computed in Hue,
Saturation and Value (HSV) color space, and 2D wavelet based features, approx-
imated by Generalized Gaussian Density (GGD). For ear, only GGD is used. The
Kullback-Leibler Distance (KLD) measure is used to match GGD features from
both traits on probe and gallery templates. Fusion exploits min-max normalization
and sum rule. The system is tested on a dataset of 45 subsets built for this aim. As
expected, performance is improved by fusion.
180 S. Barra et al.

The work presented in [43] starts from the consideration that multimodal sys-
tems may perform even worse than unimodal ones if one or more modalities fall
on degenerated data, and the adopted fusion rules do not avoid derived problems
to propagate to the multimodal result. The proposed solution is an adaptive feature
weighting scheme which is based on the Sparse Coding Error Ratio (SCER) index.
Such index is able to measure the different reliability between face and ear images due
to illumination, pose, expression (for face), occlusion, and corruption. The authors
present two multimodal methods based on Space Representation based Classifica-
tion (SRC), which integrate the new SCER index: Multimodal SRC (MSRC)and
Multimodal RSC (MRSC), where RSC is Robust Sparse Coding. Appearance-based
features of face and ear are extracted by separately applying PCA, and are then
directly chained. The approach is tested against pixel corruptions, as well as face
images with sunglasses or scarf, but it is not clear which are the results achieved with
real pose and illumination variations.

6.5.2.2 Ear and Profile Face

Of course, profile face image may contain less discriminant information than frontal
view. Nevertheless, it contains the ear, which can be used as well if the system includes
this possibility, and also some complementary features with respect to frontal ones
(e.g., the profile shape of the nose). Among the works addressing ear and profile
face fusion, one of the earliest is presented in [115]. There is no need for a fusion
step, like in [26], since the face profile silhouette is already combined with the ear
in a single image. Actually, considering also the ear region is a way to overcome
the limitation of many recognition techniques based on face profile. Such techniques
usually rely on a number of fiducial points and on their relations. However, their
reliability decreases with the peculiar appearance of some anatomical elements a
concave nose, protruding lips, or a flat chin. Furthermore, the number and position
of fiducial points vary with expression changes even for the same person. Full-Space
Linear Discriminant Analysis (FSLDA) is therefore applied here to the profile image
containing the ear region too. Experiments are performed on USTB III database.
Using only 10 features, the recognition rate reaches 86.10 %. As the feature number
increases to 100, the recognition rate reaches 96.2 %.
A group of similar approaches are presented in [103–105]. They all use USTB
III or subsets, so that ear is contained in face profile images. However, in a series of
common initial steps, ear region alone is cropped from the original image in a new
one, and both ear and profile face images undergo Wiener filtering to emphasize the
features. Then the ear images are resized in a proportional way. Finally, histogram
equalization is used to produce an image with equally distributed intensity values.
The differences among the different proposals are related to the nonlinear associated
feature extraction algorithm, where the kernel matrices are computed separately from
the corresponding image data. Further differences are the fusion and classification
strategies adopted. In [103] polynomial kernel is used for feature extraction from
ear data, and sigmoid kernel for profile face data. Then kernel canonical correlation
6 Unconstrained Ear Processing: What is Possible and What Must Be Done 181

analysis (KCCA) is used for feature fusion. The minimum-distance classifier is used
for classification, and the system achieves a recognition rate of 98.68 %. Gaussian
kernel is used in [104] to compute kernel matrices for both ear and profile face fea-
ture extraction. Afterwards, the authors fuse the two obtained matrices using prod-
uct, average and weighted sum, and then apply kernel Fisher discriminant analysis
(KFDA) to the fused matrices. In experiments KFDA is applied to single biometrics
to compare their behavior with the one achieved by fusion. Ear and profile face alone
achieve respectively 91.77 and 93.46 % recognition rate, while fusion results are
96.41 % for Average rule, 96.20 % for Product rule and 96.84 % for Weighted-sum
rule. The further different system in [105] uses FSLDA to extract features for both
traits. Decision level fusion of ear and profile face is carried out using the combination
methods of Product, Sum and Median rules according to the Bayesian theory, and to
a modified Vote rule for two classifiers. The latter is adopted to suitably support the
setting with two classifiers: if each classifier assigns the given sample to a different
class, the votes would appear equal so that the system would not be able to make a
decision. With no combination, the ear achieves 94.05 % recognition rate compared
with 88.10 % of profile face. As for the combination, recognition rate is 96.43 % for
Product rule, 97.62 % for Sum rule, 97.62 % for Median rule, and 96.43 % for the
Modified Vote rule.
Differently from the above approaches, the work in [92] adopts a kind of pose nor-
malization strategy before further processing. PCA and KPCA are used to represent
subspaces and to extract features of ear and face images. The basic idea underlying
pose transformation is that the feature spaces corresponding to ear and face images
in some pose are transformed respectively into those of a perfectly frontal ear and a
perfectly profile face using pose transformation matrices. Then the generated feature
sets of ear and face are fused. The paper adopts three fusion modes, namely serial
strategy, parallel strategy (the two feature vectors to fuse play the role of the real
and complex part of a vector in the real space), and kernel canonical correlation
analysis (KCCA). The nearest neighbor classifier is used, and tests are performed
on USTB III database. Experimental results for ear and face unimodal traits high-
light better performance with pose transformation and KPCA for both traits, with a
minimum achieved at 45◦ rotation of respectively 60 and 30 % recognition rate. The
advantages of pose transformation are more evident with greater rotations. Results
on fusion show that serial strategy is the best. It is to say that a clear improvement
after fusion is not always evident.

6.5.3 Ear and More

A mixed solution between fully controlled settings and the relaxation of some related
constraint is the availability of an equipped “biometric tunnel”. An example is the
University of Southampton Multi-Biometric Tunnel presented in [85]. It is a con-
strained environment designed to address the requirements of airports and other
high throughput settings. It can acquire a number of contactless biometrics in a
182 S. Barra et al.

non-intrusive manner and with not much participation by the user, who is aware of
the acquisition process anyway. The system uses synchronized cameras to capture
gait and additional cameras to capture images from the face and one ear, as an indi-
vidual walks through the tunnel. The path allowed to the subject is quite narrow and
inside a volume where illumination and other conditions are controlled. As for the
ear, a digital photograph is taken when a subject passes through a light beam at the
end of the tunnel. The camera uses a wide field of view to accomodate a large range of
subject heights and walking speeds. A high shutter speed minimizes motion blur. In
addition, two flash cameras provide sufficient light for the shutter speed and reduce
the presence of possible shadows, since the flash guns are positioned to point from
above and below the ear. The system has been tested with the fusion of gait and ear
in [83]. Ear recognition follows the same approach of the already cited work in [21].
The rank-1 recognition performance for visible ears is 77 %. This is lower than the
performance in previous publications due to the less constrained ear images, which
include wider occlusion. Score fusion is used to combine gait and ear recognition
results. The distance measures returned by each algorithm are normalised using an
estimate of the offset and scale of the measures between different subjects. Rank-1
recognition increases to 86 % fusing ear and gait with sum rule.
Ear, face and gait are exploited in [114]. For each modality, feature extraction step
creates a Gabor wavelet representation, which then undergoes principal component
analysis (PCA). During matching step, after normalization, fusion is performed at
score level, testing both weighted sum and weighted product. Chimeric users are
created from three different databases for the three modalities. For face images,
ORL face database includes 120 images corresponding to 40 subjects with three
images per person, taken at different times, under different illumination conditions
and different facial expression. Ear images come from USTB ear database, from
where 120 images corresponding to 40 subjects with three images per person are
extracted. Gait silhouette images come from CASIA gait database, from which 120
average silhouette images over one gait cycle were extracted from video sequences
corresponding to 40 subjects with three average silhouette images per person. The
best unimodal performance is achieved for all three modalities by setting the Gabor
parameters orientation ad scale respectively at μ = 7 and ν = 4. At False Accep-
tance Rate (FAR) set at 0.1 %, face achieves 65 % Genuine Acceptance Rate (GAR),
ear achieves 82.5 %, and gait achieves 72.5 %. With the same Gabor parameters,
multimodal fusion achieves the best performance with z-mean normalization and
weighted product reaching 97.5 % GAR at 0.1 % FAR.
In [65], face, ear and iris scores are fused using a fuzzy approach. In the identi-
fication phase, the input face and ear images are compared with the gallery images
by Fisherfaces and Fisherears, and measuring the Euclidian distance. For iris, they
calculate the Hamming distance between the iris codes. In each of the three cases,
five identifiers are obtained as output that will be ranked according to their distances.
The ranked identities of the first five subjects returned by each of the three lists
are then passed to the fuzzy fusion module along with their normalized scores. A
further contribution comes from the normalized scores from soft biometrics on the
face, namely gender, ethnicity and color of the eyes. During identification, the soft
6 Unconstrained Ear Processing: What is Possible and What Must Be Done 183

biometric information of input identifiers is fed into the fusion module. Datasets used
are CASIA 1.0 for the iris, USTB for the ear and FERET for the face. Fuzzy rules
are used during fusion. For a FAR of 1 %, the GAR achieved by fuzzy fusion reaches
98.13 %, while face, ear and iris respectively achieve GAR of 94, 95.2 and 89 %. Not
only performance of the multimodal system is higher, but the fusion method is better
than other methods tested with the same settings.

6.6 Conclusions

Unconstrained ear processing still poses a number of research problems. Differ-


ently from face, ear cannot undergo expression variations. However, due to its strong
three-dimensional characterization, its correct detection and recognition can be sig-
nificantly affected by illumination, pose and occlusion. Moreover, the similar texture
and color distribution of the region around the ear, as well as the noise introduced by
hair, can further hinder detection. A still open problem is the correct ear detection
in uncontrolled settings. Some works in literature like [57] propose special acquisi-
tion stands with a hole allowing to capture the ear in a very precise way. However,
such approaches seem to rather aim at detaching the recognition problem from the
problem of detection, that unavoidably also affects the following processing steps.
In real world applications, e.g., video surveillance of a controlled environment, this
kind of detection is completely unfeasible.
While illumination seems easier to address, through histogram equalization or
other techniques, it may still modify the 3D appearance of the ear, so affecting both
detection and recognition. In the same way, occlusion can hinder both a correct
localization and a reliable recognition. A possible solution for a correct detection is
for example the use of localized region based active contours model [57] and similar
approaches where techniques usually applied to the whole image are made local to
be more robust to this kind of problems. Even for recognition, the best way to address
occlusions seems to be a localized approach, such as HERO [35], or modular neural
networks [41].
Pose variations are the source of the most difficult detection and recognition
problems. Illumination and occlusion usually affect only limited regions, so that a
localized approach can effectively exploit the remaining ones. Ear is divided into
regions, with or without anatomical valence, and results from the different regions
are fused for better robustness. However, pose variations globally modify the overall
appearance of the ear. 3D techniques seem to be the most effective way to solve these
problems, since they rely on a 3D model expressing the full structure of the ear in
a way independent from rotations along any axis. However, 3D techniques are still
very expensive, and most of all cannot still be exploited without user participation.
A promising alternative for detection is Gabor Jets, which is exploited in [96], while
for recognition a solution seems to be provided by methods exploiting non linear
embeddings, like LLE and Improved LLE [101, 102], and NPE [120], or by multi-
view approaches. In particular, video sequences are exploited in the Structure From
184 S. Barra et al.

Motion technique proposed in [22], which gives good results also on “hard” datasets.
The physical proximity of face and ear, if not the true inclusion in side images, further
suggest the use of the biometric fusion of the two modalities. Images can be fused
at the feature level [103], or at the score level as in most approaches.
A final note regards the acquisition environment of existing datasets. Images as
well as video sequences are (almost) always captured indoor, with a uniform back-
ground, and usually at a reasonable distance. In a similar way, (almost) all systems in
literature have been tested in these “easier” conditions, where environmental noise
is often at a minimum. Ear biometrics applied outdoors and at a distance may pose
further problems, mainly for a correct detection which also affects a correct recog-
nition.

References

1. Abate AF, Nappi M, Riccio D, Ricciardi S (2006) Ear recognition by means of a rotation invari-
ant descriptor. In: Proceedings of the 18th international conference on pattern recognition–
ICPR 2006, vol 4, pp 437–440
2. Abate AF, Nappi M, Riccio D, De Marsico M (2007) Face, ear and fingerprint: design-
ing multibiometric architectures. In: Proceedings of 14th international conference on image
analysis and processing–ICIAP 2007, pp 437–442
3. Abaza A, Hebert C, Harrison MAF (2010) Fast learning ear detection for real-time sur-
veillance. In: Proceedings of the 4th IEEE international conference on biometrics: theory
applications and systems–BTAS 2010, pp 1–6
4. Abaza A, Ross A (2010) Towards understanding the symmetry of human ears: a biometric
perspective. In: Proceedings of the 4th IEEE international conference on biometrics: theory
applications and systems–BTAS 2010, pp 1–7
5. Abaza A, Ross A, Harrison MAF, Nixon MS (2013) A survey on ear biometrics. ACM Comput
Surveys 45(2), Article 22
6. Abdel-Mottaleb M, Zhou J (2006) Human ear recognition from face profile images. In: Zhang
D, Jain AK (eds) Proceedings of the international conference on biometrics–ICB 2006, LNCS
3832, pp 786 – 792
7. Akkermans AHM, Kevenaar TAM, Schobben DWE (2005) Acoustic ear recognition for per-
son identification. In: Proceedings of the AutoID’05, pp 219–223
8. Alvarez L, Gonzalez E, Mazorra L (2005) Fitting ear contour using an ovoid model. In:
Proceedings of the 39th annual international carnahan conference on security technology–
CCST ’05, pp 145–148
9. Ansari S, Gupta P (2007) Localization of ear using outer helix curve of the ear. In: Proceedings
of the international conference on computing: theory and applications–ICCTA 2007, pp 688–
692
10. Arbab-Zavar B, Nixon MS, Hurley DJ (2007) On model-based analysis of ear biometrics. In:
Proceedings of the 1st IEEE international conference on biometrics: theory, applications, and
systems–BTAS 2007, pp 1–5
11. Arbab-Zavar B, Nixon MS (2008) Robust log-gabor filter for ear biometrics. In: Proceedings
of the 19th international conference on pattern recognition–ICPR 2008, pp 1–4
12. Bach FR, Jordan MI (2003) Kernel independent component analysis. J Mach Learn Res 3:1–48
13. Badrinath GS, Gupta P (2009) Feature level fused ear biometric system. In: Proceedings of
the 7th international conference on advances in pattern recognition–ICAPR ’09, pp 197–200
14. Battisti F, Carli M, De Natale FGB, Neri AA (2012) Ear recognition based on edge potential
function. In: Proceedings of the SPIE 8295, image processing: algorithms and systems X; and
parallel processing for imaging applications II, Feb 9, p 829508. doi:10.1117/12.909082
6 Unconstrained Ear Processing: What is Possible and What Must Be Done 185

15. Bay H, Tuytelaars T, Van Gool L (2006) SURF: speeded up robust features. In: Proceedings
of the 9th european conference on computer vision–ECCV 2006, pp 404–417
16. Bhanu B, Chen H (2003) Human ear recognition in 3D. In: Proceedings of multimodal user
authentication workshop (MMUA), Santa Barbara, CA, pp 91–98
17. Brown M, Lowe DG (2002) Invariant features from interest point groups. In: Proceedings of
the 13th British machine vision conference, pp 253–262
18. Burge M, Burger W (1997) Ear biometrics for machine vision. In: Proceedings of 21 workshop
of the Austrian association for pattern recognition
19. Burge M, Burger W (1998) Ear biometrics. In: Jain AK, Bolle R, Pankanti S (eds) Biometrics:
personal identification in networked society. Kluwer Academic Publishers, Boston, pp 273–
286
20. Bustard JD, Nixon MS (2008) Robust 2D ear registration and recognition based on SIFT
point matching. In: 2nd IEEE international conference on biometrics: theory, applications
and systems–BTAS 2008, pp 1–6
21. Bustard JD, Nixon MS (2010) Toward unconstrained ear recognition from two-dimensional
images. IEEE Trans Syst Man Cyber Part A Syst Human 40(3):486–494
22. Cadavid S, Abdel-Mottaleb M (2007) Human identification based on 3D ear models. In:
Proceedings of the 1st IEEE international conference on biometrics: theory, applications, and
systems–BTAS 2007, pp 1–6
23. Cadavid S, Abdel-Mottaleb M (2008) 3-D ear modeling and recognition from video sequences
using shape from shading. IEEE Trans Info Forensics Security 3(4):709–718
24. Cadavid S, Mahoor MH, Abdel-Mottaleb M (2009) Multi-modal biometric modeling and
recognition of the human face and ear. In: Proceedings of the IEEE international workshop
on safety, security and rescue robotics–SSRR 2009, pp 1–6
25. Cai J, Goshtasby A (1999) Detecting human faces in color images. Image Vision Comput
18(1):63–75
26. Chang K, Victor B, Bowyer KW, Sarkar S (2003) Comparison and combination of ear and face
images in appearance-based biometrics. IEEE Trans Pattern Anal Mach Intell 25(8):1160–
1165
27. Chen H, Bhanu B (2004) Human ear detection from side face range images. In: Proceedings
of the international conference on pattern recognition (ICPR 2004), vol 3, pp 574–577
28. Chen H, Bhanu B (2005) Shape model-based 3D ear detection from side face range images.
In: Proceedings of the IEEE computer society conference on computer vision and pattern
recognition–workshops (CVPR 2005), p 122
29. Chen H, Bhanu B (2007) Human ear recognition in 3D. IEEE Trans Pattern Anal Mach Intell
29(4):718–737
30. Chen H, Bhanu B (2009) Efficient recognition of highly similar 3D objects in range images.
IEEE Trans Pattern Anal Mach Intell 31(1):172–179
31. Choraś M, Choraś RS (2006) Geometrical algorithms of ear contour shape representation and
feature extraction. In: Proceedings of the intelligent systems design and application–ISDA
2006, IEEE CS Press, vol II, pp 451–456, Jinan, China
32. Cummings AH, Nixon MS, Carter JN (2010) A novel ray analogy for enrollment of ear
biometrics. In: Proceedings of the 4th IEEE international conference on biometrics: theory
applications and systems–BTAS 2010, pp 1–6
33. Curless B (1999) From range scans to 3D models. ACM SIGGRAPH, Comput Graph 33(4):
38–41
34. Debevec P (1999) Image-based modeling, rendering and lighting. Comput Graph 33(4):46–50
35. De Marsico M, Michele N, Riccio D (2010) HERO: human ear recognition against occlusions.
In: Proceedings of the IEEE computer society conference on computer vision and pattern
recognition workshops–CVPRW 2010, pp 178–183
36. Dong J, Mu Z (2008) Multi-pose ear recognition based on force field transformation.
In: Proceedings of the 2nd international symposium on intelligent information technology
application–IITA 2008, vol 3, pp 771–775
186 S. Barra et al.

37. Faez K, Motamed S, Yaqubi M (2008) Personal verification using ear and palm-print biomet-
rics. In: Proceedings IEEE international conference on systems, man and cybernetics–SMC
2008, pp 3727–3731
38. Fleck M, Forsyth D, Bregler C (1996) Finding naked people. In: Proceedings of the European
conference on computer vision–ECCV 1996, vol 2, pp 592–602
39. Gnanasivam P, Muttan S (2011) Ear and fingerprint biometrics for personal identification. In:
Proceedings of the international conference on signal processing, communication, computing
and networking technologies–ICSCCN 2011, pp 347–352
40. Grabham NJ, Swabey MA, Chambers P, Lutman ME, White NM, Chad JE, Beeby SP (2013)
An evaluation of otoacoustic emissions as a biometric. IEEE Trans Info Forensics Security
8(81):174–183
41. Gutierrez L, Patricia M, Lopez M (2010) Modular neural network integrator for human recog-
nition from ear images. In: Proceedings of the 2010 international joint conference on neural
networks–IJCNN 2010, pp 1–5
42. Huang C, Lu G, Liu Y (2009) Coordinate direction normalization using point cloud projection
density for 3D ear. In: Proceedings of the 4th international conference on computer sciences
and convergence information technology–ICCIT 2009, pp 511–515
43. Huang Z, Liu Y, Li C, Yang M, Chen L (2013) A robust face and ear based multimodal
biometric system using sparse representation. Pattern Recogn 46(8):2156–2168
44. Hurley DJ, Nixon MS, Carter JN (1999) Force field energy functionals for image feature
extraction. In: Proceedings of the British machine vision conference 1999–BMVC99 BMVA,
pp 604–613
45. Hurley DJ, Nixon MS, Carter JN (2000) Automatic ear recognition by force field transfor-
mations. IEE colloquium on visual biometrics (Ref.No. 2000/018), pp 7/1–7/5
46. Hurley DJ, Nixon MS, Carter JN (2002) Force field energy functionals for image feature
extraction. Image Vision Comput 20(5–6):311–317
47. Hurley DJ, Nixon MS, Carter JN (2005) Force field feature extraction for ear biometrics.
Comput Vision Image Underst 98:491–512
48. Iannarelli A (1989) Ear identification. In forensic identification series. Paramont Publishing
Company, Fremont
49. Islam SMS, Bennamoun M, Davies R (2008) Fast and fully automatic ear detection using
cascaded adaBoost. In: Proceedings of the IEEE workshop on applications of computer vision–
WACV 2008, pp 1–6
50. Jain A, Bolle R, Pankanti S (eds) (1998) Biometrics: personal identification in networked
society. Kluwer Academic Publishers, Boston
51. Javadtalab A, Abbadi L, Omidyeganeh M, Shirmohammadi S, Adams CM, El-Saddik A
(2011) Transparent non-intrusive multimodal biometric system for video conference using the
fusion of face and ear recognition. In: Proceedings of the 9th annual international conference
on privacy security and trust–PST 2011, pp 87–92
52. Jeges E, Mate L (2006) Model-based human ear identification. In: Proceedings of the world
automation congress–WAC, pp 1–6
53. Kisku DR, Mehrotra H, Gupta P, Sing JK (2009) SIFT-based ear recognition by fusion of
detected keypoints from color similarity slice regions. In: Proceedings of the international
conference on advances in computational tools for engineering applications–ACTEA 2009,
pp 380–385
54. Kisku DR, Sing JK, Gupta P (2009) Multibiometrics belief fusion. In: Proceedings of the 2nd
international conference on machine vision–ICMV 2009, pp 37–40
55. Kisku DR, Gupta S, Gupta P, Sing JK (2010) An efficient ear identification system. In:
Proceedings of the 5th international conference on future information technology–futuretech
2010, pp 1–6
56. Klare BF, Burge MJ, Klontz JC, Vorder Bruegge RW, Jain AK (2012) Face recognition perfor-
mance: role of demographic information. IEEE Trans Info Forensics Security 7(6):1789–1801
57. Kumar A, Hanmandlu M, Kuldeep M, Gupta HM (2011) Automatic ear detection for online
biometric applications. In: Proceedings of the 3rd national conference on computer vision,
pattern recognition, image processing and graphics–NCVPRIPG 2011, pp 146–149
6 Unconstrained Ear Processing: What is Possible and What Must Be Done 187

58. Kumar A, Wu C (2012) Automated human identification using ear imaging. Pattern Recogn
45:956–968
59. Lades M, Vorbruggen JC, Buhmann J, Lange J, Von Der Malsburg C, Wurtz RP, Konen W
(1993) Distortion invariant object recognition in the dynamic link architecture. IEEE Trans
Comput 42(3):300–311
60. Liu W, Wang Y, Li SZ, Tan T (2004) Null space-based kernel fisher discriminate analysis for
face recognition. In: Proceedings of the 6th IEEE international conference on automatic face
and gesture recognition–FG 2004, pp 369–374
61. Liu H (2011) Multi-view ear recognition by patrial least square discrimination. In: Proceedings
of the 3rd international conference on computer research and development–ICCRD 2011, vol
4, pp 200–204
62. Liu H, Zhang D (2011) Fast 3D point cloud ear identification by slice curve matching. In:
Proceedings of the 3rd international conference on computer research and development–
ICCRD 2011, vol 4, pp 224–228
63. Lu L, Zhang X, Zhao Y, Jia Y (2006) Ear recognition based on statistical shape model. In:
Proceedings of the IEEE international conference on innovative computing, information and
control, vol 3, pp 353–356
64. Messer K, Matas J, Kittler J, Luettin J, Maitre G (1999) XM2VTSDB: the extended M2VTS
database. In: Proceedings of the international conference on audio- and video-based person
authentication–AVBPA ’99, pp 72–77
65. Monwar MM, Gavrilova M, Wang Y (2011) A novel fuzzy multimodal information fusion
technology for human biometric traits identification. In: Proceedings of the 10th IEEE inter-
national conference on cognitive informatics and cognitive computing–ICCI*CC 2011, pp
112–119
66. Morano RA, Ozturk C, Conn R, Dubin S, Zietz S, Nissanov J (1998) Structured light using
pseudorandom codes. IEEE Trans Pattern Anal Mach Intell 20(3), pp 322–327
67. Monwar M, Gavrilova M (2008) FES: a system for combining face, ear and signature biomet-
rics using rank level fusion. In: Proceedings of the 5th international conference on information
technology: new generations–ITNG 2008, pp 922–927
68. Moreno B, Sanchez A, Velez JF (1999) On the use of outer ear images for personal identifica-
tion in security applications. In: Proceedings of the IEEE 33rd annual international carnahan
conference on security technology, pp 469–476
69. Nanni L, Lumini A (2007) A multi-matcher for ear authentication. Pattern Recogn Lett
28(16):2219–2226
70. Nanni L, Lumini A (2009) Fusion of color spaces for ear authentication. Pattern Recogn
42(9):1906–1913
71. Nanni L, Lumini A (2009) A supervised method to discriminate between impostors and
genuine in biometry. Expert Syst Appl 36(7):10401–10407
72. Nosrati MS, Faez K, Faradji F (2007) Using 2D wavelet and principal component analysis
for personal identification based On 2D ear structure. In: Proceedings of the international
conference on intelligent and advanced systems–ICIAS 2007, pp 616–620
73. Passalis G, Kakadiaris IA, Theoharis T, Toderici G, Papaioannou T (2007) Towards fast 3D
ear recognition for real-life biometric applications. In: Proceedings of the IEEE conference
on advanced video and signal based surveillance–AVSS 2007, pp 39–44
74. Pflug A, Busch C (2012) Ear biometrics: a survey of detection, feature extraction and recog-
nition methods. IET Biometrics 1(2):114–129
75. Prakash S, Jayaraman U, Gupta P (2009) A skin-color and template based technique for
automatic ear detection. In: Proceedings of the 7th international conference on advances in
pattern recognition–ICAPR 2009, pp 213–216
76. Prakash S, Jayaraman U, Gupta P (2009) Connected component based technique for automatic
ear detection. In: Proceedings of the 16th IEEE international conference on image processing–
ICIP 2009, pp 2741–2744
77. Prakash S, Gupta P (2011) An efficient ear recognition technique invariant to illumination
and pose. Telecommun Syst 52(3):1–14 http://dx.doi.org/10.1007/s11235--011-9621-2
188 S. Barra et al.

78. Prakash S, Gupta P (2012) A rotation and scale invariant technique for ear detection in 3D.
Pattern Recogn Lett 33, pp 1924–1931
79. Raposo R, Hoyle E, Peixinho A, Proenca H (2011) UBEAR: A dataset of ear images captured
on-the-move in uncontrolled conditions. In: Proceedings of the IEEE workshop on computa-
tional intelligence in biometrics and identity management–CIBIM 2011, pp 84–90
80. Reisfeld D, Wolfson H, Yeshurun Y (1995) Context-free attentional operators: the generalized
symmetry transform. Int J Comput Vision 14(2):119–130
81. Riccio D, Tortora G, De Marsico M, Wechsler H (2012) EGA-ethnicity, gender and age,
a pre-annotated face database. In: Proceedings of the 2012 IEEE workshop on biometric
measurements and systems for security and medical applications–BioMS 2012, pp 38–45
82. Said EH, Abaza A, Ammar H (2008) Ear segmentation in color facial images using mathe-
matical morphology. In: Proceedings of the biometrics symposium–BSYM 2008, pp 29–34
83. Samangooei S, Bustard JD, Seeley RD, Nixon MS, Carter JN (2011) Acquisition and analysis
of a dataset comprising gait, ear and semantic data. In: Bhanu B, Govindaraju (eds) Multi-
biometrics for human identification, pp 277–301. Cambridge University Press, Cambridge
84. Sana A, Gupta P, Purkai R (2007) Ear biometrics: a new approach. In: P Pal (ed) Advances
in pattern recognition. World Scientific Publishing, New York, pp 46–50
85. Seely RD, Samangooei S, Lee M, Carter JN, Nixon MS (2008) The University of Southampton
multi-biometric tunnel and introducing a novel 3D gait dataset. In: Proceedings of the 2nd
IEEE international conference on biometrics: theory, applications and systems–BTAS 2008,
pp 1–6
86. Shailaja D, Gupta P (2006) A simple geometric approach for ear recognition. In: Proceedings
of the 9th international conference on information technology–ICIT 2006, pp 164–167
87. Shih H-C, Ho CC, Chang H-T, Wu C-S (2009) Ear detection based on arc-masking extraction
and AdaBoost polling verification. In: Proceedings of the 5th international conference on
intelligent information hiding and multimedia signal processing–IIH-MSP 2009, pp 669–672
88. Sun C, Mu Z-C, Zeng H (2009) Automatic 3D ear reconstruction based on epipolar geom-
etry. In: Proceedings of the 5th international conference on image and graphics–ICIG 2009,
pp 496–500
89. Takala V, Ahonen T, Pietikäinen M (2005) Block-based methods for image retrieval using
local binary patterns. In: Proceedings of the 14th scandinavian conference–SCIA 2005, LNCS
3540, pp 882–891
90. Victor B, Bowyer K, Sarkar S (2002) An evaluation of face and ear biometrics. In: Proceedings
of 16th international conference on pattern recognition, vol 1, pp 429–432
91. Viola P, Jones MM (2004) Robust real-time face detection. Int J Comput Vision 57(2):137–154
92. Wang Y, Mu Z-C, Liu K, Feng J (2007) Multimodal recognition based on pose transformation
of ear and face images. In: Proceedings of the international conference on wavelet analysis
and pattern recognition–ICWAPR 2007, vol 3, pp 1350–1355
93. Wang Y, Mu Z-C, Zeng H (2008) Block-based and multi-resolution methods for ear recogni-
tion using wavelet transform and uniform local binary patterns. In: Proceedings of the 19th
international conference on pattern recognition–ICPR 2008, pp 1–4
94. Wang X, Yuan W (2010) Gabor wavelets and general discriminant analysis for ear recognition.
In: Proceedings of the 8th world congress on intelligent control and automation–WCICA 2010,
pp 6305–6308
95. Wang Z-Q, Yan X-D (2011) Multi-scale feature extraction algorithm of ear image. In: Proceed-
ings of the international conference on electric information and control engineering–ICEICE
2011, pp 528–531
96. Watabe D, Sai H, Sakai K, Nakamura O (2008) Ear biometrics using jet space similarity.
In: Proceedings of the Canadian conference on electrical and computer engineering–CCECE
2008, pp 1259–1264
97. Watabe D, Sai H, Sakai K, Nakamura O (2011) Improving the robustness of single-view
ear-based recognition under a rotated in depth perspective. In: Proceedings of the 2011 inter-
national conference on biometrics and kansei engineering–ICBAKE, pp 179–184
6 Unconstrained Ear Processing: What is Possible and What Must Be Done 189

98. Woodard DL, Faltemier TC, Ping Y, Flynn PJ, Bowyer KW (2006) A comparison of 3D bio-
metric modalities. In: Proceedings of the conference on computer vision and pattern recog-
nition workshop–CVPRW 2006, p 57
99. Wu J, Brubaker SC, Mullin MD, Rehg JM (2008) Fast asymmetric learning for cascade face
detection. IEEE Trans Pattern Anal Mach Intell 30(3):369–382
100. Wu H-L, Wang Q, Shen H-J, Hu L-Y (2009) Ear identification based on KICA and SVM.
In: Proceedings of the WRI global congress on intelligent systems–GCIS 2009, vol 4, pp
414–417
101. Xie Z-X, Mu Z-C (2007) Improved locally linear embedding and its application on multi-
pose ear recognition. In: Proceedings of the international conference on wavelet analysis and
pattern recognition–ICWAPR 2007, vol 3, pp 1367–1371
102. Xie Z-X, Mu Z-C (2008) Ear recognition using LLE and IDLLE algorithm. In: Proceedings
of the 19th international conference on patten recognition–ICPR 2008, pp 1–4
103. Xu X-N, Zhichun Mu Z-C (2007) Feature fusion method based on KCCA for ear and profile
face based multimodal recognition. In: Proceedings of the IEEE international conference on
automation and logistics, pp 620–623
104. Xu X-N, Mu Z-C, Yuan L (2007) Feature-level fusion method based on KFDA for multimodal
recognition fusing ear and profile face. In: Proceedings of the international conference on
wavelet analysis and pattern recognition–ICWAPR 2007, vol 3, pp 1306–1310
105. Xu X-N, Mu Z-C (2007) Multimodal recognition based on fusion of ear and profile face.
In: Proceedings of the 4th international conference on image and graphics–ICIG 2007, pp
598–603
106. Xu H, Mu Z-C (2008) Multi-pose ear recognition based on improved locally linear embedding.
In: Proceedings of the congress on image and signal processing–CISP 2008, vol 2, pp 39–43
107. Yaghoubi Z, Faez K, Eliasi M, Eliasi A (2010) Multimodal biometric recognition inspired
by visual cortex and support vector machine classifier. In: Proceedings of the international
conference on multimedia computing and information technology–MCIT 2010, pp 93–96
108. Yan P, Bowyer KW (2005) ICP-based approaches for 3d ear recognition. In: Proceedings of
SPIE 5779:282–291
109. Yan P, Bowyer KW (2005) Empirical evaluation of advanced ear biometrics. In: Proceed-
ings of the IEEE computer vision and pattern recognition workshops–CVPRW 2005, vol 3,
pp 41–48
110. Yan P, Bowyer KW (2005) Multi-biometrics 2D and 3D ear recognition. In: Kanade T, Jain
A, Ratha NK (eds) Proceedings of the international conference on audio- and video-based
person authentication–AVBPA 2005, LNCS 3546, pp 503–512
111. Yan P, Bowyer KW (2007) Biometric recognition using 3D ear shape. IEEE Trans Pattern
Anal Mach Intell 29(8):1297–1308
112. Yan P, Bowyer KW (2007) A fast algorithm for ICP-based 3D shape biometrics. Comput
Vision Image Underst 107(3):195–202
113. Yaqubi M, Faez K, Motamed S (2008) Ear recognition using features inspired by visual
cortex and support vector machine technique. In: Proceedings of the international conference
on computer and communication engineering–ICCCE 2008, pp 533–537
114. Yazdanpanah AP, Faez K, Amirfattahi R (2010) Multimodal biometric system using face,
ear and gait biometrics. In: Proceedings of the 10th international conference on information
science, signal processing and their applications–ISSPA 2010, pp 251–254
115. Yuan L, Zhichun Mu Z, Ying Liu Y (2006) Multimodal recognition using face profile and
ear. In: Proceedings of the 1st international symposium on systems and control in aerospace
and astronautics–ISSCAA 2006, pp 887–891
116. Yuan L, Mu Z-C (2007) Ear detection based on skin-color and contour information. In:
Proceedings of the international conference on machine learning and cybernetics, vol 4,
pp 2213–2217
117. Yuan L, Mu ZC (2007) Ear recognition based on 2D images. In: Proceedings of the 1st IEEE
international conference on biometrics: theory, applications and systems–BTAS 2007, pp 1–5
190 S. Barra et al.

118. Yuan L, Mu Z-C, Xu X-N (2007) Multimodal recognition based on face and ear. In: Proceed-
ings of the international conference on wavelet analysis and pattern recognition–ICWAPR
2007, vol 3, pp 1203–1207
119. Yuan L, Zhang F (2009) Ear detection based on improved AdaBoost algorithm. In: Proceedings
of the international conference on machine learning and cybernetics, vol 4, pp 2414–2417
120. Yuan L, Mu Z-C (2012) Ear recognition based on local information fusion. Pattern Recogn
Lett 33:182–190
121. Zeng H, Mu Z-C, Wang K, Sun C (2009) Automatic 3D ear reconstruction based on binocular
stereo vision. In: Proceedings of the IEEE international conference on systems, man and
cybernetics–SMC 2009, pp 5205–5208
122. Zeng H, Mu Z-C, Yuan L, Wang S (2009) Ear recognition based on the SIFT descriptor with
global context and the projective invariants. In: Proceedings of the 5th international conference
on image and graphics–ICIG 2009, pp 973–977
123. Zhang D, Lu G (2002) Shape-based image retrieval using generic Fourier descriptor. Sig
Process Image Commun 17(10):825–848
124. Zhang H-J, Mu Z-C, Qu W, Liu L-M, Zhang C-Y (2005) A novel approach for ear recognition
based on ICA and RBF network. In: Proceedings of the 2005 international conference on
machine learning and cybernetics, vol 7, pp 4511–4515
125. Zhang H, Mu Z (2008) Compound structure classifier system for ear recognition. In: Pro-
ceedings of the IEEE international conference on automation and logistics - ICAL 2008,
pp 2306–2309
126. Zhang Z, Liu H (2008) Multi-view ear recognition based on B-Spline pose manifold con-
struction. In: Proceedings of the 7th world congress on intelligent control and automation -
WCICA 2008, pp 2416–2421
127. Zhou J, Cadavid S, Abdel-Mottaleb M (2010) Histograms of categorized shapes for 3D ear
detection. In: Proceedings of the 4th IEEE international conference on biometrics: theory
applications and systems - BTAS 2010, pp 1–6
Chapter 7
Feature Quality-Based Unconstrained Eye
Recognition

Zhi Zhou, Eliza Yingzi Du and N. Luke Thomas

The blood vessel structure of the sclera is unique to each person, and it can be
obtained non-intrusively in the visible wavelengths remotely. Sclera recognition has
been shown to achieve reasonable recognition accuracy under visible wavelengths.
Iris recognition has been tested to be one of the most accurate biometrics near infrared
images. However, it does not work well under visible wavelength illumination. Com-
bining iris and sclera recognition together can achieve better recognition accuracy.
However, image quality can significantly affect the recognition accuracy. In addition,
in unconstrained situations, the acquired eye images may not be frontally facing. In
this chapter, we introduced a feature quality-based unconstrained eye recognition
system that combines the respective strengths of iris recognition and sclera recog-
nition for human identification and can work with frontal and off-angle eye images.
The results show that the proposed method is very promising.

7.1 Introduction

Sclera recognition offers several benefits. It can perform human identification in


the visible wavelengths, which allows for less constrained imaging requirements.
However, if the sclera pattern is saturated, the recognition accuracy could drop dra-
matically (Fig. 7.1).

Z. Zhou
Allen Institute for Brain Science, 551 N. 34th Street, Ste. 200, Seattle, WA 98103, USA
E. Y. Du (B)
Qualcomm, 3105 Kifer Road, SCL.C-140A, Santa Clara, CA 95051, USA
e-mail: elizadu@gmail.com
Z. Zhou · E. Y. Du · N. L. Thomas
Department of Electrical and Computer Engineering, Indiana University-Purdue University
Indianapolis, 723 W. Michigan St. SL160, Indianapolis, Indiana 46202, USA

J. Scharcanski et al. (eds.), Signal and Image Processing for Biometrics, 191
Lecture Notes in Electrical Engineering 292, DOI: 10.1007/978-3-642-54080-6_7,
© Springer-Verlag Berlin Heidelberg 2014
192 Z. Zhou et al.

Fig. 7.1 An eye image with


saturated sclera vein patterns
[1]

Among biometric systems, iris recognition is proven to be one of the most reliable
approaches for automatic personal recognition. The iris is the circular, colored portion
of the eye [2]. Its opening forms the pupil. The iris helps regulate the amount of light
that enters the eye. It is formed before birth, except in the event of an injury to the
eyeball; its patterns are stable throughout an individual’s lifetime [3]. Every iris has a
distinct pattern that differs not only from person to person, but also different from left
eye to right eye. For identifying twins, iris can be more reliable than other biometrics,
such as face, since even twins have similar faces but different irises [4]. This texture
distinction of the iris makes it suitable for an automated biometric system.
Iris recognition systems are being used in access control applications. Several
airports are using biometric systems including iris recognition steps for granting
access to frequent fliers and decreasing the waiting time at security checkpoints. Iris
recognition systems are also being used at high security governmental or commercial
buildings.
Currently, iris recognition systems work well with frontal iris images from cooper-
ative users. However, iris recognition has its limitations. It requires NIR illumination
when acquiring the NIR iris image. In the visible wavelength, iris pattern from darkly
color irises are hard to identify under visible wavelength illumination [1, 5, 6]. This
is due to the high absorption of visible light by darkly colored irises, so that little of
the identifying structure of the iris can be imaged and extracted.
Moreover, many factors affect the quality of an eye image, including: defocus,
motion blur, occlusion, glare, resolution, image contrast, and deformation. Figure 7.2
shows some examples of poor quality eye images. How to perform eye image quality
measurement including iris and sclera is a challenging task.
In this chapter, we propose a feature quality-based unconstrained eye recognition
system (Fig. 7.3). Different from periocalur recognition which extracts features from
facial region in the immediate vicinity of the eye [7, 8], our eye recognition uses
the information from iris and sclera areas. First, the iris area and sclera areas will be
segmented by our segmentation approach. Iris and sclera features will be extracted
7 Feature Quality-Based Unconstrained Eye Recognition 193

Fig. 7.2 Examples of poor quality eye images [1]. a Occlusion, b Defocus, c Image Contrast,
d Blink

Fig. 7.3 Feature quality-based unconstrained eye recognition system

in the segmented areas, and we will apply our image quality measure to evaluate
the quality of the eye image. Our image quality scores and matching scores can be
combined to predict the performance of our unconstrained eye recognition system.
194 Z. Zhou et al.

Fig. 7.4 Block diagram of the proposed eye image segmentation algorithm

7.2 Eye Image Segmentation

Segmentation is the first step in the proposed system. Figure 7.4 shows the block
diagram of proposed eye image segmentation, including specular reflection estima-
tion, iris and pupil boundaries detection for frontal images and only iris detection for
off-angle images, estimation of the sclera region, eyelid and iris boundary detection
and refinement.
The specular reflection of the eye is usually inside the pupil or nearby the pupil
area with a high intensity value. Based on the image’s resolution, a 3-by-3 Sobel
filter is applied to highlight the wanted specular reflection area, since the specular
reflection can be modeled as a bright object on a much darker background (pupil or
iris areas) with sharp edges.
To improve the segmentation speed, the pupil and iris regions are simply modeled
as circular boundaries. Specifically, the pupil and iris regions are segmented using a
greedy angular search to find the pupil and iris boundaries, and using a least squares
circle fitting algorithm to fit a circle to the detected boundaries.
After the iris and pupil boundaries detection, we applied our sclera detection
approach. For grayscale images, Otsu’s method is used to detect the sclera region.
The left and right Region of Interests (ROIs) are selected based on the iris center and
iris boundaries. Otsu’s method is applied to the ROIs to obtain potential sclera areas.
The correct left sclera area should be located in the right and lower center side, and
the correct right sclera area should be located in the left and lower center. In this way,
we eliminate non-sclera areas.
The top and bottom boundaries of the sclera region are used as initial estimates
of the sclera boundaries, and a polynomial is fitted to each boundary. Using the top
and bottom portions of the estimated sclera region as guidelines, the upper eyelid
boundary, lower eyelid boundary, and iris boundary are then refined using the Fourier
active contour method [9].
7 Feature Quality-Based Unconstrained Eye Recognition 195

Fig. 7.5 A sketch of parameters of a segment descriptor

7.3 Feature Extraction and Matching

7.3.1 Sclera Feature Extraction and Matching

The segmented sclera area is highly reflective. As a result, the sclera vascular pat-
terns are often blurry and/or have very low contrast. To mitigate illumination effect
to achieve illumination-invariant process, it is important to enhance the vascular
patterns.
In [6], A bank of multiple directional Gabor filters is used for vascular pattern
enhancement:
 
−ω ( 0 ) 2( 0 )
x−x 2 + y−y 2

G (x, y, ρ, s) = e s
e−2ωi(cos v(x−x0 )+sin v(y−y0 )) , (7.1)

where (x0 , y0 ) is the center frequency of the filter, s is the variance of the Gaussian,
and ρ is the angle of the sinusoidal modulation. After enhancement, an adaptive
threshold is applied to binarize the Gabor filtered image. Then, binary morphological
operations are used to thin the detected vessel structure down to a single-pixel wide
skeleton, and to remove the branch points.
Due to physiological status of a person (for example, fatigue or non-fatigue),
the vascular patterns could have different thicknesses at different times, due to the
dilation and constriction of the vessels. Therefore, vessel thickness is not a stable
pattern for recognition. In addition, some very thin vascular patterns may not be
visible at all times. In this research, binary morphological operations are used to thin
the detected vessel structure down to a single-pixel wide skeleton, and to remove the
branch points. This leaves a set of single-pixel wide lines that represents the vessel
structure. These lines are then recursively parsed into smaller segments. The process
is repeated until the line segments are nearly linear with the line’s maximum size.
For each segment, a least-squares line is fit to each segment.
196 Z. Zhou et al.

These line segments are then used to create a template for the vessel structure. The
segments are described by three quantities—the segments angle to some reference
angle at the iris center, the segments distance to the iris center, and the dominant
angular orientation of the line segment. The template for the sclera vessel structure
is the set of all individual segments’ descriptors. Note that this implies that, while
each segments descriptor is of a fixed length, the overall template size for a sclera
vessel structure varies with the number of individual segments. Figure 7.5 shows a
visual description of the line descriptor.
A descriptor is S = (Θ r ∅)T . The individual components of the line descriptor
are calculated as:
 
−1 yl − yi
Θ = tan ,
xl − x i

r = (yl − yi )2 + (xl − xi )2
 
d
and ∅ = tan−1 fline (x) . (7.2)
dx

Here fline is the polynomial approximation of the line segment, (xl , yl ) is the
center point of the line segment, (xi , yi ) is the center of the detected iris, and is the
line descriptor. Additionally, the iris center, (xi , yi ), is stored with all of the individual
line descriptors. Using iris center as a reference, the line descriptor is translational
invariant. The line descriptor can extract patterns in different orientation, which make
it possible to achieve orientation-invariant matching.
A RANSAC-type based algorithm is developed to estimate the best-fit parameters
for registration between the two sclera vascular patterns [10]. For the registration
algorithm, it randomly chooses two points—one from the test template, and one
from the target template. It also randomly chooses a scaling factor and a rotation
value, based on a priori knowledge of the database. Using these values, it calculates
a fitness value for the registration using these parameters.
In general, the edge areas of the sclera may not be segmented accurately; therefore
the weighting image is created from the sclera mask by setting interior pixels in the
sclera mask to 1, pixels within some distance of the boundary of the mask to 0.5,
and pixels outside the mask to 0. The total matching score, M, is the sum of the
individual matching scores divided by the maximum matching score for the minimal
set between the test and target template.
 
γ(i, j)∈ Matches m si , s j
M=    , (7.3)
min γi∈Test w (si ), γi∈Target w s j

within, Matches is the set of all pairs that are matching, Test is the set of descriptors
 in
the test template, and Target is the set of descriptors in the target template. m Si , S j
can be calculated by:
7 Feature Quality-Based Unconstrained Eye Recognition 197
⎧  

⎪ d si , s j ≤ Dmatch
  ⎨
w(si )w(s j ), and
m Si , S j = , (7.4)

⎪ |∅i − ∅ j | ≤ ∅match

0, else
 
where Si and S j are two segment descriptors,
 m Si , S j is the matching score
between segments Si and S j , d Si , S j is the Euclidean distance between the seg-
ment descriptors center, Dmatch is the matching distance threshold, and ∅match is the
matching angle threshold.

7.3.2 Iris Feature Extraction and Matching

1-D Log-Gabor filter is popularly used in academia. It acts as a band pass filter to
remove the high and low frequency components inside the iris area:
2
− log ΦΦ
0
G(Φ) = e 2 log(Σ )2 , (7.5)

where, Σ is bandwidth and Φ0 is the filter’s center frequency, which is derived from
the filter’s wavelength, α. A one-dimensional FFT is used to find the frequency
characteristics from −ω to ω radians. The highest and lowest frequencies are removed
by using a Log-Gabor Wavelet. In this way, the frequencies which represent the iris
pattern in an arc, are isolated. The phase of each pixel of the polar image filtered with
Log-Gabor Wavelet is found for encoding the iris patterns. The Hamming Distance
is used to calculate the similarity between two encoded iris images.

7.4 Eye Image Quality Measure

7.4.1 Quality Measure for the Entire Eye Image

The total amount of available iris patterns and sclera patterns can affect the eye
recognition accuracy in both iris and sclera recognitions. In this research, we used
occlusion measure (O) to measure the percentage of the iris area that is invalid due to
eyelids, eyelashes, and other noise. In general, the occlusion in iris area is proportion
to the occlusion in sclera area.
Based on our analysis, the relationship between occlusion and eye image quality
is not linear, and the calibration function for occlusion is calculated by:

g (O) = (1 − e−α(1−0) )/π, (7.6)

where π = 0.9179 and α = 2.5.


198 Z. Zhou et al.

Fig. 7.6 The process of information distance calculation

7.4.2 Quality Measure for the Iris Area

In [11], we proposed the feature correlation evaluation approach for iris feature infor-
mation measure. The correlation between the neighboring features is used to measure
the quality of the iris images. The feature correlation measure (n) is calculated by:
1
FCM = γi Ji,i+1 , (7.7)
N
here i and i + 1 are consecutive Log-Gabor filtered rows. N is equal to (total number
of rows −1) used for feature correlation calculation. J is the information distance
between the consecutive rows in iris patterns. Figure 7.6 shows the process of cal-
culating Ji,i+1 between the consecutive rows ith and (i + 1)th rows. To eliminate
the effects of noise, the masked pixels of each row are assigned the mean value of
the valid pixels in the same row. Since the iris patterns in dark brown colored eyes
cannot reveal rich and complex patterns in visible light, the FCM is lower than other
color eyes.
Based on our experimental results, the calibrated function f for the calculation
of FCM is:

δ · FCM, 0 ≤ FCM ≤ ϕ
f (FCM) = , (7.8)
1, FCM ≥ ϕ

where ϕ = 0.005, and δ = ϕ1 .


7 Feature Quality-Based Unconstrained Eye Recognition 199

Dilation is another important factor in iris image quality. If the pupil dilation is
too large, it could also affect the iris recognition accuracy. Here, the dilation measure
D is calculated by:
Pupil radius
D= (7.9)
Iris radius
Similar to the occlusion, the dilation non-linearly affects recognition accuracy.
The h function is calculated as:

1 D ≤ 0.6
h(D) (7.10)
e−φ (D−ξ ) 0.6 < D ≤ 1

In our experiment, ξ = 0.6 and φ = 40.

7.4.3 Quality Measure for the Sclera Area

In our sclera matching, the matching score is based on the percentage of matched
descriptors. The images with very small extracted vessel patterns could result in
arbitrarily high matching scores, which may increase the false acceptance rate.
In this research, we calculated the number of descriptors N in the downsampled
image (to make sure the descriptors are normally distributed in the sclera area).
The number of descriptors N is nonlinearly related to the recognition accuracy. The
calibrated function p is calculated by:

1 N ≥ 130
p (N ) = −φ N
−1 (7.11)
1 − πe ξ
0 < N < 130.

In our experiment, = 0.025, ξ = 130, and φ = 3.7.

7.4.4 Overall Quality Score for Iris and Sclera Areas

After evaluating the iris area and sclera areas, we fused these quality scores to obtain
overall quality score for iris and sclera areas. The overall iris quality score I , can be
calculated as:

I = g(O) · f (FCM) · h(D). (7.12)

And the overall sclera quality score S, can be calculated as:

S = g(O) · p(N ). (7.13)


200 Z. Zhou et al.

The overall quality score can be used as a confidence score for matching result,
or can be used to select the best matching result when multiple images are used for
identification.

7.5 Quality Based Fusion

In our proposed system, we have three matching scores for a pair of two persons:
an iris matching score for frontal-looking image pairs, a sclera matching score for
frontal image pairs, and a sclera matching score for off-angle image pairs.
Assume a pair of persons, A and B in frontal images, the sclera quality scores
are FS A and FS B separately, and the iris quality scores are FI A and FI B separately.
In off-angle images, the sclera quality scores are LS A and L S B separately. For the
matching score, the sclera matching score is MFS AB for front images, and MLS AB
for off-angle images. The iris matching score for front images is MFI AB . The overall
feature quality-based unconstrained eye matching result is predicted by:


⎪ Frontal sclera macthing result, if FS A > T1 and FS B > T1

Off − −angle sclera macthing result, if LS A > T2 and L S B > T2
Macthing result =
⎪ Frontal iri macthing result,
⎪ if FI A > T3 and FI B > T3

0 else
(7.14)
where T1 , T2 , and T3 are the thresholds to assess the quality of the image in sclera
and iris areas separately. Front sclera macthing result is calculated by:

1 if MFSAB > T4
Front sclera macthing result = . (7.15)
0 else

And Off – angle sclera macthing result is calculated by:

1 if MFIAB > T5
Off – angle sclera macthing result = , (7.16)
0 else

And Front iris macthing result is calculated by:

1 if MFIAB > T6
Frontal iris macthing result = , (7.17)
0 else

within, T4 , T5 , and T6 are the thresholds to determine if the two images are matched
or not using sclera recognition and iris recognition respectively. In this way, we
fused the matching scores based on the quality score. If the sclera quality scores
for both frontal images are good, we will use the matching result from frontal
sclera recognition (Fig. 7.7). If the sclera quality scores for both off-angle images are
good, we will use the matching result from off-angle sclera recognition (Fig. 7.8).
7 Feature Quality-Based Unconstrained Eye Recognition 201

Fig. 7.7 An example of the proposed approach (two images with good frontal sclera quality scores)

Fig. 7.8 An example of the proposed approach (two images with poor frontal sclera quality scores,
but with good off-angle sclera quality scores)

If the iris quality scores for both frontal images are good, we will use the matching
result from frontal iris recognition (Fig. 7.9). For other situations, the two persons
will be identified as non-matched (Fig. 7.10).
202 Z. Zhou et al.

Fig. 7.9 An example of the proposed approach (two images with poor frontal and off-angle sclera
quality scores, but with good frontal iris quality scores)

Fig. 7.10 An example of the proposed approach (two images with poor sclera and iris quality
scores)

7.6 Experimental Results

7.6.1 Database

IUPUI multi-wavelength database is acquired under eight different wavelengths


illumination including Purple (420 nm), Blue (470 nm), Green (525 nm), Yellow
7 Feature Quality-Based Unconstrained Eye Recognition 203

Fig. 7.11 Some example images of front-looking images in the IUPUI green-wavelength database

(590 nm), Orange (610 nm), Red (630 nm), Deep Red (660 nm), and Infra-Red
(820 nm) at a distance of one foot. In each wavelength, there are total 352 images of
44 subjects in five different colored eyes including blue, dark brown, light brown,
green, and hazel. For each human subject, we obtained the image videos from both
eyes on two separate occasions. The time period between each data acquisition is at
least one week. From our observation, the sclera pattern in green wavelength can be
better extracted than any other wavelengths. Therefore, green-wavelength database
was used for this work.
Frontal-looking images and left-looking images are used to do multi-angle sclera
recognition. Figures 7.11 and 7.12 show examples of front-looking and left-looking
images in our IUPUI green-wavelength database. The top row shows good quality
images, the middle row shows images of mid- to poor- quality, and the bottom row
shows poor quality images.

7.6.2 Recognition Accuracy Versus Quality Score

The overall matching results were computed for an all-to-all similarity matrix.
For IUPUI green-wavelength database, the results are over 120 thousand cross-
comparisons.
204 Z. Zhou et al.

Fig. 7.12 Some example images of left-looking images in the IUPUI green-wavelength database

Based on the sclera quality scores, we divided all of the images in this database
equally into five groups: top 20 % as the best quality, and then top 20–40 %, top
40–60 %, top 60–80 %, the last group is the worst quality group. Using the similarity
matrices generated by our proposed matching algorithm, we plotted the ROC figures
of these five groups to check if the sclera quality scores can improve the recognition
accuracy or not. Figure 7.13 and Table 7.1 show the recognition accuracy versus
quality scores of frontal images. Figure 7.14 and Table 7.2 show the recognition
accuracy versus combination scores of off-angle images.
All figures and tables show that the highest scoring group of images has better
recognition accuracy than that of the second highest scoring group of images, and
the lowest scoring group of images has the lowest recognition accuracy. Therefore,
a good combination score can be used to improve the system performance. It means
the quality score should be highly correlated with the system performance.

7.6.3 Feature Quality-Based Unconstrained Eye Recognition

Table 7.3 shows the comparison results using frontal iris recognition, frontal sclera
recognition, off-angle sclera recognition, and our proposed system. At FAR = 0.1 %,
7 Feature Quality-Based Unconstrained Eye Recognition 205

1
top20%(1)
top20%~40%(2)
0.8 top40%~60%(3)
False Accept Rate

top60%~80%(4) 0.2

False Accept Rate


Worst Quality(5)
0.6 EER Line 0.15

0.4 0.1

0.05
0.2
0
0 0.05 0.1 0.15 0.2
0
0 0.2 0.4 0.6 0.8 1 False Reject Rate
False Reject Rate

Fig. 7.13 Recognition accuracy versus combination scores (frontal images)

Table 7.1 EERs versus combination scores (frontal images)


Groups Top 20 % Top 20–40 % Top 40–60 % Top 60–80 % Worst quality
EER (%) 0.63 4.86 5.71 6.91 9.09

1
top20%(1)

0.8 top20%~40%(2)
False Accept Rate

top40%~60%(3) 0.25
False Accept Rate

0.6 top60%~80%(4) 0.2


Worst Quality(5) 0.15
EER Line
0.4 0.1

0.05
0.2
0
0 0.05 0.1 0.15 0.2
0 False Reject Rate
0 0.2 0.4 0.6 0.8 1
False Reject Rate

Fig. 7.14 Recognition accuracy versus quality scores (off-angle images)

GARs are 76.20 % for frontal iris recognition, 86.50 % for frontal sclera recognition,
and 82.70 for off-angle sclera recognition. With our feature quality-based uncon-
strained eye recognition, GAR is increased to 91.95 %. Also, at FAR = 0.01 %, GAR
for our proposed system (88.39 %) is better than frontal iris recognition (64.00 %),
frontal sclera recognition (86.50 %), and off-angle sclera recognition (79.90 %). The
result also shows our proposed system can improve the performance of recognition
accuracy.
206 Z. Zhou et al.

Table 7.2 EERs versus quality scores (off-angle images)


Groups Top 20 % Top 20–40 % Top 40–60 % Top 60–80 % Worst quality
EER (%) 3.57 4.08 4.54 5.07 13.55

Table 7.3 Recognition


Method GAR (%)
results in feature
FAR = 0.1 % FAR = 0.01 %
quality-based unconstrained
eye recognition Iris(Frontal) 76.20 64.00
Sclera(Frontal) 86.50 76.40
Sclera(Off-angle) 82.70 79.90
Proposed system 91.95 88.39

7.7 Summaries

Sclera vessel patterns are unique with both genetic and developmental components
determining their structure, which can achieve better accuracy than other traditional
methods in visible light. However, non-ideal sclera images are still challenging for
recognition and can significantly affect the accuracy of recognition systems because
they cannot be properly preprocessed by the system and/or they have poor image
quality.
In this chapter, we introduced a feature quality-based unconstrained eye recog-
nition system. The image segmentation step can segment iris and sclera areas in
color images. Our quality measure evaluated the entire eye image quality, iris area
quality, and sclera area quality. Our quality based fusion can be used to predict the
eye recognition performance using overall iris and sclera quality scores.
Two recognition algorithms, including 1-D Log-Gabor approach in iris recogni-
tion and line descriptor based approach in sclera recognition, is applied. The research
results show that our overall iris and sclera quality scores are highly correlated to the
recognition accuracy and our quality fusion based eye recognition can improve and
predict the performance of unconstrained eye recognition systems.

References

1. Proenca H, Filipe S, Santos R, Oliveira J, Alexandre LA (2010) The UBIRIS.v2: a database


of visible wavelength iris images captured on-the-move and at-a-distance. IEEE Trans Pattern
Anal Mach Intell 32(8):1529–1535
2. Du EY (2012) Biometrics: from fiction to practice. Pan Stanford Publishing, Singapore
3. Flom L, Safir A (1987) Iris recognition system, United States Patent No. 4,641,349
4. Sit AJ, Liu JHK, Weinreb RN (2006) Asymmetry of right versus left intraocular pressures over
24 H in Glaucoma patients. Ophthalmology 113(3):425–430
5. Proenca H (2010) Iris recognition: on the segmentation of degraded images acquired in the
visible wavelength. IEEE Trans Pattern Anal Mach Intell 32(8):1502–1516
7 Feature Quality-Based Unconstrained Eye Recognition 207

6. Zhou Z, Du EY, Thomas NL, Delp EJ (2012) A new human identification method: sclera
recognition. IEEE Trans Syst Man Cybern Part A Syst Hum 42(3):571–583. © [2012] IEEE.
Reprinted, with permission
7. Unsang P, Jillela R, Ross A, Jain AK (2011) Periocular biometrics in the visible spectrum.
IEEE Trans Inf Forensics Secur 6(1):96–106
8. Padole CN, Proenca H (2012) Periocular recognition: analysis of performance degradation
factors. In: 2012 5th IAPR international conference on biometrics (ICB), pp 439–445
9. Daugman J (2007) New methods in iris recognition. IEEE Trans Syst Man Cybern Part B
37(5):1167–1175
10. Thomas N, Du Y, Zhou Z (2010) A new approach for sclera vein recognition. In: Proceedings
of SPIE, vol. 7708, Orlando, p 770805
11. Du Y, Belcher C, Zhou Z, Ives R (2010) Feature correlation evaluation approach for iris feature
quality measure. Sig Process 90(4):1176–1187
Chapter 8
Speed-Invariant Gait Recognition

Yasushi Makihara, Akira Tsuji and Yasushi Yagi

Abstract We propose a method of speed-invariant gait recognition in a unified


framework of model- and appearance-based approaches. When a person changes
his/her walking speed, kinematic features (e.g., stride and joint angle) are changed
while static features (e.g., thigh and shin lengths) are preserved. Based on the fact,
firstly, static and kinematic features are decoupled from a gait silhouette sequence
by fitting a human model. Secondly, a factorization-based stride transformation
model for the kinematic features is created by using a training set for multiple non
recognition-target persons on multiple speeds. This model can transform the kine-
matic features from a gallery speed to another arbitrary probe speed. Because only
the kinematic features are insufficient to achieve a high recognition performance, we
therefore finally synthesize a gait silhouette sequence by combining the preserved
static features and the transformed kinematic features for matching. Experiments
with the OU-ISIR Gait Database show the effectiveness of the proposed method.

8.1 Introduction

In modern society, biometric person authentication has attracted increasing attention


in many different applications, including surveillance, forensic, and access control.
While the most of biometrics such as DNA, fingerprints, finger/hand vein, iris, face,
requires subject cooperation and contact/approach to a sensor, gait biometrics have
promising properties: person authentication at a distance from camera and without

Y. Makihara (B) · A. Tsuji · Y. Yagi


The Institute of Scientific and Industrial Research, Osaka University, 8-1 Mihogaoka,
Ibaraki, Osaka, Japan
e-mail: makihara@am.sanken.osaka-u.ac.jp
A. Tsuji
e-mail: a-tuji@am.sanken.osaka-u.ac.jp
Y. Yagi
e-mail: yagi@am.sanken.osaka-u.ac.jp

J. Scharcanski et al. (eds.), Signal and Image Processing for Biometrics, 209
Lecture Notes in Electrical Engineering 292, DOI: 10.1007/978-3-642-54080-6_8,
© Springer-Verlag Berlin Heidelberg 2014
210 Y. Makihara et al.

subject cooperation [4]. In fact, the gait biometrics has been applied to forensic
science field [7] and also admitted as evidence of a burglar in UK courts [5]. In addi-
tion, the first packaged software of gait verification system for criminal investigation
has been recently developed [16].
Approaches to the gait recognition roughly fall into two families: model-based
approaches [2, 6, 11, 41, 42, 44, 46] and appearance-based approaches [3, 12, 15,
20, 24, 27, 32, 35, 43, 45]. Because the model-based approaches often suffer from
the human model fitting error and relatively high computational cost, the appearance-
based approaches are currently dominant and achieve better performances than the
model-based approaches in general.
The appearance-based approaches are, however, susceptible to various covariates
(e.g., views, shoes, surfaces, clothing, carriages, and walking speed). Among these
covariates, walking speed is one of the most important factors because people often
change their walking speeds depending on situations (e.g., a person going for a
walk in a leisurely pace, a commuter going to the station in a hurry, and a perpetrator
running away from a criminal scene). In such a case, the gait recognition performance
drastically drops due to the change of gait features (e.g., stride, arm swing, and gait
period) [8].
The effects of the speed change on the gait features are considered in two
aspects: temporal aspect (e.g., gait period and non-linear time warping) and spa-
tial aspects (e.g., stride and arm swing). In other words, we can change our walking
speed by changing pitch and stride, which correspond to the temporal and spatial
aspects, respectively. The temporal aspect is relatively easily normalized by well-
established techniques such as dynamic time warping (DTW) and hidden Markov
model (HMM) [1, 10, 22].
On the other hand, solutions to the spatial aspects are limited up to now. A few
studies tackle this problem by stride normalization in a double support phase [38]
and Procrustes shape analysis [19]. Although both approaches basically rely on the
fact that shape variation by the speed change is well represented by linear stretch,
it is not necessarily true because the human gait kinematics are not guaranteed to
change linearly as the speed changes.
Beyond such a linear shape stretch, we require a more flexible framework which
considers the essence (or a prior knowledge) of the human gait transition by the
speed change. Such a useful prior knowledge could be learnt in an example-based
way, in more detail, from a training set of gait data of multiple non recognition-target
subjects with multiple speeds.
Revisiting the other covariates on gait recognition, we notice that several
approaches to view-invariant gait recognition employed example-based transfor-
mation models using factorization [24] and regression [18], and that they have a
potential to be applied to the speed-invariant gait recognition. Its direct application
to the appearance-based features may, however, induce undesirable transformation.
The appearance-based features contain both static features (e.g., body type and pos-
ture) and kinematic features (e.g., stride and arm swing) and they may be trans-
formed together by the appearance-based transformation. On the other hand, in case
of speed change, the static features should be preserved and only kinematics features
8 Speed-Invariant Gait Recognition 211

should be transformed, and hence the direct application of the appearance-based


transformation to the speed-invariant gait recognition is unsuitable.
Therefore, we propose a method of static feature-preserving speed transformation
model for the speed-invariant gait recognition. Unlike the appearance-based trans-
formation, the static and kinematic features are decoupled by fitting a human model
to a gait silhouette sequence at first, and then factorization is applied only to the kine-
matic features to transform them kinematic features with a gallery speed to those with
a probe speed, while the static features are preserved. Finally, the preserved static
features and the transformed kinematic features are combined to synthesize a gait
silhouette sequence with the probe speed. The advantages of the proposed method
over the existing methods are summarized as follows. (1) Not only temporal change
(e.g., gait period) but also spatial change (e.g., stride) can be treated in a more flexible
way beyond the linear shape stretch [19, 38]. (2) A whole sequence of gait silhouette
are available for recognition unlike the existing methods utilize a limited number of
discrete gait stances [1, 22, 38]. (3) Not only kinematic features but also any types
of appearance-based features (e.g., gait energy image (GEI) [15], frequency-domain
feature (FDF) [24], chrono-gait image (CGI) [43], gait flow image (GFI) [20]), which
contain rich information, are available for gait recognition.
The rest of this chapter is organized as follows. Section 8.2 summarizes relevant
databases and literatures on speed-invariant gait recognition. Section 8.3 describes
a method of static feature-preserving speed transformation including preprocessing,
feature extraction, and model fitting, while Sect. 8.4 introduces recognition process.
Section 8.5 provides experimental results, and Sect. 8.6 gives conclusions as well as
future directions of this work.

8.2 Related Work

In this section, related work on speed-invariant gait recognition is briefly summarized.

8.2.1 Gait Database with Speed Variation

For evaluating the performance on speed-invariant gait recognition, a gait database


with speed variation is essential (see Table 8.1).
The CMU MoBo Database [14] contains 600 image sequences of 25 persons
walking on a treadmill captured with two different speeds; slow and fast. Each image
sequence is captured by six synchronous cameras evenly distributed around the tread-
mill and is 11 s long at 30 fps. The Soton database [31] contains image sequences
of a person walking around an inside track, with each subject filmed walking at dif-
ference speeds. The CASIA Gait Database, Dataset C, otherwise known as CASIA
Infrared Night Gait Dataset [39], contains 153 subjects with three different walking
speeds, namely, slow, normal, and fast. A thermal infrared camera was used to image
night human gait with the resolution of 320 × 240 pixels at 25 fps from the side
212 Y. Makihara et al.

Table 8.1 Existing major gait databases with speed variations


Database #Subjects #Sequences Data covariates
CMU MoBo database [14] 25 600 6 views, 2 speeds, 2 slopes,
baggage (ball)
Georgia tech database [38] 24 288 4 speeds (0.7, 1.0, 1.3, and 1.6 m/s)
Soton database [31] 12 – 4 views, 5 shoes, 3 clothes, 5 bags
(including w/o), 3 speeds
CASIA gait database [39] 153 612 3 speeds, baggage (w/ and w/o)
TokyoTech database [1] 30 1,602 3 speeds
OU-ISIR gait database [23] 34 612 9 speeds (2, 3, 4, 5, 6, 7, 8, 9, and
10 km/h)

view outdoors. The Georgia Tech database [38] contains four speeds with a 0.3 m/s
(approx. 1.0 km/h) interval in their normal clothes of 24 subjects (20 males and 4
females). Each image sequence is stored with the resolution of 320 × 240 pixels. For
this purpose, the speed control setup [37] was used, in which people were allowed to
walk naturally on the ground level floor and at the same time achieve and remain at
certain ranges of speeds. TokyoTech database [1] contains 30 subjects with three dif-
ferent walking speeds; slow (2 km/h), normal (3 km/h), and fast (4.5 km/h). They also
generate synthetic image sequences containing mixture of normal and fast speeds.
While the number of speed variations is at most four in the above gait databases,
the OU-ISIR Gait Database, the treadmill dataset A [23] contains up to nine different
speeds from 2 to 10 km/h as shown in Fig. 8.1. Each image sequence is stored with
the size of 320 × 240 pixels at 60 fps and two size-normalized silhouette sequences
per subject per speed are provided on the public domain for the research purpose.
We therefore exploit this dataset for evaluating the proposed method.

8.2.2 Time Normalization


The temporal aspect of the speed variation is often viewed as a problem of time nor-
malization. The most prevailing gait features are period-based ones (e.g., GEI [15],
FDF [24], CGI [43], GFI [20]). While the GEI [15], otherwise known as averaged
silhouette [21], is simply computed by averaging size-normalized gait silhouettes
over one gait period, the FDF [24] contains amplitude spectra of one-dimensional
Fourier transformation for 1-time and 2-times frequency based on the gait period
which is computed along the time axis for each pixel independently, and the direct
currency element, namely, averaged silhouette. The CGI [43] averages gait silhou-
ette contours encoded by phase-based colors over a quarter gait period to leverage
the temporal information more, while GFI [20] averages gait silhouette contours
whose motion magnitude is more than a threshold, over one gait period to leverage
the motion information more. Since all these gait features aggregate frame-by-frame
features and normalized them by one gait period or a quarter gait period, they are
regarded as time-normalized features.
8 Speed-Invariant Gait Recognition 213

Fig. 8.1 Examples of input images at double support/float phases with speed variations from the
OU-ISIR Gait Database, the treadmill dataset A [23]. Silhouette boundaries are depicted with red
contours

On the other hand, frame-based gait features such as a raw silhouette [10, 22],
projection of the raw silhouette into principal component analysis (PCA) space [26,
30] in conjunction with the parametric eigen space method [29], average pixel dis-
tances from the center of the silhouette [9], joint angle such as knee and thigh [36],
or width vector [12] which is equivalent to a histogram counting foreground pixels
in each row, requires a frame synchronization process. Such a frame synchronization
process is done by linear time normalization [9, 26, 30] or non-linear time warping
such as DTW [10, 12] and HMM [1, 22].
While all the above studies are robust to the temporal variation to a greater or
lesser extent, they are in natural sensitive to shape variation such as stride and arm
swing changes.

8.2.3 Shape Normalization

Compared with the above large body of literatures on the temporal aspect, literatures
on the spatial aspect are limited.
214 Y. Makihara et al.

Tanawongsuwan et al. [38] proposed a stride normalization of double-support gait


silhouettes based on a statistical relation between the walking speed and the stride.
They used only five silhouette (two single-support images and three double-support
images) for recognition and discard the other still informative images.
Kusakunniran et al. [19] employed higher-order shape configuration in the scheme
of Procrustes shape analysis (PSA) as a gait shape description, which is able to tolerate
the varying walking speed, in conjunction with a differential composition model for
matching considering the body parts-dependent weights.
Both studies [19, 38] basically relies on linear stretching process of the shape,
and hence they cannot handle the non-linear shape change due to speed change. On
the other hand, because the proposed method makes use of exemplars of multiple
speed gaits as a prior knowledge, it can cope with such non-linear shape change
successfully.

8.3 Static Feature-Preserving Speed Transformation

8.3.1 Overview of the Proposed Approach

In order to match gait image sequences with two different speeds appropriately, we
employ a static feature-preserving speed transformation framework as outlined in
Fig. 8.2. After extracting a gait silhouette sequence in the preprocessing, the following
three steps are executed.
1. Decouple static and kinematic features by fitting a human model to the gait sil-
houette sequence.
2. Transform the kinematic features of the gallery stride to those of the probe stride
by a pre-trained stride transformation model (STM).
3. Synthesize a gait silhouette sequence by combining the preserved static features
and the transformed kinematic features.
Because speed change makes a large impact on leg motion, in particular, on
stride, we focus on transformation of the lower body part in this study. Details of the
procedures are described in the following sections.

8.3.2 Preprocessing

Given a pair of an input image sequence and a background image, a gait silhouette
sequence is extracted by background subtraction-based graph-cut segmentation [25].
They are then normalized by the height and registered by the region center so as to
keep the aspect ratio. We refer the size-normalized gait silhouette sequence to gait
silhouette sequence later for convenience. Examples of the gait silhouette sequences
with speed variations are illustrated in Fig. 8.3.
8 Speed-Invariant Gait Recognition 215

Static
feature

Decouple Synthesize

STM
Kinematic
feature Speed 1 Speed 2

Fig. 8.2 Overview of the proposed approach. Once a gait silhouette sequence is decoupled by
fitting a human model into static and kinematic features, a static feature-preserving STM is trained
by the kinematic features of multiple training subjects at multiple speeds. A kinematic feature
of a test subject at one speed is then transformed into that at another speed by the STM. The
speed-transformed silhouette sequence is finally synthesized by coupling the static feature and the
speed-transformed kinematic feature

Fig. 8.3 Examples of gait silhouette sequences with speed variations. Frame intervals are adjusted
so as to accommodate approximately one gait period
216 Y. Makihara et al.

(x t,y t)
lthigh
thight

wknee
shin t
lshin

wfoot
Fig. 8.4 Human model

8.3.3 Decouple of Static and Kinematic Features


Decoupling of the static and kinematic features is done by fitting a human model.
Human model fitting has been an attractive topic in computer vision and hence a
variety of human models (e.g., stick-, rectangle-, ellipsoid-, and generic cylinder-
link models) have been proposed for diverse applications [6, 13, 41]. Among these,
we adopt a 2D trapezoidal-link model, which can be seen as an approximation of a
projected 3D truncated cone-link model as shown in Fig. 8.4. Each left and right leg
in the model consists of two trapezoids (thigh and shin), a circle (knee joint), and a
rectangle (foot). The model is represented by two types of variables: time-dependent
variables (kinematic features) and time-independent variables (static feature).
The time-dependent variables of the i-th leg (i ∈ {left, right}) at the t-th frame
consists of the 2D position of the waist [xit , yit ], a joint angle sequence of thigh ρthigh,i
t

and that of shin ρshin,i


t (t = 1, . . . , N ), where N is the number of frames in the gait
silhouette sequence. A joint angle between shin and foot is fixed to be orthogonal
for simplicity. The time-dependent kinematic features at the t-th frame is denoted as

kit = [xit , yit , ρthigh,i


t
, ρshin,i
t
]. (8.1)

The whole kinematic feature is then summarized as


8 Speed-Invariant Gait Recognition 217

K = {kit }(i ∈ {left, right}, t = 1, . . . , N ) (8.2)

and the total dimension is then 8 N.


On the other hand, the time-independent variables consist of the upper and lower
bases, and the height of trapezoids. We assume that the lower base of the shin changes
in proportion to the foot size, in other words, the lower base of the shin is a dependent
variable to the foot size. In addition, we set the lower base of the thigh, the upper base
of the shin, and the diameter of knee joint circle to be the same. The time-independent
static feature are summarized as

S = {wroot,i , lthigh,i , wknee,i , lshin,i , wfoot,i }(i ∈ {left, right}), (8.3)

where wroot,i , lthigh,i , wknee,i , wfoot,i , and lthigh,i are the upper base of the thigh, the
length of the thigh, the radius of the knee joint circle, the foot size, and the length
of the shin for the i-th leg, respectively. As a result, the total dimension of the static
feature is ten.
Consequently, both the kinematic and the static features are concatenated as
ω = {K , S}, whose dimension is (8N + 10).
Next, the human model is fit to the gait silhouette sequence in an energy minimiza-
tion framework under physical constraints on joint angle range. The energy function
is defined by a data term Sdata (ω) related to consistency of the model region and
the gait silhouette sequence, and a smoothness term Ssmooth (ω) to suppress abrupt
changes of the waist positions and joint angular velocities of the thighs and shins.
The energy function S(ω) is therefore formulated as

S(ω) = Sdata (ω) + Ssmooth (ω) (8.4)



N
Sdata (ω) = Aexor (I t ; ω) (8.5)
t=1
 
N −1
Ssmooth (ω) = Θwaist {(xit+1 − xit )2 + (yit+1 − yit )2 }
i∈{left,right} t=1

  
N −1
+ Θangle (ρi,t+1 t−1 2
j − 2ρi, j + ρi, j ) ,
t
(8.6)
i∈{left,right} j∈{thigh,shin} t=2

where Aexor (I t ; ω) is the area of an exclusive-or region between the gait silhouette
I t at the t-th frame and the inside of the human model defined by the parameters
ω, which stands for the degree of inconsistency between the observation and the
human model, Θwaist and Θangle are weights for the temporal changes of the waist
positions and those of joint angular velocities. The energy function is minimized by
the conjugate gradient method. The results of model fitting for multiple speeds are
shown in Fig. 8.5.
218 Y. Makihara et al.

2 km/h

4 km/h

6 km/h

Fig. 8.5 Model fitting results

8.3.4 Kinematic Feature Transformation

In this section, a factorization-based transformation of the kinematic features K is


described.
Assuming a person walks with a constant speed during a sequence, the walking
speed s is defined as a function of a gait period Tperiod (a pair of left and right steps)
and a mean stride L, and it is then expressed as

2L
s= . (8.7)
Tperiod

The change of the speed is therefore derived from a combined change of the gait
period and the stride. As discussed in Sect. 8.1, we can consider these two fac-
tors as two aspects: temporal and spatial aspects. We therefore apply a two-step
transformation scheme: (1) time normalization and synchronization, and (2) stride
transformation. In addition, the change of the speed due to the change of the stride
is mainly dependent on the change of the joint angles, and hence we focus on the
transformation of the joint angles of the legs while we preserve the position of the
waist before and after the transformation.
Time Normalization and Synchronization
The goal of the time normalization and synchronization is to construct kinematic
feature vectors so as that their dimension be the same and that the gait stance for
individual components in the kinematic feature vectors are the same (synchronized),
given original kinematic features with different gait periods and different initial gait
stances. Note that the time normalization and synchronization is essential not only
for the following stride transformation process but also for a subsequent matching
process.
First, the gait period Ngait is detected as the optimal time shift which maximizes
normalized autocorrelation of the gait silhouette sequence for the temporal axis [24].
8 Speed-Invariant Gait Recognition 219

Non-recognition Recognition
target target
Subject ID Feature Stride
12 M
Gallery Lg
L1 Train Apply
STM Transform
Stride

L2 Trans. Lp
Same
LJ Probe Lp

Training phase Test phase

Fig. 8.6 Overview of STM

A double support-phase frame is then detected as a key frame where the both legs
are the most open, that is, the second moment of the gait silhouette around the
central vertical axis is maximized. Next, the double support frame is set as the
first component of the kinematic feature vector and a pre-defined frames N̄ of a
normalized kinematic features are re-sampled by linear interpolation from N frames
of the original kinematic features K . As a result, we obtain the normalized kinematic
feature vector a for four joint angles of the legs, whose components are synchronous
in terms of gait stance and whose dimension is N A = 4 N̄ .
Stride Transformation
The goal of the stride transformation is to transform the normalized kinematic features
so as that strides between a gallery and a probe be the same. For this end, we introduce
a STM which is constructed by exemplars of non recognition-target subjects (e.g.,
students in a laboratory, colleagues in a research institute, and volunteers to join data
collection experiments) with multiple strides.
First, an overview of the stride transformation is addressed as follows (see
Fig. 8.6). In a training phase, the normalized kinematic features of multiple non
recognition-target subjects of multiple strides are collected. A transformation matrix
of the normalized kinematic features between different strides is then trained as the
STM. In a test phase, given a normalized kinematic feature in a gallery with stride L g
and that in a probe with a stride L p , the normalized kinematic feature in the gallery
is transformed to that with the stride L p . Note that we can estimate the stride either
from model fitting result for the double support phase directly or from Eq. (8.7) with
the estimated speed s and detected period Tperiod .
Next, the stride transformation is formulated in a similar way to factorization
for structure from motion [40] as follows. Let the numbers of training subjects and
training strides be M and J , respectively, and the N A dimensional normalized kine-
matic feature vector of the m-th subject of the j-th stride L j be am
L j . We then construct
a training matrix whose row and column indicates stride and subject respectively and
decompose it by singular value decomposition as
220 Y. Makihara et al.
   
a1L 1 ··· a LM1 PL 1
⎢ .. .. .. ⎧ = UΣVT = ⎢ .. ⎧ ⎨ v1 · · · v M ⎩ ,
 . . . ⎪  . ⎪ (8.8)
a1L J ··· M
aL J PL J

where U is the JN A × M orthogonal matrix, V is the M × M orthogonal matrix,


γ is the M × M diagonal matrix composed of singular values, P L j is the N A × M
sub-matrix of Uγ for the j-th stride, and vm is the M dimensional column vector.
The vector vm is an intrinsic feature vector of the m-th subject and is common to
all the strides. The sub-matrix P L j is a projection matrix from the intrinsic vector v
to the feature vector for the stride L j , and is common to all the subjects. Thus, given
an intrinsic feature vector of a test subject as v, his/her normalized kinematic feature
vector a L for the stride L is expressed as

a L = P L v. (8.9)

Given a normalized kinematic feature vector a L g for a gallery with a stride L g , the
estimate v̂ of intrinsic feature vector is given by the least square method as

v̂ = P+
Lg aLg , (8.10)

where P+L g is a pseudo inverse matrix of P L g . Thus, transformation of normalized


kinematic feature vectors from the gallery stride L g to a probe stride L p is now
obtained as
â L p = P L p P+
Lg aLg . (8.11)

8.3.5 Synthesis of Gait Silhouette Sequence

Once the kinematic feature is transformed to that with a probe stride, a gait silhouette
sequence with the probe stride is synthesized in conjunction with the preserved static
features. As mentioned before, because we focus on the transformation of the lower
body for simplicity in this study, the upper body silhouette is at first copied as it
is. As for the lower body, translation and rotation of individual links of the human
model due to the change of the strides is computed based on the result of the stride
transformation, and a silhouette inside each link of the human model is copied with
the translation and rotation.

8.4 Recognition

Given a pair of a gallery and probe gait silhouette sequences, we match them after
time normalization/synchronization and the stride transformation. We employ the
FDF [24] as an appearance-based gait feature for matching because its efficiency
8 Speed-Invariant Gait Recognition 221

was validated in the large-scale performance evaluation [17]. An overview of feature


extraction and matching processes is briefly described as follows.
First, one dimensional Fourier analysis is applied to a time series of silhouette
signal at each pixel in the gait silhouette sequence and an amplitude spectra is subse-
quently calculated for 0-, 1-, and 2-times frequencies. We can reconstruct an image
for each frequency component for visibility as shown in Fig. 8.8, and this is treated
as a unit of FDF. Then, Euclidean distance between the FDFs of a gallery and a probe
is computed as a dissimilarity score. If a sequence contains multiple gait periods and
it provides multiple features, we integrate the distances for all the combinations of
the multiple features by the minimum rule to provide a final dissimilarity score.

8.5 Experiments

8.5.1 Setup

We conducted experiments of cross-speed gait recognition by using the OU-ISIR


Gait Database, the treadmill dataset A [23].1 The two types of experimental datasets
(call them dataset 1 and dataset 2, respectively) were arranged to see the effect of the
number of subjects for training the transformation model. The number of training
subjects in dataset 1 and 2 are 14 and 9, respectively, and the both datasets have 20
testing subjects in common. The speed variation used in this experiment is ranging
from 2 to 7 km/h at 1 km/h interval.
We evaluated the performance of the cross-speed gait recognition in identifica-
tion scenarios (one-to-many matching) with the cumulative matching characteristics
(CMC) curve which shows rates of the genuine subjects included in each rank of the
dissimilarity scores. Besides, we evaluated the performance in verification scenarios
(one-to-one matching) with an equal error rate (EER) derived from a receiver operat-
ing characteristic (ROC) curve [33]. The ROC curve shows a tradeoff between false
rejection rate (FRR) and false acceptance rate (FAR) when an acceptance threshold
changes. The EER is defined as the FAR or FRR when the FAR and the FRR are the
same.

8.5.2 Qualitative Evaluation

Examples of synthesized gait silhouette sequences for widening and narrowing stride
are shown in Fig. 8.7. We can see that the synthesized gait silhouette sequences are
similar to the original probe gait silhouette sequences. Moreover, the matching results
of a pair of the frequency-domain features before and after the stride transformation

1 Publicly available at http://www.am.sanken.osaka-u.ac.jp/BiometricDB/GaitTM.html.


222 Y. Makihara et al.

(a)

3 km/h

Trans.

7 km/h

(b)

6 km/h

Trans.

2 km/h

Fig. 8.7 Gait silhouette sequences of the gallery (top), synthesis (middle), and the probe (bottom)

(a) Trans. 4 km/h (b) Trans. 6 km/h

6 km/h 2 km/h

Frq. 0 1 2 0 1 2 Frq. 0 1 2 0 1 2

Fig. 8.8 Feature differences before (left) and after (right) the stride transformation. Red: exist only
in gallery/transformed features, green: exist only in a probe feature, white: exist in both features.
Less red and green regions mean better matching

are shown in Fig. 8.8. We can see that the differences of the frequency-domain features
are significantly reduced by the stride transformation in both cases.
8 Speed-Invariant Gait Recognition 223

Fig. 8.9 CMC curves for cross-speed gait recognition

8.5.3 Quantitative Evaluation

The performance of the cross-speed gait recognition in the identification scenario


is quantitatively evaluated by CMC curves in Fig. 8.9, compared with a method of
direct matching with no transformation (denoted as No trans). As a result, it turns out
that the stride transformation both with dataset 1 and 2 improves the identification
rates compared with no transformation.
The performance in the verification scenario is evaluated by EERs for all the
combinations of the gallery and probe speeds as shown in Fig. 8.10. As a result,
the averaged EERs on different speed combinations were reduced from 15.0 % with
no transformation to 10.9 % (4.1 % improvement) and 11.0 % (4.0 % improvement)
with transformation in dataset 1 and 2, respectively. As a whole, the improvement
becomes larger as the speed difference between the gallery and the probe becomes
larger, and in particular the largest EER improvements for a specific pair of gallery
and probe speed in dataset 1 and 2 are 10.0 and 15.0 %, respectively.

8.5.4 Comparison with Previous Studies

In this section, the proposed method is compared with the previous studies by
Tanawongsuwan et al. [38] and Liu et al. [22]. Because the rank-1 identification
rate in the CMC curve was used in the previous studies, we also used the same
performance measure in the comparison experiments.
When the CMC curve is used for performance evaluation, the gallery size makes
a deep impact on performance, in more detail, the larger gallery size means a more
difficult identification problem setting. Thus, almost the same size of gallery should
be used among the methods in common for fair comparison. The gallery sizes of
the previous studies [22, 38] were 24 and 25, respectively, and therefore 25 subjects
224 Y. Makihara et al.

(a) 35 (b) 35
30 Dataset 1 30
Dataset 1

EER [%]
25 Dataset 2 25
EER [%]

Dataset 2
20 20
No trans.
15 15 No trans.
10 10
5 5
0 0
2 3 4 5 6 7 2 3 4 5 6 7
Gallery Speed [km/h] Gallery Speed [km/h]

(c) 35 (d)35
30 Dataset 1 30 Dataset 1
25 Datase 2 25 Dataset 2

EER [%]
EER [%]

20 20
No trans. No trans.
15 15
10 10
5 5
0 0
2 3 4 5 6 7 2 3 4 5 6 7
Gallery Speed [km/h] Gallery Speed [km/h]

(e) 35 (f) 35
30 30
Dataset 1 Dataset 1
EER [%]

25 25
EER [%]

20 Dataset 2 Dataset 2
20
15 No trans. 15 No trans.
10 10
5 5
0 0
2 3 4 5 6 7 2 3 4 5 6 7
Gallery Speed [km/h] Gallery Speed [km/h]

Fig. 8.10 Experimental results. The horizontal and vertical axes mean gallery speed and EER,
respectively

were assigned for testing while the remaining 9 subjects were used for training the
transformation model in the proposed method. Comparison results are summarized
in Table 8.2.
Speed variations in [38] are 2.5, 3.6, 4.7, and 5.8 km/h. Focused on the largest
speed difference combination (2.5 and 5.8 km/h), the rank-1 identification rates are
approximately 40 % for 2.5 km/h gallery and 30 % for 5.8 km/h gallery, respectively.
On the other hand, in case of the proposed method, for larger speed difference of
2 and 6 km/h, the rank-1 identification rate is 64 % for 2 km/h gallery and 52 % for
6 km/h gallery, respectively. This is because we use all of the frames for feature
extraction for matching while only five key frames (two single support frames and
three double support frames) are used for matching in [38].
Speed variations in [22] are 3.3 and 4.5 km/h. Since the speed difference is almost
1 km/h, we compare with a combination of 3 and 4 km/h in the proposed method.
While the rank-1 identification rate in [22] is approximately 84 %, that in the proposed
method is 84 % for 3 km/h gallery and 96 % for 4 km/h gallery. The reason of this
8 Speed-Invariant Gait Recognition 225

Table 8.2 Comparison of rank-1 identification rate across benchmarks of the speed-invariant gait
recognition
Datasets Method Gallery size Gallery speed Probe speed P1 (%)
(km/h) (km/h)
The Treadmill dataset A Proposed 25 2.0 6.0 64
Georgia tech database [38] 24 2.5 5.8 40
The Treadmill dataset A Proposed 25 6.0 2.0 52
Georgia tech database [38] 24 5.8 2.5 30
The Treadmill dataset A Proposed 25 4.0 3.0 96
CMU MoBo dataset [22] 25 4.5 3.3 84
P1 stands for rank-1 identification rate

Fig. 8.11 An example of failure mode. Gait silhouette sequences of the gallery (top, 4 km/h),
synthesis (middle, 6 km/h), and the probe (bottom, 6 km/h) are depicted. The stride for the synthesis
is much wider than that for the probe (see the first frame)

performance improvement is that we consider both time normalization and stride


transformation, while only time normalization is considered in [22].

8.5.5 Limitations

Although the proposed STM works well in general, it still fails in some cases. One
of typical failure modes is stride inconsistency between the synthetic and the probe
gait silhouette sequences as shown in Fig. 8.11. This is because the least square-
based transformation in Eq. (8.11) does not explicitly guarantee that the stride of the
transformed kinematic feature â L p coincides with the target probe stride L p .
Another problem with the proposed STM is relevant to representation capability
of the training subjects. When a kinematic feature of a test subject is totally different
from those of the training subject, more specifically, not well represented by a linear
combination of kinematic features of the training subjects, it may be difficult to
226 Y. Makihara et al.

accurately transform such as kinematic feature with one speed to that with another
speed. Since the least square residual ◦P L v̂ − a L ◦ derived from Eqs. (8.9) and (8.10)
somewhat reflects the fitness of the kinematic feature of the test subject to a linear
combination of kinematic features of the training subjects, as reported in the view
transformation model case [28], it could be used as a sort of quality measure for
score normalization in order to improve the recognition accuracy in total.

8.6 Conclusion and Future Trends

We proposed a method of speed-invariant gait recognition in a unified framework of


model- and appearance-based approaches. We firstly decouple the static and kine-
matic features from a gait silhouette sequence by fitting a human model. Secondly, the
kinematic features are transformed by the factorization-based STM. Finally, we syn-
thesize a gait silhouette sequence by combining the preserved static features and the
transformed kinematic features for matching. As a result of the experiments using
publicly available gait database containing a wide range of speed variations from
2 to 7 km/h with 1 km/h interval, averaged EER for different speed combinations
improved from 15.0 to 10.9 % compared with a method with no transformation.
A future direction is extension of the proposed method to various gait styles such
as running by introducing full body model fitting, gait style classification, and gait
style-specific STM. Faster gait dataset ranging from 8 to 10 km/h at 1 km/h interval
in the OU-ISIR Gait Database can be used for this purpose.
In addition, since it is possible to predict to some extent how accurately we can
transform a gait silhouette sequence for a test subject with the least square residual
of the STM as discussed in Sect. 8.5.5, it is another future work to improve the
recognition accuracy by a technique of quality-dependent score normalization [34],
more specifically, by adjusting a score based on the least square residual of the STM.
Moreover, because a transformation between the gait features with larger speed
differences generally yields a worse performance than that with smaller speed dif-
ferences, it could be a promising option to transform both probe and gallery gait
silhouette sequences into those with an intermediate speed between the probe and
gallery speeds to reduce the transformation error in total (e.g., transform both 7 km/h
gallery and 3 km/h probe into gait silhouette sequences at 5 km/h).
Furthermore, although we assume that speed is known and constant for each gait
silhouette sequence, the speed may be unknown in fact and also may transit within
each gait silhouette sequence when acceleration to start walking and deceleration
to stop walking. It would be also a challenge to cope with such a speed-transited
sequence in future.

Acknowledgments This work was supported by JSPS Grant-in-Aid for Scientific Research (S)
21220003.
8 Speed-Invariant Gait Recognition 227

References

1. Aqmar MR, Shinoda K, Furui S (2012) Robust gait-based person identification against walking
speed variations. IEICE Trans Info Syst E95.D(2):668–676
2. Ariyanto G, Nixon M (2012) Marionette mass-spring model for 3d gait biometrics. In: Pro-
ceedings of the 5th IAPR international conference on biometrics, pp 354–359
3. Bashir K, Xiang T, Gong S (2009) Gait representation using flow fields. In: Proceedings of the
20th British machine vision conference. London
4. Bashir K, Xiang T, Gong S (2010) Gait recognition without subject cooperation. Pattern Recogn
Lett 31(13):2052–2060
5. BBC News (2008) How biometrics could change security. http://news.bbc.co.uk/2/hi/
programmes/click_online/7702065.stm
6. Bobick A, Johnson A (2001) Gait recognition using static activity-specific parameters. In:
Proceedings of the 14th IEEE conference on computer vision and pattern recognition, vol 1,
pp 423–430
7. Bouchrika I, Goffredo M, Carter J, Nixon M (2011) On using gait in forensic biometrics.
J Forensic Sci 56(4):882–889
8. Bouchrika I, Nixon M (2008) Exploratory factor analysis of gait recognition. In: Proceedings
of the 8th IEEE international conference on automatic face and gesture recognition, pp 1–6.
Amsterdam, The Netherlands
9. Boulgouris N, Plataniotis K, Hatzinakos D (2006) Gait recognition using linear time normal-
ization. Pattern Recogn 39(5):969–979
10. Boulgouris N, Plataniotis K, Hatzinakos D (2004) Gait recognition using dynamic time warp-
ing. In: Proceedings of the IEEE 6th workshop on multimedia signal processing, pp 263–266.
doi:10.1109/MMSP.2004.1436543
11. Cunado D, Nixon M, Carter J (2003) Automatic extraction and description of human gait
models for recognition purposes. Comput Vis Image Underst 90(1):1–41
12. Cuntoor N, Kale A, Chellappa R (2003) Combining multiple evidences for gait recognition.
In: Proceedings of the 28th IEEE international conference on acoustics, speech, and signal
processing, vol 3, pp 33–36
13. Gavrila DM (1999) The visual analysis of human movement: a survey. Comput Vis Image
Underst 73(1):82–98
14. Gross R, Shi J (2001) The cmu motion of body (mobo) database. Technical report, CMT
15. Han J, Bhanu B (2006) Individual recognition using gait energy image. IEEE Trans Pattern
Anal Mach Intell 28(2):316–322
16. Iwama H, Muramatsu D, Makihara Y, Yagi Y (2013) Gait verification system for criminal
investigation. IPSJ Trans Comput Vis Appl 5:163–175
17. Iwama H, Okumura M, Makihara Y, Yagi Y (2012) The ou-isir gait database comprising
the large population dataset and performance evaluation of gait recognition. IEEE Trans Info
Forensic Secur 7(5):1511–1521
18. Kusakunniran W, Wu Q, Zhang J, Li H (2012a) Gait recognition under various viewing angles
based on correlated motion regression. IEEE Trans Circ Syst Video Technol 22(6):966–980
19. Kusakunniran W, Wu Q, Zhang J, Li H (2012b) Gait recognition across various walking speeds
using higher order shape configuration based on a differential composition model. IEEE Trans
Syst Man Cybern B Cybern 42(6):1654–1668. doi:10.1109/TSMCB.2012.2197823
20. Lam THW, Cheung KH, Liu JNK (2011) Gait flow image: a silhouette-based gait representation
for human identification. Pattern Recogn 44:973–987. http://dx.doi.org/10.1016/j.patcog.2010.
10.011
21. Liu Z, Sarkar S (2004) Simplest representation yet for gait recognition: averaged silhouette.
In: Proceedings of the 17th international conference on pattern recognition, vol 1, pp 211–214
22. Liu Z, Sarkar S (2006) Improved gait recognition by gait dynamics normalization. IEEE Trans
Pattern Anal Mach Intell 28(6):863–876. http://doi.ieeecomputersociety.org/10.1109/TPAMI.
2006.122
228 Y. Makihara et al.

23. Makihara Y, Mannami H, Tsuji A, Hossain M, Sugiura K, Mori A, Yagi Y (2012) The ou-isir
gait database comprising the treadmill dataset. IPSJ Trans Comput Vis Appl 4:53–62
24. Makihara Y, Sagawa R, Mukaigawa Y, Echigo T, Yagi Y (2006) Gait recognition using a view
transformation model in the frequency domain. In: Proceedings of the 9th European conference
on computer vision, pp 151–163. Graz, Austria
25. Makihara Y, Yagi Y (2008) Silhouette extraction based on iterative spatio-temporal local color
transformation and graph-cut segmentation. In: Proceedings of the 19th international confer-
ence on pattern recognition. Tampa, Florida
26. Mori A, Makihara Y, Yagi Y (2010) Gait recognition using period-based phase synchronization
for low frame-rate videos. In: Proceedings of the 20th international conference on pattern
recognition, pp 2194–2197. Istanbul, Turkey
27. Mowbray S, Nixon M (2003) Automatic gait recognition via fourier descriptors of deformable
objects. In: Proceedings of the 1st IEEE international conference on advanced video and signal
based surveillance, pp 566–573
28. Muramatsu D, Makihara Y, Yagi Y (2013) Quality-dependent view transformation model for
cross-view gait recognition. In: Proceedings of the IEEE 6th international conference on bio-
metrics: theory, applications and systems. Paper ID: 12, pp 1–8, Washington DC
29. Murase H, Nayar SK (1995) Visual learning and recognition of 3-d objects from appearance. Int
J Comput Vis 14(1):5–24. doi:10.1007/BF01421486, http://dx.doi.org/10.1007/BF01421486
30. Murase H, Sakai R (1996) Moving object recognition in eigenspace representation: gait analysis
and lip reading. Pattern Recogn Lett 17:155–162
31. Nixon M, Carter J, Shutler J, Grant M (2001) Experimental plan for automatic gait recognition.
Technical report, Southampton
32. Niyogi S, Adelson E (1994) Analyzing and recognizing walking figures in xyt. In: Proceedings
of the 7th IEEE conference on computer vision and pattern recognition, pp 469–474
33. Phillips P, Moon H, Rizvi S, Rauss P (2000) The feret evaluation methodology for face-
recognition algorithms. IEEE Trans Pattern Anal Mach Intell 22(10):1090–1104
34. Poh N, Kittler J, Bourlai T (2007) Improving biometric device interoperability by likelihood
ratio-based quality dependent score normalization. In: Proceedings of the IEEE 3rd interna-
tional conference on biometrics: theory, applications and systems, pp 1–5
35. Sarkar S, Phillips J, Liu Z, Vega I, ther PG, Bowyer K (2005) The humanid gait challenge prob-
lem: data sets performance and analysis. IEEE Trans of Pattern Anal Mach Intell 27(2):162–177
36. Tanawongsuwan R, Bobick A (2001) Gait recognition from time-normalized joint-angle trajec-
tories in the walking plane. In: Proceedings of the 14th IEEE computer society conference on
computer vision and pattern recognition, vol 2, pp 726–731. doi:10.1109/CVPR.2001.991036
37. Tanawongsuwan R, Bobick A (2003) Performance analysis of time-distance gait parameters
under different speeds. In: Proceedings of the 4th international conference on audio- and video-
based biometric person authentication, pp 715–724
38. Tanawongsuwan R, Bobick A (2004) Modelling the effects of walking speed on appearance-
based gait recognition. In: Proceedings of the 17th IEEE computer society conference on
computer vision and pattern recognition, vol 2, pp 783–790. http://doi.ieeecomputersociety.
org/10.1109/CVPR.2004.158
39. Tan D, Huang K, Yu S, Tan T (2006) Efficient night gait recognition based on template matching.
In: Proceedings of the 18th international conference on pattern recognition, vol 3. Hong Kong,
China, pp 1000–1003
40. Tomasi C, Takeo K (1992) Shape and motion from image streams under orthography: a fac-
torization method. Int J Comput Vis 9:137–154
41. Urtasun R, Fua P (2004) 3d tracking for gait characterization and recognition. In: Proceedings
of the 6th IEEE international conference on automatic face and gesture recognition, pp 17–22
42. Wang L, Ning H, Tan T, Hu W (2003) Fusion of static and dynamic body biometrics for gait
recognition. In: Proceedings of the 9th international conference on computer vision, vol 2,
pp 1449–1454
43. Wang C, Zhang J, Wang L, Pu J, Yuan X (2012) Human identification using temporal informa-
tion preserving gait template. IEEE Trans Pattern Anal Mach Intell 34(11):2164–2176. doi:10.
1109/TPAMI.2011.260
8 Speed-Invariant Gait Recognition 229

44. Yam C, Nixon M, Carter J (2001) Extended model based automatic gait recognition of walking
and running. In: Proceedings of the 3rd international conference on audio and video-based
person authentication, pp 278–283. Halmstad, Sweden
45. Yu S, Wang L, Hu W, Tan T (2004) Gait analysis for human identification in frequency domain.
In: Proceedings of the 3rd international conference on image and graphics, pp 282–285
46. Zhao G, Liu G, Li H, Pietikainen M (2006) 3d gait recognition using multiple cameras. In:
Proceedings of the 7th international conference on automatic face and gesture recognition,
pp 529–534
Chapter 9
Quality Induced Multiclassifier Fingerprint
Verification Using Extended Feature Set

Mayank Vatsa

Abstract Automatic fingerprint verification systems use ridge flow patterns and
general morphological information for broad classification and minutia features for
verification. With the availability of high resolution fingerprint sensors, it is now
feasible to capture more intricate features such as ridges, pores, permanent scars,
and incipient ridges. These fine details are characterized as level-3 or extended fea-
tures and play an important role in matching and improving the verification accuracy.
The main objective of this research is to develop a quality induced multi-classifier
fingerprint verification algorithm that incorporates both level-2 and level-3 features.
A quality assessment algorithm is developed that uses Redundant Discrete Wavelet
Transform to extract edge, noise and smoothness information in local regions and
encodes into a quality vector. The feature extraction algorithm first registers the
gallery and probe fingerprint images using a two-stage registration process. Then, a
fast Mumford-Shah curve evolution algorithm is used to extract four level-3 features
namely, pores, ridge contours, dots, and incipient ridges. Gallery and probe features
are matched using Mahalanobis distance measure and quality based likelihood ratio
approach. Further, the quality induced sum rule fusion algorithm is used to combine
the match scores obtained from level-2 and level-3 features. The experiments per-
formed on 1000 ppi (pixels per inch) fingerprint databases show the effectiveness of
the proposed algorithms.

9.1 Introduction

Fingerprints are considered reliable to identify individuals and are used in both
biometric and forensic applications. In biometrics applications, it is used for physical
access control, border security, watch list, background check, and National ID System

M. Vatsa (B)
IIIT, New Delhi, India
e-mail: mayank@iiitd.ac.in

J. Scharcanski et al. (eds.), Signal and Image Processing for Biometrics, 231
Lecture Notes in Electrical Engineering 292, DOI: 10.1007/978-3-642-54080-6_9,
© Springer-Verlag Berlin Heidelberg 2014
232 M. Vatsa

Fig. 9.1 Example of a rolled, b slap, and c partial fingerprint images

whereas in forensics applications, it is used for latent fingerprint matching for crime
scene investigation and to apprehend criminals and terrorists. The use of fingerprints
for establishing the identity started in the sixteenth century and thereafter several
studies have been performed for anatomical formation, classification, recognition,
categorizing features, individuality of fingerprints, and many others [1, 5, 7, 10,
11, 18, 19, 21, 23, 24, 28, 29, 31, 32]. Over the last two decades, research in
fingerprint recognition has seen tremendous growth. Several automated systems have
been developed and used for civil, military and forensic applications. FBI-AFIS, US
border security, EU passport/ID system, and India’s UIDAI (Aadhaar) project are
few examples of large scale applications of fingerprint biometrics.
In fingerprint recognition, there are two fundamental steps: (i) acquisition and
(ii) matching. Fingerprint images are acquired using either ink-paper based offline
acquisition approach or automated sensors such as optical and solid state sensors
based live-scan acquisition approach. The acquired fingerprint images provide the
impression of ridges and furrows that are used in matching. Further, the acquired
image may be rolled, slap or partial. Figure 9.1 shows examples of rolled, slap,
and partial images. Fingerprint matching is the process in which a probe image is
matched with the gallery image(s) for classification, verification, or identification
depending on the application. In classification, a fingerprint is classified into a global
category such as arch, loop, and whorl using the global patterns and morphological
information. In verification (1 : 1 matching), the identity of a probe image is verified
using details such as minutiae and pores. Finally, fingerprint identification establishes
the identity of a probe image against a gallery of fingerprint images, i.e. 1 : N
matching.
Fingerprint features are characterized into three levels:
• Level-1 Features: Macro details such as pattern type, ridge flow and morpholog-
ical features are termed as level-1 features. Figure 9.2 shows examples of level-1
features such as arch, tented arch, right loop, left loop, double loop, and whorl.
• Level-2 Features: Galton features are referred to as level-2 features. These features
are ridge ending and ridge bifurcation as shown in Fig. 9.3a.
9 Quality Induced Multiclassifier Fingerprint Verification Using Extended Feature Set 233

Fig. 9.2 Fingerprint images with level-1 features: a arch, b tented arch, c right loop, d left loop,
e double loop, and f whorl

(a) (b)

Ridge
Contours
Ridge
Bifurcation Incipient
Ridge

Ridge Dot
Ending
Pores

Fig. 9.3 Examples of a level-2 minutiae features and b level-3 features namely pores, ridge con-
tours, dots, and incipient ridge

• Level-3 Features: ANSI/NIST Committee to Define an Extended Fingerprint Fea-


ture Set (CDEFFS) [3] has defined micro features such as pores, ridge contours,
dots, and incipient ridges as level-3 features. Figure 9.3b shows an example of
level-3 features in a fingerprint image.
In general, automatic fingerprint verification algorithms use level-1 features
for broad classification and level-2 minutia features for verification [19]. Foren-
sic experts, on the other hand, use manually marked level-3 features along with
234 M. Vatsa

Fig. 9.4 Fingerprint images of an individual captured at different instances

Fig. 9.5 Difference in image


quality can cause reduced
performance

Fig. 9.6 Gallery and probe


fingerprint images can have
orientation and deformation
variations

level-1 and level-2 features for matching latent and partial fingerprints. Unlike level-
2 features, level-3 features have not been extensively investigated by the research
community. Very few researchers have proposed algorithms for level-3 based fin-
gerprint verification [6, 14, 16, 17, 20, 26, 34–37]. These algorithms perform well
with good quality images but have high computational complexity.
Sensor noise, poor quality, and partial capture may affect the performance of
fingerprint verification algorithms. For example, Fig. 9.4 shows an example in which
fingerprint images of an individual are captured at five different instances. This
example illustrates that how the person interacts with the sensor can also affect the
image quality and biometric features. Specifically, the gallery and probe images may
have differences due to image quality (Fig. 9.5), orientation, deformation (Fig. 9.6),
and reduced amount of overlap (Fig. 9.7). Further, there are cases when it is very
difficult to capture good quality images. Figure 9.8 shows images of an individual that
9 Quality Induced Multiclassifier Fingerprint Verification Using Extended Feature Set 235

Fig. 9.7 A reduced amount of overlap between two images of the same individual can affect the
verification performance

Fig. 9.8 An example illustrating a case when it is difficult to capture good quality fingerprint
images

are captured at different sessions and the image quality does not improve even with
applying more pressure. Also, Fig. 9.9 shows a fully developed fingerprint pattern
where accurate feature extraction is not viable. Finally, scars and warts also affect
the feature extraction process as they are difficult to analyze (Fig. 9.10).
This chapter focuses on developing a quality induced fast fingerprint verifica-
tion algorithm that uses level-2 and level-3 features. The algorithm first computes
an image quality vector using Redundant Discrete Wavelet Transform that encodes
the degree of irregularity present in the local regions. A fast level-3 feature extrac-
tion algorithm using the curve evolution technique is proposed for extracting four
level-3 features namely pores, ridge contours, dots, and incipient ridges. A composite
236 M. Vatsa

Fig. 9.9 An example illustrating a case when fingerprint is fully developed but features are not
distinct

Fig. 9.10 There are some features such as scars and warts that are difficult to analyze and use for
fingerprint verification

feature vector is generated for matching by incorporating the local image quality
score. Finally, quality induced sum rule is utilized to further enhance the fingerprint
verification accuracies. The experiments on a 1000 ppi fingerprint database contain-
ing both rolled and slap fingerprint images show the effectiveness of the proposed
algorithms.

9.2 Proposed Level-3 Features Based Verification Algorithm

Figure 9.11 illustrates the steps involved in the proposed feature extraction and match-
ing algorithm. The proposed algorithm starts with computing the fingerprint image
quality score followed by a two-stage registration process using level-2 features [34].
Level-2 minutiae are extracted from a fingerprint image using the ridge tracing minu-
tiae extraction algorithm [15]. The extracted minutiae are matched using a dynamic
bounding box based matching algorithm [13]. The details of the minutiae extraction
and matching algorithms can be found in [13, 15]. Next, the level-3 features are
extracted using a curve evolution approach and matched using likelihood ratio based
measure. This section presents the algorithm in detail.
9 Quality Induced Multiclassifier Fingerprint Verification Using Extended Feature Set 237

Gallery Image
Fine
Image Quality Coarse
Registration Level-3 Likelihood
Registration Accept/
Curve Ratio Reject
Evolution Matching
Probe Image TaylorSeries
Image Spline
Quality

QualityAssessment

Fig. 9.11 Illustrating the steps involved in the proposed feature extraction and matching algorithm

Fig. 9.12 Fingerprint image


is partitioned into small win-
dows for estimating the quality
score

9.2.1 Local Fingerprint Image Quality Assessment

In general, the performance of a fingerprint recognition algorithm depends on the


quality of the probe image. Image quality can vary locally in a fingerprint image and
factors such as sensor noise, pressure, and wetness can affect the quality. Therefore,
it is important to compute the image quality locally and encode the edge information,
smoothness and noise present in the fingerprint image.
Let F be a high resolution fingerprint image. As shown in Fig. 9.12, F is divided
into windows of size n×m. For the kth block, F k , RDWT decomposition is computed
for j = 1, . . . , l levels.1

[FAk j , FHk j , FVk j , FDk j ] = RDWT (F k ) (9.1)

where, i = A, H, V, D represents the approximation, horizontal, vertical, and diag-


onal subbands. For each block, the quality score is computed using Eq. 9.2.

1 The size of blocks depends on the image characteristics. In our experiments, block size of 16 × 16
yields the best results. The RDWT decomposition is applied up to three levels.
238 M. Vatsa

a kA bkA + a kH bkH + aVk bkV + a kD bkD


qk = (9.2)
bkA + bkH + bkV + bkD

where 
 k   

l  μ − lj=1 n,m F k (x, y) 2
aik = ln  i j x,y=1 i j
/nm (9.3)
j=1
σikj

and  2


l  1
bik = ln  n,m /nm (9.4)
j=1
1+ x,y=1 ∈ Fi j (x,
k y)

Here, μikj and σikj are the mean and standard deviation of the RDWT coefficients of
the ith subband and the jth level respectively, and ∈ denotes the gradient operator.
Finally, the quality score, q k , is normalized in the range of [0, 1] using min-max
normalization.
Multilevel RDWT decomposition provides the per-subband noise relationship and
the spatial and frequency information which are useful in computing the edge and
noise information [9]. The algorithm also performs error normalization by incor-
porating the weight factor, bi , and encoding the degree of irregularity in the local
regions. The proposed local quality assessment algorithm yields a quality score for
every local region. Thus, a quality score vector, q, is computed for the complete
fingerprint image and is used during recognition.

9.2.2 Level-3 Pore and Ridge Feature Extraction and Matching

The next step in the proposed algorithm is registering gallery and probe images
using 2-stage registration approach [34]. The algorithm first performs Taylor series
optimization based coarse registration followed by Thin Plate Spline based fine regis-
tration. After registering the gallery and probe images, level-3 features are extracted.
The level-3 feature extraction algorithm employs level-set based curve evolution [30]
which begins with the energy functional [4, 33].

1
E(C) = (E in C(q) + E out C(q)) dq, (9.5)
0

where, C(q) : [0, 1] ◦ R 2 is a planar curve that can be applied on the image
F : [0, x] × [0, y] ◦ R + to detect feature boundaries. E in and E out are the energy
functionals inside and outside the curve respectively. The energy functional can be
further decomposed into Eq. 9.6 using the input image F.
9 Quality Induced Multiclassifier Fingerprint Verification Using Extended Feature Set 239

E(C, C1 , C2 ) = α φ||C̄ → ||d xd y + β |F − C1 |2 d xd y


Ω in(C)

+λ |F − C2 |2 d xd y (9.6)
out (C)

where, α, β and λ are positive constants such that α + β + λ = 1 and α < β ∇ λ.


Ω represents the image domain, φ is the stopping term, C̄ is the evolution curve,
i.e., C̄ = {(x, y) : ψ̄(x, y) = 0}, and C1 and C2 are the average pixel values inside
and outside the curve respectively. Further, parameterizing the energy equation by
an artificial time t ∞ 0 and deducing the associated Euler-Lagrange equation leads
to the following active contour model,

ψ̄t→ = αφ(ν̄ + εk )| ≤ ψ̄| + ≤φ ≤ ψ̄ + βδ(F − C1 )2 + λδ ψ̄(F − C2 )2 (9.7)

where ν̄ is the advection term and εk is the curvature based smoothing term. ∈ is the
gradient and δ = 0.5/(π(x 2 + 0.25)). The stopping term φ is set to

1
φ= . (9.8)
1 + (|∈ F|)2

The basic concept of feature boundary extraction using level-set curve evolution
is to initialize the contour ψ̄ over the image F and the contour ψ̄ converges to the
boundaries based on the stopping term φ. The curve evolution efficiently segments the
contours present in the image irrespective of the quality of the image. However, this
procedure is time consuming and depends on the initial ψ̄. On a 1000 dpi fingerprint
image, if the initial contour is initialized as a grid, level-set curve evolution algorithms
require around 4–30 s to find the feature boundaries. For real world biometric systems,
such high computational complexity is not pragmatic. To address this issue, we
propose a scheme that utilizes the scale multiplication based Canny edge detection
[2] for finding the initial ψ̄ and then applies a two-cycle fast curve evolution algorithm
with smoothness regularization [30]. The contour extraction algorithm is as follows:
Step 1: Scale multiplication based edge detection algorithm (that multiples filter
response at adjacent scales to enhance the edge structure and detect the edges as
the local maxima) [2] is applied on the input fingerprint image. The output of edge
detection is a binary image in which the edges are black and non-edge regions are
white. The detected regions are very close to the exact feature boundaries but may
also have spurious edges due to noise. To remove these spurious edges, every non-
connecting edge of 1–2 pixels is removed by using morphological operations.
Step 2: In the next step, the detected edges are used as the initial contour ψ̄ and the
two-cycle curve evolution algorithm [30] is applied in which the first cycle is data
dependent and the second cycle is Gaussian filter band smoothing. Since the initial
contour is very close to the exact feature boundaries, using the stopping term φ, the
240 M. Vatsa

Fig. 9.13 Illustrating the intermediate images in the proposed feature extraction algorithm: a input
fingerprint image, b edge detected image, c gradient image (stopping term), and d fingerprint contour

Fig. 9.14 Additional examples of a input fingerprint image and b extracted fingerprint contour

curve evolution algorithm converges to the feature boundaries after only a few itera-
tions. Figure 9.13a–d shows an example of the edge detection and contour extraction
procedure. Additional examples of contour extraction are shown in Fig. 9.14. These
examples show that due to the stopping term, the noise present in the fingerprint
image has very little effect on contour extraction.
Once the contour extraction algorithm provides final fingerprint contour ψ̄, it is
scanned using the standard contour tracing technique [25] from top to bottom and
9 Quality Induced Multiclassifier Fingerprint Verification Using Extended Feature Set 241

left to right consecutively to classify the fingerprint features as: pores, ridges, and
dots. The tracing technique uses the following model-based approach:
1. A blob of size greater than 2 pixels and less than 40 pixels is classified as a
pore. Therefore, noisy contours, which are sometimes wrongly extracted, are not
included in the feature set. A pore is approximated with a circle and the center is
used as the pore feature.
2. The ridge contour (edge of a ridge) features are the x, y coordinates of the pixel
and direction of the contour at that pixel.
3. Any blob of size less than 0.02→→ that does not lie on a ridge is marked as a dot
and the corresponding x, y coordinates are stored.
4. If a blob or a ridge structure whose width is substantially thinner (less than 40 %
of average local ridge width) and size is greater than 0.02→→ , then it is marked as
an incipient ridge. Further, if the incipient ridge is a series of clearly separated
dots, then these are marked as separate incipient ridges. The x, y coordinate of
the endpoints and the distance are stored as the incipient ridge features.

These features and tracing procedure follow the standards defined by the ANSI/
NIST CDEFFS [3]. As shown in Fig. 9.15, fine details and structures present in the
fingerprint contour are traced and categorized into level-3 features depending on
the shape and attributes. Additional examples of tracing algorithms are shown in
Figs. 9.13d, 9.16, 9.17, 9.18, and 9.19. Each of the level-3 features are stored in a
matrix separately. The size of each feature matrix is dependent on the size of the
fingerprint image and features present in it. For matching, Mahalanobis distance
between each level-3 feature matrices pertaining to the gallery and probe images is
computed. Mahalanobis distance between two vectors f G and f P is defined as,

MD( f¯G , f¯P ) = ( f¯G − f¯P )t S −1 ( f¯G − f¯P ) (9.9)

where, f¯G and f¯P are the two feature vectors to be matched, and S is the positive
definite covariance matrix of f¯G and f¯P . Mahalanobis distance MD( f¯G , f¯P ) is used
as the match score of level-3 features. It ensures that the features having high variance
do not contribute to the distance and hence reduce the false reject rate for a fixed
false accept rate. Using this measure, four distance scores, MD(i) (i = 1, . . . , 4),
associated with the individual level-3 features are computed.
To compute a level-3 feature based match score, we next propose the use of quality-
based likelihood ratio [22] approach. Quality score vector, q, for the fingerprint image
and distance score, MD(i), are given as input to the Gaussian Mixture Model based
density estimation algorithm [38]. For a gallery-probe pair, let ggen (MD(i), q) and
gimp (MD(i), q) be the genuine and impostor joint marginal densities respectively.
The quality-based likelihood ratio for a gallery-probe pair is computed using Eq. 9.10,


4
ggen (MD(i), q)
S= . (9.10)
gimp (MD(i), q)
i=1
242 M. Vatsa

(a) (b) (c)

(d) (e) (f)

Fig. 9.15 Tracing of fingerprint contour and categorization of level-3 pore and ridge features:
a original fingerprint segment, b contour, c ridge features, d pores, e incipient ridges, and f dots

Finally, the decision of accept or reject is made using the threshold t.


accept if S∞t
Decision = (9.11)
reject otherwise

9.3 Integrating Image Quality in Level-2 and Level-3 Fingerprint


Match Score Fusion
Multibiometric fusion can be accomplished at several different levels in a biometric
system [12]; however, fusion at the match score level has been extensively studied in
the literature primarily due to the ease of accessing match scores using commercial
9 Quality Induced Multiclassifier Fingerprint Verification Using Extended Feature Set 243

Fig. 9.16 Pores extracted


using the proposed algorithm

Fig. 9.17 Ridge contours


extracted using the proposed
algorithm

Fig. 9.18 x, y coordinates of


the dots (red) are stored as dot
features

software and trade-off that it offers in terms of information complexity compared


to other levels of fusion. Fusion at the match score level involves combining the
match scores generated by multiple classifiers (or matchers) to render a decision
about the identity of the subject. In context to the research on level-2 and level-3
features based recognition, it is observed that the match scores obtained from these
two matching algorithms have limited correlation (Fig. 9.20). This analysis suggests
that recognition accuracy can be further improved if these match scores are fused.
244 M. Vatsa

Fig. 9.19 End points of


incipient ridges (red) are
marked after tracing the
contour

Fig. 9.20 Scatter plot of


Genuine
match scores corresponding Impostor
to the level-2 and level-3 0.75
fingerprint features
0.60
Level−3

0.45

0.30

0.15

0
0 0.15 0.30 0.45 0.60 0.75 0.90
Level−2

Therefore, in this chapter, a match score fusion algorithm is used to combine level-2
and level-3 match scores. Here, an evidence theoretic sum rule algorithm is presented
for combining image quality and match scores obtained from level-2 and level-3 fea-
tures. First, the formulation of the evidence-theoretic sum rule using belief functions
is described followed by the fusion algorithm. Since biometric verification is a two-
class problem with the classes being genuine and impostor, the fusion algorithm
is formulated as a two-class problem. Belief functions are defined on the frame of
discernment that consists of a finite set of exhaustive and mutually exclusive hypoth-
esis. Let Θ = {θgen , θimp } be the frame of discernment and θgen and θimp be the
hypothesis belonging to the genuine and impostor classes respectively.

9.3.1 Formulation of Evidence Theoretic Sum Rule

In the evidence theoretic sum rule, the belief function which is also known as the basic
probability assignment (bpa) is defined as m(·) : Θ ◦ [0, 1], such that m(θgen ) +
m(θimp ) = 1. Here, m(θgen ) represents the belief of data being genuine and m(θimp )
9 Quality Induced Multiclassifier Fingerprint Verification Using Extended Feature Set 245

Level-2 Quality based


Match Score Basic Probability
Assignment
RDWT based Evidence Accept/
Likelihood
Quality Score Theoretic Reject
BetP
Quality based
Basic Probability
Level-3 Assignment
Match Score

Fig. 9.21 Illustrating the steps involved in the proposed evidence theoretic sum rule

represents the belief of data being impostor. For a given match score, if m(θgen ) =
0.7, then m(θimp ) = 0.3.
Figure 9.21 shows the steps involved in the proposed evidence theoretic sum rule.
For fusion, the match scores obtained by matching level-2 and level-3 features are
first transformed into basic probability assignments, m 1 (·) and m 2 (·), and then fused
using Eqs. 9.12 and 9.13.
m 1 (θgen ) + m 2 (θgen )
m fused (θgen ) = (9.12)
2

m 1 (θimp ) + m 2 (θimp )
m fused (θimp ) = (9.13)
2
A decision to accept or reject is made using the pignistic probability (Bet P) and
likelihood ratio test.
 m fused (A)
BetP(θi ) = (9.14)
|A|
θi ∼A⇒Θ

where i = {gen, imp}. Pignistic probability transforms the belief assignments into
probability assignments and likelihood ratio test is used with the threshold t for
decision making as shown in Eq. 9.15.

BetP(θ )
BetP(θimp ) ∞ t
gen
genuine if
Decision = (9.15)
impostor otherwise

9.3.2 Match Score Fusion Using Evidence-Theoretic Sum Rule

Let s1 and s2 be the two match scores to be fused. It is assumed that the distribution of
match scores for every element of Θ = {θgen , θimp } follows a Gaussian distribution
as shown in Eq. 9.16.
246 M. Vatsa

2 
1 1 si − μi j
p(si , μi j , σ i j ) = √ ex p − (9.16)
σij 2π 2 σij

where μi j and σ i j are the mean and standard deviation of the ith classifier
corresponding to the jth element of Θ. The basic probability assignment m i ( j) is
computed as

p(si , μi j , σ i j )β i j
m i ( j) = Θ (9.17)
j=1 p(si , μi j , σ i j )β i j

where β i j is the weight factor of classifier i corresponding to the jth element of Θ.


β i j is defined as

β i j = qmed V i j . (9.18)

Here, qmed is the median of quality score vector of the input probe image and V i j
is the verification accuracy prior computed on the training database. Verification
accuracy prior is used to attune the classifier’s reliability of a probe match by means
of errors made on the representative training dataset. Quality score vector is computed
using the redundant discrete wavelet transform based quality assessment algorithm
described in Sect. 9.2.1 and median of the quality score vector is used in the fusion
algorithm. Values of both qmed and V i j lie in the range of [0, 1]. Basic probabilistic
assignments, m i ( j), are fused using Eqs. 9.12 and 9.13, and a decision of accept or
reject is made by transforming the fused bpa into probability measure (Bet P) and
then performing the likelihood ratio test.

9.4 Databases and Algorithms used for Evaluation

The performance of the proposed level-3 feature extraction and matching algorithm
is evaluated and compared with existing level-3 feature based fingerprint recogni-
tion algorithms on 1000 ppi fingerprint databases. In this section, the databases and
algorithms used for validation are briefly discussed.

9.4.1 Fingerprint Database

To evaluate the proposed quality induced level-3 feature extraction and matching
algorithm, two sets of fingerprint databases are used. The first fingerprint database,
obtained from law enforcement agencies, contains images from 550 different classes.
For each class, there are five rolled and five slap fingerprints. The resolution of
9 Quality Induced Multiclassifier Fingerprint Verification Using Extended Feature Set 247

fingerprint images is 1000 ppi to facilitate the extraction of both level-2 and level-3
features. The second fingerprint database is a non-ideal database prepared by the
author and contains 1000 ppi images pertaining to 150 classes and for each class
there are 10 slap fingerprint images. Further, from each class of the first database,
one good and one bad quality rolled fingerprints and one good and one bad quality
slap fingerprints are selected for training and the remaining images are used for
testing. Therefore, there are 1100 rolled and 1100 slap images for training and 1650
rolled and 1650 slap images for testing. Similarly, from the second database, 300
good quality and 300 bad quality slap images (150 × 4) are selected for training
and the remaining 900 images (150 × 6) are used as the test data. Using the testing
database, three sets of experiments are performed,

1. matching rolled fingerprints with rolled fingerprints from 550 classes with three
images for each class.
2. matching rolled fingerprints with slap fingerprints from 550 classes with three
images for each class.
3. matching slap fingerprints with slap fingerprints from 150 classes with six images
for each class.

9.4.2 Existing Algorithms Used for Comparison

The performance of the proposed level-3 algorithm is compared with two existing
algorithms. The algorithm by Kryszczuk, Drygajlo, and Morier [16, 17] extracts pore
information from high resolution fingerprint images by applying different techniques
such as correlation based alignment, Gabor filtering, binarization, morphological
filtering, and tracing. The match score obtained from this algorithm is a normalized
similarity score in the range of [0, 1]. The algorithm proposed by Jain, Chen, and
Demirkus [14] uses Gabor filtering and wavelet transform for pore and ridge feature
extraction and iterative closest point for matching.

9.5 Experimental Evaluation

The performance of the proposed algorithms is evaluated using the Receiver Oper-
ating Characteristics (ROC) curves that are generated by computing the genuine
accept rates at different false accept rates. The training datasets are used to learn dif-
ferent parameters including the parameters for Gaussian Mixture Model for density
estimation and evidence theoretic sum rule.
As mentioned in Sect. 9.4.1, three sets of experiments are performed using the
testing datasets. The verification performance of the proposed level-3 feature extrac-
tion algorithm is computed and compared with existing level-2 [13, 15] and level-3
feature based verification algorithms [14, 16, 17]. The ROC plots in Figs. 9.22,
248 M. Vatsa

100

98

Genuine Accept Rate (%)


96

94
Level−2
92 Level−3 Kryszczuk
Level−3 Jain
Proposed Level−3 Pore
90 Proposed Level−3 Ridge
Proposed Level−3 Dots
88 Proposed Level−3 Incipient Ridge
Proposed Four Level−3
Features with Quality
86 −2 −1 0
10 10 10
False Accept Rate (%)

Fig. 9.22 Matching rolled fingerprints: ROC plot for the proposed level-3 feature extraction and
comparison with existing level-3 feature based algorithms [14, 16, 17] and minutiae based algorithm

100
98
Genuine Accept Rate (%)

96
94
92
90 Level−2
Level−3 Kryszczuk
88 Level−3 Jain
Proposed Level−3 Pore
86
Proposed Level−3 Ridge
84 Proposed Level−3 Dots
Proposed Level−3 Incipient Ridge
82 Proposed Four Level−3
Features with Quality
80
−2 −1 0
10 10 10
False Accept Rate (%)

Fig. 9.23 Matching rolled to slap fingerprints: ROC plot for the proposed level-3 feature extraction
and comparison with existing level-3 feature based algorithms [14, 16, 17] and minutiae based
algorithm

9.23, 9.24 and verification accuracies in Table 9.1 summarize the results of the three
experiments: matching rolled fingerprint, matching rolled images with slap images,
and matching slap fingerprints. The key results and analysis of our experiments are
summarized below.
1. For matching rolled fingerprints (Experiment 1), the proposed level-3 feature
extraction algorithm yields a verification accuracy of 93.9 % which is around
2–7 % better than existing algorithms. Existing level-2 minutiae based verification
algorithm is more optimized for 500 ppi fingerprint images and the performance
does not improve when 1000 ppi images are used. On the other hand, level-3
9 Quality Induced Multiclassifier Fingerprint Verification Using Extended Feature Set 249

100

95

Genuine Accept Rate (%)


90

85

80
Level−2
75 Level−3 Kryszczuk
Level−3 Jain
70 Proposed Level−3 Pore
Proposed Level−3 Ridge
65 Proposed Level−3 Dots
Proposed Level−3 Incipient Ridge
60 Proposed Four Level−3
Features with Quality
55 −2 −1 0
10 10 10
False Accept Rate (%)

Fig. 9.24 Matching slap fingerprints: ROC plot for the proposed level-3 feature extraction and
comparison with existing level-3 feature based algorithms [14, 16, 17] and minutiae based algorithm

Table 9.1 Comparison of verification performance of the proposed quality induced level-3 feature
extraction and matching algorithm with existing feature extraction algorithms using fingerprints
with varying number of features
Algorithm Rolled to rolled Rolled to slap Slap to slap fin- Average time (s)
fingerprint (%) fingerprint (%) gerprint (%)
Level-2 minutiae [13, 15] 91.8 90.7 80.2 03
Level-3 pores [16, 17] 87.2 80.9 63.7 12
Level-3 pore and ridge [14] 91.0 89.1 78.1 34
Proposed Level-3 pore 88.3 86.8 68.3 03
Proposed Level-3 ridge 91.2 87.4 73.0 03
Proposed Level-3 dots 88.9 88.7 70.5 01
Proposed Level-3 incipient 90.0 87.6 71.1 01
ridge
Proposed quality 93.9 92.4 85.2 05
Induced Level-3 features
Verification accuracy is computed with FAR = 0.01 %

pore matching algorithm [16, 17] requires very high quality fingerprint images
(∞ 2000 ppi) for feature extraction and the performance suffers when the finger-
print quality is poor or the amount of pressure applied during scanning affects
the pore information. The algorithm proposed by Jain, Chen, and Demirkus [14]
is efficient with good quality rolled fingerprint images. However, the verifica-
tion accuracy decreases with poor quality fingerprints when pore features are
not clearly visible. The proposed level set based feature extraction algorithm is
robust to irregularities due to noise and efficiently compensates the variations in
pressure during fingerprint capture by incorporating ridge, pore, and dot features.
Mahalanobis distance measure also reduces the false reject cases thus improving
the verification accuracy.
250 M. Vatsa

2. The experiments with rolled to slap fingerprint matching (Experiment 2) show


a decrease of around 1.1–6.3 % in the verification accuracy of the proposed and
existing algorithms. The proposed approach exhibits the best verification perfor-
mance of 92.4 %. The experimental results show that the proposed level-3 feature
extraction algorithm provides good recognition performance even with limited
number of minutiae.
3. The experiments with slap fingerprints (Experiment 3) show the main advantage of
the proposed algorithm. Since the database contains several non-ideal poor quality
fingerprint images, verification accuracy of existing level-3 feature based algo-
rithms reduce significantly. The decrease in verification performance is mainly
because of the poor quality images which either lead to missing pore features
or spurious pores. The proposed algorithm, conversely, manages to yield 85.2 %
accuracy which is at least 5 % better than existing algorithms. The improved per-
formance is because the feature extraction algorithm, most of the times, provides
correct feature boundary, and the quality factor and likelihood ratio for decision
making further improve the verification accuracy. Moreover, the quality-based
likelihood ratio achieves optimal performance by satisfying the Neyman-Pearson
theorem [8].
4. The experiments also show that, in general, level-2 and level-3 algorithms provide
correct results when both the gallery and probe are rolled fingerprint images.
Similarly, when the images are good quality slap, both level-2 and level-3 features
yield correct results (Fig. 9.25a). However, there are several slap fingerprints in
the database that are very difficult to recognize using level-2 features because of
the inherent fingerprint structure. One such example is shown in Fig. 9.25c where
level-2 feature extraction algorithm does not yield correct result but the proposed
level-3 feature extraction algorithm provides correct result. Further, there are cases
in which both level-2 and level-3 algorithms fail to perform (Fig. 9.25d) because
of sensor noise and less pressure. Such cases require image quality enhancement
algorithms to enhance the quality such that the feature extraction algorithm can
extract useful features for verification.
5. Analysis of individual level-3 features shows that level-3 features, dots and incip-
ient ridges are the most stable and pores (specifically pore shape and size) are the
least stable features.
6. It is observed that the match scores obtained from level-2 and level-3 features
have limited correlation (Fig. 9.20). As shown in Fig. 9.26, the proposed evidence
theoretic sum rule yields an accuracy of 94.2 % which is 1.7 % better than classical
sum rule [27]. This is because the evidence theoretic sum rule incorporates image
quality score and verification prior, and the decision making process incorporates
likelihood ratio test statistics.
9 Quality Induced Multiclassifier Fingerprint Verification Using Extended Feature Set 251

Fig. 9.25 Sample cases: a when both level-2 and level-3 features based algorithms provide correct
result, b when level-3 feature based algorithm fail to provide correct result whereas level-2 features
based algorithm provide accurate decision, c when level-3 provides correct results but level-2
provides incorrect results, and d when both the algorithms fail to provide correct results
252 M. Vatsa

100
99

Genuine Accept Rate (%)


97

95

93

91

89
Level−2
87 Proposed Level−3
Sum Rule
85 Evidence Theoretic Sum Rule
−2 −1 0
10 10 10
False Accept Rate (%)

Fig. 9.26 ROC plot to evaluate the performance of the proposed evidence theoretic sum rule and
comparison with the classical sum rule

9.6 Conclusion and Future Work

With the availability of high resolution fingerprint sensors, salient features such
as pores, ridge contours, and dots are prominently visible. In this research, these
conspicuous level-3 features are utilized for fingerprint verification. RDWT based
quality assessment algorithm is used to compute the image quality score followed by
a two stage registration algorithm to register the gallery and probe fingerprint images.
A fast feature extraction algorithm is proposed to extract detailed level-3 pore, ridge,
and dot features using level set curve evolution approach. The experiments on a
high resolution fingerprint database demonstrate the effectiveness of the proposed
algorithm with respect to both accuracy and time. The experiments also show that the
proposed algorithm outperforms existing level-2 and level-3 feature based fingerprint
recognition algorithms.
Since automatic level-3 feature extraction and matching is a relatively new tech-
nique, a scientific study is required to analyze the permanence, universality, and
sensor interoperability. Undertaking such research requires a comprehensive large
scale fingerprint database that not only contains temporal variations but also varia-
tions due to geography, gender and race. Further, the curve evolution based level-3
feature extraction algorithm can be extended such that during tracing, level-2 minu-
tia features can also be computed. This extension may reduce the computational
complexity of level-2 and level-3 feature extraction.

Acknowledgments This work was supported in part through a grant from the Department of
Science and Technology (India) under FAST scheme. The author would like to thank Prof. A.
Noore and Dr. R. Singh for their feedback and comments.
9 Quality Induced Multiclassifier Fingerprint Verification Using Extended Feature Set 253

References

1. Ashbaugh DR (1999) Quantitative-qualitative friction ridge analysis. An introduction to basic


and advanced ridgeology, CRC Press, Boca Raton
2. Bao P, Zhang L, Wu X (2005) Canny edge detection enhancement by scale multiplication.
IEEE Trans Pattern Anal Mach Intell 27(9):1485–1490
3. CDEFFS: the ANSI/ NIST committee to define an extended fingerprint feature set (2008)
Available at http://fingerprint.nist.gov/standard/cdeffs/index.html
4. Chan T, Vese L (2001) Active contours without edges. IEEE Trans Image Process 10(2):
266–277
5. Chapel CE (1971) Fingerprinting: a manual of identification. Coward-McCann, New York
6. Chen Y, Jain AK (2007) Dots and incipients: extended features for partial fingerprint matching.
In: Biometric symposium, biometric consortium conference
7. Cowger JF (1983) Friction ridge skin: comparison and identification of fingerprints. Elsevier,
New York
8. Duda RO, Hart PE (2000) Pattern classification. 2nd edn. Wiley, New York
9. Fowler E (2004) The redundant discrete wavelet transform and additive noise. Technical Report
MSSU-COE-ERC-04-04, Mississippi State ERC, Mississippi State University
10. Galton F (1965) Finger prints (reprint). Da Capo Press, New York
11. Henry ER (1900) Classification and uses of fingerprints. Routledge & Sons, London
12. Jain AK, Flynn P, Ross A (2007) Handbook of biometrics. Springer, New York
13. Jain AK, Hong L, Bolle R (1997) On-line fingerprint verification. IEEE Trans Pattern Anal
Mach Intell 19(4):302–314
14. Jain AK, Chen Y, Demirkus M (2007) Pores and ridges: high resolution fingerprint matching
using level 3 features. IEEE Trans Pattern Anal Mach Intell 29(1):15–27
15. Jiang XD, Yau WY, Ser W (2001) Detecting the fingerprint minutiae by adaptive tracing the
gray level ridge. Pattern Recogn 34(5):999–1013
16. Kryszczuk K, Drygajlo A, Morier P (2004) Extraction of level 2 and level 3 features for
fragmentary fingerprints. In: Proceedings of 2nd COST275 workshop, pp 83–88
17. Kryszczuk K, Morier P, Drygajlo A (2004) Study of the distinctiveness of level 2 and level 3 fea-
tures in fragmentary fingerprint comparison. In: Proceedings of ECCV international workshop
on biometric authentication, pp 124–133
18. Lee HC, Gaensslen RE (1991) Advances in fingerprint technology. Elsevier, New York
19. Maltoni D, Maio D, Jain AK, Prabhakar S (2003) Handbook of fingerprint recognition. Springer,
New York
20. Meenen P, Ashrafi A, Adhami R (2006) The utilization of a taylor series-based transformation
in fingerprint verification. Pattern Recogn Lett 27(14):1606–1618
21. Moenssens A (1971) Fingerprint techniques. Chilton Book Company, London
22. Nandakumar K, Chen Y, Dass SC, Jain AK (2008) Likelihood ratio based biometric score
fusion. IEEE Trans Pattern Anal Mach Intell 30(2):342–347
23. National Institute of Standards and Technology (1994) Guidelines for the use of advanced
authentication technology alternatives. Federal Information Processing Standards Publication,
p 190
24. Osterberg J, Parthasarathy T, Raghavan T, Sclove S (1977) Development of a mathemati-
cal formula for the calculation of fingerprint probabilities based on individual characteristic.
J Am Stat Assoc 72:772–778
25. Pavlidis T (1982) Algorithms for graphics and image processing. Computer Science Press,
Berlin
26. Roddy AR, Stosz JD (1997) Fingerprint features-statistical analysis and system performance
estimates. Proc IEEE 85(9):1390–1421
27. Ross A, Jain AK (2003) Information fusion in biometrics. Pattern Recogn Lett 24(13):
2115–2125
28. Roxburgh T (1933) On the evidential value of fingerprints. Sankhya Indian J Stat 1:189–214
254 M. Vatsa

29. Saviers K (1987) Friction skin characteristics: a study and comparison of proposed standards.
California Police Department, Garden Grove
30. Shi Y, Karl WC (2008) A real-time algorithm for the approximation of level-set-based curve
evolution. IEEE Trans Image Process 17(5):645–656
31. Stanley C (1994) Are fingerprints a genetic marker for handedness? Behav Genet 24(2):141
32. Stoney DA (1985) A quantitative assessment of fingerprint individuality. Ph.D. Dissertation,
University of California, Berkeley
33. Tsai A, Yezzi A Jr, Willsky AS (2001) Curve evolution implementation of the mumford-shah
functional for image segmentation, denoising, interpolation, and magnification. IIEEE Trans
Image Process 10(8):1169–1186
34. Vatsa M, Singh R, Noore A, Singh SK (2009) Combining pores and ridges with minutiae for
improved fingerprint verification. Signal Process 89(12):2676–2685
35. Zhang D, Liu F, Zhao Q, Lu G, Luo N (2011) Selecting a reference high resolution for fingerprint
recognition using minutiae and pores. IEEE Trans Instrum Meas 60(3):863–871
36. Zhao Q, Jain A (2010) On the utility of extended fingerprint features: A study on pores. In:
Proceedings of IEEE computer society conference on computer vision and pattern recognition
workshops, pp 9–16
37. Zhao Q, Zhang L, Zhang D, Luo N (2009) Direct pore matching for fingerprint recognition.
In: Proceedings of third international conference on advances in biometrics, pp 597–606
38. Zivkovic Z, Heijden FVD (2004) Recursive unsupervised learning of finite mixture models.
IEEE Trans Pattern Anal Mach Intell 26(5):651–656
Chapter 10
Quality Measures for Online Handwritten
Signatures

Nesma Houmani and Sonia Garcia-Salicetti

Abstract This chapter tackles the problem of quality of online signature samples.
Several works in the literature point out signature complexity and signature stability
as main quality criteria for this behavioral biometric modality. The drawback of
these works is the measurement of such criteria separately. In this study, we propose
to analyze such criteria with a different unifying view, in terms of entropy-based
measures. We consider signature complexity as the intrinsic disorder of a signature
instance and variability as the intra-class disorder of a set of genuine signatures. We
introduce a novel statistical measure of complexity for signature samples and analyze
it relatively to Personal Entropy that we proposed in former works. We study the
power of both measures for an automatic writer categorization on several databases.
We show that such categories retrieve separately on one hand degraded data and on
the other hand good quality signatures. Finally, the degradation of signatures due to
mobile acquisition conditions is quantified by our entropy-based measures.

Keywords Signature complexity · Signature variability · Entropy · Hierarchical


clustering · Hidden Markov Models · Writer categories · Online signature verification

10.1 Introduction

Handwritten signature is a behavioural biometric modality that relies on a rapid


personal gesture. Each resulting hand-drawn signature sample has a given level of
complexity according to the writer. Indeed, some persons sign with a simple flourish

N. Houmani (B)
Laboratoire SIGMA, ESPCI ParisTech, 10 rue Vauquelin, 75005 Paris, France
e-mail: nesma.houmani@espci.fr
S. Garcia-Salicetti (B)
CNRS UMR 5157 SAMOVAR, CEA Saclay Nano-Innov, Institut Mines-Telecom/Télécom
SudParis, PC176 Bât. 861, 91191 Gif sur Yvette Cedex, France
e-mail: sonia.garcia@telecom-sudparis.eu

J. Scharcanski et al. (eds.), Signal and Image Processing for Biometrics, 255
Lecture Notes in Electrical Engineering 292, DOI: 10.1007/978-3-642-54080-6_10,
© Springer-Verlag Berlin Heidelberg 2014
256 N. Houmani and S. Garcia-Salicetti

while other have signatures close to cursive handwriting. For this reason, complexity
can be considered as an important indicator for characterizing writers of a given
population through their signature.
On the other hand, the behavioural aspect of the signature modality leads to a
rather high variability from one instance to another. Nevertheless, the amount of
this variability is strongly writer-dependent, in the sense that some writers have a
signature by far more variable from one instance to the next than other writers.
So, also signature variability (or inversely signature stability) can be considered as
another important indicator for characterizing writers.
An automatic signature verification system involves two steps: the enrolment step
and the verification step. During writer enrolment, signatures can be acquired either
online (on a digitizer) as a temporal signal [1–3] or offline as a static image [1, 4, 5].
In the signature verification literature, it has been shown that in order to provide a
given level of security to an individual signer, writer enrolment must guarantee that
enrolment signatures are complex and stable enough [6–12]. Actually, stability is
required in genuine signatures in order to be able to characterize a given writer, since
the less stable a signature is, the more likely it is that a forgery gets dangerously close
to genuine signatures in terms of the metric of any classifier. Also, complex enough
signatures are required at the enrolment step to guarantee a certain level of security;
indeed, the more complex a signature is, the more it will be difficult to forge it.
In the online framework, Brault and Plamondon pointed out that when enrolling
a writer, his/her signature is acceptable only if it is complex enough [6]. The authors
proposed a “difficulty coefficient” for estimating the difficulty to reproduce a given
genuine signature as a function of the rate of geometric modifications (length and
direction of strokes) per unit of time, namely as a function of the complexity of the
hand-draw. They conclude that “problematic” signers in terms of signature verifi-
cation performance are those having signatures with a low “difficulty coefficient”,
namely not complex enough signatures.
In the offline framework, Boulétreau et al. [7, 8] proposed a novel complexity
measure based on fractal dimension and related it to signature legibility. Such a
measure allowed writer categories to be generated automatically by a Hierarchical
Clustering. The obtained categories were illustrated and characterized by different
“writing styles” in signatures: “highly cursive” writings, “very legible” writings,
“separated” writings, “badly formed” writings, and “small” writings. Unfortunately,
such resulting categories were not confronted to classifier performance.
More recently, Alonso et al. [9, 10] characterized writers by means of signature
complexity and signature legibility. They categorized manually persons into four
complexity categories and into three legibility categories and showed that the best per-
formance of two classifiers was obtained on the most complex and legible signatures.
Other works focused on the variability indicator for writer characterization
through their online signatures. In [6], Brault and Plamondon stated that when
enrolling a writer, his/her signature is suitable as part of a reference set if it is
not too variable. The authors quantified the intra-class variability between differ-
ent signatures of a same writer by computing a “dissimilarity index” based on the
spatio-temporal difference between two signatures (elastic matching). Then, they
10 Quality Measures for Online Handwritten Signatures 257

exploited such index for selecting the enrolment signatures (signatures presenting a
not too high dissimilarity index).
Also for the same purpose (selecting reference signatures), Di Lecce et al. [11]
proposed a correlation criterion measuring how much local stability values computed
on different signatures, when being matched by elastic matching techniques, are
correlated. Then, the subset of signatures with highest correlation is selected as
reference set.
Finally, in [10, 12], Alonso et al. proposed both complexity and variability as
indicators for off-line signature verification. A human operator labelled signatures
according to both criteria and their impact on performance was studied. The authors
stated that the best performance is obtained on signatures showing high complexity
and high stability.
All the above-mentioned works of the literature point out that degraded signatures
are those which are not complex enough and not stable enough. It is worth noticing
that complexity can be measured on a single signature sample while stability can
only be measured on a set of signature samples of a same writer. Moreover, in the
special case of the signature Biometrics, high complexity of a signature means for
the writer an increase of spatial constraints in the pen trajectory. Such constraints
tend to increase stability between different instances of such a signature.
The aim of this chapter is to study these two indicators (complexity, variability) for
characterizing degradations in the online signature modality. We will carry out our
study in a very progressive manner: we will start by the complexity indicator since it
requires one single signature sample per writer; then, in a second step, we will take
into account the variability indicator by considering several signature instances per
writer. Following this progressive methodology, we will relate these two indicators to
the notion of disorder in a signature. Indeed, complexity and variability correspond
to disorder at two different levels: complexity can be seen as the intrinsic disorder
of the hand-draw in a signature sample; variability can be seen as the intra-class
disorder of a given set of a writer’s signatures.
The concept of “entropy” appears as a good alternative for quantifying disorder
[13]. In former works [14–17], we computed entropy on a set of signatures and
proposed a Personal Entropy measure to characterize a writer. In this chapter, we
analyze the two levels at which degradation occurs in signatures, by exploiting the
entropy concept.
This chapter is organized as follows. In Sect. 10.2, we describe how signature
complexity can be measured statistically by means of entropy with a Hidden Markov
Model on a given sample, and propose to retrieve a complexity value per writer. Then,
we present on the MCYT-100 database the automatic writer categories obtained
after performing a Hierarchical Clustering on writers’ complexity values. Finally,
in order to show the efficiency of the proposed statistical complexity measure, we
compare it to a deterministic one. In Sect. 10.3, we recall how “Personal Entropy”
measure is computed on a set of writer’s genuine signatures, and present the automat-
ically retrieved writer categories. Then, we show the strong relation existing between
Personal Entropy and both signature complexity and signature variability. Finally, we
study in Sect. 10.4 the impact of mobile acquisition conditions on signatures using
258 N. Houmani and S. Garcia-Salicetti

DS2-104 and DS3-104 databases that contain the same 104 writers but differ in the
type of acquisition platform (digitizing tablet vs. mobile platform). We conclude on
the scope of this study in Sect. 10.5.

10.2 Measuring Complexity in Signatures

Our aim in this section is to exploit the entropy concept for quantifying the complexity
of a genuine signature sample in a statistical manner. In fact, in the online context, a
high complexity of a signature sample corresponds to a high disorder of its trajectory
through its associated signing time.
In the following, we describe our proposal of computing complexity of a signature
sample statistically by means of the notion of entropy.

10.2.1 Entropy for Measuring the Complexity of a Signature


Instance

We consider in this chapter a signature as a sequence of two time functions, namely its
raw coordinates (x,y). Indeed, raw coordinates are the only time functions available
on all sorts of databases, whether acquired on fixed platforms (as digitizing tablets
[2]) or on mobile platforms (as a Smartphone [2]).
The entropy of a random variable only depends on its probability density values
[13]; therefore a good estimation of this probability density must be performed to
compute reliably the entropy value. As the online signature is piecewise stationary,
it is natural to estimate the probability density locally, namely on portions of the
signature. In this framework, Hidden Markov Models [18] (HMM) appear as a natural
tool as they allow performing both a segmentation of the signature into portions and
a local estimation of the probability density on each portion.
We thus consider the genuine signature of a given writer as a succession of por-
tions, generated by its segmentation via such writer’s HMM. Therefore, we obtain as
many portions in the signature as there are states in the HMM. Then, we consider each
point (x,y) in a given portion Si as the outcome of one random variable Z i (see the
top of Fig. 10.1) that follows a given probability mass function pi (z) = Pr(Z i = z),
where z belongs to the Alphabet A of ordered pairs (x,y). Such random variable asso-
ciated to a given portion of the signature is discrete since its Alphabet A has a finite
number of values, thus its entropy in bits is defined as

H (Z i ) = − p(z). log2 ( p(z)) (10.1)
z∈Si

Nevertheless, the hand-drawing as a time function is a continuous process from which


we retrieve a sequence of discrete values via a digitizer. For this reason, although
10 Quality Measures for Online Handwritten Signatures 259

Fig. 10.1 Sample complexity computation

z = (x, y) is discrete, we take advantage of the continuous emission probability law


estimated on each portion by the HMM. Such density function is modelled as a
mixture of Gaussian components.
To compute the Sample Complexity of a genuine signature instance, we first train
a HMM on such genuine signature, after computing a personalized number of states,
as follows:
T
N= (10.2)
M ∗ 30
where T is the number of sampled points available in the genuine signature instance,
and M = 4 is the number of Gaussian components per state. We ensure this way
that the number of sample points per state is at least 120, in order to obtain a good
estimation of the Gaussian Mixture in each state (four Gaussian components).
Then, we exploit the HMM to generate by the Viterbi algorithm [18] the portions
on which the entropy is computed for such genuine signature. On each portion,
we consider the probability density estimated by the HMM to compute locally this
entropy. We then average the entropy over all the portions of the signature and
normalize the result by the signing time of the signature sample (see Fig. 10.1). This
measure is expressed in bits per second.
Time normalization is necessary for comparing signature samples between them
in terms of their entropy values. Indeed, in general, there is a difference in length
between signature instances and such difference can be particularly important for
samples belonging to different persons.
260 N. Houmani and S. Garcia-Salicetti

10.2.2 Retrieving Writer Categories Based on Signature


Complexity

For comparing writers in terms of their signature complexity, we define a statistical


complexity measure per person by averaging the sample complexity values of 10 gen-
uine signatures for each writer. This number of signatures is first fixed in accordance
with our previous works on entropy measures for signatures [14–17]; nevertheless,
a detailed study on this matter is carried out in Sect. 10.3.2.
Then by performing a Hierarchical Clustering [19, 20] on the obtained complexity
values, we generate automatically three writer categories on the MCYT-100 database
(described below in Sect. 10.2.3). Such number of categories was fixed after a detailed
study on the optimal number of categories, by exploiting different validity indices
[21], given in Sect. 10.2.4.

10.2.3 MCYT Data Description


Our study is carried out on the freely available and the widely used MCYT-100 subset
of 100 persons [22]. Signatures are acquired on WACOM pen tablet, model INTUOS
A6 USB. The pen tablet resolution is 2540 lines per inch, and the precision is 0.25 mm.
The maximum detection height is 10 mm (pen-up movements are also considered),
and the capture area is 127 mm (width) × 97 mm (height). The sampling frequency
was set to 100 Hz. The capture area was further divided into 37.5 mm (width) ×
17.5 mm (height) blocks which are used as frames for acquisition. At each sampled
point of the signature, the digitizer captures pen coordinates, pen pressure (1024
pressure levels), and Azimuth and Altitude pen inclination angles.
Signature corpus contains genuine and shape-based highly skilled forgeries with
natural dynamics. In order to obtain the forgeries, each donor is requested to imitate
other signers by writing naturally, without artefacts such as breaks or slowdowns.
The acquisition procedure is as follows. Signer Si writes a set of 5 genuine signatures,
and then 5 skilled forgeries of the signer Si−1 . This procedure is repeated four more
times imitating previous users Si−2 , Si−3 , Si−4 and Si−5 . As a result, each signer
contributes with 25 genuine signatures in 5 groups of 5 signatures each, and is forged
25 times by 5 different forgers. The total number of donors in MCYT is 330. Thus,
the total number of signatures present in such a database is 330 × 50 = 16500, half
of them are genuine signatures and the rest forgeries. However, a subset of only 100
persons is freely available and used for our experiments.

10.2.4 Study on the Optimal Number of Writer Categories

Our study on the optimal number of writer categories is carried out on all writers’
complexity values, by means of a commonly used unsupervised learning algorithm:
10 Quality Measures for Online Handwritten Signatures 261

a Hierarchical Clustering [19, 20]. Such algorithm is combined to different validity


indices [21] considered as tools for evaluating quantitatively the results of the clus-
tering algorithms.

10.2.4.1 Finding the Optimal Number of Clusters

At each step of the Hierarchical Clustering (value of k, the number of clus-


ters), we compute different validity indices of the literature, namely C-index [23],
Krzanowski-Laï index [24], and 4 validity indices of the so-called RMSSTD Group
[21]: Root Mean Square Standard Deviation, R-Squared, Semi partial R-Squared
index, Distance between two clusters index.
Based on these validity indices [21], the procedure of identifying the optimal
number of writer categories is as follows:
1. We first run the Hierarchical Algorithm on complexity values associated to each
writer. Then, we plot the so-called “dendrogram” displaying the evolution of the
objective function (a dissimilarity measure) according to the clustering carried
out at each step.
2. By observing the important changes of the objective function in the dendrogram,
we can infer the interval of variation of k, namely kmin and kmax
3. For each value of k between kmin and kmax , we compute the validity indices.
4. For each validity index, we plot the obtained value of the index as a function
of k.
5. Based on this plot, the optimal number of clusters can be identified following two
approaches [21]: if the validity index as a function of k is a monotonic function, the
optimal number of clusters corresponds to the value of k at which a significant local
change in value of the index occurs; otherwise, the optimal number of clusters is
retrieved according to the intrinsic criterion of such an index, for instance:
• C-index should be minimized [23];
• Krzanowski-Laï index should be maximized [24];
• The four validity indices of the RMSSTD Group should be used simultaneously,
and the optimal number of clusters corresponds to the values of k at which a
significant variation of these four indices is observed [21].
6. Finally, the number of optimal categories which is the most represented among
all the considered indices is selected (majority voting procedure).
For a better insight into how the optimal number of clusters is found using these
validity indices, in the sequel we apply our cluster analysis on writers of the MCYT-
100 database following the above-mentioned methodology.

10.2.4.2 Experimental Result

Following steps 1 and 2 in the above-mentioned methodology (Sect. 10.2.4.1), we


plot in Fig. 10.2 the resulting dendrogram on writers of the MCYT-100 database in
262 N. Houmani and S. Garcia-Salicetti

Fig. 10.2 The resulting dendrogram of the hierarchical clustering procedure

order to find the interval of variation of k (defined by kmin and kmax ). The dendrogram
allows visualizing the evolution of the objective function (a dissimilarity measure)
according to the clusters that are generated at each step (or “clustering level”).
For a better readability of changes in the objective function, we project in the right
side of Fig. 10.2 (diagram in black) the values of such a function at each clustering
step.
As illustrated by such a diagram, the important changes of the objective function
are represented by four dotted lines (four “clustering levels”): at a value of 100 for
the objective function corresponding to a partition of the data into 2 clusters, at a
value of 60 for 3 clusters, at a value of 20 for 5 clusters and, at a value of 10 for 9
clusters. Therefore, we can compute the validity indices for k varying between 2 and
9 clusters.
Figure 10.3 shows the graphs of the validity indices computed after clustering
writers of the MCYT-100 database according to their complexity measure with the
Hierarchical Algorithm. For each index, the optimal number of clusters is indicated
by a squared symbol on the validity index curve.
All validity indices indicate that the optimal number of clusters is k = 3.
Indeed, following the last step in the above-mentioned procedure, the variation of
Krzanowski-Lai index (Fig. 10.3a) is not monotonic. Thus, the optimal number of
clusters is defined by the maximum value of this index, according to its associ-
ated optimality criterion. Following the same principle, for the RMSSTD group
(see Fig. 10.3c), the optimal number of clusters is indicated by the first variation of
the 4 indices simultaneously. Concerning C-index, we observe in Fig. 10.3b that its
variation as a function of k is monotonic, and thus the first “knee” corresponds to the
optimal number of clusters.
10 Quality Measures for Online Handwritten Signatures 263

Fig. 10.3 Validity indices for each value of k (number of writer categories) on the MCYT-100
database: a Krzanowski-Laï index, b C-index, and c RMSSTD Group

10.2.5 Writer Categories with the Statistical Complexity Measure

Figure 10.4 shows some signatures of MCYT-100 in each obtained category


(note that such signatures are already displayed in other publications [9, 10, 12,
22, 25, 26]).
We notice on Fig. 10.4 that the first category of signatures (Fig. 10.4a), those
having the lowest complexity values, contains short and simply drawn signatures,
often with the shape of a simple flourish. At the opposite, signatures in the third
category (Fig. 10.4c), those having the highest complexity values, are the longest,
the most complex and their appearance is rather that of cursive handwriting. In
between, we notice that signatures with medium complexity values (second category,
Fig. 10.4b) are longer than those of the first category, often showing the aspect of a
complex flourish.
We report in Table 10.1 the obtained distribution of writers of the MCYT-100
database into the three complexity-based categories. We also report the mean and
the standard deviation (Std) values of complexity in each category.
264 N. Houmani and S. Garcia-Salicetti

Fig. 10.4 Examples of signatures from MCYT-100 of a low, b medium and c high statistical
complexity. Such signatures were already published in [9, 10, 12, 22, 25, 26]

Table 10.1 Distribution of writers of the MCYT-100 database in each statistical complexity-based
category; the mean and the standard deviation (Std) values of complexity in each category are also
reported
Complexity-based categories Percentage of persons (%) Mean value Std value
Low complexity 14 10.92 3.05
Medium complexity 39 27.82 4.81
High complexity 47 41.84 3.83

We notice in Table 10.1 that most writers of the MCYT-100 database have sig-
natures of medium or high complexity, while only 14 % have degraded signatures in
terms of this complexity criterion. Finally, we observe that there is roughly a factor 2
between the mean complexity values of the low and medium complexity categories,
and a factor 4 for low and high complexity categories. This reflects the heterogeneity
existing between the three categories.

10.2.6 Statistical Complexity Versus the Deterministic


Complexity Measure

In our previous work [14], we have defined a deterministic complexity measure based
on a vector of 7 components: number of local extrema in both x and y directions,
changes of pen direction in both x and y directions, cusps points, crossing points,
and star points [27]. We considered the Euclidean norm of the vector as the indicator
of complexity for each signature sample. As done on the statistical complexity mea-
sure proposed in Sect. 10.2.2, we averaged such a measure on the same 10 genuine
signatures in order to generate a deterministic complexity measure per person.
Figure 10.5 shows the statistical complexity vs. the deterministic complexity per
writer of MCYT-100 database. We notice that the two complexity measures are
positively correlated with a coefficient of (0.83). Moreover, we observe that the
10 Quality Measures for Online Handwritten Signatures 265

Fig. 10.5 The statistical complexity versus the deterministic complexity on all writers of the MCYT-
100 database

Table 10.2 Distribution of writers of the MCYT-100 database in each deterministic complexity-
based category; the mean and the standard deviation (Std) values of complexity in each category
are also reported
Complexity-based categories Percentage of persons (%) Mean value Std value
Low complexity 55 376.62 0.15
Medium complexity 29 768.06 0.11
High complexity 16 1197.3 0.21

distribution of values is more spread on the axis of the statistical complexity measure
based on the entropy concept. This means that the statistical complexity measure is
more refined than the deterministic one and clearly allows a better distinction between
persons in terms of the complexity criterion.
This fact has an evident consequence on the content of categories according to
which measure has been used at the clustering step. A rough first analysis of writers’
distribution into categories is eloquent: we notice that the distribution of writers across
complexity categories is very different when considering the statistical complexity
measure (Table 10.1) and the deterministic one (Table 10.2). Indeed, according to the
deterministic measure, roughly half of writers of MCYT-100 (55 %) show signatures
of low complexity and rather few (16 %) have high complexity signatures; while
according to the statistical complexity measure, almost half of the database (47 %)
shows highly complex signatures.
For a better insight, we display in Fig. 10.6 the content of the retrieved categories
based on the deterministic complexity measure. Contrary to the obtained categories
of Fig. 10.4 (generated by the statistical complexity measure), in the case of the deter-
ministic complexity, we visually notice that the low complexity category (Fig. 10.6a)
266 N. Houmani and S. Garcia-Salicetti

Fig. 10.6 Examples of signatures from MCYT-100 of a low, b medium and c high deterministic
complexity. Such signatures were already published in [9, 10, 12, 22, 25, 26]

Fig. 10.7 Signature examples categorized as being of high or medium complexity with the statistical
measure, and as being the least complex with the deterministic measure

is not homogeneous. In fact, it contains on one hand simple flourish signatures and
on the other hand complex flourish signatures (sometimes even close to cursive
handwriting).
This is a clear difference between the deterministic complexity measure and the
statistical one, which separates in the two extreme categories such two types of
signature examples. Indeed, in the case of the statistical complexity measure, writer
categories are visually homogeneous (see Fig. 10.4).
Furthermore, Fig. 10.7 displays examples of signatures on which the statistical and
the deterministic complexity measures disagree. Such signatures are categorized as
being of high or medium complexity with the statistical measure, and as the least
complex with the deterministic measure.
These results, together with what is observed in Fig. 10.5, show that the statistical
complexity measure is refined while the deterministic one is rather coarse. Moreover,
the statistical measure only requires pen coordinates through time to extract the
complexity of a signature; while the deterministic measure would actually require
more complexity descriptors for being reliable for writer categorization (namely
leading to homogeneous categories) and characterization (namely allowing a better
distinction of writers in terms of complexity). Besides, this sets the problem of
retrieving pertinent and enough complexity descriptors for the deterministic measure.
For all these reasons, for studying verification performance on writer categories
in the following section, we only consider the statistical complexity measure.
10 Quality Measures for Online Handwritten Signatures 267

10.2.7 Verification Performance Across Statistical


Complexity-Based Writer Categories

In this section, we analyse the behaviour of the obtained writer categories with the
statistical complexity measure in terms of classifier performance. We consider to this
end a Dynamic Time Warping classifier [18] as it is among the best classifiers for
signature verification in different online signature verification competitions [28–30].
Dynamic Time Warping (DTW) measures the dissimilarity between two time
sequences of different lengths [18]. The matching score of the DTW-based verifica-
tion system is:

Score = mini [DTW (test, referencei )] i = 1, . . . , 5 (10.3)

where DTW denotes the elastic distance computed between the test signature (denoted
by test) and each reference signature (denoted by referencei ).
For classifier performance assessment, we consider only genuine signatures of the
MCYT-100 database that were not used for computing the complexity measure and
retrieving writer categories. Therefore, performance assessment per writer category
will not be biased.
The experimental protocol on the MCYT-100 database is the following: 5 random
samplings are carried out on genuine signatures. Each sampling contains 5 genuine
signatures used as enrolment set. For test purposes, we use the remaining 10 genuine
signatures and 25 skilled forgeries. The False Acceptance and False Rejection Rates
are computed relying on the total number of False Rejections and False Acceptances
obtained on all the 5 random samplings.
We display in Fig. 10.8 classifier performance on each obtained complexity-based
category of the MCYT-100 database. We first notice that the results lead to different
behaviours in terms of performance according to the category of complexity that we
consider.
Actually, there is a significant difference in classifiers’ performance between the
two extreme categories: the DTW classifier gives the best performance on writers
belonging to the highest complexity category and the worst performance on writers
belonging to the lowest complexity category. Table 10.3 shows that at the Equal Error
Rate functioning point, performance is degraded by a factor 2 when switching from
the highest complexity category to the lowest one. Confidence Intervals at 95 % are
also given to show the significance of results.
Note that, as discussed in the Introduction (Sect. 10.1), these first results are in
perfect coherence with those of the literature [6, 9, 10]: indeed, such works state
that complexity is a major quality criterion for signatures and that not complex
enough signatures lead to degraded performance. But with regard to such works, the
advantage of our approach is two-fold: first, we propose a quantitative complexity
measure while most works in the literature estimated complexity visually [9, 10];
second, we exploit this measure for categorizing writers automatically by means of a
Hierarchical Clustering, while most works categorized signatures manually [9, 10].
268 N. Houmani and S. Garcia-Salicetti

Fig. 10.8 Detection Error Trade-off (DET)-Curves on each Complexity-based category of the
MCYT-100 database

Table 10.3 Equal Error Rate (EER) and Confidence Interval (CI) in each writer category on the
MCYT-100 database with DTW classifier
Complexity-based categories EER (%) CI (at 95 %)
Low complexity 15.07 0.32
Medium complexity 9.79 0.65
High complexity 6.47 0.41

At other functioning points, Fig. 10.8 shows that the gap in performance between
the two extreme categories is less maintained and tends to decrease especially at low
values of FAR. We also notice that performance values for the category of writers
with medium complexity are in between those of the two extreme writer categories,
except for low values of FAR.
These two last remarks point out the limits of considering only the complexity
criterion for characterizing writers. Indeed, as stated in the literature [10–12] (refer
to Sect. 10.1), stability is the other main quality criterion in handwritten signatures.

10.3 Extending Quality Criteria: From Signature Complexity


to Signature Stability

Variability is another main source of degradation in handwritten signatures


[6, 10, 11] and is a particular trait of this behavioural biometric modality.
10 Quality Measures for Online Handwritten Signatures 269

Unlike signature complexity, signature variability (or inversely signature stability)


can only be measured on a set of samples of a same writer.
We have this far related signature complexity to the notion of disorder on a given
signature sample. We obtained this way an entropy measure for quantifying signature
complexity. In this section, we aim at extending this entropy measure to the context
of several signature instances. Indeed, signature variability corresponds to the intra-
class disorder of a given set of genuine signatures.
In the sequel, we will first show that this intra-class disorder is complementary
to the statistical complexity proposed in Sect. 10.2 and that both can be unified
in a single measure that we call “Personal Entropy” [1, 14, 16, 17]. Then, we will
demonstrate that such a measure allows a better characterization of writers compared
to the statistical complexity measure.

10.3.1 Measuring Personal Entropy of a Set of Genuine Signatures

To compute Personal Entropy on a set of K signature samples, we first train the


Writer-HMM on such K signatures, after computing a personalized number of states,
as follows:
TTotal
N= (10.4)
M ∗ 30
where TTotal is the total number of sampled points available in the genuine signatures,
and M = 4 is the number of Gaussian components per state.
Then, we exploit the Writer-HMM to generate by the Viterbi algorithm [18] the
portions on which entropy is computed for each of the K instances. On each portion,
we consider the probability density estimated by the Writer-HMM to compute locally
this entropy. We then average the entropy over all the portions of a signature instance
and normalize the result by the signing time of the signature sample. Finally, by
averaging this measure across the K genuine signatures, we retrieve a “Personal
Entropy” measure per writer.
At this stage, the number of signature samples required for computing reliable
Personal Entropy values is an open question; it is the objective of next section to
study this point.

10.3.2 Study on the Number of Signatures for Personal


Entropy Computation

In this section, we study on the MCYT-100 database the optimal number of genuine
signatures, namely the least number of signatures, which can ensure the stability of
Personal Entropy values.
270 N. Houmani and S. Garcia-Salicetti

Fig. 10.9 Personal Entropy values versus each number of signature samples, for the 5 subgroups
of 20 persons of the MCYT-100 database

Fig. 10.10 Average curve of Personal Entropy values on all writers of the MCYT-100 database,
versus each number of signatures samples

Figure 10.9 displays for all writers, split into 5 subgroups of 20 persons (for a
better readability), the evolution of Personal Entropy values according to the number
of genuine signatures considered for its computation, namely from 2 to 15 samples.
We also display in Fig. 10.10 the average curve of the 100 Personal Entropy values
associated to the 100 writers of the database, and that for each number of signature
samples.
10 Quality Measures for Online Handwritten Signatures 271

Fig. 10.11 Evolution of the slope for the average curve in Fig. 10.10 versus each number of signa-
tures samples

In Figs. 10.9 and 10.10, a tendency to stability appears starting from 6 signature
samples. Moreover, Fig. 10.11 displays the evolution of the slope for the average
curve in Fig. 10.10; we confirm that starting from 6 samples, the variation of the
slope becomes steady and smooth.
In our experiments, we considered 10 genuine signatures per writer for computing
its Personal Entropy, since we exploit databases containing enough genuine data and,
at 10 samples, the slope almost reaches a value of 0 (see Fig. 10.11).

10.3.3 Writer Categories with Personal Entropy Measure

Figure 10.12 shows the obtained three categories on the MCYT-100 database after
performing a Hierarchical Clustering on Personal Entropy values.
We observe that the first category of signatures, those having the highest Personal
Entropy (Fig. 10.12a), contains short, simply drawn and not legible signatures, often
with the shape of a simple flourish. At the opposite, signatures in the third category,
those of lowest Personal Entropy (Fig. 10.12c), are the longest and their appearance
is rather that of handwriting, some being even legible. In between, we notice that
signatures with medium Personal Entropy (second category, Fig. 10.12b) are longer
than those of the first category, often showing the aspect of a complex flourish.
These categories of signatures remain visually related to the complexity indicator,
but in quantitative terms, we obtain an inverse relationship between Personal Entropy
and the statistical complexity measure proposed in Sect. 10.2 (see Fig. 10.13). Actu-
ally, when computing the correlation coefficient between both measures, we obtained
272 N. Houmani and S. Garcia-Salicetti

Fig. 10.12 Examples of signatures from MCYT-100 of a high, b medium and c low Personal
Entropy. Such signatures were already published in [9, 10, 12, 22, 25, 26]

Fig. 10.13 Personal Entropy versus the statistical complexity measure, and that on Personal
Entropy-based writer categories of the MCYT-100 database

a negative value of −0.78. This is explainable by the fact that variability is now taken
into account by our extended measure (namely Personal Entropy), as we will demon-
strate in Sect. 10.3.4. Indeed, in Fig. 10.13, highly complex signatures now show a
low Personal Entropy value (accounting for low variability) and the least complex
signatures now show high Personal Entropy values (accounting for high variability).
It is worth noticing that this opposite behaviour of the two indicators (complexity,
variability) is due to a particularity of the signature Biometrics, namely that com-
plexity leads to stability: actually, for a writer, complexity in a signature involves
increased spatial constraints at the pen trajectory level, and these constraints tend
to increase stability (or decrease variability) between different instances of such a
signature.
10 Quality Measures for Online Handwritten Signatures 273

Fig. 10.14 Personal Entropy versus signature variability on Personal Entropy-based writer cate-
gories of the MCYT-100 database

10.3.4 Personal Entropy and Signature Variability

For further analysis on the relation between Personal Entropy and signature variabil-
ity, we propose in this section a quantitative measure of signature variability, and we
confront it to Personal Entropy-based categories.
In order to measure the variability of a writer’s genuine signatures, we propose
to use Dynamic Time Warping (DTW) [18], which relies on a local paradigm to
quantify distortions. We compute the DTW distances between all the possible couples
of genuine signatures (45 as we consider 10 genuine signatures) and average the
obtained distances to get a variability indicator per writer.
Four features, extracted locally per point, are considered for measuring signature
variability: the absolute speed, the angle between the absolute speed vector and the
horizontal axis, the curvature radius of the signature, and the length to width ratio
on a sliding window of size 5.
Figure 10.14 shows Personal Entropy against the variability indicator, per writer
category on the MCYT-100 database. We see that signatures of the highest
Personal Entropy category are highly variable. At the opposite, signatures of the
lowest Personal Entropy category are by far more stable (show low variability).
When computing the correlation coefficient between both measures, we obtain a
positive value of (0.76). We therefore conclude that our Personal Entropy measure
quantifies the intra-class disorder of a writer’s signatures.
We can now state that Personal Entropy quantifies simultaneously both signature
complexity and signature variability, namely the two levels of disorder in a signature:
the intrinsic disorder at the sample level (complexity) and the intra-class disorder
of a set of genuine signatures (variability).
274 N. Houmani and S. Garcia-Salicetti

Fig. 10.15 DET-Curves on each Personal Entropy-based category of the MCYT-100 database

We will also show in the next section that this measure allows a better distinction
between writer categories in terms of classifier performance compared to the former
statistical complexity measure of Sect. 10.2.

10.3.5 Verification Performance Across Personal Entropy-Based


Writer Categories

In this section, we study the relationship between Personal Entropy-based categories


and performance of the DTW classifier, on the MCYT-100 database.
We notice in Fig. 10.15 that there is a significant difference in classifiers’ perfor-
mance between the two extreme categories: DTW classifier gives the best perfor-
mance on writers belonging to the lowest Personal Entropy category, which contains
the longest, most complex and most stable signatures, as those shown in Fig. 10.12c.
At the opposite, DTW classifier gives the worst performance on writers belonging to
the highest Personal Entropy category, those having the shortest, simplest and most
unstable signatures (see Fig. 10.12a). We also notice that performance values for the
category of writers with medium Personal Entropy are in between those of the two
extreme writer categories.
As shown in Table 10.4, at the Equal Error Rate functioning point, performance is
roughly improved by a factor 2.6 when switching from the highest entropy category
to the lowest one. Confidence Intervals at 95 % are given to show the significance
10 Quality Measures for Online Handwritten Signatures 275

Table 10.4 Equal Error Rate (EER) and Confidence Interval (CI) in each Personal Entropy-based
writer category on the MCYT-100 database with DTW classifier
Entropy-based categories EER (%) CI (at 95 %)
Low Personal Entropy 13.73 1.14
Medium Personal Entropy 7.91 0.61
High Personal Entropy 5.19 0.49

Fig. 10.16 DET-Curves on each writer category of the MCYT-100 database. Writer categories
result from two-dimensional hierarchical clustering on both the statistical complexity measure and
the variability measure

of results. At other functioning points, this gap in performance between the two
extreme categories is maintained, as shown in Fig. 10.15.
Now, comparing to Fig. 10.8, namely when performance is measured on writer
categories based only on the statistical complexity measure, we notice that Personal
Entropy allows a better separation between writer categories in terms of classifier
performance, not only at the EER but at all functioning points.
This result demonstrates the efficiency of our Personal Entropy measure for char-
acterizing writers, since it takes into account simultaneously both signature com-
plexity and signature variability.
To go further in our analysis, we performed a two-dimensional Hierarchical Clus-
tering on both the statistical complexity and the deterministic variability values asso-
ciated to writers of the MCYT-100 database. Figure 10.16 displays DTW classifier
performance on each obtained category. Equal Error Rate values and Confidence
Intervals are also reported in Table 10.5.
276 N. Houmani and S. Garcia-Salicetti

Table 10.5 Equal Error Rate (EER) and Confidence Interval (CI) in each writer category extracted
on both the statistical complexity and the variability measures, on the MCYT-100 database with
DTW classifier
Statistical complexity and variability-based categories EER (%) CI (at 95 %)
Low complexity and high variability 12.33 0.66
Medium complexity and variability 5.25 0.42
High complexity and low variability 4.59 0.47

Note that the intermediate and low complexity and variability categories are not
well separated in Fig. 10.16 compared to Fig. 10.15. This is particularly clear at
the EER functioning point (see Table 10.5) when considering the reported 95 %
confidence interval. These results show that combining by a two dimensional clus-
tering procedure the two measures (statistical complexity and variability) cannot
reach the refinement of Personal Entropy in terms of automatic writer categoriza-
tion (see Fig. 10.15). Indeed, Personal Entropy combines statistically and directly on
signature raw data both complexity and variability criteria; it is thus a powerful tool
for characterizing automatically existing categories of persons.
Based on this result, in next section, we study the capacity of Personal Entropy for
measuring the degradation that occurs in signatures when being acquired in mobile
conditions.

10.4 Acquisition Conditions and Data Degradation

The aim of this section is to analyse the impact of mobile acquisition conditions on
signatures. For this study we use the two BioSecure subsets DS2-104 and DS3-104
[31, 32] that contain the same 104 persons and were acquired respectively on a fixed
platform and on a mobile one (Smartphone). Such data subsets were acquired at
Telecom SudParis.

10.4.1 BioSecure DS2-104 and DS3-104 Data Description

The DS2-104 subset contains data from 104 persons acquired in a PC-based offline
supervised scenario and the digitizing tablet WACOM INTUOS 3 A6 [31, 32]. The
pen tablet resolution is 5080 lines per inch and the precision is 0.25 mm. The max-
imum detection height is 13 mm and the capture area is 270 mm (width) x 216 mm
(height). Signatures are captured on paper using an inking pen. At each sampled point
of the signature, the digitizer captures at 100 Hz sampling rate the pen coordinates,
pen pressure (1024 pressure levels) and pen inclination angles (azimuth and altitude
angles of the pen with respect to the tablet). This database contains two sessions,
10 Quality Measures for Online Handwritten Signatures 277

Fig. 10.17 Personal Entropy values for the same 104 persons of DS2-104 (represented by “*”) and
of DS3-104 (represented by “o”) databases

acquired two weeks apart, each containing 15 genuine signatures. The donor was
asked to perform, alternatively, three times 5 genuine signatures and twice 5 forg-
eries. Indeed, for skilled forgeries, at each session, a donor is asked to imitate five
times the signature of two other persons after several minutes of practice and with
the knowledge of the signature dynamics.
The DS3-104 subset contains the signatures of the same 104 persons, acquired on
the PDA HP iPAQ hx2790, at the frequency of 100Hz and a touch screen resolution
of 1280*960 pixels [31, 32]. Three time functions are captured from the PDA: x and
y pen coordinates and the time elapsed between the acquisition of two successive
points. The user signs while standing and has to keep the PDA in his or her hand.
Two sessions were acquired spaced of around 5 months, each containing 15 genuine
signatures. The donor was asked to perform, alternatively, three times 5 genuine
signatures and twice 5 forgeries. For skilled forgeries, at each session, a donor is
asked to imitate 5 times the signature of two other persons. In order to imitate
the dynamics of the signature, the forger visualized on the PDA screen the writing
sequence of the signature he/she had to forge and could sign on the image of such
signature in order to obtain a better quality forgery both from the point of view of
the dynamics and of the shape of the signature.

10.4.2 Personal Entropy for Quantifying Data Degradation

Figure 10.17 displays Personal Entropy values for the 104 persons of DS2-104 and
DS3-104 simultaneously. It clearly appears that writers from DS3-104 (represented
278 N. Houmani and S. Garcia-Salicetti

Fig. 10.18 The raw difference of Personal Entropy values between the DS3-104 and the DS2-104
databases for all writers

Table 10.6 Distribution of writers from DS2-104 and from DS3-104 in each category
Entropy-based categories Percentage of writers Percentage of writers
from DS2-104 (%) from DS3-104 (%)
High Personal Entropy 6.45 93.55
Medium Personal Entropy 27.5 72.5
Low Personal Entropy 82.47 17.53

by “o”) show high Personal Entropy values compared to those from DS2-104 (rep-
resented by “*”). This result confirms our intuition, namely that the acquisition of
signatures on a mobile platform degrades them, by increasing the disorder measured
by Personal Entropy.
For a better insight on this phenomenon, we display in Fig. 10.18 the degradation
of signatures by computing for each writer w, the raw difference Dw of Personal
Entropy values (H) between both databases:

Dw = Hw (DS3) − Hw (DS2) (10.5)

We observe that for all writers, such difference Dw is positive and that the amount
of degradation is writer-dependent.
For analysing in terms of writer categories the observed signature degradation,
we performed a Hierarchical Clustering on the entire set of 208 Personal Entropy
values (half from DS2-104 and half from DS3-104). We report in Table 10.6 the
percentages of writers of DS2-104 and of DS3-104 found in each category.
10 Quality Measures for Online Handwritten Signatures 279

Fig. 10.19 Examples of signature of two writers that change of category on DS2 and DS3 (with
authorization of the writers): a from medium (left) to high Personal Entropy (right) and b from low
(left) to medium Personal Entropy (right)

Note that most writers of the high Personal Entropy category come from DS3-104
database (93.55 %) as well as for the medium Personal Entropy category (72.5 %).
Inversely, most writers of the low Personal Entropy category come from DS2-104
database (82.47 %). This result confirms what we observed in Fig. 10.17, namely that
signatures in DS3-104 are more degraded that in DS2-104.
These results reflect the significant change of writers’ distribution over cat-
egories when switching from the fixed platform (digitizer) to the mobile one
(PDA, Smartphone). Indeed, 81.73 % of writers change of category. More pre-
cisely, among the 80 writers that had a low Personal Entropy when signing on
the fixed platform, 58 writers switch to the medium Personal Entropy category
and 5 even reach the high Personal Entropy category when signing on the mobile
platform. Finally, all of the 22 writers that have a medium Personal Entropy on
the fixed platform jump to the high Personal Entropy category on the mobile
platform.
To go further in our analysis, we display in Fig. 10.19 signatures of two writers that
change of category in mobile conditions. Both in Fig. 10.19a, b, we note that signing
in mobile conditions favours a less complex hand-draw; this makes signatures more
variable (as discussed at the end of Sect. 10.3.3), and both factors together increase
Personal Entropy.
By means of the complexity criterion only, we can easily verify the validity of
this statement, namely that signing in mobile conditions degrades signatures. By
computing for the 104 persons of DS2-104 and DS3-104 simultaneously the statis-
tical complexity measure proposed in Sect. 10.2, it clearly appears in Fig. 10.20 that
writers from DS2-104 (represented by “o”) show high complexity values compared
to those from DS3-104 (represented by “*”).
On the other hand, concerning the variability criterion only, we have demonstrated
in Sect. 10.3 that it is already taken into account by our Personal Entropy measure
(refer to Sects. 10.3.3 and 10.3.4).
280 N. Houmani and S. Garcia-Salicetti

Fig. 10.20 The statistical complexity measure on the same 104 writers of the DS2-104 and the
DS3-104 databases

10.5 Conclusions

We have studied in this chapter the two main quality criteria for online handwritten
signatures, namely signature complexity and signature stability. To this end, we first
proposed a novel statistical complexity measure based on the concept of entropy. Such
a measure was computed on each signature sample. We evaluated this complexity
measure comparatively to a deterministic measure of signature complexity and noted
that both are strongly correlated. Nevertheless, our statistical measure is more refined
for distinguishing writers.
Then, we studied the relationship between this new complexity measure and our
former Personal Entropy, which requires a writer’s set of genuine signatures to be
computed. We demonstrated that both entropy measures are anti-correlated. In order
to explain this result, we considered a deterministic measure of a writer’s intra-
class variability, based on Dynamic Time Warping. We found that there is a strong
correlation between Personal Entropy and signature variability.
In this framework, we have deepened our knowledge on what is entropy in the
case of handwritten signatures. First, if we consider only one signature sample, our
entropy measure in Sect. 10.2 quantifies its complexity. In this case, high entropy
means high complexity. If we consider instead a set of genuine signatures of a given
writer, our entropy measure in Sect. 10.3 (Personal Entropy) measures the intra-class
variability of this set (positive correlation observed), while also taking into account
the complexity of such signatures (negative correlation observed). Therefore, in this
case, high Personal Entropy means high variability and low complexity.
10 Quality Measures for Online Handwritten Signatures 281

We concluded that Personal Entropy actually quantifies the disorder of a signature


at two levels simultaneously. At the sample level, complexity corresponds to a given
degree of disorder in the signature trajectory through its associated signing time.
At the level of a set of a writer’s signatures, variability corresponds to the intra-
class disorder of this ensemble. Thus, Personal Entropy is a global indicator of both
complexity and variability criteria.
Finally, we confronted in Sect. 10.4 those two entropy measures to different acqui-
sition conditions. To this end, we exploited two databases containing data from the
same persons and acquired on one hand on a digitizer and on the other hand on a
Smartphone. We found that signatures are degraded when acquired in mobile con-
ditions: indeed, Personal Entropy increases and the statistical signature complexity
measure decreases. This degradation could also be observed by automatic writer
categorization since most writers jump to a higher entropy category when signing
on the mobile platform.
This study also points out the necessity of taking into account simultaneously
both signature complexity and signature variability criteria for characterizing writers
efficiently, particularly in terms of verification performance. This result is important
for securing signature verification systems: as Personal Entropy is a global indicator
of both signature complexity and signature variability, it could be exploited to detect
degraded signatures at the writer enrolment step, and alternatively to associate to
each writer’s signature a given level of security.

References

1. Impedovo D, Pirlo G (2008) Automatic signature verification: the state of the art. IEEE Trans
Syst Man Cybern Part C Appl Rev 38(5):609–635
2. Garcia-Salicetti S, Houmani N, Ly-Van B, Dorizzi B, Alonso Fernandez F, Fierrez J, Ortega-
Garcia J, Vielhauer C, Scheidat T (2009) On-line Handwritten signature verification. In:
Petrovska-Delacrétaz D, Chollet G, Dorizzi B (eds) Guide to biometric reference systems
and performance evaluation, Springer, London, pp 125–164
3. Garcia-Salicetti S, Houmani N (2009) Digitizing tablet. In: Li Stan Z (ed) Encyclopedia of
biometrics, Springer, XXXII, pp 224–228, ISBN: 978-0-387-73004-2
4. Sabourin R, Plamondon R, Lorette G (1992) Off-line identification with handwritten signa-
ture images: survey and perspectives. In: Baird HS, Bunke H, Yamamoto K (eds) Structured
document image analysis, Springer, Berlin, pp 219–234
5. Sabourin R (1997) Off-Line signature verification: recent advances and perspectives. In:
Proceedings of Brazilian symposium on document image analysis, Curitiba, Brazil. In:
Murshed NA, Bortolozzi F (eds) Advances in document image analysis, vol 1339. Springer,
Berlin, LNCS, pp 84–98
6. Brault J, Plamondon R (1989) How to detect problematic signers for automatic signature
verification. In: Proceedings of the International Canadian Conference on Security Technology
(ICCST), Zurich, Switzerland, pp 127–132
7. Boulétreau V, Vincent N, Sabourin R, Emptoz H (1998) Handwriting and signature: one or
two personality identifiers?. In Proceedings of the 14th international conference on pattern
recognition, vol 2. Laos Alamitos, pp 1758–1760
8. Boulétreau V (1997) Towards a handwriting classification by fractal methods, PhD thesis.
Institut National des Sciences Appliquées de Lyon, France
282 N. Houmani and S. Garcia-Salicetti

9. Alonso-Fernandez F, Fairhurst MC, Fierrez J, Ortega-Garcia J (2007) Impact of signature


legibility and signature type in off-line signature verification. In: IEEE Biometrics symposium
BSYM, Baltimore, USA, pp 1–6
10. Alonso-Fernandez, F (2008) Biometric sample quality and its application to multimodal authen-
tication systems, PhD thesis, Universidad Politecnica de Madrid
11. Di Lecce V, Di Mauro G, Guerriero A, Impedovo S, Pirlo G, Salzo A, Sarcinella L (1999)
Selection of reference signatures for automatic signature verification. In: Proceedings of the
international conference on document analysis and recognition (ICDAR’99), Bangalore, India,
pp 597–600
12. Alonso-Fernandez F, Fairhurst MC, Fierrez J, Ortega-Garcia J (2007) Automatic measures for
predicting performance in off-line signature. IEEE Proceedings o the international conference
on image processing, ICIP, vol 1. San Antonio, USA, pp 369–372
13. Cover TM, Thomas JA (2006) Elements of information theory. 2nd edn, Wiley, New York
14. Garcia-Salicetti S, Houmani N, Dorizzi B (2009) A novel criterion for writer enrolment
based on a time- normalized signature sample entropy measure. EURASIP J Adv Sig Process
2009:964746. doi:10.1155/2009/964746
15. Garcia-Salicetti S, Houmani N, Dorizzi B (2008) A client-entropy measure for on-line signa-
tures. In: IEEE Biometrics Symposium (BSYM), Tampa, USA, pp 83–88
16. Houmani N, Garcia-Salicetti S, Dorizzi B (2008) A novel personal entropy measure confronted
with online signature verification systems’ performance. In: Proceedings of IEEE 2nd interna-
tional conference on biometrics: theory, applications and systems (BTAS’2008), Washington,
USA
17. Houmani N, Garcia-Salicetti S, Dorizzi B (2009) On assessing the robustness of pen coor-
dinates, pen pressure and pen inclination to short-term and long-term time variability with
personal entropy. In: Proceedings of IEEE 3rd international conference on biometrics: theory,
applications and systems (BTAS ’09), Washington, USA, pp 1–6
18. Rabiner L, Juang BH (1993) Fundamentals of speech recognition, Signal processing series.
Prentice Hall, Englewood Cliffs
19. Jain AK, Murty MN, Flynn PJ (1999) Data clustering data: a review. ACM Comput Surv
31(3):264–323
20. Tan PN, Steinbach M, Kumar V (2006) Introduction to data mining. Pearson Addison Wesley,
Boston
21. Halkidi M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. Intell
Inform Syst J 17(2–3):107–145
22. Ortega-Garcia J, Fierrez-Aguilar J, Simon D, Gonzalez J, Faundez-Zanuy M, Espinosa V, Satue
A, Hernaez I, Igarza J-J, Vivaracho C, Escudero D, Moro Q-I (2003) MCYT baseline corpus:
a bimodal biometric database. IEE Proc Vis Image Sig Process Spec Issue Biometrics Internet
150(6):395–401
23. Hubert L, Schultz J (1976) Quadratic assignment as a general data-analysis strategy. British J
Math Stat Psychol 29:190–241
24. Dudoit S, Fridlyand J (2002) A prediction-based resampling method for estimating the number
of clusters in a dataset. Genome Biol 3(7):Research0036.1–0036.21
25. Galbally J, Fierrez J, Ortega-Garcia J (2007) Classification of handwritten signatures based
on name legibility, in defense and security symposium. In: proceedings of SPIE, biometric
technologies for human identification, 6539, Orlando, USA
26. Galbally J, Fierrez J, Freire MR, Ortega-Garcia J (2007) Feature selection based on genetic
algorithms for on-line signature verification. In: Proceedings of IEEE workshop on automatic
identification advanced technologies, Alghero, Italy, pp 198–203
27. Chakravarthy VS, Kompella B (2003) The shape of handwritten characters. Pattern Recogn
Lett 24(12):1901–1913
28. Yeung D, Chang H, Xiong Y, George S, Kashi R, Matsumoto T, Rigoll G (2004) SVC2004: First
international signature verification competition. In: Proceedings of the international conference
on biometric authentication, Hong Kong, China, LNCS 3072, Springer, pp 16–22
10 Quality Measures for Online Handwritten Signatures 283

29. Houmani N, Mayoue A, Garcia-Salicetti S, Dorizzi B, Khalil MI, Moustafa MN, Abbas H,
Muramatsu D, Yanikoglu B, Kholmatov A, Martinez-Diaz M, Fierrez J, Ortega-Garcia J, Roure
Alcobé J, Fabregas J, Faundez-Zanuy M, Pascual-Gaspar JM, Cardeñoso-Payo V, Vivaracho-
Pascual C (2011) BioSecure signature evaluation campaign (BSEC’2009): evaluating online
signature algorithms depending on the quality of signatures. Pattern Recogn 45(3):993–1003
30. Houmani N, Garcia-Salicetti S, Dorizzi B, Montalvão J, Coutinho Canuto J, Vasconcelos
Andrade M, Qiao Y, Wang X, Scheidat T, Makrushin A, Muramatsu D, Putz-Leszczynska J,
Kudelski M, Faundez-Zanuy M, Pascual-Gaspar J, Cardeñoso-Payo V, Vivaracho-Pascual C,
Argones Rúa E, Luis Alba Castro J, Kholmatov A, Yanikoglu B (2011) BioSecure signature
evaluation campaign (ESRA’2011): evaluating systems on quality-based categories of skilled
forgeries. In: Proceedings of the international joint conference on biometrics, Washington
31. http://www.biosecure.info
32. Ortega-Garcia J, Fierrez J, Alonso-Fernandez F, Galbally J, Freire MR, Gonzalez-Rodriguez J,
Garcia-Mateo C, Alba-Castro JL, Gonzalez-Agulla E, Otero-Muras E, Garcia-Salicetti S,
Allano L, Ly-Van B, Dorizzi B, Kittler J, Bourlai T, Poh N, Deravi F, Ng MNR, Fairhurst M,
Hennebert J, Humm A, Tistarelli M, Brodo L, Richiardi J, Drygajlo A, Ganster H,
Sukno FM, Pavani SK, Frangi A, Akarun L, Savran A (2010) The multiscenario multienviron-
ment biosecure multimodal database (BMDB). IEEE Trans Pattern Anal Mach Intell 32(6):
1097–1111
Chapter 11
Human Tracking in Non-cooperative Scenarios

Kamil Grabowski and Wojciech Sankowski

Abstract In this chapter we discuss human tracking problems for biometric appli-
cations in non-cooperative scenarios. The chapter starts with an overview of modern
biometric authentication systems. Special attention is paid to vision system construc-
tion and its design fundamentals. Next the existing image segmentation methods are
presented with emphasis on procedures suitable for image pre-segmentation during
the acquisition process. The chapter ends with some sample processing architectures
presentation.

11.1 Introduction

In all possible scenarios biometric authentication starts from the acquisition of


biometric sample. When a non-cooperative scenario is taken under consideration
such a biometric system must deal with highly variable imaging conditions includ-
ing: viewing angle, lighting, occlusions and noise. But what is more important it must
provide good quality of samples in these conditions, i.e. with content of distinctive
features sufficient for authentication process. This requires using special techniques
for image capture where subject tracking and special pre-segmentation procedures
become a substantial part of sample acquisition process—see Fig. 11.1. To provide
the best results and to minimize failure-to-acquire rate they should be applied in the
real time at a frame rate of one or more video signals, sometimes with synchronization
among those video streams.

K. Grabowski · W. Sankowski (B)


Department of Microelectronics and Computer Science, Lodz University of Technology,
Wolczanska 221/223, 90-924 Lodz, Poland
e-mail: wsan@dmcs.pl
K. Grabowski
e-mail: kgrabowski@dmcs.pl

J. Scharcanski et al. (eds.), Signal and Image Processing for Biometrics, 285
Lecture Notes in Electrical Engineering 292, DOI: 10.1007/978-3-642-54080-6_11,
© Springer-Verlag Berlin Heidelberg 2014
286 K. Grabowski and W. Sankowski

Fig. 11.1 Part of a non-cooperative biometric authentication system covered by this chapter

In this chapter an overview of techniques used for biometric sample acquisition


in non-cooperative scenarios is presented. We start from presentation of current bio-
metric authentication systems with special attention paid to vision system design
and construction fundamentals. Next an overview of the existing image segmen-
tation methods is presented with emphasis on procedures suitable for image pre-
segmentation during the acquisition process. After the methods presentation, typical
error rates introduced by the segmentation stage are discussed. Also, some sample
processing architectures are presented.

11.2 Acquisition

11.2.1 Biometric Signals

The first and probably the most important part of a biometric authentication is the
acquisition of a biometric sample, because at this stage it is decided how much
distinctive information will be available to processing algorithms in the next stages.
Depending on biometric trait used for authentication, the signal’s dimension may
vary. Distinctive features may be contained in sound (1D), digitized image (2D),
temporal sequence of 2D digitized images (video stream), 3D digitized image or
even, in some experimental systems, a temporal sequence of 3D digitized images.
A summary of biometric traits commonly used in commercial systems is presented
in Table 11.1.
In a non-cooperative scenario for biometric authentication, only an unconcealed
trait, easily visible or soundable to an acquisition system, can be used. It cannot be
for instance a fingerprint, because it requires additional action from the authorized
subject to perform the acquisition. In the current state of the art for this purpose these
are the most commonly used: face, iris and periocular and gait. Thus, according to the
Table 11.1, for all the mentioned biometrics, a vision system is a common solution
for the acquisition of biometric data, capable of capturing either a still image or a
video stream. Of course, in more complex cases, multidimensional analyses can be
11 Human Tracking in Non-cooperative Scenarios 287

Table 11.1 The most typical


Trait Temporal signals Spatial signals
biometric traits
1d 2d 3d 2d 3d
Face    
Fingerprint 
Hand  
Iris and Periocular 

Signature  
Voice 
Gait   

used. However, due to its additional requirements for image acquisition stand, its
complicated architecture and final cost, this scenario is rather used in experimental
cases. In the next part of this chapter we refer to image analysis systems.

11.2.2 Non-cooperative Scenario

A non-cooperative authentication refers to a process of automatic recognition of


individuals using biometric data obtained at a distance and, most often, on the move.
An important assumption for this scenario is that the subject is not cooperating
with the system in any way. This results in two main issues of a non-cooperative
authentication scenario.
The first is that a non-cooperating subject is usually positioned off-angle with
respect to the acquisition system and is usually moving. Therefore, such situations
need to be detected and some tracking and correction solutions need to be applied to
reduce influence of those factors on the final recognition result. It can be performed
at hardware and/or software level by using specially designed acquisition system
with tracking feature in the former case and algorithms for perspective fluctuations
correction in the latter case. Moreover, an important part of this process are meth-
ods for sample selection of the best distinctive quality. However, biometric sample
acquisition process in a non-cooperative scenario is related mostly to tracking func-
tionality based on trait presegmentation using data obtained at real time. Algorithms
for advanced corrections are also a part of the processing path after the acquisition
process according to Fig. 11.1 and will be discussed in the next part of this chapter.
The second issue is that an image acquisition system for the purpose of non-
cooperative authentication must deal with unconstrained environmental conditions.
If we assume the most common method of capturing biometric data—the image
acquisition—several factors can be distinguished: illumination, reflections and
occlusions—last two usually referred to as the noise. The key aspect is that the images
of individuals are taken from a medium or large distance, which entails several prob-
lems. First of all, illumination parameters such as: light wavelength spectrum, its
288 K. Grabowski and W. Sankowski

Fig. 11.2 Images of the same biometric using different illumination sources: visible (left) and near
infrared (right). Source: the authors’ databases

intensity and distribution usually are subjected to change. A long-range biometric


authentication is usually performed in ambient light, that can be a mixture of different,
either natural or artificial, sources. The use of additional lighting devices is limited,
especially when outdoor systems are taken into consideration, in which ambient light
dominates, e.g., during a sunny day. Also, in such a situation, the movement of the
Sun will cause the change of the registered scene due to the changing position of
shadows. Occlusions or complete invisibility of biometric features constitutes an
additional problem and is solved using various processing techniques which are a
part of either segmentation or feature extraction methods.
A separate aspect of imaging for biometric authentication is the color information
often used for segmentation purposes, e.g., for skin detection [36] or background
segmentation [50, 51]. Thus, special attention has to be paid to white balance algo-
rithms in outdoor imaging scenario where wavelength spectrum of the ambient light
is not stable and several factors must be taken into account, such as: the geographical
location, the season, the weather and the time of day. An important issue is that
digital representations of some biometric traits, e.g. iris, are sensitive to wavelength
of light used for illumination. Images of the same iris look different in visible and
near infrared illumination due to melanin absorption spectrum [19, 20]. Since in the
current state of the art the feature extraction is still based on texture analysis, digital
templates will differ in such a case because they are generated using slightly different
information—see Fig. 11.2.
Holding the overall image luminance constant during the acquisition can be
extremely important for amplitude-based authentication methods. It can be achieved
at the hardware level using specially designed vision system and at the software level
by applying adjusting algorithms. These issues are explained in detail in the next part
of this chapter.
In case of an indoor installation some of the mentioned limitations are naturally
eliminated. The use of additional light sources with special configuration of image
acquisition system’s optical path allows putting the subject in a known and relatively
stable observed scene. Providing homogeneous illumination conditions in indoor
installations is not a trivial task but, more importantly, the designer can encounter
11 Human Tracking in Non-cooperative Scenarios 289

many other problems because of the varying distance between the acquisition stand
and the object as well as a wide area that is being observed in such scenarios.
Keeping known and stable imaging conditions crucial for the final authentication
result and stays in opposition to the non-cooperative scenario where biometric sample
is captured at a distance. Several processing techniques can be used to minimize the
influence of fluctuations of imaging conditions, but most of the responsibility to
provide repeatability and a good quality of biometric samples relies on properly
designed image acquisition system. Thus, many factors that are not considered in
cooperative scenarios, where acquisition conditions are nearly stable, have to be
taken into account. In the next part of the chapter several techniques used to deal
with the mentioned issues are presented.

11.2.3 Acquisition Techniques

11.2.3.1 Vision System Design Fundamentals

In this section we discuss one of possible approaches to designing a vision part of a


non-cooperative biometric system with tracking capability. A vision system for such
purposes consist in: a dedicated optics, a camera with sensor and a movable platform
that can be used in some systems for tracking. The key parameter for such a system
is the capture volume, which is the starting point for the design process. The capture
volume directly influences the strategy of image acquisition—determining if subject
can move, how fast it can move, what is its height, etc.
In typical and the simplest case, this volume is determined by the optical system
design: lenses angle of view—width and height of the observed scene and lenses
aperture—the depth of the field. Because of the basic optical laws, the capture volume
has the shape of a truncated pyramid, where the base is related to the shape and
the aspect ratio of the sensor used—see Fig. 11.3. In recently presented advanced
systems for at a distance and on the move subject tracking the capture volume does
not depend only on the optical path design, but is significantly increased using pan-
tilt mechanisms or/and variable focal length lenses. The process of tracking is then
supported either by additional Wide-Field-Of-View (WFOV) camera or Pan-Tilt-
Zoom (PTZ) camera itself in a closed loop automation scheme. The discussion about
recent methods for this tracking techniques is presented in Sect. 11.2.3.2.
The problem. Let us specify the optical path requirements for periocular recog-
nition, i.e., the authentication based on periocular region that is captured together
with an iris image. Iris image will be a bottleneck in this study as it has to be acquired
with sufficient quality for iris-based authentication. The human iris measures approx-
imately 1 cm in diameter and has a relatively low albedo, equal to about 0.15 in the
near infrared (NIR). According to ISO/IEC 19794-6 [35] standard which supports
existing iris recognition algorithms, a resolution of more than 200 pixels or more
across the iris is considered to be of good quality (GQ), of 150–200 pix across the
iris to be of acceptable quality (AQ) and of 100–150 pix to be of low quality (LQ).
290 K. Grabowski and W. Sankowski

Fig. 11.3 An explanation of Capture Volume term. Camera observes certain part of a scene
depending on optical path settings

The object assumptions. The average diameter of a human iris is diris = 11.8 mm
while the average size of a pupil is dpupil = 6 mm. These values were estimated on
the basis of authors’ measurements using dedicated software [28, 29] through pub-
licly available databases for iris authentication research as well as private databases
obtained using authors’ own acquisition device [11]. Similar values can be also found
in the literature [8].
The design. We can estimate the resolution needed by our system depending on
the ISO/IEC 19794-6 requirements (11.1).
 
rISO Quality pix
roptical = (11.1)
diris mm

If we assume 150 pix accross 11.8 mm of iris the roptical = 12.65 pix/mm. Next, we
can assume real observation area, i.e., the width and height of the periocular region.
In our example we assumed that our goal is to capture one iris and its periocular
region of size hobject = 45 mm and wobject = 68 mm (11.2).
 
himage = ceil roptical · hobject [pix]
(11.2)
wimage = ceil roptical · wobject [pix]

Our calculations show that the periocular region with the desired optical res-
olution should occupy the sensor region of size at least himage = 570 pix and
wimage = 861 pix. Based on these estimations, the sensor can be chosen according
to this exemplary application. We assume that the system will be based on Pulnix
TMC-4100GE 4 MPix color camera (we would like to preserve skin color informa-
tion) whose basic parameters are summarized in the Table 11.2.
In order to estimate the required optical parameters, the average distance to the
object should be first known. In this example, let us assume dobject = 2 m. In the first
step, we estimate optical magnification using width as it is the dominating size of
our object (11.3).
wsensor wimage
M= · [−] (11.3)
wobject rsx
11 Human Tracking in Non-cooperative Scenarios 291

Table 11.2 Pulnix


hsensor 15.15 mm
TMC-4100GE 4 MPix camera
wsensor 15.15 mm
basic parameters
rsx 2048
rsy 2048
Speed 15 fps

The first component is the typical magnification for the case when object fills the
whole sensor area (in the width dimension) while the latter is the component which
scales the object size to previous quality assumptions. For our example magnification
ratio necessary to project the whole periocular region width into the whole sensor
width would be M = 0.22. But according to the quality requirements the image
projected on the sensor should occupy just about 42 % of its width. Thus, the final
magnification is M = 0.09. Of course, this is only one of several possible approaches
to estimate magnification. The other can be using image size in mm instead of pixels.
The required focal length of lenses (angle of view) necessary to project an object
image to the desired sensor area can be calculated using the formula (11.4).

M · dobject [m]
f = · 1000 [mm] (11.4)
M +1

The focal length required in our example is f = 171 mm. In the next step the
depth of field has to be calculated. It can be done in several estimation steps (11.5).

f
H= A·CoC [mm]
dobject ·(H−f )
Dnear = H+(dobject −2·f )
[mm]
(11.5)
dobject ·(H−f )
Dfar = H−dobject [mm]

DoF = Dfar − Dnear

where H is the hyperfocal distance in mm, f is the lens focal length in mm, Dnear
is the near distance for acceptable sharpness, Dfar is the far distance for acceptable
sharpness, A is the f-number, CoC is the circle of confusion in mm. Acceptable
sharpness comes from circle of confusion assumptions. Usually, it is related to photo
printing and represents the size of the area on a sensor with unnoticeable blur on a
standard 8 × 10 inch print and observed from a standard viewing distance of about
1 ft—see Fig. 11.4.
In case of a vision system for image data processing CoC is the size of single
pixel of the sensor used—in our case 7.4 µm. The reason is that we want the object
to be projected into a certain number of pixels on a camera sensor and want to
know the stand-off range for this assumption to be met. In practice, depth of field
is estimated using analysis of authentication performance as a function of stand-off
distance, rather that strict optical laws. However, usingoptical equations the capture
292 K. Grabowski and W. Sankowski

Fig. 11.4 An explanation of Circle of Confusion term. An observed point will be sharp in a digitized
image when it is projected just into pixel area

volume can be estimated using the equations for the angle of view estimation based
on calculated focal length (11.6).
 
rdirection
αdirection = 2 arctan (11.6)
2f

where rs denotes the size of the sensor and α is the angle of view, both in the measured
direction. The formula is true for non-spatially-distorted images of distant objects,
which is certainly true for the exemplary application. Our system has a square shape
which means that its angle of observation is about α = 5◦ in width and height
direction. In order to estimate the capture volume width and height parts we can use
the previous proportion (11.3).
From the above calculations we obtained the focal length of lenses f = 171 mm
for the desired object size projected on Pulnix TMC-4100GE 4MPix color camera
sensor. The capture volume size for stand-off distance d = 2 m is approximately
0.16 × 0.16 × 0.04 m3 (cubic approximation) if we assume the aperture of f/22,
which is one of the highest possible values for a typical lens. It should be noted that
this aperture value results in the occurrence of diffraction effects that will degrade
the optical resolution. The obtained capture volume is too small for a tracking sys-
tem for the purposes of at a distance and on the move authentication. Thus, addi-
tional techniques should be implemented to increase the volume size. It can be done
using additional Wide-Field-of-View camera capable of observing a larger part of
the scene. Calculated object space coordinates can be used to correct the viewing
direction of the Narrow-Field-of-View camera (which we have just designed) to
track the subject. This is possible when Pan-Tilt camera is used. Also, the depth of
the field can be increased separately. It can be achieved using zoom featured lenses
with focus control or using Wavefront Coding method proposed by Narayanswamy
et al. [25]. The proposed method uses a special optical filter capable of expanding
the imaging volume. Next, an additional deconvolution algorithm is performed in
order to somewhat restore the original image. This technique allows an accurate and
11 Human Tracking in Non-cooperative Scenarios 293

robust iris identification for less constrained scenarios. Some advanced techniques
for efficient and reliable object tracking are described in detail in Sect. 11.2.3.2.

11.2.3.2 Vision Systems for Subject Tracking

Taking into account the unconstrained conditions for image acquisition, physical
traits require a sequence of images to be captured, from which the best image, i.e.
containing the highest amount of distinctive features, is selected in the process called
quality assessment. In some cases the information available in one frame can be
complemented using information contained in the other. In turn, for behavioral traits,
the distinctive features are carried by the movement dynamics, thus video camera
is an indispensable part of a vision system. Consequently, for this kind of biometric
systems, video cameras are mostly used instead of digital still cameras (DSC).
In most situations, a simple surveillance camera is enough to capture biometric
data such as silhouette, face, gait and other traits that do not require high-precision
devices. However, iris biometrics has attracted a great deal of attention in recent years
due to its high accuracy what has been demonstrated for less constrained scenarios
in commercial applications, e.g., [24]. Iris image acquisition is difficult in several
aspects because of the narrow field of view and eye’s glossy surface that requires
one of the most complex imaging techniques currently used for biometrics. In this
kind of system, significant cooperation from the authenticated subject is required,
such as predefined location and positioning. Because of mentioned relatively small
iris diameter about 1 cm and the requirement of 150 pix along the iris according to
ISO/IEC 19794-6 standard [35], these are the most challenging tracking systems
which are currently being developed. Thus, in the next part of the chapter we treat
these kinds of systems as a reference for the study on vision and processing systems
with tracking capability that can be used for non-cooperative biometric authentication
scenarios.
We start from the first commercialized iris-based biometric system that does not
require significant cooperation with subjects, called Iris-On-The-Move. This solution
is available to the user in the form of three types of devices: IOM Portal, IOM Drive-
Through and IOM Compact. The principles of operation of the portal prototype were
described in Matey’s work [24]. In this fixed optical system persons undergoing the
authorization process have to pass through a special portal (see Fig. 11.5) while
biometric patterns are captured. This ensures that subjects are properly placed in the
capture volume during image acquisition. Capture Volume can not be changed by
the system itself and further more must be initially calibrated. The system does not
have tracking capability.
The system has a reported throughput of an average of 3 persons per second at
a walking speed (about 1 m/s). The subject stand-off required by the system is 3 m
and as it was stated cannot be varied due to the fixed optical path design. In order to
extend the capture volume two cameras are placed one above the other. Thanks to
this solution, the system is able to acquire photos from persons of different heights.
However, because of the low depth of field and narrow field of view, the authentication
294 K. Grabowski and W. Sankowski

Fig. 11.5 Iris-On-The-Move system overview

fails if the subject’s eye does not properly fit into the small capture volume—see
Fig. 11.6.
Based on estimations of the biometric system efficiency, the depth of field under-
stood as the capability of the system to maintain certain error levels, was reported
to be as low as 5 cm up to 12 cm. For the purposes of moving subjects this required
relatively high shutter speed of both cameras (supported by near infrared flashing)
and a frame rate equal to about 15 fps.
However, a decade earlier, in 1996, Hanna et al. from D. Sarnoff Research Center
together with Sensar Inc. developed a prototype system that used pan-tilt mechanism
based on mirrors for iris image acquisition [14]—see Fig. 11.7. This system did have
the basic tracking capability used for small adjustments according to the subject
height and side placement. Thus, this system did not require from the subject to be
specially positioned while being authenticated. The system consisted of a stereo pair
of wide field-of-view (WFOV) cameras, a narrow field of-view (NFOV) camera, a
pan-tilt mirror allowing the NFOV camera to be moved with respect to the WFOV
camera, and two computers for data processing. This was probably the first iris
biometric system where WFOV cameras were used for the purpose of scanning
individuals at a long distance in a wide area, while one or more additional high-
resolution video cameras with NFOV optics were used for target image capture. The
system contained a wide angle scene camera enabling the detection of the face/eye
and another camera with a high magnification lens, which allowed a better resolution
11 Human Tracking in Non-cooperative Scenarios 295

Fig. 11.6 Capture volume of the Iris-On-The-Move Portal is extended by using two cameras
placed one above the other

for fine patterns in the trait—200 pix across iris diameter. Stereo-camera setup was
implemented to perform depth estimation. Thus, the user’s position in 3D space could
be calculated providing the focus and pan-tilt settings.
Although Hanna’s system had a quite small capture volume: 1.4–1.9 m of height,
0.3–1 m from the rig, and 0.5 m lateral displacement from the center of the rig, this
type of adjustable optical system allowed basic tracking functionality for subjects at
a distance. This approach was later improved by others in order to obtain flexible
capture volume for on the move identification capability.
As we discussed in the previous section pan-and-tilt mechanism for NFOV cam-
eras is necessary in order to enable tracking functionality. Due to technological
advances, pan-tilt cameras with variable focal length (zoom) are now widely avail-
able to the user. Although they are relatively expensive, they were proven to be
successfully used as a source of biometric images for these kind of scenarios. The
tracking capability is possible because all three axes than the subject can be placed on
are covered by camera movement—panning and tilting, zoom and focus for stand-
off distance. Panning and tilting can be realized using dedicated motors without any
special design, but the variable focal length optics makes these systems really expen-
sive. Nowadays, the subject proximity estimation, previously done using just stereo
pair of WFOV cameras, can be additionally supported by devices such as range find-
ers, either ultrasound or optical, that helps the vision system to properly track the
subject in real time. The only limitation for stand-off distance is the requirement
for image quality—number of pixels containing the selected trait. This depends on
296 K. Grabowski and W. Sankowski

Fig. 11.7 Basic Pan-Tilt Mirror setup. A mirror is responsible for image projection into the fixed
camera

lenses focal length (narrower angle–longer distance–higher price), sensor size and
density (Fig. 11.8).
The first implementations of the described type of devices for tracking purposes
was reported by Sensar [30] and OKI,1 Mitsubishi Corporation [13] and Wheeler
et al. [42]. These systems also contained multiple cameras with different fields of
view. The used strategy was to roughly estimate the subject’s position in the 3D space
using WFOV camera and then use a NFOV camera to acquire images containing a
biometric sample—see Fig. 11.9. The capture volume was not fixed and could be
changed through electronic controllers according to previously estimated subject’s
position on the observed scene. Such a controller was a part of automated control
feedback loop where error signal comes from a specialized image analysis. This

1 http://www.oki.com/en/iris/
11 Human Tracking in Non-cooperative Scenarios 297

Fig. 11.8 Pan-Tilt-Zoom camera scheme. Camera with zoom capable lenses is placed on a movable
platform

information was obtained using the WFOV mode—short focal length of optics and
zoomed when the subject is being tracked.
Guo et al. [13] have described an iris capture system that uses a pan-tilt iris
camera directed by face detection in a WFOV camera on the same pan-tilt platform.
In turn, the Wheeler’s system used wide-baseline stereo pair of fixed WFOV typical
surveillance cameras to locate the subject in 3D space [42]. Biocular camera setup
allowed for a more precise face detection and the estimation of subject’s position in
3D space through triangulation. However, the Wheeler et al. described the procedure
how the WFOV camera biocular setup was calibrated in order to provide the necessary
accuracy for 3D estimations (Fig. 11.9).
To geometrically calibrate the two WFOV cameras, authors followed the proce-
dure proposed by Zhang [47]. First, the video of a special checkerboard calibration
pattern was recorded by each WFOV camera at the same time. Second, the pattern
was slowly rotated in multiple directions in view of both WFOV cameras. Then,
corner locations on the calibration pattern were extracted and the internal parameters
of the cameras and their relative positions were determined. A metric world-based
coordinate system was established for the calibration of two WFOV cameras [16].
As a result, in all subsequent steps, the coordinates of all 3D points are specified in
the world-based coordinate system and are calculated using stereo triangulation. For
each WFOV camera, a special camera projection matrix is determined, which maps
298 K. Grabowski and W. Sankowski

SIDE NEAR
INFRARED ILLUMINATION

RIGHT NFOV PTZ LEFT


WFOV CAMERA CAMERA WFOV CAMERA

Fig. 11.9 Wheeler’s image acquisition system setup. The system was composed of two
wide-field-of-view cameras for scene observation and one pan-tilt-zoom camera for biometric
sample acquisition. Capture volumes are additionally presented

3D points to their image location. The pseudo-inverse of this matrix can be used
to find the 3D line containing all points that map to a given image location. When
an object is located in both WFOV images, the triangulation process finds the 3D
point that is closest to the two associated 3D lines [16]. An additional NFOV camera
mounted together with strobed NIR illumination on a pan-tilt platform was moved
towards the subject. The distance to the subject was also compensated using focal
variable lenses of the iris camera.
For this setup, presented in Fig. 11.9, the iris camera lens with programmable
motorized focus and zoom (focal length 10–160 mm) capabilities was used. Authors
proposed an aperture setting of f/9.5, where the system is (just) diffraction lim-
ited, and the depth of field is about 30 mm. The images were captured at a resolu-
tion which allowed approximately 200 pixels across the iris, at a subject distance
of 1.5 m, with a field of view of 76 mm—so comparable to the Hanna’s system.
64 LEDs were used in the illumination system, each with an 810 nm center wave-
length and a lens providing 6◦ viewing half-angle. In order to minimize influence of
the ambient light a 715 nm long-pass filter was used. In order to eliminate motion
blur and ambient light effect the illumination was triggered at the frame rate using the
camera controller signals. The duration of impulse in this system was 20 ms, which
allowed for proper exposition using the camera sensor and optical path settings, and
subject movement freeze as well. Authors also emphasized that their illumination
system operates well within eye safety standards and recommendations [17, 26].
11 Human Tracking in Non-cooperative Scenarios 299

The disadvantage of the Wheeler et al.s’ system was that while the IOM technology
was able to acquire images from individuals on the move, the Wheeler’s system
required subjects to stop for a moment for capturing the image. Also, in this system
the WFOV cameras needed to be re-calibrated whenever their position or the lens
zoom was changed. Nevertheless, it used less constraints than other commercially
available systems presented at that time and constituted a considerable step towards a
full subject tracking implementation by other researchers due to the system’s detailed
description.
In the next generation of tracking systems researchers tried to support WFOV
measurement of 3D subject position by an additional way. The system proposed
by Yoon et al. in Ref. [45] used a light stripe projection to detect the users and to
estimate their distance from the camera. A system to perform recognition at a stand-
off of 2 m has been also introduced by AOptix. The system extracted information
from a captured 2D image, thus enabling setting the focus at the user’s position.
Authors reported a capture volume depth of 1 m, which allowed verifying users with
heights between 0.9 and 1.9 m. As in previous cases finding the subject within the
capture volume zone was possible thanks to adaptive optics. Their system employed
a multistage, real-time closed loop control system.
The next step towards multi-biometric authentication from multiple subject at the
same time was done by the Retica team. Their multiple cameras system was called the
Retica Eagle Eye and was presented in Ref. [1]. This quite large setup used a scene
camera, a face camera and two iris cameras. It is worth noticing that the system was
capable of acquiring face and iris images from multiple subjects present anywhere in
the capture volume. Moreover, the capture volume of this setup was larger compared
to the previously described systems, namely a 3 × 2 × 3 m capture volume with a
stand-off of about of 5 m. The fields of views of multiple cameras were hierarchically-
ordered. The acquisition systems comprised also a precise pan-tilt unit (PTU) and a
long focal length zoom lens. The system was driven by innovative algorithms that
performed wide-area video surveillance, object detection and tracking, and precision
pointing. Experimental results were reported in an indoor environment for multiple
subject iris recognition at a distance.
In turn, an outdoor, long-range biometric authentication issues were touched in
Li et al.’s work [22]. Their proposed human identification and verification system for
automatic long-range biometric recognition of non-cooperative individuals had three
cameras. One was a WFOV CCD video camera with an infrared (IR) filter and power-
ful IR illuminators for human scan in a wide area at a long distance (up to 50 m). The
other two cameras were high resolution NFOV video cameras mounted on a pan-tilt
unit (PTU) to capture the frontal view of human face and iris, respectively. Once a
human subject has been detected in the WFOV images, the PTU-controlled NFOV
video camera for human face identification was pointed to the person. Because of sig-
nificantly long distance, additional techniques for image processing were applied. In
order to obtain a better quality video, authors used stabilization and debluring meth-
ods consisting of the following four steps: global motion estimation, local motion
estimation, undesired motion removal and image debluring. By coupling the image
spatial–spectral characters for image quality improvement, authors estimated the
300 K. Grabowski and W. Sankowski

Fig. 11.10 Photo—PTZ Camera used in Venugopalan system. Pictures taken from [38]

motion pattern of the camera and platform and employed a Wiener filter to remove
the blurs from the corrupted images.
However, as expected, a long range vision system causes additional problems for
the acquisition process such as: small depth of field and camera shaking—without
stabilization mechanisms the acquisition of a quality image may be impossible.
Other challenge to such an acquisition system is the resolution variability dependent
on the proximity of an object with respect to the system. The authors claim that
the iris recognition can be performed from 3 m distance while face recognition was
performed at 50 m distance. However, the validity of these claims is difficult to verify
because the authors do not report any recognition rates of the described system that
would allow to compare this solution to existing ones.
One of the most cutting edge technologies for subject tracking at a distance and on
the move was presented by Venugopalan et al. in 2010. They proposed a single camera
system where WFOV and NFOV observations were implemented with just a single
PTZ camera using in-built zoom lens and autofocusing mechanism [38]. The system,
presented in Fig. 11.10, was built using Commercial Off-The-Shelf (COTS) equip-
ment and therefore it may be easily reproduced by other researchers. It constituted a
significant step towards cost-efficient non-cooperative multi-biometric recognition.
The acquisition device used in the proposed system was the Axis 233D Network
PTZ camera manufactured by Axis Communications. This camera, currently not in
production, was capable of encoding the video in motion JPEG stream format at the
frame rate of 30 fps, it had a 35X optical zoom, a minimum focal length of 3.4 mm
(wide) and a maximum focal length of 119 mm (telephoto), a pan capability of 360◦
and a tilt capability of 180◦ . The speed of both pan and tilt motions could be varied
from 0.05◦ to 450◦ /s. The light sensitivity could be also adjusted depending on the
ambient light thanks to a built-in switchable IR cutoff filter.
11 Human Tracking in Non-cooperative Scenarios 301

The tracking methodology presented by the authors was the following: first, eye
was detected on the user’s face, next the frame was zoomed while keeping the eye
at the center and finally, the image was captured at maximum magnification. Due to
sensor parameters and the requirement for at least 150 pix across the iris (AQ), the
maximum allowable subject stand-off distance was approximated at 1.6 m. However,
in order to provide the focus range from 1.5 m at telephoto lens setting, an additional
macro converter was added to the optical path. Despite having 715 nm NIR-PASS
filter, this system does not use any additional NIR light source because it uses IR
content in ambient light that was additionally supported by a typical desktop lamp.
Also, it is worth mentioning that the system did not require calibration, as it was
necessary for biocular setup for 3D subject position estimation. Authors’ system
was further developed in the frame of tracking algorithms [18].
One year later, another interesting solution was proposed by the same team for
the purposes of long-range biometric image acquisition [37]. The system hardware
was composed of a high magnification telephoto lens, an imaging sensor and an
auto-focus module. In this project the authors also used Commercial Off The Shelf
(COTS) equipment: the Canon 800 mm lens and the Canon 5D Mark II camera.
Very high pixel density 21.1 MPix ensured that a single captured frame contained
both the face and iris traits with the required resolution. The camera has a full
frame sensor with 36 mm × 24 mm dimensions. A library of auto-focus routines was
developed by the authors in order to ensure good focus across all acquired images
from moving subject. They also designed a custom RS232 lens control module based
on the module manufactured by Birger™ Engineering,2 which allowed the controlling
the lens independently of the camera body. The system was capable of capturing
images up to distances of 8 m with a resolution of 200 pixels across the iris or even
up to 12 m with slightly lower resolution of 150 pix across the iris. Images were also
acquired from on the move subjects thanks to specially designed velocity estimation
and focus tracking modules.

11.2.3.3 Summary

As it was shown in this chapter, the construction of vision tracking systems


requires a large amount of multi-domain design, combining the knowledge from
optical physics, electronics and computer science. Despite the fact that some recent
systems are not custom designs, the price that has to be paid for the necessary equip-
ment is still significant. The most expensive part of the system is optics, especially
when zoom and focus capability is required together with long focal length. It should
be stated that degradation of optical quality is typical for lenses using wide range of
focal length regulation. This should be taken under consideration, because trait image
quality assumption may be disrupted for some certain zoom ratios due to purely opti-
cal reasons. None of the authors utilizing zoom featured lenses mentioned in their
papers this issue and its influence on authentication algorithms they were using.

2 http://www.birger.com/
302 K. Grabowski and W. Sankowski

Recent research result in the area of advanced vision tracking systems indicate
several approaches to image acquisition at a distance and on the move. The image
vision systems with tracking capability for biometric purposes started in 1996 when
pan-tilt mechanism based on mirrors was presented by Hanna et al. [14]. In 2007 Yoon
[45] used an additional camera for wide scene observation and laser stripe projection
technique to support the object’s position estimation. In 2008 Wheeler et al. [42]
used a calibrated pair of stereoscopic cameras instead of a single camera and a light
stripe to provide the estimation of object’s location in space. At the same time Retica
Systems Inc. [1] showed their long-range system. In 2009 Dong et al. [6] presented
their at-a-distance system. Recent systems use pan-tilt-zoom modern cameras (PTZ)
[18, 38], without WFOV supporting cameras for subject’s position estimation in a
3D space. There are also some long range systems [37] designed using expensive
optics and sensors from leading manufacturers specially adapted for the purposes
of control loop based on image analysis. The variety of systems shows different
approaches and solutions to the common problem and obviously all of them have
their advantages and drawbacks. The most important features are summarized in the
Table 11.3.

11.3 Image Analysis

11.3.1 Object Tracking Fundamentals

Object tracking is a fundamental problem in computer vision systems. According to


[44], tracking can be defined as “estimating the trajectory of an object in the image
plane as it moves around a scene. In other words, a tracker assigns consistent labels
to the tracked objects in different frames of a video.” Object tracking is a problem
specific for particular application. However, certain categories of tracking methods
can be introduced. An excellent survey of tracking methods, which we summarize
here, can be found in [44]. In this work a taxonomy of tracking methods depicted in
Fig. 11.11 is presented and tracking methods are divided into three main categories:
• Point tracking: detection of objects is performed independently in every frame
by the external algorithm, which represents detected objects by points. For each
object these points are associated based on the previous object state, which can
include object position and speed.
• Kernel tracking: tracking of objects is performed by computing the motion of the
kernel in consecutive frames. Kernels represent object shape and/or appearance.
A kernel can be for example an elliptical shape with an associated histogram.
The computed motion may be in the form of a parametric transformation such as
translation, rotation and affine.
• Silhouette tracking: these tracking methods can essentially be seen as image
segmentation techniques constrained by results from previous frames analysis.
Tracking of objects is performed by estimating the object region in each frame
Table 11.3 Comparison of basic features of the described systems for object tracking at a distance and on the move
Features Unit Hanna et al. [14] Yoon et al. [45] Wheeler et al. [42] Retica [1] Venugopalan et al. [37] Iris-On-The-Move [24]
Modality Iris Iris Iris Face&Iris Face&Iris Iris
Capture volume [m × m] 1 × 0.5 1×1 – 3×3 not constrained 0.2 × 0.4
width×height
Depth of field [m] 0.7 1 1.5 2 1.1 0.1
Depth of focus [cm] – 9.5 – – – 10
Minimum [m] 0.3 1.5 – 3 0.5 3
stand-off
distance
Acquisition time [s] 5.5 4 3.2 6 5 2
Camera [MPix] – 4 1.4 0.3 0.4 4
11 Human Tracking in Non-cooperative Scenarios

resolution
Illumination Incandescent NIR LED NIR LED NIR Laser Incandescent NIR LED
Tracking PT Mirror- WFOV, PTZ Stereo WFOV, 2x PT 360◦ PTZ No
capability based NFOV, WFOV, NFOV
NFOV, light PTZ (Iris and
Stereo stripe NFOV Face)
WFOV
Subjects on the No No No Yes No Yes
move
Calibration Yes No Yes No No Yes
303
304 K. Grabowski and W. Sankowski

Fig. 11.11 Taxonomy of tracking methods provided by [44] (only topmost level categories are
presented for clarity)

based on object model, which is updated every frame. The model can have a form
of shape edge map (in which case tracking is performed by contour evolution) or
appearance density (in which case tracking is performed by shape matching).

Point tracking methods can be further categorized as deterministic or probabilis-


tic. In case of deterministic methods we define the function representing cost of
associating each object in frame t − 1 to a single object in frame t based on motion
constrains. By minimizing the cost function the correspondence between each object
position in the two analyzed frames is obtained. Once the cost function is defined,
finding its minimum can be performed simply by brute force search method or by
more sophisticated methods like the Hungarian algorithm [21]. The second category
of point tracking methods—the probabilistic (statistical) methods—introduces the
model of uncertainty of object position measurements due to noise of image sensors,
possibly object motion random perturbations, etc. These methods use a state space to
model object state properties like position, velocity and acceleration. The objective
of tracking is to estimate the current state of the object given all the measurements
up to that moment.
The Kernel tracking methods can be divided into template based and multi-view
based categories. Among template based methods, the common approach is to match
the object template to image content by brute force search method. For each can-
didate location of object template in the image the measure of similarity between
the template and the particular image region is computed. As a similarity measure
a cross correlation function is widely used. To reduce the computation time, object
location search space is often reduced based on the location of the object obtained
in the previous frame. In case of multi-view based methods, different views of the
object are learned off-line and are used for tracking to overcome the problem of
object appearance changing dramatically during the tracking.
The subcategories of silhouette tracking methods include contour evolution and
shape matching. The contour evolution methods iteratively evolve an object contour
in the previous frame to its new position and shape in the current frame. It is assumed
that the object regions in the two consecutive frames overlap to some extend to allow
11 Human Tracking in Non-cooperative Scenarios 305

proper contour evolution. There are two common approaches to contour evolution.
One approach defines a contour energy function and lets the contour freely, iteratively
evolve towards the shape for which the minimal value of the energy function is
obtained. Second approach introduces a shape space that constrains possible contour
shapes [2]. Contour evolution is available only within the imposed shape space. The
shape matching methods are similar to template based kernel tracking methods in
the sense that for each consecutive frame the model is matched against the image.
The difference is that shape matching methods after matching the template in the
current frame reinitialize the object model for searching the next frame.

11.3.2 Human Tracking

Human real time tracking has potential applications in many vision systems. A good
example of its application are visual surveillance systems, particularly the solutions
dedicated for metro public transportation systems which perform real time human
behavior recognition [4]. Another application of human tracking algorithms can be
found in biometric recognition systems. Human tracking technologies start to play an
important role in automotive industry as well, where pedestrian detection performed
on board in real time by pedestrian protection system (PPS) assists driver in making
decisions by signaling possibly dangerous driving situations. Some PPS systems not
only warn the driver, but they perform braking actions and/or deploy external airbags
which protect pedestrians if a collision is unavoidable [10].
To precisely define the problem, we formulate the following definition of human
tracking: given an image sequence continuously determine in real time if there are
any humans in the observed scene and return position and extend of each image
region where human is present for the consecutive frames. Due to many important
potential applications human tracking is an area of active research. It is also a very
challenging task due to the following factors:

• In-class variability: the appearance of persons exhibits very high variability as


they can change pose, wear different clothes, carry additional objects like bags
etc.
• Size: depending on the application the size of persons being seen by vision system
can vary significantly.
• Scene complexity: human tracking is often performed in unconstrained conditions
where scene content (the background and objects present in foreground) can be
very rich and complex.
• Occlusion: persons may be partially occluded by other objects. For example in
crowded environments some persons may partially cover others.
• Lighting conditions: illumination and weather conditions influence the appear-
ance of observed scene.
306 K. Grabowski and W. Sankowski

• Dynamic background: in case of systems where camera is moving—such as


PPS—the model of static background facilitating moving objects detection cannot
be built.

In this chapter we focus on human tracking algorithms dedicated for systems with
static scene (like biometric recognition systems) rather than algorithms dedicated for
moving vision systems.

11.3.2.1 Methods Overview

A straightforward and commonly used approach to human tracking is to use motion


blobs3 detected by comparing consecutive frames with a learned model of stationary
background [3, 7, 15, 27, 31, 34, 48]. In case of not crowded environments such
blobs properly indicate regions corresponding to each human in the scene and even
in case of multiple individuals being represented by a single blob, the separation of
each person can be achieved by relatively not complex processing. For example in
Refs. [31, 48] head candidates are detected by analyzing the motion blob boundaries,
as it is assumed that the overlapping region of each two persons is small enough for
both head contours to appear separately in the image.
Various approaches to build a stationary background model are used in practical
human tracking systems. In Ref. [31] the background appearance is estimated by
simple median-filtering the video image over time. Works [27, 48] involve statis-
tical background model where the color of each pixel in the image is modeled by
a single Gaussian distribution. In Ref. [49] this single Gaussian distribution model
is combined with a uniform distribution, which captures the outliers not modeled
by the Gaussian distribution to make the model more robust. An advanced statisti-
cal model of background is the adaptive background Gaussian mixture model [33]
(see Appendix 6.1), which is currently one of the most commonly used approach
[3, 4, 34].
Some approaches do not build a model of stationary background to find motion
blobs in video stream. In Ref. [23] temporal differencing technique is used and
motion blobs are found as follows:

• For each two consecutive frames the absolute difference image Δt is determined
according to (11.7):
Δt = |It − It−1 | (11.7)

where It and It−1 denote current and previous frame, respectively.


• Motion image Mt is obtained by thresholding the absolute difference image as
in (11.8): 
I (x, y) if Δt (x, y) > T
Mt (x, y) = t (11.8)
0 otherwise

3 Motion blob—region in the image corresponding to moving object.


11 Human Tracking in Non-cooperative Scenarios 307

where x, y are pixel coordinates and T denotes threshold value.


• Detected regions in the motion image are clustered into motion blobs using a
connected component criterion.

Once the motion blobs are detected, the remaining problem is how to further
analyze them. In Ref. [23], where a system for tracking objects in video stream is
described, motion blobs are classified as either a human, a vehicle or other object.
The classification is based on two simple geometric properties of the blob—its dis-
persedness and area—and is performed across multiple frames to increase reliability.
Dispersedness represents the complexity of blob’s shape and is defined as in (11.9).
Once the motion blob is classified as either a human or a vehicle, it is used as a tem-
plate for the tracking process. The tracking process matches the template to current
image using the cross correlation function as a similarity measure. To increase the
likelihood that the object will be correctly and robustly tracked the cross correlation
is computed only for a subset of possible template locations, which is determined by
the result of motion blobs detection. As the templates of tracked objects are updated
every frame, the method [23] can be classified to the “shape matching” category of
object tracking methods (see Fig. 11.11).

perimeter 2
dispersedness = (11.9)
area
Original approach to human detection was proposed in Ref. [39]. It simultaneously
uses patterns of motion and appearance and is another example of approach where
the background model is not estimated. In this work the object detection framework
originally designed for face detection, which is based on the AdaBoost learning algo-
rithm and is explained in Sect. 11.3.3.2, was adopted to human tracking. Apart from
analysis of human silhouette appearance performed independently in each frame,
which is analogous to the case of face detection, the motion filters were introduced
as additional source of features for the classification algorithm. The motion filters
compare two consecutive frames. They operate on the five differential images listed
in (11.10), where It−1 denotes previous frame, It depicts current frame and arrows
denote translation operations (e.g. It ↑ denotes current frame after shifting it up by
1 pix).
Δ = abs (It−1 − It )
U = abs (It−1 − It ↑)
L = abs (It−1 − It ←) (11.10)
R = abs (It−1 − It →)
D = abs (It−1 − It ↓)

This approach requires the training process where the pair of image patterns from
two consecutive frames comprises a single training sample. For positive training
samples the two image patterns contain a human (see Fig. 11.12), while for negative
samples a human is not present in the two image patterns. In consequence, the
whole approach essentially detects humans based on silhouette “static” appearance
308 K. Grabowski and W. Sankowski

Fig. 11.12 A subset of positive training samples used in Ref. [39]. Each pair of image patterns
comprises a single training sample. Image taken from Ref. [39]

in a single frame and “short term patterns of motion” derived solely based on two
consecutive frames. Although this proves to be a very time efficient approach, it lacks
long-term motion analysis that could significantly reduce the false positives rate—it
does not perform correlation of detection results across multiple frames. However,
according to authors, this solution shoud be seen as complementary to other long-
term analysis approaches and should not be treated as a stand-alone human tracking
system. As the training process is performed off-line and various views of human
silhouette are learned, this approach can be classified to the “multi-view based”
category of object tracking methods (see Fig. 11.11).
11 Human Tracking in Non-cooperative Scenarios 309

Due to advances in face detection algorithms and the availability of highly accurate
and fast face detectors (see Sect. 11.3.3) human tracking can be steered by the face
detection algorithm [9, 18, 38, 42]. One limitation of this approach is that it can
be used in systems where humans are imaged with high resolution sufficient for
representing facial features. Apart from face appearance, skin color is a useful feature
indicating human position [32, 36]. However its applicability is usually limited to
indoor scenes with controlled illumination. Another important limitation of both face
appearance and skin color features is requirement of persons to be directed towards
the camera.
One of the recent directions of research on human tracking is the development
of algorithms dedicated for analysis of crowded scenes, where some individuals
can partially occlude others. Sample works in this domain are Refs. [41, 49]. Both
approaches introduce a 3D human shape model. Without introducing even the sim-
plest human shape model, tracking methods are limited to blob tracking. The main
advantage of model-based tracking is that it can solve the blob merge and split prob-
lems by enforcing global constrains [49].
A block diagram of the approach presented in Ref. [49] is depicted in Fig. 11.13.
The method builds a background model (based on a single Gaussian distribution
combined with a uniform distribution) used for motion blobs detection. Using the
camera model and assuming that objects move on a known ground plane, multiple
3D human hypotheses are projected onto the image plane and matched with the
motion blobs. In each frame motion blobs are segmented into multiple humans, and
each human is associated with the existing trajectory. As the hypothesis are in 3D,
modeling possible occlusions is naturally incorporated into the method. Updated
trajectories are used to propose human hypotheses in the next frame. In this way the
segmentation and tracking modules are integrated in a unified framework and interact
with each other along time. Internally, to solve the problem of segmentation and
tracking, the framework adapts a data-driven Markov chain Monte Carlo (MCMC)
method to choose which hypothesis matches best current observations. A sample
result of the method is depicted in Fig. 11.14.
Human body and motion are highly complex, but for the purpose of tracking
humans a simplified low-dimensional model of human body is sufficient, as was
proven in Ref. [49]. This work introduces a 3D human shape model composed of
three ellipsoids with fixed spatial relationship, which is presented in Fig. 11.15. One
ellipsoid corresponds to the head, one to the torso and arms, and one to both legs. The
ellipsoids were chosen as they fit human body parts well and have the property that
their projection is an ellipse. The model is controlled by the following parameters:

• size: it is the 3D height of the model. The parameter controls the scaling of the
model in all three dimensions.
• thickness: it allows additional scaling the model in the two horizontal dimensions.
Changing this parameter does not influence the height of the model.
• position: it is human’s position (x, y) on a ground plane projected to image coor-
dinate system.
310 K. Grabowski and W. Sankowski

Fig. 11.13 Block diagram of the approach presented in Ref. [49]. The segmentation and tracking
modules interact with each other along time

Fig. 11.14 Sample result of tracking humans using the method presented in Ref. [49]: a input
frame showing a crowded scene, b detected motion blobs, c result of segmentation and tracking.
Note correct separation of large motion blobs representing multiple humans. Images taken from
Ref. [49]
11 Human Tracking in Non-cooperative Scenarios 311

Fig. 11.15 3D human model


introduced in Ref. [49]. The
model is composed of three
ellipsoids with fixed spatial
relationship

• orientation: it is the 3D orientation of the model defined by rotation angle. Rotation


angle equal to 0◦ corresponds to human facing the camera.
• inclination: it allows slight inclinations of the 3D model from perfect upright
position. In Ref. [49] a simplified approach is used where inclination is modeled by
a single parameter to capture inclination in 2D instead of two parameters capturing
inclination in 3D.

11.3.2.2 Error Rates and Speed of Human Tracking Algorithms

The performance of human tracking methods heavily depends on the applied algo-
rithm and the scene content observed in particular application. In this section the
results of approach described in Ref. [49] are presented. This method is an advanced,
3D human shape model based approach that can indicate capabilities of modern
human tracking algorithms. This method was evaluated on publicly available test
sets prepared in the frame of the Context Aware Vision using Image-based Active
Recognition (CAVIAR) project.4 The CAVIAR project provides test data sets for var-
ious scenarios. In Ref. [49] all 26 video sequences for the “shopping center corridor”
test case were taken for evaluation. These sequences together form a test dataset
containing 36,292 frames of resolution 384 × 288 pix acquired at 25 fps frame rate,
where 277 human trajectories are recorded. This dataset contains intensive interob-
ject occlusions. Alltogether there are 96 occlusion events in the dataset, out of which
68 are heavy occlusions (over 50 % of the object is occluded) and 19 are almost full
occlusions (over 90 % of the object is occluded). Additionally interactions between

4 http://homepages.inf.ed.ac.uk/rbf/CAVIAR/
312 K. Grabowski and W. Sankowski

Fig. 11.16 Illustration of performance evaluation criteria

Table 11.4 Results of performance evaluation of the method presented in Ref. [49]
Criterion
Mostly tracked Mostly lost Trajectory fragment False alarm ID switch
Number 141 12 89 27 22
Percentage (%) 62.1 5.3 – – –

humans like talking or handshaking are present in the dataset which make the human
tracking task more difficult.
To evaluate the performance of the method the following five criteria were intro-
duced in Ref. [49]:
• Mostly tracked trajectories
• Mostly lost trajectories
• Trajectory fragment
• False alarm (a result trajectory not corresponding to any human)
• ID switch (exchange of identity between a pair of result trajectories)
These criteria are illustrated in Fig. 11.16, while results of performance evaluation
are presented in Table 11.4. Sample frames with marked tracking results are depicted
in Fig. 11.17. Measured speed of processing the images was 2 fps on a PC platform
(Intel® Pentium® 4 processor, 2.8 GHz). The results clearly indicate how challenging
the human tracking task is. Although an impressive number of humans was properly
tracked, errors in this difficult test dataset could not be avoided though the modern
and advanced human tracking method was used.

11.3.3 Face Detection

Face detection and tracking algorithms found applications in many various domains
such as biometric persons recognition, human computer interaction (HCI), digital
cameras industry (where built-in face detectors support auto-focus and auto-exposure
11 Human Tracking in Non-cooperative Scenarios 313

Fig. 11.17 Sample results of method presented in [49] for frames from the CAVIAR test dataset
(sequence “TwoEnterShop2cor”): a full occlusion resulting in fragmented trajectory and ID switch,
b no detection of the human who is wearing clothes with color similar to background. Frames with
marked results taken from Ref. [49]

algorithms) and digital photos management software (which often provides face
detection to facilitate organizing people’s photo collections). Although advanced
solutions for face detection and tracking problem were developed and published,
the problem remains an active research area due to enormous variations of observed
scenes in vision systems detecting and tracking faces.
Before going into methods overview, we give precise definitions of the discussed
problems:
• Face localization: given an arbitrary image and assuming it contains single face
return position and extent of image region where this face is present.
• Face detection: given an arbitrary image, determine whether or not there are
present any faces in the image. If present, return position and extent of each image
region where face is present.
• Face tracking: given an image sequence continuously determine in real time if
there are any faces in the observed scene and return position and extent of each
image region where face is present for the consecutive frames.

According to given definitions, face localization is a significantly simplified face


detection problem, as additional knowledge is known a priori (image contains single
face). In this chapter we focus on face detection rather than face localization or face
tracking. A sample result of face detection is depicted in Fig. 11.18.
314 K. Grabowski and W. Sankowski

Fig. 11.18 Sample result of face detection

11.3.3.1 Methods Overview

Due to possibly enormous variability of scene content in various applications face


detection is undoubtedly a challenging task. Reliable face detection algorithms must
deal with the following factors [43]:

• Pose: the observed person may be directed towards the camera, but it can be rotated
against the camera axis as well which causes features of face to become partially
occluded. In extreme cases only profiles of faces can be captured. Moreover, the
position of head may not be vertical, which further makes the face detection a
more difficult task.
• Presence or absence of structural components: additional facial features such
as beards, mustaches, glasses may be present in the image.
• Facial expression: facial expressions of the same person significantly change the
face appearance and introduce great in-class variability.
• Occlusion: faces may be partially occluded by other objects. For example in
crowded environments some persons may partially cover others.
• Lighting conditions: lighting conditions (spectrum, intensity, spot/distributed
source) influence the appearance of a face.

A comprehensive survey of face detection algorithms may be found in Refs.


[43, 46]. Various methods can be classified into the following four major categories
[43]:
11 Human Tracking in Non-cooperative Scenarios 315

• Knowledge-based methods: these methods are based on rules that represent


human knowledge of what constitutes a typical face. Usually, the rules capture
the relationships between facial features. These methods are dedicated mainly for
face localization.
• Feature invariant methods: these methods are based on finding such facial fea-
tures that exist even for different subject poses, facial expressions and lighting
conditions. These methods are dedicated mainly for face localization.
• Template matching methods: these methods store several facial templates which
describe face as a whole or its features separately. They are based on computation
of correlations between an input image and stored templates. These methods are
dedicated for both face localization and detection.
• Appearance-based methods: these methods use models of face as in case of
template matching methods. But in contrast to the previous category, the models
are learned from training set of images representative for particular application.
The general rule is to collect a large training set of face and non-face images and
apply certain machine learning algorithm to build a face model for classification.
These methods are dedicated mainly for face detection.

According to current state of the art, modern face detectors are mostly appearance-
based methods due to their high performance when compared with other methods
[46]. Although many various and effective approaches to face detection were devel-
oped up to date, undoubtedly the most significant milestone in face detection progress
was the publication of the method proposed by Viola and Jones [40]. This approach,
discussed in detail in Sect. 11.3.3.2, became de-facto the standard of face detection
in practical applications [46]. In particular it was successfully applied in a number
of biometric recognition systems such as unconstrained conditions iris recognition
system described in [38], non-intrusive iris image capturing system published in
Ref. [45], unconstrained conditions periocular recognition system presented in
Ref. [18] or long-range multi-biometric system described in Ref. [1]. After some
modifications, it was also adopted in iris recognition system at-a-distance presented
in Ref. [6]. Due to high accuracy and speed of the Viola–Jones method, it became
a basis for further development and appearance-based methods dominated recent
advances in face detection.

11.3.3.2 Viola–Jones Method

The face detection method proposed by Viola and Jones consists of three main
contributions [40]:
• integral image: it allows very fast “Haar-like” features evaluation.
• classifier build using the AdaBoost learning algorithm: it selects a small number
of critical visual features from a very large set of potential features.
• cascade of classifiers: it radically reduces computation time while improving
detection accuracy.
316 K. Grabowski and W. Sankowski

Fig. 11.19 “Haar-like” features shown relative to the detection window (light gray). Sums of
pixel values in white rectangles are subtracted from sums of pixel values in dark gray rectangles:
a, b two-rectangle feature, c three-rectangle feature, d four-rectangle feature

Fig. 11.20 Integral image


value at each (x, y) location is a
sum of pixel values in original
image in the rectangular
region marked with gray

The method searches the image for faces by extracting the “Haar-like” features
inside the detection window, which is placed at consecutive locations inside the
image. Three kinds of the “Haar-like” features are used (see Fig. 11.19):
• two-rectangle feature: values of pixels are summed in two rectangular regions. The
difference between the sums is a feature value. The two regions are horizontally
or vertically adjacent and have the same size and shape.
• three-rectangle feature: values of pixels are summed in three horizontally adjacent
rectangular regions. A feature value is computed by adding sums from the two
outside regions and subtracting sum from the center region.
• four-rectangle feature: values of pixels are summed in four adjacent rectangular
regions. A feature value is computed by adding sums from the two diagonal regions
and subtracting sums from the remaining two diagonal regions.
The “Haar-like” features can be computed very efficiently using an intermediate
image representation called an integral image. At each (x, y) location it contains
the sum of pixel values within the rectangle starting at point (0, 0) and ending at
location (x, y), as depicted in Fig. 11.20. Once the integral image is obtained, any
sum within a rectangular region can be computed in four array references, as depicted
in Fig. 11.21.
Computing the “Haar-like” features at each single position of the detection win-
dow produces a large feature vector. The detection window must be placed at every
possible location to scan the entire image. Moreover the input image is scanned at 12
different scales—at each scale the detection window is 1.25 larger than at the previ-
11 Human Tracking in Non-cooperative Scenarios 317

Fig. 11.21 Integral image usage. Sum of pixel values within the rectangle D marked with gray can
be computed from the integral image as 4 + 1 − 2 − 3, since 1 = A, 2 = A + B, 3 = A + C and
4=A+B+C+D

Fig. 11.22 Cascade of classifiers. As soon as image region is classified as not containing the face,
further processing is ceased

ous scale. All of these factors require an efficient approach to select a small subset of
critical features and to classify them. The Viola–Jones method introduces a simple
and efficient classifier built using the AdaBoost learning algorithm. In this classifier
many “weak” classifiers are combined to create one “strong” classifier. Each “weak”
classifier depends on a single feature only. It is not expected to produce reliable
answer—the “weak” classifier is only expected to classify correctly the data better
than a random guess. The “strong” classifier is obtained in the AdaBoost learning
process by combining multiple “weak” classifiers.
All “weak” classifiers are grouped into a cascade of classifiers presented in
Fig. 11.22. Each chain of the cascade contains a small number of “weak” classi-
fiers and classifies analysed image region inside the detection window as either the
face or not the face. In case of a positive answer, the analysis is continued by the next
chain. As soon as the region is classified as not containing the face, further processing
is ceased. This allows a very fast rejection of non face regions by computing only a
small subset of key features.
318 K. Grabowski and W. Sankowski

Fig. 11.23 DET curve of the Viola–Jones face detection algorithm implemented in OpenCV
software package. Results taken from Ref. [5]

11.3.3.3 Error Rates and Speed of Face Detection Algorithms

Independent tests of modern face detection algorithms confirm that these algorithms
achieve high accuracy and speed. Results of a sample comprehensive test of face
detection algorithms may be found in Ref. [5]. In this test seven software packages
for face detection were evaluated and compared. Five of them are publicly available
and two of them are commercial products (without description of the general rule
applied in the algorithm). To achieve large test set 9 face databases where combined,
which resulted in a large dataset containing 59,888 images: 11,677 image with faces
and 48,211 images without faces. To evaluate face detection accuracy the detection
error tradeoff (DET) curve was plotted for each algorithm. The DET curve denotes
dependence between the following parameters:
• False Rejection Rate (FRR): it indicates the probability of misclassification of the
images containing a face.
• False Acceptance Rate (FAR): it indicates the probability of misclassification of
the images not containing a face.
Although the exact error rates strongly depend on the particular application, the
accuracy reported in Ref. [5] due to large dataset used can be seen as a good indicator
of capabilities of modern face detection algorithms. The best accuracy among the
5 publicly available software packages was obtained for the package implementing
the Viola-Jones face detection algorithm. The DET curve for this software package
is presented in Fig. 11.23. At FAR equal to 1 %, FRR is as low as 8 %. Undoubtedly
such accuracy can be seen as a very impressive result. On the other hand the obtained
error rates clearly indicate that there is large room for further improvement to reduce
the number of misclassified images.
While achieving high accuracy, face detection can be performed in the real time.
In their original work [40] Viola and Jones report that their face detection algorithm
processes 15 fps, each frame having a resolution of 384 × 288 pix (measured on PC
with 700 MHz Intel Pentium III processor).
11 Human Tracking in Non-cooperative Scenarios 319

11.4 Processing Platforms

11.4.1 Introduction

As it was previously described, a vast majority of processing presented in the


Sect. 11.3 is related to sophisticated algorithms for image conditioning and
preprocessing—segmentation and quality assessment. For subject tracking at a dis-
tance and on the move, sometimes only one frame in the video stream may contain
biometric sample of sufficient quality for a successful authentication. Thus, in this
kind of demanding video applications, tasks have to be performed in real time, which
means that, e.g., each frame of the input video stream has be processed before the
next frame becomes available. For the typical video stream the speed equals to 30 fps,
hence the time available for analysis is 33 ms. Therefore, special dedicated architec-
tures are used for image preprocessing and quality assessment. Processing robustness
and predictable analysis time in such dedicated architectures may be of particular
importance.
Processing tasks can be divided into two groups: non-contextual and contextual
filtering. To the first type belong: image conditioning using levels and transition
curves (e.g., for contrast improvement), thresholding, as well as texture analysis like
wavelet filters, e.g., for segmentation based on features extraction.
For the purpose of non-contextual filtering current FPGA-based systems are
able to achieve data latency in the range from 10 to 100 ns. Thus, they provide
enough processing power for a typical filtering for image conditioning and pre-
processing. In turn, background estimation and image segmentation is much more
complex because it requires contextual analysis in all cases, which involves solving
multi-dimensional equations. Therefore, generic processors, digital signal proces-
sors (DSP) or, most recently, graphical processing units (GPU), are essential for
performing such calculations.
In the next part of the chapter we discuss the variety of ways in which data
processing can be realized in a biometric system for less constrained environments
using existing solutions as a reference.

11.4.2 Generic Approach

Several approaches for realizing processing can be used in this approach. The first
and simplest one is to use a typical PC computer with additional add-on cards such as
frame-grabber, illumination control etc. as a support for basic operations on images—
see Fig. 11.24. The key processing is moved to a typical PC platform. The software
containing authentication algorithms is developed like typical software for a given
operating system, e.g. Linux, Windows, etc.
In the Fig. 11.24 a sample architecture for advanced image processing is presented.
The fast NFOV camera is connected through the framegrabber to the PC. The WFOV
320 K. Grabowski and W. Sankowski

Fig. 11.24 Sample block diagram of PC based video processing system for the purposes of subject
tracking

cameras are connected through a simple interface (e.g., USB, IEEE1394, Ethernet)
with the processing station because the requirements for processing are not very
demanding (the resolution does not need to be high). The framegrabber controls
the trigger signal for the purposes of flashing synchronization capability. As it was
previously shown, flashing may be particularly important for image acquisition from
moving subject. The control over the pan-tilt-zoom functionality may be provided
using typical I/O PC standards or embedded with the framegrabber.
This approach does not require low-level programming effort such as camera
driver development, because the drivers for USB/HDMI/IEE1394 cameras or frame-
grabber card are usually available for most operating systems. Therefore, the designer
can focus just on increasing the performance of the algorithm that he develops.
Another advantage is that the developed code can be used on other platforms that
use the same operating system or, with moderate effort, moved to other platforms.
Since such an implementation of authentication algorithms is certainly the simplest
one, the development of algorithms on PC workstations is a common approach in
research and as well as in commercial installations [24].
Matey’s et al. Iris-On-The-Move solution presented in Ref. [24] proposes a mixed
approach: in addition to the standard PC, FPGA resources are embedded in the
11 Human Tracking in Non-cooperative Scenarios 321

Fig. 11.25 Dong’s iris recognition system. Picture taken from Ref. [6]

frame-grabber and are responsible for accelerating the match filter and subsequent
thresholding. The CPU executes simple operations on binary images, which reduces
processing requirements. Moreover, an additional custom microcontroller-based cir-
cuit provides synchronization of illumination and image capture process. However,
the majority of processing tasks, e.g. capturing the images and running the demon-
stration application, is still performed by the PC.
The Wheeler et al. [42] presented a prototype system implemented on a single
general-purpose PC with a 3.20 GHz single-core CPU. Their program consisted in
three independent and asynchronous threads: one thread responsible for continuously
detecting faces and determining the 3D face location, second for pointing the pan-tilt
head towards the face and setting the focal distance and the third for controlling iris
recognition. The third thread was also responsible for continuously reading frames
from the iris camera, detecting pupils, cropping eye regions, and performing iris
segmentation and recognition. However, it should be emphasized that not all iris
camera frames were processed; the maximum frame rate achieved by their prototype
was about 5 fps for NFOV and 15 fps for WFOV.
Typical PC computers are also used in Dong et al. system [6]. A PC is responsi-
ble for both system control and image processing. It controls an image acquisition
card which acquires images from the iris camera and a motion control card with
an embedded circuit which provides the control for infrared illumination and the
pan-tilt-zoom (Fig. 11.25).
322 K. Grabowski and W. Sankowski

Fig. 11.26 An experiment about implementation of iris-based authentication on different archi-


tectures. Three PC-based platforms were chosen for the comparison: PC1—based on Intel Core2
6600 2.4 GHz, PC2—based on Intel Pentium4 3.4 GHz, and PC3—mobile architecture based on
Intel Core2 Duo T7100 1.79 GHz. BioCu is the dedicated platform using FPGA and DSP devices
for signal processing described in Ref. [12]

Note that the presented architectures: optical system as well as processing plat-
forms were built for research. They cannot be easily transferred to an embedded
system, e.g., for the implementation on a mass scale.

11.4.3 Dedicated Architectures

We can distinguish two areas that computers are proficient in: data manipulation—
such as database management and word processing, and mathematical calculation—
used for data processing. Because of technical tradeoffs in digital system design
such as CPU, it is not profitable to make processors optimized for both. The reason
is memory organization and access, instruction length, interrupt handling, microar-
chitecture of CPU that is optimized for MAC operations (Multiply and Accumulate),
etc.
Nowadays, a typical PC platform, despite being very fast and efficient due to multi-
core architecture, is not optimized for signal processing. This can be clearly observed
in the Fig. 11.26, where the results of different extensive processing implementation
approaches are presented. The experiment was related to the implementation of the
same iris recognition algorithm stages on two architectures: a generic one and a DSP.
The details about the dedicated architecture and algorithm implementation strategy
was described in Ref. [12].
As far as PC-based implementations are concerned, the iris segmentation appears
to be the most time-consuming stage, which is caused by a large amount of typical
11 Human Tracking in Non-cooperative Scenarios 323

Fig. 11.27 Block diagram of sample dedicated video processing architecture for the purpose of
subject tracking. Sample implementation of this kind of processing system is presented in Ref. [12]

signal processing for which standard architectures are not optimized. Therefore, such
processing can be performed much more efficiently in DSP-like architectures. On the
other hand, in the case of DSP, the most time-consuming step in the authentication
process is the pattern search within the database. In this case, the time needed to
search the repository increases linearly with the complexity of the problem (number
of records in the database).
As it can be seen, there is no perfect architecture to solve the variety of prob-
lems which appear in the field of subject tracking for biometric purposes, but the
presented results give an idea which tools allow efficiently solving some of them.
Nevertheless, as we will show, improving processing robustness for this application
may be obtained using dedicated platforms for signal processing.
One of possible techniques is to use mixed architectures that utilize advanced
modern ICs such as FPGA’s (Field Programmable Gate Arrays) and DSP (Digital
Signal Processors) in one system. The reconfigurable device, which nowadays is
usually a reconfigurable SoC system with CPU cores, can be used for integrating such
modules as the framegrabber, image conditioning and preprocessing unit, control unit
and data flow manager. While the FPGA serves as concentrator of video data, the
multicore DSP can be responsible for the majority of processing tasks (Fig. 11.27).
324 K. Grabowski and W. Sankowski

Fig. 11.28 Block diagram of sample system-on-chip architecture DM3730 from Texas Instruments,
that has enough video stream support

This hardware-software co-design techniques were proven to be very efficient in


high-throughput data processing applications. However, this kind of implementation
requires lot of development work and is relatively expensive. However, its advantage
is that it can be integrated into an embedded system that can be used on a larger scale.
The tradeoff between generic PC systems and expensive dedicated architectures
may be found in System-On-Chip devices that integrate in one chip most resources
needed for data management and processing. Such devices are equipped with a
generic processor like ARM and with a DSP processor which serves as a cooprocess-
ing unit and can be extensively used during mathematical calculations. These so
called ‘media processors’ have special hardware acceleration units necessary to cover
the video acquisition and presentation paths. Because of significant number of hard-
ware resources, these devices usually work under the control of an operating system
(currently the most common solutions are based on Linux). Since the OS controls
user applications, this approach does not appear as efficient as “pure” dedicated
architecture which was previously described. However, there are some techniques
to improve performance and robustness of those systems using real-time operating
11 Human Tracking in Non-cooperative Scenarios 325

systems: RTOS, SYSBIOS, RT-Linux, etc. In the Fig. 11.28 a sample SoC architec-
ture is presented.
Currently these architectures are still not efficient enough to process a large amount
of data coming from several vision systems, but their processing capabilities are
constantly growing. Thus, currently they serve as a platforms for implementation
of simple biometric applications. There are known implementations of smartphone-
based biometric system based on iris. They require additional camera connected to a
smartphone as the platform has enough processing power for image processing but
insufficient optics to acquire good quality data.

11.5 Conclusion

A fundamental problem in non-cooperative biometric authentication system is how


to perform the data acquisition process to acquire biometric samples of good quality.
This fundamental problem can be decomposed into three following issues:

• Vision system: design of vision system with tracking capability (selection of:
lenses, camera, mechanism for controlling the camera position).
• Software: development of image analysis algorithms supporting the data acquisi-
tion process.
• Hardware: design of processing platform/platforms architecture capable to run
image processing algorithms required to be executed in real time.

All the above issues are closely related to each other, as depicted in Fig. 11.1 at
the beginning of the chapter. The vision system components selection was discussed
in Sect. 11.2. The vision system design is the basic and the most crucial part of
biometric system for a non-constrained environments. The acquisition system must
deal with a non-cooperating subject that is often positioned off-angle and is usually
on the move. Thus, in this chapter special attention was paid to tracking capabilities
of the acquisition hardware and to image segmentation techniques afterwards. It
was shown that in some scenarios, off-the-shelf PTZ cameras can be an excellent
solution for those kind of biometric systems, even for an extremely demanding traits,
e.g., iris.
Some issues related to image capture itself: illumination (artificial or ambient),
reflections and occlusions especially problematic for a long-range and outdoor instal-
lations were also under special consideration in this chapter. For most of the men-
tioned issues the system designer does not have significant or any influence onto.
For this reason a lot of research effort paid for a long-range and outdoor biometric
systems can be expected in the nearest future, because there is no system of this kind
with proven quality published in the literature up to date.
The software issues were discussed in Sect. 11.3. It focuses on presenting the
state of the art in human tracking algorithms. Depending on the application the
complexity and effectiveness of the existing methods vary significantly. A widely
used approach to human tracking, which proved sufficient in many applications, is to
326 K. Grabowski and W. Sankowski

detect humans in three steps: build a model of stationary background, detect motion
blobs, classify each motion blob as either human or other object. Another method is
to steer the human tracking process by the face detection algorithm due to availability
of highly accurate face detectors capable to run in real time. Such approach requires
imaging humans with high resolution sufficient for representing facial features. Its
applicability is usually limited to indoor scenes with controlled illumination where
persons are directed towards the camera. The most commonly used face detector for
controlling the human tracking process is the Viola–Jones algorithm, which became
de-facto the standard of face detection in practical applications. A separate group
of human tracking methods constitute modern algorithms dedicated for analysis of
crowded scenes where some individuals can partially occlude others. These highly
complex and advanced solutions often introduce a 3D human shape model, whose
advantage is that it can solve the blob merge and split problems by enforcing global
constrains. Although such advanced algorithms exist, the error rates measured on
difficult test datasets are significant. This illustrates how challenging the task to
track humans in unconstrained environments is and why human tracking remains an
open and actively developed research area.
The hardware issues were discussed in Sect. 11.4. The keynote of this section
was to give an overview of implementation techniques for human tracking capable
systems in two aspects: implementation of image processing and implementation of
control loop for Pan-Tilt-Zoom functionality. It was pointed out that in some sys-
tems real-time data processing may be necessary in order to merge information from
cooperating vision systems, i.e., Wide Field of View and Narrow Field of View. In
the presented installations computer and custom architectures were successfully used
for the purposes of biometric systems, however none of them is perfect in the whole
range of processing algorithms for biometric authentication. As a consequence in
the nearest future we can expect biometric solutions realized in System-On-Chip
architectures, equipped with several processing cores efficient in different areas of
biometric processing: filtering, feature extraction, template matching and database
scanning. The above will be certainly true for those biometric systems that: are deal-
ing with extremely noisy data (i.e., for unconstrained scenarios), are designed for a
large scale databases (high and efficiently distributed processing power requirement)
and must be highly reliable (contribution of processing dedicated to image segmen-
tation and feature extraction is much more significant). Thus, in order to achieve
rewarding recognition quality of a non-cooperative biometric system, each of the
three aspects covered by this chapter must be further explored.

11.6 Appendix

11.6.1 Background Estimation for Object Tracking


One of common approaches to build a model of stationary background in real-time
video processing applications is the adaptive Gaussian mixture model method [33].
11 Human Tracking in Non-cooperative Scenarios 327

This probabilistic technique allows estimating background which correctly adapts to


such factors like changes in lighting conditions, repetitive motions of scene elements,
objects moving slowly, objects appearing and disappearing in the scene and long-
term scene changes. In this approach, values of a particular pixel are modeled as a
mixture of Gaussians. The persistence and the variance of each of the Gaussians of
the mixture determine which Gaussians may correspond to background colors. Pixel
values not fitting the background distributions are considered foreground until there
is a Gaussian that includes them with sufficient, consistent evidence supporting it.
The need for using a mixture of Gaussians in a background model arises from real
world video applications. If each pixel corresponds to a single surface under constant
lighting, a single and constant Gaussian distribution is sufficient to model pixel value
influenced by random noise. In case of a single surface under lighting changing over
time, a single but adaptive Gaussian distribution is sufficient. But in practice more
advanced modeling is required especially in outdoor scenes were swaying trees or
waving water surface may be present.
Let X1 , X2 , . . . , Xt denote time series at time t of values of pixel at coordinates
(x, y), as in (11.11):

{X1 , X2 , . . . , Xt } = {I (x, y, j) : 1  j  t} (11.11)

where I is the image from the time sequence. In case of grayscale images Xt is
a scalar representing pixel’s intensity, in case of color images Xt is a vector rep-
resenting pixel’s color (for example Red, Gren and Blue components if the RGB
color space is used). The recent history of each pixel is modeled by a mixture of
K Gaussian distributions. The probability of observing the current pixel value Xt is
given by (11.12):
K
 
P (Xt ) = wi,t · η Xt , μi,t , Σi,t (11.12)
i=1

where K is the number of Gaussian distributions in the mixture, wi,t is the weight
of the i-th Gaussian distribution at time t computed as a portion of the data which is
accounted for by this distribution, μi,t and Σi,t are the mean value and the covariance
matrix of the i-th Gaussian distribution at time t respectively and η is a Gaussian
probability density function given by (11.13).

1 1 T Σ −1 (X−μ)
η (X, μ, Σ) = n 1
e− 2 (X−μ) (11.13)
(2π ) |Σ|
2 2

It was shown in Refs. [33, 34] that using from three to five Gaussian distributions
in the mixture is sufficient. In case of the RGB color space—for computational
reasons—a simplified form of covariance matrix given in (11.14) is used. It assumes
that the red, green and blue pixel color components are independent and have the
same variances.
Σk,t = σk2 I (11.14)
328 K. Grabowski and W. Sankowski

Every new pixel value Xt is checked with the consecutive K Gaussian distributions
until a matching distribution is found. A match is defined as a pixel value within 2.5
standard deviation of a distribution. If no matching distribution is found, the least
probable distribution is removed and a new distribution is introduced which has the
mean value equal to the current value, an initially high variance and low prior weight.
After each new observation Xt weight of each Gaussian distribution is updated as
in (11.15), where α is the learning rate. After the update, the weights are renormalized.

(1 − α)wk,t−1 + α if current observation matches the distribution
wk,t =
(1 − α)wk,t−1 otherwise
(11.15)
The mean and the variance of unmatched distributions remain the same. For the
matching distribution, these parameters are updated as in Eqs. (11.16), (11.17) to
incorporate the new observation, where ρ is another learning rate given by (11.18).5

μt = (1 − ρ) μt−1 + ρXt (11.16)

σt2 = (1 − ρ) σt−1
2
+ ρ(Xt − μt )T (Xt − μt ) (11.17)

ρ = αη (Xt , μk , σk ) (11.18)

The final task is to classify the new observation as coming either from the back-
ground or from the moving objects. This is achieved in the following way: first, the
Gaussian distributions are ordered by the value of w/σ , as we are interested in the
distributions which have the highest weight and the lowest variance. Then, the first B
distributions are chosen as the background model, where B is given in (11.19). In this
equation threshold T represents minimum portion of the observations that should be
accounted for by the background.

b
B = arg min wk > T (11.19)
b
k=1

References

1. Bashir F, Usher D, Casaverde P, Friedman M (2008) Video surveillance for biometrics: long-
range multi-biometric system. In: IEEE 5th international conference on advanced video and
signal based surveillance 2008 (AVSS ’08), pp 175–182. doi:10.1109/AVSS.2008.28
2. Blake A, Isard M (1998) Active contours: the application of techniques from graphics, vision,
control theory and statistics to visual tracking of shapes in motion, 1st edn. Springer-Verlag,
Secaucus

5 The equation is given for scalar X


t values. In high dimensional spaces with full covariance matrices,
it is sometimes advantageous to use a constant learning rate ρ to simplify computations and provide
faster model adaptation.
11 Human Tracking in Non-cooperative Scenarios 329

3. Bodor R, Jackson B, Papanikolopoulos N (2003) Vision-based human tracking and activity


recognition. In: Proceedings of the 11th Mediterranean conference on control and automation,
vol 1. Citeseer
4. Candamo J, Shreve M, Goldgof D, Sapper D, Kasturi R (2010) Understanding transit scenes:
a survey on human behavior-recognition algorithms. IEEE Trans Intell Transp Syst 11(1):
206–224. doi:10.1109/TITS.2009.2030963
5. Degtyarev N, Seredin O (2010) Comparative testing of face detection algorithms. In: Proceed-
ings of the 4th international conference on image and signal processing (ICISP’10). Springer-
Verlag, Berlin, pp 200–209
6. Dong W, Sun Z, Tan T (2009) A design of iris recognition system at a distance. In:
Chinese conference on pattern recognition 2009 (CCPR 2009), pp 1–5. doi:10.1109/CCPR.
2009.5344030
7. Elgammal A, Davis L (2001) Probabilistic framework for segmenting people under occlusion.
In: Proceedings of 8th IEEE international conference on computer vision 2001 (ICCV 2001),
vol 2, pp 145–152. doi:10.1109/ICCV.2001.937617
8. Forrester JV, Dick AD, McMenamin PG, Roberts F (2009) The eye, basic sciences in practice.
Clinical and experimental optometry 92(1):72–73
9. Gatica-Perez D, Odobez JM, Ba SO, Smith KC, Lathoud G (2004) Tracking people in meetings
with particles. Tech. Rep. Idiap-RR-71-2004, IDIAP, Martigny, Switzerland. In: Proceedings
of international workshop on image analysis for multimedia interactive services, invited paper,
Montreux, April 2005
10. Geronimo D, Lopez A, Sappa A, Graf T (2010) Survey of pedestrian detection for advanced
driver assistance systems. IEEE Trans Pattern Anal Mach Intell 32(7):1239–1258. doi:10.1109/
TPAMI.2009.122
11. Grabowski K (2007) Control and illumination system for eye image acquisition stand. In:
Proceedings of the 9th workshop for candidates for a doctor degree (OWD), pp 223–228
12. Grabowski K, Napieralski A (2011) Hardware architecture optimized for iris recognition. IEEE
Trans Circuits Syst Video Technol 21(9):1293–1303. doi:10.1109/TCSVT.2011.2147150
13. Guo G, Jones M, Beardlsey P (2005) A system for automatic iris capturing. Mitsubishi Electric
Research Laboratories (MERL)
14. Hanna KJ, Mandelbaum R, Mishra D, Paragano V, Wixson LE (1996) A system for non-
intrusive human iris acquisition and identification. In: Proceedings of IAPR workshop on
machine vision applications, pp 200–203
15. Haritaoglu I, Harwood D, Davis L (2000) W4: real-time surveillance of people and their
activities. IEEE Trans Pattern Anal Mach Intell 22(8):809–830. doi:10.1109/34.868683
16. Hartley R, Zisserman A (2003) Multiple view geometry in computer vision, 2nd edn. Cambridge
University Press, New York
17. International Commission on Illumination (2002) Photobiological safety of lamps and lamp
systems. CIE S 009/E.-2002
18. Juefei-Xu F, Savvides M (2012) Unconstrained periocular biometric acquisition and recognition
using cots ptz camera for uncooperative and non-cooperative subjects. In: IEEE workshop
on applications of computer vision (WACV) 2012, pp 201–208. doi:10.1109/WACV.2012.
6163051
19. Kollias N (1995) The spectroscopy of human melanin pigmentation. Valdenmar Publishing
Co., Overland Park
20. Kollias N, Baqer A (1987) Absorption mechanisms of human melanin in the visible, 400–720
nm. J Invest Dermatol 89(4):384–388
21. Kuhn HW (1955) The hungarian method for the assignment problem. Naval Res Logist Q
2(1–2):83–97. doi:10.1002/nav.3800020109
22. Li X, Chen G, Ji Q, Blasch E (2008) A non-cooperative long-range biometric system for
maritime surveillance. In: 19th international conference on pattern recognition 2008 (ICPR
2008), pp 1–4. doi:10.1109/ICPR.2008.4761887
23. Lipton A, Fujiyoshi H, Patil R (1998) Moving target classification and tracking from real-time
video. In: Proceedings of 4th IEEE workshop on applications of computer vision 1998 (WACV
’98), pp 8–14. doi:10.1109/ACV.1998.732851
330 K. Grabowski and W. Sankowski

24. Matey J, Naroditsky O, Hanna K, Kolczynski R, LoIacono D, Mangru S, Tinker M, Zappia T,


Zhao W (2006) Iris on the move: acquisition of images for iris recognition in less constrained
environments. Proc IEEE 94(11):1936–1947. doi:10.1109/JPROC.2006.884091
25. Narayanswamy R, Johnson G, Silveira PX, Wach H (2005) Extending the imaging volume for
biometric iris recognition. Appl Opt 44(5):701–712
26. Illuminating Engineering Society of North America (2005) Recommended practice for photobi-
ological safety for lamps and lamp systems—general requirements. ANSI/IESNA RP-27.1-05
27. Rosales R, Sclaroff S (1999) 3D trajectory recovery for tracking multiple objects and trajectory
guided recognition of actions. In: IEEE computer society conference on computer vision and
pattern recognition 1999, vol 2, pp 117–123. doi:10.1109/CVPR.1999.784618
28. Sankowski W, Grabowski K, Napieralska M, Zubert M, Napieralski A (2010) Reliable algo-
rithm for iris segmentation in eye image. Image Vis Comput 28(2):231–237. doi:10.1016/j.
imavis.2009.05.014
29. Sankowski W, Grabowski K, Pietek J, Napieralska M, Zubert M (2009) Optimization of
iris image segmentation algorithm for real time applications. In: Proceedings of the 16th
international conference—mixed design of integrated circuits and systems (MIXDES 2009),
pp 671–674
30. Cahn von Seelen U, Camus T, Venetianer P, Zhang G, Salganicoff M, Negin M (1999) Active
vision as an enabling technology for user-friendly iris identification. In: 2nd IEEE workshop
on automatic identification advanced technologies, pp 169–172
31. Siebel NT, Maybank SJ (2002) Fusion of multiple tracking algorithms for robust people track-
ing. In: Proceedings of the 7th European conference on computer vision—part IV (ECCV ’02).
Springer-Verlag, London, pp 373–387
32. Song X, Nevatia R (2004) Combined face–body tracking in indoor environment. In: Proceed-
ings of the 17th international conference on pattern recognition 2004 (ICPR 2004), vol 4,
pp 159–162. doi:10.1109/ICPR.2004.1333728
33. Stauffer C, Grimson W (1999) Adaptive background mixture models for real-time track-
ing. In: IEEE computer society conference on computer vision and pattern recognition 1999,
vol 2, pp 246–252. doi:10.1109/CVPR.1999.784637
34. Stauffer C, Grimson W (2000) Learning patterns of activity using real-time tracking. IEEE
Trans Pattern Anal Mach Intell 22(8):747–757. doi:10.1109/34.868677
35. Information Technology (2005) Iris image data, vol ISO/IEC 19794–6:2005
36. Vadakkepat P, Lim P, De Silva L, Jing L, Ling LL (2008) Multimodal approach to human-face
detection and tracking. IEEE Trans Ind Electron 55(3):1385–1393. doi:10.1109/TIE.2007.
903993
37. Venugopalan S, Prasad U, Harun K, Neblett K, Toomey D, Heyman J, Savvides M (2011) Long
range iris acquisition system for stationary and mobile subjects. In: 2011 international joint
conference on biometrics (IJCB), pp 1–8. doi:10.1109/IJCB.2011.6117484
38. Venugopalan S, Savvides M (2010) Unconstrained iris acquisition and recognition using
cots ptz camera. EURASIP J Adv Signal Process 2010(1):938737. doi:10.1155/2010/938737
http://asp.eurasipjournals.com/content/2010/1/938737
39. Viola P, Jones M, Snow D (2003) Detecting pedestrians using patterns of motion and appear-
ance. In: Proceedings of 9th IEEE international conference on computer vision 2003, vol 2,
pp 734–741. doi:10.1109/ICCV.2003.1238422
40. Viola P, Jones MJ (2004) Robust real-time face detection. Int J Comput Vis 57(2):137–154.
doi:10.1023/B:VISI.0000013087.49260.fb
41. Wang L, Yung N (2012) Three-dimensional model-based human detection in crowded scenes.
IEEE Trans Intell Transp Syst 13(2):691–703. doi:10.1109/TITS.2011.2179536
42. Wheeler F, Perera A, Abramovich G, Yu B, Tu P (2008) Stand-off iris recognition system.
In: 2nd IEEE international conference on biometrics: theory, applications and systems 2008
(BTAS 2008), pp 1–7. doi:10.1109/BTAS.2008.4699381
43. Yang MH, Kriegman D, Ahuja N (2002) Detecting faces in images: a survey. IEEE Trans
Pattern Anal Mach Intell 24(1):34–58. doi:10.1109/34.982883
11 Human Tracking in Non-cooperative Scenarios 331

44. Yilmaz A, Javed O, Shah M (2006) Object tracking: a survey. ACM Comput Surv 38(4):13+.
doi:10.1145/1177352.1177355
45. Yoon S, Jung HG, Suhr JK, Kim J (2007) Non-intrusive iris image capturing system using
light stripe projection and pan-tilt-zoom camera. In: IEEE conference on computer vision and
pattern recognition 2007 (CVPR’07), pp 1–7. doi:10.1109/CVPR.2007.383379
46. Zhang C, Zhang Z (2010) A survey of recent advances in face detection. In: Microsoft Research
Technical, Report MSR-TR-2010-66
47. Zhang Z (1999) Flexible camera calibration by viewing a plane from unknown orienta-
tions. In: Proceedings of the 7th IEEE international conference on computer vision 1999,
vol 1, pp 666–673. doi:10.1109/ICCV.1999.791289
48. Zhao T, Nevatia R (2004) Tracking multiple humans in complex situations. IEEE Trans Pattern
Anal Mach Intell 26(9):1208–1221. doi:10.1109/TPAMI.2004.73
49. Zhao T, Nevatia R, Wu B (2008) Segmentation and tracking of multiple humans in crowded
environments. IEEE Trans Pattern Anal Mach Intell 30(7):1198–1211. doi:10.1109/TPAMI.
2007.70770
50. Zivkovic Z (2004) Improved adaptive Gaussian mixture model for background subtraction. In:
Proceedings of the 17th international conference on pattern recognition 2004 (ICPR 2004),
vol 2, pp 28–31. doi:10.1109/ICPR.2004.1333992
51. Zivkovic Z, Van Der Heijden F (2006) Efficient adaptive density estimation per image pixel
for the task of background subtraction. Pattern Recogn Lett 27(7):773–780

You might also like