FaceRecognition PDF

Face Recognition: A Literature Survey
W. ZHAO
Sarnoff Corporation
R. CHELLAPPA
University of Maryland
P. J. PHILLIPS
National Institute of Standards and Technology
AND
A. ROSENFELD
University of Maryland
As one of the most successful applications of image analysis and understanding, face
recognition has recently received significant attention, especially during the past
several years. At least two reasons account for this trend: the first is the wide range of
commercial and law enforcement applications, and the second is the availability of
feasible technologies after 30 years of research. Even though current machine
recognition systems have reached a certain level of maturity, their success is limited by
the conditions imposed by many real applications. For example, recognition of face
images acquired in an outdoor environment with changes in illumination and/or pose
remains a largely unsolved problem. In other words, current systems are still far away
from the capability of the human perception system.
This paper provides an up-to-date critical survey of still- and video-based face
recognition research. There are two underlying motivations for us to write this survey
paper: the first is to provide an up-to-date review of the existing literature, and the
second is to offer some insights into the studies of machine recognition of faces. To
provide a comprehensive survey, we not only categorize existing recognition techniques
but also present detailed descriptions of representative methods within each category.
In addition, relevant topics such as psychophysical studies, system evaluation, and
issues of illumination and pose variation are covered.
Categories and Subject Descriptors: I.5.4 [Pattern Recognition]: Applications

General Terms: Algorithms
Additional Key Words and Phrases: Face recognition, person identification
An earlier version of this paper appeared as Face Recognition: A Literature Survey, Technical Report CAR-
TR-948, Center for Automation Research, University of Maryland, College Park, MD, 2000.
Authors addresses: W. Zhao, Vision Technologies Lab, Sarnoff Corporation, Princeton, NJ 08543-5300;
email: wzhao@sarnoff.com; R. Chellappa and A. Rosenfeld, Center for Automation Research, University of
Maryland, College Park, MD 20742-3275; email: {rama,ar}@cfar.umd.edu; P. J. Phillips, National Institute
of Standards and Technology, Gaithersburg, MD 20899; email: jonathon@nist.gov.
Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted with-
out fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright
notice, the title of the publication, and its date appear, and notice is given that copying is by permission of
ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specific
permission and/or a fee.
2003
c ACM 0360-0300/03/1200-0399 $5.00
ACM Computing Surveys, Vol. 35, No. 4, December 2003, pp. 399458.
400 Zhao et al.
1. INTRODUCTION example, fingerprint analysis and retinal

or iris scans, these methods rely on the
As one of the most successful applications cooperation of the participants, whereas
of image analysis and understanding, face a personal identification system based on
recognition has recently received signifi- analysis of frontal or profile images of the
cant attention, especially during the past face is often effective without the partici-
few years. This is evidenced by the emer- pants cooperation or knowledge. Some of
gence of face recognition conferences such the advantages/disadvantages of different
as the International Conference on Audio- biometrics are described in Phillips et al.
and Video-Based Authentication (AVBPA) [1998]. Table I lists some of the applica-
since 1997 and the International Con- tions of face recognition.
ference on Automatic Face and Gesture Commercial and law enforcement ap-
Recognition (AFGR) since 1995, system- plications of FRT range from static,
atic empirical evaluations of face recog- controlled-format photographs to uncon-
nition techniques (FRT), including the trolled video images, posing a wide range
FERET [Phillips et al. 1998b, 2000; Rizvi of technical challenges and requiring an
et al. 1998], FRVT 2000 [Blackburn et al. equally wide range of techniques from im-
2001], FRVT 2002 [Phillips et al. 2003], age processing, analysis, understanding,
and XM2VTS [Messer et al. 1999] pro- and pattern recognition. One can broadly
tocols, and many commercially available classify FRT systems into two groups de-
systems (Table II). There are at least two pending on whether they make use of
reasons for this trend; the first is the wide static images or of video. Within these
range of commercial and law enforcement groups, significant differences exist, de-
applications and the second is the avail- pending on the specific application. The
ability of feasible technologies after 30 differences are in terms of image qual-
years of research. In addition, the prob- ity, amount of background clutter (posing
lem of machine recognition of human faces challenges to segmentation algorithms),
continues to attract researchers from dis- variability of the images of a particular
ciplines such as image processing, pattern individual that must be recognized, avail-
recognition, neural networks, computer ability of a well-defined recognition or
vision, computer graphics, and psychology. matching criterion, and the nature, type,
The strong need for user-friendly sys- and amount of input from a user. A list
tems that can secure our assets and pro- of some commercial systems is given in
tect our privacy without losing our iden- Table II.
tity in a sea of numbers is obvious. At A general statement of the problem of
present, one needs a PIN to get cash from machine recognition of faces can be for-
an ATM, a password for a computer, a mulated as follows: given still or video
dozen others to access the internet, and images of a scene, identify or verify
so on. Although very reliable methods of one or more persons in the scene us-
biometric personal identification exist, for ing a stored database of faces. Available
Table I. Typical Applications of Face Recognition

Areas Specific applications
Video game, virtual reality, training programs
Entertainment Human-robot-interaction, human-computer-interaction
Drivers licenses, entitlement programs
Smart cards Immigration, national ID, passports, voter registration
Welfare fraud
TV Parental control, personal device logon, desktop logon
Information security Application security, database security, file encryption
Intranet security, internet access, medical records
Secure trading terminals
Law enforcement Advanced video surveillance, CCTV control
and surveillance Portal control, postevent analysis
Shoplifting, suspect tracking and investigation
ACM Computing Surveys, Vol. 35, No. 4, December 2003.

Face Recognition: A Literature Survey 401
Table II. Available Commercial Face Recognition Systems (Some of these Web sites
may have changed or been removed.) [The identification of any company, commercial
product, or trade name does not imply endorsement or recommendation by the National
Institute of Standards and Technology or any of the authors or their institutions.]
Commercial products Websites
FaceIt from Visionics http://www.FaceIt.com
Viisage Technology http://www.viisage.com
FaceVACS from Plettac http://www.plettac-electronics.com
FaceKey Corp. http://www.facekey.com
Cognitec Systems http://www.cognitec-systems.de
Keyware Technologies http://www.keywareusa.com/
Passfaces from ID-arts http://www.id-arts.com/
ImageWare Sofware http://www.iwsinc.com/
Eyematic Interfaces Inc. http://www.eyematic.com/
BioID sensor fusion http://www.bioid.com
Visionsphere Technologies http://www.visionspheretech.com/menu.htm
Biometric Systems, Inc. http://www.biometrica.com/
FaceSnap Recoder http://www.facesnap.de/htdocs/english/index2.html
SpotIt for face composite http://spotit.itc.it/SpotIt.html
Face perception is an important part of

the capability of human perception sys-
tem and is a routine task for humans,
while building a similar computer sys-
tem is still an on-going research area. The
earliest work on face recognition can be
traced back at least to the 1950s in psy-
chology [Bruner and Tagiuri 1954] and to
the 1960s in the engineering literature
[Bledsoe 1964]. Some of the earliest stud-
ies include work on facial expression
of emotions by Darwin [1972] (see also
Ekman [1998]) and on facial profile-based
biometrics by Galton [1888]). But re-
search on automatic machine recogni-
tion of faces really started in the 1970s
Fig. 1. Configuration of a generic face recognition [Kelly 1970] and after the seminal work
system. of Kanade [1973]. Over the past 30
years extensive research has been con-
ducted by psychophysicists, neuroscien-
collateral information such as race, age, tists, and engineers on various aspects
gender, facial expression, or speech may be of face recognition by humans and ma-
used in narrowing the search (enhancing chines. Psychophysicists and neuroscien-
recognition). The solution to the problem tists have been concerned with issues
involves segmentation of faces (face de- such as whether face perception is a
tection) from cluttered scenes, feature ex- dedicated process (this issue is still be-
traction from the face regions, recognition, ing debated in the psychology community
or verification (Figure 1). In identification [Biederman and Kalocsai 1998; Ellis 1986;
problems, the input to the system is an un- Gauthier et al. 1999; Gauthier and Logo-
known face, and the system reports back thetis 2000]) and whether it is done holis-
the determined identity from a database tically or by local feature analysis.
of known individuals, whereas in verifica- Many of the hypotheses and theories
tion problems, the system needs to confirm put forward by researchers in these dis-
or reject the claimed identity of the input ciplines have been based on rather small
face. sets of images. Nevertheless, many of the

402 Zhao et al.
findings have important consequences for ization. However, the feature extraction
engineers who design algorithms and sys- techniques needed for this type of ap-
tems for machine recognition of human proach are still not reliable or accurate
faces. Section 2 will present a concise re- enough [Cox et al. 1996]. For example,
view of these findings. most eye localization techniques assume
Barring a few exceptions that use range some geometric and textural models and
data [Gordon 1991], the face recognition do not work if the eye is closed. Section 3
problem has been formulated as recogniz- will present a review of still-image-based
ing three-dimensional (3D) objects from face recognition.
two-dimensional (2D) images.1 Earlier ap- During the past 5 to 8 years, much re-
proaches treated it as a 2D pattern recog- search has been concentrated on video-
nition problem. As a result, during the based face recognition. The still image
early and mid-1970s, typical pattern clas- problem has several inherent advantages
sification techniques, which use measured and disadvantages. For applications such
attributes of features (e.g., the distances as drivers licenses, due to the controlled
between important points) in faces or face nature of the image acquisition process,
profiles, were used [Bledsoe 1964; Kanade the segmentation problem is rather easy.
1973; Kelly 1970]. During the 1980s, work However, if only a static picture of an air-
on face recognition remained largely dor- port scene is available, automatic location
mant. Since the early 1990s, research in- and segmentation of a face could pose se-
terest in FRT has grown significantly. One rious challenges to any segmentation al-
can attribute this to several reasons: an in- gorithm. On the other hand, if a video
crease in interest in commercial opportu- sequence is available, segmentation of a
nities; the availability of real-time hard- moving person can be more easily accom-
ware; and the increasing importance of plished using motion as a cue. But the
surveillance-related applications. small size and low image quality of faces
Over the past 15 years, research has captured from video can significantly in-
focused on how to make face recognition crease the difficulty in recognition. Video-
systems fully automatic by tackling prob- based face recognition is reviewed in
lems such as localization of a face in a Section 4.
given image or video clip and extraction As we propose new algorithms and build
of features such as eyes, mouth, etc. more systems, measuring the performance
Meanwhile, significant advances have of new systems and of existing systems
been made in the design of classifiers becomes very important. Systematic data
for successful face recognition. Among collection and evaulation of face recogni-
appearance-based holistic approaches, tion systems is reviewed in Section 5.
eigenfaces [Kirby and Sirovich 1990; Recognizing a 3D object from its 2D im-
Turk and Pentland 1991] and Fisher- ages poses many challenges. The illumina-
faces [Belhumeur et al. 1997; Etemad tion and pose problems are two prominent
and Chellappa 1997; Zhao et al. 1998] issues for appearance- or image-based ap-
have proved to be effective in experiments proaches. Many approaches have been
with large databases. Feature-based proposed to handle these issues, with the
graph matching approaches [Wiskott majority of them exploring domain knowl-
et al. 1997] have also been quite suc- edge. Details of these approaches are dis-
cessful. Compared to holistic approaches, cussed in Section 6.
feature-based methods are less sensi- In 1995, a review paper [Chellappa et al.
tive to variations in illumination and 1995] gave a thorough survey of FRT
viewpoint and to inaccuracy in face local- at that time. (An earlier survey [Samal
and Iyengar 1992] appeared in 1992.) At
1 There have been recent advances on 3D face recogni-
that time, video-based face recognition
tion in situations where range data acquired through
was still in a nascent stage. During the
structured light can be matched reliably [Bronstein past 8 years, face recognition has received
et al. 2003]. increased attention and has advanced

technically. Many commercial systems for large numbers of face images. In most
still face recognition are now available. applications the images are available only
Recently, significant research efforts have in the form of single or multiple views of
been focused on video-based face model- 2D intensity data, so that the inputs to
ing/tracking, recognition, and system in- computer face recognition algorithms are
tegration. New datasets have been created visual only. For this reason, the literature
and evaluations of recognition techniques reviewed in this section is restricted to
using these databases have been carried studies of human visual perception of
out. It is not an overstatement to say that faces.
face recognition has become one of the Many studies in psychology and neuro-
most active applications of pattern recog- science have direct relevance to engineers
nition, image analysis and understanding. interested in designing algorithms or sys-
In this paper we provide a critical review tems for machine recognition of faces. For
of current developments in face recogni- example, findings in psychology [Bruce
tion. This paper is organized as follows: in 1988; Shepherd et al. 1981] about the rela-
Section 2 we briefly review issues that are tive importance of different facial features
relevant from a psychophysical point of have been noted in the engineering liter-
view. Section 3 provides a detailed review ature [Etemad and Chellappa 1997]. On
of recent developments in face recognition the other hand, machine systems provide
techniques using still images. In Section 4 tools for conducting studies in psychology
face recognition techniques based on video and neuroscience [Hancock et al. 1998;
are reviewed. Data collection and perfor- Kalocsai et al. 1998]. For example, a pos-
mance evaluation of face recognition algo- sible engineering explanation of the bot-
rithms are addressed in Section 5 with de- tom lighting effects studied in Johnston
scriptions of representative protocols. In et al. [1992] is as follows: when the actual
Section 6 we discuss two important prob- lighting direction is opposite to the usually
lems in face recognition that can be math- assumed direction, a shape-from-shading
ematically studied, lack of robustness to algorithm recovers incorrect structural in-
illumination and pose variations, and we formation and hence makes recognition of
review proposed methods of overcoming faces harder.
these limitations. Finally, a summary and A detailed review of relevant studies in
conclusions are presented in Section 7. psychophysics and neuroscience is beyond
the scope of this paper. We only summa-
rize findings that are potentially relevant
2. PSYCHOPHYSICS/NEUROSCIENCE
to the design of face recognition systems.
ISSUES RELEVANT TO FACE
For details the reader is referred to the
RECOGNITION
papers cited below. Issues that are of po-
Human recognition processes utilize a tential interest to designers are2 :
broad spectrum of stimuli, obtained from Is face recognition a dedicated process?
many, if not all, of the senses (visual, [Biederman and Kalocsai 1998; Ellis
auditory, olfactory, tactile, etc.). In many 1986; Gauthier et al. 1999; Gauthier and
situations, contextual knowledge is also Logothetis 2000]: It is traditionally be-
applied, for example, surroundings play lieved that face recognition is a dedi-
an important role in recognizing faces in cated process different from other ob-
relation to where they are supposed to ject recognition tasks. Evidence for the
be located. It is futile to even attempt to existence of a dedicated face process-
develop a system using existing technol- ing system comes from several sources
ogy, which will mimic the remarkable face [Ellis 1986]. (a) Faces are more eas-
recognition ability of humans. However, ily remembered by humans than other
the human brain has its limitations in the
total number of persons that it can accu- 2 Readers should be aware of the existence of diverse
rately remember. A key advantage of a opinions on some of these issues. The opinions given
computer system is its capacity to handle here do not necessarily represent our views.

404 Zhao et al.
objects when presented in an upright tions may not be used. For example, in
orientation. (b) Prosopagnosia patients face recall studies, humans quickly fo-
are unable to recognize previously fa- cus on odd features such as big ears, a
miliar faces, but usually have no other crooked nose, a staring eye, etc. One of
profound agnosia. They recognize peo- the strongest pieces of evidence to sup-
ple by their voices, hair color, dress, etc. port the view that face recognition in-
It should be noted that prosopagnosia volves more configural/holistic process-
patients recognize whether a given ob- ing than other object recognition has
ject is a face or not, but then have dif- been the face inversion effect in which
ficulty in identifying the face. Seven an inverted face is much harder to rec-
differences between face recognition ognize than a normal face (first demon-
and object recognition can be summa- strated in [Yin 1969]). An excellent ex-
rized [Biederman and Kalocsai 1998] ample is given in [Bartlett and Searcy
based on empirical evidence: (1) con- 1993] using the Thatcher illusion
figural effects (related to the choice of [Thompson 1980]. In this illusion, the
different types of machine recognition eyes and mouth of an expressing face
systems), (2) expertise, (3) differences are excised and inverted, and the re-
verbalizable, (4) sensitivity to contrast sult looks grotesque in an upright face;
polarity and illumination direction (re- however, when shown inverted, the face
lated to the illumination problem in ma- looks fairly normal in appearance, and
chine recognition systems), (5) metric the inversion of the internal features is
variation, (6) Rotation in depth (related not readily noticed.
to the pose variation problem in ma-
chine recognition systems), and (7) ro- Ranking of significance of facial features
tation in plane/inverted face. Contrary [Bruce 1988; Shepherd et al. 1981]: Hair,
to the traditionally held belief, some re- face outline, eyes, and mouth (not nec-
cent findings in human neuropsychol- essarily in this order) have been de-
ogy and neuroimaging suggest that face termined to be important for perceiv-
recognition may not be unique. Accord- ing and remembering faces [Shepherd
ing to [Gauthier and Logothetis 2000], et al. 1981]. Several studies have shown
recent neuroimaging studies in humans that the nose plays an insignificant role;
indicate that level of categorization and this may be due to the fact that al-
expertise interact to produce the speci- most all of these studies have been done
fication for faces in the middle fusiform using frontal images. In face recogni-
gyrus.3 Hence it is possible that the en- tion using profiles (which may be im-
coding scheme used for faces may also portant in mugshot matching applica-
be employed for other classes with simi- tions, where profiles can be extracted
lar properties. (On recognition of famil- from side views), a distinctive nose
iar vs. unfamiliar faces see Section 7.) shape could be more important than the
Is face perception the result of holistic eyes or mouth [Bruce 1988]. Another
or feature analysis? [Bruce 1988; Bruce outcome of some studies is that both
et al. 1998]: Both holistic and feature external and internal features are im-
information are crucial for the percep- portant in the recognition of previ-
tion and recognition of faces. Studies ously presented but otherwise unfamil-
suggest the possibility of global descrip- iar faces, but internal features are more
tions serving as a front end for finer, dominant in the recognition of familiar
feature-based perception. If dominant faces. It has also been found that the
features are present, holistic descrip- upper part of the face is more useful
for face recognition than the lower part
3 The
[Shepherd et al. 1981]. The role of aes-
fusiform gyrus or occipitotemporal gyrus, lo-
cated on the ventromedial surface of the temporal
thetic attributes such as beauty, attrac-
and occipital lobes, is thought to be critical for face tiveness, and/or pleasantness has also
recognition. been studied, with the conclusion that

the more attractive the faces are, the quires the use of high-frequency com-
better is their recognition rate; the least ponents [Sergent 1986]. Low-frequency
attractive faces come next, followed by components contribute to global de-
the midrange faces, in terms of ease of scription, while high-frequency compo-
being recognized. nents contribute to the finer details
Caricatures [Brennan 1985; Bruce 1988; needed in identification.
Perkins 1975]: A caricature can be for- Viewpoint-invariant recognition? [Bie-
mally defined [Perkins 1975] as a sym- derman 1987; Hill et al. 1997; Tarr
bol that exaggerates measurements rel- and Bulthoff 1995]: Much work in vi-
ative to any measure which varies from sual object recognition (e.g. [Biederman
one person to another. Thus the length 1987]) has been cast within a theo-
of a nose is a measure that varies from retical framework introduced in [Marr
person to person, and could be useful 1982] in which different views of ob-
as a symbol in caricaturing someone, jects are analyzed in a way which
but not the number of ears. A stan- allows access to (largely) viewpoint-
dard caricature algorithm [Brennan invariant descriptions. Recently, there
1985] can be applied to different qual- has been some debate about whether ob-
ities of image data (line drawings and ject recognition is viewpoint-invariant
photographs). Caricatures of line draw- or not [Tarr and Bulthoff 1995]. Some
ings do not contain as much information experiments suggest that memory for
as photographs, but they manage to cap- faces is highly viewpoint-dependent.
ture the important characteristics of a Generalization even from one profile
face; experiments based on nonordinary viewpoint to another is poor, though
faces comparing the usefulness of line- generalization from one three-quarter
drawing caricatures and unexaggerated view to the other is very good [Hill et al.
line drawings decidedly favor the former 1997].
[Bruce 1988]. Effect of lighting change [Bruce et al.
Distinctiveness [Bruce et al. 1994]: Stud- 1998; Hill and Bruce 1996; Johnston
ies show that distinctive faces are bet- et al. 1992]: It has long been informally
ter retained in memory and are rec- observed that photographic negatives
ognized better and faster than typical of faces are difficult to recognize. How-
faces. However, if a decision has to be ever, relatively little work has explored
made as to whether an object is a face why it is so difficult to recognize nega-
or not, it takes longer to recognize an tive images of faces. In [Johnston et al.
atypical face than a typical face. This 1992], experiments were conducted to
may be explained by different mecha- explore whether difficulties with nega-
nisms being used for detection and for tive images and inverted images of faces
identification. arise because each of these manipula-
The role of spatial frequency analysis tions reverses the apparent direction of
[Ginsburg 1978; Harmon 1973; Sergent lighting, rendering a top-lit image of a
1986]: Earlier studies [Ginsburg 1978; face apparently lit from below. It was
Harmon 1973] concluded that informa- demonstrated in [Johnston et al. 1992]
tion in low spatial frequency bands that bottom lighting does indeed make it
plays a dominant role in face recog- harder to identity familiar faces. In [Hill
nition. Recent studies [Sergent 1986] and Bruce 1996], the importance of top
have shown that, depending on the spe- lighting for face recognition was demon-
cific recognition task, the low, band- strated using a different task: match-
pass and high-frequency components ing surface images of faces to determine
may play different roles. For example whether they were identical.
gender classification can be successfully Movement and face recognition [OToole
accomplished using low-frequency com- et al. 2002; Bruce et al. 1998; Knight and
ponents only, while identification re- Johnston 1997]: A recent study [Knight

406 Zhao et al.
and Johnston 1997] showed that fa- cated in Figure 1. Depending on the nature
mous faces are easier to recognize when of the application, for example, the sizes of
shown in moving sequences than in the training and testing databases, clutter
still photographs. This observation has and variability of the background, noise,
been extended to show that movement occlusion, and speed requirements, some
helps in the recognition of familiar faces of the subtasks can be very challenging.
shown under a range of different types Though fully automatic face recognition
of degradationsnegated, inverted, or systems must perform all three subtasks,
thresholded [Bruce et al. 1998]. Even research on each subtask is critical. This
more interesting is the observation is not only because the techniques used
that there seems to be a benefit for the individual subtasks need to be im-
due to movement even if the informa- proved, but also because they are critical
tion content is equated in the mov- in many different applications (Figure 1).
ing and static comparison conditions. For example, face detection is needed to
However, experiments with unfamiliar initialize face tracking, and extraction of
faces suggest no additional benefit from facial features is needed for recognizing
viewing animated rather than static human emotion, which is in turn essential
sequences. in human-computer interaction (HCI) sys-
Facial expressions [Bruce 1988]: Based tems. Isolating the subtasks makes it eas-
on neurophysiological studies, it seems ier to assess and advance the state of the
that analysis of facial expressions is ac- art of the component techniques. Earlier
complished in parallel to face recogni- face detection techniques could only han-
tion. Some prosopagnosic patients, who dle single or a few well-separated frontal
have difficulties in identifying famil- faces in images with simple backgrounds,
iar faces, nevertheless seem to recog- while state-of-the-art algorithms can de-
nize expressions due to emotions. Pa- tect faces and their poses in cluttered
tients who suffer from organic brain backgrounds [Gu et al. 2001; Heisele et al.
syndrome suffer from poor expression 2001; Schneiderman and Kanade 2000; Vi-
analysis but perform face recognition ola and Jones 2001]. Extensive research on
quite well.4 Similarly, separation of face the subtasks has been carried out and rel-
recognition and focused visual process- evant surveys have appeared on, for exam-
ing tasks (e.g., looking for someone with ple, the subtask of face detection [Hjelmas
a thick mustache) have been claimed. and Low 2001; Yang et al. 2002].
In this section we survey the state of the
3. FACE RECOGNITION FROM
art of face recognition in the engineering
STILL IMAGES
literature. For the sake of completeness,
in Section 3.1 we provide a highlighted
As illustrated in Figure 1, the prob- summary of research on face segmenta-
lem of automatic face recognition involves tion/detection and feature extraction. Sec-
three key steps/subtasks: (1) detection and tion 3.2 contains detailed reviews of recent
rough normalization of faces, (2) feature work on intensity image-based face recog-
extraction and accurate normalization of nition and categorizes methods of recog-
faces, (3) identification and/or verification. nition from intensity images. Section 3.3
Sometimes, different subtasks are not to- summarizes the status of face recognition
tally separated. For example, the facial and discusses open research issues.
features (eyes, nose, mouth) used for face
recognition are often used in face detec-
3.1. Key Steps Prior to Recognition: Face
tion. Face detection and feature extraction
Detection and Feature Extraction
can be achieved simultaneously, as indi-
The first step in any automatic face
4 From a machine recognition point of view, dramatic recognition systems is the detection of
facial expressions may affect face recognition perfor- faces in images. Here we only provide a
mance if only one photograph is available. summary on this topic and highlight a few

very recent methods. After a face has been approach is based on training on multiple-
detected, the task of feature extraction is view samples [Gu et al. 2001; Schnei-
to obtain features that are fed into a face derman and Kanade 2000]. Compared to
classification system. Depending on the invariant-feature-based methods [Wiskott
type of classification system, features can et al. 1997], multiview-based methods of
be local features such as lines or fiducial face detection and recognition seem to be
points, or facial features such as eyes, able to achieve better results when the an-
nose, and mouth. Face detection may also gle of out-of-plane rotation is large (35 ).
employ features, in which case features In the psychology community, a similar
are extracted simultaneously with face debate exists on whether face recognition
detection. Feature extraction is also a is viewpoint-invariant or not. Studies in
key to animation and recognition of facial both disciplines seem to support the idea
expressions. that for small angles, face perception is
Without considering feature locations, view-independent, while for large angles,
face detection is declared successful if the it is view-dependent.
presence and rough location of a face has In a detection problem, two statistics
been correctly identified. However, with- are important: true positives (also referred
out accurate face and feature location, no- to as detection rate) and false positives
ticeable degradation in recognition perfor- (reported detections in nonface regions).
mance is observed [Martinez 2002; Zhao An ideal system would have very high
1999]. The close relationship between fea- true positive and very low false positive
ture extraction and face recognition moti- rates. In practice, these two requirements
vates us to review a few feature extraction are conflicting. Treating face detection as
methods that are used in the recognition a two-class classification problem helps
approaches to be reviewed in Section 3.2. to reduce false positives dramatically
Hence, this section also serves as an intro- [Rowley et al. 1998; Sung and Poggio 1997]
duction to the next section. while maintaining true positives. This is
achieved by retraining systems with false-
positive samples that are generated by
3.1.1. Segmentation/Detection: Summary. previously trained systems.
Up to the mid-1990s, most work on
segmentation was focused on single-face 3.1.2. Feature Extraction: Summary and
segmentation from a simple or complex Methods
background. These approaches included
using a whole-face template, a deformable 3.1.2.1. Summary. The importance of fa-
feature-based template, skin color, and a cial features for face recognition cannot
neural network. be overstated. Many face recognition sys-
Significant advances have been made tems need facial features in addition to
in recent years in achieving automatic the holistic face, as suggested by studies
face detection under various conditions. in psychology. It is well known that even
Compared to feature-based methods and holistic matching methods, for example,
template-matching methods, appearance- eigenfaces [Turk and Pentland 1991] and
or image-based methods [Rowley et al. Fisherfaces [Belhumeur et al. 1997], need
1998; Sung and Poggio 1997] that train accurate locations of key facial features
machine systems on large numbers of such as eyes, nose, and mouth to normal-
samples have achieved the best results. ize the detected face [Martinez 2002; Yang
This may not be surprising since face et al. 2002].
objects are complicated, very similar to Three types of feature extraction meth-
each other, and different from nonface ob- ods can be distinguished: (1) generic meth-
jects. Through extensive training, comput- ods based on edges, lines, and curves;
ers can be quite good at detecting faces. (2) feature-template-based methods that
More recently, detection of faces under are used to detect facial features such
rotation in depth has been studied. One as eyes; (3) structural matching methods

408 Zhao et al.
that take into consideration geometrical ible statistical model. To account for tex-
constraints on the features. Early ap- ture variation, the ASM model has been
proaches focused on individual features; expanded to statistical appearance mod-
for example, a template-based approach els including a Flexible Appearance Model
was described in [Hallinan 1991] to de- (FAM) [Lanitis et al. 1995] and an Active
tect and recognize the human eye in a Appearance Model (AAM) [Cootes et al.
frontal face. These methods have difficulty 2001]. In [Cootes et al. 2001], the pro-
when the appearances of the features posed AAM combined a model of shape
change significantly, for example, closed variation (i.e., ASM) with a model of the
eyes, eyes with glasses, open mouth. To de- appearance variation of shape-normalized
tect the features more reliably, recent ap- (shape-free) textures. A training set of 400
proaches have used structural matching images of faces, each manually labeled
methods, for example, the Active Shape with 68 landmark points, and approxi-
Model [Cootes et al. 1995]. Compared to mately 10,000 intensity values sampled
earlier methods, these recent statistical from facial regions were used. The shape
methods are much more robust in terms model (mean shape, orthogonal mapping
of handling variations in image intensity matrix Ps and projection vector bs ) is gen-
and feature shape. erated by representing each set of land-
An even more challenging situation for marks as a vector and applying principal-
feature extraction is feature restoration, component analysis (PCA) to the data.
which tries to recover features that are Then, after each sample image is warped
invisible due to large variations in head so that its landmarks match the mean
pose. The best solution here might be to shape, texture information can be sam-
hallucinate the missing features either by pled from this shape-free face patch. Ap-
using the bilateral symmetry of the face or plying PCA to this data leads to a shape-
using learned information. For example, a free texture model (mean texture, P g
view-based statistical method claims to be and b g ). To explore the correlation be-
able to handle even profile views in which tween the shape and texture variations,
many local features are invisible [Cootes a third PCA is applied to the concate-
et al. 2000]. nated vectors (bs and b g ) to obtain the
combined model in which one vector c
3.1.2.2. Methods. A template-based ap- of appearance parameters controls both
proach to detecting the eyes and mouth in the shape and texture of the model. To
real images was presented in [Yuille et al. match a given image and the model, an
1992]. This method is based on match- optimal vector of parameters (displace-
ing a predefined parameterized template ment parameters between the face region
to an image that contains a face region. and the model, parameters for linear in-
Two templates are used for matching the tensity adjustment, and the appearance
eyes and mouth respectively. An energy parameters c) are searched by minimiz-
function is defined that links edges, peaks ing the difference between the synthetic
and valleys in the image intensity to image and the given one. After match-
the corresponding properties in the tem- ing, a best-fitting model is constructed
plate, and this energy function is min- that gives the locations of all the facial
imized by iteratively changing the pa- features and can be used to reconstruct
rameters of the template to fit the im- the original images. Figure 2 illustrates
age. Compared to this model, which is the optimization/search procedure for
manually designed, the statistical shape fitting the model to the image. To speed up
model (Active Shape Model, ASM) pro- the search procedure, an efficient method
posed in [Cootes et al. 1995] offers more is proposed that exploits the similarities
flexibility and robustness. The advantages among optimizations. This allows the di-
of using the so-called analysis through rect method to find and apply directions
synthesis approach come from the fact of rapid convergence which are learned
that the solution is constrained by a flex- off-line.

Fig. 2. Multiresolution search from a displaced position using a face model. (Courtesy of T. Cootes,
K. Walker, and C. Taylor.)
3.2. Recognition from Intensity Images local features such as the eyes, nose,
and mouth are first extracted and their
Many methods of face recognition have
locations and local statistics (geomet-
been proposed during the past 30 years.
ric and/or appearance) are fed into a
Face recognition is such a challenging
structural classifier.
yet interesting problem that it has at-
tracted researchers who have different (3) Hybrid methods. Just as the human
backgrounds: psychology, pattern recogni- perception system uses both local fea-
tion, neural networks, computer vision, tures and the whole face region to rec-
and computer graphics. It is due to this ognize a face, a machine recognition
fact that the literature on face recognition system should use both. One can ar-
is vast and diverse. Often, a single sys- gue that these methods could poten-
tem involves techniques motivated by dif- tially offer the best of the two types of
ferent principles. The usage of a mixture methods.
of techniques makes it difficult to classify Within each of these categories, further
these systems based purely on what types classification is possible (Table III). Using
of techniques they use for feature repre- principal-component analysis (PCA),
sentation or classification. To have a clear many face recognition techniques have
and high-level categorization, we instead been developed: eigenfaces [Turk and
follow a guideline suggested by the psy- Pentland 1991], which use a nearest-
chological study of how humans use holis- neighbor classifier; feature-line-based
tic and local features. Specifically, we have methods, which replace the point-to-point
the following categorization: distance with the distance between a point
and the feature line linking two stored
(1) Holistic matching methods. These sample points [Li and Lu 1999]; Fisher-
methods use the whole face region as faces [Belhumeur et al. 1997; Liu and
the raw input to a recognition system. Wechsler 2001; Swets and Weng 1996b;
One of the most widely used repre- Zhao et al. 1998] which use linear/Fisher
sentations of the face region is eigen- discriminant analysis (FLD/LDA) [Fisher
pictures [Kirby and Sirovich 1990; 1938]; Bayesian methods, which use a
Sirovich and Kirby 1987], which are probabilistic distance metric [Moghaddam
based on principal component analy- and Pentland 1997]; and SVM methods,
sis. which use a support vector machine as the
(2) Feature-based (structural) matching classifier [Phillips 1998]. Utilizing higher-
methods. Typically, in these methods, order statistics, independent-component

410 Zhao et al.
Table III. Categorization of Still Face Recognition Techniques

Approach Representative work
Holistic methods
Principal-component analysis (PCA)
Eigenfaces Direct application of PCA [Craw and Cameron 1996; Kirby
and Sirovich 1990; Turk and Pentland 1991]
Probabilistic eigenfaces Two-class problem with prob. measure [Moghaddam and
Pentland 1997]
Fisherfaces/subspace LDA FLD on eigenspace [Belhumeur et al. 1997; Swets and Weng
1996b; Zhao et al. 1998]
SVM Two-class problem based on SVM [Phillips 1998]
Evolution pursuit Enhanced GA learning [Liu and Wechsler 2000a]
Feature lines Point-to-line distance based [Li and Lu 1999]
ICA ICA-based feature analysis [Bartlett et al. 1998]
Other representations
LDA/FLD LDA/FLD on raw image [Etemad and Chellappa 1997]
PDBNN Probabilistic decision based NN [Lin et al. 1997]
Feature-based methods
Pure geometry methods Earlier methods [Kanade 1973; Kelly 1970]; recent
methods [Cox et al. 1996; Manjunath et al. 1992]
Dynamic link architecture Graph matching methods [Okada et al. 1998; Wiskott et al.
1997]
Hidden Markov model HMM methods [Nefian and Hayes 1998; Samaria 1994;
Samaria and Young 1994]
Convolution Neural Network SOM learning based CNN methods [Lawrence et al. 1997]
Hybrid methods
Modular eigenfaces Eigenfaces and eigenmodules [Pentland et al. 1994]
Hybrid LFA Local feature method [Penev and Atick 1996]
Shape-normalized Flexible appearance models [Lanitis et al. 1995]
Component-based Face region and components [Huang et al. 2003]
analysis (ICA) is argued to have more els that cover the forehead, eye, nose,
representative power than PCA, and mouth, and chin [Nefian and Hayes 1998;
hence may provide better recognition per- Samaria 1994; Samaria and Young 1994].
formance than PCA [Bartlett et al. 1998]. [Nefian and Hayes 1998] reported bet-
Being able to offer potentially greater ter performance than Samaria [1994] by
generalization through learning, neural using the KL projection coefficients in-
networks/learning methods have also stead of the strips of raw pixels. One of
been applied to face recognition. One ex- the most successful systems in this cate-
ample is the Probabilistic Decision-Based gory is the graph matching system [Okada
Neural Network (PDBNN) method [Lin et al. 1998; Wiskott et al. 1997], which
et al. 1997] and the other is the evolution is based on the Dynamic Link Architec-
pursuit (EP) method [Liu and Wechsler ture (DLA) [Buhmann et al. 1990; Lades
2000a]. et al. 1993]. Using an unsupervised learn-
Most earlier methods belong to the cat- ing method based on a self-organizing map
egory of structural matching methods, us- (SOM), a system based on a convolutional
ing the width of the head, the distances neural network (CNN) has been developed
between the eyes and from the eyes to the [Lawrence et al. 1997].
mouth, etc. [Kelly 1970], or the distances In the hybrid method category, we
and angles between eye corners, mouth will briefly review the modular eigenface
extrema, nostrils, and chin top [Kanade method [Pentland et al. 1994], a hybrid
1973]. More recently, a mixture-distance representation based on PCA and local
based approach using manually extracted feature analysis (LFA) [Penev and Atick
distances was reported [Cox et al. 1996]. 1996], a flexible appearance model-based
Without finding the exact locations of method [Lanitis et al. 1995], and a recent
facial features, Hidden Markov Model- development [Huang et al. 2003] along
(HMM-) based methods use strips of pix- this direction. In [Pentland et al. 1994],

Fig. 3. Electronically modified images which were correctly identified.
the use of hybrid features by combining of objects such as face images that are
eigenfaces and other eigenmodules is ex- normalized with respect to scale, trans-
plored: eigeneyes, eigenmouth, and eigen- lation, and rotation, the redundancy is
nose. Though experiments show slight even greater [Penev and Atick 1996; Zhao
improvements over holistic eigenfaces or 1999]. One of the best global compact
eigenmodules based on structural match- representations is KL/PCA, which decor-
ing, we believe that these types of methods relates the outputs. More specifically,
are important and deserve further inves- sample vectors x can be expressed as lin-
tigation. Perhaps many relevant problems ear combinations
n of the
morthogonal basis
need to be solved before fruitful results i : x = i=1 ai i i=1 ai i (typically
can be expected, for example, how to opti- m n) by solving the eigenproblem
mally arbitrate the use of holistic and local
features. C = , (1)
Many types of systems have been suc-
cessfully applied to the task of face recog- where C is the covariance matrix for input
nition, but they all have some advantages x.
and disadvantages. Appropriate schemes An advantage of using such representa-
should be chosen based on the specific re- tions is their reduced sensitivity to noise.
quirements of a given task. Most of the Some of this noise may be due to small oc-
systems reviewed here focus on the sub- clusions, as long as the topological struc-
task of recognition, but others also in- ture does not change. For example, good
clude automatic face detection and feature performance under blurring, partial oc-
extraction, making them fully automatic clusion and changes in background has
systems [Lin et al. 1997; Moghaddam and been demonstrated in many eigenpicture-
Pentland 1997; Wiskott et al. 1997]. based systems, as illustrated in Figure 3.
This should not come as a surprise, since
3.2.1. Holistic Approaches the PCA reconstructed images are much
better than the original distorted im-
3.2.1.1. Principal-Component Analysis. ages in terms of their global appearance
Starting from the successful low- (Figure 4).
dimensional reconstruction of faces For better approximation of face images
using KL or PCA projections [Kirby and outside the training set, using an extended
Sirovich 1990; Sirovich and Kirby 1987], training set that adds mirror-imaged faces
eigenpictures have been one of the major was shown to achieve lower approxima-
driving forces behind face representation error [Kirby and Sirovich 1990]. Us-
tion, detection, and recognition. It is ing such an extended training set, the
well known that there exist significant eigenpictures are either symmetric or an-
statistical redundancies in natural im- tisymmetric, with the most leading eigen-
ages [Ruderman 1994]. For a limited class pictures typically being symmetric.

412 Zhao et al.
Fig. 4. Reconstructed images using 300 PCA projection coefficients for electronically modi-
fied images (Figure 3). (From Zhao [1999].)
The first really successful demonstra- based on a Bayesian analysis of image dif-
tion of machine recognition of faces was ferences. Two mutually exclusive classes
made in [Turk and Pentland 1991] using were defined: I , representing intraper-
eigenpictures (also known as eigenfaces) sonal variations between multiple images
for face detection and identification. Given of the same individual, and E , represent-
the eigenfaces, every face in the database ing extrapersonal variations due to dif-
can be represented as a vector of weights; ferences in identity. Assuming that both
the weights are obtained by projecting the classes are Gaussian-distributed, likeli-
image into eigenface components by a sim- hood functions P (| I ) and P (| E ) were
ple inner product operation. When a new estimated for a given intensity difference
test image whose identification is required = I1 I2 . Given these likelihood func-
is given, the new image is also represented tions and using the MAP rule, two face im-
by its vector of weights. The identification ages are determined to belong to the same
of the test image is done by locating the individual if P (| I ) > P (| E ). A large
image in the database whose weights are performance improvement of this prob-
the closest to the weights of the test image. abilistic matching technique over stan-
By using the observation that the projec- dard nearest-neighbor eigenspace match-
tion of a face image and a nonface image ing was reported using large face datasets
are usually different, a method of detect- including the FERET database [Phillips
ing the presence of a face in a given image et al. 2000]. In Moghaddam and Pentland
is obtained. The method was demon- [1997], an efficient technique of probabil-
strated using a database of 2500 face im- ity density estimation was proposed by de-
ages of 16 subjects, in all combinations of composing the input space into two mu-
three head orientations, three head sizes, tually exclusive subspaces: the principal
and three lighting conditions. subspace F and its orthogonal subspace F
Using a probabilistic measure of sim- (a similar idea was explored in Sung and
ilarity, instead of the simple Euclidean Poggio [1997]). Covariances only in the
distance used with eigenfaces [Turk and principal subspace are estimated for use
Pentland 1991], the standard eigenface in the Mahalanobis distance [Fukunaga
approach was extended [Moghaddam and 1989]. Experimental results have been re-
Pentland 1997] to a Bayesian approach. ported using different subspace dimen-
Practically, the major drawback of a sionalities M I and M E for I and E .
Bayesian method is the need to esti- For example, M I = 10 and M E = 30
mate probability distributions in a high- were used for internal tests, while M I =
dimensional space from very limited num- M E = 125 were used for the FERET test.
bers of training samples per class. To avoid In Figure 5, the so-called dual eigenfaces
this problem, a much simpler two-class separately trained on samples from I
problem was created from the multiclass and E are plotted along with the stan-
problem by using a similarity measure dard eigenfaces. While the extrapersonal

Fig. 6. Different projection bases constructed from

a set of 444 individuals, where the set is augumented
via adding noise and mirroring. The first row shows
Fig. 5. Comparison of dual eigenfaces and stan-
the first five pure LDA basis images W ; the second
dard eigenfaces: (a) intrapersonal, (b) extraper-
row shows the first five subspace LDA basis images
sonal, (c) standard [Moghaddam and Pentland 1997].
W ; the average face and first four eigenfaces are
(Courtesy of B. Moghaddam and A. Pentland.)
shown on the third row [Zhao et al. 1998].
eigenfaces appear more similar to the
their respective means mi : Ci = E[(x()
standard eigenfaces than the intraper-
mi )(x() mi )T | = i ]. Similarly, Sb is
sonal ones, the intrapersonal eigenfaces
the Between-class Scatter Matrix, repre-
represent subtle variations due mostly
senting the scatter of the conditional mean
to expression and lighting, suggesting
vectors mi around the overall mean vector
that they are more critical for identifica-
m0 . A commonly used measure for quan-
tion [Moghaddam and Pentland 1997].
tifying discriminatory power is the ratio
Face recognition systems using
of the determinant of the between-class
LDA/FLD have also been very suc-
scatter matrix of the projected samples to
cessful [Belhumeur et al. 1997; Etemad
the determinant of the within-class scat-
and Chellappa 1997; Swets and Weng
ter matrix: J (T ) = |T T Sb T |/|T T Sw T |.
1996b; Zhao et al. 1998; Zhao et al. 1999].
The optimal projection matrix W which
LDA training is carried out via scatter
maximizes J (T ) can be obtained by solv-
matrix analysis [Fukunaga 1989]. For
ing a generalized eigenvalue problem:
an M -class problem, the within- and
between-class scatter matrices Sw , Sb are
Sb W = Sw W W . (3)
computed as follows:

M It is helpful to make comparisons
Sw = Pr(i )Ci , among the so-called (linear) projection al-
i=1 gorithms. Here we illustrate the com-
(2) parison between eigenfaces and Fisher-

M
Sb = Pr(i )(mi m0 )(mi m0 ) , T faces. Similar comparisons can be made
i=1
for other methods, for example, ICA pro-
jection methods. In all these projection al-
where Pr(i ) is the prior class probability, gorithms, classification is performed by (1)
and is usually replaced by 1/M in practice projecting the input x into a subspace via
with the assumption of equal priors. Here a projection/basis matrix Proj 6 :
Sw is the within-class satter matrix, show-
ing the average scatter5 Ci of the sam- total covariance C used to compute the PCA projec-
ple vectors x of different classes i around M
tion is C = i=1
Pr(i )Ci .
roj is for eigenfaces, W for Fisherfaces with
6P
5 These are also conditional covariance matrices; the pure LDA projection, and W for Fisherfaces with

414 Zhao et al.
z = Proj x; (4) database of 176 images created at Yale.

The results of the experiments showed
(2) comparing the projection coefficient that the Fisherface method performed
vector z of the input to all the prestored significantly better than the other three
projection vectors of labeled classes to methods. However, no claim was made
determine the input class label. The about the relative performance of these
vector comparison varies in different algorithms on larger databases.
implementations and can influence the To improve the performance of LDA-
systems performance dramatically [Moon based systems, a regularized subspace
and Phillips 2001]. For example, PCA LDA system that unifies PCA and LDA
algorithms can use either the angle or was proposed in Zhao [1999] and Zhao
the Euclidean distance (weighted or un- et al. [1998]. Good generalization ability
weighted) between two projection vectors. of this system was demonstrated by ex-
For LDA algorithms, the distance can be periments that carried out testing on new
unweighted or weighted. classes/individuals without retraining the
In Swets and Weng [1996b], discrimi- PCA bases , and sometimes the LDA
nant analysis of eigenfeatures is applied bases W . While the reason for not re-
in an image retrieval system to determine training PCA is obvious, it is interesting
not only class (human face vs. nonface to test the adaptive capability of the sys-
objects) but also individuals within the tem by fixing the LDA bases when im-
face class. Using tree-structure learning, ages from new classes are added.7 The
the eigenspace and LDA projections fixed PCA subspace of dimensionality 300
are recursively applied to smaller and was trained from a large number of sam-
smaller sets of samples. Such recursive ples. An augmented set of 4056 mostly
partitioning is carried out for every node frontal-view images constructed from the
until the samples assigned to the node original 1078 FERET images of 444 in-
belong to a single class. Experiments on dividuals by adding noise and mirroring
this approach were reported in Swets and was used in Zhao et al. [1998]. At least
Weng [1996]. A set of 800 images was one of the following three characteristics
used for training; the training set came separates this system from other LDA-
from 42 classes, of which human faces based systems: (1) the unique selection
belong to a single class. Within the single of the universal face subspace dimension,
face class, 356 individuals were included (2) the use of a weighted distance mea-
and distinguished. Testing results on sure, and (3) a regularized procedure that
images not in the training set were 91% modifies the within-class scatter matrix
for 78 face images and 87% for 38 nonface Sw . The authors selected the dimension-
images based on the top choice. ality of the universal face subspace based
A comparative performance analysis on the characteristics of the eigenvectors
was carried out in Belhumeur et al. [1997]. (face-like or not) instead of the eigenval-
Four methods were compared in this pa- ues [Zhao et al. 1998], as is commonly
per: (1) a correlation-based method, (2) a done. Later it was concluded in Penev and
variant of the linear subspace method sug- Sirovich [2000] that the global face sub-
gested in Shashua [1994], (3) an eigenface space dimensionality is on the order of
method Turk and Pentland [1991], and (4) 400 for large databases of 5,000 images.
a Fisherface method which uses subspace A weighted distance metric in the pro-
projection prior to LDA projection to jection space z was used to improve per-
avoid the possible singularity in Sw as formance [Zhao 1999].8 Finally, the LDA
in Swets and Weng [1996b]. Experiments
were performed on a database of 500
7 Thismakes sense because the final classification is
images created by Hallinan [1994] and a
carried out in the projection space z by comparison
with prestored projection vectors.
sequential PCA and LDA projections; these three 8 Weighted metrics have also been used in the pure
bases are shown for visual comparison in Figure 6. LDA approach [Etemad and Chellappa 1997] and the

Fig. 7. Two architectures for performing ICA on images. Left: architecture for
finding statistically independent basis images. Performing source separation on
the face images produces independent images in the rows of U . Right: architecture
for finding a factorial code. Performing source separation on the pixels produces a
factorial code in the columns of the output matrix U [Bartlett et al. 1998]. (Courtesy
of M. Bartlett, H. Lades, and T. Sejnowski.)
training was regularized by modifying the space of possible solutions to determine

Sw matrix to Sw + I , where is a relatively the optimal basis. EP starts by projecting
small positive number. Doing this solves the original data into a lower-dimensional
a numerical problem when Sw is close to whitened PCA space. Directed random ro-
being singular. In the extreme case where tations of the basis vectors in this space
only one sample per class is available, this are then searched by GAs where evolution
regularization transforms the LDA prob- is driven by a fitness function defined in
lem into a standard PCA problem with Sb terms of performance accuracy (empirical
being the covariance matrix C. Applying risk) and class separation (confidence in-
this approach, without retraining the LDA terval). The feasibility of this method has
basis, to a testing/probe set of 46 individ- been demonstrated for face recognition,
uals of which 24 were trained and 22 were where the large number of possible bases
not trained (a total of 115 images including requires a greedy search algorithm. The
19 untrained images of nonfrontal views), particular face recognition task involves
the authors reported the following perfor- 1107 FERET frontal face images of 369
mance based on a front-view-only gallery subjects; there were three frontal images
database of 738 images: 85.2% for all im- for each subject, two for training and the
ages and 95.1% for frontal views. remaining one for testing. The authors re-
An evolution pursuit- (EP-) based adap- ported improved face recognition perfor-
tive representation and its application to mance as compared to eigenfaces [Turk
face recognition were presented in Liu and and Pentland 1991], and better gen-
Wechsler [2000a]. In analogy to projection eralization capability than Fisherfaces
pursuit methods, EP seeks to learn an op- [Belhumeur et al. 1997].
timal basis for the dual purpose of data Based on the argument that for tasks
compression and pattern classification. In such as face recognition much of the
order to increase the generalization ability important information is contained in
of EP, a balance is sought between min- high-order statistics, it has been pro-
imizing the empirical risk encountered posed [Bartlett et al. 1998] to use ICA
during training and narrowing the con- to extract features for face recognition.
fidence interval for reducing the guaran- Independent-component analysis is a gen-
teed risk during future testing on unseen eralization of principal-component anal-
data [Vapnik 1995]. Toward that end, EP ysis, which decorrelates the high-order
implements strategies characteristic of ge- moments of the input in addition to the
netic algorithms (GAs) for searching the second-order moments. Two architectures
have been proposed for face recognition
so-called enhanced FLD (EFM) approach [Liu and (Figure 7): the first is used to find a set
Wechsler 2000b]. of statistically independent source images

416 Zhao et al.
Fig. 8. Comparison of basis images using two architectures for performing ICA: (a) 25 indepen-
dent components of Architecture I, (b) 25 independent components of Architecture II [Bartlett
et al. 1998]. (Courtesy of M. Bartlett, H. Lades, and T. Sejnowski.)
that can be viewed as independent image A fully automatic face detec-

features for a given set of training im- tion/recognition system based on a
ages [Bell and Sejnowski 1995], and the neural network is reported in Lin et al.
second is used to find image filters that [1997]. The proposed system is based
produce statistically independent out- on a probabilistic decision-based neu-
puts (a factorial code method) [Bell and Se- ral network (PDBNN, an extended
jnowski 1997]. In both architectures, PCA (DBNN) [Kung and Taur 1995]) which
is used first to reduce the dimensional- consists of three modules: a face detector,
ity of the original image size (60 50). an eye localizer, and a face recognizer.
ICA is performed on the first 200 eigenvec- Unlike most methods, the facial regions
tors in the first architecture, and is carried contain the eyebrows, eyes, and nose,
out on the first 200 PCA projection coeffi- but not the mouth.9 The rationale of
cients in the second architecture. The au- using only the upper face is to build a
thors reported performance improvement robust system that excludes the influence
of both architectures over eigenfaces in of facial variations due to expressions
the following scenario: a FERET subset that cause motion around the mouth.
consisting of 425 individuals was used; To improve robustness, the segmented
all the frontal views (one per class) were facial region images are first processed
used for training and the remaining (up to produce two features at a reduced
to three) frontal views for testing. Basis resolution of 14 10: normalized intensity
images of the two architectures are shown features and edge features, both in the
in Figure 8 along with the corresponding range [0, 1]. These features are fed into
eigenfaces. two PDBNNs and the final recognition
result is the fusion of the outputs of these
two PDBNNs. A unique characteristic of
3.2.1.2. Other Representations. In addition PDBNNs and DBNNs is their modular
to the popular PCA representation and its structure. That is, for each class/person
derivatives such as ICA and EP, other fea-
tures have also been used, such as raw in- 9 Such a representation was also used in Kirby and
tensities and edges. Sirovich [1990]

Fig. 9. Structure of the PDBNN face recognizer. Each class subnet is

designed to recognize one person. All the network weightings are in prob-
abilistic format [Lin et al. 1997]. (Courtesy of S. Lin, S. Kung, and L. Lin.)
to be recognized, PDBNN/DBNN devotes any two classes, PDBNN has a lower

one of its subnets to the representation of false acceptance/rejection rate because it
that particular person, as illustrated in uses the full density description for each
Figure 9. Such a one-class-in-one-network class. In addition, this architecture is
(OCON) structure has certain advan- beneficial for hardware implementation
tages over the all-classes-in-one-network such as distributed computing. However,
(ACON) structure that is adopted by it is not clear how to accurately estimate
the conventional multilayer perceptron the full density functions for the classes
(MLP). In the ACON structure, all classes when there are only limited numbers of
are lumped into one supernetwork, samples. Further, the system could have
so large numbers of hidden units are problems when the number of classes
needed and convergence is slow. On grows exponentially.
the other hand, the OCON structure
consists of subnets that consist of small
numbers of hidden units; hence it not 3.2.2. Feature-Based Structural Matching Ap-
only converges faster but also has better proaches. Many methods in the structural
generalization capability. Compared to matching category have been proposed,
most multiclass recognition systems that including many early methods based on
use a discrimination function between geometry of local features [Kanade 1973;

418 Zhao et al.
Fig. 10. The bunch graph representation of faces used in elastic graph matching [Wiskott et al.
1997]. (Courtesy of L. Wiskott, J.-M. Fellous, and C. von der Malsburg.)
Kelly 1970] as well as 1D [Samaria and Buhmann et al. [1990] and Lades
Young 1994] and pseudo-2D [Samaria et al. [1993] used Gabor-based wavelets
1994] HMM methods. One of the most (Figure 10(a)) as the features. As de-
successful of these systems is the Elas- scribed in Lades et al. [1993] DLAs basic
tic Bunch Graph Matching (EBGM) sys- mechanism, in addition to the connection
tem [Okada et al. 1998; Wiskott et al. parameter Ti j betweeen two neurons (i,
1997], which is based on DLA [Buhmann j ), is a dynamic variable Ji j . Only the
et al. 1990; Lades et al. 1993]. Wavelets, J -variables play the roles of synaptic
especially Gabor wavelets, play a building weights for signal transmission. The
block role for facial representation in these T -parameters merely act to constrain the
graph matching methods. A typical local J -variables, for example, 0 Ji j Ti j .
feature representation consists of wavelet The T -parameters can be changed slowly
coefficients for different scales and rota- by long-term synaptic plasticity. The
tions based on fixed wavelet bases (called weights Ji j are subject to rapid modifi-
jets in Okada et al. [1998]). These locally cation and are controlled by the signal
estimated wavelet coefficients are robust correlations between neurons i and j .
to illumination change, translation, dis- Negative signal correlations lead to a
tortion, rotation, and scaling. decrease and positive signal correlations
The basic 2D Gabor function and its lead to an increase in Ji j . In the absence
Fourier transform are of any correlation, Ji j slowly returns to a
resting state, a fixed fraction of Ti j . Each
stored image is formed by picking a rect-
g (x, y : u0 , v0 ) = exp x 2 /2x2 + y 2 /2 y2
angular grid of points as graph nodes. The
+ 2i[u0 x + vo y]), grid is appropriately positioned over the

G(u, v) = exp 2 2 x2 (u u0 )2 image and is stored with each grid points
locally determined jet (Figure 10(a)), and
+ y2 (v v0 )2 , (5) serves to represent the pattern classes.
Recognition of a new image takes place by
where x and y represent the spatial transforming the image into the grid of
widths of the Gaussian and (u0 , v0 ) is the jets, and matching all stored model graphs
frequency of the complex sinusoid. to the image. Conformation of the DLA
DLAs attempt to solve some of the con- is done by establishing and dynamically
ceptual problems of conventional artificial modifying links between vertices in the
neural networks, the most prominent of model domain.
these being the representation of syntac- The DLA architecture was recently ex-
tical relationships in neural networks. tended to Elastic Bunch Graph Match-
DLAs use synaptic plasticity and are ing [Wiskott et al. 1997] (Figure 10). This
able to form sets of neurons grouped into is similar to the graph described above,
structured graphs while maintaining but instead of attaching only a single jet
the advantages of neural systems. Both to each node, the authors attached a set

of jets (called the bunch graph represen-

tation, Figure 10(b)), each derived from a
different face image. To handle the pose
variation problem, the pose of the face is
first determined using prior class infor-
mation [Kruger et al. 1997], and the jet
transformations under pose variation are
learned [Maurer and Malsburg 1996a].
Systems based on the EBGM approach
have been applied to face detection and
extraction, pose estimation, gender classi-
fication, sketch-image-based recognition,
and general object recognition. The suc-
cess of the EBGM system may be due to
its resemblance to the human visual sys-
tem [Biederman and Kalocsai 1998].
3.2.3. Hybrid Approaches. Hybrid ap-

proaches use both holistic and local Fig. 11. Comparison of matching: (a) test
features. For example, the modular eigen- views, (b) eigenface matches, (c) eigenfea-
faces approach [Pentland et al. 1994] ture matches [Pentland et al. 1994].
uses both global eigenfaces and local
eigenfeatures. the eigenfaces [Pentland et al. 1994];
In Pentland et al. [1994], the capa- when the combined set was used, only
bilities of the earlier system [Turk and marginal improvement was obtained.
Pentland 1991] were extended in several These experiments support the claim that
directions. In mugshot applications, usu- feature-based mechanisms may be useful
ally a frontal and a side view of a person when gross variations are present in the
are available; in some other applications, input images (Figure 11).
more than two views may be appropriate. It has been argued that practical sys-
One can take two approaches to handling tems should use a hybrid of PCA and
images from multiple views. The first LFA (Appendix B in Penev and Atick
approach pools all the images and con- [1996]). Such view has been long held in
structs a set of eigenfaces that represent the psychology community [Bruce 1988].
all the images from all the views. The It seems to be better to estimate eigen-
other approach uses separate eigenspaces modes/eigenfaces that have large eigen-
for different views, so that the collection of values (and so are more robust against
images taken from each view has its own noise), while for estimating higher-order
eigenspace. The second approach, known eigenmodes it is better to use LFA. To sup-
as view-based eigenspaces, performs port this point, it was argued in Penev
better. and Atick [1996] that the leading eigenpic-
The concept of eigenfaces can be tures are global, integrating, or smooth-
extended to eigenfeatures, such as ing filters that are efficient in suppress-
eigeneyes, eigenmouth, etc. Using a ing noise, while the higher-order modes
limited set of images (45 persons, two are ripply or differentiating filters that are
views per person, with different facial likely to amplify noise.
expressions such as neutral vs. smiling), LFA is an interesting biologically in-
recognition performance as a function of spired feature analysis method [Penev
the number of eigenvectors was measured and Atick 1996]. Its biological motivation
for eigenfaces only and for the combined comes from the fact that, though a huge
representation. For lower-order spaces, array of receptors (more than six million
the eigenfeatures performed better than cones) exist in the human retina, only a

420 Zhao et al.
Fig. 12. LFA kernels K (xi , y) at different grids xi [Penev and Atick 1996].
small fraction of them are active, corre- The search for the best topographic set of
sponding to natural objects/signals that sparsely distributed grids {xo } based on re-
are statistically redundant [Ruderman construction error is called sparsification
1994]. From the activity of these sparsely and is described in Penev and Atick [1996].
distributed receptors, the brain has to Two interesting points are demonstrated
discover where and what objects are in in this paper: (1) using the same number
the field of view and recover their at- of kernels, the perceptual reconstruction
tributes. Consequently, one expects to rep- quality of LFA based on the optimal set
resent the natural objects/signals in a sub- of grids is better than that of PCA; the
space of lower dimensionality by finding mean square error is 227, and 184 for a
a suitable parameterization. For a lim- particular input; (2) keeping the second
ited class of objects such as faces which PCA eigenmodel in LFA reconstruction re-
are correctly aligned and scaled, this sug- duces the mean square error to 152, sug-
gests that even lower dimensionality can gesting the hybrid use of PCA and LFA. No
be expected [Penev and Atick 1996]. One results on recognition performance based
good example is the successful use of the on LFA were reported. LFA is claimed to
truncated PCA expansion to approximate be used in Visionicss commercial system
the frontal face images in a linear sub- FaceIt (Table II).
space [Kirby and Sirovich 1990; Sirovich A flexible appearance model based
and Kirby 1987]. method for automatic face recognition was
Going a step further, the whole face represented in [Lanitis et al. 1995]. To iden-
gion stimulates a full 2D array of recep- tify a face, both shape and gray-level infor-
tors, each of which corresponds to a lo- mation are modeled and used. The shape
cation in the face, but some of these re- model is an ASM; these are statistical
ceptors may be inactive. To explore this models of the shapes of objects which it-
redundancy, LFA is used to extract to- eratively deform to fit to an example of
pographic local features from the global the shape in a new image. The statis-
PCA modes. Unlike PCA kernels i which tical shape model is trained on exam-
contain no topographic information (their ple images using PCA, where the vari-
supports extend over the entire grid of ables are the coordinates of the shape
images), LFA kernels (Figure12) K (xi , y) model points. For the purpose of classifi-
at selected grids xi have local support.10 cation, the shape variations due to inter-
class variation are separated from those
10 These
due to within-class variations (such as
kernels (Figure 12) indexed by grids xi are
similar to the ICA kernels in the first ICA system
small variations in 3D orientation and fa-
architecture [Bartlett et al. 1998; Bell and Sejnowski cial expression) using discriminant anal-
1995]. ysis. Based on the average shape of the

Fig. 13. The face recognition scheme based on flexible appearance

model [Lanitis et al. 1995]. (Courtesy of A. Lanitis, C. Taylor, and T.
Cootes.)
shape model, a global shape-free gray- ntected by a flexible geometrical model.

level model can be constructed, again us- (Notice how this method is similar to the
ing PCA.11 To further enhance the ro- EBGM system [Okada et al. 1998; Wiskott
bustness of the system against changes et al. 1997] except that gray-scale compo-
in local appearance such as occlusions, nents are used instead of Gabor wavelets.)
local gray-level models are also built The motivation for using components is
on the shape model points. Simple lo- that changes in head pose mainly lead to
cal profiles perpendicular to the shape changes in the positions of facial compo-
boundary are used. Finally, for an input nents which could be accounted for by the
image, all three types of information, flexibility of the geometric model. How-
including extracted shape parameters, ever, a major drawback of the system is
shape-free image parameters, and local that it needs a large number of training
profiles, are used to compute a Maha- images taken from different viewpoints
lanobis distance for classification as illus- and under different lighting conditions. To
trated in Figure 13. Based on training 10 overcome this problem, the 3D morphable
and testing 13 images for each of 30 indi- face model [Blanz and Vetter 1999] is ap-
viduals, the classification rate was 92% for plied to generate arbitrary synthetic im-
the 10 normal testing images and 48% for ages under varying pose and illumination.
the three difficult images. Only three face images (frontal, semipro-
The last method [Huang et al. 2003] that file, profile) of a person are needed to com-
we review in this category is based on re- pute the 3D face model. Once the 3D model
cent advances in component-based detec- is constructed, synthetic images of size
tion/recognition [Heisele et al. 2001] and 58 58 are generated for training both
3D morphable models [Blanz and Vetter the detector and the classifer. Specifically,
1999]. The basic idea of component-based the faces were rotated in depth from 0 to
methods [Heisele et al. 2001] is to decom- 34 in 2 increments and rendered with
pose a face into a set of facial components two illumination models (the first model
such as mouth and eyes that are intercon- consists of ambient light alone and the
second includes ambient light and a ro-
11 Recallthat in Craw and Cameron [1996] and
tating point light source) at each pose.
Moghaddam and Pentland [1997] these shape-free Fourteen facial components were used for
images are used as the inputs to the classifier. face detection, but only nine components

422 Zhao et al.
that were not strongly overlapped and con- images. More recently, practical meth-
tained gray-scale structures were used for ods have emerged that aim at more re-
classification. In addition, the face region alistic applications. In the recent com-
was added to the nine components to form prehensive FERET evaluations [Phillips
a single feature vector (a hybrid method), et al. 2000; Phillips et al. 1998b; Rizvi
which was later trained by a SVM clas- et al. 1998], aimed at evaluating dif-
sifer [Vapnik 1995]. Training on three im- ferent systems using the same large
ages and testing on 200 images per sub- database containing thousands of images,
ject led to the following recognition rates the systems described in Moghaddam and
on a set of six subjects: 90% for the hybrid Pentland [1997]; Swets and Weng [1996b];
method and roughly 10% for the global Turk and Pentland [1991]; Wiskott et al.
method that used the face region only; the [1997]; Zhao et al. [1998], as well as
false positive rate was 10%. others, were evaluated. The EBGM sys-
tem [Wiskott et al. 1997], the subspace
3.3. Summary and Discussion LDA system [Zhao et al. 1998], and the
probabilistic eigenface system [Moghad-
Face recognition based on still images or dam and Pentland 1997] were judged to
captured frames in a video stream can be among the top three, with each method
be viewed as 2D image matching and showing different levels of performance on
recognition; range images are not avail- different subsets of sequestered images.
able in most commercial/law enforcement A brief summary of the FERET evalua-
applications. Face recognition based on tions will be presented in Section 5. Re-
other sensing modalities such as sketches cently, more extensive evaluations using
and infrared images is also possible. Even commercial systems and thousands of im-
though this is an oversimplification of the ages have been performed in the FRVT
actual recognition problem of 3D objects 2000 [Blackburn et al. 2001] and FRVT
based on 2D images, we have focused on 2002 [Phillips et al. 2003] tests.
this 2D problem, and we will address two
important issues about 2D recognition of
3.3.2. Lessons, Facts and Highlights. Dur-
3D face objects in Section 6. Significant
ing the development of face recognition
progress has been achieved on various as-
systems, many lessons have been learned
pects of face recognition: segmentation,
which may provide some guidance in the
feature extraction, and recognition of faces
development of new methods and systems.
in intensity images. Recently, progress has
also been made on constructing fully au- Advances in face recognition have come
tomatic systems that integrate all these from considering various aspects of this
techniques. specialized perception problem. Earlier
methods treated face recognition as a
3.3.1. Status of Face Recognition. After standard pattern recognition problem;
more than 30 years of research and de- later methods focused more on the rep-
velopment, basic 2D face recognition has resentation aspect, after realizing its
reached a mature level and many commer- uniqueness (using domain knowledge);
cial systems are available (Table II) for more recent methods have been con-
various applications (Table I). cerned with both representation and
Early research on face recognition was recognition, so a robust system with
primarily focused on the feasibility ques- good generalization capability can be
tion, that is: is machine recognition of built. Face recognition continues to
faces possible? Experiments were usually adopt state-of-the-art techniques from
carried out using datasets consisting of learning, computer vision, and pattern
as few as 10 images. Significant advances recognition. For example, distribution
were made during the mid-1990s, with modeling using mixtures of Gaussians,
many methods proposed and tested on and SVM learning methods, have been
datasets consisting of as many as 100 used in face detection/recognition.

Among all face detection/recognition example, from 12 11 to 48 42, to

methods, appearance/image-based ap- obtain different face subspaces [Zhao
proaches seem to have dominated up 1999]. Indeed, slightly better perfor-
to now. The main reason is the strong mance was observed when smaller im-
prior that all face images belong to a face ages were used. One reason is that the
class. An important example is the use signal-to-noise ratio improves with the
of PCA for the representation of holistic decrease in image size.
features. To overcome sensitivity to geo- Accurate feature location is critical for
metric change, local appearance-based good recognition performance. This is
approaches, 3D enhanced approaches, true even for holistic matching methods,
and hybrid approaches can be used. since accurate location of key facial fea-
The most recent advances toward fast tures such as eyes is required to normal-
3D data acquisition and accurate 3D ize the detected face [Yang et al. 2002;
recognition are likely to influence future Zhao 1999]. This was also verified in Lin
developements.12 et al. [1997] where the use of smaller im-
The methodological difference between ages led to slightly better performance
face detection and face recognition may due to increased tolerance to location er-
not be as great as it appears to be. We rors. In Martinez [2002], a systematic
have observed that the multiclass face study of this issue was presented.
recognition problem can be converted Regarding the debate in the psychology
into a two-class detection problem by community about whether face recog-
using image differences [Moghaddam nition is a dedicated process, the re-
and Pentland 1997]; and the face de- cent success of machine systems that
tection problem can be converted into a are trained on large numbers of samples
multiclass recognition problem by us- seems to confirm recent findings sug-
ing additional nonface clusters of nega- gesting that human recognition of faces
tive samples [Sung and Poggio 1997]. may be not unique/dedicated, but needs
It is well known that for face detection, extensive training.
the image size can be quite small. But When comparing different systems, we
what about face recognition? Clearly the should pay close attention to imple-
image size cannot be too small for meth- mentation details. Different implemen-
ods that depend heavily on accurate tations of a PCA-based face recogni-
feature localization, such as graph tion algorithm were compared in Moon
matching methods [Okada et al. 1998]. and Phillips [2001]. One class of varia-
However, it has been demonstrated that tions examined was the use of seven dif-
the image size can be very small for ferent distance metrics in the nearest-
holistic face recognition: 12 11 for the neighbor classifier, which was found to
subspace LDA system [Zhao et al. 1999], be the most critical element. This raises
1410 for the PDBNN system [Lin et al. the question of what is more impor-
1997], and 18 24 for human percep- tant in algorithm performance, the rep-
tion [Bachmann 1991]. Some authors resentation or the specifics of the im-
have argued that there exists a uni- plementation. Implementation details
versal face subspace of fixed dimension; often determine the performance of a
hence for holistic recognition, image size system. For example, input images are
does not matter as long as it exceeds normalized only with respect to trans-
the subspace dimensionality [Zhao et al. lation, in-plane rotation, and scale in
1999]. This claim has been supported Belhumeur et al. [1997], Swets and
by limited experiments using normal- Weng [1996b], Turk and Pentland
ized face images of different sizes, for [1991], and Zhao et al. [1998], whereas
in Moghaddam and Pentland [1997]
12 Earlywork using range images was reported the normalization also includes mask-
in Gordon [1991]. ing and affine warping to align the

424 Zhao et al.
shape. In Craw and Cameron [1996], quick recognition method, the discrim-
manually selected points are used to inant information that it provides may
warp the input images to the mean not be rich enough to handle very large
shape, yielding shape-free images. Be- databases. This insufficiency can be
cause of this difference, PCA was a compensated for by local feature meth-
good classifier in Moghaddam and Pent- ods. However, many questions need to
land [1997] for the shape-free repre- be answered before we can build such a
sentations, but it may not be as good combined system. One important ques-
for the simply normalized representation is how to arbitrate the use of holistic
tions. Recently, systematic comparisons and local features. As a first step, a sim-
and independent reevaluations of ex- ple, naive engineering approach would
isting methods have been published be to weight the features. But how to
[Beveridge et al. 2001]. This is benefi- determine whether and how to use the
cial to the research community. How- features remains an open problem.
ever, since the methods need to be reim- The challenge of developing face detec-
plemented, and not all the details in the tion techniques that report not only the
original implementation can be taken presence of a face but also the accurate
into account, it is difficult to carry out locations of facial features under large
absolutely fair comparisons. pose and illumination variations still re-
Over 30 years of research has provided mains. Without accurate localization of
us with a vast number of methods and important features, accurate and robust
systems. Recognizing the fact that each face recognition cannot be achieved.
method has its advantages and disad- How to model face variation under re-
vantages, we should select methods and alistic settings is still challengingfor
systems appropriate to the application. example, outdoor environments, natu-
For example, local feature based meth- ral aging, etc.
ods cannot be applied when the input
image contains a small face region, say 4. FACE RECOGNITION FROM IMAGE
15 15. Another issue is when to use SEQUENCES
PCA and when to use LDA in building a
system. Apparently, when the number of A typical video-based face recognition sys-
training samples per class is large, LDA tem automatically detects face regions, ex-
is the best choice. On the other hand, tracts features from the video, and recog-
if only one or two samples are available nizes facial identity if a face is present. In
per class (a degenerate case for LDA), surveillance, information security, and ac-
PCA is a better choice. For a more de- cess control applications, face recognition
tailed comparison of PCA versus LDA, and identification from a video sequence
see Beveridge et al. [2001]; Martinez is an important problem. Face recognition
and Kak [2001]. One way to unify PCA based on video is preferable over using still
and LDA is to use regularized subspace images, since as demonstrated in Bruce
LDA [Zhao et al. 1999]. et al. [1998] and Knight and Johnston
[1997], motion helps in recognition of (fa-
3.3.3. Open Research Issues. Though miliar) faces when the images are negated,
machine recognition of faces from still inverted or threshold. It was also demon-
images has achieved a certain level of strated that humans can recognize ani-
success, its performance is still far from mated faces better than randomly rear-
that of human perception. Specifically, we ranged images from the same set. Though
can list the following open issues: recognition of faces from video sequence
is a direct extension of still-image-based
Hybrid face recognition systems that recognition, in our opinion, true video-
use both holistic and local features re- based face recognition techniques that co-
semble the human perceptual system. herently use both spatial and temporal
While the holistic approach provides a information started only a few years ago

and still need further investigation. Sig- Before we examine existing video-based
nificant challenges for video-based recog- face recognition algorithms, we briefly
nition still exist; we list several of them review three closely related techniques:
here. face segmentation and pose estimation,
face tracking, and face modeling. These
(1) The quality of video is low. Usually, techniques are critical for the realization
video acquisition occurs outdoors (or of the full potential of video-based face
indoors but with bad conditions for recognition.
video capture) and the subjects are not
cooperative; hence there may be large
4.1. Basic Techniques of Video-Based Face
illumination and pose variations in the
Recognition
face images. In addition, partial occlu-
sion and disguise are possible. In Chellappa et al. [1995], four computer
(2) Face images are small. Again, due to vision areas were mentioned as being im-
the acquisition conditions, the face important for video-based face recognition:
age sizes are smaller (sometimes much segmentation of moving objects (humans)
smaller) than the assumed sizes in from a video sequence; structure estima-
most still-image-based face recognition; 3D models for faces; and nonrigid mo-
tion systems. For example, the valid tion analysis. For example, in Jebara et al.
face region can be as small as 15 [1998] a face modeling system which es-
15 pixels,13 whereas the face image timates facial features and texture from
sizes used in feature-based still image- a video stream was described. This sys-
based systems can be as large as 128 tem utilizes all four techniques: segmen-
128. Small-size images not only make tation of the face based on skin color to
the recognition task more difficult, but initiate tracking; use of a 3D face model
also affect the accuracy of face segmen- based on laser-scanned range data to nor-
tation, as well as the accurate detec- malize the image (by facial feature align-
tion of the fiducial points/landmarks ment and texture mapping to generate a
that are often needed in recognition frontal view) and construction of an eigen-
methods. subspace for 3D heads; use of structure
(3) The characteristics of faces/human from motion (SfM) at each feature point
body parts. During the past 8 years, to provide depth information; and non-
research on human action/behavior rigid motion analysis of the facial fea-
recognition from video has been very tures based on simple 2D SSD (sum of
active and fruitful. Generic description squared differences) tracking constrained
of human behavior not particular to an by a global 3D model. Based on the current
individual is an interesting and useful development of video-based face recogni-
concept. One of the main reasons for tion, we think it is better to review three
the feasibility of generic descriptions of specific face-related techniques instead of
human behavior is that the intraclass the above four general areas. The three
variations of human bodies, and in par- video-based face-related techniques are:
ticular faces, is much smaller than the face segmentation and pose estimation,
difference between the objects inside face tracking, and face modeling.
and outside the class. For the same rea-
son, recognition of individuals within 4.1.1. Face Segmentation and Pose Estima-
the class is difficult. For example, de- tion. Early attempts [Turk and Pentland
tecting and localizing faces is typically 1991] at segmenting moving faces from an
much easier than recognizing a specific image sequence used simple pixel-based
face. change detection procedures based on dif-
ference images. These techniques may run
13 Notice this is totally different from the situation
into difficulties when multiple moving ob-
where we have images with large face regions but jects and occlusion are present. More so-
the final face regions feed into a classifier is 15 15. phisticated methods use estimated flow

426 Zhao et al.
fields for segmenting humans in mo- (1) head tracking, which involves tracking
tion [Shio and Sklansky 1991]. More re- the motion of a rigid object that is perform-
cent methods [Choudhury et al. 1999; ing rotations and translations; (2) facial
McKenna and Gong 1998] have used mo- feature tracking, which involves tracking
tion and/or color information to speed up nonrigid deformations that are limited by
the process of searching for possible face the anatomy of the head, that is, articu-
regions. After candidate face regions are lated motion due to speech or facial expres-
located, still-image-based face detection sions and deformable motion due to mus-
techniques can be applied to locate the cle contractions and relaxations; and (3)
faces [Yang et al. 2002]. Given a face re- complete tracking, which involves track-
gion, important facial features can be lo- ing both the head and the facial features.
cated. The locations of feature points can Early efforts focused on the first two
be used for pose estimation, which is im- problems: head tracking [Azarbayejani
portant for synthesizing a virtual frontal et al. 1993] and facial feature track-
view [Choudhury et al. 1999]. Newly de- ing [Terzopoulos and Waters 1993; Yuille
veloped segmentation methods locate the and Hallinan 1992]. In Azarbayejani et al.
face and estimate its pose simultaneously [1993], an approach to head tracking using
without extracting features [Gu et al. points with high Hessian values was pro-
2001; Li et al. 2001b]. This is achieved by posed. Several such points on the head are
learning multiview face examples which tracked and the 3D motion parameters of
are labeled with manually determined the head are recovered by solving an over-
pose angles. constrained set of motion equations. Facial
feature tracking methods may make use
of the feature boundary or the feature re-
4.1.2. Face and Feature Tracking. After gion. Feature boundary tracking attempts
faces are located, the faces and their fea- to track and accurately delineate the
tures can be tracked. Face tracking and shape of the facial feature, for example, to
feature tracking are critical for recon- track the contours of the lips and mouth
structing a face model (depth) through [Terzopoulos and Waters 1993]. Feature
SfM, and feature tracking is essential region tracking addresses the simpler
for facial expression recognition and gaze problem of tracking a region such as a
recognition. Tracking also plays a key bounding box that surrounds the facial
role in spatiotemporal-based recognition feature [Black et al. 1995].
methods [Li and Chellappa 2001; Li et al. In Black et al. [1995], a tracking sys-
2001a] which directly use the tracking in- tem based on local parameterized mod-
formation. els is used to recognize facial expressions.
In its most general form, tracking is The models include a planar model for
essentially motion estimation. However, the head, local affine models for the eyes,
general motion estimation has fundamen- and local affine models and curvature for
tal limitations such as the aperture prob- the mouth and eyebrows. A face track-
lem. For images like faces, some regions ing system was used in Maurer and Mals-
are too smooth to estimate flow accurately, burg [1996b] to estimate the pose of the
and sometimes the change in local appear- face. This system used a graph represen-
ances is too large to give reliable flow. tation with about 2040 nodes/landmarks
Fortunately, these problems are alleviated to model the face. Knowledge about faces
thanks to face modeling, which exploits is used to find the landmarks in the
domain knowledge. In general, tracking first frame. Two tracking systems de-
and modeling are dual processes: track- scribed in Jebara et al. [1998] and Strom
ing is constrained by a generic 3D model et al. [1999] model faces completely with
or a learned statistical model under de- texture and geometry. Both systems use
formation, and individual models are re- generic 3D models and SfM to recover
fined through tracking. Face tracking can the face structure. Jebara et al. [1998] re-
be roughly divided into three categories: lied fixed feature points (eyes, nose tip),

Table IV. Categorization of Video-Based Face Recognition Techniques

Approach Representative work
Still-image methods Basic methods [Turk and Pentland 1991; Lin et al. 1997;
Moghaddam and Pentland 1997; Okada et al. 1998; Penev and
Atick 1996; Wechsler et al. 1997; Wiskott et al. 1997]
Tracking-enhanced [Edwards et al. 1998; McKenna and Gong
1997, 1998; Steffens et al. 1998]
Multimodal methods Video- and audio-based [Bigun et al. 1998; Choudhury et al. 1999]
Spatiotemporal methods Feature trajectory-based [Li and Chellappa 2001; Li et al. 2001a]
Video-to video methods [Zhou et al. 2003]
while Strom et al. [1999] tracked only of frames, and the depths of these fea-
points with high Hessian values. Also, Je- tures are computed. To overcome the dif-
bara et al. [1998] tracked 2D features in ficulty of feature tracking, bundle adjust-
3D by deforming them, while Strom et al. ment [Triggs et al. 2000] can be used to
[1999] relied on direct comparison of a 3D obtain better and more robust results.
model to the image. Methods have been Recently, multiview based 2D methods
proposed in Black et al. [1998] and Hager have gained popularity. In Li et al. [2001b],
and Belhumeur [1998] to solve the vary- a model consisted of a sparse 3D shape
ing appearance (both geometry and pho- model learned from 2D images labeled
tometry) problem in tracking. Some of the with pose and landmarks, a shape-and-
newest model-based tracking methods cal- pose-free texture model, and an affine ge-
culate the 3D motions and deformations ometrical model. An alternative approach
directly from image intensities [Brand is to use 3D models such as the deformable
and Bhotika 2001], thus eliminating the model of DeCarlo and Metaxas [2000] or
information-lossy intermediate represen- the linear 3D object class model of Blanz
tations. and Vetter [1999]. (In Blanz and Vetter
[1999] a morphable 3D face model con-
sisting of shape and texture was directly
4.1.3. Face Modeling. Modeling of faces
matched to single/multiple input images;
includes 3D shape modeling and texture
as a consequence, head orientation, illumi-
modeling. For large texture variations due
nation conditions, and other parameters
to changes in illumination, we will address
could be free variables subject to optimiza-
the illumination problem in Section 6.
tion.) In Blanz and Vetter [1999], real-time
Here we focus on 3D shape modeling. 3D
3D modeling and tracking of faces was
models of faces have been employed in the
described; a generic 3D head model was
graphics, animation, and model-based im-
aligned to match frontal views of the face
age compression literature. More compli-
in a video sequence.
cated models are used in applications such
as forensic face reconstruction from par-
4.2. Video-Based Face Recognition
tial information.
In computer vision, one of the most Historically, video face recognition origi-
widely used methods of estimating 3D nated from still-image-based techniques
shape from a video sequence is SfM, which (Table IV). That is, the system automati-
estimates the 3D depths of interesting cally detects and segments the face from
points. The unconstrained SfM problem the video, and then applies still-image face
has been approached in two ways. In the recognition techniques. Many methods re-
differential approach, one computes some viewed in Section 3 belong to this category:
type of flow field (optical, image, or nor- eigenfaces [Turk and Pentland 1991],
mal) and uses it to estimate the depths probabilistic eigenfaces [Moghaddam
of visible points. The difficulty in this ap- and Pentland 1997], the EBGM
proach is reliable computation of the flow method [Okada et al. 1998; Wiskott
field. In the discrete approach, a set of fea- et al. 1997], and the PDBNN method [Lin
tures such as points, edges, corners, lines, et al. 1997]. An improvement over these
or contours are tracked over a sequence methods is to apply tracking; this can help

428 Zhao et al.
in recognition, in that a virtual frontal In Wechsler et al. [1997], a fully auto-

view can be synthesized via pose and matic person authentication system was
depth estimation from video. Due to the described which included video break, face
abundance of frames in a video, another detection, and authentication modules.
way to improve the recognition rate is the Video skimming was used to reduce the
use of voting based on the recognition number of frames to be processed. The
results from each frame. The voting can video break module, corresponding to key-
be deterministic, but probabilistic voting frame detection based on object motion,
is better in general [Gong et al. 2000; consisted of two units. The first unit im-
McKenna and Gong 1998]. One drawback plemented a simple optical flow method; it
of such voting schemes is the expense of was used when the image SNR level was
computing the deterministic/probabilistic low. When the SNR level was high, simple
results for each frame. pair-wise frame differencing was used to
The next phase of video-based face detect the moving object. The face detec-
recognition will be the use of multimodal tion module consisted of three units: face
cues. Since humans routinely use multi- localization using analysis of projections
ple cues to recognize identities, it is ex- along the x- and y-axes; face region label-
pected that a multimodal system will do ing using a decision tree learned from posi-
better than systems based on faces only. tive and negative examples taken from 12
More importantly, using multimodal cues images each consisting of 2759 windows
offers a comprehensive solution to the task of size 8 8; and face normalization based
of identification that might not be achiev- on the numbers of face region labels. The
able by using face images alone. For exam- normalized face images were then used
ple, in a totally noncooperative environ- for authentication, using an RBF network.
ment, such as a robbery, the face of the This system was tested on three image se-
robber is typically covered, and the only quences; the first was taken indoors with
way to perform faceless identification one subject present, the second was taken
might be to analyize body motion charac- outdoors with two subjects, and the third
teristics [Klasen and Li 1998]. Excluding was taken outdoors with one subject under
fingerprints, face and voice are the most stormy conditions. Perfect results were re-
frequently used cues for identification. ported on all three sequences, as verified
They have been used in many multimodal against a database of 20 still face images.
systems [Bigun et al. 1998; Choudhury An access control system based on
et al. 1999]. Since 1997, a dedicated con- person authentication was described
ference focused on video- and audio-based in McKenna and Gong [1997]. The system
person authentication has been held every combined two complementary visual cues:
other year. motion and facial appearance. In order
More recently, a third phase of video to reliably detect significant motion, spa-
face recognition has started. These meth- tiotemporal zero crossings computed from
ods [Li and Chellappa 2001; Li et al. six consecutive frames were used. These
2001a] coherently exploit both spatial in- motions were grouped into moving objects
formation (in each frame) and temporal in- using a clustering algorithm, and Kalman
formation (such as the trajectories of fa- filters were employed to track the grouped
cial features). A big difference between objects. An appearance-based face detec-
these methods and the probabilistic voting tion scheme using RBF networks (similar
methods [McKenna and Gong 1998] is the to that discussed in Rowley et al. [1998])
use of representations in a joint temporal was used to confirm the presence of a
and spatial space for identification. person. The face detection scheme was
We first review systems that apply bootstrapped using motion and object
still-image-based recognition to selected detection to provide an approximate head
frames, and then multimodal systems. Fi- region. Face tracking based on the RBF
nally, we review systems that use spatial network was used to provide feedback to
and temporal information simultaneously. the motion clustering process to help deal

Fig. 14. Varying the most significant identity parameters (top) and
manipulating residual variation without affecting identity (bottom)
[Edwards et al. 1998].
with occlusions. Good tracking results class-specific information, a sequence of

were demonstrated. In McKenna and images of the same class was used. Specif-
Gong [1998], this work was extended to ically, a linear mapping was assumed to
person authentication using PCA or LDA. capture the relation between the class-
The authors argued that recognition specific correction to the identity sub-
based on selected frames is not adequate space and the intraperson variation in the
since important information is discarded. residual subspace. Examples of face track-
Instead, they proposed a probabilistic ing and visual enhancement were demon-
voting scheme; that is, face identification strated, but no recognition experiments
was carried out continuously. Though they were reported. Though this method is be-
gave examples demonstrating improved lieved to enhance tracking and make it ro-
performance in identifying 8 or 15 people bust against appearance change, it is not
by using sequences, no performance clear how efficient it is to learn the class-
statistics were reported. specific information from a video sequence
An appearance model based method for that does not present much residual
video tracking and enhancing identifica- variation.
tion was proposed in Edwards et al. [1998]. In De Carlo and Metaxas [2000], a sys-
The appearance model is a combination tem called PersonSpotter was described.
of the active shape model (ASM) [Cootes This system is able to capture, track,
et al. 1995] and the shape-free texture and recognize a person walking toward
model after warping the face into a mean or passing a stereo CCD camera. It has
shape. Unlike Lanitis et al. [1995], which several modules, including a head tracker,
used the two models separately, the au- preselector, landmark finder, and identi-
thors used a combined set of parameters fier. The head tracker determines the im-
for both models. The main contribution age regions that are changing due to object
was the decomposition of the combined motion based on simple image differences.
model parameters into an identity sub- A stereo algorithm then determines the
space and an orthogonal residual subspace stereo disparities of these moving pixels.
using linear discriminant analysis. (See The disparity values are used to com-
Figure 14 for an illustration of separat- pute histograms for image regions. Re-
ing identity and residue.) The residual gions within a certain disparity interval
subspace would ideally contain intraper- are selected and referred to as silhouettes.
son variations caused by pose, lighting, Two types of detectors, skin color based
and expression. In addition, they pointed and convex region based, are applied to
out that optimal separation of identity these silhouette images. The outputs of
and residue is class-specific. For exam- these detectors are clustered to form re-
ple, the appearance change of a persons gions of interest which usually correspond
nose depends on its length, which is a to heads. To track a head robustly, tempo-
person-specific quantity. To correct this ral continuity is exploited in the form of

430 Zhao et al.
the thresholds used to initiate, track, and each person are modeled as a Gaussian
delete an object. distribution.
To find the face region in an image, the Finally, the face and speaker recogni-
preselector uses a generic sparse graph tion modules are combined using a Bayes
consisting of 16 nodes learned from eight net. The system was tested in an ATM
example face images. The landmark finder scenario, a controlled environment. An
uses a dense graph consisting of 48 nodes ATM session begins when the subject en-
learned from 25 example images to find ters the cameras field of view and the
landmarks such as the eyes and the nose system detects his/her face. The system
tip. Finally, an elastic graph matching then greets the user and begins the bank-
scheme is employed to identify the face. ing transaction, which involves a series
A recognition rate of about 90% was of questions by the system and answers
achieved; the size of the database is not by the user. Data for 26 people were col-
known. lected; the normalized face images were
A multimodal person recognition sys- 40 80 pixels and the audio was sam-
tem was described in Choudhury et al. pled at 16 kHz. These experiments on
[1999]. This system consists of a face small databases and well-controlled en-
recognition module, a speaker identifica- vironments showed that the combination
tion module, and a classifier fusion mod- of audio and video improved performance,
ule. It has the following characteristics: and that 100% recognition and verification
(1) the face recognition module can de- were achieved when the image/audio clips
tect and compensate for pose variations; with highest confidence scores were used.
the speaker identification module can de- In Li and Chellappa [2001], a face ver-
tect and compensate for changes in the ification system based on tracking facial
auditory background; (2) the most reli- features was presented. The basic idea of
able video frames and audio clips are se- this approach is to exploit the temporal
lected for recognition; (3) 3D information information available in a video sequence
about the head obtained through SfM is to improve face recognition. First, the fea-
used to detect the presence of an actual ture points defined by Gabor attributes on
person as opposed to an image of that a regular 2D grid are tracked. Then, the
person. trajectories of these tracked feature points
Two key parts of the face recognition are exploited to identify the person pre-
module are face detection/tracking and sented in a short video sequence. The pro-
eigen-face recognition. The face is de- posed tracking-for-verification scheme is
tected using skin color information using different from the pure tracking scheme
a learned model of a mixture of Gaussians. in that one template face from a database
The facial features are then located using of known persons is selected for track-
symmetry transforms and image intensity ing. For each template with a specific
gradients. Correlation-based methods are personal ID, tracking can be performed
used to track the feature points. The loca- and trajectories can be obtained. Based
tions of these feature points are used to on the characteristics of these trajecto-
estimate the pose of the face. This pose ries, identification can be carried out. Ac-
estimate and a 3D head model are used cording to the authors, the trajectories of
to warp the detected face image into a the same person are more coherent than
frontal view. For recognition, the feature those of different persons, as illustrated
locations are refined and the face is nor- in Figure 15. Such characteristics can also
malized with eyes and mouth in fixed lo- be observed in the posterior probabilities
cations. Images from the face tracker are over time by assuming different classes.
used to train a frontal eigenspace, and In other words, the posterior probabilities
the leading 35 eigenvectors are retained. for the true hypothesis tend to be higher
Face recognition is then performed using than those for false hypotheses. This in
a probabilistic eigenface approach where turn can be used for identification. Test-
the projection coefficients of all images of ing results on a small databases of 19

Fig. 15. Corresponding feature points obtained from 20 frames: (a) result
of matching the same person to a video, (b) result of matching a different
person to the video, (c) trajectories of (a), (d) trajectories of (b) [Li and
Chellappa 2001].
individuals have suggested that perfor- In still-to-video face recognition, where

mance is favorable over a frame-to-frame the gallery consists of still images, a time
matching and voting scheme, especially series state space model is proposed to
in the case of large lighting changes. The fuse temporal information in a probe
testing result is based on comparison with video, which simultaneously character-
alternative hypotheses. izes the kinematics and identity using a
Some details about the tracking algo- motion vector and an identity variable,
rithm are as follows [Li and Chellappa respectively. The joint posterior distribu-
2001]. The motion of facial feature points tion of the motion vector and the identity
is modeled as a global two-dimensional variable is first estimated at each time
(2D) affine transformation (accounting for instant and then propagated to the next
head motion) plus a local deformation (ac- time instant. Marginalization over the
counting for residual motion that is due motion vector yields a robust estimate of
to inaccuracies in the 2D affine model- the posterior distribution of the identity
ing and other factors such as facial ex- variable and marginalization over the
pression). The tracking problem has been identity variable yields a robust estimate
formulated as a Bayesian inference prob- of the posterior distribution of the motion
lem and sequential importance sampling vector, so that tracking and recognition
(SIS) [Liu and Chen 1998] (one form of SIS are handled simultaneously. A computa-
is called Condensation [Isard and Blake tionally efficient sequential importance
1996] in the computer vision literature) sampling (SIS) algorithm is used to esti-
proposed as an empirical solution to the mate the posterior distribution. Empirical
inference problem. Since SIS has difficulty results demonstrate that, due to the prop-
in high-dimensional spaces, a reparame- agation of the identity variable over time,
terization that captures essentially only degeneracy in the posterior probability
the difference was used to facilitate the of the identity variable is achieved to
computation. give improved recognition. The gallery
While most face recognition algorithms is generalized to videos in order to re-
take still images as probe inputs, a video- alize video-to-video face recognition.
based face recognition approach that takes An exemplar-based learning strategy is
video sequences as inputs has recently employed to automatically select video
been developed [Zhou et al. 2003]. Since representatives from the gallery, serving
the detected face might be moving in the as mixture centers in an updated likeli-
video sequence, one has to deal with uncer- hood measure. The SIS algorithm is used
tainty in tracking as well as in recognition. to approximate the posterior distribution
Rather than resolving these two uncer- of the motion vector, the identity variable,
tainties separately, Zhou et al. [2003] per- and the exemplar index. The marginal
formed simultaneous tracking and recog- distribution of the identity variable pro-
nition of human faces from a video duces the recognition result. The model
sequence. formulation is very general and allows a

432 Zhao et al.
Experimental results using twelve train-

ing sequences, each containing one sub-
ject, and new testing sequences of these
subjects were reported. Recognition rates
were 100% and 93.9%, using 10 and 2 KDA
(kernel discriminant analysis) vectors, re-
spectively.
Other techniques have also been used
to construct the discriminating basis in
the identity surface: kernel discriminant
Fig. 16. Identity surface [Li et al. 2001a]. analysis (KDA) [Mika et al. 1999] was
(Courtesy of Y. Li, S. Gong, and H. Liddell.)
used to compute a nonlinear discriminat-
ing basis, and a dynamic face model is
variety of image representations and used to extract a shape-and-pose-free fa-
transformations. Experimental re- cial texture pattern. The multiview dy-
sults using images/videos collected namic face model [Li et al. 2001b] con-
at UMD, NIST/USF, and CMU with sists of a sparse Point Distribution Model
pose/illumination variations have illus- (PDM) [Cootes et al. 1995], a shape-and-
trated the effectiveness of this approach pose-free texture model, and an affine ge-
in both still-to-video and video-to-video ometrical model. The 3D shape vector of
scenarios with appropriate model choices. a face is estimated from a set of 2D face
In Li et al. [2001a], a multiview based images in different views using landmark
face recognition system was proposed to points. Then a face image fitted by the
recognize faces from videos with large shape model is warped to the mean shape
pose variations. To address the challeng- in a frontal view, yielding a shape-and-
ing pose issue, the concept of an iden- pose-free texture pattern.15 When part of
tity surface that captures joint spatial and a face is invisible in an image due to rota-
temporal information was used. An iden- tion in depth, the facial texture is recov-
tity surface is a hypersurface formed by ered from the visible side of the face using
projecting all the images of one individ- the bilateral symmetry of faces. To obtain
ual onto the discriminating feature space a low-dimensional statistical model, PCA
parameterized on head pose (Figure 16).14 was applied to the 3D shape patterns and
To characterize the head pose, two an- shape-and-pose-free texture patterns sep-
gles, yaw and tilt, are used as basis coor- arately. To further suppress within-class
dinates in the feature space. As plotted in variations, the shape-and-pose-free tex-
Figure 16, the other basis coordinates rep- ture patterns were further projected into
resent discriminating feature patterns of a KDA feature space. Finally, the iden-
faces; this will be discussed later. Based on tity surface can be approximated and con-
recovered pose information, a trajectory structed from discrete samples at fixed
of the input feature pattern can be con- poses using a piece-wise planar model.
structed. The trajectories of features from
known subjects arranged in the same tem-
poral order can be synthesized on their re- 4.3. Summary
spective identity surfaces. To recognize a The availability of video/image sequences
face across views over time, the trajectory gives video-based face recognition a dis-
for the input face is matched to the trajec- tinct advantage over still-image-based
tories synthesized for the known subjects. face recognition: the abundance of tem-
This approach can be thought of as a gen- poral information. However, the typically
eralized version of face recognition based low-quality images in video present a
on single images taken at different poses. significant challenge: the loss of spatial
14 Noticethat this view-based idea has already been 15 Notice

that this procedure is very similar to
explored, for example, in Pentland et al. [1994]. AAM [Cootes et al. 2001].

information. The key to building a success- databases. However, large-scale system-

ful video-based system is to use tempo- atic evaluations are still lacking.
ral information to compensate for the lost Although we argue that it is best to
spatial information. For example, a high- use both temporal and spatial infor-
resolution frame can in principle be recon- mation for face recognition, existing
structed from a sequence of low-resolution spatiotemporal methods have not yet
video frames and used for recognition. A shown their full potential. We believe
further step is to use the image sequence that these types of methods deserve fur-
to reconstruct the 3D shape of the tracked ther investigation.
face object via SfM and thus enhance face
recognition performance. Finally, a com- During the past 8 years, recognition of
prehensive approach is to use spatial and human behavior has been actively stud-
temporal information simultaneously for ied: facial expression recognition, hand
face recognition. This is also supported by gesture recognition, activity recognition,
related psychological studies. etc. As pointed out earlier, descriptions of
However, many issues remain for exist- human behavior are useful and are eas-
ing systems: ier to obtain than recognition of faces. Of-
ten they provide complementary informa-
SfM is a common technique used in tion for face recognition or additional cues
computer vision for recovering 3D in- useful for identification. In principle, both
formation from video sequences. How- gender classification and facial expression
ever, a major obstacle exists to apply- recognition can assist in the classification
ing this technique in face recognition: of identity. For recent reviews on facial
the accuracy of 3D shape recovery. Face expression recognition, see Donato et al.
images contain smooth, textureless re- [1999] and Pantic and Rothkrantz [2000].
gions and are often acquired under vary- We also believe that analysis of body move-
ing illumination,16 resulting in signifi- ments such as gait or hand gestures can
cant difficulties in accurate recovery of help in person recognition.
3D information. The accuracy issue may
not be very important for face detection,
but it is for face recognition, which must 5. EVALUATION OF FACE RECOGNITION
differentiate the 3D shapes of similar SYSTEMS
objects. One possible solution is the com- Given the numerous theories and tech-
plementary use of shape-from-shading, niques that are applicable to face recogni-
which can utilize the illumination infor- tion, it is clear that evaluation and bench-
mation. A recent paper on using flow- marking of these algorithms is crucial.
based SfM techniques for face modeling Previous work on the evaluation of OCR
is A. K. R. Chowdhury, and R. Chellappa and fingerprint classification systems pro-
[2003]. vided insights into how the evaluation of
Up to now, the databases used in algorithms and systems can be performed
many systems have been very small, efficiently. One of the most important facts
say 20 subjects. This is partially learned in these evaluations is that large
due to the tremendous amount of sets of test images are essential for ade-
storage space needed for video se- quate evaluation. It is also extremely im-
quences. Fortunately, relatively large portant that the samples be statistically as
video databases exist, for example, similar as possible to the images that arise
the XM2TV database [Messer et al. in the application being considered. Scor-
1999], the BANCA database [Bailly- ing should be done in a way that reflects
Bailliere et al. 2003], and the addition the costs of errors in recognition. Reject-
of video into the FERET and FRVT2002 error behavior should be studied, not just
forced recognition.
16 Stereo is less sensitive to illumination change but In planning an evaluation, it is impor-
still has difficulty in handling textureless regions. tant to keep in mind that the operation

434 Zhao et al.
of a pattern recognition system is statis- tered in March 1995; it consisted of a sin-

tical, with measurable distributions of gle test that measured identification per-
success and failure. These distributions formance from a gallery of 817 individ-
are very application-dependent, and no uals, and included 463 duplicates in the
theory seems to exist that can predict probe set [Phillips et al. 1998b]. (A dupli-
them for new applications. This strongly cate is a probe for which the corresponding
suggests that an evaluation should be gallery image was taken on a different day;
based as closely as possible on a specific there were only 60 duplicates in the Aug94
application. evaluation.) The third and last evaluation
During the past 5 years, several large, (Sep96) was administered in September
publicly available face databases have 1996 and March 1997.
been collected and corresponding testing
protocols have been designed. The se-
5.1.1. Database. Currently, the FERET
ries of FERET evaluations [Phillips et al.
database is the only large database that is
2000b, 1998; Rizvi et al. 1998]17 attracted
generally available to researchers without
nine institutions and companies to partic-
charge. The images in the database were
ipate. They were succeeded by the series
initially acquired with a 35-mm camera
of FRVT vendor tests. We describe here
and then digitized.
the most important face databases and
The images were collected in 15 sessions
their associated evaluation methods, in-
between August 1993 and July 1996. Each
cluding the XM2VTS and BANCA [Bailly-
session lasted 1 or 2 days, and the location
Bailliere et al. 2003] database.
and setup did not change during the ses-
sion. Sets of 5 to 11 images of each individ-
5.1. The FERET Protocol ual were acquired under relatively uncon-
Until recently, there did not exist a com- strained conditions; see Figure 17. They
mon FRT evaluation protocol that in- included two frontal views; in the first of
cluded large databases and standard eval- these (fa) a neutral facial expression was
uation methods. This made it difficult to requested and in the second (fb) a differ-
assess the status of FRT for real appli- ent facial expression was requested (these
cations, even though many existing sys- requests were not always honored). For
tems reported almost perfect performance 200 individuals, a third frontal view was
on small databases. taken using a different camera and differ-
The first FERET evaluation test was ent lighting; this is referred to as the fc
administered in August 1994 [Phillips image. The remaining images were non-
et al. 1998b]. This evaluation established frontal and included right and left profiles,
a baseline for face recognition algorithms, right and left quarter profiles, and right
and was designed to measure performance and left half profiles. The FERET database
of algorithms that could automatically lo- consists of 1564 sets of images (1199 orig-
cate, normalize, and identify faces. This inal sets and 365 duplicate sets)a to-
evaluation consisted of three tests, each tal of 14,126 images. A development set
with a different gallery and probe set. (A of 503 sets of images were released to
gallery is a set of known individuals, while researchers; the remaining images were
a probe is a set of unknown faces pre- sequestered for independent evaluation.
sented for recognition.) The first test mea- In late 2000 the entire FERET database
sured identification performance from a was released along with the Sep96 eval-
gallery of 316 individuals with one im- uation protocols, evaluation scoring code,
age per person; the second was a false- and baseline PCA algorithms.
alarm test; and the third measured the ef-
fects of pose changes on performance. The 5.1.2. Evaluation. For details of the three
second FERET evaluation was adminis- FERET evaluations, see Phillips et al.
[2000, 1998b] and Rizvi et al. [1998].
17 http://www.itl.nist.gov/iad/humanid/feret/. The results of the most recent FERET

Fig. 17. Images from the FERET dataset; these images are of size 384
256.
evaluation (Sep96) will be briefly reviewed The Sep96 evaluation tested the follow-
here. Because the entire FERET data set ing 10 algorithms:
has been released, the Sep96 protocol pro-
vides a good benchmark for performance of an algorithm from Excalibur Corpora-
new algorithms. For the Sep96 evaluation, tion (Carlsbad, CA)(Sept. 1996);
there was a primary gallery consisting of two algorithms from MIT Media Labo-
one frontal image (fa) per person for 1196 ratory (Sept. 1996) [Moghaddam et al.
individuals. This was the core gallery used 1996; Turk and Pentland 1991];
to measure performance for the following three linear discriminant analysis-
four different probe sets: based algorithms from Michigan State
University [Swets and Weng 1996b]
fb probesgallery and probe images of (Sept. 1996) and the University of Mary-
an individual taken on the same day land [Etemad and Chellappa 1997; Zhao
with the same lighting (1195 probes); et al. 1998] (Sept. 1996 and March
fc probesgallery and probe images of 1997);
an individual taken on the same day a gray-scale projection algorithm from
with different lighting (194 probes); Rutgers University [Wilder 1994] (Sept.
Dup I probesgallery and probe im- 1996);
ages of an individual taken on different an Elastic Graph Matching algorithm
daysduplicate images (722 probes); from the University of Southern Cali-
and fornia [Okada et al. 1998; Wiskott et al.
Dup II probesgallery and probe im- 1997] (March 1997);
ages of an individual taken over a year a baseline PCA algorithm [Moon and
apart (the gallery consisted of 894 im- Phillips 2001; Turk and Pentland 1991];
ages; 234 probes). and
a baseline normalized correlation
Performance was measured using two matching algorithm.
basic methods. The first measured identi-
fication performance, where the primary Three of the algorithms performed
performance statistic is the percentage very well: probabilistic eigenface from
of probes that are correctly identified by MIT [Moghaddam et al. 1996], sub-
the algorithm. The second measured veri- space LDA from UMD [Zhao et al. 1998,
fication performance, where the primary 1999], and Elastic Graph Matching from
performance measure is the equal error USC [Wiskott et al. 1997].
rate between the probability of false alarm A number of lessons were learned from
and the probability of correct verification. the FERET evaluations. The first is that
(A more complete method of reporting performance depends on the probe cate-
identification performance is a cumulative gory and there is a difference between best
match characteristic; for verification per- and average algorithm performance.
formance it is a receiver operating charac- Another lesson is that the scenario
teristic (ROC).) has an impact on performance. For

436 Zhao et al.
identification, on the fb and duplicate from Australia, Germany, and the United
probes, the USC scores were 94% and 59%, States participating. The five companies
and the UMD scores were 96% and 47%. evaluated were Banque-Tec International
However, for verification, the equal error Pty. Ltd., C-VIS Computer Vision und Au-
rates were 2% and 14% for USC, and 1% tomation GmbH, Miros, Inc., Lau Tech-
and 12% for UMD. nologies, and Visionics Corporation.
A greater variety of imagery was used
5.1.3. Summary. The availability of the in FRVT 2000 than in the FERET evalua-
FERET database and evaluation tech- tions. FRVT 2000 reported results in eight
nology has had a significant impact on general categories: compression, distance,
progress in the development of face recog- expression, illumination, media, pose, res-
nition algorithms. The series of tests olution, and temporal. There was no com-
has allowed advances in algorithm demon gallery across all eight categories; the
velopment to be quantifiedfor exam- sizes of the galleries and probe sets varied
ple, the performance improvements in the from category to category.
MIT algorithms between March 1995 and We briefly summarize the results of
September 1996, and in the UMD al- FRVT 2000. Full details can be found in
gorithms between September 1996 and [Blackburn et al. 2001], and include iden-
March 1997. tification and verification performance
Another important contribution of the statistics. The media experiments showed
FERET evaluations is the identification of that changes in media do not adversely
areas for future research. In general the affect performance. Images of a person
test results revealed three major problem were taken simultaneously on conven-
areas: recognizing duplicates, recognizing tional film and on digital media. The
people under illumination variations, and compression experiments showed that
recognizing them under pose variations. compression does not adversely affect per-
formance. Probe images compressed up to
40:1 did not reduce recognition rates. The
5.1.4. FRVT 2000. The Sep96 FERET
compression algorithm was JPEG.
evaluation measured performance on pro-
FRVT 2000 also examined the effect of
totype laboratory systems. After March
pose angle on performance. The results
1997 there was rapid advancement in the
show that pose does not significantly affect
development of commercial face recogni-
performance up to 25 , but that perfor-
tion systems. This advancement repre-
mance is significantly affected when the
sented both a maturing of face recognition
pose angle reaches 40 .
technology, and the development of the
In the illumination category, two key
supporting system and infrastructure nec-
effects were investigated. The first was
essary to create commercial off-the-shelf
lighting change indoors. This was equiv-
(COTS) systems. By the beginning of 2000,
alent to the fc probes in FERET. For the
COTS face recognition systems were read-
best system in this category, the indoor
ily available.
change of lighting did not significantly
To assess the state of the art in COTS
affect performance. A second experiment
face recognition systems the Face Recogni-
tested recognition with an indoor gallery
tion Vendor Test (FRVT) 200018 was orga-
and an outdoor probe set. Moving from in-
nized [Blackburn et al. 2001]. FRVT 2000
door to outdoor lighting significantly af-
was a technology evaluation that used the
fected performance, with the best system
Sep96 evaluation protocol, but was signif-
achieving an identification rate of only
icantly more demanding than the Sep96
0.55.
FERET evaluation.
The temporal category is equivalent to
Participation in FRVT 2000 was re-
the duplicate probes in FERET. To com-
stricted to COTS systems, with companies
pare progress since FERET, dup I and
dup II scores were reported. For FRVT
18 http://www.frvt.org. 2000 the dup I identification rate was 0.63

compared with 0.58 for FERET. The cor- 1%, was only 50%. Thus, face recognition
responding rates for dup II were 0.64 for from outdoor imagery remains a research
FRVT 2000 and 0.52 for FERET. These re- challenge area.
sults showed that there was algorithmic A very important question for real-
progress between the FERET and FRVT world applications is the rate of decrease
2000 evaluations. FRVT 2000 showed that in performance as time increases between
two common concerns, the effects of com- the acquisition of the database of images
pression and recording media, do not af- and new images presented to a system.
fect performance. It also showed that FRVT 2002 found that for the top systems,
future areas of interest continue to be du- performance degraded at approximately
plicates, pose variations, and illumination 5% per year.
variations generated when comparing in- One open question in face recognition
door images with outdoor images. is: how does database and watch list size
effect performance? Because of the large
number of people and images in the FRVT
5.1.5. FRVT 2002. The Face Recogni- 2002 data set, FRVT 2002 reported the
tion Vendor Test (FRVT) 2002 [Phillips first large-scale results on this question.
et al. 2003]18 was a large-scale evalua- For the best system, the top-rank iden-
tion of automatic face recognition technol- tification rate was 85% on a database of
ogy. The primary objective of FRVT 2002 800 people, 83% on a database of 1,600,
was to provide performance measures for and 73% on a database of 37,437. For ev-
assessing the ability of automatic face ery doubling of database size, performance
recognition systems to meet real-world re- decreases by two to three overall percent-
quirements. Ten participants were evalu- age points. More generally, identification
ated under the direct supervision of the performance decreases linearly in the log-
FRVT 2002 organizers in July and August arithm of the database size.
2002. Previous evaluations have reported face
The heart of the FRVT 2002 was the recognition performance as a function of
high computational intensity test (HCInt). imaging properties. For example, previous
The HCInt consisted of 121,589 opera- reports compared the differences in perfor-
tional images of 37,437 people. The im- mance when using indoor versus outdoor
ages were provided from the U.S. Depart- images, or frontal versus nonfrontal im-
ment of States Mexican nonimmigrant ages. FRVT 2002, for the first time, exam-
Visa archive. From this data, real-world ined the effects of demographics on per-
performance figures on a very large data formance. Two major effects were found.
set were computed. Performance statistics First, recognition rates for males were
were computed for verification, identifica- higher than females. For the top systems,
tion, and watch list tasks. identification rates for males were 6% to
FRVT 2002 results showed that nor- 9% points higher than that of females.
mal changes in indoor lighting do not sig- For the best system, identification per-
nificantly affect performance of the top formance on males was 78% and for fe-
systems. Approximately the same perfor- males it was 79%. Second, recognition
mance results were obtained using two in- rates for older people were higher than
door data sets, with different lighting, in for younger people. For 18- to 22-year-olds
FRVT 2002. In both experiments, the best the average identification rate for the top
performer had a 90% verification rate at systems was 62%, and for 38- to 42-year-
a false accept rate of 1%. On comparable olds it was 74%. For every 10-year in-
experiments conducted 2 years earlier in crease in age, performance increased on
FRVT 2000, the results of FRVT 2002 indi- the average by approximately 5% through
cated that there has been a 50% reduction age 63.
in error rates. For the best face recogni- FRVT 2002 looked at two of these
tion systems, the recognition rate for faces new techniques. The first was the three-
captured outdoors, at a false accept rate of dimensional morphable models technique

438 Zhao et al.
of Blanz and Vetter [1999]. Morphable The XM2VTS database is an expan-

models are a technique for improving sion of the earlier M2VTS database
recognition of nonfrontal images. FRVT [Pigeon and Vandendorpe 1999]. The
2002 found that Blanz and Vetters tech- M2VTS project (Multimodal Verification
nique significantly increased recognition for Teleservices and Security Applica-
performance. The second technique is tions), a European ACTS (Advanced Com-
recognition from video sequences. Using munications Technologies and Services)
FRVT 2002 data, recognition performance project, deals with access control by mul-
using video sequences was the same as the timodal identification of human faces.
performance using still images. The goal of the project was to improve
In summary, the key lessons learned recognition performance by combining the
in FRVT 2002 were: (1) given reason- modalities of face and voice. The M2VTS
able controlled indoor lighting, the cur- database contained five shots of each of
rent state of the art in face recognition 37 subjects. During each shot, the subjects
is 90% verification at a 1% false accept were asked to count from 0 to 9 in their
rate. (2) Face Recognition in outdoor im- native language (most of the subjects were
ages is a research problem. (3) The use French-speaking) and rotate their heads
of morphable models can significantly im- from 0 to 90 , back to 0 , and then to
prove nonfrontal face recognition. (3) Iden- +90 . They were then asked to rotate their
tification performance decreases linearly heads again with their glasses off, if they
in the logarithm of the size of the gallery. wore any. Three subsequences were ex-
(4) In face recognition applications, ac- tracted from these video sequences: voice
commodations should be made for demo- sequences, motion sequences, and glasses-
graphic information since characteristics off motion sequences. The voice sequences
such as age and sex can significantly affect can be used for speech verification, frontal
performance. view face recognition, and speech/lips cor-
relation analysis. The other two sequences
5.2. The XM2VTS Protocol are intended for face recognition only.
It was found that the subjects were rel-
Multimodal methods19 are a very promis- atively difficult to recognize in the fifth
ing approach to user-friendly (hence ac- shot because it varied significantly in
ceptable), highly secure personal verifica- face/voice/camera setup from the other
tion. Recognition and verification systems shots. Several experiments have been con-
need training; the larger the training set, ducted using the first four shots with the
the better the performance achieved. The goals of investigating
volume of data required for training a mul-
timodal system based on analysis of video text-dependent speaker verification
and audio signals is on the order of TBytes; from speech,
technology that allows manipulation and text-independent speaker verification
effective use of such volumes of data has from speech,
only recently become available in the form facial feature extraction and tracking
of digital video. The XM2VTS multimodal from moving images,
database [Messer et al. 1999] contains four verification from an overall frontal view,
recordings of 295 subjects taken over a pe-
verification from lip shape,
riod of 4 months. Each recording contains
a speaking head shot and a rotating head verification from depth information (ob-
shot. Available data from this database tained using structured light),
include high-quality color images, 32-kHz verification from a profile, and
16-bit sound files, video sequences, and a synchronization of speech and lip move-
3D model. ment.
19 http://www.ee.surrey.ac.uk/Research/VSSP/ 5.2.1. Database. The XM2VTS database

xm2vtsdb/. differs from the M2VTS database

primarily in the number of subjects (295 The database is divided into three parts:
rather than 37). The M2VTS database a training set, an evaluation set, and a test
contains five shots of each subject taken set. The training set is used to build client
at sessions over a period of 3 months; the models. The evaluation set is used to com-
XM2VTS database contains eight shots of pute client and imposter scores. On the
each subject taken at four sessions over a basis of these scores, a threshold is cho-
period of 4 months (so that each session sen that determines whether a person is
contains two repetitions of the sequence). accepted or rejected. In multimodal clas-
The XM2VTS database was acquired sification, the evaluation set can also be
using a Sony VX1000E digital camcorder used to optimally combine the outputs of
and a DHR1000UX digital VCR. several classifiers. The test set is selected
In the XM2VTS database, the first shot to simulate a real authentication scenario.
is a speaking head shot. Each subject, who 295 subjects were randomly divided into
wore a clip-on microphone, was asked to 200 clients, 25 evaluation imposters, and
read three sentences that were written 70 test imposters. Two different evalu-
on a board positioned just below the cam- ation configurations were used with dif-
era. The subjects were asked to read the ferent distributions of client training and
three simple sentences twice at their nor- client evaluation data. For more details,
mal pace and to pause briefly at the end of see Messer et al. [1999].
each sentence. In order to collect face verification re-
The second shot is a rotating head se- sults on this database using the Lau-
quence. Each subject was asked to ro- sanne protocol, a contest was organized
tate his/her head to the left, to the right, in conjuction with ICPR 2000 (the Inter-
up, and down, and finally to return to national Conference on Pattern Recogni-
the center. The subjects were told that tion). There were twelve algorithms from
a full profile was required and were four partipicants in this contest [Matas
asked to repeat the entire sequence twice. et al. 2000]: an EBGM algorithm from
The same sequence was used in all four IDIAP (Daller Molle Institute for Per-
sessions. ceptual Artificial Intelligence), a slightly
An additional dataset containing a 3D modified EBGM algorithm from Aristo-
model of each subjects head was acquired tle University of Thessaloniki, a FND-
during each session using a high-precision based (Fractal Neighbor Distance) algo-
stereo-based 3D camera developed by the rithm from the University of Sydney, and
Turing Institute.20 eight variants of LDA algorithms and one
SVM algorithm from the University of
5.2.2. Evaluation. The M2VTS Lau- Surrey. The performance measures of a
sanne protocol was designed to evaluate verification system are the false accep-
the performance of vision- and speech- tance rate (FA) and the false rejection rate
based person authentication systems on (FR). Both FA and FR are influenced by
the XM2VTS database. This protocol was an acceptance threshold. According to the
defined for the task of verification. The Lausanne protocol, the threshold is set to
features of the observed person are com- satisfy certain performance levels on the
pared with stored features corresponding evaluation set. The same threshold is ap-
to the claimed identity, and the system plied to the test data and FA and FR on
decides whether the identity claim is true the test data are computed. The best re-
or false on the basis of a similarity score. sults of FA and FR on the test data (FA/FR:
The subjects whose features are stored in 2.3%/2.5% and 1.2%/1.0% for evaulation
the systems database are called clients, configurations I and II, respectively) were
whereas persons claiming a false identity obtained using an LDA algorithm with a
are called imposters. non-Euclidean metric (University of Sur-
rey) when the threshold was set so that
20 Turing Institute Web address: http://www.turing. FA was equal to FB on the evaulation re-
gla.ac.uk/. sult. This result seems to concur with the

440 Zhao et al.
equal error rates reported in the FERET and review approaches to solving them.
protocol. In addition, FA and FR on the Pros and cons of these approaches are
test data were reported when the thresh- pointed out so an appropriate approach
old was set set so that FA or FB was zero can be applied to a specific task. The ma-
on the evaulation result. For more details jority of the methods reviewed here are
on the results, see Matas et al. [2000]. generative approaches that can synthe-
size virtual views under desired illumi-
5.2.3. Summary. The results of the nation and viewing conditions. Many of
M2VTS/XM2VTS projects can be used the reviewed methods have not yet been
for a broad range of applications. In applied to the task of face recognition,
the telecommunication field, the results at least not on large databases.21 This
should have a direct impact on network may be for several reasons; some meth-
services where security of information ods may need many sample images per
and access will become increasingly person, pixel-wise accurate alignment of
important. (Telephone fraud in the U.S. images, or high-quality images for recon-
has been estimated to cost several billion struction; or they may be computationally
dollars a year.) too expensive to apply to recognition tasks
that process thousands of images in near-
real-time.
6. TWO ISSUES IN FACE RECOGNITION:
To facilitate discussion and analysis,
ILLUMINATION AND POSE VARIATION
we adopt a varying-albedo Lambertian re-
In this section, we discuss two important flectance model that relates the image I
issues that are related to face recogni- of an object to the object ( p, q) [Horn and
tion. The best face recognition techniques Brooks 1989]:
reviewed in Section 3 were successful in
terms of their recognition performance on
large databases in well-controlled envi- 1 + pPs + q Q s
ronments. However, face recognition in I = , (6)
1 + p2 + q 2 1 + Ps2 + Q 2s
an uncontrolled environment is still very
challenging. For example, the FERET
evaluations and FRVTs revealed that where ( p, q), are the partial derivatives
there are at least two major challenges: and varying albedo of the object, respec-
the illumination variation problem and tively. (Ps , Q s , 1) represents a single dis-
the pose variation problem. Though many tant light source. The light source can also
existing systems build in some sort of be represented by the illuminant slant and
performance invariance by applying pre- tilt angles; slant is the angle between
processing methods such as histogram the opposite lighting direction and the pos-
equalization or pose learning, significant itive z-axis, and tilt is the angle between
illumination or pose change can cause se- the opposite lighting direction and the x-z
rious performance degradation. In addi- plane. These angles are related to Ps and
tion, face images can be partially occluded, Q s by Ps = tan cos , Q s = tan sin .
or the system may need to recognize a per- To simplify
son from an image in the database that the notation, we replace the
was acquired some time ago (referred to constant 1 + Ps2 + Q 2s by K . For easier
as the duplicate problem in the FERET analysis, we assume that frontal face ob-
tests). jects are bilaterally symmetric about the
These problems are unavoidable when vertical midlines of the faces.
face images are acquired in an uncon-
trolled, uncooperative environment, as in
surveillance video clips. It is beyond the 21 One exception is a recent report [Blanz and Vetter
scope of this paper to discuss all these is- 2003] where faces were represented using 4448 im-
sues and possible solutions. In this section ages from the CMU-PIE databases and 1940 images
we discuss only two well-defined problems from the FERET database.

Fig. 18. In each row, the same face appears differently under different illu-
minations (from the Yale face database).
6.1. The Illumination Problem in Face we divide the images and the eigenimages
Recognition into two halves, for example, left and right,
we have
The illumination problem is illustrated
in Figure 18, where the same face ap- ai = I pL iL + I pR iR I A i ,
pears different due to a change in light-
ing. The changes induced by illumination ai = I L iL + I R iR I A i .
are often larger than the differences be- (8)
tween individuals, causing systems based Based on Equation (6), the symmetric
on comparing images to misclassify input property of eigenimages and face objects,
images. This was experimentally observed we have
in Adini et al. [1997] using a dataset of 25
individuals. ai = 2I pL [x, y] iL [x, y] I A i ,

In Zhao [1999], an analysis was carried 2 L
out of how illumination variation changes ai = I p [x, y] + I pL [x, y]q L [x, y]Q s
K
the eigen-subspace projection coefficients iL [x, y] I A i ,
of images under the assumption of a Lam- (9)
bertian surface. Consider the basic expres-
sion for the subspace decomposition of a leading to the following relation:
m
face image I : I I A + i=1 ai i , where I A
is the average image, i are the eigenim- 1 Qs a a T
a = a + f 1 , f 2 , . . . , f ma
ages, and ai are the projection coefficients. K K
Assume that for a particular individual we (10)
K 1
have a prototype image I p that is a nor- a A .,
mally lighted frontal view (Ps = 0, Q s = 0 K
in Equation (6)) in the database, and we
want to match it against a new image I of where f ia = 2(I pL [x, y]q L [x, y]) iL [x, y]
the same class under lighting (Ps , Q s , 1). and a A is the projection coefficient vector
The corresponding subspace projection co- of the average image I A : [I A 1 , . . . , I A
efficient vectors a = [a1 , a2 , . . . , am ]T (for m ]. Now let us assume that the training
I p ) and a = [a1 , a2 , . . . , am ]T (for I ) are set is extended to include mirror images
computed as follows: as in Kirby and Sirovich [1990]. A similar
analysis can be carried out, since in such a
ai = I p i I A i , case the eigenimages are either symmet-
(7) ric (for most leading eigenimages) or anti-
ai = I i I A i , symmetric.
In general, Equation (11) suggests that
where denotes the sum of all element- a significant illumination change can
wise products of two matrices (vectors). If seriously degrade the performance of

442 Zhao et al.
Fig. 19. Changes of projection vectors due to class variation (a) and illumination change (b) is of the
same order [Zhao 1999].
subspace-based methods. Figure 19 plots in intensity was done by first subtract-

the projection coefficients for the same ing a best-fit brightness plane and then
face under different illuminations ( applying histogram equalization. In the
[00 , 400 ], [00 , 1800 ]) and compares face eigen-subspace domain, it was sug-
them against the variations in the projec- gested and later experimentally verified in
tion coefficient vectors due to pure differ- Belhumeur et al. [1997] that by discard-
ences in class. ing a few most significant principal com-
In general, the illumination problem is ponents, variations due to lighting can
quite difficult and has received consider- be reduced. The plot in Figure 19(b) also
able attention in the image understand- supports this observation. However, in or-
ing literature. In the case of face recog- der to maintain system performance for
nition, many approaches to this problem normally illuminated images, while im-
have been proposed that make use of the proving performance for images acquired
domain knowledge that all faces belong to under changes in illumination, it must
one face class. These approaches can be be assumed that the first three principal
divided into four types [Zhao 1999]: (1) components capture only variations due to
heuristic methods, for example, discarding lighting. Other heuristic methods based
the leading principal components; (2) im- on frontal-face symmetry have also been
age comparison methods in which appro- proposed [Zhao 1999].
priate image representations and distance
measures are used; (3) class-based meth-
6.1.2. Image Comparison Approaches.
ods using multiple images of the same
In Adini et al. [1997], approaches based
face in a fixed pose but under different
on image comparison using different
lighting conditions; and (4) model-based
image representations and distance
approaches in which 3D models are em-
measures were evaluated. The image
ployed.
representations used were edge maps,
derivatives of the gray level, images
6.1.1. Heuristic Approaches. Many exist- filtered with 2D Gabor-like functions,
ing systems use heuristic methods to com- and a representation that combines a
pensate for lighting changes. For exam- log function of the intensity with these
ple, in Moghaddam and Pentland [1997] representations. The distance measures
simple contrast normalization was used used were point-wise distance, regional
to preprocess the detected faces, while distance, affine-GL (gray level) dis-
in Sung and Poggio [1997] normalization tance, local affine-GL distance, and log

point-wise distance. For more details as the measure of complexity and pro-
about these methods and about the eval- posed the following symmetric similarity
uation database, see Adini et al. [1997]. measure:
It was concluded that none of these

representations alone can overcome the I
d G (I, J ) = min(I, J )( )

image variations due to illumination. J
A recently proposed image comparison (13)
method Jacobs et al. [1998] used a new J
dx dy.
measure robust to illumination change. I
The rationale for develop such a method
of directly comparing images is the poten-
tial difficulty of building a complete repre- They noticed the similarity between this
sentation of an objects possible images as measure and the measure that simply
suggested in [Belhumeur and Kriegman compares the edges. It is also clear that
1997]. The authors argued that it is not the measure is not strictly illumination-
clear whether it is possible to construct invariant because it changes for a pair
the complete representation using a small of images of the same object when the
number of training images taken un- illumination changes. Experiments on
der uncontrolled viewing conditions and face recognition showed improved perfor-
containing multiple light sources. It was mance over eigenfaces, which were some-
shown that given two images of an ob- what worse than the illumination cone-
ject with unknown structure and albedo, based method [Georghiades et al. 1998] on
there is always a large family of solutions. the same set of data.
Even in the case of given light sources,
only two out of three independent compo- 6.1.3. Class-Based Approaches. Under
nents of the Hessian of the surface can be the assumptions of Lambertian sur-
determined. Instead, the authors argued faces and no shadowing, a 3D linear
that the ratio of two images of the same illumination subspace for a person
object is simpler than if the images are was constructed in Belhumeur and
from different objects. Based on this ob- Kriegman [1997], Hallinan [1994],
servation, the complexity of the ratio of Murase and Nayar [1995], Ricklin-
two aligned images was proposed as the Raviv and Shashua [1999], and Shashua
similarity measure. More specifically, we [1994] for a fixed viewpoint, using three
have aligned faces/images acquired under
different lighting conditions. Under
ideal assumptions, recognition based on
I1 K2 1 + pI Ps,1 + qI Q s,1
= (11) this subspace is illumination-invariant.
I2 K1 1 + pI Ps,2 + qI Q s,2 More recently, an illumination cone has
been proposed as an effective method
of handling illumination variations, in-
for images of the same object, and
cluding shadowing and multiple light
sources [Belhumeur and Kriegman 1997;
Georghiades et al. 1998]. This method is
I1 K2 I 1 + pI Ps,1 + qI Q s,1
= an extension of the 3D linear subspace
J2 K1 J 1 + pJ Ps,2 + q J Q s,2
method [Hallinan 1994; Shashua 1994]

(12) and has the same drawback, requiring
1 + p2J + q J2 at least three aligned training images
acquired under different lighting con-
1 + p2I + qI2 ditions per person. A more detailed
review of this approach and its extension
for images of different objects. They chose to handle the combined illumination
the integral of the magnitude of the and pose problem will be presented in
gradient of the function (ratio image) Section 6.2.

444 Zhao et al.
Fig. 20. Testing the invariance of the quotient image (Q-image) to varying illu-
mination. (a) Original images of a novel face taken under five different illumina-
tions. (b) The Q-images corresponding to the novel images, computed with respect
to the bootstrap set of ten objects [Riklin-Raviv and Shashua 1999]. (Courtesy of
T. Riklin-Raviv and A. Shashua.)
More recently, a method based on quo- tion [Freeman and Tenenbaum 2000] for
tient images was introduced [Riklin-Raviv an image y s of object y under illumination
and Shashua 1999]. Like other class-based s is
methods, this method assumes that the 2
faces of different individuals have the N
ys i Ai x , (14)
same shape and different textures. Given
i=1
two objects a, b, the quotient image Q
is defined to be the ratio of their albedo which is a bilinear problem in the N un-
functions a /b, and hence is illumination- knowns i and 3 unknowns x. For compar-
invariant. Once Q is computed, the en- ison, the proposed energy function is
tire illumination space of object a can be
generated by Q and a linear illumination
N
subspace constructed from three images (i y s Ai x)2 . (15)
of object b. To make this basic idea work i=1
in practice, a training set (called the boot-
strap set in the paper) is needed that con- This formation of the energy function is
sists of images of N objects under various a major reason why the quotient image
lighting conditions, and the quotient im- method works better than reconstruc-
age of a novel object y is defined relative tion methods based on Equation (14) in
to the average object of the bootstrap set. terms of smaller size of the bootstrap set
More specifically, the bootstrap set con- and less requirement for pixel-wise image
sists of 3N images taken from three fixed, alignment. As pointed out by the authors,
linearly independent light sources s1 , s2 , another factor contributing to the success
and s3 that are not known. Under this as- of using only a small bootstrap set is that
sumption, any light source s can be ex- the albedo functions occupy only a small
pressed as a linear combination of the si : subspace. Figure 20 demonstrates the in-
s = x1 s1 + x2 s2 + x3 s3 . The authors fur- variance of the quotient image against
ther defined the normalized albedo func- change in illumination conditions; the
tion of the bootstrap set as the squared image synthesis results are shown in
sum of the i , where i is the albedo func- Figure 21.
tion of object i. An interesting energy/cost
function is defined that is quite differ- 6.1.4. Model-Based Approaches. In
ent from the traditional bilinear form. Let model-based approaches, a 3D face model
A1 , A2 , . . . , AN be m 3 matrices whose is used to synthesize the virtual image
columns are images of object i (from the from a given image under desired illu-
bootstrap set) that contain the same m mination conditions. When the 3D model
pixels; then the bilinear energy/cost func- is unknown, recovering the shape from

Fig. 21. Image synthesis example. Original image (a) and its quotient image (b)
from the N = 10 bootstrap set. The quotient image is generated relative to the
average object of the bootstrap set, shown in (c), (d), and (e). Images (f) through (k)
are synthetic images created from (b) and (c), (d), (e) [Riklin-Raviv and Shashua
1999]. (Courtesy of T. Riklin-Raviv and A. Shashua.)
the images accurately is difficult without To address the issue of varying albedo,
using any priors. Shape-from-shading a direct 2D-to-2D approach was proposed
(SFS) can be used if only one image is based on the assumption that front-view
available; stereo or structure from motion faces are symmetric and making use of a
can be used when multiple images of the generic 3D model [Zhao et al. 1999]. Recall
same object are available. that a prototype image I p is a frontal view
Fortunately, for face recognition the dif- with Ps = 0, Q s = 0. Substituting this into
ferences in the 3D shapes of different face Equation (6), we have
objects are not dramatic. This is especially
true after the images are aligned and nor- 1
malized. Recall that this assumption was I p [x, y] = . (16)
used in the class-based methods reviewed 1 + p2 + q 2
above. Using a statistical representation
of the 3D heads, PCA was suggested as a Comparing Equations (6) and (16), we ob-
tool for solving the parametric SFS prob- tain
lem [Atick et al. 1996]. An eigenhead ap- K
proximation of a 3D head was obtained I p [x, y] = (I [x, y] + I [x, y]).
2(1 + q Q s )
after training on about 300 laser-scanned
(17)
range images of real human heads. The
ill-posed SFS problem is thereby trans- This simple equation relates the proto-
formed into a parametric problem. The au- type image I p to I [x, y] + I [x, y], which
thors also demonstrated that such a rep- is already available. The two advantages
resentation helps to determine the light of this approach are: (1) there is no need
source. For a new face image, its 3D head to recover the varying albedo [x, y]; (2)
can be approximated as a linear combina- there is no need to recover the full shape
tion of eigenheads and then used to de- gradients ( p, q); q can be approximated
termine the light source. Using this com- by a value derived from a generic 3D
plete 3D model, any virtual view of the face shape. As part of the proposed automatic
image can be generated. A major draw- method, a model-based light source iden-
back of this approach is the assumption tification method was also proposed to
of constant albedo. This assumption does improve existing source-from-shading al-
not hold for most real face images, even gorithms. Figure 22 shows some com-
though it is the most common assumption parisons between rendered images ob-
used in SFS algorithms. tained using this method and using a

446 Zhao et al.
Fig. 22. Image rendering comparison. The original images are shown
in the first column. The second column shows prototype images rendered
using the local SFS algorithm [Tsai and Shah 1994]. Prototype images
rendered using symmetric SFS are shown in the third column. Finally,
the fourth column shows real images that are close to the prototype
images [Zhao and Chellappa 2000].
local SFS algorithm [Tsai and Shah 1994]. which are produced when the object is
Using the Yale and Weizmann databases illuminated by harmonic functions. The
(Table V), significant performance im- nine harmonic images of a face are plotted
provements were reported when the pro- in Figure 23. An interesting comparison
totype images were used in a subspace was made between the proposed method
LDA system in place of the original in- and the 3D linear illumination subspace
put images [Zhao et al. 1999]. In these ex- methods [Hallinan 1994; Shashua 1994];
periments, the gallery set contained about the 3D linear methods are just first-order
500 images from various databases and harmonic approximations without the DC
the probe set contained 60 images from components.
the Yale database and 96 images from the Assuming precomputed object pose and
Weizmann database. known color albedo/texture, the authors
Recently, a general method of ap- reported an 86% correct recognition rate
proximating Lambertian reflectance us- when applying this technique to the task
ing second-order spherical harmonics has of face recognition using a probe set
been reported [Basri and Jacobs 2001]. of 10 people and a gallery set of 42
Assuming Lambertian objects under dis- people.
tant, isotropic lightng, the authors were
able to show that the set of all re-
flectance functions can be approximated 6.2. The Pose Problem in Face Recognition
using the surface spherical harmonic ex- It is not surprising that the performance
pansion. Specifically, they have proved of face recognition systems drops signif-
that using a second-order (nine harmon- icantly when large pose variations are
ics, i.e., nine-dimensional 9D-space) ap- present, in the input images. This diffi-
proximation, the accuracy for any light culty was documented in the FERET and
function exceeds 97.97%. They then ex- FRVT test reports [Blackburn et al. 2001;
tended this analysis to image formation, Phillips et al. 2002b, 2003], and was sug-
which is a much more difficult problem due gested as a major research issue. When il-
to possible occlusion, shape, and albedo lumination variation is also present, the
variations. As indicated by the authors, task of face recognition becomes even more
worst-case image approximation can be difficult. Here we focus on the out-of-plane
arbitrarily bad, but most cases are good. rotation problem, since in-plane rotation
Using their method, an image can be de- is a pure 2D problem and can be solved
composed into so-called harmonic images, much more easily.

Table V. Internet Resources for Research and Databases

Research pointers
Face recognition homepage www.cs.rug.nl/peterkr/FACE/frhp.html
Face detection homepage home.t-online.de/home/Robert.Frischholz/face.htm
Facial analysis homepage mambo.ucsc.edu/psl/fanl.html
Facial animiation homepage mambo.ucsc.edu/psl/fan.html
Face databases
FERET database http://www.itl.nist.gov/iad/humanid/feret/
XM2TVS database http://www.ee.surrey.ac.uk/Research/VSSP/xm2vtsdb/
UT Dallas database http://www.utdallas.edu/dept/bbs/FACULTY PAGES/otoole/
database.htm
Notre Dame database http://www.nd.edu/cvrl/HID-data.html
MIT face databases ftp://whitechapel.media.mit.edu/pub/images/
Shimon Edelmans face database ftp://ftp.wisdom.weizmann.ac.il/pub/FaceBase/
CMU face detection database www.ius.cs.cmu.edu/IUS/dylan usr0/har/faces/test/
CMU PIE database www.ri.cmu.edu/projects/project 418.html
Stirling face database pics.psych.stir.ac.uk
M2VTS multimodal database www.tele.ucl.ac.be/M2VTS/
Yale face database cvc.yale.edu/projects/yalefaces/yalefaces.html
Yale face database B cvc.yale.edu/projects/yalefacesB/yalefacesB.html
Harvard face database hrl.harvard.edu/pub/faces
Weizmann face database www.wisdom.weizmann.ac.il/yael/
UMIST face database images.ee.umist.ac.uk/danny/database.html
Purdue University face database rvl1.ecn.purdue.edu/aleix/aleix face DB.html
Olivetti face database www.cam-orl.co.uk/facedatabase.html
Oulu physics-based face database www.ee.oulu.fi/research/imag/color/pbfd.html
bers of multiview samples. This seems to

concur with the findings of the psychology
community; face perception is believed to
be view-independent for small angles, but
view-dependent for large angles.
To assess the pose problem more sys-
tematically, an attempt has been made to
classify pose problems [Zhao 1999; Zhao
and Chellappa 2000b]. The basic idea of
this analysis is to use a varying-albedo re-
flectance model (Equation (6)) to synthe-
size new images in different poses from a
real image, thus providing a tool for sim-
ulating the pose problem. More specifi-
Fig. 23. The first nine harmonic images of a face
object (from left to right, top to bottom) [Basri and cally, the 2D-to-2D image transformation
Jacobs 2001]. (Courtesy of R. Basri and D. Jacobs.) under 3D pose change has been studied.
The drawback of this analysis is the re-
Earlier methods focused on constructing striction of using a generic 3D model; no
invariant features [Wiskott et al. 1997] or deformation of this 3D shape was carried
synthesizing a prototypical view (frontal out, though the authors suggested doing
view) after a full model is extracted from so.
the input image [Lanitis et al. 1995].22 Researchers have proposed various
Such methods work well for small rota- methods of handling the rotation prob-
tion angles, but they fail when the angle lem. They can be divided into three
is large, say 60 , causing some important classes [Zhao 1999]: (1) multiview im-
features to be invisible. Most proposed age methods, when multiview database
methods are based on using large num- images of each person are available; (2)
hybrid methods, when multiview training
22 One exception is the multiview eigenfaces of images are available during training
Pentland et al. [1994]. but only one database image per person

448 Zhao et al.
is available during recognition; and images from nonfrontal viewpoints syn-

(3) single-image/shape-based meth- thesized from a GBR reconstruction will
ods where no training is carried out. differ from a valid image by an affine
Akamatsu et al. [1992], Beymer [1993], warp of the image coordinates.23 To ad-
Georghiades et al. [1999, 2001], and dress GBR ambiguity, the authors pro-
Ullman and Basri [1991] are examples of posed exploiting face symmetry (to correct
the first class and Beymer [1995], Beymer tilt) and the fact that the chin and the fore-
and Poggio [1995], Cootes et al. [2000], head are at about the same height (to cor-
Maurer and Malsburg [1996a], Sali and rect slant), and requiring that the range
Ullman [1998], and Vetter and Poggio of heights of the surface be about twice
[1997] of the second class. Up to now, the distance between the eyes (to correct
the second type of approach has been scale) [Georghiades et al. 2001]. They pro-
the most popular. The third approach pose a pose- and illumination-invariant
does not seem to have received much face recognition method based on build-
attention. ing illumination cones at each pose for
each person. Though conceptually this is a
good idea, in practice it is too expensive to
6.2.1. Multiview-Based Approaches. One implement. The authors suggested many
of the earliest examples of the first class of ways of speeding up the process, including
approaches is the work of Beymer [1993], first subsampling the illumination cone
which used a template-based correlation and then approximating the subsampled
matching scheme. In this work, pose cone with a 11D linear subspace. Experi-
estimation and face recognition were ments on building illumination cones and
coupled in an iterative loop. For each on 3D shape reconstruction based on seven
hypothesized pose, the input image was training images per class were reported.
aligned to database images corresponding To visualize illumination-cone based im-
to that pose. The alignment was first age synthesis, see Figure 24. Figure 25
carried out via a 2D affine transformation demonstrates the effectiveness of image
based on three key feature points (eyes synthesis under variable pose and lighting
and nose), and optical flow was then used after the GBR ambiguity is resolved. Al-
to refine the alignment of each template. most perfect recognition results on ten in-
After this step, the correlation scores of dividuals were reported using nine poses
all pairs of matching templates were used and 45 viewing conditions.
for recognition. The main limitations of
this method, and other methods belonging 6.2.2. Hybrid Approaches. Numerous al-
to this type of approach, are (1) many gorithms of the second type have been
different views per person are needed in proposed. These methods, which make
the database; (2) no lighting variations or use of prior class information, are the
facial expressions are allowed; and (3) the most successful and practical methods up
computational cost is high, since iterative to now. We review several representa-
searching is involved. tive methods here: (1) a view-based eigen-
More recently, an illumination-cone- face method [Pentland et al. 1994], (2)
based [Belhumeur and Kriegman 1997] a graph matching-based method [Wiskott
image synthesis method [Georghiades et al. 1997], (3) a linear class-based
et al. 1999] has been proposed to han- method [Blanz and Vetter 1999; Vetter
dle both pose and illumination problems and Poggio 1997], (4) a vectorized im-
in face recognition. It handles illumina- age representation based method [Beymer
tion variation quite well, but not pose 1995; Beymer and Poggio 1995], and (5)
variation. To handle variations due to a view-based appearance model [Cootes
rotation, it needs to completely resolve
the GBR (generalized-bas-relief) ambigu- 23 GBRis a 3D affine transformation with three pa-
ity and then reconstruct the Euclidean 3D rameters: scale, slant, and tilt. A weak-perspective
shape. Without resolving this ambiguity, imaging model is assumed.

Fig. 24. The process of constructing the illumination cone. (a) The seven training im-
ages from Subset 1 (near frontal illumination) in frontal pose. (b) Images correspond-
ing to the columns of B. (c) Reconstruction up to a GBR transformation. On the left,
the surface was rendered with flat shading, that is, the albedo was assumed to be con-
stant across the surface, while on the right the surface was texture-mapped using the
first basis image of B shown in Figure 24(b). (d) Synthesized images from the illumina-
tion cone of the face under novel lighting conditions but fixed pose. Note the large vari-
ations in shading and shadowing as compared to the seven training images. (Courtesy of
A. Georghiades, P. Belhumeur, and D. Kriegman.)
et al. 2000]. Some of the reviewed meth- recognition [Pentland et al. 1994]. This
ods are very closely relatedfor example, method explicitly codes the pose informa-
methods 3, 4, and 5. Despite their popular- tion by constructing an individual eigen-
ity, these methods have two common draw- face for each pose. More recently, a uni-
backs: (1) they need many example images fied framework called the bilinear model
to cover the range of possible views; (2) the was proposed in Freeman and Tenenbaum
illumination problem is not explicitly ad- [2000] that can handle either pure pose
dressed, though in principle it can be han- variation or pure class variation. (A bilin-
dled if images captured under the same ear example is given in Equation (14) for
pose but different illumination conditions the illumination problem.)
are available. In Wiskott et al. [1997], a robust face
The popular eigenface approach [Turk recognition scheme based on EBGM was
and Pentland 1991] to face recognition has proposed. The authors assumed a planar
been extended to a view-based eigenface surface patch at each feature point (land-
method in order to achieve pose-invariant mark), and learned the transformations

450 Zhao et al.
Fig. 26. The best fit to a profile model is projected

to the frontal model to predict new views [Cootes
et al. 2000]. (Courtesy of T. Cootes, K. Walker, and
Fig. 25. Synthesized images under C. Taylor.)
variable pose and lighting gener-
ated from the training images shown
in Figure 24 and 25. (Courtesy of this method reduces the need to compute
A. Georghiades, P. Belhumeur, and correspondences between images of differ-
D. Kriegman.) ent poses. On the other hand, parallel de-
formation was able to preserve some pe-
of jets under face rotation. Their re- culiarities of texture that are nonlinear
sults demonstrated substantial improve- and that could be erased by linear meth-
ment in face recognition under rotation. ods. This method was extended in Sali
Their method is also fully automatic, in- and Ullman [1998] to include an additive
cluding face localization, landmark detec- error term for better synthesis. In Blanz
tion, and flexible graph matching. The and Vetter [1999], a morphable 3D face
drawback of this method is its require- model consisting of shape and texture was
ment for accurate landmark localiza- directly matched to single/multiple input
tion, which is not an easy task, espe- images. As a consequence, head orienta-
cially when illumination variations are tion, illumination conditions, and other
present. parameters could be free variables subject
The image synthesis method in Vetter to optimization.
and Poggio [1997] is based on the assump- In Cootes et al. [2000], a view-based
tion of linear 3D object classes and the ex- statistical method was proposed based on
tension of linearity to images (both shape a small number of 2D statistical mod-
and texture) that are 2D projections of the els (AAM). Unlike most existing methods
3D objects. It extends the linear shape that can handle only images with rota-
model (which is very similar to the action angles up to, say 45 , the authors ar-
tive shape model of Cootes et al. [1995]) gued that their method can handle even
from a representation based on feature profile views in which many features are
points to full images of objects. To imple- invisible. To deal with such large pose
ment this method, a correspondence be- variations, they needed sample views at
tween images of the input object and a 90 (full profile), 45 (quasiprofile), and
reference object is established using opti- 0 (frontal view). A key element that is
cal flow. Correspondences between the ref- unique to this method is that for each
erence image and other example images pose, a different set of features is used.
having the same pose are also computed. Given a single image of a new person,
Finally, the correspondence field for the all the models are used to match the im-
input image is linearly decomposed into age, and estimation of the pose is achieved
the correspondence fields for the exam- by choosing the best fit. To synthesize
ples. Compared to the parallel deforma- a new view from the input image, the
tion scheme in Beymer and Poggio [1995], relationship between models at different

views are learned. More specifically, the

following steps are needed: (1) remov-
ing the effects of orientation, (2) pro-
jecting into the identity subspace [Ed-
wards et al. 1998], (3) projecting across
into the subspace of the target model,
and (4) adding the appropriate orienta-
tion. Figure 26 demonstrates the syn-
thesis of a virtual view of a novel face
using this method. Results of track-
ing a face across large pose variations
and predicting novel views were reported
on a limited dataset of about 15 short
sequences.
Earlier work on multiview-based meth-
ods [Beymer 1993] was extended to ex-
plore the prior class information that is
specific to a face class and can be learned
from a set of prototypes [Beymer 1993, Fig. 27. View synthesis by parallel deformation.
First (A) the prototype flow is measured between the
1995]. The key idea of these methods is prototype image and the novel image at the same
the vectorized representation of the im- pose, then (B) the flow is mapped onto the novel face,
ages at each pose; this is similar to view- and finally (C) the novel face is 2D-warped to the vir-
based AAM [Cootes et al. 2000]. A vec- tual view [Beymer and Poggio 1995].
torized representation at each pose con-
sists of both shape and texture, which are ter the correspondence between the new
mapped into the standard/average refer- image and the prototype image at pose
ence shape. The reference shape is com- 1 is computed; using the warped flow, a
puted off-line by averaging shapes consist- virtual view can be generated by warping
ing of manually defined line segments sur- the novel image. Figure 27 illustrates a
rounding the eyes, eyebrows, nose, mouth, particular procedure adopted in Beymer
and facial outline. The shape-free texture and Poggio [1995]: the parallel deforma-
is represented either by the original geo- tion needed to compute the flow between
metrically normalized prototype images or the prototype image and the novel im-
by PCA bases constructed from these image. An obvious drawback of this approach
ages. Given a new image, a vectorization is the difficulty of computing flow when
procedure (similar to the iterative energy the prototype image and novel image are
minimization procedure in AAM [Cootes dramatically different. To handle this is-
et al. 2001]) is invoked that iterates be- sue, Beymer [1995] proposed first subsam-
tween a shape step and a texture step. pling the estimated dense flow to locate
In the texture step, the input image is local features (line segments) based on
warped onto a previously computed align- prior knowledge about both images, and
ment with the reference shape and then then matching the local features. Feed-
projected into the eigen-subspace. In the ing the virtual views into a simple recog-
shape step, the PCA-reconstructed image nizer based on templates of eyes, nose, and
is used to compute the alignment for next mouth, a recognition rate of 85% was re-
iteration. In both methods [Beymer 1995; ported on a test set of 620 images (62 peo-
Beymer and Poggio 1995], an optical flow ple, 10 views per person) given one sin-
algorithm is used to compute a dense cor- gle real view. Apparently this method is
respondence between the images. To syn- not adequate, since it needs to synthe-
thesize a virtual view at pose 2 of a novel size all virtual views. A better strategy is
image at pose 1 , the flow between these to detect the pose of the novel face and
poses of the prototype images is computed synthesize only the prototype (say) frontal
and then warped to the novel image af- view.

452 Zhao et al.
6.2.3. Single-Image-Based Approaches. based on subspace LDA [Zhao et al. 1999]

Finally, the third class of approaches were reported on a small database consist-
includes low-level feature-based methods, ing of frontal and quasiprofile images of
invariant-feature-based methods, and 3D 115 novel objects (size 4842). In these ex-
model-based methods. In Manjunath et al. periments, the frontal view images served
[1992], a Gabor wavelet-based feature as the gallery images and nonfrontal view
extraction method is proposed for face images served as the probe images. Unfor-
recognition which is robust to small-angle tunately, estimation of a single pose value
rotations. In these methods, face shape is for all the images was done manually. For
usually represented by either a polygonal many images, this estimate was not good,
model or a mesh model which simulates negating the performance improvement.
tissue. Due to its complexity and compu-
tational cost, no serious attempt to apply
7. SUMMARY AND CONCLUSIONS
this approach to face recognition has been
made, except for Gordon [1991], where In this paper we have presented an ex-
3D range data was available. In Zhao and tensive survey of machine recognition of
Chellappa [2000b], a unified approach human faces and a brief review of related
was proposed to solving both the pose and psychological studies. We have considered
illumination problems. This method is a two types of face recognition tasks: one
natural extension of the method proposed from still images and the other from video.
in Zhao and Chellappa [2000] to handle We have categorized the methods used for
the illumination problem. Using a generic each type, and discussed their character-
3D model, they approximately solved the istics and their pros and cons. In addi-
correspondence problem involved in a tion to a detailed review of representative
3D rotation, and performed an input-to- work, we have provided summaries of cur-
prototype image computation. To address rent developments and of challenging is-
the varying albedo issue in the estimation sues. We have also identified two impor-
of both pose and light source, the use tant issues in practical face recognition
of a self-ratio image was proposed. The systems: the illumination problem and the
self-ratio image r I [x, y] was defined as pose problem. We have categorized pro-
posed methods of solving these problems
I [x, y] I [x, y] and discussed the pros and cons of these
r I [x, y] =
I [x, y] + I [x, y] methods. To emphasize the importance of
p[x, y]Ps (18) system evaluation, three sets of evalua-
= , tions were described: FERET, FRVT, and
1 + q[x, y]Q s
XM2VTS.
where I [x, y] is the original image and Getting started in performing experi-
I [x, y] is the mirrored image. ments in face recognition is very easy. The
Using the self-ratio image, which is Colorado State Universitys Evaluation of
albedo-free, the authors formulated the Face Recognition Algorithms Web site,
following combined estimation problem http: //www.cs.colostate.edu/evalfacerec/,
for pose and light source (, ): has an archive of baseline face recog-
nition algorithms. Baseline algorithms
( , , ) available are PCA, LDA, elastic bunch
= arg,, min[r I m (, ) r I (, , )]2 , graph matching, and Bayesian Intrap-
(19) ersonal/Extrapersoanl Image Diffference
Classifier. Source code, and scripts for run-
where r I (,, ) is the self-ratio image for ning the algorithms can be downloaded.
the virtual frontal view synthesized from The Web site includes scripts for running
the original rotated image I R via image the FERET Sep96 evaluation protocol (the
warping and texture mapping, and r I m is FERET data set needs to be obtained from
the self-ratio image generated from the 3D the FERET Web site). The baseline algo-
face model. Improved recognition results rithms and FERET Sep96 protocol provide

a framework for benchmarking new algo- surveillance where the image size of the
rithms. The scripts can be modified to run face area is very small. On the other
different sets of images against the base- hand, the subspace LDA method [Zhao
line. For on-line resources related to face et al. 1999] works well for both large and
recognition, such as research papers and small images, for example, 96 84 or
databases, see Table V. 12 11.
We give below a concise summary of our Recognition of faces from a video se-
discussion, followed by our conclusions, quence (especially a surveillance video)
in the same order as the topics have ap- is still one of the most challenging prob-
peared in this paper: lems in face recognition because video is
Machine recognition of faces has of low quality and the images are small.
emerged as an active research area Often, the subjects of interest are not co-
spanning disciplines such as image operative, for example, not looking into
processing, pattern recognition, com- the camera. One particular difficulty
puter vision, and neural networks. in these applications is how to obtain
There are numerous applications of good-quality gallery images. Neverthe-
FRT to commercial systems such as less, video-based face recognition sys-
face verification-based ATM and access tems using multiple cues have demon-
control, as well as law enforcement strated good results in relatively con-
applications to video surveillance, etc. trolled environments.
Due to its user-friendly nature, face A crucial step in face recognition is the
recognition will remain a powerful tool evaluation and benchmarking of algo-
in spite of the existence of very reliable rithms. Two of the most important face
methods of biometric personal identifi- databases and their associated evalua-
cation such as fingerprint analysis and tion methods have been reviewed: the
iris scans. FERET, FRVT, and XM2VTS protocols.
Extensive research in psychophysics The availability of these evaluations has
and the neurosciences on human recog- had a significant impact on progress in
nition of faces is documented in the lit- the development of face recognition al-
erature. We do not feel that machine gorithms.
recognition of faces should strictly fol- Although many face recognition tech-
low what is known about human recog- niques have been proposed and have
nition of faces, but it is beneficial for en- shown significant promise, robust face
gineers who design face recognition sys- recognition is still difficult. There are
tems to be aware of the relevant find- at least three major challenges: illumi-
ings. On the other hand, machine sys- nation, pose, and recognition in outdoor
tems provide tools for conducting stud- imagery. A detailed review of methods
ies in psychology and neuroscience. proposed to solve these problems has
Numerous methods have been proposed been presented. Some basic problems re-
for face recognition based on image in- main to be solved; for example, pose dis-
tensities [Chellappa et al. 1995]. Many crimination is not difficult but accurate
of these methods have been success- pose estimation is hard. In addition to
fully applied to the task of face recog- these two problems, there are other even
nition, but they have advantages and more difficult ones, such as recognition
disadvantages. The choice of a method of a person from images acquired years
should be based on the specific require- apart.
ments of a given task. For example, The impressive face recognition capabil-
the EBGM-based method [Okada et al. ity of the human perception system has
1998] has very good performance, but one limitation: the number and types of
it requires an image size, for exam- faces that can be easily distinguished.
ple, 128 128, which severely restricts Machines, on the other hand, can
its possible application to video-based store and potentially recognize as many

454 Zhao et al.
people as necessary. Is it really possible BASRI, R. AND JACOBS, D. W. 2001. Lambertian ref-
that a machine can be built that mimics electances and linear subspaces. In Proceedings,
International Conference on Computer Vision.
the human perceptual system without Vol. II. 383390.
its limitations on number and types? BELHUMEUR, P. N., HESPANHA, J. P., AND KRIEGMAN, D.
J. 1997. Eigenfaces vs. Fisherfaces: Recogni-
To conclude our paper, we present a con- tion using class specific linear projection. IEEE
jecture about face recognition based on Trans. Patt. Anal. Mach. Intell. 19, 711720.
psychological studies and lessons learned BELHUMEUR, P. N. AND KRIEGMAN, D. J. 1997. What
from designing algorithms. We conjecture is the set of images of an object under all possible
that different mechanisms are involved in lighting conditions? In Proceedings, IEEE Con-
ference on Computer Vision and Pattern Recog-
human recognition of familiar and unfa- nition. 5258.
miliar faces. For example, it is possible BELL, A. J. AND SEJNOWSKI, T. J. 1995. An informa-
that 3D head models are constructed, by tion maximisation approach to blind separation
extensive training for familiar faces, but and blind deconvolution. Neural Computation 7,
for unfamiliar faces, multiview 2D images 11291159.
are stored. This implies that we have full BELL, A. J. AND SEJNOWSKI, T. J. 1997. The indepen-
probability density functions for familiar dent components of natural scenes are edge fil-
ters. Vis. Res. 37, 33273338.
faces, while for unfamiliar faces we only
BEVERIDGE, J. R., SHE, K., DRAPER, B. A., AND
have discriminant functions. GIVENS, G. H. 2001. A nonparametric statisi-
cal comparison of principal component and
linear discriminant subspaces for face recog-
REFERENCES
nition. In Proceedings, IEEE Conference on
ADINI, Y., MOSES, Y., AND ULLMAN, S. 1997. Face Computer Vision and Pattern Recognition.
recognition: The problem of compensating for (An updated version can be found online
changes in illumination direction. IEEE Trans. at http://www.cs.colostate.edu/evalfacerec/news.
Patt. Anal. Mach. Intell. 19, 721732. html.)
AKAMATSU, S., SASAKI, T., FUKAMACHI, H., MASUI, N., BEYMER, D. 1995. Vectorizing face images by in-
AND SUENAGA, Y. 1992. An accurate and robust terleaving shape and texture computations. MIT
face identification scheme. In Proceedings, In- AI Lab memo 1537. Massachusetts Institute of
ternational Conference on Pattern Recognition. Technology, Cambridge, MA.
217220. BEYMER, D. J. 1993. Face recognition under vary-
ATICK, J., GRIFFIN, P., AND REDLICH, N. 1996. Sta- ing pose. Tech. Rep. 1461. MIT AI Lab, Mas-
tistical approach to shape from shading: Re- sachusetts Institute of Technology, Cambridge,
construction of three-dimensional face surfaces MA.
from single two-dimensional images. Neural BEYMER, D. J. AND POGGIO, T. 1995. Face recognition
Computat. 8, 13211340. from one example view. In Proceedings, Interna-
AZARBAYEJANI, A., STARNER, T., HOROWITZ, B., AND tional Conference on Computer Vision. 500507.
PENTLAND, A. 1993. Visually controlled graph- BIEDERMAN, I. 1987. Recognition by components: A
ics. IEEE Trans. Patt. Anal. Mach. Intell. 15, theory of human image understanding. Psych.
602604. Rev. 94, 115147.
BACHMANN, T. 1991. Identification of spatially BIEDERMAN, I. AND KALOCSAI, P. 1998. Neural and
quantized tachistoscopic images of faces: How psychophysical analysis of object and face recog-
many pixels does it take to carry identity? Euro- nition. In Face Recognition: From Theory to Ap-
pean J. Cog. Psych. 3, 87103. plications, H. Wechsler, P. J. Phillips, V. Bruce, F.
BAILLY-BAILLIERE, E., BENGIO, S., BIMBOT, F., HAMOUZ, F. Soulie, and T. S. Huang, Eds. Springer-Verlag,
M., KITTLER, J., MARIETHOZ, J., MATAS, J., MESSER, Berlin, Germany, 325.
K., POPOVICI, V., POREE, F., RUIZ, B., AND THIRAN, BIGUN, J., DUC, B., SMERALDI, F., FISCHER, S., AND
J. P. 2003. The BANCA database and evalua- MAKAROV, A. 1998. Multi-modal person au-
tion protocol. In Proceedings of the International thentication. In Face Recognition: From The-
Conference on Audio- and Video-Based Biometric ory to Applications, H. Wechsler, P. J. Phillips,
Person Authentication. 625638. V. Bruce, F. F. Soulie, and T. S. Huang, Eds.
BARTLETT, J. C. AND SEARCY, J. 1993. Inversion and Springer-Verlag, Berlin, Germany, 2650.
configuration of faces. Cog. Psych. 25, 281316. BLACK, M., FLEET, D., AND YACOOB, Y. 1998. A
BARTLETT, M. S., LADES, H. M., AND SEJNOWSKI, T. Framework for modelling appearance change in
1998. Independent component representation image sequences. In Proceedings, International
for face recognition. In Proceedings, SPIE Sym- Conference on Computer Vision, 660667.
posium on Electronic Imaging: Science and Tech- BLACK, M. AND YACOOB, Y. 1995. Tracking and rec-
nology. 528539. ognizing facial expressions in image sequences

using local parametrized models of image mo- based active appearance models. In Proceedings,
tion. Tech. rep. CS-TR-3401. Center for Automa- International Conference on Automatic Face and
tion Research, Unversity of Maryland, College Gesture Recognition.
Park, MD. COOTES, T. F., EDWARDS, G. J., AND TAYLOR, C. J. 2001.
BLACKBURN, D., BONE, M., AND PHILLIPS, P. J. 2001. Active appearance models. IEEE Trans. Patt.
Face recognition vendor test 2000. Tech. rep. Anal. Mach. Intell. 23, 681685.
http://www.frvt.org. COX, I. J., GHOSN, J., AND YIANILOS, P. N. 1996.
BLANZ, V. AND VETTER, T. 1999. A Morphable model Feature-based face recognition using mixture-
for the synthesis of 3D faces. In Proceedings, distance. In Proceedings, IEEE Conference on
SIGGRAPH99, 187194. Computer Vision and Pattern Recognition. 209
BLANZ, V. AND VETTER, T. 2003. Face recognition 216.
based on fitting a 3D morphable model. IEEE CRAW, I. AND CAMERON, P. 1996. Face recognition by
Trans. Patt. Anal. Mach. Intell. 25, 10631074. computer. In Proceedings, British Machine Vi-
BLEDSOE, W. W. 1964. The model method in facial sion Conference. 489507.
recognition. Tech. rep. PRI:15, Panoramic re- DARWIN, C. 1972. The Expression of the Emotions
search Inc., Palo Alto, CA. in Man and Animals. John Murray, London, U.K.
BRAND, M. AND BHOTIKA, R. 2001. Flexible flow for DECARLO, D. AND METAXAS, D. 2000. Optical flow
3D nonrigid tracking and shape recovery. In Pro- constraints on deformable models with applica-
ceedings, IEEE Conference on Computer Vision tions to face tracking. Int. J. Comput. Vis. 38,
and Pattern Recognition. 99127.
BRENNAN, S. E. 1985. The caricature generator. DONATO, G., BARTLETT, M. S., HAGER, J. C., EKMAN, P.,
Leonardo, 18, 170178. AND SEJNOWSKI, T. J. 1999. Classifying facial
BRONSTEIN, A., BRONSTEIN, M., GORDON, E., AND KIMMEL, actions. IEEE Trans. Patt. Anal. Mach. Intell. 21,
R. 2003. 3D face recognition using geometric 974989.
invariants. In Proceedings, International Confer- EDWARDS, G. J., TAYLOR, C. J., AND COOTES, T. F.
ence on Audio- and Video-Based Person Authen- 1998. Learning to identify and track faces
tication. in image sequences. In Proceedings, Interna-
BRUCE, V. 1988. Recognizing faces, Lawrence Erl- tional Conference on Automatic Face and Gesture
baum Associates, London, U.K. Recognition.
BRUCE, V., BURTON, M., AND DENCH, N. 1994. Whats EKMAN, P. Ed., 1998. Charles Darwins The Ex-
distinctive about a distinctive face? Quart. J. pression of the Emotions in Man and Animals,
Exp. Psych. 47A, 119141. Third Edition, with Introduction, Afterwords
BRUCE, V., HANCOCK, P. J. B., AND BURTON, A. M. 1998. and Commentaries by Paul Ekman. Harper-
Human face perception and identification. In Collins/Oxford University Press, New York,
Face Recognition: From Theory to Applications, NY/London, U.K.
H. Wechsler, P. J. Phillips, V. Bruce, F. F. Soulie, ELLIS, H. D. 1986. Introduction to aspects of face
and T. S. Huang, Eds. Springer-Verlag, Berlin, processing: Ten questions in need of answers. In
Germany, 5172. Aspects of Face Processing, H. Ellis, M. Jeeves,
BRUNER, I. S. AND TAGIURI, R. 1954. The percep- F. Newcombe, and A. Young, Eds. Nijhoff, Dor-
drecht, The Netherlands, 313.
tion of people. In Handbook of Social Psychology,
Vol. 2, G. Lindzey, Ed., Addison-Wesley, Reading, ETEMAD, K. AND CHELLAPPA, R. 1997. Discriminant
MA, 634654. analysis for recognition of human face images.
J. Opt. Soc. Am. A 14, 17241733.
BUHMANN, J., LADES, M., AND MALSBURG, C. V. D. 1990.
Size and distortion invariant object recognition FISHER, R. A. 1938. The statistical utilization of
by hierarchical graph matching. In Proceedings, multiple measuremeents. Ann. Eugen. 8, 376
International Joint Conference on Neural Net- 386.
works. 411416. FREEMAN, W. T. AND TENENBAUM, J. B. 2000. Sepa-
CHELLAPPA, R., WILSON, C. L., AND SIROHEY, S. 1995. rating style and contents with bilinear models.
Human and machine recognition of faces: A sur- Neural Computat. 12, 12471283.
vey. Proc. IEEE, 83, 705740. FUKUNAGA, K. 1989. Statistical Pattern Recogni-
CHOUDHURY, T., CLARKSON, B., JEBARA, T., AND PENTLAND, tion, Academic Press, New York, NY.
A. 1999. Multimodal person recognition us- GALTON, F. 1888. Personal identification and de-
ing unconstrained audio and video. In Proceed- scription. Nature, (June 21), 173188.
ings, International Conference on Audio- and GAUTHIER, I., BEHRMANN, M., AND TARR, M. J. 1999.
Video-Based Person Authentication. 176181. Can face recognition really be dissociated from
COOTES, T., TAYLOR, C., COOPER, D., AND GRAHAM, J. object recognition? J. Cogn. Neurosci. 11, 349
1995. Active shape modelstheir training and 370.
application. Comput. Vis. Image Understand. 61, GAUTHIER, I. AND LOGOTHETIS, N. K. 2000. Is face
1823. recognition so unique after All? J. Cogn. Neu-
COOTES, T., WALKER, K., AND TAYLOR, C. 2000. View- ropsych. 17, 125142.

456 Zhao et al.
GEORGHIADES, A. S., BELHUMEUR, P. N., AND KRIEGMAN, HORN, B. K. P. AND BROOKS, M. J. 1989. Shape from
D. J. 1999. Illumination-based image synthe- Shading. MIT Press, Cambridge, MA.
sis: Creating novel images of human faces un- HUANG, J., HEISELE, B., AND BLANZ, V. 2003.
der differing pose and lighting. In Proceedings, Component-based face recognition with 3D mor-
Workshop on Multi-View Modeling and Analysis phable models. In Proceedings, International
of Visual Scenes, 4754. Conference on Audio- and Video-Based Person
GEORGHIADES, A. S., BELHUMEUR, P. N., AND KRIEGMAN, Authentication.
D. J. 2001. From few to many: Illumination ISARD, M. AND BLAKE, A. 1996. Contour tracking by
cone models for face recognition under variable stochastic propagation of conditional density. In
lighting and pose. IEEE Trans. Patt. Anal. Mach. Proceedings, European Conference on Computer
Intell. 23, 643660. Vision.
GEORGHIADES, A. S., KRIEGMAN, D. J., AND BELHUMEUR, JACOBS, D. W., BELHUMEUR, P. N., AND BASRI, R.
P. N. 1998. Illumination cones for recognition 1998. Comparing images under variable illu-
under variable lighting: Faces. In Proceedings, mination. In Proceedings, IEEE Conference on
IEEE Conference on Computer Vision and Pat- Computer Vision and Pattern Recognition. 610
tern Recognition, 5258. 617.
GINSBURG, A. G. 1978. Visual information process- JEBARA, T., RUSSEL, K., AND PENTLAND, A. 1998.
ing based on spatial filters constrained by bio- Mixture of eigenfeatures for real-time struc-
logical data. AMRL Tech. rep. 78129. ture from texture. Tech. rep. TR-440, MIT Me-
GONG, S., MCKENNA, S., AND PSARROU, A. 2000. Dy- dia Lab, Massachusetts Institute of Technology,
namic Vision: From Images to Face Recognition. Cambridge, MA.
World Scientific, Singapore. JOHNSTON, A., HILL, H., AND CARMAN, N. 1992. Rec-
GORDON, G. 1991. Face recognition based on depth ognizing faces: Effects of lighting direction, in-
maps and surface curvature. In SPIE Proceed- version and brightness reversal. Cognition 40,
ings, Vol. 1570: Geometric Methods in Computer 119.
Vision. SPIE Press, Bellingham, WA 234247. KALOCSAI, P. K., ZHAO, W., AND ELAGIN, E. 1998.
GU, L., LI, S. Z., AND ZHANG, H. J. 2001. Learning Face similarity space as perceived by humans
probabilistic distribution model for multiview and artificial systems. In Proceedings, Interna-
face dectection. In Proceedings, IEEE Conference tional Conference on Automatic Face and Gesture
on Computer Vision and Pattern Recognition. Recognition. 177180.
HAGER, G. D., AND BELHUMEUR, P. N. 1998. Efficient KANADE, T. 1973. Computer recognition of hu-
region tracking with parametri models of geom- man faces. Birkhauser, Basel, Switzerland, and
etry and illumination. IEEE Trans. Patt. Anal. Stuttgart, Germany.
Mach. Intell. 20, 115. KELLY, M. D. 1970. Visual identification of peo-
HALLINAN, P. W. 1991. Recognizing human eyes. In ple by computer. Tech. rep. AI-130, Stanford AI
SPIE Proceedings, Vol. 1570: Geometric Methods Project, Stanford, CA.
In Computer Vision. 214226. KIRBY, M. AND SIROVICH, L. 1990. Application of the
HALLINAN, P. W. 1994. A low-dimensional repre- Karhunen-Loeve procedure for the characteri-
sentation of human faces for arbitrary lighting zation of human faces. IEEE Trans. Patt. Anal.
conditions. In Proceedings, IEEE Conference on Mach. Intell. 12.
Computer Vision and Pattern Recognition. 995 KLASEN, L. AND LI, H. 1998. Faceless identification.
999. In Face Recognition: From Theory to Applica-
HANCOCK, P., BRUCE, V., AND BURTON, M. 1998. A tions, H. Wechsler, P. J. Phillips, V. Bruce, F. F.
comparison of two computer-based face recogni- Soulie, and T. S. Huang, Eds. Springer-Verlag,
tion systems with human perceptions of faces. Berlin, Germany, 513527.
Vis. Res. 38, 22772288. KNIGHT, B. AND JOHNSTON, A. 1997. The role of
HARMON, L. D. 1973. The recognition of faces. Sci. movement in face recognition. Vis. Cog. 4, 265
Am. 229, 7182. 274.
HEISELE, B., SERRE, T., PONTIL, M., AND POGGIO, T. KRUGER, N., POTZSCH, M., AND MALSBURG, C. V. D. 1997.
2001. Component-based face detection. In Pro- Determination of face position and pose with a
ceedings, IEEE Conference on Computer Vision learned representation based on labelled graphs.
and Pattern Recognition. Image Vis. Comput. 15, 665673.
HILL, H. AND BRUCE, V. 1996. Effects of lighting on KUNG, S. Y. AND TAUR, J. S. 1995. Decision-based
matching facial surfaces. J. Exp. Psych.: Human neural networks with signal/image classification
Percept. Perform. 22, 9861004. applications. IEEE Trans. Neural Netw. 6, 170
HILL, H., SCHYNS, P. G., AND AKAMATSU, S. 1997. 181.
Information and viewpoint dependence in face LADES, M., VORBRUGGEN, J., BUHMANN, J., LANGE, J.,
recognition. Cognition 62, 201222. MALSBURG, C. V.D., WURTZ, R., AND KONEN, W.
HJELMAS, E. AND LOW, B. K. 2001. Face detection: 1993. Distortion invariant object recognition
A Survey. Comput. Vis. Image Understand. 83, in the dynamic link architecture. IEEE Trans.
236274. Comput. 42, 300311.

LANITIS, A., TAYLOR, C. J., AND COOTES, T. F. 1995. sequences of faces. In Proceedings, Interna-
Automatic face identification system using flex- tional Conference on Automatic Face and Gesture
ible appearance models. Image Vis. Comput. 13, Recognition. 176181.
393401. MCKENNA, S. J. AND GONG, S. 1997. Non-intrusive
LAWRENCE, S., GILES, C. L., TSOI, A. C., AND BACK, person authentication for access control by vi-
A. D. 1997. Face recognition: A convolutional sual tracking and face recognition. In Proceed-
neural-network approach. IEEE Trans. Neural ings, International Conference on Audio- and
Netw. 8, 98113. Video-Based Person Authentication. 177183.
LI, B. AND CHELLAPPA, R. 2001. Face verification MCKENNA, S. AND GONG, S. 1998. Recognising mov-
through tracking facial features. J. Opt. Soc. Am. ing faces. In Face Recognition: From Theory to
18. Applications, H. Wechsler, P. J. Phillips, V. Bruce,
LI, S. Z. AND LU, J. 1999. Face recognition using the F. F. Soulie, and T. S. Huang, Eds. Springer-
nearest feature line method. IEEE Trans. Neu- Verlag, Berlin, Germany, 578588.
ral Netw. 10, 439443. MATAS, J. ET. AL., 2000. Comparison of face verifi-
LI, Y., GONG, S., AND LIDDELL, H. 2001a. Construct- cation results on the XM2VTS database. In Pro-
ing facial identity surfaces in a nonlinear dis- ceedings, International Conference on Pattern
criminating space. In Proceedings, IEEE Confer- Recognition, Vol. 4, 858863.
ence on Computer Vision and Pattern Recogni- MESSER, K., MATAS, J., KITTLER, J., LUETTIN, J., AND
tion. MAITRE, G. 1999. XM2VTSDB: The Extended
LI, Y., GONG, S., AND LIDDELL, H. 2001b. Modelling M2VTS Database. In Proceedings, International
face dynamics across view and over time. In Pro- Conference on Audio- and Video-Based Person
ceedings, International Conference on Computer Authentication. 7277.
Vision. MIKA, S., RATSCH, G., WESTON, J., SCHOLKOPF, B.,
LIN, S. H., KUNG, S. Y., AND LIN, L. J. 1997. Face AND MULLER, K.-R. 1999. Fisher discriminant
recognition/detection by probabilistic decision- analysis with kernels. In Proceedings, IEEE
based neural network. IEEE Trans. Neural Workshop on Neural Networks for Signal Pro-
Netw. 8, 114132. cessing.
LIU, C. AND WECHSLER, H. 2000a. Evolutionary pur- MOGHADDAM, B., NASTAR, C., AND PENTLAND, A. 1996.
suit and its application to face recognition. IEEE A Bayesian similarity measure for direct image
Trans. Patt. Anal. Mach. Intell. 22, 570582. matching. In Proceedings, International Confer-
ence on Pattern Recognition.
LIU, C. AND WECHSLER, H. 2000b. Robust coding
scheme for indexing and retrieval from large face MOGHADDAM, B. AND PENTLAND, A. 1997. Probabilis-
databases. IEEE Trans. Image Process. 9, 132 tic visual learning for object representation.
137. IEEE Trans. Patt. Anal. Mach. Intell. 19, 696
LIU, C. AND WECHSLER, H. 2001. A shape- and 710.
texture-based enhanced fisher classifier for face MOON, H. AND PHILLIPS, P. J. 2001. Computational
recognition. IEEE Trans. Image Process. 10, and performance aspects of PCA-based face
598608. recognition algorithms. Perception, 30, 301321.
LIU, J. AND CHEN, R. 1998. Sequential Monte Carlo MURASE, H. AND NAYAR, S. 1995. Visual learning
methods for dynamic systems. J. Am. Stat. Assoc. and recognition of 3D objects from appearances.
93, 10311041. Int. J. Comput. Vis. 14, 525.
MANJUNATH, B. S., CHELLAPPA, R., AND MALSBURG, C. V. D. NEFIAN, A. V. AND HAYES III, M. H. 1998. Hid-
1992. A feature based approach to face recogni- den Markov models for face recognition. In Pro-
tion. In Proceedings, IEEE Conference on Com- ceedings, International Conference on Acoustics,
puter Vision and Pattern Recognition. 373378. Speech and Signal Processing. 27212724.
MARR, D. 1982. Vision. W. H. Freeman, San Fran- OKADA, K., STEFFANS, J., MAURER, T., HONG, H., ELAGIN,
cisco, CA. E., NEVEN, H., AND MALSBURG, C. V. D. 1998. The
MARTINEZ, A. 2002. Recognizing imprecisely local- Bochum/USC Face Recognition System and how
ized, partially occluded and expression variant it fared in the FERET Phase III Test. In Face
faces from a single sample per class. IEEE Trans. Recognition: From Theory to Applications, H.
Patt. Anal. Mach. Intell. 24, 748763. Wechsler, P. J. Phillips, V. Bruce, F. F. Soulie,
and T. S. Huang, Eds. Springer-Verlag, Berlin,
MARTINEZ, A. AND KAK, A. C. 2001. PCA versus Germany, 186205.
LDA. IEEE Trans. Patt. Anal. Mach. Intell. 23,
OTOOLE, A. J., ROARK, D., AND ABDI, H. 2002.
228233.
Recognitizing moving faces. A psychological and
MAURER, T. AND MALSBURG, C. V. D. 1996a. Single- neural synthesis. Trends Cogn. Sci. 6, 261
view based recognition of faces rotated in depth. 266.
In Proceedings, International Workshop on Auto-
matic Face and Gesture Recognition. 176181. PANTIC, M. AND ROTHKRANTZ, L. J. M. 2000. Auto-
matic analysis of facial expressions: The state of
MAURER, T. AND MALSBURG, C. V. D. 1996b. Track- the art. IEEE Trans. Patt. Anal. Mach. Intell. 22,
ing and learning graphs and pose on image 14241446.

458 Zhao et al.
PENEV, P. AND SIROVICH, L. 2000. The global dimen- position using a small number of example views
sionality of face space. In Proceedings, Interna- or even a single view. In Proceedings, IEEE Con-
tional Conference on Automatic Face and Gesture ference on Computer Vision and Pattern Recog-
Recognition. nition. 153161.
PENEV, P. AND ATICK, J. 1996. Local feature analy- SAMAL, A. AND IYENGAR, P. 1992. Automatic recog-
sis: A general statistical theory for objecct repre- nition and analysis of human faces and facial
sentation. Netw.: Computat. Neural Syst. 7, 477 expressions: A survey. Patt. Recog. 25, 6577.
500. SAMARIA, F. 1994. Face recognition using hidden
PENTLAND, A., MOGHADDAM, B., AND STARNER, T. 1994. markov models. Ph.D. dissertation. University
View-based and modular eigenspaces for face of Cambridge, Cambridge, U.K.
recognition. In Proceedings, IEEE Conference on SAMARIA, F. AND YOUNG, S. 1994. HMM based archi-
Computer Vision and Pattern Recognition. tecture for face identification. Image Vis. Com-
PERKINS, D. 1975. A definition of caricature and put. 12, 537583.
recognition. Stud. Anthro. Vis. Commun. 2, 1 SCHNEIDERMAN, H. AND KANADE, T. 2000. Probabilis-
24. tic modelling of local Appearance and spatial
PHILLIPS, P. J., GROTHER, P. J., MICHEALS, R. J., BLACK- reationships for object recognition. In Proceed-
BURN, D. M., TABASSI, E., AND BONE, J. M. 2003. ings, IEEE Conference on Computer Vision and
Face recognition vendor test 2002: Evaluation Pattern Recognition. 746751.
report. NISTIR 6965, 2003. Available online at SERGENT, J. 1986. Microgenesis of face perception.
http://www.frvt.org. In Aspects of Face Processing, H. D. Ellis, M. A.
PHILLIPS, P. J. 1998. Support vector machines ap- Jeeves, F. Newcombe, and A. Young, Eds. Nijhoff,
plied to face fecognition. Adv. Neural Inform. Dordrecht, The Netherlands.
Process. Syst. 11, 803809. SHASHUA, A. 1994. Geometry and photometry in
PHILLIPS, P. J., MCCABE, R. M., AND CHELLAPPA, R. 3D visual recognition. Ph.D. dissertation. Mas-
1998. Biometric image processing and recogni- sachusetts Institute of Technology, Cambridge,
tion. In Proceedings, European Signal Process- MA.
ing Conference. SHEPHERD, J. W., DAVIES, G. M., AND ELLIS, H. D. 1981.
PHILLIPS, P. J., MOON, H., RIZVI, S., AND RAUSS, P. 2000. Studies of cue saliency. In Perceiving and Re-
The FERET evaluation methodology for face- membering Faces, G. M. Davies, H. D. Ellis, and
recognition algorithms. IEEE Trans. Patt. Anal. J. W. Shepherd, Eds. Academic Press, London,
Mach. Intell. 22. U.K.
PHILLIPS, P. J., WECHSLER, H., HUANG, J., AND RAUSS, SHIO, A. AND SKLANSKY, J. 1991. Segmentation of
P. 1998b. The FERET database and evalua- people in motion. In Proceedings, IEEE Work-
tion procedure for face-recognition algorithms. shop on Visual Motion. 325332.
Image Vis. Comput. 16, 295306. SIROVICH, L. AND KIRBY, M. 1987. Low-dimensional
PIGEON, S. AND VANDENDORPE, L. 1999. The M2VTS procedure for the characterization of human
multimodal face database (Release 1.00). In face. J. Opt. Soc. Am. 4, 519524.
Proceedings, International Conference on Audio- STEFFENS, J., ELAGIN, E., AND NEVEN, H. 1998.
and Video-Based Person Authentication. 403 PersonSpotterfast and robust system for hu-
409. man detection, tracking and recognition. In Pro-
RIKLIN-RAVIV, T. AND SHASHUA, A. 1999. The quo- ceedings, International Conference on Automatic
tient image: Class based re-rendering and recog- Face and Gesture Recognition. 516521.
nition with varying illuminations. In Proceed- STROM, J., JEBARA, T., BASU, S., AND PENTLAND, A.
ings, IEEE Conference on Computer Vision and 1999. Real time tracking and modeling of
Pattern Recognition. 566571. faces: An EKF-based analysis by synthesis ap-
RIZVI, S. A., PHILLIPS, P. J., AND MOON, H. 1998. A proach. Tech. rep. TR-506, MIT Media Lab, Mas-
verification protocol and statistical performance sachusetts, Institute of Technology, Cambridge,
analysis for face recognition algorithms. In Pro- MA.
ceedings, IEEE Conference on Computer Vision SUNG, K. AND POGGIO, T. 1997. Example-based
and Pattern Recognition. 833838. learning for view-based human face detection.
ROWLEY, H. A., BALUJA, S., AND KANADE, T. 1998. IEEE Trans. Patt. Anal. Mach. Intell. 20, 39
Neural network based face detection. IEEE 51.
Trans. Patt. Anal. Mach. Intell. 20. SWETS, D. L. AND WENG, J. 1996b. Using dis-
CHOUDHURY, A. K. R. AND CHELLAPPA, R. 2003. Face criminant eigenfeatures for image retrieval.
reconstruction from monocular video using un- IEEE Trans. Patt. Anal. Mach. Intell. 18, 831
certainty analysis and a generic model. Comput. 836.
Vis. Image Understand. 91, 188213.
SWETS, D. L. AND WENG, J. 1996. Discriminant
RUDERMAN, D. L. 1994. The statistics of natural im- analysis and eigenspace partition tree for face
ages. Netw.: Comput. Neural Syst. 5, 598605. and object recognition from views. In Proceed-
SALI, E. AND ULLMAN, S. 1998. Recognizing novel ings, International Conference on Automatic
3-D objects under new illumination and viewing Face and Gesture Recognition. 192197.

TARR, M. J. AND BULTHOFF, H. H. 1995. Is hu- Mammone, Ed. Chapman Hall, New York, NY,
man object recognition better described by geon 520536.
structural descriptions or by multiple views WISKOTT, L., FELLOUS, J.-M., AND VON DER MALSBURG, C.
comment on Biederman and Gerhardstein 1997. Face recognition by elastic bunch graph
(1993). J. Exp. Psych.: Hum. Percep. Perf. 21, 71 matching. IEEE Trans. Patt. Anal. Mach. Intell.
86. 19, 775779.
TERZOPOULOS, D. AND WATERS, K. 1993. Analysis YANG, M. H., KRIEGMAN, D., AND AHUJA, N. 2002. De-
and synthesis of facial image sequences using tecting faces in images: A survey. IEEE Trans.
physical and anatomical models. IEEE Trans. Patt. Anal. Mach. Intell. 24, 3458.
Patt. Anal. Mach. Intell. 15, 569579.
YIN, R. K. 1969. Looking at upside-down faces.
THOMPSON, P. 1980. Margaret ThatcherA new il- J. Exp, Psych. 81, 141151.
lusion. Perception, 9, 483484.
YUILLE, A. L., COHEN, D. S., AND HALLINAN, P. W.
TSAI, P. S. AND SHAH, M. 1994. Shape from shading 1992. Feature extractiong from faces using de-
using linear approximation. Image Vis. Comput. formable templates. Int. J. Comput. Vis. 8, 99
12, 487498. 112.
TRIGGS, B., MCLAUCHLAN, P., HARTLEY, R., AND YUILLE, A. AND HALLINAN, P. 1992. Deformable tem-
FITZGIBBON, A. 2000. Bundle adjustmenta plates. In Active vision, A. Blake, and A. Yuille,
modern synthesis. In Vision Algorithms: Theory Eds., Cambridge, MA, 2138.
and Practice, Springer-Verlag, Berlin, Germany. ZHAO, W. 1999. Robust Image Based 3D Face
TURK, M. AND PENTLAND, A. 1991. Eigenfaces for Recognition, Ph.D. dissertation. University of
recognition. J. Cogn. Neurosci. 3, 7286. Maryland, College Park, MD.
ULLMAN, S. AND BASRI, R. 1991. Recognition by lin- ZHAO, W. AND CHELLAPPA, R. 2000b. SFS Based
ear combinations of models. IEEE Trans. Patt. View synthesis for robust face recognition. In
Anal. Mach. Intell. 13, 9921006. Proceedings, International Conference on Auto-
VAPNIK, V. N. 1995. The Nature of Statistical matic Face and Gesture Recognition.
Learning Theory. Springer-Verlag, New York, ZHAO, W. AND CHELLAPPA, R. 2000. Illumination-
NY. insensitive face recognition using symmetric
VETTER, T. AND POGGIO, T. 1997. Linear object shape-from-shading. In Proceedings, Conference
classes and image synthesis from a single exam- on Computer Vision and Pattern Recognition.
ple image. IEEE Trans. Patt. Anal. Mach. Intell. 286293.
19, 733742. ZHAO, W., CHELLAPPA, R., AND KRISHNASWAMY, A. 1998.
VIOLA, P. AND JONES, M. 2001. Rapid object detec- Discriminant analysis of principal components
tion using a boosted cascade of simple features. for face recognition. In Proceedings, Interna-
In Proceedings, IEEE Conference on Computer tional Conference on Automatic Face and Gesture
Vision and Pattern Recognition. Recognition. 336341.
WECHSLER, H., KAKKAD, V., HUANG, J., GUTTA, S., AND ZHAO, W., CHELLAPPA, R., AND PHILLIPS, P. J. 1999.
CHEN, V. 1997. Automatic video-based person Subspace linear discriminant analysis for face
authentication using the RBF network. In Pro- recognition. Tech. rep. CAR-TR-914, Center for
ceedings, International Conference on Audio- Automation Research, University of Maryland,
and Video-Based Person Authentication. 8592. College Park, MD.
WILDER, J. 1994. Face recognition using transform ZHOU, S., KRUEGER, V., AND CHELLAPPA, R. 2003.
coding of gray scale projection and the neu- Probabilistic recognition of human faces from
ral tree network. In Artificial Neural Networks video. Comput. Vis. Image Understand. 91, 214
with Applications in Speech and Vision, R. J. 245.
Received July 2002; accepted June 2003

FaceRecognition PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

FaceRecognition PDF

Uploaded by

Copyright:

Available Formats

Face Recognition: A Literature Survey

Categories and Subject Descriptors: I.5.4 [Pattern Recognition]: Applications

1. INTRODUCTION example, fingerprint analysis and retinal

Table I. Typical Applications of Face Recognition

ACM Computing Surveys, Vol. 35, No. 4, December 2003.

Face perception is an important part of

ACM Computing Surveys, Vol. 35, No. 4, December 2003.

ACM Computing Surveys, Vol. 35, No. 4, December 2003.

ACM Computing Surveys, Vol. 35, No. 4, December 2003.

ACM Computing Surveys, Vol. 35, No. 4, December 2003.

ACM Computing Surveys, Vol. 35, No. 4, December 2003.

ACM Computing Surveys, Vol. 35, No. 4, December 2003.

ACM Computing Surveys, Vol. 35, No. 4, December 2003.

ACM Computing Surveys, Vol. 35, No. 4, December 2003.

ACM Computing Surveys, Vol. 35, No. 4, December 2003.

Table III. Categorization of Still Face Recognition Techniques

ACM Computing Surveys, Vol. 35, No. 4, December 2003.

Fig. 3. Electronically modified images which were correctly identified.

ACM Computing Surveys, Vol. 35, No. 4, December 2003.

ACM Computing Surveys, Vol. 35, No. 4, December 2003.

Fig. 6. Different projection bases constructed from

ACM Computing Surveys, Vol. 35, No. 4, December 2003.

z = Proj x; (4) database of 176 images created at Yale.

ACM Computing Surveys, Vol. 35, No. 4, December 2003.

training was regularized by modifying the space of possible solutions to determine

ACM Computing Surveys, Vol. 35, No. 4, December 2003.

that can be viewed as independent image A fully automatic face detec-

ACM Computing Surveys, Vol. 35, No. 4, December 2003.

Fig. 9. Structure of the PDBNN face recognizer. Each class subnet is

to be recognized, PDBNN/DBNN devotes any two classes, PDBNN has a lower

ACM Computing Surveys, Vol. 35, No. 4, December 2003.

ACM Computing Surveys, Vol. 35, No. 4, December 2003.

of jets (called the bunch graph represen-

3.2.3. Hybrid Approaches. Hybrid ap-

ACM Computing Surveys, Vol. 35, No. 4, December 2003.

ACM Computing Surveys, Vol. 35, No. 4, December 2003.

Fig. 13. The face recognition scheme based on flexible appearance

shape model, a global shape-free gray- ntected by a flexible geometrical model.

ACM Computing Surveys, Vol. 35, No. 4, December 2003.

ACM Computing Surveys, Vol. 35, No. 4, December 2003.

Among all face detection/recognition example, from 12 11 to 48 42, to

ACM Computing Surveys, Vol. 35, No. 4, December 2003.

ACM Computing Surveys, Vol. 35, No. 4, December 2003.

ACM Computing Surveys, Vol. 35, No. 4, December 2003.

ACM Computing Surveys, Vol. 35, No. 4, December 2003.

Table IV. Categorization of Video-Based Face Recognition Techniques

ACM Computing Surveys, Vol. 35, No. 4, December 2003.

in recognition, in that a virtual frontal In Wechsler et al. [1997], a fully auto-

ACM Computing Surveys, Vol. 35, No. 4, December 2003.

with occlusions. Good tracking results class-specific information, a sequence of

ACM Computing Surveys, Vol. 35, No. 4, December 2003.

ACM Computing Surveys, Vol. 35, No. 4, December 2003.

individuals have suggested that perfor- In still-to-video face recognition, where

ACM Computing Surveys, Vol. 35, No. 4, December 2003.

Experimental results using twelve train-

14 Noticethat this view-based idea has already been 15 Notice

ACM Computing Surveys, Vol. 35, No. 4, December 2003.

information. The key to building a success- databases. However, large-scale system-

ACM Computing Surveys, Vol. 35, No. 4, December 2003.

of a pattern recognition system is statis- tered in March 1995; it consisted of a sin-

ACM Computing Surveys, Vol. 35, No. 4, December 2003.

ACM Computing Surveys, Vol. 35, No. 4, December 2003.

ACM Computing Surveys, Vol. 35, No. 4, December 2003.