You are on page 1of 18

- 1

-
Eye Detection Using Optimal Wavelet Packets
and Radial Basis Functions (RBFs)
Jeffrey Huang and Harry Wechsler
Department of Computer Science
George Mason University
Fairfax, VA 22030
{rhuang, wechsler}@cs.gmu.edu
http://cs.gmu.edu/~wechsler
Abstract: The eyes are important facial landmarks, both for image normalization due to their relatively
constant interocular distance, and for post processing due to them anchoring model-based schemes. This
paper introduces a novel approach for the eye detection task using optimal wavelet packets for eye
representation and Radial Basis Functions (RBFs) for subsequent classification (’labeling’) of facial areas
as eye vs non-eye regions. Entropy minimization is the driving force behind the derivation of optimal
wavelet packets. It decreases the degree of data dispersion and it thus facilitates clustering (’prototyping’)
and capturing the most significant characteristics of the underlying (eye regions) data. Entropy
minimization is thus functionally compatible with the 1st operational stage of the RBF classifier, that of
clustering, and this explains the improved RBF performance on eye detection. Our experiments on the
eye detection task prove the merit of this approach as they show that eye images compressed using
optimal wavelet packets lead to improved and robust performance of the RBF classifier compared to the
case where original raw images are used by the RBF classifier.
Key Words: eye detection, face recognition, neural network performance, pattern classification, RBFs
(Radial Basis Functions), optimal sampling, wavelet packets.
International Journal of Pattern Recognition and Artificial Intelligence, Vol 13 No 7, 1999
- 2 -
1. Introduction
There are several related (face recognition) sub problems: (i) detection of a pattern as a face (in the
crowd) and its pose, (ii) detection of facial landmarks, (iii) face recognition - identification and/or
verification, and (iv) analysis of facial expressions (Samal et al., 1992). Face recognition starts with the
detection of face patterns in sometimes cluttered scenes, proceeds by normalizing the face images to
account for geometrical and illumination changes, possibly using information about the location and
appearance of facial landmarks, identifies then the faces using appropriate classification algorithms, and
post processes the results using model-based schemes, facial landmarks and logistic feedback (Chellappa
et al, 1995). The eyes are important facial landmarks, both for normalization due to their relatively
constant interocular distance, and for post processing due to them anchoring model-based schemes.
Eye detection is approached within the framework of predictive learning, where the goal is to develop a
computational relationship for estimating the values for the output variables given only the values of the
input variables. Different taxonomies are available for predictive learning, among them that including
regression and density estimation, and classification, corresponding to the input variables being
continuous and discrete / categorical, respectively, even that any classification problem can be reduced to
a regression problem. Predictive learning can be thought of as a relationship y = f(x) + error, where the
error is due both to (measurement) noise and possibly to ’unobserved’ input variables. The main issues
one has to address are related to prediction (generalization) ability, data and dimensionality reduction
(complexity), explanation / interpretation capability, and possibly to biological plausibility. In the
framework of predictive learning, estimating (’learning’) a model (’classifier’) from finite data requires
specification of three concepts: a set of approximating functions (i.e., a class of models: dictionary), an
inductive principle and an optimization (parameter estimation) procedure. The notion of inductive
principle is fundamental to all learning methods. Essentially, an inductive principle provides a general
prescription for what to do with the training data in order to obtain (learn) the model. In contrast, a
learning method is a constructive implementation of an inductive principle (i.e., an optimization or
parameter estimation procedure) for a given set of approximating functions in which the model is sought,
such as feed forward nets with sigmoid units, radial basis function networks, etc. (Cherkassky and Mulier,
1998). There is just a handful of known inductive principles, Empirical Risk Minimization (ERM),
Regularization, Structural Risk Minimization (SRM), Bayesian Inference, Minimum Description Length),
but there are infinitely many learning methods based on these principles. It is the prediction risk, the
expected performance of an estimator (’classifier’) for new (future) samples, which determines to what
degree adaptation has been successful so far. Based on the above discussion our approach for eye
detection involves wavelets as the set of approximating functions (’dictionary’), ERM as the inductive
- 3 -
principle, and RBFs as the learning and optimization method implementing the inductive principle. Our
approach also displays adaptiveness with respect to the dictionary aspect as it seeks an optimal subset of
wavelet kernels to approximate eye regions. This paper introduces an approach for the eye detection task
using optimal wavelet packets for compressed eye representations and Radial Basis Functions (RBFs) for
classification (’labeling’) of facial areas as eye vs. non-eye regions.
2. Signal Representation and Optimal Sampling
One of the goals for signal processing (SP) is to restore - recover - reconstruct signals using appropriate
sampling procedures. Signals, corresponding to sensory inputs, provide the information link connecting
intelligent systems to their environment. Signals have to be properly sampled and processed to facilitate
optimal behavior and sensory-motor control. Sampling criteria are determined by examining the spectral
contents of the signal and by setting the sampling rate to match the rate of change observed in the given
signal. Intuitively, fast changing signals contain more information and thus should be sampled more
often. If sampling fails to keep track of the fast changing signals, information (’detail’) is lost, and signal
reconstruction is beset by interference (’aliasing’) effects. To avoid aliasing effects the Shannon sampling
theorem sets the minimum (Nyquist) rate for sampling a bandwidth-limited continuous signal so it can be
uniquely recovered (’restored’) from its samples.
Approximation and estimation bounds for artificial networks, when used as universal approximators,
have been derived by Barron (1991) using complexity regularization and statistical risk. Barron has
specifically shown that the mean squared error between the estimated network and a target function f is
bound by
O
Cf
2
n ( )
+ O
nd
N
log N
( )
(1)
where n is the number of nodes, d is the input dimension of the function, N is the number of training
observations, and C
f
is the first absolute moment of the Fourier magnitude distribution of f. The two
contributions to the total risk are the approximation error and the estimation error. The approximation
error (’bias’) refers to the distance between the target function and the closest neural network function of a
given architecture, while the estimation error (’variance’) refers to the distance between this ideal network
function and an estimated network function. For the case n ~ C
f
N
d logN
( )
1
2

the optimized order of the bound
on the mean integrated squared error is O(C
f
(
d
N
log N)
1
2
) . The approximation bounds thus estimate the
performance of the network in terms of the spectral characteristics of the data being analyzed. The
preceding result, reconfirms known concepts from signal analysis and information theory referred to
earlier, and establishes strong links between optimal sampling and neural network performance. As an
example, Malinowski and Zurada (1994), developed a new approach to the problem of n-dimensional
- 4 -
function approximation using two-layer neural networks using the Nyquist rate to determine the optimum
number of learning patterns. Choosing the smallest but still sufficient set of training vectors resulted in a
reduced number of hidden neurons and learning time, and it confirmed the dependence of network
performance on proper pattern sampling.
The next fundamental concept bearing on optimal signal recovery comes from information theory and it is
due to Gabor (1946). The now famous concept, known as the uncertainty principle, revolves around the
simultaneous resolution of information in (2D) spatial and spectral terms, and it is closely related to the
choice of optimal lattices of receptive fields (kernel bases) for representing (’decomposing’) some
underlying signal. As recounted by Daugman (1990), Gabor pointed out that there exists a "quantum
principle" for information, which he illustrated through the construction of a spectrogram-like information
diagram. The information diagram, a plane whose axes correspond to time (or space) and frequency,
must necessarily be grainy, or quantized, in the sense that it is not possible for any signal or filter (and
hence any carrier or detector of information) to occupy less than a certain minimal area in this plane. This
minimal or quantal area, reflects the inevitable trade-off between time (space) resolution ∆t (∆s) and
frequency resolution ∆ω, and equals the lower bound on their product. Gabor further noted that
Gaussian-modulated complex exponentials offer the optimal way to encode signals (of arbitrary spectral
content) or to represent them, if one wished the code (’basis functions’) primitives to have both a well-
defined epoch of occurrence in time (space) and a well-defined spectral identity. The code primitives are
also referred to as kernels or RFs (receptive fields) in analogy to biological vision. The conclusion to be
drawn from the above discussion is better understood and visualized if one were to associate time (space)
occurrence with localization, while frequency change is associated with the speed at which a signal would
change. One thus can not achieve at the same time both optimal signal localization and optimal tracking
of details related to the changes taking place in spatiotemporal signal patterns. Note that not all signal
changes are intrinsic to the signal being analyzed. As an example, both rapid signal changes (’high
frequency’) corresponding to noise and slow signal changes (close to dc) corresponding to illumination
changes, should be discounted. The Human Visual system (HVS) has adapted so it implements a strategy
as outlined above by properly weighting the frequency contents through the Modulation Transfer
Function (MTF).
3. Wavelet Representation
The wavelet basis functions (Mallat, 1989), a self-similar and spatially localized code, are spatial
frequency/orientation tuned kernels, and provide one possible tessellation of the conjoint spatial/spectral
signal domain considered above. The corresponding wavelet hierarchical (pyramid) is obtained as the
- 5 -
result of orientation-tuned decompositions at a dyadic (powers of two) sequence of scales. The 2D
wavelet representation W of the function f(x, y) is
W(a
x
, a
y
, s
x
, s
y
) · f (x, y)(
∫∫
a
x
a
y
)

1
2
Ψ
*
[(x − s
x
)(a
x
)
−1
,(( y − s
y
)(a
y
)
−1
] (2)
where Ψ(x, y) is the "mother" wavelet, a
x
and a
y
are scale parameters, and s
x
and s
y
are shift parameters.
The wavelet representation provides for multi-resolution analysis (MRA) through the orthogonal
decomposition of a function along basis functions consisting of appropriate translations and dilations of
the mother wavelet function. Continuous wavelets, defined using a pair of functions φ (scaling function)
and ϕ (’mother’- wavelet function), satisfy
( ) ( ) b ax a x
b a
− · φ φ
2 1 /
,
| | (3)
( ) ( ) b ax a x
b a
− · ψ ψ
2 1 /
,
| | (4)
where a, b are real numbers, and φ
a,b,
and ϕ
a,b
correspond to the scaling and mother wavelet functions
dilated by a, and translated by b. The discrete wavelet transform (DWT) is obtained for the choice of a =
a
0
-m
, b = nb
0
, where m, n are integer numbers, with the scaling and mother wavelet functions becoming
( ) ( )
0 0
2
0
nb x a a x
m m
n m
− ·
− −
φ φ
/
,
| | (5)
( ) ( )
0 0
2
0
nb x a a x
m m
n m
− ·
− −
ψ ψ
/
,
| | (6)
For the choice a
0
= 2, b
0
= 1 one obtains
( ) ( ) n x x
m m
n m
− ·
− −
2 2
2
φ φ
/
,
(7)
( ) ( ) n x x
m m
n m
− ·
− −
2 2
2
ψ ψ
/
,
(8)
The dilation equation, relating the mother wavelet to the scaling function is:
∑ − ·
k
k x k h x ) ( ) ( ) ( 2 2
1
φ ψ where
) ( ) ( ) ( k h k h
k
− − · 1 1
0 1
(9)
The wavelet coefficients and the corresponding wavelet decomposition are given as
dx x x f c
n m n m
) ( ) (
, ,
ψ

·
+∞
∞ −
(10)
) ( ) (
,
, ,
x c x f
n m
n m n m
∑ · ψ (11)
Daubechies (1998) has shown how one can derive the corresponding low (’h0’) and high (’h1’) pass filters
for designing appropriate families of scaling and mother wavelet functions. She obtained a complete
parameterization of all finite length of h0(k) that satisfies the following quadratic and linear conditions:
∑ ·
n
n h 2
0
) ( (12)
and ) ( ) ( ) ( l l n h n h
n
δ · ∑ + 2
0 0
(13)
- 6 -
Defining h1(k)=(-1)
k
h0(N-1-k) and the filter length N is even, i.e. N=2K, Daubechies writes
) ( ) ( ) ( z P z z H
K 1
0
1

+ · , and parameterizes the autocorrelation sequence P(z) for various values of K.
That P(z) is bounded on the unit circle by
2
1
2
− K
is sufficient for wavelet basis to be orthonormal.
Using the sequences h0 and h1, one computes then the Discrete Wavelet Transform (DWT) using the
structure shown in Fig.1. Mallat (1989) has showed that for any orthonormal wavelet basis, the sequences
of two-channel filter banks - h and g can be utilized to compute the DWT with perfect reconstruction.
Fig. 1 also shows the implementation of Eqns. (14) and (15) derived from the Eqn. (8):
> < ∑ − · > <
+ n j
n
i k j i
x k n h x
, , , ,
, ) ( ,
1 0
2 ϕ ϕ (14)
and
k j i
k i
i n j
k n h x
, ,
,
, ,
) ( , ϕ ϕ ∑ − · > <
+
2
1 0
(15)
where } , { 1 0 ∈ i
Figure 1: Computation of DWT Using a Filter Bank
By using the polyphase-domain analysis, the perfect reconstruction filter banks, having finite impulse
response (FIR), must satisfy a necessary and sufficient condition for aliasing cancellation (anti-aliasing):
G
0
(z)H
0
(-z) + G
1
(z)H
1
(-z) = 0 (16)
where the H
0
, H
1
are the analysis filters and G
0
, G
1
are the synthesis filters used in the forward and inverse
polyphase transformation. For the aliasing cancellation in above two-channel filter bank, one can design
the Quadrature Mirror Filters (QMF) that satisfy the constrains:
H
1
(z) = H
0
(-z), (17)
G
0
(z) = H
0
(z), (18)
G
1
(z) = -H
1
(z) = -H
0
(-z), (19)
Substituting the above into Eqn. (16) leads to H
0
(z)H
0
(-z)+H
0
(-z)H
0
(z) = 0, i.e. aliasing is cancelled. For
the perfect reconstruction, it also satisfies:
G
0
(z)H
0
(z) + G
1
(z)H
1
(z) = 2z
-l
(20)
Using the QMF constrains, Eqn. (20) becomes
H
0
2
(z) - H
0
2
(-z) = 2z
-l
, (21)
) ( k h −
0
) ( k h −
1
) (k h
0
) (k h
1
2 ↓
2 ↓
2 ↑
2 ↑
> <
k j
x
, ,
,
0
ϕ
> <
k j
x
, ,
,
1
ϕ
> <
+ k j
x
, ,
,
1 0
ϕ > <
+ k j
x
, ,
,
1 0
ϕ
- 7 -
It means that on the unit circle H
0
(-z)=H(e
j (ω+π)
) is the mirror images of H
0
(z) and the above relation
explains the name of QMF. The QMF design is a useful way to implement PR filter banks for multirate
signal analysis and directly links to multi-resolution analysis (MRA) supported by wavelet theory.
The self-similar Gabor basis functions g are a special case of non-orthogonal wavelets, correspond to
sinusoids modulated by a Gaussian, can be easily tuned to any bandwidth (scale) and orientation. Note
that the self-Gabor wavelets are redundant due to them being non-orthogonal, lead to an over complete
dictionary of basis functions and are thus expected to provide better performance than a dictionary
consisting of orthogonal bases (Daugman, 1990).
4. Optimal Wavelet Packets
The concept of optimal sampling can be expanded using different fitness criteria. As an example, Wilson
(1995) mentions the requirement for more complex approaches to signal representation, whose common
feature is adaptivity. It should be possible to automatically adjust the resolution of the representation, i.e.
its reconstruction ability, in order to provide the best ’fit’ to a given data set, rather than using a fixed
representation, whose resolution is only bound to be a compromise between space / time and frequency.
Examples of such an approach include wavelet packets (Coifman and Wickerhauser, 1992), where the
wavelet dictionary is drawn using maximal energy concentration and/or least Shannon entropy.
Let H0 and H1 be a pair of QMF’s for one-dimensional signal analysis. The wavelet packets produced by
iterating these filters are products of one-dimensional wavelet packets. The wavelet coefficients in
homogeneous tree structure are generated by applying the QMF convolutions recursively. Suppose the
periodic signal is of size N = 2
L
. This decomposition can be developed down to level L, at which point
each subspace will have a single element. Each level requires O(N) operations and it is proportional to
the length of the QMFs. The total complexity for calculating the wavelet packet coefficients, i.e.
correlating with the entire collection of wavelet coefficients, is O(NlogN). Once the coefficients of the
discrete wavelet transform (DWT) are derived one becomes interested in choosing an optimal subset, with
respect to some reconstruction criteria, for data compression purposes. Towards that end, Coifman and
Wickerhauser (1992) define the Shannon entropy (µ ) as
µ v ( ) · − v
i ∑
2
ln v
i
2
(22)
where v = {v
i
} is the corresponding set of wavelet coefficients. The Shannon entropy measure is then
used as a cost function for finding the best subset of wavelet coefficients. Note that minimum entropy
- 8 -
corresponds to less randomness (’dispersion’) and it thus leads to clustering. If one generates the complete
wavelet representations (wavelet packets) as a binary tree, the selection of the best coefficients is done by
comparing the entropy of wavelet packets corresponding to adjacent tree levels (father - son
relationships). One compares the entropy of each adjacent pair of nodes to the entropy of their union and
the subtree is expanded further only if it results in lesser entropy. For a signal whose size is n the DWT
yields n coefficients and the search for optimal coefficients yields that set (still of size n) for whom the
Shannon entropy is minimized. Data compression, subject to the same entropy criteria, ranks the optimal
coefficients according to their magnitude, and would pick up subsets consisting of m coefficients where m
is less than n.
5. Functional Approximation and Pattern Classification
Signal representation is a particular case of functional approximation, and one could use the wavelet
kernels introduced earlier as universal approximators. Wavelet networks (’wavenets’) using stochastic
gradient descent akin to back propagation (BP) (Zhang and Benveniste, 1992) are such an example.
Signal denoising is closely related to function estimation from noisy samples. Cherkassky and Shao
(1999), have recently proposed a novel signal estimation/denoising approach using the Vapnik-
Chervonenkis (VC) theory and Structural Risk Minimization (SRM) as the inductive principle. The
proposed approach emphasizes the role of an appropriate ordering of wavelet functions for defining the
structuring SRM elements, for modeling complexity and for reducing the predicted estimation risk.
Cherkassky and Shao further showed that under finite sample settings, the choice of appropriate ordering
(of the basis functions) and sound complexity control are more important than choosing the right basis,
whereas under large sample settings the type of basis functions becomes more important. Functional
approximation in terms of dynamic receptive fields (RFs) using Radial Basis Functions (RBFs)
(Lippmann and Ng, 1991) and/or Gaussian bar architectures consisting of semilocal and overlapping units
(Hartman and Keeler, 1991) have been considered as well. Pattern classification is a special case of
functional approximation (’regression’) when the only possible outputs are categorical and requires special
consideration.
Pattern recognition, a difficult but fundamental task for intelligent systems, depends heavily on the
particular choice of the features used by the (pattern) classifier. Feature selection in pattern recognition
involves the derivation of salient features from the raw input data in order to reduce the amount of data
used for classification and simultaneously provide enhanced discriminatory power. The selection of an
appropriate set of features is one of the most difficult tasks in the design of pattern classification system.
At the lowest level, the raw features are derived from noisy sensory data, the characteristics of which are
complex and difficult to characterize. In addition, there is considerable interaction among the low-level
- 9 -
features which must be identified and exploited. The typical number of possible features, however, is so
large as to prohibit any systematic exploration of all but a few possible interaction types (e.g., pairwise
interactions). In addition, any sort of performance oriented evaluation of feature subsets involves
building and testing the associated classifier, resulting in additional overhead costs. As a consequence, a
fairly standard approach is to attempt to pre-select a subset of features using some abstract measures
related to important properties of good feature sets such as orthogonality, infomax, large variance,
multimodality of marginal distributions, and/or low entropy. This approach has been recently dubbed the
"filter" approach (Kohavi and John, 1995) and it is generally much less resource intensive than building
and testing the associated classifiers, but it may result in sub optimal performance if the abstract measures
do not correlate well with actual performance.
It is still difficult, however, to develop abstract feature space measures of optimality which would
guarantee optimality for actual classification tasks. In practice this is best achieved by including some
form of performance evaluation of the feature subsets while searching for good subsets. This approach,
recently dubbed the "wrapper" approach (Kohavi and John, 1995), typically involves building a classifier
based on the feature subset being evaluated, and using the performance of the classifier as a component of
the overall evaluation. The wrapper approach should produce better classification performance than the
filter approach, but it adds considerable overhead to an already expensive search process. This in turn
usually restricts the number of alternative feature subsets one can afford to evaluate, and thus it may also
produce sub optimal results. Our hybrid approach for eye detection integrates (in an independent fashion)
and takes advantage of both the filter and the wrapper approach. The filter approach seeks optimal
wavelet packets for best features, while the wrapper approach seeks the best classifier in terms of Radial
Basis Functions (RBFs).
The RBFs, related to kernel (’potential’) functions, can perform functional approximation characteristic of
sparse surface reconstruction while they can also support a general hybrid scheme connecting signal
approximation and pattern classification. Signal approximation, characteristic of self-organization using
adaptive cluster prototypes, provides the needed bases for defining the classifier. As an example, for
some function f(x), its approximation (decomposition in terms of basis functions) given as < c
i
, T
i
, R > is
found as
f ( x) · c
i
N( x − T
i

R
2
) (23)
where T
i
are the kernels (’dictionary’) of the function to be learned given in terms of parameterized
clusters across the space of geometrical (and/or topological) variability and weighted by coefficients c
i
- 10 -
learned through straightforward linear algebra (matrix inversion via SVD), N is the Gaussian distribution,
R is the normalization matrix, and the corresponding norm is defined as
x − T
i
R
2
· ( x − T
i
) R

R(x − T
i
) (24)
The wavenets, referred to earlier, implement the same general methodology as that used by RBFs.
(Universal) approximation takes place by establishing formal links between the network coefficients and
some signal transform, such as the wavelet transform. Training for the set {x, f(x)} is accomplished using
stochastic gradient descent such as Back Propagation (BP) and the approximation found, similar to the
one corresponding to RBFs, is g(x) · ϖ
i
Ψ[D
i
R
i
( x − t
i
)] ∑ where Ψ stands for the wavelet bases.
We address now the problem of eye detection in the context of the face recognition problem. The ability
to detect salient facial features is an important component of any face recognition system. Among the
many facial features available it appears that the eyes play the most important role in both face
recognition and social interaction. Detecting the eyes serves an important role in face normalization and
thus facilitates further localization of other facial landmarks. It is eye detection that allows one to focus
attention on salient facial configurations, to filter out structural noise, and to achieve eventual face
recognition. The hybrid approach for eye detection involves deriving first optimal (’reconstructed’)
candidate windows in terms of (best basis) wavelet packets followed by their classification (as eye vs
non-eye) using the Radial Basis Function (RBF) method. It is shown that an optimal choice of a subset of
wavelet bases provides for improved RBF performance on the eye detection task.
6. Eye Detection
We now address the task of eye detection in the context of the face recognition problem. The ability to
detect salient facial features is an important component of any face recognition system. Among the many
facial features available it appears that the eyes play the most important role in both face recognition and
social interaction. Detecting the eyes serves an important role in face normalization and thus facilitates
further localization of other facial landmarks. It is eye detection that allows one to focus attention on
salient facial configurations, to filter out structural noise, and to achieve eventual face recognition. The
hybrid approach for eye detection involves deriving first optimal (’reconstructed’) candidate eye regions
(’windows’) in terms of (best basis) wavelet packets followed by their classification (as eye vs non-eye)
using the Radial Basis Function (RBFs). It is shown that an optimal choice of a subset of wavelet bases
provides for improved RBF performance on the eye detection task when compared against using raw eye
images.
6.1 Experimental Setup
- 11 -
The experimental data for the eye detection task come from the FERET database (Phillips et al, 1998).
The FERET database consists of 1,564 sets comprising 14,126 face images, corresponding to 1,199
individuals. Since large amounts of the face images were acquired during different photo sessions, the
lighting conditions, the size of the facial images and their orientation can vary. In particular, as the
frontal poses are slightly rotated to the right or left, the results reported later on show that the eye
detection technique is robust to slight rotations. The diversity of the FERET database is across gender,
race, and age, and it includes 365 duplicate sets taken at different times.
The facial data set drawn from FERET consists of 45 frontal facial images taken at the resolution of 128 x
128. Windows of size 8 x 32 were manually cut from different regions of the face corresponding to eye
and non-eye images as shown in Fig.2. The training data consists of 50 eye (+) images and 20 non-eye (-)
images, while the testing data consists of 40 eye (+) and 10 non-eye (-) images. The eye images are raster
scanned to form 1-D vectors consisting of 256 elements.
Figure 2. Eye (+) and Non-Eye (-) Examples
6.2 Eye Representation Using Optimal Wavelet Packets
Wavelets packets, corresponding to the eye and non-eye images, are derived using the Daubechies family
of order two, as shown in Fig. 3. The QMF of low pass h(i) and high pass g(i) for filter size 4 are:
- 12 -
i h(i) g(i)
1 0.4830 -0.1294
2 0.8365 -0.2241
3 0.2241 0.8365
4 0.1294 -0.4830
0 1 2 3
-0.5
0
0.5
1
1.5
Scale Function: φ
2
(x)
x
0 1 2 3
-2
-1
0
1
2
Wavelet Function: ψ
2
(x)
x
0 50 100 150
0
0.2
0.4
0.6
0.8
FFT of the Scale Function
Frequency (Radians)
0 50 100 150
0
0.1
0.2
0.3
0.4
0.5
FFT of the Wavelet Function
Frequency (Radians)
Figure 3. Daubechies (Scale and Wavelet) Family of Order Two and its FFT
Optimal (wavelet packet) decompositions are then found minimizing the Shannon entropy as described in
Sect. 4. The quality of the reconstructed images, using the whole set of 256 coefficients or a reduced set
consisting of only 16 largest coefficients from the optimal (minimal entropy) induced tree, is shown in
Fig.4. The root mean square error (e
rms
) between the reconstructed and the original image is computed
as:
( ) ( ) ( )
2 1
256
1
2
256
1
/
1
]
1

¸

∑ − ·
· i
i O i R rms
x f x f e
(25)
Note that perfect reconstruction is achieved when one uses the whole set of 256 wavelet coefficients. The
reconstructed images capture relevant characteristics (’features’) and discard structural ’noise’. The
reconstructed images (here using 16 or 64 largest coefficients) and raw original images are then passed to
the Radial Basis Function (RBF) classifier to assess the relevance, if any, of optimal signal
decompositions (’functional approximation’) for classification tasks.
- 13 -
Number of
Wavelet Best
Bases
Reconstructed Images 1-D Signal
RMS
e
rms
16
0 32 64 96 128 160 192 224 256
0
50
100
150
200
29.36
256
0 50 100 150 200 250 300
0
50
100
150
0
Figure 4. Reconstructed Eye images Using Best Wavelet Packet Bases
6.2. Classification Using Radial Basis Functions (RBFs)
The construction of the RBF network involves three different layers (Fig. 5). The input layer is made up
of source nodes (sensory units), here fed with optimal wavelet packets. The second layer is a hidden layer
whose goal is to cluster the data and further reduce its dimensionality. The output layer supplies the
response of the network to the activation patterns applied to the input layer, corresponding here to the eye
and non-eye classes. The transformation from the input space to the hidden-unit space is non-linear,
whereas the transformation from the hidden-unit space to the output space is linear. Connections between
the input and middle layers have unit weights and, as a result, do not have to be trained. Nodes in the
middle layer, called basis functions (BF) nodes (’kernels’) are similar to the clusters derived using k-
means training. The BF kernels produce a localized response to the input using Gaussian kernels and
each hidden unit can further be viewed as a localized receptive field (RF).
. . . . .
G
G G G
SELECT MAXIMUM
LINEAR WEIGHTS
UNIT WEIGHTS
OUTPUT NODES (j)
BASIS FUNCTION NODES (i)
INPUT NODES (k)
Figure 5. Radial Basis Function (RBF) Architecture
- 14 -
The basis function (BF) used are Gaussians, and the activation level y
i
of the hidden unit i is given by:
( )
1
1
1
1
]
1

¸

,
_

¸
¸



·
· Φ · −
D
k
ik
ik k
i i
o h
x
i
X y
1
2
2
2
exp
σ
µ
µ
(26)
where h is a proportionality constant for the variance, x
k
is the kth component of the input vector X = [x
1
,
x
2
, ..., x
D
], and µ
ik
and σ
ik
2
are the kth components of the mean and variance vectors defining the BF, and
o is the overlap factor between BFs. The outputs of the hidden unit lie between 0 and 1, and it can be
interpreted as a degree of fuzzy memberships; the closer the input to the center of the Gaussian, the larger
the response of the BF node. The activation level Z
j
of an output unit is given by:
j i
i
ij j
w y w Z
0
+ ∑ ·
(27)
where Z
j
is the output of the jth output node, y
i
is the activation of the ith BF node, w
ij
is the weight
connecting the ith BF node to the jth output node, and w
0j
is the bias or the threshold of the jth output
node. The bias comes from the weights associated with a BF node that has a constant unit output
regardless of the input. An unknown vector X is classified as belonging to the class associated with the
output node j which yields the largest output Z
j.
The RBF input consists of normalized (optimal wavelet packets) values fed to the network as 1-D vectors.
The width of the Gaussian for each cluster, is set to the maximum {the distance between the center of the
cluster and the farthest away member - within class diameter, the distance between the center of the
cluster and the closest pattern from all the other clusters} multiplied by the overlap factor o. The width is
dynamically refined using different proportionality constants h. The hidden layer yields the equivalent of
a functional eye basis, where each cluster node encodes some common characteristics across the eye
space.
The output (supervised) layer maps training examples to their corresponding classes (eye and non-eye)
and finds the corresponding (hidden layer) expansion (’weight’) coefficients using SVD techniques. Note
that the number of clusters is frozen for that configuration (number of clusters and specific proportionality
constant h) which yields 100% accuracy when validated on the same training images. The experiments
carried out involved 50 eye and 20 non-eye images for training, and a different set consisting of 40 eye
and 20 non-eye images for testing. The same experiment was performed for the case when original
images were used and for the cases when the images used were reconstructed using the best 16 and 64
- 15 -
wavelet coefficients. Table 1 shows the performance of correct eye and non-eye classification
corresponding to the original and reconstructed images.
Another experiment was carried out to assess how robust the eye detection approach is on miscentered
eye images. Twenty eight test images were produced by shifting the center of the eye from its original
position to some new positions using upward/downward two pixels shifts, and/or left and right two or
four pixels shifts. The classification rates obtained for original vs reconstructed images using the best 64
wavelet coefficients were 50% and 80%, respectively. From the above experiments, one concludes that
the performance of the Radial Basis Function (RBF) classifier improves when the original (raw) images
are replaced by optimally reconstructed images using the ’best’ wavelet coefficients. As entropy
minimization drives the derivation of optimal wavelet packets, clustering (’prototyping’) captures the most
significant characteristics of the underlying data and it is thus more compatible with the 1st operational
stage of the RBF classifier. Note that the best RBF classifier based on optimal wavelet packets can be
applied exhaustively across the whole facial landscape in order to detect the eyes or alternatively it can be
used only on salient locations likely to cover the eye regions. In the latter case, one would possibly
evolve a homing routine to first guide an animat toward the eye(s) and then use the RBF classifier derived
earlier to confirm that the eye region has been indeed reached (Huang, Liu and Wechsler, 1998; Huang
and Wechsler, 1999).
Performance Using
Original (8 x 32) Data Performance Using Compressed Data
Eye Non-Eye Number of
Wavelet Coefficients
Eye Non-eye
70% 70% 16 82.5% 80.0%
64 85.0% 100.0%
Table 1. Eye vs. Non-Eye Classification Results
7. Conclusions
We have shown in this paper that eye images compressed using optimal wavelet packets (’bases’) lead to
enhanced performance on the eye detection task when compared to the case where original raw images
are used. As entropy minimization drives the derivation of optimal wavelet packets, clustering
(’prototyping’) captures the most significant characteristics of the underlying data and is more compatible
with the 1st operational stage of the RBF classifier. Our hybrid approach is related to the probabilistic
- 16 -
visual learning (PVL) method for object recognition (Moghaddam and Pentland, 1997). PVL measures
similarity for recognition purposes in terms of distance-from feature space (DFFS) (’representation
adequacy’) and distance-in-feature space (DIFS) (’classification accuracy’). While DFFS measures
representation adequacy using the reconstruction error as criteria, the filter component of our hybrid
approach seeks representation adequacy in terms of entropy minimization. Both PVL and our hybrid
approach address representation and classification accuracy independent of each other. An alternative
would be to develop a hybrid approach where fitness and success are defined in terms of a combined
fitness measure evolved using Genetic Algorithms (GAs). The suggested measure would take into
account the tradeoffs between representational adequacy and classification accuracy, possibly including
the compactness of the representation as well (Liu and Wechsler, 1998). The updated scheme would
then substitute regularization for the ERM as the inductive principle as representation adequacy is traded
off for classification accuracy in terms of compact clusters well separated. Our approach is also similar to
the evolution of lattices of receptive fields (Linsker, 1988) suitable for both adaptive and optimal
functional decomposition and signal classification. Such kernel bases amount to optimal ’dictionaries’
consisting of those (kernel) primitives able to express best and in a compact form a wide range of
meaningful inputs characteristic of some sensory environment leading eventually to both pattern
classification and robust sensory-motor schemes.
One additional direction for future research would integrate the concepts of optimal signal sampling,
characteristic of signal restoration, and optimal pattern sampling, characteristic of learning machines, with
the joint goals of compact signal approximation and improved classification accuracy. Examples of
optimal pattern sampling include active learning (Krogh and Vedelsby, 1995) and support vector
machines (SVM) (Vapnik, 1995). During the early stages of training most of the examples are
misclassified, while as the time goes on the set of misclassified examples diminishes. Active learning, as
a form of corrective training, seeks to determine what the training (’bases’) set should consist of in order to
improve the guaranteed prediction risk. Specifically, active learning seeks to determine the relevant
sample patterns or what SVM would label as support vectors (SV) for the purpose of defining the actual
classifier. As both signals and patterns have to be approximated using optimal (kernel) bases (’receptive
fields’ and ’support vectors’) the computational quest should be now after such conjoint ’bases’ with the
expectation for enhanced generalization abilities and corresponding lower predictive risk.
Acknowledgments: This work was partially supported by the U.S. Army Research Laboratory under
contract DAAL01-97-K-0118.
- 17 -
References
Barron, A. R. (1991), Approximation and Estimation Bounds for Artificial Neural Networks. Proc. of the
4th Workshop on Computational Learning Theory, Eds. Warmuth, M. K. and L.G. Valiant, 243-249,
Morgan Kaufmann.
Chellappa R., C. L. Wilson. and S. Sirohey, Human and Machine Recognition of Faces: A Survey, Proc.
IEEE 83, 705-740 (1995).
Cherkassky, V. and F. Mulier (1998), Learning from Data: Concepts, Theory and Methods, Wiley.
Cherakssky, V. and X. Shao (1999), Signal Estimation and Denoising Using VC-Theory, Neural
Computation (under review).
Coifman, R. and V. Wickerhauser (1992) Entropy-Based Algorithms for Best Basis Selection, IEEE
Trans. on Information Theory, 38(2), 713-718.
Daubechies, I. (1988), Orthonormal Bases of Compactly Supported Wavelets, Comun. on Pure and Appl.
Math., 41, 909-996.
Daugman, J. G. (1990) An Information-Theoretic View of Analog Representation in Striate Cortex., in
Computational Neuroscience, Ed. E. L. Schwartz, 403-424, MIT Press.
Gabor, D. (1946),Theory of Communication, Journal of IEEE, 93, 429-459.
Hartman, E. and J. Keeler (1991) Predicting the future: Advantages of Semilocal Units, Neural
Computation, 3, 566-578.
Huang, J., C. Liu and H. Wechsler (1998), Evolutionary computation and face recognition, in Face
Recognition: From Theory to Applications, H. Wechsler, J. P. Phillips, V. Bruce, F. Fogelman -
Soulie and T. Huang (Eds.), Springer-Verlag.
Huang, J. and H. Wechsler (1999), Visual Routines for Eye Location Using Learning and Evolution,
IEEE Trans. on Evolutionary Computation (in press).
Kohavi, R. and G. John (1995), Wrappers for Feature Subset Selection," Technical Report, Computer
Science Department, Stanford University.
Krogh, A., and Vedelsby, J. (1995), Neural Network Ensembles, Cross Validation and Active Learning,
in Advances in Neural Information Processing Systems 2, ed. D. S. Touretzky, 231-238, Morgan
Kaufmann.
Linsker, R. (1988) Self-Organization in a Perceptual Neural Network, Computer, 21, 105-117.
Lippmann, R. P. and Ng, K. (1991), A Comparative Study of the Practical Characteristic of Neural
Networks and Pattern Classifiers, Technical Report 894, Lincoln Labs., MIT.
Liu, C. and H. Wechsler (1998), Face Recognition Using Evolutionary Pursuit, 5th European Conf. on
Computer Vision (ECCV), Freiburg, Germany.
Malinowski, A. and J. M. Zurada (1994), Minimal training set size estimation for sampled-data function
encoding, World Congress on Neural Networks (WCNN), Vol. II, 366 - 371, San Diego, CA.
- 18 -
Mallat, S. (1989) A Theory for Multiresolution Signal Decomposition: the Wavelet Representation. IEEE
Trans. on Pattern Analysis and Machine Intelligence, 11(7), 674-693.
Moghaddam, B. and A. Pentland (1997), Probabilistic Visual Learning for Object Representation, IEEE
Trans. on Pattern Analysis and Machine Intelligence, 19(7), 696-710.
Phillips, P. J., H. Wechsler, J. Huang and P. Rauss (1998), The FERET Database and Evaluation
Procedure for Face Recognition, Image and Visual Computing, 16(5), 295-306.
Samal A. and P. Iyengar (1992), Automatic Recognition and Analysis of Human Faces and Facial
Expressions: A Survey, Pattern Recognition 25, 65-77.
Vapnik, V. (1995), Statistical Learning Theory, Springer Verlag.
Wilson, R. (1995), Wavelets: Why so Many Varieties?, UK Symposium on Applications of Time-
Frequency and Time-Scale Methods, University of Warwick, Coventry, UK.
Zhang, Q. and A. Benveniste (1992), Wavelet Networks, IEEE Trans. on Neural Networks , 3(6), 889-
898.
JEFFREY R. J. HUANG received the Ph.D. in Information Technology at George Mason
University, Fairfax, Virginia, and he is now a researcher with the Biometrics and Computer Vision
Lab in the Department of Computer Science at George Mason University. Dr. Huang’s research
interests include image processing, pattern recognition, computer vision, and evolutionary
computation and genetic algorithms. The range of applications covers biometrics and face
recognition, human-computer interaction, video processing and multimedia, and automated
target/object detection and recognition. Since 1995, he has published more than 25 scientific
papers including 2 book chapters. Jeffrey Huang is a member of the IEEE Computer Society,
Engineering in Medicine and Biology Society and the member of AAAI, ISAB, and SPIE.
HARRY WECHSLER received the Ph.D. in Computer Science from the University of California, Irvine, in
1975, and he is presently Professor of Computer Science at George Mason University. His research, in the
field of intelligent systems, has been in the areas of PERCEPTION: Computer Vision (CV), Automatic
Target Recognition (ATR), Signal and Image Processing (SIP), MACHINE INTELLIGENCE and
LEARNING: Pattern Recognition (PR), Neural Networks (NN), Information Retrieval, Data Mining and
Knowledge Discovery, EVOLUTIONARY COMPUTATION: Genetic Algorithms (GAs) and Animats,
MULTIMEDIA and VIDEO PROCESSING: Large Image Data Bases, Document Processing, and HUMAN-
COMPUTER INTELLIGENT INTERACTION (HCII) : Face and Hand Gesture Recognition, Biometrics
and Forensics. He was Director for the NATO Advanced Study Institutes (ASI) on "Active Perception and
Robot Vision" (Maratea, Italy, 1989), "From Statistics to Neural Networks" (Les Arcs, France, 1993) and "Face Recognition:
From Theory to Applications" (Stirling, UK, 1997), and he served as co-Chair for the International Conference on Pattern
Recognition held in Vienna, Austria, in 1996. He authored over 200 scientific papers, his book "Computational Vision" was
published by Academic Press in 1990, and he was the editor for "Neural Networks for Perception" (Vol 1 & 2), published by
Academic Press in 1991. He was elected as an IEEE Fellow in 1992 and as an Int. Association of Pattern Recognition (IAPR)
Fellow in 1998.

among them that including regression and density estimation. (ii) detection of facial landmarks. an optimization or parameter estimation procedure) for a given set of approximating functions in which the model is sought. data and dimensionality reduction (complexity). (Cherkassky and Mulier. and (iv) analysis of facial expressions (Samal et al. identifies then the faces using appropriate classification algorithms. 1995).e. ERM as the inductive -2- . Based on the above discussion our approach for eye detection involves wavelets as the set of approximating functions (’dictionary’).e.1. Face recognition starts with the detection of face patterns in sometimes cluttered scenes. proceeds by normalizing the face images to account for geometrical and illumination changes. respectively. an inductive principle provides a general prescription for what to do with the training data in order to obtain (learn) the model. explanation / interpretation capability. which determines to what degree adaptation has been successful so far. (iii) face recognition . where the goal is to develop a computational relationship for estimating the values for the output variables given only the values of the input variables.identification and/or verification. There is just a handful of known inductive principles. and classification. It is the prediction risk.. a learning method is a constructive implementation of an inductive principle (i. Predictive learning can be thought of as a relationship y = f(x) + error. Essentially. and possibly to biological plausibility. Bayesian Inference.. the expected performance of an estimator (’classifier’) for new (future) samples. Eye detection is approached within the framework of predictive learning. Introduction There are several related (face recognition) sub problems: (i) detection of a pattern as a face (in the crowd) and its pose. Different taxonomies are available for predictive learning. The notion of inductive principle is fundamental to all learning methods. Structural Risk Minimization (SRM). The eyes are important facial landmarks. The main issues one has to address are related to prediction (generalization) ability. Regularization. possibly using information about the location and appearance of facial landmarks. a class of models: dictionary). such as feed forward nets with sigmoid units. even that any classification problem can be reduced to a regression problem. 1992). etc. facial landmarks and logistic feedback (Chellappa et al. radial basis function networks.. but there are infinitely many learning methods based on these principles. and for post processing due to them anchoring model-based schemes. and post processes the results using model-based schemes. an inductive principle and an optimization (parameter estimation) procedure. In contrast. estimating (’learning’) a model (’classifier’) from finite data requires specification of three concepts: a set of approximating functions (i. corresponding to the input variables being continuous and discrete / categorical. both for normalization due to their relatively constant interocular distance. Minimum Description Length). where the error is due both to (measurement) noise and possibly to ’unobserved’ input variables. Empirical Risk Minimization (ERM). In the framework of predictive learning. 1998).

This paper introduces an approach for the eye detection task using optimal wavelet packets for compressed eye representations and Radial Basis Functions (RBFs) for classification (’labeling’) of facial areas as eye vs. Our approach also displays adaptiveness with respect to the dictionary aspect as it seeks an optimal subset of wavelet kernels to approximate eye regions. developed a new approach to the problem of n-dimensional -3- . As an example. Approximation and estimation bounds for artificial networks. Signals have to be properly sampled and processed to facilitate optimal behavior and sensory-motor control. information (’detail’) is lost. To avoid aliasing effects the Shannon sampling theorem sets the minimum (Nyquist) rate for sampling a bandwidth-limited continuous signal so it can be uniquely recovered (’restored’) from its samples.principle. The approximation bounds thus estimate the 2 1 performance of the network in terms of the spectral characteristics of the data being analyzed. and establishes strong links between optimal sampling and neural network performance. corresponding to sensory inputs. when used as universal approximators. 2. reconfirms known concepts from signal analysis and information theory referred to earlier. provide the information link connecting intelligent systems to their environment. For the case n ~ C f ( d logN ) the optimized order of the bound 1 2 d on the mean integrated squared error is O(C f ( N log N) ) . The approximation error (’bias’) refers to the distance between the target function and the closest neural network function of a given architecture. and Cf is the first absolute moment of the Fourier magnitude distribution of f. non-eye regions. and RBFs as the learning and optimization method implementing the inductive principle. while the estimation error (’variance’) refers to the distance between this ideal network N function and an estimated network function. The two contributions to the total risk are the approximation error and the estimation error. d is the input dimension of the function. fast changing signals contain more information and thus should be sampled more often. The preceding result.recover . Intuitively. Sampling criteria are determined by examining the spectral contents of the signal and by setting the sampling rate to match the rate of change observed in the given signal. N is the number of training observations. Barron has specifically shown that the mean squared error between the estimated network and a target function f is bound by O ( )+ O( C2 f n nd log N N) (1) where n is the number of nodes. and signal reconstruction is beset by interference (’aliasing’) effects. If sampling fails to keep track of the fast changing signals. Signals.reconstruct signals using appropriate sampling procedures. have been derived by Barron (1991) using complexity regularization and statistical risk. Malinowski and Zurada (1994). Signal Representation and Optimal Sampling One of the goals for signal processing (SP) is to restore .

both rapid signal changes (’high frequency’) corresponding to noise and slow signal changes (close to dc) corresponding to illumination changes. known as the uncertainty principle. One thus can not achieve at the same time both optimal signal localization and optimal tracking of details related to the changes taking place in spatiotemporal signal patterns. As recounted by Daugman (1990). Note that not all signal changes are intrinsic to the signal being analyzed. a plane whose axes correspond to time (or space) and frequency. and it confirmed the dependence of network performance on proper pattern sampling. Choosing the smallest but still sufficient set of training vectors resulted in a reduced number of hidden neurons and learning time. revolves around the simultaneous resolution of information in (2D) spatial and spectral terms. and provide one possible tessellation of the conjoint spatial/spectral signal domain considered above. The now famous concept. which he illustrated through the construction of a spectrogram-like information diagram. The information diagram. The conclusion to be drawn from the above discussion is better understood and visualized if one were to associate time (space) occurrence with localization.function approximation using two-layer neural networks using the Nyquist rate to determine the optimum number of learning patterns. and it is closely related to the choice of optimal lattices of receptive fields (kernel bases) for representing (’decomposing’) some underlying signal. reflects the inevitable trade-off between time (space) resolution ∆t (∆s) and frequency resolution ∆ω. are spatial frequency/orientation tuned kernels. should be discounted. if one wished the code (’basis functions’) primitives to have both a welldefined epoch of occurrence in time (space) and a well-defined spectral identity. The code primitives are also referred to as kernels or RFs (receptive fields) in analogy to biological vision. must necessarily be grainy. The Human Visual system (HVS) has adapted so it implements a strategy as outlined above by properly weighting the frequency contents through the Modulation Transfer Function (MTF). a self-similar and spatially localized code. and equals the lower bound on their product. The next fundamental concept bearing on optimal signal recovery comes from information theory and it is due to Gabor (1946). As an example. Gabor further noted that Gaussian-modulated complex exponentials offer the optimal way to encode signals (of arbitrary spectral content) or to represent them. Wavelet Representation The wavelet basis functions (Mallat. Gabor pointed out that there exists a "quantum principle" for information. in the sense that it is not possible for any signal or filter (and hence any carrier or detector of information) to occupy less than a certain minimal area in this plane. 1989). The corresponding wavelet hierarchical (pyramid) is obtained as the -4- . or quantized. 3. This minimal or quantal area. while frequency change is associated with the speed at which a signal would change.

n = ∫ f ( x ) m . defined using a pair of functions φ (scaling function) and ϕ (’mother’ . She obtained a complete parameterization of all finite length of h0(k) that satisfies the following quadratic and linear conditions: ∑ h0 ( n ) = 2 n (12) (13) -5- and ∑ h0 ( n )h0 ( n + 2l ) = δ (l ) n . with the scaling and mother wavelet functions becoming − φ m . The 2D wavelet representation W of the function f(x. satisfy φ a . a y .b correspond to the scaling and mother wavelet functions dilated by a. n are integer numbers. and sx and sy are shift parameters.b (x ) =| a |1 / 2 ψ (ax − b ) (3) (4) where a. where m. Continuous wavelets. b0 = 1 one obtains φ m .b (x ) =| a |1 / 2 φ (ax − b ) ψ a . n (x ) =| a0 |− m / 2 φ a0 m x − nb0 ( ) (5) − ψ m . b are real numbers. y)(ax a y ) − Ψ* [(x − s x )(a x ) −1 . s x .b.(( y − sy )(a y )−1 ] 1 2 (2) where Ψ(x. The wavelet representation provides for multi-resolution analysis (MRA) through the orthogonal decomposition of a function along basis functions consisting of appropriate translations and dilations of the mother wavelet function. and φa. The discrete wavelet transform (DWT) is obtained for the choice of a = a0-m. n ( x )dx ψ −∞ +∞ (10) (11) f ( x ) = ∑ cm . relating the mother wavelet to the scaling function is: ψ ( x ) = 2 ∑ h1 ( k )φ ( 2 x − k ) k (9) The wavelet coefficients and the corresponding wavelet decomposition are given as cm . and translated by b. n ( x ) m. sy ) = ∫∫ f (x. b = nb0. nψ m . n (x ) =| a0 |− m / 2 ψ a0 m x − nb0 ( ) (6) For the choice a0 = 2.wavelet function). and ϕa. y) is the "mother" wavelet. n (x ) = 2 − m / 2ψ 2 − m x − n ( The dilation equation. y) is W(a x .n Daubechies (1998) has shown how one can derive the corresponding low (’h0’) and high (’h1’) pass filters for designing appropriate families of scaling and mother wavelet functions.result of orientation-tuned decompositions at a dyadic (powers of two) sequence of scales. ax and ay are scale parameters. n (x ) = 2 − m / 2 φ 2 − m x − n ( ) ) k where h1 ( k ) = ( −1) h0 (1 − k ) (7) (8) ψ m .

the perfect reconstruction. (21) (20) (17) (18) (19) Substituting the above into Eqn. For the aliasing cancellation in above two-channel filter bank.e. and parameterizes the autocorrelation sequence P(z) for various values of K. (16) leads to H0(z)H0(-z)+H0(-z)H0(z) = 0. G1 are the synthesis filters used in the forward and inverse polyphase transformation. must satisfy a necessary and sufficient condition for aliasing cancellation (anti-aliasing): G0(z)H0(-z) + G1(z)H1(-z) = 0 (16) where the H0. j . the sequences of two-channel filter banks . ϕ 0 .Defining h1(k)=(-1)kh0(N-1-k) and the filter length N is even. j +1. (14) and (15) derived from the Eqn.h and g can be utilized to compute the DWT with perfect reconstruction. Eqn. j . That P(z) is bounded on the unit circle by 2 K−1 2 is sufficient for wavelet basis to be orthonormal. k > h0 ( − k ) h1 ( − k ) ↓2 ↓2 ↑2 ↑2 h0 (k ) h1 (k ) < x. it also satisfies: G0(z)H0(z) + G1(z)H1(z) = 2z-l Using the QMF constrains. aliasing is cancelled. j +1. i. G0(z) = H0(z). For -6- . the perfect reconstruction filter banks.k > Figure 1: Computation of DWT Using a Filter Bank By using the polyphase-domain analysis.n > = ∑ hi ( n − 2k )ϕ i . H1 are the analysis filters and G0. one computes then the Discrete Wavelet Transform (DWT) using the structure shown in Fig. having finite impulse response (FIR). k > < x. j . Using the sequences h0 and h1. i. ϕ 0 .1} < x.k i . N=2K. Mallat (1989) has showed that for any orthonormal wavelet basis.n > n (14) (15) and < x. 1 also shows the implementation of Eqns. ϕ 0 . j +1. Daubechies writes H 0 ( z ) = (1 + z −1 ) K P( z ) . j .e. j +1. Fig.k > = ∑ hi ( n − 2k ) < x. (8): < x.H02(-z) = 2z-l. ϕ i .k > < x.1.k where i ∈ {0. ϕ 0 . G1(z) = -H1(z) = -H0(-z). (20) becomes H02(z) . one can design the Quadrature Mirror Filters (QMF) that satisfy the constrains: H1(z) = H0(-z). ϕ 1. ϕ 0 .

Note that the self-Gabor wavelets are redundant due to them being non-orthogonal. The wavelet packets produced by iterating these filters are products of one-dimensional wavelet packets. This decomposition can be developed down to level L. The self-similar Gabor basis functions g are a special case of non-orthogonal wavelets. Note that minimum entropy -7- . at which point each subspace will have a single element. Examples of such an approach include wavelet packets (Coifman and Wickerhauser. i. whose common feature is adaptivity. correlating with the entire collection of wavelet coefficients.e. whose resolution is only bound to be a compromise between space / time and frequency.e. 4. its reconstruction ability. with respect to some reconstruction criteria. Suppose the periodic signal is of size N = 2L . correspond to sinusoids modulated by a Gaussian. rather than using a fixed representation. The Shannon entropy measure is then used as a cost function for finding the best subset of wavelet coefficients. Once the coefficients of the discrete wavelet transform (DWT) are derived one becomes interested in choosing an optimal subset. The total complexity for calculating the wavelet packet coefficients. lead to an over complete dictionary of basis functions and are thus expected to provide better performance than a dictionary consisting of orthogonal bases (Daugman. The QMF design is a useful way to implement PR filter banks for multirate signal analysis and directly links to multi-resolution analysis (MRA) supported by wavelet theory. Optimal Wavelet Packets The concept of optimal sampling can be expanded using different fitness criteria. Each level requires O(N) operations and it is proportional to the length of the QMFs.It means that on the unit circle H0(-z)=H(ej (ω+π) ) is the mirror images of H0(z) and the above relation explains the name of QMF. 1990). in order to provide the best ’fit’ to a given data set. where the wavelet dictionary is drawn using maximal energy concentration and/or least Shannon entropy. for data compression purposes. Towards that end. is O(NlogN). It should be possible to automatically adjust the resolution of the representation. i. can be easily tuned to any bandwidth (scale) and orientation. The wavelet coefficients in homogeneous tree structure are generated by applying the QMF convolutions recursively. As an example. 1992). Coifman and Wickerhauser (1992) define the Shannon entropy (µ ) as µ(v) = − ∑ vi ln vi 2 2 (22) where v = {vi} is the corresponding set of wavelet coefficients. Wilson (1995) mentions the requirement for more complex approaches to signal representation. Let H0 and H1 be a pair of QMF’s for one-dimensional signal analysis.

Cherkassky and Shao (1999). The proposed approach emphasizes the role of an appropriate ordering of wavelet functions for defining the structuring SRM elements. Pattern classification is a special case of functional approximation (’regression’) when the only possible outputs are categorical and requires special consideration. In addition. Pattern recognition. whereas under large sample settings the type of basis functions becomes more important. The selection of an appropriate set of features is one of the most difficult tasks in the design of pattern classification system. and one could use the wavelet kernels introduced earlier as universal approximators. ranks the optimal coefficients according to their magnitude. a difficult but fundamental task for intelligent systems. For a signal whose size is n the DWT yields n coefficients and the search for optimal coefficients yields that set (still of size n) for whom the Shannon entropy is minimized. Functional approximation in terms of dynamic receptive fields (RFs) using Radial Basis Functions (RBFs) (Lippmann and Ng.son relationships). there is considerable interaction among the low-level -8- . the selection of the best coefficients is done by comparing the entropy of wavelet packets corresponding to adjacent tree levels (father . and would pick up subsets consisting of m coefficients where m is less than n. the characteristics of which are complex and difficult to characterize. depends heavily on the particular choice of the features used by the (pattern) classifier. Signal denoising is closely related to function estimation from noisy samples. 1991) have been considered as well. Functional Approximation and Pattern Classification Signal representation is a particular case of functional approximation. the choice of appropriate ordering (of the basis functions) and sound complexity control are more important than choosing the right basis. Data compression. subject to the same entropy criteria. 1991) and/or Gaussian bar architectures consisting of semilocal and overlapping units (Hartman and Keeler. the raw features are derived from noisy sensory data. for modeling complexity and for reducing the predicted estimation risk. One compares the entropy of each adjacent pair of nodes to the entropy of their union and the subtree is expanded further only if it results in lesser entropy. Cherkassky and Shao further showed that under finite sample settings. Wavelet networks (’wavenets’) using stochastic gradient descent akin to back propagation (BP) (Zhang and Benveniste. At the lowest level. 5.corresponds to less randomness (’dispersion’) and it thus leads to clustering. have recently proposed a novel signal estimation/denoising approach using the VapnikChervonenkis (VC) theory and Structural Risk Minimization (SRM) as the inductive principle. 1992) are such an example. If one generates the complete wavelet representations (wavelet packets) as a binary tree. Feature selection in pattern recognition involves the derivation of salient features from the raw input data in order to reduce the amount of data used for classification and simultaneously provide enhanced discriminatory power.

Signal approximation. 1995).. and/or low entropy. recently dubbed the "wrapper" approach (Kohavi and John. can perform functional approximation characteristic of sparse surface reconstruction while they can also support a general hybrid scheme connecting signal approximation and pattern classification. Our hybrid approach for eye detection integrates (in an independent fashion) and takes advantage of both the filter and the wrapper approach. while the wrapper approach seeks the best classifier in terms of Radial Basis Functions (RBFs). large variance. resulting in additional overhead costs.features which must be identified and exploited. characteristic of self-organization using adaptive cluster prototypes. Ti . is so large as to prohibit any systematic exploration of all but a few possible interaction types (e. This approach has been recently dubbed the "filter" approach (Kohavi and John. The typical number of possible features. This in turn usually restricts the number of alternative feature subsets one can afford to evaluate. for some function f(x). its approximation (decomposition in terms of basis functions) given as < ci . however. but it may result in sub optimal performance if the abstract measures do not correlate well with actual performance. infomax. The filter approach seeks optimal wavelet packets for best features. a fairly standard approach is to attempt to pre-select a subset of features using some abstract measures related to important properties of good feature sets such as orthogonality. In addition. pairwise interactions). and thus it may also produce sub optimal results.g. and using the performance of the classifier as a component of the overall evaluation. This approach. The RBFs. As an example. any sort of performance oriented evaluation of feature subsets involves building and testing the associated classifier. provides the needed bases for defining the classifier. to develop abstract feature space measures of optimality which would guarantee optimality for actual classification tasks. however. In practice this is best achieved by including some form of performance evaluation of the feature subsets while searching for good subsets. typically involves building a classifier based on the feature subset being evaluated. multimodality of marginal distributions. related to kernel (’potential’) functions. It is still difficult. R > is found as f ( x) = ∑ ci N( x − Ti 2 R ) (23) where Ti are the kernels (’dictionary’) of the function to be learned given in terms of parameterized clusters across the space of geometrical (and/or topological) variability and weighted by coefficients ci -9- . The wrapper approach should produce better classification performance than the filter approach. As a consequence. but it adds considerable overhead to an already expensive search process. 1995) and it is generally much less resource intensive than building and testing the associated classifiers.

Detecting the eyes serves an important role in face normalization and thus facilitates further localization of other facial landmarks. The ability to detect salient facial features is an important component of any face recognition system. 6. referred to earlier. Among the many facial features available it appears that the eyes play the most important role in both face recognition and social interaction. such as the wavelet transform. R is the normalization matrix. (Universal) approximation takes place by establishing formal links between the network coefficients and some signal transform. Among the many facial features available it appears that the eyes play the most important role in both face recognition and social interaction. and to achieve eventual face recognition. It is eye detection that allows one to focus attention on salient facial configurations. It is shown that an optimal choice of a subset of wavelet bases provides for improved RBF performance on the eye detection task. The hybrid approach for eye detection involves deriving first optimal (’reconstructed’) candidate eye regions (’windows’) in terms of (best basis) wavelet packets followed by their classification (as eye vs non-eye) using the Radial Basis Function (RBFs). Eye Detection We now address the task of eye detection in the context of the face recognition problem.1 Experimental Setup . It is shown that an optimal choice of a subset of wavelet bases provides for improved RBF performance on the eye detection task when compared against using raw eye images. and to achieve eventual face recognition. 6. We address now the problem of eye detection in the context of the face recognition problem. to filter out structural noise. is g(x ) = ∑ ϖ i Ψ[DiRi ( x − ti )] where Ψ stands for the wavelet bases.learned through straightforward linear algebra (matrix inversion via SVD). f(x)} is accomplished using stochastic gradient descent such as Back Propagation (BP) and the approximation found. The ability to detect salient facial features is an important component of any face recognition system.10 - . The hybrid approach for eye detection involves deriving first optimal (’reconstructed’) candidate windows in terms of (best basis) wavelet packets followed by their classification (as eye vs non-eye) using the Radial Basis Function (RBF) method. Training for the set {x. Detecting the eyes serves an important role in face normalization and thus facilitates further localization of other facial landmarks. to filter out structural noise. N is the Gaussian distribution. It is eye detection that allows one to focus attention on salient facial configurations. similar to the one corresponding to RBFs. and the corresponding norm is defined as x − Ti 2 R = ( x − Ti ) R’ R(x − Ti ) (24) The wavenets. implement the same general methodology as that used by RBFs.

The diversity of the FERET database is across gender. while the testing data consists of 40 eye (+) and 10 non-eye (-) images.564 sets comprising 14.126 face images. and age.2 Eye Representation Using Optimal Wavelet Packets Wavelets packets. the results reported later on show that the eye detection technique is robust to slight rotations. the lighting conditions. as the frontal poses are slightly rotated to the right or left. the size of the facial images and their orientation can vary. Eye (+) and Non-Eye (-) Examples 6. The QMF of low pass h(i) and high pass g(i) for filter size 4 are: . Figure 2. Since large amounts of the face images were acquired during different photo sessions. The FERET database consists of 1.11 - . The eye images are raster scanned to form 1-D vectors consisting of 256 elements. race. Windows of size 8 x 32 were manually cut from different regions of the face corresponding to eye and non-eye images as shown in Fig. In particular. 3. are derived using the Daubechies family of order two. The training data consists of 50 eye (+) images and 20 non-eye (-) images.2. corresponding to the eye and non-eye images. The facial data set drawn from FERET consists of 45 frontal facial images taken at the resolution of 128 x 128.The experimental data for the eye detection task come from the FERET database (Phillips et al. as shown in Fig. and it includes 365 duplicate sets taken at different times. 1998). corresponding to 1.199 individuals.

3 150 2 1 0 0. The reconstructed images capture relevant characteristics (’features’) and discard structural ’noise’.1 0 -2 0 1 x 2 3 0 50 100 Frequency (Radians) 150 Figure 3.5 0.12 - .4.2241 0.5 0.8365 -0.i 1 2 3 4 h(i) 0.4830 0.2241 0. if any.8 FFT of the Scale Function 1 0.6 0.5 0.4830 Scale Function: φ 2(x) 1.1294 g(i) -0. .5 0 2 x Wavelet Function: ψ 2(x) 1 3 0 0 50 100 Frequency (Radians) FFT of the Wavelet Function 0. The root mean square error (erms) between the reconstructed and the original image is computed as:  1 256 2 e rms =  ∑ ( f R (x i ) − f O (x i ))  256 i =1   1/ 2 (25) Note that perfect reconstruction is achieved when one uses the whole set of 256 wavelet coefficients.2 -1 0. Daubechies (Scale and Wavelet) Family of Order Two and its FFT Optimal (wavelet packet) decompositions are then found minimizing the Shannon entropy as described in Sect. of optimal signal decompositions (’functional approximation’) for classification tasks. using the whole set of 256 coefficients or a reduced set consisting of only 16 largest coefficients from the optimal (minimal entropy) induced tree.2 -0.4 0 0.4 0. 4.1294 -0. The quality of the reconstructed images. The reconstructed images (here using 16 or 64 largest coefficients) and raw original images are then passed to the Radial Basis Function (RBF) classifier to assess the relevance. is shown in Fig.8365 0.

Classification Using Radial Basis Functions (RBFs) The construction of the RBF network involves three different layers (Fig. . . The input layer is made up of source nodes (sensory units). The BF kernels produce a localized response to the input using Gaussian kernels and each hidden unit can further be viewed as a localized receptive field (RF).Number of Wavelet Best Bases 200 150 100 RMS Reconstructed Images 1-D Signal erms 16 50 0 29.2. do not have to be trained. SELECT MAXIMUM OUTPUT NODES (j) LINEAR WEIGHTS UNIT WEIGHTS . . corresponding here to the eye and non-eye classes. Reconstructed Eye images Using Best Wavelet Packet Bases 6. Nodes in the middle layer. 5). as a result. The second layer is a hidden layer whose goal is to cluster the data and further reduce its dimensionality. whereas the transformation from the hidden-unit space to the output space is linear. called basis functions (BF) nodes (’kernels’) are similar to the clusters derived using kmeans training. The transformation from the input space to the hidden-unit space is non-linear.13 - G G G G BASIS FUNCTION NODES (i) INPUT NODES (k) Figure 5. Radial Basis Function (RBF) Architecture . The output layer supplies the response of the network to the activation patterns applied to the input layer. Connections between the input and middle layers have unit weights and. . here fed with optimal wavelet packets.36 0 32 64 96 128 160 192 224 256 150 100 50 256 0 0 0 50 100 150 200 250 300 Figure 4. .

. the larger the response of the BF node. The width is dynamically refined using different proportionality constants h. the closer the input to the center of the Gaussian.14 - . . and it can be interpreted as a degree of fuzzy memberships.. and o is the overlap factor between BFs. The RBF input consists of normalized (optimal wavelet packets) values fed to the network as 1-D vectors. xD]. the distance between the center of the cluster and the closest pattern from all the other clusters} multiplied by the overlap factor o.The basis function (BF) used are Gaussians. The activation level Zj of an output unit is given by: Z j = ∑ wij y i + w0 j i (27) where Zj is the output of the jth output node. xk is the kth component of the input vector X = [x1. The experiments carried out involved 50 eye and 20 non-eye images for training. yi is the activation of the ith BF node. is set to the maximum {the distance between the center of the cluster and the farthest away member . wij is the weight connecting the ith BF node to the jth output node.within class diameter. and w0j is the bias or the threshold of the jth output node. and a different set consisting of 40 eye and 20 non-eye images for testing. The width of the Gaussian for each cluster. The bias comes from the weights associated with a BF node that has a constant unit output regardless of the input. The output (supervised) layer maps training examples to their corresponding classes (eye and non-eye) and finds the corresponding (hidden layer) expansion (’weight’) coefficients using SVD techniques. and the activation level yi of the hidden unit i is given by: y =Φ i i ( X − µi )  2  D x −µ     ik   k  = exp  −   2  k = 1 2hσ ik o    ∑ (26) where h is a proportionality constant for the variance. An unknown vector X is classified as belonging to the class associated with the output node j which yields the largest output Zj. The hidden layer yields the equivalent of a functional eye basis. The outputs of the hidden unit lie between 0 and 1. x2. and µik and σik2 are the kth components of the mean and variance vectors defining the BF. where each cluster node encodes some common characteristics across the eye space. The same experiment was performed for the case when original images were used and for the cases when the images used were reconstructed using the best 16 and 64 . Note that the number of clusters is frozen for that configuration (number of clusters and specific proportionality constant h) which yields 100% accuracy when validated on the same training images..

respectively. 1998. 1999). Huang and Wechsler. one would possibly evolve a homing routine to first guide an animat toward the eye(s) and then use the RBF classifier derived earlier to confirm that the eye region has been indeed reached (Huang. and/or left and right two or four pixels shifts. Twenty eight test images were produced by shifting the center of the eye from its original position to some new positions using upward/downward two pixels shifts.5% 85. As entropy minimization drives the derivation of optimal wavelet packets. Performance Using Original (8 x 32) Data Eye Non-Eye Performance Using Compressed Data Number of Wavelet Coefficients Eye Non-eye 70% 70% 16 64 82. Our hybrid approach is related to the probabilistic .wavelet coefficients.15 - . As entropy minimization drives the derivation of optimal wavelet packets. From the above experiments. The classification rates obtained for original vs reconstructed images using the best 64 wavelet coefficients were 50% and 80%. In the latter case. Table 1 shows the performance of correct eye and non-eye classification corresponding to the original and reconstructed images. Eye vs. Note that the best RBF classifier based on optimal wavelet packets can be applied exhaustively across the whole facial landscape in order to detect the eyes or alternatively it can be used only on salient locations likely to cover the eye regions. Non-Eye Classification Results 7. clustering (’prototyping’) captures the most significant characteristics of the underlying data and is more compatible with the 1st operational stage of the RBF classifier. Another experiment was carried out to assess how robust the eye detection approach is on miscentered eye images.0% 100. one concludes that the performance of the Radial Basis Function (RBF) classifier improves when the original (raw) images are replaced by optimally reconstructed images using the ’best’ wavelet coefficients.0% 80. clustering (’prototyping’) captures the most significant characteristics of the underlying data and it is thus more compatible with the 1st operational stage of the RBF classifier. Conclusions We have shown in this paper that eye images compressed using optimal wavelet packets (’bases’) lead to enhanced performance on the eye detection task when compared to the case where original raw images are used.0% Table 1. Liu and Wechsler.

active learning seeks to determine the relevant sample patterns or what SVM would label as support vectors (SV) for the purpose of defining the actual classifier. 1995) and support vector machines (SVM) (Vapnik. Examples of optimal pattern sampling include active learning (Krogh and Vedelsby. PVL measures similarity for recognition purposes in terms of distance-from feature space (DFFS) (’representation adequacy’) and distance-in-feature space (DIFS) (’classification accuracy’). the filter component of our hybrid approach seeks representation adequacy in terms of entropy minimization. possibly including the compactness of the representation as well (Liu and Wechsler. Both PVL and our hybrid approach address representation and classification accuracy independent of each other. seeks to determine what the training (’bases’) set should consist of in order to improve the guaranteed prediction risk. Our approach is also similar to the evolution of lattices of receptive fields (Linsker. Army Research Laboratory under contract DAAL01-97-K-0118.16 - . characteristic of signal restoration. The suggested measure would take into account the tradeoffs between representational adequacy and classification accuracy. with the joint goals of compact signal approximation and improved classification accuracy. Active learning. As both signals and patterns have to be approximated using optimal (kernel) bases (’receptive fields’ and ’support vectors’) the computational quest should be now after such conjoint ’bases’ with the expectation for enhanced generalization abilities and corresponding lower predictive risk. Such kernel bases amount to optimal ’dictionaries’ consisting of those (kernel) primitives able to express best and in a compact form a wide range of meaningful inputs characteristic of some sensory environment leading eventually to both pattern classification and robust sensory-motor schemes. The updated scheme would then substitute regularization for the ERM as the inductive principle as representation adequacy is traded off for classification accuracy in terms of compact clusters well separated. 1997). An alternative would be to develop a hybrid approach where fitness and success are defined in terms of a combined fitness measure evolved using Genetic Algorithms (GAs). 1998). while as the time goes on the set of misclassified examples diminishes. characteristic of learning machines. as a form of corrective training. Specifically.visual learning (PVL) method for object recognition (Moghaddam and Pentland. 1995). One additional direction for future research would integrate the concepts of optimal signal sampling. . Acknowledgments: This work was partially supported by the U. While DFFS measures representation adequacy using the reconstruction error as criteria.S. During the early stages of training most of the examples are misclassified. and optimal pattern sampling. 1988) suitable for both adaptive and optimal functional decomposition and signal classification.

Fogelman Soulie and T.17 - Estimation and Denoising Using VC-Theory. John (1995). Neural Network Ensembles. D. (1990) An Information-Theoretic View of Analog Representation in Striate Cortex. (1946). 566-578. A. Computer. Wickerhauser (1992) Entropy-Based Algorithms for Best Basis Selection. Wilson. 429-459. Neural . Kohavi. Eds. (1995).. 93. and G. R. Freiburg.References Barron. R. Wrappers for Feature Subset Selection.. and V. Krogh. Technical Report 894. G. 3. Learning from Data: Concepts. Face Recognition Using Evolutionary Pursuit. 909-996. Sirohey. and Vedelsby. Proc. in Face Recognition: From Theory to Applications. Wechsler. Minimal training set size estimation for sampled-data function encoding. J. Visual Routines for Eye Location Using Learning and Evolution. V. J. Valiant. Theory and Methods. Computer Science Department. Springer-Verlag.). and S. on Information Theory. A. J. 231-238. 713-718. Coifman. Liu and H. Proc. Journal of IEEE.. in Advances in Neural Information Processing Systems 2. M. P. Cherakssky. of the 4th Workshop on Computational Learning Theory. Morgan Kaufmann. Lippmann. D. J. M. IEEE Trans. MIT. Warmuth. and J. Malinowski. 366 . Wechsler (1998). and J. Wechsler (1999). and L. Huang.Theory of Communication. World Congress on Neural Networks (WCNN). K. . 105-117. (1991). Germany. Wiley. F. and Ng. C. P. on Evolutionary Computation (in press). R. and H. Keeler (1991) Predicting the future: Advantages of Semilocal Units. I. Signal Computation (under review). Daubechies. 403-424.. Bruce. Cross Validation and Active Learning. Human and Machine Recognition of Faces: A Survey. Chellappa R. L. A Comparative Study of the Practical Characteristic of Neural Networks and Pattern Classifiers. C. ed. Linsker. Ed. Liu. 21. (1988). Mulier (1998). L. in Computational Neuroscience. Math.G. on Computer Vision (ECCV). Comun. 41. IEEE 83. Huang. Huang (Eds. Schwartz. 243-249. Zurada (1994). 38(2). V." Technical Report. Neural Computation. (1991). V. Phillips. Vol. J. R. Lincoln Labs. Touretzky. K. 705-740 (1995). C. A. Stanford University. 5th European Conf. Daugman. Evolutionary computation and face recognition. Cherkassky. Shao (1999). Hartman. E. Gabor. S. E. and H. Approximation and Estimation Bounds for Artificial Neural Networks.371. on Pure and Appl. CA. Morgan Kaufmann.. MIT Press. and X.. R. IEEE Trans. San Diego. and F. (1988) Self-Organization in a Perceptual Neural Network. Orthonormal Bases of Compactly Supported Wavelets. H. Wechsler (1998). II.

JEFFREY R. Coventry. Benveniste (1992). He was Director for the NATO Advanced Study Institutes (ASI) on "Active Perception and Robot Vision" (Maratea. V. He authored over 200 scientific papers. HARRY WECHSLER received the Ph. UK Symposium on Applications of TimeFrequency and Time-Scale Methods. Virginia. Biometrics and Forensics. Moghaddam. 1993) and "Face Recognition: From Theory to Applications" (Stirling. Samal A. Since 1995. in the field of intelligent systems. 11(7). "From Statistics to Neural Networks" (Les Arcs. Statistical Learning Theory. Probabilistic Visual Learning for Object Representation. Vapnik.D. Engineering in Medicine and Biology Society and the member of AAAI. in 1996. his book "Computational Vision" was published by Academic Press in 1990. ISAB. and SPIE. Automatic Recognition and Analysis of Human Faces and Facial Expressions: A Survey. MULTIMEDIA and VIDEO PROCESSING: Large Image Data Bases. Image and Visual Computing. Pattern Recognition 25. Association of Pattern Recognition (IAPR) Fellow in 1998. UK. Data Mining and Knowledge Discovery. .D. B. video processing and multimedia. and A. he has published more than 25 scientific papers including 2 book chapters. 16(5). He was elected as an IEEE Fellow in 1992 and as an Int. (1995). Document Processing. (1995). and he is presently Professor of Computer Science at George Mason University. Springer Verlag. Zhang. has been in the areas of PERCEPTION: Computer Vision (CV). His research. Signal and Image Processing (SIP). on Pattern Analysis and Machine Intelligence. and he was the editor for "Neural Networks for Perception" (Vol 1 & 2). (1989) A Theory for Multiresolution Signal Decomposition: the Wavelet Representation. and HUMANCOMPUTER INTELLIGENT INTERACTION (HCII) : Face and Hand Gesture Recognition.18 - . Fairfax. Huang and P. computer vision. IEEE Trans. IEEE Trans. Jeffrey Huang is a member of the IEEE Computer Society. in Information Technology at George Mason University. in 1975. MACHINE INTELLIGENCE and LEARNING: Pattern Recognition (PR). S. Phillips. The range of applications covers biometrics and face recognition. Iyengar (1992). Wavelets: Why so Many Varieties?. 19(7). UK. The FERET Database and Evaluation Procedure for Face Recognition. EVOLUTIONARY COMPUTATION: Genetic Algorithms (GAs) and Animats. 1997). HUANG received the Ph. in Computer Science from the University of California. 295-306. pattern recognition. Pentland (1997). J. J. University of Warwick. Wechsler. on Pattern Analysis and Machine Intelligence. IEEE Trans. and he is now a researcher with the Biometrics and Computer Vision Lab in the Department of Computer Science at George Mason University. Dr. Huang’s research interests include image processing. Information Retrieval. Italy. Rauss (1998). and A. H. human-computer interaction. and he served as co-Chair for the International Conference on Pattern Recognition held in Vienna. P. 65-77. Irvine. 674-693. Austria. Q. J. Wavelet Networks. 3(6). published by Academic Press in 1991. Wilson. 1989). and evolutionary computation and genetic algorithms. and P. 889898. 696-710. and automated target/object detection and recognition.. on Neural Networks .Mallat. Neural Networks (NN). France. R. Automatic Target Recognition (ATR).