You are on page 1of 15

178

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 18, NO. 1, JANUARY 2007

Face Recognition Using Total Margin-Based


Adaptive Fuzzy Support Vector Machines
Yi-Hung Liu, Member, IEEE, and Yen-Ting Chen

AbstractThis paper presents a new classifier called total


margin-based adaptive fuzzy support vector machines (TAF-SVM)
that deals with several problems that may occur in support vector
machines (SVMs) when applied to the face recognition. The proposed TAF-SVM not only solves the overfitting problem resulted
from the outlier with the approach of fuzzification of the penalty,
but also corrects the skew of the optimal separating hyperplane
due to the very imbalanced data sets by using different cost
algorithm. In addition, by introducing the total margin algorithm
to replace the conventional soft margin algorithm, a lower generalization error bound can be obtained. Those three functions
are embodied into the traditional SVM so that the TAF-SVM is
proposed and reformulated in both linear and nonlinear cases.
By using two databases, the Chung Yuan Christian University
(CYCU) multiview and the facial recognition technology (FERET)
face databases, and using the kernel Fishers discriminant analysis (KFDA) algorithm to extract discriminating face features,
experimental results show that the proposed TAF-SVM is superior
to SVM in terms of the face-recognition accuracy. The results
also indicate that the proposed TAF-SVM can achieve smaller
error variances than SVM over a number of tests such that better
recognition stability can be obtained.
Index TermsFace recognition, kernel Fishers discriminant
analysis (KFDA), support vector machines (SVMs).

I. INTRODUCTION

ANY computer vision-based systems have become more


and more important and attractive in recent years, such as
the surveillance, automatic access control, and the humanrobot
interaction. Face recognition plays a critical role in those applications. Due to the complicated pattern distribution from large
variations in facial expressions, facial details, illumination conditions, and viewpoints, the face-recognition task has been considered as one of the most difficult pattern-recognition research
fields. Recently, various approaches have been proposed, e.g.,
[3], [5], [12], [15], [16], [22], [23], and [25][32]. From these
systems, we can conclude that how to extract discriminating features from raw face images and how to accurately classify different people based on these input features are the two keys to
the development of reliable and high-accuracy face-recognition
systems. This paper aims to propose a new classifier called total
margin-based adaptive fuzzy support vector machines (TAFSVM), which can enhance the performance of support vector
Manuscript received July 1, 2005; revised March 1, 2006. This work was
supported by the National Science Council of Taiwan, R.O.C., under Grant
93-2212-E-033-011.
The authors are with the Department of Mechanical Engineering, Chung Yuan
Christian University, Chung-Li 32023, Taiwan, R.O.C. (e-mail: lyh@cycu.edu.
tw).
Digital Object Identifier 10.1109/TNN.2006.883013

machines (SVM) for face recognition. In addition to classifier


design, selecting a good feature extractor is also necessary.
A. Feature Selection
Principal component analysis (PCA) [12] and Fishers linear
discriminant analysis (FLDA) are widely used linear subspace
analysis methods in facial feature extraction. Compared with
PCA, FLDA owns more abilities to extract discriminating features since its objective is to maximize the between-class and
minimize the within-class scatters. FLDA has been successfully
applied to face recognition in [32] and shown to be superior to
PCA. Due to the linear nature, the capabilities of linear subspace
analysis methods are still limited. Motivated by the success of
the use of kernel trick in the SVMs [8], [13], Schlkopf et al. [24]
proposed the kernel PCA (KPCA) by combining the PCA with
the kernel trick. Since the kernel trick is capable of representing
nonlinear relations of input data, KPCA is better than PCA in
terms of representation and reconstruction. This has been also
evidenced by Kims work [25] in which KPCA combined with
linear SVM classifier was applied to face recognition.
Another nonlinear subspace analysis method called generalized discriminant analysis (GDA) or kernel Fishers discriminant analysis (KFDA) was proposed by Baudat et al. [9]. KFDA
first nonlinearly maps input data into higher dimensional feature
space in which FLDA is performed. Recently, several works have
shown that KFDA was much effective than KPCA in face recognition [3], [22], [23]. This is due to the fact that KFDA keeps the
nature of FLDA, which is based on the separability maximization criterion while the unsupervised learning-based KPCA is
still only designed for the pattern representation/reconstruction.
Therefore, this paper adopts the KFDA as the feature extractor
such that the goals of the extraction of discriminating features
and the reduction of input dimensionality can be reached.
B. Classifier Design
Although KFDA has been proven its superiority to discriminating features extraction, it suffers from the problems that its
performance would drop while meeting new inputs that have
never been considered in the training process, for example, a
test face, whose viewpoint does not face the camera, while the
training faces are of frontal face. Features extracted with KFDA
are not invariant to these large changes because KFDA is essentially a kind of appearance-based method. In [3], authors
suggested that a more sophisticated classifier, compared with
nearest neighbor (NN) classifier, was still needed even though
the KFDA algorithm has been employed for the multiview face
recognition because the face-pattern distribution would still be
nonseparable in the KFDA-based subspace. In other words, a

1045-9227/$20.00 2006 IEEE

LIU AND CHEN: FACE RECOGNITION USING TAF-SVMS

classifier with good generalization ability and minimal empirical risk is necessary for making up the drawback of the appearance-based feature extractor. Based on this, an SVM can serve
as a good classifier candidate.
SVM was proposed by Vapnik et al. [13] and has been successfully applied to various applications such as the unsupervised segmentation of switching dynamics [46], face membership authentication [47], and image fusion [48]. Recently, several works relative to face recognition have used SVMs as classifiers and yielded satisfactory results [25][31]. In those systems,
the SVMs used are of regular SVM. However, some researches
which are not directly relative to the search of face recognition
have indicated that SVM would suffer from several critical problems when applied to classify some particular data types. The
first problem is that SVM is very sensitive to outliers since the
penalty weight for every data point is the same [5][7]. Second,
the class-boundary-skew problem will be met when SVM is applied to the problem of learning from imbalanced data sets in
which the negative data heavily outnumber the positive data
[1], [11], [17], [33]. The class boundary, i.e., the optimal separating hyperplane (OSH), learned by SVM, can be skewed towards the positive class. In consequence, the false-negative rate
can be very high and can make SVM ineffective in identifying
the targets that belong to the positive class, which results in the
class-boundary-skew problem. The two problems limit the performance of SVM. Unfortunately, they also occur in the application of SVM-based face recognition.
In face recognition, for example, a face image with an exaggerated expression may result in the existence of an outlier. If
the outlier possesses nonzero value of slack variable, the soft
margin algorithm used in regular SVM would start to find a hyperplane to let the error be correct. The overfitting problem may
follow. The other problem is that SVM was originally designed
for the binary classification, while face recognition is practically
a problem of multiclass classification. To extend the binary SVM
to multiclass face recognition, most existing systems [25][31]
used the one-against-all (OAA) method. As far as the computational effort is concerned, OAA may be more efficient than
one-against-one (OAO) strategy. The advantage of OAA over
OAO is that we only have to construct one hyperplane for each
of the classes instead of
pairwise decision functions.
;
This decreases the computational effort by a factor of
[35]. This
in some examples, it can be brought down to
may be the reason that authors of [25][31] used OAA in their
systems, though it has been reported that OAO is more efficient
than OAA in terms of classification accuracy [2], [36], [45]. As
OAA method is used, one of the classes will be the target class
and the rest
classes will be the negative class forthe learning
of each OSH. The class-boundary-skew problem occurs. Also,
the larger the number of classes becomes, the more imbalanced
the training set is when OAA method is applied.
To remedy these problems when SVM is applied to face recognition, this paper proposes a new classifier called TAF-SVM.
TAF-SVM is able to solve the overfitting problem by fuzzifying
the training set which is equivalent to fuzzifying the penalty term
[7], [44]. With this manner, training data are no longer treated
equally but treated differently according to their relative importance. Besides, TAF-SVM also embodies the different cost al-

179

gorithm [11], [33], by which TAF-SVM can adapt itself to the


imbalanced training set such that the false-negative rate is reduced and the recognition accuracy is enhanced.
Another contribution of this paper is that we replace the soft
margin algorithm by introducing the total margin algorithm [4]
to TAF-SVM. The total margin algorithm not only considers
the errors but also involves the information of correctly classified data points in the construction of OSH. Compared with the
conventional soft margin algorithm used in the regular SVM, a
lower generalization error bound can be reached. This can facilitate the face recognition since the generalization ability plays
a very important role for the predictions of unseen face images.
We combine these approaches in TAF-SVM and show that we
can significantly improve the face-recognition accuracy compared to applying any one approach including SVM also.
This paper is organized as follows. Section II presents the
KFDA-based feature extraction method. A brief review of SVM
is given in Section III. Then, the problems of applying SVM to
face recognition are pointed out in detail together with the solutions embodied in the TAF-SVM. In Section IV, we reformulate
the TAF-SVM in both linear and nonlinear cases. Experimental
results are presented and discussed in Section V. Conclusions
are drawn in Section VI.
II. FEATURE EXTRACTION VIA KFDA
A face image is first scanned row by row to form a vector
. The training set contains
images out of
of
subjects, namely
and
,
,
is the set of class and is the cardinality of . For
where
KFDA, the within-class scatter
and between-class scatter
in space are given by
(1)

(2)
where is a nonlinear mapping function that maps the data
from input space to a higher dimensional feature space:
.
denotes the th face image in the
th class. The mapped data are centered in space [9], [24].
KFDA seeks to find a set of discriminating orthonormal eigenfor the projection of input face image by
vectors
performing FLDA in space in which the between-class scatter
is maximized and the within-class scatter is minimized. This is
equivalent to solving the following maximization problem:
(3)
Solutions associated with the largest nonzero eigenvalues
must lie in the span of all mapped data; so, for , there exists
a normalized expansion coefficient vector
such that
(4)

180

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 18, NO. 1, JANUARY 2007

Thus, for a testing face image , its projection on the th eigenvector is computed by

minimization (SRM) principle, by which good generalization


ability can be achieved [8].
By introducing the Lagrangian, the primal constrained optimization problem can be solved with its dual form. The predicted class of an unseen data is the output of the decision
function

(5)
We do not need to know the nonlinear mapping exactly. By
can be easily obtained
using the kernel trick, the projection
by

(6)
where the kernel function is defined as the dot product of vectors

sign

(11)

are the nonnegative Lagrange multipliers for the first


where
are
inequality constraints in the primal problem (10a),
, and
is the number
support vectors for which
of support vectors. The optimal value of
is calculated with
KuhnTucker (KT) complementary conditions.
B. Basic Ideas of TAF-SVM

(7)
The radial basis function (RBF) kernel is used in this paper and
is expressed as
(8)
is specified a priori by the user.
where the width
eigenvecTo project a face image into new coordinates,
tors associated with the first largest nonzero
eigenvalues
are selected to construct the transformation matrix or
such that the dimensionality of a face image
to
.
is reduced from
To simplify the notation used in the following, we let the number
be equal to .
of projection vectors
III. BASIC IDEAS OF TAF-SVM
A. Basic Review of SVM
, where
In SVM, the training set is given as
is the training data and is its class label being either
1 or 1. Let and be the weight vector and the bias of
the separating hyperplane, the objective of SVM is to find the
OSH by maximizing the margin of separation and minimizing
the training errors
Minimize

(9)

Subject to

(10a)
(10b)

,
is the nonlinear mapping
where
function which maps the data into a higher dimensional feature
space from the input space. are slack variables representing
the error measures of data points. The penalty weight is a free
parameter; it measures the size of the penalties assigned to the
errors. Minimizing the first term in (9) is equivalent to maximizing the margin of separation, which is related to minimizing
the VapnikChervonenkis (VC) dimension. Formulation of the
objective function in (9) is perfect accord with the structural risk

1) Dealing With the Overfitting Problem via Fuzzification of


Training Set: One issue on using SVM for face recognition is
how to tackle the overfitting problem since large variation resulted from facial expressions and viewpoints may produce the
outliers appearing in the pattern distribution. As shown in previous researches [5], [6], SVM is very sensitive to outliers or
noises since the penalty term of SVM treats every data point
equally in the training process. This may result in the occurrence
of overfitting problem if one or few data points have relatively
very large values of . Wang et al. and Huang et al. proposed
the fuzzy SVM (FSVM) to deal with the overfitting problem
[7], [44], based on the idea that a membership value is assigned
to each data according to its relative importance in its class so
that a less important data is punished less. To achieve this, the
is redefined in FSVM where
is
fuzzy penalty term
the membership value denoting the relative importance of point
to its own class.
We incorporate the concept of FSVM into the proposed TAFSVM. The training set is first divided into two sets: the fuzzy
and the fuzzy negative training set
,
positive training set
denoted by
(12a)
(12b)
and
stand
where the membership values
and
to
for the relative importance of the points
the positive class and negative class, respectively. The variable
is a small positive real number.
and
are the cardinalities
of fuzzy positive training set and fuzzy negative training set,
.
respectively, and
2) Adaptation to Imbalanced Face Training Sets via Different
Cost Algorithm: Face recognition is practically a task of multiclass classification while SVM was designed for the binary
classification. OAO and OAA methods are two popular ways to
realize the SVM-based multiclass classification task [2]. Based
on the pairwise learning framework, OAO method needs to conOSHs and use the voting strategy to make final
struct
decisions if there are subjects to be recognized. Compared with
OAO method, OAA method, by which only OSHs need to

LIU AND CHEN: FACE RECOGNITION USING TAF-SVMS

181

be learned, is more effective in terms of computational effort.


Therefore, it is found that most existing SVM-based face-recognition systems chose the OAA method to accomplish the task of
multiclass classification [25][31]. However, a critical problem,
the class-boundary-skew phenomenon, which had never been
pointed out in these SVM and OAA method-based face-recognition systems, is followed.
By using OAA method to learn each OSH for multiclass face
recognition, one of the subjects forms the positive class and
the rest form the negative class. With this manner, the training
faces of the negative class significantly outnumber the training
faces in positive class. The ratio of the size of negative class to
. A very imbalanced face
the size of positive class is
training set is produced. The larger the number of subjects is,
the heavier the imbalance of the face training set is.
It has been recently reported that the success of SVM is
limited when applied to the imbalanced data sets [1], [11], [17],
[33] because the OSH would be skewed towards the positive
class and results in the class-boundary-skew phenomenon. To
solve this critical problem, some remedies have been proposed
including the oversampling and undersampling techniques [18],
combining oversampling with undersampling [19], synthetic
minority oversampling technique (SMOTE) [20], different error
cost algorithms [1], [33], class-boundary-alignment algorithm
[17], and SMOTE with different cost algorithm (SDC) [11].
Those methods can be divided into three categories. The
methods proposed in [18][20] process the data before feeding
them into the classifier. The oversampling technique duplicates the positive data by interpolation while undersampling
technique removes the redundant negative data to reduce the
imbalanced ratio. They are classifier-independent approaches.
The second category belongs to the algorithm-based approach
[1], [17], [33]. For example, Veropoulos et al. [1] and Lin et
al. [33] proposed the different cost algorithms suggesting that
by assigning heavier penalty to the smaller class, the skew
of the OSH can be corrected. The third category is the SDC
method which combines the SMOTE and the different error
cost algorithm [11].
For face recognition, since each training data stands for the
particular face information, we attempt not to use any presampling techniques. Instead, the proposed TAF-SVM adopts the
different cost algorithm to achieve the goal of adaptation to the
imbalanced face training sets when faced with the OAA-based
multiclass classification. Another reason for using this algorithm is that the different cost algorithm was originally designed
for solving the skew problem of SVM. By combining the fuzzy
penalty and the different cost algorithm, the proposed fuzzified
biased penalties are expressed as

(13)
and
are the penalty weights for the errors of the
where
positive class and negative class, respectively. The slack variand
are the measurement of errors of the data beables
longing to the positive class and the negative class, respectively.
to meet the central concept of different
By setting

cost algorithm, the OSH will be much more distant from the
smaller class.
3) Improvement of the Generalization Error Bound via Total
Margin Algorithm: Due to the fact that it is impossible to take
all face information into consideration, i.e., the available face
training samples are always finite and not numerous, the generalization ability for a classifier dominates the prediction accuracy for unseen faces. Soft margin algorithm used in SVM
relaxes the measure of margin by introducing slack variables
to errors. An OSH is found with the maximal margin of sepaby maximizing the minimum distance between
ration
few extreme values (support vectors) and the separating hyperplane. However, only few extreme training data that are used
would cause the loss of information because the most of information is contained in the nonextreme data, which occupy
the majority in the training set. Feng et al. proposed the scaled
SVM [21], which not only employed the support vectors but
also involved the means of the classes to reduce the generalization error of SVM. However, the face-pattern distribution is generally non-Gaussian and highly nonconvex [3], [22]. Namely,
the mean of a class may not be very representative. Another approach for improving the generalization error bound called total
margin algorithm has been also proposed by Yoon et al. [4].
The total margin algorithm extends the soft margin algorithm
by introducing extra surplus variables to the correctly classified data points . The surplus variable measures the distance
between the correctly classified data point and the hyperplane
, if this data point belongs to the positive/negative class. In addition to minimizing the sum of slack variables
(the misclassified data points) while maximizing the margin
of separation proposed by soft margin algorithm, total margin
algorithm suggests that the sum of surplus variables (the correctly classified data points) should also be maximized simultaneously. Maximizing the sum of surplus variables is equivalent
, which in turn is equivalent to minimizing
to maximizing
. Therefore, total margin-based SVM is formulated as
the constrained optimization problem
Minimize
Subject to
(14)
where
is the weight for the misclassified data points and
is the weight for the correctly classified data points, i.e., the
surplus variables .
From (14), we can see that the construction of OSH is no
longer controlled only by few extreme data points in which
most of them may be misclassified data points, but also by the
correctly classified data points, which are the majority of the
training set. The advantages are clear. First, the essence of the
soft margin-based SVM is to rely only on the set of data points
which take extreme values, the so-called support vectors. From
the statistics of extreme values, we know that the disadvantage
of such an approach is that the information contained in most
samples (not extreme values) is lost, so that such an approach
is bound to be less efficient than one that takes into account the
lost information [21], i.e., the correctly classified data points.
Therefore, total margin algorithm can be more efficient and gain

182

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 18, NO. 1, JANUARY 2007

Fig. 1. Geometric interpretation of slack variables and surplus variables used in TAF-SVM.

better generalization ability than soft margin algorithm since the


information of all samples are considered in the construction
of OSH. Second, from the objective expressed in (14), we can
implies that the obtained OSH is
see that minimizing
able to gain more correctly classified data points because miniis equivalent to maximizing the sum of surplus
mizing
variables. Therefore, in this paper, we adopt the total margin algorithm as one of the bases in the development of TAF-SVM
for the face recognition.
In order to facilitate the reformulation of TAF-SVM in Section IV, the usage of combining surplus variables and the imbalanced penalties is illustrated here. Since the TAF-SVM considers both of the different cost algorithm and the total margin
algorithm, the geometric relationship between the positive/negative slack variables and the positive/negative surplus variables
is illustrated in Fig. 1.
In Fig. 1, the white circles and the black circles denote the
data points belonging to the positive class and the negative class,
measures the distance berespectively. The slack variable
and the misclassified data point
tween the hyperplane
which is supposed to be classified as the positive class. Conis the measurement from the misclassified data point
trarily,
to the hyperplane
. The surplus variable
measures the distance between the correctly classified data point
and the hyperplane
. The surplus variable
measures
the distance between the correctly classified data point and the
. All these variables are nonnegative varihyperplane
and
will be zero for a data point .
ables. At least one of
Furthermore, we assume that are the positive training data
points, in which any of can have two different classification
results: misclassified and correctly classified. Table I summaries
and the surplus
the relationship between the slack variable
variable
according to different classification results of .
Notice that the used in Table I can be any data point among
all the positive training data points, while the shown in Fig. 1
is just one misclassified positive training data point.

IV. REFORMULATION OF TAF-SVM


In this section, we reformulate the proposed TAF-SVM for
linearly nonseparable and nonlinearly nonseparable cases based
on the aforementioned ideas.

TABLE I
INTERPRETATION OF THE RELATIONSHIPS BETWEEN SLACK VARIABLES,
SURPLUS VARIABLES, AND THE POINT LOCATIONS BY TAKING
POSITIVE TRAINING DATA POINTS AS EXAMPLE

A. Linearly Nonseparable Case


The primal problem for the linearly nonseparable case is reformulated as follows:
Minimize
(15)
Subject to
(16a)
(16b)
(16c)
(16d)
(16e)
(16f)
where
and
are the weights for positive and negative
slack variables, respectively.
and
are the weights for
the positive and negative surplus variables, respectively. It is
difficult to solve this constrained optimization problem. Similar to SVM, the primal optimization problem of TAF-SVM is

LIU AND CHEN: FACE RECOGNITION USING TAF-SVMS

183

transformed to the dual form by introducing a set of nonnega,


,
,
,
, and
for the
tive Lagrange multipliers
constraints from (16a) to (16f) to yield its Lagrangian

The constraints for this maximization problem are the same as


those in the dual form of the linear case (20a)(20c). The KT
complementary condition plays a key role in the optimality. The
KT complementary conditions for the nonlinear TAF-SVM are
given by

(22a)

(22b)
(22c)
(22d)
(22e)
(22f)
(17)

Differentiation with respect to

, ,

, and

yields

can be calculated with any data in


The optimal value of
the training set satisfying the KT complementary conditions.
However, from the numerical perspective, it is better to take the
resulting from all such data [14]. Therefore,
mean value of
the optimal value of is computed by

(18a)
(18b)
(18c)

where

and

are the subsets of

and

(23)
, respectively
(24a)

(18d)

(24b)
(18e)
(18f)
By the resubstitution of these equations into primal problem, the
dual problem is obtained
Maximize

(19)

Subject to

(20a)
(20b)
(20c)

B. Nonlinearly Nonseparable Case


The dual form for the nonlinearly nonseparable case can be
obtained by using kernel function
where
,
is a nonlinear map. The
objective is as follows:
Maximize

(21)

For an unseen data , its predicted class is the output of the


decision function

sign

(25)

According to the formulation of the TAF-SVM, three main


properties are discussed and summarized as follows.
1) Through an inspection from the constraints in the dual
and
form, we can see that the Lagrange multipliers (
) are bounded with the upper bound (
and
)
and
). Therefore, acand the lower bound (
cording to (20b) and (20c), all training data are support
vectors for TAF-SVM since all data are with nonzero ,
which meets the role of the total margin algorithm. On the
contrary, in the soft margin-based SVM, the OSH is only
.
constructed by few data points whose satisfy
.
2) In SVM, the are bounded by the range of
For all data points, their feasible regions are fixed once
is chosen. Speaking of TAF-SVM, the feasible region
is dynamic since the upper and lower bounds for every
data point are different, because the bounds of the feasible
region are functions of the assigned membership values.

184

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 18, NO. 1, JANUARY 2007

This means that a less important data point has the narrower
width in the feasible region.
Another question is how to fuzzify the training set efficiently. Basically, the rule to assign proper membership
values to data points can depend on the relative importance
of data points to their own classes. Therefore, for a positive
data, its assigned membership value can be calculated with
the membership function
Fig. 2. Thirty subjects of the CYCU multiview face database.

if
otherwise

(26)
denotes the Euclidean distance, the lower bound
where
is a nonnegative small real number and is user-defined,
and is the mean of all data points in
. The memberfor all the fuzzified positive training data
ship values
are bounded in
. The same procedure is also used for
the fuzzification of the negative data in which the mean is
calculated with all the negative data.
3) In SVM, only one free parameter has to be adjusted. A
more complex procedure may occur for TAF-SVM since
,
,
,
there are four free parameters to be adjusted:
. However, the adjustment process can be further
and
simplified according to some relationships. First, the inequality constraints in (20b) and (20c) say that the two inand
must be held. Second,
equalities
based on the concept of adaptation to imbalanced data sets,
and
are required
the relationships
if the size of positive class is smaller than that of negative
class. Two ratios are defined to simplify the procedure of
the adjustment of these parameters
(27)
(28)
,
By fixing the value among any of the four parameters
,
, and
, and setting the values of and , the
other three parameters can be obtained directly. In the case
1, no adaptation effort will be made to the imbalof
anced case. Furthermore, TAF-SVM will be the standard
,
SVM if equals 1 and goes to infinity (
0, and
is finite), and the membership
values are set as 1. It is noticed that the very small number
.
10 is added to avoid the situation when
Also, when
1,
infinity, and 0
1, the proposed TAF-SVM will become the FSVM. Therefore, SVM
and FSVM can be viewed as two particular cases of the
proposed TAF-SVM.
V. EXPERIMENTAL RESULTS
A. Experiment on CYCU Face Database
Here, we present a set of experiments that were carried out by
using the Chung Yuan Christian University (CYCU) multiview
face database [10]. The CYCU multiview face database contains
3150 face images out of 30 subjects and involves variations of
facial expression, viewpoint, and sex. Each image is a 112 92

Fig. 3. Collected 21 face images of one of the 30 subjects in CYCU multiview


face database.

24-b pixel matrix. The viewpoint is governed by two parameters: the rotation angle and the tilt angle where the rotation
angle and tilt angle have seven and three kinds of degrees of
,
), reangles (
spectively. For each viewpoint of each subject we prepared five
face images with different facial expressions. Therefore, each
subject has 105 face images. Fig. 2 shows the total 30 subjects
in this database. All images contain face contours and hair. The
color of the background is nearly white and the lighting condition is controlled to be uniform. Fig. 3 shows the collected 21
images containing 21 different viewpoints of one subject.
1) Analysis of Face-Pattern Distributions in KFDA-Based
Subspace: Two cases are analyzed in this subsection. Before
the experiments, all colored face images are transferred to graylevel images and the contrast of each gray image is enhanced
by the histogram equalization. All gray-level images of 112
92 pixels are resized to 28 23 pixel matrices before the feature extraction. In addition, all extracted features by KFDA from
Case 1 to Case 2 are the first two most discrimination features.
Case 1: Fig. 4 depicts the distribution of the face patterns of five subjects randomly chosen from the database
in the KFDA-based subspaces. Each subject contains 21
and
patterns covering the whole range:
, i.e., each of the 21 viewpoints provides one
image for one person. Two observations are as follows. First,
, we observe that there exists an outlier
when
for the class denoted by , as shown in Fig. 4(a). This outlier
is very far from the main body of its class and falls into another
class denoted by . The SVM will suffer from the overfitting
problem when it is applied to solve the binary classification
problem between the two classes. Second, according to the
distribution shown in Fig. 4(b), it is observed that there exists
an overlap between the three classes denoted by , , and

LIU AND CHEN: FACE RECOGNITION USING TAF-SVMS

Fig. 4. Distribution of 105 face patterns out of the five subjects in KFDA-based
subspaces with the RBF kernel where (a) 2 = 5:3 10 and (b) 2 = 10 .

, respectively. To identify the class by using the OAA


method, the imbalanced ratio of positive class to negative class
is 1 : 4. The OSH learned by the traditional SVM will be skew
toward the positive class . Consequently, the number of
false negatives will increase and the recognition accuracy will
decrease.
Case 2: Most face-recognition systems evaluate their
systems performance by changing some face conditions such
as expression, viewpoint, and illumination conditions, etc. Accordingly, several well-known databases are widely used such
as the Olivetti Research Laboratory (ORL) face database, the
University of Manchester Institute of Science and Technology
(UMIST) multiview face database, and Yale face database. The
three face databases have different conditions considered. For
example, ORL database contains 400 face images in which
all frontal face images have different facial expressions and
details (glasses or no glasses). UMIST database consists of 575
face images covering a wide range of poses from one-sided
profile to frontal views as well as the expressions. Yale database
contains 165 frontal face images having different expressions,
illumination conditions, and small occlusions (with glasses).

185

The three databases have involved most of the considerable


conditions crucial to the evaluations of face-recognition systems. However, all the faces in these databases are bounded
well. That is, they do not take the variations of face contour and
hair into consideration.
Face-recognition task follows the face-detection task. For example, the SVM-based face-detection system [34] searches for
faces from an image by using size-varying widows to scan this
image and perform face/nonface-classification task. Once the
faces are detected, these faces will be framed by rectangular
bounding boxes with different sizes and then sent into the facerecognition system. The framed face images detected by different face-detection systems (even the same) may contain the
whole hair and face contours, or just partial hair and contours,
or none of them. Most existing face-recognition systems do not
evaluate their systems on this factor since all the images in the
three databases are full faces containing both hair and contours.
Er et al. have conducted an interesting experiment in their
work [16]. They evaluated their system [discrete cosine transform (DCT) FLDA RBF neural networks] on two groups of
data: One was full faces of Yale database, and the other was the
closely cropped faces of the same database, and achieved error
rates of 1.8% and 4.8%, respectively, which were lower than
other approaches such as eigenfaces and Fisherfaces, etc. However, each of the two groups does not consider both full faces
and cropped faces at the same time. Nevertheless, by comparing
the two results, we see that the information of face contour and
hair style is important for face recognition. This study will evaluate the proposed TAF-SVM by letting this information be a
variable.
In this paper, we assume that an input face can be a full face
or a partially cropped face in order to fulfill the requirement that
in addition to variations resulted from different expressions and
viewpoints, a robust and reliable face-recognition system should
also be able to fight against the variation due to the size-varying
face-bounding boxes. This case aims to investigate the influence
of the changes of the sizes of face-bounding boxes upon the
face-pattern distribution in KFDA-based subspace. To achieve
this goal, each face image is cropped to a new face image with
and
. This procedure is called
two integer cropping sizes:
face cutting, which is illustrated in Fig. 5, where the operator
to become an integer. The
round is to force the value of
dotted white rectangle is the face-bounding box. After the cutting, the cropped image is resized to a 112 92 new image. With
this manner, an input face may contain the whole face contour
or just part of it.
The face-cutting procedure is performed to all the 105 face
images that have been used in Case 1 with randomly chosen
within the range of [0, 7]. Fig. 6(a) and (b) shows the distribution
of these 105 randomly cropped face patterns in the KFDA-based
subspaces. Compared with the distribution depicted in Fig. 4,
it can be seen that face images with different cropping sizes
significantly result in the increase of the interclass ambiguities
and the decrease of the intraclass compactness. Although the
KFDA-based feature extraction method has tried to maximize
the between-class separability and the within-class compactness, it cannot absorb the large variations caused from viewpoint and size-varying face-bounding box. Therefore, a robust

186

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 18, NO. 1, JANUARY 2007

Fig. 5. Face-cutting procedure and the cropped face images with different cropping sizes of i. The dotted white rectangle is the face-bounding box.

2) Sensitivity Test of TAF-SVM: The goal of this experiment


is to test the sensitivities of TAF-SVM to its intrinsic parameters,
and
, weights for surplus
including the penalty weights
variables
and
, and the lower bound used in the fuzzy
membership function. To make the following experiments more
constructive, three conditions containing different criteria for
the collection of the training set and the test set are defined as
follows.
Condition 1: For the training set, each subject offers 21
face images picked from all 21 angle combinations of
. Each angle combination randomly offers one for
each subject. Therefore, the training set contains 630 face
images out of the 30 subjects. The collection procedure
for the test set is of the same procedure. The two sets have
no overlap.
Condition 2: Each set is provided with 21 face images
from every subject, so each set has 630 face images in
total. The face images, different from Condition 1, are
picked randomly from confined angle combinations. For
the training set, only 30 and 0 are considered in ,
and 15 and 0 in . As to the test set, only combinations of 45 and 15 in , and 15 and 0 in are
picked. Those chosen face images will not be picked again.
Condition 3: Face images are randomly chosen from all
3150 face images in CYCU face database for the training
set and test set. Each of the 30 subjects provides 21 face
images for each set and there is no overlapping between
the two sets. Those chosen face images will not be picked
again.
As far as the viewpoint of face is concerned, it can be seen hat
the degrees of uncertainties of the three data collection criteria
are apparently different. Condition 2 has the highest uncertain
degree among the three, while Condition 1 has the lowest. Also,
the face-cutting procedure is performed to all face images before
in the range
the feature extraction with random cropping size
[0, 7].
Before extracting features via KFDA method, the optimal
RBF kernel parameter , which results in the minimum error
rate, is found by searching the variation range of from 1 to 10 .
The error rate is the average error rate over ten runs. Whenever
we are performing the next run, the training set and test set are
reprepared based on Condition 3. Following the method used
is computed by
in [3] and [15], the average error rate

(29)
Fig. 6. Distribution of randomly cropped and 105 face patterns of five subjects
in the KFDA-based subspaces subject to (a) 
:
, and (b) 
.

10

2 = 5 3 2 10

2 =

classifier is still needed even though the robust feature extraction method KFDA has been used. The robust classifier here
means a classifier better than NN classifier. Besides, because
outliers would appear in the distribution, and the situation of imbalanced training data sets would happen when OAA method is
employed, a classifier which is more robust than SVM is also
needed. This also motivates this paper.

where is the number of runs, is the number of errors for


the th run, and is the number of total testing face images
of each run. It is noticed that the total testing face images
means the training set in the parameter selection process and
classifier training, while in the experiment of comparison of different classification systems (online testing), the total test patterns means the test set. After taking trail-and-error, the op5.6 10 , which
timal KFDA parameter was found to be 2
resulted in the lowest average error rate measured from the ten
training sets, also resulted in a low average error rate 11.8%
measured from the ten test sets, by using the NN classifier. With

LIU AND CHEN: FACE RECOGNITION USING TAF-SVMS

the optimal kernel value, total of 29 discriminating feature components are extracted from each face image by using the KFDA
method.
a) Sensitivity test on
,
,
, and
: The first experiment is to test the sensitivities of TAF-SVM to the four pa,
,
, and
, which have been condensed
rameters:
to the ratios and defined by (27) and (28). The values of
and for this experiment are {1, 10, 20, 30, 40} and {2, 4, 6,
is set as 10 and is
8, infinity}, respectively. The values of
fixed. The lower bound of the fuzzy membership function is
fixed to 0.4. RBF kernel is also used for TAF-SVM where its
kernel parameter is set as 0.05. The experimental results for
the three conditions are shown in Fig. 7.
The lowest error rates for the three conditions are 2.10%,
equal (20, 4), (30, 4),
7.10%, and 4.80% when the pairs
and (30, 6), respectively. The corresponding values of ( ,
,
,
) are (200, 10, 50, 2.5), (300, 10, 75, 2.5), and (300,
10, 50, 1.66), respectively. In addition, the results also indicate
that the error rates can be reduced by changing the ratios of
and . In the following, we take the results of Condition 3
shown in Fig. 7(c) as examples to show how the performance of
TAF-SVM will be affected under different and . Three steps
are as follows.
Step 1) In Fig. 7(c), the largest average error rate 7.76% oc(1, infinity).
curs at the position
(1, infinity) means that
(10,10,0,0). At this position, the different cost
algorithm is disabled because the penalties for
the positive and the negative classes are the
10). The total margin alsame (
gorithm is also disabled at this position because of
0. Therefore, only the fuzzy penalty
is used in TAF-SVM.
goes to (30, infinity) from
Step 2) As the position
(1, infinity) , the average error rate decreases to
(30, in5.94% from 7.76%. At the position
finity), the different cost algorithm is enabled (used)
in TAF-SVM because the penalties for the positive
and the negative classes become different:
300 and
10. We can see that when
300 and
10, the penalty for the positive class
is much larger than that for the negative class. This
meets the role of different cost algorithm, which
says: Assign heavier penalty to the smaller class. In
the experiments of Fig. 7(c), the number of negative training data is 29 times the number of positive
training data in the learning of each OSH by OAA
TAF-SVM, because there are 30 subjects in CYCU
database needed to be classified. On the other hand,
the total margin algorithm is still not enabled at this
0. By comparing
position because of
the analysis in Step 1) with the one in Step 2), we
can see that the error rate is reduced by the application of different cost algorithm.
goes to (30, 6) from (30, inStep 3) As the position
finity), the average error rate decreases to 4.8% from
(30, 6), not only the
5.94%. At the position
300 and
different cost algorithm is enabled (

187

Fig. 7. Comparisons of average error rates among different pairs of (I ; T ) used


in TAF-SVM under different data collection conditions.

10), but also the total margin algorithm is enand


). By
abled (
comparing the analyses in Steps 1) and 2) with the
analysis in Step 3), we can see that the error rate can
be further reduced with the involving of total margin
algorithm, after the use of different cost algorithm.

188

The previous comparison also demonstrates that the


use of total margin algorithm is able to gain lower
generalization error bound compared with the use
of soft margin algorithm at the position
(30, infinity). Based on the three steps, the validity
of embedding the different cost and total margin algorithms into FAT-SVM for the reduction of error
rate is demonstrated.
As also can be observed, the error rates of Condition 1 are
lower than those of Condition 3, and much lower than those of
Condition 2. It can be concluded that whether or not the information of face viewpoints of test set is involved in the training
set would influence the results significantly. For example, the
largest average error rates for the three conditions are occurred
(1, infinity), and they are 5.1%,
at the same points
10.89%, and 7.76%, respectively. On the other hand, the lowest
4 and
average error rates for the three conditions are 2.1% (
20), 7.1% (
4 and
30), and 4.8% (
6 and
30), respectively. As can be seen, by changing the ratios of and
to proper values, the average error rates can be significantly
reduced.
Another result is that all largest error rates occur at the same
(1, infinity). At this position, the TAF-SVM
position of
is switched to FSVM. When is set as 1, the function of adaptation to imbalanced data sets is disabled. Also, the total margin
algorithm is switched to the soft margin algorithm when is
0 and
0). Therefore, the results
set as infinity (
reveal that our TAF-SVM outperforms FSVM in terms of face
recognition. In the next experiment, we will show that setting
the value of lower bound as zero is not optimal, even worse than
the nonfuzzy case.
b) Sensitivity test on : The second experiment is to test
the sensitivity of TAF-SVM on the lower bound of the fuzzy
membership function . By changing the values of from 0 to 1
with an interval of 0.1, the results of average error rates versus
for the three conditions are shown in Fig. 8. As shown in Fig. 8,
the average error rates decrease monotonically as the value of
decreases from 1 (nonfuzzy case) to 0.4, 0.4, and 0.5, and
the lowest average error rates of 2.10%, 7.10%, and 4.60% are
obtained for the three conditions, respectively. To facilitate the
further comparison, we extract some results from Figs. 7 and 8
and list them in Table II.
By taking Condition 3 as example and fixing the pair
(30, 6), the error rate is decreased from 5.62% to
4.80% as the value of is decreased from 1 to 0.4. This
trend also appears in another two conditions. Moreover, when
(1, infinity) and
0.4, the error rate is 7.76%.
Accordingly, the proposed TAF-SVM performs better than
nonfuzzy TAF-SVM, and nonfuzzy TAF-SVM outperforms
0.4. Based on these comparFSVM with the value of
isons, two concise remarks are obtained. First, after TAF-SVM
embodies the total margin and different cost algorithms, by
further introducing the concept of fuzzy penalty and setting
to a proper value can reduce the error rate. Second, the optimal
value of the lower bound of the fuzzy membership function is
not equal to zero but a value within the range of [0.4, 0.5].
Regarding the second remark, one reasonable cause is analyzed as follow. As the value of approaches zero, those few

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 18, NO. 1, JANUARY 2007

Fig. 8. Average error rates versus the lower bound of the fuzzy membership
function ".

TABLE II
COMPARISONS OF AVERAGE ERROR RATES (%) AMONG THE THREE
CONDITIONS UNDER DIFFERENT COMBINATIONS OF " AND (I; T )

data points lying near the border of their class possess near-zero
membership values. Accordingly, their feasible regions will be
substantially compressed to very small regions. This can be seen
in (20b) and (20c). An extreme situation is that, when the value
of is set as zero, some data points whose membership values
are zero will be excluded from the training process since both
the upper bound and the lower bound in the inequality constraints (20b) and (20c) become zero. Therefore, the situation of
information lost is revealed in the training process and in consequence a poorer performance is obtained.
3) Comparison of Performance Between TAF-SVM and
SVM: In order to verify the efficiency of the proposed
TAF-SVM, we resample the training set and test set ten times
by following the data collection criterion of Condition 3.
The face-cutting procedure is performed to each selected face
image. The features for a face image are extracted by using
KFDA method. For SVM, the tasks of the multiclass classifications are accomplished by using the OAA and OAO strategies.
For TAF-SVM, only OAA strategy is used since the imbalanced
face training sets are obtained only under this strategy. The
parameters used in KFDA, SVM, and TAF-SVM are listed in
Table III. They were determined by taking trail-and-error so
that every classifier is able to obtain the lowest error rate.
Table IV lists the comparison of the results among different
systems. The average error rates are obtained through ten runs.
We can see that by using KFDA-based feature extractor and a
simple NN classifier, the average error rate can be substantially
reduced to 11.41%, which is almost the half of that of 22.26%
obtained from a pure NN classifier. By cascading the OAA-

LIU AND CHEN: FACE RECOGNITION USING TAF-SVMS

TABLE III
PARAMETER SETTING IN KFDA, SVM, AND TAF-SVM

TABLE IV
COMPARISON OF THE AVERAGE ERROR RATE AND STANDARD DEVIATION (SD)
OVER TEN RUNS BETWEEN TAF-SVM AND OTHER SYSTEMS

TABLE V
COMPARISON OF COMPUTATION TIME AMONG DIFFERENT SYSTEMS

based SVM, the improvement is very limited (from 11.41%


to 9.02%) while that is significant by using OAO-based SVM
(from 11.41% to 7.55%). It is not surprising that the difference
of recognition accuracy between OAO- and OAA-based SVM
is that apparent since the OAO method does not result in the occurrence of imbalanced data sets while OAA method does. As a
matter of fact, Lin et al. [2] have indicated that the OAO method
is more suitable for practical use than the OAA method in terms
of classification accuracy according to their experiments carried
out on various popular data. In this paper, we also suggest that
the OAO method is better than the OAA method for SVM-based
face recognition. However, this suggestion is only in terms of
face-recognition accuracy because the OAO method takes more
recognition time than the OAA method (see Table V).
We conduct this experiment mainly based on the reason that
a robust face classifier should be able to maintain good stability while expecting that it can achieve the best recognition
accuracy under the training with different training sets. The results of Table IV indicate that TAF-SVM outperforms OAOand OAA-based SVM. This is due to the fact that TAF-SVM
not only can adapt to the imbalanced face data sets but also
can avoid the overfitting problem and improve the generalization error bound. In addition, the system KFDA TAF-SVM
achieves the lowest standard deviation (0.57) compared with the
system KFDA SVM. It indicates that the TAF-SVM is more
stable than SVM.
4) Computational Complexity: Our experiments were implemented on an Intel Xeon 3.0 GHz-Workstation (1 MB L2

189

catch, DDR2 2.0 GB SDRAM, 800 MHz Front-Site-Bus, and


10 000 rpm SCSI-hard disk). The training program was implemented by using Matlab since it is able to solve the eigenvalue
problem for KFDA and constrained optimization problem for
both SVM and TAF-SVM easily. After the training, we saved
the expansion coefficients for KFDA and the indispensable information to the further recognition including the support vectors, Lagrange multipliers, and the optimal bias for each OSH.
The test program was implemented using C++ language since
the recognition process only executes simple calculations such
as the dot product of vectors, their linear combinations, and decision making. We recorded the computation time of the first
run of the last experiment and listed them in Table V.
Most training time was spent on the solving of the constrained
optimization problem. The larger the number of the training data
was, the more time the training process needed. Also, the proportion of the increase of training time to that of training data
was more than one. The total training time of OAA-based SVM
(2429.7 s for 30 OSHs) is much larger than that of OAO-based
SVM (234.8 s for 435 OSHs), as shown in Table V. Moreover,
we found that the training time of the proposed TAF-SVM was
smaller than that of OAA-based SVM. This may be due to the
reason that for TAF-SVM the feasible regions are functions of
membership values less than one. That is, most data have comparative smaller feasible regions in searching the Lagrange multipliers, compared with SVM. Although the training is time-consuming, what is the biggest concern for face recognition is the
online recognition speed.
In the training of an OSH for the OAA-based SVM, we
noticed that the percentage of the obtained support vectors
is around 20%25%. The proposed TAF-SVM, by which the
percentage of the support vectors is 100%, is 4.75 (76.4/16.1)
times the recognition time of OAA-based SVM, as listed in
Table V. Besides, the recognition time for TAF-SVM is around
0.1231 s per subject. This recognition speed is acceptable for
the tasks of security and visual surveillance.
B. Experiment on FERET Database
The facial recognition technology (FERET) face database,
obtained from the FERET program [37], [38], [43], is a commonly used database for the test of state-of-art face-recognition
algorithms. In the following, the proposed TAF-SVM is tested
on a subset of this database.
This subset contains 1400 images of 200 subjects. The subset
contains the images whose names are marked with two-character strings: ba, bj, bk, be, bf, bd, and bg. Each
subject has seven images involving variations in illumination,
pose, and facial expression. In our experiment, each original
image is cropped so that each cropped image only contains
the portions of the face and hair. Then, each cropped image is
resized to 80 80 pixels and preprocessed by the histogram
equalization. Some images of one of the 200 subjects are shown
in Fig. 9. Six out of seven images of each subject are randomly
chosen for training, and the remaining one is used for testing.
The training set size is 1200 and the test set size is 200. We run
the previous process 20 times and obtain 20 different training
and test sets, and in each run there is no overlap between
training set and test set.

190

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 18, NO. 1, JANUARY 2007

Fig. 9. First row: images of one of the 200 subjects in the FERET database.
Second row: cropped images of those in the first row after histogram equalization.
TABLE VI
COMPARISONS OF AVERAGE ERROR RATE AND SD AMONG DIFFERENT
SYSTEMS WITH KFDA FEATURE EXTRACTION ON FERET DATABASE

1) Performance Test After the KFDA Feature Extraction: In


this experiment, the face images will go through the KFDA feature extraction first before the classification. Therefore, we first
find the optimal parameters of KFDA.
a) Optimal parameter selection: The first step is to find the
optimal parameters of KFDA for the experiment on the subset
of FERET database. Only two parameters for KFDA need to
be determined, namely the RBF kernel parameter , and the
number of chosen eigenvectors . The optimal parameter pair
will be the pair, over the wide ranges of
and
, resulting in the lowest average error rate. One average error
rate is computed from the 20 error rates under a specific parameter pair and the errors are measured by an NN classifier. In the
6.1 10 and
sequence, the optimal parameters
199, are found for KFDA. Then, the training sets are projected
onto the 199 eigenvectors and thus 20 projected training sets are
obtained.
Then, the projected training sets are used to find the optimal
parameters for SVM and TAF-SVM, respectively. The RBF
kernel is still used in the classifiers. Similar to the searching
process for the KFDAs optimal parameter selection, here the
optimal parameters of each classifier will also be the parameters
resulting in the lowest average error rate over wide searching
ranges of the classifiers parameters.
b) Comparisons among different systems: After we select
the optimal parameters for each classifier, we start to test and
compare the classification accuracies by feeding 20 test sets
into different systems. The experimental results are listed in
Table VI.
First, by comparing the results in Tables VI and Table IV,
we can observe that the error rates obtained from FERET database are much larger than those obtained from CYCU database. It may be due to the following facts: 1) the number of
available training data from FERET database (six per subject)
is much smaller than that from CYCU database (21 per subject), 2) there exist larger variations in FERET database, and 3)

the number of subjects in the subset of FERET database (200


subjects) is relatively much larger than that of CYCU database
(30 subjects). Nevertheless, the experimental results carried out
from FERET database show that TAF-SVM still performs better
than SVM. Based on the results in Table VI, TAF-SVM outperforms both SVM (OAO) and SVM (OAA) in average error
rate by 3.21% and 5.10%, respectively. Additionally, TAF-SVM
achieves the lowest variance among these systems, which indicates that TAF-SVM is able to keep better stability than SVM
when facing different unseen patterns.
It is worth noting that though KFDA has been used to extract discriminating features from original image raw data based
on the maximization of between-class separability; however, it
does not mean that the class distribution in KFDA-based subspace will be separable. This can be evidenced by the result from
(KFDA NN) 22.18%. This result implies that,
Table VI:
based on the optimal parameters of KFDA having been used,
there still exist numerous errors between classes. That is, the
class distribution in KFDA-based subspace is still nonseparable.
This may result from two reasons as follows.
First, the face patterns involve too large variation such that
KFDA is not able to separate the classes well. Second, in this
paper, the KFDA used for the subset of FERET database and
CYCU database actually suffers from the so-called small
sample size (SSS) problem [3], because in our experiment the
number of training patterns is smaller than the dimensionality
of the input training patterns. For example, in our training
sets, each pattern (80 80 pixel image) is a 6400-dimensional
vector, while the number of available training patterns is only
1200 (200 subjects, six per subject). The SSS problem also
occurs in the KFDA used for the experiment of CYCU data23 pixel image) is a
base, where each training pattern (28
644-dimensional vector, while the number of training patterns
in each training set is 630 (21 per subject). Since the KFDA
used for the two databases suffers from the SSS problem, the
of (1) is degenerated because
within-class scatter matrix
contains the null space.
To solve the SSS problem in numerical computation, this
paper uses the method of adding a matrix , where is the
identity matrix and is a small number, to the inner product
kernel matrix in finding the expansion coefficient vector for
the data projection. This method is very simple and was suggested by Mika et al. [39], [40]. However, this method discards the discriminant information contained in the null space of
within-class scatter matrix, yet the discarded information may
contain the most significant discriminant information [3], [41],
[42]. This means, even if the most discriminant eigenvectors
have been used for the data projection in our experiments, these
eigenvectors are actually not the most discriminant. Although
KFDA is employed in our work, the face-pattern distribution
is still nonseparable, for example the face-pattern distribution
shown in Figs. 4(b) and 6.
Several more efficient solutions for this SSS problem have
been recently proposed, such as the kernel direct discriminant
analysis (KDDA) [3] which is a generalization of direct LDA
[41], and the complete kernel Fisher discriminant analysis
(CKFD) which combines the kernel PCA and LDA [42].
We expect that the classification accuracy of each system in

LIU AND CHEN: FACE RECOGNITION USING TAF-SVMS

TABLE VII
COMPARISONS OF AVERAGE ERROR RATE AND SD AMONG DIFFERENT
SYSTEMS WITHOUT KFDA FEATURE EXTRACTION ON FERET DATABASE

Tables VI and IV will be improved if, instead of KFDA, KDDA


or CKFD is used for the face-feature extraction in this paper.
Moreover, though KFDA has tried to minimize the withinclass scatters for obtaining larger intraclass compactness, this
cannot guarantee that no outliers will appear in the KFDA-based
subspace. For example, in Fig. 4(a), an outlier still exists in
the KFDA-based subspace. Under such a situation, the SVMs,
SVM (OAO), and SVM (OAA) may suffer from the overfitting
problem and the classification performance will drop. On the
other hand, for SVM (OAA), although KFDA has been used
for the face-feature extraction, the case of imbalanced training
data sets is still unavoidable in the KFDA-based subspace. In
the training of an OSH via SVM (OAA), the imbalanced ratio
of negative training data to positive data is 199 : 1. Such a large
imbalanced ratio will result in the class-boundary-skew problem
for SVM (OAA). This may be the reason why SVM (OAO) always performs better than SVM (OAA), because the ratio of
negative training data to positive data is always 1 : 1 for SVM
(OAO) if the sizes of classes are the same. To sum up, Table VI
shows that the proposed TAF-SVM improves the classification
performance of SVM (OAO) and SVM (OAA), and such a significant improvement of performance should be attributed not
only to the use of fuzzy penalty and different cost algorithm,
but also to the total margin algorithm embedded in TAF-SVM.
2) Performance Test Without KFDAs Feature Extraction: In
this experiment, the image raw vectors are directly sent into each
classifier without using KFDA as the feature extractor. Since the
KFDA feature extractor is no longer used, the optimal parameters of each classifier need to be reselected. It is noted here that
the inputs of each classifier are normalized to zero mean and
unit variance. After feeding the 20 different test sets into these
systems directly without using KFDA feature extractor, the average error rates are obtained and listed in Table VII.
Comparing the results reported in Tables VII and VI, we
can find that the average error rate of each system in Table VI
is lower than that listed in Table VII. For example,
(KFDA TAF-SVM)
14.15%, while
(TAF-SVM)
20.40%. Therefore, we can conclude that by using KFDA as the
feature extractor, the classification accuracy of each classifier
can be further enhanced significantly. From Table VII, it can be
seen that in terms of the average classification rate, TAF-SVM
outperforms SVM (OAA) and SVM (OAO) by 7.23% and
3.73%, respectively. In addition, TAF-SVM still achieves the
lowest variance.
VI. CONCLUSION AND FUTURE WORK
A new classifier called TAF-SVM is proposed in this paper.
TAF-SVM is mainly designed for the improvement of the drawbacks of traditional SVM when applied to face recognition,

191

the class-boundary-skew problem, and the overfitting problem,


by introducing the different cost algorithm and the method of
fuzzification of training set. Another contribution is to enhance
the generalization ability of SVM by introducing the total
margin algorithm. Experimental results show that the proposed
TAF-SVM is superior to OAO- and OAA-based SVM in terms
of both face-classification rate and stability. The validity of
TAF-SVM on the improvement of classification accuracy of
SVM for face recognition has been indicated.
Based on the work presented, there still remain several topics
worth studying in the future. First, the circle-like membership
model for the training set fuzzification used in this paper is not a
very efficient model since the face-pattern distribution is in general non-Gaussian and nonconvex. The study on a better model
is needed. Second, experimental results have shown that using
KFDA as feature extractor is able to enhance the classification
accuracy. However, for face-recognition, KFDA suffers from
the SSS problem in our work. It is believed that if this problem
is solved, e.g., by using the variants of KFDA such as KDDA
[3] or CKFD [42], the face-recognition accuracy can be further
enhanced based on the use of TAF-SVM classifier.
ACKNOWLEDGMENT
The authors would like to thank the reviewers for their useful
comments and suggestions, and Prof. H.-P. Huang, Prof. S.-G.
Miaou, Prof. P. C. P. Chao, and H.-Y. Lin for their help in
preparing this paper.
REFERENCES
[1] K. Veropoulos, C. Campbell, and N. Cristianini, Controlling the sensitivity of support vector machines, in Proc. Int. Joint Conf. Artif. Intell.
(IJCAI99), Stockholm, Sweden, 1999, pp. 5560.
[2] C. W. Hsu and C. J. Lin, A comparison of methods for multiclass
support vector machines, IEEE Trans. Neural Netw., vol. 13, no. 2,
pp. 415425, Mar. 2002.
[3] J. Lu, K. N. Plataniotis, and A. N. Venetsanopoulos, Face recognition using kernel direct discriminant analysis algorithms, IEEE Trans.
Neural Netw., vol. 14, no. 1, pp. 117126, Jan. 2003.
[4] M. Yoon, Y. Yun, and H. Nakayama, A role of total margin in support
vector machines, in Proc. Int. Joint Conf. Neural Netw., 2003, vol. 3,
pp. 20492053.
[5] I. Guyon, N. Matic, and V. Vapnik, Discovering informative patterns
and data cleaning, in Advances in Knowledge Discovery and Data
Mining, U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Eds. Menlo Park, CA: AAAI Press, 1996, pp. 181203.
[6] X. Zhang, Using class-center vectors to build support vector machines, in Proc. IEEE Workshop Neural Netw. Signal Process.
(NNSP99), Madison, WI, 1999, pp. 311.
[7] C. F. Lin and S. D. Wang, Fuzzy support vector machines, IEEE
Trans. Neural Netw., vol. 13, no. 2, pp. 464471, Mar. 2002.
[8] V. Vapnik, Statistical Learning Theory. New York: Springer-Verlag,
1998.
[9] G. Baudat and F. Anouar, Generalized discriminant analysis using a
kernel approach, Neural Comput., vol. 12, pp. 23852404, 2000.
[10] Chung Yuan Christian Univ. (CYCU), Multiview Face Database
Chungli, Taiwan [Online]. Available: http://vsclab.me.cycu.edu.tw/
~face/face_index.html
[11] R. Akbani, S. Kwek, and N. Japkowicz, Applying support vector machines to imbalanced datasets, in Proc. 15th Eur. Conf. Mach. Learn.
(ECML), Pisa, Italy, 2004, pp. 3950.
[12] M. Turk and A. Pentland, Eigenfaces for recognition, J. Cogn. Neurosci., vol. 3, no. 1, pp. 7186, 1991.
[13] C. Corts and V. Vapnik, Support vector networks, Mach. Learn., vol.
20, pp. 273297, 1995.
[14] J. C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining Knowl. Disc., vol. 2, pp. 121167, 1998.

192

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 18, NO. 1, JANUARY 2007

[15] M. J. Er, S. Wu, J. Liu, and H. L. Toh, Face recognition with radial
basis function (RBF) neural networks, IEEE Trans. Neural Netw., vol.
13, no. 3, pp. 697710, May 2002.
[16] M. J. Er, W. L. Chen, and S. Q. Wu, High-speed face recognition based
on discrete cosine transform and RBF neural networks, IEEE Trans.
Neural Netw., vol. 16, no. 3, pp. 679691, May 2005.
[17] G. Wu and E. Cheng, Class-boundary alignment for imbalanced
dataset learning, in Proc. ICML 2003 Workshop Learn. Imbalanced
Data Sets II, Washington, DC, 2003, pp. 4956.
[18] N. Japkowicz, The class imbalance problem: Significance and strategies, in Proc. 2000 Int. Conf. Artif. Intell.: Special Track on Inductive
Learning, Las Vegas, NV, 2000, pp. 111117.
[19] C. Ling and C. Li, Data mining for direct marketing problems and
solutions, in Proc. 4th Int. Conf. Knowl. Disc. Data Mining, New York,
1998, pp. 7379.
[20] N. Chawla, K. Bowyer, and W. Kegelmeyer, SMOTE: Synthetic
minority over-sampling technique, J. Artif. Intell. Res., vol. 16, pp.
321357, 2002.
[21] J. F. Feng and P. Williams, The generalization error of the symmetric
and scaled support vector machines, IEEE Trans. Neural Netw., vol.
12, no. 5, pp. 12551260, Sep. 2001.
[22] M. H. Yang, Kernel Eigenfaces vs. kernel Fisherfaces: Face recognition using kernel methods, in Proc. 5th IEEE Int. Conf. Autom. Face
Gesture Recognit., Washington, DC, 2002, pp. 215220.
[23] Q. S. Liu, H. Q. Lu, and S. D. Ma, Improving kernel fisher discriminant analysis for face recognition, IEEE Trans. Circuits Syst. Video
Technol., vol. 14, no. 1, pp. 4249, Jan. 2004.
[24] B. Schlkopf, A. Smola, and K. R. Mller, Nonlinear component analysis as a kernel eigenvalue problem, Neural Comput., vol. 10, no. 5,
pp. 12991319, 1998.
[25] K. I. Kim, K. Jung, and H. J. Kim, Face recognition using kernel principal component analysis, IEEE Signal Process. Lett., vol. 9, no. 2,
pp. 4042, Feb. 2002.
[26] G. Cui and W. Gao, SVMs for few examples-based face recognition,
in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Hong Kong,
2003, vol. 2, pp. 381384.
[27] W. Chi, G. Dai, and L. Zhang, Face recognition based on independent
Gabor features and support vector machine, in Proc. 5th World Congr.
Intell. Control Autom., Hangzhou, China, 2004, vol. 5, pp. 40304033.
[28] C. Y. Li, F. Liu, and Y. X. Xie, Face recognition using self-organizing feature maps and support vector machines, in Proc. 5th Int.
Conf. Comput. Intell. Multimedia Appl., Xian, China, 2003, pp. 3742.
[29] G. Dai and C. Zhou, Face recognition using support vector machines
with the robust feature, in Proc. 12th IEEE Int. Workshop Robot
Human Interactive Commun., 2003, pp. 4953.
[30] S. Y. Zhang and H. Qiao, Face recognition with support vector machine, in Proc. IEEE Int. Conf. Robot., Intell. Syst. Signal Process.,
Changsha, China, 2003, vol. 2, pp. 726730.
[31] K. I. Kim, J. Kim, and K. Jung, Recognition of facial images
using support vector machines, in Proc. 11th Workshop Stat. Signal
Process., Singapore, 2001, pp. 468471.
[32] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, Eigenfaces vs.
Fisherfaces: Recognition using class specific linear projection, IEEE
Trans. Pattern Anal. Mach. Intell., vol. 19, no. 7, pp. 711720, Jul.
1997.
[33] Y. Lin, Y. Lee, and G. Wahba, Support vector machines for classification in nonstandard situations, Mach. Learn., vol. 46, pp. 191202,
2002.
[34] E. Osuna, R. Freund, and F. Girosit, Training support vector machines: An application to face detection, in Proc. Comp. Vis. Pattern
Recognit. (CVPR), Puerto Rico, 1997, pp. 130136.
[35] U. H.-G. Kressel, Pairwise classification and support vector machines, in Advances in Kernel MethodsSupport Vector Learning, B.
Schlkopf, C. J. C. Burges, and A. J. Smola, Eds. Cambridge, MA:
MIT Press, 1999.
[36] T. Van Gestel, J. Suykens, B. Baesens, S. Viaene, J. Vanthienen, G.
Dedene, B. De Moor, and J. Vandewalle, Benchmarking least squares
support vector machine classifiers, Mach. Learn., vol. 54, no. 1, pp.
532, 2004.

[37] P. J. Phillips, H. Moon, S. A. Rizvi, and P. J. Rauss, The FERET


evaluation methodology for face-recognition Algorithms, IEEE Trans.
Pattern Anal. Mach. Intell., vol. 22, no. 10, pp. 10901104, Oct. 2000.
[38] P. J. Phillips, The Facial Recognition Technology (FERET) Database
(2004) [Online]. Available: http://www.itl.nist.gov/iad/humanid/feret/
feret_master.html
[39] S. Mika, G. Rtsch, J. Weston, B. Schlkopf, and K.-R. Mller,
Fisher discriminant analysis with kernels, in Proc. IEEE Int. Workshop Neural Netw. Signal Process. IX, Aug. 1999, pp. 4148.
[40] , Constructing descriptive and discriminant nonlinear features:
Rayleigh coefficients in kernel feature spaces, IEEE Trans. Pattern
Anal. Mach. Intell., vol. 25, no. 5, pp. 623628, May 2003.
[41] H. Yu and J. Yang, A direct lDA algorithm for high-dimensional data
with application to face recognition, Pattern Recognit., vol. 34, pp.
20672070, 2001.
[42] J. Yang, A. F. Frangi, J. Y. Yang, D. Zhang, and Z. Jin, KPCA plus
LDA: A complete kernel Fisher discriminant framework for feature
extraction and recognition, IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 27, no. 2, pp. 230244, Feb. 2005.
[43] J. Yang, J. Y. Yang, and A. F. Frangi, Combined Fisherfaces framework, Image Vis. Comput., vol. 21, no. 12, pp. 10371044, 2003.
[44] H. P. Huang and Y. H. Liu, Fuzzy support vector machines for pattern recognition and data mining, Int. J. Fuzzy Syst., vol. 4, no. 3, pp.
826835, 2002.
[45] J. Suykens and J. Vandewalle, Least squares support vector machine
classifiers, Neural Process. Lett., vol. 9, pp. 293300, 1999.
[46] M. W. Chang, C. J. Lin, and R. C. H. Weng, Analysis of switching dynamics with competing support vector machines, IEEE Trans. Neural
Netw., vol. 15, no. 3, pp. 720727, May 2004.
[47] S. N. Pang, D. Kim, and S. Y. Bang, Face membership authentication
using SVM classification tree generated by membership-based LLE
data partition, IEEE Trans. Neural Netw., vol. 16, no. 2, pp. 436446,
Mar. 2005.
[48] S. Li, J. T. Y. Kwok, I. W. H. Tsang, and Y. Wang, Fusing images with
different focuses using support vector machines, IEEE Trans. Neural
Netw., vol. 15, no. 6, pp. 15551561, Nov. 2004.

Yi-Hung Liu (M04) received the B.S. degree in


naval architecture and marine engineering from
National Cheng Kung University, Tainan, Taiwan,
R.O.C.,in 1994, and the M.S. degree in engineering
science and ocean engineering in 1996 and the Ph.D.
degree in mechanical engineering in 2003, both from
National Taiwan University, Taipei, Taiwan, R.O.C.
He is currently an Assistant Professor with the
Department of Mechanical Engineering at Chung
Yuan Christian University, Chungli, Taiwan, R.O.C.
His research interests include computer vision,
machine learning, pattern recognition, data mining, automatic control, and their
associated applications.

Yen-Ting Chen was born in Kaohsiung, Taiwan,


R.O.C. He received the B.S. and M.S. degrees in
mechanical engineering from Chung Yuan Christian
University, Chungli, Taiwan, R.O.C., in 2004 and
2006, respectively.
Currently, he is with the Industrial Technology
Research Institute (ITRI), Hsinchu, Taiwan, R.O.C.,
where he is working for the intelligent robots. His
research interests include machine vision and neural
networks.

You might also like