Professional Documents
Culture Documents
I. INTRODUCTION
classifier with good generalization ability and minimal empirical risk is necessary for making up the drawback of the appearance-based feature extractor. Based on this, an SVM can serve
as a good classifier candidate.
SVM was proposed by Vapnik et al. [13] and has been successfully applied to various applications such as the unsupervised segmentation of switching dynamics [46], face membership authentication [47], and image fusion [48]. Recently, several works relative to face recognition have used SVMs as classifiers and yielded satisfactory results [25][31]. In those systems,
the SVMs used are of regular SVM. However, some researches
which are not directly relative to the search of face recognition
have indicated that SVM would suffer from several critical problems when applied to classify some particular data types. The
first problem is that SVM is very sensitive to outliers since the
penalty weight for every data point is the same [5][7]. Second,
the class-boundary-skew problem will be met when SVM is applied to the problem of learning from imbalanced data sets in
which the negative data heavily outnumber the positive data
[1], [11], [17], [33]. The class boundary, i.e., the optimal separating hyperplane (OSH), learned by SVM, can be skewed towards the positive class. In consequence, the false-negative rate
can be very high and can make SVM ineffective in identifying
the targets that belong to the positive class, which results in the
class-boundary-skew problem. The two problems limit the performance of SVM. Unfortunately, they also occur in the application of SVM-based face recognition.
In face recognition, for example, a face image with an exaggerated expression may result in the existence of an outlier. If
the outlier possesses nonzero value of slack variable, the soft
margin algorithm used in regular SVM would start to find a hyperplane to let the error be correct. The overfitting problem may
follow. The other problem is that SVM was originally designed
for the binary classification, while face recognition is practically
a problem of multiclass classification. To extend the binary SVM
to multiclass face recognition, most existing systems [25][31]
used the one-against-all (OAA) method. As far as the computational effort is concerned, OAA may be more efficient than
one-against-one (OAO) strategy. The advantage of OAA over
OAO is that we only have to construct one hyperplane for each
of the classes instead of
pairwise decision functions.
;
This decreases the computational effort by a factor of
[35]. This
in some examples, it can be brought down to
may be the reason that authors of [25][31] used OAA in their
systems, though it has been reported that OAO is more efficient
than OAA in terms of classification accuracy [2], [36], [45]. As
OAA method is used, one of the classes will be the target class
and the rest
classes will be the negative class forthe learning
of each OSH. The class-boundary-skew problem occurs. Also,
the larger the number of classes becomes, the more imbalanced
the training set is when OAA method is applied.
To remedy these problems when SVM is applied to face recognition, this paper proposes a new classifier called TAF-SVM.
TAF-SVM is able to solve the overfitting problem by fuzzifying
the training set which is equivalent to fuzzifying the penalty term
[7], [44]. With this manner, training data are no longer treated
equally but treated differently according to their relative importance. Besides, TAF-SVM also embodies the different cost al-
179
(2)
where is a nonlinear mapping function that maps the data
from input space to a higher dimensional feature space:
.
denotes the th face image in the
th class. The mapped data are centered in space [9], [24].
KFDA seeks to find a set of discriminating orthonormal eigenfor the projection of input face image by
vectors
performing FLDA in space in which the between-class scatter
is maximized and the within-class scatter is minimized. This is
equivalent to solving the following maximization problem:
(3)
Solutions associated with the largest nonzero eigenvalues
must lie in the span of all mapped data; so, for , there exists
a normalized expansion coefficient vector
such that
(4)
180
Thus, for a testing face image , its projection on the th eigenvector is computed by
(5)
We do not need to know the nonlinear mapping exactly. By
can be easily obtained
using the kernel trick, the projection
by
(6)
where the kernel function is defined as the dot product of vectors
sign
(11)
(7)
The radial basis function (RBF) kernel is used in this paper and
is expressed as
(8)
is specified a priori by the user.
where the width
eigenvecTo project a face image into new coordinates,
tors associated with the first largest nonzero
eigenvalues
are selected to construct the transformation matrix or
such that the dimensionality of a face image
to
.
is reduced from
To simplify the notation used in the following, we let the number
be equal to .
of projection vectors
III. BASIC IDEAS OF TAF-SVM
A. Basic Review of SVM
, where
In SVM, the training set is given as
is the training data and is its class label being either
1 or 1. Let and be the weight vector and the bias of
the separating hyperplane, the objective of SVM is to find the
OSH by maximizing the margin of separation and minimizing
the training errors
Minimize
(9)
Subject to
(10a)
(10b)
,
is the nonlinear mapping
where
function which maps the data into a higher dimensional feature
space from the input space. are slack variables representing
the error measures of data points. The penalty weight is a free
parameter; it measures the size of the penalties assigned to the
errors. Minimizing the first term in (9) is equivalent to maximizing the margin of separation, which is related to minimizing
the VapnikChervonenkis (VC) dimension. Formulation of the
objective function in (9) is perfect accord with the structural risk
181
(13)
and
are the penalty weights for the errors of the
where
positive class and negative class, respectively. The slack variand
are the measurement of errors of the data beables
longing to the positive class and the negative class, respectively.
to meet the central concept of different
By setting
cost algorithm, the OSH will be much more distant from the
smaller class.
3) Improvement of the Generalization Error Bound via Total
Margin Algorithm: Due to the fact that it is impossible to take
all face information into consideration, i.e., the available face
training samples are always finite and not numerous, the generalization ability for a classifier dominates the prediction accuracy for unseen faces. Soft margin algorithm used in SVM
relaxes the measure of margin by introducing slack variables
to errors. An OSH is found with the maximal margin of sepaby maximizing the minimum distance between
ration
few extreme values (support vectors) and the separating hyperplane. However, only few extreme training data that are used
would cause the loss of information because the most of information is contained in the nonextreme data, which occupy
the majority in the training set. Feng et al. proposed the scaled
SVM [21], which not only employed the support vectors but
also involved the means of the classes to reduce the generalization error of SVM. However, the face-pattern distribution is generally non-Gaussian and highly nonconvex [3], [22]. Namely,
the mean of a class may not be very representative. Another approach for improving the generalization error bound called total
margin algorithm has been also proposed by Yoon et al. [4].
The total margin algorithm extends the soft margin algorithm
by introducing extra surplus variables to the correctly classified data points . The surplus variable measures the distance
between the correctly classified data point and the hyperplane
, if this data point belongs to the positive/negative class. In addition to minimizing the sum of slack variables
(the misclassified data points) while maximizing the margin
of separation proposed by soft margin algorithm, total margin
algorithm suggests that the sum of surplus variables (the correctly classified data points) should also be maximized simultaneously. Maximizing the sum of surplus variables is equivalent
, which in turn is equivalent to minimizing
to maximizing
. Therefore, total margin-based SVM is formulated as
the constrained optimization problem
Minimize
Subject to
(14)
where
is the weight for the misclassified data points and
is the weight for the correctly classified data points, i.e., the
surplus variables .
From (14), we can see that the construction of OSH is no
longer controlled only by few extreme data points in which
most of them may be misclassified data points, but also by the
correctly classified data points, which are the majority of the
training set. The advantages are clear. First, the essence of the
soft margin-based SVM is to rely only on the set of data points
which take extreme values, the so-called support vectors. From
the statistics of extreme values, we know that the disadvantage
of such an approach is that the information contained in most
samples (not extreme values) is lost, so that such an approach
is bound to be less efficient than one that takes into account the
lost information [21], i.e., the correctly classified data points.
Therefore, total margin algorithm can be more efficient and gain
182
Fig. 1. Geometric interpretation of slack variables and surplus variables used in TAF-SVM.
TABLE I
INTERPRETATION OF THE RELATIONSHIPS BETWEEN SLACK VARIABLES,
SURPLUS VARIABLES, AND THE POINT LOCATIONS BY TAKING
POSITIVE TRAINING DATA POINTS AS EXAMPLE
183
(22a)
(22b)
(22c)
(22d)
(22e)
(22f)
(17)
, ,
, and
yields
(18a)
(18b)
(18c)
where
and
and
(23)
, respectively
(24a)
(18d)
(24b)
(18e)
(18f)
By the resubstitution of these equations into primal problem, the
dual problem is obtained
Maximize
(19)
Subject to
(20a)
(20b)
(20c)
(21)
sign
(25)
184
This means that a less important data point has the narrower
width in the feasible region.
Another question is how to fuzzify the training set efficiently. Basically, the rule to assign proper membership
values to data points can depend on the relative importance
of data points to their own classes. Therefore, for a positive
data, its assigned membership value can be calculated with
the membership function
Fig. 2. Thirty subjects of the CYCU multiview face database.
if
otherwise
(26)
denotes the Euclidean distance, the lower bound
where
is a nonnegative small real number and is user-defined,
and is the mean of all data points in
. The memberfor all the fuzzified positive training data
ship values
are bounded in
. The same procedure is also used for
the fuzzification of the negative data in which the mean is
calculated with all the negative data.
3) In SVM, only one free parameter has to be adjusted. A
more complex procedure may occur for TAF-SVM since
,
,
,
there are four free parameters to be adjusted:
. However, the adjustment process can be further
and
simplified according to some relationships. First, the inequality constraints in (20b) and (20c) say that the two inand
must be held. Second,
equalities
based on the concept of adaptation to imbalanced data sets,
and
are required
the relationships
if the size of positive class is smaller than that of negative
class. Two ratios are defined to simplify the procedure of
the adjustment of these parameters
(27)
(28)
,
By fixing the value among any of the four parameters
,
, and
, and setting the values of and , the
other three parameters can be obtained directly. In the case
1, no adaptation effort will be made to the imbalof
anced case. Furthermore, TAF-SVM will be the standard
,
SVM if equals 1 and goes to infinity (
0, and
is finite), and the membership
values are set as 1. It is noticed that the very small number
.
10 is added to avoid the situation when
Also, when
1,
infinity, and 0
1, the proposed TAF-SVM will become the FSVM. Therefore, SVM
and FSVM can be viewed as two particular cases of the
proposed TAF-SVM.
V. EXPERIMENTAL RESULTS
A. Experiment on CYCU Face Database
Here, we present a set of experiments that were carried out by
using the Chung Yuan Christian University (CYCU) multiview
face database [10]. The CYCU multiview face database contains
3150 face images out of 30 subjects and involves variations of
facial expression, viewpoint, and sex. Each image is a 112 92
24-b pixel matrix. The viewpoint is governed by two parameters: the rotation angle and the tilt angle where the rotation
angle and tilt angle have seven and three kinds of degrees of
,
), reangles (
spectively. For each viewpoint of each subject we prepared five
face images with different facial expressions. Therefore, each
subject has 105 face images. Fig. 2 shows the total 30 subjects
in this database. All images contain face contours and hair. The
color of the background is nearly white and the lighting condition is controlled to be uniform. Fig. 3 shows the collected 21
images containing 21 different viewpoints of one subject.
1) Analysis of Face-Pattern Distributions in KFDA-Based
Subspace: Two cases are analyzed in this subsection. Before
the experiments, all colored face images are transferred to graylevel images and the contrast of each gray image is enhanced
by the histogram equalization. All gray-level images of 112
92 pixels are resized to 28 23 pixel matrices before the feature extraction. In addition, all extracted features by KFDA from
Case 1 to Case 2 are the first two most discrimination features.
Case 1: Fig. 4 depicts the distribution of the face patterns of five subjects randomly chosen from the database
in the KFDA-based subspaces. Each subject contains 21
and
patterns covering the whole range:
, i.e., each of the 21 viewpoints provides one
image for one person. Two observations are as follows. First,
, we observe that there exists an outlier
when
for the class denoted by , as shown in Fig. 4(a). This outlier
is very far from the main body of its class and falls into another
class denoted by . The SVM will suffer from the overfitting
problem when it is applied to solve the binary classification
problem between the two classes. Second, according to the
distribution shown in Fig. 4(b), it is observed that there exists
an overlap between the three classes denoted by , , and
Fig. 4. Distribution of 105 face patterns out of the five subjects in KFDA-based
subspaces with the RBF kernel where (a) 2 = 5:3 10 and (b) 2 = 10 .
185
186
Fig. 5. Face-cutting procedure and the cropped face images with different cropping sizes of i. The dotted white rectangle is the face-bounding box.
(29)
Fig. 6. Distribution of randomly cropped and 105 face patterns of five subjects
in the KFDA-based subspaces subject to (a)
:
, and (b)
.
10
2 = 5 3 2 10
2 =
classifier is still needed even though the robust feature extraction method KFDA has been used. The robust classifier here
means a classifier better than NN classifier. Besides, because
outliers would appear in the distribution, and the situation of imbalanced training data sets would happen when OAA method is
employed, a classifier which is more robust than SVM is also
needed. This also motivates this paper.
the optimal kernel value, total of 29 discriminating feature components are extracted from each face image by using the KFDA
method.
a) Sensitivity test on
,
,
, and
: The first experiment is to test the sensitivities of TAF-SVM to the four pa,
,
, and
, which have been condensed
rameters:
to the ratios and defined by (27) and (28). The values of
and for this experiment are {1, 10, 20, 30, 40} and {2, 4, 6,
is set as 10 and is
8, infinity}, respectively. The values of
fixed. The lower bound of the fuzzy membership function is
fixed to 0.4. RBF kernel is also used for TAF-SVM where its
kernel parameter is set as 0.05. The experimental results for
the three conditions are shown in Fig. 7.
The lowest error rates for the three conditions are 2.10%,
equal (20, 4), (30, 4),
7.10%, and 4.80% when the pairs
and (30, 6), respectively. The corresponding values of ( ,
,
,
) are (200, 10, 50, 2.5), (300, 10, 75, 2.5), and (300,
10, 50, 1.66), respectively. In addition, the results also indicate
that the error rates can be reduced by changing the ratios of
and . In the following, we take the results of Condition 3
shown in Fig. 7(c) as examples to show how the performance of
TAF-SVM will be affected under different and . Three steps
are as follows.
Step 1) In Fig. 7(c), the largest average error rate 7.76% oc(1, infinity).
curs at the position
(1, infinity) means that
(10,10,0,0). At this position, the different cost
algorithm is disabled because the penalties for
the positive and the negative classes are the
10). The total margin alsame (
gorithm is also disabled at this position because of
0. Therefore, only the fuzzy penalty
is used in TAF-SVM.
goes to (30, infinity) from
Step 2) As the position
(1, infinity) , the average error rate decreases to
(30, in5.94% from 7.76%. At the position
finity), the different cost algorithm is enabled (used)
in TAF-SVM because the penalties for the positive
and the negative classes become different:
300 and
10. We can see that when
300 and
10, the penalty for the positive class
is much larger than that for the negative class. This
meets the role of different cost algorithm, which
says: Assign heavier penalty to the smaller class. In
the experiments of Fig. 7(c), the number of negative training data is 29 times the number of positive
training data in the learning of each OSH by OAA
TAF-SVM, because there are 30 subjects in CYCU
database needed to be classified. On the other hand,
the total margin algorithm is still not enabled at this
0. By comparing
position because of
the analysis in Step 1) with the one in Step 2), we
can see that the error rate is reduced by the application of different cost algorithm.
goes to (30, 6) from (30, inStep 3) As the position
finity), the average error rate decreases to 4.8% from
(30, 6), not only the
5.94%. At the position
300 and
different cost algorithm is enabled (
187
188
Fig. 8. Average error rates versus the lower bound of the fuzzy membership
function ".
TABLE II
COMPARISONS OF AVERAGE ERROR RATES (%) AMONG THE THREE
CONDITIONS UNDER DIFFERENT COMBINATIONS OF " AND (I; T )
data points lying near the border of their class possess near-zero
membership values. Accordingly, their feasible regions will be
substantially compressed to very small regions. This can be seen
in (20b) and (20c). An extreme situation is that, when the value
of is set as zero, some data points whose membership values
are zero will be excluded from the training process since both
the upper bound and the lower bound in the inequality constraints (20b) and (20c) become zero. Therefore, the situation of
information lost is revealed in the training process and in consequence a poorer performance is obtained.
3) Comparison of Performance Between TAF-SVM and
SVM: In order to verify the efficiency of the proposed
TAF-SVM, we resample the training set and test set ten times
by following the data collection criterion of Condition 3.
The face-cutting procedure is performed to each selected face
image. The features for a face image are extracted by using
KFDA method. For SVM, the tasks of the multiclass classifications are accomplished by using the OAA and OAO strategies.
For TAF-SVM, only OAA strategy is used since the imbalanced
face training sets are obtained only under this strategy. The
parameters used in KFDA, SVM, and TAF-SVM are listed in
Table III. They were determined by taking trail-and-error so
that every classifier is able to obtain the lowest error rate.
Table IV lists the comparison of the results among different
systems. The average error rates are obtained through ten runs.
We can see that by using KFDA-based feature extractor and a
simple NN classifier, the average error rate can be substantially
reduced to 11.41%, which is almost the half of that of 22.26%
obtained from a pure NN classifier. By cascading the OAA-
TABLE III
PARAMETER SETTING IN KFDA, SVM, AND TAF-SVM
TABLE IV
COMPARISON OF THE AVERAGE ERROR RATE AND STANDARD DEVIATION (SD)
OVER TEN RUNS BETWEEN TAF-SVM AND OTHER SYSTEMS
TABLE V
COMPARISON OF COMPUTATION TIME AMONG DIFFERENT SYSTEMS
189
190
Fig. 9. First row: images of one of the 200 subjects in the FERET database.
Second row: cropped images of those in the first row after histogram equalization.
TABLE VI
COMPARISONS OF AVERAGE ERROR RATE AND SD AMONG DIFFERENT
SYSTEMS WITH KFDA FEATURE EXTRACTION ON FERET DATABASE
TABLE VII
COMPARISONS OF AVERAGE ERROR RATE AND SD AMONG DIFFERENT
SYSTEMS WITHOUT KFDA FEATURE EXTRACTION ON FERET DATABASE
191
192
[15] M. J. Er, S. Wu, J. Liu, and H. L. Toh, Face recognition with radial
basis function (RBF) neural networks, IEEE Trans. Neural Netw., vol.
13, no. 3, pp. 697710, May 2002.
[16] M. J. Er, W. L. Chen, and S. Q. Wu, High-speed face recognition based
on discrete cosine transform and RBF neural networks, IEEE Trans.
Neural Netw., vol. 16, no. 3, pp. 679691, May 2005.
[17] G. Wu and E. Cheng, Class-boundary alignment for imbalanced
dataset learning, in Proc. ICML 2003 Workshop Learn. Imbalanced
Data Sets II, Washington, DC, 2003, pp. 4956.
[18] N. Japkowicz, The class imbalance problem: Significance and strategies, in Proc. 2000 Int. Conf. Artif. Intell.: Special Track on Inductive
Learning, Las Vegas, NV, 2000, pp. 111117.
[19] C. Ling and C. Li, Data mining for direct marketing problems and
solutions, in Proc. 4th Int. Conf. Knowl. Disc. Data Mining, New York,
1998, pp. 7379.
[20] N. Chawla, K. Bowyer, and W. Kegelmeyer, SMOTE: Synthetic
minority over-sampling technique, J. Artif. Intell. Res., vol. 16, pp.
321357, 2002.
[21] J. F. Feng and P. Williams, The generalization error of the symmetric
and scaled support vector machines, IEEE Trans. Neural Netw., vol.
12, no. 5, pp. 12551260, Sep. 2001.
[22] M. H. Yang, Kernel Eigenfaces vs. kernel Fisherfaces: Face recognition using kernel methods, in Proc. 5th IEEE Int. Conf. Autom. Face
Gesture Recognit., Washington, DC, 2002, pp. 215220.
[23] Q. S. Liu, H. Q. Lu, and S. D. Ma, Improving kernel fisher discriminant analysis for face recognition, IEEE Trans. Circuits Syst. Video
Technol., vol. 14, no. 1, pp. 4249, Jan. 2004.
[24] B. Schlkopf, A. Smola, and K. R. Mller, Nonlinear component analysis as a kernel eigenvalue problem, Neural Comput., vol. 10, no. 5,
pp. 12991319, 1998.
[25] K. I. Kim, K. Jung, and H. J. Kim, Face recognition using kernel principal component analysis, IEEE Signal Process. Lett., vol. 9, no. 2,
pp. 4042, Feb. 2002.
[26] G. Cui and W. Gao, SVMs for few examples-based face recognition,
in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Hong Kong,
2003, vol. 2, pp. 381384.
[27] W. Chi, G. Dai, and L. Zhang, Face recognition based on independent
Gabor features and support vector machine, in Proc. 5th World Congr.
Intell. Control Autom., Hangzhou, China, 2004, vol. 5, pp. 40304033.
[28] C. Y. Li, F. Liu, and Y. X. Xie, Face recognition using self-organizing feature maps and support vector machines, in Proc. 5th Int.
Conf. Comput. Intell. Multimedia Appl., Xian, China, 2003, pp. 3742.
[29] G. Dai and C. Zhou, Face recognition using support vector machines
with the robust feature, in Proc. 12th IEEE Int. Workshop Robot
Human Interactive Commun., 2003, pp. 4953.
[30] S. Y. Zhang and H. Qiao, Face recognition with support vector machine, in Proc. IEEE Int. Conf. Robot., Intell. Syst. Signal Process.,
Changsha, China, 2003, vol. 2, pp. 726730.
[31] K. I. Kim, J. Kim, and K. Jung, Recognition of facial images
using support vector machines, in Proc. 11th Workshop Stat. Signal
Process., Singapore, 2001, pp. 468471.
[32] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, Eigenfaces vs.
Fisherfaces: Recognition using class specific linear projection, IEEE
Trans. Pattern Anal. Mach. Intell., vol. 19, no. 7, pp. 711720, Jul.
1997.
[33] Y. Lin, Y. Lee, and G. Wahba, Support vector machines for classification in nonstandard situations, Mach. Learn., vol. 46, pp. 191202,
2002.
[34] E. Osuna, R. Freund, and F. Girosit, Training support vector machines: An application to face detection, in Proc. Comp. Vis. Pattern
Recognit. (CVPR), Puerto Rico, 1997, pp. 130136.
[35] U. H.-G. Kressel, Pairwise classification and support vector machines, in Advances in Kernel MethodsSupport Vector Learning, B.
Schlkopf, C. J. C. Burges, and A. J. Smola, Eds. Cambridge, MA:
MIT Press, 1999.
[36] T. Van Gestel, J. Suykens, B. Baesens, S. Viaene, J. Vanthienen, G.
Dedene, B. De Moor, and J. Vandewalle, Benchmarking least squares
support vector machine classifiers, Mach. Learn., vol. 54, no. 1, pp.
532, 2004.