You are on page 1of 5

Proceedings of the Sixth International Conference on Machine Learning and Cybernetics, Hong Kong, 19-22 August 2007

DETERMINE THE PARAMETER OF KERNEL DISCRIMINANT ANALYSIS


IN ACCORDANCE WITH FISHER CRITERION
YONG XU, WEI-JIE LI

Department of Computer Science & Technology, Shenzhen Graduate School, Harbin Institute of Technology, Shenzhen
518005, China
E-MAIL: laterfall2@yahoo.com.cn, weijiekaoyan@163.com

Abstract: out FDA in the feature space. Indeed, while FDA tries its
Feature extraction performance of kernel discriminant best to obtain the optimal discriminant direction of the
analysis (KDA) is influenced by the value of the parameter of original sample space, KDA can be considered to aim at
the kernel function. Usually one is hard to effectively exert achieving the optimal direction of the feature space. It can
the performance of FDA for it is not easy to determine the be also said that for the data in the feature space, the best
optimal value for the kernel parameter. Though some
approaches have been proposed to automatically determine
linear separability is available for their projections onto the
the parameter of FDA, it seems that none of these approaches optimal discriminant direction determined by KDA.
takes the nature of FDA into account in selecting the value Because the feature space induced by KDA is usually
for the kernel parameter. In this paper, we develop a novel equivalent to a space obtained through a nonlinear
parameter selection approach that is subject to the essence of transform, KDA is able to produce linear separable features
Fisher discriminant analysis. This approach is theoretically for such data that are from the input space and have bad
able to achieve the kernel parameter that is associated with a linear separability.
feature space with satisfactory linear separability. The KDA is associated with a kernel function that has at
approach can be carried out using an iterative computation least one parameter, so-called kernel parameter. KDA can
procedure. Experimental results show that the developed
approach does result in much higher classification accuracy
be implemented only if a value is assigned to the kernel
than naive KDA. parameter. On the other hand, different parameter values
usually produce different feature extraction results. In
Keywords: practice, some approaches have been proposed to select the
Kernel discriminant analysis (KDA); Kernel function; parameter value for KDA. These parameter selection
Parameter selection; Fisher criterion approaches can be classified into two classes. The first class
of approaches usually selects parameter by maximizing the
1. Introduction likelihood and the second class of approaches was based on
a criterion with respective to the relation between samples.
Kernel methods are a kind of nonlinear methods that The follows are some examples of the first class of
are designed by means of linear techniques in implicit approaches. Shawkat Ali and Kate A. Smith proposed an
feature spaces induced by kernel functions. As a kernel automatic parameter learning approach using Bayes
method, kernel discriminant analysis (KDA) [1-7], which inference [12]. An expectation maximization algorithm
roots in well known Fisher discriminant analysis (FDA) developed by Tonatiuh peña Centeno and Neil D. Lawrence
[8-11], has been widely used in the areas of computer vision determined kernel parameters and the regularization
and pattern recognition. The target of FDA is to achieve the coefficient through the maximization of the
optimal discriminant direction that is associated with the margin-likehood of data [13]. Note that the optimization
best linear separability. In other words, the highest linear procedure in [13] is not guaranteed to find the global
separability will be produced only if data are projected onto minimum. Volker Roth used a cross-validation criterion to
the optimal discriminant direction. The implementation of select all free parameters in KDA [14]. Schittkowski
KDA is theoretically equivalent to the consecutive employed a nonlinear programming algorithm to determine
implementations of the following procedures: the procedure kernel and weighting parameters of kernel methods such as
of mapping the original sample space i.e. input space into a support vector machines [15]. It should be pointed out that
new space i.e. feature space and the procedure of carrying the parameter selection performance of the algorithm in [15]

1-4244-0973-X/07/$25.00 ©2007 IEEE


2931
Proceedings of the Sixth International Conference on Machine Learning and Cybernetics, Hong Kong, 19-22 August 2007

depends on the choice of initial values of parameters. On where w is a discriminant vector, Sbφ and S wφ are respectively
the other hand, the following literatures are associated with between-class and within-class scatter matrixes in the
the second class of parameter selection approaches.
feature space. Suppose that there are l samples x1 , x2 ,..., xl
Lorenzo Bruzzone and Diego Fern´andez Prieto determined
centers and widths of kernel functions of RBF network from c classes and the i − th has li training samples,
through an approach of assessing the between-class relation
∑ x ij , j = 1, 2,..., li , i = 1, 2,..., c be
c
where l = l . Let
i =1 i
of training samples [16]. Note that when the classes present
a high degree of overlapping in input space, the complexity denoted the j − th sample in the i − th class. If the
of the classification problem may limit the capabilities of prior probabilities of all the classes are equal, then we have
the proposed technique to reduce the overlapping of the c
Sbφ = ∑ ( miφ − m0φ )( miφ − m0φ )
T
kernel functions associated with the different classes. (2)
i =1
Daoqiang Zhang selected the parameter for KDA using an li

( ) (φ ( x ) − m )
c
S wφ = ∑∑ φ ( x ij ) − miφ
T
objective function on linear separability of data [17]. i
j
φ
i
(3)
In this paper we propose a novel approach that i =1 j =1

essentially complies with the nature of discriminant 1 li 1 c


analysis to determine the kernel parameter of KDA. Indeed, where miφ = ∑ φ ( xij ) , m0φ = ∑ miφ .
li j =1 c i =1
for FDA with a certain kernel function, its feature
extraction performance is related to only the kernel According to the theory of reproducing kernels, w
parameter. Thus we can consider to it be the optimal can be expressed in terms of all the training samples, i.e.
l
parameter the kernel parameter associated with the feature w = ∑ α iφ ( xi ) (4)
space having the best linear separability. It is known that i =1
Fisher criterion indicates the extent of the linear
Furthermore, if a kernel function k ( xi , x j ) is
separability between sample features of different classes.
This reminds us that Fisher criterion can be used as the introduced to denote the dot production φ ( xi ) ⋅ φ ( x j ) , we
indicator for selecting optimal kernel parameter for KDA.
Hence it seems that we can conduct parameter selection for can obtain the following formulas:
1 l li
KDA through maximizing Fisher criterion. wT miφ = ∑∑ α j k ( x j , xki ) = α T M i (5)
Our parameter selection approach takes the li j =1 k =1
maximization of Fisher criterion as the target of parameter 1 c T φ
selection. This approach is completely subject to the nature ∑ w mi = α T M 0
wT m0φ =
c i =1
(6)
of Fisher discriminant analysis. The developed parameter
approach obtains good experimental results and gains where M i (i = 1, 2,..., c) is an l × 1 matrix with elements
performance improvement for KDA. The other parts of this 1 li
paper are organized as follows: KDA are introduced briefly (Mi ) j = ∑ k ( x j , xqi ) , M 0 is the matrix defined as
li q =1
in Section 2. The idea and the algorithm of parameter
selection are presented in Section 3. Experimental results 1 c
are shown in Section 4. In Section 5 we offer our
M0 = ∑ M i . Based on (5) and (6), we arrives at
c i =1
conclusion. c
wT Sbφ w = wT ∑ ( miφ − m0φ )( miφ − m0φ ) w
T

2. KDA i =1
.
c

∑(M − M 0 )( M i − M 0 ) α
T
=α T
KDA is based on a conceptual transform from input i =1
i

space into feature space. KDA can be derived from FDA as Then we have
follows. Let { xi } denote the samples in the input space wT Sbφ w = α T M α (7)
and let φ be a nonlinear function that transforms the input c
where M = ∑ ( M i − M 0 )( M i − M 0 ) . In addition, the
T
space into the feature space. Consequently, Fisher criterion i =1
in the feature space is following formulation can be derived:
wT S φ w
J ( w ) = T bφ ⎧⎪ li ⎡ l ⎞ ⎤ ⎫⎪
2
(1) c
⎛ 1 li
w Sw w wT S wφ w = ∑ ⎨∑ ⎢ ∑ α q ⎜ k ( xq , xhi ) − ∑ k ( xq , xni ) ⎟ ⎥ ⎬ . Indeed it
li n =1
⎩ ⎢ q =1 ⎝
i =1 ⎪ h =1 ⎣ ⎠ ⎦⎥ ⎭⎪

2932
Proceedings of the Sixth International Conference on Machine Learning and Cybernetics, Hong Kong, 19-22 August 2007

can be further transformed into the form shown in (8). Fisher criterion value means good linear separability, large
c Fisher criterion will be able to result in good classification
w S w w = ∑ α K i I − I li ( )( I − I )
T
φ
T T
li K α
T
i (8) performance. Thus we can apply the following proposition
i =1
to conduct parameter selection.
where Ki is an l × li matrix defined as Proposition 1. To maximize Fisher criterion (10) can
( Ki )n, m = k ( xn , x i
m ) , n = 1, 2,..., l , m = 1, 2,..., li , I is the be regarded as the objective of parameter selection on
KDA.
li × li identity, and I li is an li × li matrix whose
We can exploit the widely used gradient descent
elements are all 1 li . Note that if we ignore the algorithm to design parameter selection scheme. A lemma
available for the gradient descent method is as follows:
( I − I )( I − I )
T
second-order terms of the li li in (8), we will Lemma 1. The gradient descent algorithm for solving
arrives at the minimum of J (θ ) with respective to the parameter θ
c
is formulated as follows:
(
wT S wφ w = ∑ α T K i I − I li K iT α = α T Nα ) (9)
∂J (θ )
i =1
θ (n + 1) = θ (n) − µ |θ =θ ( n ) (12)
c ∂θ
( )
where N = ∑ K i I − I li K iT . The combination of (1), (7)
i =1
where µ is a positive constant called learning ratio.
and (9) allows the following formula to be produced: Because we aim at obtaining the maximum of Fisher
criterion value, (12) cannot be used directly. However, the
α T Mα
J (α ) = T (10) following theorem can be used for solving the maximum
α Nα value of Fisher criterion.
As a result, the problem for obtaining the optimal Theorem 1. For the kernel parameter σ of KDA,
discriminant vector w of the feature space can be formula (13) is able to determine its value associated with
converted into the problem for solving optimal α , which is the largest Fisher criterion.
associated with the maximum J (α ) . The optimal α will ∂J (α (σ )
σ (n + 1) = σ (n) + µ |σ =σ ( n ) (13)
be obtained by solving the following eigenequation ∂σ
M α = λ Nα (11) Further more, we can implement (13) according to
After α is obtained, we can use it to extract features proposition 2.
of samples. For detail please see ref. [6]. Because the Proposition 2. We can determine the optimal value of
method presented above is defined on the basis of kernel the parameter σ using the following procedure:
function and Fisher discriminant analysis, it is called kernel (1) For n = 0 , we assign an initial value to σ
discriminant analysis (KDA). Meanwhile, the learning ratio is set a proper value.
The theory of KDA can be analyzed as follows. The (2) σ is evaluated repeatedly according to (13).
implementation of KDA is equivalent to the consecutive (3) The iteration procedure will be terminated if one of
implementations of the following procedures: the procedure the following conditions is satisfied: | σ (n + 1) − σ (n) |< ε ,
of mapping input space into feature space and the procedure
of carrying out FDA in the feature space. However the use | J (α )n +1 − J (α ) n |< ε , or the iteration count exceeds the
of the kernel function allows the nonlinear mapping to be given maximum. ε is a small positive constant. The final
carried out implicitly. As a result, KDA has a much lower σ is regarded as the optimal parameter of KDA.
computational complexity than an ordinary nonlinear Furthermore, for two-class problems, the following
feature extraction method. Moreover, KDA is able to obtain lemma and theorem allow us to obtain the KDA solution
linear separable features for such data that are from the α and the optimal parameter more efficiently.
input space are not linear separable whereas FDA is not Lemma 2. For KDA applied to two-class classification
problems, the optimal KDA solution α associated with
able to do so.
the optimal parameter can be calculated efficiently using
3. Design the parameter selection scheme
α = N −1 ( M 1 − M 2 ) (14)
As indicated in the context above, Fisher Proof:
criterion J (α ) defined as in (10) can be used to asses Suppose N is not singular, (11) can be rewritten as
linear separability of samples in feature space. Since large N −1 M α = λα (15)

2933
Proceedings of the Sixth International Conference on Machine Learning and Cybernetics, Hong Kong, 19-22 August 2007

Because M = ( M 1 − M 2 )( M 1 − M 2 )T , we have classification error rates of naive KDA and KDA with the
parameter selection scheme. We can see that our approach,
M α = ( M 1 − M 2 )( M 1 − M 2 ) α = g ( M 1 − M 2 )
T
(16)
which converts the optimal parameter selection issue on
where g is a scalar defined as g = ( M 1 − M 2 ) α . T
FDA into Fisher-criterion maximization issue, results in
The combination of (15) and (16) gives: better classification performance than naïve KDA.
g Moreover, as shown above one can understand easily the
α = N −1 ( M 1 − M 2 ) (17) underlying reasonableness and rationality of our approach.
λ
For a vector, the one defined as in (17) is identical to Table 1. Classification error rate of naive KDA
the vector defined as in (14). Thus we can conclude that the
σ 0.02 0.1 1
optimal α of KDA on a two-class problem can be obtained 34.2% 19.4% 7.4%
Ionosphere
using α = N −1 ( M 1 − M 2 ) . sonar 38.9% 16.4% 14.4%
It is clear that (14) has much lower computation zoo 28.7% 28.7% 5.0%
complexity than (11). Thus using (14) we can calculate α glass 50.9% 43.5% 36.9%
much more efficiently in comparison the computation using
(11). Table 2. Classification error rate of naive KDA
Theorem 2. For two-class problems, the Fisher Initial value 0.02 0.1 1
criterion associated with α can be expressed in terms of of σ
J (α ) = ( M 1 − M 2 ) N −1 ( M 1 − M 2 ) = α T ( M 1 − M 2 ) (18)
T
Ionosphere 16.0% 12.8% 6.6%
sonar 13.5% 13.5% 14.2%
Proof: Zoo 23.7% 16.8% 4.9%
α T Mα glass 43.5% 41.6% 36.4%
Because of M = ( M 1 − M 2 )( M 1 − M 2 ) , J (α ) = T
T

α Nα
defined as in (10) can be transformed into the following 5. Conclusion
form:
α T ( M 1 − M 2 )( M 1 − M 2 ) α
T
The essence of Fisher discriminant analysis is to
J (α ) = (19) obtain the optimal discriminant vector which is able to
α T Nα
produce the optimal sample features with the largest linear
Further more, it follows by lemma 2 that J (α ) = q ,
separability. For a dataset, large Fisher criterion value
q = ( M 1 − M 2 ) ( N −1 ) T ( M 1 − M 2 )
T
where . Since means that the corresponding features can be linearly
separated well. That is, the larger Fisher criterion, the
( N ) = N and α = N ( M 1 − M 2 ) , we can arrive at
−1 T −1 −1
greater the linear separability. As a result, the larger Fisher
the following conclusion: criterion is, the better the discriminant vector is. KDA is
J (α ) = ( M 1 − M 2 ) N −1 ( M 1 − M 2 ) = α T ( M 1 − M 2 ) .
T
theoretically equivalent to Fisher discriminant analysis
Note that compared with (10), (18) allows us to implemented in feature space, so the linear separability of
the feature space can be also assessed using Fisher criterion.
evaluate J (α ) more efficiently. Consequently, this is very
On the other hand, for the feature space induced implicitly
useful for reducing the computation burden of the iteration by KDA, different parameters produce different linear
procedure on parameter selection. separability. Thus, it is theoretically reasonable and
technically feasible to select the most appropriate value for
4. Experiments the parameter to achieve the feature space with the
maximum linear separability. Indeed our parameter
Experiments on several UCI machine learning datasets selection approach developed in this paper is the first
were performed to compare naive KDA and KDA with the parameter selection approach on FDA that is completely
parameter selection scheme. To obtain representative results, subject to the essence of Fisher discriminant analysis.
we used the leave-one-out scheme. The kernel function Experiments show that the developed parameter
employed in KDA is the Gaussian kernel function selection approach does improve naïve KDA effectively.
k ( xi , x j ) = exp(|| xi − x j ||2 σ ) (20) The parameter selection approach is very helpful for FDA
The minimum distance classifier was used for to produce higher classification accuracy. Moreover, the
classification. Table 1 and Table 2 respectively show parameter selection approach has clear physics significance
and acceptable complexity.

2934
Proceedings of the Sixth International Conference on Machine Learning and Cybernetics, Hong Kong, 19-22 August 2007

Acknowledgements [9] P. Belhumeur, J. Hespanha, and D. Kriegman,.


“Eigenface vs. Fisherface: recognition using. class
This work was supported by Natural Science specific linear projection,” IEEE Trans. Pattern Anal.
Foundation of China(No. 60602038)and Natural Science And Mach. Intelligence, vol. 19,. no. 10, pp. 711-720,
Foundation of Guangdong Province, China(No. 06300862). 1997.
In addition, we would like to thank the anonymous [10] Xu,Y., Yang, J.-Y., Jin, Z.. Theory analysis on
reviewers for their constructive advice. FSLDA and ULDA. Pattern Recognition.
36(12)(2003), 3031-3033.
References [11] Xu,Y., Yang, J.-Y., Jin, Z., 2004. A novel method for
Fisher discriminant Analysis. Pattern Recognition.
[1] Mika, S., Rätsch G., Weston, J., et al.. Fisher 37(2), 381-384.
discriminant analysis with kernels. In: Y H Hu, [12] Shawkat Ali and Kate A. Smith, Automatic Parameter
J Larsen, E Wilson, S Douglas eds. Neural Networks Selection for Polynomial Kernel, Proceedings of the
for Signal Processing IX, IEEE, (1999) 41-48. IEEE International Conference on Information Reuse
[2] Muller, K.-R., Mika, S., Ratsch G., Tsuda K., and Integration, USA, pp. 243-249, 200.
Scholkopf, B.. An introduction to kernel-based [13] Tonatiuh Peña Centeno, Neil D. Lawrence,
learning algorithms. IEEE Trans. On Neural Network, Optimising Kernel Parameters and Regularisation
12(1)(2001) 181-201. Coefficients for Non-linear Discriminant Analysis,
[3] Billings, S.A., Lee, K.L.. Nonlinear Fisher Journal of Machine Learning Research 7 (2006)
discriminant analysis using a minimum square error 455–491.
cost function and the orthogonal least squares [14] Volker Roth, Outlier Detection with One-class Kernel
algorithm. Neural Networks, 15(1)(2002)263-270. Fisher Discriminants. In Lawrence K. Saul, Yair
[4] Jian Yang, Zhong Jin, Jing-Yu Yang, David Zhang, Weiss, and Léon Bottou, editors, Advances in Neural
Alejandro F. Frangi: Essence of kernel Fisher Information Processing Systems 17, pages 1169–1176,
discriminant: KPCA plus LDA. Pattern Recognition Cambridge, MA, 2005. MIT Press.
37(10): 2097-2100 (2004). [15] K. Schittkowski, Optimal Parameter Selection in
[5] Xu, Y., Yang, J. -Y., Lu J., Yu, D.-J.. An efficient Support Vector Machines, Journal of Industrial and
renovation on kernel Fisher discriminant analysis and Management Optimization, Vol. 1, No. 4, 465-476
face recognition experiments. Pattern Recognition, (2005).
37(2004) 2091-2094. [16] Lorenzo Bruzzone and Diego Fern´andez Prieto, A
[6] Xu, Y., Yang, J.-Y., Yang, J.. A reformative kernel technique for the Selection of Kernel-Function
Fisher discriminant analysis. Pattern Recognition, Parameters in RBF Neural Networks for Classification
37(2004) 1299-1302. of Remote-Sensing Images, IEEE TRANSACTIONS
[7] Xu,Y., Zhang, D., Jin, Z., Li, M., Yang J.-Y.. A fast ON GEOSCIENCE AND REMOTE SENSING, VOL.
kernel-based nonlinear discriminant analysis for 37, NO. 2, MARCH 1999.
multi-class problems. Pattern Recognition (2006) [17] Daoqiang Zhang, Songcan Chen, Zhi-Hua Zhou,
39(6) : 1026-1033. Learning the kernel parameters in kernel minimum
[8] Duda, R., Hart, P.. Pattern Classification and Scene distance classifier, Pattern Recognition 39 (2006) 133
Analysis. New York: Wiley (1973). – 135.

2935