Professional Documents
Culture Documents
Heterogeneous Kernels
First, we briefly describe our notation. All vectors will are the between and within class scatter matrices re-
be column vectors unless transposed to a row vector spectively and
by a prime superscript 0 . The scalar (inner) product
of two vectors x and y in the n-dimensional real space 1
Mc = Ac elc (4)
Rn will be denoted by x0 y, the 2-norm of x will be lc
denoted by kxk. The 1-norm and ∞-norm will be de- is the mean of class c and elc is an lc dimensional
noted by k · k1 and k · k∞ respectively. For a matrix vector of ones. Traditionally, the LFD problem has
A ∈ Rm×n , Ai is the ith row of A which is a row vector been addressed by solving the generalized eigenvalue
in Rn . A column vector of ones of arbitrary dimension problem associated with Equation (1).
will be denoted by e and the identity matrix of ar-
bitrary order will be denoted by I. For A ∈ Rm×n When classes are normally distributed with equal co-
and B ∈ Rn×l , the kernel K(A, B) (Vapnik, 2000; variance, α is in the same direction as the discriminant
Cherkassky & Mulier, 1998; Mangasarian, 2000) is in the corresponding Bayes classifier. Hence, for this
an arbitrary function which maps Rm×n × Rn×l into special case LFD is equivalent to the Bayes optimal
Rm×l . In particular, if x and y are column vectors in classifier. Although LFD relies heavily on assumptions
Rn then, K(x0 , y) is a real number, K(x0 , A0 ) is a row that are not true in most real world problems, it has
vector in Rm and K(A, A0 ) is an m × m matrix. proven to be very powerful. Generally speaking when
the distributions are unimodal and separated by the
scatter of means, LFD becomes very appealing. One
2. Linear Fisher’s Discriminant (LFD) reason why LFD may be preferred over more complex
We know that the probability of error due to the Bayes classifiers is that as a linear classifier it is less prone to
classifier is the best we can achieve. A major disadvan- overfitting.
tage of the Bayes error as a criterion, is that a closed- For most real world data, a linear discriminant is
form analytical expression is not available for the gen- clearly not complex enough. Classical techniques
eral case. However, by assuming that classes are nor- tackle these problems by using more sophisticated dis-
mally distributed, standard classifiers using quadratic tributions in modeling the optimal Bayes classifier,
and linear discriminant functions can be designed. however these often sacrifice the closed form solution
The well-known Fisher’s Linear Discriminant (LFD) and are computationally more expensive. A relatively
new approach in this domain is the kernel version of
Fisher’s Discriminant (Mika et al., 1999). The main The Lagrangian of (7) is given by
ingredient of this approach is the kernel concept, which
was originally applied in Support Vector Machines and 1 2 1 u 2 0
L(u, γ, y, v) = ν kyk + k k −v ((Au−γe)+y−d)
allows the efficient computation of Fisher’s Discrimi- 2 2 γ
nant in the kernel space. The linear discriminant in (8)
the kernel space corresponds to a powerful nonlinear Here v ∈ Rm is the Lagrange multiplier associated
decision function in the input space. Furthermore, dif- with the equality constrained problem (7). Solving
ferent kernels can be used to accommodate the wide- for the gradient of (8) equal to zero, we obtain the
range of nonlinearities possible in the data set. In Karush-Kuhn-Tucker (KKT) necessary and sufficient
what follows, we derive a slightly different formulation optimality conditions (Mangasarian, 1994, p. 112) for
of the KFD problem based on duality theory which our LFD problem with equality constraints as given
does not require the kernel to be positive semidefinite by
or what is equivalent,does not require the kernel to u − A0 v = 0
comply with Mercer’s condition (Cristianini & Shawe- γ + e0 v = 0
(9)
Taylor, 2000). νy − v = 0
Au − eγ + y − d = 0
3. Automatic heterogeneous kernel The first three equations of (9) give the following ex-
selection for the KFD problem pressions for the original problem variables (u, γ, y) in
terms of the Lagrange multiplier v:
As shown in (Xu & Zhang, 2001) and similar to (Mika v
et al., 2000), with the exception of an unimportant u = A0 v, γ = −e0 v, y = . (10)
ν
scale factor, the LFD problem can be reformulated as
the following constrained convex optimization prob- Replacing these equalities in the last equality of (9)
lem: allows us to obtain an explicit expression involving v
0
in terms of the problem data A and d, as follows:
min ν 21 kyk2 + 1
2 (u u)
(u,γ)∈Rm+1 (5) AA0 v + ee0 v + νv − d = HH 0 + νI v − d = 0
s.t. y = d − (Au − eγ).
(11)
where m = l+ + l− and d is an m-dimensional vector where H is defined as:
such that:
H = [A (−e)]. (12)
+m/l+ if xi ∈ A+
di = (6) From the two first equalities of (10) we have that
−m/l− if xi ∈ A−
and the variable ν is a positive constant introduced u
= H 0v (13)
in (Mika et al., 2000) to address the problem of γ
ill-conditioning of the estimated covariance matrices.
Using this equality and pre-multiplying by H’ in (11)
This constant can also be interpreted as a capacity
we have
control parameter. In order to have strong convexity
I
u
0
on all variables of problem (5) we can introduce the ex- HH+ = H 0d (14)
ν γ
tra term γ 2 on the corresponding objective function.
In this case, the regularization term is minimized with Solving the linear system
of equations (14) gives the
respect to both orientation u and relative location to u
explicit solution to the LFD problem (7). To
the origin γ. Extensive computational experience, as γ
in (Fung & Mangasarian, 2003; Lee & Mangasarian, obtain our “kernelized” version of the LFD classifier
2001) and other publications, indicates that in similar we modify our equality constrained optimization prob-
problems (Fung & Mangasarian, 2001) this formula- lem (7) by replacing the primal variable u by its dual
tion is just as good as the classical formulation, with equivalent u = A0 v from (10) to obtain:
some added advantages such as strong convexity of the 1 0
min ν 12 kyk2 + 2 (v v + γ2)
objective function. After adding the new term to the (v,γ,y)∈Rm+1+m
objective function of the problem (5) the problem be- s.t. = d − (AA0 v − eγ).
y
comes (15)
0
where the objective function has also been modified to
min ν 12 kyk2 + 1
2 (u u + γ2) minimize weighted 2-norm sums of the problem vari-
(u,γ,y)∈Rm+1+m (7)
s.t. y = d − (Au − eγ). ables. If we now replace the linear kernel AA0 by a
nonlinear kernel K(A, A0 ) as defined in the Introduc- where aj ≥ 0. As it is pointed out in (Lanckriet et al.,
tion, we obtain a formulation that is equivalent to the 2003), the set {K1 (A, A0 ), . . . , Kk (A, A0 )} can be seen
kernel Fisher discriminant described in (Mika et al., as a predefined set of initial “guesses” of the kernel
1999): matrix. Note that the set {K1 (A, A0 ), . . . , Kk (A, A0 )}
1 0 could contain very different kernel matrix models, e.g.,
min ν 12 kyk2 + 2 (v v + γ2)
(v,γ,y)∈Rm+1+m linear, Gaussian, polynomial, all with different param-
s.t. y= d − (K(A, A)0 v − eγ). eter values. Instead of fine tuning the kernel parame-
(16) ters for a predetermined kernel via cross-validation,
Recent SVM formulations with least squares loss we can optimize the set of values ai ≥ 0 in or-
(Suykens & Vandewalle, 1999) are much the same in der to obtain a PSD linear combination K(A, A0 ) =
spirit as the problem of minimizing ν 21 kyk2 + 21 w0 w
Pk 0
j=1 aj Kj (A, A ) suitable for the specific classifica-
with constraints y = d − (Aw − eγ). Using a similar tion problem. Replacing equation (18) in equation (16)
duality analysis to the one presented before, and then and solving for y in and replacing it on the objective
“kernelizing” they obtain the objective function function in (16), we can reformulate the KFD problem
1 1 optimization for heterogeneous linear combinations of
ν kyk2 + v 0 K(A, A0 )v. (17) kernel as follows
2 2
The regularization term v 0 K(A, A0 )v determines that
the model complexity is regularized in a reproducing min
Pk
ν 21 kd − (( j=1 aj Kj )v − eγ)k2
kernel Hilbert space (RKHS) associated with the spe- (v,γ,a≥0)∈Rm+1
cific kernel K where the kernel function K has to sat- + 12 (v 0 v)
isfy Mercer’s conditions and K(A, A0 ) has to be posi- (19)
tive semidefinite. where Kj = Kj (A, A0 ). When considering linear com-
binations of kernels the hypothesis space may become
By comparing the objective function (17) to problem larger, making the issue of capacity control an impor-
(16), we can see that problem (16) does not regularize tant one. It is known that if two classifiers have similar
in terms of RKHS. Instead, the columns in a kernel training error, a smaller capacity may lead to better
matrix are simply regarded as new features K(A, A0 ) generalization on future unseen data (Vapnik, 2000;
of the classification task in addition to original fea- Cherkassky & Mulier, 1998). In order to reduce the
tures A. we can construct then, classifiers based on size of the hypothesis and model space and to gain
the features introduced by a kernel in the same way strong convexity in all variables, an additional regu-
as how we build models using original features in A. larization term a0 a = kak2 is added to the objective
More precisely, in a more general framework (regular- function of problem (19). The problem then becomes,
ized networks (Evgeniou et al., 2000a)) our method
could produce linear classifiers (with respect to the
new kernel features K(A, A0 )) which minimize the cost min
Pk
ν 21 kd − (( i=1 ai Ki )v − eγ)k2
function regularized in the span space formed by these (v,γ,a≥0)∈Rm+1
3. On the training set TR we used a ten-fold cross- 5.2. Numerical experience on the Colon CAD
validation tuning procedure to select “optimal” dataset
values for the parameter ν in A-KFD and for
the parameters ν and µ in the standard KFD. By In this section of the paper we performed experiments
“optimal” values, we mean the parameters values on the colon CAD dataset. The classification task as-
that maximize the ten-fold cross-validation test- sociated with this dataset is related to colorectal can-
ing correctness. A linear kernel was also consid- cer diagnosis. Colorectal cancer is the third most com-
ered as a kernel choice in the standard KFD. mon cancer in both men and women. Recent stud-
ies (Yee et al., 2003) have estimated that in 2003,
4. Using the “optimal” values found in step 3, we nearly 150,000 cases of colon and rectal cancer would
build a final classification surface (21), and then be diagnosed in the US, and more than 57,000 people
evaluate the performance on the testing set TE . would die from the disease, accounting for about 10%
of all cancer deaths. A polyp is an small tumor that
Steps 1 to 4 are repeated 10 times and the average test- projects from the inner walls of the intestine or rec-
ing set correctness is reported in Table 1. The average tum. Early detection of polyps in the colon is critical
In summary, A-KFD had the same generalization ca-
Table 2. Average times in seconds for both methods: A-
pabilities and ran almost 3 times faster than the stan-
KFD and standard KFD where the kernel width was ob-
tained by tuning. Times are the averages over ten runs. dard KFD.
Kernel calculation time and ν tuning time are included in
both algorithms(Best in bold). 6. Conclusions and Outlook
We have proposed a simple procedure for generat-
Data set A-KFD (secs.) KFD + kernel
ing heterogeneous Kernel Fisher Discriminant classifier
(m × n) tuning (secs.) where the kernel model is defined to be a linear combi-
Ionosphere 55.3 350.0 nation of members of a potentially larger pre-defined
(351 × 34) family of heterogeneous kernels. Using this approach,
Housing 134.4 336.9 the task of finding an “appropriate” kernel that satis-
(506 × 13) factory suits the classification task can be incorporated
Heart 39.7 109.2
into the optimization problem to solve. In contrast
(297 × 13)
Pima 341.5 598.4 with previous works that also consider linear com-
(768 × 8) bination of kernels, our proposed algorithm requires
Bupa 48.2 81.7 nothing more sophisticated than solving a simple non-
(345 × 6) singular system of linear equations of the size of the
number of training points m and solving a quadratic
programming problem that is usually very small since
because polyps can turn into cancerous tumors if they it depends on the predefined number of kernels on the
are not detected in the polyp stage. kernel family (5 in our experiments). The practical
complexity of the A-KFD algorithm does not explic-
The database of high-resolution CT images used in itly depend on the number of kernels on the predefined
this study was obtained at NYU Medical Center. One kernel family.
hundred and five (105) patients were selected so as to
include positive cases (n=61) as well as negative cases Empirical results show that the proposed method com-
(n=44). The images are preprocessed in order to cal- pared to the standard KFD where the kernel is selected
culate features based on moments of tissue intensity, by a cross-validation tuning procedure, is several times
volumetric and surface shape and texture characteris- faster with no significant impact on generalization per-
tics. The final dataset used in this paper is a balanced formance.
subset of the original dataset consisting of 300 candi- The convergence of the A-KFD algorithm is justified
dates, 145 candidates are labeled as a polyp and 155 as as a special case of the alternate optimization algo-
non-polyps. Each candidate is represented by a vec- rithm described in (Bezdek & Hathaway, 2003). Fu-
tor of 14 features that have the most discriminating ture work includes a more general version of the pro-
power according to a feature selection pre-processing posed algorithm in the context of regularized networks,
stage. The non-polyp points were chosen from can- where the convergence results are presented in a more
didates that were consistently misclassified by an ex- detailed manner. Future work also includes the use
isting classifier that was trained to have a very low of sparse kernel techniques and random projections to
number of false positives on the entire dataset. This improve further computational efficiency. We also plan
means that in the given 14 dimensional feature space, to explore the use of weak kernels (kernels that de-
the colon CAD dataset is extremely difficult to sepa- pends on a subset of the original input features) for
rate. feature selection in this framework.
For the tests, we used the same methodology described
in subsection 5.1 obtaining very similar results. The 7. Citations and References
standard KFD performed in an average time of 122.0
seconds over ten runs and an average test set correct- References
ness of 73.4 %. The A-KFD performed in an average (2004). CPLEX optimizer. ILOG CPLEX Divi-
time of 41.21 seconds with an average test set correct- sion, 889 Alder Avenue, Incline Village, Nevada.
ness of 72.4 %. As in Section 5.1, we performed a http://www.cplex.com/.
paired t-test (Mitchell, 1997) at 95% confidence level
with a p-value of 0.32 > 0.05, this indicates that there Bach, F., Lanckriet, J., & Jordan, M. (2004). Fast
is no significant difference between both methods in kernel learning using sequential minimal optimiza-
this dataset at the 95% confidence level. tionTechnical Report CSD-04-1307). Division of
Computer Science, University of California, Berke- Lanckriet, G., Cristianini, N., Bartlett, P., Ghaoui,
ley. L. E., & Jordan, M. (2003). Learning the kernel
matrix with semidefinite programming. Journal of
Bennet, K., Momma, M., & Embrechts, M. (2002). Machine Learning Research, 5, 27–72.
Mark: a boosting algorithm for heterogeneous ker-
nel models. Proceedings KDD-2002: Knowledge Dis- Lee, Y.-J., & Mangasarian, O. L. (2001). SSVM: A
covery and Data Mining, July 23-26, 2002, Edmon- smooth support vector machine. Computational Op-
ton, Alberta, CA (pp. 24–31). Asscociation for Com- timization and Applications, 20, 5–22. Data Min-
puting Machinery. ing Institute, University of Wisconsin, Technical
Report 99-03. ftp://ftp.cs.wisc.edu/pub/dmi/tech-
Bezdek, J., & Hathaway, R. (2002). Some notes on reports/99-03.ps.
alternating optimization. Proceedings of the 2002
AFSS International Conference on Fuzzy Systems Mangasarian, O. L. (1994). Nonlinear programming.
(pp. 288–300). Springer-Verlag. Philadelphia, PA: SIAM.
Mangasarian, O. L. (2000). Generalized sup-
Bezdek, J., & Hathaway, R. (2003). Convergence of al-
port vector machines. Advances in Large Mar-
ternating optimization. Neural, Parallel Sci. Com-
gin Classifiers (pp. 135–146). Cambridge, MA:
put., 11, 351–368.
MIT Press. ftp://ftp.cs.wisc.edu/math-prog/tech-
Cherkassky, V., & Mulier, F. (1998). Learning from reports/98-14.ps.
data - concepts, theory and methods. New York: Mika, S., Rätsch, G., & Müller, K.-R. (2000). A math-
John Wiley & Sons. ematical programming approach to the kernel fisher
algorithm. NIPS (pp. 591–597).
Cristianini, N., & Shawe-Taylor, J. (2000). An in-
troduction to support vector machines. Cambridge: Mika, S., Rätsch, G., Weston, J., Schölkopf, B., &
Cambridge University Press. Müller, K.-R. (1999). Fisher discriminant analysis
with kernels. Neural Networks for Signal Processing
Evgeniou, T., Pontil, M., & Poggio, T. (2000a). Reg- IX (pp. 41–48). IEEE.
ularization networks and support vector machines.
Advances in Computational Mathematics, 13, 1–50. Mitchell, T. M. (1997). Machine learning. Boston:
McGraw-Hill.
Evgeniou, T., Pontil, M., & Poggio, T. (2000b). Reg-
ularization networks and support vector machines. Murphy, P. M., & Aha, D. W. (1992).
Advances in Large Margin Classifiers (pp. 171–203). UCI machine learning repository.
Cambridge, MA: MIT Press. www.ics.uci.edu/∼mlearn/MLRepository.html.
Suykens, J. A. K., & Vandewalle, J. (1999). Least
Fukunaga, K. (1990). Introduction to statistical pat-
squares support vector machine classifiers. Neural
tern recognition. San Diego, CA: Academic Press.
Processing Letters, 9, 293–300.
Fung, G., & Mangasarian, O. L. (2001). Proxi- Vapnik, V. N. (2000). The nature of statistical learning
mal support vector machine classifiers. Proceed- theory. New York: Springer. Second edition.
ings KDD-2001: Knowledge Discovery and Data
Mining, August 26-29, 2001, San Francisco, CA Xu, J., & Zhang, X. (2001). Kernel mse algorithm:
(pp. 77–86). New York: Asscociation for Comput- A unified framework for kfd, ls-svm and krr. Pro-
ing Machinery. ftp://ftp.cs.wisc.edu/pub/dmi/tech- ceedings of The International Joint Conference on
reports/01-02.ps. Neural Networks 2001. IEEE.
Fung, G., & Mangasarian, O. L. (2003). Finite New- Yee, J., Geetanjali, A., Hung, R., Steinauer-Gebauer,
ton method for Lagrangian support vector machine A., wall, A., & McQuaid, K. (2003). Computer to-
classification. Special Issue on Support Vector Ma- mographic virtual colonoscopy to screen for colorec-
chines. Neurocomputing, 55, 39–55. tal neoplasia in asymptomatic adults. Proceeding of
NEJM-2003 (pp. 2191–2200).
Hamers, B., Suykens, J., Leemans, V., & Moor, B. D.
(2003). Ensemble learning of coupled parmeterised
kernel models. International Conference on Neural
Information Processing (pp. 130–133).