You are on page 1of 8

A Fast Iterative Algorithm for Fisher Discriminant using

Heterogeneous Kernels

Glenn Fung glenn.fung@siemens.com


Murat Dundar murat.dundar@siemens.com
Jinbo Bi jinbo.bi@siemens.com
Bharat Rao bharat.rao@siemens.com
Computer Aided Diagnosis & Therapy Solutions, Siemens Medical Solutions, 51 Valley Stream Parkway, Malvern
PA 19355
Linear Fisher Discriminant, Heterogeneous Kernels, Mathematical Programming, Binary Classification.

Abstract Typically, kernels are chosen by predefining a kernel


model (Gaussian, polynomial, etc.) and adjusting the
We propose a fast iterative classification algo-
kernel parameters by means of a tuning procedure.
rithm for Kernel Fisher Discriminant (KFD)
The selection is based on the classification performance
using heterogeneous kernel models. In con-
on a subset of the training data that is commonly re-
trast with the standard KFD that requires
ferred to as the validation set. This kernel selection
the user to predefine a kernel function, we
procedure can be computationally very expensive and
incorporate the task of choosing an appro-
is particularly prohibitive when the dataset is large;
priate kernel into the optimization problem
furthermore, there is no warranty that the predefined
to be solved. The choice of kernel is defined
kernel model is an optimal choice for the classifica-
as a linear combination of kernels belonging
tion problem. In recent years, several authors (Hamers
to a potentially large family of different pos-
et al., 2003; Lanckriet et al., 2003; Bennet et al., 2002;
itive semidefinite kernels. The complexity of
Bach et al., 2004) have proposed the use of a linear
our algorithm does not increase significantly
combination of kernels formed by a family of different
with respect to the number of kernels on the
kernel functions and parameters; this transforms the
kernel family. Experiments on several bench-
problem of choosing a kernel model into one of find-
mark datasets demonstrate that generaliza-
ing an “optimal” linear combination of the members
tion performance of the proposed algorithm is
of the kernel family. Using this approach there is no
not significantly different from that achieved
need to predefine a kernel; instead, a final kernel is con-
by the standard KFD in which the kernel pa-
structed according to the specific classification prob-
rameters have been tuned using cross vali-
lem to be solved without sacrificing capacity control.
dation. We also present results on a real-life
By combining kernels, we make the hypothesis space
colon cancer dataset that demonstrate the ef-
larger (potentially, but not always), but with appro-
ficiency of the proposed method.
priate regularization, we improve prediction accuracy
which is the ultimate goal for classification.

1. Introduction The drawback of using a linear combination of ker-


nels is that it leads to considerable more complex op-
In recent years, kernel based methods have been timization problems . We propose a fast iterative algo-
proved to be an excellent choice to solve classification rithm that transforms the resulting optimization prob-
problems. It is well known that the use of an appropri- lem into several relatively computationally less expen-
ate nonlinear kernel mapping is a critical issue when sive strongly convex optimization problems.
nonlinear hyperplane-based methods such as Kernel
Fisher Discriminant (KFD) are used for classification. At each iteration, our algorithm only requires to solve
a simple system of linear equations and a relatively
Appearing in Proceedings of the 21 st International Confer- small quadratic programming problem with nonneg-
ence on Machine Learning, Banff, Canada, 2004. Copyright ativity constraints, which makes the proposed algo-
2004 by the first author. rithm easy to implement. In contrast with some of the
previous work, the complexity of our algorithm does (Fukunaga, 1990), arises in the special case when
not depend directly on the number of kernels in the the classes have a common covariance matrix. LFD
kernel family. is a classification method that projects the high di-
mensional data onto a line (for a binary classifica-
We now outline the contents of the paper. In Section
tion problem) and performs classification in this one-
2, we formulate the linear classification problem as a
dimensional space. This projection is chosen such that
Linear Fisher Discriminant (LFD) problem. In section
either the ratio of the scatter matrices (between and
3, using the result (Mika et al., 2000), we show how
within classes) or the so called Rayleigh quotient is
the classical Fisher discriminant problem can be refor-
maximized.
mulated as a convex quadratic optimization problem.
Using this equivalent mathematical programming LFD More specifically, let be A ∈ Rm×n a matrix containing
formulation and using mathematical programming du- all the samples and let Ac ⊆ A ∈ Rlc ×n be a matrix
ality theory, we proposed a kernel Fisher discriminant containing the lc labeled samples for class c, xi ∈ Rn ,
formulation similar to the one proposed in (Mika et al., c ∈ {±}. Then, the LFD is the projection u, which
1999), our formulation. Then we introduce a new for- maximizes,
mulation that incorporates both the KFD problem and
the problem of finding an appropriate linear combina- u T SB u
J (α) = (1)
tion of kernels into an quadratic optimization problem u T SW u
with nonnegativity constraints on one set of the vari- where
ables. In Section 4, we propose an algorithm for solv-
T
ing this optimization problem and we discuss the com- SB = (M+ − M− ) (M+ − M− ) (2)
plexity and convergence of the proposed algorithm. In
Section 5, we give some computational results includ- X 1 T
Ac − Mc eTlc Ac − Mc eTlc

ing those for a real life colorectal cancer dataset as well SW = (3)
lc
as five other publicly available datasets. c∈{±}

First, we briefly describe our notation. All vectors will are the between and within class scatter matrices re-
be column vectors unless transposed to a row vector spectively and
by a prime superscript 0 . The scalar (inner) product
of two vectors x and y in the n-dimensional real space 1
Mc = Ac elc (4)
Rn will be denoted by x0 y, the 2-norm of x will be lc
denoted by kxk. The 1-norm and ∞-norm will be de- is the mean of class c and elc is an lc dimensional
noted by k · k1 and k · k∞ respectively. For a matrix vector of ones. Traditionally, the LFD problem has
A ∈ Rm×n , Ai is the ith row of A which is a row vector been addressed by solving the generalized eigenvalue
in Rn . A column vector of ones of arbitrary dimension problem associated with Equation (1).
will be denoted by e and the identity matrix of ar-
bitrary order will be denoted by I. For A ∈ Rm×n When classes are normally distributed with equal co-
and B ∈ Rn×l , the kernel K(A, B) (Vapnik, 2000; variance, α is in the same direction as the discriminant
Cherkassky & Mulier, 1998; Mangasarian, 2000) is in the corresponding Bayes classifier. Hence, for this
an arbitrary function which maps Rm×n × Rn×l into special case LFD is equivalent to the Bayes optimal
Rm×l . In particular, if x and y are column vectors in classifier. Although LFD relies heavily on assumptions
Rn then, K(x0 , y) is a real number, K(x0 , A0 ) is a row that are not true in most real world problems, it has
vector in Rm and K(A, A0 ) is an m × m matrix. proven to be very powerful. Generally speaking when
the distributions are unimodal and separated by the
scatter of means, LFD becomes very appealing. One
2. Linear Fisher’s Discriminant (LFD) reason why LFD may be preferred over more complex
We know that the probability of error due to the Bayes classifiers is that as a linear classifier it is less prone to
classifier is the best we can achieve. A major disadvan- overfitting.
tage of the Bayes error as a criterion, is that a closed- For most real world data, a linear discriminant is
form analytical expression is not available for the gen- clearly not complex enough. Classical techniques
eral case. However, by assuming that classes are nor- tackle these problems by using more sophisticated dis-
mally distributed, standard classifiers using quadratic tributions in modeling the optimal Bayes classifier,
and linear discriminant functions can be designed. however these often sacrifice the closed form solution
The well-known Fisher’s Linear Discriminant (LFD) and are computationally more expensive. A relatively
new approach in this domain is the kernel version of
Fisher’s Discriminant (Mika et al., 1999). The main The Lagrangian of (7) is given by
ingredient of this approach is the kernel concept, which  
was originally applied in Support Vector Machines and 1 2 1 u 2 0
L(u, γ, y, v) = ν kyk + k k −v ((Au−γe)+y−d)
allows the efficient computation of Fisher’s Discrimi- 2 2 γ
nant in the kernel space. The linear discriminant in (8)
the kernel space corresponds to a powerful nonlinear Here v ∈ Rm is the Lagrange multiplier associated
decision function in the input space. Furthermore, dif- with the equality constrained problem (7). Solving
ferent kernels can be used to accommodate the wide- for the gradient of (8) equal to zero, we obtain the
range of nonlinearities possible in the data set. In Karush-Kuhn-Tucker (KKT) necessary and sufficient
what follows, we derive a slightly different formulation optimality conditions (Mangasarian, 1994, p. 112) for
of the KFD problem based on duality theory which our LFD problem with equality constraints as given
does not require the kernel to be positive semidefinite by
or what is equivalent,does not require the kernel to u − A0 v = 0
comply with Mercer’s condition (Cristianini & Shawe- γ + e0 v = 0
(9)
Taylor, 2000). νy − v = 0
Au − eγ + y − d = 0

3. Automatic heterogeneous kernel The first three equations of (9) give the following ex-
selection for the KFD problem pressions for the original problem variables (u, γ, y) in
terms of the Lagrange multiplier v:
As shown in (Xu & Zhang, 2001) and similar to (Mika v
et al., 2000), with the exception of an unimportant u = A0 v, γ = −e0 v, y = . (10)
ν
scale factor, the LFD problem can be reformulated as
the following constrained convex optimization prob- Replacing these equalities in the last equality of (9)
lem: allows us to obtain an explicit expression involving v
0
in terms of the problem data A and d, as follows:
min ν 21 kyk2 + 1
2 (u u)
(u,γ)∈Rm+1 (5) AA0 v + ee0 v + νv − d = HH 0 + νI v − d = 0

s.t. y = d − (Au − eγ).
(11)
where m = l+ + l− and d is an m-dimensional vector where H is defined as:
such that:
 H = [A (−e)]. (12)
+m/l+ if xi ∈ A+
di = (6) From the two first equalities of (10) we have that
−m/l− if xi ∈ A−
 
and the variable ν is a positive constant introduced u
= H 0v (13)
in (Mika et al., 2000) to address the problem of γ
ill-conditioning of the estimated covariance matrices.
Using this equality and pre-multiplying by H’ in (11)
This constant can also be interpreted as a capacity
we have
control parameter. In order to have strong convexity 
I

u

0
on all variables of problem (5) we can introduce the ex- HH+ = H 0d (14)
ν γ
tra term γ 2 on the corresponding objective function.
In this case, the regularization term is minimized with Solving the linear system
 of equations (14) gives the
respect to both orientation u and relative location to u
explicit solution to the LFD problem (7). To
the origin γ. Extensive computational experience, as γ
in (Fung & Mangasarian, 2003; Lee & Mangasarian, obtain our “kernelized” version of the LFD classifier
2001) and other publications, indicates that in similar we modify our equality constrained optimization prob-
problems (Fung & Mangasarian, 2001) this formula- lem (7) by replacing the primal variable u by its dual
tion is just as good as the classical formulation, with equivalent u = A0 v from (10) to obtain:
some added advantages such as strong convexity of the 1 0
min ν 12 kyk2 + 2 (v v + γ2)
objective function. After adding the new term to the (v,γ,y)∈Rm+1+m
objective function of the problem (5) the problem be- s.t. = d − (AA0 v − eγ).
y
comes (15)
0
where the objective function has also been modified to
min ν 12 kyk2 + 1
2 (u u + γ2) minimize weighted 2-norm sums of the problem vari-
(u,γ,y)∈Rm+1+m (7)
s.t. y = d − (Au − eγ). ables. If we now replace the linear kernel AA0 by a
nonlinear kernel K(A, A0 ) as defined in the Introduc- where aj ≥ 0. As it is pointed out in (Lanckriet et al.,
tion, we obtain a formulation that is equivalent to the 2003), the set {K1 (A, A0 ), . . . , Kk (A, A0 )} can be seen
kernel Fisher discriminant described in (Mika et al., as a predefined set of initial “guesses” of the kernel
1999): matrix. Note that the set {K1 (A, A0 ), . . . , Kk (A, A0 )}
1 0 could contain very different kernel matrix models, e.g.,
min ν 12 kyk2 + 2 (v v + γ2)
(v,γ,y)∈Rm+1+m linear, Gaussian, polynomial, all with different param-
s.t. y= d − (K(A, A)0 v − eγ). eter values. Instead of fine tuning the kernel parame-
(16) ters for a predetermined kernel via cross-validation,
Recent SVM formulations with least squares loss we can optimize the set of values ai ≥ 0 in or-
(Suykens & Vandewalle, 1999) are much the same in der to obtain a PSD linear combination K(A, A0 ) =
spirit as the problem of minimizing ν 21 kyk2 + 21 w0 w
Pk 0
j=1 aj Kj (A, A ) suitable for the specific classifica-
with constraints y = d − (Aw − eγ). Using a similar tion problem. Replacing equation (18) in equation (16)
duality analysis to the one presented before, and then and solving for y in and replacing it on the objective
“kernelizing” they obtain the objective function function in (16), we can reformulate the KFD problem
1 1 optimization for heterogeneous linear combinations of
ν kyk2 + v 0 K(A, A0 )v. (17) kernel as follows
2 2
The regularization term v 0 K(A, A0 )v determines that
the model complexity is regularized in a reproducing min
Pk
ν 21 kd − (( j=1 aj Kj )v − eγ)k2
kernel Hilbert space (RKHS) associated with the spe- (v,γ,a≥0)∈Rm+1
cific kernel K where the kernel function K has to sat- + 12 (v 0 v)
isfy Mercer’s conditions and K(A, A0 ) has to be posi- (19)
tive semidefinite. where Kj = Kj (A, A0 ). When considering linear com-
binations of kernels the hypothesis space may become
By comparing the objective function (17) to problem larger, making the issue of capacity control an impor-
(16), we can see that problem (16) does not regularize tant one. It is known that if two classifiers have similar
in terms of RKHS. Instead, the columns in a kernel training error, a smaller capacity may lead to better
matrix are simply regarded as new features K(A, A0 ) generalization on future unseen data (Vapnik, 2000;
of the classification task in addition to original fea- Cherkassky & Mulier, 1998). In order to reduce the
tures A. we can construct then, classifiers based on size of the hypothesis and model space and to gain
the features introduced by a kernel in the same way strong convexity in all variables, an additional regu-
as how we build models using original features in A. larization term a0 a = kak2 is added to the objective
More precisely, in a more general framework (regular- function of problem (19). The problem then becomes,
ized networks (Evgeniou et al., 2000a)) our method
could produce linear classifiers (with respect to the
new kernel features K(A, A0 )) which minimize the cost min
Pk
ν 21 kd − (( i=1 ai Ki )v − eγ)k2
function regularized in the span space formed by these (v,γ,a≥0)∈Rm+1

kernel features. Thus the requirement for a kernel to + 21 (v 0 v + γ 2 + a0 a)


be positive semidefinite could be relaxed, at the cost in (20)
some cases, of an intuitive geometrical interpretation. The corresponding nonlinear classifier to this nonlin-
In this paper, however, since we are considering a Ker- ear separating surface is then:
nel fisher discriminant formulation, we will require the
kernel matrix to be positive semidefinite. This require-  

k  > 0, then x ∈ A+ ,
ment allows to conserve the geometrical interpretation X
 (aj Kj (x0 , A0 )) v−γ = < 0, then x ∈ A− ,
of the KFD formulation since the kernel matrix can
= 0, then x ∈ A+ ∪ A− .

j=1
be seen as a “covariance” matrix on the higher dimen-
(21)
sional space induced implicitly by the kernel mapping.
Furthermore, problem (20) can be seen as a biconvex
Next, Let us suppose that instead of the kernel K be- program of the form,
ing defined by a single kernel mapping (i.e Gaussian,
polynomial, etc.), the kernel K is instead composed
of a linear combination of kernel functions Kj , j = min F (S, T ) (22)
(S,T )∈(Rm+1 ,Rk )
1, . . . , k, as below
k
X  
K(A, A0 ) = aj Kj (A, A0 ), (18) where S =
v
and T = a
j=1 γ
When T = â is fixed, problem (22) becomes: Stop when a predefined maximum number of iterations
is reached or when there is sufficiently little change
min F (S, â) = of the objective function of problem (20) evaluated in
(S)∈(Rm+1 )
successive iterations.
min ν 12 kd − (K̂v − eγ)k2
(v,γ)∈Rm+1
+ 12 (v 0 v + γ 2 ) Let Ni be the number of iterations of algorithm 4.1,
(23) when k << m that is usually the case, this is when
Pk
where K̂ = j=1 a
ˆj K j . This is equivalent to solve the number of kernels functions considered on the ker-
 
v̂ nel family is much smaller than the number of data
(16) with K = K̂. On the other hand when Ŝ =
γ̂ points, then the complexity of the Algorithm 4.1 is ap-
is fixed, problem (22) becomes: proximately Ni (O(m3 )) = O(m3 ), since Ni is bounded
by the maximum of iterations and the cost of solving
min F (Ŝ, T ) = min F (Ŝ, a) = the quadratic programming problem (24) it is “domi-
T ≥0∈(Rk ) a≥0∈(Rk )
Pk 2 nated” by the cost of solving problem (23). In practice,
min ν 12 d − (( j=1 Λj aj ) − eγ̂)

we found that Algorithm 4.1 typically converges in 3-4
a≥0∈Rk
+ 21 (a0 a) iterations (3 or 4) to a local solution of problem (20).
(24) Since each of the two optimization problems ( (23)
where Λj = Kj v. Subproblem (23) is an unconstrained and (24)) that are required to be solved by the A-
strongly convex problem for which a unique solution in KFD algorithm are strongly convex and thus each of
close form can be obtained by solving a (m+1)×(m+1) them have a unique minimizer, the A-KFD algorithm
system of linear equations. On the other hand, sub- can also be interpreted as an Alternate Optimization
problem (24) is also a strongly convex problems with (AO) problem (Bezdek & Hathaway, 2003). Classical
the simple nonnegativity constraint a ≥ 0 on k vari- instances of AO problems include fuzzy regression c-
ables (k is usually very small) for which a unique solu- models and fuzzy c-means clustering.
tion can be obtained by solving a very simple quadratic
programming problem. We are ready now to describe The A-KFD algorithm then, inherits the convergence
our proposed algorithm. properties and characteristics of AO problems. As
stated in (Bezdek & Hathaway, 2002), the set of points
for which Algorithm 4.1 can converge can include cer-
4. Automatic kernel selection KFD tain type of saddle points (i.e. a point that behaves like
Algorithm a local minimizer only when projected along a subset
of the variables). However, it also stated that is ex-
Algorithm 4.1 Automatic kernel selection
tremely difficult to find examples where converge oc-
KFD Algorithm (A-KFD)
curs to a saddle point rather than to a local minimizer.
Given m data points in Rn represented by the m × n
If the initial estimate is chosen sufficiently near a so-
matrix A and vector L of ±1 labels denoting the class
lution, a local q-linear convergence result is also pre-
of each row of A , the parameter µ and an initial
sented by Bezek et al in (Bezdek & Hathaway, 2002).
a0 ∈ <k , we generate the nonlinear classifier (21) as
A more detailed convergence study in the more general
follows:
context of of regularization networks (Evgeniou et al.,
2000a; Evgeniou et al., 2000b), including SVM type
(0) Calculate K1 , . . . , Kk , the k kernels on the kernel loss functions, is in preparation.
family, where for each i, Ki = Ki (A, A0 ). Define
the vector d as in (7).
5. Numerical Experiments
For each iteration i do: We tested our algorithm on five publicly available
datasets commonly used in the literature for bench-
(i) given an a(i−1) calculate the linear combination marking from the UCI Machine Learning Repository
Pk
K = j=1 aj
(i−1)
Kj . (Murphy & Aha, 1992): Ionosphere, Cleveland Heart,
Pima Indians, BUPA Liver and Boston Housing. Ad-
(i) Solve subproblem (23) to obtain (v (i) , γ (i) ). ditionally, a sixth dataset, the colon CAD dataset,
relates to colorectal cancer diagnosis using virtual
(ii) Calculate Λl = Kl v (i) for l = 1, . . . , k. colonoscopy derived from computer tomographic im-
ages. We will refer to this dataset as the colon CAD
(iii) Solve subproblem (24) to obtain ai . dataset. The dimensionality and size of each dataset
are shown in Table 1.
Table 1. Ten-fold and testing set classification accuracies
and p-values for five publicly available datasets (best and
5.1. Numerical experience on five publicly statistical significant values in bold).
available datasets
We compared our proposed A-KFD against standard Data set A-KFD KFD + kernel p-value
KFD as described in equation (7) where the kernel (m × n) tuning
model is chosen using a cross-validation tuning proce-
dure. For our family of kernels we chose a family of 5 Ionosphere 94.7% 92.73% 0.03
(351 × 34)
kernels: A linear kernel (K = AA0 ) and 4 Gaussians
Housing 89.9% 89.4 % 0.40
kernels with µ ∈ {0.001, 0.01, 0.1, 1}: (506 × 13)
0 2 Heart 79.7 % 82.2 % 0.04
(Gµ )ij = (K(A, B))ij = ε−µkAi −B·j k (297 × 13)
i = 1 . . . , m, j = 1 . . . , n. Pima 74.1% 74.4 % 0.7
(25) (768 × 8)
Bupa 70.9% 70.5% 0.75
where A ∈ Rm×n , B = A0 ∈ Rn×m . For all the ex-
(345 × 6)
periments, in algorithm 4.1, we used an initial a0 such
that:
k
(i−1)
X
K= aj Kj = AA0 + G1 (26) times over the ten runs are reported in Table 2. A
j=1 paired t-test (Mitchell, 1997) at 95% confidence level
That is, our initial kernel is an equally weighted was performed over the ten runs results to compare
combination of a linear kernel A0 A (the kernel with the performance of the two algorithms tested. In most
less fitting power) and G1 (the kernel with the of the experiments, the p-values obtained show that
most fitting power). The parameter ν required for there is no significant difference between A-KFD and
both methods was chosen to be on the following the the standard KFD where the kernel model is cho-
set {10−3 , 10−2 , . . . , 100 , . . . , 1011 , 1012 }. To solve the sen using a cross-validation tuning procedure. Only
quadratic programming (QP) problem (24) we used on two of the datasets, ionosphere and housing there
CPLEX 9.0 (CPL, 2004), although, since the problem is a small statistically significant difference for the two
to solve has nice properties and it is “small” in size methods, with the performance of A-KFD being the
(k = 5 in our experiments) any publicly available QP better of the two for the ionosphere dataset and the
solver can be used for this task. Next, we describe the standar tunning being the best for the housing dataset.
methodology used in our experiments: These results suggest that the two methods are not sig-
nificantly different regarding generalization accuracy.
1. Each dataset was normalized between −1 and 1. In all experiments, the A-KFD algorithm converged
2. We randomly splitted the dataset into two groups in average on 3 or 4 iterations, thus obtaining the final
consisting of 70 % for training and 30% for testing. classifier in a considerable faster time that the stan-
We called the training subset TR and the training dard KFD with kernel tuning. Table 2 shows that
set TE . A-KFD was up to 6.3 times faster in one of the cases.

3. On the training set TR we used a ten-fold cross- 5.2. Numerical experience on the Colon CAD
validation tuning procedure to select “optimal” dataset
values for the parameter ν in A-KFD and for
the parameters ν and µ in the standard KFD. By In this section of the paper we performed experiments
“optimal” values, we mean the parameters values on the colon CAD dataset. The classification task as-
that maximize the ten-fold cross-validation test- sociated with this dataset is related to colorectal can-
ing correctness. A linear kernel was also consid- cer diagnosis. Colorectal cancer is the third most com-
ered as a kernel choice in the standard KFD. mon cancer in both men and women. Recent stud-
ies (Yee et al., 2003) have estimated that in 2003,
4. Using the “optimal” values found in step 3, we nearly 150,000 cases of colon and rectal cancer would
build a final classification surface (21), and then be diagnosed in the US, and more than 57,000 people
evaluate the performance on the testing set TE . would die from the disease, accounting for about 10%
of all cancer deaths. A polyp is an small tumor that
Steps 1 to 4 are repeated 10 times and the average test- projects from the inner walls of the intestine or rec-
ing set correctness is reported in Table 1. The average tum. Early detection of polyps in the colon is critical
In summary, A-KFD had the same generalization ca-
Table 2. Average times in seconds for both methods: A-
pabilities and ran almost 3 times faster than the stan-
KFD and standard KFD where the kernel width was ob-
tained by tuning. Times are the averages over ten runs. dard KFD.
Kernel calculation time and ν tuning time are included in
both algorithms(Best in bold). 6. Conclusions and Outlook
We have proposed a simple procedure for generat-
Data set A-KFD (secs.) KFD + kernel
ing heterogeneous Kernel Fisher Discriminant classifier
(m × n) tuning (secs.) where the kernel model is defined to be a linear combi-
Ionosphere 55.3 350.0 nation of members of a potentially larger pre-defined
(351 × 34) family of heterogeneous kernels. Using this approach,
Housing 134.4 336.9 the task of finding an “appropriate” kernel that satis-
(506 × 13) factory suits the classification task can be incorporated
Heart 39.7 109.2
into the optimization problem to solve. In contrast
(297 × 13)
Pima 341.5 598.4 with previous works that also consider linear com-
(768 × 8) bination of kernels, our proposed algorithm requires
Bupa 48.2 81.7 nothing more sophisticated than solving a simple non-
(345 × 6) singular system of linear equations of the size of the
number of training points m and solving a quadratic
programming problem that is usually very small since
because polyps can turn into cancerous tumors if they it depends on the predefined number of kernels on the
are not detected in the polyp stage. kernel family (5 in our experiments). The practical
complexity of the A-KFD algorithm does not explic-
The database of high-resolution CT images used in itly depend on the number of kernels on the predefined
this study was obtained at NYU Medical Center. One kernel family.
hundred and five (105) patients were selected so as to
include positive cases (n=61) as well as negative cases Empirical results show that the proposed method com-
(n=44). The images are preprocessed in order to cal- pared to the standard KFD where the kernel is selected
culate features based on moments of tissue intensity, by a cross-validation tuning procedure, is several times
volumetric and surface shape and texture characteris- faster with no significant impact on generalization per-
tics. The final dataset used in this paper is a balanced formance.
subset of the original dataset consisting of 300 candi- The convergence of the A-KFD algorithm is justified
dates, 145 candidates are labeled as a polyp and 155 as as a special case of the alternate optimization algo-
non-polyps. Each candidate is represented by a vec- rithm described in (Bezdek & Hathaway, 2003). Fu-
tor of 14 features that have the most discriminating ture work includes a more general version of the pro-
power according to a feature selection pre-processing posed algorithm in the context of regularized networks,
stage. The non-polyp points were chosen from can- where the convergence results are presented in a more
didates that were consistently misclassified by an ex- detailed manner. Future work also includes the use
isting classifier that was trained to have a very low of sparse kernel techniques and random projections to
number of false positives on the entire dataset. This improve further computational efficiency. We also plan
means that in the given 14 dimensional feature space, to explore the use of weak kernels (kernels that de-
the colon CAD dataset is extremely difficult to sepa- pends on a subset of the original input features) for
rate. feature selection in this framework.
For the tests, we used the same methodology described
in subsection 5.1 obtaining very similar results. The 7. Citations and References
standard KFD performed in an average time of 122.0
seconds over ten runs and an average test set correct- References
ness of 73.4 %. The A-KFD performed in an average (2004). CPLEX optimizer. ILOG CPLEX Divi-
time of 41.21 seconds with an average test set correct- sion, 889 Alder Avenue, Incline Village, Nevada.
ness of 72.4 %. As in Section 5.1, we performed a http://www.cplex.com/.
paired t-test (Mitchell, 1997) at 95% confidence level
with a p-value of 0.32 > 0.05, this indicates that there Bach, F., Lanckriet, J., & Jordan, M. (2004). Fast
is no significant difference between both methods in kernel learning using sequential minimal optimiza-
this dataset at the 95% confidence level. tionTechnical Report CSD-04-1307). Division of
Computer Science, University of California, Berke- Lanckriet, G., Cristianini, N., Bartlett, P., Ghaoui,
ley. L. E., & Jordan, M. (2003). Learning the kernel
matrix with semidefinite programming. Journal of
Bennet, K., Momma, M., & Embrechts, M. (2002). Machine Learning Research, 5, 27–72.
Mark: a boosting algorithm for heterogeneous ker-
nel models. Proceedings KDD-2002: Knowledge Dis- Lee, Y.-J., & Mangasarian, O. L. (2001). SSVM: A
covery and Data Mining, July 23-26, 2002, Edmon- smooth support vector machine. Computational Op-
ton, Alberta, CA (pp. 24–31). Asscociation for Com- timization and Applications, 20, 5–22. Data Min-
puting Machinery. ing Institute, University of Wisconsin, Technical
Report 99-03. ftp://ftp.cs.wisc.edu/pub/dmi/tech-
Bezdek, J., & Hathaway, R. (2002). Some notes on reports/99-03.ps.
alternating optimization. Proceedings of the 2002
AFSS International Conference on Fuzzy Systems Mangasarian, O. L. (1994). Nonlinear programming.
(pp. 288–300). Springer-Verlag. Philadelphia, PA: SIAM.
Mangasarian, O. L. (2000). Generalized sup-
Bezdek, J., & Hathaway, R. (2003). Convergence of al-
port vector machines. Advances in Large Mar-
ternating optimization. Neural, Parallel Sci. Com-
gin Classifiers (pp. 135–146). Cambridge, MA:
put., 11, 351–368.
MIT Press. ftp://ftp.cs.wisc.edu/math-prog/tech-
Cherkassky, V., & Mulier, F. (1998). Learning from reports/98-14.ps.
data - concepts, theory and methods. New York: Mika, S., Rätsch, G., & Müller, K.-R. (2000). A math-
John Wiley & Sons. ematical programming approach to the kernel fisher
algorithm. NIPS (pp. 591–597).
Cristianini, N., & Shawe-Taylor, J. (2000). An in-
troduction to support vector machines. Cambridge: Mika, S., Rätsch, G., Weston, J., Schölkopf, B., &
Cambridge University Press. Müller, K.-R. (1999). Fisher discriminant analysis
with kernels. Neural Networks for Signal Processing
Evgeniou, T., Pontil, M., & Poggio, T. (2000a). Reg- IX (pp. 41–48). IEEE.
ularization networks and support vector machines.
Advances in Computational Mathematics, 13, 1–50. Mitchell, T. M. (1997). Machine learning. Boston:
McGraw-Hill.
Evgeniou, T., Pontil, M., & Poggio, T. (2000b). Reg-
ularization networks and support vector machines. Murphy, P. M., & Aha, D. W. (1992).
Advances in Large Margin Classifiers (pp. 171–203). UCI machine learning repository.
Cambridge, MA: MIT Press. www.ics.uci.edu/∼mlearn/MLRepository.html.
Suykens, J. A. K., & Vandewalle, J. (1999). Least
Fukunaga, K. (1990). Introduction to statistical pat-
squares support vector machine classifiers. Neural
tern recognition. San Diego, CA: Academic Press.
Processing Letters, 9, 293–300.
Fung, G., & Mangasarian, O. L. (2001). Proxi- Vapnik, V. N. (2000). The nature of statistical learning
mal support vector machine classifiers. Proceed- theory. New York: Springer. Second edition.
ings KDD-2001: Knowledge Discovery and Data
Mining, August 26-29, 2001, San Francisco, CA Xu, J., & Zhang, X. (2001). Kernel mse algorithm:
(pp. 77–86). New York: Asscociation for Comput- A unified framework for kfd, ls-svm and krr. Pro-
ing Machinery. ftp://ftp.cs.wisc.edu/pub/dmi/tech- ceedings of The International Joint Conference on
reports/01-02.ps. Neural Networks 2001. IEEE.

Fung, G., & Mangasarian, O. L. (2003). Finite New- Yee, J., Geetanjali, A., Hung, R., Steinauer-Gebauer,
ton method for Lagrangian support vector machine A., wall, A., & McQuaid, K. (2003). Computer to-
classification. Special Issue on Support Vector Ma- mographic virtual colonoscopy to screen for colorec-
chines. Neurocomputing, 55, 39–55. tal neoplasia in asymptomatic adults. Proceeding of
NEJM-2003 (pp. 2191–2200).
Hamers, B., Suykens, J., Leemans, V., & Moor, B. D.
(2003). Ensemble learning of coupled parmeterised
kernel models. International Conference on Neural
Information Processing (pp. 130–133).

You might also like