You are on page 1of 13

Pattern Recognition 42 (2009) 2671 -- 2683

Contents lists available at ScienceDirect

Pattern Recognition
journal homepage: w w w . e l s e v i e r . c o m / l o c a t e / p r

Combining feature spaces for classification


Theodoros Damoulas ∗ , Mark A. Girolami
Inference Research Group, Department of Computing Science, Faculty of Information and Mathematical Sciences, University of Glasgow, 18 Lilybank Gardens, Glasgow G12 8QQ,
Scotland, UK

A R T I C L E I N F O A B S T R A C T

Article history: In this paper we offer a variational Bayes approximation to the multinomial probit model for basis
Received 2 November 2007 expansion and kernel combination. Our model is well-founded within a hierarchical Bayesian framework
Received in revised form 24 January 2009 and is able to instructively combine available sources of information for multinomial classification. The
Accepted 5 April 2009
proposed framework enables informative integration of possibly heterogeneous sources in a multitude of
Keywords: ways, from the simple summation of feature expansions to weighted product of kernels, and it is shown
Variational Bayes approximation to match and in certain cases outperform the well-known ensemble learning approaches of combining
Multiclass classification individual classifiers. At the same time the approximation reduces considerably the CPU time and re-
Kernel combination sources required with respect to both the ensemble learning methods and the full Markov chain Monte
Hierarchical Bayes Carlo, Metropolis–Hastings within Gibbs solution of our model. We present our proposed framework
Bayesian inference
together with extensive experimental studies on synthetic and benchmark datasets and also for the first
Ensemble learning
time report a comparison between summation and product of individual kernels as possible different
Multi-modal modelling
Information integration methods for constructing the composite kernel matrix.
© 2009 Elsevier Ltd. All rights reserved.

1. Introduction guarantee an optimum performance and it exacerbates the “curse of


dimensionality” problem.
Classification, or supervised discrimination, has been an active The problem of classification in the case of multiple feature spaces
research field within the pattern recognition and machine learning or sources has been mostly dealt with in the past by ensemble learn-
communities for a number of decades. The interest comes from the ing methods [12], namely combinations of individual classifiers. The
nature of the problems encountered within the communities, a large idea behind that approach is to train one classifier in every feature
number of which can be expressed as discrimination tasks, i.e. to be space and then combine their class predictive distributions. Differ-
able to learn and/or predict in which category an object belongs. In ent ways of combining the output of the classifiers have been stud-
the supervised and semi-supervised learning setting, [24,6], where ied [18,34,19] and also meta-learning an overall classifier on these
labeled (or partially labeled) training sets are available in order to distributions [36] has been proposed.
predict labels for unseen objects, we are in the classification domain The drawbacks of combining classifiers lie on their theoretical jus-
and in the unsupervised learning case, where the machine is asked tification, on the processing loads incurred as multiple training has
to learn categories from the available unlabeled data, we fall in the to be performed, and on the fact that the individual classifiers oper-
domain of clustering. ate independently on the data. Their performance has been shown
In both of these domains a recurring theme is how to utilize to significantly improve over the best individual classifier in many
constructively all the information that is available for the specific cases but the extra load of training multiple classifiers may possi-
problem in hand. This is specifically true when a large number of bly restrain their application when resources are limited. The typi-
features from different sources are available for the same object and cal combination rules for classifiers are ad hoc methods [5] that are
there is limited or no a priori knowledge of their significance and based on the notions of the linear or independent opinion pools. How-
contribution to the classification or clustering task. In such cases, ever, as noted by Berger [5], “. . . it is usually better to approximate a
concatenating all the features into a single feature space does not “correct answer” than to adopt a completely ad hoc approach.” and
this has a bearing, as we shall see, on the nature of classifier combi-
nation and the motivation for kernel combination.
In this paper we offer, for the classification problem, a multi-
∗ Corresponding author. Tel.: +44 141 330 2421; fax: +44 141 3308627. class kernel machine [30,31] that is able to combine kernel spaces
E-mail addresses: theo@dcs.gla.ac.uk (T. Damoulas), girolami@dcs.gla.ac.uk in an informative manner while at the same time learning their
(M.A. Girolami). significance. The proposed methodology is general and can also be

0031-3203/$ - see front matter © 2009 Elsevier Ltd. All rights reserved.
doi:10.1016/j.patcog.2009.04.002
2672 T. Damoulas, M.A. Girolami / Pattern Recognition 42 (2009) 2671 -- 2683

employed outside the “kernel-domain”, i.e. without the need to em- C with the target variable tn = c = 1, . . . , C and the number of objects
bed the features into Hilbert spaces, by allowing for combination of N with n = 1, . . . , N. Then, the class posterior probability for object n
basis function expansions. That is useful in the case of multi-scale will be P(tn |xn1 , . . . , xnS ) and according to Bayesian decision theory we
problems and wavelets [3]. The combination of kernels or dictionar- would assign the object n with the class that has the maximum a
ies of basis functions as an alternative to classifier combination offers posteriori probability.
the advantages of reduced computational requirements, the ability The classifier combination methodologies are based on the fol-
to learn the significance of the individual sources and an improved lowing assumption, which as we shall see is not necessary for the
solution based on the inferred significance. proposed kernel combination approach. The assumption is that al-
Previous related work, includes the area of kernel learning, where though “it is essential to compute the probabilities of various hy-
various methodologies have been proposed in order to learn the potheses by considering all the measurements simultaneously . . . it
possibly composite kernel matrix, by convex optimization methods may not be a practical proposition.” [18] and it leads to a further
[20,33,22,23], by the introduction of hyper-kernels [28] or by direct approximation on the (noisy) class posterior probabilities which are
ad hoc construction of the composite kernel [21]. These methods op- now broken down to individual contributions from classifiers trained
erate within the context of support vector machine (SVM) learning on each feature space s.
and hence inherit the arguable drawbacks of non-probabilistic out- The product combination rule, assuming equal a priori class prob-
puts and ad hoc extensions to the multiclass case. Furthermore, an- abilities:
other related approach within the non-parametric Gaussian process
!S
(GP) methodology [27,29] has been proposed by Girolami and Zhong (P(tn |xns ) + !s )
Pprod (tn |xn1 , . . . , xnS ) = "C s=1!S (1)
[16], where instead of kernel combination the integration of infor- s s
tn =1 s=1 (P(tn |xn ) + ! )
mation is achieved via combination of the GP covariance functions.
Our work further develops the work of Girolami and Rogers [14],
The mean combination rule:
which we generalize to the multiclass case by employing a multino-
mial probit likelihood and introducing latent variables that give rise S
to efficient Gibbs sampling from the parameter posterior distribu- 1#
Pmean (tn |xn1 , . . . , xnS ) = (P(tn |xns ) + !s ) (2)
tion. We bound the marginal likelihood or model evidence [25] and S
s=1
derive the variational Bayes (VB) approximation for the multinomial
probit composite kernel model, providing a fast and efficient solu- with !s the prediction error made by the individual classifier trained
tion. We are able to combine kernels in a general way e.g. by sum- on feature space s.
mation, product or binary rules and learn the associated weights to The theoretical justification for the product rule comes from
infer the significance of the sources. the independence assumption of the feature spaces, where xn1 , . . . , xnS
For the first time we offer, a VB approximation on an explicit are assumed to be uncorrelated; and the mean combination rule
multiclass model for kernel combination and a comparison between is derived on the opposite assumption of extreme correlation. As
kernel combination methods, from summation to weighted product it can be seen from Eqs. (1) and (2), the individual errors !s of the
of kernels. Furthermore, we compare our methods against classifier classifiers are either added or multiplied together and the hope is
combination strategies on a number of datasets and we show that that there will be a synergetic effect from the combination that
the accuracy of our VB approximation is comparable to that of the will cancel them out and hence reduce the overall classification
full Markov chain Monte Carlo (MCMC) solution. error.
The paper is organized as follows. First we introduce the con- Instead of combining classifiers we propose to combine the fea-
cepts of composite kernels, kernel combination rules and classifier ture spaces and hence obtain the class posterior probabilities from
combination strategies and give an insight on their theoretical un- the model
derpinnings. Next we introduce the multinomial probit composite
kernel model, the MCMC solution and the VB approximation. Finally, Pspaces (tn |xn1 , . . . , xnS ) = P(tn |xn1 , . . . , xnS ) + !
we present results on synthetic and benchmark datasets and offer
discussion and concluding remarks. where now we only have one error term ! from the single classi-
fier operating on the composite feature space. The different methods
2. Classifier versus kernel combination to construct the composite feature space reflect different approxi-
mations to the P(tn |xn1 , . . . , xnS ) term without incorporating individual
The theoretical framework behind classifier combination strate- classifier errors into the approximation. Still though, underlying as-
gies has been explored relatively recently1 [12,18,19], despite the sumptions of independence for the construction of the composite
fact that ensembles of classifiers have been widely used experimen- feature space are implied through the kernel combination rules and
tally before. We briefly review that framework in order to identify as we shall see this will lead to the need for diversity as it is com-
the common ground and the deviations from the classifier combina- monly known for other ensemble learning approaches.
tion to the proposed kernel (or basis function) combination method- Although our approach is general and can be applied to combine
ology. basis expansions or kernels, we present here the composite ker-
Consider S feature spaces or sources of information.2 An object n nel construction as our main example. Embedding the features into
belonging to a dataset {X, t}, with X the collection of all objects and t Hilbert spaces via the kernel trick [30,31] we can define3 :
the corresponding labels, is represented by S, Ds -dimensional feature The N × N mean composite kernel as
Ds
vectors xns for s = 1, . . . , S and xns ∈ R . Let the number of classes be
S
# S
#
KbH (xi , xj ) = "s K shs (xis , xjs ) with "s = 1 and "s ! 0 ∀s
1
At least in the classification domain for machine learning and pattern recog- s=1 s=1
nition. The related statistical decision theory literature has studied opinion pools
much earlier, see Berger [5] for a full treatment.
2
Throughout this paper m denotes scalar, m vector and M a Matrix. If M is
3
a C × N matrix then mc is the cth row vector, mn the nth column vector of that Superscripts denote “function of”, i.e. KbH denotes that the kernel K is a
matrix and all other indices imply row vectors. function of b and H, unless otherwise specified.
T. Damoulas, M.A. Girolami / Pattern Recognition 42 (2009) 2671 -- 2683 2673

The N × N product composite kernel4 as τ υ ω φ µ λ


S
$
KbH (xi , xj ) = K shs (xis , xjs )"s with "s ! 0 ∀s
s=1 Ζ Θ ρ
S x Ds
And the N × N binary composite kernel as

S
# W Y β
KbH (xi , xj ) = "s K shs (xis , xjs ) with "s ∈ {0, 1} ∀s CxN
s=1

where H are the kernel parameters, b the combinatorial weights and t


K is the kernel function employed, typically in this study, a Gaussian CxN N S
or polynomial function. We can now weigh accordingly the feature
spaces by the parameter b which, as we shall see further on, can be Fig. 1. Plates diagram of the model for mean composite kernel.
inferred from the observed evidence. Learning these combinatorial
parameters is the main objective of multiple kernel learning methods
such as the proposed ones in this article. κ ξ

3. The multinomial probit model

In accordance with Albert and Chib [1], we introduce auxiliary π χ


C×N β
variables Y ∈ R that we regress onto with our composite kernel
C×N
and the parameters (regressors) W ∈ R . The intuition is that the S
regressors express the weight with which a data point “votes” for
a specific class c and the auxiliary variables are continuous “target” β
values that are related to class membership ranking for a specific S
point. Following the standardized noise model ! ∼ N(0, 1) which
bH Fig. 2. Modification for binary composite (left) and product composite kernel (right).
results in ycn = wc kn + !, with wc the 1 × N row vector of class c
bH
regressors and kn the N × 1 column vector of inner products for the
nth element, leads to the following Gaussian probability distribution:
The hyper-parameter prior setting of %, & dictates the prior form of
bH bH
p(ycn |wc , kn ) = Nycn (wc kn , 1) (3) the gamma distribution and can be set to ensure an uninformative
distribution (as it will be done in this work) or we can induce sparsity
The link from the auxiliary variable ycn to the discrete target vari- via setting a flat prior (%, & → 0) as in Damoulas et al. [9]. This leads
able of interest tn ∈ {1, . . . , C} is given by tn = i ⇐⇒ yin > yjn ∀j " i to the empirical Bayes method of type-II maximum likelihood as
bH %
and by the following marginalization P(tn = i|W, kn ) = P(tn = employed in the construction of relevant vector machines [35].
bH
i|yn )P(yn |W, kn ) dyn , where P(tn = i|yn ) is a delta function, results in Furthermore, we place a gamma distribution on each kernel pa-
+
the multinomial probit likelihood as rameter since 'sd ∈ R . In the case of the mean composite kernel, a
  Dirichlet distribution with parameters q is placed on the combina-
$  torial weights in order to satisfy the constraints imposed on the pos-
bH bH
P(tn = i|W, kn ) = Ep(u) #(u + (wi − wj )kn ) (4) sible values which are defined on a simplex and assists in statistical
 
j"i identifiability. A further gamma distribution, again to ensure positive
values, is placed on each (s with associated hyper-parameters ), *.
where E is the expectation taken with respect to the standardized
In the product composite kernel case we employ the right dashed
normal distribution p(u) = N(0, 1) and # is the cumulative density
plate in Fig. 2 which places a gamma distribution on the combina-
function. We can now consider the prior and hyper-prior distribu-
torial weights b, that do not need to be defined on a simplex any-
tions to be placed on our model.
more, with an exponential hyper-prior distribution on each of the
parameters +s , ,s .
3.1. Prior probabilities Finally, in the binary composite kernel case we employ the left
dashed plate in Fig. 2 which places a binomial distribution on each
Having derived the multinomial probit likelihood we complete "s with equal probability of being 1 or zero (unless prior knowledge
our Bayesian model by introducing prior distributions on the model says otherwise). The small size of the possible 2S states of the b vector
parameters. The choice of prior distributions is justified as follows: allows for their explicit consideration in the inference procedure and
for the regressors W we place a product of zero mean Gaussian dis- hence there is no need to place any hyper-prior distributions.
tributions with scale $cn that reflects independence (product) and The model, for case of the mean composite kernel, is depicted
lack of prior knowledge (zero mean normal). The only free param- graphically in Fig. 1 where the conditional relation of the model pa-
eter on this distribution is the scale, on which we place a gamma rameters and associated hyper-parameters can be seen. The accom-
prior distribution that ensures a positive value, is a conjugate pair panied variations for the binary and product composite kernel are
with the normal and is controlled via the hyper-parameters %, &. This given in Fig. 2.
prior setting propagates the uncertainty to a higher level and it is The intuition behind the hierarchical construction of the proposed
commonly employed in the statistics literature [11] for the above- model is that uncertainty is propagated into a higher level of prior
mentioned reasons of conjugacy and (lack of) prior knowledge. distribution setting. In other words, we place further hyper-prior
distributions on parameters of the prior distributions and we let the
4 "
In this case only the superscript "s in K shs (xis , xjs ) s denotes power. The evidence guide us to the required posteriors. In that way the extra
superscript s in K s (., .) or generally in Ks indicates that this is the sth base kernel. levels allow less sensitivity for initial settings (in the case where
2674 T. Damoulas, M.A. Girolami / Pattern Recognition 42 (2009) 2671 -- 2683

fixed values are used for the hyper-parameters instead of empirical which is a product of N C-dimensional conically truncated Gaussians
Bayes) since a specific hyper-parameter value will still lead to a demonstrating independence across samples as expected from our
(prior) distribution of possible parameter values which can still be initial i.i.d assumption. The shorthand tilde notation denotes pos-
“corrected” by the evidence of the data towards the appropriate final terior expectations in the usual manner, i.e. f-
(b) = EQ(b) {f (b)}, and
posterior distribution. the posterior expectations (details in Appendix C) for the auxiliary
variable follow as
3.2. Exact Bayesian inference: Gibbs sampling ,, ,,
,, , c knbH − w
Ep(u) {Nu (w , i knbH , 1)#n,i,c
u }
, , c knbH −
ycn = w ,, ,,
(7)
The standard statistical solution for performing Bayesian infer- , i knbH − w
Ep(u) {#(u + w , c knbH )#n,i,c
u }
ence is via Markov chain Monte Carlo sampling methods that are  
powerful but computationally expensive. These approaches still ap- ,, # , ,
, , i knbH
yin = w − ,ycn − w
b H
, c kn  (8)
proximate the desired posterior distribution but only due to the
c"i
limited number of drawn samples, hence the name exact inference.
From these methods, Gibbs sampling is the preferred alternative as
where # is the standardized cumulative distribution function (CDF)
it avoids the need for tuning acceptance ratios as is typical in the ! ,, ,,
and #n,i,c , i knbH − w
u = j " i,c #(u + w , j knbH ). Next, the approximate pos-
original Metropolis–Hastings (MH) algorithms [17]. Such a sampling
scheme is only possible when the hard-to-sample-from joint pos- terior for the regressors can be expressed as
terior distribution can be broken down to conditional distributions
C
$
from which we can easily draw samples. ,,
Q(W) ∝ Nwc (y,c KbH Vc , Vc ) (9)
In our model such a sampler naturally arises by exploiting the
c=1
conditional distributions of Y|W, . . . and W|Y, . . ., so that we can draw
samples from the parameter posterior distribution P(W|t, X, N) where where the covariance is defined as
W = {Y, W, b, H, Z, q} and N the aforementioned hyper-parameters
of the model. In the case of non-conjugate pair of distributions,  −1
S #
# S
Metropolis–Hastings sub-samplers can be employed for posterior in- Vc =  "- i,
hi j,
hj + (, −1 
i "j K K Zc ) (10)
ference. More details of the Gibbs sampler are provided analytically i=1 j=1
in [8] and summarized in Appendix A.
and ,Zc is a diagonal matrix of the expected variances , $i , . . . , ,
$N for
3.3. Approximate inference: variational approximation each class. The associated posterior mean for the regressors is there-
,,
, c = y,c KbH Vc and we can see the coupling between the auxiliary
fore w
Exact inference via the Gibbs sampling scheme is computation- variable and regressor posterior expectation.
ally expensive and in this article we derive a deterministic varia- The approximate posterior for the variances Z is an updated prod-
tional approximation which is an efficient method with comparable uct of inverse-gamma distributions (gamma on the scales) and the
accuracy to the full sampling scheme as it will be further demon- posterior mean for the scale is given by
strated. The variational methodology, see Beal [4] for a recent treat-
ment, offers a lower bound on the model evidence using an ensemble %+ 1
2
of factored posteriors to approximate the joint parameter posterior (11)
-
& + 12 w 2
distribution. Although the factored ensemble implies independence cn

of the approximate posteriors, which is a typical strong assumption


of variational methods, there is weak coupling through the current for details see Appendix B.3 or Denison et al. [11]. Finally, the ap-
estimates of parameters as it can be seen below. proximate posteriors for the kernel parameters Q(H), the combina-
Considering the joint likelihood of the model5 defined as torial weights Q(b) and the associated hyper-prior parameters Q(q),
p(t, W|X, N)=p(t|Y)p(Y|W, b, H)p(W|Z)p(Z|%, &)p(b|q)p(H|-, .)p(q|), *) or Q(p), Q(v) in the product composite kernel case, can be obtained
and the factorable ensemble approximation of the required pos- by importance sampling [2] in a similar manner to Girolami and
terior p(W|N, X, t) ≈ Q(W) = Q(Y)Q(W)Q(b)Q(H)Q(Z)Q(q) we can Rogers [15] since no tractable analytical solution can be offered. De-
bound the model evidence using Jensen's inequality: tails are provided in Appendix B.
Having described the approximate posterior distributions of the
log p(t) ! EQ(W) {log p(t, W|N)} − EQ(W) {log Q(W)} (5) parameters and hence obtained the posterior expectations we turn
back to our original task of making class predictions t∗ for Ntest new
and minimize it with distributions Q(Wi ) ∝ exp(EQ(/−i ) {log p(t, W|N)}) objects X∗ that are represented by S different information sources
where Q(/−i ) is the factorable ensemble with the ith component Xs ∗ embedded into Hilbert spaces as base kernels K∗shs ,"s and com-
removed. The above standard steps describe the adoption of the bined into a composite test kernel K∗ H,b . The predictive distribu-
%
mean-field or variational approximation theory for our specific tion for a single new object x∗ is given by p(t∗ = c|x∗ , X, t) = p(t∗ =
% ∗
model. c|y )p(y |x , X, t) dy = 0c p(y |x , X, t) dy which ends up, see Ap-
∗ ∗ ∗ ∗ ∗ ∗ ∗

The resulting approximate posterior distributions are given be- pendix D for complete derivation, as
low with full details of the derivations in Appendix B. First, the ap-  2 4
proximate posterior over the auxiliary variables is given by $ 1 ,∗ 3∗ 3∗ 
∗ ∗
p(t = c|x , X, t) = Ep(u) # ,∗ (u1c + mc − mj ) (12)
 1 
N
$ j"c j
,,
Q(Y) ∝ , bH , I)
0(yi,n > yk,n ∀k " i)0(tn = i)Nyn (Wk (6)
n
n=1 where, for the general case of Ntest objects, m -∗c = ,yc K(K∗ K∗T +
−1 −1 ∗ -∗ - ∗
Vc ) K Vc and Vc = (I + K Vc K ) while we have dropped the no-
∗T ∗

5
The mean composite kernel is considered as an example. Modifications for
tation for the dependance of the train K(N × N) and test K∗ (N × Ntest )
the binary and product composite kernel are trivial and respective details are given kernels on H, b for clarity. In Algorithm 1 we summarize the VB
in the appendices. approximation in a pseudo-algorithmic fashion.
T. Damoulas, M.A. Girolami / Pattern Recognition 42 (2009) 2671 -- 2683 2675

Algorithm 1. VB multinomial probit composite kernel regression 4


C1
1: Initialize N, sample W, create Ks |"s , 's and hence K|b, H C2
2: while Lower Bound changing do 3 C3
3: , c ← y,c KVc
w Dec12
,, ,, Dec13
,, , c knbH − w
Ep(u) {Nu (w , i knbH , 1)#n,i,c
u } Dec23
4: , c knbH −
ycn ← w
, 2
,, ,,
, i knbH
5Ep(u) {#(u + w , c knbH )#n,i,c
6−w u }
,, " ,,
5: , , i knbH −
yin ← w , j knbH
, −w
j " i yjn 1
1
,−1 %+ 2
6: $cn ← -
& + 12 w 2 0
cn
7: , , ,
q, b, H ← q, b, H
, , , |w yn by importance sampling
,c,,
8: Update K|, , and Vc
b, H −1
9: end while
10: Create composite test kernel K∗ |, ,
b, H
- ∗ −2
11: Vc ← (I + K Vc K )
∗T ∗

12: -∗ ← ,
m y K(K∗ K∗T + V−1 )−1 K∗ V -∗
c c c c
13: for n = 1 to Ntest do −3
14: for c = 1 to C do −3 −2 −1 0 1 2 3
15: for i = 1 to K Samples do 2 4 Fig. 3. Linearly separable dataset with known regressors defining the decision
i
! 1 , 3 3 boundaries. Cn denotes the members of class n and Decij is the decision boundary
16: ui ← N(0, 1), pcn ← # ,∗ (ui 1c + mc − mj )
∗ ∗ ∗
j"c 1 between classes i and j.
j
17: end for
18: end for
1 "K i
19: P(tn∗ = c|xn∗ , X, t) = K i=1 pcn −2.4
20: end for
−2.6
4. Experimental studies
−2.8
Intercept

This section presents the experimental findings of our work with −3


emphasis first on the performance of the VB approximation with
respect to the full MCMC Gibbs sampling solution and next on the −3.2
efficiency of the proposed feature space combination method ver-
sus the well known classifier combination strategies. We employ −3.4
two artificial low-dimensional datasets, a linearly and a non-linearly
separable one introduced by Neal [27], for the comparison between −3.6
the approximation and the full solution; standard UCI6 multino-
mial datasets for an assessment of the VB performance; and finally, −3.6 −3.4 −3.2 −3 −2.8 −2.6 −2.4 −2.2
two well known large benchmark datasets to demonstrate the utility
Slope
of the proposed feature space combination against ensemble learn-
ing and assess the performance of the different kernel combination Fig. 4. Gibbs posterior distribution of a decision boundary's (Dec12 ) slope and
methods. intercept for a chain of 100,000 samples. The cross shows the original decision
Unless otherwise specified, all the hyper-parameters were set to boundary employed to sample the dataset.
uninformative values and the Gaussian kernel parameters were fixed
to 1/D where D the dimensionality of the vectors. Furthermore, the
convergence of the VB approximation was determined by monitoring dimensional datasets which enable us to visualize the decision
the lower bound and the convergence occurred when there was less boundaries and posterior distributions produced by either method.
than 0.1% increase in the bound or when the maximum number of First we consider a linearly separable case in which we construct
VB iterations was reached. The burn-in period for the Gibbs sampler the dataset by fixing our regressors W ∈ R
C×D
, with C = 3 and D = 3,
was set to 10% of the total 100,000 of samples. Finally, all the CPU to known values and sample two-dimensional covariates X plus a
times reported in this study are for a 1.6 GHz Intel based PC with constant term. In that way, by knowing the true values of our re-
2Gb RAM running unoptimized Matlab codes and all the p-values gressors, we can examine the accuracy of both the Gibbs posterior
reported are for a two sample t-test with the null hypothesis that distribution and the approximate posterior estimate of the VB. In
the distributions have a common mean. Fig. 3 the dataset together with the optimal decision boundaries
constructed by the known regressor values can be seen.
4.1. Synthetic datasets In Figs. 4 and 5 we present the posterior distributions of one
decision boundary's (Dec12 ) slope and intercept based on both our
In order to illustrate the performance of the VB approxima- obtained Gibbs samples and the approximate posterior of the regres-
tion against the full Gibbs sampling solution, we employ two low- sors W. As we can see, the variational approximation is in agreement
with the mass of the Gibbs posterior and it successfully captures the
pre-determined regressors values.
6
A. Asuncion, D.J. Newman, 2007. UCI Machine Learning Repository, Depart-
However, as it can be observed the approximation is over-
ment of Information and Computer Science, University of California, Irvine, CA confident in the prediction and produces a smaller covariance
[http://www.ics.uci.edu/~mlearn/MLRepository.html] for the posterior distribution as expected [10]. Furthermore, the
2676 T. Damoulas, M.A. Girolami / Pattern Recognition 42 (2009) 2671 -- 2683

−2.4 1

−2.6 0.9

−2.8 0.8
Intercept

−3 0.7

0.6
−3.2
0.5
−3.4
0.4
−3.6
0.3
Class 1
−3.6 −3.4 −3.2 −3 −2.8 −2.6 −2.4 −2.2 0.2 Class 2
Class 3
Slope 1 vs 2
0.1 1 vs 3
Fig. 5. The variational approximate posterior distribution for the same case as above. 2 vs 3
Employing 100,000 samples from the approximate posterior of the regressors W in
0
order to estimate the approximate posterior of the slope and intercept. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Fig. 7. Decision boundaries from the VB approximation.

0.9 Table 1
CPU time (s) comparison for 100,000 Gibbs samples versus 100 VB iterations.
0.8
Gibbs VB

0.7 41,720 (s) 120.3 (s)

0.6 Notice that the number of VB iterations needed for the lower bound to converge is
typically less than 100.
0.5

0.4 the corresponding decision boundaries from the VB approximation


after 100 iterations.
0.3 As it can be seen, both the VB approximation and the MCMC
Class 1 solution produce similar decision boundaries leading to good classi-
0.2 Class 2
Class 3 fication performances—2% error for both Gibbs and VB for the above
0.1 1 vs 2 decision boundaries—depending on training size. However, the Gibbs
1 vs 3
2 vs 3 sampler produces tighter boundaries due to the Markov chain ex-
0 ploring the parameter posterior space more efficiently than the VB
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
approximation.
The corresponding CPU times are given in Table 1.
Fig. 6. Decision boundaries from the Gibbs sampling solution.

4.2. Multinomial UCI datasets


probability mass is concentrated in a very small area due to the very
nature of VB approximation and similar mean field methods that To further assess the VB approximation of the proposed multi-
make extreme “judgments” as they do not explore the posterior nomial probit classifier, we explore a selection of UCI multinomial
space by Markov chains: datasets. The performances are compared against reported results
7 8 7 8 [26] from well-known methods in the pattern recognition and ma-
0.16 0.18 0.015 0.015 chine learning literature. We employ an RBF (VB RBF), a second order
CGibbs = , CVB = (13)
0.18 0.22 0.015 0.018 polynomial (VB P) and a linear kernel (VB L) with the VB approxi-
The second synthetic dataset we employ is a four-dimensional three- mation and report 10-fold cross-validated (CV) error percentages, in
class dataset {X, t} with N = 400, first described by Neal [27], which Table 2, and CPU times, in Table 3, for a maximum of 50 VB iterations
defines the first class as points in an ellipse 2 > x21 +x22 > ", the second unless convergence has already occurred. The comparison with the
class as points below a line 2x1 + "x2 < 3 and the third class as points K-nn and PK-nn is for standard implementations of these methods,
surrounding these areas, see Fig. 6. see Manocha and Girolami [26] for details.
We approach the problem by: (1) introducing a second As we can see the VB approximation to the multinomial probit
order polynomial expansion on the original dataset F(xn ) = outperforms in most cases both the K-nn and PK-nn although not
[1 xn1 xn2 x2n1 xn1 xn2 x2n2 ] while disregarding the uninformative offering statistical significant improvements as the variance of the
dimensions x3 , x4 , (2) modifying the dimensionality of our regres- 10-fold CV errors is quite large in most cases.
sors W to C × D and analogously the corresponding covariance and
(3) substituting F(X) for K in our derived methodology. Due to our 4.3. Benchmark datasets
expansion we now have a 2 − D decision plane that we can plot and
a six-dimensional regressor w per class. In Fig. 6 we plot the deci- 4.3.1. Handwritten numerals classification
sion boundaries produced from the full Gibbs solution by averaging In this section we report results demonstrating the efficiency
over the posterior parameters after 100,000 samples and in Fig. 7 of our kernel combination approach when multiple sources of
T. Damoulas, M.A. Girolami / Pattern Recognition 42 (2009) 2671 -- 2683 2677

Table 2
Ten-fold cross-validated error percentages (mean ± std.) on standard UCI multinomial datasets.

Dataset VB RBF VB L VB P K-nn PK-nn

Balance 8.8 ± 3.6 12.2 ± 4.2 7.0 ± 3.3 11.5 ± 3.0 10.2 ± 3.0
Crabs 23.5 ± 11.3 13.5 ± 8.2 21.5 ± 9.1 15.0 ± 8.8 19.5 ± 6.8
Glass 27.9 ± 10.1 35.8 ± 11.8 28.4 ± 8.9 29.9 ± 9.2 26.7 ± 8.8
Iris 2.7 ± 5.6 11.3 ± 9.9 4.7 ± 6.3 5.3 ± 5.2 4.0 ± 5.6
Soybean 6.5 ± 10.5 6 ± 9.7 4 ± 8.4 14.5 ± 16.7 4.5 ± 9.6
Vehicle 25.6 ± 4.0 29.6 ± 3.3 26 ± 6.1 36.3 ± 5.2 37.2 ± 4.5
Wine 4.5 ± 5.1 2.8 ± 4.7 1.1 ± 2.3 3.9 ± 3.8 3.4 ± 2.9

THe bold letters indicate top mean accuracy.

Table 3 Table 7
Running times (seconds) for computing 10-fold cross-validation results. Combination of feature spaces.

Dataset Balance Crabs Glass Iris Soybean Vehicle Wine Bin FixSum WSum FixProd WProd

VB CPU time (s) 2285 270 380 89 19 3420 105 VB approx.—Comb. FS


5.53 ± 1.7 4.85 ± 1.5 6.1 ± 1.6 5.35 ± 1.4 6.43 ± 1.8

VB approximation.
Table 4
Classification percentage error (mean ± std.) of individual classifiers trained on each
feature space.

FR KL PX ZM

Full MCMC Gibbs sampling—single FS combinations of classifiers versus combination of feature spaces. It is
27.3 ± 3.3 11.0 ± 2.3 7.3 ± 2 25.2 ± 3 worth noting that a concatenation of feature spaces for this specific
problem has been found to perform as good as ensemble learning
methods [8]. We report results from both the full MCMC solution
Table 5 and the VB approximation in order to analyse the possible degrada-
Combinations of the individual classifiers based on four widely used rules.
tion of the performance for the approximation. In all cases we em-
Prod Sum Max Maj ploy the multinomial probit kernel machine using a Gaussian kernel
Full MCMC Gibbs sampling—Comb. classifiers
with fixed parameters.
5.1 ± 1.7 5.3 ± 2 8.4 ± 2.3 8.45 ± 2.2 As we can see from Tables 5 and 4 the two best performing en-
semble learning methods outperform all of the individual classifiers
Product (Prod), Mean (Sum), Majority voting (Maj) and Maximum (Max).
trained on separate feature spaces. With a p-value of 1.5e−07 be-
tween the Pixel classifier and the Product rule, it is a statistical signif-
Table 6 icant difference. At the same time, all the kernel combination meth-
Combination of feature spaces. ods in Table 6 match the best performing classifier combination ap-
Bin FixSum WSum FixProd WProd
proaches, with a p-value of 0.91 between the Product classifier com-
bination rule and the product kernel combination method. Finally,
Full MCMC Gibbs sampling—Comb. FS from Table 7 it is obvious that the VB approximation performs very
5.7 ± 2 5.5 ± 2 5.8 ± 2.1 5.2 ± 1.8 5.9 ± 1.2
well compared with the full Gibbs solution and even when combi-
Gibbs sampling results for four feature space combination methods: Binary (Bin), natorial weights are inferred, which expands the parameter space,
Mean with fixed weights (FixSum), Mean with inferred weights (WSum), product
there is no statistical difference, with a p-value of 0.47, between the
with fixed weights (FixProd) and product with inferred weights (WProd).
MCMC and the VB solution.
Furthermore, the variants of our method that employ combina-
information are available. Comparisons are made against combi- torial weights offer the advantage of inferring the significance of the
nation of classifiers and also between different ways to combine contributing sources of information. In Fig. 8, we can see that the
the kernels as we have described previously. In order to assess the pixel (PX) and Zernike moments (ZM) feature spaces receive large
above, we make use of two datasets that have multiple feature weights and hence contribute significantly in the composite feature
spaces describing the objects to be classified. The first one is a large space. This is in accordance with our expectations, as the pixel fea-
N = 2000, multinomial C = 10 and “multi-featured” S = 4 UCI dataset, ture space seems to be the best performing individual classifier and
named “Multiple Features”. The objects are handwritten numerals, complementary predictive power mainly from the ZM channel is im-
from 0 to 9, and the available features are the Fourier descriptors proving on that.
(FR), the Karhunen–Loéve features (KL), the pixel averages (PX) and The results indicate that there is no benefit in the classification
the Zernike moments (ZM). Tax et al. [34] have previously reported error performance when weighted combinations of feature spaces
results on this problem by combining classifiers but have employed are employed. This is in agreement with past work by Lewis et al. [23]
a different test set which is not publicly available. Furthermore, we and Girolami and Zhong [16]. The clear benefit, however, remains
allow the rotation invariance property of the ZM features to cause the ability to infer the relative significance of the sources and hence
problems in the distinction between digits 6 and 9. The hope is that gain a better understanding of the problem.
the remaining feature spaces can compensate on the discrimination.
In Tables 4–7, we report experimental results over 50 repeated
trials where we have randomly selected 20 training and 20 testing 4.3.2. Protein fold classification
objects from each class. For each trial we employ (1) a single clas- The second multi-feature dataset we examine was first described
sifier on each feature space, (2) the proposed classifier on the com- by Ding and Dubchak [13] and is a protein fold classification prob-
posite feature space. This allows us to examine the performance of lem with C = 27 classes and S = 6 available feature spaces (approx.
2678 T. Damoulas, M.A. Girolami / Pattern Recognition 42 (2009) 2671 -- 2683

1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
FR KL PX ZM FR KL PX ZM

Fig. 8. Typical combinatorial weights from Mean (left) and Product (right) composite kernel.

Table 8 classifier and kernel combination methods for comparison. It is worth


Classification error percentage on the individual feature spaces.
noting that a possible concatenation of all features into a single rep-
Comp Hydro Pol Polz Str Vol resentation has been examined by Ding and Dubchak [13] and was
found not to be performing competitively.
VB approximation—Single FS
49.8 ± 0.6 63.4 ± 0.6 65.2 ± 0.4 66.3 ± 0.4 60.0 ± 0.25 65.6 ± 0.3 The experiments are repeated over five randomly initialized trials
for a maximum of 100 VB iterations by monitoring convergence
(1) Amino Acid composition (Comp); (2) Hydrophobicity profile (HP); (3) Polarity
(Pol); (4) Polarizability (Polz); (5) Secondary structure (Str); and (6) van der Waals with a 0.1% increase threshold on the lower bound progression. We
volume profile (Vol). employ second order polynomial kernels as they were found to give
The bold letters indicate statistical significant improvements. the best performances across all methods.
As it can be observed the best of the classifier combination meth-
ods outperform any individual classifier trained on a single feature
Table 9 space with a p-value between the Sum combination rule and the
Classification error percentages by combining the predictive distributions of the composition classifier of 2.3e − 02. However, all the kernel com-
individual classifiers. bination methods perform better than the best classifier combina-
Prod Sum Max Maj tion with a p-value of 1.5e − 08 between the latter (FixSum) and
the best performing classifier combination method (Sum). The cor-
VB approximation—Comb. classifiers
49.8 ± 0.6 47.6 ± 0.4 53.7 ± 1.4 54.1 ± 0.5
responding CPU times show a 10-fold reduction by the proposed
feature space combination method from the classifier combination
The bold letters indicate statistical significant improvements.
approaches.
Furthermore, with a top performance of 38.3 error % for a single
binary combination run (Bin) we match the state-of-the-art reported
Table 10
on the original dataset by Girolami and Zhong [16] employing Gaus-
Classification error percentages by feature space combination.
sian process priors. The best reported performance by Ding and
Bin FixSum WSum FixProd WProd Dubchak [13] was a 56% accuracy (44%) error employing an all-vs-
VB approximation—Comb. FS all method with S × C × (C − 1)/2 = 8240 binary SVM classifiers. We
40.7 ± 1.2 40.1 ± 0.3 44.4 ± 0.6 43.8 ± 0.14 43.2 ± 1.1 offer instead a single multiclass kernel machine able to combine the
The bold letters indicate statistical significant improvements.
feature spaces and achieve a 60–62% accuracy in reduced computa-
tional time. It is worth noting that recent work by the authors [7] has
achieved a 70% accuracy by considering additional state-of-the-art
Table 11
string kernels.
Typical running times (seconds). Finally, in Fig. 9 the combinatorial weights can be seen for the
case of the mean and the product8 composite kernel. The results
Method Classifier combination FS combination
are in agreement with the past work identifying the significance of
CPU time (s) 11,519 1256 the amino acid composition, hydrophobicity profile and secondary
Ten-fold reduction in CPU time by combining feature spaces. structure feature spaces for the classification task.

5. Discussion
20-D each). The dataset is available online7 and it is divided into a
train set of N = 313 size and an independent test set with Ntest = 385. A multinomial probit composite kernel classifier, based on a well-
We use the original dataset and do not consider the feature mod- founded hierarchical Bayesian framework and able to instructively
ifications (extra features, modification of existing ones and omis- combine feature spaces has been presented in this work. The full
sion of four objects) suggested recently in the work by Shen and MCMC MH within Gibbs sampling approach allows for exact in-
Chou [32] in order to increase prediction accuracy. In Table 8 we re- ference to be performed and the proposed variational Bayes ap-
port the performances of the individual classifiers trained on each proximation reduces significantly computational requirements while
feature space, in Table 9 the classifier combination performances retaining the same levels of performance. The proposed methodol-
and in Table 10 the kernel combination approaches. Furthermore, in ogy enables inference to be performed in three significant levels by
Table 11 we give the corresponding CPU time requirements between
8
Notice that for the product composite kernel the combinatorial weights are
7
http://crd.lbl.gov/~cding/protein/ not restricted in the simplex.
T. Damoulas, M.A. Girolami / Pattern Recognition 42 (2009) 2671 -- 2683 2679

1 1.5

0.8
1
0.6

0.4
0.5
0.2

0 0
Comp Hydro Pol Polz Str Vol Comp Hydro Pol Polz Str Vol

Fig. 9. Typical combinatorial weights from Mean (left) and Product (right) composite kernel for the protein fold dataset.

learning the combinatorial weights associated to each feature space, Appendix A. Gibbs sampler
the kernel parameters associated to each dimension of the feature
spaces and finally the parameters/regressors associated with the el- The conditional distribution for the auxiliary variables Y is a
ements of the composite feature space. product of N C-dimensional conically truncated Gaussians given by
An explicit comparison between the Gibbs sampling and the VB
approximation has been presented. The approximate solution was N
$
found not to worsen significantly the classification performance and 0(yin > yjn ∀j " i)0(tn = i)Nyn (WkbH
n , I)
the resulting decision boundaries were similar to the MCMC ap- n=1

proach, though the latter is producing tighter descriptions of the


and for the regressors W is a product of Gaussian distributions
classes. The VB posteriors over the parameters were observed to be !C
narrow as expected and all the probability mass was concentrated on c=1 Nwc (mc , Vc ) where

a small area, however, within the posterior distributions produced


−1
by the Gibbs solution. mc = yc KbH Vc (row vector) and Vc = (KbH KbH + Z−1
c )
The proposed feature space combination approach was found to
be as good as, or outperforming the best classifier combination rule hence by iteratively sampling from these distributions will lead the
examined and achieving the state-of-the-art while at the same time Markov chain to sample from the desired posterior distribution. The
offering a 10-fold reduction in computational time and a better scal- typical MH subsamplers employed, in the case of the mean composite
ing when multiple spaces are available. At the same time, contrary to kernel model have acceptance ratios
the classifier combination approach, inference can be performed on  
!C !N b∗ H ! ∗ ( −1
the combinatorial weights of the composite feature space, enabling , 1) Ss=1 "s s
i ∗ c=1 n=1 Nycn (wc kn
a better understanding of the significance of the sources. In compar- A(b , b ) = min 1, 
!C !N bi H !S i (s −1
ison with previously employed SVM combinations for the multiclass c=1 n=1 Nycn (w k
c n , 1) "
s=1 s
problem of protein fold prediction, our method offers a significant  
! (∗ −1 !
improvement of performance while employing a single classifier in- #(q∗ ) Ss=1 "s s S ∗ )−1 e−*(s ∗
i ∗ s=1 (s
stead of thousands, reducing the computational time from approxi- A(q , q ) = min 1, ! i −1 !

(
#(qi ) Ss=1 "s s S i )−1 e−*(is
mately 12 to 0.3 CPU hours. s=1 (s

Finally, with respect to the two main problems considered, "


we can conclude that there is greater improvement on zero–one 4( S ( )
where #(q) = !S s=1 s
loss when diverse sources of information are used. This is a well s=1 4((s )
known observation in ensemble learning methodology and kernel
combination methods follow the same phenomenon. Theoretical with the proposed move symbolized by ∗ and the current state with
justification for this can be easily provided via a simple bias- i.
variance–covariance decomposition of the loss which follows by the For the binary composite kernel, an extra Gibbs step is intro-
same analysis as the one used for an ensemble of regressors. We are duced in our model as p("i = 0|b−i , Y, Kshs ∀s ∈ {1, . . . , S}) which de-
currently investigating a theoretical justification for another well pends on the marginal likelihood given by p(Y|b, Kshs ∀s ∈ {1, . . . , S})=
!C
known phenomenon in multiple kernel learning, that of zero–one c=1 (2+)
−N/2
|Xc |−1/2 exp{− 12 yc Xc ycT } with Xc = I + KbH Z−1
c K
bH .
loss equivalence between the average and the weighted kernel The product composite kernel employs two MH subsamplers to
combination rules on specific problems. Recently we have offered in sample b (from a gamma distribution this time) and the hyper-
Damoulas et al. [9] a further deterministic approximation based on parameters p, v. The kernel parameters H are inferred in all cases
an EM update scheme which induces sparsity and in fact generalizes via an extra MH subsampler with acceptance ratio
the relevance vector machine to the multi-class and multi-kernel
setting. A(H , H )
i ∗

 
Acknowledgments !S !D ∗ !C !N bH ∗
∗ -−1 −.'∗sd
s=1 d=1 'sd c=1 n=1 Nycn (wc kn , 1)'sd e
= min 1, -−1

!S !D i !C !N bHi
i i
This work is sponsored by NCR Financial Solutions Group Ltd. s=1 d=1 'sd c=1 n=1 Nycn (wc kn , 1)'sd e−.'sd
The authors would particularly like to acknowledge the help and
support from Dr. Gary Ross and Dr. Chao He of NCR Labs. The sec- Finally, the Monte Carlo estimate of the predictive distribution is
ond author is supported by an EPSRC Advanced Research Fellowship used to assign the class probabilities according to a number of sam-
(EP/E052029/1). ples L drawn from the predictive distribution which, considering
2680 T. Damoulas, M.A. Girolami / Pattern Recognition 42 (2009) 2671 -- 2683

∗bH
Eq. (4) and substituting for the test composite kernel kn defines B.2. Q(W)
the estimate of the predictive distribution as
 
L $ 
1# l l
Q(W) ∝ exp{EQ(y)Q(b)Q(f)Q(H) {log p(y|W, b) + log p(W|f)}}
P(t∗ = c|x∗ ) = Ep(u) #(u + (wcl − wjl )k∗b H )   
L  
l=1 j"c   $ C 
= exp EQ(y)Q(b)Q(H) log Nyc (wc KbH , I)
  
c=1
Appendix B. Approximate posterior distributions  
 $ C 
We give here the full derivations of the approximate posteriors +EQ(f) log Nwc (0, Zc )
 
c=1
for the model under the VB approximation. We employ a first or-  
der approximation for the kernel parameters H, i.e. EQ(h) {Kihi Kjhj } ≈  # C
1
i, j, = exp EQ(y)Q(bQ(H)) log
K K , to avoid nonlinear contributions to the expectation. The
h i h j   (2 + )N/2 1/2
|I|
c=1
same approximation is applied to the product composite kernel case  
, ,  1 
for the combinatorial parameters b where EQ(b) {Ki"i Kj"j } ≈ Ki"i Kj"j × exp − (yc − wc K )(yc − wc K ) bH bH T
 2 
 
B.1. Q(Y) # C 9 :
1 1 −1 T
+ EQ(f) log exp − (wc (Zc ) wc )
 (2+)N/2 |Zc |1/2 2 
c=1
 
 # C
Q(Y) ∝ exp{EQ(W)Q(b)Q(H) {log J · L}} 1
= exp EQ(y)Q(b)Q(H) − (yc ycT
  2
c=1
∝ exp{EQ(W)Q(b)Q(H) {log p(t|Y) + log p(Y|W, b, H)}} 

bH T bH bH T
−2yc K wc + wc K K wc )
$N 
∝ 0(yi,n > yk,n ∀k " i)0(tn = i) exp{EQ(W)Q(b)Q(H) log p(Y|W, b, H)}  
# C $ N 
n=1 1 1 −1 T
+EQ(f) − log $cn − wc (Zc ) wc
 2 2 
where the exponential term can be analysed as follows: c=1 n=1
 
# C
1 !T ,3
exp{EQ(W)Q(b)Q(H) log p(Y|W, b)} = exp − yc yc −2y,c KbH wcT
 2
c=1
  
 $ N  S # S N 
bH # #
= exp EQ(W)Q(b)Q(H) log Nyn (Wkn , I) - i,
h j,
h T "
+ wc " i " j K K wc +
i j , −1 T
log $cn +wc (Zc ) wc
  
n=1 i=1 j=1 n=1
 
 #N Again we can form the posterior expectation as a new Gaussian:
1
= exp EQ(W)Q(b)Q(H) log
 
n=1 (2+)C/2 |I|1/2 C
$ ,,
Q(W) ∝ Nwc (y,c KbH Vc , Vc ) (B.2)
 c=1
5 6
1 bH T bH
× exp − (yn − Wkn ) (yn − Wkn ) where Vc is the covariance matrix defined as
2 
 −1
  S #
# S
i,
hi j,
 # N 5
1 Vc =  "-
i "j K K Zc )−1 
hj + (, (B.3)
bH
= exp EQ(W)Q(b)Q(H) − (ynT yn − 2ynT Wkn i=1 j=1
  2
n=1

6 Zc is a diagonal matrix of the expected variances $3
and , -
ci , . . . , $cN for
bH T T bH each class.
+(kn ) W Wkn )

 
# N
1 ,, B.3. Q(Z)
= exp − ynT yn − 2ynT Wk , bH
n
 2
n=1

# S #S 
i,
hi T !
,
jhj Q(Z) ∝ exp{EQ(W) (log p(W|Z) + log p(Z|%, &))}
+ b-
i bj (kn ) W Wkn
T 
 = exp{EQ(W) (log p(W|Z))}p(Z|%, &)
i=1 j=1

i,
h
Analysing the exponential term only:
where kn i is the nth N-dimensional column vector of the ith base   
kernel with kernel parameters ,
hi . Now from this exponential term  $C $ N 
we can form the posterior distribution as a Gaussian and reach to exp EQ(W) log Nwcn (0, $cn )
 
the final expression: c=1 n=1
  
 # C # N 
1 1 2 −1
N
$ = exp EQ(W)  − log $cn − wcn $cn 
Q(Y) ∝ , bH , I)
0(yi,n > yk,n ∀k " i)0(tn = i)Nyn (Wk
,,
(B.1)
 2 2 
n c=1 n=1
n=1 C N
; <
$ $ −1/2 1w-2
cn
= $cn exp −
which is a C-dimensional conically truncated Gaussian. 2 $cn
c=1 n=1
T. Damoulas, M.A. Girolami / Pattern Recognition 42 (2009) 2671 -- 2683 2681

which combined with the p(Z|%, &) prior Gamma distribution leads distribution and we need to calculate the correction to the normal-
to our expected posterior distribution: izing term Zn caused by the truncation. Thus, the posterior expec-
tation can be expressed as
C $
N −1 -
2
$ −(%+1/2)+1 &% e−$cn (&+(1/2)wcn )
Q(Z) ∝ $cn C
$
c=1 n=1
4(%) Q(yn ) = Z−1 Ntyncn (w
,,
, c kn"5 , 1)
n
C $
$ N 5 6 c=1
−1 1 1-2
= Gamma $cn |% + , & + wcn
2 2 where the superscript tn indicates the truncation needed so that
c=1 n=1
the appropriate dimension i (since tn = i ⇐⇒ yin > yjn ∀j " i) is the
largest.
B.4. Q(b), Q(q), Q(H), Q(p), Q(v)
Now, Zn = P(yn ∈ C) where C = {yn : yin > yjn } hence

Importance sampling techniques [2] are used to approximate = +∞ $= yin


,, ,,
these posterior distributions as they are intractable. We present the Zn = , i kn"5 , 1)
Nyin (w , j kn"5 , 1) dyjn dyin
Nyjn (w
−∞ −∞
case of the mean composite kernel analytically and leave the cases of j"i
 
the product and binary composite kernel as a straightforward modi- $ ,, ,, 
fications. = Ep(u) , i kn"5
#(u + w , j kn"5 )
−w
 
For Q(q) we have p(q|b) ∝ p(b|q)p(q|), *) j"i

The unnormalized posterior is Q ∗ (q) = p(b|q)p(q|), *) and hence


with p(u) = Nu (0, 1). The posterior expectation of ycn for all c " i (the
the importance weights are
auxiliary variables associated with the rest of the classes except the
one that object n belongs to) is given by
Q ∗ (qi )
-

!S
=
s=1 Gamma(is- (), *)
C
Dir(qi )
-
+∞ $ ,,
W(qi ) = = "I −1
ycn = Zn
, ycn , j kn"5 ) dyjn
Nyjn (w
"I Q ∗ (qi ) i
i=1 Dir(q ) −∞
j=1
i=1 !S
= =
s=1 Gamma(is (), *) +∞ yin ,,
= Z−1
n ycn Nycn (w, c kn"5 )
−∞ −∞
where Gamma and Dir are the Gamma and Dirichlet distributions, $ ,, ,,
while I is the total number of samples of q taken until now from , i kn"5 , 1)#(yin − w
Nyin (w , j kn"5 ) dycn dyin
the product gamma distributions and i- denotes the current (last) j " i,c

sample. So now we can estimate any function f of q based on ,,  ,, ,,
= , c kn"5
w − Z−1
n Ep(u) , c kn"5 − w
Nu (w , i kn"5 , 1)
I 
#
,
f (q ) = f (q)W(qi ) 
$ ,, ,, 
i=1 , i kn"5
#(u + w , j kn"5 )
−w

j " i,c
In the same manner as above but now for Q(b) and Q(H) we can use
the unnormalized posteriors Q ∗ (b) and Q ∗ (H), where p(b|q, Y, W) ∝ For the ith class the posterior expectation yin (the auxiliary variable
p(Y|W, b)p(b|q) and p(H|-, ., Y, W) ∝ p(Y|W, H)p(H|-, .) with im- associated with the known class of the nth object) is given by
portance weights defined as
= +∞ $
,
"5, ,,
i- yin = Z−1
, n yin Nyin (w
, i kn , 1) , j kn"5 ) dyin
#(yin − w
Q ∗ (b ) −∞
!N i - j"i
Dir(b )
i-
yn (Wkn
, b , I)  
i n=1 N,
W(b ) = i
=
"I !N ,,  $ , , , , 
, i kn"5 "5 "5
i −1
"I Q ∗ (b ) yn (Wkn
, b , I) = w + Zn Ep(u) u #(u + w
, i kn − w
, j kn )
i=1 n=1 N,  
i=1 i j"i
Dir(b )
,, # ,,
, i kn"5 +
=w , c kn"5 − ,
(w ycn )
and
c"i
i-
Q ∗ (H ) where we have made use of the fact that for a variable uN(0, 1) and
i- !N , H i-
i Dir(H ) yn (Wkn , I)
n=1 N, any differentiable function g(u), E{ug(u)} = E{g- (u)}.
W(H ) = i
=" !N
"I Q ∗ (H ) I
yn (Wkn
, Hi , I)
i=1 n=1 N,
i=1 i Appendix D. Predictive distribution
Dir(H )

and again we can estimate any function g of b and h of H as In order to make a prediction t∗ for a new point x∗ we need to
know:
I
# I
# =
i i
g(b) =
, g(b)W(b ) and ,
h(H) = g(H)W(H ) p(t∗ = c|x∗ , X, t) = p(t∗ = c|y∗ )p(y∗ |x∗ , X, t) dy∗
i=1 i=1 =
= 0∗c p(y∗ |x∗ , X, t) dy∗
For the product composite kernel case we proceed in the same manner
by importance sampling for Q(p) and Q(v).
Hence we need to evaluate
=
Appendix C. Posterior expectations of each yn p(y∗ |x∗ , X, t) = p(y∗ |W, x∗ )p(W|X, t) dW
!N C =
n=1 0(yi,n > yk,n ∀k " i)0(tn =
As we have shown Q(Y) ∝ $
,
bH, = Nwc K∗ (yc∗ , I)Nwc (y,c KVc , Vc ) dwc
,
i)Nyn (Wkn , I). Hence Q(yn ) is a truncated multivariate Gaussian c=1
2682 T. Damoulas, M.A. Girolami / Pattern Recognition 42 (2009) 2671 -- 2683

We proceed by analysing the integral, gathering all the terms de- following expression for the lower bound:
pending on wc , completing the square twice and reforming to
C N
NC 1 ## 3
C = − log 2+ − {y2cn + knT w!
T ycn w
c wc kn −2, , c kn }
$
3 ,V3 ) ∗ ∗ 2 2
p(y∗ |x∗ , X, t) = yc KKc K∗ V
Nyc∗ (, c c
c=1 n=1
C C
c=1 NC 1# 1#
− log 2+ − log |Zc | − , c Z−1
w c w, cT
× Nwc ((yc∗ K∗T + ,
yc K)Kc , Kc ) (D.1) 2 2 2
c=1 c=1
C N
with 1# # NC
− Tr[Z−1
c Vc ] + log Zn + log 2+
∗ 2 2
3 = (I − K∗T Hc K∗ )−1 (Ntest × Ntest) c=1 n=1
V c
N C
1 # # ,2
+ (y cn − 2, , c kn + knT w
ycn w , cT w
, c kn )
and 2
n=1 c=1
C
Kc = (K∗ K∗T + Vc−1 )−1 (N × N) (D.2) NC 1# NC
+ log 2+ + log |Vc | +
2 2 2
c=1
3 ∗ by applying the Woodbury identity and
Finally we can simplify V c
reduce its form to which simplifies to our final expression

3 = (I + K∗T Vc K∗ ) (Ntest × Ntest) C N
V c NC 1# #
Lower Bound = + log |Vc | + log Zn (E.5)
2 2
c=1 n=1
Now the Gaussian distribution wrt wc integrates to one and we are
C C
left with 1# 1#
− Tr[Z−1
c Vc ] − , c Z−1
w c w, cT (E.6)
2 2
C c=1 c=1
$ ∗
p(y∗ |x∗ , X, t) = 3 )
, ∗c , V
Nyc∗ (m (D.3) C C N
c 1# 1 ## T
c=1 − log |Zc | − kn Vc kn (E.7)
2 2
c=1 c=1 n=1

where m 3 ∗ (1 × Ntest)
yc KKc K∗ V
, ∗c = , c
References
Hence we can go back to the predictive distribution and consider
the case of a single test point (i.e. Ntest = 1) with associated scalars [1] J. Albert, S. Chib, Bayesian analysis of binary and polychotomous response data,
, ∗c and 1,∗c
m Journal of the American Statistical Association 88 (1993) 669–679.
[2] C. Andrieu, An introduction to MCMC for machine learning, Machine Learning
p(t∗ = c|x∗ , X, t) 50 (2003) 5–43.
[3] L. Bai, L. Shen, Combining wavelets with HMM for face recognition, in:
= $ C Proceedings of the 23rd Artificial Intelligence Conference, 2003.
∗ 3∗c , 1,∗c ) dy∗
= 0c Ny∗c (m c [4] M.J. Beal, Variational algorithms for approximate Bayesian inference, Ph.D.
c=1 Thesis, The Gatsby Computational Neuroscience Unit, University College London,
= 2003.
+∞ $= y∗c
[5] J.O. Berger, Statistical Decision Theory and Bayesian Analysis, Springer Series
= 3∗c , 1,∗c )
Ny∗c (m 3∗ , 1,∗ ) dy∗ dy∗
Ny∗j (m j j j c in Statistics, Springer, Berlin, 1985.
−∞ −∞
j"c [6] C.M. Bishop, Pattern Recognition and Machine Learning, Springer, New York,
= +∞ $= 3∗
y∗c −m USA, 2006.
j
= Ny∗c −m3∗c (0, 1,∗c ) Ny∗ −m3∗ (0, 1,∗j ) dy∗j dy∗c [7] T. Damoulas, M.A. Girolami, Probabilistic multi-class multi-kernel learning: on
−∞ −∞ j j protein fold recognition and remote homology detection, Bioinformatics 24
j"c
(10) (2008) 1264–1270.
[8] T. Damoulas, M.A. Girolami, Pattern recognition with a Bayesian kernel
−1 −1
3∗c )1,∗c
Setting u = (y∗c − m 3∗ )1,∗
and x = (y∗c − m we have combination machine, Pattern Recognition Letters 30 (1) (2009) 46–54.
j j [9] T. Damoulas, Y. Ying, M.A. Girolami, C. Campbel, Inferring sparse kernel
combinations and relevant vectors: an application to subcellular localization
p(t∗ = c|x∗ , X, t) of proteins, in: IEEE, International Conference on Machine Learning and
Applications (ICMLA '08), 2008.
= +∞ $= (u1,∗c +m
3∗c −m
3∗ )1,∗ −1
j j [10] N. de Freitas, P. HBjen-SBrensen, M. Jordan, S. Russell, Variational MCMC, in:
= Nu (0, 1) Nx (0, 1) dx du Proceedings of the 17th conference in Uncertainty in Artificial Intelligence,
−∞ −∞
j"c 2001.
 2 4 [11] D.G.T. Denison, C.C. Holmes, B.K. Mallick, A.F.M. Smith, Bayesian Methods
$ 1 ,∗ 3∗ 3∗  for Nonlinear Classification and Regression, Wiley Series in Probability and
= Ep(u) # ,∗ (u1c + mc − mj ) Statistics, West Sussex, UK, 2002.
 1 
j"c j [12] T.G. Dietterich, Ensemble methods in machine learning, in: Proceedings of the
1st International Workshop on Multiple Classifier Systems, 2000, pp. 1–15.
[13] C. Ding, I. Dubchak, Multi-class protein fold recognition using support vector
Appendix E. Lower bound machines and neural networks, Bioinformatics 17 (4) (2001) 349–358.
[14] M. Girolami, S. Rogers, Hierarchic Bayesian models for kernel learning, in:
Proceedings of the 22nd International Conference on Machine Learning, 2005,
From Jensen's inequality in Eq. (5) and by conditioning on current pp. 241–248.
values of b, H, Z, q we can derive the variational lower bound using [15] M. Girolami, S. Rogers, Variational Bayesian multinomial probit regression with
the relevant components Gaussian process priors, Neural Computation 18 (8) (2006) 1790–1817.
[16] M. Girolami, M. Zhong, Data integration for classification problems employing
Gaussian process priors, in: B. Schölkopf, J. Platt, T. Hoffman (Eds.), Advances
EQ(Y)Q(W) {log p(Y|W, b, X)} (E.1)
in Neural Information Processing Systems, vol. 19, MIT Press, Cambridge, MA,
+ EQ(Y)Q(W) {log p(W|Z, X)} (E.2) 2007, pp. 465–472.
[17] W.K. Hastings, Monte Carlo sampling methods using Markov chains and their
− EQ(Y) {log Q(Y)} (E.3) applications, Biometrika 57 (1970) 97–109.
− EQ(W) {log Q(W)} (E.4) [18] J. Kittler, M. Hatef, R.P.W. Duin, J. Matas, On combining classifiers, IEEE
Transactions on Pattern Analysis and Machine Intelligence 20 (3) (1998)
226–239.
which, by noting that the expectation of a quadratic form under [19] L.I. Kuncheva, Combining Pattern Classifiers, Methods and Algorithms, Wiley,
a Gaussian is another quadratic form plus a constant, leads to the New York, 2004.
T. Damoulas, M.A. Girolami / Pattern Recognition 42 (2009) 2671 -- 2683 2683

[20] G.R.G. Lanckriet, N. Cristianini, P. Bartlett, L.E. Ghaoui, M.I. Jordan, Learning the [29] C.E. Rasmussen, C.K.I. Williams, Gaussian Processes for Machine Learning, MIT
kernel matrix with semidefinite programming, Journal of Machine Learning Press, Cambridge, MA, USA, 2006.
Research 5 (2004) 27–72. [30] B. Schölkopf, A. Smola, Learning with Kernels, The MIT Press, Cambridge, MA,
[21] W.-J. Lee, S. Verzakov, R.P. Duin, Kernel combination versus classifier USA, 2002.
combination, in: 7th International Workshop on Multiple Classifier Systems, [31] J. Shawe-Taylor, N. Cristianini, Kernel Methods for Pattern Analysis, Cambridge
2007. University Press, Cambridge, England, UK, 2004.
[22] D.P. Lewis, T. Jebara, W.S. Noble, Nonstationary kernel combination, in: 23rd [32] H.-B. Shen, K.-C. Chou, Ensemble classifier for protein fold pattern recognition,
International Conference on Machine Learning, 2006. Bioinformatics 22 (14) (2006) 1717–1722.
[23] D.P. Lewis, T. Jebara, W.S. Noble, Support vector machine learning from [33] S. Sonnenburg, G. Rätsch, C. Schäfer, A general and efficient multiple kernel
heterogeneous data: an empirical analysis using protein sequence and structure, learning algorithm, in: Y. Weiss, B. Schölkopf, J. Platt (Eds.), Advances in Neural
Bioinformatics 22 (22) (2006) 2753–2760. Information Processing Systems, vol. 18, MIT Press, Cambridge, MA, 2006, pp.
[24] D. MacKay, Information Theory, Inference and Learning Algorithms, Cambridge 1273–1280.
University Press, Cambridge, 2003. [34] D.M.J. Tax, M. van Breukelen, R.P.W. Duin, J. Kittler, Combining multiple
[25] D.J.C. MacKay, The evidence framework applied to classification networks, classifiers by averaging or by multiplying?, Pattern Recognition 33 (2000)
Neural Computation 4 (5) (1992) 698–714. 1475–1485.
[26] S. Manocha, M.A. Girolami, An empirical analysis of the probabilistic k-nearest [35] M.E. Tipping, Sparse Bayesian learning and the relevance vector machine,
neighbour classifier, Pattern Recognition Letters 28 (13) (2007) 1818–1824. Journal of Machine Learning Research 1 (2001) 211–244.
[27] R.M. Neal, Regression and classification using Gaussian process priors, Bayesian [36] D.H. Wolpert, Stacked generalization, Neural Networks 5 (2) (1992) 241–259.
Statistics 6 (1998) 475–501.
[28] C.S. Ong, A.J. Smola, R.C. Williamson, Learning the kernel with hyperkernels,
Journal of Machine Learning Research 6 (2005) 1043–1071.

About the Author—THEODOROS DAMOULAS was awarded the NCR Ltd. Ph.D. Scholarship in 2006 in the Department of Computing Science at the University of Glasgow. He
is a member of the Inference Research Group (http://www.dcs.gla.ac.uk/inference) and his research interests are in statistical machine learning and pattern recognition. He
holds a 1st class M.Eng. degree in Mechanical Engineering from the University of Manchester and an M.Sc. in Informatics (Distinction) from the University of Edinburgh.

About the Author—MARK A. GIROLAMI is Professor of Computing and Inferential Science and holds a joint appointment in the Department of Computing Science and
the Department of Statistics at the University of Glasgow. His various research interests lie at the boundary between computing, statistical, and biological science. He
currently holds a prestigious five year long (2007 until 2012) Advanced Research Fellowship from the Engineering & Physical Sciences Research Council (EPSRC). In 2009 the
International Society of Photo-Optical Instrumentation Engineers (SPIE) presented him with their Pioneer Award for his contributions to Biomedical Signal Processing and the
development of analytical tools for EEG and fMRI which revolutionized the automated processing of such signals. In 2005 he was awarded an MRC Discipline Hopping Award
in the Department of Biochemistry and during 2000 he was the TEKES visiting professor at the Laboratory of Computing and Information Science in Helsinki University
of Technology. In 1998 and 1999 Professor Girolami was a research fellow at the Laboratory for Advanced Brain Signal Processing in the Brain Science Institute, RIKEN,
Wako-Shi, Japan. He has been a visiting researcher at the Computational Neurobiology Laboratory (CNL) of the Salk Institute. The main focus of the seven post-doctoral
researchers and eight Ph.D. students in his group is in the development of statistical inferential methodology to support reasoning about biological mechanisms within a
systems biology context.

You might also like