Professional Documents
Culture Documents
Pattern Recognition
journal homepage: w w w . e l s e v i e r . c o m / l o c a t e / p r
A R T I C L E I N F O A B S T R A C T
Article history: In this paper we offer a variational Bayes approximation to the multinomial probit model for basis
Received 2 November 2007 expansion and kernel combination. Our model is well-founded within a hierarchical Bayesian framework
Received in revised form 24 January 2009 and is able to instructively combine available sources of information for multinomial classification. The
Accepted 5 April 2009
proposed framework enables informative integration of possibly heterogeneous sources in a multitude of
Keywords: ways, from the simple summation of feature expansions to weighted product of kernels, and it is shown
Variational Bayes approximation to match and in certain cases outperform the well-known ensemble learning approaches of combining
Multiclass classification individual classifiers. At the same time the approximation reduces considerably the CPU time and re-
Kernel combination sources required with respect to both the ensemble learning methods and the full Markov chain Monte
Hierarchical Bayes Carlo, Metropolis–Hastings within Gibbs solution of our model. We present our proposed framework
Bayesian inference
together with extensive experimental studies on synthetic and benchmark datasets and also for the first
Ensemble learning
time report a comparison between summation and product of individual kernels as possible different
Multi-modal modelling
Information integration methods for constructing the composite kernel matrix.
© 2009 Elsevier Ltd. All rights reserved.
0031-3203/$ - see front matter © 2009 Elsevier Ltd. All rights reserved.
doi:10.1016/j.patcog.2009.04.002
2672 T. Damoulas, M.A. Girolami / Pattern Recognition 42 (2009) 2671 -- 2683
employed outside the “kernel-domain”, i.e. without the need to em- C with the target variable tn = c = 1, . . . , C and the number of objects
bed the features into Hilbert spaces, by allowing for combination of N with n = 1, . . . , N. Then, the class posterior probability for object n
basis function expansions. That is useful in the case of multi-scale will be P(tn |xn1 , . . . , xnS ) and according to Bayesian decision theory we
problems and wavelets [3]. The combination of kernels or dictionar- would assign the object n with the class that has the maximum a
ies of basis functions as an alternative to classifier combination offers posteriori probability.
the advantages of reduced computational requirements, the ability The classifier combination methodologies are based on the fol-
to learn the significance of the individual sources and an improved lowing assumption, which as we shall see is not necessary for the
solution based on the inferred significance. proposed kernel combination approach. The assumption is that al-
Previous related work, includes the area of kernel learning, where though “it is essential to compute the probabilities of various hy-
various methodologies have been proposed in order to learn the potheses by considering all the measurements simultaneously . . . it
possibly composite kernel matrix, by convex optimization methods may not be a practical proposition.” [18] and it leads to a further
[20,33,22,23], by the introduction of hyper-kernels [28] or by direct approximation on the (noisy) class posterior probabilities which are
ad hoc construction of the composite kernel [21]. These methods op- now broken down to individual contributions from classifiers trained
erate within the context of support vector machine (SVM) learning on each feature space s.
and hence inherit the arguable drawbacks of non-probabilistic out- The product combination rule, assuming equal a priori class prob-
puts and ad hoc extensions to the multiclass case. Furthermore, an- abilities:
other related approach within the non-parametric Gaussian process
!S
(GP) methodology [27,29] has been proposed by Girolami and Zhong (P(tn |xns ) + !s )
Pprod (tn |xn1 , . . . , xnS ) = "C s=1!S (1)
[16], where instead of kernel combination the integration of infor- s s
tn =1 s=1 (P(tn |xn ) + ! )
mation is achieved via combination of the GP covariance functions.
Our work further develops the work of Girolami and Rogers [14],
The mean combination rule:
which we generalize to the multiclass case by employing a multino-
mial probit likelihood and introducing latent variables that give rise S
to efficient Gibbs sampling from the parameter posterior distribu- 1#
Pmean (tn |xn1 , . . . , xnS ) = (P(tn |xns ) + !s ) (2)
tion. We bound the marginal likelihood or model evidence [25] and S
s=1
derive the variational Bayes (VB) approximation for the multinomial
probit composite kernel model, providing a fast and efficient solu- with !s the prediction error made by the individual classifier trained
tion. We are able to combine kernels in a general way e.g. by sum- on feature space s.
mation, product or binary rules and learn the associated weights to The theoretical justification for the product rule comes from
infer the significance of the sources. the independence assumption of the feature spaces, where xn1 , . . . , xnS
For the first time we offer, a VB approximation on an explicit are assumed to be uncorrelated; and the mean combination rule
multiclass model for kernel combination and a comparison between is derived on the opposite assumption of extreme correlation. As
kernel combination methods, from summation to weighted product it can be seen from Eqs. (1) and (2), the individual errors !s of the
of kernels. Furthermore, we compare our methods against classifier classifiers are either added or multiplied together and the hope is
combination strategies on a number of datasets and we show that that there will be a synergetic effect from the combination that
the accuracy of our VB approximation is comparable to that of the will cancel them out and hence reduce the overall classification
full Markov chain Monte Carlo (MCMC) solution. error.
The paper is organized as follows. First we introduce the con- Instead of combining classifiers we propose to combine the fea-
cepts of composite kernels, kernel combination rules and classifier ture spaces and hence obtain the class posterior probabilities from
combination strategies and give an insight on their theoretical un- the model
derpinnings. Next we introduce the multinomial probit composite
kernel model, the MCMC solution and the VB approximation. Finally, Pspaces (tn |xn1 , . . . , xnS ) = P(tn |xn1 , . . . , xnS ) + !
we present results on synthetic and benchmark datasets and offer
discussion and concluding remarks. where now we only have one error term ! from the single classi-
fier operating on the composite feature space. The different methods
2. Classifier versus kernel combination to construct the composite feature space reflect different approxi-
mations to the P(tn |xn1 , . . . , xnS ) term without incorporating individual
The theoretical framework behind classifier combination strate- classifier errors into the approximation. Still though, underlying as-
gies has been explored relatively recently1 [12,18,19], despite the sumptions of independence for the construction of the composite
fact that ensembles of classifiers have been widely used experimen- feature space are implied through the kernel combination rules and
tally before. We briefly review that framework in order to identify as we shall see this will lead to the need for diversity as it is com-
the common ground and the deviations from the classifier combina- monly known for other ensemble learning approaches.
tion to the proposed kernel (or basis function) combination method- Although our approach is general and can be applied to combine
ology. basis expansions or kernels, we present here the composite ker-
Consider S feature spaces or sources of information.2 An object n nel construction as our main example. Embedding the features into
belonging to a dataset {X, t}, with X the collection of all objects and t Hilbert spaces via the kernel trick [30,31] we can define3 :
the corresponding labels, is represented by S, Ds -dimensional feature The N × N mean composite kernel as
Ds
vectors xns for s = 1, . . . , S and xns ∈ R . Let the number of classes be
S
# S
#
KbH (xi , xj ) = "s K shs (xis , xjs ) with "s = 1 and "s ! 0 ∀s
1
At least in the classification domain for machine learning and pattern recog- s=1 s=1
nition. The related statistical decision theory literature has studied opinion pools
much earlier, see Berger [5] for a full treatment.
2
Throughout this paper m denotes scalar, m vector and M a Matrix. If M is
3
a C × N matrix then mc is the cth row vector, mn the nth column vector of that Superscripts denote “function of”, i.e. KbH denotes that the kernel K is a
matrix and all other indices imply row vectors. function of b and H, unless otherwise specified.
T. Damoulas, M.A. Girolami / Pattern Recognition 42 (2009) 2671 -- 2683 2673
S
# W Y β
KbH (xi , xj ) = "s K shs (xis , xjs ) with "s ∈ {0, 1} ∀s CxN
s=1
fixed values are used for the hyper-parameters instead of empirical which is a product of N C-dimensional conically truncated Gaussians
Bayes) since a specific hyper-parameter value will still lead to a demonstrating independence across samples as expected from our
(prior) distribution of possible parameter values which can still be initial i.i.d assumption. The shorthand tilde notation denotes pos-
“corrected” by the evidence of the data towards the appropriate final terior expectations in the usual manner, i.e. f-
(b) = EQ(b) {f (b)}, and
posterior distribution. the posterior expectations (details in Appendix C) for the auxiliary
variable follow as
3.2. Exact Bayesian inference: Gibbs sampling ,, ,,
,, , c knbH − w
Ep(u) {Nu (w , i knbH , 1)#n,i,c
u }
, , c knbH −
ycn = w ,, ,,
(7)
The standard statistical solution for performing Bayesian infer- , i knbH − w
Ep(u) {#(u + w , c knbH )#n,i,c
u }
ence is via Markov chain Monte Carlo sampling methods that are
powerful but computationally expensive. These approaches still ap- ,, # , ,
, , i knbH
yin = w − ,ycn − w
b H
, c kn (8)
proximate the desired posterior distribution but only due to the
c"i
limited number of drawn samples, hence the name exact inference.
From these methods, Gibbs sampling is the preferred alternative as
where # is the standardized cumulative distribution function (CDF)
it avoids the need for tuning acceptance ratios as is typical in the ! ,, ,,
and #n,i,c , i knbH − w
u = j " i,c #(u + w , j knbH ). Next, the approximate pos-
original Metropolis–Hastings (MH) algorithms [17]. Such a sampling
scheme is only possible when the hard-to-sample-from joint pos- terior for the regressors can be expressed as
terior distribution can be broken down to conditional distributions
C
$
from which we can easily draw samples. ,,
Q(W) ∝ Nwc (y,c KbH Vc , Vc ) (9)
In our model such a sampler naturally arises by exploiting the
c=1
conditional distributions of Y|W, . . . and W|Y, . . ., so that we can draw
samples from the parameter posterior distribution P(W|t, X, N) where where the covariance is defined as
W = {Y, W, b, H, Z, q} and N the aforementioned hyper-parameters
of the model. In the case of non-conjugate pair of distributions, −1
S #
# S
Metropolis–Hastings sub-samplers can be employed for posterior in- Vc = "- i,
hi j,
hj + (, −1
i "j K K Zc ) (10)
ference. More details of the Gibbs sampler are provided analytically i=1 j=1
in [8] and summarized in Appendix A.
and ,Zc is a diagonal matrix of the expected variances , $i , . . . , ,
$N for
3.3. Approximate inference: variational approximation each class. The associated posterior mean for the regressors is there-
,,
, c = y,c KbH Vc and we can see the coupling between the auxiliary
fore w
Exact inference via the Gibbs sampling scheme is computation- variable and regressor posterior expectation.
ally expensive and in this article we derive a deterministic varia- The approximate posterior for the variances Z is an updated prod-
tional approximation which is an efficient method with comparable uct of inverse-gamma distributions (gamma on the scales) and the
accuracy to the full sampling scheme as it will be further demon- posterior mean for the scale is given by
strated. The variational methodology, see Beal [4] for a recent treat-
ment, offers a lower bound on the model evidence using an ensemble %+ 1
2
of factored posteriors to approximate the joint parameter posterior (11)
-
& + 12 w 2
distribution. Although the factored ensemble implies independence cn
The resulting approximate posterior distributions are given be- pendix D for complete derivation, as
low with full details of the derivations in Appendix B. First, the ap- 2 4
proximate posterior over the auxiliary variables is given by $ 1 ,∗ 3∗ 3∗
∗ ∗
p(t = c|x , X, t) = Ep(u) # ,∗ (u1c + mc − mj ) (12)
1
N
$ j"c j
,,
Q(Y) ∝ , bH , I)
0(yi,n > yk,n ∀k " i)0(tn = i)Nyn (Wk (6)
n
n=1 where, for the general case of Ntest objects, m -∗c = ,yc K(K∗ K∗T +
−1 −1 ∗ -∗ - ∗
Vc ) K Vc and Vc = (I + K Vc K ) while we have dropped the no-
∗T ∗
5
The mean composite kernel is considered as an example. Modifications for
tation for the dependance of the train K(N × N) and test K∗ (N × Ntest )
the binary and product composite kernel are trivial and respective details are given kernels on H, b for clarity. In Algorithm 1 we summarize the VB
in the appendices. approximation in a pseudo-algorithmic fashion.
T. Damoulas, M.A. Girolami / Pattern Recognition 42 (2009) 2671 -- 2683 2675
12: -∗ ← ,
m y K(K∗ K∗T + V−1 )−1 K∗ V -∗
c c c c
13: for n = 1 to Ntest do −3
14: for c = 1 to C do −3 −2 −1 0 1 2 3
15: for i = 1 to K Samples do 2 4 Fig. 3. Linearly separable dataset with known regressors defining the decision
i
! 1 , 3 3 boundaries. Cn denotes the members of class n and Decij is the decision boundary
16: ui ← N(0, 1), pcn ← # ,∗ (ui 1c + mc − mj )
∗ ∗ ∗
j"c 1 between classes i and j.
j
17: end for
18: end for
1 "K i
19: P(tn∗ = c|xn∗ , X, t) = K i=1 pcn −2.4
20: end for
−2.6
4. Experimental studies
−2.8
Intercept
−2.4 1
−2.6 0.9
−2.8 0.8
Intercept
−3 0.7
0.6
−3.2
0.5
−3.4
0.4
−3.6
0.3
Class 1
−3.6 −3.4 −3.2 −3 −2.8 −2.6 −2.4 −2.2 0.2 Class 2
Class 3
Slope 1 vs 2
0.1 1 vs 3
Fig. 5. The variational approximate posterior distribution for the same case as above. 2 vs 3
Employing 100,000 samples from the approximate posterior of the regressors W in
0
order to estimate the approximate posterior of the slope and intercept. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.9 Table 1
CPU time (s) comparison for 100,000 Gibbs samples versus 100 VB iterations.
0.8
Gibbs VB
0.6 Notice that the number of VB iterations needed for the lower bound to converge is
typically less than 100.
0.5
Table 2
Ten-fold cross-validated error percentages (mean ± std.) on standard UCI multinomial datasets.
Balance 8.8 ± 3.6 12.2 ± 4.2 7.0 ± 3.3 11.5 ± 3.0 10.2 ± 3.0
Crabs 23.5 ± 11.3 13.5 ± 8.2 21.5 ± 9.1 15.0 ± 8.8 19.5 ± 6.8
Glass 27.9 ± 10.1 35.8 ± 11.8 28.4 ± 8.9 29.9 ± 9.2 26.7 ± 8.8
Iris 2.7 ± 5.6 11.3 ± 9.9 4.7 ± 6.3 5.3 ± 5.2 4.0 ± 5.6
Soybean 6.5 ± 10.5 6 ± 9.7 4 ± 8.4 14.5 ± 16.7 4.5 ± 9.6
Vehicle 25.6 ± 4.0 29.6 ± 3.3 26 ± 6.1 36.3 ± 5.2 37.2 ± 4.5
Wine 4.5 ± 5.1 2.8 ± 4.7 1.1 ± 2.3 3.9 ± 3.8 3.4 ± 2.9
Table 3 Table 7
Running times (seconds) for computing 10-fold cross-validation results. Combination of feature spaces.
Dataset Balance Crabs Glass Iris Soybean Vehicle Wine Bin FixSum WSum FixProd WProd
VB approximation.
Table 4
Classification percentage error (mean ± std.) of individual classifiers trained on each
feature space.
FR KL PX ZM
Full MCMC Gibbs sampling—single FS combinations of classifiers versus combination of feature spaces. It is
27.3 ± 3.3 11.0 ± 2.3 7.3 ± 2 25.2 ± 3 worth noting that a concatenation of feature spaces for this specific
problem has been found to perform as good as ensemble learning
methods [8]. We report results from both the full MCMC solution
Table 5 and the VB approximation in order to analyse the possible degrada-
Combinations of the individual classifiers based on four widely used rules.
tion of the performance for the approximation. In all cases we em-
Prod Sum Max Maj ploy the multinomial probit kernel machine using a Gaussian kernel
Full MCMC Gibbs sampling—Comb. classifiers
with fixed parameters.
5.1 ± 1.7 5.3 ± 2 8.4 ± 2.3 8.45 ± 2.2 As we can see from Tables 5 and 4 the two best performing en-
semble learning methods outperform all of the individual classifiers
Product (Prod), Mean (Sum), Majority voting (Maj) and Maximum (Max).
trained on separate feature spaces. With a p-value of 1.5e−07 be-
tween the Pixel classifier and the Product rule, it is a statistical signif-
Table 6 icant difference. At the same time, all the kernel combination meth-
Combination of feature spaces. ods in Table 6 match the best performing classifier combination ap-
Bin FixSum WSum FixProd WProd
proaches, with a p-value of 0.91 between the Product classifier com-
bination rule and the product kernel combination method. Finally,
Full MCMC Gibbs sampling—Comb. FS from Table 7 it is obvious that the VB approximation performs very
5.7 ± 2 5.5 ± 2 5.8 ± 2.1 5.2 ± 1.8 5.9 ± 1.2
well compared with the full Gibbs solution and even when combi-
Gibbs sampling results for four feature space combination methods: Binary (Bin), natorial weights are inferred, which expands the parameter space,
Mean with fixed weights (FixSum), Mean with inferred weights (WSum), product
there is no statistical difference, with a p-value of 0.47, between the
with fixed weights (FixProd) and product with inferred weights (WProd).
MCMC and the VB solution.
Furthermore, the variants of our method that employ combina-
information are available. Comparisons are made against combi- torial weights offer the advantage of inferring the significance of the
nation of classifiers and also between different ways to combine contributing sources of information. In Fig. 8, we can see that the
the kernels as we have described previously. In order to assess the pixel (PX) and Zernike moments (ZM) feature spaces receive large
above, we make use of two datasets that have multiple feature weights and hence contribute significantly in the composite feature
spaces describing the objects to be classified. The first one is a large space. This is in accordance with our expectations, as the pixel fea-
N = 2000, multinomial C = 10 and “multi-featured” S = 4 UCI dataset, ture space seems to be the best performing individual classifier and
named “Multiple Features”. The objects are handwritten numerals, complementary predictive power mainly from the ZM channel is im-
from 0 to 9, and the available features are the Fourier descriptors proving on that.
(FR), the Karhunen–Loéve features (KL), the pixel averages (PX) and The results indicate that there is no benefit in the classification
the Zernike moments (ZM). Tax et al. [34] have previously reported error performance when weighted combinations of feature spaces
results on this problem by combining classifiers but have employed are employed. This is in agreement with past work by Lewis et al. [23]
a different test set which is not publicly available. Furthermore, we and Girolami and Zhong [16]. The clear benefit, however, remains
allow the rotation invariance property of the ZM features to cause the ability to infer the relative significance of the sources and hence
problems in the distinction between digits 6 and 9. The hope is that gain a better understanding of the problem.
the remaining feature spaces can compensate on the discrimination.
In Tables 4–7, we report experimental results over 50 repeated
trials where we have randomly selected 20 training and 20 testing 4.3.2. Protein fold classification
objects from each class. For each trial we employ (1) a single clas- The second multi-feature dataset we examine was first described
sifier on each feature space, (2) the proposed classifier on the com- by Ding and Dubchak [13] and is a protein fold classification prob-
posite feature space. This allows us to examine the performance of lem with C = 27 classes and S = 6 available feature spaces (approx.
2678 T. Damoulas, M.A. Girolami / Pattern Recognition 42 (2009) 2671 -- 2683
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
FR KL PX ZM FR KL PX ZM
Fig. 8. Typical combinatorial weights from Mean (left) and Product (right) composite kernel.
5. Discussion
20-D each). The dataset is available online7 and it is divided into a
train set of N = 313 size and an independent test set with Ntest = 385. A multinomial probit composite kernel classifier, based on a well-
We use the original dataset and do not consider the feature mod- founded hierarchical Bayesian framework and able to instructively
ifications (extra features, modification of existing ones and omis- combine feature spaces has been presented in this work. The full
sion of four objects) suggested recently in the work by Shen and MCMC MH within Gibbs sampling approach allows for exact in-
Chou [32] in order to increase prediction accuracy. In Table 8 we re- ference to be performed and the proposed variational Bayes ap-
port the performances of the individual classifiers trained on each proximation reduces significantly computational requirements while
feature space, in Table 9 the classifier combination performances retaining the same levels of performance. The proposed methodol-
and in Table 10 the kernel combination approaches. Furthermore, in ogy enables inference to be performed in three significant levels by
Table 11 we give the corresponding CPU time requirements between
8
Notice that for the product composite kernel the combinatorial weights are
7
http://crd.lbl.gov/~cding/protein/ not restricted in the simplex.
T. Damoulas, M.A. Girolami / Pattern Recognition 42 (2009) 2671 -- 2683 2679
1 1.5
0.8
1
0.6
0.4
0.5
0.2
0 0
Comp Hydro Pol Polz Str Vol Comp Hydro Pol Polz Str Vol
Fig. 9. Typical combinatorial weights from Mean (left) and Product (right) composite kernel for the protein fold dataset.
learning the combinatorial weights associated to each feature space, Appendix A. Gibbs sampler
the kernel parameters associated to each dimension of the feature
spaces and finally the parameters/regressors associated with the el- The conditional distribution for the auxiliary variables Y is a
ements of the composite feature space. product of N C-dimensional conically truncated Gaussians given by
An explicit comparison between the Gibbs sampling and the VB
approximation has been presented. The approximate solution was N
$
found not to worsen significantly the classification performance and 0(yin > yjn ∀j " i)0(tn = i)Nyn (WkbH
n , I)
the resulting decision boundaries were similar to the MCMC ap- n=1
Acknowledgments !S !D ∗ !C !N bH ∗
∗ -−1 −.'∗sd
s=1 d=1 'sd c=1 n=1 Nycn (wc kn , 1)'sd e
= min 1, -−1
!S !D i !C !N bHi
i i
This work is sponsored by NCR Financial Solutions Group Ltd. s=1 d=1 'sd c=1 n=1 Nycn (wc kn , 1)'sd e−.'sd
The authors would particularly like to acknowledge the help and
support from Dr. Gary Ross and Dr. Chao He of NCR Labs. The sec- Finally, the Monte Carlo estimate of the predictive distribution is
ond author is supported by an EPSRC Advanced Research Fellowship used to assign the class probabilities according to a number of sam-
(EP/E052029/1). ples L drawn from the predictive distribution which, considering
2680 T. Damoulas, M.A. Girolami / Pattern Recognition 42 (2009) 2671 -- 2683
∗bH
Eq. (4) and substituting for the test composite kernel kn defines B.2. Q(W)
the estimate of the predictive distribution as
L $
1# l l
Q(W) ∝ exp{EQ(y)Q(b)Q(f)Q(H) {log p(y|W, b) + log p(W|f)}}
P(t∗ = c|x∗ ) = Ep(u) #(u + (wcl − wjl )k∗b H )
L
l=1 j"c $ C
= exp EQ(y)Q(b)Q(H) log Nyc (wc KbH , I)
c=1
Appendix B. Approximate posterior distributions
$ C
We give here the full derivations of the approximate posteriors +EQ(f) log Nwc (0, Zc )
c=1
for the model under the VB approximation. We employ a first or-
der approximation for the kernel parameters H, i.e. EQ(h) {Kihi Kjhj } ≈ # C
1
i, j, = exp EQ(y)Q(bQ(H)) log
K K , to avoid nonlinear contributions to the expectation. The
h i h j (2 + )N/2 1/2
|I|
c=1
same approximation is applied to the product composite kernel case
, , 1
for the combinatorial parameters b where EQ(b) {Ki"i Kj"j } ≈ Ki"i Kj"j × exp − (yc − wc K )(yc − wc K ) bH bH T
2
B.1. Q(Y) # C 9 :
1 1 −1 T
+ EQ(f) log exp − (wc (Zc ) wc )
(2+)N/2 |Zc |1/2 2
c=1
# C
Q(Y) ∝ exp{EQ(W)Q(b)Q(H) {log J · L}} 1
= exp EQ(y)Q(b)Q(H) − (yc ycT
2
c=1
∝ exp{EQ(W)Q(b)Q(H) {log p(t|Y) + log p(Y|W, b, H)}}
bH T bH bH T
−2yc K wc + wc K K wc )
$N
∝ 0(yi,n > yk,n ∀k " i)0(tn = i) exp{EQ(W)Q(b)Q(H) log p(Y|W, b, H)}
# C $ N
n=1 1 1 −1 T
+EQ(f) − log $cn − wc (Zc ) wc
2 2
where the exponential term can be analysed as follows: c=1 n=1
# C
1 !T ,3
exp{EQ(W)Q(b)Q(H) log p(Y|W, b)} = exp − yc yc −2y,c KbH wcT
2
c=1
$ N S # S N
bH # #
= exp EQ(W)Q(b)Q(H) log Nyn (Wkn , I) - i,
h j,
h T "
+ wc " i " j K K wc +
i j , −1 T
log $cn +wc (Zc ) wc
n=1 i=1 j=1 n=1
#N Again we can form the posterior expectation as a new Gaussian:
1
= exp EQ(W)Q(b)Q(H) log
n=1 (2+)C/2 |I|1/2 C
$ ,,
Q(W) ∝ Nwc (y,c KbH Vc , Vc ) (B.2)
c=1
5 6
1 bH T bH
× exp − (yn − Wkn ) (yn − Wkn ) where Vc is the covariance matrix defined as
2
−1
S #
# S
i,
hi j,
# N 5
1 Vc = "-
i "j K K Zc )−1
hj + (, (B.3)
bH
= exp EQ(W)Q(b)Q(H) − (ynT yn − 2ynT Wkn i=1 j=1
2
n=1
6 Zc is a diagonal matrix of the expected variances $3
and , -
ci , . . . , $cN for
bH T T bH each class.
+(kn ) W Wkn )
# N
1 ,, B.3. Q(Z)
= exp − ynT yn − 2ynT Wk , bH
n
2
n=1
# S #S
i,
hi T !
,
jhj Q(Z) ∝ exp{EQ(W) (log p(W|Z) + log p(Z|%, &))}
+ b-
i bj (kn ) W Wkn
T
= exp{EQ(W) (log p(W|Z))}p(Z|%, &)
i=1 j=1
i,
h
Analysing the exponential term only:
where kn i is the nth N-dimensional column vector of the ith base
kernel with kernel parameters ,
hi . Now from this exponential term $C $ N
we can form the posterior distribution as a Gaussian and reach to exp EQ(W) log Nwcn (0, $cn )
the final expression: c=1 n=1
# C # N
1 1 2 −1
N
$ = exp EQ(W) − log $cn − wcn $cn
Q(Y) ∝ , bH , I)
0(yi,n > yk,n ∀k " i)0(tn = i)Nyn (Wk
,,
(B.1)
2 2
n c=1 n=1
n=1 C N
; <
$ $ −1/2 1w-2
cn
= $cn exp −
which is a C-dimensional conically truncated Gaussian. 2 $cn
c=1 n=1
T. Damoulas, M.A. Girolami / Pattern Recognition 42 (2009) 2671 -- 2683 2681
which combined with the p(Z|%, &) prior Gamma distribution leads distribution and we need to calculate the correction to the normal-
to our expected posterior distribution: izing term Zn caused by the truncation. Thus, the posterior expec-
tation can be expressed as
C $
N −1 -
2
$ −(%+1/2)+1 &% e−$cn (&+(1/2)wcn )
Q(Z) ∝ $cn C
$
c=1 n=1
4(%) Q(yn ) = Z−1 Ntyncn (w
,,
, c kn"5 , 1)
n
C $
$ N 5 6 c=1
−1 1 1-2
= Gamma $cn |% + , & + wcn
2 2 where the superscript tn indicates the truncation needed so that
c=1 n=1
the appropriate dimension i (since tn = i ⇐⇒ yin > yjn ∀j " i) is the
largest.
B.4. Q(b), Q(q), Q(H), Q(p), Q(v)
Now, Zn = P(yn ∈ C) where C = {yn : yin > yjn } hence
!S
=
s=1 Gamma(is- (), *)
C
Dir(qi )
-
+∞ $ ,,
W(qi ) = = "I −1
ycn = Zn
, ycn , j kn"5 ) dyjn
Nyjn (w
"I Q ∗ (qi ) i
i=1 Dir(q ) −∞
j=1
i=1 !S
= =
s=1 Gamma(is (), *) +∞ yin ,,
= Z−1
n ycn Nycn (w, c kn"5 )
−∞ −∞
where Gamma and Dir are the Gamma and Dirichlet distributions, $ ,, ,,
while I is the total number of samples of q taken until now from , i kn"5 , 1)#(yin − w
Nyin (w , j kn"5 ) dycn dyin
the product gamma distributions and i- denotes the current (last) j " i,c
sample. So now we can estimate any function f of q based on ,, ,, ,,
= , c kn"5
w − Z−1
n Ep(u) , c kn"5 − w
Nu (w , i kn"5 , 1)
I
#
,
f (q ) = f (q)W(qi )
$ ,, ,,
i=1 , i kn"5
#(u + w , j kn"5 )
−w
j " i,c
In the same manner as above but now for Q(b) and Q(H) we can use
the unnormalized posteriors Q ∗ (b) and Q ∗ (H), where p(b|q, Y, W) ∝ For the ith class the posterior expectation yin (the auxiliary variable
p(Y|W, b)p(b|q) and p(H|-, ., Y, W) ∝ p(Y|W, H)p(H|-, .) with im- associated with the known class of the nth object) is given by
portance weights defined as
= +∞ $
,
"5, ,,
i- yin = Z−1
, n yin Nyin (w
, i kn , 1) , j kn"5 ) dyin
#(yin − w
Q ∗ (b ) −∞
!N i - j"i
Dir(b )
i-
yn (Wkn
, b , I)
i n=1 N,
W(b ) = i
=
"I !N ,, $ , , , ,
, i kn"5 "5 "5
i −1
"I Q ∗ (b ) yn (Wkn
, b , I) = w + Zn Ep(u) u #(u + w
, i kn − w
, j kn )
i=1 n=1 N,
i=1 i j"i
Dir(b )
,, # ,,
, i kn"5 +
=w , c kn"5 − ,
(w ycn )
and
c"i
i-
Q ∗ (H ) where we have made use of the fact that for a variable uN(0, 1) and
i- !N , H i-
i Dir(H ) yn (Wkn , I)
n=1 N, any differentiable function g(u), E{ug(u)} = E{g- (u)}.
W(H ) = i
=" !N
"I Q ∗ (H ) I
yn (Wkn
, Hi , I)
i=1 n=1 N,
i=1 i Appendix D. Predictive distribution
Dir(H )
and again we can estimate any function g of b and h of H as In order to make a prediction t∗ for a new point x∗ we need to
know:
I
# I
# =
i i
g(b) =
, g(b)W(b ) and ,
h(H) = g(H)W(H ) p(t∗ = c|x∗ , X, t) = p(t∗ = c|y∗ )p(y∗ |x∗ , X, t) dy∗
i=1 i=1 =
= 0∗c p(y∗ |x∗ , X, t) dy∗
For the product composite kernel case we proceed in the same manner
by importance sampling for Q(p) and Q(v).
Hence we need to evaluate
=
Appendix C. Posterior expectations of each yn p(y∗ |x∗ , X, t) = p(y∗ |W, x∗ )p(W|X, t) dW
!N C =
n=1 0(yi,n > yk,n ∀k " i)0(tn =
As we have shown Q(Y) ∝ $
,
bH, = Nwc K∗ (yc∗ , I)Nwc (y,c KVc , Vc ) dwc
,
i)Nyn (Wkn , I). Hence Q(yn ) is a truncated multivariate Gaussian c=1
2682 T. Damoulas, M.A. Girolami / Pattern Recognition 42 (2009) 2671 -- 2683
We proceed by analysing the integral, gathering all the terms de- following expression for the lower bound:
pending on wc , completing the square twice and reforming to
C N
NC 1 ## 3
C = − log 2+ − {y2cn + knT w!
T ycn w
c wc kn −2, , c kn }
$
3 ,V3 ) ∗ ∗ 2 2
p(y∗ |x∗ , X, t) = yc KKc K∗ V
Nyc∗ (, c c
c=1 n=1
C C
c=1 NC 1# 1#
− log 2+ − log |Zc | − , c Z−1
w c w, cT
× Nwc ((yc∗ K∗T + ,
yc K)Kc , Kc ) (D.1) 2 2 2
c=1 c=1
C N
with 1# # NC
− Tr[Z−1
c Vc ] + log Zn + log 2+
∗ 2 2
3 = (I − K∗T Hc K∗ )−1 (Ntest × Ntest) c=1 n=1
V c
N C
1 # # ,2
+ (y cn − 2, , c kn + knT w
ycn w , cT w
, c kn )
and 2
n=1 c=1
C
Kc = (K∗ K∗T + Vc−1 )−1 (N × N) (D.2) NC 1# NC
+ log 2+ + log |Vc | +
2 2 2
c=1
3 ∗ by applying the Woodbury identity and
Finally we can simplify V c
reduce its form to which simplifies to our final expression
∗
3 = (I + K∗T Vc K∗ ) (Ntest × Ntest) C N
V c NC 1# #
Lower Bound = + log |Vc | + log Zn (E.5)
2 2
c=1 n=1
Now the Gaussian distribution wrt wc integrates to one and we are
C C
left with 1# 1#
− Tr[Z−1
c Vc ] − , c Z−1
w c w, cT (E.6)
2 2
C c=1 c=1
$ ∗
p(y∗ |x∗ , X, t) = 3 )
, ∗c , V
Nyc∗ (m (D.3) C C N
c 1# 1 ## T
c=1 − log |Zc | − kn Vc kn (E.7)
2 2
c=1 c=1 n=1
where m 3 ∗ (1 × Ntest)
yc KKc K∗ V
, ∗c = , c
References
Hence we can go back to the predictive distribution and consider
the case of a single test point (i.e. Ntest = 1) with associated scalars [1] J. Albert, S. Chib, Bayesian analysis of binary and polychotomous response data,
, ∗c and 1,∗c
m Journal of the American Statistical Association 88 (1993) 669–679.
[2] C. Andrieu, An introduction to MCMC for machine learning, Machine Learning
p(t∗ = c|x∗ , X, t) 50 (2003) 5–43.
[3] L. Bai, L. Shen, Combining wavelets with HMM for face recognition, in:
= $ C Proceedings of the 23rd Artificial Intelligence Conference, 2003.
∗ 3∗c , 1,∗c ) dy∗
= 0c Ny∗c (m c [4] M.J. Beal, Variational algorithms for approximate Bayesian inference, Ph.D.
c=1 Thesis, The Gatsby Computational Neuroscience Unit, University College London,
= 2003.
+∞ $= y∗c
[5] J.O. Berger, Statistical Decision Theory and Bayesian Analysis, Springer Series
= 3∗c , 1,∗c )
Ny∗c (m 3∗ , 1,∗ ) dy∗ dy∗
Ny∗j (m j j j c in Statistics, Springer, Berlin, 1985.
−∞ −∞
j"c [6] C.M. Bishop, Pattern Recognition and Machine Learning, Springer, New York,
= +∞ $= 3∗
y∗c −m USA, 2006.
j
= Ny∗c −m3∗c (0, 1,∗c ) Ny∗ −m3∗ (0, 1,∗j ) dy∗j dy∗c [7] T. Damoulas, M.A. Girolami, Probabilistic multi-class multi-kernel learning: on
−∞ −∞ j j protein fold recognition and remote homology detection, Bioinformatics 24
j"c
(10) (2008) 1264–1270.
[8] T. Damoulas, M.A. Girolami, Pattern recognition with a Bayesian kernel
−1 −1
3∗c )1,∗c
Setting u = (y∗c − m 3∗ )1,∗
and x = (y∗c − m we have combination machine, Pattern Recognition Letters 30 (1) (2009) 46–54.
j j [9] T. Damoulas, Y. Ying, M.A. Girolami, C. Campbel, Inferring sparse kernel
combinations and relevant vectors: an application to subcellular localization
p(t∗ = c|x∗ , X, t) of proteins, in: IEEE, International Conference on Machine Learning and
Applications (ICMLA '08), 2008.
= +∞ $= (u1,∗c +m
3∗c −m
3∗ )1,∗ −1
j j [10] N. de Freitas, P. HBjen-SBrensen, M. Jordan, S. Russell, Variational MCMC, in:
= Nu (0, 1) Nx (0, 1) dx du Proceedings of the 17th conference in Uncertainty in Artificial Intelligence,
−∞ −∞
j"c 2001.
2 4 [11] D.G.T. Denison, C.C. Holmes, B.K. Mallick, A.F.M. Smith, Bayesian Methods
$ 1 ,∗ 3∗ 3∗ for Nonlinear Classification and Regression, Wiley Series in Probability and
= Ep(u) # ,∗ (u1c + mc − mj ) Statistics, West Sussex, UK, 2002.
1
j"c j [12] T.G. Dietterich, Ensemble methods in machine learning, in: Proceedings of the
1st International Workshop on Multiple Classifier Systems, 2000, pp. 1–15.
[13] C. Ding, I. Dubchak, Multi-class protein fold recognition using support vector
Appendix E. Lower bound machines and neural networks, Bioinformatics 17 (4) (2001) 349–358.
[14] M. Girolami, S. Rogers, Hierarchic Bayesian models for kernel learning, in:
Proceedings of the 22nd International Conference on Machine Learning, 2005,
From Jensen's inequality in Eq. (5) and by conditioning on current pp. 241–248.
values of b, H, Z, q we can derive the variational lower bound using [15] M. Girolami, S. Rogers, Variational Bayesian multinomial probit regression with
the relevant components Gaussian process priors, Neural Computation 18 (8) (2006) 1790–1817.
[16] M. Girolami, M. Zhong, Data integration for classification problems employing
Gaussian process priors, in: B. Schölkopf, J. Platt, T. Hoffman (Eds.), Advances
EQ(Y)Q(W) {log p(Y|W, b, X)} (E.1)
in Neural Information Processing Systems, vol. 19, MIT Press, Cambridge, MA,
+ EQ(Y)Q(W) {log p(W|Z, X)} (E.2) 2007, pp. 465–472.
[17] W.K. Hastings, Monte Carlo sampling methods using Markov chains and their
− EQ(Y) {log Q(Y)} (E.3) applications, Biometrika 57 (1970) 97–109.
− EQ(W) {log Q(W)} (E.4) [18] J. Kittler, M. Hatef, R.P.W. Duin, J. Matas, On combining classifiers, IEEE
Transactions on Pattern Analysis and Machine Intelligence 20 (3) (1998)
226–239.
which, by noting that the expectation of a quadratic form under [19] L.I. Kuncheva, Combining Pattern Classifiers, Methods and Algorithms, Wiley,
a Gaussian is another quadratic form plus a constant, leads to the New York, 2004.
T. Damoulas, M.A. Girolami / Pattern Recognition 42 (2009) 2671 -- 2683 2683
[20] G.R.G. Lanckriet, N. Cristianini, P. Bartlett, L.E. Ghaoui, M.I. Jordan, Learning the [29] C.E. Rasmussen, C.K.I. Williams, Gaussian Processes for Machine Learning, MIT
kernel matrix with semidefinite programming, Journal of Machine Learning Press, Cambridge, MA, USA, 2006.
Research 5 (2004) 27–72. [30] B. Schölkopf, A. Smola, Learning with Kernels, The MIT Press, Cambridge, MA,
[21] W.-J. Lee, S. Verzakov, R.P. Duin, Kernel combination versus classifier USA, 2002.
combination, in: 7th International Workshop on Multiple Classifier Systems, [31] J. Shawe-Taylor, N. Cristianini, Kernel Methods for Pattern Analysis, Cambridge
2007. University Press, Cambridge, England, UK, 2004.
[22] D.P. Lewis, T. Jebara, W.S. Noble, Nonstationary kernel combination, in: 23rd [32] H.-B. Shen, K.-C. Chou, Ensemble classifier for protein fold pattern recognition,
International Conference on Machine Learning, 2006. Bioinformatics 22 (14) (2006) 1717–1722.
[23] D.P. Lewis, T. Jebara, W.S. Noble, Support vector machine learning from [33] S. Sonnenburg, G. Rätsch, C. Schäfer, A general and efficient multiple kernel
heterogeneous data: an empirical analysis using protein sequence and structure, learning algorithm, in: Y. Weiss, B. Schölkopf, J. Platt (Eds.), Advances in Neural
Bioinformatics 22 (22) (2006) 2753–2760. Information Processing Systems, vol. 18, MIT Press, Cambridge, MA, 2006, pp.
[24] D. MacKay, Information Theory, Inference and Learning Algorithms, Cambridge 1273–1280.
University Press, Cambridge, 2003. [34] D.M.J. Tax, M. van Breukelen, R.P.W. Duin, J. Kittler, Combining multiple
[25] D.J.C. MacKay, The evidence framework applied to classification networks, classifiers by averaging or by multiplying?, Pattern Recognition 33 (2000)
Neural Computation 4 (5) (1992) 698–714. 1475–1485.
[26] S. Manocha, M.A. Girolami, An empirical analysis of the probabilistic k-nearest [35] M.E. Tipping, Sparse Bayesian learning and the relevance vector machine,
neighbour classifier, Pattern Recognition Letters 28 (13) (2007) 1818–1824. Journal of Machine Learning Research 1 (2001) 211–244.
[27] R.M. Neal, Regression and classification using Gaussian process priors, Bayesian [36] D.H. Wolpert, Stacked generalization, Neural Networks 5 (2) (1992) 241–259.
Statistics 6 (1998) 475–501.
[28] C.S. Ong, A.J. Smola, R.C. Williamson, Learning the kernel with hyperkernels,
Journal of Machine Learning Research 6 (2005) 1043–1071.
About the Author—THEODOROS DAMOULAS was awarded the NCR Ltd. Ph.D. Scholarship in 2006 in the Department of Computing Science at the University of Glasgow. He
is a member of the Inference Research Group (http://www.dcs.gla.ac.uk/inference) and his research interests are in statistical machine learning and pattern recognition. He
holds a 1st class M.Eng. degree in Mechanical Engineering from the University of Manchester and an M.Sc. in Informatics (Distinction) from the University of Edinburgh.
About the Author—MARK A. GIROLAMI is Professor of Computing and Inferential Science and holds a joint appointment in the Department of Computing Science and
the Department of Statistics at the University of Glasgow. His various research interests lie at the boundary between computing, statistical, and biological science. He
currently holds a prestigious five year long (2007 until 2012) Advanced Research Fellowship from the Engineering & Physical Sciences Research Council (EPSRC). In 2009 the
International Society of Photo-Optical Instrumentation Engineers (SPIE) presented him with their Pioneer Award for his contributions to Biomedical Signal Processing and the
development of analytical tools for EEG and fMRI which revolutionized the automated processing of such signals. In 2005 he was awarded an MRC Discipline Hopping Award
in the Department of Biochemistry and during 2000 he was the TEKES visiting professor at the Laboratory of Computing and Information Science in Helsinki University
of Technology. In 1998 and 1999 Professor Girolami was a research fellow at the Laboratory for Advanced Brain Signal Processing in the Brain Science Institute, RIKEN,
Wako-Shi, Japan. He has been a visiting researcher at the Computational Neurobiology Laboratory (CNL) of the Salk Institute. The main focus of the seven post-doctoral
researchers and eight Ph.D. students in his group is in the development of statistical inferential methodology to support reasoning about biological mechanisms within a
systems biology context.