You are on page 1of 11

Pattern Recognition 46 (2013) 914–924

Contents lists available at SciVerse ScienceDirect

Pattern Recognition
journal homepage: www.elsevier.com/locate/pr

Multitask multiclass support vector machines: Model and experiments


You Ji, Shiliang Sun n
Department of Computer Science and Technology, East China Normal University, 500 Dongchuan Road, Shanghai 200241, PR China

a r t i c l e i n f o a b s t r a c t

Article history: Multitask learning or learning multiple related tasks simultaneously has shown a better performance
Received 5 January 2012 than learning these tasks independently. Most approaches to multitask multiclass problems decompose
Received in revised form them into multiple multitask binary problems, and thus cannot effectively capture inherent correla-
29 June 2012
tions between classes. Although very elegant, traditional multitask support vector machines are
Accepted 12 August 2012
Available online 23 August 2012
restricted by the fact that different learning tasks have to share the same set of classes. In this paper,
we present an approach to multitask multiclass support vector machines based on the minimization of
Keywords: regularization functionals. We cast multitask multiclass problems into a constrained optimization
Multiclass classification problem with a quadratic objective function. Therefore, our approach can learn multitask multiclass
Multitask learning
problems directly and effectively. This approach can learn in two different scenarios: label-compatible
Support vector machine
and label-incompatible multitask learning. We can easily generalize the linear multitask learning
Kernel
Regularization method to the non-linear case using kernels. A number of experiments, including comparisons with
other multitask learning methods, indicate that our approach for multitask multiclass problems is very
encouraging.
& 2012 Elsevier Ltd. All rights reserved.

1. Introduction correlations between different classes [8]. As a consequence, the


multiclass extensions of multitask binary SVMs require all differ-
The principal goal of multitask learning (MTL) is to improve ent tasks share the same set of labels. This means MTL binary
the generalization performance of learners by leveraging the SVMs cannot be used in label-incompatible MTL scenarios. This is
domain-specific information contained in the related tasks [1]. a serious limitation for multitask SVMs when the tasks do not
One way to reach the goal is learning multiple related tasks share the same classes [3].
simultaneously while using a shared representation. In fact, the Crammer and Singer proposed multiclass kernel-based vector
training signals for extra tasks serve as an inductive bias which is machines (MKVMs) which not only capture these correlations,
helpful to learn multiple complex tasks together [1]. Past empiri- but also solve multiclass classification in one dual optimization
cal work has shown that, when there are relations between the problem instead of reducing multiclass problems into multiple
tasks to learn, it is beneficial to learn them simultaneously instead binary problems [8]. This method is a generalization of separating
of learning each task independently [1–4]. There are also a lot of hyperplanes and the notion of margins for multiclass problems.
attempts to theoretically study MTL [5–7]. MTL methods can be We follow the intuition of MKVMs, and extend MKVMs to the
natural extensions of existing kernel based learning methods. multitask setting. We develop and discuss in detail a direct approach
In [6], it is demonstrated that methods based on multitask kernel for learning multitask multiclass SVMs (M2SVMs) on the basis of
can lead to significant performance gains. MKVMs. Up to recent, multitask support vector learning algorithms
A successful instance of MTL is its adaptation to support vector have been designed for multitask binary problems [5]. This paper
machines (SVMs) by multitask kernel methods. The way to solve mainly makes attempts to generalize multitask binary SVMs to
multitask multiclass problem using regularized MTL was based on M2SVMs. We cast M2SVMs into a constrained optimization problem
reducing a multitask multiclass problem into multitask binary with a quadratic objective function which can solve multitask
classification problems. Then some multiclass classification stra- multiclass problems at one time.
tegies such as one-against-one (one2one) and one-against-all-rest M2SVMs proposed in this paper can not only capture correla-
(one2all) were used to solve multiclass problems. However, while tions between different classes and relationships between related
it provides a simple and powerful framework it cannot capture tasks, but also extend multitask SVMs to label-incompatible MTL
scenario. Experimental results demonstrate the effectiveness of
the proposed method.
n
Corresponding author. Tel.: þ86 21 54345186; fax: þ86 21 54345119. A preliminary result was reported previously [9]. This paper
E-mail address: slsun@cs.ecnu.edu.cn (S. Sun). substantially extends this work.

0031-3203/$ - see front matter & 2012 Elsevier Ltd. All rights reserved.
http://dx.doi.org/10.1016/j.patcog.2012.08.010
Y. Ji, S. Sun / Pattern Recognition 46 (2013) 914–924 915

The rest of this paper is organized as follows. Section 2 briefly share the same set of class labels. When it refers to multiclass
reviews works on multiclass problems and multitask learning. problems, a common method is to build a set of binary classifiers
Section 3 thoroughly describes the M2SVMs model and the where each classifier distinguishes between one of the labels to
process of derivation in detail. Some extensions including the rest (or to one of the other labels). These approaches cannot
non-linear multitask learning and label-incompatible multitask effectively capture correlations between the different classes
learning are also introduced here. Experimental results are since it breaks a multiclass problem into multiple independent
reported in Section 4. Conclusions and future work are presented binary problems [8]. Moreover, the traditional MTL binary SVMs
in Section 5. cannot solve problems in label-incompatible MTL scenario. This is
a serious limitation for multitask SVMs when all the tasks may
not share the same classes [3]. One main purpose of this paper is
2. Related work to extend traditional MTL SVMs to the label-incompatible MTL
scenario using a strategy to learn multiclass problems in one dual
M2SVMs build on previous research on regularized multitask optimization problem.
learning and multiclass learning. In this section, related work In this paper, MTL is different with domain adaption or transfer
about regularized multitask learning and multiclass learning are learning. Domain adaption and transfer learning have target tasks
briefly introduced as preliminary knowledge, from which we which need some helpful assistant tasks to gather more useful
reach our M2SVMs approach. information to improve their performance. But in this paper, all the
tasks take the same place in MTL. There is no source domain or
2.1. Regularized multitask learning target domain. Our target is to improve all tasks’ performance. This
means all tasks have an equal role in our MTL settings.
In [1], Caruana proposed the MTL approach which learns tasks
in parallel while using a shared representation. A statistical 2.2. Multiclass learning
learning theory based approach to MTL was introduced in
[10,11]. In [10], Baxter proposed the notion of the extended VC Support vector machines are originally designed for binary
dimension to derive generalization bounds of MTL. In [5,11], the classification. How to extend binary SVMs to multiclass classifica-
extended VC dimension was used to derive tighter bounds that tion is still an active research topic [17]. There are two ways to
hold for each task. Butko and Movellan use a probability model to solve multiclass classification problems. One way is to construct a
capture the relations of MTL [12]. The mixture of Gaussians leads multiclass classifier by combining several binary classifiers. There
to clustering these tasks [13]. Regularized MTL presented by are three methods based on binary problems: one2all, one2one,
Evgeniou et al. [5] extends single task SVMs to MTL. After that, and DAGSVM [18]. The other way is to consider all classes
the MTL kernel learning methods were introduced in [6,14]. together [8,19]. It casts multiclass categorization problems as a
Universal multitask kernels were introduced in [15]. In [16], a constrained optimization problem with a quadratic objective
boosted algorithm is used to improve the performance of MTL. function that yields a direct method for training multiclass
Our work builds on previous research and advances in MTL SVMs. predictors [8]. Usually, with this strategy, a much larger dual
Here we introduce regularized MTL SVMs (more details can be optimization problem is required to solve. Algorithmic imple-
found in [5]). mentations such as MKVMs, which are able to incorporate kernels
Suppose that there are T different but related tasks. All data for with a compact set of constraints and decompose the dual
the T tasks come from the same space X  Y where X  Rd and problem into multiple optimization problems of reduced size,
Y  f1,1g. For simplicity, we assume that each example belongs are proposed in [8,17]. Because this strategy is related to our
to one task. The input–output pair ðxi ,yi ÞðiA f1,2,3, . . . ,mgÞ stands work, we briefly review it below. A comparison of methods for
for the ith example whose task index is given by jðiÞ. mt stands multiclass support vector machines is given in [17].
for the size of the tth task’s training set. Regularized MTL learns T MKVMs are generalizations of separating hyperplanes and
classifiers w1 , . . . ,wT , where each classifier wt is specifically margins to the scenario of multiclass problems. Suppose
dedicated to task t (t A f1, . . . ,Tg) [5]. With w0 which captures ðxi ,yi Þ A X  Y, where X ¼ Rd , Y ¼ f1,2,3, . . . ,Kg and i A f1, . . . ,mg.
the common information among all tasks, all wt can be rewritten This framework uses classifiers of the form
as wt ¼ w0 þ vt . Then vt and w0 are obtained by solving the
K
optimization problem HM ðxÞ ¼ arg maxfM k xg, ð2Þ
k¼1

l1 X
T T X
X mt
min Jvt J22 þ l2 Jw0 J22 þ et,i , where K means the number of labels, and M is a matrix of size
w0 ,vt , et,i T t¼1 t ¼1i¼1 K  d. Mk is the kth row of M. Mk xi means the confidence and
s:t: 8t,i, yt,i ðw0 þ vt Þ  xt,i Z1et,i , et,i Z0: ð1Þ similarity score for the kth class. Therefore, according to the
equation above, the predicted label is the index of the row which
The constants l1 Z 0 and l2 Z 0 control the trade off between attains the highest similarity score [8]. The quadratic optimiza-
tasks. There are two extreme cases. If l2 - þ1 then w0 -0, which tion problem is defined as
indicates that all tasks are decoupled. Then there is no related -
information between tasks. If l2 -0 and l1 - þ 1, then vt - 0 , 1 Xm
min bJMJ22 þ ei ,
which means all tasks share the same decision function with M 2 i¼1
w0 [5]. s:t: 8k,i,M yi xi þ dk,yi M k xi Z1ei , ð3Þ
By the strategy of feature maps, the above multitask learning
can be generalized to non-linear multitask learning. Kernel where dp,q is equal to 1 if p¼q and 0 otherwise, and b 4 0 is a
methods can be used to model relations between tasks and lead regularization constant. For k ¼ yi , the inequality constraints become
to linear and non-linear multitask learning algorithms [6,14]. ei Z 0. Its goal is to find a matrix M that attains a small empirical
A wide variety of kernels which are useful for applications are error on the training examples and also generalizes well. The
proposed in [6,14]. advantage of this strategy is solving multiclass classification pro-
Although regularized multitask learning are shown to be blems directly and therefore it can effectively capture the informa-
superior to their single task counterparts. It requires all tasks to tion between different classes. Results in [17] also show that for
916 Y. Ji, S. Sun / Pattern Recognition 46 (2013) 914–924

large-scale problems methods considering all classes at once need support vector machines is introduced. It also provides details about
fewer support vectors than methods constructed by binary convex optimization theory and optimality conditions for generic
classifiers. optimization problems. It analyzes the saddle point of the Lagran-
gian formulation theoretically. In order to find the minimum for the
primal variables V t ,M0 , e, we require

@L XK XK
3. Multitask multiclass support vector machines (M2SVMs) ¼ 1 at,k,i ¼ 0 ) at,k,i ¼ 1: ð7Þ
@et,i k¼1 k¼1
As before, suppose all the data for the T tasks come from the
Similarly, for M 0,k , we require
space X  Y where X ¼ Rd ,Y ¼ f1,2,3, . . . ,Kg. For simplicity, we
assume that each example just belongs to one task. For each task @L
¼ 0: ð8Þ
we have mt ðt A f1,2,3, . . . ,TgÞ instances sampled from some @M 0,k
unknown distributions Pt on X  Y. We assume that Pt is different Since the derivation is rather technical, we write the complete
from each other but has relationships with other tasks [5,11]. derivation to Appendix A. We get results in the following form:
How to express this kind of relationships between tasks is still an
open question. In [11], exploiting task relatedness for MTL is 1 X T X mt
M0,k ¼ ðd a Þx : ð9Þ
discussed. When T¼1, it reduces to the single task learning 2b t ¼ 1 i ¼ 1 k,yi t,k,i t,i
problem. Our framework uses classifiers as
Next, for V t,k , we need
K
HMt ðxÞ ¼ arg maxfMt,k xt g, @L
k¼1 ¼ 0, ð10Þ
@V t,k
Mt ¼ M0 þ V t , ð4Þ
which results in the form
where Mt is a matrix of size K  d. Mt,k stands for the kth row of Mt, 1 mt
X
and Mt corresponds to the unknown distribution Pt. This implies that V t,k ¼ ðdk,yi at,k,i Þxt,i : ð11Þ
2brt i ¼ 1
all Mt are close to the mean matrix M0. M0 stands for the public
classification information between related tasks while Vt stands for Eqs. (9) and (11) imply that the solutions of the optimization
the tth task’s own classification information. When the tasks are problem given by Eq. (5) are matrices whose rows are linear
very similar to each other, values of JV t J22 are very small (J  J22 combinations of training examples. Note that, from Eq. (9), we get
stands for the 2-norm of matrices). To this end we solve the that the contribution of an instance xi to M 0,k is dk,yi at,k,i . We also
following optimization problem which is similar to Eq. (3): may find that all T tasks’ instances have contributed for M 0,k . For V t,k ,
" # the contribution of an instance is dk,yi at,k,i . Comparing with M 0,k ,
XK XT T X
X mt
V t,k just contains the tth task’s instances. We say that an instance xi is
min b rt JV t,k J22 þ JM0,k J22 þ et,i ,
a support vector if its coefficient is not zero. For the different but
k¼1 t¼1 t ¼1i¼1
related T tasks, M 0,k can be rewritten as
s:t: 8t,k,i,ðV t,yi þ M0,yi Þxt,i þ dk,yi ðV t,k þ M 0,k Þxt,i Z 1et,i , ð5Þ
X
T
M0,k ¼ rt V t,k : ð12Þ
where dp,q is equal to 1 if p¼q and 0 otherwise, rt is a weighted t¼1
parameter between Vt and M0, and b 40 and rt 40 are regulariza- This implies that M0 takes rt information from task t to stand
PT 2 2 for common information between different tasks. To solve
tion constants. t ¼ 1 rt JV t,k J2 þ JM 0,k J2 stands for regularization
for t tasks. M0 may take different places in different issues. For the optimization problem, we have to rewrite LðV t,k ,M 0,k , et,i Þ
example, some tasks may be highly related with each other. So M0 in Eq. (6) as a dual optimization problem. Since the derivation is
takes important place in these tasks. In the other scenario, some rather technical we defer the complete derivation to Appendix B.
tasks may be highly different with each other. Then M0’s place is not Then we obtain the following objective function of the dual program
so important in these tasks. So we add parameters rt to trade off the X
K X
T X
mt
public classification information M0 and the private classification max Q ðaÞ ¼  at,k,i dk,yi
k¼1t ¼1i¼1
information Vt. For k ¼ yi , the inequality constraints become et,i Z 0.
Solving the question directly is not a good choice. We find that it is 1 XK X T X T X mt X ms
 ðd a Þðd a Þ
better to solve Eq. (5)’s dual optimization problem. We add a dual 4b k ¼ 1 t ¼ 1 s ¼ 1 i ¼ 1 j ¼ 1 k,yi t,k,i k,yj s,k,j
set of variables at,k,i , one for each constraint and get the Lagrangian  
dt,s
of the optimization problem as  1þ /xt,i ,xs,j S: ð13Þ
rt
" # P
bX K X T
2 2
X T Xmt
If we set m ¼ Tt ¼ 1 mt ,i,j A f1,2,3, . . . ,mg, where m means the
LðV t,k ,M 0,k , et,i Þ ¼ rt JV t,k J2 þJM0,k J2 þ et,i
2k¼1 t¼1 training data size and jðiÞ means that the ith example belongs to
t ¼1i¼1
jðiÞ th task. Then we can get t ¼ jðiÞ. So variables at,k,i can be
K X
X T X
mt replaced with ak,i . Then we obtain the following objective function of
 at,k,i ½ðV t,yi þ M0,yi Þxt,i þ dk,yi the dual program:
k¼1t ¼1i¼1

ðV t,k þ M 0,k Þxt,i 1 þ et,i 


K X
X m
1 X K X m X m

s:t: 8t,k,i,at,k,i Z 0: ð6Þ min Q ðaÞ ¼ ak,i dk,yi þ


k¼1i¼1
4b k ¼ 1 i ¼ 1 j ¼ 1
" ! #
The variable at,k,i is referred to as the Lagrange multiplier associated djðiÞ, jðjÞ
 ðak,i dk,yi Þðak,j dk,yj Þ 1 þ /xi ,xj S
with constraints. The vector a is called a dual variable or Lagrange rjðiÞ
multiplier. We now seek a saddle point of the Lagrangian form, X
K
which is the minimum for the variables and maximum for the dual s:t: 8k,i,ak,i Z0, ak,i ¼ 1: ð14Þ
variables at,k,i [8]. In [21], a review of optimization methodologies in k¼1
Y. Ji, S. Sun / Pattern Recognition 46 (2013) 914–924 917

3.1. Non-linear multitask learning Another additional harvest of (16) is that it can be simply
extended to multitask clustering [6,14]. For example, task cluster
The dual optimization problem in Eq. (14) has one short- A has three tasks which may take multitask parameters 7, 8,
coming: a linear decision function may be unsuitable to classify 9 while task cluster B has two tasks which may take multitask
different examples especially when training sets are not linearly parameters 100, 90. We may use the multitask parameter to
separable. But an important characteristic of SVMs is that they control tasks’ clustering result.
can be used to estimate highly non-linear functions through the
use of kernels [5]. This method is named the kernel trick, which can 3.3. Label-incompatible multitask learning
be used to avoid computing feature maps explicitly. In fact, every
algorithm totaly based on inner products can be kernelized [20]. By In common sense, the MTL of SVMs require all tasks to share
the same strategy in [5], we extend our method to non-linear an identical set of labels [4]. This is a serious limitation in MTL
multitask learning as domains especially when different tasks may not have the same
X
K X
m labels. In label-incompatible multitask learning settings, we
min Q ðaÞ ¼ ak,i dk,yi assume that, the same label refers to the sample class while
k¼1i¼1 different labels refer to different classes.
1 X K X m X m
M2SVMs described in this paper can easily overcome this
þ ½ða d Þða d ÞK ðx ,x Þ
4b k ¼ 1 i ¼ 1 j ¼ 1 k,i k,yi k,j k,yj jðiÞ, jðjÞ i j limitation. Assume there are two different but related tasks
A and B. Task A has three labels 1–3 while task B has three labels
X
K
s:t: 8k,i,ak,i Z 0, ak,i ¼ 1, ð15Þ 2–4. We may make these two related but different tasks have four
k¼1 labels from 1 to 4. Task A has no instances of class 4 while task B
has no instances of class 1. Then using M2SVMs can solve this
where K stands for kernel function. It is given as
! kind of label-incompatible MTL problems.
djðiÞ, jðjÞ Another extreme case is that task A has three labels 1–3 while
KjðiÞjðjÞ ðxi ,xj Þ ¼ 1þ /xi ,xj S: ð16Þ
rjðiÞ task B has two labels 4 and 5. In this scenario, M2SVMs perform
MTL with five classes while task A has no instances of 4 and 5 and
We may replace the dot product with a non-linear kernel as what is task B has no instances of 1–3.
done for standard SVMs [5]. It is noteworthy that this non-linear As we known, domain adaption is to utilize labeled data from
multitask kernel is different from the one in [5]. All tasks share the source domain to help the learning in target domain with just a
same parameter m in [5]. Because tasks’ weights should be different few or even no labeled data. In this paper, label-incompatible
in multitask learning, we use a separate parameter rt for each task. learning is different with domain adaption. There is no division of
We rewrite the classifier as the source domain and target domain in our multitask learning
K settings. Each task in our label-incompatible experiments has an
Ht ðxÞ ¼ arg maxfM t,k xg
k¼1 equal role.
( )
K 1 X m
¼ arg max ðdk,yi ak,i ÞKjðiÞ,t ðxi ,xÞ : ð17Þ
k¼1 2b i ¼ 1
4. Experiments
It is an easy way to implement our strategy by replacing
kernels used in MKVMs. MKVMs have two implementations. One In our experiments, we categorized the MTL into two different
is described in [8] and the other is given in [17]. In the experi- scenarios: label-compatible MTL and label-incompatible MTL. In
ments, we choose the implementation in [17] as our solver. the label-compatible MTL scenario, all tasks share the same label
set while in label-incompatible MTL not [4]. We demonstrate the
3.2. Differences with traditional multitask kernels applicability and effectiveness of M2SVMs in these scenarios.

One may think that this kind of multitask kernel is identical 4.1. Experimental setting
with the traditional multitask kernel described in [5]. Now we
point out the differences with [5]. The traditional form of multi- Before introducing details about our experiments, we intro-
task kernel can be written as duce some technical tricks about SVMs and common settings in
  our experiments. Scaling before applying SVMs is very important
1
KjðiÞjðjÞ ðxi ,xj Þ ¼ djðiÞ, jðjÞ þ /xi ,xj S: ð18Þ [22]. The main advantage of scaling is to avoid attributes in
m
greater numeric ranges dominating those in smaller numeric
Besides the difference between (18) and (16), they also have ranges. Another advantage is to avoid computing difficulties.
different meanings. The characteristic of the multitask kernel We scale each attribute to the range ½1, þ1 as in [22].
described in [5] is that all related but different tasks just share the In our experiments, we use n-fold cross validation to get the
same parameter m as (18) while in (16), each task has its own average error rate. The multitask kernel used in our experiments
parameters which shows the relationship with other related but is the widely used radial basis function (RBF) kernel, which can be
different tasks. written as
Using the traditional multitask kernel is difficult to deal with !
this scenario. Suppose there are T related but different tasks while djðiÞ, jðjÞ n
KjðiÞjðjÞ ðxi ,xj Þ ¼ 1þ k ðxi ,xj Þ,
task m is a very different but related task with other T 1 tasks. rjðiÞ
The T 1 tasks are coupled very well. When using a parameter mm
k ðxi, xj Þ ¼ expðgJxi xj J22 Þ, g 4 0,
n
ð19Þ
appropriate for task m in (18), task m learns very well while the
other T 1 tasks may learn badly because task m is different with were k stands for regular kernel function, and g is the kernel
n

other tasks. When using another parameter mn appropriate for the parameter. It should be noted that there are other two parameters
other T  1 tasks, the T 1 tasks learns very well because they are rt , b for our model.
well coupled while task m may not learn very well. If we use (16), It is usually unknown beforehand which parameters are best
this situation will get solved. for problems at hand [22]. We used a grid search strategy to select
918 Y. Ji, S. Sun / Pattern Recognition 46 (2013) 914–924

best parameters which is also recommended in [22]. Various pairs female, we may think that these two tasks are also different
of ðrt , b, gÞ values are tried. Then we adopt the one with the best between each other. In the original dataset, the feature of one
cross-validation results on the training set. In our experiments, spoken digit is a matrix of size row  13 (4 rrow r93). For
firstly we take a coarse grid search for parameters rt , b, g in the simplicity, we compute the average value of each column of this
region ½210 ,214  with exponent growth 0.5. Then we take the matrix. Then we get a vector of size 1  13 to represent one digit’s
finer grid search on the neighborhood of the best results of the feature. Details about these two datasets are summarized in Table 1.
coarse grid search such as ½21 ,23 . The interval exponent growth Results and analysis. We compared the performance of the
is 0.01. Although the process of grid search usually takes a lot of M2SVMs algorithm with different baselines in Tables 2 and 3. ALL
time during training, it is very effective and straightforward means just taking all the T tasks’ data together and using single
method which avoids doing more consuming exhaustive para- task learning to deal with these data with no tasks’ differences.
meter search. MTL is denoted as MTL in Tables 2 and 3. Because MKVMs and
M2SVMs will get the same result when in single task learning
scenario, we denote as ‘–’ which means the result is same as
4.2. Label-compatible MTL experiments
MKVMs. While in the MTL setting, MKVMs will get the same
results as M2SVMs if MKVMs use the multitask kernel.
We evaluate M2SVMs now on four datasets taken from the UCI
From Table 2, with the increasing of training examples, we find
Machine Learning Repository [23]. The first one is Isolet spoken
that accuracy is highly improved when we just take instances of
alphabet recognition. The second one is spoken Arabic digit (SAD).
all tasks together and ignore relationships between tasks. In
The last two are signature verification datasets, one of which
Table 3, collecting instances of all tasks together just gets error
is from GPDS [24] while the other from MCYT [25]. Usually,
rates which seem to be the average results of two tasks’. But if
signature verifications have different evaluation parameters of
relationships between tasks are captured by multiclass learning,
experimental results [26]. Settings of MTL in signature verifica-
multitask learning gets better results than methods which ignore
tion are different to the previous datasets. Therefore, we split
these useful information between tasks in Tables 2 and 3. This
experiments on these four datasets into two experimental groups.
means the importance of discovering and exploiting relationship
The experiments on the front two datasets are in one group while
between tasks.
the experiments on signature verification are in the other group.
If we compare results by columns, we can find that MKVMs get
comparable results with the other two popular multiclass classi-
4.2.1. Experimental group (I) fication methods. In fact, because MKVMs can capture relation-
In this experimental group, we evaluate M2SVMs on the Isolet ship between different classes, it can often get better results than
spoken alphabet recognition dataset and SAD dataset. We first the other two methods.
provide a brief overview of the two datasets and then present In Table 3, the result of M2SVMs is 2.97% while MKVMs’ is
results in various experimental settings. In these experiments, we 2.73% for Task 1. It seems that M2SVMs get a worse result than
use 3-fold cross-validation to get the average error rate. learning tasks independently. But if we consider Task 1 and
Details about datasets. The Isolet dataset was collected from Task 2 together, there are 1467  2.73% þ 1467  5.28%¼117
150 subjects uttering all English alphabets twice. Each speaker instances misclassified. With M2SVMs, there are only 2934 
contributed 52 training examples. Because three examples are 2.97%¼ 87 instances misclassified. (There are 1467 examples for
historically missing, there are 7797 examples in total. These
speakers are grouped into five sets of 30 speakers each. These
Table 1
groups are referred to as Isolet1 to Isolet5. The original task is to Details of datasets.
classify which letter has been uttered based on features such as
spectral coefficients, contour features, sonorant features, pre- Name Attributes Instances Classes Tasks
sonorant features, and post-sonorant features. The representation
Isolet 290 7797 26 5
of Isolet leads itself to the multitask multiclass learning [4]. SAD 13 8800 10 2
Therefore, we treat each of the subsets as its own classification
task (T¼5) with K ¼26 labels. These five tasks are highly related
with each other because all the data are collected from the same Table 2
utterances [4]. Because groups of speakers vary in the way of Error rates (%) on the Isolet dataset.
speaking the English alphabets, five tasks are different from each
other. These five tasks are labeled from 1 to 5. In order to remove Isolet one2one one2all MKVMs M2SVMs
low variance noise and to speed up computation we preprocess
Task 1 7.12 71.82 7.18 71.70 5.517 1.92 –
the Isolet data with principal component analysis (PCA). We Task 2 6.47 72.06 6.73 71.07 5.007 1.02 –
capture 98% of the data variance while reducing the dimension- Task 3 9.70 71.85 9.687 1.56 9.707 1.60 –
ality from 617 to 290. Task 4 10.407 1.78 10.98 70.85 11.23 71.80 –
The SAD dataset was collected from 88 speakers with 44 male Task 5 10.33 72.08 9.817 1.01 10.297 1.85 –
ALL 4.84 70.82 4.23 70.62 4.107 0.67 –
and 44 female Arabic native speakers between ages 18 and 40 to MTL 3.23 70.91 3.66 70.82 – 2.707 0.97
represent 10 spoken Arabic digits from 0 to 9. Each speaker spoke
one Arabic digit 10 times. Therefore, there are 8800 (10 digits  10
repetitions  88 speakers) examples in the dataset. The original task
is to classify which Arabic digit has been uttered based on 13 mel- Table 3
Error rates (%) on the SAD dataset.
frequency cepstral coefficients. The representation of SAD also leads
itself to the multitask multiclass learning. We split the original task SAD one2one one2all MKVMs M2SVMs
into two tasks. One task (Task 1) is to classify Arabic digits spoken
by males while the other one (Task 2) is to classify ones spoken by Task 1 3.46 7 1.15 3.07 71.08 2.737 0.64 –
females. These two tasks are highly related with each other because Task 2 5.32 7 0.91 5.107 0.90 5.28 70.92 –
ALL 4.97 7 0.97 4.32 71.23 4.107 0.93 –
these data are sampled from the same region. Because the voice, MTL 3.10 70.88 3.25 70.94 – 2.977 0.76
intonation and volume are usually different between male and
Y. Ji, S. Sun / Pattern Recognition 46 (2013) 914–924 919

testing in Task 1 and Task 2, because we use 3-fold cross validation based on one2one take less support vectors than multitask SVMs
and the data size of each task is 4400. Therefore, there are 2934 based on one2all.
examples for testing in MTL and ALL.) If we just take instances
of Task 1 and Task 2 together, MKVMs get 2934  4.10%¼120
4.2.2. Experimental group (II)
instances misclassified. This is even slightly worse than learning
Experiments are carried out on GPDS [24] and MCYT [25]
tasks independently. This comparison also indicates the importance
datasets, respectively. For signature verification, evaluation para-
of exploiting correlations between tasks.
meters of experimental results are different to classical experi-
Since M2SVMs can not only capture correlations between
ments. Firstly, we briefly introduce these evaluations. Then we
different classes, but also use common information between tasks
provide a brief overview of the two datasets and present results in
for prediction, it get the best results both in Tables 2 and 3.
various experimental settings.
Because of the complexity of the dual optimization problem,
Experimental settings. In an off-line signature verification
one may think that M2SVMs may take a lot of computing time.
system, three kinds of forgery such as random, simple and skilled
Therefore, we also compare these methods on the computing
forgeries are considered [27]. The random forgery is usually
time. These experiments were carried out on a Intel Core 2 with
represented by a signature example that belongs to a different
1024 RAM using gcc compiler. Note that the results shown in
writer of the signature model. The simple forgery is represented
Fig. 1(a) and (b) are the average time of 3-fold cross validation
by a signature example with the same shape of the genuine
with instances randomly selected 10 times. From Fig. 1(a) and (b),
writer’s name. The skilled forgery is represented by a suitable
we may easily find that M2SVMs take a similar time with multi-
imitation of the genuine signature model. Because the skilled
task SVMs based on the one2one strategy. Multitask SVMs based
forgeries are very similar to genuine signatures, it is difficult to
on one2all strategy take more time than the other two methods.
distinguish skilled forgeries from genuine signatures [27]. MCYT
Because using less support vectors can improve the speed of
and GPDS datasets just provide random, skilled forgeries. There-
prediction and avoid over-fitting, we also make comparisons on
fore, there are three classes in the classification problem.
the number of support vectors. After the training process, these
If we take everyone’s signature verification as a single task,
three algorithms have different number of support vectors. From
some common information may be shared between tasks. If we
Figs. 2(a) and (b) we may find that M2SVMs and multitask SVMs
take one’s signature verification as the target task, we may add

Fig. 1. Comparison of computing time. (a) Results on Isolet. (b) Results on SAD. Fig. 2. Comparison of support vectors. (a) Results on Isolet. (b) Results on SAD.
920 Y. Ji, S. Sun / Pattern Recognition 46 (2013) 914–924

some other peoples’ signature verification as assistant tasks to A ¼ A[ MostHelpfulTaskIndex


help improve the accuracy of classification. Because of the sharing else
of common information between tasks, the generalization of break
classifiers are highly improved. end if
Error rate (ER), false acceptance rate (FAR) and false rejection end while
rate (FRR) are three parameters used for measuring performance
of any signature verification method. ER, FAR and FRR are At last, each task will get some helpful assistant tasks. Then we
calculated by learn these tasks (one signer’s task and assistant tasks) together
number of wrong predicted images by MTL SVMs.
ER ¼  100%, Details about datasets. GPDS database contains images from
number of images
300 signers. Each signer has 24 genuine signatures and 30 skilled
number of forgeries accepted forgeries. The 24 genuine specimens were produced in a single
FAR ¼  100%, day writing sessions. The skilled forgeries were produced from
number of forgeries tested
the static image of the genuine signature which is chosen
number of genuine signatures rejected randomly from the 24 genuine signatures. Forgers were allowed
FRR ¼  100%: ð20Þ
number of genuine signatures to practice the signature for as long as they wish. Each forger
We compute the average ER (AER), average FAR (AFAR) and imitated three signatures of five signers in a single day writing
average FRR (AFRR) on T persons by session. Therefore, for each genuine signature there are 30 skilled
forgeries made by 10 forgers from 10 different genuine speci-
1XT
1XT
1XT
mens. After the modified direction feature (MDF) [28] is used to
AER ¼ ERt , AFAR ¼ FARt , AFRR ¼ FRRt : ð21Þ
Tt¼1 Tt¼1 Tt¼1 extract features, we employ PCA to reduce dimensions. The
dimension is reduced from 2400 to 113. For each person, there
For each signer’s evaluation parameter, we use 10-fold cross-
are 24 genuine signatures and 30 skilled forgeries. In order to
validation to get its values.
gather random forgeries for one signer, our method is to select
MTL strategy for signature verification. If we take all signers’
from other signers’ genuine signatures randomly. For one signer,
signature verification problems in one MTL, it will take a long
we select 60 images from other signers’ randomly.
time to select parameters for each task. For simplicity, we find
MCYT contains images from 75 signers. For evaluating effec-
some helpful assistant tasks for each signer’s classification pro-
tiveness of our method, a sub-corpus of the larger MCYT bimodal
blem. Then we use MTL to improve the accuracy.
database which contains 2250 images is used in our experiments.
How to find helpful assistant tasks is described by the pseudo
Each person has 15 genuine signatures and 15 forgeries which are
code in Algorithm 1. The main idea of this algorithm is adding
contributed by three different user-specific forgers. The 15 gen-
other person’s signature verification task as an assistant task until
uine signatures were acquired at different times (between three
the max accuracy on training datasets is reached.
and five) of the same acquisition session. At each time, between
Algorithm 1. The algorithm to find helpful assistant tasks for the one and five signatures were acquired consecutively [27]. After
ith task. we employ PCA to reduce features which are originally extracted
by MDF, the dimension is reduced from 2400 to 102. For each
Input: signer, we select 30 images from other signers’ randomly.
An integer i stands for the ith task. Results and analysis. We compared the performance of the
An integer T stands for the number of tasks. M2SVMs algorithm with different evaluation parameters in
The function MTSVMs represent multitask SVMs. If there is Tables 4 and 5 where STL stands for single task learning.
just one task, MTSVMs act as single-task SVMs. From Tables 4 and 5, if we compare M2SVMs with other
Output: methods, we may find M2SVMs get better results. Despite the
A set A contains the ith task and helpful assistant tasks for performance of M2SVMs, another important reason is from the
ith task. characteristic of signature verification problems. In signature
A ¼ fig verification problems, what we care more is to distinguish
while 1 do genuine signatures between forgeries. We may not care whether
MaxAccuracy¼ MTSVMs(A), MostHelpfulTaskIndex ¼0 it is random forgeries or skilled forgeries.
for j¼1; j rT and j=2A; þ þ j do One method is to first mix random forgeries with skilled
TempSet ¼ A [ j forgeries and then classify with genuine signatures. The one2all
TempAccuracy ¼MTSVMs(TempSet) method uses this strategy. From Tables 4 and 5, we can find that
if TempAccuracy 4 MaxAccuracy then the one2all method gets worst results because it takes into no
MaxAccuracy¼TempAccuracy consideration the differences between random forgeries and
MostHelpfulTaskIndex ¼j skilled forgeries. Usually, there are huge differences between
end if random forgeries and skilled forgeries. Random forgeries are
end for taken from other signers’ signatures while skilled forgeries’
if MaxAccuracy4 MTSVMsðAÞ then targets are one signer’s signature.

Table 4
Error rates (%) on the GPDS dataset.

GPDS one2one one2all MKVMs M2SVMs

STL MTL STL MTL

AER 13.01 7 8.59 8.61 77.14 14.75 7 9.15 10.73 78.03 11.54 7 8.67 6.107 7.53
AFAR 9.27 7 0.07 3.747 0.05 8.93 7 0.08 5.86 70.07 7.02 7 0.06 4.29 7 0.06
AFRR 30.83 7 0.14 28.81 7 0.09 34.25 7 0.25 29.28 7 0.16 28.97 7 0.15 16.52 70.18
Y. Ji, S. Sun / Pattern Recognition 46 (2013) 914–924 921

Table 5
Error rates (%) on the MCYT dataset.

MCYT one2one one2all MKVMs M2SVMs

STL MTL STL MTL

AER 9.84 7 9.02 7.38 7 8.63 11.06 7 9.60 7.89 7 9.53 8.517 10.81 6.58 78.71
AFAR 5.52 7 0.13 5.58 7 0.10 8.93 7 0.11 4.527 0.13 6.277 0.21 5.42 7 0.14
AFRR 20.95 7 0.09 17.70 7 0.11 29.28 7 0.15 19.18 7 0.09 15.54 70.17 9.21 70.13

Table 6 Table 7
Error rates (%) on Isolet label-incompatible tasks. Error rates (%) on SAD label-incompatible tasks.

Isolet STL MTL SAD STL MTL

one2one 7.92 7 1.03 – one2one 4.25 71.31 –


one2all 8.35 7 1.74 – one2all 4.88 70.97 –
MKVMs 7.29 7 1.21 – MKVMs 4.56 71.18 –
M2SVMs – 4.277 1.66 M2SVMs – 3.047 0.83

The one2one method is to train two classifiers where one is MTL datasets. Table 7 shows the result of M2SVMs. Note that, for
used to classify genuine forgeries and random forgeries, and the each datasets, we repeat five times to get average results.
other is used to classify genuine forgeries and skilled forgeries. By
this strategy, it cannot capture information between different 4.3.2. Results and analysis
classes. If we use similar scores to measure the similarity to From Tables 6 ad 7, we may find that M2SVMs can obtain
genuine signatures. These three classes will get different scores. better results in label-incompatible MTL. Because tasks here have
For example, it can score genuine signatures 100. Skilled forgeries different meanings with label-compatible MTL experiments, the
may get 70–80 while random forgeries may get 3–20. The characteristic of results are essentially not the same as before.
one2one method cannot capture these structures between
classes. But it gets better results than the one2all method because 4.4. Comparative experiments
it considers differences between random forgeries and skilled
forgeries. MKVMs can capture these structures between different We compare our method with tree methods Large Margin
classes, and thus it gives better results in AER, AFAR and AFRR Multitask Metric Learning (LM3L)1 [4], Convex Multitask Feature
evaluation parameters. Learning (MTL-FEAT)2 [7] and Boosted Multitask Learning (BMTL)
In Table 4, although AFAR of the one2one MTL method is lower [16]. These experiments are carried out on Isolet and SAD datasets
than any other ones’. But we can find that the one2one method in label-incompatible settings. As before, we also use 3-fold cross-
trends to reject genuine signatures to guarantee that it gets low validation for comparative experiments. Details about datasets
acceptance rates of random and skilled forgeries. Therefore, it gives are given in Table 1. Gaussian kernel functions are used in
higher AFRR than M2SVMs. This situation may be found in Table 5. MTL-FEAT and M2SVMs.
From Tables 4 and 5, we can also find that better results in signature From Table 8, we find that our method can get comparable
verification problems can be obtained by the MTL strategy. results with three state of art algorithms. BMTL gets lightly better
results on SAD dataset, but its standard deviation is much larger
than M2SVMs’.
4.3. Label-incompatible MTL experiments

To demonstrate M2SVMs’ ability to learn multiple tasks that 5. Conclusions and future work
have different sets of class labels, we run experiments on the
Isolet and SAD datasets. Because the traditional SVMs cannot be In this paper, we presented a new method named M2SVMs
used in this case [4], we just compare M2SVMs with the single based on the regularization principle. Our method is mainly
task version of SVMs. inspired by MKVMs [8] and regularized multitask learning [5].
We derived a dual optimization problem which is optimized to
give solutions. Then we extend the method to non-linear multi-
4.3.1. Experimental settings task learning by the kernel trick. This extension is very easy to
In order to get label-incompatible MTL datasets, we get tasks implement using existing tools.
from the original datasets by the following steps. We compared our method with other two popular multitask
learning methods which are based on one2one and one2all
 Select n different labels from the label set randomly. strategies. These algorithms are compared in three aspects:
 Select all the examples with these n labels. accuracy, computing time and the number of support vectors.
The results show that our method gets better performance than
For Isolet dataset, we get five tasks which have 10 different the other two methods almost on all comparisons. Because the
labels. Therefore, for considering every two tasks, they have information between tasks can be effectively captured by
different labels, although some labels may be the same. Table 6
shows the result of M2SVMs. 1
Code available at http://www.cse.wustl.edu/  kilian/code/files/mtLMNN.zip.
For the SAD dataset, we get three tasks which have four 2
Code available at http://ttic.uchicago.edu/  argyriou/code/mtl_feat/mtl_
different labels. By this strategy, we can get label-incompatible feat.tar.
922 Y. Ji, S. Sun / Pattern Recognition 46 (2013) 914–924

Table 8 which results in the form


Error rates (%) on Isolet and SAD dataset. !
1 X
mt X
mt
V t,k ¼ dk,yi xt,i  at,k,i xt,i
MTL Isolet SAD 2brt i¼1 i¼1

LM3L 4.63 71.39 3.22 71.32 1 X


mt

MTL-FEAT 4.78 71.03 3.16 71.19


¼ ðdk,yi at,k,i Þxt,i : ð26Þ
2brt i ¼ 1
BMTL-weighted 3.42 71.65 2.93 71.41
BMTL-unweighted 3.17 71.63 2.867 1.40
M2SVMs 2.707 0.97 2.97 70.76

Appendix B. Derivation of the dual optimization problem


2
M SVMs, our method got better results than MKVMs. We also
compared our method with some MTL methods such as MTL- We develop the Lagrangian using only the dual variables.
FEAT, BMTL and LM3L. Our method also got comparable results. Substituting Eqs. (7), (9) and (11) into (6), we obtain
" #
How to extend our model to domain adaption and transfer X
K X T
2 1X
mt
learning is valuable to research in the future. Q ðaÞ ¼ b rt JV t,k J2 þ ðat,k,i V t,k xt,i at,k,i V t,yi xt,i Þ
k¼1t ¼1
bi¼1
" #
X
K
1 X
T X
mt
þb JM 0 J22 þ ðat,k,i M 0,k xt,i at,k,i M 0,yi xt,i Þ
Acknowledgments k¼1
bt ¼1i¼1
X mt
T X X
K X mt
T X
This work is supported in part by the National Natural Science þ et,i  at,k,i ½dk,yi 1 þ et,i 
Foundation of China under Project 61075005 and the Fundamen- t ¼1i¼1 k¼1t ¼1i¼1
2 0 13
tal Research Funds for the Central Universities. 9S1
Bzfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{ C7
6X Xmt
6 T B Br JV t,k J2 þ 1
C7
C7
6 B at,k,i V t,k x t,i C7
6 t 2
b
6 t ¼ 1@ i¼1 A7
Appendix A. Derivation of the Lagrangian’s derivatives X6
K 7
¼b 6 7
6 7
k¼16 9S
7
6 zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{ 7
2
We seek a saddle point of the Lagrangian, which is the minimum 6 7
6 7
for the primal variables V t,k ,M 0,k , et,i and the maximum for the dual 4 2 1X X T m t
5
þJM 0 J2 þ at,k,i M0,k xt,i
variables at,k,i . For et,i , we get bt ¼ 1 i ¼ 1
2 3
XK XK 9S3
@L K X
X T X
mt
6 zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl
ffl }|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl
ffl { 7
¼ 1 at,k,i ¼ 0 ) at,k,i ¼ 1: ð22Þ  4at,k,i V t,yi xt,i þat,k,i M 0,yi xt,i 5
@et,i k¼1 k¼1
k¼1t ¼1i¼1

Similarly, for M0,k , we require K X


X T X
mt

yi X
!  at,k,i dk,yi þC, ð27Þ
@L mt
T X
X X mt
T X X
K
k¼1t ¼1i¼1
¼ 2bM0,k þ at,k,i xt,i  ae,t,i xt,i
@M 0,k t ¼1i¼1 k¼y t ¼1i¼1 e¼1
i
|fflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflffl} where C is a constant variable equal to the whole training data
¼1 size of all tasks as
T X
X mt yi X
X T X
mt
K X
X mt
T X X mt X
T X K X
T
¼ 2bM 0,k þ at,k,i xt,i  xt,i C¼ at,k,i ¼ at,k,i ¼ mt : ð28Þ
t ¼1i¼1 k ¼ yi t ¼ 1 i ¼ 1
k¼1t ¼1i¼1 t ¼1i¼1k¼1 t¼1
|fflfflfflfflffl{zfflfflfflfflffl}
T X
X mt T X
X mt ¼1
¼ 2bM 0,k þ at,k,i xt,i  dk,yi xt,i
t ¼1i¼1 t ¼1i¼1 In order to get a simple representation of Eq. (27), we define
S1 ,S2 ,S3 in Eq. (27). We substitute Eq. (11) into S1, then we get
¼ 0, ð23Þ
mt
1X
which results in the following form: S1 ¼ rt JV t,k J22 þ at,k,i V t,k xt,i
! bi¼1
1 X
T X
mt X
T X
mt
M 0,k ¼ dk,yi xt,i  at,k,i xt,i 1 X
mt X
mt
2b t ¼1i¼1 t ¼1i¼1
¼ 2
ðdk,yi at,k,i Þðdk,yj ak,t,j Þ/xt,i ,xt,j S
4b rt i ¼ 1 j ¼ 1
1 X T X mt
X
mt X
mt
¼ ðd a Þx : ð24Þ 1
2b t ¼ 1 i ¼ 1 k,yi t,k,i t,i þ 2
at,k,i ðdk,yj ak,t,j Þ/xt,i ,xt,j S
2b rt i ¼ 1 j ¼ 1
Next, for V t,k , we need 1 X
mt X
mt

yi X
! ¼ 2
ðdk,yi þ at,k,i Þðdk,yj ak,t,j Þ/xt,i ,xt,j S,
@L X
mt X mt X
K
4b rt i ¼ 1 j ¼ 1
¼ 2brt V t,k þ at,k,i xt,i  ae,t,i xt,i
@V t,k i¼1 k¼y i¼1 e¼1
i
|fflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflffl}
1X
T X
mt
¼1
S2 ¼ JM 0 J22 þ at,k,i M0,k xt,i
mt
X yi
X mt
X bt ¼1i¼1
¼ 2brt V t,k þ at,k,i xt,i  xt,i
1 X
T X
T X
mt X
ms
i¼1 k ¼ yi i ¼ 1 ¼ 2
ðdk,yi at,k,i Þðdk,yj as,k,j Þ/xt,i ,xs,j S
X
mt X
mt 4b t ¼1s¼1i¼1j¼1
¼ 2brt V t,k þ at,k,i xt,i  dk,yi xt,i
1 X
T X mt X
T X ms
i¼1 i¼1 þ 2
at,k,i ðdk,yj as,k,j Þ/xt,i ,xs,j S
¼ 0, ð25Þ 2b t ¼1s¼1i¼1j¼1
Y. Ji, S. Sun / Pattern Recognition 46 (2013) 914–924 923

1 X
T X
T X
mt X
ms X
K X
K
¼ 2
ðdk,yi þat,k,i Þðdk,yj as,k,j Þ/xt,i ,xs,j S, ð29Þ ¼ ½ðdk,yi þ at,k,i Þðdk,yj as,k,j Þ2 dk,yi ðdk,yj as,k,j Þ
4b t ¼1s¼1i¼1j¼1 k¼1 k¼1
X
K
S3 ¼ at,k,i V t,yi xt,i þ at,k,i M0,yi xt,i ¼ ½ðdk,yi at,k,i Þðdk,yj as,k,j Þ: ð32Þ
k¼1
1 X mt
¼ a ðdy ,y a Þ/xt,i ,xt,j S Then we substitute S5 in Eq. (32) into Eq. (31) and get
2brt j ¼ 1 t,k,i i j yi ,t,i
X
K X mt
T X
1 X T X ms
max Q ðaÞ ¼  at,k,i dk,yi
þ a ðdy ,y a Þ/xt,i ,xs,j S
2b s ¼ 1 j ¼ 1 t,k,i i j yi ,s,j k¼1t ¼1i¼1

  1 XK X T X T X mt X ms
1 X T X ms
dt,s  ðd a Þðd a Þ
¼ at,k,i ðdyi ,yj ayi ,s,j Þ 1 þ /xt,i ,xs,j S: 4b k ¼ 1 t ¼ 1 s ¼ 1 i ¼ 1 j ¼ 1 k,yi t,k,i k,yj s,k,j
2b s ¼ 1 j ¼ 1 rt  
dt,s
 1þ /xt,i ,xs,j S: ð33Þ
If we substitute S1 ,S2 ,S3 into (27), we may get a very complex
P
rt
equation. We find that Tt ¼ 1 S1 þS2 can lead to a simple equation,
P References
and therefore we set S4 ¼ Tt ¼ 1 S1 þ S2 . After substituting S1 ,S2 in
Eq. (29) into S4, we obtain a simple representation as
[1] R. Caruana, Multitask learning, Machine Learning 28 (1) (1997) 41–75.
X
T
[2] S. Sun, Multitask learning for EEG-based biometrics, in: Proceedings of the
S4 ¼ S1 þ S2 19th International Conference on Pattern Recognition, 2008, pp. 1–4.
t¼1 [3] X.T. Yuan, S. Yan, Visual classification with multi-task joint sparse represen-
X mt X
T X mt tation, in: Proceedings of the IEEE Computer Society Conference on Computer
1
¼ 2
ðdk,yi þ at,k,i Þðdk,yj ak,t,j Þ/xt,i ,xt,j S Vision and Pattern Recognition, 2010, pp. 3493–3500.
4b rt t ¼ 1 i ¼ 1 j ¼ 1 [4] S. Parameswaran, K.Q. Weinberger, Large margin multi-task metric learning,
Advances in Neural Information Processing Systems (2010) 1867–1875.
1 X
T X
T X
mt X
ms [5] T. Evgeniou, M. Pontil, Regularized multi-task learning, in: Proceedings of the
þ 2
ðdk,yi þ at,k,i Þðdk,yj as,k,j Þ/xt,i ,xs,j S 10th ACM SIGKDD International Conference on Knowledge Discovery and
4b t ¼1s¼1i¼1j¼1 Data Mining, 2004, pp. 109–117.
  [6] T. Evgeniou, C. Micchelli, M. Pontil, Learning multiple tasks with kernel
1 X
T X
T X
mt X
ms
dt,s methods, Journal of Machine Learning Research 6 (1) (2005) 615–637.
¼ ðdk,yi þ at,k,i Þðdk,yj as,k,j Þ 1 þ /xt,i ,xs,j S: [7] A. Argyriou, T. Evgeniou, M. Pontil, Convex multi-task feature learning,
4b
2
t ¼1s¼1i¼1j¼1
rt
Machine Learning 73 (3) (2008) 243–272.
ð30Þ [8] K. Crammer, Y. Singer, On the algorithmic implementation of multiclass
kernel-based vector machines, Journal of Machine Learning Research 2
Then substituting S3 ,S4 into Eq. (27), results in (2001) 265–292.
[9] Y. Ji, S. Sun, Multitask multiclass support vector machines, in: International
Conference on Data Mining Workshops, 2011, pp. 512–518.
1 X K X T X T X mt X ms
Q ðaÞ ¼ ðd þ at,k,i Þðdk,yj as,k,j Þ [10] J. Baxter, A model for inductive bias learning, Journal of Artificial Intelligence
4b k ¼ 1 t ¼ 1 s ¼ 1 i ¼ 1 j ¼ 1 k,yi Research 12 (2000) 149–198.
  [11] S. Ben-David, R. Schuller, Exploiting task relatedness for multiple task
dt,s learning, Learning Theory and Kernel Machines (2003) 567–580.
 1þ /xt,i ,xs,j S
rt [12] N.J. Butko, J.R. Movellan, Learning to learn, in: Proceedings of the IEEE 6th
International Conference on Development and Learning, 2007, pp. 151–156.
1 X
K X T X mt X T X ms
[13] B. Bakker, T. Heskes, Task clustering and gating for Bayesian multitask
 a ðdy ,y a Þ learning, Journal of Machine Learning Research 4 (1) (2003) 83–99.
2b k ¼ 1 t ¼ 1 i ¼ 1 s ¼ 1 j ¼ 1 t,k,i i j yi ,s,j
[14] C.A. Micchelli, M. Pontil, Kernels for multi-task learning, Advances in Neural
  K X
X T X
mt Information Processing Systems (2005) 921–928.
dt,s [15] A. Caponnetto, C.A. Micchelli, M. Pontil, Y. Ying, Universal multi-task kernels,
 1þ /xt,i ,xs,j S at,k,i dk,yi
rt k¼1t ¼1i¼1
Journal of Machine Learning Research 9 (7) (2008) 1615–1646.
[16] O. Chapelle, P.K. Shivaswamy, S. Vadrevu, K.Q. Weinberger, Y. Zhang, B.L. Tseng,
1 X T X T X mt X ms Boosted multi-task learning, Machine Learning 85 (1–2) (2011) 149–173.
¼ [17] C.W. Hsu, C.J. Lin, A comparison of methods for multiclass support vector
4b t ¼ 1 s ¼ 1 i ¼ 1 j ¼ 1 machines, Neural Networks 13 (2) (2002) 415–425.
0  1 [18] J.C. Platt, N. Cristianini, J. Shawe-Taylor, Large margin DAGs for multiclass
dt,s classification, Advances in Neural Information Processing Systems (2000)
B 1þ r /xt,i ,xs,j S C 547–553.
B t C
B C [19] V. Franc, V. Hlavác, Multi-class support vector machine, in: Proceedings of
B 9S5 C the 16th International Conference on Pattern Recognition, 2002, pp. 236–
B zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl
ffl!{ C
B K d a d C 239.
BX ð k,yi þ t,k,i Þð k,yj a s,k,j Þ C
@ A [20] I. Steinwart, Support vector machines are universally consistent, Journal of
2at,k,i ðdyi ,yj ayi ,s,j Þ Complexity 18 (3) (2002) 768–791.
k¼1
[21] J. Shawe-Taylor, S. Sun, A review of optimization methodologies in support
K X
X mt
T X vector machines, Neurocomputing 74 (17) (2011) 3609–3618.
 at,k,i dk,yi : ð31Þ [22] C.W. Hsu, C.C. Chang, C.J. Lin, et al., A Practical Guide to Support Vector
k¼1t ¼1i¼1 Classification, 2003.
[23] A. Asuncion, D.J. Newman, UCI machine learning repository, Datasets avail-
The most complex part of Eq. (31) is S5. Fortunately, S5 can have a able at /http://archive.ics.uci.edu/ml/datasets.htmlS, 2007.
simpler representation as [24] J.F. Vargas, M.A. Ferrer, C.M. Travieso, J.B. Alonso, in: Proceedings of the 9th
Off-line Handwritten Signature GPDS-960 Corpus, International Conference
X
K on Document Analysis and Recognition, 2007, pp. 764–768.
S5 ¼ ½ðdk,yi þat,k,i Þðdk,yj as,k,j Þ2at,k,i ðdyi ,yj ayi ,s,j Þ [25] J. Ortega-Garcia, J. Fierrez-Aguilar, D. Simon, J. Gonzalez, et al., MCYT baseline
k¼1 corpus: a bimodal biometric database, Vision Image and Signal Processing
150 (6) (2003) 395–401.
X
K X
K
[26] Y. Ji, S. Sun, Off-line signature verification based on multitask learning, in:
¼ ½ðdk,yi þat,k,i Þðdk,yj as,k,j Þ2 at,k,i ðdyi ,yj ayi ,s,j Þ Proceedings of the 8th International Symposium on Neural Networks, 2011,
k¼1 k¼1 pp. 323–330.
|fflfflfflfflffl{zfflfflfflfflffl}
¼1 [27] J. Wen, B. Fang, Y.Y. Tang, T.P. Zhang, Model-based signature verification with
rotation invariant features, Pattern Recognition 42 (7) (2009) 1458–1466.
X
K
[28] M. Blumenstein, X.Y. Liu, B. Verma, An investigation of the modified direction
¼ ½ðdk,yi þat,k,i Þðdk,yj as,k,j Þ2ðdyi ,yj ayi ,s,j Þ feature for cursive character recognition, Pattern Recognition 40 (2) (2007)
k¼1 376–388.
924 Y. Ji, S. Sun / Pattern Recognition 46 (2013) 914–924

You Ji received the B.E. degree in computer science and technology from East China Normal University in 2009. Now he is a master student in the Pattern Recognition and
Machine Learning Research Group, Department of Computer Science and Technology, East China Normal University. His research interests include machine learning,
pattern recognition, etc.

Shiliang Sun received the B.E. degree in automatic control from Beijing University of Aeronautics and Astronautics and Ph.D. degree in pattern recognition and intelligent
systems from Tsinghua University, respectively, in 2002 and 2007. Now he is a professor and the director of the Pattern Recognition and Machine Learning Research Group,
Department of Computer Science and Technology, East China Normal University. He is on the editorial boards of several international journals and a referee for many top
journals. His research interests include machine learning, pattern recognition, brain–computer interfaces and intelligent transportation systems, etc.

You might also like