Professional Documents
Culture Documents
Pattern Recognition
journal homepage: www.elsevier.com/locate/pr
a r t i c l e i n f o a b s t r a c t
Article history: Multitask learning or learning multiple related tasks simultaneously has shown a better performance
Received 5 January 2012 than learning these tasks independently. Most approaches to multitask multiclass problems decompose
Received in revised form them into multiple multitask binary problems, and thus cannot effectively capture inherent correla-
29 June 2012
tions between classes. Although very elegant, traditional multitask support vector machines are
Accepted 12 August 2012
Available online 23 August 2012
restricted by the fact that different learning tasks have to share the same set of classes. In this paper,
we present an approach to multitask multiclass support vector machines based on the minimization of
Keywords: regularization functionals. We cast multitask multiclass problems into a constrained optimization
Multiclass classification problem with a quadratic objective function. Therefore, our approach can learn multitask multiclass
Multitask learning
problems directly and effectively. This approach can learn in two different scenarios: label-compatible
Support vector machine
and label-incompatible multitask learning. We can easily generalize the linear multitask learning
Kernel
Regularization method to the non-linear case using kernels. A number of experiments, including comparisons with
other multitask learning methods, indicate that our approach for multitask multiclass problems is very
encouraging.
& 2012 Elsevier Ltd. All rights reserved.
0031-3203/$ - see front matter & 2012 Elsevier Ltd. All rights reserved.
http://dx.doi.org/10.1016/j.patcog.2012.08.010
Y. Ji, S. Sun / Pattern Recognition 46 (2013) 914–924 915
The rest of this paper is organized as follows. Section 2 briefly share the same set of class labels. When it refers to multiclass
reviews works on multiclass problems and multitask learning. problems, a common method is to build a set of binary classifiers
Section 3 thoroughly describes the M2SVMs model and the where each classifier distinguishes between one of the labels to
process of derivation in detail. Some extensions including the rest (or to one of the other labels). These approaches cannot
non-linear multitask learning and label-incompatible multitask effectively capture correlations between the different classes
learning are also introduced here. Experimental results are since it breaks a multiclass problem into multiple independent
reported in Section 4. Conclusions and future work are presented binary problems [8]. Moreover, the traditional MTL binary SVMs
in Section 5. cannot solve problems in label-incompatible MTL scenario. This is
a serious limitation for multitask SVMs when all the tasks may
not share the same classes [3]. One main purpose of this paper is
2. Related work to extend traditional MTL SVMs to the label-incompatible MTL
scenario using a strategy to learn multiclass problems in one dual
M2SVMs build on previous research on regularized multitask optimization problem.
learning and multiclass learning. In this section, related work In this paper, MTL is different with domain adaption or transfer
about regularized multitask learning and multiclass learning are learning. Domain adaption and transfer learning have target tasks
briefly introduced as preliminary knowledge, from which we which need some helpful assistant tasks to gather more useful
reach our M2SVMs approach. information to improve their performance. But in this paper, all the
tasks take the same place in MTL. There is no source domain or
2.1. Regularized multitask learning target domain. Our target is to improve all tasks’ performance. This
means all tasks have an equal role in our MTL settings.
In [1], Caruana proposed the MTL approach which learns tasks
in parallel while using a shared representation. A statistical 2.2. Multiclass learning
learning theory based approach to MTL was introduced in
[10,11]. In [10], Baxter proposed the notion of the extended VC Support vector machines are originally designed for binary
dimension to derive generalization bounds of MTL. In [5,11], the classification. How to extend binary SVMs to multiclass classifica-
extended VC dimension was used to derive tighter bounds that tion is still an active research topic [17]. There are two ways to
hold for each task. Butko and Movellan use a probability model to solve multiclass classification problems. One way is to construct a
capture the relations of MTL [12]. The mixture of Gaussians leads multiclass classifier by combining several binary classifiers. There
to clustering these tasks [13]. Regularized MTL presented by are three methods based on binary problems: one2all, one2one,
Evgeniou et al. [5] extends single task SVMs to MTL. After that, and DAGSVM [18]. The other way is to consider all classes
the MTL kernel learning methods were introduced in [6,14]. together [8,19]. It casts multiclass categorization problems as a
Universal multitask kernels were introduced in [15]. In [16], a constrained optimization problem with a quadratic objective
boosted algorithm is used to improve the performance of MTL. function that yields a direct method for training multiclass
Our work builds on previous research and advances in MTL SVMs. predictors [8]. Usually, with this strategy, a much larger dual
Here we introduce regularized MTL SVMs (more details can be optimization problem is required to solve. Algorithmic imple-
found in [5]). mentations such as MKVMs, which are able to incorporate kernels
Suppose that there are T different but related tasks. All data for with a compact set of constraints and decompose the dual
the T tasks come from the same space X Y where X Rd and problem into multiple optimization problems of reduced size,
Y f1,1g. For simplicity, we assume that each example belongs are proposed in [8,17]. Because this strategy is related to our
to one task. The input–output pair ðxi ,yi ÞðiA f1,2,3, . . . ,mgÞ stands work, we briefly review it below. A comparison of methods for
for the ith example whose task index is given by jðiÞ. mt stands multiclass support vector machines is given in [17].
for the size of the tth task’s training set. Regularized MTL learns T MKVMs are generalizations of separating hyperplanes and
classifiers w1 , . . . ,wT , where each classifier wt is specifically margins to the scenario of multiclass problems. Suppose
dedicated to task t (t A f1, . . . ,Tg) [5]. With w0 which captures ðxi ,yi Þ A X Y, where X ¼ Rd , Y ¼ f1,2,3, . . . ,Kg and i A f1, . . . ,mg.
the common information among all tasks, all wt can be rewritten This framework uses classifiers of the form
as wt ¼ w0 þ vt . Then vt and w0 are obtained by solving the
K
optimization problem HM ðxÞ ¼ arg maxfM k xg, ð2Þ
k¼1
l1 X
T T X
X mt
min Jvt J22 þ l2 Jw0 J22 þ et,i , where K means the number of labels, and M is a matrix of size
w0 ,vt , et,i T t¼1 t ¼1i¼1 K d. Mk is the kth row of M. Mk xi means the confidence and
s:t: 8t,i, yt,i ðw0 þ vt Þ xt,i Z1et,i , et,i Z0: ð1Þ similarity score for the kth class. Therefore, according to the
equation above, the predicted label is the index of the row which
The constants l1 Z 0 and l2 Z 0 control the trade off between attains the highest similarity score [8]. The quadratic optimiza-
tasks. There are two extreme cases. If l2 - þ1 then w0 -0, which tion problem is defined as
indicates that all tasks are decoupled. Then there is no related -
information between tasks. If l2 -0 and l1 - þ 1, then vt - 0 , 1 Xm
min bJMJ22 þ ei ,
which means all tasks share the same decision function with M 2 i¼1
w0 [5]. s:t: 8k,i,M yi xi þ dk,yi M k xi Z1ei , ð3Þ
By the strategy of feature maps, the above multitask learning
can be generalized to non-linear multitask learning. Kernel where dp,q is equal to 1 if p¼q and 0 otherwise, and b 4 0 is a
methods can be used to model relations between tasks and lead regularization constant. For k ¼ yi , the inequality constraints become
to linear and non-linear multitask learning algorithms [6,14]. ei Z 0. Its goal is to find a matrix M that attains a small empirical
A wide variety of kernels which are useful for applications are error on the training examples and also generalizes well. The
proposed in [6,14]. advantage of this strategy is solving multiclass classification pro-
Although regularized multitask learning are shown to be blems directly and therefore it can effectively capture the informa-
superior to their single task counterparts. It requires all tasks to tion between different classes. Results in [17] also show that for
916 Y. Ji, S. Sun / Pattern Recognition 46 (2013) 914–924
large-scale problems methods considering all classes at once need support vector machines is introduced. It also provides details about
fewer support vectors than methods constructed by binary convex optimization theory and optimality conditions for generic
classifiers. optimization problems. It analyzes the saddle point of the Lagran-
gian formulation theoretically. In order to find the minimum for the
primal variables V t ,M0 , e, we require
@L XK XK
3. Multitask multiclass support vector machines (M2SVMs) ¼ 1 at,k,i ¼ 0 ) at,k,i ¼ 1: ð7Þ
@et,i k¼1 k¼1
As before, suppose all the data for the T tasks come from the
Similarly, for M 0,k , we require
space X Y where X ¼ Rd ,Y ¼ f1,2,3, . . . ,Kg. For simplicity, we
assume that each example just belongs to one task. For each task @L
¼ 0: ð8Þ
we have mt ðt A f1,2,3, . . . ,TgÞ instances sampled from some @M 0,k
unknown distributions Pt on X Y. We assume that Pt is different Since the derivation is rather technical, we write the complete
from each other but has relationships with other tasks [5,11]. derivation to Appendix A. We get results in the following form:
How to express this kind of relationships between tasks is still an
open question. In [11], exploiting task relatedness for MTL is 1 X T X mt
M0,k ¼ ðd a Þx : ð9Þ
discussed. When T¼1, it reduces to the single task learning 2b t ¼ 1 i ¼ 1 k,yi t,k,i t,i
problem. Our framework uses classifiers as
Next, for V t,k , we need
K
HMt ðxÞ ¼ arg maxfMt,k xt g, @L
k¼1 ¼ 0, ð10Þ
@V t,k
Mt ¼ M0 þ V t , ð4Þ
which results in the form
where Mt is a matrix of size K d. Mt,k stands for the kth row of Mt, 1 mt
X
and Mt corresponds to the unknown distribution Pt. This implies that V t,k ¼ ðdk,yi at,k,i Þxt,i : ð11Þ
2brt i ¼ 1
all Mt are close to the mean matrix M0. M0 stands for the public
classification information between related tasks while Vt stands for Eqs. (9) and (11) imply that the solutions of the optimization
the tth task’s own classification information. When the tasks are problem given by Eq. (5) are matrices whose rows are linear
very similar to each other, values of JV t J22 are very small (J J22 combinations of training examples. Note that, from Eq. (9), we get
stands for the 2-norm of matrices). To this end we solve the that the contribution of an instance xi to M 0,k is dk,yi at,k,i . We also
following optimization problem which is similar to Eq. (3): may find that all T tasks’ instances have contributed for M 0,k . For V t,k ,
" # the contribution of an instance is dk,yi at,k,i . Comparing with M 0,k ,
XK XT T X
X mt
V t,k just contains the tth task’s instances. We say that an instance xi is
min b rt JV t,k J22 þ JM0,k J22 þ et,i ,
a support vector if its coefficient is not zero. For the different but
k¼1 t¼1 t ¼1i¼1
related T tasks, M 0,k can be rewritten as
s:t: 8t,k,i,ðV t,yi þ M0,yi Þxt,i þ dk,yi ðV t,k þ M 0,k Þxt,i Z 1et,i , ð5Þ
X
T
M0,k ¼ rt V t,k : ð12Þ
where dp,q is equal to 1 if p¼q and 0 otherwise, rt is a weighted t¼1
parameter between Vt and M0, and b 40 and rt 40 are regulariza- This implies that M0 takes rt information from task t to stand
PT 2 2 for common information between different tasks. To solve
tion constants. t ¼ 1 rt JV t,k J2 þ JM 0,k J2 stands for regularization
for t tasks. M0 may take different places in different issues. For the optimization problem, we have to rewrite LðV t,k ,M 0,k , et,i Þ
example, some tasks may be highly related with each other. So M0 in Eq. (6) as a dual optimization problem. Since the derivation is
takes important place in these tasks. In the other scenario, some rather technical we defer the complete derivation to Appendix B.
tasks may be highly different with each other. Then M0’s place is not Then we obtain the following objective function of the dual program
so important in these tasks. So we add parameters rt to trade off the X
K X
T X
mt
public classification information M0 and the private classification max Q ðaÞ ¼ at,k,i dk,yi
k¼1t ¼1i¼1
information Vt. For k ¼ yi , the inequality constraints become et,i Z 0.
Solving the question directly is not a good choice. We find that it is 1 XK X T X T X mt X ms
ðd a Þðd a Þ
better to solve Eq. (5)’s dual optimization problem. We add a dual 4b k ¼ 1 t ¼ 1 s ¼ 1 i ¼ 1 j ¼ 1 k,yi t,k,i k,yj s,k,j
set of variables at,k,i , one for each constraint and get the Lagrangian
dt,s
of the optimization problem as 1þ /xt,i ,xs,j S: ð13Þ
rt
" # P
bX K X T
2 2
X T Xmt
If we set m ¼ Tt ¼ 1 mt ,i,j A f1,2,3, . . . ,mg, where m means the
LðV t,k ,M 0,k , et,i Þ ¼ rt JV t,k J2 þJM0,k J2 þ et,i
2k¼1 t¼1 training data size and jðiÞ means that the ith example belongs to
t ¼1i¼1
jðiÞ th task. Then we can get t ¼ jðiÞ. So variables at,k,i can be
K X
X T X
mt replaced with ak,i . Then we obtain the following objective function of
at,k,i ½ðV t,yi þ M0,yi Þxt,i þ dk,yi the dual program:
k¼1t ¼1i¼1
3.1. Non-linear multitask learning Another additional harvest of (16) is that it can be simply
extended to multitask clustering [6,14]. For example, task cluster
The dual optimization problem in Eq. (14) has one short- A has three tasks which may take multitask parameters 7, 8,
coming: a linear decision function may be unsuitable to classify 9 while task cluster B has two tasks which may take multitask
different examples especially when training sets are not linearly parameters 100, 90. We may use the multitask parameter to
separable. But an important characteristic of SVMs is that they control tasks’ clustering result.
can be used to estimate highly non-linear functions through the
use of kernels [5]. This method is named the kernel trick, which can 3.3. Label-incompatible multitask learning
be used to avoid computing feature maps explicitly. In fact, every
algorithm totaly based on inner products can be kernelized [20]. By In common sense, the MTL of SVMs require all tasks to share
the same strategy in [5], we extend our method to non-linear an identical set of labels [4]. This is a serious limitation in MTL
multitask learning as domains especially when different tasks may not have the same
X
K X
m labels. In label-incompatible multitask learning settings, we
min Q ðaÞ ¼ ak,i dk,yi assume that, the same label refers to the sample class while
k¼1i¼1 different labels refer to different classes.
1 X K X m X m
M2SVMs described in this paper can easily overcome this
þ ½ða d Þða d ÞK ðx ,x Þ
4b k ¼ 1 i ¼ 1 j ¼ 1 k,i k,yi k,j k,yj jðiÞ, jðjÞ i j limitation. Assume there are two different but related tasks
A and B. Task A has three labels 1–3 while task B has three labels
X
K
s:t: 8k,i,ak,i Z 0, ak,i ¼ 1, ð15Þ 2–4. We may make these two related but different tasks have four
k¼1 labels from 1 to 4. Task A has no instances of class 4 while task B
has no instances of class 1. Then using M2SVMs can solve this
where K stands for kernel function. It is given as
! kind of label-incompatible MTL problems.
djðiÞ, jðjÞ Another extreme case is that task A has three labels 1–3 while
KjðiÞjðjÞ ðxi ,xj Þ ¼ 1þ /xi ,xj S: ð16Þ
rjðiÞ task B has two labels 4 and 5. In this scenario, M2SVMs perform
MTL with five classes while task A has no instances of 4 and 5 and
We may replace the dot product with a non-linear kernel as what is task B has no instances of 1–3.
done for standard SVMs [5]. It is noteworthy that this non-linear As we known, domain adaption is to utilize labeled data from
multitask kernel is different from the one in [5]. All tasks share the source domain to help the learning in target domain with just a
same parameter m in [5]. Because tasks’ weights should be different few or even no labeled data. In this paper, label-incompatible
in multitask learning, we use a separate parameter rt for each task. learning is different with domain adaption. There is no division of
We rewrite the classifier as the source domain and target domain in our multitask learning
K settings. Each task in our label-incompatible experiments has an
Ht ðxÞ ¼ arg maxfM t,k xg
k¼1 equal role.
( )
K 1 X m
¼ arg max ðdk,yi ak,i ÞKjðiÞ,t ðxi ,xÞ : ð17Þ
k¼1 2b i ¼ 1
4. Experiments
It is an easy way to implement our strategy by replacing
kernels used in MKVMs. MKVMs have two implementations. One In our experiments, we categorized the MTL into two different
is described in [8] and the other is given in [17]. In the experi- scenarios: label-compatible MTL and label-incompatible MTL. In
ments, we choose the implementation in [17] as our solver. the label-compatible MTL scenario, all tasks share the same label
set while in label-incompatible MTL not [4]. We demonstrate the
3.2. Differences with traditional multitask kernels applicability and effectiveness of M2SVMs in these scenarios.
One may think that this kind of multitask kernel is identical 4.1. Experimental setting
with the traditional multitask kernel described in [5]. Now we
point out the differences with [5]. The traditional form of multi- Before introducing details about our experiments, we intro-
task kernel can be written as duce some technical tricks about SVMs and common settings in
our experiments. Scaling before applying SVMs is very important
1
KjðiÞjðjÞ ðxi ,xj Þ ¼ djðiÞ, jðjÞ þ /xi ,xj S: ð18Þ [22]. The main advantage of scaling is to avoid attributes in
m
greater numeric ranges dominating those in smaller numeric
Besides the difference between (18) and (16), they also have ranges. Another advantage is to avoid computing difficulties.
different meanings. The characteristic of the multitask kernel We scale each attribute to the range ½1, þ1 as in [22].
described in [5] is that all related but different tasks just share the In our experiments, we use n-fold cross validation to get the
same parameter m as (18) while in (16), each task has its own average error rate. The multitask kernel used in our experiments
parameters which shows the relationship with other related but is the widely used radial basis function (RBF) kernel, which can be
different tasks. written as
Using the traditional multitask kernel is difficult to deal with !
this scenario. Suppose there are T related but different tasks while djðiÞ, jðjÞ n
KjðiÞjðjÞ ðxi ,xj Þ ¼ 1þ k ðxi ,xj Þ,
task m is a very different but related task with other T 1 tasks. rjðiÞ
The T 1 tasks are coupled very well. When using a parameter mm
k ðxi, xj Þ ¼ expðgJxi xj J22 Þ, g 4 0,
n
ð19Þ
appropriate for task m in (18), task m learns very well while the
other T 1 tasks may learn badly because task m is different with were k stands for regular kernel function, and g is the kernel
n
other tasks. When using another parameter mn appropriate for the parameter. It should be noted that there are other two parameters
other T 1 tasks, the T 1 tasks learns very well because they are rt , b for our model.
well coupled while task m may not learn very well. If we use (16), It is usually unknown beforehand which parameters are best
this situation will get solved. for problems at hand [22]. We used a grid search strategy to select
918 Y. Ji, S. Sun / Pattern Recognition 46 (2013) 914–924
best parameters which is also recommended in [22]. Various pairs female, we may think that these two tasks are also different
of ðrt , b, gÞ values are tried. Then we adopt the one with the best between each other. In the original dataset, the feature of one
cross-validation results on the training set. In our experiments, spoken digit is a matrix of size row 13 (4 rrow r93). For
firstly we take a coarse grid search for parameters rt , b, g in the simplicity, we compute the average value of each column of this
region ½210 ,214 with exponent growth 0.5. Then we take the matrix. Then we get a vector of size 1 13 to represent one digit’s
finer grid search on the neighborhood of the best results of the feature. Details about these two datasets are summarized in Table 1.
coarse grid search such as ½21 ,23 . The interval exponent growth Results and analysis. We compared the performance of the
is 0.01. Although the process of grid search usually takes a lot of M2SVMs algorithm with different baselines in Tables 2 and 3. ALL
time during training, it is very effective and straightforward means just taking all the T tasks’ data together and using single
method which avoids doing more consuming exhaustive para- task learning to deal with these data with no tasks’ differences.
meter search. MTL is denoted as MTL in Tables 2 and 3. Because MKVMs and
M2SVMs will get the same result when in single task learning
scenario, we denote as ‘–’ which means the result is same as
4.2. Label-compatible MTL experiments
MKVMs. While in the MTL setting, MKVMs will get the same
results as M2SVMs if MKVMs use the multitask kernel.
We evaluate M2SVMs now on four datasets taken from the UCI
From Table 2, with the increasing of training examples, we find
Machine Learning Repository [23]. The first one is Isolet spoken
that accuracy is highly improved when we just take instances of
alphabet recognition. The second one is spoken Arabic digit (SAD).
all tasks together and ignore relationships between tasks. In
The last two are signature verification datasets, one of which
Table 3, collecting instances of all tasks together just gets error
is from GPDS [24] while the other from MCYT [25]. Usually,
rates which seem to be the average results of two tasks’. But if
signature verifications have different evaluation parameters of
relationships between tasks are captured by multiclass learning,
experimental results [26]. Settings of MTL in signature verifica-
multitask learning gets better results than methods which ignore
tion are different to the previous datasets. Therefore, we split
these useful information between tasks in Tables 2 and 3. This
experiments on these four datasets into two experimental groups.
means the importance of discovering and exploiting relationship
The experiments on the front two datasets are in one group while
between tasks.
the experiments on signature verification are in the other group.
If we compare results by columns, we can find that MKVMs get
comparable results with the other two popular multiclass classi-
4.2.1. Experimental group (I) fication methods. In fact, because MKVMs can capture relation-
In this experimental group, we evaluate M2SVMs on the Isolet ship between different classes, it can often get better results than
spoken alphabet recognition dataset and SAD dataset. We first the other two methods.
provide a brief overview of the two datasets and then present In Table 3, the result of M2SVMs is 2.97% while MKVMs’ is
results in various experimental settings. In these experiments, we 2.73% for Task 1. It seems that M2SVMs get a worse result than
use 3-fold cross-validation to get the average error rate. learning tasks independently. But if we consider Task 1 and
Details about datasets. The Isolet dataset was collected from Task 2 together, there are 1467 2.73% þ 1467 5.28%¼117
150 subjects uttering all English alphabets twice. Each speaker instances misclassified. With M2SVMs, there are only 2934
contributed 52 training examples. Because three examples are 2.97%¼ 87 instances misclassified. (There are 1467 examples for
historically missing, there are 7797 examples in total. These
speakers are grouped into five sets of 30 speakers each. These
Table 1
groups are referred to as Isolet1 to Isolet5. The original task is to Details of datasets.
classify which letter has been uttered based on features such as
spectral coefficients, contour features, sonorant features, pre- Name Attributes Instances Classes Tasks
sonorant features, and post-sonorant features. The representation
Isolet 290 7797 26 5
of Isolet leads itself to the multitask multiclass learning [4]. SAD 13 8800 10 2
Therefore, we treat each of the subsets as its own classification
task (T¼5) with K ¼26 labels. These five tasks are highly related
with each other because all the data are collected from the same Table 2
utterances [4]. Because groups of speakers vary in the way of Error rates (%) on the Isolet dataset.
speaking the English alphabets, five tasks are different from each
other. These five tasks are labeled from 1 to 5. In order to remove Isolet one2one one2all MKVMs M2SVMs
low variance noise and to speed up computation we preprocess
Task 1 7.12 71.82 7.18 71.70 5.517 1.92 –
the Isolet data with principal component analysis (PCA). We Task 2 6.47 72.06 6.73 71.07 5.007 1.02 –
capture 98% of the data variance while reducing the dimension- Task 3 9.70 71.85 9.687 1.56 9.707 1.60 –
ality from 617 to 290. Task 4 10.407 1.78 10.98 70.85 11.23 71.80 –
The SAD dataset was collected from 88 speakers with 44 male Task 5 10.33 72.08 9.817 1.01 10.297 1.85 –
ALL 4.84 70.82 4.23 70.62 4.107 0.67 –
and 44 female Arabic native speakers between ages 18 and 40 to MTL 3.23 70.91 3.66 70.82 – 2.707 0.97
represent 10 spoken Arabic digits from 0 to 9. Each speaker spoke
one Arabic digit 10 times. Therefore, there are 8800 (10 digits 10
repetitions 88 speakers) examples in the dataset. The original task
is to classify which Arabic digit has been uttered based on 13 mel- Table 3
Error rates (%) on the SAD dataset.
frequency cepstral coefficients. The representation of SAD also leads
itself to the multitask multiclass learning. We split the original task SAD one2one one2all MKVMs M2SVMs
into two tasks. One task (Task 1) is to classify Arabic digits spoken
by males while the other one (Task 2) is to classify ones spoken by Task 1 3.46 7 1.15 3.07 71.08 2.737 0.64 –
females. These two tasks are highly related with each other because Task 2 5.32 7 0.91 5.107 0.90 5.28 70.92 –
ALL 4.97 7 0.97 4.32 71.23 4.107 0.93 –
these data are sampled from the same region. Because the voice, MTL 3.10 70.88 3.25 70.94 – 2.977 0.76
intonation and volume are usually different between male and
Y. Ji, S. Sun / Pattern Recognition 46 (2013) 914–924 919
testing in Task 1 and Task 2, because we use 3-fold cross validation based on one2one take less support vectors than multitask SVMs
and the data size of each task is 4400. Therefore, there are 2934 based on one2all.
examples for testing in MTL and ALL.) If we just take instances
of Task 1 and Task 2 together, MKVMs get 2934 4.10%¼120
4.2.2. Experimental group (II)
instances misclassified. This is even slightly worse than learning
Experiments are carried out on GPDS [24] and MCYT [25]
tasks independently. This comparison also indicates the importance
datasets, respectively. For signature verification, evaluation para-
of exploiting correlations between tasks.
meters of experimental results are different to classical experi-
Since M2SVMs can not only capture correlations between
ments. Firstly, we briefly introduce these evaluations. Then we
different classes, but also use common information between tasks
provide a brief overview of the two datasets and present results in
for prediction, it get the best results both in Tables 2 and 3.
various experimental settings.
Because of the complexity of the dual optimization problem,
Experimental settings. In an off-line signature verification
one may think that M2SVMs may take a lot of computing time.
system, three kinds of forgery such as random, simple and skilled
Therefore, we also compare these methods on the computing
forgeries are considered [27]. The random forgery is usually
time. These experiments were carried out on a Intel Core 2 with
represented by a signature example that belongs to a different
1024 RAM using gcc compiler. Note that the results shown in
writer of the signature model. The simple forgery is represented
Fig. 1(a) and (b) are the average time of 3-fold cross validation
by a signature example with the same shape of the genuine
with instances randomly selected 10 times. From Fig. 1(a) and (b),
writer’s name. The skilled forgery is represented by a suitable
we may easily find that M2SVMs take a similar time with multi-
imitation of the genuine signature model. Because the skilled
task SVMs based on the one2one strategy. Multitask SVMs based
forgeries are very similar to genuine signatures, it is difficult to
on one2all strategy take more time than the other two methods.
distinguish skilled forgeries from genuine signatures [27]. MCYT
Because using less support vectors can improve the speed of
and GPDS datasets just provide random, skilled forgeries. There-
prediction and avoid over-fitting, we also make comparisons on
fore, there are three classes in the classification problem.
the number of support vectors. After the training process, these
If we take everyone’s signature verification as a single task,
three algorithms have different number of support vectors. From
some common information may be shared between tasks. If we
Figs. 2(a) and (b) we may find that M2SVMs and multitask SVMs
take one’s signature verification as the target task, we may add
Fig. 1. Comparison of computing time. (a) Results on Isolet. (b) Results on SAD. Fig. 2. Comparison of support vectors. (a) Results on Isolet. (b) Results on SAD.
920 Y. Ji, S. Sun / Pattern Recognition 46 (2013) 914–924
Table 4
Error rates (%) on the GPDS dataset.
AER 13.01 7 8.59 8.61 77.14 14.75 7 9.15 10.73 78.03 11.54 7 8.67 6.107 7.53
AFAR 9.27 7 0.07 3.747 0.05 8.93 7 0.08 5.86 70.07 7.02 7 0.06 4.29 7 0.06
AFRR 30.83 7 0.14 28.81 7 0.09 34.25 7 0.25 29.28 7 0.16 28.97 7 0.15 16.52 70.18
Y. Ji, S. Sun / Pattern Recognition 46 (2013) 914–924 921
Table 5
Error rates (%) on the MCYT dataset.
AER 9.84 7 9.02 7.38 7 8.63 11.06 7 9.60 7.89 7 9.53 8.517 10.81 6.58 78.71
AFAR 5.52 7 0.13 5.58 7 0.10 8.93 7 0.11 4.527 0.13 6.277 0.21 5.42 7 0.14
AFRR 20.95 7 0.09 17.70 7 0.11 29.28 7 0.15 19.18 7 0.09 15.54 70.17 9.21 70.13
Table 6 Table 7
Error rates (%) on Isolet label-incompatible tasks. Error rates (%) on SAD label-incompatible tasks.
The one2one method is to train two classifiers where one is MTL datasets. Table 7 shows the result of M2SVMs. Note that, for
used to classify genuine forgeries and random forgeries, and the each datasets, we repeat five times to get average results.
other is used to classify genuine forgeries and skilled forgeries. By
this strategy, it cannot capture information between different 4.3.2. Results and analysis
classes. If we use similar scores to measure the similarity to From Tables 6 ad 7, we may find that M2SVMs can obtain
genuine signatures. These three classes will get different scores. better results in label-incompatible MTL. Because tasks here have
For example, it can score genuine signatures 100. Skilled forgeries different meanings with label-compatible MTL experiments, the
may get 70–80 while random forgeries may get 3–20. The characteristic of results are essentially not the same as before.
one2one method cannot capture these structures between
classes. But it gets better results than the one2all method because 4.4. Comparative experiments
it considers differences between random forgeries and skilled
forgeries. MKVMs can capture these structures between different We compare our method with tree methods Large Margin
classes, and thus it gives better results in AER, AFAR and AFRR Multitask Metric Learning (LM3L)1 [4], Convex Multitask Feature
evaluation parameters. Learning (MTL-FEAT)2 [7] and Boosted Multitask Learning (BMTL)
In Table 4, although AFAR of the one2one MTL method is lower [16]. These experiments are carried out on Isolet and SAD datasets
than any other ones’. But we can find that the one2one method in label-incompatible settings. As before, we also use 3-fold cross-
trends to reject genuine signatures to guarantee that it gets low validation for comparative experiments. Details about datasets
acceptance rates of random and skilled forgeries. Therefore, it gives are given in Table 1. Gaussian kernel functions are used in
higher AFRR than M2SVMs. This situation may be found in Table 5. MTL-FEAT and M2SVMs.
From Tables 4 and 5, we can also find that better results in signature From Table 8, we find that our method can get comparable
verification problems can be obtained by the MTL strategy. results with three state of art algorithms. BMTL gets lightly better
results on SAD dataset, but its standard deviation is much larger
than M2SVMs’.
4.3. Label-incompatible MTL experiments
To demonstrate M2SVMs’ ability to learn multiple tasks that 5. Conclusions and future work
have different sets of class labels, we run experiments on the
Isolet and SAD datasets. Because the traditional SVMs cannot be In this paper, we presented a new method named M2SVMs
used in this case [4], we just compare M2SVMs with the single based on the regularization principle. Our method is mainly
task version of SVMs. inspired by MKVMs [8] and regularized multitask learning [5].
We derived a dual optimization problem which is optimized to
give solutions. Then we extend the method to non-linear multi-
4.3.1. Experimental settings task learning by the kernel trick. This extension is very easy to
In order to get label-incompatible MTL datasets, we get tasks implement using existing tools.
from the original datasets by the following steps. We compared our method with other two popular multitask
learning methods which are based on one2one and one2all
Select n different labels from the label set randomly. strategies. These algorithms are compared in three aspects:
Select all the examples with these n labels. accuracy, computing time and the number of support vectors.
The results show that our method gets better performance than
For Isolet dataset, we get five tasks which have 10 different the other two methods almost on all comparisons. Because the
labels. Therefore, for considering every two tasks, they have information between tasks can be effectively captured by
different labels, although some labels may be the same. Table 6
shows the result of M2SVMs. 1
Code available at http://www.cse.wustl.edu/ kilian/code/files/mtLMNN.zip.
For the SAD dataset, we get three tasks which have four 2
Code available at http://ttic.uchicago.edu/ argyriou/code/mtl_feat/mtl_
different labels. By this strategy, we can get label-incompatible feat.tar.
922 Y. Ji, S. Sun / Pattern Recognition 46 (2013) 914–924
yi X
! at,k,i dk,yi þC, ð27Þ
@L mt
T X
X X mt
T X X
K
k¼1t ¼1i¼1
¼ 2bM0,k þ at,k,i xt,i ae,t,i xt,i
@M 0,k t ¼1i¼1 k¼y t ¼1i¼1 e¼1
i
|fflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflffl} where C is a constant variable equal to the whole training data
¼1 size of all tasks as
T X
X mt yi X
X T X
mt
K X
X mt
T X X mt X
T X K X
T
¼ 2bM 0,k þ at,k,i xt,i xt,i C¼ at,k,i ¼ at,k,i ¼ mt : ð28Þ
t ¼1i¼1 k ¼ yi t ¼ 1 i ¼ 1
k¼1t ¼1i¼1 t ¼1i¼1k¼1 t¼1
|fflfflfflfflffl{zfflfflfflfflffl}
T X
X mt T X
X mt ¼1
¼ 2bM 0,k þ at,k,i xt,i dk,yi xt,i
t ¼1i¼1 t ¼1i¼1 In order to get a simple representation of Eq. (27), we define
S1 ,S2 ,S3 in Eq. (27). We substitute Eq. (11) into S1, then we get
¼ 0, ð23Þ
mt
1X
which results in the following form: S1 ¼ rt JV t,k J22 þ at,k,i V t,k xt,i
! bi¼1
1 X
T X
mt X
T X
mt
M 0,k ¼ dk,yi xt,i at,k,i xt,i 1 X
mt X
mt
2b t ¼1i¼1 t ¼1i¼1
¼ 2
ðdk,yi at,k,i Þðdk,yj ak,t,j Þ/xt,i ,xt,j S
4b rt i ¼ 1 j ¼ 1
1 X T X mt
X
mt X
mt
¼ ðd a Þx : ð24Þ 1
2b t ¼ 1 i ¼ 1 k,yi t,k,i t,i þ 2
at,k,i ðdk,yj ak,t,j Þ/xt,i ,xt,j S
2b rt i ¼ 1 j ¼ 1
Next, for V t,k , we need 1 X
mt X
mt
yi X
! ¼ 2
ðdk,yi þ at,k,i Þðdk,yj ak,t,j Þ/xt,i ,xt,j S,
@L X
mt X mt X
K
4b rt i ¼ 1 j ¼ 1
¼ 2brt V t,k þ at,k,i xt,i ae,t,i xt,i
@V t,k i¼1 k¼y i¼1 e¼1
i
|fflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflffl}
1X
T X
mt
¼1
S2 ¼ JM 0 J22 þ at,k,i M0,k xt,i
mt
X yi
X mt
X bt ¼1i¼1
¼ 2brt V t,k þ at,k,i xt,i xt,i
1 X
T X
T X
mt X
ms
i¼1 k ¼ yi i ¼ 1 ¼ 2
ðdk,yi at,k,i Þðdk,yj as,k,j Þ/xt,i ,xs,j S
X
mt X
mt 4b t ¼1s¼1i¼1j¼1
¼ 2brt V t,k þ at,k,i xt,i dk,yi xt,i
1 X
T X mt X
T X ms
i¼1 i¼1 þ 2
at,k,i ðdk,yj as,k,j Þ/xt,i ,xs,j S
¼ 0, ð25Þ 2b t ¼1s¼1i¼1j¼1
Y. Ji, S. Sun / Pattern Recognition 46 (2013) 914–924 923
1 X
T X
T X
mt X
ms X
K X
K
¼ 2
ðdk,yi þat,k,i Þðdk,yj as,k,j Þ/xt,i ,xs,j S, ð29Þ ¼ ½ðdk,yi þ at,k,i Þðdk,yj as,k,j Þ2 dk,yi ðdk,yj as,k,j Þ
4b t ¼1s¼1i¼1j¼1 k¼1 k¼1
X
K
S3 ¼ at,k,i V t,yi xt,i þ at,k,i M0,yi xt,i ¼ ½ðdk,yi at,k,i Þðdk,yj as,k,j Þ: ð32Þ
k¼1
1 X mt
¼ a ðdy ,y a Þ/xt,i ,xt,j S Then we substitute S5 in Eq. (32) into Eq. (31) and get
2brt j ¼ 1 t,k,i i j yi ,t,i
X
K X mt
T X
1 X T X ms
max Q ðaÞ ¼ at,k,i dk,yi
þ a ðdy ,y a Þ/xt,i ,xs,j S
2b s ¼ 1 j ¼ 1 t,k,i i j yi ,s,j k¼1t ¼1i¼1
1 XK X T X T X mt X ms
1 X T X ms
dt,s ðd a Þðd a Þ
¼ at,k,i ðdyi ,yj ayi ,s,j Þ 1 þ /xt,i ,xs,j S: 4b k ¼ 1 t ¼ 1 s ¼ 1 i ¼ 1 j ¼ 1 k,yi t,k,i k,yj s,k,j
2b s ¼ 1 j ¼ 1 rt
dt,s
1þ /xt,i ,xs,j S: ð33Þ
If we substitute S1 ,S2 ,S3 into (27), we may get a very complex
P
rt
equation. We find that Tt ¼ 1 S1 þS2 can lead to a simple equation,
P References
and therefore we set S4 ¼ Tt ¼ 1 S1 þ S2 . After substituting S1 ,S2 in
Eq. (29) into S4, we obtain a simple representation as
[1] R. Caruana, Multitask learning, Machine Learning 28 (1) (1997) 41–75.
X
T
[2] S. Sun, Multitask learning for EEG-based biometrics, in: Proceedings of the
S4 ¼ S1 þ S2 19th International Conference on Pattern Recognition, 2008, pp. 1–4.
t¼1 [3] X.T. Yuan, S. Yan, Visual classification with multi-task joint sparse represen-
X mt X
T X mt tation, in: Proceedings of the IEEE Computer Society Conference on Computer
1
¼ 2
ðdk,yi þ at,k,i Þðdk,yj ak,t,j Þ/xt,i ,xt,j S Vision and Pattern Recognition, 2010, pp. 3493–3500.
4b rt t ¼ 1 i ¼ 1 j ¼ 1 [4] S. Parameswaran, K.Q. Weinberger, Large margin multi-task metric learning,
Advances in Neural Information Processing Systems (2010) 1867–1875.
1 X
T X
T X
mt X
ms [5] T. Evgeniou, M. Pontil, Regularized multi-task learning, in: Proceedings of the
þ 2
ðdk,yi þ at,k,i Þðdk,yj as,k,j Þ/xt,i ,xs,j S 10th ACM SIGKDD International Conference on Knowledge Discovery and
4b t ¼1s¼1i¼1j¼1 Data Mining, 2004, pp. 109–117.
[6] T. Evgeniou, C. Micchelli, M. Pontil, Learning multiple tasks with kernel
1 X
T X
T X
mt X
ms
dt,s methods, Journal of Machine Learning Research 6 (1) (2005) 615–637.
¼ ðdk,yi þ at,k,i Þðdk,yj as,k,j Þ 1 þ /xt,i ,xs,j S: [7] A. Argyriou, T. Evgeniou, M. Pontil, Convex multi-task feature learning,
4b
2
t ¼1s¼1i¼1j¼1
rt
Machine Learning 73 (3) (2008) 243–272.
ð30Þ [8] K. Crammer, Y. Singer, On the algorithmic implementation of multiclass
kernel-based vector machines, Journal of Machine Learning Research 2
Then substituting S3 ,S4 into Eq. (27), results in (2001) 265–292.
[9] Y. Ji, S. Sun, Multitask multiclass support vector machines, in: International
Conference on Data Mining Workshops, 2011, pp. 512–518.
1 X K X T X T X mt X ms
Q ðaÞ ¼ ðd þ at,k,i Þðdk,yj as,k,j Þ [10] J. Baxter, A model for inductive bias learning, Journal of Artificial Intelligence
4b k ¼ 1 t ¼ 1 s ¼ 1 i ¼ 1 j ¼ 1 k,yi Research 12 (2000) 149–198.
[11] S. Ben-David, R. Schuller, Exploiting task relatedness for multiple task
dt,s learning, Learning Theory and Kernel Machines (2003) 567–580.
1þ /xt,i ,xs,j S
rt [12] N.J. Butko, J.R. Movellan, Learning to learn, in: Proceedings of the IEEE 6th
International Conference on Development and Learning, 2007, pp. 151–156.
1 X
K X T X mt X T X ms
[13] B. Bakker, T. Heskes, Task clustering and gating for Bayesian multitask
a ðdy ,y a Þ learning, Journal of Machine Learning Research 4 (1) (2003) 83–99.
2b k ¼ 1 t ¼ 1 i ¼ 1 s ¼ 1 j ¼ 1 t,k,i i j yi ,s,j
[14] C.A. Micchelli, M. Pontil, Kernels for multi-task learning, Advances in Neural
K X
X T X
mt Information Processing Systems (2005) 921–928.
dt,s [15] A. Caponnetto, C.A. Micchelli, M. Pontil, Y. Ying, Universal multi-task kernels,
1þ /xt,i ,xs,j S at,k,i dk,yi
rt k¼1t ¼1i¼1
Journal of Machine Learning Research 9 (7) (2008) 1615–1646.
[16] O. Chapelle, P.K. Shivaswamy, S. Vadrevu, K.Q. Weinberger, Y. Zhang, B.L. Tseng,
1 X T X T X mt X ms Boosted multi-task learning, Machine Learning 85 (1–2) (2011) 149–173.
¼ [17] C.W. Hsu, C.J. Lin, A comparison of methods for multiclass support vector
4b t ¼ 1 s ¼ 1 i ¼ 1 j ¼ 1 machines, Neural Networks 13 (2) (2002) 415–425.
0 1 [18] J.C. Platt, N. Cristianini, J. Shawe-Taylor, Large margin DAGs for multiclass
dt,s classification, Advances in Neural Information Processing Systems (2000)
B 1þ r /xt,i ,xs,j S C 547–553.
B t C
B C [19] V. Franc, V. Hlavác, Multi-class support vector machine, in: Proceedings of
B 9S5 C the 16th International Conference on Pattern Recognition, 2002, pp. 236–
B zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl
ffl!{ C
B K d a d C 239.
BX ð k,yi þ t,k,i Þð k,yj a s,k,j Þ C
@ A [20] I. Steinwart, Support vector machines are universally consistent, Journal of
2at,k,i ðdyi ,yj ayi ,s,j Þ Complexity 18 (3) (2002) 768–791.
k¼1
[21] J. Shawe-Taylor, S. Sun, A review of optimization methodologies in support
K X
X mt
T X vector machines, Neurocomputing 74 (17) (2011) 3609–3618.
at,k,i dk,yi : ð31Þ [22] C.W. Hsu, C.C. Chang, C.J. Lin, et al., A Practical Guide to Support Vector
k¼1t ¼1i¼1 Classification, 2003.
[23] A. Asuncion, D.J. Newman, UCI machine learning repository, Datasets avail-
The most complex part of Eq. (31) is S5. Fortunately, S5 can have a able at /http://archive.ics.uci.edu/ml/datasets.htmlS, 2007.
simpler representation as [24] J.F. Vargas, M.A. Ferrer, C.M. Travieso, J.B. Alonso, in: Proceedings of the 9th
Off-line Handwritten Signature GPDS-960 Corpus, International Conference
X
K on Document Analysis and Recognition, 2007, pp. 764–768.
S5 ¼ ½ðdk,yi þat,k,i Þðdk,yj as,k,j Þ2at,k,i ðdyi ,yj ayi ,s,j Þ [25] J. Ortega-Garcia, J. Fierrez-Aguilar, D. Simon, J. Gonzalez, et al., MCYT baseline
k¼1 corpus: a bimodal biometric database, Vision Image and Signal Processing
150 (6) (2003) 395–401.
X
K X
K
[26] Y. Ji, S. Sun, Off-line signature verification based on multitask learning, in:
¼ ½ðdk,yi þat,k,i Þðdk,yj as,k,j Þ2 at,k,i ðdyi ,yj ayi ,s,j Þ Proceedings of the 8th International Symposium on Neural Networks, 2011,
k¼1 k¼1 pp. 323–330.
|fflfflfflfflffl{zfflfflfflfflffl}
¼1 [27] J. Wen, B. Fang, Y.Y. Tang, T.P. Zhang, Model-based signature verification with
rotation invariant features, Pattern Recognition 42 (7) (2009) 1458–1466.
X
K
[28] M. Blumenstein, X.Y. Liu, B. Verma, An investigation of the modified direction
¼ ½ðdk,yi þat,k,i Þðdk,yj as,k,j Þ2ðdyi ,yj ayi ,s,j Þ feature for cursive character recognition, Pattern Recognition 40 (2) (2007)
k¼1 376–388.
924 Y. Ji, S. Sun / Pattern Recognition 46 (2013) 914–924
You Ji received the B.E. degree in computer science and technology from East China Normal University in 2009. Now he is a master student in the Pattern Recognition and
Machine Learning Research Group, Department of Computer Science and Technology, East China Normal University. His research interests include machine learning,
pattern recognition, etc.
Shiliang Sun received the B.E. degree in automatic control from Beijing University of Aeronautics and Astronautics and Ph.D. degree in pattern recognition and intelligent
systems from Tsinghua University, respectively, in 2002 and 2007. Now he is a professor and the director of the Pattern Recognition and Machine Learning Research Group,
Department of Computer Science and Technology, East China Normal University. He is on the editorial boards of several international journals and a referee for many top
journals. His research interests include machine learning, pattern recognition, brain–computer interfaces and intelligent transportation systems, etc.