You are on page 1of 21

Communicated by Shiliang Sun

Accepted Manuscript

Effective active learning strategy for multi-label learning

Oscar Reyes, Carlos Morell, Sebastián Ventura

PII: S0925-2312(17)31337-1
DOI: 10.1016/j.neucom.2017.08.001
Reference: NEUCOM 18727

To appear in: Neurocomputing

Received date: 14 September 2015


Revised date: 18 June 2017
Accepted date: 6 August 2017

Please cite this article as: Oscar Reyes, Carlos Morell, Sebastián Ventura, Effective active learning
strategy for multi-label learning, Neurocomputing (2017), doi: 10.1016/j.neucom.2017.08.001

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service
to our customers we are providing this early version of the manuscript. The manuscript will undergo
copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please
note that during the production process errors may be discovered which could affect the content, and
all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT

Effective active learning strategy for multi-label learning

Oscar Reyesa , Carlos Morellb , Sebastián Venturaa,c,∗


a Department of Computer Science and Numerical Analysis, University of Córdoba, Spain
b Department of Computer Science, Universidad Central de Las Villas, Cuba
c Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia

T
IP
Abstract
Data labelling is commonly an expensive process that requires expert handling. In multi-label data, data labelling
is further complicated owing to the experts must label several times each example, as each example belongs to var-

CR
ious categories. Active learning is concerned with learning accurate classifiers by choosing which examples will be
labelled, reducing the labelling effort and the cost of training an accurate model. The main challenge in performing
multi-label active learning is designing effective strategies that measure the informative potential of unlabelled exam-
ples across all labels. This paper presents a new active learning strategy for working on multi-label data. Two un-

US
certainty measures based on the base classifier predictions and the inconsistency of a predicted label set, respectively,
were defined to select the most informative examples. The proposed strategy was compared to several state-of-the-art
strategies on a large number of datasets. The experimental results showed the effectiveness of the proposal for better
multi-label active learning.
AN
Keywords: Multi-label active learning, multi-label classification, label ranking

1. Introduction
M

multi-label learning algorithms able to produce, at the


same time, both a bi-partition of the label space and a
In recent years, the study of problems that involve consistent ranking of labels.
data associated with more than one label at the same
Most multi-label algorithms have been proposed for
ED

time has attracted a great deal of attention [1–6]. Par-


ticular multi-label problems include text categorization working on supervised learning environments, i.e. sce-
[7–9], classification of emotions evoked by music [10], narios where all training examples are labelled. How-
semantic annotation of images [11–14], classification ever, data labelling is commonly a very expensive pro-
cess that requires expert handling. In multi-label data,
PT

of music and videos [15–17], classification of protein


and gene function [18–23], acoustic classification [24], experts must label several times each example, as each
chemical data analysis [25] and more. example belongs to various categories. The situation is
Multi-label learning is concerned with learning a further complicated when an expert labels a dataset with
CE

model able to predict a set of labels for an unseen exam- a large number of examples and label classes. Conse-
ple. In multi-label learning, two tasks have been stud- quently, several real-world scenarios nowadays can con-
ied [26–28]: multi-label classification and label rank- tain a small number of labelled data and a large number
ing. Multi-label classification task aims to find a model of unlabelled data simultaneously.
AC

where, for a given test example, the label space is di- To date, there are two main areas that are concerned
vided into relevant and irrelevant label sets. On the other with learning models from labelled and unlabelled data,
hand, label ranking task aims to provide, for a given known as Semi-Supervised Learning [29] and Active
test example, a ranking of labels according to their rel- Learning [30]. Active Learning (AL) is concerned with
evance values. Nowadays, it is more common to find learning better classifiers by choosing which examples
are labelled for training. Consequently, the labelling ef-
∗ Corresponding
fort and the cost of training an accurate model are re-
author. Tel:+34957212218; fax:+34957218630.
Email addresses: ogreyes@uco.es (Oscar Reyes),
duced. AL methods are involved in the acquisition of
cmorellp@uclv.edu.cu (Carlos Morell), sventura@uco.es their own training data. A selection strategy iteratively
(Sebastián Ventura) selects examples from the unlabelled set that seem to be
Preprint submitted to Neurocomputing August 18, 2017
ACCEPTED MANUSCRIPT

the most informative for the model that is being trained. gies were compared over a large number of multi-label
In this work, we focused on AL scenarios in which a datasets, and two multi-label learning tasks were anal-
large collection of unlabelled data and a small set of la- ysed; multi-label classification and label ranking.
belled data are available, known as pool-based AL [31]. The experiments were carried out on 18 multi-label
The usefulness and effectiveness of AL methods have datasets. To compare the performance of the AL strate-
been proved in several domains [32–37]. For more than gies, in addition to a visual comparison of learning
a decade, a considerable number of AL methods for curves of the strategies, the experimental study included
single-label data have been proposed, for an interesting a statistical analysis based on non-parametric tests, re-
survey see [30]. However, AL methods for multi-label sulting in a more robust analysis. The experimental
data have been far less studied. stage showed the effectiveness of the proposal, obtain-

T
The main challenge in performing AL on multi-label ing significantly better results than previous multi-label
data is designing effective strategies that measure the AL strategies.

IP
unified informative potential of unlabelled examples This paper is arranged as follows: Section 2 briefly
across all labels. Most state-of-the-art multi-label AL describes the multi-label learning and active learning

CR
strategies employ the Binary Relevance [26] approach paradigms, and the state-of-the-art in the development
to breaking down a multi-label problem into several bi- of AL strategies for multi-label data. Section 3 presents
nary classification problems. Multi-label AL strategies the basis of our proposal. Section 4 describes the ex-
have been generally assessed on the multi-label classi- perimental set-up and shows the experimental results.
fication task. However, the performance with regard to
the label ranking task has been rarely considered. On
the other hand, most AL strategies use informativeness-
based1 criteria to select the most useful unlabelled ex-
US Finally, section 5 provides some concluding remarks.

2. Preliminaries
AN
amples. However, strategies that only select informative
In this section, a brief descriptions of the multi-label
examples usually do not exploit either the structure of
learning and active learning paradigms are exposed. On
unlabelled data or the label space information, leading
the other hand, a review of the state-of-the-art multi-
to a sub-optimal performance [38].
label AL strategies is portrayed.
In this work, an effective multi-label AL strategy was
M

proposed, named as Uncertainty Sampling based on


Category Vector Inconsistency and Ranking of Scores 2.1. Multi-label learning and active learning
(CVIRS). Two measures based on the base classifier A multi-label problem comprises a feature space F
ED

predictions and the inconsistency of predicted label sets, and a label space L with cardinality equal to q (num-
respectively, were defined. A rank aggregation problem ber of labels). A multi-label example i is represented
was formulated to compute the unified uncertainty of an as a tuple hXi ,Yi i, where Xi is the feature vector and Yi
unlabelled example across all labels. The rank aggre- the category vector of the example i. Let us say Yi is a
PT

gation problem was based on the probabilities with that binary vector that contains q components, where com-
the base classifier predicts whether an example belongs ponent Yi` represents whether the example i belongs to
or not to a certain label. On the other hand, the incon- the `-th label or not.
sistency of a predicted label set, for a given unlabelled Let us say Φ is a multi-label classifier able to resolve
CE

example, was computed by means of the distance be- multi-label classification and label ranking tasks at the
tween the predicted label set and the label sets of the same time. Therefore, for a given test example, (i) Φ
labelled examples. partitions the label space L into a relevant label set (pos-
To the best of our knowledge, this paper presents itive labels) and an irrelevant label set (negative labels),
AC

the first attempt to propose a multi-label AL strategy and also (ii) Φ returns a ranking of labels according to
that computes the uncertainty of unlabelled examples their relevance.
by means of a rank aggregation method, and this uncer- The multi-label learning algorithms can be organ-
tainty measure is also combined with another measure ised into two main categories [28]: the problem trans-
that takes into account the label space information. On formation methods and algorithm adaptation methods.
the other hand, in contrast to the majority of works re- The problem transformation methods transform a multi-
lated to multi-label AL, in this paper, several AL strate- label dataset into one or more single-label datasets. Af-
terwards, for each transformed dataset a single-label
1 Informativeness measures the effectiveness of an example by re- classifier is executed, and an aggregation strategy is fi-
ducing the uncertainty of a model. nally performed in order to obtain the results. On the
2
ACCEPTED MANUSCRIPT

other hand, the algorithm adaptation category comprises in each iteration. Most state-of-the-art multi-label AL
the algorithms designed to directly handle the multi- strategies have been designed to select only one un-
label data. labelled example in each iteration (dubbed as myopic
As for active learning, it is an iterative process that strategies) [38–43, 45–51, 53–56]. Myopic strategies
aims to construct a better classifier by selecting unla- can easily select a batch of unlabelled examples, e.g. by
belled examples. Let us say Φ is the base classifier used selecting the most informative examples from the un-
in the AL process. On pool-based AL scenarios, we labelled set. However, the main drawback of selecting
have a small set of labelled data L s and a large set of a set of unlabelled examples in a greedy manner lies
unlabelled data U s . On the other hand, we have an AL that the selected examples may be similar, resulting in
strategy γ that selects a set of unlabelled examples from an information redundancy. On the other hand, there

T
U s using some selection criterion, e.g. an uncertainty are AL strategies that select a batch of unlabelled exam-
measure. The following steps are commonly performed ples in each iteration by taking into account the diversity

IP
in an AL process: of the selected examples (dubbed as batch-mode strate-
gies). As for batch-mode multi-label AL, to date, very

CR
few works have been proposed [44, 52]. The few exist-
1. γ selects unlabelled examples from U s
ing works based on batch-mode multi-label AL formu-
2. The selected unlabelled examples are labelled by late the problem of selecting the best batch of unlabelled
an annotator (e.g. a human expert) examples as an NP problem, and the methods used to
3. The selected examples are added to L s and re-
moved from U s
4. Φ is trained with the labelled set L s
5. The performance of Φ is assessed
US resolve the NP problem have a high computational cost.
Consequently, the application of these methods is dif-
ficult, practically speaking, for large-scale multi-label
datasets.
AN
6. IF not stop-condition THEN go to step 1. Table 1 shows a summary of state-of-the-art multi-
label AL strategies.
In AL literature, several stopping conditions have
been used. Commonly, the AL process is repeated β Source Year Type Label assignment
times (number of iterations). However, we can use as [39] 2004 Myopic All
M

[40] 2006 Myopic All


stopping criterion whether the performance of the base [49] 2009 Myopic EL
classifier has attained a certain level. The evaluation [50] 2009 Myopic EL
[41] 2009 Myopic All
manner of base classifier’s performances depends on the [42] 2009 Myopic All
problem studied. Commonly, the performance of the
ED

[43] 2009 Myopic All


[44] 2011 Batch-mode All
base classifier is assessed by means of using a test set [45] 2011 Myopic All
and analysing an evaluation measure. [46] 2012 Myopic All
[47] 2012 Myopic All
[48] 2013 Myopic All
2.2. Related works
PT

[51] 2013 Myopic EL


[52] 2014 Batch-mode EL
Multi-label AL strategies can be classified according [53] 2014 Myopic EL
[38] 2014 Myopic EL
to the manner that the labels of unlabelled examples are [54] 2014 Myopic EL
queried. Most multi-label AL strategies are designed [55] 2015 Myopic EL
CE

[56] 2015 Myopic EL


to query all the label assignments of the selected unla-
Table 1: Summary of state-of-the-art multi-label AL strategies. The AL strate-
belled examples [39–48]. On the other hand, there are gies are ordered by year of publication. All: all the label assignments; EL:
multi-label AL strategies that query the relevance of an example-label pairs assignment.

example-label pair [38, 49–56], i.e. the strategy ana-


AC

lyzes whether a specific label is relevant to a selected In this work, we focused on myopic AL strategies that
example. Strategies that query all labels of an example query all the label assignments of the selected examples,
may lead to information redundancy and more annota- since our proposal belongs to this category of AL strate-
tion effort in real-world problems with a large number gies. Next, we summarised the most relevant myopic
of labels. On the other hand, AL strategies that select AL strategies that query all the label assignments.
example-label pairs avoid information redundancy, but In [39], two multi-label AL strategies, named Max
they may ignore the interaction between labels and can Loss (ML) and Mean Max Loss (MML), were proposed.
obtain a limited supervision from each query [55]. The two strategies select the unlabelled examples which
Multi-label AL strategies can be also classified in the have the maximum or mean loss value over the pre-
manner that a set of unlabelled examples is selected dicted labels. MML strategy considers the multi-label
3
ACCEPTED MANUSCRIPT

information taking into account the loss produced in tion.


each label. ML strategy calculates the loss value only The bibliographic revision revealed us that previous
on the label predicted with the most certainty. The ef- works have been often assessed on the multi-label clas-
fectiveness of the approaches was proved on two multi- sification task. However, the performance of these AL
label datasets for image classification. strategies on the label ranking task has been rarely con-
In [40], the Binary Minimum (BinMin) strategy sidered. Most multi-label AL strategies have been tested
was proposed. BinMin selects the unlabelled example with BR-SVM as a base classifier, i.e. the Binary
that, considering a target label, minimises the distance Relevance approach using binary SVM classifiers [39–
among the restricting hyperplane and the centre of the 41, 48]. Most AL strategies use informativeness-based
maximum radius hyper-ball of each binary SVM classi- criteria to select the most useful unlabelled examples.

T
fier. The effectiveness of the approach was assessed on However, strategies that only select informative exam-
one multi-label dataset for text classification. ples usually do not exploit the label space information,

IP
In [41], the Maximum Loss Reduction with Maxi- leading to a sub-optimal performance. Some strategies
mal Confidence (MMC) strategy was presented. MMC simply extend the binary uncertainty concept for multi-

CR
is based on the principles of Expected Error Reduction label data by aggregating the value associated with each
[30]; it selects those unlabelled examples that maximise label, e.g. taking the minimum [40] or the average value
the reduction rate of the expected model loss. MMC over all labels [39, 41, 43, 53].
predicts the number of labels for an unlabelled exam-
ple following a process named as “LR-based prediction
method”. In each iteration, the labelled set is trans-
formed into a single-label dataset and a Logistic Regres-
sion classifier is trained. The prediction of the Logistic
US 3. A new multi-label active learning strategy

In this section, the basis of a new multi-label AL strat-


egy is presented. The multi-label AL strategy combines
AN
Regression classifier is used to calculate the uncertainty
of the unlabelled examples. The effectiveness of the ap- two measures for selecting the most informative unla-
proach was proved on seven multi-label datasets for text belled example. The two measures are based on the
classification. classifier predictions and the inconsistency of predicted
In [42], a general framework for multi-label AL was label sets, respectively.
M

proposed. The authors defined three dimensions: evi-


dence, class and weight. The evidence dimension rep- Measure based on a rank aggregation problem
resents the type of evidence to use for computing the Let Φ be a multi-label classifier which, for a given
ED

usefulness of an unlabelled example. The class dimen- unseen example, returns probabilities for each possible
sion represents how to combine the values of a vector label ` ∈ L. We have a probability that an example i
of evidence. The weight dimension shows whether all belongs to the `-th label (PΦ (`=1|i)) and a probability
labels are treated equally or not. The authors showed that i does not belong to the `-th label (PΦ (`=0|i)).
PT

that CMN strategy obtains the best results. CMN takes So, the difference margin in classifier predictions
into account the confidence of predictions as a type of with respect to whether the given example i belongs or
evidence (C), the minimum value of a confidence vec- does not belong to the `-th label can be computed as
tor (M), and it treats all labels alike (N). The effective-
CE

ness of the approach was assessed on two multi-label mi,`


Φ =|PΦ (`=1|i) − PΦ (`=0|i)|. (1)
datasets for text classification.
In [48], the Max-Margin Prediction Uncertainty A large margin value on the `-th label means that the
(MMU) and Label Cardinality Inconsistency (LCI) classifier Φ has a small error in predicting whether the
AC

strategies were proposed. MMU models the uncer- example belongs or does not belong to this label. On
tainty of an example by computing the separation mar- the other hand, a small margin value on the `-th label
gin between the predicted groups of positive and nega- means that it is more ambiguous for the current classi-
tive labels. On the other hand, LCI strategy measures fier to predict whether the example belongs or does not
the uncertainty of an example as the distance between belong to the label `. So, given an unlabelled example i
the number of predicted positive labels and the label and a classifier Φ, we can obtain a vector of margin val-
i,q
cardinality (average number of labels per example) of ues MiΦ =[mi,1 i,2
Φ ,mΦ , . . . ,mΦ ], one margin value for each
the current labelled set. The effectiveness of these AL label ` ∈ L. The problem is how to properly aggre-
strategies was tested on three multi-label datasets for gate the multi-label information for computing the uni-
image classification and one dataset for text classifica- fied informative value of an unlabelled example.
4
ACCEPTED MANUSCRIPT

We consider that for computing the utility of an unla- Measure based on category vector inconsistency
belled example, it is important to consider the informa- In addition to the defined uncertainty measure based
tion regarding all unlabelled examples. Note that, we on rank aggregation problem, we consider important to
focus on pool-based AL scenarios, therefore a vector of take into account the information from label space in
margin values for each unlabelled example i ∈ U s can computing the uncertainty of an unlabelled example.
be computed. The idea is straightforward, the labelled set and unla-
i
Given the vectors MiΦ1 , MiΦ2 , . . . , MΦ|Us | of the unla- belled set of data are drawn from the same underlying
belled examples i1 , i2 , . . . , i|U s | , respectively, q rankings distribution, therefore it is expected that the label sets
of examples τ1 , τ2 , . . . , τq can be computed; one ranking predicted by the base classifier and the label sets of la-
for each label ` ∈ L. The ranking of unlabelled exam- belled examples share common properties.

T
ples τ` is computed as In this work, we propose to use a measure based on
category vector inconsistency, computing the difference

IP
between the predicted label set for an unlabelled exam-
i ,`
τ` =(iπ1 ,iπ2 , . . . ,iπ|U s | ) | miΦπ1 ,` < miΦπ2 ,` . . . < mΦπ|U s | . (2) ple and the label sets of the current labelled examples.

CR
Table 2 shows the contingency table given the category
The ranking τ` is an ordering (permutation or full list) of vectors Yi and Yj of the examples i and j, respectively.
the unlabelled examples according to their margin val- Let a be the number of components such that Yi` =Y j` =1,
ues on the `-th label. So, we want to find a ranking of b is the number of components such that Yi` =1 and
examples τ0 by aggregating the information of the rank-
ings τ1 , τ2 , . . . , τq , in such a manner that the examples
placed in the first positions of the final ranking τ0 corre-
spond to the most uncertain examples.
US Y j` =0, c is the number of components such that Yi` =0
and Y j` =1, and d is the number of components such that
Yi` =Y j` =0.
AN
This formulation for aggregating the margin values is Yi Yj 1 0
equivalent to the well-known Rank Aggregation prob- 1 a b
0 c d
lem [57, 58]. Several rank aggregation methods have
Table 2: Contingency table between two category vectors.
been proposed in the literature [57–59]. However, the
use of sophisticated rank aggregation methods would Given the category vectors Yi and Yj , the normalised
M

not be practical in our situation, considering that, (i) Hamming distance is computed as
nowadays, it is common to find multi-label datasets with
a large number of labels, and (ii) a large number of un- b+c
dH (Yi ,Yj )= , (4)
ED

labelled examples are available in pool-based AL. Con- q


sequently, in this work, we used the simplest rank ag- where q is the number of labels.
gregation method, known as the Borda’s method [57]. The distance dH returns the proportion of labels for
Borda’s method is a positional method that assigns which two examples differ in their category vectors.
PT

a score to an element in correspondence to the posi- However, we also consider important to measure the dif-
tions in which this element appears in each ranking [57]. ference among the structures of category vectors. Struc-
The advantage of the positional methods is that they tures mean combinations of zeros and ones that are
are computationally efficient. However, the positional commonly found in a set of binary vectors. In multi-
CE

methods neither optimise any distance criteria nor sat- label context, the label sets that appear more frequently
isfy Condorcet’s criterion. Condorcet’s criterion states in a dataset create structures that, with a high proba-
that if an element defeats every other element in pair- bility, can be found in the category vectors of labelled
wise majority voting, this element should be ranked first examples.
AC

[57]. To compute the differences among the structures of


Based on Borda’s method, the score of an example i two binary vectors, the entropy distance defined in [60]
is computed as was used. The normalised entropy distance among two
P category vectors Yi and Yj is computed as
`∈L (|U s |-τ` (i))
s(i)= , (3)
q(|U s |-1)
2H(Yi ,Yj ) − H(Yi ) − H(Yj )
where τ` (i) is the position of the example i in the ranking dE (Yi ,Yj )= , (5)
H(Yi ,Yj )
τ` . The greater the value of s(i), the greater the uncer-
tainty of the example i taking the information across all where the joint entropy among Yi and Yj is computed
labels. as
5
ACCEPTED MANUSCRIPT

of how one probability distribution diverges from a sec-


a b c d ond expected probability distribution. In the multi-label
H(Yi ,Yj )=H4 ( , , , ).
q q q q AL context, KLD has been widely used to measure the
According to the properties of the discrete entropy, H4 degree of new models preserving the existing knowl-
is equal to edge contained in old ones, commonly this has been
made by using the probabilities of the instances belong-
a b c d b+c a+d b+c b c
H4 ( , , , )=H2 (
q q q q q
,
q
)+
q
H2 ( ,
b+c b+c
) ing to each possible label [44, 49, 52]. In this work,
a+d a d however, we did not use the probabilities computed by
+ H2 ( , )
q a+d a+d the base classifiers, we used the information regarding
label memberships instead, that it is discrete by nature.

T
The entropy of a category vector Y is computed as In this manner, we intend to avoid some bias that can
be introduced by the different manners in which multi-

IP
w s w w s s label classifiers compute the probabilities of an instance
H(Y)=H2 ( , )= − log2 ( ) − log2 ( ), belonging to each label.
q q q q q q

CR
where w and s are the numbers of ones (positive labels) Active learning strategy
and zeros (negative labels), respectively, of the category
vector Y. Uncertainty sampling is one of the most simple and
commonly used AL strategy [31]. This type of strategy
Based on the dH and dE distance functions, the in-
consistency of the predicted category vector, given un-
labelled example i, is computed as

1 X
US selects those unlabelled examples which are least cer-
tain for the base classifier. Based on the two measures
previously defined, the most informative example from
U s is selected as
AN
v(i)= fu (Yi ,Yj ), (6)
|L s | j∈L
s argmax s(i) · v(i)
(7)
i ∈ Us


dE (Yi ,Yj )
 if dH (Yi ,Yj ) < 1 We named this new strategy, Uncertainty sampling
M

fu (Yi ,Yj )= 

 1 if dH (Yi ,Yj )=1 , based on Category Vector Inconsistency and Ranking of
Scores (CVIRS). This strategy selects those unlabelled
where Yi is the category vector predicted by the base examples having the most unified uncertainty computed
ED

classifier, and Yj is the category vector of the example j by means of the rank aggregation problem formulated,
that belongs to the labelled set L s . The greater the value and at the same time, the most inconsistent predicted
of v(i), the greater the inconsistency of the category vec- category vectors.
tor of the example i. This approach can be used with any multi-label clas-
PT

The distance function dE is more flexible than the dis- sifier that can obtain proper probability estimates from
tance dH ; the former can recognize existing structures its outputs. Note that our proposal is somehow related
(patterns) among two binary vectors. As a matter of with Density-Weighted methods. Density-Weighted
example, given two category vectors Yi =[010101] and methods consider that the most informative example
CE

Yj =[101010], dH (Yi ,Yj )=1, owing to Yi and Yj differ in should not only be uncertain, but it should also be “rep-
all their components. However, dE (Yi ,Yj )=0 when this resentative” of the underlying distribution [30]. Accord-
case happens, since Yi and Yj have the same structure, ing to the categories of AL strategies portrayed in Sec-
the alternation of two symbols. Note that, for this case, tion 2, the proposed strategy can be categorised as my-
AC

we assigned the maximum value to fu , i.e. a value equal opic, and it also queries all the label assignments.
to 1, to represent that the base classifier is predicting a Regarding the computational complexity to compute
category vector completely inverse to the ones existing the score of an unlabelled example by means of the rank
in the labelled set. Consequently, a greater uncertainty aggregation problem formulated, let fts (Φ) be the cost
value will be given to the corresponding example, be- function of the multi-label classifier Φ to classify an
ing it more likely to be selected for querying its actual unlabelled example. To compute the margin vector of
labels. each unlabelled example i ∈ U s , O(|U s | · fts (Φ)) steps
It is worth to note that this second uncertainty mea- are needed. To compute the q rankings of unlabelled
sure has some relationship with the Kullback-Leibler examples, O(q · |U s |2 ) steps are needed. Although, if an
Divergence (KLD) [61]. KLD divergence is a measure efficient sort algorithm is used, then the computational
6
ACCEPTED MANUSCRIPT

complexity could be reduced to O(q · |U s | · log(|U s |)) from 194 up to 28,596 examples, from 10 up to 32,001
steps. In addition, in order to reduce the computational features, from 6 up to 374 labels, from 15 up to 4,937
complexity of computing the q rankings, only a subset different subsets of labels, from 1.014 up to 26.044 label
of the U s could be considered. cardinality, and from 0.009 up to 0.485 label density.
Regarding the computational complexity to compute On the multi-label AL context, the dataset Corel5k
the inconsistency of a category vector predicted for an was previously used in [39, 48, 51]. The datasets Emo-
unlabelled example, O(q · |L s |) steps are needed. Gener- tions, Enron, Medical and Genbase were used in [51].
ally speaking, the CVIRS strategy requires O(max(|U s | · Scene and Yeast were previously used in [49, 51, 62].
fts (Φ), q · |U s |2 )) steps to determine the utility of an un- The datasets Arts, Business, Entertainment and Health
labelled example, owing to q · |U s |2 >> q · |L s |. were used in [41].

T
4.2. Evaluation of multi-label models and active learn-

IP
4. Experimental study ing strategies
In this section, a description of the multi-label In this work, several evaluation measures were used

CR
datasets used in the experiments, the evaluation of to assess the multi-label classifiers induced by the AL
multi-label models and AL strategies, and other settings process. The multi-label evaluation measures are di-
used in the empirical study are explained. Finally, the vided into two categories [26]: label-based measures
experimental results and the statistical analysis are pre- and example-based measures. The example-based mea-
sented.

4.1. Multi-label datasets


In the experiments, 18 real multi-label datasets were
US sures are further categorised into ranking-based and
bipartition-based measures.
The label-based measures used in this work were the
Micro-Average F1 -Measure (MiF1 ) and Macro-Average
AN
F1 -Measure (MaF1 ). The micro approach aggregates
used2 . Multi-label datasets with different scales and
the true positive, true negative, false positive and false
from different application domains were included to
negative values of all labels, and then calculates the
analyse the performance of the multi-label AL strate-
F1 -measure. The macro approach computes the F1 -
gies in datasets with different properties.
measure for each label and then the values are averaged
M

Dataset n d q ds lc ld over all labels. The MiF1 and MaF1 measures are defined
Flags 194 10 7 54 3.932 0.485 as
Emotions 593 72 6 27 1.869 0.311 q q q q
X X X X
Birds 645 260 19 133 1.014 0.053
ED

MiF1 =F1 ( tpi , f pi , tni , f ni ), (8)


Yeast 2417 103 14 198 4.237 0.303 i=1 i=1 i=1 i=1
Scene 2407 294 6 15 1.074 0.179
Cal500 502 68 174 502 26.044 0.150 q
Genbase 662 1186 27 32 1.252 0.046 1X
MaF1 = F1 (tpi , f pi , tni , f ni ), (9)
Medical 978 1449 45 94 1.245 0.028 q i=1
Enron 1702 1001 53 753 3.378 0.064
PT

TMC2007-500 28596 500 22 1341 2.160 0.098 where F1 function computes the F1 -Measure given the
Corel5k 5000 499 374 3175 3.522 0.009
Corel16k 13811 500 161 4937 2.867 0.018
true positive (tp), false positive ( f p), true negative (tn),
Bibtex 7395 1836 159 2856 2.402 0.015 and false negative ( f n) values, and q represents the
Arts 7484 23146 26 599 1.654 0.064
number of labels.
CE

Business 11214 21924 30 233 1.599 0.053


Entertainment 12730 32001 21 337 1.414 0.067 As for multi-label classification task, a classifier Φ
Recreation 12828 30324 22 530 1.429 0.065
Health 9205 30605 32 335 1.644 0.051
predicts, for a given test example i, the set of labels pi .
Table 3: Statistics of the benchmark datasets, number of examples (n), number Let us say ti is the actual label set of the example i. The
bipartition-based measures used in this work were the
AC

of features (d), number of labels (q), different subsets of labels (d s ), label car-
dinality (lc ) and label density (ld ). The datasets are ordered by their complexity
calculated as n × d × q.
Hamming Loss (HL ) and Example-based F1 -Measure
(F1Ex ). HL averages the symmetrical differences among
Table 3 shows some statistics of the datasets. The the predicted and actual label sets, while F1Ex calculates
label cardinality is the average number of labels per ex- the F1 -Measure on all examples in the test set.
m
1 X | ti 4pi |
ample. The label density is the label cardinality divided HL =
m i=1 q
, (10)
by the total number of labels. The datasets vary in size: 1 Xm
2 | t i ∩ pi |
F1Ex = , (11)
m i=1 | ti | + | pi |

2 The datasets are available to download at where 4 denotes the symmetric difference between two
http://www.uco.es/grupos/kdis/kdiswiki/index.php/Resources sets, and m is the number of test examples.
7
ACCEPTED MANUSCRIPT

Regarding label-ranking task, a classifier Φ provides, For the sake of fairness, the AL strategies were
for a given test example i, a ranking of labels Ri , where tested with BR-SVM classifier, since most state-of-
Ri (`) is the rank predicted for the label `. The ranking- the-art multi-label AL strategies have been tested with
based measures used in this work were the Ranking BR-SVM as a base classifier. A linear kernel and a
Loss (RL ), Average Precision (AP ) and One Error (OE ). penalty parameter equal to 1.0 were used, as proposed
RL averages the proportion of label pairs that are incor- in [41]. Logistic regression models are fitted to the out-
rectly ordered. AP averages how many times a particular puts of SVMs to obtain proper probability estimates.
label is ranked above another label which is in the actual All strategies were evaluated by a 10-fold cross-
label set. OE averages how many times the top-ranked validation in each dataset. For each fold execution, the
label is not in the set of true labels of the example. iterative experimental protocol described in Algorithm

T
m 1 was adopted. The 5% of the training set T r was ran-
1 X | (`a , `b ) : Ri (`a ) > Ri (`b ), (`a , `b ) ∈ ti × t¯i |
RL = (12) domly selected to construct the labelled set L s . There-

IP
,
m i=1 | ti || t¯i |
fore, the initial classifier was trained with few labelled
m
1 X 1 X | {`0 ∈ ti : Ri (`0 ) ≤ Ri (`)} | examples. The non-selected examples of T r were used

CR
AP = , (13)
m i=1 | ti | `∈t
i
Ri (`) to create the unlabelled set U s . The maximum num-
ber of iteration β was set to 750. In each iteration, the
Pm
OE = m1 i=1 δ(argmin Ri (`)),
(14) effectiveness of the multi-label classifier Φ was tested
`∈L



 1 ` < ti
by classifying the test set T s . This experimental proto-

US col is similar to previous experimental protocols used in


δ(`)= 

 0 otherwise,
[41, 42, 48, 51].
where t¯i denotes the complementary set of ti in the label
space L. Algorithm 1: Experimental protocol.
As for the evaluation of the effectiveness of AL strate-
AN
Input : T r → training set of multi-label examples
gies, the AL methods are commonly evaluated by visu- T s → test set of multi-label examples
Φ → multi-label classifier
ally comparing learning curves. A learning curve is con- γ → multi-label AL strategy
structed by means of plotting an evaluation measure as a θ → oracle for labelling unlabelled examples
s → number of sampling examples
function of the number of labelled examples that exist in
M

β → maximum number of iterations


the labelled set. Through a visual comparison, a strat- 1 begin
// Construct the labelled and unlabelled sets
egy is superior to the alternatives if it dominates them 2 L s ← Resample (s, T r );
for most of the points along their learning curves [30]. 3 U s ← T r \L s ;
// Train Φ with L s
ED

However, visually comparing several learning curves 4 Φ ← Train (L s , Φ);


can be very confusing, as several intersections among 5 for iter ← 1 to β do
// Select informative example from U s
the learning curves could occur. 6 i ← SelectInformativeExample (γ, Φ, U s );
In this work, in addition to a visual comparison, the // Label the selected example
7 Label (θ, i);
Area Under Learning Curve (AUC) to compare several
PT

// Update the labelled and unlabelled sets


AL strategies in a quantitative manner was used. To 8 L s ← L s ∪ {i} ;
9 U s ← U s \ {i} ;
analyse and validate the results, several non-parametric // Train Φ with L s
statistical tests were used. Friedman’s test [63] was per- 10 Φ ← Train (L s , Φ);
CE

// Evaluate Φ on T s
formed to evaluate whether there were significant differ- 11 Test (T s , Φ);
ences in the results. If Friedman’s test indicated that the 12 end
results were significantly different, the Shaffer post-hoc 13 end

test [64] was used to perform all pairwise comparisons,


AC

as proposed in [65]. The AL strategies selected only one unlabelled ex-


ample in each iteration since all strategies considered in
4.3. Experimental setting the experimental study are not optimal for working in
In the experimental study, our proposal -CVIRS- was batch-mode AL scenarios. The labelling process was
compared with the most relevant myopic AL strategies done in a simulated environment since the label sets of
that query all the label assignments (see Section 2): the examples of U s are actually known. All strategies
BinMin [40], ML [39], MML [39], MMC [41], CMN were implemented with JCLAL framework [66], it is a
[42], MMU [48] and LCI [48]. In addition, a Random class library that allows an easy implementation of any
strategy, that randomly chooses examples from the un- AL method. The algorithms as standalone runnable files
labelled set, was considered in the comparison. are available in order to facilitate the replicability of the
8
ACCEPTED MANUSCRIPT

experiments3 . rank (Avg. Rank) and the ranking position (Pos.) for
each strategy according to the Friedman’s test.
4.4. Results and discussion As for the label-based measures (MiF1 and MaF1 ),
The study was divided into two parts. In the first CVIRS strategy generally showed a good performance
one, a comparative study between the AL strategies for on the 18 multi-label datasets. The MMU and LCI
multi-label classification task was conducted, whereas strategies were effective on Medical, Enron and Yeast
the second one focussed on the label ranking task. datasets. The CMN strategy obtained good results
on Birds, Genbase, Medical and Enron datasets. The
4.4.1. Multi-label classification task BinMin and ML strategies performed well on Corel5k,
Corel16k, Arts, Entertainment, Recreation and Health

T
Figures 1-4 represent the learning curves of the AL
strategies on Emotions, Medical, Yeast and TMC2007- datasets.
As for the bipartition-based measures (HL and F1Ex ),

IP
500 datasets. Each graph represents a plot of a function
where the x-axis is the number of labelled examples and the CVIRS strategy generally showed a good effective-
the y-axis is the value obtained by the multi-label clas- ness on the 18 multi-label datasets. The MMU and LCI

CR
sifier for a certain evaluation measure. strategies were effective on Medical and Yeast datasets.
Figure 1 shows that CVIRS strategy obtained the best The CMN strategy performed well on Emotions, Birds,
outcomes for the MiF1 and MaF1 measures on the Emo- Genbase, Medical and Enron datasets. The BinMin
tions dataset. The CMN, ML, MML and MMC strate- strategy obtained good results on Flags, Yeast, Corel5k,
gies performed better than Random strategy. The LCI
and MMU strategies showed a poor performance on this
dataset.
Figure 2 shows that MMC, ML and MML strate-
US Corel16k and Entertainment datasets.
According to the average rankings computed by
Friedman’s test, the three AL strategies that obtained
the best performance for the label-based and bipartition-
AN
gies obtained worse results than Random strategy on the based measures were CVIRS, CMN and BinMin, in this
Medical dataset. The CVIRS, CMN, MMU and LCI order. The MMC and Random strategies had the worst
strategies showed the best performance. outcomes.
As for the HL measure, Figure 3 shows that CVIRS, Friedman’s test rejected all null hypotheses at a sig-
nificance level α=0.05. Therefore, we can conclude that
M

CMN, LCI and BinMin strategies outperformed the rest


of the strategies on the Yeast dataset, their learning there were significant differences between the observed
curves were under the learning curve of the Random AUC’s values in the bipartition-based and label-based
strategy. The ML, MML and MMC strategies showed measures considered. Afterwards, a Shaffer’s post-hoc
ED

a poor performance. For the F1Ex measure, the CVIRS, test for all pairwise comparisons was carried out. In
LCI, BinMin and MMU strategies obtained the best per- the statistical analysis, the adjusted p-values [67] were
formance. considered. The adjusted p-values take into account the
Figure 4 shows that CVIRS strategy obtained the best fact that multiple tests are conducted and they can be
PT

effectiveness on the TMC2007-500 dataset. For the directly compared with any significance level [65].
F1Ex measure, the CMN, BinMin, ML, MML and MMC The multiple comparisons were illustrated as a di-
strategies showed a poor performance - their learning rected graph. An edge γ1 → γ2 shows that the strat-
curves were dominated by the learning curve of the egy γ1 outperforms to strategy γ2 . Each edge is labelled
CE

Random strategy. The MMU, ML, MML and MMC with the evaluation measures for which γ1 outperformed
strategies had the lowest effectiveness at the MaF1Ex to γ2 method. The adjusted p-values of the Shaffer’s test
measure. are shown between parentheses. Figure 5 shows the re-
sults of the Shaffer’s test for the bipartition-based and
AC

To compare the AL strategies in a quantitative man-


ner, the AUC values were estimated and a statistical label-based measures.
analysis was conducted. Tables 4-7 show the AUC re- From a statistical point of view, our proposal -
sults obtained by the nine strategies compared in the ex- CVIRS- significantly outperformed to Random, ML,
perimental study. In all cases, the best results are high- MML, MMC, LCI and MMU strategies on all label-
lighted in bold typeface in the tables, “↓” indicates “the based and bipartition-based measures.
smaller the better”, and “↑” indicates “the larger the bet- As for MiF1Ex measure, the BinMin, CMN and LCI
ter”. In the tables, the last two rows show the average strategies significantly outperformed to the Random
strategy. The CMN and BinMin strategies performed
better than MMC and MML strategies. Shaffer’s test
3 http://www.uco.es/grupos/kdis/kdiswiki/MLAL did not detect significant differences between Random,
9
ACCEPTED MANUSCRIPT

T
IP
CR
(a) The performance at the MiF1 measure.
US (b) The performance at the MaF1
Figure 1: The performance of the AL strategies on the Emotions dataset.
measure.
AN
M
ED
PT
CE
AC

(a) The performance at the F1Ex measure. (b) The performance at the MiF1 measure.

Figure 2: The performance of the AL strategies on the Medical dataset.

MMC, ML, MML and MMU strategies at the signifi- strategy. Furthermore, the CMN strategy performed
cance level considered. better than ML and MMC strategies. Shaffer’s test did
not detect significant differences between ML, MML,
Regarding MaF1Ex measure, the BinMin and CMN
MMC, MMU, LCI and Random strategies at the signif-
strategies significantly outperformed to the Random
10
ACCEPTED MANUSCRIPT

T
IP
CR
(a) The performance at the HL measure. US (b) The performance at the F1Ex measure.
Figure 3: The performance of the AL strategies on the Yeast dataset.
AN
M
ED
PT
CE
AC

(a) The performance at the F1Ex measure. (b) The performance at the MaF1 measure.

Figure 4: The performance of the AL strategies on the TMC2007-500 dataset.

icance level considered. between ML, MML, MMC, MMU, LCI and Random
strategies at the significance level considered.
As for HL measure, the CMN and BinMin strate-
gies significantly outperformed to the Random strat- With regard to F1Ex measure, CMN and BinMin per-
egy. Shaffer’s test did not detect significant differences formed better than MML, MMC and Random strategies.
11
ACCEPTED MANUSCRIPT

Multi-label AL strategy
Dataset
Random BinMin ML MML MMC CMN MMU LCI CVIRS
Flags 0.541 0.691 0.668 0.671 0.671 0.683 0.688 0.681 0.692
Emotions 0.616 0.621 0.640 0.643 0.644 0.658 0.601 0.607 0.659
Birds 0.265 0.333 0.384 0.385 0.387 0.412 0.326 0.396 0.415
Genbase 0.945 0.949 0.952 0.946 0.923 0.956 0.921 0.940 0.963
Cal500 0.330 0.336 0.331 0.330 0.332 0.332 0.329 0.328 0.346
Medical 0.648 0.648 0.570 0.556 0.609 0.665 0.665 0.665 0.667
Yeast 0.575 0.630 0.618 0.608 0.616 0.640 0.780 0.784 0.658
Scene 0.630 0.634 0.618 0.608 0.616 0.640 0.642 0.630 0.643
Enron 0.420 0.436 0.372 0.378 0.384 0.457 0.447 0.450 0.464
Corel5k 0.101 0.168 0.126 0.128 0.120 0.158 0.154 0.157 0.160
Corel16k 0.099 0.161 0.145 0.146 0.149 0.152 0.155 0.154 0.158
TMC2007-500 0.598 0.608 0.589 0.584 0.584 0.608 0.597 0.600 0.620
Bibtex 0.203 0.299 0.274 0.286 0.289 0.312 0.298 0.314 0.321
Arts 0.200 0.266 0.260 0.262 0.259 0.265 0.249 0.260 0.264
Business 0.305 0.366 0.476 0.391 0.375 0.387 0.411 0.422 0.436
Entertainment 0.259 0.343 0.323 0.304 0.298 0.332 0.334 0.333 0.350
Recreation 0.199 0.268 0.265 0.264 0.258 0.268 0.261 0.255 0.273

T
Health 0.301 0.359 0.347 0.332 0.315 0.347 0.357 0.341 0.371
Avg. Rank 7.806 3.583 6.000 6.528 6.639 3.278 5.056 4.722 1.389
Pos. 9 3 6 7 8 2 5 4 1

IP
Table 4: The AUC results at the MiF1 (↑) measure. Friedman’s test rejected the null hypothesis with a p-value equal to 6.121E-11.

Multi-label AL strategy
Dataset
Random BinMin ML MML MMC CMN MMU LCI CVIRS

CR
Flags 0.569 0.583 0.572 0.576 0.562 0.592 0.575 0.567 0.588
Emotions 0.517 0.520 0.608 0.636 0.636 0.642 0.495 0.498 0.654
Birds 0.304 0.255 0.309 0.310 0.311 0.332 0.239 0.311 0.330
Genbase 0.751 0.785 0.806 0.753 0.699 0.794 0.735 0.788 0.785
Cal500 0.161 0.156 0.154 0.154 0.151 0.162 0.156 0.146 0.170
Medical 0.352 0.348 0.310 0.312 0.317 0.376 0.370 0.369 0.383
Yeast 0.385 0.413 0.416 0.408 0.396 0.393 0.398 0.396 0.400
Scene 0.645 0.640 0.624 0.612 0.628 0.650 0.647 0.634 0.651

US
Enron 0.152 0.173 0.147 0.152 0.154 0.171 0.170 0.166 0.185
Corel5k 0.274 0.315 0.303 0.310 0.300 0.321 0.300 0.309 0.314
Corel16k 0.033 0.059 0.048 0.054 0.051 0.062 0.060 0.061 0.065
TMC2007-500 0.485 0.497 0.479 0.473 0.467 0.500 0.476 0.487 0.521
Bibtex 0.111 0.145 0.149 0.154 0.152 0.152 0.150 0.151 0.156
Arts 0.132 0.171 0.147 0.148 0.147 0.167 0.155 0.159 0.170
Business 0.135 0.158 0.159 0.161 0.158 0.158 0.148 0.149 0.170
Entertainment 0.154 0.200 0.191 0.195 0.187 0.197 0.190 0.194 0.201
AN
Recreation 0.142 0.209 0.207 0.205 0.204 0.198 0.197 0.190 0.218
Health 0.123 0.188 0.171 0.169 0.155 0.174 0.188 0.185 0.194
Avg. Rank 7.528 3.972 5.778 5.306 6.667 2.972 5.750 5.389 1.639
Pos. 9 3 7 4 8 2 6 5 1

Table 5: The AUC results at the MaF1 (↑) measure. Friedman’s test rejected the null hypothesis with a p-value equal to 1.029E-10.

Multi-label AL strategy
M

Dataset
Random BinMin ML MML MMC CMN MMU LCI CVIRS
Flags 0.365 0.301 0.313 0.311 0.313 0.304 0.304 0.310 0.294
Emotions 0.234 0.235 0.228 0.226 0.224 0.222 0.244 0.241 0.221
Birds 0.200 0.117 0.091 0.091 0.089 0.078 0.117 0.085 0.083
Genbase 0.006 0.005 0.004 0.005 0.007 0.004 0.008 0.006 0.003
Cal500 0.196 0.197 0.200 0.201 0.196 0.199 0.199 0.190 0.185
Medical 0.019 0.018 0.021 0.023 0.020 0.018 0.018 0.019 0.018
ED

Yeast 0.251 0.247 0.266 0.272 0.281 0.248 0.250 0.248 0.243
Scene 0.145 0.136 0.142 0.141 0.144 0.140 0.142 0.139 0.137
Enron 0.086 0.082 0.095 0.095 0.091 0.076 0.085 0.080 0.078
Corel5k 0.045 0.017 0.023 0.022 0.020 0.017 0.019 0.019 0.017
Corel16k 0.052 0.036 0.039 0.040 0.042 0.036 0.037 0.042 0.035
TMC2007-500 0.084 0.078 0.084 0.085 0.085 0.079 0.087 0.082 0.078
Bibtex 0.029 0.017 0.021 0.020 0.023 0.017 0.021 0.019 0.014
Arts 0.295 0.180 0.140 0.138 0.139 0.198 0.213 0.158 0.205
PT

Business 0.188 0.115 0.084 0.098 0.110 0.105 0.099 0.102 0.094
Entertainment 0.245 0.170 0.168 0.172 0.194 0.182 0.177 0.178 0.165
Recreation 0.289 0.239 0.210 0.225 0.234 0.231 0.221 0.233 0.213
Health 0.187 0.128 0.112 0.124 0.129 0.133 0.154 0.161 0.119
Avg. Rank 7.639 4.000 5.083 5.556 6.389 3.667 5.806 5.028 1.833
Pos. 9 3 5 6 8 2 7 4 1

Table 6: The AUC results at the HL (↓) measure. Friedman’s test rejected the null hypothesis with a p-value equal to 5.829E-9.
CE

Multi-label AL strategy
Dataset
Random BinMin ML MML MMC CMN MMU LCI CVIRS
Flags 0.601 0.677 0.644 0.647 0.648 0.663 0.662 0.657 0.674
Emotions 0.555 0.563 0.587 0.592 0.587 0.615 0.539 0.547 0.616
Birds 0.488 0.522 0.518 0.520 0.520 0.605 0.513 0.576 0.590
AC

Genbase 0.955 0.958 0.958 0.953 0.930 0.958 0.941 0.950 0.960
Cal500 0.335 0.336 0.331 0.330 0.332 0.330 0.328 0.327 0.345
Medical 0.625 0.616 0.521 0.510 0.575 0.639 0.640 0.645 0.650
Yeast 0.553 0.565 0.558 0.547 0.533 0.554 0.565 0.566 0.568
Scene 0.599 0.594 0.573 0.554 0.578 0.610 0.614 0.597 0.634
Enron 0.424 0.432 0.377 0.382 0.388 0.455 0.440 0.450 0.465
Corel5k 0.099 0.158 0.124 0.123 0.113 0.146 0.130 0.147 0.158
Corel16k 0.086 0.147 0.135 0.139 0.137 0.139 0.140 0.141 0.144
TMC2007-500 0.600 0.589 0.574 0.570 0.567 0.592 0.605 0.607 0.622
Bibtex 0.203 0.271 0.268 0.271 0.269 0.290 0.274 0.283 0.295
Arts 0.198 0.272 0.255 0.256 0.253 0.273 0.254 0.272 0.275
Business 0.374 0.393 0.540 0.421 0.458 0.438 0.477 0.485 0.498
Entertainment 0.296 0.366 0.340 0.322 0.301 0.357 0.354 0.358 0.362
Recreation 0.221 0.281 0.274 0.270 0.268 0.281 0.284 0.286 0.290
Health 0.332 0.373 0.358 0.331 0.320 0.361 0.374 0.371 0.386
Avg. Rank 7.444 3.778 6.139 6.722 7.056 3.639 4.833 4.083 1.306
Pos. 9 3 6 7 8 2 5 4 1

Table 7: The AUC results at the F1Ex (↑) measure. Friedman’s test rejected the null hypothesis with a p-value equal to 4.523E-11.

12
ACCEPTED MANUSCRIPT

the Medical dataset. The ML, MML, BinMin and MMC


strategies showed a poor performance.
Figure 9 shows that CVIRS, BinMin, LCI and CMN
strategies obtained the best results at AP and RL mea-
sures on the Yeast dataset. The MML and MMC strate-
gies had a lower effectiveness.
Tables 8-10 show the AUC results obtained by the
AL strategies at the ranking-based measures (OE , RL
and AP ). In all cases, the best results are highlighted
in bold typeface in the tables. In the tables, the last two

T
rows show the average rank (Avg. Rank) and the rank-
ing position (Pos.) for each strategy according to the

IP
Friedman’s test.
As for ranking-based measures, CVIRS strategy gen-

CR
erally performed well on the 18 multi-label datasets.
BinMin strategy obtained good results on Flags,
Corel16k and TMC2007-500 datasets. LCI strategy had
a good performance on the Arts dataset. CMN strat-

US egy showed a good effectiveness on Birds and Emotions


datasets.
According to the average rankings computed by
Friedman’s test, for the OE measure the strategies that
AN
obtained the best results were CVIRS, LCI and CMN,
in this order. With regard to the RL measure, the strategy
that had the best performance was CVIRS. As for the
AP measure, the strategies that obtained the best results
were CVIRS, BinMin and CMN, in this order.
M

Figure 5: Significant differences between AL strategies according to


Friedman’s test rejected all null hypotheses at a sig-
the Shaffer’s test at the significance level α=0.05. nificance level α=0.05. Thus, we can conclude that
there were significant differences between the observed
ED

AUC’s values at the ranking-based measures consid-


Furthermore, LCI strategy outperformed to Random ered. Afterwards, a Shaffer’s post-hoc test for all pair-
and MMC strategies. Shaffer’s test did not detect sig- wise comparisons was conducted. Figure 10 shows the
nificant differences between ML, MML, MMC, MMU results of the Shaffer’s test.
PT

and Random strategies at the significance level consid- Regarding OE measure, the CVIRS strategy signif-
ered. icantly outperformed to Random, ML, MML, MMC,
MMU and BinMin strategies. Furthermore, the LCI
4.4.2. Label ranking task strategy performed better than Random strategy. The
CE

Figures 6-9 represent the learning curves of the AL statistical test did not detect significant differences be-
strategies on Emotions, Cal500, Genbase, Medical and tween Random, MMC, ML, MML, MMU, BinMin and
Yeast datasets. As for the AP and RL measures, Figure CMN strategies at the significance level considered.
6 shows that CVIRS and CMN strategies obtained the As for RL measure, the CVIRS strategy significantly
AC

best results on the Emotions dataset. The MMU strategy outperformed all strategies. BinMin strategy outper-
had worse results than Random strategy. formed to the Random strategy. The Shaffer’s test did
Figure 7 shows that CVIRS strategy had the best per- not detect significant differences between Random, ML,
formance at AP and RL measures on the Cal500 dataset. MML, MMC, MMU, CMN and LCI strategies at the
The CMN, ML, MML and MMU strategies showed a significance level considered.
poor performance on this dataset - their learning curves With regard to AP measure, the CVIRS strategy sig-
were dominated by the learning curve of the Random nificantly outperformed to Random, MMU, LCI, ML,
strategy. MML and MMC strategies. The BinMin and CMN
Figure 8 shows that CVIRS and LCI strategies ob- strategies performed better than Random strategy. Sig-
tained the best effectiveness at AP and RL measures on nificant differences between CVIRS, BinMin and CMN
13
ACCEPTED MANUSCRIPT

T
IP
CR
(a) The performance at the AP measure. US (b) The performance at the RL measure.
Figure 6: The performance of the multi-label AL strategies on the Emotions dataset.
AN
M
ED
PT
CE
AC

(a) The performance at the AP measure. (b) The performance at the RL measure.
Figure 7: The performance of the multi-label AL strategies on the Cal500 dataset.

strategies were not detected. Shaffer’s test did not detect 4.4.3. Discussion
significant differences between Random, MMU, LCI,
The experimental study aimed to compare our pro-
ML, MML and MMC strategies at the significance level
posal with several state-of-the-art AL strategies for
considered.
two multi-label learning tasks over many multi-label
14
ACCEPTED MANUSCRIPT

T
IP
CR
(a) The performance at the AP measure. US (b) The performance at the RL measure.
Figure 8: The performance of the multi-label AL strategies on the Medical dataset.
AN
M
ED
PT
CE
AC

(a) The performance at the AP measure. (b) The performance at the RL measure.
Figure 9: The performance of the multi-label AL strategies on the Yeast dataset.

datasets. The evidence suggested that our proposal - the CVIRS, CMN and BinMin strategies had the best
CVIRS- was effective on the two tasks analysed; multi- results for the multi-label classification task. As for the
label classification and label ranking tasks. Analysing label ranking task, the strategies that obtained the best
the average rankings returned by the Friedman’s test, effectiveness were CVIRS, BinMin, CMN and LCI.

15
ACCEPTED MANUSCRIPT

Multi-label AL strategy
Dataset
Random BinMin ML MML MMC CMN MMU LCI CVIRS
Flags 0.332 0.252 0.347 0.344 0.272 0.274 0.239 0.276 0.238
Emotions 0.324 0.322 0.305 0.305 0.299 0.290 0.321 0.322 0.276
Birds 0.901 0.853 0.785 0.784 0.782 0.776 0.845 0.801 0.775
Genbase 0.037 0.022 0.033 0.050 0.065 0.039 0.040 0.045 0.029
Cal500 0.800 0.847 0.845 0.848 0.824 0.840 0.808 0.780 0.769
Medical 0.303 0.320 0.416 0.432 0.356 0.299 0.299 0.294 0.283
Yeast 0.382 0.342 0.392 0.406 0.456 0.351 0.372 0.358 0.326
Scene 0.328 0.332 0.349 0.357 0.345 0.336 0.328 0.335 0.321
Enron 0.699 0.618 0.791 0.781 0.768 0.640 0.650 0.687 0.618
Corel5k 0.930 0.851 0.928 0.913 0.915 0.849 0.866 0.874 0.852
Corel16k 0.902 0.840 0.848 0.846 0.851 0.848 0.844 0.842 0.838
TMC2007-500 0.339 0.313 0.350 0.360 0.361 0.321 0.366 0.349 0.308
Bibtex 0.621 0.586 0.615 0.609 0.612 0.566 0.588 0.571 0.549
Arts 0.841 0.760 0.739 0.736 0.739 0.756 0.780 0.736 0.771
Business 0.799 0.726 0.500 0.654 0.700 0.679 0.659 0.675 0.641
Entertainment 0.842 0.738 0.754 0.775 0.788 0.750 0.723 0.735 0.703
Recreation 0.823 0.773 0.757 0.787 0.777 0.781 0.788 0.772 0.747

T
Health 0.789 0.764 0.697 0.674 0.642 0.773 0.642 0.640 0.634
Avg. Rank 7.083 4.444 5.972 6.389 6.167 4.333 4.861 4.167 1.583
Pos. 9 4 6 8 7 3 5 2 1

IP
Table 8: The AUC results at the OE (↓) measure. Friedman’s test rejected the null hypothesis with a p-value equal to 1.601E-8.

Multi-label AL strategy
Dataset
Random BinMin ML MML MMC CMN MMU LCI CVIRS

CR
Flags 0.302 0.261 0.293 0.282 0.279 0.288 0.266 0.289 0.266
Emotions 0.201 0.206 0.194 0.192 0.190 0.184 0.211 0.207 0.184
Birds 0.198 0.158 0.132 0.132 0.134 0.128 0.156 0.133 0.126
Genbase 0.009 0.006 0.008 0.008 0.029 0.014 0.008 0.007 0.005
Cal500 0.249 0.250 0.254 0.255 0.251 0.256 0.253 0.248 0.234
Medical 0.089 0.118 0.171 0.183 0.113 0.089 0.087 0.085 0.083
Yeast 0.229 0.220 0.231 0.237 0.254 0.220 0.226 0.221 0.216
Scene 0.131 0.137 0.147 0.154 0.145 0.133 0.127 0.136 0.123

US
Enron 0.189 0.225 0.188 0.185 0.183 0.214 0.192 0.218 0.174
Corel5k 0.501 0.425 0.377 0.372 0.374 0.457 0.431 0.426 0.406
Corel16k 0.399 0.346 0.351 0.348 0.350 0.345 0.349 0.356 0.341
TMC2007-500 0.088 0.078 0.084 0.085 0.085 0.078 0.079 0.077 0.077
Bibtex 0.301 0.268 0.273 0.274 0.270 0.261 0.264 0.267 0.248
Arts 0.365 0.262 0.285 0.283 0.300 0.259 0.262 0.254 0.250
Business 0.299 0.280 0.166 0.190 0.296 0.284 0.254 0.231 0.226
Entertainment 0.350 0.240 0.224 0.221 0.249 0.261 0.257 0.264 0.225
AN
Recreation 0.311 0.248 0.258 0.267 0.274 0.258 0.254 0.255 0.240
Health 0.287 0.221 0.241 0.239 0.230 0.238 0.233 0.236 0.210
Avg. Rank 7.417 4.528 5.667 5.444 5.806 4.917 4.889 4.806 1.528
Pos. 9 2 7 6 8 5 4 3 1

Table 9: The AUC results at the RL (↓) measure. Friedman’s test rejected the null hypothesis with a p-value equal to 1.732E-7.

Multi-label AL strategy
Dataset
M

Random BinMin ML MML MMC CMN MMU LCI CVIRS


Flags 0.685 0.792 0.757 0.761 0.775 0.774 0.778 0.771 0.787
Emotions 0.759 0.768 0.778 0.778 0.782 0.789 0.758 0.765 0.790
Birds 0.399 0.418 0.507 0.507 0.507 0.514 0.424 0.492 0.519
Genbase 0.960 0.980 0.975 0.966 0.943 0.968 0.968 0.960 0.978
Cal500 0.345 0.340 0.332 0.331 0.340 0.332 0.338 0.349 0.367
Medical 0.757 0.725 0.642 0.628 0.704 0.754 0.755 0.764 0.775
ED

Yeast 0.681 0.695 0.677 0.664 0.644 0.692 0.680 0.690 0.699
Scene 0.790 0.792 0.780 0.775 0.782 0.792 0.807 0.789 0.804
Enron 0.448 0.453 0.387 0.392 0.401 0.474 0.447 0.450 0.479
Corel5k 0.113 0.163 0.124 0.122 0.129 0.152 0.134 0.138 0.150
Corel16k 0.166 0.183 0.168 0.170 0.175 0.178 0.175 0.174 0.184
TMC2007-500 0.726 0.740 0.720 0.714 0.714 0.738 0.717 0.722 0.744
Bibtex 0.301 0.356 0.337 0.341 0.354 0.375 0.377 0.374 0.387
Arts 0.296 0.392 0.390 0.394 0.385 0.394 0.376 0.408 0.386
PT

Business 0.444 0.436 0.618 0.592 0.302 0.462 0.498 0.488 0.525
Entertainment 0.364 0.439 0.431 0.395 0.374 0.424 0.428 0.439 0.460
Recreation 0.302 0.398 0.405 0.390 0.378 0.388 0.411 0.401 0.414
Health 0.356 0.412 0.443 0.421 0.400 0.400 0.402 0.410 0.425
Avg. Rank 7.250 3.806 5.667 6.417 6.611 4.139 4.833 4.556 1.722
Pos. 9 2 6 7 8 3 5 4 1

Table 10: The AUC results at the AP (↑) measure. Friedman’s test rejected the null hypothesis with a p-value equal to 3.137E-9.
CE

It is worth to note that, although the Shaffer’s test did datasets that have a large number of labels (e.g. Cal500,
not detect significant differences between CVIRS, CMN Corel5k and Bibtex datasets). In datasets with a large
AC

and BinMin strategies in some of the evaluation mea- number of labels, the performance of CVIRS can be af-
sures considered, the CVIRS strategy was ranked at the fected due to the positional method used to resolve the
first position of all average rankings computed by Fried- rank aggregation problem formulated in this work.
man’s test. The evidence indicated that the CVIRS strategy per-
Generally speaking, CVIRS strategy performed well formed well using BR-SVM as its base classifier and
on multi-label datasets with diverse characteristics. The the experimental settings set in this work. However, for
evidence suggested that CVIRS strategy obtained better future researches, it would be important to test our pro-
results on multi-label datasets that have a small number posal with other experimental settings. It is also im-
of labels (e.g. Emotions, Birds, Yeast, Arts, Business, portant to study the effectiveness of our strategy with
Entertainments, Health and Recreation datasets) than on base classifiers that do not follow the BR approach, e.g.
16
ACCEPTED MANUSCRIPT

base classifiers which can obtain proper probability es-


timates from their outputs. CVIRS is not restricted to
base classifiers that use problem transformation meth-
ods, it can also be used with multi-label learning algo-
rithms belonging to algorithm adaptation category.
An extensive comparison of several AL strategies
was conducted over 18 multi-label datasets, showing
that CVIRS strategy is competitive with respect to the
state-of-the-art multi-label AL strategies. The CVIRS
strategy was effective on multi-label datasets with di-

T
verse characteristics. It also performed well for the two
tasks analysed; multi-label classification and label rank-

IP
ing tasks.
The evidence suggested that the uncertainty measure

CR
based on the rank aggregation problem is a good ap-
proximation to compute the unified uncertainty of an
unlabelled example. On the other hand, the results
demonstrated the benefits of combining an uncertainty

US measure with another measure that brings into account


the label space information.
Future research will study more effective approaches
to resolve the rank aggregation problem formulated in
AN
this work. In addition, we will study multi-label AL
strategies for working on batch-mode scenarios, an area
where scarce studies have been carried out. It would
also be useful to analyse the effectiveness of our ap-
proach using incremental learning algorithms to speed
M

up the updating of the base classifiers in each iteration.

Acknowledgements
ED

Figure 10: Significant differences between AL strategies according to


the Shaffer’s test at the significance level α=0.05. This research was supported by the Spanish Ministry
of Economy and Competitiveness, project TIN-2014-
55252-P, and by FEDER funds.
PT

multi-label classifiers belonging to the algorithm adap-


tation category.
References
CE

[1] O. Reyes, C. Morell, S. Ventura, Scalable extensions of the Reli-


5. Conclusions
efF algorithm for weighting and selecting features on the multi-
label learning context, Neurocomputing 161 (2015) 168–182.
In this work, an effective AL strategy for working [2] M. A. Tahir, J. Kittler, A. Bouridane, Multi-label classification
on multi-label data was proposed, named as CVIRS. using stacked spectral kernel discriminant analysis, Neurocom-
AC

puting 171 (2015) 127–137.


CVIRS selects the most informative examples of com- [3] H. Liu, X. Wu, S. Zhang, Neighbor selection for multilabel clas-
bining two measures. The first measure computes the sification, Neurocomputing 182 (2015) 187–196.
unified uncertainty of an unlabelled example based on [4] Z. Chen, Z. Hao, A unified multi-label classification framework
base classifier predictions. In order to aggregate the in- with supervised low-dimensional embedding, Neurocomputing
171 (2016) 1563–1575.
formation across all labels, a rank aggregation problem [5] J. Li, Y. Rao, F. Jin, H. Chen, X. Xiang, Multi-label maximum
was defined, and a simple rank aggregation method to entropy model for social emotion classification over short text,
resolve it was used. The second measure computes the Neurocomputing 210 (2016) 247–256.
[6] X. Jia, F. Suna, H. Lib, Y. Caoa, X. Zhanga, Image multi-label
inconsistency of a predicted label set by taking into ac- annotation based on supervised nonnegative matrix factorization
count the information about the label space of the la- with new matching measurement, Neurocomputing 219 (2017)
belled set. The CVIRS strategy can be used with any 518–525.

17
ACCEPTED MANUSCRIPT

[7] A. McCallum, Multi-label text classification with a mixture edge Discovery Handbook, 2nd Edition, Springer-Verlag, New
model trained by EM, in: Working Notes of the AAAI’99 Work- York, USA, 2010, Ch. Mining Multi-label Data, pp. 667–686.
shop on Text Learning, 1999, pp. 1–7. [27] G. Madjarov, D. Kocev, D. Gjorgjevikj, An extensive experi-
[8] A. Srivastava, B. Zane-Ulman, Discovering recurring anomalies mental comparison of methods for multi-label learning, Pattern
in text reports regarding complex space systems, in: Proceed- Recognition 45 (2012) 3084–3104.
ings of the Aerospace Conference, IEEE, 2005, pp. 55–63. [28] E. Gibaja, S. Ventura, Multi-label learning: a review of the
[9] I. Katakis, G. Tsoumakas, I. Vlahavas, Multilabel text classi- state of the art and ongoing research, WIREs Data Mining and
fication for automated tag suggestion, in: Proceedings of the Knowledge Discovery 4 (2014) 411–444.
ECML/PKDD Discovery Challenge, Vol. 75, 2008, pp. 75–83. [29] X. Zhu, A. B. Goldberg, Introduction to Semi-Supervised
[10] T. Li, M. Ogihara, Detecting emotion in music, in: Proceedings Learning, Morgan & Claypool Publishers, 2009.
of the International Symposium on Music Information Retrieval, [30] B. Settles, Active Learning, 1st Edition, Synthesis Lectures on
Vol. 3, Washington DC., USA, 2003, pp. 239–240. Artificial Intelligence and Machine Learning, Morgan & Clay-

T
[11] K. Barnard, P. Duygulu, N. de Freitas, D. Forsyth, D. Blei, pool Publishers, 2012.
M. I. Jordan, Matching words and pictures, Journal of Machine [31] D. Lewis, W. Gale, A sequential algorithm for training text clas-

IP
Learning Research 3 (2003) 1107–1135. sifiers, in: Proceedings of the 17th annual international ACM
[12] S. Yang, S. Kim, Y. Ro, Semantic home photo categorization, SIGIR conference on Research and development in informa-
IEEE Transactions on Circuits and Systems for Video Technol- tion retrieval, Springer-Verlag New York, Inc., New York, USA,
ogy 17 (2007) 324–335. 1994, pp. 3–12.

CR
[13] E. Correa, A. Plastino, A. Freitas, A Genetic Algorithm for Op- [32] J. Fu, S. Lee, Certainty-based active learning for sampling im-
timizing the Label Ordering in Multi-Label Classifier Chains, balanced datasets, Neurocomputing 119 (2013) 350–358.
in: Proceedings of the 25th International Conference on Tools [33] W. Wu, Y. Liu, M. Guo, C. Wang, X. Liu, A probabilistic model
with Artificial Intelligence, IEEE, 2013, pp. 469–476. of active learning with multiple noisy oracles, Neurocomputing
[14] Z. Y. He, C. Chen, J. J. Bu, P. Li, D. Cai, Multi-view based 118 (2013) 253–262.
multi-label propagation for image annotation, Neurocomputing
168 (2015) 853–860.
[15] M. Boutell, J. Luo, X. Shen, C. Brown, Learning multi-label
scene classification, Pattern Recognition 37 (9) (2004) 1757–
1771.
US [34]

[35]
S. Jones, L. Shao, K. Du, Active learning for human action re-
trieval using query pool selection, Neurocomputing 124 (2014)
89–96.
X. Zhang, S. Wang, X. Zhu, X. Yun, G. Wu, Y. Wang, Update vs.
upgrade: Modeling with indeterminate multi-class active learn-
AN
[16] D. Turnbull, L. Barrington, D. Torres, G. Lanckriet, Seman- ing, Neurocomputing 162 (2015) 163–170.
tic annotation and retrieval of music and sound effects, IEEE [36] J. Zhou, S. Sun, Gaussian process versus margin sampling active
Transactions on Audio, Speech, and Language Processing 16 (2) learning, Neurocomputing 167 (2015) 122–131.
(2008) 467–476. [37] H. Yu, C. Sun, W. Yang, X. Yang, X. Zuo, AL-ELM:
[17] J.Wang, Y. Zhao, X. Wu, X. Hua, A transductive multi-label One uncertainty-based active learning algorithm using extreme
M

learning approach for video concept detection, Pattern Recogni- learning machine, Neurocomputing 166 (2015) 140–150.
tion 44 (2010) 2274–2286. [38] S. Huang, R. Jin, Z. Zhou, Active learning by querying informa-
[18] A. Elisseeff, J. Weston, A kernel method for multi-labelled clas- tive and representative examples, IEEE Transactions on Pattern
sification, in: Advances in Neural Information Processing Sys- Analysis and Machine Intelligence 36 (10) (2014) 1936–1949.
tems, Vol. 14, MIT Press, 2001, pp. 681–687. [39] X. Li, L. Wang, E. Sung, Multi-label SVM active learning for
ED

[19] S. Diplarisa, G. Tsoumakas, P. Mitkas, I. Vlahavas, Protein clas- image classification, in: Proceedings of the International Con-
sification with multiple algorithms, in: Proceedings of the 10th ference on Image processing, Vol. 4, IEEE, 2004, pp. 2207–
Panhellenic Conference on Informatics, Springer Berlin Heidel- 2210.
berg, 2005, pp. 448–456. [40] K. Brinker, From Data and Information Analysis to Knowledge
[20] M. L. Zhang, Z. H. Zhou, Multi-label neural networks with ap- Engineering, Springer Berlin Heidelberg, 2006, Ch. On Active
PT

plications to functional genomics and text categorization, IEEE Learning in Multi-label Classification, pp. 206–213.
Transactions on knowledge and Data Engineering 18 (2006) [41] B. Yang, J. Sun, T. Wang, Z. Chen, Effective multi-label ac-
1338–1351. tive learning for text classification, in: Proceedings of the 15th
[21] N. Cesa-Bianchi, G. Valentini, Hierarchical cost-sensitive algo- ACM SIGKDD International Conference on Knowledge Dis-
CE

rithms for genome-wide gene function prediction, Journal of covery and Data Mining, ACM, Paris, France, 2009, pp. 917–
Machine Learning Research 8 (2010) 14–29. 926.
[22] F. Otero, A. Freitas, C. Johnson, A hierarchical multi-label clas- [42] A. Esuli, F. Sebastiani, Active Learning Strategies for Multi-
sification ant colony algorithm for protein function prediction, Label Text Classification, in: Advances in Information Re-
Memetic Computing 2 (3) (2010) 165–181. trieval, Springer Berlin Heidelberg, Toulouse, France, 2009, pp.
AC

[23] M. G. Larese, P. Granitto, J. Gómez, Spot defects detection in 102–113.


cDNA microarray images, Pattern Analysis and Applications 16 [43] M. Singh, E. Curran, P. Cunningham, Active learning for multi-
(2013) 307–319. label image annotation, in: Proceedings of the 19th Irish Con-
[24] F. Briggs, et. al., The 9th annual MLSP competition: New ference on Artificial Intelligence and Cognitive Science, 2009,
methods for acoustic classification of multiple simultaneous bird pp. 173–182.
species in a noisy environment, in: Proceedings of the Interna- [44] S. Chakraborty, V. Balasubramanian, S. Panchanathan, Optimal
tional Workshop on Machine Learning for Signal Processing, Batch Selection for Active Learning in Multi-label Classifica-
IEEE, 2013, pp. 1–8. tion, in: Proceedings of the 19th ACM international confer-
[25] E. Ukwatta, J. Samarabandu, Vision based metal spectral analy- ence on Multimedia, ACM, Scottsdale, Arizona, USA, 2011,
sis using multi-label classification, in: Proceedings of the Cana- pp. 1413–1416.
dian Conference on Computer and Robot Vision, IEEE, 2009, [45] C. W. Hung, H. T. Lin, Multi-label active learning with auxiliary
pp. 132–139. learner, in: Proceedings of the Asian Conference on Machine
[26] G. Tsoumakas, I. Katakis, I. Vlahavas, Data Mining and Knowl- Learning, Vol. 20, JMLR: Workshop and Conference Proceed-

18
ACCEPTED MANUSCRIPT

ings, 2011, pp. 315–330. Statistics 11 (1940) 86–92.


[46] P. Wang, P. Zhang, L. Guo, Mining multi-label data streams [64] J. Shaffer, Modified sequentially rejective multiple test proce-
using ensemble-based active learning, in: Proceedings of the dures, Journal of the American Statistical Association 81 (395)
12th SIAM International Conference on Data Mining, 2012, pp. (1986) 826–831.
1131–1140. [65] S. Garcı́a, F. Herrera, An extension on “Statistical Comparisons
[47] J. Tang, Z.-J. Zha, D. Tao, T.-S. Chua, Semantic-gap-oriented of Classifiers over Multiple Data Sets’’ for all pairwise compar-
active learning for multilabel image annotation, IEEE Transac- isons, Journal of Machine Learning Research 9 (2008) 2677–
tions on Image Processing 21 (4) (2012) 2354–2360. 2694.
[48] X. Li, Y. Guo, Active Learning with Multi-Label SVM Classi- [66] O. Reyes, E. Pérez, M. C. Rodrı́guez-Hernández, H. M. Far-
fication, in: Proceedings of the 23-th International Joint Con- doun, S. Ventura, Jclal: a java framework for active learning,
ference on Artificial Intelligence, AAAI Press, 2013, pp. 1479– Journal of Machine Learning Research 17 (1) (2016) 3271–
1485. 3275.

T
[49] G. Qi, X. Hua, Y. Rui, J. Tang, H. Zhang, Two-dimensional [67] S. P. Wright, Adjusted p-values for simultaneous inference, Bio-
multi-label active learning with an efficient online adaptation metrics 48 (1992) 1005–1013.

IP
model for image classification, IEEE Transactions on Pattern
Analysis and Machine Intelligence 31 (10) (2009) 1880–1897.
[50] X. Zhang, J. Cheng, C. Xu, H. Lu, S. Ma, Multi-view multi-label
active learning for image classification, in: Proceedings of the

CR
IEEE International Conference on Multimedia and Expo, IEEE,
2009, pp. 258–261.
[51] S. Huang, Z. Zhou, Active query driven by uncertainty and di-
versity for incremental multi-label learning, in: Proceedings of Oscar Reyes was born in Holgun,
the 13th International Conference on Data Mining, IEEE, 2013, Cuba, in 1984. He received the B.S. and M.Sc. degrees
pp. 1079–1084.
[52] B. Zhang, Y. Wang, F. Chen, Multilabel image classification via
high-order label correlation driven active learning, IEEE Trans-
actions on Image Processing 23 (3) (2014) 1430–144.
[53] J. Wu, V. Sheng, J. Zhang, P. Zhao, Z. Cui, Multi-label active
US in Computer Science from the University of Holgun,
Cuba, in 2008 and 2011, respectively. He is currently an
Assistant Professor with the Department of Computer
Science of University of Holgun, Cuba and member of
AN
learning for image classification, in: Proceedings of the IEEE Knowledge Discovery and Intelligent Systems Research
International Conference on Image Processing, IEEE, 2014, pp.
5227–5231. Laboratory of University of Crdoba, Spain. He is cur-
[54] D. Vasisht, A. Damianou, Active learning for sparse bayesian rently working toward the Ph. D. degree. His current
multilabel classification, in: Proceedings of the 20th ACM research interests are in the fields of data mining, ma-
M

SIGKDD International Conference on Knowledge Discovery


chine learning, metaheuristics, and their applications.
and Data Mining, New York, USA, 2014, pp. 472–481.
[55] S. Huang, S. Chen, Z. Zhou, Multi-label active learning: Query
type matters, in: Proceedings of the 24th International Confer-
ence on Artificial Intelligence, AAAI Press, 2015, pp. 946–952.
ED

[56] C.Ye, J. Wu, V. Sheng, P. Zhao, Z. Cui, Multi-label active learn-


ing with label correlation for image classification, in: IEEE In-
ternational Conference on Image Processing, IEEE, 2015, pp.
3437–3441.
[57] C. Dwork, R. Kumar, M. Naor, D. Sivakumar, Rank Aggrega-
Carlos Morell received his B.S
PT

tion Methods for the Web, in: Proceedings of the 10th World degree in Computers Science and his Ph.D. in Artifi-
Wide Web Conference, ACM, Hong Kong, Hong Kong, 2001, cial Intelligence from the Universidad Central de Las
pp. 613–622. Villas, in 1995 and 2005, respectively. Currently, he
[58] N. Ailon, M. Charikar, A. Newman., Aggregating inconsistent
is a Professor in the Department of Computer Science
CE

information: ranking and clustering, in: Proceedings of the 37th


Annual ACM Symposium on Theory of Computing, ACM, Bal- at the same university. In addition he leads the Artifi-
timore, Maryland, USA, 2005, pp. 684–693. cial Intelligence Research Laboratory. His teaching and
[59] R. Fagin, R. Kumar, M. Mahdian, D. Sivakumar, E. Vee, Com- research interests include Machine Learning, Soft Com-
paring and aggregating rankings with ties, in: Proceedings of
puting and Programming Languages.
AC

the ACM Symposium on Principles of Database Systems, ACM,


Paris, France, 2004, pp. 47–58.
[60] S. Guiasu, C. Reischer, Some remarks on entropic distance, en-
tropic measure of connexion and hamming distance, RAIRO-
Theoretical Informatics 13 (4) (1979) 395–407.
[61] S. Kullback, R. A. Leibler, On information and sufficiency, The
annals of mathematical statistics 22 (1) (1951) 79–86.
[62] G. Qi, X. Hua, Y. Rui, J. Tang, H. Zhang, Two-dimensional
active learning for image classification, in: Proceedings of the
Conference on Computer Vision and Pattern Recognition, IEEE, Sebastian Ventura is currently an
2008, pp. 1–8. Associate Professor in the Department of Computer Sci-
[63] M. Friedman, A comparison of alternative tests of significance ence and Numerical Analysis at the University of Cor-
for the problem of m rankings, The Annals of Mathematical doba, where he heads the Knowledge Discovery and In-
19
ACCEPTED MANUSCRIPT

telligent Systems Research Laboratory. He received his


BSc and Ph.D. degrees in sciences from the University
of Cordoba, Spain, in 1989 and 1996, respectively. He
has published more than 150 papers in journals and sci-
entific conferences, and he has edited three books and
several special issues in international journals. He has
also been engaged in twelve research projects (being the
coordinator of four of them) supported by the Spanish
and Andalusian governments and the European Union.
His main research interests are in the fields of computa-

T
tional intelligence, machine learning, data mining, and
their applications. Dr. Ventura is a senior member of

IP
the IEEE Computer, the IEEE Computational Intelli-
gence and the IEEE Systems, Man and Cybernetics So-

CR
cieties, as well as the Association of Computing Ma-
chinery (ACM).

US
AN
M
ED
PT
CE
AC

20