You are on page 1of 14

Singapore Management University

Institutional Knowledge at Singapore Management University

Research Collection School Of Information School of Information Systems


Systems

1-2017

Exploring representativeness and informativeness for active


learning
Bo DU
Wuhan University of Science and Technology

Zengmao WANG
Wuhan University of Science and Technology

Lefei ZHANG
Wuhan University of Science and Technology

Liangpei ZHANG
Wuhan University

Wei LIU
IBM T. J. Watson Research Center

See next page for additional authors

Follow this and additional works at: https://ink.library.smu.edu.sg/sis_research

Part of the Databases and Information Systems Commons, and the Theory and Algorithms Commons

Citation
DU, Bo; WANG, Zengmao; ZHANG, Lefei; ZHANG, Liangpei; LIU, Wei; SHEN, Jialie; and TAO, Dacheng.
Exploring representativeness and informativeness for active learning. (2017). IEEE Transactions on
Cybernetics. 47, (1), 14-26. Research Collection School Of Information Systems.
Available at: https://ink.library.smu.edu.sg/sis_research/3531

This Journal Article is brought to you for free and open access by the School of Information Systems at
Institutional Knowledge at Singapore Management University. It has been accepted for inclusion in Research
Collection School Of Information Systems by an authorized administrator of Institutional Knowledge at Singapore
Management University. For more information, please email cherylds@smu.edu.sg.
Author
Bo DU, Zengmao WANG, Lefei ZHANG, Liangpei ZHANG, Wei LIU, Jialie SHEN, and Dacheng TAO

This journal article is available at Institutional Knowledge at Singapore Management University:


https://ink.library.smu.edu.sg/sis_research/3531
Published
14 in IEEE Transactions on Cybernetics, 2017 Jan, vol. 47, no. 1, pp. 14-26.
IEEE TRANSACTIONS ON CYBERNETICS, VOL. 47, NO. 1, JANUARY 2017
https://doi.org/10.1109/TCYB.2015.2496974

Exploring Representativeness and Informativeness


for Active Learning
Bo Du, Senior Member, IEEE, Zengmao Wang, Lefei Zhang, Member, IEEE,
Liangpei Zhang, Senior Member, IEEE, Wei Liu, Jialie Shen, Member, IEEE,
and Dacheng Tao, Fellow, IEEE

Abstract—How can we find a general way to choose the most Experimental results on benchmark datasets demonstrate that
suitable samples for training a classifier? Even with very limited our algorithm consistently achieves superior performance over
prior information? Active learning, which can be regarded as an the state-of-the-art active learning algorithms.
iterative optimization procedure, plays a key role to construct a
refined training set to improve the classification performance in Index Terms—Active learning, classification informative and
a variety of applications, such as text analysis, image recognition, representative, informativeness, representativeness.
social network modeling, etc. Although combining representative-
ness and informativeness of samples has been proven promising
for active sampling, state-of-the-art methods perform well under I. I NTRODUCTION
certain data structures. Then can we find a way to fuse the two
HE PAST decades have witnessed a rapid development
active sampling criteria without any assumption on data? This
paper proposes a general active learning framework that effec-
tively fuses the two criteria. Inspired by a two-sample discrepancy
T of cheaply collecting huge data, providing the opportu-
nities of intelligently classifying data using machine learning
problem, triple measures are elaborately designed to guarantee techniques [1]–[6]. In classification tasks, a sufficient amount
that the query samples not only possess the representativeness of labeled data is obliged to be provided to a classifi-
of the unlabeled data but also reveal the diversity of the labeled
data. Any appropriate similarity measure can be employed to cation model in order to achieve satisfactory classification
construct the triple measures. Meanwhile, an uncertain measure accuracy [7]–[10]. However, annotating such an amount of
is leveraged to generate the informativeness criterion, which can data manually is time consuming and sometimes expensive.
be carried out in different ways. Rooted in this framework, a Hence it is wise to select fewer yet informative samples
practical active learning algorithm is proposed, which exploits a for labeling from a pool of unlabeled samples, so that a
radial basis function together with the estimated probabilities to
construct the triple measures and a modified best-versus-second- classification model trained with these optimally chosen sam-
best strategy to construct the uncertain measure, respectively. ples can perform well on unseen data samples. If we select
the unlabeled samples randomly, there would be redundancy
and some samples may bias the classification model, which
will eventually result in a poor generalization ability of the
Manuscript received June 3, 2015; revised September 11, 2015; accepted model. Active learning methodologies address such a chal-
October 27, 2015. Date of publication November 17, 2015; date of cur- lenge by querying the most informative samples for class
rent version December 14, 2016. This work was supported in part by the
National Basic Research Program of China (973 Program) under Grant assignments [11]–[13], and the informativeness criterion for
2012CB719905, in part by the National Natural Science Foundation of China active sampling has been successfully applied to many data
under Grant 61471274, Grant 41431175, and Grant 61401317, in part by the mining and machine learning tasks [14]–[20]. Although active
Natural Science Foundation of Hubei Province under Grant 2014CFB193,
in part by the Fundamental Research Funds for the Central Universities learning has been developed based on many approaches [19],
under Grant 2042014kf0239, and in part by the Australian Research Council [22]–[24], the dream to query the most informative samples
Projects under Grant DP-140102164 and Grant FT-130101457. This paper is never changing [22], [25], [26].
was recommended by Associate Editor J. Basak. (Corresponding author:
Lefei Zhang.) Essentially, active learning is an iterative sampling + label-
B. Du, Z. Wang and L. Zhang are with the State Key Laboratory of Software ing procedure. At each iteration, it selects one sample for man-
Engineering, School of Computer, Wuhan University, Wuhan 430079, China ually labeling, which is expected to improve the performance
(e-mail: gunspace@163.com; wzm902009@gmail.com; zhanglefei@ieee.org).
L. Zhang is with the Collaborative Innovation Center of Geospatial of the current classifier [26], [27]. Generally speaking, there
Technology, State Key Laboratory of Information Engineering in Surveying, are two main sampling criteria in designing an effective active
Mapping and Remote Sensing, Wuhan University, Wuhan 430079, China learning algorithm, that is, informativeness and representative-
(e-mail: zlp62@whu.edu.cn).
W. Liu is with IBM T. J. Watson Research Center, Yorktown Heights, ness [28]. Informativeness represents the ability of a sample
NY 10598 USA (e-mail: wliu@ee.columbia.edu). to reduce the generalization error of the adopted classification
J. Shen is with the School of Information Systems, Singapore Management model, and ensures less uncertainty of the classification model
University, Singapore (e-mail: jlshen@smu.edu.sg).
D. Tao is with the Centre for Quantum Computation & Intelligent Systems in the next iteration. Representativeness decides whether
and the Faculty of Engineering and Information Technology, University of a sample can exploit the structure underlying unlabeled
Technology, Sydney, NSW 2007, Australia (e-mail: dacheng.tao@uts.edu.au). data [11], and many applications have pay much attention
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. on such information [29]–[33]. Most popular active learning
Digital Object Identifier 10.1109/TCYB.2015.2496974 algorithms deploy only one criterion to query the most desired
2168-2267 c 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
DU et al.: EXPLORING REPRESENTATIVENESS AND INFORMATIVENESS FOR ACTIVE LEARNING 15

samples. The approaches drawing on informativeness attracted directly investigated by several similarity measures with the
more attention in the early research of active learning. Typical function in the two-sample discrepancy problem theory, lead-
approaches include: 1) query-by-committee, in which several ing to a general active learning framework. This framework
distinct classifiers are used, and the samples are selected with integrates the informativeness and representativeness into one
the largest disagreements in the labels predicted by these formula, whose optimization falls into a standard quadratic
classifiers [34]–[36]; 2) max-margin sampling, where the sam- programming (QP) problem.
ples are selected according to the maximum uncertainty via To the best of our knowledge, this paper is the first attempt
the distances to the classification boundaries [37]–[39]; and to develop a well-defined, complete, and general framework
3) max-entropy sampling, which uses entropy as the uncer- for active learning in the sense that both representativeness and
tainty measure via probabilistic modeling [40]–[43]. The informativeness are taken into consideration based on the two-
common issue of the above active learning methods is that sample discrepancy problem. According to the two-sample
they may not be able to take full advantage of the informa- discrepancy problem, this framework provides a straightfor-
tion of abundant unlabeled data, and query the samples merely ward and meaningful way to measure the representativeness
relying on scarce labeled data. Therefore, they may be prone by fully investigating the triple similarities that include the
to a sampling bias. similarities between a query sample and the unlabeled set,
Hence, a number of active learning algorithms have recently between a query sample and the labeled set, and between any
been proposed based on the representativeness criterion to two candidate query samples. If a proper function is founded
exploit the structure of unlabeled data in order to overcome the that satisfy the conditions in the two-sample discrepancy prob-
deficiency of the informativeness criterion. Among these meth- lem, the representative sample can be mined with the triple
ods, there are two typical means to explore the representative- similarities in the active learning process. Since the uncertainty
ness in unlabeled data. One is clustering methods [44], [45], is also the important information in the active learning, an
which exploit the clustering structure of unlabeled data and uncertain part is combined the triple similarities with a trade-
choose the query samples closest to the cluster centers. The off to weight the importance between representativeness and
performance of clustering based methods depends on how well informativeness. Therefore, the most significant contribution
the clustering structure can represent the entire data structure. of this paper is that it provides a general idea to design active
The other is optimal experimental design methods [46]–[48], learning methods, under which various active learning algo-
which try to query the representative examples in a trans- rithms can be tailored through suitably choosing a similarity
ductive manner. The major problem of experimental design measure or/and an uncertainty measure.
based methods is that a large number of samples need to be Rooted in the proposed framework, a practical active learn-
accessed before the optimal decision boundary is found, while ing algorithm is designed. In this algorithm, the radial basis
the informativeness of the query samples is almost ignored. function (RBF) is adopted to measure the similarity between
Since either single criterion cannot perform perfectly, it two samples and then applied to derive the triple similari-
is natural to combine the two criteria to query the desired ties. Different from traditional ways, the kernel is calculated
samples. Wang and Ye [49] introduced an empirical risk min- by the posterior probabilities of the two samples, which
imization principle to active learning, and derived an empirical is more adaptive to potentially large variations of data in
upper-bound for forecasting the active learning risk. By doing the whole active learning process. Meanwhile, we modify
so, the discriminative and representative samples are effec- the best-versus-second-best (BvSB) strategy [53], which is
tively queried, and the uncertainty measure stems from a also based on the posterior probabilities, to measure the
hypothetical classification model (i.e., a regression model in a uncertainty. We verify our algorithm on fifteen UCI bench-
kernel space). However, a bias would be caused if the hypo- mark [54] datasets. The extensive experimental results demon-
thetical model and the true model differed in some aspects. strate that the proposed active learning algorithm outperforms
Huang et al. [28] also proposed an active learning frame- the state-of-the-art active learning algorithms.
work combining informativeness and representativeness, in The rest of this paper is organized as follows. Section II
which classification uncertainty is used as the informative- presents our proposed active learning framework in detail. The
ness measurement and a semi-supervised learning algorithm experimental results as well as analysis are given in Section III.
is introduced to discover the representativeness of unlabeled Section IV provides further discusses about the proposed
data [50], [51]. To run the semi-supervised learning algorithm, method. Finally, we draw the conclusion in Section V.
the input data structure should satisfy the semi-supervised
assumption to guarantee the performance [52].
As reviewed above, we argue that the current attempts II. P ROPOSED F RAMEWORK
to fuse the informativeness and representativeness in active Suppose there is a dataset with n samples, initially we ran-
learning may be susceptible to certain assumptions and con- domly divide it into a labeled set and an unlabeled set. We
straints on input data. Hence, can we find a general way denote that Lt is a training set with nt samples and Ut is an
to combine the two criteria in design active learning meth- unlabeled set with ut samples at time t, respectively. Note that
ods regardless of any assumption or constraint? In this paper, n is always equal to nt + ut . Let ft be the classifier trained
motivated by a two-sample discrepancy problem which uses on Lt . The objective of active learning is to select a sample
an estimator with one sample measuring the distribution, the xs from Ut and a new classification model trained on Lt ∪ xs
fresh eyes idea is provided: the unlabeled and labeled sets are is learned, which has the maximum generalization capability.
16 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 47, NO. 1, JANUARY 2017

Let Y be the set of class labels. In the following discussion, the with two samples from them, we develop it to active learning
symbols will be used as defined above. We review the basics to find a sample to measure the representativeness in labeled
of the two-sample discrepancy problem below. dataset and unlabeled dataset, and we call such a problem as
“two-sample discrepancy problem.” The two-sample discrep-
A. Two-Sample Discrepancy Problem ancy problem is presented in Theorem 2, and the proof is
Define X = {x1 , . . . , xm } and Z = {z1 , . . . , zn } are the obser- provided in Appendix B.
vations drawn from a domain D, and let x and z be the Theorem 2: Suppose {Xj1 , . . . , Xjnj }, j = 1, 2 two indepen-
distributions defined on X and Z, respectively. The two sam- dent random samples, from p-variate distributions with den-
ples discrepancy problem is to draw independent identically sities fj , j = 1, 2, and define σj is a bandwidth and K is a
distributed (i.i.d.) sample from x and z, respectively, to test spherically symmetric p-variate density function. Assuming
whether x and z are the same distribution [55]. The two-sample lim σ1 = 0, lim σ2 = 0
test statistic for measuring the discrepancies is proposed by n1 →∞ n2 →∞
Hall [56] and Anderson et al. [57]. In [57], the two-sample lim n1 σ1 = ∞, lim n2 σ2 = ∞
n1 →∞ n2 →∞
test statistics are used for measuring discrepancies between
two multivariate probability density functions (pdfs). Hall [56] and define the estimator of a p-variate distribution with
explains the integrated squared error between a kernel-based density fj is
density of a multivariate pdf and the true pdf. With a brief  −1
nj


notation, it can be represented as ˆfj = nj σ p K x − Xji /σj .
j
   i=1
H= f̂σ − f (1)
The discrepancy of the two distributions can be measured by
two-sample discrepancy problem with a minimum distance
where f̂σ denotes the density estimate, σ denotes the associated n−1/2 σ −p/2 , where n is n1 + n2 .
bandwidth, and f is the true pdf. In such way, a central limited The other details proofs and theorem about the two-sample
theorem for H which relies the bandwidth within f̂σ is derived, discrepancy problem can refer to [55] and [57]. Suppose we
and represented in Theorem 1. find a density function F in Theorem 2, following [46], the
Theorem 1: Define empirical estimate of distribution discrepancy between X and Z
Gn (x, y) = E{Qn (X1 , x)Qn (X1 , y)} with the samples xi ∈ X and the samples zi ∈ Z can be defined
as follows:
and assume Qn is symmetric
  mj
 
p −1
Qn ((X1 , X2 )|X1 ) = 0 F(xi ) − F(zi ) = sup mj σj F (x − xi )/σj
 
i=1
E Q2n (X1 , X2 ) < ∞
  
nj
p −1 
for each n. If − nj σj F (z − zi )/σj . (3)
   
E G2n (X1 , X2 ) + n−1 E Q4n (X1 , X2 ) i=1
lim   2 =0 In (3), it shows that given a sample x ∈ X and a sample
n→∞
E Q2n (X1 , X2 )
z ∈ Z, the distribution discrepancy of the two datasets can be
then measured with the two sample under the density function F.

According to Theorem 2, (3) has a minimum distance between
Un ≡ Qn Xi , Xj
the two datasets. If the upper bound can be minimized, the
1≤i≤j≤n
distribution of X and Z will have the similar distribution
is asymptotically normally distributed with zero mean and with the density function F, where F is rich enough [46].
variance given by (1/2)n2 E{Q2n (X1 , X2 )}, where Qn is a sym- According to the review above, the consistency of distribution
metric function depending on n and X1 , . . . , Xn are i.i.d. between two sets can be measured similarity with a proper
random variables (or vectors). Un is a random variable of a density function F. If we treat X as the unlabeled set and
simple one-sample U-statistic. Z as the labeled set, we can discover that the two-sample
Following [56], the brief proof is provided in Appendix A. discrepancy problem can be adapted to the active learning
It is intuitively obvious that H is a spontaneous test statistics problem. If a sample in the unlabeled set can measure the
for a significance test against the hypothesis that f is indeed distribution of the unlabeled set, adding it to the labeled set,
the correct pdf. Based on above a theorem, the two-sample and removing it from the unlabeled set without changing the
versions of H is investigated [57]. Naturally, the objective of distribution of the unlabeled set, it can make the two sets
the statistic is have the same distribution. In active learning, there are only a
   small proportion of the samples are labeled, so the finite sam-
Hσ1 σ2 = f̂σ1 − f̂σ2 (2) ple properties of the distribution discrepancy are important
for the two-sample distribution discrepancy to apply to active
in which, for j = 1, 2, f̂σj is a density, which is estimated from learning. We briefly introduce one test for the two-sample dis-
the jth sample with smoothing parameter σj . Based on the two- crepancy problem that has exact performance guarantees at
sample versions, which is used to measure two distributions finite sample sizes, based on uniform convergence bounds,
DU et al.: EXPLORING REPRESENTATIVENESS AND INFORMATIVENESS FOR ACTIVE LEARNING 17

following [55], and the McDiarmid [58] bound on the biased term M2 and M3 in (4). If the difference between M2 and M3
distribution discrepancy statistic is used. With the finite sample is small, it indicates that when the sample is added to the
setting, we can establish two properties of the distribution dis- labeled data, it will decrease the distribution discrepancy of
crepancy, from which we derive a hypothesis test. Intuitively, the unlabeled data and labeled data. For representativeness,
we can observe that the expression of the distribution dis- our goal is also to find the sample that makes the distribu-
crepancy and MMD are very similar. If we define σ = 1, tion discrepancy of unlabeled data and labeled data small.
they have the same expression. First, if we let mj σ p = m However, exhaustive search is not feasible due to the exponen-
p
and nj σj = n, according to [55], it is shown that regard- tial nature of the search space. Hence, we solve this problem
less of whether or not the distributions of two datasets are using numerical optimization-based techniques. We define a
the same, the empirical distribution discrepancy converges in binary vector α with ut entries, and each entry αi is corre-
probability at rate O((mj σ p + nj σj )(−1/2) ), which is a func-
p
sponding to the unlabeled point xi . If the point is queried, the
tion of σ . It shows the consistency of statistical tests. Second, αi is equal to 1, else 0. The discrepancy of the samples in the
probabilistic bounds for large deviations of the empirical dis- unlabeled set is also a vector with ut entries. For each sam-
tribution discrepancy can be given when the two datasets have ple xi in the unlabeled, we measure the discrepancy with two
the same distribution. According to Theorem 2 in this paper, parts, which are defined M2 (i) and M3 (i) as the distribution
these bounds lead directly to a threshold. And according to in labeled set and unlabeled set, respectively. The distribution
the theorems in Section 4.1 of [55] about the convergence discrepancy of sample xi can be measured as
rate is perspective and the biased bound links with σ . The
M2 (i) − M3 (i).
more details about the properties with finite samples can refer
to [55]. Hence, this Theorem 2 can be used to measure the Meanwhile, we want to make sure the sample is optimal in
representiveness of an unlabeled sample in active learning. the latent representative samples. And a similarity matrix M1
is defined with ut × ut entries whose (i, j)th entry represents
the similarity between xi and xj in unlabeled set. Hence, the
B. General Active Learning Framework
optimization problem of representative part can be formulated
In our proposed framework, the goal is to select an opti- as follows:
mal sample that not only furnish useful information for the
classifier ft , but also shows representativeness in the unlabeled min α T M1 α + α T (M2 − M3 ). (4)
α T 1ut =1,αi ∈{0,1}
set Ut , and as little redundancy as possible in the labeled set Lt .
In such a way, the sample xs should be informative and rep- For the informative part, we compute an uncertainty vector
resentative in the unlabeled set and labeled set, i.e., if we C with ut entries, where C(xi ) denotes the uncertainty value
query the sample without representativeness the two samples of point xi in the unlabeled set. Therefore, combining the
queried from two iterations separately furnish useful informa- representative part and uncertain part with a tradeoff param-
tion but they may provide same information, then one of them eter, we can directly describe the objective of active learning
will become redundant. Hence, our proposed framework aims framework and formulate as follows:
at selecting the samples with different pieces of information.
min α T M1 α + α T (M2 − M3 ) + βC
To achieve this goal, an optimal active learning framework is α
proposed, which combines the informativeness and represen- s.t. α T 1ut = 1, αi ∈ {0, 1}. (5)
tativeness together, with a tradeoff parameter that is used to
balance the two criteria. If S is a spherically symmetric density function to measure
For the representative part, the two-sample discrepancy the similarity between two samples, the entry in M1 –M3 can be
problem is used. As reviewed above, the two-sample dis- defined as below. In the proposed framework, M1 is a matrix
crepancy problem is used to examine H12 in (2) under the with Rut ×ut , and the entry in it can be formulated as follows:
hypothesis f̂ σ 1 = f̂ σ 2 , and the objective is to minimize H12 . 1
So the essential problem of the distribution discrepancy is to M1 (i, j) =S(i, j) (6)
2
estimate the f̂ σ 1 and f̂ σ 2 . In Theorem 2, it shows a way to
where M1 (i, j) is the similarity between the ith and jth sample
find an estimator of a p-variate distribution with fj , which is
in unlabeled set. However, differing from the entry in M1 , the
estimated from the jth sample, so a sample xj in a dataset can
entry in M2 measures the distribution between one sample and
always obtain an estimator of a p-variate distribution with fj .
the labeled set. The formulation can be written as
If we use the xj to find two estimators of a p-variate distribu-
nt + 1 tn
tion with two datasets A and B, the distribution discrepancy of
M2 (i) = S(i, j) (7)
A and B can be measured by the difference of the two estima- n
j=1
tors. Let A and B represent the labeled data and the unlabeled
data, respectively, in active learning, and a sample in unla- where (nt + 1)/n is a weight, corresponding to the percent-
beled data can obtain two estimators with the labeled data and age of the labeled set in the whole dataset, which is used
the unlabeled data, respectively. For each unlabeled sample, a to balance the importance of the unlabeled data and labeled
distribution discrepancy can be obtained between the labeled data. It represents the similarity between sample xi in the unla-
data and the unlabeled data. Distribution of labeled data and beled set and the labeled set. If M2 (i) in M2 is smaller, it
that of unlabeled data are, respectively, corresponding to the implies that the sample xi chosen from the unlabeled set is
18 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 47, NO. 1, JANUARY 2017

more different with the labeled samples, and the redundancy space [28], [46], [60]. However, in practice, the distribution
is also reduced. Therefore, M2 is to ensure the selected sam- of the samples in unlabeled set and that of samples in labeled
ples contain more diversity. Similar to the definition of M2 , set may be different as the number of the two sets is changing.
we define M3 ∈ Rut ×1 to measure the distribution between the Therefore, it may not be reasonable to use such kernel matrix
sample and the unlabeled set as follows: to measure the similarity in the active learning process. Hence,
in our proposed method, we measure the similarity between
ut − 1
u
t
M3 (i) = S(i, j) (8) two samples using the posterior possibility combined with
n RBF. The posterior probability represents the importance of
j=1
a sample with respect to a classifier. If a similarity is mea-
where (ut − 1)/n is a weight corresponding to the percentage sured based on the probability kernel, it represents whether
of the unlabeled set in the whole dataset which has the same the samples have the same impact on the respective classi-
meaning with the weight of M2 . It enforces the query sample fiers, which can help directly to construct proper classifiers.
to present certain similarity to the remaining ones in Ut . Thus, the representative samples we select with the probabil-
As the description above, M1 –M3 together can help to select ity kernel are effective to enhance the classifiers. Meanwhile,
a sample that is representative in unlabeled set and presents the probability leads to a distribution of samples, which is
low redundancy in labeled set, and this may be an excellent denser than that of feature similarity. Therefore, the redun-
way to combine them together to announce the representa- dancy can be reduced and more informative samples can be
tiveness of a sample. In order to ensure the selected sample is queried then. To compute the posterior probability, the classi-
also highly informative, C is computed as an uncertain vector fier f t is applied on the sample xi in the unlabeled set to yield
C ∈ Rut ×1 of length ut . Each entry C(i) denotes the uncer- the posterior probability pic with respect to the corresponding
tainty of xi in the unlabeled set for the current classifier f t . class c, where c belongs to the labels set Y [15]. Let Pi and P j
It can be formulated as be the posterior probability of two samples in the unlabeled


C(i) =  f t , xi , xi ∈ Ut (9) set, Pi = {pi1 , pi2 , . . . , pi|Y| }, where |Y| is the number of classes.
Then, the (i, j)th entry in M1 can be represented as
where (.) is a function to measure the uncertainty of xi based  2 
1 1
on f t . Based on the fundamental idea in the proposed frame- M1 (i, j) = S(i, j) = exp −γ ∗ Pi − P j 2
work, if we can find a reasonable function to measure the 2 2
|Y|
|Y|
similarity between two samples and a method to compute the j
s.t. pik = 1, pk = 1 (11)
uncertainty of a sample, an active learning algorithm can be
k=1 k=1
designed reasonably. Meanwhile, we can find that the formu-
lation of our proposed framework is an integer programming where γ is the kernel parameter. Thus, the element in M2 can
problem which is NP hard due to the constraint αi ∈ {0, 1}. be computed as
A common strategy is to relax αi to a continuous value range nt + 1
nt  2 
[0, 1], and we will derive a convex optimization problem as M2 (i) = exp −γ ∗ Pi − P j 2 . (12)
n
j=1
min α T M1 α + α T (M2 − M3 ) + βC
α Meanwhile, the ith entry in M3 can be figured out as
s.t. α T 1ut = 1, αi ∈ [0, 1]. (10)  2 
ut − 1
ut

M3 (i) = exp −γ ∗ Pi − P j 2 . (13)
This is a QP problem, and it can be solved efficiently by a n
j=1
QP solver. Once we solve α in formula (10), we set the largest
element αi to 1 and the remaining ones to 0. Therefore, the 2) Computing Uncertainty Part: The uncertain part is used
continuous solution can be greedily recovered to the integer to measure the informativeness of the sample for the current
solution. classifier f t . If it is hard for f t to decide a sample’s class mem-
bership, it suggests that the sample contains a high uncertainty.
C. Proposed Active Learning Method So it may probably be the one that we want to select and
label. In this part, we design a new strategy to measure the
Based on the proposed optimal framework, we propose an uncertainty of a sample, which is a modification of the BvSB
active learning method relying on the probability estimates of strategy. The BvSB is a method based on the posterior proba-
class membership for all the samples. bility, which considers the difference between the probability
1) Computing Representative Part: The proposed frame- values of the two classes with the highest estimated probabil-
work is a convex optimization problem, which requires the ity and has been described in detail in [42]. Such a measure is
similarity matrix to be positive semi-definite. Kernel matrix a direct way to estimate the confusion about class membership
has been widely used as the similarity matrix, which main- from the classification perspective. The mathematical form is
tains the convexity of the objective function by constrain- following: for a sample xi in the unlabeled dataset, let pih be
ing the kernel to be positive semi-definite [28], [46], [59]. the maximum value in Pi for class h, also let pig be the second
Without losing generality, we adopt the RBF to measure
maximum value in Pi for class g. The BvSB measure can be
the similarity between two samples. Generally speaking, the
obtained as follows:
similarity matrices are usually directly computed based on
the Euclidean distance of two samples with RBF in feature i
dBvSB = pih − pig . (14)
DU et al.: EXPLORING REPRESENTATIVENESS AND INFORMATIVENESS FOR ACTIVE LEARNING 19

Fig. 1. Triangle and the rectangle are two classes, the green points are the unlabeled set. Left: original model. Middle: green points with black rectangle are
the samples queried with BvSB. Right: green points with black rectangle are the samples queried with the proposed uncertain method.

The smaller BvSB measure is the higher uncertainty of Algorithm 1 Proposed Algorithm Based on the General Active
the sample. But such a method is inclined to select sam- Learning Framework
ples close to the separating hyperplane. In our proposed Input: Lt = {(xi , yi )} with nt labeled samples, Ut = {xi } with
method, the SVM classifier is used, hence, we also hope ut unlabeled samples, t=0, the tradeoff parameter β, the
the distance between two classes is large. Therefore, the terminating condition δ
query sample should not only be close to the hyperplane, but 1: repeat
also close to the support vectors. We define such informa- 2:
calculate the estimated probability for the samples in
tion as position measure. Based on such an idea, we modify Lt Ut using LIBSVM.
the BvSB method to reveal the highly uncertain information 3: acquire the M1 , M2 , and M3 with the estimated prob-
behind the unlabeled samples. We denote such the support ability according function (11), (12) and (13).
vectors set as SV = {SV1 , SV2 , . . . , SVm }, where m is the 4: calculate the BvSB value for each sample in Ut with
number of support vectors. Suppose P = {P1 , P2 , . . . , Pm } function (14).
is the probability set of support vectors. For each sample 5: find the closet support vector of each sample in Ut
xi in unlabeled dataset, we construct a similarity function with function (15) and (16), then, calculate the position
to calculate the distance between the sample and the sup- measure with the closest support vector.
port vectors with the estimated probability as the position 6: combine the uncertain value of each sample in Ut with
measure function (17).

 2  7: optimize the objective function (10) w.r.t α using QP;
f xi , SVj = exp Pi − Pj 2 . (15) set the largest elements in α to 1 and the others to 0, set
the query sample as xs with an oracle label.
If the sample is close to the support vector, f (xi , SVj ) will be 8: update labeled set Lt and unlabeled set Ut
also small. We choose the smallest value of (15) between the 9: until The terminating condition δ is satisfied
closest support vector and the sample as the position measure
of the sample in the classification interval of SVM. The closest
support vector can be found as follows:


The key steps of the proposed algorithm are summarized in
SVc = argmin f xi , SVj . (16) Algorithm 1.
SVj ∈SV From the descriptions in Sections II-A and II-B, it can
Since our goal is to enhance the classification hyperplanes be found that our proposed framework can be generalized to
as well as to improve the classification interval of SVM classi- different AL algorithms.
fier, we combine the BvSB measure and the position measure
together as the uncertainty III. E XPERIMENTS
C(xi ) = dBvSB
i
∗ f (xi , SVc ). (17) In our experiments, we compare our method with random
selection and state-of-the-art active learning methods. All the
By minimizing C(xi ), the uncertain information is enhanced compared methods in the experiments are listed as follows.
compared to BvSB. Fig. 1 shows the data point selected by 1) RANDOM: Randomly select the selected samples in the
the BvSB and the proposed uncertain method. It is worth whole process.
noting that in the proposed method the main task we need 2) BMDR: Batch-mode active learning by querying dis-
to do is to calculate the estimated probabilities, with which criminative and representative samples, active learning
the active learning procedure can be easily implemented. We to select discriminative and representative samples by
use the LIBSVM toolbox [61] for classification and proba- adopting maximum mean discrepancy to measure the
bility estimation for the proposed method. Hence, the imple- distribution difference and deriving an empirical upper
mentation of our proposed method is simple and efficient. bound for active learning risk [49].
20 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 47, NO. 1, JANUARY 2017

TABLE I TABLE II
C HARACTERISTICS OF THE DATASETS , I NCLUDING THE N UMBERS W IN /T IE /L OSS C OUNTS OF O UR M ETHOD V ERSUS THE C OMPETITORS
OF THE C ORRESPONDING F EATURES AND S AMPLES BASED ON PAIRED t-T EST AT 95% S IGNIFICANCE L EVEL

3) QUIRE: Min–max based active learning, a method value of β from a candidate set by cross validation at each
that queries both informative and representative sam- iteration. For all methods, the SVM is used as the classifier,
ples [28]. and the LIBSVM toolbox [61] is used. We choose the RBF
4) MP: Marginal probability distribution matching-based kernel for the classifier, and the same kernel width is used
active learning, a method that prefers representative for the proposed algorithm and the comparison methods. The
samples [46]. parameters of SVM are adopted with the empirical values [61].
5) MARGIN: Simple margin, active learning that selects Since our proposed method is a QP problem, we can solve it
uncertain samples that based on the distance the point with QP toolbox. In our experiments, we use the MOSEK
to the hyperplane [39]. toolbox1 to solve our optimization problem.
Note that the BMDR and MP are batch-mode active learning We conduct our experiments in ten runs for each dataset
method in [46] and [49], so we set the batch size as 1 to select a on each active learning method, and show the average per-
single sample to label at each iteration as in [28]. Following the formance of each method in Fig. 2. In active learning field,
previous active learning publications [28], [39], [46], [49] we the performances of the entire query process of the compet-
verify our proposed method on 15 UCI benchmark datasets: ing methods are usually presented for comparison. Besides,
1) Australian; 2) sonar; 3) diabetis; 4) German; 5) heart; we compare the competing methods in each run with the pro-
6) splice; 7) image; 8) iris; 9) monk1; 10) vote; 11) wine; posed method based on the paired t-test at 95% significance
12) ionosphere; 13) twonorm; 14) waveform; 15) ringnorm; level [28], [49], and show the Win/Tie/Lose for all datasets in
and 16) their characteristics are summarized in Table I. Table II.
In our experiments, we divide each dataset into two parts From the results, we can observe that our proposed method
as a partition 60% and 40% randomly. We treat the 60% data yields the best performance among all the methods. The
as the training set and 40% data as the testing set [49]. The other active learning methods are not always superior to
training set is used for active learning and the testing data the RANDOM method in certain cases. Among the com-
is to compare the prediction accuracy of different methods. petitors, BMDR and QUIRE are two methods to query the
We can start our proposed method without labeled samples, informative and representative samples, and BMDR presents
but for MARGIN method, the initial labeled data is obliga- a better performance than the other competitors. It is per-
tory. Hence, insuring a fair comparison for each method, we forming well at the beginning of the learning stage. As the
start all the active learning methods with an initially labeled number of queries increases, we observe that BMDR yields
small dataset which is just enough to train a classifier. In decent performance, comparing with our proposed method.
our experiments, for each dataset, we select just one sample This phenomenon may be attributed to the fact that with a
from each class as the initially labeled data. Same as [49], we hypothetical classification model, the learned decision bound-
select them from the training dataset. The rest of the train- ary tends to be inaccurate, and as a result, the unlabeled
ing set is used as the unlabeled dataset for active learning. instances closest to the decision boundary may not be the
For each dataset, the procedure is stopped when the pre- most informative ones. For the QUIRE, although it is also
diction accuracy does not increase for any methods, or the a method to query the informative and representative sam-
proposed method keeps outperforming the compared methods ples, it requires the unlabeled data to meet the semi-supervised
after several iterations. This stopping criterion ensures that assumption. It may be a limitation to apply the method. As to
the proposed method and the compared methods have a good the single criterion methods, our method performs consistently
contrast and also decrease the labeling cost. As to the com- better than them during the whole active learning process.
pared methods’ parameters setting, we use the values in the
original papers. In the proposed method, there is a tradeoff
parameter β. Following [28] and [49], we choose the best 1 https://mosek.com/
DU et al.: EXPLORING REPRESENTATIVENESS AND INFORMATIVENESS FOR ACTIVE LEARNING 21

Fig. 2. Comparison of different active learning methods on 15 benchmark datasets. The curves show the learning accuracy over queries, and each curve
represents the average result of ten runs.

The experimental results indicate that our proposed approach approach to measure the informativeness also contributes to
to directly measure the representativeness is simple but the performances. By combining them together, we can select
effective and comprehensive. Simultaneously, the proposed the suitable samples for classification tasks.
22 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 47, NO. 1, JANUARY 2017

Fig. 3. Performance of our method with different tradeoff values on three UCI benchmark datasets. Each cure represents the ten runs average results.

IV. D ISCUSSION AND A NALYSIS Both informativeness and representativeness are significant
In our proposed method, there is a tradeoff parameter β for active learning [46], [49]. Since active learning is to iter-
between the informative part and representative part in the atively select the most informative samples, and it is hard to
optimization objective. In our experiments, we choose its value decide which criterion is more important at each iteration, our
from a candidate set 1, 2, 10, 100, and 1000. We conduct framework provide an easy way to naturally obtain the impor-
this parameter analysis on three UCI benchmark datasets: tant information in the active learning process. Meanwhile, the
1) breast cancer; 2) balance; and 3) semeion handwritten framework provides a principled method to design an active
digit. The other parameters setting are the same with the learning method. Through the sensitivity analysis, we can see
previous ones. that the distribution of dataset impacts which information is
The performances with different β values are shown in important in the active learning process. Hence, we can design
Fig. 3. From these results, we can observe that the sensi- more practical active learning algorithm according to the data
tivity of β on three benchmark datasets are different. The distribution for our specific classification tasks to achieve a
performance on the breast cancer dataset is more sensitive good performance.
to the β than that on semeion handwritten digit and bal-
V. C ONCLUSION
ance dataset. We can observe that smaller β works better on
breast cancer dataset, while larger β works better on semeion In this paper, a general active learning framework is pro-
digit handwritten and balance dataset. In other words, the rep- posed by querying the informative and representative samples,
resentativeness is more important to the breast cancer set, which provides a systematic and direct way to measure and
while the informativeness can better mine the data informa- combine the informativeness and representativeness. Based
tion for balance and semeion handwritten digit. The reason on this framework, a novel efficient active learning algo-
may be that the breast cancer data is distributed more densely, rithm is devised, which uses a modified BvSB strategy to
so the representativeness can help boost the active learning. generate the informativeness measure and an RBF with the
Meanwhile, the breast cancer set just has two classes, and estimated probabilities to construct the representativeness mea-
for each sample there are only two probabilities to measure sure, respectively. The extensive experimental results on 15
the informativeness. Therefore, the information may not be benchmark datasets corroborate that our algorithm outper-
enough. Several existing studies show that the representative- forms the state-of-the-art active learning algorithms. In the
ness is more useful when there is no or very few labeled future, we plan to develop more principles for measuring the
data [11], [46], [49], [62]. However, as to semeion handwrit- representativeness and informativeness complying with spe-
ten digit and balance, the informativeness may be dominated. cific data structures or distributions, such that more practical
This is because these data is loosely distributed. And in our and specialized active learning algorithms can be produced.
method, the posterior probability is adopted. Based on the
ACKNOWLEDGMENT
probability, we designed a new uncertain measurement, which
is more suitable to measure the uncertainty of a sample. The The authors would like to thank the handing editor and
position measure is combined into the uncertainty part, which the anonymous reviewers for their careful reading and helpful
effectively prevents the query samples bias. Besides, for these remarks, which have contributed in improving the quality of
two datasets, they are multiclass, so the probability informa- this paper.
tion is enough to measure the amount of informativeness of
a sample. This may be the reason why for semeion handwrit- A PPENDIX A
ten digit and balance perform better when the β is larger. As Proof: According to [56], the proof of the Theorem 1
the analysis above, we can infer that in our proposed method requires us to check two conditions. The first one is
a small β may be preferred when the dataset just has two  
n
classes; and a large β may be recommended when the dataset ∀θ > 0, lim kn−2 E Yni2
I(Yni ) > θ kn = 0
has multiple classes. n→∞
i=2
DU et al.: EXPLORING REPRESENTATIVENESS AND INFORMATIVENESS FOR ACTIVE LEARNING 23

i−1
where Yni = j=1 Hn (Xi , Xj ), and kn2 = E(Un2 ). And the A PPENDIX B
second condition is Proof: Following [57], we know that the two-sample dis-
lim k−2 Vn2 =1 crepancy problem is used to examine Hσ1 σ2 in (2) under the
n→∞ n hypothesis f1 = f2 , and the objective is to minimize Hσ1 σ2 .
 Meanwhile, the estimators of f1 and f2 are defined as in
in probability, where Vn2 = ni=2 E{Yni 2 |X , . . . , X
1 i−1 }. From
the two conditions, it follows that kn−1 is asymptotically normal Theorem 2. Conveniently, we assume σ1 = σ2 = σ , hence,
N(0, 1). Since our test is directly on Hσ1 σ2 = Hσ . In order to assess the
power of a test based on Hσ , the performance against a local
  i−1
i−1

 alternative hypothesis should be ascertained. To this end, let
2
E Yni = E Hn Xi , Xj Hn (Xi , Xk ) f1 = f be the fixed density function, and let g be a func-
j=1 k=1 tion such that f2 = f + εg a density for all sufficiently small
2 |ε|. Simultaneously, let hσ be the α-level critical point of the
where 2 ≤ i ≤ n, then kn2 = i=2 E(Yni ).
2 Furthermore
distribution of Hσ under the null hypothesis H0 that ε = 0
E{Hn (X1 , X2 )Hn (X1 , X3 )Hn (X1 , X4 )Hn (X1 , X5 )}
  PH0 (Hσ > hσ ) = α.
= E Hn (X1 , X2 )Hn3 (X1 , X3 ) = 0
Obviously, if Hσ > hσ , H0 is rejected. Therefore, we
and so claim that
  i−1 

 lim α=0
4
E Yni = E Hn4 Xi , Xj n1 ,n2 →∞
j=1

 which is necessary if f̂j is consistently to estimate fj . Then, the
+3 E Hn2 Xi , Xj Hn2 (Xi , Xk ) . minimum distance that can be discriminated between f1 and f2
1≤j,k≤i−1,j =k
is ε = n−1/2 σ −p/2 . This claim can be formalized as fol-
Hence lows. Let H1 = H1 (a)(a = 0) be the alternative hypothesis

n      that ε = n−1/2 σ −p/2 a, and define
4
E Yni ≤ const n2 E Hn4 (X1 , X2 )
(a) = lim PH1 (Hσ > hσ ).
i=2
  n→∞
+n E 3
Hn2 (X1 , X2 )Hn2 (X1 , X2 ) Actually, such a limit is well-defined, that α < (a) < 1 for
   0 < |a| < ∞, and that (a) → 1 as |a| → ∞. This can be
≤ const. n3 E Hn4 (X1 , X2 ) . verified as follow. First, we can observe that
   2
It now follows from Theorem 1 that: Hσ = f̂1 − f̂2 − EH1 f̂1 − f̂2
n        
lim 4
E Yni =0 +2 f̂1 − f̂2 − EH1 f̂1 − f̂2 EH1 f̂1 − f̂2
n→∞
i=2    2
which implies the condition one. We also observe that + EH1 f̂1 − f̂2
  i−1
i−1

 
vni ≡ E Yni2 = Gn Xj , Xk and {EH1 (f̂1 − f̂2 )}2 ∼ ε2 g2 . Arguing as in [56], if
j=1 k=1 0 < limn1 ,n2 →∞ n1 /n2 < ∞, and σ → 0, nσ p → ∞, then
under H1

i−1

 
=2 Gn Xj , Xk + Gn Xj , Xj .  2
1≤j≤k≤i−1,j =k nσ (p/2) f̂1 − f̂2 − EH1 f̂1 − f̂2
j=1
  
With the results in [56] under the situations j1 ≤ k1 and − κ1 n1 (−1/2) + n2 (−1/2) σ −1
j2 ≤ k2 , we can obtain that
     
 

n   n(1/2) −1
ε f̂1 − f̂2 − EH1 f̂1 − f̂2 EH1 f̂1 − f̂2
E Vn4 =2 E vni , vnj + E v2ni .
2≤i≤j≤n
i=2 are asymptotically independent and normally distributed
Therefore with zeros means and finite, nonzero variances,
 2 the lat-
     ter not depending on a, where κ1 = K . Hence, if
E(Vn2 − kn2 ) ≤ const n4 E G2n (X1 , X2 ) + n3 E G2n (X1 , X1 ) ε = n−(1/2) σ −(P/2) a, then under H1
         
≤ const n4 E G2n (X1 , X2 ) + n3 E G2n (X1 , X2 ) . nσ (p/2) −1 −1 −1
Hσ − κ1 n1 + n2 σ − {1 + o(1)}ε 2
g2

It now follows from Theorem 1 that:


 2 is asymptotically normally distributed with zeros means and
kn−4 E Vn2 − kn2 → 0 finite, nonzero variances, the latter being an increasing func-
tion of a. Thus, the claims make about (a) directly from such
which proves the second condition. a result.
24 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 47, NO. 1, JANUARY 2017

R EFERENCES [26] M. Zuluaga, G. Sergent, A. Krause, and M. Püschel, “Active learning


for multi-objective optimization,” in Proc. ICML, Atlanta, GA, USA,
[1] M. Yu, L. Liu, and L. Shao, “Structure-preserving binary representations 2013, pp. 462–470.
for RGB-D action recognition,” IEEE Trans. Pattern Anal. Mach. Intell., [27] A. Gonen, S. Sabato, and S. Shalev-Shwartz, “Efficient active learning
Doi: 10.1109/TPAMI.2015.2491925. of halfspaces: An aggressive approach,” in Proc. ICML, Atlanta, GA,
[2] Q. Zhu, J. Mai, and L. Shao, “A fast single image haze removal algorithm USA, 2013, pp. 480–488.
using color attenuation prior,” IEEE Trans. Image Process., vol. 24, [28] S. J. Huang, R. Jin, and Z. H. Zhou, “Active learning by querying infor-
no. 11, pp. 3522–3533, Nov. 2015. mative and representative examples,” IEEE Trans. Pattern Anal. Mach.
[3] T. Liu and D. Tao, “Classification with noisy labels by impor- Intell., vol. 36, no. 10, pp. 1936–1949, Oct. 2014.
tance reweighting,” IEEE Trans. Pattern Anal. Mach. Intell., [29] X. Lu, Y. Wang, and Y. Yuan, “Sparse coding from a Bayesian perspec-
Doi: 10.1109/TPAMI.2015.2456899. tive,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 6, pp. 929–939,
[4] C. Xu, D. Tao, and C. Xu, “Multi-view intact space learning,” IEEE Jun. 2013.
Trans. Pattern Anal. Mach. Intell., vol. 37, no. 12, pp. 2531–2544, [30] X. Lu, H. Wu, and Y. Yuan, “Double constrained NMF for hyper-
Dec. 2015. spectral unmixing,” IEEE Trans. Geosci. Remote Sens., vol. 52, no. 5,
[5] D. Tao, J. Cheng, M. Song, and X. Lin, “Manifold ranking-based matrix pp. 2746–2758, May 2014.
factorization for saliency detection,” IEEE Trans. Neural Netw. Learn. [31] X. Lu, Y. Wang, and Y. Yuan, “Graph-regularized low-rank representa-
Syst., Doi: 10.1109/TNNLS.2015.2461554. tion for destriping of hyperspectral images,” IEEE Trans. Geosci. Remote
[6] D. Tao, X. Lin, L. Jin, and X. Li, “Principal component 2-D long short- Sens., vol. 51, no. 7, pp. 4009–4018, Jul. 2013.
term memory for font recognition on single Chinese characters,” IEEE [32] X. Lu, Y. Yuan, and P. Yan, “Image super-resolution via double spar-
Trans. Cybern., Doi: 10.1109/TCYB.2015.2414920. sity regularized manifold learning,” IEEE Trans. Circuits Syst. Video
[7] L. Liu, M. Yu, and L. Shao, “Unsupervised local feature Technol., vol. 23, no. 12, pp. 2022–2033, Dec. 2013.
hashing for image similarity search,” IEEE Trans. Cybern., [33] X. Lu, H. Wu, Y. Yuan, P. Yan, and X. Li, “Manifold regularized sparse
Doi: 10.1109/TCYB.2015.2480966. NMF for hyperspectral unmixing,” IEEE Trans. Geosci. Remote Sens.,
vol. 51, no. 5, pp. 2815–2826, May 2013.
[8] F. Zhu and L. Shao, “Weakly-supervised cross-domain dictionary learn-
[34] D. Vasisht, A. Damianou, M. Varma, and A. Kapoor, “Active learning
ing for visual recognition,” Int. J. Comput. Vision, vol. 109, nos. 1–2,
for sparse Bayesian multilabel classification,” in Proc. ACM SIGKDD,
pp. 42–59, 2014.
New York, NY, USA, 2014, pp. 472–481.
[9] L. Liu, M. Yu, and L. Shao, “Multiview alignment hashing for efficient
[35] P. Melville and R. J. Mooney, “Diverse ensembles for active learning,”
image search,” IEEE Trans. Image Process., vol. 24, no. 3, pp. 956–966,
in Proc. 21th Int. Conf. Mach. Learn., Banff, AB, Canada, 2004, p. 74.
2015.
[36] I. Guyon, G. Cawley, V. Lemaire, and G. Dror, “Results of the active
[10] D. Tao, L. Jin, W. Liu, and X. Li., “Hessian regularized support vec- learning challenge,” J. Mach. Learn. Res., vol. 16, pp. 19–45, 2011.
tor machines for mobile image annotation on the cloud,” IEEE Trans. [37] M. Wang and X.-S. Hua, “Active learning in multimedia annotation and
Multimedia, vol. 15, no. 4, pp. 833–844, 2013. retrieval: A survey,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 2,
[11] B. Settles, “Active learning literature survey,” Dept. Comput. Sci., Univ. pp. 1–21, 2011.
Wisconson–Madison, Madison, WI, USA, Tech. Rep. 1648, 2009. [38] Y. Chen and A. Krause, “Near-optimal batch mode active learning and
[12] X. He, “Laplacian regularized D-optimal design for active learning and adaptive submodular optimization,” J. Mach. Learn. Res., vol. 28, no. 1,
its application to image retrieval,” IEEE Trans. Image Process., vol. 19, pp. 160–168, 2013.
no. 1, pp. 254–263, Jan. 2010. [39] S. Tong and D. Koller, “Support vector machine active learning
[13] L. Zhang et al., “Active learning based on locally linear reconstruction,” with applications to text classification,” J. Mach. Learn. Res., vol. 2,
IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 10, pp. 2026–2038, pp. 45–66, 2001.
Oct. 2011. [40] J. Zhu and M. Ma, “Uncertainty-based active learning with instability
[14] R. Chattopadhyay, W. Fan, I. Davidson, S. Panchanathan, and J. Ye, estimation for text classification,” ACM Trans. Speech Lang. Process.,
“Joint transfer and batch-mode active learning,” in Proc. ICML, Atlanta, vol. 8, no. 4, pp. 1–21, 2012.
GA, USA, 2013, pp. 253–261. [41] M. Park and J. W. Pillow, “Bayesian active learning with localized priors
[15] J. Attenberg and F. Provost, “Online active inference and learning,” in for fast receptive field characterization,” in Proc. NIPS, Nevada City, CA,
Proc. ACM SIGKDD, San Diego, CA, USA, 2011, pp. 186–194. USA, 2012, pp. 2357–2365.
[16] M. Fang, J. Yin, and X. Zhu, “Knowledge transfer for multi-labeler [42] W. Luo, A. Schwing, and R. Urtasun, “Latent structured active learning,”
active learning,” in Proc. ECML-PKDD, Prague, Czech Republic, 2013, in Proc. NIPS, Lake Tahoe, CA, USA, 2013, pp. 728–736.
pp. 273–288. [43] N. V. Cuong, W. S. Lee, N. Ye, K. M. A. Chai, and H. L. Chieu, “Active
[17] J. Wang, N. Srebro, and J. Evans, “Active collaborative permuta- learning for probabilistic hypotheses using the maximum Gibbs error
tion learning,” in Proc. ACM SIGKDD, New York, NY, USA, 2014, criterion,” in Proc. NIPS, Lake Tahoe, CA, USA, 2013, pp. 1457–1465.
pp. 502–511. [44] X. Li, D. Kuang, and C. X. Ling, “Active learning for hierarchical text
[18] N. García-Pedrajas, J. Prez-Rodríguez, and A. de Haro-García, “OligoIS: classification,” in Advances in Knowledge Discovery and Data Mining,
Scalable instance selection for class-imbalanced data sets,” IEEE Trans. Berlin, Germany: Springer, 2012, pp. 14–25.
Cybern., vol. 43, no. 1, pp. 332–346, Feb. 2013. [45] I. Dino, B. Albert, Z. Indre, and P. Bernhard, “Clustering based active
learning for evolving data streams,” in Discovery Science, Berlin,
[19] W. Hu, W. Hu, N. Xie, and S. Maybank, “Unsupervised active learn-
Germany: Springer, 2013, pp. 79–93.
ing based on hierarchical graph-theoretic clustering,” IEEE Trans. Syst.,
[46] R. Chattopadhyay et al., “Batch mode active sampling based on marginal
Man, Cybern. B, Cybern., vol. 39, no. 5, pp. 1147–1161, Oct. 2009.
probability distribution matching,” in Proc. ACM SIGKDD, Beijing,
[20] W. Nam and R. Alur, “Active learning of plans for safety and reachability
China, 2012, pp. 741–749.
goals with partial observability,” IEEE Trans. Syst., Man, Cybern. B,
[47] Y. Fu, B. Li, X. Zhu, and C. Zhang, “Active learning without know-
Cybern., vol. 40, no. 2, pp. 412–420, Apr. 2010.
ing individual instance labels: A pairwise label homogeneity query
[21] J. Zhang, X. Wu, and V. S. Sheng, “Active learning with imbal- approach,” IEEE Trans. Knowl. Data Eng., vol. 26, no. 4, pp. 808–822,
anced multiple noisy labeling,” IEEE Trans. Cybern., vol. 45, no. 5, Apr. 2014.
pp. 1081–1093, Aug. 2015. [48] T. Reitmaier, A. Calma, and B. Sick, “Transductive active learning—
[22] X. Zhu, P. Zhang, Y. Shi, and X. Lin, “Active learning from stream A new semi-supervised learning approach based on iteratively refined
data using optimal weight classifier ensemble,” IEEE Trans. Syst., Man, generative models to capture structure in data,” Inf. Sci., vol. 293,
Cybern. B, Cybern., vol. 40, no. 6, pp. 1607–1621, Dec. 2010. pp. 275–298, Feb. 2015.
[23] O. M. Aodha, N. D. F. Campbell, J. Kautz, and G. J. Brostow, [49] Z. Wang and J. Ye, “Querying discriminative and representative samples
“Hierarchical subquery evaluation for active learning on a graph,” for batch mode active learning,” in Proc. ACM SIGKDD, San Jose, CA,
in Proc. CVPR, Columbus, OH, USA, 2014, pp. 564–571. USA, 2013, pp. 158–166.
[24] E. Elhamifar, G. Sapiro, A. Yang, and S. S. Sastry, “A convex opti- [50] M. Li, R. Wang, and K. Tang, “Combining semi-supervised and
mization framework for active learning,” in Proc. ICCV, Sydney, NSW, active learning for hyperspectral image classification,” in Proc. CIDM,
Australia, 2013, pp. 209–216. Singapore, 2013, pp. 89–94.
[25] Y. Fu, X. Zhu, and A. K. Elmagarmid, “Active learning with opti- [51] M. F. A. Hady and F. Schwenker, “Combining committee-based semi-
mal instance subset selection,” IEEE Trans. Cybern., vol. 43, no. 2, supervised learning and active learning,” J. Comput. Sci. Technol.,
pp. 464–475, Apr. 2013. vol. 25, no. 4, pp. 681–698, 2010.
DU et al.: EXPLORING REPRESENTATIVENESS AND INFORMATIVENESS FOR ACTIVE LEARNING 25

[52] M. F. A. Hady and F. Schwenker, “Semi−supervised learning,” Lefei Zhang (S’11–M’14) received the B.S. and
in Handbook on Neural Information Processing. Berlin, Germany: Ph.D. degrees from Wuhan University, Wuhan,
Springer, 2013. China, in 2008 and 2013, respectively.
[53] A. J. Joshi, F. Porikli, and N. P. Papanikolopoulos, “Scalable active From 2013 to 2015, he was a Post-Doctoral
learning for multiclass image classification,” IEEE Trans. Pattern Anal. Researcher with the School of Computer, Wuhan
Mach. Intell., vol. 34, no. 11, pp. 2259–2273, Nov. 2012. University, and a Visiting Scholar with the CAD
[54] A. Frank and A. Asuncion, “UCI Machine Learning Repository,” School and CG Laboratory, Zhejiang University, Hangzhou,
Inf. Comput. Sci., Univ. California, Irvine, CA, USA, Tech. Rep. 213, China, in 2015. He is currently a Lecturer with
2010. the School of Computer, Wuhan University, and
[55] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola, also a Hong Kong Scholar with the Department
“A kernel two-sample test,” J. Mach. Learn. Res., vol. 13, no. 1, of Computing, Hong Kong Polytechnic University,
pp. 723–733, 2013. Hong Kong. His current research interests include pattern recognition, image
[56] P. Hall, “Central limit theorem for integrated square error of multivariate processing, and remote sensing.
nonparametric density estimators,” J. Multivariate Anal., vol. 14, no. 1, Dr. Zhang is a Reviewer of over 20 international journals, includ-
pp. 1–16, 1984. ing the IEEE T RANSACTIONS ON I MAGE P ROCESSING, the IEEE
[57] N. H. Anderson, P. Hall, and D. M. Titterington, “Two-sample test statis- T RANSACTIONS ON N EURAL N ETWORKS AND L EARNING S YSTEMS, the
tics for measuring discrepancies between two multivariate probability IEEE T RANSACTIONS ON M ULTIMEDIA, and the IEEE T RANSACTIONS ON
density functions using kernel-based density estimates,” J. Multivariate G EOSCIENCE AND R EMOTE S ENSING.
Anal., vol. 50, no. 1, pp. 41–54, 1994.
[58] C. McDiarmid, “On the method of bounded differences,” in Surveys
Combinatorics, vol. 141. Cambridge, U.K.: Cambridge Univ. Press,
1989, pp. 148–188.
[59] K. Chen and S. Wang, “Semi-supervised learning via regularized boost-
ing working on multiple semi-supervised assumptions,” IEEE Trans.
Pattern Anal. Mach. Intell., vol. 33, no. 1, pp. 129–143, Jan. 2011.
[60] L. Zhang et al., “Ensemble manifold regularized sparse low-rank approx-
imation for multiview feature embedding,” Pattern Recognit., vol. 48, Liangpei Zhang (M’06–SM’08) received the B.S.
no. 10, pp. 3102–3112, 2015. degree in physics from Hunan Normal University,
[61] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vec- Changsha, China, in 1982, the M.S. degree in optics
tor machines,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, 2011, from the Xi’an Institute of Optics and Precision
Art. ID 21. Mechanics, Chinese Academy of Sciences, Xi’an,
[62] H. Zhuang and J. Young, “Leveraging in-batch annotation bias for China, in 1988, and the Ph.D. degree in photogram-
crowdsourced active learning,” in Proc. ACM WSDM, Shanghai, China, metry and remote sensing from Wuhan University,
2015, pp. 243–252. Wuhan, China, in 1998.
He is currently the Head of the Remote Sensing
Division, State Key Laboratory of Information
Bo Du (M’10–SM’15) received the B.S. and Ph.D. Engineering in Surveying, Mapping, and Remote
Sensing, Wuhan University. He is also a Chang-Jiang Scholar Chair Professor
degrees in photogrammetry and remote sensing
appointed by the Ministry of Education of China. He is currently a Principal
from the State Key Laboratory of Information
Scientist for the China State Key Basic Research Project from 2011 to 2016
Engineering in Surveying, Mapping and Remote
appointed by the Ministry of National Science and Technology of China to
Sensing, Wuhan University, Wuhan, China, in 2005
lead the Remote Sensing Program in China. He has over 450 research papers
and 2010, respectively.
and five books. He holds 15 patents. His current research interests include
He is currently an Associate Professor with the
School of Computer, Wuhan University. He has hyperspectral remote sensing, high-resolution remote sensing, image process-
ing, and artificial intelligence.
over 40 research papers published in the IEEE
Dr. Zhang is the founding chair of IEEE Geoscience and Remote Sensing
T RANSACTIONS ON G EOSCIENCE AND R EMOTE
Society (GRSS) Wuhan Chapter. He received the best reviewer awards from
S ENSING (TGRS), the IEEE T RANSACTIONS ON
I MAGE P ROCESSING (TIP), the IEEE J OURNAL OF S ELECTED T OPICS IN IEEE GRSS for his service to IEEE Journal of Selected Topics in Earth
Observations and Applied Remote Sensing (JSTARS) in 2012 and IEEE
E ARTH O BSERVATIONS AND A PPLIED R EMOTE S ENSING (JSTARS), and the
Geoscience and Remote Sensing Letters (GRSL) in 2014. He was the General
IEEE G EOSCIENCE AND R EMOTE S ENSING L ETTERS (GRSL). His current
Chair for the 4th IEEE GRSS Workshop on Hyperspectral Image and Signal
research interests include pattern recognition, hyperspectral image processing,
and signal processing. Processing: Evolution in Remote Sensing (WHISPERS) and the guest editor
of JSTARS. His research teams won the top three prizes of the IEEE GRSS
Dr. Du was a recipient of the Best Reviewer Award from the IEEE
2014 Data Fusion Contest, and his students have been selected as the win-
Geoscience and Remote Sensing Society (GRSS) for his service to the
ners or finalists of the IEEE International Geoscience and Remote Sensing
JSTARS in 2011 and the ACM Rising Star Award for his academic progress
in 2015. He was the Session Chair of the 4th IEEE GRSS Workshop on Symposium (IGARSS) student paper contest in recent years. Dr. Zhang is
a Fellow of the Institution of Engineering and Technology (IET), executive
Hyperspectral Image and Signal Processing: Evolution in Remote Sensing. He
member (board of governor) of the China national committee of international
also serves as a Reviewer of 20 Science Citation Index magazines, including
geosphere–biosphere programme, executive member of the China society of
the IEEE TGRS, TIP, JSTARS, and GRSL.
image and graphics, etc. He was a recipient of the 2010 best paper Boeing
award and the 2013 best paper ERDAS award from the American society of
photogrammetry and remote sensing (ASPRS). He regularly serves as a Co-
Zengmao Wang received the B.S. degree in project chair of the series SPIE conferences on multispectral image processing and
of surveying and mapping from Central South pattern recognition, conference on Asia remote sensing, and many other con-
University, Changsha, China, in 2013, and the M.S. ferences. He edits several conference proceedings, issues, and geoinformatics
degree from the State Key Laboratory of Information symposiums. He also serves as an associate editor of the International Journal
Engineering in Surveying, Mapping, and Remote of Ambient Computing and Intelligence, International Journal of Image and
Sensing, Wuhan University, Wuhan, China, in 2015, Graphics, International Journal of Digital Multimedia Broadcasting, Journal
where he is currently pursuing the Ph.D. degree with of Geo-spatial Information Science, and Journal of Remote Sensing, and the
the School of Computer. guest editor of Journal of applied remote sensing and Journal of sensors. Dr.
His current research interests include data mining Zhang is currently serving as an associate editor of the IEEE Transactions on
and machine learning. Geoscience and Remote Sensing.

You might also like