Active Learning With Sampling by Uncertainty and Density For Word

Active Learning with Sampling by Uncertainty and Density for Word
Sense Disambiguation and Text Classification

Jingbo Zhu Huizhen Wang Tianshun Yao Benjamin K Tsou
Natural Language Processing Laboratory Language Information Sciences
Northeastern University Research Centre
Shenyang, Liaoning, P.R.China 110004 City University of Hong Kong
zhujingbo@mail.neu.edu.cn HK, P.R.China
wanghuizhen@mail.neu.edu.cn rlbtsou@cityu.edu.hk
Abstract unlabeled examples for human annotation. The

ability of the active learner can be referred to as
This paper addresses two issues of active selective sampling. Uncertainty sampling (Lewis
learning. Firstly, to solve a problem of and Gale, 1994) is a popular selective sampling
uncertainty sampling that it often fails by technique, and has been widely studied in natural
selecting outliers, this paper presents a language processing (NLP) applications such as
new selective sampling technique, sam- word sense disambiguation (WSD) (Chen et al.,
pling by uncertainty and density (SUD), 2006; Chan and Ng, 2007), text classification
in which a k-Nearest-Neighbor-based (TC) (Lewis and Gale, 1994; Zhu et al., 2008),
density measure is adopted to determine statistical syntactic parsing (Tang et al., 2002),
whether an unlabeled example is an out- and named entity recognition (Shen et al., 2004).
lier. Secondly, a technique of sampling Actually the motivation behind uncertainty
by clustering (SBC) is applied to build a sampling is to find some unlabeled examples
representative initial training data set for near decision boundaries, and use them to clarify
active learning. Finally, we implement a the position of decision boundaries. However,
new algorithm of active learning with uncertainly sampling often fails by selecting out-
SUD and SBC techniques. The experi- liers (Roy and McCallum, 2001; Tang et al.,
mental results from three real-world data 2002). These selected outliers (i.e. unlabeled ex-
sets show that our method outperforms amples) have high uncertainty, but can not pro-
competing methods, particularly at the vide much help to the learner. To solve the out-
early stages of active learning. lier problem, we proposed in this paper a new
method, sampling by uncertainty and density
1 Introduction (SUD), in which a K-Nearest-Neighbor-based
density (KNN-density) measure is used to deter-
Creating a large labeled training corpus is expen- mine whether an unlabeled example is an outlier,
sive and time-consuming in some real-world ap- and a combination strategy based on KNN-
plications (e.g. word sense annotation), and is density measure and uncertainty measure is de-
often a bottleneck to build a supervised classifier signed to select the most informative unlabeled
for a new application or domain. Our study aims examples for human annotation at each learning
to minimize the amount of human labeling ef- iteration.
forts required for a supervised classifier (e.g. for The second effort we made is to study how to
automated word sense disambiguation) to build a representative initial training data set for
achieve a satisfactory performance by using ac- active learning. We think building a more repre-
tive learning. sentative initial training data set is very helpful
Among the techniques to solve the knowledge for active learning. In previous studies on active
bottleneck problem, active learning is a widely learning, the initial training data set is generally
used framework in which the learner has the abil- generated at random, based on an assumption
ity to automatically select the most informative that random sampling will be likely to build the
initial training set with same prior data distribu-
© 2008. Licensed under the Creative Commons Attri- tion as that of whole corpus. However, this situa-
bution-Noncommercial-Share Alike 3.0 Unported tion seldom occurs in real-world applications due
license (http://creativecommons.org/licenses/by-nc- to the small size of initial training set used. In
sa/3.0/). Some rights reserved.
1137
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 1137–1144
Manchester, August 2008
this paper, we utilize an approach, sampling by 3 Uncertainty Measures
clustering (SBC), to selecting the most represen-
tative examples to form initial training data set In real-world applications, only limited size of
for active learning. To do it, the whole unlabeled training sample set can be provided to train a
corpus should be first clustered into predefined supervised classifier. Due to manual efforts in-
number of clusters (i.e. the predefined size of the volved, such brings up a considerable issue: what
initial training data set). The example closest to is the best subset of examples to annotate. In the
the centroid of each cluster will be selected to uncertainty sampling scheme, the unlabeled ex-
augment initial training data set, which is consid- ample with maximum uncertainty is viewed as
ered as the most representative case. the most informative case. The key point of un-
Finally, we describe an implementation of ac- certainty sampling is how to measure the uncer-
tive learning with SUD and SBC techniques. Ex- tainty of an unlabeled example x.
perimental results of active learning for WSD 3.1 Entropy Measure
and TC tasks show that our proposed method
outperforms competing methods, particularly at The well-known entropy is a popular uncertainty
the early stages of active learning process. It is measurement widely used in previous studies on
noteworthy that these proposed techniques are active learning (Tang et al., 2002; Chen et al.
easy to implement, and can be easily applied to 2006; Zhu and Hovy, 2007):
several learners, such as Maximum Entropy ∑
H ( x) = − P ( y | x) log P ( y | x) (1)
(ME), naïve Bayes (NB) and Support Vector y∈Y
Machines (SVMs). where P(y|x) is the a posteriori probability. We
denote the output class y ∈ Y={y1, y2, …, yk}. H is
2 Active Learning Process the uncertainty measurement function based on
In this work, we are interested in uncertainty the entropy estimation of the classifier’s
sampling (Lewis and Gale, 1994) for pool-based posterior distribution.
active learning, in which an unlabeled example x In the following comparison experiments, the
with maximum uncertainty is selected for human uncertainty sampling based on entropy criterion
annotation at each learning cycle. The maximum is considered as the baseline method, also called
uncertainty implies that the current classifier (i.e. traditional uncertainty sampling.
the learner) has the least confidence on its classi- 3.2 Density*Entropy Measure
fication of this unlabeled example.
Actually active learning is a two-stage process To analyze the outlier problem of traditional un-
in which a small number of labeled samples and certainty sampling, we first give an example to
a large number of unlabeled examples are first explain our motivation.
collected in the initialization stage, and a closed-
loop stage of query and retraining is adopted.
Procedure: Active Learning Process
Input: initial small training set L, and pool of unla-
beled data set U
Use L to train the initial classifier C
Repeat
1. Use the current classifier C to label all unla-
beled examples in U
2. Use uncertainty sampling technique to select
m 2 most informative unlabeled examples, and
ask oracle H for labeling
3. Augment L with these m new examples, and Figure 2. An example of two points A and B with
remove them from U maximum uncertainty at the ith learning iteration
4. Use L to retrain the current classifier C
Until the predefined stopping criterion SC is met.
As mentioned in Section 1, the motivation be-
Figure 1. Active learning with uncertainty sam- hind uncertainty sampling is to find some unla-
pling technique beled examples near decision boundaries, and
2
assume that these examples have the maximum
A batch-based sample selection labels the top-m most uncertainty. Fig. 2 shows two unlabeled exam-
informative unlabeled examples at each learning cycle to
decrease the number times the learner is retrained.
ples A and B with maximum uncertainty at the ith
1138
learning cycle. Roughly speaking, there are three resentative subset, because the size of initial
unlabeled examples near or similar to B, but, training set is generally too small (e.g. 10). We
none for A. We think example B has higher rep- think selecting some representative examples to
resentativeness than example A, and A is likely form initial training set can help the active
to be an outlier. We think adding B to the train- learner.
ing set will help the learner more than A. In this section we utilize an approach, sam-
The motivation of our study is that we prefer pling by clustering (SBC), to selecting the most
not only the most informative example in terms representative examples to form initial training
of uncertainty measure, but also the most repre- data set. In the SBC scheme, the whole unlabeled
sentative example in terms of density measure. corpus has been first clustered into a predefined
The density measure can be evaluated based on number of clusters (i.e. the predefined size of the
how many examples there are similar or near to it. initial training set). The example closest to the
An example with high density degree is less centroid of each cluster will be selected to aug-
likely to be an outlier. ment initial training set, which is viewed as the
In most real-world applications, because the most representative case.
scale of unlabeled corpus would be very large, We use the K-means clustering algorithm
Tang et al. (2002) and Shen et al. (2004) evalu- (Duda and Hart, 1973) to cluster examples in the
ated the density of an example within a cluster. whole unlabeled corpus. In the following K-
Unlike their work 3 , we adopt a new approach, means clustering algorithm, the traditional cosine
called K-Nearest-Neighbor-based density (KNN- measure is adopted to estimate the similarity be-
density) measure, to evaluating the density of an tween two examples, that is
unlabeled example x. Given a set of K (i.e. =20 wi • w j
used in our experiments) most similar examples cos( wi , w j ) = (4)
S(x)={s1, s2, …, sK} of the example x, the KNN-
wi ⋅ w j
density DS(.) of example x is defined as: where wi and wj are the feature vectors of the ex-
∑
cos( x, si )
s i ∈S ( x )
amples i and j.
To summarize the SBC-based initial training
DS ( x) = (2) set generation algorithm, let U={U1, U2, …, UN}
K
As discussed above, we prefer to select exam- be the set of unlabeled examples to be clustered,
ples with maximum uncertainty and highest den- and k be the predefined size of initial training
sity for human annotation. We think getting their data set. In other words, SBC technique selects k
labels can help the learner greatly. To do it, we most representative unlabeled examples from U
proposed a new method, sampling by uncertainty to generate the initial training data set. The SBC-
and density (SUD), in which entropy-based un- based initial training set generation procedure is
certainty measure and KNN-density measure are summarized as follows:
considered simultaneously. SBC-based Initial Training Set Generation
In SUD scheme, a new uncertainty measure, Input: U, k
called density*entropy measure 4 , is defined as: Phrase 1: Cluster the corpus U into k clusters
DSH ( x) = DS ( x) × H ( x) (3) Ψ j(j=1,…,k) by using K-means clustering algo-
rithm as follows:
4 Initial Training Set Generation 1. Initialization. Randomly choosing k exam-
ples as the centroid φj(j=1,…,k) for initial
As shown in Fig. 1, only a small number of train- clusters Ψ j(j=1,…,k), respectively.
ing samples are provided at the beginning of ac- 2. Re-partition {U1, U2, …, UN} into k clus-
tive learning process. In previous studies on ac- ters Ψ j(j=1,…,k), where
tive learning, the initial training set is generally
Ψ j = {U i : cos(U i , φ j ) ≥ cos(U i , φt ), t ≠ j}.
generated by random sampling from the whole
unlabeled corpus. However, random sampling 3. Re-estimate the centroid φj for each clus-
technique can not guarantee selecting a most rep- ters Ψ j, that is:
3
We also tried their cluster-based density measure, but per-
∑U
U i∈Ψ j
i
formance was essentially degraded. φj = , where m is the size of Ψ j.

4
We also tried other ways like λ*DS(x)+(1-λ) H(x) m
measure used in previous studies, but it seems to be random. 4. Repeat Step 2 and Step 3 until the algo-
Actually it is very difficult to determine an appropriateλ rithm converges.
value for a specific task.
1139
Phrase 2: Select the example uj closest to the 6.1 Deficiency Measure
centroidφj for each cluster Ψ j to augment ini- To compare various active learning methods,
tial training data set Ω, where deficiency is a statistic developed to compare
Ω = {u j : cos(u j , φ j ) ≥ cos(U i , φ j ), u j ≠ U i , j ∈ [1, k ]} performance of active learning methods globally
ReturnΩ; across the learning curve, which has been used in
previous studies (Schein and Unga, 2007). The
The computation complexity of the K-means deficiency measure can be defined as:
∑
n
clustering algorithm is O(NdkT), where d is the
t =1
(accn (REF) − acct ( AL))
number of features and T is the number of itera- Def ( AL, REF) = (5)
∑
n n
tions. In practice, we can define the stopping cri- t =1
(accn (REF) − acct (REF))
terion (i.e. shown in Step 4) of K-means cluster- where acct is the average accuracy at tth learning
ing algorithm that relative change of the total iteration. REF is the baseline active learning
distortion is smaller than a threshold. method, and AL is the active learning variant of
the learning algorithm of REF, e.g. active learn-
5 Active Learning with SUD and SBC ing with SUD and SBC. n refers to the evaluation
Procedure: Active Learning with SUD and SBC
stopping points (i.e. the number of learned ex-
Input: Pool of unlabeled data set U; k is the prede- amples). Smaller deficiency value (i.e. <1.0) in-
fined size of initial training data set dicates AL method is better than REF method.
Initialization. Conversely, a larger value (i.e. >1.0) indicates a
z Evaluate the density of each unlabeled example negative result.
in terms of KNN-density measure; In the following comparison experiments, we
z Use SBC technique to generate the small initial evaluate the effectiveness of six active learning
training data set of size k. methods, including random sampling (random),
Use L to train the initial classifier C uncertainty sampling (uncertainty), SUD, ran-
Repeat
dom sampling with SBC (random+SBC), uncer-
1. Use the current classifier C to label all unla-
beled examples in U
tainty sampling with SBC (uncertainty+SBC),
2. Use uncertainty sampling technique in terms and SUD with SBC (SUD+SBC). “+SBC” indi-
of density*entropy measure to select m most cates initial training data set generated by SBC
informative unlabeled examples, and ask ora- technique. Otherwise, initial training set is gen-
cle H for labeling, namely SUD scheme. erated at random. To evaluate deficiency of each
3. Augment L with these m new examples, and method, the REF method (i.e. the baseline
remove them from U method) defined in Equation (5) refers to (tradi-
4. Use L to retrain the current classifier C tional) uncertainty sampling.
Until the predefined stopping criterion SC is met.
Figure 3. Active learning with SUD and SBC 6.2 Experimental Settings
We utilize a maximum entropy (ME) model
Fig. 3 shows the algorithm of active learning (Berger et al., 1996) to design the basic classifier
with SUD and SBC techniques. Actually there for WSD and TC tasks. The advantage of the ME
are some variations. For example, if the initial model is the ability to freely incorporate features
training data set is generated by SBC, and en- from diverse sources into a single, well-grounded
tropy-based uncertainty measure is used, it is statistical model. A publicly available ME tool-
active learning with SBC. Similarly, if the initial kit 5 was used in our experiments. To build the
training data set is generated at random, and the ME-based classifier for WSD, three knowledge
density*entropy uncertainty measure is used, it is sources are used to capture contextual informa-
active learning with SUD. If both SBC and SUD tion: unordered single words in topical context,
techniques are not used, we call it (traditional) POS of neighboring words with position infor-
uncertainty sampling as baseline method. mation, and local collocations, which are the
same as the knowledge sources used in (Lee and
6 Evaluation
Ng, 2002). In the design of text classifier, the
In the following comparison experiments, we maximum entropy model is also utilized, and no
evaluate the effectiveness of various active learn- feature selection technique is used.
ing methods for WSD and TC tasks on three pub-
licly available real-world data sets. 5
See http://homepages.inf.ed.ac.uk/s0450736/maxent_
toolkit.html
1140
In the following comparison experiments, the Random Random+SBC Uncertainty
algorithm starts with a initial training set of 10 1.926 1.886 NA
Uncertainty+SBC SUD SUD+SBC
labeled examples, and make 10 queries after each
0.947 0.811 0.758
learning iteration. A 10 by 10-fold cross-
Table 2. Average deficiency achieved by various
validation was performed. All results reported
active learning methods on Interest data set. The
are the average of 10 trials in each active
stopping point is 300.
learning process.
6.3 Data Sets Fig. 4 depicts performance curves of various ac-
tive learning methods for WSD task on Interest
Three publicly available natural data sets have data set. Among these six methods, random sam-
been used in the following active learning com- pling method shows the worst performance. SUD
parison experiments. Interest data set is used for method constantly outperforms uncertainty sam-
WSD tasks. Comp2 and WebKB data sets are pling. As discussed above, SUD method prefers
used for TC tasks. not only the most uncertainty examples, but also
The Interest data set developed by Bruce and the most representative examples. In the SUD
Wiebe (1994) has been previously used for WSD scheme, the factor of KNN-density can effec-
(Ng and Lee, 1996). This data set consists of tively avoid selecting the outliers that often cause
2369 sentences of the noun “interest” with its uncertainty sampling to fail.
correct sense manually labeled. The noun It is noteworthy that using SBC to generate
“interest” has six different senses in this data set. initial training data set can improve random (-
The Comp2 data set consists of comp.graphics 0.04 deficiency), uncertainty (-0.053 deficiency)
and comp.windows.x categories from News- and SUD (-0.053 deficiency) methods, respec-
Groups, which has been previously used in ac- tively. If the initial training data set is generated
tive learning for TC (Roy and McCallum, 2001; at random, the initial accuracy is only 55.6%.
Schein and Ungar, 2007). Interestingly, SBC achieves 62.2% initial accu-
The WebKB dataset was widely used in TC racy, and makes 6.6% accuracy performance im-
research. Following previous studies (McCallum provement. However, SBC only makes perform-
and Nigam, 1998), we use the four most popu- ance improvement for each method at the early
lous categories: student, faculty, course and pro- stages of active learning. After 50 unlabeled ex-
ject, altogether containing 4199 web pages. In amples have been learned, it seems that SBC has
the preprocessing step, we remove those words very little contribution to random, uncertainty
that occur merely once without using stemming. and SUD methods. Table 2 shows that the best
The resulting vocabulary has 23803 words. method is SUD with SBC (0.758 deficiency),
followed by SUD method.
Data sets Interest Comp2 WebKB
Accuracy 0.908 0.90 0.91 6.5 Active Learning for TC Tasks
Table 1. Average accuracy of supervised learning
Active Learning for Text Classification on Comp2
on each data set when all examples have been 0.85
learned. 0.8
6.4 Active Learning for WSD Task 0.75

Accuracy
Active Learning for WSD on Interest 0.7

0.9
0.65
0.85
uncertainty
0.6 uncertainty + SBC
0.8 SUD
SUD + SBC
Accuracy
0.75 0.55
0 50 100 150
0.7 Number of Learned Examples
0.65
random
random + SBC
Figure 5. Active learning curve for text classifi-
0.6
uncertainty
uncertainty + SBC cation on Comp2 data set
SUD
SUD + SBC
0.55
0 50 100 150 200 250 300 Uncertainty Uncertainty+SBC SUD SUD+SBC
Number of Learned Examples
NA 0.409 0.588 0.257
Figure 4. Active learning curve for WSD on In-
Table 3. Average deficiency achieved by various
terest data set
active learning methods on Comp2 data set. The
stopping point is 150.
1141
Active Learning for Text Classification on WebKB 7 Related Work
0.8
0.75
0.7
In recent years active learning has been widely
0.65 studied in various natural language processing
0.6 (NLP) tasks, such as word sense disambiguation
Accuracy
0.55
(Chen et al., 2006; Zhu and Hovy, 2007), text
0.5
0.45
classification (TC) (Lewis and Gale, 1994;
0.4 uncertainty McCallum and Nigam, 1998), named entity
uncertainty + SBC
0.35 SUD
SUD + SBC
recognition (NER) (Shen et al., 2004), chunking
0.3
0 50 100 150 (Ngai and Yarowsky, 2000), information
Number of Learned Examples
extraction (IE) (Thompson et al., 1999), and
Figure 6. Active learning curve for text classifi-
statistical parsing (Tang et al., 2002).
cation on WebKB data set
In addition to uncertainty sampling, there is
Uncertainty Uncertainty+SBC SUD SUD+SBC another popular selective sampling scheme,
NA 0.669 0.748 0.595 Query-by-committee (Engelson and Dagan,
Table 4. Average deficiency achieved by various 1999), which generates a committee of classifiers
active learning methods on WebKB data set. The (always more than two classifiers) and selects the
stopping point is 150. next unlabeled example by the principle of
maximal disagreement among these classifiers. A
Fig. 5 and 6 show the effectiveness of various method similar to committee-based sampling is
active learning methods for text classification co-testing proposed by Muslea et al. (2000),
tasks. Since random sampling performs poorly as which trains two learners individually on two
shown in Fig. 4, it is not further shown in Fig. 5 compatible and uncorrelated views that should be
and 6. We only compare uncertainty sampling able to reach the same classification accuracy. In
and our proposed methods for both text classifi- practice, however, these conditions of view se-
cation tasks. lection are difficult to meet in real-world applica-
Similarly, SUD method constantly outper- tions. Cohn et al. (1996) and Roy and McCallum
forms uncertainty sampling on two data sets. (2001) proposed a method that directly optimizes
SBC greatly improves uncertainty sampling (i.e. expected future error on future test examples.
0.591 and 0.331 deficiencies degraded) and SUD However, the computational complexity of their
method (i.e. 0.331 and 0.153 deficiencies de- methods is very high.
graded), respectively. Interestingly, unlike WSD There are some similar previous studies (Tang
task shown in Fig. 4, Table 3 and 4 show that et al., 2002; Shen et al., 2004) in which the rep-
uncertainty sampling with SBC outperforms our resentativeness criterion in active learning is
SUD method for text classification on both data considered. Unlike our sampling by uncertainty
sets. The reason is that SBC makes about 15% and density technique, Tang et al. (2002) adopted
initial accuracy improvement on Comp2 data set, a sampling scheme of most uncertain per cluster
and about 23% initial accuracy improvement on for NLP parsing, in which the learner selects the
WebKB data set. Such improvements indicate sentence with the highest uncertain score from
that selecting high representative initial training each cluster, and use the density to weight the
set is very necessary and helpful for active learn- selected examples while we use density informa-
ing. Table 3 and 4 show that the best active tion to select the most informative examples. Ac-
learning method for TC task is SUD with SBC, tually the scheme of most uncertain per cluster
following by uncertainty sampling with SBC still can not solve the outlier problem faced by
method. It is noteworthy that on WebKB uncer- uncertainty sampling technique. Shen et al.
tainty sampling with SBC (0.669 deficiency) (2004) proposed an approach to selecting exam-
achieves only slight better performance than ples based on informativeness, representativeness
SUD method (0.748 deficiency) as shown in Ta- and diversity criteria. In their work, the density
ble 4, simply because SBC only introduce good of an example is evaluated within a cluster, and
performance improvement at the early stages. multiple criteria have been linearly combined
Actually on WebKB SUD method achieves with some coefficients. However, it is difficult to
slight better performance than uncertainty sam- automatically determine sufficient coefficients in
pling with SBC after about 50 unlabeled exam- real-world applications. Perhaps there are differ-
ples have been learned. ent appropriate coefficients for various applica-
tions.
1142
8 Discussion gether to select the most informative unlabeled
example for human annotation at each learning
For batch mode active learning, we found some- cycle. We employ a method of sampling by clus-
times there is a redundancy problem that some tering (SBC) to generate a representative initial
selected examples are identical or similar. Such training data set. Experimental results on three
situation would reduce the representativeness of evaluation data sets show that our combined
selected examples. To solve this problem, we SUD with SBC method achieved the best per-
tried the sampling scheme of “most uncertain per formance compared to other competing methods,
cluster” (Tang et al., 2002) to select the most particularly at the early stages of active learning
informative examples. We think selecting exam- process. In future work, we will focus on the re-
ples from each cluster can alleviate the redun- dundancy problem faced by batch mode active
dancy problem. However, this sampling scheme learning, and how to make use of misclassified
works poorly for WSD and TC on the three data information to select the most useful examples
sets, compared to traditional uncertainty sam- for human annotation.
pling. From the clustering results, we found these
resulting clusters are very imbalanced. It makes Acknowledgments
sense that more informative examples are con-
tained in a bigger cluster. In this work, we only This work was supported in part by the National
use SUD technique to select the most informative 863 High-tech Project (2006AA01Z154) and the
examples for active learning. We plan to study Program for New Century Excellent Talents in
how combining SBC and SUD techniques can University (NCET-05-0287).
enhance the selection of the most informative
examples in the future work. References
Furthermore, we think that a misclassified Berger Adam L., Vincent J. Della Pietra, Stephen A.
unlabeled example may convey more Della Pietra. 1996. A maximum entropy approach
information than a correctly classified unlabeled to natural language processing. Computational
example which is closer to the decision boundary. Linguistics 22(1):39–71.
But there is a difficulty that the true label of each
Bruce Rebecca and Janyce Wiebe. 1994. Word sense
unlabeled example is unknown. To use misclassi- disambiguation using decomposable models. Pro-
fication information to select the most informa- ceedings of the 32nd annual meeting on Associa-
tive examples, we should study how to automati- tion for Computational Linguistics, pp. 139-146.
cally determine whether an unlabeled example
has been misclassified. For example, we can Chan Yee Seng and Hwee Tou Ng. 2007. Domain
make an assumption that an unlabeled example adaptation with active learning for word sense dis-
ambiguation. Proceedings of the 45th annual meet-
may be misclassified if this example was previ-
ing on Association for Computational Linguistics,
ously “outside” and is now “inside”. We will pp. 49-56
study this issue in the future work.
Actually these proposed techniques can be Chen Jinying, Andrew Schein, Lyle Ungar and
easily applied for committee-based sampling for Martha Palmer. 2006. An empirical study of the
active learning. However, to do so, we should behavior of active learning for word sense disam-
adopt a new uncertainty measurement such as biguation. Proceedings of the main conference on
Human Language Technology Conference of the
vote entropy to measure the uncertaity of each
North American Chapter of the Association of
unlabled example in committee-based sampling Computational Linguistics, pp. 120-127
scheme.
Cohn David A., Zoubin Ghahramani and Michael I.
9 Conclusion and Future Work Jordan. 1996. Active learning with statistical mod-
els. Journal of Artificial Intelligence Research, 4,
In this paper, we have addressed two issues of 129–145.
active learning, involving the outlier problem of
traditional uncertainty sampling, and initial train- Duda Richard O. and Peter E. Hart. 1973. Pattern
ing data set generation. To solve the outlier prob- classification and scene analysis. New York:
Wiley.
lem of traditional uncertainly sampling, we pro-
posed a new method of sampling by uncertainty Engelson S. Argamon and I. Dagan. 1999. Commit-
and density (SUD) in which KNN-density meas- tee-based sample selection for probabilistic classi-
ure and uncertainty measure are combined to- fiers. Journal of Artificial Intelligence Research
(11):335-360.
1143
Lee Yoong Keok and Hwee Tou Ng. 2002. An em- ing on Association for Computational Linguistics,
pirical evaluation of knowledge sources and learn- pp. 120-127
ing algorithm for word sense disambiguation. In
Proceedings of the ACL-02 conference on Empiri- Thompson Cynthia A., Mary Elaine Califf and Ray-
cal methods in natural language processing, pp. 41- mond J. Mooney. 1999. Active learning for natural
48 language parsing and information extraction. In
Proceedings of the Sixteenth International Confer-
Lewis David D. and William A. Gale. 1994. A se- ence on Machine Learning, pp. 406-414
quential algorithm for training text classifiers. In
Proceedings of the 17th annual international ACM Zhu Jingbo and Eduard Hovy. 2007. Active learning
SIGIR conference on Research and development in for word sense disambiguation with methods for
information retrieval, pp. 3-12 addressing the class imbalance problem. In Pro-
ceedings of the 2007 Joint Conference on Empiri-
McCallum Andrew and Kamal Nigam. 1998. A com- cal Methods in Natural Language Processing and
parison of event models for naïve bayes text classi- Computational Natural Language Learning, pp.
fication. In AAAI-98 workshop on learning for text 783-790
categorization.
Zhu Jingbo, Huizhen Wang and Eduard Hovy. 2008.
Muslea Ion, Steven Minton and Craig A. Knoblock. Learning a stopping criterion for active learning
2000. Selective sampling with redundant views. In for word sense disambiguation and text classifica-
Proceedings of the Seventeenth National Confer- tion. In Proceedings of the Third International Joint
ence on Artificial Intelligence and Twelfth Confer- Conference on Natural Language Processing, pp.
ence on Innovative Applications of Artificial Intel- 366-372
ligence, pp. 621-626.
Ng Hwee Tou and Hian Beng Lee. 1996. Integrating
multiple knowledge sources to disambiguate word
sense: an exemplar-based approach. In Proceed-
ings of the Thirty-Fourth Annual Meeting of the
Association for Computational Linguistics, pp. 40-
47
Ngai Grace and David Yarowsky. 2000. Rule writing
or annotation: cost-efficient resource usage for
based noun phrase chunking. In Proceedings of the
38th Annual Meeting of the Association for Com-
putational Linguistics, pp. 117-125
Roy Nicholas and Andrew McCallum. 2001. Toward
optimal active learning through sampling estima-
tion of error reduction. In Proceedings of the
Eighteenth International Conference on Machine
Learning, pp. 441-448
Schein Andrew I. and Lyle H. Ungar. 2007. Active
learning for logistic regression: an evaluation.
Machine Learning 68(3): 235-265
Schohn Greg and David Cohn. 2000. Less is more:
Active learning with support vector machines. In
Proceedings of the Seventeenth International Con-
ference on Machine Learning, pp. 839-846
Shen Dan, Jie Zhang, Jian Su, Guodong Zhou and
Chew-Lim Tan. 2004. Multi-criteria-based active
learning for named entity recognition. In Proceed-
ings of the 42nd Annual Meeting on Association
for Computational Linguistics.
Tang Min, Xiaoqiang Luo and Salim Roukos. 2002.
Active learning for statistical natural language
parsing. In Proceedings of the 40th Annual Meet-
1144

Active Learning With Sampling by Uncertainty and Density For Word

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Active Learning With Sampling by Uncertainty and Density For Word

Uploaded by

Copyright:

Available Formats

Active Learning with Sampling by Uncertainty and Density for Word

Sense Disambiguation and Text Classification

Abstract unlabeled examples for human annotation. The

formance was essentially degraded. φj = , where m is the size of Ψ j.

6.4 Active Learning for WSD Task 0.75

Active Learning for WSD on Interest 0.7

You might also like