Professional Documents
Culture Documents
Text Chunking by System Combination: Proceedings of Conll-2000 and Lll-2000
Text Chunking by System Combination: Proceedings of Conll-2000 and Lll-2000
151
Apart from these voting methods we have also method.
applied two memory-based learners t;o the out-
put of the five chunkers: IBI-IG and IGTREE, a 3 Results
decision tree variant of IBI-IG (Daelemans et
al., 1999). This approach is called classifier In order to find out which of the three process-
stacking. Like with the voting algorithms, we ing methods and which of the nine combination
have tested these meta-classifiers with the out- methods performs best, we have applied them
put of the first classification stage. Unlike the to the training data of the CoNLL-2000 shared
voting algorithms, the classifiers do not require task (Tjong Kim Sang and Buchholz, 2000) in a
a uniform input. Therefore we have tested if 10-fold cross-validation experiment (Weiss and
their performance can be improved by supply- Kulikowski, 1991). For the single-pass method,
ing them with information about the input of we trained IBI-IG classifiers to produce the most
the first classification stage. For this purpose likely output tags for the five data representa-
we have used the part-of-speech tag of the cur- tions. In the input of the classifiers a word was
rent word as compressed representation of the represented as itself, its part-of-speech tag and
first stage input (Van Halteren et al., 1998). a context of four left and four right word/part-
of-speech tag pairs. For the four IO represen-
The combination methods will generate a list tations we used a second phase with a lim-
of open brackets and a list of close brackets. We ited input context (3) but with additionally the
have converted these to phrases by only using two previous and the two next chunk tags pre-
brackets which could be matched with the clos- dicted by the first phase. The classifier out-
est matching candidate and ignoring the others. put was converted to the O representation (open
For example, in the structure [NP a [NP b ]gg brackets) and the C representation (close brack-
[VP c ]pg d ]vg, we would accept [NP b ]NP ets) and the results were combined with the
as a noun phrase and ignore all other brackets nine combination methods. In the double-pass
since they cannot be matched with their clos- method finding the most likely tag for each word
est candidate for a pair, either because of type was split in finding chunk boundaries and as-
inconsistencies or because there was some other signing types to the chunks. The n-pass method
bracket in between them. divided this process into eleven passes each of
We will examine three processing strategies which recognized one chunk type.
in order to test our hypothesis that chunking For each processing strategy, all combination
performance can be increased by making a dis- results were better than those obtained with the
tinction between finding chunk boundaries and five individual classifiers. The differences be-
identifying chunk types. The first is the single- tween combination results within each process-
pass method. Here each individual classifier at- ing strategy were small and between the three
tempts to find the correct chunk tag for each strategies the best results were not far apart:
word in one step. A variant of this is the double- the best FZ=i rates were 92.40 (single-pass),
pass method. It processes the data twice: first 92.35 (double-pass) and 92.75 (n-pass).
it searches for chunks boundaries and then it Since the three processing methods reach a
attempts to identify the types of the chunks similar performances, we can choose any of
found. The third processing method is the n- them for our remaining experiments. The n-
pass method. It contains as many passes as pass method performed best but it has the
there are different chunk types. In each pass, disadvantage of needing as many passes as
it attempts to find chunks of a single type. In there are chunk types. This will require a
case a word is classified as belonging to more lot of computation. The single-pass method
than one chunk type, preference will be given was second-best but in order to obtain good
to the chunk type that occurs most often in the results with this method, we would need to
training data. We expect the n-pass method to use a stacked classifier because those performed
outperform the other two methods. However, better (F~=1=92.40) than the voting methods
we are not sure if the performance difference (Fz=1=91.98). This stacked classifier requires
will be large enough to compensate for the extra preprocessed combinator training data which
computation that is required for this processing can be obtained by processing the original train-
152
ing data with 10-fold cross-validation. Again test data precision recall Ffl=l
this will require a lot of work for new data sets. ADJP 85.25% 59.36% 69.99
We have chosen for the double-pass method ADVP 85.03% 71.48% 77.67
because in this processing strategy it is possi- CONJP 42.86% 33.33% 37.50
ble to obtain good results with majority vot- INTJ 100.00% 50.00% 66.67
ing. The advantage of using majority voting is LST 0.00% 0.00% 0.00
that it does not require extra preprocessed com- NP 94.14% 92.34% 93.23
binator training data so by using it we avoid PP 96.45% 96.59% 96.52
the extra computation required for generating PRT 79.49% 58.49% 67.39
this data. We have applied the double-pass SBAR 89.81% 72.52% 80.25
method with majority voting to the CoNLL- VP 93.97% 91.35% 92.64
2000 test data while using the complete train- all 94.04% 91.00% 92.50
ing data. The results can be found in table 1.
The recognition method performs well for the
most frequently occurring chunk types (NP, VP Table h The results per chunk type of process-
and PP) and worse for the other seven (the test ing the test data with the double-pass method
and majority voting. Our method outper-
data did not contain UCP chunks). The recog-
forms most chunk type results mention in Buch-
nition rate for NP chunks (F~=1=93.23) is close
to the result for a related standard baseNP data holz et al. (1999) (FAD jR=66.7, FADVp=77.9
FNp=92.3, Fpp=96.8, FNp=91.8) despite the
set obtained by Tjong Kim Sang (2000) (93.26).
Our method outperforms the results mentioned fact that we have used about 80% less training
in Buchholz et al. (1999) in four of the five data.
cases (ADJP, NP, P P and VP); only for ADVP
chunks it performs slightly worse. This is sur- signment. In Proceedings o] EMNLP/VLC-99.
prising given that Buchholz et al. (1999) used Association for Computational Linguistics.
Walter Daelemans, Jakub Zavrel, Ko van der
956696 tokens of training data and we have used
Sloot, and Antal van den Bosch. 1999. TiMBL:
only 211727 (78% less). Tilburg Memory Based Learner, version 2.0,
Reference Guide. ILK Technical Report 99-01.
4 Concluding Remarks http://ilk.kub.nl/.
We have evaluated three methods for recogniz- Lance A. Ramshaw and Mitchell P. Marcus. 1995.
ing non-recursive non-overlapping text chunks Text chunking using transformation-based learn-
of arbitrary syntactical categories. In each ing. In Proceedings of the Third A CL Workshop
method a memory-based learner was trained on Very Large Corpora. Association for Compu-
to recognize chunks represented in five differ- tational Linguistics.
ent ways. We have examined nine different Erik F. Tjong Kim Sang and Sabine Buchholz.
2000. Introduction to the CoNLL-2000 shared
methods for combining the five results. A 10- task: Chunking. In Proceedings of the CoNLL-
fold cross-validation experiment on the train- 2000. Association for Computational Linguistics.
ing data of the CoNLL-2000 shared task re- Erik F. Tjong Kim Sang. 2000. Noun phrase recog-
vealed that (1) the combined results were better nition by system combination. In Proceedings of
than the individual results, (2) the combination the ANLP-NAACL 2000. Seattle, Washington,
methods perform equally well and (3) the best USA. Morgan Kaufman Publishers.
performances of the three processing methods Hans van Halteren, Jakub Zavrel, and Walter Daele-
were similar. We have selected the double-pass mans. 1998. Improving data driven wordclass
method with majority voting for processing the tagging by system combination. In Proceedings
of COLING-ACL '98. Association for Computa-
CoNLL-2000 shared task data. This method
tional Linguistics.
outperformed an earlier text chunking study for Sholom M. Weiss and Casimir A. Kulikowski. 1991.
most chunk types, despite the fact that it used Computer Systems That Learn: Classification and
about 80% less training data. Prediction Methods ~rom Statistics, Neural Nets,
Machine Learning and Expert Systems. Morgan
References Kaufmann.
Sabine Buchholz, Jorn Veenstra, and Walter Daele-
mans. 1999. Cascaded grammatical relation as-
153