You are on page 1of 2

Denitional and human constraints on parsing performance Geoffrey Sampson, Sussex University Anna Babarczy, Budapest University of Technology

and Economics

A number of authors (Voutilainen 1999; Brants 2000) have explored the ceiling on consistency of human grammatical annotation of natural-language samples. It is not always appreciated that this issue covers two rather separate sub-issues: (i) how rened can a well-dened scheme of annotation be? (ii) how accurately can human analysts learn to apply a well-dened but highly-rened scheme? The rst issue relates to the inherent nature of a language, or of whichever aspect of its structure an annotation scheme represents. The second relates to the human ability to be explicit about the properties of a language. To give an analogy: if we aimed to measure the size (volume) of individual clouds in the sky, one limitation we would face is that the fuzziness of a cloud makes its size ill-dened beyond some threshold of precision; another limit is that our technology may not enable us to surpass some other, perhaps far lower threshold of measurement precision. The analogy is not perfect. Clouds exist independently of human beings, whereas the properties of a language sample are aspects of the behaviour of people, including linguistic analysts. Nevertheless, the two issues are logically distinct, though the distinction between them has not always been drawn in past discussions. (The distinction we are drawing is not the same, for instance, as Dickinson and Meurers' (2003) distinction between ambiguity and error: by ambiguity Dickinson and Meurers are referring to cases where a linguistic form (their example is English can), taken out of context, is compatible with alternative annotations but the correct choice is determined once the context is given. We are interested in cases where full information about linguistic context and annotation scheme may not uniquely determine the annotation of a given form. On the other hand, Blaheta's (2002) distinction between Type A and Type B errors, on one side, and Type C errors, on the other side, does seem to match our distinction.) In earlier work (Babarczy, Carroll, and Sampson 2006) we began to explore the quantitative and qualitative differences between these two limits on annotation consistency experimentally, by examining the specic domain of wordtagging. We found that, even for analysts who are very well-versed in a part-of-speech tagging scheme, human ability to conform to the scheme is a more serious constraint than precision of scheme denition on the degree of annotation consistency achievable. The present paper will report results of an experiment which extends the enquiry to the domain of higher-level (phrase and clause) annotation. Note that neither in our earlier investigation nor in that to be reported here are we concerned with the separate question of what levels of accuracy are achievable by automatic annotation systems (wordtaggers or parsers) an issue which has frequently been examined by others. But our work is highly relevant to that issue, because it implies the existence of upper bounds, lower than 100%, to the degree of accuracy theoretically achievable by automatic systems. In the physical world it makes straightforwardly good sense to say that some instrument can measure the size, or the mass, of objects more accurately than a human being can estimate these properties unaided. In the domain of language, since this is an aspect of human behaviour, it sounds contradictory or meaningless to suggest that a machine might be able to annotate grammatical structure more accurately than the best-trained human expert: human performance appears to dene the standard. Nevertheless, the ndings already referred to imply that it is logically possible for an automatic wordtagger to outperform a human expert, though no standard of perfect accuracy exists. Clouds are inherently somewhat fuzzy but not as fuzzy as people's ability to measure them. The present paper aims to examine whether the same holds true for structure above the word level. These are considerations which developers of automatic language-analysis systems need to be aware of.

Our experimental data consist of independent annotations by two suitable human analysts of ten extracts from diverse les of the written-language section of the British National Corpus, each extract containing 2000+ words beginning and ending at a natural break (or about 2300 parsable items, including e.g. punctuation marks, parts of hyphenated words, etc.). (Although, ideally, it would certainly be better to use more analysts for the investigation, the realities of academic research and the need for the analysts to be extremely welltrained on the annotation scheme mean that in practice it is reasonable to settle for two.) The annotation scheme applied was the SUSANNE scheme (Sampson 1995), the development of which was guided by the aim of producing a maximally rened and rigorously-dened set of categories and guidelines for their use (rather than the aim of generating large quantities of analysed language samples). To quote an independent observer, Compared with other possible alternatives such as the Penn Treebank ... [t]he SUSANNE corpus puts more emphasis on precision and consistency (Lin 2003: 321). These data are currently being analysed in two ways: (i) the leaf-ancestor metric (Sampson 2000) is being applied to measure the degree of discrepancy between the independent parses of the same passages, and to ascertain what proportions of the overall discrepancy level are attributable to particular aspects of language structure (e.g. how far discrepancy arises from formtagging as opposed to functiontagging, from phrase classication as opposed to clause classication, etc.); and (ii) for a sample of specic discrepancies in each category, the alternative analysts' annotations are compared with the published annotation guidelines to discover what proportion arise from previously-unnoticed vagueness in the guidelines as opposed to human error on the analysts' part. The leaf-ancestor metric is used for this purpose both because it is the best operationalization known to us of linguists' intuitive concept of relative parse accuracy (Sampson and Babarczy 2003 give experimental evidence that it is considerably superior in this respect to the best-known alternative metric, the GEIG system used in the PARSEVAL programme), and because its overall assessment of a pair of labelled trees over a string is derived from individual scores for the successive elements of the string, making it easy to locate specic discrepancies and identify structures repeatedly associated with discrepancy. At the time of writing this abstract, although the annotations have been produced and the software to apply the leaf-ancestor metric to them has been written, the process of using the software to extract quantitative results has only just begun. We nd that the overall correspondence between the two analysts' annotations of the 20,000-word sample is about 0.94, but this gure in isolation is not very meaningful. More interesting will be data on how the approx. 0.06 incidence of discrepancy divides between different aspects of language structure, and between scheme vagueness and human error. Detailed ndings on those issues will be presented at the Osnabrck conference.

References Babarczy, Anna, J. Carroll, and G.R. Sampson (2006) Denitional, personal, and mechanical constraints on part of speech annotation performance, J. of Natural Language Engineering 11.1-14. Blaheta, D. (2002) Handling noisy training and testing data, Proc. 7th EMNLP, Philadelphia. Brants, T. (2000) Inter-annotator agreement for a German newspaper corpus. Proc. LREC-2000, Athens. Dickinson, M. and W.D. Meurers (2003) Detecting errors in part-of-speech annotation. Proc. 11th EACL, Budapest. Lin, D. (2003) Dependency-based evaluation of Minipar. In A. Abeill, ed., Treebanks, Kluwer, pp. 317-29. Sampson, G.R. (1995) English for the Computer. Clarendon Press (Oxford University Press). Sampson, G.R. (2000) A proposal for improving the measurement of parse accuracy, International J. of Corpus Linguistics 5.53-68. Sampson, G.R. and Anna Babarczy (2003) A test of the leaf-ancestor metric for parse accuracy, J. of Natural Language Engineering 9.365-80. Voutilainen, A. (1999) An experiment on the upper bound of interjudge agreement: the case of tagging. Proc. 9th Conference of EACL, Bergen, pp. 204-8.