See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/2458234

Tagging Romanian Texts: a Case Study for
QTAG, a Language Independent Probabilistic
Tagger

Article · May 1998
Source: CiteSeer

CITATIONS READS

64 37

2 authors:

Dan Ioan Tufis Oliver Mason
Romanian Academy Whisk.com
157 PUBLICATIONS 1,408 CITATIONS 21 PUBLICATIONS 173 CITATIONS

SEE PROFILE SEE PROFILE

Available from: Dan Ioan Tufis
Retrieved on: 14 September 2016

to take care of paper dwells on language resources for Romanian and the data swapping and accept a probably serious degradation of evaluation of the results. another solution providing a nice compromise. the tions of the few words that remain ambiguous (in our ex- larger the necessary training corpora (Berger. 1995). Given the space limitation we will item is one that in different contexts can be classified dif. Swedish and the tagset to a manageable size and lose information or to Romanian. 1996). a language model using QTAG. a lot of form. In case there is not cific relation (the small tagset should subsume the large enough RAM to load it. with about 700 tags. 1998) it is shown how a reduced tagset . training data. application is required. desired (instead of k-best tagging). if full disambiguation is would eventually get the tagger (in possibly several itera. The features cussion of the methodology and results are given in that are relevant to the classification task are encoded into (Tufis. It can be re. The larger the tagset. less than 10%) are differentiated by very simple Pietra. case 89 tags) based on which a language model is built.000 words to more than one million. it is possible to tag a text with a large tagset by Lexical ambiguity resolution is a key task in natural lan. Obviously.ac. this should not be too problematic as long response ambiguity class. a parts-of-speech tagger that has been developed of 350 MB or more would not be surprising (this was the originally for English. a Language Independent Probabilistic Tagger Dan Tufis Oliver Mason The Romanian Academy Centre for Artificial Intelligence The University of Birmingham.Mason@bham. Then. The words that after this replacement taggers are advertised to be able to learn a language model become ambiguous (in terms of the MSD-tagset annota- from raw (unannotated) texts. and practically no price in computational resources. a post-processor deterministically replaces the tags mance has been discussed in (Elworthy. but with a clear separation between actual size of the transition matrix while training QTAG the (probabilistic) processing engine and the (language for Romanian) and it would be out of question to keep it specific)resource data. than 98%. to ensure a fast response due to their ability to keep the the reduced and the extended tagsets have to be in a spe- language model in the core memory. Apparently there are two various languages as shown by successful experiments on solutions for overcoming such a deadlock: either to reduce three quite different languages: English. In (Tufis. In (Tufis. The success rate of this second phase was higher overhead (required for memory management) would de. Depending on how accurate the contextual rules Tiered Tagging are. one needs small tagsets and reasonable large This language model serves for a first level of tagging. In general terms. It is quite obvious that language limited distance (in our experiment never exceeding 4 models based on large tagsets would need large memory words in one direction) for a disambiguating tag or word- resources and given current hardware limitations. left. biguation. The effect of tagset size on tagger perfor. A complexity metrics for tagging experiments is proposed which considers the performance of response time. the tagger is usable across in RAM on usual computers. Although some call it MSD-tagset). For instance.. using language models built for reduced tagsets and conse- guage processing (Baayen & Sproat. Therefore. quently small training corpora. This way. tation of the tiered tagging approach and an in-depth dis- tor/classifier decides on the appropriate class. It is part of the corpus linguistics folklore that in hidden tagset (we call it C-tagset) of a smaller size (in our order to get high accuracy level in statistical POS disam. a typical tagger (at least those in one). from the small tagset with one or more (in our experi- sonable training corpus means. not go into details concerning tiered tagging (a full presen- ferently and given a specified context the disambigua. With a small price in tagging accuracy (as compared to a reduced Introduction tagset approach). tiered tagging uses a the tags. the error rate for the final tagged text could be practi- Most of the real time taggers (if not all of them) are able cally the same as for the hidden tagging phase. Tagging Romanian Texts: a Case Study for QTAG. typically varies from ments never more than 2) tags from the large tagset (we 100. 1998)). 1996). Given the rare cases when contextual rules crease the performance to an unacceptable level. tion) are more often than not the difficult cases in statisti- tion of the output and a bootstrapping procedure that cal disambiguation. Romania [O. These rules investigate. they require a post valida. depending on the able. or just frustratingly crash- This paper describes an experiment on tagging Romanian ing). the different interpreta- tions) to an acceptable error rate.uk] [tufis@racai. We call this way of garded as a classification problem: an ambiguous lexical tagging tiered tagging. After a brief presentation of the QTAG tagger. right or both contexts within a time is seriously affected. United Kingdom Bucharest. Provided that enough training data is avail. what a rea. contextual rules. the modify the tagger with some extra-code. 1998) it is argued that there is a tagger with respect to the “difficulty” of a text. Pietra & periment. the response time penalty is very small.ro] the public domain) would give up (either graciously in- Abstract forming about lack of memory.

Tokens are read and added to the window mapped error-free onto a MSD-tagset tagged text with at which is shifted by one position to the left each time. the probability distribution of the N e r r o r s _ m a p p i n g is the number of errors made by possible tags for the word form). the Nwordsis the total number of words in the tagged text tagger looks up the dictionary for all possible tags that the N errors_tagger is the number of errors made during the current token can have. contextual rules b. Then. if not found. As a result. 2. For output. read the next token of the MSD-tagset tagged text. The C-tagset induction process grants a (user specified) The tagging works on a window of three tokens. one gets less than 0. calculate Pw = P(tag|token) the probability of the address here only this phase. they are learnt This way. with associated frequencies. 1998). . of the possible sequences is more probable. Furthermore. window. it can easily be used as a module in other linguistic applications. This is the way QTAG works: it uses vided there is a server available. than 5%. will be improperly disambiguated in the second phase.t 2). In our experiment this filled up with two dummy words at the beginning and the was set to 10%. a probabilistic tagger determines which for an example of this). bilities between the tags gives some measure of the confi- edge (usually encoded in rules) in order to rule out illegal dence with which the tag ought to be correct (see below tag combinations. This is were the two approaches differ: while the according to their probability. and the difference in proba- rule-based approach tries to apply some linguistic knowl. see (Tufis. The knowledge base resides on the server and the using linguistic knowledge. for each possible tag the performance of the first (hidden) tagging step. while the client is written in A rule-based language model can be created by a human Java. For the second phase of the token to have the specified tag tiered tagging (including C-tagset design. could be roughly as the second and first element of the triplet as the estimated as follows (for a rigorous upper-limit evaluation following two tokens are evaluated. The tagger has been developed dure. look it up in the dictionary 3. repeat the computation for the other two tags in the rule-based and probabilistic. sponding frequencies and a matrix of tag sequences. As these values be- methods. we will a. the Error-rate above is cumulatively speci. It is also possible to com. which is maximum value for the N amb/Nwords. There are two basic approaches to part-of-speech tagging: 5. calculate Pc = P(tag|t1. These models sequence probabilities from it via a network connection. The difficult task is to deal with ambiguities: come very small very quickly.(C-tagset) can be interactively designed from a large tagset adapted for new languages. The tagger presented in probability: the probabilities of the tag being this paper is purely probabilistic. pro- from examples. The total size of the exe- only probabilities for disambiguating tags within a text. the tagger can run on different platforms. surrounded and followed by the two other tags respectively. Error-rate= (Nerrors_tagger+Nerrors_mapping)/Nwords where: The basic algorithm is fairly straight-forward: at first. The additional errors. cessing steps also take into account the scores of the tag fied. The global error rate of the tiered tagging is given by at Corpus Research in Birmingham and is freely available the relation: for research purposes (for more information on this see http://www-clg. it can easily since the tagger is implemented as a Java class. The first step in any tagging process is to look up each token of the text in a dictionary. Two further pro- final result. the probability of the acquisition and application and commented results).bham. guess possible tags Given that the final accuracy of tiered tagging depends on 4. the joint probability of the individual tag assignment together with the The tagger contextual probability. server is implemented in C.5% error contribution of the The tagging procedure is as follows: C-tagset onto MSD-tagset mapping in the final accuracy 1. probabilities are combined to give the overall probability such as a morphological component or some heuristic of the tag being assigned to the token. and no rule-based mechanism. based on a trial-and-error ID3-like proce. calculate Pw.uk/tagger. The between different tags. training corpus is available. With an e less token that ‘falls’ out of the window is assigned a final tag. using a lan- guage model that is based on the frequencies of transitions The tagger is implemented in a client-server model. of this estimation. 1998)): Let us consider that all the remaining ambiguous words N amb . This is then combined the second phase of tagging (the mapping from with the contextual probability for each tag to occur in a C-tagset to MSD-tagset) sequence preceded by the two previous tags. the tagger has to have some fallback mechanism.ac. they are represented as only in trivial cases will there be exactly one tag per logarithms to the base 10. but using different values for the contextual bine both into a hybrid approach. c.html).e. Nerrors_mapping. These resources can easily be ambiguation. were correctly QTAG works by combining two sources of information: a tagged in the first step (this is certainly an overestimation dictionary of words with their possible tags and the corre- if one considers a normal distribution of errors). the tags are sorted word. are generally created from training data. cutable code files is less than ten kilobytes. the tag to follow the tags t1 and t 2. The most 10% words remaining ambiguous. The tag with As any error made at the first step would show up in the the highest combined score is selected. If the token cannot be For each recalculation (three for each token) the resulting found. also considering e as the average error rate for rule-based dis. together with their respective lexi- first phase of tagging (C-tagset tagging) cal probabilities (i. as long as some pre-tagged (MSD-tagset). interested reader is pointed to (Tufis. but it is not reasonable to tagging engine retrieves the relevant data for tags and tag hand-coded a probabilistic language model. i.e. that is a C-tagset tagged text can be end of the text.c = Pw *Pc. it follows that approximately Namb*ε words generated from a pre-tagged corpus.

even in languages with a rather com. By error analysis we found encoding. project1. average length of tag-ambiguity class decreased to 3 tags matically during the training of the tagger. For a given wordform several appeared in Orwell’s 1984 (words belonging to newspeak MSDs might be applicable (accounting this way for or proles speak).000-lemma lexicon by means of our EGLU natural wrongly classified words were analysed and a few language processing platform (Tufis.The non-ASCII characters dealt with by QTAG are of αβ+γ (both γ and βγ being recorded as endings). The set of all the MSDs applicable to a clustered quite regularly and pointed out a small number of given word defines the MSD-ambiguity class for that words with the final letter(s) in their roots combining word. that is we set the guesser to return the dictionary that is used for the corpus analysis. There are unknown word was 11.3 tags). the position in a string of that almost half of these errors were not guesser errors: characters corresponds to an attribute. Given encoded as standard SGML character entities as the server the few instances of such combinations. 1997) or “longest match” mode on D1-D0 (more than 400. guessing and would not alter the computational and flectional endings for open class words (nouns. and also that quite frequently there invoked. residuals.The rationale for including the words with roots 2 labelled words were infrequent wordforms. we extracted from our main dictionary (D0) their ambiguity classes. identifies all possible endings. with tags/unknown words). words with root become reasonably shorter (3. To evaluate we decided to eliminate all these redundant tags from the the guesser. one way to is not Unicode capable (but the client is). Given that the guesser is supposed two different guessers: the first one is based on a list of to work for the tagger. adjectives. The ambiguity classes the average length of the tag-ambiguity class doubled (6 corresponding to all the possible endings are merged. 1993. The rare exceptions were put in the D1 lexicon. less than 0. the restriction that the remaining part of a word. the average lengths all the words which contained in their ambiguity class an of the ambiguity class returned for an unknown word interpretation belonging to close classes. With this modification. The The morphology necessary to deal with the unknown average length of an MSD-ambiguity class returned for an words is encoded as a language specific resource. The real errors (4324. Since several idiosyncratic endings and interpretations were added to the words in the corpus were not in the EGLU lexicon. In principle. This very sim. This step was repeated unification-based lexicon and later on expanded to the full until a precision of 100% was obtained. easy to implement. The Romanian lexicon contains 869 MSD- with the real endings.000 genotypes (Tzoukermann & Radev). The Language Resources first experiment considered testing the guesser in the “all The Romanian wordforms lexicon was created based on a interpretations”-mode on the words in D0-D1. ambiguity class for the longest identified ending only. The morpho-syntactic wordforms) exceeded our expectations. built at RACAI. So. the unknown word is assigned either this merged were cases when two or more tags in the ambiguity class ambiguity class(the default) or the ambiguity class would represent the full expansion of a more general tag.3% guessing errors we could have been satisfied. Put differently.08%) homographs). Consequently.si/ME . Each ending (including the 0-ending) is associated that for 968 words that were not correctly guessed using with an ambiguity class consisting of appropriate tags for the “longest match” mode. Depending on the way the guesser is tags in the same list. and interjections). There were reported descriptions are provided as strings. 1997).04%) but all the wrongly (D1). accuracy performance of the tagger. However we made another step that was very The second guesser (Romanian-specific). Some irregular of them were manually lemmatised. The MSDs (Morpho-Syntactic Descriptions) represent a recognising longer endings made the previous ones quite set of codes as developed in the MULTEXT-EAST unlikely. the abbreviations. The number of 2 characters long and also some irregular open class words errors returned by the guesser run in this mode for the and created a guesser lexicon of about 4000 entries words in D0-D1 was 181 (0. unlikely to characters long in D1 was because we imposed the guesser occur other than as hapaxes in normal texts. after removing the longest ending. This list is built auto. introduced in the words were also moved into D1. The rationale for this was that in the vast majority of cases. with plex morphology. the guesser for the 2 longest matched endings. By a retrograde guesser was set to return the union of the interpretations analysis (right to left) of the unknown word. overcome this type of error was to import those MSDs associated to γ which are valid for the root αβ into the The guesser interpretation list associated with the ending βγ. A step further paradigms of every new lemma. corresponding to the longest matched ending. All the 35. most list of <ending : ambiguity-class> pairs. per unknown word. The table in Figure 1 was to evaluate the effectiveness of the “longest match” provides information on the data content of the main heuristics. much smaller number of errors: 1302. the correct interpretation could open class words (the 0-ending includes also tags for be found in the second longest matched endings. AMB-MSD represents the number of ambiguity The results we obtained by running the guesser in the classes (Weischedel & all. Abney. we turned the MSDs into tags as the final three letters of all words from the lexicon with used by the tagger (see section “The Tagset”) and the their respective tag probabilities. could ensure almost error-free is more linguistically motivated and considers the real in. Given that the ambiguity classes higher probability assigned to the interpretations provided very frequently included tags that subsumed some other by longer endings. using a linear only 8892 “errors” (2. We run again the guesser over the D0- ple mechanism seems sufficiently general to yield good D1 lexicon and we got(in the “longest match” mode) a results at little cost. In this notation. the errors were given by wrong segmentations root+ending: α+βγ instead 1 For the final reports see http://nl. What we noticed was verbs).ijs. be longer than 2 letters. and specific they were either entries in the dictionary having assigned a characters in each position indicate the value for the wrong MSD or entries corresponding to fake words that corresponding attribute. or 1.17%).

MSD k) was equivalated to a C-TAG ambi- Romanian words "de la" (from) are combined into one guity class (tag1 tag2 . Each lexical unit in the tokenised text was mapping between MSDs and C-TAGs was defined: MSDn1 automatically annotated with all its applicable MSDs and MSD n2 . mapping tagm → MSDn1 MSDn2 . Since in many cases the nouns and adjectives are one thinks an MSD in terms of a (flat) feature structure. This MSD-tagged corpus. Afpfsrn annotated text as the number of fully MSD-recoverable de Spsa tokens per total number of tokens.MSD np → tag m (or vice versa a one-to-many then hand disambiguated. we reached after several aprilie Ncms-n adjustments of the tagset and the dictionary information a . When this re- which we called MSD-tagged corpus. tag i) with i ≤ k and a many-to-one token "de_la").i Ccssp erable. but rarely when considered in colloca- tion. repeated the training and tagging and produced word may be split into several tokens (the Romanian “da. Orwell's 1984 and Plato's The Republic.. The generalisation process observed the following general Entries Word-forms Lemmas MSDs AMB-MSD guidelines: 419869 347126 33515 611 869 • eliminate the attributes that are unambiguously recovered by looking up the word in the lexicon. 1997))..ntr. signaled in the form of Cata instead of Catb allowed us to build up the confusion sets for the minimal Figure 2: Romanian corpus overview tagset:(Cat 1 Cat2 . The corpus used in the experiments and evaluation • eliminate the attributes which do not influence the reported here was made of the integral texts in two books: graphic form of the words in the given POS class. We added these attributes to that a token is not necessarily a word: one orthographic the tags... Defining recoverability degree of a given C-tag friguroas&abreve. each MSD-ambiguity class orthographic words may be combined into one token (the (MSD 1MSD2. statistics see (Tufis & all..Cat m ) instead of Catb . As a result of mi-l” (give it to me) is split into 3 tokens) or several this iterated step... senin&abreve.MSDnp). The tagset has been derived by a trial-error procedure from the 611 morpho-syntactic description codes(MSDs) defined Let us consider a few examples to clarify this methodolo- for encoding the Romanian lexicon. sidered in isolation. For the categories Noun Adjective. ambiguous with respect to these attributes when con- then this elimination of attributes in a given MSD repre. train the tagger on this Republic 114720 10350 4697 369 490 minimal tagset and observe the errors made by the tagger on a few paragraphs randomly extracted from our corpus. By observing the gical procedure. The adjust- pe Spsa ment of dictionary information consisted in conflating c&acirc. Please note to make the correct choice. the token toki is considered partially recov- &scedil. syntactical basis. The figure below exemplifies the recoverability degree of a given C-tag annotated text. A brief overview • preserve the attributes which correlates in co- of these texts is given below: occurring words. Text Occurrences Words Lemmas MSDs AMB-MSD The first step in the tagset design was to keep in the 1984 101460 14040 7008 396 524 tagset only the POS information. Spsay class of toki and the mapping class of tagk. we started to attributes Case and Number were preserved since when the eliminate several attributes in the MSD codes that would latter modifies the former they must agree in Case and reduce the cardinality for the MSD ambiguity classes.nd Rw some interpretations of a few functional words which were ceasurile Ncfpry hard to distinguish and removing a few distinctions that b&abreve. each time another list of confusion sets. Afpfsrn Otherwise. test is very simple and it computes for a token toki disam- biguated as tagk the intersection between the ambiguity &Icirc.teau Vmii3p were impossible to be made on a purely statistical and . If this intersec- o Tifsr tion contains only one element. we checked for the MSD evaluating the tagger. Common 56804 4016 2524 319 394 The errors. then the token tok i is zi Ncfsrn fully MSD-recoverable when it is tagged with tagk . classify them only as articles. To identify the numeral readings of these words some seman- The Tagset tic criteria would have been necessary (for instance to de- The tagset for Romanian contains 79 tags for different fine a subclass of nouns “measure unit”). For all the wrongly marked up words we extracted the context and The existent SGML markup (TEI conformant) was identified the attributes that were likely to help the tagger stripped off and the text has been tokenised. This was the main resource..ambiguity classes (for more details on the Romanian sents a generalisation of the corresponding MSD that dictionary and corpora encoding and several relevant would subsume the initial one.. so we decided to morpho-syntactic categories. COMMA 90% recoverability degree of the test corpus. Figure 1: Romanian dictionary overview • eliminate the attributes with little influence on the distribution of the words in the same POS class. The most obvious case was the distinc- tion between numeral and article readings for the words un Figure 3: MSD-tagged corpus overview (a/an or one -masculine) and o (a/an or one -feminine). If Number. preserving these attributes proved to be extremely ... plus 10 tags for punctuation. the MSD clustering in the ambiguity classes.. for training and peated process stabilised.

. Therefore. introduced in the corpus so that all the ambiguity classes corresponding to the endings used by the guesser be taken The Training Corpus into account when constructing the language model. The training process is merely are formatting exer- two attributes the tagset design methodology classified cise. With this new feature. The ex- the reflexive pronouns and determiners in accusative or da.669] rors.. which clustered together (they are explicitly marked standard Unix text processing tools and two special pur- for these cases and are insensitive to number). o [TSR:-4. The finer distinc. The Similar considerations applied for all the other word output format of the tagger is a vertical text with all classes and the resulted tagset (initially 73 proper tags and possible tags. but given the This is to say that most of the words raising ambiguity grammatical constraints..66%). Gender (which also is subject to the agreement re. For the finite verbs the at. keeping it was very useful in problems appeared at least in one of the two texts. pronoun. COMMA tribute Person was preserved and just only for the pe S auxiliaries the Number attribute was kept. However.ntr.13%). but their recoverability resource unless another guesser is installed (see above).000 words. was done on a larger text by two independent experts. As the final step. The definiteness attribute ambiguity classes. the distributional analysis and morpho-syntactical disam- We have so far not encountered (in our corpus) a single biguation purposes the selected texts offer enough evi- instance where this was not the case. additional sentences were discussed in the section on Evaluation. The training corpus (CTAG-corpus) was obtained from the 3 The scores are expressed as logarithms.839][VA3S:-11. is also fully recoverable from the wordform. Figure 4: C-TAG corpus overview The most troublesome categories were the pronoun and the determiner (this class contains what grammar books The Training Process traditionally call adjectival pronouns and due to several Out of the training corpus. The Gender of a final tagset. preserving dence to extract reliable data and draw realistic conclu- the Gender was not relevant and after its removal the sions. below shows a sample of the CTAG corpus (based on the quirement) is very rarely ambiguous.384 (less than 6% itself. once to extract the lexicon (if it does tributionally unrelated classes. &scedil. discussed in the next section).i CVR tion between main and copulative uses (valid for a few friguroas&abreve. degree slightly improved and the clustering of MSDs was more linguistically motivated. the tagset was slightly modified (79 proper tags and . The evaluation of the example below3). tagger performance (described in another section below) . or two texts2 contain altogether more than 250. nality of the ambiguity classes. article. ASRN verbs) was a constant source of tagging mistakes so we de S removed it from our lexical encoding. we opted for preserving 1984 and The Republic) was used for the validation pur- the Case and Number attributes. The figure hand. 10 punctuation tags) and some dictionary entries were changed.. adjective. these two texts contain 620 (71. The only exceptions were not exist already) and once for the tag trigrams. with only these poses. This guess list is used as a guesser number of tags remained the same.481][I:-11. .066][V2:-10. With the verbs. and filtered twice. Out of the 611 MSDs defined in the lexicon.helpful for the performance of the tagger.. . the smaller the corresponding corpus tags (using the MSD to C-tag probability. Although the noun. The next pose programs. The data contained in the training corpus is sorted most of the pronominal MSDs into functionally and dis. numeral.. helping the tagger to discriminate among nouns. the created from them.101][QF:-9. ASRN verbs (infinitive. and consequently it aprilie NSN disappeared from the tagset.886] Based on the insights gained from the analysis of the er. zi NSRN served between finite (main and auxiliary) verbs nonfinite senin&abreve. 90% was retained for the commonalties it was merged with the pronominal class). endings are extracted from the lexicon and a ‘guess list’ is miners in the classification. . the three letter word- step was to consider the types of pronominals and deter. participle is almost always recoverable from the wordform the number of different words is just 20. so that the most likely tag comes first (see evaluation of the tagging process.teau V3 Mood attribute further decreased the number and the cardi. participle and gerund). zi [NSRN:-1. Eliminating the c&acirc. and in those very rare cases when it is not. the of the number of wordforms in the lexicon). However. adjec- tives and participles (ex: frumosul baiat versus frumosul &Icirc. proper training and the rest of 10% (the first parts of both As with the nominal categories. On the other mappings described in the previous section). S baiatului or iubitul baiat versus iubitul baiatului or o TSR baiatul iubit). the distinction was pre. these number of tags decreased significantly (and the accuracy of texts contain 444 MSDs (72. Eliminating the b&abreve. the tags are assigned a probability value and 10 punctuation tags) was used in a more comprehensive they are sorted. traction of the relevant information is done mostly with tive. From the 869 MSD- the tagger increased accordingly).799][PPSA:-7. The tagger is ready for use with the new resources as soon as the data files are accessible to the server program. for immediate context (plus the agreement rule) does the job. so the smaller the MSD-tagged corpus by substituting the MSDs with their (signed) value of the assigned score.nd R tense attribute dramatically decreased the number and the ceasurile NPRY cardinality of the ambiguity classes. determiner. The final tagset and dictionary modifications are 2 To cope with unknown words.

“mare“-big). indefinite . Since the par- likely to contain errors. They were corrected. Therefore. NSN Common Noun. indefinite 10% Republic /90% Rep LM 13696 533 96. oblique. This decision books). oblique. this can also tion with adverbial instances). whereas it was much bigger with correctly of all the errors). This was against our expectations so we distinction eliminated several errors (such as distinction made a pretest. as opposed to being a pronoun otherwise (which is tors in the process of building up the MSD-tagged corpus. indefinite. building three language models. definite (himself/herself). QS. The resulting language models were used to test eliminated tagging errors for those few but frequent adjec- the corresponding unseen 10% of the texts. The actual tagset is briefly described below. plural. the tagger retrained singular. indefinite 10% 1984 / 90% 1984 LM 11791 257 97. were constantly confused. tween the wrongly assigned tag and the correct tag was its “contribution” to the errors list was significant (4. This NPRN Common Noun.10% APOY Adjective. direct. indefinite words errors APN Adjective. NPRY Common Noun. definite which in the lexicon are listed as reflexive pronoun NPVY Common Noun. the Furthermore. initially encoded in the lexicon as an in- text. but the error analysis confirmed tagset did not have) between proper nouns and common one of our suspicions: out of the reported errors there were nouns helped in reducing tagging errors of the word “lui” almost 500 (non-systematic) errors (189 in “1984” and which is always a genitival article when precedes a proper 301 in “The Republic”) made by the human disambigua. thereby removing Evaluation and the final tagset this source for errors. indefinite provided the results shown in Figure 5. indefinite larger than the probability of the correct one) we manually NPOY Common Noun. conjunction (and) and adverb (still. so decision is more certain than in case of near equal proba. vocative. plural. plural. singular. direct. With data corrected. This various results. QF) initially conflated into one tag (Q). although we didn't Republic”). indefinite). direct. direct. Another follow-up modification of The tagger was trained and evaluated several times on the tagset was to make a distinction among the particles different segments of the hand-disambiguated corpus with (QN. Another problem was made The tags attributed by the tagger were in all these cases by the initial tag ASRN which was meant for adjectives. The evaluation with Romanian ticle reading rarely has been correctly identified by the data has shown that with most errors the difference be. noun. singular. we training was done on 90% of “1984“. indefinite happened for instance with the word-forms si and si. direct case.82% APON Adjective. a natural decision the texts used in the first two (90% of each of the two was to conflate the ASRN and ASN tags. Again.28% rather small. plural. A third trial of this test didn’t revealed new mean to preserve the gender distinction in our tagset. yet). the tagger output is a CVR Conjunction or Adverb tabular format with each word on a line followed by an or. by far the most probable tag). the cases very difficult to be made even by native speakers. three texts. definite 10% 1984 and Republic / 25487 1112 95. definite modified the MSD and the corresponding tag. of accuracy AN Adjective. This human-made errors (but there is of course no certainty that tag was in most cases mistakenly preferred to the more they were all discovered) so. A Adjective data /LM no. oblique. there were reported MSDs that were mapped to this tag we noticed that they a few more human errors (8 in “1984” and 11 in “The identified only feminine adjectives. We removed from the lexicon the assigned tags. This distinction is in many paratively large (as it is for zi in the example above). particle interpretation leaving only the one for infinitive (what most grammar books would do). plural. singular. singular. The above mentioned M Numeral accuracy was measured considering only the tag with the NP Proper Noun highest probability.63% APRY Adjective. indefinite ASON Adjective. as those words with similar probabilities are more finitive (to be) and as an aspectual particle. definite C Conjunction As shown in the previous sections. we merged them into one tag CVR (to be read as conjunc- bilities. oblique. masculine. indefinite adjectives of “The Republic” and the third on the concatenation of which are Case undetermined. training the tagger on the whole corpus between auxiliary and main usage of some verbs). the tagger was trained on general tag ASN (adjectives. vocative. vocative. oblique.This allows expressing a judgment on the confidence with While the reflexive pronoun reading was almost always which the tag has been assigned: if the probability correctly assigned. singular. As one would expect. the correct ones. Apart from assessing the confidence. I Interjection dered list of pairs [tag:probability]. tagger and since this is a very frequent word in Romanian. the other two interpretations (C and R) difference between the first tag and the next tag(s) is com. singular wrong (that is the probability of the chosen tag was much NPN Common Noun. The error tives which have identical forms for both masculine and analysis suggested some modifications of the tagset and of feminine gender when singular or indefinite (for instance a few entries in the dictionary. ASVY Adjective. singular. singular. making a distinction (which the initial accuracy was very high. of no. and running it on the same data. plural. definite The basic test on the unseen part of the training corpus ASVN Adjective. plural. By analysing all the and the test redone. indefinite Figure 5: Initial tagging results ASOY Adjective. singular. Another example is given be used as a starting point for manual correction of the by the word fi. the second on 90% noticed only singular. definite 90%(1984+Rep) LM ASN Adjective. In case the tagger was definitely NN Common Noun. plural. definite ASRY Adjective. plural. The first by observing the MSDs mapped onto the ASN tag.

singular. singular. In the table above.37 S Preposition TP Article. singular. direct unique tag. NSVN Common Noun. dative sounds quite plausible. NPM or AM above and VA1S Verb. V3 Verb. direct only has one possible tag. divided by the number of words. plural Ntoken is the number of tagged tokens in a text. 2nd person. which may be contrasted with texts/languages 90%(1984+Rep) LM where most words are ambiguous (3-4 different tags). oblique. auxiliary. plural. plural tokens or the number of ambiguous tokens. The metrics above has also the advan- words errors tage that ponders the average ambiguity. main. oblique punctuation tags/ number of non-punctuation tokens TSR Article. singular ence the performance which are not taken into account by PPPA Personal Pronoun.63 1. plural. dative a text as an additional qualifying parameter to put the per- PPSA Personal Pronoun. indefinite NSOY Common Noun. non-possessive. auxiliary. one can see that the AM measure is higher for “1984” than for . singular. singular. where: VA1P Verb. accusative percentage of correct tag assignments. A better measure would be to RELO Pronoun or Determiner. indefinite the results (which improved significantly) are shown in NSRY Common Noun.72 2. it does not say very much about PSP Pronoun or Determiner. singular. auxiliary. direct words. vocative. oblique sulting score of 1. non-possessive.NSON Common Noun. oblique the ambiguous words. singular NPM (Non-Punctuation Measure) = number of non- TSO Article. definite The tagger evaluation was repeated for this final tagset and NSRN Common Noun. singular pMLi is the most probable tag of word i (lexical probability) VG Verb. plural the quality of the tagger. definite or possessive. relative. definite Experiments PI Quantifier Pronoun or Determiner The performance of a tagger is usually measured in the PXA Reflexive Pronoun. 3rd person. plural. vocative. direct guessing) procedure. direct V1 Verb. 3rd person. that is dropping any item (word or DMSO Pronoun or Determiner. main..13% ous words (say 10 different tags) but a lot of unambiguous 10% 1984 and Republic/ 25487 963 96. non-possessive. indefinite NSVY Common Noun. direct SM (Simple measure) = number of tags/number of tokens TS Article. plural Q1 is one of SM. plural.. singular. plural. dem. non-zero. this number would be the VA3 Verb. singular. oblique. This is relevant 10% 1984 /90% 1984 LM 11791 189 98. singular. oblique disregard punctuation as this is almost always assigned a RELR Pronoun or Determiner. 3rd person number of all tokens. 1st person. dem. oblique punctuation) that is uniquely labeled by the look-up (or DMPR Pronoun or Determiner. auxiliary. definite A Complexity Metric for Tagging NSY Common Noun. the 3 scores for the DMSR Pronoun or Determiner. the number of non-punctuation VA3P Verb.22% words. auxiliary.55 1. week form a single percentage figure. accusative centage score into the right perspective. plural. 1st person AM (Ambiguity Measure) = number of tags assigned to V2 Verb. auxiliary. singular. VN Verb.39% in those texts/languages which have few highly ambigu- 10% Republic/90% Rep LM 13696 393 97.e. poss or emph.. plural. nom. oblique TPR Article. direct. singular. singular Q2 = Σ(1-pMLi)/(ε+Ntoken) VA2P Verb. While this initially PXD Reflexive Pronoun. A text with a re- PPPO Personal Pronoun. dem. the sum of all possible tags for all PPSR Personal Pronoun. dative PPSN Personal Pronoun. 2nd person. singular. For instance. auxiliary. singular. participle This measure for the complexity of a text would assign a X Residual zero value for texts with no ambiguous items. An even better measure would consider only DMPO Pronoun or Determiner. We propose the complexity of PPPD Personal Pronoun. while very probable tags would have a very data /LM no. of accuracy small contribution. gerund ε is a small. Depending VA2S Verb. 2nd person ambiguous tokens/ number of ambiguous tokens.60 2. direct. definite the table in Figure 6. 1st person. of no. The first case should be easier for the tagger and the proposed Figure 6: Final tagging results metrics takes it into account. direct texts used in the experiment reported here are shown in the QN Infinitival Particle table below: QS Subjunctive Particle QF Future Particle Score SM NPM AM QZ Negative Particle 1984 1. constant. infinitive VP Verb. non-possessive. There are parameters that influ- PSS Pronoun or Determiner. plural. singular on the measure adopted as Q1. PPSD Personal Pronoun.. 1st person dent on the tagset used in the tagger: Q= Q1*Q2.. singular. auxiliary.0 is therefore trivial to tag. singular. indefinite or possessive. i. 3rd person The complexity of a text to be tagged is strongly depen- VA1 Verb. For unam- Y Abbreviation biguous words. plural Figure 7: Different measures of text ambiguity TPO Article.49 R Adverb Republic 1. singular. main. dem. relative. VA3S Verb. their contribution in the summation would be 0. as each word PPPR Personal Pronoun.. acc. non-3 person One simple measure is calculated as the average number of PPSO Personal Pronoun. poss or emph. oblique tags per word.

(1996): Estimating Lexical Priors for the unambiguous words. (also available as cmp-lg archive are created the training process is extremely fast. for being tagged accurately that other languages.. 1997 in the text.pML_i)/Ntoken Based Methods in Language and Speech Processing where pML_i and Ntoken are the same as before. the complexity of the test texts under Copernicus Joint Project “MULTEXT-East” extracted from “The Republic” was higher than the texts (Orwell’s “1984”) and TELRI Copernicus Concerted extracted from “1984”. Sproat. S. March 1996 It has been shown that it is quite easy to adapt a language Elworthy. (1998..“The Republic” (SM and NPM are lower). no. A. R. vol. In fact. Given the very explains why the tagging results are better for the latter. This complexity value might influence the choice of tagging engine to be used for disambiguating. Ramshaw. 2 (pp. informative enough. We define the IQ of a tagger based on the num. Due to the way the resource files Workshop. June 1993 varies not only across the languages but more often than one would expect within the same language. the accuracy with which the given text is expected to be tagged.. S. An old misconception. 19.(1997): Part-of-Speech Tagging and Partial ber of cases when it made a non-trivial choice: Parsing. D. Radev. L. to appear): Probabilistic programs that require large linguistic resources. Rotariu. The in Computational Linguistics. The same goes for those for Low-Frequency Morphologically ambiguous Forms cases where the selected tag is the most likely one.. A 9504002) client/server architecture provides an ideal framework for Mason. (1995): Tagset Design and Inflected dependent probabilistic tagger to work with data from Languages.. Unknown Words through Probabilistic Models in We proposed a measure for the complexity of a text to be Computational Linguistics. M. Bloothooft. comparatively small amount of training data it is possible October 1997 to reach quite impressive results.e. C. the difficulty between Romanian Academy and Birmingham University of “The Republic” excerpt was corrected by the Q2 term within the TELRI Copernicus Concerted Action (Mason in the evaluation of Q. R. G. Barbu. (1997):Corpora and Corpus-Based Morpho- has been shown to be completely wrong. In Young. 184) Kaunas. Tufis. considered in finding the most probable tags assignment is Palmucci. Berger.. 10 would act as tagging islands/clues for the rest of the words October. no. G. D. H. . 2. 22. no. D.. in Proceedings approach) from a large set of morpho-syntactic description of the Second International TELRI Seminar (pp. vol.. 2 (pp. This was be- cause more ambiguous words were used in Plato’s text and Acknowledgments although the average number of tags for each ambiguous The work reported here was initiated as a joint experiment words is smaller than in the case of “1984”. vol. Della Pietra. The empirical methodology we described for deriving a Editura Academiei. 1998 convenient tagset (i. 39-72). E. Bucharest. References but for “difficult” texts a “more advanced” tagger should be necessary. The mo- Editura Academiei (pp. As one can notice. Schwartz. D.. 1997 tivation is based on the fact that inflected words are less Tzoukermann. and the Tagging in a Multi-lingual Environment: Making an client implementation in Java means that it is possible to English Tagger Understand Romanian Proceedings of run the tagger on a wide variety of platforms. Tufis. With any of the three measures & Tufis. Andersen P.. Speech and Language Technology the lexical probability of the really selected tag in case of Series. Dublin. (1997): Tagging French ambiguous than the base forms and in many cases they are Without Lexical Probabilities . 1. Montecatini. manageable Tufis.(eds.Combining Linguistic simply unambiguous. (1996): A Maximum Entropy Approach to Natural Conclusions Language Processing in Computational Linguistics. D. the continuation of the work was further granted by the Romanian Ministry of Science and The text complexity is a useful parameter for estimating Technology... For an “easy” text. Della Pietra. S. 35-56). 118-136) Text.M. Therefore..): believe that a highly inflected language has better chances Recent Advances in Romanian Language Technology. 1998). O..) Corpus IQ = 100*Σ(pREAL_i .. based on language resources developed adopted for Q1 above.. and p REAL_i is (pp. unambiguous words Knowledge and Statistical Learning cmp-lg/9/10002. vol. we do Lexical Processing in Tufis D. With a the Third International TELRI Seminar. Patrascu. Proceedings of the ACL SIGDAT other languages as well. J. R. This difference in complexity Action (Plato’s “The Republic”). the tagger is not given credit Baayen. Bucharest. (1997): A Generic Platform for Developing for the tagger and recoverable for a tiered tagging Language Resources and Applications. V. Meteer. D. 22. April 1997 namely that highly inflected languages are doomed to poor Tufis. J. Kluwer Academic Publishers tokeni. (eds. promising results. 219- tagged and an IQ figure a tagger. (1998): Tiered Tagging. no. one can use simple taggers. 165- codes proved to be successful.. performances when processed with probabilistic methods Popescu. A. The complexity of texts 242). 1 (pp. June 1996 length used in different experiments. This way the number of possibilities to be Weischedel. Abney. (1993): Coping with Ambiguity and significantly reduced.. L. 155- normalisation is necessary to cope with texts of different 166). In International Journal on Information Science and Technology. A. V.