SentiMatrix - Named Entity Recognition For Romanian Language

KNOWLEDGE ENGINEERING: PRINCIPLES AND TECHNIQUES Proceedings of the International Conference on Knowledge Engineering, Principles and Techniques, KEPT2011
Cluj-Napoca (Romania), July 46, 2011, pp. 2536
NAMED ENTITY RECOGNITION FOR ROMANIAN

ADRIAN IFTENE(1) , DIANA TRANDABAT(1) , MIHAI TOADER(2) , AND MARIUS COR (2) ICI Abstract. This paper presents a Named Entity Recognition system for Romanian, created using linguistic grammar-based techniques and a set of resources. Our systems architecture is based on two modules, the named entity identication and the named entity classication module. After the named entity candidates are marked for each input text, each candidate is classied into one of the considered categories, such as Person, Organization, Place, Country, etc. The systems Upper Bound and its performance in real context are evaluated for each of the two modules (identication and classication) and for each named entity type. The evaluation show promising results, our system being comparable with the existing systems for Romanian, and even better for Person recognition.
Named Entity Recognition (NER) is a common natural language processing task dedicated to the discovery of textual expressions such as the names of persons, organizations, locations, places, etc. Named entity recognition, although a seemingly simple task, faces a number of challenges. Entities may rstly be dicult to nd, and once found, dicult to classify. In this paper, we present the development of a NER system for Romanian. Even though the categories of named entities are predened, there are varying opinions on what categories should be regarded as named entities and how broad those categories should be. The categories chosen for a particular NER project may depend on the requirements of the project. If numerical classication is important to a particular eld, then the categories dealing with numerical data may need to be more rened. Similarly, if geographical classication is important, it may be necessary to classify each location entity as a particular type of location. The NER system for Romanian presented in this paper is intended to be part of a sentiment assessment system which monitors user feedback in rapport to an organizations brand or product. Therefore, we
Received by the editors: 07.04.2011. 2010 Mathematics Subject Classication. 68T50, 68P20, 91F20. Key words and phrases. Named Entity, Information Extraction.
c 2011 Babe-Bolyai University, Cluj-Napoca s
25
ADRIAN IFTENE(1) , DIANA TRANDABAT(1) , MIHAI TOADER(2) , AND MARIUS COR (2) 26 ICI
tried to rene the named entities types with regard to companies and products, so the categories we considered are: Person, Organization, Company, Region, Place, City, Country, Product, Brand, Model, and Publication. Having such complex named entities types was justied also by the need of an NER module for our question answering and textual entailment systems. In question answering, there are questions having as expected answer type a person (male or female), a city, region or organization. The systems presented in [4], [5] and [3] show that the identication of these types in the retrieved paragraphs is a necessary step toward better results. The next section presents some observations on previous published work, while the second section elaborates on our method. Section 3 presents the evaluation of our system, before nally discussing our conclusions. 1. Existing Work NER systems use linguistic grammar-based techniques or statistical models (an overview is presented in[10]). Hand-crafted grammar-based systems typically obtain better precision, but at the cost of lower recall and months of work by experienced computational linguists. Besides, they are hard to adapt to new domains. Statistical NER systems typically require a large amount of manually annotated training data. Machine learning techniques, such as the ones discussed in [8] or [9], allow systems-based adaptation two new domains, perform very well for coarse-grained classication, but require large training data. Named entity recognition for Romanian has been attacked in [1], [6] and [7]. There is also a NER gazetteer for Romanian included in GATE [2]. The system presented in [7] is based on GATE system, with several components added to the GATE platform: three gazetteers developed for Romanian and a JAPE transducer, also developed for the Romanian language. The gazetteers provide the system with a list of entities (person names, locations, organizations) and the JAPE (Java Annotations Pattern Engine) transducer provides a set of rules for matching specic patterns in the text aimed at identifying the named entities which appear in the text. The performance of the system was measured to be F-Measure of 32% for the identication of Person named entities, 66% for the identication of Locations and 58% for the identication of Organizations. Another named entity recognition for Romanian is presented in [6], developed by extensive use of Perl regular expressions that dene sequences of tokens that constitute named entities. The considered named entities are: integers, real numbers, dates, times, names of persons (female and male), different quantities, lengths, volumes and weights.
27
In [1], an algorithm for the minimally supervised learning of named entity recognizers given short name lists as seed data is presented. The algorithm uses hierarchically smoothed trie structures for modeling morphological and contextual probabilities eectively in a language independent framework, overcoming the need for xed token boundaries or history lengths. The system achieves 70.5%-75.4% F-measure (measuring both named entity identication and classication) when applied to Romanian text. The system presented in this paper obtains comparable results for most of our categories, and outperforms the existing approaches for Person recognition. 2. Our Solution Nowadays, big companies and organizations spend time and money in order to nd users opinions about their products, the impact of their marketing decisions, or the overall feeling about their support and maintenance services. These analyses help in the process of establishing new trends, policies and determine in which areas investments must be made. One of the focuses of our work is helping companies build such analyses in the context of users sentiment identication. Therefore, the corpus we work on consists of articles from newspapers, peoples blog, various entries from forums, and posts from social networks. But before extracting sentiments, named entities must be recognized. In the process of extracting named entities (NEs) we consider two steps: the rst one is related to the identication of NEs and second one involves the classication of the identied NEs. 2.1. Named Entities Identication. We took a rule-based approach for the Named Entities Identication (NEI) task. The NEI module uses in a preprocessing step a text segmentator and a tokenizer. Given a text, we divide it into paragraphs, every paragraph is split into sentences, and every phrase is tokenized. Each token is annotated with two pieces of information: its lemma (obtained from our resource with 76,760 word lemmas corresponding to 633,444 derived forms) and the normalized form (translated to the proper diacritics1). Every token written with a capital letter is then considered to be a named entity candidate. Here we consider also tokens which are near or between punctuation signs like comma, point, question mark, quotes, single quotes, brackets, etc. A special module was built for tokens with capital letters which are the rst tokens in phrases. For this category, we consider two situations:
In Romanian online texts, two diacritics are commonly used, but only one is accepted by the ocial grammar.
1
(1) when this rst token of a phrase is in our stop word list - we eliminate it from the named entities candidate list; (2) when the rst token of a phrase is in our common word list - in this case we have two possible situations: (a) when this common word is followed by lowercase words - we check if the common word (the rst word of the sentence) can be found in the list of trigger words. These trigger words are cue words which introduce NEs, for example: university, company for the Organization type (this is the case for [Universitatea] din Iai s (En: [University] of Iasi)); doctor, professor for the Person type ([Profesor] de Fizic (En: [Professor] of Physics)); words such a as city, country for Location NEs ([Tara] de Jos (En: Low [Country])), etc. If the rst word of the sentence is in this list with trigger words, it is kept in the NEs candidate list. If the word is not in the list with trigger words, it is eliminated from NEs candidates, as being just a common word written with capital letter just due to its position. (b) when this common word is followed by uppercase words - in this case the rst word of the sentence is kept in the NEs candidate list, and in a further step it will be decided if it will be combined with the following word in order to create a composed named entities (For example [Doctor] Stomatolog (En: [Doctor] Dentist)). After we build the list with named entity candidates, we apply rules that unify adjacent candidates in order to obtain composed named entities. The most important rules are: (1) Rules related to a persons title - in these cases, we unify words like Doctor, Profesor, Ministru, Preedinte, etc. (En: Doctor, Professor, s Minister, President) next to adjacent candidates (Example of possible results are: Doctorul Popescu (En: Doctor Popescu)); (2) Rules related to the Organization type - we unify words like Universitate, Companie, Partid, etc. (En: University, Company, Party) next to adjacent candidates (Example of possible results are Universitatea Cuza (En: Cuza University)); (3) Rules related to abbreviation words - we unify abbreviations such as S.R.L., S.C., S.A. next to adjacent candidates (for example S.C. Travis); (4) Rules related to special punctuation signs - in these cases we unify candidates separated by & or - (Ana - Maria for example); (5) Rules related to candidates to named entities separated by stop words in these cases we unify candidates separated by one or two specic stop
29
words. Examples of specic stop words are: din, de la, a, al, pentru, (En: of, from, the, of, for, in) (Few examples: Universitatea din n Iai (En: Iasi University), BCR Banca pentru Locuinte (En: BCR s Housing Bank), Directia pentru Sntate i Securitate Munc a a s n a (En: Department of Health and Safety at Work)); (6) Rules for a specic model/product - candidates are combined with numbers or with one or two letters, followed by digits (For example: Portege R835, Qosimio X500-Q930). Some of these rules are used also in the classication process, namely the rules related to Person, Organization and Model types. Beside uppercase words which are automatically NE candidates, we also consider as possible NE-trigger lowercase words expressing titles (e.g. profesor, avocat, doctor, etc. (En: professor, lawyer, doctor)). 2.2. Named Entities Classication. The named entities (NEs) resource for Romanian was build starting from the categories used in GATE [2]. Thus, we consider the following major categories: the standard categories of City, Organization, Company, Country, Person, and additional categories such as Brand, Product and Publication (for revues, newspapers, etc.). NE type Number of entries Brand 9 Non-Romanian City 114,288 Romanian City 14,277 Non-Romanian Company 73,125 Romanian Company 2,884 Country 405 Non-Romanian Organization 2,016 Romanian Organization 1,987 Non-Romanian Person Name 102,512 Romanian Person Name 52,868 Place 189,403 Product 14,842 Region 64 Title 4,050 Total 572,730 Table 1: Number of NEs of each type For almost all major categories we consider subcategories (the most represented ones are included in Table 1): for Cities we have as subcategories Romanian, European, American and Other Cities; for Companies we consider Romanian and non-Romanian Companies, but also Airlines, Banks, etc.; for
Organizations we consider Parties, Faculties, Universities, Ministries; for Persons we have the subcategories of Sportsmen, Politicians, Males, Females, etc.; for Cities we split our resources on Romanian and non-Romanian Cities and we separate Romanian Cities on Counties. In the end, we have built a total of 14 main categories with 98 subcategories. The distribution of categories that have the highest number of NEs in our databases is presented in Table 1. In the classication process we use some of rules used in the unication of NEs candidates along with the resource of NEs and several rules specically tailored for classication. Thus, after all NEs in the input text are identied and, if possible, compound NEs have been created, we apply the following classication rules: (1) contextual rules - using contextual information, we are able to classify candidate NEs in one of the categories Organization, Company, Person, City and Country by considering a mix between regular expressions and trigger words. Thus, our regular expressions search for specic triggers one or two words before the NE candidate. For example ora, capital, trg, localitate (En: city, capital, town, locality) are s a a the triggers searched in order to classify a candidate NE into the City type; companie or corporatie (En: company, corporation) are the trig gers for Company type; partid, banc, universitate (En: party, bank, a university) are triggers for Organization type and all titles are triggers for Person type; (2) resource-based rules - if no triggers were found to indicate what type of entity we have, we start searching our databases for the candidate entity. If the candidate NE is a compound one, we rst try to nd it as if (i.e. the complex NE) in our resources. If it cannot be found as a complex entity, we split it back and try to nd the rst entity and assign its type to the whole complex. This rule was inspired by the fact that in cases such as Universitatea din Iai (En.Iasi s University), the rst word Universitatea) is the one that determines the Organization type of the complex NE, and not the last word, which is a City. 3. Evaluation This section presents the performance of our NER system. Sections 3.1 and 3.2 discuss a rst development evaluation step, where we wanted to evaluate the systems performance when all needed resources were available (i.e. all NE are can be found in our resources). Therefore, if a NE was not found in our resources, we added the NE to the database, or if additional rules were needed, we built them. This case represents the systems upper bound.
31
The next sections, 3.3 and 3.4, present the evaluation of our system on a new corpus, for each module. Some NEs were not found in our resources, but this time we no longer added them, nor have we built specic rules. 3.1. Upper Bound Named Entities Identication Evaluation. In the evaluation process, we manually annotated 48 les with a total of 24,244 words and with 1,638 NEs. Based on these development les, we incrementally built our rules, both for NE identication and for NE classication. Also, we added all missing NEs in our resources and built special rules for the untreated cases. The evaluation on this corpus is presented in Table 2. Partial matching represents the intersection between the gold NE and the NE identied by our system. Metric Value Description Entity Identication 95.12% The percentage of gold entities found by Precision the system reported to all returned entities Entity Identication 96.40% The percentage of gold entities found by Recall the system reported to all gold entities Entity Identication 95.76% The equally weighted f - measure F-Measure Partial Precision 1.81% The percentage of gold entities partially returned by the system reported to all returned entities Partial Recall 1.83% The percentage of gold entities partially returned by the system reported to all gold entities Table 2: Upper bound evaluation for NE identication The rst main problem in NE identication is related to the agreement between annotators when dierent types of NEs are adjacent (For example: PDL Dolj). In these situations, some believe it would be a single entity, while others believe that two dierent entities should be considered (PDL and Dolj, in this case, or even an embedded PDL Dolj and Dolj), with dierent types. The second main problem in NE identication is related to the cases when the rst word of a sentence is a common word and is not followed by words with capitalized letter. In these cases, the system is trained to leave the rst word of the sentence out of the candidate NE list. This is the case for Maria, a common Romanian female name which also means to marry or Ana, also a common Romanian female name, which can be found in the list of common words with the meaning of rope used on boats. A total number of 4,346 common words appear 5,622 times in one or more resources as NEs. In
other words, 1% of named entities is ambiguous with common words in our databases. Interesting is the case of Aurora, a female Romanian name which also means dawn: it appears simultaneously in four dierent resources: cities, places, products and person names. For these cases we decided to build a special list with NEs that exist also in the list of common words, which are however statistically more frequent as NEs than as common words. Another frequent problem in the identication of NEs is related to special characters found at the beginning of a row. In these cases, a wrong identication is due to the fact that the sentences are not properly segmented, but, although not a problem of the system itself, it inuences the systems overall performance. 3.2. Upper Bound Named Entities Classication Evaluation. For correctly identied named entities, the percentage of the matched and partial matched NEs that have been properly categorized is 95.71%. The main problems in NEs classication are related to the fact that there exist NEs that are in more than one list of NEs. A number of 5,243 NEs appears in more than two resources, summing up to 10,588 occurrences. For example, Adevrul exist both in the list of companies and in list of publia cations, Markel exists in the list of person name and in list of companies, Capital is both in the list of places and in list of publications, Bneasa is a both a place and a city, Acer and Nokia are both a brand and a company, Dacia is a company, a brand, country, a city and a product, David Jones is a person and a brand, Darian is a company or a person, G7 is a model and an organization, Siret is a region and a city, Tudor Vladimirescu is a city and a person name, Alexandria is a place and a city, Franklin Templeton is a person name and a company, and so on. In Table 3, we present the most represented NE types pairs in our databases, for which the same NE appears in both categories: NE types Number of Common Entities Company and Product 2,067 City and Place 890 Person and Place 609 Company and Person 247 Company and Organization 96 Table 3: Number of NE joint in the two categories As we can see from Table 3, the most important problem is due to the fact that we have products that have the same name as the company that produce them. Another problem is due to the fact that we have the same names for
33
cities and places (this is due to the fact that in Romania, most districts have the same name as the administrative city in the county). Another problem in classication is related to cases when we have partial match on extracted NEs. This is the case when in the initial text we have something like ...a declarat exclusivitate pentru Capital Bjorn Hauge n (En: said exclusively for Capital Bjorn Hauge), with two gold entities, each with its type: Capital is a company and Bjorn Hauge is a person. In this case, due to our NE composition rules, our application extracts only one named entity Capital Bjorn Hauge which is not found in any class, and thus the system assign to this NE group the class of the rst NE. This is also the case for fostul ef al Vmii Halmeu Nicoleta Dobrescu (En: former head s a of Halmeu Custom Nicoleta Dobrescu), for alnirea Bsescu-Nazarbaev nt a (En: Bsescu-Nazarbaev meeting), for a precizat pentru SFin Sorin Blejnar a (En: said for SFin Sorin Blejnar) and for autostrada Piteti-Bucureti (En: s s Piteti-Bucureti highway). s s 3.3. Named Entities Identication Evaluation in Real Context. For testing the system in real context we created a new test corpus, unseen by our system, containing 38 les manually annotated with a total of 19,509 words and 1,215 NEs. The evaluation of the system with this test corpus is presented in Table 4. Metric Value Description Entity Identication 90.50% The percentage of gold entities found by Precision the system reported to all returned entities Entity Identication 90.95% The percentage of gold entities found by Recall the system reported to all gold entities Entity Identication 90.72% The equally weighted f - measure F-Measure Partial Precision 2.29% The percentage of gold entities partially returned by the system reported to all the extracted entities Partial Recall 2.30% The percentage of gold entities partially returned by the system reported to all gold entities Table 4: Evaluation of the NE identication module Besides the problems discussed in the upper bound evaluation, we found additional problems related to the extraction of entities of the type Title (which are usually written with lowercase letters) and are very dependent to our resource list. The problems related to Title account for 3.70% of the total number of NEs error in this corpus (i.e. 45 from 144 titles werent extracted) and comes from the fact that we dont have enough entities in our resources.
Is was the case for words such as clugrit, sor, colonel, viceprimar, coa a a a preedinti, ambasador, guvernator, premier, patron, ar, tovar, maiestate, s t asa europarlamentar (En: nun, nurse, Colonel, vice mayor, co-president, Ambassador, governor, premier, employer, Tsar, comrades, majesty, EP member). 3.4. Named Entities Classication Evaluation in Real Context. Our system correctly classied (total or partial match) 66.73% of the NEs in our test corpus. The next table presents the error distribution on the named entity types were most errors occurred. NE Type NEs wrongly Identied Total no. of NEs Percent Company 28 89 31.46% Publication 15 25 60.00% Person 86 364 23.63% Product 34 65 52.31% Organization 72 220 32.73% Region 55 76 72.37% Undecided 23 36 63.88% Table 5: Error distribution on NE types Interesting is the case of Undecided entities, entities which are not classied in any of our types by human annotator in the test corpus (for example: Stirile ProTV (En: ProTV News), dierent laws or rules, TBC, ITP, PIN, etc.). In 13 of these cases, our system was not able to classify the extracted entity, similar to the gold annotation. However, in 23 cases, although these NEs dont exist in our resources, contextual rules were applied and the system wrongly identied a type for them. For Companies, Organization and Person types, the errors appear because the NEs were not found in our resources and no contextual rules could be applied. For Publication and Product types, the errors occurred because they frequently are marked interchangeable in the test corpus, since it is dicult to distinguish between them. For Region type, the major cause of errors is due to the fact that respective NE exists also in resources for other type, such as City, Place, and Country. An interesting example is the case of PNL, which does not exist in any of our resources. In some cases, when it is proceeded by the word partid (En: party), it is correctly classied as Organization, but in all other cases, the system does not identify any type for it. Thus is a clear example where anaphora resolution would greatly increase the system performance, since after nding the type for one entity, this type could be transferred to all its references in the anaphoric chain.
35
4. Conclusions This paper presents a Named Entity Recognition system for Romanian, created using linguistic grammar-based techniques and a set of resources. The architecture of our system involves two modules, named entity identication and named entity classication module, successively applied. The goal of the described system is to recognize named entities for Romanian, distinguishing between 14 NE types. Even if we consider so many categories, we still manage to have comparable results (and even better for specic categories) with existing systems for Romanian, which identify less NE types. Future work will be related to the elimination of problems related to common words that are at the beginning of sentences. To x these problems, we intend to use statistical information about common words obtained from a large corpus, such as the Romanian Wikipedia. Another immediate future direction is related to anaphora, which could be of great benet in order to transfer the type of one classied entity to all its referees. Thus, the undecided cases will reduce, and we could also consider a voting system if the same entity has dierent types assigned in dierent contexts. In order to help us increase the resources we have, we also consider designing an interface where people using our NER could add the entities we did not identied, or classify them correctly if the system did not. However, we also need to nd a sort of annotation evaluation technique to minimize the possibility of errors to be introduced in the database. Acknowledgment The research presented in this paper was funded by the Sectoral Operational Program for Human Resources Development through the project Development of the innovation capacity and increasing of the research impact through post-doctoral programs POSDRU/89/1.5/S/49944. The authors of this paper thank the colleagues Alexandru G a, Emanuela Boro, Augusto nsc s Perez, Dan Cristea from Faculty of Computer Science Iasi, for the help oered in this project. References 1. S. Cucerzan and D. Yarowsky, Language independent named entity recognition combining morphological and contextual evidence, In Proceedings of the Joint SIGDAT Conference on EMNLP and VLC, 1999, pp. 9099. 2. H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan, Gate: A framework and graphical development environment for robust nlp tools and
3.
4.
5.
6. 7.
8.
9. 10.
applications, Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics, 2002. A. Iftene, D. Trandab, M. Moruz, I. Pistol, M. Husarciuc, and D. Cristea, at Question answering on english and romanian languages, Multilingual Information Access Evaluation, Text Retrieval Experiments (Carol Peters, Giorgio Di Nunzio, Mikko Kurimo, Thomas Mandl, Djamel Mostefa, Anselmo Pe nas, and Giovanna Roda, eds.), Lecture notes in Computer Science, vol. 1, Springer Berlin / Heidelberg, 2010, pp. 229236. A. Iftene, D. Trandab, I. Pistol, M. Moruz, A. Balahur-Dobrescu, at D. Cotelea, I. Dornescu, I. Draghici, and D. Cristea, Uaic romanian question answering system for qa@clef, In Advances in Multilingual and Multimodal Information Retrieval, 8th Workshop of the Cross-Language Evaluation Forum, CLEF 2007, Budapest, Hungary, September 19-21, 2007, Revised Selected Papers, Lecture notes in Computer Science, vol. 5152, 2008, pp. 336343. A. Iftene, D. Trandab, I. Pistol, M. Moruz, M. Husarciuc, and D. Cristea, at Uaic participation at qa@clef2008, Evaluating Systems for Multilingual and Multimodal Information Access, 9th Workshop of the Cross-Language Evaluation Forum, CLEF 2008, Aarhus, Denmark, September 17-19, 2008, Revised Selected Papers, Lecture notes in Computer Science, vol. 5706, 2009, pp. 448451. R. Ion, Word sense disambiguation methods applied to english and romanian, PhD Thesis, 2007. L. M. Machison, Named entity recognition for romanian (roner), Proceedings of the International Conference on Knowledge Engineering, Principles and Techniques, KEPT2009, 2009, pp. 5356. Y. Mehdad, V. Scurtu, and E. Stepanov, Italian named entity recognizer participation in ner task @ evalita 09, Proceedings of the 11th Conference of the Italian Association for Articial Intelligence, 2009. D. Nadeau, Semi-supervised named entity recognition: Learning to recognize 100 entity types with little supervision, PhD Thesis, 2007. D. Nadeau and S. Sekine, A survey of named entity recognition and classication, Linguisticae Investigationes 30 (2007), no. 1, 326, Publisher: John Benjamins Publishing Company.
(1)
Al. I. Cuza University of Iasi, Faculty of Computer Science, Romania E-mail address: adiftene@info.uaic.ro,dtrandabat@info.uaic.ro
(2)
Intelligentics, Cluj Napoca, Romania E-mail address: marius@intelligentics.ro,mtoader@gmail.com

SentiMatrix - Named Entity Recognition For Romanian Language

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SentiMatrix - Named Entity Recognition For Romanian Language

Uploaded by

Copyright:

Available Formats

KNOWLEDGE ENGINEERING: PRINCIPLES AND TECHNIQUES Proceedings of the International Conference on Knowledge Engineering, Principles and Techniques, KEPT2011

Cluj-Napoca (Romania), July 46, 2011, pp. 2536

NAMED ENTITY RECOGNITION FOR ROMANIAN

NAMED ENTITY RECOGNITION FOR ROMANIAN

NAMED ENTITY RECOGNITION FOR ROMANIAN

NAMED ENTITY RECOGNITION FOR ROMANIAN

NAMED ENTITY RECOGNITION FOR ROMANIAN

NAMED ENTITY RECOGNITION FOR ROMANIAN

Intelligentics, Cluj Napoca, Romania E-mail address: marius@intelligentics.ro,mtoader@gmail.com

You might also like