Professional Documents
Culture Documents
7 czerwca 2010
Project NEKST (Natively Enhanced Knowledge Sharing Technologies) co-nanced by Innovative Economy Programme project POIG.01.01.02-14-013/09
Scope Introduction
Introduction
Overview : a problem of Named Entity Recognition, recognition of PERSON and COMPANY annotations, two corpora of Stock Exchange Reports from an economic domain and Police Reports from a security domain, combination of a machine learning approach with a manually created rules, application of Hidden Markov Models. The corpus was published at http://nlp.pwr.wroc.pl/gpw/download
7 czerwca 2010
2 / 15
Task Denition
We dened NEs as language expressions referring to extra-linguistic real or abstract objects of the preselected kinds. We limited the Named Entity Recognition task to identify expressions consiting of proper names refering to PERSON and COMPANY entities.
7 czerwca 2010
3 / 15
Characteristic
a set of economic reports published by companies, very formal style of writing, a lot of expressions written from an upper case letter that are not proper names, a lot of names of institutions, organizations, companies, people and location.
Micha Marciczuk and Maciej Piasecki (PWr.) 7 czerwca 2010 4 / 15
Characteristic
a set of statements produced by witnesses and suspects, rather informal style of writing, a lot of pseudonyms and common words that are proper names, a lot of one-word person names.
7 czerwca 2010
5 / 15
Corpus Developement
7 czerwca 2010
6 / 15
Scope Baselines
Baselines
1
Heuristic matches a sequence of words starting with an upper case letter. For COMPANY the name must end with an activity form, i.e.: SA, LLC, Spka, AG, S.A., Sp., B.V.. Gazetteers matches a sequence of words present in the dictionary of rst names and last names (63 555 entries) or company names (6200 entries) [Piskorski, 2004].
PERSON Precision Recall F1 -measure Precision Recall F1 -measure 0.89 42.45 1.75 9.61 41.19 15.59 CSER CPR COMPANY PERSON COMPANY Heuristic % 0.76 % 19.35 % 0.19 % % 4.42 % 93.87 % 4.13 % % 1.29 % 32.09 % 0.36 % Gazetteers % 37.01 % 46.79 % 21.05 % % 40.54 % 9.02 % 3.31 % % 38.69 % 15.12 % 5.71 %
7 czerwca 2010 7 / 15
7 czerwca 2010
8 / 15
We performed a 10-fold Cross Validation on the Stock Exchange Corpus for PERSON annotations.
7 czerwca 2010
9 / 15
Error Analysis
We have identied 10 types of errors. No. 1 2 3 4 5 6 7 8 9 10 Error type Name of institution, company, etc. Name of location (street, place, etc.) Other proper names Phrases in English Incorrect annotation boundary Common word starting from upper case character Common word starting from lower case character Single character Common word with a spelling error Other full 38 30 2 Matches partial total 91 129 16 46 10 12 21 21 35 35 6 6 26 26 6 6 3 3 46 46
A) 1, 2, 3 incorrect types of annotation recognition of COMPANY and LOCATION, B) 4, 7, 8, 9 lower case and non-alphabetic expressions rule ltering, C) 5 incorrect annotation boundary annotation merging and trimming.
Micha Marciczuk and Maciej Piasecki (PWr.) 7 czerwca 2010 10 / 15
Referring to the A group of errors, we have re-annotated the CSER with COMPANY annotations and repeated the 10-fold CV .
7 czerwca 2010
11 / 15
Scope Post-ltering
Post-ltering
Referring to the B and C groups of errors, we have applied two types of post-processing: annotation ltering and merging.
HMM Precision Recall F1 -measure Precision Recall F1 -measure 64.74 %* 93.73 %* 76.59 %* 78.63 %* 94.62 %* 85.89 %* +ltering +trimming PERSON (REV) 76.27 % 64.85 % 91.64 % 93.88 % 83.25 % 76.71 % PERSON (COMB) 87.16 % 78.76 % 91.33 % 94.77 % 89.20 % 86.02 % +both 75.82 % 91.77 % 83.03 % 86.69 % 91.48 % 89.02 %
7 czerwca 2010
12 / 15
7 czerwca 2010
13 / 15
Plans
to extend the schema annotation by LOCATION & ORGANIZATION, to collect a new corpus for cross-domain evaluation, to develop new rules for post-processing (for example to x a problem with sentence segmentation), incorporate other sources of knowledge (rules and gazetteers for majority voting, plWordNet for generalization), new learning models: HMM including morphology, other ML methods with context features (preceding verbs, prepositions, etc.).
Micha Marciczuk and Maciej Piasecki (PWr.) 7 czerwca 2010 14 / 15
Alias-i, LingPipe 3.9.0. http://alias-i.com/lingpipe (October 1, 2008) Graliski, F., Jassem, K., Marciczuk, M., Wawrzyniak, P.: Named Entity Recognition in Machine Anonymization. In: Kopotek, M. A., Przepiorkowski, A., Wierzcho, A. T., Trojanowski, K. (eds.) Recent Advances in Intelligent Information Systems, pp. 247260. Academic Publishing House Exit (2009) Piskorski J.: Extraction of Polish named entities. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation, LREC 2004, pp. 313316. ACL, Prague, Czech Republic (2004)
7 czerwca 2010
15 / 15