You are on page 1of 15

Named Entity Recognition in the Domain of Polish Stock Exchange Reports

Micha Marciczuk and Maciej Piasecki


Wrocaw University of Technology

7 czerwca 2010

Project NEKST (Natively Enhanced Knowledge Sharing Technologies) co-nanced by Innovative Economy Programme project POIG.01.01.02-14-013/09

Scope Introduction

Introduction
Overview : a problem of Named Entity Recognition, recognition of PERSON and COMPANY annotations, two corpora of Stock Exchange Reports from an economic domain and Police Reports from a security domain, combination of a machine learning approach with a manually created rules, application of Hidden Markov Models. The corpus was published at http://nlp.pwr.wroc.pl/gpw/download

Micha Marciczuk and Maciej Piasecki (PWr.)

7 czerwca 2010

2 / 15

Scope Task Denition

Task Denition

We dened NEs as language expressions referring to extra-linguistic real or abstract objects of the preselected kinds. We limited the Named Entity Recognition task to identify expressions consiting of proper names refering to PERSON and COMPANY entities.

Examples of correct and incorrect expressions of PERSON type:


correct: R. Dolea, Marek Wiak, Luis Manuel Conceicao do Amaral, person names are part of a company name: Moore Stephens Trzemalski , Krynicki i Partnerzy Kancelaria Biegych Rewidentw Sp . z o.o., location: pl. Jana Pawa II.

Micha Marciczuk and Maciej Piasecki (PWr.)

7 czerwca 2010

3 / 15

Scope Corpora of Economic Domain

Corpora of Economic Domain


Stock Exchange Reports (SCER)
1215 documents, 282 376 tokens, 670 PERSON and 3 238 COMPANY annotations, source http://gpwinfostrefa.pl,

Characteristic
a set of economic reports published by companies, very formal style of writing, a lot of expressions written from an upper case letter that are not proper names, a lot of names of institutions, organizations, companies, people and location.
Micha Marciczuk and Maciej Piasecki (PWr.) 7 czerwca 2010 4 / 15

Scope Corpora of Security Domain

Corpora of Security Domain


Police Reports (CPR)
12 documents, 29 569 tokens, 555 PERSON and 121 COMPANY annotations, source [Gralinski et al., 2009].

Characteristic
a set of statements produced by witnesses and suspects, rather informal style of writing, a lot of pseudonyms and common words that are proper names, a lot of one-word person names.

Micha Marciczuk and Maciej Piasecki (PWr.)

7 czerwca 2010

5 / 15

Scope Corpus Developement

Corpus Developement

To annotate the corpora we developed and used the Inforex system.


System fueatures: web-based does not require installation (requires only a FireFox browser with JavaScript), remote corpora are stored on a server, shared corpora can be simultaneously annotated by many users.

Micha Marciczuk and Maciej Piasecki (PWr.)

7 czerwca 2010

6 / 15

Scope Baselines

Baselines
1

Heuristic matches a sequence of words starting with an upper case letter. For COMPANY the name must end with an activity form, i.e.: SA, LLC, Spka, AG, S.A., Sp., B.V.. Gazetteers matches a sequence of words present in the dictionary of rst names and last names (63 555 entries) or company names (6200 entries) [Piskorski, 2004].
PERSON Precision Recall F1 -measure Precision Recall F1 -measure 0.89 42.45 1.75 9.61 41.19 15.59 CSER CPR COMPANY PERSON COMPANY Heuristic % 0.76 % 19.35 % 0.19 % % 4.42 % 93.87 % 4.13 % % 1.29 % 32.09 % 0.36 % Gazetteers % 37.01 % 46.79 % 21.05 % % 40.54 % 9.02 % 3.31 % % 38.69 % 15.12 % 5.71 %
7 czerwca 2010 7 / 15

Micha Marciczuk and Maciej Piasecki (PWr.)

Scope Recognition Based on HMM

Recognition Based on HMM

LingPipe [Alias-i, 2008] implementation of HMM


7 hidden states for every annotation type, 3 additional states (BOS, EOS, middle token), Witten-Bell smoothing, rst-best decoder based on Viterbis algorithm. Pan Jan Nowak zosta nominowany na stanowisko prezesa (Mr. Jan Nowak was nominated for the chairman position.) (BOS) (E-O-PER) (B-PER) (E-PER) (B-O-PER) (W-O) (W-O) (W-O) (W-O) (W-O) (EOS)

Micha Marciczuk and Maciej Piasecki (PWr.)

7 czerwca 2010

8 / 15

Scope Single Domain Evaluation (PERSON)

Single Domain Evaluation (PERSON)

We performed a 10-fold Cross Validation on the Stock Exchange Corpus for PERSON annotations.

Precision Recall F1 -measure

Heuristic 0.89 % 42.45 % 1.75 %

Gazetters 9.61 % 41.19 % 15.59 %

HMM 64.74 % 93.73 % 76.59 %

Micha Marciczuk and Maciej Piasecki (PWr.)

7 czerwca 2010

9 / 15

Scope Error Analysis

Error Analysis
We have identied 10 types of errors. No. 1 2 3 4 5 6 7 8 9 10 Error type Name of institution, company, etc. Name of location (street, place, etc.) Other proper names Phrases in English Incorrect annotation boundary Common word starting from upper case character Common word starting from lower case character Single character Common word with a spelling error Other full 38 30 2 Matches partial total 91 129 16 46 10 12 21 21 35 35 6 6 26 26 6 6 3 3 46 46

A) 1, 2, 3 incorrect types of annotation recognition of COMPANY and LOCATION, B) 4, 7, 8, 9 lower case and non-alphabetic expressions rule ltering, C) 5 incorrect annotation boundary annotation merging and trimming.
Micha Marciczuk and Maciej Piasecki (PWr.) 7 czerwca 2010 10 / 15

Scope Single Domain Evaluation (PERSON & COMPANY)

Single Domain Evaluation (PERSON & COMPANY)

Referring to the A group of errors, we have re-annotated the CSER with COMPANY annotations and repeated the 10-fold CV .

Precision Recall F1 -measure

PERSON REV COMB 64.74 %* 78.63 % 93.73 %* 94.62 % 76.59 %* 85.89 %

COMPANY 76.56 % 83.14 % 79.71 %

* results from the previous 10-fold CV

Micha Marciczuk and Maciej Piasecki (PWr.)

7 czerwca 2010

11 / 15

Scope Post-ltering

Post-ltering

Referring to the B and C groups of errors, we have applied two types of post-processing: annotation ltering and merging.
HMM Precision Recall F1 -measure Precision Recall F1 -measure 64.74 %* 93.73 %* 76.59 %* 78.63 %* 94.62 %* 85.89 %* +ltering +trimming PERSON (REV) 76.27 % 64.85 % 91.64 % 93.88 % 83.25 % 76.71 % PERSON (COMB) 87.16 % 78.76 % 91.33 % 94.77 % 89.20 % 86.02 % +both 75.82 % 91.77 % 83.03 % 86.69 % 91.48 % 89.02 %

* results from the previous 10-fold CV

Micha Marciczuk and Maciej Piasecki (PWr.)

7 czerwca 2010

12 / 15

Scope Cross Domain Evaluation

Cross Domain Evaluation


The system was trained on the Corpus of Stock Exchange and tested on the Corpus of Police Reports.
HMM Precision Recall F1 -measure Precision Recall F1 -measure Precision Recall F1 -measure 7.73 % 48.47 % 35.28 % 29.81 % 39.64 % 34.03 % 12.30 % 56.20 % 20.18 % +ltering +trimming PERSON (REV) 62.91 % 32.16 % 48.29 % 56.22 % 54.64 % 40.92 % PERSON (COMB) 69.75 % 37.13 % 39.46 % 49.37 % 50.40 % 42.38 % COMPANY +both 53.71 % 56.04 % 54.85 % 58.33 % 49.19 % 53.37 % -

Micha Marciczuk and Maciej Piasecki (PWr.)

7 czerwca 2010

13 / 15

Summary Conclusion & Plans

Conclusion & Plans


Conclusion
results of single-domain evaluation are promising, simple rule-based post-processing of HMM can improve the nal results, low domain portability better utilization of gazetteers and rules are needed.

Plans
to extend the schema annotation by LOCATION & ORGANIZATION, to collect a new corpus for cross-domain evaluation, to develop new rules for post-processing (for example to x a problem with sentence segmentation), incorporate other sources of knowledge (rules and gazetteers for majority voting, plWordNet for generalization), new learning models: HMM including morphology, other ML methods with context features (preceding verbs, prepositions, etc.).
Micha Marciczuk and Maciej Piasecki (PWr.) 7 czerwca 2010 14 / 15

References Main papers

Alias-i, LingPipe 3.9.0. http://alias-i.com/lingpipe (October 1, 2008) Graliski, F., Jassem, K., Marciczuk, M., Wawrzyniak, P.: Named Entity Recognition in Machine Anonymization. In: Kopotek, M. A., Przepiorkowski, A., Wierzcho, A. T., Trojanowski, K. (eds.) Recent Advances in Intelligent Information Systems, pp. 247260. Academic Publishing House Exit (2009) Piskorski J.: Extraction of Polish named entities. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation, LREC 2004, pp. 313316. ACL, Prague, Czech Republic (2004)

Micha Marciczuk and Maciej Piasecki (PWr.)

7 czerwca 2010

15 / 15

You might also like