Named Entity Recognition in The Domain of Polish Stock Exchange Reports

Named Entity Recognition in the Domain of Polish Stock Exchange Reports
Micha Marciczuk and Maciej Piasecki

Wrocaw University of Technology
7 czerwca 2010
Project NEKST (Natively Enhanced Knowledge Sharing Technologies) co-nanced by Innovative Economy Programme project POIG.01.01.02-14-013/09
Scope Introduction
Introduction
Overview : a problem of Named Entity Recognition, recognition of PERSON and COMPANY annotations, two corpora of Stock Exchange Reports from an economic domain and Police Reports from a security domain, combination of a machine learning approach with a manually created rules, application of Hidden Markov Models. The corpus was published at http://nlp.pwr.wroc.pl/gpw/download
Micha Marciczuk and Maciej Piasecki (PWr.)
7 czerwca 2010
2 / 15
Scope Task Denition
Task Denition
We dened NEs as language expressions referring to extra-linguistic real or abstract objects of the preselected kinds. We limited the Named Entity Recognition task to identify expressions consiting of proper names refering to PERSON and COMPANY entities.
Examples of correct and incorrect expressions of PERSON type:

correct: R. Dolea, Marek Wiak, Luis Manuel Conceicao do Amaral, person names are part of a company name: Moore Stephens Trzemalski , Krynicki i Partnerzy Kancelaria Biegych Rewidentw Sp . z o.o., location: pl. Jana Pawa II.
7 czerwca 2010
3 / 15
Scope Corpora of Economic Domain
Corpora of Economic Domain

Stock Exchange Reports (SCER)
1215 documents, 282 376 tokens, 670 PERSON and 3 238 COMPANY annotations, source http://gpwinfostrefa.pl,
Characteristic
a set of economic reports published by companies, very formal style of writing, a lot of expressions written from an upper case letter that are not proper names, a lot of names of institutions, organizations, companies, people and location.
Micha Marciczuk and Maciej Piasecki (PWr.) 7 czerwca 2010 4 / 15
Scope Corpora of Security Domain
Corpora of Security Domain

Police Reports (CPR)
12 documents, 29 569 tokens, 555 PERSON and 121 COMPANY annotations, source [Gralinski et al., 2009].
Characteristic
a set of statements produced by witnesses and suspects, rather informal style of writing, a lot of pseudonyms and common words that are proper names, a lot of one-word person names.
7 czerwca 2010
5 / 15
Scope Corpus Developement
Corpus Developement
To annotate the corpora we developed and used the Inforex system.

System fueatures: web-based does not require installation (requires only a FireFox browser with JavaScript), remote corpora are stored on a server, shared corpora can be simultaneously annotated by many users.
7 czerwca 2010
6 / 15
Scope Baselines
Baselines
1
Heuristic matches a sequence of words starting with an upper case letter. For COMPANY the name must end with an activity form, i.e.: SA, LLC, Spka, AG, S.A., Sp., B.V.. Gazetteers matches a sequence of words present in the dictionary of rst names and last names (63 555 entries) or company names (6200 entries) [Piskorski, 2004].
PERSON Precision Recall F1 -measure Precision Recall F1 -measure 0.89 42.45 1.75 9.61 41.19 15.59 CSER CPR COMPANY PERSON COMPANY Heuristic % 0.76 % 19.35 % 0.19 % % 4.42 % 93.87 % 4.13 % % 1.29 % 32.09 % 0.36 % Gazetteers % 37.01 % 46.79 % 21.05 % % 40.54 % 9.02 % 3.31 % % 38.69 % 15.12 % 5.71 %
7 czerwca 2010 7 / 15
Scope Recognition Based on HMM
Recognition Based on HMM
LingPipe [Alias-i, 2008] implementation of HMM

7 hidden states for every annotation type, 3 additional states (BOS, EOS, middle token), Witten-Bell smoothing, rst-best decoder based on Viterbis algorithm. Pan Jan Nowak zosta nominowany na stanowisko prezesa (Mr. Jan Nowak was nominated for the chairman position.) (BOS) (E-O-PER) (B-PER) (E-PER) (B-O-PER) (W-O) (W-O) (W-O) (W-O) (W-O) (EOS)
7 czerwca 2010
8 / 15
Scope Single Domain Evaluation (PERSON)
Single Domain Evaluation (PERSON)
We performed a 10-fold Cross Validation on the Stock Exchange Corpus for PERSON annotations.
Precision Recall F1 -measure
Heuristic 0.89 % 42.45 % 1.75 %
Gazetters 9.61 % 41.19 % 15.59 %
HMM 64.74 % 93.73 % 76.59 %
7 czerwca 2010
9 / 15
Scope Error Analysis
Error Analysis
We have identied 10 types of errors. No. 1 2 3 4 5 6 7 8 9 10 Error type Name of institution, company, etc. Name of location (street, place, etc.) Other proper names Phrases in English Incorrect annotation boundary Common word starting from upper case character Common word starting from lower case character Single character Common word with a spelling error Other full 38 30 2 Matches partial total 91 129 16 46 10 12 21 21 35 35 6 6 26 26 6 6 3 3 46 46
A) 1, 2, 3 incorrect types of annotation recognition of COMPANY and LOCATION, B) 4, 7, 8, 9 lower case and non-alphabetic expressions rule ltering, C) 5 incorrect annotation boundary annotation merging and trimming.
Scope Single Domain Evaluation (PERSON & COMPANY)
Single Domain Evaluation (PERSON & COMPANY)
Referring to the A group of errors, we have re-annotated the CSER with COMPANY annotations and repeated the 10-fold CV .
Precision Recall F1 -measure
PERSON REV COMB 64.74 %* 78.63 % 93.73 %* 94.62 % 76.59 %* 85.89 %
COMPANY 76.56 % 83.14 % 79.71 %
* results from the previous 10-fold CV
7 czerwca 2010
11 / 15
Scope Post-ltering
Post-ltering
Referring to the B and C groups of errors, we have applied two types of post-processing: annotation ltering and merging.
HMM Precision Recall F1 -measure Precision Recall F1 -measure 64.74 %* 93.73 %* 76.59 %* 78.63 %* 94.62 %* 85.89 %* +ltering +trimming PERSON (REV) 76.27 % 64.85 % 91.64 % 93.88 % 83.25 % 76.71 % PERSON (COMB) 87.16 % 78.76 % 91.33 % 94.77 % 89.20 % 86.02 % +both 75.82 % 91.77 % 83.03 % 86.69 % 91.48 % 89.02 %
* results from the previous 10-fold CV
7 czerwca 2010
12 / 15
Scope Cross Domain Evaluation
Cross Domain Evaluation

The system was trained on the Corpus of Stock Exchange and tested on the Corpus of Police Reports.
HMM Precision Recall F1 -measure Precision Recall F1 -measure Precision Recall F1 -measure 7.73 % 48.47 % 35.28 % 29.81 % 39.64 % 34.03 % 12.30 % 56.20 % 20.18 % +ltering +trimming PERSON (REV) 62.91 % 32.16 % 48.29 % 56.22 % 54.64 % 40.92 % PERSON (COMB) 69.75 % 37.13 % 39.46 % 49.37 % 50.40 % 42.38 % COMPANY +both 53.71 % 56.04 % 54.85 % 58.33 % 49.19 % 53.37 % -
7 czerwca 2010
13 / 15
Summary Conclusion & Plans
Conclusion & Plans

Conclusion
results of single-domain evaluation are promising, simple rule-based post-processing of HMM can improve the nal results, low domain portability better utilization of gazetteers and rules are needed.
Plans
to extend the schema annotation by LOCATION & ORGANIZATION, to collect a new corpus for cross-domain evaluation, to develop new rules for post-processing (for example to x a problem with sentence segmentation), incorporate other sources of knowledge (rules and gazetteers for majority voting, plWordNet for generalization), new learning models: HMM including morphology, other ML methods with context features (preceding verbs, prepositions, etc.).
References Main papers
Alias-i, LingPipe 3.9.0. http://alias-i.com/lingpipe (October 1, 2008) Graliski, F., Jassem, K., Marciczuk, M., Wawrzyniak, P.: Named Entity Recognition in Machine Anonymization. In: Kopotek, M. A., Przepiorkowski, A., Wierzcho, A. T., Trojanowski, K. (eds.) Recent Advances in Intelligent Information Systems, pp. 247260. Academic Publishing House Exit (2009) Piskorski J.: Extraction of Polish named entities. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation, LREC 2004, pp. 313316. ACL, Prague, Czech Republic (2004)
7 czerwca 2010
15 / 15

Named Entity Recognition in The Domain of Polish Stock Exchange Reports

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Named Entity Recognition in The Domain of Polish Stock Exchange Reports

Uploaded by

Copyright:

Available Formats

Named Entity Recognition in the Domain of Polish Stock Exchange Reports

Micha Marciczuk and Maciej Piasecki

Micha Marciczuk and Maciej Piasecki (PWr.)

Scope Task Denition

Examples of correct and incorrect expressions of PERSON type:

Micha Marciczuk and Maciej Piasecki (PWr.)

Scope Corpora of Economic Domain

Corpora of Economic Domain

Scope Corpora of Security Domain

Corpora of Security Domain

Micha Marciczuk and Maciej Piasecki (PWr.)

Scope Corpus Developement

To annotate the corpora we developed and used the Inforex system.

Micha Marciczuk and Maciej Piasecki (PWr.)

Micha Marciczuk and Maciej Piasecki (PWr.)

Scope Recognition Based on HMM

Recognition Based on HMM

LingPipe [Alias-i, 2008] implementation of HMM

Micha Marciczuk and Maciej Piasecki (PWr.)

Scope Single Domain Evaluation (PERSON)

Single Domain Evaluation (PERSON)

Precision Recall F1 -measure

Heuristic 0.89 % 42.45 % 1.75 %

Gazetters 9.61 % 41.19 % 15.59 %

HMM 64.74 % 93.73 % 76.59 %

Micha Marciczuk and Maciej Piasecki (PWr.)

Scope Error Analysis

Scope Single Domain Evaluation (PERSON & COMPANY)

Single Domain Evaluation (PERSON & COMPANY)

Precision Recall F1 -measure

PERSON REV COMB 64.74 %* 78.63 % 93.73 %* 94.62 % 76.59 %* 85.89 %

COMPANY 76.56 % 83.14 % 79.71 %

* results from the previous 10-fold CV

Micha Marciczuk and Maciej Piasecki (PWr.)

* results from the previous 10-fold CV

Micha Marciczuk and Maciej Piasecki (PWr.)

Scope Cross Domain Evaluation

Cross Domain Evaluation

Micha Marciczuk and Maciej Piasecki (PWr.)

Summary Conclusion & Plans

Conclusion & Plans

References Main papers

Micha Marciczuk and Maciej Piasecki (PWr.)

You might also like