Professional Documents
Culture Documents
P - NER: Massive Multilingual Named Entity Recognition: Olyglot
P - NER: Massive Multilingual Named Entity Recognition: Olyglot
Rami Al-Rfou
Vivek Kulkarni Bryan Perozzi
Steven Skiena
Department of Computer Science
Stony Brook University
{ralrfou, vvkulkarni, bperozzi, skiena}@cs.stonybrook.edu
arXiv:1410.3791v1 [cs.CL] 14 Oct 2014
Next, we learn a model Ψy to score tag y given Φni , i.e 4 Extracting Entity Mentions
Ψy : Φni 7→ R, using a neural network with one hidden layer Our procedure for creating a named entity training corpus
of size h from Wikipedia consists of two steps; First, we find which
Wikipedia articles correspond to entities, using Freebase
[4]. Second, we cope with the missing annotations problem
(3.1) Ψy (Φni ) = sT (tanh(WΦni + b)), through oversampling from the entity classes and extending
the annotation coverage using an exact surface form matching
where W ∈ Rh(2n+1)d and s ∈ Rh are the first and second rule.
layer weights of the neural network, and b ∈ Rh are the bias
units of the hidden layer. Finally, we construct a one-vs-all 4.1 Article Categorization Freebase maintains several at-
classifier F and penalize it by the following hinge loss, tributes for each Wikipedia article, covering around 40 differ-
ent Wikipedia languages. We categorize the article topics into
m
X one of the following categories, Y = {P ERSON , L OCATION , O R -
1
J= max 0, 1 − Ψti (Φni ) + max Ψy (Φni ) , GANIZATION , N ON E NTITY } . For each entity category we specify
m i=1
y6=ti
the corresponding freebase attributes, as the following:
y∈Y
where ti is the correct tag of the word wi , and m is the size • L OCATION /location/{citytown, country, region, continent,
of the training set. neighborhood, administrative_division}
• O RGANIZATION /sports/sports_team,
/book/newspaper,
3.3 Optimization We learn the parameters θ = (s, W, b) /organization/organization,
via backpropagation with stochastic gradient descent. As • P ERSON /people/person
the stochastic optimization performance is dependent on a The result is a mapping of Wikipedia page titles and
good choice of the learning rate, we automate the learning their redirects to Y. If an internal link points to any of these
rate selection through an adaptive update procedure [8]. titles, we consider it an entity mention. Table 3 shows the
This results in separate learning rates ηi for each individual
3 Notice, that we obtain higher results, here, than annotators trained on
Wikipedia datasets (Table 6), because training and testing datasets belong to
2 Available: http://bit.ly/embeddings the same domain.
percentage of pages that are covered by Freebase for some of
the languages we consider. The entity coverage varies with 70
each language, and this greatly biases the label distribution 60
of the generated training data. We will overcome this bias in
50
the distribution of entity examples using oversampling (See
Section 4.2.1). 40
Exact F1
Language Coverage Language Coverage 30
Malay 37.8% Arabic 15.6% 20
English 35.2% Dutch 14.7% English
10 Spanish
Spanish 24.8% Swedish 10.2%
Greek 24.3% Hindi 8.8% 0
Dutch
0.0 0.2 0.4 0.6 0.8 1.0
Table 3: Percentage of Wikipedia articles identified by ρ
Freebase as entity based articles. There is a wide disparity
across languages. Figure 1: Performance of oversampling entities
(POLYGLOT- NERS7 ) on D EV datasets. ρ ∼ = 2.5% cor-
4.2 Missing Links Unfortunately, generating a training responds to the original distribution of positive labels in
dataset directly from the link structure results in very poor Wikipedia text.
performance with D EVF1 < 10% in English, Spanish and
Dutch due to missing annotations. This is a consequence of L OCATION , O nRGANIZATION } and Y ={N ON E NTITY }. A training
−
4
Wikipedia style guidelines . Editors are instructed to link the example Φ i is considered positive if F (Φ n
i ) ∈ Y + and
first mention in the page, but not later ones. This results in negative otherwise. We define the oversampling ratio (ρ) to
leaving most entity mentions unmarked. Table 4 examines be Pm
this effect by contrasting the percentage of words that are [F (Φni ) ∈ Y + ]
ρ= i ,
covered by entity phrases in both C O NLL and Wikipedia. m
where [x] is the indicator function and m is the total number
C O NLL W IKI
of training examples.
English 16.76% 2.34% Our goal, here, is to construct a subset of our training
Spanish 12.60% 2.12% corpus where ρ is higher in this subset than the original
Dutch 9.29% 2.28% training dataset. We sample the positive class uniformly
Table 4: Percentage of words that are covered by entity without replacement. This insures we do not change the
phrases in each corpus. conditional distribution of a specific entity class given it is a
We can view the generated examples as two sets; one positive example.
that is truly positive and the other as a mix of negative Figure 1 shows the effect of oversampling. The first point
and positive examples. Our task is to learn an annotator corresponds to the original distribution of positive labels in
only from the positive examples. In such a setting, [10, Wikipedia text, where ρ ∼ = 2.5%. We observe that regardless
17, 18] show that considering the unlabeled set as negative of the chosen ρ, oversampling improves the results. This
examples while modifying the objective loss to accommodate improvement is quite stable when 0.25 < ρ < 0.75, and the
different penalties for misclassifying each set of examples Exact F1 score is increased by at least 40% for all languages
outperforms other heuristics and other iterative EM-based we consider. We choose ρ = 0.5 to be the value we use for
methods. Changing the label distribution by oversampling our testing phase in Section 5 as it produces the maximum
the positive labels will achieve a similar effect. results across the three languages under investigation.
4.2.1 Oversampling To overcome the effect of missing 4.2.2 Exact Surface Form Matching While oversam-
annotations, we correct the label distribution by oversampling pling mitigates the effect of the skewed label distribution,
from the entity classes. The intuition is that untagged words it does not address the stylistic bias with which Wikipedia
are not necessarily non-entities. Conversely, we have high editors create links. The first bias is to link only the first
confidence in the links which have been explicitly tagged mention of a entity in an article. This canonical mention is
by users. To reflect the difference in confidence levels, we usually the full name of an entity, and not the abbreviated
categorize our labels into two subcategories: Y + = {P ERSON , form used throughout the remainder of the article. We found
that in 200K examples tagged with P ERSON, 45K examples
4 http://en.wikipedia.org/wiki/Wikipedia:Manual_ belong to three terms mentions, 140K to two terms mentions
of_Style/Linking and only 15K belonging to single term mentions. This bias
D EV English Spanish Dutch
20 POLYGLOT- NER S7 62.9 56.7 53.2
English
Spanish POLYGLOT- NER S6 +S7 73.3 59.3 59.7
15 Dutch Nothman et al. [22] 67.9 60.7 62.2
Exact F1 Improvement
10 T EST
POLYGLOT- NER S7 58.5 58.5 51.5
5 POLYGLOT- NER S6 +S7 71.3 63.0 59.6
Nothman et al. [22] 61.3 61.0 64.0
0
Table 6: Cross-domain performance measured by Exact F1
on T EST and D EV sections of C O NLL corpora.
5
0.2 0.3 0.4 0.5 0.6 0.7 0.8
ρ and Table 5b a sample of the mistakes our annotators make.
Analyzing our good examples shows that our system performs
Figure 2: Performance improvement on C O NLL D EV
the best on the P ERSON category, even for names that are
datasets after applying coreference stage first on Wikipedia
transliterated from other languages (see Russian, Arabic and
before oversampling.
Korean examples). Moreover, our system is still able to
identify entities in mixed languages scenario, for example,
against single term mentions in links does not reflect the true the appearance of ’Spain’ in the Arabic example and
distribution of named entities in the text resulting in anno- ’Pešek’ in the Spanish example. This robustness stems
tators which tag Noam Chomsky but not Chomsky. The from two factors; First, the embeddings vocabulary of a
second bias is that there are no self-referential links inside an specific language includes frequent foreign words. Second,
entity’s article (e.g. on Barack Obama’s page, none of the our annotators are able to capture sufficient local contextual
sentences mentioning him are linked). clues.
In order to extend our annotations coverage on each Our errors can be grouped into three categories (See Ta-
article, we apply a simple rule of surface form matching. If ble 5b); First, common words {River, House} that appear
a word appeared in an entity mention, or in the title of the in organizations are hard to identify. Second, our system
current article (to address the second bias), we consider all does not consistently tag demonyms (nationalities), as Rus-
appearances of this word in the current article to be annotated sian, French, Arabic, Greek examples show. Misclassifica-
with the same tag. In the case of having multiple tags for the tion errors occur, common cases include confusion between
same word, we use the most frequent tag. This process can L OCATION and O RGANIZATION tags in the case of nested
be viewed as a first-order coreference resolution where we entities (See the Chinese example) or between the P ERSON
link mentions using exact string matching. and O RGANIZATION tags when company names are referred
For example, after this procedure, every mention of to in the same context as that of persons (See the Spanish
‘Barack Obama’, Barack, and Obama in the article on example).
Barack_Obama will be considered a link referring to a In addition to the qualitative analysis, we evaluate
P ERSON entity. In order to avoid mislabeling functional words our models quantitatively on the C O NLL 2002 Spanish
which appear in links (e.g. of, the, de) we exclude the most and Dutch datasets, and the C O NLL 2003 English dataset.
frequent 1000 words in our vocabulary. We show results of our models trained on Wikipedia and
Figure 2 shows the improvement this stage adds when evaluated on C O NLL in Table 6. Observe that oversampling
it is applied to Wikipedia before oversampling. F1 improve- (POLYGLOT- NERS7 ) alone is able to get competitive results.
ments that we observe in English, Spanish, and Dutch are With exact surface form matching applied to the data first
significant, especially when ρ ≤ 0.5. Most of this improve- (POLYGLOT- NERS6 +S7 ), we outperform previous work on
ment is due to higher recall on the tag P ERSON. English and Spanish without applying any language-specific
rules. Most of POLYGLOT- NER Dutch errors appear in the
5 Results category of O RGANIZATION.
In this section, we will analyze the errors produced system Training on Wikipedia results in lower scores on C O NLL
for qualitative assessment. In addition, we evaluate the testing datasets compared to models trained on C O NLL
performance of POLYGLOT- NER on C O NLL datasets to directly (See Table 2) due to two main factors. First this
demonstrate the efficiency of the proposed solutions to deal is an out-of-domain evaluation. Second, there are a variety of
with missing links in the Wikipedia markup. orthographic and contextual differences between Wikipedia
Table 5 shows annotated examples for 11 different and C O NLL. Common differences between the datasets
languages. Table 5a shows correctly annotated examples include: trailing periods, leading delimiters, and modifiers
Language Sentence Translation
English Simien was traded from the Heat along with Antoine Walker -
and Michael Doleac to the Minnesota Timberwolves on
October 24, 2007, for Ricky Davis and Mark Blount.
Hungarian Dimitri beszélt egy utat Rómába. Dimitri talked about a trip to Rome.
Spanish Pešek nació en Praga y estudió dirección de orquesta piano Pešek born in Prague and studied orchestra direction piano at
en la Academia de Artes allí, con Václav Smetacek the Academy of Arts there, Vaclav Smetacek.
Russian Уроженец Рио-де-Жанейро, Хосе Родригес Трин- A native of Rio de Janeiro, Jose Rodriguez Trindade used
дади использовал сокращенную форму как своим as a shortened form of his stage name.
сценическим псевдонимом.
Korean ᅩᄂ
ᄃ ᆯᄃ
ᅥ ᅳ스ᄆ ᅵᄉ ᅳᄂ ᆫ 1976 ᄂ
ᅳ ᆫᄋ
ᅧ ᅦᄌ ᅡᄋ ᅵᄃ ᅳᄋ ᆸᄃ
ᅡ ᆯᄋ
ᅮ ᅡᄌ ᅵ즈 Donald Smith in 1976 changed his name to Zaid Abdul Aziz.
ᅦᅳ
ᄋ ᄀᄋ ᅴ이ᄅᆷᅳ
ᅳ ᆯ
ᄋᄇ ᅡᄁ ᆻᄃ
ᅯ ᅡ.
French La France veut satisfaire à ses engagements envers l’Union France wants to meet its commitments to the European
européenne. Union.
Turkish Erdoğan, Türkiye’de twitter yasaklandı. Erdogan banned twitter in Turkey.
Arabic ©C¤d Ewf , d§Cd A§C §r , y §CA A Said Gareth Bell, the star of Real Madrid, said that winning
dAF Ad` Ф ,Åd± YA" ¢` \yF A¤C¤ AW the Champions League will remain with him "forever", and
¨ d§Cd wktyl © A Tm§z¡ Yl ¨klm © An ¢d¡ that after his goal helped the club to defeat Atletico Madrid
. Spain in Spain.
Indonesian Rendjambe meninggal dalam keadaan tidak jelas , yang Rendjambe died in unclear circumstances, which resulted
mengakibatkan kerusuhan oleh pendukung oposisi marah di in riots by angry opposition supporters in Port Gentil - and
Port Gentil - dan Libreville. Libreville.
Chinese 新华社 25 日报道 援引 新 疆 公安厅 。 Xinhua News Agency report quoted the 25th Xinjiang Public
Security Department.
Greek Ενλίλ έφερε τα Γκούτιους κάτω από το λόφους ανατο- Enlil brought the Gutians down from the hills east of the Tigris,
λικά του Τίγρη, να φέρει το θάνατο σε ολόκληρη τη to bring death throughout Mesopotamia.
Μεσοποταµία.
Table 5: POLYGLOT- NER results on several languages. Color code: {Person, Location, Organization}. We denote errors as
the following: false positive, ::::
false :::::::
negative, and label misclassification. Translations are acquired through Google Translate
and labels on translated phrases correspond to their annotations in the source language.
and annotators’ disagreements. Specifically, the English as the following
C O NLL dataset has an over representation of upper case X
words and sports teams. In the Dutch dataset, country names Ze = Ce (S),
were abbreviated after the journalist name. For example, S∈L1
Spain will be mapped to Spa and Italy to Ita. This leads 1 X
EM (e) = |Ce (S1 ) − Ce (S2 )|+ ,
to more out of vocabulary (OOV) terms for which Polyglot Ze (S1 ,S2 )∈(L1 ,L2 )
does not have embeddings. Since we do not rely on any X
C O NLL-tailored preprocessing steps, such notational and 1
EA (e) = |Ce (S2 ) − Ce (S1 )|+ ,
stylistic differences affect our performance more than other Ze (S1 ,S2 )∈(L1 ,L2 )
EA
EA
0.20
EA
0.10 canl de 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80
0.15 0.20 0.25 0.30 0.35 0.40 en pl de
pt
0.4 th 0.4 ptro ms 0.4 fi de
fr uksridsk
es 0.15 0.20 0.25 0.30 0.35 essv ca
nlen
hucsid
ru fa
itsv hr arlt
sv he ko nl csnorudahu fa
hrbg rotruk hiel
lv
pt sksr sl
es
enpl
ardehilt el ja he
0.2 enfrit no lcs
plrunca
uk fihrar tret 0.2 sl et
ko 0.2 skhe
vi bg
da
id hu vi fa hi lv es
de ro
de ms
bg
lt en entr fide lv esnl th sl nl
srms
ja et ja th
en el en tl en tl
0.0 0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
EM EM EM
0.30 ko it nl
es frsv en
no
7 Conclusion & Future Work 0.25 da sr
sk ro
csid uk plru
We successfully built a multilingual NER system for 40 lan- hrhe hu fa
guages with no language specific knowledge or expertise. We 0.20 th ms
ar de
lt fi
use automatically learned features, and apply language agnos- hi sl
EA
bg tr
tic data processing techniques. The system outperforms previ- 0.15 lv el et
ous work in several languages and competitive in the rest on ja
human annotated datasets. We demonstrate its performance 0.1015
2 2 16 2 17 2 18 2 19 220 221 222 223
on the rest of the languages, by a comparative analysis using Wikipedia(Size)
machine translation. Our approach yields highly consistent
(b) EA of all entity classes versus Wikipedia article count.
performance across all languages. Wikipedia Cross-lingual
Figure 4: Aggregated errors over all categories versus
links will be used in combination with Freebase to extend our
Wikipedia size for each language.
approach to all languages as future work.