Professional Documents
Culture Documents
Recognition
1 Introduction
[<TIMEX TYPE=“DATE”> 22
September 2010 </TIMEX> <ENAMEX TYPE=“ORGANIZATION”>Brisbane
Convention Center</ENAMEX> <ENAMEX TYPE=“LOCATION”>Brisbane
Australia</ENAMEX>].
One of the major challenges in NER is that frequently there are overlaps among classes of
named entities. The common situations are-
1. Common noun Vs proper noun- Common noun sometimes occurs as a person name
such as “Suraj” which means sun, thus creating ambiguities between common noun
and proper noun.
2. Organization Vs person name- “Tata” as a person name as well as an organization.
3. Organization Vs place name- “Mumbai meets Chennai at Bangalore”. Here Mumbai
and Chennai are names of teams rather than the names of cities.
4. Person name Vs place name- The word “Kashi” is used as a person name as well as
the name of a place.
5. Moreover there also exist words or word sequences such as Thinking Machines (a
company), Gates (person), that can occur in contexts where they do not refer to
Named Entities.
Rule-based NER focuses on the extraction of names using human made rules. There are
several Rule-based NER systems for English achieving 88%-92% f-measure [31][39]. This
approach lacks portability and robustness. One needs a significant number of rules to
maintain optimal performance, resulting in the high maintenance cost.
The main advantage of the machine learning (ML) approaches over rule-based approach is
that the former is trainable and can be adapted to di↵erent domains, and the maintenance
cost is much smaller. ML approach for NER employ a statistical model. Representative
ML methods used in NER are Hidden Markov Model (HMM) [23][24] and KRDL’s system
[41] for Chinese NER, Conditional Random Fields (CRF) [20], Support Vector Machines
(SVM) [6], Maximum Entropy (ME) [3] and Decision Tree (DT) [15][36]. Besides, a variant
of Eric Brill’s transformation-based rules [4] has been applied to the problem [2]. These
models are primarily of two types: Supervised and Unsupervised.
Hidden Markov Model (HMM) In context of NER, HMM is defined as follows: Given
a word sequence
W = (W0 f0 )....(Wn fn )
where Wj denotes a word, fj denotes a single token feature set associated with word Wj .
The goal of NER is to find the optimal NE tag sequence T = t0 t1 t2 ....tn , which maximizes
the conditional probability P r(T sequence|W sequence) ([23]). By Bayesian equality, this
becomes equivalent to maximizing the joint probability Pr(W sequence, T sequence). This
joint probability can be computed by bi-gram HMM as follows:
P r(W sequence, T sequence)
= Y
P r({Wi , fi }, ti |{Wi 1 , fi 1 }, ti 1 )
i
.
In tag sequence T , each tag consists of three parts- the NE class, boundary class(indicates
the position of current word in NE) and the feature set.
Conditional Random Field (CRF) CRFs [20] are a type of discriminative probabilistic
model used for labelling and segmenting sequential data such as natural language text or
biological sequences. They can incorporate a large number of arbitrary, non-independent
features. The conditional probability of a state sequence S = (s1 , s2 ..sT ) given an
observation sequence O = (o1 , o2 , o3 ...ot ) is calculated as
XXT
1
P (s|o) = exp( k fk (St 1 , St , o, t)
Zo t=1 k
X T X
X
Zo = exp( k fk (St 1 , St , o, t)
t=1 k
Support Vector Machine (SVM) SVMs represent a relatively new machine learning
approach that analyzes data and recognizes patterns for classification and regression
analysis [6].
In the field of NLP, SVMs are applied to text categorization [38] and many other problems
and are reported to produce excellent results.
An SVM constructs a hyperplane or a set of hyperplanes in a high or infinite dimensional
space. The classification rule of an SVM is:
sgn(f (x, w, b))......(1)
f (x, w, b) =< w.x > +b..........(2)
where x is the example to be classified, w is a weight vector, and a bias b which stands
for the distance of the hyperplane to the origin.
The classifier produces a sequence of inadmissible classes. The transition probability
between word classes is set to 1 in NER problem if the sequence is admissible otherwise 0.
where o refers to the outcome, h the history (or context) and z(h) is a normalization
function. In addition each feature function fj (h, o) is a binary function. The parameters
/j are estimated by a procedure called Generalized Iterative Scaling (GIS) [8]. This is an
iterative method that improves the estimation of the parameter at each iteration.
In NER, history can be viewed as all information derivable from the training corpus relative
to the current token Wi . The computation P (o|h) is a set of features that helps to predict
the output.
Decision Tree (DT): DT is a decision support tool that uses a tree like graph or
model of decisions and their possible consequences. It is a powerful and popular tool for
classification and prediction. The attractiveness of DT is due to the fact that in contrast
to other models such as neural networks, it represents rules. Rules obtained by DT can be
easily expressed in a way that humans can understand them or even directly use them.
Early systems made use of handcrafted rule based algorithms while modern systems often
resort to machine learning techniques. Besides the above two approaches, NER also makes
use of the hybrid model that combines the strongest points from both rule-based and
statistical methods. This method is used particularly when data is scanty and complex
Named Entities (NE) classes are used. Sirhari et al. [33] introduce a hybrid system by
combination of HMM, ME and handcrafted grammatical rules to build an NER
Named Entity Recognition has made remarkable progress in the European languages [25]
[16] [40] [36], but only a little work can be found in the Indian languages. Below are some
of the work done in IL using di↵erent approaches.
4.1 Bengali
HMM-approach: Ekbal and Bandyopadhyay [9] report on the development of an HMM
based NER system for Bengali and Hindi. Initially, the system was developed for Bengali.A
10-fold cross validation results yields an average Recall, Precision and F-score values of
90.2%, 79.48% and 84.5%, respectively. The HMM-based NER system is also trained and
tested with Hindi data to show the e↵ectiveness of the language independent features
used by the approach. The results show an average Recall, Precision and F-score values of
82.5%, 74.6% and 78.35%, respectively.
SVM-approach: Ekbal and Bandyopadhyay [10] discuss the use of SVM using the
features like contextual information along with a variety of features such as context words,
prefix and suffix words, POS tags and length of the words.They use the BIE format where
B stands for Beginning, I for Internal and E for End of a word. Experimental results of the
10 fold cross validation test shows the e↵ectiveness of the system with the overall average
recall, precision and F-score values of 94.3%, 89.4% and 91.8% and which is then further
performed in combination with Hindi [12]. A number of experiments were carried out to
find the best-suited features for NER in Bengali and Hindi. An unsupervised algorithm
was used to generate the lexical context patterns from an unlabeled corpus of 10 million
word forms and the NER system was tested with gold standard test sets of 35K, and
60K tokens for Bengali and Hindi, respectively. Recall, precision and f-score were 88.61%,
80.12% and 84.15%, respectively for Bengali and 80.23%, 74.34% and 77.17%, respectively
for Hindi.
CRF-approach: Ekbal et al. [14] use the statistical CRF to identify and classify NEs
into four classes. Experimental results of the 10-fold cross validation test showed an overall
average recall, precision and f-score values of 93.8%, 87.8% and 90.7%.
ME, CRF and SVM: Ekbal and Bandyopadhyay [11] combine the outputs of several
classifiers based on ME, CRF, and SVM by a voting scheme. The overall average recall,
precision and f-score values were 90.78%, 87.35% and 89.03%, respectively. In [13], the
same author describe a voted NER system by using Appropriate Unlabeled Data to obtain
overall recall precision and f-score values of 93.81%, 92.18% and 92.98%, respectively.
Chaudhuri and Bhattacharya [5] used a three-stage approach namely- a dictionary based
method, rules and left-right co-occurrences statistics for identification of NEs. A word level
morphological parser is constructed for the dictionary based approach and the rule-based
approach relies on rules to be satisfied by the given test word. The average recall, precision
and f-measures were 85.50%, 94.24% and 89.51%, respectively. They observed that their
automatic evaluation system gives almost the same result as manual evaluation.
4.2 Hindi
The NER task for Hindi was explored by Cucerzan and Yarowsky in their language
independent NER work in which morphological and contextual evidences were used [7].
ME approach: Saha et al. [34] describe the development of Hindi NER using the ME
approach. The system was evaluated using a blind test corpus of 25K words and came out
with an F-measure of 81.52%.
CRF approach: Li and McCallum [22] describe the application of CRF with feature
induction to a Hindi NER task. The experimental results for validation and test sets
found out to be 82.55% and 71.50% respectively.
Goyal [17] focused on building Hindi NER using CRFs. He used the NLPAI Machine
Learning Contest 2007 data for experiments. This method was evaluated on two di↵erent
test sets and attained a maximum F1-measure of around 49.2% and nested F1-measure of
around 50.1% for test set1, maximum F1-measure around 44.97% and nested F1-measure
around 43.70% for test set2 and F-measure of 58.85% on the development set. The author
also compared the results on Hindi data with English data of CONLL shared task of 2003.
They trained this system on English data of CoNLL-2003 shared task, considering only
contextual features since they give the maximum accuracy. They obtained an overall F-
measure of 84.09% and 75.81% on both the test sets.
Gupta and Arora [18] also experimented with using the CRF model to develop Hindi
NER.Finally after adding the NE and POS tag the results came out considering the
surrounding and current words combinations for Person to be 66.7% and 66.3%, for
Location to be 69.5% and 68%, and 58% for Organization.
4.3 Telugu
ME approach: Raju et al. [30] developed a Telugu NER system by using the ME
approach.Evaluation results came out with a an F-measure of 72.07% for person, 60.76%,
68.40% and 45.28% for organization, location and others, respectively.
CRF approach: Srikanth and Murthy [28] used CRF approach on a part of the Language
Engineering Research Centre at University of Hyderabad (LERC-UoH) Telugu corpus
consisting of a variety of books and articles, and two popular newspapers They obtained
an F-measure of 91.95%. Then they developed a rule-based NER system using a corpus of
72,152 words including 6,268 Named Entities. Finally they developed a CRF based NER
system. They achieved an overall F-measures between 80% and 97% in various experiments.
Shishtla et al. [37] conducted an experiment on the development data released as a part of
NER for South and South East Asian Languages (NERSSEAL) Competition using CRF.
The best performing model gave an F-1 measure of 44.91%.
4.4 Tamil
CRF approach: VijayKrishna and Sobha [29] developed a domain specific Tamil NER
for tourism by using CRF. The system obtained an F-measure of 80.44%.
Hybrid approach: Pandian et al. [27] presented a hybrid three-stage approach for Tamil
NER. The first phase includes the classification of the named entities by shallow parsing, a
dictionary of word clues and case-markers. In the second phase, shallow semantic parsing,
and syntactic and semantic information are used to identify the named entity type. The
final phase includes statistical information from a training corpus. The E-M (HMM)
algorithm is used to identify the best sequence for the first two phases and then modified
to resolve the free-word order problem. Both NER tags and POS tags are used as the
hidden variables in the algorithm. The system is able to obtain an F-measure of 72.72%
for various entity types.
4.5 Oriya
Biswas et al. [35] presented a hybrid system for Oriya NER that applies both ME and
HMM and handcrafted rules to recognize NEs.Linguistic rules were also used to identify
named entities. The system obtains an F-measure between 75% to 90%.
4.6 Urdu
Mukund and Sirhari [26] proposed a bootstrapped model that involves four levels of text
processing for Urdu.Two types of model have been used. One is the two stage model that
uses POS information to perform NE tagging and the other is the four stage model for
NE tagging. The two stage model achieved an F-measure of 55.3% and F-measure for the
four stage model was 68.9%.
Riaz [32] identifies the complex relationship between Hindi and Urdu and finds that
NER computation models for Hindi cannot be used for Urdu NER. It also describes a
rule-based NER system which outperforms the models that use statistical learning with
an F-measure of 91.11%.
References
1. Modeling and Simulation. CSI Communications, 34:15, August 2010.
2. John Aberdeen, John Burger, David Day, Lynette Hirschman, Patricia Robinson, and Marc
Vilain. MITRE: Description of the Alembic System Used for MUC-6. In Proceedings of the
6th Conference on Message Understanding, pages 141–155, Columbia, Maryland, 1995.
3. Borthwick Andrew. A Maximum Entropy Approach to NER. PhD thesis, 1999.
4. Eric Brill. Transformation-based Error Driven Learning and Natural Language Processing:
A Case Study in Part-of-speech. Computational Linguistic, pages 543–565, December 1995.
5. Bidyut Baran Chaudhuri and Suvankar Bhattacharya. An Experiment on Automatic
Detection of Named Entities in Bangla. In Proceedings of the IJCNLP-08 Workshop on
NER for South and South East Asian laanguages, pages 75–82, Hyderabad, India, January
2008.
6. Cortes and Vapnik. Support Vector Network ,MachineLearning. pages 273–297, 1995.
7. Silviu Cucerzan and David Yarowsky. Language independent named entity recognition
combining morphological and contextual evidence. pages 90–99, 1999.
8. J.N Darroch and D.Ratcli↵. Generalized iterative scaling for log-linear models. The Annals
of Mathematical Statistics, 43(5):1470–1480, 1972.
9. Asif Ekbal and Sivaji Bandyopadhyay. A Hidden Markov Model Based Named Entity
Recognition System:Bengali and Hindi as Case Studies. In Proceedings of 2nd International
conference in Pattern Recognition and Machine Intelligence, pages 545–552, Kolkata, India,
2007.
10. Asif Ekbal and Sivaji Bandyopadhyay. Bengali Named Entity Rcognition using Support
Vector Machine. In Proceedings of the IJCNLP-08 Workshop on NER for South and South
East Asian laanguages, pages 51–58, Hyderabad, India, January 2008.
11. Asif Ekbal and Sivaji Bandyopadhyay. Improving the Performance of a NER System by
Post-processing andVoting. In Proceedings of 2008 Joint IAPR International Workshop on
Structural Syntactic and Statistical Pattern Recognition, pages 831–841, Orlando, Florida,
2008.
12. Asif Ekbal and Sivaji Bandyopadhyay. Named Entity Recognition using Support Vector
Machine: A Language Independent Approach. International Journal of Computer, Systems
Sciences and Engg(IJCSSE), 4(2):155–170, 2008.
13. Asif Ekbal and Sivaji Bandyopadhyay. Voted NER System using Appropriate Unlabelled
Data. In Proceedings of the 2009 Named Entities Workshop, ACL-IJCNLP 2009, pages 202–
210, Suntec, Singapore, August 2009.
14. Asif Ekbal, Rejwanul Haque, and Sivaji Bandyopadhyay. Named Entity Recogntion in
Bengali: A Conditional Random Field. In Proceedings of the 3rd International Joint
Conference on Natural Language Processing(IJCNLP-08), pages 589–594, India, 2008.
15. F.Bechet, A.Nasr, and F.Genet. Tagging Unknown Proper Names using Decision Trees.
In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistic,
pages 77–84, 2000.
16. Radu Florian, Abe Ittycheriah, Hongyan Jing, and Tong Zhang. Named Entity Recognition
through Classifier Combination. In Proceedings of CoNLL-2003(Conference on Natural
Language Learning), pages 168–171, 2003.
17. Amit Goyal. Named Entity Recognition for South Asian Languages. In Proceedings of the
IJCNLP-08 Workshop on NER for South and South-East Asian Languages, pages 89–96,
Hyderabad, India, Jan 2008.
18. Pramod Kumar Gupta and Sunita Arora. An Approach for Named Entity Recognition System
for Hindi: An Experimental Study. In Proceedings of ASCNT-2009, pages 103–108, CDAC,
Noida, India.
19. Mohammad Hasanuzzaman, Asif Ekbal, and Sivaji Bandyopadhyay. Maximum Entropy
Approach for Named Entity Recognition in Bengali and Hindi. International Journal of
Recent Trends in Engineering, 1(1):408–412, May 2009.
20. John La↵erty, Andrew McCallum, and Fernando Pereira. Probabilistic Models forSegmenting
and Labelling Sequence Data. In Proceedings of the Eighteenth International Conference on
Machine Learning(ICML-2001), pages 282–289, 2001.
21. Adam L.Berger, Stephen A.Della Pietra, and Vincent J.Della Pietra. A Maximum Entropy
approach to Natural Language Processing. Computational Linguistic, 22:39–71, 1996.
22. Wei Li and Andrew McCallum. Rapid Development of Hindi Named Entity Recognition
using Conditional Random Fields and Feature Induction (Short Paper). ACM Transactions
on Computational Logic, pages 290–294, Sept 2003.
23. Bikel Daniel M, Miller Scott, Schwartz Richard, and Weischedel Ralph. A High Performance
Learning Name-finder. In Proceedings of the fifth Conference on Applied Natural language
Processing, pages 194–201, 1997.
24. Scott Miller, Michael Crystal, Heidi Fox, Lance Ramshaw, Richard Schwartz, Rebecca Stone,
Ralph Weischedel, and the Annotation Group. BBN:Description of the SIFT System as Used
for MUC-7. In Proceedings of MUC-7, Virginia, 1998.
25. Diego Molla, Menno van Zaanen, and Daniel Smith. Named Entity Recognition for
Question Answering. In Proceedings of the 2006 Australasian Language Technology
Workshop(ALTW2006), pages 51–58.
26. Smruthi Mukund and Rohini K.Srihari. NE tagging for Urdu based on Bootstrap POS
Learning. In Proceedings of the Third International Workshop on Cross Lingual Information
Access:Addressing the Information Need of Multilingual Societies, pages 61–69, 2009.
27. S.Lakshmana Pandian, Krishnan Aravind Pavithra, and T.V Geetha. Hybrid Three-stage
Named Entity Recognizer for Tamil. INFOS2008, pages 59–66, March 2008.
28. P.Srikanth and Kavi Narayana Murthy. Named Entity Recognition for Telegu. In Proceedings
of the IJCNLP-08 Wokshop on NER for South and South East Asian languages, pages 41–50,
Hyderabad, India, Jan 2008.
29. Vijaykrishna R and Sobha L. Domain focussed Named Entity Recognizer for Tamil using
Conditional Random Fields. In Proceedings of the IJCNLP-08 Wokshop on NER for South
and South East Asian languages, pages 59–66, Hyderabad, India, 2008.
30. G.V.S Raju, B.Srinivasu, Dr. S. Viswanadha Raju, and K.S.M.V Kumar. Named Entity
Recognition for Telegu using Maximum Entropy Model. Journal of Theoretical and Applied
Information Technology, 3(2):125–130, 2010.
31. Grishman Ralph. The New York University System MUC-6 or Where’s the syntax. In
Proceedings of the Sixth Message Understanding Conference, pages 167–175, 1995.
32. Kashif Riaz. Rule-based Named Entity Recognition in Urdu. In Proceedings of the 2010
Named Entities Workshop, ACL 2010, pages 126–135, Uppsala,Sweden, July 2010.
33. R.Sirhari, C.Nui, and W.Li. A Hybrid Approach for Named Entity and Sub-Type Tagging.
In Proceedings of the sixth conference on Applied natural language processing, Acm Pp, pages
247–254, 2000.
34. Sujan Kumar Saha, Sudeshna Sarkar, and Pabitra Mitra. A Hybrid Feature Set based
Maximum Entropy Hindi Named Entity Recognition. In Proceedings of the 3rd International
Joint Conference on NLP, pages 343–349, Hyderabad,India, January 2008.
35. S.Biswas, S.P.Mohanty, S.Acharya, and S.Mohanty. A Hybrid Oriya Named Entity Recogntion
system. In Proceedings of the CoNLL, volume 1(1), pages 1–6, Edmonton, Canada, 2003.
36. Satoshi Sekine. Description of the Japanese NE System Used for MET-2. In Proceedings of
the Seventh MUC-7, Virginia, 1998.
37. Praneeth M Shishtla, Karthik Gali, Prasad Pingali, and Vasudeva Varma. Experiments in
Telegu NER: A Conditional Random Field Approach. In Proceedings of the IJCNLP-08
Workshop on NER for South and South East Asian languages, pages 105–110, Hyderabad,
India, January 2008.
38. Hirotoshi Taira and Masahiko Haruna. Feature Selection in SVM Text Categorization. 1999.
39. Takahiro Wakao, Robert Gaizauskas, and Yorick Wilks. Evaluation of an Algorithm for the
Recognition and Classification of Proper Names. In Proceedings of COLING-96, pages 418–
423, 1996.
40. Dong Yang, Paul Dixon, Yi-Cheng Pan, Tasuku Oonishi, Masanobu Nakamura, and Sadaoki
Furui. Combining a Two-step Conditional Random Field Model and a Joint Source
Channel Model for Machine Transliteration. In Proceedings of the 2009 Named Entities
Workshop,ACL-IJCNLP 2009, pages 72–75, Suntec, Singapore, August 2009.
41. Shihong Yu, Shuanhu Bai, and Paul Wu. Description of the Kent Ridge Digital Labs System
Used for MUC-7. In Proceedings of the MUC-7, Virginia, 1998.