The First Steps Towards Assamese Named Entity Recognition

The first Steps towards Assamese Named Entity
Recognition
Padmaja Sharma1 , Utpal Sharma1 and Jugal Kalita2

1
Department of Computer Science and Engineering, Tezpur University,Assam, India 784028,
psharma@tezu.ernet.in,utpal@tezu.ernet.in
2
Department of Computer Science, University of Colorado at Colorado Springs, Colorado, USA
80918,
kalita@eas.uccs.edu
Abstract. Named Entity Recognition (NER) is the process of identifying and

classifying proper nouns in text documents into pre-defined classes such as person,
location and organization. Although NER in Indian languages is a difficult and
challenging task and su↵ers from scarcity of resources, such work has started
to appear recently. In this paper we present an overview of NER, the di↵erent
approaches to NER and its existing work in Indian language. Finally we describe
the Assamese language and the issues faced by this language in NER and the rules
of Assamese NER.
1 Introduction
Natural Language Processing (NLP) entails computational approaches to analyze text. A

Named Entity (NE) is an element in text that refers to the name of a thing such as that of
a person, organization or location. Tagging of Named Entities in text plays an important
role in many NLP applications. Although everyone agrees on the importance of discovering
named entities, there are various opinions regarding what the categories should be.
In the Message Understanding Conferences (MUC) of the 1990s, it became clear that it
is necessary to first identify certain classes of information in order to extract information
from a given document. Later the conference established the Named Entity Recognition
task, systems were asked to identify names, dates, times and numerical information. Thus
Named Entity Recognition (NER) can be defined as a two stage problem - Identification
of proper nouns and the further classification of these proper nouns into a set of
classes such as person names, location names (e.g., cities, and countries), organization
names (e.g., companies, government organizations, and committees), and miscellaneous
names (e.g., date, time, number, percentage, monetary expressions, number expressions
and measurement expressions). NER is an essential component of NLP tasks such as
Information Extraction (IE), Question Answering (QE), and Automatic Summarization
(AS).
A few conventions for tagging Named Entities were established in the conferences MUC.
These include ENAMEX for names (organization, person, location), NUMEX for numerical
entities (monetary, percentages) and TIMEX tags for temporal entities (time, date, year).
For example consider the sentence-
22 September 2010, Brisbane Convention Center Brisbane Australia [1]

Using XML format, it can be marked up as-
[<TIMEX TYPE=“DATE”> 22
September 2010 </TIMEX> <ENAMEX TYPE=“ORGANIZATION”>Brisbane
Convention Center</ENAMEX> <ENAMEX TYPE=“LOCATION”>Brisbane
Australia</ENAMEX>].
2 Ambiguity in Named Entity Recognition
One of the major challenges in NER is that frequently there are overlaps among classes of
named entities. The common situations are-
1. Common noun Vs proper noun- Common noun sometimes occurs as a person name
such as “Suraj” which means sun, thus creating ambiguities between common noun
and proper noun.
2. Organization Vs person name- “Tata” as a person name as well as an organization.
3. Organization Vs place name- “Mumbai meets Chennai at Bangalore”. Here Mumbai
and Chennai are names of teams rather than the names of cities.
4. Person name Vs place name- The word “Kashi” is used as a person name as well as
the name of a place.
5. Moreover there also exist words or word sequences such as Thinking Machines (a
company), Gates (person), that can occur in contexts where they do not refer to
Named Entities.
3 Di↵erent approaches to NER
3.1 Rule-based NER
Rule-based NER focuses on the extraction of names using human made rules. There are
several Rule-based NER systems for English achieving 88%-92% f-measure [31][39]. This
approach lacks portability and robustness. One needs a significant number of rules to
maintain optimal performance, resulting in the high maintenance cost.
3.2 Statistics-based NER/Machine Learning approaches
The main advantage of the machine learning (ML) approaches over rule-based approach is
that the former is trainable and can be adapted to di↵erent domains, and the maintenance
cost is much smaller. ML approach for NER employ a statistical model. Representative
ML methods used in NER are Hidden Markov Model (HMM) [23][24] and KRDL’s system
[41] for Chinese NER, Conditional Random Fields (CRF) [20], Support Vector Machines
(SVM) [6], Maximum Entropy (ME) [3] and Decision Tree (DT) [15][36]. Besides, a variant
of Eric Brill’s transformation-based rules [4] has been applied to the problem [2]. These
models are primarily of two types: Supervised and Unsupervised.
Hidden Markov Model (HMM) In context of NER, HMM is defined as follows: Given
a word sequence
W = (W0 f0 )....(Wn fn )
where Wj denotes a word, fj denotes a single token feature set associated with word Wj .
The goal of NER is to find the optimal NE tag sequence T = t0 t1 t2 ....tn , which maximizes
the conditional probability P r(T sequence|W sequence) ([23]). By Bayesian equality, this
becomes equivalent to maximizing the joint probability Pr(W sequence, T sequence). This
joint probability can be computed by bi-gram HMM as follows:
P r(W sequence, T sequence)
= Y
P r({Wi , fi }, ti |{Wi 1 , fi 1 }, ti 1 )
i
.
In tag sequence T , each tag consists of three parts- the NE class, boundary class(indicates
the position of current word in NE) and the feature set.
Conditional Random Field (CRF) CRFs [20] are a type of discriminative probabilistic
model used for labelling and segmenting sequential data such as natural language text or
biological sequences. They can incorporate a large number of arbitrary, non-independent
features. The conditional probability of a state sequence S = (s1 , s2 ..sT ) given an
observation sequence O = (o1 , o2 , o3 ...ot ) is calculated as
XXT
1
P (s|o) = exp( k fk (St 1 , St , o, t)
Zo t=1 k
where Zo is a normalization factor for the overall state sequence.
X T X
X
Zo = exp( k fk (St 1 , St , o, t)
t=1 k
and fk (St 1 , St , o, t) is a feature function whose weight k is to be learned via training.

When applying CRF to NER, an observation sequence is the word sequence of a sentence
or document of text and the state sequence is the corresponding label sequence.
Support Vector Machine (SVM) SVMs represent a relatively new machine learning
approach that analyzes data and recognizes patterns for classification and regression
analysis [6].
In the field of NLP, SVMs are applied to text categorization [38] and many other problems
and are reported to produce excellent results.
An SVM constructs a hyperplane or a set of hyperplanes in a high or infinite dimensional
space. The classification rule of an SVM is:
sgn(f (x, w, b))......(1)
f (x, w, b) =< w.x > +b..........(2)
where x is the example to be classified, w is a weight vector, and a bias b which stands
for the distance of the hyperplane to the origin.
The classifier produces a sequence of inadmissible classes. The transition probability
between word classes is set to 1 in NER problem if the sequence is admissible otherwise 0.
Maximum Entropy (ME) The principle of Maximum Entropy is a postulate which

states that subject to known constraints the probability distribution which best represents
the current state of knowledge is the one with largest entropy. ME aims at providing a
model with the least bias possible [21] . The ME framework estimates probabilities based
on the principle of making as few assumptions as possible other than the constraints
imposed. Such constraints are derived from training data, expressing relationships among
features and outcomes. The probability distribution that satisfies the above property is
the one with the highest entropy and has the exponential form.
Qk
P (o|h) = 1
z(h) j=1 /j fj (h, o)
where o refers to the outcome, h the history (or context) and z(h) is a normalization
function. In addition each feature function fj (h, o) is a binary function. The parameters
/j are estimated by a procedure called Generalized Iterative Scaling (GIS) [8]. This is an
iterative method that improves the estimation of the parameter at each iteration.
In NER, history can be viewed as all information derivable from the training corpus relative
to the current token Wi . The computation P (o|h) is a set of features that helps to predict
the output.
Decision Tree (DT): DT is a decision support tool that uses a tree like graph or
model of decisions and their possible consequences. It is a powerful and popular tool for
classification and prediction. The attractiveness of DT is due to the fact that in contrast
to other models such as neural networks, it represents rules. Rules obtained by DT can be
easily expressed in a way that humans can understand them or even directly use them.
3.3 Hybrid approach
Early systems made use of handcrafted rule based algorithms while modern systems often
resort to machine learning techniques. Besides the above two approaches, NER also makes
use of the hybrid model that combines the strongest points from both rule-based and
statistical methods. This method is used particularly when data is scanty and complex
Named Entities (NE) classes are used. Sirhari et al. [33] introduce a hybrid system by
combination of HMM, ME and handcrafted grammatical rules to build an NER
4 Existing NER work in Indian Languages
Named Entity Recognition has made remarkable progress in the European languages [25]
[16] [40] [36], but only a little work can be found in the Indian languages. Below are some
of the work done in IL using di↵erent approaches.
4.1 Bengali
HMM-approach: Ekbal and Bandyopadhyay [9] report on the development of an HMM
based NER system for Bengali and Hindi. Initially, the system was developed for Bengali.A
10-fold cross validation results yields an average Recall, Precision and F-score values of
90.2%, 79.48% and 84.5%, respectively. The HMM-based NER system is also trained and
tested with Hindi data to show the e↵ectiveness of the language independent features
used by the approach. The results show an average Recall, Precision and F-score values of
82.5%, 74.6% and 78.35%, respectively.
SVM-approach: Ekbal and Bandyopadhyay [10] discuss the use of SVM using the
features like contextual information along with a variety of features such as context words,
prefix and suffix words, POS tags and length of the words.They use the BIE format where
B stands for Beginning, I for Internal and E for End of a word. Experimental results of the
10 fold cross validation test shows the e↵ectiveness of the system with the overall average
recall, precision and F-score values of 94.3%, 89.4% and 91.8% and which is then further
performed in combination with Hindi [12]. A number of experiments were carried out to
find the best-suited features for NER in Bengali and Hindi. An unsupervised algorithm
was used to generate the lexical context patterns from an unlabeled corpus of 10 million
word forms and the NER system was tested with gold standard test sets of 35K, and
60K tokens for Bengali and Hindi, respectively. Recall, precision and f-score were 88.61%,
80.12% and 84.15%, respectively for Bengali and 80.23%, 74.34% and 77.17%, respectively
for Hindi.
CRF-approach: Ekbal et al. [14] use the statistical CRF to identify and classify NEs
into four classes. Experimental results of the 10-fold cross validation test showed an overall
average recall, precision and f-score values of 93.8%, 87.8% and 90.7%.
ME-approach: Hasanuzzaman et al. [19] describe the development of NER system in

Bengali and Hindi using ME framework with 12 NE tags.The average recall, precision and
f-measure were 88.01%, 82.63%, 85.22%, respectively for Bengali and 86.4%, 79.23% and
82.66%, respectively for Hindi.
ME, CRF and SVM: Ekbal and Bandyopadhyay [11] combine the outputs of several
classifiers based on ME, CRF, and SVM by a voting scheme. The overall average recall,
precision and f-score values were 90.78%, 87.35% and 89.03%, respectively. In [13], the
same author describe a voted NER system by using Appropriate Unlabeled Data to obtain
overall recall precision and f-score values of 93.81%, 92.18% and 92.98%, respectively.
Chaudhuri and Bhattacharya [5] used a three-stage approach namely- a dictionary based
method, rules and left-right co-occurrences statistics for identification of NEs. A word level
morphological parser is constructed for the dictionary based approach and the rule-based
approach relies on rules to be satisfied by the given test word. The average recall, precision
and f-measures were 85.50%, 94.24% and 89.51%, respectively. They observed that their
automatic evaluation system gives almost the same result as manual evaluation.
4.2 Hindi
The NER task for Hindi was explored by Cucerzan and Yarowsky in their language
independent NER work in which morphological and contextual evidences were used [7].
ME approach: Saha et al. [34] describe the development of Hindi NER using the ME
approach. The system was evaluated using a blind test corpus of 25K words and came out
with an F-measure of 81.52%.
CRF approach: Li and McCallum [22] describe the application of CRF with feature
induction to a Hindi NER task. The experimental results for validation and test sets
found out to be 82.55% and 71.50% respectively.
Goyal [17] focused on building Hindi NER using CRFs. He used the NLPAI Machine
Learning Contest 2007 data for experiments. This method was evaluated on two di↵erent
test sets and attained a maximum F1-measure of around 49.2% and nested F1-measure of
around 50.1% for test set1, maximum F1-measure around 44.97% and nested F1-measure
around 43.70% for test set2 and F-measure of 58.85% on the development set. The author
also compared the results on Hindi data with English data of CONLL shared task of 2003.
They trained this system on English data of CoNLL-2003 shared task, considering only
contextual features since they give the maximum accuracy. They obtained an overall F-
measure of 84.09% and 75.81% on both the test sets.
Gupta and Arora [18] also experimented with using the CRF model to develop Hindi
NER.Finally after adding the NE and POS tag the results came out considering the
surrounding and current words combinations for Person to be 66.7% and 66.3%, for
Location to be 69.5% and 68%, and 58% for Organization.
4.3 Telugu
ME approach: Raju et al. [30] developed a Telugu NER system by using the ME
approach.Evaluation results came out with a an F-measure of 72.07% for person, 60.76%,
68.40% and 45.28% for organization, location and others, respectively.
CRF approach: Srikanth and Murthy [28] used CRF approach on a part of the Language
Engineering Research Centre at University of Hyderabad (LERC-UoH) Telugu corpus
consisting of a variety of books and articles, and two popular newspapers They obtained
an F-measure of 91.95%. Then they developed a rule-based NER system using a corpus of
72,152 words including 6,268 Named Entities. Finally they developed a CRF based NER
system. They achieved an overall F-measures between 80% and 97% in various experiments.
Shishtla et al. [37] conducted an experiment on the development data released as a part of
NER for South and South East Asian Languages (NERSSEAL) Competition using CRF.
The best performing model gave an F-1 measure of 44.91%.
4.4 Tamil
CRF approach: VijayKrishna and Sobha [29] developed a domain specific Tamil NER
for tourism by using CRF. The system obtained an F-measure of 80.44%.
Hybrid approach: Pandian et al. [27] presented a hybrid three-stage approach for Tamil
NER. The first phase includes the classification of the named entities by shallow parsing, a
dictionary of word clues and case-markers. In the second phase, shallow semantic parsing,
and syntactic and semantic information are used to identify the named entity type. The
final phase includes statistical information from a training corpus. The E-M (HMM)
algorithm is used to identify the best sequence for the first two phases and then modified
to resolve the free-word order problem. Both NER tags and POS tags are used as the
hidden variables in the algorithm. The system is able to obtain an F-measure of 72.72%
for various entity types.
4.5 Oriya
Biswas et al. [35] presented a hybrid system for Oriya NER that applies both ME and
HMM and handcrafted rules to recognize NEs.Linguistic rules were also used to identify
named entities. The system obtains an F-measure between 75% to 90%.
4.6 Urdu
Mukund and Sirhari [26] proposed a bootstrapped model that involves four levels of text
processing for Urdu.Two types of model have been used. One is the two stage model that
uses POS information to perform NE tagging and the other is the four stage model for
NE tagging. The two stage model achieved an F-measure of 55.3% and F-measure for the
four stage model was 68.9%.
Riaz [32] identifies the complex relationship between Hindi and Urdu and finds that
NER computation models for Hindi cannot be used for Urdu NER. It also describes a
rule-based NER system which outperforms the models that use statistical learning with
an F-measure of 91.11%.
5 Key Challenges in Assamese NER

Our work involves NER in Assamese, an Indo European language spoken by over 30
million people in North East India and a national language of India, but with a limited
computational linguistic work. Ours is the first work on NER for Assamese. As an initial
step we attempt to manually tag an Assamese text corpus for named entities. We encounter
several issues in this exercise. Many of these issues are general and not specific to Assamese.
– Lack of Capitalization: Capitalization plays an important role in identifying the
named entities. But there is no concept of capitalization in Assamese language thus
making it difficult to identify the proper noun.
– Ambiguity: There exist an ambiguity in the names of peoples since names of
people are usually dictionary words, unlike Western names. For example (AkAS)
, (jon) means sky and moon respectively in Assamese, but also can indicate person
names thus creating ambiguities between common noun and proper noun.
– Nested Entities- Nested entities also create ambiguity because they contains two or
more proper names.Eg, (Tezpur bishabidyAly). It creates a problem
for NER in the sense that the word (Tezpur) refers to a location, whereas
(bishabidyAly) falls under the organization class which creates a problem
to identify the proper class.
– Agglutinative nature- Agglutination is the process of adding some additional
features in the root word to produce complex meaning. Eg, (guzrAT)
which indicates location when combined with a case marker like (I) becomes
(guzrATI) which means a people residing in (guzrAT) which is not a
named entity.
– Spelling Variation- An important issue in Assamese NER is the variation in the
spellings of proper names. Eg, the names like ( Sree SreenATh) creates an
ambiguity whether in falls under person named entity or it is a Pre-nominal
word.
6 Rule-based Assamese NER

We have developed a hand-crafted rule-based NER for Assamese. We manually tagged a
corpus of about 50000 words of Assamese online Pratidin articles3 . In this task we faced the
issues mentioned in section 5. We found around 500 person names and 250 location names.
Based on an analysis of the tagged corpus we crafted the following rules for automatic NE
tagging.
– If a previous word of a particular token is a Pre-nominal word like
(sri), (sriman), (srimati), (srijukta) etc then the word is a named entity.
– Suffix rules for the location like (kASIt) help to identify the named entity.
– Presence of a suffix like (k), (lE), (r), (i), (ye) often denotes the presence of a
named entity.
– If a word like (dA), (bAi) exist after a particular token then it denotes a person
named entity.
– Precision of a system is reduced by stemming such as terms like
(asmIyA=asm+IyA) ,thus marking as named entity in the context of the
which is not a named entity.
– Onomatopoetic words like (akAi-pkAi), replicate words like (kA-
kA) (ker-ker), and re duplicative words such as (TAmol-cAmol)
etc does not fall under named entity.
– Action verbs such as (kEche), (gEche) etc often determines the presence of a
named entity.
7 Conclusion and future work

In this paper, we have studied the di↵erent techniques employed for NER, and have
identified the various problems in the task and also the previous work done in the Indian
3
http://www.asomiyaprotidin.co.in
language. We have also identified various issues in Assamese NER and pointed out some
of the rules for NER while analyzing the Assamese corpus. Further work includes the
automatic tagging over the training corpus as well as other large corpora and to do a
quantitative analysis over the corpora.
References
1. Modeling and Simulation. CSI Communications, 34:15, August 2010.
2. John Aberdeen, John Burger, David Day, Lynette Hirschman, Patricia Robinson, and Marc
Vilain. MITRE: Description of the Alembic System Used for MUC-6. In Proceedings of the
6th Conference on Message Understanding, pages 141–155, Columbia, Maryland, 1995.
3. Borthwick Andrew. A Maximum Entropy Approach to NER. PhD thesis, 1999.
4. Eric Brill. Transformation-based Error Driven Learning and Natural Language Processing:
A Case Study in Part-of-speech. Computational Linguistic, pages 543–565, December 1995.
5. Bidyut Baran Chaudhuri and Suvankar Bhattacharya. An Experiment on Automatic
Detection of Named Entities in Bangla. In Proceedings of the IJCNLP-08 Workshop on
NER for South and South East Asian laanguages, pages 75–82, Hyderabad, India, January
2008.
6. Cortes and Vapnik. Support Vector Network ,MachineLearning. pages 273–297, 1995.
7. Silviu Cucerzan and David Yarowsky. Language independent named entity recognition
combining morphological and contextual evidence. pages 90–99, 1999.
8. J.N Darroch and D.Ratcli↵. Generalized iterative scaling for log-linear models. The Annals
of Mathematical Statistics, 43(5):1470–1480, 1972.
9. Asif Ekbal and Sivaji Bandyopadhyay. A Hidden Markov Model Based Named Entity
Recognition System:Bengali and Hindi as Case Studies. In Proceedings of 2nd International
conference in Pattern Recognition and Machine Intelligence, pages 545–552, Kolkata, India,
2007.
10. Asif Ekbal and Sivaji Bandyopadhyay. Bengali Named Entity Rcognition using Support
Vector Machine. In Proceedings of the IJCNLP-08 Workshop on NER for South and South
East Asian laanguages, pages 51–58, Hyderabad, India, January 2008.
11. Asif Ekbal and Sivaji Bandyopadhyay. Improving the Performance of a NER System by
Post-processing andVoting. In Proceedings of 2008 Joint IAPR International Workshop on
Structural Syntactic and Statistical Pattern Recognition, pages 831–841, Orlando, Florida,
2008.
12. Asif Ekbal and Sivaji Bandyopadhyay. Named Entity Recognition using Support Vector
Machine: A Language Independent Approach. International Journal of Computer, Systems
Sciences and Engg(IJCSSE), 4(2):155–170, 2008.
13. Asif Ekbal and Sivaji Bandyopadhyay. Voted NER System using Appropriate Unlabelled
Data. In Proceedings of the 2009 Named Entities Workshop, ACL-IJCNLP 2009, pages 202–
210, Suntec, Singapore, August 2009.
14. Asif Ekbal, Rejwanul Haque, and Sivaji Bandyopadhyay. Named Entity Recogntion in
Bengali: A Conditional Random Field. In Proceedings of the 3rd International Joint
Conference on Natural Language Processing(IJCNLP-08), pages 589–594, India, 2008.
15. F.Bechet, A.Nasr, and F.Genet. Tagging Unknown Proper Names using Decision Trees.
In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistic,
pages 77–84, 2000.
16. Radu Florian, Abe Ittycheriah, Hongyan Jing, and Tong Zhang. Named Entity Recognition
through Classifier Combination. In Proceedings of CoNLL-2003(Conference on Natural
Language Learning), pages 168–171, 2003.
17. Amit Goyal. Named Entity Recognition for South Asian Languages. In Proceedings of the
IJCNLP-08 Workshop on NER for South and South-East Asian Languages, pages 89–96,
Hyderabad, India, Jan 2008.
18. Pramod Kumar Gupta and Sunita Arora. An Approach for Named Entity Recognition System
for Hindi: An Experimental Study. In Proceedings of ASCNT-2009, pages 103–108, CDAC,
Noida, India.
19. Mohammad Hasanuzzaman, Asif Ekbal, and Sivaji Bandyopadhyay. Maximum Entropy
Approach for Named Entity Recognition in Bengali and Hindi. International Journal of
Recent Trends in Engineering, 1(1):408–412, May 2009.
20. John La↵erty, Andrew McCallum, and Fernando Pereira. Probabilistic Models forSegmenting
and Labelling Sequence Data. In Proceedings of the Eighteenth International Conference on
Machine Learning(ICML-2001), pages 282–289, 2001.
21. Adam L.Berger, Stephen A.Della Pietra, and Vincent J.Della Pietra. A Maximum Entropy
approach to Natural Language Processing. Computational Linguistic, 22:39–71, 1996.
22. Wei Li and Andrew McCallum. Rapid Development of Hindi Named Entity Recognition
using Conditional Random Fields and Feature Induction (Short Paper). ACM Transactions
on Computational Logic, pages 290–294, Sept 2003.
23. Bikel Daniel M, Miller Scott, Schwartz Richard, and Weischedel Ralph. A High Performance
Learning Name-finder. In Proceedings of the fifth Conference on Applied Natural language
Processing, pages 194–201, 1997.
24. Scott Miller, Michael Crystal, Heidi Fox, Lance Ramshaw, Richard Schwartz, Rebecca Stone,
Ralph Weischedel, and the Annotation Group. BBN:Description of the SIFT System as Used
for MUC-7. In Proceedings of MUC-7, Virginia, 1998.
25. Diego Molla, Menno van Zaanen, and Daniel Smith. Named Entity Recognition for
Question Answering. In Proceedings of the 2006 Australasian Language Technology
Workshop(ALTW2006), pages 51–58.
26. Smruthi Mukund and Rohini K.Srihari. NE tagging for Urdu based on Bootstrap POS
Learning. In Proceedings of the Third International Workshop on Cross Lingual Information
Access:Addressing the Information Need of Multilingual Societies, pages 61–69, 2009.
27. S.Lakshmana Pandian, Krishnan Aravind Pavithra, and T.V Geetha. Hybrid Three-stage
Named Entity Recognizer for Tamil. INFOS2008, pages 59–66, March 2008.
28. P.Srikanth and Kavi Narayana Murthy. Named Entity Recognition for Telegu. In Proceedings
of the IJCNLP-08 Wokshop on NER for South and South East Asian languages, pages 41–50,
Hyderabad, India, Jan 2008.
29. Vijaykrishna R and Sobha L. Domain focussed Named Entity Recognizer for Tamil using
Conditional Random Fields. In Proceedings of the IJCNLP-08 Wokshop on NER for South
and South East Asian languages, pages 59–66, Hyderabad, India, 2008.
30. G.V.S Raju, B.Srinivasu, Dr. S. Viswanadha Raju, and K.S.M.V Kumar. Named Entity
Recognition for Telegu using Maximum Entropy Model. Journal of Theoretical and Applied
Information Technology, 3(2):125–130, 2010.
31. Grishman Ralph. The New York University System MUC-6 or Where’s the syntax. In
Proceedings of the Sixth Message Understanding Conference, pages 167–175, 1995.
32. Kashif Riaz. Rule-based Named Entity Recognition in Urdu. In Proceedings of the 2010
Named Entities Workshop, ACL 2010, pages 126–135, Uppsala,Sweden, July 2010.
33. R.Sirhari, C.Nui, and W.Li. A Hybrid Approach for Named Entity and Sub-Type Tagging.
In Proceedings of the sixth conference on Applied natural language processing, Acm Pp, pages
247–254, 2000.
34. Sujan Kumar Saha, Sudeshna Sarkar, and Pabitra Mitra. A Hybrid Feature Set based
Maximum Entropy Hindi Named Entity Recognition. In Proceedings of the 3rd International
Joint Conference on NLP, pages 343–349, Hyderabad,India, January 2008.
35. S.Biswas, S.P.Mohanty, S.Acharya, and S.Mohanty. A Hybrid Oriya Named Entity Recogntion
system. In Proceedings of the CoNLL, volume 1(1), pages 1–6, Edmonton, Canada, 2003.
36. Satoshi Sekine. Description of the Japanese NE System Used for MET-2. In Proceedings of
the Seventh MUC-7, Virginia, 1998.
37. Praneeth M Shishtla, Karthik Gali, Prasad Pingali, and Vasudeva Varma. Experiments in
Telegu NER: A Conditional Random Field Approach. In Proceedings of the IJCNLP-08
Workshop on NER for South and South East Asian languages, pages 105–110, Hyderabad,
India, January 2008.
38. Hirotoshi Taira and Masahiko Haruna. Feature Selection in SVM Text Categorization. 1999.
39. Takahiro Wakao, Robert Gaizauskas, and Yorick Wilks. Evaluation of an Algorithm for the
Recognition and Classification of Proper Names. In Proceedings of COLING-96, pages 418–
423, 1996.
40. Dong Yang, Paul Dixon, Yi-Cheng Pan, Tasuku Oonishi, Masanobu Nakamura, and Sadaoki
Furui. Combining a Two-step Conditional Random Field Model and a Joint Source
Channel Model for Machine Transliteration. In Proceedings of the 2009 Named Entities
Workshop,ACL-IJCNLP 2009, pages 72–75, Suntec, Singapore, August 2009.
41. Shihong Yu, Shuanhu Bai, and Paul Wu. Description of the Kent Ridge Digital Labs System
Used for MUC-7. In Proceedings of the MUC-7, Virginia, 1998.

The First Steps Towards Assamese Named Entity Recognition

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The First Steps Towards Assamese Named Entity Recognition

Uploaded by

Copyright:

Available Formats

The first Steps towards Assamese Named Entity

Padmaja Sharma1 , Utpal Sharma1 and Jugal Kalita2

Abstract. Named Entity Recognition (NER) is the process of identifying and

Natural Language Processing (NLP) entails computational approaches to analyze text. A

22 September 2010, Brisbane Convention Center Brisbane Australia [1]

2 Ambiguity in Named Entity Recognition

3 Di↵erent approaches to NER

3.1 Rule-based NER

3.2 Statistics-based NER/Machine Learning approaches

where Zo is a normalization factor for the overall state sequence.

and fk (St 1 , St , o, t) is a feature function whose weight k is to be learned via training.

Maximum Entropy (ME) The principle of Maximum Entropy is a postulate which

3.3 Hybrid approach

4 Existing NER work in Indian Languages

ME-approach: Hasanuzzaman et al. [19] describe the development of NER system in

5 Key Challenges in Assamese NER

6 Rule-based Assamese NER

7 Conclusion and future work

You might also like