Professional Documents
Culture Documents
Abstract-This paper presents a rule based model of parts of speech tagging assigns grammatical category of the language.
speech (POS) tagset for Classical Tamil Texts (CTT). The noun There are noun, verbs, adjective, adverbs, postposition, etc. A
forms are type pattern, verb forms are token pattern. This is POS tagset is developed on the basis of the information from a
based on form agreement method. This is a very efficient and particular language. The tagset consist finite tags and the
novel approach because Tamil Language has a build-in system of information extraction should be infinite. Each word assigns a
agreement/concord of the sentence. Classical Tamil Tagset is unique category. Tamil is an agglutinative language. In Tamil,
divided into two basic classifications, noun morphology and verb root or stem can be added with some suffixes that derivate new
morphology. words. Those words are divided into two parts namely lexical
word and grammatical word. A word can function
Keywords- parts of speech; tagset; token; type; insert independently; but grammatical word cannot function
independently. Based on the above criteria, all the languages
I. INTRODUCTION make a distinction between open classes and close classes. Thus,
basically word classes may be grouped into two viz., open
Natural Language Processing (NLP) is a computerized approach classes and closed classes. In relation with this, Robin (a
to analyze the text based on a set of theories and set of linguist) quotes ”we can describe open classes as those, whose
technologies. And, being a very active area of research and membership is in principle unlimited, varying from time to time
development, the basic objective of Natural Language and between one speaker and another’ and closed classes as
Processing is to facilitate human-machine interaction through those that ‘contain a fixed and usually small number of member
the means of natural human language. Research on NLP has words, which are (essentially) the same for all the speakers of the
focused on various intermediate tasks that make partial sense of language or dialect’ ” (Robins, 1964 quoted in Paul and
language structure without requiring complete understanding Shopen:1985). Thus nouns, verbs, adjectives and adverbs
which, in turn, contributes to develop a successful system. Part- belongs to open classes and pronouns, conjunction, etc., belong
Of-Speech tagging is one of the processes in which grammatical to closed classes.
categories are assigned to each token in the context from a given
set of tags called POS Tagset. It serves wide number of III. METHODOLOGY FOR THE WORD TAGGING
applications like parser, morphology, speech synthesis and The noun form carries plural marker, case markers, clitics,
recognition, information retrieval, information extraction, postposition, etc., The noun form may be a root or a stem or an
partial parsing, machine translation, lexicography, Word oblique base form. The root form means that there is no future
Sense Disambiguation (WSD), Corpus based learning and segmentation ( kāl ‘leg’, kaṇ ’eye’ ) and the base form means
teaching etc. There are a few POS Tagsets for Modern Tamil that the root word can be added with any suffix for example
Language such as IL-ILMT, AU-KBC, IL-POST Microsoft, (kaṇṇaṉ (kaṇ+aṉ) ‘male person’, kuṉṟam(kuṉṟu+am) ‘hill’ ), the
LDC-IL Mysore, etc. There are various methods in POS tagsets oblique means the form that carries suffix or equivalent another
like linguistic rules, stochastic models and hybrid approaches, form (kiṇaṟṟu (kiṇaṟu ) ‘well’ , āṟṟu (āṟu) ‘river’ , eṉ (nāṉ) ‘I’ ,
and each approach has its own merits and demerits taṉ (tāṉ)) ’my’. In such a way, the verb forms consist of verb
(Rajendran. 2007). In this context, Classical Tamil further root and conjugated. There are seven major types of verb
presents a challenge in developing an automatic POS tagger conjugated is as follows Verbal base, Infinitive, Verbal noun,
as the language is highly inflectional and morphologically rich. Verbal participle, Relative participle, Participial noun and Finite
Hence, it is essential to consider text processing prior to POS verb.
tagging in order to achieve high performance and more
reliability. 3.1 Plural marker
Plural means a grammatical form that designates more than one
II. POS TAGGER FOR CLASSICAL TAMIL: AN of the things specified.
OVERVIEW மலர்+கள் = மலர்கள் (NC)
flower+pl = flowers
In the last few decades of NLP initiatives on Indian languages in
India, different research groups working on various languages
have developed POS tagger. In this paper, a brief summary of
POS tagger for Classical Tamil is being analyzed.Parts of
15
Integrated Intelligent Research(IIR) International Journal of Business Intelligent
Volume: 01 Issue: 01 June 2012,Pages No.15-17
ISSN: 2278-2400
languages, verbs are considered to be the most important part of
3.2 speech. Verbs play a very important role in any languages. As
Tamil verbs are inflected to various grammatical categories the
bulk of Tamil parts of speech dealt with verb is necessary.
16
Integrated Intelligent Research(IIR) International Journal of Business Intelligent
Volume: 01 Issue: 01 June 2012,Pages No.15-17
ISSN: 2278-2400
it as an unknown category. The tagging process follows the
following procedural steps The Data layer is developed at the time of Training; it
encapsulates all the information related to data from the tagging
module like electronic text, hibernated text and Extensible
Markup Language Database. The Process layer contains logic
for retrieving persistent data from the DL and placing it into
process objects. The PL gives graphical user interface (GUI)
application for end user application
V. CONCLUSION
REFERENCES
[1] Andronov, M. 1969. The Standard Grammar of Modern and Classical
Tamil. Madras: New Century Book House, Pvt. Ltd.
[2] Arulmozhi. P, Sobha. L, Kumara Shanmugam. B. Parts of Speech Tagger
for Tamil. In the Proceedings of the Symposium on Indian Morphology,
Phonology & Language Engineering, Indian Institute of Technology,
Kharagpur. pp 55- 57, 2004
[3] Arulmozhi. P, Sobha L, A Hybrid POS Tagger for a Relatively Free Word
Order Language. In proceedings of MSPIL-2006, Indian Institute of
4.2 Architecture of POS Technology , Bombay. pp 79-85, 2006
The architecture of POS Tagger consists of three layers such as [4] Anandan, P. K. Saravanan, Ranjani Parthasarathi and T. V. Geetha.
Morphological Analyzer for Tamil. In: International Conference on
Data Layer (DL),Process Layer (BL) and Presentation Layer Natural language Processing, 2002.
(PL). The system of architecture is shown below [5] Jesse Liberty and Donald Xie, ”Programming C# 3.0”, O’Reilly 5th
edition, 2007.
[6] Lakshmana Pandian S. and T.V. Geetha, “Morpheme Components based
Part of Speech tagging”, International Conference on Natural language
processing, 2008
[7] Rajendran S, (2007) Complexity of Tamil in POS tagging, Language in
India, Jan 2007.
[8] Rajan,K. Corpus analysis and tagging for Tamil. In: Proceeding of
symposium on Translation support system STRANS-2002
[9] Thomas Lehman. A grammar of modern Tamil, Pondicherry Institute of
Linguistic and culture, Pondicherry.1989.
17