You are on page 1of 64

TO THE LOTUS FEET OF OUR AMMA

TtoTO THE

- (Syntactic Parser using Machine Learning


Approach)

. , SRM .

SRM_National Workshop on " Tamil Computational Linguistics :by Dr.V.Dhanalakshmi, SRMUniversity.

? (NLP? ) (Machine learning) (Corpus and


Machine learning)

- (Tamil Syntactic Parser) Syntactic (Tools needed for developing SP) (S t ti Parser using Machine L i A (Syntactic P i M hi Learning Approach) h


Processing (NLP)]

[Natural Language

NL INPUT
UNDERSTAN DING

Generatio
n

NL Output

The ultimate goal of NLP is to build computational models that equal human performance in the linguistic task of reading, writing, learning, speaking and understanding. [Allen, 1995]. understanding [Allen 1995]


(Knowledge Driven) (Data Driven)
NL INPUT
UNDERSTAN DING

Generatio
n

NL Output

Machine Learning Approach ( )


(corpus) () .(M hi (Machine llearning d l with i deals ith
techniques that allow computers automatically learn to make accurate predictions based on past observations from the annotated corpus). corpus) There are two main tasks involved in machine learning 1) learning/training . 2) Prediction .
7

-
1) learning/training . 2) Prediction . ???????

(Training data) (Model) . , . (Learned Model) , .


8

(Corpus and Machine learning)


.

, , ,

. (Annotated Corpus) . .

Syntactic Parser

- (Tamil Syntax)

.
.


. (. ) .

<NN> <NOM> ( <NP>)

<VF> <ROOT+PAST+3SM> ( <VP>)

General Framework
.
Un-annotated U t t d Sentence

POS Tagger

Chunker

Morphological Analyzer

Dependency parsing (RF)

Syntactic Parsed Sentence

14


TOOLS NEEDED

(POS Tagger) (Ph (Pharse Ch k ) Chunker) (Morphological Analyzer) (Dependency Parser) )


POS Tagger for Tamil using SML

POS Tagging
It is the process of assigning a grammatical category (Noun, Verb, Adjective, Adverb, etc) for each and every word in a given sentence depending on the context.( ) It is needed for languages which have more than one POS category for a word. ( ). It is first step in NLP task at the sentence level.

17

Tamil POS
Tamil is an agglutinative language, where, Nouns get inflected for number and cases. Verbs get inflected for various inflections which include tense, axillaries and person-number-gender S ffi t ill i d b d Suffixes. Tamil word in a text carries grammatical category and grammatical features such as tense, gender, number, person, etc. MarangkaL and Maram+ kaL <NN> <NN>+PL Intially, Tamil POS tagger is developed based on the 'category' of the word and in the next level the grammatical features of the word are acquired from the morphological tagger.
18

Example of POS Tagging


Kovillil aaru adi uyaramana mani ullathu NN CRD NN ADJ NN VF NN NN VF ADJ NNP VF

(Ambiguous tags) Six feet tall bell is in the temple.

we need to depend on syntactic function or context to decide upon whether one word is a noun or verb or propernoun.

A i Assign POS t tags t to words i d in a sentence considering its lexical ambiguity. ambiguity

NN

CRD

NN

ADJ

NN

VF

<Six feet tall bell is in the temple>

Customized POS Tagset for word category level


Word level Tagset
1. <NN> Common Noun 2. <NNC> Compound Noun 3. 3 <VADV> Adverbial participle 4. <VADJ> Adjectival participle 5. <VINF> Infinitive Verb 6. <VCON> Conditional Verb VCON 7. <VG-PN> Participle Noun 8. <VG-AN> Adjectival noun 9. <VF> Finite Verb 10. <VMOD> Modal Verb 11. <PRP> Personal Pronoun 12. <PRIT> Interrogative Pronoun 13. <NPC> Compound Proper noun 14. <CRD> Cardinal 15. <ORD> Ordinal 16. 16 <ADJ> Adjective 17. <ADV> Adverb 18. <PPO> Postposition 19. <CNJ> Conjunction CNJ 20. <DET> Determiners 21. <COMP> Complimentizer 22. <EMP> Emphasis 23. <QTF> Quantifier 24. <COMM> Comma 25. <DOT> Dot 26. <QM> Question mark
21

(Corpus development )
POS corpus tagging is done in three stages: 1. Pre-editing 2. Manual Tagging 3. Boot-strapping

22

Pre-editing
Collect Untagged corpus from news papers, magazines, papers magazines websites, etc and remove the noisy elements.

Untagged corpus before pre-editing pre editing

Untagged corpus after pre-editing

3) Boot- strapping

Corpus Boot-strapping is done using the Machine learning tool. Boot strapping The output of the Machine learning tool is again manually corrected and added to the tagged corpus size. corpus, to increase the

Available POS Taggers


Stanford POS t St f d tagger Hunpos MBT: Memory-based Tagger gg TreeTagger MXPOST Java POS tagger fnTBL mu-TBL mu TBL QTAG Part of speech tagger The TOSCA/LOB tagger. Brill's Transformation-based learning Tagger Original Xerox Tagger Lingua-EN-Tagger SVMTool POS Tagger

SVMTool
SVMTool is implemented based on the principle of Support Vector Machines (SVM). This is developed by Jesus Gimenez and Llus p y M`arquez. It trains efficiently and solve real NLP problems The SVMTool software package consists of three main components, namely The learner (SVMTlearn) The tagger (SVMTagger) ( ) The evaluator (SVMTeval).
27

SVMTlearn
SVM models are learned from a training corpus using the SVMTlearn component. Training data should be in a particular format. g p
<DET> <NN> 3500 <CRD> <NN> <VF> . <DOT> <PRP> <NNP>
28

SVMTlearn

The SVMTagger
The SVMTagger tags the raw corpus using the learned Model.

Block diagram of SVMTagger

Annotated POS corpus extracted using SVMTool

31

POS tagged Output 1


.
Input Sentence .

Tokenization

Tokenization

WordlevelTagger

SVMTagger

Word level POS tagged sentence

<PN> <NP> <VADV> <VF> . <S>


32

Chunker for Tamil

(Chunking)
The next process after POS tagging is chunking, which divides sentences into nonrecursive inseparable Phrases.

[ NN ] NP [ <NN >]NP

[ NN [<NN><NN>]NP NN ] NP

[ ADJ [<ADJ><NN>]NP [<VF>] VP NN ] NP [ VF ]VP

Tamil is a relatively free word order language but in the phrasal and clausal construction it behaves like a fixed fi d word order l d d language. so chunking task is less complex.

34

Customized Chunk tagset for Tamil


S.N S N Chunk o Tag
1 2 3 4 5 6 7 8 9 NP AJP AVP VFP VNP VGP CJP COMP . ?

Tag Name
Noun Phrase Adjectival Phrase Adverbial Phrase Verb Finite Phrase Verb Nonfinite Phrase Verb Gerund Phrase Conjunctional Complimentizer Symbols

Possible POS Tags


NN,NNP,NNPC,NNC,NNQ,PRP, QTF,DET,CRD,ORD,ADJ,INT QTF DET CRD ORD ADJ INT CRD, ADJ ADV,INT,CRD VF,VAX VNAJ,VNAV,VINT,CVB VBG CNJ COM O
35

Tamil Chunker
The IOB tags are used to indicate the boundaries for each chunk. I indicates the current word is inside a chunk. O - i di t th b indicates the boundary of the sentence. d f th t B indicates the current word is the beginning of a chunk, which may be followed by another chunk. Training: Word sequences with corresponding POS and Chunk tags. Input: Word sequence and POS tags Output : A single best Chunk Tag for each word .
CEN Amrita Vishwa Vidyapeetham Coimbatore. 36

Training data - sample

PRP B-NP DET B-NP ADJ I-NP I NP NN VF I-NP B-VFP

DOT O

YamCha
Y YamCha i a generic, customizable, Ch is i t i bl and open source text chunker. Developed by Kudo T and Matsumoto YamCha is using a state-of-the-art g machine learning algorithm called Support Vector Machines (SVMs).
38

Chunker Implementation using Yamcha Tool


POS TAGGED Corpus Manual chunk Tagging Yamcha Training POS T Tagged I t d Input Trained T i d Model Chunked t t Ch k d output

39

(Morphological
Analyzer)

40

Morphological level Tagging


Morphological analysis is the process of segmenting words into morphemes and analyzing structure of the word. Individual Analyzers for Verb (Machine Learning) Noun ( (Machine Learning) h ) Pronoun (Machine Learning) Proper noun (Machine Learning) Others (Takes grammatical category)

41

General framework of the Morphological level tagger


.
POS Tagged Sentence

Minimized POS Tagger

Noun/Verb Analyzer

Pronoun Analyzer

Proper Noun Analyzer

Other word Class Analyzers

Morphologically Annotated Sentence

Morphological T M h l i l Tagger using ML i


Novel Approach using Machine Learning Tool (SVMTool). Morphological analyzer i M h l i l l is redefined as a d fi d classification task. There are three steps in morphological analyzer using ML. Pre Preprocessing Segmentatg ion of morpheme Identifying morpheme

43

Example
<NN>+ <PL>+ <Ben_case> <NN> + <OBL> + <Acc Case> <Acc_Case> <Verb> + <PAST>+ <3SM> <Verb> + <PAST> + <3SF>
No. No of Inflections

No. of Paradigms Verb Noun 32 25

WO-AUX 95 --

WO-PP -30

Total 1884 325

Data M d li D t Modeling
Training Data Format

45

Implementation of Morphological level tagger

46

Tamil Syntactic Parsing Using Dependency Parser D d P

- (Tamil Syntax)

.
.

Syntactic Parsing-Dependency Parser


Syntactic Parsing is defined as the process of assigning structural descriptions to sequence of words in a natural language. It is actually related to the automatic analysis of text according to a grammar. Dependency parsing is a form of syntactic parsing of natural language based on the theoretical tradition of dependency grammar. It is a process of analyzing the dependency structure of a sentence. ( )

Challenges in DP
Tamil is a head-final language head final language. The verb comes at the end of the clause with a typical word order of Subject Object Verb (SOV) (SOV). However, They allows word order to be changed, making it a relatively word order free language language. Due to this relatively free word order nature there occurs a structural complexity complexity.

Challenges in DP
The NP ( () arguments before a final ) verb can appear in any permutation, yet it conveys the same sense of a sentence. Consider the following sentences, which all convey the same meaning, though the NP are h i h h h NPs in different positions in different sentences.

wAn A

avarukku oru kk wAn

koduththEn k d ththE koduththEn

puththakam thth k oru

Avarukku

puththakam

oru

puththakam oru

wAn

avarukku koduththEn avarukku koduththEn wAn koduththEn

wAn

puththakam oru

Avarukku

puththakam

Dependency Tagset
S.No. 1 2 3 4 5 6 7 8 9 10 11 <ROOT> <N.SUB> <D.OBJ> <D OBJ> <I.OBJ> <NST.MOD> <CL.SUB> <CL.DOBJ> <CL.IOBJ> <SYM> <X.CL > <X > Tag Head Word Nominal Subject Direct Object Indirect Object Spatial Time Modifier Clausal Subject Clausal Direct Object Clausal Indirect Object Symbols Clause Boundary Others Description

EXAMPLE
1 2 3 4 5 6 7 8 .
<NNP> <DET> <NN> <VNAV> <PRP> <NN> <VF> <B-NP> <B-NP> <I-NP> <B-VNP> <B-NP> <B-NP> <B-VFP>

R+DAT DET R+ACC VNAV R+ACC R+NOM R+PT+3 SM

4 3 4 7 7 7 0 7

<CL.IOBJ> <X> X <CL.DOBJ> <X.CL> <D.OBJ> <X> <ROOT> <SYM>

Dependency Parsing using MALT


MALT parser tool is a language Independent li l I d d tool used for dependency parsing which is tested for several languages . g g Using the MALT Parser tool the dependency relation of Tamil language is obtained. MALT is freely downloadable at
http://www.msi.vxu.se/users/nivre/MaltParser.html

55

Features considered
In our model, we have considered the model following features Word Index Word POS tag Chunk Tag Morphological tag p y Dependency Head Dependency Relation The rest of the features are marked _.

Example data
1 <NNP> <B-NP> _ 4 <CL.IOBJ> _ _ 2 <DET> <B-NP> _ 3 <X> _ _

<NN>

<I-NP>

<CL.DOBJ>

<VNAV>

<B-VNP>

<X.CL>

<PRP>

<B-NP>

<D.OBJ>

<NN>

<B-NP>

<X>

<VF>

<B-VFP>

<ROOT>

General Framework
.
Un-annotated U t t d Sentence

POS Tagger

Chunker

Morphological Analyzer

Dependency parsing (RF)

Syntactic Parsed Sentence

58

Application-
(Data bank) (Grammar checker) (Information retrieval) (M hi Translation system) (Machine T l ti t ) (Question Answering system) (Dialoge system) (Language Teaching )

Conclusion
The field of NLP is developing steadily primarily due to the emerging area of Artificial Intelligence Application. The innovative NLP applications using Machine Learning Approach have led to commendable results in creating language models.

References
Akshar Bharati Rajeev Sangal Dipti Misra Sharma and Lakshmi Bai Bharati, Sangal, Bai. 2006. AnnCorra:Annotating Corpora Guidelines for POS and Chunk Annotation for Indian Languages, Technical Report, Language Technologies Research Centre IIIT, Hyderabad. Dhanalakshmi V Anand Kumar M Soman K P and Rajendran S V, M, K. S, Natural Language Processing Tools for Tamil Grammar Learning and Teaching , International Journal of Computer Applications(IJCA) Foundation of Computer Science, October 2010. Dhanalakshmi V, Anandkumar M Shi Dh l k h i V A dk M, Shivapratap G S t G, Soman, K P P, Rajendran S. May 2009. Tamil POS Tagging using Linear Programming, In International Journal of Recent Trends in Engineering, Vol:1(2):166-169. Gimenez, J and L M`arquez, 2003. Fast and Accurate Part of Speech Tagging: The SVM Approach Revisited, in Proceedings of the Fourth RANLP.

References
Joakim Nivre and Johan Hall MaltParser: A language independent system for Hall, language-independent data-driven dependency parsing. In Proceedings of the Fourth Workshop on Treebanks and Linguistic Theories (TLT) , 2005. Joakim Nivre. 2006. Inductive Dependency Parsing Springer Verlag. Parsing S.Rajendran Parsing in Tamil in LANGUAGE www.languageinindia.com Vol6 : 8 August, 2006. IN INDIA

Saravanan .K ,Ranjai Parthasarathi T V Geetha ,Syntactic Parser for Tamil , K Ranjai Parthasarathi, T.V.Geetha Syntactic Tamil Tamil Internet Conference 2003,Chennai ,Tamilnadu,India Selvam M, Natarajan. A M, and Thangarajan R Structural Parsing of Natural Language Text in Tamil Using Phrase Structure Hybrid Language Model, a guage e t a Us g ase St uctu e yb d a guage ode , International Journal of Computer, Information, and Systems Science, and Engineering 2:4 2008.

62

Dhanalakshmi V, Anand Kumar M, Rekha R U, Arun Kumar C, Soman , , , , K P, and Rajendran S (2009), Morphological Analyzer for Agglutinative Languages Using Machine Learning Approaches, ARTCOM- 2009 International Conference on Advances in Recent Technologies in Communication and Computing, IEEE Press, doi: 10.1109/ARTCom.2009.184. / Dhanalakshmi.V , Anand Kumar M, Soman K P and Rajendran S (2011) , Shallow Parser for Tamil, Proceedings of INFITT Conference held at the University of Pennsylvania, Philadelphia, USA during June 17-19, 2011. Jurafsky Daniel and Martin James H (2005), An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Prentice Hall, ISBN: 0130950696, contributing writers: Andrew Kehler, Keith Vander Linden, and Nigel Ward.

Kudo T and Matsumoto Y (2000), Use of Support Vector Learning for Chunk Identification, Proceeding of CoNLL-2000 and LLL-2000, Lisbon, Portugal, pp.142-144.


THANK YOU

64