You are on page 1of 21

R. SHANMUGAM M.A.,M.Phil,PhD PROJECT ENGINEER CDAC – GIST, R & D,PUNE

Natural Language Processing

Natural Lan g ua ge Processing (NLP) is an emerging field in which attempts are being made to make the computer to understand human natural languages as human understand them .

Computational Linguistics (CL)

To achieve the aim of NLP, the ways and means that have to be provided to the computer are being studied . This branch is called Computational Linguistics (CL). It proposes many methods , formalisms and algorithms for this purpose. It is an interdisciplinary field involving Linguistics , Mathematics, Computer Science, Electronics and Statistics .

Parsing

Mor p holo g ical Parsin g (Word Parsin g ) Words and its grammatical Meaning

Syntactic Parsing

Sentence and its grammatical Meaning

To develop a Parser ….

Tamil Mor p holo gy (about suffixes)

Tamil Morphophonemic Rules (Sandhi)

Tami l Morp h otactics

Computational fromalism (RE ,FSA)

Database

Pro g rammin g lan g ua ge

Tamil Morphology (about suffixes)

Suffixes for Noun

Plural , Case , Postposition , clitics

Suffixes for verb Tense , PNG , aspectual , models , clitics

Tamil Morphophonemic Rules (Sandhi)

Addition

கணினி + தமிழ் = கணினித்தமிழ்

D elet ion

மரம்+ேவர் = மரேவர்

Substitution

மரம்+ெகாத்தி = மரங்ெகாத்தி

Tamil Morphotactics

Noun Root + PL + Case + Post position + clitics

கைளப்பற்றிட் மா

Verb Root+Asp Aux+Voice Aux+Mod Aux+Tns+

PNG+Cl3+Cl4

ிக் காட்ட ைவக்கப் பார் த்தா னா

Computational fromalism (RE ,FSA)

Regular expression (RE) Regular expression is the standard notation for characterizing strings (combination of characters). It is a formula in a special language for specifying simple classes of strings. Formally it is an algebraic notation for characterizing strings. Regular expression was introduced by Kleene (1956). A string is any sequence of characters like letters, numbers, spaces, tabs, and punctuation space which is also a character because it has encoding value . Regular expression needs a pattern (search type) to search strings. avaN puththakam patiththaaN ; /puththakam/ (book)

5.2. Finite state automaton (FSA) Finite state automaton is a mathematical device used to implement regular expressions. Finite state automata are the theoretical foundation of a good deal of the computational work. Any finite state automaton can be described with the Regular Expression.

ISSUES IN MORPHOLOGICAL PARSING

1. Similarities between the suffixes and part of the suffixes.

2. Similarities between suffixes and part of the roots.

3. Similar suffixes in different categories.

4. Oblique forms of pronouns

5. E x istence of Glid es, S andhi , and Fill ers.

6. Lack of vocabulary

7. Stems

8. Ambiguity

9. Exceptionals

Similarities between the suffixes and part of the suffixes

Postposition versus postposition:

aaka

maa R aa ka, neeraka, muu l amaa ka , vaay il aa ka , vaz i yaaka

ஆக

( மாறாக, ேநராக, லமாக,வாயிலாக ,வழியாக)

Ranking of the suffixes

1. மாறாக

2. ஆக

Similarities between suffixes and part of the roots.

1. maraththai ñóˆ¬î (treeAcc) Stem : marathth Stem :mara

2. vaaththai õ£ˆ¬î (duck Acc) Stem : vaathth Stem : vaa

Compare remaining syllables

Similar suffixes in different categories

vai:

vaankk ivai வாங்கிைவ ( receive it )

Pookavai ேபாகைவ (make him to go) Voice

Ceythu form, ceya form

Aspectual

Oblique forms of pronouns

$word=~s /eN/ naaN (p irathi p eyar)/;

$ word =~s/nam/naam(pirathipeyar)/;

$ word =~s/thaN/thaaN(pirathipeyar)/;

$ word =~s/tham/thaam(pirathipeyar)/;

$ word =~s/num/niir(pirathipeyar)/;

$ word =~s/um/niir(pirathipeyar)/;

$ word =~s/em/naam(pirathipeyar)/;

$ word =~s/thankkaL/thaankkaL(pirathipeyar)/;

( ‘s ’ is used for substitution)

More Existence of glides, sandhi, filler

Theruva iyotti yee ( Glid e) ெத ைவெயாட் ேய (near the street)

avaNaippaRRi ththaaN (Sandhi)

அவைனப்பற்றித்தான்

mar aththiNai (Filler ) மரத்திைன (tree Acc)

Lack of vocabulary

“suNaami” (சுனாமி)

In p ut

Stems

stem

deletion

1.maNnNnai மண்ைண (soil Acc) maNnNn Nn last two character

2. eNNai என்ைன (I Acc ) eNN(N) last one character

3. pallai பல்ைல (tooth Acc) pall( l) last one character

4.ceyyaamal ெசய்யாமல் (without doing)ceYY( Y) last one

character

Ambiguity

Input:

than

neythaaN

(ெநய்தான் )

(Wove cloth He)

Input: kaalai

( காைல ) ( கால் + )

(Morning) (leg Acc)

ney + th + aaN ney +

( ெநய்தான் )

(it is ghee)

kaalai , kaal + ai

Ambiguity

avarkaLootu iNnaiya maaNaatu ceNRaaN”.

அவர்கேளா ைணய மாநா ெசன்றான்

Context knowledge will play vital role.

THANK YOU