You are on page 1of 46

Lexical Resources ெசால் வளங்கள்

Lexical Resources ெசால் வளங்கள் Selvaraj Arulmozi arulmozi@gmail.com D rav idi an Uni vers it y

Selvaraj Arulmozi arulmozi@gmail.com Dravidian University

2
2

SRM University

24-Jan-2012

Lexical Resources

In recent years, monolingual and multilingual lexicons, WordNets and other lexical resources have become more readily available.

In such databases, the information provided on words and their relations, both within and across languages,

has become richer and more easily exploitable in various applications.

3
3

SRM University

24-Jan-2012

Parallel corpora aligned at word level have created possibilities for analyzing translational correspondences and deriving lexical relations within and across languages by means of new computational methods

such as Semantic Mirrors

Furthermore, unstructured texts, such as ordinary web materials, can be mined in different ways by tools

such as SketchEngine

in order to fully automatically derive overviews of how lexical items behave in context.

4
4

TDIL Initiatives

SRM University

24-Jan-2012

MCIT started TDIL in 1991

For building technology solutions for Indian languages

To develop information processing tools and techniques To facilitate human-machine interaction without language barrier To create and access multilingual knowledge resources and integrate them to develop innovative user products and services

5
5

SRM University

24-Jan-2012

Basic tools for Indian languages (National Rollout Plan)

Software tools and fonts for all 22 Indian languages (then scheduled languages) have been released in the public domain

The CD-ROM typically contains the basic software tools for enabling the linguistic community in the digital age

www.ildc.in

6
6

SRM University

24-Jan-2012

Ongoing projects in Consortium mode

English-IL MT system

IL-IL MT system

On-line handwritten recognition system

Cross-lingual Information Access

Speech Corpora/Technologies

WordNets

Language Corpora

7
7

SRM University

24-Jan-2012

Lexical Resources

WordNet

IndoWordNet

Corpora

Indian Languages Corpora Initiative

8
8

WordNet

SRM University

24-Jan-2012

WordNets are being used in word sense disambiguation, machine translation, information extraction and information retrieval.

Over 60 WordNets have been developed over the world.

Typologically different languages have faced challenges in adapting the original model and linking WordNets across languages.

9
9

SRM University

24-Jan-2012

What is WordNet?

A large lexical database, or “electronic dictionary

Covers most English nouns, verbs, adjectives, adverbs

Electronic format makes it amenable to automatic manipulation

Used in many applications (document retrieval

and sorting, machine translation,

)

10
10

SRM University

24-Jan-2012

What’s so special about WordNet?

Traditional paper dictionaries are organized alphabetically

so words that are grouped together (on the same page) are unrelated

WordNet is organized by meaning

so words in close proximity are related

Users can browse WordNet and find words related to their queries (like in a thesaurus)

11
11

SRM University

24-Jan-2012

Basic Design of WN

WordNet entries are word-concept mappings

Natural Languages map many-to many:

One concept can be expressed by many words

(synonymy):

{car, auto, automobile}

{close, shut}

12
12

SRM University

24-Jan-2012

One word can express many concepts (polysemy):

{club, stick}

{club, nightclub}

{club, playing card}

Added problem in Natural Language:

The words we use most frequently are the most polysemous (have the most meanings)!

13
13

SRM University

24-Jan-2012

WordNet handles synonymy and polysemy

Represents words and concepts unambiguously

Meaningfully relates words and concepts

14
14

SRM University

24-Jan-2012

WordNet’s building blocks: sets of synonyms (synsets)

{hit, beat}

{big, large}

{queue, line}

Each synset expresses a distinct concept.

Currently, WordNet contains appr. 117,000 synsets

15
15

SRM University

24-Jan-2012

WordNet stores, and allows one to retrieve,

all concepts that a given word can express

all words that express a given concept

Words and synsets are connected via meaning- based relations

Result: a large semantic network

(as opposed to a flat list in a paper dictionary)

16
16

SRM University

24-Jan-2012

Relations among noun synsets

Hyperonymy/hyponymy relates super/subordinate synsets (denting more/less general concepts):

{vehicle}

/

\

{car, automobile}

{bicycle, bike}

/ {convertible} {SUV} {mountain bike}

\

\

Transitivity:

A car is a kind of vehicle

An SUV is a kind of car

=> An SUV is a kind of vehicle

17
17

SRM University

24-Jan-2012

Relations among noun synsets

Meronymy/holonymy (part/whole) {car, automobile}

|

{engine} / {spark plug} {cylinder}

\

Inheritance:

A car has an engine

An engine has spark plugs

=> A car has spark plugs

18
18

SRM University

24-Jan-2012

Relations among verb synsets

Verbs denote event

Related by a mannerrelation

{communicate}

|

{talk} / {stammer} {whisper}

\

19
19

SRM University

24-Jan-2012

Semantics of events (verbs) are very different from semantics of entities (nouns)

WordNet captures this fact with different relations

Relation refer to temporal properties of events

partial and complete overlap of two events

prior or posterior events

20
20

SRM University

24-Jan-2012

Relations among synsets create interconnected network

Different senses of polysemous words are members of distinct synsets that are related to different synsets

(i.e., occupy different locations in the network)

e.g., {stock, broth} has superordinate synset {dish}

{stock, breed} has superordinate {variety}

These different synsets are also linked to different part/whole synsets

21
21

SRM University

24-Jan-2012

A word’s meaning can be defined in terms of its position in the network

club 1 is a kind of association/has members

club 2 is a kind of stick

Relatedness between words or synsets can be quantified in terms of path length

(number of connections among synsets)

22
22

SRM University

24-Jan-2012

How closely related are {zebra} and {horse}?

Very: Both share the direct superordinate equine

What about {horse, sawhorse} and {horse, gymnastic horse}?

Related, but less so: joint superordinate {artifact} is 4-5 levels up

What about {zebra} and {horse, gymnastic horse}?

Unrelated: the trees containing them never intersect!

23
23

SRM University

24-Jan-2012

WSD is a major problem in Natural Language Processing

Assumption: words in a context (phrase, sentence, discourse) are semantically related

So, horse in the neighborhood of zebra is likely to mean “equine”;

in the neighborhood of gym it likely means “gymnastic horse.”

If you want to disambiguate “horse” in the context of zebra,look for all WordNet paths from “zebra” to “horse.”

The shortest one is likely to give you the correct sense of “horse.”

24
24

Freely downloadable:

SRM University

24-Jan-2012

http://wordnet.princeton.edu/

25
25

SRM University

24-Jan-2012

WordNets around the world

Currently, WordNets exist for some 60 languages, including Arabic, Basque, Bulgarian, Estonian, Hebrew, Icelandic, Italian, Kannada, Latvian, Persian, Romanian, Sanskrit, Tamil, Telugu, Thai, Turkish, Urdu,

Global WordNet Association

http://www.globalwordnet.org

26
26

SRM University

24-Jan-2012

WordNets in Indian Languages

Pioneer: Hindi WordNet

Other Indian Languages under Construction

North-East WordNet

Assamese, Bodo, Manipuri

Indradhanush

Bengali , Gujarati, Kashmiri, Konkani, Oriya, Punjabi, Urdu

27
27

SRM University

24-Jan-2012

Dravidian WordNet

Tamil (Tamil University), Telugu (Dravidian University), Malayalam (Amrita Viswavidyalayam), Kannada (University of Mysore)

Funding Agency: DIT

Budget: 152 lakhs

Time frame: 24 months

Starting Date: 26-12-2011

28
28

SRM University

24-Jan-2012

Work already done

Tamil WordNet

AU-KBC Research Centre with funding from Tamil Virtual University

Available for download from www.nrcfoss.in

Dravidian WordNet

Collaborative Project with funding from MHRD 11000 synsets developed

Available online from http://www.cfilt.iitb.ac.in/indowordnet/

29
29

IndoWordNet

SRM University

24-Jan-2012

Collaborative effort to develop/link all Indian language WordNets

Foundation of WordNet construction:

Relational Semantics

Source: Hindi WordNet

Expansion Approach

30
30

SRM University

24-Jan-2012

Three Principles

Minimality

principle insists on capturing that minimal set of the words in the synset which uniquely identifies the concept.

For example

{family, house} uniquely identifies a concept

(e.g. “he is from the house of the King of Jaipur”}.

31
31

SRM University

24-Jan-2012

Coverage

principle then stresses on the completion of the synset, i.e., capturing ALL the words that stand for the concept expressed by the synset

(e.g., {family, house, household, ménage} completes the synset).

Within the synset the words should be ordered according their frequency in the corpus.

32
32

Replaceability

SRM University

24-Jan-2012

demands that the most common words in the synset,

i.e., words towards the beginning of the synset should be able to replace one another in the example sentence associated with the synset

33
33

SRM University

24-Jan-2012

Some Statistics on IndoWordNet

WordnetLanguage

#synsets/unique-words

Assamese

3530/19609

Bengali

8679/ 18563

Bodo

3837/13357

Gugarati

970/2125

Hindi

33900/82000

Kannada

5920/7344

Kashmiri

6569/8674

Malayalam

6154/8622

Manipuri

2744/5231

Marathi

9739/21223

Nepali

5802/10278

Sanskrit

3340/17820

Tamil

4750/9821

Telugu

10639/18250

Urdu

6123/9641

34
34

SRM University

24-Jan-2012

Corpora

35
35

SRM University

24-Jan-2012

Indian Languages Corpora Initiative

The Indian Languages Corpora Initiative (ILCI) is a

research project for technology development for Indian languages.

Special Centre for Sanskrit Studies of Jawaharlal Nehru University

is coordinating this national project and is the consortium leader of the ILCI project.

36
36

SRM University

24-Jan-2012

Consortium Members

Punjabi University for Punjabi

JNU (Center for Indian languages) for Urdu

ISI Kolkata for Bangla

Utkal University for Oriya

IIT Mumbai for Marathi

Gujarat University for Gujarati

Dravidian University for Telugu

Tamil University for Tamil

IITM-K Trivandrum for Malayalam

Goa University for Konkani

Each consortium member will develop corpora and standards in their respective languages.

37
37

The main objective

SRM University

24-Jan-2012

to build an annotated parallel corpora (Hindi to 11 Indian languages along with English) with standards for 12 major Indian languages including English in the domain of tourism and health.

Major aims of the project are

to evolve draft standards

build parallel corpora in the domain of tourism and health (Hindi-English and Hindi-Indian languages) &

annotate (label) the parallel corpora.

38
38

Aims

SRM University

24-Jan-2012

Evolving Draft Standards includes evaluation of existing corpora and tools that have been developed as part of various projects under Technology Development in Indian Languages (TDIL), and

evaluating existing standards for their usability.

Standards for corpora collection, for corpora encoding and for corpora validation

The task of Corpora development includes corpora collection in Hindi, parallel corpora in 11 Indian languages and parallel corpora in English.

39
39

SRM University

24-Jan-2012

The basic starting point for this project is a list of 50,000 Hindi sentences used in the tourism and health domain.

A list of data source institutions including Tourism and Health departments was made to collect data for Hindi.

Parallel aligned corpora with Hindi as source language in the given 11 Indian languages and English has been created as per the standards evolved.

Annotated corpora in these 11 Indian languages and English are almost completed as per the BIS standards

40
40

SRM University

24-Jan-2012

50 K sentences from Hindi into Telugu were translated

25 k each in tourism and health domain

Annotation work is nearing completion

based on BIS-POS Tagset

Will be ready by 31 st Jan and

Will be made available online from

www.tdil.gov.in

41
41

Tools developed

SRM University

24-Jan-2012

Corpora Annotation Tool

KWIC identifier

Stemmer

Affix list builder

Frequency list builder

Named Entity lists builder

42
42

ILCI-Phase II

SRM University

24-Jan-2012

Major aims of the project are:

Draft Standards for newer languages

Corpora collection for source language

Parallel Corpora creation by translation in 22 target languages

Corpora annotation of parallel corpora in 23 languages

Agriculture and Culture domains (in addition to tourism and health domains)

More than 10 million word corpus to be developed

43
43

Budget

SRM University

24-Jan-2012

1049.26 - 10 crores, 49 lakhs and 26 thousands

Old partners – 45.85 lakhs

New partners – 60.38 lakhs

Funding Agency: Dept of IT, Ministry of Communications and IT, GoI.

44
44

SRM University

24-Jan-2012

New languages in ILCI-PII

Maithili

Kashmiri

Kannada

Sanskrit

Dogri

Sindhi

Santhali

Assamese

Manipuri

Nepali

Bodo

45
45

Advertisement

SRM University

24-Jan-2012

M.Sc in Computational Linguistics

@ Dravidian University

Under UGC’s Innovative Programme

Open to all graduates

46
46

SRM University

24-Jan-2012

Thank you for your kind attention!

நன்றி!