You are on page 1of 21

Multilingual Dependency Parsing of Indian Languages

Project-I (EE47009) report submitted to

Indian Institute of Technology Kharagpur

in partial fulfilment for the award of the degree of

Bachelor of Technology

in

Electrical Engineering

by

Arundhuti Naskar

(19EE10009)

Under the supervision of

Dr. Pawan Goyal

Department of Electrical Engineering

Indian Institute of Technology Kharagpur

Autumn Semester, 2022-23

November 12, 2022


DECLARATION

I certify that

(a) The work contained in this report has been done by me under the guidance of
my supervisor.

(b) The work has not been submitted to any other Institute for any degree or
diploma.

(c) I have conformed to the norms and guidelines given in the Ethical Code of
Conduct of the Institute.

(d) Whenever I have used materials (data, theoretical analysis, figures, and text)
from other sources, I have given due credit to them by citing them in the text
of the thesis and giving their details in the references. Further, I have taken
permission from the copyright owners of the sources, whenever necessary.

Date: November 12, 2022 (Arundhuti Naskar)


Place: Kharagpur (19EE10009)

i
DEPARTMENT OF ELECTRICAL ENGINEERING
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
KHARAGPUR - 721302, INDIA

CERTIFICATE

This is to certify that the project report entitled “Multilingual Dependency


Parsing of Indian Languages” submitted by Arundhuti Naskar (Roll No.
19EE10009) to Indian Institute of Technology Kharagpur towards partial fulfilment
of requirements for the award of degree of Bachelor of Technology in Electrical En-
gineering is a record of bona fide work carried out by him under my supervision and
guidance during Autumn Semester, 2022-23.

Dr. Pawan Goyal


Date: November 12, 2022 Department of Electrical Engineering
Place: Kharagpur Indian Institute of Technology Kharagpur
Kharagpur - 721302, India

ii
Abstract

Name of the student: Arundhuti Naskar Roll No: 19EE10009


Degree for which submitted: Bachelor of Technology
Department: Department of Electrical Engineering
Thesis title: Multilingual Dependency Parsing of Indian Languages
Thesis supervisor: Dr. Pawan Goyal
Month and year of thesis submission: November 12, 2022

One of the essential goals of Natural Language Processing is to fathom how human
languages interact with technology, using computational linguistics tools. Extensive
improvements have been observed in NLP systems by using dependency parsing in
certain languages and are widely accepted as a reliable tool for NLP applications.
Dependency Parsing is a useful tool to parse languages with free word order (i.e.
the sentence meaning does not alter by changing the word positions) like various
Indian Languages. However, these tasks when dealing with Indian Languages are
often resource-constrained, which is a hindrance, because lots of training data are
required to generate a model with good performance. Different Parsers and anno-
tations schemes influence the overall NLP Pipeline in various ways. This work is
an application of building a Customized pipeline using the Pretrained Multilingual
Model, Trankit, by extending it to annotate dependency parsing relations in various
Indian Languages like Hindi.

Keywords: Natural Language Processing, Multilingual Dependency Parsing, Trankit,


Indian Languages, Transfer Learning

iii
Acknowledgements
I would like to express my sincerest gratitude to my supervisor Dr. Pawan Goyal,
for his constant support and guidance, throughout this project. It is needless to say
that I would not have been able to complete my work without his valuable feedback
and suggestions. I also want to thank Mr. Aniruddha Roy, Research Scholar for
his constant support and mentorship, providing invaluable inputs continuously, that
have aided in the successful completion of this project.

I would also like to thank the Department of Computer Science and Engineering,
IIT Kharagpur for providing various facilities that were pivotal in the project com-
pletion. Finally, I would like to extend my thanks to the Department of Electrical
Engineering, for this opportunity to work on this project.

iv
Contents

Declaration i

Certificate ii

Abstract iii

Acknowledgements iv

Contents v

1 Introduction 1
Indian Languages . . . . . . . . . . . . . . . . . . . . . 1
Advantages of Dependency Parsing . . . . . . . . . . . 2
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 2
Objective . . . . . . . . . . . . . . . . . . . . . . . . . 3
Literature Survey . . . . . . . . . . . . . . . . . . . . . 3
Research gaps . . . . . . . . . . . . . . . . . . . . . . . 3
Objective . . . . . . . . . . . . . . . . . . . . . . . . . 3
Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
An algorithm . . . . . . . . . . . . . . . . . . . . . . . 3
1.1 Adding another section . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Methodology 6
2.1 Design and Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Multilingual Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Customized Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Experimentation and Results 9


3.1 Experimentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

v
Contents vi

A Appendix A 12

Bibliography 13
Chapter 1

Introduction

Dependency Relations in the language are an important resource for information, to


determine the grammatical structure of sentences. For advanced text searching, it
is important to understand the structure and relationships involved, between words
in a sentence. Dependency Parsing can be used to determine the subjects and
objects of a verb, and also the words describing a subject (9). It focuses more on
dependencies between words rather than sentence structure and relationship. The
sentence syntax is described with directed binary grammatical relations between
words, labeling them as heads and dependents. While the words are indicated with
directed arcs, the relations are described using dependency tags. Finally, the aim is
to construct a dependency tree, from every sentence. There is a unique root node,
which is connected to some other node via a directed arc and a unique directed path
exists from the root node to every single vertex that connects a single word(eng).

Indian Languages Indian Languages are significantly different from the English
language, which require different processing techniques. They have broad coverage,
rich morphology, and a non-linear structure with extensive use of complex predicates
and verb complexes (6). Unlike their English counterparts, the Indian Languages
1
Chapter 1. Introduction 2

have a variable word order which is less strictly bound. An example of fixed order
is main verbs succeeding auxiliary verb sequences and nouns followed by postprepo-
sition. Alternatively, the property of free word order is that some adverbs can
be moved freely within a sentence without making any difference to correctness or
meaning (5)

Advantages of Dependency Parsing Dependency Parsing gives a natural ad-


vantage while dealing with the property of free word order where grammatical func-
tions of the language do not depend strictly on the ordering of words. A phrase-
structured parse tree gives hierarchical relations between words in a constituent. In
constituent parsing, the entire sentence is broken down and the noun, part-of-speech,
verbs, and adjectives are highlighted. But for dependency parsing, the correlation
between individual words is described. Due to this, dependeAdvantages d ely appli-
cable in machine translation and relational extraction (8). For English script, the
SVO (Subject-Verb-Object), the order is followed but the unmarked order of Hindi
Language like many Indian Languages is SOV. The constituent order in sentences,
follow no strict rules and often deviate beyond its SOV order rules, ignoring both
subjects and objects (gra).As a result, the same sentence can follow different phrase
structures or word order and yet have the same meaning (8). The head-dependent
relations of dependency parsing do not consider word order, but rather provide a
more useful approximation of the semantic relationship between individual words in
a sentence.

Motivation Finally, we realize that dependency parsing helps in several essential


NLP tasks like information extraction, question answering, and summarization (7).
Most of the success in NLP has been achieved for popular languages like English,
which has abundant training data available but is only the third largest language
Chapter 1. Introduction 3

spoken by native speakers worldwide. Even some handful of other languages, approx-
imately 20 out of 7000 languages are considered high resources because of abundant
text corpora (low). But the rest of the languages especially Asian Languages, do not
have even the basic statistical tools for NLP available to them. Therefore here we
focus on low-resource Indian Languages, like Hindi, to train a Dependency Parser,
using transfer learning.

Objective Hence the prime objective of this work is to design such a Multilin-
gual Dependency Parser for a low resource language like Hindi. This is achieved by
training a light-weight Transformer based Python Toolkit, that supports Multilin-
gual NLP operations.

Literature Survey

Research gaps Typically include research gaps for your study.

Objective Similarly objectives of study.

Scope Define scope of study.

An algorithm How you could refer to figures: This is an example. (Refer 1.1).
You can add equations like this Eq. (1.1)

X Ti
SDR = sd(T ) − × sd(Ti ) (1.1)
i
|T |
Chapter 1. Introduction 4

Figure 1.1: Splitting of the input space (X1 x X2) by M5’ model tree algorithm

1.1 Adding another section

You can show a lot of figures together like these Figures 1.2(a), 1.2(b), 1.2(c) below.
You can add lists into the text like this.

(a) Caption1 (b) Caption2 (c) Caption3

Figure 1.2: Figures sample

□ Some sample text item 1.

□ You may refer to tables 1.1

□ Or figures 1.2(a)

Tables can be added like this


Chapter 1. Introduction 5

Table 1.1: Sample table

Column 1 Column 2 Column 3


1 Data1 13.41179 0.9492839
2 Data2 13.39824 0.9492952
Chapter 2

Methodology

In this chapter, we will discuss the approach followed to create a dependency parser
for multilingual Indian Languages. Our work is mainly an application of the lightweight
Transformer -based Toolkit called Trankit. It provides a trainable pipeline for funda-
mental NLP Tasks like POS Tagging, Morphological features, Named Entity Recog-
nition, Lemmatization over 90 Universal Dependencies Treebanks, and Dependency
Parsing, for over 100 languages, including 90 pretrained pipelines for 56 languages.
This tool is chosen because of its efficient memory usage, significant speed, contextu-
alized embeddings, and a word piece-based token and sentence-splitter that provides
improved performance over 50+ languages (10).

2.1 Design and Architecture

Trankit utilizes the state-of-the-art transformer XLM-Roberta (Conneau et al., 2020)[10],


which is a multilingual version of Roberta and pretrained on 2.5 TB of filtered
CommonCrawl data comprising over 100 languages. The use of Adapters, inside a
transformer layer especially the core component being the Multilingual Encoder, is

6
Chapter 2. Methodology 7

Figure 2.1: Overall architecture of Trankit. A single multilingual pretrained


transformer is shared across three components(pointed by red arrows) of the
pipeline for different languages.

shared across various transformer components each suited for different languages.
Other components include Joint Token and Sentence Splitter, Multi-word Token
Expander, Lemmatizer, a Named Entity Recognizer, and a Joint Model for POS
Tagging, Morphological Tagging, and Dependency Parsing. (10) The architecture is
described in Figure 1. (3.1).

2.2 Multilingual Processing

There has been some research done in designing multilingual models via training
them on combined data of different languages leading to poor model development
which cannot work on raw text. NLP Toolkits like UDPipe (12) and (11) do not share
any component between their embedding layers, adding to bulk model size and fast
Chapter 2. Methodology 8

memory usage. The problem of loading multiple pipelines for different languages into
memory without adding to model size is solved using Adapters. Here, a single large
multilingual pretrained transformer is shared across all components and languages,
while for each language, a specific set of adapters and task-based weights are assigned
for every transformer-based component. Finally, during inference, the adapter set
and weights are activated based on the input language and task, to process the query
input. This approach solves the memory problem to a large extent while retaining
the ability to process multiple languages simultaneously and efficiently.

2.3 Customized Pipeline

Trankit already provides 90 pretrained pipelines for 56 languages, each with its own
default tree bank. However, there are 100 languages supported in Trankit, whose
customized pipelines can be trained using the TPipeline Class, because the base-
model XLM-Roberta encoder is pretrained on the supporting languages. In this
project, we aim at building a customized pipeline for dependency parsing for the
Hindi Language.
Chapter 3

Experimentation and Results

3.1 Experimentation

3.1.1 Dataset

The HINDI INTERCHUNK Dataset has been used for training purposes. The
dependency annotation in Inter-chunk has been done, following the dependency
guidelines in (4) which uses a dependency framework inspired by Panini’s grammar
of Sanskrit. Here is an example of dependency relation in Interchunk. Refer Figure
(3.1).

The requirement for training customized pipelines is that the data should be avail-
able in CONLL-U format, for Universal Dependencies Task. Here we are mainly
focused on HEAD, DEPREL, and DEPS which are associated with Dependency
Parsing. The dataset has been divided into 932 data files for training, 112 data files
for testing, and 131 data items for testing the model.

9
Chapter 3. Experimentation and Results 10

Figure 3.1: Figure 1 shows SSF representation. Figure 2 shows schematic de-
pendency relation for sentence 1.

3.1.2 Method

We try to evaluate a custom Dependency Parser Model on the Hindi Dataset and
then load the custom pipeline that can be used afterward. Trankit provides the
TPipeline Class to build customized pipelines, which can subsequently be loaded
with the usual Pipeline Class. The Pipeline of customized Category is taken which
has the following models of Joint token and sentence splitter. Joint model for part-of-
speech tagging, morphological feature tagging, and dependency parsing Lemmatizer
The task is designated as ‘Posdep’ and embedding as ‘xlm-roberta-base’, default
embedding. Since the customized pipeline is built on Hindi Dataset, we have kept
the language field as ‘Hindi’. The model has been trained for 100 epochs, follow-
ing which are the scores obtained in the last iteration. The model gets saved in the
‘saved ir′ location.Inthenextstep, wedownloadedallthemissingf ilesf ortheHindiLanguageliketagg
Chapter 3. Experimentation and Results 11

Figure 3.2: Figure shows the metrics after completion of model training for 100
epochs. Finally the model gets loaded into Pipeline and can be initialized using
customized language.

Figure 3.3: Figure shows output of the sentence in CONLL-U Format

3.2 Results

This is a sample query and its corresponding output is obtained from the customized
pipeline. Refer Figure (3.3).
Appendix A

Appendix A

Write your Appendix content here.

12
Bibliography

[eng] Dependency parsing. Engati.

[gra] Hindustani grammar - wikipedia.

[low] Nlp for low-resource settings - medium.

[4] A. Bharati, D. M. Sharma S. Husain, L. B. R. B. and Sangal, R. (2009). An-


ncorra: Treebanks for indian languages, guidelines for annotating hindi treebank
(version–2.0).

[5] Akshar Bharati, Vineet Chaitanya, R. S. (2016). Local word grouping and its
relevance to indian languages.

[6] Akshar Bharati, Mridul Gupta, V. Y. K. G. D. M. S. (2009). Simple parser for


indian languages in a dependency framework.

[7] Daniel Jurafsky, J. H. M. (2008). Speech and language processing.

[8] Falavarjani, S.A.M., G.-S. G. . . I. I. G. M.-S. T. P. J. Q. J. W. R. e. (2015).


Advantages of dependency parsing for free word order natural languages.

[9] Green, N. (2011). Dependency parsing.

[10] Nguyen, M. V., Lai, V. D., Pouran Ben Veyseh, A., and Nguyen, T. H. (2021).
Trankit: A light-weight transformer-based toolkit for multilingual natural lan-
guage processing. In Proceedings of the 16th Conference of the European Chapter
13
Bibliography 14

of the Association for Computational Linguistics: System Demonstrations, pages


80–90, Online. Association for Computational Linguistics.

[11] Qi, P., Zhang, Y., Zhang, Y., Bolton, J., and Manning, C. D. (2020). Stanza: A
python natural language processing toolkit for many human languages. In Proceed-
ings of the 58th Annual Meeting of the Association for Computational Linguistics:
System Demonstrations, pages 101–108, Online. Association for Computational
Linguistics.

[12] Straka, M. (2018). UDPipe 2.0 prototype at CoNLL 2018 UD shared task.
In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw
Text to Universal Dependencies, pages 197–207, Brussels, Belgium. Association
for Computational Linguistics.

You might also like