Bachelor Thesis Template For Indian Institute of Technology Kharagpur

Multilingual Dependency Parsing of Indian Languages
Project-I (EE47009) report submitted to
Indian Institute of Technology Kharagpur
in partial fulfilment for the award of the degree of
Bachelor of Technology
in
Electrical Engineering
by
Arundhuti Naskar
(19EE10009)
Under the supervision of
Dr. Pawan Goyal
Department of Electrical Engineering
Indian Institute of Technology Kharagpur
Autumn Semester, 2022-23
November 12, 2022

DECLARATION
I certify that
(a) The work contained in this report has been done by me under the guidance of
my supervisor.
(b) The work has not been submitted to any other Institute for any degree or
diploma.
(c) I have conformed to the norms and guidelines given in the Ethical Code of
Conduct of the Institute.
(d) Whenever I have used materials (data, theoretical analysis, figures, and text)
from other sources, I have given due credit to them by citing them in the text
of the thesis and giving their details in the references. Further, I have taken
permission from the copyright owners of the sources, whenever necessary.
Date: November 12, 2022 (Arundhuti Naskar)

Place: Kharagpur (19EE10009)
i
DEPARTMENT OF ELECTRICAL ENGINEERING
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
KHARAGPUR - 721302, INDIA
CERTIFICATE
This is to certify that the project report entitled “Multilingual Dependency

Parsing of Indian Languages” submitted by Arundhuti Naskar (Roll No.
19EE10009) to Indian Institute of Technology Kharagpur towards partial fulfilment
of requirements for the award of degree of Bachelor of Technology in Electrical En-
gineering is a record of bona fide work carried out by him under my supervision and
guidance during Autumn Semester, 2022-23.
Dr. Pawan Goyal

Date: November 12, 2022 Department of Electrical Engineering
Place: Kharagpur Indian Institute of Technology Kharagpur
Kharagpur - 721302, India
ii
Abstract
Name of the student: Arundhuti Naskar Roll No: 19EE10009

Degree for which submitted: Bachelor of Technology
Department: Department of Electrical Engineering
Thesis title: Multilingual Dependency Parsing of Indian Languages
Thesis supervisor: Dr. Pawan Goyal
Month and year of thesis submission: November 12, 2022
One of the essential goals of Natural Language Processing is to fathom how human
languages interact with technology, using computational linguistics tools. Extensive
improvements have been observed in NLP systems by using dependency parsing in
certain languages and are widely accepted as a reliable tool for NLP applications.
Dependency Parsing is a useful tool to parse languages with free word order (i.e.
the sentence meaning does not alter by changing the word positions) like various
Indian Languages. However, these tasks when dealing with Indian Languages are
often resource-constrained, which is a hindrance, because lots of training data are
required to generate a model with good performance. Different Parsers and anno-
tations schemes influence the overall NLP Pipeline in various ways. This work is
an application of building a Customized pipeline using the Pretrained Multilingual
Model, Trankit, by extending it to annotate dependency parsing relations in various
Indian Languages like Hindi.
Keywords: Natural Language Processing, Multilingual Dependency Parsing, Trankit,

Indian Languages, Transfer Learning
iii
Acknowledgements
I would like to express my sincerest gratitude to my supervisor Dr. Pawan Goyal,
for his constant support and guidance, throughout this project. It is needless to say
that I would not have been able to complete my work without his valuable feedback
and suggestions. I also want to thank Mr. Aniruddha Roy, Research Scholar for
his constant support and mentorship, providing invaluable inputs continuously, that
have aided in the successful completion of this project.
I would also like to thank the Department of Computer Science and Engineering,
IIT Kharagpur for providing various facilities that were pivotal in the project com-
pletion. Finally, I would like to extend my thanks to the Department of Electrical
Engineering, for this opportunity to work on this project.
iv
Contents
Declaration i
Certificate ii
Abstract iii
Acknowledgements iv
Contents v
1 Introduction 1
Indian Languages . . . . . . . . . . . . . . . . . . . . . 1
Advantages of Dependency Parsing . . . . . . . . . . . 2
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 2
Objective . . . . . . . . . . . . . . . . . . . . . . . . . 3
Literature Survey . . . . . . . . . . . . . . . . . . . . . 3
Research gaps . . . . . . . . . . . . . . . . . . . . . . . 3
Objective . . . . . . . . . . . . . . . . . . . . . . . . . 3
Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
An algorithm . . . . . . . . . . . . . . . . . . . . . . . 3
1.1 Adding another section . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Methodology 6
2.1 Design and Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Multilingual Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Customized Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Experimentation and Results 9

3.1 Experimentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
v
Contents vi
A Appendix A 12
Bibliography 13
Chapter 1
Introduction
Dependency Relations in the language are an important resource for information, to

determine the grammatical structure of sentences. For advanced text searching, it
is important to understand the structure and relationships involved, between words
in a sentence. Dependency Parsing can be used to determine the subjects and
objects of a verb, and also the words describing a subject (9). It focuses more on
dependencies between words rather than sentence structure and relationship. The
sentence syntax is described with directed binary grammatical relations between
words, labeling them as heads and dependents. While the words are indicated with
directed arcs, the relations are described using dependency tags. Finally, the aim is
to construct a dependency tree, from every sentence. There is a unique root node,
which is connected to some other node via a directed arc and a unique directed path
exists from the root node to every single vertex that connects a single word(eng).
Indian Languages Indian Languages are significantly different from the English
language, which require different processing techniques. They have broad coverage,
rich morphology, and a non-linear structure with extensive use of complex predicates
and verb complexes (6). Unlike their English counterparts, the Indian Languages
1
Chapter 1. Introduction 2
have a variable word order which is less strictly bound. An example of fixed order
is main verbs succeeding auxiliary verb sequences and nouns followed by postprepo-
sition. Alternatively, the property of free word order is that some adverbs can
be moved freely within a sentence without making any difference to correctness or
meaning (5)
Advantages of Dependency Parsing Dependency Parsing gives a natural ad-

vantage while dealing with the property of free word order where grammatical func-
tions of the language do not depend strictly on the ordering of words. A phrase-
structured parse tree gives hierarchical relations between words in a constituent. In
constituent parsing, the entire sentence is broken down and the noun, part-of-speech,
verbs, and adjectives are highlighted. But for dependency parsing, the correlation
between individual words is described. Due to this, dependeAdvantages d ely appli-
cable in machine translation and relational extraction (8). For English script, the
SVO (Subject-Verb-Object), the order is followed but the unmarked order of Hindi
Language like many Indian Languages is SOV. The constituent order in sentences,
follow no strict rules and often deviate beyond its SOV order rules, ignoring both
subjects and objects (gra).As a result, the same sentence can follow different phrase
structures or word order and yet have the same meaning (8). The head-dependent
relations of dependency parsing do not consider word order, but rather provide a
more useful approximation of the semantic relationship between individual words in
a sentence.
Motivation Finally, we realize that dependency parsing helps in several essential

NLP tasks like information extraction, question answering, and summarization (7).
Most of the success in NLP has been achieved for popular languages like English,
which has abundant training data available but is only the third largest language
spoken by native speakers worldwide. Even some handful of other languages, approx-
imately 20 out of 7000 languages are considered high resources because of abundant
text corpora (low). But the rest of the languages especially Asian Languages, do not
have even the basic statistical tools for NLP available to them. Therefore here we
focus on low-resource Indian Languages, like Hindi, to train a Dependency Parser,
using transfer learning.
Objective Hence the prime objective of this work is to design such a Multilin-
gual Dependency Parser for a low resource language like Hindi. This is achieved by
training a light-weight Transformer based Python Toolkit, that supports Multilin-
gual NLP operations.
Literature Survey
Research gaps Typically include research gaps for your study.
Objective Similarly objectives of study.
Scope Define scope of study.
An algorithm How you could refer to figures: This is an example. (Refer 1.1).
You can add equations like this Eq. (1.1)
X Ti
SDR = sd(T ) − × sd(Ti ) (1.1)
i
|T |
Figure 1.1: Splitting of the input space (X1 x X2) by M5’ model tree algorithm
1.1 Adding another section
You can show a lot of figures together like these Figures 1.2(a), 1.2(b), 1.2(c) below.
You can add lists into the text like this.
(a) Caption1 (b) Caption2 (c) Caption3
Figure 1.2: Figures sample
□ Some sample text item 1.
□ You may refer to tables 1.1
□ Or figures 1.2(a)
Tables can be added like this

Table 1.1: Sample table
Column 1 Column 2 Column 3

1 Data1 13.41179 0.9492839
2 Data2 13.39824 0.9492952
Chapter 2
Methodology
In this chapter, we will discuss the approach followed to create a dependency parser
for multilingual Indian Languages. Our work is mainly an application of the lightweight
Transformer -based Toolkit called Trankit. It provides a trainable pipeline for funda-
mental NLP Tasks like POS Tagging, Morphological features, Named Entity Recog-
nition, Lemmatization over 90 Universal Dependencies Treebanks, and Dependency
Parsing, for over 100 languages, including 90 pretrained pipelines for 56 languages.
This tool is chosen because of its efficient memory usage, significant speed, contextu-
alized embeddings, and a word piece-based token and sentence-splitter that provides
improved performance over 50+ languages (10).
2.1 Design and Architecture
Trankit utilizes the state-of-the-art transformer XLM-Roberta (Conneau et al., 2020)[10],

which is a multilingual version of Roberta and pretrained on 2.5 TB of filtered
CommonCrawl data comprising over 100 languages. The use of Adapters, inside a
transformer layer especially the core component being the Multilingual Encoder, is
6
Chapter 2. Methodology 7
Figure 2.1: Overall architecture of Trankit. A single multilingual pretrained

transformer is shared across three components(pointed by red arrows) of the
pipeline for different languages.
shared across various transformer components each suited for different languages.
Other components include Joint Token and Sentence Splitter, Multi-word Token
Expander, Lemmatizer, a Named Entity Recognizer, and a Joint Model for POS
Tagging, Morphological Tagging, and Dependency Parsing. (10) The architecture is
described in Figure 1. (3.1).
2.2 Multilingual Processing
There has been some research done in designing multilingual models via training
them on combined data of different languages leading to poor model development
which cannot work on raw text. NLP Toolkits like UDPipe (12) and (11) do not share
any component between their embedding layers, adding to bulk model size and fast
Chapter 2. Methodology 8
memory usage. The problem of loading multiple pipelines for different languages into
memory without adding to model size is solved using Adapters. Here, a single large
multilingual pretrained transformer is shared across all components and languages,
while for each language, a specific set of adapters and task-based weights are assigned
for every transformer-based component. Finally, during inference, the adapter set
and weights are activated based on the input language and task, to process the query
input. This approach solves the memory problem to a large extent while retaining
the ability to process multiple languages simultaneously and efficiently.
2.3 Customized Pipeline
Trankit already provides 90 pretrained pipelines for 56 languages, each with its own
default tree bank. However, there are 100 languages supported in Trankit, whose
customized pipelines can be trained using the TPipeline Class, because the base-
model XLM-Roberta encoder is pretrained on the supporting languages. In this
project, we aim at building a customized pipeline for dependency parsing for the
Hindi Language.
Chapter 3
Experimentation and Results
3.1 Experimentation
3.1.1 Dataset
The HINDI INTERCHUNK Dataset has been used for training purposes. The
dependency annotation in Inter-chunk has been done, following the dependency
guidelines in (4) which uses a dependency framework inspired by Panini’s grammar
of Sanskrit. Here is an example of dependency relation in Interchunk. Refer Figure
(3.1).
The requirement for training customized pipelines is that the data should be avail-
able in CONLL-U format, for Universal Dependencies Task. Here we are mainly
focused on HEAD, DEPREL, and DEPS which are associated with Dependency
Parsing. The dataset has been divided into 932 data files for training, 112 data files
for testing, and 131 data items for testing the model.
9
Chapter 3. Experimentation and Results 10
Figure 3.1: Figure 1 shows SSF representation. Figure 2 shows schematic de-
pendency relation for sentence 1.
3.1.2 Method
We try to evaluate a custom Dependency Parser Model on the Hindi Dataset and
then load the custom pipeline that can be used afterward. Trankit provides the
TPipeline Class to build customized pipelines, which can subsequently be loaded
with the usual Pipeline Class. The Pipeline of customized Category is taken which
has the following models of Joint token and sentence splitter. Joint model for part-of-
speech tagging, morphological feature tagging, and dependency parsing Lemmatizer
The task is designated as ‘Posdep’ and embedding as ‘xlm-roberta-base’, default
embedding. Since the customized pipeline is built on Hindi Dataset, we have kept
the language field as ‘Hindi’. The model has been trained for 100 epochs, follow-
ing which are the scores obtained in the last iteration. The model gets saved in the
‘saved ir′ location.Inthenextstep, wedownloadedallthemissingf ilesf ortheHindiLanguageliketagg
Chapter 3. Experimentation and Results 11
Figure 3.2: Figure shows the metrics after completion of model training for 100
epochs. Finally the model gets loaded into Pipeline and can be initialized using
customized language.
Figure 3.3: Figure shows output of the sentence in CONLL-U Format
3.2 Results
This is a sample query and its corresponding output is obtained from the customized
pipeline. Refer Figure (3.3).
Appendix A
Appendix A
Write your Appendix content here.
12
Bibliography
[eng] Dependency parsing. Engati.
[gra] Hindustani grammar - wikipedia.
[low] Nlp for low-resource settings - medium.
[4] A. Bharati, D. M. Sharma S. Husain, L. B. R. B. and Sangal, R. (2009). An-

ncorra: Treebanks for indian languages, guidelines for annotating hindi treebank
(version–2.0).
[5] Akshar Bharati, Vineet Chaitanya, R. S. (2016). Local word grouping and its
relevance to indian languages.
[6] Akshar Bharati, Mridul Gupta, V. Y. K. G. D. M. S. (2009). Simple parser for

indian languages in a dependency framework.
[7] Daniel Jurafsky, J. H. M. (2008). Speech and language processing.
[8] Falavarjani, S.A.M., G.-S. G. . . I. I. G. M.-S. T. P. J. Q. J. W. R. e. (2015).

Advantages of dependency parsing for free word order natural languages.
[9] Green, N. (2011). Dependency parsing.
[10] Nguyen, M. V., Lai, V. D., Pouran Ben Veyseh, A., and Nguyen, T. H. (2021).
Trankit: A light-weight transformer-based toolkit for multilingual natural lan-
guage processing. In Proceedings of the 16th Conference of the European Chapter
13
Bibliography 14
of the Association for Computational Linguistics: System Demonstrations, pages

80–90, Online. Association for Computational Linguistics.
[11] Qi, P., Zhang, Y., Zhang, Y., Bolton, J., and Manning, C. D. (2020). Stanza: A
python natural language processing toolkit for many human languages. In Proceed-
ings of the 58th Annual Meeting of the Association for Computational Linguistics:
System Demonstrations, pages 101–108, Online. Association for Computational
Linguistics.
[12] Straka, M. (2018). UDPipe 2.0 prototype at CoNLL 2018 UD shared task.
In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw
Text to Universal Dependencies, pages 197–207, Brussels, Belgium. Association
for Computational Linguistics.

Bachelor Thesis Template For Indian Institute of Technology Kharagpur

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bachelor Thesis Template For Indian Institute of Technology Kharagpur

Uploaded by

Copyright:

Available Formats

Multilingual Dependency Parsing of Indian Languages

Project-I (EE47009) report submitted to

Indian Institute of Technology Kharagpur

in partial fulfilment for the award of the degree of

Under the supervision of

Dr. Pawan Goyal

Department of Electrical Engineering

Indian Institute of Technology Kharagpur

Autumn Semester, 2022-23

November 12, 2022

Date: November 12, 2022 (Arundhuti Naskar)

This is to certify that the project report entitled “Multilingual Dependency

Dr. Pawan Goyal

Name of the student: Arundhuti Naskar Roll No: 19EE10009

Keywords: Natural Language Processing, Multilingual Dependency Parsing, Trankit,

3 Experimentation and Results 9

Dependency Relations in the language are an important resource for information, to

Advantages of Dependency Parsing Dependency Parsing gives a natural ad-

Motivation Finally, we realize that dependency parsing helps in several essential

Research gaps Typically include research gaps for your study.

Objective Similarly objectives of study.

Scope Define scope of study.

1.1 Adding another section

(a) Caption1 (b) Caption2 (c) Caption3

Figure 1.2: Figures sample

□ Some sample text item 1.

□ You may refer to tables 1.1

Tables can be added like this

Table 1.1: Sample table

Column 1 Column 2 Column 3

2.1 Design and Architecture

Trankit utilizes the state-of-the-art transformer XLM-Roberta (Conneau et al., 2020)[10],

Figure 2.1: Overall architecture of Trankit. A single multilingual pretrained

2.2 Multilingual Processing

2.3 Customized Pipeline

Experimentation and Results

Figure 3.3: Figure shows output of the sentence in CONLL-U Format

Write your Appendix content here.

[eng] Dependency parsing. Engati.

[gra] Hindustani grammar - wikipedia.

[low] Nlp for low-resource settings - medium.

[4] A. Bharati, D. M. Sharma S. Husain, L. B. R. B. and Sangal, R. (2009). An-

[6] Akshar Bharati, Mridul Gupta, V. Y. K. G. D. M. S. (2009). Simple parser for

[7] Daniel Jurafsky, J. H. M. (2008). Speech and language processing.

[8] Falavarjani, S.A.M., G.-S. G. . . I. I. G. M.-S. T. P. J. Q. J. W. R. e. (2015).

[9] Green, N. (2011). Dependency parsing.

of the Association for Computational Linguistics: System Demonstrations, pages

You might also like