Professional Documents
Culture Documents
Bachelor of Technology
in
Electrical Engineering
by
Arundhuti Naskar
(19EE10009)
I certify that
(a) The work contained in this report has been done by me under the guidance of
my supervisor.
(b) The work has not been submitted to any other Institute for any degree or
diploma.
(c) I have conformed to the norms and guidelines given in the Ethical Code of
Conduct of the Institute.
(d) Whenever I have used materials (data, theoretical analysis, figures, and text)
from other sources, I have given due credit to them by citing them in the text
of the thesis and giving their details in the references. Further, I have taken
permission from the copyright owners of the sources, whenever necessary.
i
DEPARTMENT OF ELECTRICAL ENGINEERING
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
KHARAGPUR - 721302, INDIA
CERTIFICATE
ii
Abstract
One of the essential goals of Natural Language Processing is to fathom how human
languages interact with technology, using computational linguistics tools. Extensive
improvements have been observed in NLP systems by using dependency parsing in
certain languages and are widely accepted as a reliable tool for NLP applications.
Dependency Parsing is a useful tool to parse languages with free word order (i.e.
the sentence meaning does not alter by changing the word positions) like various
Indian Languages. However, these tasks when dealing with Indian Languages are
often resource-constrained, which is a hindrance, because lots of training data are
required to generate a model with good performance. Different Parsers and anno-
tations schemes influence the overall NLP Pipeline in various ways. This work is
an application of building a Customized pipeline using the Pretrained Multilingual
Model, Trankit, by extending it to annotate dependency parsing relations in various
Indian Languages like Hindi.
iii
Acknowledgements
I would like to express my sincerest gratitude to my supervisor Dr. Pawan Goyal,
for his constant support and guidance, throughout this project. It is needless to say
that I would not have been able to complete my work without his valuable feedback
and suggestions. I also want to thank Mr. Aniruddha Roy, Research Scholar for
his constant support and mentorship, providing invaluable inputs continuously, that
have aided in the successful completion of this project.
I would also like to thank the Department of Computer Science and Engineering,
IIT Kharagpur for providing various facilities that were pivotal in the project com-
pletion. Finally, I would like to extend my thanks to the Department of Electrical
Engineering, for this opportunity to work on this project.
iv
Contents
Declaration i
Certificate ii
Abstract iii
Acknowledgements iv
Contents v
1 Introduction 1
Indian Languages . . . . . . . . . . . . . . . . . . . . . 1
Advantages of Dependency Parsing . . . . . . . . . . . 2
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 2
Objective . . . . . . . . . . . . . . . . . . . . . . . . . 3
Literature Survey . . . . . . . . . . . . . . . . . . . . . 3
Research gaps . . . . . . . . . . . . . . . . . . . . . . . 3
Objective . . . . . . . . . . . . . . . . . . . . . . . . . 3
Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
An algorithm . . . . . . . . . . . . . . . . . . . . . . . 3
1.1 Adding another section . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Methodology 6
2.1 Design and Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Multilingual Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Customized Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
v
Contents vi
A Appendix A 12
Bibliography 13
Chapter 1
Introduction
Indian Languages Indian Languages are significantly different from the English
language, which require different processing techniques. They have broad coverage,
rich morphology, and a non-linear structure with extensive use of complex predicates
and verb complexes (6). Unlike their English counterparts, the Indian Languages
1
Chapter 1. Introduction 2
have a variable word order which is less strictly bound. An example of fixed order
is main verbs succeeding auxiliary verb sequences and nouns followed by postprepo-
sition. Alternatively, the property of free word order is that some adverbs can
be moved freely within a sentence without making any difference to correctness or
meaning (5)
spoken by native speakers worldwide. Even some handful of other languages, approx-
imately 20 out of 7000 languages are considered high resources because of abundant
text corpora (low). But the rest of the languages especially Asian Languages, do not
have even the basic statistical tools for NLP available to them. Therefore here we
focus on low-resource Indian Languages, like Hindi, to train a Dependency Parser,
using transfer learning.
Objective Hence the prime objective of this work is to design such a Multilin-
gual Dependency Parser for a low resource language like Hindi. This is achieved by
training a light-weight Transformer based Python Toolkit, that supports Multilin-
gual NLP operations.
Literature Survey
An algorithm How you could refer to figures: This is an example. (Refer 1.1).
You can add equations like this Eq. (1.1)
X Ti
SDR = sd(T ) − × sd(Ti ) (1.1)
i
|T |
Chapter 1. Introduction 4
Figure 1.1: Splitting of the input space (X1 x X2) by M5’ model tree algorithm
You can show a lot of figures together like these Figures 1.2(a), 1.2(b), 1.2(c) below.
You can add lists into the text like this.
□ Or figures 1.2(a)
Methodology
In this chapter, we will discuss the approach followed to create a dependency parser
for multilingual Indian Languages. Our work is mainly an application of the lightweight
Transformer -based Toolkit called Trankit. It provides a trainable pipeline for funda-
mental NLP Tasks like POS Tagging, Morphological features, Named Entity Recog-
nition, Lemmatization over 90 Universal Dependencies Treebanks, and Dependency
Parsing, for over 100 languages, including 90 pretrained pipelines for 56 languages.
This tool is chosen because of its efficient memory usage, significant speed, contextu-
alized embeddings, and a word piece-based token and sentence-splitter that provides
improved performance over 50+ languages (10).
6
Chapter 2. Methodology 7
shared across various transformer components each suited for different languages.
Other components include Joint Token and Sentence Splitter, Multi-word Token
Expander, Lemmatizer, a Named Entity Recognizer, and a Joint Model for POS
Tagging, Morphological Tagging, and Dependency Parsing. (10) The architecture is
described in Figure 1. (3.1).
There has been some research done in designing multilingual models via training
them on combined data of different languages leading to poor model development
which cannot work on raw text. NLP Toolkits like UDPipe (12) and (11) do not share
any component between their embedding layers, adding to bulk model size and fast
Chapter 2. Methodology 8
memory usage. The problem of loading multiple pipelines for different languages into
memory without adding to model size is solved using Adapters. Here, a single large
multilingual pretrained transformer is shared across all components and languages,
while for each language, a specific set of adapters and task-based weights are assigned
for every transformer-based component. Finally, during inference, the adapter set
and weights are activated based on the input language and task, to process the query
input. This approach solves the memory problem to a large extent while retaining
the ability to process multiple languages simultaneously and efficiently.
Trankit already provides 90 pretrained pipelines for 56 languages, each with its own
default tree bank. However, there are 100 languages supported in Trankit, whose
customized pipelines can be trained using the TPipeline Class, because the base-
model XLM-Roberta encoder is pretrained on the supporting languages. In this
project, we aim at building a customized pipeline for dependency parsing for the
Hindi Language.
Chapter 3
3.1 Experimentation
3.1.1 Dataset
The HINDI INTERCHUNK Dataset has been used for training purposes. The
dependency annotation in Inter-chunk has been done, following the dependency
guidelines in (4) which uses a dependency framework inspired by Panini’s grammar
of Sanskrit. Here is an example of dependency relation in Interchunk. Refer Figure
(3.1).
The requirement for training customized pipelines is that the data should be avail-
able in CONLL-U format, for Universal Dependencies Task. Here we are mainly
focused on HEAD, DEPREL, and DEPS which are associated with Dependency
Parsing. The dataset has been divided into 932 data files for training, 112 data files
for testing, and 131 data items for testing the model.
9
Chapter 3. Experimentation and Results 10
Figure 3.1: Figure 1 shows SSF representation. Figure 2 shows schematic de-
pendency relation for sentence 1.
3.1.2 Method
We try to evaluate a custom Dependency Parser Model on the Hindi Dataset and
then load the custom pipeline that can be used afterward. Trankit provides the
TPipeline Class to build customized pipelines, which can subsequently be loaded
with the usual Pipeline Class. The Pipeline of customized Category is taken which
has the following models of Joint token and sentence splitter. Joint model for part-of-
speech tagging, morphological feature tagging, and dependency parsing Lemmatizer
The task is designated as ‘Posdep’ and embedding as ‘xlm-roberta-base’, default
embedding. Since the customized pipeline is built on Hindi Dataset, we have kept
the language field as ‘Hindi’. The model has been trained for 100 epochs, follow-
ing which are the scores obtained in the last iteration. The model gets saved in the
‘saved ir′ location.Inthenextstep, wedownloadedallthemissingf ilesf ortheHindiLanguageliketagg
Chapter 3. Experimentation and Results 11
Figure 3.2: Figure shows the metrics after completion of model training for 100
epochs. Finally the model gets loaded into Pipeline and can be initialized using
customized language.
3.2 Results
This is a sample query and its corresponding output is obtained from the customized
pipeline. Refer Figure (3.3).
Appendix A
Appendix A
12
Bibliography
[5] Akshar Bharati, Vineet Chaitanya, R. S. (2016). Local word grouping and its
relevance to indian languages.
[10] Nguyen, M. V., Lai, V. D., Pouran Ben Veyseh, A., and Nguyen, T. H. (2021).
Trankit: A light-weight transformer-based toolkit for multilingual natural lan-
guage processing. In Proceedings of the 16th Conference of the European Chapter
13
Bibliography 14
[11] Qi, P., Zhang, Y., Zhang, Y., Bolton, J., and Manning, C. D. (2020). Stanza: A
python natural language processing toolkit for many human languages. In Proceed-
ings of the 58th Annual Meeting of the Association for Computational Linguistics:
System Demonstrations, pages 101–108, Online. Association for Computational
Linguistics.
[12] Straka, M. (2018). UDPipe 2.0 prototype at CoNLL 2018 UD shared task.
In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw
Text to Universal Dependencies, pages 197–207, Brussels, Belgium. Association
for Computational Linguistics.