You are on page 1of 5

2022 IEEE 3rd Global Conference for Advancement in Technology (GCAT)

Bangalore, India. Oct 7-9, 2022

Neural Machine Translation for English-Assamese


Language Pair using Transformer
Rudra Dutt#1, Tarun Aditya Kusupati#2, Akshat Srivastava#3, Basab Nath*4
#
Computer Engineering, Ajeenkya DY Patil University
DY Patil Knowledge City Rd, Charholi Budruk, Pune, Maharashtra, India
1rudra.dutt@adypu.edu.in 2tarun.kusupati@adypu.edu.in 3akshat.srivastava@adypu.edu.in
4basab.nath@adypu.edu.in

Abstract— Machine translation is a computer-based machine translation in natural language processing. Much work
2022 IEEE 3rd Global Conference for Advancement in Technology (GCAT) | 978-1-6654-6855-8/22/$31.00 ©2022 IEEE | DOI: 10.1109/GCAT55367.2022.9972085

translation process that receives a set of words from a particular has been done in this field so far, and much is still being done;
human- readable language as input and outputs a second set of we all use it in our daily lives now, be it the translators of
words in the intended human-readable language. Machine Google, Microsoft, and others, where we place our text and it
translation has assisted linguists and sociologists all across the converts into the desired language that we want, and we also
world. A machine translation model may be statistical, rule- have features where we speak on the mic and it is translated to
based, or based on neural networks. The Neural Machine the text that we want. We chose to focus on English-Assamese
Translation (NMT) model was introduced in response to the because there hasn't been much work done for this pair, and
numerous drawbacks of rule- based and statistical-based
Assamese is a language spoken widely in the north- eastern
machine translation models. This study adds a parallel corpus of
Assamese and English and develops the NMT system, a
part of India, so having a translation for this pair would help
transformer model with a self- attention mechanism. For people travelling from other parts of the country and world
translation from English to Assamese, the system received a communicate easily.
higher bilingual evaluation understudy (BLEU) score of 7.6, and
for translation from Assamese to English,a BLEU score of 23.2.
III. LITERATURE REVIEW
When we begin working on a topic, one of the most
Keywords— Transformer, BLEU Score, Encoder-Decoder, important things we do is look at previous work on the topic to
Comparative Analysis, Attention Mechanism. see what aspects have been covered, what useful insight has
been provided, and how we differ from them in our approach.
I. INTRODUCTION We also read a lot of research papers thathelped us fine tune our
There are several Indian languages available, according to topic and our goals for achieving it. The first research paper [1]
the Indian constitution, as well as several others, but there are we read was Development
not enough available language-processing resources. The of neural machine translation, which was written by Basab
majority of the resources are in English. To build a high- Nath, Sunita Sarkar, and Surajeet Das. They used five different
quality translation system, NMT requires a large, high-quality machine translator models for converting English to Assamese
corpus. As a result, NMT is typically only effective in and also compared LSTM and GRU models. The other paper
resource-rich languages, i.e., languages with hundreds of we referred to was Rup Jyoti Baruah, Rajesh Kumar
thousands or millions of parallel sentences. The most widely Mundotiya, and Anil Kumar Singh's low resource neural
spoken language in India, Hindi, has a large number of parallel machine translation Assamese to and from other Indo-Aryan
resources but it is still considered a mid-resource language languages [2]. This paper compared the effectiveness of SMT
when compared to European languages and Assamese, a and NMT models for bidirectional language pairs involving
widely spoken language in India's north-eastern region, with Assamese and other Indo-Aryan languages. The other paper
few parallel resources and classified as a low-resource was Self attention-based end-to-end Hindu-English neural
language. In this paper, we use transformers to create English machine translation [3] by Siddhant Srivastava and Ritu Tiwari,
to Assamese language MT and vice versa. The translation which focused on recent advances. Put forward in the field of
outputs are first evaluated using automatic evaluation metrics neural machine translation models, different domains where
and then manually evaluated to determine the effectiveness of NMT has replaced traditional SMT models, and future
the automatic evaluation. With ASMLATOR, we propose to directions in the field They proposed a neural machine
create a system that effectively MTs low-resource languages translation approach with an end-to-end self-attention
by utilizing deep learning's technological developments. transformer network trained on the Hindi-English parallel
corpus and compared its efficiency to other state-of-the-art
II. MOTIVATION models such as encoder-decoder neural machine translation for
As engineers, we wanted to create something that would English to Hindi language pairwise sequence architecture with
solve a real-world problem that the world is facing. We both encoder and decoder (1), LSTM (2), Bidirectional LSTM
identified a number of areas in which we could work, but we Conditional Random Field (3) gated recurrent units with
ultimately dedicated our focus to AI and ML related topics. attention mechanism used in three models.
One of the areas that piqued our interest was the field of
We used the paper Neural Machine translation for low

978-1-6654-6855-8/22/$31.00 ©2022 IEEE 1

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SILCHAR. Downloaded on May 01,2023 at 07:49:41 UTC from IEEE Xplore. Restrictions apply.
resource Assamese-English [4] as a comparison for our paper. It is "directing your focus toward something and taking more
was written by Sahinur Rahman Laskar, Partha Pakray, and notice." Based on the idea of directing your attention, Deep
Sivaji Bandyopadhyay. This paper contributed an Assamese- Learning's attention mechanism gives particular aspects of the
English parallel corpus and constructed two NMT systems, a data it processes more attention.
sequence-to-sequence recurrent neural network (RNN) with an
attention mechanism (NMT-1) and a transformer model with a In general, Attention is a part of the architecture of a
self-attention mechanism (NMT-2) (NMT-2). network and it is in the position of overseeing and quantifying
dependency:
We primarily compared our model to the paper's NMT-2
system. 1. Between the input and output elements (General
attention)
IV. BACKGROUND 2. With regard to the input components (Self-Attention)
There are three basic building blocks that we focus while While Attention is used in other deep learning disciplines,
developing a machine translation model encoder-decoder, such as computer vision, it achieves its greatest breakthrough
attention mechanism and the bleu score calculation, what are and success when applied to Natural Language Processing
they and why they are used for can be understood from below. (NLP) applications. This is because attention was developed to
solve the issue of extended sequences in machine translation,
A. Encoder-Decoder [5] one that arises in the majority of other NLP workloads.
A neural network called the Encoder-Decoder, which was Types of Attention-
found in 2014, is now broadly applied in projects. It is a crucial
component of translation services. It can be found in the neural 1. Bahdanau Attention
network behind various translation services across the web. 2. Luong Attention
Encoder-Decoder is a paradigm of Machine Learning
C. Bahdanau attention [6]
involving two neural nets. Typically, the two neural nets
possess a similar structure. While the second one operates This attention mechanism was introduced by Bahdanau, to
backward, the first one is operated normally. address the bottleneck problem that arises with the use of a
fixed-length encoding vector, where the decoder would have
The process is as follows- limited access to the information provided by the input. This is
1. The input of the first neural network is a text, and the thought to become especially problematic for long and/or
output is a series of numbers. complex sequences, where the dimensionality of their
representation would be forced to be the same as for shorter or
2. This string of numbers will be entered into the second
simpler sequences.
network, which will then produce a phrase this time.
3. These two networks really perform the same action; Bahdanau [7] attention mechanism is divided into the step-
the only difference is that one network moves in the by-step computations of the alignment scores, the weights and
typical orientation while the other proceeds in the the context vector:
opposing direction. Following are all the steps involved in applying attention in
4. So, we start with a sentence, a group of phrases, which Bahdanau's article:
is then converted into a group of numbers, which is
then translated back into the original sentence's group 1. Making Hidden States for the Encoder - The encoder
of words. makes hidden states for every component in the
As you can see, the first neural network is an encoder and training data.
the second neural network is a decoder. 2. Between each hidden state of the encoder and the
preceding hidden state of the decoder, alignment
scores are computed.
3. Alignment Score Softmaxing: Each encoder's hidden
alignment scores are aggregated, represented by a
single vector, and then softmaxed.
4. Context vector calculation: the context vector is
created by multiplying individual encoder hidden
states by the corresponding alignment scores.
5. Concatenating the context vector with both the
previous decoder outcome and feeding it into the
decoder Recurrent neural network for that time step,
in addition to the preceding decoder obscured state,
Fig. 1. Diagram Showing Encoder-Decoder results in a new output during decoding of the output.
6. For each decoder time step, the procedure (steps 2-5)
B. Attention Mechanism is repeated unless a token is generated or the output
The fundamental building block of a transformer is self- exceeds the set maximum length
attention. The definition of "Attention" in the English language

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SILCHAR. Downloaded on May 01,2023 at 07:49:41 UTC from IEEE Xplore. Restrictions apply.
.
Fig. 2. Diagram Showing Bahdanau Mechanism
Fig. 3. Diagram Showing luong mechanism
D. Luong attention
E. Bleu score [8]
Thang Luong put out the second kind of attention in this The BLEU score offers a comprehensive evaluation of
study. It was created on top of the Bahdanau-proposed model quality. By exporting the TEST set along with the model
Attention mechanism and is repeatedly alluded to as predictions, you may additionally assess the model output for
Multiplicative Attention. Luong Attention and Bahdanau certain pieces of data. Both the reference text (from the original
Attention differ primarily in two ways: dataset) and the candidate text for the model are included in the
exported data. A score of 1.0 signifies a flawless match,
whereas a value of 0.0 corresponds to a perfect disparity. We
1. The method through which the alignment value is
have multiplied the results with 100 for a more streamlined
computed. understanding and comparison with others. Both the source and
2. The location in the decoder where the Attention candidacy versions are normalized and tokenized before
mechanism has been implemented. calculating the BLEU score. The final BLEU score is
substantially influenced by the normalization and tokenization
In contrast to Bahdanau's one alignment scoring system, methods chosen. We calculate the following data to get the
Luong's study proposes three different alignment scoring BLEU score for every interpretation.
functions. Additionally, Luong Attention has a different overall 1) N-Gram Precisions
structure for the Attention Decoder since the context vector can N-Gram precisions for each candidate are taken.
only be used after the RNN has generated the result for that
sampling interval. The Luong Attention procedure, which is 2) Brevity-Penalty
what we shall do as we move through these distinctions, is: If given that both candidates 1 and 2 are made up of 11
tokens, the brevity penalty is identical for each of them.
1. Making Hidden States for the Encoder - The encoder
makes hidden states for every component in the input 3) BLEU-Score
pattern. Keep in mind that a BLEU scores greater than zero requires
2. Decoder RNN: To create a new hidden state for that at least one matched 4-gram.
time step, the prior decoder hidden state and decoder
result are transmitted through the Decoder RNN. V. METHODOLOGY
3. Alignment scores are determined using the newly A. Data for training
discovered decoder hidden state as well as the encoder To train our model, we combined Samanantar parallel data
hidden values. with our own, proprietary dataset of English and Assamese.
4. Alignment Score Softmaxing: Each encoder's hidden The dataset consists of 140357 English sentences and 140357
alignment scores are aggregated, represented by a Assamese sentences they are further divided into training,
single vector, and then softmaxed. testing and validating sets for processing. Some more corpus
5. Context vector calculation: the context vector is stats can be understood from the table attached below.
created by multiplying individual encoder hidden
layer by the corresponding alignment scores. TABLE I. SHOWING CORPUS STATS
6. The context vector and the decoder hidden value
Sr. Content English Assamese
created in step 2 are concatenated and then transmitted
through a completely connected layer to create the No.
final output. 1 Sentences 140357 140357
7. For each decoder time step, the procedure (steps 2-6) 2 Mean 12.52 10.08
is repeated till a token gets generated or the output
exceeds the set maximum length. 3 Median 9.0 8.0
As was already mentioned above, Luong Attention follows 4 Max Length 169 1302
a distinct set of stages than Bahdanau Attention. This 5 Memory 2.24 MB 2.24 MB
procedure also uses distinct computations and code Consumption
implementation.

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SILCHAR. Downloaded on May 01,2023 at 07:49:41 UTC from IEEE Xplore. Restrictions apply.
B. Model proposal to attain the indispensableresults.
Using the self-attention technique, a transformer [9] is a VI. EVALUATION AND RESULT
deep learning model which weighs the importance of each
component of the incoming data differently. The Transformer We use BLEU scores to evaluate the models. We provide
[10]
employs an encoder-decoder architecture with a multi-head BLEU to ensure model uniformity and reproducibility. For
attention layer on the encoder block and a feed forward neural Assamese to English conversion, we use the built-in default
network layer on the decoder block. On the other hand, the mteval-v13a tokenizer. Because the BLEU tokenizer does not
decoder block and a feed forward neural network layer on the support Indic languages, we use the indicNLP [11] tokenizer
decoder block. On the other hand, the decoder features an before running BLEU to tokenize English to Assamese.
additional masked multi head attention. Several equal
encoders and decoders are layered on top of one another to TABLE II. COMPARING THE BLEU SCORES

form the encoder and the decoder blocks. The number of units MODEL English to Assamese to
in both the encoder stack and the decoder stack is uniform. The
Assamese English
layers of self- attention and feed forward. Another layer of
encoder- decoder care is given to the decoders. This aids the NMT-2 10.03 13.10
decoder in concentrating on the necessary portions of the input Our Model 7.6 23.2
sequence.
The whole architecture of this transformer model can be Asmlator outperforms NMT-2 in Assamese to English
understood through the diagrams attached below translation, but it competes in English to Assamese translation,
it can also be seen through the comparison chart attached
below.

Fig. 4. Showing transformer architecture

C. Training
Fairseq [12] is used to train transformer-based models. We
use 4096 feedforward dimensions, 1512 input embeddings with
16 attention heads, and 6 encoder and decoder layers. Using
the Adam optimizer, label smoothing of 0.2, and gradient
Fig. 5. Bar chart comparing the bleu score
clipping of 1.0, we adjusted the cross-entropy loss. With a
starting learning rate of 3e-4 and 1000 warmup steps, we
Below we are attaching a table that shows how are model
follow the learning rate annealing schedule suggested in the
performed when we gave it certain inputs what was the target
original transformer paper.
output and what the output was predicted by our English to
D. Tokenization Assamese model.
Using the Indic NLP Library as well as a few additional
heuristics to consider We divided the text's primary material TABLE III. SHOWING ENGLISH TO ASSAMESE (OUR MODEL)
THE TABLE SHOWS HOW OUR ASSAMESE TO ENGLISH MODELFARED
into sentences using Assamese punctuation characters, sentence WHEN WE GAVE IT CERTAIN INPUTS WHAT WAS THEEXPECTED OUTPUT AND
delimiters, and non-breaking prefixes. WHAT IT PREDICTED.

Using English-Centric training data, we learn independent INPUT TARGET OUTPUT PREDICTED
vocabularies for the English and Assamese languages using 8K OUTPUT
BPE merging operations, each usingsubword-nmt. However, they are িয িক নহওক, িসহঁ েত িক ইয়াৰ ফলত
not thought to have াটলখনৰ এেকা িত মহাকাশযানেটাৰ
E. Fairseq caused any damage
The sequence modelling toolkit Fairseq(-py) enables কৰা নাই বুিল ভবা হেছ। কােনা িত হাৱা
to the shuttle.
academics and developers to train unique models for tasks বুিল ভািবব নাৱািৰ।
including translation, summarization, language modelling, and
other text production. We exercise the use of fairseq in our

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SILCHAR. Downloaded on May 01,2023 at 07:49:41 UTC from IEEE Xplore. Restrictions apply.
All I say to people সকেলা মানুহেক কাৱা মই লাকসকলক 1920 দশকত During the 1920s, the In the 1920s, the
is you treat us the হেছ আেপানােলাকক কৱল এইেটা কওঁ য, সৰহভাগ দশ আৰু prevailing attitudes of mood of the
way we treat you. most citizens and majority of the
সৱা আগবঢ়াবৈল আিম আেপানােলাকৰ নাগিৰকৰ ভাৱ-ভংগী
nations was that of nation and its
আমাক সৱা সেত যেনৈক আচৰণ শাি বাদ আৰু একা pacifism and isolation. citizens was
আগবঢ়াওক। কেৰাঁ, তেনৈক আিছল। pacifism and
আেপানােলােক privacy.
আমাক আচৰণ কৰা
VII. CONCLUSION
উিচত।
Asmlator aims to provide a medium in a world where low-
However, the িয নহওঁক, াইভাৰজেন িক চালকজন resource language is seen as a barrier between individuals
driver sustained মূৰত ৰু পূণ আঘাত ৰুতৰভােৱ expressing their thoughts. We propose a solution which
serious injuries to provides acceptable work in the space of low-resource neural
া কিৰিছল৷ আঘাত া হয়।
the head. machine translation in the hope of better expression of thoughts
The surface of the চ ৰ পৃ ভূ িম িশল আৰু চাঁদৰ পৃ ভাগ িশলা between individuals.
Moon is made of আৰু ধূিলৰ াৰা
ধূিলেৰ গ ত। চ ৰ
rocks and dust. The Our work may not have been at the pinnacle of what is
ওপৰ অংশক কাৱা হয় িনিমত হয় আৰু চাঁদৰ
outer layer of the available out there, but we hope to build upon This
Moon is called the খালা। বািহ ক ৰক ভূ ক with time, an increase in computational efficiency, and a better
crust. বুিল কাৱা হয়। approach to building datasets for low resource languages.
Cancellation বািতলকৰণ নীিতসমূহৰ বািতলকৰণৰ নীিত
policies vary, but ACKNOWLEDGMENT
তাৰতম ঘেট িক মাচৰ িভ হয়, িক মাচৰ
as of late March I would like to thank the other author’s Tarun Aditya
শষৰফালৈল অিধকাং শ শষত বিছভাগ
most coronavirus- Kusupati and Akshat Srivastava for their contribution and
based cancellation কৰ'ণা-ভাইৰাছ মূলৰ ক’ৰ’না ভাইৰাছ- efforts in making this happen as well as our Professor Mr.
policies don't বািতলকৰণ নীিতসমূহ িভি ক বািতলকৰণ Basab Nath for constantly guiding us throughout the work.
extend to July 2020 চনৰ জুলাইৈলেক নীিত ২০২০ চনৰ
2020, when the স সাৰণ নঘ ব, জুলাইৈল বৃি নহয়ৈগ, REFERENCES
Olympics had been [1] B. Nath, S. Sarkar, S. Das, Development of Neural Machine Translator
scheduled. যিতয়া অিলি কক যিতয়া অিলি কৰ
for English-Assamese language pair, Springer Link India 2021.
অনুসূিচত কৰা হিছল৷ সময়সূচী ঘাষণা
[2] R. Baruah, R. K. Mundotiya, A. K. Singh, Low Resource Neural
কৰা হিছল। Machine Translation: Assamese to/from other Indo- Aryan (Indic)
Languages, IIT BHU India November 2021.
TABLE IV. TABLE SHOWING ASSAMESE TO ENGLISH [3] S. Srivastava, R. Tiwari, Self-Attention based end-to-end Hindi-English
Neural Machine Translation, September 2019.
INPUT TARGET OUTPUT PREDICTED [4] S. R. Laskar, Dr. P Pakray, S. Bandyopadhay, Neural Machine
OUTPUT Translation for low resource Assamese-English, Chapter April 2021.
ধাৰণােটা চীনৰ পৰা The concept came from The idea came from [5] online Available: https://inside-machinelearning.com/en/encoder-
China where plum China where the decoder-what-and-why-simple-explanation/
আিহিছল য'ত
blossoms were the noodoca flowers [6] online Available: https://blog.floydhub.com/attention-
নােদাকা ফু লসমূহ mechanism/#bahdanau-att-step3/
flower of choice. were selected.
বাছিনৰ ফু ল আিছল৷ [7] online Available: https://www.baeldung.com/cs/nlp-encoder-decoder-
ী মাছ হেছ Christmas is one of the Christmas is one of models
ী ানসকলৰ ব most important the closed days of [8] K. Papineni, S. Roukos, T. Ward, W. J. Zhu, BLEU: A Method for
holidays of Christians and is Automatic Evaluation of Machine Translation, July 2002
িদনসমূহৰ এটা আৰু
Christianity, and is celebrated as the [9] online Available: https://theaisummer.com/transformer/
ইয়াক যী ৰ জ িদন celebrated as the birthday of Jesus [10] online Available: https://medium.com/inside-machine-learning/what-is-
িহচােপ পালন কৰা birthday of Jesus. Christ. a- transformer-d07dd1fbec04
হয়। [11] D. Kakwani, A. Kunchukuttan, S. G N.C., A. Bhattacharyya, M. M. P.
Kumar, IndicNLPSuite: Monolingual corpora, Evaluation Benchmarks
কম ানত একতা অিত Workplace harmony is Unity in the and pre trained Multilingual language Models for Indian languages,
জৰুৰী য'ত ব ি গত crucial, emphasizing workplace is very November 2020.
group effort rather than important where [12] M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Graniger,
ভােৱ সংশা কৰাৰ
praising individual party effort is given M. Auli, FAIRSEQ: A fast Extensible Toolkit for Sequence Modeling,
পিৰবেত দলীয় accomplishments. more importance April 2019
য়াসক বিছ মহ than individual
িদয়া হয়। praise.
ই াৰেনেট সামূিহক The Internet combines The internet adds
আৰু পাৰ িৰক elements of both mass the philosophy of
and interpersonal collective and inter-
দুেয়ািবধ স কৰ ত
communication. connectedness.
যাগ কেৰ।

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SILCHAR. Downloaded on May 01,2023 at 07:49:41 UTC from IEEE Xplore. Restrictions apply.

You might also like