You are on page 1of 17

Expert Systems With Applications 238 (2024) 121813

Contents lists available at ScienceDirect

Expert Systems With Applications


journal homepage: www.elsevier.com/locate/eswa

Neural machine translation systems for English to Khasi: A case study of an


Austroasiatic language
Aiusha Vellintihun Hujon a,c ,∗, Thoudam Doren Singh b , Khwairakpam Amitab a
a Department of Information Technology, North Eastern Hill University, Shillong, India
b
Department of Computer Science and Engineering, National Institute of Technology Silchar, India
c
Department of Computer Science, St. Anthony’s College, Shillong, India

ARTICLE INFO ABSTRACT

Keywords: Neural machine translation has predominantly outperformed previous machine translation models primarily
Neural machine translation for resourceful languages. However, very little work has been reported for resource-constrained languages
Khasi such as Khasi. The Khasi language belongs to the Mon-Khmer branch of the Austroasiatic language family
Low-resource language
and is spoken primarily in the state of Meghalaya in India. Although performing neural machine translation
Austroasiatic language
for the under-resourced Khasi language is difficult, we build a substantial parallel corpus of English–Khasi.
Long-short-term memory
Gated recurrent unit
We apply three segmentation methods in the datasets for our experiments: untokenized, tokenized and a
Transformer subword BPE (Byte Pair Encoding). Experiments are carried out on this dataset with different aspects of neural
Transfer learning machine translation systems using cutting-edge architectures such as LSTM (Long Short Term Memory), GRU
(Gated Recurrent Unit) and a transformer based model for the English–Khasi language pair. We also carry
out experiments by adapting the transfer learning approach using English-Vietnamese as the parent language
pair and English-Khasi as the child language pair. This work reports a quantitative and qualitative analysis of
several models based on architectural and data segmentation methodologies. The experimental findings show
that the model adapted using the transfer learning approach achieved a reasonable improvement in BLEU
scores with the highest being 58.1 BLEU on similar domains and 17.7 BLEU for the general domain outscoring
the other models for the same language pair. Qualitative analysis is carried out focusing on the morphological
inflections of gender identification in the translated output.

1. Introduction and Khmer, have also made some progress in machine translation.
However, few works have been reported regarding low-resource lan-
Language barriers have been one of the most crucial issue for com- guages, such as Khasi. Khasi belongs to the Mon-Khmer branch of the
munication across different sections of speakers. Early machine trans- Austroasiatic language family. The morphology of the Khasi language
lation systems employed a variety of methods, including direct trans- is quite simple, where most roots can occur as words without any
lation (Bharati, Chaitanya, Kulkarni, & Sangal, 2003), transfer-based modification, and morphological functions are carried out by syn-
translation (Krishnamurthy, 2015), and interlingua (Dave, Parikh, & tactic categories (Nagaraja, 1985). The word order is Subject Verb
Bhattacharyya, 2001) to translate source languages into target lan- Object(SVO), similar to English. The Khasi language is also a gender-
guages. Rule-based machine translation (Centelles & Costa-jussà, 2014; based (Rabel-Heymann, 1977) language. It is primarily spoken by the
Forcada et al., 2011), example-based machine translation (Singh & indigenous people of Meghalaya in northeastern India. The Khasi alpha-
Bandyopadhyay, 2010a), statistical machine translation (Bojar et al., bet uses the Latin Script. Substantial corpus is the basic requirement
2014; Singh & Bandyopadhyay, 2010b) and more recently neural ma- for machine translation to be feasible. At present, there is no corpus
chine translation (NMT) (Sennrich, Haddow, & Birch, 2016; Sutskever, publicly available for the Khasi language. Few translated books exist,
Vinyals, & Le, 2014) have all been developed as a result of the de- which need to be manually digitized and aligned to create a parallel
velopment of artificial intelligence. Despite tremendous progress in corpus. Thus, performing state-of-the-art machine translation for a
performance, many research problems still need to be solved since resource constrained language such as Khasi is challenging. Hence, we
neural machine translation requires a large parallel corpus. decided to take up this task to build a machine translation system for
Machine translation for many Indian languages has made significant English to Khasi.
progress over the years. Some Austroasiatic languages, like Vietnamese

∗ Corresponding author at: Department of Information Technology, North Eastern Hill University, Shillong, India.
E-mail addresses: avhujon@gmail.com (A.V. Hujon), thoudam.doren@gmail.com (T.D. Singh), khamitab@gmail.com (K. Amitab).

https://doi.org/10.1016/j.eswa.2023.121813
Received 25 May 2023; Received in revised form 11 September 2023; Accepted 22 September 2023
Available online 28 September 2023
0957-4174/© 2023 Elsevier Ltd. All rights reserved.
A.V. Hujon et al. Expert Systems With Applications 238 (2024) 121813

This paper aims to present the improvement made in the per- achieved a BLEU score of 25.3 for English-German and 24.1 for English-
formance of NMT on the language pair English-Khasi. Initially, we French. Segmentation of words using subword units such as BPE, a
build our parallel corpora on the English-Khasi language pair and compression algorithm, has proven a very effective way of taking care
perform experiments using different models based on encoder–decoder of OOV words and reducing the size of a vocabulary, thus gaining
framework (Sutskever et al., 2014). We present three types of NMT; the attention of many researchers. One such report (Li, Jiang, Yangji,
Long Short-term Memory (LSTM), Gated Recurrent Unit (GRU), and & Ma, 2021), performed experiments on Tibetan–Chinese language
transformer architectures. We also built two models using transfer pairs, has shown a BLEU score of 35.12 for subword-syllable and 35.56
learning approaches. The models are trained and validated on different for subword-character, which is higher compared to its transformer
datasets. We process the datasets using three segmentation approaches; model’s score of 34.55.
untokenized, tokenized, and subword byte pair encoding. Compared Many Indian languages have advanced in the field of machine
to the existing works on the same language pair (Hujon, Singh, & translations. One such report is the NMT systems for Indian languages
Amitab, 2023; Singh & Hujon, 2020; Thabah & Purkayastha, 2021), using the transformer architecture in the WMT20 Shared Task for
the results of our experiments are very encouraging. Quantitative and Hindi-Marathi and Marathi-Hindi (Kumar, Baruah, Mundotiya, & Singh,
qualitative results are analyzed based on the architecture and data 2020). The evaluation is done using BLEU, RIBES(Rank-based Intu-
segmentation approaches. We incorporate automatic, statistical and itive Bilingual Evaluation Score), and TER(Translation Error Rate). The
human evaluations for quantitative evaluation. For automatic scor- scores reported in this paper are 20.72 for BLEU, 64.46 for RIBES
ing, we use BLEU -Bilingual Evaluation Understudy (Papineni, Roukos, and 71.04 for TER for Marathi-Hindi, 12.5 for BLEU, 58.66 for RIBES
Ward, & Zhu, 2002), ChrF2-character n-gram F-score for automatic and 76.86 for TER for Hindi-Marathi. A recent study was reported on
MT evaluation, TER-Translation Error Rate (Post, 2018), and RIBES- Dravidian languages (Hegde, Gashaw, & H.l., 2021) where the NMT
Rank-based Intuitive Bilingual Evaluation Score (Isozaki, Hirao, Duh, model is applied in translating English-Tamil, English-Telugu, English-
Sudoh, & Tsukada, 2010). 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛, 𝑟𝑒𝑐𝑎𝑙𝑙, and 𝑓 1-𝑚𝑒𝑎𝑠𝑢𝑟𝑒 are used for Malayalam, and Tamil-Telugu languages pair using Stacked LSTM ar-
statistical evaluations (Koehn, 2009). 𝐴𝑑𝑒𝑞𝑢𝑎𝑐𝑦 and 𝑓 𝑙𝑢𝑒𝑛𝑐𝑦 are used as chitecture. The BLEU scores reported are 1.66 for English-Tamil, 0.29
metrics for human judgment. Qualitative evaluation and analysis of the for English-Telugu, 0.48 for English-Malayalam and 0.43 for Tamil-
results of the models are studied, thoroughly analyzed, and discussed. Telegu. An encoder–decoder model, which is pre-trained using mono-
The morphological inflection on gender in the predicted text of our lingual corpora, was reported (Imamura & Sumita, 2021) to improve
models is also analyzed and discussed. the accuracy of translating low-resource language pairs. Substantial
The significant contributions of this paper are: study of neural machine translations of the Manipuri language, a
low-resource language using unsupervised (Singh & Singh, 2020), semi-
1. A sizeable parallel corpus is built for English-Khasi which is
supervised (Singh & Singh, 2022b), and multilingual approaches (Singh
manually digitized and aligned.
& Singh, 2022a) are recently reported.
2. An empirical study is conducted using these datasets, where
A few works have been reported for Austroasiatic languages. Among
various experiments are performed on three state-of-the-art ar-
these are the Vietnamese (Phan-Vu, Nguyen, Tran, & Do, 2017; Phan-
chitectures: LSTM, GRU, and Transformer.
Vu, Tran, Dang, Do, et al., 2019) and Khmer (Marie et al., 2019).
3. A transfer learning approach is adapted to build models using
The work reported on Vietnamese (Nguyen, Vo, Shin, Tran, & Ock,
English-Vietnamese as the parent language pair with improved
2019) pre-process the datasets by using a word segmentation method
results.
to Vietnamese texts and morphological analysis and word-sense dis-
4. Quantitative and qualitative analysis on the output of the models
ambiguation to Korean texts. A bi-directional Korean-Vietnamese NMT
are also carried out.
system is built using the Korean-Vietnamese language pair on a parallel
5. Gender based morphological inflection in the output of the
corpus of 454,751 sentences. The models achieved a BLEU scores of
models are also analyzed.
27.79 for Korean-Vietnamese and 25.44 for Vietnamese-Korean. The
The paper is organized as follows. The related work is discussed in subword approach using BPE reported for English-Vietnamese (Phan-
Section 2 and the research methods are described in Section 3. Further, Vu et al., 2017) language pair on its word-piece model achieved
the experimental setup is explained in Section 4 and the experimental state-of-the-art results of a BLEU score of 26.58.
results of the various models are discussed in Section 5. Gender based Transfer learning has shown promising results for low resource
morphological inflection in the output of the models are also analyzed language pair. A report (Zoph, Yuret, May, & Knight, 2016), imple-
in Section 5.2.5 followed by conclusion in Section 6. mented the transfer learning approach by using a single parent model
to improve the translation accuracy of various languages. All the child
2. Related works models are initialized with knowledge learned from the same parent
model, which is trained on the French-English dataset. The word em-
Machine translation accuracy for various languages has improved bedding of the child sources’ languages-Hausa, Turkish, Uzbek, and
since the introduction of the concept of NMT (Bahdanau, Cho, & Urdu are initialized by mapping to random weights of the source lan-
Bengio, 2014). NMT has achieved another milestone with the trans- guage(French) of the parent model, while the word embedding of the
former architecture. The transformer model has proven as the cutting target languages is kept the same as the parent model. The experiment
edge of machine translation since its introduction in 2017 (Vaswani showed a significant translation accuracy. Another report (Nguyen &
et al., 2017). The transformer models are trained using the WMT 2014 Chiang, 2017) uses byte pair encoding technique and a transliteration
English-German dataset consisting of about 4.5 million sentence pairs technique, while the method is similar to Zoph et al. (2016). The results
and the WMT 2014 English-French dataset consisting of 36M sentences. have shown an improvement in BLEU scores. The study shows that
The models achieved a 28.4 BLEU score on the WMT 2014 English- transfer learning is more effective when the languages of parent and
German translation task and 41.0 BLEU score on the English-French child models are related to one another. The transfer learning approach
translation task. An effective approach for rare words and subword can also be simplified as shown in a report (Kocmi & Bojar, 2018). It
units (Sennrich et al., 2016) has paved the way for making the NMT is noticed in the report that transfer learning is helpful in improving
model capable of open-vocabulary translation by encoding rare and the translation accuracy of languages that are not even closely related.
unknown words as sequences of subword units. The method was ex- The transformer is used as the backbone architecture. The vocabulary
perimented on the WMT 2015 English-German training dataset which is created using equal number of sentences from the parent and child
consists of 4.2 million sentence pairs, and on the English-Russian train- language pairs. After the parent has converged, the child model is
ing dataset which consists of 2.6 million sentence pairs. The models trained from the parent model by switching the parent corpus to the

2
A.V. Hujon et al. Expert Systems With Applications 238 (2024) 121813

child corpus. The result shows a significant improvement in BLEU


scores.
In the case of the Khasi language, a recent work using transfer
learning (Hujon, Singh, & Amitab, 2023) with English-French as the
language pair of the parent model and English-Khasi for the child model
was reported with a BLEU score of 51.11. One of the previous works
was the statistical and neural machine translations performed on the
same language pair was also reported (Singh & Hujon, 2020) with
significant outcomes. Another work with similar language pair using
transformer (Thabah & Purkayastha, 2021) with BPE was reported with
a BLEU score of 39.63 on the similar domain and 5.33 on the general
domain. A convolutional sequence to sequence NMT have recently been
reported on the same language pair with a BLEU score of 37.7 (Hujon,
Amitab, & Singh, 2023).

3. Research methods

Fig. 1. Block diagram of the LSTM recurrent network cell (Goodfellow et al., 2016)
Translation between languages has never been straightforward,
given the linguistic intricacies of languages. Linguistic features of
one language may differ from those of another, so that a phrase
in one language may not have the same amount of words in an- where 𝑥(𝑡) is the current input vector and ℎ(𝑡) is the current hidden layer
other. This problem can be solved using the sequence-to-sequence vector which contains the outputs of all the LSTM cells and 𝑏𝑓 , 𝑈 𝑓 ,
model (Sutskever et al., 2014) based on the concept of encoder and 𝑊 𝑓 are biases, input weights, and recurrent weights respectively for
decoder. The sequence-to-sequence model consists of the encoder, the the forget gates. The internal state of the LSTM is updated as in Eq. (2)
encoded vector, and the decoder. An encoder consists of a stack of with a conditional self-loop weight 𝑓𝑖(𝑡)
many recurrent units, which can be any RNN. Each unit accepts a ( ∑ (𝑓 ) (𝑡) ∑ (𝑓 ) (𝑡−1) )
𝑠(𝑡) (𝑡) (𝑡−1)
𝑖 = 𝑓 𝑖 𝑠𝑖 + 𝑔𝑖(𝑡) 𝜎 𝑏𝑖 + 𝑈𝑖,𝑗 𝑥𝑗 + 𝑊𝑖,𝑗 ℎ𝑗 (2)
single input element from the input sequence. It collects information
for that element and propagates forward. The last hidden layer contains where 𝑏, 𝑈 , and 𝑊 denote the biases, input weights, and recurrent
the encoded vector passed on to the decoder. The decoder uses the weights respectively of the LSTM cell. The external input gate unit 𝑔𝑖(𝑡)
encoded vector to predict the most probable output. We perform in Eq. (3) using its own parameters is similarly computed as the forget
our experiments on three types of architecture, the LSTM, GRU, and gate.
transformer architecture. Three of our models used LSTM, two models ( ∑ 𝑔 (𝑡) ∑ 𝑔 (𝑡−1) )
𝑔𝑖(𝑡) = 𝜎 𝑏𝑔𝑖 + 𝑈𝑖,𝑗 𝑥𝑗 + 𝑊𝑖,𝑗 ℎ𝑗 (3)
used GRU, and another three used the transformer. The RNN applied
for our baseline model is the Long short-term memory(LSTM). Fig. 2 Using a sigmoid unit for gating, the output ℎ𝑡𝑖 (𝑡) in Eq. (4) of the LSTM
shows the encoder–decoder framework used in our experiments. We cell can be shut off using the output gate 𝑞𝑖(𝑡) as in Eq. (5).
also experiment with subword byte pair encoding (Sennrich et al., ( )
2016), which is applied to the dataset as an initial step for machine ℎ(𝑡)
𝑖 = 𝑡𝑎𝑛ℎ 𝑠𝑖
(𝑡)
𝑞𝑖(𝑡) (4)
translation. ( )
∑ ∑
𝑞𝑖(𝑡) = 𝜎 𝑏𝑜𝑖 + 𝑜 (𝑡)
𝑈𝑖,𝑗 𝑥𝑗 + 𝑊𝑖,𝑗𝑜 ℎ(𝑡−1)
𝑗 (5)
3.1. LSTM
The parameters 𝑏𝑜 , 𝑈 𝑜, 𝑜
and 𝑊 are respectively its biases, input
Recurrent Neural Networks(RNN) cannot translate long sentences weights, and recurrent weights. Thus an LSTM layer consists of a vector
of LSTM cells. As a recurrent neural network, the input is received from
accurately because they suffer from short-term memory. RNN also
the previous layer, and the output value is received from the previous
suffers from a vanishing gradient problem. The performance degrades
time step 𝑡 − 1. The memory state m is updated from the input state 𝑖,
with very long sentences as it needs to retain memory for words it
and the earlier value of the memory state 𝑚𝑡−1 . Various gates transmit
learned much earlier in lengthy sentences. LSTM and Gated Recurrent
the information flow in the cell towards the output value o.
Unit (GRU) (Cho, van Merriënboer, Bahdanau, & Bengio, 2014) are
designed to solve this problem. The LSTM architecture (Hochreiter &
3.2. GRU
Schmidhuber, 1997), can decide which words to retain in memory and
which to forget. The LSTM has an internal mechanism consisting of GRU slightly differs from LSTM, where it uses only two gates,
the input gate, the forget gate, and the output gate, which control the the update gate and the reset gate, to solve the vanishing gradient
flow of information and a cell state. These three gates determine what problem of the RNN. The update gate decides what information it needs
relevant data in the information sequence to pass through and which to pass to the next time step, and the reset gate decides how much
ones to discard or forget. The cell state transfers relative information information to forget. A single gating unit can decide whether to control
down the sequence chain from the earlier time steps to later time steps, the forgetting factor or to update the state unit. The update equations
thereby reducing the effects of short-term memory throughout the are given as in Eq. (6) where 𝑢 stands for ‘‘update’’ gate and 𝑟 for
processing of the sequence. The communication between the different ‘‘reset’’ gate, and their values are computed as in Eq. (7) and Eq. (8)
gates (Goodfellow, Bengio, & Courville, 2016) is shown in Fig. 1. In respectively.
order to update, discard, or forget data, the input gate, the forget ( )
gate, and the output gate contain sigmoid activation, which uses values (𝑡) (𝑡−1) (𝑡−1) (𝑡−1)
∑ (𝑡−1)
∑ (𝑡−1) (𝑡−1)
ℎ 𝑖 = 𝑢𝑖 ℎ 𝑖 + (1 − 𝑢𝑖 )𝜎 𝑏𝑖 + 𝑈𝑖,𝑗 𝑥𝑗 + 𝑊𝑖,𝑗 𝑟𝑗 ℎ𝑗
between 0 and 1. If the information is closer to 0, it implies that it is 𝑗 𝑗
to forget, and if it is closer to 1, it is to keep. For time step 𝑡, and cell 𝑖,
(6)
the forget gate 𝑓𝑖(𝑡) given in Eq. (1), set the weight to a value between
0 and 1 through the sigmoid unit, ( )
( ∑ ∑
∑ (𝑓 ) (𝑡) ∑ (𝑓 ) (𝑡−1) ) 𝑢(𝑡) 𝑏𝑢𝑖 + 𝑢 (𝑡)
𝑊𝑖,𝑗𝑢 ℎ(𝑡) (7)
𝑓𝑖(𝑡) = 𝜎 𝑏(𝑓 )
(1) 𝑖 =𝜎 𝑈𝑖,𝑗 𝑥𝑗 + 𝑗
𝑖 + 𝑈𝑖,𝑗 𝑥𝑗 + 𝑊𝑖,𝑗 ℎ𝑗 𝑗 𝑗

3
A.V. Hujon et al. Expert Systems With Applications 238 (2024) 121813

Fig. 2. Encoder–decoder for neural machine translation from English to Khasi.

( )
∑ ∑
𝑟(𝑡)
𝑖 =𝜎 𝑏𝑟𝑖 + 𝑟 (𝑡)
𝑈𝑖,𝑗 𝑥𝑗 + 𝑊𝑖,𝑗𝑟 ℎ(𝑡)
𝑗 (8)
𝑗 𝑗

3.3. Transformer

The transformer architecture illustrated in Fig. 3 introduced by


Vaswani et al. (2017) is capable of producing state-of-the-art machine
translation using the self-attention mechanism. Self-attention in the
transformer handles the sense of sentences in a language using parallel
processing. The six layers of the original transformer model are stacked
together so that the output of layer 𝑙 is the input of layer 𝑙 + 1.
The encoder and the decoder each have six layers. In the high-level
layer, the encoder maps an incoming input sequence into a contin-
uous abstract representation that holds the current input sequence’s
learned information. The decoder takes the continuous representation
and generates a single output step-by-step.
The first layer of the encoder feeds input into a word embed-
ding layer. Since the transformer does not have a recurrence as RNN,
positional information is added to the input embedding. This is accom-
plished by two vectors using 𝑐𝑜𝑠 function as in Eq. (9)
⎛ ⎞
𝑝𝑜𝑠
PE(𝑝𝑜𝑠2𝑖+1) = 𝑐𝑜𝑠 ⎜ ⎟ (9)
⎜ 2𝑖 ⎟
⎝ 10000 𝑑 model ⎠
on every odd index on the input vector, and a 𝑠𝑖𝑛 function as in Eq. (10)

⎛ ⎞
𝑝𝑜𝑠
PE(𝑝𝑜𝑠2𝑖) = 𝑠𝑖𝑛 ⎜ ⎟ (10) Fig. 3. Architecture of transformer model (Vaswani et al., 2017).
⎜ 2𝑖 ⎟
⎝ 10000 𝑑 model ⎠
for every even index on the input vector. This vector is fed to the
A residual connection is achieved by adding the output vector of the
multi-attention sub-layer of the first layer of the encoder. Before self-
multi-headed attention and the original positional input embedding,
attention, the multi-headed attention computation is performed by
which goes through a normalization layer and gets projected through
splitting the query, the key, and the value into 𝑁 vectors. These vectors
a point-wise feed-forward network for further processing. The point-
individually go through the self-attention process which is called a head
wise feed-forward network uses rectified linear unit (ReLu) activation
and it produces an output vector concatenated into a single vector
between the two linear layers of the network. The encoder, to encode
before it goes through the final linear layer. The multi-head attention
sub-layer contains eight heads, followed by a normalization layer that information, can be stacked 𝑁 number of times so that each layer can
adds residual connections to the sub-layer’s output and normalizes it. learn different attention representations and boost the predictive power
A matrix 𝑧𝑖 with a shape of 𝑥 ∗ 𝑑𝑘 is the output of each head. 𝑍 as of the transformer network.
defined in Eq. (11) is the output of a multi-attention head. The decoder has two multi-headed attention layers, a point-wise
feed-forward layer, residual connections, and layer normalization after
𝑍 = (𝑧0 , 𝑧1 , 𝑧2 , 𝑧3 , 𝑧4 , 𝑧5 , 𝑧6 , 𝑧7 ) (11) each sub-layer. Like the encoder, the input to the decoder also goes
through an embedding layer followed by a positional encoding layer
The elements of 𝑍 are concatenated before exiting the multi-head
to get positional embedding and then fed into the first multi-head
attention sub-layer as in Eq. (12) and Eq. (13)
attention layer. The multi-head attention layer computes the attention
𝑀𝑢𝑙𝑡𝑖 − 𝐻𝑒𝑎𝑑(𝑜𝑢𝑡𝑝𝑢𝑡) = 𝐶𝑜𝑛𝑐𝑎𝑡(𝑧0 , 𝑧1 , 𝑧2 , 𝑧3 , 𝑧4 , 𝑧5 , 𝑧6 , 𝑧7 ) = 𝑥, 𝑑𝑚𝑜𝑑𝑒𝑙 (12) scores for the decoder’s input. The decoder is auto-regressive; thus,
generating the word-by-word sequence requires a looked-ahead mask
where 𝑑𝑚𝑜𝑑𝑒𝑙 is the dimension(typically 512) and 𝑥 is a sequence of
to prevent it from conditioning to future tokens. A softmax is applied to
words.
( ) the masked scores, which results in zeroing out the negative infinities,
𝑄𝐾 𝑇 thus leaving zero attention scores for future tokens and preventing
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑄, 𝐾, 𝑉 ) = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥 √ 𝑉 (13)
𝑑𝐾 the model from focusing on those words. The masked output vector

4
A.V. Hujon et al. Expert Systems With Applications 238 (2024) 121813

Table 1
Applying subword BPE to a sample English Vocabulary.
Sample English words Frequencies of chars Merging the most frequent byte pairs
Words Frequency Token Frequency Token Frequency
be⟨∕w⟩ 4807 ⟨∕w⟩ 11 ⟨∕w⟩ 11
him⟨∕w⟩ 4488 b 4807 b 4807-4807 = 0
himself⟨∕w⟩ 385 e 6008 e 6008-4807 = 1201
great⟨∕w⟩ 410 h 4873 h 4873
greatness⟨∕w⟩ 44 i 4886 i 4886
greater⟨∕w⟩ 89 m 4873 m 4873
greatest⟨∕w⟩ 37 s 510 s 510
all⟨∕w⟩ 4142 l 8801 l 8801-8416 = 385
allowed⟨∕w⟩ 48 f 385 f 385
alliance⟨∕w⟩ 13 g 580 g 580
allotted⟨∕w⟩ 5 r 580 r 580
a 4788 a 4788-4208 = 580
t 590 t 590
n 57 n 57
be 4807
all 4208

contains the information on how the model attends to the input of the
decoder, which becomes the output of the first multi-headed attention.
The input to the second multi-headed attention layer is the output of
the encoder. For further processing, the output of the second multi-
headed attention goes to the final linear layer which behaves as a
classifier through a point-wise feed-forward layer. The classifier gen-
erates an output fed into a softmax layer. The softmax layer produces
probability scores between 0 and 1, on which the predicted word will
have the index of the highest probability score. The decoder takes the
output and adds it to the list of decoder inputs. This process of decoding
would continue till a token is predicted. The decoder also stacked 𝑁
number of layers. It enabled the model to learn to extract and focus
on different combinations of attention from its attention heads, thus Fig. 4. A schematic diagram of the transfer learning model.
boosting the transformer’s predictive power.

3.4. Subword BPE


frequency of the tokens ‘b’ and ‘e’ are subtracted accordingly. Another
token all is a frequent byte pair, and the tokens ‘a’ and ‘l’ are merged
Tokenizing the datasets has shown its effectiveness in translations
to form a single token ‘all’. The number of frequencies is subtracted
using various models. In order to reduce memory and effort, a more
from the frequencies of ‘a’ as well as ‘l’ as shown. Similarly, for the
effective method is required to deal with a potentially huge vocabulary
Khasi vocabulary, Table 2 shows a sample of the words and character
by using a finite list of known words. This is the main idea behind
frequencies, and BPE is applied by merging the most frequent byte pair
subword tokenization. Many compound words in Khasi start with a
in the sample Khasi vocabulary. The byte pair ka is found to be the
subword like ‘jing’ such as ‘jingshisha’, ‘jingbha’, ‘jingkyrmen’ or the
most frequent pair, so it is merged as a single token; at the same time a
subword ‘ka’ which can occur as a determiner or as a subword such
number of frequencies is updated for ‘k’ and ‘a’ accordingly, as shown
as ‘kaba’, ‘kata’, ‘kane’, ‘katai’ for which subword tokenization can
in Table 2. After updating, the frequency of the token k becomes zero,
play a major role in improving the accuracy of the translations. One
which implies that it can now be omitted from the vocabulary. In this
popular algorithm for the subword tokenization approach is BPE. It
way, many iterations are applied until no more tokens can be merged.
was applied as a method to compress data by finding the common byte
pair combinations to represent frequently occurring texts. The same The original English vocabulary consists of 129791 tokens, and the
algorithm can be applied as a subword tokenization technique. The Khasi vocabulary consists of 82750 tokens. After the BPE is applied,
main objective of the Subword BPE algorithm is to represent the entire a joint vocabulary is created having 126506 tokens which is 86035
text dataset with the least amount of tokens. lesser than simply adding the original vocabularies together(212541
The algorithm (Sennrich et al., 2016) scans through each character tokens).
in the text with the goal of finding the character pairs with the highest
frequency of occurrence. Once these pairs are found, the next step is 3.5. Transfer learning
to merge them. Thus, the most common words will be represented in
the new vocabulary as a single token, whereas the less common words The transfer learning approach is widely used in deep learning and
are broken down into two or more subword tokens. In order to achieve has been particularly successful in various fields, such as computer
this goal, the algorithm checks through every potential option at each vision, natural language processing, and speech recognition. In machine
iteration step and picks the tokens to merge based on the highest translation, it is used to improve the performance of neural machine
frequency. This is repeated until no more pairs can be found. translation (NMT) systems. Transfer learning approach is useful in
Table 1 shows a sample of the words and character frequencies, and machine translation tasks, particularly for low-resource language pairs,
BPE was applied by merging the most frequent byte pair in the sample which can benefit with the knowledge learned from the parent model.
English vocabulary. Here, after a few iterations, we have be as the most Translation accuracy can be significantly improved using parent models
frequent byte pair, and the tokens ‘b’ and ‘e’ are merged together; then, trained with parallel corpora of high-resource languages. As part of
it is added into the vocabulary as a single token, and the number of our case study on NMT of Austroasiatic languages, we decided to

5
A.V. Hujon et al. Expert Systems With Applications 238 (2024) 121813

Table 2
Applying subword BPE to a sample Khasi Vocabulary.
Sample Khasi words Frequencies of chars Merging the most frequent byte pairs
Words Frequency Token Frequency Token Frequency
⟨∕w⟩ 11 ⟨∕w⟩ 11 ⟨∕w⟩ 11
soh⟨∕w⟩ 193 s 1032 s 1032
sohshkor⟨∕w⟩ 11 o 873 o 873
kaba⟨∕w⟩ 11044 h 1264 h 1264
kata⟨∕w⟩ 2583 k 15952 k 15952-15952 = 0
kane⟨∕w⟩ 2325 r 11 r 11
sohshnong⟨∕w⟩ 70 a 29959 a 29959-15952 = 14007
shnong⟨∕w⟩ 529 b 11044 b 11044
jingshai⟨∕w⟩ 232 t 3166 t 3166
jingieit⟨∕w⟩ 246 n 4001 n 4001
ieit⟨∕w⟩ 337 e 2908 e 2908
shai⟨∕w⟩ 148 g 1077 g 1077
j 478 j 478
i 2024 i 2024
t 583 t 583
ka 15952

use Vietnamese as one of the parent language since it has similar Algorithm 1 Building NMT models
characteristics as the Khasi language. Vietnamese, a language in the 1: Input: 𝐸𝑛𝑔𝑙𝑖𝑠ℎ𝐾ℎ𝑎𝑠𝑖.𝐾ℎ, 𝐸𝑛𝑔𝑙𝑖𝑠ℎ𝐾ℎ𝑎𝑠𝑖.𝐸𝑛,
same branch as Khasi in the classification of Austroasiatic language, can 𝐸𝑛𝑔𝑙𝑖𝑠ℎ𝑉 𝑖𝑒𝑡𝑛𝑎𝑚𝑒𝑠𝑒.𝐸𝑛, 𝐸𝑛𝑔𝑙𝑖𝑠ℎ𝑉 𝑖𝑒𝑡𝑛𝑎𝑚𝑒𝑠𝑒.𝑉 𝑖 ⊳ EnglishKhasi.X and EnglishVietnamese.X
be an advantage if chosen as a language for the parent model in transfer denotes the parallel datasets
2: Output: 𝐷𝐴𝑇 𝐴𝑢𝑛𝑡𝑘 , 𝐷𝐴𝑇 𝐴𝑡𝑘 , 𝐷𝐴𝑇 𝐴𝑠𝑏𝑝𝑒 , 𝐸𝑉 𝐷𝑎𝑡𝑎𝑡𝑘 , 𝐸𝑉 𝐷𝑎𝑡𝑎𝑠𝑏𝑝𝑒 , 𝑇 𝐷𝐴𝑇 𝐴1𝑢𝑛𝑡𝑘 ,
learning. The transfer learning approach applied in our experiment 𝑇 𝐷𝐴𝑇 𝐴2𝑢𝑛𝑡𝑘 , 𝑇 𝐷𝐴𝑇 𝐴1𝑡𝑘 , 𝑇 𝐷𝐴𝑇 𝐴2𝑡𝑘 , 𝑇 𝐷𝐴𝑇 𝐴1𝑠𝑏𝑝𝑒 , 𝑇 𝐷𝐴𝑇 𝐴2𝑠𝑏𝑝𝑒 ,
is as shown in Fig. 4. Vietnamese also uses a similar script-the Latin 𝐿𝑆𝑇 𝑀𝑢𝑛𝑡𝑘 , 𝐿𝑆𝑇 𝑀𝑡𝑘 , 𝐿𝑆𝑇 𝑀𝑠𝑏𝑝𝑒 , 𝐺𝑅𝑈𝑡𝑘 , 𝐺𝑅𝑈𝑠𝑏𝑝𝑒 ,
script, and it follows the same word order as the Khasi language- 𝑇 𝑅𝐹𝑡𝑘 , 𝑇 𝑅𝐹𝑠𝑏𝑝𝑒 , 𝑇 𝑅𝐹𝑠𝑏𝑝𝑒_𝑡𝑘 , 𝑇 𝐿𝑡𝑘 , 𝑇 𝐿𝑠𝑏𝑝𝑒
3: procedure DataPreProcess(EnglishKhasi.Kh, EnglishKhasi.En, EnglishVietnamese.Vi, En-
SVO. The parent model is trained with the English-Vietnamese parallel
glishVietnamese.En)
dataset (Doan, Nguyen, Tran, Hoang, & Nguyen, 2021), and the child 4: Preprocess the Datasets using different word segmentation
model is trained with the English-Khasi parallel datasets. A vocabulary 5: 𝐷𝐴𝑇 𝐴𝑢𝑛𝑡𝑘 ← Untokenized datasets
of three languages— English, Vietnamese, and Khasi is built and shared 6: Read 𝐷𝐴𝑇 𝐴𝑢𝑛𝑡𝑘
7: Process word segmentation on the dataset using Mosesdecoder toolkit
by both the parent model and child model. The initialization method 8: 𝐷𝐴𝑇 𝐴𝑡𝑘 ← Tokenized datasets
implemented is similar to Hujon, Singh, and Amitab (2023), how- 9: Read 𝐷𝐴𝑇 𝐴𝑡𝑘
ever, our approach differs, as we consider the linguistic relationship 10: Process word segmentation using subword Byte Pair Encoding(BPE)
11: 𝐷𝐴𝑇 𝐴𝑠𝑏𝑝𝑒 ← Subword BPE datasets
between languages of the parent model and child model. Therefore, 12: 𝐸𝑉 𝐷𝑎𝑡𝑎𝑠𝑏𝑝𝑒 ← Subword BPE datasets
the language pairs for the parent model is English-Vietnamese, and 13: 𝐿 = {𝐷𝐴𝑇 𝐴𝑢𝑛𝑡𝑘 , 𝐷𝐴𝑇 𝐴𝑡𝑘 , 𝐷𝐴𝑇 𝐴𝑠𝑏𝑝𝑒 }
the backbone architecture applied is the transformer instead of LSTM 14: for 𝑇 ∈ 𝐿 do ⊳ Test data on two different domains
15: 𝑇 𝐷𝐴𝑇 𝐴1𝑇 ← same domain test data
architecture. 16: 𝑇 𝐷𝐴𝑇 𝐴2𝑇 ← general domain test data
17: end for
4. Experiment setup 18: end procedure
19: procedure BuildNMTModels ⊳ Building different models using LSTM, GRU,
Transformer, and Transfer Learning
An empirical study is conducted on three types of architectures, the 20: Read EnglishKhasi.Kh, EnglishKhasi.En
LSTM, the GRU, and the transformer using different word segmenta- DataPreProcess(EnglishKhasi.Kh, EnglishKhasi.En,
𝐸𝑛𝑔𝑙𝑖𝑠ℎ𝑉 𝑖𝑒𝑡𝑛𝑎𝑚𝑒𝑠𝑒.𝐸𝑛, 𝐸𝑛𝑔𝑙𝑖𝑠ℎ𝑉 𝑖𝑒𝑡𝑛𝑎𝑚𝑒𝑠𝑒.𝑉 𝑖) ⊳ calling the procedure for preprocessing
tion approach to increase the accuracy and performance of NMT in the dataset
comparison to previous works. 21: Train the baseline model using LSTM on 𝐷𝐴𝑇 𝐴𝑢𝑛𝑡𝑘 → 𝐿𝑆𝑇 𝑀𝑢𝑛𝑡𝑘
The steps for preparing the datasets and building NMT models are as 22: 𝑀𝑂𝐷𝐸𝐿_𝑂𝑈 𝑇 𝑃 𝑈 𝑇𝑢𝑛𝑡𝑘_𝑇 𝐷𝐴𝑇 𝐴1 ← 𝑀𝑂𝐷𝐸𝐿𝑢𝑛𝑡𝑘 translate 𝑇 𝐷𝐴𝑇 𝐴1𝑢𝑛𝑡𝑘
23: 𝑀𝑂𝐷𝐸𝐿_𝑂𝑈 𝑇 𝑃 𝑈 𝑇𝑢𝑛𝑡𝑘_𝑇 𝐷𝐴𝑇 𝐴2 ← 𝑀𝑂𝐷𝐸𝐿𝑢𝑛𝑡𝑘 translate 𝑇 𝐷𝐴𝑇 𝐴2𝑢𝑛𝑡𝑘
shown in the Algorithm 1. The meaning of each term mentioned in the 24: 𝐿 ← {𝐿𝑆𝑇 𝑀, 𝐺𝑅𝑈 , 𝑇 𝑅𝐹 , 𝑇 𝐿}
algorithm is given as follows: 𝐷𝐴𝑇 𝐴𝑢𝑛𝑡𝑘 , 𝐷𝐴𝑇 𝐴𝑡𝑘 , and 𝐷𝐴𝑇 𝐴𝑠𝑏𝑝𝑒 refers 25: for 𝑀𝑂𝐷𝐸𝐿 ∈ 𝐿 do ⊳ Training different models
to the En-Kh dataset with different word segmentations-untokenized, 26: 𝑀𝑂𝐷𝐸𝐿𝑡𝑘 ← train 𝑀𝑂𝐷𝐸𝐿 on tokenized dataset
27: 𝑀𝑂𝐷𝐸𝐿_𝑂𝑈 𝑇 𝑃 𝑈 𝑇𝑡𝑘_𝑇 𝐷𝐴𝑇 𝐴1 ← 𝑀𝑂𝐷𝐸𝐿𝑡𝑘 translate 𝑇 𝐷𝐴𝑇 𝐴1𝑡𝑘
tokenized by using Mosesdecoder and subword BPE respectively, 𝐸𝑉 𝐷𝑎𝑡𝑎𝑡𝑘
28: 𝑀𝑂𝐷𝐸𝐿_𝑂𝑈 𝑇 𝑃 𝑈 𝑇𝑡𝑘_𝑇 𝐷𝐴𝑇 𝐴2 ← 𝑀𝑂𝐷𝐸𝐿𝑡𝑘 translate 𝑇 𝐷𝐴𝑇 𝐴2𝑡𝑘
and 𝐸𝑉 𝐷𝑎𝑡𝑎𝑠𝑏𝑝𝑒 are the tokenized En-Vi dataset and subword BPE 29: 𝑀𝑂𝐷𝐸𝐿𝑠𝑏𝑝𝑒 ← train 𝑀𝑂𝐷𝐸𝐿 on subword BPE dataset
En-Vi dataset respectively, 𝑇 𝐷𝐴𝑇 𝐴1𝑢𝑛𝑡𝑘 and 𝑇 𝐷𝐴𝑇 𝐴2𝑢𝑛𝑡𝑘 are the unto- 30: 𝑀𝑂𝐷𝐸𝐿_𝑂𝑈 𝑇 𝑃 𝑈 𝑇𝑠𝑏𝑝𝑒_𝑇 𝐷𝐴𝑇 𝐴1 ← 𝑀𝑂𝐷𝐸𝐿𝑠𝑏𝑝𝑒 translate 𝑇 𝐷𝐴𝑇 𝐴1𝑠𝑏𝑝𝑒
kenized test data for similar domain and general domain respectively, 31: 𝑀𝑂𝐷𝐸𝐿_𝑂𝑈 𝑇 𝑃 𝑈 𝑇𝑠𝑏𝑝𝑒_𝑇 𝐷𝐴𝑇 𝐴2 ← 𝑀𝑂𝐷𝐸𝐿𝑠𝑏𝑝𝑒 translate 𝑇 𝐷𝐴𝑇 𝐴2𝑠𝑏𝑝𝑒
32: end for
𝑇 𝐷𝐴𝑇 𝐴1𝑡𝑘 and 𝑇 𝐷𝐴𝑇 𝐴2𝑡𝑘 are the tokenized test data for similar do- 33: 𝑇 𝑅𝐹 _𝑂𝑈 𝑇 𝑃 𝑈 𝑇𝑠𝑏𝑝𝑒_𝑇 𝐷𝐴𝑇 𝐴1 ← 𝑇 𝑅𝐹𝑠𝑏𝑝𝑒 translate 𝑇 𝐷𝐴𝑇 𝐴1𝑡𝑘
main and general domain respectively, 𝑇 𝐷𝐴𝑇 𝐴1𝑠𝑏𝑝𝑒 and 𝑇 𝐷𝐴𝑇 𝐴2𝑠𝑏𝑝𝑒 34: 𝑇 𝑅𝐹 _𝑂𝑈 𝑇 𝑃 𝑈 𝑇𝑠𝑏𝑝𝑒_𝑇 𝐷𝐴𝑇 𝐴2 ← 𝑇 𝑅𝐹𝑠𝑏𝑝𝑒 translate 𝑇 𝐷𝐴𝑇 𝐴2𝑡𝑘
are the subword BPE test data for similar domain and general domain 35: end procedure
respectively. The different models are referred to as 𝐿𝑆𝑇 𝑀𝑢𝑛 for the
LSTM model with the untokenized dataset, 𝐿𝑆𝑇 𝑀𝑡𝑘 for the LSTM
model with the tokenized dataset, 𝐿𝑆𝑇 𝑀𝑠𝑏𝑝𝑒 is the LSTM model with 4.1. Datasets
the subword BPE dataset, 𝐺𝑅𝑈𝑡𝑘 is the GRU model with the tokenized
dataset, 𝐺𝑅𝑈𝑠𝑏𝑝𝑒 is the GRU model with the subword BPE dataset, The dataset is a parallel corpus consisting of 41529 parallel sen-
𝑇 𝑅𝐹𝑡𝑘 is the transformer model with the tokenized dataset, 𝑇 𝑅𝐹𝑠𝑏𝑝𝑒 tences. We collected data for our dataset from the Bible
is the transformer model with subword BPE dataset, 𝑇 𝑅𝐹𝑠𝑏𝑝𝑒_𝑡𝑘 is the (Life.Church/YouVersion, 2021a, 2021b), which is clean and aligned
transformer model which is trained using subword BPE dataset and is manually. We also collected from existing books, which initially re-
used to translate the tokenized dataset, 𝑇 𝐿𝑡𝑘 is the transfer learning quired digitization, then cleaned and aligned. The tasks of digitizing,
model trained with the tokenized datasets and 𝑇 𝐿𝑠𝑏𝑝𝑒 is the transfer cleaning, and checking for spelling errors in both the Khasi text and
learning model with subword BPE dataset. the English text, along with the task of alignment, are done manually.

6
A.V. Hujon et al. Expert Systems With Applications 238 (2024) 121813

Fig. 5. LSTM tokenized model graphs.

Table 3 rate is at its highest while training at 40K steps and started to decline
Data sets.
with the increase of training steps. Low perplexity indicates more
Dataset Train Val Test1 Test2 robust predictions and is also desired for the training model; however,
En-Kh 36601 1772 2656 500 perplexity is very high at the beginning, steeply declines at 15K steps,
En-Vi 2978000 18720 19152 –
and then remains stable at around 45K–50K steps and slightly decreases
with the number of steps. It is at its lowest at around 70K–80K steps,
implying that training the model at just 50K steps would be enough.
The source of the train and validation data is taken from the Bible. The validation perplexity appears to increase over time from 20k steps,
We divided the test dataset into two categories; 𝑇 𝑑𝑎𝑡𝑎1 and 𝑇 𝑑𝑎𝑡𝑎2; which indicates the inability of the model to clearly predict an unseen
𝑇 𝑑𝑎𝑡𝑎1 contains text from the same domain as the train data and text based on the learning on the small-sized training dataset as the
𝑇 𝑑𝑎𝑡𝑎2 contains text from a general domain. The dataset is further language is a resource constraint. Similarly, the training accuracy,
processed to get the tokenized dataset using Mosesdecoder (Koehn learning rate, and perplexity of the 𝐿𝑆𝑇 𝑀𝑠𝑏𝑝𝑒 as shown in Fig. 6 follows
et al., 2007). As a different approach, we segment the data using a similar pattern as the 𝐿𝑆𝑇 𝑀𝑡𝑘 model.
subword-nmt (Sennrich et al., 2016) with 32000 merge operations and
get the subword BPE processed dataset. Thus, we have three English- 4.3. GRU models
Khasi datasets; untokenized, tokenized, and subword BPE datasets.
These are experimented with by the LSTM, GRU, transformer models, We also perform experiments on two GRU models, 𝐺𝑅𝑈𝑡𝑘 , which
and transfer learning . The English-Vietnamese parallel corpora (Doan uses the tokenized dataset, and 𝐺𝑅𝑈𝑠𝑏𝑝𝑒 , which uses the subword BPE
et al., 2021) consists of 3015872 parallel sentences, which is divided dataset. Our GRU models are trained using OpenNMT-py 2.1.2 (Klein
into 2978000 train, 18720 validation, and 19152 test set as shown in et al., 2017) which consider and implements Eqs. (6)–(8) described
Table 3. The En-Vi dataset is already tokenized. We further processed in Section 3.2. We initially train the GRU models with a maximum
the dataset using subword BPE as we have done for En-Kh dataset. The gradient normalization value of 5, a learning rate of 1.0, and a learning
En-Vi dataset is used in training the transfer learning parent models. rate decay initially set to 0.5, but the performance is unsatisfactory.
We then tune the parameters by reducing the maximum gradient
4.2. LSTM models normalization value to 2, the learning rate to 0.5, and a learning rate
decay by half of the current learning rate which is initially set to 0.25.
We conducted the experiments with three different datasets us- A global attention mechanism is also applied, as in the LSTM models.
ing LSTM; 𝐿𝑆𝑇 𝑀𝑢𝑛 , 𝐿𝑆𝑇 𝑀𝑡𝑘 , and 𝐿𝑆𝑇 𝑀𝑠𝑏𝑝𝑒 . Our baseline model is Eventually, the models perform significantly well.
𝐿𝑆𝑇 𝑀𝑢𝑛 , an LSTM train with the untokenized dataset. Eqs. (1)–(5) The graphs in Fig. 7 show the training accuracy, learning rate, and
described in Section 3.1 are considered in the construction of the LSTM perplexity of the 𝐺𝑅𝑈𝑡𝑘 model. The progress accuracy during training
models. The LSTM models consist of a 2-layer LSTM with 500 hidden reaches 78–80 at around 70k–80k steps. The learning rate declined at
units on both the encoder/decoder using an untokenized dataset. It 40k, while perplexity reaches its lowest only at 80k steps. The graphs
uses a learning rate of 1.0, a learning rate decay by half of the current in Fig. 8 show the training accuracy, learning rate, and perplexity of
learning rate and a dropout rate of 0.3. The LSTM models also use a the 𝐺𝑅𝑈𝑠𝑏𝑝𝑒 model. We observed similar characteristics as the former
global attention mechanism. It shows a better performance than the model; the accuracy reaches its highest at 75k–80k, the learning rate
untokenized model. We use OpenNMT-py (Klein, Kim, Deng, Senellart, starts declining at 40k while perplexity appears to fluctuate at around
& Rush, 2017) for the experiments. The third LSTM model uses the 20k–80k and finally reaches its lowest at 80k.
subword BPE dataset which is processed using the subword-nmt (Sen-
nrich et al., 2016). We also utilize a joint-BPE shared vocabulary for 4.4. Transformer models
this model. Then, it is trained using OpenNMT-py 2.1.2 (Klein et al.,
2017). The transformer models are built in consideration with the Eqs. (9)–
The training accuracy of the 𝐿𝑆𝑇 𝑀𝑡𝑘 model, as shown in Fig. 5 (13) described in Section 3.3. The first transformer model experi-
increases with the number of training steps. However, the learning ments and trains using the tokenized dataset, and the system used is

7
A.V. Hujon et al. Expert Systems With Applications 238 (2024) 121813

Fig. 6. LSTM subword byte pair encoding model graphs.

Fig. 7. GRU tokenized model graphs.

Fig. 8. GRU subword byte pair encoding model graphs.

8
A.V. Hujon et al. Expert Systems With Applications 238 (2024) 121813

Fig. 9. Transformer tokenized model graphs.

Fig. 10. Transformer subword byte pair encoding model graphs.

OpenNMT-py 2.1.2. Transformer model (Klein et al., 2017) consists of 4.5. Transfer learning models
6 layers and 8 heads. The parameters used in this model are: a learning
rate of 2.0, a dropout rate of 0.1, an attention dropout of 0.1, and Adam We built two models using transfer learning approach as described
optimizer and run on Nvidia Quadro P2000, a single GPU with a batch in Section 3.5. We follow the same approach as we have done for other
size of 4096 and compute gradient based on eight batches. It trains baseline models (LSTM, GRU, Transformer), where we trained and
for five days. The second transformer model experiments and trains tested our models on two different datasets segmented using-Tokenized
using the subword BPE dataset, with similar parameters as the former. method and subword byte pair encoded method. The transfer learning
It trains for eight days. We evaluate the transformer model trained with model that is trained and tested using tokenized datasets is referred to
the subword BPE train dataset with two different test datasets. The first as 𝑇 𝐿𝑡𝑘 , and transfer learning model trained and tested using subword
dataset is the Subword BPE test dataset 𝑇 𝑑𝑎𝑡𝑎1, and the second is the BPE dataset is referred to as 𝑇 𝐿𝑠𝑏𝑝𝑒 . Both these models shared a joint
tokenized dataset 𝑇 𝑑𝑎𝑡𝑎2. vocabulary built using English-Vietnamese train dataset and English-
The training accuracy of the transformer models 𝑇 𝑅𝐹𝑡𝑘 is shown in Khasi train datasets as shown in Table 3 and are trained with similar
Fig. 9 and 𝑇 𝑅𝐹𝑠𝑏𝑝𝑒 models is shown in Fig. 10 and reaches its peak of parameters.
98–99 at 70k–80k steps, the learning rate also reaches its peak at 10k The parent model is trained on the English-Vietnamese dataset till
steps and starts declining at 30k. The decline in the perplexity stabilizes it converged. The child model is then trained from the parent model
at 40k steps. The progress accuracy is validated accordingly, as shown with the English-Khasi dataset. The weights are initialized with the
in the graph. The validation perplexity is at its low around 10K–40K weights of the parent model and parameters such as the learning rate-
steps for 𝑇 𝑅𝐹𝑡𝑘 and 𝑇 𝑅𝐹𝑠𝑏𝑝𝑒 . We pick the best model which achieved 2.0, dropout rate-0.1 are kept the same as in the parent model, and we
the lowest validation perplexity after training for 80K steps. trained for 80K steps. Fig. 11 shows the train progress and validation

9
A.V. Hujon et al. Expert Systems With Applications 238 (2024) 121813

Fig. 11. Transfer learning model graphs.

of the Transfer learning model trained on tokenized dataset. Just like Table 4
Automatic evaluation of different models.
the other models, we pick the best model which achieved the lowest
Model Test data set BLEU Score ChrF2 TER RIBES
validation perplexity after training the child model for 80K steps.
𝐿𝑆𝑇 𝑀𝑢𝑛 Tdata1 45.7 62.1 41.9 0.852688
𝐿𝑆𝑇 𝑀𝑢𝑛 Tdata2 12.9 33.6 85.0 0.635693
5. Results and analysis 𝐿𝑆𝑇 𝑀𝑡𝑘 Tdata1 53.5 69.3 31.4 0.902735
𝐿𝑆𝑇 𝑀𝑡𝑘 Tdata2 12.7 31.3 87.0 0.601477
𝐿𝑆𝑇 𝑀𝑠𝑏𝑝𝑒 Tdata1 51.7 66.6 53.5 0.90306
We perform the translation on two different test sets, same domain 𝐿𝑆𝑇 𝑀𝑠𝑏𝑝𝑒 Tdata2 9.2 28.8 85.2 0.618851
𝑇 𝑑𝑎𝑡𝑎1 and out-of-domain 𝑇 𝑑𝑎𝑡𝑎2. 𝐺𝑅𝑈𝑡𝑘 Tdata1 47.9 64.1 36.8 0.894059
𝐺𝑅𝑈𝑡𝑘 Tdata2 9.6 28.9 84.7 0.6334
𝐺𝑅𝑈𝑠𝑏𝑝𝑒 Tdata1 47.7 64.5 36.2 0.896446
5.1. Evaluation methods 𝐺𝑅𝑈𝑠𝑏𝑝𝑒 Tdata2 6.6 26.5 85.7 0.58497
𝑇 𝑅𝐹𝑡𝑘 Tdata1 56.0 71.0 29.4 0.909528
The translation accuracy of the experimental result of the models 𝑇 𝑅𝐹𝑡𝑘 Tdata2 12.1 29.6 83.1 0.572565
𝑇 𝑅𝐹𝑠𝑏𝑝𝑒 Tdata1 55.3 70.6 32.7 0.91181
is calculated using three evaluation methods; automatic, human judg- 𝑇 𝑅𝐹𝑠𝑏𝑝𝑒 Tdata2 10.2 32.2 85.0 0.620027
ment, and statistical evaluations. The automatic evaluation uses four 𝑇 𝑅𝐹𝑠𝑏𝑝𝑒_𝑡𝑘 Tdata1 56.2 71.2 29.2 0.912516
metrics; Bilingual Evaluation Understudy-BLEU, ChrF2, an n-gram F- 𝑇 𝑅𝐹𝑠𝑏𝑝𝑒_𝑡𝑘 Tdata2 13.3 33.3 83.3 0.612406

score for automatic machine translation evaluation, Translation Error 𝑇 𝐿𝑡𝑘 Tdata1 58.1 70.2 26.0 0.920302
𝑇 𝐿𝑡𝑘 Tdata2 17.7 36.4 72.8 0.710131
Rate-TER used to determine the amount of post-editing required for ma-
𝑇 𝐿𝑠𝑏𝑝𝑒 Tdata1 53.1 68.6 29.4 0.911138
chine translation outputs, which are computed using sacreBleu (Post, 𝑇 𝐿𝑠𝑏𝑝𝑒 Tdata2 13.7 33.7 78.7 0.627104
2018) and the scores ranging from 0–100. A low TER score indicates TRF model (Thabah & Purkayastha, 2021) 39.63
better translation. We also use the Rank Based Intuitive Bilingual Eval- Conv model (Hujon, Amitab, & Singh, 2023) 37.7
TL model (Hujon, Singh, & Amitab, 2023) 51.11
uation Scores-RIBES, an automatic score using RIBES-1.03.1 (Isozaki
et al., 2010) ranging from 0–1. The human evaluation is performed
using two metrics; 𝐴𝑑𝑒𝑞𝑢𝑎𝑐𝑦-the amount of correct words translated
in the target output and 𝑓 𝑙𝑢𝑒𝑛𝑐𝑦-the correct ordering of words as per 5.2.1. Automatic evaluation and analysis
the grammatical rules. Both metrics are ranked on a 5-point scale, On analyzing the performance of all models based on the automatic
with 5 as the highest score. The human evaluation is performed on scores shown in Table 4, we initially focus on the results using the
four variable sentence lengths grouped by the number of words in a test set 𝑇 𝑑𝑎𝑡𝑎1. We find the Transfer learning 𝑇 𝐿𝑡𝑘 model achieved the
sentence; sentences less than 15, sentences greater than 15 and less highest score with 58.1 BLEU, 0.920302 RIBES and the lowest post-
than 25 words, sentences greater than 25 and less than 50 words, and editing of 26.0 TER. However, 𝑇 𝑅𝐹𝑠𝑏𝑝𝑒_𝑡𝑘 achieved a higher score in
sentences greater than 50 words. The statistical evaluation (Koehn, ChrF2 with +1.0. Our models, 𝑇 𝐿𝑡𝑘 have also improved with a score
2009) is applied using 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 with Eq. (14), 𝑅𝑒𝑐𝑎𝑙𝑙 with Eq. (15), of +6.99 compared to the existing model (Hujon, Singh, & Amitab,
and 𝐹 1-𝑀𝑒𝑎𝑠𝑢𝑟𝑒 with Eq. (16). Quantitative and qualitative analyses 2023) and the 𝑇 𝑅𝐹𝑠𝑏𝑝𝑒_𝑡𝑘 model have also outperformed the existing
are also performed on the results of the experiments. score (Thabah & Purkayastha, 2021) on similar settings with an im-
provement of +16.57 BLEU score. 𝐿𝑆𝑇 𝑀𝑢𝑛 shows the lowest scores
5.2. Quantitative results and analysis in the three automatic metrics while 𝐿𝑆𝑇 𝑀𝑠𝑏𝑝𝑒 achieved the highest
amount of post-editing required with a score of 53.5 TER. The 𝐿𝑆𝑇 𝑀𝑢𝑛
Three different methods are used for quantitative evaluation. Auto- model, even with a lower performance among other models in the
matic, Human evaluation, and Statistical. We also analyze the impact experiment, it achieved a significant BLEU score of 45.7. Compared
of using the subword BPE data segmentation approach. with the baseline transformer model 𝑇 𝑅𝐹𝑡𝑘 , the 𝑇 𝐿𝑡𝑘 model achieved

10
A.V. Hujon et al. Expert Systems With Applications 238 (2024) 121813

Fig. 12. Human evaluations of the results of models on variable sentence length.

an improvement of +2.1 BLEU, while the 𝑇 𝐿𝑠𝑏𝑝𝑒 model achieved lesser with a score of 4.13 rank the lowest in this category. 𝑇 𝑅𝐹𝑡𝑘 with
score of −2.2 BLEU compared to 𝑇 𝑅𝐹𝑠𝑏𝑝𝑒 model. Next, we analyze the a fluency score of 4.65 ranked highest in the third category while
results on the test set 𝑇 𝑑𝑎𝑡𝑎2. The automatic scores for 𝑇 𝑑𝑎𝑡𝑎2 shows 𝐿𝑆𝑇 𝑀𝑢𝑛 gained the lowest score of 3.80. The two models 𝑇 𝑅𝐹𝑠𝑏𝑝𝑒 and
that our model, the 𝑇 𝐿𝑡𝑘 model have achieved the highest score of 17.7 𝑇 𝑅𝐹𝑠𝑏𝑝𝑒_𝑡𝑘 achieved an equal score of 4.60 and ranked the highest in
BLEU, 36.4 ChrF2, and 0.710131 RIBES. It also achieved the lowest the fourth category. 𝐿𝑆𝑇 𝑀𝑢𝑛 , 𝐺𝑅𝑈𝑡𝑘 rank the lowest with an equal
amount of post-editing for general test data 𝑇 𝑑𝑎𝑡𝑎2 with a score of 72.8 fluency score of 3.50. The results show that the transformer models
TER. Even though the score is very low as compared to 𝑇 𝑑𝑎𝑡𝑎1, it is predict more accurate and comprehensible sentences. Comparing the
generally acceptable since most models do not fare equally well with two RNN architectures, the LSTM and GRU, the GRU models perform
different domains. 𝐺𝑅𝑈𝑠𝑏𝑝𝑒 scored the lowest with 6.6 BLEU and 26.5 better than the baseline 𝐿𝑆𝑇 𝑀𝑢𝑛 . Nevertheless, the other two LSTM
ChrF2, but 𝐿𝑆𝑇 𝑀𝑢𝑛 with 0.45, scored the least in the RIBES metric models, 𝐿𝑆𝑇 𝑀𝑡𝑘 and 𝐿𝑆𝑇 𝑀𝑠𝑏𝑝𝑒 perform significantly better than GRU
and the highest amount of post-editing required. Overall, 𝐺𝑅𝑈𝑠𝑏𝑝𝑒 and models. In fluency, both the transfer learning models 𝑇 𝐿𝑡𝑘 and 𝑇 𝐿𝑠𝑏𝑝𝑒
𝐿𝑆𝑇 𝑀𝑢𝑛 score the lowest automatic scores compared to all the models; achieved an overall high scores of 4+ for all four categories.
nevertheless, the GRU models show an improvement of +2.2 and +2.0
BLEU scores as observed in Table 4 by 𝐺𝑅𝑈𝑡𝑘 and 𝐺𝑅𝑈𝑠𝑏𝑝𝑒 over the 5.2.3. Statistical evaluation and analysis
baseline 𝐿𝑆𝑇 𝑀𝑢𝑛 model. The statistical evaluation (Koehn, 2009) is applied using 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛
with Eq. (14), 𝑅𝑒𝑐𝑎𝑙𝑙 with Eq. (15), and 𝐹 1-𝑚𝑒𝑎𝑠𝑢𝑟𝑒 with Eq. (16).
5.2.2. Human evaluation and analysis The terms correct, Output_length, Reference_length refer to the number of
Human subjective evaluation in Fig. 12 shows that the models using words that are translated correctly in the translated text, the number of
transformer architecture perform better than the LSTM and GRU mod- words in the translated text, and the number of words in the reference
els in both adequacy and fluency for all four categories of sentences. We text, respectively.
observed in Fig. 12(a), 𝑇 𝑅𝐹𝑠𝑏𝑝𝑒_𝑡𝑘 scores the highest adequacy of 4.6 for 𝑐𝑜𝑟𝑟𝑒𝑐𝑡
short sentences of length less than 15 words, while 𝑇 𝑅𝐹𝑠𝑏𝑝𝑒 scores 3.77 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (14)
𝑂𝑢𝑡𝑝𝑢𝑡_𝑙𝑒𝑛𝑔𝑡ℎ
for the second category with sentences greater than 15 and less than 25,
𝑐𝑜𝑟𝑟𝑒𝑐𝑡
and 𝑇 𝑅𝐹𝑡𝑘 with scores of 3.74 achieved the highest for both the third 𝑅𝑒𝑐𝑎𝑙𝑙 = (15)
𝑅𝑒𝑓 𝑒𝑟𝑒𝑛𝑐𝑒_𝑙𝑒𝑛𝑔𝑡ℎ
category of sentences greater than 25 less. 𝑇 𝐿𝑡𝑘 performed more or less
equally well for all four categories, moreover, it achieved the highest (𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙)
𝐹 1-𝑚𝑒𝑎𝑠𝑢𝑟𝑒 = (16)
adequacy score than the other models in long sentences of > 50 words. (𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙)∕2
As expected, the transformer models also score the highest in flu- As shown in Table 5 and Fig. 13, 𝑇 𝐿𝑡𝑘 rank the highest for short
ency for all four categories of sentences as observed in Fig. 12(b). sentences of length < 15 and very long sentences of length > 50
𝑇 𝑅𝐹𝑠𝑏𝑝𝑒_𝑡𝑘 achieved the highest score of 5.0, and 𝐿𝑆𝑇 𝑀𝑢𝑛 scored the words. The transformer models outperformed the other models in two
lowest with 4.63 for the first category, and 𝑇 𝑅𝐹𝑠𝑏𝑝𝑒_𝑡𝑘 with a fluency categories with 𝑇 𝑅𝐹𝑠𝑏𝑝𝑒_𝑡𝑘 model in the second category and 𝑇 𝑅𝐹𝑡𝑘
score of 4.90 ranked the highest in the second category while 𝐺𝑅𝑈𝑠𝑏𝑝𝑒 in the third category. The two LSTM models have outperformed the

11
A.V. Hujon et al. Expert Systems With Applications 238 (2024) 121813

Table 5 language pair and subword BPE technique. Although Both Vietnamese
Statistical evaluation of different models.
and Khasi share a similar script, there are differences in some letters of
Models Sentence <= 15 words Sentence > 15 <= 25 words
the alphabet, and applying word segmentation using the subword BPE
Precision Recall F1-measure Precision Recall F1-measure
in the Vietnamese language does not seem to have such a good impact
𝐿𝑆𝑇 𝑀𝑢𝑛 64.57% 68.68% 66.56% 71.89% 66.98% 69.35%
on the model’s performance.
𝐿𝑆𝑇 𝑀𝑡𝑘 82.61% 74.45% 78.32% 66.67% 70.99% 68.76%
𝐿𝑆𝑇 𝑀𝑠𝑏𝑝𝑒 84.96% 86.13% 85.54% 66.44% 68.06% 67.24%
𝐺𝑅𝑈𝑡𝑘 87.07% 76.42% 81.53% 70.07% 68.83% 69.44%
𝐺𝑅𝑈𝑠𝑏𝑝𝑒 82.96% 80.27% 81.59% 65.08% 64.81% 64.95% 5.2.5. A detection of gender disagreement
𝑇 𝑅𝐹𝑡𝑘 93.02% 86.13% 89.44% 73.62% 70.52% 72.04%
During the process of our experiments and after a thorough analysis
𝑇 𝑅𝐹𝑠𝑏𝑝𝑒 88.08% 83.75% 85.86% 76.33% 71.76% 73.98%
𝑇 𝑅𝐹𝑠𝑏𝑝𝑒_𝑡𝑘 88.08% 83.75% 85.86% 79.89% 72.99% 76.29% of the predicted text of these different models, we observed that the
𝑇 𝐿𝑡𝑘 92.86% 87.50% 90.10% 73.00% 74.07% 73.38% translations suffer from a problem of data sparsity because the language
𝑇 𝐿𝑠𝑏𝑝𝑒 84.62% 78.13% 81.24% 73.50% 76.85% 75.14%
pair are not equal in terms of morphological richness. We find that
Models Sentence > 25 <= 50 words Sentence > 50 words
there is a gender disagreement on a certain part of translation like
Precision Recall F1-measure Precision Recall F1-measure
personal pronouns. For example, in Table 6 the word you in the input
𝐿𝑆𝑇 𝑀𝑢𝑛 59.62% 66.73% 62.98% 57.58% 59.86% 58.43%
sentence and pha in the reference text, which is a female gender form,
𝐿𝑆𝑇 𝑀𝑡𝑘 65.56% 65.65% 65.60% 59.85% 57.39% 58.38%
𝐿𝑆𝑇 𝑀𝑠𝑏𝑝𝑒 65.34% 71.31% 68.20% 59.17% 65.72% 62.20% is translated as phi which can be used for neutral gender in the Khasi
𝐺𝑅𝑈𝑡𝑘 62.18% 69.44% 65.61% 62.82% 56.90% 59.72% language in 𝐿𝑆𝑇 𝑀𝑢𝑛 , 𝐿𝑆𝑇 𝑀𝑡𝑘 , 𝐿𝑆𝑇 𝑀𝑠𝑏𝑝𝑒 , 𝐺𝑅𝑈𝑠𝑏𝑝𝑒 , 𝑇 𝑅𝐹𝑡𝑘 , and both
𝐺𝑅𝑈𝑠𝑏𝑝𝑒 65.48% 67.11% 66.29% 62.90% 56.09% 59.30%
𝑇 𝑅𝐹𝑡𝑘 76.64% 74.72% 75.67% 67.00% 76.19% 70.67%
the transfer learning models.
𝑇 𝑅𝐹𝑠𝑏𝑝𝑒 75.72% 72.62% 74.14% 70.85% 64.86% 67.03% However, in 𝑇 𝑅𝐹𝑠𝑏𝑝𝑒 and 𝑇 𝑅𝐹𝑠𝑏𝑝𝑒_𝑡𝑘 it was translated to me a male
𝑇 𝑅𝐹𝑠𝑏𝑝𝑒_𝑡𝑘 75.72% 72.62% 74.14% 68.53% 67.25% 67.42%
gender form. We also find a similar inflection of the pronoun in the
𝑇 𝐿𝑡𝑘 58.54% 55.81% 57.14% 67.53% 76.89% 71.90%
𝑇 𝐿𝑠𝑏𝑝𝑒 50.00% 53.49% 51.69% 59.86% 70.19% 64.61% 𝐺𝑅𝑈𝑡𝑘 model. Moreover, we also noticed that the pronoun changes
from male form mem to neutral form phi in one phrase of the output
text of the 𝐺𝑅𝑈𝑡𝑘 model. The words you cannot in the Input sentence
and pham a female gender form in the Reference is translated to
phim a neutral gender form in 𝐿𝑆𝑇 𝑀𝑢𝑛 , 𝐿𝑆𝑇 𝑀𝑡𝑘 , 𝐿𝑆𝑇 𝑀𝑠𝑏𝑝𝑒 , 𝐺𝑅𝑈𝑠𝑏𝑝𝑒 ,
𝑇 𝑅𝐹𝑡𝑘 , 𝑇 𝐿𝑡𝑘 and 𝑇 𝐿𝑠𝑏𝑝𝑒 , but in 𝑇 𝑅𝐹𝑠𝑏𝑝𝑒 and 𝑇 𝑅𝐹𝑠𝑏𝑝𝑒_𝑡𝑘 it is translated
to mem a male gender form. We find similar morphological inflections
on gender in the translated test dataset Tdata2. Thus, the female gender
gets converted to the male gender and vice versa; however, within the
whole sentence, except 𝐺𝑅𝑈𝑡𝑘 , there is a consistency in maintaining the
gender, and we find that there is a subject-verb concord.
The changes in gender forms are due to the morphological inflec-
tion of nouns and pronouns from the source language to the target
language. The parallel corpora lack gender information, on which one
solution could be some form of morphological injection. The mor-
phological inflection on nouns, verbs, and adjectives can be present
in languages. Similar problems are encountered in statistical machine
Fig. 13. Statistical evaluation of different models on variable sentence length.
translations (SMT). A method to tackle morphological inflections is by
using Factored Models (Sreelekha & Bhattacharyya, 2017) to improve
the quality of translation in terms of adequacy and fluency where they
baseline model 𝐿𝑆𝑇 𝑀𝑢𝑛 , which achieved the lowest score of 66.56% incorporated the idea of morphological injection on nouns and verbs
and 62.98% in two categories, while 𝐺𝑅𝑈𝑠𝑏𝑝𝑒 , rank lowest in the second in SMT. Factored translation model uses additional annotation at the
category with a score of 64.95%. However, the 𝐿𝑆𝑇 𝑀𝑡𝑘 achieved the word level; the technique is by using a factored parallel corpus for the
lowest score in the fourth category with 58.38%. The LSTM models, morphology inflection method.
other than the baseline model, seem to achieve a higher performance
As reported in the experimental results (Sreelekha & Bhattacharyya,
than the GRU models based on the statistical scores.
2017), morphology injection has improved the translation quality by
reducing the number of OOVs and improving in BLEU score, adequacy,
5.2.4. Analysis on the impact of subword BPE data segmentation approach
and fluency of the translation outputs. Gender bias and gender dis-
Considering the models for which subword BPE is implemented on
agreement are issues related not only to SMT but also to NMT. It is
the datasets and based on the result of the three evaluation methods as
found that Google Translate had a problem called Gender Bias (Prates,
shown in Table 4, Fig. 12, and Fig. 13, the GRU model does not fare
Avelar, & Lamb, 2020), and Google has made an effort to promote
well with subword BPE compared to the other models and performs the
fairness and reduce bias in machine learning by providing feminine
lowest among the other subword BPE models. The transformer models,
however, show a superior performance against the LSTM and GRU and masculine translations for some gender-neutral words. Researchers
models in all three evaluation methods. like Vanmassenhove, Hardmeier, and Way (2018) raised this issue in
The transformer subword BPE model 𝑇 𝑅𝐹𝑠𝑏𝑝𝑒_𝑡𝑘 , scores the highest 2018. The current machine translation systems depend on statistical
automatic scores in both domains among the subword BPE models. dependencies on the sentence level that are learned from large amounts
Thus, the approach using the subword BPE significantly contributes to of parallel data, and also, sentences are generally translated in isolation
the model’s score. Human subjective evaluation and statistical eval- which results in the loss of important information necessary to deter-
uation also tallies with automatic evaluation. We notice Vietnamese mine the gender of the speakers. A gender-enhanced NMT is reported
language does not perform very well with subword BPE method and by Vanmassenhove et al. (2018) and the experiments show significant
the 𝑇 𝐿𝑠𝑏𝑝𝑒 model shows an unexpectedly lower scores as compared to improvement in the results. The technique is by tagging the data in the
𝑇 𝑅𝐹𝑠𝑏𝑝𝑒 model and current existing scores with similar settings (Hujon, corpus where gender information is injected into the corpus to improve
Singh, & Amitab, 2023) which used French as one of the parent the morphological agreement.

12
A.V. Hujon et al. Expert Systems With Applications 238 (2024) 121813

Table 6
Sample Input-Output Showing Gender disagreement of different models.
Input I know what you have done; I know how hard you have worked and how patient
you have been. I know that you cannot tolerate evil people and that you have
tested those who say they are apostles but are not, and have found out that
they are liars.
Reference Nga tip ïa kaei kaba pha la leh; nga tip katno pha la trei shitom bad katno pha
la long kaba ïaishah. Nga tip ba pham lah ban shah ïa ki briew kiba sniew bad
ba phala tynjuh ïa kito kiba ong ba ki dei ki apostol hynrei kim dei, bad pha la
shem ba ki long kiba lamler.

𝐿𝑆𝑇 𝑀𝑢𝑛 Nga tip kaei kaba phi la leh; nga tip haduh katno phi la trei shitom bad katno
phi la long. Nga tip ba phi lah ban ïaishah ïa ki briew kiba sniew bad ba phi la
tynjuh ïa kito kiba ong ki long ki apostol hynrei ki bym dei ban tip ba ki long ki
nongshukor.
𝐿𝑆𝑇 𝑀𝑡𝑘 Nga tip kaei kaba phi la leh ; nga tip haduh katno phi la trei shitom bad kumno
phi la long kiba ïaishah. Nga tip ba phim lah shah ïa u riewbymman bad ba phi
la tynjuh ïa u briew uba ong ba u long u apostol hynrei uba la tynjuh , bad u la
shem ba ki long ki nongshukor.
𝐿𝑆𝑇 𝑀𝑠𝑏𝑝𝑒 Nga tip aïu phi la leh ; nga tip haduh katno phi la trei shitom bad katno phi la
long kiba ïaishah. Nga tip ba phim lah shah ïa ki briew kiba sniew bad ba phi la
tynjuh ïa kito kiba ong ba ki long ki apostol hynrei ki bym dei ki briew , bad ki
la shem ba ki long kiba lamler .
𝐺𝑅𝑈𝑡𝑘 Nga tip kaei kaba me la leh ; nga tip haduh katno me la trei bad katno uba ïaishah,
la tip ba mem lah ban shah ïa ki briew kiba kynja mynsiem bad ba phi la tynjuh
ïa kito kiba ong ba ki long ki apostol hynrei kim dei , bad ki la
shem ïa kata ba ki long kiba lamler.
𝐺𝑅𝑈𝑠𝑏𝑝𝑒 Nga tip kaei ba phi la leh ; nga tip haduh katno phi la trei bad kumno slem phi
la long . Nga tip ba phim lah shah ban shah ïa ki briew kiba sniew
bad ba phi la shem ba ki long ki nongshukor hynrei kim long , bad ki
la shem ïa kata ki nongshukor.
𝑇 𝑅𝐹𝑡𝑘 Nga tip ïa kaei kaba phi la leh ; nga tip haduh katno phi la trei shitom bad katno
phi long kiba ïaishah. Nga tip ba phim lah shah ïa ki briew basniew bad ba phi
la tynjuh ïa kito kiba ong ba ki long ki apostol hynrei kim long , bad nga la
shem ba ki long ki nongshukor.
𝑇 𝑅𝐹𝑠𝑏𝑝𝑒 Nga tip kaei kaba me la leh ; nga tip haduh katno me la trei shitom bad kumno
ba me la long uba ïaishah. Nga tip ba mem lah shah ïa ki briew kiba sniew
bad ba me la tynjuh ïa kito kiba ong ba ki long ki apostol hynrei kim long , bad
me la shem ba ki long ki nongshukor.
𝑇 𝑅𝐹𝑠𝑏𝑝𝑒_𝑡𝑘 Nga tip kaei kaba me la leh ; nga tip haduh katno me la trei shitom bad kumno
ba me la long uba ïaishah. Nga tip ba mem lah shah ïa ki briew kiba sniew
bad ba me la tynjuh ïa kito kiba ong ba ki long ki apostol hynrei kim long , bad me
la shem ba ki long ki nongshukor.
𝑇 𝐿𝑡𝑘 Nga tip ïa kaei kaba phi la leh ; nga tip haduh katno phi la trei shitom bad haduh
katno phi la long kiba ïaishah . Nga tip ba phim lah shah ïa ki briew kiba sniew bad
ba phi la tynjuh ïa kito kiba ong ba ki long ki apostol hynrei phim long kumta , bad
phi la shem ba ki long ki .
𝑇 𝐿𝑠𝑏𝑝𝑒 Nga tip ïa kaei phi la leh ; nga tip haduh katno phi la trei shitom bad haduh katno phi
la long kiba ïaishah slem Nga tip ba phim lah shah ïa ki briew kiba sniew bad
ba phi la tynjuh ïa kito kiba ong ba ki long ki apostol hynrei ki bym dei , bad phi
la shem ba ki long ki

5.3. Qualitative analysis on Input–Output of different models Table 7


Sample Input-Output for Sentences of length less than 15 words of
different models.
The comprehensibility of a language can be determined if sentences
Input flowing through the desert like a river
strictly follow a specific word order of that language. Generating fully
Reference ka tuid lyngba ka ri shyiap kum ka wah
comprehensible output is one of machine translation’s goals. Both
𝐿𝑆𝑇 𝑀𝑢𝑛 ka ri shyiap kum ka wah
English and Khasi follow SVO as the word order. 𝐿𝑆𝑇 𝑀𝑡𝑘 ka tuid lyngba ka ri shyiap kum ka wah
We categorize the experimental outputs of the models into four sets 𝐿𝑆𝑇 𝑀𝑠𝑏𝑝𝑒 kaba tuid lyngba ka ri shyiap kum ka wah
for analysis. Table 7 shows sentences less than 15 words. 𝐺𝑅𝑈𝑡𝑘 kiba tuid lyngba ka ri shyiap kum ka wah
We find that the input verb flowing is not correctly translated to 𝐺𝑅𝑈𝑠𝑏𝑝𝑒 kiba tuid lyngba ka ri shyiap kum ka wah
the reference verb tuid in 𝐿𝑆𝑇 𝑀𝑢𝑛 , SVO is translated correctly in 𝑇 𝑅𝐹𝑡𝑘 ki tuid lyngba ïa ka ri shyiap kum ka wah
𝑇 𝑅𝐹𝑠𝑏𝑝𝑒 ki tuid lyngba ïa ka ri shyiap kum ka wah
𝐿𝑆𝑇 𝑀𝑡𝑘 . However, the subject is not fully translated in 𝐿𝑆𝑇 𝑀𝑠𝑏𝑝𝑒 𝑇 𝑅𝐹𝑠𝑏𝑝𝑒_𝑡𝑘 ki tuid lyngba ïa ka ri shyiap kum ka wah
and by the GRU models. The word order for transformer models is 𝑇 𝐿𝑡𝑘 kaba tuid lyngba ka ri shyiap kum ka wah
translated correctly except a subject ka is translated as plural ki instead 𝑇 𝐿𝑠𝑏𝑝𝑒 kaba tuid lyngba ka ri shyiap kum ka wah
of singular. Similarly the transfer learning models correctly translates
the whole sentence and maintains the word order, except the word ka
is translated as kaba. We find that all the models are comprehensible
objects are not correctly translated, we find the transformer models-
in this category. The adequacy of 𝐿𝑆𝑇 𝑀𝑠𝑏𝑝𝑒 is higher than the other
LSTM models and equally good as the transformer and transfer learning 𝑇 𝑅𝐹𝑡𝑘 , 𝑇 𝑅𝐹𝑠𝑏𝑝𝑒 and the transfer learning models- 𝑇 𝐿𝑡𝑘 , 𝑇 𝐿𝑠𝑏𝑝𝑒 are
models. The fluency of all the models is equally high. Table 8 shows more comprehensible than the LSTM and GRU models. Adequacy of
sentences greater than 15 and less than 25 words. Even though a few 𝑇 𝑅𝐹𝑠𝑏𝑝𝑒_𝑡𝑘 is higher than other models, and in terms of fluency, the

13
A.V. Hujon et al. Expert Systems With Applications 238 (2024) 121813

Table 8
Sample Input-Output for Sentences of length greater than 15 and less than 25 words of different models.
Input Also in front of the throne there was what looked like a sea of glass, clear as crystal
Reference Ha khmat ka khet ruh la don kaei kaba i kum ka ïit bakhraw, kaba shai sngur bha
𝐿𝑆𝑇 𝑀𝑢𝑛 Ha khmat jong ka khet la don ruh kaei kaba i kum ka duriaw u Stephanas, kaba
paw kum ka ngap
𝐿𝑆𝑇 𝑀𝑡𝑘 Kumjuh ruh ha khmat jong ka khet don kaei kaba i kum ka duriaw ka jingpynhiar
duriaw , kaba shai kum ki nongtrei
𝐿𝑆𝑇 𝑀𝑠𝑏𝑝𝑒 Kumjuh ruh ha khmat ka khet la don ka dieng kaba syriem kum ka duriaw ka khia
, da kaba mut kumjuh
𝐺𝑅𝑈𝑡𝑘 Ha khmat ka khet jong ka khet don kaei kaba peit kum ka duriaw ka jingieit , kaba
shai kum ki dongmusa
𝐺𝑅𝑈𝑠𝑏𝑝𝑒 Ha khmat ka khet jong ka khet don kaei kaba peit kum ka duriaw ka jingieit , kaba
shai kum ki dongmusa
𝑇 𝑅𝐹𝑡𝑘 Ha khmat jong ka khet , don kaei kaba i kum ka duriaw ïit shai , hynrei don
ki maw-it
𝑇 𝑅𝐹𝑠𝑏𝑝𝑒 Ha khmat jong ka khet , la don kaei kaba i kum ka duriaw ïa kaba la ïit , ha ki
maw-it
𝑇 𝑅𝐹𝑠𝑏𝑝𝑒_𝑡𝑘 Ha khmat jong ka khet , ka don kaei kaba i kum ka duriaw , kaba shai kum u
maw-it
𝑇 𝐿𝑡𝑘 Ha khmat jong ka khet ruh la don kaei kaba i kum ka duriaw, kaba shai kum kaba
𝑇 𝐿𝑠𝑏𝑝𝑒 Ha khmat ka khet la don kaei kaba i kum ka duriaw , kaba shai kum kiba

Table 9
Sample Input-Output for Sentences of length greater than 25 and less than 50 words of different models.
Input Each of them was given a white robe, and they were told to rest a little while longer,
until the complete number of their fellow-servants and fellow-Christians had been
killed, as they had been"
Reference Ha uwei pa uwei na ki la ai ïa ka jaiñkup balieh, bad la ong ïa ki ba kin shongthait
sa shipor, tad ynda la dep kheiñ lut ïa ki para shakri bad parabangeit kiba la shah pynïap,
kumba la leh ïa ki
𝐿𝑆𝑇 𝑀𝑢𝑛 Ïa uwei pa uwei na ki la ai ka jaiñkup kaba lieh bad la bthah ïa ki ban sah sa shipor,
haduh ka jingbun briew ha ka jingbun briew jong ki bad ha ki matsohlah jong ki la
pynïap, kumba ki long kiba khuid ha kaba pynkyntang ïa u
𝐿𝑆𝑇 𝑀𝑡𝑘 La ai ha uwei pa uwei na ki , ïa ka jaiñkup kaba lieh , bad la bthah ïa ki ban neh
khyndiat por , haduh ba ka jingdon met ba-pura jong ki para shakri jong ki bad ki
parabangeit ki la shah pynïap , kumba ki la long
𝐿𝑆𝑇 𝑀𝑠𝑏𝑝𝑒 Ïa uwei pa uwei na ki la ai ha ka jaiñkup balieh , bad la bthah ïa ki ban pynjahthait
khyndiat por , haduh ba ka jingbun pura jong ki shakri bad ki parabangeit jong ki
ka la shah pynïap , kumba ki la long
𝐺𝑅𝑈𝑡𝑘 Uwei pa uwei na ki u la ai ruh ïa ka jaiñkup jaiñkup , bad la bthah ïa ki ruh ban
shong tang shipor , haduh ba un da dep da kaba pynïap ïa ki para kynthei jong ki
bad ki briew kiba la shah pynïap , kumba ki la long kiba la lah
𝐺𝑅𝑈𝑠𝑏𝑝𝑒 Uwei pa uwei na ki u la ai ïa ka jaiñkup kaba lieh , bad ki la ong ha kaba sah katto
katne por , tad haduh ba ka jinglong pura jong ki para tymmen bad ki para kynthei
jong ki , ki la shah pynïap , kumba ki la long
𝑇 𝑅𝐹𝑡𝑘 Ïa uwei pa uwei na ki la ai ka jaiñkup balieh , bad la bthah ïa ki ban pynjahthait
katto katne por , haduh ba ka jingdon ki para shakri bad ki parabangeit la shah
pynïap , kumba ki la long
𝑇 𝑅𝐹𝑠𝑏𝑝𝑒 Ïa uwei pa uwei na ki la ai ka jaiñkup balieh , bad la bthah ïa ki ban pynjahthait
katto katne , tad haduh ba ka jingbun ki para shakri jong ki bad ki parabangeit la
shah pynïap , kumba la long
𝑇 𝑅𝐹𝑠𝑏𝑝𝑒_𝑡𝑘 Ïa uwei pa uwei na ki la ai ka jaiñkup balieh , bad la bthah ïa ki ban pynjahthait
katto katne , tad haduh ba ka jingbun ki para shakri jong ki bad ki parabangeit la
shah pynïap , kumba la long
𝑇 𝐿𝑡𝑘 Ïa uwei pa uwei na ki la ai ka jaiñkup , bad la bthah ïa ki ban sa katto katne por ,
haduh ba yn da pynïap lut ïa ka jong ki bad ïa ka , kumba ki la long
𝑇 𝐿𝑠𝑏𝑝𝑒 Ïa uwei pa uwei na ki la ai da ka jaiñkup kaba lieh , bad la bthah ïa ki ban shongthait
sa katto katne por , haduh ba baroh ka jong ki shakri bad ki parabangeit jong ki
ka la shah pynïap , kumba ki la long

three transformer models equally rank the highest in this category. Table 10 shows sentences greater than 50 words, here, SVO translations
There are a few words, such as ‘throne’, which is translated correctly are more correct in 𝑇 𝐿𝑡𝑘 and 𝑇 𝐿𝑠𝑏𝑝𝑒 , 𝑇 𝑅𝐹𝑠𝑏𝑝𝑒 and 𝑇 𝑅𝐹𝑠𝑏𝑝𝑒_𝑡𝑘 . They
by all the models, whereas some of the words are partially translated also show better comprehensibility than 𝑇 𝑅𝐹𝑡𝑘 and the LSTM and GRU
by the models, such as the word ‘clear’. models. However, the GRU models performed more or less similar
Table 9 shows sentences greater than 25 and less than 50 words. In to the LSTM models. The adequacy of the transformer models ranks
this category, objects are translated correctly to some extent except a
equally among them and performs better than the LSTM models. In
few in 𝐿𝑆𝑇 𝑀𝑢𝑛 , but 𝐿𝑆𝑇 𝑀𝑡𝑘 is more comprehensible than the other
terms of fluency, the two transformers using subword BPE models rank
LSTM models. Object translation in 𝑇 𝑅𝐹𝑡𝑘 is better than 𝐿𝑆𝑇 𝑀𝑡𝑘 and
the highest, while the 𝑇 𝐿𝑡𝑘 and 𝑇 𝐿𝑠𝑏𝑝𝑒 perform consistently high for
more comprehensible than the LSTM models, although a few verbs were
not fully translated. The phrase ‘white robe’ is translated correctly by all sentence lengths. Fluency is slightly lower than in shorter sentences,
all models, except 𝐺𝑅𝑈𝑡𝑘 . The adequacy of all the models seems to especially in the LSTM models. The 𝑇 𝐿𝑡𝑘 , 𝑇 𝐿𝑠𝑏𝑝𝑒 , and 𝑇 𝑅𝐹𝑠𝑏𝑝𝑒 shows
have decreased with the increasing length of sentences. However, the the best performance and 𝐿𝑆𝑇 𝑀𝑢𝑛 shows lesser performance among
adequacy and fluency of 𝑇 𝑅𝐹𝑠𝑏𝑝𝑒 and 𝑇 𝑅𝐹𝑠𝑏𝑝𝑒_𝑡𝑘 rank the highest com- the models regarding word order agreement, adequacy, and fluency for
pared to the other models. 𝐿𝑆𝑇 𝑀𝑢𝑛 rank the lowest in this category. various sentence lengths.

14
A.V. Hujon et al. Expert Systems With Applications 238 (2024) 121813

Table 10
Sample Input-Output for Sentences of length greater than 50 words of different models.
Input But it is not just creation alone which groans; we who have the Spirit as the first of
God’s gifts also groan within ourselves, as we wait for God to make us his children
and Some manuscripts do not have make us his children and set our whole being free.
Reference Hynrei ym dei tang ki jingthaw kiba ud ; ma ngi ruh kiba don ïa U Mynsiem kum
ki jingai banyngkong jong U Blei , ngi ïa-ud hapoh lade hi , katba ngi dang ap ïa
U Blei ba un pynlong ïa ngi ki khun jong u bad ban pyllaitluid phar ïa ngi.
𝐿𝑆𝑇 𝑀𝑢𝑛 Hynrei ym dei tang ba ka long kaba ym long tang kaba kyang, ïa kaba ngi don U
Mynsiem kumba nyngkong eh na ki jingai jong U Blei ruh kiba lamler ruh ha lade
hi, kumba ngi ap ïa U Blei ban pynlong ïa ngi ki khun jong u bad ngim shym la
pynlong ïa ngi ki khun jong u.
𝐿𝑆𝑇 𝑀𝑡𝑘 Hynrei ym dei tang ïa ka uba ki ud ; ngi kiba don U Mynsiem kum kiba nyngkong
na ki jingai jong U Blei ruh ki la ud hapoh jong ngi , kumba ngi ap khmih ïa U
Blei ban pynlong ïa ngi ki khun jong u bad ngin ym mad ïa ka met jong u .
𝐿𝑆𝑇 𝑀𝑠𝑏𝑝𝑒 Wat la kam dei tang ban shu thaw ïa ka khyndew tang ïa kaba ki jingud ; ma ngi
kiba don U Mynsiem kumba long kaba nyngkong eh na ki jingai jong U Blei ruh ,
ngi ud ha lade , kumba ngi ap khmih ïa U Blei ban pynlong ïa ngi ki khun jong u
bad ba ngin pyllait im ïa ngi
𝐺𝑅𝑈𝑡𝑘 Hynrei dei ban ym pyrkhat ïa kaei kaba ki myrsiang ; ngi kiba don U Mynsiem
kum ki jingai sngewbha jong U Blei baroh , kumba ngi ap ha lade , kumba ngi ap
ïa U Blei ban pynlong ïa ngi ki khun jong u bad u mudui
𝐺𝑅𝑈𝑠𝑏𝑝𝑒 Hynrei kam dei ban tang ïa kaei kaba ki apostol ; ngi kiba don U Mynsiem kum ki
jingai sngewbha jong phi ruh ki ud ha lade , kumba ngi ap ïa U Blei ban pynlong
ïa ngi bad uno u khun jong ki , u la buh ïa ka mon jong u
𝑇 𝑅𝐹𝑡𝑘 Hynrei kam long tang kaba pynlong ïa ngi kiba ud ; ngi don U Mynsiem kum ki
jingai kiba nyngkong jong ki jingai U Blei ruh ngi ud hapoh ïa lade, katba ngi
dang ap ïa U Blei ban pynlong ïa ngi ki khun jong u bad ym don ki khun jong u ,
ba ki ai ha ngi ïa ki khun jong u , kiba pyllait ïa ka jingleh jong ngi baroh
𝑇 𝑅𝐹𝑠𝑏𝑝𝑒 Hynrei kam dei tang ka jingpynlong tang ka jingpynlong ; ngi kiba don U Mynsiem
kum ki jingai jong U Blei ruh ki ud hapoh jong ngi , katba ngi dang ap ïa U Blei
ban pynlong ïa ngi ki khun jong u bad ban ym pynlong ïa ngi ki khun jong u ba
la pyllaitluid ïa ngi baroh kawei
𝑇 𝑅𝐹𝑠𝑏𝑝𝑒_𝑡𝑘 Hynrei kam dei tang ka jingpynlong tang ka jingpynlong ; ngi kiba don U Mynsiem
kum ki jingai jong U Blei ruh ki ud hapoh jong ngi , katba ngi dang ap ïa U Blei
ban pynlong ïa ngi ki khun jong u bad ban ym pynlong ïa ngi ki khun jong u ba la
pyllaitluid ïa ngi baroh kawei
𝑇 𝐿𝑡𝑘 Hynrei kam dei tang ka jingthaw kaba ; ma ngi kiba don U Mynsiem kum ki jingai
jong U Blei ruh , ngi la hapoh jong ngi hi , katba ngi dang ap ïa U Blei ban
pynlong ïa ngi ki khun jong u bad ki bym shym la pynlong ïa ngi ki khun jong u
da ka ban pyllaitluid ïa ngi baroh ki la .
𝑇 𝐿𝑠𝑏𝑝𝑒 Hynrei kam dei tang ka hi kaba la ai ; ma ngi kiba don U Mynsiem kum ka jingai
kaba nyngkong na ki jingai jong U Blei ruh ngi roi hapoh jong ngi , katba ngi dang
ap ïa U Blei ban pynlong ïa ngi ki khun jong u bad katto katne tylli ki khun jong
u lem bad ngi la pyllait ïa ngi kumba ki long kiba laitluid

6. Conclusion models in the experiments. We also analyzed word order agreement


based on sentence length in four output categories. LSTM models show
The experiments and analysis of three cutting-edge models based on more satisfactory results than the GRU models. Overall, the trans-
the LSTM, the GRU and the transformer with different data segmenta- former models show more comprehensibility compared to LSTM and
tion approaches are presented in this work. We also adapted a transfer GRU models. Moreover, the transfer learning models are based on the
learning approach using transformer as the backbone architecture. With transformer as a backbone architecture and 𝑇 𝐿𝑡𝑘 achieved the highest
these models, we explore another Austroasiatic language and used score as compared to all other models. Considering different methods
the English-Vietnamese as the language pair of the parent model. We of evaluation applied to the output, experiment on models using the
used three data segmentation approaches: untokenized, tokenized and
subword BPE has comparatively shown performance that is a cut above
subword BPE. The results of these models show impressive scores
the rest of the models based on the result’s quality. However, the parent
compared to existing NMT models of English-Khasi language pair.
language, Vietnamese shows a slightly lower performance compared
The automatic, human and statistical evaluations correlate with one
to other subword BPE models. Nevertheless, choosing Vietnamese as a
another on the scores of the translated texts. Our approach using the
transformer based architectures outperformed by a significant margin parent language have shown promising results with tokenized dataset.
compared to the LSTM and GRU models in terms of quantitative and Vietnamese being a language from the same family of Austroasiatic
qualitative evaluation and analysis. Among all models, 𝑇 𝐿𝑡𝑘 achieved language as Khasi have proved to help better than the French language
the best performance in automatic evaluations for both domains with as in the existing work (Hujon, Singh, & Amitab, 2023) for transfer
scores in 𝑇 𝑑𝑎𝑡𝑎1 and 𝑇 𝑑𝑎𝑡𝑎2 of 58.1 BLEU and 17.7 BLEU respectively. learning NMT related tasks.
The quantitative and qualitative evaluations of transfer learning models
correlates with each other and shows an equally high performance
CRediT authorship contribution statement
for both models. During the analysis of the translated text, we find
that many sentences show an occurrence of morphological gender
disagreement. Gender-related issues in machine translation are broadly Aiusha Vellintihun Hujon: Conceived idea, Collected data, Con-
discussed in the paper with prospective solutions we can apply to future ducted the experiment, Analyzed the experimental results, Writing –
works. Morphological gender disagreement affects the accuracy of the original draft. Thoudam Doren Singh: Analyzed the experimental
results of our experiments both in automatic scores, adequacy and results, Reviewed and revised the manuscript. Khwairakpam Amitab:
fluency, which, if corrected could increase the performance of all the Reviewed the manuscript.

15
A.V. Hujon et al. Expert Systems With Applications 238 (2024) 121813

Declaration of competing interest Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., et al.
(2007). Moses: Open source toolkit for statistical machine translation. In Proceedings
of the 45th annual meeting of the association for computational linguistics companion
The authors declare that they have no known competing finan-
volume proceedings of the demo and poster sessions (pp. 177–180). Prague, Czech
cial interests or personal relationships that could have appeared to Republic: Association for Computational Linguistics, URL https://aclanthology.org/
influence the work reported in this paper. P07-2045.
Krishnamurthy, P. (2015). Development of telugu-tamil transfer-based machine trans-
Data availability lation system: With special reference to divergence index. In Proceedings of the 1st
deep machine translation workshop (pp. 48–54). Praha, Czechia: ÚFAL MFF UK, URL
https://aclanthology.org/W15-5706.
Data will be made available on request Kumar, A., Baruah, R., Mundotiya, R. K., & Singh, A. K. (2020). Transformer-based
neural machine translation system for hindi – marathi: WMT20 shared task. In
References Proceedings of the fifth conference on machine translation (pp. 393–395). Association
for Computational Linguistics, URL https://aclanthology.org/2020.wmt-1.44.
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly Li, Y., Jiang, J., Yangji, J., & Ma, N. (2021). Finding better subwords for tibetan neural
learning to align and translate. arXiv preprint arXiv:1409.0473. machine translation. Transactions on Asian and Low-Resource Language Information
Bharati, A., Chaitanya, V., Kulkarni, A. P., & Sangal, R. (2003). Anusaaraka: Machine Processing, 20, 1–11.
translation in stages, CoRR, cs.cl/0306130. URL http://arxiv.org/abs/cs/0306130. Life. Church/YouVersion (2021a). GNB bible YouVersion. A digital ministry of
Bojar, O., Buck, C., Federmann, C., Haddow, B., Koehn, P., Leveling, J., et al. (2014). Life.Church, URL https://www.bible.com/en-GB/bible/296/GEN.1.GNB. (Accessed:
Findings of the 2014 workshop on statistical machine translation. In Proceedings of March 2021).
the ninth workshop on statistical machine translation (pp. 12–58). Life. Church/YouVersion (2021b). KHASICLBSI Bible YouVersion. A digital ministry of
Centelles, J., & Costa-jussà, M. R. (2014). Chinese-to-spanish rule-based machine trans- Life.Church, URL https://www.bible.com/en-GB/bible/1865/EXO.1.KHASICLBSI.
lation system. In Proceedings of the 3rd workshop on hybrid approaches to machine (Accessed: March 2021).
translation (pp. 82–86). Gothenburg, Sweden: Association for Computational Lin- Marie, B., Kaing, H., Mon, A. M., Ding, C., Fujita, A., Utiyama, M., et al. (2019).
guistics, http://dx.doi.org/10.3115/v1/W14-1015, URL https://aclanthology.org/ Supervised and unsupervised machine translation for myanmar-english and khmer-
W14-1015. english. In Proceedings of the 6th workshop on asian translation (pp. 68–75).
Cho, K., van Merriënboer, B., Bahdanau, D., & Bengio, Y. (2014). On the properties of Hong Kong, China: Association for Computational Linguistics, http://dx.doi.org/
neural machine translation: Encoder–decoder approaches. In Proceedings of SSST-8, 10.18653/v1/D19-5206, URL https://aclanthology.org/D19-5206.
eighth workshop on syntax, semantics and structure in statistical translation (pp. 103– Nagaraja, K. S. (1985). Khasi: A descriptive analysis. Pune: Deccan College.
111). Doha, Qatar: Association for Computational Linguistics, http://dx.doi.org/10.
Nguyen, T. Q., & Chiang, D. (2017). Transfer learning across low-resource, related
3115/v1/W14-4012, URL https://aclanthology.org/W14-4012.
languages for neural machine translation. In Proceedings of the eighth international
Dave, S., Parikh, J., & Bhattacharyya, P. (2001). Interlingua-based english-hindi
joint conference on natural language processing (volume 2: short papers) (pp. 296–
machine translation and language divergence. Machine Translation, 16, 251–304.
301). Taipei, Taiwan: Asian Federation of Natural Language Processing, URL https:
http://dx.doi.org/10.1023/A:1021902704523.
//aclanthology.org/I17-2050.
Doan, L., Nguyen, L. T., Tran, N. L., Hoang, T., & Nguyen, D. Q. (2021). PhoMT:
Nguyen, Q.-P., Vo, A.-D., Shin, J.-C., Tran, P., & Ock, C.-Y. (2019). Korean-Vietnamese
A high-quality and large-scale benchmark dataset for Vietnamese-english machine
neural machine translation system with Korean morphological analysis and word
translation. In Proceedings of the 2021 conference on empirical methods in natural
sense disambiguation. IEEE Access, 7, 32602–32616. http://dx.doi.org/10.1109/
language processing (pp. 4495–4503). Online and Punta Cana, Dominican Republic:
ACCESS.2019.2902270.
Association for Computational Linguistics, http://dx.doi.org/10.18653/v1/2021.
emnlp-main.369, URL https://aclanthology.org/2021.emnlp-main.369. Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: a method for automatic
Forcada, M. L., Ginestí-Rosell, M., Nordfalk, J., O’Regan, J., Ortiz-Rojas, S., Pérez- evaluation of machine translation. In Proceedings of the 40th annual meeting of
Ortiz, J. A., et al. (2011). Apertium: a free/open-source platform for rule-based the association for computational linguistics (pp. 311–318). Philadelphia, Pennsyl-
machine translation. Machine Translation, 25(2), 127–144. vania, USA: Association for Computational Linguistics, http://dx.doi.org/10.3115/
1073083.1073135, URL https://aclanthology.org/P02-1040.
Goodfellow, I. J., Bengio, Y., & Courville, A. (2016). Deep learning. Cambridge, MA,
USA: MIT Press, http://www.deeplearningbook.org. Phan-Vu, H.-H., Nguyen, V.-N., Tran, V.-T., & Do, P.-T. (2017). Towards state-of-
Hegde, A., Gashaw, I., & H.l., S. (2021). MUCS@ - machine translation for dravidian the-art english-Vietnamese neural machine translation. In Proceedings of the eighth
languages using stacked long short term memory. In Proceedings of the first international symposium on information and communication technology (pp. 120–126).
workshop on speech and language technologies for dravidian languages (pp. 340– New York, NY, USA: Association for Computing Machinery, ISBN: 9781450353281,
345). Kyiv: Association for Computational Linguistics, URL https://aclanthology. http://dx.doi.org/10.1145/3155133.3155205.
org/2021.dravidianlangtech-1.50. Phan-Vu, H.-H., Tran, V. T., Dang, H. V., Do, P. T., et al. (2019). Neural machine trans-
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computa- lation between Vietnamese and english: an empirical study. Journal of Computer
tion, [ISSN: 0899-7667] 9(8), 1735–1780. http://dx.doi.org/10.1162/neco.1997.9. Science and Cybernetics, 35(2), 147–166.
8.1735. Post, M. (2018). A call for clarity in reporting BLEU scores. In Proceedings of the
Hujon, A. V., Amitab, K., & Singh, T. D. (2023). Convolutional sequence to sequence third conference on machine translation: Research papers (pp. 186–191). Belgium,
learning for english-khasi neural machine translation. In 2023 4th international Brussels: Association for Computational Linguistics, URL https://www.aclweb.org/
conference on computing and communication systems (pp. 1–4). http://dx.doi.org/10. anthology/W18-6319.
1109/I3CS58314.2023.10127426. Prates, M., Avelar, P., & Lamb, L. (2020). Assessing gender bias in machine translation:
Hujon, A. V., Singh, T. D., & Amitab, K. (2023). Transfer learning based neural a case study with google translate. Neural Computing and Applications, 32, http:
machine translation of english-khasi on low-resource settings. Procedia Computer //dx.doi.org/10.1007/s00521-019-04144-6.
Science, [ISSN: 1877-0509] 218, 1–8. http://dx.doi.org/10.1016/j.procs.2022.12. Rabel-Heymann, L. (1977). Gender in khasi nouns. Mon-Khmer Studies, 6, 247–272.
396, URL https://www.sciencedirect.com/science/article/pii/S1877050922024899,
Sennrich, R., Haddow, B., & Birch, A. (2016). Neural machine translation of rare words
International Conference on Machine Learning and Data Engineering.
with subword units proceedings of the 54th annual meeting of the association
Imamura, K., & Sumita, E. (2021). NICT-2 translation system at WAT-2021: Applying
for computational linguistics (volume 1: Long papers) 1715–1725 association for
a pretrained multilingual encoder-decoder model to low-resource language pairs.
computational linguistics. (pp. P16–1162). https://www.aclweb.org/anthology.
In Proceedings of the 8th workshop on asian translation (pp. 90–95). Association
Singh, T. D., & Bandyopadhyay, S. (2010a). Manipuri-english bidirectional statistical
for Computational Linguistics, http://dx.doi.org/10.18653/v1/2021.wat-1.8, URL
machine translation systems using morphology and dependency relations. In
https://aclanthology.org/2021.wat-1.8.
Proceedings of the 4th workshop on syntax and structure in statistical translation (pp.
Isozaki, H., Hirao, T., Duh, K., Sudoh, K., & Tsukada, H. (2010). Automatic evaluation of
83–91).
translation quality for distant language pairs. In Proceedings of the 2010 conference
on empirical methods in natural language processing (pp. 944–952). Cambridge, MA: Singh, T. D., & Bandyopadhyay, S. (2010b). Manipuri-english bidirectional statistical
Association for Computational Linguistics, URL https://aclanthology.org/D10-1092. machine translation systems using morphology and dependency relations. In
Klein, G., Kim, Y., Deng, Y., Senellart, J., & Rush, A. (2017). OpenNMT: Open- Proceedings of the 4th workshop on syntax and structure in statistical translation
source toolkit for neural machine translation. In Proceedings of ACL 2017, system (pp. 83–91). Beijing, China: Coling 2010 Organizing Committee, URL https://
demonstrations (pp. 67–72). Vancouver, Canada: Association for Computational aclanthology.org/W10-3811.
Linguistics, URL https://www.aclweb.org/anthology/P17-4012. Singh, T. D., & Hujon, A. V. (2020). Low resource and domain specific english to
Kocmi, T., & Bojar, O. (2018). Trivial transfer learning for low-resource neural machine khasi SMT and NMT systems. In 2020 International Conference on Computational
translation, CoRR. arXiv:1809.00357. Performance Evaluation (pp. 733–737). IEEE.
Koehn, P. (2009). Evaluation. In Statistical machine translation (pp. 217–246). Cambridge Singh, S. M., & Singh, T. D. (2020). Unsupervised neural machine translation for
University Press, http://dx.doi.org/10.1017/CBO9780511815829.009. english and manipuri. In Proceedings of the 3rd workshop on technologies for MT of

16
A.V. Hujon et al. Expert Systems With Applications 238 (2024) 121813

low resource languages (pp. 69–78). Suzhou, China: Association for Computational Thabah, N. D. J., & Purkayastha, B. S. (2021). Low resource neural machine translation
Linguistics, URL https://aclanthology.org/2020.loresmt-1.10. from english to khasi: A transformer-based approach. 170, In Proceedings of the
Singh, S. M., & Singh, T. D. (2022a). An empirical study of low-resource neural machine international conference on computing and communication systems: I3CS 2020, NEHU,
translation of manipuri in multilingual settings. Neural Computing and Applications, Shillong, India (p. 3). Springer.
34, http://dx.doi.org/10.1007/s00521-022-07337-8. Vanmassenhove, E., Hardmeier, C., & Way, A. (2018). Getting gender right in neural
Singh, S. M., & Singh, T. D. (2022b). Low resource machine translation of english- machine translation. In Proceedings of the 2018 conference on empirical methods
manipuri: A semi-supervised approach. Expert Systems with Applications, 209, Article in natural language processing (pp. 3003–3008). Brussels, Belgium: Association for
118187. http://dx.doi.org/10.1016/j.eswa.2022.118187. Computational Linguistics, http://dx.doi.org/10.18653/v1/D18-1334, URL https:
Sreelekha, S., & Bhattacharyya, P. (2017). Role of morphology injection in SMT: A //aclanthology.org/D18-1334.
case study from Indian language perspective. ACM Transactions on Asian and Low- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., et al.
Resource Language Information Processing, [ISSN: 2375-4699] 17(1), http://dx.doi. (2017). Attention is all you need, CoRR, abs/1706.03762. arXiv:1706.03762.
org/10.1145/3129208. Zoph, B., Yuret, D., May, J., & Knight, K. (2016). Transfer learning for low-resource
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural machine translation. In Proceedings of the 2016 Conference on Empirical
neural networks. http://dx.doi.org/10.48550/ARXIV.1409.3215, URL https://arxiv. Methods in Natural Language Processing (pp. 1568–1575). Austin, Texas: Associa-
org/abs/1409.3215. tion for Computational Linguistics, http://dx.doi.org/10.18653/v1/D16-1163, URL
https://aclanthology.org/D16-1163.

17

You might also like