You are on page 1of 7

Neural Machine Translation for English-Tamil

Himanshu Choudhary Aditya Kumar Pathak


DTU-Delhi IIIT-Bhubaneswar
himanshu.dce12@gmail.com adityapathak.cse@gmail.com

Rajiv Ratn Shah Ponnurangam Kumaraguru


IIIT-Delhi IIIT-Delhi
rajivratn@iiitd.ac.in pk@iiitd.ac.in

Abstract
Language is the expression of ideas using speech sounds combined into words. Words are
combined into sentences, this combination answering to that of ideas into thoughts. English
language is international language that is way a large amount of resources is available in the
world in English language that may be in the forms of hardcopy or softcopy such as books,
audios, images, videos documents etc. most people needs translation help to access this resources
that are not speakers of English language in their local languages when this can be done in
manually (human translation) it is very boring, time consume and expensive process, Machine
translation (MT) is an automatic translation from one language to another. Is the process of using artificial
intelligence (AI) to automatically translate content from one language (the source) to another (the target)
without any human input, one of the most recently used machine translation is Neural machine translation
(NMT). In this paper, we apply NMT for English-Tamil language pair. using word-embedding along with
Byte-Pair-Encoding (BPE) to develop an efficient translation system that overcomes the OOV (Out Of
Vocabulary) problem for languages which do not have much translations available online. A technique
that we use to evaluate the system performance is BLEU score. Experimental results confirm that our
proposed MIDAS translator (8.33 BLEU score) outperforms Google translator (3.75 BLEU score)
Introduction
Large countries like India and China have multiple languages that vary by location.India, for example,
has 23 officially recognized official languages (such as Hindi, Tamil, and Panjabi) as well as hundreds of
unregistered regional languages. Despite the fact that India has a population of almost 1.3 billion people,
only about 10% of them speak English. According to some surveys, just 2% of these 10% of English
speakers can speak, write, and read English fluently, while the remaining 8% can only grasp basic English
and speak broken English with a wide range of accents (sta). Given that a substantial quantity of useful
information is available on the internet in English, and since the majority of Indians do not speak the
language fluently, it is critical to translate such content into local languages to assist people.
People must share information not only for business goals, but also to express their sentiments, opinions,
and actions. To this end, translation plays a critical role in bridging the gap in communication between
people. It is not possible to manually translate the content due to the large amount of data. As a result, it's
critical to automatically translate content from one language (say, English) to another (say, Tamil).
Machine translation is another name for this procedure.
Machine translation for Indian languages is fraught with difficulties. Two key obstacles are
I. The magnitude of parallel corpora and
II. Differences among languages, particularly morphological richness and word order
variances related to syntactical divergence.
Both of these issues plague Indian languages (IL), particularly when translated from English.
For English and Indian languages, there are only a few parallel corpora.
Furthermore, Indian languages like Tamil differ from English in terms of both word order and
morphological complexity. For example, English uses the Subject-Verb-Object (SVO) structure,
but Tamil uses the Subject-Object-Verb (SOV) structure (SOV). Furthermore, English is a
fusional language, but Tamil is an agglutinative one. While morphological variations lead to data
sparsity, syntactic differences add to translation model challenges. In this study, we aim to
address both difficulties.
While there is a lot of work being done on machine translation for both foreign and Indian
languages, most of the work on Indian languages is limited to traditional machine translation
techniques. We see that techniques like word-embedding and byte-pair encoding (BPE) aren't
used on many Indian languages, despite the fact that natural language processing has improved
dramatically. As a result, we use a neural machine translation technique (torch implementation)
in conjunction with word embedding and BPE in this article. Particularly, We concentrate on the
English-Tamil language pair because it is one of the most difficult to translate (ZdenekZabokrtsk
y', 2012) due to Tamil's morphological richness.We use data from EnTamv2.0 and Opus to
evaluate our results, and we employ the BLEU evaluation matrices. On the Tamil language, the
experimental results show that we achieved substantially better outcomes than traditional
machine translation techniques. Our study, we feel, can be extended to other Indian language
pairs as well.
The following are the main contributions of our work: • This is the first study to use BPE with word
embedding on an Indian language pair (English-Tamil) using the NMT technique.
 We can achieve equal accuracy with a simpler model in less time than we can with a deep and
complicated neural network, which takes a long time to train.
 We've demonstrated why and how data pretreatment is important in neural machine translation.
 With a BLEU score of 4.58, our model outperforms Google Translator.
The remainder of the paper is laid out as follows. Sections 2 and 3 discuss related work and our MIDAS
translator's methodology, respectively. Section 4 contains the results of the evaluation. Section 5 brings
the paper to a close.

Methodology
We describe a neural machine translation technique that uses word embedding and Byte-Pair-
Encoding (BPE) to construct an efficient translation system called MIDAS translator that solves
the OOV (Out Of Vocabulary) problem for languages with limited online translations. To begin,
we'll go through neural machine translation, attention models, word embedding, and Byte Pair
Encoding. The framework of our MIDAS translator is then presented.
Neural Machine Translation Overview
The technique of neural machine translation is based on neural networks and the conditional probability
of translated sentences from the source language to the target language (Revanuru et al., 2017) We will
give an overview of the sequence to sequence architecture and attention model employed in our proposed
MIDAS translator in the following subsections.
Architecture with a Sequence-to-Sequence Relationship In machine translation models, sequence to
sequence architecture is used to determine the relationship between two different language pairs, whereas
in response generation, it is used to find the relationship between two different language pairs. It is made
up of two components: an encoder and a decoder. The encoder receives the data from the source, while
the decoder creates the output using the encoding vector and previously formed words. Assume sentence
A is the source sentence and sentence B is the target sentence. The encoder turns the source sentence a1,
a2, a3..., an into a fixed-dimensional vector, which the decoder decodes word by word using conditional
probability. The fixed size encoded vectors are A1, A2..., AM in the equation. The Eq. 1 is converted to
the Eq. 2 using the chain rule.

P(B/A) = P(B|A1,A2,A3,...,AM) (1)

P(B|A) = P(bi|b0,b1,b2,...,bi−1; (2) a1,a2,a3,...,am

In Eq. 1, the next word is predicted while decoding using previously predicted word vectors and
source sentence vectors. A softmax over all the terms in the vocabulary is used to represent each
term in the distribution.
Model of Attention The encoder reads the entire sentence, memorizes it, and stores it in the final
activation layer in a basic encoder-decoder architecture, after which the decoder network
provides the target translation. This structure performs admirably.

Figure 1: Seq2Seq architecture for English-Tamil

We may acquire a very good BLEU score for short sentences, but performance worsens for
really long sentences, perhaps greater than 30 or 40 words. A solution for this is to use an
attention mechanism in conjunction with a basic encoder-decoder architecture. It translates
sentences in the same way that humans do by glancing at parts of them at a time. The mechanism
determines how much attention should be devoted to a specific word during the translation
process. Figure 2 depicts the mechanism. From the inputs A1,A2,A3At, the Encoder generates
the attention vectors h1,h2,h3,......ht. The context vector Ci is then calculated for each output
time step by concatenating these vectors. Then Decoder generates the softmax output Bi using
the context vector Ci hidden state Si and previously predicted words.
Figure 2: Attention model

Embedding Words
Word embedding is a method of encoding words on a vector space in which the vector
representations of words with related meanings are comparable. Each word in the dictionary has
hundreds of different dimensions. Normally, pre-trained word embedding is employed, and
words from the vocabulary are translated to vectors via transfer learning (Cho et al., 2014). Fast
Text word vectors were employed in our model to translate English and Tamil language into a
300 -dimensional vector. We received a BLEU score of 6.74 after training the model with the
same layers, optimization approach, attention, and regularization.
Encoding in Byte Pairs BPE (Gage, 1994) is a straightforward data compression method. It
substitutes a single unused byte for the most often occurring pair of bytes in a sequence. This
algorithm is used to segment words. We integrate charters or character sequences by merging
frequent pairs of bytes (Sennrich et al., 2015). NMT symbols can be interpreted as sub-words,
and units and networks can use sub-words to translate and create new words. We learned the
independent encodings with 10,000 and 25,000 words on our source and target training data,
respectively, and then applied them to train test and validation data for both source and target.
BPE assisted in compound splitting, suffix, and prefix separation, which are utilized to create
new Tamil words. We employed BPE in conjunction with word embedding and experimented
with several models.
MIDAS Translator

We experimented with numerous models to have a better understanding of how parameter


adjustment and various strategies affect the Indian language pair. Our first model architecture
consists of a 500-dimensional bi-directional LSTM encoder and decoder with a vocabulary
capacity of 50,004 words for both the source and target. To begin, we used the SGD optimization
method, Luong attention, and a dropout (regularization) of 0.3 with a learning rate of 1.0.
Second, with a learning rate of 0.001, we changed the optimization method to Adam and the
attention to Bahdanau. With a BPE vocabulary size of 25,000, a 2 Layer Bi-directional encoder-
decoder, Adam optimization with a learning rate of 0.001, Bahdanau attention, and word
embedding with a dimension of 500, we were able to achieve the best results. For the training of
several models, we employed GPU (Nvidia GeForce GTX 1080), which increased the computing
speed.
After 5 hours of training on this GPU, we were able to attain our goal.

Conclusion & Future Work


We used NMT to solve one of the most complex language pairs in this research (English-Tamil). On
Indian languages, we found that NMT with pre-trained word embedding and Byte Pair Encoding
outperforms complicated translation approaches. Our model outperformed the competition.
With a margin of 4.58 BLEU points, Google translator wins. Our technique can be used to create English-
Tamil translation applications that will be valuable in fields such as tourism and education because we
achieved fairly excellent accuracy. Furthermore, we can investigate the feasibility of employing the
aforementioned methodologies for other English-Indian language translations. In the future, we'd like to
use machine translation to detect offensive languages in code-switched languages as well (Mathur et al.,
2018).

References
What percentage of people in India speak English? https://tinyurl.com/ indianlanguageStats. Accessed: 2018-08-27.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to
align and translate. arXiv preprint arXiv:1409.0473.
Andrew Donald Booth. 1955. Machine translation of languages, fourteen essays.
Kyunghyun Cho, Bart Van Merrienboer, Dzmitry Bah-¨ danau, and Yoshua Bengio. 2014. On the properties of
neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259.
Philip Gage. 1994. A new algorithm for data compression. The C Users Journal, 12(2):23–38.
Siddhartha Ghosh, Sujata Thamke, et al. 2014. Translation of telugu-marathi and vice-versa using rule based
machine translation. arXiv preprint arXiv:1406.3969.
Krupakar Hans and RS Milton. 2016. Improving the performance of neural machine translation involving
morphologically rich languages. arXiv preprint arXiv:1612.02482.
Nadeem Jadoon Khan, Waqas Anwar, and Nadir Durrani. 2017. Machine translation approaches and survey for
indian languages. arXiv preprint arXiv:1701.04290.
Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attentionbased neural
machine translation. arXiv preprint arXiv:1508.04025.
Puneet Mathur, Rajiv Shah, Ramit Sawhney, and Debanjan Mahata. 2018. Detecting offensive tweets in hindi-
english code-switched language. In Proceedings of the International Workshop on Natural Language Processing for
Social Media, pages 18–26.
Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of
machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics,
pages 311–318. Association for Computational Linguistics.
Raj Nath Patel, Prakash B Pimpale, et al. 2017. Mtil17: English to indian langauge statistical machine translation.
arXiv preprint arXiv:1708.07950.
Amarnath Pathak and Partha Pakray. Neural machine translation for indian languages. Journal of Intelligent
Systems.
Sree Harsha Ramesh and Krishna Prasad Sankaranarayanan. 2018. Neural machine translation for low resource
languages using bilingual lexicon induced from comparable corpora. In Proceedings of the 2018 Conference of
the North American Chapter of the Association for Computational Linguistics: Student Research Workshop,
pages 112–119.
Karthik Revanuru, Kaushik Turlapaty, and Shrisha Rao. 2017. Neural machine translation of indian languages. In
Proceedings of the 10th Annual ACM India Compute Conference on ZZZ, pages 11–20. ACM.
Pramod Salunkhe, Aniket D Kadam, Shashank Joshi, Shuhas Patil, Devendrasingh Thakore, and Shrikant Jadhav.
2016. Hybrid machine translation for english to marathi: A research evaluation in machine translation:(hybrid
translator). In Electrical, Electronics, and Optimization Techniques (ICEEOT), International Conference on,
pages 924–931. IEEE.
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation of rare words with subword
units. arXiv preprint arXiv:1508.07909.
Reshef Shilon. 2011. Transfer-based Machine Translation between morphologically-rich and resourcepoor
languages: The case of Hebrew and Arabic.
Ph.D. thesis, Citeseer.
Harold Somers. 2003. An overview of ebmt. In Recent advances in example-bas ed machine translation, pages 3–
57. Springer.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and
Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages
5998–6008.
Fai Wong, Mingchui Dong, and Dongcheng Hu.
2006. Machine translation using constraint-based synchronous grammar. Tsinghua Science and Technology,
11(3):295–306.
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun,
Yuan Cao, Qin Gao, Klaus Macherey, et al.
2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv
preprint arXiv:1609.08144.
LoganathanRamasamy OndrejBojar
ZdenekZabokrtskˇ y. 2012.` Morphological processing for english-tamil statistical machine translation. In 24th
International Conference on Computational Linguistics, page 113.

You might also like