You are on page 1of 4

FIRE Track on Machine Translation for Indian

Languages
Saran Pandian
February 2023

1 Track description
India is a diverse country where more than 1000 languages are spoken. The
Indian languages are broadly classified into three parts namely Indo-Aryan,
Dravidian, and Sino-Tibetan. Hence, there is a need to translate between these
languages for the unification of the country. However, the dataset required to
create a Machine Translation model to translate between Indian languages re-
mains low. Due to the scarcity of datasets, there is a need to create machine
translation models which are trained on the low-resourced dataset. Creating a
human-translated parallel corpus is expensive and time-consuming. So the chal-
lenge is to create models which can utilize the low-resourced data to translate
between Indian languages. Moreover, another challenge with respect to Indian
languages is the linguistic features of Indian Languages. The linguistic feature
which is shared among every Indian language is its syntactic similarity. How-
ever, some languages have a complex morphological structure which imposes
difficulty in modeling those problems.
In this track, we focus on translating between 12 Indian Languages. We di-
vide the languages into different categories based on the availability of resources.
Category-1 (Hindi (IIITH), Punjabi(cdac-noida), Telugu(IIITH)), Category-2
(Urdu(IIITH), Gujarati(DAIICT)), Category-3 (Kannada(Manipal), Odia(IIIT-
BH)), Category -4(Sindhi, Dogri, Kashmiri), Category-5 (Tamil, Malayalam)
along with English. Here Tamil, Malayalam, Telugu, and Kannada belong to
the Dravidian family, and others belong to the Indo-Aryan family.
We evaluate the systems based on automated scores. We evaluate based on
String-based metrics such as BLEU, TER, and CHRF2. This proposal encour-
ages participants to do research in Low-resource Machine translation for Indian
Languages.
The participants will be provided with development, training, and test dataset.
Task A: This is a Machine translation task for the general domain. The
participants should download the training data and create a Machine translation
system that can translate from language A to language B. They should submit
their results to the leaderboard.

1
Task B: In this task, the participants can utilize the minimum parallel data
or monolingual data available in a specific domain or some other techniques
whichever is feasible and build MT systems that can translate from one language
to other for that domain. The domains included in this task include healthcare
and governance.

2 Broad categories of the track


This track belongs to the domain of NLP.

3 Use case/s
In a multilingual country like India, there is a need to bridge the diversity of
languages with automated translated models. For example, the proceedings of
parliament, court, national news, etc. can be translated into regional languages
from Hindi & English in an automated fashion rather than using a human
translator. Moreover, this can ensure that sources of Knowledge are available
in all regional languages rather than one high-resource language such as Hindi
or English.
Workshop on Machine Translation(WMT) has been conducting shared tasks
for many languages around the world including European and Asian languages.
However, the inclusion of Indian languages in these shared tasks is limited to a
few languages such as Hindi, Gujarati, Tamil, etc. But Many Indian languages
such as Kashmiri, Dogri, and Sindhi are not represented in the WMT. This
makes it necessary to focus on research for translating between Indian Languages
and to other Asian and European Languages. We would like to encourage
the researchers to focus on building models that can translate between Indian
Languages and other Asian Languages.
Moreover, Indian Languages especially Dravidian Languages are morpholog-
ically rich languages which makes it difficult to translate them from any other
language. So it makes necessary to apply linguistic

4 Data
For this track, we use the parallel data available in the samanantar corpus
for different language pairs. The languages available include Hindi(Hi), Pun-
jabi(Pa), Odia(Or), Telugu(Te), Kannada(Kn), Gujarati(Gu), Sindhi(Si), Do-
gri(Do), Kashmiri, Urdu, Tamil(ta), Malayalam(ma) and English(en).

4.1 Data in general domain


• Gu-Hi - 24,65,286
• Gu-Kn - 20,52,877

2
• Gu-Or - 5,29,167
• Gu-Pa - 11,34,881
• Gu-Te - 23,02,371
• Hi-Kn - 21,48,233

• Hi-Or - 6,58,779
• Hi-Pa - 16,36,898
• Hi-Te - 24,34,229

• Kn-Or - 5,32,914
• Kn-Pa - 11,23,547
• Kn-Te - 27,96,983
• Or-Pa - 5,07,150

• Or-Te - 11,13,939
• Pa-Te - 17,56,072
• kn-ta - 24,98,652

• kn-ml - 28,69,136
• hi-ta - 25,01,092
• hi-ml - 27,48,006
• gu-ta - 20,54,266

• ml-ta - 25,84,436
• or-ta - 10,75,921
• pa-ta - 17,47,278

• ta-te - 25,99,404
• gu-ml - 23,49,091
• ml-or - 5,58,134
• ml-pa - 11,22,613

• ml-te - 26,71,707

3
5 Evaluation plan
The performance of an MT system will be measured based on BLEU score
which is considered to be the standard baseline metric for evaluating machine
translation systems. Other metrics include TER and METEOR. Participants
can evaluate their systems using the moses script (https://github.com/moses-
smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl). Initially a tem-
porary test data will be given to participants to evaluate their model. The
participants are allowed to upload their results on this test data and look their
performance in a leaderboard.
Later, the participants should submit the predicted sentences for another
golden dataset in the a file named test-pred.(lang-code) which is considered to
be the golden test data.

6 Timeline
• - 8st May – track proposals due
• - 12th May – track acceptance notification due

• - 25th May - open track websites and training data release


• - 10th July – test data release
• - 1st August – run submission deadline

• - 15th August – results declared


• - 15th September – Working notes due
• - 15th Oct – Camera-ready copies of working notes and overview paper
due

• - FIRE 2023 Tentatively in December 2023

You might also like