You are on page 1of 6

English To Luganda Paper

KISEJJERE RASHID
line 2: dept. name
MAKERERE UNIVERSITY
2100711543

21/U/11543/EVE
rashidkisejjere0784@gmail.com

Abstract—English to Luganda Translation using different A. Rule Based Machine translation(RBMT)


machine classification models and deep learning. Their many
Rule base machine translation (RBMT) as the name states
machine translation sources right now but there isn’t any
public English to Luganda Translation paper showing how the it’s mainly about researchers coming up with different rules
translation process occurs. Luganda is a very common through which a text in a given can follow to come up with
language and it’s the mother language of Uganda. It has got a its respective translation. It is the oldest machine translation
very big vocabulary of words which means that working on it technique and it was used in the 1970s.
requires a very big dataset. For the sake of this paper, am
going to be going through different machine-learning models B. Statistical Machine Learning (SMT)
that can be implemented to translate a given English text to Statistical Machine translation (SMT) is an old
Luganda. Because Luganda is a new field in translation so am translation technique that uses a statistical model to create a
going to be experimenting with multiple machine learning representation of the relationships between sentences,
classification models of SVMs, Logistic regression models, and words,
many more, finally deep learning models of RNNs, LSTMs,
and also incorporating advanced mechanisms of Attention,
and phrases in a given text. This method is then applied to
Transformers plus some additional techniques of transfer a second language to convert these elements into the new
learning. language.One of the main advantages of this technique is
that it can improve on the rule-based MT while sharing the
Keywords—Artificial Intelligence, machine translation, same problems.
classification task, RNNs, transfer learning, LSTMs,
attention
C. Neural Machine Translation(NMT)
Neural network translation was developed using deep
learning techniques. This method is faster and more accurate
I. INTRODUCTION (HEADING 1) (5 MARKS) than other methods. Neural MT is rapidly becoming the
Machine translation is a field that was and is still in the standard in MT engine development.
research and so far there are multiple machine translation
Approaches that researchers have come up with. These
The translation of a given language to another different
machine translation techniques mainly include; Rule-Based
language is a machine classification problem. This type of
Machine Translation (RBMT), Statistical Machine Learning problem can predict only values within a known domain. The
(SMT), and Neural Machine translation (NMT). domain in this case could be either the number of characters
A detailed explanation of these approaches is in the next that make up the vocabulary of a given language or the
chapter. Machine translation is one of the major number of words in that given vocabulary. So this shows that
subcategories of NLP as it involves a proper understanding the classification model could be either word-based or
of two different languages. This is always challenging as character based. I will elaborate more on this issue in the
languages tend to have a very huge vocabulary so a lot of next topics.
computer resources are needed for a machine translation
system to come out as accurately as possible. Also, the data III. LITERATURE REVIEW
used in the process is supposed to be very accurate and this Translation is a crucial aspect of communication for
as result also tends to affect the accuracy of these models so individuals who speak different languages. With the advent
coming up with a very accurate model is very tricky. of Artificial Intelligence (AI), translation has become more
Throughout this paper, I will be explaining how I was able efficient and accurate, making it possible to communicate
to come up with a couple of translation models using with individuals in other languages in real-time. There are
different strategies of machine learning. basically two major learning techniques that can be used ;
Supervised learning is a type of machine learning where
the model is trained on a labeled dataset and makes
II. BACKGROUND AND MOTIVATION predictions based on the input data and the labeled output.
Supervised learning algorithms have been used to train AI-
The background of ML translation comes from majorly powered English to Luganda translation systems. The model
three translation processes I.e. the Rule Based Machine is trained on a large corpus of bilingual text data, which helps
Translation (RBMT), Statistical Machine Learning (SMT) it learn the relationships between English and Luganda words
and Neural Machine translation (NMT) and phrases. This allows the model to make predictions about
the Luganda translation of an English text based on the input

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE


data. This is the famous type of machine learning and it Pre-processing: Pre-process the data to remove irrelevant
involves the famous deep neural networks. information and standardize the text.
Unsupervised learning is a type of machine learning Model Selection: Choose the neural machine translation
where the model is not trained on labeled data but instead model that is best suited for the problem.
learns from the input data. Unsupervised learning algorithms
Model Training: Train the NMT model on the pr-
can also been used to develop AI-powered English to
processed data.
Luganda translation systems. The model uses techniques
such as clustering and dimensionality reduction to learn the Model Evaluation: Evaluate the trained model on a held-
relationships between English and Luganda words and out set of data to determine its performance.
phrases. This allows the model to make predictions about the
Luganda translation of an English text based on the input Deployment: Deploy the trained model for use in a real-
data. world setting.

In conclusion, AI-powered English to Luganda Continuous Improvement: Continuously evaluate the


translation has the potential to greatly improve the speed and performance of the model and make improvements as
accuracy of translations. needed.

A. Research Gaps The AI evaluation framework used in this project are the
accuracy metrics mainly. This is a major of how the model
Below are some of the major research Gaps in the field will be able to translate a given text correctly.
of machine translation.
 Limited Training Data: The quality of AI-powered In conclusion, the proposed AI approach for this project
translations is heavily dependent on the amount and is to develop a neural machine translation model that can
quality of training data used to train the model. accurately translate English text into Luganda text while
Further research is needed to explore methods for preserving the meaning and cultural context of the original
obtaining high-quality training data. text.
 Lack of Cultural Sensitivity: AI-powered translation A. Dataset Description
systems can produce translations that are
grammatically correct but lack the cultural sensitivity
of human translations. This can result in translations
that are culturally inappropriate or that do not
accurately convey the original message.
 Vulnerability to Errors of the machine learning
system. AI can only understand what it has been
trained on. So in cases where the input is not similar
to the data which it was trained on, AI then can easily
create undesired results.
B. Contributions of this paper.
a) One of the major aim of this paper is lay a foundation for
The dataset I used was created by Makerere University
further and much more detailed research in the and it contains approximately 15k English sentences with
translation of large vocabulary languages like Luganda. there respective Luganda translation. Below are the factors
Through showing the different machine learning for considering this dataset.
techniques that can be used ti achieve this.
1) Scarcity of Luganda datasets. Luganda isn’t a
IV. METHODOLOGY famous language world wide and it is mainly used in the
The problem being investigated in this project is to Country Uganda only so the only major dataset I could find
develop an AI-powered English to Luganda translation was this one.
system. The significance of this problem lies in the growing 2) Cost. The dataset is available for free for anyone to
demand for high-quality and culturally sensitive translations, use and edit.
particularly in the field of commerce and communication 3) The accuracy of the dataset isn’t bad at all so it is the
between English and Luganda-speaking communities.
best option to use.
The scope of the project is to develop an AI system that is 4) The dataset is relatively large and diverse enough to
capable of accurately translating English text into Luganda be able to create a very good model out of.
text, while also preserving the meaning and cultural context
of the original text.
To address this problem, the proposed AI approach is to B. Data Preparation and Exploratory Data Analysis.
develop a neural machine translation (NMT) model. The
NMT model will be trained on the English and Luganda Headings, or heads, are organizational devices that guide
parallel corpus dataset, and will use this data to learn the the reader through your paper. There are two types: compone
relationship between the two languages.The AI process can nt heads and text heads.
be summarized as follows: Data preparation refers to the steps taken to prepare raw data
Data Collection: Collect a large corpus of parallel text into improved data which can be used to train a machine
data in English and Luganda. learning model. The data preparation process for my model
was as follows;
a. Removal of any punctuation plus any unnecessary spaces to each other.
this is necessary to prevent the model from training on a
large amount of unnecessary data.
b. Converting the case of words in the dataset to lowercase.
Since python is case-sensitive a word like “Hello” is
different from “hello”. to avoid this dilemma I had to
change the case.
c. Vectorization of the dataset. Vectorization is referred to
as the process of converting a given text into numerical
indices. This is necessary because the machine learning
pipeline can only be trained on numerical data.
d. Removal of null values. Here all the rows that had null
data had to be dropped because for textual data it is
very difficult to estimate the value in the null spot.
e. Those were the data preparation processes I used in the
model creation process.
Exploratory data analysis is referred to as the process of A correlation matrix for one of the sentences
performing initial investigations on data to discover
anomalies and patterns. Exploratory data analysis is mainly
carried out through the visualization of the data. Below are
the visualizations and their meanings;
A. Word Cloud
A word cloud is graphical representation of the words
that are used frequently in the dataset. This is important as it
shows that the model will highlt depend on those particular
words .

f. Sentence Lengths plots


Through these plots, we are ale to determine what should
all the sentences of the datasets be padded to because during
the training process they are all supposed to be of the same
length.

For the Luganda sentences

A. Correlation matrix.
This is a matrix showing the correlation of the different
values to each other. Plotting a 2d correlation matrix for the
entire dataset is almost impossible but what is possible is the
plot of a particular sentence. The matrix below shows the
correlation for a given sentence. Here the model will have to
pay a lot of attention to the words that are highly correlated
The recurrent neural network model was a simple model
that uses RNNs to translate the model. Its accuracy was very
bad because the vocabulary for the two languages was very
big. These types of RNNs are best for simple vocabularies.
The attention mechanism model. This happened to
be much much better compared to the RNN model. Attention
is a mechanism used in deep neural networks where the
model can focus on only the important parts of a given text
by assigning them more weights.
The other model I created used transformers.
Transformers are also deep learning models that are built on
top of attention layers. This makes them much more efficient
when it comes to NLP tasks.
These figures show the maximum sentence lengths for the
English and the Luganda sentences receptively. D. AI model selection Accountability.
In this context of AI, “accountability” refers to the
g. Box Plot expectation that organizations or individuals will use
A box plot is visual representation that can be used to toensure the proper functioning, throughout the AI systems
show the major outliers in the dataset. Plotting a box plot for that they design, develop, operate or deploy, following their
the entire spot is also almost impossible but what is possible rolesand applicable regulatory frameworks, and for
is the plotting of the box plot for a particular sentence, this as demonstrating this through their actions and decision-making
a result shows on the possible outliers in the sentence thus process (for example, by providing documentation on key
the model during the training process ends up not paying a decisions throughout the AI system lifecycle or conducting
lot of attention to those particular words. or allowing auditing were justified).
Box plot for one of the sentences in the dataset AI accountability is very important because it’s a means
of safeguarding against unintended uses. Most AI systems
are designed for a specific use case; using them for a
different use case would produce incorrect results. Through
this am also to apply accountability to my model by making
sure that Since my AI model mainly depends on the dataset.
Hence, it’s best to make sure that the quality of the dataset is
constantly improved and filtered. Because of any slight
modifications in the spelling of the words then the model’s
accuracy will decrease.

RESULTS AND DISCUSSION


I spitted the data into training and the validation set
below are the results;

The training accuracy is of 92% and the validation


accuracy is of 52%
Validation and Accuracy plot

In a conclusion, data preparation and exploratory data


analysis are key steps in the creation of a very accurate
model.
C. AI model selection and optimization.
Throughout the project, I created three models. I.e one
with recurrent neural networks, the other with the attention
mechanism, and finally the last one with transfer learning on
the per-trained hugging face transformer model.
LINK to the YouTube Video -
https : //youtu.be/RLXfM0iLQag

ACKNOWLEDGMENT
Special Thanks to Mr.Ggaliwango Marvin for his never
ending support towards my research on this project. I also
want to appreciate Mrs. ------ for the provision of the
foundation knowledge needed for this project.
REFERENCES

[1] M. Singh, R. Kumar, and I. Chana, ”Neural-Based Machine Transla


Its clear that the model is over fitting the dataset but it’s tion System Outperforming Statistical Phrase-Based Machine Transla
accuracy is still fairly good. tion for Low-Resource Languages”, 2019 Twelfth International Con
Attention plot ference on Contemporary Computing (IC3), 2019, pp. 1-7, DOI:
10.1109/IC3.2019.8844915. V. Bakarola and J. Nasriwala, ”Attention
An attention plot is a figure showing how the model was based Neural Machine Translation with Sequence to Sequence Learning
able to predict the given output. on Low Resourced Indic Languages,” 2021 2nd International Con
ference on Advances in Computing, Communication, Embedded and
Secure Systems (ACCESS), 2021, pp. 178-182, DOI: 10.1109/AC
CESS51619.2021.9563317. .
[2] Academy, E. (2022) How to Write a Research Hy
pothesis — Enago Academy, Enago Academy. Avail
able at: https://www.enago.com/academy/how-to-develop
a-good-research-hypothesis/ (Accessed: 17 November
2022). What is the project scope? (2022). Available at:
https://www.techtarget.com/searchcio/definition/project-scope
(Accessed: 17 November 2022).
[3] Machine translation – Wikipedia (2022). Available at:
https://en.wikipedia.org/wiki/Machine translation (Accessed: 17
November 2022).
Words that were predicted with a very high probability [4] K. Chen et al., ”Towards More Diverse Input Representation for
are more coloured. Neural Machine Translation,” in IEEE/ACM Transactions on Audio,
Speech, and Language Processing, vol. 28, pp. 1586-1597, 2020, doi:
10.1109/TASLP.2020.2996077.
CONCLUSION AND FUTURE WORKS [5] O. Mekpiroon, P. Tammarattananont, N. Apitiwongmanit, N.
Buasroung,
I hope this paper will give a basic understanding of the
T. Charoenporn and T. Supnithi, ”Integrating Translation Feature Using
different machine learning methods that can be used to create
Machine Translation in Open Source LMS,” 2009 Ninth IEEE Interna
a deep learning model capable of translating a given English
tional Conference on Advanced Learning Technologies, 2009, pp. 403-
text into Luganda. The same idea can be used to translate
different languages. 404, doi: 10.1109/ICALT.2009.136.
[6] J. -W. Hung, J. -R. Lin and L. -Y. Zhuang, ”The Evaluation Study of
The model currently is overfitting the dataset. One way to the Deep Learning Model Transformer in Speech Translation,” 2021 7th
overcome this is to increase the size of the data because the International Conference on Applied System Innovation (ICASI), 2021,
dataset contains of only about 15k sentences. So for the pp. 30-33, doi: 10.1109/ICASI52993.2021.9568450.
model to become much more accurate increasing the dataset [7] V. Alves, J. Ribeiro, P. Faria and L. Romero, ”Neural Machine Transla
to about a million sentences will tremendously improve on its tion Approach in Automatic Translations between Portuguese Language
accuracy. and Portuguese Sign Language Glosses,” 2022 17th Iberian Conference
Usage of other machine learning techniques like on Information Systems and Technologies (CISTI), 2022, pp. 1-7, doi:
transformers. The model illustrated above was based on the 10.23919/CISTI54924.2022.9820212.
attention mechanism of neural networks. Using the [8] Machine Translation – Towards Data Science. (2022). Retrieved 24
transformers will improve the quality of the model even November 2022, from https://towardsdatascience.com/tagged/machine
more. Though transformers are usually complicated to use translation
instead fine tuning an already trained model is what I would [9] H. Sun, R. Wang, K. Chen, M. Utiyama, E. Sumita and T. Zhao, ”Un
recommend, this is called transfer learning. supervised Neural Machine Translation With Cross-Lingual Language
Representation Agreement,” in IEEE/ACM Transactions on Audio,
DATASET AND PYTHON SOURCE CODE (20 MARKS)
Speech, and Language Processing, vol. 28, pp. 1170-1182, 2020, doi:
LINK to the Final Python Source Code - 10.1109/TASLP.2020.2982282.
https:://colab.research.google.com/drive1N_sRAxdf [10] Y. Wu, ”A Chinese-English Machine Translation Model Based on
tGIzqzeIMw3NF Y 9xClLk49f_ Deep Neural Network,” 2020 International Conference on Intelligent
Transportation, Big Data and Smart City (ICITBS), 2020, pp. 828-831,
LINK to the used dataset - doi: 10.1109/ICITBS49701.2020.00182.
https : //zenodo.org/record/5855017 [11] L. Wang, ”Adaptability of English Literature Translation from the
Perspective of Machine Learning Linguistics,” 2020 International Con cations and Manufacturing (ICIEAM), 2021, pp. 398-403, doi:
ference on Computers, Information Processing and Advanced Education 10.1109/ICIEAM51226.2021.9446474.
(CIPAE), 2020, pp. 130-133, doi: 10.1109/CIPAE51077.2020.00042. [14] How to Build Accountability into Your AI. (2021). Retrieved 24
[12] S. P. Singh, H. Darbari, A. Kumar, S. Jain and A. Lohan, ”Overview of Novem
Neural Machine Translation for English-Hindi,” 2019 International Con ber 2022, from https://hbr.org/2021/08/how-to-build-accountability-into
ference on Issues and Challenges in Intelligent Computing Techniques your-ai
(ICICT), 2019, pp. 1-4, doi: 10.1109/ICICT46931.2019.8977715 [15] Mukiibi, J., Hussein, A., Meyer, J., Katumba, A., and Nakatumba
[13] R. F. Gibadullin, M. Y. Perukhin and A. V. Ilin, ”Speech Nabende, J. (2022). The Makerere Radio Speech Corpus: A Luganda
Recognition and Machine Translation Using Neural Networks,” Radio Corpus for Automatic Speech Recognition. Retrieved 24 Novem
2021 International Conference on Industrial Engineering, Appli ber 2022, from https://zenodo.org/record/5855017

You might also like