You are on page 1of 3

ENGLISH TO LUGANDA AI PROJECT

* MACHINE LEARNING ARTIFICIAL INTELLIGENCE MODEL

NAME: KISEJJERE RASHID


2ND YEAR BACHELORS OF SCIENCE IN SOFTWARE ENGINEERING
MAKERERE UNIVERSITY
KAMPALA, UGANDA
email: rashidkisejjere@gmail.com

Abstract—English to Luganda Translation using different C. Neural Machine translation (NMT)


machine classification models and deep learning. Their many
machine translation sources right now but there isn’t any public The last and best is the neural network translation system,
English to Luganda Translation paper showing how the transla- which was developed using deep learning techniques. This
tion process occurs. Luganda is a very common language and it’s method is faster and more accurate than other methods.
the mother language of Uganda. It has got a very big vocabulary
of words which means that working on it requires a very big
Neural MT is rapidly becoming the standard in MT engine
dataset. For the sake of this paper, am going to be going through development.
different machine-learning models that can be implemented to The translation of a given language to another different
translate a given English text to Luganda. Because Luganda is language is a machine classification problem. This type of
a new field in translation so am going to be experimenting with problem can predict only values within a known domain. The
multiple machine learning classification models of SVMs, Logistic
regression models, and many more, finally deep learning models
domain in this case could be either the number of characters
of RNNs, LSTMs, and also incorporating advanced mechanisms that make up the vocabulary of a given language or the number
of Attention, Transformers plus some additional technics of of words in that given vocabulary. So this shows that the
transfer learning. For this project, am to use the English to classification model could be either word-based or character
Luganda dataset which was crafted by Makerere students the based. I will elaborate more on this issue in the next topics.
dataset is made up of around 15000 sentences of English with
their respective translations.
Index Terms—machine translation, classification task, RNNs, II. DATASET DESCRIPTION
transfer learning, LSTMs, attention
The dataset was created by fellow makererenians, it’s made
I. I NTRODUCTION up of two columns, the first column includes sentences in
Machine translation is a field that was and is still in the Luganda and the other includes their respective translations to
research and so far there are multiple machine translation Ap- English. This is a textual dataset and it can be accessed from
proaches that researchers have come up with. These machine the link: The creators of this dataset are: The major factors for
translation techniques mainly include ; using this dataset include; There are very few datasets for the
English to Luganda dataset out there and this was the only free
A. Rule Based Machine Translation (RBMT) dataset I could find. Even though it’s the only dataset, its data
Rule base machine translation (RBMT) as the name states is structured very well and its got minimal errors. The dataset
it’s mainly about researchers coming up with different rules has got over 10000 Luganda sentences and their respective
through which a text in a given can follow to come up with translations to English and this type of dataset is capable of
its respective translation. It is the oldest machine translation creating accurate models.
technique and it was used in the 1970s.
III. AI PROJECT SCOPE
B. Statistical Machine Learning (SMT)
Next is the Statistical Machine translation (SMT) is an old The goal is to create an AI Model with the ability to receive
translation technique that uses a statistical model to create a English sentences from the user and be able to translate these
representation of the relationships between sentences, words, sentences into Luganda. I am to create multiple models with
and phrases in a given text. This method is then applied to different types of AI architectures till I get the best. Possible
a second language to convert these elements into the new constraints for this problem include insufficient memory re-
language.One of the main advantages of this technique is that sources because this being an NLP problem, requires a lot of
it can improve on the rule-based MT while sharing the same processing power. So according to the wide vocabulary of the
problems. dataset, the model is going to need a very big computational
power which might be a problem but fortunately, there’s
google colab that can be used to train a model on a free GPU.
A. ACCOUNTABILITY VII. RESEARCH OBJECTIVES
In this context of AI, “accountability” refers to the
The major aim of this paper Isn’t about creating a very
expectation that organizations or individuals will use to
accurate model but it’s mainly about laying a foundation for
ensure the proper functioning, throughout the AI systems that
further research in this field, through the showing of the pre-
they design, develop, operate or deploy, following their roles
processing needed when implementing a machine translation
and applicable regulatory frameworks, and for demonstrating
model on a dataset that has never been publically used before,
this through their actions and decision-making process (for
also showing all of the basic models that I can implement
example, by providing documentation on key decisions
regarding the translation of English to Luganda. And mainly
throughout the AI system lifecycle or conducting or allowing
Creating a basis for any other implementations of Luganda-
auditing were justified).
related models.
AI accountability is very important because it’s a means
of safeguarding against unintended uses. Most AI systems are VIII. PROPOSED AI METHODOLOGY
designed for a specific use case; using them for a different use The methodology is to use is to come up with multiple mod-
case would produce incorrect results. Through this am also to els implemented with different types of architectures. These
apply accountability to my model by making sure that Since architectures include SVMs, RNNs, Attention mechanisms,
my AI model mainly depends on the dataset. Hence, it’s best to and also transformers.
make sure that the quality of the dataset is constantly improved
and filtered. Because of any slight modifications in the spelling
A. Data Collection
of the words then the model’s accuracy will decrease.
IV. STATEMENT OF THE AI PROJECT This refers to the process of collecting information from
different sources. So am to do this mainly by coming up with
Machine learning has improved a lot of Areas including
many different Luganda sentences wit their respective English
machine translation but so far this has been focussed on the
Translations. The major sources of this kind of dataset is the
popular languages of Germany, French, and Africa wise the
society.
most translated language is Kiswahili leaving languages such
as Luganda neglected. Luganda has a very big vocabulary so
coming up with a solution like this tends to ease on Translation B. Data Preprocessing
process for even other languages that have got a very big This is the process of cleaning, filtering preparing of the
vocabulary. So this project is a basis for more research that colloected data and modifying it so as to make more easily
can be made in Luganda. understandable to the computer. There major data preprocess-
V. AI RESEARCH QUESTIONS ing tasks that am to apply are as below;
Below is a list of famous research questions that are in the • Tokenization. Tokenization is referred to as a technique
field of machine translation. that breaks up the words of a given sentence into numer-
• Does the model translate the English questions? ical values. Its usually the first step when preprocessing
• How accurate is its translation? text data during Natural Language Processing. This can
• Which Machine translation model did you use? be done through the use of different algorithms that
• How long does it take to come up with its translation? handle tokenization.
• How much space does the model need? • Spellings Errors Removal. Here the goal is to et rid of
• What is machine translation? any spelling errors that my ne in the text. Because the
• Who is to use the model? for the model to be accurate, all of the inptted in must
• What is the environment of the AI model going to be? all be accurate. ”Garbage In, Garbage Out.”
• What is the attention mechanism used in NLP? • Padding. The model is supposed to have data which is
• What is RNN as used in NLP? of the same size and this is known as padding. WHich
is adding of extra zeros at the end of each text being
VI. HYPOTHESIS
inputted into the text.
Research hypothesis: it’s hypothesized that the model will
be able to translate any given English input into Luganda. The
C. Model Training
model is a classification type of model to classify the given
output into their respective translations. The model will predict This is a process of fitting the best combination of weights
its output based on a 15000 thousand sentenced dataset of and bias to a machine learning algorithm to minimize loss
English words to Luganda. But in actual sense, there’s a very function over prediction range. These process on NLP related
big Vocabulary and since the bigger, the dataset the better and tasks most especially when using deep learning models usually
also the more modifications that need to be made on the dataset takes alot of time. But since this model is to be trained on
so for the sake of this paper am just going to be assuming that multiple different models, the training time will depend on
the vocabulary size is of: the complexity of the model.
D. Model Testing [5] J. -W. Hung, J. -R. Lin and L. -Y. Zhuang, ”The Evaluation Study of
the Deep Learning Model Transformer in Speech Translation,” 2021 7th
This is the method of measuring the accuracy of the Model. International Conference on Applied System Innovation (ICASI), 2021,
This happens when part of data is splitted into the train, test pp. 30-33, doi: 10.1109/ICASI52993.2021.9568450.
[6] O. Mekpiroon, P. Tammarattananont, N. Apitiwongmanit, N. Buasroung,
and also the evaluation dataset. The test dataset is supposed T. Charoenporn and T. Supnithi, ”Integrating Translation Feature Using
around a quarter of the entire dataset because the most of the Machine Translation in Open Source LMS,” 2009 Ninth IEEE Interna-
data is supposed to be used for the training process and the tional Conference on Advanced Learning Technologies, 2009, pp. 403-
404, doi: 10.1109/ICALT.2009.136.
few just used for the training part. [7] V. Alves, J. Ribeiro, P. Faria and L. Romero, ”Neural Machine Transla-
tion Approach in Automatic Translations between Portuguese Language
E. Evaluation and Portuguese Sign Language Glosses,” 2022 17th Iberian Conference
on Information Systems and Technologies (CISTI), 2022, pp. 1-7, doi:
Model evaluation is the process of using different evaluation 10.23919/CISTI54924.2022.9820212.
metrics to understand a machine learning model’s perfor- [8] Machine Translation – Towards Data Science. (2022). Retrieved 24
mance, as well as its strengths and weaknesses. This is a very November 2022, from https://towardsdatascience.com/tagged/machine-
translation
necessary part because through evaluation, you are able to [9] H. Sun, R. Wang, K. Chen, M. Utiyama, E. Sumita and T. Zhao, ”Un-
know all of the flows of the system and this part is uually supervised Neural Machine Translation With Cross-Lingual Language
done as the last part model creation. Representation Agreement,” in IEEE/ACM Transactions on Audio,
Speech, and Language Processing, vol. 28, pp. 1170-1182, 2020, doi:
10.1109/TASLP.2020.2982282.
F. Overview [10] Y. Wu, ”A Chinese-English Machine Translation Model Based on
This AI model is a classification type of model. The classi- Deep Neural Network,” 2020 International Conference on Intelligent
Transportation, Big Data and Smart City (ICITBS), 2020, pp. 828-831,
fication of this problem has two ways in which it can resolve, doi: 10.1109/ICITBS49701.2020.00182.
the first is through the use of a word-based classification [11] L. Wang, ”Adaptability of English Literature Translation from the
model and the other is through the use of a character-based Perspective of Machine Learning Linguistics,” 2020 International Con-
ference on Computers, Information Processing and Advanced Education
model. For the word-based classification model, its output is (CIPAE), 2020, pp. 130-133, doi: 10.1109/CIPAE51077.2020.00042.
the entire vocabulary of Luganda and the entire vocabulary is [12] S. P. Singh, H. Darbari, A. Kumar, S. Jain and A. Lohan, ”Overview of
very thus leading to the output also being very big meaning Neural Machine Translation for English-Hindi,” 2019 International Con-
ference on Issues and Challenges in Intelligent Computing Techniques
that the Accuracy of this model would highly depend on the (ICICT), 2019, pp. 1-4, doi: 10.1109/ICICT46931.2019.8977715.
preprocessing techniques, also training such a model would [13] R. F. Gibadullin, M. Y. Perukhin and A. V. Ilin, ”Speech
require a very large computational power. For the character- Recognition and Machine Translation Using Neural Networks,”
2021 International Conference on Industrial Engineering, Appli-
based model, its output would be character by character. the cations and Manufacturing (ICIEAM), 2021, pp. 398-403, doi:
major importance of such a model is that it won’t require a 10.1109/ICIEAM51226.2021.9446474.
very high computational power because it’s output won’t be [14] How to Build Accountability into Your AI. (2021). Retrieved 24 Novem-
ber 2022, from https://hbr.org/2021/08/how-to-build-accountability-into-
complicated but its downfall is that it would be limited to only your-ai
models that can store states and these are mainly deep learning [15] Mukiibi, J., Hussein, A., Meyer, J., Katumba, A., and Nakatumba-
models of RNNs, LSTMs and their variants. Because it would Nabende, J. (2022). The Makerere Radio Speech Corpus: A Luganda
Radio Corpus for Automatic Speech Recognition. Retrieved 24 Novem-
have to keep knowing the previous character for it to be able ber 2022, from https://zenodo.org/record/5855017
to predict its next character.
ACKNOWLEDGEMENTS
R EFERENCES Special Thanks to Mr.Galiwango Marvin and Dr.Rose
[1] M. Singh, R. Kumar, and I. Chana, “Neural-Based Machine Transla- Nakibuule for their determined and never ending guidance
tion System Outperforming Statistical Phrase-Based Machine Transla- towards my research and implementations of this project.
tion for Low-Resource Languages,” 2019 Twelfth International Con-
ference on Contemporary Computing (IC3), 2019, pp. 1-7, DOI:
10.1109/IC3.2019.8844915. V. Bakarola and J. Nasriwala, “Attention-
based Neural Machine Translation with Sequence to Sequence Learning
on Low Resourced Indic Languages,” 2021 2nd International Con-
ference on Advances in Computing, Communication, Embedded and
Secure Systems (ACCESS), 2021, pp. 178-182, DOI: 10.1109/AC-
CESS51619.2021.9563317. .
[2] Academy, E. (2022) How to Write a Research Hy-
pothesis— Enago Academy, Enago Academy. Avail-
able at: https://www.enago.com/academy/how-to-develop-
a-good-research-hypothesis/ (Accessed: 17 November
2022). What is the project scope? (2022). Available at:
https://www.techtarget.com/searchcio/definition/project-scope
(Accessed: 17 November 2022).
[3] Machine translation – Wikipedia (2022). Available at:
https://en.wikipedia.org/wiki/Machine translation (Accessed: 17
November 2022).
[4] K. Chen et al., ”Towards More Diverse Input Representation for
Neural Machine Translation,” in IEEE/ACM Transactions on Audio,
Speech, and Language Processing, vol. 28, pp. 1586-1597, 2020, doi:
10.1109/TASLP.2020.2996077.

You might also like