2ND YEAR BACHELORS OF SCIENCE IN SOFTWARE ENGINEERING MAKERERE UNIVERSITY KAMPALA, UGANDA email: rashidkisejjere@gmail.com
Abstract—English to Luganda Translation using different C. Neural Machine translation (NMT)
machine classification models and deep learning. Their many machine translation sources right now but there isn’t any public The last and best is the neural network translation system, English to Luganda Translation paper showing how the transla- which was developed using deep learning techniques. This tion process occurs. Luganda is a very common language and it’s method is faster and more accurate than other methods. the mother language of Uganda. It has got a very big vocabulary of words which means that working on it requires a very big Neural MT is rapidly becoming the standard in MT engine dataset. For the sake of this paper, am going to be going through development. different machine-learning models that can be implemented to The translation of a given language to another different translate a given English text to Luganda. Because Luganda is language is a machine classification problem. This type of a new field in translation so am going to be experimenting with problem can predict only values within a known domain. The multiple machine learning classification models of SVMs, Logistic regression models, and many more, finally deep learning models domain in this case could be either the number of characters of RNNs, LSTMs, and also incorporating advanced mechanisms that make up the vocabulary of a given language or the number of Attention, Transformers plus some additional technics of of words in that given vocabulary. So this shows that the transfer learning. For this project, am to use the English to classification model could be either word-based or character Luganda dataset which was crafted by Makerere students the based. I will elaborate more on this issue in the next topics. dataset is made up of around 15000 sentences of English with their respective translations. Index Terms—machine translation, classification task, RNNs, II. DATASET DESCRIPTION transfer learning, LSTMs, attention The dataset was created by fellow makererenians, it’s made I. I NTRODUCTION up of two columns, the first column includes sentences in Machine translation is a field that was and is still in the Luganda and the other includes their respective translations to research and so far there are multiple machine translation Ap- English. This is a textual dataset and it can be accessed from proaches that researchers have come up with. These machine the link: The creators of this dataset are: The major factors for translation techniques mainly include ; using this dataset include; There are very few datasets for the English to Luganda dataset out there and this was the only free A. Rule Based Machine Translation (RBMT) dataset I could find. Even though it’s the only dataset, its data Rule base machine translation (RBMT) as the name states is structured very well and its got minimal errors. The dataset it’s mainly about researchers coming up with different rules has got over 10000 Luganda sentences and their respective through which a text in a given can follow to come up with translations to English and this type of dataset is capable of its respective translation. It is the oldest machine translation creating accurate models. technique and it was used in the 1970s. III. AI PROJECT SCOPE B. Statistical Machine Learning (SMT) Next is the Statistical Machine translation (SMT) is an old The goal is to create an AI Model with the ability to receive translation technique that uses a statistical model to create a English sentences from the user and be able to translate these representation of the relationships between sentences, words, sentences into Luganda. I am to create multiple models with and phrases in a given text. This method is then applied to different types of AI architectures till I get the best. Possible a second language to convert these elements into the new constraints for this problem include insufficient memory re- language.One of the main advantages of this technique is that sources because this being an NLP problem, requires a lot of it can improve on the rule-based MT while sharing the same processing power. So according to the wide vocabulary of the problems. dataset, the model is going to need a very big computational power which might be a problem but fortunately, there’s google colab that can be used to train a model on a free GPU. A. ACCOUNTABILITY VII. RESEARCH OBJECTIVES In this context of AI, “accountability” refers to the The major aim of this paper Isn’t about creating a very expectation that organizations or individuals will use to accurate model but it’s mainly about laying a foundation for ensure the proper functioning, throughout the AI systems that further research in this field, through the showing of the pre- they design, develop, operate or deploy, following their roles processing needed when implementing a machine translation and applicable regulatory frameworks, and for demonstrating model on a dataset that has never been publically used before, this through their actions and decision-making process (for also showing all of the basic models that I can implement example, by providing documentation on key decisions regarding the translation of English to Luganda. And mainly throughout the AI system lifecycle or conducting or allowing Creating a basis for any other implementations of Luganda- auditing were justified). related models. AI accountability is very important because it’s a means of safeguarding against unintended uses. Most AI systems are VIII. PROPOSED AI METHODOLOGY designed for a specific use case; using them for a different use The methodology is to use is to come up with multiple mod- case would produce incorrect results. Through this am also to els implemented with different types of architectures. These apply accountability to my model by making sure that Since architectures include SVMs, RNNs, Attention mechanisms, my AI model mainly depends on the dataset. Hence, it’s best to and also transformers. make sure that the quality of the dataset is constantly improved and filtered. Because of any slight modifications in the spelling A. Data Collection of the words then the model’s accuracy will decrease. IV. STATEMENT OF THE AI PROJECT This refers to the process of collecting information from different sources. So am to do this mainly by coming up with Machine learning has improved a lot of Areas including many different Luganda sentences wit their respective English machine translation but so far this has been focussed on the Translations. The major sources of this kind of dataset is the popular languages of Germany, French, and Africa wise the society. most translated language is Kiswahili leaving languages such as Luganda neglected. Luganda has a very big vocabulary so coming up with a solution like this tends to ease on Translation B. Data Preprocessing process for even other languages that have got a very big This is the process of cleaning, filtering preparing of the vocabulary. So this project is a basis for more research that colloected data and modifying it so as to make more easily can be made in Luganda. understandable to the computer. There major data preprocess- V. AI RESEARCH QUESTIONS ing tasks that am to apply are as below; Below is a list of famous research questions that are in the • Tokenization. Tokenization is referred to as a technique field of machine translation. that breaks up the words of a given sentence into numer- • Does the model translate the English questions? ical values. Its usually the first step when preprocessing • How accurate is its translation? text data during Natural Language Processing. This can • Which Machine translation model did you use? be done through the use of different algorithms that • How long does it take to come up with its translation? handle tokenization. • How much space does the model need? • Spellings Errors Removal. Here the goal is to et rid of • What is machine translation? any spelling errors that my ne in the text. Because the • Who is to use the model? for the model to be accurate, all of the inptted in must • What is the environment of the AI model going to be? all be accurate. ”Garbage In, Garbage Out.” • What is the attention mechanism used in NLP? • Padding. The model is supposed to have data which is • What is RNN as used in NLP? of the same size and this is known as padding. WHich is adding of extra zeros at the end of each text being VI. HYPOTHESIS inputted into the text. Research hypothesis: it’s hypothesized that the model will be able to translate any given English input into Luganda. The C. Model Training model is a classification type of model to classify the given output into their respective translations. The model will predict This is a process of fitting the best combination of weights its output based on a 15000 thousand sentenced dataset of and bias to a machine learning algorithm to minimize loss English words to Luganda. But in actual sense, there’s a very function over prediction range. These process on NLP related big Vocabulary and since the bigger, the dataset the better and tasks most especially when using deep learning models usually also the more modifications that need to be made on the dataset takes alot of time. But since this model is to be trained on so for the sake of this paper am just going to be assuming that multiple different models, the training time will depend on the vocabulary size is of: the complexity of the model. D. Model Testing [5] J. -W. Hung, J. -R. Lin and L. -Y. Zhuang, ”The Evaluation Study of the Deep Learning Model Transformer in Speech Translation,” 2021 7th This is the method of measuring the accuracy of the Model. International Conference on Applied System Innovation (ICASI), 2021, This happens when part of data is splitted into the train, test pp. 30-33, doi: 10.1109/ICASI52993.2021.9568450. [6] O. Mekpiroon, P. Tammarattananont, N. Apitiwongmanit, N. Buasroung, and also the evaluation dataset. The test dataset is supposed T. Charoenporn and T. Supnithi, ”Integrating Translation Feature Using around a quarter of the entire dataset because the most of the Machine Translation in Open Source LMS,” 2009 Ninth IEEE Interna- data is supposed to be used for the training process and the tional Conference on Advanced Learning Technologies, 2009, pp. 403- 404, doi: 10.1109/ICALT.2009.136. few just used for the training part. [7] V. Alves, J. Ribeiro, P. Faria and L. Romero, ”Neural Machine Transla- tion Approach in Automatic Translations between Portuguese Language E. Evaluation and Portuguese Sign Language Glosses,” 2022 17th Iberian Conference on Information Systems and Technologies (CISTI), 2022, pp. 1-7, doi: Model evaluation is the process of using different evaluation 10.23919/CISTI54924.2022.9820212. metrics to understand a machine learning model’s perfor- [8] Machine Translation – Towards Data Science. (2022). Retrieved 24 mance, as well as its strengths and weaknesses. This is a very November 2022, from https://towardsdatascience.com/tagged/machine- translation necessary part because through evaluation, you are able to [9] H. Sun, R. Wang, K. Chen, M. Utiyama, E. Sumita and T. Zhao, ”Un- know all of the flows of the system and this part is uually supervised Neural Machine Translation With Cross-Lingual Language done as the last part model creation. Representation Agreement,” in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1170-1182, 2020, doi: 10.1109/TASLP.2020.2982282. F. Overview [10] Y. Wu, ”A Chinese-English Machine Translation Model Based on This AI model is a classification type of model. The classi- Deep Neural Network,” 2020 International Conference on Intelligent Transportation, Big Data and Smart City (ICITBS), 2020, pp. 828-831, fication of this problem has two ways in which it can resolve, doi: 10.1109/ICITBS49701.2020.00182. the first is through the use of a word-based classification [11] L. Wang, ”Adaptability of English Literature Translation from the model and the other is through the use of a character-based Perspective of Machine Learning Linguistics,” 2020 International Con- ference on Computers, Information Processing and Advanced Education model. For the word-based classification model, its output is (CIPAE), 2020, pp. 130-133, doi: 10.1109/CIPAE51077.2020.00042. the entire vocabulary of Luganda and the entire vocabulary is [12] S. P. Singh, H. Darbari, A. Kumar, S. Jain and A. Lohan, ”Overview of very thus leading to the output also being very big meaning Neural Machine Translation for English-Hindi,” 2019 International Con- ference on Issues and Challenges in Intelligent Computing Techniques that the Accuracy of this model would highly depend on the (ICICT), 2019, pp. 1-4, doi: 10.1109/ICICT46931.2019.8977715. preprocessing techniques, also training such a model would [13] R. F. Gibadullin, M. Y. Perukhin and A. V. Ilin, ”Speech require a very large computational power. For the character- Recognition and Machine Translation Using Neural Networks,” 2021 International Conference on Industrial Engineering, Appli- based model, its output would be character by character. the cations and Manufacturing (ICIEAM), 2021, pp. 398-403, doi: major importance of such a model is that it won’t require a 10.1109/ICIEAM51226.2021.9446474. very high computational power because it’s output won’t be [14] How to Build Accountability into Your AI. (2021). Retrieved 24 Novem- ber 2022, from https://hbr.org/2021/08/how-to-build-accountability-into- complicated but its downfall is that it would be limited to only your-ai models that can store states and these are mainly deep learning [15] Mukiibi, J., Hussein, A., Meyer, J., Katumba, A., and Nakatumba- models of RNNs, LSTMs and their variants. Because it would Nabende, J. (2022). The Makerere Radio Speech Corpus: A Luganda Radio Corpus for Automatic Speech Recognition. Retrieved 24 Novem- have to keep knowing the previous character for it to be able ber 2022, from https://zenodo.org/record/5855017 to predict its next character. ACKNOWLEDGEMENTS R EFERENCES Special Thanks to Mr.Galiwango Marvin and Dr.Rose [1] M. Singh, R. Kumar, and I. Chana, “Neural-Based Machine Transla- Nakibuule for their determined and never ending guidance tion System Outperforming Statistical Phrase-Based Machine Transla- towards my research and implementations of this project. tion for Low-Resource Languages,” 2019 Twelfth International Con- ference on Contemporary Computing (IC3), 2019, pp. 1-7, DOI: 10.1109/IC3.2019.8844915. V. Bakarola and J. Nasriwala, “Attention- based Neural Machine Translation with Sequence to Sequence Learning on Low Resourced Indic Languages,” 2021 2nd International Con- ference on Advances in Computing, Communication, Embedded and Secure Systems (ACCESS), 2021, pp. 178-182, DOI: 10.1109/AC- CESS51619.2021.9563317. . [2] Academy, E. (2022) How to Write a Research Hy- pothesis— Enago Academy, Enago Academy. Avail- able at: https://www.enago.com/academy/how-to-develop- a-good-research-hypothesis/ (Accessed: 17 November 2022). What is the project scope? (2022). Available at: https://www.techtarget.com/searchcio/definition/project-scope (Accessed: 17 November 2022). [3] Machine translation – Wikipedia (2022). Available at: https://en.wikipedia.org/wiki/Machine translation (Accessed: 17 November 2022). [4] K. Chen et al., ”Towards More Diverse Input Representation for Neural Machine Translation,” in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1586-1597, 2020, doi: 10.1109/TASLP.2020.2996077.
Onu - Escwa (Escwa) Report Workshop On International Migration and Development in The Arab Region: Integrating International Migration Into Development Strategies Beirut, 19-22 July 2010
ChatGPT Millionaire 2024 - Bot-Driven Side Hustles, Prompt Engineering Shortcut Secrets, and Automated Income Streams that Print Money While You Sleep. The Ultimate Beginner’s Guide for AI Business
ChatGPT Money Machine 2024 - The Ultimate Chatbot Cheat Sheet to Go From Clueless Noob to Prompt Prodigy Fast! Complete AI Beginner’s Course to Catch the GPT Gold Rush Before It Leaves You Behind
ChatGPT Side Hustles 2024 - Unlock the Digital Goldmine and Get AI Working for You Fast with More Than 85 Side Hustle Ideas to Boost Passive Income, Create New Cash Flow, and Get Ahead of the Curve