You are on page 1of 20

DEDAN KIMATHI UNIVERSITY OF TECHNOLOGY

PROJECT REPORT FOR FINAL YEAR STUDY IN BSC


COMPUTER SCIENCE

BY

NJOROGE MBOTE STANLEY (C026-01-1008/2015)

PROJECT TITLE

NEURAL NETWORK TRANSLATION FROM GIKUYU TO


KISWAHILI

Submitted in partial fulfillment of requirements of the Bachelor of


Science Degree in Computer Science

i|Page
DECLARATION

STUDENT
I hereby declare that the project entitled “Neural Network Translation of Gikuyu to
Kiswahili” submitted for the B.Sc. Computer Science degree is my original work and the project
has not formed the basis for award of any other degree, diploma or any other similar titles.

Signature …………………………….

Date: __________________________

SUPERVISOR

This is to certify that the above statement made by the student is true to the best of my
knowledge.

Signature ……………………

Date: ___________________

ii | P a g e
ABSTRACT
Machine translation is one of the important aspects in natural language processing just like
speech tagging, grammar checking, speech processing and etc. Basically it involves decoding
phrases from one language and availing it in another language. Even though developers around
the world have managed to translate different languages of the world, some languages have not
been digitalized yet: more so the African languages thus they remain as under-resourced
languages.

In this paper, the research is emphasized on the Gikuyu language: mostly spoken in the central
part of Kenya but it is also spoken in other parts of the country and the Kiswahili language which
is widely spoken around Kenya. Gikuyu is one of the tribe in Kenya and its spoken by about 22
% of the Kenyan population. However, despite it being one of the largely spoken language in
Kenya, it still remains an under resourced language i.e. there exist very few language
technologies and tools that relate to it. Hence this has prompted me to machine translate from
Gikuyu to Kiswahili for Kiswahili is an African language that already exist in the digital format.
Machine translation for this language is very essentials as it has many applications.

This project is aimed at machine translating Gikuyu to Kiswahili, both languages being widely
spoken in Kenya. For this project I intend to use a neural network tool known as the OpenNMT
Tool. OpenNMT is an implementation of the Neural network training of model and translation. I
intend to use ten thousand sentences for my model training.

The corpus is collected and it is first preprocessed before being passed into the OpenNMT tool
for model training purpose. During machine translation, the tool takes in two parallel data
sources that is the source language and the target language and uses encoders and decoders to
train itself on translation. The tool uses data and thus it involves a lot of hand crafting of the data
hence it is time consuming particularly for under resourced language such as Gikuyu . The main
aspects are data collection and preprocessing the data.

iii | P a g e
Table of Contents
DECLARATION.............................................................................................................................ii
ABSTRACT...................................................................................................................................iii
Table of Contents.............................................................................................................................1
CHAPTER 1: INTRODUCTION....................................................................................................3
1.1 Background.......................................................................................................................3
1.2 Problem statement.............................................................................................................5
1.2.1 Language is under-resourced..........................................................................................5
1.2.2 Represent language in digital format..............................................................................5
1.2.3 UNESCO goals on endangered languages.....................................................................6
1.3 Objectives..........................................................................................................................6
1.3.1 General Objectives..........................................................................................................6
1.3.2 Specific Objectives.........................................................................................................6
1.4 Justification............................................................................................................................6
1.5 Scope......................................................................................................................................7
1.5.1 General Scope.................................................................................................................7
1.5.2 Specific Scope................................................................................................................7
CHAPTER 2: LITERATURE REVIEW.........................................................................................7
2.1 Introduction............................................................................................................................7
2.2 Case Studies...........................................................................................................................8
2.2.1 S. A. W. A Corpus..........................................................................................................8
2.2.2 Bengali to Assamese Statistical Machine Translation..................................................10
2.2.3 English to Creole and Creole to English Rule Based Machine Translation System....12
CHAPTER THREE: METHODOLOGY......................................................................................14
3.1 Introduction..........................................................................................................................14
3.2 Document Collection...........................................................................................................14
3.3 Data Collection....................................................................................................................15
3.4 Data analysis Tools..............................................................................................................15
3.5 Manual tuning......................................................................................................................15
3.6 Phrases.................................................................................................................................16
CHAPTER 4: IMPLEMENTATION............................................................................................16
4.1 Corpus Development...........................................................................................................16
4.2 Data Preparation..................................................................................................................16
4.2.1 Tokenization.................................................................................................................16

1|Page
4.4.2 Model Training.............................................................................................................17
4.4.3 Model Translation.........................................................................................................18

2|Page
CHAPTER 1: INTRODUCTION

1.1 Background
Machine Translation (MT) is the use of computers to automate the production of translations
from one natural language (NL) into another, with or without human assistance and also trying to
maintain the meaning of the context. Machine translation system is used to translate the source
text into target text. MT syste m uses the various approaches to complete the translation. These
systems incorporate a lot of knowledge about words, and about the language (linguistic
knowledge). Such knowledge is stored in one or more lexicons, and possibly other sources of
linguistic knowledge, such as grammar.

With increase in growth of technology, language barrier should not be a problem. It becomes
important to provide information to people when needed: in their official language as well as
their native language. The localization process is constantly searching for technologies that
would increase translator throughput, with the current focus on the use of high-quality Statistical
Machine Translation (SMT) and with the emergence of Neural Network Tools as supplements to
the established Translation Memory (TM) technology.

In Neural Networks machine translation, the source language text is transferred in to the target
language text based on the NN models extracted from the corpus of both the source language and
target language. Supervised or unsupervised NN machine algorithm is used to build prediction
networks from the corpora, these networks consist of the prediction information such as
characteristics of the sentences and the structural relation between the two languages. These
characteristics are used by the model translating from one natural language to another language.
The document is translated according to a probability distribution function.
This grammatical information is given in sentences. A sentence is divided into phrases. One
phrase can occur in more than one context.
Example, consider the following sentences.
Mutumia ucio nia ciara
Translation: That woman has given birth.
Mwanake ucio niarega ciara.
Translation: That young man has cut his fingers.

3|Page
Example 2: mtoto amesimama kwa meza
Translation: The kid has stood on the table.
Mtoto amemeza chakula
Translation: The kid has swallowed the food.

In the first example, the word `ciara` has been used in the context to mean giving birth. In the
second example, the word means cutting fingers and in the second example the word ‘meza’ has
been used to mean table while in the second example the same word has been used to mean
swallow.

Both Gikuyu and Kiswahili are Kenyan languages spoken by majority of the Kenyans. Gikuyu
originates from the Central part of the Kenya while Kiswahili originates from the Coastal part of
Kenya. However, by comparison Kiswahili is much spoken by more people compared to Gikuyu
as it is used in both East and Central Africa and some parts of the world it being used.

Kiswahili there for has an upper hand, as it is easy to find resources in the latter language as it
commonly used. However, with Gikuyu its mostly used within the Kenyan context and actually
it is the most rampant vernacular language in Kenya.

Both languages are categorized as Bantu languages: this means that these languages share some
words and also it tentatively they have a similar phrasal structure.

Despite the two languages being dominant in Kenya, literature on these languages is still sparse
and it must be stressed that although there exist written forms such as Bible, prayer book and
stories in particular, there is no standard form; hence the language still lacks the language tools
and technology.

1.2 Problem statement


1.2.1 Language is under-resourced

Most of the African languages are resource-scare: meaning that these languages have little or not
at all digital text resources. Gikuyu (the main context language) being one of the African
language that is found in Kenya also suffers the same fate as most of the other languages.

4|Page
Kiswahili language has received support from the developer’s community as it has been partially
digitalized. This has been attributed to the fact that Kiswahili despite it being an African
language its being embraced by people beyond the African continent and also digitalization of
Kiswahili has been attributed by the fact that over the years there has been an increasing number
of publication in the Kiswahili language.

However, developers have focused in making translation for most of the international languages
such as English, French, Chinese, Spanish and et al but tend to forget to also digitize their own
local language thus making constituting to the issue of most African languages being under-
resourced.

1.2.2 Represent language in digital format

There is need to represent our own native language into a digital format. This will lead to more
extraction of knowledge from these languages. Issues with communication barriers due to
language indifferences can now be surpassed by the use of a translator that has been well trained.
But due to lack of our own native languages in digital format, it renders the work of developing
translation tools difficult as there are no sources for training the translation models.

So, by developing this Gikuyu -Kiswahili translator I will be helping in the conversion process of
Kiswahili and also Kiswahili to a digital platform. This would lead to easier searching of
contents from online sources or other digital platforms without fearing issue with language
barrier.

1.2.3 UNESCO goals on endangered languages

There are currently about 7,000 languages spoken worldwide, and it is projected that at least half
of these languages will disappear in this century(citation). To prevent the loss of this languages,
UNESCO aims at using technology to preserve these languages as well as some of the cultural
practices since anything stored in a digital format it will be very difficult for it to be wiped out of
existence. One of the method that can help in the preservation of these endangered language is
through translation so that it would be easier for people to learn the new language and hence
enhance its continuity.

5|Page
By developing the translation tool for Gikuyu to Kiswahili, I will be achieving one of the goals
of UNESCO of preserving our native language and enhance its continuity.

1.3 Objectives
1.3.1 General Objectives

To provide a platform that can enable translation of Gikuyu to Kiswahili

1.3.2 Specific Objectives

The objectives in this project are as stated below:

i. To fine tune the language


ii. To develop a translator for Gikuyu to Kiswahili
iii. To evaluate the translator

1.4 Justifcation
The proposed system will provide translation from Gikuyu to Kiswahili with minimal error for
statistical tools are the newest technology for implementing machine translation and has proved
to be a success. The system will also enable people to effectively learn either Gikuyu or
Kiswahili with minimal hustle as the system will quickly do the translation for them. By thus
doing so, they system will be enhancing the continuity of both languages as people with interest
will learn be able to learn the language.

Last but not least, the system will also allow kids in the lower primary classes to learn their
vernacular languages (Gikuyu /Kiswahili) now that it has been incorporated in the new
curriculum. The system will act as a catalyst to quick learning of the languages.

1.5 Scope
1.5.1 General Scope

6|Page
The project is a better solution for Gikuyu language since it lacks adequate technology and
tools. Researchers from this language and even other language can utilize this system for their
own benefits in research areas such as data mining, name entity recognition, parts of speech,
speech synthesis and recognition and pronunciation. It can also be used by future generations to
learn Gikuyu and revive it from the face of extinction.

1.5.2 Specific Scope

This proposal has been developed to specifically, develop a statistical machine language translator
system based on Gikuyu to Kiswahili language. The system shall be developed by using corpus that
is in both Gikuyu language and Kiswahili, and using it to come up with an intelligent system which
shall help in the translating any Kiswahili statement to Gikuyu .

CHAPTER 2: LITERATURE REVIEW

2.1 Introduction
There has been other research work that have been done in the Natural Language Processing that
are concerned with the development of natural language translators. Several methods have been
used to achieve this such as use of statistical tools or even neural-network models tools.
However, this previously done researches have both weaknesses and strengths which will be help
in the development of Neural Network Gikuyu to Kiswahili Translation .

In this write up, I will be studying four of the previously done studies that relate to natural
language processing more so in the aspect of machine translation.

2.2 Case Studies

2.2.1 S. A. W. A Corpus

2.2.1.1 Introduction and Scope

The SAWA corpus is a machine translation that focuses on translating English towards Kiswahili
and the vice versa is also true. This system was developed by Guy De Pauw, Peter Waiganjo

7|Page
Wagacha from the University of Nairobi (U.O.N.), Gilles-Maurice de Schryver from Ghent
University. Each text in the SAWA corpus is automatically part-of-speech tagged and
lemmatized, using the Tree Tagger for the English part (Schmid, 1994) and the systems
described in De Pauw et al. (2006) and De Pauw and de Schryver (2008) for Swahili.

For a language like Swahili, spoken by more than fifty million people in East and
Central Africa, digital resources have become increasingly important in everyday
life, both in urban and rural areas, thanks to the growing number of web-enabled
mobile phone users in the language area and increased bandwidth, courtesy of
broadband and the terrestrial and undersea optical fiber cables. The prominence of
regional economic blocks such as the East African Market and the growing
popularity of the expanded media in the region further underline the need for
African language technology tools.

English was first spoken in the early medieval England and is now a global language. English
has been fully digitalized unlike Kiswahili that has been partially digitalized despite it being
spoken by over 50 million people in both Eastern and Central Africa. English is currently the
official language for most of the countries in world.

2.2.1.2 Description
Most African languages are however resource- scarce, meaning that digital text resources are
few. An increasing number of publications however are showing that carefully selected proce-
dures can in-deed bootstrap language technology for Kiswahili (De Pauw et al., 2006; De Pauw
and de Schryver, 2008), Northern Sotho (de Schryver and De Pauw,2007) and smaller African
languages (Wagacha et al., 2006a; Wagacha et al., 2006b; De Pauw and Wagacha, 2007; De
Pauw et al., 2007a; De Pauw et al., 2007b).

During the development of this corpus, half of the data was tokenized manually while the rest of
the data was automatically tokenized. During this project, the greatest challenge was as a result
of the strong agglutinating nature of the English – Kiswahili language.

8|Page
Most of the data that was used in the training was from the Kiswahili version of the New Testa -
ment and the Quran for which they sourced their English counter parts. The data was randomly
divided: 90% was used to train the model and the 10% was used for testing.

2.2.1.3 Evaluation

During evaluation process, the team depended on the SMT system on the training set and evalua-
tion on the test set, using the standard machine translation evaluation measures BLEU and NIST.
Then they compared their results to that of the Google Translate system for Kiswahili. This com-
parison was problematic: Google Translate system is partially built on the basis of data described
in a previous publication of the SAWA corpus (De Pauw et al.,2009b) and it is obvious that sig-
nificant portions of the test set in our experiments actually constitute training data for the Google
Translate system.

For English-Kiswahili translation the SAWA system underperforms compared to Google


translates system. This could be partly attributed to the aforementioned evaluation problem, but
is also likely due to Google’s more refined morphological generation model on the target
language side

Error analysis shows that the SAWA system has significant difficulties generating
morphologically correct Kiswahili words. However, for Kiswahili-English translation, SAWA
system fares better, not hampered by the morphological generation issues of the target language.
In this case, the SAWA system is able to outperform the Google system by a significant margin.

2.2.2 Bengali to Assamese Statistical Machine Translation

2.2.2.1 Introduction

Bengali language, which is a national dialect in Bangladesh, is the second most spoken language
in India. It is local to the locale of eastern South Asia known as Bengali, which embodies present
day Bangladesh and the Indian state of West Bengali. With almost 230 million local speakers,
Bengali is a stand out amongst the most prevalently talked languages on the planet.

9|Page
2.2.2.2 Description

Bengali takes after Subject-Object-Verb word structures, in spite of the fact that varieties to this
subject are basic. Bengali makes utilization of postpositions, as restricted to the prepositions
utilized within English and other European dialects. Determiners take after the thing, while
numerals, modifiers, and owners go before the thing. Bengali has two abstract styles: one is
called Sadhubhasa (exquisite dialect) and the other Chaltibhasa (current dialect) or Cholit
Bangla.

2.2.2.3 Methodology

Statistical Machine Translation system uses a parallel corpus of source and target language pairs.
For this, a Bengali to Assamese parallel corpus with approx. 20000 sentences was developed.
This corpus consisted of small sentences related to novel, story, travel, tourism in India. Table \
ref{corpus} shows the number of sentences used in training, testing and tuning purposes. Corpus

Corpus No of sentences Source Target


Training 17000 17000 17000
Testing 1500 1500 1500
Tuning 1500 1500 1500

2.2.2.4 Evaluation
For the experiments, three sets of randomly selected sentences with 200, 250 and 300 sentences
were chosen. Table 3 below is the analysis table for the observation of set of sentences. In the
wake of experiencing the results, we can say that the slips are a direct result of the emulating rea-
sons:
1. The amount of words in the corpus was extremely constrained.

2. The PoS tagger sections was not finished.

3. Now and then, due to different word passages in the target dialect lexicon for a solitary word
in the source dialect lexicon. For instance, for both the Assamese words নগৰ and চহৰ, the Ben-
gali word was শহর
Sets Total Successful Unsuccessful % of error
Set 1 200 165 35 17.5
Set 2 250 211 39 15.6 The output of
Set 3 300 259 41 13.7 the
experiment
was evaluated using BLEU (Bilingual Evaluation Understudy). BLEU toolkit is used to calculate

10 | P a g e
the BLEU (Determine how good the translation is, the metric estimates the quality of machine
translated text). A BLEU score of 16.3 from the parallel corpus (Bengali-Assamese) after
translation was obtained. This is very small and may be because a very small data set was used.

2.2.2.4 Conclusion

In this paper, Bengali language was translated to Assaseme using SMT with a corpus of around
200 words. The method of extracting translation pattern was a relatively simple one and has been
proposed by using phrase-based decoder in Moses. The extracted values of BLEU score were
16.3 meaning that the corpus used was relatively small. Also, the incorporation of shallow syn-
tactic information (POS tags) in the discriminative model would be best to boots the performance
of the translation.

2.2.3 English to Creole and Creole to English Rule Based Machine Translation System

2.2.3.1 Introduction

The Mauritian Creole language is spoken by the majority of the 1.2 million inhabitants of
Mauritius, while English is still the official language. This language is also spoken in Rodrigues
and Seychelles, although with some variations. The correct grapheme of Mauritian Creole, being
the mother-tongue of the inhabitants, was until recently only written by a few persons. The
language has been introduced in Primary Schools since January 2012 and institutions like
University of Mauritius, Ledikasyon pu Travayer and the Mauritius Institute of Education have
played a pivotal role for the Mauritian Creole to flourish as a National Language.

2.2.3.2 Description

Mauritian Creole Grammar


The Mauritian Creole (MC) determiner system is much simpler compared to French, as there are
no French definite and partitive articles. There is also the exclusion of grammatical gender as
well as number in Mauritian Creole.
The core of the MC determiner system has the following functional elements:
1. An indefinite singular article enn.
2. A demonstrative sa, which is generally used in conjunction with la.
3. A post-nominal specificity marker la.
4. A plural marker bann.

11 | P a g e
5. The morpheme li, which is used to represent the pronoun he/she/it/him/her, depending
upon the context it is used.

Mauritian Creole verbs use TMA (Tense, Modality, and Aspect) markers to indicate the tense.
The tense marker ‘ti’ indicates an action that has already taken place (i.e. past tense). The modal-
ity marker ‘pu’ indicates something will happen (i.e. definite future) whereas the modality
marker ‘ava’ is used to express something that may possibly happen (i.e. indefinite future). The
aspect marker ‘pe/ape’ marks an action that is still going on (i.e. progressive), in contrast to the
aspect marker ‘finn/inn’ which indicates an action that is already over (i.e. perfect).

2.2.3.3 Methodology

The system was implemented as follows:


1) Split a text into an array of sentences using “.”, “!”, “?” as delimiters.
2) Split each sentence into an array of words using “\W” meta sequence, and the underscore
character as delimiters.
3) A Greedy algorithm is used to find the longest match for a given fragment, in the database
4) Perform morphological analysis to extract root of word, and check for corresponding transla-
tion, in case word has not been translated in step 3.
5) Reorder the words according to the target language sentence format.

The rule-based machine system relies on the use of bilingual dictionary to perform translation.
The Diksioner Morisien dictionary was used to build the bilingual dictionary in the database.
The greedy algorithm was used to retrieve the target word(s) from the database and it worked in
following way: it starts at the first character in a sentence, and by traversing from left to right it
attempts to find the longest match, based on the words in the database. When a fragment is
found, a boundary is marked at the end of the longest match, and then the same searching
process continues starting at the next character after the match. If a word is not found, the greedy
algorithms remove that character, and then continue the searching process starting at the next
character
.
2.2.3.4 Evaluation

The goal was to provide the most accurate translation; therefore, whenever new rules were
added, a series of tests was carried out to make sure that it does not affect the quality of transla-
tion.

12 | P a g e
Table 1 presents some sample translations obtained when translating sentences from English to
Mauritian Creole
.
TABLE I. TRANSLATION OF ENGLISH SENTENCES TO MAURITIAN CREOLE
Source text Expected Result Target Text
She is a brilliant student Li enn zelev intelizan Li ene zelev intelizan
I love spicy food Mo kontan manze epise Mo kontan manze epise
I can’t tell if he is listening to Mo pa capav dir si lip ekoute Mo pa capav dir si li pe ekoute
me or not mwa oubien non mwa oubien pa
Either take it or leave it Swa pran li oubien les li Swa pran li oubien dekale li

2.2.3.5 Conclusion

In this paper, they have implemented the first automated translation system that performs
translation of English sentences to Mauritian Creole and vice-versa. The translation system used
the rule-based machine translation approach to perform translation. The results obtained showed
that the implemented system could provide translation of acceptable quality. The speed of
translation of the system was also satisfactory. it would be necessary to investigate how the
problem of word sense disambiguation can be solved and how translation can be improved for
longer sentences. The translation system would benefit both foreigners and the Mauritian
population as it would enable them to swap between their mother tongue and Mauritian Creole
with ease and convenience.

CHAPTER THREE: METHODOLOGY


In this chapter, the methodology used while developing the system will be analyzed as well as
the data collection methods and tools. The flow chart would be the best procedure to describe the
method used.

3.1 Introduction
The development of NNKK follows the procedure described below by the chart.

3.2 Document Collection


Data collection is the concept under which the data from the data sources are retrieved and then
used for the analysis. In the development of the Neural network machine translation for

13 | P a g e
Gikuyu to Kiswahili translator, there will a need to collect the actual data. This will help in the
development of the tool.
In this case, the bible (both Kiswahili and English), the web clawer, the dogpile as well as the
google scholar will be used in the collection of the tool which will further be analyzed.

3.3 Data Collection


The development of NNKK will require data words written in both Gikuyu language and the
Kiswahili language. These data will in turn help in the learning/training process by the Neural –
Network Based-Tool. The learning of the phrases and making an inference from the other
language will in turn help the tool to understand and be able to decode to the other language.
In this case, the Gikuyu Bible, the Kiswahili Bible, both Gikuyu and Kiswahili documents, the
web among others will be used as the data source on the same effect.

3.4 Data analysis Tools


Data analysis is the process by which the data collected is evaluated using analytical and logical
reasoning to examine each component of data provided. This process involves inspecting,
cleaning, transforming and modelling the data with the goal of discovering useful information so
as to help in the suggestion of a major and specific conclusion. The accuracy through which the
data is analyzed is for great importance.
In this case, the corpus tool which will be developed in the form of Microsoft excel will be used
in the analysis of the data. In this case, there will be creation of the tables where each of the
words will be linked to their NER meaning

3.5 Manual tuning


During the process of manual annotation, the data collected from the data sources are written in
two separate files using excel file. One file has the phrases of one of the languages from the data
collected and the other file has the corresponding phrase for the targeted language. The phrases
in this case are defined by the developer of the tool but there should be consistency in the phrases
in that each phrase from the local language match the phrase of the targeted language and not
any other phrase. Thus, for the success of the tool depends on how the two data sets: the targeted
language and the local language, are aligned.

14 | P a g e
3.6 Phrases

CHAPTER 4: IMPLEMENTATION

4.1 Corpus Development


For this project I used two a text editor to develop the training data set for my model. I created
two files: one containing Gikuyu language sentences and the other one containing Kiswahili
sentences.

Image below show a portion of my Gikuyu data set.

Image below shows a portion of my Kiswahili data set.

4.2 Data Preparation.


4.2.1 Tokenization

This falls under data preprocessing before it gets to the tool for model training. The main aim of
this process is to convert the raw sentences into a sequence of tokens. In this process to main
activities normally occurs:

a) Normalization – it applies some uniform transformation on the source sequence to


identify and protect some specific sequences eg Urls date or unicodes from the source
language.
b) Tokenization Itself – transforms the actual normalized sentences into a sequence of space
separated tokens.

15 | P a g e
4.2.2.1 Tokenization output

The image below show the preprocessing command I used to preprocess my data set and
the out result of the preprocess

After the preprocessing is completed, at least three files are are given out as output. The three
key output files are :

a) Training file – It contains the source language and the target language in a tokenized form
b) Valid file – It contains tokens that help in aligning the training data set
c) Vocabulary – It contains extracted vocabularies from the training data set and will be the one
used to make inferences from.

4.4.2 Model Training

The main training command is quite simple. It loads the source file and then save a model file
that will be used to make inference during translation. By default, OpenNMT tool consist of a 2-
Layer LSTM with 500 hidden units on both the encoder and the decoder.

Image showing training command in action.

The table below show option that can be added to the training command

16 | P a g e
4.4.3 Model Translation

After the corpus has been preprocessed and a model trained , eventually the last bit is to now
make some translation using our generated model.

Translation via the terminal can be done using the command show below:

17 | P a g e

You might also like