You are on page 1of 70

Predictive Text Entry Method For Somali Language On Mobile Phone

Submitted by
Mahamed Daud Abdulahi
M.Tech (CSE)
Roll no: 130454049
Under supervision of
Dr. Vishal Goyal (Assistant professor)

Department of Computer Science


Punjabi University, Patiala 147002
June, 2015
 
INTRODUCTION
What is Somali language?
• Somali language is originated from different languages.

• Somali language is spoken by different countries.

• The Somali language is basically defined one of the Cushitic


Afro-Asiatic family of languages.

• The Cushitic contains around 40 different languages which


are mainly spoken in Ethiopia, Somalia, Djibouti, Eritrea and
Kenya.
• Somali language is one of the most popular and
widely used language among Cushitic languages.

• Before the colonial power came in Africa Somali


language was using Arabic script.

• Today, Somali language uses the English-language


alphabet except P, V and Z.

• Somali language has short and long voices and strong


and weak voices.
• The red part is where the Somali language is mainly spoken
What is Natural Language Processing?
• NLP is a subfield of computer science with strong
connections to artificial intelligence.

• It is normally used to describe the functions of computer


system which analyze or synthesize spoken or written
language.

• NLP is a field of study dealing with computational


techniques for analyzing and representing naturally
occurring texts at one or more levels of linguistic analysis
for the purpose of achieving human-like language
processing for a range of tasks or applications
How to communicate with a machine?
• The most important communications between human
interaction and machines are based on the data entry
technique.
• To make data entry easier, word prediction should be
used.
• There are different kinds of data entry techniques that
can be used like speech, handwringing recognition,
scanner, microphone and digital camera.

• The interaction between human and machine is mainly


concerned with the keyboard and pointing devices.
What is text prediction?

• text prediction the most important area that the


researchers are focusing to introduce new methods
for mobile devices.

• Text prediction is the one which predicts what the


user wants to write based on frequencies.

• Text prediction can be divided into two categories:


– Character prediction
– word prediction
Why do we need text prediction?
• It reduces the redundancy of the input and increase the speed
and efficiency of the system.

• It makes the interface acceptable for the elderly and disabled


persons since they can’t write more.

• Text prediction tries to predict the correct or most appropriate


candidate word in a given context.

• The conventional prediction systems use word frequency lists


to complete intended words which the user wants to write.
Motivation
• The development of the technology is increasing rapidly and
reaches where every country and even individuals takes the
advantage.

• The development of the country is also dependent on how much it


uses and utilizes the technology.

• To improve the utilization of the technology it should be localized

• Most of the computers use specific languages like English and


view others.
• This makes problem for the usability of the technology.
• After observing these problems which jeopardize the
development of the technology it becomes necessary to do
research to alleviate these problems.

• One of the most needed research area which demands the


researcher’s attention is text prediction in a particular language.

• All the above mentioned reasons are motivated to do a research


for text prediction techniques on mobile application for Somali
language to utilize the Somali people and play their crucial role
to the technology advancement.
Problem Definition
• There are many Somali people who don’t know English to
access their mobile phones. So the question is:

• why not these people are helped?


• Why not these people participates the development of the
technology and make their language part of the technology?

• Why not Somali language serves as a text entry method?

• This is an issue that needs to be addressed in order to explore


the means for effective Somali text writing particularly on
mobile devices. reducing the input burden of the system.
Objectives

• The main objective of this research is to design


a model and develop an application that
predicts Somali language text input for mobile
devices.
• To achieve the aforementioned main objective
the following specific tasks should be
performed:
– To review relevant works in other languages
– To explore the existing word-based predictive
models
– To develop a monolingual corpus upon which the
models can be built
– To extract vocabulary of Somali language from the
monolingual corpus
– To identify and adopt one of the appropriate
prediction systems
– To design Somali word prediction model
– To develop the necessary algorithms for word
prediction engine of Somali language
– To develop a prototype to demonstrate the
effectiveness of the designed model and the
developed algorithms
– To evaluate the performance of the word
prediction system
Detailed Study Of The Problem
And Literature Survey
• The general input paradigms can be divided two types as
I. Scott Mac Kenzie. Et.al (2002) motioned which are:
– pen-based input
– keyboard-based input.
• These two paradigms were emerged from ancient
technologies.
• based on their discussion user experience with typing
and handwriting greatly influences expectations for Text
entry in mobile computing
• As Amal Sirisena (2002) said the input methods
can be sub dived into:
– Key based
• 12 key (standard)
• Two key
• Small QWERTY keyboard
• Three key( date stamp)
• Five key
– Stylus based
• Handwriting with automatic recognition
• Typing on soft or virtual keyboards
• Difference between virtual and physical
keyboards
Key-based Text Entry
• As mentioned in the previous slide, there are a
different input techniques on key based text entry.

• Key based text entry method is from standard


keyboard which are more ambiguous to the
keyboards which are less ambiguous.

• the Ambiguity occurs if there are fewer keys than


the symbols in the language.
fictitious alphabetic keyboard

QWERTY keyboard A standard 12 key keyboard


Standard 12-keys
• Lee Butts and Dr. Andy Cockburn (2001), discussed the
standard 12- keys and said, if the keyboard has less than 26 keys,
more than one character should be mapped to the same key.

• the standard 12 key has only 12 keys and its needed to map the
26 characters to these keys.
• Since there is unequal number of keys and characters, multi_
press is needed to input data.

• The multi_ press can be categorized in to two different


techniques which are multi_ press with timeout and multi_ press
with next button.
Multi-press with timeout
• A user cycles a letters with specific button by pressing many
times and using a predefined time which is called timeout.

• If the user presses within this interval, it cycles the letters


which are on this button.

• If the timeout passed, the interface will select the character


currently on screen.

• If the user presses another key the timeout restarts.


• The main disadvantage is to make system slow, because of the
timeout.
Multi-Press with next button

• It solves the problem of the multi-press with


timeout.

• It replaces the timeout with a “next” button.

• Instead of waiting the timeout the user now just


presses the “next” button and this modifies that
the cycling is over through characters on that key.
Two-key
• Nesredien Suleiman (2008), mentioned that this type of text
entry method needs two key presses to write a single
character.

• The first key press is to select the group of the character and
the second one is to select the position of the character.

• A user should enter 1 to 4 to select the position of the


character.

• This method is not suited for entering punctuation or special


characters.
Small QWERTY Keyboard
• It is commonly used around the world.
• As Ahmed Sabbir Arif (2014), discussed the
QWERTY has different types:
– Standard QWERTY
– Mini QWERTY
– Projection QWERTY
– Virtual QWERTY
• The name of QWERTY comes from the first six
topmost left keys of top row.
• the keyboard size has significant effect on the data
entry speed.
Standard QWERRTY
keyboard

Mini QWERRTY keyboard

Virtual QWERRTY keyboard

Projection QWERTY keyboard


Three key
• This method was called date stamp or three key methods.

• It is called date stamp because it was using similar


method of the date stamp to select the required character.

• it has a wheel and enter key.

• To select the required character the wheel should be


rotated and press enter key.
• This method is too slow and not preferable for common
use because of multi_ press.
Five key
• Dunlop, M.D. and Crossan, A (2000), motioned their paper that
this method contains all the alphanumeric and symbols on the
display in a form of rows and columns.

• It may represent in alphabetic order or in the common QWERTY


keyboard.

• It has five keys, four of these are used to move the cursor to the
directions while the other one is used to select the required
character.

• Multi key press is needed which slows the system.


Pointing Text Entry
• As Koji Yatan and Khai N. Truong. (2009), said.
mobile devices are commonly use pointing text entry
method by using finger or touch to input characters
while some others use pointing devices like pen.

• Pen-based text entry techniques on mobile devices


require the use of the two hands.

• The thumb of the non-dominant hand is still available


for secondary input.
Soft Keyboards
• I. SCOTT MACKENZIE. Et.al (1999) discussed soft
keyboards and said Any keyboard that appears on the
display in a digital form is called soft keyboard.

• To input something to the system by using this kind of


input method, a stylus or finger should be used.

• The main advantage of soft keyboards is space efficiency.

• It has good performance, and it is simple to use.


Predictive Text Entry
• Predictive text is another type of text entry method which is
dictionary based method as Gudisa tesema ( 2013) discussed.

• To type a character the user simply flexes the relevant finger to


select the correct column.

• After a sequence of finger flexes the user is presented with the


predicted word.

• Users may go through to rotate alternative matching words to


indicate the desired word, if the initial prediction is incorrect.
• Most of the times during this process of text entry
there may be some clashing words for a given key
presses.

• The user gets an ordered list of words which are


ranked according their probability and the most
suitable word will be selected for the input.

• It requires only one key press to enter each


character
• when the key is pressed, the system compares the
key sequence with the word possibilities in a
linguistic database to guess the intended word.

• Sometimes, ambiguity arises when two or more


words match the given key sequence.

• A down-up arrow or a special “next” key is used


to choose an alternative word if the intended word
is different from the displayed one.
• For example
• there are four words matching the key sequence
2-2-5-3.
• From most-to-least probable, the words and the
required key sequences are:
– Able 2-2-5-3-0
cake 2-2-5-3-N-0
bald 2-2-5-3-N-N-0
calf 2-2-5-3-N-N-N-0
• If the user intends calf, then three presses of a
special NEXT key are required to reach the
correct response.
• To overcome this multi_ press, prefix based method
should be used.

• This method uses letter wise prediction.


• It is based on stored database of probabilities of
prefixes.

• A prefix is the letter preceding the current keystroke.


• For example, if the user presses 3 with prefix “th”, the
most likely next letter is e because “the” in English is
far more probable than either “thd” or “thf”.
Related work
Predictive Text Entry Speed on Mobile Phones
• Silfverberg et. al.(2000) designed a method for
predicting potential expert user text entry speed for
input methods.
• They have discussed the three most common text
entry approaches such as Multi-press, two-key
press and T9 methods.
• The designed model has two components:
– movement model (Fitts’ law)
– linguistic model (digraph probabilities).
• Fitts’ law is a quantitative model for rapid, aimed
movement.

• It is used to calculate the potential text entry speed of


an expert user.

• This model has been applied with success to pointing


devices and on-screen keyboards.

• The linguistic model based on the analysis made on


representative sample of common English probability,
Pij is given for each letter pair.
• In general the “Timeout Kill” strategy is faster than
“Timeout” strategy .
• When Multi-press uses timeout strategy is slower and
faster when it uses the timeout kill strategy than two-
key press method .
• The linguistic model embedded in the T9 method is
used to determine how often the NEXT function is
required.
• predictive text entry method has better performance
than other text entry methods .
• linguistic model is the core element of prediction
system.
Effects of N-gram order and Training Text size on Word
Prediction
• Lesher et. al.(1999) stated that irrespective of the n-gram order the
performance of the system increases with an increase of training
text size.

• The increase is much more on sounds for trigram than for


unigrams.

• With higher n-gram orders the keystroke savings for the given
training texts increases constantly.

• The performance gain in moving from bigram to trigram prediction


is considerably less dramatic.
• When ever the order increases the performance difference
decreases.
METHODOLOGY

• To accomplish this Research work will be as


follows:
– Literature review
– Data collection
– Data analysis
– Select implementation tools
– Design
– Experiment.
– Conclusion and recommendations
WORD PREDICTION MODEL FOR SOMALI LANGUAGE

• The word prediction model


• Word prediction techniques are popular
methods inside the field of Augmentative and
Alternative Communication (AAC). This field
is commonly used as communication aids for
people with disabilities and it accelerates the
writing, reduces the effort needed to type and
suggests the correct word
Statistical prediction
• Most of the prediction systems are based on the statistical
analysis to predict what the user wants to write.

• When the prediction process is going the probabilities of


the words are used.

• This probability may be fixed or dynamic.


• In statistical prediction, there are two different methods:-
– Fixed lexicon
– Adaptive lexicon
Fixed lexicon
• This is the easiest method in statistical prediction.
• The words of this method has fixed frequencies.
• This method uses two techniques to make a
prediction.
– All words are arranged in order based on their frequencies,
and some few words are at the top of the list.

– This technique considers how the words are following each


other.
• The main drawback of the fixed lexicon is since the
lexicon is fixed at the design stage of the system.
Adaptive lexicon
• This method changes the frequency of the words that the
dictionary or lexicon contains when the user builds sentences.

• Adaptive lexicon considers the recently used words, when a


word is used the priority of that word increases, and gives high
priority to use it again.
• Adaptive lexicon is dynamic and it dependent on the usage of
the words.

• If a new word is used which was not in the lexicon or


dictionary it adds the dictionary with frequency of one.
Corpus Preparation
• To complete the task of word prediction we need to get
some statistical information such as the words and their
frequency .

• Since there is no Somali corpus available at this time, it


should be prepared a corpus to achieve our goal.

• Having in mind that, a good collection of words results


or helps to design better model we collected a data to
prepare our corpus from different sources.
Continue…
• This is used to get statistical information like
frequency of the words in the corpus, the average
word length of Somali language.

• We also use the corpus to decide the N in the N-gram


model, where N is the number of characters after
which the system starts to predict the intended word.

• The statistical information and the value of N are used


to design the text prediction system and the algorithm
implemented.
The most twenty frequently used words among the corpus
S.No Words Frequency %of Words
1 iyo 16076 2.33
2 oo 13653 1.98
3 ka 12670 1.84
4 ku 11840 1.72
5 ay 10387 1.51
6 ee 9806 1.42
7 waa 9419 1.37
8 in 8165 1.19
9 soo 6955 1
10 la 5403 0.78
11 uu 5137 0.75
12 waxaa 4572 0.66
13 waxay 4325 0.63
14 kala 3763 0.55
15 mid 3728 0.54
16 loo 3627 0.53
17 ama 3502 0.5
18 aad 3297 0.48
19 si 3251 0.47
20 lagu 3101 0.45
Table 3.2: Description of the corpus
 
Data Item Description of the corpus
Total Words 688830  
Unique words 119923  

Average word length 7 Percentage of words based on length


1-Character 123 0.1%
2-Characters 315 0.26%
3-Characters 1060 0.88
4-Characters 2830 2.36%
5-Characters 7602 6.34%
6-Characters 12529 10.45%
7-Characters 15361 12.81%
8-Characters 18701 15.59%
9-Characters 17215 14.36%
10-Characters 14726 12.28%
11-Characters 10998 9.17%
12-Characters 7320 6.1%
13-Characters 4744 3.96%
14-Characters 2828 2.36%
15-Characters 1686 1.41%
16-Characters 975 0.81%
17-Characters 558 0.47%
18-Characters 206 0.17%
19-Characters 92 0.08%
20-Characters 45 0.04%
Table 3.3: Description of Number of Letters per Word from Different Dictionaries
Qaamuus, Qaamuuska Af-Somaliga Qaamuuska Qaamuus Qaamuuska
Title Ereykoobe Afsomaliga Caafimaad Qeexan Soomaali-Talyaani

Author Saalax Xaashi Annarita Pugelielli iyo Yasiin Cismaan DR.liban Ali Diriye Agostini F., A.
Carab Cabdalla Cumar Mansuur Keenadiid Puglielli e Ciise M.
Siyaad

Unique words 40,000 72,065 15000 5000 30,000

Average word length 7 8 7 8 7


1-Character 167 25 52 6 56
2-Characters 514 49 122 39 318
3-Characters 1155 569 415 114 883
4-Characters 2323 1640 1052 236 2031
5-Characters 4354 4868 2033 454 3880
6-Characters 5532 7602 2451 629 4664
7-Characters 5891 8346 2242 705 4616
8-Characters 5621 11354 2069 651 4211
9-Characters 4573 10314 1632 642 3373
10-Characters 3429 9056 1123 522 2476
11-Characters 2267 6926 459 400 1579
12-Characters 1335 4789 472 231 833
13-Characters 859 3076 296 157 450
14-Characters 560 1976 131 135 336
15-Characters 314 1021 110 58 162
16-Characters 179 454 41 20 87
Architecture of the System
• The proposed system needs from the user at least two
characters and at most n characters.

• The words of the lexicon have frequencies assigned from the


corpus so using these words and their frequencies the
prediction starts.

• If the words have the same frequencies the system predicts by


arranging in alphabetical order

• If the word is not in the database the system adds that word
with a frequency of one.
Components of the system
• The Word Prediction Engine has three main components,
which are participating in the prediction process.
– Start Engine
– Word Selector
– Word Ranker

• Start Engine component gets two characters one by one


from the user.
• The Start Engine component initiates Word Selector
component after getting every two characters.
• Once the Word Selector is initiated, it will start searching
for words in the dictionary (lexicon) of which their first two
or more characters match with the inputted characters.

• If there are no matched words the word selected initiates


another component which adds to the dictionary.

• This component adds any new word to the dictionary with


frequency of one.

• If there are matched words the list of words found and their
corresponding frequency will be delivered to the next
component
• Word Ranker: by considering the frequency of each word,
provides a rank to each word in the list of found words.

• Words with highest frequency will get highest rank and those with
least frequency will get the least rank.

• In the case of two or more words have the same frequency, the
Word Ranker will decide the rank of the words by considering
their alphabetical order.

• These ranking policies are used to determine the word(s) to be


predicted and a list of predicted words is displayed

• The user selects his/her desired word from the list.


System Architecture

Start Engine

Word Selecter Somali


Lexicon

No
Matched? Add to database

Yes

Word Ranker By frequency order

Same No Predicted
Frequency? List

Yes

By Alphabetical Order
Implementation and Experiment
• The development environment for developing mobile
applications is not comparable for the desktop develop
environment because it is more restrictive.

• The Android platform promised openness, affordability, open


source code, and a high-end development framework.

• The Android Platform embraces the idea of general-purpose


computing for handheld devices.

• It is a comprehensive platform feature that is a Linux-based


operating system stack for managing devices, memory, and
processes.
Android emulator
• Android SDK ships with an Eclipse plug-in called Android
Development Tools (ADT).
• This Integrated Development Environment (IDE) tool is used
for developing, debugging, and testing your Java applications .

• It can also use the Android SDK without using ADT; it is used
command-line tools instead.
• Both approaches support an emulator that you can use to run,
debug, and test your applications.

• The Emulator has some disadvantage than the real device, it is


somehow slow and takes a time to start at the first time, and it
does not support some applications.
SQLite databases
• SQLite is a very popular embedded database.
• It combines a clean SQL interface with a very small memory
footprint and decent speed.
• It is public domain, so everyone can use it.

• Lots of firms (Adobe, Apple, Google, Sun, and Symbian) and


open source projects (Mozilla, PHP, Python) all ship products
with SQLite.
• Since SQLite uses an SQL interface, it is straightforward to
use for people with experience in other SQL-based databases.

• It’s native API is not JDBC, and JDBC might be too much
overhead for a memory-limited device like a phone.
User interfaces
• Since the IDE used for the prototype design of the predictive
system is Android.
• Android comes with XML file to create the user interfaces.

• The user interfaces used in this prototype are TextEditor, TextView,


Autotextcompletion, buttons.

• The Database activity is used by the developer only to insert data


into the dictionary, to delete data from the dictionary and to see the
whole data available in the database.

• The user just accesses the database to retrieve data and to add new
words to the dictionary.
Database activity to insert, delete and show data
The predicted words after two-prefix length (bu) is inserted to the
text editor
The predicted words after three prefix length (bul) is inserted to
the text editor
The screenshot of predicted words after 3 words and 4 prefixes
Experiment
• To check the accuracy of the system which is Somali
text prediction, an experiment was performed.

• This experiment clarifies the accuracy of the system


by using sample data which was collected from
different sources.

• The experiment is conducted on Android Simulator


software.
Result of the experiment
Word length collected words for test predicted words non predicted words

3-Characters 62 61 1
4-Characters 150 149 1
5-Characters 263 258 5
6-Characters 240 231 9
7-Characters 209 202 7
8-Characters 215 202 13
9-Characters 127 115 12
10-Characters 100 89 11
11-Characters 68 58 10
12-Characters 40 35 5
13-Characters 13 11 2
14-Characters 10 8 2
15-Characters 2 2 0
16-Characters 1 1 0
Total 1500 1422 78
Precession, Recall and F-measure
• Precision, recall, and the F-measure are set-based measures.
• They are computed using unordered sets of documents.
• Given these ingredients, how the system effectiveness measured?
The two most frequent and basic measures for information retrieval
effectiveness are precision and recall.

• Precision (p) is the ratio of the number of relevant records


retrieved to the total number of irrelevant and relevant records
retrieved.
• In our case, it is the predicted words divided by the collected
words.It is usually expressed as a percentage.
• Recall (R) is the ratio of the number of relevant records
retrieved to the total number of relevant records in the
database.
• In our case, it is the predicted words divided by the total of
predicted words. It is usually expressed as a percentage.

• F-measure: - a combined measure that access precision


and recall tradeoff if F-measure (weighted harmonic mean),
that harmonic mean is a very conservative average see.
Precision, Recall and F-measure
Word Length Precision Recall F-Measure
3-Characters 98.39% 4.29% 8.22%
4-Characters 99.33% 10.48% 18.96%
5-Characters 98.1% 18.14% 31.06%
6-Characters 96.25% 16.24% 27.79%
7-Characters 96.65% 14.21% 24.78%
8-Characters 93.95% 14.21% 24.69%
9-Characters 90.55% 8.09% 14.85%
10-Characters 89% 6.26% 11.7%
11-Characters 85.29% 4.08% 7.79%
12-Characters 87.5% 2.46% 4.79%
13-Characters 84.62% 0.77% 1.53%
14-Characters 80% 0.56% 1.11%
15-Characters 100% 0.14% 0.28%
16-Characters 100% 0.07% 0.14%
Average 92.83% 7.14% 12.69%
Future Works
• we identify the following works in the future:

– preparing error adequate and better size corpus must be


one of the tasks that need to be done in the future. In
addition to this, having a standard dictionary with
maximum possible word size is very important and will
increase the accuracy of the word prediction system.

– Since the searching efficiency of the system has not


been taken into consideration in this work, one can
extend this work by adding this feature.
– This work can be done for other local languages like
Afar, which are languages spoken in Ethiopia and use
Latin character sets.

– Integrate this word prediction system with different


Somali text editing systems.

– Extend this work into context-level prediction system.

– Extend this work to phrase and sentence level


prediction system.
Thanks for all!!!

You might also like