Sware Gex

The document presents SwaRegex, a rule-based lexical transducer for morphological segmentation of Swahili verbs, achieving high accuracy compared to a similar model. The model was tested on two datasets, demonstrating significant improvements in performance, particularly for verbs of Bantu origin. SwaRegex serves as a valuable tool for learners and researchers of the Swahili language, enhancing understanding of its verb syntax and aiding in natural language processing applications.

Uploaded by

hamisiabbas2005

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views8 pages

Sware Gex

Uploaded by

hamisiabbas2005

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Journal website: https://journals.must.ac.

SwaRegex: a lexical transducer for the morphological segmentation of

Swahili verbs
George Mutwiri,*1 Mutua Makau,1 Amos Omamo1
a School of Computing and Informatics. Meru University of Science and Technology.

The morphological syntax of the Swahili verb comprises 10 slots. In this work,
KEY WORDS we present SwaRegex, a novel rule-based model for the morphological segmen-
Natural language processing tation of Swahili verbs. This model is designed as a lexical transducer, which
Morphological segmentation accepts a verb as an input string and outputs the morphological slot occupied by
the morphemes in the input string. SwaRegex is based on regular expressions
Low-resourced language
developed using the C# programming language. To test the model, we designed
Lemmatization
a web scraper that obtained verbs from an online Swahili dictionary. The scrap-
Part-of-speech tagging per separated the corpus into two datasets: dataset A, comprising 163 verbs
Word segmentation Bantu origin; and dataset B, containing the entire set of 715 non-Arabic verb
Segmentation entries obtained by the web scrapper. The performance of the model was tested
against a similar model designed using the Xerox Finite State Tools (XFST). The
regular expressions used in both models were the same. SwaRegex outper-
formed the XFST model on both datasets, achieving a 98.77% accuracy on da-
taset A, better than the XFST model by 41.1%, and a 68.67% accuracy on dataset
B, better than the XFST model by 38.46%. This work is beneficial to prospective
learners of Swahili, by helping them understand the syntax of Swahili verbs, and
is an integral teaching aid for Swahili. Search engines will benefit from the lexi-
cal transducer by leveraging its finite state network when lemmatizing search
terms. This work will also create more opportunities for more research to be
done on Swahili.

∗
Introduction In rule-based methods, regular expressions can be
compiled into lexical transducers. A lexical transducer
Morphological segmentation is a subfield of mor- is a finite state transducer that accepts a surface form
phological analysis that involves identifying mor- of a word as input and produces its lexical form, and
pheme boundaries within text (Liu et al., 2021). vice versa. The lexical form of a word is made up of its
Words in a language comprise a set of morphemes lemma and tags that depict its morphological roles.
arranged so as to conform to the syntax rules of the The lemma of a word is its dictionary representation
language. Natural language processing (NLP) acts as a (Mohrs & Cajo, 2020). Creating rules for lexical trans-
bridge between computers and natural human lan- ducers necessitates an understanding of finite state
guage. For this to be possible, it is important that automata as well as the morphological rules and syn-
computers be able to identify the morpheme con- tax of the target language.
structs that make up words in the language in con- Morphological analysis focused on morphological
junction with the morphological rules that apply to alternations in Kinyarwanda was presented by
the language. This in turn makes it possible for the (Muhirwe, 2007). The analyzer was rule-based, em-
computer to analyze words based on their constituent ploying finite state machines. The model was imple-
morphemes and also to generate words based on mented in Twolc (Two-level compiler) and based on
morphological rules. the Xerox finite state tool. The author did not, howev-
While researchers have recently shown more in- er, publish the results of their model.
terest in poorly resourced languages, there remains (Katushemererwe & Issue, 2010) presented a fi-
scant work on certain aspects of these languages. In nite state method for the automatic analysis of Runya-
this paper, we present SwaRegex, a lexical transducer kitara nouns. All noun lexemes in the language were
for Swahili verbs. The paper is structured as follows. built into fsm2. Nouns were extracted from a Runya-
We first review related literature. Following that, fi- kitara dictionary and manually coded into noun sub-
nite transducers, the morphology of the Swahili lan- classes. The system was evaluated using a dataset
guage and the Swahili verb, the design of SwaRegex extracted from a weekly newspaper and an orthogra-
and its XFST counterpart, and experimental results phy reference book, both in a different language.
are discussed. Their model recorded a precision of 80% on 4472
words and a recall of 80% on the 5599-word corpus.
Related Works Runyagram, a formal system for the morphological
segmentation of Runyakitara verbs based on the fsm2
There are two methods by which morphological interpreter, was presented by (Fridah & Thomas,
segmentation can be performed: rule-based and ma- 2010). Just like their similar model for nouns
chine learning methods. Machine learning methods (Katushemererwe & Issue, 2010), Runyagram finite
improve their performance automatically by relying state transducer was comprised of a special symbol
on experience and datasets (Mahesh, 2018). These file, a grammar file, and a replacement rule file con-
methods can be supervised, semi-supervised, or unsu- taining about 34 rules. The grammar contained about
pervised. Supervised methods rely on labeled data to 330 rules and was converted into a finite-state accep-
learn, while unsupervised methods rely on unlabeled tor containing about 1200 transitions and about 800
data. Semi-supervised methods, however, rely on states. The system was tested against 3971 verbs
both labeled and unlabeled data (Li et al., 2021). Ma- from an orthography reference book and a dictionary
chine learning takes time in order for training to oc- of another Bantu language. The system scored a recall
cur. It is also incapable of distinguishing between der- of 86% and a precision of 82%.
ivational and inflectional morphology in natural lan- While the above research focuses on a variety of
guage processing (Mott et al., 2020). poorly resourced languages, our model is keen on the
Rule-based methods rely on a set of rules that de- morphology of Swahili verbs. No work has been pre-
fine how the correct input is mapped to the relevant sented before on the morphological segmentation of
output. Rule-based methods are suitable when com- Swahili verbs. Our model implements regular expres-
putational language resources are unavailable or in- sions, a rule-based technique for morphological seg-
adequate (Ammari & Zenkouar, 2021a). These rules mentation. We chose rule-based over machine learn-
are usually in the form of regular expressions, which ing-based techniques because rule-based techniques
are translated into finite state networks. This means are best suited for languages with limited resources
that rule-based methods achieve more accurate re- (Ammari & Zenkouar, 2021a).
sults since they perform only according to the availa-
ble rules. However, these methods post poor results
on input whose rules have not been defined.

78
2
Methods and Materials East African language with over 80 million speakers.
The language is written in the Roman alphabet and
Method has only 5 vowels (a, e, i, o, u), 23 consonants (b, c, d,
Finite State Transducers f, g, h, i, j, k, l, m, n, p, r, s, t, v, w, y, th, dh, gh, sh) and 6
A finite state transducer is a finite state autom- nasal consonants (ny, ng’, ng, nd, mb, mw). Every
aton that allows not only analysis of input against lex- Swahili word concludes with a vowel (Ngugi et al.,
ical forms, but also generation of the base form of a 2010). (Schonhof & Wilkans, 2012) suggest that Swa-
word given its lexical form. A finite state automaton is hili morphology involves agglutination, the noun class
made up of five elements Q, ∑, q0, F, and T. These ele- system, concordance, conjugation (or inflection), and
ments have the following roles: use of locatives.
Q – a finite set of states The basic word order in Swahili is SVO (subject
∑ - the alphabet (finite set of input symbols) verb object), with the subject and object optional, for
q0 – a state in Q that is the start state of the au- example, Kijana anacheza mpira (The young boy/girl
tomaton is playing the ball). ‘Kijana’ is the subject and ‘mpira’
F – a finite set of final states. F is a subset of Q the object. Removing the subject and object in this
T – a transition function that determines when example leaves only the verb “anacheza”, which is a
the automaton moves from one state to another morphologically correct simple sentence (Kituku,
A finite state transducer has an additional element 2019).
to its finite state automaton: the output function,
which is responsible for yielding output from the au- Morphological Syntax of the Swahili Verb
tomaton. (Abdulla et al., 2019; Gerdjikov, 2018) de- The following is the syntactical format for the Swa-
scribe a finite state transducer as a finite automaton hili verb (Deo, 2016):
tool defined by: Pre-Initial (optional): Negative marker ha-
Q a finite set of states; (NEG1); mutually exclusive with NEG2. When the
Σ input alphabet; Formative 1 slot is marked by the present continuous
Γ the output alphabet; marker -na- and the Pre-Initial slot marked by -ha-,
I ⊆ Q the set of initial states the final vowel should be marked by the negative pre-
F ⊆ Q the set of final states sent indicative -i.
δ ⊆ Q × (Σ ∪ {λ}) × (Γ ∪ {λ}) × Q, the transition Initial (optional): Subject Concord (SC); Pro-
function nouns
Given a finite state transducer y:x, y denotes the Post-Initial (optional): Negative marker si-
lexical string while x, the surface string. The upper- (NEG2); mutually exclusive with NEG1; conjunctive/
side symbol is y while x is the lower-side symbol. subjunctive. When used, the final vowel/mood is
When the input string is x, the transducer traverses marked by the subjunctive marker -e.
its finite state automaton, searching for an arc labeled Formative 1 (optional): TAM; -na-, -li-, -ta-, -ka-
x leaving the start state. If such an arc exists, the net- , -me-, -sha-, -ki-, -nge-, -ngali-, -ngeli-
work returns the lexical symbol, which in this case is Formative 2 (optional): Subordinator (SUBO);
y. This operation is called analysis/lookup, and it re- object/ relative pronoun; -ye-. This slot also takes the
turns the lexical string for a given surface input. By same morphemes as the Post-Final.
traversing the finite state network backwards, this Pre-radical (optional): Object concord (OC) -ni-
transducer can also be used to determine whether y is or reflexive marker -ji-.
a valid string of the language. This process is called Verbal base: verb stem lexical base or root
generation/ look down, and it takes as input the lexi- Pre-Final (optional): verbal derivational suffix-
cal strings and returns the corresponding surface es (SUFF)
strings (Karttunen, 2001). Final: TAM; -e (subjunctive); -a(indicative); -i
(neg. pres. Indicative)
Materials Post-Final (optional): Plural addressee marker
This section is organized into two sub-sections. (PLA)/ plural imperative: -ni; reference marker (REF)/
The first explores the Swahili language and the mor- relative pronoun – particle -o. If this slot is occupied,
phological structure of its verbs. The second section the tense slot (Formative 1) is not marked at all.
discusses the structure of SwaRegex.
Swahili verbal inflections
Swahili The verb is made up of a root, that remains un-
changed no matter the inflection used (Mdee, 2016).
Swahili is an agglutinative (Shikali et al., 2019) Swahili verbal roots take on affixes when inflected.

79
3
Inflection on verbs is applied after the verbal root. An from the website. In total, dataset B contained 715
exception to inflection in Swahili is proper nouns: entries.
they are never inflected (Steinberger et al., 2011).
Once inflected, the meaning of the verb changes, The Lexical Transducer
which also requires changes in the subject and/or
object concordance. (Deo, 2016) suggests the follow- SwaRegex was developed in Visual Studio using C#
ing verbal inflections and the possible suffixes they version 8.0. The model contained two sets of regular
take: expressions. The first set defined morphemes for each
Active suffix (ACT): -a, -i, -u, -e morphological slot, while the second set described
Applicative/ applied/ dative/ prepositional possible combinations of morphemes from the first
(APPL): ia, -ea, -ilia, -elea set of expressions that form a morphologically correct
Associative/ reciprocal suffix (REC): -an verb.
Causative (CAUS): -isha, -esha, -iza, -eza, -sha, - To define the morphemes allowed in each morpho-
za, -ya, -sa logical slot, ten variables were defined, one for each
Contactive suffix:(CONT) -et, -at slot. Each variable was a regular expression specify-
Conversive/ reversive / separative suffix: ing all the sets of morphemes allowed to occupy the
(CONV) – -u-, --ul-, -ol-, -o- morphological slot. These regular expressions were
Durative suffix (DUR): -a enclosed in regex groups so the transducer could
Intensive suffix (INTE): -ili--, -ele- identify the particular morphological slot when a
Passive (PASS): -wa, -iwa, -ewa, -liwa, -lewa match was successfully made for an input verb. How-
Potential suffix (POTE): -ka ever, some morpheme combinations in various slots
Static suffix – (STC): -ma were defined as separate variables due to the follow-
Stative/ neuter suffix (STV): -ka, -ika, -leka, -ika, ing reasons:
-eka, -uka The morphemes occur in extremely separate con-
Reciprocal Stative suffix (RECSTV): -na texts. For instance, the morpheme “ji” (reflexive),
Reflexive suffix (REFL): -ji- which only succeeds “si” (Post-Final).
The occurrence of a morpheme in more than one
SwaRegex slot distorts the output of the transducer. Take for
The structure of SwaRegex was divided into two instance ‘a’ appears at both the Initial and the Final. If
sections: the web scraper and the lexical transducer. the morpheme appears in any one slot within the in-
put verb, the transducer marks both slots as occupied.
The Web Scraper The regular expressions representing the morpho-
A web scrapper was designed to automatically logical syntax of the Swahili verb were also construct-
scrape off verbs from the “Translation Dictionary” ed in the same way. These comprised the second set
website (https://www.translationdirectory.com/ of expressions. String interpolation was used to deref-
dictionaries/dictionary031.htm). Since the website erence variables representing the morphological slots
lists a dictionary of Swahili words, verbs are not when constructing regular expressions. The matching
placed in any particular order. However, verbs are function for the regular expression was set to ignore
denoted by a succeeding “v.” tag on entries in the dic- case during the matching operation.
tionary. Verbs also appear in their inflected forms. The same set of regular expressions were copied
The first character of each verb is a hyphen. In order onto the XFST tool. This tool has been used before in
to obtain only the roots of these verbs, the scrapper developing and testing finite-state transducers for
was tuned to remove the hyphen, the Pre-Final, and various languages (Sahala et al., 2020). Next, we de-
the Final morphemes. The scrapper was also tuned to scribe the implementation of the model in both
ignore entries containing spaces and to separate SwaRegex and its XFST counterpart.
verbs of Arabic origin from verbs of Bantu origin. Ara-
bic-derived verbs are those that end in either “e”, “i”,
and “u”. This separation is because verbs of Arabic
origin are inflected differently from those of Bantu Implementation of SwaRegex
orignin. Malformed verb entries were also excluded
from the scraped data. We implemented SwaRegex in Visual Studio as a
The scrapper saved two separate files: dataset A, console project in C# 8.0. The variables containing the
which contained163 verb phrases that did not have first set of regular expressions that represented mor-
any inflectional morphemes, and dataset B, which phemes allowed in each morphological slot were de-
contained all the non-Arabic verb entries obtained fined as follows:

80
4
string P1 = "(?<P1>ha)"; string Rule4 => $"^({P2}|{PE})({P4}|
{PF})({P5}|{PH})?({P2}|{P6}|{P10}|{PB})?({P7})
string P2 = "(? ({P8}|{PE}|{PH})?({P9}|{PD}|{PE})({P10})?$";
<P2>wa|u|i|ya|vi|zi|u|pa|ku|m|tu|li|ki)";
string Rule5 => $"^({P2}|{PE})({P4}|{PF}
string P3 = "(?<P3>si)"; |{PG})?({P2}|{P6}|{P10}|{PB})?({P7})({P8}|{PE}
|{PH})?({PE})({P5}|{PH})?({P10})?$";
string P4 = "(?
<P4>na|ka|me|sha|taka|li)"; string Rule6 => $"^({P2}|{PE})({P4}|
{PF})?({P2}|{P6}|{PB})?({P7})({P8})?({PE})
string P5 = "(? ({P5}|{PH})?({P10})?$";
<P5>ye|yo|lo|cho|vyo|zo|po|ko|mo|ye)";
string Rule7 => $"^({P7})({P8})?({PE})
string P6 = "(?<P6>mw)"; ({P10})?$";
string P8 = "(?
<P8>an|az|i|ian|ik|iki|ikiw|ikiz|ikw|ili|iliw| string Rule8 => $"^({PJ})({P7})({P8})?
ish|ishiw|iz|e|ek|ele|ew|esh|et|ez|k|lek|lew|l ({P9}|{PE})$";
i|liw|m|n|ol|sh|shan|s|u|uk|ul|uli|uliw|ush|w|
y|z|zik|zw)"; The above ruleset was combined into one general
regex, against which input was matched as follows:
string P9 = "(?<P9>i)";

string P10 = "(?<P10>ni)"; var match = Regex.Match(verb, $"{Rule1}|

{Rule2}|{Rule3}|{Rule4}|{Rule5}|{Rule6}|
{Rule7}|{Rule8}", RegexOptions.IgnoreCase);
P7, the verb roots variable, was defined in the
same way.
Implementation of the XFST model
8 more slots were defined to avoid duplicates in
The first set of regular expressions represent-
the above morphological slots. This was to improve
ing possible combinations of morphemes for each slot
the performance of the models, where duplicates re-
are as follows on the XFST terminal:
sulted in more than one output from the transducers. define P1 ha;
These slots are defined below: define P2
string PA = "(?<PA>ja)"; wa|u|i|ya|vi|zi|u|pa|ku|m|tu|li|ki;
define P3 si;
string PB = "(?<PB>ji)"; define P4 na|ka|me|sha|taka|li;
define P5
string PD = "(?<PD>e)"; ye|yo|lo|cho|vyo|zo|po|ko|mo|ye;
define P6 mw;
string PE = "(?<PE>a)"; define P8
an|az|e|ek|ele|ew|esh|et|ez|i|ian|ik|iki|ikiw|
string PF = "(?<PF>nge|ngeli|ngali)"; ikiz|ikw|ili|iliw|ish|ishiw|iz|k|lek|lew|li|li
w|m|n|ol|s|sh|shan|u|uk|ul|uli|uliw|ush|w|y|z|
string PG = "(?<PG>ta)"; zik|zw;
define P9 i;
string PH = "(?<PH>o)"; define P10 ni;
P7, the verb roots, were also defined in the
string PJ = "(?<PJ>hu)";
same way.
Next, we show the second set of variables, which
Next, the following command was passed to XFST
defines possible morpheme combinations that form a
to develop a finite state network against the above
morphologically correct verb. These rules are defined
morphological slots:
as follows:
read regex
string Rule1 => $"^({P1})({P2})?({PA}|
[[P1] (P2) (PA | PF | PG) (P6 | P10 | PB
{PF}|{PG})?({P6}|{P10}|{PB}|{P2})?({P7})({P8}|
| P2) [P7] (P8 | PH | PE) [P9 | PE] (P10)] |
{PE}|{PH})?({P9}|{PE})({P10})?$";
[[P3] (PA | PF | PG) (PB | P2 | P6) [P7]
(P8 | PH | PE) [P9 | PE | PD] (P10)] |
string Rule2 => $"^({P3})({PA}|{PF}|
[[P2 | PE] (P3 | PF | P4 | P3 PF) (PB |
{PG})?({P2}|{P6}|{PB})?({P7})({P8}|{PE}|{PH})?
P2 | P6) [P7] (P8 | PH | PE) [PD | PE | P9]
({P9}|{PD}|{PE})({P10})?$";
(P10)] |
[[P2 | PE] [P4 | PF] (P5 | PH) (PB | P10
string Rule3 => $"^({P2}|{PE})({P3}|{P4}
| P2 | P6) [P7] (P8 | PH | PE) [PD | PE | P9]
|{PF})?({P2}|{P6}|{PB})?({P7})({P8}|{PE}|
(P10)] |
{PH})?({P9}|{PD}|{PE})({P10})?$";
[[P2 | PE] (P4 | PF| PG) (PB | P10 | P2
| P6) [P7] (P8 | PH | PE) [PE] (P5 | PH)

81
5
(P10)] | Since XFST does not provide a way of evaluating how
[[P2 | PE] (P4 | PF) (PB | P2 | P6) [P7] it parses strings, it was not possible to evaluate its
(P8) [PE] (P5 | PH) (P10)] |
performance on the datasets. However, printing the
[[P7] (P8) [PE] (P10)]|
[[PJ] [P7] (P8) [P9 | PH]] finite state network at the top of its stack revealed
.o. [PA -> A, PB -> B, PD -> D, PE -> E, that some paths were malformed. This was, therefore,
PF -> F, PG -> G, PH -> H, PJ -> J, P1 -> 1, attributed to the design limitations of XFST.
P2 -> 2, P3 -> 3, P4 -> 4, P5 -> 5, P6 -> 6,
P7 -> 7, P8 -> 8, P9 -> 9, P10 -> 10]; Dataset B Test Results

The notation “.o.” in XFST denotes composition. In SwaRegex correctly segmented 491 entries from
this case, it implies that morphemes in an input verb this dataset. This translated to an accuracy of 68.67%.
that correspond to a slot, say PA, will cause the output The XFST model correctly segmented 216 entries,
of the transducer to be A. We used this notation to giving it 30.21% accuracy. This drop in performance
convert the finite state network into a transducer. was a result of noise in the data. To investigate this,
the roots obtained by the web scraper were closely
Analysis and Discussion examined. The examination revealed that some root
entries in this dataset were erroneous. These errors
A total of eight morphological rules for the Swahili occurred while the scrapper stripped the inflectional
verb were defined. The web scraper obtained a total morphemes from the entries in the online dictionary.
of 776 distinct Swahili verbs from the online diction-
ary. From these, 507 were roots, 163 were verbs of Comparison with other models
Bantu origin, and 61 were verbs of Arabic origin.
Both models were tested against the two datasets. SwaRegex achieves better results than those pro-
The same set of rules was defined for each model. posed by (Rahi et al., 2020). Their model averages an
Each model was pointed to the same dataset. The accuracy of 95.2% on 2,500 Maithili verbs in XFST.
models had no problems reading the file. However, Their model involves manually sorting words into
for the XFST model, the encoding of the file had to be inflection types and classes.
changed from UTF8 to ANSI. The performance of SwaRegex also outperforms (Ammari & Zenkouar,
SwaRegex on both datasets is shown in Table 1, while 2021b), whose model presented an accuracy of 87%
the performance of the XFST model is shown in Table when performing morphological analysis on 28,000
2. Amazigh words in XFST.
A similar experiment by (Chahuneau et al., 2013)
achieved an accuracy of 78.2% on predicting Swahili
words, which is still outperformed by our model.
Runyagram (Katushemererwe & Issue, 2010) was
a morphological segmentation experiment on Runya-
kitara, one of the Bantu languages in Uganda. Their
Table 1: Performance of SwaRegex on both datasets
experiments were conducted on fsm2, a similar finite-
state transducer environment to xfst. Still, SwaRegex
achieved better results than Runyagram, which
scored an accuracy of 82%.

Discussion

Regular expressions remain a robust means of

matching input text against a set of rules. They are a
widely used technique in programming today. They
Table 2: Performance of the XFST model on both datasets are important when evaluating whether user input
conforms to a variety of rules. One such application is
Dataset A Test Results user input validation in web forms, where regular
SwaRegex correctly segmented 161 out of the 163 expressions are used to ensure that email addresses
verbs in dataset A. This translated to 98.77% accura- and phone numbers are in the correct format (Davis
cy. This performance was possible because dataset A et al., 2018).
was devoid of noise. The XFST model correctly seg- When applying morphological analysis to lan-
mented 95 verbs. This translated to 57.67% accuracy. guages with limited resources, regular expressions

82
6
are essential, as demonstrated by SwaRegex, a lexical guages with synthetic phrases. EMNLP 2013 - 2013
transducer for Swahili verbs. SwaRegex has demon- Conference on Empirical Methods in Natural Lan-
strated remarkable results when determining wheth- guage Processing, Proceedings of the Conference.
er verbs are formatted correctly. This function is help- Davis, J. C., Coghlan, C. A., Servant, F., & Lee, D. (2018).
ful in many contexts where Swahili is spoken. The impact of Regular Expression Denial of Service
First, Swahili-speaking computer users will benefit (ReDoS) in practice: An empirical study at the eco-
from SwaRegex, where search keywords can be ana- system scale. ESEC/FSE 2018 - Proceedings of the
lyzed morphologically to improve their search experi- 2018 26th ACM Joint Meeting on European Software
ence. One such area where SwaRegex is applicable is Engineering Conference and Symposium on the
in search engines, where the search keywords need to Foundations of Software Engineering. https://
be broken down into roots in order for the engine to doi.org/10.1145/3236024.3236027
present the most relevant results to the user. Used Deo, S. N. (2016). Pairwise Combinations of Swahil
this way, SwaRegex plays the role of a lemmatizer, Applicative with Verb Extensions. Nordic Journal of
where Swahili verbs are broken down into their lem- African Studies, 25(1), 52–71.
ma or roots. Gerdjikov, S. (2018). Note on the Lower Bounds of
SwaRegex can also be useful to new Swahili learn- Bimachines. In arXiv.
ers. SwaRegex provides learners with a platform that Karttunen, L. (2001). Applications of finite-state
helps them learn the morphological syntax of the transducers in natural language processing. Lec-
Swahili verb. Learners will learn how to position mor- ture Notes in Computer Science (Including Subseries
phemes to occupy the morphological slots of the verb Lecture Notes in Artificial Intelligence and Lecture
in order to create morphologically correct verbs. Notes in Bioinformatics). https://doi.org/54.5447 /7-
540-44674-5_2
Conclusion and Future work Katushemererwe, F., & Issue, S. (2010). Fsm2 and the
Morphological Analysis of Bantu Nouns – First Expe-
SwaRegex, a lexical transducer for the morphologi- riences from Runyakitara. 8(1), 58–69.
cal segmentation of Swahili verbs, is presented in this Kituku, B. (2019). Grammar engineering for Swahili.
study. The model is rule-based, and the morphological 08(06). http://repository.dkut.ac.ke:8080/xmlui/
syntax of the verb is represented by regular expres- handle/123456789/1280
sions. A web scraper was used to populate the dataset Li, B., Cheng, F., Zhang, X., Cui, C., & Cai, W. (2021). A
with verbs from an online dictionary dynamically. The novel semi-supervised data-driven method for
same regular expressions were combined into a lexi- chiller fault diagnosis with unlabeled data. Applied
cal transducer in the XFST terminal to evaluate Energy. https://doi.org/54.5456 /
SwaRegex's performance. SwaRegex outperformed its j.apenergy.2021.116459
XFST cousin in terms of performance. We intend to Liu, Z., Jimerson, R., & Prud’hommeaux, E. (2021).
develop the morphological rule set in our model in Morphological Segmentation for Seneca. https://
the future to accommodate different inflections. doi.org/10.18653/v1/2021.americasnlp-1.10
Mahesh, B. (2018). Machine Learning Algorithms-A
References Review. International Journal of Science and Re-
Abdulla, P. A., Faouzi Atig, M., Chen, Y. F., Diep, B. P., search.
Holik, L., Rezine, A., & Rummer, P. (2019). Trau: Mdee, J. (2016). Patterns of Swahili Verbal Deriva-
SMT solver for string constraints. Proceedings of tives: An Analysis of their Formation. Huria: Journal
the 18th Conference on Formal Methods in Computer of the Open University of Tanzania, 21(1), 43–51.
-Aided Design, FMCAD 2018. https:// Mohrs, C., & Cajo, S. T. (2020). The microstructure of a
doi.org/10.23919/FMCAD.2018.8602997 lexicographical resource of spoken German: Mean-
Ammari, R., & Zenkouar, L. (2021a). Amazigh-sys: In- ings and functions of the Lemma Eben. Rasprave
telligent system for recognition of amazigh words. Instituta Za Hrvatski Jezik i Jezikoslovlje. https://
IAES International Journal of Artificial Intelligence. doi.org/10.31724/RIHJJ.46.2.25
https://doi.org/10.11591/IJAI.V10.I2.PP482-489 Mott, J., Bies, A., Strassel, S., Kodner, J., Richter, C., Xu,
Ammari, R., & Zenkouar, L. (2021b). APMorph: Finite- H., & Marcus, M. (2020). Morphological Segmenta-
state transducer for Amazigh pronominal morphol- tion for Low Resource Languages. May, 3996–4002.
ogy. International Journal of Electrical and Comput- Muhirwe, J. (1983). Computational Analysis of Kinyar-
er Engineering. https://doi.org/54.5599 5/ wanda Morphology : The Morphological. 1(1), 78–
ijece.v11i1.pp699-706 87.
Chahuneau, V., Schlinger, E., Smith, N. A., & Dyer, C. Ngugi, K., Okelo-Odongo, W., & Wagacha, P. . (2010).
(2013). Translating into morphologically rich lan- Swahili text-to-speech system. African Journal of

83
7
Science and Technology. https://doi.org/54.8758/
ajst.v6i1.55170
Rahi, R., Pushp, S., Khan, A., & Sinha, S. K. (2020). A
Finite State Transducer Based Morphological Ana-
lyzer of Maithili Language. ArXiv Preprint
ArXiv:2003.00238.
Sahala, A., Silfverberg, M., Arppe, A., & Linden, K.
(2020). BabyFST - Towards a finite-state based
computational model of ancient babylonian. LREC
2020 - 12th International Conference on Language
Resources and Evaluation, Conference Proceedings.
Schonhof, & Wilkans, A. (2012). On the question of
transitive and intransitive verbs in Swahili. Lingua
Posnaniensis. https://doi.org/54.687 8 /v54566-012-
0008-y
Shikali, C. S., Sijie, Z., Qihe, L., & Mokhosi, R. (2019).
Better word representation vectors using syllabic
alphabet: A case study of Swahili. Applied Sciences
(Switzerland). https://doi.org/54.779 4/app9 58 76 88
Steinberger, R., Ombuya, S., Kabadjov, M., Pouliquen,
B., Della Rocca, L., Belyaeva, J., de Paola, M., Ignat,
C., & van der Goot, E. (2011). Expanding a multilin-
gual media monitoring and information extraction
tool to a new language: Swahili. Language Re-
sources and Evaluation. https://doi.org/54.5447 /
s10579-011-9155-y

84
8

A Review of Techniques For Morphological Analysis in Natural Language Processing
No ratings yet
A Review of Techniques For Morphological Analysis in Natural Language Processing
11 pages
Corpus Study of Pashto Morphology
No ratings yet
Corpus Study of Pashto Morphology
13 pages
English Morphology
No ratings yet
English Morphology
32 pages
The Theory of Automata in Natural Language Processing
100% (1)
The Theory of Automata in Natural Language Processing
6 pages
NLP Morphology for Linguists
No ratings yet
NLP Morphology for Linguists
8 pages
Terminology Acquisition in Serbian
No ratings yet
Terminology Acquisition in Serbian
11 pages
Chapter 1
No ratings yet
Chapter 1
41 pages
Graph-Based Morphological Analysis
No ratings yet
Graph-Based Morphological Analysis
4 pages
Finnish, Turkish and Hungarian
100% (2)
Finnish, Turkish and Hungarian
12 pages
Lexical and Morphological Analysis in NLP
No ratings yet
Lexical and Morphological Analysis in NLP
9 pages
Finite Automata and Morphological Parsing
No ratings yet
Finite Automata and Morphological Parsing
18 pages
A Structured Analysis On Morpheme Segmentation For Agglutinative Languages
No ratings yet
A Structured Analysis On Morpheme Segmentation For Agglutinative Languages
6 pages
7
No ratings yet
7
4 pages
NLP with Finite Automata
No ratings yet
NLP with Finite Automata
3 pages
Bidirectional Lstmbased Morphological Analyzer For Gujarati
No ratings yet
Bidirectional Lstmbased Morphological Analyzer For Gujarati
17 pages
Word Root Finder: A Morphological Segmentor Based On CRF: J Oseph Z. Chang J Ason S. Chang
No ratings yet
Word Root Finder: A Morphological Segmentor Based On CRF: J Oseph Z. Chang J Ason S. Chang
8 pages
Morphological Analysis
No ratings yet
Morphological Analysis
35 pages
NLP 39-48
No ratings yet
NLP 39-48
11 pages
NLP Unit-1
No ratings yet
NLP Unit-1
12 pages
Understanding English Morphology Basics
No ratings yet
Understanding English Morphology Basics
5 pages
Morphological Parsing in NLP Techniques
No ratings yet
Morphological Parsing in NLP Techniques
7 pages
Pashto Language Stemming Algorithm: Jurnal Teknologi Maklumat Dan Multimedia Asia-Pasifik
No ratings yet
Pashto Language Stemming Algorithm: Jurnal Teknologi Maklumat Dan Multimedia Asia-Pasifik
13 pages
NLP - Sem
No ratings yet
NLP - Sem
31 pages
8-Morphology Part3
No ratings yet
8-Morphology Part3
27 pages
Word Level Analysis
No ratings yet
Word Level Analysis
49 pages
Lecture 3
No ratings yet
Lecture 3
55 pages
2 NLP
No ratings yet
2 NLP
36 pages
Finnish Morphology Lexicon Construction
No ratings yet
Finnish Morphology Lexicon Construction
64 pages
NLP Morphology & Transducers
No ratings yet
NLP Morphology & Transducers
26 pages
Linguistics: Morphology Basics
No ratings yet
Linguistics: Morphology Basics
41 pages
NLP3 - Lecture 3
No ratings yet
NLP3 - Lecture 3
52 pages
Scope, Phonology and Morphology in An Agglutinating Language: Choguita Rara Muri (Tarahumara) Variable Suffix Ordering
No ratings yet
Scope, Phonology and Morphology in An Agglutinating Language: Choguita Rara Muri (Tarahumara) Variable Suffix Ordering
40 pages
Morphology and Finite-State Transducers
No ratings yet
Morphology and Finite-State Transducers
30 pages
Elixir Thesis
No ratings yet
Elixir Thesis
107 pages
Pashto Morphological Analysis
No ratings yet
Pashto Morphological Analysis
6 pages
Natural Language Processing
No ratings yet
Natural Language Processing
13 pages
MAGEAD: Arabic Nominal Morphology Analysis
No ratings yet
MAGEAD: Arabic Nominal Morphology Analysis
8 pages
Finite State Automata in Language Processing
No ratings yet
Finite State Automata in Language Processing
40 pages
Kornai2020 Chapter Lexemes
No ratings yet
Kornai2020 Chapter Lexemes
28 pages
Introduction to Natural Language Processing
No ratings yet
Introduction to Natural Language Processing
19 pages
Speech Recognition With Weighted Finite-State Transducers: Mehryar Mohri Fernando Pereira Michael Riley
No ratings yet
Speech Recognition With Weighted Finite-State Transducers: Mehryar Mohri Fernando Pereira Michael Riley
31 pages
Morphological Analysis Tool for Languages
No ratings yet
Morphological Analysis Tool for Languages
56 pages
Linguistic Issues and Methods in Computa A4919075
No ratings yet
Linguistic Issues and Methods in Computa A4919075
7 pages
Japanese Text Segmentation Model
No ratings yet
Japanese Text Segmentation Model
6 pages
Word Level Analysis and Morphology
No ratings yet
Word Level Analysis and Morphology
54 pages
Tokenization and Morphology in NLP
No ratings yet
Tokenization and Morphology in NLP
63 pages
Inf2a L15 Slides
No ratings yet
Inf2a L15 Slides
31 pages
7-Morphology Part2
No ratings yet
7-Morphology Part2
28 pages
Part-Of-Speech Tagging With Rich Language Description: Anastasyev D. G. Andrianov A. I. Indenbom E. M
No ratings yet
Part-Of-Speech Tagging With Rich Language Description: Anastasyev D. G. Andrianov A. I. Indenbom E. M
12 pages
Purepos 2.0: A Hybrid Tool For Morphological Disambiguation
No ratings yet
Purepos 2.0: A Hybrid Tool For Morphological Disambiguation
7 pages
Computational Linguistics Notes
No ratings yet
Computational Linguistics Notes
17 pages
Verb Morphological Generator For Telugu
No ratings yet
Verb Morphological Generator For Telugu
11 pages
675469663
No ratings yet
675469663
33 pages
Lightweight Morphology for Text Search
No ratings yet
Lightweight Morphology for Text Search
10 pages
Development of A Rule-Based Lemmatization Algorithm Through Finite State Machine For Uzbek Language
No ratings yet
Development of A Rule-Based Lemmatization Algorithm Through Finite State Machine For Uzbek Language
6 pages
Amazigh Language Recognition System
No ratings yet
Amazigh Language Recognition System
8 pages
SILABUS Bahasa Inggris Kelas X
No ratings yet
SILABUS Bahasa Inggris Kelas X
17 pages
Session 1 - Word Order - Statements - Questions - Main vs. Subordinate Clauses
No ratings yet
Session 1 - Word Order - Statements - Questions - Main vs. Subordinate Clauses
6 pages
English IA 2 Portions 2025
No ratings yet
English IA 2 Portions 2025
2 pages
Informal Letter Writing Guide
No ratings yet
Informal Letter Writing Guide
4 pages
APA Formatting & Style
100% (1)
APA Formatting & Style
39 pages
First Grade Skills Pre-Screening Assessment
No ratings yet
First Grade Skills Pre-Screening Assessment
10 pages
English 6 PT
No ratings yet
English 6 PT
3 pages
Year 8 English Exam: Punctuation & Writing
No ratings yet
Year 8 English Exam: Punctuation & Writing
3 pages
Tution Revision Worksheet
No ratings yet
Tution Revision Worksheet
4 pages
Understanding Relative Clauses Explained
No ratings yet
Understanding Relative Clauses Explained
28 pages
English Prefixes and Suffixes Guide
No ratings yet
English Prefixes and Suffixes Guide
17 pages
Legal 1
No ratings yet
Legal 1
30 pages
GENIE - Language Acquisition Case
100% (1)
GENIE - Language Acquisition Case
2 pages
Limba Engleza - Clasa-Vi - Fisa de Lucru
No ratings yet
Limba Engleza - Clasa-Vi - Fisa de Lucru
3 pages
Etymology of Hamlet: Amlethus and Admlithi
No ratings yet
Etymology of Hamlet: Amlethus and Admlithi
20 pages
Resume Jian Carlo Yuvienco
No ratings yet
Resume Jian Carlo Yuvienco
1 page
Understanding Morphophonemic Changes
No ratings yet
Understanding Morphophonemic Changes
2 pages
Accuracy in Writing
No ratings yet
Accuracy in Writing
25 pages
Ptlal - Lesson 5
No ratings yet
Ptlal - Lesson 5
11 pages
30-Day English Proficiency Plan
No ratings yet
30-Day English Proficiency Plan
3 pages
Regular Verbs List: Base Form Simple Past Past Participle Meaning (Traducción)
No ratings yet
Regular Verbs List: Base Form Simple Past Past Participle Meaning (Traducción)
2 pages
Reported Questions Grammar
No ratings yet
Reported Questions Grammar
3 pages
Writing Assessment Criteria Updated 2018
No ratings yet
Writing Assessment Criteria Updated 2018
2 pages
Grade 6 Week 1 Evaluation Report
No ratings yet
Grade 6 Week 1 Evaluation Report
11 pages
The BLC English Programme
No ratings yet
The BLC English Programme
3 pages
Types and Uses of Adjectives
No ratings yet
Types and Uses of Adjectives
17 pages
7th Grade Wh Questions Lesson
No ratings yet
7th Grade Wh Questions Lesson
7 pages
Class 3 English 200 Real MCQs
No ratings yet
Class 3 English 200 Real MCQs
50 pages
SMOG Readability Formula G. Harry McLaughlin (1969)
No ratings yet
SMOG Readability Formula G. Harry McLaughlin (1969)
8 pages

Sware Gex

Uploaded by

Sware Gex

Uploaded by

Journal website: https://journals.must.ac.

SwaRegex: a lexical transducer for the morphological segmentation of

string P10 = "(?<P10>ni)"; var match = Regex.Match(verb, $"{Rule1}|

Regular expressions remain a robust means of

You might also like