Professional Documents
Culture Documents
Corresponding Author:
Laras Gupitasari
School of Computing
Telkom University
Bandung, Indonesia
Email: laras.gupitasari@gmail.com
Figure 1 The relatedness of inflection forms in Arabic. The green colored word shows the difference with the
previous words. The additions of green color must follow some of the rules in Arabic morphology. On the far
right is the similarity of tenses or forms.
To handle the problem of finding a single word form, some techniques like lemmatization [4] [5] or
stemming could be done for some languages. These techniques are not very suitable for Arabic. Arabic need
system that can accurately learn and capture the mappings of the rules and the principles, to maximizing the
capability of most human language technologies.
There is study of the rules to forming and changing words in Arabic called sharaf or shorof. Sharaf
discuss the basics of word formation and provides rules known as Morphology before they are combined or
assembled with other words [6]. Word formation in Arabic is very diverse and flexible, in Sharaf the way
forming word called tashrif. In tashrif, we could generate word by tashrif lughawi (inflection) and tashrif
istilahy (derivation). Arabic has three important aspects or elements for word formation, they are class words
(fi’il, isim, hurf), wazan (verb-form), and dhamir (pronouns) [7][8][9].
The word class in Arabic consists of three types: fi’il / verb, isim / noun, and hurf / conjunction. For
example, there is sentence “ ( ” يَ ْذهَبُ ال ُم َعلِّ ُم ِإلَى الفَصْ ِلThe teacher goes to class) then ُ( يَ ْذهَبyaḏhabu: ‘goes’) is
fi’il, ( ال ُم َعلِّ ُمal-muʿallim: ‘teacher’) and ( الفَصْ ِلal-faṣli: ‘class’) are isim and ( ِإلَىʾilā: ‘to’) is hurf. Each word
class has one dhamir, that is why dhamir is one of the important elements in Arabic word mapping [9].
Wazan is verb-form or pattern for word formation in Arabic[7], and the patterns can affect the meaning or
semantics of the words. Therefore, it would be good if there is a system that can correctly doing Arabic word
formation by given some rules and the important elements in Morphological Arabic.
Research of the mapping of infected forms has been purposed by Kann and Schutze [10]. They
propose morphological reinflection techniques to handling the single word form problem called MED. The
experiments were conducted in 10 languages (Arabic, Finnish, Georgian, German, Hungarian, Maltese,
Navajo, Russian, Spanish, and Turkish). It means that the system is multilingual or general. The reinflection
process is process where one input word is given in the form of an inflected word, and the system need to
generate another form of the word based on certain rules.
However, the model by MED did not include wazan (verb-form) which is one of important aspects
in Arabic word mapping. And rather than a general system, this research aims to create a system that is more
specific to one language Arabic. The method will use Recurrent Neural Network (RNN). Based on
SIGMORPHON 2016[1], 2017[11] and 2018 [12] RNN give the best results compared to other methods for
reinflection process. In other words, this study purposes a model that could accurately generate
morphological reinflection word in Arabic by an emphasis on the morphological aspects of Arabic.
2. RELATED WORK
The system’s output is predicted word by given an inflected Arabic word and the certain rules as the
input. Table 1 shows the example of reinflection process. More explanation about what this research do will
discuss later.
Int J Inf & Commun Technol, Vol. 99, No. 1, Month 2099: 1-1x
3Int J Inf & Commun Technol ISSN: 2252-8776
Task 1 Task 2
Source form writing Writing
Input MSD-source pos=V, tense=PRS -
MSD-target pos=N, num=PL pos=V, tense=PST
Output Target form writer Wrote
The MSDs are the specific information about the words. Table 2 shows the MSDs of type of words,
wazan (pattern), and dhamir (pronoun) in Arabic.
Aspects MSDs
Type of words pos={V/N},mood={IND/CON/IMP/SBJV},aspect={IPFV/PFV},voice={ACT/PASS}
Dhamir GEN={M/F},PER={1/2/3},NUM={SG/DU/PL}
Wazan form={I-13} / form={Iq/2q/3q/4q}
3. RESEARCH METHOD
4. SYSTEM MODEL
4.1 Model Development
Reinflection process is process where one input word is given in the form of an inflected word
(source form), system need to generate another form (target form) of the word based on certain rules (MSD-
source and/or MSD-target).
This research purposed two tasks to the reinflection process model. Task 1 given input source form,
MSD-source, MSD-target, and the output is target form. For Task 2 given source form, MSD-target, and the
output is target form. Basically, task 2 is more difficult than task 1 because task 2 don’t have MSD-source as
the input. The process of the system can be seen in Figure 3 and Figure 4.
The reason to do these two settings is because MSD is not always available. We can get the MSD by
doing some morphological analysis manually, but it takes a lot of cost and effort. The other option to get the
MSD is using morphological analyzer, but we have to adjust the result MSD with the format of the MSD
from the model. That is the reason why need to perform these two settings. We want to build model that can
predict the Arabic word (reinflection process) without the MSD as the input. It will be good if the model get
a good performance.
4.2 Dataset
The dataset for this research is from Association for Computational Linguistics (ACL) for The
Sigmorphon 2016 Shared Task [1]. The training set and testing set have divided. So, we don’t need to split
the dataset again. The original training dataset have totaled 12.800 rows. Because we build model like
machine translator, we do not tune any hyperparameters and combine the training set and development set
into big training set with total of 14.540 rows. The testing dataset have total 4.500 rows data, consist of 2.833
(verb) and 1.667 isim (noun).
Table 3 The examples of the dataset for task 1
Input Output
No. Source form MSD-Source MSD-Target Target form
َُك ِّميَّة pos=N poss=PSSD case=NOM pos=N poss=PSSD case=GEN
1.
num=SG num=PL ِ َك ِّميَّا
pت
pos=N def=DEF case=NOM pos=N poss=PSSD case=ACC
2. ِ َْال َج ْيب
ان
num=DU num=SG
ْب
َ َجي
pos=V mood=IMP gen=FEM pos=V mood=IND tense=PST
per=2 voice=ACT num=SG gen=FEM per=2 voice=ACT
3. َأب ِْر ِقي َأب َْر ْقتُ َما
form=4 num=DU aspect={IPFV/PFV/PRF}
form=4
Table 3 shown that there are three input source-form, MSD-source, and MSD-target. And the output is target
form. Table 4 show that there are two input source-form and MSD-target. And the output is target form.
Int J Inf & Commun Technol, Vol. 99, No. 1, Month 2099: 1-1x
5Int J Inf & Commun Technol ISSN: 2252-8776
The first step of adding wazan is take the Arabic word into the Wiktionary. We will get the
declension of the root of the Arabic word that we want to search. For the example we want to know the
wazan of س ُمواَ َ( قqasamū: ‘they divided’) and we will get the table of the word ( قَ َس َمqasama: ‘to divide’). The
word س َم َ
َ ( قqasama) is the lemma and come from their root ( ق س مq-s-m). From the declension table we get
the wazan of the target Arabic word. We put the wazan into the training dataset and the testing dataset. It
happens to 8.755 fi’il in train set and 2.833 fi’il in test set.
To confirm whether the added wazan is correct, the expert validate the wazan using given sample.
After all the wazan is correct and acceptable with the rules of word formation in Arabic, we can use the
dataset to build the model.
Paper’s should be the fewest possible that accurately describe … (First Author)
6 ISSN: 2252-8776
{
accuracy tuple 1 , if the prediction is correct
0 , else
Figure 7 True prediction, each characters on Figure 8 False prediction, one characted on
prediction same with characters on the actual prediction not match with character on the actual
Int J Inf & Commun Technol, Vol. 99, No. 1, Month 2099: 1-1x
7Int J Inf & Commun Technol ISSN: 2252-8776
The previous research by MED [10] has a good accuracy for both the task 1 and the task 2. This
study also the baseline of this research. Both did the same tasks, Task 1 for the reinflection process using
MSD-source and Task 2 is reinflection process without the MSD-source. The difference is MED didn’t use
wazan as the feature and this research did. Table 6 shows that the model from this research using wazan is
have better accuracy than the MED without wazan. The reinflection process by MED for Arabic gives
accuracy 91.09% for task 1 and 82.80% for task 2. While, this research gives better accuracy, it’s 92.87% for
task 1 and 90.71% for task 2. The difference accuracy is not too significant, but the accuracy is enough to
prove that using wazan is effective to increase the accuracies.
To ensure that wazan give improvement, we also testing for both the type of words using model
without wazan. The result is on Table 8, and we can see that without the wazan, fi’il have lower accuracy
than the isim.
Table 8 Testing result for both type of word using model without wazan
Accuracy Accuracy
Type of words
Task 1 Task 2
Fi’il (verb) 92.84 % 91.23 %
Isim (noun) 92.97 % 91.42 %
Table 7 shows that isim or noun is contributing more error for the accuracy. The assumption is
because additional wazan only belongs to fi’il. We don’t add more spesific feature to the isim and the
accuracy will not be better. It proved by Table 8, using model without wazan fi’il contribute more error than
the isim. The wazan give improvement to the prediction on fi’il.
The testing result for Isim on the task 1 have total wrong prediction is 117 arabic word. And 98
predictions have similarity on feature they have.
Dual Definite
Informal ْال َحلُّوفَيْن
Paper’s should be the fewest possible that accurately describe … (First Author)
8 ISSN: 2252-8776
al-ḥallūfayn
Nominative ِ َْال َحلُّوف
ان
al-ḥallūfāni
ْال َحلُّوفَي ِْن
Accusative
al-ḥallūfayni
ْال َحلُّوفَي ِْن
Genitive
al-ḥallūfayni
For the exampe is the ambiguation, in dataset the world ( ْال َحلُّوفَيْنal-ḥallūfayn: ‘the two alliances’)
have MSD pos=N def=DEF case={NOM/ACC/GEN} num=DU. On the wiktionary (where the
dataset come from) Arabic word ْال َحلُّوفَيْنhave case Informal not Nominative or Accusative or Genitive. We
can see the declension of the root on the Table 9 that every case has different Arabic word form. It’s also
happened to fi’il with feature mood={IND/COND/IMP}, aspect={IPFV/PFV/PRF}, or
tense={PRS/FUT}.
To make the difference between one and the others Arabic words, we may change some feature or
the MSD to more specific for both the isim and fi’il. More specific features may have an impact on prediction
result. Because we need more various pattern that model can learn.
6. CONCLUSION
The model that can accurately learn and capture the mappings of the world in Arabic has been built
using RNN seq2seq. By adding wazan to the feature, the model got 92.87% and 90.71% for the accuracy. It’s
higher than the previous research that didn’t use wazan. It proves that wazan is effective for word mapping in
Arabic. The best amount of the dataset for this model, using the dataset from ACL is 14.540 rows training
set.
Adding wazan to the dataset especially fi’il (verb) can make each Arabic world more specific. That
is the reason isim (noun) contribute more error than fi’il (verb) because wazan or verb-form added only to the
fi’il. The result shows some errors because the training set need more various features and give more specific
feature to the isim and fi’il. Giving more dataset but same features or same pattern will make the model
overfitting because there are excessive data duplication to learn.
For further research, it's good to try adding more datasets to the training set. Several ways can be
done to duplicate the data. It’s better if without duplicating the dataset to avoid overfitting and get the better
accuracy. Using the same dataset, features for isim and fi’il can be more specific. So, every word that was
previously same can have a difference. Using RNN seq2seq idea by machine translation give good accuracy,
for the further research approach to the other method is good to compare the result.
REFERENCES
[1] R. Cotterell, J. Sylak-glassman, and D. Yarowsky, “The SIGMORPHON 2016 Shared Task —
Morphological Reinflection,” pp. 10–22, 2016.
[2] K. Kann, “Neural Sequence-to-Sequence Models for Low-Resource Morphology,” 2018.
[3] Y. LeCun, G. E. Hinton, and Y. Bengio, AI & Statistics: Proceedings of the 10th International Workshop.
2005.
[4] S. Goldwater and D. McClosky, “Improving statistical MT through morphological analysis,” HLT/EMNLP
2005 - Human Language Technology Conference and Conference on Empirical Methods in Natural
Language Processing, Proceedings of the Conference, no. October, pp. 676–683, 2005, doi:
10.3115/1220575.1220660.
[5] H. Saif, Y. He, and H. Alani, “Alleviating data sparsity for twitter sentiment analysis,” 2012.
[6] M. Najah, “Penerapan Pembelajaran Shorof Bagi Pembelajar Tingkat Pemula Menggunakan Metode
Pemerolehan Bahasa,” al Mahāra: Jurnal Pendidikan Bahasa Arab, vol. 5, no. 1, pp. 117–140, 2019, doi:
10.14421/almahara.2019.051-07.
[7] A. Razin and U. Razin, Ilmu Sharaf Untuk Pemula. Sepohon Kayu, 2017.
[8] A. G. Martínez, S. L. Hervás, D. Samy, C. G. Arques, and A. M. Sandoval, “Jabalín: A Comprehensive
Computational Model of Modern Standard Arabic Verbal Morphology Based on Traditional Arabic
Prosody,” Communications in Computer and Information Science, vol. 380 CCIS, pp. 35–52, 2013, doi:
10.1007/978-3-642-40486-3_3.
Int J Inf & Commun Technol, Vol. 99, No. 1, Month 2099: 1-1x
9Int J Inf & Commun Technol ISSN: 2252-8776
[9] T. I. Ramadhan, M. A. Bijaksana, and A. F. Huda, “Rule based pattern type of verb identification algorithm
for the Holy Qur’an,” Procedia Computer Science, vol. 157, pp. 337–344, 2019, doi:
10.1016/j.procs.2019.08.175.
[10] K. Kann and H. Schütze, “MED: The LMU System for the SIGMORPHON 2016 Shared Task on
Morphological Reinflection,” Proceedings of the 14th Annual SIGMORPHON Workshop on Computational
Research in Phonetics, Phonology, and Morphology, pp. 62–70, 2016.
[11] R. Cotterell et al., “Conll-Sigmorphon 2017 shared task: Universal morphological reinflection in 52
languages,” CoNLL 2017 - Proceedings of the CoNLL SIGMORPHON 2017 Shared Task: Universal
Morphological Reinflection, pp. 1–30, 2017, doi: 10.18653/v1/k17-2001.
[12] R. Cotterell et al., “The CoNLL–SIGMORPHON 2018 shared task: Universal morphological reinflection,”
arXiv, pp. 1–27, 2018.
[13] D. Bahdanau, K. H. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and
translate,” 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track
Proceedings, pp. 1–15, 2015.
Paper’s should be the fewest possible that accurately describe … (First Author)