Professional Documents
Culture Documents
Then, we introduced a pre-trained language model for Tigri- Unlike rich-resource languages like English, there is almost
nya language, which is a RoBERTa-based [18] model. The no available Tigrinya corpus, which makes it difficult for re-
language model was trained exclusively on Tigrinya corpus searchers to develop tools. To the best of our knowledge, the
using the Masked Language Modeling (MLM) task and has only publicly available labeled corpus for Tigrinya language
the same size as the RoBERTa-base model. We named is the Nagaoka POS tagging corpus [33], which contains gold
the language model TigRoBERTa (Tig refers to Tigrinya, POS labeled with 72,080 tokens and 4,656 sentences.
and RoBERTa refers to the Transformer model used). We The work with the Nagaoka corpus proposed a method for a
then apply TigRoBERTa on two different downstream se- POS tagging using the traditional supervised machine learn-
quence labeling tasks: NER and POS tagging. For this pur- ing approaches [33]. The authors evaluated traditional ma-
pose, we apply fine-tuning to the TigRoBERTa using the re- chine learning methods: Conditional Random Field (CRF)
spective labeled datasets. Furthermore, we propose a semi- and Support Vector Machine (SVM). The original Nagaoka
supervised self-training approach for Tigrinya to augment Tigrinya POS tagging corpus contained 73 labels and was
the training data, and achieve a better performance. We reduced to 20, achieving an accuracy of 89.92% and 90.89%
further explore the CNN-BiLSTM-CRF as a baseline model for SVM and CRF, respectively.
for NER and POS tagging tasks. A pre-trained word2vec
[22] embedding purely trained on Tigrinya corpus was used Another study on Tigrinya POS tagging using the Nagaoka
to initialize the CNN-BiLSTM-CRF model. corpus was also conducted by [34]. The authors evaluated
Deep Neural Network (DNN) classifiers: Feed Forward Neu-
The experimental result shows that TigRoBERTa achieved ral Network (FFNN), Long Short-Term Memory method
an F1-score of 84% for NER and 92% accuracy for POS (LSTM), Bidirectional LSTM, and Convolutional Neural Net-
tagging. Similarly, our baseline model CNN-BiLSTM-CRF work (CNN) using word2vec neural word embeddings. They
model’s experimental performance on the word2vec Tigrinya reported that the BiLSTM approach was suitable for POS
embedding resulted in an f1 score of 68.86% for the NER tagging and achieved 91.1% accuracy.
task and a POS tagging accuracy of 94%.
Moreover, a work by [32] investigated the effects of morpho-
This paper is an extended version from a paper published in logical segmentation on the performance of statistical ma-
the Knowledge and Natural Language Processing track at chine translation from English to Tigrinya. They performed
ACM SAC 2022 conference [40]. Our contributions can be a segmentation to achieve better word alignment and to re-
summarized as follows: duce vocabulary dropouts, thereby improving the language
model in both languages. Furthermore, they explored two
• Develop the first dataset tagged with named entity segmentation schemes, i.e., one based on longest affix seg-
recognition for Tigrinya and release it publicly. mentation and another based on fine-grained morphological
• Develop and release a language model pre-trained ex- segmentation.
clusively on Tigrinya language corpus. Another study by [10] investigated text classification based
• Introduce supervised, semi-supervised, and transfer le- on CNN-BiLSTM for Tigrinya. They created a manually
arning techniques for Tigrinya language in a NER and annotated dataset of 30,000 documents from Tigrinya news
POS tagging tasks. with the six different categories of “sports”, “agriculture”,
“politics”, “religion”, “education”, and “health”, as well as
• We propose a semi-supervised self-training approach none annotated corpus of more than six million words. How-
that can yield comparable performance to the super- ever, they did not make their corpus publicly available. They
vised learning in Tigrinya language. evaluated word2vec and fasttext word embedding in classifi-
cation models by applying CNNs to Tigrinya news articles.
The rest of this paper is organized as follows: Section 2 A work in [1] investigated the NER task for ten African lan-
provides a brief literature review of previous NLP work on guages and created and published a NER dataset for each
Tigrinya and other low-resource languages. Section 3 presents language. They also investigated cross-domain transfer with
the development and annotation process of the Named En- experiments on five languages using the Wiki-Ann dataset
tity Recognition dataset. Section 4. describes the proposed and a cross-lingual transfer for low-resource named entity
language model and semi-supervised self-training method. recognition.
Section 5. discusses the experimental results in the NER
and POS tagging datasets using the TigRoBERTa and CNN-
BiLSTM-CRF models. Section 6 shows the error analysis of 2.2 Semi-supervised Learning
supervised and semi-supervised self-training study. Finally, The authors in [28] investigated a semi-supervised NER us-
Section 7 presents the conclusions and directions for future ing a graph-based label propagation algorithm for Amharic
work. named entity recognition problem. Their experiment uses
3. TIGRINYA NER CORPUS The corpus contains sentences from 2015-2021 on various
topics. Figure 1 shows the distribution of different types of
text topics in the corpus. We annotated five entity types:
3.1 Tigrinya Script person name (PER), location (LOC), organization (ORG),
Tigrinya uses the Ge’ez script. Ge’ez is a script used as an date and time (DATE), and miscellaneous (MISC) using the
abugida (alphasyllabary) for several Afro-Asiatic and Nilo-
Saharan languages in Ethiopia and Eritrea in the horn of 3
https://blog.amara.org/2021/08/04/new-to-amara-
Africa. In Amharic and Tigrinya, the script is often called tigrinya/amp/
BIO standard. The annotated tags were inspired by the the inter-annotator agreement analysis. The agreement be-
English CoNNL-2003 corpus [27]. In addition, we follow the tween the annotations PER and LOC is relatively high. Fur-
MUC6 [30] annotation guidelines. thermore, the kappa agreement for MISC was low compared
to the other entities. Thus, the tag MISC was the most
In the following, we summarize the annotation guidelines for
difficult tag for our annotator. The goal of our annotation
the five classes.
technique is to produce a high-quality corpus by ensuring
high annotator agreement.
PER personal name including first name, middle name, and
last name. Personal names that refer to an organiza-
tion, location, events, law, and prizes were not tagged 4. PROPOSED METHOD
with the PER tag.
4.1 Overview
LOC includes all country names, region names, state names
In this work, we propose a new pre-trained RoBERTa-based
and city names, non-gpe locations like (mountain name,
language model for Tigrinya language.
river name and, body of water).
Language models are trained on an extensive unsupervised
ORG can be grouped as proper names that include all kinds text corpus to predict the next word in a sentence. The pre-
of organizations, sports teams, multinational organiza- trained language models can then be further trained using
tions, political parties, unions and, proper names re- a small supervised dataset by slightly altering the behavior
ferring to facilities. of the model. Training a neural network on a small amount
of data may result in over-fitting or under-fitting. However,
DATE absolute date expressions denote a particular seg- Those pre-trained language models can transfer their knowl-
ment of date, i.e., a particular day, season, final quar- edge by further training on more specific downstream tasks
ters, years, a decade, or a specific century. with a small dataset, such as: named entity recognition,
POS-tagging, question answering, and text classification.
MISC includes other types of entities, e.g., events, specific
disease names, etc. Figure 3 shows an overview of our proposed language model,
the source data used, and the architecture used to generate
O is used for non-entity tokens. our model. Finally, it shows further model training using the
fine-tuning technique for different downstream tasks, such as
NER and POS tagging.
The annotation process was carried out by three paid and
four volunteer human annotators who have a linguistic back-
ground and are native speakers of Tigrinya. Table 1 shows
4.2 Transformer-based Architectures
the frequency of each entity tag. The corpus was annotated Our model exploits the well-studied Transformer [35]. It
according to the established Beginning, Inside, and Outside uses an encoder and decoder architecture for converting one
(BIO) scheme, where “B” indicates the first word of the en- sequence to another sequence. The encoder takes as input a
tity; “I” indicates the remaining words of the same entity, sequence and converts it into an embedding which is a vector
and the “O” indicates that the tagged word is not a named representation of the input. The decoder as input takes an
entity. Our corpus will be publicly available on GitHub 4 embedding and converts it into a sequence. The encoder and
for research purposes. decoder consist of several multi-headed attentions stacked
on top of each other. However, recent approaches, such as
To validate the annotation quality, we report inter-annotator BERT, AlBERT, RoBERTa, GPT3, and XLNeT [7, 15, 18,
agreement scores in Table 2 using Cohen kappa [21] for all 39, 26], use the Transformers to create embeddings that can
entity tags. We calculated the inter-annotator agreement be used for other tasks.
between two annotation sets. Table 2 shows the results of
In this study, we use RoBERTa [18], which is a Transformer-
4
https://github.com/mehari-eng/Tigrinya-NER based model and is a replication of BERT (Bidirectional En-
(large) encoders.
RoBERTa was developed by Facebook [18] and aimed to op-
timize training BERT. RoBERTa was introduced to improve
BERT’s training methods and shared a similar architecture
with BERT. RoBERTa modifies key hyper-parameters in
BERT, including removing BERT’s next sentence predic-
tion task objective to train longer sequences and introduc-
ing dynamic masking. RoBERTa also changes the training
of BERT with much larger mini-batches and learning rates.
This allows RoBERTa to improve the MLM objective com-
pared to BERT and leads to better performance on down-
stream tasks. Moreover, RoBERTa has two models: the
base model and the large model. The RoBERTa base model
consists of 12 layers with a hidden size of 768 and 278M
parameters, while the RoBERTa large model has 24 layers
with a hidden size of 1024 and 550M parameters.
Figure 10: NER f1 score for all entity classes: PER, LOC, ORG, DATE, and MISC across all the models.
Toshiyuki Amagasa received B.E., M.E., and Ph.D. from the Department of
Computer Science, Gunma University in 1994, 1996, and 1999, respectively. He is
currently a full professor at the Center for Computational Sciences (CCS) and the
Center for Artificial Intelligence Research (C-AIR), University of Tsukuba. His
research interests cover database systems, data mining, and database application in
scientific domains. He is a senior member of IPSJ, IEICE, and IEEE, a board
member of DBSJ, and a member of ACM.