You are on page 1of 10

A Proposed Approach for Arabic Semantic

Annotation

Ghada Khairy1(&), A. A. Ewees1, and Mohamed Eisa2


1
Computer Department, Damietta University, Damietta, Egypt
ghadakhairy89@yahoo.com
2
Computer Science Department, Port Said University, Port Said, Egypt

Abstract. Semantic annotation refers to the process of annotating documents


using the ontology in order to data becomes meaningful. Most of the techniques
and methods of the field of semantic annotation and retrieval are used for
dealing broadly in the English language. This paper aims to enhance the process
of information retrieval for Arabic language that depends on the ontology in the
process of document annotation. To achieve this aim, it is determined and
processed the problems of the Arabic language through the proposed approach.
This paper depends on semantic annotation based on ontology and Resource
Description Framework (RDF). The results achieved high precision and high
recall for the semantic annotation based on the proposed approach.

Keywords: Arabic semantic annotation  Ontology 


Resource Description Framework  Stemming

1 Introduction

The conception of the semantic web has become the Web of data instead of the Web of
documents in a model that can be prepared by computers. This approach could be
implemented in the current Web adopting semantic annotation. Due to exponential
extension and the enormous dimension of the Web references, there is a requirement to
become a rapid and automatic semantic annotation of Web documents. The Arabic
language acquired a significant research study from academia due to the complexity and
challenges in semantic web study compared to Latin languages especially in the field of
semantic annotation [1]. Semantic Annotation (SA) is the process of including metadata,
which is thoughts of ontology (i.e. classes, instances, properties, and relations), in order
to specify semantics, and knows what precisely the concepts annotated expect in the
context [2]. Arabic language is a complicated language that may limit the development
of the mechanisms for semantic web in that language. It has several particularities as
compare to English like low vowels, the lack of uppercase letters and complicated
morphology. Arabic language is comprised of nouns, verbs, and particles. Arabic is also
greatly inflectional and derivational, which makes morphological interpretation a highly
difficult task [3]. Arabic words could have further than one affix and can be represented
as a sequence of affix such as prefixes lemma and suffixes, which make it more com-
plicated for stemming. Furthermore, they have various types of ambiguities correlated
with typographic patterns and spelling [1]. These challenges increase the difficulty in

© Springer Nature Switzerland AG 2020


A. E. Hassanien et al. (Eds.): AMLTA 2019, AISC 921, pp. 556–565, 2020.
https://doi.org/10.1007/978-3-030-14118-9_56
A Proposed Approach for Arabic Semantic Annotation 557

dealing automatically with Arabic documents. The computer science techniques are
widely used to solve many general complicated problems such as image processing
[4–7], forecasting problems [8], prediction [9–12], and features reduction [13]. More-
over, there are many studies work to solve the Arabic semantic annotation challenges;
the authors of [14] suggested an approach for improving the process of information
retrieval for Arabic language based on the ontology in the manner of document anno-
tation. The results of this approach showed significant improvement in the process of
documents retrieval based on the two standard evaluation measures accuracy and recall.
Add to that it was presented as an automatic annotation tool that assisted the semantic
annotation of Arabic web documents. The results had proved that an encouraging
achievement remained in leveraging semantic web technologies to assist the Arabic and
providing semantically annotated web documents for several fields in an automatic way
[1]. Further, it had been reported the design, implementation, and evaluation of a lexical
ontology for Arabic semantic relationships. The principal objective of the ontology is to
aid the task of semantic annotation of the Arabic textual content. The results of the
evaluation registered that the ontology was fit for the direction of semantic annotation of
Arabic text with lexical relations [15]. Besides, it had suggested a framework to promote
a semantic annotation tool for encouraging Arabic contents and matching it with other
tools in a similar field [16].
In addition, it was presented as an automatic annotation of the Arabic web resources
related to food, nutrition, and health fields to improve Arabic OWL ontologies asso-
ciated to those domains. The results confirmed that encouraging accuracy and recall
[17]. Furthermore, it had offered a successful way to perform clustering with semantic
similarities by combining k-means document clustering with semantic feature extraction
and document vectorization to group the Arabic web pages according to semantic
similarities and then explained the semantic annotation [18]. Besides, it was recom-
mended a technique to obtain taxonomic relationships to build ontology automatically
from original Arabic text on political News domain. The results reached 92% and recall
91% [19].
The main motivation for this paper is to propose an approach for enhancing the
process of information retrieval for Arabic language that depends on the ontology in the
process of document annotation. The sections of this paper are organized as follow:
Sect. 2 provides materials and methods based on semantic annotation tools. Section 3
presents the details of the proposed Arabic semantic annotation approach. Section 4
describes the experimental setup and discuss results. Finally, Sect. 5 concludes the paper.

2 Materials and Methods

The semantic web, as described by the W3C, is a web of data - the combination of
semantic web technologies, e.g. RDF, OWL, provides an environment where a soft-
ware application can query the data, make conclusions utilizing vocabularies (ontol-
ogy). The achievement of the semantic web needs the broad extent availability of
semantic annotation for current and new documents on the web [20]. In this section,
RDF and Ontology techniques had explained for semantic annotation in details.
558 G. Khairy et al.

2.1 Resource Description Framework (RDF)


Resource Description Framework (RDF) is a language for describing information about
support resources in the World Wide Web [21]; it is a basic ontology language. RDF is
written in XML. By using XML, RDF information can simply be replaced among
various types of computers utilizing several types of operating systems and application
languages. RDF was intended to present a general way to represent information so it is
device readable. RDF representations are not designed to be presented on the web [22].
RDF is a data model for describing objects and relationships between them [23]. RDF
identifies resources with Uniform Resource Identifiers (URI) [24].
For example, “Machine Learning part of Artificial Intelligence”, for the triple
“Machine Learning” is subject, “Part of” is a predicate, “Artificial Intelligence” is an
object. A set of RDF statements is called an RDF graph [25]. RDF-Schema is a
language for defining vocabulary for describing properties and classes of RDF
resources. RDFS is used to define graphs of trio RDF, with semantics of
generalization/prioritization of such properties and classes [23]. In RDFS, predefined
web resources rdfs: Class, rdfs: Resource, and rdf: Property can be used to declare
classes, resources, and properties respectively. From the first view, RDFS is a simple
ontology language that supports only class and property hierarchies, as well as domain
and range restrictions for properties [25].

2.2 Ontology
Ontology is a formal, explicit specification of a given conceptualization in the pattern
of theories and relationships [26]. The ontology includes individuals, classes, proper-
ties, and relationships. Individuals are the essential elements of ontology. Individuals
represent objects in the domain in which we are involved such as people, animals, and
plants, as well as abstract individuals such as numbers and words. Individuals are also
identified as examples. Individuals can be pointed to as existing ‘examples of classes’.
Classes are the collections or sets of objects represent by the collection of properties.
Classes may classify individuals with aid of these properties. Properties are attributes
and features that classes can have. For example, a person class or object has the
properties; name, age, height, etc. Relations between objects in ontology define how
objects are associated with other objects [27]. The Web Ontology Language
(OWL) helps larger machine know the ability of web resources than that recommended
by RDFS by combining further constructors for producing class and property decla-
rations (vocabulary) and new axioms (constraints), along with a formal semantics [25].
In the top-down approach, the theories in the ontology are inferred from an
investigation and study of appropriate information sources about the field. A top-down
development method begins with the definition of the most common concepts in the
field and subsequent specialization of the theories [28].
A Proposed Approach for Arabic Semantic Annotation 559

3 The Proposed Approach

The proposed approach consists of three main phases, the first one is pre-processing
consists of fifth steps: prepare the Arabic documents, remove diacritics, remove stop-
words, stem and part of speech. The second phase is Arabic documents annotation
consists of two steps: get ontology, RDF and correct text. The third phase is text
retrieval. Figure 1 presents the stages of the proposed approach for Arabic semantic
annotation.

Fig. 1. Proposed approach for Arabic semantic annotation

3.1 Pre-processing Phase


This phase consists of fifth steps: prepare the Arabic documents, remove diacritics,
remove stop-words, stem and part of speech.

Prepare the Arabic documents: it is one of the most important steps. This step uses to
collect documents manually in (Computer science) domain. Then it uses these to
annotate and retrieval Arabic documents.

Removing diacritics: it removes diacritics from the Arabic documents.

Removing stop-words: it removes all stop-words from the Arabic documents because
stop-words are terms that are too frequent in the text. These terms are insignificant. The
stop-words collected by the proposed approach consists of 13,019 stop-words.

Stemming: it returns Arabic words into their stems. Stemming removes all prefixes and
suffixes letters from word to get the stem of the word. This is an important process
because stem word leads to reduce the space of words in order to improve the
560 G. Khairy et al.

effectiveness of the retrieval process. This step is divided into three stage. The first
stage removes all prefixes from word. It depends on the light stemmer technique.

Part of speech (POS): it is used to find POS tags for the stem and affixes to overcome
determining the POS tags of a word in a particular context, primarily because the same
word may be spelled in different ways. Further, detecting the difference between Arabic
derivatives represents a very challenging issue for the majority of POS taggers. Hence,
the task of tagging the correct POS tags requires advanced processing and the use of
considerable resources.

3.2 Arabic Documents Annotation Phase


This phase consists of the step of getting the ontology. In this step uses a top-down
approach [29] in building the ontology where the ontology consists of the following
stages:

Define concepts (i.e. classes and sub-classes): it is used in our ontology domain.

Define instances (i.e. real elements in the chosen domain): Creating instances (indi-
viduals) is a very important step to enrich the ontology with direct relation with classes
and sub-classes.

Getting synonyms: that are very similar words to the basic word, it can be useful and
also deploying thresholds to eliminate poor scoring words that unfairly punish a par-
ticular set.

3.3 Indexing and Text Retrieval Phase


In this phase, the purpose of indexing is to optimize speed and performance in finding
relevant documents for a search query. In our work, we used term frequency–inverse
document frequency (TFIDF) which is a numerical statistic that is intended to reflect
how important a word is to a document in our (Computer science) domain as shown in
Eq. (1) calculated (TF), Eq. (2) calculated (IDF) and Eq. (3) calculated (TFIDF).
X
TFi; j ¼ k
ni; j ð1Þ
 
N
IDFðwÞ ¼ log ð2Þ
dft
 
N
wi; j ¼ TFi; j  log ð3Þ
dft

where TFi,j (number of occurrences of I in j), DFi (number of documents containing i)


and N (total number of documents). This phase is performed in order to retrieve the
relevant documents by entering the query over the user by linking the research word,
synonyms, and documents and arranging them from the closest to the farthest links.
Also, the text is corrected at this stage which helps to find misspellings in the research
A Proposed Approach for Arabic Semantic Annotation 561

word and automatically corrects it in order to get the result of the research word. For
example, the word “ ” is corrected to the word “ ”.

4 Experiments and Results

This section shows the description of the dataset, performance measures, parameters
settings, experiment description and the results.

4.1 Dataset
The dataset consists of 63 Arabic documents related to (Computer science) Domain.
The smallest Arabic document consists of 260 words, while the largest Arabic docu-
ment consists of 2570 words. The longest sentence in the dataset consists of 43 words
while the shorter sentence consists of 6 words. These documents were collected
manually from Wikipedia website “ar.wikipedia.org” where these documents have
been included in classes ((programming), (artificial intelligence), (databases), (oper-
ating systems), (computer security), (multimedia)). programming class is divided into
sub-classes ((programming languages), (application programs)). Artificial intelligence
class is divided into sub-classes ((machine learning), (image processing), (natural
language processing)). These classes are divided into subclasses. Unicode has been
used as a universal character encoding standard, it defines the way individual characters
are represented in text files, web pages, and other types of documents. In the dataset, a
site “thesaurus.com” was used to obtain synonyms in English for the difficulty of
obtaining synonyms in Arabic in this field, to solve this problem, we translate these
synonyms from English to Arabic, and so we have synonyms in Arabic. DBpedia
Ontology “wiki.dbpedia.org” is a simple, cross-domain ontology, which has been
manually designed based on the most commonly used infoboxes within Wikipedia.
DBpedia Ontology was used in the dataset to link among classes as a requirement to
come up with the ontology through RDF and annotating Arabic documents. The dataset
contains 6 classes, 17 subclasses, 13 individuals, and 30 synonymous.

4.2 Performance Measures


To evaluate the performance of the proposed approach accuracy, precision and recall
measures [21] are used. Equation (4) show Accuracy that is a measure of the overall
correctness of the approach, it’s the number of documents that are correctly classified
divided by the sum of the total documents. Equation (5) illustrates the precision, it
measures the number of retrieved positive documents divided by the total positive
retrieved documents. Equation (6) illustrates the recall, it measures the number of
retrieved positive documents divided by the number of existing relevant documents.

Tp þ Tn
Accuracy ¼ ð4Þ
Tp þ Tn þ Fp þ Fn
562 G. Khairy et al.

Tp
Precision ¼ ð5Þ
Tp þ Fp
Tp
Recall ¼ ð6Þ
Tp þ Fn
where TP is true positive, FP is false positive, FN is false negative, and TN is
true negative.

4.3 Experiment Description


This part describes the experiment environment which used to perform the proposed
approach for Arabic semantic annotation. All experiments are performed using “Intel
Corei5 2.5 GHz CPU”, 8 GB RAM, and Windows 10 46 bit. And the software
descriptions are DBpedia ontology, Protégé 5, and Python 3.

4.4 Results and Discussion


All results are tabulated in the following parts. Table 1 shows the calculated results of
TP (True Positive), TN (True Negative), FP (False Positive) and FN (False Negative)
for each class of computer science domain (Computer, Programming, Artificial Intel-
ligence, Machine Learning, Databases, Operating Systems, Natural Language Pro-
cessing, Image Processing, Computer Security, and Multimedia). The results obtained
by (a) the proposed approach (b) searching using traditional method.

Table 1. TR, TN, FP and FN results for each class


Classes (a) The proposed (b) The
approach results traditional
method
TP TN FP FN TP TN FP FN
Computer 58 2 2 4 32 1 7 25
Programming 55 3 3 5 31 3 6 26
Artificial Intelligence 57 1 3 5 24 1 6 25
Machine Learning 51 5 4 6 22 5 5 32
Databases 55 2 1 3 26 5 6 28
Operating Systems 55 6 1 4 19 6 5 33
Natural Language Processing 57 2 2 5 27 1 6 32
Image Processing 59 6 0 1 23 3 5 35
Computer Security 52 10 0 4 20 10 3 37
Multimedia 55 8 1 0 28 8 4 30

The results in Table 2 show differences in the evaluation measures of the ten
mentioned classes. This is due to the following reasons: The Machine Learning class
has achieved less efficient precision (92.7%) because of the nature of some documents
A Proposed Approach for Arabic Semantic Annotation 563

Table 2. Results of precision, recall and accuracy


Annotation types (a) The proposed approach (b) The traditional method
results
Recall Precision Accuracy Recall Precision Accuracy
Computer 93.5% 96.6% 90.77% 56.14% 82.05% 50.77%
Programming 91.6% 94.8% 87.88% 54.39% 83.78% 51.52%
Artificial Intelligence 91.9% 95% 87.87% 48.98% 80.00% 44.64%
Machine Learning 87.9% 92.7% 83.58% 40.74% 81.48% 42.19%
Databases 94.8% 98.2% 93.75% 48.15% 81.25% 47.69%
Operating Systems 92.9% 98% 92.19% 36.54% 79.17% 39.68%
Natural Language Pro. 91.9% 96.6% 89.23% 45.76% 81.82% 42.42%
Image Processing 98.3% 100% 98.41% 39.66% 82.14% 39.39%
Computer Security 94.3% 100% 95.21% 35.09% 86.96% 42.86%
Multimedia 91.6% 98.2% 98.44% 48.28% 87.50% 51.43%
Average 92.87% 97% 91.73% 45.37% 82.62% 45.26%

content which have mixed content, and thus exists in more than one class. It also
achieved less efficient recall value (87.9%) since there is some documents classified to
other classes due to the same reason of mixed content. The Databases class has
achieved high efficient recall (94.8%) and precision (98.2%) because it is a specific
category in the domain computer science and it is not mixed content in other classes.
Table 2 shows the calculated values of precision, recall and accuracy for the
ontology concepts, we listed some of the results due to the space limitation. The results
obtained by (a) the proposed approach (b) searching using the traditional method. The
results are calculated based on Eqs. (4), (5) and (6). We can conclude that, the results of
the proposed approach are better than the traditional (manual) method, it achieved the
high accuracy equals 91.7% whereas, the traditional method obtained only 45.2%.
On one hand, the studies [14, 19] depended on GATE software to perform docu-
ment annotations. They used Onto Root Gazetteer to produce ontology-based anno-
tations and produce a part-of-speech tag as an annotation on each Arabic word.
In this research, ASA approach was implemented online through the site https://
pos-project.herokuapp.com and used python code to execute pre-processing phase
which divided into fifth sub-phases: prepare the Arabic documents, remove diacritics,
remove stop-words, stem, and part of speech. ASA approach used part of DBpedia
machine learning ontology and it was placed and translated in the python code.
Consequently, our result improved the accuracy compared with previous researches.
On the other hand, [1, 15] researches depended on a local GUI (Graphical User
Interface). In this research, we used GUI online. Thus, our result improved the recall
rate than the previous works.
564 G. Khairy et al.

5 Conclusion

This paper has presented a proposed approach based on Arabic semantic annotation for
Arabic information retrieval that facilitates information retrieval with high precision
and high recall. Our approach consists of several stages: Preprocessing stage (prepares
the Arabic documents, remove diacritics, remove stop-words, stem and part of speech),
Arabic documents annotation stage (get ontology, RDF and correct text), Indexing and
text retrieval stage. Using our approach, we overcome the problem of difficulty in
Arabic and the traditional way used in the process of documents search and retrieval.
The results achieved high Precision and high Recall for all the annotation types as in
Experiments section. For future work, we intend to increase our corpus of documents to
retrieve more documents in the domain “computer science” and obtain more accurate
results. In the future, we intend to improve our proposed approach by helping people
with special needs (blind) by turning it into a dynamic interactive application to
facilitate their search while developing the Arabic semantic annotation mechanism in
which the approach works.

References
1. Al-Bukhitan, S., Helmy, T., Al-Mulhem, M.: Semantic annotation tool for annotating Arabic
web documents. Procedia Comput. Sci. 32, 429–436 (2014)
2. Oliveira, P., Rocha, J.: Semantic annotation tools survey. In: 2013 IEEE Symposium on
Computational Intelligence and Data Mining (CIDM), pp. 301–307. IEEE, April 2013
3. Beseiso, M., Ahmad, A.R., Ismail, R.: A Survey of Arabic language support in semantic
web. Int. J. Comput. Appl. 9(1), 35–40 (2010)
4. Kaloub, A.: Automatic ontology-based document annotation for Arabic information
retrieval. Unpublished master’s thesis, Islamic University-Gaza, Deanery of Graduate
Studies, Faculty of Information Technology (2013)
5. El Aziz, M.A., Ewees, A.A., Hassanien, A.E.: Hybrid swarms optimization based image
segmentation. In: Hybrid Soft Computing for Image Segmentation, pp. 1–21 (2016)
6. El Aziz, M.A., Ewees, A.A., Hassanien, A.E., Mudhsh, M., Xiong, S.: Multi-objective whale
optimization algorithm for multilevel thresholding segmentation. In: Advances in Soft
Computing and Machine Learning in Image Processing, pp. 23–39 (2018)
7. El Aziz, M.A., Ewees, A.A., Hassanien, A.E.: Multi-objective whale optimization algorithm
for content-based image retrieval. Multimed. Tools Appl. 77(19), 26135–26172 (2018)
8. El Aziz, M.A., Ewees, A.A., Hassanien, A.E.: Whale optimization algorithm and Moth-
Flame optimization for multilevel thresholding image segmentation. Expert Syst. Appl. 83,
242–256 (2017)
9. Sahlol, A.T., Moemen, Y.S., Ewees, A.A., Hassanien, A.E.: Evaluation of cisplatin
efficiency as a chemotherapeutic drug based on neural networks optimized by genetic
algorithm. In: 2017 12th International Conference on Computer Engineering and Systems
(ICCES), pp. 682–685. IEEE, December 2017
10. Ahmed, K., Ewees, A.A., Hassanien, A.E.: Prediction and management system for forest
fires based on hybrid flower pollination optimization algorithm and adaptive neuro-fuzzy
inference system. In: 2017 Eighth International Conference on Intelligent Computing and
Information Systems (ICICIS), pp. 299–304, December 2017
A Proposed Approach for Arabic Semantic Annotation 565

11. Sahlol, A.T., Ewees, A.A., Hemdan, A.M., Hassanien, A.E.: Training feedforward neural
networks using Sine-Cosine algorithm to improve the prediction of liver enzymes on fish
farmed on nano-selenite. In: 2016 12th International Computer Engineering Conference
(ICENCO), pp. 35–40. IEEE, December 2016
12. Oliva, D., Ewees, A.A., Aziz, M.A., Hassanien, A., Peréz-Cisneros, M.: A chaotic improved
artificial bee colony for parameter estimation of photovoltaic cells. Energies 10(7), 865
(2017)
13. Ahmed, K., Ewees, A.A., El Aziz, M.A., Hassanien, A.E., Gaber, T., Tsai, P.W., Pan, J.S.: A
hybrid krill-ANFIS model for wind speed forecasting. In: International Conference on
Advanced Intelligent Systems and Informatics, pp. 365–372, October 2016
14. Ewees, A.A., El Aziz, M.A., Hassanien, A.E.: Chaotic multi-verse optimizer-based feature
selection. Neural Comput. Appl., 1–16 (2017)
15. Al-Yahya, M., Al-Shaman, M., Al-Otaiby, N., Al-Sultan, W., Al-Zahrani, A., Al-Dalbahie,
M.: Ontology-based semantic annotation of Arabic language text. Int. J. Mod. Educ.
Comput. Sci. 7(7), 53 (2015)
16. El-ghobashy, A.N., Attiya, G.M., Kelash, H.M.: A proposed framework for Arabic semantic
annotation tool. Int. J. Com. Dig. Syst. 3(1), 47–53 (2014)
17. Albukhitan, S., Helmy, T.: Automatic ontology-based annotation of food, nutrition and
health Arabic web content. Procedia Comput. Sci. 19, 461–469 (2013)
18. Alghamdi, H.M., Selamat, A., Karim, N.S.A.: Arabic web pages clustering and annotation
using semantic class features. J. King Saud Univ. Comput. Inf. Sci. 26(4), 388–397 (2014)
19. El Zraie, B.: Extraction of Taxonomic Relations from Arabic Text for Ontology
Construction. Unpublished master’s thesis, Islamic University-Gaza, Deanery of Graduate
Studies, Faculty of Information Technology (2016)
20. Yang, C.Y., Lin, H.Y.: Semantic annotation for the web of data - an ontology and RDF
based automated approach. J. Converg. Inf. Technol. (JCIT), Special Issue Soc. Netw. Appl.
Decis. Support 6(4), 318–327 (2011)
21. Manola, F., Miller, E., McBride, B.: RDF primer. W3C recommendation, 10(1–107), 6
(2004)
22. Champin, P.-A.: RDF Tutorial. Pierre-Antoine Champin, 1–9, 5 April 2001
23. Alatrash, E.: Using Web Tools for Constructing an Ontology of Different Natural
Languages, Doctoral dissertation, University of Belgrade (2013)
24. Corcho, O., Fernández-López, M., Gómez-Pérez, A.: Methodologies, tools and languages
for building ontologies. Where is their meeting point? Data Knowl. Eng. 46(1), 41–64
(2003)
25. Pan, J., Horrocks, I.: RDFS(FA): connecting RDF(S) and OWL DL. IEEE Trans. Knowl.
Data Eng. 19, 192–206 (2007)
26. Gruber, T.R.: A translation approach to portable ontology specifications. Knowl. Acquis.
5(2), 199–220 (1993)
27. Ahmed, Z.: Domain Specific Information Extraction for Semantic Annotation. (Unpublished
Master Thesis), Charles University (2009)
28. Al Tayyar, M.S.: Arabic information retrieval system based on orphological analysis
(AIRSMA). Ph.D. Thesis DeMonfort University, July 2000
29. López-Pellicer, F.J., Vilches-Blázquez, L.M., Nogueras-Iso, J., Corcho, Ó., Bernabé, M.A.,
Rodríguez, A.F.: Using a hybrid approach for the development of an ontology in the
hydrographical domain (2008)

You might also like