Professional Documents
Culture Documents
net/publication/341074191
On Finding Similar Verses from the Holy Quran using Word Embeddings
CITATION READS
1 3,921
3 authors:
Quratulain Rajput
Institute of Business Administration Karachi
22 PUBLICATIONS 388 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Sumaira Saeed on 01 September 2020.
Abstract— Finding semantic text similarity (STS) between Both of the verses ‘78:20’ and ‘101:5’ describe a similar
two pieces of text is a well-known problem in Natural Language concept that mountains will be destroyed. Therefore, these
Processing. Its applications are nearly in every field such as verses can be termed as being similar. When studying this
plagiarism detection, finding related user queries in customer phenomenon, both of the verses together give a more detailed
services or finding similar questions in search engines or forums description of the event than when studied in isolation.
like Stack Overflow, Quora and Stack exchange. If applied to
any religious text, it can help to relate how similar pieces of Similarly, consider the two verses given below:
knowledge are described in different places. This paper uses
• “Then your hearts became hardened after that, being
Word2Vec and Sent2Vec models to facilitate the process of
knowledge extraction from a given corpus. The paper makes use like stones or even harder.” (Qur’an 2:74)
of several English translations of the Holy Quran which is the • So for their breaking of the covenant We cursed them
most sacred book for Muslims. Sent2vec models have been and made their hearts hard. (Qur’an 5:13)
trained from several translations of the book and the trained
models are then subsequently utilized to study the semantic
relationship between different words and sentences. The
performance of the custom-built word embeddings is compared
The verses ‘2:74’ and ‘5:13’ discuss the idea that hearts
against the pre-trained embeddings provided by the Spacy will be hardened. These verses can be termed as similar
library. because they present a similar concept.
There are many more such examples which show that the
Keywords— Word Embedding, Word2vec, Holy Quran, information on a single subject is scattered at different places
Semantic Text Similarity(STS), Quranic Verses Similarity in the Holy Quran. Thus, in order to understand a subject
I. INTRODUCTION completely, it is evident to have a look at all the related verses
on that topic at once.
During the past many years, extensive research has been
done in the area of semantic text similarity. Researchers are The rest of the paper is organized as follows. Section II
interested in providing enough intelligence to machines that provides a brief survey of the previous research done in the
they can compare semantic similarity between two pieces of field of semantic text similarity. The proposed methodology
text. The aim is to automate the process of extracting related of learning Word2Vec[7][8] models using a given text is
and relevant chunks of information from large dumps of data. explained in Section III while experimental design and results
are discussed in Sections IV and V, respectively. Finally,
Every piece of writing is different in its own way and has Section VI concludes the paper and provides future research
some special characteristics which might be due to the nature directions.
of the subject, target audience, or due to the writing style of
the author. Sacred or religious books which are claimed not to II. RESEARCH SURVEY
be written by humans are unique in their own way and have a Text similarity search plays an important role in text
completely different structure. Muslims consider the Holy related research as it has several applications in NLP such as
Quran as the most sacred book and a source of guidance for document clustering, topic detection, question answering,
them to follow in their lifestyle. Therefore, it becomes vital for machine translation, text summarization and so on. The task
every Muslim to understand it. This paper is an endeavor and can be achieved using a) lexical similarity or b) semantic
part of a continuing effort to ease the process of searching in similarity. In this section, previous research related to both
the Holy Quran. It is well-established among the Muslims that techniques that focused on the Quran corpus is presented. A
the Holy Quran was not revealed at once but was revealed few of the works have been applied to Arabic text while the
gradually in a step by step fashion. As a result, information on others were focused on translations of the Holy Quran.
a single subject is scattered at multiple places. Consider the
two verses given below: One of the earlier works reported on the Arabic text of the
Holy Quran is [13] which presents a large dataset where
• And the mountains are removed and will be [but] a semantically similar or related verses are linked together. It
mirage. (78:20) uses knowledge of experts (having a deep understanding of
• And the mountains will be like wool, fluffed both the Quran and its exegesis) to construct the dataset
up.(101:5) containing similar verses-pairs. The work used lexical
similarity bases approaches like Term Frequency- Inverse
Document Frequency (TF-IDF) to further improve the dataset
[Copyright and Reprint Permission: Abstracting is permitted with credit to the source. Libraries are permitted to photocopy beyond the limit of U.S.
copyright law for private use of patrons those articles in this volume that carry a code at the bottom of the first page, provided the per-copy fee indicated in
the code is paid through Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923. For reprint or republication permission, email to IEEE
Copyrights Manager at pubs-permissions@ieee.org. All rights reserved. Copyright ©2020 by IEEE.]
of related verses. However, this work represents a limited The work proposed in this study is based on semantic text
dataset and does not contain all the verses of the Holy Quran. similarity search applied on all the verses of Holy Quran
translated in English. It can be used at sentence-level to
Another work that used the lexical similarity based capture similar sentences in Quranic translation. The next
technique for computing text similarities in the Arabic text of section will provide details of the proposed framework for
the Holy Quran was reported in [1]. It aimed to output verses computing sentence similarity.
from the Holy Quran which are similar or relevant to a user’s
provided query verse using the TF-IDF technique and III. METHODOLOGY
normalization. It also performed Chapter (Surah)
classification as Makki or Madni using N-gram and with a This section explains the proposed framework and its
machine learning algorithm (LibSVM classifier in Weka). major components. The framework is generic and is not
This work computes similarity based on common key terms dependent on any specific corpus type. It can be used to extract
only. One major limitation of this work is a lack of semantic text which is semantically similar to the query text.
based similarity search. The framework is composed of four components (stages):
Both semantic similarity and lexical similarity based corpus preprocessing, model training, computing verse
approaches were merged together in [10]. The focus was on similarity and result output. Fig 1 shows the complete system
computing similarity between Arabic sentences. In the paper, flow of the framework. Each of these stages is discussed in the
the authors experimented with multiple techniques namely following sub-sections.
word embeddings, weighted embeddings using TFIDF and
POS tagging. The results of the experiments were compared
with human evaluated results. Their models performed fairly Train
Extract
Corpus Word2Vec
well. However, the model fails if two pairs of sentences have Sentences
Model
similar words but used in a completely different context.
Efforts have also been made on the translated text of the
Holy Quran. The study in [5] developed a Question Load Word
Save Key
Input Query Embeddings
Answering System which was focused on one chapter from in Spacy
Vectors
the Holy Quran, Chapter of Cow (Surah Baqarah). The
authors performed classification on the output to reduce the
retrieval of irrelevant results. Neural Networks were used to Compute Output
classify verses of Surah Baqarah into Fasting or Pilgrimage Verse Relevant
categories. The authors used WordNet [9] and a collection of Similarities Results
Islamic terms to expand user queries which could be processed
further. The scope of this work was limited to only two
categories: ‘Fasting’ and ‘Pilgrimage’. Fig. 1. System Architecture
Another research, [2], worked on both Arabic and the A. Corpus Preprocessing
English translation of the Holy Quran. The paper aimed to
solve the ‘problem of text similarity in the context of multi- This stage creates an appropriate text corpus that can be
lingual representations of the Qur’an’. Verses were used for training. Text can be a news article, story, report,
represented as vectors and different similarity measures like book, or a research paper. The preprocessing steps consist of
Jaccard similarity, Euclidean distance, cosine similarity and segmenting text corpus into sentences, removal of stop words
Pearson Correlation Coefficient were used to find similarity and stemming or lemmatization.
between different verses. The study fails to obtain semantic- The first step is to break down the given text corpus into
based similarity between Quranic verses as it uses lexical sentences. This means that each record/sample represents a
similarity based techniques for similarity computations. single sentence. This corpus/document containing a list of
A comprehensive study was performed by [12] to find sentences is labeled as ‘D’.
similarities between sacred texts. This study was performed In the next stage any punctuation marks or symbols are
on sacred texts of the Bible and the Holy Quran. Different removed from the text. Depending on the nature of the text
statistical approaches were used to extract features and to corpus, stop words may or may not hold much significance.
compute similarity between the texts. Several distance Another preprocessing that is applied to the corpus is
measures, including Euclidean, Hillinger, Manhattan, Cosine, stemming or lemmatization that simplifies words. For
were used. The research studied overall similarity based on example the word ‘playing’ or ‘played’ will become ‘play’
topics contained in two documents. It does not go deeper into after stemming.
finding similarity between sentences from each text.
B. Model Training
The finding of the above literature survey suggested that
none of the studies focused on finding semantic-based This stage creates domain-specific word representations of
similarity between verses of the Holy Quran. Most of the each word as used in the corpus. Also called word
studies focused only on lexical-based similarity and failed to embeddings, the idea is to transform each word into a vector.
capture semantic similarity[1][2]. Also, if semantic text Word2Vec is a popular technique to train word embeddings
similarity based approaches were used, they were either for a particular corpus[8]. There are two ways of training a
applied to the Arabic text of the Quran or they were applied Word2Vec model: CBOW(Continuous Bag of Words) and
only on a small portion of Quranic translation[5][13]. Skip-gram. These architectures are explained in detail in
papers [7-8] and a high-level description is shown in Fig 2.
The resultant word embeddings capture the context of words
in the corpus, semantic and syntactic similarity and relation training. Single translation of the Holy Quran only had around
with other words. 6249 verses which were not enough for training.
To increase the size of the training corpus, six more
English translations of the Holy Quran available at tanzil.net
were selected. The following seven translations (named as per
the respective translators) were used to train the final models
which have been used in the experiments: Shakir, Yousuf Ali,
Sahih International, Pickthal, Sarwar, Hilali and Arberry. Data
from all seven translations were merged together to form a
single corpus.
23:12 And certainly did We create man from an extract of clay. 2 73% 78% 86%