You are on page 1of 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/341074191

On Finding Similar Verses from the Holy Quran using Word Embeddings

Conference Paper · March 2020


DOI: 10.1109/ICETST49965.2020.9080691

CITATION READS

1 3,921

3 authors:

Sumaira Saeed Sajjad Haider


Institute of Business Administration Karachi Institute of Business Administration Karachi
2 PUBLICATIONS   9 CITATIONS    61 PUBLICATIONS   743 CITATIONS   

SEE PROFILE SEE PROFILE

Quratulain Rajput
Institute of Business Administration Karachi
22 PUBLICATIONS   388 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

RoboCup Soccer View project

Babul-Islah: Islamic Semantic Web View project

All content following this page was uploaded by Sumaira Saeed on 01 September 2020.

The user has requested enhancement of the downloaded file.


On Finding Similar Verses from the Holy Quran
using Word Embeddings
Sumaira Saeed, Sajjad Haider, Quratulain Rajput
Artificial Intelligence Lab
Institute of Business Administration
Karachi, Pakistan
{sumairasaeed,sahaider,qrajput}@iba.edu.pk

Abstract— Finding semantic text similarity (STS) between Both of the verses ‘78:20’ and ‘101:5’ describe a similar
two pieces of text is a well-known problem in Natural Language concept that mountains will be destroyed. Therefore, these
Processing. Its applications are nearly in every field such as verses can be termed as being similar. When studying this
plagiarism detection, finding related user queries in customer phenomenon, both of the verses together give a more detailed
services or finding similar questions in search engines or forums description of the event than when studied in isolation.
like Stack Overflow, Quora and Stack exchange. If applied to
any religious text, it can help to relate how similar pieces of Similarly, consider the two verses given below:
knowledge are described in different places. This paper uses
• “Then your hearts became hardened after that, being
Word2Vec and Sent2Vec models to facilitate the process of
knowledge extraction from a given corpus. The paper makes use like stones or even harder.” (Qur’an 2:74)
of several English translations of the Holy Quran which is the • So for their breaking of the covenant We cursed them
most sacred book for Muslims. Sent2vec models have been and made their hearts hard. (Qur’an 5:13)
trained from several translations of the book and the trained
models are then subsequently utilized to study the semantic
relationship between different words and sentences. The
performance of the custom-built word embeddings is compared
The verses ‘2:74’ and ‘5:13’ discuss the idea that hearts
against the pre-trained embeddings provided by the Spacy will be hardened. These verses can be termed as similar
library. because they present a similar concept.
There are many more such examples which show that the
Keywords— Word Embedding, Word2vec, Holy Quran, information on a single subject is scattered at different places
Semantic Text Similarity(STS), Quranic Verses Similarity in the Holy Quran. Thus, in order to understand a subject
I. INTRODUCTION completely, it is evident to have a look at all the related verses
on that topic at once.
During the past many years, extensive research has been
done in the area of semantic text similarity. Researchers are The rest of the paper is organized as follows. Section II
interested in providing enough intelligence to machines that provides a brief survey of the previous research done in the
they can compare semantic similarity between two pieces of field of semantic text similarity. The proposed methodology
text. The aim is to automate the process of extracting related of learning Word2Vec[7][8] models using a given text is
and relevant chunks of information from large dumps of data. explained in Section III while experimental design and results
are discussed in Sections IV and V, respectively. Finally,
Every piece of writing is different in its own way and has Section VI concludes the paper and provides future research
some special characteristics which might be due to the nature directions.
of the subject, target audience, or due to the writing style of
the author. Sacred or religious books which are claimed not to II. RESEARCH SURVEY
be written by humans are unique in their own way and have a Text similarity search plays an important role in text
completely different structure. Muslims consider the Holy related research as it has several applications in NLP such as
Quran as the most sacred book and a source of guidance for document clustering, topic detection, question answering,
them to follow in their lifestyle. Therefore, it becomes vital for machine translation, text summarization and so on. The task
every Muslim to understand it. This paper is an endeavor and can be achieved using a) lexical similarity or b) semantic
part of a continuing effort to ease the process of searching in similarity. In this section, previous research related to both
the Holy Quran. It is well-established among the Muslims that techniques that focused on the Quran corpus is presented. A
the Holy Quran was not revealed at once but was revealed few of the works have been applied to Arabic text while the
gradually in a step by step fashion. As a result, information on others were focused on translations of the Holy Quran.
a single subject is scattered at multiple places. Consider the
two verses given below: One of the earlier works reported on the Arabic text of the
Holy Quran is [13] which presents a large dataset where
• And the mountains are removed and will be [but] a semantically similar or related verses are linked together. It
mirage. (78:20) uses knowledge of experts (having a deep understanding of
• And the mountains will be like wool, fluffed both the Quran and its exegesis) to construct the dataset
up.(101:5) containing similar verses-pairs. The work used lexical
similarity bases approaches like Term Frequency- Inverse
Document Frequency (TF-IDF) to further improve the dataset

[Copyright and Reprint Permission: Abstracting is permitted with credit to the source. Libraries are permitted to photocopy beyond the limit of U.S.
copyright law for private use of patrons those articles in this volume that carry a code at the bottom of the first page, provided the per-copy fee indicated in
the code is paid through Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923. For reprint or republication permission, email to IEEE
Copyrights Manager at pubs-permissions@ieee.org. All rights reserved. Copyright ©2020 by IEEE.]
of related verses. However, this work represents a limited The work proposed in this study is based on semantic text
dataset and does not contain all the verses of the Holy Quran. similarity search applied on all the verses of Holy Quran
translated in English. It can be used at sentence-level to
Another work that used the lexical similarity based capture similar sentences in Quranic translation. The next
technique for computing text similarities in the Arabic text of section will provide details of the proposed framework for
the Holy Quran was reported in [1]. It aimed to output verses computing sentence similarity.
from the Holy Quran which are similar or relevant to a user’s
provided query verse using the TF-IDF technique and III. METHODOLOGY
normalization. It also performed Chapter (Surah)
classification as Makki or Madni using N-gram and with a This section explains the proposed framework and its
machine learning algorithm (LibSVM classifier in Weka). major components. The framework is generic and is not
This work computes similarity based on common key terms dependent on any specific corpus type. It can be used to extract
only. One major limitation of this work is a lack of semantic text which is semantically similar to the query text.
based similarity search. The framework is composed of four components (stages):
Both semantic similarity and lexical similarity based corpus preprocessing, model training, computing verse
approaches were merged together in [10]. The focus was on similarity and result output. Fig 1 shows the complete system
computing similarity between Arabic sentences. In the paper, flow of the framework. Each of these stages is discussed in the
the authors experimented with multiple techniques namely following sub-sections.
word embeddings, weighted embeddings using TFIDF and
POS tagging. The results of the experiments were compared
with human evaluated results. Their models performed fairly Train
Extract
Corpus Word2Vec
well. However, the model fails if two pairs of sentences have Sentences
Model
similar words but used in a completely different context.
Efforts have also been made on the translated text of the
Holy Quran. The study in [5] developed a Question Load Word
Save Key
Input Query Embeddings
Answering System which was focused on one chapter from in Spacy
Vectors
the Holy Quran, Chapter of Cow (Surah Baqarah). The
authors performed classification on the output to reduce the
retrieval of irrelevant results. Neural Networks were used to Compute Output
classify verses of Surah Baqarah into Fasting or Pilgrimage Verse Relevant
categories. The authors used WordNet [9] and a collection of Similarities Results
Islamic terms to expand user queries which could be processed
further. The scope of this work was limited to only two
categories: ‘Fasting’ and ‘Pilgrimage’. Fig. 1. System Architecture

Another research, [2], worked on both Arabic and the A. Corpus Preprocessing
English translation of the Holy Quran. The paper aimed to
solve the ‘problem of text similarity in the context of multi- This stage creates an appropriate text corpus that can be
lingual representations of the Qur’an’. Verses were used for training. Text can be a news article, story, report,
represented as vectors and different similarity measures like book, or a research paper. The preprocessing steps consist of
Jaccard similarity, Euclidean distance, cosine similarity and segmenting text corpus into sentences, removal of stop words
Pearson Correlation Coefficient were used to find similarity and stemming or lemmatization.
between different verses. The study fails to obtain semantic- The first step is to break down the given text corpus into
based similarity between Quranic verses as it uses lexical sentences. This means that each record/sample represents a
similarity based techniques for similarity computations. single sentence. This corpus/document containing a list of
A comprehensive study was performed by [12] to find sentences is labeled as ‘D’.
similarities between sacred texts. This study was performed In the next stage any punctuation marks or symbols are
on sacred texts of the Bible and the Holy Quran. Different removed from the text. Depending on the nature of the text
statistical approaches were used to extract features and to corpus, stop words may or may not hold much significance.
compute similarity between the texts. Several distance Another preprocessing that is applied to the corpus is
measures, including Euclidean, Hillinger, Manhattan, Cosine, stemming or lemmatization that simplifies words. For
were used. The research studied overall similarity based on example the word ‘playing’ or ‘played’ will become ‘play’
topics contained in two documents. It does not go deeper into after stemming.
finding similarity between sentences from each text.
B. Model Training
The finding of the above literature survey suggested that
none of the studies focused on finding semantic-based This stage creates domain-specific word representations of
similarity between verses of the Holy Quran. Most of the each word as used in the corpus. Also called word
studies focused only on lexical-based similarity and failed to embeddings, the idea is to transform each word into a vector.
capture semantic similarity[1][2]. Also, if semantic text Word2Vec is a popular technique to train word embeddings
similarity based approaches were used, they were either for a particular corpus[8]. There are two ways of training a
applied to the Arabic text of the Quran or they were applied Word2Vec model: CBOW(Continuous Bag of Words) and
only on a small portion of Quranic translation[5][13]. Skip-gram. These architectures are explained in detail in
papers [7-8] and a high-level description is shown in Fig 2.
The resultant word embeddings capture the context of words
in the corpus, semantic and syntactic similarity and relation training. Single translation of the Holy Quran only had around
with other words. 6249 verses which were not enough for training.
To increase the size of the training corpus, six more
English translations of the Holy Quran available at tanzil.net
were selected. The following seven translations (named as per
the respective translators) were used to train the final models
which have been used in the experiments: Shakir, Yousuf Ali,
Sahih International, Pickthal, Sarwar, Hilali and Arberry. Data
from all seven translations were merged together to form a
single corpus.

B. Data Pre-processing on Quran Corpus


The first step of the proposed framework is data
preprocessing. Let the complete dataset containing seven
Quranic translations be called ‘D’. This dataset D contains
Fig. 2. The CBOW architecture predicts the current word based on the
6249 x 7 = 43743 lines. Quranic verses vary in sizes. Some
context, and the Skip-gram predicts surrounding words given the current verses are small and some are large. Some verses contain
word [7] single word/sentence and some verses contain multiple
sentences.
The trained model will be able to capture syntactically
The first step of data preprocessing is to break all the
and semantically related words. This is one of the main
verses in ‘D’ into sentences. Thus, after breaking each verse
characteristics which differentiates semantic-based search into sentences, dataset D has approximately 70,000 records.
from traditional keyword-based search. In keyword-based Next, normalization is performed on the dataset to remove all
search, a user needs to know the exact vocabulary used in the the punctuation marks or symbols (if any). Also, stop words
corpus to search or query a similar piece of text. With word which usually do not contribute in adding any significant
embeddings, this limitation is relaxed and the search can be meaning to data are removed from the data. Moreover, various
performed using any word without worrying about using affixes were removed from words so that words having the
particular keywords. same root will become identical. This is achieved using the
stemming technique.
C. Sentence Embeddings
C. Model Training on Quran Corpus
The trained model provides word embeddings for each
word in the sentence. However, sentence embeddings are This research uses the Word2Vec technique to train word
required to compute sentence similarity. Word embeddings embeddings on the translations of the Holy Quran. Multiple
obtained from the trained model for each word in the sentence experiments were conducted to increase the performance of
are summed together and divided by the number of word the system. The following parameters of Word2Vec models
tokens to obtain sentence embeddings. Thus, the resultant were tweaked during the experiments:
Sent2Vec representation is used in the experiments to obtain • Window size
sentence embeddings.
• Skip-gram or CBOW
In the proposed system, a user can give input in the form
of a target sentence and the system will output semantically • Epochs length.
similar sentences from the available corpus. The query The trained models contain vector representation for each
sentence provided by the user is termed as ‘Q’. The cosine
word in the vocabulary. Vectors of all the words are saved in
similarity is used to measure the similarity score between two
a format that can be loaded in Spacy, a Python library, for
sentence embeddings. The top-10, top-5, or top-3 results are
considered for computing the precision score. Results of the processing user queries.
query retrieved from the proposed system are sorted in
descending order according to the similarity score. D. Sample Test Verses
IV. DESIGN OF EXPERIMENTS A total of ten sentences from the Holy Quran were
manually selected as part of the test dataset. Precision was
This section explains the design of experiments conducted used as the evaluation metric in the experiments. All the
to test the proposed framework. It also discusses the dataset selected sentences had a common property that each verse had
used for corpus creation, training parameters and creation of at least ten other similar sentences available in the Holy
the test dataset. Quran. The sample sentences are listed in Table I.
A. Dataset TABLE I. QUERY SENTENCES IN TEST DATA
This research focuses on the English translations of the
Holy Quran. In the initial experiments, only one English S.# Ayat# Query Sentence
Said the eminent among the people of Pharaoh,
translation, ‘Sahih International’ - downloaded from 1 7:109 "Indeed, this is a learned magician
tanzil.net, was used for training. The results of the preliminary Indeed, We guided him to the way, be he grateful or
experiments were not very encouraging. The obvious reason 2 76:3 be he ungrateful.
was the fact that Word2Vec needs a large dataset for proper Deaf, dumb and blind - so they will not return [to the
3 2:18 right path].
Indeed, those who have believed and done righteous contained all seven translations of the Holy Quran (Document
deeds - they will have the Gardens of Paradise as a ‘D’). As is the case in Model, all the Quranic verses contained
4 18:107 lodging,
He said, "Never would I prostrate to a human whom
in ‘D’ were split into sentences and then tokenized to create
5 15:33 You created out of clay from an altered black mud." Bag of Words. All the preprocessing steps, namely,
And We will surely test you with something of fear normalization, stop words removal and stemming, were
and hunger and a loss of wealth and lives and fruits, performed. The trained Word2Vec model also used
6 2:155 but give good tidings to the patient. Continuous Bag of Words (CBOW) architecture.
Is there any creator other than Allah who provides for
you from the heaven and earth? The training parameters used in this experiment were as
7 35:3
Indeed, those who reject the message after their belief
follows:
and then increase in disbelief - never will their • Window size = 5
[claimed] repentance be accepted, and they are the
8 3:90 ones astray. • Epochs = 50
This is the Book about which there is no doubt, a
9 2:2 guidance for those conscious of Allah.
they establish prayer and give zakah and bow with
10 2:43 those who bow [in worship and obedience]. E.4. Model # 4
This model is the same as Model # 2 explained
above. No preprocessing steps were performed in this
The next section provide details about the experiments experiment. The main difference, as compared to Model # 3,
conducted in this research. was the architecture type. It used Skip-Gram architecture for
E. Experiments training custom word embeddings.
In this study, a total of five experiments were conducted. The training parameters used in this experiment were as
Each experiment used different parameters for training the follows:
model. Variations were based on the architecture of the model
• Window size = 5
and the parameters of the model. Initially, the window size for
experiments was defined as ten but it was giving very poor • Epochs = 50
results. Multiple preliminary experiments were conducted
with different values and it was observed that window size
five was giving comparatively good results. Details of each E.5. Model # 5
model and parameters used in experiments are provided This model is the same as Model # 4 explained
below. above. The main difference, as compared to Model # 4, is that
preprocessing steps of normalization, stopwords removal and
stemming were performed.
E.1. Model # 1:
The first model employs the already trained word It used Skip-Gram architecture for training custom word
embeddings provided by ‘Spacy’, a Python. The model used embeddings. The training parameters used in this experiment
in this experiment is en_core_web_md. It is an English multi- were as follows:
task Convolution Neural Network (CNN)[4] trained on
• Window size = 5
OntoNotes[14], with GloVe[11] vectors trained on Common
Crawl[3]. It contains 685,000 keys, 20,000 unique words and • Epochs = 50
each word corresponding a unique 300-dimensional vector.
The similarity measure used is the cosine similarity which
computes the cosine of the angle between two vectors to Table II shows a consolidated summary of training parameters
determine their similarity. used in each experiment.
E.2. Model # 2 TABLE II. TRAINING PARAMETERS OF EXPERIMENTS
A customized Word2Vec model was trained using
Gensim, a Python library. The training dataset contained all Model. Window
seven translations of the Holy Quran (Document ‘D’) # Architecture size Pre-Processing Epochs
described in the “Data set” section above.
1 Spacy - - -
The model was trained using Continuous Bag Of Words 2 CBOW 5 None 50
(CBOW) architecture. The training parameters used in this
experiment were as follows: 3 Normalization, Stop
words Removal,
• Window size = 5 CBOW 5 Stemming 50

• Epochs = 50 4 Skip-Gram 5 None 50


5 Normalis]zation,
Before training, all the Quranic verses contained in ‘D’ Stop words
were split into sentences and then tokenized to create Bag of Skip-Gram 5 Removal, Stemming 50
Words. None of the preprocessing steps were performed on
the dataset.
E.3. Model # 3 The next section describes the results of each experiment
A customized Word2Vec model was trained using conducted and their analysis.
the Gensim library. The training dataset in this model also
V. RESULTS/EVALUATION
This section analyzes the performance of all five models Table IV. shows the top-10 precision score of all five
as described above. The aim is to compare the results of models. The score shown in the table is average of all
trained models with already available ‘en_core_web_md’ manually labeled verses as correct (scored as 1) or incorrect
model of Spacy when tested on pre-defined test questions. (scored as zero). It was observed that both Continuous Bag of
A single translation of the Holy Quran is considered at this Words model and Skip-Gram models (with and without
stage. Any English translation of the Holy Quran could be preprocessing steps) has superior performance when
selected but the experiments used “Sahih International” compared against the Spacy model.
English translation of the Quran because it is widely used and
TABLE IV. PRECISION (TOP-10) FOR EACH EXPERIMENT
uses simpler vocabulary terms as compared to other
translations. The translation document is referred to as ‘T’. It Exp.# Model (Word Embeddings) Precision(Top-10)
contains 6249 English translated verses of the Holy Quran 1 Spacy (en_core_md) 60%
2 CBOW (without Pre-processing) 73%
The input query verse ‘Q’ is fed to Sent2Vec trained 3 CBOW (with Pre-processing) 87%
models. The similarity between query ‘Q’ vector and vectors 4 Skip-Gram (without pre-processing) 78%
of all the verses of the Holy Quran contained in ‘T’ is 5 Skip-Gram (with Pre-processing) 94%
computed using cosine similarity. The results of the query are
sorted in descending order according to the similarity value.
If only CBOW models are considered, it was found that
Results were manually verified as there was no automated preprocessing steps increased the accuracy from 73% to 87%.
mechanism to verify if the verses retrieved by this system are If only Skip-Gram models are investigated, a similar pattern
actually correct or not. All the data retrieved was manually is observed as the preprocessing steps increased performance
labeled as correct or incorrect. For each experiment, top-10 from 78% to 94%. This accuracy of 94% is the best accuracy
results were manually evaluated. Therefore, for five achieved from all five models.
experiments and ten test cases, a total of 5*10*10=500
sentences were manually evaluated for similarity check. TABLE V. QUESTION-WISE TOP-10 RESULTS FOR EACH MODEL
For evaluating the overall performance of each model, the
precision of top-10, top-5 and top-3 results are considered and Models
compared. Table III shows top-10 verses of Holy Quran CBOW Skip-Gram
retrieved by the proposed framework as similar verses for (with Pre- Skip- (with Pre-
Ayat# Spacy CBOW processing) Gram processing)
query verse: "He said, ‘Never would I prostrate to a human 7:109 30% 60% 100% 80% 100%
whom You created out of clay from an altered black mud.’" 76:3 50% 10% 40% 60% 100%
2:18 50% 70% 60% 80% 80%
TABLE III. TOP-10 RESULTS RETRIEVED FOR VERSE 15:33 18:107 100% 100% 100% 100% 100%
15:33 70% 100% 100% 100% 100%
2:155 30% 50% 70% 30% 60%
Ayat# Query Verse: He said, "Never would I prostrate to a 35:3 60% 100% 100% 100% 100%
human whom You created out of clay from an altered 3:90 60% 90% 100% 80% 90%
black mud." 2:2 50% 50% 100% 50% 100%
2:43 100% 100% 100% 100% 100%
He said, "Never would I prostrate to a human whom You
15:33 created out of clay from an altered black mud."
Table V. shows verse-wise precision scores of each
experiment. Top-10 results retrieved by each experiment is
And We did certainly create man out of clay from an altered
15:26 black mud.
used to calculate precision scores. Scores mentioned in Table
V are an average of top-10 answer scores for each model for
each test verse.
And [mention, O Muhammad], when your Lord said to the
angels, "I will create a human being out of clay from an Top-10, top-5 and top-3 results were also considered to
15:28 altered black mud. evaluate the performance of each of the five models. Results
are defined in Table VI. It was observed that model # 5
17:61 He said, "Should I prostrate to one You created from clay?" outperformed all other models in top-10, top-5 and top-3
precision values.
7:12 You created me from fire and created him from clay."
TABLE VI. TOP-10, TOP-5 AND TOP-3 PRECISION SCORES FOR EACH
EXPERIMENT
38:76 You created me from fire and created him from clay."

Experiment# Precision Precision Precision


[So mention] when your Lord said to the angels, "Indeed, I
(Top-10) (Top-5) (Top-3)
38:71 am going to create a human being from clay.
1 60% 70% 80%

23:12 And certainly did We create man from an extract of clay. 2 73% 78% 86%

3 87% 90% 93%


37:11 Indeed, We created men from sticky clay. 4 78% 86% 90%

5 94% 96% 97%


55:14 He created man from clay like [that of] pottery.
[6] Latifi, Majid, Horacio Rodríguez Hontoria, and Miquel Sànchez-
Summary of Results Marrè. "ScoQAS: A semantic-based closed and open domain question
answering system." (2017).
120% [7] Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean.
100% "Efficient estimation of word representations in vector space." arXiv
preprint arXiv:1301.3781 (2013).
80% [8] Mikolov, Tomas; Sutskever, Ilya; Chen, Kai; Corrado, Greg S.; Dean,
Jeff (2013). Distributed representations of words and phrases and their
60%
compositionality. Advances in Neural Information Processing
40% Systems.
[9] Miller, G.A., Beckwith, R., Fellbaum, C.D., Gross, D. & Miller, K.
20% (1990). WordNet: An online lexical database.Int. J. Lexicograph. 3, 4,
0% pp. 235–244.
1 2 3 4 5 [10] Nagoudi, El Moatez Billah and Schwab, Didier "Semantic Similarity
of Arabic Sentences with Word Embeddings" Proceedings of the Third
Arabic Natural Language Processing Workshop Association for
Computational Linguistics (ACL) p. 18--24. 2017
Precision (Top-10) Precision (Top-5)
[11] Pennington, Jeffrey, Richard Socher, and Christopher Manning.
Precision (Top-3) "Glove: Global vectors for word representation." In Proceedings of the
2014 conference on empirical methods in natural language processing
Fig. 3. Summary of experimental results (EMNLP), pp. 1532-1543. 2014.
[12] Qahl, Salha Hassan Muhammed, "An Automatic Similarity Detection
Fig 3. Shows the bar-chart plot of top-10, top-5 and top-3 Engine Between Sacred Texts Using Text Mining and Similarity
results for each model. It can be noticed from the figure that Measures" (2014). Thesis. Rochester Institute of Technology.
accuracy trend increases from top-10 to top-5 to top-3. It is Accessed from https://scholarworks.rit.edu/theses/8496/
also observed that Model #5 achieved highest accuracy in all [13] Sharaf, Abdul-Baquee M., and Eric Atwell. "QurSim: A corpus for
three performance measures. evaluation of relatedness in short texts." LREC. 2012.
[14] Weischedel, R., Hovy, E., Marcus, M., Palmer M., Belvin, R., Pradhan,
S., Ramshaw, L., Xue, N. OntoNotes: A Large Training Corpus for
Enhanced Processing. In Handbook of Natural Language Processing
VI. CONCLUSION and Machine Translation: DARPA Global Autonomous Language
Exploitation, Eds. Joseph Olive, Caitlin Christianson, and John
The paper presented an ongoing work on developing a text McCary, Springer, 2011.
similarity computing framework for retrieving similar pieces
of text. The goal is to develop a generalized framework that
can be applied to any text. The Corpus used for experiments
in this study is the Holy Quran. The paper applied Word2Vec
technique for learning customized word embeddings from the
English translations of the Quran and then created sentence
embeddings by taking the mean of all the word embeddings of
words in the sentence.
The experiments suggested that Skip-Gram model
performed best on the test set. The best precision score
achieved was 94%. However, there is a lot more potential in
this work for improvement. In the future, the plan is to
experiment by integrating knowledge-based models with
machine learning techniques.
REFERENCES

[1] Akour, Mohammed, Izzat Alsmadi, and Iyad Alazzam. "MQVC:


measuring Quranic verses similarity and sura classification using N-
gram." WSEAS Transactions on Computers (2014).
[2] Basharat, A., D. Yasdansepas, and K. Rasheed. "Comparative study of
verse similarity for multi-lingual representations of the qur’an."
Proceedings on the International Conference on Artificial Intelligence
(ICAI). The Steering Committee of The World Congress in Computer
Science, Computer Engineering and Applied Computing
(WorldComp), 2015.
[3] Buck, C., Heafield, K., and Van Ooyen, B. (2014). N-gram counts and
language models from the common crawl. In Proc. LREC, volume 2.
[4] Fukushima, Kunihiko (1980). "Neocognitron: A Self-organizing
Neural Network Model for a Mechanism of Pattern Recognition
Unaffected by Shift in Position" (PDF). Biological Cybernetics. 36 (4):
193–202. doi:10.1007/BF00344251. PMID 7370364. Retrieved 16
November 2013.
[5] Hamed, Suhaib Kh, and Mohd Juzaiddin Ab Aziz. "A Question
Answering System on Holy Quran Translation Based on Question
Expansion Technique and Neural Network Classification." JCS 12.3
(2016): 169-177.

View publication stats

You might also like