You are on page 1of 10

Information Processing and Management 50 (2014) 416–425

Contents lists available at ScienceDirect

Information Processing and Management


journal homepage: www.elsevier.com/locate/infoproman

A review of ranking approaches for semantic search on Web


Vikas Jindal a,⇑, Seema Bawa b, Shalini Batra b
a
School of Computational Sciences, Apeejay Stya University, Sohna 122103 Gurgaon, India
b
Computer Science and Engineering Department, Thapar University, P.O. Box 32, Patiala 147004, India

a r t i c l e i n f o a b s t r a c t

Article history: With ever increasing information being available to the end users, search engines have
Received 3 April 2012 become the most powerful tools for obtaining useful information scattered on the Web.
Received in revised form 14 August 2013 However, it is very common that even most renowned search engines return result sets
Accepted 18 October 2013
with not so useful pages to the user. Research on semantic search aims to improve tradi-
Available online 16 November 2013
tional information search and retrieval methods where the basic relevance criteria rely pri-
marily on the presence of query keywords within the returned pages. This work is an
Keywords:
attempt to explore different relevancy ranking approaches based on semantics which are
Semantic search
Ranking
considered appropriate for the retrieval of relevant information. In this paper, various pilot
Ontology projects and their corresponding outcomes have been investigated based on methodolo-
gies adopted and their most distinctive characteristics towards ranking. An overview of
selected approaches and their comparison by means of the classification criteria has been
presented. With the help of this comparison, some common concepts and outstanding fea-
tures have been identified.
Ó 2013 Elsevier Ltd. All rights reserved.

1. Introduction

Web search is a key application of the Web where present search technologies rely on link analysis techniques that ex-
ploit the structure of Web to determine important documents. At the same time, they rely on simple term statistics to iden-
tify documents that are most relevant to a query. Mark-up languages such as (X)HTML are primarily focused to documents
whose content should be interpretable by human interpreters and hence focused on document structure and its presenta-
tion. Little efforts are paid to the representation of the semantics of the content itself.
The growing availability of structured information on the Web enables new opportunities for information access. Seman-
tically oriented search engines and specifically that use ontologies as enabling technologies have gained considerable inter-
est in the last decade. The ever growing amount of ontology-based semantic mark-up in the Web provides an opportunity to
start working in the direction of a new generation of open intelligent applications (Motta & Sabou, 2006). Efficient search is
one such major envisioned application of this next generation Web popularly known as Semantic Web (Burners-Lee, Hen-
dler, & Lassila, 2001).
Current Web search techniques are not directly suited for indexing and retrieval of semantic mark-up. Document is trea-
ted as a bag of words where words or word variants are recognized as indexing terms. The existing semantic mark-up is
either simply ignored by many search engines for indexing purposes or not processed in a way that allows the mark-up
to be used distinguishably from other text during the search.
The upcoming Web search is no longer limited to matching keywords of the query against documents but instead com-
plex information needs can be expressed in a structured way with precise and structured answers as results. The kind of

⇑ Corresponding author. Tel.: +91 8295262540; fax: +91 0124 2013125.


E-mail addresses: jindal35@gmail.com (V. Jindal), seema@thapar.edu (S. Bawa), sbatra@thapar.edu (S. Batra).

0306-4573/$ - see front matter Ó 2013 Elsevier Ltd. All rights reserved.
http://dx.doi.org/10.1016/j.ipm.2013.10.004
V. Jindal et al. / Information Processing and Management 50 (2014) 416–425 417

search in which user’s information needs are addressed by considering the meaning of user’s query as well as available re-
sources is referred to as Semantic Search (Tran, Haase, & Studer, 2009).
Due to the ever increasing move from data to knowledge and increasing popularity of the vision of Semantic Web, there is
equally increasing interest and work in automatically extracting and representing the metadata as semantic annotation to
the documents and services on the Web (Shah, Finin, Joshi, Cost, & Mayfield, 2002). It seems that each Web page would pos-
sess semantic annotation that record additional details concerning the page itself. Annotations are based on classes of con-
cepts and relations among them. The ‘‘vocabulary’’ for the annotation is usually expressed by means of ontology. The
information contained in such agreed upon ontology is quite valuable for determining the relevance of the retrieved docu-
ments based on the ‘‘known’’ facts, relationships or the other data. Table 1 shows a comparison of features of Traditional
Keyword-based search and Semantic-based search based on various parameters.
The two elements of the ontology are quite significant from the ‘‘relevant information access’’ point of view. The first ele-
ment is the named entities such as names of persons, objects, countries, places, research articles, artists, and museum. Avail-
able techniques had been developed for entity oriented search of documents (Aleman-Meza, Arpinar, Nural, & Sheth, 2010).
The second element is the relationships which provide meaning to the entity. The value of such relationships relies on the fact
that those are named relationships. Relationships play a vital role in the relevant information access as the Web evolves con-
tinuously (Sheth & Ramakrishnan, 2007).

2. Motivation for ranking

Many users try to analyze information either by browsing information space or using a search engine. Search engine
based systems generally locate documents based on keywords. Although they do return documents involving keywords
inputted by user, a lot of retrieved documents have very less to do with user’s needs. The onus lies on the user to decide
about the relevance of the retrieved documents using their mental model in order to obtain desired information. Efforts
are consistently being made to extend or identify alternatives to traditional search mechanisms focused on finding docu-
ments based on keyword-based approaches. With the advent of the Semantic Web along with enabling technologies, a stage
has been set which will facilitate in getting relevant documents from the massive data sources thereby assisting in informa-
tion analysis.
The premise of search technologies today is primarily centered on enabling search for entities or other Semantic Web re-
sources. Different from traditional text-based information retrieval systems which exclusively retrieve and rank documents,
semantic search systems retrieve and rank entities of various types in response to user queries. Semantics of relationships
among entities are defined in schema ontologies (e.g., through the domain and range constructs in RDF(S) or OWL lan-
guages). It is increasingly possible to analyze metadata extracted from Web to discover interesting relationships. Possibly,
just as document ranking is a critical component in present search engines, the ranking of complex relationships is likely
to be important component in the upcoming Semantic search engines. But it is very unlikely that ranking schemes for rank-
ing entities (documents, resources etc.) may be applied for ranking complex relationships among entities. Furthermore, het-
erogeneous relationships existing among entities embedded into semantic annotations can be effectively exploited to define
ranking strategies for semantically annotated Web pages.

2.1. Ranking for normal search

One of the most impressive and popular Ranking model for the ordering of retrieved documents is PageRank (Page, Brin,
Motowani, & Winograd, 1998). It looks at the Internet as a big graph where pages are nodes and hyperlinks are edges. It has
been successfully applied to distinguish the popularity of different Web documents through analyzing the link structure in
the Web graph. It is obvious that in the Web graph, all the Web pages involving same keyword(s) are not equally popular. E.g.
only some top conferences pertaining to a research field are highly important with high quality research papers. In order to
help users to quickly locate their pages of interest, popularity of retrieved pages is required to be calculated. The more pop-
ular a Web page is, the more likely the user is interested in it and hence the more important that page is. PageRank algorithm
facilitates to accurately approximate such global importance for a given page. It is based on the intuition that more the num-

Table 1
Comparison of features of Traditional Keyword-based search and Semantic-based search.

Parameter Traditional Keyword-based search Semantic-based search


Dataset Documents RDF triples, semantically annotated documents
Data organization Unstructured Semi- structured
Search orientation Document – centric Entity, relationship and semantic document centric
Collection Bag of words Bag of (RDF) assertions
Representation Light weight syntax – centric models Ontology based better expressive models
Domain of satisfaction Work well for topical searches Complex queries are satisfied, more precise answers
Query processing approach Matching and filtering Not just matching and filtering but also joining
Scalability Web scale Not scale to massive and heterogeneous Web environment
418 V. Jindal et al. / Information Processing and Management 50 (2014) 416–425

ber of random visitors/references to a page are, the more popular of the page is. In the process, query context is not taken into
consideration for the purpose of ranking of the page pertaining to a given query.

2.2. Ranking for semantic search

RDF datasets in the Semantic Web represented as graphs have unique characteristic i.e. heterogeneity of links. An entity is
linked to other entities through different types of relationships e.g. In research domain, paper ? cited_by ? paper,
paper ? written_by ? author, paper ? published_in ? conference/journal, similarly in health domain, medi-
cine ? cures ? disease, medicine ? manufactured_by ? company, medicine ? causes ? disorder etc. Hence popularity of
relationships along with popularity of entity is supposed to be considered for calculating the popularity score of entity. Tra-
ditional link analysis methods like PageRank treat all the links/relationships to be of same importance therefore directly
applying these methods would result in unjustified popularity ranking. Some learning based approaches like (Nie, Zhang,
Wen, & Ma, 2005) had been proposed for automatically finding the popularity of different types of relationships based on
some partial ranking of entity objects manually given by domain experts.
Another aspect is to exploit the relevance of relationships connecting two entities for searching or ranking documents
based on various factors. An entity can be related with other different entities with relationships having varying degree
of importance. Based on the relationships described among entities in domain ontology, it can be made possible to find
the set of neighboring entities which are important with respect to seed entity. The entity that directly matches with the user
query input can be referred to as seed entity. The score of a document can be determined based on how many of its anno-
tations belong to such set. Unlike other link analysis methods like PageRank, documents are not required to be interlinked for
the purpose of ranking. Also, this seems to be more inclined as query dependent approach which emphasizes on finding enti-
ties pertaining to the query terms rather than finding global importance of the entity.

3. Classification of existing ranking approaches

The research on ranking approaches for Semantic search on Web has been broadly classified into three categories in
accordance with their stage of ranking. Also some distinctive features have been identified after careful analysis of the con-
temporary approaches. A detailed comparison of most of the distinct approaches has been made based on most distinctive
characteristics towards ranking and presented in Table 2. Rather than discussing the varieties and the evolution of selected
ideas, it has been preferred to present a wide spectrum of approaches pertaining to ranking for Semantic Search with a view
to present the diversity of ideas.

3.1. Entity ranking

In terms of entity oriented search, the aim is to retrieve results that match the user input which might directly specify the
entity of interest. Let the entity that match the user query be referred to as seed entity. This seed entity is associated with a
number of neighboring entities existing in the ontology through one or more relationships. Also there is typically more than
one path connecting two entities. Hence it is argued that closeness of two connected entities is required to be weighted
through parameters like path length, path direction etc. This necessitates the ranking for closeness of seed entity with other
neighboring entities to consider some threshold number of entities for subsequent search of documents based on the resul-
tant set of entities. A survey of the approaches pertaining to entity ranking is presented below:
RareRank: This is a Domain specific and query specific approach for approximation of popularity of a page/resource.
While classical IR models exclusively rank documents based on content (relevance) and link analysis based methods empha-
size link structures (quality), (Wei, Barnaghi, & Bargiela, 2011) attempts to integrate the two scores: relevance score and
quality score coherently and with proper tuning of parameters which is often tedious and generally missing. Following a
so called Rational Research Model, it has taken a Knowledge base in Research domain (consisting of instances such as pub-
lication, author, and journal or conference) which is represented as a directed graph. Then domain topic ontology is plugged
into the graph. The derived Ranking score integrates both relevance (using Domain topic Ontology) and quality (using cita-
tion links).
The model generalizes the previous link analysis based methods which only aims at ranking documents. Hence it can be
utilized by semantic search systems for the purpose of ranking Entities. By doing so, it attempts to bridge the gap of domain
specificity which used to remain unaddressed while using PageRank – like algorithm for semantic based ranking. There is a
potential scope left in parameter tuning although tuning procedure adopted in the model is simple.
Learning to Rank: The main objective of (Dali & Fortuna, 2011) is to explore Domain-independent and Query-independent
features to approximate the popularity of a resource for Semantic based search on Web. Moreover it attempts to combine
such features to reach out to more robust ranking model. PageRank (Page et al., 1998) algorithm is taken as a reference rank-
ing model to determine its applicability in the RDF graph scenario. A set of domain – independent features along with an
adapted PageRank has been combined and utilized to get higher correlation with the popularity of resources and to even-
tually get the better results. Where on one hand, author looks into ways of combining several features to reach to a more
accurate and robust ranking model, on the other hand, it attempts to find out that to what extent the resultant ranking model
Table 2
Comparison of semantic search approaches based on distinctive characteristics towards Ranking.

Approach Stage of Domain Query Scope of knowledge base used Query interface Result set comparison Benchmarking
ranking relatedness relatedness of criterion
of ranking ranking features
features explored
explored
Wang Wei et al. (2011) Stage I Domain Query specific Domain-based knowledgebase Keyword based Accuracy Original PageRank
specific (research domain) algorithm
Dali and Fortuna Stage I Domain Query Linked Open Data (LOD) datasets, Structured query language Accuracy PgPop, as referred by
(2011) independent independent DBpedia and YAGO based (SPARQL) author, for number of visits
of a page during a time
period (Wikipedia Access
logs)

V. Jindal et al. / Information Processing and Management 50 (2014) 416–425


Aleman-Meza et al. Stage III Domain Query specific Swetodblp ontology created by Keyword based Accuracy in terms of Absolute matching
(2010) specific author in Aleman-Meza, precision and recall
Hakimpour, Arpinar, and Sheth
(2007)
Lamberti et al. (2009) Stage III Domain Query specific Domain based knowledgebase Keyword based Response time in terms of OntoLook Li et al. (2007)
specific (travel domain) time complexity, and for response time, Google
accuracy Ranking for accuracy
Li et al. (2007) Stage II Domain Query specific Domain based knowledgebase Form-based Response time in terms of Independent evaluation
specific (travel domain) time complexity
Castells et al. (2007) Stage III Domain Query specific KIM Kiryakov, Popov, Terziev, structure query language Accuracy in terms of Keyword-based vector
independent Manov, and Ognyanoff (2004) based (RDQL) precision and recall space models
based knowledgebase along with
corpus of documents from CNN
website
Ning et al. (2008): RSS Stage I Domain A combination of Domain based Knowledge base, Structured Query Language Accuracy PageRank Algorithm
specific query CiteSeer metadata (Computer based
independent and Science Research domain)
query specific
features
Hogan et al. (2006): Stage I Domain Query Cross-domain knowledge base Keyword based Scalability in terms of Independent evaluation
ReConRank independent independent with 15 M triples effect of dataset size on
ranking computation time
Hwang et al. (2006) Stage I Domain Query specific Domain based bibliographic Keyword based Accuracy in terms of recall Bibliographic section of
specific database each chapter in a text
bookRamakrishnan and
Gehrke (2003)
Ding et al. (2005): Stage III Domain Query A Self developed crawler based Keyword based along with Accuracy Not mentioned
Swoogle independent independent custom indexed dataset, an content based constraints.
adaptation of SireCost, Kallurkar,
Majithia, Nicholas, and Shi (2002)
Aleman-Meza et al. Stage II Domain Query specific A real world large ontology Concept based along with Accuracy Human subject ranked
(2005) independent dataset, SWETO option to customize the paths
ranking criteria by assigning
weights to each individual
criteria
Anyanwu et al. (2005): Stage II Domain Query specific Cross domain synthetically Keyword/Resource based Accuracy Independent evaluation
SemRank independent generated knowledgebase along with reference to the based on different result

(continued on next page)

419
420
Table 2 (continued)

Approach Stage of Domain Query Scope of knowledge base used Query interface Result set comparison Benchmarking
ranking relatedness relatedness of criterion
of ranking ranking features
features explored
explored
query context in the form of orderings as per the user’s
search mode. needs.
Nie et al. (2005) Stage I Domain Query Domain based knowledge base, Not mentioned Accuracy PageRank Algorithm
specific independent libra (research domain)
Rocha et al. (2004) Stage I Domain Query Domain based knowledge base Keyword based Accuracy Human expert approval
specific independent (two exclusive test beds: Research
domain ontology and portinari

V. Jindal et al. / Information Processing and Management 50 (2014) 416–425


application ontology)
Stojanovic et al. (2003) Stage I Domain Query Domain based knowledge base, an F-Logic query Accuracy in terms of so Independent evaluation
independent independent institute ontology called top_result_ratio
evaluated based on subject
expert’s input; response
time

Stage I – entity ranking.


Stage II – relationship ranking.
Stage III – (semantic) document ranking.
V. Jindal et al. / Information Processing and Management 50 (2014) 416–425 421

is applicable to other datasets. DBpedia and YAGO, two popular RDF datasets along with a set of SPARQL queries have been
used for the purpose. One important finding is that graph based features lead to bigger inconsistencies than those of text
based features used for ranking purposes which makes it difficult to apply as a general ranking model. Although the present
model with a combination of text based features and graph based features have shown favorable results on the sample data-
sets, it further needs to investigate the transferability of ranking model across different Linked Open Data (LOD) datasets
representing different domains.
RSS: A framework for enabling ranked semantic search on the Semantic Web (Ning, Jin, & Wu, 2008), on one hand, in-
vokes heterogeneity of relationships to determine the global importance of resources. On the other hand, it exploits an ex-
tended spread activation algorithm to retrieve resources most semantically related to the query thus facilitating inference
while searching. In the process, it becomes capable to provide users with properly ordered semantic search results. It is ob-
served that search results can be greatly expanded with entities most semantically related to the query. Users are thus pro-
vided with most interesting and properly ordered semantic search results. The current approach has achieved this objective
by combining global ranking values of the resources and relevance between the resources and query. However the step of
assigning edge weights in the schema graph is required to be automated rather than performed manually. At the same time,
its applicability needs to be evaluated on cross-domain platform although it is believed to be so by the author.
ReConRank: SWSE (Harth et al., 2007) is a search engine for searching and retrieving entities and simple knowledge. ReC-
onRank (Hogan, Harth, & Decker, 2006) is the ranking algorithm for SWSE. It is shown as using a 3 – phase computation for
entity prioritization: (i) ResourceRank (ii) ContextRank (iii) ReConRank (a combination of earlier two). All these are varia-
tions of PageRank algorithm applied to semantic graphs. In the first phase, RDF data crawled from Web are transferred to
directed labeled graph (entities are taken as nodes and predicates as labeled links), ResourceRank Algorithm is applied to
compute rank scores of Web resources. In second phase, context graphs using the provenance of RDF data are extracted
to further compute ContextRank scores. In the final phase, integrated resource context graph is derived based on predefined
rules to further produce ReConRank scores which reflect the importance of a resource as well as its context (based on prov-
enance data of entities). The performance in terms of scalability has been emphasized for the purpose of evaluating the re-
sults on a dataset with 15 M triples.
ObjectRank: (Hwang, Hristidis, & Papakonstantinou, 2006) has idea of ranking objects in relational databases based on the
principle inspired by PageRank. A number of parameters have been proposed to improve the relevance of search results to
the keyword based user query. One is specificity metric to measure the keyword relatedness of the resulting objects. The other
is quality metric to measure the global importance of the object independent of the query keyword. Two more parameters: i)
importance of the results actually containing the query keywords ii) weight assigned to each query keyword have been ex-
plored although its implementation and validation is not shown. One interesting augmentation has been proposed to exploit
the domain knowledge related to the given query. Domain based ontology graph is proposed to be integrated into their
ObjecrRank system which eventually enables to enhance the quality of query results. Although it seems to be a favorable
approach but tested upon a relatively smaller dataset. It requires to be validated upon larger systems up to the scale of
Web with ontology based approach.
Although PageRank is one of the most popular ranking model for document-level ranking, (Nie et al., 2005) discards its
validity for ranking Web objects because of the heterogeneous relationships existing between them. Web information for
objects relevant to a specific application domain is collected and these objects are ranked in terms of their relevance and
popularity to answer user queries. Unlike Web pages that are connected among each other to form a so-called Web graph,
objects are related to each other through different types of relationships. Author asserts that traditional PageRank algorithm
is no longer valid for object popularity calculation. Therefore a popularity propagation factor has been assigned to each type
of object relationship. A learning based approach is proposed to automatically learn the popularity propagation factors for
different types of links. This is done based on the partial ranking of objects given by domain experts. It seems to be a good
attempt to rank Web objects based on their popularity and object relationship graph but query context as one of the signif-
icant criteria for ranking is missing. At the same time, present approach may prove to be domain-independent as claimed by
author but cross-domain evaluation of the approach has not been presented.
A hybrid approach for searching in the Semantic Web (Rocha, Schwabe, & Aragao, 2004) combines full text search with
spreading activation search in ontology. Search starts with a keyword based query. Results to the full text search are in-
stances from the ontology. Those instances are used to initiate a spreading activation search in the ontology to find addi-
tional instances. As claimed by author, it is not possible to devise universal formula that proves to be the best for all
application domains. So it becomes a domain dependent approach for reaching to the relevant ordered set of results. Also,
here all types of relations are considered to have same relative weight while the calculation of the weight mapping value for
a relation instance can be context-sensitive depicting relative importance of relation types based on context.
Contrary to the traditional IR approaches where the relevance of the search results is determined only by analyzing the
underlying information repository in the form of content and hyperlink structure, (Stojanovic, Studer, & Stojanovic, 2003)
presents an approach that exploits the explicitly shared semantics of the information supported by an ontology. It aims
at ranking search results of a semantic portal. The approach combines the characteristics of inferencing process and the con-
tent of information repository for determining the greater relevance and ordering of search results. ‘‘Universal’’ and ‘‘user-
defined’’ weights are assigned to each semantic relation, taking into account the context as well as other parameters like
specificity and path length. These weights are combined into a global formula where multiplying constants are specified
by the user. Specificity of the instance of a relation, which is higher, the less often the instances of the concepts in the relation
422 V. Jindal et al. / Information Processing and Management 50 (2014) 416–425

are present in other instances of relations. In addition the inference process of the statements is taken into account for rank-
ing results. Although the approach seems to be promising for large datasets related to any domain but it is shown to be
tested on a small dataset of a particular domain. Query context as a dominating criterion for ranking purposes is missing.
Author himself has emphasized the need to develop task-oriented strategies for calculating the relevance.

3.2. Relationship ranking

In terms of relationship oriented search, one of the vital issues required to be addressed is how to determine relative
importance of relationships found with respect to a user’s query context. This is important since it is very likely that number
of relationships existing among two entities is very large. This may lead to create more acute information overload problem
than currently exist on the Web. It is therefore imperative that techniques for ordering of search results with respect to rela-
tionships be developed in order to present results of highest importance first to the user. Although all the approaches for
relationship ranking in the swiftly developing field of Semantic Search have not been looked into but it is hoped that basic
ideas have been covered.
OntoLook: The idea of (Li, Wang, & Huang, 2007) is that if a graph based representation of a Web page annotation can be
provided where concepts and relations (along with their multiplicities) are modeled as vertices and weighted edges respec-
tively, it becomes possible to define a series of cuts removing less relevant concepts from the graph. This allows for the gen-
eration of so-called candidate relation keyword set (CRKS) to be submitted to the annotated database which can significantly
reduce the presence of uninteresting pages in the result set. The strategy behind OntoLook as named by the author only al-
lows to empirically identify relations among concepts that are supposed to be less relevant with respect to user query. This
information is used to reformulate the user query by including only a subset of all the possible relations among concepts
which is later used to retrieve Web pages from the annotated database. Instead of using the whole semantic knowledgebase,
they have used user query, page annotation and underlying ontology. This is expected to result in reduction in the cost of
query answering phase. Because of decentralized and heterogeneous nature of Web, it seems impossible for all Web Pages
to use same ontology. So, even on same domain, semantic communication among ontologies will be needed. Moreover the
weight of relations in forming the property-keyword candidate set also needs to be considered. It seems that the absence of
an effective ranking strategy has greatly limited the scope of user satisfaction. A concrete ranking criterion requires to be in
place although concepts being explored are query dependent.
In the direction of exploiting semantics of complex relationships for locating relevant pieces of data from Web, Aleman-
Meza, Halaschek-Wiener, Budak Arpinar, Ramakrishnan, & Sheth, 2005 presents a flexible user-centric ranking approach to
identify interesting and relevant relationships in the Semantic Web. Eventually, this is thought to be helpful for ranking the
results of a query involving two entities in terms of interesting semantic associations. Semantic associations are generally
sequence of properties that link various entities. E.g. two entities e1: Person and e2: Person are involved in a query; the result
set involves the semantic associations indicating the different ways by which these two persons are related. An overall rank-
ing criterion has been developed using a number of semantic metrics such as context, subsumption and trust and statistical
metrics such as Rarity, Popularity and Association Length for ranking semantic associations. In semantic metrics, context is
based on a blend of user perspective and ontological aspect, subsumption is purely based on ontological aspect and trust
is purely based on user perspective. In statistical metrics, all the three Rarity, Popularity and Association Length are based
on ontological aspects like number and connectivity of entities and relationships. This is thought to be one of the good ap-
proaches for Relationship ranking with wide set of parameters involved in computing rank scores. Although a wide range of
features have been used for ranking but the idea of user involvement for assigning weights to different parameters may not
be applicable to general user.
SemRank: Different from contemporary views for ranking semantic associations or relationships that approximate the
relevance irrespective of situation, (Anyanwu, Maduko, & Sheth, 2005) opines that relevance is situation dependent even
for the same query and hence some flexibility should be built into the relevance models so that different orderings may
be imposed on the same result set for the same query made in different situations. In this work, user is given flexibility
for selecting the search mode from conventional search mode to discovery search mode based on his need. For the purpose,
SemRank exploits a so-called ‘‘Modulative Ranking Model’’ that is capable of taking into account the particular context in
which query is submitted.
It goes on accounting for the specificity or uniqueness of the result based on intuition that commonly occurring association
is more predictable than a rarely occurring association. Discrepancy is accounted by calculating the number of deviations in a
path description connecting two resources which starts and proceeds along the direction as described in the schema layer,
suddenly changes direction unrelated to schema because of multiple typing of a resource. Such deviations are not likely to be
anticipated by users and hence unpredictable. Finally a semantic match of the results with the keywords optionally entered
by the user along with query enhances the ranking value. SemRank formula combines all of these three factors to assign a
rank to any semantic association, adapting itself as the mode changes. In nutshell, it presents a unified ranking model with a
blend of semantic and information theoretic techniques to determine the rank of a semantic association. Although it pro-
vides a flexible ranking approach that offers a variety of result orderings to be chosen as per the needs of the user, the empir-
ical evaluation was done using a synthetically generated data set rather than existing RDF data collections. Query context has
been taken into consideration for ranking purpose.
V. Jindal et al. / Information Processing and Management 50 (2014) 416–425 423

3.3. Semantic document ranking

In terms of document oriented search, the number of documents may be present in the result set based on most relevant
and complete set of entities of interest along with most relevant set of relationships existing among those entities. The doc-
uments may further be required to be sorted in their order of relevance. The concepts/entities and instances in the ontology
are likely to be linked to the documents by means of explicit non-embedded annotations to the documents. Annotations may
be assigned a weight that reflects the relative importance of instance in relation to the document meaning. Weight can be
computed possibly automatically with an adaptation of tf⁄idf algorithm. Alternatively the relevance score of a document can
be made as the summation of the weights of paths from entities spotted in a document to the concepts of ontology.
Semantic relationships have been taken as the exclusive resource for ranking documents in (Aleman-Meza et al., 2010).
Unlike many other approaches, this approach does not exploit any specific structure in a document or links between docu-
ments for the purpose of ranking documents. Relevance of documents is determined using relationships those are known to
exist between the entities in a populated ontology. A measure of relevance is introduced based on the traversal and seman-
tics of relationships that link entities in an ontology. Actually the relevance measure is calculated based on the subjective
knowledge by a domain expert who assigns ‘‘low/medium/high’’ scores to the relationship sequences by referring to the
schema of ontology. Based on this score, degree of relatedness of so called match entity with other entities existing in the
ontology is found out. Finally the score of a document is determined depending on how many of its annotations belong
to such related entity set. Although it is a novel approach of exploiting semantics of relationships for finding the relevancy
of documents, a poorly populated ontology may greatly limit the effectiveness of semantic annotation step and in turn the
retrieval step. Moreover, a domain expert manually assigns the weights to relationship sequences. Although this is done for
once, that too without the knowledge of end-user, still it is felt that automation of this process is required.
The idea of a relation-based page rank algorithm is presented in (Lamberti, Sanna, & Demartini, 2009). The ranking criteria
is based on an estimate of the probability that the keywords/concepts within an annotated page are linked one to the other in
a way that is the same (or at least that is similar) to the one in user’s mind at the time of query definition. This probability
measure is shown to be effectively computed by defining a graph based description of the ontology (ontology graph), of the
user query (query sub-graph) and of each annotated page containing queried concepts/keywords (both in terms of annotation
graph and page sub-graph). In other words, a ranking among semantically annotated pages is based on the intuition that lar-
ger is the number of relations linking one concept with the other in a page, given the total number of relations among those
concepts in the ontology, the higher is the probability that this page contains exactly the same relations as desired by the
user. Hence it is the most relevant page with respect to the user query. Only query dependent features have been explored
for ranking semantically annotated Web pages. The cost of query answering seems to be reduced since only user query, the
page annotation and underlying ontology are being used in the ranking process instead of whole semantic knowledgebase
with billions of pages. Further efforts will be required to address scalability into future Semantic Web repositories based on
multiple ontologies, characterized by billions of Web pages.
Query-specific and domain independent ranking features have been explored in (Castells, Fernández, & Vallet, 2007).
Author proposes an ontology based information retrieval model using full-fledged domain ontologies and knowledge bases
to support semantic search in document repositories. Full documents rather than specific ontological instances are returned
in response to a user query unlike Boolean semantic search systems. Once the list of documents is formed, the search engine
computes a semantic similarity value between the query and each document. The added value of semantic information re-
trieval relies on additional explicit information-type, structure, relations, classification, and rules about the concepts refer-
enced in the documents, represented in an ontology-based KB. The documents that are annotated with the returned
instances are retrieved, ranked and presented to the user. The performance of the model significantly depends upon the
amount and quality of information within the KB it runs upon. It is assumed by the author that annotations do not describe
all the meaning conveyed by the document in a complete manner. Hence document retrieval phase leads to an approximate
match. Further, independently developed and maintained cross-domain KBs can be integrated so as to deal with multiple
heterogeneous data sets.
Swoogle: A prototype of a semantic search engine is presented in (Ding et al., 2004) which helps to search and rank
Semantic Web Documents (SWDs). This follows a query independent approach for ranking purposes. Four types of semantic
links between SWDs have been explored and different weights are assigned to them: (i) Imports (A, B): A imports all the
contents of B (ii) Uses-term (A, B): Instead of importing all the terms of B, A uses some of the terms defined by B (iii) Extends
(A, B): A extends the definitions of the terms defined by B (iv) Asserts (A, B): A makes assertions about the terms defined by B.
The ranking algorithm for the ranking of SWDs is termed as OntoRank (Ding et al., 2005). The ranking score of an SWD is
computed using an adaptation of PageRank algorithm. The approach uses domain-independent and query independent fea-
tures for ranking. Query context remains unaddressed for ranking purposes.

4. Discussion

After the comparison of surveyed systems by means of classification criteria, some peculiar issues have been mined which
are thought to be relevant with respect to efficient semantic search. In this subsection, these issues have been discussed with
the intention to reflect their potential for further research.
424 V. Jindal et al. / Information Processing and Management 50 (2014) 416–425

4.1. Heterogeneity

Semantic search systems are supposed to find answers to user queries by directly returning information or knowledge on
entities in an efficient manner. Many a times, multiple ontologies are likely to be referred to satisfy the needs of complex
user queries. The search system must be able to search several different domains at the same time. For example, the term
‘apple’ may occur in fruit ontology and computer ontology. The search system would need to investigate all of these cases
and eliminate irrelevant ones. If necessary, it might also have to integrate the knowledge structured according to different
ontologies together. However, it is observed that many of the surveyed approaches have relied upon an ontology related to a
single domain for finding the relevant resources related to the user query. A novel domain-independent approach towards
finding related entities has been presented by (Vechtomova & Robertson, 2012). Another contemporary major solution to
this problem is proposed in the form of ‘‘Linked Open Data’’. An extended discussion of related issues regarding open seman-
tic environments has also been presented by (Motta & Sabou, 2006).

4.2. Query context

It has been seen in many of the surveyed approaches that query independent features have been used for the purpose of
ranking entities. Although global importance of resources play a vital role in approximating its relevance to user query but it
is asserted that query context has to be treated as one of the dominating criterion for ranking semantic resources. However,
it seems to be an open issue how to map query context to ontologies with unknown structure. On the other hand, aspect of
personalization can be explored from efficient context retrieval point of view. History of a particular user browsing patterns,
information demands etc., combined with ad hoc/instant context expression by the user, can be exploited to reach to its
intention to the highest closeness.

4.3. Portability

The available ontologies often exhibit different conceptualizations of similar or overlapping domains. One of the chal-
lenging tasks for efficient semantic search on Web is the integration of ontologies with the purpose of building a com-
mon ontology for all Web sources and consumers in a domain. This will facilitate the system to move across ontologies
without any need for domain-specific reconfiguration. This can be done by detecting semantic relations between con-
cepts, properties or instances of two ontologies, i.e. ontology matching. This is not only important concerning the por-
tability across ontologies related to a domain but also regards as an important step towards domain-independent
heterogeneous knowledge bases.

4.4. Query interface

Four modes of user interaction with the system have been observed for expressing its intent of search. Those are: (i) Key-
word based (ii) Form based (iii) Natural language based, and (iv) Structure query language based (e.g. SPARQL). Query
expressiveness can be enhanced to a great extent using structured query based approach but a general user may not be will-
ing to learn structured query language. He is comfortable with keyword based approach which is easy but not so expressive.
User intent may not be expressed so clearly using keyword based approach. So a trade-off is required between easiness of
keyword type query approach and expressiveness of structured query approach. One of the solutions may be to offer a range
of different modes of search formulation, to allow users to pick the method that best suits their task.

4.5. Scalability

Efficient implementation of semantic search systems from the point of view of indexing time, index space, and response
time is required to compete with contemporary search engines. Only a little overhead may be introduced as compared to
standard search systems. Only a few works have been found out reflecting the performance of semantic search systems
on corpora as large as Web.

4.6. Evaluation benchmarks

Semantic search systems have started taking shape which ultimately aims at a human-like interface to the knowl-
edge and services available on the Web. Despite this fact, SW community is still a long way to go for defining stan-
dard evaluation benchmarks to judge the quality of semantic based search methods. Systematic evaluation of semantic
search tools involve appropriate test collection of data and queries, standard performance criteria and independent
judgments of performance, thus, supporting performance comparisons between systems. Present approaches for seman-
tic search evaluation are mostly based on user-centric methods, small scale and difficult to repeat. SEALS project
(Wrigley, Elbedweihy, Reinhard, Bernstein, & F., 2010) seems to be a good initiative in the direction of providing such
benchmarks.
V. Jindal et al. / Information Processing and Management 50 (2014) 416–425 425

5. Conclusion

In this paper, a number of promising ranking approaches for Semantic search on Web have been presented which have
been classified in accordance with their stage of ranking. It is observed that unlike classical IR based search models, in case of
semantic based search models, ranking involves at three stages termed as the first: Entity Ranking, the second: Relationship
Ranking and finally: Semantic Document Ranking. Two entities are connected to each other by a single relationship or a
chain of relationships. Entity ranking is used to approximate the closeness of one entity to other related entities through fea-
tures such as path length, number of paths, and number of common/shared connections. This would facilitate to find the set
of relevant entities with respect to user query. Whereas ranking of relationships, on one hand, can be used for finding the
rarity or popularity of relationships depending on the context of search such as investigative search or conventional predict-
able search. On the other hand, it can also be used to find the closeness of one entity to other related entities. Eventually the
relevance score of a document can be determined based on how many of its annotations belong to such set of entities.
During the review process, a number of common parameters for Semantic search have been identified which directly or
indirectly influence the ranking process. These parameters have been reflected as classification criteria in the comparison of
various reviewed search-cum-ranking approaches. In the process, a set of peculiar issues have been observed which are
thought to have significant relevance with respect to efficient semantic search. These issues have been discussed reflecting
the present semantic search approaches, their limitations and potential future trends. It is hoped that a thoughtful discussion
on these issues would further catalyze the research efforts in this direction.

References

Aleman-Meza, B., Arpinar, I. B., Nural, M. V. & Sheth, A. P. (2010). Ranking documents semantically using ontological relationships. In Proc. of IEEE fourth
international conference on semantic computing (ICSC) (pp. 299–304).
Aleman-Meza, B., Hakimpour, F., Arpinar, I. B., & Sheth, A. P. (2007). Swetodblp: Ontology of computer science publications. Journal of Web Semantics:
Science, Services and Agents on the World Wide Web, 5(3), 151–155.
Aleman-Meza, B., Halaschek-Wiener, C., Budak Arpinar, I., Ramakrishnan, C., & Sheth, A. (2005). Ranking complex relationships on the semantic web. IEEE
Internet Computing, 9(3), 37–44.
Anyanwu, K., Maduko, A. & Sheth, A. (2005). SemRank: Ranking complex relation search results on the semantic web. In Proc. 14th international conference
on world wide web (WWW 05) (pp. 117–127).
Burners-Lee, T., Hendler, J. & Lassila, O. (2001). The semantic web. Scientific American (pp. 34–43).
Castells, P., Fernández, M., & Vallet, D. (2007). An adaptation of the vector space model for ontology-based information retrieval. IEEE Transactions on
Knowledge and Data Engineering, 19(2), 261–272.
Cost, R. S., Kallurkar, S., Majithia, H., Nicholas, C. & Shi, Y. (2002). Integrating distributed information sources with carrot ii. In: Proceedings of the 6th
international workshop on cooperative information agents VI (pp. 194–201). Springer-Verlag.
Dali, L. & Fortuna, B. (2011). Learning to rank for semantic search. In: Proc. of fourth international Semantic Search workshop located at the 20th international
World Wide Web Conference WWW2011.
Ding, L., Finin, T., Joshi, A., Pan, R., Cost, R. S., Peng, Y., et al. (2004). Swoogle: A search and metadata engine for the semantic web. In: CIKM’04 (pp. 652–659).
New York, NY, USA.
Ding, L., Pan, R., Finin, T., Joshi, A., Peng, Y. & Kolari, P. (2005). Finding and ranking knowledge on the semantic web. In Proceedings of the 4th international
semantic web conference, LNCS 3729 (pp. 156–170). Springer.
Harth, A., Hogan, A., Delbru, R., Umbrich, J., ORiain, S. & Decker, S. (2007). Swse: Answers before links! In Proceedings of the semantic web challenge, in
conjunction with ISWC/ASWC, 295, CEUR-WS.org.
Hogan, A., Harth, A. & Decker, S. (2006). Reconrank: A scalable ranking method for semantic web with context. In: Proc. of second international workshop on
scalable semantic web knowledge base systems (SSWS2006).
Hwang, H., Hristidis, V. & Papakonstantinou, Y. (2006). Objectrank: A system for authority-based search on databases. In Proc. of SIGMOD conference (pp.
796–798).
Kiryakov, A., Popov, B., Terziev, I., Manov, D., & Ognyanoff, D. (2004). Semantic annotation, indexing and retrieval. Journal of Web Semantics: Science, Services
and Agents on the World Wide Web, 2(1), 49–79.
Lamberti, F., Sanna, A. & Demartini, C. (2009). A relation based page rank algorithm for semantic web search engines. In IEEE Trans. On Knowledge and Data
Engg (vol 21(2), pp. 123–136).
Li, Y., Wang, Y., & Huang, X. (2007). A Relation based search engine in Semantic Web. IEEE Transactions on Knowledge and Data Engineering, 19(2), 273–282.
Motta, E. & Sabou, M. (2006). Next generation semantic web applications. In: Proceedings of 1st asian semantic web conference. Beijing.
Nie, Z., Zhang, Y., Wen, J. & Ma, W. (2005). Object-level ranking: Bringing order to web objects. In Proc. 14th international conference on world wide web
(WWW 05) (pp. 567–574).
Ning, X., Jin, H., & Wu, H. (2008). RSS: A framework enabling ranked search on the semantic web. Information Processing and Management, 44(2008),
893–909.
Page, L., Brin, S., Motowani, R. & Winograd, T. (1998). The pagerank citation ranking: Bringing order to the web. Technical Report, Stanford Digital Library
Technologies Project (pp. 1–17).
Ramakrishnan, R. & Gehrke, J. (2003). Database Management Systems (3rd Ed.). McGraw-Hill Book Co.
Rocha, C., Schwabe, D. & Aragao, M. P. (2004). A hybrid approach for searching in the semantic web. In Proc. of the 13th int. conf. on world wide web (WWW),
WWW2004.
Shah, U., Finin, T., Joshi, A., Cost, R. S. & Mayfield, J. (2002). Information retrieval on the semantic web. In: Proceedings of 10th international conference on
information and, knowledge management [November].
Sheth, A. P., & Ramakrishnan, C. (2007). Relationship web: Blazing semantic trails between web resources. IEEE Internet Computing, 11(4), 77–81.
Stojanovic, N., Studer, R., Stojanovic, L. (2003). An approach for the ranking of query results in the semantic Web. In Proceedings of second international semantic
web conference, (ISWC’03) (pp. 500–516).
Tran, T., Haase, P., & Studer, R. (2009). Semantic search – Using graph-structured semantic models for supporting the search process. LNAI, 5662, 48–65.
Vechtomova, O., & Robertson, S. E. (2012). A domain-independent approach to finding related entities. Information Processing and Management, 48(2012),
654–670.
Wei, W., Barnaghi, P., & Bargiela, A. (2011). Rational research model for ranking semantic entities. Information Sciences, 181(2011), 2823–2840.
Wrigley, S. N., Elbedweihy, K., Reinhard, D., Bernstein, A.& Ciravegna, F. (2010). Evaluating semantic search tools using SEALS platform. In: Proceedings of the
international workshop on evaluation of semantic technologies (IWEST 2010).

You might also like