You are on page 1of 7


Generic Hybrid Semantic Search Approach

Reham H. El-Deeb, Abdel Fatah. A. Hegazy and Aly A. Fahmy
Abstract Based on the fact that information mining is a tedious repetitive procedure, enhancement of research in the area is essential . The main abstract idea of these researches was formality and non-formality of the interface for research methodology. In a previous published research [1], we proposed a Semantic Form-Based Guided Search System (SFBGSS), which combines the advantages of both the NLIs and the guided-based semantic search engines. The purpose of this research was to enhance the precision of the guided systems by infusing a form-based interface to it. The figures of the precision and recall for the previously proposed SFGBSS were 98% and 93% respectively. These excellent results outperformed the precision and recall of NLP-Reduce by 38% and 43% respectively. But, it was concluded that it will be better if we have a generic user interface that extracts the classes, instances, and properties from the data sets, i.e. having the ability to produce the values of listings in the runtime from the dataset hierarchy. In this new research, a Generic Hybrid Semantic Search Approach (GHSSA) is proposed in order to be exerted on any standard data set in addition to performing data profiling using the five-number summary statistical model. This will in return, maximize the benefits of our new search engine and extend its utilization scope. The GHSSA was tested on the three Mooney data sets [2]. And To set the bar a little higher, we also tested our approach on an Arabic data set which adds new dimensions to be a multilingual and a flexible approach. Index TermsSemantic web, Semantic Guided Search, Form-based Search Engines, data profiling.

emantic Search significance has lead to the emergence of natural language interfaces that permit users to convey their need of information via natural language processing which was a substitute for the users formal knowledge in ontologies. Using the knowledge sharing and exchanging of ontologies, they act as a substantial pillar in the semantic web. Although the NLIs have been through many evolutionary stages, it is still cant generate precise results because it cant understand the whole query. Most Natural Language Interfaces only recognize a part of the natural language query. In addition to that, NLIs does not provide any information to the users about the available resources to search for, which make the users uncertain of the queries that will be appropriately answered. In [3], this issue is tackled: Users need knowledge of what it is possible to ask in a particular domain and so did [4]: Often, users would attempt to paraphrase a sentence many times when the reason for the system's lack of understanding was due to the fact that the system did not have data about the query being asked. This divergence is called the habitability problem. As for guided based systems, is another type of semantic search engines. It is not as flexible as NLIs but it has high precision rates. The semantic search tools could be divided into four groups depending on their user interface approach: keyword-based, view-based, natural language based and form-based systems [5]. Keyword-based systems allow the input of several keywords and generate their equivalent semantic entities. These Keyword-based systems give the

R. El-Deeb is with the Arab Academy for Science Technology & Maritime Transport Cairo, Egypt. A. Hegazy is with the Arab Academy for Science Technology & Maritime Transport, Cairo, Egypt. A. Fahmy is with the Faculty of Computers and Information, Cairo University, Cairo, Egypt.

impression of regular information retrieval systems apparently, but allows user to precisely identify their information needs by interpreting each query term into semantic phrases. View-based systems sustain query creation and domain investigation using the presentation and navigation of ontology structure. Natural language systems translate natural language sentences which are submitted by the user into ontological queries via various linguistic techniques. Form-based systems direct the user in constructing semantic queries by the means of form structure and form controls, bearing in mind the ontology structures. The more difficult issue about this approach is its scalability for outsized ontologies, in regards to the scrolling lists limitation on the items count that can be incorporated in it. In addition to, the number of form controls that a form can contain. Therefore these previous issues might limit the usability of form-based interfaces. Thats why we believe that data profiling is an essential milestone to help in overcoming the limitations of formbased systems .In addition to, improving the data quality and the understanding of data. One of the most valuable technologies for enhancing data accuracy is Data profiling. It checks the data in an existing data source and gathers information and statistics concerning that data. This Data profiling exploits aggregates like count and sum. In addition to, various types of explanatory statistics like minimum, maximum, mean, standard deviation, and variation. Diverse analyses are executed on different structural levels. For instance, to acquire an understanding of frequency distribution of different types and values as well as use of columns, each column could be profiled independently. Data profiling has various techniques, one of them is five-number summary which is considered as an explanatory statistic offering information about a set of annotations. It is composed of the five most vital percentiles: the sample minimum, the lower quartile, the median, the upper quartile, and the sample maximum. In comparison with the mean and standard deviation, the


five-number summary is; in most cases; better especially for describing a slanted distribution or a distribution with excessive outliers. The mean and standard deviation are reasonable for outliers-free distributions which are considered symmetric. In real life we cant always expect symmetry of the data. Its a common practice to include number of observations (n), mean, median, standard deviation, and range as common for data summarization purpose.

is a portable tool that can assemble and tune parameterized searches. It has been tested for more than ten diverse domains [9]. However, it does not effectively support investigative browsing like Magnet [10], which is a module of Haystack. In survey of other approaches, mSpace7 [11] produces form-like interfaces straight from the domain structure, introduces the user to an alien information space, where he does not know how it is structured. Yet, on the other hand, it is incapable of searching in a diverse environment.


In the coming sections, we will illustrate the core features of the previously mentioned modes of interaction concentrating on their strengths and weaknesses in terms of their usability.


Domain understanding, query building and query enhancement are the vital features in View-based search tools. Using the related domain ontology, the view-based systems structures the view in a graphical or tree-based manner in order to illustrate the underlying semantic metadata. From its benefits, is that the content categorization scheme and the query vocabulary can be presented in spontaneous formats, which results in better understanding of the domain by the user. Queries are often built by means of navigation. Also, as an example of view-based systems, GRQL [12] gives end-users the ability to construct queries in the runtime by visually browsing through the specified ontology domain. Nevertheless, the time of query construction via navigation can be a limitation in regards to inflexibility. In survey of other approaches, SEWASIE [13] supports endusers in query construction; however, when the related ontology gets complex, the steps to construct a query can get large. In Ontogator [14], a multi-facet search tool was created to pace up the query formulation, by incorporating the keyword search with the view-based navigation routine. When it comes to scalability, view-based tools perform poorly, due to its time-consuming interaction. Another limitation is its inability to effectively present views of the domain, which will result in making endusers lost in the information space.

2.1 KEYWORD-BASED SYSTEMS These systems utilize the accessibility of unambiguous semantics to improve the performance of conventional keyword search. Keyword-based tools most important benefit is allowing end-users to specify queries with a straightforward manner; which is very familiar to them. Giving end-users the ability to use these systems without any prior knowledge of the ontologys exact vocabulary or structure. Also, without the need to master a special query language. The way the search algorithm processes the queries and their keyword selection method, determines the success of the search. The TAP search engine [6] is a keyword-based semantic search systems, that was one the pioneers to build such systems. It makes use of the conventional keyword search algorithms. The present tools keywordmatching mechanisms are treated at the syntactic level, using string-matching techniques. This makes them domain independent, because they are not attached to domain ontologies but unfortunately making them unable to recognize the information needs of end-users. Therefore, they dont always generate successful search results. As a result, keyword semantic search routine should integrate both semantic and syntactic matching mechanisms by employing domain-specific ontology and lexical resources like WordNet. In order to, match the user keyword with its semantic equivalents. This was demonstrated in ZOOM5 and the distinctive features presented in [7] and [8], which generated interesting semantic matching results. 2.2 FORM -BASED SYSTEMS
All computer users use Forms in their typical day-today interactions. Resulting to make forms an accepted approach for semantic search interfaces. By making users select query values from valid expressions lists, formbased interfaces can overcome mapping issues that arise in other interaction modes. They give the user the ability to envision what the except-able searches would look like by viewing the user what is there in the domain and therefore supporting his understanding of it. Form-based interfaces are supported by the Corese library tool, which


What draws end-users to natural language semantic tools is its simple interface and interaction. Generating the querys answer from unprocessed text, and sustaining query development in information retrieval [15], has been the main aim, for prior natural language question answering systems. On the other hand, integrating semantic mark-up with question answering (QA) systems opens the door for new possibilities for new QA systems. With this integration, these new QA systems with the use of semantic information can offer accurate answers to queries presented in natural language. What highlights these systems is that they offer a method to originate several parameters queries with more flexibility than the other previously mentioned systems, without obligating the end-users to have any prior search language knowledge. An example on these kinds of systems is AquaLog8 [16], which is ontology-based and portable.


However, if we try to review the constraints of Semantic QA systems, we find that these systems do not provide the end-users with any hints about the domain they are querying, which means that the end-users must be acquainted with the domain in order to create valid questions. In elaboration, these systems do not assist the user in understanding the domain, very much like the constraints of the keyword search systems. GINO system [17], tried to solve these limitations by presenting step by step directions to end-user while creating a query in the quasi-English form, to guarantee only the passing of suitable queries. Another example is Orakel9 [18] which uses two diverse lexicons: the domain lexicon and the generic, domain-independent lexicon. The domain-independent lexicon consists of English language related words like for example questions pronouns including When, How,etc. The domain lexicon is produced on the fly from each knowledge base thus, differing from one application to another. However, a Natural Language system does have its limitations. In order to have the flexibility of constructing a natural language query with numerous query-parts, you will have to sacrifice the ability to interpret the whole query, where there is a constraint on the number of phrases that could be parsed correctly. Therefore, not all complex parameterized queries could be interpreted. But what should be accounted for is that it handles scalability to large ontologies very well.

Performing data profiling on data values for better User experience. Validating the robustness of this novel approach

Fig.1. GHSSA Algorithm

using Arabic data set which adds a new dimension for our approach to be a multilingual and flexible.


To achieve the previously mentioned objectives the GHSSA Algorithm is proposed as shown in Fig. 1.


A system which does not obligate end-users to have any prior knowledge of the components of the domain, be skillful in any formal search language, or be acquainted with the knowledge bases structure, unlike the Natural language Interfaces which permits end-users who have a prior idea about the domain to question the semantic Knowledge base. The Top level classes must be a subClassOf the topmost generic class Thing which is the superclass of every OWL class. All these siblings in the class hierarchy must be at the same level of generality, where the main class is the one having the higher number of relations. [OWL Web Ontology Language Overview W3C Recommendation 10 February 2004] Rather than using only one specific data set as in SFBGSS [1], a library of Mooney Natural Language Learning [2] containing four data sets is used in GHSSA as follows: Geo Query Data: Data for parsing queries about a simple U.S. geography database. Restaurant Query Data: Data for parsing queries about a database of restaurant information in N. California, which was previously used in SFBGSS.

The main intention of this research is to offer a generic hybrid semantic search approach (GHSSA) to accomplish the following goals: Enlarging the utilization scope and maximizing the benefits of the SFGBSS to be used with any Standard English data set in a data set independent basis. Displaying different comparative operands according to the data type of each relation.

Fig.2. Geo Data Set Layout


These datasets were used in specific because they have been used in most of the related work in this area. Thats why they are considered as a credible and standard source. Also, it facilitates the capability of comparison with other systems. Each data set contains a collection of English Queries in addition to the domain knowledge base.

Fig.3. Jobs Data Set Layout

Jobs Query Data: Data for parsing queries about job

Fig. 6. GHSSA Geo Data Set.

Fig.4. Restaurant Data Set Layout.

Fig. 7. GHSSA Geo Data Set Result.

Fig.5. Geo Data Set Layout Arabization.

announcements posted in the newsgroup Geo Query Data in Arabic: Data for parsing queries about a simple U.S. geography database in translated into Arabic.

The consecutive eight figures represent the screenshots of the GHSSA with different datasets. Each pair illustrates a user query and its corresponding system response.


Fig. 11. GHSSA Restaurant Data Set Result.

Fig. 9. GHSSA Jobs Data Set Result.

Fig. 12. GHSSA Geo Data Set in Arabic.

Fig. 10. GHSSA Restaurant Data Set.

Fig. 13. GHSSA Geo Data Set in Arabic Result.


In the past few years various semantic search engines emerged, each implements a different approach. The purpose of this research was to enhance the precision of the guided systems by infusing a form-based interface to it. The GHSSA surpassed in implementing a Generic Hybrid Semantic Search engine that overcomed the limitations of Natural Language Interfaces habitability problem by providing the user with the data values in the domain and the Natural Language Interfaces limitation to the number of query-parts in a phrase that it can be correctly interpreted by displaying all the relations that exist so that the user can choose as many as required. In Regards to the Form-based search engine scalability limitation, we provided the data profiling of the data using five-number statistical model which can be implemented on any range of values no matter how large it gets. The GHSSA also encompassed some features that were implemented in previous Semantic Search engines like portability. In addition to, some new features that were not executed in other Semantic search engines as far as we know like, displaying different comparative operands according to the data type of each relation and querying an Arabic data set.

Fig. 8. GHSSA Job Data Set.

Fig. 14. Comparison between the precision and recall of the 3 Data Sets.

The GHSSA was tested on the 3 Mooney Data Sets mentioned above and then compared with the Nlp-

REFERENCES [1] Reham Hesham

El-Deeb, Abdel Fatah. A. Hegazy, Aly Aly Fahmy. Semantic Form-Based Guided Search System. (2012). In the 22nd International Conference on Computer Theory and Applications. Alexandria, Egypt. L.R. Tang, R.J. Mooney, Using multiple clause constructors in inductive logic programming for semantic parsing. In: 12th Europe. Conf. on Machine Learning, Freiburg, Germany. 2001, pp. 466477. A. Bernstein, & E. Kaufmann, GINO - A Guided Input Natural Language Ontology Editor. In Proceedings of the 5th International Semantic Web Conference (ISWC 2006). Athens, Georgia, 2006, pp. 144-157. A. Bernstein, E. Kaufmann, & C. Kaiser, Querying the Semantic Web with Ginseng: A Guided Input Natural Language Search Engine. In Proceedings of the 15th Workshop on Information Technology and Systems (WITS 2005). Las Vegas, NV, 2005, pp. 45-50. Victoria Uren, Yuangui Lei, Vanessa Lopez, Haiming Liu, Enrico Motta, Marina Giordanino. (2007). The usability of semantic search tools: a review. The Knowledge Engineering Review. (pp 361-377). Guha, R., McCool, R. & Miller, E. 2003 Semantic search. In 12th International Conference on World WideWeb. pp. 700709. Mihalcea, R. & Moldovan, D. 2005 Semantic indexing using wordnet senses. In Proceedings of the ACL-2000 workshop on Recent advances in natural language processing and information retrieval: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics. pp. 3545. Buscaldi, D., Rosso, P. & Sanchis Arnal, E. 2005 A WordNetbased queryexpansion method for geographical information retrieval. In CLEF 2005 Workshop at GeoCLEF 2005. Vienna, Austria.



Fig. 15. Comparison between the precision and recall of phase 1 and phase 2.

Reduce NLI [19]. Fig. 14 shows the Precision and recall performance measures for both systems. Fig. 15 shows the total precision and total recall of the SFBGSS (phase 1)[1] and the GHSSA (phase 2) compared with Nlp-Reduce in both phases.


[6] [7]



[9] [10]

Corby, O., Dieng-Kuntz, R., Faron-Zucker, C. & Gandon, F. 2006 Searching the semantic web: approximate query processing based on ontologies. IEEE Intelligent Systems 21(1), 2027. Sinha, V., Karger, D. R. 2005 Magnet: supporting navigation in semistructured data environments. In 2005ACM SIGMOD International Conference on Management of Data. Baltimore, Maryland, ACM Press, pp. 97106. schraefel, m.c., Wilson, M., Russell, A., & Smith, D.A., 2006 mSpace: improving information access to multimedia domains with multimodal exploratory search. Communications of the ACM 49(4), 4749. Athanasis, N., Christophides, V. & Kotzinos, D. 2004 Generating on the fly queries for the semantic web: the ICS-FORTH graphical RQL interface (GRQL). In 3rd International Semantic Web Conference (ISWC04). Hiroshima, Japan, pp. 486501. Catarci, T., Di Mascio, T., Franconi, E., Santucci, G. & Tessaris, S. An ontology based visual tool for query formulation support. In 16th European Conference on Artificial Intelligence (ECAI04). 2004. Valencia, Spain, pp. 308312. Hyvonen, E., Saarela, S. & Viljanen, K. 2003 Ontogator: combining view-and ontology-based search with semantic browsing. In XML Finland 2003, Open Standards, XML, and the Public Sector. Kuopio, Finland, pp. 8285. Mc Guinness, D. 2004 Question answering on the semantic web. IEEE Intelligent Systems 19(1), 8285 Lopez, V., Pasin, M. & Motta, E. 2005 AquaLog: an ontologyportable question answering system for the semantic web. In 2nd European Semantic Web Conference (ESWC 2005). Heraklion, Crete, Greece, pp. 546562. Bernstein, A. & Kaufmann, E. 2006 GINO-a Guided Input Ontology Editor. In Proceedings of the International Semantic Web Conference. pp. 144157. Cimiano, P. 2004 ORAKEL: A natural language interface to an flogic knowledge base. In 9th International Conference on Applications of Natural Language to Information Systems (NLDB). pp. 401406. E. Kaufmann and A. Bernstein, "How Useful Are Natural Language Interfaces to the Semantic Web for Casual EndUsers?," Proceedings of the 6th International Semantic Web Conference (ISWC 2007), Busan, Korea: 2007, pp. 281-294. A. Bernstein and E. Kaufmann. Making the semantic web accessible to the casual user: Empirical evidence on the usefulness of semiformal query languages. IEEE Transactions on Knowledge and Data Engineering, under review. A. Bernstein, E. Kaufmann, C. Kaiser, and C. Kiefer, "Ginseng: A Guided Input Natural Language Search Engine for Querying Ontologies," Jena User Conference, Bristol, UK: 2006. Esther Kaufmann, Abraham Bernstein, Renato Zumstein, Querix: A Natural Language Interface to Query Ontologies Based on Clarification Dialogs, In: 5th International Semantic Web Conference (ISWC 2006), Springer, November 2006. C. W. Thompson, P. Pazandak, and H. R. Tennant. Talk to your semantic web. IEEE Internet Computing, 9(6):75-78, 2005.





[15] [16]

[17] [18]



[21] [22]