1 s2.0 S1570826811001053 Main

Web Semantics: Science, Services and Agents on the World Wide Web 10 (2012) 332
Contents lists available at SciVerse ScienceDirect
Web Semantics: Science, Services and Agents on the World Wide Web
journal homepage: http://www.elsevier.com/locate/websem
Scalable distributed indexing and query processing over Linked Data

Marcel Karnstedt a,, Kai-Uwe Sattler b, Manfred Hauswirth a
a b
Digital Enterprise Research Institute (DERI), National University of Ireland, Galway, Ireland Faculty of Computer Science and Automation, Ilmenau University of Technology, Germany
a r t i c l e
i n f o
a b s t r a c t
Linked Data is becoming the core part of modern Web applications and thus efcient access to structured information expressed in RDF gains paramount importance. A number of efcient local RDF stores exist already, while distributed indexing and distributed query processing over Linked Data with similar efciency and data management features as known from traditional database and data integration systems are only starting to develop. Distributed approaches will necessarily co-exist with centralized schemes, as data will be owned by different stakeholders who may not want to provide their complete data sets to a central place. Additionally, central/integrated storage may be prohibited for organizational or legal reasons in certain areas. To support decentralized schemes, only a few attempts in this direction exist so far, but they are limited in terms of capabilities and the degree of distribution vs. efciency, query expressivity, and scalability. To remedy this situation, the approach and proof-of-concept prototype presented in this paper provides a solution for these open challenges. As we argue for widely distributed systems as a possible answer to scalability issues, we rst identify and discuss the main challenges and based on this analysis, we propose an approach for efcient and scalable query processing over distributed Linked Data sources, taking into account the latest advances in database technology. Our system is based on a layered architecture that makes use of the advantages of decentralized indexing and query processing approaches, which have been researched and matured over the last decade. Our approach is based on a logical algebra for queries over RDF data and a related physical query algebra to enable optimization, both on the logical and physical layers in query processing. The introduced operators and strategies for processing complex query plans make excessive use of parallelism and other optimization paradigms of distributed query processing. Our query processing framework includes a sophisticated cost model to enable cost-efcient query planning and query execution. We extensively evaluate our approach through an experimental evaluation of a real proof-of-concept deployment, which demonstrates the efciency, applicability, and correctness of the proposed concepts. 2011 Elsevier B.V. All rights reserved.
Article history: Available online 23 December 2011 Keywords: Linked Data Query processing Distributed storage Distributed indexing Structured queries Decentralized data management
1. Introduction Information processing on the Semantic Web relies heavily on processing structured data in the form of triples and graphs, i.e., Linked Data. Linked Data sources use the Resource Description Format (RDF) for encoding (graph-)structured data and simplify the linking of units of information by using URIs to identify them and to refer to them. Processing Linked Data typically involves retrieving, ltering, combining, and inferring data. According to [1] already 207 large, open data sources are available. In addition, legacy or hidden data can be easily exported also as Linked Data resulting in potentially large numbers of sources and a huge amount of data. The number of sources and the amount of data have already started to grow exponentially as Linked Data gains traction in many domains.
Corresponding author.
E-mail address: marcel.karnstedt@deri.org (M. Karnstedt). 1570-8268/$ - see front matter 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.websem.2011.11.010
The Linked Data paradigm but also many other data integration efforts, e.g., in e-science and e-government, are examples of public data management where data of public interest is maintained and accessed by a community. In a community setting, it would be preferable that the costs of the infrastructure would be shared as well. So, instead of again building data stovepipes with their proprietary structures and interfaces and varying levels of query processing capabilities which require signicant efforts for access and integration at the technical level, these systems could benet from a the world as a database perspective, where all data irrespective of their internal structures and functionalities could be accessed and combined in a uniform way. In turn, many users would certainly like to share their own knowledge/data in this way. For instance, in astronomy more than 200 scientic publications have been produced which rely mainly on the opportunity of the Sloan Digital SkyServer to answer database queries [2], which clearly demonstrates the need for an open Linked Data solution.
M. Karnstedt et al. / Web Semantics: Science, Services and Agents on the World Wide Web 10 (2012) 332
One of the biggest challenges in realizing this idea of the world as a database for Linked Data is the scalability regarding the data volume, the number of participating nodes and the number of users and queries. Scalability in this way can be best achieved by scaling out, i.e., adding more nodes to the system, partitioning the data and distributing the computation/query processing over the nodes. This approach is already successfully applied in the huge data center infrastructures of the Internet, by cloud providers for large-scale data analytics, and also by grid architectures for distributed computing as well as by loosely coupled P2P-based systems. It represents also a viable approach for managing Linked Data: Instead of maintaining a separate and autonomous data store (SPARQL endpoint) for each data set, all data sets could be published or at least indexed in a single, global but distributed data store. This would not only provide better utilization of the storage and processing capacities of the nodes, but also simplify the discovery and querying of the various data sources. It is important to note here, that this does not necessarily mean giving up the ownership of the data as long as appropriate mechanisms for granting access rights are provided. Based on these observations this paper tries to answer the question how such a distributed data management system has to be designed to support public data management with Linked Data. Particularly, we address the problem of efcient processing of complex queries on structured data. We present our UniStore system which is based on a highly efcient distributed hashtable (DHT) as the storage layer for a distributed triple store. The triple store allows the user to represent arbitrary structured data in a exible and extensible way without the requirement of maintaining explicit schema information as required by classical relational approaches. On top of this triple store, UniStore supports distributed processing of queries with similarity, top-k, and skyline operators in addition to the well-known standard query operators. An important feature of our approach is that we do not use the DHT as a data storage only, but also exploit it as a substrate for routing and processing queries in a massively distributed way. This means that nodes participating in UniStore are both storage nodes and query processors which enables highly efcient and fair parallel processing and storage. Though, our approach is based on P-Grid [3] a DHT for loosely coupled networks it can be also applied to data center solutions based on distributed key-value stores which provide similar functionalities like a DHT, e.g., Amazon Dynamo [4] or Basho Riak [5]. The specic contributions of this paper are: Based on an approach for indexing RDF data in a DHT, we present a physical query algebra for a subset of SPARQL. More specically, we focus on query operators and their efcient, distributed implementation which extend SPARQL with similarity and ranking operators, because this class of queries is particularly important for large-scale and distributed data collections. We present M2 QP, a strategy for distributed processing of queries that exploits the nature of the DHT substrate in terms of parallelism and routing for shipping subqueries to the relevant peers. Furthermore, we discuss issues of cost-based query planning, plan adaptivity, and plan synchronization to address the dynamicity of the DHT infrastructure. The remainder of this work is structured as follows. In Section 2 we introduce our vision of a decentralized system for indexing Linked Data and processing SPARQL queries. We discuss the main challenges we identied and align our contribution to them. The approach of using DHTs as a basis for indexing is overviewed in Section 3.1, before we propose our approach of storing Linked Data in a chosen DHT in Section 3.2 and summarize the resulting architecture in Section 3.3. The focus of the work is on Section 4, where
we introduce the underlying logical and physical query algebra and discuss different implementations of query operators in detail. That section also provides insights into overall query execution and different optimization techniques, such as pipelining and cost-based query planning. Section 5 presents an exhaustive evaluation of our approach based on real-world data and widely distributed setups. Related work is discussed in Section 6, before we summarize and conclude in Section 7.
2. Challenges of large-scale distributed data management Building a scalable distributed system for processing structured queries raises several challenges which we can roughly classify into three categories: 1. How to efciently structure and organize data in massively distributed settings in a scalable way? 2. How to build a robust and practical solution? 3. How to query data and how to process these queries efciently? The rst question mainly deals with data organization and raises two challenges: Genericity and exibility. Structured data is a basic requirement for building information systems. Due to the sheer number of participants and data sources, trying to agree on a common global schema is impractical. Thus, we need a generic and exible schema for structuring and managing data. This must also be reected in the way queries are formulated. Users should not need to worry about whether they know specic schema elements or not. RDF is the accepted standard that satises these needs. Dealing with heterogeneity. The goal is to combine data from many different sources, potentially originating from many different domains. As the agreement on a global schema is unrealistic, there is a strong need for techniques resolving heterogeneities on schema level (different naming and structuring of the same concepts) and on data level (different representations for the same real-world objects). Thus, the creation of integrated views should happen ad-hoc and user-driven, allowing for a pay-as-you-go integration [6]. A major requirement of a universal storage system is to handle semantics for information-driven integration (e.g., schema correspondences) as standard, rst-class-citizen data. Question (2) touches on the architecture of a large-scale distributed platform and practicability issues. Among others, the main challenges are: Scalability. While offering many advantages, centralized data management raises several problems, such as a single point of failure (specically at network connection level), limited data freshness, reduced resilience against attacks, and so on. There also exist clear economical arguments for decentralization: The load and costs of managing data and maintaining the required services may exceed the capabilities of a single centralized provider in terms of high initial investments and infrastructure costs during the lifetime of a system. Thus, decentralized approaches are a reasonable alternative/complimentary technology. But, in order to guarantee performance of the system even in times of high load or churn (i.e., frequent membership changes of participating nodes), a fair distribution of data and load among all nodes is required. Note, that this does not argue against Cloud or data center solutions. We consider such infrastructures as inherently distributed. Scalability needs to be achieved in terms of number of participants and the amount of managed data. Robustness and availability. A main requirement of distributed storage systems is to achieve resilience against node and link failures. One popular way to address this issue is to maintain redundant links and replicas of data objects. But, such approaches
raise several other issues, from the actual degree of replication over appropriate strategies for update propagation to problems of choosing among the set of replicas. In the end, replication is not the one-size-ts-all solution for handling unreliable networks. Participating systems and links between them may fail during processing, some data or network parts may be completely unavailable at query time, messages may get lost, etc. This emphasizes the need for relying on the open world assumption, where no constant result sizes or similar guarantees can be assumed. The open world assumption implies querying in a best-effort manner. This means, results are as good as possible. Nevertheless, distributed storage systems should be able to cope with node failures and other problems. Meaningful and signicant results at all times and in all situations are a fundamental requirement. The third question is related to query formulation and query processing with the following challenges: Expressiveness of queries. The generic and exible data model motivated above needs to be accompanied by an appropriate query language. For querying structured data, such a query language should support the core features of classical database query languages. Furthermore, the query language has to support special query operators for dealing with large-scale data (e.g., ranking operators), heterogeneities at different levels (e.g., similarity operators), and unknown schemata (e.g., mapping operators). For this purpose, techniques and methods known from information retrieval (IR) [7] need to be adopted. Ranking queries allow the user to get an overview of large data sets, which can be rened iteratively. Similarity queries support the ltering and combination of textually similar data. This helps in handling spelling errors and alike, but also to identify correspondences. Mapping operators allow for querying schema data, like attribute names and correspondences, and treating this as plain data. Efcient querying. As we argue above, scalability can only be achieved using a distributed architecture. The same holds true for the actual processing of queries. In most cases the query shipping approach will be much more useful and efcient: (Parts of) queries are shipped to where the required data resides and are processed there. This requires distributed implementations of the supported query operators. These distributed operators should be combined with an appropriate adaptive query processing strategy in order to be able to react to changing network conditions, high load, failures, available knowledge, etc. Furthermore, increased parallelism of query processing (up to a certain degree) is fundamental for processing complex queries with satisfying performance and meaningful results, particularly in the presence of failures and changes in the network. Finally, querying should be ad-hoc (like the integration of heterogeneities) and in-situ. This means a result is complete and correct if it contains all data from all participants available in the current situation. For instance, Google uses crawling processes to collect index data in regular intervals. These processes are very long-running, due to the high amount of data available on the Web. Despite the fact that this will become a problem with the steadily increasing amount of data, this is ne in principle due to the small number of highly dynamic data sources, e.g., news, weather, etc. Such services are known a-priori and thus, can be prioritized. In many public data management scenarios we have to face highly dynamic data sources. Thus, the general problem of data freshness and accuracy becomes more important. An intuitive way is to provide and access the data instantaneously at the place where it is being generated, i.e., ad-hoc from and by the participants of the system. In our approach, this can be achieved by storing data directly in a global DHT as primary data source instead of individual provider-specic sources which are indexed separately. As a result, any new data is available from that moment on in which it is inserted into the DHT. While this insert
can be initiated by any user or agent, we expect a participating source to initiate it as soon as Linked Data is published. Guarantees. Query processing should provide exact guarantees for query time and overhead (in terms of messages, hops, etc.). Moreover, information on availability, freshness, completeness, and other advanced features should be provided. But, supporting such guarantees in large-scale distributed systems is an extremely challenging task. Due to the very large scale and the participants dynamic patterns of use, data and service quality can often be guaranteed only in a best-effort way. Most of the time, this is sufcient for the services and systems we aspire, exemplied by many services we already use on a daily basis, which follow this approach and still provide meaningful service (e.g., email, DNS). As explained in [8], the amount of possible guarantees decreases with increasing autonomy of the participants. Nevertheless, databaselike applications require guarantees. Thus, trade-offs between guarantees and assumed autonomy have to be identied and methods for providing probabilistic guarantees have to be established. The latter should be used in all situations where exact guarantees are not available. This particularly applies to information on the completeness of query results, which cannot be determined straightforward in the presence of adaptive and highly parallelized query processing strategies without any global coordination (which would introduce a bottleneck and thus should be avoided). The UniStore system which we are going to present in the following sections addresses most of these challenges in the following way: Data organization. Issues related to logical data organization are addressed in our work by relying on RDF or more generally data represented in the form of triples as the generic way of representing data. On top of the generic triple store, we present techniques for indexing RDF data supporting efcient query processing (Section 3.2). In addition, UniStore provides basic mechanisms for representing mapping information as data which can be exploited in queries, though, sophisticated reasoning techniques are out of the scope of this paper. Architecture. The UniStore system is implemented as a DHT-based distributed storage guaranteeing upper bounds for lookup costs and supporting distributed processing of queries (Section 3.1). Furthermore, relying on a DHT infrastructure with builtin replication facilities as well as exploiting stateless, adaptive techniques for query execution allows to improve robustness (Section 4.6.2). Query formulation and query processing. UniStore supports SPARQL as the de-facto standard for RDF queries plus extensions for similarity and ranking operations (Sections 4.2.1 and 4.2.2). The implemented query execution strategy follows an adaptive query shipping strategy exploiting the inherent parallelism of a distributed system and the data localization mechanism of a DHT (Section 4.3). Finally, UniStore addresses also the probem of providing guarantees. In [9,10] we have presented a light-weight but effective approach to enable such an approximate completeness estimation, which also supports probabilistic guarantees regarding its accuracy. The idea is based on routing graphs: with each query plan traveling through the network, we ship a very small amount of additional information which reects the part of a routing graph that this message corresponds to, i.e., the path it took through the DHT, from the query initiator to the node sending a reply. With the help of this information it is possible to estimate the number of outstanding replies. As the underlying DHT supports load-balancing, this can be used to estimate the number of outstanding results. The accuracy of the estimation depends on the number of already received replies (the more replies received, the more information can be used for the estimation) and the degree of adaptiveness. Comprehensive details and a full evaluation can be found in [8].
3. A distributed index In this section, we discuss the use of DHTs as a basic layer of our architecture. Further, we discuss appropriate indexing approaches for RDF data on top of these systems. Finally, we introduce the layered architecture that we propose.
3.1. DHTs for index management As has been argued above, distribution is an inherent property of Linked Data. Along with this and the requirements and issues discussed above, peer-to-peer (P2P) systems are a natural choice to address these problems. By taking advantage of the principle of resource sharing, i.e., integrating the resources available at computers in a system into a larger system, it is possible to build applications that scale well up to global size. However, only a few P2P systems are suitable to address the strongly data-oriented challenges of Linked Data: Most P2P systems are optimized towards scalable distributed routing, i.e., given an identier, to nd the node(s) in the system which host this identier and its associated information (Chord is the typical example) or efcient content distribution (e.g., BitTorrent). In terms of query processing this means that only identity queries (simple lookups) are supported efciently. Supporting general-purpose SPARQL query processing, however, requires more sophisticated indexing structures. To meet the requirements outlined in the previous section, we chose the P-Grid [3] P2P system as the basis for the distributed indexing in UniStore for which an efcient implementation exists.1 While P-Grid is not in the scope of this paper and has been well-published, we briey outline its properties and characteristics which made it the system of choice for UniStore. P-Grid builds on the abstraction of a distributed trie. A trie or prex tree is an ordered tree data structure where each nodes key (prex) is determined by its position in the tree. All the descendants of a node have a common prex of the string associated with that node. In P-Grid we combine the trie abstraction with an order-preserving hash function to preserve the data semantics, i.e., a < b ) ha < hb. Without constraining general applicability P-Grid uses a binary trie, i.e., all hash keys in P-Grid are bitwise binary in nature. In contrast to most other DHTs, which destroy data semantics in the hashing process, by using a trie abstraction, P-Grid natively supports more complex predicates, e.g., <, >, range queries, etc. The inherent problem of skewed distribution, a problem in any trie, is addressed by a lightweight load-balancing strategy which guarantees logarithmic search performance. Similar to other DHTs, P-Grid replicates each path of the trie among multiple nodes in the system. Despite the fact that P-Grid is built on a hierarchical abstraction, search can start from an arbitrary node and there are no special nodes like a root node. Also, P-Grid is completely self-organizing in the construction of the index, i.e., no coordination is required, and supports the efcient partitioning and merging of indexes which is essential in dynamic networking environments. Its randomized construction process facilitates very high robustness and availability while providing probabilistic efciency and scalability guarantees. On the query processing side, P-Grid also includes efcient support for updates and has been extended to provide efcient similarity and top-k query processing along with completeness guarantees. Refs. [11,12] provide a detailed discussions of general P2P principles and comparisons of the main system families which are beyond the scope of this paper. While in the following sections, we describe UniStore as it currently is implemented on top of P-Grid, we only use standard
1
DHT functionality, so our approach is generally applicable to all DHT systems. The difference would be in what functionalities need to be added to the system of choice. Approaches for providing functionalities similar to P-Grids exist for most DHTs, yet the efciency and scalability of these approaches varies over a wide range. In the following, where we utilize special functionality that is not provided by all systems, we highlight this and discuss why and for what exact purpose such special features are used. However, these are only options for efciently supporting special query constructs and processing approaches, and they are again built on top of standard DHT functionality. The basic query processing proposed in this work can be implemented on top of any standard DHT system, although reimplementing the additional features is required to achieve full efciency as proposed here. The following list summarizes the features that are desirable, but not mandatory: support of efcient range queries, support of prex queries, load-balancing features. Assuming these features is not unrealistic. Several modern DHT systems meanwhile support these and other sophisticated capabilities. The motivation is to integrate novel processing paradigms and efcient query processing techniques already on DHT level. Recently, some DHT systems support even multi-attribute indexing and querying. There are several systems providing all three of the capabilities listed above. Meanwhile, many popular DHT systems support efcient processing of range queries, either by nature [1315] or by extension [1618] (if the required data semantics are not kept during hashing). Queried ranges are mapped to according key ranges and these key ranges are looked up accordingly. Thus, range query mechanisms can be applied to every data indexed in the system. They can be used to implement specic operators exceptionally efcient. For systems not supporting range queries we can at least provide rather simple implementations using standard DHT functionality. The idea is to route to the rst peer of the range and afterwards to all peers of the range in a sequence. Overall, our choice of P-Grid as the underlying infrastructure is well justied as P-Grid supports prex queries as an underlying concept [19], provides efcient load balancing [3] and includes efcient support for range queries by means of its shower algorithm [13]. In this algorithm the range query is rst forwarded to an arbitrary peer responsible for any of the key-space partitions within the range, and then the query is forwarded to other possible partitions recursively using the current peers routing table. Since the query is split into multiple parallel queries which appear to trickle down to all the key-space partitions in the range, it is called shower algorithm. A detailed discussion of this algorithm is provided in [13]. The efciency of P-Grid in terms of ranges queries compared to related P2P systems such as [20,21] is also shown by performance evaluation studies such as [22] where P-Grid is among the best performing systems. In summary, DHTs provide the following advantages in respect to the challenges presented in Section 2: They provide the foundation for achieving scalability in terms of participants, data amounts, and query processing. The guarantees concerning message and hop complexity are fundamental prerequisites for realizing efcient cost-based query processing at large scale. Mechanisms for automatized self-organization and/or maintenance algorithms for highly dynamic systems are included. Achieving robustness and high availability in unreliable networks is a main feature of these systems. Fairness and efciency are based on solid grounds, due to the (implicit or explicit) load balancing in these systems.
http://www.p-grid.org/
Certain aspects of privacy come for free, because nodes decide which data they index and how, and nobody has a global view of the system. We conclude this discussion of distributed indexing using P2P systems with some organizational remarks. While P2P systems seem to require the constant participation of all data providers in the system along with the requirement to share their resources, we would like to emphasize that this reects only the most general case. For example, if someone only wants to provide data but does not want to participate in the indexing and query processing, the data can be delegated to another peer who is willing to provide this functionality. Further, a source might provide only a fraction of its whole data for indexing. Forward references can be maintained in the index for enabling (controlled) access to additional data hosted only at the source. This and more complex organizational structures are easy to satisfy with our proposed approach. 3.2. Indexing RDF data The universal storage proposed in this work supports triplebased models based on RDF [23]. This section explains how UniStore uses DHTs for indexing RDF to meet the requirements for efcient distributed query processing explained in the previous section. A triple t consists of a subject s (referenced by ts, a property p (also predicate or attribute, referenced by tp, and an object value 0 (referenced by to. The RDF model offers several advantages over classical relational database approaches for a wide range of application domains, as several types of (semi-)structured data can be mapped to it. From a data management point of view, the triple model is: self-descriptive: a triple knows to which real-world entity (identied by the subject) it refers to and all properties of that entity can be looked up thus, there is no need for a global data dictionary which reduces overheads, extensible: new data following an arbitrary schema can be added at any time and existing data can be modied by adding, removing, or modifying existing triples, universal: data from different application domains can be mixed and enriched by metadata without altering the triple model. For instance, tuples conforming to a given relational schema can be integrated by splitting them into triples and insert these triples into the DHT. A tuple looking like
A triple is inserted into the DHT using a key k which is determined by applying a hash function h on (parts of) the triple. A crucial question is on which part(s) of the triple h should be applied. We discuss this question in detail below. The combination of a triple and its corresponding hash key is called an index item in the following. Index items can be looked up by searching for the corresponding hash key k. All triples indexed with the same hash key are located on the same peer. Triples with adjacent hash keys are located on neighboring peers (with respect to the topology of the DHT). By inserting triples multiple times, each time using a different hash key, we can build different indexes on top of the data. This allows for providing different access paths to the data managed in the system and is the classical approach followed by database management systems. Which and how many indexes are built depends on the application requirements. It represents a trade-off between query processing performance and storage overhead. We build the following default indexes on triples of the form s p o: 1. a subject index using hs, for efciently querying all triples with a given subject (e.g., for star-shaped queries), 2. a propertyobject index on the concatenation of p and o hpjjo, for efciently processing queries containing lter predicates like phc, where h refers to an operation like <, =, P, etc. and c to a constant that is compared to the o values, 3. an object index using ho, for querying arbitrary values without referencing a specic property. If the underlying DHT supports prex queries (as in the case of P-Grid), a lookup operation for hp can be used to query for all triples with property p using the propertyobject index without the need for an additional index. Indexing the property values directly would result in heavy load on several nodes, as many properties occur very frequently. By indexing the combination of property and object, we distribute all property values over a range of nodes. Still, utilizing prex queries we can query for all triples with a certain property value. These distributed indexes distribute storage load and processing load in a fair way. The choice of indexes allows for optimizing query processing for different queries and data distributions, as we will show in Section 4. Our default set of indexes can be extended easily for supporting expensive operations like string similarity queries in an efcient way. We have included similarity indexes/queries into UniStore, as we understand similarity operations as essential for scenarios like public data management on top of distributed RDF stores. Similarity can be used to avoid problems introduced by typos and to nd correspondences, to name two examples. There exist many different distance functions for expressing the similarity between two strings. One very popular one is the Levenshtein distance [24], also called edit distance edist, which expresses the number of edit operations (delete one character, add one character, change one character) needed to transform one string into another one. This distance function is, for instance, very useful to identify typos. Due to its popularity, we decided to support particularly the edit distance in our system. Other similarity measures can be introduced in a similar way. We build a specialized index based on qgrams. A q-gram is a substring of length q. Thus, one can create jsj q 1 overlapping q-grams from a string s (where jsj denotes the length of the string). Among others, Navarro et al. [25] and Gravano et al. [26] propose efcient methods for evaluating the edit distance using such q-grams. The main observation of [25] is that for an edit distance of 1, the sets of q-grams from two strings will differ by at most q. Only these q substrings contain the character affected by the one edit distance operation. The remaining q-grams correspond to each other. Based on this, [26] introduced count ltering: if edists1 ; s2 6 d is true, then s1 and s2 will share at least
where pi i 6 n 2 N are the attribute names of the tuple and oi i 6 n refer to the values corresponding to the attributes pi . Such a tuple can be split into triples as follows:
The subject st identies an original tuple and all triples belonging to it. If a corresponding URI does not exist, it may be system generated, based on a primary key, etc. One advantage, among several others, of this vertical approach is that there is no need to explicitly represent NULL values.
maxjs1 j; js2 j q 1 d q corresponding q-grams d q is the maximum number of q-grams that can be affected by d edit distance operations. We can utilize this by searching for a q-sample [27] of d 1 non-overlapping q-grams of s1 in order to nd all similar strings s2 . If none of the q-grams from this q-sample overlaps with another, none of them can be affected by the same operation, i.e., at least one of these q-grams has to be fully contained in each candidate string s2 . All so found candidate strings s2 have to be nally checked for their distance to s1 . Ref. [27] discusses methods for estimating the selectivity of q-grams, which can be used to generate particularly good q-samples that result in only few candidates. To enable this technique, we split strings into q-grams and index these q-grams, rather than (only) indexing whole strings. This is feasible for both, predicate and object level. For each triple t we use the hash keys htpjjqg j to for all overlapping q-grams qg j of to to index the object level. On predicate level we use hqg j tp for all overlapping q-grams qg j of tp. Optionally, we can also index on pure object level by using hqg j to. This involves a non-negligible overhead depending on the actual choice of indexed properties, but it can be used to decrease query processing costs signicantly. Note that the indexed q-grams have to be overlapping in order to enable arbitrary q-samples during query processing. For the sake of readability, we use a simplied notation for triples and queries in the following. We omit name space prexes, use abbreviated URIs, etc. Consider a (simplied) triple (<123> <title> mymovie). Building the q-gram index using all available 3-grams produces the following index items: h(tit) ! (<123> <title> mymovie) h(itl) ! (<123> <title> mymovie) h(tle) ! (<123> <title> mymovie) h(titlejjmym) ! (<123> <title> mymovie) h(titlejjymo) ! (<123> <title> mymovie) h(titlejjmov) ! (<123> <title> mymovie) h(titlejjovi) ! (<123> <title> mymovie) h(titlejjvie) ! (<123> <title> mymovie)
for certain query and lter constructs, as explained above. They introduce a xed overhead, as each indexed triple is inserted exactly once into each of them. Q-gram indexes are particularly benecial for string similarity queries, but result in a noticeable storage overhead. The exact overhead depends on the chosen value of q, the length of the strings and URIs in each indexed triple, and the decision on the usage of special characters. With special characters, the overhead increases with increasing q. Without special characters, this is vice versa. In Fig. 1, we show the number of generated index items for data from DBPedia and Mondial (see Section 5 for details about these data sets). In the gure, the amount of triples corresponding to the three default indexes is represented by std. As three indexes are built, each input triple results in three index items. 3gpo represents the sum of index items resulting from the q-grams of property and object values, 3go from the object values only. Also, the results for indexes with q = 5 are depicted. Further, we differentiate between the inclusion and exclusion of special leading and trailing characters used for extracting q-grams. The decision whether to accept these additional storage costs depends on the benet of the q-gram indexes. We will show results highlighting the benets of the different indexes in Section 5. 3.3. General architecture We propose an architecture based on four layers as shown in Fig. 2. The darker shading indicates the higher focus that we apply to it. As the starting point, the distribution layer provides basic support for transparently distributing data, query load, and index
600000 500000
#indexitems
400000 300000 200000 100000 0
std 3go 3gpo 5go 5gpo
The q-grams are only used for indexing, but are not stored in the triples explicitly. This is because we only utilize them to decrease routing efforts. If we would store q-grams or references rather than full strings, this would on the one hand decrease storage costs, but on the other hand demand for building or querying all candidate strings before nally computing the edit distance. Note that our q-gram approach may be also be extended to include positions of the q-grams, which enables position ltering as proposed by Gravano et al. [26]. This is a technique that reduces the set of candidates that have to be compared even further, by taking into account the fact that two q-grams can correspond only if their positions in the strings do not differ by more than q. The position of a q-gram is dened as the number of the q-grams rst character in the original string. A particular problem are very short strings, as not enough q-grams can be extracted to enable the outlined ltering techniques. We conceptually extend strings with q 1 leading occurrences of # and q 1 trailing occurrences of z. This results in 2 q 1 q-grams more for each input string. We focus on q 3, because we observed that to be a good choice for most situations. This also conforms to theoretical considerations in [28] for approximate string matching with very long strings q log jRj jsj, where s is a string on the alphabet R. In summary, we build different indexes to enable different access paths to the triples in the system. These can be used for enabling different processing strategies, as we explain in Section 4. The subject, propertyobject, and object indexes are benecial
1.6k
3k
5k
6.5k
8k
10k
11.3k
13k
#input triples (rounded)

(a) With special characters
400000 350000 300000
#index items
250000 200000 150000 100000 50000 0 0 1.6k
std 3go 3gpo 5go 5gpo
3k
5k
6.5k
8k
10k
11.3k
13k
#input triples (rounded) (b) Without special characters

Fig. 1. Resulting number of index items for building default indexes (std) and qgram indexes on DBPedia and Mondial data.
the logical algebra used to represent SPARQL queries. Then, we focus on query operators, before we discuss concrete query processing strategies and query planning. 4.1. Query algebra Queries are formulated using SPARQL [31], the standard for querying RDF data. In this work, we use an abbreviated syntax and a few special but intuitive constructs, such as ordering and ranking based on a nearest-neighbor semantic (see the Appendix for an example). However, our focus is not on the used query language. For several of the used constructs similar SPARQL 1.0 extensions are already existing, while others will be supported by SPARQL 1.1. The proposed algebra is based on the relational algebra and supports the core features of SPARQL. While, as presented here, it does not support the full SPARQL functionality, in turn it extends it by special query constructs like ranking and similarity queries. These extensions are particularly useful for integrating many sources, which are potentially heterogeneous and inconsistent. As such, the main ingredients of queries are triple patterns, lter statements, and variables (identied by a leading ?), which are used to formulate conjunctive queries. A query consists of a SELECT clause for projecting variable values and a WHERE clause containing the triple patterns. The following is an advanced example query containing similarity operations: select where ?v1 ?s1 ?s2 ?c {?s1 ?A ?v1 . ?s2 ?B ?v2 ; <created> ?c . lter (edist(?v1, Marcel)<2) . lter (edist(?A,?B)<3) . lter (?v1=?v2)}
Fig. 2. UniStore architecture.
structures over the network through the use of a overlay/P2P system. This allows us to achieve basic scalability, location transparency, logarithmic search complexity, and provides certain guarantees offered by DHTs [29]. The query layer offers storage, indexing, and querying capabilities on local and network data. Several indexes are built using the capabilities of the distribution layer to support efcient distributed query processing which is provided by a set of database-like operators working on these indexes. The operators implement advanced query processing functionalities by appropriately utilizing the features provided by the distribution layer, no further extensions are required. The semantic layer enables virtual grouping of data and the creation of mappings. Mappings are necessary for establishing semantic relations between different user schemata in order to share data and to integrate data sources. The creation of such mappings is easier if schemata are dened clearly. To support this, technologies from the Semantic Web are exploited in the semantic layer to represent different concepts and the relations between them. The user interface on top of all layers provides transparent access for the user from a local point of view. The system supports SPARQL queries (with a few extensions, such as string similarity and nearestneighbor functions). This architecture is implemented in a light-weight Java-based system called UniStore [30], which is available bundled with PGrid upon request. Our prototype also supports other models for structured data, such as relational data or XML data, that can be represented in a triple format.
4. Query processing In this section, we describe the query engine deployed on top of the data organization introduced above. First, we briey introduce
The query asks for all triples (identied by ?s1, e.g., URI of a person) with an object value similar to Marcel (rst lter, similarity extraction). These triples are joined with all triples (?s2, which could be a conicting URI for the same person) that contain a similar predicate (second lter, predicate similarity join) and the same object value as the ?s1 triples (third lter). Finally, for the latter an additional property <created> is extracted and joined on the values of subject ?s2. This query allows for identifying potentially conicting URIs (e.g., from two different sources) that refer to the same person identied by the same name in two similarly named properties. One can further restrict the ?s1 triples by, for instance, adding a lter like filter ?A=<name> or an according similarity expression on ?A. For actual processing, queries are rst transformed into logical query plans on the basis of a query algebra. Table 1 summarizes all logical query plan operators that are currently supported by the algebra and the symbols used throughout this work. A special
Table 1 Logical query plan operators supported by the query algebra. Operator Projection Extraction Selection Cross-product Join Natural join Union Grouping Top-N Skyline Expand Symbol Meaning Project result rows Extract triples at leaf level Filter triple sets Combine triple sets without a condition Combine triple sets under a condition cond Combine triple sets with compatible solution mappings Merge triple sets Group triple sets Rank triple sets Rank triple sets using skyline semantic Expand query by resolving correspondences
p
n
r
cond [
c u
U
10
operator that the algebra introduces is the extraction n. This operator represents the extraction of triples from the RDF graph that is distributed in the DHT. On the leaf level of query plans we always have extract operators. Each such operator indicates in a subscript the triple pattern s p o it represents. All three parts can either be a variable or a constant. In the case of constants, s and p refer to URIs, o can refer to a URI or a string constant (strings are embraced by ). Each extracted triple has values as specied in the triple pattern. Further, variable references are set as specied in the pattern. In several examples throughout this work, we use the short notation [<A>] for the triple pattern _ <A> ?A. Blank nodes in a query are represented by an underscore _. As usual, blank nodes in queries are handled just as non-distinguished variables. They are integrated into the physical query processing without the need for any special constructs. Blank nodes in the RDF graph have to be replaced by unique constants (Skolemisation). References can be set only when extracting triples in the DHT. The algebra further contains operators for general joins , selections r, etc. The query shown above results in the following logical query plan:
use s to indicate subject-joins (as used for star-shaped queries) and so for subjectobject joins (as used for path-shaped queries). Leaf operators extract triples from the DHT they produce sets of triple sets in which each triple set contains one triple, i.e., triple sets of size 1. Subsequent operators extend these triple sets by joining new triples based on subject, predicate, or object values; lter out triple sets; or modify sets of triple sets (ranking, etc.). One concrete solution mapping is determined from each triple set contained in the input for the nal operator, which is always the projection p. Each operator and operation is supported on all parts of a triple. This allows, for instance, querying the schema of the indexed data using similarity operations. 4.2. Query operators For actual query processing, query plans have to be transformed from the logical algebra representation into a physical one. That is, logical operators are replaced by according physical operators, where one logical operator may correspond to several physical implementations. Each of these alternatives uses different access paths (i.e., indexes) and processing approaches. Before we focus on the actual processing of the resulting query plans in Section 4.3, in the following we explain the different general operator alternatives and those special for processing string similarity on the basis of the example introduced above. General strategies. The general approach for processing query plans relies on an extended version of Mutant Query Plans MQP [34]. The idea is to have multiple copies of a query plan traveling through the network, where each plan contains the operators to process and data produced by already processed operators. Like this, processing is stateless and all necessary information is encapsulated in the traveling plans. We will provide more details on this concept throughout the following sections. For now, it is important to understand that a query plan has to be shipped to nodes that can replace one of the unprocessed operators by according result data. This is repeated until all operators are processed and a nal MQP, containing (parts of) the query result, can be send to the query initiator. Generally, all physical operators rely solely on the functionalities provided by the underlying DHT. The differences arise from the different characteristics of the available indexes and, as usually multiple nodes are responsible for required data, the choice of sequential vs. parallel processing. Thus, we classify all operators op by the following two categories (listed together with the appropriate superscripts for clarity): local operator opLOC : if all required data are available locally, e.g., for a join, sequential operator opSEQ : contact all nodes responsible for required data in sequence, intra-operator parallel operator opPAR : contact all nodes responsible for required data in parallel; if we use P-Grids advanced parallel implementations of range queries, we write opRQ as a special case. Figs. 3 and 4 illustrate the difference for the sequential and parallel case. We call these illustrations routing graphs. The gures refer to the rst lter statement of the example query from above, combined with an additional filter ?A=<A>. There are two rst choices for the resulting left-most extract operator based on the pjjo-index (remember that this index clusters all data for <A> on nodes in the same subtree): 1. nSEQ , as illustrated in the top line of Fig. 3, which means contacting all nodes responsible for property <A> in sequence jprA j denotes the number of responsible nodes),
A problem for query processing are unbound triple patterns, such as two of the n operators in the example. Basically, such unbound patterns require to extract all triples available in the system. Thus, their use is prohibited, as long as no implicit binding results from other operators. This is provided in the example (e.g., by r). We will get back to this in the next section when describing the physical query algebra and the according query processing. The algebra works on the principle of solution mappings as proposed in [32]. But, rather than determining only the solution mappings, all underlying statements are kept and shipped with query plans as sets of triple sets. Variable bindings pointing to the corresponding part of a single triple are used to represent the actual solution mappings for each triple set. This bears several advantages, such as the basis for advanced caching mechanisms, the ability to use query processing optimizations independent from the variables used in the query, and the support of extended information (such as showing the source(s) of a solution mapping, e.g., if the context is stored as it is done in the quadruple format). As there are no duplicate triples in an RDF graph, each triple set is unique, even if the resulting solution mappings are duplicated. Thus, the semantic of multisets of solution mappings [33] is preserved as well. The semantics of query operators from [32] are adopted straightforwardly to the notion of triple sets, while respecting existing and resulting variable bindings. For instance, a natural join combines two triple sets and maintains the resulting variable bindings only if the variable bindings for both input triple sets correspond to a compatible mapping. We denote such a natural join by , while a join under an arbitrary condition is denoted as cond . Further, where appropriate, we
11
Fig. 3. Sequential routing schemes for nSEQ and nQG;SEQ n : jprA j.
n the input is the search term), which can be compared to the principle of a hash-based join known from relational databases. Therefore, we refer to according physical operators as hash-based operators. In the special case of using the q-gram index, we denote that by opQG;SEQ and opQG;PAR . This approach is illustrated in the second lines of Figs. 3 and 4. If we refer to the original query introduced above (without the additional lter), another advantage of the hash-based approach for similarity queries becomes obvious. The search term provides the implicit binding for the originally unbound query term. To process the resulting operator, we can replace hAjjqg i in the gures by hqg i , which utilizes the q-gram index on only the values. There is no option to process this operator on the pjjo-index without having to contact all nodes in the DHT. Note that, in contrast, an exact match on the value could be processed on the o-index. Each implemented operator can be assigned to one of the operator classes introduced here. Before we will explain this in more detail on the example of the join operator, we highlight that the following cases are covered by the discussed alternatives as well: exact match on property name and object value, which results in a single key lookup (no difference between the operator classes), exact or similarity match on a set of search terms (e.g., with an IN keyword used in the query), which can be processed with the same approaches while handling the different search terms either in sequence or in parallel, no constrained value, which can be handled using the pjjo-index (P-grids prex query support) also possible for similarity constraints on the property name (i.e., no explicit differentiation between predicate and object level required), all the cases work also for (additionally) constraining the subject ID, whereas we regard this a rather unlikely case (thus, we do not build q-gram indexes on them), no constraints (implicit or explicit) at all result in an unbound triple pattern, which is prohibited by the system. 4.2.1. Similarity joins Additionally to a similarity extraction, the example query introduced above also contains a similarity join. Again, the right side of the join results from an unbound triple pattern. If the right side would be bound as well (e.g., by using a lter like filter B= <person> rather than filter edist(?A,?B)), an operator nSEQ or nRQ could be used to extract all triples for property ?B. We illustrate the resulting chain of query processing using a local join LOC in Section 4.3. In the example, the binding is again provided implicitly by triples resulting from the left n operator. To highlight this, we denote the resulting operator as a combination of both, the join and the extract operation n. There are two options to exploit the implicit binding: 1. use the constraint ?a=?b to lookup all join partners using the object value index and an operator HASH;PAR n or HASH;SEQ n the similarity between ?A and ?B can be checked at each responsible node, 2. use the constraint edist(?A,?B) to lookup all join partners using the q-gram index and an operator HASH;PAR nQG;PAR , HASH;PAR nQG;SEQ , HASH;SEQ nQG;PAR , or HASH;SEQ nQG;SEQ the constraint ?a=?b can be checked at each responsible node. Exemplary, we focus on the q-gram variants. All four mentioned alternatives are illustrated in Fig. 5. The according physical operators represent a combination of the hashing approach and the q-gram approach. Each triple from the left side acts as an input string for the join. From each of these strings we extract d 1 q-grams. These q-grams can be queried in parallel nQG;PAR or in
Fig. 4. Parallel routing schemes for nRQ and nQG;PAR n : jprA j.
2. nRQ , as illustrated in the top line of Fig. 4, which means contacting all nodes responsible for property <A> using the parallel shower algorithm [13] (as outlined in Section 3.1). We refer to n as a routing operator, as it has to be routed to the node(s) responsible for the correct triples before it can be processed. During query-plan transformation and optimization, the query engine also identies joins that can be processed as routing operators. This holds, for instance, in the case of star-shaped or path-shaped queries. For these, the optimizer can take the existence of subject and object indexes into account to create alternative query plans. For instance, the top join operator in the query plan in Section 4.1 could be processed by: 1. using the predicateobject index to route the right-side n to all nodes responsible for the predicate <created>, and process the join as a local operator there, or 2. utilizing the subject index to look up all input subjects from the left side, which results in combining and n to one routing operator s utilizing the subject index. Note that, where required, we use a subscript to explicitly state which part of the input triple sets is used for routing by the operator. Note that, again, both variants can be processed either in sequence or in parallel. The physical query algebra supports inner joins and outer joins (for processing OPTIONAL constructs). The operator implementations differ only in the way they handle triple sets for which no compatible mapping is found outer joins keep such triple sets, inner joins reject them. Query plans that contain only results (no operators still to process) are sent back to the initiator. The nal projection p can be processed at each node that processes the nal join, as all required input data is available, or only at the node that receives all nal replies. Due to the MQPbased concept and applied parallelism, this can result in multiple independent query replies. Details follow in Section 4.3. The introduced q-gram index provides an alternative, especially suited for similarity operators. As introduced in Section 3.2, if we can nd d 1 q-grams that are not shared between two strings, we know for sure that the distance between both is larger than d. Thus, the strategy on this index is to query for d 1 q-grams qg i using hAjjqg i , which, again, can be done in sequence or in parallel. This approach uses input of the operator (in the case of
12
sequence nQG;SEQ . As the input strings can also be processed in sequence HASH;SEQ or in parallel HASH;PAR , this results in a doubled choice between sequential and parallel processing, which is denoted accordingly. Algorithm 1: Q-gram-based similarity join:
HASH;PAR edist?A;?B6d^?a?b nQG;PAR ?B?b!
no constraint on the property names or object values: use only the object value index or only the property index (exact and similarity match) note that, if only the property names are constrained, this corresponds to a cross product between two properties as known from the relational algebra, nothing constrained: forbidden (would correspond to a cross product between all triples in the system). We support a wide range of operators as known from the relational algebra and a few special operators, such as extraction n. All operator implementations correspond to the introduced classes and exploit the principles outlined in the previous sections. However, there are special operators that utilize the same indexes, but are based on sophisticated processing concepts. We briey introduce concepts for such operators from one of the most important classes, namely ranking operators, in the next section. 4.2.2. Ranking operators Ranking operators are crucial and tailor-made for heterogeneous large-scale data collections. They allow for focusing on the most interesting results in the presence of many relevant data and sources. For the user, they provide possibilities to achieve an overview of the resulting huge data collections. We focus on two specic classes of ranking operations, which are known to be the most popular and powerful ones: top-N ranking and skylines. Top-N queries allow to retrieve only the N most interesting results, with respect to a ranking function based on single or multiple dimensions. We support such ranking functions on numerical as well as string values. Supported ranking functions include minimization, maximization, and nearest-neighbor ranking with respect to a provided value. If based on multiple dimensions, the single dimensions are either combined and optionally weighted in the ranking function, or used hierarchically (i.e., the top N results for attribute 1, rank equal values in that set on attribute 2, and so on). This is not always appropriate. For instance, for decision making processes and recommendations, it is often desired to regard all ranking attributes equally important. Predening a weighting of the attributes is in many cases not straightforward, rather one is interested in all choices that are not dominated by any other choice in any dimension. Thats exactly what a skyline query computes [35]. The above introduced general operators are suited to compute both, top-N and skyline queries. Anyhow, this requires to rst collect all relevant data and afterwards rank and lter it either at the query initiator or at an intermediate synchronizing node. Both query types can be computed more efciently when they are
Input: variable names ?A; ?a, left side input T, distance d[, property name A, triple set ts] 1 if A NULL then 2 forall ts 2 T do 3 resolve property name A for variable ?A in ts; 4 determine d 1 q-grams Q from A; 5 forall qg 2 Q do 6 forwardhqg; A; ts; 7 end 8 end 9 end 10 else 11 forall local triples t : jjtpj jAjj 6 d do 12 resolve object value a for variable ?a in ts; 13 if edisttp; A 6 d ^ to a then 14 add-to-resultts [ ftg; 15 end 16 end 17 signalizeDONE; 18 end
The complete algorithm for processing HASH;PAR nQG;PAR is shown in Algorithm 1. This algorithm is processed by each node that receives an according query plan. The optional parameters A and ts are used to differentiate between the nodes that initiate the operator (lines 28) and those actually responsible for the queried q-grams (lines 1117). Note that in line 11 we also use length ltering [26]: if edists1 ; s2 6 d is true, js1 j and js2 j cannot differ by more than d. Again, we should note that the described concepts are applicable for several different query constraints: similarity on the object values while property name ?B is constrained: use either the property index or the q-gram index on values hBjjqg instead of hqg in the algorithm), similarity on property names and object values, exact match in property names and object values,
Fig. 5. Different routing schemes for HASH with d 2.
13
backed by according advanced operators. Still, these operators only use the functionality provided by the underlying DHT. In Fig. 6 we illustrate this for the top-N operator u. The example in Fig. 6(a) shows the general principle of computing the result for a top-N nearest neighbor query, where m refers to the center of the queried range. The node responsible for m can be determined by a lookup operation supported by the DHT. The idea is to guess a range r spanning from m that contains the N queried objects. As all nodes responsible for r are located in the same neighborhood, querying the range can be done efciently. Guessing r is based on the assumption of load balance, which is, for instance, an integral part of the P-Grid system. Using the supported range-query mechanisms, all nodes in the range can be queried in either parallel or sequence. After collecting all temporary results at a dynamically chosen synchronization node, that node has to decide whether N objects are already determined or if r has to be adapted. If adaptation is necessary, this is again based on the density of values over the already queried nodes. Following this approach, the required N objects can be usually determined in very few iterations (with perfect load balancing in only one iteration). Encapsulating this process into a special operator allows for including the top-N operation arbitrarily in a query plan. Fig. 6(b) exemplies the process for numerical data. First, a range of nodes comprising three further nodes is determined and queried (solid arrows). In a second iteration, the N required objects are retrieved by querying three more neighboring nodes (dashed arrows). Fig. 6(c) shows the difference for string data based on the introduced edit distance and q-grams. In this case, due to the hashing of the q-grams, responsible nodes may not be located in a closed neighborhood. But, querying involves only single lookups. Due to the q-gram approach, extending the range usually involves less nodes than for numerical data. Note that this process is also possible if m denotes the maximum or minimum of an attribute. Both can be computed efciently by the DHT functionality and the resulting range has to be extended to only one side. A special skyline operator U can use a related approach [36]. The idea of the FrameSkyline is based on the observation that the search space for a skyline query result can be narrowed by using the minimal and maximal values of the involved attributes. This is shown in Fig. 7, where only the objects in the shaded rectangle have to be analyzed after the minima of each dimension are known. The principle of U is therefore to rst determine the extrema and then send sub-skyline queries to the relevant nodes using range queries (four nodes indicated in the gure). This again can be done in sequence USEQ or in parallel UPAR . The several sub-skylines are then again combined by a synchronizing node, which also computes the nal global skyline. For further details we refer to [36,8,37]. This approach works also for more than two dimensions, with only slight extensions required. Note that there exist different approaches for narrowing the search space, e.g., SkyFrame [38].
1 y Target functions: minimize x minimize y min(y) min(x)
Fig. 7. Principle of processing skyline queries in UPAR .
Which of the approaches is best suited in what situation is an open research question. 4.3. Query execution Query plans have to be processed from bottom to top. To achieve efcient query processing, it is important that the plan optimization results in an (estimated) optimal ordering of the nodes in the query plan. The basic strategy is to process plans in post-order. Certain operators may be combined to enable a more efcient processing. This should be considered in the process of logical optimization. Since the impact of combining operators can only be evaluated meaningfully on the basis of cost estimations, a reordering of operators should also be enabled during the process of physical optimization. We briey discuss query planning below in Section 4.6. Query plan processing is based on the idea of Mutant Query Plans MQP [34]. Here, we illustrate this concept using a small example query plan as shown in Fig. 8. Note that < A > is short for the triple pattern [] <A> ?A, as this pattern occurs frequently. A message containing the query plan is shipped to one or multiple peers responsible for the next operator to process. These peers insert all data that correspond to the operator. In the example, the plan is rst shipped to all peers responsible for property <A>, where all local triples for that property are inserted. Adhering to the post-order processing, the query plans are next shipped to the peer(s) responsible for property . Again, local triples are inserted. Like this, the plan mutates, because its operators are successively replaced by according data. If nSEQ is used, one plan <A> containing all data for <A> will be sent. In contrast, if nRQ is used, <A> each peer responsible for a part of <A>s data will be contacted in parallel. Thus, multiple plans, each containing a part of the data, are forwarded. This is an extension of the original MQP concept. During the processing of each operator, peers can autonomously duplicate plans as well as change its structure and the data contained. By this, we add even more possibilities of mutations. Thus, we call this concept Mutating Mutant Query Plans M 2 QP. This
Fig. 6. Principle of processing top-N ranking in u.
14
Fig. 8. Example query plan for illustrating query plan processing.
allows a higher degree of parallelism, because all plans are traveling through the network independently. But, this involves more messages, because each plan that contains a part of <A>s data has to be sent to each peer responsible for a part of s data. Otherwise, we would probably miss matching pairs. As soon as the data of one operator is not needed anymore it is deleted from the plan. In the example, each peer responsible for property can process a part of the join. Thus, only matching pairs are inserted into the plan, replacing the operator. Afterwards, the output data of the n operators are not needed anymore and can be removed. If nSEQ is used, processing of and deletion of input data will happen at the last peer in the sequence of peers responsible for . If jprA j refers to the number of peers responsible for the range of <A> and jprB j refers to the number of peers responsible for , the different combinations of physical operators will result in the following number of nal query plans, i.e., replies. Fig. 9 illustrates this on the basis of according routing graphs. nRQ and nRQ (Fig. 9(a)): jprA j jprB j query plans <A> nSEQ and nRQ (Fig. 9(b)): jprB j query plans <A> nRQ and nSEQ (Fig. 9(c)): jprA j query plans <A> nSEQ and nSEQ (Fig. 9(d)): 1 query plan <A> Inter-operator parallelism is an alternative to the post-order processing of query plans. Using this strategy, the children of a binary operator are processed in parallel. To process the binary operator, resulting query plans have to be collected at a synchronization peer. The difference between post-order processing and inter-operator parallelism relates to the difference between depth-rst traversal and bottom-up evaluation of parse trees briey discussed in [32],
where UniStore supports both. Inter-operator parallelism is especially useful for operators like the union operator. This operator does not require explicit synchronization during query time. Since the synchronization is data-independent, it degrades to a simple combination of result sets. Other operators that can benet from inter-operator parallelism are join operators and cross products . Here, the two child branches can be processed in parallel, but need explicit synchronization before processing subsequent operators. For data-independent binary operators like the union [, this implies unnecessary overhead. Moreover, dropping operators is not applicable in this case. Thus, inter-operator parallelism should be applied for such operators in general. Synchronization itself can be implemented in a blocking fashion or in a way that enables online query processing, i.e., by forwarding partial results as soon as they are available. This can only be applied in conjunction with some pipelining mechanism, which is discussed next. 4.4. Pipelining Pipelining is a technique known from traditional DBMS, in which it is used to speed up query processing. As soon as output of an operator is available, it is passed to the next operator in the query plan and processed further. Like this, several operators can be processed in an interleaving manner. In our implementation, this means a triple is processed by all piped operators without intermediate routing states. Thus, the processing of subsequent operators is delayed only to a minimum. We refer to this as triple set pipelining. As an example, consider an extract n operation followed by a selection rP . Extracted triples may be directly passed to the selection, which makes storing the output triples of n unnecessary. Triple set pipelining is useful to speed up local processing of query plans. Between operators, a query plan is always handled in a routing phase, even if the next responsible node is the current node itself (i.e., the node would send the plan to itself). This unnecessary step is omitted with triple set pipelining. If it is not used, we will rst nish the extract operator and afterwards process the selection. While triple set pipelining can decrease the used bandwidth, it has no impact on the number of generated messages. As a modication, according operators can be
Fig. 9. Different combinations of n operators n : jprA j; m : jprB j.
15
processed in parallel. This is called pipelined parallelism [39] in traditional DBMS. We introduce a modied version of pipelining called peer pipelining, which is specically designed for widely distributed systems. The concept is basically the same, but does not only involve operators that are processed at the same peer. Rather, it is extended to include also routing operators. This is mainly effective in sequential operators, but also in synchronizing operators like U and inter-operator parallel joins . The idea is to pass intermediate results to routing operators like s as soon as local processing or an operator phase is nished. Fig. 10 illustrates this for nSEQ . In the usual case, we rst pass the whole sequence of peers responsible for the triples to be extracted. Afterwards, one query plan containing the whole set of output triples is forwarded to the peer(s) responsible for processing the next operator. If one peer in the sequence fails or is slow, this may break or delay the chain of processing unnecessarily. With peer pipelining, the output produced so far is forwarded immediately. This is still different to parallel processing, i.e., to choosing nRQ . Only the rst peer of <A> the sequence is contacted by the initiating peer. The approach results in the same number of forwarded query plans as in the parallel case. But, the fact that all peers are contacted in sequence eases the estimation of the number of resulting query plans. Further, sending multiple independent query plans to the same peer is prevented, which can be guaranteed only in sequential operators. This still comes with an increased vulnerability to broken chains and slow peers. Peers succeeding a slow peer in a sequence are still delayed in processing. Peer pipelining is a prerequisite for enabling online and evolving processing. Only with this method, combined with an according online projection p, online and evolving processing can improve search experience of the user. 4.5. Overview of query execution Algorithm 2: General plan processing processQ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 18 20 21 Input: query plan Q op get-next-operatorQ ; if is-logicalop then op plan-queryQ ; end process-operatorop; if is-doneop then prune-planQ ; processQ ; / restart with next operator / end else if peer pipelining ^ new-data-availableop then C clone-planQ ; forwardC; mark-partly-processedop; mark-doneop; processQ ; / restart with next operator / end else forwardQ ; end end
Fig. 10. Routing scheme for nSEQ with peer pipelining n : jprA j. <A>
query time when dynamic planning is enabled see Section 4.6), the process of query planning is initiated. Here, one or multiple logical operators are replaced by physical ones, synchronization peers are chosen, and operators may be reordered. Afterwards, process-operator() is called for the current operator. If no data for processing is available, the operator will return without doing anything. Otherwise, all data (including local data) that can be processed are processed. Result data are inserted into the query plan Q. If triple set pipelining is enabled, this will be integrated here as well. If the operator nishes, the DONE ag will be set. This ag decides whether the process() method is restarted or not (checked by is-doneop). If it is restarted, this will result in processing the next operator. Otherwise, the query plan Q is pruned (remove unnecessary data and operators) and forwarded to the peer(s) that are the next to process the current operator. This forwarding implements the different routing calls introduced for the physical operators. Also, nally processed queries are returned to the query initiator and inter-operator parallelism is initiated in the forward() method. The integration of peer pipelining is illustrated in lines 1117. The created plan copy C is forwarded to continue the processing of the current operator. Both query plan copies can be pruned in this step. The current operator in the local plan copy Q is marked as partially processed (method markpartly-processedop) and DONE (method mark-doneop), which results in processing the next operator by restarting the plan processing procedure. If the next operator is a routing operator, this results in forwarding the original plan copy as well. Following this procedure at each peer, each created plan copy will be replied at some point in time. Next, we discuss the process of query planning. 4.6. Query planning Query planning is the process of transforming a logical query plan into a physical execution plan. This involves logical and physical plan optimization. Logical optimization involves simplication of expressions, union or splitting of operators, and other rule-based optimizations. For this, methods from traditional DBMS can be adopted. General optimization rules are similar to those that are known from relational DBMS, such as pushing selections r downwards and pushing subject joins upwards. Further issues that should be regarded have been indicated throughout this work and are not discussed here in detail. Ref. [40] provides some more insights into the process of logical query optimization. To get an executable plan, the contained logical operators have to be replaced by physical operators. This is matter of the physical optimization. Using static planning, this happens at the peer initiating the query. The optimizer has to be aware of the fact that there exists an m : n mapping between logical and physical operators. This means, there exist several choices of physical operators for one logical operator. Further, multiple logical operators can be combined in one physical operator and vice versa. Physical optimization should also support the reordering and combination of operators. To
An overview of the general process of query plan processing is provided in Algorithm 2. We illustrate only the main steps. This procedure is executed at each peer that receives a query plan copy. First, the next operator to process is determined. This respects the options of post-order processing and inter-operator parallelism. If that operator is still a logical operator (at query start or during
16
meaningfully enable this, some tasks that could be integrated into logical optimization should be delayed after the physical one. Further, several operator orders can only be identied and evaluated with the knowledge about available physical operators. A cost model should be applied to choose the best combination of physical operators. Implementing sophisticated optimizers is a complex and challenging task. In the following, we provide a basis for this by discussing costs for physical operators presented in this work. 4.6.1. Cost model In order to support a cost-based query planning, we analyzed all of the supported operators and determined according cost formulae. For exact cost prediction, the cost factors have to be determined by gathering and maintaining data statistics from all involved nodes. As this results in an additional overhead, they can alternatively be approximated locally. This is possible with satisfying accuracy if the underlying DHT supports load-balancing. Approximated factors are also successfully used for cost-based optimization in centralized database engines and for the maintenance tasks in DHTs, if no exact statistics are available. Cost-based planning in our system faces similar advantages and drawbacks as these approaches when based on local approximations. Evaluating alternatives for determining the cost factors is out of the scope of this paper and part of our ongoing work. The cost model allows for an adaptive plan processing, where each involved node can decide independently how to further process a received query plan. The main cost measures in a distributed setup are the number of messages m and the number of hops h. Reply times usually cannot be predicted accurately, as they depend on too many unknown factors. However, together with other measures, such as bandwidth consumption, they are usually linearly related to m and h. We will provide some empirical evidence for that in Section 5. In Table 2 we show costs of the main routing operators introduced above. The important cost factors are the number of key space partitions that have to be queried (e.g., by a given range) and the number of input triple sets for an operator. We list the symbols that we use in the cost formulae in the upper part of Table 2. We omit the costs of a few operators due to the complexity of the underlying formulae. The shower algorithm used in nRQ results in sublinear costs (in jPj and mainly depends on the number of nodes in the queried range and their storage capacity. Ref. [13] provides a detailed discussion and a comparison to the algorithm used in nSEQ . For the q-gram similarity joins, we only list the main variants of full parallel and full sequential processing. The costs of the
variants mixing parallel and sequential processing reect the trade-off between required hops (more parallelism results in fewer hops) and messages (more sequential parts can reduce the number of messages). For a detailed cost discussion we further refer to [8]. All operators rely on the functionality of the DHT, which guarantees logarithmic hops and messages for each lookup. Thus, all costs in Table 2 contain the logjPj. Besides that, sequential operators usually come with less messages, but more hops. Thus, they are usually suited to save bandwidth but result in higher reply times, and are not suited for rather unreliable networks. Parallel operators require less hops, on the cost of more messages. Costs for hash-based processing (e.g., using HASH depend on the size of the input data, whereas range-based processing costs (e.g., combining LOC with nRQ are dominated by the number of nodes in the queried ranges. The costs above represent the general costs for processing operators in a single instance of a query plan. For query planning, more details are required, such as the number of query plans that are created by an operator and have to be processed in parallel by subsequent operators. In the following, we introduce additional parameters used for that and nally provide a concrete example of cost comparison. The costs of complex query plans Q are determined using the costs of single operators: P mQ op2Q mop froot P hQ op2Q hop 1 In this, fop represents the number of follow-up query plans produced by an operator op (i.e., f 1 for sequential operators, f P 1 for parallel operators). The exact costs of each operator op, as well as fop , are dependent on the values of f from all child operators. We omit the details here and just note that these values are incorporated into the formulae for mop and hop . The number of follow-up query plans of the root operator froot corresponds to the estimated number of leafs in the routing graph for Q. Thus, it reects the number of estimated replies. Each reply is sent directly to the query initiator. The nal decision which plan to choose has to be based on the two factors mQ and hQ . It further depends on the characteristics of the network environment, user preferences, and general requirements concerning the trade-off between query answer times, robustness, and load. An example logical query plan is shown in Fig. 11. Figs. 12 and 13 show two possible physical query plans. Note that, for each physical operator, we only indicate the variables and properties that are used to determine responsible nodes, i.e., nRQ <dlr> uses the
Table 2 Costs of main routing operators. Symbol jPj jPR j jTj jPT j d jPqd1 j Operator op n
SEQ =RQ
Meaning Number of key space partitions p 2 P in the trie Number of key space partitions in queried range R Number of triple sets ts 2 T as operator input Number of key space partitions all ts 2 T are hashed to by the operator Queried distance in similarity queries Number of key space partitions all d 1 queried q-grams are hashed to mop Olog jPj Olog jPj jPR j 1 jPT j Olog jPj jTj Olog jPj jPqd1 j Olog jPj d 1 Olog jPj jPqjTjd1 j Olog jPj jTj d 1 Olog jPj hop Olog jPj Olog jPj jPR j 1 jPT j Olog jPj Olog jPj jPqd1 j Olog jPj Olog jPj jPqjTjd1 j Olog jPj Olog jPj See shower algorithm [13]
[exact match only]
nSEQ nRQ HASH;SEQ HASH;PAR nQG;SEQ nQG;PAR HASH;SEQ nQG;SEQ HASH;PAR nQG;PAR
M. Karnstedt et al. / Web Semantics: Science, Services and Agents on the World Wide Web 10 (2012) 332 Table 3 Example statistics estimation. Variable jprprize j Unique values of <prize> Minimal value of <prize> Maximal value of <prize> jprdlr j jprid j jprname j
17
needed
for
cost
Value 10 200 500 20,500 7 5 7
that a single look-up operation in the DHT requires, mh to the number of hops, respectively. These values are approximated to be logarithmic in N and we do not have to determine them precisely for comparing execution plans. This results in total costs:
Fig. 11. Logical query plan Q used as example for cost estimations.
mQ 1 10 ml 70 ml 70 ml 280 70 49 ml 49 199 ml 399 hQ 1 hl hl hl 4 hl hl 1 5 hl 5

If we would choose to use inter-operator parallelism for the join SYNC operator, LOC edist?dlr;?id<2 would be replaced by edist?dlr;?id<2 . This would result in modied costs as shown in Table 5 and in total costs:
mQ 1 10 ml 70 ml ml 4 71 1 7 ml 7 88 ml 83 hQ 1 hl hl hl 4 1 hl hl 1 5 hl 6
Cost estimations for Q 2 from Fig. 13 are shown in Table 6. Using this variant, the total costs are:
mQ 2 2 ml 14 ml 80 ml 80 5 ml 5 101 ml 85
Fig. 12. First physical plan Q 1 for Q from Fig. 11.
hQ 2 hl hl hl hl hl 1 5 hl 1
Since Q 2 contains several hashing operators, we have to estimate selectivities. This is based on the assumption of a uniform data distribution. It is not needed for the last join operator, because the preceding u is blocking and produces only one query plan containing the top-5 triple sets. An interesting observation is that synchronization operators are suited to reduce the number of independent plan copies. This holds also for hashing operators, because for both fop is independent from f of the children. But, this can also result in the opposite effect. Nevertheless, such operators can be used to reduce the impact of parallelism on the number of messages. This fact should be regarded in an according optimizer. 4.6.2. Adaptive query processing Due to limited local knowledge, changing network situations, dynamics of load and peers, etc., static planning will usually not result in the best suited query plans. Rather, a dynamic planning should be preferred. Technically, achieving dynamic planning is rather simple. The idea is to not replace all logical operators by physical ones before starting the processing of a query plan. Rather, only the operator(s) on leaf level have to be chosen in order to start query processing. Each time an operator is nished and the next operator has to be processed, we can check if this is still a logical operator. If so, this is the latest point we have to replace it by a physical implementation. Before that, the operator can be replaced at any time of query processing and the choice of operators can be restricted a priori. We achieve this by shipping so-called plan rewriters along with a query plan. Dynamic planning represents a basis for enabling adaptive query processing.
Fig. 13. Second physical plan Q 2 for Q from Fig. 11.
propertyobject index to look up all triples for the property <dlr>, HASH;PAR uses the subject index to look up all triples that corre?s spond to one of the input subjects referenced by ?s. Assuming statistical values as shown in Table 3, the costs of this plan are estimated as shown in Table 4. An example how such cost factors can be principally approximated with only local knowledge can be found in Section 4.2.2, where we discuss how to guess a range to be queried in the top-N operator. ml refers to the number of messages
18
M. Karnstedt et al. / Web Semantics: Science, Services and Agents on the World Wide Web 10 (2012) 332 Table 4 Cost estimation for Q 1 rev u 0:1. Operator op nRQ <price> mop 10 ml 0 10 7 ml 70 ml 0 70 ml 5 1 70 ml 280 0 70 7 7 ml 49 ml 0 0 hop hl 0 hl 0 hl 5 1 hl 4 0 hl hl 0 0 fop 10 10 10 7 70 70 70 70 d0:1 70e 7 7 7 49 49 49
rLOC 10006?price65000
nRQ <dlr> LOC ?s nSEQ <id> LOC edist?dlr;?id<2
uSYNC MAX;?price;5
nRQ <name> LOC ?s
pLOC ?name;?price;?dlr
Table 5 Cost estimation for rev 0; rev u 0:1. Operator op nRQ <price> mop 10 ml 0 10 7 ml 70 ml 0 ml 5 1 ml 4 70 1 71 1 1 7 ml 7 ml 0 0
Q1
with
inter-operator
parallel
blocking
join
hop hl 0 hl 0 hl 5 1 hl 4 1 hl hl 0 0
fop 10 10 10 7 70 70 1 1 d0:1 1e 1 177 7 7 Fig. 14. Example of utilizing mutations for adaptive query processing.
LOC 10006?price65000
nRQ <dlr> LOC ?s nSEQ <id> SYNC edist?dlr;?id<2
uSYNC MAX;?price;5
nRQ <name> LOC ?s
Table 6 Cost estimation for Q 2 rev u 0. Operator op mop

40 d200
hop 10e ml 2 ml hl hl 0 hl hl hl 0
fop
40 d200 10e 2
RQ n 10006?price65000 <price>
nRQ <dlr> LOC ?s nQG;PAR HASH;PAR edist?dlr;?id<2 <id>
2 7 ml 14 ml 0 40 2 ml 80 ml 80 5 ml 0
2 7 14 14 40 2 80 1 5 5
third plan in the gure. To nally process the plan while utilizing existing operator implementations, the peers can generate one plan for each input string sA as shown in the end of the sequence. This is only possible if they can manipulate query plans autonomously. Note that triple set pipelining should be applied to prevent the doubled evaluation of edist. Similar insertions of temporary operators are used, for instance, to process the different phases of a skyline operator. A powerful aspect of M 2 QP is that each peer can decide autonomously. Thus, some may decide to use an operator implementation as in the example. Others may decide to query the q-grams of each input string in sequence, using HASH;PAR nQG;SEQ . Some can even edist?A;?B<2 
RQ decide to use a local join and process LOC edist?A;?B<2 n . As all plans
uSYNC MAX;?price;5
HASH;PAR n<name> ?s
Adaptive techniques should be applied on more levels than only choosing physical operators. It is a main ingredient of the proposed M 2 QP approach. By allowing each peer to autonomously change the structure of a query plan, we provide a maximum of exibility in query planning and processing. It can be used for supporting certain advanced operators on the basis of existing ones. We illustrate this using the example of a q-gram-based join in Fig. 14. Assuming a simple similarity join on the values of two properties <A> and , we start with a query plan as shown in the left of the gure. At the query initiator, we start query planning by replacing only the rst extraction operator, as shown in the second query plan. Each peer responsible for a part of property <A> receives a copy of that plan, inserts local triples, and continues with processing the next operator. On this point, the peers have to decide which physical operator to choose for n . Depending on local knowledge and estimated costs, they may decide to use a parallel qgram-based similarity join that looks up the q-grams of each input string sA from property <A> in parallel. The nal plan looks as the
travel the network independently, they do not inuence each other and the nal result will be correct in each case. Adaptivity in query processing has its limits as well. For instance, if a query plan contains a ranking operator u, the peers cannot decide absolutely autonomously about he chosen implementation. If any of the operators preceding u is executed in parallel, uLOC cannot be chosen. Rather, all intermediate query plans have to be routed to the same synchronization peer. That peer has to be dened before starting parallel processing. This applies to all operators that need some synchronization. A good choice is to use the query initiator as the synchronization peer and shift according operators in the plans upwards. Then, distributing query load through parallelization can help to increase robustness, because ranking is processed locally on all received nal replies. In contrast, shifting ranking operators downwards can reduce the number of generated messages on the sake of reduced robustness and increased query answer times. Fig. 15 exemplarily illustrates this. Note that the rst join has to be processed before the ranking can happen. Otherwise, results could be inaccurate. Thus, the special implementations uSEQ and uPAR cannot be chosen. Further discussions on adaptive query processing techniques are an interesting issue for future work.
19
Fig. 15. Late vs. early synchronization.
5. Evaluation To evaluate our approach and the implemented system we ran a series of experiments. The gained results presented in the following have to be understood as a proof of concept, with a particular focus on the distributed processing paradigm. The number of nodes (and their resources) that were available for our experiments did not allow us to run a larger-scale evaluation. Further, we do not directly compare to existing systems. This is because for many proposed systems there is no implementation available, while other available distributed RDF management system do not support similarly complex operations as our system. The direct comparison to centralized engines is out of scope of this evaluation. This evaluation should be understood to showcase the general applicability and scalability of the proposed approach, as well as the characteristic differences between the discussed operators and processing strategies. More details of most of the discussed experiments and results can be found in [8], while [37] focuses on an extensive evaluation of the introduced processing strategies for skyline queries. Setup. To enable a large-scale and realistic evaluation, we ran a series of experiments on the PlanetLab [41] environment. PlanetLab is a world-wide consortium created especially for running large-scale distributed experiments. At the time of our experiments, it consisted of more than 700 nodes and is specically dynamic and unreliable. PlanetLab was developed for evaluating network protocols and distributed systems such as overlay networks, rather than for running large-scale data-oriented experiments. As a consequence, resources (particularly main memory) available at the participating nodes are very limited. Therefore, PlanetLab bears several challenges for running meaningful largescale experiments on a database background. Several of these issues have been observed by the designers of PIER as well [42]. Despite these limitations PlanetLab still provides the best distributed infrastructure to evaluate large-scale distributed systems in a true Internet setting. We tried to use specically reliable nodes (online and available with high probability) to eliminate unwanted effects on our evaluation due to node failures. The number of nodes involved in our experiments varied between 30 and 120 and the nodes were distributed all over the world. We also ran the same series of test in full-scale PlanetLab setups. For this we gathered as many nodes as possible, resulting in setups with up to 460 nodes. Due to the reliability issues we encountered on PlanetLab, we also ran the same series of experiments in a local setup. For this we tried to use as many machines that we could in our local network and running multiple instances (i.e., nodes) on each machine. This allowed us to build networks with up to 74 nodes utilizing all available resources. As this environment was more stable, we could use a lower replication factor in the underlying DHT. This system conguration factor has to be larger in more widely distributed and unreliable environments, so the local network can be
understood as a representative for actually larger networks. The presented results are mostly based on the PlanetLab runs we point to the local setup where appropriate. However, the gained results in both setups are not absolutely satisfying. The local setup is quite small and due to the many instances on one machine, resources of the single nodes were exploited to the maximum (and beyond). The PlanetLab setup provides a very realistic environment in terms of geographical spread and unreliability of nodes. But, due to the unusually high number of experiments running in parallel, the resources are too sparse and the whole network generally under too much load to achieve really satisfying performance. Thus, we ran a third series of experiments on G-Lab [43]. This is a recent sub-project of PlanetLab with relatively powerful machines and signicantly less load. We were able to use between 105 and 115 machines in each run. This setup can be understood as a good mix between the advantages of a local-only cluster and the spread of the actual PlanetLab network. Data. As the concepts of similarity and ranking queries form an integral part of our proposal, we had to use data sets that support an appropriate evaluation adequately. We used data from the Internet Movie Database (Imdb2). The data is a collection of movie titles with some additional information like IDs and year of production. The data set exhibits a skewed heavy-tail key distribution (power-law like). We built two heterogeneous RDF data sets by using different names for properties with an actually identical semantic (e.g., title and name). More heterogeneities come for free due to the character of the used data (e.g., typos in actor names). When running the experiments, these data were not part of the Linked Open Data cloud. However, in the context of our evaluation they show an appropriate similarity to the available Linked Data. The second used data set, representing a realistic scenario combining geographical data from different sources, is composed of triples from DBPedia3 and geographical data in relational format from Mondial.4 The Mondial data were transformed to RDF before using it in our tests. Varying data distributions are represented by the different test data, while varying failure rates and message delays come inherently due to the usage of the realistic distributed environments. Unfortunately, the issues of the PlanetLab environment described above limited the number of triples we could actually inject into the system. We chose the injected data randomly from the complete sets. After building the default indexes, each peer managed about 200 triples without the q-gram indexes, i.e., about 600024,000 triples in total. With q-gram indexes, these numbers increased to about 830 triples per peer, i.e., about 25,000100,000 triples. Consequently, in the full-scale PlanetLab tests the total number of managed triples was about 380,000. Although these numbers are relatively low, they enable the targeted proof of concept. The number of data items is mainly limited by technical issues of the implementation. The used P-Grid version is based on an in-memory Java database, where non-optimized hard-disk swapping is used when free memory gets low. However, the available memory resources were constrained in all setups. Further, message sizes can become an issue in any DHT exchange mechanism. We are aware of these issues and plan to approach them utilizing latest achievements of the DHT community. But, we were not able to integrate these into the current implementation. To achieve larger scale in the experiments, we require more nodes and resources and an appropriate optimization of the locally used databases (which is not in our focus) in the best case, we would use well-performing local RDF stores at each participating nodes. Still, we are able to show the performance trends with increasing network size as well as the accuracy of our cost model. These costs
2 3 4
http://www.imdb.com/ http://dbepdia.org http://www.dbis.informatik.uni-goettingen.de/Mondial/
20
in turn can be used to show the theoretical scalability of our approach to a signicantly larger number of nodes, and thus triples. Further, we are able to show the performance of the proposed methods and techniques with an increasing data volume but constant number of nodes. For these tests all run on the G-Lab setup we used crawled Linked Data. The crawl started from the FOAF le of a DERI researcher and considered only the content of URIs serving content of type application/rdf+xml. New URIs were extracted from the subject and object RDF terms of found triples. From the so created set of 23 million statements, we picked random sets of different sizes and built the default indexes, resulting in ca. 25,000, 150,000, 280,000 and 400,000 managed triples. Again, the management of more data items while keeping realistic query times was impossible due to limits in the current implementation. Queries. As a prerequisite for enabling query performance tests, the complete distributed data set needed to be indexed. This index generation from scratch is not a well researched domain, as most P2P systems provide the index construction and maintenance algorithms and usually assume an already existing index of a certain size in their performance evaluations (routing, churn, latencies, etc.). This is not a realistic assumption for data-oriented approaches such as UniStore. Fortunately, the required index construction effort analysis has been carried out for P-Grid already: Ref. [3] provides an extensive discussion and evaluation of index creation from scratch with P-Grid that we could rely on. The index construction times can be derived easily from [3] and thus we do not include them in the following evaluations. To start with a stable index, we allowed the system 2060 min for index construction to minimize the effects of churn on the initial setup for the query performance tests. The query processing tests themselves were again exposed to churn as in a realistic environment, but it was important to decouple these effects from the effects caused by churn in the initial index construction phase. For the rst series of tests, we generated a query mix of 120 test queries The query complexity ranges from simple extractions to complex join queries, including similarity joins and selections based on q-gram operators, as well as aggregation operators and ranking operators. The actual shape of a query plan and the contained operators are more relevant than the actual query. Considering this, the queries used in the evaluation can be assigned to ve general classes as shown in Fig. 16. We refer to these classes using the identiers e (single extraction operators), m (extraction combined with subject join and optional selection), c (join queries with multiple extractions and star-shaped patterns), a (aggregation queries) and ac (complex queries containing also aggregations). These classes form a representative mix of all discussed operator functionalities with a focus on the routing operators. Thus, they are tailor-made to compare the proposed processing strategies
Fig. 16. Shape of evaluation queries and class identiers.
and to evaluate how the combination of different operator implementations inuences performance. In the gures and explanations we use corresponding symbols and abbreviations, which are introduced where needed. These query classes are used on all involved data sets. Example SPARQL queries for the classes are listed in the Appendix. We also consider increasing query load and complexity for queries of the same class, e.g., by increasing the number of joins and queried distances in similarity queries. All queries with all variations were run 10 times and the average values are presented. For the second series of tests, run on G-Lab and the crawled Linked Data, we automatically created a set of queries. In order to create queries that produce non-empty results, the query generator parsed for all dereferenceable URIs in the data and tried to produce as many queries of different shapes as possible. The different shapes ranged from single object/subject lookups, over path queries of different length, to star-shaped queries and queries containing a mix of star and path shapes. This represents a mix of queries that was found to be common in practice [44], where some of the queries even exceed the typically observed complexity in order to stress the system. The queries nally used for the experiments were randomly chosen and again issued up to 10 times, where average values are presented. We explicitly mention the type of query where appropriate the corresponding SPARQL queries are again listed in the Appendix. Metrics. The most important performance factors in distributed data-management systems are the number of network messages, the number of hops (or hop count) these messages require to reach their destination, and the total bandwidth consumption of a query. Bandwidth consumption usually shows a linear correlation with the number of messages. The number of hops corresponds to the number of nodes a query is processed on in sequence, and thus represents the delay that can occur during the processing. For illustration, consider a query that has to be processed by three nodes. If all nodes are contacted in sequence, the number of hops will be 3. If each of the nodes is contacted in parallel, and each node can process its part independent from the other nodes, the number of hops is only 1. In order to achieve scalability, evaluation metrics for the system should scale sub-linearly with increasing number of nodes (and data). In the ideal case, we can even achieve a constant behavior (e.g., as we show later, for the hop count by exploiting parallelism). The most interesting metric is the time to nish the processing of a query, which could be measured in a more reliable and robust setup. However, in a distributed system, particularly the number of hops provides a very good indicator for this. This is particularly relevant in the PlanetLab setup, as highly loaded or temporarily unavailable nodes can slow down the complete processing of a query drastically. Consequently, the more parallelism we can apply, the lower the number of hops, and the faster partial results are available. Thus, we use the number of received results over time as well as the time required for receiving rst results as additional metrics. Both should behave rather independent from the actual scale of the system. These metrics are further particularly relevant for enabling on-line query processing. This is based on the idea to use partial results for presenting or further processing as soon as they are available, which is a very popular strategy in large-scale Web-based systems. The practicability of this is especially leveraged when combined with an appropriate completeness estimation, as briey discussed in Section 2. The number of messages required to process a query represents the additional overhead that an increased degree of parallelism imposes. Note that we always refer to the number of actual hops and messages, not the overlay hops and messages, i.e., a look-up operation will result in log jPj hops. The series of experiments on the G-Lab setup are mainly meant to assess the query times we can achieve. Further, these
21
Fig. 17. Comparison of nRQ and nSEQ for two different triple patterns.
experiments are used to show the relation between hop count and query time, between message number and bandwidth consumption, as well as between degree of parallelism and number of replies vs. time required for rst/last replies. These experiments are particularly suited for these metrics due to the character of the G-Lab network and the used data crawled from the Web. Physical operators. In this part of the evaluation, we focus on comparing the different operator implementations and processing strategies. The tests were run on 30, 60, 90, and 120 nodes from the PlanetLab test-bed using the two generated Imdb data sets. We built four indexes: a subject index, a propertyobject index, one q-gram index on the object values of one property from each set, and a second q-gram index on only the names of four chosen properties from each set. In general, we expect parallel operators to show lower reply times and decreased hop counts, but in turn resulting in more replies and more messages. These differences should conform to the cost model introduced for physical optimization. First, we focus on implementations for the basic routing operators n and s (as a representative of other join operators that can use same processing approach, e.g., so . In Fig. 17 we compare the number of messages, the number of hops, and the time for the rst replies for two different extractions. The queries belong to class e introduced before. While the number of messages shows only small differences, the number of hops and the small reply times highlight the advantages of the parallel approach. As expected, nRQ is the best choice for extraction operators that address a range of nodes. In Figs. 18 and 19 we show the estimations that are gained from the proposed cost model. Figs. 18(a) and 19(a) show the estimated absolute values (left y axis) and the resulting relative error of estimation (right y axis). The general differences and trends in the absolute values are reected very well, but the relative error reaches 25 percent in places. As the errors do not seem to depend on the network size, we have to look for the reasons in other factors. A crucial factor is the failure rate of peers and messages, which decreases actual cost values (if, for instance, a query plan gets lost and is not completely processed). But, these issues are not covered by the cost model as introduced. Second, P-Grids advanced algorithm for processing range queries makes the cost estimation particularly difcult. However, as the costs reect the main differences between the operator alternatives, they are suited to fulll the main task they were designed for, the decision on choosing operator implementations. This is why we also use them to simulate the scalability of the operators. The results in Figs. 18(b) and 19(b) (note the logarithmic scale) conrm the basic trends that we identied in the experiments. Our experiments showed that the general approaches for subject-join operators result in the same differences as can be found for the n operators. Again, the differences are reected in the estimated costs. We omit these results and rather focus on the
Fig. 18. Cost comparison of nRQ and nSEQ : messages.
scalability of the variants in Fig. 20. We can conclude similar as for the extraction operator regarding the choice between the sequential and range-based parallel variants. The hash-based variant introduces two more alternatives, which we discuss in more detail below. For this, we take a closer look on the inuence of the number of subject joins when using the subject index to process them. To illustrate the differences between the range-based (using propertyobject index) and hash-based (using subject index all subject joins for one particular subject can be processed at the same node) approaches more clearly, we chose queries from class m that address larger ranges than before. A comparison to the sequential approach is omitted here. Fig. 21 contains the achieved results for the number of hops, with respect to the number of subject joins. The rst observation is that the range-based approach performs signicantly worse as the addressed ranges and the number of subject joins increase. This is because with this operator we can only process one join in each routing step to reach the nodes responsible for the next property, we have to issue the next routing. This is different for the hash-based operators, where all properties available for one particular subject are indexed on the same node. Thus, the hop count (as well as the number of messages and the number of replies, which are omitted here) stays constant for the hash-based operators that utilize the subject index. As expected, the hop count for parallel operators is below that of sequential
22
Fig. 21. Variants of star-shaped queries for different numbers of subject joins.
Fig. 19. Cost comparison of nRQ and nSEQ : hops.
Fig. 22. Cost comparison of similarity extractions.
Fig. 20. Cost comparison of different subject join variants (star-shaped query with 2 joins).
ones. We additionally show an important characteristic of sequential operators. Sequential approaches can be used to reduce overhead by processing as much of an input set as possible at each
peer. The impact of this is illustrated by HASH;SEQ in the gure, s which corresponds to HASH;SEQ , but does not apply this optimizas tion step. When extractions n and subject joins s are combined, usually the choice of the join operator has the most inuence on the resulting performance. This is accurately reected by the introduced cost model. Further, when comparing different join variants, we observed the same effects of the choice between parallel and sequential approaches. Additionally, variants that use a designated synchronization node usually perform slightly better than local variants. Of particular interest are those operators that advance beyond standard database functionality, such as the similarity operators. We rst focus on the leaf level of query plans and analyze similarity extractions. Again, we show absolute estimations and the resulting relative errors for selected operator variants in Fig. 22(a). As expected, with increasing distance dened in the query the q-gram operators get more expensive than the rangebased operators. The exact cut-off point depends on the used data and the characteristics of the network. In the results presented here, a rather small range of peers is responsible for the triples that have to be checked. This makes the parallel range-based operator a meaningful alternative. A theoretical look on the scalability of qgram operators is provided in Fig. 22(b).
23
Fig. 23. Similarity on predicate level.
Fig. 25. Received replies over time for aggregation queries in the local setup.
Next, we take a closer look on similarity queries on property level. As argued before, range-based operators, whether sequential or parallel, are not suited for this purpose, because given the proposed indexes they require to query for all triples in the system. Thus, similarity on property level is only feasible with q-gram operators. The differences in the number of messages, bandwidth consumption, and hop count are analogous to similarity operators on object level. This is an intuitive result, because there is no difference for the actual processing. More interesting are the answer times. They are shown in Fig. 23. The used queries are from class m. They contain a similarity extraction on schema level and two further subject joins. The query identied by rQG;SEQ s uses HASH;SEQ and rQG;SEQ p uses RQ . Similarly, rQG;PAR s uses HASH;SEQ s s s and rQG;PAR uses RQ . The average answer times in Fig. 23 show s the dominance of the subject-join operator. All in all, similarity on property level is an particularly expensive operation. The most complex similarity operators that we discussed are similarity joins. We evaluate different physical operators that support this operation next. The queries used for this are from class c. They contain two extractions and a similarity join on object level. There are no subsequent joins included. We included the following combinations in the tests: c1: uses HASH;PAR nQG;SEQ and nSEQ c2: uses HASH;PAR nQG;SEQ and nRQ
c3: uses HASH;PAR nQG;PAR and nSEQ c4: uses HASH;PAR nQG;PAR and nRQ c5: uses LOC and nRQ In Fig. 24 we compare the resulting number of messages and hops for the different variants, with respect to the number of peers and the queried distance. The bandwidth consumption looks analogous to the number of messages. The gures reveal that the driving performance factor in these queries is the join operation. The used extraction operators make almost no difference. The effort for all q-gram similarity joins scales linearly with the number of participating nodes and the queried distance. The hop count of the range-based approach is very close to the parallel q-gram approaches, but it requires signicantly less messages. The hop counts perfectly reect the achieved answer times. Similar to q-gram operators for extraction, q-gram-based similarity joins are very valuable. They result in a noticeable overhead, but leverage robustness and performance of query processing signicantly. Effect of parallelism. Next we provide some insights into the effects of sequential and parallel operators by choosing a different metric for illustration. Fig. 25 shows the number of replies received over time. The total number of received replies reects the degree of parallelism and thus, the total number of generated messages. The time of receiving the last reply is the point where the plot shows the last increase, i.e., about 5 s for acPAR . The plots also indicate the
Fig. 24. Evaluation of different similarity join operators.
24
Fig. 26. Q-gram queries with 384 PlanetLab nodes.
minimal and average answer times. We present results for queries from classes c and ac run in the local setup on the DBPedia and Mondial data. All queries use nRQ for extraction on leaf level. For remaining subject joins, they apply HASH;SEQ acSEQ , RQ acRQ , and s s HASH;PAR acPAR , respectively. By this, we varied the general degree s of parallelism in the query plans. The gures show that sequential operators produce less reply messages, but these arrive later than with the parallel operators. The parallel hash-based subject join can result in a very high amount of messages and replies. Finally, we illustrate the number of replies over time for some q-gram queries from class c run on PlanetLab. Fig. 26 underlines the differences between the q-gram operators and shows that rst results are received after ca. 5 s. After that, the number of replies increases constantly, where the slope depends on the type of execution plan. For the PlanetLab environment, these gures are quite satisfying. The used queries include the following combinations of operators: c1: c2: c3: c4: c5: c6: HASH;PAR nQG;SEQ and rQG;SEQ HASH;PAR nQG;SEQ and rQG;PAR HASH;PAR nQG;PAR and rQG;SEQ HASH;PAR nQG;PAR and rQG;PAR HASH;PAR nQG;SEQ HASH;PAR nQG;PAR
The last presented results conrm that the proposed concepts are also suited for query processing in extremely dynamic and unreliable environments, such as a full-scale PlanetLab setup. G-Lab experiments. As mentioned before, the main purpose of the experiments on G-Lab is to assess the query times we can achieve as well as to show the relation between hop count and query time, between message number and bandwidth consumption, and between degree of parallelism and number of replies vs. time required for rst/last replies. From all queries we generated and ran, we show selected results. The corresponding SPARQL queries are listed in the Appendix. For each query, we evaluated different physical query plans. Table 7 lists the abbreviations we use to refer to the different execution plans. In Fig. 27, we compare path-shaped queries on the basis of range queries (propertyobject index) in post order (RQ) with rangequery-based processing using inter-operator parallelism (RQ-par all branches of the query tree are processed in parallel). The gure
Table 7 Types of query execution plans used in G-Lab experiments. Abbreviation sidx sidx-par RQ RQ-par seq Type of execution plan
further shows results for a selected star-shaped query. In this case, we further include subject joins using the subject index into the comparison. For this case, we again distinguish between post-order processing (sidx) and inter-operator parallelism (sidx-par). The gures show that all variants perform well on smaller numbers of index items. With increasing load on the single nodes, we can identify a clear increase in the reply times. But, we can also see that those execution plans applying a high degree of parallelism, particularly RQ-par, still provide satisfying reply times. As expected, the times for rst replies are relatively low in this case. Sequential processing is the slowest for all queries. Obviously, the effect of inter-operator parallelism becomes more evident with more complex queries. The fact that we also see a few decreases in the performance with increasing number of index items shows that the performance of query processing is still depending on the general load in the network. This is also inuenced by the load distribution over the nodes. In fact, due to the heavily skewed distribution of the data, some selected nodes were under much higher load than others. We will look into the applied hashing scheme to approach this issue. Still, the gures reveal the trends we expected. The shown number of messages, hops, and bandwidth consumption also underline our assumptions. The more parallelism is applied, the higher the number of required messages, and the lower the number of hops. With increasing number of index items, query times increase more signicant than the number of messages and hops. This indicates the issues about the current implementation, where query times are signicantly shaped by the local processing times on the single nodes. Further, we can see the expected relation between number of messages and bandwidth consumption, as well as between number of hops and query times. However, also here the effects of the local load are indicated. In Fig. 27(j)(o), we show similar results for more complex queries that combine star-shapes and path-shaped parts (what we call entity queries). These gures support the same observations as before. They illustrate even clearer that increasing query complexity results in higher reply times, particularly with increasing number of index items, i.e., higher local load. Further, the processing of subject joins using the subject index is seldomly benecial compared to the processing based on range queries using the propertyobject index. Entity query 1 also reveals what we observed before: inter-operator parallelism brings benets particularly for more complex queries. The bad performance of the sequential approach for the smallest number of index items for the entity query 1 surely comes to our surprise. We assume this is due to a particular issue in the network during the time of our tests. Other values again show the assumed differences between the alternative variants. Other performance metrics shown in Fig. 28 again highlight the assumptions we made before. Parallelism comes with more messages, more bandwidth consumption, fewer hops, and lower reply times. In general, the G-Lab experiments revealed what we expected. We can identify the assumed relations between the metrics and query times are satisfying, particularly for lower load. However, they also show the need for looking into the current implementation in order to improve the local performance achieved at each node. Besides more resources, a very promising approach is
Processes all subject joins in the query using the subject index using HASH;SEQ , all other operators are processed using the propertyobject index and range queries, query plan is processed in post order Same as sidx, but query plan is processed using inter-operator parallelism, i.e., parallel joins (except the subject joins) Processes all routing operators using P-Grids range-query processing, i.e., nRQ extractions and local joins LOC (also the subject joins), query plan is processed in post order Same as RQ, but query plan is processed using inter-operator parallelism, i.e., parallel joins Query plan is processed in post-order and all operators are executed strictly sequential using the propertyobject index, e.g., nSEQ
25
Fig. 27. Evaluation of rst queries on G-Lab: time of last reply ((a)(c)); time of rst reply ((d)(f)); messages, hops and bandwidth for selected star-shaped query ((g)(i)); second set of queries on G-Lab: time of last reply ((j)(l)); time of rst reply ((m)(o)).
to use full-edged triple stores at each node. We give a nal indicator of this issue by inspecting the performance of a simple look-up query in Fig. 29. We show all the metrics from before: time of the reply
(time, right axis), bandwidth consumption (bw, right axis), number of messages (msgs, left axis), and number of hops (hops, left axis). Again we can see that the reply time increases signicantly with
26
Fig. 28. Alternative metrics for second set of queries on G-Lab: messages ((a) and (d)), hops ((b) and (e)), and bandwidth ((c) and (f)) for selected entity queries.
increasing load, although the other metrics stay almost constant. This can only be due to the signicantly increased processing time at the node that receives this look-up query. Semantic expansion. Finally, we show some results from evaluating a rst simple semantic expansion. The idea is to expand query results by including mappings between properties, URIs, etc. (e.g., based on constructs like sameAs). The information required for expansion is stored in the DHT as well, and thus has to be queried as well. This, and the expanded query, is processed in parallel to the original query. In these preliminary tests, we show that we support the direct integration of the respective techniques into the ow of query processing, without additional complex constructs. Due to the use of parallelism, the performance of an original unexpanded query is not inuenced and the additional overhead for resolving correspondences is very low. We ran the tests on a network of up to 70 instances in our local environment. We used the geographical data from the DBPedia and Mondial data sets, enriched by small manually dened direct correspondences like owl:sameAs. Further, we use small ontology data to express more complex correspondences. From this set of about 30,000 triples, we inserted about 7000 randomly chosen triples into the system. On these triples, we built a propertyobject index and a subject index. Thus, we handled about 14,000 index items in the system. In the following, we present a selected subset of all results gathered. In each test, we ran a set of chosen queries, each initiated ve times by random nodes. We used two complex queries of class s that use such straightforward mapping data. The shape of the original queries is shown in the Appendix. Results for other queries look similar. The rst results show the performance of queries on the largest network size of 70 nodes. In Fig. 30 we plot the number of replies received over time. We plot the number of replies rather than the result size, because this size depends too much on the query and data selectivity. The result size increases nearly in parallel to the number of replies, but showing more leaps. The number of replies gives better insights into the actual performance. The time of receiving the last reply is the point where the plot shows the last increase, i.e., between 4 and 5 s for all queries in Fig. 30(b). First,
3 #hops/#messages 2.5 2 1.5 1 0.5 0 0 time bw msgs hops 50 100 150 200 250 300 #index items 103
26000 24000 22000 20000 18000 16000 14000 12000 10000 8000 6000 4000 350 400
Fig. 29. Evaluation of simple look-up query in G-Lab.
the queries s1 and s2 were run separately. Afterwards, each query was expanded by one mapping s map1 and two mappings s map2. Further, we used a complex ontology query for expanding the original queries, represented by s ont (also shown in the Appendix). Queries were run using only parallel operators. Both plots indicate that running an expanded query does not inuence response time negatively rather, in some cases the results of expanded queries arrive earlier. This is due to the strict use of parallelism. The mapping data is queried very quickly and in parallel to the processing of the original query. Right afterwards, the processing of the expanded query can start. This may result in the observed effects. All in all, the plot for the number of replies shows a similar shape for all queries, all the same how much parallelism is involved or how complex the query is (e.g., s ont. However, as our experiments on G-Lab show, this might look different when the overall load at the single nodes increases or when the degree of parallelism signicantly increases. The preliminary results presented here are nevertheless promising. The parallel processing approach, however, comes at the costs of messages. This is illustrated in Fig. 31(a) (note the logarithmic scale). The number of messages increases almost exponentially
time in ms/bandwidth consumption in Byte
27
1000
10000 1000
#replies
100
10
s1 s1 map1 s1 map2 s1 ont 0 2000 4000 6000 8000 10000 12000 time in ms
#msgs
100 10 1 0.1 10
s1 s1 map2 s2 s2 map2
20
30
(a) s1 expanded on 70 nodes

100
40 50 #nodes (a) Messages
60
70
100
#replies
10 s2 s2 map1 s2 map2 s2 ont 0 1000 2000 3000 4000 time in ms 5000 6000
10 #hops
(b) s2 expanded on 70 nodes

Fig. 30. Evaluation of query expansion.
0.1
s1 s1 map2 s2 s2 map2 10 20 30 40 50 #nodes (b) Hops 60 70
with the network size for some of the queries. For others, the overhead is constant. This mainly depends on the number of intraoperator parallel processing stages that a query plan contains. Bandwidth consumption raises almost in the same scale like the number of messages. The benets we gain with respect to response times are reected by the number of hops shown in Fig. 31(b). There may be a point where such an amount of generated messages results in performance drops. The more parallel operators in a post-order plan, the higher the number of messages. This can be avoided by increasing inter-operator parallelism (process more sub-trees in parallel) or by using alternative operators in specic situations (this decision is due to a sophisticated optimizer). However, we observed that the actual size of the inserted data has low impact on the resulting performance. These results only give a rst insight in actual performance of query processing. But, they promise that the proposed approach ts well into the implemented query processing strategies. With this foundation, querying a heterogeneous, semantically enriched data storage emerges as simple and efcient as querying the Web itself, but also as powerful as querying database systems. This all comes with low effort and no expert user knowledge or intervention needed at all. Expanding the support of semantic expansion is on the agenda for our future research. 6. Related work First works considering complex database functionality on top of P2P overlays are [45] and [46]. Both works propose rst visions on the combination of the two research elds. While [45] focuses on general opportunities and challenges and discusses the problem of data placement, [46] deals with problems of data integration. A very closely related work, also motivated by database requirements for DHT overlays on the physical layer, is PIER [47]. PIER supports querying homogeneous relational data, i.e., there is a globally known data schema. Fragments of relations are distributed over the participating peers following the idea of horizontal data partitioning. PIER supports standard SQL queries, supported by primary and secondary indexes. It has been implemented on top of several DHT systems. Different routing strategies are not discussed as detailed as we do. In contrast, specic operators like joins
Fig. 31. Overhead of query expansion.
are handled in more detail. Beside joins, PIER supports selections, projections, group-by, tee, union, and duplicate elimination. All queries can be run as snapshot queries or as continuous queries. The supported operators make excessive use of rehashing (parts) of relations. We consider this as too expensive and infeasible for many applications. Further, the system is (up to this point) not able to provide any guarantees on running queries, such as estimates for query completeness. Similar to our work the focus of PIER is on query processing. In contrast to our work, it is not designed for RDF and lacks functionality to handle heterogeneous data. Similarly, PeerDB [48,49], which is based on the DHT BestPeer [50] supports only full-edged database systems as sources. These sources are accompanied by centralized directory servers. Thus, the system bears disadvantages concerning the scalability in terms of data and participants. The utilized query agents are comparable to M 2 QPs. In order to support querying with limited knowledge about schemata, managed relations are automatically annotated by semantic metadata motivated by IR approaches. But, integration into query processing follows a two-phase approach and is not as stateless and tightly integrated as proposed by us. Further, query processing is based on a ooding approach similar to the Gnutella [51] way of processing queries. Similar to our work, Ref. [52] recently proposed database-oriented query optimization strategies for RDF query processing on top of DHTs in the context of the Atlas system [53]. While one of our main goals is the optimization of the number of messages required for processing a query, in that work the authors explicitly focus on reducing the required bandwidth. They advance in that direction by discussing several effective heuristics for achieving query optimization based on a distributed dictionary and on statistics that are gathered from the underlying DHT on demand. The implemented system is designed for large distributed, but rather robust systems. Query processing is based on so-called query chains, which mainly resemble a sequential processing strategy. In [54], the authors also propose a variant that exploits the values of matching triples found while processing the query incrementally by constructing multiple chains for each query. This introduces some limited degree of parallelism into query processing.
28
In Atlas, nodes are expected to be reliable. In contrast, UniStore focuses on less reliable systems and proposes to make extensive use of parallel processing strategies. Ref. [52] discusses very interesting and important extensions for our system, while on the other hand the Atlas system can benet from the different processing strategies we discuss. An evaluation in a large local cluster of powerful machines shows that the idea of efcient RDF query processing in DHT systems can scale to millions of triples. One of our next steps is to investigate possibilities of integrating both approaches appropriately. RDFPeers [55] is a distributed infrastructure for managing largescale sets of RDF data. It is based on the multi-attribute addressable network MAAN [56]. Similar to the approach proposed here, each part of a triple is indexed, but whole triples are stored each time. Numerical data are hashed using a locality-preserving hash function. Load-balancing is discussed as well. Queries formulated in formal query languages, such as RDQL [57], can be mapped to the supported native queries. RDFPeers supports only exact-match queries, disjunctive queries for sets of values, range queries on numerical data, and conjunctive queries for a common triple subject. Sophisticated database-like query processing capabilities and heterogeneities are not considered. In contrast to our approach, query resolution is done locally and iteratively. Most RDF query engines follow such a centralized approach based on data shipping. In contrast, DARQ [58] is a federated query engine for SPARQL queries. Queries are decomposed into subqueries and shipped to the corresponding RDF repositories. DARQ uses query rewriting and cost-based optimizations. It strongly relies on standards and Web service descriptions and can use any endpoint that conforms to the SPARQL protocol. Further, it does not require an integrated schema. DARQ supports standard query operators and two join variants on bound predicates. Query planning is static and centralized, utilizing the service descriptions that replace any indexes. Subqueries are distributed using a broadcasting mechanism. This makes DARQ infeasible for the scenarios that we propose. There are several other approaches which also follow this federated querying paradigm, even for P2P systems. However, because these approaches aim at more on interoperability than scalability and, therefore, use different techniques, we omit their discussion here. In [59] the whole RDF model graph is mapped to nodes of a DHT. DHT lookups are used to locate triples. This implements a rather simplistic query processing based on query graphs that can be mapped to the model graph. Query processing basically conforms to matching query graph and model graph. RDF-Schema data are also indexed in a distributed manner and used by applying RDF-Schema entailment rules. On the downside, query processing has two subsequent phases and sophisticated query constructs that leverage the expressiveness of queries are not supported. The same group further discussed top-k query processing [60] and highlighted the importance of parallel query processing in DHTs [61] as well as the need for advanced query planning techniques [62]. There exist several other proposals for large-scale distributed RDF repositories [63] and for managing and querying RDF metadata [64]. All these systems index different parts of triples in parallel, but none of the works discusses different indexing schemes as comprehensive as we do. Generally speaking, the idea of federated query processing for RDF, particularly for Linked Data, gains more and more attention. Ref. [65] presents a survey of the state-of-the-art in that area. However, while P2P approaches like DHT technologies are mentioned, they are not discussed in detail. Thus, our work and the works mentioned above represent a very important development in that direction. DHTs on the physical layer can be found in a couple of commercial and open source cloud computing projects. To the best of our knowledge, Ref. [66] is the rst work discussing data management overlays on top of storage services like S3. This combines database
functionality with utility computing [67] in a way as we propose to combine it with DHT overlays. The authors focus on traditional database aspects, such as the trade-off between consistency and availability, particularly in the presence of small objects and frequent updates. Query processing is part of their considerations, but not discussed as recondite as in our work. In contrast to this and other NoSQL systems, a stream of work that came up together with the One size does not t all discussion of relational DBMS technology initiated by [68], we believe in the importance of query processing and planning techniques known from the world of relational databases to leverage large-scale Semantic Web applications. XML P2P data management is closely related to our work, due to the semi-structured and graph-based nature of both, XML and RDF. However, there are still crucial practical differences between both, due to the specic tree-based character of XML. A discussion of the resulting issues, including indexing, clustering, replication, and query processing, can be found in [69]. The integration and resolution of semantic correspondences is thoroughly discussed in works like [70,71]. GridVine [72,70] is a peer data management infrastructure addressing both scalability and semantic heterogeneity. Scalability is addressed by peers organized in a structured overlay network forming the physical layer, in which data, schemata, and schema mappings are stored. Semantic interoperability is achieved through a purely decentralized and self-organizing process of pair-wise schema mappings and query reformulation. This forms a semantic mediation layer on top and independent of the physical layer. GridVine offers a recursive and an iterative gossiping approach, i.e., query reformulation approach. The semantic gossiping approach was rst formalized in the context of the chatty Web in [71]. GridVine shares several aims and features with our approach, e.g., both use P-Grid as the underlying overlay network and are able to handle data and schema heterogeneity to a similar extent. GridVine supports triple pattern queries with conjunction and disjunction, implemented by distributed joins across the network. It does not apply the idea of costbased database-like query processing over multiple indexes. Additionally, we support similarity-enriched SPARQL-like queries with in-network query execution. The integration of semantic correspondences into query processing as proposed by Kokkinidis et al. [73] is similar to our approach. RVL views [74] are used to describe participating sources. These views are indexed in a DHT. The method is not as exible as the M 2 QP approach, where different synchronization peers can be used and resolution of mappings is absolutely stateless. Further, the approach assumes globally agreed schemata, which are managed in according semantic overlay networks one for each agreed schema. There are several centralized approaches enabling database functionality for large-scale RDF data. A recent system using B trees to index RDF data is RDF-3X [75,76]. To answer queries with variables in any position of an RDF triple, RDF-3X holds indexes for querying all possible combinations of subject, predicate and object an idea introduced in [77]. RDF-3X uses sophisticated join optimization techniques based on statistics derived from the data. Other systems, such as [78] implement a live (i.e., distributed) querying, but still rely on centralized indexes. While the indexing approaches in many of these systems are comparable to our standard indexes, the main distinguishing feature of our work is the discussion of totally distributed and decentralized approaches.
7. Conclusion In this paper, we have presented a system for scalable, distributed query processing on distributed RDF data sets. Our approach exploits a DHT-based distributed index both for storing data as
29
well as for routing queries. Query processing is supported by optimized, cost-based, distributed query execution strategies and fully utilizes the inherent parallelism of distributed systems. To address heterogeneity problems, the approach includes similarity-based and ranking-based query extensions to SPARQL and we discuss their implementation in our UniStore system. We provided an extensive experimental evaluation using a real deployment both on a local cluster as well as on PlanetLab and G-Lab, which demonstrates the efciency and scalability of our solution. Our distributed RDF store with its efcient query processing engine and extended SPARQL syntax is an important building block for the support of ontologies, rules and reasoning approaches in distributed settings. As UniStore supports RDF, in principle, OWL can be used on top of it in its RDF representation. While at the logical level the conceptual problems of supporting OWL, reasoning and rules would be the same as for centralized solutions, at the storage level a signicant amount of research may be necessary to make a distributed solution efcient. Despite these problems, UniStore provides a substrate to implement possible approaches and obtain experimental results to assess their practical applicability and efciency in this interesting and relevant area of research. Also, the availability of a distributed RDF store with efcient query processing will help to identify fragments of OWL and rule languages that are meaningful in practical applications and can be supported efciently in a distributed setting. In this respect, UniStore may turn out to be a relevant stepping stone for research into these areas. Acknowledgements Part of the work was done while the rst author was at Ilmenau University of Technology. This work has further received support from the Science Foundation Ireland under Grant No. SFI/08/CE/ I1380 (Lion-2). Appendix A. Example queries used in the evaluation In the following, we list an example query for each class used in the evaluation run on PlanetLab. The actual queries vary in the used values for property names and object values, optional lter statements, the applied level(s) of querying (property, object, and subject level), and the exact number of operators (i.e., triple patterns). We highlight the specic differences accordingly in the text. class e: select ?n where {?x <title> ?n . lter edist(?n, stranger) < 6)} class m: select ?v ?c where {?x <title> ?v ; <cnt> ?c . lter (?c >= 30) . lter edist(?v, slice_the) < 4)}
class a: select ?v ?r where {?x <created> ?v ; <rev> ?r} order by desc (?v) limit 20
class ac: select ?d ?l where {?y <name> ?k ; <rev> ?r . ?x <title> ?k ; <length> ?l ; <created> ?d} order by NN(?k, Haven_The) limit dist 6
The following queries are the exact queries used in the experiments on G-Lab (as generated, before query rewriting during logical optimization).
lookup: select ?o0 where { <http://aims.fao.org/aos/geopolitical.owl> <http://www.w3.org/2000/01/rdf-schema#comment> ?o0 }
Path-shaped query, 1 join: select ?o0 where { <http://www.georgikobilarov.com/foaf.rdf> <http://xmlns.com/foaf/0.1/primaryTopic> ?join_os_0_0 . ?join_os_0_0 <http://xmlns.com/foaf/0.1/name> ?o0 }
Path-shaped query, 3 joins: select ?s0 where { ?join_so_0_0 <http://xmlns.com/foaf/0.1/homepage> <http://www.snee.com/bobdc.blog/bobdcblog.rdf> . ?join_so_0_1 <http://purl.org/dc/elements/1.1/contributor> ?join_so_0_0 . ?join_so_0_2 <http://www.w3.org/1999/02/22-rdf-syntax-ns#_6> ?join_so_0_1 . ?s0 <http://purl.org/rss/1.0/items> ?join_so_0_2 }
class c: select ?xt ?xl ?yn where {?x <title> ?xt ; <length> ?xl ?y <name> ?yn . lter edist(?xt, ?yn) < 3)}
30
Entity query 1: select ?s0 ?s1 where { ?join_so_0_0 <http://xmlns.com/foaf/0.1/maker> <http://sebastian.dietzold.de/terms/me> . ?s0 <http://xmlns.com/foaf/0.1/pastProject> ?join_so_0_0 . ?join_so_1_0 <http://xmlns.com/foaf/0.1/member> <http://sebastian.dietzold.de/terms/me> . ?s1 <http://xmlns.com/foaf/0.1/maker> ?join_so_1_0 }
Fig. A.32. Shape of ontology query used for evaluating query expansion.
Fig. A.33. Shape of queries used for evaluating query expansion.
31
Entity query 2: select ?o0 ?o1 where { <http://www.ivan-herman.net/foaf> <http://xmlns.com/wot/0.1/assurance> ?join_os_0_0 . ?join_os_0_0 <http://xmlns.com/wot/0.1/endorser> ?join_os_0_1 . ?join_os_0_1 <http://xmlns.com/wot/0.1/pubKeyAddress> ?o0 . <http://www.ivan-herman.net/foaf> <http://xmlns.com/foaf/0.1/primaryTopic> ?join_os_1_0 . ?join_os_1_0 <http://xmlns.com/foaf/0.1/knows> ?join_os_1_1 . ?join_os_1_1 <http://xmlns.com/foaf/0.1/name> ?o1 }
?o0 . <http://sebastian.dietzold.de/terms/me> <http://purl.org/vocab/relationship/worksWith> ?join_os_1_0 . ?join_os_1_0 <http://xmlns.com/foaf/0.1/name> ?o1 . ?join_so_0_0 <http://xmlns.com/foaf/0.1/maker> <http://sebastian.dietzold.de/terms/me> . ?s0 <http://xmlns.com/foaf/0.1/pastProject> ?join_so_0_0 . ?join_so_1_0 <http://xmlns.com/foaf/0.1/member> <http://sebastian.dietzold.de/terms/me> . ?s1 <http://xmlns.com/foaf/0.1/maker> ?join_so_1_0 }
Finally, we show the query plans for the queries used in the evaluation of the semantic expansion (Figs. A.32 and A.33).
Entity query 3: select ?o0 ?o1 ?o2 ?o3 where { <http://sebastian.dietzold.de/terms/me> <http://purl.org/vocab/relationship/closeFriendOf> ?join_os_0_0 . ?join_os_0_0 <http://xmlns.com/foaf/0.1/nick> ?o0 . <http://sebastian.dietzold.de/terms/me> <http://purl.org/vocab/relationship/worksWith> ?join_os_1_0 . ?join_os_1_0 <http://xmlns.com/foaf/0.1/name> ?o1 . <http://sebastian.dietzold.de/terms/me> <http://xmlns.com/foaf/0.1/pastProject> ?join_os_2_0 . ?join_os_2_0 <http://xmlns.com/foaf/0.1/maker> ?o2 . <http://sebastian.dietzold.de/terms/me> <http://xmlns.com/foaf/0.1/currentProject> ?join_os_3_0 . ?join_os_3_0 <http://xmlns.com/foaf/0.1/maker> ?o3 } Star-shaped query: select ?o0 ?o1 ?s0 ?s1 where { <http://sebastian.dietzold.de/terms/me> <http://purl.org/vocab/relationship/closeFriendOf> ?join_os_0_0 . ?join_os_0_0 <http://xmlns.com/foaf/0.1/nick> References
[1] C. Bizer, A. Jentzsch, R. Cyganiak, State of the LOD cloud, 2011. Available from: <http://www4.wiwiss.fu-berlin.de/lodcloud/state/>. [2] J. Gray, A. Szalay, A. Thakar, P.Z. Zunszt, T. Malik, J. Raddick, C. Stoughton, J. van den Berg, The SDSS SkyServer public access to the Sloan Digital Sky Server Data, Technical Report MSR-TR-2001-104, Microsoft Research, 2001. [3] K. Aberer, A. Datta, M. Hauswirth, R. Schmidt, Indexing data-oriented overlay networks, in: VLDB05, 2005, pp. 685696. [4] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, W. Vogels, Dynamo: Amazons highly available key-value store, in: Proceedings of the 21st ACM Symposium on Operating Systems Principles (SOSP) 2007, 2007, pp. 205220. [5] Basho, Basho: Riak, an Open Source Scalable Data Store, 2011. Available from: <www.basho.com/Riak>. [6] J. Madhavan, S.R. Jeffery, S. Cohen, X.L. Dong, D. Ko, C. Yu, A. Halevy, Web-scale data integration: you can only afford to pay as you go, in: CIDR07, 2007, pp. 342350. [7] A. Singhal, Modern information retrieval: a brief overview, Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 24 (4) (2001) 35 42. [8] M. Karnstedt, Query Processing in a DHT-based Universal Storage, Ph.D. Thesis at Ilmenau University of Technology, AVM-Verlag, 2009. [9] M. Karnstedt, K. Sattler, M. Ha, M. Hauswirth, B. Sapkota, R. Schmidt, Estimating the number of answers with guarantees for structured queries in P2P databases (poster), in: CIKM08, 2008, pp. 14071408. [10] M. Karnstedt, K. Sattler, M. Ha, M. Hauswirth, B. Sapkota, R. Schmidt, Approximating query completeness by predicting the number of answers in DHT-based web applications, in: International Workshop on Web Information and Data Management (WIDM08), 2008, pp. 7178. [11] K. Aberer, M. Hauswirth, Practical Handbook of Internet Computing, CRC Press, 2004. Ch. Peer-to-Peer Systems. [12] K. Aberer, L.O. Alima, A. Ghodsi, S. Girdzijauskas, S. Haridi, M. Hauswirth, The essence of P2P: a reference architecture for overlay networks, in: Fifth IEEE International Conference on Peer-to-Peer Computing, Use of Computers at the Edge of Networks (P2P, Grid, Clusters), 2005. [13] A. Datta, M. Hauswirth, R. Schmidt, R. John, K. Aberer, Range queries in triestructured overlays, in: P2P05, 2005, pp. 5766. [14] J. Aspnes, J. Kirsch, A. Krishnamurthy, Load balancing and locality in rangequeriable data structures, in: PODC04, 2004, pp. 115124. [15] A. Bharambe, M. Agrawal, S. Seshan, Mercury: supporting scalable multiattribute range queries, SIGCOMM Computer Communication Review 34 (4) (2004) 353366. [16] O.D. Sahin, A. Gupta, D. Agrawal, A.E. Abbadi, A peer-to-peer framework for caching range queries, in: ICDE04, 2004, p. 165. [17] A. Gupta, D. Agrawal, A.E. Abbadi, Approximate range selection queries in peer-to-peer systems, in: CIDR03, 2003, pp. 141151. [18] C.Y. Liau, W.S. Ng, Y. Shu, K.-L. Tan, S. Bressan, Efcient range queries and fast lookup services for scalable p2p networks, in: International Workshop on Databases, Information Systems and Peer-to-Peer Computing (DBISP2P04), 2004, pp. 93106.
32
M. Karnstedt et al. / Web Semantics: Science, Services and Agents on the World Wide Web 10 (2012) 332 [52] Z. Kaoudi, K. Kyzirakos, M. Koubarakis, SPARQL query optimization on top of DHTs, in: ISWC10, 2010, pp. 418435. Available from: <http://portal.acm.org/ citation.cfm?id=1940281.1940309>. [53] Z. Kaoudi, M. Koubarakis, K. Kyzirakos, I. Miliaraki, M. Magiridou, A. PapadakisPesaresi, Atlas: storing, updating and querying rdf(s) data on top of dhts, Web Semantics: Science, Services and Agents on the World Wide Web 8 (4) (2010) 271277, doi:10.1016/j.websem.2010.07.001. Semantic Web Challenge 2009; User Interaction in Semantic Web research. Available from: <http:// www.sciencedirect.com/science/article/B758F-50GWN98-1/2/10673894ac61 2f523ad9bf73c1b172fb>. [54] E. Liarou, S. Idreos, M. Koubarakis, Evaluating conjunctive triple pattern queries over large structured overlay networks, ISWC 4273 (2006) 399413. Available from: <http://www.springerlink.com/index/y51k8656812x6716. pdf>. [55] M. Cai, M.R. Frank, B. Yan, R.M. MacGregor, A subscribable peer-to-peer rdf repository for distributed metadata management, Journal of Web Semantics 2 (2) (2004) 109130. [56] M. Cai, M. Frank, J. Chen, P. Szekely, MAAN: a multi-attribute addressable network for grid information services, in: International Workshop on Grid Computing (GRID03), 2003, p. 184. [57] L. Miller, A. Seaborne, A. Reggiori, Three implementations of SquishQL, a simple RDF query language, in: ISWC02, 2002, pp. 423435. [58] B. Quilitz, U. Leser, Querying distributed RDF data sources with SPARQL, in: ESWC08, 2008, pp. 524538. [59] F. Heine, Scalable p2p based RDF querying, in: International Conference on Scalable Information Systems (InfoScale06), 2006, pp. 1722. [60] D. Battr, F. Heine, O. Kao, Topk rdf query evaluation in structured p2p networks, in: W. Nagel, W. Walter, W. Lehner (Eds.), Euro-Par 2006 Parallel Processing, Lecture Notes in Computer Science, vol. 4128, Springer, Berlin/ Heidelberg, 2006, pp. 9951004, doi:10.1007/11823285_105. Available from: <http://dx.doi.org/10.1007/11823285_105>. [61] B. Lohrmann, D. Battr, O. Kao, Towards parallel processing of RDF queries in DHTs, in: Data Management in Grid and Peer-to-Peer Systems, Globe09, 2009, pp. 3647, doi:10.1007/978-3-642-03715-3_4. Available from: http:// dx.doi.org/10.1007/978-3-642-03715-3_4. [62] D. Battr, Query planning in DHT based RDF stores, in: Signal Image Technology and Internet Based Systems, SITIS08, 2008, pp. 187194, doi:10.1109/SITIS.2008.15. Available from: <http://dx.doi.org/10.1109/SITIS. 2008.15>. [63] H. Stuckenschmidt, R. Vdovjak, G.-J. Houben, J. Broekstra, Index structures and algorithms for querying distributed RDF repositories, in: WWW04, 2004, pp. 631639. [64] W. Nejdl, B. Wolf, C. Qu, S. Decker, M. Sintek, A. Naeve, M. Nilsson, M. Palmer, T. Risch, Edutella: a P2P Networking Infrastructure based on RDF, in: WWW02, 2002. [65] O. Grlitz, S. Staab, Federated data management and query optimization for linked open data, in: A. Vakali, L. Jain (Eds.), Web Data Management Trails, Springer, 2011. Preprint available from: <http://www.uni-koblenz.de/staab/ Research/Publications/2010/LOD-Federation.pdf>. [66] M. Brantner, D. Florescu, D. Graf, D. Kossmann, T. Kraska, Building a database on S3, in: SIGMOD08, 2008, pp. 251264. [67] J.W. Ross, G. Westerman, Preparing for utility computing: the role of IT architecture and relationship management, IBM Systems Journal 43 (1) (2004) 519. [68] M. Stonebraker, U. Cetintemel, One size ts all: an idea whose time has come and gone, in: ICDE05, 2005, pp. 211. [69] G. Koloniari, E. Pitoura, Peer-to-peer management of XML data: issues and research challenges, SIGMOD Record 34 (2) (2005) 617. [70] P. Cudr-Mauroux, S. Agarwal, K. Aberer, GridVine: an infrastructure for peer information management, IEEE Internet Computing 11 (5) (2007) 864875, doi:10.1109/MIC.2007.108. [71] K. Aberer, P. Cudre-Mauroux, M. Hauswirth, Start making sense: the chatty web approach for global semantic agreements, Journal of Web Semantics 1 (1) (2003) 89114. [72] K. Aberer, P. Cudr-Mauroux, M. Hauswirth, T.V. Pelt, GridVine: building internet-scale semantic overlay networks, in: ISWC04, 2004, pp. 107121. [73] G. Kokkinidis, E. Sidirourgos, V. Christophides, Query processing in RDF/Sbased P2P database systems, in: S. Staab, H. Stuckenschmidt (Eds.), Semantic Web and Peer-to-Peer, Springer, Berlin/Heidelberg, 2006, pp. 5981 (Chapter 4). [74] A. Magkanaraki, V. Tannen, V. Christophides, Viewing the semantic web through RVL lenses, in: ISWC03, 2003, pp. 98112. [75] T. Neumann, G. Weikum, RDF-3X: a RISC-style engine for RDF, VLDB Endowment 1 (1) (2008) 647659. [76] T. Neumann, G. Weikum, Scalable join processing on very large RDF graphs, in: SIGMOD09, 2009, pp. 627640, doi:10.1145/1559845.1559911. Available from: http://doi.acm.org/10.1145/1559845.1559911. [77] A. Harth, S. Decker, Optimized index structures for querying rdf from the web, in: Third Latin American Web Congress, 2005, pp. 7180. [78] A. Harth, K. Hose, M. Karnstedt, A. Polleres, K.-U. Sattler, J. Umbrich, Data summaries for on-demand queries over linked data, in: WWW10, 2010.
[19] K. Aberer, P. Cudr-Mauroux, A. Datta, Z. Despotovic, M. Hauswirth, M. Punceva, R. Schmidt, P-Grid: a self-organizing structured P2P system, SIGMOD Record 32 (3) (2003) 2933. [20] H.V. Jagadish, B.C. Ooi, Q.H. Vu, Baton: a balanced tree structure for peer-topeer networks, in: K. Bhm, C.S. Jensen, L.M. Haas, M.L. Kersten, P.-. Larson, B.C. Ooi (Eds.), VLDB, ACM, 2005, pp. 661672. [21] H.V. Jagadish, B.C. Ooi, Q.H. Vu, R. Zhang, A. Zhou, Vbi-tree: a peer-to-peer framework for supporting multi-dimensional indexing schemes, in: L. Liu, A. Reuter, K.-Y. Whang, J. Zhang (Eds.), ICDE, IEEE Computer Society, 2006, p. 34. [22] S. Blanas, V. Samoladas, Contention-based performance evaluation of multidimensional range search in peer-to-peer networks, Future Generation Computer Systems 25 (1) (2009) 100108. [23] RDF at W3C homepage. <http://www.w3.org/RDF> (last visited 2009/02/21). [24] V. Levenshtein, Binary codes of correcting deletions, insertions, and reversals, Soviet Physics Doklady 10 (8) (1966) 707710. [25] G. Navarro, R.A. Baeza-Yates, A practical q-gram index for text retrieval allowing errors, CLEI Electronic Journal 1 (2) (1998). [26] L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan, D. Srivastava, Approximate string joins in a database (almost) for free, in: VLDB 01, 2001, pp. 491500. [27] E. Schallehn, I. Geist, K. Sattler, Supporting similarity operations based on approximate string matching on the web, in: CoopIS04, 2004, pp. 227244. [28] G. Navarro, A guided tour to approximate string matching, ACM Computing Surveys 33 (1) (2001) 3188. [29] K. Aberer, P. Cudr-Mauroux, A. Datta, Z. Despotovic, M. Hauswirth, M. Punceva, R. Schmidt, J. Wu, Advanced peer-to-peer networking: the P-grid system and its applications, PIKJ 26 (2003) 8689. Special issue on P2P systems. Available from: <http://citeseer.ist.psu.edu/aberer03advanced.html>. [30] M. Karnstedt, K. Sattler, M. Richtarsky, J. Mller, M. Hauswirth, R. Schmidt, R. John, UniStore: querying a DHT-based universal storage, in: ICDE07 Demonstrations Program, 2007, pp. 15031504. [31] E. Prudhommeaux, A. Seaborne, SPARQL Query Language for RDF, w3C Candidate Recommendation, 2006. Available from: <http://www.w3.org/TR/ rdf-sparql-query/>. [32] J. Prez, M. Arenas, C. Gutierrez, Semantics and complexity of SPARQL, ACM Transactions on Database Systems 34 (2009) 16:116:45, doi:10.1145/ 1567274.1567278. Available from: http://doi.acm.org/10.1145/1567274. 1567278. [33] B. Glimm, M. Krtzsch, SPARQL beyond subgraph matching, in: ISWC10, 2010, pp. 241256. Available from: <http://portal.acm.org/citation.cfm?id=1940281. 1940298>. [34] V. Papadimos, D. Maier, Mutant query plans, Information and Software Technology 44 (4) (2002) 197206. [35] S. Brzsnyi, D. Kossmann, K. Stocker, The skyline operator, in: ICDE01, 2001, pp. 421432. [36] M. Karnstedt, J. Mller, K. Sattler, Cost-aware skyline queries in structured overlays, in: ICDE Workshop on Ranking in Databases (DBRank07), 2007, pp. 285288. [37] J. Mller, Berechnung von Skylines in strukturierten Overlaynetzwerken, Diploma Thesis at TU Ilmenau (available in German only), 2007. [38] S. Wang, Q.H. Vu, B.C. Ooi, A.K.H. Tung, L. Xu, Skyframe: a framework for skyline query processing in peer-to-peer systems, The VLDB Journal 18 (1) (2009) 345362. [39] C.T. Yu, W. Meng, Principles of Database Query Processing for Advanced Applications, Morgan Kaufman Publishers Inc., 1998. [40] S. Schwalm, Anfragesystem fr vertikal organisierte Universalrelationen in P2P-Systemen, Diploma Thesis at TU Ilmenau (available in German only), 2006. [41] B. Chun, D. Culler, T. Roscoe, A. Bavier, L. Peterson, M. Wawrzoniak, M. Bowman, PlanetLab: an overlay Testbed for broad-coverage services, SIGCOMM Computer Communication Review 33 (3) (2003) 312. [42] R. Huebsch, B. Chun, J.M. Hellerstein, Pier on PlanetLab: initial experience and open problems, Technical Report, University of California, Berkeley, 2003. [43] D. Schwerdel, D. Gnther, R. Henjes, B. Reuther, P. Mller, German-lab experimental facility, in: Future Internet, FIS10, 2010, pp. 110. Available from: <http://portal.acm.org/citation.cfm?id=1929268.1929269>. [44] M. Arias, J.D. Fernndez, M.A. Martnez-Prieto, P. de la Fuente, An Empirical Study of Real-World SPARQL Queries, CoRR abs/1103.5043. [45] S.D. Gribble, A.Y. Halevy, Z.G. Ives, M. Rodrig, D. Suciu, What can database do for peer-to-peer? in: WebDB01, 2001, pp. 3136. [46] P.A. Bernstein, F. Giunchiglia, A. Kementsietsidis, J. Mylopoulos, L. Serani, I. Zaihrayeu, Data management for peer-to-peer computing: a vision, in: Fifth International Workshop on the Web and Databases (WebDB02), 2002, pp. 89 94. [47] R. Huebsch, J.M. Hellerstein, N. Lanham, B. Thau Loo, S. Shenker, I. Stoica, Querying the internet with PIER, in: VLDB 03, 2003, pp. 321332. [48] B.C. Ooi, K.-L. Tan, A. Zhou, C.H. Goh, Y. Li, C.Y. Liau, B. Ling, W.S. Ng, Y. Shu, X. Wang, M. Zhang, PeerDB: peering into personal databases, in: SIGMOD03, 2003, pp. 659659. [49] W.S. Ng, B.C. Ooi, K.-L. Tan, PeerDB: a P2P-based system for distributed data sharing, in: ICDE03, 2003, pp. 633644. [50] W.S. Ng, B.C. Ooi, K.-L. Tan, BestPeer: a self-congurable peer-to-peer system, in: ICDE02, 2002, p. 272. [51] Gnutella homepage. <http://rfc-gnutella.sourceforge.net> (last visited 2008/ 11/6).

1 s2.0 S1570826811001053 Main

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 s2.0 S1570826811001053 Main

Uploaded by

Copyright:

Available Formats

Web Semantics: Science, Services and Agents on the World Wide Web 10 (2012) 332

Contents lists available at SciVerse ScienceDirect

Scalable distributed indexing and query processing over Linked Data

400000 300000 200000 100000 0

std 3go 3gpo 5go 5gpo

#input triples (rounded)

250000 200000 150000 100000 50000 0 0 1.6k

std 3go 3gpo 5go 5gpo

#input triples (rounded) (b) Without special characters

Fig. 2. UniStore architecture.

Fig. 3. Sequential routing schemes for nSEQ and nQG;SEQ n : jprA j.

Fig. 4. Parallel routing schemes for nRQ and nQG;PAR n : jprA j.

Fig. 5. Different routing schemes for HASH with d 2.

1 y Target functions: minimize x minimize y min(y) min(x)

Fig. 7. Principle of processing skyline queries in UPAR .

Fig. 6. Principle of processing top-N ranking in u.

Fig. 8. Example query plan for illustrating query plan processing.

Fig. 9. Different combinations of n operators n : jprA j; m : jprB j.

[exact match only]

Value 10 200 500 20,500 7 5 7

mQ 1 10 ml 70 ml 70 ml 280 70 49 ml 49 199 ml 399 hQ 1 hl hl hl 4 hl hl 1 5 hl 5

Fig. 13. Second physical plan Q 2 for Q from Fig. 11.

nRQ <dlr> LOC ?s nSEQ <id> SYNC edist?dlr;?id<2

Table 6 Cost estimation for Q 2 rev u 0. Operator op mop

nRQ <dlr> LOC ?s nQG;PAR HASH;PAR edist?dlr;?id<2 <id>

Fig. 15. Late vs. early synchronization.

http://www.imdb.com/ http://dbepdia.org http://www.dbis.informatik.uni-goettingen.de/Mondial/

Fig. 16. Shape of evaluation queries and class identiers.

Fig. 18. Cost comparison of nRQ and nSEQ : messages.

Fig. 19. Cost comparison of nRQ and nSEQ : hops.

Fig. 22. Cost comparison of similarity extractions.

Fig. 23. Similarity on predicate level.

Fig. 24. Evaluation of different similarity join operators.

Fig. 26. Q-gram queries with 384 PlanetLab nodes.

Fig. 29. Evaluation of simple look-up query in G-Lab.

time in ms/bandwidth consumption in Byte

(a) s1 expanded on 70 nodes

40 50 #nodes (a) Messages

(b) s2 expanded on 70 nodes

s1 s1 map2 s2 s2 map2 10 20 30 40 50 #nodes (b) Hops 60 70

Fig. 31. Overhead of query expansion.

 lookup: select ?o0 where { <http://aims.fao.org/aos/geopolitical.owl> <http://www.w3.org/2000/01/rdf-schema#comment> ?o0 }

Fig. A.33. Shape of queries used for evaluating query expansion.

You might also like

lookup: select ?o0 where { <http://aims.fao.org/aos/geopolitical.owl> <http://www.w3.org/2000/01/rdf-schema#comment> ?o0 }