This action might not be possible to undo. Are you sure you want to continue?
Nelson Kotowski 1, Alexandre A. B. Lima3, Esther Pacitti2, Patrick Valduriez2, Marta Mattoso1
COPPE/UFRJ, Rio de Janeiro, Brazil Atlas Group, INRIA and LINA, University of Nantes, France 3 UNIGRANRIO, Rio de Janeiro, Brazil email@example.com, Esther.Pacitti@univ-nantes.fr, Patrick.Valduriez@inria.fr, firstname.lastname@example.org, email@example.com
Abstract. OLAP query processing is critical for enterprise grids. Capitalizing on our experience with the ParGRES database cluster, we propose a middleware solution, GParGRES, which exploits database replication and inter- and intra-query parallelism to efficiently support OLAP queries in a grid. GParGRES has been partially implemented as database grid services on Grid5000. We give preliminary experimental results obtained with two clusters of Grid5000 using queries of the TPC-H Benchmark. The results show linear or almost linear speedup in query execution, as more nodes are added in all tested configurations.
Initially developed for the scientific community, Grid computing is now gaining much interest in other areas such as enterprise information systems, thus making Grid data management critical . For instance, IBM, Oracle and Microsoft are all promoting tools and services for enterprise grids. Data management in grids has been initially achieved using distributed file systems. However, more general database solutions are needed to enable the virtualization of distributed, autonomous databases using Web services and provide transparent support for database queries , , , , . Ideally, a grid database solution must respect database autonomy (i.e. avoid database or application migration) while taking advantage of distributed and parallel computing. This can be achieved through the development of a middleware layer between the user applications and the databases. Such a middleware should provide for distributed and parallel query processing with non-intrusive techniques, considering DBMS as black-box components so there is no need for database or application migration.
Work partially funded by CAPES-COFECUB (DAAD project), CNPq-INRIA (GriData project), French ANR Massive Data (Respire project) and the European Strep Grid4All project.
Section 0 presents related work. For instance. thus increasing data availability and quality of service. The only requirement for the DBMS is that it supports SQL-99.and intra-query parallelism using a parallel database system on a multiprocessor system or a cluster of PC (e. However. For instance. a database cluster can provide inter. We give preliminary experimental results obtained with two clusters of Grid5000 using queries of the TPC-H Benchmark . . 2.and intra-query processing in a non-intrusive way through a middleware layer. Section 4 gives experimental results. Compared to the database cluster approach where the database is replicated at a single site. The paper is organized as follows. database clusters like C-JDBC  and . a large and flexible configurable grid platform in France.and intra-query parallelism during query processing. Section 6 concludes. Section 3 presents GParGRES. ParGRES manages the parallel execution of queries using DBMS instances at cluster nodes. Section 2 introduces ParGRES. Thus. However. Similar to other database clusters. for maintenance). if one grid site is unavailable (e. ParGRES eases database migration from centralized environments since no new physical database design is required. The typical solution to efficient OLAP query processing is to exploit inter. Parallelism is obtained through full database replication and Adaptive Virtual Partitioning (AVP) . The results show linear or almost linear speedup in query execution.and intraquery processing. as more nodes are added in all tested configurations. AVP provides for dynamic load balancing among cluster nodes during query processing in a non-intrusive way.and intra-query processing to applications accessing any DBMS that supports SQL-99 and has a JDBC driver for client connections. A cost-effective alternative which respects database autonomy is database clusters . called GParGRES. it is still possible to run OLAP queries using other sites. It provides flexibility with respect to node allocation for query processing: any query can be processed by any set of cluster nodes. ParGRES ParGRES is a database cluster middleware which exploits inter. which capitalizes on ParGRES to provide transparent inter. database replication and parallel query processing must be addressed at two levels: grid level and cluster level. each node running a black-box DBMS.g. . By replicating (parts of) the database on cluster (PC) nodes. GParGRES enables the database to be replicated at multiple sites of the grid. our database cluster middleware ParGRES  provides transparent inter.An important kind of database queries is OLAP which tends to access massive amounts of data and is thus time consuming.g. we propose a middleware solution to OLAP query processing in grids. In this paper. this solution requires heavy migration of the existing databases and applications to the parallel database system. Oracle’s Real Application Cluster). We consider a typical grid environment with multiple clusters at different sites. GParGRES has been partially implemented as grid services on Grid5000 .
passing them to the CQP. ParGRES performs decentralized control and its components are distributed among cluster nodes (see Figure 1) . Inter-query parallelism is relatively straightforward. required by AVP. thus preserving database autonomy and considering each DBMS as a “black-box” component. Non-uniform data distribution can lead to load skew.) Mediator and Cluster Query Processor (CQP)) execute tasks that involve several cluster nodes. The results in  show the technique is very efficient. Commands not parsed by this grammar are sent directly to the DBMS. ParGRES does result composition in a two-phase aggregation. They are sent to NQPs. and passing back CQP responses to the applications. ParGRES executes four types of tasks: (i) SQL query parsing. Global components (i. It uses information from the Catalog. the Mediator component is typically allocated at this node.e. (iii) a set of attributes used in aggregation operations. After receiving partial results from all NQPs. which adaptively fine tune the virtual partitions locally. NQP locally coordinates query execution at the DBMS and helps CQP during load balancing. NQPs perform balancing by exchanging messages among themselves to redefine virtual partitions.e. Node Query Processor (NQP) and DBMS) execute tasks in one node. The Mediator is responsible for receiving requests from the applications. CQP finishes the result composition and sends it to the client application. ParGRES’s non-intrusive dynamic load balancing technique addresses this issue. Thus. the groups are distributed to their respective . It uses a context-free grammar for SQL-99. (ii) information needed to perform result composition. It uses parallel processing in this composition. which stores the metadata needed to implement AVP only. Since most clusters have a single node accessible to external applications (the entry node). CQP is responsible for choosing the type of parallelism used during query processing and allocating nodes that will be used for it. Local components (i. CQP coordinates all other components in the context of a query. In the second phase. There are global and local components. To avoid such centralized bottleneck. CQP rewrites the original query into sub-queries. which is sent to the CQP.PowerDB  use a centralized layer executed at a single node which acts as coordinator. the nodes aggregate the groups returned by the local sub-queries. thus minimizing communication between nodes. CQP contains a syntactic analyzer to parse SQL commands from the client application. (ii) Query processing with inter/intra-query parallelism. Each NQP executes its subquery and generates a partial result. In the first phase. The intra-query parallel strategy decomposes complex queries into sub-queries that will be executed in parallel over different data fragments. thus improving the overall environment availability. Those subqueries are a version of the original query containing a predicate that determines the ranges of the virtual partitions. No specific information from the DBMS is needed. The information generated includes: (i) a set of relations and attributes referenced by the query that may be used to obtain intra-query parallelism. only the Mediator component is centralized which gives it full flexibility in the physical allocation of CQP for each request. (iii) Result composition. especially for cases of extreme skew. CQP sends the query to the NQP of the node with the smallest number of pending tasks.
Distributed Query Service (DQS) – this service directly interacts with client applications. We assume that each grid node is a PC cluster. Although ParGRES focuses on read-only query processing. Its main components are described as follows. Grid5000). Finally. implemented by GParGRES. Figure 1 shows GParGRES’s architecture. the databases are managed by DBMS that are orchestrated by ParGRES instances. While updates are processed. Factory Service (FS) – responsible for creating new instances of DQS. which executes their union. ParGRES adopts a strong consistency policy: it does not allow the concurrent execution of updates and queries. implemented by ParGRES. Our approach has two levels of query splitting: grid-level splitting. each node sends its subset of the global result to the coordinator node. Grid Local Query Service (GLQS) – local component responsible for receiving subqueries from DQS and passing them to the local ParGRES. this behavior is one of the basic characteristics of grid services. similar to ParGRES´ CQP. Such an identifier is not reused for other new instances even when the service is finished. DQS receives queries and splits them into subqueries to implement intraquery parallelism using an approach similar to ParGRES. As its name suggests. To implement this policy. It takes advantage of database replication to perform virtual partitioning. it initially asks FS to create a new DQS instance. for the implementation of GParGRES. When a client application intends to submit queries to GParGRES. When there are just read-only queries. 3. In a grid. GParGRES: a Database Grid Middleware GParGRES is a middleware to transparently access distributed databases in a grid. According to . . typical of OLAP. This service also performs final result composition. Each new instance receives a unique service identifier that associates it with its respective factory. Registry Service (RS) – concentrates information concerning GParGRES services.g. Since updates in OLAP environments are usually fast and executed at predefined times. CQP allows them to execute in parallel. Sort operations are also done in parallel in a similar fashion. e. We discuss some of these issues after presenting our architecture. We are working on using established standards for the development of grid applications. Such partitioning generates adaptive virtual partitions (AVP) to be processed in parallel. (iv) Update processing. updates may also be performed. and node-level splitting. GParGRES is designed as a wrapper that enables the use of ParGRES in a grid (in our case. ParGRES has a scheduler that orders queries and updates. GParGRES is a layer on top of ParGRES instances. This service monitors . and ParGRES execution in the nodes. it is based on ParGRES and shares the same objective of efficient support of OLAP through inter/intra-query parallelism. such as the state of each FS and DQS instance.nodes through a hash function. all the remaining queries coming from the application are blocked.
i. It is possible to use the WSDL specification adopted by the Index Service described in MDS4 (WS MDS – Web Service – Monitoring and Discovery System) as our basis to RS in collecting metadata about grid computing resources. Factory Service (FS) – To create new instances of DQS. GDS can be used to interact with the ParGRES instance running on a grid node. in particular. it is supplied by the Ganglia tool that also follows MDS4.. . they create instances of the main service of an application. as they have similar functions. such as the Open Grid Service Architecture (OGSA) . Let us now discuss the implementation of GParGRES as grid services and their compatibility with existing grid solutions. using information about the subqueries being processed in each grid node involved in the global query execution.subquery execution on ParGRES to allow for query redistribution if the node is too busy or redirect the subquery to another node in case of failure. Figure 1 – GParGRES architecture with ParGRES on detail ParGRES performs node-level load balancing while GParGRES performs grid-level load balancing. In Grid5000. Grid Local Query Service (GLQS) – This service can take advantage of the GDS (Grid Data Service) service specified by OGSA-DAI . OGSA-Data Access and Integration (OGSA-DAI) and Web Services Resource Framework (WSRF).OGSA-DAI service. which is able to interact with a data resource.e. Registry Service (RS) – RS can be implemented as a WSRF-compliant . ParGRES would act as a relational DBMS to GDS. In our case. FS can be implemented with the help of GDSF – Grid Data Service Factory .
The communication between ParGRES and each DBMS is done through JDBC. ParGRES is developed in Java. which reduces the costs of message communication. Grid5000 has a tool (Ganglia ) that supplies information concerning the activities of each grid node. Jobs were interactively submitted through OAR. In all experiments.e. Our tests are based on the TPC-H benchmark  which is representative of ad-hoc OLAP applications. which gave us a database of approximately 2. each with 2 Opteron 2. We generated the database according to the TPC-H specifications with a scale factor of 1 using the DBMS PostgreSQL 8. The use of ParGRES assumes a computational environment of three layers: the application layer (i. This way.. . 2GB RAM and 73 GB HD. each with 2 Dual Core Xeon 2. Each DBMS accesses its local database. Grid5000 has nine sites spread over France.4.2. as required by TPC-H. from CPU load to hard disk consumption.g. We assume the OLAP client tool acquires data using SQL. The clusters are interconnected by 1 Gbps network links. we used the Kadeploy tool to generate an image of the 64 bits Debian OS (available on Grid5000) along with the PostgreSQL 8. Each site is itself a cluster of PC.2GHz CPUs. a French project that creates a large-scale reconfigurable grid infrastructure to support distributed and parallel experiments.33GHz CPUs.4  DBMS and a ParGRES instance for each cluster node. . Q12. Today. store and automatically or interactively load them through the job scheduler tool OAR . . These experiments make it possible to evaluate the prototype performance in each cluster. One main advantage of Grid5000 is that it is possible to reconfigure each grid node for experiments. Q5. No other indexes were created. implemented by ParGRES. the time required to setup the environment needed for experiments is reduced through the use of Grid5000 official tools. The Paraquad cluster has 64 nodes. Experimental Results To validate our approach. it is possible to generate customized images of operating systems and applications (e. They have different levels of complexity and are quite representative of OLAP applications. the ParGRES layer. Q14 and Q18. 4GB RAM and 160GB HD. With the Kadeploy  tool. and between each internal module through RMI. They are necessary for AVP. Clustered indexes based on the first attribute of the primary key were generated for each fact table (Orders and LineItem). all of them are interconnected by a high-speed network. and the database layer (with a DBMS instance in each node of the cluster). In the first experiment. We performed two kinds of experiments (see Figure 2 and Figure 3). The Parasol cluster has 64 nodes. We restrict our analysis to these queries due to space limitations. Indexes were also generated for all other primary and foreign keys. we implemented GParGRES and did a first performance evaluation based on experiments with Grid5000 .2GB (including all indexes). The TPC-H queries used for the preliminary tests of GParGRES are Q1. an OLAP tool). DBMS). Also. the clusters are isolated and each query is entirely processed by each one. Q6. Our preliminary experiments are performed on two clusters (located in Rennes).2. in which the data cubes have already been generated and are ready to be queried.4.
network or machine specific optimization. both clusters process the same set of queries. For the configuration with 1 node. the results obtained with GParGRES achieve linear or almost linear speedup in query execution. Both kinds of experiments show similar performance with a slight improvement during the second one (which still does not characterize it as the best scenario). And this is obtained without any DBMS. This demonstrates the effectiveness of ParGRES’s non-intrusive approach and thus that of GParGRES. Then. Furthermore. However. When considering one node-only query processing. we take the average time of the last nine runs (the first one is not considered). In our second kind of experiments. these are very encouraging results which make GParGRES an attractive solution for OLAP support in grids. (a) Figure 2 . However.just as ParGRES. we use an NQP running in Paraquad and the CQP running in Parasol. the results of the two clusters tend to be very close. in order not to eliminate the inter-cluster communication costs. as more nodes are added in all tested configurations. as more nodes are added (so virtual fragments are more likely to be found in memory).Mixed Configuration Results The results obtained with the Parasol and Paraquad clusters are shown in Figure 2. the results in Paraquad are better because the cluster is more powerful.Results with Isolated Clusters (b) Figure 3 . We call this Mixed Configuration and the experimental results are shown in Figure 3. it can be inferred that message passing costs are not significant for the final performance. . Thus. Each query is run ten times for each cluster configuration.
in particular OGSA. GParGRES enables the database to be replicated at multiple sites of the grid.g. physical fragments of the data warehouse are distributed among grid nodes. It provides a standard way to send a query to a grid data resource and obtain its corresponding result. Some grid services are also proposed for generating distributed query execution plans One main advantage of GParGRES is to provide intra-query parallelism without requiring any physical fragmentation of the database. and node-level splitting. It requires only standard clustered ordered indexes. grid services are built to identify and index such fragments. GParGRES has been partially implemented as grid services on Grid5000. Conclusion In this paper. OGSA-DQP does not automatically provide for intra-query parallelism at the operator level. Some related works propose new data models for data warehouses in grids. Furthermore. implemented by ParGRES. a database cluster middleware. GParGRES capitalizes on our previous work on ParGRES. e. We gave preliminary experimental results obtained with two clusters of Grid5000 using queries of . which is based on OGSA-DAI. since it requires specialpurpose implementation. Our solution is thus better for OLAP queries. Using data fragmentation . Then. Considering a typical grid environment with multiple clusters at different sites. OGSA-DAI can be used as a basis for GParGRES implementation. This index structure is not commonly found in standard DBMS. Related Work OGSA-DAI  is a middleware based on OGSA to provide access to relational and XML databases in grids. to provide transparent inter. the same operator (e. However.g.and intra-query processing without compromising database and application autonomy. which can be found in many DBMS. we proposed GParGRES. GParGRES provides for intra-operator parallelism as it supports data partitioning. An alternative for distributed query processing in grids is through OGSA-DQP . our approach has two levels of query splitting: grid-level splitting. as some of its services are useful to those presented in GParGRES architecture. which is important for OLAP query processing. implemented by GParGRES. GParGRES does not require any special index structure. thus increasing data availability and quality of service. In addition. join operator) is executed in parallel by GParGRES using several ParGRES instances which process different data subsets. GParGRES works with standard relational DBMS while the approach presented in  does not make it clear if data are stored in relational databases or in flat files. Finally.5. OGSA-DAI is complementary rather than alternative to GParGRES as it does not provide services for query processing. . GParGRES can be implemented as grid services and is compatible with existing grid solutions. 6. With intra-operator parallelism. a middleware for OLAP query processing in grids. the approach proposed in  uses spatial indexes based on X-Trees.
5. and Martin C. These are very encouraging results which make GParGRES an attractive solution for OLAP application support in grids. N. P. G. Austria (2007) Anjomshoaa. W. (2001) 11. and Zwaenepoel. USENIX Association.. Dayde..: The Anatomy of the Grid.. I. et al. 28--39. In: DANTE.... Georgiou. Concurrency and Computation: Practice and Experience.. pp. Grid5000. A. 3. M.. Bosio. pp... and Tuecke. J. C. C. –P. et al. H. W.grid5000.: Physical and Virtual Partitioning in OLAP Database Clusters. pp. In: VLDB. Rio de Janeiro (2005) 10. 143—150. Cardiff.. A. http://www..: Service-Based Distributed Querying on the Grid. pp.: The Design and Implementation of Grid Database Services in OGSADAI. R. Desprez..: Best Position Algorithms for Top-k Queries.. Akbarinia. Enabling Scalable Virtual Organizations. Mumbai (1996) Cappello. Marguerite. McCance. 35--42. In: CCGRID’2005. and Valduriez. 15(3). 200-222.. 9--18. Boston (2004) Furtado. A. Y. In: SBAC. M. F. In: Int. D. pp.. S. E. 776--783. Kesselman.. Hoschek. IEEE (1999) Berchtold.: Project Spitfire – Towards Grid Web Service Databases. (2002) Bellatreche. 7. IEEE (2005) Cecchet... E. vol.the TPC-H Benchmark. Vienna. L..fr/ .: C-JDBC: Flexible Database Clustering Middleware. 2910. References 1. 6. Richard O. pp.: The X-tree: An Index Structure for HighDimensional Data.: A batch scheduler with high level components. In: ICSOC 2003. Valduriez. (2005) 12. 357--376 (2005) Alpdemir. 4. 8. In: Freenix 2004. 9. pp. 467--483. Springer (2003) Bell. P.fr 13. Pacitti E.: Grid5000: a large scale and highly reconfigurable Grid experimental testbed. M. IEEE. this work can be extended in several interesting directions. Neyron P. In particular. Lima. In: VLDB. as more nodes are added in all tested configurations.. P.. 2. Huard G. and Mattoso. S. and Kriegel. Another promising direction is the support of top-k queries . Keim D. H. GGF. Pacitti. The first one is to provide support for partial replication  (as opposed to full replication) which is required for very large databases. Kadeploy. F. 17(2-4). http://kadeploy. Kunszt. W. Foster. Technical Report. extending best position algorithms  to work in GParGRES is a challenging problem. Karlapalem K. another important kind of queries whose support in grids has not yet received attention. M. and Silander. The results show linear or almost linear speedup in query execution. Workshop on Grid Computing.: OLAP Query Processing for Partitioned Data Warehouses. Besides more substantial performance experiments as was done for ParGRES. International Journal of High Performance Computing Applications. et al. and Mohania M. LNCS.imag. 99--106.
. Paton.. N. http://pargres. P.: A Model for Distributing and Querying a Data Warehouse on a Computing Grid. and O’Mullane. A.. K. 154--161.. and Mattoso. J. pp. Fernandes. Watson. M. 30(7). Technical Report. A. A. H.14. A.. Ph. Mattoso. Lima. Gray. A. Schek. Chun.: The Ganglia Distributed Monitoring System: design. et al. Thakar. and Sakellariou.tpc. vol. pp. Röhm.A Freshness-Sensitive Coordination Middleware for a Cluster of OLAP Components. 18.oasis-open. A.: When Database Systems Meet the Grid. 5(3) (2007).. Szalay. Watson. (2005). UK e-Science (2003) 26. Mattoso. Technical Report.: Intra-Query parallelism in database clusters. http://www. P. Thesis. Smith. P. U. B. M.postgresql. LNCS. Web Services Resource Framework. Lima. P. W. R.org/ogsa 19.. Santisteban. J. Wehrle. COPPE/UFRJ. A. (2004) 17. 2536. Annis. IEEE. Lima. (2004) 15. and Culler.: Databases and the Grid. 203--209. Valduriez. In: CIDR.. W. A. (2004) 23.: Grid Data Management: open problems and new issues. 754--765. Springer (2002) 24. Parallel Computing. and Valduriez. Pacitti.br/Documentos/ES-690. A.. 92--105. TPC-H Benchmark. A. E. Böhm. 817--840. 279--290. R.org 25. N. S.. M. PostgreSQL. implementation. Hong Kong.pdf. N.. 20.. M. Open Grid Services Architecture.D. In: SBBD. A. (2004) 16.. Brasília. A. and experience. et al: ParGRES: a middleware for executing OLAP queries in parallel..globus. Gounaris. Miquel M. pp. D. B. In: VLDB. A.: Adaptive Virtual Partitioning for OLAP Query Processing in a Database Cluster. B. In: GRID. M. California. J. and Tchounikine.org/committees/wsrf/ .org 21. Journal of Grid Computing. http://www.nacad.. B.ufrj. N... Rio de Janeiro. http://www. A. In: ICPADS. http://www. pp. E. pp. (2002) 22.: FAS . Fukuoka (2005) 27. Massie. J.: Distributed Query Processing on the Grid.-J. P.
This action might not be possible to undo. Are you sure you want to continue?
We've moved you to where you read on your other device.
Get the full title to continue listening from where you left off, or restart the preview.