You are on page 1of 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/316285760

Comparison of Different Solutions for Solving the Optimization Problem of


Large Join Queries

Conference Paper · April 2010


DOI: 10.1109/DBKDA.2010.1 · Source: doi.ieeecomputersociety.org

CITATIONS READS
6 352

1 author:

Dusan Petkovic
Technische Hochschule Rosenheim
43 PUBLICATIONS 153 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

SQL/JSON: Implementing JSON in RDBMSs View project

Performance of Large Join Queries View project

All content following this page was uploaded by Dusan Petkovic on 16 January 2018.

The user has requested enhancement of the downloaded file.


Comparison of Different Solutions for Solving the Optimization Problem of Large
Join Queries

Dušan Petković
University of Applied Sciences
Hochschulstrasse 1
Rosenheim, Germany
petkovic@fh-rosenheim.de

Abstract—The article explores the optimization of queries join operations exist, the optimizer selects the order of join
using genetic algorithms and compares it with the conventional operations and chooses one of the join processing
query optimization component. Genetic algorithms (GAs), as a techniques. In the last step, QEP is permanently stored and
data mining technique, have been shown to be a promising executed.
technique in solving the ordering of join operations in large
Genetic algorithms belong to data mining techniques
join queries. In practice, a genetic algorithm has been
implemented in the PostgreSQL database system. Using this that are based on an analogy to biological processes [15].
implementation, we compare the conventional component for Generally, they apply natural selection to find the globally
an exhaustive search with the corresponding module based on optimal solution to the given problem. A genetic algorithm
a genetic algorithm. Our results show that the use of a genetic is an iterative process, which maintains a constant
algorithm is a viable solution for optimization of large join population size of feasible solutions. During each step, the
queries, i.e., that the use of such a module outperforms the fitness of current population is evaluated and population is
conventional query optimization component for queries with selected based on the values of the function. Chromosomes,
more than 12 join operations. i.e., members of population with higher fitness are selected,
Keywords-Database systems, Query optimization, Large join
while ones with lower fitness are eliminated. The new
queries, Genetic algorithms, PostgreSQL. population is evaluated and selected for the next generation
(iteration). The iterative process is finished, when an
optimal solution is reached.
I. INTRODUCTION
The population is modified in each step using the
The question that generally arises when a relational following operators: selection, crossover and mutation.
database system executes a query is how the data that is Selection is similar to natural selection, i.e., to the principle
necessary for the query can be accessed and processed in the of “survival of the fittest”. Unlike nature, the size of the
most efficient manner. The component of a database system population remains constant from one generation to the
that is responsible for the processing is called query next. The chance of a chromosome to be selected for the
optimizer. The task of query optimizer is to consider a next generation is proportional to its fitness. In other words,
variety of possible execution strategies for querying the data chromosomes with higher fitness are selected, while from
concerning a given query and to select the most efficient poor or lower fitness chromosomes only a few (or no one)
strategy. The selected strategy is called query execution plan are selected. For this reason, genetic algorithms use the
(QEP). The optimizer makes its decisions using fitness function, which quantifies the optimality of
considerations such as how big the tables are that are chromosomes so that particular chromosome may be ranked
involved in the query, what indices exist, and which against all the other ones.
Boolean operators are used for existing conditions. The next operator applied to the surviving
The query optimization process comprises several chromosomes is crossover. Crossover creates two new
phases [14]. In the first query optimization phase, the query chromosomes from two existing ones by mixing together
is transformed in a query tree, where the leaves of a tree are pieces of each of them. This operator starts with two
database relations and its non-leaf nodes represent chromosomes and a random position. The first part of one
operations of relational algebra, such as projection, selection chromosome swaps places with the corresponding part of
and join. After that, the query tree is compiled. In the third the second one. The intuition behind the crossover operator
step, query optimizer takes as input the compiled query tree is a creation of new solutions by exploiting the old ones.
and investigates several access strategies before it decides Genetic algorithms construct a better solution by mixing
how to process the given query. To find the most efficient good characteristic of chromosomes. Hence, the probability
strategy, the query optimizer first makes the query analysis, that chromosomes with better fitness will be selected is
during which it searches for arguments and join operations. higher than for the chromosomes with lower fitness.
The optimizer then selects which indices to use. Finally, if

1
The final operation is mutation. In nature, mutation is the paper genetic algorithms are compared with the
result of miscoded genes being passed from a parent to a corresponding component of System R for relational queries
child. New chromosomes may be better (or poorer) than old up to 16 join operations. The experimental part of this paper
ones. If they are poorer than old chromosomes, they will be shows that GA can find a better execution plan than the
eliminated in the following generations. Generally, mutation corresponding component of System R. The three
is a second-order effect (in relation to selection and algorithm groups discussed above (deterministic,
crossover), which helps to avoid reaching a local maximum. randomized and GAs) were compared in [7]. The paper
As we already stated, the set of semantically equivalent concludes that randomized and genetic algorithms are
query trees can be generated for a given query. The task of significantly better than the deterministic one, although they
the query optimizer is to select an optimal query execution need more time for the optimization process.
plan out of all generated ones. On the other hand, genetic As we already stated, during the query optimization
algorithms can be used to solve optimization problems by process, every QEP is transformed in a query tree. Each
manipulating the patterns of bits in the chromosomes. For such tree can be viewed as a program, and a specialization
this reason, genetic algorithms can be used to solve complex of GAs that is called genetic programming can be applied to
problems of searching for and selecting an optimal plan out solve the optimization problem for large join queries. In this
of the set of all plans that are generated by query optimizer. case, the database relations are the terminals and algebraic
This paper shows that the implementation of the query relational operations are the functions of the genetic
optimization component for ordering of join operations in program.
large join queries outperforms the traditional component, Lahiri, in his article [8], compares GAs with genetic
i.e. the one, which uses exhaustive search, for queries with programming and concludes from his experimentation that
more than twelve joins. genetic programming works better for the optimization
problem of large join queries than GAs. Another paper,
which uses genetic programming to solve the optimization
A. Related Work
problem of large join queries is [9]. The conclusion of this
The problem of ordering of relational joins has been paper, where genetic programming is compared to the
discussed in one of the first papers concerning the System R Iterative Improvement algorithm, is that the latter is more
[1]. The algorithm used in [1] belongs to deterministic effective, while the former algorithm can find a better
algorithms, because it constructs a solution step by step in solution.
deterministic manner. Exhaustive search is one of Genetic algorithms were applied for the first time as a
deterministic algorithms, where all possible combinations of technique for optimization of relational queries at the
the joins in a given query are tested and the best solution in beginning of the last decade of the twentieth century. Initial
relation to the costs is chosen. Almost all query optimizer papers showed that genetic algorithms are a viable solution
components of existing relational database systems use a for optimization of large join queries. According to these
form of exhaustive search to find the optimal QEP. results, members of the PostgreSQL development
The optimization technique called Iterative Improvement community implemented a genetic algorithm as an
starts with a random execution plan and searches for the alternative to the conventional module, which uses
optimal execution plan similar to hill-climbing [2, 3]. In exhaustive search to solve the optimization problem for
other words, the execution plan that offers the best ordering of participating tables in large join queries. The
improvement in relation to the costs is taken. The process is query optimization module of the PostgreSQL database
repeated until one settles at the local minimum, where no system, which uses GA to order tables in large join queries
further improvement is possible. Another algorithm, which is called GEQO [10]. Properties of the GEQO module will
belongs to the same class as the Iterative Improvement be discussed in Section 2.
algorithm is called Simulated Annealing [4]. It differs from Besides the implementation of the GEQO module for the
the former one, because it avoids the trapping in one of the PostgreSQL database system, there exist several papers
local minima by searching further for the global optimal describing the implementation of GA for the IBM DB2
solution. The disadvantage of this algorithm is the increase query optimizer. The papers of Muntes-Montero et al. [11,
in costs. The Two-Phase optimization algorithm [5] 12] introduce Carquinyoli Genetic Optimizer (CGO), which
combines both previous algorithms: the Iterative can be used as a stand-alone optimizer or a post-optimizer,
Improvement algorithm is run until a local minimum is together with IBM DB2. The experimental results of this
reached. After that the Simulated Annealing algorithm paper show that the CGO component generates QEPs that
attempts to find less obvious improvements. The algorithms are several times cheaper than those obtained by the IBM
mentioned above are called randomized algorithms, because DB2 query optimizer. Also, the average cost for QEPS with
each of them performs a random walk according to certain CGO are circa four times cheaper.
rules and along the different solutions of the search space.
The paper [6] introduces genetic algorithms as a way to
solve the ordering problem of large join queries. In this

2
B. Roadmap well for testing large join queries [13]. (The original sample
The rest of the paper is organized as follows. Section 2 database of the PostgreSQL system contains only 6 tables
gives a short insight into the implementation of the genetic [10], and cannot be used to test large join queries.) All
algorithm called GENITOR used in the PostgreSQL tables from the AdventureWorks 2008 database have been
database system. The results of our experiments are shown exported from the SQL Server database system and
in Section 3. First, we run several large join queries using imported in the PostgreSQL database system Version 8.4.
the GEQO module and compare their execution times with For each query optimization component we run 18
execution times, when the query component for exhaustive different queries with up to 14 join operations.. The
search is used. Second, we investigate how the increasing following list shows the structures of some of the tables
number of generations influences the execution time of from the AdventureWorks database used in our tests:
large join queries. Section 4 gives conclusions of this paper.
Product (productID: int, name char(50), productnumber:
char(25), color char(15), listprice dec(10,2),
II. IMPLEMENTATION OF THE GENETIC ALGORITHM IN size char(5), sizeunitmeasurecode char(3),
POSTGRESQL weightunitmeasurecode char(3), weight dec(8,2)
The particular genetic algorithm used to implement the BusinessEntity (businessentityid: int, modifieddate: int)
GEQO module is called GENITOR. GENITOR is an Employee (businessentityid: int, nationalidnumber:
acronym for GENetic ImplemenTOR, a genetic search char(15), jobtitle: char(50), gender: char(1),
algorithm that differs in two major ways from standard vacationhours: int)
algorithms. The algorithm reproduces new chromosomes on EmployeePayHistory (businessentityid: int, rate: int)
an individual basis. It does so in such a way that parents and JobCandidate (jobcandidateid: int, businessentityid: int)
their offspring can exist together at the same time. The Person (businessentityid: int, persontype: char(2),
GENITOR algorithm has been successfully used in solving firstname: char(50), lastname: char(50))
different optimization problems. The advantage of the Product (productid: int, name: char(50),
GENITOR algorithm is that it produces one new productnumber: char(25),color: char(15),listprice: dec(10,2),
chromosome at a time, so inserting a single new individual size: char(5), sizeunitmeasurecode: char(3),
is simple. Another advantage is that the insertion operation weightunitmeasurecode: char(3), weight: dec(8,2),
automatically ranks the individual relative to the existing productionsubcategoryid: int, productmodelid: int)
ones. For this reason, no further measures of relative fitness ProductCategory (productcategoryid: int, name: char(50))
are needed. ProductListPriceHistory (productid: int, startdate: date,
The theoretical results show that GENITOR achieves faster listprice: dec(9,2))
feedback relative to the rate at which new elements of the ProductReview (productreviewid: int, productid: int,
search space are being sampled, because it uses a one-at- reviewername: char(50), rating: int)
time reproduction scheme. Also, this algorithm may be less ProductSubCategory (productsubcategoryid: int,
biased in relation to schemata with long defining lengths productcategoryid: int, name: char(50))
than the standard one. The latter property is one of main ProductVendor (productid: int, businessentityid: int,
reasons why this algorithm is used to implement the GEQO minorderqty: int, maxorderqty: int, unitmeasurecode:
module, because large join queries belong to the group of char(3))
processes with long defining lengths. PurchaseOrderDetail (purchaseorderid: int,
The two GEQO system functions used in our tests are: Purchaseorderdetailid: int, Orderqty: int, Productid: int)
- geqo (boolean) PurchaseOrderHeader (purchaseorderid: int, status: int,
- geqo_generations(integer) employeeid: int)
Each user of the PostgreSQL database system can SalesOrderHeader (salesorderid: int, revisionnumber: int,
optionally use either the GEQO module or the exhaustive status: int, customerid: int, salespersonid: int, territoryid: int,
search component for the execution of join queries. To billtoaddressid: int, shiptoaddressid: int, shipmethodid: int)
disable, i.e., enable the GEQO module, the geqo() system SalesPerson (businessentityid: int, territoryid: int,
function is used. Another important system function salesquota: int, bonus: int)
concerning the GEQO module is geqo_generations(). This TransactionHistory (transcationid: int, productid: int,
function controls the number of generations (i.e., iterations) quantity: int)
used be the GEQO module. If the value is set to zero, the
system chooses the most appropriate value. In processing the input queries, we considered the following
generic query. From it we derived other test queries:
III. EXPERIMENTAL EVALUATIONS
For our experiments, we use the AdventureWorks 2008 SELECT p.productID, e.businessentityID,pod.orderqty,
sample database of the MS SQL Server system. This pod.purchaseorderdetailID, eph.rate,
database contains more than 60 tables, and hence suits very jc.jobcandidateID, pc.name, plph.startdate,

3
pv.unitmeasurecode, pod.productID, sp.bonus B. Increasing the Number of Generations
FROM product p In the second test, we investigate how the increasing
INNER JOIN productSubCategory psc ON number of generations, i.e., iterations influences the
p.productionsubcategoryid=psc.productsubcategoryid execution time of a query. This test differs from the first one
INNER JOIN productCategory pc ON because it uses only the GEQO module, i.e., it does not
pc.productcategoryid = psc.productcategoryid compare the both components used in the first test. Queries
INNER JOIN transactionHistory th ON used for experimentation are the same as the queries from
th.productID = p.productID Section 3.A. In this test, the parameter value of the
INNER JOIN productListPriceHistory plph ON geqo_generations() system function was changed, starting
plph.productID = p.productID with 1, and increasing the value up to 13. The results are
INNER JOIN productVendor pv ON shown in Fig. 2.
pv.productID = p.productID
INNER JOIN purchaseOrderDetail pod ON
pod.productID = pv.productID
INNER JOIN purchaseOrderHeader poh ON
poh.purchaseOrderID = pod.purchaseOrderID
INNER JOIN employee e ON
e.businessEntityID = poh.employeeID
INNER JOIN salesPerson sp ON
sp.businessentityID = e.businessEntityID
INNER JOIN employeePayHistory eph ON
eph.businessentityID = e.businessentityID
INNER JOIN jobcandidate jc ON
jc.businessentityID = e.businessentityID
INNER JOIN businessentity be ON
be.businessentityID = e.businessentityID
WHERE p.productID= 317
AND pod.purchaseorderdetailID BETWEEN 100 and 200;
Figure 1: Comparison of the GEQO module and the exhaustive
search
A. GEQO Module vs. Exhausted Search
As we already stated, the convenient component, which The figure demonstrates that increasing the number of
uses exhaustive search, examines all possible combinations generations decreases the execution time of a query. In the
of the joins in a given query and chooses the best one in initial generation (geqo_generations=1), the deviation from
relation to costs. For this reason, exhaustive search is not the best value is around 25%. As generation proceed, the
recommended for large join queries. In our tests (see Fig. 1) deviation decreases and begins to converge beginning with
we compare the execution time of the queries with 11, 12, the 9th generation.
13 and 14 join operations. Each lefthand bar of the diagram
in Fig. 1 shows the execution time with the GEQO module, IV. CONCLUSIONS
while each right-hand bar represents the execution time of
the query optimization component, which uses exhaustive Until now, the theoretical part of applying genetic
search. Execution times represented in Fig. 1 show the algorithms to the optimization of large join queries has been
average time over all queries. As it can be seen from Fig. 1, set. For this reason, the emphasis in the future should be to
both components perform equally for queries up to 12 join implement genetic algorithms in relational database systems
operations. When the number of join operations is higher other than PostgreSQL. Therefore, future implementation
than 12, the GEQO module outperforms the conventional prospects can be divided in the following groups:
component for about 15% and more. We did not use a) Future extensions of the GEQO module
queries, where the number of joins exceeds 14, but b) Further development of the CGO module
according to the results of this paper, one can expect that c) Implementation of genetic algorithms in relational
they are similar. Note also that our results are different than database systems other than PostgreSQL and IBM
the results obtained in [7]. The reason for this can be the use DB2
of different class of queries and/or the use of different
constraints. A. Future Extensions of the GEQO Module
As the results of Section 3.A of this paper show, the
GEQO module is a viable alternative for the execution of
large join queries in the PostgreSQL database system. On

4
the other hand, the implementation of the GEQO module database systems, such as Oracle and MS SQL Server, is
supports only the left-deep tree query processing. The left- recommended.
deep tree query processing means that all internal nodes of a
query tree have at least one leaf as a child and inner nodes
REFERENCES
of all join operations are always database relations. For this
reason, the query optimizer of the PostgreSQL system
would benefit from extensions of the GEQO module, where [1] Astrahan, M.M., Blasgen, M.W., Chamberlain, D.D., Eswaran,
the more general representation of query trees called bushy- K.P., Gray J.N. – Access Path Selection in a Relational Database
tree is supported. The reason for this is, as the paper [7] Management System, in Proc. of the ACM SIGMOD Conf. on
Management of Data, Boston, June 1979, pp.23-34.
shows that the bushy tree query processing is more [2] Swami, A. and Gupta, A. - Optimization of Large Join Queries,
preferable than left-deep tree query processing in the case of Proc. of the 1988 ACM SIGMOD Int. Conf. on the Management of
the hash join technique. Data, Vol. 17, No. 3, 1988. pp. 8-17
[3] Swami, A. – Optimizations of large join queries: Combining
Heuristic and Combinatorial Techniques, Proc. ACM SIGMOD
B. Further Development of the CGO Module Conf. on Management of Data, Portland, 1989, pp.367-76.
The future efforts in relation to the CGO module should [4] Ioannidis, Y. E; Wong, E. - Query Optimization by Simulated
be twofold. The first effort should be to implement the Annealing, Proc. of the 1987 ACM SIGMOD Int. Conf. on the
module as the integral part of the DB2 system software. One Management of Data, San Francisco, 1987, pp. 9-22.
way to do this is to implement the module as a new DB2 [5] Ioannidis, Y. E.; Kang, Y. C. - Randomized Algorithms for
Extender module. Second, the CGO module should be Optimizing Large Join Queries, Proc. of the 1990 ACM-SIGMOD
Conference on the Management of Data. Atlantic City, NJ, 1990,
tested together with advanced features in DB2, such as
pp. 312-321.
multidimensional clustering indexed, index ANDing and [6] Bennett, K.; Ferris, M. C.; Ioannidis, Y. - A genetic algorithm
index ORing, to get the complete picture of the viability of for database query optimization, Tech. Report TR1004, Univ.
the module. All these features were disabled during the tests Wisconsin, Madison, 1991
executed for the work in [11]. [7] Steinbrunn, M., Moerkotte, G., Kemper, A. - Heuristic and
randomized optimization for the join ordering problem. VLDB
Journal, 6, 3 (Aug. 1997), Springer, New York, pp. 191-208
[8] Lahiri, T. Genetic Optimization Techniques for Large Join
Queries, in Proc. of the 3rd Genetic Programming Conf., 1998,
pp.535-40.
[9] Stilger, M., Spiliopoulou, M. – Genetic Programming in
Database Query Optimization, in Proc. of the 1st Genetic
Programming Conference, 1996, pp. 388-93.
[10] PostgreSQL, http://www.postgresql.org, last access
24.12.2009.
[11] Muntes-Mulero, V.; Aguilar-Saborit, J.; Zuzarte, C; Larriba-
Pey, J. – SGO: A Sound Genetic Optimizer for Cyclic Query
Graphs, in Alexandrov, V.N. et al. Proceedings of ICCS 2006, Part
I, LNCS 3991, 2006, pp. 156-163.
[12] Muntes-Mulero, V.; Aguilar-Saborit, J.; Zuzarte, C; Larriba-
Pey, J. – Analyzing the Genetic Operations of an Evolutionary
Query Optimizer, in Bell, D. and Hong. J. (Eds.), Proc. of BNCOD
2006, LNCS 4042, 2006, pp. 240-244.
[13] Petković, D. – SQL Server 2008, A Beginner’s Guide,
Figure 2: Modifying the number of generations McGraw Hill, 2008, ISBN: 0071546383.
C. Implementation of Genetic Algorithms in Database [14] Ioannidis, Y. E. – Query Optimization,
Systems other than PostgreSQL and IBM DB2 http://infolab.stanford.edu/~widom/cs346/Ioannidis.pdf
[15] Goldberg, D. - Genetic Algorithms in Search, Optimization
We are not aware of any efforts to implement genetic and Machine Learning, Addison Wesley, 1989.
algorithms as a component of relational database systems
other than PostgreSQL and DB2. As the first test of this
paper as well as the experimentation in [11] show, the
optimal plans generated by genetic algorithms are
competitive with corresponding optimal QEPs generated by
conventional query optimizer components in relation to
large join queries. For this reason, it is obvious that the
further research in applying and implementing of GA as an
extended component of query optimizers for other enterprise

View publication stats

You might also like