You are on page 1of 13

Front. Comput. Sci.

DOI 10.1007/s11704-014-4025-6

Efficient query processing framework for big data warehouse:


an almost join-free approach

1,2,3
Huiju WANG , Xiongpai QIN1,2, Xuan ZHOU1, Furong LI1,2 , Zuoyan QIN1,
Qing ZHU1,2 , Shan WANG1,2

1 DEKE Lab (Renmin University of China), Beijing 100872, China


2 School of Information, Renmin University of China, Beijing 100872, China
3 School of Computing, National University of Singapore, Singapore 117417, Singapore


c Higher Education Press and Springer-Verlag Berlin Heidelberg 2014

Abstract The rapidly increasing scale of data warehouses Keywords data warehouse, large scale, TAMP, join-free,
is challenging today’s data analytical technologies. A con- multi-version schema
ventional data analytical platform processes data warehouse
queries using a star schema — it normalizes the data into
1 Introduction
a fact table and a number of dimension tables, and during
query processing it selectively joins the tables according to A data warehouse is usually organized in one or multiple
users’ demands. This model is space economical. However, star (snowflake) schemas [1]. In a typical star (snowflake)
it faces two problems when applied to big data. First, join is schema, a large central fact table is linked to multiple dimen-
an expensive operation, which prohibits a parallel database sion tables through primary-foreign key relationships. Pro-
or a MapReduce-based system from achieving efficiency and cessing an OLAP query over a star (snowflake) schema is
scalability simultaneously. Second, join operations have to be expensive. It needs to first perform a multi-table join to con-
executed repeatedly, while numerous join results can actually nect the fact table to the dimension tables, and then imposes
be reused by different queries. restrictions on the dimension tables to select the desired data
In this paper, we propose a new query processing frame- from the fact table. The selected data are finally grouped and
work for data warehouses. It pushes the join operations par- aggregated according to the user’s demands. The major bot-
tially to the pre-processing phase and partially to the post- tleneck in star query processing is the execution of multi-
processing phase, so that data warehouse queries can be table joins, which are also known as star join.
transformed into massive parallelized filter-aggregation oper- The data in today’s enterprise information systems con-
ations on the fact table. In contrast to the conventional query tinue to explode. Large computer clusters, especially the ones
processing models, our approach is efficient, scalable and sta- composed of low-cost machines, have become a popular plat-
ble despite of the large number of tables involved in the join. form for processing big data. To architect a data warehouse
It is especially suitable for a large-scale parallel data ware- over a computer cluster, we can choose from the following
house. Our empirical evaluation on Hadoop shows that our architectures, each of which has its own strengths:
framework exhibits linear scalability and outperforms some • Parallel DBMS (PDBMS) A parallel database is usu-
existing approaches by an order of magnitude. ally deployed on hundreds of high-end servers, where queries
take no more than a few hours to finish. Failures are rela-
Received January 20, 2014; accepted August 20, 2014 tively rare in such an environment. Once a failure happens, a
E-mail: wanghj@comp.nus.edu.sg parallel database can simply re-execute the query. When de-
2 Front. Comput. Sci.

ployed on a large cluster consisting of thousands of low-cost expensive.


machines, PDBMS encounters a scalability wall. As the ma- Most of the existing data warehouse approaches do not
chines are less reliable, system failures become more com- perform satisfactorily when confronted with big data. This
mon and PDBMS tends to be less tolerable than MapReduce. is mainly due to the common processing model they adopt
• MapReduce Most big data analysis systems employ the — it first extracts, transforms and loads data into a fact ta-
MapReduce [2] framework to achieve scalability and fault ble and a number of dimension tables first, and joins them
tolerance. A typical example is Hadoop [3]. MapReduce par- together again during the query processing. This model faces
titions a data processing task into a large number of inde- two problems: 1) Join is an expensive and complex opera-
pendent subtasks, such that a failure of a single subtask does tion, which prohibits a parallel database or a MapReduce-
not affect the others. Usually, these subtasks are very sim- based system from achieving efficiency and scalability simul-
ple, such as filtering and aggregation. While MapReduce- taneously; 2) Join operations have to be executed repeatedly,
based system is superior to PDBMS in scalability and fault though numerous join results can actually be reused by dif-
tolerance, it is inferior to PDBMS in efficiency [4]. In partic- ferent data warehouse queries.
ular, MapReduce-based systems are very inefficient in pro- In this paper, we proposed a novel query processing frame-
cessing join operations, let alone star (snowflake) joins that work for big data warehouses, which aims to achieve high
involve a significant number of tables. Although a number of scalability and high performance simultaneously. The frame-
approaches have been proposed to improve the efficiency of work, in principle, can be applied to any data analytical plat-
join operations [4–9], most of them have to incur some ex- forms mentioned above (as detailed in Section 3). Under
tra cost such as network transmission cost [4–6], I/O cost [9], this framework, a data warehouse query can be easily par-
storage cost [8, 10] etc., or assume strong hardware config- titioned into massive independent sub-tasks and handled by
urations [4]. These approaches did not solve the efficiency a large scale computer cluster. Our framework consists of a
problem fundamentally. new storage model and a query processing model customized
• Hybrid systems of MapReduce and PDBMS In recent for data warehouse queries. The storage model employs hi-
years, researchers have proposed a number of solutions to erarchy encoding to compress the hierarchical information
combine the advantages of MapReduce and those of PDBMS. of the dimension tables into the fact table, to minimize the
The examples include Hive1) and Pig Latin [7], Teradata [11], space consumption. Our query processing model is called
Vertica2) , Greenplum3) , Aster Data4) , etc. Some approaches TAMP, standing for transform, aggregation, merge, and post-
try to import DBMS interface into MapReduce platforms, or processing, which are the crucial steps for processing data
import MapReduce interfaces in DBMSs. However, as these warehouse queries. TAMP translates a data warehouse query
approaches do not touch the internal execution engines, they into a unified plan which scans, filters and aggregates the fact
do not take the full advantages of either PDBMS or MapRe- table in a pipeline, and then merges all aggregated data re-
duce. turned from each data node and performs a residual join to
Some pioneering work, such as HadoopDB [12] (whose complement the final aggregation results with additional in-
commercialized version is called Hadapt5) ), attempts to formation from the dimension tables. To handle the dimen-
tightly integrate PDBMS and MapReduce. Some improve- sion update problem, we design a multi-version schema up-
ments have been made on both scalability and efficiency. date protocol. We also propose a scan-index algorithm to en-
However, the results are still not satisfactory for data ware- sure the efficiency of our approach.
house applications. Especially when a join operation involves We implemented our query processing framework and de-
multiple join attributes, such as a star join, HadoopDB would ployed it on Hadoop. Extensive experiments were conducted
lose its performance advantage. If we replicate each dimen- to evaluate the scalability and performance. Our experiment
sion table to all the database nodes, it will incur high space results show that our approach can scale almost linearly with
cost. If we adopt a partition method, we can only partition a the size of the data warehouse. The Hadoop platform adopt-
dimension table and the fact table based on one join key. In ing our framework can outperform Hadoop and HadoopDB
this case, the access on other dimension tables will be very significantly in star query processing.
1) Apache hive. http://hive.apache.org/
2) Vertica. http://www.vertica.com/the-analytics-platform/native-bi-etl-and-hadoop-mapreduce-integration/
3)
Greenplum. http://www.greenplum.com/technology/mapreduce/
4) Aster. http://www.asterdata.com/product/mapreduce.php
5) Hadapt Inc. http://www.hadapt.com
Huiju WANG et al. Efficient query processing framework for big data warehouse: an almost join-free approach 3

The rest of the paper is organized as follows. We first the join keys in the foreign key table. If we apply join-index
compare our approach with some related work in Section 2. to star query processing, the dimension tables have to be ac-
Section 3 introduces the storage model we designed for star cessed repeatedly to obtain the join results, which will intro-
schema. Section 4 presents our query processing framework. duce expensive random accesses. In contrast, our approach
Section 5 discuss our dimension update protocol and opti- materializes the values of the join keys and its parent hierar-
mization issues. In Section 6, we provide the experiment re- chies in the fact tables, so that a single scan is sufficient to
sults. Finally, Section 7 concludes the paper and discusses the generate the join results.
future research directions. • Hierarchy encoding Optimizing star join queries us-
ing hierarchy encoding was first introduced in [16, 17]. In
[16, 17], the surrogate keys based on the dimension hierar-
2 Related work chies are used to link dimension tables and fact tables. In
their approach, a fact table is hierarchically clustered and star
• Processing framework Our framework shares some idea joins can be transformed to multidimensional range queries
with MapReduce, such as scan-based processing model, new using UB-Tree [18]. In [19], the authors proposed a CSB star
abstraction of processing, etc. However our framework is a schema based on composite surrogate key, which is similar
domain-specific abstraction for data warehouse queries pro- to our dimension surrogate key. Compared to the above ap-
cessing, which answers all queries in one scan job, despite proaches, our hierarchy encoding method is more space ef-
of the number of tables involved in the query. In contrast, ficient. As we encode hierarchy members based on the local
MapReduce is a universal framework and usually has to domains rather than the global domains, our method needs
launch multiple jobs to process a data warehouse query. less bits to represent a hierarchy member.
Our processing framework is dramatically different from
conventional RDBMSs. Typically, RDBMSs translate differ-
ent queries into different plans, and find one optimal plan to 3 Join-free storage format
execute the query; while our approach uses one unified plan
In a nutshell, we compress the important information from
to answer all queries. Different processing schema is applica-
the dimension tables into a set of surrogate keys, and put them
ble to different scenarios. RDBMSs may benefit from sophis-
into the fact table, so that the queries on the dimension tables
ticated optimization techniques to achieve high performance.
can mostly be answered by accessing the surrogate keys. The
However, it is difficult for them to handle data with very large
compression is achieved by hierarchy encoding [16, 17]. In
size because of scalability issue (see discussions in introduc-
this section, we will discuss how to use hierarchy encoding
tion section); and if the join involves a little more relations
to design efficient storage strategy for processing data ware-
i.e., eight relations, the optimization search space is too large
house queries.
to be affordable [13]. However, our approach depends on scan
to process queries which is more scalable and stable despite 3.1 Dimension storage
of the large number of tables involved in the join. Many tradi-
tional optimization techniques, especially index, can also be • Local Hierarchy Domain Both the star schema and the
applied in our processing framework to improve the perfor- snowflake schema are used to describe multidimensional
mance further. data. In general cases, the values of each dimension in a
• Pre-join Pre-join has been utilized to speed up database schema can be organized into a hierarchy tree (For a value
query processing. In [8, 14], the universal relation is used to domain without hierarchical structure, it can be regarded as a
put all information into the fact table, so that queries can be special hierarchy consists of a single level), such as the City
processed by sequential scan. However, this approach incurs dimension in Fig. 1. Let L be the hierarchy tree of a dimen-
high storage space cost. Different from the universal relation, sion. For each node in L, we call its children a local hierarchy
our approach compresses only the key hierarchy information domain (LHD). As shown in Fig. 1, China, Japan, etc. share
of the dimension tables into the surrogate keys of the fact ta- the same parent Asia. Then China, Japan, . . ., is a LHD of the
ble, and pushes the residual joins to the post-processing of Region level. The highest level of a hierarchy tree is also a
the aggregation results. This is supposed to achieve a good LHD, denoted by All.
tradeoff between storage cost and join cost. Join-index [15] is • Local hierarchy surrogate For each local hierarchy do-
another form of pre-join, which materializes the addresses of main, its values can always be ordered. Let LHD x be a local
4 Front. Comput. Sci.

dimension tables in the server’s memory and finish the trans-


formation efficiently. For some big hierarchies which cannot
be held in memory, we can store them on disk and build in-
dexes on each field to accelerate the lookup operation on it.
Table 1 Encoding of the dimension tables for city
Dimension region Dimension nation
dsk Region dsk Nation
01 Asia 01.01 China
10 Europe 01.10 Japan
... ... ... ...
Fig. 1 An example of hierarchy encoding

Dimension city
hierarchy domain with C values, and let member(LHD x ) be
dsk City dsk City
the possible values of LHD x . We can define a one-to-one 01.01.01 Beijing 01.10.01 Osaka
function S : member(LHDx) → [0, C −1], such that, for every 01.01.10 Henan 01.10.10 Tokyo
u, u ∈ member(LHD x ), u < u implies S (u) < S (u ). S (u) is ... ... ... ...
called the local hierarchy surrogate of u, which is denoted by
hierarghy surrogate key(hsk). (Note that if < is not defined
for the domain of LHD x , S is simply defined as an arbitrary 3.2 Fact table storage
one-to-one function from the value domain to [0, C −1].) Fig- • Compress dimension surrogate key into fact table To inte-
ure 1 shows some examples of hsk. We can see that the num- grate the dimension information into the fact table, we add a
ber of bits that are reserved to represent the hsks of the level L surrogate attribute named mdsk to the fact table. We first re-
is [log2 Cmax ], in which Cmax represents the size of the LHD place each foreign key of the fact table with its corresponding
with the maximum candidates in L. For example, in the City dsk in the dimension table. Then we combine these dsks into
level, Beijing, Hongkong and Shanghai share the common a single surrogate key by interleaving their hsks. We call the
parent China, which has the maximum number of children. result as mdsk(multi dimensional surrogate key) of the fact
Thus we use [log2 3] = 2 bits to encode the City level. table. After obtaining the mdsk, we may drop former foreign
• Global hierarchy surrogate The global hierarchy sur- keys from the fact table to save space.
rogate of a node v, denoted by dsk(v), is the concatenation For instance, let dski be the hierarchy surrogate of the
of all the hsks on the pre-ordered transversal path of v. In ith dimension of a fact table tuple. Let hski.1 , hski.2 , . . . ,
other words, each dsk contains all the hierarchical informa- hski.n be the elementary surrogates of dski . Then the mdsk of
tion of the corresponding dimension. As shown in Fig. 1, the the tuple is hsk1.1 . . . hskn.1 .hsk1.2 . . . hskn.2 . . . hsk1.d . . . hskn.d ,
dsk of Osaka is the concatenation of the hsks of Asia, Japan which interleaves all the dsks. Suppose we have two dimen-
and Osaka, which is 01.10.01. We denote the global hierar- sions, that is, the date dimension (Year–Month–Day) and the
chy surrogate of the lowest level of a dimension as dimension City dimension as shown in Fig. 1. The mdsk will be like
surrogate key(dsk). hskYear .hskRegion .hskMonth . hskNation .hskDay .hskCity . If the date
In some special cases, a dimension may correspond to mul- dimension has only two levels, i.e., Year–Month, the mdsk
tiple hierarchies. In such a case, we treat each hierarchy as a will be hskYear .hskRegion .hskMonth . hskNation .hskCity .
separate dimension and encode them separately. By combining all the dsks into a single multidimensional
Using the hierarchy encoding, we encode each hierarchy surrogate key, we can transform all the operations on the di-
to form a new dimension table. Therefore, a dimension ta- mension tables to bit string operations on the mdsk, which
ble consists of: 1) global hierarchy surrogate key dsk, and 2) can be executed highly efficiently [14]. This interleaving en-
other attributes of the original dimension table. For example, coding method ensures that if we pre-sorted data on mdsk,
the dimension tables of the hierarchy in Fig. 1 are shown in the higher the hierarchy, the better the clustering effect. As
Table 1. most data warehouse-style queries operate on the relatively
• Physical storage In most cases, as a modern server is high hierarchical levels, they can benefit from the pre-sort
usually equipped with a large main memory (normally from significantly. Utilizing pre-sorting is our future work.
hundreds of GBs to several TBs), we can actually keep the • Key-value pair storage model For analytical workload,
Huiju WANG et al. Efficient query processing framework for big data warehouse: an almost join-free approach 5

column-store is commonly regarded as s superior solution to tables into the operations on hierarchical information in the
row-store. Thus we adopt a key, value storage model to dis- fact table. Then we conduct selection and aggregation on the
tribute the fact table on the HDFS, where each key attribute fact table. Finally, we decode the aggregation results by join-
stores a mdsk and each value attribute stores a single attribute ing them with the dimension tables.
value of the fact table. In this model, the key-value pairs shar-
4.2 TAMP execution model
ing the same key constitute a tuple of the fact table, and the
key-value pairs corresponding to the same attribute constitute The high level architecture of our framework is depicted in
a column of the fact table. Such a key, value model shares Fig. 2. In a typical data warehouse, the dimension tables only
the advantages of the column-store, as it allows us to avoid occupy a small proportion of the storage space compared to
redundant data accesses. The advantage of our key, value the fact table. Therefore, we store the dimension tables on the
model over the column-store is that it does not need to per- master node and distribute the fact table over the cluster.
form the tuple reconstruction during the query processing, as
the keys already contain all the needed information for query 4.2.1 Transform
processing.
When a query is issued, the transform module interprets the
SQL query, extracts the operations (predicates, group-by, etc)
4 TAMP execution model on the dimension tables and translates them into the opera-
tions on the fact table. The predicates on the fact table, on the
In this section, we introduce the design of our framework. We other hand, remain unchanged. After the transform phase, a
discuss the key design idea first, and then introduce our data new query for processing the fact table is produced.
warehouse query processing model TAMP. We restrict ourselves to those queries whose selection
predicate can be expressed as a conjunction of predicates of
4.1 Key idea
the form attrib op const, where attrib is an attribute of the
The dependence of join operation makes current query pro- relation, op is a comparison operator, and const is a con-
cessing frameworks difficult to get performance and scala- stant value. Queries containing disjunctions can always be
bility at the same time. Although universal relation [20] can expressed as a union of such queries, so this is not a limita-
avoid join, it will consume too much storage space, thus it tion. The transformation rules are as follows:
is not applicable. We aim to avoid expensive join operations • Predicates on Hierarchy
without high storage cost. To achieve this, we encode the di- Equation conjunctive predicates transformation. Predi-
mension tables’ hierarchical information (not all information) cates in an equation conjunctive are all equation predicates.
into the fact table at the data loading phase(as discussed in To transform this type of predicates, we produce two bit
last section), so that the fact table can be processed without strings. One is a mask, extr_mask_equal, to exact the needed
referring to the dimension tables. In this way, we can get the hierarchies by a bitwise AND operation on the mdsk. An-
best trade-off: eliminate the expensive join operation with- other one is the expected result which contains correspond-
out the space overhead of universal relation. When a query ing hierarchies’ dsk at appropriate offset, we denote this
arrives, we first transform the operations on the dimension bit string as mdsk_e. One conjunctive produces a pair of

Fig. 2 TAMP execution model


6 Front. Comput. Sci.

extr_mask_equal and mdsk_e. Then, on each tuple t, we test predicates on these fields can be evaluated on the fact table in
this: extr_mask_equal & t = mdsk_e, which only need one batch, just like predicates on hierarchies.
single comparison to evaluate all predicates. • Removal of Join clauses from SQL query
Range conjunctive predicates transformation. Range con- In our design, join operation is replaced with table scan,
junctive predicates transformation is more complex because thus join operation is unnecessary and can be removed safely.
there may be both range predicates and equation predicates To help readers to understand the transformation rules, we
in the conjunctive. For this type of transformation, we pro- give a simple example. Before transformation, most opera-
duce a new conjunctive predicate containing two predicates. tions perform on dimension tables:
One is equation predicate which contains all equation predi-
SELECT d.year, c.nation, sum(revenue)
cates in this conjunctive, we get this equation predicate in the
FROM fact, d_date, d_customer
same way as equation conjunctive transformation described WHERE d.year  1994 AND d.year  1996
above. Another one is range predicate which evaluates all AND c.region= Asia
range predicates in this conjunctive. We transform all range GROUP BY d.year, c_nation
predicates in a conjunctive into two bit strings which rep-
resent the expected bound if all predicates pass: one rep- After transformation, all operations are translated into new
resents the lower bound (denotes mdsk_l), and another one ones on the fact table(the binary code are new operations, see
upper bound (denotes mdsk_u). The mdsk_l is combined by Section 3 for details):
dsk of all hierarchies’ lower bound in the conjunctive. If a SELECT mdsk &  1111001100000 ,
hierarchy has no lower bound, its corresponding bits in the sum(revenue)
mdsk are left to be false (represent the minimal value). In FROM fact
the same way, the upper bound mdsk_u are combined by dsk WHERE mdsk &  0011000000000 =

key of all hierarchies’ upper bound in the conjunctive. If a 0001000000000 AND mdsk &

1100000000000 BETWEEN
hierarchy has no upper bound, its corresponding bits in the 
0100000000000 AND
mdsk are left to be true (represent the maximum value). Bits 
1100000000000
that does not correspond to a range predicates are set false
GROUP BY mdsk &  1111001100000
in both mdsk_l and mdsk_u. Two masks are also produced:
1) extr_mask_rng, which is to extract the appropriate sur-
4.2.2 Aggregate
rogates for range predicate from the mdsk; 2) group_mask,
which is to extract the code of the hierarchies which appear In this phase, each data node executes the following opera-
in the group-by clause for aggregation. When testing a tu- tions in a pipeline: (1) scan the corresponding columns of the
ple t, we execute such a conjunctive: extr_mask_equal & fact table; (2) aggregate the data that satisfies the predicates;
t = mdsk_e and (extr_mask_rng & t) between mdsk_l and (3) sort the results.
mdsk_u. If a range predicate is the form of attrib < const (or
attrib > const), we translate it into a form of attrib  const 4.2.3 Merge
(or attrib  const ) where const is the biggest sibling that is
After the aggregation phase, a merger starts to merge the re-
less than it (or the smallest sibling that is greater than it) in
sults produced by each data node. As the aggregation results
the hierarchy tree. For example, year < 1995 can be trans-
are usually small in size, the network transmission cost is lim-
lated into year  1994.
ited and the merge operation can be finished quickly.
Like and IN conjunctive predicates transformation. We
handle these types of transformation by transforming pred- 4.2.4 Post-process
icate into an IN predicate. The value list is a set of mdsk keys
which are produced by evaluating the predicates on dimen- As the results returned by the slave nodes consist of encoded
sion tables. dimensional information, residuary joins are needed to turn
• Predicates on other fields the codes into the values of the dimension tables. Following
There are two ways to transform this type of predicates. that, we execute the having-clause and sort the final results if
One way is to transform the predicates into an IN predicate, necessary.
just discussed above. For some hot fields, we can use another As we can see, after transformation, each data node fil-
way: encode these fields into dsk and mdsk key which the ters and aggregates its own data without the need of moving
Huiju WANG et al. Efficient query processing framework for big data warehouse: an almost join-free approach 7

dimension tables or fact table across remote nodes. The net- usually need to perform tuple reconstruction when conduct-
work transmission and the memory overhead is minimized. ing queries over our storage format. To speedup the process,
By using hierarchy encoding (see Section 3), the storage our tuple reconstruction algorithm utilizes the following prin-
overhead is far lower than universal relation. Besides, given a ciple: as the selective conditions in a query are normally com-
set of queries over the same star schema, existing query mod- posed of conjunctions of predicates, if one field of a tuple
els will conduct one star join for each query. However, for failed the predicate evaluation, the complete tuple will fail
our schema, each query may be processed on the same Join- the evaluation. Therefore, if some tuples have been elimi-
Free schema without join operation, which means such set of nated during the scan of one column, they can be skipped
queries may reuse the join results. The pre-processing cost is when we scan the successive columns. Applying this princi-
also amortized by such set of queries. ple, our system scans the columns in the descending order of
the selectivity of their corresponding predicates. During the
4.3 Application scenarios scan, it uses a scan-index to record the offsets of the tuples
In this section, we show how the TAMP model can be applied that have passed the predicate evaluation. We implement the
to a class of data-analytical platforms. scan-index as a bitmap, in which the ith bit indicates whether
the ith value in the preceding columns satisfies the predicates.
• RDBMS The TAMP model performs query process-
By using the bitmap, we can skip the records which have been
ing in a common execution plan. Thus it can be natually ap-
eliminated from the scan-index.
plied to traditional RDBMS. Especially for PDBMS, it can
improve its scalability greatly. Using this framework, we can
5.2 Multi-version schema update protocol
divide a query into massive independent sub-tasks. Each sub-
task can be accomplished through a scan operation. This al- Update on dimension is not rare. For example, a company
lows a PDBMS to achieve the same level of fault tolerance as often adds a new product into their product list. And once it
MapReduce — when a sub-task fails, we only need to re- changes, we need to update the existing data’s mdsks. The
execute that particular sub-task and avoid re-executing the most straightforward implementation is to update dimension
whole query. A custom relational database can also benefit surrogate codes first, then update the fact table. However,
from our model, especially when processing multi-way star this implementation may incur expensive data synchroniza-
join. tion cost.
• MapReduce-based systems As the TAMP model de- To handle this problem, we designed a update protocol
pends on scan operations to answer queries, it fits well with (called multi-version schema update protocol). We annotated
the filter-aggregation processing model of MapReduce. Thus, the star schema with version information. Whenever the di-
it is easy to integrate it into the MapReduce framework. In mension table is updated, we generate a new version that can
experiment section, we will provide more details. be stored, accessed and managed separately from the old ver-
• HadoopDB In our framework, the fact table is inde- sion. In our protocol, we can keep both the old and the new
pendent of the dimension tables. Therefore, we can store the versions.
dimension tables on the master node and partition the fact ta-
• Insert When insertion occurs, the corresponding hsk
ble over the local databases on the HadoopDB nodes. Then
needs to be extended. If the hsk has extra space, we can sim-
we can push all the star queries into database layer and ob-
ply use a new code to represent the new item. If the hsk is
tain better performance. This deployment can also be applied
already fully utilized, we have to increase its length. In these
to database cluster with a middleware.
cases, we will generate a new version of the star schema.
When new queries arrive, each query will be transformed into
5 Other optimization issues two queries, one for the old version, another for the new ver-
sion. And we need a final merge operation after finishing the
In this section, we focus on the most crucial parts of our two queries.
framework, the tuple reconstruction and dimension update • Update After update, new value should not overwrite old
problems. value in most cases. For instance, Hungary was a developing
5.1 Scan-index country before 2008, and now it is counted as a developed
country. In this circumstance, we cannot erase its historical
The logical data units in OLAP query are tuples. Thus, we data. In other words, even though the dimension data was up-
8 Front. Comput. Sci.

dated, its history information should be preserved. Therefore, performance of scan-index. Below are two sample queries of
we keep the data after update in a new version. For a query, if varying complexity from the SSB query set.
it accesses data before 2008, we use its old mdsk. Otherwise,
we use the new mdsk. If new value needs to overwrite old Query 1.1:
SELECT SUM(lo_extendedprice*lo_discount)
value, there will be a mapping between new mdsks and old
AS revenue
ones while accessing old version of star schema. FROM lineorder, date
To reduce storage cost, we can combine multiple versions WHERE lo_orderdate = d_datekey
of data into just one by unifying their surrogate key. AND d_year = 1993
AND lo_discount BETWEEN 1 AND 3
AND lo_quantity < 25;
6 Experiments
Query 1.1 is the first query in the set and returns a single row,
We conducted extensive experiments to analyze the perfor- it performs an aggregate over the entire table, filtering with
mance of our framework. We also showed the various over- selectivity 0.019. Query 4.3, on the other hand, is the last
heads of our framework, such as the overheads of data load- query in the set, returns the fewest rows of all test queries. Its
ing, storage, etc. We implemented the modules of TAMP as filters have a selectivity of 0.000 091.
well as all the optimization algorithms using C and deployed
them on Hadoop. To guarantee that all the fields of fact ta- Query 4.3:
ble in the same record are co-located in the same node in a SELECT d_year, s_city, p_brand1,
cluster, we implemented a column-store similar to [21] on sum(lo_revenue - lo_supplycost)
AS profit
Hadoop distributed file system.
FROM date, customer, supplier,
• Platform All experiments were conducted on a cluster of part, lineorder
14 nodes, one serving as the NameNode and the others as the WHERE lo_custkey = c_custkey
DataNodes. Each node has a 2 GB memory, an Intel Core2 AND lo_suppkey = s_suppkey
AND lo_partkey = p_partkey
Duo 1.87 GHZ processor and 140 GB disk space. Each of
AND lo_orderdate = d_datekey
them runs the 32-bit Ubuntu 10.10 OS. We installed Hadoop AND s_nation = ’UNITED STATES’
0.21. To achieve the best performance, we changed its config- AND (d_year = 1997
uration settings: (1) we set the data block size of the HDFS to OR d_year = 1998)
64 MB, (2) we enabled each task tracker to run with a max- AND p_category = ’MFGR#14’
imum heap size of 500 MB, (3) we increased the sort buffer
to 200 MB, (4) we set the concurrency value of each node to
6.1 Performance
two map slots and one reduce slot.
• SSB test data Our experiments used the Star Schema To study the performance of our framework, we compared the
Benchmark6) which includes one fact table(lineorders) and performance of Hadoop, HadoopDB and Hadoop with our
four dimension tables. With scale factor 500, we used SSB framework (denoted as HadoopTAMP). For HadoopTAMP,
to generate 500 GB of data: a fact table with 498 GB and we deployed the transform module and the post-processing
dimension tables totaling 1.3 GB. module on the master node, and used mapper tasks to con-
• Query workload The full set of SSB queries con- duct aggregation operations based on Hadoop streaming and
sists of four “query flights”, of three to four queries each, a reducer task to conduct merge operations.
for a total of 13 queries. According to the number of mea- To conduct joins on Hadoop, a small difference in im-
sures the query uses, we classify SSB queries into two plementation will result in big difference in performance.
categories: single-measure query and multi-measure query. Therefore, we conducted a set of initial experiments on
Single-measure queries includes Q2.1–Q3.4 which operate Hive and Hadoop to choose the fastest implementation. For
on only one measure, and multi-measure includes the first Hive, we tested the hash-based broadcast join and the left-
query group (Q1.1–Q1.3) and the last query group (Q4.1– deep tree join. For Hadoop, we tested our own hash-based
Q4.2) which use three measures and two measures respec- broadcast join. It turned out that the hash-based broad-
tively. Multi-measure queries may be used to evaluate the cast join is the best [4] scheme for most queries. We used
6) SSB benchmark. http://www.cs.umb.edu/ xuedchen/research/publications/StarSchemaB.PDF
Huiju WANG et al. Efficient query processing framework for big data warehouse: an almost join-free approach 9

HadoopDB for comparison. This can show us the perfor- scan-index algorithm will be very useful. In the second set of
mance difference between our module and a custom RDBMS experiments, we compared HadoopTAMP against Hadoop,
(HadoopDB uses PostgreSQL as its execution engine). We HadoopDB and HadoopTAMP without scan-index (denoted
did not compare with Hadoop++, because HadoopDB out- by HadoopTAMP-NoSI). To perform tuple reconstruction,
performed Hadoop++ in data loading time and performance HadoopTAMP without scan-index directly accesses the data
in general cases [8]. To get the best possible performance files to retrieve the values belonging to the same tuple. Figure
of HadoopDB , we re-engineered its processor. Basically, 3 shows the experiment results.
we co-partitioned the customer table (the biggest dimension
table) and the fact table based on their join key and repli-
cated all the other dimension tables on each slave node. This
method is supposed to be highly optimized for HadoopDB5),
as all joins are pushed to the database layer. We used Post-
greSQL 9.0.2 as local databases of HadoopDB and increased
the memory for shared buffers to 256MB and the working
memory to 128 MB. We did not use PostgreSQL’s data com-
pression feature, as the compression ratio of SSB is very Fig. 3 Scan-index (SF=500)
small. Each node in HadoopDB was installed with Hive 0.6.0.

6.1.1 Single measure with full scan

To evaluate the performance of our approach on single-


measure queries, we conducted a set of experiments us-
ing SSB Q2.1–Q3.4. In the experiments, we turned off the
scan-index optimization module, as it is not helpful in these
queries. Our experiment results are shown in Fig. 4. We can
see that even though we tuned HadoopDB and Hadoop to Fig. 4 SSB Query performance (SF=500)
their optimal status, the performance of Hadoop with our
framework (denoted by HadoopTAMP) is eight times as fast As we can see, HadoopTAMP is still the best. In average,
as that of HadoopDB, and 13 times as fast as that of Hadoop. its execution time is about 13 times faster than Hadoop and
HadoopTAMP also exhibits better stability than HadoopDB six times faster than HadoopDB. The performance advantage
and Hadoop. Firstly, Hadoop needs to materialize interme- of HadoopTAMP over HadoopDB is reduced a little, for there
diate results to the local disk, which incurs high I/O cost. is no long-running query, such as Q3.1. The query execu-
HadoopDB relies on costly join operations to process data tion time of HadoopTAMP is only about 40% of the time
warehouse queries. In contrast, HadoopTAMP depends on consumed by HadoopTAMP without scan-index. We attribute
scan operation and performs join only on aggregated data. this to the scan-index algorithm. Scan-index allows the sys-
Moreover, it only needs to materialize aggregation results tem to avoid a significant amount of redundant data accesses.
to the HDFS, whose size is usually very small (in SSB, all
queries produce less than 1 000 tuples). Secondly, as Hadoop- 6.2 Scalability
TAMP adopts the PAX storage model, it can avoid access-
As described in Section 3, we divide the total cost of TAMP
ing unnecessary columns. By contrast, HadoopDB uses Post-
execution model into four parts: the transformation cost
greSQL which builds upon row-store and has to consume
(TC), the scan and aggregation cost (AC), the merge cost
more I/O operations. Hadoop performs poorly for the same
(MC) and post-processing cost (PC). The total cost expres-
reasons.
sion is: T otalCost = TC + AC + MC + PC.
6.1.2 Scan-index performance A significant part of TC is consumed by by the lookup op-
eration to get the surrogate keys corresponding to the predi-
The first query group (Q1.1–Q1.3) and the last query group cates in the query. As discussed in section IV.A, this operation
(Q4.1–Q4.2) of SSB use more than one measures (three mea- can be facilitated by an index and thus can be very efficient.
sures and two measures respectively). In this circumstance, The AC should be proportional to the size of the fact table,
10 Front. Comput. Sci.

as the operations on each node are limited to table scans The from the local hard disk to HDFS. In contrast, HadoopDB
MC is determined by the result size, as it is mostly network needs three steps to load data: first, it distributes data based
transmission cost and I/O cost. Since there is only a small on a key; second, it partitions data locally in each node; fi-
amount of aggregated data, MC is usually very low. As post- nally, it loads locally partitioned data into PostgreSQL. Our
processing is the inverse operation of transformation, thus its framework needs two steps to load data: it generates surro-
cost (PC) is close to TC. gate keys and inserts them into the fact table first, then it
Based on the analysis above, we carried out an experiment writes the data to the slave nodes in the format of PAX model.
to study the time distribution of TAMP execution model. The Figure 6 shows the loading time of the 500GB dataset using
experiment was performed on the master node and a data the free different approaches.
node. The latter performs the scan, aggregation and sort op- As we can see from Fig. 6, the loading time of Hadoop-
eration on fact table, other operations in TAMP model were TAMP is 3.4 times of Hadoop and 29% of HadoopDB. By
executed on the master node. We ran SSB Query 3.2 over a analyzing our framework’s execution process in detail, we
dataset of about 32 GB. This query is supposed to incur big found that the major overhead is incurred by the hash-based
TC + MC + PC, because it will produce large results (six hun- broadcast join to get dimension surrogates. The generation
dred records) from small data (just one column). Without loss cost of surrogate keys occupies 80% of the total cost. In fact,
of generality, instead of keeping dimension tables in memory, the data pre-processing cost can be reduced greatly if we im-
we stored all dimension on disk and used B+ tree index-based plement a dedicated ETL module, as Hadoop’s performance
access method. Nevertheless, our experiments show that the is not as good as that of RDBMS.
proportion of TC+MC+PC to T otalCost is less than 0.12%
for Query 3.2. It is predictable that the bigger the data vol-
ume, the lower the ratio between results and data. Hence, the
query processing overhead is dominated by the scan and ag-
gregation cost. Suppose the size of the fact table is F and
there are N nodes of the same configuration in the cluster.
Then the fragment of each data node is F/N. Thus, the to-
tal cost is proportional to F/N. This indicates that the sys-
tem scales almost linearly with the data size (as illustrated in
Fig. 5).

Fig. 6 Data loading (SF=500)

Although our system incurs higher loading cost than


Hadoop, it will be rewarded at runtime, which can be seen
in the performance results.

6.4 Storage

In this experiment, we compared the storage overhead of our


framework against that of HadoopDB and Hadoop. We used a
Fig. 5 Processing time vs data size
32 GB SSB dataset. As shown in Fig. 7, HadoopTAMP con-
sumes 12% more space than Hadoop and 35% more space
6.3 Data loading than HadoopDB. The extra cost, though not much, is mainly
due to the key, value storage model, which replicates mdsk
As the loading time for dimension tables are relatively small attributes for each column. In fact, our key, value storage
and stable, we only report the experiment results on the fact model is similar to column-store, which indicates that we can
table. design efficient compression algorithm to further reduce the
To load the fact data, Hadoop only needs to copy the files storage overhead. This is our future work.
Huiju WANG et al. Efficient query processing framework for big data warehouse: an almost join-free approach 11

greatly on other platforms such as PDBMSs which can start


one query more efficiently.

Fig. 7 Storage (SF=32)

6.5 Dimension update


Fig. 8 Performance vs #Schema Versions
Based on our multi-version schema update protocol, one di-
mension update (either insert or update) operation will either
generate a new version of schema or insert a new code into
7 Conclusion
current version of schema. As dimension update operations
usually occur periodically which are far less frequent than In this paper, we propose a novel framework for efficient pro-
queries and can be done at some pre-defined time, we mainly cessing of data warehouse queries over big data. Our frame-
focus on the evaluation of query performance. To evaluate the work abstracts data warehouse processing in a single proce-
query performance, we use the same dataset and queries; to dure: transform, aggregate, merge and post- processing. We
see the effect of the number of schema versions on query per- achieve this by compressing dimension hierarchies informa-
formance, we generate several derived datasets with different tion into the fact table using hierarchy encoding which elim-
number of versions of schema data based on the same dataset. inates the star (snowflake) joins in query processing. There-
For simplicity, we choose one dimension such as c_nation as fore, the fact table can be filtered and aggregated without re-
target dimension to be updated. #N versions of data may be ferring to the dimension tables and the query execution time
generated by updating 100/N% fact table tuples and corre- is not affected by the number of joins.
sponding values in c_nation dimension. For example, if there
Much work remains in the future. First, our key, value
are three versions, each version contains about 33% percent
store is similar to column-store, thus it is possible to achieve
of data which means each version has 167 GB of 500 GB data
high compression ratio if we compress the data. Second, as
in average.
all queries in our framework are answered by scans, we can
We ran the second query in each set of SSB queries over
devise inter-parallel mechanism to allow multiple queries to
each derived dataset. From Fig. 8, we can see that for each
share I/Os and the network bandwidth. Finally, as our frame-
query, as the number of versions increases, execution time
work turns multidimensional data to single-dimension surro-
increases accordingly; and more versions, more time it takes
gate key, it is interesting to investigate how to utilize the tra-
to execute the query. This is reasonable because with the
ditional one-dimensional index, such as hash, to further opti-
same size of data, one more version will introduce one more
mize the query processing.
transformation operation and one more job launch operation,
which results in the increasing of execution time. It is pre-
dictable that increased time caused by increasing of versions References
will take the most time possibly when the number of versions
1. Chaudhuri S, Dayal U. An overview of data warehousing and olap tech-
reaches some point. However, this can be avoided by simply nology. SIGMOD Record, 1997, 26(1): 65–74
combining multiple versions of schema data into one period- 2. Dean J, Ghemawat S. Mapreduce: Simplified data processing on large
ically as discussed in Section 2. As increased time is intro- clusters. In: Proceedings of the 6th Symposium on Operating Systems
Design and Implementation. 2004, 137–150
duced mainly by job launch operation which is bounded by 3. Apache hadoop. http://hadoop.apache.org
Hadoop platform, the situation should be able to be alleviated 4. Pavlo A, Paulson E, Rasin A, Abadi D J, DeWitt D J, Madden S, Stone-
12 Front. Comput. Sci.

braker M. A comparison of approaches to large-scale data analysis. In: ACM Transactions on Database Systems, 1984, 9(3): 331–347
Proceedings of the 35th SIGMOD International Conference on Man- 21. Floratou A, Patel J M, Shekita E J, Tata S. Column-oriented storage
agement of Data. 2009, 165–178 techniques for mapreduce. Proceedings of the VLDB Endowent, 2011,
5. Afrati F N, Ullman J D. Optimizing joins in a map-reduce environment. 4(7): 419–429
In: Proceedings of the 2010 International Conference on Extending
Databas Technology. 2010, 99–110
6. Dawei Jiang G CA. K. H. Map-join-reduce: Towards scalable and effi- Huiju Wang graduated from Renmin
cient data analysis on large clusters. IEEE Transactions on Knowledge University of China in 2012 and works
and Data Engineering, 2011, 23(9): 1299–1311 as postdoctoral research fellow at the
7. Olston C, Reed B, Srivastava U, Kumar R, Tomkins A. Pig latin: a not-
school of computing of National Uni-
so-foreign language for data processing. In: Proceedings of the 2008
SIGMOD International Conference on Management of Data. 2008, versity of Singapore. His research
1099–1110 spans the areas of big data, cloud-
8. Dittrich J, Quiané-Ruiz J A, Jindal A, Kargin Y, Setty V, Schad J. ing computing, databases and data
Hadoop++: Making a yellow elephant run like a cheetah (without
management, with emphasis on graph
it even noticing). Proceedings of the VLDB Endowment, 2010, 3(1):
518–529 database, graph index, graph data ex-
9. Floratou A, Patel J M, Shekita E J, Tata S. Column-oriented storage ploration.
techniques for mapreduce. Proceedings of the VLDB Endowent, 2011,
4(7): 419–429
Xiongpai Qin received his MS and PhD
10. Lin Y, Agrawal D, Chen C, Ooi B C, Wu S. LLAMA: leveraging
degree in computer science from Ren-
columnar storage for scalable join processing in the mapreduce frame-
work. In: Proceedings of the 2011 SIGMOD International Conference min University of China in 1998 and
on Management of Data. 2011, 961–972 2009 respectively, and works as a lec-
11. Xu Y, Kostamaa P, Gao L. Integrating hadoop and parallel DBMS. turer at Information School of Ren-
In: Proceedings of the 2010 SIGMOD Conference on Management
min University of China. His research
of Data. 2010, 969–974
12. Abouzeid A, Bajda-Pawlikowski K, Abadi D J, Rasin A, Silberschatz interests include semantic based in-
A. Hadoopdb: An architectural hybrid of mapreduce and DBMS tech- formation retrieval, high performance
nologies for analytical workloads. Proceedings of the VLDB Endow- database and big data.
ment, 2009, 2(1): 922–933
13. Swami A, Gupta A. Optimization of large join queries. SIGMOD
Record, 1988, 17(3): 8–17 Xuan Zhou is an associate professor at
14. Raman V, Swart G, Qiao L, Reiss F, Dialani V, Kossmann D, Narang I, the Renmin University of China. He
Sidle R. Constant-time query processing. In: Proceedings of the 2008 obtained his BS in computer science
International Conference of Data Engineering. 2008, 60–69
from Fudan University, China in 2001,
15. Valduriez P. Join indices. ACM Transactions on Database Systems,
1987, 12: 218–246 and his PhD from the National Univer-
16. Markl V, Ramsak F, Bayer R. Improving OLAP performance by multi- sity of Singapore in 2005. His research
dimensional hierarchical clustering. In: Proceedings of the 1999 Inter- interests include database and informa-
national Symposium on Database Engineering and Applications. 1999,
tion management. He has published his
165–177
work in the top conferences and jour-
17. Karayannidis N, Tsois A, Sellis T K, Pieringer R, Markl V, Ramsak F,
Fenk R, Elhardt K, Bayer R. Processing star queries on hierarchically- nals on data management.
clustered fact tables. In: Proceedings of the 28th VLDB Conference.
2002, 730–741
Furong Li is a PhD candidate at Na-
18. Bayer R. The universal b-tree for multidimensional indexing: gen-
tional University of Singapore. She ob-
eral concepts. In: Proceedings of the 1997 International Conference
on Worldwide Computing and Its Applications. 1997, 198–209 tained her BS from Renmin University
19. Theodoratos D, Tsois A. Heuristic optimization of olap queries in mul- of China in 2012. Her research inter-
tidimensionally hierarchically clustered databases. In: Proceedings of ests include data integration, social net-
ACM 4th International Workshop on Data Warehousing and OLAP.
works and big data management.
2001, 48–55
20. Korth H F, Kuper G M, Feigenbaum J, Gelder A V, Ullman J D. Sys-
tem/u: A database system based on the universal relation assumption.
Huiju WANG et al. Efficient query processing framework for big data warehouse: an almost join-free approach 13

Zuoyan Qin received his BS and MS Professor Shan Wang finished her un-
from Renmin University of China in dergraduate studies at the Peking Uni-
2008 and 2011 respectively. He is one versity, China in 1968, and completed
senior engineer in Baidu company one. her Master study at Renmin University
Before joining Baidu, he worked in of China in 1981. Her research interests
Tencent. His main focus is big data pro- include high performance database,
cessing and cloud computing. data warehouse and knowledge engi-
neering, information retrieval, etc.

Qing Zhu is an associate professor of


School of Information, Renmin Univer-
sity of China. She completed her Phd in
2005 in Renmin University and MS in
1991 in Beijing University of Technol-
ogy, China. Her research interests in-
clude Grid computing, distributed algo-
rithms, Semantic Web service and high
performance Database.

You might also like