You are on page 1of 3

HDW: A High Performance Large Scale Data Warehouse

Jinguo You1, Jianqing Xi1, Chuan Zhang1, Gengqi Guo1, 2
1
School of Computer Science and Engineering, South China University of Technology,
Guang Zhou 510641, China
2
Guangdong Communication Polytechnic, Guang Zhou 510641, China
jgyou@126.com; csjqxi@scut.edu.cn; chuanzh69@yahoo.com.cn; ggengqi@tom.com

Abstract∗ OLAP query answering.
HDW is different from other previous parallel data
As data warehouses grow in size, ensuring adequate warehouse and OLAP systems. The most comparable
database performance will be a big challenge. This system with HDW is Panda [6, 7, 8]. They all aim at a
paper presents a solution, called HDW, based on high performance scalable parallel OLAP system built
Google infrastructure such as GFS, Bigtable, on low cost share nothing clusters. But they have quite a
MapReduce to build and manage a large scale few differences in implementation. Panda constructs
distributed data warehouse for high performance OLAP ROLAP data cubes mainly based on the pipesort
analysis. In addition, HDW provides XMLA standard algorithm, while HDW builds a MOLAP data cubes
interface for front end applications. The results show around the closed cube algorithm [9]. Panda and most
that HDW achieves pretty good performance and high other systems [10, 11, 12] employ MPI (Message
scalability, which has been demonstrated on at least 18 Passing Interface) to communicate between a node and
nodes with 36 cores. another node. They need to take extra consideration
about data partitioning, load balance and failure
Keywords: data warehouse, GFS, Bigtable,
tolerance, which are automatically handled by GFS and
MapReduce, high performance
Mapreduce used in HDW. In addition, for high
availability, HDW provides XMLA (XML for Analysis)
1. Introduction standard API for front end applications. Thus, all
visualization tools which support XMLA can be easily
With information generated consistently, data plugged into HDW system without much modification.
warehouses grow in size [1]. This challenges current
data warehouse systems. To accommodate and analyze
vast quantities of data over time become more difficult
2. Design Overview
than ever. In some business scenarios, data warehouses 2.1 Architecture
and business intelligence applications are required to
answer exactly the types of ad hoc OLAP queries that The HDW system consists of four layers as
users would like to pose against real time data. Further, illustrated in Figure 1: the storage layer, the calculation
[2] proposed C-Store, a column-oriented DBMS, which layer, the message layer and the presentation layer.
outperforms the current traditional DBMSs in read
presentation
optimization. As data warehouses are read-mostly, the JPivot Manager Excel
layer
new data warehouses design needs to consider and l i
incorporate the C-Store feature.
Recently, Google provides server infrastructure such message
as GFS [3], Bigtable [4] and MapReduce [5] to handle layer XMLA Engine
with large amounts of data on thousands of low cost
commodity machines. Considering Google’s searching calculation
information over the very large worldwide web, how layer MapReduce
about using the infrastructure to build a huge data
warehouses? storage
This paper presents a solution built upon the Google layer GFS / Bigtable RDB
infrastructure, called HDW. HDW is a large scale
distributed data warehouse where high performance Figure 1. The layers of HDW
OLAP analysis is executed. It uses GFS and Bigtable to
store data. Also it uses MapReduce to parallelized The storage layer stores metadata and summary data
computation tasks such as data cube construction, in GFS or Bigtable. Large data is split into blocks which
are spread across different machines. The storage layer

The research is sponsored by the Science & Technology Program of also extracts data from a relational database.
Guangdong Province, China (NO. 2006B11301001, NO.
2006B80407001) and The International Science &Technology
The calculation layer uses MapReduce framework to
Cooperation Program of Guangdong Province, China (NO. parallelize computation tasks such as the data cube
2007A050100026)
construction (i.e. aggregate data into a data cube) and the data cube is inserted into Cubetable, each cell in the
OLAP queries. The calculation layer accesses data in same row for the record is under the same version, i.e.
the storage layer via the read / write interface provided the same timestamp. When the record has a special
by the storage layer, pre-aggregates the data and stores value ALL or * in a dimension, the cell value is set
the aggregated data to the storage layer. NULL.
The message layer includes an XMLA engine which
processes requests from users and invokes the 3.2 Data Cube Construction
calculation layer to execute parallel computation tasks.
Through XMLA, the message layer provide a unify Because the closed cube is very efficient in the data
interface for client applications. Metadata discovery, compression ratio, we used it to implement the data
OLAP query and pre-computation etc. are all wrapped cube construction.
into two standard methods: discover and execute. For the input data in files or tables specified by
In the presentation layer, users can make analysis FileInputFormat or TableInputFormat respectively, the
via JPivot, Excel pivot table, etc. visualization tools that system partitions them into data blocks. As shown in
support the XMLA specification. The system also Figure 3, every block’s content as a whole is an input
provides a management console to manage the GFS / value with a unique id. The map function accepts the
Bigtable, create data warehouse schema metadata, send pairs and computes every local data cube by calling
query request, and monitor the execution. DFS function (see DFS definition in [9]). The cells are
stored in a local data cube closedCells. Finally the
2.2 Interface closedCells identified by a blockid are output and stored
in the GFS/Bigtable. The CubingReduce does nothing
The storage layer provides the data access interface except only outputting the closed cube.
for the calculation layer. The Map tasks and Reduce
tasks read data in Bigtable via TableInputFormat and Class CubingMap
write TableOutputFormat. When the table is from a local variable closedCells;
relational database, the interface is DbInputFormat and map(InputKey blockid, InputValue blockdata)
DboutputFormat. For files, the access of GFS is 1. cl = (ALL, …, ALL);
wrapped as FileInputFormat and FileOutputFormat. 2. call DFS(cl, blockdata, 0); //generating and
In the calculation layer, the interface for the data collecting the closed cells
cube construction is CubingMap and CubingReduce. 3. emit(blockid, closedCells);
The OLAP querying interface is QueryMap and
QueryReduce. These interfaces are all invoked by the Figure 3. The pseudocode of the cubing interface
message layer’s interface: Discover and Execute, two
methods defined in XMLA. The data cube is locally constructed. As the id keys
of local data cubes are small and unique, the
3. Implementation Detail subsequently partitioning merging of these keys
3.1 Storage Structure produces little communication overhead and little data
swap between nodes.
If the summary data in the data cube is structured, it We conducted the experiments for the data cube
is stored in a distributed table which we called construction in an 18-node PC cluster with total 36
Cubetable. Cubetable is built and managed by Bigtable. cores, 18 GB RAM, 540GB disk volumes. Although the
Therefore Cubetable is column oriented, which is more cluster is not large, but it is easy to add more nodes (say,
efficient for storage and querying of data cubes since reach 100 -1000 cores and 1 terabyte disk volume) since
data cubes are sparse in most situations. it is share nothing.
“dimension1: “dimension2:level1” “dimension2:level2” For a fact table with 60 million rows (stored in text
” file with the size 1.37G), the data cubes are constructed
in less than 5 minutes and output the 2.98G file spread
t t t over 18 nodes. Even when the number of rows reaches
data cube1 1
V1
1 1
V22
100 million, the construction time is about 7 minutes.
V21
The speedup is almost linear within at least 36 cores.
t For high dimensions, say the fact table with 12
2
data cube1 * V21′ * dimensions and 20 million rows, the construction time
is only 273 seconds.
Figure 2. A slice of Cubetable conceptual view
3.3 OLAP Query
The Cubetable conceptual view is shown in Figure 2.
The data cube unique name acts as the row name. The The same query is sent to every data node. The
dimension unique name acts as the column family name, parallel query task includes the QueryMap class and the
while the level unique name in the dimension the QueryReduce class. As shown in Firgure 4, the (blockid,
qualifier of the column family. As soon as a record of closedcells) pair from files/tables is the input of the map
function. The query strings are stored in a file which can HDW has the same potential ability which will be
be accessed by all map functions. Then the map proved in the next step. The data extraction,
function searches every queried cell in its local data transformation and loading (ETL) will be considered to
cube and emits a (cell, msr) intermediate key/value pair. incorporate into HDW.
These intermediate pairs are partitioned by the key, i.e.
the name of cell, so the measures are grouped by cells. References
Finally, the measures for a cell are reduced to one
measure by applying the aggregate function (e.g. sum). [1] David J. DeWitt, Samuel Madden, Michael Stonebraker.
How to Build a High-Performance Data Warehouse. http: //
Class QueryMap db.lcs.mit.edu/madden/high_perf.pdf
map(InputKey blockid, InputValue closedcells) [2] Michael Stonebraker et al. C-Store: A Column Oriented
DBMS. In proceedings of VLDB, 2005.
1. get queried cells from a file;
[3] Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung. The
2. for each cell in queried cells Google file system. In 19th Symposium on Operating Systems
3. msr = query(cell, closedcells); //query cell in Principles, 2003.
closedcells [4] Fay Chang, Jeffrey Dean, Sanjay Ghemawat et al. Bigtable:
4. emit(cell, msr); A Distributed Storage System for Structured Data. In 7th
Class QueryReduce Symposium on Operating System Design and Implementation,
reduce(InputKey cell, InputValue msrlist) 2006
1. result = 0; [5] Jeffrey Dean, Sanjay Ghemawat. MapReduce: Simplified
2. for each msr in msrlist Data Processing on Large Clusters. In Symposium on
Operating Systems Design and Implementation, 2004.
3. result += msr;
[6] PANDA, http://projects.cs.dal.ca/panda/
4. emit(cell, result); [7] Ying Chen, Frank Dehne, Todd Eavis et al. Parallel
ROLAP Data Cube Construction on Shared-Nothing
Figure 4. The pseudocode of the querying interface Multiprocessors. Distributed and Parallel Databases, 2004
[8] Frank Dehne, Todd Eavis, Andrew Rau-Chaplin. The
Even though the OLAP query involves in each node, cgmCUBE project: Optimizing parallel data cube generation
the result sets are small. Thus the partitioning merging for ROLAP. Distributed and Parallel Databases, 2006.
for the results causes little communication. [9] Laks V.S. Lakshmanan, Jian Pei, Yan Zhao. QCTrees: An
For the fact table with 60 million rows, the 1,000 Efficient Summary Structure for Semantic OLAP. SIGMOD,
point queries answering time is only 203 seconds (the 2003.
experimental environment is the same as the data cube [10] Sanjay Goil, Alok Choudhary. High performance OLAP
construction environment). Each point query answering and data mining on parallel computers. Journal of Data
Mining and Knowledge Discovery, 1(4):391–417, 1997.
time is 0.2 seconds in average. Also the time approaches [11] Sanjay Goil, Alok Choudhary. A parallel scalable
linear speedup when the number of nodes increases infrastructure for OLAP and data mining. In Proc.
from 5 to 17. International Data Engineering and Applications Symposium
(IDEAS’99), Montreal, 1999
3.4 The Code Base [12] Raymond T. Ng, Alan Wagner, Yu Yin. Iceberg-cube
Computation with PC Clusters. SIGMOD, 2001
We implemented our system based on Hadoop [13]. [13] Apache org. Hadoop. http://lucene.apache.org/hadoop/
Hadoop is a software platform that allows one to easily
write and run parallel or distributed applications that
process vast amounts of data. It incorporates features
similar to those of the Google File System and of
MapReduce. Hadoop also includes HBase which is a
column-oriented store model like Bigtable.
Although Hadoop is implemented in Java, the map
and reduce computation tasks were all coded in C++
because of the efficiency of C++. The C++ program
communicates with Hadoop through Hadoop Streaming.

4. Conclusions
HDW aims at building a large scale data warehouse
that accommodates terabytes data atop inexpensive PC
clusters with thousands of nodes. As the limited
experimental condition, at present we demonstrated it
on only 18 nodes with 36 cores. But in view of
Hadoop’s successfully sorting 20 terabytes on a
2000-node cluster within 2.5 hours [13], we believe that