You are on page 1of 13

MonetDB/X100: Hyper-Pipelining Query Execution

Peter Boncz, Marcin Zukowski, Niels Nes


CWI
Kruislaan 413
Amsterdam, The Netherlands
{P.Boncz,M.Zukowski,N.Nes}@cwi.nl

Abstract One would expect that query-intensive database


workloads such as decision support, OLAP, data-
Database systems tend to achieve only mining but also multimedia retrieval, that all require
low IPC (instructions-per-cycle) efficiency on to do many independent calculations, should provide
modern CPUs in compute-intensive applica- modern CPUs the opportunity to get near optimal IPC
tion areas like decision support, OLAP and (instructions-per-cycle) efficiencies.
multimedia retrieval. This paper starts with However, research has shown that database systems
an in-depth investigation to the reason why tend to achieve low IPC efficiency on modern CPUs in
this happens, focusing on the TPC-H bench- these application areas [6, 3]. We question whether
mark. Our analysis of various relational sys- it should really be that way. Going beyond the (im-
tems and MonetDB leads us to a new set of portant) topic of cache-conscious query processing, we
guidelines for designing a query processor. investigate in detail how relational database systems
The second part of the paper describes the interact with modern hyper-pipelined CPUs in query-
architecture of our new X100 query engine intensive workloads, in particular the TPC-H decision
for the MonetDB system, that follows these support benchmark.
guidelines. On the surface, it resembles a The main conclusion we draw from this investiga-
classical Volcano-style engine, but the cru- tion is that the architecture employed by most DBMSs
cial difference to base all execution on the inhibits compilers from using their most performance-
concept of vector processing makes it highly critical optimization techniques, resulting in low CPU
CPU efficient. We evaluate the power of Mon- efficiencies. Particularly, the common way to im-
etDB/X100 on the 100GB version of TPC-H, plement the popular Volcano [10] iterator model for
showing its raw execution power to be between pipelined processing, leads to tuple-at-a-time execu-
one and two orders of magnitude higher than tion, which causes both high interpretation overhead,
previous technology. and hides opportunities for CPU parallelism from the
compiler.
We also analyze the performance of the main mem-
1 Introduction ory database system MonetDB1 , developed in our
Modern CPUs can perform enormous amounts of cal- group, and its MIL query language [4]. MonetDB/MIL
culations per second, but only if they can find enough uses a column-at-a-time execution model, and there-
independent work to exploit their parallel execution fore does not suffer from problems generated by tuple-
capabilities. Hardware developments during the past at-a-time interpretation. However, its policy of full
decade have significantly increased the speed difference column materialization causes it to generate large data
between a CPU running at full throughput and mini- streams during query execution. On our decision sup-
mal throughput, which can now easily be an order of port workload, we found MonetDB/MIL to become
magnitude. heavily constrained by memory bandwidth, causing its
CPU efficiency to drop sharply.
Permission to copy without fee all or part of this material is Therefore, we argue to combine the column-wise ex-
granted provided that the copies are not made or distributed for ecution of MonetDB with the incremental materializa-
direct commercial advantage, the VLDB copyright notice and
the title of the publication and its date appear, and notice is tion offered by Volcano-style pipelining.
given that copying is by permission of the Very Large Data Base We designed and implemented from scratch a new
Endowment. To copy otherwise, or to republish, requires a fee query engine for the MonetDB system, called X100,
and/or special permission from the Endowment.
Proceedings of the 2005 CIDR Conference 1 MonetDB is now in open-source, see monetdb.cwi.nl
that employs a vectorized query processing model.

Itanium2
Apart from achieving high CPU efficiency, Mon- CPU Performance (SPECcpu int+fp)

POWER4
10000 hyper−pipelining
etDB/X100 is intended to scale out towards non main-

Alpha21164B
CPU MHz

Alpha21164A
memory (disk-based) datasets. The second part of this pipelining
paper is dedicated to describing the architecture of inverted gate distance

MonetDB/X100 and evaluating its performance on the


full TPC-H benchmark of size 100GB.
Pentium4
1000

Alpha21064A
Athlon
1.1 Outline 164
ha21

Alpha21064
Alp 130nm

This paper is organized as follows. Section 2 provides 250nm


350nm
an introduction to modern hyper-pipelined CPUs, cov- 500nm
1994 1996 1998 2000 2002
ering the issues most relevant for query evaluation per-
formance. In Section 3, we study TPC-H Query 1 as a
micro-benchmark of CPU efficiency, first for standard Figure 1: A Decade of CPU Performance
relational database systems, then in MonetDB, and
finally descend into a standalone hard-coded imple- predict whether a will evaluate to true or false. It
mentation of this query to get a baseline of maximum might guess the latter and put c into the pipeline, just
achievable raw performance. after a. Many stages further, when the evaluation of
Section 4 describes the architecture of our new X100 a finishes, it may determine that it guessed wrongly
query processor for MonetDB, focusing on query exe- (i.e. mispredicted the branch), and then must flush
cution, but also sketching topics like data layout, in- the pipeline (discard all instructions in it) and start
dexing and updates. over with b. Obviously, the longer the pipeline, the
In Section 5, we present a performance comparison more instructions are flushed away and the higher the
of MIL and X100 inside the Monet system on the TPC- performance penalty. Translated to database systems,
H benchmark. We discuss related work in Section 6, branches that are data-dependent, such as those found
before concluding in Section 7. in a selection operator on data with a selectivity that is
neither very high or very low, are impossible to predict
and can significantly slow down query execution [17].
2 How CPUs Work In addition, hyper-pipelined CPUs offer the possi-
Figure 1 displays for each year in the past decade the bility to take multiple instructions into execution in
fastest CPU available in terms of MHz, as well as high- parallel if they are independent. That is, the CPU has
est performance (one thing does not necessarily equate not one, but multiple pipelines. Each cycle, a new in-
the other), as well as the most advanced chip manu- struction can be pushed into each pipeline, provided
facturing technology in production that year. again they are independent of all instructions already
The root cause for CPU MHz improvements is in execution. With hyper-pipelining, a CPU can get to
progress in chip manufacturing process scales, that an IPC (Instructions Per Cycle) of > 1. Figure 1 shows
typically shrink by a factor 1.4 every 18 months (a.k.a. that this has allowed real-world CPU performance to
Moore’s law [13]). Every smaller manufacturing scale increase faster than CPU frequency.
means twice (the square of 1.4) as many, and twice Modern CPUs are balanced in different ways. The
smaller transistors, as well as 1.4 times smaller wire Intel Itanium2 processor is a VLIW (Very Large In-
distances and signal latencies. Thus one would expect struction Word) processor with many parallel pipelines
CPU MHz to increase with inverted signal latencies, (it can execute up to 6 instructions per cycle) with
but Figure 1 shows that clock speed has increased even only few (7) stages, and therefore a relatively low clock
further. This is mainly done by pipelining: dividing speed of 1.5GHz. In contrast, the Pentium4 has its
the work of a CPU instruction in ever more stages. very long 31-stage pipeline allowing for a 3.6GHz clock
Less work per stage means that the CPU frequency speed, but can only execute 3 instructions per cycle.
can be increased. While the 1988 Intel 80386 CPU Either way, to get to its theoretical maximum through-
executed one instruction in one (or more) cycles, the put, an Itanium2 needs 7x6 = 42 independent instruc-
1993 Pentium already had a 5-stage pipeline, to be in- tions at any time, while the Pentium4 needs 31x3 = 93.
creased in the 1999 PentiumIII to 14 while the 2004 Such parallelism cannot always be found, and there-
Pentium4 has 31 pipeline stages. fore many programs use the resources of the Itanium2
Pipelines introduce two dangers: (i) if one instruc- much better than the Pentium4, which explains why in
tion needs the result of a previous instruction, it can- benchmarks the performance of both CPUs is similar,
not be pushed into the pipeline right after it, but must despite the big clock speed difference.
wait until the first instruction has passed through the Most programming languages do not require
pipeline (or a significant fraction thereof), and (ii) in programmers to explicitly specify in their programs
case of IF-a-THEN-b-ELSE-c branches, the CPU must which instructions (or expressions) are independent
int sel_lt_int_col_int_val(int n, int* res, int* in, int V) {
instructions executed by a CPU are memory loads
for(int i=0,j=0; i<n; i++){ Itanium2 branch
/* branch version */
100 Itanium2 predicated and stores, that access data on DRAM chips, located
AthlonMP branch

msec.
if (src[i] < V) AthlonMP predicated
inches away from the CPU on a motherboard. This
80
out[j++] = i; imposes a physical lower bound on memory latency of
/* predicated version */ 60
around 50 ns. This (ideal) minimum latency of 50ns
bool b = (src[i] < V); already translates into 180 wait cycles for a 3.6GHz
out[j] = i; 40 CPU. Thus, only if the overwhelming majority of the
j += b; memory accessed by a program can be found in an on-
} 20
return j; chip cache, a modern CPU has a chance to operate at
} 20 40 60 80 100
its maximum throughput. Recent database research
0 query selectivity
has shown that DBMS performance is strongly im-
paired by memory access cost (“cache misses”) [3], and
Figure 2: Itanium Hardware Predication Eliminates can significantly improve if cache-conscious data struc-
Branch Mispredictions tures are used, such as cache-aligned B-trees [15, 7] or
column-wise data layouts such as PAX [2] and DSM [8]
Therefore, compiler optimizations have become crit- (as in MonetDB). Also, query processing algorithms
ical to achieving good CPU utilization. The most that restrict their random memory access patterns to
important technique is loop pipelining, in which an regions that fit a CPU cache, such as radix-partitioned
operation consisting of multiple dependent operations hash-join [18, 11], strongly improve performance.
F(), G() on all n independent elements of an array A
is transformed from: All in all, CPUs have become highly complex de-
F(A[0]),G(A[0]), F(A[1]),G(A[1]),.. F(A[n]),G(A[n]) vices, where the instruction throughput of a processor
into: can vary by orders of magnitude (!) depending on
F(A[0]),F(A[1]),F(A[2]), G(A[0]),G(A[1]),G(A[2]), F(A[3]),.. the cache hit-ratio of the memory loads and stores,
the amount of branches and whether they can be pre-
Supposing the pipeline dependency latency of F() dicted/predicated, as well as the amount of indepen-
is 2 cycles, when G(A[0]) is taken into execution, the dent instructions a compiler and the CPU can detect
result of F(A[0]) has just become available. on average. It has been shown that query execution in
In the case of the Itanium2 processor, the impor- commercial DBMS systems get an IPC of only 0.7 [6],
tance of the compiler is even stronger, as it is the thus executing less than one instruction per cycle. In
compiler who has to find instructions that can go into contrast, scientific computation (e.g. matrix multipli-
different pipelines (other CPUs do that at run-time, cation) or multimedia processing does extract average
using out-of-order execution). As the Itanium2 chip IPCs of up to 2 out of modern CPUs. We argue that
does not need any complex logic dedicated to find- database systems do not need to perform so badly,
ing out-of-order execution opportunities, it can contain especially not on large-scale analysis tasks, where mil-
more pipelines that do real work. The Itanium2 also lions of tuples need to be examined and expressions
has a feature called branch predication for eliminating to be calculated. This abundance of work contains
branch mispredictions, by allowing to execute both the plenty of independence that should be able to fill all
THEN and ELSE blocks in parallel and discard one the pipelines a CPU can offer. Hence, our quest is to
of the results as soon as the result of the condition adapt database architecture to expose this to the com-
becomes known. It is also the task of the compiler to piler and CPU where possible, and thus significantly
detect opportunities for branch predication. improve query processing throughput.
Figure 2 shows a micro-benchmark of the selection
query SELECT oid FROM table WHERE col < X, where X SELECT l_returnflag, l_linestatus,
is uniformly and randomly distributed over [0:100] sum(l_quantity) AS sum_qty,
and we vary the selectivity X between 0 and 100. Nor- sum(l_extendedprice) AS sum_base_price,
sum(l_extendedprice * (1 - l_discount))
mal CPUs like the AthlonMP show worst-case behav- AS sum_disc_price,
ior around 50%, due to a branch mispredictions. As sum(l_extendedprice * (1 - l_discount) *
suggested in [17], by rewriting the code cleverly, we can (1 + l_tax)) AS sum_charge,
avg(l_quantity) AS avg_qty,
transform the branch into a boolean calculation (the avg(l_extendedprice) AS avg_price,
“predicated” variant). Performance of this rewritten avg(l_discount) AS avg_disc,
variant is independent of the selectivity, but incurs a count(*) AS count_order
higher average cost. Interestingly, the “branch” vari-
FROM lineitem
ant on Itanium2 is highly efficient and independent of WHERE l_shipdate <= date ’1998-09-02’
selectivity as well, because the compiler transforms the GROUP BY l_returnflag, l_linestatus
branch into hardware-predicated code.
Finally, we should mention the importance of on-
chip caches to CPU throughput. About 30% of all Figure 3: TPC-H Query 1
3 Microbenchmark: TPC-H Query 1 however, has a high degree of freedom in its param-
eters. For instance, even a simple ScanSelect(R, b, P )
While we target CPU efficiency of query processing in only at query-time receives full knowledge of the for-
general, we first focus on expression calculation, dis- mat of the input relation R (number of columns, their
carding more complex relational operations (like join) types, and record offsets), the boolean selection ex-
to simplify our analysis. We choose Query 1 of the pression b (which may be of any form), and a list
TPC-H benchmark, shown in Figure 3, because on all of projection expressions P (each of arbitrary com-
RDBMSs we tested, this query was CPU-bound and plexity) that define the output relation. In order to
parallelizes trivially over multiple CPUs. Also, this deal with all possible R, b, and P , DBMS implemen-
query requires virtually no optimization or fancy join tors must in fact implement an expression interpreter
implementations as its plan is so simple. Thus, all that can handle expressions of arbitrary complexity.
database systems operate on a level playing field and One of the dangers of such an interpreter, especially
mainly expose their expression evaluation efficiency. if the granularity of interpretation is a tuple, is that
The TPC-H benchmark operates on a data ware- the cost of the “real work” (i.e. executing the expres-
house of 1GB, the size of which can be increased with a sions found in the query) is only a tiny fraction of
Scaling Factor (SF). Query 1 is a scan on the lineitem total query execution cost. We can see this happen-
table of SF*6M tuples, that selects almost all tu- ing in Table 2 that shows a gprof trace of a MySQL
ples (SF*5.9M), and computes a number of fixed-point 4.1 of TPC-H Query 1 on a database of SF=1. The
decimal expressions: two column-to-constant subtrac- second column shows the percentage of total execu-
tions, one column-to-constant addition, three column- tion time spent in the routine, excluding time spent
to-column multiplications, and eight aggregates (four in routines it called (excl.). The first column is a cu-
SUM()s, three AVG()s and a COUNT()). The aggre- mulative sum of the second (cum.). The third column
gate grouping is on two single-character columns, and lists how many times the routine was called, while the
yields only 4 unique combinations, such that it can be fourth and fifth columns show the average amount of
done efficiently with a hash-table, requiring no addi- instructions executed on each call, as well as the IPC
tional I/O. achieved.
In the following, we analyze the performance of The first observation to make is that the five oper-
Query 1 first on relational database systems, then on ations that do all the “work” (displayed in boldface),
MonetDB/MIL and finally in a hard-coded program. correspond to only 10% of total execution time. Closer
inspection shows that 28% of execution time is taken
TPC-H Query 1 Experiments
up by creation and lookup in the hash-table used for
DBMS “X” 28.1 1 1 AthlonMP 1533MHz, 609/547
MySQL 4.1 26.6 1 1 AthlonMP 1533MHz, 609/547
aggregation. The remaining 62% of execution time is
MonetDB/MIL 3.7 1 1 AthlonMP 1533MHz, 609/547 spread over functions like rec get nth field, that navi-
MonetDB/MIL 3.4 1 1 Itanium2 1.3GHz, 1132/1891 gate through MySQL’s record representation and copy
hard-coded 0.22 1 1 AthlonMP 1533MHz, 609/547 data in and out of it.
hard-coded 0.14 1 1 Itanium2 1.3GHz, 1132/1891 The second observation is the cost of the Item op-
MonetDB/X100 0.50 1 1 AthlonMP 1533MHz, 609/547 erations that correspond to the computational “work”
MonetDB/X100 0.31 1 1 Itanium2 1.3GHz, 1132/1891 of the query. For example, Item func plus::val has a
MonetDB/X100 0.30 100 1 Itanium2 1.3GHz, 1132/1891 cost of 38 instructions per addition. This performance
sec/(#CPU*SF) SF #CPU, SPECcpu int/fp trace was made on an SGI machine with MIPS R12000
Oracle10g 18.1 100 16 Itanium2 1.3GHz, 1132/1891 CPU2 , which can execute three integer or floating-
Oracle10g 13.2 1000 64 Itanium2 1.5GHz, 1408/2161 point instructions and one load/store per cycle, with
SQLserver2000 18.0 100 2 Xeon P4 3.0GHz, 1294/1208 an average operation latency of about 5 cycles. A sim-
SQLserver2000 21.8 1000 8 Xeon P4 2.8GHz, 1270/1094 ple arithmetic operation +(double src1, double src2)
DB2 UDB 8.1 9.0 100 4 Itanium2 1.5GHz, 1408/2161 : double in RISC instructions would look like:
DB2 UDB 8.1 7.4 100 2 Opteron 2.0GHz, 1409/1514
Sybase IQ 12.5 15.6 100 2 USIII 1.28GHz, 704/1054 LOAD src1,reg1
Sybase IQ 12.5 15.8 1000 2 USIII 1.28GHz, 704/1054 LOAD src2,reg2
TPC-H Query 1 Reference Results (www.tpc.org) ADD reg1,reg2,reg3
STOR dst,reg3
Table 1: TPC-H Query 1 Performance
The limiting factor in this code are the three
load/store instructions, thus a MIPS processor can do
one *(double,double) per 3 cycles. This is in sharp
3.1 Query 1 on Relational Database Systems contrast to the MySQL cost of #ins/Instruction-Per-
Since the early days of RDBMSs, query execution Cycle (IPC) = 38/0.8 = 49 cycles! One explanation
functionality is provided by implementing a physical for this high cost is the absence of loop pipelining. As
relational algebra, typically following the Volcano [10] 2 On our Linux test platforms, no multi-threaded profiling
model of pipelined processing. Relational algebra, tools seem to be available.
cum. excl. calls ins. IPC function SF=1 SF=0.001 tot res (BW = MB/s)
11.9 11.9 846M 6 0.64 ut fold ulint pair ms BW us BW MB size MIL statement
20.4 8.5 0.15M 27K 0.71 ut fold binary 127 352 150 305 45 5.9M s0 := select(l shipdate).mark
26.2 5.8 77M 37 0.85 memcpy 134 505 113 608 68 5.9M s1 := join(s0,l returnflag)
29.3 3.1 23M 64 0.88 Item sum sum::update field 134 506 113 608 68 5.9M s2 := join(s0,l linestatus)
32.3 3.0 6M 247 0.83 row search for mysql 235 483 129 887 114 5.9M s3 := join(s0,l extprice)
35.2 2.9 17M 79 0.70 Item sum avg::update field 233 488 130 881 114 5.9M s4 := join(s0,l discount)
37.8 2.6 108M 11 0.60 rec get bit field 1 232 489 127 901 114 5.9M s5 := join(s0,l tax)
40.3 2.5 6M 213 0.61 row sel store mysql rec 134 507 104 660 68 5.9M s6 := join(s0,l quantity)
42.7 2.4 48M 25 0.52 rec get nth field 290 155 324 141 45 5.9M s7 := group(s1)
45.1 2.4 60 19M 0.69 ha print info 329 136 368 124 45 5.9M s8 := group(s7,s2)
47.5 2.4 5.9M 195 1.08 end update 0 0 0 0 0 4 s9 := unique(s8.mirror)
49.6 2.1 11M 89 0.98 field conv 206 440 60 1527 91 5.9M r0 := [+](1.0,s5)
51.6 2.0 5.9M 16 0.77 Field float::val real 210 432 51 1796 91 5.9M r1 := [-](1.0,s4)
53.4 1.8 5.9M 14 1.07 Item field::val 274 498 83 1655 137 5.9M r2 := [*](s3,r1)
54.9 1.5 42M 17 0.51 row sel field store in mysql.. 274 499 84 1653 137 5.9M r3 := [*](s12,r0)
56.3 1.4 36M 18 0.76 buf frame align 165 271 121 378 45 4 r4 := {sum}(r3,s8,s9)
57.6 1.3 17M 38 0.80 Item func mul::val 165 271 125 366 45 4 r5 := {sum}(r2,s8,s9)
59.0 1.4 25M 25 0.62 pthread mutex unlock 163 275 128 357 45 4 r6 := {sum}(s3,s8,s9)
60.2 1.2 206M 2 0.75 hash get nth cell 163 275 128 357 45 4 r7 := {sum}(s4,s8,s9)
61.4 1.2 25M 21 0.65 mutex test and set 144 151 107 214 22 4 r8 := {sum}(s6,s8,s9)
62.4 1.0 102M 4 0.62 rec get 1byte offs flag 112 196 145 157 22 4 r9 := {count}(s7,s8,s9)
63.4 1.0 53M 9 0.58 rec 1 get field start offs 3724 2327 TOTAL
64.3 0.9 42M 11 0.65 rec get nth field extern bit
65.3 1.0 11M 38 0.80 Item func minus::val Table 3: MonetDB/MIL trace of TPC-H Query 1
65.8 0.5 5.9M 38 0.80 Item func plus::val
DBMS results we obtained are roughly in the same
Table 2: MySQL gprof trace of TPC-H Q1:
ballpark as what is published by TPC. This leads us
+,-,*,SUM,AVG takes <10%, low IPC of 0.7
to believe that what we see in the MySQL trace is
the routine called by MySQL only computes one ad- likely representative of what happens in commercial
dition per call, instead of an array of additions, the RDBMS implementations.
compiler cannot perform loop pipelining. Thus, the
addition consists of four dependent instructions that 3.2 Query 1 on MonetDB/MIL
have to wait for each other. With a mean instruction The MonetDB system [4] developed at our group, is
latency of 5 cycles, this explains a cost of about 20 mostly known for its use of vertical fragmentation,
cycles. The rest of the 49 cycles are spent on jumping storing tables column-wise, each column in a Binary
into the routine, and pushing and popping the stack. Association Table (BAT) that contain [oid,value]
The consequence of the MySQL policy to execute combinations. A BAT is a 2-column table where the
expressions tuple-at-a-time, is twofold: left column is called head and the right column tail.
The algebraic query language of MonetDB is a column-
• Item func plus::val only performs one addition,
algebra called MIL [5].
preventing the compiler from creating a pipelined
loop. As the instructions for one operation are In contrast to the relational algebra, the MIL al-
highly dependent, empty pipeline slots must be gebra does not have any degree of freedom. Its al-
generated (stalls) to wait for the instruction la- gebraic operators have a fixed number of parame-
tencies, such that the cost of the loop becomes 20 ters of a fixed format (all two-column tables or con-
instead of 3 cycles. stants). The expression calculated by an operator is
fixed, as well as the shape of the result. For example,
• the cost of the routine call (in the ballpark of 20 the MIL join(BAT[tl , te ] A, BAT[te , tr ] B) : BAT[tl , tr ]
cycles) must be amortized over only one opera- is an equi-join between the tail column of A and head
tion, which effectively doubles the operation cost. column of B, that for each matching combination of
tuples returns the head value from A and tail value
We also tested the same query on a well-known com- from B. The mechanism in MIL to join on the other
mercial RDBMS (see the first row of Table 1). As we column (i.e. the head, instead of the tail) of A, is
obviously lack the source code of this product, we can- to use the MIL reverse(A) operator that returns a
not produce a gprof trace. However, the query eval- view on A with its columns swapped: BAT[te , tl ]. This
uation cost on this DBMS is very similar to MySQL. reverse is a zero-cost operation in MonetDB that
The lower part of Table 1 includes some official Query just swaps some pointers in the internal representa-
1 results taken from the TPC website. We normalized tion of a BAT. Complex expressions must be exe-
all times towards SF=1 and a single CPU by assuming cuted using multiple statements in MIL. For example,
linear scaling. We also provide the SPECcpu int/float extprice * (1 - tax) becomes tmp1 := [-](1,tax);
scores of the various hardware platforms used. We tmp2 := [*](extprice,tmp1), where [*]() and [-]()
mainly do this in order to check that the relational are multiplex operators that “map” a function onto an
entire BAT (column). MIL executes in column-wise quired in a Volcano-like pipelined execution model. It
fashion in the sense that its operators always consume can do the selection, computations and aggregation all
a number of materialized input BATs and materialize in a single pass, not materializing any data.
a single output BAT. While in this paper we concentrate on CPU effi-
We used the MonetDB/MIL SQL front-end to ciency in main-memory scenarios, we point out that
translate TPC-H Query 1 into MIL and run it. Ta- the “artificially” high bandwidths generated by Mon-
ble 3 shows all 20 MIL invocations that together span etDB/MIL make it harder to scale the system to disk-
more than 99% of elapsed query time. On TPC-H based problems efficiently, simply because memory
Query 1, MonetDB/MIL is clearly faster than MySQL bandwidths tends to be much greater (and cheaper)
and the commercial DBMS on the same machine, and than I/O bandwidth. Sustaining say a 1.5GB/s data
is also competitive with the published TPC-H scores transfer would require a truly high-end RAID system
(see Table 1). However, closer inspection of Table 3 with an awful lot of disks.
shows that almost all MIL operators are memory- static void tpch_query1(int n, int hi_date,
bound instead of CPU-bound! This was established unsigned char*__restrict__ p_returnflag,
by running the same query plan on the TPC-H dataset unsigned char*__restrict__ p_linestatus,
double*__restrict__ p_quantity,
with SF=0.001, such that all used columns of the double*__restrict__ p_extendedprice,
lineitem table as well as all intermediate results fit double*__restrict__ p_discount,
inside the CPU cache, eliminating any memory traffic. double*__restrict__ p_tax,
MonetDB/MIL then becomes almost 2 times as fast. int*__restrict__ p_shipdate,
aggr_t1*__restrict__ hashtab)
Columns 2 and 4 list the bandwidth (BW) in MB/s {
achieved by the individual MIL operations, counting for(int i=0; i<n; i++) {
both the size of the input BATs and the produced out- if (p_shipdate[i] <= hi_date) {
put BAT. On SF=1, MonetDB gets stuck at 500MB/s, aggr_t1 *entry = hashtab +
(p_returnflag[i]<<8) + p_linestatus[i];
which is the maximum bandwidth sustainable on this double discount = p_discount[i];
hardware [1]. When running purely in the CPU cache double extprice = p_extendedprice[i];
at SF=0.001, bandwidths can get above 1.5GB/s. For entry->count++;
the multiplexed multiplication [*](), a bandwidth of entry->sum_qty += p_quantity[i];
entry->sum_disc += discount;
only 500MB/s means 20M tuples per second (16 bytes entry->sum_base_price += extprice;
in, 8 bytes out), thus 75 cycles per multiplication on entry->sum_disc_price += (extprice *= (1-discount));
our 1533MHz CPU, which is even worse than MySQL. entry->sum_charge += extprice*(1-p_tax[i]);
}}}
Thus, the column-at-a-time policy in MIL turns out
to be a two-edged sword. To its advantage is the fact
that MonetDB is not prone to the MySQL problem Figure 4: Hard-Coded UDF for Query 1 in C
of spending 90% of its query execution time in tuple-
at-a-time interpretation “overhead”. As the multiplex
operations that perform expression calculations work
on entire BATs (basically arrays of which the layout is 3.3 Query 1: Baseline Performance
known at compile-time), the compiler is able to employ To get a baseline of what modern hardware can do
loop-pipelining such that these operators achieve high on a problem like Query 1, we implemented it as a
CPU efficiencies, embodied by the SF=0.001 results. single UDF in MonetDB, as shown in Figure 4. The
However, we identify the following problems with UDF gets passed in only those columns touched by
full materialization. First, queries that contain com- the query. In MonetDB, these columns are stored as
plex calculation expressions over many tuples will ma- arrays in BAT[void,T]s. That is, the oid values in the
terialize an entire result column for each function in head column are densely ascending from 0 upwards. In
the expression. Often, such function results are not such cases, MonetDB uses voids (“virtual-oids”) that
required in the query result, but just serve as inputs are not stored. The BAT then takes the form of an
to other functions in the expression. For instance, if array. We pass these arrays as restrict pointers,
an aggregation is the top-most operator in the query such that the C compiler knows that they are non-
plan, the eventual result size might even be negligible overlapping. Only then can it apply loop-pipelining!
(such as in Query 1). In such cases, MIL material- This implementation exploits the fact that a
izes much more data than strictly necessary, causing GROUP BY on two single-byte characters can never
its high bandwidth consumption. yield more than 65536 combinations, such that their
Also, Query 1 starts with a 98% selection of the combined bit-representation can be used directly as an
6M tuple table, and performs the aggregations on the array index to the table with aggregation results. Like
remaining 5.9M million tuples. Again, MonetDB ma- in MonetDB/MIL, we performed some common subex-
terializes the relevant result columns of the select() pression elimination such that one minus and three
using six positional join()s. These joins are not re- AVG aggregates can be omitted.
Table 1 shows that this UDF implementation (la- eventually become a run-time activity mandated
beled “hard-coded”) reduces query evaluation cost to by an optimizer.
a stunning 0.22 seconds. From the same table, you
will notice that our new X100 query processor, that is To maintain focus in this paper, we only sum-
the topic of the remainder of this paper, is able to get marily describe disk storage issues, also because the
within a factor 2 of this hard-coded implementation. ColumnBM buffer manager is still under development.
In all our experiments, X100 uses MonetDB as its stor-
4 X100: A Vectorized Query Processor age manager (as shown in Figure 5), where it operates
on in-memory BATs.
The goal of X100 is to (i) execute high-volume queries
at high CPU efficiency, (ii) be extensible to other ap- SQL X100 algebra
plication domains like data mining and multi-media
code
retrieval, and achieve those same high efficiencies on SQL−frontend
extensibility code, and (iii) scale with the size of the patterns
lowest storage hierarchy (disk). MetaData signature
X100 Parser
In order to achieve our goals, X100 must fight bot- makefile
tlenecks throughout the entire computer architecture: X100 Cost Model primitive
generator
Disk the ColumnBM I/O subsystem of X100 is geared X100 Optimizer
towards efficient sequential data access. To re- dll
duce bandwidth requirements, it uses a vertically MILgen X100 vectorized
fragmented data layout, that in some cases is en- Algebra primitives
hanced with lightweight data compression. MIL Operators dynamic signatures
RAM like I/O, RAM access is carried out through ex- MonetDB 4.3 ColumnBM
plicit memory-to-cache and cache-to-memory rou- ‘MonetDB/X100’
tines (which contain platform-specific optimiza-
‘MonetDB/MIL’
tions, sometimes including e.g. SSE prefetching
and data movement assembly instructions). The Figure 5: X100 Software Architecture
same vertically partitioned and even compressed
disk data layout is used in RAM to save space and
bandwidth. 4.1 Query Language
Cache we use a Volcano-like execution pipeline based on X100 uses a rather standard relational algebra as query
a vectorized processing model. Small (<1000 val- language. We departed from the column-at-a-time
ues) vertical chunks of cache-resident data items, MIL language so that the relational operators can pro-
called “vectors” are the unit of operation for cess (vectors of) multiple columns at the same time,
X100 execution primitives. The CPU cache is allowing to use a vector produced by one expression
the only place where bandwidth does not mat- as the input to another, while the data is in the CPU
ter, and therefore (de)compression happens on cache.
the boundary between RAM and cache. The
X100 query processing operators should be cache- 4.1.1 Example
conscious and fragment huge datasets efficiently
into cache-chunks and perform random data ac- To demonstrate the behavior of MonetDB/X100, Fig-
cess only there. ure 6 presents the execution of a simplified version of
a TPC-H Query 1, with the following X100 relational
CPU vectorized primitives expose to the compiler that algebra syntax:
processing a tuple is independent of the previous
and next tuples. Vectorized primitives for projec- Aggr(
Project(
tions (expression calculation) do this easily, but Select(
we try to achieve the same for other query process- Table(lineitem),
ing operators as well (e.g. aggregation). This al- < (shipdate, date(’1998-09-03’))),
[ discountprice = *( -( flt(’1.0’), discount),
lows compilers to produce efficient loop-pipelined extendedprice) ]),
code. To improve the CPU throughput further [ returnflag ],
(mainly by reducing the amount of load/stores in [ sum_disc_price = sum(discountprice) ])
the instruction mix), X100 contains facilities to
compile vectorized primitives for whole expression Execution proceeds using Volcano-like pipelining,
sub-trees rather than single functions. Currently, on the granularity of a vector (e.g. 1000 values).
this compilation is statically steered, but it may The Scan operator retrieves data vector-at-a-time from
Monet BATs. Note that only attributes relevant for Table(ID) : Table
the query are actually scanned. Scan( Table) : Dataflow
Array(List<Exp<int>>) : Dataflow
A second step is the Select operator, which cre-
Select(Dataflow, Exp<bool>) : Dataflow
ates a selection-vector, filled with positions of tuples
Join(Dataflow, Table, Exp<bool>, List<Column>) : Dataflow
that match our predicate. Then the Project opera- CartProd(Dataflow, Table, List<Column>)
tor is executed to calculate expressions needed for the Fetch1Join(Dataflow, Table, Exp<int>, List<Column>)
final aggregation. Note that ”discount” and ”extend- FetchNJoin(Dataflow, Table, Exp<int>,
edprice” columns are not modified during selection. Exp<int>, Column, List<Column>)
Instead, the selection-vector is taken into account by Project(Dataflow, List<Exp<*>>) : Dataflow
map-primitives to perform calculations only for rele- Aggr(Dataflow, List<Exp<*>>, List<AggrExp>) : Dataflow
vant tuples, writing results at the same positions in OrdAggr(Dataflow, List<Exp<*>>, List<AggrExp>)
the output vector as they were in the input one. This DirectAggr(Dataflow, List<Exp<*>>, List<AggrExp>)
behavior requires propagating of the selection-vector HashAggr(Dataflow, List<Exp<*>>, List<AggrExp>])
TopN(Dataflow, List<OrdExp>, List<Exp<*>>, int):Dataflow
to the final Aggr. There, for each tuple its position in
Order(Table, List<OrdExp>, List<AggrExp>) : Table
the hash table is calculated, and then, using this data,
aggregate results are updated. Additionally, for the Figure 7: X100 Query Algebra
new elements in the hash table, values of the group-
ing attribute are saved. The contents of the hash-table
becomes available as the query result as soon as the un- through a pipeline.
derlying operators become exhausted and cannot pro- Order, TopN and Select return a Dataflow with the
duce more vectors. same shape as its input. The other operators define a
Dataflow with a new shape. Some peculiarities of this
AGGREGATE algebra are that Project is just used for expression
returnflag sum_disc_price calculation; it does not eliminate duplicates. Dupli-
cate elimination can be performed using an Aggr with
hash table maintenance aggr_sum_flt_col
only group-by columns. The Array operator generates
position in a Dataflow representing a N -dimensional array as a
hash table
N -ary relation containing all valid array index coordi-
map_hash_chr_col nates in column-major dimension order. It is used by
the RAM array manipulation front-end for the Mon-
selection
PROJECT etDB system [9].
discountprice
vector
Aggregation is supported by three physical opera-
map_mul_flt_col_flt_col tors: (i) direct aggregation, (ii) hash aggregation, and
(iii) ordered aggregation. The latter is chosen if all
group-members will arrive right after each other in the
source Dataflow. Direct aggregation can be used for
map_sub_flt_val_flt_col
small datatypes where the bit-representation is lim-
1.0 ited to a known (small) domain, similar to way ag-
gregation was handled in the “hard-coded” solution
selection
SELECT (Section 3.3). In all other cases, hash-aggregation is
vector used.
select_lt_date_col_date_val X100 currently only supports left-deep joins. The
1998−09−03
default physical implementation is a CartProd operator
with a Select on top (i.e. nested-loop join). If X100
detects a foreign-key condition in a join condition, and
SCAN
a join-index is available, it exploits it with a Fetch1Join
shipdate returnflag discount extendedprice
or FetchNJoin.
The inclusion of these fetch-joins in X100 is no co-
Figure 6: Execution scheme of a simplified TPC-H incidence. In MIL, the “positional-join” of an oid into
Query 1 in MonetDB/X100 a void column has proven valuable on vertically frag-
mented data stored in dense columns. Positional joins
allow to deal with the “extra” joins needed for vertical
fragmentation in a highly efficient way [4]. Just like
4.1.2 X100 Algebra
the void type in MonetDB, X100 gives each table a vir-
Figure 7 lists the currently supported X100 algebra op- tual #rowId column, which is just a densely ascending
erators. In X100 algebra, a Table is a materialized rela- number from 0. The Fetch1Join allows to positionally
tion, whereas a Dataflow just consists of tuples flowing fetch column values by #rowId.
4.2 Vectorized Primitives latter identified with an extra *). Other extensible
RDBMSs often only allow UDFs with single-value pa-
The primary reason for using the column-wise vector
rameters [19]. This inhibits loop pipelining, reducing
layout is not to optimize memory layout in the cache
performance (see Section 3.1). 3
(X100 is supposed to operate on cached data anyway).
Rather, vectorized execution primitives have the ad- We can also request compound primitive signatures:
vantage of a low degree of freedom (as discussed in
Section 3.2). In a vertically fragmented data model, /(square(-(double*, double*)), double*)
the execution primitives only know about the columns
they operate on without having to know about the The above signature is the Mahanalobis distance,
overall table layout (e.g. record offsets). When compil- a performance-critical operation for some multi-media
ing X100, the C compiler sees that the X100 vectorized retrieval tasks [9]. We found that the compound prim-
primitives operate on restricted (independent) arrays itives often perform twice as fast as the single-function
of fixed shape. This allows it to apply aggressive loop vectorized primitives. Note that this factor 2 is simi-
pipelining, critical for modern CPU performance (see lar to the difference between MonetDB/X100 and the
Section 2). As an example, we show the (generated) hard-coded implementation of TPC-H Query in Ta-
code for vectorized floating-point addition: ble 1. The reason why compound primitives are more
efficient is a better instruction mix. Like in the exam-
map_plus_double_col_double_col(int n,
double*__restrict__ res, ple with addition on the MIPS processor in Section 3.1,
double*__restrict__ col1, double*__restrict__ col2, vectorized execution often becomes load/store bound,
int*__restrict__ sel) because for simple 2-ary calculations, each vectorized
{ instruction requires loading two parameters and stor-
if (sel) {
for(int j=0;j<n; j++) { ing one result (1 work instruction, 3 memory instruc-
int i = sel[j]; tions). Modern CPUs can typically only perform 1 or
res[i] = col1[i] + col2[i]; 2 load/store operations per cycle. In compound prim-
}
} else {
itives, the results from one calculation are passed via a
for(int i=0;i<n; i++) CPU register to the next calculation, with load/stores
res[i] = col1[i] + col2[i]; only occurring at the edges of the expression graph.
} } Currently, the primitive generator is not much more
than a macro expansion script in the make sequence
The sel parameter may be NULL or point to an
of the X100 system. However, we intend to implement
array of n selected array positions (i.e. the “selection-
dynamic compilation of compound primitives as man-
vector” from Figure 6). All X100 vectorized primitives
dated by an optimizer.
allow passing such selection vectors. The rationale is
that after a selection, leaving the vectors delivered by A slight variation on the map primitives are the
the child operator intact is often quicker than copying select * primitives (see also Figure 2). These only
all selected data into new (contiguous) vectors. exist for code patterns that return a boolean. Instead
X100 contains hundreds of vectorized primitives. of producing a full result vector of booleans (as the
These are not written (and maintained) by hand, but map does), the select primitives fill a result array of
are generated from primitive patterns. The primitive selected vector positions (integers), and return the to-
pattern for addition is: tal number of selected tuples.
Similarly, there are the aggr * primitives that calcu-
any::1 +(any::1 x,any::1 y) plus = x + y late aggregates like count, sum, min, and max. For each,
an initialization, an update, and an epilogue pattern
This pattern states that an addition of two values
need be specified. The primitive generator then gen-
of the same type (but without any type restriction)
erates the relevant routines for the various implemen-
is implemented in C by the infix operator +. It pro-
tations of aggregation in X100.
duces a result of the same type, and the name identi-
The X100 mechanism of allowing database exten-
fier should be plus. Type-specific patterns later in the
sion developers to provide (source-)code patterns in-
specification file may override this pattern (e.g. str
stead of compiled code, allows all ADTs to get first-
+(str x,str y) concat = str concat(x,y)).
class-citizen treatment during query execution. This
The other part of primitive generation is a file with
was also a weak point of MIL (and most extensible
map signature requests:
DBMSs [19]), as its main algebraic operators were only
+(double*, double*) optimized for the built-in types.
+(double, double*)
+(double*, double) 3 If X100 is used in resource-restricted environments, the size
+(double, double)
of the X100 binary (less than a MB now) could be further re-
duced by omitting the column-versions of (certain) execution
This requests to generate all possible combinations primitives. X100 will still able to process those primitives al-
of addition between single values and columns (the though more slowly, with a vector size of 1.
delete from TABLE where key=F insert into TABLE values (K,d,m)
4.3 Data Storage
shipmod shipmod shipmod
MonetDB/X100 stores all tables in vertically frag- key flag #del key flag #del key flag #del
#0 A a m #0 A a m #5 #0 A a m #5
mented form. The storage scheme is the same whether #1 B a m #1 B a m #1 B a m
the new ColumnBM buffer manager is used, or Mon- #2 C a m #2 C a m #2 C a m
etDB BAT[void,T] storage. While MonetDB stores #3 D b s #3 D b s #3 D b s
#4 E d s #4 E d s #4 E d s
each BAT in a single continuous file, ColumnBM par- #5 F c s #5 F c s #5 F c s
titions those files in large (>1MB) chunks. #6 G f s #6 G f s #6 G f s
#7 H e a #7 H e a buffer #7 H e a
A disadvantage of vertical storage is an increased manager
#8 I e a #8 I e a #8 I e a
update cost: a single row update or delete must per- #9 J c a #9 J c a blocks #9 J c a
form one I/O for each column. MonetDB/X100 cir-
cumvents this by treating the vertical fragments as leave the column storage blocks untouched on updates #10 K d m
immutable objects. Updates go to delta structures
instead. Figure 8 shows that deletes are handled by Figure 8: Vertical Storage and Updates
adding the tuple ID to a deletion list, and that inserts
lead to appends in separate delta columns. ColumnBM
actually stores all delta columns together in a chunk,
the-box MonetDB/MIL system with its SQL-frontend
which equates PAX [2]. Thus, both operations incur
on our AthlonMP platform (1533MHz, 1GB RAM,
only one I/O. Updates are simply a deletion followed
Linux2.4) at SF=1. We also hand-translated all TPC-
by an insertion. Updates make the delta columns
H queries to X100 algebra, and ran them on Mon-
grow, such that whenever their size exceeds a (small)
etDB/X100. The comparison between the first two re-
percentile of the total table size, data storage should
sult columns clearly shows that MonetDB/X100 over-
be reorganized, such that the vertical storage is up-to-
powers MonetDB/MIL.
date again and the delta columns are empty.
An advantage of vertical storage is that queries that Both MonetDB/MIL and MonetDB/X100 use join
access many tuples but not all columns save bandwidth indices over all foreign key paths. For MonetDB/X100
(this holds both for RAM bandwidth and I/O band- we sorted the orders table on date, and kept lineitem
width). We further reduce bandwidth requirements clustered with it. We use summary indices (see Sec-
using lightweight compression. MonetDB/X100 sup- tion 4.3) on all date columns of both tables. We
ports enumeration types, which effectively store a col- also sorted both suppliers and customers on (re-
umn as a single-byte or two-byte integer. This integer gion,country). In all, total disk storage for Mon-
refers to #rowId of a mapping table. MonetDB/X100 etDB/MIL was about 1GB, and around 0.8GB for
automatically adds a Fetch1Join operation to retrieve MonetDB/X100 (SF=1). The reduction was achieved
the uncompressed value using the small integer when by using enumeration types, where possible.
such columns are used in a query. Notice that since the
vertical fragments are immutable, updates just go to We also ran TPC-H both at SF=1 and SF=100 on
the delta columns (which are never compressed) and our Itanium2 1.3GHz (3MB cache) server with 12GB
do not complicate the compression scheme. RAM running Linux2.4. The last column of Table 4
MonetDB/X100 also supports simple “summary” lists official TPC-H results for the MAXDATA Plat-
indices, similar to [12], which are used if a column is inum 9000-4R, a server machine with four 1.5GHz
clustered (almost sorted). These summary indices con- (6MB cache) Itanium2 processors and 32GB RAM
tain a #rowId, the running maximum value of the col- running DB2 8.1 UDB.
umn until that point in the base table, and a reversely
We should clarify that all MonetDB TPC-H num-
running minimum at a very coarse granularity (the de-
bers are in-memory results; no I/O occurs. This
fault size is 1000 entries, with #rowids taken with fixed
should be taken into account especially when compar-
intervals from the base table). These summary indices
ing with the DB2 results. It also shows that even at
can be used to quickly derive #rowId bounds for range
SF=100, MonetDB/X100 needs less than our 12GB
predicates. Notice again, due to the property that
RAM for each individual query. If we would have had
vertical fragments are immutable, indices on them ef-
32GB of RAM like the DB2 platform, the hot-set for
fectively require no maintenance. The delta columns,
all TPC-H queries would have fit in memory.
which are supposed to be small and in-memory, are
not indexed and must always be accessed. While the DB2 TPC-H numbers obviously do in-
clude I/O, its impact may not be that strong as its
5 TPC-H Experiments test platform uses 112 SCSI disks. This suggests that
disks were added until DB2 became CPU-bound. In
Table 4 shows the results of executing all TPC-H any case, and taking into account that CPU-wise the
queries on both MonetDB/MIL and MonetDB/X100. DB2 hardware is more than four times stronger, Mon-
We ran the SQL benchmark queries on an out-of- etDB/X100 performance looks very solid.
MonetDB/MIL MonetDB/X100, 1CPU DB2, 4CPU Order(
Q SF=1 SF=1 SF=1 SF=100 SF=100 Project(
Aggr(
1 3.72 0.50 0.31 30.25 229 Select(
2 0.46 0.01 0.01 0.81 19 Table(lineitem)
3 2.52 0.04 0.02 3.77 16 < ( l_shipdate, date(’1998-09-03’))),
4 1.56 0.05 0.02 1.15 14 [ l_returnflag, l_linestatus ],
5 2.72 0.08 0.04 11.02 72 [ sum_qty = sum(l_quantity),
sum_base_price = sum(l_extendedprice),
6 2.24 0.09 0.02 1.44 12
sum_disc_price = sum(
7 3.26 0.22 0.22 29.47 81 discountprice = *( -(flt(’1.0’), l_discount),
8 2.23 0.06 0.03 2.78 65 l_extendedprice ) ),
9 6.78 0.44 0.44 71.24 274 sum_charge = sum(*( +( flt(’1.0’), l_tax),
10 4.40 0.22 0.19 30.73 47 discountprice ) ),
11 0.43 0.03 0.02 1.66 20 sum_disc = sum(l_discount),
12 3.73 0.09 0.04 3.68 19 count_order = count() ]),
[ l_returnflag, l_linestatus, sum_qty,
13 11.42 1.26 1.04 148.22 343
sum_base_price, sum_disc_price, sum_charge,
14 1.03 0.02 0.02 2.64 14 avg_qty = /( sum_qty, cnt=dbl(count_order)),
15 1.39 0.09 0.04 14.36 30 avg_price = /( sum_base_price, cnt),
16 2.25 0.21 0.14 15.77 64 avg_disc = /( sum_disc, cnt), count_order ]),
17 2.30 0.02 0.02 1.75 77 [ l_returnflag ASC, l_linestatus ASC])
18 5.20 0.15 0.11 10.37 600
19 12.46 0.05 0.05 4.47 81
20 2.75 0.08 0.05 2.45 35 Figure 9: Query 1 in X100 Algebra
21 8.85 0.29 0.17 17.61 428
22 3.07 0.07 0.04 2.30 93 from the respective enumeration tables. We can see
AthlonMP Itanium2 that these fetch-joins are truly efficient, as they cost
less than 2 cycles per tuple.
Table 4: TPC-H Performance (seconds)
5.1.1 Vector Size Impact
5.1 Query 1 performance
We now investigate the influence of vector size on per-
As we did for MySQL and MonetDB/MIL, we now also formance. X100 uses a default vector size of 1024, but
study the performance of MonetDB/X100 on TPC- users can override it. Preferably, all vectors together
H Query 1 in detail. Figure 9 shows its translation should comfortably fit the CPU cache size, hence they
in X100 Algebra. X100 implements detailed tracing should not be too big. However, with really small vec-
and profiling support using low-level CPU counters, tor sizes, the possibility of exploiting CPU parallelism
to help analyze query performance. Table 5 shows the disappears. Also, in that case, the impact of interpre-
tracing output generated by running TPC-H Query 1 tation overhead in the X100 Algebra next() methods
on our Itanium2 at SF=1. The top part of the trace will grow.
provides statistics on the level of the vectorized prim-
itives, while the bottom part contains information on AthlonMP
the (coarser) level of X100 algebra operators. Itanium2

A first observation is that X100 manages to run all 10


primitives at a very low number of CPU cycles per
tuple - even relatively complex primitives like aggre-
Time (seconds)

gation run in 6 cycles per tuple. Notice that a mul-


tiplication (map mul *) is handled in 2.2 cycles per tu-
ple, which is way better than the 49 cycles per tuple 1

achieved by MySQL (see Section 3.1).


A second observation is that since a large part of
data that is being processed by primitives comes from
vectors in the CPU cache, X100 is able to sustain a
really high bandwidth. Where multiplication in Mon- 0.1
1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M
etDB/MIL was constrained by the RAM bandwidth Vector size (tuples)

of 500MB/s, MonetDB/X100 exceeds 7.5GB/s on the


same operator 4 . Figure 10: Query 1 performance w.r.t. vector-size
Finally, Table 5 shows that Query 1 uses three
columns that are stored in enumerated types (i.e. Figure 10 presents results of the experiment, in
l discount, l tax and l quantity). X100 automatically which we execute TPC-H Query 1 on both the Ita-
adds three Fetch1Joins to retrieve the original values nium2 and AthlonMP with varying vector sizes. Just
like MySQL, interpretation overhead also hits Mon-
4 On the AthlonMP it is around 5GB/s etDB/X100 strongly if it uses tuple-at-a-time process-
input total time BW avg. X100 primitive MonetDB/X100, which is designed from the ground
count MB (us) MB/s cycles up for vectorized execution, the authors only use their
6M 30 8518 3521 1.9 map fetch uchr col flt col approach to enhance aggregation and projection op-
6M 30 8360 3588 1.9 map fetch uchr col flt col
erations. In DB2, the tuple layout remains NSM, al-
6M 30 8145 3683 1.9 map fetch uchr col flt col
6M 35.5 13307 2667 3.0 select lt usht col usht val
though the authors discuss the possibility to dynam-
5.9M 47 10039 4681 2.3 map sub flt val flt col ically remap NSM chunks into vertical chunks. The
5.9M 71 9385 7565 2.2 map mul flt col flt col overheads introduced by this may be the cause for the
5.9M 71 9248 7677 2.1 map mul flt col flt col only modest performance gains reported.
5.9M 47 10254 4583 2.4 map add flt val flt col Also closely related is [21], which also suggests
5.9M 35.5 13052 2719 3.0 map uidx uchr col block-at-a-time processing, again focusing on NSM tu-
5.9M 53 14712 3602 3.4 map directgrp uidx col uchr col ple layouts. The authors propose to insert ”Buffer”
5.9M 71 28058 2530 6.5 aggr sum flt col uidx col operators into the operator pipeline, which call their
5.9M 71 28598 2482 6.6 aggr sum flt col uidx col
child N times after each other, buffering the results.
5.9M 71 27243 2606 6.3 aggr sum flt col uidx col
5.9M 71 26603 2668 6.1 aggr sum flt col uidx col
This helps in situations where the code-footprint for
5.9M 71 27404 2590 6.3 aggr sum flt col uidx col all operators that occur in a query tree together exceed
5.9M 47 18738 2508 4.3 aggr count uidx col the instruction cache. Then, when the instructions of
X100 operator one operator are “hot” it makes sense to call it multi-
0 3978 Scan ple times. Thus, this paper proposes to do block-wise
6M 10970 Fetch1Join(ENUM) processing, but without modifying the query operators
6M 10712 Fetch1Join(ENUM) to make them work on blocks. We argue that if our ap-
6M 10656 Fetch1Join(ENUM) proach is adopted, we get the instruction cache benefit
6M 15302 Select discussed in [21] for free. We had already noticed in
5.9M 236443 Aggr(DIRECT) the past, that MonetDB/MIL due to its column-wise
execution spends so much time in each operator that
Table 5: TPC-H Query 1 performance trace with (Ita- instruction cache misses are not a problem.
nium2, SF=1) A similar proposal for block-at-a-time query pro-
cessing is [20], this time regarding lookup in B-trees.
ing (i.e. a vector size of 1). With increasing vector Again the goals of the authors are different, mainly
size, the execution time quickly improves. For this better use of the data caches, while the main goal of
query and these platforms, the optimal vector size MonetDB/X100 is to increase the CPU efficiency of
seems to be 1000, but all values between 128 and 8K query processing by loop pipelining.
actually work well. Performance starts to deteriorate As far as data storage is concerned, the update
when intermediate results do not fit in the cache any- scheme of MonetDB/X100 combines the decomposed
more. The total width of all vectors used in Query 1 is storage model (DSM) [8], with PAX [2] for tuples that
just over 40 bytes. Thus, when we start using vectors are updated. This idea is close to the suggestion in [16]
larger than 8K, the cache memory requirements start to combine DSM and NSM for more flexible data mir-
to exceed the 320KB combined L1 and L2 cache of the roring, and use of inverted lists to handle updates ef-
AthlonMP, and performance starts to degrade. For ficiently. In fact, a PAX block can be seen as a col-
Itanium 2 (16KB L1, 256KB L2, and 3MB L3), the lection of vertical vectors, such that X100 could run
performance degradation starts a bit earlier, and then right on top of this representation, without conversion
decreases continuously until data does not fit even in overhead.
L3 (after 64K x 40 bytes).
When the vectors do not fit in any cache anymore, 7 Conclusion and Future Work
we are materializing all intermediate results in main
In this paper, we investigate why relational database
memory. Therefore, at the extreme vector size of 4M
systems achieve low CPU efficiencies on modern CPUs.
tuples, MonetDB/X100 behaves very similar to Mon-
It turns out, that the Volcano-like tuple-at-a-time exe-
etDB/MIL. Still, X100 performance is better since it
cution architecture of relational systems introduces in-
does not have to perform the extra join steps present
terpretation overhead, and inhibits compilers from us-
in MIL, required to project selected tuples (see Sec-
ing their most performance-critical optimization tech-
tion 3.2).
niques, such as loop pipelining.
We also analyzed the CPU efficiency of the main
6 Related Work memory database system MonetDB, which does not
This research builds a bridge between the classical Vol- suffer from problems generated by tuple-at-a-time in-
cano iterator model [10] and the column-wise query terpretation, but instead employs a column-at-a-time
processing model of MonetDB [4]. materialization policy, which makes it memory band-
The work closest to our paper is [14], where a width bound.
blocked execution path in DB2 is presented. Unlike Therefore, we propose to strike a balance between
the Volcano and MonetDB execution models. This [7] S. Chen, P. B. Gibbons, and T. C. Mowry. Im-
leads to pipelined operators that pass to each other proving index performance through prefetching.
small, cache-resident, vertical data fragments called In Proc. SIGMOD, Santa Barbara, USA, 2001.
vectors. Following this principle, we present the ar-
chitecture of a brand new query engine for MonetDB [8] G. P. Copeland and S. Khoshafian. A Decompo-
called X100. It uses vectorized primitives, to perform sition Storage Model. In Proc. SIGMOD, Austin,
the bulk of query processing work in a very efficient USA, 1985.
way. We evaluated our system on the TPC-H de- [9] R. Cornacchia, A. van Ballegooij, and A. P.
cision support benchmark with size 100GB, showing de Vries. A case study on array query optimi-
that MonetDB/X100 can be up to two orders of mag- sation. In Proc. CVDB, 2004.
nitude faster than existing DBMS technology.
In the future, we will continue to add to Mon- [10] G. Graefe. Volcano - an extensible and parallel
etDB/X100 more vectorized query processing opera- query evaluation system. IEEE Trans. Knowl.
tors. We also plan to port the MonetDB/MIL SQL- Data Eng., 6(1):120–135, 1994.
frontend to it, and fit it with a histogram-based query
optimizer. We intend to deploy MonetDB/X100 in [11] S. Manegold, P. A. Boncz, and M. L. Kersten.
data-mining, XML processing and multimedia and in- Optimizing Main-Memory Join On Modern Hard-
formation retrieval projects ongoing in our group. ware. IEEE Transactions on Knowledge and Data
We will also continue our work on the ColumnDB Eng., 14(4):709–730, 2002.
buffer manager. This work embodies our goal to make [12] G. Moerkotte. Small Materialized Aggregates: A
MonetDB/X100 scale out of main memory, and prefer- Light Weight Index Structure for Data Warehous-
ably achieve the same high CPU efficiencies if data ing. In Proc. VLDB, New York, USA, 1998.
is sequentially streamed in from disk instead of from
RAM. Therefore, we plan to investigate lightweight [13] G. Moore. Cramming more components onto in-
compression and multi-query optimization of disk ac- tegrated circuits. Electronics, 38(8), Apr. 1965.
cess to reduce I/O bandwidth requirements.
Finally, we are considering the use of X100 as an [14] S. Padmanabhan, T. Malkemus, R. Agarwal, and
energy-efficient query processing system for low-power A. Jhingran. Block oriented processing of rela-
(embedded,mobile) environments, because it has a tional database operations in modern computer
small footprint, and its property to perform as much architectures. In Proc. ICDE, Heidelberg, Ger-
work in as few CPU cycles as possible, translates to many, 2001.
improved battery life in such environments. [15] J. Rao and K. A. Ross. Making B+ -Trees Cache
Conscious in Main Memory. In Proc. SIGMOD,
References Madison, USA, 2000.
[1] The STREAM Benchmark: Computer Memory [16] Q. S. Ravishankar Ramamurthy, David J. De-
Bandwidth. http://www.streambench.org. Witt. A case for fractured mirrors. In Proc.
[2] A. Ailamaki, D. DeWitt, M. Hill, and M. Sk- VLDB, Hong Kong, 2002.
ounakis. Weaving Relations for Cache Perfor- [17] K. A. Ross. Conjunctive selection conditions in
mance. In Proc. VLDB, Rome, Italy, 2001. main memory. In Proc. PODS, Madison, USA,
[3] A. Ailamaki, D. J. DeWitt, M. D. Hill, and D. A. 2002.
Wood. DBMSs on a Modern Processor: Where [18] A. Shatdal, C. Kant, and J. F. Naughton. Cache
Does Time Go? In Proc. VLDB, Edinburgh, conscious algorithms for relational query process-
1999. ing. In Proc. VLDB, Santiago, 1994.
[4] P. A. Boncz. Monet: A Next-Generation DBMS [19] M. Stonebraker, J. Anton, and M. Hirohama.
Kernel For Query-Intensive Applications. Ph.d. Extendability in POSTGRES. IEEE Data Eng.
thesis, Universiteit van Amsterdam, Amsterdam, Bull., 10(2):16–23, 1987.
The Netherlands, May 2002.
[20] J. Zhou and K. A. Ross. Buffering accesses
[5] P. A. Boncz and M. L. Kersten. MIL Primitives to memory-resident index structures. In Proc.
for Querying a Fragmented World. VLDB J., VLDB, Toronto, Canada, 2003.
8(2):101–119, 1999.
[21] J. Zhou and K. A. Ross. Buffering database op-
[6] Q. Cao, J. Torrellas, P. Trancoso, J.-L. Larriba- erations for enhanced instruction cache perfor-
Pey, B. Knighten, and Y. Won. Detailed charac- mance. In Proc. SIGMOD, Paris, France, 2004.
terization of a quad pentium pro server running
tpc-d. In Proc. ICCD, Austin, USA, 1999.

You might also like