Professional Documents
Culture Documents
Itanium2
Apart from achieving high CPU efficiency, Mon- CPU Performance (SPECcpu int+fp)
POWER4
10000 hyper−pipelining
etDB/X100 is intended to scale out towards non main-
Alpha21164B
CPU MHz
Alpha21164A
memory (disk-based) datasets. The second part of this pipelining
paper is dedicated to describing the architecture of inverted gate distance
Alpha21064A
Athlon
1.1 Outline 164
ha21
Alpha21064
Alp 130nm
msec.
if (src[i] < V) AthlonMP predicated
inches away from the CPU on a motherboard. This
80
out[j++] = i; imposes a physical lower bound on memory latency of
/* predicated version */ 60
around 50 ns. This (ideal) minimum latency of 50ns
bool b = (src[i] < V); already translates into 180 wait cycles for a 3.6GHz
out[j] = i; 40 CPU. Thus, only if the overwhelming majority of the
j += b; memory accessed by a program can be found in an on-
} 20
return j; chip cache, a modern CPU has a chance to operate at
} 20 40 60 80 100
its maximum throughput. Recent database research
0 query selectivity
has shown that DBMS performance is strongly im-
paired by memory access cost (“cache misses”) [3], and
Figure 2: Itanium Hardware Predication Eliminates can significantly improve if cache-conscious data struc-
Branch Mispredictions tures are used, such as cache-aligned B-trees [15, 7] or
column-wise data layouts such as PAX [2] and DSM [8]
Therefore, compiler optimizations have become crit- (as in MonetDB). Also, query processing algorithms
ical to achieving good CPU utilization. The most that restrict their random memory access patterns to
important technique is loop pipelining, in which an regions that fit a CPU cache, such as radix-partitioned
operation consisting of multiple dependent operations hash-join [18, 11], strongly improve performance.
F(), G() on all n independent elements of an array A
is transformed from: All in all, CPUs have become highly complex de-
F(A[0]),G(A[0]), F(A[1]),G(A[1]),.. F(A[n]),G(A[n]) vices, where the instruction throughput of a processor
into: can vary by orders of magnitude (!) depending on
F(A[0]),F(A[1]),F(A[2]), G(A[0]),G(A[1]),G(A[2]), F(A[3]),.. the cache hit-ratio of the memory loads and stores,
the amount of branches and whether they can be pre-
Supposing the pipeline dependency latency of F() dicted/predicated, as well as the amount of indepen-
is 2 cycles, when G(A[0]) is taken into execution, the dent instructions a compiler and the CPU can detect
result of F(A[0]) has just become available. on average. It has been shown that query execution in
In the case of the Itanium2 processor, the impor- commercial DBMS systems get an IPC of only 0.7 [6],
tance of the compiler is even stronger, as it is the thus executing less than one instruction per cycle. In
compiler who has to find instructions that can go into contrast, scientific computation (e.g. matrix multipli-
different pipelines (other CPUs do that at run-time, cation) or multimedia processing does extract average
using out-of-order execution). As the Itanium2 chip IPCs of up to 2 out of modern CPUs. We argue that
does not need any complex logic dedicated to find- database systems do not need to perform so badly,
ing out-of-order execution opportunities, it can contain especially not on large-scale analysis tasks, where mil-
more pipelines that do real work. The Itanium2 also lions of tuples need to be examined and expressions
has a feature called branch predication for eliminating to be calculated. This abundance of work contains
branch mispredictions, by allowing to execute both the plenty of independence that should be able to fill all
THEN and ELSE blocks in parallel and discard one the pipelines a CPU can offer. Hence, our quest is to
of the results as soon as the result of the condition adapt database architecture to expose this to the com-
becomes known. It is also the task of the compiler to piler and CPU where possible, and thus significantly
detect opportunities for branch predication. improve query processing throughput.
Figure 2 shows a micro-benchmark of the selection
query SELECT oid FROM table WHERE col < X, where X SELECT l_returnflag, l_linestatus,
is uniformly and randomly distributed over [0:100] sum(l_quantity) AS sum_qty,
and we vary the selectivity X between 0 and 100. Nor- sum(l_extendedprice) AS sum_base_price,
sum(l_extendedprice * (1 - l_discount))
mal CPUs like the AthlonMP show worst-case behav- AS sum_disc_price,
ior around 50%, due to a branch mispredictions. As sum(l_extendedprice * (1 - l_discount) *
suggested in [17], by rewriting the code cleverly, we can (1 + l_tax)) AS sum_charge,
avg(l_quantity) AS avg_qty,
transform the branch into a boolean calculation (the avg(l_extendedprice) AS avg_price,
“predicated” variant). Performance of this rewritten avg(l_discount) AS avg_disc,
variant is independent of the selectivity, but incurs a count(*) AS count_order
higher average cost. Interestingly, the “branch” vari-
FROM lineitem
ant on Itanium2 is highly efficient and independent of WHERE l_shipdate <= date ’1998-09-02’
selectivity as well, because the compiler transforms the GROUP BY l_returnflag, l_linestatus
branch into hardware-predicated code.
Finally, we should mention the importance of on-
chip caches to CPU throughput. About 30% of all Figure 3: TPC-H Query 1
3 Microbenchmark: TPC-H Query 1 however, has a high degree of freedom in its param-
eters. For instance, even a simple ScanSelect(R, b, P )
While we target CPU efficiency of query processing in only at query-time receives full knowledge of the for-
general, we first focus on expression calculation, dis- mat of the input relation R (number of columns, their
carding more complex relational operations (like join) types, and record offsets), the boolean selection ex-
to simplify our analysis. We choose Query 1 of the pression b (which may be of any form), and a list
TPC-H benchmark, shown in Figure 3, because on all of projection expressions P (each of arbitrary com-
RDBMSs we tested, this query was CPU-bound and plexity) that define the output relation. In order to
parallelizes trivially over multiple CPUs. Also, this deal with all possible R, b, and P , DBMS implemen-
query requires virtually no optimization or fancy join tors must in fact implement an expression interpreter
implementations as its plan is so simple. Thus, all that can handle expressions of arbitrary complexity.
database systems operate on a level playing field and One of the dangers of such an interpreter, especially
mainly expose their expression evaluation efficiency. if the granularity of interpretation is a tuple, is that
The TPC-H benchmark operates on a data ware- the cost of the “real work” (i.e. executing the expres-
house of 1GB, the size of which can be increased with a sions found in the query) is only a tiny fraction of
Scaling Factor (SF). Query 1 is a scan on the lineitem total query execution cost. We can see this happen-
table of SF*6M tuples, that selects almost all tu- ing in Table 2 that shows a gprof trace of a MySQL
ples (SF*5.9M), and computes a number of fixed-point 4.1 of TPC-H Query 1 on a database of SF=1. The
decimal expressions: two column-to-constant subtrac- second column shows the percentage of total execu-
tions, one column-to-constant addition, three column- tion time spent in the routine, excluding time spent
to-column multiplications, and eight aggregates (four in routines it called (excl.). The first column is a cu-
SUM()s, three AVG()s and a COUNT()). The aggre- mulative sum of the second (cum.). The third column
gate grouping is on two single-character columns, and lists how many times the routine was called, while the
yields only 4 unique combinations, such that it can be fourth and fifth columns show the average amount of
done efficiently with a hash-table, requiring no addi- instructions executed on each call, as well as the IPC
tional I/O. achieved.
In the following, we analyze the performance of The first observation to make is that the five oper-
Query 1 first on relational database systems, then on ations that do all the “work” (displayed in boldface),
MonetDB/MIL and finally in a hard-coded program. correspond to only 10% of total execution time. Closer
inspection shows that 28% of execution time is taken
TPC-H Query 1 Experiments
up by creation and lookup in the hash-table used for
DBMS “X” 28.1 1 1 AthlonMP 1533MHz, 609/547
MySQL 4.1 26.6 1 1 AthlonMP 1533MHz, 609/547
aggregation. The remaining 62% of execution time is
MonetDB/MIL 3.7 1 1 AthlonMP 1533MHz, 609/547 spread over functions like rec get nth field, that navi-
MonetDB/MIL 3.4 1 1 Itanium2 1.3GHz, 1132/1891 gate through MySQL’s record representation and copy
hard-coded 0.22 1 1 AthlonMP 1533MHz, 609/547 data in and out of it.
hard-coded 0.14 1 1 Itanium2 1.3GHz, 1132/1891 The second observation is the cost of the Item op-
MonetDB/X100 0.50 1 1 AthlonMP 1533MHz, 609/547 erations that correspond to the computational “work”
MonetDB/X100 0.31 1 1 Itanium2 1.3GHz, 1132/1891 of the query. For example, Item func plus::val has a
MonetDB/X100 0.30 100 1 Itanium2 1.3GHz, 1132/1891 cost of 38 instructions per addition. This performance
sec/(#CPU*SF) SF #CPU, SPECcpu int/fp trace was made on an SGI machine with MIPS R12000
Oracle10g 18.1 100 16 Itanium2 1.3GHz, 1132/1891 CPU2 , which can execute three integer or floating-
Oracle10g 13.2 1000 64 Itanium2 1.5GHz, 1408/2161 point instructions and one load/store per cycle, with
SQLserver2000 18.0 100 2 Xeon P4 3.0GHz, 1294/1208 an average operation latency of about 5 cycles. A sim-
SQLserver2000 21.8 1000 8 Xeon P4 2.8GHz, 1270/1094 ple arithmetic operation +(double src1, double src2)
DB2 UDB 8.1 9.0 100 4 Itanium2 1.5GHz, 1408/2161 : double in RISC instructions would look like:
DB2 UDB 8.1 7.4 100 2 Opteron 2.0GHz, 1409/1514
Sybase IQ 12.5 15.6 100 2 USIII 1.28GHz, 704/1054 LOAD src1,reg1
Sybase IQ 12.5 15.8 1000 2 USIII 1.28GHz, 704/1054 LOAD src2,reg2
TPC-H Query 1 Reference Results (www.tpc.org) ADD reg1,reg2,reg3
STOR dst,reg3
Table 1: TPC-H Query 1 Performance
The limiting factor in this code are the three
load/store instructions, thus a MIPS processor can do
one *(double,double) per 3 cycles. This is in sharp
3.1 Query 1 on Relational Database Systems contrast to the MySQL cost of #ins/Instruction-Per-
Since the early days of RDBMSs, query execution Cycle (IPC) = 38/0.8 = 49 cycles! One explanation
functionality is provided by implementing a physical for this high cost is the absence of loop pipelining. As
relational algebra, typically following the Volcano [10] 2 On our Linux test platforms, no multi-threaded profiling
model of pipelined processing. Relational algebra, tools seem to be available.
cum. excl. calls ins. IPC function SF=1 SF=0.001 tot res (BW = MB/s)
11.9 11.9 846M 6 0.64 ut fold ulint pair ms BW us BW MB size MIL statement
20.4 8.5 0.15M 27K 0.71 ut fold binary 127 352 150 305 45 5.9M s0 := select(l shipdate).mark
26.2 5.8 77M 37 0.85 memcpy 134 505 113 608 68 5.9M s1 := join(s0,l returnflag)
29.3 3.1 23M 64 0.88 Item sum sum::update field 134 506 113 608 68 5.9M s2 := join(s0,l linestatus)
32.3 3.0 6M 247 0.83 row search for mysql 235 483 129 887 114 5.9M s3 := join(s0,l extprice)
35.2 2.9 17M 79 0.70 Item sum avg::update field 233 488 130 881 114 5.9M s4 := join(s0,l discount)
37.8 2.6 108M 11 0.60 rec get bit field 1 232 489 127 901 114 5.9M s5 := join(s0,l tax)
40.3 2.5 6M 213 0.61 row sel store mysql rec 134 507 104 660 68 5.9M s6 := join(s0,l quantity)
42.7 2.4 48M 25 0.52 rec get nth field 290 155 324 141 45 5.9M s7 := group(s1)
45.1 2.4 60 19M 0.69 ha print info 329 136 368 124 45 5.9M s8 := group(s7,s2)
47.5 2.4 5.9M 195 1.08 end update 0 0 0 0 0 4 s9 := unique(s8.mirror)
49.6 2.1 11M 89 0.98 field conv 206 440 60 1527 91 5.9M r0 := [+](1.0,s5)
51.6 2.0 5.9M 16 0.77 Field float::val real 210 432 51 1796 91 5.9M r1 := [-](1.0,s4)
53.4 1.8 5.9M 14 1.07 Item field::val 274 498 83 1655 137 5.9M r2 := [*](s3,r1)
54.9 1.5 42M 17 0.51 row sel field store in mysql.. 274 499 84 1653 137 5.9M r3 := [*](s12,r0)
56.3 1.4 36M 18 0.76 buf frame align 165 271 121 378 45 4 r4 := {sum}(r3,s8,s9)
57.6 1.3 17M 38 0.80 Item func mul::val 165 271 125 366 45 4 r5 := {sum}(r2,s8,s9)
59.0 1.4 25M 25 0.62 pthread mutex unlock 163 275 128 357 45 4 r6 := {sum}(s3,s8,s9)
60.2 1.2 206M 2 0.75 hash get nth cell 163 275 128 357 45 4 r7 := {sum}(s4,s8,s9)
61.4 1.2 25M 21 0.65 mutex test and set 144 151 107 214 22 4 r8 := {sum}(s6,s8,s9)
62.4 1.0 102M 4 0.62 rec get 1byte offs flag 112 196 145 157 22 4 r9 := {count}(s7,s8,s9)
63.4 1.0 53M 9 0.58 rec 1 get field start offs 3724 2327 TOTAL
64.3 0.9 42M 11 0.65 rec get nth field extern bit
65.3 1.0 11M 38 0.80 Item func minus::val Table 3: MonetDB/MIL trace of TPC-H Query 1
65.8 0.5 5.9M 38 0.80 Item func plus::val
DBMS results we obtained are roughly in the same
Table 2: MySQL gprof trace of TPC-H Q1:
ballpark as what is published by TPC. This leads us
+,-,*,SUM,AVG takes <10%, low IPC of 0.7
to believe that what we see in the MySQL trace is
the routine called by MySQL only computes one ad- likely representative of what happens in commercial
dition per call, instead of an array of additions, the RDBMS implementations.
compiler cannot perform loop pipelining. Thus, the
addition consists of four dependent instructions that 3.2 Query 1 on MonetDB/MIL
have to wait for each other. With a mean instruction The MonetDB system [4] developed at our group, is
latency of 5 cycles, this explains a cost of about 20 mostly known for its use of vertical fragmentation,
cycles. The rest of the 49 cycles are spent on jumping storing tables column-wise, each column in a Binary
into the routine, and pushing and popping the stack. Association Table (BAT) that contain [oid,value]
The consequence of the MySQL policy to execute combinations. A BAT is a 2-column table where the
expressions tuple-at-a-time, is twofold: left column is called head and the right column tail.
The algebraic query language of MonetDB is a column-
• Item func plus::val only performs one addition,
algebra called MIL [5].
preventing the compiler from creating a pipelined
loop. As the instructions for one operation are In contrast to the relational algebra, the MIL al-
highly dependent, empty pipeline slots must be gebra does not have any degree of freedom. Its al-
generated (stalls) to wait for the instruction la- gebraic operators have a fixed number of parame-
tencies, such that the cost of the loop becomes 20 ters of a fixed format (all two-column tables or con-
instead of 3 cycles. stants). The expression calculated by an operator is
fixed, as well as the shape of the result. For example,
• the cost of the routine call (in the ballpark of 20 the MIL join(BAT[tl , te ] A, BAT[te , tr ] B) : BAT[tl , tr ]
cycles) must be amortized over only one opera- is an equi-join between the tail column of A and head
tion, which effectively doubles the operation cost. column of B, that for each matching combination of
tuples returns the head value from A and tail value
We also tested the same query on a well-known com- from B. The mechanism in MIL to join on the other
mercial RDBMS (see the first row of Table 1). As we column (i.e. the head, instead of the tail) of A, is
obviously lack the source code of this product, we can- to use the MIL reverse(A) operator that returns a
not produce a gprof trace. However, the query eval- view on A with its columns swapped: BAT[te , tl ]. This
uation cost on this DBMS is very similar to MySQL. reverse is a zero-cost operation in MonetDB that
The lower part of Table 1 includes some official Query just swaps some pointers in the internal representa-
1 results taken from the TPC website. We normalized tion of a BAT. Complex expressions must be exe-
all times towards SF=1 and a single CPU by assuming cuted using multiple statements in MIL. For example,
linear scaling. We also provide the SPECcpu int/float extprice * (1 - tax) becomes tmp1 := [-](1,tax);
scores of the various hardware platforms used. We tmp2 := [*](extprice,tmp1), where [*]() and [-]()
mainly do this in order to check that the relational are multiplex operators that “map” a function onto an
entire BAT (column). MIL executes in column-wise quired in a Volcano-like pipelined execution model. It
fashion in the sense that its operators always consume can do the selection, computations and aggregation all
a number of materialized input BATs and materialize in a single pass, not materializing any data.
a single output BAT. While in this paper we concentrate on CPU effi-
We used the MonetDB/MIL SQL front-end to ciency in main-memory scenarios, we point out that
translate TPC-H Query 1 into MIL and run it. Ta- the “artificially” high bandwidths generated by Mon-
ble 3 shows all 20 MIL invocations that together span etDB/MIL make it harder to scale the system to disk-
more than 99% of elapsed query time. On TPC-H based problems efficiently, simply because memory
Query 1, MonetDB/MIL is clearly faster than MySQL bandwidths tends to be much greater (and cheaper)
and the commercial DBMS on the same machine, and than I/O bandwidth. Sustaining say a 1.5GB/s data
is also competitive with the published TPC-H scores transfer would require a truly high-end RAID system
(see Table 1). However, closer inspection of Table 3 with an awful lot of disks.
shows that almost all MIL operators are memory- static void tpch_query1(int n, int hi_date,
bound instead of CPU-bound! This was established unsigned char*__restrict__ p_returnflag,
by running the same query plan on the TPC-H dataset unsigned char*__restrict__ p_linestatus,
double*__restrict__ p_quantity,
with SF=0.001, such that all used columns of the double*__restrict__ p_extendedprice,
lineitem table as well as all intermediate results fit double*__restrict__ p_discount,
inside the CPU cache, eliminating any memory traffic. double*__restrict__ p_tax,
MonetDB/MIL then becomes almost 2 times as fast. int*__restrict__ p_shipdate,
aggr_t1*__restrict__ hashtab)
Columns 2 and 4 list the bandwidth (BW) in MB/s {
achieved by the individual MIL operations, counting for(int i=0; i<n; i++) {
both the size of the input BATs and the produced out- if (p_shipdate[i] <= hi_date) {
put BAT. On SF=1, MonetDB gets stuck at 500MB/s, aggr_t1 *entry = hashtab +
(p_returnflag[i]<<8) + p_linestatus[i];
which is the maximum bandwidth sustainable on this double discount = p_discount[i];
hardware [1]. When running purely in the CPU cache double extprice = p_extendedprice[i];
at SF=0.001, bandwidths can get above 1.5GB/s. For entry->count++;
the multiplexed multiplication [*](), a bandwidth of entry->sum_qty += p_quantity[i];
entry->sum_disc += discount;
only 500MB/s means 20M tuples per second (16 bytes entry->sum_base_price += extprice;
in, 8 bytes out), thus 75 cycles per multiplication on entry->sum_disc_price += (extprice *= (1-discount));
our 1533MHz CPU, which is even worse than MySQL. entry->sum_charge += extprice*(1-p_tax[i]);
}}}
Thus, the column-at-a-time policy in MIL turns out
to be a two-edged sword. To its advantage is the fact
that MonetDB is not prone to the MySQL problem Figure 4: Hard-Coded UDF for Query 1 in C
of spending 90% of its query execution time in tuple-
at-a-time interpretation “overhead”. As the multiplex
operations that perform expression calculations work
on entire BATs (basically arrays of which the layout is 3.3 Query 1: Baseline Performance
known at compile-time), the compiler is able to employ To get a baseline of what modern hardware can do
loop-pipelining such that these operators achieve high on a problem like Query 1, we implemented it as a
CPU efficiencies, embodied by the SF=0.001 results. single UDF in MonetDB, as shown in Figure 4. The
However, we identify the following problems with UDF gets passed in only those columns touched by
full materialization. First, queries that contain com- the query. In MonetDB, these columns are stored as
plex calculation expressions over many tuples will ma- arrays in BAT[void,T]s. That is, the oid values in the
terialize an entire result column for each function in head column are densely ascending from 0 upwards. In
the expression. Often, such function results are not such cases, MonetDB uses voids (“virtual-oids”) that
required in the query result, but just serve as inputs are not stored. The BAT then takes the form of an
to other functions in the expression. For instance, if array. We pass these arrays as restrict pointers,
an aggregation is the top-most operator in the query such that the C compiler knows that they are non-
plan, the eventual result size might even be negligible overlapping. Only then can it apply loop-pipelining!
(such as in Query 1). In such cases, MIL material- This implementation exploits the fact that a
izes much more data than strictly necessary, causing GROUP BY on two single-byte characters can never
its high bandwidth consumption. yield more than 65536 combinations, such that their
Also, Query 1 starts with a 98% selection of the combined bit-representation can be used directly as an
6M tuple table, and performs the aggregations on the array index to the table with aggregation results. Like
remaining 5.9M million tuples. Again, MonetDB ma- in MonetDB/MIL, we performed some common subex-
terializes the relevant result columns of the select() pression elimination such that one minus and three
using six positional join()s. These joins are not re- AVG aggregates can be omitted.
Table 1 shows that this UDF implementation (la- eventually become a run-time activity mandated
beled “hard-coded”) reduces query evaluation cost to by an optimizer.
a stunning 0.22 seconds. From the same table, you
will notice that our new X100 query processor, that is To maintain focus in this paper, we only sum-
the topic of the remainder of this paper, is able to get marily describe disk storage issues, also because the
within a factor 2 of this hard-coded implementation. ColumnBM buffer manager is still under development.
In all our experiments, X100 uses MonetDB as its stor-
4 X100: A Vectorized Query Processor age manager (as shown in Figure 5), where it operates
on in-memory BATs.
The goal of X100 is to (i) execute high-volume queries
at high CPU efficiency, (ii) be extensible to other ap- SQL X100 algebra
plication domains like data mining and multi-media
code
retrieval, and achieve those same high efficiencies on SQL−frontend
extensibility code, and (iii) scale with the size of the patterns
lowest storage hierarchy (disk). MetaData signature
X100 Parser
In order to achieve our goals, X100 must fight bot- makefile
tlenecks throughout the entire computer architecture: X100 Cost Model primitive
generator
Disk the ColumnBM I/O subsystem of X100 is geared X100 Optimizer
towards efficient sequential data access. To re- dll
duce bandwidth requirements, it uses a vertically MILgen X100 vectorized
fragmented data layout, that in some cases is en- Algebra primitives
hanced with lightweight data compression. MIL Operators dynamic signatures
RAM like I/O, RAM access is carried out through ex- MonetDB 4.3 ColumnBM
plicit memory-to-cache and cache-to-memory rou- ‘MonetDB/X100’
tines (which contain platform-specific optimiza-
‘MonetDB/MIL’
tions, sometimes including e.g. SSE prefetching
and data movement assembly instructions). The Figure 5: X100 Software Architecture
same vertically partitioned and even compressed
disk data layout is used in RAM to save space and
bandwidth. 4.1 Query Language
Cache we use a Volcano-like execution pipeline based on X100 uses a rather standard relational algebra as query
a vectorized processing model. Small (<1000 val- language. We departed from the column-at-a-time
ues) vertical chunks of cache-resident data items, MIL language so that the relational operators can pro-
called “vectors” are the unit of operation for cess (vectors of) multiple columns at the same time,
X100 execution primitives. The CPU cache is allowing to use a vector produced by one expression
the only place where bandwidth does not mat- as the input to another, while the data is in the CPU
ter, and therefore (de)compression happens on cache.
the boundary between RAM and cache. The
X100 query processing operators should be cache- 4.1.1 Example
conscious and fragment huge datasets efficiently
into cache-chunks and perform random data ac- To demonstrate the behavior of MonetDB/X100, Fig-
cess only there. ure 6 presents the execution of a simplified version of
a TPC-H Query 1, with the following X100 relational
CPU vectorized primitives expose to the compiler that algebra syntax:
processing a tuple is independent of the previous
and next tuples. Vectorized primitives for projec- Aggr(
Project(
tions (expression calculation) do this easily, but Select(
we try to achieve the same for other query process- Table(lineitem),
ing operators as well (e.g. aggregation). This al- < (shipdate, date(’1998-09-03’))),
[ discountprice = *( -( flt(’1.0’), discount),
lows compilers to produce efficient loop-pipelined extendedprice) ]),
code. To improve the CPU throughput further [ returnflag ],
(mainly by reducing the amount of load/stores in [ sum_disc_price = sum(discountprice) ])
the instruction mix), X100 contains facilities to
compile vectorized primitives for whole expression Execution proceeds using Volcano-like pipelining,
sub-trees rather than single functions. Currently, on the granularity of a vector (e.g. 1000 values).
this compilation is statically steered, but it may The Scan operator retrieves data vector-at-a-time from
Monet BATs. Note that only attributes relevant for Table(ID) : Table
the query are actually scanned. Scan( Table) : Dataflow
Array(List<Exp<int>>) : Dataflow
A second step is the Select operator, which cre-
Select(Dataflow, Exp<bool>) : Dataflow
ates a selection-vector, filled with positions of tuples
Join(Dataflow, Table, Exp<bool>, List<Column>) : Dataflow
that match our predicate. Then the Project opera- CartProd(Dataflow, Table, List<Column>)
tor is executed to calculate expressions needed for the Fetch1Join(Dataflow, Table, Exp<int>, List<Column>)
final aggregation. Note that ”discount” and ”extend- FetchNJoin(Dataflow, Table, Exp<int>,
edprice” columns are not modified during selection. Exp<int>, Column, List<Column>)
Instead, the selection-vector is taken into account by Project(Dataflow, List<Exp<*>>) : Dataflow
map-primitives to perform calculations only for rele- Aggr(Dataflow, List<Exp<*>>, List<AggrExp>) : Dataflow
vant tuples, writing results at the same positions in OrdAggr(Dataflow, List<Exp<*>>, List<AggrExp>)
the output vector as they were in the input one. This DirectAggr(Dataflow, List<Exp<*>>, List<AggrExp>)
behavior requires propagating of the selection-vector HashAggr(Dataflow, List<Exp<*>>, List<AggrExp>])
TopN(Dataflow, List<OrdExp>, List<Exp<*>>, int):Dataflow
to the final Aggr. There, for each tuple its position in
Order(Table, List<OrdExp>, List<AggrExp>) : Table
the hash table is calculated, and then, using this data,
aggregate results are updated. Additionally, for the Figure 7: X100 Query Algebra
new elements in the hash table, values of the group-
ing attribute are saved. The contents of the hash-table
becomes available as the query result as soon as the un- through a pipeline.
derlying operators become exhausted and cannot pro- Order, TopN and Select return a Dataflow with the
duce more vectors. same shape as its input. The other operators define a
Dataflow with a new shape. Some peculiarities of this
AGGREGATE algebra are that Project is just used for expression
returnflag sum_disc_price calculation; it does not eliminate duplicates. Dupli-
cate elimination can be performed using an Aggr with
hash table maintenance aggr_sum_flt_col
only group-by columns. The Array operator generates
position in a Dataflow representing a N -dimensional array as a
hash table
N -ary relation containing all valid array index coordi-
map_hash_chr_col nates in column-major dimension order. It is used by
the RAM array manipulation front-end for the Mon-
selection
PROJECT etDB system [9].
discountprice
vector
Aggregation is supported by three physical opera-
map_mul_flt_col_flt_col tors: (i) direct aggregation, (ii) hash aggregation, and
(iii) ordered aggregation. The latter is chosen if all
group-members will arrive right after each other in the
source Dataflow. Direct aggregation can be used for
map_sub_flt_val_flt_col
small datatypes where the bit-representation is lim-
1.0 ited to a known (small) domain, similar to way ag-
gregation was handled in the “hard-coded” solution
selection
SELECT (Section 3.3). In all other cases, hash-aggregation is
vector used.
select_lt_date_col_date_val X100 currently only supports left-deep joins. The
1998−09−03
default physical implementation is a CartProd operator
with a Select on top (i.e. nested-loop join). If X100
detects a foreign-key condition in a join condition, and
SCAN
a join-index is available, it exploits it with a Fetch1Join
shipdate returnflag discount extendedprice
or FetchNJoin.
The inclusion of these fetch-joins in X100 is no co-
Figure 6: Execution scheme of a simplified TPC-H incidence. In MIL, the “positional-join” of an oid into
Query 1 in MonetDB/X100 a void column has proven valuable on vertically frag-
mented data stored in dense columns. Positional joins
allow to deal with the “extra” joins needed for vertical
fragmentation in a highly efficient way [4]. Just like
4.1.2 X100 Algebra
the void type in MonetDB, X100 gives each table a vir-
Figure 7 lists the currently supported X100 algebra op- tual #rowId column, which is just a densely ascending
erators. In X100 algebra, a Table is a materialized rela- number from 0. The Fetch1Join allows to positionally
tion, whereas a Dataflow just consists of tuples flowing fetch column values by #rowId.
4.2 Vectorized Primitives latter identified with an extra *). Other extensible
RDBMSs often only allow UDFs with single-value pa-
The primary reason for using the column-wise vector
rameters [19]. This inhibits loop pipelining, reducing
layout is not to optimize memory layout in the cache
performance (see Section 3.1). 3
(X100 is supposed to operate on cached data anyway).
Rather, vectorized execution primitives have the ad- We can also request compound primitive signatures:
vantage of a low degree of freedom (as discussed in
Section 3.2). In a vertically fragmented data model, /(square(-(double*, double*)), double*)
the execution primitives only know about the columns
they operate on without having to know about the The above signature is the Mahanalobis distance,
overall table layout (e.g. record offsets). When compil- a performance-critical operation for some multi-media
ing X100, the C compiler sees that the X100 vectorized retrieval tasks [9]. We found that the compound prim-
primitives operate on restricted (independent) arrays itives often perform twice as fast as the single-function
of fixed shape. This allows it to apply aggressive loop vectorized primitives. Note that this factor 2 is simi-
pipelining, critical for modern CPU performance (see lar to the difference between MonetDB/X100 and the
Section 2). As an example, we show the (generated) hard-coded implementation of TPC-H Query in Ta-
code for vectorized floating-point addition: ble 1. The reason why compound primitives are more
efficient is a better instruction mix. Like in the exam-
map_plus_double_col_double_col(int n,
double*__restrict__ res, ple with addition on the MIPS processor in Section 3.1,
double*__restrict__ col1, double*__restrict__ col2, vectorized execution often becomes load/store bound,
int*__restrict__ sel) because for simple 2-ary calculations, each vectorized
{ instruction requires loading two parameters and stor-
if (sel) {
for(int j=0;j<n; j++) { ing one result (1 work instruction, 3 memory instruc-
int i = sel[j]; tions). Modern CPUs can typically only perform 1 or
res[i] = col1[i] + col2[i]; 2 load/store operations per cycle. In compound prim-
}
} else {
itives, the results from one calculation are passed via a
for(int i=0;i<n; i++) CPU register to the next calculation, with load/stores
res[i] = col1[i] + col2[i]; only occurring at the edges of the expression graph.
} } Currently, the primitive generator is not much more
than a macro expansion script in the make sequence
The sel parameter may be NULL or point to an
of the X100 system. However, we intend to implement
array of n selected array positions (i.e. the “selection-
dynamic compilation of compound primitives as man-
vector” from Figure 6). All X100 vectorized primitives
dated by an optimizer.
allow passing such selection vectors. The rationale is
that after a selection, leaving the vectors delivered by A slight variation on the map primitives are the
the child operator intact is often quicker than copying select * primitives (see also Figure 2). These only
all selected data into new (contiguous) vectors. exist for code patterns that return a boolean. Instead
X100 contains hundreds of vectorized primitives. of producing a full result vector of booleans (as the
These are not written (and maintained) by hand, but map does), the select primitives fill a result array of
are generated from primitive patterns. The primitive selected vector positions (integers), and return the to-
pattern for addition is: tal number of selected tuples.
Similarly, there are the aggr * primitives that calcu-
any::1 +(any::1 x,any::1 y) plus = x + y late aggregates like count, sum, min, and max. For each,
an initialization, an update, and an epilogue pattern
This pattern states that an addition of two values
need be specified. The primitive generator then gen-
of the same type (but without any type restriction)
erates the relevant routines for the various implemen-
is implemented in C by the infix operator +. It pro-
tations of aggregation in X100.
duces a result of the same type, and the name identi-
The X100 mechanism of allowing database exten-
fier should be plus. Type-specific patterns later in the
sion developers to provide (source-)code patterns in-
specification file may override this pattern (e.g. str
stead of compiled code, allows all ADTs to get first-
+(str x,str y) concat = str concat(x,y)).
class-citizen treatment during query execution. This
The other part of primitive generation is a file with
was also a weak point of MIL (and most extensible
map signature requests:
DBMSs [19]), as its main algebraic operators were only
+(double*, double*) optimized for the built-in types.
+(double, double*)
+(double*, double) 3 If X100 is used in resource-restricted environments, the size
+(double, double)
of the X100 binary (less than a MB now) could be further re-
duced by omitting the column-versions of (certain) execution
This requests to generate all possible combinations primitives. X100 will still able to process those primitives al-
of addition between single values and columns (the though more slowly, with a vector size of 1.
delete from TABLE where key=F insert into TABLE values (K,d,m)
4.3 Data Storage
shipmod shipmod shipmod
MonetDB/X100 stores all tables in vertically frag- key flag #del key flag #del key flag #del
#0 A a m #0 A a m #5 #0 A a m #5
mented form. The storage scheme is the same whether #1 B a m #1 B a m #1 B a m
the new ColumnBM buffer manager is used, or Mon- #2 C a m #2 C a m #2 C a m
etDB BAT[void,T] storage. While MonetDB stores #3 D b s #3 D b s #3 D b s
#4 E d s #4 E d s #4 E d s
each BAT in a single continuous file, ColumnBM par- #5 F c s #5 F c s #5 F c s
titions those files in large (>1MB) chunks. #6 G f s #6 G f s #6 G f s
#7 H e a #7 H e a buffer #7 H e a
A disadvantage of vertical storage is an increased manager
#8 I e a #8 I e a #8 I e a
update cost: a single row update or delete must per- #9 J c a #9 J c a blocks #9 J c a
form one I/O for each column. MonetDB/X100 cir-
cumvents this by treating the vertical fragments as leave the column storage blocks untouched on updates #10 K d m
immutable objects. Updates go to delta structures
instead. Figure 8 shows that deletes are handled by Figure 8: Vertical Storage and Updates
adding the tuple ID to a deletion list, and that inserts
lead to appends in separate delta columns. ColumnBM
actually stores all delta columns together in a chunk,
the-box MonetDB/MIL system with its SQL-frontend
which equates PAX [2]. Thus, both operations incur
on our AthlonMP platform (1533MHz, 1GB RAM,
only one I/O. Updates are simply a deletion followed
Linux2.4) at SF=1. We also hand-translated all TPC-
by an insertion. Updates make the delta columns
H queries to X100 algebra, and ran them on Mon-
grow, such that whenever their size exceeds a (small)
etDB/X100. The comparison between the first two re-
percentile of the total table size, data storage should
sult columns clearly shows that MonetDB/X100 over-
be reorganized, such that the vertical storage is up-to-
powers MonetDB/MIL.
date again and the delta columns are empty.
An advantage of vertical storage is that queries that Both MonetDB/MIL and MonetDB/X100 use join
access many tuples but not all columns save bandwidth indices over all foreign key paths. For MonetDB/X100
(this holds both for RAM bandwidth and I/O band- we sorted the orders table on date, and kept lineitem
width). We further reduce bandwidth requirements clustered with it. We use summary indices (see Sec-
using lightweight compression. MonetDB/X100 sup- tion 4.3) on all date columns of both tables. We
ports enumeration types, which effectively store a col- also sorted both suppliers and customers on (re-
umn as a single-byte or two-byte integer. This integer gion,country). In all, total disk storage for Mon-
refers to #rowId of a mapping table. MonetDB/X100 etDB/MIL was about 1GB, and around 0.8GB for
automatically adds a Fetch1Join operation to retrieve MonetDB/X100 (SF=1). The reduction was achieved
the uncompressed value using the small integer when by using enumeration types, where possible.
such columns are used in a query. Notice that since the
vertical fragments are immutable, updates just go to We also ran TPC-H both at SF=1 and SF=100 on
the delta columns (which are never compressed) and our Itanium2 1.3GHz (3MB cache) server with 12GB
do not complicate the compression scheme. RAM running Linux2.4. The last column of Table 4
MonetDB/X100 also supports simple “summary” lists official TPC-H results for the MAXDATA Plat-
indices, similar to [12], which are used if a column is inum 9000-4R, a server machine with four 1.5GHz
clustered (almost sorted). These summary indices con- (6MB cache) Itanium2 processors and 32GB RAM
tain a #rowId, the running maximum value of the col- running DB2 8.1 UDB.
umn until that point in the base table, and a reversely
We should clarify that all MonetDB TPC-H num-
running minimum at a very coarse granularity (the de-
bers are in-memory results; no I/O occurs. This
fault size is 1000 entries, with #rowids taken with fixed
should be taken into account especially when compar-
intervals from the base table). These summary indices
ing with the DB2 results. It also shows that even at
can be used to quickly derive #rowId bounds for range
SF=100, MonetDB/X100 needs less than our 12GB
predicates. Notice again, due to the property that
RAM for each individual query. If we would have had
vertical fragments are immutable, indices on them ef-
32GB of RAM like the DB2 platform, the hot-set for
fectively require no maintenance. The delta columns,
all TPC-H queries would have fit in memory.
which are supposed to be small and in-memory, are
not indexed and must always be accessed. While the DB2 TPC-H numbers obviously do in-
clude I/O, its impact may not be that strong as its
5 TPC-H Experiments test platform uses 112 SCSI disks. This suggests that
disks were added until DB2 became CPU-bound. In
Table 4 shows the results of executing all TPC-H any case, and taking into account that CPU-wise the
queries on both MonetDB/MIL and MonetDB/X100. DB2 hardware is more than four times stronger, Mon-
We ran the SQL benchmark queries on an out-of- etDB/X100 performance looks very solid.
MonetDB/MIL MonetDB/X100, 1CPU DB2, 4CPU Order(
Q SF=1 SF=1 SF=1 SF=100 SF=100 Project(
Aggr(
1 3.72 0.50 0.31 30.25 229 Select(
2 0.46 0.01 0.01 0.81 19 Table(lineitem)
3 2.52 0.04 0.02 3.77 16 < ( l_shipdate, date(’1998-09-03’))),
4 1.56 0.05 0.02 1.15 14 [ l_returnflag, l_linestatus ],
5 2.72 0.08 0.04 11.02 72 [ sum_qty = sum(l_quantity),
sum_base_price = sum(l_extendedprice),
6 2.24 0.09 0.02 1.44 12
sum_disc_price = sum(
7 3.26 0.22 0.22 29.47 81 discountprice = *( -(flt(’1.0’), l_discount),
8 2.23 0.06 0.03 2.78 65 l_extendedprice ) ),
9 6.78 0.44 0.44 71.24 274 sum_charge = sum(*( +( flt(’1.0’), l_tax),
10 4.40 0.22 0.19 30.73 47 discountprice ) ),
11 0.43 0.03 0.02 1.66 20 sum_disc = sum(l_discount),
12 3.73 0.09 0.04 3.68 19 count_order = count() ]),
[ l_returnflag, l_linestatus, sum_qty,
13 11.42 1.26 1.04 148.22 343
sum_base_price, sum_disc_price, sum_charge,
14 1.03 0.02 0.02 2.64 14 avg_qty = /( sum_qty, cnt=dbl(count_order)),
15 1.39 0.09 0.04 14.36 30 avg_price = /( sum_base_price, cnt),
16 2.25 0.21 0.14 15.77 64 avg_disc = /( sum_disc, cnt), count_order ]),
17 2.30 0.02 0.02 1.75 77 [ l_returnflag ASC, l_linestatus ASC])
18 5.20 0.15 0.11 10.37 600
19 12.46 0.05 0.05 4.47 81
20 2.75 0.08 0.05 2.45 35 Figure 9: Query 1 in X100 Algebra
21 8.85 0.29 0.17 17.61 428
22 3.07 0.07 0.04 2.30 93 from the respective enumeration tables. We can see
AthlonMP Itanium2 that these fetch-joins are truly efficient, as they cost
less than 2 cycles per tuple.
Table 4: TPC-H Performance (seconds)
5.1.1 Vector Size Impact
5.1 Query 1 performance
We now investigate the influence of vector size on per-
As we did for MySQL and MonetDB/MIL, we now also formance. X100 uses a default vector size of 1024, but
study the performance of MonetDB/X100 on TPC- users can override it. Preferably, all vectors together
H Query 1 in detail. Figure 9 shows its translation should comfortably fit the CPU cache size, hence they
in X100 Algebra. X100 implements detailed tracing should not be too big. However, with really small vec-
and profiling support using low-level CPU counters, tor sizes, the possibility of exploiting CPU parallelism
to help analyze query performance. Table 5 shows the disappears. Also, in that case, the impact of interpre-
tracing output generated by running TPC-H Query 1 tation overhead in the X100 Algebra next() methods
on our Itanium2 at SF=1. The top part of the trace will grow.
provides statistics on the level of the vectorized prim-
itives, while the bottom part contains information on AthlonMP
the (coarser) level of X100 algebra operators. Itanium2