You are on page 1of 9

Prof.

Hasso Plattner

A Course in
In-Memory Data Management
The Inner Mechanics
of In-Memory Databases

August 30, 2013

This learning material is part of the reading material for Prof.


Plattner’s online lecture "In-Memory Data Management" taking place at
www.openHPI.de. If you have any questions or remarks regarding the
online lecture or the reading material, please give us a note at openhpi-
imdb@hpi.uni-potsdam.de. We are glad to further improve the material.
Chapter 8
Data Layout In Main Memory

In this chapter, we address the question how data is organized in memory.


Relational database tables have a two-dimensional structure but main mem-
ory is organized unidimensional, providing memory addresses that start
at zero and increase serially to the highest available location. The database
storage layer has to decide how to map the two-dimensional table structures
to the linear memory address space.
We will consider two ways of representing a table in memory, called row
and columnar layout, and a combination of both ways, a hybrid layout.

8.1 Cache E↵ects on Application Performance

In order to understand the implications introduced by row-based and


column-based layouts, a basic understanding of memory access performance
is essential. Due to the di↵erent available types of memory as described in
Section 4.1, modern computer systems leverage a so-called memory hierar-
chy as described in Section 4.2. These caching mechanisms plus techniques
like the Translation Lookaside Bu↵er (TLB, see Section 4.4) or hardware
prefetching (see Section 4.5) introduce various performance implications,
which will be outlined in this section that is based on [SKP12].
The described caching and virtual memory mechanisms are implemented
as transparent systems from the viewpoint of an actual application. How-
ever, knowing the used system with its characteristics and optimizing appli-
cations based on this knowledge can have crucial implications on application
performance.
The following two sections describe two small experiments, outlining
performance di↵erences when accessing main memory. These experiments
are for the interested reader and will not be relevant for the exam.

55
56 8 Data Layout In Main Memory

pointer padding pointer padding pointer padding pointer padding

pointer padding pointer padding pointer padding pointer padding

Fig. 8.1: Sequential vs. Random Array Layout

8.1.1 The Stride Experiment

As the name random access memory suggests, the memory can be accessed
randomly and one would expect constant access costs. In order to test this
assumption, we run a simple benchmark accessing a constant number of
addresses with an increasing stride, i.e. distance, between the accessed ad-
dresses.
We implemented this benchmark by iterating through an array chasing
a pointer. The array is filled with structs. Structs are data structures, which
allow to create user-defined aggregate data types that group multiple indi-
vidual variables together. The structs consist of a pointer and an additional
data attribute realizing the padding in memory, resulting in a memory access
with the desired stride when following the pointer chained list.
struct element {
struct element *pointer;
size_t padding[PADDING];
}
In case of a sequential array, the pointer of element i points to element
i + 1 and the pointer of the last element references the first element so that
the loop through all array elements is closed. In case of a random array,
the pointer of each element points to a random element of the array while
ensuring that every element is referenced exactly once. Figure 8.1 outlines
the created sequential and random arrays.
If the assumption holds and random memory access costs are constant,
then the size of the padding in the array and the array layout (sequential or
random) should make no di↵erence when iterating over the array. Figure 8.2
shows the result for iterating through a list with 4,096 elements, while fol-
lowing the pointers inside the elements and increasing the padding between
the elements. As we can clearly see, the access costs are not constant and in-
crease with an increasing stride. We also see multiple points of discontinuity
in the curves, e.g. the access times increase heavily up to a stride of 64 bytes
and continue increasing with a smaller slope.
8.1 Cache E↵ects on Application Performance 57
[Cache Linesize] [Pagesize]
550
500

CPU Cycles per Element


450
400
350
300
250
200
150
100
50
0
20 22 24 26 28 210 212 214 216 218
Stride in Bytes
Random Sequential

Fig. 8.2: Cycles for Cache Accesses with Increasing Stride

[Cache Linesize] [Pagesize] [Cache Linesize] [Pagesize]


2.5 2.5

2.0 2.0
Misses per Element

Misses per Element


1.5 1.5

1.0 1.0

0.5 0.5

0.0 0.0
20 22 24 26 28 210 212 214 216 218 20 22 24 26 28 210 212 214 216 218
Stride in Bytes Stride in Bytes
Seq. L1 Misses Seq. L3 Misses Random, L1 Misses Random, L3 Misses
Seq. L2 Misses Seq. TLB Misses Random, L2 Misses Random, TLB Misses

(a) Sequential Access (b) Random Access

Fig. 8.3: Cache Misses for Cache Accesses with Increasing Stride

Figure 8.3 indicates that an increasing number of cache misses is causing


the increase in access times. The first point of discontinuity in Figure 8.2
is quite exactly the size of the cache lines of the test system. The strong
increase is due to the fact, that with a stride smaller than 64 bytes, multiple
list elements are located on one cache line and the overhead of loading one
line is amortized over the multiple elements.
For strides greater than 64 bytes, we would expect a cache miss for every
single list element and no further increase in access times. However, as the
stride gets larger the array is placed over multiple pages in memory and
more TLB misses occur, as the virtual addresses on the new pages have
to be translated into physical addresses. The number of TLB cache misses
increases up to the page size of 4 kB and stays at its worst case of one miss
per element. With strides greater as the page size, the TLB misses can induce
additional cache misses when translating the virtual to a physical address.
These cache misses are due to accesses to the paging structures which reside
in main memory [BCR10, BT09, SS95].
58 8 Data Layout In Main Memory

[L1] [L2] [L3] [L1] [L2] [L3]


1000.0 10.00
1.00

CPU Cycles per Element


0.10

Misses per Element


100.0 0.01
0.00
0.00
0.00
10.0
0.00
0.00
0.00
1.0 16KB 64KB 256KB 1MB 4MB 16MB 64MB 256MB
16KB 64KB 256KB 1MB 4MB 16MB 64MB 256MB Size of Accessed Area in Bytes
Size of Accessed Area in Bytes
Random, L3 Misses Random, L2 Misses
Random Sequential Random, L1 Misses Random, TLB Misses

(a) Sequential Access (b) Random Access

Fig. 8.4: Cycles and Cache Misses for Cache Accesses with Increasing Work-
ing Sets

To summarize, the performance of main memory accesses can largely


di↵er depending on the access patterns. In order to improve application
performance, main memory access should be optimized in order to exploit
the usage of caches.

8.1.2 The Size Experiment

In a second experiment, we access a constant number of addresses in main


memory with a constant stride of 64 bytes and vary the size of the working
set size or accessed area in memory. A run with n memory accesses and a
n
working set size of s bytes would iterate s·64 times through the array, which
is created as described earlier in the stride experiment in Section 8.1.1.
Figure 8.4a shows that the access costs di↵er up to a factor of 100, depend-
ing on the working set size. The points of discontinuity correlate with the
sizes of the caches in the system. As long as the working set size is smaller
than the size of the L1 Cache, only the first iteration results in cache misses
and all other accesses can be answered out of the cache. As the working set
size increases, the accesses in one iteration start to evict the earlier accessed
addresses, resulting in cache misses in the next iteration.
Figure 8.4b shows the individual cache misses with increasing working
set sizes. Up to working sets of 32 KB, the misses for the L1 cache go up to
one per element, the L2 cache misses reach their plateau at the L2 cache size
of 256 KB and the L3 cache misses at 12 MB.
As we can see, the larger the accessed area in main memory, the more
capacity cache misses occur, resulting in poorer application performance.
Therefore, it is advisable to process data in cache-sized chunks if possible.
8.3 Benefits of a Columnar Layout 59

8.2 Row and Columnar Layouts

Let us consider a simple example to illustrate the two mentioned approaches


for representing a relational table in memory. For simplicity, we assume that
all values are stored as strings directly in memory and that we do not need
to store any additional data. As an example, let us look at the simple world
population example:

Id Name Country City


1 Paul Smith Australia Sydney
2 Lena Jones USA Washington
3 Marc Winter Germany Berlin

As discussed above, the database must transform its two-dimensional


table into a one-dimensional series of bytes for the operating system to
write them to memory. The classical and obvious approach is a row-
or record-based layout. In this case, all attributes of a tuple are stored
consecutively and sequentially in memory. In other words, the data is
stored tuple-wise. Considering our example table, the data would be stored
as follows: "1, Paul Smith, Australia, Sydney; 2, Lena Jones, USA,
Washington; 3, Marc Winter, Germany, Berlin".
On the contrary, in a columnar layout , the values of one column are
stored together, column by column. The resulting layout in memory for our
example would be: "1, 2, 3; Paul Smith, Lena Jones, Marc Winter;
Australia, USA, Germany; Sydney, Washington, Berlin".
The columnar layout is especially e↵ective for set-based reads. In other
words, it is useful for operations that work on many rows but only on a
notably smaller subset of all columns, as the values of one column can be
read sequentially, e.g. when performing aggregate calculations. However,
when performing operations on single tuples or for inserting new rows, a
row-based layout is beneficial. The di↵erent access patterns for row-based
and column-based operations are illustrated in Figure 8.5.
Currently, row-oriented architectures are widely used for OLTP work-
loads while column stores are widely utilized in OLAP scenarios like data
warehousing, which typically involve a smaller number of highly complex
queries over the complete data set.

8.3 Benefits of a Columnar Layout

As mentioned above, there are use cases where a row-based table layout
can be more efficient. Nevertheless, many advantages speak in favor of the
usage of a columnar layout in an enterprise scenario.
Row Data Layout
□ Data is stored tuple-wise
Columnar Data Layout
□ Leverage co-location of attributes□ for a issingle
Data storedtuple
attribute-wise
□ Low cost for60reconstruction, but □higher costsequential
Leverage for sequential
8 Data Layout
scan-speed In Main
in main Memory
memory
scan of a single attribute □ Tuple reconstruction is expensive

7 8

(a) Row Layout (b) Columnnar Layout

Fig. 8.5: Illustration of Memory Accesses for Row-Based and Column-Based


Operations on Row and Columnar Data Layouts.
4

Columnar Data Layout


First, when analyzing the workloads enterprise databases are facing, it
turns out that the actual workloads are more read-oriented and dominated
by set processing [KKG+ 11].
Second, despite the fact that hardware technology develops very rapidly
□ Data is stored
andattribute-wise
the size of available main memory constantly grows, the use of efficient
□ compression
Leverage sequential techniquesin
scan-speed is main
still important
memoryin order to a) keep as much data in
main memory as possible and to b) minimize the amount of data that has to
□ Tuple reconstruction
be read from ismemory
expensive
to process queries as well as the data transfer between
non-volatile storage mediums and main memory.
Using column-based table layouts enables the use of efficient compression
techniques leveraging the high data locality in columns (see Chapter 7).
They mainly use the similarity of the data stored in a column. Dictionary
encoding can be applied to row-based as well as column-based table layout,
whereas other techniques like prefix encoding, run-length encoding, cluster
encoding or indirect encoding only leverage their full benefits on columnar
table layouts.
Third, using columnar table layouts enables very fast column scans as
they can sequentially scan the memory, allowing e.g. on the fly calcula-
tions of aggregates. Consequently, storing pre-calculated aggregates in the
8
database can be avoided, thus minimizing redundancy and complexity of
the database.

4
REFERENCES 61

8.4 Hybrid Table Layouts

As stated above, set processing operations are dominating enterprise work-


loads. Nevertheless, each concrete workload is di↵erent and might favor a
row-based or a column-based layout. Hybrid table layouts combine the ad-
vantages of both worlds, allowing to store single attributes of a table column
oriented while grouping other attributes into a row-based layout [GKP+ 11].
The actual optimal combination highly depends on the actual workload and
can be calculated by layouting algorithms.
As an illustrating example, think about attributes, which inherently be-
long together in commercial applications, e.g. quantity and measuring unit or
payment conditions in accounting. The idea of the hybrid layout is that if the
set of attributes are processed together, it makes sense from a performance
point of view to physically store them together. Considering the example
table provided in Section 8.2 and assuming the fact that the attributes Id and
Name are often processed together, we can outline the following hybrid data
layout for the table: "1, Paul Smith; 2, Lena Jones; 3, Marc Winter;
Australia, USA, Germany; Sydney, Washington, Berlin". In this case,
two first attributes are stored row-based, but country and city are stored
column-based. This hybrid layout may decrease the number of cache misses
caused by the expected workload, resulting in increased performance.
The usage of hybrid layouts can be beneficial but also introduces new
questions like how to find the optimal layout for a given workload or how
to react on a changing workload.

8.5 References

[BCR10] T.W. Barr, A.L. Cox, and S. Rixner. Translation Caching: Skip,
Don’t Walk (the Page Table). ACM SIGARCH Computer Architec-
ture News, 38(3):48–59, 2010.
[BT09] V. Babka and P. Tůma. Investigating Cache Parameters of x86
Family Processors. Computer Performance Evaluation and Bench-
marking, pages 77–96, 2009.
[GKP+ 11] M. Grund, J. Krueger, H. Plattner, A. Zeier, S. Madden, and
P. Cudre-Mauroux. HYRISE - A Hybrid Main Memory Storage
Engine. In VLDB, 2011.
[KKG+ 11] Jens Krueger, Changkyu Kim, Martin Grund, Nadathur Satish,
David Schwalb, Jatin Chhugani, Hasso Plattner, Pradeep Dubey,
and Alexander Zeier. Fast Updates on Read-Optimized
Databases Using Multi-Core CPUs. PVLDB, 2011.
[SKP12] David Schwalb, Jens Krueger, and Hasso Plattner. Cache con-
scious column organization in in-memory column stores. Tech-
nical Report 60, Hasso-Plattner-Institute, December 2012.
62 REFERENCES

[SS95] R.H. Saavedra and A.J. Smith. Measuring cache and TLB perfor-
mance and their e↵ect on benchmark runtimes. IEEE Transactions
on Computersn, 1995.

You might also like