Professional Documents
Culture Documents
Alexandru Ionescu
Abstract
Databases play nowadays a huge role in computer science field. Formally, a database
is an organized collection of data, generally stored and accessed electronically from a
computer system. The database management software system (DBMS) is the software
that interacts with end users, applications and the database itself.
Some of major advantages of using databases are accuracy, managing large amounts of
data, security of information or data integrity. Thus we are interested in a performant
system that provides all these benefits and it can also be adapted for future require-
ments. When we think about the performance of a database, indexing is the first thing
that comes to the mind. It is a data structure technique used to quickly locate and
access the data within the database.
In this project, I investigate the existing literature, critically evaluating previous work
done in this area, while also conducting experiments to understand better why some
data structures perform better than others on various scenarios and then I propose an
implementation of a custom tree-based index that can be used in real-time environ-
ments.
4
Acknowledgements
Acknowledgements go here.
Table of Contents
1 Introduction 7
1.1 Project goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Summary of contributions . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.1 Analysis of existing work . . . . . . . . . . . . . . . . . . . 8
1.2.2 Practical work in DBToaster . . . . . . . . . . . . . . . . . . 8
1.3 Report Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Background 11
2.1 The concept of database indexing and why it is needed . . . . . . . . 11
2.1.1 Database Index . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.2 Why indexing is useful? . . . . . . . . . . . . . . . . . . . 11
2.2 Introduction of B+Tree . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Bw-Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Adaptive Radix Tree . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 MassTree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5 Future Work 29
5.1 Implementation part . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2 Experimental part . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2.1 Increase number of threads and relevant statistics . . . . . . . 29
5.3 Integration part . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6 Conclusions 31
5
6 TABLE OF CONTENTS
Bibliography 33
Introduction
The concept of a database was made possible by the introduction of direct access stor-
age media like magnetic disks, that gained popularity in the mid 1960s. Since then, the
sizes and performance of databases and their respective software have grown in orders
of magnitude. Capabilities to support bigger and complex type of information were
enabled by the advancements in the areas of computer memory, processors, computer
storage and computer networks.
In this context, database indexing is increasingly useful as it is a way to optimize the
performance of the system by minimizing the number of disk accesses required when
a query is processed. To put it simply, an index is a data structure that improves the
speed of data retrieval operations on a database table at the cost of additional writes
and storage space to maintain the index data structure.
There are a range of operations that we can do on databases such as insert, read, update
or scan. Although some index data structures perform especially well on some of these
operations none of them has great performance in all such modifications that may affect
our database. Thus, experimental studies are performed in order to understand what
index structure is best to use in a range of particular situations.
7
8 Chapter 1. Introduction
tended to accommodate the future hardware and software designs, what could I have
done in more detail in my project and the future work that can be done in DBToaster
software, one might choose to do so.
Chapter 2
Background
In this chapter I present the concept of indexing in databases and explain why it is
important. Then I introduce tree-based indexes that I have analysed as part of my
experiments in this project and a short summary of existing literature that contains
their implementation and expected performance in different situations.
11
12 Chapter 2. Background
single entry for a single node, B-Tree uses an array of entries for a single node with
references to child node for each of these entries, which leads to a smaller height of
the tree. Another property is that all leaf nodes should be at the same level in a B-Tree.
Figure 2.1
B+Trees are the ones generally implemented in databases, as they eliminate the above
drawback by storing data pointers only at the leaf nodes of the tree. So, the structure
of the leaf nodes of a B+Tree is quite different from the structure of internal nodes
of the B-Tree. We must also note that, because data pointers are present only at the
leaf nodes, the leaf nodes must necessarily store all the key values along with their
corresponding data pointers to the disk file block. Also, the leaf nodes are linked to
provide ordered access to the records.
2.3 Bw-Tree
Although very useful data structures for databases, B+Trees need to be modified for the
software and hardware of the future that will eventually handle the enormous amount of
data generated each day. Bw-Trees help with this issue. They preserve the elementary
ideas of B-Trees but focus on improving performance in more up-to-date software
paradigms and hardware. These trees were introduced by a team of Microsoft Research
in 2013. Two important concepts that Bw-Trees address to are multi core processing
and disk latency.
Multi core processing performance depends on two concepts: high concurrency and
high CPU cache hit ratios. While the amount of concurrency increases, design involv-
ing mutex locks (latches) will put a limit to the system’s scalability. Bw-Trees maintain
this concurrency without using latches. Also, to achieve high CPU cache hit ratio, they
do not update directly in memory, but instead they use so called delta updates.
Disk latency is a major problem and their low I/O operations per second are not ideal
for a system. Flash storage is a better alternative as it offers higher I/O ops per second
at a lower cost. Bw-Trees are aimed at flash storage, performing log structuring itself
at its storage layer. This approach ensures that write performance is as high as possible
for both high-end and low-end flash devices.
14 Chapter 2. Background
Figure 2.2
Mapping Table
A mapping table is maintained in the cache layer, that maps logical pages to physical
pages, logical pages being identified by a logical “page identifier” (PID). The mapping
table translates a PID into either , a flash offset, the address of a page on stable storage,
or a memory pointer, the address of the page in memory. Mapping table serves the
connection between physical location and inter node links. This enables the physical
location of a Bw-Tree node to change on every update and every time a page is written
to stable storage, without the location change being propagated to the root of tree.
Delta Updates
Page state changes are done by creating a delta record (describing the change) and
attaching it to an existing page state. The new memory address of the delta record is
installed into the page’s physical address slot in the mapping table using the atomic
compare and swap (CAS) instruction. Pages are consolidated (create a new page that
applies all delta changes) to both reduce memory footprint and to improve search per-
formance.
Figure 2.3
In comparison with regular trees, edges can be labeled with sequences of elements and
single elements as well, making them more efficient on small sets. In regular trees,
whole keys are compared en masse from their beginning up to the point of inequality,
however in radix tree, the key at each node is compared chunk-of-bits by chunk-of-bits,
where the quantity of bits in that chunk at that node is the radix r of the radix trie.
The adaptive radix tree(ART) is a variant of radix tree that integrates adaptive node
sizes to the radix tree. One major drawback of the usual radix trees is the use of space,
because it uses a constant node size in every level. The major difference between the
radix tree and the ART is its variable size for each node based on the number of child
elements, which grows while adding new entries. Thus, the adaptive radix tree leads
to a better use of space without reducing its speed.
If we look at the performances of this data structure we will observe the following: in
terms of lookups, ART exceeds tuned read-only search trees, while also performing
well under deletions and insertions. This type of tree solves the problem of space con-
sumption, often seen at radix trees, by adapting its data structures for internal nodes.
Even though it has a similar performance with hash tables, ART also maintains order
in data, so it can support range scan or prefix lookup too.
2.5 MassTree
When considering the problem of storing in memory millions of (key, value) pairs,
if we wanted to support point lookup, a hash table would be a choice. However, to
support range queries, a tree structure would be desirable. A candidate could be for
example B+Tree. The number of levels would be kept small due to the fact that each
node has a high fan-out. But this means that a single node packs a large number of
keys and it results in a large number of key comparisons to perform when searching
through the tree. Also, cost of key comparisons can be quite high with variable-length
16 Chapter 2. Background
keys. If the keys are really long they can each occupy more cache lines, so comparing
two of them can affect the cache locality.
MassTree is an efficient tree data structure that relies on splitting length keys into a
number of fixed-length keys called slices. When going down the tree, you compare
the first slice of each key, then the second, and so on, with each comparison having
a constant cost. This is way more efficient than comparing whole strings again and
again. This data structure exploits the cache locality advantage by doing these fixed-
size comparisons.
A classic search structure for variable-length strings over a fixed alphabet is Trie. The
disadvantage of it is that, when splitting strings into their natural one-character alpha-
bet would be too deep when you have many long keys. The child pointers in an English
alphabet trie are generally stored in an array indexed by the next character. When deal-
ing with large alphabets it is not feasible anymore. The idea is then to use B+Trees,
by making each node in the trie its own B+Tree. B+Trees are good at representing the
child pointers by using ranges, thus taking logarithmic take for search, but they are not
able to be indexed in constant time. So we can define MassTree as a trie concatenation
of B+Trees.
Chapter 3
This chapter contains my critical evaluation of previous work in the area of tree-based
index data structures together with performance results of these and techniques used
in optimizing. I have also included statistics I have obtained by running similar exper-
iments and some explanations regarding the results obtained.
17
18 Chapter 3. Experimental Analysis of Tree-Based Indices
In terms of memory, among all compared indexes, the ART has the lowest usage for
Mono-Int, while the B+Tree has the lowest for the Rand-Int keys due to its compact
internal structure and large node size. The SkipList consumes more memory than
the B+Tree/ART due to its customized memory allocator and preallocation; it has a
memory usage comparable to the Bw-Tree. The MassTree always has highest memory
usage.
3.2. Similar experiments 19
3.2.1 Observations
I am going to present now the results obtained by running similar experiments. I also
included in the analysis SkipList to have a bigger picture of the performance statistics,
even though this data structure does not belong to trees category.
On workload A, ART performs best, being well adapted for lookups and updates, then
comes MassTree (slower by a factor of 1.5x). B+Tree is slower than MassTree, al-
though on Rand-Ints they have similar performance. Bw-Tree is worst, slower by a
factor of 1.3x than B+Tree.
On workload C, ART performs best as it is a data structure well adapted for supporting
point lookups. Then MassTree and B+Tree have a similar performance on Rand-Ints
(they are slower by a factor of 1.7x than ART; B+Tree is better here than MassTree)
and Mono-Ints (they are slower by a factor of 2.8x than ART; MassTree performs better
here than B-Tree) and Bw-Tree performs worst out of these 4 data structures on this
workload (slower by a factor of 4.3x on Mono-Ints and a factor of 2.1x on Rand-Ints).
We can see that on all workloads, Bw-Tree is slower than its competitors (B+Tree,
MassTree, ART). This comes from the overhead that Bw-Tree has because of the in-
direction layer and delta records, which cause this lock-free index data structure im-
plementation to underperform the other lock based indexes, in spite of some previous
claims that lock-free indexes are superior to lock-based ones.
ART is the leader on workloads heavily based on read and update operations, but fails
to have a great performance on the workload based on insert and scan operations. The
exceptional running time of this data structure on reads/updates comes from the fact
that performance depends on the length of the keys it has and not on the number of
total elements contained at a particular moment.
22 Chapter 3. Experimental Analysis of Tree-Based Indices
From the latest graphs, it can be observed that on Rand-Ints, MassTree has generally
the highest memory usage, with SkipList, Bw-Tree, ART, B+Tree coming afterwards
3.2. Similar experiments 23
in this order. On Mono-Ints though, SkipList comes first, then MassTree, Bw-Tree,
B+Tree, ART in a descending order of memory consumption.
Chapter 4
As the second part of the project, my goal has been to use programming and tree-based
index data structures to help add new functionalities and optimize queries by a lot in
some cases in DBToaster software.
I started the work at DBToaster with the idea in mind of integrating tree-based data
structures in the code used to run queries that may contain various and complicated
aggregate functions. After studying this kind of data structures in the first part of this
project, it made sense to put tree-based algorithms to work in order to obtain better
performance.
To put it simply, DBToaster is an SQL-to-native-code compiler, it generates specialized
query engines, and it is used in applications that deals with real-time, low-latency data
processing. It has some features that make it one of the best systems of this kind that
exists in the world, such as:
1) Compilation of database queries to low-level code
DBToaster does not need to interpret queries and data, instead it compiles SQL-queries
to low-level code, eliminating overheads resulting from interpretation.
2) Online query processing
DBToaster generates code that maintains a query result as an in-memory materialized
view which is kept fresh as a stream of updates to the base data.
3) Embedded query engines
Code that is generated by DBToaster can be linked into applications without a separate
runtime system.
4) Materialized views of nested queries
DBToaster is the only system that actually supports efficient materialized views of
nested SQL queries. Nested queries are vital to complex analytics.
25
26 Chapter 4. Tree-Based Data Structures in DBToaster
* When adding another MinStruct via+= operator, two given trees are merged, by
iterating over the second tree and adding each pair (v, c) to the current tree (*this):
Suppose there are two trees named T1 and T2 and a pair (v, c) needs to be added from
T2 to T1. Looking for the element v in T1, if there is no element v in T1, pair (v, c)
is added to the final tree. Otherwise, if there is already v in T1, its count is increased,
that is, pair (v, c1 + c2) appears in the final tree T1. If the final count of an element is
0, that element is deleted from the tree;
* Multiplying MinStruct with a scalar ’s’ means multiplying the count of every element
in the tree of that structure by ’s’ (no change of keys).
I used the map data structure from Standard Template Library in C++, which is im-
plemented with self-balancing trees, and successfully compiled queries that contained
MIN/MAX operations.
Also, the use of rings abstraction made the code development more demanding as
I was aiming at integrating new implementations within the software infrastructure.
Last but not least, using non-trivial tree-based data structures to help compute queries
efficiently has represented a challenge in itself.
Future Work
There are definitely things related to this work that can be improved or further explored,
but because of the time frame of this year could not be tackled in this particular project.
I present in this chapter some modifications that can make the analysis of the work
done more relevant as well as some ideas for improving data structures discussed and
expand the performance comparison for the future. I divide all these ideas in 3 parts:
Implementation, Experimental and Integration.
When running the experiments following the YCSB framework, one and two threads
have been used to obtain the results. Using 5, 10, 20 threads would extend the analysis
and help draw better conclusions about the increasing performance. Also, other rele-
vant statistics that we could have included in our analysis are cache misses as well as
instruction counts.
29
30 Chapter 5. Future Work
Conclusions
In this project, I studied existing tree-based indexing data structures as well as other
type of data structures and experimentally evaluated their performance on different
type of workloads. The evaluation aimed to provide deeper insights about the be-
haviour of these data structures in various situations. I have achieved this by trying to
explain to the best of my knowledge the results obtained in the process.
I then present my work done within DBToaster software where I used research done in
index data structures and experience in programming to help optimize different types
of queries and add new functionalities to the software. I managed to provide implemen-
tations for tree-based data structures that fit in the existing infrastructure of DBToaster
and that results in a significant improvement of the running time of queries.
In Chapter 2 I start by introducing the concept of database indexing and why it is
useful. Then I introduce the main types of tree-based index data structures that I have
analysed in this project and present their structure as well as their usage, why they are
suitable in some particular situations .
Chapter 3 includes results of some of the work done previously in the area and analysis
that I did in a similar environment. The aim of these has been to understand what data
structures are suited for a range of situations and how well they actually perform as well
as developing better ways of thinking when it comes to design index data structures.
In Chapter 4 I present the work done in the area of optimizing queries by using tree-
based index data structures and integrating their implementation within DBToaster’s
existing software. I have seen the difference in performance of these attempts and I
have tried and succeed in providing new functionalities to computing queries.
In the end, in Chapter 5 I propose some ideas in regards to the development of data
structures in order to support future workloads and software, what I could have also
included in my analysis and ways to extend the analysis I did and continue this project,
one should want to do it.
To conclude with, in this project I have studied different tree-based index data struc-
tures, analysed scenarios in which they can provide good performance and I have
shown results obtained through running experiments. I compared these with some
31
32 Chapter 6. Conclusions
previous work done in the area. I also did a consistent amount of work at DBToaster
by writing and using tree-based data structures to improve the performance of running
various queries.
Bibliography
[1] https://dzone.com/articles/database-btree-indexing-in-sqlite
[2] http://www.cs.cmu.edu/h̃uanche1/publications/open bwtree.pdf
[3] https://www.cs.cmu.edu/p̃avlo/papers/mod342-wangA.pdf
[4] https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/bw-tree-
[5] Prefix Tree.
https://www.xspdf.com/resolution/52555187.html
[6] MassTree Notes.
https://www.the-paper-trail.org/post/masstree-paper-notes/
[7] BwTree Blog notes.
https://medium.com/@nayakdebanuj4/bw-trees-also-known-as-buzz-word-trees-7de9
[8] https://15721.courses.cs.cmu.edu/spring2020/papers/07-oltpindexes2/p521-binn
[9] https://www.comp.nus.edu.sg/d̃bsystem/download/xie-icde18-paper.pdf
[10] https://pdos.csail.mit.edu/papers/masstree:eurosys12.pdf
[11] http://www.cse.chalmers.se/edu/year/2018/course/DAT037/slides/12-tries.pdf
[12] Wikipedia. Radix tree.
https://en.wikipedia.org/wiki/Radix tree
[13] https://www.boost.org/doc/libs/1 58 0/doc/html/container/non standard containers.htmlcontain
[14] https://abseil.io/docs/cpp/guides/container
[15] https://medium.com/nlnetlabs/adapting-radix-trees-15fe7d27c894
[16] https://fsgeek.ca/2018/04/23/cache-craftiness-for-fast-multicore-key-value-
[17] MassTree Article.
http://highscalability.com/blog/2012/4/30/masstree-much-faster-than-mongodb-v
[18] https://ntnuopen.ntnu.no/ntnu-xmlui/bitstream/handle/11250/2619032/no.ntnu:
[19] http://ranger.uta.edu/s̃jiang/CSE6350-spring-19/Presentations-materials/Topi
[20] Wikipedia. Database Index.
https://en.wikipedia.org/wiki/Database index
33
34 Bibliography
35
36 Appendix A. Code for queries operations using Rings