You are on page 1of 49

NoSQL Database for Software Project Data

Anna Bjrklund o

January 18, 2011 Masters Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Ola gren A Examiner: Fredrik Georgsson

Ume University a
Department of Computing Science SE-901 87 UME A SWEDEN

Abstract The eld of databases have exploded in the last couple of years. New architectures try to meet the need to store more and more data and new kinds of data. The old relational model is no longer the only way and the NoSQL movement is not a trend but a new way of making the database t the data, not the other way around. This master thesis report aims to nd an ecient and well designed solution for storing and retrieving huge amounts of software project data at Tieto. It starts by looking at dierent architectures and trying three to see if any of them can solve the problem. The three databases selected are the relational database PostgreSQL, the graph database Neo4j and the key value store Berkeley DB. These are all implemented as a Web service and time is measured to nd out which, if any, can handle the data at Tieto. In the end it is clear that the best database for Tieto is Berkeley DB. Even if Neo4j is almost as fast, it is still new and not as mature as Berkeley DB.

ii

Contents
1 Introduction 1.1 Paper outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 3 3 4 4 5 6 6 7 7 9 9

2 Modern Databases 2.1 2.2 2.3 A brief history . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The CAP Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 2.3.1 2.3.2 2.3.3 2.3.4 ACID v. BASE - two dierent ways of achieving partitioning . . . . . Column store . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Key value store . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Document store . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Graph database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Storing data today . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3 The problem at Tieto 3.1 3.2 3.3 The data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 The databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 13

4 The solutions 4.1 4.1.1 4.1.2 4.2 4.2.1 4.2.2 4.3 4.3.1 4.3.2 5 Results iii

PostgreSQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 The questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Strengths and weaknesses . . . . . . . . . . . . . . . . . . . . . . . . . 15 The questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Strengths and weaknesses . . . . . . . . . . . . . . . . . . . . . . . . . 16 The questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Strengths and weaknesses . . . . . . . . . . . . . . . . . . . . . . . . . 19 21

Neo4j . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

Berkely DB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

iv

CONTENTS

6 Conclusions 29 6.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 7 Acknowledgements References 31 33

A Data from test runs 35 A.1 Server times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 A.2 Client times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

List of Figures
3.1 3.2 3.3 3.4 4.1 5.1 5.2 5.3 5.4 5.5 5.6 An An An An overview of the data . . . . . . . . . . . . . . . . . . . . . . . example of how a property is shared between dierent nodes example of a consistent tree . . . . . . . . . . . . . . . . . . . example of an inconsistent tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 10 11 12

An example of how a shared property gets duplicated in Berkeley DB . . . . 18 Total time for the client. . . . . . . . . . . . . . . . . . . . . . Client times for 10% of the data without PostgreSQL . . . . Total time for the server. . . . . . . . . . . . . . . . . . . . . Client times for question 1-6 . . . . . . . . . . . . . . . . . . . Total time for the client for question 7 at dierent amounts of Client times for question 8-13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 23 24 25 26 27

vi

LIST OF FIGURES

List of Tables
2.1 4.1 A.1 A.2 A.3 A.4 A.5 A.6 A.7 A.8 A.9 An example of data organized in a table . . . . . . . . . . . . . . . . . . . . . The table retured by PostgreSQL for the inconsistent example The server times for 0.1% of the data. Time in milliseconds. . The server times for 1% of the data. Time in milliseconds. . . The server times for 10% of the data. Time in milliseconds. . The server times for 100% of the data. Time in milliseconds. The server times for the hash table. Time in milliseconds. . . Client times for all test runs at 0.1%. Time in seconds. . . . . Client times for all test runs at 1%. Time in seconds. . . . . . Client times for all test runs at 10%. Time in seconds. . . . . Client times for all test runs at 100%. Time in seconds. . . . . . . . . . . . . 6

. . . . . . . . 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 36 36 37 37 38 38 39 39

vii

viii

LIST OF TABLES

Chapter 1

Introduction
Today there exist many dierent types of databases, not only the traditional relational SQL database but several other architectures designed to handle dierent types of data. Since the 70s and into the new millennium the relational model was the dominant and almost all databases followed the same basic architecture. At the beginning of the new millennium developers started to realize that their data did not t the relational model and some of them started to develop other architectures for storing data in databases. When choosing a database today the problem is much more complex then deciding on a vendor for the relational database, the main problem is deciding which architecture of data storing is best suited for the data. When that decision is made it is time to choose a vendor that meets the companies requirements regarding price, reliability and so forth. This paper will look at three dierent database solutions for software project data at Tieto Ume and compare a them. First a theoretical approach is made and then all three are implemented and are tested to see which is fastest in a test with the real data.

1.1

Paper outline

Chapter 2 begins with a brief history and then takes a deeper look at the dierent solutions for data storing that exist today. Chapter 3 takes a deeper look at the problem at Tieto, the data they have and how this data ts dierent architectures. Chapter 4 describes the three dierent solutions implemented and where the strengths and weaknesses lie in each solution from a theorecial point of view. Chapter 5 presents the result of the implementation with extra attention to performance and the specic requirements from Tieto. Chapter 6 addresses what is left to do and how Tieto can move forward with this.

Chapter 1. Introduction

Chapter 2

Modern Databases
2.1 A brief history

In the 70s databases was a growing eld and there were some debate on how to organize the data. IBM developed System R and it was the rst system to implement a Standard Query Language (SQL). System R is the foundation for many of todays popular DBMSs (Database Management System)[9]. The hardware design in the 70s and 80s were much dierent from today. Today processors are thousands of times faster, memory is thousands of times larger and the main bottleneck is the bandwidth between disk and main memory. The main market for RDBMS (Relational DBMS) in those days was business data processing, today there are a lot of dierent markets with completely dierent requirements. Yet another dierence is the user interface, in the beginning there was a text terminal and today there is a graphical interface. Despite the changes in requirements and hardware the relational model was the dominate one until the beginning of the new millennium. At that time developers started to think outside the box and realized that they had data that did not t the relational model. Several started to develop dierent ways to organize their data depending on their specic needs. There were some products but most of them were only available within the company and for a specic solution. The phrase NoSQL was rst used in 1998 as a name for a lightweight relational database that did not expose a SQL interface. In early 2009 it was reused by the organizers of an event to discuss open source distributed databases and was a reference to the naming convention of traditional relational databases, such as MySQL and PostgreSQL. Today the expression is often thought of as Not only SQL and is the movement of other database solutions than a relational database. The idea is not that relational databases are bad and wrong, just that in some cases the relational model just isnt enough. If the If the relational model ts the data then it is a good idea to use it. But if the data does not t the relational model it is worthwhile to look at another types of database. The two main disadvantages that RDBMS have are that they do not scale easily (in the next Section it will be obvious why) and they often fail at capturing the relation between the data. Only a few years ago these problems was not such a big problem but the amount of data that is in store today is innitely much more than only ten years ago. The continuing trends of cloud computing and growth of social networks will only fuel the need for large data stores even more. 3

Chapter 2. Modern Databases

2.2

The CAP Theorem

To be able to discuss the dierent database solutions that exist today it is important to have an understanding of the CAP theorem [3]. The CAP theorem states that it is impossible for a Web service to guarantee all three of the following properties: Consistency all clients have the same view of the same data at all times Availability all clients can always read and write Partition-tolerance the system works well despite physical network partitions All three are desirable for all Web services but at PODC 2000 Brewer [3] made the conjecture that it is impossible to have all three, a Web service can at most choose two of the three. In 2002 Gilbert and Lynch proved that Brewer was right for asynchronous network, which is a network where the nodes does not share a clock but have to rely on the messages that are sent between them. Since this is the case for most web services it has a major impact on the decision to choose the right model for storing data. The CAP theorem states that any database solution can only fulll two of the criteria and that it is up to the architecture to choose which two. Most relational databases can promise consistency and availability and this is good for smaller system. If this is the main goal, the data ts the relational model and there are no requirements on uptime then the relational model is a good choice. If there are requirements on uptime or the data is massive there might be necessary to partition the data between several nodes and make a compromise on one of the other. One node can never guarantee a given uptime and for some companies this is so important that they can tolerate a database that is inconsistent at times to guarantee availability. One important note is that inconsistency is not always inconsistency, it only means that the database cannot guarantee that every node have the exact same picture of the data at all times. They do guarantee that all nodes will have the same picture at some time, but not all the time. This is referred to as eventual consistency and as the term implies, the database will be consistent at some time but not all the time.

2.2.1

ACID v. BASE - two dierent ways of achieving partitioning

If a database needs to be physically partitioned then the CAP-theorem states that it needs to choose to give up either A (availability) or C (consistency). ACID (Atomicity Consistent Isolation Durability) and BASE (Basically, Available, Soft state, Eventually consistent) are two dierent ways of doing this. ACID and BASE are not databases but more of organisation schemas that can give guidelines how a database can operate to be as good as it can be for the third criteria. In 1981, Jim Gray [4] proposed how a partitioned database could guarantee consistency by making sure updates were done in transactions that followed some given guidelines. Today transactions are something natural and most databases support it, some even demand it for some type of update. But in 1981 when Jim Gray reinvented transactions it was something new and it is on that foundation most systems are built today. The properties of a transaction are Consistency a transaction only commits if it preserves the consistency of the database Atomicity a transaction either commits or not, it acts as an atomic operation

2.3. Storing data today

Durability once a transaction is committed, it cannot roll back Isolation no other transaction can see the events in a non-committed transaction In the original paper there were no I, it was added later in 1983 by Andreas Reuter and Theo Haerder [5] to form the acronym ACID (Atomicity Consistent Isolation Durability). In 1998 Jim Gray was awarded the Turing Award1 for seminal contributions to database and transaction processing research and technical leadership in system implementation [14]. Grays rules guarantee that a database stays consistent even then partitioned, but the price for this must be availability according to the CAP theorem. Another problem with the rules is performance; the cost to keep the data consistent is not nonexistent. As a solution to these problems Dan Prittchett, Ebay [7] suggested that trading some consistency for availability can lead to dramatic improvements in scalability. His solution has the acronym BASE (Basically, Available, Soft state, Eventually consistent) and uses functional partitioning as a method of partitioning the data. Functional partitioning is dependent on the actual data stored and for some systems this technique will not work well. It allows some data to be inconsistent between dierent partitions at some period in time and uses persistent messaging to make the data consistent at a later point. The main point here is to allow some of the data to be inconsistent at some times but not all the time, hence the eventually consistent part of the acronym. The notion of having inconsistent data, even if it only is for a moment, is very scary to some computer scientist. The important thing is to choose which data to allow inconsistency on and partition the system according to this. This is something we all come in contact with at some point in our lives, for example when we pay with our credit card it takes a day or two before it can be seen on the bank statement. Another example of this is Amazon and their solution Dynamo [2]. They risk losing millions in revenue if customers cannot access their web store at all times because they have customers from all around the world. When it is night at one part of the world another part has daytime and millions of potential customers choose between them and another online book store. If Amazon tolerated downtime on parts of the store at any given time the word would spread and they risk losing reputation and customers. Because of this they tolerate that dierent nodes have dierent views of some the data at short periods of time. It is worth mentioning that not all non-relational databases operate in the same space of the CAP-theorem and there is no clear way of saying that a specic type of NoSQL database is in any specic area. Today there are several solutions that operate in dierent areas of the CAP theorem and the same database can exist in dierent areas depending on congurations.

2.3

Storing data today

Today there exist some famous non-relational database systems; Googles Big Table, Amazons Dynamo and Cassandra (used by Facebook and Twitter) name a few. There are also several open source solutions with varying quality. There are no reasons to choose an RDBMS and try to t the data into it, instead there is money and time to be saved by choosing carefully and nding the model which ts the data the best. The term NoSQL does not denote a specic type of database but can be divided into several dierent types of
1 The Turing Award is recognized as the highest distinction in Computer science and Nobel Prize of computing.

Chapter 2. Modern Databases

non-relational databases that all have dierent characteristics and are suitable for dierent types of data and in dierent situations.

2.3.1

Column store

Most data today is organized in tables, this is a model that suits lots of data and is easy to understand for most humans. Name John Svensson Malin Olsson Ove Nykvist Birth date 1976-02-10 1986-09-23 1967-05-02 Address Nygatan 12 Storgatan 54 Hammargrnd 2 a Zipcode 123 94 345 19 735 12 City Orebro Gteborg o Sundsvall

Table 2.1: An example of data organized in a table The data cannot be stored in two dimensions on disc since disc is sequentially accessed. The traditional way a RBDMS organizes the data is in records and these are continuously placed in storage. This row-oriented architecture gives fast writes and is called write optimized. John Svensson, 1976-02-10, Nygatan 12, 123 94, rebro O Malin Olsson, 1986-09-23, Storgatan 54, 345 19, Gteborg o Ove Nykvist, 1967-05-02, Hammargrnd 2, 735 12, Sundsvall a This is optimized for systems that does lots of writes but does not work well with systems that handle few writes with lots of data in each write and lots of querying in between the writes. In that case a read-optimized system is better suited and a way to achieve this is a column-oriented organization [10]. In a column store the data is stored in columns instead, making it faster to read a particular column to memory and making calculations on all values in a column. John Svensson, Malin Olsson, Ove Nykvist 1976-02-10, 1986-09-23, 1967-05-02 Nygatan 12, Storgatan 54, Hammargrnd 2 a 123 94, 345 19, 735 12 rebro, Gteborg , Sundsvall O o Then the columns of data are stored together and when querying it is not necessary to read unimportant columns to memory, thus making it faster for some types of operations. One disadvantage of this type of storage is that it makes joins very time consuming and some column stores does not support join operations on the data. Cassandra is one of the most famous of the wide column stores and is used by both Facebook and Twitter; though Twitter use a slightly dierent conguration called Twissandra. Another famous one is Googles Bigtable.

2.3.2

Key value store

A key value store stores anything as a key/value pair. The key is used to access the stored value and the stored value can be anything. This may seem very simple and it is, but only on the surface; the database engine handling the persistent data is often very advanced. The main advantage with this type of storage is that it is schema less, in theory there

2.3. Storing data today

are no constraints on the key or the value. In practice the key is usually some primitive of the programming language; a string, an integer or something like that. The value can be anything but is usually an object of the implementing language or a string. The main advantage is the speed and ease that data can be stored in a persistent way. It is easy to support physical partitioning and most support the eventually consistent idea behind BASE. The main disadvantage is that inhereted relations between data are lost and since anything can be stored it is up to the client to interpret the data returned by the store. The most famous of the key value stores is Amazons Dynamo with was already discussed earlier. An open source alternative to Dynamo is Voldemort [13] Another key value store is Berkeley DB which is one of the databases that is used in this paper. More on that will follow later.

2.3.3

Document store

A document store is a special kind of key value store; it does not store the document (the value) as a mass of data, and it uses information in the document to index the document. Because of this there are demands on the data in the document; it has to be structured in some way. This is usually accomplice by XML, JSON or something else that the database can understand. This allows queries on the data and not just the keys as is the case for a key value store. This also allows for a much more exible solution than an RDBMS since the database has no schema. There are no problems adding attributes to records after they have been inserted into the database, even if the attribute is something not even conceived at design time. This makes the document store very exible, something that is hard to achieve with a traditional RDBMS. Some famous document stores are Raven [8], MongoDB [1] and CouchDB [12].

2.3.4

Graph database

In a graph database the data is stored as nodes and vertices between the nodes. The nodes have attributes or properties and the vertices have types or names. The data is extracted by traversing the nodes and vertices in dierent ways. Some vendors include some way of indexing the nodes for easy access. The main advantage with this type of storing is the possibility to traverse the nodes with known mathematical graph traversing algorithms. As with all these dierent ways of storing data it will only work if the data ts the model. There will be problems if the data is tabular in nature with little or no relationship between nodes. This model will then work poorly and it would have been better to use another type of database instead. The notion of storing data in something else besides an RDBMS is nothing new, there have been several projects for as long as there have been computers. In the last years there have been an exponential growth of data and the need to use something else than an RDBMS has also grown exponentially.

Chapter 2. Modern Databases

Chapter 3

The problem at Tieto


The database today has two major problems; it is designed to handle any type of data and scalability was not an issue for the designers. This became a big problem when the amount of data became much larger than anticipated at design time. One of the questions Tieto wanted to have answered is if a more specic design of an RDBMS can help with scalability. At the same time they want to know if a dierent kind of database can do the job better. The choice is described later, rst a look at the data at Tieto.

3.1

The data

The content and the nature of the content of the data is a company secret. Therefore this thesis will only give a general schematic picture of the data and not use the correct names or labels. The easiest way of describing the data is by using a graph. There are four dierent levels of nodes; A, B, C, and D. There are no vertices between nodes of the same level and only vertices to a node of the adjacent level. The information of the relationships lies entirely on the upper nodes. The C nodes have information about which D nodes it relates to, the B nodes have information on which C nodes they relate to and so forth. A C node has no information about which B nodes it connect to. This is the nature of the data; in the implementations there exist a relationship both ways. Because of this it is a requirement among all solutions that nodes are entered in the right order. If a B node is entered into the database without all C nodes it relates to already being there, the B node cannot be stored. Figure 3.1 is a picture to help understand the organization of the data and also give an idea of how many nodes there are in each level. In total there are 4 million nodes, 40 million relationships between nodes and 100 million values. This is a simplication of the real data, in the real data there are some vertices from A to C, the nodes also have some metadata attached to them and each node has a predecessor. These properties are removed from the data for this thesis to make the implementation a little bit easier. Since the main goal of this thesis is to implement and test dierent databases and see if they can handle the amount of data these simplications should not make the result dier too much from reality. The picture is an overview of how the data is organized; the B and C nodes are very similar and have properties that they share, making the picture a little more complicated. Figure 3.2 is a small section of the big graph and an illustration of how one of the properties of the data in the nodes makes them connected. 9

10

Chapter 3. The problem at Tieto

Figure 3.1: An overview of the data

A node has on average one or zero connections to properties but the number vary a lot and others have up to ten. The nature of the graph; the number of vertices between the dierent levels, the number of nodes and the number of properties shared between the nodes dier a lot depending on where in graph the calculations are made. This makes it harder to implement an optimal solution. An algorithm that works nice in one part of the graph may be a catastrophe at another part. These implementations try to be as good as it gets for the majority of the graph but not optimal for any one part of it.

Figure 3.2: An example of how a property is shared between dierent nodes

3.2. The Questions

11

3.2

The Questions

The questions asked to the database can be divided into two dierent types. The rst type is a simple get a set of nodes with a given property. The properties are one or two and all nodes returned must have the given property. In PostgreSQL a typical question is SELECT * FROM table WHERE table.a = X AND table.b = Y. This type is called the simple questions without connections and there are nine in this solution. The other type is a set of more complicated questions where the connections between the nodes are explored. These questions are Return a sub tree of a given A Return all B that connect to a given C Check the dierence between two A, return all nodes unique to A1, unique to A2 and all nodes A1 and A2 have in common in three dierent lists Check if a tree under A is consistent and if it is inconsistent return how it is inconsistent A sub tree under an A is inconsistent if two C have the same value in a specic attribute but are not the same node. For all cases of inconsistency the database returns the value of the attribute and all B-C pairs where the C-node has the value of the attribute. If the sub tree is consistent the question only return true. To illustrate this see Figure 3.3 and 3.4. In Figure 3.3 the sub tree of A is consistent but in 3.4 there are two C-nodes that have the same value of a=4 but they are not the same node. In this case the question will return a=4 and a list of BC-pair; B2-C2, B3-C4 and B4-C4.

Figure 3.3: An example of a consistent tree Note that the attribute in question is a very specic attribute and that it is only in a sub tree under a given A that this is interesting. In two dierent sub trees there are some C nodes with the same value that are not the same C node but this is permitted. It is only in the sub tree of one A that all C nodes with the same value in the attribute must be the same node. How and why inconsistencies occur is to closely connect to the nature of the data to be revealed here. They do occur at some points and the solution at Tieto today cannot tell in which nodes the problem is, it only gives true or false if the sub tree is consistent or not.

12

Chapter 3. The problem at Tieto

Figure 3.4: An example of an inconsistent tree

3.3

The databases

One of the databases chosen was PostgreSQL, a free Open Source RDBMS that is known to have good performance. One major advantage was also that Tieto already had this installed for other testing purpose. Because of the nature of the data one of the databases is a graph oriented database and the choice was Neo4j, a Swedish open source graph database. One major advantage was also that it is written in Java and well documented. The last database was a choice of either a document store or a key value store. This was because they are very dierent in their architecture from the other two and it is interesting to see if they are as good as they should be on the simple requests and how bad they are on the more complicated questions. Several were considered but the choice fell on Berkeley DB, a key value store that had recently been rewritten for Java. It previously only existed in C with a library that could run the C-version in Java but with the new version the cost for inter language translation was avoided. Berkeley DB is also well documented and has several examples.

Chapter 4

The solutions
The solution is implemented in Java 1.6 as a Web service. This was a requirement from Tieto as this would make it easier to integrate this into their existing systems. The three dierent data stores implements the same interface and are therefore interchangeable without making any changes to the Web service methods. Since this is intended as an internal system the data source is trusted and there are no protections against malicious data. The only protection that exists is against faulty data. This may seem a bit strange and if any of these databases are integrated into Tietos existing systems they need some work in this area. The main reason for this limitation is time. The time for this thesis is limited and it was decided that it was better get as much functionality as possible instead of spending time on error handling for something that may be discarded. All three solutions operate in the same space of the CAP theorem; none of them are physically partitioned. Both Neo4j and Berkeley can be partitioned if the need should arise.

4.1

PostgreSQL

The PostgreSQL solution is implemented in version 8 and uses java.sql.* library to communicate with the database. The main table structure is straight forward; one table for each of the levels of nodes with a serial id eld and three tables containing the relationship between the dierent layers of nodes. The property described in Section 3.1 is also in its own table with a join table that tells which property belongs to which node. This is not the structure of Tietos current solution; this is a new design that tries to be as good as it can be for this version of the data. One of the main problems with PostgreSQL can be found in this design; there are approximately 18 million rows in the table joining B and C nodes. All questions regarding the relationship between the B and C nodes will be costly any way it is done. When asking for a specic B it is necessary to query over this table because one important part of a B is with C it connects to. The information about these connections must be in the database and the other solutions for achieving this would have other problems. One way would be to let the table for B nodes contain this information but the number of connections diers widely between dierent B nodes. This makes it hard to have any other solution than the one chosen. Pruning the data and discard some of the nodes is not an option, all data is still relevant in one way or another. Another possibility would be if the graph could be divided into several smaller sub graphs. Then the table could be split into several smaller tables. This is not possible for this data, not even when looking at only the B and C levels so this 13

14

Chapter 4. The solutions

a 3 3 4 4 4 2 2

B.id C1 C1 C2 C4 C4 C3 C3

C.id B1 B2 B2 B3 B5 B3 B4

Table 4.1: The table retured by PostgreSQL for the inconsistent example cannot be done without having more than one copy of some of the nodes. The logic of this solution strives to make as few queries to the database as possible without having to loop the resulting rows more than once. The overhead of querying the database is something to consider and it is worth a little more logic in the program to not have to query the database more than once for each question. This is the only solution that requires anything else than a java library since the PostgreSQL server runs independently from the java program. In the test runs the PostgreSQL server was run on the same computer and thus eliminating the time it would take the data to be transferred in a network.

4.1.1

The questions

In this Section the more interesting solutions will be described in some detail. The rst nine questions are simple select with some joins for retrieving the data. They are not particularly interesting or special, only simple selects. The queries are written to allow the query planner in PostgreSQL as much freedom as possible since it probably is better at the planning then the author of this paper. Question 10 and 11 does nothing special; they only return the sub tree or the list of B nodes and does nothing unexpected. Question 12 gets the unique labelling string for the two dierent sub trees for the dierent A nodes. It then uses javas set operators with a hash set on the two sets of strings to get the three dierent subsets. Given the results of the test runs this is probably not the optimal way of doing this even thought java is good at hashing strings. If there is any more work done on this solution this questions is denitely worth looking at and implementing a better solution. The last query determined if a sub tree of A was consistent or not. For examples of this see Figure 3.3 and 3.4. This was really hard to implement and the nal solution is one that uses PostgreSQL for the most part and some java for the nal logic. The query to PostgreSQL returns table 4.1 for the previous inconsistent example. This is sorted primarily on the value of a and secondarily on C. The java program then has the following algorithm to remove the rows that does not contain the inconsistent a-value. The return structure is not in row format but contains the same information, organized in a slightly dierent way. Note that before the rst row is calculated the variables keeping track of the previous row are set to empty strings and therefore are a value but matches noting from the database. 1 Fetch the values for the new row 2 is this a-value the same as the previous row 2.1 is this C the same as the previous row 2.1.1 keep C and D in a structure

4.2. Neo4j

15

2.2 else (the C is dierent meaning an inconsistent tree) 2.2.1 set this a-value as inconsistent 2.2.2 keep C and B nodes in a structure 3 else (this is not the same a-value as the previous row) 3.1 is the previous a inconsistent 3.1.1 set the return structure as inconsistent 3.1.2 save all data in the return structure 3.2 else (the previous a is consistent) 3.2.1 discard all saved data 4 set this row as the previous row

4.1.2

Strengths and weaknesses

The implementation of PostgreSQL was the rst one, since this was the database most familiar from previous experience. In the end this database proved to be the hardest and most time consuming to implement. The amount of code needed to handle calls to the database, exceptions and similar things is massive. Almost all exception handling is simply printing an error message on stderr and moving on or returning false since there is no use spending time implementing fancy error handling for something that may be discarded shortly. All changes to the database are made in transaction and time and eort was spent on making sure the data in the database did not get corrupted. One problem with PostgreSQL and other SQL databases is that the programmer needs to be good at SQL to be able to handle writing the queries, setting up tables and similar things. There is a big hurdle to get over in order to do things nicely and eciently.

4.2

Neo4j

Neo4j is a graph database and as such it uses nodes and vertices to store data. A node can have several properties and several vertices or relationships with other nodes. These relationships must be of a specic type. A relationship can also have several properties just like the nodes. A property is a key that is a string and a value. The value must be one of Javas primitive types, a string or an array of primitives or strings. The data is retrieved from the database by traversing the graph. Since the data is in a graph structure there was no need to think of any other structure for storing the data. All attributes in a node are stored as properties and the relationships are set as the nodes get entered into the database. Relationships are set both ways so between two nodes there are two relationships with dierent types, one going up and one going down. This helps with the traversing of the tree, making sure that only nodes in the right direction gets traversed. To index the nodes this solution uses the LucenIndexService that is closely integrated in the database but not a part of it. There is no indexing in the graph engine but this semi built in index service uses a Lucene as backend and is as close as it gets to being an integrated index. This is not intended as a key-value store and therefore indexing is not a priority, the main way of nding the right nodes should be by traversing the graph with dierent algorithms. Because the indexing is separate from the database it is possible to

16

Chapter 4. The solutions

remove a node from the database and still have it indexed. The index service may still return the node but it will not have the correct properties set and when sked for the value of the propery an exception will be thrown. The rst time this happened it was a somewhat hard error to nd. The database said that the property was not set but the problem was that the node was still indexed by the property and removed from the database. It is vital that all indices are removed when a node is removed from the database. The only time nodes are removed from the database in this solution is then all data is removed for testing purposes. This solution uses version 1.1 of Neo4j. The new version 1.2 came out in December 2010. The implementation was nished in November of 2010 and has not been checked against the new version of the database. One of the main dierences is how the indexing is handled. If there is any future work done on this solution one of the rst steps should be integrating the new way of indexing. More information on this can be found on Neo4js web page [11].

4.2.1

The questions

The rst nine questions are handled by the index service. All nodes that are put in the database get indexed on the dierent properties that are needed for this. Then the correct node has been found the information is moved from the node to the Java object that gets returned. A new object is created for every node that is returned, it is not the same object that got put in the database but it has the same information. The more complex questions use the graph structure of Neo4j to return the correct nodes. Question 10 gets the correct A from the index service and then simply traverses the sub tree and returns the nodes. Neo4j makes this traversing very easy, it is possible to ask for every node that is at a maximum depth from this node and that can be accessed by the correct relationship type. The same is true for question 11, the correct C node is found by asking the index service for the node and then asking the database for all its neighbours with the correct relationship. Question 12 uses Javas hash set in the same way as PostgreSQL to calculate the dierent sets from given sub trees. As previously mentioned this is probably not the best way and a dierent algorithm should be considered if any more work is done on this database. The consistency check really uses the graph properties of this database. It begins by making a depth rst search with a maximum depth of two. For all C nodes it saved the a-value and the C node in a hash table. The a-value is set as key and the C node is put in a list. When all nodes are traversed the program begins to go through the hash table and searching for a-values that have more than one value in the list. If such a list is found, the sub tree is inconsistent. To get the right B nodes all B nodes are examined and the ones with correct A are stored in the return structure. For the most cases a sub tree will be consistent and this should be faster than BerkeleyDB for those cases since it really uses the nature of the graph and only when it is needed.

4.2.2

Strengths and weaknesses

It was fairly easy to start implementing Neo4j. There are some examples on the web site and a really good API for all of the classes. The only major problem was described above, if an index was not removed it showed in a really strange behaviour and the root cause was hard to nd. A completely missing functionality is the ability to truncate the entire database. Since this was a test there was a need to store data, perform the tests and then remove all data to make the database clean for the next test. This is very easy in both

4.3. Berkely DB

17

PostgreSQL and BerkeleyDB but no easy solution was found for Neo4j. Neo4j is still young and the changing of indexing from version 1.1 to 1.2 shows that big changes are still being made. Both PostgreSQL and BerkeleyDB are older and more mature products. Neo4j requires really fast discs or huge amounts of memory to work well. The Linux machine that was used in the developing of this thesis had some real problems with speed. The test runs was done on solid state drive and Neo4j needs better hardware for the program than the other two databases. Because the data is organized as a graph there are several graph algorithms that can be used to solve various problems. A graph is easy to understand and most data with lots of relationships are described as a graph. There exist programs that allows for a graphical presentation of the data in the database but none were tested by the author of this thesis.

4.3

Berkely DB

BerkeleyDB is a key-value store that is originally written in C++ but now has a completely rewritten version in Java. BerkeleyDB is owned by Oracle since 2006. Berkeley stores any java-class that is set to be persistent. It uses annotation to set the class as persistent or as an entity class and the members of the class that are primary- and secondary key. The secondary keys have of four dierent ways of relating to other instances of the same class. An example will clarify this. @Entity class ExampleClass { @PrimaryKey long id; @SecondaryKey(relate=ONE_TO_ONE) Int ssn; @SecondaryKey(relate=MANY_TO_ONE) String Name; @SecondaryKey(relate=ONE_TO_MANY) String[] email; @SecondaryKey(relate=MANY_TO_MANY) String[] family; } ONE TO ONE says that the value is unique for every instance in the database. A primary key is of this type but it is unusual for secondary keys. MANY TO ONE means that this instance only has one but share that with several other instances of this class. If an instance may have many but no other instance may have any of the same the relation type is ONE TO MANY and if an instance can have many that it shares with other the relation type is MANY TO MANY. For more information on implementation details, see the API [6]. Since BerkeleyDB require all objects to be stored to be set as persistent the information have to be moved from the original object coming in to a Berkeley object that looks the same except for the Berkeley specic annotations. When returning an object from the database

18

Chapter 4. The solutions

the reverse is done to make sure the correct type of object gets returned. When doing this some information about relationships between the nodes are lost. In Figure 3.2 there is a shared property between the nodes. In Berkeley each node will get their own copy of this property and when returned the nodes will have dierent but identical copies of this property, see Figure 4.1.

Figure 4.1: An example of how a shared property gets duplicated in Berkeley DB

4.3.1

The questions

As with the others the rst nine questions were easy then the indexing part was understood. If a Berkeley DB object could be returned the search methods would consist of only one line of code. Returning a sub tree needs the database to rst get the right A node then looping all B nodes, getting the right node from database and nding all C nodes it connects to. It then needs to do the same for all C nodes to get all D nodes. This means that several nodes may get visited more than once and that is not optimal. If a get from the database is costly and there are lots of C nodes that belong to dierent B nodes in the same sub tree this approach will be expensive. But if there are only a few of these nodes it will cost more to keep track of which have already been explored than to let them get explored once more. Therefore this solution uses the naive approach and lets the same C node get explored several times. Getting down the tree is fairly easy with Berkeley DB as with both the other databases. In the other two there is also information on how to get up the tree but this information does not exist in Berkeley DB. Information on with B nodes a C node belongs to is only stored in the B nodes and therefore it must be search in the B nodes. The search is not hard, the id of the C nodes is set to be a secondary index and the search is simple but possibly time consuming. The dierence between two sub trees is handled in much the same way as with the other databases. The dierence here is in the speed of the initial search since it is done twice. The inconsistency check was hard to do in Berkley DB. In Neo4j there was a possibility to store only the C nodes and the a-value and then get the information if the sub tree is inconsistent. In Berkeley DB this would be much harder since there is no link from a C node to its B nodes and on to the A node to make sure only the correct B nodes gets returned.

4.3. Berkely DB

19

The supervisor at Tieto, Anders Martinsson, made a solution for his Hashtable and that solution seemed to be a good approach: it stores all the information needed as it makes a depth rst search of the sub tree. In Neo4j only the a-value and its corresponding C node is saved but there it is possible to retrieve the information about the B nodes without having to search the whole tree again. In Berkeley DB this is not possible so all information needed to be saved as the initial search proceeded. The hashtable has the structure Hashtable<String, Hashtable<String, List<String>>> to keep track of all nodes visited so far. The outer hashtable have the a-value as key and a hashtable as the value. The inner hashtable have the C node for key and then a list of B nodes for value. Then the initial search is done it is simply a matter of looking at the length of each of the inner hashtables to see if any of them are of greater length than one. If so, then that a-value is in more than one C node and the tree is inconsistent.

4.3.2

Strengths and weaknesses

This was a little bit harder than Neo4j to get started on the implementation but after the initial hurdle was cleared there were few problems with getting the code to work. Then rst tested on the developing machine the rst reaction was that it was really fast. Those tests were only to test the functionality but even then it was a clear dierence in the speed of the test program. The author of this paper has tried to nd some graphical program to view the data in the database and handle it manually. No such program has been found but it would be good if it existed.

20

Chapter 4. The solutions

Chapter 5

Results
The test program was developed by Anders Martinsson; my supervisor at Tieto. All test runs were done by him on his computer, The test runs are made in four dierent percentages of the data; 0.1%, 1%, 10% and 100%. The test was run 2048 times for each question except for Neo4j and PortgreSQL at 100% because they took so long it was not possible to let them run that many tests. Neo4j had 512 tests and PostgreSQL had 256 tests per question. Time was measured at both client and server; the client measured the wall clock time for all test runs in milliseconds and the server each questions time in nanoseconds. In total there were almost 500 000 times to analyse at the end of all test runs. The server times will not be presented in total here; only the mean, median and a trimmed mean will be presented. The trimmed mean has 5% cut of at each end to get rid of any extreme values. All times are in tables in Appendix A. Note that the listings of client times for 100% have the numbers for Neo4j and PostgreSQL are multiplied with 4 resp. 8 to give an accurate picture for comparison. This is maybe the most important result of them all. The fact that the total time for running the entire test set was so high that it could not be completed for both Neo4j and PostgreSQL is very telling of which database is faster for the large amount of data. The test program was developed at the same time as the databases and to test it Anders designed a Hash table to handle the data. This is not a persistent database but for comparison reasons test runs were done with this as well for the three smaller data sets. First a look at the client times for the dierent questions at the dierent percentage levels. Note that the scale in the x-axis is logarithmic and not linear. It can be seen in Figure 5.1(a) and 5.1(b) that PostgreSQL has a high overhead cost since it needs to call an external database. For that amount of data there can be no other explanation as to why the cost is so much higher. The impact of this overhead should decrease as the amount of data increases. Neo4j struggles with some of the questions when it comes to 100% of the data but Berkeley DB is still quite fast. Neither Neo4j nor Berkeley DB can keep all the information in memory and need to read from disc. This seems to aect Neo4j more than Berkeley DB, even though they both keep their data on the SSD. If the client times are divided by the number of test runs Berkeley DB completes query 12 in less than 5 seconds, Neo4j needs almost 25 second but PostgreSQL needs as much as 115 seconds or almost 2 minutes. Even if this is a question asked once a day 2 minutes is a very long time to wait for an answer. One other interesting thing that is obvious from these graphs is that the more complex questions that is the hardest seems to be question 12, the dierence between two sub-trees and question 10, returning a sub tree of A. When 21

22

Chapter 5. Results

900 800 700 600 Time in seconds 500 400 300 200 Time in seconds PostgreSQL Neo4J BerkeleyDB Hash table

3000 PostgreSQL Neo4J BerkeleyDB Hash table

2500

2000

1500

1000

500 100 0 0

6 8 Question number

10

12

14

6 8 Question number

10

12

14

(a) 0.1 % of the data.


8000 7000 6000 Time in seconds 5000 4000 3000 2000 0.5 1000 0 0 Time in seconds PostgreSQL Neo4J BerkeleyDB Hash table 2.5 x 10
5

(b) 1 % of the data.

PostgreSQL Neo4J BerkeleyDB 2

1.5

6 8 Question number

10

12

14

6 8 Question number

10

12

14

(c) 10 % of the data.

(d) 100 % of the data.

Figure 5.1: Total time for the client. comparing the times, for 10% of the data the time is almost double which means that the time spend calculating the sets is almost nothing compared to the time of getting the sub trees. For 100% the time is still only slightly more than double the time for getting one sub tree. One interesting fact is that Berkeley DB and Neo4j almost can keep up with the Hash table. One of the main reasons for this is probably the speed of the solid state drive. With a slower hard drive this would probably not be possible, especially for Neo4j. In Figure 5.2 the client times for Neo4j, Berkeley DB and the Hash table are plotted for 10% of the data. For question 9 it appears that the hash table is the slowest, but only just. On the server side the times plotted are the trimmed means because they are generally somewhere in between the mean and median. The graphs are almost the same; PostgreSQL is the slowest for almost every question. One thing that becomes apparent from these Figures is that some of the simple questions are not so simple after all. A look at question 7 in Figure 5.5 and the results are very interesting. The maximum seems to be when only 1% of the data is tested, except for Neo4j which has a top at 100% when the rest of them actually go down in time. There are some explanations for this but the most likely is that the data scales badly for this example. If the original data contains 20 unique value of the attribute in question, it drops to 0.2 for 1% and is rounded to 1. That means that a much larger portion of the data is returned

23

800 700 600 Time in seconds 500 400 300 200 100 0 Neo4J BerkeleyDB Hash table

6 8 Question number

10

12

14

Figure 5.2: Client times for 10% of the data without PostgreSQL

than for 10%. A closer look at the exact behaviour of the data and the databases for this question would be interresting but it probably does not inuence the nal result.

24

Chapter 5. Results

0.35 PostgreSQL Neo4J BerkeleyDB Hash table

1.4 PostgreSQL Neo4J BerkeleyDB Hash table

0.3

1.2

0.25 Time in seconds Time in seconds 0 2 4 6 8 Question number 10 12 14

0.2

0.8

0.15

0.6

0.1

0.4

0.05

0.2

6 8 Question number

10

12

14

(a) 0.1 % of the data.


4 PostgreSQL Neo4J BerkeleyDB Hash table 120

(b) 1 % of the data.

3.5

PostgreSQL Neo4J BerkeleyDB 100

3 80 Time in seconds Time in seconds 0 2 4 6 8 Question number 10 12 14 2.5

60

1.5

40 1 20 0.5

6 8 Question number

10

12

14

(c) 10 % of the data.

(d) 100 % of the data.

Figure 5.3: Total time for the server.

25

1200 PostgreSQL Neo4J BerkeleyDB Hash table

1200 PostgreSQL Neo4J BerkeleyDB Hash table

1000

1000

800 Time in seconds Time in seconds 1% Amount of data 10% 100%

800

600

600

400

400

200

200

0 0.1%

0 0.1%

1% Amount of data

10%

100%

(a) Question 1
200 180 160 140 Time in seconds 120 100 80 60 40 20 0 0.1% 1% Amount of data 10% 100% 0 0.1% Time in seconds PostgreSQL Neo4J BerkeleyDB Hash table 1200

(b) Question 2

1000

PostgreSQL Neo4J BerkeleyDB Hash table

800

600

400

200

1% Amount of data

10%

100%

(c) Question 3
1800 1600 1400 1200 Time in seconds 1000 800 600 400 200 0 0.1% Time in seconds PostgreSQL Neo4J BerkeleyDB Hash table 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 0.1% PostgreSQL Neo4J BerkeleyDB Hash table

(d) Question 4

1% Amount of data

10%

100%

1% Amount of data

10%

100%

(e) Question 5

(f) Question 6

Figure 5.4: Client times for question 1-6

26

Chapter 5. Results

3000 PostgreSQL Neo4J BerkeleyDB Hash table

2500

2000 Time in seconds

1500

1000

500

0 0.1%

1% Amount of data

10%

100%

Figure 5.5: Total time for the client for question 7 at dierent amounts of data

27

1500 PostgreSQL Neo4J BerkeleyDB Hash table

9000 8000 7000 6000 Time in seconds 5000 4000 3000 2000 1000 PostgreSQL Neo4J BerkeleyDB Hash table

1000 Time in seconds 500 0 0.1%

1% Amount of data

10%

100%

0 0.1%

1% Amount of data

10%

100%

(a) Question 8
12 x 10
4

(b) Question 9
60

10

PostgreSQL Neo4J BerkeleyDB Hash table

50

PostgreSQL Neo4J BerkeleyDB Hash table

8 Time in seconds Time in seconds 1% Amount of data 10% 100%

40

30

20

10

0 0.1%

0 0.1%

1% Amount of data

10%

100%

(c) Question 10
2.5 x 10
5

(d) Question 11
3500

PostgreSQL Neo4J BerkeleyDB Hash table

3000

PostgreSQL Neo4J BerkeleyDB Hash table

2500 Time in seconds 1.5 Time in seconds 1% Amount of data 10% 100%

2000

1500

1000 0.5 500

0 0.1%

0 0.1%

1% Amount of data

10%

100%

(e) Question 12

(f) Question 13

Figure 5.6: Client times for question 8-13

28

Chapter 5. Results

Chapter 6

Conclusions
In this paper a question was asked; is there any database that can handle the amount of data that Tieto have and how good can it get? To answer this question it was necessary to take a look at what architectures that exist today and nd some dierent databases that could do the job. Based on the nature of the data it became clear that one should be a graph database and Neo4j was chosen because it considered to be one of the best on the market. The relational database was based on the fact that is already existed in the company and that it is one of the fastest relational databases available. A third and completely dierent database was needed and Berkeley DB is a key value store that had all the qualities. All three were implemented as a Web service and their performance was measured. The results were pretty clear; the database that is the best at handling the data is Berkeley DB, even for the questions that are closely connected to the graphical aspect of the data. This was a surprising result especially that it was faster even for the more graphical questions. Even though it data ts the graphical model the best Neo4j just was not fast enough to be able to use its advantage. There is also the problem with the maturity of the product. The rst version of Neo4j was released in February of 2010, version 1.1 half a year later and in December of 2010 yet another version. Neo4j needs time to mature and become a more stable product before suits companies such as Tieto.

6.1

Future work

The results are promising and there is denitely worth a continued development of the Berkeley DB part of the solution. Even though there exist a solution today it is not optimal and at the rate the data is growing Tieto may nd themselves in trouble a lot faster than they anticipate. The real data have some properties that are excluded from this rst test to make the task a little easier. A good rst step would be to identify these and start implementing them as well to see if the results still hold.

29

30

Chapter 6. Conclusions

Chapter 7

Acknowledgements
I would like to start by thanking my supervisor at Tieto, Anders Martinsson for all his support and help. Without him this master thesis would not have existed since the whole thing was his idea. I also thank everyone at Tietos oce in Ume for making my workday a a pleasant time. I thank internal supervisor at the department of Computing Science at Ume Universitet, a Ola gren. A Last but not least I thank my husband for everything he has done to support me throwout the entire master thesis project.

31

32

Chapter 7. Acknowledgements

References
[1] 10Gen. Mongodb. http://www.mongodb.org/, August 18 2010. [2] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: Amazons highly available key-value store. In Proc. SOSP, 41:205220, 2007. [3] Seth Gilbert and Nancy Lynch. Brewers Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services. ACM SIGACT News, 33, 2002. [4] Jim Gray. The Transaction Concept: Virtues and Limitations. Proceedings of Seventh International Conference on Very Large Databases, 1981. [5] Theo Haerder and Andreas Reuter. Principles of transaction-oriented database recovery. Computing Surveys, 15(4), 1983. [6] Oracle. http://www.systomath.com/doc/BerkeleyDb-4.7/html/java/, December 6 2010. [7] Dan Pritchett. BASE: An Acid Alternative. Queue, 6(3), 2008. [8] Hibernating Rhinos. Raven DB. http://ravendb.net/, August 18 2010. [9] Michael Stonebraker, Samuel Madden, Daniel J. Abadi, Stavros Harizopoulos, Nabil Hachem, and Pat Helland. The End of an Architectural Era (Its Time for a Complete Rewrite). VLDB 07, 2007. [10] Mike Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth ONeil, Pat ONeil, Alex Rasin, Nga Tran, and Stan Zdonik. C-Store: A Column-oriented DBMS. VLDB, pages 553564, 2005. [11] Neo Technology. Neo4J, the graph database. http://neo4j.org/, December 6 2010. [12] The Apache Software Foundation. http://couchdb.apache.org/, August 18 2010. [13] Project Voldemort. Project Voldemort, http://project-voldemort.com/, August 18 2010. The A CouchDB distributed Project. database.

[14] Wikipedia. Turing award. http://en.wikipedia.org/wiki/Turing Award, October 5 2010.

33

34

REFERENCES

Appendix A

Data from test runs


Here is the raw data from the test runs. The server times are rounded to 4 signicant digits. The values for the Hash table did not t in the table with the others and therefore it is in its own table.

A.1

Server times
mean 13.41 12.38 0.7108 82.43 53.41 298.4 326.9 27.97 32.83 29.17 5.288 59.67 1.722 PostgreSQL trim median 13.14 12.9 12.37 12.36 0.7051 0.7035 81.49 68.1 53.37 53.36 297.7 297.4 326.2 326 27.89 28.13 32.74 32.53 29.1 28.86 5.285 5.408 59.52 59.18 1.715 1.713 mean 0.1889 0.09073 0.0648 3.679 8.909 18.74 27.47 3.619 0.8853 0.3765 0.06893 0.5257 0.09877 Neo4j trim 0.1657 0.09009 0.06052 3.648 8.704 18.59 27.22 3.575 0.8656 0.3518 0.06775 0.4977 0.09506 median 0.1647 0.09068 0.06033 2.709 8.457 18.6 27.29 3.563 0.8562 0.346 0.0664 0.5086 0.09334 mean 0.1363 0.05806 0.04722 3.36 9.297 16.6 25.53 0.421 0.813 0.3681 0.2815 0.5271 0.3445 Berkeley DB trim median 0.114 0.1078 0.05725 0.05615 0.04666 0.04629 3.295 2.403 9.134 8.978 16.29 15.98 25.16 24.6 0.4028 0.4147 0.7952 0.7852 0.3437 0.3003 0.2452 0.1964 0.4973 0.5486 0.3281 0.2868

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13

Table A.1: The server times for 0.1% of the data. Time in milliseconds.

35

36

Chapter A. Data from test runs

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13

mean 18.8 17.91 0.7133 107.1 161.3 631.4 1044 26.76 65.11 160.2 5.266 319.7 3.434

PostgreSQL trim median 18.58 18.29 17.91 17.9 0.7109 0.7107 105.6 105.1 161.9 163.1 631.4 639.3 1061 1092 26.91 27.09 65.04 64.87 159.7 159.9 5.239 5.225 319.2 318.2 3.429 3.429

mean 0.1265 0.08418 0.06575 3.795 14.96 31.5 85.15 29.22 1.024 4.833 0.1018 9.418 0.3105

Neo4j trim 0.1246 0.08166 0.06573 3.734 14.99 31.47 86.93 30.14 1.018 4.811 0.09598 9.499 0.3028

median 0.1218 0.08158 0.0664 3.668 15.09 31.89 105.3 33.4 1.016 4.807 0.08651 9.53 0.3035

mean 0.06133 0.03206 0.0218 3.381 17.13 25.94 105.9 0.4238 0.8452 3.21 0.9761 6.97 8.692

Berkeley DB trimmean 0.06115 0.03178 0.02161 3.267 17.09 25.8 106.3 0.4219 0.8292 3.119 0.8173 6.929 8.54

median 0.05767 0.03149 0.02163 3.075 17.01 25.94 120.9 0.4153 0.8298 3.099 0.6008 6.206 8.136

Table A.2: The server times for 1% of the data. Time in milliseconds.

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13

mean 61.75 61.37 1.072 200.9 209.4 1097 567.3 54.51 382.7 1808 6.072 3641 27.5

PostgreSQL trim median 61.62 61.54 61.37 61.33 1.066 1.064 199.4 201.9 209.3 209.2 1096 1098 566.7 567.2 49.95 18.76 381.8 375.2 1807 1806 6.037 6.335 3639 3637 27.49 27.48

mean 0.5108 0.1364 0.1775 5.118 21.59 46.82 39.17 32.76 1.347 69.08 0.187 120 3.368

Neo4j trim 0.4027 0.1335 0.1761 5.015 21.27 46.57 39.23 32.59 1.338 60.8 0.1742 119.9 3.238

median 0.388 0.1332 0.1757 5.004 21.23 46.37 39.04 32.5 1.33 60.32 0.1559 120.3 3.237

mean 0.1131 0.05555 0.08425 3.745 21.62 37.03 179.3 0.5008 1.007 37.71 1.481 74.42 114.1

Berkeley DB trim median 0.1026 0.1032 0.05541 0.0554 0.0829 0.08234 3.69 3.676 21.48 21.37 36.88 36.44 179.8 187.4 0.4963 0.4841 0.9983 0.9945 37.24 37.21 1.245 0.9066 74.81 74.77 113.9 113.8

Table A.3: The server times for 10% of the data. Time in milliseconds.

A.1. Server times

37

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13

mean 501.6 495.7 19.53 563.7 761.8 2153 463.2 57.72 4059 52980 21.65 112400 374.3

PostgreSQL trim median 492.7 492.6 495.5 495.1 19.38 17.89 562.8 558.6 747.9 697.6 2144 2135 462.7 462.4 25.97 22.05 4058 4058 52890 52660 18.87 14.65 111900 112200 373.9 371.8

mean 55.14 18.87 90.93 394.9 338.4 4075 584.4 693.7 93.71 8455 16 22320 1540

Neo4j trim median 17.95 14.56 9.12 8.794 63.22 46.37 353.3 163.5 291 249.7 4096 4092 524.9 490.5 599.5 357.8 55.95 32.2 8473 8559 12.37 8.832 22540 22750 1505 1489

mean 1.332 0.6003 0.5779 21.14 157.2 84.07 378 3.302 5.644 504.5 24.99 740.5 1136

Berkeley DB trim median 1.147 0.896 0.5044 0.4637 0.5593 0.5589 21.09 22.88 153 22.5 77.21 36.11 262.8 193 3.301 3.269 5.669 5.648 442.7 410.7 17.14 11.91 740.8 731.8 1134 1133

Table A.4: The server times for 100% of the data. Time in milliseconds.

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13

mean 0.001531 0.001579 0.001787 0.008888 0.02186 0.01621 0.1925 0.0371 0.05229 0.008697 0.004694 0.0187 0.01769

0.1% trim 0.001521 0.001573 0.001779 0.008829 0.02147 0.01608 0.1872 0.03555 0.05061 0.008618 0.004403 0.01792 0.0172

median 0.001518 0.001518 0.001897 0.008348 0.02125 0.01594 0.1867 0.03529 0.05122 0.009106 0.004174 0.01404 0.0167

mean 0.00117 0.001185 0.001569 0.008475 0.03216 0.02497 1.114 0.2556 0.6377 0.05151 0.006041 0.1213 0.08085

1% trim 0.001157 0.001189 0.001563 0.008419 0.03121 0.02483 1.134 0.261 0.6278 0.05172 0.005718 0.1217 0.07295

median 0.001138 0.001138 0.001518 0.007589 0.03111 0.02466 1.293 0.2831 0.631 0.05084 0.005312 0.1226 0.06412

mean 0.00144 0.001467 0.001744 0.009677 0.0408 0.03602 1.242 0.62 11.55 0.5432 0.007546 1.366 0.8914

10% trim 0.00142 0.001458 0.001719 0.009643 0.04063 0.03593 1.239 0.6182 11.56 0.5411 0.006902 1.365 0.8829

median 0.001518 0.001518 0.001897 0.009106 0.0406 0.03605 1.233 0.6575 11.97 0.5418 0.006071 1.366 0.8852

Table A.5: The server times for the hash table. Time in milliseconds.

38

Chapter A. Data from test runs

A.2

Client times

Here are the times recordeed by the server in seconds. This is the total time for 2048 test runs, Neo4j and PostgreSQLs times are multiplied with appropriate scalar to give a time for comparison. Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 PostgreSQL 29.75 26.72 2.54 198.7 175.4 75.15 876.9 61.33 73.91 63.16 11.84 127.1 4.52 Neo4j 2.221 1.383 1.147 37.52 83.29 181.7 262.6 10.94 7.958 3.745 0.889 5.233 1.195 Berkeley DB 2.217 1.455 1.142 38.9 97.62 182.7 285.8 4.53 8.266 4.004 1.341 5.645 1.83 Hash table 1.86 1.239 0.952 30.42 74.35 139.2 215.1 3.561 6.402 3.065 0.674 4.502 1.006

Table A.6: Client times for all test runs at 0.1%. Time in seconds. PostgreSQL 39.84 37.74 2.192 242.8 457 1491 2754 57.92 139.8 350.7 11.58 698.5 7.891 Neo4j 1.296 0.983 0.872 31.02 152.6 263.7 791.2 63.22 8.397 32.57 0.956 63.31 1.485 Berkeley DB 1.357 0.962 0.837 39.59 182.6 268.6 905.3 3.875 9.377 31.66 3.038 67 19.59 Hash table 1.007 0.776 0.687 23.9 133.8 197.7 641.7 3.431 7.722 23.22 0.743 45.28 0.984

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13

Table A.7: Client times for all test runs at 1%. Time in seconds.

A.2. Client times

39

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13

PostgreSQL 127.9 126.9 3.478 436.8 574.5 2494 1270 113 791 3940 13.36 7932 58.98

Neo4j 2.628 1.38 2.006 41.96 193 360.1 193.1 68.62 9.891 400.3 1.405 762.5 9.899

Berkeley DB 1.457 0.998 1.533 53.12 204.2 334.2 481.6 2.041 9.14 328.9 3.874 661.7 236.7

Hash table 1.193 0.865 1.273 26.51 154.5 254.2 114.4 2.262 31.07 246.1 0.794 496.3 4.437

Table A.8: Client times for all test runs at 10%. Time in seconds.

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13

PostgreSQL 1032 1018 48.68 1181 1710 4659 961.6 120.6 8322 111000 47.34 235200 789.7

Neo4j 116.9 41.56 198.1 838.1 844.6 8604 1212 1423 201.1 19970 35.38 50860 3186

Berkeley DB 4.127 2.143 7.885 69.37 494.7 428.8 78.14 7.619 18.67 3758 52.44 9654 2349

Table A.9: Client times for all test runs at 100%. Time in seconds.

You might also like