CS 708 Seminar NoSQL

Prepared by, Fayaz Yusuf Khan, Reg.No: CSU071/16 Guided by, Nimi Prakash P. System Analyst Computer Science & Engineering Department September 29, 2010

NoSQL DEFINITION Next Generation Databases mostly addressing some of the points: being nonrelational, distributed, open-source and horizontal scalable. The original intention has been modern web-scale databases. The movement began early 2009 and is growing rapidly. Often more characteristics apply as: schema-free, easy replication support, simple API, eventually consistent /BASE (not ACID), a huge data amount, and more. So the misleading term ”NoSQL” (the community now translates it mostly with “not only SQL”) should be seen as an alias to something like the definition above. [1]

1 Introduction 1.1 Why relational databases are not enough . . . . 1.2 What NoSQL has to offer . . . . . . . . . . . . 1.3 ACIDs & BASEs . . . . . . . . . . . . . . . . . 1.3.1 ACID Properties of Relational Databases 1.3.2 CAP . . . . . . . . . . . . . . . . . . . 1.3.3 BASE . . . . . . . . . . . . . . . . . . 2 NoSQL Features 2.1 No entity joins . . . . . . . . . . . . . . 2.2 Eventual Consistency . . . . . . . . . . . 2.2.1 Historical Perspective . . . . . . 2.2.2 Consistency — Client and Server 2.3 Cloud Based Memory Architecture . . . 2.3.1 Memory Based Architectures . . 3 Different NoSQL Database Choices 3.1 Document Databases & BigTable . 3.2 Graph Databases . . . . . . . . . . 3.3 MapReduce . . . . . . . . . . . . . 3.4 Distributed Key-Value Stores . . . 5 6 7 7 7 8 9 11 11 13 13 15 19 19 21 21 22 22 22 25 25 25 26 27 29 29 30 32

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

4 NoSQL: Merits & Demerits 4.1 Merits . . . . . . . . . . . . . . . . 4.1.1 Semi-Structured Data . . . . 4.1.2 Alternative Model Paradigms 4.1.3 Multi-Valued Properties . . . 4.1.4 Generalized Analytics . . . . 4.1.5 Version History . . . . . . . . 4.1.6 Predictable Scalability . . . . 4.1.7 Schema Evolution . . . . . . 3

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .


4.1.8 More Natural Fit with Code . . . . . . . . . . . . . . . . 33 Demerits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.2.1 Limitations on Analytics . . . . . . . . . . . . . . . . . . 34

5 Conclusion 37 5.1 Data inconsistency . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.2 Making a Decision . . . . . . . . . . . . . . . . . . . . . . . . . 38 A Different popular NoSQL databases A.1 The Shortlist . . . . . . . . . . . . . A.2 Cloud-Service Contenders . . . . . . A.2.1 Amazon: SimpleDB . . . . . A.2.2 Google AppEngine Data Store A.2.3 Microsoft: SQL Data Services A.3 Non-Cloud Service Contenders . . . . A.3.1 Tokyo Cabinet . . . . . . . . A.3.2 CouchDB . . . . . . . . . . . A.3.3 Project Voldemort . . . . . . A.3.4 Mongo . . . . . . . . . . . . A.3.5 Drizzle . . . . . . . . . . . . A.3.6 Cassandra . . . . . . . . . . A.3.7 BigTable . . . . . . . . . . . A.3.8 Dynamo . . . . . . . . . . . 39 39 39 39 41 41 42 42 45 45 45 46 47 47 48

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .


Chapter 1 Introduction
The history of the relational database has been one of continual adversity: initially, many claimed that mathematical set-based models could never be the basis for efficient database implementations; later, aspiring object oriented databases claimed they would remove the “middle man” of relational databases from the OO design and persistence process. In all of these cases, through a combination of sound concepts, elegant implementation, and general applicability, relational databases have become and remained the lingua franca of data storage and manipulation. Most recently, a new contender has arisen to challenge the supremacy of relational databases. Referred to generally as “non-relational databases” (among other names), this class of storage engine seeks to break down the rigidity of the relational model, in exchange for leaner models that can perform and scale at higher levels, using various models (including key/value pairs, sharded arrays, and document-oriented approaches) which can be created and read efficiently as the basic unit of data storage. Primarily, these new technologies have arisen in situations where traditional relational database systems would be extremely challenging to scale to the degree needed for global systems (for example, at companies such as Google, Yahoo, Amazon, LinkedIn, etc., which regularly collect, store and analyze massive data sets with extremely high transactional throughput and low latency). As of this writing, there exist dozens of variants of this new model, each with different capabilities and trade-offs, but all with the general property that traditional relational design — as practiced on relational database management systems like Oracle, Sybase, etc. — is neither possible nor desired. [11] 5


Why relational databases are not enough

Even though RDBMS have provided database users with the best mix of simplicity, robustness, flexibility, performance, scalability, and compatibility, their performance in each of these areas is not necessarily better than that of an alternate solution pursuing one of these benefits in isolation. This has not been much of a problem so far because the universal dominance of RDBMS has outweighed the need to push any of these boundaries. Nonetheless, if you really had a need that couldn’t be answered by a generic relational database, alternatives have always been around to fill those niches. Today, we are in a slightly different situation. For an increasing number of applications, one of these benefits is becoming more and more critical; and while still considered a niche, it is rapidly becoming mainstream, so much so that for an increasing number of database users this requirement is beginning to eclipse others in importance. That benefit is scalability. As more and more applications are launched in environments that have massive workloads, such as web services, their scalability requirements can, first of all, change very quickly and, secondly, grow very large. The first scenario can be difficult to manage if you have a relational database sitting on a single in-house server. For example, if your load triples overnight, how quickly can you upgrade your hardware? The second scenario can be too difficult to manage with a relational database in general. Relational databases scale well, but usually only when that scaling happens on a single server node. When the capacity of that single node is reached, you need to scale out and distribute that load across multiple server nodes. This is when the complexity of relational databases starts to rub against their potential to scale. Try scaling to hundreds or thousands of nodes, rather than a few, and the complexities become overwhelming, and the characteristics that make RDBMS so appealing drastically reduce their viability as platforms for large distributed systems. For cloud services to be viable, vendors have had to address this limitation, because a cloud platform without a scalable data store is not much of a platform at all. So, to provide customers with a scalable place to store application data, vendors had only one real option. They had to implement a new type of database system that focuses on scalability, at the expense of the other benefits that come with relational databases. These efforts, combined with those of existing niche vendors, have led to the rise of a new breed of database management system. [2] 6


What NoSQL has to offer

This new kind of database management system is commonly called a key/value store. In fact, no official name yet exists, so you may see it referred to as document-oriented, Internet-facing, attribute-oriented, distributed database (although this can be relational also), sharded sorted arrays, distributed hash table, and key/value database. While each of these names point to specific traits of this new approach, they are all variations on one theme, which we’ll call key/value databases. Whatever you call it, this “new” type of database has been around for a long time and has been used for specialized applications for which the generic relational database was ill-suited. But without the scale that web and cloud applications have brought, it would have remained a mostly unused subset. Now, the challenge is to recognize whether it or a relational database would be better suited to a particular application. Relational databases and key/value databases are fundamentally different and designed to meet different needs. A side-by-side comparison only takes you so far in understanding these differences, but to begin, let’s lay one down: [2]


ACID Properties of Relational Databases

• The claim to fame for relational databases is they make the ACID promise: – Atomicity — a transaction is all or nothing. – Consistency — only valid data is written to the database. – Isolation — pretend all transactions are happening serially and the data is correct. – Durability — what you write is what you get. • The problem with ACID is that it gives too much; it trips up when trying to scale a system across multiple nodes. • Down time is unacceptable. So the system needs to be reliable. Reliability requires multiple nodes to handle machine failures. • To make a scalable systems that can handle lots and lots of reads and writes requires many more nodes. 7

Database Definition Relational Database Database contains tables, tables contain columns and rows, and rows are made up of column values. Rows within a table all have the same schema.

The data model is well defined in advance. A schema is strongly typed and it has constraints and relationships that enforce data integrity. The data model is based on a “natural” representation of the data it contains, not on an application’s functionality.

Key/Value Database Domains can initially be thought of like a table, but unlike a table you don’t define any schema for a domain. A domain is basically a bucket that you put items into. Items within a single domain can have differing schema. Items are identified by keys, and a given item can have a dynamic set of attributes attached to it.

The data model is normalized to remove data duplication. Normalization establishes table relationships. Relationships associate data between tables.

In some implementations, attributes are all of a string type. In other implementations, attributes have simple types that reflect code types, such as ints, string arrays, and lists. No relationships are explicitly defined between domains or within a given domain.

Table 1.1: Fundamental differences between relational databases and key/ value stores • Once we try to scale ACID across many machines we hit problems with network failures and delays. The algorithms don’t work in a distributed environment at any acceptable speed.



• If we can’t have all of the ACID guarantees it turns out we can have two of the following three characteristics: – Consistency — your data is correct all the time. What you write is what you read. – Availability — you can read and write and write your data all the time. 8

– Partition Tolerance — if one or more nodes fails the system still works and becomes consistent when the system comes on-line.


– Basically Available — system seems to work all the time. – Soft State — it doesn’t have to be consistent all the time. – Eventually Consistent — becomes consistent at some later time.

• The types of large systems based on CAP aren’t ACID they are BASE:

• Everyone who builds big applications builds them on CAP and BASE: Google, Yahoo, Facebook, Amazon, eBay, etc. [7]



Chapter 2 NoSQL Features
2.1 No entity joins

Key/value databases are item-oriented, meaning all relevant data relating to an item are stored within that item. A domain (which you can think of as a table) can contain vastly different items. For example, a domain may contain customer items and order items. This means that data are commonly duplicated between items in a domain. This is accepted practice because disk space is relatively cheap. But this model allows a single item to contain all relevant data, which improves scalability by eliminating the need to join data from multiple tables. With a relational database, such data needs to be joined to be able to regroup relevant attributes.

But while the need for relationships is greatly reduced with key/value databases, certain ones are inevitable. These relationships usually exist among core entities. For example, an ordering system would have items that contain data about 11

customers, products, and orders. Whether these reside on the same domain or separate domains is irrelevant; but when a customer places an order, you would likely not want to store both the customer and product’s attributes in the same order item. Instead, orders would need to contain relevant keys that point to the customer and product. While this is perfectly doable in a key/value database, these relationships are not defined in the data model itself, and so the database management system cannot enforce the integrity of the relationships. This means you can delete customers and the products they have ordered. The responsibility of ensuring data integrity falls entirely to the application. [2]
Data Access Relational Database Data is created, updated, deleted, and retrieved using SQL. SQL queries can access data from a single table or multiple tables through table joins. SQL queries include functions for aggregation and complex filtering. Usually contain means of embedding logic close to data storage, such as triggers, stored procedures, and functions.

Key/Value Database Data is created, updated, deleted, and retrieved using API method calls. Some implementations provide basic SQL-like syntax for defining filter criteria. Basic filter predicates (such as =, =, <, >, ≤, and ≥) can often only be applied. All applications and data integrity logic is contained in the application code.

Table 2.1: Data access for relational databases and key/value stores

Application Interface Relational Database Tend to have their own specific API, or make use of a generic API such as OLE-DB or ODBC. Data is stored in a format that represents its natural structure, so must be mapped between application code structure and relational structure.

Key/Value Database Tend to provide SOAP and/or REST APIs over which data access calls can be made. Data can be more effectively stored in application code that is compatible with its structure, requiring only relational “plumbing” code for the object.

Table 2.2: Data access for relational databases and key/value stores



Eventual Consistency

Eventually Consistent - Building reliable distributed systems at a worldwide scale demands trade-offs between consistency and availability. At the foundation of Amazon’s cloud computing are infrastructure services such as Amazon’s S3 (Simple Storage Service), SimpleDB, and EC2 (Elastic Compute Cloud) that provide the resources for constructing Internet-scale computing platforms and a great variety of applications. The requirements placed on these infrastructure services are very strict; they need to score high marks in the areas of security, scalability, availability, performance, and cost effectiveness, and they need to meet these requirements while serving millions of customers around the globe, continuously. Under the covers these services are massive distributed systems that operate on a worldwide scale. This scale creates additional challenges, because when a system processes trillions and trillions of requests, events that normally have a low probability of occurrence are now guaranteed to happen and need to be accounted for up front in the design and architecture of the system. Given the worldwide scope of these systems, we use replication techniques ubiquitously to guarantee consistent performance and high availability. Although replication brings us closer to our goals, it cannot achieve them in a perfectly transparent manner; under a number of conditions the customers of these services will be confronted with the consequences of using replication techniques inside the services. One of the ways in which this manifests itself is in the type of data consistency that is provided, particularly when the underlying distributed system provides an eventual consistency model for data replication. When designing these large-scale systems, we ought to use a set of guiding principles and abstractions related to large-scale data replication and focus on the trade-offs between high availability and data consistency. [4]


Historical Perspective

In an ideal world there would be only one consistency model: when an update is made all observers would see that update. The first time this surfaced as difficult to achieve was in the database systems of the late ’70s. The best “period piece” on this topic is “Notes on Distributed Databases” by Bruce Lindsay et al. It lays out the fundamental principles for database replication and discusses a number of techniques that deal with achieving consistency. Many of these techniques try to achieve distribution transparency — that is, to the user of the system it appears as if there is only one system instead of a number of collaborating systems. Many systems during this time took the approach that it was better to fail the complete system than to break this transparency. 13

In the mid-’90s, with the rise of larger Internet systems, these practices were revisited. At that time people began to consider the idea that availability was perhaps the most important property of these systems, but they were struggling with what it should be traded off against. Eric Brewer, systems professor at the University of California, Berkeley, and at that time head of Inktomi, brought the different trade-offs together in a keynote address to the PODC (Principles of Distributed Computing) conference in 2000. He presented the CAP theorem, which states that of three properties of shared-data systems — data consistency, system availability, and tolerance to network partition — only two can be achieved at any given time. A more formal confirmation can be found in a 2002 paper by Seth Gilbert and Nancy Lynch. A system that is not tolerant to network partitions can achieve data consistency and availability, and often does so by using transaction protocols. To make this work, client and storage systems must be part of the same environment; they fail as a whole under certain scenarios, and as such, clients cannot observe partitions. An important observation is that in larger distributed-scale systems, network partitions are a given; therefore, consistency and availability cannot be achieved at the same time. This means that there are two choices on what to drop: relaxing consistency will allow the system to remain highly available under the partitionable conditions, whereas making consistency a priority means that under certain conditions the system will not be available. Both options require the client developer to be aware of what the system is offering. If the system emphasizes consistency, the developer has to deal with the fact that the system may not be available to take, for example, a write. If this write fails because of system unavailability, then the developer will have to deal with what to do with the data to be written. If the system emphasizes availability, it may always accept the write, but under certain conditions a read will not reflect the result of a recently completed write. The developer then has to decide whether the client requires access to the absolute latest update all the time. There is a range of applications that can handle slightly stale data, and they are served well under this model. In principle the consistency property of transaction systems as defined in the ACID properties (atomicity, consistency, isolation, durability) is a different kind of consistency guarantee. In ACID, consistency relates to the guarantee that when a transaction is finished the database is in a consistent state; for example, when transferring money from one account to another the total amount held in both accounts should not change. In ACID-based systems, this kind of consistency is often the responsibility of the developer writing the transaction but can be assisted by the database managing integrity constraints. [4] 14


Consistency — Client and Server

There are two ways of looking at consistency. One is from the developer/client point of view: how they observe data updates. The second way is from the server side: how updates flow through the system and what guarantees systems can give with respect to updates. [4] Client-side Consistency The client side has these components: A storage system. For the moment we’ll treat it as a black box, but one should assume that under the covers it is something of large scale and highly distributed, and that it is built to guarantee durability and availability. Process A. This is a process that writes to and reads from the storage system. Processes B and C. These two processes are independent of process A and write to and read from the storage system. It is irrelevant whether these are really processes or threads within the same process; what is important is that they are independent and need to communicate to share information. Client-side consistency has to do with how and when observers (in this case the processes A, B, or C) see updates made to a data object in the storage systems. In the following examples illustrating the different types of consistency, process A has made an update to a data object: Strong consistency. After the update completes, any subsequent access (by A, B, or C) will return the updated value. Weak consistency . The system does not guarantee that subsequent accesses will return the updated value. A number of conditions need to be met before the value will be returned. The period between the update and the moment when it is guaranteed that any observer will always see the updated value is dubbed the inconsistency window. Eventual consistency. This is a specific form of weak consistency; the storage system guarantees that if no new updates are made to the object, eventually all accesses will return the last updated value. If no failures occur, the maximum size of the inconsistency window can be determined based on factors such as communication delays, the load on the system, and the number of replicas involved in the replication scheme. The most popular system that implements eventual consistency is DNS (Domain Name 15

System). Updates to a name are distributed according to a configured pattern and in combination with time-controlled caches; eventually, all clients will see the update. The eventual consistency model has a number of variations that are important to consider: Causal consistency. If process A has communicated to process B that it has updated a data item, a subsequent access by process B will return the updated value, and a write is guaranteed to supersede the earlier write. Access by process C that has no causal relationship to process A is subject to the normal eventual consistency rules. Read-your-writes consistency. This is an important model where process A, after it has updated a data item, always accesses the updated value and will never see an older value. This is a special case of the causal consistency model. Session consistency. This is a practical version of the previous model, where a process accesses the storage system in the context of a session. As long as the session exists, the system guarantees read-your-writes consistency. If the session terminates because of a certain failure scenario, a new session needs to be created and the guarantees do not overlap the sessions. Monotonic read consistency. If a process has seen a particular value for the object, any subsequent accesses will never return any previous values. Monotonic write consistency. In this case the system guarantees to serialize the writes by the same process. Systems that do not guarantee this level of consistency are notoriously hard to program. A number of these properties can be combined. For example, one can get monotonic reads combined with session-level consistency. From a practical point of view these two properties (monotonic reads and read-your-writes) are most desirable in an eventual consistency system, but not always required. These two properties make it simpler for developers to build applications, while allowing the storage system to relax consistency and provide high availability. As you can see from these variations, quite a few different scenarios are possible. It depends on the particular applications whether or not one can deal with the consequences. Eventual consistency is not some esoteric property of extreme distributed systems. Many modern RDBMSs (relational database management systems) that provide primary-backup reliability implement their replication techniques in 16

both synchronous and asynchronous modes. In synchronous mode the replica update is part of the transaction. In asynchronous mode the updates arrive at the backup in a delayed manner, often through log shipping. In the latter mode if the primary fails before the logs are shipped, reading from the promoted backup will produce old, inconsistent values. Also to support better scalable read performance, RDBMSs have started to provide the ability to read from the backup, which is a classical case of providing eventual consistency guarantees in which the inconsistency windows depend on the periodicity of the log shipping. [4] Server-side Consistency On the server side we need to take a deeper look at how updates flow through the system to understand what drives the different modes that the developer who uses the system can experience. Let’s establish a few definitions before getting started: N = the number of nodes that store replicas of the data W = the number of replicas that need to acknowledge the receipt of the update before the update completes R = the number of replicas that are contacted when a data object is accessed through a read operation If W + R > N , then the write set and the read set always overlap and one can guarantee strong consistency. In the primary-backup RDBMS scenario, which implements synchronous replication, N = 2, W = 2, and R = 1. No matter from which replica the client reads, it will always get a consistent answer. In asynchronous replication with reading from the backup enabled, N = 2, W = 1, and R = 1. In this case R + W = N , and consistency cannot be guaranteed. The problems with these configurations, which are basic quorum protocols, is that when the system cannot write to W nodes because of failures, the write operation has to fail, marking the unavailability of the system. With N = 3 and W = 3 and only two nodes available, the system will have to fail the write. In distributed-storage systems that need to provide high performance and high availability, the number of replicas is in general higher than two. Systems that focus solely on fault tolerance often use N = 3 (with W = 2 and R = 2 configurations). Systems that need to serve very high read loads often replicate their data beyond what is required for fault tolerance; N can be tens or even hundreds of nodes, with R configured to 1 such that a single read will return a result. Systems that are concerned with consistency are set to W = N for updates, which may decrease the probability of the write succeeding. A common configuration for these systems that are concerned about fault tolerance but not consistency is to run with W = 1 to get minimal durability of the update and 17

then rely on a lazy (epidemic) technique to update the other replicas. How to configure N , W , and R depends on what the common case is and which performance path needs to be optimized. In R = 1 and N = W we optimize for the read case, and in W = 1 and R = N we optimize for a very fast write. Of course in the latter case, durability is not guaranteed in the presence of failures, and if W < (N + 1)/2, there is the possibility of conflicting writes when the write sets do not overlap. Weak/eventual consistency arises when W + R ≤ N , meaning that there is a possibility that the read and write set will not overlap. If this is a deliberate configuration and not based on a failure case, then it hardly makes sense to set R to anything but 1. This happens in two very common cases: the first is the massive replication for read scaling mentioned earlier; the second is where data access is more complicated. In a simple key-value model it is easy to compare versions to determine the latest value written to the system, but in systems that return sets of objects it is more difficult to determine what the correct latest set should be. In most of these systems where the write set is smaller than the replica set, a mechanism is in place that applies the updates in a lazy manner to the remaining nodes in the replica’s set. The period until all replicas have been updated is the inconsistency window discussed before. If W + R ≤ N , then the system is vulnerable to reading from nodes that have not yet received the updates. Whether or not read-your-writes, session, and monotonic consistency can be achieved depends in general on the “stickiness” of clients to the server that executes the distributed protocol for them. If this is the same server every time, then it is relatively easy to guarantee read-your-writes and monotonic reads. This makes it slightly harder to manage load balancing and fault tolerance, but it is a simple solution. Using sessions, which are sticky, makes this explicit and provides an exposure level that clients can reason about. Sometimes the client implements read-your-writes and monotonic reads. By adding versions on writes, the client discards reads of values with versions that precede the last-seen version. Partitions happen when some nodes in the system cannot reach other nodes, but both sets are reachable by groups of clients. If you use a classical majority quorum approach, then the partition that has W nodes of the replica set can continue to take updates while the other partition becomes unavailable. The same is true for the read set. Given that these two sets overlap, by definition the minority set becomes unavailable. Partitions don’t happen frequently, but they do occur between data centers, as well as inside data centers. In some applications the unavailability of any of the partitions is unacceptable, and it is important that the clients that can reach that partition make progress. In that case both sides assign a new set of storage nodes to receive the data, 18

and a merge operation is executed when the partition heals. [4]


Cloud Based Memory Architecture
Memory Based Architectures

Google query results are now served in under an astonishingly fast 200ms, down from 1000ms in the olden days. The vast majority of this great performance improvement is due to holding indexes completely in memory. Thousands of machines process each query in order to make search results appear nearly instantaneously. This text was adapted from notes on Google Fellow Jeff Dean keynote speech at WSDM 2009. What makes Memory Based Architectures different from traditional architectures is that memory is the system of record. Typically disk based databases have been the system of record. All the data is stored on the disk. Disk being slow we’ve ended up wrapping disks in complicated caching and distributed file systems to make them perform. Even though, memory is used as all over the place as cache, it is simply assumed that cache can be invalidated at any time. In Memory Based Architectures memory is where the “official” data values are stored. Caching also serves a different purpose. The purpose behind cache based architectures is to minimize the data bottleneck due to disk. Memory based architectures can address the entire end-to-end application stack. Data in memory can be of higher reliability and availability than traditional architectures. Memory Based Architectures initially developed out of the need in some applications spaces for very low latencies. The dramatic drop of RAM prices along with the ability of servers to handle larger and larger amounts of RAM has caused memory architectures to verge of going mainstream. For example, someone recently calculated that 1TB of RAM across 40 servers at 24GB per server would cost an additional $40,000. Which is really quite affordable given the cost of the servers. Projecting out, 1U and 2U rack-mounted servers will soon support a terabyte or more of memory. [6] RAM: High Bandwidth and Low Latency Compared to disk RAM is a high bandwidth and low latency storage medium. The bandwidth of RAM is typically 5GB/s. The bandwidth of disk is about 100MB/s. RAM bandwidth is many hundreds of times faster. Modern hard drives have latencies under 13 milliseconds. When many applications are queued 19

for disk reads latencies can easily be in the many second range. Memory latency is in the 5 nanosecond range. Memory latency is 2,000 times faster. [6] RAM is the New Disk The superiority of RAM is at the heart of the RAM is the New Disk paradigm. As an architecture it combines the holy quadrinity of computing: • Performance is better because data is accessed from memory instead of through a database to a disk. • Scalability is linear because as more servers are added data is transparently load balanced across the servers so there is an automated in-memory sharding. • Availability is higher because multiple copies of data are kept in memory and the entire system reroutes on failure. • Application development is faster because theres only one layer of software to deal with, the cache, and its API is simple. All the complexity is hidden from the programmer which means all a developer has to do is get and put data. Access disk on the critical path of any transaction limits both throughput and latency. Committing a transaction over the network in-memory is faster than writing through to disk. Reading data from memory is also faster than reading data from disk. So the idea is to skip disk, except perhaps as an asynchronous write-behind option, archival storage, and for large files. [6]


Chapter 3 Different NoSQL Database Choices
3.1 Document Databases & BigTable

The BigTable paper describes how Google developed their own massively scalable database for internal use, as basis for several of their services. The data model is quite different from relational databases: columns dont need to be predefined, and rows can be added with any set of columns. Empty columns are not stored at all. BigTable inspired many developers to write their own implementations of this data model; amongst the most popular are HBase, Hypertable and Cassandra. The lack of a predefined schema can make these databases attractive in applications where the attributes of objects are not known in advance, or change frequently. Document databases have a related data model (although the way they handle concurrency and distributed servers can be quite different): a BigTable row with its arbitrary number of columns/attributes corresponds to a document in a document database, which is typically a tree of objects containing attribute values and lists, often with a mapping to JSON or XML. Open source document databases include Project Voldemort, CouchDB, MongoDB, ThruDB and Jackrabbit. How is this different from just dumping JSON strings into MySQL? Document databases can actually work with the structure of the documents, for example extracting, indexing, aggregating and filtering based on attribute values within the documents. Alternatively you could of course build the attribute indexing yourself, but I wouldnt recommend that unless it makes working with your legacy code easier. 21

The big limitation of BigTables and document databases is that most implementations cannot perform joins or transactions spanning several rows or documents. This restriction is deliberate, because it allows the database to do automatic partitioning, which can be important for scaling — see the section 3.4 on distributed key-value stores below. If the structure of our data is lots of independent documents, this is not a problem — but if the data fits nicely into a relational model and we need joins, we shouldn’t try to force it into a document model. [9]


Graph Databases

Graph databases live at the opposite end of the spectrum. While document databases are good for storing data which is structured in the form of lots of independent documents, graph databases focus on the relationships between items — a better fit for highly interconnected data models. Standard SQL cannot query transitive relationships, i.e variable-length chains of joins which continue until some condition is reached. Graph databases, on the other hand, are optimised precisely for this kind of data. There is less choice in graph databases than there is in document databases: Neo4j, AllegroGraph and Sesame (which typically uses MySQL or PostgreSQL as storage back-end) are ones to look at. FreeBase and DirectedEdge have developed graph databases for their internal use. Graph databases are often associated with the semantic web and RDF datastores, which is one of the applications they are used for. [9]



Popularised by another Google paper, MapReduce is a way of writing batch processing jobs without having to worry about infrastructure. Different databases lend themselves more or less well to MapReduce. Hadoop is the big one amongst the open MapReduce implementations, and Skynet and Disco are also worth looking at. CouchDB also includes some MapReduce ideas on a smaller scale. [9]


Distributed Key-Value Stores

A key-value store is a very simple concept, much like a hash table: you can retrieve an item based on its key, you can insert a key/value pair, and you can delete a key/value pair. The value can just be an opaque list of bytes, or 22

might be a structured document (most of the document databases and BigTable implementations above can also be considered to be key-value stores). Document databases, graph databases and MapReduce introduce new data models and new ways of thinking which can be useful even in a small-scale applications. Distributed key-value stores, on the other hand, are really just about scalability. They can scale to truly vast amounts of data — much more than a single server could hold. Distributed databases can transparently partition and replicate the data across many machines in a cluster. We dont need to figure out a sharding scheme to decide on which server we can find a particular piece of data; the database can locate it for us. If one server dies, no problem — others can immediately take over. If we need more resources, just add servers to the cluster, and the database will automatically give them a share of the load and the data. When choosing a key-value store we need to decide whether it should be optimised for low latency (for lightning-fast data access during the request-response cycle) or for high throughput (which is needed for batch processing jobs). Other than the BigTables and document databases above, Scalaris, Dynomite and Ringo provide certain data consistency guarantees while taking care of partitioning and distributing the dataset. MemcacheDB and Tokyo Cabinet (with Tokyo Tyrant for network service and LightCloud to make it distributed) focus on latency. The caveat about limited transactions and joins applies even more strongly for distributed databases. Different implementations take different approaches, but in general, if we need to read several items, manipulate them in some way and then write them back, there is no guarantee that we will end up in a consistent state immediately (although many implementations try to become eventually consistent by resolving write conflicts or using distributed transaction protocols). [9]



Chapter 4 NoSQL: Merits & Demerits
4.1 Merits

There is a long list of potential advantages to using non-relational databases. Of course, not all non-relational databases are the same; but the following list covers areas common to many of them. [11]


Semi-Structured Data

Here, a structure where each entity can have any number of properties defined at run-time. This approach is clearly helpful in domains where the problem is itself amenable to expansion or change over time. We can begin simply, and alter the details of our problem as we go with minimal administrative burden. This approach has much in common with the imputed typing systems of scripting languages like Python, which, while often less efficient than strongly typed languages like C and Java, usually more than make up for this deficiency by giving programmers improved usability; they can get started quickly and add structure and overhead only as needed. But there is another, more important aspect to this tendency towards storing non-structured, or semi-structured, data: the idea that our understanding of a problem, and its data, might legitimately emerge over time, and be entirely data-driven after the fact. As one observer put it: RDBMSs are designed to model very highly and statically structured data which has been modeled with mathematical precision — data and designs that do not meet these criteria, such as data designed for direct human consumption, lose the advantages of the relational model, and result in poorer maintainability than with less stringent models. This kind of emergent behavior is atypical when dealing with the program25

ming problems of the past 40 years, such as accounting systems, desktop word processing software, etc. However, many of today’s interesting problems involve unpredictable behavior and inputs from extremely large populations; consider web search, social network graphs, large scale purchasing habits, etc. In these “messy” arenas, the impulse to exactly model and define all the possible structures in the data in advance is exactly the wrong approach. Relational data design tends to turn programmers into “structure first” proponents, but in many cases, the rest of the world (including the users we are writing programs for) are thinking “data first”. [11]


Alternative Model Paradigms

Modelling data in terms of relations, tuples and attributes —or equivalently, tables, rows and columns — is but one conceptual approach. There are entirely different ways of considering, planning, and designing a data model. These include hierarchical trees, arbitrary graphs, structured objects, cube or star schema analytical approaches, tuple spaces, and even undifferentiated (emergent) storage. By moving into the realm of semi-structured non-relational data, we gain the possibility of accessing our data along these lines instead of simply in relational database terms. For example, graph-oriented databases, such as Neo4j. This paradigm attempts to map persistent storage capabilities directly onto the graph model of computation: sets of nodes connected by sets of edges. The database engine then innately provides many algorithmic services that one would expect on graph representations: establishing spanning trees, finding shortest path, depth and breadth-first search, etc. Object databases are another paradigm that have, at various times, appeared poised to challenge the supremacy of the relational database. An example of a current contender in this space is Persevere (http://www.persvr.org/), which is an object store for JSON (JavaScript Object Notation) data. Advantages gained in this space include a consistent execution model between the storage engine and the client platform (JavaScript, in this case), and the ability to natively store objects without any translation layer. Here again, the general principle is that by moving away from the strictly modeled structure of SQL, we untie the hands of developers to model data in terms they may be more familiar with, or that may be more conducive to solving the problem at hand. This is very attractive to many developers. [11] 26


Multi-Valued Properties

Even with the bounds of the more traditional relational approach, there are ways in which the semi-structured approach of non-relational databases can give us a helping hand in conceptual data design. One of these is by way of multi-value properties — that is, attributes that can simultaneously take on more than one value. A credo of relational database design is that for any given tuple in a relation, there is only one value for any given attribute; storing multiple values in the same attribute for the same tuple is considered very bad practice, and is not supported by standard SQL. Generally, cases where one might be tempted to store multiple values in the same attribute indicate that the design needs further normalization. As an example, consider a User relation, with an attribute email. Since people typically have more than one email address, a simple (but wrong, at least for relational database design) decision might be to store the email addresses as a comma-delimited list within the “emails” attribute. The problems with this are myriad - for example, simple membership tests like SELECT ∗ FROM U s e r WHERE e m a i l s = ’ homer@simpson . com ’ ; will fail if there are more than one email address in the list, because that is no longer the value of the attribute; a more general test using wildcards such as SELECT ∗ FROM U s e r WHERE e m a i l s LIKE ’%homer@simpson . com%’ ; will succeed, but raises serious performance issues in that it defeats the use of indexes and causes the database engine to do (at best) linear-time text pattern searches against every value in the table. Worse, it may actually impact correctness if entries in the list can be proper substrings of each other (as in the list “car, cart, art”). The proper way to design for this situation, in a relational model, is to normalize the email addresses into their own table, with a foreign key relationship to the user table. This is a design strategy that can is frequently applied to many situations in standard relational database design, even recursively: if you sense a one-to-many relationship in an attribute, break it out into two relations with a foreign key. The trouble with this pattern, however, is that it still does not elegantly serve all the possible use cases of such data, especially in situations with a low cardinality; either it is overkill, or it is a clumsy way to store data. In the above example, there are a very small set of use cases that we might typically do with email addresses, including: • Return the user, along with their one “primary” email address, for normal operations involving sending an email to the user. 27

• Return the user with a list of all their email addresses, for showing on a “profile” screen, for example. • Find which user (if any) has a given email address. The first situation requires an additional attribute along the lines of is primary on the email table, not to mention logic to ensure that only one email tuple per user is marked as primary (which cannot be done natively in a relational database, because a UNIQUE constraint on the user id and the is primary field would only allow one primary and one non-primary email address per user id ). Alternately, a primary email field can be kept on the User table, acting as a cache of which email address is the primary one; this too requires coordination by code to ensure that this field actually exists in the User Email table, etc. To use standard SQL to return a single tuple containing the user and all of their email addresses, comma delimited like our original (“wrong”) design concept, is actually quite difficult under this two-table structure. Standard SQL has no way of rendering this output, which is surprising considering how common it is. The only mechanisms would be constructing intermediate temporary tables of the information, looping through records of the join relation and outputting one tuple per user id with the concatenation of email addresses as an attribute. Under key/value stores, we have a different paradigm entirely for this problem, and one which much more closely matches the real-world uses of such data. We can simply model the email attribute as a substructure: a list of emails within the attribute. For example, Google App Engine has a “List” type that can store exactly this type of information as an attribute: c l a s s U s e r ( db . Model ) : name = db . S t r i n g P r o p e r t y ( ) e m a i l s = db . S t r i n g L i s t P r o p e r t y ( ) The query system then has the ability to not only return the contained lists as structured data, but also to do membership queries, such as: r e s u l t s = db . GqlQuery ( ”SELECT ∗ FROM U s e r WHERE e m a i l = ’ homer@simpson . com ’ ” ) Since order is preserved, the semantics of “primary” versus “additional” can be encoded into the order of items, so no additional attribute is needed for this purpose; we can always get the primary email by saying something like “ results . emails [0] ”. In effect, we have expressed our actual data requirements in a much more succinct and powerful way using this notation, without any noticeable loss in precision, abstraction, or expressive power. [11] 28


Generalized Analytics

If the nature of the analysis falls outside of SQL’s standard set of operations, it can be extremely difficult to produce results with the operational silo of SQL queries. Worse, this has a pernicious effect on the mindset of data developers, sometimes called “SQL Myopia”: if you can’t do it in SQL, you cant do it.1 This is unfortunate, because there are many interesting and useful modes of interacting with data sets that are outside of this paradigm — consider matrix transformations, data mining, clustering, Bayesian filtering, probability analysis, etc. Additionally, besides simply lacking Turing-completeness,2 SQL has a long list of faults that non-SQL developers regularly present. These include a verbose, non-customizable syntax; inability to reduce nested constructions to recursive calls, or generally work with graphs, trees, or nested structures; inconsistency in specific implementation between vendors, despite standardization; and so forth. It is no wonder that the moniker for the current non-relational database movement is converging on the tag “NOSQL”: it is a limited, inelegant language. [11]


Version History

Part of the design of many (but not all) non-relational databases is the explicit inclusion of version history in the storage unit of data. For example, when you store the value 123 in an attribute, and later change it to the value 234, your data store actually now contains both values, each with a timestamp or vector clock version stamp. This approach has many benefits from an efficiency point of view: primary interaction with the database disks is always in write-forward mode, and multi-version concurrency control can be easily modeled with this structure. From a modeling point of view, however, there are other distinct advantages to this format. One of them is the ability to intentionally keep, and interact with, older versions of data in a structured way. An example of this, which almost certainly uses the versioned characteristics of Google’s Bigtable infrastructure, is Google Docs: any document can be instantly viewed in, or reverted to, its state at any point in its history — a granular, infinite “undo”. Implementing this kind of revision ability in typical relational database apNote that this is not a fault of the relational model itself — only of SQL, which is ultimately just one possible declarative grammar for interacting with relational structures. 2 For the record, this lack of Turing-completeness is by design, so that all queries would be able to run in bounded time; never mind that every major commercial vendor has extended SQL with operations that do make it Turing complete, albeit still awkward.


plications is prohibitive both from a programming complexity standpoint (this ability must be consciously designed in to each entity that might need it) as well as from a performance standpoint.3 We have two main options when keeping a history of information for a table. On the one hand, we can keep a full additional copy of every row whenever it changes. This can be done in place, by adding an additional component to the primary key which is a timestamp or version number. This is problematic in that all application code that interacts with this entity needs to know about the versioning scheme; it also complicates the indexing of the entities, because relational database storage with a composite primary key including a date is significantly less optimized than for a single integer key. Alternately, the entire-row history method can be done in a secondary table which only keeps historical records, much like a log. This is less obtrusive on the application (which need not even be aware of its existence, especially if it is produce via a database level procedure or trigger), and has the benefit that it can be populated asynchronously. However, both of these cases require O(sn) storage, where s is the row size and n is the number of updates. For large row sizes, this approach can be prohibitive. The other mechanism for doing this is to keep what amounts to an Entity/ Attribute/Value table for the historical changes: a table where only the changed value is kept. This is easier to do in situations where the table design itself is already in the EAV paradigm, but can still be done dynamically (if not efficiently) by using the string name of the updated attribute. For sparsely updated tables, this approach does save space over the entire-row versions, but it suffers from the drawback that any use of this data via interactive SQL queries is nearly impossible. Overall, the non-relational database stores that support column-based version history have a huge advantage in any situations where the application might need this level of historical data snapshots. [11]


Predictable Scalability

These databases are simple and thus scale much better than today’s relational databases. If you are putting together a system in-house and intend to throw dozens or hundreds of servers behind your data store to cope with what you expect will be a massive demand in scale, then consider a key/value store.
Consider how many traditional relational database implemented products you know of that offer any kind of Undo functionality.


This definitively impacts the modelling concepts supported by the systems, because it elevates scalability concerns to a first class modeling directive — part of the logical and conceptual modeling process itself. Rather than designing an elegant relational model and only later considering how it might reasonably be “sharded” or replicated in such a way as to provide high availability in various failure scenarios (typically accompanied by great cost, in commercial relational database products), instead the bedrock of the logical design asks: how can we conceive of this data in such a way that it is scalable by its definition? As an example, consider the mechanism for establishing the locality of transactions in Bigtable and its ilk (including the Google App Engine data store). Obviously, when involving multiple entities in a transaction on a distributed data store, it is desirable to restrict the number of nodes who actually must participate in the transaction. (While protocols do of course exist for distributed transactions, the performance of these protocols suffer immensely as the size of machine cluster increases, because the risk of a node failure, and thus a timeout on the distributed transaction, increases.) It is therefore most beneficial to couple related entities tightly, and unrelated entities loosely, so that the most common entities to participate in a transaction would be those that are already tightly coupled. In a relational database, you might use foreign key relationships to indicate related entities, but the relationship carries no additional information that might indicate “these two things are likely to participate in transactions together”. By contract, in Bigtable, this is enabled by allowing entities to indicate an “ancestor” relation chain, of any depth. That is, entity A can declare entity B its “parent”, and henceforth, the data store organizes the physical representation of these entities on one (or a small number of) physical machines, so that they can easily participate in shared transactions. This is a natural design inclination, but one that is not easily expressed in the world of relational databases (you could certainly provide self-relationships on entities but since SQL does not readily express recursive relationships, that is only beneficial in cases where the selfrelationship is a key part of the data design itself, with business import.) Many commercial relational database vendors make the claim that their solutions are highly scalable. This is true, but there are two caveats. First, of course, is cost: sharded, replicated instances of Oracle or DB2 are not a cheap commodity, and the cost scales with the load. Second, however, and less obvious, is the predictability factor. This is highly touted by systems such as Project Voldemort, which point out that with a simple data model, as in many non-relational databases, not only can you scale more easily, but you can scale more predictably: the requirements to support additional operations, in terms of CPU and memory is known fairly exactly, so load planning can be an exact science. Compare this with SQL/relational database scaling, which is highly unpredictable due to the 31

complex nature of the RDBMS engine. There are, naturally, other criteria that are involved in the quest for performance and scalability, including topics like low level data storage (b-tree-like storage formats, disk access patterns, solid state storage, etc.); issues with the raw networking of systems and their communications overhead; data reliability, both considered for single-node and multi-node systems, etc. Because key/value databases easily and dynamically scale, they are also the database of choice for vendors who provide a multi-user, web services platform data store. The database provides a relatively cheap data store platform with massive potential to scale. Users typically only pay for what they use, but their usage can increase as their needs increase. Meanwhile, the vendor can scale the platform dynamically based on the total user load, with little limitation on the entire platform’s size. [2, 11]


Schema Evolution

In addition to the static existence of a database schema, it is also important to consider what happens over time as an application’s needs or requirements change. Non-relational databases have a distinct advantage in this realm, because they offer more options for how the version update should proceed. To be sure, relational databases have mechanisms for handling ongoing updates to data schema; indeed, one of the strengths of the relational model is that the schema is data: databases keep system tables which define schema metadata, which are handled by the exact same database primitives as user-space tables. This generality has advantages in terms of manageability, but it also provides a clean abstraction that vendors can use to provide valuable schema update facilities. Indeed, commercial RDMBS products have applied a great deal of engineering resources to the problem, and have developed sophisticated mechanisms that allow production databases to ALTER their schema without downtime in most scenarios.4 However, there are two issues with the relational database approach to this. First, relational database schemas exist in only one state at any given time. This means that if the specific form of an attribute changes, it must change immediately for all records, even in cases where the new form of the attribute would rightfully require processing that the database cannot do (for example, application-specific business logic). It also implies that if there is a high-volume update, such as one that might need to write many gigabytes of changed data
Non-commercial databases such as MySQL also have mechanisms such as this, in general their methods are much less sophisticated, often requiring downtime to do even simple operations such as rebuild indices, etc.


back to disk, the RDBMS is obligated to do this operation atomically and in real-time (because DDL updates are transactional); regardless of how efficiently implemented it is, this type of operation cannot be made seamless in a highly transactional production environment. Second, the release of relational database schema changes typically requires precise coordination with application-layer code; the code version must exactly match the data version. In any highly available application, there is a high likelihood that this implies downtime,5 or at least advanced operational coordination that takes a great deal of precision and energy. Non-relational databases, by comparison, can use a very different approach for schema versioning. Because the schema (in many cases) is not enforced at the data engine level, it is up to the application to enforce (and migrate) the schema. Therefore, a schema change can be gradually introduced by code that understands how to interact with both the N − 1 version and the N version, and leaves each entity updated as it is touched. “Gardener” processes can then periodically sweep through the data store, updating nodes as a lower-priority process. Naturally, this approach produces more complex code in the short term, especially if the schema of the data is relied upon by analytical (map/reduce) jobs. But in many cases, the knowledge that no downtime will be required during a schema evolution is worth the additional complexity. In fact, this approach might be seen to encourage a more agile development methodology, because each change to the internal schema of the application’s data is bundled with the update to the codebase, and can be collectively versioned and managed accordingly.


More Natural Fit with Code

Relational data models and Application Code Object Models are typically built differently, which leads to incompatibilities. Developers overcome these incompatibilities with code that maps relational models to their object models, a process commonly referred to as object-to-relational mapping.This process, which essentially amounts to “plumbing” code and has no clear and immediate value, can take up a significant chunk of the time and effort that goes into developing the application. On the other hand, many key/value databases retain data
The exception to this is that, thanks to the relational models implicit lack of attribute order, there are situations in which new attributes can be added and it is guaranteed that no application code would even know of the existence of the new attributes, let alone be affected by them. This is a case where the relational model has the upper hand; however, because it is not a comprehensive solution for every situation, the end result is that, for safety, most relational database schema updates are treated as downtime events.


in a structure that maps more directly to object classes used in the underlying application code, which can significantly reduce development time. [2]



The inherent constraints of a relational database ensure that data at the lowest level have integrity. Data that violate integrity constraints cannot physically be entered into the database. These constraints don’t exist in a key/value database, so the responsibility for ensuring data integrity falls entirely to the application. But application code often carries bugs. Bugs in a properly designed relational database usually don’t lead to data integrity issues; bugs in a key/value database, however, quite easily lead to data integrity issues. One of the other key benefits of a relational database is that it forces you to go through a data modeling process. If done well, this modeling process create in the database a logical structure that reflects the data it is to contain, rather than reflecting the structure of the application. Data, then, become somewhat application-independent, which means other applications can use the same data set and application logic can be changed without disrupting the underlying data model. To facilitate this process with a key/value database, try replacing the relational data modeling exercise with a class modeling exercise, which creates generic classes based on the natural structure of the data. And don’t forget about compatibility. Unlike relational databases, cloudoriented databases have little in the way of shared standards. While they all share similar concepts, they each have their own API, specific query interfaces, and peculiarities. So, you will need to really trust your vendor, because you won’t simply be able to switch down the line if you’re not happy with the service. And because almost all current key/value databases are still in beta, that trust is far riskier than with old-school relational databases. [2]


Limitations on Analytics

In the cloud, key/value databases are usually multi-tenanted, which means that a lot of users and applications will use the same system. To prevent any one process from overloading the shared environment, most cloud data stores strictly limit the total impact that any single query can cause. For example, with SimpleDB, you can’t run a query that takes longer than 5 seconds. With Google’s AppEngine Datastore, you can’t retrieve more than 1000 items for any given query. These limitations aren’t a problem for your bread-and-butter application logic (adding, updating, deleting, and retrieving small numbers of items). But what happens when your application becomes successful? You have attracted many 34

users and gained lots of data, and now you want to create new value for your users or perhaps use the data to generate new revenue. You may find yourself severely limited in running even straightforward analysis-style queries. Things like tracking usage patterns and providing recommendations based on user histories may be difficult at best, and impossible at worst, with this type of database platform. In this case, you will likely have to implement a separate analytical database, populated from your key/value database, on which such analytics can be executed. Think in advance of where and how you would be able to do that? Would you host it in the cloud or invest in on-site infrastructure? Would latency between you and the cloud-service provider pose a problem? Does your current cloud-based key/value database support this? If you have 100 million items in your key/value database, but can only pull out 1000 items at a time, how long would queries take? Ultimately, while scale is a consideration, don’t put it ahead of your ability to turn data into an asset of its own. All the scaling in the world is useless if your users have moved on to your competitor because it has cooler, more personalized features.[2]



Chapter 5 Conclusion
5.1 Data inconsistency

Data inconsistency in large-scale reliable distributed systems has to be tolerated for two reasons: improving read and write performance under highly concurrent conditions; and handling partition cases where a majority model would render part of the system unavailable even though the nodes are up and running. Whether or not inconsistencies are acceptable depends on the client application. In all cases the developer needs to be aware that consistency guarantees are provided by the storage systems and need to be taken into account when developing applications. There are a number of practical improvements to the eventual consistency model, such as session-level consistency and monotonic reads, which provide better tools for the developer. Many times the application is capable of handling the eventual consistency guarantees of the storage system without any problem. A specific popular case is a Web site in which we can have the notion of user-perceived consistency. In this scenario the inconsistency window needs to be smaller than the time expected for the customer to return for the next page load. This allows for updates to propagate through the system before the next read is expected. [4] BigTable inspired many developers to write their own implementations of this data model; amongst the most popular are HBase, Hypertable and Cassandra. The lack of a predefined schema can make these databases attractive in applications where the attributes of objects are not known in advance, or change frequently. Document databases have a related data model (although the way they handle concurrency and distributed servers can be quite different): a BigTable row with its arbitrary number of columns/attributes corresponds to a document in a document database, which is typically a tree of objects containing attribute 37

values and lists, often with a mapping to JSON or XML. Open source document databases include Project Voldemort, CouchDB, MongoDB, ThruDB and Jackrabbit. How is this different from just dumping JSON strings into MySQL? Document databases can actually work with the structure of the documents, for example extracting, indexing, aggregating and filtering based on attribute values within the documents. Alternatively you could of course build the attribute indexing yourself, but I wouldnt recommend that unless it makes working with your legacy code easier. The big limitation of BigTables and document databases is that most implementations cannot perform joins or transactions spanning several rows or documents. This restriction is deliberate, because it allows the database to do automatic partitioning, which can be important for scaling see the section on distributed key-value stores below. If the structure of your data is lots of independent documents, this is not a problem but if your data fits nicely into a relational model and you need joins, please dont try to force it into a document model.ions require. One of the tools the system


Making a Decision

Ultimately, there are four reasons why you would choose a non-relational key/ value database platform for your application: 1. Your data is heavily document-oriented, making it a more natural fit with the key/value data model than the relational data model. 2. Your development environment is heavily object-oriented, and a key/value database could minimize the need for “plumbing” code. 3. The data store is cheap and integrates easily with your vendor’s web services platform. 4. Your foremost concern is on-demand, high-end scalability — that is, largescale, distributed scalability, the kind that can’t be achieved simply by scaling up. But in making your decision, remember the database’s limitations and the risks you face by branching off the relational path. For all other requirements, you are probably best off with the good old RDBMS. So, is the relational database doomed? Clearly not. Well, not yet at least. [2] 38

Appendix A Different popular NoSQL databases
A.1 The Shortlist

Table A.1 provides a list of projects that could potentially replace a group of relational database shards. Some of these are much more than key-value stores, and arent suitable for low-latency data serving, but are interesting none-the-less. [3]


Cloud-Service Contenders

A number of web service vendors now offer multi-tenanted key/value databases on a pay-as-you-go basis. Most of them meet the criteria discussed to this point, but each has unique features and varies from the general standards described thus far. Let’s take a look now at particular databases, namely SimpleDB, Google AppEngine Datastore, and SQL Data Services. [2]


Amazon: SimpleDB

SimpleDB is an attribute-oriented key/value database available on the Amazon Web Services platform. SimpleDB is still in public beta; in the meantime, users can sign up online for a “free” version – free, that is, until you exceed your usage limits. 39

Name Project Voldemort

Language Java









Faulttolerance partitioned, replicated, readrepair partitioned, replicated, immutable partitioned, replicated, paxos partitioned, replicated? partitioned, replicated replication replication

Persistence Client Protocol Pluggable: Java API BerkleyDB, MySQL Custom on-disk (append only log) Inmemory only On-disk Dets file Pluggable: couch, dets BerkleyDB Pluggable: BerkleyDB, Custom, MySQL, S3 Custom on-disk Custom on-disk Custom on-disk HTTP

Data model Structured /blob / text blob

Docs A

Community LinkedIn, no


Nokia, no

Erlang, blob Java, HTTP Memcached blob


OnScale, no no


MemcacheDBC ThruDB C++

Custom blob ASCII, Thrift Memcached blob Thrift Document oriented


Powerset, no some Third rail, unsure

B C+







replication, partitioning? replication, partitioning replication, partitioning replication, partitioning




Custom on-disk

Custom API, Thrift, Rest Thrift, other

Document oriented (JSON) BigTable meets Dynamo BigTable


Apache, yes Facebook, no Apache, yes





Zvents, Baidu, yes

Table A.1: Some NoSQL initiatives SimpleDB has several limitations. First, a query can only execute for a maximum of 5 seconds. Secondly, there are no data types apart from strings. Everything is stored, retrieved, and compared as a string, so date comparisons won’t work unless you convert all dates to ISO8601 format. Thirdly, the maximum size of any string is limited to 1024 bytes, which limits how much text (i.e. product descriptions, etc.) you can store in a single attribute. But because the schema is dynamic and flexible, you can get around the limit by adding “ProductDescription1”, “ProductDescription2”, etc. The catch is that an item is limited to 256 attributes. While SimpleDB is in beta, domains can’t be larger than 10GB, and entire databases cannot exceed 1TB. One key feature of SimpleDB is that it uses an eventual consistency model.This consistency model is good for concurrency, but means that after you have changed an attribute for an item, those changes may not be reflected in read operations 40

that immediately follow. While the chances of this actually happening are low, you should account for such situations. For example, you don’t want to sell the last concert ticket in your event booking system to five people because your data wasn’t consistent at the time of sale. [2]


Google AppEngine Data Store

Google’s AppEngine Datastore is built on BigTable, Google’s internal storage system for handling structured data. In and of itself, the AppEngine Datastore is not a direct access mechanism to BigTable, but can be thought of as a simplified interface on top of BigTable. The AppEngine Datastore supports much richer data types within items than SimpleDB, including list types, which contain collections within a single item. You will almost certainly use this data store if you plan on building applications within the Google AppEngine. However, unlike with SimpleDB, you cannot currently interface with the AppEngine Datastore (or with BigTable) using an application outside of Google’s web service platform. [2]


Microsoft: SQL Data Services

SQL Data Services is part of the Microsoft Azure Web Services platform. The SDS service is also in beta and so is free but has limits on the size of databases. SQL Data Services is actually an application itself that sits on top of many SQL servers, which make up the underlying data storage for the SDS platform. While the underlying data stores may be relational, you don’t have access to these; SDS is a key/value store, like the other platforms discussed thus far. Microsoft seems to be alone among these three vendors in acknowledging that while key/value stores are great for scalability, they come at the great expense of data management, when compared to RDBMS. Microsoft’s approach seems to be to strip to the bare bones to get the scaling and distribution mechanisms right, and then over time build up, adding features that help bridge the gap between the key/value store and relational database platform. [2] 41


Non-Cloud Service Contenders

Outside the cloud, a number of key/value database software products exist that can be installed in-house. Almost all of these products are still young, either in alpha or beta, but most are also open source; having access to the code, you can perhaps be more aware of potential issues and limitations than you would with close-source vendors. [2]


Tokyo Cabinet

Developed and sponsored by Mixi Inc., it is an incredibly fast, and feature rich database library. [5] Tokyo Cabinet Highlights Speed and efficiency are two consistent themes for Tokyo Cabinet. Benchmarks show that it only takes 0.7 seconds to store 1 million records in the regular hash table and 1.6 seconds for the B-Tree engine. To achieve this, the overhead per record is kept at as low as possible, ranging between 5 and 20 bytes: 5 bytes for B-Tree, 16 – 20 bytes for the Hash-table engine. And if small overhead is not enough, Tokyo Cabinet also has native support for Lempel-Ziv or BWT compression algorithms, which can reduce your database to 25% of it’s size (typical text compression rate). Also, it is thread safe (uses pthreads) and offers row-level locking. [5] Features • Similar use cases as for BerkelyDB. • Disk persistence. Can store data larger than RAM. • Performs well. • Actively developed. Lots of developers adding new features (but not bug fixes). • Similar replication strategy to MySQL. Not useful for scalability as it limits the write throughput to one node. • Optional compressed pages so has some compression advantages. [7] 42

Hash and B-Tree Database Engines Hash database engine is a direct competitor to BerkeleyDB, and other key-value stores: one key, one value, no duplicates, and very fast. Functionally, the B-Tree database engine is equivalent to the Hash database. However, because of its underlying structure, the keys can be ordered via a userspecified function, which in turn allows us to do prefix and range matching on a key, as well as, traverse the entries in order. Let’s look at some examples: r e q u i r e ” rubygems ” require ” tokyocabinet ” include TokyoCabinet bdb = BDB : : new # B−Tree d a t a b a s e ; k e y s may have m u l t i p l e v a l u e s bdb . open ( ” c a s k e t . bdb ” , BDB : : OWRITER | BDB : : OCREAT) # s t o r e r e c o r d s i n the database , a l l o w i n g d u p l i c a t e s bdb . putdup ( ” k ey1 ” , ” v a l u e 1 ” ) bdb . putdup ( ” k ey1 ” , ” v a l u e 2 ” ) bdb . p u t ( ” key 2 ” , ” v a l u e 3 ” ) bdb . p u t ( ” key 3 ” , ” v a l u e 4 ” ) # retrieve a l l values p bdb . g e t l i s t ( ” ke y1 ” ) # => [ ” v a l u e 1 ” , ” v a l u e 2 ” ] # range query , f i n d a l l matching keys p bdb . r a n g e ( ” k ey1 ” , true , ” key 3 ” , tr ue ) # => [ ” ke y1 ” , ” ke y2 ” , ” key 3 ” ]


Fixed-length and Table Database Engines

Next, we have the ‘fixed length’ engine, which is best understood as a simple array. There is absolutely no hashing and access is done via natural number keys, which also means no key overhead. This method is extremely fast. Saving best for last, we have the Table engine, which mimics a relational database, except that it requires no predefined schema (in this, it is a close cousin to CouchDB, which allows arbitrary properties on any object). Each record still has a primary key, but we are allowed to declare arbitrary indexes on our columns, and even perform queries on them: r e q u i r e ” rubygems ” require ” r u f u s / tokyo / c a b i n e t / t a b l e ” t = R u f u s : : Tokyo : : T a b l e . new ( ’ t a b l e . t d b ’ , : c r e a t e , : w r i t e ) # p o p u l a t e t a b l e w i t h a r b i t r a r y d a t a ( no schema ! ) t [ ’ pk0 ’ ] = { ’ name ’ => ’ a l f r e d ’ , ’ age ’ => ’ 22 ’ , ’ s e x ’ => ’ male ’ } t [ ’ pk1 ’ ] = { ’ name ’ => ’ bob ’ , ’ age ’ => ’ 18 ’ } t [ ’ pk2 ’ ] = { ’ name ’ => ’ c h a r l y ’ , ’ age ’ => ’ 45 ’ , ’ nickname ’ => ’ c h a r l i e ’ } t [ ’ pk3 ’ ] = { ’ name ’ => ’ doug ’ , ’ age ’ => ’ 77 ’ } t [ ’ pk4 ’ ] = { ’ name ’ => ’ ephrem ’ , ’ age ’ => ’ 32 ’ } # q u e r y t a b l e f o r age >= 32 p t . query { | q | q . a d d c o n d i t i o n ’ age ’ , : numge , ’ 32 ’ q . o r d e r b y ’ age ’ } # => [ {”name”=>”ephrem ” , : pk=>”pk4 ” , ” age ”=>”32”}, # {”name”=>” c h a r l y ” , : pk=>”pk2 ” , ” nickname”=>” c h a r l i e ” , # ” age ”=>”45”}, # {”name”=>”doug ” , : pk=>”pk3 ” , ” age ”=>”77”} ] t . close




CouchDB is a free, open-source, distributed, fault-tolerant and schema-free document-oriented database accessible via a RESTful HTTP/JSON API. Derived from the key/value store, it uses JSON to define an item’s schema. Data is stored in ‘documents’, which are essentially key/value maps themselves. CouchDB is meant to bridge the gap between document-oriented and relational databases by allowing “views” to be dynamically created using JavaScript. These views map the document data onto a table-like structure that can be indexed and queried. It can also do full text indexing of the documents. At the moment, CouchDB isn’t really a distributed database. It has replication functions that allow data to be synchronized across servers, but this isn’t the kind of distribution needed to build highly scalable environments. The CouchDB community, though, is no doubt working on this. [3, 2]


Project Voldemort

Project Voldemort is a distributed key/value database that is intended to scale horizontally across a large numbers of servers. It spawned from work done at LinkedIn and is reportedly used there for a few systems that have very high scalability requirements. Project Voldemort also uses an eventual consistency model, based on Amazon’s. Project-Voldemort handles replication and partitioning of data, and appears to be well written and designed using Java. [3, 2, 10]



Mongo is the database system being developed at 10gen by Geir Magnusson and Dwight Merriman. Like CouchDB, Mongo is a document-oriented JSON database, except that it is designed to be a true object database, rather than a pure key/value store. Originally, 10gen focused on putting together a complete web services stack; more recently, though, it has refocused mainly on the Mongo database. [2] 45

Features • Written in C++. • Significantly faster the CouchDB. • JSON and BSON (binary JSON-ish) formats. • Asynchronous replication with auto-sharding. • Supports indexes. Querying a property is quick because an index is automatically kept on updates. Trades off some write speed for more consistent read spead. • Documents can be nested unlike CouchDB which requires applications keep relationships. Advantage is that the whole object doesn’t have to be written and read because the system knows about the relationship. Example is a blog post and comments. In CouchDB the post and comments are stored together and walk through all the comments when creating a view even though you are only interested in the blog post. Better write and query performance. • More advanced queries than CouchDB. [7]



Drizzle can be thought of as a counter-approach to the problems that key/ value stores are meant to solve. Drizzle began life as a spin-off of the MySQL (6.0) relational database. Over the last few months, its developers have removed a host of non-core features (including views, triggers, prepared statements, stored procedures, query cache, ACL, and a number of data types), with the aim of creating a leaner, simpler, faster database system. Drizzle can still store relational data; as Brian Aker of MySQL/Sun puts it, “There is no reason to throw out the baby with the bath water.” The aim is to build a semi-relational database platform tailored to web- and cloud-based apps running on systems with 16 cores or more. [2] 46



The source code for Cassandra was released recently by Facebook. They use it for inbox search. It’s a BigTable-esque, but uses a DHT so doesn’t need a central server. Originally developed by Facebook, it was developed by some of the key engineers behind Amazon’s famous Dynamo database. Cassandra can be thought of as a huge 4-or-5-level associative array, where each dimension of the array gets a free index based on the keys in that level. The real power comes from that optional 5th level in the associative array, which can turn a simple key-value architecture into an architecture where you can now deal with sorted lists, based on an index of your own specification. That 5th level is called a SuperColumn, and it’s one of the reasons that Cassandra stands out from the crowd. Cassandra has no single points of failure, and can scale from one machine to several thousands of machines clustered in different data centers. It has no central master, so any data can be written to any of the nodes in the cluster, and can be read likewise from any other node in the cluster. It provides knobs that can be tweaked to slide the scale between consistency and availability, depending on a particular application and problem domain. And it provides a high availability guarantee, that if one node goes down, another node will step in to replace it smoothly. [3, 8] Pros: • Open source. • Incremental scalable — as data grows one can add more nodes to storage mesh. • Minimal administration — because it’s incremental we don’t have to do a lot of up front planning for migration. [7] Cons: • Not polished yet. It was built for inbox searching so may not be work well for other use cases. • No compression yet. [7]



• Google BigTable — manages data across many nodes. 47

• Paxos (Chubby) — distributed transaction algorithm that manages locks across systems. • BigTable Characteristics: – Stores data in tablets using GFS, a distributed file system. – Compression — great gains in throughput, can store more, reduces IO bottleneck because you have to store less so you have to talk to the disks less so performance improves. – Single master — one node knows everything about all the other node (backed up and cached). – Hybrid between row and column database: ∗ Row database — store objects together. ∗ Column database — store attributes of objects together. Makes sequential retrieval very fast, allows very efficient compression, reduces disks seeks and random IO. – Versioning – Bloom filters — allows data to be distributed across a bunch of nodes. It’s a calculation on data that probabilistically maps the data to the nodes it can be found on. – Eventually consistent — append only system using a row time stamp. When a client queries they get several versions and the client is in charge of picking the most recent. • Pros: – Compression is available. – Clients are simple. – Integrates with map-reduce. • Cons: – Proprietary to Google — Unavailable for our own use. [7]



• Amazon’s Dynamo — A giant distributed hash table. • Uses consistent hashing to distribute data to one or more nodes for redundancy and performance. 48

– Consistent hashing — a ring of nodes and hash function picks which node(s) to store data. – Consitency between nodes is based on vector clocks and read repair. – Vector clocks — time stamp on every row for every node that has written to it. – Read repair — When a client does a read and the nodes disagree on the data it’s up to the client to select the correct data and tell the nodes the new correct state. • Pros: – No Master — eliminates single point of failure. – Highly Available for Write — This is the partition failure aspect of CAP. You can write to many nodes at once so depending on the number of replicas (which is configurable) maintained you should always be able to write somewhere. So users will never see a write failure. – Relatively simple which is why we see so many clones. • Cons: – Proprietary. – Clients have to be smart to handle read-repair, rebalancing a cluster, hashing, etc. Client proxies can handle these responsibilities but that adds another hop. – No compression which doesn’t reduce IO. – Not suitable for column-like workloads, it’s just a key-value store, so it’s not optimized for analytics. Aggregate queries, for example, aren’t in it’s wheel house. [7]



List of Tables
1.1 2.1 2.2 Fundamental differences between relational databases and key/ value stores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data access for relational databases and key/value stores Data access for relational databases and key/value stores 8

. . . . 12 . . . . 12

A.1 Some NoSQL initiatives . . . . . . . . . . . . . . . . . . . . . . 40



[1] NoSQL Databases http://nosql-database.org/ [2] “Is the Relational Database Doomed?”, Tony Bain http://www.readwriteweb.com/enterprise/2009/02/ is-the-relational-database-doomed.php

[3] “Anti-RDBMS: A list of distributed key-value stores”, Richard Jones http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-store [4] “Eventually Consistent”, Werner Vogels (Amazon) http://www.allthingsdistributed.com/2008/12/eventually_ consistent.html [5] “Tokyo Cabinet: Beyond Key-Values Store” http://www.igvita.com/2009/02/13/tokyo-cabinet-beyond-key-value-store/ [6] “Are cloud based memory architectures the next big thing?”, High Scalability Blog http://highscalability.com/blog/2009/3/16/ are-cloud-based-memory-architectures-the-next-big-thing. html [7] “Drop ACID and think about Data”, High Scalability Blog http://highscalability.com/drop-acid-and-think-about-data [8] “Thoughts on NOSQL”, Eric Florenzano http://www.eflorenzano.com/blog/post/my-thoughts-nosql/ [9] “Should you go beyond relational databases?”, Martin Kleppmann http://thinkvitamin.com/dev/should-you-go-beyond-relational-databases/ [10] “Notes from the NoSQL Meetup”, Toby Negrin http://developer.yahoo.net/blog/archives/2009/06/nosql_ meetup.html 53

[11] “The mixed blessing of Non-Relational Databases”, Ian Thomas Varley http://ianvarley.com/UT/MR/Varley_MastersReport_Full_ 2009-08-07.pdf