You are on page 1of 56

CS 708 Seminar

NoSQL

Prepared by,
Fayaz Yusuf Khan,
Reg.No: CSU071/16
Guided by,
Nimi Prakash P.
System Analyst
Computer Science & Engineering Department

September 29, 2010


NoSQL DEFINITION

Next Generation Databases mostly addressing some of the points: being non-
relational, distributed, open-source and horizontal scalable. The original
intention has been modern web-scale databases. The movement began early
2009 and is growing rapidly. Often more characteristics apply as: schema-free,
easy replication support, simple API, eventually consistent /BASE (not
ACID), a huge data amount, and more. So the misleading term ”NoSQL”
(the community now translates it mostly with “not only SQL”) should be seen
as an alias to something like the definition above. [1]
Contents

1 Introduction 5
1.1 Why relational databases are not enough . . . . . . . . . . . . . 6
1.2 What NoSQL has to offer . . . . . . . . . . . . . . . . . . . . . 7
1.3 ACIDs & BASEs . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.1 ACID Properties of Relational Databases . . . . . . . . . 7
1.3.2 CAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.3 BASE . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 NoSQL Features 11
2.1 No entity joins . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Eventual Consistency . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 Historical Perspective . . . . . . . . . . . . . . . . . . . 13
2.2.2 Consistency — Client and Server . . . . . . . . . . . . . 15
2.3 Cloud Based Memory Architecture . . . . . . . . . . . . . . . . 19
2.3.1 Memory Based Architectures . . . . . . . . . . . . . . . 19

3 Different NoSQL Database Choices 21


3.1 Document Databases & BigTable . . . . . . . . . . . . . . . . . 21
3.2 Graph Databases . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 Distributed Key-Value Stores . . . . . . . . . . . . . . . . . . . 22

4 NoSQL: Merits & Demerits 25


4.1 Merits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1.1 Semi-Structured Data . . . . . . . . . . . . . . . . . . . 25
4.1.2 Alternative Model Paradigms . . . . . . . . . . . . . . . 26
4.1.3 Multi-Valued Properties . . . . . . . . . . . . . . . . . . 27
4.1.4 Generalized Analytics . . . . . . . . . . . . . . . . . . . 29
4.1.5 Version History . . . . . . . . . . . . . . . . . . . . . . . 29
4.1.6 Predictable Scalability . . . . . . . . . . . . . . . . . . . 30
4.1.7 Schema Evolution . . . . . . . . . . . . . . . . . . . . . 32

3
4.1.8 More Natural Fit with Code . . . . . . . . . . . . . . . . 33
4.2 Demerits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.1 Limitations on Analytics . . . . . . . . . . . . . . . . . . 34

5 Conclusion 37
5.1 Data inconsistency . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 Making a Decision . . . . . . . . . . . . . . . . . . . . . . . . . 38

A Different popular NoSQL databases 39


A.1 The Shortlist . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
A.2 Cloud-Service Contenders . . . . . . . . . . . . . . . . . . . . . 39
A.2.1 Amazon: SimpleDB . . . . . . . . . . . . . . . . . . . . 39
A.2.2 Google AppEngine Data Store . . . . . . . . . . . . . . . 41
A.2.3 Microsoft: SQL Data Services . . . . . . . . . . . . . . . 41
A.3 Non-Cloud Service Contenders . . . . . . . . . . . . . . . . . . . 42
A.3.1 Tokyo Cabinet . . . . . . . . . . . . . . . . . . . . . . . 42
A.3.2 CouchDB . . . . . . . . . . . . . . . . . . . . . . . . . . 45
A.3.3 Project Voldemort . . . . . . . . . . . . . . . . . . . . . 45
A.3.4 Mongo . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
A.3.5 Drizzle . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
A.3.6 Cassandra . . . . . . . . . . . . . . . . . . . . . . . . . 47
A.3.7 BigTable . . . . . . . . . . . . . . . . . . . . . . . . . . 47
A.3.8 Dynamo . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4
Chapter 1

Introduction

The history of the relational database has been one of continual adversity: ini-
tially, many claimed that mathematical set-based models could never be the basis
for efficient database implementations; later, aspiring object oriented databases
claimed they would remove the “middle man” of relational databases from the
OO design and persistence process. In all of these cases, through a combination
of sound concepts, elegant implementation, and general applicability, relational
databases have become and remained the lingua franca of data storage and ma-
nipulation.

Most recently, a new contender has arisen to challenge the supremacy of re-
lational databases. Referred to generally as “non-relational databases” (among
other names), this class of storage engine seeks to break down the rigidity of
the relational model, in exchange for leaner models that can perform and scale
at higher levels, using various models (including key/value pairs, sharded arrays,
and document-oriented approaches) which can be created and read efficiently as
the basic unit of data storage. Primarily, these new technologies have arisen in
situations where traditional relational database systems would be extremely chal-
lenging to scale to the degree needed for global systems (for example, at com-
panies such as Google, Yahoo, Amazon, LinkedIn, etc., which regularly collect,
store and analyze massive data sets with extremely high transactional throughput
and low latency). As of this writing, there exist dozens of variants of this new
model, each with different capabilities and trade-offs, but all with the general
property that traditional relational design — as practiced on relational database
management systems like Oracle, Sybase, etc. — is neither possible nor desired.
[11]

5
1.1 Why relational databases are not enough

Even though RDBMS have provided database users with the best mix of sim-
plicity, robustness, flexibility, performance, scalability, and compatibility, their
performance in each of these areas is not necessarily better than that of an al-
ternate solution pursuing one of these benefits in isolation. This has not been
much of a problem so far because the universal dominance of RDBMS has out-
weighed the need to push any of these boundaries. Nonetheless, if you really had
a need that couldn’t be answered by a generic relational database, alternatives
have always been around to fill those niches.
Today, we are in a slightly different situation. For an increasing number
of applications, one of these benefits is becoming more and more critical; and
while still considered a niche, it is rapidly becoming mainstream, so much so
that for an increasing number of database users this requirement is beginning
to eclipse others in importance. That benefit is scalability. As more and more
applications are launched in environments that have massive workloads, such as
web services, their scalability requirements can, first of all, change very quickly
and, secondly, grow very large. The first scenario can be difficult to manage if
you have a relational database sitting on a single in-house server. For example,
if your load triples overnight, how quickly can you upgrade your hardware? The
second scenario can be too difficult to manage with a relational database in
general.
Relational databases scale well, but usually only when that scaling happens on
a single server node. When the capacity of that single node is reached, you need
to scale out and distribute that load across multiple server nodes. This is when
the complexity of relational databases starts to rub against their potential to
scale. Try scaling to hundreds or thousands of nodes, rather than a few, and the
complexities become overwhelming, and the characteristics that make RDBMS
so appealing drastically reduce their viability as platforms for large distributed
systems.
For cloud services to be viable, vendors have had to address this limitation,
because a cloud platform without a scalable data store is not much of a platform
at all. So, to provide customers with a scalable place to store application data,
vendors had only one real option. They had to implement a new type of database
system that focuses on scalability, at the expense of the other benefits that come
with relational databases.
These efforts, combined with those of existing niche vendors, have led to the
rise of a new breed of database management system. [2]

6
1.2 What NoSQL has to offer
This new kind of database management system is commonly called a key/value
store. In fact, no official name yet exists, so you may see it referred to as
document-oriented, Internet-facing, attribute-oriented, distributed database (al-
though this can be relational also), sharded sorted arrays, distributed hash table,
and key/value database. While each of these names point to specific traits of this
new approach, they are all variations on one theme, which we’ll call key/value
databases.
Whatever you call it, this “new” type of database has been around for a
long time and has been used for specialized applications for which the generic
relational database was ill-suited. But without the scale that web and cloud
applications have brought, it would have remained a mostly unused subset. Now,
the challenge is to recognize whether it or a relational database would be better
suited to a particular application.
Relational databases and key/value databases are fundamentally different and
designed to meet different needs. A side-by-side comparison only takes you so
far in understanding these differences, but to begin, let’s lay one down: [2]

1.3 ACIDs & BASEs


1.3.1 ACID Properties of Relational Databases
• The claim to fame for relational databases is they make the ACID promise:

– Atomicity — a transaction is all or nothing.


– Consistency — only valid data is written to the database.
– Isolation — pretend all transactions are happening serially and the
data is correct.
– Durability — what you write is what you get.

• The problem with ACID is that it gives too much; it trips up when trying
to scale a system across multiple nodes.

• Down time is unacceptable. So the system needs to be reliable. Reliability


requires multiple nodes to handle machine failures.

• To make a scalable systems that can handle lots and lots of reads and
writes requires many more nodes.

7
Database Definition
Relational Database Key/Value Database
Database contains tables, tables Domains can initially be thought
contain columns and rows, and of like a table, but unlike a table
rows are made up of column val- you don’t define any schema for
ues. Rows within a table all have a domain. A domain is basically
the same schema. a bucket that you put items into.
Items within a single domain can
have differing schema.
The data model is well defined in Items are identified by keys, and a
advance. A schema is strongly given item can have a dynamic set
typed and it has constraints and of attributes attached to it.
relationships that enforce data in-
tegrity.
The data model is based on a “nat- In some implementations, at-
ural” representation of the data it tributes are all of a string type. In
contains, not on an application’s other implementations, attributes
functionality. have simple types that reflect code
types, such as ints, string arrays,
and lists.
The data model is normalized to No relationships are explicitly de-
remove data duplication. Nor- fined between domains or within a
malization establishes table rela- given domain.
tionships. Relationships associate
data between tables.

Table 1.1: Fundamental differences between relational databases and key/


value stores

• Once we try to scale ACID across many machines we hit problems with
network failures and delays. The algorithms don’t work in a distributed
environment at any acceptable speed.

1.3.2 CAP
• If we can’t have all of the ACID guarantees it turns out we can have two
of the following three characteristics:

– Consistency — your data is correct all the time. What you write is
what you read.
– Availability — you can read and write and write your data all the
time.

8
– Partition Tolerance — if one or more nodes fails the system still works
and becomes consistent when the system comes on-line.

1.3.3 BASE
• The types of large systems based on CAP aren’t ACID they are BASE:

– Basically Available — system seems to work all the time.


– Soft State — it doesn’t have to be consistent all the time.
– Eventually Consistent — becomes consistent at some later time.

• Everyone who builds big applications builds them on CAP and BASE:
Google, Yahoo, Facebook, Amazon, eBay, etc. [7]

9
10
Chapter 2

NoSQL Features

2.1 No entity joins


Key/value databases are item-oriented, meaning all relevant data relating to an
item are stored within that item. A domain (which you can think of as a table)
can contain vastly different items. For example, a domain may contain customer
items and order items. This means that data are commonly duplicated between
items in a domain. This is accepted practice because disk space is relatively
cheap. But this model allows a single item to contain all relevant data, which
improves scalability by eliminating the need to join data from multiple tables.
With a relational database, such data needs to be joined to be able to regroup
relevant attributes.

But while the need for relationships is greatly reduced with key/value databases,
certain ones are inevitable. These relationships usually exist among core enti-
ties. For example, an ordering system would have items that contain data about

11
customers, products, and orders. Whether these reside on the same domain or
separate domains is irrelevant; but when a customer places an order, you would
likely not want to store both the customer and product’s attributes in the same
order item.
Instead, orders would need to contain relevant keys that point to the cus-
tomer and product. While this is perfectly doable in a key/value database, these
relationships are not defined in the data model itself, and so the database man-
agement system cannot enforce the integrity of the relationships. This means
you can delete customers and the products they have ordered. The responsibility
of ensuring data integrity falls entirely to the application. [2]

Data Access
Relational Database Key/Value Database
Data is created, updated, deleted, Data is created, updated, deleted,
and retrieved using SQL. and retrieved using API method
calls.
SQL queries can access data from Some implementations provide ba-
a single table or multiple tables sic SQL-like syntax for defining fil-
through table joins. ter criteria.
SQL queries include functions for Basic filter predicates (such as =,
aggregation and complex filtering. 6=, <, >, ≤, and ≥) can often only
be applied.
Usually contain means of embed- All applications and data integrity
ding logic close to data storage, logic is contained in the applica-
such as triggers, stored proce- tion code.
dures, and functions.

Table 2.1: Data access for relational databases and key/value stores

Application Interface
Relational Database Key/Value Database
Tend to have their own specific Tend to provide SOAP and/or
API, or make use of a generic API REST APIs over which data access
such as OLE-DB or ODBC. calls can be made.
Data is stored in a format that Data can be more effectively
represents its natural structure, so stored in application code that is
must be mapped between applica- compatible with its structure, re-
tion code structure and relational quiring only relational “plumbing”
structure. code for the object.

Table 2.2: Data access for relational databases and key/value stores

12
2.2 Eventual Consistency
Eventually Consistent - Building reliable distributed systems at a worldwide scale
demands trade-offs between consistency and availability.
At the foundation of Amazon’s cloud computing are infrastructure services
such as Amazon’s S3 (Simple Storage Service), SimpleDB, and EC2 (Elastic
Compute Cloud) that provide the resources for constructing Internet-scale com-
puting platforms and a great variety of applications. The requirements placed on
these infrastructure services are very strict; they need to score high marks in the
areas of security, scalability, availability, performance, and cost effectiveness, and
they need to meet these requirements while serving millions of customers around
the globe, continuously.
Under the covers these services are massive distributed systems that operate
on a worldwide scale. This scale creates additional challenges, because when a
system processes trillions and trillions of requests, events that normally have a low
probability of occurrence are now guaranteed to happen and need to be accounted
for up front in the design and architecture of the system. Given the worldwide
scope of these systems, we use replication techniques ubiquitously to guarantee
consistent performance and high availability. Although replication brings us closer
to our goals, it cannot achieve them in a perfectly transparent manner; under a
number of conditions the customers of these services will be confronted with the
consequences of using replication techniques inside the services.
One of the ways in which this manifests itself is in the type of data consistency
that is provided, particularly when the underlying distributed system provides an
eventual consistency model for data replication. When designing these large-scale
systems, we ought to use a set of guiding principles and abstractions related to
large-scale data replication and focus on the trade-offs between high availability
and data consistency. [4]

2.2.1 Historical Perspective


In an ideal world there would be only one consistency model: when an update is
made all observers would see that update. The first time this surfaced as difficult
to achieve was in the database systems of the late ’70s. The best “period piece”
on this topic is “Notes on Distributed Databases” by Bruce Lindsay et al. It lays
out the fundamental principles for database replication and discusses a number
of techniques that deal with achieving consistency. Many of these techniques
try to achieve distribution transparency — that is, to the user of the system
it appears as if there is only one system instead of a number of collaborating
systems. Many systems during this time took the approach that it was better to
fail the complete system than to break this transparency.

13
In the mid-’90s, with the rise of larger Internet systems, these practices were
revisited. At that time people began to consider the idea that availability was
perhaps the most important property of these systems, but they were struggling
with what it should be traded off against. Eric Brewer, systems professor at
the University of California, Berkeley, and at that time head of Inktomi, brought
the different trade-offs together in a keynote address to the PODC (Principles
of Distributed Computing) conference in 2000. He presented the CAP theorem,
which states that of three properties of shared-data systems — data consistency,
system availability, and tolerance to network partition — only two can be achieved
at any given time. A more formal confirmation can be found in a 2002 paper by
Seth Gilbert and Nancy Lynch.

A system that is not tolerant to network partitions can achieve data consis-
tency and availability, and often does so by using transaction protocols. To make
this work, client and storage systems must be part of the same environment;
they fail as a whole under certain scenarios, and as such, clients cannot observe
partitions. An important observation is that in larger distributed-scale systems,
network partitions are a given; therefore, consistency and availability cannot be
achieved at the same time. This means that there are two choices on what to
drop: relaxing consistency will allow the system to remain highly available under
the partitionable conditions, whereas making consistency a priority means that
under certain conditions the system will not be available.

Both options require the client developer to be aware of what the system is
offering. If the system emphasizes consistency, the developer has to deal with
the fact that the system may not be available to take, for example, a write. If
this write fails because of system unavailability, then the developer will have to
deal with what to do with the data to be written. If the system emphasizes
availability, it may always accept the write, but under certain conditions a read
will not reflect the result of a recently completed write. The developer then has
to decide whether the client requires access to the absolute latest update all the
time. There is a range of applications that can handle slightly stale data, and
they are served well under this model.

In principle the consistency property of transaction systems as defined in the


ACID properties (atomicity, consistency, isolation, durability) is a different kind of
consistency guarantee. In ACID, consistency relates to the guarantee that when
a transaction is finished the database is in a consistent state; for example, when
transferring money from one account to another the total amount held in both
accounts should not change. In ACID-based systems, this kind of consistency
is often the responsibility of the developer writing the transaction but can be
assisted by the database managing integrity constraints. [4]

14
2.2.2 Consistency — Client and Server
There are two ways of looking at consistency. One is from the developer/client
point of view: how they observe data updates. The second way is from the server
side: how updates flow through the system and what guarantees systems can
give with respect to updates. [4]

Client-side Consistency
The client side has these components:

A storage system. For the moment we’ll treat it as a black box, but one
should assume that under the covers it is something of large scale and highly
distributed, and that it is built to guarantee durability and availability.

Process A. This is a process that writes to and reads from the storage system.

Processes B and C. These two processes are independent of process A and


write to and read from the storage system. It is irrelevant whether these are
really processes or threads within the same process; what is important is
that they are independent and need to communicate to share information.

Client-side consistency has to do with how and when observers (in this case
the processes A, B, or C) see updates made to a data object in the storage
systems. In the following examples illustrating the different types of consistency,
process A has made an update to a data object:

Strong consistency. After the update completes, any subsequent access (by
A, B, or C) will return the updated value.

Weak consistency . The system does not guarantee that subsequent accesses
will return the updated value. A number of conditions need to be met
before the value will be returned. The period between the update and
the moment when it is guaranteed that any observer will always see the
updated value is dubbed the inconsistency window.

Eventual consistency. This is a specific form of weak consistency; the stor-


age system guarantees that if no new updates are made to the object,
eventually all accesses will return the last updated value. If no failures
occur, the maximum size of the inconsistency window can be determined
based on factors such as communication delays, the load on the system, and
the number of replicas involved in the replication scheme. The most pop-
ular system that implements eventual consistency is DNS (Domain Name

15
System). Updates to a name are distributed according to a configured pat-
tern and in combination with time-controlled caches; eventually, all clients
will see the update.

The eventual consistency model has a number of variations that are important
to consider:

Causal consistency. If process A has communicated to process B that it has


updated a data item, a subsequent access by process B will return the
updated value, and a write is guaranteed to supersede the earlier write.
Access by process C that has no causal relationship to process A is subject
to the normal eventual consistency rules.

Read-your-writes consistency. This is an important model where process


A, after it has updated a data item, always accesses the updated value and
will never see an older value. This is a special case of the causal consistency
model.

Session consistency. This is a practical version of the previous model, where


a process accesses the storage system in the context of a session. As long
as the session exists, the system guarantees read-your-writes consistency.
If the session terminates because of a certain failure scenario, a new session
needs to be created and the guarantees do not overlap the sessions.

Monotonic read consistency. If a process has seen a particular value for the
object, any subsequent accesses will never return any previous values.

Monotonic write consistency. In this case the system guarantees to serial-


ize the writes by the same process. Systems that do not guarantee this
level of consistency are notoriously hard to program.

A number of these properties can be combined. For example, one can get
monotonic reads combined with session-level consistency. From a practical point
of view these two properties (monotonic reads and read-your-writes) are most
desirable in an eventual consistency system, but not always required. These two
properties make it simpler for developers to build applications, while allowing the
storage system to relax consistency and provide high availability.
As you can see from these variations, quite a few different scenarios are
possible. It depends on the particular applications whether or not one can deal
with the consequences.
Eventual consistency is not some esoteric property of extreme distributed
systems. Many modern RDBMSs (relational database management systems)
that provide primary-backup reliability implement their replication techniques in

16
both synchronous and asynchronous modes. In synchronous mode the replica
update is part of the transaction. In asynchronous mode the updates arrive
at the backup in a delayed manner, often through log shipping. In the latter
mode if the primary fails before the logs are shipped, reading from the promoted
backup will produce old, inconsistent values. Also to support better scalable
read performance, RDBMSs have started to provide the ability to read from the
backup, which is a classical case of providing eventual consistency guarantees in
which the inconsistency windows depend on the periodicity of the log shipping.
[4]

Server-side Consistency
On the server side we need to take a deeper look at how updates flow through
the system to understand what drives the different modes that the developer who
uses the system can experience. Let’s establish a few definitions before getting
started:
N = the number of nodes that store replicas of the data
W = the number of replicas that need to acknowledge the receipt of the
update before the update completes
R = the number of replicas that are contacted when a data object is accessed
through a read operation
If W + R > N , then the write set and the read set always overlap and one
can guarantee strong consistency. In the primary-backup RDBMS scenario, which
implements synchronous replication, N = 2, W = 2, and R = 1. No matter
from which replica the client reads, it will always get a consistent answer. In
asynchronous replication with reading from the backup enabled, N = 2, W = 1,
and R = 1. In this case R + W = N , and consistency cannot be guaranteed.
The problems with these configurations, which are basic quorum protocols,
is that when the system cannot write to W nodes because of failures, the write
operation has to fail, marking the unavailability of the system. With N = 3 and
W = 3 and only two nodes available, the system will have to fail the write.
In distributed-storage systems that need to provide high performance and
high availability, the number of replicas is in general higher than two. Systems
that focus solely on fault tolerance often use N = 3 (with W = 2 and R = 2
configurations). Systems that need to serve very high read loads often replicate
their data beyond what is required for fault tolerance; N can be tens or even
hundreds of nodes, with R configured to 1 such that a single read will return
a result. Systems that are concerned with consistency are set to W = N for
updates, which may decrease the probability of the write succeeding. A common
configuration for these systems that are concerned about fault tolerance but not
consistency is to run with W = 1 to get minimal durability of the update and

17
then rely on a lazy (epidemic) technique to update the other replicas.
How to configure N , W , and R depends on what the common case is and
which performance path needs to be optimized. In R = 1 and N = W we
optimize for the read case, and in W = 1 and R = N we optimize for a very fast
write. Of course in the latter case, durability is not guaranteed in the presence
of failures, and if W < (N + 1)/2, there is the possibility of conflicting writes
when the write sets do not overlap.
Weak/eventual consistency arises when W + R ≤ N , meaning that there is
a possibility that the read and write set will not overlap. If this is a deliberate
configuration and not based on a failure case, then it hardly makes sense to set
R to anything but 1. This happens in two very common cases: the first is the
massive replication for read scaling mentioned earlier; the second is where data
access is more complicated. In a simple key-value model it is easy to compare
versions to determine the latest value written to the system, but in systems that
return sets of objects it is more difficult to determine what the correct latest
set should be. In most of these systems where the write set is smaller than the
replica set, a mechanism is in place that applies the updates in a lazy manner
to the remaining nodes in the replica’s set. The period until all replicas have
been updated is the inconsistency window discussed before. If W + R ≤ N , then
the system is vulnerable to reading from nodes that have not yet received the
updates.
Whether or not read-your-writes, session, and monotonic consistency can be
achieved depends in general on the “stickiness” of clients to the server that
executes the distributed protocol for them. If this is the same server every time,
then it is relatively easy to guarantee read-your-writes and monotonic reads. This
makes it slightly harder to manage load balancing and fault tolerance, but it is a
simple solution. Using sessions, which are sticky, makes this explicit and provides
an exposure level that clients can reason about.
Sometimes the client implements read-your-writes and monotonic reads. By
adding versions on writes, the client discards reads of values with versions that
precede the last-seen version.
Partitions happen when some nodes in the system cannot reach other nodes,
but both sets are reachable by groups of clients. If you use a classical majority
quorum approach, then the partition that has W nodes of the replica set can
continue to take updates while the other partition becomes unavailable. The
same is true for the read set. Given that these two sets overlap, by definition the
minority set becomes unavailable. Partitions don’t happen frequently, but they
do occur between data centers, as well as inside data centers.
In some applications the unavailability of any of the partitions is unacceptable,
and it is important that the clients that can reach that partition make progress.
In that case both sides assign a new set of storage nodes to receive the data,

18
and a merge operation is executed when the partition heals. [4]

2.3 Cloud Based Memory Architecture


2.3.1 Memory Based Architectures
Google query results are now served in under an astonishingly fast 200ms, down
from 1000ms in the olden days. The vast majority of this great performance
improvement is due to holding indexes completely in memory. Thousands of
machines process each query in order to make search results appear nearly in-
stantaneously.
This text was adapted from notes on Google Fellow Jeff Dean keynote speech
at WSDM 2009.
What makes Memory Based Architectures different from traditional architec-
tures is that memory is the system of record. Typically disk based databases
have been the system of record. All the data is stored on the disk. Disk being
slow we’ve ended up wrapping disks in complicated caching and distributed file
systems to make them perform.
Even though, memory is used as all over the place as cache, it is simply as-
sumed that cache can be invalidated at any time. In Memory Based Architectures
memory is where the “official” data values are stored.
Caching also serves a different purpose. The purpose behind cache based
architectures is to minimize the data bottleneck due to disk. Memory based ar-
chitectures can address the entire end-to-end application stack. Data in memory
can be of higher reliability and availability than traditional architectures.
Memory Based Architectures initially developed out of the need in some
applications spaces for very low latencies. The dramatic drop of RAM prices
along with the ability of servers to handle larger and larger amounts of RAM
has caused memory architectures to verge of going mainstream. For example,
someone recently calculated that 1TB of RAM across 40 servers at 24GB per
server would cost an additional $40,000. Which is really quite affordable given
the cost of the servers. Projecting out, 1U and 2U rack-mounted servers will
soon support a terabyte or more of memory. [6]

RAM: High Bandwidth and Low Latency


Compared to disk RAM is a high bandwidth and low latency storage medium.
The bandwidth of RAM is typically 5GB/s. The bandwidth of disk is about
100MB/s. RAM bandwidth is many hundreds of times faster. Modern hard
drives have latencies under 13 milliseconds. When many applications are queued

19
for disk reads latencies can easily be in the many second range. Memory latency
is in the 5 nanosecond range. Memory latency is 2,000 times faster. [6]

RAM is the New Disk


The superiority of RAM is at the heart of the RAM is the New Disk paradigm.
As an architecture it combines the holy quadrinity of computing:

• Performance is better because data is accessed from memory instead of


through a database to a disk.

• Scalability is linear because as more servers are added data is transpar-


ently load balanced across the servers so there is an automated in-memory
sharding.

• Availability is higher because multiple copies of data are kept in memory


and the entire system reroutes on failure.

• Application development is faster because theres only one layer of software


to deal with, the cache, and its API is simple. All the complexity is hidden
from the programmer which means all a developer has to do is get and put
data.

Access disk on the critical path of any transaction limits both throughput and
latency. Committing a transaction over the network in-memory is faster than
writing through to disk. Reading data from memory is also faster than reading
data from disk. So the idea is to skip disk, except perhaps as an asynchronous
write-behind option, archival storage, and for large files. [6]

20
Chapter 3

Different NoSQL Database


Choices

3.1 Document Databases & BigTable


The BigTable paper describes how Google developed their own massively scalable
database for internal use, as basis for several of their services. The data model
is quite different from relational databases: columns dont need to be predefined,
and rows can be added with any set of columns. Empty columns are not stored
at all.
BigTable inspired many developers to write their own implementations of this
data model; amongst the most popular are HBase, Hypertable and Cassandra.
The lack of a predefined schema can make these databases attractive in appli-
cations where the attributes of objects are not known in advance, or change
frequently.
Document databases have a related data model (although the way they han-
dle concurrency and distributed servers can be quite different): a BigTable row
with its arbitrary number of columns/attributes corresponds to a document in
a document database, which is typically a tree of objects containing attribute
values and lists, often with a mapping to JSON or XML. Open source docu-
ment databases include Project Voldemort, CouchDB, MongoDB, ThruDB and
Jackrabbit.
How is this different from just dumping JSON strings into MySQL? Document
databases can actually work with the structure of the documents, for example
extracting, indexing, aggregating and filtering based on attribute values within
the documents. Alternatively you could of course build the attribute indexing
yourself, but I wouldnt recommend that unless it makes working with your legacy
code easier.

21
The big limitation of BigTables and document databases is that most imple-
mentations cannot perform joins or transactions spanning several rows or doc-
uments. This restriction is deliberate, because it allows the database to do
automatic partitioning, which can be important for scaling — see the section
3.4 on distributed key-value stores below. If the structure of our data is lots of
independent documents, this is not a problem — but if the data fits nicely into a
relational model and we need joins, we shouldn’t try to force it into a document
model. [9]

3.2 Graph Databases


Graph databases live at the opposite end of the spectrum. While document
databases are good for storing data which is structured in the form of lots of
independent documents, graph databases focus on the relationships between
items — a better fit for highly interconnected data models.
Standard SQL cannot query transitive relationships, i.e variable-length chains
of joins which continue until some condition is reached. Graph databases, on the
other hand, are optimised precisely for this kind of data.
There is less choice in graph databases than there is in document databases:
Neo4j, AllegroGraph and Sesame (which typically uses MySQL or PostgreSQL
as storage back-end) are ones to look at. FreeBase and DirectedEdge have
developed graph databases for their internal use.
Graph databases are often associated with the semantic web and RDF data-
stores, which is one of the applications they are used for. [9]

3.3 MapReduce
Popularised by another Google paper, MapReduce is a way of writing batch
processing jobs without having to worry about infrastructure. Different databases
lend themselves more or less well to MapReduce.
Hadoop is the big one amongst the open MapReduce implementations, and
Skynet and Disco are also worth looking at. CouchDB also includes some MapRe-
duce ideas on a smaller scale. [9]

3.4 Distributed Key-Value Stores


A key-value store is a very simple concept, much like a hash table: you can
retrieve an item based on its key, you can insert a key/value pair, and you
can delete a key/value pair. The value can just be an opaque list of bytes, or

22
might be a structured document (most of the document databases and BigTable
implementations above can also be considered to be key-value stores).
Document databases, graph databases and MapReduce introduce new data
models and new ways of thinking which can be useful even in a small-scale
applications. Distributed key-value stores, on the other hand, are really just
about scalability. They can scale to truly vast amounts of data — much more
than a single server could hold.
Distributed databases can transparently partition and replicate the data across
many machines in a cluster. We dont need to figure out a sharding scheme to
decide on which server we can find a particular piece of data; the database can
locate it for us. If one server dies, no problem — others can immediately take
over. If we need more resources, just add servers to the cluster, and the database
will automatically give them a share of the load and the data.
When choosing a key-value store we need to decide whether it should be opti-
mised for low latency (for lightning-fast data access during the request-response
cycle) or for high throughput (which is needed for batch processing jobs).
Other than the BigTables and document databases above, Scalaris, Dynomite
and Ringo provide certain data consistency guarantees while taking care of par-
titioning and distributing the dataset. MemcacheDB and Tokyo Cabinet (with
Tokyo Tyrant for network service and LightCloud to make it distributed) focus
on latency.
The caveat about limited transactions and joins applies even more strongly
for distributed databases. Different implementations take different approaches,
but in general, if we need to read several items, manipulate them in some way and
then write them back, there is no guarantee that we will end up in a consistent
state immediately (although many implementations try to become eventually
consistent by resolving write conflicts or using distributed transaction protocols).
[9]

23
24
Chapter 4

NoSQL: Merits & Demerits

4.1 Merits
There is a long list of potential advantages to using non-relational databases.
Of course, not all non-relational databases are the same; but the following list
covers areas common to many of them. [11]

4.1.1 Semi-Structured Data


Here, a structure where each entity can have any number of properties defined
at run-time. This approach is clearly helpful in domains where the problem is
itself amenable to expansion or change over time. We can begin simply, and
alter the details of our problem as we go with minimal administrative burden.
This approach has much in common with the imputed typing systems of scripting
languages like Python, which, while often less efficient than strongly typed lan-
guages like C and Java, usually more than make up for this deficiency by giving
programmers improved usability; they can get started quickly and add structure
and overhead only as needed.
But there is another, more important aspect to this tendency towards storing
non-structured, or semi-structured, data: the idea that our understanding of
a problem, and its data, might legitimately emerge over time, and be entirely
data-driven after the fact. As one observer put it:
RDBMSs are designed to model very highly and statically structured data
which has been modeled with mathematical precision — data and designs that
do not meet these criteria, such as data designed for direct human consumption,
lose the advantages of the relational model, and result in poorer maintainability
than with less stringent models.
This kind of emergent behavior is atypical when dealing with the program-

25
ming problems of the past 40 years, such as accounting systems, desktop word
processing software, etc. However, many of today’s interesting problems involve
unpredictable behavior and inputs from extremely large populations; consider
web search, social network graphs, large scale purchasing habits, etc. In these
“messy” arenas, the impulse to exactly model and define all the possible struc-
tures in the data in advance is exactly the wrong approach. Relational data
design tends to turn programmers into “structure first” proponents, but in many
cases, the rest of the world (including the users we are writing programs for) are
thinking “data first”. [11]

4.1.2 Alternative Model Paradigms

Modelling data in terms of relations, tuples and attributes —or equivalently, ta-
bles, rows and columns — is but one conceptual approach. There are entirely
different ways of considering, planning, and designing a data model. These in-
clude hierarchical trees, arbitrary graphs, structured objects, cube or star schema
analytical approaches, tuple spaces, and even undifferentiated (emergent) stor-
age. By moving into the realm of semi-structured non-relational data, we gain the
possibility of accessing our data along these lines instead of simply in relational
database terms.
For example, graph-oriented databases, such as Neo4j. This paradigm at-
tempts to map persistent storage capabilities directly onto the graph model of
computation: sets of nodes connected by sets of edges. The database engine
then innately provides many algorithmic services that one would expect on graph
representations: establishing spanning trees, finding shortest path, depth and
breadth-first search, etc.
Object databases are another paradigm that have, at various times, appeared
poised to challenge the supremacy of the relational database. An example of a
current contender in this space is Persevere (http://www.persvr.org/), which
is an object store for JSON (JavaScript Object Notation) data. Advantages
gained in this space include a consistent execution model between the storage
engine and the client platform (JavaScript, in this case), and the ability to natively
store objects without any translation layer.
Here again, the general principle is that by moving away from the strictly
modeled structure of SQL, we untie the hands of developers to model data in
terms they may be more familiar with, or that may be more conducive to solving
the problem at hand. This is very attractive to many developers. [11]

26
4.1.3 Multi-Valued Properties
Even with the bounds of the more traditional relational approach, there are ways
in which the semi-structured approach of non-relational databases can give us a
helping hand in conceptual data design. One of these is by way of multi-value
properties — that is, attributes that can simultaneously take on more than one
value.
A credo of relational database design is that for any given tuple in a relation,
there is only one value for any given attribute; storing multiple values in the same
attribute for the same tuple is considered very bad practice, and is not supported
by standard SQL. Generally, cases where one might be tempted to store multiple
values in the same attribute indicate that the design needs further normalization.
As an example, consider a User relation, with an attribute email. Since
people typically have more than one email address, a simple (but wrong, at least
for relational database design) decision might be to store the email addresses as
a comma-delimited list within the “emails” attribute.
The problems with this are myriad - for example, simple membership tests
like
SELECT ∗ FROM U s e r WHERE e m a i l s = ’ homer@simpson . com ’ ;
will fail if there are more than one email address in the list, because that is no
longer the value of the attribute; a more general test using wildcards such as
SELECT ∗ FROM U s e r WHERE e m a i l s LIKE ’%homer@simpson . com%’ ;
will succeed, but raises serious performance issues in that it defeats the use of
indexes and causes the database engine to do (at best) linear-time text pattern
searches against every value in the table. Worse, it may actually impact correct-
ness if entries in the list can be proper substrings of each other (as in the list
“car, cart, art”).
The proper way to design for this situation, in a relational model, is to nor-
malize the email addresses into their own table, with a foreign key relationship
to the user table.
This is a design strategy that can is frequently applied to many situations in
standard relational database design, even recursively: if you sense a one-to-many
relationship in an attribute, break it out into two relations with a foreign key.
The trouble with this pattern, however, is that it still does not elegantly
serve all the possible use cases of such data, especially in situations with a low
cardinality; either it is overkill, or it is a clumsy way to store data. In the above
example, there are a very small set of use cases that we might typically do with
email addresses, including:
• Return the user, along with their one “primary” email address, for normal
operations involving sending an email to the user.

27
• Return the user with a list of all their email addresses, for showing on a
“profile” screen, for example.
• Find which user (if any) has a given email address.
The first situation requires an additional attribute along the lines of is primary
on the email table, not to mention logic to ensure that only one email tuple per
user is marked as primary (which cannot be done natively in a relational database,
because a UNIQUE constraint on the user id and the is primary field would
only allow one primary and one non-primary email address per user id ). Alter-
nately, a primary email field can be kept on the User table, acting as a cache of
which email address is the primary one; this too requires coordination by code to
ensure that this field actually exists in the User Email table, etc.
To use standard SQL to return a single tuple containing the user and all
of their email addresses, comma delimited like our original (“wrong”) design
concept, is actually quite difficult under this two-table structure. Standard SQL
has no way of rendering this output, which is surprising considering how common
it is. The only mechanisms would be constructing intermediate temporary tables
of the information, looping through records of the join relation and outputting
one tuple per user id with the concatenation of email addresses as an attribute.
Under key/value stores, we have a different paradigm entirely for this problem,
and one which much more closely matches the real-world uses of such data. We
can simply model the email attribute as a substructure: a list of emails within
the attribute.
For example, Google App Engine has a “List” type that can store exactly this
type of information as an attribute:
c l a s s U s e r ( db . Model ) :
name = db . S t r i n g P r o p e r t y ( )
e m a i l s = db . S t r i n g L i s t P r o p e r t y ( )
The query system then has the ability to not only return the contained lists
as structured data, but also to do membership queries, such as:
r e s u l t s = db . GqlQuery (
”SELECT ∗ FROM U s e r WHERE e m a i l = ’ homer@simpson . com ’ ”
)
Since order is preserved, the semantics of “primary” versus “additional” can
be encoded into the order of items, so no additional attribute is needed for
this purpose; we can always get the primary email by saying something like
“ results . emails [0] ”.
In effect, we have expressed our actual data requirements in a much more
succinct and powerful way using this notation, without any noticeable loss in
precision, abstraction, or expressive power. [11]

28
4.1.4 Generalized Analytics
If the nature of the analysis falls outside of SQL’s standard set of operations,
it can be extremely difficult to produce results with the operational silo of SQL
queries. Worse, this has a pernicious effect on the mindset of data developers,
sometimes called “SQL Myopia”: if you can’t do it in SQL, you cant do it.1
This is unfortunate, because there are many interesting and useful modes of
interacting with data sets that are outside of this paradigm — consider matrix
transformations, data mining, clustering, Bayesian filtering, probability analysis,
etc.
Additionally, besides simply lacking Turing-completeness,2 SQL has a long list
of faults that non-SQL developers regularly present. These include a verbose,
non-customizable syntax; inability to reduce nested constructions to recursive
calls, or generally work with graphs, trees, or nested structures; inconsistency in
specific implementation between vendors, despite standardization; and so forth.
It is no wonder that the moniker for the current non-relational database move-
ment is converging on the tag “NOSQL”: it is a limited, inelegant language.
[11]

4.1.5 Version History


Part of the design of many (but not all) non-relational databases is the explicit
inclusion of version history in the storage unit of data. For example, when you
store the value 123 in an attribute, and later change it to the value 234, your
data store actually now contains both values, each with a timestamp or vector
clock version stamp. This approach has many benefits from an efficiency point
of view: primary interaction with the database disks is always in write-forward
mode, and multi-version concurrency control can be easily modeled with this
structure.
From a modeling point of view, however, there are other distinct advantages
to this format. One of them is the ability to intentionally keep, and interact with,
older versions of data in a structured way. An example of this, which almost
certainly uses the versioned characteristics of Google’s Bigtable infrastructure, is
Google Docs: any document can be instantly viewed in, or reverted to, its state
at any point in its history — a granular, infinite “undo”.
Implementing this kind of revision ability in typical relational database ap-
1
Note that this is not a fault of the relational model itself — only of SQL, which is
ultimately just one possible declarative grammar for interacting with relational structures.
2
For the record, this lack of Turing-completeness is by design, so that all queries would
be able to run in bounded time; never mind that every major commercial vendor has
extended SQL with operations that do make it Turing complete, albeit still awkward.

29
plications is prohibitive both from a programming complexity standpoint (this
ability must be consciously designed in to each entity that might need it) as well
as from a performance standpoint.3
We have two main options when keeping a history of information for a table.
On the one hand, we can keep a full additional copy of every row whenever it
changes. This can be done in place, by adding an additional component to the
primary key which is a timestamp or version number.
This is problematic in that all application code that interacts with this entity
needs to know about the versioning scheme; it also complicates the indexing of
the entities, because relational database storage with a composite primary key
including a date is significantly less optimized than for a single integer key.
Alternately, the entire-row history method can be done in a secondary table
which only keeps historical records, much like a log.
This is less obtrusive on the application (which need not even be aware of
its existence, especially if it is produce via a database level procedure or trigger),
and has the benefit that it can be populated asynchronously.
However, both of these cases require O(sn) storage, where s is the row size
and n is the number of updates. For large row sizes, this approach can be
prohibitive.
The other mechanism for doing this is to keep what amounts to an Entity/
Attribute/Value table for the historical changes: a table where only the changed
value is kept. This is easier to do in situations where the table design itself is
already in the EAV paradigm, but can still be done dynamically (if not efficiently)
by using the string name of the updated attribute.
For sparsely updated tables, this approach does save space over the entire-row
versions, but it suffers from the drawback that any use of this data via interactive
SQL queries is nearly impossible.
Overall, the non-relational database stores that support column-based version
history have a huge advantage in any situations where the application might need
this level of historical data snapshots. [11]

4.1.6 Predictable Scalability


These databases are simple and thus scale much better than today’s relational
databases. If you are putting together a system in-house and intend to throw
dozens or hundreds of servers behind your data store to cope with what you
expect will be a massive demand in scale, then consider a key/value store.
3
Consider how many traditional relational database implemented products you know
of that offer any kind of Undo functionality.

30
This definitively impacts the modelling concepts supported by the systems,
because it elevates scalability concerns to a first class modeling directive — part
of the logical and conceptual modeling process itself. Rather than designing an
elegant relational model and only later considering how it might reasonably be
“sharded” or replicated in such a way as to provide high availability in various
failure scenarios (typically accompanied by great cost, in commercial relational
database products), instead the bedrock of the logical design asks: how can we
conceive of this data in such a way that it is scalable by its definition?
As an example, consider the mechanism for establishing the locality of trans-
actions in Bigtable and its ilk (including the Google App Engine data store).
Obviously, when involving multiple entities in a transaction on a distributed data
store, it is desirable to restrict the number of nodes who actually must par-
ticipate in the transaction. (While protocols do of course exist for distributed
transactions, the performance of these protocols suffer immensely as the size of
machine cluster increases, because the risk of a node failure, and thus a time-
out on the distributed transaction, increases.) It is therefore most beneficial to
couple related entities tightly, and unrelated entities loosely, so that the most
common entities to participate in a transaction would be those that are already
tightly coupled. In a relational database, you might use foreign key relationships
to indicate related entities, but the relationship carries no additional information
that might indicate “these two things are likely to participate in transactions
together”.
By contract, in Bigtable, this is enabled by allowing entities to indicate an
“ancestor” relation chain, of any depth. That is, entity A can declare entity B its
“parent”, and henceforth, the data store organizes the physical representation of
these entities on one (or a small number of) physical machines, so that they can
easily participate in shared transactions. This is a natural design inclination, but
one that is not easily expressed in the world of relational databases (you could
certainly provide self-relationships on entities but since SQL does not readily
express recursive relationships, that is only beneficial in cases where the self-
relationship is a key part of the data design itself, with business import.)
Many commercial relational database vendors make the claim that their solu-
tions are highly scalable. This is true, but there are two caveats. First, of course,
is cost: sharded, replicated instances of Oracle or DB2 are not a cheap com-
modity, and the cost scales with the load. Second, however, and less obvious, is
the predictability factor. This is highly touted by systems such as Project Volde-
mort, which point out that with a simple data model, as in many non-relational
databases, not only can you scale more easily, but you can scale more predictably:
the requirements to support additional operations, in terms of CPU and memory
is known fairly exactly, so load planning can be an exact science. Compare this
with SQL/relational database scaling, which is highly unpredictable due to the

31
complex nature of the RDBMS engine.
There are, naturally, other criteria that are involved in the quest for per-
formance and scalability, including topics like low level data storage (b-tree-like
storage formats, disk access patterns, solid state storage, etc.); issues with the
raw networking of systems and their communications overhead; data reliability,
both considered for single-node and multi-node systems, etc.
Because key/value databases easily and dynamically scale, they are also the
database of choice for vendors who provide a multi-user, web services platform
data store. The database provides a relatively cheap data store platform with
massive potential to scale. Users typically only pay for what they use, but their
usage can increase as their needs increase. Meanwhile, the vendor can scale the
platform dynamically based on the total user load, with little limitation on the
entire platform’s size. [2, 11]

4.1.7 Schema Evolution


In addition to the static existence of a database schema, it is also important
to consider what happens over time as an application’s needs or requirements
change. Non-relational databases have a distinct advantage in this realm, be-
cause they offer more options for how the version update should proceed.
To be sure, relational databases have mechanisms for handling ongoing up-
dates to data schema; indeed, one of the strengths of the relational model is that
the schema is data: databases keep system tables which define schema meta-
data, which are handled by the exact same database primitives as user-space
tables. This generality has advantages in terms of manageability, but it also
provides a clean abstraction that vendors can use to provide valuable schema
update facilities. Indeed, commercial RDMBS products have applied a great
deal of engineering resources to the problem, and have developed sophisticated
mechanisms that allow production databases to ALTER their schema without
downtime in most scenarios.4 However, there are two issues with the relational
database approach to this.
First, relational database schemas exist in only one state at any given time.
This means that if the specific form of an attribute changes, it must change
immediately for all records, even in cases where the new form of the attribute
would rightfully require processing that the database cannot do (for example,
application-specific business logic). It also implies that if there is a high-volume
update, such as one that might need to write many gigabytes of changed data
4
Non-commercial databases such as MySQL also have mechanisms such as this, in
general their methods are much less sophisticated, often requiring downtime to do even
simple operations such as rebuild indices, etc.

32
back to disk, the RDBMS is obligated to do this operation atomically and in
real-time (because DDL updates are transactional); regardless of how efficiently
implemented it is, this type of operation cannot be made seamless in a highly
transactional production environment.
Second, the release of relational database schema changes typically requires
precise coordination with application-layer code; the code version must exactly
match the data version. In any highly available application, there is a high likeli-
hood that this implies downtime,5 or at least advanced operational coordination
that takes a great deal of precision and energy.
Non-relational databases, by comparison, can use a very different approach
for schema versioning. Because the schema (in many cases) is not enforced at
the data engine level, it is up to the application to enforce (and migrate) the
schema. Therefore, a schema change can be gradually introduced by code that
understands how to interact with both the N − 1 version and the N version,
and leaves each entity updated as it is touched. “Gardener” processes can then
periodically sweep through the data store, updating nodes as a lower-priority
process.
Naturally, this approach produces more complex code in the short term, es-
pecially if the schema of the data is relied upon by analytical (map/reduce) jobs.
But in many cases, the knowledge that no downtime will be required during
a schema evolution is worth the additional complexity. In fact, this approach
might be seen to encourage a more agile development methodology, because
each change to the internal schema of the application’s data is bundled with the
update to the codebase, and can be collectively versioned and managed accord-
ingly.

4.1.8 More Natural Fit with Code


Relational data models and Application Code Object Models are typically built
differently, which leads to incompatibilities. Developers overcome these incom-
patibilities with code that maps relational models to their object models, a pro-
cess commonly referred to as object-to-relational mapping.This process, which
essentially amounts to “plumbing” code and has no clear and immediate value,
can take up a significant chunk of the time and effort that goes into develop-
ing the application. On the other hand, many key/value databases retain data
5
The exception to this is that, thanks to the relational models implicit lack of attribute
order, there are situations in which new attributes can be added and it is guaranteed that
no application code would even know of the existence of the new attributes, let alone be
affected by them. This is a case where the relational model has the upper hand; however,
because it is not a comprehensive solution for every situation, the end result is that, for
safety, most relational database schema updates are treated as downtime events.

33
in a structure that maps more directly to object classes used in the underlying
application code, which can significantly reduce development time. [2]

4.2 Demerits
The inherent constraints of a relational database ensure that data at the lowest
level have integrity. Data that violate integrity constraints cannot physically be
entered into the database. These constraints don’t exist in a key/value database,
so the responsibility for ensuring data integrity falls entirely to the application.
But application code often carries bugs. Bugs in a properly designed relational
database usually don’t lead to data integrity issues; bugs in a key/value database,
however, quite easily lead to data integrity issues.
One of the other key benefits of a relational database is that it forces you to
go through a data modeling process. If done well, this modeling process create
in the database a logical structure that reflects the data it is to contain, rather
than reflecting the structure of the application. Data, then, become somewhat
application-independent, which means other applications can use the same data
set and application logic can be changed without disrupting the underlying data
model. To facilitate this process with a key/value database, try replacing the
relational data modeling exercise with a class modeling exercise, which creates
generic classes based on the natural structure of the data.
And don’t forget about compatibility. Unlike relational databases, cloud-
oriented databases have little in the way of shared standards. While they all
share similar concepts, they each have their own API, specific query interfaces,
and peculiarities. So, you will need to really trust your vendor, because you won’t
simply be able to switch down the line if you’re not happy with the service. And
because almost all current key/value databases are still in beta, that trust is far
riskier than with old-school relational databases. [2]

4.2.1 Limitations on Analytics


In the cloud, key/value databases are usually multi-tenanted, which means that a
lot of users and applications will use the same system. To prevent any one process
from overloading the shared environment, most cloud data stores strictly limit the
total impact that any single query can cause. For example, with SimpleDB, you
can’t run a query that takes longer than 5 seconds. With Google’s AppEngine
Datastore, you can’t retrieve more than 1000 items for any given query.
These limitations aren’t a problem for your bread-and-butter application logic
(adding, updating, deleting, and retrieving small numbers of items). But what
happens when your application becomes successful? You have attracted many

34
users and gained lots of data, and now you want to create new value for your
users or perhaps use the data to generate new revenue. You may find yourself
severely limited in running even straightforward analysis-style queries. Things like
tracking usage patterns and providing recommendations based on user histories
may be difficult at best, and impossible at worst, with this type of database
platform.
In this case, you will likely have to implement a separate analytical database,
populated from your key/value database, on which such analytics can be exe-
cuted. Think in advance of where and how you would be able to do that? Would
you host it in the cloud or invest in on-site infrastructure? Would latency be-
tween you and the cloud-service provider pose a problem? Does your current
cloud-based key/value database support this? If you have 100 million items in
your key/value database, but can only pull out 1000 items at a time, how long
would queries take?
Ultimately, while scale is a consideration, don’t put it ahead of your ability to
turn data into an asset of its own. All the scaling in the world is useless if your
users have moved on to your competitor because it has cooler, more personalized
features.[2]

35
36
Chapter 5

Conclusion

5.1 Data inconsistency


Data inconsistency in large-scale reliable distributed systems has to be tolerated
for two reasons: improving read and write performance under highly concurrent
conditions; and handling partition cases where a majority model would render
part of the system unavailable even though the nodes are up and running.
Whether or not inconsistencies are acceptable depends on the client applica-
tion. In all cases the developer needs to be aware that consistency guarantees are
provided by the storage systems and need to be taken into account when develop-
ing applications. There are a number of practical improvements to the eventual
consistency model, such as session-level consistency and monotonic reads, which
provide better tools for the developer. Many times the application is capable of
handling the eventual consistency guarantees of the storage system without any
problem. A specific popular case is a Web site in which we can have the notion
of user-perceived consistency. In this scenario the inconsistency window needs to
be smaller than the time expected for the customer to return for the next page
load. This allows for updates to propagate through the system before the next
read is expected. [4]
BigTable inspired many developers to write their own implementations of this
data model; amongst the most popular are HBase, Hypertable and Cassandra.
The lack of a predefined schema can make these databases attractive in appli-
cations where the attributes of objects are not known in advance, or change
frequently.
Document databases have a related data model (although the way they han-
dle concurrency and distributed servers can be quite different): a BigTable row
with its arbitrary number of columns/attributes corresponds to a document in
a document database, which is typically a tree of objects containing attribute

37
values and lists, often with a mapping to JSON or XML. Open source docu-
ment databases include Project Voldemort, CouchDB, MongoDB, ThruDB and
Jackrabbit.
How is this different from just dumping JSON strings into MySQL? Document
databases can actually work with the structure of the documents, for example
extracting, indexing, aggregating and filtering based on attribute values within
the documents. Alternatively you could of course build the attribute indexing
yourself, but I wouldnt recommend that unless it makes working with your legacy
code easier.
The big limitation of BigTables and document databases is that most imple-
mentations cannot perform joins or transactions spanning several rows or doc-
uments. This restriction is deliberate, because it allows the database to do
automatic partitioning, which can be important for scaling see the section on
distributed key-value stores below. If the structure of your data is lots of in-
dependent documents, this is not a problem but if your data fits nicely into a
relational model and you need joins, please dont try to force it into a document
model.ions require. One of the tools the system

5.2 Making a Decision


Ultimately, there are four reasons why you would choose a non-relational key/
value database platform for your application:

1. Your data is heavily document-oriented, making it a more natural fit with


the key/value data model than the relational data model.

2. Your development environment is heavily object-oriented, and a key/value


database could minimize the need for “plumbing” code.

3. The data store is cheap and integrates easily with your vendor’s web services
platform.

4. Your foremost concern is on-demand, high-end scalability — that is, large-


scale, distributed scalability, the kind that can’t be achieved simply by
scaling up.

But in making your decision, remember the database’s limitations and the
risks you face by branching off the relational path.
For all other requirements, you are probably best off with the good old
RDBMS. So, is the relational database doomed? Clearly not. Well, not yet
at least. [2]

38
Appendix A

Different popular NoSQL


databases

A.1 The Shortlist


Table A.1 provides a list of projects that could potentially replace a group of
relational database shards. Some of these are much more than key-value stores,
and arent suitable for low-latency data serving, but are interesting none-the-less.
[3]

A.2 Cloud-Service Contenders


A number of web service vendors now offer multi-tenanted key/value databases
on a pay-as-you-go basis. Most of them meet the criteria discussed to this point,
but each has unique features and varies from the general standards described thus
far. Let’s take a look now at particular databases, namely SimpleDB, Google
AppEngine Datastore, and SQL Data Services. [2]

A.2.1 Amazon: SimpleDB

SimpleDB is an attribute-oriented key/value database available on the Ama-


zon Web Services platform. SimpleDB is still in public beta; in the meantime,
users can sign up online for a “free” version – free, that is, until you exceed your
usage limits.

39
Name Language Fault- Persistence Client Data Docs Community
tolerance Protocol model
Project Java partitioned, Pluggable: Java API Structured A LinkedIn,
Voldemort replicated, BerkleyDB, /blob / no
read- MySQL text
repair
Ringo Erlang partitioned, Custom HTTP blob B Nokia, no
repli- on-disk
cated, im- (append
mutable only log)
Scalaris Erlang partitioned, In- Erlang, blob B OnScale,
replicated, memory Java, no
paxos only HTTP
Kai Erlang partitioned, On-disk Memcached blob C no
repli- Dets file
cated?
Dynomite Erlang partitioned, Pluggable: Custom blob D+ Powerset,
replicated couch, ASCII, no
dets Thrift
MemcacheDBC replication BerkleyDB Memcached blob B some
ThruDB C++ replication Pluggable: Thrift Document C+ Third rail,
BerkleyDB, oriented unsure
Custom,
MySQL,
S3
CouchDB Erlang replication, Custom HTTP, Document A Apache,
partition- on-disk JSON oriented yes
ing? (JSON)
Cassandra Java replication, Custom Thrift BigTable A Facebook,
partition- on-disk meets no
ing Dynamo
HBase Java replication, Custom Custom BigTable F Apache,
partition- on-disk API, yes
ing Thrift,
Rest
Hypertable C++ replication, Custom Thrift, BigTable A Zvents,
partition- on-disk other Baidu, yes
ing

Table A.1: Some NoSQL initiatives

SimpleDB has several limitations. First, a query can only execute for a max-
imum of 5 seconds. Secondly, there are no data types apart from strings. Every-
thing is stored, retrieved, and compared as a string, so date comparisons won’t
work unless you convert all dates to ISO8601 format. Thirdly, the maximum size
of any string is limited to 1024 bytes, which limits how much text (i.e. product
descriptions, etc.) you can store in a single attribute. But because the schema is
dynamic and flexible, you can get around the limit by adding “ProductDescrip-
tion1”, “ProductDescription2”, etc. The catch is that an item is limited to 256
attributes. While SimpleDB is in beta, domains can’t be larger than 10GB, and
entire databases cannot exceed 1TB.
One key feature of SimpleDB is that it uses an eventual consistency model.This
consistency model is good for concurrency, but means that after you have changed
an attribute for an item, those changes may not be reflected in read operations

40
that immediately follow. While the chances of this actually happening are low,
you should account for such situations. For example, you don’t want to sell the
last concert ticket in your event booking system to five people because your data
wasn’t consistent at the time of sale. [2]

A.2.2 Google AppEngine Data Store

Google’s AppEngine Datastore is built on BigTable, Google’s internal storage


system for handling structured data. In and of itself, the AppEngine Datastore is
not a direct access mechanism to BigTable, but can be thought of as a simplified
interface on top of BigTable.
The AppEngine Datastore supports much richer data types within items than
SimpleDB, including list types, which contain collections within a single item.
You will almost certainly use this data store if you plan on building applica-
tions within the Google AppEngine. However, unlike with SimpleDB, you cannot
currently interface with the AppEngine Datastore (or with BigTable) using an
application outside of Google’s web service platform. [2]

A.2.3 Microsoft: SQL Data Services

SQL Data Services is part of the Microsoft Azure Web Services platform. The
SDS service is also in beta and so is free but has limits on the size of databases.
SQL Data Services is actually an application itself that sits on top of many SQL
servers, which make up the underlying data storage for the SDS platform. While
the underlying data stores may be relational, you don’t have access to these;
SDS is a key/value store, like the other platforms discussed thus far.
Microsoft seems to be alone among these three vendors in acknowledging that
while key/value stores are great for scalability, they come at the great expense
of data management, when compared to RDBMS. Microsoft’s approach seems
to be to strip to the bare bones to get the scaling and distribution mechanisms
right, and then over time build up, adding features that help bridge the gap
between the key/value store and relational database platform. [2]

41
A.3 Non-Cloud Service Contenders
Outside the cloud, a number of key/value database software products exist that
can be installed in-house. Almost all of these products are still young, either in
alpha or beta, but most are also open source; having access to the code, you can
perhaps be more aware of potential issues and limitations than you would with
close-source vendors. [2]

A.3.1 Tokyo Cabinet


Developed and sponsored by Mixi Inc., it is an incredibly fast, and feature rich
database library. [5]

Tokyo Cabinet Highlights

Speed and efficiency are two consistent themes for Tokyo Cabinet. Benchmarks
show that it only takes 0.7 seconds to store 1 million records in the regular hash
table and 1.6 seconds for the B-Tree engine. To achieve this, the overhead per
record is kept at as low as possible, ranging between 5 and 20 bytes: 5 bytes
for B-Tree, 16 – 20 bytes for the Hash-table engine. And if small overhead
is not enough, Tokyo Cabinet also has native support for Lempel-Ziv or BWT
compression algorithms, which can reduce your database to 25% of it’s size
(typical text compression rate). Also, it is thread safe (uses pthreads) and offers
row-level locking. [5]

Features

• Similar use cases as for BerkelyDB.

• Disk persistence. Can store data larger than RAM.

• Performs well.

• Actively developed. Lots of developers adding new features (but not bug
fixes).

• Similar replication strategy to MySQL. Not useful for scalability as it limits


the write throughput to one node.

• Optional compressed pages so has some compression advantages. [7]

42
Hash and B-Tree Database Engines

Hash database engine is a direct competitor to BerkeleyDB, and other key-value


stores: one key, one value, no duplicates, and very fast.
Functionally, the B-Tree database engine is equivalent to the Hash database.
However, because of its underlying structure, the keys can be ordered via a user-
specified function, which in turn allows us to do prefix and range matching on a
key, as well as, traverse the entries in order. Let’s look at some examples:

r e q u i r e ” rubygems ”
require ” tokyocabinet ”

include TokyoCabinet

bdb = BDB : : new # B−Tree d a t a b a s e ; k e y s may have m u l t i p l e v a l u e s


bdb . open ( ” c a s k e t . bdb ” , BDB : : OWRITER | BDB : : OCREAT)

# s t o r e r e c o r d s i n the database , a l l o w i n g d u p l i c a t e s
bdb . putdup ( ” k ey1 ” , ” v a l u e 1 ” )
bdb . putdup ( ” k ey1 ” , ” v a l u e 2 ” )
bdb . p u t ( ” key 2 ” , ” v a l u e 3 ” )
bdb . p u t ( ” key 3 ” , ” v a l u e 4 ” )

# retrieve a l l values
p bdb . g e t l i s t ( ” ke y1 ” )
# => [ ” v a l u e 1 ” , ” v a l u e 2 ” ]

# range query , f i n d a l l matching keys


p bdb . r a n g e ( ” k ey1 ” , true , ” key 3 ” , tr ue )
# => [ ” ke y1 ” , ” ke y2 ” , ” key 3 ” ]

43
Fixed-length and Table Database Engines

Next, we have the ‘fixed length’ engine, which is best understood as a simple
array. There is absolutely no hashing and access is done via natural number keys,
which also means no key overhead. This method is extremely fast.
Saving best for last, we have the Table engine, which mimics a relational
database, except that it requires no predefined schema (in this, it is a close
cousin to CouchDB, which allows arbitrary properties on any object). Each
record still has a primary key, but we are allowed to declare arbitrary indexes on
our columns, and even perform queries on them:

r e q u i r e ” rubygems ”
require ” r u f u s / tokyo / c a b i n e t / t a b l e ”

t = R u f u s : : Tokyo : : T a b l e . new ( ’ t a b l e . t d b ’ , : c r e a t e , : w r i t e )

# p o p u l a t e t a b l e w i t h a r b i t r a r y d a t a ( no schema ! )
t [ ’ pk0 ’ ] = { ’ name ’ => ’ a l f r e d ’ , ’ age ’ => ’ 22 ’ ,
’ s e x ’ => ’ male ’ }
t [ ’ pk1 ’ ] = { ’ name ’ => ’ bob ’ , ’ age ’ => ’ 18 ’ }
t [ ’ pk2 ’ ] = { ’ name ’ => ’ c h a r l y ’ , ’ age ’ => ’ 45 ’ ,
’ nickname ’ => ’ c h a r l i e ’ }
t [ ’ pk3 ’ ] = { ’ name ’ => ’ doug ’ , ’ age ’ => ’ 77 ’ }
t [ ’ pk4 ’ ] = { ’ name ’ => ’ ephrem ’ , ’ age ’ => ’ 32 ’ }

# q u e r y t a b l e f o r age >= 32
p t . query { | q |
q . a d d c o n d i t i o n ’ age ’ , : numge , ’ 32 ’
q . o r d e r b y ’ age ’
}

# => [ {”name”=>”ephrem ” , : pk=>”pk4 ” , ” age ”=>”32”},


# {”name”=>” c h a r l y ” , : pk=>”pk2 ” , ” nickname”=>” c h a r l i e ” ,
# ” age ”=>”45”},
# {”name”=>”doug ” , : pk=>”pk3 ” , ” age ”=>”77”} ]

t . close

44
A.3.2 CouchDB
CouchDB is a free, open-source, distributed, fault-tolerant and schema-free
document-oriented database accessible via a RESTful HTTP/JSON API. Derived
from the key/value store, it uses JSON to define an item’s schema. Data is stored
in ‘documents’, which are essentially key/value maps themselves. CouchDB is
meant to bridge the gap between document-oriented and relational databases by
allowing “views” to be dynamically created using JavaScript. These views map
the document data onto a table-like structure that can be indexed and queried.
It can also do full text indexing of the documents.
At the moment, CouchDB isn’t really a distributed database. It has replica-
tion functions that allow data to be synchronized across servers, but this isn’t the
kind of distribution needed to build highly scalable environments. The CouchDB
community, though, is no doubt working on this. [3, 2]

A.3.3 Project Voldemort


Project Voldemort is a distributed key/value database that is intended to scale
horizontally across a large numbers of servers. It spawned from work done at
LinkedIn and is reportedly used there for a few systems that have very high
scalability requirements. Project Voldemort also uses an eventual consistency
model, based on Amazon’s.
Project-Voldemort handles replication and partitioning of data, and appears
to be well written and designed using Java. [3, 2, 10]

A.3.4 Mongo

Mongo is the database system being developed at 10gen by Geir Magnusson


and Dwight Merriman. Like CouchDB, Mongo is a document-oriented JSON
database, except that it is designed to be a true object database, rather than a
pure key/value store. Originally, 10gen focused on putting together a complete
web services stack; more recently, though, it has refocused mainly on the Mongo
database. [2]

45
Features

• Written in C++.

• Significantly faster the CouchDB.

• JSON and BSON (binary JSON-ish) formats.

• Asynchronous replication with auto-sharding.

• Supports indexes. Querying a property is quick because an index is auto-


matically kept on updates. Trades off some write speed for more consistent
read spead.

• Documents can be nested unlike CouchDB which requires applications keep


relationships. Advantage is that the whole object doesn’t have to be writ-
ten and read because the system knows about the relationship. Example
is a blog post and comments. In CouchDB the post and comments are
stored together and walk through all the comments when creating a view
even though you are only interested in the blog post. Better write and
query performance.

• More advanced queries than CouchDB. [7]

A.3.5 Drizzle

Drizzle can be thought of as a counter-approach to the problems that key/


value stores are meant to solve. Drizzle began life as a spin-off of the MySQL
(6.0) relational database. Over the last few months, its developers have removed
a host of non-core features (including views, triggers, prepared statements, stored
procedures, query cache, ACL, and a number of data types), with the aim of
creating a leaner, simpler, faster database system. Drizzle can still store relational
data; as Brian Aker of MySQL/Sun puts it, “There is no reason to throw out
the baby with the bath water.” The aim is to build a semi-relational database
platform tailored to web- and cloud-based apps running on systems with 16 cores
or more. [2]

46
A.3.6 Cassandra
The source code for Cassandra was released recently by Facebook. They use
it for inbox search. It’s a BigTable-esque, but uses a DHT so doesn’t need a
central server.
Originally developed by Facebook, it was developed by some of the key engi-
neers behind Amazon’s famous Dynamo database.
Cassandra can be thought of as a huge 4-or-5-level associative array, where
each dimension of the array gets a free index based on the keys in that level.
The real power comes from that optional 5th level in the associative array, which
can turn a simple key-value architecture into an architecture where you can now
deal with sorted lists, based on an index of your own specification. That 5th level
is called a SuperColumn, and it’s one of the reasons that Cassandra stands out
from the crowd.
Cassandra has no single points of failure, and can scale from one machine
to several thousands of machines clustered in different data centers. It has no
central master, so any data can be written to any of the nodes in the cluster,
and can be read likewise from any other node in the cluster.
It provides knobs that can be tweaked to slide the scale between consistency
and availability, depending on a particular application and problem domain. And
it provides a high availability guarantee, that if one node goes down, another
node will step in to replace it smoothly. [3, 8]

Pros:
• Open source.

• Incremental scalable — as data grows one can add more nodes to storage
mesh.

• Minimal administration — because it’s incremental we don’t have to do a


lot of up front planning for migration. [7]

Cons:
• Not polished yet. It was built for inbox searching so may not be work well
for other use cases.

• No compression yet. [7]

A.3.7 BigTable
• Google BigTable — manages data across many nodes.

47
• Paxos (Chubby) — distributed transaction algorithm that manages locks
across systems.

• BigTable Characteristics:

– Stores data in tablets using GFS, a distributed file system.


– Compression — great gains in throughput, can store more, reduces
IO bottleneck because you have to store less so you have to talk to
the disks less so performance improves.
– Single master — one node knows everything about all the other node
(backed up and cached).
– Hybrid between row and column database:
∗ Row database — store objects together.
∗ Column database — store attributes of objects together. Makes
sequential retrieval very fast, allows very efficient compression,
reduces disks seeks and random IO.
– Versioning
– Bloom filters — allows data to be distributed across a bunch of nodes.
It’s a calculation on data that probabilistically maps the data to the
nodes it can be found on.
– Eventually consistent — append only system using a row time stamp.
When a client queries they get several versions and the client is in
charge of picking the most recent.

• Pros:

– Compression is available.
– Clients are simple.
– Integrates with map-reduce.

• Cons:

– Proprietary to Google — Unavailable for our own use. [7]

A.3.8 Dynamo
• Amazon’s Dynamo — A giant distributed hash table.

• Uses consistent hashing to distribute data to one or more nodes for redun-
dancy and performance.

48
– Consistent hashing — a ring of nodes and hash function picks which
node(s) to store data.
– Consitency between nodes is based on vector clocks and read repair.
– Vector clocks — time stamp on every row for every node that has
written to it.
– Read repair — When a client does a read and the nodes disagree on
the data it’s up to the client to select the correct data and tell the
nodes the new correct state.

• Pros:

– No Master — eliminates single point of failure.


– Highly Available for Write — This is the partition failure aspect of
CAP. You can write to many nodes at once so depending on the num-
ber of replicas (which is configurable) maintained you should always
be able to write somewhere. So users will never see a write failure.
– Relatively simple which is why we see so many clones.

• Cons:

– Proprietary.
– Clients have to be smart to handle read-repair, rebalancing a cluster,
hashing, etc. Client proxies can handle these responsibilities but that
adds another hop.
– No compression which doesn’t reduce IO.
– Not suitable for column-like workloads, it’s just a key-value store,
so it’s not optimized for analytics. Aggregate queries, for example,
aren’t in it’s wheel house. [7]

49
50
List of Tables

1.1 Fundamental differences between relational databases and key/


value stores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1 Data access for relational databases and key/value stores . . . . 12


2.2 Data access for relational databases and key/value stores . . . . 12

A.1 Some NoSQL initiatives . . . . . . . . . . . . . . . . . . . . . . 40

51
52
Bibliography

[1] NoSQL Databases


http://nosql-database.org/
[2] “Is the Relational Database Doomed?”, Tony Bain
http://www.readwriteweb.com/enterprise/2009/02/
is-the-relational-database-doomed.php
[3] “Anti-RDBMS: A list of distributed key-value stores”, Richard Jones
http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-store
[4] “Eventually Consistent”, Werner Vogels (Amazon)
http://www.allthingsdistributed.com/2008/12/eventually_
consistent.html
[5] “Tokyo Cabinet: Beyond Key-Values Store”
http://www.igvita.com/2009/02/13/tokyo-cabinet-beyond-key-value-store/
[6] “Are cloud based memory architectures the next big thing?”, High Scalability
Blog
http://highscalability.com/blog/2009/3/16/
are-cloud-based-memory-architectures-the-next-big-thing.
html
[7] “Drop ACID and think about Data”, High Scalability Blog
http://highscalability.com/drop-acid-and-think-about-data
[8] “Thoughts on NOSQL”, Eric Florenzano
http://www.eflorenzano.com/blog/post/my-thoughts-nosql/
[9] “Should you go beyond relational databases?”, Martin Kleppmann
http://thinkvitamin.com/dev/should-you-go-beyond-relational-databases/
[10] “Notes from the NoSQL Meetup”, Toby Negrin
http://developer.yahoo.net/blog/archives/2009/06/nosql_
meetup.html

53
[11] “The mixed blessing of Non-Relational Databases”, Ian Thomas Varley
http://ianvarley.com/UT/MR/Varley_MastersReport_Full_
2009-08-07.pdf

54

You might also like