Professional Documents
Culture Documents
NoSQL
Prepared by,
Fayaz Yusuf Khan,
Reg.No: CSU071/16
Guided by,
Nimi Prakash P.
System Analyst
Computer Science & Engineering Department
Next Generation Databases mostly addressing some of the points: being non-
relational, distributed, open-source and horizontal scalable. The original
intention has been modern web-scale databases. The movement began early
2009 and is growing rapidly. Often more characteristics apply as: schema-free,
easy replication support, simple API, eventually consistent /BASE (not
ACID), a huge data amount, and more. So the misleading term ”NoSQL”
(the community now translates it mostly with “not only SQL”) should be seen
as an alias to something like the definition above. [1]
Contents
1 Introduction 5
1.1 Why relational databases are not enough . . . . . . . . . . . . . 6
1.2 What NoSQL has to offer . . . . . . . . . . . . . . . . . . . . . 7
1.3 ACIDs & BASEs . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.1 ACID Properties of Relational Databases . . . . . . . . . 7
1.3.2 CAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.3 BASE . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 NoSQL Features 11
2.1 No entity joins . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Eventual Consistency . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 Historical Perspective . . . . . . . . . . . . . . . . . . . 13
2.2.2 Consistency — Client and Server . . . . . . . . . . . . . 15
2.3 Cloud Based Memory Architecture . . . . . . . . . . . . . . . . 19
2.3.1 Memory Based Architectures . . . . . . . . . . . . . . . 19
3
4.1.8 More Natural Fit with Code . . . . . . . . . . . . . . . . 33
4.2 Demerits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.1 Limitations on Analytics . . . . . . . . . . . . . . . . . . 34
5 Conclusion 37
5.1 Data inconsistency . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 Making a Decision . . . . . . . . . . . . . . . . . . . . . . . . . 38
4
Chapter 1
Introduction
The history of the relational database has been one of continual adversity: ini-
tially, many claimed that mathematical set-based models could never be the basis
for efficient database implementations; later, aspiring object oriented databases
claimed they would remove the “middle man” of relational databases from the
OO design and persistence process. In all of these cases, through a combination
of sound concepts, elegant implementation, and general applicability, relational
databases have become and remained the lingua franca of data storage and ma-
nipulation.
Most recently, a new contender has arisen to challenge the supremacy of re-
lational databases. Referred to generally as “non-relational databases” (among
other names), this class of storage engine seeks to break down the rigidity of
the relational model, in exchange for leaner models that can perform and scale
at higher levels, using various models (including key/value pairs, sharded arrays,
and document-oriented approaches) which can be created and read efficiently as
the basic unit of data storage. Primarily, these new technologies have arisen in
situations where traditional relational database systems would be extremely chal-
lenging to scale to the degree needed for global systems (for example, at com-
panies such as Google, Yahoo, Amazon, LinkedIn, etc., which regularly collect,
store and analyze massive data sets with extremely high transactional throughput
and low latency). As of this writing, there exist dozens of variants of this new
model, each with different capabilities and trade-offs, but all with the general
property that traditional relational design — as practiced on relational database
management systems like Oracle, Sybase, etc. — is neither possible nor desired.
[11]
5
1.1 Why relational databases are not enough
Even though RDBMS have provided database users with the best mix of sim-
plicity, robustness, flexibility, performance, scalability, and compatibility, their
performance in each of these areas is not necessarily better than that of an al-
ternate solution pursuing one of these benefits in isolation. This has not been
much of a problem so far because the universal dominance of RDBMS has out-
weighed the need to push any of these boundaries. Nonetheless, if you really had
a need that couldn’t be answered by a generic relational database, alternatives
have always been around to fill those niches.
Today, we are in a slightly different situation. For an increasing number
of applications, one of these benefits is becoming more and more critical; and
while still considered a niche, it is rapidly becoming mainstream, so much so
that for an increasing number of database users this requirement is beginning
to eclipse others in importance. That benefit is scalability. As more and more
applications are launched in environments that have massive workloads, such as
web services, their scalability requirements can, first of all, change very quickly
and, secondly, grow very large. The first scenario can be difficult to manage if
you have a relational database sitting on a single in-house server. For example,
if your load triples overnight, how quickly can you upgrade your hardware? The
second scenario can be too difficult to manage with a relational database in
general.
Relational databases scale well, but usually only when that scaling happens on
a single server node. When the capacity of that single node is reached, you need
to scale out and distribute that load across multiple server nodes. This is when
the complexity of relational databases starts to rub against their potential to
scale. Try scaling to hundreds or thousands of nodes, rather than a few, and the
complexities become overwhelming, and the characteristics that make RDBMS
so appealing drastically reduce their viability as platforms for large distributed
systems.
For cloud services to be viable, vendors have had to address this limitation,
because a cloud platform without a scalable data store is not much of a platform
at all. So, to provide customers with a scalable place to store application data,
vendors had only one real option. They had to implement a new type of database
system that focuses on scalability, at the expense of the other benefits that come
with relational databases.
These efforts, combined with those of existing niche vendors, have led to the
rise of a new breed of database management system. [2]
6
1.2 What NoSQL has to offer
This new kind of database management system is commonly called a key/value
store. In fact, no official name yet exists, so you may see it referred to as
document-oriented, Internet-facing, attribute-oriented, distributed database (al-
though this can be relational also), sharded sorted arrays, distributed hash table,
and key/value database. While each of these names point to specific traits of this
new approach, they are all variations on one theme, which we’ll call key/value
databases.
Whatever you call it, this “new” type of database has been around for a
long time and has been used for specialized applications for which the generic
relational database was ill-suited. But without the scale that web and cloud
applications have brought, it would have remained a mostly unused subset. Now,
the challenge is to recognize whether it or a relational database would be better
suited to a particular application.
Relational databases and key/value databases are fundamentally different and
designed to meet different needs. A side-by-side comparison only takes you so
far in understanding these differences, but to begin, let’s lay one down: [2]
• The problem with ACID is that it gives too much; it trips up when trying
to scale a system across multiple nodes.
• To make a scalable systems that can handle lots and lots of reads and
writes requires many more nodes.
7
Database Definition
Relational Database Key/Value Database
Database contains tables, tables Domains can initially be thought
contain columns and rows, and of like a table, but unlike a table
rows are made up of column val- you don’t define any schema for
ues. Rows within a table all have a domain. A domain is basically
the same schema. a bucket that you put items into.
Items within a single domain can
have differing schema.
The data model is well defined in Items are identified by keys, and a
advance. A schema is strongly given item can have a dynamic set
typed and it has constraints and of attributes attached to it.
relationships that enforce data in-
tegrity.
The data model is based on a “nat- In some implementations, at-
ural” representation of the data it tributes are all of a string type. In
contains, not on an application’s other implementations, attributes
functionality. have simple types that reflect code
types, such as ints, string arrays,
and lists.
The data model is normalized to No relationships are explicitly de-
remove data duplication. Nor- fined between domains or within a
malization establishes table rela- given domain.
tionships. Relationships associate
data between tables.
• Once we try to scale ACID across many machines we hit problems with
network failures and delays. The algorithms don’t work in a distributed
environment at any acceptable speed.
1.3.2 CAP
• If we can’t have all of the ACID guarantees it turns out we can have two
of the following three characteristics:
– Consistency — your data is correct all the time. What you write is
what you read.
– Availability — you can read and write and write your data all the
time.
8
– Partition Tolerance — if one or more nodes fails the system still works
and becomes consistent when the system comes on-line.
1.3.3 BASE
• The types of large systems based on CAP aren’t ACID they are BASE:
• Everyone who builds big applications builds them on CAP and BASE:
Google, Yahoo, Facebook, Amazon, eBay, etc. [7]
9
10
Chapter 2
NoSQL Features
But while the need for relationships is greatly reduced with key/value databases,
certain ones are inevitable. These relationships usually exist among core enti-
ties. For example, an ordering system would have items that contain data about
11
customers, products, and orders. Whether these reside on the same domain or
separate domains is irrelevant; but when a customer places an order, you would
likely not want to store both the customer and product’s attributes in the same
order item.
Instead, orders would need to contain relevant keys that point to the cus-
tomer and product. While this is perfectly doable in a key/value database, these
relationships are not defined in the data model itself, and so the database man-
agement system cannot enforce the integrity of the relationships. This means
you can delete customers and the products they have ordered. The responsibility
of ensuring data integrity falls entirely to the application. [2]
Data Access
Relational Database Key/Value Database
Data is created, updated, deleted, Data is created, updated, deleted,
and retrieved using SQL. and retrieved using API method
calls.
SQL queries can access data from Some implementations provide ba-
a single table or multiple tables sic SQL-like syntax for defining fil-
through table joins. ter criteria.
SQL queries include functions for Basic filter predicates (such as =,
aggregation and complex filtering. 6=, <, >, ≤, and ≥) can often only
be applied.
Usually contain means of embed- All applications and data integrity
ding logic close to data storage, logic is contained in the applica-
such as triggers, stored proce- tion code.
dures, and functions.
Table 2.1: Data access for relational databases and key/value stores
Application Interface
Relational Database Key/Value Database
Tend to have their own specific Tend to provide SOAP and/or
API, or make use of a generic API REST APIs over which data access
such as OLE-DB or ODBC. calls can be made.
Data is stored in a format that Data can be more effectively
represents its natural structure, so stored in application code that is
must be mapped between applica- compatible with its structure, re-
tion code structure and relational quiring only relational “plumbing”
structure. code for the object.
Table 2.2: Data access for relational databases and key/value stores
12
2.2 Eventual Consistency
Eventually Consistent - Building reliable distributed systems at a worldwide scale
demands trade-offs between consistency and availability.
At the foundation of Amazon’s cloud computing are infrastructure services
such as Amazon’s S3 (Simple Storage Service), SimpleDB, and EC2 (Elastic
Compute Cloud) that provide the resources for constructing Internet-scale com-
puting platforms and a great variety of applications. The requirements placed on
these infrastructure services are very strict; they need to score high marks in the
areas of security, scalability, availability, performance, and cost effectiveness, and
they need to meet these requirements while serving millions of customers around
the globe, continuously.
Under the covers these services are massive distributed systems that operate
on a worldwide scale. This scale creates additional challenges, because when a
system processes trillions and trillions of requests, events that normally have a low
probability of occurrence are now guaranteed to happen and need to be accounted
for up front in the design and architecture of the system. Given the worldwide
scope of these systems, we use replication techniques ubiquitously to guarantee
consistent performance and high availability. Although replication brings us closer
to our goals, it cannot achieve them in a perfectly transparent manner; under a
number of conditions the customers of these services will be confronted with the
consequences of using replication techniques inside the services.
One of the ways in which this manifests itself is in the type of data consistency
that is provided, particularly when the underlying distributed system provides an
eventual consistency model for data replication. When designing these large-scale
systems, we ought to use a set of guiding principles and abstractions related to
large-scale data replication and focus on the trade-offs between high availability
and data consistency. [4]
13
In the mid-’90s, with the rise of larger Internet systems, these practices were
revisited. At that time people began to consider the idea that availability was
perhaps the most important property of these systems, but they were struggling
with what it should be traded off against. Eric Brewer, systems professor at
the University of California, Berkeley, and at that time head of Inktomi, brought
the different trade-offs together in a keynote address to the PODC (Principles
of Distributed Computing) conference in 2000. He presented the CAP theorem,
which states that of three properties of shared-data systems — data consistency,
system availability, and tolerance to network partition — only two can be achieved
at any given time. A more formal confirmation can be found in a 2002 paper by
Seth Gilbert and Nancy Lynch.
A system that is not tolerant to network partitions can achieve data consis-
tency and availability, and often does so by using transaction protocols. To make
this work, client and storage systems must be part of the same environment;
they fail as a whole under certain scenarios, and as such, clients cannot observe
partitions. An important observation is that in larger distributed-scale systems,
network partitions are a given; therefore, consistency and availability cannot be
achieved at the same time. This means that there are two choices on what to
drop: relaxing consistency will allow the system to remain highly available under
the partitionable conditions, whereas making consistency a priority means that
under certain conditions the system will not be available.
Both options require the client developer to be aware of what the system is
offering. If the system emphasizes consistency, the developer has to deal with
the fact that the system may not be available to take, for example, a write. If
this write fails because of system unavailability, then the developer will have to
deal with what to do with the data to be written. If the system emphasizes
availability, it may always accept the write, but under certain conditions a read
will not reflect the result of a recently completed write. The developer then has
to decide whether the client requires access to the absolute latest update all the
time. There is a range of applications that can handle slightly stale data, and
they are served well under this model.
14
2.2.2 Consistency — Client and Server
There are two ways of looking at consistency. One is from the developer/client
point of view: how they observe data updates. The second way is from the server
side: how updates flow through the system and what guarantees systems can
give with respect to updates. [4]
Client-side Consistency
The client side has these components:
A storage system. For the moment we’ll treat it as a black box, but one
should assume that under the covers it is something of large scale and highly
distributed, and that it is built to guarantee durability and availability.
Process A. This is a process that writes to and reads from the storage system.
Client-side consistency has to do with how and when observers (in this case
the processes A, B, or C) see updates made to a data object in the storage
systems. In the following examples illustrating the different types of consistency,
process A has made an update to a data object:
Strong consistency. After the update completes, any subsequent access (by
A, B, or C) will return the updated value.
Weak consistency . The system does not guarantee that subsequent accesses
will return the updated value. A number of conditions need to be met
before the value will be returned. The period between the update and
the moment when it is guaranteed that any observer will always see the
updated value is dubbed the inconsistency window.
15
System). Updates to a name are distributed according to a configured pat-
tern and in combination with time-controlled caches; eventually, all clients
will see the update.
The eventual consistency model has a number of variations that are important
to consider:
Monotonic read consistency. If a process has seen a particular value for the
object, any subsequent accesses will never return any previous values.
A number of these properties can be combined. For example, one can get
monotonic reads combined with session-level consistency. From a practical point
of view these two properties (monotonic reads and read-your-writes) are most
desirable in an eventual consistency system, but not always required. These two
properties make it simpler for developers to build applications, while allowing the
storage system to relax consistency and provide high availability.
As you can see from these variations, quite a few different scenarios are
possible. It depends on the particular applications whether or not one can deal
with the consequences.
Eventual consistency is not some esoteric property of extreme distributed
systems. Many modern RDBMSs (relational database management systems)
that provide primary-backup reliability implement their replication techniques in
16
both synchronous and asynchronous modes. In synchronous mode the replica
update is part of the transaction. In asynchronous mode the updates arrive
at the backup in a delayed manner, often through log shipping. In the latter
mode if the primary fails before the logs are shipped, reading from the promoted
backup will produce old, inconsistent values. Also to support better scalable
read performance, RDBMSs have started to provide the ability to read from the
backup, which is a classical case of providing eventual consistency guarantees in
which the inconsistency windows depend on the periodicity of the log shipping.
[4]
Server-side Consistency
On the server side we need to take a deeper look at how updates flow through
the system to understand what drives the different modes that the developer who
uses the system can experience. Let’s establish a few definitions before getting
started:
N = the number of nodes that store replicas of the data
W = the number of replicas that need to acknowledge the receipt of the
update before the update completes
R = the number of replicas that are contacted when a data object is accessed
through a read operation
If W + R > N , then the write set and the read set always overlap and one
can guarantee strong consistency. In the primary-backup RDBMS scenario, which
implements synchronous replication, N = 2, W = 2, and R = 1. No matter
from which replica the client reads, it will always get a consistent answer. In
asynchronous replication with reading from the backup enabled, N = 2, W = 1,
and R = 1. In this case R + W = N , and consistency cannot be guaranteed.
The problems with these configurations, which are basic quorum protocols,
is that when the system cannot write to W nodes because of failures, the write
operation has to fail, marking the unavailability of the system. With N = 3 and
W = 3 and only two nodes available, the system will have to fail the write.
In distributed-storage systems that need to provide high performance and
high availability, the number of replicas is in general higher than two. Systems
that focus solely on fault tolerance often use N = 3 (with W = 2 and R = 2
configurations). Systems that need to serve very high read loads often replicate
their data beyond what is required for fault tolerance; N can be tens or even
hundreds of nodes, with R configured to 1 such that a single read will return
a result. Systems that are concerned with consistency are set to W = N for
updates, which may decrease the probability of the write succeeding. A common
configuration for these systems that are concerned about fault tolerance but not
consistency is to run with W = 1 to get minimal durability of the update and
17
then rely on a lazy (epidemic) technique to update the other replicas.
How to configure N , W , and R depends on what the common case is and
which performance path needs to be optimized. In R = 1 and N = W we
optimize for the read case, and in W = 1 and R = N we optimize for a very fast
write. Of course in the latter case, durability is not guaranteed in the presence
of failures, and if W < (N + 1)/2, there is the possibility of conflicting writes
when the write sets do not overlap.
Weak/eventual consistency arises when W + R ≤ N , meaning that there is
a possibility that the read and write set will not overlap. If this is a deliberate
configuration and not based on a failure case, then it hardly makes sense to set
R to anything but 1. This happens in two very common cases: the first is the
massive replication for read scaling mentioned earlier; the second is where data
access is more complicated. In a simple key-value model it is easy to compare
versions to determine the latest value written to the system, but in systems that
return sets of objects it is more difficult to determine what the correct latest
set should be. In most of these systems where the write set is smaller than the
replica set, a mechanism is in place that applies the updates in a lazy manner
to the remaining nodes in the replica’s set. The period until all replicas have
been updated is the inconsistency window discussed before. If W + R ≤ N , then
the system is vulnerable to reading from nodes that have not yet received the
updates.
Whether or not read-your-writes, session, and monotonic consistency can be
achieved depends in general on the “stickiness” of clients to the server that
executes the distributed protocol for them. If this is the same server every time,
then it is relatively easy to guarantee read-your-writes and monotonic reads. This
makes it slightly harder to manage load balancing and fault tolerance, but it is a
simple solution. Using sessions, which are sticky, makes this explicit and provides
an exposure level that clients can reason about.
Sometimes the client implements read-your-writes and monotonic reads. By
adding versions on writes, the client discards reads of values with versions that
precede the last-seen version.
Partitions happen when some nodes in the system cannot reach other nodes,
but both sets are reachable by groups of clients. If you use a classical majority
quorum approach, then the partition that has W nodes of the replica set can
continue to take updates while the other partition becomes unavailable. The
same is true for the read set. Given that these two sets overlap, by definition the
minority set becomes unavailable. Partitions don’t happen frequently, but they
do occur between data centers, as well as inside data centers.
In some applications the unavailability of any of the partitions is unacceptable,
and it is important that the clients that can reach that partition make progress.
In that case both sides assign a new set of storage nodes to receive the data,
18
and a merge operation is executed when the partition heals. [4]
19
for disk reads latencies can easily be in the many second range. Memory latency
is in the 5 nanosecond range. Memory latency is 2,000 times faster. [6]
Access disk on the critical path of any transaction limits both throughput and
latency. Committing a transaction over the network in-memory is faster than
writing through to disk. Reading data from memory is also faster than reading
data from disk. So the idea is to skip disk, except perhaps as an asynchronous
write-behind option, archival storage, and for large files. [6]
20
Chapter 3
21
The big limitation of BigTables and document databases is that most imple-
mentations cannot perform joins or transactions spanning several rows or doc-
uments. This restriction is deliberate, because it allows the database to do
automatic partitioning, which can be important for scaling — see the section
3.4 on distributed key-value stores below. If the structure of our data is lots of
independent documents, this is not a problem — but if the data fits nicely into a
relational model and we need joins, we shouldn’t try to force it into a document
model. [9]
3.3 MapReduce
Popularised by another Google paper, MapReduce is a way of writing batch
processing jobs without having to worry about infrastructure. Different databases
lend themselves more or less well to MapReduce.
Hadoop is the big one amongst the open MapReduce implementations, and
Skynet and Disco are also worth looking at. CouchDB also includes some MapRe-
duce ideas on a smaller scale. [9]
22
might be a structured document (most of the document databases and BigTable
implementations above can also be considered to be key-value stores).
Document databases, graph databases and MapReduce introduce new data
models and new ways of thinking which can be useful even in a small-scale
applications. Distributed key-value stores, on the other hand, are really just
about scalability. They can scale to truly vast amounts of data — much more
than a single server could hold.
Distributed databases can transparently partition and replicate the data across
many machines in a cluster. We dont need to figure out a sharding scheme to
decide on which server we can find a particular piece of data; the database can
locate it for us. If one server dies, no problem — others can immediately take
over. If we need more resources, just add servers to the cluster, and the database
will automatically give them a share of the load and the data.
When choosing a key-value store we need to decide whether it should be opti-
mised for low latency (for lightning-fast data access during the request-response
cycle) or for high throughput (which is needed for batch processing jobs).
Other than the BigTables and document databases above, Scalaris, Dynomite
and Ringo provide certain data consistency guarantees while taking care of par-
titioning and distributing the dataset. MemcacheDB and Tokyo Cabinet (with
Tokyo Tyrant for network service and LightCloud to make it distributed) focus
on latency.
The caveat about limited transactions and joins applies even more strongly
for distributed databases. Different implementations take different approaches,
but in general, if we need to read several items, manipulate them in some way and
then write them back, there is no guarantee that we will end up in a consistent
state immediately (although many implementations try to become eventually
consistent by resolving write conflicts or using distributed transaction protocols).
[9]
23
24
Chapter 4
4.1 Merits
There is a long list of potential advantages to using non-relational databases.
Of course, not all non-relational databases are the same; but the following list
covers areas common to many of them. [11]
25
ming problems of the past 40 years, such as accounting systems, desktop word
processing software, etc. However, many of today’s interesting problems involve
unpredictable behavior and inputs from extremely large populations; consider
web search, social network graphs, large scale purchasing habits, etc. In these
“messy” arenas, the impulse to exactly model and define all the possible struc-
tures in the data in advance is exactly the wrong approach. Relational data
design tends to turn programmers into “structure first” proponents, but in many
cases, the rest of the world (including the users we are writing programs for) are
thinking “data first”. [11]
Modelling data in terms of relations, tuples and attributes —or equivalently, ta-
bles, rows and columns — is but one conceptual approach. There are entirely
different ways of considering, planning, and designing a data model. These in-
clude hierarchical trees, arbitrary graphs, structured objects, cube or star schema
analytical approaches, tuple spaces, and even undifferentiated (emergent) stor-
age. By moving into the realm of semi-structured non-relational data, we gain the
possibility of accessing our data along these lines instead of simply in relational
database terms.
For example, graph-oriented databases, such as Neo4j. This paradigm at-
tempts to map persistent storage capabilities directly onto the graph model of
computation: sets of nodes connected by sets of edges. The database engine
then innately provides many algorithmic services that one would expect on graph
representations: establishing spanning trees, finding shortest path, depth and
breadth-first search, etc.
Object databases are another paradigm that have, at various times, appeared
poised to challenge the supremacy of the relational database. An example of a
current contender in this space is Persevere (http://www.persvr.org/), which
is an object store for JSON (JavaScript Object Notation) data. Advantages
gained in this space include a consistent execution model between the storage
engine and the client platform (JavaScript, in this case), and the ability to natively
store objects without any translation layer.
Here again, the general principle is that by moving away from the strictly
modeled structure of SQL, we untie the hands of developers to model data in
terms they may be more familiar with, or that may be more conducive to solving
the problem at hand. This is very attractive to many developers. [11]
26
4.1.3 Multi-Valued Properties
Even with the bounds of the more traditional relational approach, there are ways
in which the semi-structured approach of non-relational databases can give us a
helping hand in conceptual data design. One of these is by way of multi-value
properties — that is, attributes that can simultaneously take on more than one
value.
A credo of relational database design is that for any given tuple in a relation,
there is only one value for any given attribute; storing multiple values in the same
attribute for the same tuple is considered very bad practice, and is not supported
by standard SQL. Generally, cases where one might be tempted to store multiple
values in the same attribute indicate that the design needs further normalization.
As an example, consider a User relation, with an attribute email. Since
people typically have more than one email address, a simple (but wrong, at least
for relational database design) decision might be to store the email addresses as
a comma-delimited list within the “emails” attribute.
The problems with this are myriad - for example, simple membership tests
like
SELECT ∗ FROM U s e r WHERE e m a i l s = ’ homer@simpson . com ’ ;
will fail if there are more than one email address in the list, because that is no
longer the value of the attribute; a more general test using wildcards such as
SELECT ∗ FROM U s e r WHERE e m a i l s LIKE ’%homer@simpson . com%’ ;
will succeed, but raises serious performance issues in that it defeats the use of
indexes and causes the database engine to do (at best) linear-time text pattern
searches against every value in the table. Worse, it may actually impact correct-
ness if entries in the list can be proper substrings of each other (as in the list
“car, cart, art”).
The proper way to design for this situation, in a relational model, is to nor-
malize the email addresses into their own table, with a foreign key relationship
to the user table.
This is a design strategy that can is frequently applied to many situations in
standard relational database design, even recursively: if you sense a one-to-many
relationship in an attribute, break it out into two relations with a foreign key.
The trouble with this pattern, however, is that it still does not elegantly
serve all the possible use cases of such data, especially in situations with a low
cardinality; either it is overkill, or it is a clumsy way to store data. In the above
example, there are a very small set of use cases that we might typically do with
email addresses, including:
• Return the user, along with their one “primary” email address, for normal
operations involving sending an email to the user.
27
• Return the user with a list of all their email addresses, for showing on a
“profile” screen, for example.
• Find which user (if any) has a given email address.
The first situation requires an additional attribute along the lines of is primary
on the email table, not to mention logic to ensure that only one email tuple per
user is marked as primary (which cannot be done natively in a relational database,
because a UNIQUE constraint on the user id and the is primary field would
only allow one primary and one non-primary email address per user id ). Alter-
nately, a primary email field can be kept on the User table, acting as a cache of
which email address is the primary one; this too requires coordination by code to
ensure that this field actually exists in the User Email table, etc.
To use standard SQL to return a single tuple containing the user and all
of their email addresses, comma delimited like our original (“wrong”) design
concept, is actually quite difficult under this two-table structure. Standard SQL
has no way of rendering this output, which is surprising considering how common
it is. The only mechanisms would be constructing intermediate temporary tables
of the information, looping through records of the join relation and outputting
one tuple per user id with the concatenation of email addresses as an attribute.
Under key/value stores, we have a different paradigm entirely for this problem,
and one which much more closely matches the real-world uses of such data. We
can simply model the email attribute as a substructure: a list of emails within
the attribute.
For example, Google App Engine has a “List” type that can store exactly this
type of information as an attribute:
c l a s s U s e r ( db . Model ) :
name = db . S t r i n g P r o p e r t y ( )
e m a i l s = db . S t r i n g L i s t P r o p e r t y ( )
The query system then has the ability to not only return the contained lists
as structured data, but also to do membership queries, such as:
r e s u l t s = db . GqlQuery (
”SELECT ∗ FROM U s e r WHERE e m a i l = ’ homer@simpson . com ’ ”
)
Since order is preserved, the semantics of “primary” versus “additional” can
be encoded into the order of items, so no additional attribute is needed for
this purpose; we can always get the primary email by saying something like
“ results . emails [0] ”.
In effect, we have expressed our actual data requirements in a much more
succinct and powerful way using this notation, without any noticeable loss in
precision, abstraction, or expressive power. [11]
28
4.1.4 Generalized Analytics
If the nature of the analysis falls outside of SQL’s standard set of operations,
it can be extremely difficult to produce results with the operational silo of SQL
queries. Worse, this has a pernicious effect on the mindset of data developers,
sometimes called “SQL Myopia”: if you can’t do it in SQL, you cant do it.1
This is unfortunate, because there are many interesting and useful modes of
interacting with data sets that are outside of this paradigm — consider matrix
transformations, data mining, clustering, Bayesian filtering, probability analysis,
etc.
Additionally, besides simply lacking Turing-completeness,2 SQL has a long list
of faults that non-SQL developers regularly present. These include a verbose,
non-customizable syntax; inability to reduce nested constructions to recursive
calls, or generally work with graphs, trees, or nested structures; inconsistency in
specific implementation between vendors, despite standardization; and so forth.
It is no wonder that the moniker for the current non-relational database move-
ment is converging on the tag “NOSQL”: it is a limited, inelegant language.
[11]
29
plications is prohibitive both from a programming complexity standpoint (this
ability must be consciously designed in to each entity that might need it) as well
as from a performance standpoint.3
We have two main options when keeping a history of information for a table.
On the one hand, we can keep a full additional copy of every row whenever it
changes. This can be done in place, by adding an additional component to the
primary key which is a timestamp or version number.
This is problematic in that all application code that interacts with this entity
needs to know about the versioning scheme; it also complicates the indexing of
the entities, because relational database storage with a composite primary key
including a date is significantly less optimized than for a single integer key.
Alternately, the entire-row history method can be done in a secondary table
which only keeps historical records, much like a log.
This is less obtrusive on the application (which need not even be aware of
its existence, especially if it is produce via a database level procedure or trigger),
and has the benefit that it can be populated asynchronously.
However, both of these cases require O(sn) storage, where s is the row size
and n is the number of updates. For large row sizes, this approach can be
prohibitive.
The other mechanism for doing this is to keep what amounts to an Entity/
Attribute/Value table for the historical changes: a table where only the changed
value is kept. This is easier to do in situations where the table design itself is
already in the EAV paradigm, but can still be done dynamically (if not efficiently)
by using the string name of the updated attribute.
For sparsely updated tables, this approach does save space over the entire-row
versions, but it suffers from the drawback that any use of this data via interactive
SQL queries is nearly impossible.
Overall, the non-relational database stores that support column-based version
history have a huge advantage in any situations where the application might need
this level of historical data snapshots. [11]
30
This definitively impacts the modelling concepts supported by the systems,
because it elevates scalability concerns to a first class modeling directive — part
of the logical and conceptual modeling process itself. Rather than designing an
elegant relational model and only later considering how it might reasonably be
“sharded” or replicated in such a way as to provide high availability in various
failure scenarios (typically accompanied by great cost, in commercial relational
database products), instead the bedrock of the logical design asks: how can we
conceive of this data in such a way that it is scalable by its definition?
As an example, consider the mechanism for establishing the locality of trans-
actions in Bigtable and its ilk (including the Google App Engine data store).
Obviously, when involving multiple entities in a transaction on a distributed data
store, it is desirable to restrict the number of nodes who actually must par-
ticipate in the transaction. (While protocols do of course exist for distributed
transactions, the performance of these protocols suffer immensely as the size of
machine cluster increases, because the risk of a node failure, and thus a time-
out on the distributed transaction, increases.) It is therefore most beneficial to
couple related entities tightly, and unrelated entities loosely, so that the most
common entities to participate in a transaction would be those that are already
tightly coupled. In a relational database, you might use foreign key relationships
to indicate related entities, but the relationship carries no additional information
that might indicate “these two things are likely to participate in transactions
together”.
By contract, in Bigtable, this is enabled by allowing entities to indicate an
“ancestor” relation chain, of any depth. That is, entity A can declare entity B its
“parent”, and henceforth, the data store organizes the physical representation of
these entities on one (or a small number of) physical machines, so that they can
easily participate in shared transactions. This is a natural design inclination, but
one that is not easily expressed in the world of relational databases (you could
certainly provide self-relationships on entities but since SQL does not readily
express recursive relationships, that is only beneficial in cases where the self-
relationship is a key part of the data design itself, with business import.)
Many commercial relational database vendors make the claim that their solu-
tions are highly scalable. This is true, but there are two caveats. First, of course,
is cost: sharded, replicated instances of Oracle or DB2 are not a cheap com-
modity, and the cost scales with the load. Second, however, and less obvious, is
the predictability factor. This is highly touted by systems such as Project Volde-
mort, which point out that with a simple data model, as in many non-relational
databases, not only can you scale more easily, but you can scale more predictably:
the requirements to support additional operations, in terms of CPU and memory
is known fairly exactly, so load planning can be an exact science. Compare this
with SQL/relational database scaling, which is highly unpredictable due to the
31
complex nature of the RDBMS engine.
There are, naturally, other criteria that are involved in the quest for per-
formance and scalability, including topics like low level data storage (b-tree-like
storage formats, disk access patterns, solid state storage, etc.); issues with the
raw networking of systems and their communications overhead; data reliability,
both considered for single-node and multi-node systems, etc.
Because key/value databases easily and dynamically scale, they are also the
database of choice for vendors who provide a multi-user, web services platform
data store. The database provides a relatively cheap data store platform with
massive potential to scale. Users typically only pay for what they use, but their
usage can increase as their needs increase. Meanwhile, the vendor can scale the
platform dynamically based on the total user load, with little limitation on the
entire platform’s size. [2, 11]
32
back to disk, the RDBMS is obligated to do this operation atomically and in
real-time (because DDL updates are transactional); regardless of how efficiently
implemented it is, this type of operation cannot be made seamless in a highly
transactional production environment.
Second, the release of relational database schema changes typically requires
precise coordination with application-layer code; the code version must exactly
match the data version. In any highly available application, there is a high likeli-
hood that this implies downtime,5 or at least advanced operational coordination
that takes a great deal of precision and energy.
Non-relational databases, by comparison, can use a very different approach
for schema versioning. Because the schema (in many cases) is not enforced at
the data engine level, it is up to the application to enforce (and migrate) the
schema. Therefore, a schema change can be gradually introduced by code that
understands how to interact with both the N − 1 version and the N version,
and leaves each entity updated as it is touched. “Gardener” processes can then
periodically sweep through the data store, updating nodes as a lower-priority
process.
Naturally, this approach produces more complex code in the short term, es-
pecially if the schema of the data is relied upon by analytical (map/reduce) jobs.
But in many cases, the knowledge that no downtime will be required during
a schema evolution is worth the additional complexity. In fact, this approach
might be seen to encourage a more agile development methodology, because
each change to the internal schema of the application’s data is bundled with the
update to the codebase, and can be collectively versioned and managed accord-
ingly.
33
in a structure that maps more directly to object classes used in the underlying
application code, which can significantly reduce development time. [2]
4.2 Demerits
The inherent constraints of a relational database ensure that data at the lowest
level have integrity. Data that violate integrity constraints cannot physically be
entered into the database. These constraints don’t exist in a key/value database,
so the responsibility for ensuring data integrity falls entirely to the application.
But application code often carries bugs. Bugs in a properly designed relational
database usually don’t lead to data integrity issues; bugs in a key/value database,
however, quite easily lead to data integrity issues.
One of the other key benefits of a relational database is that it forces you to
go through a data modeling process. If done well, this modeling process create
in the database a logical structure that reflects the data it is to contain, rather
than reflecting the structure of the application. Data, then, become somewhat
application-independent, which means other applications can use the same data
set and application logic can be changed without disrupting the underlying data
model. To facilitate this process with a key/value database, try replacing the
relational data modeling exercise with a class modeling exercise, which creates
generic classes based on the natural structure of the data.
And don’t forget about compatibility. Unlike relational databases, cloud-
oriented databases have little in the way of shared standards. While they all
share similar concepts, they each have their own API, specific query interfaces,
and peculiarities. So, you will need to really trust your vendor, because you won’t
simply be able to switch down the line if you’re not happy with the service. And
because almost all current key/value databases are still in beta, that trust is far
riskier than with old-school relational databases. [2]
34
users and gained lots of data, and now you want to create new value for your
users or perhaps use the data to generate new revenue. You may find yourself
severely limited in running even straightforward analysis-style queries. Things like
tracking usage patterns and providing recommendations based on user histories
may be difficult at best, and impossible at worst, with this type of database
platform.
In this case, you will likely have to implement a separate analytical database,
populated from your key/value database, on which such analytics can be exe-
cuted. Think in advance of where and how you would be able to do that? Would
you host it in the cloud or invest in on-site infrastructure? Would latency be-
tween you and the cloud-service provider pose a problem? Does your current
cloud-based key/value database support this? If you have 100 million items in
your key/value database, but can only pull out 1000 items at a time, how long
would queries take?
Ultimately, while scale is a consideration, don’t put it ahead of your ability to
turn data into an asset of its own. All the scaling in the world is useless if your
users have moved on to your competitor because it has cooler, more personalized
features.[2]
35
36
Chapter 5
Conclusion
37
values and lists, often with a mapping to JSON or XML. Open source docu-
ment databases include Project Voldemort, CouchDB, MongoDB, ThruDB and
Jackrabbit.
How is this different from just dumping JSON strings into MySQL? Document
databases can actually work with the structure of the documents, for example
extracting, indexing, aggregating and filtering based on attribute values within
the documents. Alternatively you could of course build the attribute indexing
yourself, but I wouldnt recommend that unless it makes working with your legacy
code easier.
The big limitation of BigTables and document databases is that most imple-
mentations cannot perform joins or transactions spanning several rows or doc-
uments. This restriction is deliberate, because it allows the database to do
automatic partitioning, which can be important for scaling see the section on
distributed key-value stores below. If the structure of your data is lots of in-
dependent documents, this is not a problem but if your data fits nicely into a
relational model and you need joins, please dont try to force it into a document
model.ions require. One of the tools the system
3. The data store is cheap and integrates easily with your vendor’s web services
platform.
But in making your decision, remember the database’s limitations and the
risks you face by branching off the relational path.
For all other requirements, you are probably best off with the good old
RDBMS. So, is the relational database doomed? Clearly not. Well, not yet
at least. [2]
38
Appendix A
39
Name Language Fault- Persistence Client Data Docs Community
tolerance Protocol model
Project Java partitioned, Pluggable: Java API Structured A LinkedIn,
Voldemort replicated, BerkleyDB, /blob / no
read- MySQL text
repair
Ringo Erlang partitioned, Custom HTTP blob B Nokia, no
repli- on-disk
cated, im- (append
mutable only log)
Scalaris Erlang partitioned, In- Erlang, blob B OnScale,
replicated, memory Java, no
paxos only HTTP
Kai Erlang partitioned, On-disk Memcached blob C no
repli- Dets file
cated?
Dynomite Erlang partitioned, Pluggable: Custom blob D+ Powerset,
replicated couch, ASCII, no
dets Thrift
MemcacheDBC replication BerkleyDB Memcached blob B some
ThruDB C++ replication Pluggable: Thrift Document C+ Third rail,
BerkleyDB, oriented unsure
Custom,
MySQL,
S3
CouchDB Erlang replication, Custom HTTP, Document A Apache,
partition- on-disk JSON oriented yes
ing? (JSON)
Cassandra Java replication, Custom Thrift BigTable A Facebook,
partition- on-disk meets no
ing Dynamo
HBase Java replication, Custom Custom BigTable F Apache,
partition- on-disk API, yes
ing Thrift,
Rest
Hypertable C++ replication, Custom Thrift, BigTable A Zvents,
partition- on-disk other Baidu, yes
ing
SimpleDB has several limitations. First, a query can only execute for a max-
imum of 5 seconds. Secondly, there are no data types apart from strings. Every-
thing is stored, retrieved, and compared as a string, so date comparisons won’t
work unless you convert all dates to ISO8601 format. Thirdly, the maximum size
of any string is limited to 1024 bytes, which limits how much text (i.e. product
descriptions, etc.) you can store in a single attribute. But because the schema is
dynamic and flexible, you can get around the limit by adding “ProductDescrip-
tion1”, “ProductDescription2”, etc. The catch is that an item is limited to 256
attributes. While SimpleDB is in beta, domains can’t be larger than 10GB, and
entire databases cannot exceed 1TB.
One key feature of SimpleDB is that it uses an eventual consistency model.This
consistency model is good for concurrency, but means that after you have changed
an attribute for an item, those changes may not be reflected in read operations
40
that immediately follow. While the chances of this actually happening are low,
you should account for such situations. For example, you don’t want to sell the
last concert ticket in your event booking system to five people because your data
wasn’t consistent at the time of sale. [2]
SQL Data Services is part of the Microsoft Azure Web Services platform. The
SDS service is also in beta and so is free but has limits on the size of databases.
SQL Data Services is actually an application itself that sits on top of many SQL
servers, which make up the underlying data storage for the SDS platform. While
the underlying data stores may be relational, you don’t have access to these;
SDS is a key/value store, like the other platforms discussed thus far.
Microsoft seems to be alone among these three vendors in acknowledging that
while key/value stores are great for scalability, they come at the great expense
of data management, when compared to RDBMS. Microsoft’s approach seems
to be to strip to the bare bones to get the scaling and distribution mechanisms
right, and then over time build up, adding features that help bridge the gap
between the key/value store and relational database platform. [2]
41
A.3 Non-Cloud Service Contenders
Outside the cloud, a number of key/value database software products exist that
can be installed in-house. Almost all of these products are still young, either in
alpha or beta, but most are also open source; having access to the code, you can
perhaps be more aware of potential issues and limitations than you would with
close-source vendors. [2]
Speed and efficiency are two consistent themes for Tokyo Cabinet. Benchmarks
show that it only takes 0.7 seconds to store 1 million records in the regular hash
table and 1.6 seconds for the B-Tree engine. To achieve this, the overhead per
record is kept at as low as possible, ranging between 5 and 20 bytes: 5 bytes
for B-Tree, 16 – 20 bytes for the Hash-table engine. And if small overhead
is not enough, Tokyo Cabinet also has native support for Lempel-Ziv or BWT
compression algorithms, which can reduce your database to 25% of it’s size
(typical text compression rate). Also, it is thread safe (uses pthreads) and offers
row-level locking. [5]
Features
• Performs well.
• Actively developed. Lots of developers adding new features (but not bug
fixes).
42
Hash and B-Tree Database Engines
r e q u i r e ” rubygems ”
require ” tokyocabinet ”
include TokyoCabinet
# s t o r e r e c o r d s i n the database , a l l o w i n g d u p l i c a t e s
bdb . putdup ( ” k ey1 ” , ” v a l u e 1 ” )
bdb . putdup ( ” k ey1 ” , ” v a l u e 2 ” )
bdb . p u t ( ” key 2 ” , ” v a l u e 3 ” )
bdb . p u t ( ” key 3 ” , ” v a l u e 4 ” )
# retrieve a l l values
p bdb . g e t l i s t ( ” ke y1 ” )
# => [ ” v a l u e 1 ” , ” v a l u e 2 ” ]
43
Fixed-length and Table Database Engines
Next, we have the ‘fixed length’ engine, which is best understood as a simple
array. There is absolutely no hashing and access is done via natural number keys,
which also means no key overhead. This method is extremely fast.
Saving best for last, we have the Table engine, which mimics a relational
database, except that it requires no predefined schema (in this, it is a close
cousin to CouchDB, which allows arbitrary properties on any object). Each
record still has a primary key, but we are allowed to declare arbitrary indexes on
our columns, and even perform queries on them:
r e q u i r e ” rubygems ”
require ” r u f u s / tokyo / c a b i n e t / t a b l e ”
t = R u f u s : : Tokyo : : T a b l e . new ( ’ t a b l e . t d b ’ , : c r e a t e , : w r i t e )
# p o p u l a t e t a b l e w i t h a r b i t r a r y d a t a ( no schema ! )
t [ ’ pk0 ’ ] = { ’ name ’ => ’ a l f r e d ’ , ’ age ’ => ’ 22 ’ ,
’ s e x ’ => ’ male ’ }
t [ ’ pk1 ’ ] = { ’ name ’ => ’ bob ’ , ’ age ’ => ’ 18 ’ }
t [ ’ pk2 ’ ] = { ’ name ’ => ’ c h a r l y ’ , ’ age ’ => ’ 45 ’ ,
’ nickname ’ => ’ c h a r l i e ’ }
t [ ’ pk3 ’ ] = { ’ name ’ => ’ doug ’ , ’ age ’ => ’ 77 ’ }
t [ ’ pk4 ’ ] = { ’ name ’ => ’ ephrem ’ , ’ age ’ => ’ 32 ’ }
# q u e r y t a b l e f o r age >= 32
p t . query { | q |
q . a d d c o n d i t i o n ’ age ’ , : numge , ’ 32 ’
q . o r d e r b y ’ age ’
}
t . close
44
A.3.2 CouchDB
CouchDB is a free, open-source, distributed, fault-tolerant and schema-free
document-oriented database accessible via a RESTful HTTP/JSON API. Derived
from the key/value store, it uses JSON to define an item’s schema. Data is stored
in ‘documents’, which are essentially key/value maps themselves. CouchDB is
meant to bridge the gap between document-oriented and relational databases by
allowing “views” to be dynamically created using JavaScript. These views map
the document data onto a table-like structure that can be indexed and queried.
It can also do full text indexing of the documents.
At the moment, CouchDB isn’t really a distributed database. It has replica-
tion functions that allow data to be synchronized across servers, but this isn’t the
kind of distribution needed to build highly scalable environments. The CouchDB
community, though, is no doubt working on this. [3, 2]
A.3.4 Mongo
45
Features
• Written in C++.
A.3.5 Drizzle
46
A.3.6 Cassandra
The source code for Cassandra was released recently by Facebook. They use
it for inbox search. It’s a BigTable-esque, but uses a DHT so doesn’t need a
central server.
Originally developed by Facebook, it was developed by some of the key engi-
neers behind Amazon’s famous Dynamo database.
Cassandra can be thought of as a huge 4-or-5-level associative array, where
each dimension of the array gets a free index based on the keys in that level.
The real power comes from that optional 5th level in the associative array, which
can turn a simple key-value architecture into an architecture where you can now
deal with sorted lists, based on an index of your own specification. That 5th level
is called a SuperColumn, and it’s one of the reasons that Cassandra stands out
from the crowd.
Cassandra has no single points of failure, and can scale from one machine
to several thousands of machines clustered in different data centers. It has no
central master, so any data can be written to any of the nodes in the cluster,
and can be read likewise from any other node in the cluster.
It provides knobs that can be tweaked to slide the scale between consistency
and availability, depending on a particular application and problem domain. And
it provides a high availability guarantee, that if one node goes down, another
node will step in to replace it smoothly. [3, 8]
Pros:
• Open source.
• Incremental scalable — as data grows one can add more nodes to storage
mesh.
Cons:
• Not polished yet. It was built for inbox searching so may not be work well
for other use cases.
A.3.7 BigTable
• Google BigTable — manages data across many nodes.
47
• Paxos (Chubby) — distributed transaction algorithm that manages locks
across systems.
• BigTable Characteristics:
• Pros:
– Compression is available.
– Clients are simple.
– Integrates with map-reduce.
• Cons:
A.3.8 Dynamo
• Amazon’s Dynamo — A giant distributed hash table.
• Uses consistent hashing to distribute data to one or more nodes for redun-
dancy and performance.
48
– Consistent hashing — a ring of nodes and hash function picks which
node(s) to store data.
– Consitency between nodes is based on vector clocks and read repair.
– Vector clocks — time stamp on every row for every node that has
written to it.
– Read repair — When a client does a read and the nodes disagree on
the data it’s up to the client to select the correct data and tell the
nodes the new correct state.
• Pros:
• Cons:
– Proprietary.
– Clients have to be smart to handle read-repair, rebalancing a cluster,
hashing, etc. Client proxies can handle these responsibilities but that
adds another hop.
– No compression which doesn’t reduce IO.
– Not suitable for column-like workloads, it’s just a key-value store,
so it’s not optimized for analytics. Aggregate queries, for example,
aren’t in it’s wheel house. [7]
49
50
List of Tables
51
52
Bibliography
53
[11] “The mixed blessing of Non-Relational Databases”, Ian Thomas Varley
http://ianvarley.com/UT/MR/Varley_MastersReport_Full_
2009-08-07.pdf
54