Why I Love Databases: Jeeyoung Kim Nov 9, 2014 9 Min Read

Why I love databases - Jeeyoung Kim - Medium https://medium.com/@jeeyoungk/why-i-love-databases-1d4cc4...
Why I love databases

Jeeyoung Kim Follow
Nov 9, 2014 · 9 min read
Last year I worked with various databases at Square. I did:
Investigating and resolving database performance issues.
Designing data models and sharding strategies for the new

applications.
Evaluating and operationalizing new databases.
Initially forced by necessity, I soon became fascinated by databases.

The study of databases intersects almost every topic in computer
science — its theory and implementation are both sophisticated and
challenging.
However, I soon realized that this is not an universally shared

passion. To many of my colleagues and friends, the database is a
magical black box system, too scary and complicated to understand.
I wanted to change that.
While talking about databases, the topic of distributed systems

cannot be ignored. Most modern databases are distributed, either
implicitly (distributed clustered databases) or externally (a single
application connected to multiple databases via application-level
1 of 14 04/10/19, 5:46 pm
sharding).
This post is a confession of my love for databases and distributed

systems. It is mostly targeted towards programmers like me,
application developers who regularly interact with databases. We
program mostly in Java, Python, or Ruby, writing server side
applications. I will cover:
Comparing and evaluating different databases.
How to understand and fully utilize your database.
Understanding how the database works at a high level.
First of all, What is a database?

In this post, any software which accepts and persists data for future
retrieval is a database. This includes both traditional RDBMS and
NoSQL databases, as well as systems such as Apache Zookeeper and
Kafka.
CAP theorem
CAP theorem. It is my favorite impossibility result since Turing’s
Halting Problem and P≠NP (technically not a result). The CAP
theorem suggests that, at best, any distributed system can only
satisfy CP (Consistency & Partition Tolerance), AP (Availability &
Partition Tolerance), or somewhere between the two. As a
consequence, interesting tradeoffs between consistency and
availability arise.
2 of 14 04/10/19, 5:46 pm
Few important misconceptions about CAP theorem:
The traditional “choose 2 of 3" argument doesn’t make sense. You

cannot give up partition tolerance, because that would mean
“behavior of the operations performed during the partition is
undefined”, and in that case the database isn’t really consistent.
(See [1])
Reaching the limit of the CAP theorem is not given by default.

There are many databases which are neither consistent,
available, nor partition tolerant. Achieving the limits of the CAP
theorem requires careful design and implementation.
Distributed Systems
As stated earlier, many modern databases are distributed in some
way. There are two main motivations for this:
To scale beyond a single machine — storing and processing data

in multiple nodes.
To increase availability — ensuring a database isn’t a single point

of failure.
These two goals are closely related. In general, scaling a system by

increasing the number of machines negatively affects availability,
since the probability of encountering failures increases statistically.
Thus, achieving high availability is almost a prerequisite to
scalability.
3 of 14 04/10/19, 5:46 pm
Correctness vs Ef>iciency
Both correctness and efficiency matter, and are closely related in
distributed databases.
Correctness is important in any software, but it is essential for

databases because (1) databases persist data, and incorrect data will
survive over the reboot, and (2) databases are regarded as the most
trusted foundation of the software stack.
What does it mean for a database to be correct? Many distributed

databases have exotic consistency semantics written in fine print.
You can make a tradeoff here. In general, more stringent consistency
makes application code easier to reason about, at the cost of
efficiency and availability.
Theories aside, there are implementation and operational challenges

to correctness. A distributed system is inherently complex. An
algorithm like Paxos is notoriously hard to understand and
implement correctly. As the system becomes more complex, more
obscure failure scenarios arise. Systems like Redis and ElasticSearch
suffer from the non-axiomatic design of their distributed systems.
Other than the above tradeoff, efficiency is important because of the

previous difficulties. The more I learn about low-level programming,
I realize how many raw operations/sec a machine can perform. In
many cases, efficiency reduces complexity, making the overall
systems simpler. The fact that distributed systems are harder than
low-level programming motivates me to seek these efficiencies. I
would gladly choose the database that requires fewer machines to
4 of 14 04/10/19, 5:46 pm
operate, given the same load.
Finally, the computational and latency overhead of inter-machine

coordination can be significant. In conclusion: the fewer moving
parts, the better.
To squeeze more performance and efficiency out of the code, one

must journey deeper into lower level abstractions, including:
Memory allocators & Garbage Collectors [2]
Filesystem schedulers & IO device characteristics
Kernel Settings
Implementation details of various system calls, (fork, execv,

malloc)
Any disharmony between them can cause poor performance. You

don’t have to be a kernel hacker, but you do need a high-level
overview of how these components interact.
Empowering the Applications

Just like how different programming languages have their own
benefits and drawbacks, databases have their own unique
characteristics. It is important to fully understand them. It allows
you to implement a sophisticated yet efficient application while
delegating most of the complex, error-prone work to the database.
Traditional RDBMS is generally a good choice if you don’t have a
5 of 14 04/10/19, 5:46 pm
stringent performance or availability requirement. ACID guarantee is

very powerful and tooling is great. Sharding RDBMS is painful but it
is well understood. MySQL and Postgres are two popular choices.
Full text search engines allow you to build advanced indexing &
searching functionalities. Their eventually consistent nature is
seldom a problem, since search is an inherently fuzzy operation.
Lucene and its descendants (Solr, ElasticSearch) are popular choices.
Message queues and event processing systems eliminate error prone,

hard to implement code. Kafka, Storm, Spark SQL, RabbitMQ, and
Redis are popular choices.
Databases with cross-region replication make regional failover and

high availability much easier. There aren’t many open source choices
here, but Apache Cassandra is probably the most mature one.
Consensus, leader election, and distributed locks are hard to

implement and test. Don’t implement your own. Use Zookeeper,
etcd, or raft-as-a-library.
Now to the fine prints. Database is an inherently leaky abstraction.

They generally do a very good job at hiding their underlying
complexities, but ignoring their limitations will eventually bite you.
Few important things:
Understanding correctness guarantees of databases. What does a

failed operation mean [3]? Which operations are fully consistent,
which are not?
6 of 14 04/10/19, 5:46 pm
Understanding how the data is persisted & retrieved. Which

operations are efficient, which are not? Is there a query planner,
or detailed statistics per operation?
Sharding and clustering topologies. Understanding how data is

distributed amongst a cluster. Does your sharding strategy evenly
distribute data, or are there hot spots?
Data modelling patterns and anti-patterns.
Operational Challenges
Once part of your software stack, Databases live and breathe with
your infrastructure 24–7. They introduce unique operational
challenges.
Operating a database is like sailing in the middle of an ocean.

Whenever you encounter a problem, you need to fix it without
sinking the database, even in the midst of a storm. Thus a database
must have:
Ways to introspect & monitor the system.
Knobs to maintain & administrate the system.
Replication, backup, and restoration. Losing data is critically

bad. Machines will crash at some point. Since databases are
stateful, you can’t simply redeploy to restore failed machines.
All these capabilities while it is running. Frankly, all databases have

shortcomings regarding this. Allowing every configuration to be
changed while a system is running a difficult challenge. Many
7 of 14 04/10/19, 5:46 pm
operations require a database-wide mutex, additional system

resources, or a restart. Examples include:
In MySQL prior to 5.6, adding a column requires a full table lock.

Beautiful solutions like pt-online-schema-change exists to
mitigate this problem. MySQL now supports online schema
migrations.
Cassandra allows you to easily add, remove, and repair nodes.

However, these operations adds additional loads on the system.
Moreover, you can’t easily replace a database. Even a task of

migrating the data in the same database isn’t simple. Migrating to
another database is even harder, if not infeasible [4]. An application
code is much easier to incrementally roll out and revert back. Data
live much longer than code. Data schema and stored data are
normally shared amongst the multiple applications. Thus, the choice
of the initial database system and the corresponding data model is
very important.
Finally, databases will eventually fail. No matter what

platform/infrastructure-as-a-service you use, some failures are
unavoidable:
An application bug corrupting or dropping data. [5]
Misunderstanding the safety and consistency guarantee of a

database and losing writes.
Data model & database mismatch — For example, having a

dataset that doesn’t fit in a single shard. Expecting compare-and-
8 of 14 04/10/19, 5:46 pm
swap to work in an eventually consistent database.
Operational failure — Machines crashing. Hard drives dying. OS

Upgrades.
Network Partitions [6].
Thundering herd — A single system failure cascading to an entire

system.
And the worst of all — “things are suddenly slow”. “random

spikes in latency”. “sporadic errors once a day”, “this record
should’ve existed”.
And there is no single solution to this. The art of operating a

database belongs to the art of maintaining a high SLA system, but if I
need to give few tips:
Application developers understanding limitations and failure

mode behaviors.
Writing a resilient application. Multi-datacenter deployment.

Automatic failover between datacenter.
Having a team of competent operations engineers (Site Reliability

Engineers, DBAs) who understand different failure scenarios and
recovery methods.
PS: At Square, we have an awesome online data storage (ODS) team

that abstracts this problem away from us
Basic Building Blocks
9 of 14 04/10/19, 5:46 pm
Abstractions provided by a database is truly magical. Data ingestion,

querying, replication, and failover all in the same package? But as
you get accustomed to it, you start to learn that there are basic
building blocks — common patterns and components shared
amongst all the databases.
First, data retrieval reduces down to one of:
key-value lookup (hash tables)
range lookup (trees and LSMs)
file offset lookup (Kafka, HDFS)
At the end of the day, computers don’t understand SQL, indexes,

joins, or other fancy frills. High level operations need to be translated
to something the machine can execute.
For the persistent data structure, B-trees, hash tables, and log-
structured-merge-trees (LSM trees) are popular choices. Most likely,
your data is stored in one of these unless it requires some special
lookup (ex: geospatial queries). LSM-tree is a popular modern
choice, used in BigTable, HBase, Cassandra, LevelDB, and RocksDB
due to its superb write performance and reasonable read
performance.
Finally, there are popular patterns and algorithms reused throughout

the different systems: Paxos, Raft, Consistent Hashing, Quorum
Read / Writes, Merkel Trees, and Vector Clocks are some of the
fundamental building blocks.
10 of 14 04/10/19, 5:46 pm
Summary
This post is a simple, high level overview of some of the topics. There
are many other topics that I haven’t covered, such as optimizing for
different workflows (OLAP, OLTP, Batch Processing) and UX of the
database (query language, transport protocol, client libraries),
which are equally important. The implications of different
consistency semantics, such as sequential consistency, read your own
write, at least once delivery are also very interesting.
The best thing about a database is that it is a very mature

abstraction. It mostly works, and as an application developer, it is
very easy to save and retrieve data without much thinking. This is
definitely celebration worthy, but peeling the skin off this
sufficiently advanced technology is definitely worthwhile.
I wish more people are fascinated by this subject, and fully utilize it.
. . .
References & Side Notes

[1] You can’t sacrifice partition tolerance — http://codahale.com
/you-cant-sacrifice-partition-tolerance/. Aphy’rs Jepsen posts is a
good starting resource — http://aphyr.com/tags/Jepsen
[2] Modern databases heavily utilize OS file system cache to

transparently accelerate the file system access. Unused memory
is automatically used as cache. The recommended production
configuration of such a system is unusual to the untrained eyes,
machines having as much as 90% unused memory.
11 of 14 04/10/19, 5:46 pm
[3] Common pitfall in dynamo-like quorum based system is that

failed writes give no information. Writes can fail externally to the
client when the writes to the internal replicas don’t finish on
time. Thus failed writes are very likely to be successful. Worst,
future writes can be overwritten by this failed write under the
last-write-wins resolution strategy and skewed system clocks.
[4] Migrating between databases involves reading and writing to

multiple databases at once. The ownership of data (commonly
known as the “source of truth”) is uncertain, and you may have
data going out of sync between the legacy and the new system.
[5] Application bug is probably the biggest offender of reliability

and availability in most startups’ infrastructure. Scaling and
performance issues are can be predicted, and hardware failures
aren’t common when you have a handful of servers. However,
new code is deployed constantly every day.
[6] From The Network is Reliable — network partitions are more

common than you think. Fundamentally, there’s no way to
distinguish between high latency, network partitions, GC Pause,
and failed machines— they all appear as slow connections. This is
a commonly faced problem in ElasticSearch. A node suffers a
large GC pause, and the entire cluster thinks that it is down, and
tries to aggressively reshuffle the data, cascading the problem.
Multiple Databases
Once a system interacts with multiple databases, the system is
eventually consistent. You cannot concurrently modify the multiple
database at the exact moment, unless you implement two phase
commit (2PC). This is analogous to “composition of atomic
12 of 14 04/10/19, 5:46 pm
operations is not atomic”.
On Deletion
In any distributed system, deleting data is difficult and unsafe. Data
is replicated everywhere, either by the database or the application.
Without a proper coordination, It’s possible for the deleted data to be
replicated back. A typical strategy is writing a tombstone record to
represent the deletion. However, tombstones have their own
problems:
Tombstones occupy disk space. To reclaim the disk space,

tombstones need to be expired. If the tombstones are expired and
deleted before they are fully replicated, deleted records may
replicate back to life.
You may be able to delete a future write with an old tombstone.

This is colloquially referred as a doomstone. Hilarious as it is, it’s a
real problem.
13 of 14 04/10/19, 5:46 pm
Discover Medium Make Medium yours Become a member

Welcome to a place where Follow all the topics you care Get unlimited access to the
words matter. On Medium, about, and we’ll deliver the best stories on Medium —
smart voices and original best stories for you to your and support writers while
ideas take center stage - with homepage and inbox. Explore you’re at it. Just $5/month.
no ads in sight. Watch Upgrade
About Help Legal
14 of 14 04/10/19, 5:46 pm

Why I Love Databases: Jeeyoung Kim Nov 9, 2014 9 Min Read

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Why I Love Databases: Jeeyoung Kim Nov 9, 2014 9 Min Read

Uploaded by

Copyright:

Available Formats

Why I love databases - Jeeyoung Kim - Medium https://medium.com/@jeeyoungk/why-i-love-databases-1d4cc4...

Why I love databases

Last year I worked with various databases at Square. I did:

Investigating and resolving database performance issues.

Designing data models and sharding strategies for the new

Evaluating and operationalizing new databases.

Initially forced by necessity, I soon became fascinated by databases.

However, I soon realized that this is not an universally shared

While talking about databases, the topic of distributed systems

This post is a confession of my love for databases and distributed

Comparing and evaluating different databases.

How to understand and fully utilize your database.

Understanding how the database works at a high level.

First of all, What is a database?

Few important misconceptions about CAP theorem:

The traditional “choose 2 of 3" argument doesn’t make sense. You

Reaching the limit of the CAP theorem is not given by default.

To scale beyond a single machine — storing and processing data

To increase availability — ensuring a database isn’t a single point

These two goals are closely related. In general, scaling a system by

Correctness is important in any software, but it is essential for

What does it mean for a database to be correct? Many distributed

Theories aside, there are implementation and operational challenges

Other than the above tradeoff, efficiency is important because of the

operate, given the same load.

Finally, the computational and latency overhead of inter-machine

To squeeze more performance and efficiency out of the code, one

Memory allocators & Garbage Collectors [2]

Filesystem schedulers & IO device characteristics

Implementation details of various system calls, (fork, execv,

Any disharmony between them can cause poor performance. You

Empowering the Applications

Traditional RDBMS is generally a good choice if you don’t have a

stringent performance or availability requirement. ACID guarantee is

Message queues and event processing systems eliminate error prone,

Databases with cross-region replication make regional failover and

Consensus, leader election, and distributed locks are hard to

Now to the fine prints. Database is an inherently leaky abstraction.

Understanding correctness guarantees of databases. What does a

Understanding how the data is persisted & retrieved. Which

Sharding and clustering topologies. Understanding how data is

Data modelling patterns and anti-patterns.

Operating a database is like sailing in the middle of an ocean.

Ways to introspect & monitor the system.

Knobs to maintain & administrate the system.

Replication, backup, and restoration. Losing data is critically

All these capabilities while it is running. Frankly, all databases have

operations require a database-wide mutex, additional system

In MySQL prior to 5.6, adding a column requires a full table lock.

Cassandra allows you to easily add, remove, and repair nodes.

Moreover, you can’t easily replace a database. Even a task of

Finally, databases will eventually fail. No matter what

An application bug corrupting or dropping data. [5]

Misunderstanding the safety and consistency guarantee of a

Data model & database mismatch — For example, having a

swap to work in an eventually consistent database.

Operational failure — Machines crashing. Hard drives dying. OS

Network Partitions [6].

Thundering herd — A single system failure cascading to an entire

And the worst of all — “things are suddenly slow”. “random

And there is no single solution to this. The art of operating a

Application developers understanding limitations and failure

Writing a resilient application. Multi-datacenter deployment.