You are on page 1of 14

Why I love databases - Jeeyoung Kim - Medium https://medium.com/@jeeyoungk/why-i-love-databases-1d4cc4...

Why I love databases


Jeeyoung Kim Follow
Nov 9, 2014 · 9 min read

Last year I worked with various databases at Square. I did:

Investigating and resolving database performance issues.

Designing data models and sharding strategies for the new


applications.

Evaluating and operationalizing new databases.

Initially forced by necessity, I soon became fascinated by databases.


The study of databases intersects almost every topic in computer
science — its theory and implementation are both sophisticated and
challenging.

However, I soon realized that this is not an universally shared


passion. To many of my colleagues and friends, the database is a
magical black box system, too scary and complicated to understand.
I wanted to change that.

While talking about databases, the topic of distributed systems


cannot be ignored. Most modern databases are distributed, either
implicitly (distributed clustered databases) or externally (a single
application connected to multiple databases via application-level

1 of 14 04/10/19, 5:46 pm
Why I love databases - Jeeyoung Kim - Medium https://medium.com/@jeeyoungk/why-i-love-databases-1d4cc4...

sharding).

This post is a confession of my love for databases and distributed


systems. It is mostly targeted towards programmers like me,
application developers who regularly interact with databases. We
program mostly in Java, Python, or Ruby, writing server side
applications. I will cover:

Comparing and evaluating different databases.

How to understand and fully utilize your database.

Understanding how the database works at a high level.

First of all, What is a database?


In this post, any software which accepts and persists data for future
retrieval is a database. This includes both traditional RDBMS and
NoSQL databases, as well as systems such as Apache Zookeeper and
Kafka.

CAP theorem
CAP theorem. It is my favorite impossibility result since Turing’s
Halting Problem and P≠NP (technically not a result). The CAP
theorem suggests that, at best, any distributed system can only
satisfy CP (Consistency & Partition Tolerance), AP (Availability &
Partition Tolerance), or somewhere between the two. As a
consequence, interesting tradeoffs between consistency and
availability arise.

2 of 14 04/10/19, 5:46 pm
Why I love databases - Jeeyoung Kim - Medium https://medium.com/@jeeyoungk/why-i-love-databases-1d4cc4...

Few important misconceptions about CAP theorem:

The traditional “choose 2 of 3" argument doesn’t make sense. You


cannot give up partition tolerance, because that would mean
“behavior of the operations performed during the partition is
undefined”, and in that case the database isn’t really consistent.
(See [1])

Reaching the limit of the CAP theorem is not given by default.


There are many databases which are neither consistent,
available, nor partition tolerant. Achieving the limits of the CAP
theorem requires careful design and implementation.

Distributed Systems
As stated earlier, many modern databases are distributed in some
way. There are two main motivations for this:

To scale beyond a single machine — storing and processing data


in multiple nodes.

To increase availability — ensuring a database isn’t a single point


of failure.

These two goals are closely related. In general, scaling a system by


increasing the number of machines negatively affects availability,
since the probability of encountering failures increases statistically.
Thus, achieving high availability is almost a prerequisite to
scalability.

3 of 14 04/10/19, 5:46 pm
Why I love databases - Jeeyoung Kim - Medium https://medium.com/@jeeyoungk/why-i-love-databases-1d4cc4...

Correctness vs Ef>iciency
Both correctness and efficiency matter, and are closely related in
distributed databases.

Correctness is important in any software, but it is essential for


databases because (1) databases persist data, and incorrect data will
survive over the reboot, and (2) databases are regarded as the most
trusted foundation of the software stack.

What does it mean for a database to be correct? Many distributed


databases have exotic consistency semantics written in fine print.
You can make a tradeoff here. In general, more stringent consistency
makes application code easier to reason about, at the cost of
efficiency and availability.

Theories aside, there are implementation and operational challenges


to correctness. A distributed system is inherently complex. An
algorithm like Paxos is notoriously hard to understand and
implement correctly. As the system becomes more complex, more
obscure failure scenarios arise. Systems like Redis and ElasticSearch
suffer from the non-axiomatic design of their distributed systems.

Other than the above tradeoff, efficiency is important because of the


previous difficulties. The more I learn about low-level programming,
I realize how many raw operations/sec a machine can perform. In
many cases, efficiency reduces complexity, making the overall
systems simpler. The fact that distributed systems are harder than
low-level programming motivates me to seek these efficiencies. I
would gladly choose the database that requires fewer machines to

4 of 14 04/10/19, 5:46 pm
Why I love databases - Jeeyoung Kim - Medium https://medium.com/@jeeyoungk/why-i-love-databases-1d4cc4...

operate, given the same load.

Finally, the computational and latency overhead of inter-machine


coordination can be significant. In conclusion: the fewer moving
parts, the better.

To squeeze more performance and efficiency out of the code, one


must journey deeper into lower level abstractions, including:

Memory allocators & Garbage Collectors [2]

Filesystem schedulers & IO device characteristics

Kernel Settings

Implementation details of various system calls, (fork, execv,


malloc)

Any disharmony between them can cause poor performance. You


don’t have to be a kernel hacker, but you do need a high-level
overview of how these components interact.

Empowering the Applications


Just like how different programming languages have their own
benefits and drawbacks, databases have their own unique
characteristics. It is important to fully understand them. It allows
you to implement a sophisticated yet efficient application while
delegating most of the complex, error-prone work to the database.

Traditional RDBMS is generally a good choice if you don’t have a

5 of 14 04/10/19, 5:46 pm
Why I love databases - Jeeyoung Kim - Medium https://medium.com/@jeeyoungk/why-i-love-databases-1d4cc4...

stringent performance or availability requirement. ACID guarantee is


very powerful and tooling is great. Sharding RDBMS is painful but it
is well understood. MySQL and Postgres are two popular choices.

Full text search engines allow you to build advanced indexing &
searching functionalities. Their eventually consistent nature is
seldom a problem, since search is an inherently fuzzy operation.
Lucene and its descendants (Solr, ElasticSearch) are popular choices.

Message queues and event processing systems eliminate error prone,


hard to implement code. Kafka, Storm, Spark SQL, RabbitMQ, and
Redis are popular choices.

Databases with cross-region replication make regional failover and


high availability much easier. There aren’t many open source choices
here, but Apache Cassandra is probably the most mature one.

Consensus, leader election, and distributed locks are hard to


implement and test. Don’t implement your own. Use Zookeeper,
etcd, or raft-as-a-library.

Now to the fine prints. Database is an inherently leaky abstraction.


They generally do a very good job at hiding their underlying
complexities, but ignoring their limitations will eventually bite you.
Few important things:

Understanding correctness guarantees of databases. What does a


failed operation mean [3]? Which operations are fully consistent,
which are not?

6 of 14 04/10/19, 5:46 pm
Why I love databases - Jeeyoung Kim - Medium https://medium.com/@jeeyoungk/why-i-love-databases-1d4cc4...

Understanding how the data is persisted & retrieved. Which


operations are efficient, which are not? Is there a query planner,
or detailed statistics per operation?

Sharding and clustering topologies. Understanding how data is


distributed amongst a cluster. Does your sharding strategy evenly
distribute data, or are there hot spots?

Data modelling patterns and anti-patterns.

Operational Challenges
Once part of your software stack, Databases live and breathe with
your infrastructure 24–7. They introduce unique operational
challenges.

Operating a database is like sailing in the middle of an ocean.


Whenever you encounter a problem, you need to fix it without
sinking the database, even in the midst of a storm. Thus a database
must have:

Ways to introspect & monitor the system.

Knobs to maintain & administrate the system.

Replication, backup, and restoration. Losing data is critically


bad. Machines will crash at some point. Since databases are
stateful, you can’t simply redeploy to restore failed machines.

All these capabilities while it is running. Frankly, all databases have


shortcomings regarding this. Allowing every configuration to be
changed while a system is running a difficult challenge. Many

7 of 14 04/10/19, 5:46 pm
Why I love databases - Jeeyoung Kim - Medium https://medium.com/@jeeyoungk/why-i-love-databases-1d4cc4...

operations require a database-wide mutex, additional system


resources, or a restart. Examples include:

In MySQL prior to 5.6, adding a column requires a full table lock.


Beautiful solutions like pt-online-schema-change exists to
mitigate this problem. MySQL now supports online schema
migrations.

Cassandra allows you to easily add, remove, and repair nodes.


However, these operations adds additional loads on the system.

Moreover, you can’t easily replace a database. Even a task of


migrating the data in the same database isn’t simple. Migrating to
another database is even harder, if not infeasible [4]. An application
code is much easier to incrementally roll out and revert back. Data
live much longer than code. Data schema and stored data are
normally shared amongst the multiple applications. Thus, the choice
of the initial database system and the corresponding data model is
very important.

Finally, databases will eventually fail. No matter what


platform/infrastructure-as-a-service you use, some failures are
unavoidable:

An application bug corrupting or dropping data. [5]

Misunderstanding the safety and consistency guarantee of a


database and losing writes.

Data model & database mismatch — For example, having a


dataset that doesn’t fit in a single shard. Expecting compare-and-

8 of 14 04/10/19, 5:46 pm
Why I love databases - Jeeyoung Kim - Medium https://medium.com/@jeeyoungk/why-i-love-databases-1d4cc4...

swap to work in an eventually consistent database.

Operational failure — Machines crashing. Hard drives dying. OS


Upgrades.

Network Partitions [6].

Thundering herd — A single system failure cascading to an entire


system.

And the worst of all — “things are suddenly slow”. “random


spikes in latency”. “sporadic errors once a day”, “this record
should’ve existed”.

And there is no single solution to this. The art of operating a


database belongs to the art of maintaining a high SLA system, but if I
need to give few tips:

Application developers understanding limitations and failure


mode behaviors.

Writing a resilient application. Multi-datacenter deployment.


Automatic failover between datacenter.

Having a team of competent operations engineers (Site Reliability


Engineers, DBAs) who understand different failure scenarios and
recovery methods.

PS: At Square, we have an awesome online data storage (ODS) team


that abstracts this problem away from us

Basic Building Blocks

9 of 14 04/10/19, 5:46 pm
Why I love databases - Jeeyoung Kim - Medium https://medium.com/@jeeyoungk/why-i-love-databases-1d4cc4...

Abstractions provided by a database is truly magical. Data ingestion,


querying, replication, and failover all in the same package? But as
you get accustomed to it, you start to learn that there are basic
building blocks — common patterns and components shared
amongst all the databases.

First, data retrieval reduces down to one of:

key-value lookup (hash tables)

range lookup (trees and LSMs)

file offset lookup (Kafka, HDFS)

At the end of the day, computers don’t understand SQL, indexes,


joins, or other fancy frills. High level operations need to be translated
to something the machine can execute.

For the persistent data structure, B-trees, hash tables, and log-
structured-merge-trees (LSM trees) are popular choices. Most likely,
your data is stored in one of these unless it requires some special
lookup (ex: geospatial queries). LSM-tree is a popular modern
choice, used in BigTable, HBase, Cassandra, LevelDB, and RocksDB
due to its superb write performance and reasonable read
performance.

Finally, there are popular patterns and algorithms reused throughout


the different systems: Paxos, Raft, Consistent Hashing, Quorum
Read / Writes, Merkel Trees, and Vector Clocks are some of the
fundamental building blocks.

10 of 14 04/10/19, 5:46 pm
Why I love databases - Jeeyoung Kim - Medium https://medium.com/@jeeyoungk/why-i-love-databases-1d4cc4...

Summary
This post is a simple, high level overview of some of the topics. There
are many other topics that I haven’t covered, such as optimizing for
different workflows (OLAP, OLTP, Batch Processing) and UX of the
database (query language, transport protocol, client libraries),
which are equally important. The implications of different
consistency semantics, such as sequential consistency, read your own
write, at least once delivery are also very interesting.

The best thing about a database is that it is a very mature


abstraction. It mostly works, and as an application developer, it is
very easy to save and retrieve data without much thinking. This is
definitely celebration worthy, but peeling the skin off this
sufficiently advanced technology is definitely worthwhile.

I wish more people are fascinated by this subject, and fully utilize it.

. . .

References & Side Notes


[1] You can’t sacrifice partition tolerance — http://codahale.com
/you-cant-sacrifice-partition-tolerance/. Aphy’rs Jepsen posts is a
good starting resource — http://aphyr.com/tags/Jepsen

[2] Modern databases heavily utilize OS file system cache to


transparently accelerate the file system access. Unused memory
is automatically used as cache. The recommended production
configuration of such a system is unusual to the untrained eyes,
machines having as much as 90% unused memory.

11 of 14 04/10/19, 5:46 pm
Why I love databases - Jeeyoung Kim - Medium https://medium.com/@jeeyoungk/why-i-love-databases-1d4cc4...

[3] Common pitfall in dynamo-like quorum based system is that


failed writes give no information. Writes can fail externally to the
client when the writes to the internal replicas don’t finish on
time. Thus failed writes are very likely to be successful. Worst,
future writes can be overwritten by this failed write under the
last-write-wins resolution strategy and skewed system clocks.

[4] Migrating between databases involves reading and writing to


multiple databases at once. The ownership of data (commonly
known as the “source of truth”) is uncertain, and you may have
data going out of sync between the legacy and the new system.

[5] Application bug is probably the biggest offender of reliability


and availability in most startups’ infrastructure. Scaling and
performance issues are can be predicted, and hardware failures
aren’t common when you have a handful of servers. However,
new code is deployed constantly every day.

[6] From The Network is Reliable — network partitions are more


common than you think. Fundamentally, there’s no way to
distinguish between high latency, network partitions, GC Pause,
and failed machines— they all appear as slow connections. This is
a commonly faced problem in ElasticSearch. A node suffers a
large GC pause, and the entire cluster thinks that it is down, and
tries to aggressively reshuffle the data, cascading the problem.

Multiple Databases
Once a system interacts with multiple databases, the system is
eventually consistent. You cannot concurrently modify the multiple
database at the exact moment, unless you implement two phase
commit (2PC). This is analogous to “composition of atomic

12 of 14 04/10/19, 5:46 pm
Why I love databases - Jeeyoung Kim - Medium https://medium.com/@jeeyoungk/why-i-love-databases-1d4cc4...

operations is not atomic”.

On Deletion
In any distributed system, deleting data is difficult and unsafe. Data
is replicated everywhere, either by the database or the application.
Without a proper coordination, It’s possible for the deleted data to be
replicated back. A typical strategy is writing a tombstone record to
represent the deletion. However, tombstones have their own
problems:

Tombstones occupy disk space. To reclaim the disk space,


tombstones need to be expired. If the tombstones are expired and
deleted before they are fully replicated, deleted records may
replicate back to life.

You may be able to delete a future write with an old tombstone.


This is colloquially referred as a doomstone. Hilarious as it is, it’s a
real problem.

13 of 14 04/10/19, 5:46 pm
Why I love databases - Jeeyoung Kim - Medium https://medium.com/@jeeyoungk/why-i-love-databases-1d4cc4...

Discover Medium Make Medium yours Become a member


Welcome to a place where Follow all the topics you care Get unlimited access to the
words matter. On Medium, about, and we’ll deliver the best stories on Medium —
smart voices and original best stories for you to your and support writers while
ideas take center stage - with homepage and inbox. Explore you’re at it. Just $5/month.
no ads in sight. Watch Upgrade

About Help Legal

14 of 14 04/10/19, 5:46 pm

You might also like