You are on page 1of 8

contributed articles

DOI:10.1145/ 2699417
S3 is just one of many AWS ser-
Engineers use TLA+ to prevent serious but vices that store and process data our
customers have entrusted to us. To
subtle bugs from reaching production. safeguard that data, the core of each
service relies on fault-tolerant dis-
BY CHRIS NEWCOMBE, TIM RATH, FAN ZHANG, BOGDAN MUNTEANU, tributed algorithms for replication,
MARC BROOKER, AND MICHAEL DEARDEUFF consistency, concurrency control, au-
to-scaling, load balancing, and other

How Amazon
coordination tasks. There are many
such algorithms in the literature, but
combining them into a cohesive sys-
tem is a challenge, as the algorithms

Web Services
must usually be modified to interact
properly in a real-world system. In
addition, we have found it necessary
to invent algorithms of our own. We

Uses Formal
work hard to avoid unnecessary com-
plexity, but the essential complexity of
the task remains high.
Complexity increases the probabil-

Methods
ity of human error in design, code,
and operations. Errors in the core of
the system could cause loss or corrup-
tion of data, or violate other interface
contracts on which our customers de-
pend. So, before launching a service,
we need to reach extremely high con-
fidence that the core of the system is
correct. We have found the standard
at Amazon Web Services
SI N CE 2011, ENG I NE E RS verification techniques in industry are
necessary but not sufficient. We rou-
(AWS) have used formal specification and model tinely use deep design reviews, code
checking to help solve difficult design problems in reviews, static code analysis, stress
testing, and fault-injection testing but
critical systems. Here, we describe our motivation still find that subtle bugs can hide in
and experience, what has worked well in our problem complex concurrent fault-tolerant
domain, and what has not. When discussing personal systems. One reason they do is that
human intuition is poor at estimating
experience we refer to the authors by their initials. the true probability of supposedly “ex-
At AWS we strive to build services that are simple for tremely rare” combinations of events
in systems operating at a scale of mil-
customers to use. External simplicity is built on a hidden lions of requests per second.
substrate of complex distributed systems. Such complex
internals are required to achieve high availability while key insights
running on cost-efficient infrastructure and cope ˽˽ Formal methods find bugs in system
designs that cannot be found through
with relentless business growth. As an example of this any other technique we know of.
growth, in 2006, AWS launched S3, its Simple Storage ˽˽ Formal methods are surprisingly feasible
Service. In the following six years, S3 grew to store one for mainstream software development
and give good return on investment.
trillion objects.3 Less than a year later it had grown ˽˽ At Amazon, formal methods are routinely
to two trillion objects and was regularly handling 1.1 applied to the design of complex
real-world software, including public
million requests per second.4 cloud services.

66 COMMUNICATIO NS O F TH E AC M | A P R I L 201 5 | VO L . 5 8 | NO. 4


NASA’s C. Michael Holloway says, an ad hoc untestable language. Such of all possible legal behaviors, or ex-
“To a first approximation, we can say descriptions are far from precise; they ecution traces, of a system. We found
that accidents are almost always the are often ambiguous or missing criti- it helpful that the same language is
result of incorrect estimates of the cal aspects (such as partial failure or used to describe both the desired cor-
likelihood of one or more things.”8 Hu- the granularity of concurrency). At the rectness properties of the system (the
man fallibility means some of the more other end of the spectrum, the final “what”) and the design of the system
subtle, dangerous bugs turn out to be executable code is unambiguous but (the “how”). In TLA+, correctness
errors in design; the code faithfully im- contains an overwhelming amount of properties and system designs are
plements the intended design, but the detail. We had to be able to capture the just steps on a ladder of abstraction,
design fails to correctly handle a par- essence of a design in a few hundred with correctness properties occupy-
ticular “rare” scenario. We have found lines of precise description. As our ing higher levels, systems designs and
that testing the code is inadequate as a designs are unavoidably complex, we algorithms in the middle, and execut-
method for finding subtle errors in de- needed a highly expressive language, able code and hardware at the lower
sign, as the number of reachable states far above the level of code, but with levels. TLA+ is intended to make it as
of the code is astronomical. So we look precise semantics. That expressiv- easy as possible to show a system de-
for a better approach. ity must cover real-world concurrency sign correctly implements the desired
and fault tolerance. And, as we wish correctness properties, through either
Precise Designs to build services quickly, we wanted a conventional mathematical reasoning
In order to find subtle bugs in a system language that is simple to learn and or tools like the TLC model checker9
design, it is necessary to have a precise apply, avoiding esoteric concepts. We that take a TLA+ specification and
description of that design. There are also very much wanted an existing eco- exhaustively checks the desired cor-
at least two major benefits to writing a system of tools. We were thus looking rectness properties across all possible
precise design: the author is forced to for an off-the-shelf method with high execution traces. The ladder of ab-
IMAGE BY AND RIJ BORYS ASSOCIAT ES

think more clearly, helping eliminate return on investment. straction also helps designers manage
“plausible hand waving,” and tools We found what we were looking for the complexity of real-world systems;
can be applied to check for errors in in TLA+,11 a formal specification lan- designers may choose to describe
the design, even while it is being writ- guage based on simple discrete math, the system at several “middle” levels
ten. In contrast, conventional design or basic set theory and predicates, of abstraction, with each lower level
documents consist of prose, static dia- with which all engineers are familiar. serving a different purpose (such as to
grams, and perhaps pseudo-code in A TLA+ specification describes the set understand the consequences of fin-

A P R I L 2 0 1 5 | VO L. 58 | N O. 4 | C OM M U N IC AT ION S OF T HE ACM 67
contributed articles

er-grain concurrency or more detailed cases in their personal time on week-


behavior of a communication medi- ends and evenings, without further
um). The designer can then verify that help or training.
each level is correct with respect to a In this article, we have not included
higher level. The freedom to choose
and adjust levels of abstraction makes A precise, testable snippets of specifications because their
unfamiliar syntax can be off-putting to
TLA+ extremely flexible. description potential new users. We find that po-
At first, the syntax and idioms of
TLA+ are somewhat unfamiliar to of a system tential new users benefit from hearing
about the value of formal methods in in-
programmers. Fortunately, TLA+ is
accompanied by a second language
becomes a what- dustry before tackling tutorials and ex-
amples. We refer readers to Lamport et
called PlusCal that is closer to a C-style if tool for designs, al.11 for tutorials, Lamport’s Viewpoint
programming language but much
more expressive, as it uses TLA+ for
analogous to how on page 38 in this issue, and Lamport13
for an example of a TLA+ specification
expressions and values. PlusCal is spreadsheets are from industry similar in size and com-
intended to be a direct replacement
for pseudo-code. Several engineers at
a what-if tool for plexity to some of the larger specifica-
tions at Amazon (see the table here). We
Amazon have found they are more pro- financial models. find TLA+ to be effective in our problem
ductive using PlusCal than they are us- domain, but there are many other for-
ing TLA+. However, in other cases, the mal specification languages and tools,
additional flexibility of plain TLA+ has some of which we describe later.
been very useful. For many designs the
choice is a matter of taste, as PlusCal is Side Benefit
automatically translated to TLA+ with a TLA+ has been helping us shift to a bet-
single key press. PlusCal users do have ter way of designing systems. Engineers
to be familiar with TLA+ in order to naturally focus on designing the “happy
write rich expressions and because it is case” for a system, or the processing
often helpful to read the TLA+ transla- path in which no errors occur. This is
tion to understand the precise seman- understandable, as the happy case is by
tics of a piece of code. Moreover, tools far the most common case. That code
(such as the TLC model checker) work path must solve the customer’s prob-
at the TLA+ level. lem, perform well, make efficient use
of resources, and scale with the busi-
Formal Methods for ness—all significant challenges in their
Real-World Systems own right. When the design for the hap-
In industry, formal methods have py case is done, the engineer then tries
a reputation for requiring a huge to think of “what could go wrong” based
amount of training and effort to verify a on personal experience and that of col-
tiny piece of relatively straightforward leagues and reviewers. The engineer
code, so the return on investment is then adds mitigations for these sce-
justified only in safety-critical domains narios, prioritized by intuition and per-
(such as medical systems and avion- haps statistics on the probability of oc-
ics). Our experience with TLA+ shows currence. Almost always, the engineer
this perception to be wrong. At the stops well short of handling “extremely
time of this writing, Amazon engineers rare” combinations of events, as there
have used TLA+ on 10 large complex are too many such scenarios to imagine.
real-world systems. In each, TLA+ has In contrast, when using formal
added significant value, either finding specification we begin by stating pre-
subtle bugs we are sure we would not cisely “what needs to go right.” We first
have found by other means, or giving specify what the system should do by
us enough understanding and confi- defining correctness properties, which
dence to make aggressive performance come in two varieties:
optimizations without sacrificing cor- Safety. What the system is allowed to
rectness. Amazon now has seven teams do. For example, at all times, all com-
using TLA+, with encouragement from mitted data is present and correct, or
senior management and technical equivalently; at no time can the system
leadership. Engineers from entry level have lost or corrupted any committed
to principal have been able to learn data; and
TLA+ from scratch and get useful re- Liveness. What the system must even-
sults in two to three weeks, in some tually do. For example, whenever the

68 COMMUNICATIO NS O F TH E AC M | A P R I L 201 5 | VO L . 5 8 | NO. 4


contributed articles

system receives a request, it must even- make innovative performance optimi- What Formal Specification
tually respond to that request. zations (such as removing or narrow- Is Not Good For
After defining correctness prop- ing locks or weakening constraints on We are concerned with two major
erties, we then precisely describe an message ordering) we would not have classes of problems with large distrib-
abstract version of the design, along dared to do without having model- uted systems: bugs and operator er-
with an abstract version of its operat- checked those changes. A precise, test- rors that cause a departure from the
ing environment. We express “what able description of a system becomes system’s logical intent; and surpris-
must go right” by explicitly specifying a what-if tool for designs, analogous to ing “sustained emergent performance
all properties of the environment on how spreadsheets are a what-if tool for degradation” of complex systems that
which the system relies. Examples of financial models. We find that using inevitably contain feedback loops.
such properties might be “If a commu- such a tool to explore the behavior of We know how to use formal specifica-
nication channel has not failed, then the system can improve the designer’s tion to find problems in the first class.
messages will be propagated along understanding of the system. However, problems in the second class
it,” and “If a process has not restarted, In addition, a precise, testable, well- can cripple a system even though no
then it retains its local state, modulo commented description of a design is logic bug is involved. A common ex-
any intentional modifications.” Next, an excellent form of documentation, ample is when a momentary slowdown
with the goal of confirming our design which is important, as AWS systems in a server (due, perhaps, to Java gar-
correctly handles all dynamic events have unbounded lifetimes. Over time, bage collection) causes timeouts to be
in the environment, we specify the ef- teams grow as the business grows, so breached on clients, causing the cli-
fects of each of those possible events— we regularly have to bring new people ents to retry requests, thus adding load
network errors and repairs, disk er- up to speed on systems. This educa- to the server, and further slowdown. In
rors, process crashes and restarts, tion must be effective. To avoid creat- such scenarios the system eventually
data-center failures and repairs, and ing subtle bugs, we need all engineers makes progress; it is not stuck in a logi-
actions by human operators. We then to have the same mental model of the cal deadlock, livelock, or other cycle.
use the model checker to verify that system and for that shared model to be But from the customer’s perspective
the specification of the system in its accurate, precise, and complete. Engi- it is effectively unavailable due to sus-
environment implements the chosen neers form mental models in various tained unacceptable response times.
correctness properties, despite any ways—talking to each other, reading TLA+ can be used to specify an upper
combination or interleaving of events design documents, reading code, and bound on response time, as a real-time
in the operating environment. We find implementing bug fixes or small fea- safety property. However, AWS systems
this rigorous “what needs to go right” tures. But talk and design documents are built on infrastructure—disks, op-
approach to be significantly less error can be ambiguous or incomplete, and erating systems, network—that does
prone than the ad hoc “what might go the executable code is much too large not support hard real-time scheduling
wrong” approach. to absorb quickly and might not pre- or guarantees, so real-time safety prop-
cisely reflect the intended design. In erties would not be realistic. We build
More Side Benefits contrast, a formal specification is pre- soft real-time systems in which very
We also find that writing a formal cise, short, and can be explored and ex- short periods of slow responses are not
specification pays dividends over the perimented on with tools. considered errors. However, prolonged
lifetime of the system. All production
services at Amazon are under constant Applying TLA+ to some of Amazon’s more complex systems.
development, even those released
years ago; we add new features cus-
Line Count
tomers have requested, we redesign System Components (Excluding Comments) Benefit
components to handle massive in- Fault-tolerant, low-level 804 PlusCal Found two bugs, then
creases in scale, and we improve per- network algorithm others in proposed
formance by removing bottlenecks. S3
optimizations
Many of these changes are complex Background redistribution of 645 PlusCal Found one bug, then
data another in the first
and must be made to the running sys-
proposed fix
tem with no downtime. Our first prior-
DynamoDB Replication and 939 TLA+ Found three bugs requir-
ity is always to avoid causing bugs in a group-membership system ing traces of up to 35
production system, so we often have steps
to answer “Is this change safe?” We EBS Volume management 102 PlusCal Found three bugs
find a major benefit of having a pre- Lock-free data structure 223 PlusCal Improved confidence
cise, testable model of the core system though failed to find a
Internal liveness bug, as liveness
is that we can quickly verify that even distributed not checked
deep changes are safe or learn they are lock
Fault-tolerant replication-and- 318 TLA+ Found one bug and
manager
unsafe without doing harm. In several reconfiguration algorithm verified an aggressive
cases, we have prevented subtle but se- optimization
rious bugs from reaching production.
In other cases we have been able to

A P R I L 2 0 1 5 | VO L. 58 | N O. 4 | C OM M U N IC AT ION S OF T HE ACM 69
contributed articles

severe slowdowns are considered er- at AWS; for instance, we could not find pressed in the language. But so far we
rors. We do not yet know of a feasible a practical way in Alloy to represent have always been able to find a way to
way to model a real system that would rich data structures (such as dynamic express our intent in a way that is clear,
enable tools to predict such emergent sequences containing nested records direct, and can be model checked.
behavior. We use other techniques to with multiple fields). After evaluating Alloy and TLA+,
mitigate these risks. Alloy’s limited expressivity appears C.N. tried to persuade colleagues at
to be a consequence of the particular Amazon to adopt TLA+. However, en-
First Steps to Formal Methods approach to analysis taken by the Al- gineers have almost no spare time for
With hindsight, Amazon’s path to for- loy Analyzer tool. The limitations do such things, unless compelled by need.
mal methods seems straightforward; not seem to be caused by Alloy’s con- Fortunately, a need was about to arise.
we had an engineering problem and ceptual model (“execution traces” over
found a solution. Reality was some- system states). This hypothesis moti- First Big Success at Amazon
what different. The effort began with vated C.N. to look for a language with In January 2012, Amazon launched Dy-
author C.N.’s dissatisfaction with the a similar conceptual model but with namoDB, a scalable high-performance
quality of several distributed systems richer constructs for describing system “no SQL” data store that replicates
he had designed and reviewed, and states. C.N. eventually stumbled on a customer data across multiple data
with the development process and language with those properties when centers while promising strong con-
tools that had been used to construct he found a TLA+ specification in the sistency.2 This combination of require-
those systems. The systems were con- appendix of a paper on a canonical al- ments leads to a large, complex system.
sidered successful, yet bugs and opera- gorithm in our problem domain—the The replication and fault-tolerance
tional problems persisted. To mitigate Paxos consensus algorithm.12 mechanisms in DynamoDB were creat-
the problems, the systems used well- The fact that TLA+ was created by ed by author T.R. To verify correctness
proven methods—pervasive contract the designer of such a widely used of the production code, T.R. performed
assertions enabled in production—to algorithm gave us some confidence extensive fault-injection testing using
detect symptoms of bugs, and mecha- that TLA+ would work for real-world a simulated network layer to control
nisms (such as “recovery-oriented systems. We became more confident message loss, duplication, and reor-
computing”20) to attempt to minimize when we learned a team of engineers dering. The system was also stress test-
the impact when bugs are triggered. at DEC/Compaq had used TLA+ to ed for long periods on real hardware
However, reactive mechanisms can- specify and verify some intricate under many different workloads. We
not recover from the class of bugs that cache-coherency protocols for the Al- know such testing is absolutely neces-
cause permanent damage to customer pha series of multicore CPUs.5,16 We sary but can still fail to uncover subtle
data; we must instead prevent such read one of the specifications13 and flaws in design. To verify the design of
bugs from being created. found they were sophisticated distrib- DynamoDB, T.R. wrote detailed infor-
When looking for techniques to pre- uted algorithms involving rich mes- mal proofs of correctness that did in-
vent bugs, C.N. did not initially consid- sage passing, fine-grain concurrency, deed find several bugs in early versions
er formal methods, due to the pervasive and complex correctness properties. of the design. However, we have also
view that they are suitable for only tiny That left only the question of whether learned that conventional informal
problems and give very low return on in- TLA+ could handle real-world failure proofs can miss very subtle problems.14
vestment. Overcoming the bias against modes. (The Alpha cache-coherency To achieve the highest level of confi-
formal methods required evidence they algorithm does not consider failure.) dence in the design, T.R. chose TLA+.
work on real-world systems. This evi- We knew from Lamport’s Fast Paxos T.R. learned TLA+ and wrote a de-
dence was provided by Zave,22 who used paper12 that TLA+ could model fault tailed specification of these compo-
a language called Alloy to find serious tolerance at a high level of abstrac- nents in a couple of weeks. To model-
bugs in the membership protocol of a tion and were further convinced when check the specification, we used the
distributed system called Chord. Chord we found other papers showing TLA+ distributed version of the TLC model
was designed by an expert group at MIT could model lower-level failures.15 checker running on a cluster of 10
and is successful, having won a “10-year C.N. evaluated TLA+ by writing a cc1.4xlarge EC2 instances, each with
test of time” award at the SIGCOMM specification of the same non-trivial eight cores plus hyperthreads and
2011 conference and influenced several concurrent algorithm he had written in 23GB of RAM. The model checker veri-
systems in industry. Zave’s success mo- Alloy.18 Both Alloy and TLA+ were able fied that a small, complicated part of
tivated C.N. to perform an evaluation of to handle the problem, but the com- the algorithm worked as expected for
Alloy by writing and model checking a parison revealed that TLA+ is much a sufficiently large instance of the sys-
moderately large Alloy specification of more expressive than Alloy. This differ- tem to give high confidence it is cor-
a non-trivial concurrent algorithm.18 ence is important in practice; several rect. T.R. then checked the broader
We liked many characteristics of the Al- of the real-world specifications we have fault-tolerant algorithm. This time the
loy language, including its emphasis on written in TLA+ would have been infea- model checker found a bug that could
“execution traces” of abstract system sible in Alloy. We initially had the oppo- lead to losing data if a particular se-
states composed of sets and relations. site concern about TLA+; it is so expres- quence of failures and recovery steps
However, we also found that Alloy is not sive that no model checker can hope would be interleaved with other pro-
expressive enough for many use cases to evaluate everything that can be ex- cessing. This was a very subtle bug; the

70 COMMUNICATIO NS O F TH E ACM | A P R I L 201 5 | VO L . 5 8 | NO. 4


contributed articles

shortest error trace exhibiting the bug ware engineers more readily grasp the
included 35 high-level steps. The im- concept and practical value of TLA+ if
probability of such compound events we dub it “exhaustively testable pseu-
is not a defense against such bugs; his- do-code.” We initially avoid the words
torically, AWS engineers have observed
many combinations of events at least Formal methods “formal,” “verification,” and “proof”
due to the widespread view that for-
as complicated as those that could trig-
ger this bug. The bug had passed unno-
have helped us mal methods are impractical. We also
initially avoid mentioning what TLA
ticed through extensive design reviews, devise aggressive stands for, as doing so would give an
code reviews, and testing, and T.R. is
convinced we would not have found it
optimizations to incorrect impression of complexity.
Immediately after seeing the pre-
by doing more work in those conven- complex algorithms sentation, a team working on S3 asked
tional areas. The model checker later
found two bugs in other algorithms,
without sacrificing for help using TLA+ to verify a new
fault-tolerant network algorithm.
both serious and subtle. T.R. fixed all quality. The documentation for the algorithm
these bugs, and the model checker ver- consisted of many large, complicated
ified the resulting algorithms to a very state-machine diagrams. To check
high degree of confidence. the state machine, the team had been
T.R. says that, had he known about considering writing a Java program
TLA+ before starting work on Dy- to brute-force explore possible execu-
namoDB he would have used it from tions: essentially a hard-wired form
the start. He believes the investment of model checking. They were able to
he made in writing and checking the avoid the effort by using TLA+ instead.
formal TLA+ specifications was more Author F.Z. wrote two versions of the
reliable and less time consuming than spec over a couple of weeks. For this
the work he put into writing and check- particular problem, F.Z. found that
ing his informal proofs. Using TLA+ in she was more productive in PlusCal
place of traditional proof writing would than TLA+, and we have observed that
thus likely have improved time to mar- engineers often find it easier to begin
ket, in addition to achieving greater with PlusCal.
confidence in the system’s correctness. Model checking revealed two sub-
After DynamoDB was launched, T.R. tle bugs in the algorithm and allowed
worked on a new feature to allow data F.Z. to verify fixes for both. F.Z. then
to be migrated between data centers. used the spec to experiment with the
As he already had the specification for design, adding new features and opti-
the existing replication algorithm, T.R. mizations. The model checker quickly
was able to quickly incorporate this revealed that some of these changes
new feature into the specification. The would have introduced bugs.
model checker found the initial design This success led AWS management
would have introduced a subtle bug, to advocate TLA+ to other teams work-
but it was easy to fix, and the model ing on S3. Engineers from those teams
checker verified the resulting algo- wrote specs for two additional critical
rithm to the necessary level of confi- algorithms and for one new feature.
dence. T.R. continues to use TLA+ and F.Z. helped teach them how to write
model checking to verify changes to their first specs. We find it encouraging
the design for both optimizations and that TLA+ can be taught by engineers
new features. who are still new to it themselves; this is
important for quickly scaling adoption
Persuading More Engineers in an organization as large as Amazon.
Success with DynamoDB gave us Author B.M. was one such engineer.
enough evidence to present TLA+ to His first spec was for an algorithm
the broader engineering community at known to contain a subtle bug. The bug
Amazon. This raised a challenge—how had passed unnoticed through mul-
to convey the purpose and benefits tiple design reviews and code reviews
of formal methods to an audience of and had surfaced only after months of
software engineers. Engineers think in testing. B.M. spent two weeks learning
terms of debugging rather than “verifi- TLA+ and writing the spec. Using it,
cation,” so we called the presentation the TLC model checker found the bug
“Debugging Designs.”18 Continuing in seconds. The team had already de-
the metaphor, we have found that soft- signed and reviewed a fix for the bug,

A P R I L 2 0 1 5 | VO L. 58 | N O. 4 | C OM M U N IC AT ION S OF T HE ACM 71
contributed articles

so B.M. changed the spec to include the data that were much richer than
the proposed fix. The model checker standard multiplicity constraints and
found the problem still occurred in a foreign key constraints. We then added
different execution trace. A stronger fix high-level specifications of some of
was proposed, and the model checker
verified the second fix. B.M. later wrote Executive the main operations on the data that
helped us correct and refine the sche-
another spec for a different algorithm.
That spec did not uncover any bugs but
management ma. This result suggests a data model
can be viewed as just another level of
did uncover several important ambi- actively encourages abstraction of the entire system. It also
guities in the documentation for the
algorithm the spec helped resolve.
teams to write suggests TLA+ may help designers im-
prove a system’s scalability. In order to
Somewhat independently, after see- TLA+ specs for new remove scalability bottlenecks, design-
ing internal presentations about TLA+,
authors M.B and M.D. taught them-
features and other ers often break atomic transactions
into finer-grain operations chained
selves PlusCal and TLA+ and started significant design together through asynchronous work-
using them on their respective projects
without further persuasion or assis- changes. flows; TLA+ can help explore the conse-
quences of such changes with respect
tance. M.B. used PlusCal to find three to isolation and consistency.
bugs and wrote a public blog about his
personal experiments with TLA+ out- Most Frequently Asked Question
side of Amazon.7 M.D. used PlusCal to On learning about TLA+, engineers
check a lock-free concurrent algorithm usually ask, “How do we know that the
and then used TLA+ to find a critical executable code correctly implements
bug in one of AWS’s most important the verified design?” The answer is
new distributed algorithms. M.D. also we do not know. Despite this, formal
developed a fix for the bug and veri- methods still help in multiple ways:
fied the fix. Independently, C.N. wrote Get design right. Formal methods
a spec for the same algorithm that was help engineers get the design right,
quite different in style from the spec which is a necessary first step toward
written by M.D., but both found the getting the code right. If the design is
same bug in the algorithm. This sug- broken, then the code is almost cer-
gests the benefits of using TLA+ are tainly broken, as mistakes during cod-
quite robust to variations among en- ing are extremely unlikely to compen-
gineers. Both specs were later used to sate for mistakes in design. Worse,
verify that a crucial optimization to the engineers are likely to be deceived into
algorithm did not introduce any bugs. believing the code is “correct” because
Engineers at Amazon continue to it appears to correctly implement the
use TLA+, adopting the practice of first (broken) design. Engineers are un-
writing a conventional prose-design likely to realize the design is incorrect
document, then incrementally refining while focused on coding;
parts of it into PlusCal or TLA+. This Gain better understanding. Formal
method often yields important insight methods help engineers gain a better
about the design, even without going as understanding of the design. Improved
far as full specification or model check- understanding can only increase the
ing. In one case, C.N. refined a prose chances they will get the code right;
design of a fault-tolerant replication and
system that had been designed by an- Write better code. Formal methods
other Amazon engineer. C.N. wrote can help engineers write better “self-
and model checked specifications diagnosing code” in the form of asser-
at two levels of concurrency; these tions. Independent evidence10 and our
specifications helped him understand own experience suggest pervasive use
the design well enough to propose of assertions is a good way to reduce
a major protocol optimization that errors in code. An assertion checks a
radically reduced write-latency in the small, local part of an overall system
system. We have also discovered that invariant. A good system invariant
TLA+ is an excellent tool for data mod- captures the fundamental reason the
eling, as when designing the schema system works; the system will not do
for a relational or “no SQL” database. anything wrong that could violate a
We used TLA+ to design a non-trivial safety property as long as it continu-
schema with semantic invariants over ously maintains the system invariant.

72 COMM UNICATIO NS O F THE ACM | A P R I L 201 5 | VO L . 5 8 | NO. 4


contributed articles

The challenge is to find a good system Conclusion (Norfolk, VA, July 2005); http://klabs.org/richcontent/
conferences/faa_nasa_2005/presentations/cmh-why-
invariant, one strong enough to en- Formal methods are a big success at read-accident-reports.pdf
sure no safety properties are violated. AWS, helping us prevent subtle but se- 9. Joshi, R., Lamport, L. et al. Checking cache-coherence
protocols with TLA+. Formal Methods in System
Formal methods help engineers find rious bugs from reaching production, Design 22, 2 (Mar, 2003) 125–131.
strong invariants, so formal methods bugs we would not have found through 10. Kudrjavets, G., Nagappan, N., and Ball, T. Assessing
the relationship between software assertions
can help improve assertions that help any other technique. They have helped and code quality: An empirical investigation. In
improve the quality of code. us devise aggressive optimizations to Proceedings of the 17th International Symposium on
Software Reliability Engineering (Raleigh, NC, Nov.
While we would like to verify that complex algorithms without sacrific- 2006), 204–212.
executable code correctly imple- ing quality. At the time of this writing, 11. Lamport, L. The TLA Home Page; http://research.
microsoft.com/en-us/um/people/lamport/tla/tla.html
ments the high-level specification or seven Amazon teams have used TLA+, 12. Lamport, L. Fast Paxos. Distributed Computing 19, 2
even generate the code from the spec- all finding value in doing so, and more (Oct. 2006), 79–103.
13. Lamport, L. The Wildfire Challenge Problem; http://
ification, we are not aware of any such Amazon teams are starting to use it. research.microsoft.com/en-us/um/people/lamport/
tla/wildfire-challenge.html
tools that can handle distributed sys- Using TLA+ will improve both time- 14. Lamport, L. Checking a multithreaded algorithm with
tems as large and complex as those to-market and quality of our systems. +CAL. In Distributed Computing: 20th International
Conference, S. Dolev, Ed. Springer-Verlag, 2006, 11–163.
being built at Amazon. We do rou- Executive management actively en- 15. Lamport, L. and Merz, S. Specifying and verifying fault-
tinely use conventional static analy- courages teams to write TLA+ specs tolerant systems. In Formal Techniques in Real-Time
and Fault-Tolerant Systems, Lecture Notes in Computer
sis tools, but they are largely limited for new features and other significant Science, Number 863, H. Langmaack, W.-P. de Roever,
to finding “local” issues in the code, design changes. In annual planning, and J. Vytopil, Eds. Springer-Verlag, Sept. 1994, 41–76.
16. Lamport, L., Sharma, M., Tuttle, M., and Yu, Y.
and are unable to verify compliance managers now allocate engineering The Wildfire Challenge Problem. Jan. 2001;
with a high-level specification. time to TLA+. http://research.microsoft.com/en-us/um/people/
lamport/pubs/wildfire-challenge.pdf
We have seen research on using the While our results are encourag- 17. Lu, T., Merz, S., and Weidenbach, C. Towards
TLC model checker to find “edge cas- ing, some important caveats remain. verification of the Pastry Protocol using TLA+. In
Proceedings of Joint 13th IFIP WG 6.1 International
es” in the design on which to test the Formal methods deal with models of Conference and 30th IFIP WG 6.1 International
code,21 an approach that seems prom- systems, not the systems themselves, Conference Lecture Notes in Computer Science
Volume 6722 (Reykjavik, Iceland, June 6–9). Springer-
ising. However, Tasiran et al.21 covered so the adage “All models are wrong, Verlag, 2011, 244 –258.
hardware design, and we have not yet some are useful” applies. The design- 18. Newcombe, C. Debugging Designs. Presented at the
14th International Workshop on High-Performance
tried to apply the method to software. er must ensure the model captures the Transaction Systems (Monterey, CA, Oct. 2011); http://
hpts.ws/papers/2011/sessions_2011/Debugging.
significant aspects of the real system. pdf and associated specifications http://hpts.ws/
Alternatives to TLA+ Achieving it is a special skill, the ac- papers/2011/sessions_2011/amazonbundle.tar.gz
19. Newcombe, C. Why Amazon chose TLA+. In
There are many formal specifica- quisition of which requires thought- Proceedings of the Fourth International Conference
tion methods. We evaluated several ful practice. Also, we were solely Lecture Notes in Computer Science Volume 8477, Y.A.
Ameur and K.-D. Schewe, Eds. (Toulouse, France, June
and published our findings in New- concerned with obtaining practical 2–6). Springer, 2014, 25–39.
combe,19 listing the requirements benefits in our particular problem do- 20. Patterson, D., Fox, A. et al. The Berkeley/Stanford
Recovery-Oriented Computing Project. University of
we think are important for a formal main and have not attempted a com- California, Berkeley; http://roc.cs.berkeley.edu/
method to be successful in our indus- prehensive survey. Therefore, mileage 21. Tasiran, S., Yu, Y., Batson, B., and Kreider, S. Using
formal specifications to monitor and guide simulation:
try segment. When we found TLA+ met may vary with other tools or in other Verifying the cache coherence engine of the Alpha
those requirements, we stopped evalu- problem domains. 21364 microprocessor. In Proceedings of the Third
IEEE International Workshop on Microprocessor Test
ating methods, as our goal was always and Verification (Austin, TX, June). IEEE Computer
practical engineering rather than an Society, 2002.
References
22. Zave, P. Using lightweight modeling to understand
exhaustive survey. 1. Abrial, J. Formal methods in industry: Achievements,
Chord. ACM SIGCOMM Computer Communication
problems, future. In Proceedings of the 28th
Review 42, 2 (Apr. 2012), 49–57.
International Conference on Software Engineering
Related Work (Shanghai, China, 2006), 761–768.
2. Amazon.com. Supported Operations in DynamoDB:
We find relatively little published liter- Strongly Consistent Reads. System documentation; Chris Newcombe (chris.newcombe@gmail.com) is an
http://docs.aws.amazon.com/amazondynamodb/ architect at Oracle, Seattle, WA, and was a principal
ature on using high-level formal spec- latest/developerguide/APISummary.html engineer in the AWS database services group at Amazon.
ification for verifying the design of 3. Barr, J. Amazon S3: The first trillion objects. Amazon com, Seattle, WA, when this article was written.
Web Services Blog, June 2012; http://aws.typepad.
complex distributed systems in indus- com/aws/2012/06/amazon-s3-the-first-trillion-
Tim Rath (rath@amazon.com) is a principal engineer in the
AWS database services group at Amazon.com, Seattle, WA.
try. The Farsite project6 is complex but objects.html
4. Barr, J. Amazon S3: Two trillion objects, 1.1 million Fan Zhang (fanxhang58@gmail.com) is a software
somewhat different from the types of requests per second. Amazon Web Services Blog, Mar. engineer and technical product and program manager at
systems we describe here and appar- 2013; http://aws.typepad.com/aws/2013/04/amazon- Cyanogen, Seattle, WA, and was a software engineer for
s3-two-trillion-objects-11-million-requests-second.html AWS S3 at Amazon.com, Seattle, WA, when this article
ently never launched commercially. 5. Batson, B. and Lamport, L. High-level specifications: was written.
Abrial1 cited applications in commer- Lessons from industry. In Formal Methods for
Components and Objects, Lecture Notes in Computer Bogdan Munteanu (bogdanmunte@gmail.com) is
cial safety-critical control systems, Science Number 2852, F.S. de Boer, M. Bonsangue, a software engineer at Dropbox, and was a software
but they seem less complex than our S. Graf, and W.-P. de Roever, Eds. Springer, 2003, engineer in the AWS S3 Engines group at Amazon.com,
242–262. Seattle, WA, when this article was written.
problem domain. Lu et al.17 described 6. Bolosky, W., Douceur, J., and Howell, J. The Farsite
Project: A retrospective. ACM SIGOPS Operating Marc Brooker (mbrooker@amazon.com) is a principal
post-facto verification of a well-known engineer for AWS EC2 at Amazon.com, Seattle, WA.
Systems Review: Systems Work at Microsoft Research
algorithm for a fault-tolerant distrib- 41, 2 (Apr. 2007), 17–26. Michael Deardeuff (mdearde@amazon.com) is a
uted hash table, and Zave22 described 7. Brooker, M. Exploring TLA+ with two-phase commit. software engineer in the AWS database services group
Personal blog, Jan. 2013; http://brooker.co.za/ at Amazon.com, Seattle, WA.
another such algorithm, but we do not blog/2013/01/20/two-phase.html
8. Holloway, C. Michael Why you should read accident
know if these algorithms have been reports. Presented at the Software and Complex Copyright held by Owners/Authors.
used in commercial products. Electronic Hardware Standardization Conference Publication rights licensed to ACM. $15.00

A P R I L 2 0 1 5 | VO L. 58 | N O. 4 | C OM M U N IC AT ION S OF T HE ACM 73

You might also like