Professional Documents
Culture Documents
interview 1 of 26 ONLY
A Second Conversation
with Werner Vogels
The Amazon
CTO sits with
Tom Killalea
to discuss
designing
for evolution
at scale.
W
hen I joined Amazon in 1998, the company
had a single US-based website selling only
books and running a monolithic C application
on five servers, a handful of Berkeley DBs
for key/value data, and a relational database.
That database was called “ACB” which stood for “Amazon.
Com Books,” a name that failed to reflect the range of our
ambition. In 2006 acmqueue published a conversation
between Jim Gray and Werner Vogels, Amazon’s CTO, in
which Werner explained that Amazon should be viewed not
just as an online bookstore but as a technology company.
In the intervening 14 years, Amazon’s distributed systems,
and the patterns used to build and operate them, have
grown in influence. In this follow-up conversation, Werner
and I pay particular attention to the lessons to be learned
from the evolution of a single distributed system, S3, which
was publicly launched close to the time of that 2006
conversation.
3
Amazon used great interest to our software practitioner
the following community. This is evolution at a scale that
principles of is unseen and certainly hasn’t been broadly
distributed system design to discussed.
meet Amazon S3 requirements: Werner Vogels I absolutely agree that this is
Decentralization. Use fully unparalleled scale. Even today, even though
decentralized techniques to there are Internet services these days that
remove scaling bottlenecks and have reached incredible scale—I mean look at
single points of failure. Zoom, for example [this interview took place
Asynchrony. The system over Zoom]—I think S3 is still two or three
makes progress under all generations ahead of that. And why? Because
circumstances. we started earlier; it’s just a matter of time,
Autonomy. The system is and at the same time having a strict feedback
designed such that individual loop with your customers that continuously
components can make decisions evolves the service. Believe me, when we
based on local information. were designing it, when we were building
Local responsibility. Each it, I don’t think that anyone anticipated the
individual component is complexity of it…eventually. I think what we
responsible for achieving its did realize is that we would not be running
consistency; this is never the the same architecture six months later, or a
burden of its peers. year later.
Controlled concurrency. So, I think one of the tenets up front was
Operations are designed such don’t lock yourself into your architecture,
that no or limited concurrency because two or three orders of magnitude
control is required. of scale and you will have to rethink it. Some
W
come with everything and the kitchen sink—not small
e went
down a
building blocks but rather, “This is how you should build
path, one software.”
that Jeff A little before we started S3, we began to realize that
[Bezos] what we were doing might radically change the way that
described years
before, as build- software was being built and services were being used.
ing tools instead But we had no idea how that would evolve, so it was more
of platforms. important to build small, nimble tools that customers
could build on (or we could build on ourselves) instead
of having everything and the kitchen sink ready at that
particular moment. It was not necessarily a timing issue;
it was much more that we were convinced that whatever
we would be adding to the interfaces of S3, to the
functionality of S3, should be driven by our customers—and
how the next generation of customers would start building
their systems.
If you build everything and the kitchen sink as one big
platform, you build with technology that is from five years
before, because that’s how long it takes to design and build
and give everything to your customers. We wanted to
move much faster and have a really quick feedback cycle
with our customers that asks, “How would you develop for
2025?”
it, but that’s really what our customers are used to using.
Again, you build a minimalistic interface, and you can build
it in a robust and solid manner, in a way that would be much
harder if you started adding complexity from day one, even
though you know you’re adding something that customers
want.
There were things that we didn’t know on day one, but a
better example here is when we launched DynamoDB and
A
took a similar minimalistic approach. We knew on the day
complex
system
of the launch that customers already wanted secondary
that indices, but we decided to launch without it. It turned out
works is that customers came back saying that they wanted IAM
invariably (Identity and Access Management)—access control on
found to have
evolved from a individual fields within the database—much more than
simple system they wanted secondary indices. Our approach allows us to
that worked. reorient the roadmap and figure out the most important
things for our customers. In the DynamoDB case it turned
out to be very different from what we thought.
TK I think that much of this conversation is going to be
about evolvability. As I listened to you at re:Invent, my
mind turned to Gall’s Law: “A complex system that works is
invariably found to have evolved from a simple system that
worked. A complex system designed from scratch never
works and cannot be patched up to make it work. You have
to start over with a working simple system.” How do you
think this applies to how S3 has evolved?
WV That was the fundamental thinking behind S3. Could
we have built a complex system? Probably. But if you build
a complex system, it’s much harder to evolve and change,
because you make a lot of long-term decisions in a complex
system. It doesn’t hurt that much if you make long-term
Y
leadership principle is to be customer obsessed, then you
ou
build it,
need to have that come back into your operations as well.
you We want everybody to be customer obsessed, and for that
run it. you need to be in contact with customers. I do also think,
and I always joke about it, that if your alarm goes off at
4 am, there’s more motivation to fix your systems. But it
allows us to be agile, and to be really fast.
I remember during an earlier period in Amazon Retail,
we had a whole year where we focused on performance,
especially at the 99.9 percentile, and we had a whole year
where we focused on removing single points of failure, but
then we had a whole year where we focused on efficiency.
Well, that last one failed completely, because it’s not a
customer-facing opportunity. Our engineers are very
well attuned to removing single points of failure because
it’s good for our customers, or to performance, and
understanding our customers. Becoming more efficient is
bottom-line driven, and all of the engineers go, “Yes, but we
could be doing all of these other things that would be good
for our customers.”
TK Right. That was a tough program to lead.
WV That’s the engineering culture you want. Of course, you
also don’t want to spend too much money, but… I remember
what else is going on and what are the best practices that
some of the other teams have developed—at our scale,
these things become a challenge, and as such you need to
change your organization, and you need to hire into the
organization people who are really good at…I won’t say
oversight because that implies that they have decision
control, but let’s say they are the teachers.
TK Approachability is a key characteristic for a principal
engineer. Even if junior engineers go through a design
review and learn that their design is terrible, they should
still come away confident that they could go back to
that same principal engineer once they believe that their
reworked design is ready for another look.
WV Yes.
TK Could you talk about partitioning of responsibility that
enables evolution across team boundaries, where to draw
the boundaries and to scope appropriately?
WV I still think there are two other angles to this topic
of information sharing that we’ve always done well at
Amazon. One, which predates AWS of course, is getting
everybody in the room to review operational metrics, or
to review the business metrics. The database teams may
get together on a Tuesday morning to review everything
that is going on in their services, and they’ll show up on
Wednesday morning at the big AWS meeting where, now
that we have 175 services not every service presents
anymore, but a die is being rolled.
TK Actually I believe that a wheel is being spun.
WV Yes. So, you need to be ready to talk about your
operational results of the past week. An important part
of that is there are senior people in the room, and there
are junior folks who have just launched their first service.
A lot of learning goes on in those two hours in that room
that is probably the highest-value learning I’ve ever seen.
The same goes for the business meeting, whether it’s
Andy [Jassy, CEO of AWS] in the room, or Jeff [Bezos, CEO
of Amazon] or [Jeff] Wilke [CEO of Amazon Worldwide
Consumer]; there’s a lot of learning going on at a business
level. Why did S3 delete this much data? Well, it turns out
that this file-storage service that serves to others deletes
only once a month, and prior to that they mark all of their
objects (for deletion).
You need to know; you need to understand; and you need
to talk to others in the room and share this knowledge.
These operational review and business review meetings
have become extremely valuable as an educational tool
where the most senior engineers can shine in showing this
is how you build and operate a scalable service.
TK Fourteen years is such a long time in the life of a large-
scale service with a continuously evolving architecture.
Are there universal lessons that you could share for
service owners who are much earlier in their journey? On
S3’s journey from 8 to 262 services over 14 years, any other
learnings that would benefit our readers?
WV As always, security is number one. Otherwise, you
have no business. Whatever you build, with whatever you
architect, you should start with security. You cannot have
a customer-facing service, or even an internally facing
service, without making that your number-one priority.
Finally, I have advice that is twofold. There’s the Well-
Architected Framework, where basically over five different
pillars we’ve collected 10 or 15 years of knowledge from
References
1. Amazon Builder’s Library; https://aws.amazon.com/
builders-library/.
2. Amazon Press Center. 2006. Amazon Web Services
launches; https://press.aboutamazon.com/news-
releases/news-release-details/amazon-web-services-
launches-amazon-s3-simple-storage-service.
3. A WS Well-Architected; https://aws.amazon.com/
architecture/well-architected/.
4. D eCandia, G., et al. 2007. Dynamo: Amazon’s highly
available key-value store. Proceedings of the 21st Annual
Symposium on Operating Systems Principles. (October)
205-220; https://dl.acm.org/doi/10.1145/1294261.1294281.
5. Dixon, J. 2010. Pentaho, Hadoop, and data lakes; https://
jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-
and-data-lakes/.
6. G ray, J. 2006. A conversation with Werner Vogels.
acmqueue 4(4); https://queue.acm.org/detail.
cfm?id=1142065.
Copyright © 2020 held by owner/author. Publication rights licensed to ACM.
2 CONTENTS
acmqueue | september-october 2020 92