You are on page 1of 5

Heres How Google Makes Sure It (Almost) Never Goes Down

Screen-Shot-2016-04-05-at-4.56.12-PM-s2.jpg
Click to Open Overlay Gallery
Google
When was the last time you needed to Google something and Google
wasnt there?
Odds are, you dont remember that ever happening. Sure, there are
times when you cant reach Google because your internet connection
is down. But Googles primary online services, from its search engine
to Gmail to Google Docs and more, are nearly always accessible. The
companys Google Apps suite, including Gmail and Docs, was available
about 99.97 percent of the time in 2015, according to the companys
own numbers. The world pretty much takes this for granted, but its a
remarkable reality. The billions who use Google hardly stop to consider
how Google made something so impressive seem so mundane.
Google explains the feat in three words: Site Reliability Engineering.
OK, they arent the best three words. But thats the rather unsexy
name Google gave to this seminal philosophy more than a decade ago.
Its a rather nuanced and expansive philosophy, but it really boils down
to one central idea: Dont get IT people who specialize in running
Internet services to run your Internet services. Have software coders
run them instead. If you do this, the thinking goes, the software coders
will build tools that can help run the operation without the active
involvement of real live people.
'We long for the day when nobody runs anything.' Todd Underwood,
Google
The result of our approach, writes Googler Ben Treynor Sloss in a new
essay, is that we end up with a team of people who will quickly
become bored by performing tasks by hand and have the skill set
necessary to write software to replace their previously manual work.
For many in Silicon Valley, that may seem like a common idea. This
kind of thing is now practiced across the tech world, from Amazon to
Box.com. People call it DevOpsdevelopment plus operationsan
effort to combine the ways of the software coder with the aims of the
systems administrator. But the DevOps movement, embodied by tools
like Chef and Puppet, evolved separately from and largely after the
SRE philosophies that arose inside Google (and similar ideas that took
hold at Amazon). Its just that Google has kept largely quiet about this
over the last decade, as it often did when the topic was the inner

workings of its enormously efficient online operation.


But the company has entered a new period, one in which its more
willing to discuss such things (mainly because it wants to promote the
cloud services that allow outside business to run their own software
atop its vast network of data centers and machines). Google has even
gone so far as to write a book about Site Reliability Engineering.
The book is called, well, Site Reliability Engineering. It was just
published by OReilly, and the essay from Sloss serves as the first
chapter. If youre into DevOps, its a must-read. And even if youre not,
the opening of the bookthe preface, the introduction, and the first
chapteris a fascinating look at the attitudes that drive the worlds
largest online empire.
ADVERTISING
inRead invented by Teads
For many in techand almost everyone outside of techsystem
administration (or operations or whatever you want to call it) is an
afterthought, one of the more boring aspects of computer technology.
But Sloss, officially known as Googles Vice President for 24/7
Operations, turns this notion upside down, arguing that site reliability is
the most fundamental feature of any product. After all: A system
isnt very useful if nobody can use it.
Ground Zero
Sloss is ground zero for the SRE movement. It began when Google
hired him to run its operations, and it was he who coined the term.
SRE is what happens when you ask a software engineer to design an
operations team, he says. I designed and managed the group the
way I would want it to work if I worked as an SRE myself.
For Todd Underwood, now an SRE director at Google, its only natural
that the company would hire a coder like Sloss for the job. When
Google was in its infancy, there were so many software engineers who
had a better sense of how things broke and a better sense of how
engineering could be done well, he tells WIRED. But not one them
wanted to do any of that by hand.
Thats a very Googly thing to say. But Adam Jacob, chief technology
officer at Chef, pretty much agrees, explaining that this is the expected
transition for an online operation thats growing to such a large size.
Its natural to have a conversation to combine software development
and the practical pieces of operationand to have no real divide
between the two, he says. When you look at the problem holistically,
you get better results.

The shift is particularly interesting when you consider that dev and ops
were traditionally opposing forces. The devs wanted to build new
software and change it and get the changes out to the public as a fast
as possible. But the ops folks wanted to ensure that nothing went
wrong, and the best way to do that was to keep changes to a
minimum. These are incommensurate goals, Underwood says. The
trick is that, if you combine dev and ops, you can start to eliminate
their competing aims.
Underwood calls it a Hegelian thesis-antithesis synthesis. He then
acknowledges that when he says this, no one really buys it. People
just dont read Hegel anymore, he quips. But the description is spot
on. And once this synthesis was in place, Google accelerated the
process by adding all sorts of other Googly ideas to the mix.
The Error Budget
One big idea is that, in an effort to reduce the conflict between dev and
ops, the company doesnt strive for 100 percent uptime. The reality,
Sloss writes, is that you dont need an internet service to be 100
percent available. Users cant really tell the difference between 100
percent and, say, 99.999 percent (their laptop or WiFi or electricity or
ISP are down far more than 0.001 percent of the time). If you set a
reasonable uptime goal below 100 percentan error budgetyou
have more room to make changes and role out experiments.
The use of an error budget resolves the structural conflict of
incentives between development and SRE, Slosser says. An outage is
no longer a bad thing. It is an expected part of the process of
innovation, and an occurrence that both development and SRE teams
manage rather than fear.
At the same time, the company put rules in place to ensure that SREs
didnt end up morphing into good old fashioned sysadmins. Basically, it
decreed that no SRE could spent more than 50 percent of his or her
time on traditional operations as opposed to coding. If ops starts to
take precedence over dev on a particular SRE team, Google shifts
some of the ops load onto the team that is typically just build the
softwarethe regular Google software engineers. Consciously
maintaining this balance between ops and development work allows us
to ensure that SREs have the bandwidth to engage in creative,
autonomous engineering, Sloss writes, while still retaining the
wisdom gleaned from the operations side of running a service.
Chefs Jacob says that the ratio here50 percentisnt that important.
But he likes the attitude. This is just economics, he says. Theres

always demand for people to do operational bullshit. There is an


almost infinite amount of bullshit that people will ask an operational
person to do. So the idea that you would put a cap on that it legit.
Google even created strict guidelines for hiring its SREs. It hires about
50 to 60 percect through exactly the same process that applies to all
other Google engineers, and the rest have about 85 to 99 percent of
the same skillsplus a set of technical skills that is useful to SRE but
is rare for most software engineers, such as an intimate knowledge of
the inside of the UNIX operating system or hardware networking
protocols. This too aims to ensure that dev and ops maintain the
proper balance.
The Moonshot That Keeps Google Online
In many ways, this was a new philosophy. But in their book, as they
seek to describe the philosophy, the Google team uses a much older
example. The spiritual forebear of the Google SREs is Margaret
Hamilton, the MIT programmer who spent the 60s building software for
Apollo spacecraft that would one day land on the moon. As explained
by Hamilton herselfwho was interviewed for the bookpart of the
culture on the Apollo program was to learn from everyone and
everything, including from that which one would least expect.
Hamilton was a coder. But she played an important role in operations.
To show this, the book recounts the day Hamiltons young daughter,
Lauren, who she often brought to the computer lab, happened to hit a
button and feed an Apollo pre-launch program into a computer that
was running a post-launch scenario.
This crashed the scenario, and Hamilton tried to add a new error
checking code to the system that automatically would prevent this
during a real flight. Her superiors rejected the idea, arguing that
astronauts would never do such a thing, but on Apollo 8, the astronauts
did such a thing. Luckily, Hamilton had added a workaround to the
system documentation. And for subsequent missions, she added the
error checking code.
If you come along and say Thats going to break, its really not that
useful. But if say: Thats going to break, and let me tell you how,
youve done something amazing, Underwood explains. Heres a
person who saw that something was going to break and saw how it
was going to break and devised a way to prevent it from breaking.
Thats DevOpsor, in Google parlance, Site Reliability Engineering. As
three words, it doesnt sound like much. But its an enormously
powerful idea. It has already produced Google. But particularly

philosophical SREs like Underwood have even bigger ambitions. They


envision a world where operations shift even further towards code. We
long for the day, Underwood says, when nobody runs anything.
Go Back to Top. Skip To: Start of Article.

You might also like