Professional Documents
Culture Documents
Designing and Operating Highly Available Software Systems
Designing and Operating Highly Available Software Systems
What is
“Site Reliability Engineering”?
Proprietary + Confidential
Proprietary + Confidential
Site Reliability
Engineering (SRE)
SRE: Proactive
Design
Planning
Automation
Disaster Games
Proprietary + Confidential
SRE: Reactive
Monitoring
Root-cause Analysis
Performance Tuning
Proprietary + Confidential
Proprietary + Confidential
Proprietary + Confidential
A “typical” day.
An “interesting” day.
Web server
Database
Proprietary + Confidential
What needs
to change?
Proprietary + Confidential
100M
users
Proprietary + Confidential
10B
requests/day
Proprietary + Confidential
100k
requests/second
(average)
Proprietary + Confidential
200k requests/second
(peak)
Proprietary + Confidential
200k requests/second
(peak)
What do we need
to serve them?
Lots of servers.
Anatomy of a
large-scale web
application
Proprietary + Confidential
Proprietary + Confidential
Disclaimer
Users
Back to our
simple service Web server
Database
Proprietary + Confidential
Users
Replication
Web server
Web server
Database
Database
Proprietary + Confidential
Users
Meshed Topology
Web server
Not all connections shown.
Web server
Database
Database
Proprietary + Confidential
Users
Simplified
Web servers
One box for all replicas of
each server type.
Databases
Proprietary + Confidential
Hierarchy of
servers talking Server Server Server
to each other.
Server Server
Server Server
Proprietary + Confidential
Language
Disambiguation
Detection
Proprietary + Confidential
Cyclic?!? Server
Occasionally,
servers talk to Server Server Server
themselves.
Server Server
Server Server
Proprietary + Confidential
DNS Server is IP
DNS Server
Geo Location
Users
aware.
Monitoring
Engineers
DC-x
Alerting
Collection
Visualisation
Storage Retrieval
Monitoring
Proprietary + Confidential
“Failure is always
an option.”
Things that fail, from low to high impact:
● machines
● switches
● power distribution units
● routers
● fiber lines
● power substations
● imperfect software
● human error
● adversarial action / malicious behavior
Actions:
Failure Domain:
(Network) Switch
Symptom: Dozens of machines connected to
one switch go offline at the same time.
Action:
● Get traffic away from affected machines
● and get someone to replace the switch.
Failure domain:
Datacenter
Some failure modes can take out entire
datacenters:
● hurricanes
● flooding
● earthquakes
● …
Failure Domain:
Imperfect Software
This can really ruin your day
Action:
● when in doubt, roll back
● your roll-out strategy should make this
easy and fast
Now we can
DNS Server
make good use
Users
of DNS IP Geo
Location.
EMEA
APAC
US-
East
US-
West
Datacenter Location Diversity Proprietary + Confidential
DNS Server
Users
US- APAC
East EMEA
US-
West
Proprietary + Confidential
Server Server
Proprietary + Confidential
So we're provisioning
Load Balancer
for 200% of expected
load.
Load Balancer
Operational
Considerations
Proprietary + Confidential
Proprietary + Confidential
Change Management
Disaster Recovery
We talked about failures, let's talk about
disasters! Examples:
● One of your major datacenters burns down
● Software bug silently corrupts data over
months
Resolutions:
● ‘oops, my bad’ - probably not going to cut it