Professional Documents
Culture Documents
Nightmares
The coloring book
Reliability
Nightmares
The coloring book
Gino passed away and left the restaurant to his son, Leo, who quit his job to run it.
Gino had a natural instinct for running a restaurant. Leo does not.
2
Wrong food.
Cold food.
4
Ugh...Mia!
What do you say?
Want to give it a go?
Yes!
Your sister Mia called us! ...Sure, let’s do it.
The first task is for Chef Cache to evaluate Gino’s dining experience—
sample the menu and check out the service.
5
Our specials of the day are J2EE
spaghetti with artisanal memory leaks,
bash’ed shellfish on bare metal, and
home-cooked biz-critical ancient
LAMP stack pancakes.
6
Sssssorry!! You saved me!
Here you are!
Help!
HELP!
Five Challenges
8
To accomplish this, we’ll define service-level
objectives for Gino’s. SLOs are a set of goals
you’d like to achieve that represent what your
customers expect. We’ll collect data to measure
whether we meet our SLOs or not.
9
Yep! Let’s set the SLO to 99% of
customers will receive their food within
20 minutes of placing their order. How will
we know we are meeting this SLO?
I have this new ticket Great, let’s set it up. Chef, you scan
scanning system... when you get a ticket, scan when the
food is ready, and waitstaff—scan when
you pick the meals up to serve.
10
How about supplies? So we don’t
have to tell customers we’re out of their
favorite dishes quite so much?
11
#2. Safety and self-healing
12
What about when it’s less clear there’s
a problem? I had to bring 15 orders back
last week because the customers disagreed
about the steak doneness.
13
So measure it! Measure before And if you hit two steak returns in
you send them out. Time them and one day, you can self-heal by using
use a thermometer—don’t just go by thermometers on every steak afterwards
your gut—until you meet the SLO. to make sure no more get returned and
so you can recalibrate yourself.
Hmm...maybe something
is wrong with the stove?
14
#3. Scalability
17
Dashboard
What is it today?
18
Shifting left is moving towards
preventing problems as far in advance
(leftward) as possible. The further in advance
you see and address issues, the better
experience your customers will have!
Exactly.
Dining here like how you did?
19
That’s a great idea.
I didn’t think of that!
20
Of course, of course! You need
breaks to do your best work—
we want to prevent burnout and
help you do your best work!
22
Great ideas—except for
the thought of Chef leaving us.
Let’s work on adjusting our
staffing schedules accordingly!
The dashboard pointed out when three reception tables were in danger of
not meeting a 20 min food wait SLO—Leo was able to step in and help
the waitstaff get the food to the guests on time.
Out of 200 guests, four sent their steaks back—and only one was outside
of the error budget for doneness. Chef’s steak SLO was met for the event!
24
#3. Scalability
When the dance floor opened up, the bar was overwhelmed with drink orders.
Leo took waitstaff off of dessert service (everyone was dancing anyway) to help
auto-scale the bar. Once the bar quieted down, those waitstaff continued serving dessert.
Leo and Chef did a dry-run of the reception menu and improved the wine pairing
and dessert options based on the experience—both were a big hit at the event!
25
#5. Zero downtime
The reception ran six hours—well past midnight. From monthly staff meetings
and regular communication, Leo knew Annie had to leave at 11. Leo started taking
over her guests and by 11 she was all set to go home.
Mia, Leo’s sister, was thrilled, and the entire family had a blast!
Gino’s, the family restaurant, was saved, and became a much greater success.
Leo’s dad would have been proud!
26
Observability
1. . . . . . . . . . . . . . . . . . . . . . . 2. . . . . . . . . . . . . . . . . . . . . . 3.. . . . . . . . . . . . . . . . . . . . . . 4.. . . . . . . . . . . . . . . . . . . . . . 5
Never tried to get Regularly enable new metrics for SRE team working closely
metrics out of it with service development and product teams
Scalability
1. . . . . . . . . . . . . . . . . . . . . . . 2. . . . . . . . . . . . . . . . . . . . . . 3.. . . . . . . . . . . . . . . . . . . . . . 4.. . . . . . . . . . . . . . . . . . . . . . 5
I have no idea of the capacity limits of My codebase’s default deployment includes capacity-
my codebase in production. limit based auto-scaling and load shedding.
Shifting left
1. . . . . . . . . . . . . . . . . . . . . . . 2. . . . . . . . . . . . . . . . . . . . . . 3.. . . . . . . . . . . . . . . . . . . . . . 4.. . . . . . . . . . . . . . . . . . . . . . 5
I’ve never run my codebase I work in a DevOps team that is regularly involved in production
in Production before. deployment and operations, and am directly responsible for its reliability.
0-10................. beginner
11-20................. advanced
21+................. expert
Stable, Secure,
Performant, and Boring
red.ht/sre-coloring-book
Sponsored by Red Hat
30