Reliability Nightmares Coloring e Book Red Hat Developer

Reliability
Nightmares
The coloring book
Reliability
Nightmares
The coloring book
Script and storyboard by Máirín Duffy

Illustration by Wildfire
Concept and technical information by Jeremy Eder, Irit Goihman,
David Martin, and Craig Robinson
Gino’s was a successful family-owned restaurant,
renowned for good food and good times.
Gino passed away and left the restaurant to his son, Leo, who quit his job to run it.
Gino had a natural instinct for running a restaurant. Leo does not.
2
Wrong food.
Cold food.
The waitstaff are either

bombarded or bored.
Gross food.
Loyal customers were

Food spoils due to a freezer failure,
disappointed and started leaving.
and a poorly maintained oven
Word was getting around that Gino’s
causes a kitchen fire.
was “not what it used to be.”
3
I can’t do this.
45 years in operation!
Will this be the last?
Who’s that at the door?
Hi! I’m celebrity chef Cookie Cache!

You’re hosting your sister’s wedding
reception! Let’s get this place
whipped into shape!!
4
Ugh...Mia!
What do you say?
Want to give it a go?
Yes!
Your sister Mia called us! ...Sure, let’s do it.
The first task is for Chef Cache to evaluate Gino’s dining experience—
sample the menu and check out the service.
5
Our specials of the day are J2EE
spaghetti with artisanal memory leaks,
bash’ed shellfish on bare metal, and
home-cooked biz-critical ancient
LAMP stack pancakes.
I’ll try the LAMP stack, please!
[90 minutes pass]

... Hey, uh, Leo?
Apologies for the

wait! I’ll check!
6
Sssssorry!! You saved me!
Here you are!
Help!
HELP!
Go home, get a good

night’s sleep. We’ll start
again in the morning.
Five Challenges
There is a better way. I will teach you

how to run Gino’s as a managed service.
You’ll have five challenges, one per day.
We’ll close Gino’s for the week and get
you ready for the wedding! 7
#1. Observability • Break down silos
• Consensus
• Enforce accountability
• Prioritize work
• Keep customers happy
The first challenge is to set up

observability for the restaurant.
This is foundational for
improving Gino’s service.
1. We need to break down the silos between each

role with regular open communication.
2. We must be in consensus about the level of
service we provide.
3. We enforce accountability by measuring
causes of issues and working to address them.
4. We need to prioritize our work...
5. ...so we can keep our customers happy!
You need to know your equipment

works, your food supplies are fresh
How do we do it?
and stocked, and that customers are
served in a timely manner. You need
data to achieve all of this.
8
To accomplish this, we’ll define service-level
objectives for Gino’s. SLOs are a set of goals
you’d like to achieve that represent what your
customers expect. We’ll collect data to measure
whether we meet our SLOs or not.
We’ve had a lot of dishes sent

back lately because they weren’t
served right away and got cold.
Can we set an SLO for that?
9
Yep! Let’s set the SLO to 99% of
customers will receive their food within
20 minutes of placing their order. How will
we know we are meeting this SLO?
I have this new ticket Great, let’s set it up. Chef, you scan
scanning system... when you get a ticket, scan when the
food is ready, and waitstaff—scan when
you pick the meals up to serve.
10
How about supplies? So we don’t
have to tell customers we’re out of their
favorite dishes quite so much?
Brilliant! Our SLO can be no more

The scanners can do
than two dishes omitted per week due to
inventory tracking too!
supply. How do we measure this?
Great, let’s set the scanners up to track

Dashboard inventory. Once everything is set up, let’s
schedule monthly SLO meetings. We can talk
about how we’re doing at meeting our SLOs,
and tweak, retire, or create new ones.
11
#2. Safety and self-healing
Great work on instrumenting for

observability! Today’s challenge is safety
and self-healing. We’ll know when things
go wrong with the observability we set up
yesterday. For issues that can be addressed
through automation, we can self-heal!
So instead of simply tracking Yes! If Chef spends less time with

inventory—we automatically order ordering forms, that’s more time to spend
more when low? on customer experience. You can
self-heal inventory issues.
12
What about when it’s less clear there’s
a problem? I had to bring 15 orders back
last week because the customers disagreed
about the steak doneness.
We can’t be perfect on subjective things.

Let’s create an SLO—1/100 customers will
send their steaks back. We can set an error
budget for doneness. For a medium steak
what would your range be?
Medium rare to medium-well done.

Not well done or rare.
That means that of 100 steaks ordered,

only one will be out of range. If we get
more than one out of range, it is time to
do something about it ASAP. Let’s call
that an ‘incident’.
13
So measure it! Measure before And if you hit two steak returns in
you send them out. Time them and one day, you can self-heal by using
use a thermometer—don’t just go by thermometers on every steak afterwards
your gut—until you meet the SLO. to make sure no more get returned and
so you can recalibrate yourself.
Hmm...maybe something
is wrong with the stove?
#2. Safety and self-
So let’s list out all of these

kinds of potential issues and set error
budgets for them. Then we can track them
each day on our dashboard!
14
#3. Scalability
What’s our Today we’re looking at scalability.

challenge today? Both the observability and self-healing we
set up over the past two days of challenges
will enable today’s challenge.
Will scalability help us handle the

dining room better when we have a
huge batch of take-out orders? The
guests get so mad at me—sometimes
they wait two hours for food!
Takeout pick-up
Yes, it can help! Scalability

refers to Gino’s ability to scale up
for the dinner rush, and scale down
as demand comes back down.
15
We had an online coupon for 15% off that
went viral last week. We got flooded with takeout
orders—600 orders between 5–7 PM. We couldn’t
figure out how to shut off online orders...
So we set up an alert when

Let’s address this. First, let’s add we have 10+ takeout orders.
the number of active online takeout Then what?
orders to the dashboard. How many
can you handle per hour? load shedding
+
auto-scaling
Dashboard
Active takeouts
12
Maybe 10–12? Then, you should start

load shedding and autoscaling.
16
Load shedding is when you drop some tasks
purposely to work through your queue. To load shed
here, you could add a 20 minute wait to new dine-in
guests. You can also pad out time estimates online
to buy time to work thru the queue.
Autoscaling is when you horizontally scale out

your capacity on demand. Say you have a large party
reservation coming when takeout orders are high—
you could call in extra staff to meet the demand if
load shedding can’t free up enough resources.
17
Dashboard
Large Active takeouts

Reservations
12
Let’s make sure that the upcoming daily

reservations—especially for large parties—
are on the dashboard as well!
#4. Shifting left

Ready for today’s challenge!
What is it today?
We’re going to shift left today.
What does that mean?
18
Shifting left is moving towards
preventing problems as far in advance
(leftward) as possible. The further in advance
you see and address issues, the better
experience your customers will have!
You could regularly dine here as a

customer yourself and experience it first-
hand, note any improvements to be made,
and plan out SLOs to meet them.
Exactly.
Dining here like how you did?
For seasonal menu testing, you could

hold a ‘test kitchen’ event and have diners
come in to test out your new menu ahead of time.
That gives you time to address any issues and
improve the menu before launch based on their
feedback. You’ll shift your menu left!
19
That’s a great idea.
I didn’t think of that!
So let’s prep a service for Leo

and Chef—I’ll be the stand-in chef,
and I’ll work off your menu.
Let’s see how it goes!
#5. Zero downtime

Ready for today’s challenge?
It’s...zero downtime upgrades!
Zero downtime? We need

our 15 minute breaks!
20
Of course, of course! You need
breaks to do your best work—
we want to prevent burnout and
help you do your best work!
Zero downtime upgrades are more about

seamless transitions during needed changes.
For example, waitstaff shift changes—
instead of scheduling Annie to leave at
the beginning of Thomas’s shift...
...schedule their shifts to

overlap, and slowly transfer guests
from the departing waitstaff
to the incoming waitstaff.
Once all guests are

transferred, Annie can
go home, confident she
won’t be called back.
21
That would be great! We’re always
so rushed between shifts, sometimes we
neglect a customer waiting on a drink they
ordered before our shift, for example!
Now say your chef gives you notice to

open her own restaurant. A handover
between Chefs will ensure a seamless
experience for customers.
22
Great ideas—except for
the thought of Chef leaving us.
Let’s work on adjusting our
staffing schedules accordingly!
Chef Cookie Cache left Gino’s

at the end of day five.
In the months leading up to his

sister’s wedding reception, Leo put ...and the wedding reception at Gino’s
into practice the lessons from the went off without a hitch. Each lesson
five managed service challenges... helped avert issues.
23
#1. Observability
The dashboard pointed out when three reception tables were in danger of
not meeting a 20 min food wait SLO—Leo was able to step in and help
the waitstaff get the food to the guests on time.
#2. Safety and self-healing
Out of 200 guests, four sent their steaks back—and only one was outside
of the error budget for doneness. Chef’s steak SLO was met for the event!
24
#3. Scalability
When the dance floor opened up, the bar was overwhelmed with drink orders.
Leo took waitstaff off of dessert service (everyone was dancing anyway) to help
auto-scale the bar. Once the bar quieted down, those waitstaff continued serving dessert.
#4. Shifting left
Leo and Chef did a dry-run of the reception menu and improved the wine pairing
and dessert options based on the experience—both were a big hit at the event!
25
#5. Zero downtime
The reception ran six hours—well past midnight. From monthly staff meetings
and regular communication, Leo knew Annie had to leave at 11. Leo started taking
over her guests and by 11 she was all set to go home.
Mia, Leo’s sister, was thrilled, and the entire family had a blast!
Gino’s, the family restaurant, was saved, and became a much greater success.
Leo’s dad would have been proud!
26
Observability
1. . . . . . . . . . . . . . . . . . . . . . . 2. . . . . . . . . . . . . . . . . . . . . . 3.. . . . . . . . . . . . . . . . . . . . . . 4.. . . . . . . . . . . . . . . . . . . . . . 5
Never tried to get Regularly enable new metrics for SRE team working closely
metrics out of it with service development and product teams
Safety and self-healing

1. . . . . . . . . . . . . . . . . . . . . . . 2. . . . . . . . . . . . . . . . . . . . . . 3.. . . . . . . . . . . . . . . . . . . . . . 4.. . . . . . . . . . . . . . . . . . . . . . 5
My codebase OOM kills My codebase checks for common
routinely, oops. preventable failures and avoids them.
Scalability
1. . . . . . . . . . . . . . . . . . . . . . . 2. . . . . . . . . . . . . . . . . . . . . . 3.. . . . . . . . . . . . . . . . . . . . . . 4.. . . . . . . . . . . . . . . . . . . . . . 5
I have no idea of the capacity limits of My codebase’s default deployment includes capacity-
my codebase in production. limit based auto-scaling and load shedding.
Shifting left
1. . . . . . . . . . . . . . . . . . . . . . . 2. . . . . . . . . . . . . . . . . . . . . . 3.. . . . . . . . . . . . . . . . . . . . . . 4.. . . . . . . . . . . . . . . . . . . . . . 5
I’ve never run my codebase I work in a DevOps team that is regularly involved in production
in Production before. deployment and operations, and am directly responsible for its reliability.
Zero downtime upgrades

1. . . . . . . . . . . . . . . . . . . . . . . 2. . . . . . . . . . . . . . . . . . . . . . 3.. . . . . . . . . . . . . . . . . . . . . . 4.. . . . . . . . . . . . . . . . . . . . . . 5
Upgrading my codebase requires a My codebase can gracefully handoff
weeks-long consulting engagement. connections to a new upgraded version
of my application and take those
clients back in case of failure.
Check your score

on the next page!
So what’s the operability

of your codebase? For each
of the five challenges...
How many points did you get?
Add them up, let’s see
where you are!
0-10................. beginner
11-20................. advanced
21+................. expert
Stable, Secure,
Performant, and Boring
red.ht/sre-coloring-book
Sponsored by Red Hat
30

Reliability Nightmares Coloring e Book Red Hat Developer

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Reliability Nightmares Coloring e Book Red Hat Developer

Uploaded by

Copyright:

Available Formats

Reliability

Script and storyboard by Máirín Duffy

The waitstaff are either

Loyal customers were

Who’s that at the door?

Hi! I’m celebrity chef Cookie Cache!

I’ll try the LAMP stack, please!

[90 minutes pass]

Apologies for the

Go home, get a good

There is a better way. I will teach you

The first challenge is to set up

1. We need to break down the silos between each

You need to know your equipment

We’ve had a lot of dishes sent

Brilliant! Our SLO can be no more

Great, let’s set the scanners up to track

Great work on instrumenting for

So instead of simply tracking Yes! If Chef spends less time with

We can’t be perfect on subjective things.

Medium rare to medium-well done.

That means that of 100 steaks ordered,

#2. Safety and self-

So let’s list out all of these

What’s our Today we’re looking at scalability.

Will scalability help us handle the

Yes, it can help! Scalability

So we set up an alert when

Maybe 10–12? Then, you should start

Autoscaling is when you horizontally scale out

Large Active takeouts

Let’s make sure that the upcoming daily

#4. Shifting left

We’re going to shift left today.

What does that mean?

You could regularly dine here as a

For seasonal menu testing, you could

So let’s prep a service for Leo

#5. Zero downtime

Zero downtime? We need

Zero downtime upgrades are more about

...schedule their shifts to

Once all guests are

Now say your chef gives you notice to

Chef Cookie Cache left Gino’s

In the months leading up to his

#2. Safety and self-healing

#4. Shifting left

Safety and self-healing

Zero downtime upgrades

Check your score

So what’s the operability

You might also like