You are on page 1of 26

Incident Analysis:

How Learning Is Different Than Fixing

John Allspaw
Adaptive Capacity Labs
The Main Gist
• Current and typical approaches to “learning from incidents” have very little
to do with actual learning.

• Learning is not the same as fixing

• Most post-incident review documents are written to be filed, not written


to be read.

• Changing the primary focus from fixing to learning will result in a


significant competitive advantage
When you focus on fixing things,
they tend to get fixed quickly, with
the materials and methods at hand.

The quality and effectiveness of the fixing


— and preventing future issues — is
proportional to how well you understand
how the thing works.
attention paid to learning will
always yield higher-quality fixing

focusing exclusively on fixing is a


barrier for learning
Conventional View on Where Post-Incident Analysis Has Value
Tendency is to think the only/greatest
value is here

💥
ACTION
ITEMS

a postmortem (maybe)
incident (maybe)
meeting, where (maybe) someone FINALLY!
happens… someone
you fill out some template… compiles a report…
preps a timeline…

very expensive meeting


On “learning”
• Learning is happening all of the time, it’s a core part of being human.

• What is learned, who learns, when they learn, and how they learn depends
on how well practices are set up to support it.

• No one can understand everything about everything. We should be


surprised that things work as well as they do, given this!

• Frequency of incidents has nothing to do with how well an organization


learns! (see first bullet) It may be a signal about what they’re learning.
Conventional Myth

A canonical set of “lessons” can be extracted from an


incident, which is then “shared” to a group.

The perceived problem to be solved, then, is to


somehow “share” better.

Reality
Different people will have varying understandings
before and after an incident, and what mysteries
remain for them cannot be captured or addressed in
a “one size fits all” package.

What is important/notable/interesting will differ from


person to person.

* we tend to say “share” when typically we mean “make available to others” what they actually remember!
if you can’t remember something,
you can’t say you’ve learned it
when we ask people about incidents

• they become animated when they tell the story

• they include elements of suspense in the structure of how they tell it

• they include elements of surprise (“what we didn’t know at the time was…”)

• they set some context (“now remember, this is the day we did our IPO…”)

• they recall it in detail even if it’s many years since


stories that you remember have
elements of challenge, struggle, and
difficulty.
Interesting incident analysis documents get read.

Compelling incident analysis documents get read and shared with others.

Fascinating documents get read, shared with others,


commented on, asked about, referenced in code comments,
…in pull requests,

…in architecture diagrams,

…in other incident writeups,

…in newhire onboarding,

Uninteresting documents…don’t.

This film was made available in


thousands of theaters around the
world.
Make Effort to Highlight The Messy Details

• What was difficult for people to understand during the incident?

• What was surprising for people about the incident?

• How do people understand the origins of the incident?

• What mysteries still remain for people?


I want to know what was difficult
about this, and I want to be able to
ask questions about that.
Flip from “severity” to difficulty

• “Customer impact” is not equivalent to the difficulty of solving the issue.

• Multiple difficulties can exist in the same incident.

• Fielding questions about what was — or still is — difficult is how critical


understandings are spread and how lasting memories are formed.
“this is how it
all works”

“this is how it
all works”

“this is how it
all works”

“this is how it
all works”
“this is how it
all works”
“this is how it
all works”

“this is how it
all works”

“this is how it
all works”

“this is how it
all works”
“this is how it
all works”
“this is how it
all works”

“this is how it
all works”

“this is how it
all works”

“this is how it
all works”
“this is how it
all works”
“oh…I thought it
did X nightly, “wait - I thought everyone
not weekly…” knew that Y was an issue…
only I knew that?”

“I knew about N but


didn’t know how it got
to be that way…”

“I didn’t know M
could break silently “ok, got it - A feeds B,
like that…” but C also feeds B…”
Fine Goals For A Post-Incident Meeting

…when participants in a post-incident group meeting leave the meeting


knowing:
a. new things they didn’t know when they entered the meeting
b. new things they didn’t know about what their colleagues know
c. how to continue discussions and where to capture it
Fine Goals For A Post-Incident Review Writeup

…when readers of a post-incident review writeup are finished reading, they


know:
a. new things they didn’t know when they started reading
b. new things they didn’t know about what their colleagues know
c. how to continue discussions and where to capture it
Some things to
experiment with
• Separate the generation of “follow-up” items from a group incident review meeting
• Record in the document who responded to the incident, and who attended the
group meeting

• Capture things that were done after the incident but before the group meeting in the
document

• Give write-ups to brand new engineers and ask them to record any and all
questions they have after reading it

• Link company-specific jargon/terms to documents that describe them


• Ask more people to draw diagrams in debriefings and include them in the writeup
Have someone who was
not involved in the event
lead the analysis.
Ryan Kitchens

“If the incident analyst


participated in the incident, they
will inevitably have a deeper
https://www.learningfromincidents.io/blog/
understanding and bias towards
the incident that will be
impossible to remove in the
process of analysis.”
Resist focusing on
reducing the number of
incidents.
Ryan Kitchens

“The cliche idea that we would do this work to


reduce the number of incidents or to lessen
the time to remediate is too simplistic.
https://www.learningfromincidents.io/blog/
Of course organizations want to have fewer
Focus instead on incidents, however stating this as an end goal
actually hurts our organizations. Indeed, it will
increasing the number lead to a reduction in incident count–not from
of people who want to actually reducing the number of incidents, but
read reports and attend rather lessening how and how often they are
reported.”
the PIR meetings.
Thank You!

You might also like