Professional Documents
Culture Documents
Management
Handbook
©2019 Atlassian, Inc. All Rights Reserved.
Table of contents
3
ATLASSIAN INCIDENT MANAGEMENT HANDBOOK 4
Foreword
Or: How I learned to stop worrying and love on-call
In early 2014, driven by the broad industry shift to SaaS, the Atlassian
Cloud suite was evolving from a cobbled-together set of our
downloadable software into a true “cloud native” system. This was
a big change that required careful planning and execution across
most of the company. For eight months, a team had been working to
make user and group management easier for our cloud customers by
unifying them across all products. So, instead of having to manage
users and groups separately in each product, there would be only one
set per customer and one interface to manage them with.
FOREWORD 5
We immediately recognized this as a very serious problem and
jumped on it right away. Although it affected a relatively small
number of customers, given the impact we treated this problem
as our highest priority. We released a fix in short order...at least we
thought it was a fix (and I bet you can guess what happened next).
It turned out that the “fix” actually made the problem worse: We
had inadvertently broadened the impact by an order of magnitude.
Now, instead of a few users or a few customers being able to see
content they shouldn’t have had access to, we had thousands of
people across hundreds of customers in this situation. It was a very
hairy knot of access control misconfiguration, and some of our best
customers were understandably furious.
FOREWORD 7
detectors rather than stop making cars and buildings. The ability
to make more changes with the same or less risk is a competitive
advantage. We can adopt practices like canaries and feature flags
that reduce the risk per change, but some risk always remains—it’s
never completely eliminated. “Risk tolerance” is the level of risk that’s
appropriate for your organization and its activities, and it varies
depending on the nature of the business and services. The more
changes an organization can make within its “risk tolerance”, the
more competitive that organization can be. This is expressed well in
the Google SRE book’s Chapter 3, “Embracing Risk.”
In a way, yes. You need to accept that in the presence of risk, incidents
are bound to happen sooner or later, and having an incident doesn’t
mean that you have permanently failed. An incident should properly
be regarded as the beginning of a process of improvement. Your goal
should be to do that in the best possible way. Once you understand
that, you can focus on two important things that are entirely in your
control: How well your organization responds to incidents, and how
well you learn from them.
FOREWORD 9
I’ve heard incidents described as “unscheduled investments in
the reliability of your services.” I think this is accurate. Given that
incidents are going to happen whether we want them or not, your
job as the owner of an incident and postmortem process is to make
this “investment” as cheap as possible (in terms of reducing incident
impact and duration), and to extract every last drop of value from
each incident that you have (in the form of learnings and mitigations).
Like so much else in the world it boils down to a matter of minimizing
the cost and maximizing the benefit.
Incident overview
Overview
Commit to delivering
specific changes by
specific dates.
Recommended solution:
Many of our customers use Opsgenie at the center of their incident response
process. It helps them rapidly respond to, resolve, and learn from incidents.
Opsgenie streamlines collaboration by automatically posting information to
chat tools like Slack and Microsoft Teams, and creating a virtual war room
complete with a native video conference bridge. The deep integrations with
Jira Software and Jira Service Desk provide visibility into all post-incident tasks.
Each incident is driven by the incident manager (IM), who has overall
responsibility and authority for the incident. This person is indicated
by the assignee of the incident issue. During a major incident, the
incident manager is empowered to take any action necessary to
resolve the incident, which includes paging additional responders
in the organization and keeping those involved in an incident focused
on restoring service as quickly as possible.
Incident response
The following sections describe Atlassian’s process for
responding to incidents. The incident manager (IM) takes
the team through this series of steps to drive the incident
from detection to resolution.
Incident response
Incident status
IM activity
Detect
Faulty service Single- The service that has the fault that’s
select causing the incident. Take your best
guess if unsure. Select “Unknown” if
you have no idea.
Once the incident is created, its issue key (eg “HOT-12345”) is used in
all internal communications about the incident.
Customers will often open support cases about an incident that affects
them. Once our customer support teams determine that these cases
all relate to an incident, they label those cases with the incident’s
issue key in order to track the customer impact and to more easily
follow up with affected customers when the incident is resolved.
Another option:
In Opsgenie, you can use Incident Templates for specific teams or services
to set up the communication flow to incident commanders, responders,
and stakeholders.
The page alert includes the incident’s issue key, severity, and
summary, which tells the incident manager where to go to start
managing the incident (a link to the Jira issue), what’s generally
wrong, and how severe it is.
Open communications
The first thing the IM does when they come online is assign the
incident issue to themselves and progress the issue to the “Fixing”
state to indicate that the incident is being actively managed. The Jira
issue’s assignee field always indicates who the current IM is. In an
emergency response, it’s very important to be clear who’s in charge,
so we’re pretty strict about making sure this field is accurate.
We’ve found that using both video chat and chat room synchronously
works best during most incidents, because they are optimized for
different things. Video chat excels at creating a shared mental picture
Assess
After the incident team has their communication channels set up,
the next step is to assess the incident’s severity so the team can
decide what level of response is appropriate.
3. A minor incident 1. A
minor inconvenience to customers,
with low impact workaround available.
2. Usable performance degradation.
Note that we always include the incident’s Jira issue key on all
internal communications about the incident, so everyone knows how
to find out more or join the incident (remember, the chat room is
automatically named for the Jira issue key).
The team focused on the entire incident from start to finish: from the
point that engineering received the first monitoring alert to our last
customer update.
Once the team assessed the incident, we put an action plan in place
which included people, process and technology actions. We assigned
each action to an “owner” and a recommendation on the next steps. A
few weeks later we re-grouped and the owners provided an update on
the actions which has now improved our future communications.
But this isn’t a one-off exercise. It’s something that we will continue
to repeat every few months to help refine and improve our incident
communication process. A good incident communications plan is a
journey, not a destination. It’s something you can constantly improve
and iterate on. If you’re ready to kick off that journey, this exercise is a
great start.
Nick Coates is a Product Owner at Symantec. To find this and other Team
Playbook plays, visit Atlassian.com/team-playbook.
Your first responders might be all the people you need in order to
resolve the incident, but more often than not, you need to bring other
teams into the incident by paging them. We call this escalation.
The key system in this step is a page rostering and alerting tool
like Opsgenie. Opsgenie and similar products allow you to define
on-call rosters so that any given team has a rotation of staff who
are expected to be contactable to respond in an emergency. This is
superior to needing a specific individual all the time (“get Bob again”)
because individuals won’t always be available (they tend to go on
vacation from time to time, change jobs, or burn out when you call
them too much). It is also superior to “best efforts” on-call because
it’s clear which individuals are responsible for responding.
We always include the incident’s Jira issue key on a page alert about
the incident. This is the key that the person receiving the alert uses to
join the incident’s chat room.
Delegate
The IM can also devise and delegate roles as required by the incident,
for example, multiple tech leads if more than one stream of work
is underway, or separate internal and external communications
managers.
We use the chat room’s topic to show who is currently in which role,
and this is kept up-to-date if roles change during an incident.
You already sent out initial communications. Once the incident team
is rolling you have to update staff and customers on the incident, and
as the incident progresses you need to keep them looped in.
Iterate
The biggest challenges for the IM at this point are around maintaining
the team’s discipline:
In any case, don’t panic—it doesn’t help. Stay calm and the rest of the
team will take that cue.
The IM has to keep an eye on team fatigue and plan team handovers.
A dedicated team can risk burning themselves out when resolving
complex incidents—IMs should look out for how long members have
been awake for and how long they’ve been working on the incident
for, and decide who’s going to fill their roles next.
Resolve
QUESTION:
What tips would you suggest for writing a good postmortem report?
Structure of postmortems is something many companies seem
to struggle with e.g., what to include, what really would make a
difference?
JOHN ALLSPAW:
What will the reader want to know about the context of the incident?
We could imagine that indicating when an outage happened (Cyber
Monday? Same day as your IPO? When your CEO is on stage giving
an interview?) might be important in some cases and not really
important in others, for example.
Incident postmortems
The blameless postmortem is the alchemy that turns
bitter incident tears into sweet resilience sauce. We use
it to capture learnings and decide on mitigations for the
incident. Here’s a summarized version of our internal
documentation describing how we run postmortems
at Atlassian.
What is a postmortem?
Why do we do postmortems?
Usually the team that delivers the service that caused the incident
is responsible for completing the associated postmortem. They
nominate one person to be accountable for completing the
postmortem, and the issue is assigned to them. They are the
“postmortem owner” and they drive the postmortem through drafting
and approval, all the way until it’s published.
·· When people feel the risk to their standing in the eyes of their
peers or to their career prospects, it usually outranks “my
employer’s corporate best interests” in their personal hierarchy,
so they will naturally dissemble or hide the truth in order to
protect their basic needs;
·· Blaming individuals is unkind and, if repeated often enough,
will create a culture of fear and distrust; and
·· Even if a person took an action that directly led to an incident,
what we should ask is not “why did individual X do this”, but
“why did the system allow them to do this, or lead them to
believe this was the right thing to do.”
3. Use the “Five Whys” technique to traverse the causal chain and
discover the underlying causes of the incident. Prepare a theory
of what actually happened vs. a more ideal sequence of events,
and list proposed mitigations for the causes you identify.
5. Meet with the team and run through the agenda (see
“Postmortem meetings” below).
Another option:
If you’re tracking and collaborating on incidents within Opsgenie, you can
automatically generate a postmortem report using a predefined template.
All critical actions that occurred during the incident are documented in a
timeline and related Jira tickets are included as reference and remediation.
Detection How and when did The incident was detected when
Atlassian detect the the <type of alert> was triggered
incident? How could and <team or person paged> were
time to detection paged. They then had to page
be improved? As a <secondary response person or
thought exercise, team> because they didn’t own
how would you have the service writing to the disk,
cut the time in half? delaying the response by <length
of time>. <Description of the
improvement> will be set up by
<team owning the
improvement> so that <impact
of improvement>.
Five whys Use the “Five 1. The service went down because
Whys” root cause the database was locked
identification 2. Because there were too many
technique. Start with database writes
the impact and ask 3. Because a change was made
why it happened to the service and the increase
and why it had was not expected
the impact it did. 4. Because we don’t have a
Continue asking why development process set up
until you arrive at the for when we should load test
root cause. Document changes
your “whys” as a list 5. We’ve never done load testing
here or in a diagram and are hitting new levels of
attached to the issue. scale
Root cause What was the root A bug in <cause of bug or service
cause? This is the where it occurred> connection pool
thing that needs handling led to leaked connections
to change in order under failure conditions, combined
to stop this class with lack of visibility into
of incident from connection state.
recurring.
With the advent of SaaS and IaaS, companies are more dependent
than ever before on their suppliers. Companies like AWS, Google,
Microsoft, and Salesforce provide much of the infrastructure that
powers modern services. When they fail and you are impacted,
how can you get value from the postmortem?
None: We don’t really have an The person responsible for the use of the
expectation! 3rd party dependency needs to establish
an SLE and share it with teams so they
know what level of resilience they need
to build into their dependent services.
The right wording for a postmortem action can make the difference
between an easy completion and indefinite delay due to infeasibility
or procrastination. A well-crafted postmortem action should have
these properties:
From... To...
Investigate monitoring for this (Actionable) Add alerting for all cases
scenario. where this service returns >1% errors.
Fix the issue that caused the (Specific) Handle invalid postal code in
outage. user address form input safely.
Postmortem meetings
3. Present your theory of the incident’s causes, and where you think
mitigations could best be applied to prevent this class of incident
in the future. Discuss with the room and try to build a shared
understanding of what happened, why, and what we generally
need to do about it.
5. Ask the team “What went well / What could have gone better
/ Where did we get lucky?” These questions help catch near-
misses, learnings, mitigations and actions that are outside
the direct blast radius of the incident. Remember, our goal is to
extract as much value as possible given that we’ve already paid
the price of having an incident.
In this meeting we’ll review the incident, determine its significant causes and
decide on actions to mitigate them.
For this reason, it’s usually more effective in practice to follow up with
the responsible managers after the meeting to get firmer, time-bound
commitments to the actions that you identified in the meeting, and
then note that on the postmortem issue.
We use Jira issue links to track actions because our teams use many
different instances of Jira Cloud and Server, but we still have to track
and report on this data. To do this we built some custom tooling that
spiders remote issue links and populates a database with the results.
This allows us to track state over time and create detailed reports.
Assess Delegate
·· Assess severity go/severity ·· Explicitly delegate these roles:
–– What is the impact to customers? –– Tech Lead(s)
–– What are they seeing? –– Internal Communications
–– How many customers are affected? –– External Communications
–– When did it start?
–– How many support tickets are Periodic Tasks
there?
–– Any other factors eg Twitter, data ·· Update communications GOTO
Communicate
loss, security?
·· Confirm HOT “Affected Products” ·· Has impact changed?
·· Confirm HOT “Severity” –– What was observed? Record
with timestamps
·· Post Initial Communications quickly –– Has severity changed?
(<=30m)
–– Internal: go/internalcomms ·· Incident duration: who needs to be
relieved?
–– External: go/externalcomms –– Who will hand over which roles
·· Security Incident? Page SecInt to who, when?
go/pagesecint
–– GOTO Handover
Communicate Resolve
·· Post update communications regularly ·· Confirm with CSS that service is
–– Max 180m interval restored
·· Follow this five-point checklist for –– Customer feedback
communications: –– Firsthand observations
–– What is the actual impact to –– External vendors
customers? ·· Final update internal and external
–– How many int and ext customers communications
are affected? –– Resolve Statuspage incidents
–– If a root cause is known, what is it? –– Close incident in go/
–– If there is an ETA for restoration, multistatuspageupdater
what is it? –– Resolve the HOT; set impact and
–– When and where will the next detection times
update be? –– Create PIR and get accountability
for completion
Fix
·· What is the observed vs expected Handover
behavior? ·· Outgoing MIM: 1-1 call with incoming
·· Any changes (eg deploys, feature MIM
flags) go/changedashboard –– Who is out and in of which roles
·· Share observations: graphs, logs, –– What are the current theories and
GSAC tickets etc. streams of work
·· Develop theories of what is happening –– When next handover is expected
·· Work to prove or disprove theories and to whom
·· What are the streams of work? ·· Incoming MIM: Come up to speed first
·· Who is the Tech Lead for each stream? –– Ask lots of questions!
·· What are current steps and next steps
for each? Backup Systems
·· How do those steps move us towards ·· Communications: Zoom IM; Google
resolution? Hangouts for video
·· Track changes, decisions and ·· Docs: “MIM Offline Material” in
observations Google Drive
·· ISD: Google Docs
·· HOT ticket: Hello/J IMFB project
Have questions?
Contact us at incident-handbook@atlassian.com