Professional Documents
Culture Documents
Lost revenue per Lost staff time spent Staff dissatisfaction Loss of focus by
minute of downtime resolving incident from toil developer team
Manual: In these organizations, automation can only be run by experts. Incident resolution follows a ticketed
system and incidents are almost always escalated to a subject matter expert.
Reactive: At this stage, there is some general automation that triggers basic diagnostics in the event of an inci-
dent. But automation is still only initiated by experts and these organizations also rely on ticketed escalations.
Responsive: As maturity develops, delegation of self-service operations improves. Here, there is some kind of
Runbook Automation platform in use, and front-line responders and operators can run common diagnostics and
automation rather than having to escalate to subject matter experts every time.
Proactive: Now, organizations are really reaping the benefits of better maturity as context-specific diagnos-
tics and mitigation are automatically triggered when an incident hits. Deeper automation means front-line
responders can not only diagnose but can also resolve recurring issues without escalating to an expert.
Preventative: This final evolution is the “gold standard” of maturity. Here, automation seeks to resolve commonly
occurring incidents before a responder is notified. Escalations are only needed for complex or unusual prob-
lems, and service uptime is maximized.
Crawl, walk, run: getting started with automation
for incident response
To mature from a manual and reactive approach to a preventative one, organizations
should start with small,achievable goals. Successfully deploying automation
requires a progressive “crawl, walk, run” approach. To begin, technical teams need
to understand what automation is already in place, how complex the system in
question is, and the potential risk/revenue impact of a given task.
The key to getting started is not opting for the most sophisticated systems (those
with the most dependencies) or the actions with the highest risk of impacting
revenue. By starting small, organizations can improve the capacity for automation as
they learn and realize benefits.
Crawl: The first stage is read actions requiring little processing. These are simple, single-step actions with
no impact on service performance or availability. Read actions could include enriching an incident descrip-
tion with system information or pulling system metrics for diagnostics. Many organizations have this kind
of automation already on hand. All that needs to be done is to make it available to operations teams rather
than just subject matter experts.
Walk: Think of these as the steps your experts take to resolve many incidents, like sequences that provide
deeper diagnoses and remediate many common or even recurring problems. They just need to be auto-
mated and delegated to operators. These actions often require multiple steps and higher-access privileges,
and could potentially cause more damage when used incorrectly. Runbook automation ensures operators
can only invoke relevant operations at the right time, and allow privileged operations to be run without
having to share superuser credentials.
Run: The final stage is complex change actions. These are actions that can significantly impact perfor-
mance or service availability, and typically involve privileged access for many steps between multiple
systems. While that might sound scary, for many common types of problems, a proven workflow is appro-
priate to delegate for an operator trying to resolve a P1 incident and restore availability. Examples could be a
multi-service rolling restart, or rolling back a software deployment to the last known good version.
Using Rundeck and PagerDuty together offers organizations a powerful way to further improve MTTA and the MTTR,
protect revenue, unleash operations productivity, and reduce burnout.
Rundeck by PagerDuty
Rundeck by PagerDuty delivers the leading automated runbook platform for Enterprise IT. Automated runbooks
enable faster, more effective Ops actions and maximize existing investments in people and automation. Streamline
workflows and speed up incident response by connecting people with standard operating procedures and tools
across organizational and technology silos. Rundeck automates Ops actions so your team can focus on improving
customer experience and staying ahead of the competition.