You are on page 1of 9

Supercharge your

incident response with


runbook automation
Achieving shorter incidents and fewer escalations with
Rundeck and PagerDuty
Contents
Changing the status quo and embracing
automation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

What is a modern incident response


approach?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Speed up incident resolution with runbook


automation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Crawl, walk, run: Getting started with


automation for incident response. . . . . . . . . 7

Deploying automated runbooks with


Rundeck and PagerDuty. . . . . . . . . . . . . . . . . 8
Changing the status quo and
embracing automation
The shift to digital accelerated faster than high availability is paramount. Both
we all planned. A study from McKinsey online revenue and competition have
found that the pandemic sped up digital never been greater. Businesses’ digital
transformation by at least seven years. services therefore have to be available
Digitalization closed the gaps that around the clock. Meanwhile, ITOps and
sheltering in place opened, and now DevOps teams also have to deliver more
much of the world is continuing to live, capacity and release new services and
work, and learn online. Companies with applications faster than ever.
strong digital offerings have thrived, while
Incidents are inevitable. Fixing them
those without digital investments have
quickly when they occur is critical to a
seen just how critical exceptional digital
company’s bottom line and customer
experiences have become.
experience. However, the challenge for
As the world settles into a new normal, operations teams is that IT environments
companies are investing even more are continuing to become more complex
resources in software-driven experiences. and interdependent, making it harder
A recent forecast from Gartner finds to troubleshoot and resolve incidents.
worldwide IT spending will total $4.1 The sheer volume of data that has to
trillion in 2021—an increase of 8.4% from be collected and made sense of, when
2020. Much of this spending will come responding to an incident, is growing
from outside of IT as digital technologies every year. Dealing with these challenges
extend beyond traditional IT use cases and the high-pressure environment that
and into value-adding digital services. comes with incidents can leave teams
feeling frustrated and burned out.
This continued shift of revenue
towards digital services means that

Manual and reactive incident response increases mean time to


resolve (MTTR) and wastes precious time
incident response that teams just don’t have.
processes waste Historically, the answer has been
precious time to swarm the problem with more
responders. This can result in dozens, if
The growth in the complexity of the not hundreds of team members being
infrastructure and quantity of applications called into the fray following a customer-
that teams support means that, in many impacting disruption. Zoom calls or
organizations, incident resolution simply bridges like the one in the image below
takes too long. Manual and reactive attempt to answer these questions in a
This situation has become the norm for many When it comes to incident response,
operations teams because responders are responders may not know how to use
not armed with the information and insights these tools, scripts, and commands to
they need to take action when an incident diagnose and resolve an issue, so they
initially occurs. Teams need to be able to escalate the problem.
answer a series of questions quickly to get
to the bottom of an incident and apply a fix. When teams are scrambling in a hectic
These questions include: all-hands-on-deck call and incidents
are inevitably escalated, individuals are
pulled away from their day job or called
• What changed in the environment?
in unnecessarily during their time off.
In a survey conducted by PagerDuty
• What services are impacted?
of 700 developers and IT operations
professionals, more than half (55%)
• Who owns those services?
said they are having to fix incidents
during their time off at least five times a
• What are their dependencies?
week, and 62% are working more than
10 additional hours a week.
• What signals hold the clues?
This has several negative outcomes.
• Have we solved this in the past? It not only prevents operations
professionals from spending time
innovating, it also causes frustration
and runs the risk of burning teams out.
The short- and long-term costs of an
incident add up quickly—both in terms
of dollars and the impact on people.
In the short term: In the long term:

Lost revenue per Lost staff time spent Staff dissatisfaction Loss of focus by
minute of downtime resolving incident from toil developer team

This current approach epitomizes inefficiency and is representative of organizations where


operations processes haven’t evolved to meet the current state. Managing incidents “on the fly”
in this way—without defined and quickly executable processes that can be run by people other
than subject matter experts—is not sustainable. IT and operations leaders need to adopt a modern,
automated approach to incident management.

What is a modern PagerDuty Incident Response brings


modern incident best practices to an
incident response organization with end-to-end response
approach? automation, seamless incident response
integration with ITSM toolchains, and
Traditionally, IT Service Management friction-free postmortems.
(ITSM) solutions have played a role in
This speeds up incident detection and
managing the queued work in an orga-
triage and mobilizes a fast response from
nization and responding to operations
the right people before customers and
issues that are not urgent. But when it
users are impacted. Applying real-time
comes to urgent, mission-critical issues
response capabilities improves key
that need to be fixed right away, a ticketing
metrics such as mean time to acknowl-
system can slow things down. In an
edge (MTTA) and MTTR.
always-on world, operations teams must
work in real time to meet application and However, you can do more to further
performance demands. reduce MTTA/MTTR, including automating
incident resolution.
Speed up incident incidents such as restart servers, copy
artifacts, manipulate files, etc. Runbook
resolution with runbook automation standardizes incident
automation response by capturing and automating
these methods and delegating them to
Technical teams feel the pain of an inef- the right people within the organization.
ficient incident response first hand. This With runbook automation in place,
pressure is compounded by business responders are empowered to run
outcomes for technical leaders; the loss automated workflows for diagnostic and
of revenue and customer impact due remediation activities. They can directly
to digital disruption ultimately falls upon resolve known issues, thereby reducing
the CIO and CTO. It is increasingly their the volume of incidents that get escalated
responsibility to ensure the best customer while significantly speeding up resolution.
experience and continuity of revenue
Automating incident response requires
while still advancing innovation.
cultural and platform changes for an
The good news is that there is a way to organization. It can be helpful to think
change the narrative and accelerate oper- of these changes as an evolution. The
ational maturity. By embracing runbook image below outlines the journey through
automation, organizations can leave the a digital operations maturity model for IT
toil and inefficiencies of traditional incident automation developed by PagerDuty. The
response behind and free up ITOps and model describes characteristics of each
DevOps teams to create, innovate, and maturity stage so businesses can assess
deliver high-quality digital experiences. where they are today and understand what
the next phase looks like. The five phases
All teams have a variety of methods to include:
complete repetitive tasks and resolve

Manual: In these organizations, automation can only be run by experts. Incident resolution follows a ticketed
system and incidents are almost always escalated to a subject matter expert.

Reactive: At this stage, there is some general automation that triggers basic diagnostics in the event of an inci-
dent. But automation is still only initiated by experts and these organizations also rely on ticketed escalations.

Responsive: As maturity develops, delegation of self-service operations improves. Here, there is some kind of
Runbook Automation platform in use, and front-line responders and operators can run common diagnostics and
automation rather than having to escalate to subject matter experts every time.

Proactive: Now, organizations are really reaping the benefits of better maturity as context-specific diagnos-
tics and mitigation are automatically triggered when an incident hits. Deeper automation means front-line
responders can not only diagnose but can also resolve recurring issues without escalating to an expert.

Preventative: This final evolution is the “gold standard” of maturity. Here, automation seeks to resolve commonly
occurring incidents before a responder is notified. Escalations are only needed for complex or unusual prob-
lems, and service uptime is maximized.
Crawl, walk, run: getting started with automation
for incident response
To mature from a manual and reactive approach to a preventative one, organizations
should start with small,achievable goals. Successfully deploying automation
requires a progressive “crawl, walk, run” approach. To begin, technical teams need
to understand what automation is already in place, how complex the system in
question is, and the potential risk/revenue impact of a given task.

The key to getting started is not opting for the most sophisticated systems (those
with the most dependencies) or the actions with the highest risk of impacting
revenue. By starting small, organizations can improve the capacity for automation as
they learn and realize benefits.
Crawl: The first stage is read actions requiring little processing. These are simple, single-step actions with
no impact on service performance or availability. Read actions could include enriching an incident descrip-
tion with system information or pulling system metrics for diagnostics. Many organizations have this kind
of automation already on hand. All that needs to be done is to make it available to operations teams rather
than just subject matter experts.

Walk: Think of these as the steps your experts take to resolve many incidents, like sequences that provide
deeper diagnoses and remediate many common or even recurring problems. They just need to be auto-
mated and delegated to operators. These actions often require multiple steps and higher-access privileges,
and could potentially cause more damage when used incorrectly. Runbook automation ensures operators
can only invoke relevant operations at the right time, and allow privileged operations to be run without
having to share superuser credentials.

Run: The final stage is complex change actions. These are actions that can significantly impact perfor-
mance or service availability, and typically involve privileged access for many steps between multiple
systems. While that might sound scary, for many common types of problems, a proven workflow is appro-
priate to delegate for an operator trying to resolve a P1 incident and restore availability. Examples could be a
multi-service rolling restart, or rolling back a software deployment to the last known good version.

Deploying automated scripts, APIs, and manual commands)


and delegate to others to execute. Key
runbooks with Rundeck tools and infrastructure in the operations
and PagerDuty workflow can be connected with Rundeck
as a central hub, and executed through
With digital experience expectations on PagerDuty (or the Rundeck GUI for non-in-
the rise, organizations need to maximize cident-response use cases).
their investment in people and process.
Rundeck also provides role-based access
With Rundeck by PagerDuty organizations
control, so organizations can define
can modernize incident response with
who has privileges to invoke or publish
runbook automation. Rundeck doesn’t
workflows. It logs actions taken to satisfy
replace existing automation. Instead, it
compliance requirements, and appro-
makes existing automation, scripts, and
priately handles secrets to your systems
commands more secure, auditable,
so there is no need to provide root level
and easier to run. Rundeck enables the
passwords or keys to users.
safe delegation of tasks to responders
via self-service. Subject matter experts
define automated workflows (across tools,
Customers see a reduction in MTTA / MTTR by using Rundeck and PagerDuty

Using Rundeck and PagerDuty together offers organizations a powerful way to further improve MTTA and the MTTR,
protect revenue, unleash operations productivity, and reduce burnout.

To learn more about PagerDuty Runbook Automation, visit https://www.pagerduty.com/use-cases/automation/. Or,


you can get started with Rundeck today and enable anyone in your organization to have safe, self-service access to IT
operations tasks. Visit https://www.rundeck.com/see-demo to schedule a demo.

Rundeck by PagerDuty
Rundeck by PagerDuty delivers the leading automated runbook platform for Enterprise IT. Automated runbooks
enable faster, more effective Ops actions and maximize existing investments in people and automation. Streamline
workflows and speed up incident response by connecting people with standard operating procedures and tools
across organizational and technology silos. Rundeck automates Ops actions so your team can focus on improving
customer experience and staying ahead of the competition.

You might also like