You are on page 1of 12

The cost of

downtime and
how to use visuals
to mitigate loss

v20.08.31-20.06
Introduction

I
Table of n March of 2015, Apple lost $25 In this paper, we’ll cover how to prevent

contents million in 12 hours. That same year,


trading at the New York Stock
costly downtimes by doing the following:

• Following an architecture framework


03 The true costs of downtime Exchange was suspended for four
hours, costing the Intercontinental • Visualizing your cloud implementation
04 How to prevent downtime Exchange $14 million. And more recently, or migration
in March of 2019, Facebook lost an • Maintaining visibility into your existing
07 How to resolve incidents faster
estimated $90 million in 14 hours. environments
11 Reign over your cloud with
• Keeping teams aligned with
tangible next steps The common denominator between each architecture diagrams
of these events? Network outages. One
• Setting and enforcing internal best
outage can lead to security breaches,
practices
customer dissatisfaction, and lost
revenue. As with any other crisis, the best Finally, while there are many effective
way to combat the threat of downtime tools available that are dedicated to
is to carefully prepare response plans incident management, this paper will
ahead of time. However, to make your not discuss these tools in great detail,
plan actionable and agile, you must take but instead review critical strategies for
the next step and design the necessary resolving incidents efficiently: investing
documentation to align your teams and in key pre-work; documenting your
guide your decision-making. troubleshooting processes; collaborating
on incident response in a shared visual
workspace; and holding post-mortems
after incidents to solidify learnings.

The cost of downtime and how to use visuals to mitigate loss sales@lucidchart.com 650-733-6172 2
The true costs of downtime

A
lthough Apple, Facebook, and Beyond that loss of productivity, consider how Intangible
the NYSE were large enough to much time and attention outages demand from your costs
absorb the high monetary costs, IT team. Any other projects they were working on,
The effects of unplanned downtime and outages
their outages also cost them any other tasks in the queue, have to be set aside
extend beyond monetary considerations. For
time and customer trust. And during an outage emergency, and that translates
instance, if your company needs to maintain
for smaller companies, similar damages could into backlogs of work and missed deadlines.
regulatory compliance (think GDPR for protecting
be catastrophic. Understanding these costs, and
personal user data or PCI DSS if you handle
creating a plan to prevent and resolve outages,
credit card data), your standing can be severely
is critical.
You can actually calculate this compromised by an outage if it leads to or was
caused by a data breach.
loss of productivity with this
Tangible
Any outage is an opportunity for sensitive data
costs formula from Atlassian:
loss or exposure, so you can expect compliance
So what, exactly, are the costs of network outages auditors and other stakeholders to express
and downtime? Let’s start with financial costs.
According to Gartner, network downtime costs
Lost productivity = concern when you experience an outage. This
concern can potentially lead to more audits with
an average of $5,600 per minute or $300,000 Employee salary/hr each audit eating up more of your company’s
per hour. And that might be a conservative time and resources on top of what you lost to the
estimate. In their 2016 report, “Cost of Data x utilization % downtime itself.
Center Outages,” the Ponemon Institute raised
that cost estimate to $9,000 per minute. Even
x number of employees However, it’s not just about your standing with
regulatory bodies. It’s about your standing with
if the downtime only lasts one hour, that is still a your customers. Outages are a surefire way to lose
sizable cost, and most outages last much longer. customer trust. This is the ultimate breach in the
One simple error—a failed software upgrade or Finally, you might be hit with litigation costs. Take provider client service contract and can quickly
changes in device configuration—can lead to Robinhood, for example. The online brokerage damage your customer satisfaction and retention.
millions of dollars in losses. experienced outages and trading glitches in March
of 2020. Within two weeks of this outage, they faced And with so many service provider options
And the costs don’t stop there. An outage can three separate lawsuits from angry customers. available to them, it’s easier than ever for
result in a costly lack of productivity from your customers to take their business elsewhere. This
teams. When servers are down, employees are is the case whether you’re a financial service
unable to perform their jobs. Since their salaries provider or a travel agent. A single unsatisfied
are likely a fixed cost, they will continue to get customer might share their grievances on social
paid regardless of their inability to work during media and soon that one negative experience
an outage. You’ll end up paying them for the time can snowball into a larger problem. What may
they couldn’t work and then paying them to catch have started as an intangible cost is now a very
up once you’re back online. tangible hit to your bottom line.

The cost of downtime and how to use visuals to mitigate loss sales@lucidchart.com 650-733-6172 3
How to prevent downtime

T
he immediate and residual costs Follow an architecture framework With these pillars in mind, you can take extra they code, but these frameworks should also be
of downtime pose a danger to precautions to protect any associated information referenced during internal architecture reviews.
businesses and their customers. Correctly designing and building your architecture and assets, verify that computing resources Ideally, reviews should be conducted on a
And although downtime is virtually is a crucial first step in preventing downtime. efficiently meet system requirements and present- regular basis.
impossible to eliminate entirely, Best practices for architecting aren’t always clear, day technology demands, and guarantee that
especially because the needs of each business When supplemented with a well-architected
there are preventative measures businesses can everything is running at its best, at the lowest
differ. What may be right for one organization may framework, architecture reviews provide extra
take to reduce the number of incidents and the possible price point.
not be for the next. opportunities to prevent the risks of downtime.
associated time and costs it takes to fully recover.
When you design carefully from the start, you During architecture reviews, you should inspect
Luckily, many cloud providers, such as Amazon can mitigate the risk of downtime in the future. your architecture design to ensure it is scalable
Even the latest AWS user agreement suggests Web Services (AWS), Microsoft Azure, and Google If you’re looking for a more comprehensive and reliable, while also evaluating risks to ensure
that “you are responsible for … taking appropriate Cloud Provider, offer nuanced versions of a architecture framework, there are also many proper security measures have been taken.
action to secure, protect and backup your well-architected framework, which outline best certifications, such as SOC 2 and ISO 27000
accounts.” practices for how to architect and monitor your With your architecture diagram on display, your
series, that provide subsequent guidelines and
cloud environment. teams can easily run through risk assessments
So, what can you do to prevent downtime and best practices.
and mitigation strategies together, flagging
protect and secure your accounts? These frameworks provide customers with core Not only should these frameworks be used by resources such as unencrypted databases.
strategies that are categorized into five pillars and architects as they architect, or by coders as
a friendly mnemonic: CROPS.

• Cost optimization

• Reliability

• Operational excellence

• Performance efficiency

• Security

Essentially, these five pillars help you plan,


build, and optimize your architecture with extra
attentiveness and consistency.

The cost of downtime and how to use visuals to mitigate loss sales@lucidchart.com 650-733-6172 4
Visualize your cloud implementation If designs are not implemented as planned, how Maintain visibility into your
or migration to ensure it is can you be certain you’re not at risk for downtime? existing environments
done correctly All it takes is one bad game of telephone to result
So you’ve done all of the work to design your
in a botched deployment, potentially leading to
The world’s best architecture design means cloud architecture by following a framework
multiple single points of failure, scaling limits,
nothing if what was designed isn’t actually what’s to design and review your architecture and by
incorrect communication between your inputs
implemented. verifying that design was implemented correctly.
and outputs, or any of the other common reasons
Now what?
Deploying something into the cloud often for downtime.
requires the work and input of several teams. It’s also important to understand what your
With all of these risks in mind, if you design
Unfortunately, it’s possible for your plans to be environment looks like at all times, especially as
based on your cloud provider’s version of a well-
lost in translation. For example, breakdowns you iterate and deploy in smaller batches.
architected framework, you then need to tackle
can happen as a cloud architect handoffs their
two very important tasks during implementation: Creating an architecture diagram is one easy way
design for review by security teams. Or maybe
as the architecture is passed to a cloud engineer • Verify that migration and implementation to maintain visibility into what your environment
to code. By the time your code arrives in the were done correctly. looks like. You’ll score bonus points if you can
hands of your quality assurance teams, there may visualize it in real time or by simply refreshing your
• Understand why migration and
already be multiple small (or big) breakdowns cloud provider data.
implementation were not done correctly if
that put your infrastructure at a greater risk for
you encounter issues. Visualizing your infrastructure is a powerful way to
downtime.
create a bird’s-eye view of a complicated system.
Essentially, you need to quickly cross-check plans
Doing so can help you—and your teams and
with what your engineers actually build.
stakeholders—more clearly understand exactly
Automatically generating an architecture diagram where certain elements live, why they matter, and
based on your deployed infrastructure helps how they can affect other systems and processes
you compare your “blueprint” diagram with your downstream should something go awry.
current state diagram. If the diagrams are not
With accurate and up-to-date visibility, your
identical, direct stakeholders to the exact location
employees will be more equipped to:
of misalignment and prompt conversations about
potential engineering mistakes that need to be • Conduct thorough security gap analysis.
corrected or possible technical limitations in the • Complete detailed network evaluations.
architect’s design that may require alterations to
• Provide informed recommendations.
your architecture templates for future use.
• Make smarter network decisions and
changes.

The cost of downtime and how to use visuals to mitigate loss sales@lucidchart.com 650-733-6172 5
Keep teams Set and enforce
aligned internal best practices

At the end of the day, your employees are • How ideas, processes, and initiatives should Often, downtime is the result of poorly enforced • Ensure continued access to critical data in the
responsible for preventing downtime. And they be implemented internal best practices and standards, or lack event of errors and failures.
can only be successful if they’re on the same thereof, in regards to your CROPS. Occasionally,
• What the purpose behind these decisions is • Conduct penetration testing.
page from start to end. you will discover knowledge gaps that need to
• How the cloud impacts the business as a be addressed and will require you to update your • Reduce interdependencies by loose coupling.
From the moment an employee is hired and whole best practices and standards.
throughout the onboarding process, you can use After you develop your internal best practices,
your cloud visuals to help onboard new team While it’s important that current employees Standardizing internal best practices (and how do you ensure they’re enforced and that
members quickly and effectively with accurate are aligned while they manage your cloud keeping them up to date) instills confidence teams follow through? Regularly conducted
documentation. Then, these diagrams can keep together, it’s equally as important that you don’t in your teams and ensures a cloud-enabled architecture reviews.
new and seasoned team members aligned. allow critical tribal knowledge to leave with an workforce that is properly evaluating the current
An effective architecture review should identify
employee if they leave your company. A well- state of your infrastructure and securely planning
As mentioned earlier, architecture passes through and highlight all security weaknesses or critical
maintained architecture diagram mitigates the for future states.
many hands and teams can easily be misaligned if issues in your applications.
risk of critical knowledge loss, ensuring current
you’re not careful. employees have access to the information they IT leadership will likely develop these best
Rather than dive into lines of spreadsheet data
need, regardless of employee turnover. practices, which should be documented and
Whether you’re migrating to the cloud, managing or code, present your architecture diagram and
centralized for easy access, so architects and
a hybrid cloud, or optimizing your current cloud focus on the information that matters most so
As you leverage a well-architected framework to engineers know exactly where to turn should
infrastructure, keeping architecture diagrams you can quickly arrive on a set of actions that
streamline alignment on designing and reviewing questions arise.
up to date ensures that teams can more easily should improve the performance of your cloud
existing and proposed architecture, how can you
collaborate and make decisions together about While internal best practices may vary by environment and achieve your internal standards.
be sure employees follow through on the small—
the current and future states of your cloud yet high-impact—details of designing and coding organization, they could be a combination of the
If possible, your architecture diagrams could be
environment. your infrastructure? following on this non-exhaustive list:
automatically generated based on a simple data
Use accurate and up-to-date cloud diagrams • Secure your endpoints. import from your cloud provider. This way, you’ll
to keep architects and implementation teams know you’re always looking at your current state,
• Encrypt data in storage and in transit.
informed and on the same page when it comes to: not a version from six (or 60) deployments ago.
• Use multi-factor authentication to reduce the
• How things currently work vs. how they risk of credential compromise.
should work • Always use auto-scaling.
• What limitations exist • Manage infrastructure in a source code
• What changes or updates have been made control system.

• When changes were made

• Who owns the changes or is responsible


for them

The cost of downtime and how to use visuals to mitigate loss sales@lucidchart.com 650-733-6172 6
How to resolve incidents faster

W
hile you may do everything Think ahead, Document your
in your power to mitigate make a plan troubleshooting processes
the risks of downtime, it’s
When your network goes down, there’s no time to It’s one thing to agree upon a troubleshooting
impossible to completely
lag—the cloud moves quickly. It may seem cliché, process, but you need to ensure that each
eradicate. As we’ve seen, the
but the most effective way to resolve incidents scenario you imagine during your pre-work
costs of downtime can be costly. And while you
faster is to have a plan already in place. If you phases has an associated process to follow during
may not be able to prevent downtime entirely, you don’t have a plan, your teams won’t know where an incident. Luckily, there are multiple resources
can certainly mitigate the impact. to begin or how to work together. and existing processes that you can follow, but
you should always tailor the processes to fit the
When you think about how to approach incident
You may use dedicated incident response and needs and resources of your business.
response in the cloud before an event happens,
management tools, such as Splunk, IBM QRadar, it reduces response time. An incident that could As you determine which processes you need, try
or Demisto that may effectively help monitor last days or hours could be reduced to minutes to cover the following bases.
networks, provide issue alerts, etc. But applying or even seconds, saving you hundreds (if not
thousands) of dollars.
best practices and using visuals alongside these
tools can better equip your teams to plan and Keep in mind—this should be more than a set-and-
prepare for downtime. sit kind of plan. You should review your response
plan regularly and iterate alongside changes
made to your cloud environment.

Consider making a flowchart of all the possible


downtime scenarios so you can visualize the areas
of your cloud environment that are most at risk.

The cost of downtime and how to use visuals to mitigate loss sales@lucidchart.com 650-733-6172 7
Collaborate during incident response

Say you have multiple dispersed offices, and your


network goes down at the end of the workday
Understand relative priority in the United States. Are you going to ask your
employees in the U.S. to work through the night,
Just as you take steps to determine what the
or do you have a process in place that helps
issues are, you should also document a decision
your global team coordinate handoffs during an
process that helps incident management teams
outage?
understand the priority of events or incidents and
when to escalate a problem. In this instance, you could adopt a follow-the-sun
model that can be used to coordinate in real time
Because your systems and data flows are in
Get to the root of the incident across teams and time zones. Just as you would
constant states of change, a decision tree diagram
document the flow from team to team, you could
When your network goes down, your first instinct can help new and seasoned team members alike
include a step for the coordinating teams to review
will likely be to identify the issue. And while you quickly determine the next best course of action
an up-to-date architecture diagram that accurately
need to get to the root of the incident, utilizing when it comes to identification, contaminants,
depicts your current cloud environment, and
efficient tools and processes can help you identify eradication, and recovery.
together they can determine what’s already been
the cause for downtime faster. discovered and what still needs to be done.
Here are a few examples of questions you could
As mentioned earlier, you may already be using include:
These diagrams can also help teams coordinate
tools such as Splunk, IBM QRadar, or Demisto. But • Did you review recent network configurations? and answer these questions:
without a process in place, you may not be using
• Is the application and all services operational • What subnet is it in?
these tools to their fullest potential.
from end to end?
• What VPC is it in?
Consider popular processes outlined in the • Can the incident be isolated?
Information Technology Infrastructure Library • How is routing set up?
• Have backups been created to protect critical
(ITIL), such as incident management. Incident • What’s the IP of the actual box? How is it
data?
management includes a set of steps to follow routing?
when there is an unplanned interruption that By creating a decision guide, employees can • Where’s it going?
impacts something down the line, such as another resolve incidents quicker because they won’t
system. • Where’s it getting stuck?
waste costly downtime time searching for answers
or the person who may hold the right answer. Ideally, your cloud diagram will be hosted on an
From there, you’d follow another set of protocols
to discover the underlying cause of the incident— application with advanced collaboration features
problem management. While these ITIL processes to further support these efforts. Imagine if incident
are effective, it’s useful to create a process flow response teams could comment directly on a
diagram so your employees can more easily follow specific component of your architecture diagram,
along during downtime, rather than poring over quickly directing each other to the areas where
chapters of text. action is needed.

The cost of downtime and how to use visuals to mitigate loss sales@lucidchart.com 650-733-6172 8
Focus on the information that
really matters Centralize documents

When you experience an outage, you first need From there, they can direct support engineers How you craft your troubleshooting processes If you host your architecture diagrams in the
to determine exactly what is happening. Because to exactly where the issue is, highlighting where is important, but it’s just as important that your cloud—and keep them up to date with simple data
cloud environments are often large and complex, action needs to be taken to resolve the incident. documentation is accessible. refreshes—then you can rest assured that your
you need to quickly drill into the information teams will have access to the documents they
However, given how quickly teams need to react Unfortunately, when you really need your
and components of your architecture that really need when they need them, restoring uptime as
during incident response, there often isn’t time to architecture diagrams and process flows, they’re
matter. quickly as possible.
create a new architecture diagram. So you must often not available. This challenge may be a
When you maintain architecture diagrams that always have an updated and accessible diagram result of only one or two people managing your And, while it’s important to keep your diagrams
accurately depict your current environment, your available. This way, your incident response teams architecture diagrams or using a desktop-only centralized, don’t forget about the other apps that
incident response teams can quickly understand can focus on resolving incidents, not searching solution like Visio. So when your network goes may include your diagrams, such as Confluence
and determine: for documents. down, the doc is stuck on one person’s desktop, or Jira. Invest in a diagramming solution that
unavailable to your incident response teams and offers automated integrations, so when you
• Where a problem is occurring
support engineers. make changes to your architecture diagram, the
• What the issue is changes will be automatically reflected anywhere
Or, given how quickly your environment changes,
• Why a problem is occurring else, like your Confluence wiki pages.
you may run into versioning issues, unable to
• How the associated systems are connected identify the correct visual, partially due to non-
standardized titling and slight nuances in dozens
Putting a network diagram at the forefront of your
of PNGs and JPEGs scattered across your teams.
incident response processes and further weeding
out distracting details can speed up how quickly Luckily, it doesn’t have to be this way—but only if
response teams locate issues. you centralize your documents.

The cost of downtime and how to use visuals to mitigate loss sales@lucidchart.com 650-733-6172 9
Focus on the information that Centralize documents Host a
really matters post-mortem
How you craft your troubleshooting processes
When you experience an outage, you first need is important, but it’s just as important that your After an incident is resolved and uptime has been
to determine exactly what is happening. Because documentation is accessible. restored, rather than indulge in a culture of blame,
cloud environments are often large and complex, create a culture of improvement. Treat every
you need to quickly drill into the information Unfortunately, when you really need your minute of downtime as a learning experience and
and components of your architecture that really architecture diagrams and process flows, they’re opportunity to revise troubleshooting processes
matter. often not available. This challenge may be a and architecting best practices.
result of only one or two people managing your
When you maintain architecture diagrams that architecture diagrams or using a desktop-only Gather everyone involved for a comprehensive
accurately depict your current environment, your solution like Visio. So when your network goes post-mortem, where fear of punishment or
incident response teams can quickly understand down, the doc is stuck on one person’s desktop, retribution is left at the door. An effective post-
and determine: unavailable to your incident response teams and mortem holds individuals, teams, and departments
support engineers. accountable while encouraging improvement.
• Where a problem is occurring

• What the issue is Or, given how quickly your environment changes, With your architecture diagram at the ready, invite
you may run into versioning issues, unable to engineers and response teams to give a detailed
• Why a problem is occurring
identify the correct visual, partially due to non- account of where mistakes were made, why, and
• How the associated systems are connected what assumptions were made and then evaluate
standardized titling and slight nuances in dozens
of PNGs and JPEGs scattered across your teams. what could have been done differently.
Putting a network diagram at the forefront of your
incident response processes and further weeding You might even consider periodically hosting a
Luckily, it doesn’t have to be this way—but only if
out distracting details can speed up how quickly “game day” where you replay past incidents—but
you centralize your documents.
response teams locate issues. this time, you’ll be on the winning team.
If you host your architecture diagrams in the
From there, they can direct support engineers By thoroughly reliving these experiences, you can
cloud—and keep them up to date with simple data
to exactly where the issue is, highlighting where better prevent downtime in the future and apply
refreshes—then you can rest assured that your
action needs to be taken to resolve the incident. new knowledge gleaned to better, quicker incident
teams will have access to the documents they
need when they need them, restoring uptime as response the next time around (and there will be a
However, given how quickly teams need to react
quickly as possible. next time).
during incident response, there often isn’t time to
create a new architecture diagram. So you must
And, while it’s important to keep your diagrams
always have an updated and accessible diagram
centralized, don’t forget about the other apps that
available. This way, your incident response teams
may include your diagrams, such as Confluence
can focus on resolving incidents, not searching
or Jira. Invest in a diagramming solution that
for documents.
offers automated integrations, so when you
make changes to your architecture diagram, the
changes will be automatically reflected anywhere
else, like your Confluence wiki pages.

The cost of downtime and how to use visuals to mitigate loss sales@lucidchart.com 650-733-6172 10
Reign over your cloud
with tangible next steps Lucidchart Cloud
Insights helps you

Y
our cloud environment is always on automatically visualize
your architecture so your
the move—code is being written,
deployments are being prepared,
and incidents are lurking.

organization can better


Dedicate the time and resources to developing
a truly comprehensive downtime response understand, optimize, and
govern the cloud through
plan—one that starts with developing state of the
art cloud architecture and results in dwindling
incidents, fast uptime restoration, and avoiding
the ghastly costs of outages.
accurate, up-to-date, and
interactive diagrams that
With the proper processes and diagrams in place,
together your cloud teams can mitigate the risks
and losses of downtime, using new cloud insights
to come out on top and deliver uninterrupted
service to customers worldwide. are consistent across
your org.

Learn more today.

The cost of downtime and how to use visuals to mitigate loss sales@lucidchart.com 650-733-6172 11
Lucidchart is a visual workspace that combines diagramming, data
visualization, and collaboration to accelerate understanding and drive
innovation. With this intuitive, cloud-based solution, everyone can work
visually and collaborate in real time while building flowcharts, mockups, UML
diagrams, and more. Lucidchart is utilized in over 180 countries by more than
15 million users, from sales managers mapping out target organizations to
IT directors visualizing their network infrastructure. Ninety-nine percent of
the Fortune 500 use Lucidchart, and customers include Google, GE, NBC
Universal and Johnson & Johnson. Since the Utah-based company’s founding
in 2010, it has received numerous awards for its product, business and
workplace culture. For more information, visit lucidchart.com.

sales@lucidchart.com 650-733-6172

You might also like