You are on page 1of 7

An Introduction to AIOps

Application development has evolved over past 25 years, but

monitoring and operations management has stood still. As
cloud computing environments grow more complex, enterprise
firms face new challenges that burn time and capital expense,
inhibiting growth and stifling their success. In this paper, we
explore a new paradigm of IT operations, AIOps, which
provides cloud-first companies a path forward to more
intelligent, efficient operations.

The prominence of cloud computing has services and the underlying infrastructure and
changed the way enterprises build, manage applications.” By 2020, 25% of the Global
and scale apps. Large organizations have 2000 companies will deploy an ITOA platform,
reorganized their IT and development centers. compared to 2% today [2]. Spend on ITOA
Enterprise companies must deliver great amounted to $1.7B in 2014, and will grow by an
products with the challenge of rising Operating estimated 70% in 2015 [3]. These trends show
Expenses (OpEx). Agile development practices that businesses believe in ITOA as a critical
have changed the way we architect every discipline in the cloud computing age.
aspect of the application stack. These systems
output tremendous amounts of data about user The ample amount of raw data combined with
behaviors and application health. Companies mathematical algorithms gives businesses an
can find insights within this data to create edge in IT management. Cloud device data
efficient operations, reducing OpEx. Due to (i.e. telemetry data) provides a clear depiction
environment complexity, businesses must use of environment activity. With OpEx on the rise,
advanced analytics to uncover these insights. businesses hope insights from this data can
curb cloud spend. Furthermore, this data can
These analytical practices ushered in the create other competitive advantages as well.
discipline of IT operations and analytics YMor, a Netherlands based ITOA consultant,
(ITOA). In 2012, a Forrester report [1] shaped notes ITOA “[brings] together multiple sources
the early ITOA narrative. “IT analytics tools of data to enable data-driven IT Operations,
hold the promise of helping IT organizations delivering consistent, high-quality results for
maximum digital performance, availability,
security and agility” [4]. To underscore this
Due to cloud environment point, ITOA provides the most value when
complexity, businesses must organizations can correlate several data streams
use advanced analytics to together. A DevOps team determining the root
cause of an application issue would need data
uncover insights that lead to
available from the host as well. ITOA platforms
reduced cloud OpEx. that bring these streams together can provide a
significant operational advantage.
better manage the technology that runs their
business,” the report notes. “Think of it as This telemetry data and information from
turning the concept of big data inward to make systems of record need further human expertise
better decisions about the business technology and interpretation, making it a challenge for
businesses to adopt an ITOA practice. Cloud this than their business peers” [6]. While ITOA
computing has contributed to explosion of can provide promising insights, tools within this
available data, which can be cumbersome for space today are still unable to address these
IT to manage. According to Big Panda, “IT is concerns.
struggling to keep up with the pace of change,
and the rush to modernize is leaving DevOps Even with ITOA platforms, enterprise
leaders looking for a better solution” [5]. Many organizations have trouble bridging the gap
companies rely on several tools to monitor from insight to action. The most advanced
and analyze telemetry data. Most of these ITOA platforms often serve up information with
companies dislike their ITOA and monitoring little additional value. During a major incident
strategy. One major pain point stems from event, ITOA tools provide tertiary information
human operators to using intuition to create that provides the most help in hindsight.
arbitrary, threshold based alerts to stay on top Companies spend between 3 to 6 hours
of IT data. Many monitoring platforms lack the repairing an app related problem [7]. Almost
flexibility and ease of use to help operators 15% of the time, more than 10 people are
pick the right thresholds for alerting. Big Panda necessary to resolve such problems. According
notes, “Of those who receive 100+ alerts per to the EMA, “[a] majority of companies are
still trying to manage complex applications
IT Operational Analytics with a combination of “all hands on deck”
platforms can reduce some interactive marathons and tribal knowledge”
[7]. While ITOA can reduce some monitoring
cloud monitoring complexity, complexity, it does not tie into the rest of the
but it does not tie into the rest DevOps ecosystem to provide significant value.
of the DevOps ecosystem to In the end, enterprises need a comprehensive
provide significant value alone. solution to reduce the IT burden ushered in by
cloud computing.
day, only 17% are able to investigate and
remediate the majority within 24 hours.” As a
result, one can trace most OpEx spend back
to the need for talent to close the monitoring
gap and make ITOA practices succeed.
Unfortunately, firms are struggling to find and
keep experienced IT operators. In one study
by McKinsey&Company, 35% of executives
surveyed believed improving IT talent and
capabilities would lead to better IT performance.
The study notes that “two-thirds [of executives]
agree that it’s a significant challenge for their
organizations to find, develop, and retain talent,
with IT executives even more concerned about

Businesses that want to receive the most benefit balance a tradeoff between tuning the sensitivity
from AIOps should focus first on building a of monitoring thresholds so that support teams
firm foundation in process (ITIL) in order to are not flooded but trying to ensure nothing gets
apply a base of intelligent algorithms to drive IT missed. Applying intelligent algorithms to do
automation. An effective AIOps strategy must the work of answering the ‘what is important’
envision both in-line learning and reinforcement question, makes it possible finally to be
learning. For telemetry data (system proactive and enabling Level 1 and 2 support
generated), in-line learning works best to drive engineers to focus on other tasks that require
out anomalies and event correlations. But, that personal touch.
algorithms have limitations in understanding
complicated associations. Therefore, it is critical
to have feedback mechanisms built across the AIOps should focus first on
Service Management lifecycle. Such training building a firm foundation in
opportunities should be integrated with a solid process (ITIL) in order to apply a
process model. Such a vision for AIOps will base of intelligent algorithms to
guide the investments made in underlying
drive IT automation.
systems and tools and prioritize where those
investments should be made first.
AIOps should make closed loop change and
Some of the most promising areas to focus configuration management possible. Even
on with your AIOps strategy include event & with the best discovery tools and the most
incident management streamlining, closed refined and controlled change management
loop change and configuration management, process, the rate of “unplanned” changes in
and incident response automation. First, a modern data center will only increase over
AIOps should eliminate the event vs incident time. Engineers and operations personnel
dichotomy- and drive down the cost of each must be empowered to make quick decisions
Incident while increasing service reliability. If an when emergencies occur and configuration
event occurs but isn’t captured as an incident, drift in an era of multi-cloud and microservices
does it really matter? As the number and is ‘designed-in’ to modern IT services. The
complexity of services has increased, managing goal since the onset of the truly heterogenous,
the inflow of automated feeds to incident distributed computing infrastructure has been
management (ie monitoring tools) has required to have complete visibility into the configuration
a increasingly difficult challenge. Organizations and state of all underlying systems. If we
must constantly make the decisions that design our processes to require ‘completeness’
then they are design for failure- and gaps in branch on an automation model has similar
visibility compound exponentially to ensure fixed input conditions from the prior step in the
that engineers will always revert to command automation. By injecting learning algorithms
line tools and domain tools to pick through into the automation workflow we eliminate the
configuration data when failures occur. By need to manage and maintain these fixed input
embedding learning algorithms into the change conditions across a growing body of work,
and configuration management process drive thus eliminating the natural point of diminishing
inferences across diverse sources of data to: returns that every organization has encountered
in the depths of an IT automation initiative.
AIOps makes it possible to
The overall strategy for AIOps should be framed
automate and orchestrate
around 4 central questions.
the response to Incidents 1. What layer and element of the ITSM process
(availability or security) without model might benefit most from deeper
anchoring the response model insights?
to a fixed workflow 2. What sources of data are the most promising
for anchoring a learning algorithm that could
• Quickly associate the root cause of an be trained to take action?
incident to the Incident so it can be 3. How can automated insights be fed back
addressed in real time. into the process and underlying systems and
• Take proactive measures to ensure stability tools to take action?
when a high risk change is approved. 4. How can the results of such action be fed
• Orchestrate problem resolution procedures back into the algorithm to make it smarter
to ensure that the right personnel are over time?
working on the right problem.

AIOps makes it possible to automate and

orchestrate the response to Incidents (availability
or security) without anchoring the response
model to a fixed workflow, thereby allowing
for a much higher degree of automation in the Tribal Genius (online:, works
environment. Traditionally automation is tied to with our clients to establish an IT Strategy that is
fixed conditions that ‘trigger’ the automation, deeply anchored in a Decision by Design (DBD)
such as a monitoring threshold or a selection of approach. By conducting a series of vision and
a provisioning request in a service catalogue. strategy sessions we help our clients map out
With this approach, each nuanced automation (a) their vision for AIOps, (b) the gaps in their
opportunity requires either an increasing set of current data, tools, or process architecture,
conditional branches on the root automation and (c) a roadmap for moving them towards
model or copies of automations that are their AIOps vision. For Tribal Genius and our
tailored to very specific input conditions. Each clients, the greatest benefit of big data and
analytics lies in harnessing the data to feed [1] O’Donnell et. al. (2012). “Turn Big Data Inward With
IT Analytics.” Forrester. Available Online.
intelligent algorithms that can be directed to
benefit their organizations. Tribal Genius works [2] Cappelli (2013). “IT Operations Analytics: Big
Data for the Data Center.” Gartner IT Infrastructure &
with our strategic partners, like BMC, to help Operations Management Summit: Orlando.
our joint clients understand how best to harness
[3] Cappelli (2015). “Organizations Must Sequentially
the power of their ITSM portfolio to hasten Implement the Four Phases of ITOA to Maximize
their path to the AIOps vision. Often that work Investment.” Gartner. Available Online.
takes the form of identifying opportunities to [4] YMor. “IT operations Analytics, from rear iew mirror
better leverage existing tools and close gaps in to glass globe.” Available Online.
existing processes. [5] Big Panda (2016). “State of Monitoring 2016 - Full
Report.” Available Online.

Enterprise, cloud-first [6] Khan and Sikes (2014). “IT under pressure:
companies across the world McKinsey Global Survey results.” McKinsey&Company.
Available Online.
need AIOps more than ever so [7] Enterprise Management Associates (2015).
they can stop on fighting fires “Application Performance Monitoring (APM), 2015 --
Industry Challenges, State of the Art, and the Case for
and focus on innovation. It only Unified Monitoring.” EMA Whitepaper. Available Online.

takes a few minutes to begin [8] Zurier, Steve (2016). “ITOA to AIOps: The next
the AIOps journey with Tribal generation of network analytics.” SearchNetworking by
TechTarget. Available Online.
Genius, and our combined [9] Extrahop (2017). “Machine Learning Survey -- Hope
30+ years of enterprise or Hype for IT?” Extrahop. Available Online.

expertise can help guide you [10] Numenta (2017). “NAB -- The Numenta Anomaly
Detection Benchmark.” Numenta. Available Online.
on the journey to smarter, more
efficient operations.
We look forward to helping businesses on their
AIOps journey. Visit our website at tribalgenius.
com for more information. If you’d like to request
a demo or learn more about our AIOps vision,
please email us at: