You are on page 1of 24

AIOps: AI for IT Operators

Jian Zhang
Microsoft Azure
May 19th, 2021

Stanford University - EE392B


1
Agenda

• What’s AIOps
• AIOps at Azure
• Annoucement

2
What’s AIOps

AIOps definition (Gartner)

Adoption has increased with the


uptick of digital transformation

Improve engineering
Business value efficiency and service
quality; Reduce COGS

Source: Gartner 3
AIOps Problem Space

Detection Diagnosis

Prediction Optimization

4
AIOps Products

Embedded AIOps features from AIOps product offering from


cloud service providers more IT service providers
• Amazon • IBM
• Google • Splunk
• Microsoft • Cisco (AppDynamics)
• New Relic
• DataDog

5
Azure - a global cloud

Datacenters

100,000 miles of fiber


optic and subsea cable
Azure Regions
7
8
AIOps at Azure

Motivation
Services Customers
• Scale
• Complexity
• Quality Engineering

Empower Building & Operating


Services at Scale
9
AIOps Methodologies: From Data to Actions

Data Insights Actions


• Customer • Detect • Mitigate/Resolve
• System • Diagnose • Avert future pain
• DevOps process • Predict • Optimize resource allocation
• Optimize • Improve architecture & process

10
Infusing AI into Systems & Operations

11
M. Fontoura et al, “Toward intelligent cloud platforms and AIOps”, AAAI-20
AI for System Resilience
Scenario – Platform resilience
• Reduce VM service interruption caused
by physical host reboot

Approach
• Hardware failure prediction
• Memory leak detection
• Physical host recovery only pauses VM
execution
• VM live migration

 Result & Learnings


Project Tardigrade • VMs: can survive host reboot and
continue running
• Improved VM service availability actions
12
AI for DevOps (Sample Solutions)

Dev Ops

Picture curtesy of addteq.com

13
Incident Management Procedure

Z. Chen et al, “AIOps Innovations of Incident Management for Cloud Services”, AAAI-20

14
KEY TECH
• Robust anomaly detection with ensemble weighting
• Temporal/spatial-aware correlation with exponential decay
Manual deployment
assessment
Rollout
Records Various health signals Deployment events
Any
Deployment teams anomaly Yes
Auto caused by
Assessment rollout?
No
Features
Anomaly
Detection

Various health signals Anomalies

Z. Li et al., “Gandalf: An Intelligent, End-To-End Analytics Service for Safe Deployment in Large-Scale Cloud Infrastructure”, in NSDI’20 15
Möbius: Online Anomaly Detection

KEY TECH
• Model recommendation from a model pool based on workload characteristics
• Feedback loop for continuous improvement and adaptation to dynamic workload

16
J. Zhang et al., “Möbius: Online Anomaly Detection and Diagnosis”, KDD’18
emerging
issues from an enormous attribute combinations

Problem Formulation

Object Function Best from Multiple Selection

17
J.Gu, C.Luo et al, “Efficient Incident Identification from Multi-dimensional Issue Reports via Meta-heuristic Search”, ESEC/FSE’20
AI/ML Platform - Resource Central
ML and prediction-serving system for improving resource management

RC clients: Platform resource managers

Power
VM scheduling Cluster selection
oversubscription

Server VM rightsizing
maintenance recommendation

R.Bianchini, M. Fontoura et al. “Toward ML-Centric Cloud Platforms”, Communications ACM, Feb 2020
18
Resource
Central

Video
19
AIOps in Azure: Summary
• AIOps is critical for digital transformation and an emerging innovation
area
• AIOps is a multi-discipline research area involving software engineering,
systems, machine learning, big data and visualization
• AIOps is comprehensive: from making the system smart and resilient to
enhancing developer efficiency and improving customer experience
• AIOps is what makes modern clouds scale and efficient to support the
next generation of Computing
• AIOps calls for close collaboration between the industry and academia

20
Announcement

21
https://cloudintelligenceworkshop.org Registration deadline: May 23rd, 2021
22
https://aiotworkshop.github.io Submission deadline: May 24th, 202123
Reference
• Rex: Preventing Bugs and Misconfiguration in Large Services Using
Correlated Change Analysis, Sonu Mehta, Ranjita Bhagwan et al., NSDI’20
• “An Intelligent, End-To-End Analytics Service for Safe Deployment in Large-
Scale Cloud Infrastructure”, in NSDI’20
• Tardigrade: Leveraging Lightweight Virtual Machines to Easily and
Efficiently Construct Fault-Tolerant Services, Jacob R. Lorch et al., NSDI’15
• Efficient Incident Identification from Multi-dimensional Issue Reports via
Meta-heuristic Search, Jiazhen Gu, Chuan Luo et al., ESEC/FSE’20
• Toward ML-Centric Cloud Platforms, Ricardo Bianchini, Marcus Fontoura et
al., Communications ACM, 2020

24

You might also like