Professional Documents
Culture Documents
Jian Zhang
Microsoft Azure
May 19th, 2021
• What’s AIOps
• AIOps at Azure
• Annoucement
2
What’s AIOps
Improve engineering
Business value efficiency and service
quality; Reduce COGS
Source: Gartner 3
AIOps Problem Space
Detection Diagnosis
Prediction Optimization
4
AIOps Products
5
Azure - a global cloud
Datacenters
Motivation
Services Customers
• Scale
• Complexity
• Quality Engineering
10
Infusing AI into Systems & Operations
11
M. Fontoura et al, “Toward intelligent cloud platforms and AIOps”, AAAI-20
AI for System Resilience
Scenario – Platform resilience
• Reduce VM service interruption caused
by physical host reboot
Approach
• Hardware failure prediction
• Memory leak detection
• Physical host recovery only pauses VM
execution
• VM live migration
Dev Ops
13
Incident Management Procedure
Z. Chen et al, “AIOps Innovations of Incident Management for Cloud Services”, AAAI-20
14
KEY TECH
• Robust anomaly detection with ensemble weighting
• Temporal/spatial-aware correlation with exponential decay
Manual deployment
assessment
Rollout
Records Various health signals Deployment events
Any
Deployment teams anomaly Yes
Auto caused by
Assessment rollout?
No
Features
Anomaly
Detection
Z. Li et al., “Gandalf: An Intelligent, End-To-End Analytics Service for Safe Deployment in Large-Scale Cloud Infrastructure”, in NSDI’20 15
Möbius: Online Anomaly Detection
KEY TECH
• Model recommendation from a model pool based on workload characteristics
• Feedback loop for continuous improvement and adaptation to dynamic workload
16
J. Zhang et al., “Möbius: Online Anomaly Detection and Diagnosis”, KDD’18
emerging
issues from an enormous attribute combinations
Problem Formulation
17
J.Gu, C.Luo et al, “Efficient Incident Identification from Multi-dimensional Issue Reports via Meta-heuristic Search”, ESEC/FSE’20
AI/ML Platform - Resource Central
ML and prediction-serving system for improving resource management
Power
VM scheduling Cluster selection
oversubscription
Server VM rightsizing
maintenance recommendation
R.Bianchini, M. Fontoura et al. “Toward ML-Centric Cloud Platforms”, Communications ACM, Feb 2020
18
Resource
Central
Video
19
AIOps in Azure: Summary
• AIOps is critical for digital transformation and an emerging innovation
area
• AIOps is a multi-discipline research area involving software engineering,
systems, machine learning, big data and visualization
• AIOps is comprehensive: from making the system smart and resilient to
enhancing developer efficiency and improving customer experience
• AIOps is what makes modern clouds scale and efficient to support the
next generation of Computing
• AIOps calls for close collaboration between the industry and academia
20
Announcement
21
https://cloudintelligenceworkshop.org Registration deadline: May 23rd, 2021
22
https://aiotworkshop.github.io Submission deadline: May 24th, 202123
Reference
• Rex: Preventing Bugs and Misconfiguration in Large Services Using
Correlated Change Analysis, Sonu Mehta, Ranjita Bhagwan et al., NSDI’20
• “An Intelligent, End-To-End Analytics Service for Safe Deployment in Large-
Scale Cloud Infrastructure”, in NSDI’20
• Tardigrade: Leveraging Lightweight Virtual Machines to Easily and
Efficiently Construct Fault-Tolerant Services, Jacob R. Lorch et al., NSDI’15
• Efficient Incident Identification from Multi-dimensional Issue Reports via
Meta-heuristic Search, Jiazhen Gu, Chuan Luo et al., ESEC/FSE’20
• Toward ML-Centric Cloud Platforms, Ricardo Bianchini, Marcus Fontoura et
al., Communications ACM, 2020
24