Causal Reasoning Workshop Armen RCA NOOK

Root Cause Analysis (RCA) for
Future Networks :
Technology, Vision and Challenges
Armen Aghasaryan, Bell Labs, Paris-Saclay
Shonan Seminar on Causal Reasoning in Systems

24-27 June, 2019
1 © Nokia 2016
2019
Management of Future Networks with Root Cause Analysis
Introduction
The digital era calls for an unprecedented need of real-time monitoring and Root
Cause Analysis (RCA) of complex systems deployed at a global network scale
RCA duality character: RCA is applied to methodically identify and correct the
root causes of events, as opposed to simply addressing their symptomatic result.
RCA impact perspective: “Root cause” may be described as the point in a causal Explain
chain where applying a corrective action or intervention would prevent the hidden causes
problem from occurring.
Correlation is not Causation
Model-based approach
• Model explains the occurrence, manifestation, and propagation visible plane
fault situations
hidden plane
tile3
• Model learning techniques: tile1
interventional / observational / knowledge-based tile2
tile4
2 © Nokia 2018
RCA for Cloudified Networks
Perf. degradation
EFFECT
client Response
client
? PROPAGATION
App Request A A TRAJECTORY
Layer ? vm-A ? vm-
B C B C
A
Virtual ? vm-B vm-
vm D E
? vm D B
Resource E
Layer
?
vm ? ?
? vm ? vm
vm
Hardware
Resource ?
?
Layer ? CAUSE
High volume of measurements and alarms to be handled at different layers

• Relevant alarm definitions and measurements not trivial to identify
Ambiguity in the interpretation of observations by the human operator
• Single fault may produce multiple alarms, a given alarm can be caused by different fault conditions
• Number of possible failures grows exponentially in the future (terminals – IoT, network components – 5G)
Delayed understanding of faults leads to negative impact on businesses
3 © Nokia 2018 Confidential

RCA for Cloudified Networks
Example of a Web app App Delay alarm Host Bw alarm
Request rate alarm Host CPU alarm
DB Delay alarm
Host Bw alarm
Page I/O rate alarm
Host CPU alarm
App Extra demand
Host CPU
overloaded
DB Extra demand
Host CPU VM-B-1
overloaded
VM-A-1 Host NIC
saturated
Host NIC
saturated Observable behavior : alarms Host B
Host A Hidden Behavior : root causes
secondary causes
Symptoms: (web) application performance degradation - generation of infrastructure and application level alarms
Multiple possible underlying causes

- User request peak or permanent increase (App extra demand)
- Underlying resource overload by other tenants (host network interface, host CPU saturated)
- Remote fault propagation via app dependencies (e.g. BD delay)
- Remote host failures, etc.

Fault model-based approach Standards
Product
specifications System
structure
Expert
knowledge Data (Machine
Learning)
Setting up the RCA design as a model-based fault
diagnosis using a “model inversion” approach
• Fault model replicates the fault behavior of the fault Fault generated
scenarios model alarms
network
- representation of initial faults, fault propagation, and alarm
generation
- different types of models representing the interplay
between observed variables and hidden states diagnosed
observed Diagnosis
fault
• Labelled automata, Petri Nets, Pattern matching, Hidden Markov alarms process
scenario(s)
models, Bayesian Networks, …
• Diagnosis algorithms inverse the model A inverted model deployed

online for run time RCA
- without reconstructing the global state (to avoid state inference
space explosion)
Logical view of the RCA design process
5 © Nokia 2018
Framework : from model discovery to fault diagnosis
• Automated (fault) model discovery modes

Model discovery modes Model inversion
- Interventional learning: stimulus-based App Profiling
Sandbox Controlled Uncontrolled
environments environments
- Observational learning: lagged correlation analysis & Environment
pattern learning actionability
- Self-modelled data/systems, e.g. perf. logs allowing auto-

System Self-modeled
extraction of a Bayesian Network structure Black-box
Description systems
availability
• Model Inversion: from symptoms to causes Interventional Observational Parameter

Diagnosis
Learning learning Inference
- Identification of possible causes - provides the most
likely explanations given the observed alarms.
- Using Bayesian network inference, tile-based trajectory Intervention Correlation Causal Symptom
composition / Petri Net unfolding, Viterbi-like algorithms, generation analysis modeling Explanations
etc.
6 © Nokia 2018
Interventional Learning
Stimulus-based App profiling
Intervention Correlation Causal Symptom

generation analysis modeling Explanations
7 © Nokia 2018
Stimulus-based App Profiling Sandbox
Goal: Automated fault model discovery
Aproach: Stimulus creation framework for virtualized distributed
apps
- Interventional learning of causal Tiles via systematic resource perturbations
and their impact analysis
- Algorithm for causal chaining of fault trajectories
Response time
input
load ALBA
LB CC
WS2
WS2
Intervention Generation node node
“Stimulus” creation: system-level fine-tuned control of available

computing resources on a given app node (CPU, RAM, Disk I/O, Ntwk I/O)
Ntwk traffic
Stimulus
Discovery of causalities: analysis of the reactions of other app nodes propagation BB DDBD
DB
to the current stimulus
WS1
WS1 CPU
node
CPU node
stimulus
Methodical elucidation of causal dependencies

Stimulus-based modeling
Correlation analysis
One datasheet per stimulus DB WS1 WS2 LB hosts
DB
WS1
WS1-CPU
WS2
LB
hosts
LB-response-time
Positive correlations
Negative correlation
Confidential
10 © Nokia 2018
Causal Modeling: Interpretation of correlations given the stimulus
DB WS1 WS2
DB
DB
WS1
WS1 Current stimulus is on DB :
network Bw down
WS2
WS2

Tile extraction
NETW NETW
Hidden faults (primary or propagated)
CPU CPU
- we stimulate (and know) the ground truth MEM
MEM
• Network Bw, CPU, Memory, Disk I/O DISK IO
DISK IO
• Input Load
LOAD
Hidden states – alert enablers network.outgoing.bytes.rate network.outgoing…

network.incoming.bytes.rate network.incoming…
- enable alarm generation
cpu_util cpu_util
- defined by OpenStack & proprietary meters
Visible events - alarms Alarm (network.outgoing.bytes.rate )

- defined on the basis of the existing meters Alarm (cpu_util)
Alarm (network.outgoing…)
Alarm (cpu_util )

Tile extraction : example of stimulus on DB network bandwidth down
resourceType resourceID meterType CorrCoeff resourceType resourceID meterType
DB db_server_v2 network.outgoing.bytes.rate 0.87 WS WebServer1 network.incoming.bytes.rate
DB db_server_v2 network.outgoing.bytes.rate 0.85 WS WebServer1 cpu_util
Primary Cause Occurrence: Local Propagation: Alarm(out.bytes.rate )

Network BW down Directly impacted meters
NETW out.bytes.rate
NETW
out.bytes.rate
DB DB DB
out.packet.rate
- + - + - +
Primary Cause
Remote Occurrence:
propagation: Local Propagation: Alarm(cpu_util )
Insufficient
DB -> WS CPU Directly observed meters
NETW CPU CPU cpu_util cpu_util 

CPU
[DB]
WS WS WS WS
NETW
13 - © Nokia
- 2018 + + - + - +
Diagnosis
• Interventional learning of causal Tiles:
Tiles
stimulus-reaction transitions (tiles) represent causal
relationships between resource stimuli and the reactions
at local & remote nodes
• Causality chaining via net unfolding:

the alarm flow guides the on-line composition of reusable
tiles into hypothetical fault propagation trajectories
• Fault trajectory inference & diagnosis

Iterative selection of the best explanations (trajectories)
using Max likelihood estimation algorithms (Viterbi Fault trajectories
algorithm variants)
14 © Nokia 2018 Internal use © Nokia 2017

Observational learning
Correlation and Causality Analysis with CloudBand (CB) alerts
Intervention Correlation Causal Symptom

generation analysis modeling Explanations
15 © Nokia 2018
Causality inference from CB alert logs
Goal: automatic inference of correlation & causality rules from observed alert logs
Approach:
- Optimized computation of type-based correlations
- Causality inference based on time lagged correlations
Implemented by OpenStack Vitrage

Presented at OpenStack Summit 2017, Boston
Type-based correlations
Instance-based correlations
Alert time-series on
distinct resources Inference of causal relationships
16 © Nokia 2018
Data transformation pipeline
Correlation Analysis
Datasets from CB:

- e.g. Cloud platform @Naperville: ~ 14K alarms, 1,5K distinct resources
Corr(x,y) > threshold
Alert activation state
ResType x ResId x AlertType

Computing correlations between alert time series:
x
- identified by <ResType x ResId x AlertType>
x
- using co-occurence measures: Jaccard, Pearson,…
y
- filtering according to topological constraints time
ResType x ResId x AlertType
y
17 © Nokia 2018
Correlation Analysis (cont.)
Identify frequent co-occurrence patterns in terms of Types
• using Spark computing
Inferred set of co-occurrence rules

Resource Resource Alert Corr. Resource Resource Alert
Type Id Type Coeff.. Type Id Type Group by Types Resource Alert Agg. Resource Alert
Type Type Score Type Type
+
Aggregation functions
…
…
(counter, mean)
resourceType x alertype
x y
Sparse representation of the Correlation matrix Threshold constraints
18 © Nokia 2018 resourceType x alertype

Causal analysis
Transform co-occurrence rules into causality rules
Symmetric co-occurrence rules Suggested causality rules
Resource Alert Agg. Resource Alert

Type Type Value Type Type Lagged correlation Sensitivity Resourc Alert Resourc Alert Agg.
tests e Type e Type Value
Type Type
…

…
~
+
expert knowledge Maintain unresolved co-
occurrences without
on Resource & Alert Type
causality
semantics (if available)
19 © Nokia 2018
Causality analysis (examples)
Correlation threshold: 0.9
resourceType alertType resourceType alertType Count Mean
HOST HOST_CEPH_PROBLEM → MACHINE VM_STORAGE_SUBOPTIMAL_PERF. 5742 0.96
Correlation threshold: 0.7 HOST HOST_HIGH_CPU_LOAD → MACHINE VM_CPU_SUBOPTIMAL_PERF. 47 0.98
resourceType HOST RESOURCE_INVALID_STATE
alertType ~ MACHINE
resourceType VM_STORAGE_SUBOPTIMAL_PERF.
alertType Count Mean 45 0.95
HOST HOST_CEPH_PROBLEM → MACHINE VM_STORAGE_SUBOPTIMAL_PERF. 23490 0.86

MACHINE FAILED_TO_GET_METRICS ~ MACHINE VM_STORAGE_SUBOPTIMAL_PERF. 5953 0.72
HOST RESOURCE_INVALID_STATE ~ MACHINE VM_STORAGE_SUBOPTIMAL_PERF. 5525 0.77
HOST HOST_CEPH_PROBLEM ~ HOST RESOURCE_INVALID_STATE 232 0.77
HOST HOST_HIGH_CPU_LOAD → MACHINE VM_CPU_SUBOPTIMAL_PERF. 47 0.98
Correlation threshold: 0.5
HOST RESOURCE_INVALID_STATE ~ MACHINE FAILED_TO_GET_METRICS 8 0.73
resourceType CLOUD_NODE
alertType RESOURCE_INVALID_STATE
resourceType ~ COMPONENT
alertType RESOURCE_INVALID_STATE
Count Mean 6 0.82
MACHINE CLOUD_NODE
FAILED_TO_GET_METRICS ~ MACHINE
METRICS_READING_ERROR ~ MACHINE FAILED_TO_GET_METRICS
VM_STORAGE_SUBOPTIMAL_PERF. 84109 0.60 4 0.71
HOST CLUSTER_INST.
HOST_CEPH_PROBLEM → MACHINE
MODULE_ERROR → CLUSTER_INST. SYSTEM_ERROR
VM_STORAGE_SUBOPTIMAL_PERF. 23548 0.86 1 0.85
HOST RESOURCE_INVALID_STATE ~ MACHINE VM_STORAGE_SUBOPTIMAL_PERF. 6608 0.75
HOST HOST_CEPH_PROBLEM ~ HOST RESOURCE_INVALID_STATE 232 0.77
HOST RESOURCE_INVALID_STATE ~ MACHINE FAILED_TO_GET_METRICS 60 0.64
HOST HOST_CEPH_PROBLEM ~ MACHINE FAILED_TO_GET_METRICS 58 0.56
HOST HOST_HIGH_CPU_LOAD → MACHINE VM_CPU_SUBOPTIMAL_PERF. 47 0.98
CLOUD_NODE RESOURCE_INVALID_STATE ~ COMPONENT RESOURCE_INVALID_STATE 6 0.82
CLOUD_NODE METRICS_READING_ERROR ~ HOST RESOURCE_INVALID_STATE 2 0.53
20 © Nokia 2018 CLUSTER_INS. MODULE_ERROR → CLUSTER_INS. SYSTEM_ERROR 2 0.69
Diagnosis
Topology based RCA templates [Nokia CloudBand / OpenStack Vitrage project]

- Expert rules + Inferred Rules + dynamic instantiation of RCA templates => root causes
Dynamic instantiation of RCA templates

21 © Nokia 2018
Inferring propagation scenarios from alarm logs (Radio Acess Network)
Complexity reduction and automated causal analysis Presented at Global Analyst Forum 2018
Goal: Reducing the time to dispatch KPI (time to resolve an incident)

Approach: Structural and Causal analysis of the alarm “chaos”
- Using statistical causal inference techniques enforced with exogeneous knowledge,
e.g. physical/logical topology, or expertise
Impact: Number of possible explanations reduced by 10K
NSW DI demonstrator on a
smartphone
Alarm chaos Reduced space Zooming on an alert propagation scenario
22 © Nokia 2018
Inferring propagation scenarios from alarm logs (2)
Complexity reduction and automated causal analysis
Dataset: large radio access network

remote
- 200K network elements, 1 Million alarms every day (40 000 alarms/ hour), several
hours per incident, up to 24 hours in some cases
Local
Structural and Causal analysis of the alarm “chaos”
Using statistical causal inference techniques enforced with exogeneous
knowledge injection (e.g. physical/logical topology, or expertise) vertical
- Dynamic correlation graphs based on local/vertical/remote resource relations
- Graph summarization
- Number of possible explanations reduced by 10k
23 © Nokia 2018
Root Cause Analysis of Future Networks
Summary of important challenges
Model-based approaches are challenges by the fact that in real systems models are often unknown!
• Fault injection and interventional approaches to causality discovery – completeness, combinatorics
• Observational analysis / statistical causal inference - missing sufficient fault data volumes
• Observational analysis assisted by some structural /topological knowledge - partial topology knowledge
Problem translation into the event space : adequate anomaly detection mechanisms
• Not specified in complex cloud-based environments
• Need for methods coupling anomaly detection and fault diagnosis
Challenges for Model Learning and Diagnosis

• High-dimensionality
• Dynamically reconfigurable systems, dynamically tracking the changes in the model
• Partially specified and partially observed models (controlled environments)
• Non-causal observation
• High-responsiveness : self-healing control loops
24 © Nokia 2018
25 © Nokia 2016
Fault model-based approach
Tile-based fault model
visible plane
Tile-based model
elementary fault behaviors of various node types: tile1 tile3
<pre-conditions, alarm/event, post-conditions> tile2 hidden plane
Petri Net semantics

distributed state / local conditions
concurrency
Uncertainty representation
unobserved tiles
likelihood measure
Model execution : fault trajectory building & diagnosis

Max likelihood estimation / Viterbi algorithm variants
Stream-based puzzle assembling using tiles (Spark Streaming)

Trajectory building
Trajectory 1 Alarm(out.bytes.rate )
Alarm(cpu_util )
Primary Cause :
out.bytes.rate
Network BW down Alarm(cpu_util )
NETW
DB
DB DB CPU cpu_util
…
WS1 WS1
WS1
NETW…
CPU cpu_util
WS2 WS2
WS2
NETW… Web server 1
Trajectory 2 Alarm(cpu_util )
(WS1)
Primary Cause : Database

Insufficient CPU CPU cpu_util (DB)
WS1 WS1 WS1 Web server n

(WSn)
27 © Nokia 2018
Model discovery in controlled environments
Stimulus-based modeling for Cloud : Open questions
Verification & testing of derived tiles
• Model properties, coverage of real system faults and behaviors
Automatic learning of “complex” behaviors

• Forward propagation “Conflict” behaviors,
• “Synchronization” behaviors induced by the co-occurrence of multiple faults
Diagnosis algorithms under non-causal ordering of observed events

• On-line vs batch processing with sliding window, distribution vs stream analytics
Dealing with dynamically reconfigurable systems, e.g. auto-scaling resources

• Relevance of the tile model learned in the Sanbox => asymptotic analysis of correlations, parametric tile model ?
• (distributed) diagnosis algorithms for reconfigurable models (changing the number of resources)
Dealing with partially modeled systems, e.g. unknown types of resources (not covered by the model)
• On-line micro-stimulus approach, …?

Causal Reasoning Workshop Armen RCA NOOK

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Causal Reasoning Workshop Armen RCA NOOK

Uploaded by

Copyright:

Available Formats

Root Cause Analysis (RCA) for

Armen Aghasaryan, Bell Labs, Paris-Saclay

Shonan Seminar on Causal Reasoning in Systems

Correlation is not Causation

High volume of measurements and alarms to be handled at different layers

3 © Nokia 2018 Confidential

Multiple possible underlying causes

4 © Nokia 2018 Confidential

• Diagnosis algorithms inverse the model A inverted model deployed

Logical view of the RCA design process

• Automated (fault) model discovery modes

- Self-modelled data/systems, e.g. perf. logs allowing auto-

• Model Inversion: from symptoms to causes Interventional Observational Parameter

Intervention Correlation Causal Symptom

“Stimulus” creation: system-level fine-tuned control of available

Methodical elucidation of causal dependencies

9 © Nokia 2018 Confidential

One datasheet per stimulus DB WS1 WS2 LB hosts

11 © Nokia 2018 Confidential

Hidden states – alert enablers network.outgoing.bytes.rate network.outgoing…

Visible events - alarms Alarm (network.outgoing.bytes.rate )

12 © Nokia 2018 Confidential

Primary Cause Occurrence: Local Propagation: Alarm(out.bytes.rate )

NETW CPU CPU cpu_util cpu_util 

• Interventional learning of causal Tiles:

• Causality chaining via net unfolding:

• Fault trajectory inference & diagnosis

14 © Nokia 2018 Internal use © Nokia 2017

Intervention Correlation Causal Symptom

Implemented by OpenStack Vitrage

Datasets from CB:

Corr(x,y) > threshold

Alert activation state

ResType x ResId x AlertType

ResType x ResId x AlertType

Inferred set of co-occurrence rules

Sparse representation of the Correlation matrix Threshold constraints

18 © Nokia 2018 resourceType x alertype

Symmetric co-occurrence rules Suggested causality rules

Resource Alert Agg. Resource Alert

HOST HOST_CEPH_PROBLEM → MACHINE VM_STORAGE_SUBOPTIMAL_PERF. 23490 0.86

Topology based RCA templates [Nokia CloudBand / OpenStack Vitrage project]

Dynamic instantiation of RCA templates

Goal: Reducing the time to dispatch KPI (time to resolve an incident)

Impact: Number of possible explanations reduced by 10K

Alarm chaos Reduced space Zooming on an alert propagation scenario

Dataset: large radio access network

- Dynamic correlation graphs based on local/vertical/remote resource relations

- Number of possible explanations reduced by 10k

Challenges for Model Learning and Diagnosis

Petri Net semantics

Model execution : fault trajectory building & diagnosis

26 © Nokia 2018 Confidential

Primary Cause : Database

WS1 WS1 WS1 Web server n

Automatic learning of “complex” behaviors

Diagnosis algorithms under non-causal ordering of observed events

Dealing with dynamically reconfigurable systems, e.g. auto-scaling resources

28 © Nokia 2018 Confidential

You might also like