You are on page 1of 13

Micro Failures, Macro Insights: Unveiling the MIF Phenomenon in the

Internet

JP Vasseur1 (jpv@cisco.com), PhD - Fellow/VP, Gregory Mermoud1 (gmermoud@cisco.com), PhD – Distinguished Engineer,
Eduard Schornig1 (eschornig@cisco.com) – Principal Engineer, Mukund Yelahanka Raghuprasad1 (myelahan@cisco.com) –
Technical Leader, Gregoire Magendie (gmagendie@cisco.com), Software Engineer - 1Cisco Systems

Release v1 – August 2023

Abstract: For years, the concept of “failure” has been associated with a clear and easily observable phenomenon where
a resource (e.g., link, router, server) ceases to function entirely. These instances are sometimes referred to as “Dark”
failures and can last for varying durations. We have expanded this definition to include the concept of “Grey” failures,
where a specific resource undergoes performance degradation that affects the user experience. In this paper, we further
extend the notion of grey failures to introduce the concept of Micro-failures (also termed MIF). MIF are characterized
by specific motifs observed concerning path characteristics, such as delay, loss, and jitter, which may impact the user's
Quality of Experience (QoE), while being difficult to observe with low frequency probing. As defined, MIFs often go
unnoticed since they don't typically affect the control plane (no rerouting action) despite their potential impact on QoE.
Micro-failures are not only more common than dark failures but can also be equally detrimental to the user experience.
Furthermore, they pose a significant challenge in detection and remediation. In this paper, for the first time, we delve
into several intricate, multi-dimensional patterns of micro-failures through an exhaustive analysis based on fast-
frequency probing realized at scale in the Internet. Being able to identify and detect such micro-failures could
undoubtedly lead to triggering remediation in forwarding mechanisms in the future, thereby improving the QoE for
applications requiring stringent SLAs.

1 Dark, Grey and Micro failures (MIF)


Dark failures refer to a complete outage of a network element such as a link, a router or a server. Detecting dark failures
is usually relatively simple using mechanisms such as inter-layer signaling (e.g. lower layer reports that the link is down)
or routing protocol/Keep-Alive (KA) protocol. (Fast) KA approaches rely on sending probes at regular intervals that are
echoed by the receiver: after k out n consecutive probes did not echo an event failure is triggered, that itself may
generate a cascade of mitigation tasks such activating a redundant standby resource or recomputing a new topology.
For example, optical networks have been using such active-bypass strategies whereby a second lambda is used should
the active one fail. Alternatively, single-ended approach may be used where both resources are simultaneously active
and the receiver “flip a switch” so as to receive traffic from the secondary lambda should the first fail (also called “1+1”).
In the case of routing protocol, the detection of a link or node failure would trigger a re-computation of the routing
topology followed by the rerouting of the traffic.
In contrast with Dark failures where both the failure detection and mitigation are quite straightforward, “Grey” failures (
[1]) relate to situation where application performance (and thus QoE) is impacted without being able to easily identify
an obvious root cause. Consider the example of traffic congestion in the network: such a phenomenon may be detected
by upper layers (e.g., transport protocols such as TCP or application layers) due to non-acknowledged traffic segment
after the expiration of (dynamically) computed timers or upon receiving an explicit congestion notification. The sender
could then trigger congestion control mechanisms. However, congestion is often completely ignored at the layer 3.
The situation becomes more challenging when “grey” failures are application-specific and are not easily detectable at
the scale at which they occur. Furthermore, the impact on applications may not even be known. For the past 20 years,
the user (application) Quality of Experience (QoE) has been observed through the prism of networking KPI such as the
1
delay, loss, and jitter. Several attempts have been made to specify hard-bound values for each of these Layer-3 that
should not be exceeded to meet the application Service Level Agreement (SLA). Although such an approach is known
to be (too) simplistic, in this paper, for the first time, we are analyzing micro-failures (MIF) that have the properties of
being potentially impactful for applications and transient.
Below are examples of MIF telemetry reported by a Cisco WebEx Client. The WebEx client application is sending
performance reports every 60 seconds with one second granularity for dozens of variables:
During the experiment we have equipped various computers with a real-time software capable of generating probes at
high frequency towards a set of destination deeper in the Internet.
What is referred to as “Raw MIF” are the network KPIs reporting by the fast probing agent where “Webex Telemetry”
refer to same KPIs as reported by a Webex agent with lower frequency. As shown in Figure 1 - Networking KPI
reporting by L3 Fast probing versus L7 (Webex) probes at lower frequency, the MIF motif expectedly provides higher
resolution than probes as lower frequency.

Figure 1 - Networking KPI reporting by L3 Fast probing versus L7 (Webex) probes at lower frequency

Figure 2 - shorter-lived MIF reporting by fast probing compared to 1minute resolution telemetry

Figure 2 - shorter-lived MIF reporting by fast probing compared to 1minute resolution telemetry shows that with shorter-
lived phenomenon impacting packet loss, the MIF motifs are quite expressive where low-level granularity (although still
at 1’) barely reports the issue. Many other examples could show even shorter-lived issues that cannot be detected even
at a very reasonable frequency of 1’ telemetry reporting (which is considered in Networking as a fairly high frequency).
It is not rare to have a series of such of MIFs or even many (periodic) MIFs. Also, for most of the time, MIFs are of a
very short duration (i.e., between 500ms and 5 seconds).
Unfortunately, most applications and network solutions are unable to detect short temporary failures, simply because
telemetry polling operates at a different granularity. Furthermore, even when higher granularity is available, most
variables are aggregated using averaged values over a specific time for obvious scalability reasons. For example, Cisco
SD-WAN solution measuring loss, jitter and latency on tunnel is using by default 10-minute time window averaging the
results from 600 probes that are sent every 1 second. It is not advised to configure more aggressive timers.

2
2 Methodology
MIF telemetry was collected for more than 350 individual network paths via probing agents deployed in 30 global
locations in 19 countries. Dedicated nodes in each location were configured to initiate Webex Voice and Video calls
while at the same time use fast frequency ICMP probes to monitor network conditions towards destinations of interest
such as:
• Webex Monitoring Endpoints: Dedicated monitoring endpoints are deployed in each Webex Data Center. The
L3 metrics from any site to a monitoring-endpoint are deemed to be representative of the network behavior to
all the media-servers that the monitoring-endpoint represents.
• Other targets: Google DNS, etc
• Starlink First Hop (optional): For sites where Starlink was used as a connectivity provider, an additional probe
was configured to monitor the first L3 network hop in the Starlink network.

Figure 3 - MIF Deployment Examples

For a given MIF path (site -> target) the agent sends probes at 100ms intervals and records the associated RTT or an
error code (timeout, unreachable, TTL exceeded). Results are then aggregate at 500ms intervals, and loss, latency
and jitter metrics are computed.

Figure 4: MIF Telemetry Overview

3 MIF Motifs
The fast-probing methodology described in the previous section generates high-frequency MIF telemetry for various
network paths across the world. MIF telemetry consists of loss, latency and jitter corresponding to the network paths at
3
a 500ms granularity. Such high-frequency telemetry can provide an insight into the behavior of the network that can be
otherwise invisible with telemetry at lower granularities.

Figure 5 - Example of MIF Motif

The patterns can be identified at two levels of abstraction,


1. Patterns per path: Determine whether a given path exhibits micro-failures (MIF), the various patterns of failures
and the frequency of occurrence for each pattern.
2. Patterns across paths: Determine whether similar patterns of failures are exhibited by paths across the network.
Extracting timeseries patterns involve identifying various timeseries snippets that exhibit a similar behavior/shape. The
patterns can be identified for a given KPI timeseries (univariate) or for multiple KPIs (multivariate). The study primarily
focuses on identifying univariate patterns; however, we do provide a brief insight into multivariate patterns in further
sections. A common approach to extracting timeseries patterns is by using the matrix profile distance for motif extraction,
where for a given pattern, one identifies occurrences of similar patterns in a timeseries. However, considering our use
case we approach this problem through clustering, where we cluster timeseries snippets that exhibit similar patterns.
The per-path pattern extraction approach can be described with three stages as shown in the figure below. Once the
patterns are identified per-path they can be used to further identify common patterns across paths.

Figure 6 - MIF Pattern Extraction

3.1 Timeseries Snippets


3.1.1 Preprocessing
This section outlines the pre-processing steps aimed at reducing noise in the MIF telemetry. The goal of pre-processing
is to preserve as much information in the interesting range of the telemetry values and reduce the noise elsewhere.
Considering that the study is aimed understanding “failures” the pre-processing aims to preserve all information at higher
values (range of failure) of a metric and to only reduce noise at lower values (range of normal operation). To this end
we clip a metric timeseries at its 25th percentile value. This filtering is done per-path, per-metric and improves the
clustering considerably.

4
Figure 7 - Quartile Filtered Delay Time Series

3.1.2 Snippets
The study considers a snippet length of 1 minute. Given the granularity of the MIF telemetry to be 500ms, the timeseries
snippets are 120-dimensional for each of loss, latency, and jitter. Snippets are created for each timeseries using a
rolling-window approach, this causes the snippets to have a certain overlap with the snippets corresponding to the
previous few windows. While this would cause overlapping snippets corresponding to the same occurrence of a pattern
to be clustered together, it can be easily remedied with some post-processing of the pattern clusters generated (detailed
in next section).

Figure 8 - Timeseries snippet

3.2 Snippet Clustering


3.2.1 K-Shape Clustering
The snippets created from the MIF telemetry are clustered for each of loss, latency, and jitter KPIs. An important aspect
of clustering timeseries data is the distance metric used to compare two timeseries snippets. Euclidean distance is not
suitable for clustering timeseries as it is not “shift invariant” and “scale-invariant” (see figure below). Other metrics likes
the Dynamic Time Warping measure (DTW) provide better estimates of similarity and can be used as the distance
metric with popular clustering algorithms like k-means or k-medoids. However, the DTW measure (and its constrained
version cDTW) are expensive to compute and are not easily scalable. We use the k-Shape clustering algorithm which
overcomes the problems of scaling by using a computationally optimal distance measure, while achieving comparable
accuracy to methods that use the DTW measure. The k-shape approach uses a shift and scale invariant distance
measure called the Shape Based Distance (SBD) that is based on cross-correlation. The cross-correlation measure can
be made computationally optimal by using the Fast Fourier Transform.

5
Figure 9 - K-Shape Clustering

Prior to the clustering, snippets that have a low coefficient of variance (<0.05) are filtered out. These snippets mostly
correspond to flat timeseries that do not exhibit any significant pattern.

3.2.2 Filtering noisy/non-significant clusters


The clustering is performed for all the filtered snippets that correspond to a given path and is carried for each metric
separately (univariate). Once the clustering is complete, silhouette scores (based on the SBD measure) that indicate
the quality of a cluster are computed for all clusters. Clusters that have a low silhouette score (<0.5) are filtered out as
they mostly contain noisy/outlier snippets. Non-significant clusters that contain very few snippets (<10) are also filtered
out. The final set of clusters obtained will still need to be cleaned for overlapping snippets, this is detailed in the next
section.

3.2.3 Cleaning clusters


One of the consequences of the rolling-window approach to snippet creation is the presence of overlapping snippets
that correspond to a single occurrence of a pattern. Such redundant snippets need to be removed such that each snippet
in the cluster corresponds to a single occurrence of the clustered pattern. This can be done by retaining the first snippet
in a group of snippets that are found to be significantly (>25%) overlapping.

Figure 10 - Cleaning Clusters

3.2.4 Example clusters


This section provides some of the example clusters extracted for each of loss, latency, and jitter. Each cluster is
represented by the 25th, 75th and 50th percentile values observed at a snippet timestep. In addition to the snippet that
represents the cluster, the figure also visualizes in the shaded regions the snippets that occur before and after.

6
Figure 11 - Example Latency Clusters

Figure 12 - Example Loss Clusters

Figure 13 - Example Jitter Clusters

3.3 Clusters across paths


Once clusters are extracted for a given path, they are examined to identify common patterns across multiple paths. The
same k-Shape clustering algorithm can be run again to extract common patterns across the paths. To perform the
second level of clustering, each per-path cluster is represented by a single timeseries that corresponds to the median
value at each snippet timestep. Given that each per-path cluster is now represented by a single timeseries snippet, the
k-Shape algorithm can be run on these representation snippets to identify common patterns across paths.
Some of the example patterns that were commonly found across paths are shown in the figures below. Common
patterns found across different paths are shown, where each pattern is represented in similar convention as the previous
section. The figure also notes the path (device#application-server-region) that the pattern corresponds to, the number
of occurrences of the pattern on the path and the time-window of the timeseries.

Figure 14 - Common Latency Pattern Examples

7
Figure 15 - Common Loss Pattern Examples

Figure 16 - Common Jitter Pattern Examples

4 Multi-variate MIF clustering


Initial experiments with a novel approach based on deep learning have been conducted to perform end-to-end clustering
of the 3-dimensional MIF time series. The idea is to train in an unsupervised fashion a multi-output deep neural network
(DNN) to associate a 3D sample (loss, jitter, and latency time series) to a cluster.
The DNN architecture is composed of convolutional layers that project the MIF sample onto a latent space. Then, fully
connected layers process this latent projection into a normalized vector P of size N. Pi is the probability that the input
sample is associated with cluster i. At inference time, the predicted cluster for the input sample is argmax Pi.
During training, the DNN has to learn to associate similar samples with the same cluster and to separate dissimilar
samples. This can be achieved by training the model to minimize the distance between the points of the same cluster
(intra-cluster distance) and maximizing the distance between points of different clusters (inter-cluster distance). To this
end, a custom loss that approximates the intra-cluster distance / inter-cluster distance on the latent space was used.

8
Figure 17 - DNN used for Multi-dimensional Pattern Clustering

The first experiments with this new approach gave promising results, with many similar 3D patterns assigned to the
same cluster.

9
5 Interesting Motifs
In this section we show a set of interesting MIF (motif) found in the study.

Figure 18 – Latency Motif Cluster

Figure 18 – Latency Motif Cluster Figure 18 – Latency Motif Clustershows an example of a motif encountered on the
path between a Webex client located in Krakow, Poland and the Webex Data Center in Frankfurt, Germany. Network
latency is seen to suddenly increase from ~45ms to ~200ms (4x increase) for 10 to 12 seconds, followed by a drop to
~75ms for another 15 seconds before finally returning to the initial value. This event is seen 3673 times during a period
of 9 days.

10
Figure 19 – Latency Motif Cluster

Figure 19 depicts yet another example of a latency motif, this time seen on a network path between two very closely
located endpoints: a host located in San Francisco, CA, US and the Webex Data Center in San Jose, CA, US. In this
case, latency increases from ~20ms to ~70ms over the course of 20 seconds before dropping back to the initial value.
The event is seen 4240 times over the course of 9 days.

Figure 20 – Packet Loss Motif Cluster

Figure 20 shows an example of a packet loss motif, captured on multiple paths originating from a host located in
Lisbon, Portugal. The pattern consists of a short-lived but large packet loss spike of up 80% along with multiple
smaller 20% loss spikes. The motif is initially seen over the course of 5 days (18th to 22nd July) after which it
disappears only to be detected again on 2nd and 3rd of August.

11
Figure 21 – Jitter Motif Cluster

Figure 21 shows a jitter motif cluster captured between a host located in Toronto, Canada and the Webex DC in
Dallas, TX, USA. Jitter is seen to suddenly increase from <10ms to anywhere between 40ms to 80ms for ~5 seconds
before dropping back down. The event is only seen 14 times over the course of 3 days.

6 Conclusion
For the first time, a study has revealed the existence of micro-failures in the Internet based on various path characteristics
such as Delay, Loss, and Jitter. These micro-failures, referred to as MIF in this document, significantly broaden the
definition of failures. They shift our understanding from overt events, where resources cease to function for a certain
period, to subtler phenomena that, though sometimes hard to detect, can drastically affect the user experience. The study
has identified patterns, both unidimensional and multidimensional, that can be detected using fast probing or by
employing ML algorithms trained to compute the probabilities of such patterns' occurrence in network paths, drawing
from multiple path characteristics. It might then be possible to train classifiers that can be used on-premise to detect
MIFs and subsequently trigger potential control and forwarding plane remediations to continually enhance QoE.

7 Bibliography

[1] J. Vasseur, "From Dark to Grey Failures In The Internet," June 2021. [Online]. Available: https://assets.zyrosite.com/A1aglvGGy1F64BNz/wp-grey-
failures-june-21-v1-A1agLPKg00uWaBLV.pdf.

12
13

You might also like