Professional Documents
Culture Documents
WP Mif v1 Aug 23 mnlnzL2aRqhGL7lL
WP Mif v1 Aug 23 mnlnzL2aRqhGL7lL
Internet
JP Vasseur1 (jpv@cisco.com), PhD - Fellow/VP, Gregory Mermoud1 (gmermoud@cisco.com), PhD – Distinguished Engineer,
Eduard Schornig1 (eschornig@cisco.com) – Principal Engineer, Mukund Yelahanka Raghuprasad1 (myelahan@cisco.com) –
Technical Leader, Gregoire Magendie (gmagendie@cisco.com), Software Engineer - 1Cisco Systems
Abstract: For years, the concept of “failure” has been associated with a clear and easily observable phenomenon where
a resource (e.g., link, router, server) ceases to function entirely. These instances are sometimes referred to as “Dark”
failures and can last for varying durations. We have expanded this definition to include the concept of “Grey” failures,
where a specific resource undergoes performance degradation that affects the user experience. In this paper, we further
extend the notion of grey failures to introduce the concept of Micro-failures (also termed MIF). MIF are characterized
by specific motifs observed concerning path characteristics, such as delay, loss, and jitter, which may impact the user's
Quality of Experience (QoE), while being difficult to observe with low frequency probing. As defined, MIFs often go
unnoticed since they don't typically affect the control plane (no rerouting action) despite their potential impact on QoE.
Micro-failures are not only more common than dark failures but can also be equally detrimental to the user experience.
Furthermore, they pose a significant challenge in detection and remediation. In this paper, for the first time, we delve
into several intricate, multi-dimensional patterns of micro-failures through an exhaustive analysis based on fast-
frequency probing realized at scale in the Internet. Being able to identify and detect such micro-failures could
undoubtedly lead to triggering remediation in forwarding mechanisms in the future, thereby improving the QoE for
applications requiring stringent SLAs.
Figure 1 - Networking KPI reporting by L3 Fast probing versus L7 (Webex) probes at lower frequency
Figure 2 - shorter-lived MIF reporting by fast probing compared to 1minute resolution telemetry
Figure 2 - shorter-lived MIF reporting by fast probing compared to 1minute resolution telemetry shows that with shorter-
lived phenomenon impacting packet loss, the MIF motifs are quite expressive where low-level granularity (although still
at 1’) barely reports the issue. Many other examples could show even shorter-lived issues that cannot be detected even
at a very reasonable frequency of 1’ telemetry reporting (which is considered in Networking as a fairly high frequency).
It is not rare to have a series of such of MIFs or even many (periodic) MIFs. Also, for most of the time, MIFs are of a
very short duration (i.e., between 500ms and 5 seconds).
Unfortunately, most applications and network solutions are unable to detect short temporary failures, simply because
telemetry polling operates at a different granularity. Furthermore, even when higher granularity is available, most
variables are aggregated using averaged values over a specific time for obvious scalability reasons. For example, Cisco
SD-WAN solution measuring loss, jitter and latency on tunnel is using by default 10-minute time window averaging the
results from 600 probes that are sent every 1 second. It is not advised to configure more aggressive timers.
2
2 Methodology
MIF telemetry was collected for more than 350 individual network paths via probing agents deployed in 30 global
locations in 19 countries. Dedicated nodes in each location were configured to initiate Webex Voice and Video calls
while at the same time use fast frequency ICMP probes to monitor network conditions towards destinations of interest
such as:
• Webex Monitoring Endpoints: Dedicated monitoring endpoints are deployed in each Webex Data Center. The
L3 metrics from any site to a monitoring-endpoint are deemed to be representative of the network behavior to
all the media-servers that the monitoring-endpoint represents.
• Other targets: Google DNS, etc
• Starlink First Hop (optional): For sites where Starlink was used as a connectivity provider, an additional probe
was configured to monitor the first L3 network hop in the Starlink network.
For a given MIF path (site -> target) the agent sends probes at 100ms intervals and records the associated RTT or an
error code (timeout, unreachable, TTL exceeded). Results are then aggregate at 500ms intervals, and loss, latency
and jitter metrics are computed.
3 MIF Motifs
The fast-probing methodology described in the previous section generates high-frequency MIF telemetry for various
network paths across the world. MIF telemetry consists of loss, latency and jitter corresponding to the network paths at
3
a 500ms granularity. Such high-frequency telemetry can provide an insight into the behavior of the network that can be
otherwise invisible with telemetry at lower granularities.
4
Figure 7 - Quartile Filtered Delay Time Series
3.1.2 Snippets
The study considers a snippet length of 1 minute. Given the granularity of the MIF telemetry to be 500ms, the timeseries
snippets are 120-dimensional for each of loss, latency, and jitter. Snippets are created for each timeseries using a
rolling-window approach, this causes the snippets to have a certain overlap with the snippets corresponding to the
previous few windows. While this would cause overlapping snippets corresponding to the same occurrence of a pattern
to be clustered together, it can be easily remedied with some post-processing of the pattern clusters generated (detailed
in next section).
5
Figure 9 - K-Shape Clustering
Prior to the clustering, snippets that have a low coefficient of variance (<0.05) are filtered out. These snippets mostly
correspond to flat timeseries that do not exhibit any significant pattern.
6
Figure 11 - Example Latency Clusters
7
Figure 15 - Common Loss Pattern Examples
8
Figure 17 - DNN used for Multi-dimensional Pattern Clustering
The first experiments with this new approach gave promising results, with many similar 3D patterns assigned to the
same cluster.
9
5 Interesting Motifs
In this section we show a set of interesting MIF (motif) found in the study.
Figure 18 – Latency Motif Cluster Figure 18 – Latency Motif Clustershows an example of a motif encountered on the
path between a Webex client located in Krakow, Poland and the Webex Data Center in Frankfurt, Germany. Network
latency is seen to suddenly increase from ~45ms to ~200ms (4x increase) for 10 to 12 seconds, followed by a drop to
~75ms for another 15 seconds before finally returning to the initial value. This event is seen 3673 times during a period
of 9 days.
10
Figure 19 – Latency Motif Cluster
Figure 19 depicts yet another example of a latency motif, this time seen on a network path between two very closely
located endpoints: a host located in San Francisco, CA, US and the Webex Data Center in San Jose, CA, US. In this
case, latency increases from ~20ms to ~70ms over the course of 20 seconds before dropping back to the initial value.
The event is seen 4240 times over the course of 9 days.
Figure 20 shows an example of a packet loss motif, captured on multiple paths originating from a host located in
Lisbon, Portugal. The pattern consists of a short-lived but large packet loss spike of up 80% along with multiple
smaller 20% loss spikes. The motif is initially seen over the course of 5 days (18th to 22nd July) after which it
disappears only to be detected again on 2nd and 3rd of August.
11
Figure 21 – Jitter Motif Cluster
Figure 21 shows a jitter motif cluster captured between a host located in Toronto, Canada and the Webex DC in
Dallas, TX, USA. Jitter is seen to suddenly increase from <10ms to anywhere between 40ms to 80ms for ~5 seconds
before dropping back down. The event is only seen 14 times over the course of 3 days.
6 Conclusion
For the first time, a study has revealed the existence of micro-failures in the Internet based on various path characteristics
such as Delay, Loss, and Jitter. These micro-failures, referred to as MIF in this document, significantly broaden the
definition of failures. They shift our understanding from overt events, where resources cease to function for a certain
period, to subtler phenomena that, though sometimes hard to detect, can drastically affect the user experience. The study
has identified patterns, both unidimensional and multidimensional, that can be detected using fast probing or by
employing ML algorithms trained to compute the probabilities of such patterns' occurrence in network paths, drawing
from multiple path characteristics. It might then be possible to train classifiers that can be used on-premise to detect
MIFs and subsequently trigger potential control and forwarding plane remediations to continually enhance QoE.
7 Bibliography
[1] J. Vasseur, "From Dark to Grey Failures In The Internet," June 2021. [Online]. Available: https://assets.zyrosite.com/A1aglvGGy1F64BNz/wp-grey-
failures-june-21-v1-A1agLPKg00uWaBLV.pdf.
12
13