You are on page 1of 5

Large-scale Internet Path modelling and applications

Vinay Kolar, Ph.D., Principal Engineer ,JP Vasseur, Ph.D., Cisco Fellow, , Mukund Raghuprasad
(myelahan@cisco.com) - Sept 2020

different pair of service providers across the Internet and infers if there
are any patterns in path KPIs.
1 Introduction

A crucial part of providing better quality-of-service (QoS) to applications 2 Internet-Scale Data collection
is to route the application packets through paths that provide the best
experience. Existing technologies such as Software Defined Wide Area
SDWAN usually consists of multiple branches and datacenters being
Networks (SDWANs) construct an overlay network connecting the
connected via tunnels. Each branch or datacenter may have one or
branches of an enterprise (branch offices or datacenters). It continually
more edge-routers. A tunnel connects two edge-routers, and each pair
measures end-to-end path performance between different branches
of edge-routers many have multiple tunnels. A tunnel encapsulates and
using probes that measure the critical path KPIs such as loss, latency
sends packets via a chosen fabric such as MPLS, Internet and 3G/LTE.
and jitter along a series of paths.
SDWAN use several active and passive probes to measure path KPIs.
The path KPIs provide continuous performance metrics of the path that
A primary active probe is the Bidirectional Forwarding Detection (BFD)
can be used to route traffic applications accordingly. For example, all
that sends heart-beat messages periodically and computes the loss,
real-time video applications can be sent over paths with low loss, latency
latency and jitter on the tunnel potentially for multiple DSCP values
and jitter, whereas elastic applications such as e-mails can be sent over
reflecting QoS. The system also collects other KPIs such as the amount
other paths.
of traffic seen on the tunnel.
Existing protocols make use of static metrics that may (partially) reflect
The paths in the above scenario are the tunnels, which connect edge-
such of these KPIs. However, the current protocols do not exploit or learn
routers of an enterprise. SDWAN network has a rich measurement
from long-term KPI values of the path; they buffer recent KPIs (usually in
history of how a path behaves due to constant monitoring of primary
the last few minutes) and perform path selection (no path computation)
KPIs. This data is measured at the edge-routers. The edge-router
according to recent historical data. This is a real drawback of existing
transfers the telemetry data to the cloud periodically. The details of the
protocols. For example, the historical data may indicate a pattern of large
platform and mechanisms or the data collection and data transfer is not
variance in loss or jitter, indicating a bad fit for, say, real-time voice
discussed in this paper. It is assumed that the path data is stored in the
applications; this information is usually not available in the instantaneous
data-lake, which will be used for modelling.
measurements, but the protocol has to query long-term KPIs to infer
presence of such variance. A path can be represented as time-series of several KPIs. For the study
in this paper, we consider three KPIs for any path P: 𝑙𝑎𝑡𝑒𝑛𝑐𝑦(𝑃, 𝑡),
This paper discusses data-driven path modelling. Unlike the traditional
𝑙𝑜𝑠𝑠(𝑃, 𝑡) and 𝑗𝑖𝑡𝑡𝑒𝑟(𝑃, 𝑡), which denotes the respective KPIs for a path
protocols, the path modelling approach differs in three important ways.
First, the KPI data is collected across all Internet in a cloud-based data- measured at time 𝑡. In addition to these primary KPIs, the path can also
lake. Second, the path model is constructed using statistical or machine be described by certain attributes. For example, each tunnel may have
learning models. The models enable (1) statistical understanding of long- 𝑆𝑃!"#$ (𝑃) and 𝑆𝑃%#&' (𝑃), which represents the service-providers
and short-term path behavior, and (2) fast decision making by protocols connected to the head-end and the tail-end of the SDWAN tunnel
without querying the large data for every packet being routed. Third, the (determined by its public IP). Similarly, the 𝑓𝑎𝑏𝑟𝑖𝑐(𝑃) may represent the
models of multiple paths can be evaluated for down-stream use cases fabric of the path, which may be {𝑀𝑃𝐿𝑆, 𝐼𝑛𝑡𝑒𝑟𝑛𝑒𝑡, 3𝐺, … }.
for either routing or decision making. In one use case, it can be used The large-scale data collected by Cisco SDWAN provides extensive
select the best path for the application. For example, voice-applications information of how paths look in the Internet.
may have a Service Level Agreement (SLA) of ensuring loss < 1%, jitter 7.1 billion hours of Up to 935’000 248 millions
1800 Service
< 30ms and latency < 150ms. The models are used to compute the active tunnel active tunnels seen Providers tunnel failures
telemetry (12 months) every week (12 months)
probability for SLA-violation, which can be used to route the voice-
application via the best path. In another use case, path models of several
related paths to infer probable root-causes. For example, the models can
be used to understand if paths transiting through a given service-provider
have clusters of paths that exhibit similar loss, latency and jitter. If the
paths do exhibit strong clusters of paths, then it can be used to predict or
conclude if, say, the enterprise can expect a benefit from adding a path Up to 2.1 Millions
through some service providers for enabling better quality for voice-calls. tunnel failures per
day

This paper will first briefly discuss the path data collected across the
Internet. Then, interpretable path-modelling approaches are discussed Distribution of 13024 routers across the world, color coded
by the number of service providers attached.
that enable understanding and concluding path behavior. Examples of
latency and loss KPI models are shown. Finally, two use cases of path
models are discussed: (1) SLA Violation: This use case utilizes the path Figure 1: Path data across the Internet
model to evaluate if a given path is suitable for several applications; and
(2) Service-provider path behavior: This use case analyzes paths from
1
Figure 1 shows the large-scaled data across around 50 diverse 3.2 Latency model
SDWAN customers. This large-scale data is used to model paths
across the Internet. Observed latency
For this study, we use the data from 46 customers across 6 months

Probability
(July to Dec 2019). There are 253,000 tunnels across 6 continents, 89
countries and reported from 5500 edge-routers. The public-ip of the
edge-routers are used to determine the Service-Provider (SP) and SP-
city, SP-region and SP-country. We define an SP as a combination of
<SP-name, SP-region>; this allows characterizing performance of SPs
per geo-region. The data has 1033 unique SPs.
Gaussian Mixture

Probability Density
Model of latency
3 Path Modelling

This section details the path modelling approach taken. Each KPI for
the path is individually modelled using a generative model that
describes the KPI characteristics.

3.1 Path Model objectives Figure 2: Example path latency with two modes
The path-model constructed will be used to optimize several The top graph in Figure 2 shows the histogram of latency of one
applications or to understand how paths for a given, say, service selected path. The X-axis shows the latency bin, and the Y-axis shows
provider behaves. The primary objectives to design path models for the fraction of samples in a given latency-bin. In this example, it can be
such generic goals are: seen that latency can have multiple modes. The bottom graph shows
1. Uncertainty estimation: The KPI models should be that the distributions can be modelled as a mixture of two Gaussians.
designed such that it can be used to not only estimate one The bottom graph in Figure 2 shows the estimated probability density
KPI value, but also provide uncertainty of the KPI. For function (pdf), where X-axis is the latency value, and y-axis is the
example, the KPI model should be able to estimate what is probability of observing a given latency. The first Gaussian corresponds
the predicted variance of the KPI so that, for example, voice- to the left mode has a mean of around 50 ms with slightly higher
applications can choose to use it only if the KPI satisfies the variance. The second Gaussian is the right curve having 55 ms as the
loss, latency and jitter SLA for 99% of the times or more. mode with lower variance.
2. Low-cost retraining: There are millions of paths across the We model latency as a univariate Gaussian Mixture Model (GMM) with
Internet, and each path has multiple KPIs. While modelling multiple modes. If needed, extreme outliers can be removed before
each KPI for each path for each second is highly desired, the fitting the model.
model accuracy and computational requirement for
maintaining an up-to-date accurate model is heavy. Hence, GMM requires estimating the number of Gaussian mixture components,
for maintainability we choose to model by “train once and use 𝑘, that is present in the distribution; this is one of the main challenges in
many times” approach. We model how KPI behaves in longer GMM. We automatically find the value of 𝑘 by using Silverman’s mode
periods of time; this enables retraining path models at longer estimation methodology.
periods of time, and at the same time captures major patterns
The latency for the path can be expressed as a mixture of 𝑘
in KPI behavior. Note that this is not to say that such coarse
components, with the component represented by
models can be applicable everywhere. Wherever needed,
detailed models to estimate more accurate KPI with any other 𝑙𝑎𝑡𝑒𝑛𝑐𝑦& (𝑃)~ 𝒩(𝜇& , 𝜎& ),
use case specific requirement should be preferred instead of
using the model proposed in this paper. where (𝜇& , 𝜎& ) is the mean and standard deviation of 𝑖 %! component. The
probability density function (pdf) for latency is then defined as:
The next sections detail the latency and loss models.
(

𝑝(𝑥) = E 𝑤& 𝒩(𝜇& , 𝜎& ),


&)*

%!
where 𝑤& is the weight of 𝑖 component. GMM uses Expectation
Minimization (EM) to fit multiple Gaussian distributions to observed
values and return appropriate values for (𝜇& , 𝜎& , 𝑤& ).

One key challenge is to estimate the value of 𝑘 (number of


components). Several mode-estimation methods, such as Silverman’s
mode estimation [1] can be used to estimate the number of modes in
the latency distribution. These techniques utilize kernel-density
estimation.

This pdf or corresponding cumulative distribution function (CDF) will be


then used to measure the probability that latency will be within a certain
range.

Once the model is trained, the values of (𝜇& , 𝜎& , 𝑤& ) is used by
downstream use cases to check the latency characteristics of the path.

2
The bottom graph in Figure 2 shows the GMM fit on the example As shown in Figure 4, most paths (96% of paths in our dataset) have
observations shown in the top graph. single mode. However, the below model is generic to capture single or
multiple modes.
Jitter can also be modelled as GMM, similar to latency.
A service-provider with relatively
less paths, but has around 70% of
3.2.1 Latency models for multiple paths the paths with multiple modes Number of paths

Percentage of Multimodal paths


Once all the paths are modelled, it can be analyzed by overlaying Around 3500 paths pass through
this service-provider. 7% of those
distributions of all paths to infer how the set of paths behave. For paths are multimodal.

example, consider a set of paths which passes through a one source


region and source SP to a destination region and destination SP. One
way to analyze how all the paths behave is to plot a time-series of all
the paths. However, when there are 100s of paths, its usually hard to
Different Service-Providers
summarize the group behavior.
Figure 5: Different service providers and the percentage of multi-modal
paths passing through the service-provider

Figure 5 shows 16 SPs (X-axis) and the percentage of the paths that
have more than one mode (Y-axis). Few service providers have very
high number of paths with multiple modes. This does not conclude
anything about the SP itself; different properties of the path need to be
analyzed to infer what might be the cause. However, such an analysis
provides a starting point to troubleshoot why latencies have multiple
modes, which may affect applications. For example, the higher mode of
latency may affect latency sensitive applications.

3.3 Loss model

Observed loss
Probability

Figure 3: Latency distributions of all the paths passing through a pair of


SPs

Figure 3 shows the probability density function (pdf) of all paths


between a source and destination SP-pair; this is computed from the
GMM models above. Each line represents the pdf of one path. It can be
Beta model for
Probability Density

clearly seen that: (a) There are two types of paths. One set of paths latency
has a latency of around 60-150 ms, and another set of paths have
latency > 200 ms; and (b) Most of the paths with mode of around 100-
150 ms have very small standard deviation, i.e., latencies of these
paths can be expected to be stable around its mean. The application of
path models can, hence, help analyze a set of paths and summarize its
behavior.
Loss (%-age)
3.2.2 Multiple modes
Figure 6: Example of loss percentage for one path
The latency is modelled as a GMM, which can enable modeling
latencies that have multiple modes. Such distributions can occur in In this context, loss is defined as the percentage of the packets that are
scenarios where, say, the latency peaks during working hours or peak lost. The top graph in Figure 6 shows the histogram of loss for one
loads (corresponding to the right peak) when compared to non-working path. It can be seen that the loss is usually 0%, and sometimes the loss
hours. increases up to 6%. In fact, most of the paths have a loss distributions
that has a high negative skew, with most values around 0%.

96.1% Loss for a path P is, hence, first modelled as a beta distribution:
%-age of paths

96.1% of paths 𝑙𝑜𝑠𝑠(𝑃) ~ Beta(𝛼, 𝛽)


have single mode
This provides a distribution that scales from [0, 1]. 𝛼, 𝛽 are called as the
shape parameters that control how the distribution behaves between
the ranges [0,1]. These shape parameters are automatically learnt by
3.6% 0.17% …
fitting the loss distributions.
1 2 3
Number of modes Note that loss percentage values range from [0, 100], and Beta
distribution scales from [0,1]. Hence, we first scale the loss values
Figure 4: Most paths have single mode. However, around 3.9% of the between [0, 1] by mapping [-1, 100%] to the respective ranges [0,1].
paths have two or more modes The left boundary for the model is set to -1 since it is a well-known fact
that beta distribution usually does not fit very well near the extreme
ranges. Hence, extending the ranges is a well-known method to get

3
good fits at the extreme range values [2]. Other variations, such as the downstream use cases. For example, when the edge-router has to
Inflated beta distributions [3], can also be used to overcome the artefact make a decision if a given path P is suitable for some arbitrary
of fitting between the range [0,1]. application 𝐴, it has to only access the (𝜇& , 𝜎& , 𝑤& ) for latency and jitter
GMM models, and (𝛼, 𝛽) for the loss model for the given path; a total of
5 variables. Compare this with accessing the entire time-series of all
loss, latency and jitter – which are often computed and stored once at a
4 SLA Violation Probability
minute granularity – to compute how the path behaves. Thus, many
downstream use cases benefit from the light-weight estimation of path
A direct application of the loss, latency and jitter models are to estimate
properties.
application experience. If the SLA template for an application is known,
the models can be used to compute the probability with which the path
may violate an SLA.
5 Service-Provider Path Behavior
For example, consider an application A, which has the SLA
requirements of:
Recall that a tunnel connects a head and a tail edge-router. The head
• I𝑙𝑎𝑡𝑒𝑛𝑐𝑦 ≤ latencyThreshA K, and and the tail can be connected to a different Service Providers (SPs).
• (𝑙𝑜𝑠𝑠 ≤ lossThreshA ), and This section discusses the use of path models to quantify the path KPIs
between a pair of service providers. As describe in Section 2, there are
• I𝑗𝑖𝑡𝑡𝑒𝑟 ≤ jitterThreshAK
1033 distinct SPs spread across the world in this dataset. This section
From the latency GMM model for path P, it is quite straight-forward to analyzes the models for paths passing through a pair of SPs: intra-
compute the SLA violation probability with respect to latency country, intra-continental, and also inter-continental.
(𝑝𝑆𝑙𝑎+,-./01(𝑃)).
5.1 Latency analysis across SPs
SLA Latency Threshold
latencyThresh! = 150 &'

()*+"#$%&'( ,, . = 1 –
CDF(latencyThresh) )

Figure 7: Probability of SLA violation due to latency is the cumulative


Figure 8: Time-series of latencies of paths for an SP-pair connecting
probability from 𝑙𝑎𝑡𝑒𝑛𝑐𝑦𝑇ℎ𝑟𝑒𝑠ℎ2 .
New Jersey to California. There are three clusters, with one prominent
Note that, at the CDF at point latencyThreshA (i.e, F(latencyThreshA ), is cluster containing paths from multiple customers.
the probability where latency of path P is less than or equal to
latencyThreshA (as shown in Figure 7). Hence the probability of SLA
violation of application A due to latency on path P is:

𝑝𝑆𝑙𝑎+,-./01 (𝐴, 𝑃) = 1 – F(latencyThresh2 ).

Similarly, the probability of SLA violation due to loss and jitter are:

𝑝𝑆𝑙𝑎+344 (𝑃) = 1 – F(lossThreshA ), and

𝑝𝑆𝑙𝑎56--.7 (𝑃) = 1 – F(jitterThreshA ).

Since we statistically model loss, latency and jitter as standard


distributions (or mixture of standard distributions), it is straight-forward
to compute the CDF of these distributions from the parameters
estimated during the distribution fitting.
Figure 9: Distribution of latency for all paths between an SP in New
The overall probability of SLA violation can then be defined as: Jersey and another SP in California. The one prominent cluster
between 60-100 ms (and most of them with low variance) containing
𝑝𝑆𝑙𝑎(𝐴, 𝑃) = 1 − QR1 − 𝑝𝑆𝑙𝑎latency (𝐴, 𝑃)S I1 − 𝑝𝑆𝑙𝑎loss (𝐴, 𝑃)K R1
many customer paths is clearly visible. Another cluster of low-latency
− 𝑝𝑆𝑙𝑎jitter (𝐴, 𝑃)ST paths at ~50 ms is also seen.

Figure 8 shows the time-series of latencies of several paths of different


customers between an SP in NJ to another SP in California. Each line
There is a limitation of the above formula; it assumes that loss, latency is the latency of one path from Oct 2019 to Dec 2019. The color
and jitter are independent, which is strictly not the case. However, in represents the customer; all tunnels (paths) belonging to one customer
reality, it can be used as a representative value for SLA violation is shown in one color. Figure 9 shows the respective latency
probability. distributions from the GMM model of their paths; each line here is the
pdf of the latency. Like before, one color indicates one customer. It can
We now discuss few salient points about the models and its use in
be seen in both Figure 8 and Figure 9 that the paths between these
downstream SLA violation probability computation. First, it is clear that
SPs can be divided into three cluster: (1) Cluster 1, composed of a
the models discussed in Section 3 can be trained at a low frequency,
majority of paths across multiple customers, contains a majority of the
say, once in 15 days, and can be used for the next 15 days. Second
path. (2) Cluster 2, very close to Cluster 1, has very few paths from one
advantage is that the models are much simpler to store and access for

4
customer. (3) Cluster 3 has an interesting mix of paths from some two customers observed between two pair of SPs from US/NJ to
customers with low latency. US/Massachusetts. Customer-1 has low loss. However, Customer-2
has very high losses.
US-Argentina England-Texas US-India
(SP-3/Florida to SP-4/Buenos Aires) (SP-5/England to SP-6/Texas) (SP-6/Texas to SP-7/Bangalore/India)

• Most tunnels • Tunnels either have • Highly varying


have similar mean mean ~125ms or latencies (from 100
latencies ~250 ms ms to 250 ms)
• Tunnels with larger
6 Conclusion
• Latency has high • Tunnels with mean
variance ~250ms have high latencies have
variance higher variance
The paper analyzes path across the Internet with paths across 6
continents, 89 countries and 1300 service-provider networks. The data
collected has been reported from 5500 edge-routers across multiple
Figure 10: Paths connecting inter-continental SPs. customers. The paper then proposes modeling path KPIs, such as loss,
latency and jitter. The models are built so that it captures the
Figure 10 shows paths connecting inter-continental SPs. US to
uncertainty in path KPIs and does not rely on frequent retraining. These
Argentina has predictable paths with mean around 60 ms, but with a
models can be used by multiple downstream applications.
large variance. England-Texas paths can be clustered into paths with
lower latency (~125ms), and paths with higher latency (200-300 ms). The latency and jitter are modelled as Gaussian Mixture Model. Loss is
US-India links vary a lot. One customer has good paths with latency of modelled as a Beta distribution. All the parameters for the models are
around 120 ms. However, there are paths with very high latencies too. automatically estimated.
The red path with latency of around 50 ms is an outlier.
As an example, we illustrate two use cases. First, we use the models to
estimate SLA violation probability for different applications without
5.2 Loss across SPs
training one model for each application. Second, we show how to
Recall that loss is modelled as a beta distribution. We now show the analyze paths that flow between service-providers. We analyze and
losses across different SPs across continents. show path properties across the Internet between different SPs.

US-US (Low loss) US-India (High loss) US-US


(SP1/Texas to SP2/California) (SP1/Texas to SP2/India) High loss across customers
(SP4/NJ to SP5/California)
References
Customer-3
Customer-3 Customer-4
Customer-4
Customer-5
5 customers One
have customer
moderate
Small portion of paths
have high loss
High mean losses loss
(Custom
er-2) has
[1] B. W. Silverman, "Using kernel density estimates to investigate
high loss
multimodality," Journal of the Royal Statistical Society: Series B
(Methodological), vol. 43, no. 1, pp. 97--99, 1981.

Figure 11: Path loss distribution across SPs in US-US, US-India. [2] F. Cribari-Neto and A. Zeileis, "Beta regression in R," Journal of
statistical software, vol. 34, no. 10, pp. 1-24, 2010.
Figure 11 shows the distribution of losses for different paths across
three SP-pairs: US/Texas to US/California, US/Texas to [3] R. Ospina and S. L. Ferrari, "A general class of zero-or-one inflated beta
India/Bangalore and US/NJ to US/California. Like before, each line regression models," Computational Statistics & Data Analysis, vol. 56,
represents the pdf for one path, and the color indicates one customer. no. 6, pp. 1609-1623, 2012.
The first graph shows many paths having similar characteristics (low
loss) for three customers. The second graph shows high mean loss [4] C. Molnar, 2019. [Online]. Available:
observed across Texas to India. The third graph shows an interesting https://christophm.github.io/interpretable-ml-book/global.html.
characteristic. Five customers have moderate loss. However, one
customer (Customer-2) has very high loss. Such customer specific [5] J. a. Y. B. Bergstra, "Random search for hyper-parameter optimization.,"
behavior can be easily identified by observing loss distributions. Journal of Machine Learning Research , p. 281–305, 2012.

US-US
[6] L. M. D. P. F. a. M. P. Franceschi, "Forward and reverse gradient-based
High Loss for one customer
SP4/New Jersey to SP6/Mass hyperparameter optimization.," in In Proceedings of the 34th
International Conference on Machine Learning-Volume 70, 2017.
Customer-2
Customer-1
[7] hyperopt/hyperopt, "Github," [Online]. Available:
https://github.com/hyperopt/hyperopt.
Each customer has
specific pattern of
loss [8] Cross-validation, "wikipedia," [Online]. Available:
https://en.wikipedia.org/wiki/Cross-validation_(statistics).

[9] "SHAP (SHapley Additive exPlanations)," [Online]. Available:


https://github.com/slundberg/shap.

[10] R. Tibshirani, M. Saunders, S. Rosset, J. Zhu and K. Knight, "Sparsity


Figure 12: Path models can show customer specific behavior.
and smoothness via the fused lasso," Journal of the Royal Statistical
Customer-2 has very losses and Customer-1 has relatively low loss.
Society: Series B, vol. 67, no. 1, pp. 91-108, 2005.
This indicates that loss is probably occurring at customer-edge rather
than in the SP.

Figure 12 shows an extreme case of losses probably occurring at


customer edges rather than being a characteristic of SP. Here there are
5

You might also like