1 ISP only have 1.54Mbps capacity. This is becausecustomers often do not have incentive to pay for theirproviders to over provision the access link capacity.We deﬁne the
to be the set of constraints or rules that the monitoring system shouldcomply to. For example, a typical constraint can be thatall the measurement overhead over a link cannot exceed1% of the link capacity.In this paper, our goal is to design a monitoring anddiagnosis system for the VPN infrastructure that ISPsdeploy to host VPN services. Usually the backbone linksare over-provisioned (
, 10Gbps) while access linksmay have several orders of magnitude difference in termsof capacity (
, 1.54Mbps mentioned above). Such highheterogeneity of link capacities and router capabilitiesmakes it very likely to severely overload some mon-itors and links when selecting the monitors or pathsfor monitoring (as in [1, 4, 5, 7]) without consideringthese constraints. Taking the operational constraints intoaccount makes this problem very challenging and uniquefrom the existing work for the following reasons.
Considering the operational constraints makes somewell-known problems much harder. As we show inSection 3.2, minimizing the number of monitors underthe constraints becomes harder than some notoriousNP-hard problems.
Most tomography work assumes that all the pathsto be monitored will be measured simultaneously [1,4, 5, 7]. However, this setup may not be true orefﬁcient under the real-world constraints. Under tightconstraints, the monitoring system may have to sched-ule the measurement in different time slots. We foundthis particularly crucial for the VPN topologies whichexhibit star-like topologies where a backbone routerconnects to a large number of customer routers.To address these challenges, in this paper, we propose
monitoring and diagnosis systemfor VPN. While we mainly focus on VPN service inthis paper, our system is general enough to work onany other network that its resources are limited and theoperational constraints should be considered in its activemonitoring system (
IP network of a small Tier-3ISP). VScope consists of two phases: 1) monitor setupphase which selects the candidate routers as monitorsand schedules the paths to be measured by the monitors
in multiple rounds
, and 2) continuous monitoring anddiagnosis phase. We make the following contributions indesigning the VScope.First, we design algorithms to select as few monitorsas possible that can monitor the whole network under theoperational constraints. The special case of our problemignoring the operational constraints is shown to be NP-hard in . Considering the operational constraints, wemodel our problem as a unique combination of the two-level nested Set Cover problem and constraint satisfac-tion problem. We found that no existing solutions suchas those for variants of Set Cover problem  can bedirectly applied to solve this new problem. Thus we de-sign a greedy-assisted linear programming algorithm forit. In addition, we develop a simple but scalable greedyalgorithm for a smooth efﬁciency-optimality tradeoff.Secondly, VScope takes a multi-round measure-ment approach which gives a smooth tradeoff be-tween measurement frequency and monitors deploy-ment/management cost. With the single-round measure-ment algorithms as the basis, we propose three algo-rithms to schedule the path measurements in differentrounds obeying the operational constraints.Finally, VScope not only detects the
of some fault (
large loss rates or latency) but alsoquickly identify exactly which links are faulty so thatoperators can take actions for mitigation. We design acontinuous monitoring and diagnosis mechanism whichaims to select the minimal number of paths to measurefor diagnosis under operational constraints.Besides some synthetic topologies, we mainly evalu-ate the VScope system with one IP network topologyand two VPN topologies,
all with the real topologies,capacities, loss rates and constraints
, from a tier-1 ISP.The sizes of networks vary from hundreds to hundredsof thousands of routers. The results demonstrate that weare able to select about 5% of routers as monitors tocover all the links with all the operational constraints.When faulty
paths are discovered, in about 17 seconds,we identify only a small number of extra paths tobe measured for locating the faulty links with smallgranularity close to that of the physical link. Both falsepositives and false negatives for locating the faulty linksare less than 0.5%.The rest of the paper is organized as follows. We intro-duce the problem and VScope architecture in Section 2.We present our design on monitor selection in Section 3and our design on diagnosis in Section 4. The dynamicsissues are discussed in Section 5. Then we show theevaluation methodology and results in Section 6. Finally,we present related work in Section 7 and conclude in
Faulty means lossy in this paper as we currently focus on lossrates.