Professional Documents
Culture Documents
2013-02
Abstract As the Internet becomes increasingly impor- Keywords resilient survivable disruption-tolerant
tant to all aspects of society, the consequences of dis- network · dependability performability · diverse topol-
ruption become increasingly severe. Thus it is critical ogy generation · network analysis experimentation ·
to increase the resilience and survivability of future net- ns-3 simulation methodology
works. We define resilience as the ability of the network
to provide desired service even when challenged by at-
tacks, large-scale disasters, and other failures. This pa- 1 Introduction and Motivation
per describes a comprehensive methodology to evaluate
network resilience using a combination of topology gen- The increasing importance of the Global Internet has
eration, analytical, simulation, and experimental emu- lead to it becoming one of the critical infrastructures [2]
lation techniques with the goal of improving the re- on which almost every aspect of our lives depend. Thus
silience and survivability of the Future Internet. it is essential that the Internet be resilient, which we
define as the ability of the network to provide and
This research was supported in part by the the National Sci- maintain an acceptable level of service in the face of
ence Foundation FIND (Future Internet Design) Program un- various faults and challenges to normal operation [133,
der grant CNS-0626918 (Postmodern Internet Architecture), 132]. It is generally recognised that the current Inter-
by NSF grant CNS-1050226 (Multilayer Network Resilience
net is not as resilient, survivable, dependable, and se-
Analysis and Experimentation on GENI), and the European
Commission FIRE (Future Internet Research and Experimen- cure as needed given its increasingly central role in so-
tation Programme) under grant FP7-224619 (ResumeNet). ciety [14, 70, 126, 20, 17, 148]. Thus, we need to ensure
James P.G. Sterbenz
that resilience is a fundamental design property of the
Electrical Engineering and Computer Science, Future Internet, and seek ways to increase the resilience
Information and Telecommunication Technology Center, of the current and future Internet. This requires an un-
The University of Kansas, Lawrence, Kansas, USA derstanding of vulnerabilities of the current Internet,
http://www.ittc.ku.edu/resilinets
jpgs@ittc.ku.edu, +1 785 864 7890
as well as a methodology to test alternative proposals
Computing Department, InfoLab21 to increase resilience. In particular, we are interested in
Lancaster University, Lancaster, UK, understanding, modelling, and analysing the properties
jpgs@comp.lancs.ac.uk of dependability that quantifies the reliance that can be
Egemen K. Çetinkaya placed on the service delivered including reliability and
ekc@ittc.ku.edu availability [89] and performability that quantifies the
Mahmood A. Hameed level of performance [101] when the network is chal-
hameed@ittc.ku.edu lenged. This notion of resilience subsumes survivability
Abdul Jabbar that is the ability to tolerate the correlated failures that
jabbar@ge.com result from attacks and large-scale disasters [134, 105,
Shi Qian 59, 66] and disruption-tolerance that is the ability to
shiqian@ittc.ku.edu communicate even when stable end-to-end paths may
Justin P. Rohrer not exist due to weak channel connectivity, mobility,
rohrej@ittc.ku.edu unpredictable delay, and energy constraints [134, 62].
ITTC © James P.G. Sterbenz
ResiliNets Strategy
2 James P.G. Sterbenz et al.
D
d
Section 2 reviews the ResiliNets architectural frame-
et
en
ec
ef
work for network resilience based on a two-phase strat-
t
D
egy and principles for resilient network design. Section 3 Defend
presents the problem of generating realistic topologies
te
that can be used to evaluate network resilience, in-
ia
ec
ed
troduces the KU-LoCGen topology generator and KU-
ov
em
er
R
TopView, and discusses the issues of multi-level topol-
ogy overlays. Section 4 describes an analytical formula-
Refine
tion of resilience as the trajectory through a multilevel
two-dimensional state space with operational and ser- Fig. 1 ResiliNets strategy
vice dimensions and presents a few examples of this
analysis. Section 5 describes a simulation methodology
to evaluate the resilience of alternative22network
January 2008
archi- KU0.EECS 983 – Resilient & Survivable Nets – Introduction
Faults are inevitable; it is not possible to construct
RSN-IM-4
tectures with emphasis on attacks and area-based chal- perfect systems, nor is it possible to prevent chal-
lenges using the KU-CSM challenge simulation module lenges and threats1 .
with example simulation results. Section 6 briefly de- 1. Understanding normal operation is necessary, in-
scribes how the GpENI large-scale programmable testbed cluding the environment and application demands.
infrastructure can be used to experimentally validate It is only by understanding normal operation that
and cross-verify with analytical and simulation-based we have any hope of determining when the network
resilience analysis. Finally, Section 7 summarises the is challenged or threatened.
main points of the the paper and discusses future re- 2. Expectation and preparation for adverse events
search in this area. and conditions is necessary, so that that defences
and detection of challenges that disrupt normal op-
erations can occur. These challenges are inevitable.
2 Resilience Framework, Strategy, Principles 3. Response to adverse events and conditions is re-
quired for resilience, by remediation ensuring cor-
This section reviews the ResiliNets framework for re- rect operation and graceful degradation, restoration
silient, survivable, and disruption-tolerant network ar- to normal operation, diagnosis of root cause faults,
chitecture and design [133, 132]. First, a two-phase re- and refinement of future responses.
silience strategy is described that provides the basis of The ResiliNets strategy consists of two phases D2 R2
the metrics framework presented in Section 4. Then, a +DR, as shown in Figure 1. At the core are passive
set of design principles is presented with emphasis on structural defences. The first active phase D2 R2 : de-
heterogeneity, redundancy, and diversity that are used fend, detect, remediate, recover, is the inner control loop
in the topology analysis in Section 3 and simulation and describes a set of activities that are undertaken in
methodology in Section 5. order for a system to rapidly adapt to challenges and
attacks and maintain an acceptable level of service. The
second active phase DR: diagnose, refine, is the outer
2.1 ResiliNets Strategy loop that enables longer-term evolution of the system
in order to enhance the approaches to the activities of
There have been several systematic resilience strategies, the inner loop. The following sections briefly describe
including ANSA [56], T1A1.2 [140], CMU-CERT [58], the steps in this strategy.
and SUMOWIN [134]. This ResiliNets resilience frame-
work and strategy [133, 132] are based in part on these 2.1.1 D2 R2 Inner Loop
previous frameworks and provides the basis for the re-
silience evaluation methodology described in the rest The first strategy phase consists of a passive core and
of the paper. More recently, the policy aspects of re- a cycle of four steps that are performed in real time
silience mechanisms are being studied [130, 125]. The 1
The strict usage of fault, error, and failure terminology
framework begins with a set of four axioms that moti- is based on [28] and fully explained in the ResiliNets context
vate the strategy: in [133].
Evaluation of Network Resilience, Survivability, and Disruption Tolerance 3
and are directly involved in network operation and ser- ing an adverse condition. This requires adaptation and
vice provision. In fact, there is not just one of these autonomic behaviour so that corrective action can be
cycles, but many operating simultaneously throughout taken at all levels without direct human intervention, to
and across the network for each resilient subsystem, minimise the impact of service failure, including correct
triggered whenever an adverse event or condition is de- operation with graceful degradation of performance.
tected. A common example of remediation is for dynamic
Defend against challenges and threats to normal routing protocols to reroute around failures (e.g. [80])
operation. The basis for a resilient network is a set of and for adaptive applications and congestion control al-
defences that reduce the probability of a fault leading gorithms to degrade gracefully from acceptable to im-
to a failure (fault-tolerance) and reduce the impact of paired service (Section 4).
an adverse event on network service delivery. These de- Recover to original and normal operations. Once
fences are identified by developing and analysing threat the challenge is over after an adverse event or the end
models, and consist of a passive and active component. of an adverse condition, the network may remain in a
Passive defences are primarily structural, suggest- degraded state (Section 4). When the end of a challenge
ing the use of trust boundaries, redundancy, diversity, has been detected (e.g., a storm has passed, which re-
and heterogeneity. The main network techniques are to stores wireless connectivity), the system must recover
provide geographically diverse redundant paths and al- to its original optimal normal operation, since the net-
ternative technologies such as simultaneous wired and work is likely not to be in an ideal state, and continued
wireless links, so that a challenge to part of the net- remediation activities may incur an additional resource
work permits communication to be routed around the cost.
failure [123].
Active defences consist of self-protection mechanisms 2.1.2 D+R Outer Loop
operating in the network that defend against challenges,
such as firewalls that filter traffic for anomalies and The second phase consists of two background opera-
known attack signatures, and the eventual connectivity tions that observe and modify the behaviour of the
paradigm that permits communication to occur even D2 R2 cycle: diagnosis of faults and refinement of fu-
when stable end-to-end paths cannot be maintained. ture behaviour. While currently these activities gener-
Clearly, defences will not always prevent challenges from ally have a significant human involvement, a future goal
penetrating the network, which leads to the next strat- is for autonomic systems to automate diagnosis and re-
egy step: detect. finement.
Detect when an adverse event or condition has oc- Diagnose the fault that was the root cause. While
curred. The second step is for the network as a dis- it is not possible to directly detect faults, we may be
tributed system, as well as individual components such able diagnose the fault that caused an observable er-
as routers, to detect challenges and to understand when ror. In some cases this may be automated, but more
the defence mechanisms have failed [67]. There are three generally it is an offline process of root-cause analy-
main ways to determine if the network is challenged. sis. The goal is to either remove the fault (generally a
The first of these involves understanding the service design flaw as opposed to an intentional design com-
requirements and normal operational behaviour of a promise) or add redundancy for fault-tolerance so that
system and detecting deviations from it – anomaly de- service failures are avoided in the future. An example of
tection based on metrics (described in Section 4). The network-based fault diagnosis is the analysis of packet
second approach involves detecting when errors occur traces to determine a protocol vulnerability that can
in a system, for example, by calculating CRCs (cyclic- then be fixed.
redundancy checks) to determine the existence of bit Refine behaviour for the future based on past D2 R2
errors that could lead to a service failure. Finally, a cycles. The final aspect of the strategy is to refine be-
system should detect service failures; an essential facet haviour for the future based on past D2 R2 cycles. The
of this is an understanding of service requirements. An goal is to learn and reflect on how the system has de-
important aspect of detecting a challenge is determin- fended, detected, remediated, and recovered so that all
ing its nature, which requires context awareness. of these can be improved to continuously increase the
Remediate the effects of the adverse event or con- resilience of the network using the evaluation techniques
dition. The next step is to remediate the effects of the described in this paper. This is an ongoing process that
detected adverse event or condition to minimise the im- requires that the network infrastructure, protocols, and
pact on service delivery. The goal is to do the best resilience mechanisms be evolvable. This is a significant
possible at all levels after an adverse event and dur- challenge given the current Internet hourglass waist of
4 James P.G. Sterbenz et al.
IPv4, BGP (border gateway protocol), and DNS (do- severely-degraded) and service state P (in the range ac-
main name system), as well as other mechanisms (e.g. ceptable ↔ impaired ↔ unacceptable) to detect, reme-
NAT – network address translation) and protocol ar- diate, and quantify resilience, as well as to refine future
chitectures (e.g. TCP and HTTP) that are entrenched behaviour. The set of parameters (N, P ) and the way in
and resist innovation. which they are combined as objective functions to de-
termine the scales (N, P) of the two dimensional state
space are the fundamental basis for the measurement of
2.2 Resilience Design Principles resilience R at a particular service level, leading to the
multi-level composition into overall network resilience
The D2 R2 +DR strategy leads to a set of principles for
< described in Section 4.
the design of resilient networks and systems, developed
P5. Heterogeneity in mechanism, trust, and pol-
as part of the ResiliNets framework [133, 127] that pro-
icy are the realities that no single technology is appro-
vides detailed explanations with their derivation from
priate for all scenarios, and choices change as time pro-
the strategy and inter-relaitonships, as well as more ex-
gresses. The emerging Future Internet will be a collec-
tensive background references. This section provides a
tion of realms [46] of disparate technologies [34], which
brief summary of the principles, shown in Figure 2, and
also define trust and policy boundaries across which
describes the way in which they relate to the evaluation
there is tussle [47]. Resilience mechanisms must not
of network resilience that is the subject of the rest of
only deal with this heterogeneity, but can also exploit
this paper.
it by using diversity in mechanism as a defence, and by
providing self-protection mechanisms at realm bound-
2.2.1 Prerequisites
aries.
The first set of five principles span the domain of pre-
requisites necessary to build a resilient system. Three 2.2.2 Design Tradeoffs
of these are essential for the evaluation and analysis of
resilience: service requirements, normal behaviour, and The second set of principles describe fundamental trade-
metrics. offs that must be made while developing and analysing
P1. Service requirements of applications need to be a resilient system.
determined to understand the level of resilience the sys- P6. Resource tradeoffs determine the deployment
tem should provide. Service parameters are the vertical of resilience mechanisms. The relative composition and
axis P of the metrics state space described in Section 4. placement of these resources must be balanced to opti-
The resilience requirements at a particular service level, mise resilience and cost. Resources to be traded against
consisting of a set of parameters P , define acceptable, one-another include bandwidth, memory [116], process-
impaired, and unacceptable service. ing, latency [136], energy, and monetary cost. These can
P2. Normal behaviour of the network is a combina- either be viewed as resources contributing to the oper-
tion of design and engineering specification, along with ational state N or as constraints that define the service
monitoring while unchallenged to learn the network’s state P. Of particular note is that maximum resilience
normal operational parameters [129]. Operational pa- can be obtained with unlimited cost, consisting in part
rameters are the horizontal axis N of the metrics state of a full mesh of hardened overprovisioned links, but
space described in Section 4. The resilience of the un- there are cost constraints that limit the use of enablers
derlying system when challenged, consisting of a set of such as redundancy and diversity.
parameters N , define normal, partially degraded, and P7. Complexity of the network results due to the in-
severely degraded operation. Understanding normal be- teraction of systems at multiple levels of hardware and
haviour is a fundamental requirement for detecting chal- software, and is related to scalability. While many of the
lenges to normal operation. resilience principles and mechanisms increase this com-
P3. Threat and challenge models are essential to plexity, complexity itself makes systems difficult to un-
understanding and detecting potential adverse events derstand and manage, and thereby threatens resilience.
and conditions. It is not possible to understand, define, The degree of complexity must be carefully balanced in
and implement mechanisms for resilience that defend terms of cost vs. benefit, and unnecessary complexity
against, detect, and remediate challenges without such should be eliminated.
a model. P8. State management is an essential part of any
P4. Metrics quantifying the service requirements and large complex system. It is related to resilience in two
operational state are needed to measure the operational ways: First, the choice of state management impacts
state N (in the range normal ↔ partially-degraded ↔ the resilience of the network. Second, resilience mech-
Principles
service self-protection
requirements
resource connectivity self-organising
normal tradeoffs and autonomic
redundancy
behaviour
complexity diversity adaptable
threat and
challenge models multilevel
state
evolvable
metrics management context awareness
heterogeneity translucency
prerequisites tradeoffs enablers behaviour
Fig. 2 ResiliNets principles
anisms themselves require state and it is important fault-tolerance. In the case that a fault is activated and
that they achieve their goal in increasing overall re- results in an error, redundant components are able to
22 January !""#Introduction © James P.G. Sterbenz
silience by the way2008
in whichKU EECS
they 983 – Resilient
manage & Survivable
state, re- operate Nets
and –prevent RSN-IM-3
a service failure. It is important
quiring a tradeoff among the design choices. Resilience to note that redundancy does not inherently prevent
tends to favour soft, distributed, inconsistency-tolerant !"#$%&"'()*+,-./$
the redundant components from sharing the same fate,
state rather than hard, centralised, consistent state, but motivating the need for diversity.
careful choices must be made in every case, and it is the
measurement of resilience < that helps determine the
proper state management decisions. backbone
wireless provider 1 wireless
access net access net
2.2.3 Enablers
(a) L3 IP PoP interconnection (b) MPLS backbone (c) Physical fiber links
Fig. 5 Comparison of logical overlay and physical topologies for Sprint
grounded in a understanding of the structure of real However, an edge between a pair of vertices almost
networks constrained by cost, location, and practical- never corresponds to a direct physical link without any
ity of infrastructure deployment. The rest of this sec- intermediary lower layer nodes due to the underlay-
tion will explore these issues and introduce the topology ing structures that provide IP connectivity, including
generator KU LoCGen (The University of Kansas Lo- L2.5-traffic-engineering underlays such as MPLS (mul-
cation and Cost-Constrained Topology Generator) [81] tiprotocol label switching – Figure 5b [19]), L2 struc-
and the topology viewer KU-TopView (The University ture such as SONET/SDH (synchronous optical net-
of Kansas Topology Viewer and Combiner) [22]. work / synchronous digital hierarchy) rings including
cross connects and ADMs (add-drop multiplexors), and
fibre links interconnected by regenerators and ampli-
fiers (Figure 5c [85]). Hence, neither the AS-level nor
3.1 Logical vs. Physical Topologies router-level graphs represents the actual physical con-
nection between nodes, as can be seen in Figure 5c for
The majority of the existing body of research is based the Sprint network. In this example, the San Jose –
on logical topology models focusing on the generation Kansas City IP interconnection might go through the
of either AS-level [149, 100] or router-level [100] topolo- Stockton or Anaheim – Ft. Worth MPLS nodes, which
gies. In an AS-level topology, each Internet autonomous follows a geographic fiber path significantly different
system (AS) is represented as a single node and the from that implied by either the L3 or L2.5 graphs.
BGP (border gateway protocol) connectivity between While L3 topologies are useful for modelling the the
the ASes represents the graph edge connectivity; this resilience of L3 services such as BGP and IGP routing,
models the highest-level service provider structure of they are not sufficient to understand the resilience of the
the Internet. While most of the approaches aggregate physical infrastructure to a number of challenges, in-
the intra-AS topology to a single node, some do con- cluding large-scale disasters and attacks against the in-
sider the complexity or structure within a given AS as frastructure, explored in Section 5. Furthermore, since
a mesh. it is possible for two distinct IP paths of different ser-
In the router-level L3 (layer-3) graph, each IP router vice providers to share the same physical conduit, it
is represented as a vertex and a logical IP link between is difficult to understand and engineer the resilience of
a pair of routers forms the edge between the vertices. the network by assuming that IP links correspond to
An example of the Rocketfuel-inferred [131] Sprint L3 physical links. Without an understanding of the geo-
topology is shown in Figure 5a. graphic location of physical network nodes and links
One of motivating factors for the study of logical and their correspondence to logical links, it is not pos-
topologies is that the L3 protocols such as IP, IGPs sible to know if the logical components share fate, as
(interior gateway protocols), and BGP only see L3 con- was the case in the Baltimore tunnel fire [138] in which
nectivity of the Internet. Furthermore, the majority many logically distinct links failed at the same time
of inference mechanisms [73] are only able to collect when all the fibre running through the tunnel melted.
data on the the router-level connectivity of commer- Therefore, we argue that resilience evaluation of a
cial networks. To date, results from topology modelling network must begin with the physical topology and ge-
have been used for evaluating various aspects of net- ography because it is the physical topology that ulti-
works [36] including security, performance, traffic mod- mately determines the ability to survive infrastructure
eling and engineering, protocol development and analy- failures. Service and network dependability and per-
sis, as well as evaluation of numerous other algorithms. formability in the face of failures is highly dependent
8 James P.G. Sterbenz et al.
on the physical topology. For example, in the case of as representing an entire PoP (point-of-presence) area,
recovery after a large-scale disaster, it becomes the sur- and a link as the physical bundle of fibers buried in a
viving physical infrastructure that drives traffic man- given right-of-way. This distinction between WAN and
agement decisions. This leads to the need for realistic LAN (local-area network) component identifiers affects
topology models and generators that reflect the physi- only the population of the path database, not the usage
cal and geographic structure of the network, as well as of the diversity metric.
the logical topology overlays.
We note that one of the reasons for the previous lack 3.2.1 Diversity Metric
of interest in physical layer topologies may in part be
the abstraction of protection mechanisms. For example, Given a (source s, destination d) node pair, a path P
link-level protection such as SONET/SDH automatic between them is a vector containing all links L and
protection switching (APS) [57], p-cycles [72], and f - all intermediate nodes N traversed by that path P =
cycles [106] provide fault-tolerant masking of uncorre- L ∪ N and the length of this path |P | = |L| + |N | is the
lated failures, and shared-link risk groups (SLRGs) [137] combined total number of elements in L and N .
provide topological diversity, but not necessarily geo- Let the shortest path between a given (s, d) pair be
graphic diversity. These mechanisms do not solve the P0 , L0 be a vector containing the set of links traversed
fate-sharing problem nor provide resilience against cor- by P0 , and N0 be a vector containing the nodes which
related failures and attacks. lie on P0 . Then, for any other path Pk between the same
One of the fundamental challenges in developing a source and destination, we define the diversity function
physical topology model is the lack of real data for val- D(x) with respect to P0 as:
idation of the models. The physical topology of com-
|Pk ∩ P0 |
mercial networks including the Internet are not readily D(Pk ) = 1 −
available. Previous research has considered several in- |P0 |
ference mechanisms to determine geographic node lo- The path diversity has a value of 1 if Pk and P0 are
cations and physical link distances [64, 113, 88], but de- completely disjoint and a value of 0 if Pk and P0 are
spite these efforts, the inference of physical topologies identical. For two arbitrary paths Pa and Pb the path
remains an open problem. There are, however, a few diversity is given as:
educational and research networks such as GÉANT2,
|Pb ∩ Pa |
NLR (National LambdaRail), and Internet2 for which D(Pb , Pa ) = 1 −
the physical topology is available for validation, but un- |Pa |
fortunately research networks are generally significantly where |Pa | ≤ |Pb |.
smaller than large commercial ISPs. It should be noted
that physical topology generation and analysis is fun-
damentally an intra-domain issue. Hence, we can inde- 0 1 2
pendently validate a model against an individual ISP D(P2) = 2/3
As described in Section 2.2.3, a key enabler to resilience It has been claimed [109] that measuring diversity
is diversity [124, 123, 120] such that when challenges im- (referred to as novelty) with respect to either nodes or
pact part of the network, other parts do not share fate links is sufficient, however we assert that this is not the
and are able to continue communicating. In the case of case. Figure 6 shows the shortest path, P0 , along with
topological resilience, it is important that diverse phys- the alternate paths P1 and P2 both of which have a
ical paths exist, and that end-to-end communication is (link) novelty of 1. However, given a failure on node 1,
able to exploit this capability and choose paths that are both P0 and P2 will fail. In our approach, D(P2 ) = 32 ,
unlikely to experience correlated failures. To this end, which reflects this vulnerability. P1 on the other hand
we define a measure of diversity (introduced in [124]; has both a novelty of 1 and a diversity of 1, and does not
further developed in [123]) that quantifies the degree to share any common point of failure with P0 . Similarly,
which alternate paths share the same nodes and links. the wavelengths or fibres from multiple nodes may in
Note that in the WAN (wide-area network) context in fact be shared by a single physical conduit such as was
which we are concerned with events and connections the case in the Baltimore tunnel fire [138], resulting
on a large geographic scale, a node may be thought of in a single point of failure, thus illustrating the need
Evaluation of Network Resilience, Survivability, and Disruption Tolerance 9
for including both nodes and links into the diversity first mode is to find k maximally diverse paths. We first
measure. find the shortest fully disjoint paths, and if additional
paths are required we continue finding paths that add
3.2.2 Effective Path Diversity maximum diversity as calculated using the equation for
ksd . The second mode selects as many maximally di-
Effective path diversity (EPD) is an aggregation of path verse paths as are required to achieve the requested
diversities for a selected set of paths between a given EPD. Finally, the third mode selects all maximally di-
node-pair (s, d). To calculate EPD we use the exponen- verse paths with stretch less than the stretch limit. In
tial function all modes, the set of maximally diverse paths are found
EPD = 1 − e−λksd using the Floyd-Warshall algorithm with modified edge
where ksd is a measure of the added diversity defined weights [33]. In this algorithm, only those paths are
as used that increase the EPD for the node pair in ques-
k
X tion. Recall that only paths with one or more disjoint
ksd = Dmin (Pi ) elements (links+nodes) will result in non-zero Dmin and
i=1
consequently increase EPD.
where Dmin (Pi ) is the minimum diversity of path i
when evaluated against all previously selected paths for
that pair of nodes. λ is an experimentally determined 3.2.4 Measuring Graph Diversity
constant that scales the impact of ksd based on the util-
The total graph diversity (TGD) is simply the average
ity of this added diversity. A high value of λ (> 1) indi-
of the EPD values of all node pairs within that graph.
cates lower marginal utility for additional paths, while
This allows us to quantify the diversity that can be
a low value of λ indicates a higher marginal utility for
achieved for a particular topology, not just for a par-
additional paths. Using EPD allows us both to bound
ticular flow. For example a star or tree topology will
the diversity measurement on the range [0,1) (an EPD
always have a TGD of 0, while a ring topology will have
of 1 would indicate an infinite diversity) and also re-
a TGD of 0.6 given a λ of 1. In Figure 7 we compare
flect the decreasing marginal utility provided by addi-
three different real network topology TGD plots with
tional paths in real networks. This property is based on
those of four regular topologies (full-mesh, Manhattan
the aggregate diversity of the paths connecting the two
grid, ring, and star). The Sprint and AT&T topologies
nodes.
are inferred from Rocketfuel [4]; GÉANT2 [9] nodes are
the actual location.
1
Table 1 Network statistics
0.8
Nodes Links Avg. deg TGD
full-mesh 20 190 19.00 0.99
total graph diversity
of inference mechanisms and the factors that lead to HFC (hybrid-fibre coax) trees, to a modified power-
such properties [39, 44, 61]. The latest studies have em- law preferential attachment of subscribers to access net-
phasised the need to model the process of network de- works. Figure 11 shows an example of a topology mod-
velopment instead of replicating properties in random elled by KU-LoCGen representing a mesh backbone at
graphs [90, 25]. level 1, various access topologies at level 2, and prefer-
Emphasis is increasingly placed on reconciling the ential attachment of subscribers at level 3.
differences between purely analytical models and prac- The current KU-LoCGen implementation generates
tical design principles. It was observed that Internet three levels representing the backbone, access, and sub-
does not have an “achilles heel” [55, 25] of a purely scriber networks whose geographic distribution is rep-
power-law graph and that the processes that led to resented by Ψp , Ψn , Ψs respectively. Furthermore, the
Internet properties are quite complex involving vari- number and geographic spread of nodes at any given
ous optimisations that are characteristics of real net- level is strongly correlated to the higher level nodes in
works [24, 39]. The latest understanding of the research the hierarchy as discussed below. The backbone node
community is that the purely analytical models are not (level-1) distribution model Ψp supports three differ-
very representative of actual networks and are hence ent location constraints including fixed geographic po-
not capable of producing realistic topology models. Sev- sitions based on known point-of-presence geolocations
eral heuristic methods that incorporate real world op- of existing networks, user defined location, and a ran-
timisations and tradeoffs have been proposed [55, 146]. dom distribution as discussed below.
Specifically, the method of highly-optimised tolerance [39, The number of access networks (level-2) N (i) are
24,25] proposes network topology as a result of resilience chosen based on a uniformly distributed random vari-
optimisation with non-generic, highly engineered con- able. N = U (nmin , nmax ), where nmin and nmax are the
figurations. lower and upper limits on the number of access net-
works per backbone node. The N (i) access networks
are distributed around a given PoP using a Gaussian
distribution: Ψn = N [µn , σn2 ], where µn represents the
PoP location and σn2 is the variance. The subscriber net-
works are distributed normally: Ψs = N [µs , σs2 ], where
µs represents the access network location and σs2 is the
variance. Obviously, the variance determines the geo-
graphical extent and the spread of the subscribers. Ad-
ditionally, the variance of each access network may vary
according to the size and location of the access network
as well as the PoP to which the access network is con-
nected: σs2 (i) ∝ N1(i)
Backbone network
Access network
The number of access nodes M (i) in the ith access
Subscriber network network of the j th backbone node is based on the dis-
Fig. 11 KU-LoCGen hierarchical topology model tance of the access network from the backbone node
relative to the other access networks connected to the
same backbone node. The number of nodes in the access
network is given as
3.3.2 Hierarchy in KU-LoCGen
max(dt ); t = 1, 2, ...N (j)
The goal of KU-LoCGen (The University of Kansas Lo- M (i) = × Mmin
di
cation and Cost-Constrained Topology Generator) is
to provide a flexible framework that allows for an n-
level hierarchical, modular structure with level-specific where di is the distance of the ith access network and
graphs, constrained by cost, population, and infrastruc- Mmin is the minimum number of access networks de-
ture location. Furthermore, the graph models used at fined per PoP. Furthermore M (i) is also the upper bound
each level differ significantly. These vary from closed- to a predefined maximum value of Mmax . The access
form general-purpose mesh models (e.g. Waxman [147]) network nodes are then uniformly distributed in a cir-
typical in backbones, to pre-structured models such as cular region of radius r around the first access node.
rings and trees typical in access networks based on par- Therefore Ψm = U (0, r). The number of subscribers in
ticular technologies such as SONET/SDH rings and an access network is directly proportional to the size of
12 James P.G. Sterbenz et al.
capital city
other major city
generated PoP
(a) Predicted PoPs (20) (b) Actual population density chart [42]
(reprinted with permission)
Fig. 13 Cluster centres for Africa
providers, so as to not neglect certain parts of the US erate the optimal location of backbone PoPs that could
that may be under-served by a particular ISP. Fig- be used by an ISP desiring to have a continent-wide
ure 12 shows a comparison of 112 PoPs generated using topology. Figure 13a shows the predicted location of
our population based model with the existing 112 com- 20 PoPs for Africa, next to a population-density map
bined Rocketfuel-inferred [131] L3 PoP cities of Sprint, (Figure 13b) [23] for visual comparison. Since there is
AT&T, and Level3. no continent-wide ISP in Africa, we cannot compare
predicted node locations with existing infrastructure.
(a) Sprint physical topology (b) Sprint with railway mainlines (c) Sprint with Interstate highways
Fig. 15 Comparison of Sprint fiber topology and main transportation routes
is much denser in population than Chennai and the PoP ditional step of “snapping-to-grid” nodes to infrastruc-
placed near Bangalore is close enough to Chennai for ture, and should improve the accuracy of node place-
our algorithm to place another PoP. ment over purely population based. For example, in
The quarterly report released by Telecom Regula- Figure 12 there are a number of nodes in the sparsely-
tory Authority of India [143] is used to get the state- populated Western US that would snap to larger cities
wide list of broadband subscribers in India. Technology at fibre junctions and be located even more closely to
penetration is incorporated into our model by weight- existing PoP cities. Figure 15 shows the relationship of
ing the population of each grid in an area by the corre- the Sprint physical fibre topology to railway mainlines
sponding γ and then clustering the resulting data set. (based on [121]) and Interstate freeways (based on [21])
After incorporating γ, the 4 PoPs which matched ear- in the US.
lier get closer to the real locations, while the one in
Patna moves to Kolkata as it is one of the metropolitan
areas with a high number of Internet users. 3.5 Cost-Constrained Connectivity
link, vi,j is the variable cost per unit distance for link Table 2 Offset distance with existing PoPs in km
deployment, and di,j is the length of the link. For sim- Network (PoPs) Mean σ Min. Max.
plicity we generally choose vi,j = d¯× vi,j where d¯ is the Sprint (27) 54.2 45.3 2.6 163.6
average link length of the network. The level-1 nodes AT&T (106) 26.5 37.3 1.1 265.2
in our model are connected using a cost-constrained Level 3 (38) 43.4 31.7 9.6 118.6
Waxman model, which is a reasonable representation GÉANT2 (34) 101.5 54.1 20.2 252.3
Ebone (27) 56.3 27.9 17.7 131.1
of link connectivity in a backbone network [88]. While
Tiscali (47) 34.8 22.3 2.47 80.6
it is generally agreed that backbone networks are mesh- VSNL (5) 26.7 34.9 2.6 265.1
like [55], there is some contention as to exact relation-
ship between link probability and its distance; some
works claim that this is exponential [88], but others Table 2 is a summary of our results pertaining to the
claim that it is linear [150]. locations of PoPs for various ISPs in the US, Europe,
According to the Waxman model [147] the proba- and India [75]. Noted that all of them are Rocketfuel-
bility that two nodes u and v have a link between them inferred topologies except for the European GÉANT2 [9]
is given by research network. Our predictions match very well with
−d(u,v) ISPs with large infrastructure. For example, in the case
P (u, v) = βe Lα
of AT&T, the mean separation between real and clus-
where 0 < α, β ≤ 1 and L is the maximum distance be- tered nodes is 26.5 km and the closest match is with an
tween any two nodes. The Waxman parameters α and offset of 1.1 km.
β are controlled by the cost. A high value of α corre-
sponds to a high fraction of short to long links and β
is directly proportional to the link density; d is the Eu- 3.6 Example Synthetic Network Graph
clidian distance between the two nodes. The Waxman
model as traditionally applied begins with uniform node In this section, we demonstrate the ability to generate a
distribution, but we use the location constrained node realistic 27 node topology based on US population den-
locations for a realistic backbone topology model. sity data set. We use a cost-constrained Waxman model
Figure 17 shows an example level-2 topology gener- to connect the backbone nodes. The objective is to go
ated by our model using the 27-node topology (equal from realistic node locations to understanding realistic
to the number of Sprint PoPs) with population-based topologies and evaluate resilience of synthetic graphs.
node clustering and random node placement about the Figure 18 shows the backbone topology generated by
PoPs for the 2nd level. The objective is to generate al- our model.
ternative realistic topologies to compare their resilience
with one another as well as against existing network de-
ployment. This motivates the need to for metrics and
a methodology to quantify resilience, described in Sec-
tion 4.
Topology of Network
55
50
45
35
Table 3 Topological characteristics of sample networks systems [96]. In 1974, one of first resilience works in
Network Topology Sprint PoPs Synthetic communication networks was presented [65] as the sur-
Number of nodes 27 27
vivability analysis of command and control networks
Number of edges 68 71 in the context of military systems. The inability to de-
Maximum degree 12 14 sign systems with sufficient redundancy to overcome all
Average degree 5.04 5.23 failures was realised in late 1970s in the context of fault-
Clustering coeff. 0.43 0.28 tolerant computing systems [32, 114, 95]. Hence the con-
Network diameter 6 6
Average hop count 2.44 2.22
cept of degradable systems was introduced in which the
Node betweenness system has at least some degraded performance under
144/28/72 124/1/32
(max/min/avg) the presence of challenges without failing completely.
Link betweenness Markov models are used to evaluate the performance
72/2/12.6 35.1/2.9/11.3
(max/min/avg)
and reliability of the degradable system [78, 101, 69].
Meyer [101] first coined the term performability as the
particular node or link [97]. A high value of between- probability that the system will stay above a certain ac-
ness (for example, average node betweenness of 72 for complishment level over a fixed period of time [74]. Un-
Sprint topology) indicates a high stress level. This in- til then, reliability and performance of communication
dicates presence of more critical nodes in a graph. An networks were treated separately. Huslende [78] defined
average node betweenness value of 32 for the gener- performance as the second dimension of the classical re-
ated synthetic topology implies uniform stress levels liability, thereby defining the reliability of a degradable
for most nodes. Similar reasoning applies for link or system as the probability that the system will operate
edge betweenness metric. A higher average node de- with a performance measure above certain threshold.
gree value generally indicates that a graph is better Since then there have been a number of rigorous def-
connected and is more robust [97]. We observe that av- initions of survivability, reliability and availability [59,
erage node degrees for both graphs are almost equal. 86, 71, 105]. Existing research on fault-tolerance mea-
Clustering coefficient, almost same for both topologies, sures such as reliability and availability targets single
is the measure of how well neighbors of a node are con- (or at most several) instances of random faults, such
nected. The other metrics for both topologies compare as topology based analysis considering node and link
well. The ability to generate graphs with specified re- failures [91, 92, 26, 83].
silience properties such as TGD(d) is a key part of our Frameworks to characterise survivability were also
future work. developed in specific contexts, such as for large-scale
disasters [91, 92], vulnerabilities under the presence of
DDoS attacks [119, 76], and in conjunction with net-
4 Analytical Resilience Framework work dynamics [86, 68, 104]. The T1A1.2 working group
defines survivability [141, 142] based on availability and
This section describes a new analytical framework to network performance models; later approaches have used
evaluate network resilience based on a two-dimensional this approach to quantify survivability [94, 77, 144]. In
state space: the horizontal axis representing the oper- this paper, we quantify network resilience as a measure
ational state of the network and the vertical axis the of service degradation in the presence of challenges that
service delivered. The resilience of a network is quan- are perturbations to the operational state of the net-
tified as the trajectory through the state space as the work.
network is challenged by failures, attacks, or large-scale
disasters. We show that for particular scenarios, and at
particular service levels, we can indeed characterise the 4.2 Metrics Framework
resilience with a single number given by the area under
the resilience trajectory. In this section, we present a framework to quantify net-
work resilience in the presence of challenges using func-
tional metrics [79, 103], beginning with a brief overview
4.1 Background of our approach. Recall that we define resilience the
ability of the network to provide and maintain an ac-
The earliest works in fault tolerance include the 1956 ceptable level of service in the face of various faults and
Moore and Shannon paper [108] on reliable circuits and challenges to normal operation. The major complexity
the Peirce [118] and Avižienis [27] publications on fault- in resilience evaluation comes from the varied nature of
tolerant computing. The initial work on reliability and services that the network provides, the numerous lay-
fault tolerance was focused on the design of computing ers and their parameters over which these services de-
Evaluation of Network Resilience, Survivability, and Disruption Tolerance 17
pend, and the plethora of adverse events and conditions metric Ni , 1 ≤ i ≤ `, is in itself a set of m values, repre-
that present as challenges to the network as a whole. senting all possible settings of the particular operational
This complexity renders an exhaustive resilience anal- metric, Ni = {ni,1 , . . . , ni,m }. For example, at the phys-
ysis intractable. Our approach simplifies the resilience ical layer of an ISP network, the number of link failures
evaluation process by using two novel methods. First, and link capacities could be two operational metrics.
we isolate the impact of challenges at each layer in the The operational state space of S is NS = Xi Ni where
network by evaluating resilience at each service-layer X represents the cross product operator. Therefore, the
boundary, thereby avoiding a continually increasing pa- operational state space consists of all possible combina-
rameter set as we move up the network layers. Secondly, tions of the operational metrics.
we quantify resilience as a (negative) change in service We now define an operational state, N as a subset of
corresponding to a (negative) change in the operating the complete state space NS , represented as the hori-
conditions at any given layer [103]. Therefore resilience zontal axis in Figure 19. Therefore, N is an operational
is characterised as a mapping between the network op- state if N ⊆ NS . Let NS be a set of operational states,
eration and service, wherein the operation is affected NS = {N1 , . . . , Nk }. NS is valid if NS is a partition
by challenges, which in turn may result in degradation of NS . That is Ni ∩ Nj = ∅, Ni , Nj ∈ NS and i 6= j
of the service at the servcie-layer boundary. In other and ∪i Ni = NS where ∪ represents the union opera-
words, instead of evaluating the impact of each chal- tor. Hence, in the generic case, an operational state is
lenge or attack separately, which leads to an intractable defined as a subset of NS .
number of cases, we focus on quantifying the service to If Ni is numeric and ordered ∀i such that Ni ∈ NS ,
varying operational conditions. Given the right set of then the k th operational state Nk can be defined us-
metrics, the operations can always be defined such that ing the same notation used to define the complete state
most challenges manifest as perturbations in these oper- space instead of specifying it as a subset of NS . There-
ational metrics. We now present the formal framework. fore, Nk = {N1k , . . . , Nik , . . . , N`k }. A member Nik in
the set Nk is in itself a set of valid values bounded
by [nik , nik ], representing the lower and upper limit of
Operational State the ith operational metric. We can now define Nik ≡
Normal Partially Severely {nik , . . . , nik }. Thus Nik represents the set of ith oper-
Operation Degraded Degraded ational metric values that correspond to the operational
Impaired Unacceptable
and i 6= j and ∪i Pi = PS . Note that a union of all ser- states labelled Sc , Sa . Depending on the service specifi-
vice states forms the complete service state space. In cation and resilience, various trajectories are possible.
other words, service states are simply partitions of the The lower Sa is preferable since the service remains ac-
complete service space. ceptable even when the network degrades. Next, we de-
If Pi is numeric and ordered, then the k th service scribe how this can be quantified.
state can be represented as Pk = {P1k , . . . , Pik , . . . , P`k }.
A member Pik in the set Pk is in itself a set of val-
ues bounded by [pik , pik ], representing the lower and Operational State
upper n limit of the oith service metric. We can define Normal Partially Severely
Operation Degraded Degraded
Pik ≡ pik , . . . , pik . Thus, Pik represents the set ith
Impaired Unacceptable
service parameter values that correspond to the service
state Pk .
Sc
Service Parameters
4.2.3 Network State S
Acceptable
a tuple of operational state and service state: (N, P).
Therefore the k th network state is Sk = (Nk , Pk ). The
S!
network state represents a mapping between the op-
erational state space NS and service state space PS .
Furthermore, this mapping is an onto mapping, mean- Fig. 20 Resilience R measured in state space
ing that for every service state there is an operational
state.
In a deterministic system, the mapping of NS to PS
is functional, meaning that for each operational state
there is one and only one service state. However, if the 4.2.4 Resilience Evaluation R
system is stochastic then this mapping is also stochastic
in which one operational state maps to multiple service Under normal conditions, the network continues to op-
states based on the randomness in the execution of the erate in a given state corresponding to normal oper-
system. In order to eliminate the stochastic nature of ational and service states. When a challenge causes a
the NS to PS mapping in our analysis, we present the large perturbation in the operational state, the service
NS to PS mapping, thereby focussing on the mapping of may also be impaired below the acceptable service spec-
aggregates rather than individual operational or service ification. A significant change in either dimension leads
states. In other words, instead of looking at the map- to a network state transition. We formulate that chal-
ping of a instantaneous value of an operational metric lenges in the form of adverse events transform the net-
to a service parameter, we focus on the mapping of the work from one state to another based on the severity
operational state to the service state. of the event. Network resilience can be evaluated in
Note that both the operational and the service state terms of the various network state transitions under
spaces are multi-variate. In order to visualise this state the presence of challenges. Resilience Rij is defined at
space on a two dimensional state space as in Figure 19, the boundary Bij between any two adjacent layers Li ,
we project both the operational state space and service Lj . Resilience Rij at the boundary Bij is then evalu-
state space on to one dimension each. This projection ated as the transition of the network through this state
is achieved via objective functions that are applied to space. The goal is to derive the Rij as a function of
both service and operational parameters. The specific N and P. The operational and service space is covered
function used depends on the scenario. For example, it fully by its states and can be decomposed in a fixed set
may be be a linear combination with normalised weights of large states which we term as regions. For simplicity,
or logical functions (e.g., AND, OR) the network operational space N is divided into normal
Figure 19 shows the system in an initial state S0 operation, partially degraded, and severely degraded re-
for which acceptable service is delivered during nor- gions as shown in Figure 20. Similarly, the service space
mal operations. As the network is challenged, its op- P is divided into acceptable, impaired, and unacceptable
erational state may be degraded, represented by the regions.
Evaluation of Network Resilience, Survivability, and Disruption Tolerance 19
We then quantify the resilience Rij for a particular To measure the benefits of remediation mechanisms,
scenario at a particular layer boundary Bij as the area a temporal resilience analysis should not only consider
under the resilience trajectory, shown by the shaded the area under the trajectory, but factor in the time
triangular area under the S0 → Sc trajectory in Fig- that the system is in challenged and remediated states,
ure 20. This results in a static resilience analysis [79] as shown in Figure 22 (based in part on on [140]). At
that does not consider the temporal aspects of the state time t0 a challenge lowers performability by Pu (fraction
space trajectory. unserved) corresponding to the S0 → Sc state transi-
tion, but then remediation mechanisms at t1 increase
performability by Pr (fraction remediated) correspond-
Operational State ing to the Sc → Sr state transition. Eventually recovery
Normal Partially Severely at t2 returns the network to its original normailsed per-
Operation Degraded Degraded formability of P = 1. Clearly the shorter the time to
Impaired Unacceptable
Operational State
Impaired Unacceptable
Detect Sr
Acceptable
Defend
Sc
Service Parameters
Recover
S!"
Sc
Sr
Sr
Acceptable
S!"
4.2.5 Relationship to the Strategy
Fig. 23 Resilience state space
The relationship of the the state-space formulation to
the ResiliNets strategy described in Section 2 is shown
in Figure 21. The inner D2 R2 loop trajectory is shown.
Defence prevents the system from leaving its initial The outer control loop reduces the impact of a given
state S0 . If a challenge causes the state to change signifi- challenge in the future, as shown in Figure 23, in which
cantly, this is detected by a change in the operational or the challenged state Sc0 is not as bad as the previous
service parameters when the state goes to a challenged Sc , and remediation performs better with Sr0 resulting
state Sc . Remediation improves the situation to Sr , and in a smaller area R0 and better overall resilience. Tem-
recovery finally returns the system to its original state poral analysis considers the improvement not only for
S0 . the static condition of R0 < R, but also for reduced
remediation and recovery times: t0r ≤ tr and t0R ≤ tR .
P ( t)
!" 4.2.6 Multilevel Multiscenario Resilience <
Recover
Pa + Pr
Pu Remediate
In the multilevel analysis, as shown in Figure 24, the
Pr
service parameters at the boundary Bij become the op-
Pa tR eration metrics at boundary Bi+1,j+1 . In other words,
tr the service provided by a given layer becomes the op-
#"
t0 t1 t2 t erational state of the layer above, which has a new set
Fig. 22 Temporal aspects of resilience of service parameters characterizing its service to the
layer above.
20 James P.G. Sterbenz et al.
p1
p2 < 0.2
Fig. 24 Resilience across multiple levels 2.0
impaired
0.9 ≤ p1 < 1
1.5
GÉANT2
p2 ≥ 0.33 Sprint
it may be possible to derive a single resilience value <. 0.0
AT&T
0.0
normal
0.5 1.0
partially degraded
1.5 2.0
severely degraded
2.5 3.0
resilience; in the limiting case, if the service remains ac- challenges are presented. Then, the KU-CSM (The Uni-
ceptable for all operational conditions, the area under versity of Kansas Challenge Simulation Module) is de-
the curve will be zero, representing perfect resilience scribed, followed by a few example simulation results.
R = 1. In order to get a normalised value of resilience, More details are presented in [41, 40].
we define resilience R = 1−normalised area, where nor-
malised area is the total area divided by the span of the
x-axis. 5.1 Simulation Framework
The resilience R for AT&T is 0.63, for sprint 0.54
and for GÉANT2 0.47. We observe that due to a fewer Simulation via abstraction is one of the techniques to
number of links, the GÉANT2 topology has very low analyse networks in a cost-effective manner. We have
clustering coefficient and the topology service performs chosen the ns-3 [110] network simulator since it is open
poorly even in the normal operational regions. source, flexible, provides mixed wired and wireless ca-
pability (unlike ns-2), and the models can be extended.
3.0
Unfortunately, the simulation model space increases mul-
tiplicatively with the different number of challenges and
unacceptable network topologies being simulated. Hence, for n differ-
< 0.75
2.5
p1
ent topologies subjected to c different challenges, n × c
path reliability and stretch
p2 > 1.50
2.0
p1
1.10 < p2 ≤ 1.50
1.0
work, thereby reducing the simulation model space to
n+c consisting of c challenges applied to n network sce-
acceptable
p1 ≥ 0.99
0.5
k=1
narios. We have created an automated simulation model
p2 ≤ 1.10 k=2 generator that arbitrarily combines network topologies
k=3
0.0
0.0
normal
0.5 1.0
partially degraded
1.5 2.0
severely degraded
2.5 3.0
and challenge specifications, thus increasing the effi-
n1 = 1.00 0.90 ≤ n1 < 1.00 n1 < 0.90
n2 ≥ 0.33 0.25 ≤ n2 < 0.33 n2 < 0.25
ciency of the simulation model generation process. Our
LC size and clustering coefficient simulation framework consists of four distinct steps as
shown in Figure 27.
Fig. 26 Resilience of multi-path routing on AT&T topology
The first step is to provide a challenge specification
that includes the type of the challenge and specifics of
Now we go up a level in the analysis in which N3r = the challenge type. The second step is to provide a de-
P3t to analyse the resilience of the routing service given scription of the network topology, consisting of node
a topology. Figure 26 shows the resilience of the path- geographical or logical coordinates and an adjacency
diversity mechanism explained in Section 3.2, when us- matrix. The third step is the automated generation of
ing 1, 2, and 3 paths over the AT&T topology. We ob- ns-3 simulation code based on the topology and chal-
serve that as the number of paths selected (k) increases, lenge descriptor. Finally, we run the simulations with
the service remains longer in the acceptable region and traffic sources and protocols as appropriate, and anal-
degrades more gracefully. Note that k = 1 represents yse the network performance under challenge scenar-
unipath routing in which even a single link failure will ios. Additionally, the simulation framework can also be
result in failure of certain paths even if the network is enabled to generate ns-3 network animator (NetAnim)
connected. As expected, multipath routing is more re- traces for visualisation purposes.
silient to poorly connected topologies. A significantly
more detailed analysis with more examples (including
MANET resilience) is presented in [79]; dynamic tem- 5.2 Challenge Modelling
poral analysis remains in future work. A related ana-
lytical approach computes the robustness R-value [145] A challenge is an event that impacts normal opera-
and uses GraphExplorer [54]. tion [133]. A threat is a potential challenge that might
exploit a vulnerability. A challenge triggers faults, which
are the hypothesised cause of errors. Eventually, a fault
5 Simulation Methodology may manifest itself as an error. If the error propagates it
may cause the delivered services to fail [28]. Challenges
This section describes our simulation framework and to the normal operation of networks include uninten-
methodology for understanding the resilience of net- tional misconfiguration or operational mistakes, mali-
works and the impact of challenges. First the types of cious attacks, large-scale disasters, and environmental
22 James P.G. Sterbenz et al.
challenge
specification
.txt
adjacency simulation plotted trace
matrix ns-3
description
.txt 0 0 simulator
.cc
node (C++)
1 0
coordinates
.txt
challenges [133, 134]. Network challenges can be cate- 5.3 Implementation of Challenge Models
gorised based on intent, scope, and domain they im-
pact [41]. Considerably more detail on challenge mod- Modelling and simulating network performance under
elling is presented in a companion paper [40]. challenge conditions is non-trivial [115]. There have been
We model the challenges based on the intent as several studies that analyse different aspects of net-
non-malicious or malicious. Non-malicious challenges works under challenges (see [40]). Here we briefly de-
can be due to incompetence of an operator or designer. scribe the way in which challenges are implemented in
These random events affect node and link availability, KU-CSM.
and result in the majority of the failures observed [87,
117,83]. On the other hand, malicious attacks, orches- 5.3.1 Non-malicious challenges
trated by an intelligent adversary, target specific parts
of a network and can have significant impact if critical In the case of wired domain challenges in this category,
elements of the network fail. the number of nodes or links k and challenge period is
The scope of a challenge can be further categorised specified in the challenge specification file. This type of
based on nodes, links, or network elements affected within challenge models failures that are uncorrelated with re-
a geographic area. Hurricanes, earthquakes, and solar spect to topology and geography, that is, random node
storms are examples of natural disasters that can im- and link failures.
pact the network at a large scale [84]. Furthermore, ge-
ographically correlated failures can result from depen- 5.3.2 Malicious attacks
dency among critical infrastructures, as experienced in
the 2003 Northeast US blackout [93, 51]. We use topological properties of the graph in order to
A domain of a challenge is wired or wireless; some determine the critical elements in the network based on
challenges can affect both, particularly area-based chal- properties such as the degree of connectivity of nodes
lenges. Challenges to the wired domain include fibre- and betweenness of nodes and links [97, 107]). The crit-
optic cable cuts and failure of switching nodes and trans- ical nodes or links are shut down for the duration of the
mission equipment. Challenges that are inherent in the challenge period to simulate an attacker with knowledge
wireless domain include weakly connected channels, mo- of the network structure.
bility of nodes in an ad-hoc network, and unpredictably
long delays [134]. These are the natural result of noise, 5.3.3 Large-scale disasters
interference, and other effects of RF propagation such
as scattering and multipath, as well as the mobility of The challenge specification for area-based challenges is
untethered nodes. Furthermore, weather events such as an n-sided polygon with vertices located at a particu-
rain and snow can cause the signals to attenuate in lar set of geographic coordinates or a circle centered at
wireless communication networks [80]. Malicious nodes a user specified coordinates with radius r. The simu-
may jam the signal of legitimate users to impair com- lation framework then determines the nodes and links
munication in the open wireless medium. that are encompassed by the polygon or circle, and dis-
While these challenge model categories are orthog- ables them during the challenge interval using the Com-
onal to one other, particular challenge scenarios are a putational Geometry Algorithms Library (CGAL) [1].
combination of challenge sub-categories. For example, We also implement dynamic area-based challenges, in
a failure due to natural aging of a component can be which the challenge area can evolve in shape over time:
categorised as a non-malicious, wired or wireless, node scale (expand or contract), rotate, and move on a tra-
failure. jectory during the simulation. Examples of the need
Evaluation of Network Resilience, Survivability, and Disruption Tolerance 23
0.5
PoP locations match the cities on the physical topology.
0.4
Since not all cities are traffic source or sinks, the statis-
0.3
tical failure scenarios would not be useful determining random-link
0.2 link-betweenness
the performability of the network; therefore we do not random-node
0.1 node-degree
do random node or link failures on the physical topol- node-betweenness
ogy. A full explanation of the challenge specifications, 0.0
0 2 4 6 8 10
as well as details of simulation parameters and further # node, link failures
example results are presented in [41, 40].
Fig. 30 PDR during non-malicious and malicious challenges
5.4.1 Non-malicious and Malicious Challenges
First, we evaluate the performance of the Sprint topol- The top curve in Figure 30 shows the PDR with
ogy (Figure 28) under the presence of malicious and random link failures. In this case for 10 random link
non-malicious failures of up to 10 nodes or links, with failures averaged over 100 runs, the PDR drops to 87%.
the packet delivery ratio (PDR) shown in Figure 30. The second curve from the top shows the PDR for link
We measure the instantaneous PDR at the steady-state attacks. In this case, we first calculate the betweenness
24 James P.G. Sterbenz et al.
PDR
PDR
(a) Scaling circle PDR (b) Moving circle PDR (c) Scaling polygon PDR
for each link in the topology, and provide the challenge to 80%. This example confirms the intuition that at-
file as the list of the links to be brought down. As can tacks launched with knowledge of the network topology
be seen, link attacks have more degrading impact than can cause the most severe damage.
the random failures: 50% PDR for highest ranked 10
links. The middle curve shows random node failures, 5.4.2 Area-based Challenges
worse than link attacks or failures, since each node fail-
ure is equivalent of the failure of all links incident to Recently, the research community has recognised the
that node. The bottom two curves show the PDR dur- importance of understanding the impact of geographi-
ing node attacks based on degree of connectivity and cally correlated failures on networks [98, 41, 31, 111, 112].
betweenness; these are the most damaging attacks to Our framework uses circles or polygons to model ge-
the network. The primary difference between the two ographically correlated failures representative of large-
attack scenarios is that an attack based on between- scale disasters needed for network survivability [134, 59]
ness can be more damaging for the few highest ranked analysis. Next, we present the results of three scenar-
nodes. When the highest betweenness two nodes in rank ios that demonstrate area-based challenges that evolve
are attacked, PDR is reduced to 60%, while an attack spatially and temporally using the Sprint logical and
based on degree of connectivity only reduces the PDR physical topologies, as shown in Figure 31 and Fig-
Evaluation of Network Resilience, Survivability, and Disruption Tolerance 25
ure 32. Application traffic is generated from 2 to 29 The PDR throughout the simulation is shown in
sec. and challenge scenarios were applied from 10 until Figure 33c. In this scenario, the edges of the irregular
22 sec. for the plots as shown in Figure 33, which verify polygon increase 1.8 times every three sec. The charac-
the impact of the example challenges. teristics of the network performability in this scenario
To demonstrate a scaling circle area-based challenge is similar for both physical and logical topologies. The
scenario, we simulate a circle centered at in New York PDR drops as the challenge area affects Chicago in the
as shown in Figure 31a and in Figure 32a for inferred smallest area at 10 sec. Despite the increase in the chal-
and physical topologies respectively, with a radius of lenge area at 13 sec., the PDR improves due to com-
approximately 111 km. We choose the scenario to be pletion of route reconvergence. As the area increases,
representative of an electromagnetic pulse (EMP) at- the PDR drops as low as 40% since the network is par-
tack [3]. The PDR is shown in Figure 33a. We choose titioned. This type of scenario can be used either to
the simulation parameters such that the radius doubles understand the relationship between the area of a chal-
in every 4 sec. As can be seen, the PDR reduces as lenge and network performability, or to model a tem-
the circular area doubles. The PDR drops depending porally evolving challenge, such as a cascading power
on how many nodes and links are covered in each step failure that increases in scope over time.
for both physical and logical topologies.
Next, we demonstrate an area-based scenario that
can evolve spatially and temporally. We simulate a mov-
ing circle in a trajectory from Orlando to New York
that might model a severe hurricane, but with rapid
restoration of links as the challenge moves out of a par-
ticular area. Three snapshots of the evolving challenge
are shown in Figure 31b and Figure 32b. The radius of
the circle is kept at approximately 222 km. We choose
the simulation parameters for illustration such that the
circle reaches NY in seven seconds (to constrain simu-
lation time), with route recomputation every 3 sec.
As shown in Figure 33b PDR reduces to 93% for
Fig. 34 South central area-based challenge scenario
the logical topology as the challenge starts only cov-
ering the node in Orlando at 10 sec and 82% for the
physical topologies, since there are several PoPs being Next, we demonstrate an area-based scenario repre-
affected. As the challenge moves towards New York in sentative of a hurricane hitting the South-central US as
its trajectory, the PDR reaches 1.0 at the 13 sec. In this shown in Figure 34. In the smallest area are the physical
case, the challenge area includes only the link between nodes in New Orleans and Biloxi of which only New Or-
Orlando and New York, but since there are multiple leans node is a MPLS PoP node generating and sinking
paths for the Rocketfuel-inferred topology a single link traffic. In the second circular area challenge, the phys-
failure does not affect the PDR, showing that geographic ical nodes are: New Orleans, Baton Rouge, Lafayette,
diversity for survivability is crucial [133]. On the other Biloxi, and Mobile, in which 4 out of the 5 affected
hand, for the physical topology in the same instance, nodes are MPLS PoP nodes. In the largest affected
the traffic source in Fairfax, South Carolina resides in area there are total of 10 physical nodes, 6 of which
the challenge area, therefore the cumulative PDR does are MPLS PoP nodes. However none of the three circu-
not reach 100%. As the challenge moves into the North- lar challenge areas cover any logical links or nodes on
east US region, cumulative PDR values depend on the the map in Figure 28, permitting us to investigate the
number of traffic sources being affected by the challenge differences between logical and physical topologies.
area. The network performance of physical and logical
Polygons are useful to model specific geographic chal- topologies when the South-central US region is chal-
lenges such as power failures, earthquakes, and floods. lenged is shown in Figure 35. Since there are no nodes
For a scaling polygon example, we show a 6-sided irreg- or links in the logical topology impacted, the PDR is
ular polygon in the Midwest US, roughly representative 100%. On the other hand, the PDR of the physical
of the North American Electric Reliability Corporation topology drops to 98%, 91%, and 86% respectively as
(NERC) Midwest region [3], as shown in Figure 31c and the challenge area covers more physical nodes and links.
in Figure 32c for logical and physical topologies respec- This demonstrates that it is imperative to study the im-
tively. pact of area-based challenges on the physical topologies.
26 James P.G. Sterbenz et al.
1.0
Table 4 GpENI programmability layers
GpENI Layer Programmability
0.8
experiment Gush, Raven
7 application
0.6 PlanetLab
4 end-to-end
PDR
Tampere UT
Uppsala Skt. Peterburg
Simula
SICS TKK Helsinki IIRAS
KTH Stockholm
Karlstads Moscow
JANET NORDUnet
CUC Beijing HEAnet Lancaster
POSTECH
UC Dublin Warszawa
Cambridge
IIT Guwahati G-Lab
GÉANT2
Kaiserslautern
Karlsruhe Passau
ERNET Wien
SDSMT DSU
Internet2 U-Zürich München
IIT Mumbai Bern Konstanz
USD
DANTE ETH
APAN IIT SWITCH
IISc Bangalore UNL IU
Internet2 KSU
UMC
UPC
GMOC Bilkent
KU UMKC ISCTE Barcelona
Lisboa
tional networks, as shown in Figure 36. Currently, con- Sections 3 and 5 to GpENI experiments. GpENI ex-
nectivity is achieved using L2TPv3 and IP tunnels. A periments will have the advantage of incorporating real
direct fibre link over JANET is deployed between Lan- networking not easy to emulate in a simulation, albeit
caster and Cambridge Universities. The principal Eu- still at smaller scale than large simulation topologies.
ropean GpENI institutions are Lancaster University in
the UK and ETH Zürich in Switzerland, with addi-
7 Summary and Future Outlook
tional core nodes at Universität Bern in Switzerland, G-
Lab Kaiserslautern in Germany, and Simula Research
Resilience is an essential property of the Future Inter-
Laboratory in Norway. Similarly, GpENI is extended
net, including performability, dependability, and sur-
to Asia across Internet2 to APAN, then to national
vivability. While a number of aspects of resilience have
research network infrastructure including ERNET in
been an active area of research for a half-century, it is
India. Furthermore, GpENI is interconnected to the
generally recognised that the Global Internet lacks re-
Emulab-based ProtoGENI cluster [18] in Utah, and is
silience and is vulnerable to attacks and disasters. It is
deploying several small ProtoGENI clusters of its own.
critical to make progress toward evaluating proposals
Thus GpENI provides a large scale, rich topology on
for alternative topologies, protocols, and mechanisms
which to perform resilience experiments.
that are candidates for deployment in the Future Inter-
net.
6.2 Experimental Evaluation of Resilience However, we have lacked a comprehensive frame-
work to evaluate the resilience of current and proposed
Resilient topologies generated by KU-LoCGen using network architectures, in part due to the complexity
GpENI node geographic coördinates and analysed by of the problem. This requires metrics to quantify re-
KU-CSM can be used to generate layer-2 topologies silience, and a tractable methodology to evaluate net-
that configure the topology of GpENI experiments. We work resilience using appropriate abstractions in analy-
can then evaluate performance when GpENI slices are sis, simulation, and emulation permitting cross-verification
challenged by correlated failures of nodes and links, among these techniques.
measuring connectivity, packet delivery ratio, goodput, This paper aims to contribute to this task by de-
and delay, when subject to CBR (constant bit rate), scribing a comprehensive framework consisting of a re-
bulk data transfer, and transactional (HTTP) traffic. silience strategy, metrics for quantifying resilience, and
We can also characterise the packet-loss probability of evaluation techniques. The key to a tractable solution
wireless links at the Utah Emulab [8], and the capa- is multilevel composition of scenario-based evaluation
bilities for emulating jamming and misbehaving nodes of the resilience R at each level, measured as the nor-
within the Emulab-federated CMU wireless emulator. malised inverse of the area under the trajectory through
The goal is to cross-verify identical configurations the N, P state space. Complex scenarios are simulated
of the simulated topologies and protocols discussed in using the KU-LoCGen topology generator and KU-CSM
28 James P.G. Sterbenz et al.
44. Chen, Q., Chang, H., Govindan, R., Jamin, S.: The ori- 61. Fabrikant, A., Koutsoupias, E., Papadimitriou, C.:
gin of power laws in Internet topologies revisited. In: Heuristically optimized trade-offs: A new paradigm for
Proceedings of the 21st Annual Joint Conference of the power laws in the Internet. Lecture notes in computer
IEEE Computer and Communications Societies (INFO- science pp. 110–122 (2002)
COM), vol. 2, pp. 608–617 (2002) 62. Fall, K.: A delay-tolerant network architecture for chal-
45. Cherukuri, R., Liu, X., Bavier, A., Sterbenz, J., Medhi, lenged internets. In: SIGCOMM ’03: Proceedings of the
D.: Network virtualization in GpENI: Framework, im- 2003 conference on Applications, technologies, architec-
plementation & integration experience. In: IEEE/IFIP tures, and protocols for computer communications, pp.
ManFI. Dublin, Ireland (2011). (to appear) 27–34. ACM, New York, NY, USA (2003)
46. Clark, D., Sollins, K., Wroclawski, J., Katabi, D., Kulik, 63. Faloutsos, M., Faloutsos, P., Faloutsos, C.: On power-
J., Yang, X., Braden, R., Faber, T., Falk, A., Pingali, V., law relationships of the Internet topology. In: SIG-
Handley, M., Chiappa, N.: New arch: Future generation COMM ’99: Proceedings of the conference on Applica-
Internet architecture. Technical report, DARPA, MIT, tions, technologies, architectures, and protocols for com-
ISI (2003) puter communication, pp. 251–262. ACM, New York,
47. Clark, D.D., Wroclawski, J., Sollins, K.R., Braden, NY, USA (1999)
R.: Tussle in cyberspace: Defining tomorrow’s Internet. 64. Francis, P., Jamin, S., Jin, C., Jin, Y., Raz, D., Shavitt,
IEEE/ACM Transactions on Networking 13(3), 462– Y., Zhang, L.: IDMaps: a global Internet host distance
475 (2005) estimation service. IEEE/ACM Transactions on Net-
48. Cowie, J.: Lights Out in Rio. http://www.renesys.com/ working 9(5), 525–540 (2001)
blog/2009/11/lights-out-in-rio.shtml (2009) 65. Frank, H.: Survivability Analysis of Command and Con-
49. Cowie, J.: Japan Quake. http://www.renesys.com/ trol Communications Networks–Part I. IEEE Transac-
blog/2011/03/japan-quake.shtml (2011) tions on Communications, [legacy, pre-1988] 22(5), 589–
50. Cowie, J., Popescu, A., Underwood, T.: Impact of hurri- 595 (1974)
cane Katrina on Internet infrastructure. report, Renesys 66. Frank, H., Frisch, I.: Analysis and Design of Survivable
(2005) Networks. IEEE Transactions on Communication Tech-
51. Cowie, J.H., Ogielski, A.T., Premore, B., Smith, E.A., nology 18(5), 501–519 (1970)
Underwood, T.: Impact of the 2003 blackouts on Inter- 67. Fry, M., Fischer, M., Karaliopoulos, M., Smith, P.,
net communications. Preliminary report, Renesys Cor- Hutchison, D.: Challenge identification for network re-
poration (2003). (updated March 1, 2004) silience. In: Proc. of the IEEE 6th EURO-NF Confer-
52. Davis, T., Rogers, H., Shays, C., Others: A failure of ence on Next Generation Internet (NGI), pp. 1–8 (2010)
initiative: The final report of the select bipartisan com- 68. Gan, Q., Helvik, B.: Dependability modelling and analy-
mittee to investigate the preparation for and response sis of networks as taking routing and traffic into account.
to hurricane Katrina. Congressional Report H.Rpt. In: NGI ’06: Proceedings of the Conference on Next
109-377, US House of Representatives, Washington, DC Generation Internet Design and Engineering (2006)
(2006) 69. Gay, F., Ketelsen, M.: Performance evaluation for grace-
53. Dobson, S., Denazis, S., Fernández, A., Gaı̈ti, D., Ge- fully degrading systems. In: Proc. of the 9th Annual Int.
lenbe, E., Massacci, F., Nixon, P., Saffre, F., Schmidt, Symp. on Fault Tolerant Computing, pp. 51–58 (1979)
N., Zambonelli, F.: A survey of autonomic communica- 70. Goodman, S., Lin, H.: Toward a Safer and More Secure
tions. ACM Transactions on Autonomous and Adaptive Cyberspace. National Academies Press (2007)
Systems 1(2), 223–259 (2006) 71. Grover, W.D.: Mesh-Based Survivable Networks. Pren-
54. Doerr, C., Hernandez, J.M.: A Computational Approach tice Hall PTR Pearson, Upper Saddle River, New Jersey
to Multi-level Analysis of Network Resilience. In: Pro- (2004)
ceedings of the Third International Conference on De- 72. Grover, W.D., Stamatelakis, D.: Cycle-oriented dis-
pendability (DEPEND), pp. 125–132 (2010) tributed preconfiguration: Ring-like speed with mesh-
55. Doyle, J.C., Alderson, D.L., Li, L., Low, S., Roughan, like capacity for self-planning network restoration. In:
M., Shalunov, S., Tanaka, R., Willinger, W.: The “ro- Proceeding of the IEEE International Conference on
bust yet fragile” nature of the Internet. Proceedings of Communications (ICC’98), vol. 1, pp. 537–543 (1998)
the National Academy of Sciences of the United States 73. Haddadi, H., Rio, M., Iannaccone, G., Moore, A.,
of America 102(41), 14,497–14,502 (2005) Mortier, R.: Network topologies: inference, modeling,
56. Edwards, N.: Building dependable distributed systems. and generation. Communications Surveys and Tutori-
Technical report APM.1144.00.02, ANSA (1994) als, IEEE 10(2), 48–69 (2008)
57. Ellinas, G., Stern, T.: Automatic protection switching 74. Hagin, A.A.: Performability, reliability, and survivabil-
for link failures in optical networks with bi-directional ity of communication networks: system of methods and
links. In: Proceedings of the Global Telecommunications models for evaluation. In: Proceedings of the 14th In-
Conference (GLOBECOM), vol. 1, pp. 152–156 (1996) ternational Conference on Distributed Computing Sys-
58. Ellison, R., Fisher, D., Linger, R., Lipson, H., Longstaff, tems, pp. 562–573 (1994)
T., Mead, N.: Survivable network systems: An emerg- 75. Hameed, M.A., Jabbar, A., Çetinkaya, E.K., Sterbenz,
ing discipline. Tech. Rep. CMU/SEI-97-TR-013, Soft- J.P.: Deriving network topologies from real world con-
ware Engineering Institute, Carnegie Mellon University straints. In: Proceedings of IEEE GLOBECOM Work-
(1997) shop on Complex and Communication Networks (CC-
59. Ellison, R.J., Fisher, D.A., Linger, R.C., Lipson, H.F., Net), pp. 415–419 (2010)
Longstaff, T., Mead, N.R.: Survivable network systems: 76. Hariri, S., Qu, G., Dharmagadda, T., Ramkishore, M.,
An emerging discipline. Tech. Rep. CMU/SEI-97-TR- Raghavendra, C.S.: Impact analysis of faults and at-
013, Software Engineering Institute, Carnegie Mellon tacks in large-scale networks. IEEE Security and Pri-
University, PA (1999) vacy 01(5), 49–54 (2003)
60. Erdös, P., Rényi, A.: On the evolution of random graphs. 77. Heegaard, P.E., Trivedi, K.S.: Network survivability
Publ. Math. Inst. Hung. Acad. Sci 5, 17–61 (1960) modeling. Computer Networks 53(8), 1215–1234 (2009).
Evaluation of Network Resilience, Survivability, and Disruption Tolerance 31
Performance Modeling of Computer Networks: Special the 15th International Symposium on Software Reliabil-
Issue in Memory of Dr. Gunter Bolch ity Engineering, pp. 367–378. IEEE Computer Society
78. Huslende, R.: A combined evaluation of performance Washington, DC, USA (2004)
and reliability for degradable systems. In: Proceedings 95. Losq, J.: Effects of Failures on Gracefully Degradable
of the ACM Conference on Measurement and Model- Systems. In: Proc. of 7th Fault-Tolerant Computing
ing of Computer Systems (SIGMETRICS), pp. 157–164. Symposium, pp. 29–34 (1977)
ACM Press, New York, NY, USA (1981) 96. Lyons, R., Vanderkulk, W.: The use of triple-modular
79. Jabbar, A.: A framework to quantify network resilience redundancy to improve computer reliability. IBM Jour-
and survivability. Ph.D. thesis, The University of nal of Research and Development 6(2), 200–209 (1962)
Kansas, Lawrence, KS (2010) 97. Mahadevan, P., Krioukov, D., Fomenkov, M., Dim-
80. Jabbar, A., Rohrer, J.P., Oberthaler, A., Çetinkaya, itropoulos, X., Claffy, K.C., Vahdat, A.: The Internet
E.K., Frost, V., Sterbenz, J.P.G.: Performance compar- AS-level topology: three data sources and one definitive
ison of weather disruption-tolerant cross-layer routing metric. ACM SIGCOMM CCR 36(1), 17–26 (2006)
algorithms. In: Proc. IEEE INFOCOM 2009. The 28th 98. Mahmood, R.A.: Simulating challenges to communica-
Conference on Computer Communications, pp. 1143– tion networks for evaluation of resilience. Master’s the-
1151 (2009) sis, The University of Kansas, Lawrence, KS (2009)
81. Jabbar, A., Shi, Q., Hameed, M., Çetinkaya, E.K., 99. Medhi, D., Tipper, D.: Multi-layered network
Sterbenz, J.P.: ResiliNets topology modelling. https: survivability-models, analysis, architecture, framework
//wiki.ittc.ku.edu/resilinets/Topology_Modelling and implementation: An overview. In: Proceedings of
(2011) the DARPA Information Survivability Conference and
82. Jackson, A.W., Sterbenz, J.P.G., Condell, M.N., Hain, Exposition (DISCEX), vol. 1, pp. 173–186 (2000)
R.R.: Active network monitoring and control: The SEN- 100. Medina, A., Matta, I., Byers, J.: On the origin of power
COMM architecture and implementation. In: DARPA laws in Internet topologies. SIGCOMM Computer Com-
Active Networks Conference and Exposition (DANCE), munication Review 30(2), 18–28 (2000)
pp. 379—393. IEEE Computer Society, Los Alamitos, 101. Meyer, J.: On Evaluating the Performability of Degrad-
CA, USA (2002) able Computing Systems. IEEE Transactions on Com-
83. Jamakovic, A., Uhlig, S.: Influence of the network struc- puters 100(29), 720–731 (1980)
ture on robustness. In: IEEE International Conference 102. Minden, G., Evans, J., Searl, L., DePardo, D., Petty,
on Networks (ICON), pp. 278–283 (2007) V., Rajbanshi, R., Newman, T., Chen, Q., Weidling,
84. Kitamura, Y., Lee, Y., Sakiyama, R., Okamura, K.: Ex- F., Guffey, J., Datla, D., Barker, B., Peck, M., Cordill,
perience with restoration of asia pacific network failures B., Wyglinski, A., Agah, A.: KUAR: A flexible software-
from taiwan earthquake. IEICE Transactions on Com- defined radio development platform. In: 2nd IEEE Inter-
munications E90-B(11), 3095–3103 (2007) national Symposium on New Frontiers in Dynamic Spec-
85. KMI Corporation: North American Fiberoptic Long- trum Access Networks (DySPAN), pp. 428–439 (2007)
haul Routes Planned and in Place (1999) 103. Mohammad, A.J., Hutchison, D., Sterbenz, J.P.G.: To-
86. Knight, J.C., Strunk, E.A., Sullivan, K.J.: Towards a wards quantifying metrics for resilient and survivable
rigorous definition of information system survivability. networks. In: Proceedings of the 14th IEEE Interna-
In: Proceedings of the DARPA Information Survivabil- tional Conference on Network Protocols (ICNP), pp.
ity Conference and Exposition DISCEX III, pp. 78–89. 17–18 (2006)
Washington DC (2003) 104. Molisz, W.: Survivability function-a measure of disaster-
87. Kuhn, D.: Sources of failure in the public switched tele- based routing performance. IEEE Journal on Selected
phone network. Computer 30(4), 31–36 (1997) Areas in Communications 22(9), 1876–1883 (2004)
88. Lakhina, A., Byers, J., Crovella, M., Matta, I.: On the 105. Molisz, W., Rak, J.: End-to-end service survivability un-
geographic location of Internet resources. IEEE Journal der attacks on networks. Journal of Telecommunications
on Selected Areas in Communications 21(6), 934–948 and Information Technology 3, 19–26 (2006)
(2003) 106. Molisz, W., Rak, J.: f-cycles–a new approach to provid-
89. Laprie, J.C.: Dependability: Basic concepts and termi- ing fast service recovery at low backup capacity over-
nology. Draft, IFIP Working Group 10.4 – Dependable head. In: 10th Anniversary International Conference on
Computing and Fault Tolerance (1994) Transparent Optical Networks (ICTON), vol. 3, pp. 59–
90. Li, L., Alderson, D., Willinger, W., Doyle, J.: A first- 62 (2008)
principles approach to understanding the Internet’s 107. Molisz, W., Rak, J.: Impact of WDM network topol-
router-level topology. SIGCOMM Computer Commu- ogy characteristics on the extent of failure losses. In:
nication Review 34(4), 3–14 (2004) 12th International Conference on Transparent Optical
91. Liew, S., Lu, K.: A framework for network survivability Networks (ICTON), pp. 1–4 (2010)
characterization. In: SUPERCOMM/ICC’92: Proceed- 108. Moore, E., Shannon, C.: Reliable circuits using less re-
ings of IEEE International Conference on Communica- liable relays. Journal of the Franklin Institute 262(3),
tions (ICC), pp. 405–410 (1992) 191–208 (1956)
92. Liew, S., Lu, K.: A framework for characterizing 109. Motiwala, M., Elmore, M., Feamster, N., Vempala, S.:
disaster-based network survivability. IEEE Journal on Path splicing. In: Proceedings of the ACM SIGCOMM
Selected Areas in Communications 12(1), 52–58 (1994) conference on data communication, pp. 27–38. ACM,
93. Liscouski, B., Elliot, W.J.: Final report on the august New York, NY, USA (2008)
14, 2003 blackout in the United States and Canada: 110. The ns-3 network simulator. http://www.nsnam.org
Causes and recommendations. Tech. rep., U.S. – Canada (2009)
Power System Outage Task Force (2004) 111. Neumayer, S., Modiano, E.: Network reliability with ge-
94. Liu, Y., Mendiratta, V., Trivedi, K.: Survivability anal- ographically correlated failures. In: Proc. of IEEE IN-
ysis of telephone access network. In: Proceedings of FOCOM, pp. 1–9 (2010)
32 James P.G. Sterbenz et al.
112. Neumayer, S., Zussman, G., Cohen, R., Modiano, E.: 129. Smith, P., Hutchison, D., Banfield, M., Leopold, H.:
Assessing the vulnerability of the fiber infrastructure to On understanding normal protocol behaviour to mon-
disasters. In: Proc. of IEEE INFOCOM, pp. 1566–1574 itor and mitigate the abnormal. In: Proceedings of
(2009) the IEEE/IST Workshop on Monitoring, Attack Detec-
113. Ng, T.S.E., Zhang, H.: Predicting Internet network dis- tion and Mitigation (MonAM), pp. 105–107. Tuebingen,
tance with coordinates-based approaches. In: INFO- Germany (2006)
COM, pp. 170–179 (2001) 130. Smith, P., Schaeffer-Filho, A., Ali, A., Schöller, M.,
114. Ng, Y., Avizienis, A.: A Reliability Model for Gracefully Kheir, N., Mauthe, A., Hutchison, D.: Strategies for
Degrading and Repairable Fault-tolerant Systems. In: Network Resilience: Capitalising on Policies. In:
Proceedings of 7th International Symposium on Fault- B. Stiller, F. De Turck (eds.) Mechanisms for Au-
Tolerant Computing, pp. 29–34. IEEE Computing Soci- tonomous Management of Networks and Services, Lec-
ety Publications (1977) ture Notes in Computer Science, vol. 6155, pp. 118–122.
115. Nicol, D.M., Sanders, W.H., Trivedi, K.S.: Model-based Springer Berlin / Heidelberg (2010)
evaluation: From dependability to security. IEEE Trans- 131. Spring, N., Mahajan, R., Wetherall, D.: Measuring ISP
actions on Dependable and Secure Computing 01(1), Topologies with Rocketfuel. IEEE/ACM TON 12(1),
48–65 (2004) 2–16 (2004)
132. Sterbenz, J.P.G., Hutchison, D.: Resilinets: Multilevel
116. Nussbaumer, J., Patel, B.V., Schaffa, F., Sterbenz,
resilient and survivable networking initiative wiki. http:
J.P.G.: Networking requirements for interactive video
//wiki.ittc.ku.edu/resilinets (2008)
on demand. IEEE Journal on Selected Areas in Com-
133. Sterbenz, J.P.G., Hutchison, D., Çetinkaya, E.K., Jab-
munications 13, 779–787 (1995)
bar, A., Rohrer, J.P., Schöller, M., Smith, P.: Resilience
117. Oppenheimer, D., Ganapathi, A., Patterson, D.A.: Why and survivability in communication networks: Strate-
do Internet services fail, and what can be done about gies, principles, and survey of disciplines. Computer
it? In: Proc. of USENIX USITS, pp. 1–16 (2003) Networks: Special Issue on Resilient and Survivable Net-
118. Pierce, W.: Failure-tolerant Computer Design. Aca- works (COMNET) 54(8), 1245–1265 (2010)
demic Press (1965) 134. Sterbenz, J.P.G., Krishnan, R., Hain, R.R., Jackson,
119. Qu, G., Jayaprakash, R., Hariri, S., Raghavendra, C.: A.W., Levin, D., Ramanathan, R., Zao, J.: Surviv-
A framework for network vulnerability analysis. In: CT able mobile wireless networks: issues, challenges, and re-
’02: Proceedings of the 1st IASTED International Con- search directions. In: WiSE ’02: Proceedings of the 3rd
ference on Communications, Internet, Information Tech- ACM workshop on Wireless security, pp. 31–40. ACM
nology, pp. 289–298. St. Thomas, Virgin Islands, USA Press, New York, NY, USA (2002)
(2002) 135. Sterbenz, J.P.G., Medhi, D., Ramamurthy, B., Scoglio,
120. Rak, J.: k-penalty: a novel approach to find k-disjoint C., Hutchison, D., Plattner, B., Anjali, T., Scott,
paths with differentiated path costs. IEEE Communi- A., Buffington, C., Monaco, G.E., Gruenbacher, D.,
cations Letters 14(4), 354–356 (2010) McMullen, R., Rohrer, J.P., Sherrell, J., Angu, P.,
121. Richards, C.W.: Map of the Month: Mainline Tonnage Cherukuri, R., Qian, H., Tare, N.: The Great plains
1980 / 2005 (2007) Environment for Network Innovation (GpENI): A pro-
122. Rohrer, J.P., Jabbar, A., Perrins, E., Sterbenz, J.P.G.: grammable testbed for future internet architecture re-
Cross-layer architectural framework for highly-mobile search. In: Proceedings of the 6th International Con-
multihop airborne telemetry networks. In: Proceed- ference on Testbeds and Research Infrastructures for
ings of the IEEE Military Communications Conference the Development of Networks & Communities (Trident-
(MILCOM), pp. 1–9. San Diego, CA, USA (2008) Com), pp. 428–441. Berlin, Germany (2010)
123. Rohrer, J.P., Jabbar, A., Sterbenz, J.P.G.: Path diver- 136. Sterbenz, J.P.G., Touch, J.D.: High-Speed Network-
sification: A multipath resilience mechanism. In: Pro- ing: A Systematic Approach to High-Bandwidth Low-
ceedings of the IEEE 7th International Workshop on the Latency Communication, 1st edn. Wiley (2001)
Design of Reliable Communication Networks (DRCN), 137. Strand, J., Chiu, A., Tkach, R.: Issues for routing in the
pp. 343–351. Washington, DC, USA (2009) optical layer. IEEE Communications Magazine 39(2),
81–87 (2001)
124. Rohrer, J.P., Naidu, R., Sterbenz, J.P.G.: Multipath
138. Styron, H.C.: CSX tunnel fire: Baltimore, MD. US Fire
at the transport layer: An end-to-end resilience mecha-
Administration Technical Report USFA-TR-140, Fed-
nism. In: Proceedings of the IEEE/IFIP International
eral Emergency Management Administration, Emmits-
Workshop on Reliable Networks Design and Modeling
burg, MD (2001)
(RNDM), pp. 1–7. St. Petersburg, Russia (2009)
139. Sydney, A., Scoglio, C., Youssef, M., Schumm, P.: Char-
125. Schaeffer-Filho, A., Smith, P., Mauthe, A.: Policy-driven acterising the robustness of complex networks. Inter-
Network Simulation: a Resilience Case Study. In: national Journal of Internet Technology and Secured
26th ACM Symposium on Applied Computing (SAC). Transactions 2, 291–320 (2010)
Taichung, Taiwan (2011) 140. T1A1.2 Working Group: Network survivability perfor-
126. Schneider, F.: Trust in Cyberspace. National Academies mance. Technical Report T1A1.2/93-001R3, Alliance
Press (1999) for Telecommunications Industry Solutions (ATIS)
127. Schöller, M., Smith, P., Rohner, C., Karaliopoulos, M., (1993)
Jabbar, A., Sterbenz, J., Hutchison, D.: On realising 141. T1A1.2 Working Group: Enhanced network survivabil-
a strategy for resilience in opportunistic networks. In: ity performance. Technical Report T1.TR.68-2001,
Future Network and Mobile Summit, pp. 1–8 (2010) Alliance for Telecommunications Industry Solutions
128. Siganos, G., Faloutsos, M., Faloutsos, P., Faloutsos, (ATIS) (2001)
C.: Power laws and the AS-level Internet topology. 142. T1A1.2 Working Group: Reliability-related metrics and
IEEE/ACM Transactions on Networking 11(4), 514– terminology for network elements in evolving com-
524 (2003) munications networks. American National Standard
Evaluation of Network Resilience, Survivability, and Disruption Tolerance 33
for Telecommunications T1.TR.524-2004, Alliance for research management positions at BBN Technologies,
Telecommunications Industry Solutions (ATIS) (2004) GTE Laboratories, and IBM Research, where he has
143. Telecom Regulatory Authority of India: The Indian
lead DARPA- and internally-funded research in mo-
Telecom Services Performance Indiactors April–June
2008. Quarterly press release 109 (2008) bile, wireless, active, and high-speed networks. He has
144. Trivedi, K., Kim, D., Roy, A., Medhi, D.: Dependability been program chair for IEEE GI, GBN, and HotI; IFIP
and security models. In: Proceedings of the Interna- IWSOS, PfHSN, and IWAN; and is on the editorial
tional Workshop of Design of Reliable Communication
board of IEEE Network. He has been active in Sci-
Networks (DRCN), pp. 11–20. IEEE (2009)
145. Van Mieghem, P., Doerr, C., Wang, H., Hernandez, J., ence and Engineering Fair organisation and judging in
Hutchison, D., Karaliopoulos, M., Kooij, R.: A Frame- Massachusetts and Kansas for middle and high-school
work for Computing Topological Network Robustness. students. He is principal author of the book High-Speed
Tech. rep., Delft University of Technology (2010).
Networking: A Systematic Approach to High-Bandwidth
URL http://www.nas.ewi.tudelft.nl/people/Piet/
papers/RobustnessRmodel_TUDreport20101218.pdf Low-Latency Communication. He is a member of the
146. Wang, C., Byers, J.W.: Generating representative ISP IEEE, ACM, IET/IEE, and IEICE. His research in-
topologies from first-principles. In: Proceedings of the terests include resilient, survivable, and disruption tol-
ACM international conference on measurement and erant networking, future Internet architectures, active
modeling of computer systems (SIGMETRICS), pp.
365–366. ACM, New York, NY, USA (2007) and programmable networks, and high-speed network-
147. Waxman, B.: Routing of multipoint connections. Se- ing and systems.
lected Areas in Communications, IEEE Journal on 6(9), Egemen K. Çetinkaya: is a Ph.D. candidate in the
1617–1622 (1988)
148. Winick, J., Jamin, S.: Information Security: Computer
department of Electrical Engineering and Computer Sci-
Hacker Information Available on the Internet. Tech. ence at The University of Kansas. He received the B.S.
Rep. T-AIMD-96-108, United States General Account- degree in Electronics Engineering from Uludağ Uni-
ing Office (1996). URL http://www.fas.org/irp/gao/ versity (Bursa, Turkey) in 1999 and the M.S. degree
aimd-96-108.htm
149. Winick, J., Jamin, S.: Inet-3.0: Internet topology gen-
in Electrical Engineering from University of Missouri–
erator. Tech. Rep. UM-CSE-TR-456-02, EECS, Univer- Rolla in 2001. He held various positions at Sprint as a
sity of Michigan (2002). URL http://topology.eecs. support, system, design engineer from 2001 until 2008.
umich.edu/inet/inet-3.0.pdf He is a graduate research assistant in the ResiliNets re-
150. Yook, S., Jeong, H., Barabasi, A.: Modeling the Inter-
search group at the KU Information & Telecommunica-
net’s large-scale topology. Proceedings of the National
Academy of Sciences 99(21), 13,382–13,386 (2002) tion Technology Center (ITTC). His research interests
151. Zegura, E., Calvert, K., Bhattacharjee, S.: How to model are in resilient networks. He is a member of the IEEE
an internetwork. In: Proceedings of the 15th Annual Communications Society, ACM SIGCOMM, and Sigma
Joint Conference of the IEEE Computer Societies (IN-
Xi.
FOCOM), vol. 2, pp. 594–602 (1996)
Mahmood A. Hameed: received the B.S. degree in
Electronics and Communications Engineering from Os-
Bios mania University (Hyderabad, India) in 2005 and the
M.S. degree in Electrical Engineering from the Univer-
Dr. James P.G. Sterbenz: is Associate Professor of sity of Kansas in 2008. He is currently working toward
Electrical Engineering & Computer Science and on staff the Ph.D. degree in Electrical Engineering at the Uni-
at the Information & Telecommunication Technology versity of Kansas, where he is engaged in research in
Center at The University of Kansas, and is a Visit- energy efficient optical technologies for switching in fu-
ing Professor of Computing in InfoLab 21 at Lancaster ture networks. His research interests include nonlinear
University in the UK. He received a doctorate in com- optics, advanced optical modulation formats as well as
puter science from Washington University in St. Louis signal processing.
in 1991, with undergraduate degrees in electrical engi- Dr. Abdul Jabbar: is currently a Research Engineer
neering, computer science, and economics. He is direc- in the Advanced Communication Systems Lab at GE
tor of the ResiliNets research group at KU, PI for the Global Research in Nikayuna, NY. He received his Ph.D
NSF-funded FIND Postmodern Internet Architecture in Electrical Engineering from The University of Kansas
project, PI for the NSF Multilayer Network Resilience in 2010 with honors. He received his M.S. degree in
Analysis and Experimentation on GENI project, lead Electrical Engineering from KU in 2004 and B.S. de-
PI for the GpENI (Great Plains Environment for Net- gree in Electrical Engineering from Osmania Univer-
work Innovation) international GENI and FIRE testbed, sity, India in 2001. He also holds an Adjunct Research
co-I in the EU-funded FIRE ResumeNet project, and Associate position at the KU Information & Telecom-
PI for the US DoD-funded highly-mobile airborne net- munication Technology Center. His interests include re-
working project. He has previously held senior staff and silience and survivability, network algorithms, design
34 James P.G. Sterbenz et al.