You are on page 1of 10

Proceedings of the 37th Hawaii International Conference on System Sciences - 2004

A Definition for Information System Survivability

Vickie R. Westmark
University of Central Florida
Department of Electrical and Computer Engineering, Adjunct Professor
vwestmark@cfl.rr.com

Abstract of distributed network systems. The findings of this


research within current literature will provide summary
Society has become dependent on information and analysis to the following questions: 1) How is
systems. As networks develop into large-scale systems, survivability defined?, 2) How is survivability
often critical to personal and business operations, composed?, 3) How is survivability calculated?, and 4)
survivability of these systems is imperative. While How are survivability trends and gaps characterized?
these systems continue to emerge and grow, answers to The need for this research was to provide a baseline
questions like: “What does survivability mean?”, for scholars and industry professionals to describe the
“How is survivability being measured?”, and “How is current state of computing survivability of distributed
survivability computed?” become very important. network systems. Since our social infrastructure has
This paper summarizes the standard or lack of become dependent on certain essential services
standard methods for defining and computing provided over distributed network systems such as
survivability while providing an easy to reference defense, education, emergency services, energy,
baseline of the current state. It also provides a finance and banking, government, health care,
template for defining survivability to facilitate telecommunications, transportation, and utility, it has
subsequent research into computational quality become increasingly important to understand the
attributes by using standard definitions. Where there composition and calculation of survivability for these
are gaps or inconsistencies in current research and systems. If a network system is threatened and fails to
practice, assessments can be made to continue provide any one of these essential services, the social
research and development in the areas most needed to and economic consequences may be catastrophic and
develop taxonomy of survivability. even fatal.
The community of high-speed network users and
1. Introduction providers find survivability important because “most of
the high-speed networks are designed to have
The information systems that society has become restoration and protection capabilities.” [1] The
so dependent on are typically distributed network telecommunications industry finds survivability
systems that consist of components of varying quality important: “a significant number of end-users have
that have been integrated to provide services for the absolute dependence on telecommunications, so a
end-user. With the continuing emergence of these primary concern is the integrity of the network.” [2]
systems, answers to the above questions become very Researchers whose primary focus was once on security
important to developers and maintainers of these are now focusing on survivability: “while security
networks and distributed systems. The goal of this traditionally has been the focus on confidentiality of
paper is twofold: 1) to summarize and analyze the information, the problems of greatest concern today
standard or lack of standard methods for defining and relate to the availability of information and continued
computing survivability of distributed network systems services.” [3]
and, 2) to provide a template for defining survivability. This research differs from previous work by
In areas where there are gaps in definition and categorically organizing and summarizing the current
computation, this paper proposes guidelines for literature to provide an easy to reference baseline to
composition and calculation. better understand current trends or lack of trends in
The objective of this research was to summarize the computational survivability. Where there are gaps or
standard methods or lack of standard methods for inconsistencies in the current research and practice,
computing the specific quality attribute: survivability, professionals and scholars can assess the void or

0-7695-2056-1/04 $17.00 (C) 2004 IEEE 1


Proceedings of the 37th Hawaii International Conference on System Sciences - 2004

conflict to continue research and development in the research on computational survivability within a
areas that are most needed to improve and/or develop distributed network environment.
taxonomy of system survivability. 3. Review the abstract, introduction, paragraph
headings, and conclusion/summary for content
2. Methodology related to definitions, metrics, and computations of
survivability for each paper selected by title as
Specifically, the current literature used to support potential support to the research.
this research came from the results of thorough on-line 4. Analyze and summarize the current literature that
searches in the database archives of the major journal passed the first and second selection criterion, step
libraries of Association for Computing Machinery 2 and step 3 respectively, and which also provided
(ACM), Institute of Electrical and Electronics definition, metrics, or computation of survivability.
Engineering (IEEE), and Software Engineering 5. Review the reference section of any article
Institute (SEI). The literature contained information supporting the research to determine if there were
used to determine the current standard practice or lack any other authors that contributed to the field of
of standard practice for computing survivability in a survivability or more specifically, computational
distributed network environment. Each of these major survivability. Perform new a new search using
publishers has an on-line search engine that allows these authors’ names as a keyword in the search
users to construct a query based on keywords and a and follow the selection criterion from steps 1
combination of search operators (where available) to through 4.
allow a search within the respective publisher’s
archives. 3. Findings
Searches performed on the obvious keyword
combinations: “computational survivability” and At the time of this research the ACM, IEEE, and
“computing survivability” were unfruitful, so a new list SEI libraries each approximately contained 360,000,
of keywords was developed for the searches. Since 749,000 and 1500 publications respectively. The
over a million papers are available within these initial search results based on keywords by subject,
publishers’ archives, a subjective selection criterion phrase, or author used in each of the publisher’s on-
was created for choosing and analyzing literature to line search utilities matched 3760 publications that
support this research. In order to reasonably support could potentially be used for review and analysis as
this research without reading each and every paper, an shown in Table 1. The remainder of Table 1 shows the
initial pass review on the title was performed to select 3760 publications reduced to 270, which is the count of
potential applicable literature. publications selected for review based on the
Titles of journal, conference, proceeding, or other methodology for selection in section 2. Of the 270
similar type papers were neither standard nor uniform articles selected for review by title, only 107 were
and presented a challenge to select literature that might categorized as discussed in section 3.1 and only 63
contribute to the research. To address this challenge, a actually referenced within this research, which means
subjective assessment based solely on title was 163 papers were not considered as contributors to the
performed to select articles to support the research. research upon a second review of the title.
The following five step process was developed for Eliminating articles from the secondary review of
searching and selecting the articles that supported the title was based on comparison other similar titles of
research for this paper. papers that were actually reviewed for content and
1. Perform separate on-line searches using the search found that they were not related to the subject of
utility provided by the major publishers: ACM, survivability or computational survivability. Most of
IEEE, SEI and on six specific keywords or keyword the component-based software engineering papers were
combination: survivability, survivable, component- eliminated because the information provided by the
based software engineering, component-based paper was too specific to software development. The
distributed systems, distributed network systems, distributed network papers were eliminated because
and distributed network environment. The they were more concerned with network routing and
publishers’ search utility websites were: a) ACM: data encryption schemes. Other articles that were
http://portal.acm.org/, b) IEEE: skipped for review specifically discussed topics such as
http://ieeexplore.ieee.org/, and c) SEI: system specification, other quality attributes not related
http://www.sei.cmu.edu/. to survivability, or computer network architecture at a
2. Perform a subjective first pass review of the titles client/server level.
yielded by the on-line searches to support the

0-7695-2056-1/04 $17.00 (C) 2004 IEEE 2


Proceedings of the 37th Hawaii International Conference on System Sciences - 2004

Table 1. Findings: papers found by keyword and publisher then selected by initial review
First Pass - papers found by keywords: Second Pass - papers selected for review:
Keywords by Keyword used in on-line search
ACM IEEE SEI Totals ACM IEEE SEI Totals
subject or phrase Survivability 227 903 84 1214 7 140 11 158
subject or phrase Survivable 972 790 65 1827 2 36 0 38
subject or phrase component-based software engineering 111 58 41 210 7 21 0 28
subject or phrase component-based distributed systems 12 4 1 17 2 2 0 4
subject or phrase distributed network systems 25 19 9 53 2 4 0 6
subject or phrase distributed network environment 23 25 3 51 1 7 0 8
author Knight 27 61 13 101 3 8 0 11
author Sullivan 25 24 13 62 1 5 0 6
author Yacoub 9 9 0 18 0 3 0 3
author Weiss 6 44 23 73 0 2 0 2
author Hoffman 12 67 1 80 0 1 0 1
author Ellison 4 3 47 54 2 1 2 5
Totals by publisher 1453 2007 300 3760 27 230 13 270
Percent to total by publisher 39% 53% 8% 10% 85% 5%

3.1. Characterization of survivability trends components. It is recommended that these key


and gaps in current literature components be used as the basis for developing a
standard definition of survivability for distributed
The trends and the gaps in trends for computing network systems.
survivability of distributed network systems were 1. System: if the definition of survivability must vary,
reviewed and could be categorized into one of the then at least the distributed network system
following categories: 1) did not support the research at environment for which it has been defined should
all, 2) provided definition of survivability only, 3) be mentioned. The different types of essential
recognized the need for survivability, but did not really services may warrant a special definition of
doing anything about computing survivability, 4) survivability. In addition, whether the system is
recognized the need for survivability metrics, but not bounded or unbounded should be addressed.
really doing anything, 5) recognized the need for 2. Threat: a threat to a system may prevent the system
computation, but not really doing anything, 6) provided from providing services to the user in the prescribed
methods to compute survivability, but very informally amount of time or may prevent the system from
and rarely used in practical applications, and 7) providing the services at all. Threats to a system
computed survivability very well. can be categorized as accidental, intentional
(malicious), or catastrophic. Accidental threats
3.2. Definition of survivability in the current include software errors, hardware errors, and
literature human errors. Intentional or malicious threats
include sabotage, intrusion, or terrorist attacks.
The definition for survivability was not always Catastrophic threats typically do not allow delivery
consistent or even present in the current literature of required service to the user, which includes acts
discussing survivability. Only 52 articles were cited of nature (thunderstorms, hurricanes, lightning,
for the definition for survivability, shown in Table 2, flood, earthquake, etc.), acts of war, and power
with a quality perspective definition of survivability failures.
cited from IEEE Standard 1061-1992 listed as the last 3. Adaptability: in the event of a threat the system
entry. The summary of the citations highlights the should have the capability to adapt to the threat and
non-standard definitions of survivability. continue to provide the required service to the user.
Large-scale network systems include many 4. Continuity of Service: services should be available
components (nodes) that are required to deliver to the user as defined by the requirements of the
services to the end user. Since the end-user typically system and expected by the user, even in the event
invokes the service request, the definition of of a threat. Network performance should not
survivability should also consider the expectations of appear to be degraded by the end user.
the user. References to network systems themselves 5. Time: services should be available to the user
also varied in definition, mainly in stating whether or within the time required by the system and expected
not a system was bounded or unbounded. This by the user.
difference plays a major role in the computation of From these five key components, a standard
survivability. definition of survivability should be developed and
Based on the research, the key components of the used throughout academia and industry to avoid
definition of survivability could be summarized from ambiguity and confusion. A standard definition might
the definitions in Table 2 as follows. No single author also help to determine the composition and calculation
or reference included consideration of all of these of survivability.

0-7695-2056-1/04 $17.00 (C) 2004 IEEE 3


Proceedings of the 37th Hawaii International Conference on System Sciences - 2004

Table 2. Analysis: summary of survivability definitions in current literature


Reference Definition of survivability is…
[1] To “provide quantitative measures for the network's capability to tolerate failures and to provide continuous service.”
[2] Defined in terms of network survivability where it is “1) the ability of a network to maintain or restore an acceptable level of performance during network failure conditions
by applying various restoration techniques and 2) the mitigation or prevention of service outages from potential network failures by applying preventative techniques.”
[4] The “quality of a system to handle all essentially critical operation instances successfully.”
[3] [5] [6] [7] [8] The “capability of a system to fulfill its mission in a timely manner in the presence of attacks, failures, or accidents.”
[9] [10] [11] [12]
[13] The “ability of a system to continue operation despite the presence of abnormal events such as failures and intrusions.”
[14] A “network's ability to perform its designated set of functions given network infrastructure component failures, resulting in a service outage, which can be described by the
number of services affected, the number of subscribers affected, and the duration of the outage.”
[15] “Robustness under conditions of intrusion, failure, or accident.”
[16] The “ability of a system to maintain a set of essential services despite the presence of abnormal events such as faults and intrusions.”
[17] “That a system can be made robust to partially successful attack through general architecture features, through adaptability (flexible response to unanticipated changes) and
flexibility (ability to adapt to a range of adverse events without having to anticipate the particular response in advance).”
[18] To “provide network design and management procedures towards minimizing the impact of failures on multi-networks.”
[19] The “ability of a system to tolerate intentional attacks or accidental failures or errors.”
[20] Defined in terms of information survivability where it is “the ability of an information system to continue to operate in the presence of faults, anomalous system behavior,
or malicious attack.”
[21] The “ability of a system to provide service (possibly degraded) when various changes occur in the system or operating environment.”
[22] Where network systems “continue functioning even when under attack.”
[23] The “ability of a system/network to be maintained in the working state, given that a deterministic set of failures occurs to the system/network; therefore, the survivability is
always “yes” or “no” for a given failure scenario.”
[24] “Phases of survivability are attack detection, damage confinement, damage assessment and repair, and attack avoidance focusing on continued service and recovery.”
[25] The “capacity of a system to provide essential services even after successful intrusion and compromise, and to recover full services in a timely manner.”
[26] “The availability within a crucial time period”
[27] “Network design and management procedures to minimize the impact of failures on the network.”
[28] Defined in the terms of a telecommunications network where it is “the ability of the network to maintain or restore an acceptable level of performance in the event of
deterministic or random network failures, such as link failures and node failures.”
[29] Defined in terms of performance where it will “ensure that, under given failure scenarios, network performance will not degrade below predetermined levels.”
[30] The “ability of a network to cope with facility outages, capacity overloads, and natural disasters.”
[31] The “robustness of communication networks vis-à-vis events that affect a significant portion of the network topology.”
[32] Where “integrity is not compromised at the occurrence of unexpected disasters.”
[33] The “measure of the degree of keeping the performances of a kind of military weaponry or equipments or other military forces, which undergoing enemy's attacks.”
[34] The “ability of an item to perform a required function at a given instant in time after a specified subset of components of the item to become unavailable.”
[35] The “measure of a network's endurance in the presence of possible component failures (of the measure of the magnitude of attack needed to render a network
nonfunctional).”
[36] Where “survivable network must achieve an acceptable level of performance under demanding conditions.”
[37] The “assurance of stored information's integrity, confidentiality, and continuous availability guaranteed over time.”
[38] Defined in terms of survivable information systems through adaptation where it is “allowing a system to continue running, albeit with reduced functionality or performance
in the face of reduced resources, attacks, or broken components is often preferable to either complete shutdown or continued normal operation in compromised mode.”
[39] Defined in terms of a survivable system where it “must be adaptable, able to respond to attacks and achieve its goals.”
[40] The “capability of a system to complete its mission in a timely manner, even if significant portions are incapacitated by attack or accident.”
[41] A “certain percentage of traffic can still be carried immediately after a failure.”
[42] The “degree to which a system has been able to withstand an attack or attacks, and is still able to function at a certain level in its new state after the attack.”
[43] The “capability of a system to fulfill its mission, in a timely manner, in the presence of attacks, failures, or accidents.” And was also defined as “preserving essential
services in unbounded environments, even when systems in such environments are penetrated and compromised.”
[44] Defined in terms of a survivable system where it is “available to fulfill its mission in a timely manner, in the presence of attacks, failures, or accidents.”
[45] Defined in terms of a survivable system where it “satisfies its survivability specification of essential services and adverse environments.”
[46] The “capability of a system to fulfill its mission in a timely manner despite intrusions, failures, or accidents.”
[47] The “capability of an enterprise to continue to fulfill its mission by preserving essential services, even when systems are penetrated and compromised.”
[48] “Service stream over time”
[49] “Assured continuity of essential infrastructure services under defined adverse conditions: natural, accidental, or hostile.”
[50] “Certain path-connectivity is preserved under limited failures of network elements.”
[51] Where systems “must continue to perform adequately in the face of various kinds of adversity.”
[52] The “capability of a network system to complete its mission in a timely manner, even if significant portions are incapacitated by attack or accident.”
[53] The “Extent to which the software will perform and support critical functions without failures within a specified time period when a portion of the system is inoperable.”

3.3. Elements of survivability in current description of the characteristic was provided, based on
literature the collective reference to these attributes within the
literature.
The current literature indicates that the survivability 1. Availability: “The degree to which software
model is a combination of at least twenty recognized remains operable in the presence of system
quality models, their sub-characteristics, their sub- failures.” [53]
factors and other non-recognized quality specific 2. Architectural design Hardware (HW) Dependence:
models. Often these items overlapped. Current “The degree to which software does not depend on
literature used an a la carte approach for describing the specific hardware environments.” [53] Software
survivability model and mixed and matched other (SW) Dependence: “The degree to which hardware
quality attributes, sub-characteristics and sub-factors. does not depend on specific software
Definitions from ISO/IEC 9126 and IEEE Standard environments.” [53]
1061-1992 of standardized quality models with their 3. Connectivity: the degree to which a system will
sub-characteristics and sub-factors were used to build perform when all nodes and links are available.
the list below for the survivability model. Where no 4. Correctness: “The degree to which all software
formal quality definition could be cited, a general functions are specified.” [53]

0-7695-2056-1/04 $17.00 (C) 2004 IEEE 4


Proceedings of the 37th Hawaii International Conference on System Sciences - 2004

5. Dependability: the degree to which the system can attributes, and no two articles were consistent in the
provide services, even in the event of a threat. definition of the survivability model.
6. Endurability: the degree to which a system can The comparison indicates that the survivability
tolerate a threat and still provide service. model is a combination of the main quality model:
7. Fairness: the ability of a network system to reliability, and various sub-traits of other main quality
organize and route information without failure. models: functionality, reliability, maintainability, and
8. Fault Tolerance: “The degree to which the software portability. It would appear that survivability could not
will continue to work without a system failure that be classified as a main quality model, since it uses
would cause damage to the users. Also, the degree various sub-categories of four specific main quality
to which software includes degraded operation and models. None of the quality models listed survivability
recovery functions.” [53] as a sub-trait. When developing a survivable network
9. Interoperability: “The degree to which software can system, the current literature referenced the following
be connected easily with other systems and methods that could be used for achieving survivability:
operated.” [53] Access control, Adaptive reconfiguration, Distribution,
10.Modifiability (Similar to Expandability): “The Diversity, Intrusion monitoring and detection,
degree of effort required to improve or modify the Replication, Redundancy, and Separation
efficiency or functions of the software.” [53]
11.Performance: “This acquisition concern is 3.4. Computation of survivability in current
composed of the quality factors: Efficiency, literature
Integrity, Reliability, Survivability, and Usability.
Sub-factors include: Speed, Efficiency, Resource For the literature reviewed in Table 3,
Needs, Throughput, and Response Time.” [53] computational survivability was present, but
12.Predictability: the degree of providing calculations were made very informally and rarely used
countermeasures to system failures in the event of a in fielded applications. Many papers agreed that a
threat. distributed network system could be represented as a
13.Recoverability: “The ability to restore services in a state machine with services being the node and arcs are
timely manner.” [8] links to the services. The assignments of the
14.Reliability: “A set of attributes that bear on the probabilities to the arcs differed and where the articles
capability of software to maintain its level of diverged is the assessment and computation of
performance under stated conditions for a stated survivability once a state machine was graphed.
period of time.” [54]
15.Restorability: the ability of a system to recover 4. Conclusions
from threat and provide services in a timely
manner.
4.1. Analysis of findings
16.Reusability: “The degree to which software can be
reused in applications other that the original
Less than 1% of the articles originally selected for
application.” [53]
potential support to the research area of computational
17.Safety: the ability of the system to not cause harm
system survivability actually computed survivability.
to the network or personnel.
Most of the articles agreed that system survivability is
18.Security: “The degree to which software can detect
very important to our social and economic
and prevent information leak, information loss,
infrastructures since it provides many essential services
illegal use, and system resource destruction.” [53]
to support our existence. Most of the articles also
19.Testability: “The effort required to test software.”
agreed that if these systems are threatened and fail to
[53]
provide the required services, the consequences might
20.Verifiability: “Relative efforts to verify the
be catastrophic and even fatal. Only a very few current
specified software operation and performance.”
literature references could be used to support this area
[53]
of research. Of the current literature that did support
A comparison of the ISO/IEC 9126 and IEEE
computational survivability, informal calculations were
Standard 1061-1992 quality models and their
used, and the calculations were not currently used in
respective sub-characteristics and sub-factors to the
practice. In addition, there was no one single
collection of survivability quality attributes that were
definition that was embraced by the survivability
mentioned from various papers reviewed for this
community. Even the definition of survivability from
research showed that no one article listed all of these
an IEEE quality standard was not used explicitly.

0-7695-2056-1/04 $17.00 (C) 2004 IEEE 5


Proceedings of the 37th Hawaii International Conference on System Sciences - 2004

Table 3. Summary of literature that computed survivability


Reference Summary of Literature Analysis or Computation
[1] The author of this paper developed a tool to assess survivability by evaluating reliability, availability, and restorability of SONET networks using a Parametric State
Reward Markov Model (SRMM/p). The survivability model has three states: functioning (satisfactory amount of service to users), restoration (where recovery
procedure takes place), and failure (the opposite state is functioning). The evaluation of survivability uses three metrics and their probabilistic values: reliability
(“transient behavior of the model’s functioning and restoration states with the integration of the states’ reward values”), availability (“steady-state behavior of the same
states with the same performance integration”), and restorability (“average amount of recovery and average restoration duration”). Simulated the New Jersey network
with 11 nodes and 23 links. The results supported survivability of a network is affected by the restoration time and the amount of recovery after restoration.
SUMMARY: Survivability is measured by reliability, availability and restorability.
[2] Authors use trellis graphs to find disjoint routing paths of network systems, which can be used to address survivability. Focus is on shortest path, which minimizes
delay, minimization of bandwidth, and maximization of bandwidth. “Survivability techniques are classified as 1) prevention, 2) network design, and 3) traffic
management and restoration.” The proposed algorithm transforms a network to a trellis graph then finds the k-best path through the trellis. This in turn is transformed
into a Minimum Cost Network Flow (MCNF) problem. SUMMARY: Survivability is measured by connectivity.
[13] Analyzes survivability of network systems, which are service dependent; therefore a system architect should focus on the design of the system by analyzing only the
[16] – similar paper service required of that system. The authors of this paper use a Constrained Markov Decision Process (CMDP) to form the basis of the survivability analysis, which is
[27] – similar paper composed of reliability, latency, and cost-benefit. The survivability analysis process, using techniques such as model checking, Bayesian techniques, probabilistic
techniques, and cost-benefit analysis, is covered in six steps: 1) Model the network using a finite state machine, 2) Inject faults into the model annotated by a special
state variable with specified assumptions, 3) Specify survivability properties classified by faults (where a service node may reach an undesired or unsafe state) and
services (where an issued service is monitored for completion of that particular service, which eventually does complete), 4) Construct fault scenario graphs and use
model checking. Since the graphs can get quite large a querying process was developed to select a subset of scenarios that represent the events of interest to the
architect, 5) Perform the reliability and latency analysis by assigning a Boolean variable to each state (indicating if an event occurred), a conditional probability
(indicating probability of reaching a state), and a cost to the edges. The reliability metric is the probability that an event will eventually finish and the latency metric is
the time to complete the event, and 6) Perform cost-benefit analysis to possible improvements to links based on cost, reliability, and latency. The analysis can identify
critical nodes and determine survivability of a system with respect to the properties: fault and services. A tool, Trishul, was developed to simulate the basic algorithms
presented in the paper. SUMMARY: Survivability is measured by reliability, latency, and cost-benefit.
[28] A class of traffic-based survivability measures is defined, where the performance of the network is used as the analysis of survivability. Networks are evaluated with
and without restoration. Mentions two types of survivability measures: 1) “deterministic survivability measures depend solely on topology of the network”, 2)
“probabilistic survivability measures depend on topology and reliability of each component on the network, which is further split into connectivity-based and traffic-
based measures.” The authors use a model that is an undirected graph of nodes and links with probabilities that the link and/or node are operative. The analysis is used
to find three measures of survivability: 1) “Terminal survivability: the fraction of traffic between a specific pair of nodes that can be carried by the network”, 2)
“Network survivability: the fraction of traffic of the entire network that can be carried by the network,” 3) “Subnet survivability: the fraction of traffic of a subnet that
can be carried by the network.” NOTE: 4 pages of the 8 pages were missing from the PDF file. SUMMARY: Survivability is measured by network performance.
[29] Assesses and analyzes survivability based on a survivability framework. A threshold is determined for acceptable level of network performance in the context of user
expectations. Outages have three major features: 1) unservability 2) duration 3) weight, which fall into three major categories: 1) catastrophic, 2) major, and 3) minor.
Two approaches to survivability analysis: 1) “Probability of network failure and rates of repair to calculate network availability or unservability.” and 2) “Measures of
a network after a given failure has occurred using probabilistic weighting of the resulting states of the network and resulting restored after the failure.” In the context of
a telecommunications network at the service layer, “survivability measures may include end-to-end grade of service, number of calls, number of connected subscribers,
network operator’s revenue and traffic volume.” SUMMARY: Survivability measurements are to be determined by the analyst.
[31] The authors use the Monte Carlo simulation and reliability algorithms to determine the probability of a surviving connection for node pairs. Probabilities are assigned
to edges of a network graph. The simulation randomly generates graphs of a network system with the nodes predefined to represent the system. The minimum
reliability value of the node pairs in the system is used as the network survivability value. SUMMARY: Survivability is measured by connectivity and assigns the
lowest reliability value between node pairs as the network survivability value.
[32] A survivability function is used as the measure instead of a single value for survivability. The author evaluates network survivability in terms of nodes connected after
a failure (disaster) that results in unavailable or destroyed nodes. The survivability function is described as “the probability that a fraction of the nodes are connected to
the central node.” The function allows for different quantities to be calculated based on the network characteristics such as type of failure (disaster) and goodness of the
network. The survivability function can calculate expected, worst-case, r-percentile, and probability of zero survivability. SUMMARY: Survivability is calculated as a
function that depends on the type of network failure and the remaining links available after the failure.
[33] Survivability is measured by traffic capacity, not network connectivity. Survivability is calculated as a percentage of remaining network traffic flow to the original
traffic after the communication network has been destroyed. SUMMARY: Survivability is measured by traffic capacity.
[35] Measures survivability in formulas as vectors in terms of cutsets. A cutset is “a set of edges whose removal results in a disconnected network.” The computation of
survivability limits the number of cutsets and classifies two types of problems: 1) minimum cutsets and 2) weakest cutsets. SUMMARY: Survivability is formulated as
vectors in terms of cutsets.
[42] The authors propose a model to assess the survivability of a network system. Different parameters affect survivability such as the frequency and impact of attacks on a
[62] – similar paper network system. The measure of survivability = (performance level of the new state of the system after and attack)/(system performance at a normal level). The
possible values of survivability range from 1 (completely normal) to 0 (failure). Another possible calculation is a weighted sum of the importance level of the service
times the degree of compromise of the service in the survived state. The authors finally conclude that there is no “absolute survivability” and sites other measures of
survivability such as relative survivability, worst-case survivability, and survivability with expected compromise. Simulations to analyze survivability used the Poisson
model. SUMMARY: Survivability is calculated in terms of network performance.
[50] Survivability is analyzed using the Steiner network problem, which addresses connectivity of a network system under node and link failures. SUMMARY:
Survivability is measured by connectivity.
[55] Similar paper to reference [13] by same author(s). Measure for survivability is based on topological structures of network systems, specifically military
communications networks (MCNs). The measure of survivability is based on connectivity where measures make the following assumptions: 1) nodes have only two
status, damage or undamaged, 2) links between nodes are wireless, 3) only one node is destroyed or moved at a time SUMMARY: Survivability is measured by
connectivity.
[56] Survivability is computed as the probability that communication across a network is a success. The indexes are based on a function of actions causing the network to be
down. The authors use Boolean algebra, probability, and queuing theories to support the computation of survivability. SUMMARY: Survivability is measured by
success of communication (network performance).
[57] Survivability is calculated as network performance where the fraction of time in failure state affects the performance. The authors choose measure performance as a
time interval, called traffic blocking level, versus using perceived service effects as a measure. The magnitude, duration, and frequency of failure are used to determine
the impact to traffic performance. SUMMARY: Survivability is calculated as network performance.
[58] The authors use capacity related reliability (CRR) for the survivability index and developed a tool called SACHEL (Survivability Analysis of complex Computer-
networks with Heterogeneous Link-capacities) to perform the survivability analysis. Networks are graphed as nodes (services) and links (connection services).
SUMMARY: Survivability is calculated as network capacity.
[59] Elaborates on the computation of the survivability metric called the Node Connectivity Factor (NCF). NCF is concerned with the remaining nodes after a connection to
[60] – similar paper nodes or links fail. The authors introduce knowledge-based computations to determine the NCF values for networks with large amounts of nodes (greater than 15).
“The final NCF value can be formed by combining the NCF values determined for subgraphs at lower levels.” SUMMARY: Survivability is measured by connectivity.
[61] The terms connectivity and survivability are used interchangeably in this article. The author measures survivability using Node Connectivity Factor (NCF) and Link
Connectivity Factor (LCF). For a survivable network, high values of NCF and LCF are ideal. NCF represents the physical stability and LCF represents the electronic
stability.
Probabilistic values are assigned to nodes and links. A modified cut-saturation algorithm in conjunction with Floyd’s algorithm is used in the design process for
networks. Inputs to the algorithm include: network topology, traffic flow, and traffic requirements between pairs of nodes. SUMMARY: Survivability is measured by
connectivity.
[63] Authors “identify real-time metrics to quantify system survivability” and propose data visualization. Analysis of survivability depends on system performance during
three states of failure: 1) period after failure, period during failure, and period following recovery of failure. The calculations are too specific to the mobile network
system. SUMMARY: Survivability is measured by network performance.

0-7695-2056-1/04 $17.00 (C) 2004 IEEE 6


Proceedings of the 37th Hawaii International Conference on System Sciences - 2004

Since a non-exhaustive search was performed in the Survivability = the ability of a given system with a
ACM, IEEE, and SEI publisher’s library archives, it is given intended usage to provide a pre-specified
possible that a relevant paper that supports the research minimum level of service in the event of one or
area of computational survivability could have been more pre-specified threats
overlooked. However, the overwhelming evidence Thus, to precisely define survivability requires a
from this study supports a conclusion that current state precise definition of: the system, the usage, the
and understanding of survivability by researchers and minimum level of service, and the threats.
industry professionals is inconsistent and non-standard. System: The system is typically a large-scale
Other sources that could have been used is a network system that includes many components
consolidated source of system survivability found (nodes) that are required to deliver services to the end
within The Information Society Workshops which user. The system environment and the essential
provides “a forum for researchers, practitioners, and services that the system provides are defined for this
sponsors to discuss the area of survivability, the nature survivable network system. State whether the system
of the unique (and sometimes not-so-unique) problems is bounded or unbounded. The system is unbounded if
associated with survivability, and promising all nodes that provide the essential services are not
approaches to finding solutions to these problems.” known.
[64]. Since the time this research began, the Usage: Since the end-user typically invokes the
International Conference on COTS-based Software service request, the definition of survivability should
Systems (ICCBSS) held its first conference in early consider the expectations of the user (including any
2002. ICCBSS is the first conference series to focus preconditions that will tell us the applicability of this
on exchanging ideas about current best practices and definition of survivability).
promising research directions in creating and Minimum level of service: The minimum level of
maintaining systems that incorporate COTS software service is a set of functional specifications for each
products [65]. Mention of these additional resources is required service, each associated with a set of quality
to highlight that industry and scholars are continuing in attributes and their associated values. For example:
area of research of system survivability and to note that For a networked distributed system, a required service
there are many other available sources. may be a specified response time to the end user. For a
In general, the reported measurement of time-critical system, the description of a required
survivability falls into one of three major categories: 1) service may include the maximum time allowable
connectivity, 2) network performance, or 3) a function between user request and system response.
of other quality or cost measures. Unless there is a Threats: Include the type of threat to the system
reason to support a bounded system, the research that may prevent the system from providing services to
should focus on unbounded systems, since most the user in the prescribed amount of time or may
distributed network systems are much like the Internet prevent the system from providing the services at all.
where all nodes that provide essential services may not Threat categories include: 1) Accidental threats:
always be known. The specific models that should be software errors, hardware errors, and human errors, 2)
expanded are: 1) “reliability, latency, and cost benefit Intentional or malicious threats: sabotage, intrusion, or
model” [13] 2) “survivability function” model [32], 3) terrorist attacks, and 3) Catastrophic threats typically
node and link connectivity models [61] and 4) do not allow delivery of required service to the user,
“simulation model for managing survivability of which includes acts of nature (thunderstorms,
network information systems.” [42] These models hurricanes, lightning, flood, earthquake, etc.), acts of
were chosen because they were 1) over simplistic and war, and power failures.
understandable, 2) the mathematics was thorough and A business case: A business case is required for
good, 3) the calculations made it easy to talk about each survivability definition. The rationale provides
components. the business case for the definition of survivability.
There is an extra cost associated with the design,
4.2. Survivability definition template development, and operation of a survivable system.
The business case is developed based on the
Survivability must be a context-specific definition. cost/benefit analysis from which the threat is identified
A standard template is provided here for defining and required responses are specified. Note: A separate
survivability to support research and practice in the cost/benefit analysis is required for each level of threat.
composition and calculation of survivability, enhance A non-malicious virus may degrade system
survivability-related communication across the system performance but not shut the system down. If the
life cycle, and support improvements in survivability degraded performance is within the defined minimum
analyses and subsequent risk assessments.

0-7695-2056-1/04 $17.00 (C) 2004 IEEE 7


Proceedings of the 37th Hawaii International Conference on System Sciences - 2004

level of service, no action may be required with respect [3] Lipson, Howard F., David A. Fisher, “Survivability—a
to survivability. new technical and business perspective on security,” New
Security Paradigms Workshop, Proceedings of the 1999
workshop on New security paradigms, ACM, Sep 1999.
4.3. Contributions and future research
[4] Wedde, H.F., Bohm, S.; Freund, W., “Adaptive protocols
The survivability analysis of information systems for survivability of transactions operating on replicated
and the new survivability definition template provided objects,” Sixth International Workshop on Object-Oriented
by this research is useful to several groups of people Real-Time Dependable Systems, 2001, IEEE, 2001, pp. 61-
such as researchers, managers, and ultimately end- 66.
users. For researchers, current literature has been
thoroughly reviewed and base-lined for current [5] Chenxi Wang, Davidson, J.; Hill, J.; Knight, J.,
practices of computational survivability. Researchers “Protection of software-based survivability mechanisms,”
The International Conference on Dependable Systems and
can use the results of this paper as a baseline from Networks, IEEE, 2001, pp. 1413-1420.
which to build subsequent efforts while recognizing the
need to move the study of computation of survivability [6] Ellison, R.J., Linger, R.C.; Longstaff, T.; Mead, N.R.,
from theory to practice. In particular, field-testing of “Survivable network system analysis: a case study,” IEEE
survivability issues is needed to obtain empirical Software, Volume 16, Issue 4, IEEE, Jul-Aug 1999, pp. 58-
results for actual applications. The findings of this 62.
paper also support the need for a standardization of the
definition of survivability that may facilitate [7] Ellison, R.J., Fisher, D.A.; Linger, R.C.; Lipson, H.F.;
subsequent research into computational quality Longstaff, T.A.; Mead, N.R., “Survivability: protecting your
critical systems,” IEEE Internet Computing, Volume 3, Issue
attributes. 6, IEEE, Nov-Dec 1999, pp. 9.B.3-1-9.B.3-8.
Managers can use the findings of this research to
support improved risk assessment of distributed [8] Fisher, D.A., Lipson, H.F., “Emergent algorithms-a new
network systems. With the knowledge that there are method for enhancing survivability in unbounded systems,”
inconsistent and non-standard definitions of Proceedings of the 32nd Annual Hawaii International
survivability and that computing survivability may be Conference on Systems Sciences, IEEE, 1999, pp. 351-357.
system specific, a manager can better understand the
risks associated with making business decisions [9] Perraju, T.S., “An agent framework for survivable
regarding survivable systems. Managers may not need network systems,” International Performance, Computing
and Communications Conference, IEEE, 1999, pp. 235-243.
to know exactly how to compute survivability, but they
should be able to ask questions about survivability (and [10] Byon, Imju, “Survivability of the U.S. Electric Power
to understand the answers to those questions) of Industry,” SEI, May 2002.
vendors providing services and of designers proposing
the development or enhancement of a distributed [11] Ellison, Robert J., Richard C. Linger, Thomas Longstaff,
network system. Nancy R. Mead, “A Case Study in Requirements for
As researchers and managers recognize the need for Survivable Systems,” SEI, Dec 2002.
computational survivability and move forward to
develop this area of study, ultimately the end-users will [12] Ellison, R. J., R. C. Linger, T. Longstaff, N. R. Mead,
“A Case Study in Survivable Network System Analysis,”
benefit from this research if improved communication SEI, Sep 1998.
about survivability results in users receiving the
services that they need without interruption and in a [13] Jha, S., Wing, J.M., “Survivability analysis of
timely manner. networked systems,” Proceedings of the 23rd International
Conference on Software Engineering, IEEE, 2001, pp. 872-
5. References 874.

[14] Snow, A.P., Varshney, U.; Malloy, A.D., “Reliability


[1] Cankaya, H.C., Nair, V.S.S., “A survivability assessment
and survivability of wireless and mobile networks,” IEEE
tool for restorable networks,” 3rd IEEE Symposium on
Computer, Volume 33, Issue 7, IEEE, Jul 2000, pp. 449-454.
Application-Specific Systems and Software Engineering
Technology, IEEE, 2000, pp. 319-324.
[15] Kyamakya, K., Jobman, K.; Meincke, M., “Security and
survivability of distributed systems: an overview,” 21st
[2] Louca, S., Pitsillides, A.; Samaras, G., “On network
Century Military Communications Conference Proceedings,
survivability algorithms based on trellis graph
Volume 1, IEEE, 2000, pp. 1204-1208.
transformations,” International Symposium on Computers
and Communications, IEEE, 1999, pp. 1008-1023.

0-7695-2056-1/04 $17.00 (C) 2004 IEEE 8


Proceedings of the 37th Hawaii International Conference on System Sciences - 2004

[16] Jha, S., Wing, J.; Linger, R.; Longstaff, T., [27] Nikolopoulos, S.D., Pitsillides, A.; Tipper, D.,
“Survivability analysis of network specifications,” “Addressing network survivability issues by finding the K-
International Conference on Dependable Systems and best paths through a trellis graph,” Sixteenth Annual Joint
Networks, IEEE, 2000, pp. 53-58. Conference of the IEEE Computer and Communications
Societies, Volume 1, IEEE, 1997, pp. 370-377.
[17] Eegleston, J.E., Jamin, S.; Kelly, T.P.; Mackie-Mason,
J.K.; Walsh, W.E.; Wellman, P.P., “Survivability through [28] Jianxu Shi, Fonseka, J.P., “Traffic-based survivability
market based adaptivity: the MARX project,” DARPA analysis of telecommunications networks,” Global
Information Survivability Conference and Exposition, Telecommunications Conference, Volume 2, IEEE, 1995, pp.
Volume 2, IEEE, 1999, pp. 380-390. 79-87.

[18] Medhi, D., Tipper, D., “Multi-layered network [29] Zolfaghari, A., Kaudel, F.J., “Framework for network
survivability-models, analysis, architecture, framework and survivability performance,” Journal on Selected Areas in
implementation: an overview,” DARPA Information Communications, Volume 12, Issue 1, IEEE, 1994, pp. 1615-
Survivability Conference and Exposition, Volume 1, IEEE, 1616.
1999, pp. 421-423.
[30] Laretto, K.G., “Sprint network survivability,” Military
[19] Hiltunen, M.A., Schlichting, R.D.; Ugarte, C.A.; Wong, Communications Conference, IEEE, 1994, pp. 587-597.
G.T., “Survivability through customization and adaptability:
the Cactus approach,” DARPA Information Survivability [31] Kalyoncu, H., Sankur, B., “Estimation of survivability
Conference and Exposition, Volume 1, IEEE, 1999, pp. 207- of communication networks,” Electronics Letters, Volume
221. 28, Issue 19, IEEE, Sep 1992, pp. 473-480.

[20] Voas, J.M., Ghosh, A.K., “Software fault injection for [32] Liew, S.C., Lu, K.W., “A framework for network
survivability,” DARPA Information Survivability survivability characterization,” IEEE International
Conference and Exposition, Volume 2, IEEE, 1999, pp. 256- Conference on Communications, IEEE, 1992, pp. 441-451.
270.
[33] Jiang, T.Z., “A new definition of survivability of
[21] Knight, J.C., Sullivan, K.J.; Elder, M.C.; Chenxi Wang, communication networks,” Military Communications
“Survivability architectures: issues and approaches,” DARPA Conference: Military Communications in a Changing World,
Information Survivability Conference and Exposition, IEEE, 1991, pp. 2007-2012.
Volume 2, IEEE, 1999, pp. 36-45.
[34] Brush, G., Marlow, N., “Assuring the dependability of
[22] Bowen, T., Chee, D.; Segal, M.; Sekar, R.; Shanbhag, telecommunications networks and services,” IEEE Network,
T.; Uppuluri, P., “Building survivable systems: an integrated Volume 4, Issue 1, IEEE, Jan 1990, pp. 827-828.
approach based on intrusion detection and damage
containment,” DARPA Information Survivability Conference [35] Wu, L., Varshney, P.K., “On survivability measures for
and Exposition, Volume 2, military networks,” Military Communications Conference, A
New Era, IEEE, 1990, pp. 125-129.
[23] Wilson, M.R., “The quantitative impact of survivable
network architectures on service availability,” IEEE [36] Huffman, S., Altes, T.; Chahine, K., “Issues for
Communications Magazine, Volume 36, Issue 5, IEEE, May proliferated survivable network design,” Global
1998, pp. 71-77. Telecommunications Conference, IEEE, 1988, pp. 489-492.

[24] Ammann, P., Jajodia, S.; Peng Liu, “A fault tolerance [37] Jones, A., “The challenge of building survivable
approach to survivability,” Computer Security, Dependability information-intensive systems,” Computer, Volume 33, Issue
and Assurance: From Needs to Solutions, IEEE, 1998, pp. 8, IEEE, Aug 2000, pp. 39-43.
1017-1021.
[38] Bowers, S., Delcambre, L.; Maier, D.; Cowan, C.;
[25] Linger, R.C., Mead, N.R.; Lipson, H.F., “Requirements Wagle, P.; McNamee, D.; Le Meur, A.-F.; Hinton, H.,
definition for survivable network systems,” Third “Applying adaptation spaces to support quality of service and
International Conference on Requirements Engineering, survivability,” DARPA Information Survivability
IEEE, 1998, pp. 1603-1606. Conference and Exposition, Volume 2, IEEE, 1999, pp. 271-
283.
[26] Struyve, K., Van Caenegem, B.; Van Doorselare, K.;
Gryseels, M.; Demeester, P., “Design and evaluation of [39] Shrobe, H., “Model-based troubleshooting for
multi-layer survivability for SDH-based ATM networks,” information survivability,” DARPA Information
Global Telecommunications Conference, Volume 3, IEEE, Survivability Conference and Exposition, Volume 2, IEEE,
1997, pp. 1466-1470. 1999, pp. 231-240.

0-7695-2056-1/04 $17.00 (C) 2004 IEEE 9


Proceedings of the 37th Hawaii International Conference on System Sciences - 2004

[40] Linger, R.C., “Panel: Issues in Requirements Definition [54] Robert, P., “Quality requirements for software
for Survivable Systems,” Third International Conference on acquisition,” Software Engineering Standards Symposium
Requirements Engineering, IEEE, 1998, pp. 198-199. and Forum, IEEE, 1997 pp. 136-143.

[41] Medhi, D., “A unified framework for survivable [55] Haizhuang Kang, Butler, C.; Qingping Yang; Jiamo
telecommunications network design,” International Chen, “A new survivability measure for military
Conference on Communications, Volume 1, IEEE, 1992, pp. communication networks,” Military Communications
411-415. Conference, Volume 1, IEEE, 1998, pp. 3-4.

[42] Moitra, Soumyo D., Suresh L. Konda, “A Simulation [56] Hagin, A.A., “Performability, reliability, and
Model for Managing Survivability of Networked Information survivability of communication networks: system of methods
Systems,” SEI, Dec 2002. and models for evaluation,” Proceedings of the 14th
International Conference on Distributed Computing Systems,
[43] Ellison, B., Fisher, D.A. Linger, R.C. Lipson, H.F. IEEE, 1994, pp. 912-916.
Longstaff, T. Mead, N.R., “Survivable Network Systems: An
Emerging Discipline,” SEI, May 1999. [57] Zolfaghari, A., “Study And Analysis Of High Capacity
Survivability Performance,” Region 10 International
[44] Caldera, Jose, “Survivability Requirements for the U.S. Conference on EC3-Energy, Computer, Communication and
Health Care Industry,” SEI, May 2000. Control Systems, Volume 4, IEEE, 1991, pp. 727-730.

[45] Linger, Richard C., Andrew P. Moore, “Foundations for [58] Rai, U., Soh, S., “Survivability analysis of complex
Survivable System Development: Service Traces, Intrusion computer-networks with heterogeneous link-capacities,”
Traces, and ...,” SEI, Oct 2001. Annual Reliability and Maintainability Symposium, IEEE,
1991, pp. 440-445.
[46] Mead, Nancy R., Robert J. Ellison, Richard C. Linger,
Thomas Longstaff, John McHugh, “Survivable Network [59] Whittaker, G.M., Schroeder, M.A.; Newport, K.T., “A
Analysis Method,” SEI, Sep 2000. knowledge-based approach to the computation of network
nodal survivability,” Military Communications Conference,
[47] Ellison, Robert J., Andrew P. Moore, “Architectural A New Era, IEEE, 1990, pp. 424-429.
Refinement for the Design of Survivable Systems,” SEI, Oct
2001. [60] Newport, K.T., Schroeder, M.A.; Whittaker, G.M.,
“Techniques for evaluating the nodal survivability of large
[48] Sullivan, Kevin, John C. Knight; Xing Du; Steve Geist, networks,” Military Communications Conference, A New
“Information survivability control systems,” Proceedings of Era, IEEE, 1990, pp. 293-297.
the 1999 international conference on Software engineering,
ACM, May 1999. [61] Newport, K.T., “Incorporating survivability
considerations directly into the network design process,”
[49] Sullivan, Kevin J., Steve Geist; Paul Shaw, “Mediators Ninth Annual Joint Conference of the IEEE Computer and
in infrastructure survivability enhancement,” Proceedings of Communication Societies, IEEE, 1990, pp. 1963-1970.
the third international workshop on Software architecture,
ACM, Nov 1998. [62] Moitra, Soumyo D., Suresh L. Konda, “Survivability of
Network Systems: An Empirical Analysis, The,” SEI, Dec
[50] Mihail, Milena, David Shallcross; Nate Dean; Marco 2000.
Mostrel, “A commercial application of survivable network
design,” Proceedings of the seventh annual ACM-SIAM [63] Dahlberg, T. A., K. R. Subramanian, “Visualization of
symposium on Discrete algorithms, ACM, Jan 1996. real-time survivability metrics for mobile networks,”
Proceedings of the 3rd ACM international workshop on
[51] Neumann, Peter G., “Survivable systems,” Modeling, analysis and simulation of wireless and mobile
Communications of the ACM, Volume 35, Issue 5, ACM, systems, ACM, Aug 2000.
May 1992.
[64] The Information Survivability Workshops, SEI and
[52] Mead, Nancy R., “Issues in software engineering for IEEE, http://www.cert.org/research/isw.html.
survivable systems (panel),” Proceedings of the 1999
international conference on Software engineering, ACM, [65] International Conference on COTS-Based Software
May 1999. Systems. http://www.iccbss.org/.

[53] IEEE Std 1061-1992, IEEE standard for a software


quality metrics methodology.

0-7695-2056-1/04 $17.00 (C) 2004 IEEE 10

You might also like