2 - AAS Concepts and Terminology Full Slides

Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC
Analysis of Software Artifacts (ASA)

Henrique Madeira,
Departamento de Engenharia Informática
Faculdade de Ciências e Tecnologia da Universidade de Coimbra
2022/2023
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 1
Fundamental concepts
of software quality
and software dependability
1
21
Henrique Madeira, 2022/2023 1

Two views of software systems

• Functional view
– What the software system does
– Quality is related to the match between the functionalities and the user
needs/expectations
• Non-Functional view
– How the software system does it (features such as performance, security,
reliability, availability, usability, maintainability, and many, many, more)
– Typically known as Quality Attributes of a software system
– Most of them cannot be measured directly
– The biggest technical challenges are in these non-functional attributes
22
Functional and non-functional requirements
• In software engineering the functional versus non-functional view

appears at the requirements level. That is, appear at the very
beginning of the development process.
• Functional requirements
– Describes what a software system should do
– Function points is a usual metric to characterize and assess the size of the
software
• Non-functional requirements
– Define constraints (or goals) on how the system will do so
– Include basically everything that is not related to the functional aspects of the
software system
23

Most common quality attributes

(i.e., non-functional attributes)
• Availability
• Reliability Some obvious observations on many of
these properties:
• Security
• In general, make sense at system level; not just
• Performance at program/application level.
• Depend on both the software and the underlying
• Usability hardware.
• Depend on architectural design choices and
• Maintainability configuration choices.
• Manageability
• Cost
24

• Availability
Availability: the ability of the system being available
• Reliability for service when requested by end-users or other
systems.
• Security
Directly related to downtime
• Performance Availability = Expected uptime
Expected uptime + Expected downtime
.
• Usability
Downtime depends on both failure rate & time to
repair
• Maintainability
• Manageability
• Cost
25


• Availability
Reliability: the ability of a system to perform its
• Reliability functions under stated conditions for a specific period
of time. It is expressed as a probability.
• Security
Often referred as a relative property expressed by the
mean time between failures (MTBF). For example, the
• Performance system has a MTBF of 12 days.
• Usability
• Maintainability
• Manageability
• Cost
26

• Availability Security: is a composite of three properties:
• Reliability • Confidentiality: the absence of unauthorized

disclosure of information.
• Security • Integrity: absence of improper system alteration.
• Performance • Availability: the readiness of the system to provide

correct service (as seen before)
• Usability Security is related to intentional or malicious actions
against the system.
• Maintainability
Systems can be compromised when there is a
• Manageability vulnerability + attack (that exploits the vulnerability)
• Cost
27


• Availability Performance: concerns the speed of operation of a
system. Measured by two metrics:
• Reliability • Response time: how quickly the system reacts to an
input from the user or from other system.
• Security • Throughput: how much work (i.e., operations) the
system can accomplish within a specified amount of
• Performance time.
• Usability Scalability: often associated to performance;
expresses the ability of a system to handle a growing
• Maintainability amount of work or its ability to be enlarged to
accommodate that growth.
• Manageability
• Cost
28

• Availability Usability: is the ease of use and learning of a software
system. Its composed by several attributes (adapted from
• Reliability Jakob & Shneiderman):
• Learnability: how easy is it for users to accomplish basic
• Security tasks for the first time?
• Efficiency: once users have learned the system, how
• Performance quickly can they perform tasks?
• Memorability: when users return to the system after a
• Usability period of not using it, how easily can they reestablish
proficiency?
• Maintainability • Errors: how many errors do users make, how severe are
these errors, and how easily can they recover from the
• Manageability errors?
• Satisfaction: How pleasant is it to use the system?”
• Cost
29


• Availability Maintainability: is the ease with which a software
system can be changed in order to addressed several
• Reliability situations such as:
• correct defects;
• Security • install upgrades;
• add new features to face new requirements;
• Performance • ...
• Usability
• Maintainability
• Manageability
• Cost
30

• Availability Manageability: how easy the system can be
administered by support staff. May include a variety
• Reliability of actions such as:
• user management;
• Security • configurations management;
• error correction;
• Performance • logging;
• Usability • ...
• Maintainability
• Manageability
• Cost
31


• Availability Cost: amount of resources invested in the system. At
design time is just a an estimation.
• Reliability
Often referred as total cost of ownership (TCO), that
includes the estimation of direct cost and indirect cost
• Security (staff salaries, energy cost, facilities, etc.) of a system:
• Performance • initial development of the system;
• installation;
• Usability • cost of modifications;
• cost of keeping the system running.
• Maintainability
• Manageability
• Cost
32
Questions
• Explain why availability is an attribute of security. Give

examples.
33

Dependability: an integrative concept

(more on concepts & terminology)
• Dependability: ”delivery of service that can justifiably be trusted, thus

avoidance of failures that are unacceptably frequent or severe” (J.-C. Laprie)
• Includes the following system attributes:

– Availability: readiness for correct service
– Reliability: continuity of correct service
– Safety: absence of catastrophic consequences on the user(s) and the
environment
– Confidentiality: the absence of unauthorized disclosure of information
– Integrity: absence of improper system alteration
– Maintainability: ability for a process to undergo modifications and repairs
34



– Availability: readiness for correct service Security
Attributes
– Reliability: continuity of correct service
environment
35




– Availability: readiness for correct service
– Reliability: continuity of correct service Mission critical system
environment
36
Questions
• Explain why availability is an attribute of security. Give examples.

• If a web service crashes when called with a give combination of valid
inputs, can you claim that the web service is not robust? Explain.
37

Robustness
• Robustness: “a software system can be said to be robust if it retains

its ability to deliver service in conditions which are beyond its normal
domain of operation” (Laprie)
• Robustness is used very often to test software interfaces such as

system calls, APIs, web services, etc. This is called robustness
testing:
– In this context, robustness is defined as “the degree to which a system or
component can function correctly in the presence of invalid inputs
[IEEE90]”
– Experimental studies (Phil Koopman) show that approximately 15% of the OS
system calls (Linux, Unix, Windows) crashes when called with invalid input
parameters.
38
Resilience
• Resilience ≈ dependability + robustness
Resilience: the persistence of service delivery that can justifiably

be trusted, when facing changes (Laprie)
• Resilience considers changes in lato senso. That is, changes include

all sort of upsets:
– Hardware and software faults
– Malicious attacks
– Configuration changes
– Software and hardware upgrades
– Etc…
39

Dependability (and Resilience)

Attributes, Means, and Threats
Key non-functional
attributes of
software systems
40
Questions

• Explain the differences among fault prevention, fault removal, fault
tolerance and fault forecasting and list the four techniques by order of
frequency of utilization by the software industry (put in first place the
one that is used more intensively).
41


Different means to
solve or mitigate the
effect of the threats
42

The problems that may

damage dependability
43

Questions

frequency of utilization by the software industry (put in first place the one
that is used more intensively).
• When what is visible to end-users is a deviation from the specific or
expected behavior, this is called: a) an error; b) a fault; c) a failure; d) a
defect e) a mistake.
44
Faults, Errors, and Failures
Attention: this is the terminology from a dependability view
Fault Error Failure
Root cause Erroneous change in the Incorrect component/

state of the system system response.
A system is made out of components. Each component is a system

in its own. The notion of failure reflects this.
45

Faults, Errors, and Failures
• Correct service is delivered when the service implements the

expected system function.
• Service failure is an event that occurs when the delivered service
deviates from correct service.
• Failure is a transition from correct service to incorrect service,
• Restoration is the transition from incorrect service to correct
service.
failure
correct incorrect
service service
restoration
46
Terminology (dependability view)
• Error - A measure of the difference between the actual and

the ideal.
• Fault - A condition that may cause a system to fail in

performing its required function.
• Failure - The inability of a system or component to

perform a required function according to its specifications.
47

Other terminology (software reliability view)
• Error - Human action that results in software containing a

fault
• Fault - A cause for an internal error
• Failure - any observable divergence of software behavior

in execution from user needs
Error may cause Fault may cause Failure

(human)
No other option but learning how to deal with/understand

different terminologies…
48

• Hardware faults
• Software faults
• Environment faults
• Human faults
• …
49

The first “bug”
Harvard University Mark II

Aiken Relay Calculator
“On the 9th of September, 1947,

when the machine was
experiencing problems, an
investigation showed that there
was a moth trapped between the
points of Relay #70, in Panel F.
The operators removed the moth

and affixed it to the log. The
entry reads: "First actual case of
bug being found."
http://www.jamesshuggins.com/h/tek1/first_computer_bug.htm
50
Questions

• In your opinion, the concept of permanent and transient fault use for
hardware faults can be also applied to software bugs?
51

What is a software fault?

Residual(?) software faults (bugs), originated from defects in design or
implementation of software components and its integration in a system, that
escape testing and other fault avoidance methods
Software development process (in theory...)

Requirements
Specification
Design
Code development
Test
Deployment
Correctness from
the end user point
of view
52
What is a software fault?
53

Different types of software faults

• In complex systems, the failures caused by software bugs may appear in
different way, defining a very first big types of software faults (bugs):
• Bohrbugs
• Bugs that cause failures deterministically
• Easiest to find during testing
• Fault tolerance à design diversity and redundancy
• Mandelbugs
• Re-execution after a failure caused by a Mandelbug will generally not cause another
failure
• Very difficult to find and correct during testing
• Fault tolerance à simple retries and recovery-oriented computing using checkpointing
• Aging-related
• Bugs tend to be activated and cause failures after long periods of system run-time
• Difficult to find during testing (but static code analysis is effective for some of them)
• Fault tolerance à software rejuvenation
54
Questions

• In your opinion, the concept of permanent and transient fault use for
hardware faults can be also applied to software bugs?
55

Software faults: a persistent problem
• Software reliability is mainly based on fault avoidance using good

software engineering methodologies
• In real systems (i.e., not toys) à fault avoidance not successful à

Fault-tolerance is needed, unless the impact of failures is
acceptable.
• Rule of thumb for fault density in software (Rome labs, USA)

– 10-50 faults per 1,000 lines of code à for good software
– 1-5 faults per 1,000 lines of code à for critical applications using highly
mature software development methods and having intensive testing
56
Software faults: a persistent problem
• Software reliability is mainly based on fault avoidance using good

software engineering methodologies
• SW development methodologies
• In real systems (i.e., not toys) à fault avoidance not successful à
• Static analysis tools
Fault-tolerance is needed, unless the impact of failures is
• Software inspections
acceptable.
• Model checking
• Rule of thumb for• fault density
Testing, in software
testing, testing (Rome labs, USA)
– 10-50 faults per 1,000 lines of code à for good software
• Verification and validation
– 1-5 faults per 1,000 lines of code à for critical applications using highly
• …
mature software development methods and having intensive testing
57

Size of software: examples
Half million of software bugs?

(using conservative bug density statistics)
From Rich Rogers, https://twitter.com/richrogersiot/status/958112741218111489

58
Linux kernel size: another example
696212 patches since

April 16, 2006
59

Classification of faults
• Caused by what?
– Physical faults
– Human-Made faults
• Why?
– Accidental faults
– Intentional non malicious faults / Intentional malicious faults
• When?
– Development faults: design, coding, configuration, upgrading
– Operational faults: in use or maintenance (operation faults, interaction faults,
configuration faults,..)
• Where (with respect to the system)?
– Internal faults
– External faults
• How long?
– Permanent faults
– Transient faults
60
Classification of faults (more detailed view)
61

Software faults:
the main cause of computer failures
• Software faults (i.e., defects or bugs) are the major cause of
computer failures.
• The increasing complexity of software, the pressure to
shrink time-to-market, and high cost of software testing
contribute to keep bugs as the main computer failure cause.
• Many failure reports available in the Internet:
http://www.teach-
ict.com/news/news_stories/news_computer_failures.htm
• Cost thousands of millions of euros every year

(occasionally software bugs cost human lives)
62
Software faults
Alfred Z. Spector (Google Research) wrote a paper 30 years

ago, when we was president of a company called Transarc
Corporation, comparing bridge building to software
development:
“Bridges are normally built on-time, on-budget, and do not
fall down. On the other hand, software never comes in on-
time or on-budget. In addition, it always breaks down.”
Things are not that different today…
63

Software faults
“Bridges are normally built on-time, on-budget, and do not
fall down. On the other hand, software never comes in on-
time or on-budget. In addition, it always breaks down.”
Alfred Z. Spector
has an optimistic
view on bridges…
64
Bugs (once found…)

are oddly simple
Some examples (there are many failure reports in the Internet):
• Project Mercury’s FORTRAN code had the following fault: DO I=1.10
instead of ... DO I=1,10
• An F-18 crashed because of a missing exception condition:

if ... then ... without the else clause that was thought could not
possibly arise.
• In simulation, an F-16 program bug caused the virtual plane to flip over
whenever it crossed the equator, as a result of a missing minus sign to
indicate south latitude.
• The Bank of New York (BoNY) had a $32 billion overdraft as the
result of a 16-bit integer counter that went unchecked.
Examples taken from Spiros Mancoridis slides
65

What do you need to write correct code

(or do anything) correctly and efficiently?
Attention
Knowledge
Practice
(skill)
Motivation
66
Software fault types distribution

# Faults
Fault types
# Faults Top N of most

common software
fault types
Fault types
67

The “Top-N” software fault types

Perc. Observed
Fault types in field study ODC classes
Missing "If (cond) { statement(s) }" 9.96 % Algorithm

Missing function call 8.64 % Algorithm
Missing "AND EXPR" in expression used as branch condition 7.89 % Checking
Missing "if (cond)" surrounding statement(s) 4.32 % Checking
Missing small and localized part of the algorithm 3.19 % Algorithm
Missing variable assignment using an expression 3.00 % Assignment
Wrong logical expression used as branch condition 3.00 % Checking
Wrong value assigned to a value 2.44 % Assignment
Missing variable initialization 2.25 % Assignment
Missing variable assignment using a value 2.25 % Assignment
Wrong arithmetic expression used in parameter of function call 2.25 % Interface
Wrong variable used in parameter of function call 1.50 % Interface
Total faults coverage 50.69 %
Results obtained from the analysis of 650 real software faults in real (already deployed) programs.
“Emulation of software faults: A field data study and a practical approach”, Software Engineering, IEEE Transactions
on 32 (11), 849-867, 2006.
68
The “Top-N” software fault types

Perc. Observed
Fault types in field study ODC classes
Missing "If (cond) { statement(s) }" 9.96 % Algorithm

Missing function call 8.64 % Algorithm
Missing "AND EXPR" in expression used as branch condition 7.89 % Checking
Missing "if (cond)" surrounding statement(s) 4.32 % Checking
Missing small and localized part All
of thethese bug types are incredibly
algorithm 3.19 % Algorithm
Missing variable assignment using an expression 3.00 % Assignment
trivial… but are the reality.
Wrong logical expression used as branch condition 3.00 % Checking
Wrong value assigned to a value 2.44 % Assignment
Missing variable initialization 2.25 % Assignment
Missing variable assignment using a value 2.25 % Assignment
Wrong arithmetic expression used in parameter of function call 2.25 % Interface
Wrong variable used in parameter of function call 1.50 % Interface
Total faults coverage 50.69 %
Results obtained from the analysis of 650 real software faults in real (already deployed) programs.
“Emulation of software faults: A field data study and a practical approach”, Software Engineering, IEEE Transactions
on 32 (11), 849-867, 2006.
69

Questions

• If a web service crashes when called with a give combination of valid inputs, can you claim that the web
service is not robust? Explain.
• Explain the differences among fault prevention, fault removal, fault tolerance and fault forecasting and
list the four techniques by order of frequency of utilization by the software industry (put in first place
the one that is used more intensively).
• When what is visible to end-users is a deviation from the specific or expected behavior, this is called: a)
an error; b) a fault; c) a failure; d) a defect e) a mistake.
• In your opinion, the concept of permanent and transient fault use for hardware faults can be also applied
to software bugs?
• Consider you have a program to calculate the list of occurrences of Fridays

on the 13th day of the month (these days are considered days of bad luck by
superstitious people) for the next 20 years. Give four examples of failure
modes that may happen when running that program.
70

Components/systems may fail

arbitrarily
Failures such as clean crashes
(i.e., stop sending outputs)
are relatively rare
71

Computers failures
• Computer failures are normally complex, as result of the high

complexity of systems.
• Simple failure modes such as pure silent failures (clean crash or halt
failures) are relatively rare.
• Failure mode – condensed description of the way a

component/computer system fails
• Critical systems assume that components/systems may fail in

arbitrary failure modes (i.e., the erroneous behavior related to a
failure can be of any type, including the worst cases possible).
72
Questions

• If a web service crashes when called with a give combination of valid inputs, can you claim that the web
service is not robust? Explain.
• Explain the differences among fault prevention, fault removal, fault tolerance and fault forecasting and
list the four techniques by order of frequency of utilization by the software industry (put in first place
the one that is used more intensively).
• When what is visible to end-users is a deviation from the specific or expected behavior, this is called: a)
an error; b) a fault; c) a failure; d) a defect e) a mistake.
• In your opinion, the concept of permanent and transient fault use for hardware faults can be also applied
to software bugs?
• Consider you have a program to calculate the list of occurrences of Fridays on the 13th day of the month
(these days are considered days of bad luck by superstitious people) for the next 20 years. Give four
examples of failure modes that may happen when running that program.
• If a program performs correct calculations (i.e., the result is correct), can

you still claim that such result represents a failure? If your answer is yes,
give two examples
73

Failure modes – Koopman’s CRASH scale
• Mainly from the operating system perspective
• CRASH – Catastrophic, Restart, Abort, Silent, Hindering
– Catastrophic (OS crashes/multiple tasks affected)
– Restart (task/process hangs, requiring restart)
– Abort (task/process aborts, e.g., segmentation violation)
– Silent (no error code returned when one should be)
– Hindering (incorrect error code returned)
74
Failures classification
75


76
Dependability means
• Fault Prevention techniques: prevent the occurrence of faults
Two different
– Improve development process to avoid/minimize faults perspectives
– Use selected technologies (better components, certified software tools, etc. ) with
strong technical
• Fault Tolerance techniques: to provide correct service implications
in presence of
faults
– Triple modular redundancy, N-Version programming, check pointing and recovery, etc.
77

Dependability means
• Fault Prevention techniques: prevent the occurrence of faults
– Improve development process to avoid/minimize faults
– Use selected technologies (better components, certified software tools, etc. )
• Fault Tolerance techniques: to provide correct service in presence of

faults
– Triple modular redundancy, N-Version programming, check pointing and recovery, etc.
• Fault Removal techniques: specific techniques to reduce the presence

of faults (number, seriousness, ...)
– Development: regression and non-regression testing, static and dynamic verification, etc.
– Operation: preventive maintenance such as patches, updates, SW rejuvenation, etc.
• Fault Forecasting techniques: to estimate the present number, the

future incidence, and the consequences of faults
– Probabilistic assessment, modeling, operational evaluation,…
78
Questions
• If a web service crashes when called with a give combination of valid inputs, can you claim that the web service is not
robust? Explain.
• Explain the differences among fault prevention, fault removal, fault tolerance and fault forecasting and list the four
techniques by order of frequency of utilization by the software industry (put in first place the one that is used more
intensively).
• When what is visible to end-users is a deviation from the specific or expected behavior, this is called: a) an error; b) a
fault; c) a failure; d) a defect e) a mistake.
• In your opinion, the concept of permanent and transient fault use for hardware faults can be also applied to software
bugs?
• Consider you have a program to calculate the list of occurrences of Fridays on the 13th day of the month (these days are
considered days of bad luck by superstitious people) for the next 20 years. Give four examples of failure modes that may
happen when running that program.
• If a program performs correct calculations (i.e., the result is correct), can you still claim that such result represents a
failure? If your answer is yes, give two examples.
• If you decide to execute the same program with the same input parameters
several times, using the same hardware, and vote for majority the results
obtained in the different runs, what kind of redundancy are you using?
79

Fault prevention techniques
• Fault prevention techniques intended to keep faults out

of the system. Applied at the design stage.
• Related to general system engineering techniques

(design methodologies, construction rules, use of high
reliable components). Include:
• Rigid software development process
• Formal methods
80
Fault prevention techniques:

V-model example
• Fault prevention techniques intended to keep faults out

of the system. Applied at the design stage.
• Related to general system engineering techniques

(design methodologies, construction rules, use of high
reliable components). Include:
• Rigid software development process
• Formal methods
81

Dependability means diagram (Laprie)
Error masking
82
Fault tolerance techniques
Fault Error Failure
Estimated using fault

forecasting techniques
Fault tolerant mechanisms
83


84
Fault tolerance
Ability of the system to deliver a correct service after the

occurrence of faults.
• Why fault tolerance techniques?

Even with the most careful fault avoidance, faults will eventually occur and may result
in a system failure
• Fault tolerance techniques:

Carried out via error detection and system recovery, and use redundancy to counteract
the effects of faults
• Protective redundancy:
Additional components or processes that mask or correct errors or faults inside a system
so they do not become observable failures in its service
85

Organisation of fault tolerance
• Possible phases in response to fault manifestation

– Error detection
– Damage containment
– Damage assessment/diagnosis
– Reconfiguration
– Error recovery / restart
– Fault treatment / repair / reintegration
86
Fault tolerant techniques diagram
87

Replication
• Replication provides:
– Two or more copies of items that may be corrupted by a fault
– A mechanism that compares them and declares an error if they
differ
• Examples:
– Duplicated circuitry
– Transmit messages twice
– Store data in two separate places (e.g. mirrored disks)
– …
• Assumption: the two copies must be unlikely to be

corrupted together and in the same way
88
Replication
• Replication provides:
– Two or more copies of items that may be corrupted by a fault
– A mechanism that compares them and declares an error if they
differ
• Examples: Replication may have a very

important impact on a system in the
– Duplicated circuitry
areatwice
– Transmit messages of performance, size, weight,
– Store data in two separate places (e.g. mirrored
power consumption disks)
and others
– …
• Assumption: the two copies must be unlikely to be

corrupted together and in the same way
89

Redundancy
Redundancy is the ingredient of Replication
• Hardware redundancy
Physical replication of HW (the most common form of redundancy). The cost
of replicating HW is decreasing rapidly.
BeagleBone Black Raspberry PI

computer: < 30 € computer: < 20 €
Obviously, these low-end computers cannot solve everything; but the trend is clear.
90
Redundancy
Redundancy is the ingredient of Replication
• Hardware redundancy
Physical replication of HW (the most common form of redundancy). The cost
of replicating HW is decreasing rapidly.
• Information redundancy
Addition of redundant information to data in order to allow fault detection and
fault masking.
• Time redundancy
Attempt to reduce the amount of extra HW at the expense of using additional
time.
• Software redundancy More relevant

Fault detection and fault tolerance implemented in SW to the course
91

Passive hardware redundancy

TMR - Triple Modular Redundancy
• Passive technique
– Use error masking to hide the occurrence of faults
– Rely upon voting mechanisms to mask the occurrence of faults
– Do not require any action on the part of the system / operator
– Generally do not provide for the detection of faults
Module 1
output
Module 2 Voter
Module 3
• Can be applied at any level: computers, processors, memories,

general circuitry
92
Passive hardware redundancy

TMR - Triple Modular Redundancy
• Passive technique Voter is critical
– Use error masking to hide the occurrence of faults
– • Single
Rely upon voting mechanisms to maskpoint of failure.ofCan
the occurrence be
faults
– partially
Do not require any action on the part of replicated… but the very end
the system / operator
– is a single point of
Generally do not provide for the detection of faults failure.
• Difficulties in replica synchronization
Module 1 • Can be implemented in HW or in SW
output
Module 2 Voter
Module 3
• Can be applied at any level: computers, processors, memories,

general circuitry
93

Active hardware redundancy

DMR – Dual Modular Redundancy
• Active technique: duplication with comparison scheme
– Two identical pieces of HW are employed
– They perform the same computation in parallel
– When a failure occurs, the two copies are no more identical and a
simple comparison detects the errors
Module 1 output
Comparator
Module 2 Error signal
• No method for determining which component is faulty

• Can be applied at any level: computers, processors,…
94

Standby sparing technique
• One module is operational, while one or more modules
serve as standbys or spares
• When a fault is detected and located, the faulty module is
removed and replaced with one of the spares.
• Hot standby sparing: the spares operate in synchrony with
the on line modules, and they are prepared to take over at
any time
• Cold standby sparing: the spares are unpowered until
needed to replace a faulty module
à Power consumption vs time to perform initialization prior to
bringing the module into active service
95


Pair-and-a-spare approach
• Two modules operate in parallel at all
times and their results are compared
to provide the error detection output
Comparator
capability
Error signal
• Reconfiguration process can be Module 1
viewed as a switch that accepts the Reconfigurator
module’s outputs and error reports,
and provides the comparator with the Module 2
output of two modules.
• As long as the two outputs agree, the Spare

spares are not used. When a
miscompare occurs, the switch uses
the error reports from the modules to • Hybrid approaches that combine
first identify the faulty module and passive and active HW redundancy
then select a replacement module.
Both modules are replaced if the • Active approaches that combine HW
faulty module is not identified and SW redundancy
• Many implementations strategies
96
Information redundancy
• Coding
– Information is represented with more bits that strictly necessary: says, an n-bit
information chunck is represented by n + c = m bits
– Among all the possible 2m configurations of the m bits, only 2n represent
acceptable values (code words)
– If a non-code word appears, it indicates an error in transmitting, or storing, or
retrieving …
– Examples: checksums, error correction and detection, cyclic redundancy check
(CRC),….
• Self-checking circuits (or SW operations)

– A circuit that has the ability to automatically detect the existence of errors and
the detection occurs during the normal course of its operations.
– Typically obtained using coding techniques (e.g., reverse operation, etc).
– Examples: built-in self-test (BIST), coded-processors,…
97

Time redundancy techniques

• Reduce the amount of extra hardware at the expense
of using additional time
– Repetition of computations & comparison the results to detect errors
– Good for transient faults; no protection against permanent fault
– Problem of guaranteeing the same data when a computation is executed (after a
transient fault system data can be completely corrupted)
– May use a minimum of extra hardware to detect also permanent faults. E.g.,
encode data before executing the second computation
– Examples: re-execution, sending messages twice, …
Data Store
Computation
time t0 result
error
Compare
results
time t0+d
Encode Decode Store
Computation
Data result result
Data
98
Software redundancy
• Software implemented fault tolerance (SWIFT):

– Management of hardware faults at software level
– Management of faults originating from the design and
implementation of software components (i.e., software bugs)
• Due to the large cost of developing software, most of the

software dependability effort has focused on fault-avoidance
techniques
• The current trend reduce the transistors size to 10 nanometers and

below will increase the interest in SWIFT techniques, especially
in the cloud infrastructure (see International Technology
Roadmap for Semiconductors: http://www.itrs.net/)
99

Software redundancy techniques

• Software redundancy may occur in many forms:
– Add extra lines of code and data structures to:
– Check the acceptable range of variables or magnitude of
signals, or a routine to test a memory by writing and
reading locations, etc.
– Perform error recovery.
– Replicate program modules or the complete program
• Basic techniques:
– Consistency checks
– Capability checks
– Software diversity
– Error detection and recovery
100
Consistency check
• Uses a priori knowledge about the characteristics of
information to verify its correctness.
• Often called assertions (program assertions, executable
assertions)
• Examples:
– Data consistency: check the range of variables, input
parameters, signals, etc.
– Address consistency: check the addresses generated by the
computer in the address range of the available memory.
– Time consistency: check time limits for given operations, such as
timeouts
– Detection of invalid instruction codes (n bit to represent 2k legal
instruction codes: 2n - 2k are illegal) :
101

Consistency check
• Uses a priori knowledge about the characteristics of
information to verifyWatchdog
its correctness.
processors
• Often called assertions (programanassertions,
• Sometimes executable
external mechanism is
assertions) needed to check assertions
• Examples: • In general, needs both HW + SW

– Data consistency: check the range of variables, input
parameters, signals, etc.
– Address consistency: check the addresses generated by the
computer in the address range of the available memory.
– Time consistency: check time limits for given operations, such as
timeouts
– Detection of invalid instruction codes (n bit to represent 2k legal
instruction codes: 2n - 2k are illegal) :
102
Capability check
• Verify that a system has the expected capability
(hardware testing)
• Examples:
– Is the ALU working well? à Execute specific instructions on
specific data and compare the result with the expected one
(written in a ROM memory)
– In a multiprocessor, are processes working properly? Are they

capable of communicating? à Execute testing programs and
compare the results with the expected ones.
– Memory test: write and read some locations
103

Questions
• If a web service crashes when called with a give combination of valid inputs, can you claim that the web service is not
robust? Explain.
• Explain the differences among fault prevention, fault removal, fault tolerance and fault forecasting and list the four
techniques by order of frequency of utilization by the software industry (put in first place the one that is used more
intensively).
• When what is visible to end-users is a deviation from the specific or expected behavior, this is called: a) an error; b) a
fault; c) a failure; d) a defect e) a mistake.
• In your opinion, the concept of permanent and transient fault use for hardware faults can be also applied to software
bugs?
• Consider you have a program to calculate the list of occurrences of Fridays on the 13th day of the month (these days are
considered days of bad luck by superstitious people) for the next 20 years. Give four examples of failure modes that may
happen when running that program.
• If a program performs correct calculations (i.e., the result is correct), can you still claim that such result represents a
failure? If your answer is yes, give two examples.
• If you decide to execute the same program with the same input parameters several times, using the same hardware, and
vote for majority the results obtained in the different runs, what kind of redundancy are you using?
• In your opinion, NVersion programming is based on error detection or error

masking techniques?
104
Software diversity: N-version programming
• Independently developed versions

of design and code from the same
set of requirements. Program
Inputs
Program
• Technique: independently design Version 1 Program
Outputs
teams utilizing different design Program
Version 2 Voter
methodologies, algorithms, .
compilers, run-time systems and . .
.
hardware components
Program
Version N
• Vote on the N results produced Program

Inputs
105

N-version programming cons.

• Disadvantages:
– Cost of resources
– Cost of concurrent executions
– Potential source of correlated errors
o The original requirements and specification
o Humans tend to fail in similar modes (social and education commonalities)
– Requirements and specification mistakes are not tolerated (fault avoidance)
• Software voter (it is a technical challenge):

– Not replicated; single point of failure: must be simple and verifiable
– Must assure that the input data vector to each of the versions is identical
– Must receive data from each version in identical formats or make efficient
conversions
– Must implement some sort of communication protocol to wait until all versions
complete their processing or recognize the versions that do not complete
106
Software diversity: N-self-checking

programming
• Based on acceptance tests rather than comparison with equivalent versions
• N versions of the program are written
• Each version is running simultaneously and includes its acceptance tests
• The selection logic chooses the results from one of the programs that passes
the acceptance tests
• Tolerates N-1 faults (independent faults) Problem:
Program
The coverage of self-
Version 1 checking is far from
Program perfect
Inputs Accepptance
.
. tests
Selection
Logic
Program
Version N Program
Outputs
Program Acceptance
Inputs tests
107

Error recovery
• Forward recovery: transform the erroneous state in a new state

from which the system can operate
• Backward recovery: bring the system back to a state prior to the

error occurrence
- Checkpointing
- Recovery block
• Backward and forward recovery are not exclusive; they can be

combined if the error persists
108
Forward error recovery
• Requires the assessment of damages caused by the detected

error or by errors propagated before detection (not easy…)
• Usually ad hoc
• Example of application:
Real-time control systems in which an occasional missed response to a
sensor input is tolerable
The system can recover by skipping its response to the missed
sensor input.
109

Backward error recovery: checkpointing
• Checkpoint: a copy of the current state for possible use in

rollback.
– May be taken automatically (periodically) or upon request by program
– Need to be correct
– Need eventually to be discarded
– Survival of checkpoint data in the presence of faults is critical: stable storage
• Loss: computation time between the checkpointing and the rollback;

data received during that interval
• Overhead of saving system state

– Important goal: to minimize the amount of state information that must be saved
110
Backward error recovery: checkpointing
• Checkpoint: a copy of the current state for possible use in

rollback.
– May be taken automatically (periodically) or upon request by program
– Need to be correct à when is it guaranteed to take the checkpoints?
– Need eventually to be discarded
– Survival of checkpoint data in the presence of faults is critical: stable storage
• Loss: computation
Stabletime
storage:
between the checkpointing and the rollback;
data received during that interval
• Replicated storage
+
• Overhead of• saving Atomicsystem state the read/write accesses
access (either
– Important goal: to minimize the amount of changes)
complete or nothing state information that must be saved
111

Backward error recovery:

transactional systems and databases
• Checkpoint and backward recovery are highly successful in
databases and in transactional systems in general.
• Revision:
• What is a transaction?
• What are the basic transaction operations?
• What are the ACID transaction properties?
• What are concurrency control and locking?
• What is the two-phase commit protocol?
112

recovery block
checkpoint
• Each recovery block contains variables global to the

block that will be automatically checkpointed if they
are altered within the block. Acceptance
test
• Upon entry to a recovery block, the primary alternate is executed and

subjected to an acceptance test to detect any error in the result.
- If the test is passed, the block is exited.
- If the test fails or the primary alternative fails to execute, the content of the
recovery cache pertinent to the block is reinstated, and the second alternate is
executed.
- This cycle is executed until either an alternative is successful or no more
alternatives exist. In this case an error is reported.
113


recovery block (cont.)
Program Inputs
Primary Program Outputs
Version
Secondary
N-to-1 Acceptance
Version 1 Tests
. Program
. Switch
.
.
Test Result
Secondary
Version N-1
• A single acceptance test

• Only one single implementation of the program is run at a time
• Combines elements of checkpointing and backup
• Minimizes the information to be backed up
• Releases the programmer from determining which variables should be checkpointed
and when linguistic structure for recovery blocks requires a suitable mechanism for
providing automatic backward error recovery.
114
Error detection
• Structural approach (duplication and comparison)

– Two or more copies of data item that may be corrupted
– A mechanism that compares them and declares an error if differ
– The two copies must be unlikely to be corrupted together in the same way
• Behavior based approach

– Execution of checks on the behavior (variables, results, etc.) of the target
system. The checks use a simplified view of the target behavior (behavior
abstraction)
– The detection is done by a separate mechanisms called, in general, watchdog.
In some implementations, the watchdog requires specific hardware.
– Watchdog can range from a simple watchdog times to complex watchdog
processors
115

Error detection effectiveness
• Coverage
– Probability that an error is detected, conditional on its occurrence
• Latency
– Time elapsing between the occurrence of an error and its detection.
• Overhead
– Cost of the error detection. It may include (extra) hardware, software,
memory space, computing time, etc.
• Damage Confinement
– Error propagation path
– The wider the propagation, the more likely that errors will spread outside the
system
116

2 - AAS Concepts and Terminology Full Slides

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2 - AAS Concepts and Terminology Full Slides

Uploaded by

Copyright:

Available Formats

Analysis of Software Artifacts

Departamento de Engenharia Informática, FCTUC

Analysis of Software Artifacts (ASA)

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 1

Henrique Madeira, 2022/2023 1

Two views of software systems

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 22

Functional and non-functional requirements

• In software engineering the functional versus non-functional view

Henrique Madeira, 2022/2023 2

Most common quality attributes

Most common quality attributes

Henrique Madeira, 2022/2023 3

Most common quality attributes

Most common quality attributes

• Reliability • Confidentiality: the absence of unauthorized

• Performance • Availability: the readiness of the system to provide

Henrique Madeira, 2022/2023 4

Most common quality attributes

Most common quality attributes

Henrique Madeira, 2022/2023 5

Most common quality attributes

Most common quality attributes

Henrique Madeira, 2022/2023 6

Most common quality attributes

• Explain why availability is an attribute of security. Give

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 33

Henrique Madeira, 2022/2023 7

Dependability: an integrative concept

• Dependability: ”delivery of service that can justifiably be trusted, thus

• Includes the following system attributes:

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 34

Dependability: an integrative concept

• Dependability: ”delivery of service that can justifiably be trusted, thus

• Includes the following system attributes:

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 35

Henrique Madeira, 2022/2023 8

Dependability: an integrative concept

• Dependability: ”delivery of service that can justifiably be trusted, thus

• Includes the following system attributes:

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 36

• Explain why availability is an attribute of security. Give examples.

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 37

Henrique Madeira, 2022/2023 9

• Robustness: “a software system can be said to be robust if it retains

• Robustness is used very often to test software interfaces such as

• Resilience ≈ dependability + robustness

Resilience: the persistence of service delivery that can justifiably

• Resilience considers changes in lato senso. That is, changes include

Henrique Madeira, 2022/2023 10

Dependability (and Resilience)

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 40

• Explain why availability is an attribute of security. Give examples.

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 41

Henrique Madeira, 2022/2023 11

Dependability (and Resilience)

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 42

Dependability (and Resilience)

The problems that may

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 43

Henrique Madeira, 2022/2023 12

• Explain why availability is an attribute of security. Give examples.