You are on page 1of 49

Analysis of Software Artifacts

Departamento de Engenharia Informática, FCTUC

Analysis of Software Artifacts (ASA)


Henrique Madeira,
Departamento de Engenharia Informática
Faculdade de Ciências e Tecnologia da Universidade de Coimbra
2022/2023

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 1

Fundamental concepts
of software quality
and software dependability

1
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 21

21

Henrique Madeira, 2022/2023 1


Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC

Two views of software systems


• Functional view
– What the software system does
– Quality is related to the match between the functionalities and the user
needs/expectations

• Non-Functional view
– How the software system does it (features such as performance, security,
reliability, availability, usability, maintainability, and many, many, more)
– Typically known as Quality Attributes of a software system
– Most of them cannot be measured directly
– The biggest technical challenges are in these non-functional attributes

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 22

22

Functional and non-functional requirements

• In software engineering the functional versus non-functional view


appears at the requirements level. That is, appear at the very
beginning of the development process.

• Functional requirements
– Describes what a software system should do
– Function points is a usual metric to characterize and assess the size of the
software

• Non-functional requirements
– Define constraints (or goals) on how the system will do so
– Include basically everything that is not related to the functional aspects of the
software system
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 23

23

Henrique Madeira, 2022/2023 2


Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC

Most common quality attributes


(i.e., non-functional attributes)

• Availability
• Reliability Some obvious observations on many of
these properties:
• Security
• In general, make sense at system level; not just
• Performance at program/application level.
• Depend on both the software and the underlying
• Usability hardware.
• Depend on architectural design choices and
• Maintainability configuration choices.
• Manageability
• Cost
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 24

24

Most common quality attributes


(i.e., non-functional attributes)
• Availability
Availability: the ability of the system being available
• Reliability for service when requested by end-users or other
systems.
• Security
Directly related to downtime
• Performance Availability = Expected uptime
Expected uptime + Expected downtime
.

• Usability
Downtime depends on both failure rate & time to
repair
• Maintainability
• Manageability
• Cost
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 25

25

Henrique Madeira, 2022/2023 3


Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC

Most common quality attributes


(i.e., non-functional attributes)
• Availability
Reliability: the ability of a system to perform its
• Reliability functions under stated conditions for a specific period
of time. It is expressed as a probability.
• Security
Often referred as a relative property expressed by the
mean time between failures (MTBF). For example, the
• Performance system has a MTBF of 12 days.
• Usability
• Maintainability
• Manageability
• Cost
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 26

26

Most common quality attributes


(i.e., non-functional attributes)
• Availability Security: is a composite of three properties:

• Reliability • Confidentiality: the absence of unauthorized


disclosure of information.
• Security • Integrity: absence of improper system alteration.

• Performance • Availability: the readiness of the system to provide


correct service (as seen before)
• Usability Security is related to intentional or malicious actions
against the system.
• Maintainability
Systems can be compromised when there is a
• Manageability vulnerability + attack (that exploits the vulnerability)

• Cost
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 27

27

Henrique Madeira, 2022/2023 4


Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC

Most common quality attributes


(i.e., non-functional attributes)
• Availability Performance: concerns the speed of operation of a
system. Measured by two metrics:
• Reliability • Response time: how quickly the system reacts to an
input from the user or from other system.
• Security • Throughput: how much work (i.e., operations) the
system can accomplish within a specified amount of
• Performance time.
• Usability Scalability: often associated to performance;
expresses the ability of a system to handle a growing
• Maintainability amount of work or its ability to be enlarged to
accommodate that growth.
• Manageability
• Cost
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 28

28

Most common quality attributes


(i.e., non-functional attributes)
• Availability Usability: is the ease of use and learning of a software
system. Its composed by several attributes (adapted from
• Reliability Jakob & Shneiderman):
• Learnability: how easy is it for users to accomplish basic
• Security tasks for the first time?
• Efficiency: once users have learned the system, how
• Performance quickly can they perform tasks?
• Memorability: when users return to the system after a
• Usability period of not using it, how easily can they reestablish
proficiency?
• Maintainability • Errors: how many errors do users make, how severe are
these errors, and how easily can they recover from the
• Manageability errors?
• Satisfaction: How pleasant is it to use the system?”
• Cost
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 29

29

Henrique Madeira, 2022/2023 5


Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC

Most common quality attributes


(i.e., non-functional attributes)
• Availability Maintainability: is the ease with which a software
system can be changed in order to addressed several
• Reliability situations such as:
• correct defects;
• Security • install upgrades;
• add new features to face new requirements;
• Performance • ...
• Usability
• Maintainability
• Manageability
• Cost
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 30

30

Most common quality attributes


(i.e., non-functional attributes)
• Availability Manageability: how easy the system can be
administered by support staff. May include a variety
• Reliability of actions such as:
• user management;
• Security • configurations management;
• error correction;
• Performance • logging;
• Usability • ...

• Maintainability
• Manageability
• Cost
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 31

31

Henrique Madeira, 2022/2023 6


Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC

Most common quality attributes


(i.e., non-functional attributes)
• Availability Cost: amount of resources invested in the system. At
design time is just a an estimation.
• Reliability
Often referred as total cost of ownership (TCO), that
includes the estimation of direct cost and indirect cost
• Security (staff salaries, energy cost, facilities, etc.) of a system:
• Performance • initial development of the system;
• installation;
• Usability • cost of modifications;
• cost of keeping the system running.
• Maintainability
• Manageability
• Cost
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 32

32

Questions

• Explain why availability is an attribute of security. Give


examples.

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 33

33

Henrique Madeira, 2022/2023 7


Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC

Dependability: an integrative concept


(more on concepts & terminology)

• Dependability: ”delivery of service that can justifiably be trusted, thus


avoidance of failures that are unacceptably frequent or severe” (J.-C. Laprie)

• Includes the following system attributes:


– Availability: readiness for correct service
– Reliability: continuity of correct service
– Safety: absence of catastrophic consequences on the user(s) and the
environment
– Confidentiality: the absence of unauthorized disclosure of information
– Integrity: absence of improper system alteration
– Maintainability: ability for a process to undergo modifications and repairs

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 34

34

Dependability: an integrative concept


(more on concepts & terminology)

• Dependability: ”delivery of service that can justifiably be trusted, thus


avoidance of failures that are unacceptably frequent or severe” (J.-C. Laprie)

• Includes the following system attributes:


– Availability: readiness for correct service Security
Attributes
– Reliability: continuity of correct service
– Safety: absence of catastrophic consequences on the user(s) and the
environment
– Confidentiality: the absence of unauthorized disclosure of information
– Integrity: absence of improper system alteration
– Maintainability: ability for a process to undergo modifications and repairs

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 35

35

Henrique Madeira, 2022/2023 8


Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC

Dependability: an integrative concept


(more on concepts & terminology)

• Dependability: ”delivery of service that can justifiably be trusted, thus


avoidance of failures that are unacceptably frequent or severe” (J.-C. Laprie)

• Includes the following system attributes:


– Availability: readiness for correct service
– Reliability: continuity of correct service Mission critical system
– Safety: absence of catastrophic consequences on the user(s) and the
environment
– Confidentiality: the absence of unauthorized disclosure of information
– Integrity: absence of improper system alteration
– Maintainability: ability for a process to undergo modifications and repairs

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 36

36

Questions

• Explain why availability is an attribute of security. Give examples.


• If a web service crashes when called with a give combination of valid
inputs, can you claim that the web service is not robust? Explain.

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 37

37

Henrique Madeira, 2022/2023 9


Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC

Robustness
(more on concepts & terminology)

• Robustness: “a software system can be said to be robust if it retains


its ability to deliver service in conditions which are beyond its normal
domain of operation” (Laprie)

• Robustness is used very often to test software interfaces such as


system calls, APIs, web services, etc. This is called robustness
testing:
– In this context, robustness is defined as “the degree to which a system or
component can function correctly in the presence of invalid inputs
[IEEE90]”
– Experimental studies (Phil Koopman) show that approximately 15% of the OS
system calls (Linux, Unix, Windows) crashes when called with invalid input
parameters.
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 38

38

Resilience
(more on concepts & terminology)

• Resilience ≈ dependability + robustness

Resilience: the persistence of service delivery that can justifiably


be trusted, when facing changes (Laprie)

• Resilience considers changes in lato senso. That is, changes include


all sort of upsets:
– Hardware and software faults
– Malicious attacks
– Configuration changes
– Software and hardware upgrades
– Etc…
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 39

39

Henrique Madeira, 2022/2023 10


Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC

Dependability (and Resilience)


Attributes, Means, and Threats

Key non-functional
attributes of
software systems

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 40

40

Questions

• Explain why availability is an attribute of security. Give examples.


• If a web service crashes when called with a give combination of valid
inputs, can you claim that the web service is not robust? Explain.
• Explain the differences among fault prevention, fault removal, fault
tolerance and fault forecasting and list the four techniques by order of
frequency of utilization by the software industry (put in first place the
one that is used more intensively).

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 41

41

Henrique Madeira, 2022/2023 11


Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC

Dependability (and Resilience)


Attributes, Means, and Threats

Different means to
solve or mitigate the
effect of the threats

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 42

42

Dependability (and Resilience)


Attributes, Means, and Threats

The problems that may


damage dependability

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 43

43

Henrique Madeira, 2022/2023 12


Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC

Questions

• Explain why availability is an attribute of security. Give examples.


• If a web service crashes when called with a give combination of valid
inputs, can you claim that the web service is not robust? Explain.
• Explain the differences among fault prevention, fault removal, fault
tolerance and fault forecasting and list the four techniques by order of
frequency of utilization by the software industry (put in first place the one
that is used more intensively).
• When what is visible to end-users is a deviation from the specific or
expected behavior, this is called: a) an error; b) a fault; c) a failure; d) a
defect e) a mistake.

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 44

44

Faults, Errors, and Failures

Attention: this is the terminology from a dependability view

Fault Error Failure

Root cause Erroneous change in the Incorrect component/


state of the system system response.

A system is made out of components. Each component is a system


in its own. The notion of failure reflects this.
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 45

45

Henrique Madeira, 2022/2023 13


Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC

Faults, Errors, and Failures

• Correct service is delivered when the service implements the


expected system function.
• Service failure is an event that occurs when the delivered service
deviates from correct service.
• Failure is a transition from correct service to incorrect service,
• Restoration is the transition from incorrect service to correct
service.
failure

correct incorrect
service service

restoration
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 46

46

Terminology (dependability view)

• Error - A measure of the difference between the actual and


the ideal.

• Fault - A condition that may cause a system to fail in


performing its required function.

• Failure - The inability of a system or component to


perform a required function according to its specifications.

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 47

47

Henrique Madeira, 2022/2023 14


Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC

Other terminology (software reliability view)

• Error - Human action that results in software containing a


fault

• Fault - A cause for an internal error

• Failure - any observable divergence of software behavior


in execution from user needs

Error may cause Fault may cause Failure


(human)

No other option but learning how to deal with/understand


different terminologies…
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 48

48

Dependability (and Resilience)


Attributes, Means, and Threats

• Hardware faults
• Software faults
• Environment faults
• Human faults
• …
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 49

49

Henrique Madeira, 2022/2023 15


Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC

The first “bug”

Harvard University Mark II


Aiken Relay Calculator

“On the 9th of September, 1947,


when the machine was
experiencing problems, an
investigation showed that there
was a moth trapped between the
points of Relay #70, in Panel F.

The operators removed the moth


and affixed it to the log. The
entry reads: "First actual case of
bug being found."
http://www.jamesshuggins.com/h/tek1/first_computer_bug.htm
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 50

50

Questions

• Explain why availability is an attribute of security. Give examples.


• If a web service crashes when called with a give combination of valid
inputs, can you claim that the web service is not robust? Explain.
• Explain the differences among fault prevention, fault removal, fault
tolerance and fault forecasting and list the four techniques by order of
frequency of utilization by the software industry (put in first place the one
that is used more intensively).
• When what is visible to end-users is a deviation from the specific or
expected behavior, this is called: a) an error; b) a fault; c) a failure; d) a
defect e) a mistake.
• In your opinion, the concept of permanent and transient fault use for
hardware faults can be also applied to software bugs?

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 51

51

Henrique Madeira, 2022/2023 16


Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC

What is a software fault?


Residual(?) software faults (bugs), originated from defects in design or
implementation of software components and its integration in a system, that
escape testing and other fault avoidance methods

Software development process (in theory...)


Requirements
Specification
Design
Code development
Test
Deployment
Correctness from
the end user point
of view
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 52

52

What is a software fault?

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 53

53

Henrique Madeira, 2022/2023 17


Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC

Different types of software faults


• In complex systems, the failures caused by software bugs may appear in
different way, defining a very first big types of software faults (bugs):
• Bohrbugs
• Bugs that cause failures deterministically
• Easiest to find during testing
• Fault tolerance à design diversity and redundancy
• Mandelbugs
• Re-execution after a failure caused by a Mandelbug will generally not cause another
failure
• Very difficult to find and correct during testing
• Fault tolerance à simple retries and recovery-oriented computing using checkpointing
• Aging-related
• Bugs tend to be activated and cause failures after long periods of system run-time
• Difficult to find during testing (but static code analysis is effective for some of them)
• Fault tolerance à software rejuvenation

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 54

54

Questions

• Explain why availability is an attribute of security. Give examples.


• If a web service crashes when called with a give combination of valid
inputs, can you claim that the web service is not robust? Explain.
• Explain the differences among fault prevention, fault removal, fault
tolerance and fault forecasting and list the four techniques by order of
frequency of utilization by the software industry (put in first place the one
that is used more intensively).
• When what is visible to end-users is a deviation from the specific or
expected behavior, this is called: a) an error; b) a fault; c) a failure; d) a
defect e) a mistake.
• In your opinion, the concept of permanent and transient fault use for
hardware faults can be also applied to software bugs?

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 55

55

Henrique Madeira, 2022/2023 18


Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC

Software faults: a persistent problem

• Software reliability is mainly based on fault avoidance using good


software engineering methodologies

• In real systems (i.e., not toys) à fault avoidance not successful à


Fault-tolerance is needed, unless the impact of failures is
acceptable.

• Rule of thumb for fault density in software (Rome labs, USA)


– 10-50 faults per 1,000 lines of code à for good software
– 1-5 faults per 1,000 lines of code à for critical applications using highly
mature software development methods and having intensive testing

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 56

56

Software faults: a persistent problem

• Software reliability is mainly based on fault avoidance using good


software engineering methodologies
• SW development methodologies
• In real systems (i.e., not toys) à fault avoidance not successful à
• Static analysis tools
Fault-tolerance is needed, unless the impact of failures is
• Software inspections
acceptable.
• Model checking
• Rule of thumb for• fault density
Testing, in software
testing, testing (Rome labs, USA)
– 10-50 faults per 1,000 lines of code à for good software
• Verification and validation
– 1-5 faults per 1,000 lines of code à for critical applications using highly
• …
mature software development methods and having intensive testing

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 57

57

Henrique Madeira, 2022/2023 19


Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC

Size of software: examples

Half million of software bugs?


(using conservative bug density statistics)

From Rich Rogers, https://twitter.com/richrogersiot/status/958112741218111489


Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 58

58

Linux kernel size: another example

696212 patches since


April 16, 2006

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 59

59

Henrique Madeira, 2022/2023 20


Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC

Classification of faults
• Caused by what?
– Physical faults
– Human-Made faults
• Why?
– Accidental faults
– Intentional non malicious faults / Intentional malicious faults
• When?
– Development faults: design, coding, configuration, upgrading
– Operational faults: in use or maintenance (operation faults, interaction faults,
configuration faults,..)
• Where (with respect to the system)?
– Internal faults
– External faults
• How long?
– Permanent faults
– Transient faults
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 60

60

Classification of faults (more detailed view)

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 61

61

Henrique Madeira, 2022/2023 21


Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC

Software faults:
the main cause of computer failures
• Software faults (i.e., defects or bugs) are the major cause of
computer failures.
• The increasing complexity of software, the pressure to
shrink time-to-market, and high cost of software testing
contribute to keep bugs as the main computer failure cause.
• Many failure reports available in the Internet:
http://www.teach-
ict.com/news/news_stories/news_computer_failures.htm

• Cost thousands of millions of euros every year


(occasionally software bugs cost human lives)
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 62

62

Software faults

Alfred Z. Spector (Google Research) wrote a paper 30 years


ago, when we was president of a company called Transarc
Corporation, comparing bridge building to software
development:
“Bridges are normally built on-time, on-budget, and do not
fall down. On the other hand, software never comes in on-
time or on-budget. In addition, it always breaks down.”
Things are not that different today…

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 63

63

Henrique Madeira, 2022/2023 22


Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC

Software faults
“Bridges are normally built on-time, on-budget, and do not
fall down. On the other hand, software never comes in on-
time or on-budget. In addition, it always breaks down.”

Alfred Z. Spector
has an optimistic
view on bridges…

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 64

64

Bugs (once found…)


are oddly simple
Some examples (there are many failure reports in the Internet):
• Project Mercury’s FORTRAN code had the following fault: DO I=1.10
instead of ... DO I=1,10

• An F-18 crashed because of a missing exception condition:


if ... then ... without the else clause that was thought could not
possibly arise.
• In simulation, an F-16 program bug caused the virtual plane to flip over
whenever it crossed the equator, as a result of a missing minus sign to
indicate south latitude.
• The Bank of New York (BoNY) had a $32 billion overdraft as the
result of a 16-bit integer counter that went unchecked.
Examples taken from Spiros Mancoridis slides
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 65

65

Henrique Madeira, 2022/2023 23


Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC

What do you need to write correct code


(or do anything) correctly and efficiently?

Attention
Knowledge

Practice
(skill)
Motivation

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 66

66

Software fault types distribution


# Faults

Fault types

# Faults Top N of most


common software
fault types

Fault types
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 67

67

Henrique Madeira, 2022/2023 24


Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC

The “Top-N” software fault types


Perc. Observed
Fault types in field study ODC classes

Missing "If (cond) { statement(s) }" 9.96 % Algorithm


Missing function call 8.64 % Algorithm
Missing "AND EXPR" in expression used as branch condition 7.89 % Checking
Missing "if (cond)" surrounding statement(s) 4.32 % Checking
Missing small and localized part of the algorithm 3.19 % Algorithm
Missing variable assignment using an expression 3.00 % Assignment
Wrong logical expression used as branch condition 3.00 % Checking
Wrong value assigned to a value 2.44 % Assignment
Missing variable initialization 2.25 % Assignment
Missing variable assignment using a value 2.25 % Assignment
Wrong arithmetic expression used in parameter of function call 2.25 % Interface
Wrong variable used in parameter of function call 1.50 % Interface
Total faults coverage 50.69 %
Results obtained from the analysis of 650 real software faults in real (already deployed) programs.
“Emulation of software faults: A field data study and a practical approach”, Software Engineering, IEEE Transactions
on 32 (11), 849-867, 2006.
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 68

68

The “Top-N” software fault types


Perc. Observed
Fault types in field study ODC classes

Missing "If (cond) { statement(s) }" 9.96 % Algorithm


Missing function call 8.64 % Algorithm
Missing "AND EXPR" in expression used as branch condition 7.89 % Checking
Missing "if (cond)" surrounding statement(s) 4.32 % Checking
Missing small and localized part All
of thethese bug types are incredibly
algorithm 3.19 % Algorithm
Missing variable assignment using an expression 3.00 % Assignment
trivial… but are the reality.
Wrong logical expression used as branch condition 3.00 % Checking
Wrong value assigned to a value 2.44 % Assignment
Missing variable initialization 2.25 % Assignment
Missing variable assignment using a value 2.25 % Assignment
Wrong arithmetic expression used in parameter of function call 2.25 % Interface
Wrong variable used in parameter of function call 1.50 % Interface
Total faults coverage 50.69 %
Results obtained from the analysis of 650 real software faults in real (already deployed) programs.
“Emulation of software faults: A field data study and a practical approach”, Software Engineering, IEEE Transactions
on 32 (11), 849-867, 2006.
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 69

69

Henrique Madeira, 2022/2023 25


Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC

Questions

• Explain why availability is an attribute of security. Give examples.


• If a web service crashes when called with a give combination of valid inputs, can you claim that the web
service is not robust? Explain.

• Explain the differences among fault prevention, fault removal, fault tolerance and fault forecasting and
list the four techniques by order of frequency of utilization by the software industry (put in first place
the one that is used more intensively).

• When what is visible to end-users is a deviation from the specific or expected behavior, this is called: a)
an error; b) a fault; c) a failure; d) a defect e) a mistake.

• In your opinion, the concept of permanent and transient fault use for hardware faults can be also applied
to software bugs?

• Consider you have a program to calculate the list of occurrences of Fridays


on the 13th day of the month (these days are considered days of bad luck by
superstitious people) for the next 20 years. Give four examples of failure
modes that may happen when running that program.

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 70

70

Dependability (and Resilience)


Attributes, Means, and Threats

Components/systems may fail


arbitrarily
Failures such as clean crashes
(i.e., stop sending outputs)
are relatively rare

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 71

71

Henrique Madeira, 2022/2023 26


Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC

Computers failures

• Computer failures are normally complex, as result of the high


complexity of systems.

• Simple failure modes such as pure silent failures (clean crash or halt
failures) are relatively rare.

• Failure mode – condensed description of the way a


component/computer system fails

• Critical systems assume that components/systems may fail in


arbitrary failure modes (i.e., the erroneous behavior related to a
failure can be of any type, including the worst cases possible).

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 72

72

Questions

• Explain why availability is an attribute of security. Give examples.


• If a web service crashes when called with a give combination of valid inputs, can you claim that the web
service is not robust? Explain.
• Explain the differences among fault prevention, fault removal, fault tolerance and fault forecasting and
list the four techniques by order of frequency of utilization by the software industry (put in first place
the one that is used more intensively).
• When what is visible to end-users is a deviation from the specific or expected behavior, this is called: a)
an error; b) a fault; c) a failure; d) a defect e) a mistake.
• In your opinion, the concept of permanent and transient fault use for hardware faults can be also applied
to software bugs?
• Consider you have a program to calculate the list of occurrences of Fridays on the 13th day of the month
(these days are considered days of bad luck by superstitious people) for the next 20 years. Give four
examples of failure modes that may happen when running that program.

• If a program performs correct calculations (i.e., the result is correct), can


you still claim that such result represents a failure? If your answer is yes,
give two examples

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 73

73

Henrique Madeira, 2022/2023 27


Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC

Failure modes – Koopman’s CRASH scale

• Mainly from the operating system perspective

• CRASH – Catastrophic, Restart, Abort, Silent, Hindering

– Catastrophic (OS crashes/multiple tasks affected)

– Restart (task/process hangs, requiring restart)

– Abort (task/process aborts, e.g., segmentation violation)

– Silent (no error code returned when one should be)

– Hindering (incorrect error code returned)

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 74

74

Failures classification

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 75

75

Henrique Madeira, 2022/2023 28


Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC

Dependability (and Resilience)


Attributes, Means, and Threats

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 76

76

Dependability means
• Fault Prevention techniques: prevent the occurrence of faults
Two different
– Improve development process to avoid/minimize faults perspectives
– Use selected technologies (better components, certified software tools, etc. ) with
strong technical
• Fault Tolerance techniques: to provide correct service implications
in presence of
faults
– Triple modular redundancy, N-Version programming, check pointing and recovery, etc.

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 77

77

Henrique Madeira, 2022/2023 29


Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC

Dependability means
• Fault Prevention techniques: prevent the occurrence of faults
– Improve development process to avoid/minimize faults
– Use selected technologies (better components, certified software tools, etc. )

• Fault Tolerance techniques: to provide correct service in presence of


faults
– Triple modular redundancy, N-Version programming, check pointing and recovery, etc.

• Fault Removal techniques: specific techniques to reduce the presence


of faults (number, seriousness, ...)
– Development: regression and non-regression testing, static and dynamic verification, etc.
– Operation: preventive maintenance such as patches, updates, SW rejuvenation, etc.

• Fault Forecasting techniques: to estimate the present number, the


future incidence, and the consequences of faults
– Probabilistic assessment, modeling, operational evaluation,…
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 78

78

Questions
• Explain why availability is an attribute of security. Give examples.
• If a web service crashes when called with a give combination of valid inputs, can you claim that the web service is not
robust? Explain.
• Explain the differences among fault prevention, fault removal, fault tolerance and fault forecasting and list the four
techniques by order of frequency of utilization by the software industry (put in first place the one that is used more
intensively).
• When what is visible to end-users is a deviation from the specific or expected behavior, this is called: a) an error; b) a
fault; c) a failure; d) a defect e) a mistake.
• In your opinion, the concept of permanent and transient fault use for hardware faults can be also applied to software
bugs?
• Consider you have a program to calculate the list of occurrences of Fridays on the 13th day of the month (these days are
considered days of bad luck by superstitious people) for the next 20 years. Give four examples of failure modes that may
happen when running that program.
• If a program performs correct calculations (i.e., the result is correct), can you still claim that such result represents a
failure? If your answer is yes, give two examples.

• If you decide to execute the same program with the same input parameters
several times, using the same hardware, and vote for majority the results
obtained in the different runs, what kind of redundancy are you using?

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 79

79

Henrique Madeira, 2022/2023 30


Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC

Fault prevention techniques

• Fault prevention techniques intended to keep faults out


of the system. Applied at the design stage.

• Related to general system engineering techniques


(design methodologies, construction rules, use of high
reliable components). Include:
• Rigid software development process
• Formal methods

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 80

80

Fault prevention techniques:


V-model example

• Fault prevention techniques intended to keep faults out


of the system. Applied at the design stage.

• Related to general system engineering techniques


(design methodologies, construction rules, use of high
reliable components). Include:
• Rigid software development process
• Formal methods

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 81

81

Henrique Madeira, 2022/2023 31


Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC

Dependability means diagram (Laprie)

Error masking

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 82

82

Fault tolerance techniques

Fault Error Failure

Estimated using fault


forecasting techniques

Fault tolerant mechanisms

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 83

83

Henrique Madeira, 2022/2023 32


Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC

Dependability (and Resilience)


Attributes, Means, and Threats

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 84

84

Fault tolerance

Ability of the system to deliver a correct service after the


occurrence of faults.

• Why fault tolerance techniques?


Even with the most careful fault avoidance, faults will eventually occur and may result
in a system failure

• Fault tolerance techniques:


Carried out via error detection and system recovery, and use redundancy to counteract
the effects of faults

• Protective redundancy:
Additional components or processes that mask or correct errors or faults inside a system
so they do not become observable failures in its service

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 85

85

Henrique Madeira, 2022/2023 33


Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC

Organisation of fault tolerance

• Possible phases in response to fault manifestation


– Error detection
– Damage containment
– Damage assessment/diagnosis
– Reconfiguration
– Error recovery / restart
– Fault treatment / repair / reintegration

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 86

86

Fault tolerant techniques diagram

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 87

87

Henrique Madeira, 2022/2023 34


Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC

Replication

• Replication provides:
– Two or more copies of items that may be corrupted by a fault
– A mechanism that compares them and declares an error if they
differ

• Examples:
– Duplicated circuitry
– Transmit messages twice
– Store data in two separate places (e.g. mirrored disks)
– …

• Assumption: the two copies must be unlikely to be


corrupted together and in the same way
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 88

88

Replication

• Replication provides:
– Two or more copies of items that may be corrupted by a fault
– A mechanism that compares them and declares an error if they
differ

• Examples: Replication may have a very


important impact on a system in the
– Duplicated circuitry
areatwice
– Transmit messages of performance, size, weight,
– Store data in two separate places (e.g. mirrored
power consumption disks)
and others
– …

• Assumption: the two copies must be unlikely to be


corrupted together and in the same way
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 89

89

Henrique Madeira, 2022/2023 35


Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC

Redundancy
Redundancy is the ingredient of Replication
• Hardware redundancy
Physical replication of HW (the most common form of redundancy). The cost
of replicating HW is decreasing rapidly.

BeagleBone Black Raspberry PI


computer: < 30 € computer: < 20 €

Obviously, these low-end computers cannot solve everything; but the trend is clear.
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 90

90

Redundancy
Redundancy is the ingredient of Replication
• Hardware redundancy
Physical replication of HW (the most common form of redundancy). The cost
of replicating HW is decreasing rapidly.

• Information redundancy
Addition of redundant information to data in order to allow fault detection and
fault masking.

• Time redundancy
Attempt to reduce the amount of extra HW at the expense of using additional
time.

• Software redundancy More relevant


Fault detection and fault tolerance implemented in SW to the course
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 91

91

Henrique Madeira, 2022/2023 36


Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC

Passive hardware redundancy


TMR - Triple Modular Redundancy
• Passive technique
– Use error masking to hide the occurrence of faults
– Rely upon voting mechanisms to mask the occurrence of faults
– Do not require any action on the part of the system / operator
– Generally do not provide for the detection of faults

Module 1

output
Module 2 Voter

Module 3

• Can be applied at any level: computers, processors, memories,


general circuitry
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 92

92

Passive hardware redundancy


TMR - Triple Modular Redundancy
• Passive technique Voter is critical
– Use error masking to hide the occurrence of faults
– • Single
Rely upon voting mechanisms to maskpoint of failure.ofCan
the occurrence be
faults
– partially
Do not require any action on the part of replicated… but the very end
the system / operator
– is a single point of
Generally do not provide for the detection of faults failure.
• Difficulties in replica synchronization
Module 1 • Can be implemented in HW or in SW
output
Module 2 Voter

Module 3

• Can be applied at any level: computers, processors, memories,


general circuitry
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 93

93

Henrique Madeira, 2022/2023 37


Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC

Active hardware redundancy


DMR – Dual Modular Redundancy
• Active technique: duplication with comparison scheme
– Two identical pieces of HW are employed
– They perform the same computation in parallel
– When a failure occurs, the two copies are no more identical and a
simple comparison detects the errors

Module 1 output
Comparator

Module 2 Error signal

• No method for determining which component is faulty


• Can be applied at any level: computers, processors,…
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 94

94

Active hardware redundancy


Standby sparing technique
• One module is operational, while one or more modules
serve as standbys or spares
• When a fault is detected and located, the faulty module is
removed and replaced with one of the spares.
• Hot standby sparing: the spares operate in synchrony with
the on line modules, and they are prepared to take over at
any time
• Cold standby sparing: the spares are unpowered until
needed to replace a faulty module
à Power consumption vs time to perform initialization prior to
bringing the module into active service
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 95

95

Henrique Madeira, 2022/2023 38


Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC

Active hardware redundancy


Pair-and-a-spare approach
• Two modules operate in parallel at all
times and their results are compared
to provide the error detection output
Comparator
capability
Error signal
• Reconfiguration process can be Module 1
viewed as a switch that accepts the Reconfigurator
module’s outputs and error reports,
and provides the comparator with the Module 2
output of two modules.

• As long as the two outputs agree, the Spare


spares are not used. When a
miscompare occurs, the switch uses
the error reports from the modules to • Hybrid approaches that combine
first identify the faulty module and passive and active HW redundancy
then select a replacement module.
Both modules are replaced if the • Active approaches that combine HW
faulty module is not identified and SW redundancy
• Many implementations strategies
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 96

96

Information redundancy

• Coding
– Information is represented with more bits that strictly necessary: says, an n-bit
information chunck is represented by n + c = m bits
– Among all the possible 2m configurations of the m bits, only 2n represent
acceptable values (code words)
– If a non-code word appears, it indicates an error in transmitting, or storing, or
retrieving …
– Examples: checksums, error correction and detection, cyclic redundancy check
(CRC),….

• Self-checking circuits (or SW operations)


– A circuit that has the ability to automatically detect the existence of errors and
the detection occurs during the normal course of its operations.
– Typically obtained using coding techniques (e.g., reverse operation, etc).
– Examples: built-in self-test (BIST), coded-processors,…

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 97

97

Henrique Madeira, 2022/2023 39


Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC

Time redundancy techniques


• Reduce the amount of extra hardware at the expense
of using additional time
– Repetition of computations & comparison the results to detect errors
– Good for transient faults; no protection against permanent fault
– Problem of guaranteeing the same data when a computation is executed (after a
transient fault system data can be completely corrupted)
– May use a minimum of extra hardware to detect also permanent faults. E.g.,
encode data before executing the second computation
– Examples: re-execution, sending messages twice, …
Data Store
Computation
time t0 result

error
Compare
results
time t0+d
Encode Decode Store
Computation
Data result result
Data
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 98

98

Software redundancy

• Software implemented fault tolerance (SWIFT):


– Management of hardware faults at software level
– Management of faults originating from the design and
implementation of software components (i.e., software bugs)

• Due to the large cost of developing software, most of the


software dependability effort has focused on fault-avoidance
techniques

• The current trend reduce the transistors size to 10 nanometers and


below will increase the interest in SWIFT techniques, especially
in the cloud infrastructure (see International Technology
Roadmap for Semiconductors: http://www.itrs.net/)
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 99

99

Henrique Madeira, 2022/2023 40


Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC

Software redundancy techniques


• Software redundancy may occur in many forms:
– Add extra lines of code and data structures to:
– Check the acceptable range of variables or magnitude of
signals, or a routine to test a memory by writing and
reading locations, etc.
– Perform error recovery.
– Replicate program modules or the complete program

• Basic techniques:
– Consistency checks
– Capability checks
– Software diversity
– Error detection and recovery
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 100

100

Consistency check
• Uses a priori knowledge about the characteristics of
information to verify its correctness.
• Often called assertions (program assertions, executable
assertions)
• Examples:
– Data consistency: check the range of variables, input
parameters, signals, etc.
– Address consistency: check the addresses generated by the
computer in the address range of the available memory.
– Time consistency: check time limits for given operations, such as
timeouts
– Detection of invalid instruction codes (n bit to represent 2k legal
instruction codes: 2n - 2k are illegal) :
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 101

101

Henrique Madeira, 2022/2023 41


Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC

Consistency check
• Uses a priori knowledge about the characteristics of
information to verifyWatchdog
its correctness.
processors
• Often called assertions (programanassertions,
• Sometimes executable
external mechanism is
assertions) needed to check assertions

• Examples: • In general, needs both HW + SW


– Data consistency: check the range of variables, input
parameters, signals, etc.
– Address consistency: check the addresses generated by the
computer in the address range of the available memory.
– Time consistency: check time limits for given operations, such as
timeouts
– Detection of invalid instruction codes (n bit to represent 2k legal
instruction codes: 2n - 2k are illegal) :
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 102

102

Capability check
• Verify that a system has the expected capability
(hardware testing)
• Examples:
– Is the ALU working well? à Execute specific instructions on
specific data and compare the result with the expected one
(written in a ROM memory)

– In a multiprocessor, are processes working properly? Are they


capable of communicating? à Execute testing programs and
compare the results with the expected ones.
– Memory test: write and read some locations

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 103

103

Henrique Madeira, 2022/2023 42


Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC

Questions
• Explain why availability is an attribute of security. Give examples.
• If a web service crashes when called with a give combination of valid inputs, can you claim that the web service is not
robust? Explain.
• Explain the differences among fault prevention, fault removal, fault tolerance and fault forecasting and list the four
techniques by order of frequency of utilization by the software industry (put in first place the one that is used more
intensively).
• When what is visible to end-users is a deviation from the specific or expected behavior, this is called: a) an error; b) a
fault; c) a failure; d) a defect e) a mistake.
• In your opinion, the concept of permanent and transient fault use for hardware faults can be also applied to software
bugs?
• Consider you have a program to calculate the list of occurrences of Fridays on the 13th day of the month (these days are
considered days of bad luck by superstitious people) for the next 20 years. Give four examples of failure modes that may
happen when running that program.
• If a program performs correct calculations (i.e., the result is correct), can you still claim that such result represents a
failure? If your answer is yes, give two examples.
• If you decide to execute the same program with the same input parameters several times, using the same hardware, and
vote for majority the results obtained in the different runs, what kind of redundancy are you using?

• In your opinion, NVersion programming is based on error detection or error


masking techniques?
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 104

104

Software diversity: N-version programming

• Independently developed versions


of design and code from the same
set of requirements. Program
Inputs

Program
• Technique: independently design Version 1 Program
Outputs
teams utilizing different design Program
Version 2 Voter
methodologies, algorithms, .
compilers, run-time systems and . .
.
hardware components
Program
Version N

• Vote on the N results produced Program


Inputs

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 105

105

Henrique Madeira, 2022/2023 43


Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC

N-version programming cons.


• Disadvantages:
– Cost of resources
– Cost of concurrent executions
– Potential source of correlated errors
o The original requirements and specification
o Humans tend to fail in similar modes (social and education commonalities)
– Requirements and specification mistakes are not tolerated (fault avoidance)

• Software voter (it is a technical challenge):


– Not replicated; single point of failure: must be simple and verifiable
– Must assure that the input data vector to each of the versions is identical
– Must receive data from each version in identical formats or make efficient
conversions
– Must implement some sort of communication protocol to wait until all versions
complete their processing or recognize the versions that do not complete
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 106

106

Software diversity: N-self-checking


programming
• Based on acceptance tests rather than comparison with equivalent versions
• N versions of the program are written
• Each version is running simultaneously and includes its acceptance tests
• The selection logic chooses the results from one of the programs that passes
the acceptance tests
• Tolerates N-1 faults (independent faults) Problem:
Program
The coverage of self-
Version 1 checking is far from
Program perfect
Inputs Accepptance
.
. tests
Selection
Logic
Program
Version N Program
Outputs
Program Acceptance
Inputs tests

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 107

107

Henrique Madeira, 2022/2023 44


Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC

Error recovery

• Forward recovery: transform the erroneous state in a new state


from which the system can operate

• Backward recovery: bring the system back to a state prior to the


error occurrence
- Checkpointing
- Recovery block

• Backward and forward recovery are not exclusive; they can be


combined if the error persists

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 108

108

Forward error recovery

• Requires the assessment of damages caused by the detected


error or by errors propagated before detection (not easy…)

• Usually ad hoc

• Example of application:
Real-time control systems in which an occasional missed response to a
sensor input is tolerable
The system can recover by skipping its response to the missed
sensor input.

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 109

109

Henrique Madeira, 2022/2023 45


Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC

Backward error recovery: checkpointing

• Checkpoint: a copy of the current state for possible use in


rollback.
– May be taken automatically (periodically) or upon request by program
– Need to be correct
– Need eventually to be discarded
– Survival of checkpoint data in the presence of faults is critical: stable storage

• Loss: computation time between the checkpointing and the rollback;


data received during that interval

• Overhead of saving system state


– Important goal: to minimize the amount of state information that must be saved

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 110

110

Backward error recovery: checkpointing

• Checkpoint: a copy of the current state for possible use in


rollback.
– May be taken automatically (periodically) or upon request by program
– Need to be correct à when is it guaranteed to take the checkpoints?
– Need eventually to be discarded
– Survival of checkpoint data in the presence of faults is critical: stable storage

• Loss: computation
Stabletime
storage:
between the checkpointing and the rollback;
data received during that interval
• Replicated storage
+
• Overhead of• saving Atomicsystem state the read/write accesses
access (either
– Important goal: to minimize the amount of changes)
complete or nothing state information that must be saved

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 111

111

Henrique Madeira, 2022/2023 46


Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC

Backward error recovery:


transactional systems and databases
• Checkpoint and backward recovery are highly successful in
databases and in transactional systems in general.
• Revision:
• What is a transaction?
• What are the basic transaction operations?
• What are the ACID transaction properties?
• What are concurrency control and locking?
• What is the two-phase commit protocol?

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 112

112

Backward error recovery:


recovery block
checkpoint

• Each recovery block contains variables global to the


block that will be automatically checkpointed if they
are altered within the block. Acceptance
test

• Upon entry to a recovery block, the primary alternate is executed and


subjected to an acceptance test to detect any error in the result.
- If the test is passed, the block is exited.
- If the test fails or the primary alternative fails to execute, the content of the
recovery cache pertinent to the block is reinstated, and the second alternate is
executed.
- This cycle is executed until either an alternative is successful or no more
alternatives exist. In this case an error is reported.

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 113

113

Henrique Madeira, 2022/2023 47


Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC

Backward error recovery:


recovery block (cont.)
Program Inputs
Primary Program Outputs
Version

Secondary
N-to-1 Acceptance
Version 1 Tests
. Program
. Switch
.
.
Test Result
Secondary
Version N-1

• A single acceptance test


• Only one single implementation of the program is run at a time
• Combines elements of checkpointing and backup
• Minimizes the information to be backed up
• Releases the programmer from determining which variables should be checkpointed
and when linguistic structure for recovery blocks requires a suitable mechanism for
providing automatic backward error recovery.
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 114

114

Error detection

• Structural approach (duplication and comparison)


– Two or more copies of data item that may be corrupted
– A mechanism that compares them and declares an error if differ
– The two copies must be unlikely to be corrupted together in the same way

• Behavior based approach


– Execution of checks on the behavior (variables, results, etc.) of the target
system. The checks use a simplified view of the target behavior (behavior
abstraction)
– The detection is done by a separate mechanisms called, in general, watchdog.
In some implementations, the watchdog requires specific hardware.
– Watchdog can range from a simple watchdog times to complex watchdog
processors

Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 115

115

Henrique Madeira, 2022/2023 48


Analysis of Software Artifacts
Departamento de Engenharia Informática, FCTUC

Error detection effectiveness

• Coverage
– Probability that an error is detected, conditional on its occurrence

• Latency
– Time elapsing between the occurrence of an error and its detection.

• Overhead
– Cost of the error detection. It may include (extra) hardware, software,
memory space, computing time, etc.

• Damage Confinement
– Error propagation path
– The wider the propagation, the more likely that errors will spread outside the
system
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 116

116

Henrique Madeira, 2022/2023 49

You might also like