Professional Documents
Culture Documents
Fundamental concepts
of software quality
and software dependability
1
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 21
21
• Non-Functional view
– How the software system does it (features such as performance, security,
reliability, availability, usability, maintainability, and many, many, more)
– Typically known as Quality Attributes of a software system
– Most of them cannot be measured directly
– The biggest technical challenges are in these non-functional attributes
22
• Functional requirements
– Describes what a software system should do
– Function points is a usual metric to characterize and assess the size of the
software
• Non-functional requirements
– Define constraints (or goals) on how the system will do so
– Include basically everything that is not related to the functional aspects of the
software system
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 23
23
• Availability
• Reliability Some obvious observations on many of
these properties:
• Security
• In general, make sense at system level; not just
• Performance at program/application level.
• Depend on both the software and the underlying
• Usability hardware.
• Depend on architectural design choices and
• Maintainability configuration choices.
• Manageability
• Cost
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 24
24
• Usability
Downtime depends on both failure rate & time to
repair
• Maintainability
• Manageability
• Cost
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 25
25
26
• Cost
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 27
27
28
29
30
• Maintainability
• Manageability
• Cost
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 31
31
32
Questions
33
34
35
36
Questions
37
Robustness
(more on concepts & terminology)
38
Resilience
(more on concepts & terminology)
39
Key non-functional
attributes of
software systems
40
Questions
41
Different means to
solve or mitigate the
effect of the threats
42
43
Questions
44
45
correct incorrect
service service
restoration
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 46
46
47
48
• Hardware faults
• Software faults
• Environment faults
• Human faults
• …
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 49
49
50
Questions
51
52
53
54
Questions
55
56
57
58
59
Classification of faults
• Caused by what?
– Physical faults
– Human-Made faults
• Why?
– Accidental faults
– Intentional non malicious faults / Intentional malicious faults
• When?
– Development faults: design, coding, configuration, upgrading
– Operational faults: in use or maintenance (operation faults, interaction faults,
configuration faults,..)
• Where (with respect to the system)?
– Internal faults
– External faults
• How long?
– Permanent faults
– Transient faults
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 60
60
61
Software faults:
the main cause of computer failures
• Software faults (i.e., defects or bugs) are the major cause of
computer failures.
• The increasing complexity of software, the pressure to
shrink time-to-market, and high cost of software testing
contribute to keep bugs as the main computer failure cause.
• Many failure reports available in the Internet:
http://www.teach-
ict.com/news/news_stories/news_computer_failures.htm
62
Software faults
63
Software faults
“Bridges are normally built on-time, on-budget, and do not
fall down. On the other hand, software never comes in on-
time or on-budget. In addition, it always breaks down.”
Alfred Z. Spector
has an optimistic
view on bridges…
64
65
Attention
Knowledge
Practice
(skill)
Motivation
66
Fault types
Fault types
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 67
67
68
69
Questions
• Explain the differences among fault prevention, fault removal, fault tolerance and fault forecasting and
list the four techniques by order of frequency of utilization by the software industry (put in first place
the one that is used more intensively).
• When what is visible to end-users is a deviation from the specific or expected behavior, this is called: a)
an error; b) a fault; c) a failure; d) a defect e) a mistake.
• In your opinion, the concept of permanent and transient fault use for hardware faults can be also applied
to software bugs?
70
71
Computers failures
• Simple failure modes such as pure silent failures (clean crash or halt
failures) are relatively rare.
72
Questions
73
74
Failures classification
75
76
Dependability means
• Fault Prevention techniques: prevent the occurrence of faults
Two different
– Improve development process to avoid/minimize faults perspectives
– Use selected technologies (better components, certified software tools, etc. ) with
strong technical
• Fault Tolerance techniques: to provide correct service implications
in presence of
faults
– Triple modular redundancy, N-Version programming, check pointing and recovery, etc.
77
Dependability means
• Fault Prevention techniques: prevent the occurrence of faults
– Improve development process to avoid/minimize faults
– Use selected technologies (better components, certified software tools, etc. )
78
Questions
• Explain why availability is an attribute of security. Give examples.
• If a web service crashes when called with a give combination of valid inputs, can you claim that the web service is not
robust? Explain.
• Explain the differences among fault prevention, fault removal, fault tolerance and fault forecasting and list the four
techniques by order of frequency of utilization by the software industry (put in first place the one that is used more
intensively).
• When what is visible to end-users is a deviation from the specific or expected behavior, this is called: a) an error; b) a
fault; c) a failure; d) a defect e) a mistake.
• In your opinion, the concept of permanent and transient fault use for hardware faults can be also applied to software
bugs?
• Consider you have a program to calculate the list of occurrences of Fridays on the 13th day of the month (these days are
considered days of bad luck by superstitious people) for the next 20 years. Give four examples of failure modes that may
happen when running that program.
• If a program performs correct calculations (i.e., the result is correct), can you still claim that such result represents a
failure? If your answer is yes, give two examples.
• If you decide to execute the same program with the same input parameters
several times, using the same hardware, and vote for majority the results
obtained in the different runs, what kind of redundancy are you using?
79
80
81
Error masking
82
83
84
Fault tolerance
• Protective redundancy:
Additional components or processes that mask or correct errors or faults inside a system
so they do not become observable failures in its service
85
86
87
Replication
• Replication provides:
– Two or more copies of items that may be corrupted by a fault
– A mechanism that compares them and declares an error if they
differ
• Examples:
– Duplicated circuitry
– Transmit messages twice
– Store data in two separate places (e.g. mirrored disks)
– …
88
Replication
• Replication provides:
– Two or more copies of items that may be corrupted by a fault
– A mechanism that compares them and declares an error if they
differ
89
Redundancy
Redundancy is the ingredient of Replication
• Hardware redundancy
Physical replication of HW (the most common form of redundancy). The cost
of replicating HW is decreasing rapidly.
Obviously, these low-end computers cannot solve everything; but the trend is clear.
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 90
90
Redundancy
Redundancy is the ingredient of Replication
• Hardware redundancy
Physical replication of HW (the most common form of redundancy). The cost
of replicating HW is decreasing rapidly.
• Information redundancy
Addition of redundant information to data in order to allow fault detection and
fault masking.
• Time redundancy
Attempt to reduce the amount of extra HW at the expense of using additional
time.
91
Module 1
output
Module 2 Voter
Module 3
92
Module 3
93
Module 1 output
Comparator
94
95
96
Information redundancy
• Coding
– Information is represented with more bits that strictly necessary: says, an n-bit
information chunck is represented by n + c = m bits
– Among all the possible 2m configurations of the m bits, only 2n represent
acceptable values (code words)
– If a non-code word appears, it indicates an error in transmitting, or storing, or
retrieving …
– Examples: checksums, error correction and detection, cyclic redundancy check
(CRC),….
97
error
Compare
results
time t0+d
Encode Decode Store
Computation
Data result result
Data
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 98
98
Software redundancy
99
• Basic techniques:
– Consistency checks
– Capability checks
– Software diversity
– Error detection and recovery
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 100
100
Consistency check
• Uses a priori knowledge about the characteristics of
information to verify its correctness.
• Often called assertions (program assertions, executable
assertions)
• Examples:
– Data consistency: check the range of variables, input
parameters, signals, etc.
– Address consistency: check the addresses generated by the
computer in the address range of the available memory.
– Time consistency: check time limits for given operations, such as
timeouts
– Detection of invalid instruction codes (n bit to represent 2k legal
instruction codes: 2n - 2k are illegal) :
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 101
101
Consistency check
• Uses a priori knowledge about the characteristics of
information to verifyWatchdog
its correctness.
processors
• Often called assertions (programanassertions,
• Sometimes executable
external mechanism is
assertions) needed to check assertions
102
Capability check
• Verify that a system has the expected capability
(hardware testing)
• Examples:
– Is the ALU working well? à Execute specific instructions on
specific data and compare the result with the expected one
(written in a ROM memory)
103
Questions
• Explain why availability is an attribute of security. Give examples.
• If a web service crashes when called with a give combination of valid inputs, can you claim that the web service is not
robust? Explain.
• Explain the differences among fault prevention, fault removal, fault tolerance and fault forecasting and list the four
techniques by order of frequency of utilization by the software industry (put in first place the one that is used more
intensively).
• When what is visible to end-users is a deviation from the specific or expected behavior, this is called: a) an error; b) a
fault; c) a failure; d) a defect e) a mistake.
• In your opinion, the concept of permanent and transient fault use for hardware faults can be also applied to software
bugs?
• Consider you have a program to calculate the list of occurrences of Fridays on the 13th day of the month (these days are
considered days of bad luck by superstitious people) for the next 20 years. Give four examples of failure modes that may
happen when running that program.
• If a program performs correct calculations (i.e., the result is correct), can you still claim that such result represents a
failure? If your answer is yes, give two examples.
• If you decide to execute the same program with the same input parameters several times, using the same hardware, and
vote for majority the results obtained in the different runs, what kind of redundancy are you using?
104
Program
• Technique: independently design Version 1 Program
Outputs
teams utilizing different design Program
Version 2 Voter
methodologies, algorithms, .
compilers, run-time systems and . .
.
hardware components
Program
Version N
105
106
107
Error recovery
108
• Usually ad hoc
• Example of application:
Real-time control systems in which an occasional missed response to a
sensor input is tolerable
The system can recover by skipping its response to the missed
sensor input.
109
110
• Loss: computation
Stabletime
storage:
between the checkpointing and the rollback;
data received during that interval
• Replicated storage
+
• Overhead of• saving Atomicsystem state the read/write accesses
access (either
– Important goal: to minimize the amount of changes)
complete or nothing state information that must be saved
111
112
113
Secondary
N-to-1 Acceptance
Version 1 Tests
. Program
. Switch
.
.
Test Result
Secondary
Version N-1
114
Error detection
115
• Coverage
– Probability that an error is detected, conditional on its occurrence
• Latency
– Time elapsing between the occurrence of an error and its detection.
• Overhead
– Cost of the error detection. It may include (extra) hardware, software,
memory space, computing time, etc.
• Damage Confinement
– Error propagation path
– The wider the propagation, the more likely that errors will spread outside the
system
Henrique M adeira Analysis of Software Artifacts, DEI-FCTUC, 2022/2023 116
116