Software Testing Techniques

Software Testing

Intro
1.1. Definitions
Testing Activities
2. Test Selection
2.1. Categories of Test Selection
2.2. Definitions
2.3. Requirements Based Testing
2.3.1.Cause Effect Graphs
2.3.2. Equivalence Partitioning
2.4. Control Flow Based Testing
2.4.1. Motivation
2.4.2. Control Flow Graph
2.4.3.Statement, Path, BranchCoverage
2.4.4. Condition Coverage
2.4.5. Loops
2.4.6. Conclusions about Coverage Criteria
2.4.7. Subsumption
2.5. Data Flow Based Testing
2.5.1. Motivation
2.5.2. Intraprocedural data flows
2.5.3. Interprocedural data flows
2.5.4. Conclusions about Data Flow Testing
3. Combinatorial Testing
3.1. Motivation
3.2. Latin Squares
3.3. Conclusions
4. Random and Statistical Testing
4.1. Motivation
4.2. Grounds
4.3. Weyuker and Jeng (fixed failure rates)
4.4. Gutjahr (failure rates as random variables)
4.5. Summary
4.6. Statistical Testing
4.6.1. Usage profiles in Software Reliability Engineering
4.6.2. Usage Profiles based on Markov chains
4.6.3. Conclusion
5. Model Based Testing
5.1. Motivation
5.2. Models
5.3. Scenarios
5.4. Selection Criteria
5.5. Test Case Generation
5.6. Cost Effectiveness
6. OO Testing
6.1. Differences of OO Software
6.2. Statebased Testing
6.3. Testing with Inheritance
6.4. Testing with polymorphism
6.5. Conclusions
7. From Failures to Tests and Faults
7.1 Motivation
7.2 Test extraction by reproducing program executions
7.2.1. StateBased Test Extraction
7.2.2. EventBased Text Extraction
7.3. Delta Debugging
7.4. Hit Spectra, Code Metrics, Code Churn
7.5. Alternatives: directly target faults
8. Assessment of the quality of testsuites
8.1. Motivation
8.1. Fault Injection
8.2. Mutation Testing
9. Concurrency Testing
10. Fault Models

1. Intro

1.1. Definitions

Goal of testing

Show functionality is implemented
and
detect failures
(while maintaining
reasonable
costs). (Usually done by comparing actual and intended behaviors.)

Test Cases

Definition:

Expected Output
→ a specification is needed
→ abstract: no exception occurs

Definition of a good test case:

Ability to detect likely failures with good cost effectiveness.

What Testing is not:
Improvement of quality → debugging
Fault localization
Types of Defects

Failure: Error: Fault:

Deviation of actual I/O Deviation of system’s actual state. Actual or hypothesized reason
behavior. (Intended state) for the deviation.
(Intended behavior)
Runtime May or may not lead to a failure E.g. faulty code

(no runtime correction)
Usercentered May be corrected at runtime
“interface” perspective (redundancy) Fault can only be corrected at
→ the user directly design time > fix & rebuild
sees the defect, e.g. Internal “transition system”
system crash perspective Developercenter perspective
Observable
difference between → Wrong state/case → X > 5 instead of X >= 5
actual and intended
behaviour

→ False output

➨ Error = mistake in the process, fault = mistake in code!

Test Selection
Central Problem: group possible input data into blocks of a partition
, or “equivalence
classes”
➨ Independence of representative: reveal same failure

Stopping Criteria

Useful:
When a coverage criterion is met → if there is a relationship with failure detection
When specific number of failure was detected → requires good estimates

Useless:
When time is up
When no more defects are detected with the test cases

→ Stopping and Selection Criteria are the same!

Build equivalence classes by...
Control flow and decisions covering control flow graph
Data flow covering data flow graph
Stochastic profiles Pick test cases at random (w.r.t. distribution)

Combinatorial Testing
E.g. Automotive Infotainment Network.

p
p
With values, we have n combinations.
n
parameters, each of which can take
Assuming that failures are a result of the interaction of only 2
(3,4,...)
parameters
.
➨ Pairwise (twise) interactions

Random (uniform) Testing
Any test selection criterion should be “better” than random testing!

Weyuker and Jeng
Partitionbased testing compared with random testing.
Metric: Find at least one failurecausing input.
Result: Can be the same, better or worse
Example: 100 input domain, 8 failurecausing, 2 tests:

➨ In general we don’t know apriori which blocks are problematic!

(This is why partitionbased testing has not been shown superior to other strategies)

Statistical Testing
RiskBased Testing : Risk = Likelihood * Expected Damage
Do most of the testing for the high risk parts of the system’s functionality
→ Purposely live with remaining faults!

ModelBased Testing
Test don’t describe the intended behavior of a system at all
→ Describe the intended behavior independently of a test (State charts, sequence diagrams)
Specify a test selection and have some magic machine generate tests and code
→ For robustness testing , describe just the environment?
Statistical testing is one instance

OO Software Testing
Encapsulation, inheritance, polymorphism
(Encapsulation Objects are state machines → modelbased testing!
Expensive: Testing a method with an argument of type T means, test it with
all subtypes of
T
as well
!

Test Case Extraction
In case that people just play around → Problem: Turn these executions into managed test

Fault Localization
Ideas to get from failure to the fault:
Minimizing input
Consider two failing tests (Check intersection of invoked methods)
Consider the states of a program (Infer relevant variables)
Consider frequency of changes in svn repository

Fault Injection
Assess the quality of a test suite or a test selection criterion, with a given fault model in mind,
we can inject the problem and see if the test suite finds it.
Mutation testing
e.g. fault model of simple syntactic problems (use “=” instead of “>=”)
Testing Activities

Planning
Organization
Documentation
Test Case Derivation
Execution
Monitoring
Evaluation

Different Levels of Testing

Unit tests
→ all called units are stubbed
Integration Tests
→ individual software modules are combined and tested as a group
System tests
→ testing conducted on a complete, integrated system to check the requirements are
met
Acceptance tests
→ determine whether or not a system satisfies the acceptance criteria (user needs,
requirements) → the user then decides if he accepts the system

2. Test Selection
2.1. Categories of Test Selection

Requirements Controlflow Data flowbased Statistical
based Testing based Testing Testing Testing
Cause effect graphs Statements, Paths, intraprocedural

Branches
Equivalence Conditions interprocedural

Partitioning
Loops

2.2. Definitions

Kinds of Testing

test without knowing about the internal structure of the
program
Black Box Testing
→ test based upon requirements/specification
→ functionality ony
test with respect to the internal structure
White Box Testing
→ e.g. testing of paths
Usability Testing determine how well people can use the system
Stress and Volume Testing intense testing to determine the stability
Robustness Testing Functionality is assumed to be correct
Performance Testing e.g. test response times
Security Testing test for flaws in the security mechanism
Reliability and recovery stability during an interval of time
testing determine how well the system recovers from crashes
Configuration and evaluate the application's compatibility with the
compatibility testing computing environment
Check if old tests and requirements can still be applied
Regression testing
to a changed system

Stubs simulate callees
Drivers simulate callers
Big Band Testing test all modules alone, then combine all
Incremental Testing add one module after another
TopDown
start testing with the top module
Only stubs are needed, no drivers
Requirements can be validated easily, early skeletal
version
BottomUp
start with leaf modules
only drivers needed (simple to write)
disadvantage: no early skeleton of the program
Random variable “In probability and statistics, a
random variable , aleatory
variable or stochastic
variable is a
variable
whose
value is subject to variations due to chance (i.e.
randomness, in a mathematical sense).” (wiki)

Stochastic process Family of random variables indexed by time

2.3. Requirements Based Testing

Goal: have at least one test case for each requirement
BlackBox
Application and domainspecific

2.3.1.Cause Effect Graphs

a cause–effect graph is a directed graph
that maps a set of causes to a set of effects
causes = input condition of the program
effects = output or state transformation
translate naturallanguage specifications into a formal scheme / logical network

Use it when:
helpful in identifying all relevant situations
structure / reproducibility
also good for presentations / meetings since very visible
we don’t know about completeness

Helpful websites for decision table generation:
http://www.softwaretestinghelp.com/causeandeffectgraphtestcasewritingtechnique/
https://amybughunter.wordpress.com/2013/10/02/blackboxtestingwithcauseeffectgraphs
/

2.3.2. Equivalence Partitioning

Partitioning = group input data into blocks (‘equivalence classes’)
Domain and applicationspecific → cannot be automated
Category Partition Method
Categories = functional units in the code (they can be tested individually, e.g.
a function call or a user input)
Find parameters for the unit
For each parameter: find different input choices → partitions
For each partition: find constraints
Test frame: all possible combinations of inputs
Test case: instantiation of a test frame
Tool: classification tree method
2.4. Control Flow Based Testing

→ form of white box testing
2.4.1. Motivation

Why use CFGbased Coverage Criteria?

Use it when:
Finding dead code
Intuitively: each line of code should be executed at least once. (e.g. as a specification
requirement .)
Dual use:
Aposteriori (measure existing test suite)
Apriori (use as selection criterion) + as stopping criterion
We want a structure / methodology how to derive test cases (Partition based
testing tells us to make partitions, but not how)
We want to measure some performance → easier to show management people
One could think that CFGbased coverage criteria will find all failures, because all
parts of the code are covered, BUT:

No CFGbased coverage criteria guarantees the revelation of all failures!
specification wrong
missing paths in the CFG
dependence on data values

2.4.2. Control Flow Graph

Graphical representation of all paths in the code
Nodes: program statements
Edges: transfers of control
Equivalence relation == one class includes all executions with identical paths
2.4.3.Statement-, Path-, Branch-Coverage

Path Coverage Decision/Branch
Statement Coverage
Coverage

every edge is executed at
least once
covers all nodes every path is executed once

→ lower number of paths

# executed statements
k loops → infinite number of
# statements paths
loops with upper bound n →
n paths
Exhaustive Path testing
== composite conditions not

Exhaustive input testing taken into account
→
usefulness hard to prove
not realistic
there are
infeasible paths
(e.g. paths depending on
the same condition cannot
be executed alternately)

2.4.4. Condition Coverage

Advantage : structured way of deriving test cases (predictable)
Disadvantage : might be better or worse than choosing randomly, we don’t know

Literal
= fixed values, such as true/false, integers or characters, in this case: atomic boolean
expressions
Conditions consist of literals → e.g. a<b, a||b, a&&b

Condition Coverage Condition/Decision Coverage
each literal of a condition condition coverage + result must
evaluates once to true and once evaluate both to true and false
to false outcome relevant → subsumes
(a&&b)||c → minimum two cases branch coverage
(all zero, all 1)
outcome irrelevant → branch
coverage not subsumed
Modified Condition/Decision Coverage Multiple Condition Coverage
(MC/DC)
Independently swap the value of all combinations of literals → all

each literal, if the result changes, logical faults will be detected
use as test case too many test cases: 2^n
n+1 tests for n independent
parameters
subsumes condition and decision
and C/D coverage

Does MC/DC detect all failures?
→ No, not all possibilities are tested

NonStrict Evaluation
→ In many programming languages, conditions are evaluated differently from the code
e.g.: if (a&&b) → if (a) { if (b) …}
→ semantically different programs require different tests for the same coverage

Dependent Parameters
100% Coverage is not achievable, if parameters are dependent!
Example: a == b, condition: (a&&b) → a=1, b=0 is never true

2.4.5. Loops
loops are problematic because we sometimes don't know the boundaries, or they are
even infinite
→ BoundaryInterior Tests
zero times: don't enter the loop
once: enter and execute once
more than once, e.g. twice: test the interior

2.4.6. Conclusions about Coverage Criteria

Problems

Evaluation

in most programming languages, evaluation of code is different
for example, some literals are not even evaluated: if (a||b) > if a is true, b might not
be evaluated
→ semantically equivalent programs require different test suites to reach the same
coverage

Dependencies
if one parameter is logically dependent on another, then 100% coverage is not
achievable

Structural Coverage groups the set of inputs → is a type of Partitioning

Measurement of Coverage

inserting statements into the code
update a large table
leave the insertions in the code, otherwise it will be altered and might result in failures

Coverage Criteria can be used as:
Selection Criteria (before testing)
Quality indicators (after testing)
Stopping criteria

Advantages and
disadvantages of coverage criteria:
structured → could be automated
measurable → may lead to optimization of the wrong parameter
purely syntactic →
nothing to do with requirements

→ ALWAYS use in combination with black box / requirements based testing

2.4.7. Subsumption

2.5. Data Flow Based Testing

2.5.1. Motivation

Approximations of path coverage
Variables are crucial
Only path coverage might not be sufficient, as it doesn’t take into account data flows,
e.g. if a variable is defined in one path and used in another
→ Information about where variables are defined and where they are used is used to specify
test cases

More: ftp://
ftp.ncsu.edu/pub/tech/2006/TR200622.pdf

2.5.2. Intra-procedural data flows

Idea: defuse graphs → annotate control flow graph with definitions and usages of
variables
def represents the definition of a variable. The variable on the lefthand side
of an assignment statement is the one getting defined
cuse represents computational use of a variable
puse represents predicate use. These are all the occurrences of the
variables in a predicate

Input Output Procedure
Assignment statement Statement Call Conditional
def (and def cuse cuse puse

cuse, if the (and def, for
value is read values passed
from a by reference)
variable)
Loop Pointer Arrays
puse definition p*: use of p*: definition a[] usage of a[]

cuse of p cuse of p cuse of x_i cuse of x_i
def of p* cuse of p* def of a cuse of a

Coverage

Complete path from entry to exit node
A complete path p covers
DCU: a definitioncuseassociation if it has a defclear subpath from i to j
dcu(xyz, nX) = {mX} //
xyz = is the name of the used variable.
nX = the block where the variable is def. or redef!
mX = Blocks where the variable is cused
DPU : a definitionpuse association if it has a defclear subpath from i to (j,k)
dpu(xyz, nX) = {mX→ vX} //
xyz = is the name of the used variable.
nX = the block where the variable is def. or redef!
mX, vX = Blocks where the variable is pused
a dupath p’ if p’ is a subpath of p

Data Flow Testing Strategies ( http://www.whiteboxtest.com/DataFlowTesting.php )

All Definitions: Test cases for each definition of each variable
All Predicate Uses: Test cases are generated so that there is at least one path of
each variable definition to each Puse of variable.
All Computational Uses: Test cases are generated so that there is at least one path
of each variable definition to each C use of variable.
All P uses Some C uses: Test cases for every variable, there is a path from every
definition to every puse of that definition. If there is a definition with no puse
following it, then a cuse of the definition is considered
All C uses Some P uses : Test cases for every variable, there is a path from every
definition to every cuse of that definition. If there is a definition with no cuse
following it, then a puse of the definition is considered
All uses: Test cases for every use of the variable, there is a path from the definition
of that variable to the use.
All du paths: This is the strongest data flow testing strategy. Every du path from
every definition of every variable to every use of that definition

Subsumption Relationships

Counter Examples

all defs → all cuses / all puses
Three branchings with calculations in DFG
all cuses → all puses (& vice versa)
Only puses in program
all cuses → all cuses/some puses (& vice versa)
all/some is more test cases, because whenever there is a no cuse a puse will be
taken instead
the other way round it works, because all cuses covered by all/some are also
covered by all cuses

2.5.3. Inter-procedural data flows

Idea
: consider defuse relationships across procedure boundaries
Extends data flow testing to relationships between formal and actual parameters
see OO testing

2.5.4. Conclusions about Data Flow Testing

Advantages Disadvantages
Data flow testing overcomes Almost no tool support
‘locality’ of decisionbased testing Requirements not taken into
Structured account
Measurable Missing defuses cannot be tested
Might be expensive
Agreement : Use as complement, not alone!
Underlying Fault Model:
failures that materialize as erroneous def/use pairs

3. Combinatorial Testing

3.1. Motivation
Example: Automotive Infotainment, multiple components
All combinations are a lot of combinations → combinatorial testing to reduce the
number of test cases
Fault Model: failures are a result of combinations of n components, not all

p
p
With values, we have n combinations.
n
parameters, each of which can take
Assuming that failures are a result of the interaction of only 2 (3,4,...) parameters.
➨ Pairwise (twise) interactions

3.2. Latin Squares

Each symbol occurs precisely once in each row and each column
Each entry in a Latin square presents one combination of three parameters
<row, column, value>
→ The set of all these triples
guarantees pairwise coverage

Orthogonal Latin Squares:
Needed if more than three parameters must be handled

1. Square + 2. Square → orthogonal Latin square
1 2 3 1 2 3 (1,1) (2,2) (3,3)
3 1 2 2 3 1 (3,2) (1,3) (2,1)
2 3 1 3 1 2 (2,3) (3,1) (1,2)

Latin Squares when one parameter has 4 options:

1. Square + 4th row → Test Cases
1 2 3 3 1,1,1 1,2,2 1,3,3
3 1 2 2 2,1,3 2,2,1 2,3,2
2 3 1 1 3,1,2 3,2,3 3,3,1
1,4,3 2,4,2 3,4,1

3.3. Conclusions

Fault models
sneak into requirements based testing in special cases
Combinatorial Testing relies on a specific fault model

CAUTION!
→ No guarantee that this works/ the fault model holds!

4. Random and Statistical Testing

4.1. Motivation

Random Testing is considered the “gold standard” for testing, everything is compared
to it
How can we prove a method is ‘better’ than random testing?
Maximizations of costbenefit ratio → test what is used more often (usage profiles)

4.2. Grounds
Definitions:
Programs → P
/ input domain → D d
of the size /
m
points → produce incorrect output /
n → test cases (or test input data)
→ In general, d >> m
m
Failure rate Θ = |D i| =
i
Is the probability that a failurecausing input will be selected as a test case.

// Es folgt wieder der vergleich RT vs PT mit better, worse, or the same durch Pr(p) = Pr(r) /
Pr(p) > Pr(r) / Pr(p) < Pr(r ) → Warum kommt das da vor? //

4.3. Weyuker and Jeng (fixed failure rates)

Goal: Analytical Comparison of Random & partition Testing
Revealing / homogeneous subdomains
One domain contains only failurecausing inputs
Extremely unusual in practice
Forms of partition based testing
Code Coverage Criteria: statement, branch and path coverage divide the
input domain into subdomains
Data Flow Testing Criteria: also divides the input domain into subdomains
Mutation Testing
Exhaustive Testing: single element subdomains
Random Testing: partition consists of one class
Definitions
Ideal case: random selection is based on operational distribution → this
information is not always available, so the comparison is based solely on
the probability of finding at least one failurecausing input (might be
problematic!)
Replacement : we assume that testing is done with replacement → one test
case might be chosen twice, although that’s unlikely
Designing Partitions
Good Partition Testing Strategy: inputs, that are likely to be failurecausing,
are put together in one subdomain
→ Probability Pp will be high (Pp = prob. of finding at least one failurecausing
input)

Subdomains of equal size
Partition testing can, in general, be better, worse or same as random testing

→ Superiority(Equal or better!) only for: Same Number of tests and
elements in each subdomain

Refinement of domains
Difficult
: Group failurecausing inputs → need to know which input are failure
causing.
Splitting a subdomain into equally sized subdomains
with the
same number of
tests
sometimes is advantageous and NEVER disadvantageous
Conclusions
Partition testing can be better, worse or the same as random based
testing, depending on how the partitioning is done.
Partition testing will be good
when partitions are faultbased.

4.4. Gutjahr (failure rates as random variables)

Weyuker and Jeng assumed fixed failure rates,
BUT: the real failure rate is never known
General Conclusion: Partition based testing performs better
Failure rate is a random variable → expected value can be derived from empirical
studies and statistics
Expected value depends on the class of program (e.g. difficulty)
Effects particularly strong if a partition consists of few large and many small
subdomains
Independent failure rates with identical expected value in the subdomains guarantee
a higher Pp

Expected Failure Rate

M: number of failure causing inputs (random variable)
d: Size of the input domain

4.5. Summary

Weyuker & Jeng
Random Testing can be better, worse or the same as partition testing
In special cases, partition testing is better: equal failure rates and equal
distribution of failures
Gutjahr
deterministic assumptions on failure rates favor random testing
if failure rates are considered
random variables , partition testing is favored
(assumption: equal expected failure rates and one test per block )
Different measures for ‘better’
At least on failure
# of faults detected
Weighted + of faults detected
Is random testing worthless? No, because:
Assumption: Identical expected failure rates
Uniform distributions of test selection from the input domain
It’s cheap, but there is no oracle

“Dear Gutjahr how do you know that the expected failure rates are equal?”
We don’t know. But we assume that they are. If they weren’t we would pick
tests from the domain that has the highest failure rate right? :P (# Gurjahr
sending trollface)
This is the weakest point in Gutjahrs Theory btw.

4.6. Statistical Testing

Goal:
maximize the costbenefit ratio → test most frequent interactions first/more
(i.e. the app should not crash at the login screen…)
80% of users use 20% of the functionality
Two concepts: frequency and severity of failures
Stopping criteria: should relate to reliability of software
Reliability: probability of failure free functioning for a given time

4.6.1. Usage profiles in Software Reliability Engineering

Operational/usage profiles are models of the environment
Consists of a
use case
and an occurrence rate
External oracle needed (trivial one: “No exception is thrown”)

→ FoneFollower Example
Musa advises 23 tests / KLOC in order to get a reliable program, and 2033 tests /
KLOC in order to get a high reliability program. (Note that these datas are old.)
Number of test cases = (Available time * Available staff)/ average time for
preparation of 1 test case
Number of new tests = N/(N1)*T
where N is the occurrence probabilities of new operations and T is the total
cumulative number of test cases from all previous versions
e.g. if a new
component is used 0.2 then 0.2/0.8*Number of test cases.
Between base product and variations:
U (Unique new Uses) = F (expected fraction of field use) * S (Sum of
NEW occurrence probabilities)
Test cases are assigned in proportion with the “U” value of each
version.
To test crucial elements (severity)
FIO (Failure Intensity Objective): # of failures per time unit or operations
Acceleration factor (A) = FIO(system)/FIO(Operation)
Test cases for critical tasks: A*Occurrence proportion * #total test cases

4.6.2. Usage Profiles based on Markov chains

Definition
:
Markov chain is a family of random variables indexed by time!
Consider the Markov Chain a finite
automaton with probabilistic transitions
Two phases
States/transitions
Probabilities

Benefits
analysis purposes
see the probabilities for reaching states from start state
Compute test cases (JUMBL) but we need an oracle or restrict ourselves
to “exception/no exception” verdicts.

4.6.3. Conclusion

Motivation
use actual usage data → test most frequent interactions
take into account critical operations
optimize costbenefit ratio

Good when…
...an earlier version of the system exists, and thus we have an operational profile ?!

Difference between Usage Profiles / Markov chain based usage profiles
Granularity ( attention to detail
→ markov chains are more detailed I guess...)
Probability of a function easier to define than the one of a transition

Use it when:
Boss tells you that he doesn’t want the program crashing while commercial people
use it. (i.e. the 20% of the program that is used by 80% of the people should run
flawless).
When severity is not crucial. (Nothing is going to happen if my POM app crashes.)
You have a former “base” product from which you can get the statistics. If not, you
can still derive them from simulation, expectancy, measurements.

DON’T Use it when: :
When severity is critical: Nuclear reactor (Go with codereading then)
5. Model Based Testing

5.1. Motivation

Oracle Problem
Automatically derive tests that include finegranular expected output
information (That’s our main motive!)
Specifications tend to be bad
→ derive test cases from an abstract model

5.2. Models

Abstraction
Model needs to be more abstract than the system
→ information is lost
2 kinds of abstraction:
Encapsulation (e.g. libraries) → encapsulate complexity
Subroutines, Garbage collector etc.
Simplification / Omission → omit details → information cannot be reinserted
Abstraction is needed to validate the model → cannot be too complex
If the model is as precise as the SUT → directly validate SUT
→ You have the same work, that’s not cost effective!
If you design 100% SUT (but 0% Environment) you
have to test for all possible inputs. (In a case of a car
this would mean that you have a VERY precise model
(btw unnecessarily precise once again with 100% SUT
we didn’t make any abstraction), but you are going to
test this model from 01000km/h speed. (Since you
have no clue of the environment) Doesn’t make sense. Also the reason for modelling
is to do abstractions and not to copy the real world.)
If you do 100% environment then you don’t have any clue about the system, so the
best thing you can do is robustness testing. (e.g. You apply all the inputs and see
whether the system crashes.)
Models can be described as Mealy Machines (which are state machines with input /
output!)

5.3. Scenarios

One model for both, test cases and code generation
generated from models, e.g. Matlab/Simulink
Use it when:
Code generators
Assumption on the environment
Performance
Exceptions

Cons :
no redundancy → no verification
BUT: not useful to generate code and test cases at the same time! For test
case generation you need a separate model! → no redundancy!
(so basically same as the first Con?!)

Two Models
One model for test cases, one for code
Cons
Expensive → possible solution: split between OEM and Zulieferer
Also when there are changes in the requirements ( Double expense to modify
both test and code model! → However if specification is solid (Automotive
industry etc. then it probably won’t occur.))
if there is a fault you need to find out which model was faulty
Pro:
Redundancy → need to make sure the models are different
Different levels of abstraction possible

Use it when:
Specification is VERY exact. (i.e. car manufacturer(System model)
/suppliers(Test model )

Model for test only
Pro:
Redundancy
Con:
(Expensive) → we don’t know if it’s cheaper
Code and model need to be kept in sync (interleaving) → hard!
Model has to be tested itself
Specification doesn’t profit from model based testing.
Hard when the requirements are changing.

Use it when:
Conformance tests: OEM builds the model, suppliers must show adherence to
model / conform to the model
Scenario of our running chip card example

Model Extraction from Code
Pro :
May make sense if you extract the model manually
Cons :
While having automatic generation there is NO redundancy.

Use it when:
Expost development of test cases only
(exception/no exception possible ?)
(Also useful if there are no requirements documents ?)
5.4. Selection Criteria

Structural criteria
Mealy Machine as a Model!
Coverage for example “find state A” or “use transition T” etc.
Do these with model checkers.
Functional selection criteria
Relate to a specific functionality, requirement! (for example a function is:
Enter correct pin → set internal data structures → set rivate key → perform
computation → show results). In case of the POM android app it would be like
“the ability to reserve a pedelec” and then a test case that goes through the
whole sequence I guess.
Then you can also set constraints like “dont visit a given state twice etc.”
Aim is also to detect illegal permutations, missing signals or anything that
relates to functionality issues
Stochastic criteria
I guess here we are creating test cases based on user profiles.
Fault Based
Sneak path etc so I select
my test cases in a way that they aim for these
sneaky peaky paths.

Don’t forget the Mealy Machines! (Homing, Synchronizing sequence, State Identification,
State Verification, What is a minimal mealy machine, tests for mealy machines etc.)

5.5. Test Case Generation
Search Problem:
Find and enumerate traces that cover edges/nodes/special data in the control and data flow
graphs (in terms of TC specification)

Techniques:
Model checker (you tell it that you can’t get to state X and it will try to prove you
wrong. (I.e. come up with a solution))
Deductive Theorem Proving : 1) Behavior models (translated or given in
specification language 2) Search problem says “there is a path” 3) Prover then finds
input therefor
Symbolic execution : Execution of a program by sets of values (difficult to handle if
there a large sets constraints)
Other search strategies (A* Chao Chen Power, Depth First, Breadth First etc.)
Don’t rely on structural criteria only!
We can generate automatically test cases based on coverage criteria, BUT it has
nothing to do with the found errors later on...

5.6. Cost Effectiveness

There is no evidence that it would be… (So it is unknown)

Conclusion:
“We can automatically generate test cases with the environment model and the
model of the SUT, but we still need to tell the machine what a good test case
consists of”
“We can also automatically generate test cases based on coverage criteria, but
as we know it won’t have anything to do with detecting faults / triggering
failures.
Good with Network Adapters, since we can abstract the CPU/PC and the whole
stuff and concentrate only on the Network / Network Traffic. (CISCO etc.)
6. OO Testing
6.1. Differences of OO Software

Inheritance
Object/Class based on another Object/Class using the same implementation
Mechanism for code reuse
Polymorphism
Single Interface for different types
Dynamic Binding
mechanism in which the method being called upon an object is looked up by
name at runtime
Encapsulation and state , sequencing of messages
Method a sets flag f, method b uses f (Branch coverage may or may not
provoke the failure

6.2. State-based Testing

Idea
Classes define values and operations
Classes define state machines
State of a program: values of all variables + program counter
You can create as many states as you like for a single class → Refactoring, yields
different test cases
Problem: How to build state machines?
Coverage? → NO! If the problem / program is complex, Jon might create a
different state machine than Josh. This is how arbitrary coverage criteria is!
Take the Stack as a state machine? Maybe, but here as well you can come
up with many solutions. (Do 2 States: Stack empty / full, or do more states..)

Harel Statecharts
Classic state diagrams require the creation of distinct nodes for every valid combination
of parameters that define the state . This can lead to a very large number of nodes and
transitions between nodes for all but the simplest of systems ( state and transition explosion ).
This complexity reduces the readability of the state diagram. With Harel statecharts it is
possible to model multiple crossfunctional state diagrams within the statechart. Each
of these crossfunctional state machines can transition internally without affecting the other
state machines in the statechart. The current state of each crossfunctional state machine in
the statechart defines the state of the system. The Harel statechart is equivalent to a state
diagram but it improves the readability of the resulting diagram.
Hierarchy
Orthogonal states (i.e. Cruise control + AntiLock Brake Control work Parallel)

Potential Errors in Statecharts
Missing or incorrect transition
A transition is missing, or a transition leads to an existing, correct state
that it
is not supposed to lead. (i.e. wrong wiring)
Missing or incorrect action
Instead of adding +1 point to player score, we take one from it.
Extra, missing or corrupt state
With a defined action we go into a corrupt state
Sneak path or illegal message failure

(undesired circuit through a seriesparallel configuration)
With a defined method we go into an unwanted state. (e.g. Player 2 presses
“X” in his controller and wins automatically
Trap door ( undefined messages accepted)
Incorrect composite behavior (subclass superclass interaction incorrect)
Subclass method incorrect

Testing Strategies
All transitions, pairs of transitions,...
Condition tests with boundary conditions
Illegal transitions enforce all illegal events
Method pre and postcondition check

Example to do sneak path testing. You check from every state if the method leads to a
defined state. (PSP if it is not defined from the give state, (?) if conditional, and (OK) if it is
defined.
6.3. Testing with Inheritance

Fault Model
Incorrect Initialization
New()Method is often inherited
Inadvertent bindings
Private Variables invisible in subclass, public variables might be shadowed
Problems: Multiple Inheritance, Missing Override (omission of
subclassspecific implementation)
Naked Access
Not using getters and setters
Incorrect location of subclass
Square is subclass of rectangle → square needs equal sizes, rectangle allows
for different sizes
Naughty Children
Subclass leaves object in a state that is illegal in the superclass
Worm Holes
Subclass object computes values that are not consistent with superclass
invariant
Spaghetti Inheritance
Deep hierarchies → errorprone

How to test these?
Flattening ! → Flattened class makes all inherited features explicit.
Can we reuse tests from super class? Yes, you can, but
you have to test the
new interactions with the child class methods.
6.4. Testing with polymorphism
Conformance with Liskov’s substitution principle:
It states that, in a computer program , if S is a
subtype
of T, then objects of
type T may be replaced with objects of type S (i.e., objects of type S may
substitute
objects of type T) without altering any of the desirable properties of
that program (correctness, task performed, etc.).
All of the inheritance problem (fault models) are here as well.
6.5. Conclusions

Special problems induced by special language constructs
Inheritance & Polymorphism
Encapsulation
TradeOff between benefits of inheritance/polymorphism on one side and cost of
testing on the other
State based testing reflects discussions on model based testing
Flattening Classes
Good design helps avoid errors → Contracts about design practices!

6.5. Examples
Mike has written a C++ program with fantastic object orientation and
encapsulations. Mike is however a very unexperienced programmer, and now
struggles finding the faults in his program. What would you suggest?
Flatten the program / state machine to see the interactions better
Write test cases that specifically target fault models (Naughty children, incorrect
initialization, Inadvertent binding etc.)

7. From Failures to Tests and Faults

7.1 Motivation

Are there further ways to extract tests?
How to get from failures to faults? → Fault Localization
7.2 Test extraction by reproducing program executions

Idea:
record all relevant input →
“Capture & Replay”
useful for
: testing GUIs, testing systems that have produced a database, record “playing
around” by the programmer

Advantage : inherent underlying fault model, we don’t know about costeffectiveness

Definitions

State of a program Values of all variables + Program Counter
Stack consists of frames
frames → method invocations

frames contain:
values of arguments
functionlocal variables
return address
…

elements of a frame can reference a heap address
Heap Dynamically allocated memory
contains data

7.2.1. State-Based Test Extraction

External Event influences the State of the program
Every Stack Frame can be used to derive a Test Case
Heap elements have to be copied as well as stack elements
Shallow Copy: only directly referenced objects (“copy only the objects that
are directly pointed by the stack frame”)
Deep copy: indirectly referenced objects (expensive!)
Two Variants
Shadow Copy / Invocation State Extraction
maintain a shadow copy of the stack before every
method invocation
→ HUGE overhead
Failure State Extraction
Serialize the stack only if an exception is thrown
zero overhead
approximation: state at failure != state at method invocation
e.g. buffer overflow → useless stack
in practice: 90% of failures can be reproduced without shadows
Top Level / Bottom Level Frames
Top Level Frames → unit tests
Bottom Level Frames → integration test
Problem
External Events are not taken into account → Nonautomatic tests!
Top Level Frame is only sufficient to reproduce the failure if:
deep copying of heap element occurs
no external events between method invocation and failure
Shallow copying might mean the relevant parts are ignored

⇒ Serialize ALL frames to increase the chances of getting a test case that reproduces
the failure and is useful for debugging (i.e. it is helpful to find the fault.)

Examples
1.
We have copied the failure causing stacks to our hard drive. Write a program that uses this
data to create test cases that trigger the failures again.
Steps:
Create pointers for variables (int *variable1; bool *booly1 etc.)
Allocate memory for these variables (variable1 = malloc(sizeof(int)); etc.)
Take the stored data and associate them with the variables (variable1 =
stordeddata1;)
Execute the method (e.g. test(variable1, booly1,...);
When to use:
When you are interested in generating test cases that trigger faults.
When there are no external events (or any stochastic crap)

7.2.2. Event-Based Text Extraction

Idea
automatic tests, unlike with statebased extraction
Interaction with GUI is a very relevant part of a program
Record all events and replay them

Method
Insert a layer between SUT and environment
Tests can be derived by
looking at the internal state
using a debugger
using shadow stacks and deeply copied heap structures
research stage
looking at external states
Interceptors between window manager and GUI (i.e. record the
sequence the buttons are pressed on the UI)
Industrial tools available

When to use:
When there is external (and unpredictable = stochastic) interaction with the SUT.
7.3. Delta Debugging

Motivation
minimize tests by removing parts that do not lead to a failure (since test cases are
usually damn huge)
Isolate the failureinducing difference → easy faultlocalization
Delta Debugging automates: minimization and localization, starting with a failing and
a passing test case

Method

Split failing input and test resulting parts
In general: for any subset, test both the small subset and the large
complement!
Algorithm:

Conclusions

Applicability depends on circumstances ( serial input is good)
Scalability difficult → “Still requires (2^abs(c))2 test case! We don’t fancy
exponential stuff...

How to get around this problem: Trying to add pieces as long as we still don’t cause error.
Hopefully at the end we get a minimal difference between failing and successful run. This is
not the fault itself, but may lead the programmer to the fault.

7.4. Hit Spectra, Code Metrics, Code Churn

Hit Spectra Code Metrics Code Churn
Single Faults Complexity “A prior

i” hints
Multiple Faults Imports
“
A posteriori
” hints “
A priori
” hints
Abstractions
of program Complexity → Defects Amount of Change →
runs Defects
Idea: Idea: Idea:

Set up a counter for how
often each block is executed predict defect densities by count how often the code is
for a specific input looking at the imports a changed:
class makes
The more similar a column how many lines added?
is to the failure column, the e.g.: import of compiler how often was the file
more likely the internal → more likely to edited?
corresponding block contain faults how many lines deleted?
contains a fault how many files?
(Similarity based on Ochiai, how many changes?
Tarantula and co.

Blocks executed by more
failed tests are more likely
to contain a fault
Similarity

many different measures for
similarity
Conclusion Conclusion Conclusion

@Normal (slide 64):
Nice Idea, but no general “They Correlate, but no excellent indicator
superiority of any similarity universal metric” (But ONLY on # of faults)
method. @IMPORT
Applicability unclear ⅓ false positives
(especially
sucks when ⅓ not detected
multiple faults …) =
IT IS USELESS
Overall Conclusion

Open research
Promising results for single projects
Not used commercially yet

7.5. Alternatives: directly target faults

Idea
Reading → detect faults

Cleanroom Software Engineering
No testing or compiling to find faults, just a text editor
They read the code many many times
Rather ensure the quality of the process
01 fault/kloc

Pair Programming
2 Programmers
1 writes the code
1 reads the code and interjects if he sees a fault
Also comment on style, readability, maintainability, efficiency.

Walkthroughs Inspections Review
informal formal Coworker read and

no moderator moderated correct the code
no roles Initiated by project often at the
initiated by author team workplace
unplanned Checklists (> fault informal, no
people just sit models) meetings or planning
together and discuss Analysis of results
a piece of code
Problem with checklists
people only rely on the lists
Goal blindly → miss the obvious
increase technical quality
Example
Fagan Software Inspections
Formal peerreview
technique
Based on checklists

Roles
Moderator (competent
programmer)
Designer
Coder
Tester

Activities
Planning
Overview
Preparation
Inspection Meeting
Rework
Followup

Conclusions
Problem: Using results of inspections for assessment of employees → delicate topic
Code Inspection is as effective as testing
It finds different types of defects than testing
Testing: Integration Problems
Reading: Focus on single artifacts

→ You should always combine codereading and testing!
→ Combination (if you have already some test cases, according to Dominik):
Testing first (?!) → easy faults removed before reading
Code Reading
Rerun tests

8. Assessment of the quality of test-suites

8.1. Motivation

The
quality of a test suite
is difficult to assess
Definition of a good test suite? Problem:
First Failure, All failures, severe failures
Cost
Cost of debugging, fault localization
...
8.1. Fault Injection

Idea
Inject a fault into the program
Rerun Test Suite
If the suite finds the fault, it is a good test suite

Problems
What faults to inject?
Cannot be automated easily
It is hard to actually come up with “nonequivalent” mutations.

8.2. Mutation Testing

Way of automatically inject faults
Does NOT inject realistic faults
Very
small syntactic changes
ONE change per program
Mutations applied to one program; a kill can be detected by using the
original
program as the oracle

→ Create lots of mutant programs → Run test suite on each → Ratio == Quality

This is a good idea if…
...the injected faults correlate with real faults

Equivalent Mutant
Same semantics/behaviour as the original program, e.g. change was made in dead code,
or doesn’t change the semantics.
→ Undecidable Problem (you cannot determine if two programs to the same thing)
→ Sometimes they cannot be killed
→ Thus in practice we simply go by “mutants”, since we can’t care much about
equivalence.

Mutation Score
#nonequivalent killed mutants / #mutants
#mutants killed
/#nonequivalent mutants
→ Problem: We don’t know which mutants are nonequivalent!
→ Realistic fault injection: we don’t know if the injected faults are realistic at all

Assumptions:
1. Competent Programmer Hypothesis
“Programmers make syntactical faults only”
2. Coupling Hypothesis
“Syntactic faults correlate with other kinds of faults”
→ Test Suite that detects simple faults also detects complex faults
We don’t know if these assumptions justify!

1 order and korder Mutants
Empirical evidence shows that test data successful at killing 1order mutants also successful
at killing korder mutants.

Type of mutations
Arithmetic mutations (Operator replacement, insertion, deletion etc.)
InterClass Level Mutation Operators (Access modifier change, super keyword deletion,
argument order change, this keyword deletion etc.)

Strong Mutation & Weak Mutation Testing
Strong: requires the output to change
Weak: requires the expression to yield a different state
Applying the same test suite to mutants usually yields weak mutation score strong
mutation score

Conclusion:
For a set of eight programs with several versions, [Andrews’05] show for a specific set of
mutation operations that there is a statistically significant correlation between killing
rate and failure detection.
Usually applied with unit tests!
BUT!
But then he jokes about that only if you chose the mutations wisely
He says it can’t be generalized!
And as a personal bottom line I kind of think that at the bottom of his
heart he doesn’t believe that it is an adequate way.

9. Concurrency Testing

What is concurrency?
Two or more processes or threads execute concurrently
The same resources are accessed
Nondeterministic

Problems
Deadlocks
Livelocks
Atomicity Violation
Order Violations

Conditions for a deadlock:
1. Mutex : at least one resource is shared
2. Hold and wait: one process holds one resource and requests another
3. No preemption: only processes can release resources (a scheduler makes
decisions about who gets which resource)
4. Circular wait: A and B wait for the respective other process to release a resource
→ A monitor/watchdog can check if a process is in a deadlock

Atomicity Violations
Code that was meant to be executed as one block is interrupted by the scheduler
(e.g. you make sure with an if statement that the pointer is not null, then you try to
access but it throws a failure, because in the meantime another thread set the pointer
to null.)

Order Violation
Intended order is not followed
Example: Thread 2 dereferences a variable before it was initialized by thread 1.
Maybe 95% if the time Thread1 is faster and thus no failure is seen, but when you
have to present your program it is god damn sure that the 5% kicks in and your
program pukes a failure at your face.

Group Violation
You think that whether 1 inocent line gets executed before the other doesn't make a
difference. Well if that 1 function calls 50 other in sequence it might…

Watchdog
As a cunning guy (or girl!) you create a watchdog, that checks whether the process is
alive. (e.g. Every 100 ms). You happily take your program to show it to your profile
but he uses a shitty old PC with floppy drive. And it is so slow, (even if there was no
bug in your code) that the process takes more than what you had preset in the
watchdog → watchdog says that the process is dead and shuts down the whole thing
and you walk home with a 5.0

Most Bugs: Atomicity Violation, Order Violation AND
combination of 2 threads with usually a
few parameters.
→ Use this as a fault model !

Testing Strategies

Test all Schedules
Is it a good thing to test all interleavings?
→ No, it’s impossible. There are too many possibilities.
Even if we could, how would we enforce a specific schedule?
→ Use sleep() . Problem: it takes a long time +
especially @ small numbers sleep
is inaccurate
Enforce relevant schedules
manually
toolbased (based on global clock or events)
Use global logical clock! (“Wait for tick”)
Use scheduler to specify desired ordering.
Checking all interleavings is impossible → MODEL CHECKER MAYBE!

10. Fault Models

Correct and incorrect programs

A correct program can be
mapped to a corresponding
incorrect program
The 2 programs only have one
syntactic difference

Defect Model
Fault Model Failure Model
Exact faulttype in code e.g. Limit Testing
e.g. division by zero Try to find methods to cause failures.
No fault localization

From the Exercise Quoting Dominik:
Fault Model:
2 Step Process
1st Step (if you want to describe a fault): Transformation from a fault free to
a faulty program. You wanted <= but got =. You want to try catch, but didn’t
get it. So it is crashing your system.
What does this transformation do to my program? (in case of x<5 or x<=5 the
path for 5 changed!) → You come up with input space partitions that
reflect this behaviour.
Failure Model:
You DON’T need the transformation, you only need the input space partitioning,
because just from it, you can create test cases that target potential faults with
good cost effectiveness. e.g. Someone tells you that block 7 is most probably going
to cause failures. What do I do? Create a partition which are full of test cases just for
block seven! = And now back to Wuki/Jang I have created a partition where I have
concentrated the failure causing elements, so the partition is likely failure causing!
In Conclusion: At the failure model someone tells you that Jon is a full retard and probably
fucked up every single line of code he wrote. In this case you know or have a
strong
hypothesis where the potential faults are so you come up with a test suite that targets
directly Jon’s messy code. However, at fault models you just see that instead of “chocolate”
the program writes “strawberries”. As a first step you think about the possible
transformations that could affect this output. (e.g. you look at the data flow of this faulty
output and see what parts of the code affect the output and how. As a second step you
come up ( probably with more than 1) test groups that try to locate the mistake in the
code. (e.g. 1 partition aims for the limits, the other for typos whatever, the thing is (at least
as I see) here you are not sure where the faults are, you are only trying to come up with
tests that are going to trigger the failures.

Intuitive Definition of a Fault Model: understanding of specific things that can go wrong in
a program

Fault Model

1. Consider a class of programs written to a specific specification
→ There are some that satisfy that specification, some not

2. Fault = Syntactic difference between a correct and incorrect
program
→ Transformation from a correct to a faulty program possible by
fault injection

3. Classes of Faults
→ Consider all classes with a fault of class K (e.g. division by zero)

4. Consider all programs, that are derived from programs that satisfy the specification by
injecting a fault of class K

5. Approximate fault models
→ we can only make an approximation of these programs (otherwise we would already
know exactly where the failures are)

6. General Idea: Partition the input space according to mapping or heuristic with empirical
evidence to reveal a fault

→ Create Partitions with a potentially high failure rate (cf. ideas of Gutjahr and
Weyuker&Jeng: Partition based testing is better than random testing if there is an underlying
defect model!)

Good Fault Model:
Large number of specifications for which induced failure domains largely overlap with actual
failure domains.

→ Try to approximate a failure domain, that is close to the program's actual faults
→ In other words: Using a fault model can create partitions that are likely failure causing

Failure Model:
No transformation! Only input space partition. E.g. limit testing or combinatorial testing.

List of Fault Models
Boundary Problems → Limit testing (there is empirical evidence)
Division by zero
Minor syntactical problems
Sneak Paths
Stuckatone
Combinatorial Testing
Atomicity Violations
Naughty Children (OO)
Some things make sense, even though there is no defect model!
→ Requirements based testing
→ Severity Testing (e.g. nuclear power plant shutdown could result in huge damage/cost)

Glossary

Code Churn Measure of activity in the repository
Testability degree to which a software artifact supports testing in a given
test context
high testability → finding faults in the system is easier
cannot be measured directly (extrinsic)
lower testability → increased test effort
Scalability algorithm works nicely on small dataset, but not on a big
dataset → time/memory grow exponentially
Software Artifact Code, module, method

Questions

General Questions

1) Definition of a test case?
(Test input → Expected output) + environment conditions

1.1) What makes a test case a good test case?
Ability to detect likely failures with good costeffectiveness

2) When to stop testing?
When one of the “coverage criteria” is met Useful if there is a relationship with
failure detection.
When a specific number of failures was detected Great but requires good
estimates
(Stopping, selection and assessment criteria are the same thing!)

3) Does it make sense to use failure detection as a comparison criterion for different types of
testing?
no, because we are interested in faults not in failures

4) Why is every partitioning scheme problematic?
because we don’t know anything about the underlying fault model, therefore it is
difficult to create revealing subdomains

5) Is requirements based testing a variant of partition based testing?
yes, because each requirement leads to a set of paths through the program, from
the set of paths, the input blocks can be computed

6) Is it a good idea to use requirements based testing?
no, because it is no defect model (if you consider it a variant of partition based
testing)
yes, because the requirements define the most important/ most used parts of the
code
problem: requirements change, they might be incomplete

Questions: Combinatorial based testing

Would you use / You are an CTO of pornhub.com would you use CBT?
Yes, if you know there are problems between some Components!

Questions: Random based Testing

1) What are the Problems with random testing?
→ input is generated, expected outcome is unclear
→ implicit oracle: system does not crash

2) Should you do random testing?
Problem: even though random testing is effective in detecting failures, debugging is
extremely difficult
Pros: cheap (in generating test cases), no information about an underlying defect
model needed

Questions: Control Flow Based Testing

1) Can MC/DC find all faults?
→ No, not all possibilities are tested

2) Why make MC/DC coverage sense?
We see the influence of one literal
Programmers make often failures at boolean condition (AND | OR | NOR)

Partition based testing:
What was the
assumption of Wuyeker and Jeng?
That the failure rate is a fixed value.
This usually does not hold, since a) not fix b) we don’t know in
advance

Questions: Statistical Testing

1) Does it make sense to use statistical testing?
→ depends on if you have usage data
→ doesn’t help for security testing (security scenarios such as shutdown happen
rarely)

Assignment 1: Discuss the approach of testing based on usage profiles for libraries. What
strategy would you use to test libraries?
It doesn’t make sense. But it depends on the scenario. How would you test java’s UI
library? How people use the certain elements (i.e. who uses buttons for what)
depends on the context, even if you have statistics. When you know the context –
you know what and how people are going to use you can. BUT when it is a general
library (i.e. a User Interface) it doesn’t make sense, since you usually don’t know the
context they use the functions.
Suggestion: Use a different strategy – but which?!
“
As the developer of a program
depending on the library you can use usage profiles.
DAFAQ? You tell me not to use it and then suggest to use it? ??? :O

Questions: Models

1) Do coverage based criteria for models make sense?
YES if you want to test the safety critical part of your System
→ Create a Model of it and test it!
It depends on whether or not the model is conform to some defect model

2) Would you use model based testing?
It depends on:
building a model helps the developers understand their systems
better communication between customer and developer
money/time for building a model
if you have a small important part of a system → create a model only for that part
Model can be adapted to new versions, problematic: for new functionality you need a
new model

3) Why do you need different levels of abstraction?
The more you can abstract away in the model, the higher is the cost saving.
The level of abstraction of the test model needs to be higher than that of the SUT. Because
otherwise validating the test model has as much effort as validating the whole system.
This works very vell, if you test a certain aspect of your system, e.g. Safety.

Difference in abstraction (Black Box and W Box) (From slides)
►Black box testing relies on the specification
►The box is abstraction by omission
►Black box allows abstracted inputs and expected outputs
►White box testing shows all the details

Questions: OO Testing

1) Would you test a program written in C similar as one in Java?
→ No
→ Many classes/objects are similar to state machines
→ Specific defect models for OO

2) How would you build the control states for a program?
→ CFG is the projection of code on the program counter

3) Does it make sense to represent an object as a state machine?
→ Depends
→ Stack: No, might not be the most suitable way
→ Network Controller: yes, very adequate
→ Traffic Light: yes

Questions: From failures to tests and faults

1) Would you apply code reading?
→ Depends on time and money: Should combinated with other tests (not replaced)
→ Really effective if you have professional programmers.
→ Don’t use it to assessment of employees
→ Different methods find different faults AND find different faults as testing!
Review: informel Code reading of team members
Walkthrough: Code review as a kind of informal meeting
Checklist (Inspection): Meeting with define roles and activities

2) Would you use inspection?
→ depends on time, process, quality requirement
→ Tests can be repeated but code reading not

Questions: Fault models

1) What is a fault model?
→ A fault model is an abstract model of something that could go wrong

Answers for exercise 2:

A set/partition includes all possible values for all input parameters.
The three rules for creating a partition:
1) No empty set → no block is empty
2) unique values → No value occurs in two blocks
3) The sum of all blocks is the hole set

Assignment 2:
1. Explain the idea behind partitionbased testing?
Dividing the input domain of a system into blocks, and then selecting random
test cases from each block.
2. What is limit testing and when have it a high portability to triggering failures?
testing programs at their limits, e.g. boundaries of loops, max/min inputs
The Fault model: a lot of faults occur at loop boundaries, indices and so on
3. How are coveragebased and partitionbased testing related?
Partitionbased testing includes coveragebased testing for instance we check
both path in a ifelsestatement → The blocks made a way through the
Program

What exactly shows Coverage Criteria?
If you have just 80% coverage there is maybe Deadcode or a problem you don’t have
figure out jet.

Would you use as a CEO of PornHub.com with 20 Million buget Coverage Criteria?
DEPENDS!!!
No, because we have no clough clue if it’s better than RT! (Since Coverage
Criteria doesn’t partition the input space based on a defect model, according
to Jeng/Wuki you don’t know if it is the same, better or worse than random
testing.)
But if we have to meet norms and standards it’s good. / you have to do it.

Which Testing strategy to use?

Cause Effect Graphs
Use it when:
helpful in identifying all relevant situations
structure / reproducibility
also good for presentations / meetings since very visible
we don’t know about completeness

Coverage Criteria
Use it when:
Finding dead code
Intuitively: each line of code should be executed at least once. (e.g. as a
specification requirement .)
Dual use:
Aposteriori (measure existing test suite)
Apriori (use as selection criterion) + as stopping criterion

Problematic when:
in most programming languages,
evaluation of code is different
Dependencies

Data Flow Testing
Use it when:
Interested in flow of data / dependencies want to see the “scope” of each
variable / definition. (i.e. how far does the given definition go, what is messed
up by changing the value of the variable)

Latin square stuff:
Use it when:
You have a fault model that tells you that the fault is caused by combination of
2, 3 parameters which can have limited values (max 34 values)

Partition based testing
Use it when:
You can endure that certain partitions will have outstanding failure rate. (i.e.
you can concentrate failure causing tests in a certain block)

Statistical Testing:
Use it when:
Dumb commercial programs (Android games/apps) Boss tells you that he doesn’t
want the program crashing while commercial people use it. (i.e. the 20% of the
program that is used by 80% of the people should run flawless).
When severity is not crucial. (Nothing is going to happen if my POM app crashes.)
You have former existing profiles . If not, you can still derive them from simulation,
expectancy, measurements but then it comes down to money whether it is worth it.

DON’T Use it when: :
When severity is critical: Nuclear reactor (Go with codereading then)

Model based Testing
One model for both, test cases and code generation
generated from models, e.g. Matlab/Simulink
Use it when:
Code generators
Assumption on the environment
Performance
Exceptions

Cons
:
no redundancy → no verification
BUT: not useful to generate code and test cases at the same time! For test
case generation you need a separate model! → no redundancy!
(so basically same as the first Con?!)

Two Models
One model for test cases, one for code
Cons
Expensive → possible solution: split between OEM and Zulieferer
Also when there are changes in the requirements ( Double expense to modify
both test and code model! → However if specification is solid (Automotive
industry etc. then it probably won’t occur.))
if there is a fault you need to find out which model was faulty
Pro:
Redundancy → need to make sure the models are different
Different levels of abstraction possible

Use it when:
Specification is VERY exact. (i.e. car manufacturer(System model)
/suppliers(Test model )

Model for test only
Pro:
Redundancy
Con:
(Expensive) → we don’t know if it’s cheaper
Code and model need to be kept in sync (interleaving) →
hard
!
Model has to be tested itself
Specification doesn’t profit from model based testing.
Hard when the requirements are changing.

Use it when:
Conformance tests: OEM builds the model, suppliers must show adherence to
model / conform to the model
Scenario of our running chip card example

Model Extraction from Code
Pro
:
May make sense if you extract the model manually
Cons :
While having automatic generation there is NO redundancy.

Use it when:
Expost development of test cases only
(exception/no exception possible ?)
(Also useful if there are no requirements documents ?)

Test case extraction
Copying the stack
When you don’t have external events, thus you simply press “start” on the
program and it runs autonomously, purely based on the code.
Event based
Use it when you have external events such as user interactions.
Shallow copy / deep copy
If you are really interested in the line / execution step where the program went
off the rails, then use deep copy.
In order to reproduce the failure
only the shallow copy is okay (90%) of time.

Test case minimization
Delta debugging
Use it when the test case is relatively small. Since the model has the number
of characters in the exponent, it gets nasty when you have a long string.

Fault localization
Code Churn is probably the best to estimate the number of failures.
Hit spectra / Code Metrics very hard to chose the best method apriori. Not really
useful.
Code Reading
usually the most efficient way to actually localize faults.
Especially used when you have a high severity program (NASA, Nuclear
reactor, and their friends)

Fault injection
Andrews’05 paper says there is a correlation between the kill rate of a test suite and
the failure detection
Pretschner however is on another opinion.

Concurrency testing strategies
Sleep
Logic based
Actual scheduler
I think rather focus on the actual scheduler or the logic based. Sleep is not
exact especially at low numbers fast executions. (e.g. you say sleep(50)
and the program sleeps 150+ ms. )

FORMER EXAM QUESTIONS
Macht es Sinn, Path Coverage etc. zu benutzen um Test Cases zu generieren?
Answer 1:> Nein, da man eventuell dann den falschen Parameter optimiert. Man kann
100% Coverage erreichen ohne alle Fehler zu finden. Evtl. kommt man auch nie auf 100%
wegen Dead Code.
Answer2: > Depends ^.^ But normally not, since any coverage criteria is not based on a
fault model, thus the partitions (with regard to finding faults) are not going to be any better
than random testing. If you have the time and money you can do (but just for fun :))
What is Mutation Testing? What is tested? Is it possible to make a syntactical change
and the output stays the same?
Mutation Testing is automatically inserting small syntactical faults, thus creating a lot of
mutant programs with k mutations. The test suite is assessed by measuring #killed
mutants/#nonequivalent mutants.

In reality, you cannot know which mutants are equivalent → the value is useless

You can make a syntactical change, that will not reflect in the outcome: inserting the change
into dead code, the change has no effect on the code (it might be directly overwritten)

Welche Arten von Model Driven Testing gibt es?
Answer1: Symbolic Execution, Model Checking, Theorem Proving, Search Algorithms.
Answer2: Dunno. What does he refer by “MDT”? : Abstraction(Encapsulation / Omission), or
the 4 Models?
Wenn im Code eine Stelle geändert wird, welche Methode könnte man benutzen um
die Test Cases zu bestimmen, die davon betroffen sind?
Answer: Hit Spectra, Coverage
Ist Use Case Testing sinnvoll? (=operational profile stuff)
Answer 1: Ja, weil requirements based Testing immer sinnvoll ist.
Was ist ein Fault Model?
Answer:
Hypothesis for the reason of failure/fault. Things that can go wrong, and
. A fault model is good, when the
usually go wrong induced failure domains largely
overlap with actual failure domains. (“Intuitively a fault model is the understanding of
“specific things that can go wrong” when writing a program. In a first approximation, we
define fault models to be descriptions of the differences with a correct programs that
contain instances of fault class K
. … A fault model for class K therefore is a description of
the computation of Alphak or a direct description of the failure domains induced by Alphak,
Fi(Alphak,s) for all s E S.
Welche Möglichkeiten gibt es die Anzahl der Test Cases für eine Funktion mit 3
Parametern zu reduzieren?
Answer: Combinatorial Testing? → Latin squares.
Wie kann man parallele Systeme testen? Was ist dabei schwierig? (=Concurrency)
Answer1: Alle Schedules/Wichtige Schedules. Probleme bei concurrency: deadlock, livelock,
atomicity violation, order violation. Schedule testen durch: Scheduler verwenden,
Sleep,
Ticks, Events Plus info:
Transactional Memory (ca. 2040% of these problems) can be fixed.
(“Findings V”)
Inheritance, Polymorphismus, Flattening (see slides)
What is the Coupling Hypothesis?
Small syntactical faults correlate with complex failures.
Test Cases that find small faults also find complex ones.
Was ist Big Bang Testing? Sollte man es benutzen?
Answer: Integration Testing, alle Komponenten auf einmal. Schlecht, da mann dann schwer
Fault Localization machen kann (besser wäre incremental oder combinatorial testing).
http://istqbexamcertification.com/whatisbigbangintegrationtesting/
Cyclomatische Komplexität
> Code Metric, wenn sie hoch ist, dann ist das Programm sehr komplex und fehleranfällig.
“Cyclomatic complexity is a software metric (measurement), used to indicate the complexity
of a program. It is a quantitative measure of the number of linearly independent paths
through a program's source code. “

PUses/CUses

Software Testing Techniques

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Software Testing Techniques

Uploaded by

Copyright:

Available Formats

Software Testing

­ Runtime ­ May or may not lead to a failure ­ E.g. faulty code

Cause effect graphs Statements, Paths, intra­procedural

Equivalence Conditions inter­procedural

2.3. Requirements Based Testing

2.3.2. Equivalence Partitioning

2.4. Control Flow Based Testing

2.4.3.Statement-, Path-, Branch-Coverage

­ Independently swap the value of ­ all combinations of literals ​ → all

2.5. Data Flow Based Testing

2.5.2. Intra-procedural data flows

def (and def c­use c­use p­use

Loop Pointer Arrays

p­use definition p*: use of p*: definition a[] usage of a[]

2.5.3. Inter-procedural data flows

2.5.4. Conclusions about Data Flow Testing

3.2. Latin Squares

3 1 2 2 3 1 (3,2) (1,3) (2,1)

2 3 1 3 1 2 (2,3) (3,1) (1,2)

3 1 2 2 2,1,3 2,2,1 2,3,2

2 3 1 1 3,1,2 3,2,3 3,3,1

1,4,3 2,4,2 3,4,1

4.3. Weyuker and Jeng (fixed failure rates)

4.4. Gutjahr (failure rates as random variables)

4.6.1. Usage profiles in Software Reliability Engineering

4.6.2. Usage Profiles based on Markov chains

5.4. Selection Criteria

5.6. Cost Effectiveness

6.2. State-based Testing

6.3. Testing with Inheritance

7. From Failures to Tests and Faults

7.2 Test extraction by reproducing program executions

7.2.1. State-Based Test Extraction

7.2.2. Event-Based Text Extraction

7.4. Hit Spectra, Code Metrics, Code Churn

Single Faults Complexity “A prior​

Idea: Idea: Idea:

Conclusion Conclusion Conclusion

7.5. Alternatives: directly target faults

­ informal ­ formal ­ Co­worker read and

8.1. Fault Injection

8.2. Mutation Testing

10. Fault Models

You might also like

Runtime May or may not lead to a failure E.g. faulty code

Cause effect graphs Statements, Paths, intraprocedural

Equivalence Conditions interprocedural

Independently swap the value of all combinations of literals → all

def (and def cuse cuse puse

puse definition p: use of p: definition a[] usage of a[]

Single Faults Complexity “A prior

informal formal Coworker read and