You are on page 1of 12

Landmines of Software

Testing Metrics

1. Abstract
It is not only desirable but also necessary to assess the quality of
testing being delivered by a vendor. Specific to software testing,
there are some discerning metrics that one can look at, however it
must be kept in mind that there are multiple factors that affect these
metrics which are not necessarily under the control of testing team.
The SLAs for testing initiatives can, and should, only be committed
after a detailed understanding of the customers IT organization in
terms of culture and process maturity and after analyzing the
various trends among these metrics. This white paper lists some of
the popular testing metrics and the factors one must keep in mind
while reading in to their values.

2. Introduction
This white paper discusses some of the popular metrics for testing
outsourcing engagements and the factors one must keep in mind
while looking at the values of these metrics.
Metric 1.Residual defects after a testing stage
?
Definition

The absolute number of defects that are detected after the testing
stage (owned by the vendors testing team).
The lower the number of defects found after the current testing
stage the better the quality of testing is considered.
Factors to consider
?
Quality of requirements
The ambiguity in the requirements results in misinterpretations and
misunderstandings, leading to ineffectiveness in defect detection.
The clearer the requirements the higher the chances of testing team
understanding it right and hence noticing any deviations or defects
in the system under test (SUT).
?
Quality of development

The planning for testing is usually done with an assumption that the
system will be thoroughly unit tested prior to handling it to the
testing team. However if the quality of the development process is
poor and if the unit testing is not thoroughly done, the testing team
is likely to be encountering more of unit-level defects and might be

pausing their testing for the defects (even in the fundamental


processes) to get fixed and hence wouldnt be able to devote/focus
all of their time in looking for functional-level/systemlevel/business-level defects.
?
Availability of Business users for reviews and clarifications

If the Business users are not available for answering the queries in
time, the testing team is likely to struggle in establishing the
intended behavior of the system under test (whether it is a defect or
not etc.) and hence some amount of productivity would be lost and
it is likely to have more defects remaining by the time the testing
window is used up.
?
Incomplete test cycles - Delay in defect fixes, higher defect fixes,

environment availability
The estimates and planning for testing is based on certain
assumptions and available historical data. However if there are
higher number of disruptions (than anticipated) to testing in terms
of environment unavailability or higher number of defects being
found and fixed, the quality time available for testing the system
would be less and hence higher number of defects slip through the
testing stage.
Metric 2. Test effectiveness of a testing stage (or) Defect
Containment effectiveness.
?
Definition

The % of system level defects that have slipped through the testing
stage (owned by the vendors testing team) and are detected at a later
testing stage or in Production.
The higher the effectiveness the lower the chances of defects being
found downstream.
Test Stage Effectiveness =
(Defects detected in the current testing stage)
------------------------------------------------------------(Detects detected in current testing stage + defects detected in all
subsequent stages)
Factors to consider
?
Availability of accurate defects data at all stages
We must ensure that the data on defects on all subsequent stages are
also available and are accurate. Production defects are usually
handled by a separate Production support team and testing team is

at times not given much insight in to this data. Also, since multiple
projects and/or Programs would be going live, one after another,
there are usually challenges in identifying which defects in
Production can be attributed to which Project or Program.
Inaccuracies in assignment would lead to inaccurate measure of test
stage effectiveness.
All the factors listed above for Residual defects also apply for this
metric.
Metric 3. % improvement in test case productivity
?
Definitions

Test case Productivity = # of test cases developed per person per


day (or per hour)
% Improvement =
(Test case productivity in current year last year)
= --------------------------------------------------------- x 100
(Initial Test Case Productivity last year)
Factors to consider
?
Nature of changes
The nature of changes that the system goes through might not
necessarily be comparable all the time. Depending on the nature of
changes the # of test cases required to test a unit of development
effort, and the amount of investigation/analysis required prior to
deciding or documenting the test cases would differ drastically.
?
Test case definition

The very measurement of a test case itself could be a challenge. A


test case can range from very simple to very complex depending on
the specifics of the test objective. Hence a mere count of the test
cases might not reflect the actual effort put in testing a particular
change or functionality. We must consider the complexity of test
cases as well. Also different teams might be following different
levels of documentation for a test case. For example a test case
with 12 conditions might be considered as a single test case by one
team while the other may split them in to 12 separate test cases with
one condition each. Normalizing the unit of test case is very
critical to get this metric representing the real picture.
?
Ambiguity on effort categorization

Some people might consider test data set up as part of test case

development effort while some consider it as part of test execution


and set up. Different people following different notations would
lead to erroneous values and the data might not be comparable.
?
Effectiveness of test cases
Testers might be incorrectly motivated towards creating lots of
test cases in less time rather than taking time to think through the
changes and requirements and come up with good test cases (even
if they are only few) that are likely to find defects or are likely to give
the Business people more comfort/confidence.
?
Increase in experience/SME

Over a period of time a testing resource is likely to become more


knowledgeable about the SUT. Due to this he/she would be in a
position to better expect which test cases are likely to find a defect
and thus might cut down on test cases that are NOT likely to find
defects.
?
Change in Testing Resources

Whenever a resource is replaced by another it is likely that the new


resource would take more time for doing the analysis needed to
write the test cases. The higher the number of changes the lower the
test case productivity (and also test case effectiveness) in a way.
Metric 4. % reduction in testing cost
?
Definitions

(Testing cost per unit dev effort in curr year last year)
= ---------------------------------------------------------(Testing cost per unit development effort last year)
Factors to consider
?
Lack of accurate measurement of development efforts
This metric heavily depends on the measurement of actual
development effort which might not be accurate. The number of
projects that have formal measurement units such as FP is relatively
few.
?
Testing effort variance with dev effort

Testing effort might not always directly proportional to the


development effort. For example a slight modification (with very
small development effort) to a legacy application might incur a lot
of testing effort factoring in the regression testing etc. Similarly

there may be lot of structural changes to the existing code with high
development efforts (for performance enhancement or some other
refactoring needs) but the (black box) testing effort might be not be
as high if the end functionality is not undergoing a drastic change.
?
Scope of outsourcing

When we engage with a customer for testing projects we might start


with a small set of applications or modules with in. However over a
period of time we would have started servicing more applications
or modules and the timesheet entries might not be granular enough
to distinguish between efforts for different modules. After some
point of time it might be difficult to extract the actual effort that is
spent on a particular module and hence to calculate the reduction in
effort for the same module.
?
Sharing of resources

Along the same lines as above it might be difficult to extract the


information as to which resource spent how much time on which
application as they might be working on different modules
interchangeably.
?
Projects complexity not comparable

It is also common to encounter projects with complexities widely


differing from the projects that were used for base lining. Hence
comparing the data between these types of projects might not give
the real picture.
Metric 5. % Automated
?
Definitions

% of test cases automated


# of test cases automated
= --------------------------------------- x 100
Total # of test cases
Factors to consider
?
Business case
Not every application and at times not every module within an
application (and even not every test case within that module) has
business case for automation and they might continue to be tested
manually for multiple business reasons (e.g. possible replacement
of application with a COTS package in the next 6 months).

?
Feasibility of automation

Depending on the technology constraints and the suitability of


tools that are available, some parts of an application might not be
technically feasible to automate even if we wanted/needed to.
?
Scope

It is advisable to measure % of automation under the revised scope:


For the modules/applications that are known to give good RoI and
are technically feasible what % is automated.
?
Resource allocation

Industry survey reveals that over 60% of automation projects are


not successful and the major causes for this are: automation
attempted on an ad-hoc basis and people not dedicated for
automation. People also underestimate the need for an effective ongoing maintenance after the initial test bed is automated. If
automation is approached immaturely one can expect disruptions in
build and maintenance of automation due to lack of constant focus
and supporting skills/resources for automation.
?
Room for exploratory testing
At times it may be desirable to give allowance for a bit of
exploratory testing to check out if we can detect any anomalies that
could not be detected by regular testing techniques. Usually in
exploratory testing, test execution is attempted along the lines of a
few test objectives. This is then followed by documentation of the
attempts and results. This could lead to addition of a few test cases
which might not necessarily be automated but are desired to ensure
more probing in to the application behavior.

Metric 6. Requirements coverage


Definitions
% of requirements that are covered by test cases.
(# Of requirements covered by at least 1 test case)
= -------------------------------------------------------- x 100
(Total # of requirements)
Factors to consider
?
Format of requirements, traceability matrix
Not every team might be following standard notations while
documenting the requirements and hence one can expect challenges
in mapping test cases to free-format requirements.

?
Legacy systems with no documented requirements
Most of the legacy systems might not have any documentation on
the basic functionality thus making the reference point difficult to
establish.
?
Configuration Management

The team may not be having access to a configuration management


tool that enables keeping the test cases systematically mapped with
changing requirements. The challenge lies in managing not only the
version of requirements but also the corresponding version of test
cases.
Metric 7. Test case effectiveness
Definitions
(# of Test Cases that detected defects)
(Total # of Test cases)
Factors to consider
?
Stability of application
As the application or SUT stabilizes over a period of time the
number of defects in the system goes down and it require more
effort (test cases) to find remaining defects.
?
Risk

Where people try to over emphasize this metric and refrain from
writing more test cases - there may be a risk of not detecting some
of defects we could otherwise find (thus resulting in bad test stage
effectiveness). In order to reduce this risk people might try and
write more and more test cases even though the possibility of
finding a defect might be low.
Metric 8. Defect Rejection Ratio
A defect initially raised by a tester could be later rejected for multiple
reasons and the main objective of having this metric is to ensure the
testers correctly understand the application/requirements and do
their ground work well before everyones time and attention is
invested in solving the defect. Too many defects being rejected
results in inefficiency (due to time and effort spent on something
that wasnt a problem).

?
Definitions

(Defects rejected as invalid)


= --------------------------------(Total no. of defects raised)
Factors to consider
?
Cause of rejection
Lack of application knowledge is usually the cause for rejected
defects. However there could be other reasons as well, such as
misinterpretations, changes in the environments and defects
becoming non-reproducible. Hence we need to take in to
consideration the causes for rejection in order to correctly
understand the trends.
?
Defect life cycle culture

In some teams a defect is initially raised in the system as New


followed by a discussion and then rejected off the system if it is
not considered a defect. However in some teams discussions are
held between testers and other stakeholders such as Business and
Development on the possible defects and a defect is entered in to
the system only after it is confirmed during the discussions. Hence
there wont be a rejection at all.
?
Fear of risk

In some cases it might not be very straightforward to decide on


whether a behavior is a defect or not and one has to choose between
risking a defect rejection but covering the defect slippage and not
raising the defect (lower defect rejection) but higher defect slippage
in to the next stage. Usually people are more concerned about the
defect slippage and hence take the route of raising defects even if it
means higher defect rejection.
?
Quality of requirements
The higher the quality of requirements the higher the chances of
elimination of misunderstanding and lower the defect rejection
rate. Even if the defect rejection rate is higher it will at least help in
establishing the fact that testers knowledge was not sufficient and
that requirements could not be blamed.
?
Control of environment

At times the behavior of the system changes from the time a defect
is raised to the time the verification of the defect is done (it didnt
work when I raised the defect but now it is working. Dont know

how.) If the environment is undergoing changes with out the


notice/control of testing team it would be hard to establish the
cause for the defect rejection.
Metric 9. Currency of knowledge database
?
Definitions

% of application knowledge that is documented.


(Knowledge level documented)
= -----------------------------------------------------------(Total Knowledge of the Application or Module)
Factors to consider
?
Quantifying Knowledge
The parameter Knowledge is hard to quantify and is mostly a
qualitative measure. At times people do graduate their
applications/modules in to a list of functions/transactions/flows
and the documentation for these is measured relatively objectively.
However not every module might be conducive for this kind of
graduation and care must be taken to ensure that some sort of
classification is possible before we attempt to measure this metric.
?
Depth of details expected in documentation

Also a functionality/transaction can very well be documented at


different levels of depth the deeper the documentation the higher
the time involved in it, ranging from a few minutes to a few days. We
must ensure that the expectations on depth are understood
correctly before we commit a certain SLA on this metric.
?
Frequency of changes
During estimations for documentation efforts a certain amount of
changes to baseline requirements/functionality would be factored
in. However if the changes are much higher than estimated this has
an obvious impact on the actual effort for documentation.
?
Expectation on Time of Documentation

It is also important to know the actual point in life cycle at which


the completed documentation is expected. If the documentation
is expected to be reviewed/refined at the end of each project (i.e. by
Phase and not by duration such as 3 months or 6 months) then it is
very likely that people do not attempt to document things during
the middle of a project thus avoiding any rework (due to changes in

10

requirements). However if the expectation is that the


documentation is updated every 3 months (or even 6 months) it is
possible that the project is going through some changes in
requirements and hence one should anticipate rework in
documentation.
?
Multiple Projects on same application/module

It is also likely that the same application might be undergoing


changes by multiple projects with a little bit of time gap between
each. This brings in all the complexity that applies to configuration
management of the code. The documentation must reflect the
changes with respect to the Projects so people of other projects also
know where to apply their changes and by how much. In the
absence of a configuration management tool the documentation
might be a little difficult to handle without leading to
confusion/rework.
?
Planning for documentation effort

Testing usually encounters tight schedules with racing against time


becoming the norm. Due to this, whether documentation of
application knowledge is a paid service or a value addition, one
cannot achieve it without factoring the time and effort (for
documentation) during the Project planning.

3. Conclusion
There are some discerning metrics for testing engagements which
can be considered for drafting SLAs. However for each metric one
must take in to consideration the factors (including other metrics)
that influence the value of metric and the scope of the testing team
in controlling those factors. This white paper lists down some of
the popular metrics and discusses about the factors that affect each
of these metrics.
One shouldnt underestimate the fact that it usually takes some time
(a few months to over a year) in order to establish consistent trends
on these metrics. It is thus recommended that enough time is given
to assess the trends of these metrics before an SLA is worked upon
for the engagement.
Where applicable, assumptions must clearly be documented to
indicate/reflect the influencing factors on the SLAs being
committed.

11

Hello there. I am from HCL Technologies. We work behind the scenes, helping our
customers to shift paradigms and start revolutions. We use digital engineering to
build superhuman capabilities. We make sure that the rate of progress far
exceeds the price. And right now, 59000 of us bright sparks are busy developing
solutions for 500 customers in 20 countries across the world.
How can I help you?

12

transform@hcl.in

You might also like