You are on page 1of 5

Paper FC06

Using Cyclomatic Complexity to Determine Test Coverage for SAS Programs


Michael C. Harris, Amgen Inc., Thousand Oaks, CA
ABSTRACT

The Cyclomatic Complexity metric described by Watson and McCabe1 provides an objective measure of the complexity of a given module of program code by examining its decision structure. This article examines issues related to using the metric. These include identifying relevant SAS flow control constructs, creating flow graphs of SAS programs, and methods for deriving test cases from basis path analysis. Emphasis is placed on programs typical of the biotechnology and pharmaceutical industry, but the principles are applicable to any kind of program.

INTRODUCTION
SAS programmers often learn how to use The SAS System in the context of specific applications as opposed to approaching it from an abstract or theoretical perspective. Programming as a discipline has been discussed extensively in the literature with reference to specific languages and from a language-neutral analytical perspective with strong mathematical underpinnings. This article is a brief introduction to the application of one software engineering concept to SAS programming.

ATTRIBUTES OF QUALITY SOFTWARE


There is no single way of developing and testing programs and systems that will ensure they reliably meet their requirements over time. A sound method includes multiple layers of quality control with each layer addressing a specific part of an overall process. User requirements tend to be one of the more visible pieces because they drive everything else. In the pharmaceutical industry, we frequently express requirements in statistical analysis plans, in depictions of output such as table shells and in data dictionaries. The programmers job is to translate requirements that might be vague or incomplete into code and output that meet the needs of a particular audience, always with one important unstated requirement: The numbers must be correct. A number of approaches are used within our industry to foster a level of confidence that results are accurate. They vary from the popular, relatively inexpensive but scientifically indefensible independent programming approach to costly implementations of formal life cycles with unit, integration, system and user acceptance testing built into the process and carefully documented. The goal of any approach is to ensure the presence of two main quality attributes. The output must look like what the customer expects (whether for internal decision making or external review, such as a regulatory body) and the numbers must be correct. Secondary quality attributes might fall into the realm of design goals. For example, we consider it highly desirable to write code that can be maintained and used for new applications without being rewritten. Many organizations have libraries of standard code just for such purposes. This article discusses one tool that can be used in conjunction with structured testing. The tool is not meant to be a solution to anything more than one part of an overall quality process.

COMPLEXITY OF PROGRAMS
We can all agree that programs vary in complexity. We may not have the same understanding of what constitutes a meaningful measure of complexity. For example, non-comment lines of code and number of functions can be used to get a sense of how complex a program may be (or programmer productivity if measured over time). While not without value, those metrics do not provide a sound framework for testing strategies. Cyclomatic complexity is a mathematically rigorous approach to defining complexity and using clearly identifiable properties of programs to define how modules of code will be tested. Although not bound to any language, the metric has been associated with the C programming language and many examples in the literature are based on functions implemented in C. Central to this article is the concept of applying software engineering principles and mathematical rigor to SAS programming. Cyclomatic complexity was originally described in the context of code modules such as C functions. For the purposes of this article, a SAS data step will be considered a module of code. SAS procedures hide their flow control from the user and we will not discuss them. The SAS Macro language has flow control constructs analogous to most of those available to the data step and will not be called out for special consideration. The reader may be able to apply the

principles discussed to SAS macros, with the caveat that the SAS Macro facility is a text processor as opposed to a pure procedural language and therefore may have macro and data step flow control constructs intermingled in ways that make code difficult to understand. The particular application may determine how one analyzes and tests macro code.

FLOW CONTROL IN SAS


Experienced programmers have an arsenal of tools to design and analyze program code, including flowcharts, program structure charts and pseudocode. The primary tool for analyzing cyclomatic complexity is the flow graph. Flow graphs depict control constructs with a simple notation based on graph theory. Conceptually, you reduce a program to a collection of nodes representing procedural statements and edges connecting the nodes. All edges must terminate in nodes and paths through a module comprise edges. Complexity results from decisions a program must make, which is inextricably bound to the number of paths through a module of code. SAS control constructs available to the data step are: CONTINUE, DO WHILE, DO UNTIL, END (associated with other statements), GO TO, IF THEN / ELSE, LEAVE, LINK, RETURN and SELECT2. We will focus our attention on the constructs most commonly associated with structured languages and will ignore the CONTINUE, GO TO, LEAVE, LINK and RETURN statements. The subsetting IF is used exclusively for data restriction rather than flow control and for that reason will be ignored. The structured constructs in flow graph form are shown below3.

Sequence

While

If

Until

Case (SAS Select)

Figure 1. Structured Constructs

A SIMPLE EXAMPLE
THE PROGRAM

Consider the following SAS data step wrapped in a SAS macro, adapted from a C function that implements Euclids algorithm for finding greatest common divisors4. Executable statements are numbered for reference purposes and are represented as nodes in the accompanying flow graph. The macro wrapper is for convenience and will be discussed later. %macro Euclid(m, n); 0 data _null_; retain m &m n &n; 1 if (n>m) then do; 2 r=m; 3 m=n; 4 n=r; 5 end; 6 r=mod(m, n); 7 do while (r ne 0); 8 m=n; 9 n=r; 10 r=mod(m, n); 11 end; 12 put n=; 13 run; %mend;

THE FLOW GRAPH

A flow graph of the data step is shown below. You can think of this graph as a mathematical abstraction of the original program.

4 13 12 11 10

Figure 2. Flow graph of sample data step


CALCULATING CYCLOMATIC COMPLEXITY

This graph has 14 nodes (n) and 15 edges (e). Cyclomatic complexity of a module is defined as e n + 2. Thus, the complexity of our sample data step is 3. This is the number of independent paths (known as basis paths) through the flow graph and provides the starting point for developing test cases. You can calculate the cyclomatic number in two other ways. It is the same as the number of regions in a flow graph (where you count the region surrounding the immediate graph). It is also defined as P + 1 where P is the number of predicate nodes in the flow graph. Predicate nodes are those representing control structures and have one or more edges emanating from them. In Figure 2 nodes 1 and 7 represent DO WHILE and IF conditions and are therefore predicate nodes. In this case, P + 1 = 3. This reveals an interesting and useful fact. Since predicate nodes represent control structures, you can calculate cyclomatic complexity without drawing flow graphs by counting control structures and adding one.
PATHS AND TEST CASES

We must identify three basis (linearly independent) paths through our sample program and generate test cases for each path. Three possible paths are: Path 1: 0-1-5-6-7-11-12-13 Path 2: 0-1-2-3-4-5-6-7-11-12-13 Path 3: 0-1-5-6-7-8-9-10-7-11-12-13 We can develop test cases from the Boolean states of nodes 1 and 7. Cases that force execution of each path are shown in the following table. Node: Statement 1: n>m 7: r ne 0 Path 2 1: n>m 7: r ne 0 Path 3 1: n> m 7: r ne 0 7: r ne 0 Figure 3. Cases forcing the execution of each path Path 1 Path State False False True False False True False

We need both a clear statement of the conditions and an expected result to write a test case. To exercise Path 1, the test case designer chooses values of m and n such that m is greater than n and (m / n) = 0. Setting m to 2 and n to 1 satisfies these criteria. The expected result is that N=1 is written to the log. We can satisfy the criteria for Path 2 by swapping the values of m and n from the first test case. The expected result is that N=1 is written to the log. We must force execution of the do while loop to traverse Path 3. The terminal value of the loop index r is determined by the programs logic, which if correct will eventually return 0. Setting m to 3 and n to 2 satisfies the criteria for exercising the logic of the loop. The expected result is that N=1 is written to the log. The test cases can be summarized as follows: Path 1: Call %Euclid(2, 1), N=1 is written to the log Path 2: Call %Euclid(1, 2), N=1 is written to the log

Path 3: Call %Euclid((3, 2), N=1 is written to the log

The process of deriving test cases can be summarized in five general steps. 1. 2. 3. 4. 5. Annotate your source code, numbering each procedural statement. Using your annotated source code, draw the corresponding flow graph. Determine the complexity of the flow graph. With the flow graph as an aid, determine a set of basis paths. Write test cases that force execution of each path. It may be useful to tabulate the paths, nodes and logic states in a manner similar to Figure 3.

Note that the cyclomatic complexity number specifies the minimum number of test cases required to traverse every basis path we have identified. Other paths are possible, but they will have edges already traversed in the basis path and therefore do not need to be tested. One might be tempted to try calling the macro with other inputs. There is no harm in doing so, but there may be no value as well since no new logic will be executed. If the algorithm is correct and the implementation is correct, more testing does not necessarily provide more evidence that the code is sound.
TESTING IN ISOLATION

Our example illustrates an important characteristic of a truly modular approach to programming: Modules can be tested in isolation. In this case, the data step is wrapped in a macro driver that enables us to vary the input without modifying the code being tested. Depending upon your application, you may need to write drivers before test cases can be executed. If your code calls macros having dependencies not present in an isolated environment, you may need to write macros that are essentially placeholders in the code. Such macros should do little other than report they have been called. External dependencies designed for testing and having little or no functionality are traditionally known as stubs. Writing drivers and stubs is not necessarily difficult but does require attention to design, and ultimately someone to do the coding.

USES FOR CYCLOMATIC COMPLEXITY


GENERAL USAGE

The most obvious use of cyclomatic complexity as discussed above is to facilitate the development of robust test protocols by introducing mathematical rigor into one aspect of the overall testing strategy. Your organization may have neither time, resources nor desire to produce and analyze flow diagrams. You may be able to make relatively low impact changes that could result in process improvements. As we have seen, calculating cyclomatic complexity can be as simple as counting control structures. As complexity increases, testing becomes harder, code becomes more difficult to comprehend and maintenance becomes more difficult. It may be useful to implement complexity limits as part of existing coding standards. A cyclomatic complexity limit of 10 was originally proposed by McCabe with the suggestion that higher complexities could be reserved for special circumstances in which highly experienced programmers work within a defined and rigorous development method4. It is relatively easy to determine complexity, so an organization could benefit from enforcing limits without necessarily changing anything else in its practices. Another benefit may accrue if calculation of cyclomatic complexity is included in the coding process. Programmers should become more aware of simplicity as a design goal and code with testing in mind. As one becomes more sensitive to the impact of increased complexity, the question How is this going to be tested? becomes highly relevant regardless of your general testing strategy.
INDUSTRY-SPECIFIC APPLICATIONS WITH SAS PROGRAMS

The fact that a particular tool could be used in a given situation does not mean it should be used. Programming in the pharmaceutical industry can have constraints such as: Short time lines projects have a tendency to be compressed near the end A nearly constant sense of urgency production of data sets, tables, listings and figures may be on the critical path Regulatory requirements that must be met, such as 21 CFR Part 11 Vague or incomplete requirements that evolve over time Unexpected changes in requirements this is science after all, and new knowledge can spawn new questions we answer by modifying existing code or writing new ad hoc programs With the exception of regulatory requirements for validation, activities with these constraints are not good candidates for any process that adds time or requires additional resources. When should cyclomatic complexity be used to derive test cases? Consider the following cases: Stable systems not associated with a particular clinical protocol such as randomization systems, clinical data warehouses and the full gamut of multi-user systems with user interfaces

Standard report generation systems such as standard code libraries (macro or otherwise) needing rigorous testing Programs that create analysis data sets; such programs frequently implement exception handling through flow control and are a good use of cyclomatic complexity Macro code that implements important algorithms such as derivations, imputations and endpoints where robustness is necessary whether such code will be used for only one study or across protocols or analyses

These are all situations where the risk of producing incorrect output needs to be weighed against the resources available for developing and testing programs.
IMPEDIMENTS TO USING CYCLOMATIC COMPLEXITY

There is arguably a shortage of both native and third-party tools for managing projects built with The SAS System. Automated utilities that produce flow graphs and help determine basis paths are available for C, C++ and some proprietary languages. No such tools exist for the various SAS languages. Writing your own tools might not be feasible. There is also a training issue tightly bound to the way an organization expects to use its programming resources. It makes little sense to use cyclomatic complexity as the only means of determining how to test software. It is best used within the context of a formal development method where there will be either a defined analyst role or the expectation that programmers will possess the analytical skills to create logical and physical system models. Small organizations may find that problematic. Even large organizations who expect programmers to do little except produce data sets, tables, listing and figures (particularly in a reactive mode) will find it nearly impossible to use cyclomatic complexity without having automated tools, providing training to programmers and setting new expectations for the way work will be done. Although the principles are not difficult to understand, the implementation is not free.

CONCLUSION
In this article we have given a brief introduction to one software engineering technique in the context of SAS programming. There is sufficient theory behind this technique that one could spend a great deal more time devising ways of applying the theory at the application level than actually writing code. Analysis of cyclomatic complexity is only one of many techniques that can be applied to SAS programming. You are encouraged to dig deeper by using resources available on the World Wide Web and the vast body of printed resources.

CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the author at: Michael C. Harris Amgen Inc. One Amgen Center Drive, MS 24-2-C Thousand Oaks, CA 91320 Work Phone: 805.447.1011 Email: mikeh@amgen.com SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies.

REFERENCES
1

Arthur H. Watson and Thomas J. McCabe, Structured Testing: A Testing Methodology Using the Cyclomatic Complexity Metric (Gaithersburg: National Institute of Standards and Technology, 1996), 7 2 SAS Online Documentation, Version 8 3 Roger S. Pressman, Software Engineering: A Practitioners Approach (New York: The McGraw-Hill Companies, Inc., 1997), 456
4

Watson and McCabe, 8