You are on page 1of 14

SOFTWARE M ETRTCS MODEL FOR TNTEGRATTNG OUALTTY CONTROL AND PREDICTTON

Norman F. Schneidewind Code S W S s Naval Postgraduate School Monterey, CA 93943 Voice: (408) 656-27 19 Fax : (408) 656-340 7 Email: schneidewind@nps.navy.mil

Abstract

A model is developed that is used to validate and apply metrics for quality control and quality prediction, with the objective of using mctrics as early indicators of sofiare quality problems. Metrics and qualiv factor data from the Space Shuttle flight sofiare are used as an example. Our approach is to integrate quality control and prediction in a single model and to validate metrics with respect to a quality factor. Although a quality factor is ideally a direct measuwment of sofiare quality and, hence, of more interest to customers and users than metrics, qualityfactors cannot be collected ear4 in a project. Thus the need arises to validate metrics, which developers can collect ear4 in a project, to ace as an indirect measurement of the quality factor. Boolean discriminant functions (BDFs) were developed for use in the quality control and quality prediction process. Thesefunctions provide good accuracy (Le., J% error)for classifj,ing low qualiry sofhvare. This is f true because the BDFs consist o more than just a set of metrics. They include additional information for discriminating quality: critical values. Critical values are thrvshold values of metrics that are used to either accept or reject modules when the modules are inspected during the quality contmlpmcess. A series of nonparametric statistical method is used to: I ) identifj, a set of candidate metricsf o r fitrtheranatysis; 2) identifi the critical values o the metrics; f and 3)9ind the optimal BDF of metrics and critical values basedon the ability of the BDF to satisfj, both statistical and application (i.e., quality and cost) criteria .We show that it is important to perform a marginal analysis when making a decision about how many melrics to use in the qualiry control and prediction process. If many metrics are added at once, the contribution of individual metrics is obscured. Also, the marginal analysis provides an effective stopping rulefor deciding when to stop adding metrics. Wefound that certain metrics are dominant in their effects on classifiing quality and that additional metrics are not needed to accurately classifi quality. This efect is called dominance. Related to the property of dominance is the property o f concordance, which is the degree to which a set of metrics produces the same resule in classibing sofiare quality. A

high value o concordance implies that additional melrics f will not make a signifjcant contribution to accurately classifiing quality; hence, these metrics are redundant.
Keywords: validating and applying metrics, quality control, quality prediction, Boolean discriminant functions, marginal analysis. Introduction A model is developed that is used to validate and apply metrics for quality control and prediction, with the objective of using metrics as early indicators of software quality problems. The model is general and c m be applied to any application. However, the best set of metrics to use with the model i a particular application would depend on the results n of the model's validation process. Metrics and quality factor data from the Space Shuttle flight software are used to illustrate the model's validation, control, and prediction process. Throughout this paper when we use the word "validate"we do not mean a universal validation of the metrics for all applications. Rather we refer to validating the relationhp between a set of metrics and a quality factor for a given application. This is the si&icance of our use of the words with respect to in the next sentence. O r approach is to u integrate quality control and prediction in a single model and to validate metrics with respect to a quality factor in accordance with the metrics validation methodology we developed [SCH92] and that is standardized in the IEEE Standard for a Software Quality Metrics Methodology (1061) [iEE93]. Although a quality factor is a direct measurement of sohare quality and, hence, of more interest to customers than metrics, quality factors cannot be collected early in a project. Thus the need arises to validate metrics, which developers can collect early in a project, to act as a indirect measurement for the quality factor. Boolean discriminant functions (BDFs) are developed for use in the quality control and prediction process. These functions provide good accuracy (Le., 53% error) for classifying low quality software. This is true because the BDFs consist of more than just a set of metrics. They include

1071-9458/97 $10.000 1997 IEEE

402

additional information for discriminating quality: critical values. Critical values are threshold values of metrics that are used to either accept or reject modules when the modules are inspected during the quality control process. To reject a module does not mean to discard it. Rather it means that the module receives priority attention to see whether its metrics values are a natural consequence of the hctionality of the module. If this is not the case, then the module may be a candidatefor redesign or the design may be left intact and the module given priority treatment during inspection and testing. Critical values should be determined quantitatively and they can be highly application and project dependent.

close by drawing conclusions about the contributions of our metrics model to softwari: quality control and prediction and the results obtained to date in applying it to the Space Shuttle,
Definitions Quality Factor: An attriibute of software that contributes to its quality [SCH92] where quality is the degree to which software meets customer or user needs or expectations [IEEN]. example, reliiability, an attribute that contributes For to quality, is a factor. Quality factors are customer or user oriented. They are attributes that customers or users expect to see in the delivered sofimue. Ideally, we want to directly measure quality factors like reliability. Unfortunately, direct measurement of reliabil,i&, for example, by absence of faiIuw during the specified mission, can only be obtained late in a software project during the test and operations phases. Therefore the project may have to resort to an indirect measurement of quality, such as number of discrepancy reports @Rs) written against modules, drcount, where the DRs record deviations from requirements, as is the case with the Space Shuttle flight soflhare. Thus,in this study, drcount will be treated as a quality factor. Quality Metric: A funcl.ion (e.g., cyclomatic complexity M=e-n+Zp) whose inputs are software data (elementary software measurements, such as number of edges e, number of nodes n, and number of connected components p, in a directed graph) and whose output is a single numerical value M that can be interpreted as the degree to which software possesses a given attribute (cyclomatic complexity) that may affect its quality (e.g., reliability) [EE93, SCH921. A special case is the identity function wherein the software data (e.g., nodes) are used as metrics. This is the case in the Space Shuttle flight software application. Objectives

A series of nonparametric statistical methods is used to: 1) identlfy a set of candidate metrics for further analysis, 2) identlfir the critical values of the metrics, and 3) find the optimal function of metrics and critical values based on the ability of the BDF to satisfy both statistical and application (i.e., quality and cost) criteria. In order to investigate the feasibility of validating and applying metrics for controlling and predicting quality on large software projects, we validate metrics on one random sample (Validation Sample) of modules and apply metrics to three random samples (Application Sample) that are both disjoint among themselves and from the validation sample, drawn from a population of 1397 modules of Space Shuttle flight software.
It is important to perform a marginal analysis when making a decision about how many metrics should be used in the qualily control and prediction process. If many metrics are added at once, the contribution of individual metrics is obscured. Also, the marginal analysis provides an effective stopping rule for deciding when to stop adding metrics. We also show that certain metrics are dominant in their effects on classlfylng q d t y (i.e, dominant metrics make fewer mistakes in classlfiring metrics than non-dominant ones) and that additional metrics are not needed to accurately classify quality. This effect is called dominance. Related to the property of dominance is the property of concordance, which is the degree to which a set of metrics produces the same result in classlfying s o h a r e quality. A high value of concordance implies that additional metrics will not make a signrfcant contribution to accurately classifying quality; hence, these metrics are redundant. Lastly we found, contrary to our expectation, that the metrics comments and stalements, when combined in a single BDF, were better indicators of the quality of the Space Shuttle software than complexity metrics. Now we provide definitions that are necessary for understanding our metrics model in general and the Space Shut& flight sobare measurement environment in particular. This is followed by a description of our objetives and related research. Then we explain the Discriminative Power model and our approach to validation. Next we compare validation with application results for both control and prediction. We

One objective is to sec: whether there are relationships between a quality factor mid metrics that would allow us to control and predict the quality of s o h a r e on large-scale projects such as the Space Shuttle jlight s o h a r e . In this application the quality factor is number of discrepancy reports drcount written against a module. A second objective is to integrate quality control and prediction in one model by deriving prediction equations from the Contingency Table, which provides the framewodkfor the control function. Quality con(r0l is achieved by using rnetrics in BDFs during design to see whether the quality of the: product is within the thresholds of acceptable quahty, and to indicate whether remedial action is necessary (e.g., detailed inspection and tracking of the quality of the product during test and operation). Quality prediction involves computing point estimates and ~onfidence limits of various quality and cost quantities to give software managers forecasts of quality and the cost to achieve it. This

403

appears to be the first reported application of BDFs, used

wti the fi-ameworkof a Contingency Table, to both control ihn


and predict software quality. Related Research A number of useful related measurement projects have been reported in the literature. Ebert found that a fuzzy classification produced better results in identlfying high risk modules in digital switching systems than did Pareto classification, classification trees, factor-based discriminant analysis, and neural network classifcation [EBE96]. Henry, et ai, found a strong correlation between errors corrected per module and the impact of the software upgrade wN941.This mformation can be used to rank modules by their upgrade impact during code inspection in order to find and correct these errors before the software enters the expensive test phase. Khoshgofiaar et ai, used nonparametric discriminant analysis in each iteration of their military system project to predict fault prone modules in the next iteration P O 9 6 11. This approach provided an advance indication of reliability and the risk of implementing the next iteration. They also conducted a similar study involving a telecommunications application, again using nonparametric discriminant analysis, to classlfy modules as either fault prone or not fault prone m 0 % 2 ] . Lanning and Khoshgohar found that discriminant functions are u e 1 in classlfying software quality using two sh discriminants: fimctional enhancements and the lines of code necessary to implement the fimctional enhancements [LAN95]. Munson and Wenies showed how to measure software evolution by using relative complexity to track changes in Space Shuttle software across builds, referenced to a baseline build, for both absolute and delta values of domain metrics "961. Ohlsson and Alberg used Alberg Diagrams for fault prediction of telephone switches [OHL96]. These diagrams predict percentage of faults as a function of percentage of modules by ordenng modules in decreasing order of faults and noting the cumulative number of faults corresponding to various percentages of modules.
Pearse and Oman applied a maintenance metrics index to measure the maintainability of C source code before and after maintenanceactivities pEA9.51. This t e c h q u e allowed theproject engineers to track the "health" of the code as it was being maintained.

Pleeger, Fitzgerald, and Rippy used a modified Kiviat Graph to portray the relationship between multiple metrics and software quality over time for evaluating telecommunications switch software [PLE92]. The authors claim this graphical approach is superior than a computed index of multiple metrics because the identity of the individual metrics is retained and the area of the modified Kiviat graph polygon shows the effect of quality improvements over time. Pigoski and Nelson collected and analyzed metrics on size, trouble reports, change proposals, stafhg, and trouble report and change proposal completion times [PIG94]. A major benefit of thls project was the use of trends to identlfy the relationship between the productivity of the maintenance organization and stafliig levels. Porter and Selby used classification trees to partition multiple metric value space so that a sequence of metrics and their critical values could be identified that were associated with either high quality or low quality software [POR90]. This techmque is closely related to our approach of identlfying a set of metrics and their critical values that will satisfy quality and costcriteria. However, we use statistical analysis to make the identification.
Stark collected and analyzed metrics in the categories of customer satisfaction, cost, and schedule with the objective of focusing management's attention on improvement areas and tracking improvements over time [STA96]. This approach aided management in deciding whether to include changes in the current release, with possible schedule slippage, or to include the changes in the next release. Stark, Durst, and Vowell used process and product metrics to monitor the trends in several NASA ground and flight projects to see whether the trends followed the desired improvement in product quality and process stability over time or whether the opposite behavior was demonstrated, indicating the need for corrective action [STA94].

Although there are similaritiesbetween these projects and our research, our work differs in that we validate and apply metrics to consider both statistical and application criteria in an integrated model for control and prediction by measuring the marginal contribution of each metric in satisfying these criteria. Furthermore, our Boolean discriminant function P D F ) is a new type of discriminant for classlfying software quality and our application of Kolmogorov-Smirnov (K-S) distance is a new way to determine a metric's critical value. Lastly, we have developed a new stopping rule for adding metrics:the ratio of the relative improvement in quality to the relative increase in cost. To date these modeling concepts have only been applied to the Space Shuttle. We have plans to apply the concepts to additional space science projects a the t NASA Jet Propulsion Laboratory. However, it should be noted that the basic framework of our methodology (i.e., non-

404

parametric statistical validation of metrics with respect to a quality factor) was previously successfhlly applied to programs developed at the Naval Postgraduate School

ifa high value of F, (high a'rcounl implying low quality) were


selected. Having defined the Discriminative Power validity criterion, we now develop a model that will allow us to validate metrics for controlling quality during software design. In order to validate metnics. it is necessary to collect both quality factor data and meitric data. This will only be feasible when development has progressed to the point where quality factor data, such as discrepancy reports, are available (i.e., inspections have been conducted or tests have been held). In contrast, when metrics are applied during design, only the metric data are available. 'Thus metrics are used as estimates of quality until the point in,development is reached when the quality factor data are available. Also it is important to recognize that validation is performed retrospectively. That is, with both metrics and quality factors in hand, we can evaluate how well the metrics would have performed if they had been applied to the Validation Sample. If the metrics perform well, we say they are validated and it is our expectation that they will perform adequately when applied to the Application Sample (not as well as when applied to the Validation Sample because of possible differences in module characteristics between the Validation arrd Application samples but better than if'unvalidated metrics are used).
DiscriminativePower Validation Model

[SCH92].
DiscriminativePower Model Discriminative Power Validation. Using our metrics validationmethodology[SCH92], and the Space Shuttle flight software metrics and discrepancy reports, we validate metrics with respect to the quality factor drcount. In brief, this involves conductingstatistical t s s to determine whether there et i a high degree of association between drcount and candidate s metrics. Quality Control. The quality control h c t i o n is applied to the Application Sample to flag software for detailed inspection that is below quality l m t . i i s Quality control is the evaluation of modules with respect to predetermined critical values of metrics. The purpose of quality control is to allow software managers to identlfy software that does not meet quality requirements early in the development process so corrective action can be taken when the cost is low. The Discriminative Power validity criterion is applied to the Validation Sample to validate metrics which wl subsequently be used to control the i l quality of the Application Sample. Discritninative Power is defined as follows [SCH92]: Discriminative Power. Given matrix M,, n modules and m of metrics (i.e., nm metric values), vector Mi of m metric critical values, vector F, of n quality factor values, and scaler F, of quality factor critical value, must be able to disximhk with respect to F,, for a specified F,, as shown in the following relation:

The basis of this model is a methodology for validating BDFs and their critical values that have the ability to discriminate high quality Ikom low quality. We use a four stage process for selecting metrics for quality control and prediction: 1) identtfy candidate metrics; 2) compute critical values of the candidate metrics; 3) for the set of candidate mebics and critical values, find the optimal combination based on statistical and application (i.e., quality and cost) criteria; and 4) apply a stopping rul'e for adding metrics.
1) Stage 1: Identify Candidate Metrics. In Stage 1 we rank the metrics according to th& ability to discriminate between two samples of modules: F6Fc V e r s u s F?FC, which correspond to d r c o u n H ard drcounM, respectively, in the Space Shuttle flight software. This analysis is independent of the metric critical values (i.e., only the ranks of the metrics are used). The Kruskal-Wallis (One-wayAnalysis By Ranks (KW) is used for this analysis. Data from the two samples are ordered and ranked and mean ranks are computed for the sets. Ifthe difference in mean r n s is sigmfkant, as determined by ak the value of the test statistic and si&icance level (i.e., a((.005),we conclude that the corresponding populations differ; thus, metrics that pass this test are classified as candidate metrics.

for i=12,...3, and j=l,2,...m with specifid a,where a is the si@cance level of a statistical test for estimating the degree to which (1) holds. In other words do the indicated metric relations imply corresponding quality factor relations in (l)? This criterion assesses whether Mi has sufficient Discriminative Power to be capable of distinguishing a set of high q a1 moduiesh m a set of low quality modules. If this ul 9 is the case, we use the critical values to flag modules of the Application Sample that appear to have unacceptable or questionable quality. The desired quality level is set by the choice of Fe by ih validatmg 4wt respect to F,. If a low value of F, (e.g., low drcount implying high quality) is selected, it would produce a Mithat would flag more modules than would be the case n

The thirteen metrics that were collected with the quality factor drcount, for 1397 imodules written in H A U S , are

405

defined in Table 1. We show the results of the K-W test for the Validation Sample in Table 2; a bar diagram is shown in Figure 1.We use the K-W test to identlfy the set of candidate metrics because: 1) K-W has proven to be a good indicator of the relative ability of metrics to classify quality; 2) experience has shown that only a subset of the collected metrics is necessary to identlfjt the optimal BDF, as described in Stage 3; and 3) the amount of computation involved for computing the Kolmogorov-Sminaov (K-S) distance for all metrics (see Stage 2) is large. Thus we use the procedure of screening the metrics initially, using K-W as the criterion, and including metrics for fiuther analysis in Stage 2 and Stage 3 by selecting them h m the top in Table 2 (i.e., comments), and continuing the process until the stopping rule in Stage 3 (to be described later) is satisfied.
Stage 2: Compute Critical Values. Once the metrics have

where at least one metric exceeds its critical value; this condition is expressed by a Boolean OR h c t i o n of the metrics. This is the REJECT column, meaning that according to the classification decision made by the metrics, these modules have unacceptable quality. The top row contains modules that are high quality; these modules have a quality factor that does not exceed its critical value (e.g., drcount=O). The bottom row contains modules that are low quality; these modules have a quality factor that exceeds its critical value (e.g., drcounP0). Equation (3) gives the module count, based on BDFs of

F, and Iv$, that are calculated over the n modules for m


metrics. This equation is an implementation of the relation given in (1).
8

C iiCOUNT FOR ((FisF$(h4,, < M c i ) . . . ~ < M & . . ~ < M J )


CI
8

been ranked, critical values Mi are computed, using a new method we have developed, which is based on the Kolmogorov-Smimov (K-S) test [CON7 11. This method tests whether the sample cumulative distribution functions (CDF) are fi-om the sanie or different populations. The test statistic is the maximum vertical difference between the CDFs of two samples (e.g., the CDFs of h, for drcountSF, and 4, dmounaFJ. Lfthe difference is si&icant (i.e., as.005), the value of M, corresponding to maximum CDF difference is used for M,. This relationship is expressed in equation (2). This concept is illustrated in Figure 2, for the critical value of comments, where we show the CDFs for d r c o u n d and drcounM. In this example, C,=38. This is the value of comments where there is the maximum difference between the CDFs. Figure 3 shows the dgerence in CDFs for comments, where the curve reaches a maximum at 38. Table 3 shows the metric critical values h.I, and the K-S distances for the thirteen metrics ofthe Validation Sample. In addition, the ranks of KW and K-S show that several metrics rank high with respect to both statistics.

C ,,-COUNT FOR
iI

(vi< FJA((Mil>Mci) ... V@$>MJ. ..V(ML>Ma))


i

C,,-COblT
il
B

FOR ( ( F , S F ~ r M , ,...~ )

~ .~.M s .

M a )

C,-COUNT FOR ( c F a ' F ~ A ( ~ > M , , )V


&I

...

(3)

~ ~ .V . > M a ) ) ~~ M .

for j=l, ...,m, and where COUNT(i)=COUNT(i-1)+1 FOR Boolean expression true and COUNT(i)=COUNT(bl), otherwise; COUNT(O)=O. The counts correspond to the cells of the Contingency Table, as shown in Table 3, where row and column totals are also shown: n, n,, n,, NI, and N,. The analysis could be generaked to include multiple quality factors, if necessary; in this case the Contingency Table would have more than two rows. Note that Table 3 is the Contingency Table for validation that contains both quality factor and metric data.

Metrics are added to the BDF in the order of their K-S Distance.
Stage 3: Perform Contingency Table Analysis. For each BDF identified in Stage 2 we use the Contingency Table (see Table 3 ) and its accompanying x2 statistic [CON7 I ] to M e r evaluate the ability of the h c t i o n s to discriminate high cpahty hlow quality, from both statistical and application standpoints. In Table 3, MFj F, classlfy modules into one and of four categories. The left column contains modules where none of the metfics exceeds its critical value; this condition is expressed with a Boolean AND function of the metrics. This is the ACCEPT column, meaning that according to the classification decision made by the metrics, these modules have acceptable quality. The right column contains modules

In addition to counting modules i Table 3, we must also n count the quality factor (e.g., drcount) that is incorrectly classified. This is shown as Remaining Factor, RF, in the ACCEPT column. This is the quality factor count (e.g., drcount) on modules that should have been rejected. A s lo shown is Total Factor, TF, the total quality factor count on all the m d l sin the sample. Lastly we show RFM (Remaining oue Factor Modules) that is the count of modules with quality factor count >o (i.e., modules with Remaining Factor, RF).
Table 3 and subsequent equations show an example validation, where the optimal combination of metrics fi-om Table 1 and their critical values for a random sample of 100 modules (sample 1). fiom the population of 1397, is comments with a critical value of 38 and statemeno with a critical value of 26. The critical values were obtained by using

406

the ranking of metrics from the K-W test in Table 2 and applying the K-S criterion as given by equation (2) and illustrated in Figures 2 and 3. Later we explain how we anived at the particular combination of metrics as the optimal set. The reason that comments is the leading metric in this example is that in the case of the Space Shuttle the comments are not l m t d to the conventional comments of describing the iie purpose and action of the statements in a program. In addition to this type of comment, and more important, the Space Shuttle comments contain a history of the experience with a module: requirements, design approach, inspections, tests, discrepancy reports, etc. Thus, in general, more comments, means more problems with the software!
Application Contingency Table. A different Contingency Table is used for the application of validated metrics. This table contains only metrics data. This is shown in Table 4, where the "7" indicates that the quality factor data Fi is not available when the validated metrics are used in the quality control function of the application project. During the design phase of the application project, modules are classified according to the criteria that have been described. A second disjoint random sample of 100 moduies (sample 2) was used to illustrate the process. Whereas 3 1 and 69 modules were accepted and rejected, respectively during validation, 40 and 60 modules were accepted and rejected, respectively, during the application. The rejected modules would be given priority attention, as was described previously. Statistical Validation. We validate M, statistically by demonstrating that it partitions Table 3 in such a way that C,, and C, are large relative to C,, and C,,.Ifthis is the case, a large number of high quality modules (e.g., modules with zero drcount) would have M,jsM, and would be correctly classified as high quality. Similarly, a large number of low quality modules (e.g., modules with d r c o u n ~ 0would have ) q>M,and would be correctly classified as low quality. The degree to which this is the case is estimated by the chi-square <x3statistic. Ifcalculated xZE>x2, (chi-square at specified a,) and if calculated ac<a,,we conclude that M, is statistically sigrufcant. Misclassification. As part of the statistical validation, we estimate the Discriminative Power of M by noting in Table , 3 that ideally CI,=nl=NI,C,,=O, C,,=O, C,,=n,=N,. The extent that this is not the case is estimated by Type I misclassifications (i.e., the module has Low Quality and the metrics "say" it has High Quality) and Type 2 misciassifications (i.e., the module has High Quality and the metrics "say" it has Low Quality). Thus we define the following measures of misclassification:

Proportion of Type 2:P,=C,,ln

(5)

For the example, P,=(27/1 OO)* 100=27%.


Proportion of Type l+Type 2:P,,=(C,,+C,,)/n

For the example, P,,=((1+27)/100)* 100=28%.


Application Criteria

(6)

It is insuffkient to just validate with respect to statistical criteria. In the final anodysis, it is the performance of the metrics in the application context that counts. Therefore, we validate metrics with respect to the application criteria: Quality and Inspection [SCH95 1, SCH952,SCH945.
Quality. First we esthnate the ability of the metrics to correctly classify quality, given that the quality is known to be low:

LQC: Proportion of low quality (i.e., drcounf4) sohare correctly classifed=C,,/n, 17)
For the example, LQC=(42/43)* 100=97.7%.
As part of the application validation, we estimate the Discriminative Power of h41kj summing quality factor in the by ACCEPT column in Table 3 to produce Remaining Factor RF (e.g., remaining drcount), given by equation (8). This is the sum of Fi not caught by inspection because (Fi>Fc)A(M,,~M.J for these modules.

for j= 1,. ..,m. We estimate the pmpo'rtion of RF by equation (91, where TF is the total Fi prior to inspection.

For the example, fiom T,able 3 there is a one DR on one module that is incorrectly classified (i.e., RF=I). The total number of DRs for the 100 modules is 192. Therefore, RFP=( 1/192)* 100=.52%.
We estimate the density of RF by equation (10).

Proportion of Type 1 : P,=C,,/n

(4)

For the example, RFD=1/100=.01 drcormtlmodule.

For the example, P,=(1/100)*100=1%.

I addition we estha1.e the count ofnrodules remaining n

that have Fi>F,. The proportion remaining RMP is given by e q ~ ~ (11).i Note that W = P , (proportion of Type I ~ o ~ 's.&ssifications) when F,=O (i.e., the only modules with FSO will be in the C2,cell); see Table 3.

of the module's metrics exceeds its critical value. These effects can only be observed if a marginal analysis is performed, where metrics are added to the set one-by-one and the calculations shown in Table 5 are made after each metric is added. For each added metric, its effect is evaluated with respect to both statistical and application criteria. In addition, a suitable stopping rule must be used to know when to stop adding metrics (see the next section). Because, as mentioned earlier, the prominence of the metric comments was an unexpected result, an analysis was made in Table 5 of the best set of metrics excluding comments. Thus entering the metrics in the set according to the K-S rule, statements, e t a l , and nodes are shown in Table 5.Note that these three metrics are required to reach the same value of RFP as comments alone and that other measures of quality LQC and RMP are mferior. Thus comments alone is superior to the three metric set for this sample.

forj=l,...D.

FQI-the example, there is one accepted module with one DR,


=( 1/100)* 100=1%.

ection is one of the costs of high quality. We tn weighmg inspection requirements against the for various values of Mi. We estimate nts by noting that all modules with mspected; this is the count C,,+C,,. Thus the r0~0rtion ofmodules that must be inspected is given by:

e example, 1=((27+42)/100)* 10069%.


that I is also equal to the proportion of modules result of applying the critical values of metrics to (see Tables 3 and 4) and that 1- I is equal to the f modules accepted. Thus, for the example, the modules accepted is 3 1%. inspection is "wasted" because of Type 2 (C,J (i.e., modules are inspected because they are ged). We estimate wasted mpection by using equation (1 3 ) for Type 2 misclassifications:

Stage 4: Apply a Stopping Rule for Adding Metria. One rule for stopping the addition of metrics to a BDF is to quit when RFP no longer decreases as metrics are added. This is the maximum quality rule. This rule is illustrated in Table S. When a third metric, e t a l , is added, there is no decrease in RFP and RMP nor is there an increase in LQC. If it is important to strike a balance between quality and cost (i.e., between RFP and I), we add metrics until the ratio of the relative change in RFP to the relative change in I is maximum, as given by the Quality Inspection Ratio in equation (14), where i refers to the previous RFP and I:

QR=( hRFP L/RFP,)/(AyU For the example,

(14)

QIR(C-+C,S)=((152-1.56 b/1.56)/((69-62)/62)=5.90. QIR(C,S- C,S,E I)*. Therefore, we stop adding metrics after statements has been added. In this particular case,equation (1 4) produces the same metric set as the maximum quality rule.
Comparison of Validation with Application Results
A comparison of the Validation Sample with the Application Samples with respect to statistical criteria is shown in Table 6. A comparison of the Validation Sample with the Application Samples with respect to application criteria is shown in Tables 7 and 8. As we have mentioned, only metrics data is available when the validated metrics are applied. However, to have a basis for comparison with the validation results, we computed the values shown in Tables 6, 7, and 8 retrospectively (i.e, after the application project was far enough along to be able to collect the quality factor data).

For the ~x~~~~~ RI=42/27= 1.56.

ry OC Validation Results. The results of the le are summarized in Table 5. The properties of d o ~ i n ~ n c e concordance, which were defined and in these validation results and in yzed &om this data That is, a point metrics where Discriminative Power is 1) the contribution of the dominant ifying quality has already taken effect metncs essentially replicate the classifcation t metrics the concordance effect. This property of the BDF used as an OR cause a module to b e rejected if only one

408

RMP is not shown in these tables because it is equal to PI when F,=O. The average relative error (validationapplicatiodapplication) across eighteen comparisons between sample 1 versus samples 2,3,4 in Tables 6,7, and 8 is 32.9% with a standard deviation of 3 1.3%.
Predictability Validation Model

Application Sample o proportion of modules with F,>O (e.g., drcounl>0 on module i) in the Validation Sample prior to inspection and correction of defects:
a

p,-(COUNT FOR F,M)h


il

(15)

In addition to using the Contingency Table to classlfjr the Validation Sample during validation and subsequently to apply the results during quality control to the Application Sample, various quality predictions, such as proportion of modules with zero and non-zero drcount and their confidence l m t ,are derived fi-om the Contingency Table. This approach iis has the advantage of integrating quality control and quality prediction in one basic model. Thus the software developer is provided with validated metrics to both control and predict the quality of the Application Sample. The predictions are used by the developer to anticipate rather than react to quality problems. For example, the predictions provide indications of the quality of the software that would be delivered to the customer if remedial steps (e.g., inspection, testing) are not taken. In addition, the predictions provide indications of resource levels that are needed to achieve quality goals. For example, ifthe predicted quality of the software is lower than the specified quality, the difference would be an indication of increased usage of personnel and computer time during inspection and testing, respectively. This appears to be the 6rst repotted application of BDFs, used within the framework of a Contingency Table, to both control and predict software quality.

where COUNT(i)=COLJFJT(i-1)+1 FOR expression true and COUNT(i)=COUNT(i- I), otherwise; COUNT(O)=O. o

two-sidedconfidence h t s of pn, used as predicted limits of p i in the Application Sample:

proportion of modulles not flagged for inspection (i.e., contained in NI) with F,>O (e.g., drcounm on module i) in the Validation Sample:

one-sided upper confdence limit of pNI, used as predicted limit of pNltin the Application Sample:

Two type of predictions are made: module counts and quality factor counts. Using Validation Sample 1, point estimates and confidence intervals are computed for each and compared with the actual values of Application Samples 2,3, and 4. The normal approximation to the binomial distribution is used to compute the contidence limits of the proportions.
Module Counts. The proportion of modules that are either not flagged or flagged for inspection are estimated below and are shown in Table 9.

proportion of modules flagged for inspection (i.e., contained in NJ with F j O (e.g., drcounr>(aon module i) in the Validation Sample:

one-sided lower confidence limit of pN,, used as predicted limit of pN,' in the Application Sample:

n:

number of modules in sample

NI: number of modules not flagged for inspection in the Validation Sample N,: number of modules flagged for inspection in the Validation Sample

Quality Factor Counts. The proportion of quality factor count (e.g., drcount) on niodules that are either not flagged o flagged for inspection are estimated below and are shown r in Table 9. In addition, point estimates of total quality factor count on the modules is estimated and shown in Table 10.
o

NIt: number of modules not flagged for inspection in the Application Sample

proportion of quality factor that occurs on modules not flagged for inspection (i.e., contained in NI) in the Validation Sample: d,=RFflT (same as RFP if proportion)

RFP is expressed as a
(21)

N: number of modules flagged for inspection in the ;

409

one-sided upper confidence limit of d,, used as predicted limit of d,' in the Application Sample

Kolmogorov-Smirnov (K-S) distance is a new way to determine a metric's critical value. We found that the metrics comments and statements, when combined in a single BDF, were better indicators (i.e., 53% error) than complexity metrics in classifying low quality software of the Space Shuttle. Acknowledgments We wish to acknowledge the support provided for this project by Dr. William Farr of the Naval Surface Warfare Center, by Mr. Allen Nikora of the Jet Propulsion Laboratory, and by the Marine Corps Tactical System Support Activity; the data provided by Prof John Munson of the University of Idaho; and the assistance provided by h4r. Ted Keller and Ms. Patti Thomton of Lockheed-Martin Space Mission Systems & Services References

proportion of quality factor that occurs on modules flagged for inspection (i.e., contained in N,) in the Validation Sample:

one-sided lower confidence limit of 4, used as predicted limit of 4' in the Application Sample:

e-ted dmount that occurs on modules not flagged for inspection (i.e., contained in NI'), used as a predictor of D,' in the Application Sample (note that no knowledge of the Application quality factor is required to make this estimate):

[CON7 11 [EBE96]

W. J. Conover, Practical Nonparametric Statistics, John Wiley & Sons, Inc., 1971.

expected drcount that occurs on modules flagged for inspection (i.e., contained in Ni), used as a predictor of D; in the Application Sample (note that no knowledge of the Application quality factor is required to make this estimate):

Christof Ebert, "Evaluation and Application of Complexity-Based Criticality Models", Proceedings of the Third International Software Metrics Symposium, March 25-26, 1996, Berlin, Germany, pp. 174-185. Joel Henry, Sallie Henry, Dennis K a f h , and Lance Matheson, "Improving Software Maintenance at Martin Marietta", IEEE Software, Vol. 11, N0.4, July 1994, pp. 67-75. Standard for a Software Quality Metrics Methodology, IEEE Std 106 I - 1 992, March 12, 1993.

@EN941

[IEE93] Ten of the actual values out of the fifteen cases in Table 9 fall within the confidence lmt.The average relative error iis across six comparisons between sample 1 versus samples 2, 3,4 in Table 10 is 28.9% with a standard deviation of 30.7%. Conclusions [KH0961]
It is important when validating and applying metrics to

PE901

IEEE Glossary of Software Engineering Terminology, 610.12, 1990.


Ta& M. Khoshgofiaar, Edward B. Allen, Robert Halstead, and Gary P. Trio, "Detection of Fault-Prone Software Modules During a Spiral Life Cycle", Proceedings of the International Conference on Software Maintenance, Monterey, California, November 4-8, 1996, pp. 69-76. T a b . U Khoshgoftaar, Edward B. Allen, Kalai Kalaichelvan, and Nishith Goel, "Early n Quality Prediction: A case Study i Telecommunications, IEEE Software, Vol. 13, NO. 1, January 1996, pp. 65-71.

consider both statistical and application criteria and to measure the marginal contribution of each metric in satisfying these criteria. When this approach is used, we observe that a p i n t is reached where adding metrics makes no contribution to improving quality and the cost of using additional metrics increaseS.This phenomenon is due to the metric classifcation properties of dominance and concordance. The ratio of the relative improvement in quality to the relative increase in cost provides a good stopping nile for adding metrics. Our Boolean dsuimhant b c t i o n P D F ) is a new type of discriminant for class@ng software quality to support an integrated approach to control and prediction in one model, and our application of

WO9621

410

[LAN951

I. ) Lanning and T. Khoshgoflaar, "The Impact of Software Enhancement on Software Reliability", IEEE Transactions on Reliability, Vol. 44, No. 4, December 1995, pp. 677-682.
John C. Munson and Darrell S . Wemes, 1 "Measuring Software Evolution", Proceedings of the Third International Software Metrics Symposium, March 25-26, 1996, Berlin, Germany, pp. 4 1-5 1. Niclas Ohlsson and Hans Alberg. "Predicting Fault-Prone Software Modules in Telephone Switches", IEEE Transactions on Software Engineering, Vol. 22, No. 12, December 1 996, pp. 886-894. Troy Pearse and Paul Oman, "Maintainability Measurements on Industrial Source Code Maintenance Activities", Proceedings of the Intemational Conference on Software Maintenance, Opio (Nice), France, October 1720, 1995, pp. 295-303. Thomas M. Pigoski and Lauren E. Nelson, "software MaintenanceMetrics: A Case Study", proceedings of the Intemational Conference on Software Maintenance, Victoria, British Columbia, Canada, September 19-23, 1994, pp. 392-401. Shari Lawrence Pleeger, Joseph C. Fitzgerald Jr., and Dale A Rippy, "Using Multiple Metrics for Analysis of Improvement", Software Quality Journal, Vol. 1, 1992, pp 27-36.

POR901

A. A. Porter and R. W. Selby, "Empirically Guided Software Development Using MetricBased Clissssification Trees", IIZEE Software, Vol. 7, NO.2, pp. 15 1- 160.
Norman f:. Schneidewind, "Sohare Metrics Validation: Space Shuttle Flight Software Example".,Annals of Software Ihgineering, J. C. Balker AG, Science Publishers, 1995. Norman I:. Schneidewind, "Controlling and predicting the quality of space shuttle software using metrics", Software Quality Journal 4,4968, (1 995)1, Chapman & Hall. Norman F. Schneidewind, "Validating Metrics for Contrcilling and Predicting the Quality of Space Slhuttle Flight S o b a r e " , IEEE Computer, Vol. 27, No. 8, August, 1994, pp.
50-57.

[SCH951]

[SCH9521

[OHL96]

[SCH94]

[PEA95J

[SCH92]

PIG941

Norman F. Schneidewind, "Methodology for Validating Software Metrics", IEEE Transactionson Software Engineering, Vol. 18, No. 5,May 1992, pp. 4 10-422. George E. Stark, "Measurements for Managing Software Maintenance", Proceedings of the International Conference on Software Maintenance, Monterey, California, November 4-8, 1996, ;pp. 152-161. George Stark, Robert C. Durst, and C. W. Vowell. "Using Metrics in Management Decision Making", IEEE Computer, Vol. 27, No. 9, Septiember 1994, pp. 42-48.

[STA96]

[PLE92]

[STA94]

411

Table 1: Metrics Definitions


~

-~

~~

Symbol Metric Definition (counts per module) Metric Critical

unique operand count


111
rl2

E2
rl1
rl2

I
rll.
rl2.
se

total operator count


t t l operand count oa

statements
~~~ ~~

total statement count (executable code; no comments)


~ ~

S
L
C N

lOC

total nonammenled lines of code


total comment count

L
c,
Ne

comments

nodes

total node count (in control graph)

e= 3
paths

total edge count (in control graph)

E . P .

total path count (in control graph)


maximum path length (edges in control graph)

avepath

average path length (edges in control graph)


total cycle count (in control graph)

Ap

Ape

cycles

CY

cy.

412

Cp38VS,>26 High Quality C,,=30 C,,=27 Type 2 n,=57

V(WM3

F,sF,
drcounr=O

Low Quality

G,=1
Type 1

%=42

q=43

FF S.
drcount3

I .
ACCEPT REJECT Table 4: Application Contingency Table

N,=3 1 RF=I. RFM=l

N,=69

n= IO0 TF=192

A(%%) C,s38AS,s26
High Quality
?
Low Quality

V(%Md w38vSS26 Type 2


? ?

Type1
?

N,=40

N,=60 REJECT

ACCEPT

It

Table 5: DiscrimrnatrvePower Validity Evaluation (Sample 1, n=100 modules)

I
Sample 1 1 .o Sample 2
1.o

Table 6: S a i t c l Criteria P1 and P2 for Metric Set: C,!3 ttsia Validation (Sample 1) MI. Application (Samples 2.3, and 4 1 n=100 modules

( I
~~

P1 : Percentage Type 1 Misclassification


Sample 3 4.0 Sample 4 3.0 Sample 1 27.0

P2: Percentage Type 2 Misclassificdtion

24.0

413

97.7

97.3

91.1

93.2

32

.62

3.01

1.50

RFD: Density of quality factor (drcounrlmodule) not caught by inspection

I: Percentage of modules inspected

Table 9:Validation Redictions (Sample 1) vs. Application Actual Values (Samples 2,3. and 4)
1 1

Point Estimates
(Sample 1) P : proportion of modules with drcountx pN,! proportion of modules not flngged for inspection with drcounrx pN,': proportion of modules flagged for inspection with drcounrx d ! , proportion of drcount on modules not flagged for insDection
~~

95%
Confidence Limits lSamole 1) Sample 2

Actual Values Sample 3 Sample 4

43.0%

33.3%52.7%

37.0% 230%

3.22%

LE 8.45%

45.0%
9.76%

44.0% 8.11%

60.9%

CE 51.2%

.52%

LE 1.38%

4':
proportion of drcounr on modules flagged for inspection

99.5%

GE98.6%

Table 10: Validation Actual Values and Predictions (Sample 1) w.Application Actual Values (Samples 2,3, and 4)

: D
expected drcount on modules flrrgged for insoection

166.1

191

I
414

163.3

161

174.4

17 9

comments

eta1 stmts nodes edges


IOC

maxpath paths avepath eta2


nl

n2 cycles

10

20

30

40

Test Statistic (sample 1, n=100 modules)

Figure 1. Kruskal-Wallis Analysis of Metrics

SOU

0
c

,
i
//
II

C,=38
\ '

0
0 ._ .-

2
0 .-

061

j
: I

'
/'

9
w

. -

-0
400 800

1200

1600

2000

2400

10

20

30

40

50

comments

comments
1, n=100 modules)

Figure 2. K-S Tot:comments

CDF (sample

Figure 3 . Difference in CDFs for comments (sample I , "=I00 modules)

415

You might also like