This action might not be possible to undo. Are you sure you want to continue?

∗

Chao Liu Xifeng Yan Jiawei Han

Department of Computer Science

University of Illinois at Urbana-Champaign

¦chaoliu, xyan, hanj¦@cs.uiuc.edu

Abstract

Analyzing the executions of a buggy program is essen-

tially a data mining process: Tracing the data generated

during program executions may disclose important pat-

terns and outliers that could eventually reveal the lo-

cation of software errors. In this paper, we investigate

program logic errors, which rarely incur memory ac-

cess violations but generate incorrect outputs. We show

that through mining program control ﬂow abnormality,

we could isolate many logic errors without knowing the

program semantics.

In order to detect the control abnormality, we pro-

pose a hypothesis testing-like approach that statistically

contrasts the evaluation probability of condition state-

ments between correct and incorrect executions. Based

on this contrast, we develop two algorithms that eﬀec-

tively rank functions with respect to their likelihood of

containing the hidden error. We evaluated these two

algorithms on a set of standard test programs, and the

result clearly indicates their eﬀectiveness.

Keywords. software errors, abnormality, ranking

1 Introduction

Recent years have witnessed a wide spread of data min-

ing techniques into software engineering researches, such

as in software speciﬁcation extraction [2, 22, 19] and in

software testings [5, 7, 23]. For example, frequent item-

set mining algorithms are used by Livshits et al. to ﬁnd,

from software revision histories, programming patterns

that programmers are expected to conform to [22]. A

closed frequent itemset mining algorithm, CloSpan [27],

is employed by Li et al. in boosting the performance of

storage systems [17] and in isolating copy-paste program

errors [18]. Besides frequent pattern-based approaches,

∗

This work was supported in part by the U.S. National Science

Foundation NSF ITR-03-25603, IIS-02-09199, and IIS-03-08215.

Any opinions, ﬁndings, and conclusions or recommendations

expressed here are those of the authors and do not necessarily

reﬂect the views of the funding agencies.

existing machine learning techniques, such as Support

Vector Machines, decision trees, and logistic regression

are also widely adopted for various software engineering

tasks [5, 6, 7, 23, 21].

In this paper, we develop a new data mining algo-

rithm that can assist programmers’ manual debugging.

Although programming language designs and software

testings have greatly advanced in the past decade, soft-

ware is still far from error (or bug, fault) free [8, 16].

As a rough estimation, there are usually 1-10 errors per

thousand lines of code in deployed softwares [11]. Be-

cause debugging is notoriously time-consuming and la-

borious, we wonder whether data mining techniques can

help speed up the process. As the initial exploration,

this paper demonstrates the possibility of isolating logic

errors through statistical inference.

The isolation task is challenging in that (1) stati-

cally, a program, even of only hundreds of lines of code,

is a complex system because of the inherent intertwined

data and control dependencies; (2) dynamically, the ex-

ecution paths can vary greatly with diﬀerent inputs,

which further complicates the analysis; and (3) most

importantly, since logic errors are usually tricky, the

diﬀerence between incorrect and correct executions is

not apparent at all. Therefore, it is almost like looking

for a needle in a haystack to isolate logic errors. Due to

these characteristics, isolating program errors could be

a very important playground for data mining research.

From a data mining point of view, isolating logic

errors is to discover suspicious buggy regions through

screening bulky program execution traces. For exam-

ple, our method ﬁrst monitors the program runtime be-

haviors, and then tries to discover the region where the

behavior of incorrect executions (or runs) diverges from

that of correct ones. Diﬀerent from conventional soft-

ware engineering methods, this data mining approach

assumes no prior knowledge of the program semantics,

thus providing a higher level of automation.

Before detailing the concepts of our method, let us

ﬁrst take a brief overview of program errors. Based

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited 106

on their faulty symptoms, we can roughly classify pro-

gram errors into two categories: memory errors and

logic errors. Memory errors usually manifest themselves

through memory abnormalities, like segmentation faults

and/or memory leaks. Some tools, like Purify, Valgrind

and CCured, are designed for tracking down memory

errors. Logic errors, on the other hand, refer to the pro-

gram logic incorrectness. Instead of causing programs

to abort due to memory violations, programs with logic

errors usually exit silently without segmentation faults.

The only faulty symptom is its malfunction, like gen-

erating none or incorrect outputs. Because logic errors

do not behave abnormally at the memory access level,

oﬀ-the-shelf memory monitoring tools, like Purify, have

little chance to isolate them. Considering the plethora

of tools for memory errors, we plan to focus on isolating

logic errors in this paper. To illustrate the challenges,

let us ﬁrst examine an example.

Program 1 Buggy Code - Version 9 of replace

01 void dodash (char delim, char *src, int *i,

02 char *dest, int *j, int maxset)

03 {

04 while (...) {

05 ...

06 if ( isalnum(src[*i-1]) && isalnum(src[*i+1])

07 /* && (src[*i-1] <= src[*i+1]) */)

08 {

09 for (k = src[*i-1]+1; k<=src[*i+1]; k++)

10 junk = addstr(k, dest, j, maxset);

11 *i = *i + 1;

12 }

13 ...

14 (*i) = (*i) + 1;

15 }

16 }

Example 1. Program 1 shows a code segment of

replace, a program that performs regular expression

matching and substitutions. The replace program has

512 lines of non-blank C code, and consists of 20 func-

tions. In the dodash function, one subclause (com-

mented by /∗ and ∗/) that should have been included

in the if condition is missed by the developer. This is

a typical “incomplete logic” error. We tested this buggy

program through 5542 test cases and found that 130 out

of them failed to give the correct outputs. Moreover, no

segmentation faults happened during executions.

For logic errors as the above, without any memory

violations, the developer generally has no clues for

debugging. Probably, he will resort to conventional

debuggers, like GDB, for a step-by-step tracing and

verify observed values against his expectations in mind.

To make things worse, if the developer is not the original

author of the program, which is very likely in reality,

such as maintaining legacy codes, he may not even be

able to trace the execution until he at least roughly

understands the program. However, understanding

other programmers’ code is notoriously troublesome

because of un-intuitive identiﬁers, personalized coding

styles and, most importantly, the complex control and

data dependencies formulated by the original author.

Knowing the diﬃculty of debugging logic errors, we

are interested in ﬁnding an automated method that can

prioritize the code examination for debugging, e.g., to

guide programmers to examine the buggy function ﬁrst.

Although it is unlikely to totally unload the debugging

burden from human beings, we ﬁnd it possible to min-

imize programmers’ workloads through machine intelli-

gence. For instance, by correctly suggesting the possible

buggy regions, one can narrow down the search, reduce

the debugging cost, and speed up the development pro-

cess. We follow this moderate thought in this study.

In order to facilitate the analysis of program execu-

tions, we need to ﬁrst encode each execution in such

a way that discriminates incorrect from correct exe-

cutions. This is equivalent to a proper extraction of

error-sensitive features from program executions. Be-

cause the faulty symptom of logic error is the unex-

pected execution path, it is tempting to register the ex-

act path for each execution. However, this scheme turns

out to be hard to manipulate, as well as expensive to

register. Noticing that executions are mainly directed

by condition statements (e.g., if, while, for), we ﬁ-

nally decide to summarize the execution by the evalu-

ation frequencies of conditionals. In other words, for

each condition statement, we record how many times

it is evaluated as true and false respectively in each

execution. Although this coding scheme loses much in-

formation about the exact execution, it turns out to be

easy to manipulate and eﬀective.

Based on the above analysis, we treat each condi-

tional in the buggy program as one distinct feature and

try to isolate the logic error through contrasting the

behaviors of correct and incorrect executions in terms

of these conditionals. Speciﬁcally, we regard that the

more divergent the evaluation of one conditional in in-

correct runs is from that of correct ones, the more likely

the conditional is bug-relevant. In order to see why this

choice can be eﬀective, let us go back to Example 1.

For convenience, we declare two boolean variables

A and B as follows,

A = isalnum(src[*i-1]) && isalnum(src[*i+1]);

B = src[*i-1]<= src[*i+1];

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited 107

If the program were written correctly, A ∧ B should

have been the guarding condition for the if block, i.e.,

the control ﬂow goes into the if block if and only if

A ∧ B is true. However, in the buggy program where

B is missing, any control ﬂow that satisﬁes A will

fall into the block. As one may notice and we will

explain in Section 2, an execution is logically correct

until (A∧B) is evaluated as true when the control ﬂow

reaches Line 6. Because the true evaluation of (A∧B)

only happens in incorrect runs, and it contributes to

the true evaluation of A, the evaluation distribution of

the if statement should be diﬀerent between correct

and incorrect executions. This suggests that if we

monitor the program conditionals, like the A here, their

evaluations will shed light on the hidden error and can

be exploited for error isolation. While this heuristic can

be eﬀective, one should not expect it to work with any

logic errors. After all, even experienced programmers

may feel incapable to debug certain logic errors. Given

that few eﬀective ways exist for isolating logic errors,

we believe this heuristic worths a try.

The above example motivates us to infer about the

potential buggy region from the divergent branching

probability. Heuristically, the simplest approach is to

ﬁrst calculate the average true evaluation probabilities

for both correct and incorrect runs, and treat the nu-

meric diﬀerence as the measure of divergence. However,

simple heuristic methods like the above generally suf-

fer from weak performance across various buggy pro-

grams. In this paper, we develop a novel and more

sophisticated approach to quantifying the divergence of

branching probabilities, which features a similar ratio-

nale to hypothesis testing [15]. Based on this quantiﬁ-

cation, we further derive two algorithms that eﬀectively

rank functions according to their likelihood of contain-

ing the error. For example, both algorithms recognize

the dodash function as the most suspicious in Example

1.

In summary, we make the following contributions.

1. We introduce the problem of isolating logic errors, a

problem not yet well solved in software engineering

community, into data mining research. We believe

that recent advances in data mining would help

tackle this tough problem and contribute to software

engineering in general.

2. We propose an error isolation technique that is

based on mining control ﬂow abnormity in incorrect

runs against correct ones. Especially, we choose to

monitor only conditionals to bear low overhead while

maintaining the isolation quality.

3. We develop a principled statistical approach to

quantify the bug relevance of each condition state-

ment and further derive two algorithms to locate the

possible buggy functions. As evaluated on a set of

standard test programs, both of them achieve an en-

couraging success.

The remaining of the paper is organized as follows.

Section 2 ﬁrst provides several examples that illustrate

how our method is motivated. The statistical model

and the two derived ranking algorithms are developed

in Section 3. We present the analysis of experiment

results in Section 4. After the discussion about related

work in Section 5, Section 6 concludes this study.

2 Motivating Examples

In this section, we re-visit Example 1 and explain in de-

tail the implication of control ﬂow abnormality to logic

errors. For clarity in what follows, we denote the pro-

gram with the subclause (src[*i-1]<= src[*i+1])

commented out as the incorrect (or buggy) program {,

and the one without comments is the correct one, de-

noted as

{. Because

{ is certainly not available when

debugging {,

{ is used here only for illustration pur-

poses: It helps illustrate how our method is motivated.

As one will see in Section 3, our method collects statis-

tics only from the buggy program { and performs all

the analysis.

Given the two boolean variables A and B as pre-

sented in Section 1, let us consider their evaluation com-

binations and corresponding branching actions (either

enter or skip the block) in both { and

{. Figure 1 ex-

plicitly lists the actions in { (left) and

{ (right), respec-

tively. Clearly, the left panel shows the actual actions

taken in the buggy program {, and the right one lists

the expected actions if { had no errors.

A A

B enter skip

B enter skip

A A

B enter skip

B skip skip

Figure 1: Branching Actions in { and

{

Diﬀerences between the above two tables reveal that

in the buggy program {, unexpected actions take place

if and only if A ∧ B evaluates to true. Explicitly,

when A ∧ B is true, the control ﬂow actually enters

the block, while it is expected to skip. This incorrect

control ﬂow will eventually lead to incorrect outputs.

Therefore, for the buggy program{, one run is incorrect

if and only if there exist true evaluations of A∧ B at

Line 6; otherwise, the execution is correct although the

program contains a bug.

While the boolean expression B: (A ∧ B) = true

exactly characterizes the scenario under which incorrect

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited 108

executions take place, there is little chance for an

automated tool to spot B as bug relevant. The obvious

reason is that while we are debugging the program {,

we have no idea of what B is, let alone its combination

with A. Because A is observable in {, we are therefore

interested in whether the evaluation of A will give away

the error. If the evaluation of A in incorrect executions

signiﬁcantly diverge from that in correct ones, the if

statement at line 6 may be regarded as bug-relevant,

which points to the exact error location.

A A

B n

AB

n ¯

AB

B n

A

¯

B

= 0 n ¯

A

¯

B

A A

B n

AB

n

¯

AB

B n

A

¯

B

≥ 1 n

¯

A

¯

B

Figure 2: A Correct and Incorrect Run in {

We therefore contrast how A is evaluated diﬀerently

between correct and incorrect executions of {. Figure

2 shows the number of true evaluations for the four

combinations of A and B in one correct (left) and

incorrect (right) run. The major diﬀerence is that in

the correct run, A∧B never evaluates true (n

A

¯

B

= 0)

while n

A

¯

B

must be nonzero for one execution to be

incorrect. Since the true evaluation of A ∧ B implies

A = true, we therefore expect that the probability for

A to be true is diﬀerent between correct and incorrect

executions. As we tested through 5,542 test cases,

the true evaluation probability is 0.727 in a correct

and 0.896 in an incorrect execution on average. This

divergence suggests that the error location (i.e., Line 6)

does exhibit detectable abnormal behaviors in incorrect

executions. Our method, as developed in Section 3,

nicely captures this divergence and ranks the dodash

function as the most suspicious function, isolating this

logic error.

The above discussion illustrates a simple, but moti-

vating example where we can use the branching statis-

tics to capture the control ﬂow abnormality. Sometimes

when the error does not occur in a condition statement,

control ﬂows may still disclose the error trace. In Pro-

gram 2, the programmer mistakenly assigns the variable

result with i+1 instead of i, which is a typical “oﬀ-

by-one” logic error.

The error in Program 2 literally has no relation

with any condition statements. However, the error is

triggered only when the variable junk is nonzero, which

is eventually associated with the if statement at Line

5. In this case, although the branching statement is not

directly involved in this oﬀ-by-one error, the abnormal

branching probability in incorrect runs can still reveal

the error trace. For example, the makepat function

is identiﬁed as the most suspicious function by our

Program 2 Buggy Code - Version 15 of replace

01 void makepat (char *arg, int start, char delim,

02 char *pat)

03 {

04 ...

05 if(!junk)

06 result = 0;

07 else

08 result = i + 1; /* off-by-one error */

09 /* should be result = i */

10 return result;

11 }

algorithms.

Inspired by the above two examples, we recognize

that given that the dependency structure of a program

is hard to untangle, there do exist simple statistics

that capture the abnormality caused by hidden errors.

Discovering these abnormalities by mining the program

executions can provide useful information for error

isolation. In the next section, we elaborate on the

statistical ranking model that identiﬁes potential buggy

regions.

3 Ranking Model

Let T = ¦t

1

, t

2

, , t

n

¦ be a test suite for a program

{. Each test case t

i

= (d

i

, o

i

) (1 ≤ i ≤ n) has an

input d

i

and the desired output o

i

. We execute { on

each t

i

, and obtain the output o

i

= {(d

i

). We say {

passes the test case t

i

(i.e., “a passing case”) if and only

if o

i

is identical to o

i

; otherwise, { fails on t

i

(i.e., “a

failing case”). We thus partition the test suite T into

two disjoint subsets T

p

and T

f

, corresponding to the

passing and failing cases respectively,

T

p

= ¦t

i

[o

i

= {(d

i

) matches o

i

¦,

T

f

= ¦t

i

[o

i

= {(d

i

) does not match o

i

¦.

Since program { passes test case t

i

if and only if {

executes correctly, we use “correct” and “passing”, “in-

correct” and “failing” interchangeably in the following

discussion.

Given a buggy program { together with a test suite

T = T

p

∪ T

f

, our task is to isolate the suspicious error

region by contrasting {’s runtime behaviors on T

p

and

T

f

.

3.1 Feature Preparation

Motivated by previous examples, we decide to instru-

ment each condition statement in program { to col-

lect the evaluation frequencies at runtime. Speciﬁcally,

we take the entire boolean expression in each condition

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited 109

statement as one distinct boolean feature. For exam-

ple, isalnum(src[*i-1]) && isalnum(src[*i+1]) in

Program 1 is treated as one feature. Since a feature

can be evaluated zero to multiple times as either true

or false in each execution, we deﬁne boolean bias to

summarize its exact evaluations.

Definition 1. (Boolean Bias) Let n

t

be the number

of times that a boolean feature B evaluates true, and n

f

the number of times it evaluates false in one execution.

π(B) =

nt−n

f

nt+n

f

is the boolean bias of B in this execution.

π(B) varies in the range of [−1, 1]. It encodes the

distribution of B’s value: π(B) is equal to 1 if B always

assumes true and −1 as it sticks to false, and in

between for all other mixtures. If a conditional is never

touched during an execution, π(B) is deﬁned to be 0

because of no evidence favoring either true or false.

3.2 Methodology Overview

Before detailed discussions about our method, we ﬁrst

lay out the main idea in this subsection. Following

the convention of statistics, we use uppercase letters for

random variables and lowercases for their realizations.

Moreover, f(X[θ) is the probability density function

(pdf) of a distribution (or population) family indexed

by the parameter θ.

Let the entire test case space be T , which concep-

tually contains all possible input and output pairs. Ac-

cording to the correctness of { on cases from T , T can

be partitioned into two disjoint sets T

p

and T

f

for pass-

ing and failing cases. Therefore, T, T

p

, and T

f

can be

thought as random samples from T , T

p

, and T

f

respec-

tively. Let X be the random variable for the boolean

bias of boolean feature B, we use f(X[θ

p

) and f(X[θ

f

)

to denote underlying probability model that generates

the boolean bias of B for cases from T

p

and T

f

respec-

tively.

Definition 2. (Bug Relevance) A boolean feature

B is relevant to the program error if its underlying

probability model f(X[θ

f

) diverges from f(X[θ

p

).

The above deﬁnition relates the probability model

f(X[θ) with the hidden program error. The more

signiﬁcantly f(X[θ

f

) diverges from f(X[θ

p

), the more

relevant B is to the hidden error. Let L(B) be a

similarity function,

(3.1) L(B) = Sim(f(X[θ

p

), f(X[θ

f

)),

and s(B), the bug relevance score of B can be deﬁned

as

(3.2) s(B) = −log(L(B)).

Using the above measure, we can rank features

according to their relevance to the error. The ranking

problem boils down to ﬁnding a proper way to quantify

the similarity function. This includes two problems: (1)

what is a suitable similarity function? and (2) how to

compute it while we do not known the closed form of

f(X[θ

p

) and f(X[θ

f

)? In the following subsections, we

will examine these two problems in detail.

3.3 Feature Ranking

Because no prior knowledge of the closed form of

f(X[θ

p

) is known, we can only characterize it through

general parameters, such as the population mean and

variance, µ

p

= E(X[θ

p

) and σ

2

p

= V ar(X[θ

p

), i.e., we

take θ = (µ, σ

2

). While µ and σ

2

are taken to char-

acterize f(X[θ), we do not regard their estimates as

suﬃcient statistics, and hence we do not take normality

assumptions on f(X[θ). Instead, the diﬀerence exhib-

ited through µ and σ

2

is treated as a measure of the

model diﬀerence.

Given an independent and identically distributed

(i.i.d.) random sample X = (X

1

, X

2

, , X

n

) from

f(X[θ

p

), µ

p

and σ

2

p

can be estimated by the sample

mean X and sample variance S

n

,

µ

p

= X =

n

i=1

X

i

n

and

σ

2

p

= S

n

=

1

n −1

n

i=1

(X

i

−X)

2

,

where X

i

represents the boolean bias of B from the ith

passing case. The population mean µ

f

and variance σ

2

f

of f(X[θ

f

) can be estimated in a similar way.

Given the estimations, it may be appealing to take

the diﬀerences of both mean and variance as the bug

relevance score. For example, s(B) can be deﬁned as

s(B) = α ∗ [µ

p

−µ

f

[ + β ∗ [σ

2

p

−σ

2

f

[ (α, β ≥ 0).

However, heuristic methods like the above usually suﬀer

from the dilemma of parameter settings: no guidance is

available to properly set the parameters, like the α and β

in the above formula. In addition, parameters that work

perfectly with one program error may not generalize to

other errors or programs. Therefore, in the following, we

develop a principled statistical method to quantify the

bug relevance, which, as one will see, is parameter-free.

Our method takes an indirect approach to quantify-

ing the divergence between f(X[θ

p

) and f(X[θ

f

), which

is supported by a similar rationale to hypothesis testing

[15]. Instead of writing a formula that explicitly spec-

iﬁes the diﬀerence, we ﬁrst propose the null hypothesis

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited 110

H

0

that θ

p

= θ

f

(i.e., no divergence exists), and then

under the null hypothesis, we derive a statistic Y (X)

that conforms to a speciﬁc known distribution. Given

the realized random sample X = x, if Y (x) corresponds

to a small probability event, the null hypothesis H

0

is

invalidated, which immediately suggests that f(X[θ

p

)

and f(X[θ

f

) are divergent. Moreover, the divergence is

proportional to the extent to which the null hypothesis

is invalidated.

Speciﬁcally, the null hypothesis H

0

is

(3.3) µ

p

= µ

f

and σ

p

= σ

f

.

Let X = (X

1

, X

2

, , X

m

) be an i.i.d. random sample

from f(X[θ

f

). Under the null hypothesis, we have

E(X

i

) = µ

f

= µ

p

and V ar(X

i

) = σ

2

f

= σ

2

p

. Because

X

i

∈ [−1, 1], E(X

i

) and V ar(X

i

) are both ﬁnite.

According to the Central Limit Theorem, the following

statistic

(3.4) Y =

m

i=1

X

i

m

,

converges to the normal distribution N(µ

p

,

σ

2

p

m

) as m →

+∞.

Let f(Y [θ

p

) be the pdf of N(µ

p

,

σ

2

p

m

), then the

likelihood of θ

p

given the observation of Y is

(3.5) L(θ

p

[Y ) = f(Y [θ

p

).

A smaller likelihood implies that H

0

is less likely

to hold and this, in consequence, indicates that a

larger divergence exists between f(X[θ

p

) and f(X[θ

f

).

Therefore, we can reasonably set the similarity function

in Eq. (3.1) as the likelihood function,

(3.6) L(B) = L(θ

p

[Y ).

According to the property of normal distribution,

the normalized statistic

Z =

Y −µ

p

σ

p

/

√

m

conforms to the standard normal distribution N(0, 1),

and

(3.7) f(Y [θ

p

) =

√

m

σ

p

ϕ(Z),

where ϕ(Z) is the pdf of N(0, 1),

Combining Eq. (3.2), (3.6), (3.5), and (3.7), we

ﬁnally have the bug relevance score for boolean feature

B as

(3.8) s(B) = −log(L(B)) = log(

σ

p

√

mϕ(Z)

).

According to Eq. (3.8), we can rank all condition state-

ments of the buggy function [20]. However, we regard

that a ranking of suspicious functions is preferable to

that of individual statements because a highly relevant

condition statement is not necessarily the error root.

For example, the error in Program 2 does not take place

in the “if(!junk)” statement. Moreover, we generally

have higher conﬁdence in the quality of function rank-

ing than that of individual statements in that function

abnormality is assessed by considering all of its compo-

nent features. In the following section, we discuss how

to combine individual s(B) to a global score s(T) for

each function T.

3.4 Function Ranking

Suppose a function T encompasses k boolean features

B

1

, , B

k

, and there are m failing test cases. Let X

ij

denote the boolean bias of the i

th

boolean feature in the

j

th

test case, we tabulate the statistics in the following

table.

t

1

t

2

. . . t

m

B

1

X

11

X

12

. . . X

1m

X

1

Y

1

Z

1

B

2

X

21

X

22

. . . X

2m

X

2

Y

2

Z

2

. . . . . . . . . . . . . . . . . . . . . . . .

B

k

X

k1

X

k2

. . . X

km

X

k

Y

k

Z

k

In this table, X

i

= (X

i1

, X

i2

, . . . , X

im

) represents

the observed boolean bias from the m failing runs and

Y

i

and Z

i

are similarly derived from X

i

as in Section

3.3. For each feature B

i

, we propose the null hypothesis

H

i

0

: θ

i

f

= θ

i

p

, and obtain

f(Y

i

[θ

i

p

) =

√

m

σ

i

p

ϕ(Z

i

). (3.9)

Given that the function T has k features, we denote the

parameters in a vector form:

−→

H

0

= 'H

1

0

, H

2

0

, , H

k

0

`,

−→

θ

p

= 'θ

1

p

, θ

2

p

, , θ

k

p

`,

−→

Y = 'Y

1

, Y

2

, , Y

k

`.

Under the null hypothesis

−→

H

0

, similar arguments sug-

gest that the bug relevance score s(T) can be chosen

as

(3.10) s(T) = −log(f(

−→

Y [

−→

θ

p

)),

where f(

−→

Y [

−→

θ

p

) is the joint pdf of Y

i

’s (i = 1, 2, , k).

However, the above scoring function Eq. (3.10) does

not immediately apply because f(

−→

Y [

−→

θ

p

) is a multivari-

ate density function. Because neither the closed forms

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited 111

of f(Y

i

[θ

i

p

) nor the dependencies among Y

i

’s are avail-

able, it is impossible to calculate the exact value of s(T).

Therefore, in the following discussion, we propose two

simple ways to approximate it.

3.4.1 CombineRank

One conventional approach to untangling joint distri-

bution is through the independence assumption. If we

assume that boolean features B

i

’s are mutually indepen-

dent, the population of Y

i

is also mutually independent

with each other. Therefore, we have

(3.11) f(

−→

Y [

−→

θ

p

) =

k

i=1

f(Y

i

[θ

i

p

) =

k

i=1

√

m

σ

i

p

ϕ(Z

i

)

Following Eq. (3.10),

s(T) = −log(f(

−→

Y [

−→

θ

p

)) =

k

i=1

log(

σ

i

p

√

mϕ(Z

i

)

) (3.12)

=

k

i=1

s(B

i

)

We name the above scoring schema (Eq. (3.12))

CombineRank as it sums over the bug relevance score

of each individual condition statement with the func-

tion. We note that the independence assumption is

practically unrealistic because condition statements are

usually intertwined, such as nested loops or the if state-

ments inside while loops. However, given that no as-

sumptions are made about the probability densities, one

should not expect for a devices more magic than the

independence assumption in decomposing the joint dis-

tribution.

From the other point of view, CombineRank does

make good sense in that the abnormal branchings at

one condition statement may likely trigger the abnormal

executions of other branches (like an avalanche) and

Eq. (3.12) just captures this systematical abnormality

and encode it into the ﬁnal bug relevance score for

the function T. As we will see in the experiments,

CombineRank works really well in locating buggy

functions.

3.4.2 UpperRank

In this subsection, we propose another approach to

approximating f(

−→

Y [

−→

θ

p

), and end up with another score

schema UpperRank. The main idea is based on the

following apparent inequality

(3.13) f(

−→

Y [

−→

θ

p

) ≤ min

1≤i≤k

f(Y

i

[θ

i

p

) = min

1≤i≤k

√

m

σ

i

p

ϕ(Z

i

).

Therefore, if we adopt the upper bound as an alternative

for f(

−→

Y [

−→

θ

p

), s(T) can be derived as

s(T) = −log( min

1≤i≤k

√

m

σ

i

p

ϕ(Z

i

)) = max

1≤i≤k

s(B

i

). (3.14)

This scoring schema, named UpperRank, essen-

tially picks the most suspicious feature as the “repre-

sentative” of the function. This is meaningful because if

one boolean expression is extremely abnormal, the func-

tion containing it is very likely to be abnormal. How-

ever, since UpperRank uses the upper bound as the

approximation to f(

−→

Y [

−→

θ

p

), its ranking quality is infe-

rior to that of CombineRank when multiple peak ab-

normalities exist, as illustrated in the following example.

0

50

100

150

200

250

300

350

400

B

u

g

R

e

l

e

v

a

n

c

e

S

c

o

r

e

dodash omatch

Figure 3: CombineRank vs. UpperRank

Example 2. Figure 3 visualizes the bug relevance score

of boolean features in both function dodash and omatch,

calculated on the faulty Version 5 of the replace program.

From left to right, the ﬁrst six stubs represent scores

for the six boolean features in dodash and the following

nine are for features in omatch. In this example,

the logic error is inside dodash. As one can see,

UpperRank will rank omatch over dodash due to the

maximal peak abnormality. However, it might be better

to credit the abnormality of each feature for function

ranking, as implemented by CombineRank. Therefore,

CombineRank correctly ranks dodash as the most

bug relevant. For this reason, we generally prefer

CombineRank to UpperRank, as is also supported

by experiments.

4 Experiment Results

In this section, we evaluate the eﬀectiveness of both

CombineRank and UpperRank in isolating logic er-

rors. We implemented these two algorithms using C++

and Matlab and conducted the experiments on a 3.2GHz

Intel Pentium 4 PC with 1GB physical memory that

runs Fedora Core 2.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited 112

Version Buggy Function Fail Runs combineRank upperRank Error Description: How to Fix Cat.

1 dodash 68 2 1 change *i to *i -1

2 dodash 37 1 1 add one if branch

3 subline 130 3 3 add one && subclause

4 subline 143 8 11 change i to lastm ´

5 dodash 271 1 3 change < to <=

6 locate 96 3 3 change >= to >

7 in set 2 83 5 4 change c == ANY to c == EOL ´

8 in set 2 54 3 2 add one || subclause

9 dodash 30 1 1 add one && subclause

10 dodash 23 1 2 add one && subclause

11 dodash 30 1 1 change > to <=

12 Macro 309 5 5 change 50 to 100 in define MAXPAT 50 ´

13 subline 175 5 5 add one if branch

14 omatch 137 1 1 add one && subclause

15 makepat 60 1 1 change i+1 to i in result = i+1

16 in set 2 83 5 4 remove one || subclause

17 esc 24 1 1 change result = NEWLINE to = ESCAPE ´

18 omatch 210 2 3 add one && subclause

19 change 3 15 8 rewrite the function change

20 esc 22 1 1 change result = ENDSTR to = ESCAPE ´

21 getline 3 12 5 rewrite the function getline

22 getccl 19 7 3 move one statement into if branch

23 esc 22 1 2 change s[*i] to s[*i + 1]

24 omatch 170 2 4 add one if branch

25 omatch 3 2 2 change <= to ==

26 omatch 198 6 6 change j to j + 1

28 in set 2 142 4 3 remove one || subclause

29 in set 2 64 6 5 remove one || subclause

30 in set 2 284 1 1 remove one || subclause

31 omatch 210 2 3 change >= to !=

Error Category Legend : Incorrect Branch Expression ´: Misuse of Variables or Constants : Oﬀ-By-One : Misc.

Table 1: Summary of Buggy Versions and Ranking Results

4.1 Subject Programs

We experimented on a package of standard test pro-

grams, called Siemens programs

1

. This package was

originally prepared by Siemens Corp. Research in study

of test adequacy criteria [12]. The Siemens programs

consist of seven programs: print tokens, print tokens2,

replace, schedule, schedule2, tcas, and tot info. For each

program, the Siemens researchers manually injected

multiple errors, obtaining multiple faulty versions, with

each version containing one and only one error. Be-

cause these injected errors rightly represent common

mistakes made in practice, the Siemens programs are

widely adopted as a benchmark in software engineering

1

A variant is available at http://www.cc.gatech.edu/aristo-

tle/Tools/subjects.

research [12, 25, 9, 10]. Because these injected errors are

mainly logic faults, we choose them as the benchmark

to evaluate our algorithms.

Except tcas (141 lines), the size of these programs

ranges from 292 to 512 lines of C code, excluding blanks.

Because debugging tcas is pretty straightforward due to

its small size, we focus the experiment study on the

other six programs. In Section 4.2, we ﬁrst analyze

the eﬀectiveness of our algorithms on the 30 faulty

versions of the replace program. The replace program

deserves detailed examination in that (1) it is the

largest and most complex one among the six programs,

and (2) it covers the most varieties of logic errors.

After the examination of replace, we discuss about the

experiments on the other ﬁve programs in Section 4.3.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited 113

4.2 Experimental Study on Replace

The replace program contains 32 versions in total,

among which Version 0 is error free. Each of the other

31 faulty versions contains one and only one error in

comparison with Version 0. In this setting, Version 0

serves as the oracle in labelling whether a particular

execution is correct or incorrect.

Table 1 lists the error characteristics for each

faulty version and the ranking results provided by

CombineRank and UpperRank. Because we focus

on isolating logic errors that do not incur segmenta-

tion faults, Version 27 is excluded from examination.

In Table 1, the second column lists the name of the

buggy function for each version and the third column

shows the number of failing runs out of the 5542 test

cases. The ﬁnal ranks of the buggy function provided

by CombineRank and UpperRank are presented in

the fourth and ﬁfth columns respectively. Taking ver-

sion 1 for an example, the buggy function is dodash

and the error causes incorrect outputs for 68 out of the

5542 test cases. For this case, CombineRank ranks the

dodash function at the second place and UpperRank

identiﬁes it as the most suspicious. Finally, the last two

columns brieﬂy describe each error and the category the

error belongs to (Section 4.2.2). Because we cannot list

the buggy code for each error due to the space limit, we

concisely describe how to ﬁx each error in the “Error De-

scription” column. Interested readers are encouraged to

download “Siemens Programs” from the public domain

for detailed examination.

4.2.1 Overall Eﬀectiveness

The rankings in Table 1 indicate that both

CombineRank and UpperRank work very well

with the replace program. Among the 30 faulty versions

under examination, both methods rank the buggy

function within the top ﬁve for 24 versions except

Versions 4, 19, 21, 22, 26, and 29. This implies that

the developer can locate 24 out of the 30 errors if he

only examines the top ﬁve functions of the ranked

list. Although this proposition is drawn across various

errors on a speciﬁc subject program, we believe that

this does shed light on the eﬀectiveness of the isolation

algorithms.

In order to quantify the isolation quality of the

algorithms, we choose to measure how many errors

are located if a programmer examines the suggested

ranked list from the top down. For example, because

CombineRank ranks the buggy function at the ﬁrst

place for 11 faulty versions, the programmer will locate

11 out of the 30 errors if he only examines the ﬁrst func-

tion of the ranked list for each faulty version. Moreover,

because the buggy function of another ﬁve versions is

ranked at the second place by CombineRank, the pro-

grammer can locate 16 out of the 30 errors if he takes

the top-2 functions seriously. In this way, Figure 4 plots

the number of located errors with respect to the top-k

examination.

0

5

10

15

20

25

30

0 2 4 6 8 10 12 14 16 18 20

N

u

m

.

o

f

L

o

c

a

t

e

d

E

r

r

o

r

s

Top-k Functions

Located Errors within Top-k Functions

CombineRank

UpperRank

RandomRank

Figure 4: Isolation Quality: replace

In order to assess the eﬀectiveness of the algorithms,

we also plot the curve for random rankings in Figure 4.

Because a random ranking puts the buggy function at

any place with equal probability, the buggy function has

a probability of

k

m

to be within the top-k of the entire

m functions. Furthermore, this also suggests that given

n faulty versions,

k

m

n errors are expected to be located

if the programmer only examines the top-k functions.

For the replace program under study, where m = 20

and n = 30, only 1.5 errors are expected to be located if

top-1 function is examined. In contrast, UpperRank

and CombineRank locates 9 and 11 errors respectively.

Furthermore, when top-2 functions are examined, the

number for UpperRank and CombineRank is lever-

aged to 13 and 16 respectively whereas it is merely 3 for

random rankings. Considering the tricky nature of these

logic errors, we regard that the result shown in Figure 4

is excitingly good because our method infers about the

buggy function purely from the execution statistics, and

assumes no more program semantics than the random

ranking. As one will see in Section 4.3, similar results

are also observed on other programs.

4.2.2 Eﬀectiveness for Various Errors

After the discussion about the overall eﬀectiveness, one

may also ﬁnd it instructive to explore for what kinds of

errors our method works well (or not). We thus break

down errors in Table 1 into the following four categories

and examine the eﬀectiveness of our method on each of

them.

1. Incorrect Branch Expression (IBE): This cat-

egory generally refers to errors that directly inﬂu-

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited 114

ence program control ﬂows. For example, it includes

omitted (e.g., in Versions 3, 9, 10 etc.) or un-wanted

(e.g., in Version 16) logic subclauses, and misuses of

comparators (e.g., in Versions 5, 6 etc.). These er-

rors usually sneak in when the programmer fails to

think fully of the program logic.

2. Misuse of Variables or Constants (MVC):

Since similar data structures can exist in programs,

developers may misuse one variable for another.

One typical scenario of introducing this kind of

error is through the notorious copy-pastes: when

a programmer copies and pastes one segment of

code, he may forget to change certain variables

for the destination contexts. Some tools, like CP-

Miner [18], can automatically detect such copy-paste

errors.

3. Oﬀ-By-One (OBO): A variable is used with the

value one more or less than it is expected. For

example, in Version 15, result = i is mistakenly

written as result = i + 1.

4. Misc. (MIS): Other types of errors that include,

but not limited to, function reformulations (e.g.,

in Versions 19 and 21) and moving one statement

around (e.g., in Version 22).

We label the category that each faulty ver-

sion belongs to in Table 1. The isolation quality of

CombineRank for each category on replace is listed

in Table 2. Notice that the third column only lists the

number of faulty versions where the buggy function is

ranked at the top by CombineRank.

Error Category Error # Located (top-1)

IBE 18 7

MVC 5 2

OBO 4 2

MIS 3 0

Table 2: Isolation Quality by Categories

Table 2 also indicates that CombineRank ranks

the buggy function at the top not only for errors that

literally reside on condition statements but for errors

that involve variable values as well. Similar conclusions

can be also drawn for UpperRank based on Table 1.

This result reaﬃrms our belief that value related errors

may also incur control ﬂow abnormality, which justiﬁes

our choice of branching abnormality for error isolation.

Finally, we note that there do exist some logic errors

that are extremely hard for machines to isolate. Con-

sidering the error description in Table 1, we believe that

it is also a non-trivial task for experienced programmers

to isolate these errors. Therefore, we cannot unpracti-

cally expect one method to work well for all kinds of

errors.

4.2.3 CombineRank vs. UpperRank

In terms of the isolation quality, an overview of Fig-

ure 4 may suggest that UpperRank outperforms

CombineRank because the curve of UpperRank is

above CombineRank for a large portion of k’s. How-

ever, we believe that considering the utility to pro-

grammers, the top-5 ranking is less appealing than the

top-2 or even top-1 ranking. From this point of view,

CombineRank is apparently better than UpperRank:

in terms of top-1, CombineRank identiﬁes 11 while

UpperRank gets 9, and CombineRank locates in 16

within top-2’s whereas UpperRank hits 13. Results

from other subject programs as examined in Section 4.3

also support this conclusion.

4.3 More Subject Programs

We also tested our algorithms on the other ﬁve programs

in the Siemens program package. They are two lexical

analyzers print tokens and print tokens2, two priority

schedulers schedule and schedule2, and tot info that

computes the statistics of given input data. Although

print tokens and print tokens2 merely diﬀer a little bit in

name, they are two completely diﬀerent programs, and

so are schedule and schedule2.

Program Functions Error # Located (top-2)

replace 20 30 16

print tokens 18 7 3

print tokens2 19 10 7

schedule 18 5 4

schedule2 16 10 2

tot info 7 23 16

Table 3: Isolation Quality across Programs

Table 3 lists the isolation quality of CombineRank

across the six programs. The second column gives

the total number of functions in each program. With

the third column telling how many errors are under

examination, the last column tabulates how many of

them are actually located within the top-2’s. Overall,

except on schedule2, CombineRank achieves similar

or better quality than it does on replace. Especially,

the isolation quality on print tokens2 and schedule is

amazingly good. However, echoing the saying that “no

silver bullets exist”, CombineRank also fail to work

well on certain programs. For instance, CombineRank

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited 115

only locates 2 out of the 10 errors in schedule2 within

the top-1, and fail to work on the other 8 errors.

0

2

4

6

8

10

0 2 4 6 8 10 12 14 16 18

N

u

m

.

o

f

L

o

c

a

t

e

d

E

r

r

o

r

s

Top-k Functions

Located Errors within Top-k Functions

CombineRank

UpperRank

Figure 5: Isolation Quality: print tokens2

We also compared the isolation quality between

CombineRank and UpperRank on these ﬁve pro-

grams. Results suggest that they are generally com-

parable, and CombineRank tends to outperform

UpperRank at the top rank range. Figure 5 plots their

comparison on print tokens2, in which we can see a pat-

tern similar to Figure 4.

5 Related Work

As the software becomes increasingly bulky in size, its

complexity has exceeded the capability of human un-

derstandings. This stimulates the wide applications of

data mining techniques in solving software engineering

problems. For example, Livshits et al. apply frequent

itemset mining algorithms to software revision histories,

and relate mining results with programming patterns

that programmers should conform to [22]. Going one

step further, Li et al. construct the PR-Miner [19], which

detects programming rules (PR) more general than that

from [22]. PR-Miner ﬁrst tokenizes the program source

code and then applies CloSpan [27] to discover program-

ming rules implicitly embedded in source code. Diﬀer-

ent from the above static analysis approaches, whose

analysis is based on the static entities, like the source

codes and revision history, our method in this paper at-

tempts to isolate logic errors from the program dynamic

behaviors in executions.

For tracking down program logic errors, there are

two conventional methods: software testing [4, 13] and

model checking [26]. The aim of testings is to ﬁnd test

cases that induce unexpected behaviors. Once a fail-

ing case is found, it is the developer’s responsibility to

trace and eradicate the hidden error. By prioritizing the

potential buggy functions, our method actually comple-

ments the necessary human eﬀorts in debugging. On

the other hand, model checking can detect logic errors

based on a well-speciﬁed program model. Theoretically,

when the model is accurate and complete, model check-

ing can detect all hidden errors, regardless of the time

it may take. However, quality models are usually too

expensive to construct in practice [26]. In comparison,

our method does not rely on any semantic speciﬁcation

from human beings, and automatically infers about the

error location from program runtime statistics.

There exist some automated debugging tools that

do not rely on speciﬁcations either. AskIgor [28], which

is available online, tries to outline the cause-eﬀect chains

of program failures by narrowing down the diﬀerence

between an incorrect and a correct execution. However,

since its analysis relies on memory abnormality, AskIgor

does not work with logic errors, as is conﬁrmed by our

practice.

Finally, our method also relates to outlier detection

[24, 1, 14] or deviation detection [3] in general. Diﬀerent

from previous methods that directly measure the devi-

ation within a certain metric spaces, like the Euclidean

space, our method quantiﬁes the abnormality through

an indirect approach that is supported by a similar ra-

tionale to hypothesis testing. Since this method makes

no assumption of the metric space structure, nor re-

quires users to set any parameters, it is more principled

and alleviates users from the common dilemma of pa-

rameter settings. To the best of our knowledge, this is

the ﬁrst piece of work that calibrates deviations in a hy-

pothesis testing alike approach. In addition, this study

also demonstrates the power and usefulness of this ap-

proach at detecting subtle deviations in complex and

intertwined environments, like the softwares.

6 Conclusions

In this paper, we investigate the capability of data

mining methods in isolating suspicious buggy regions

through abnormality analysis of program control ﬂows.

For the ﬁrst time, through mining software trace data,

we are able to, if not fully impossible, locate logic errors

without knowing the program semantics. It is observed

from our study that the abnormality of program traces

from incorrect executions can actually provide inside

information that could reveal the error location. The

statistical approach, together with two ranking algo-

rithms we developed, can successfully detect and score

suspicious buggy regions. A thorough examination of

our approach on a set of standard test programs clearly

demonstrates the eﬀectiveness of this approach.

There are still many unsolved issues related to soft-

ware error isolation, such as what if multiple errors exist

in a program and/or only a few test cases are available.

Moreover, it could be more systematic and enlightening

if the problem of error localization is viewed from fea-

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited 116

ture selection point of view. These are interesting issues

in our future research.

Acknowledgements

We would like to thank the anonymous reviewers for

their insightful and constructive suggestions. We also

appreciate Zheng Shao, a ﬁnalist of TopCoder Contest,

for his participation in our user study.

References

[1] C. C. Aggarwal and P. S. Yu. Outlier detection for high

dimensional data. In Proc. of the 2001 ACM SIGMOD

Int. Conf. on Management of Data, pages 37–46, 2001.

[2] G. Ammons, R. Bodik, and J. Larus. Mining speciﬁca-

tions. In Proc. of the 29th ACM SIGPLAN-SIGACT

Symp. on Principles of Programming Languages, pages

4–16, 2002.

[3] A. Arning, R. Agrawal, and P. Raghavan. A linear

method for deviation detection in large databases.

In Proc. of the 10th ACM SIGKDD Int. Conf. on

Knowledge Discovery and Data Mining, 1995.

[4] B. Beizer. Software Testing Techniques. International

Thomson Computer Press, second edition, 1990.

[5] J. F. Bowring, J. M. Rehg, and M. J. Harrold. Active

learning for automatic classiﬁcation of software behav-

ior. In Proc. of the 2004 Int. Symp. on Software testing

and analysis, pages 195–205, 2004.

[6] Y. Brun and M. Ernst. Finding latent code errors via

machine learning over program executions. In Proc.

of the 26th Int. Conf. on Software Engineering, pages

480–490, 2004.

[7] W. Dickinson, D. Leon, and A. Podgurski. Finding

failures by cluster analysis of execution proﬁles. In

Proc. of the 23rd Int. Conf. on Software Engineering,

pages 339–348, 2001.

[8] European Space Agency. Arianne-5 ﬂight 501 inquiry

board report. In http://ravel.esrin.esa.it/docs/esa-x-

1819 eng.pdf.

[9] P. Frankl and O. Iakounenko. Further empirical studies

of test eﬀectiveness. In Proc. of the 6th ACM Int.

Symp. on Foundations of Software Engineering, pages

153–162, 1998.

[10] M. Harrold, G. Rothermel, R. Wu, and L. Yi. An

empirical investigation of program spectra. In Proc. of

the ACM SIGPLAN-SIGSOFT workshop on Program

Analysis for Software Tools and Engineering, pages 83–

90, 1998.

[11] G. J. Holzmann. Economics of software veriﬁcation. In

Proc. of the 2001 ACM SIGPLAN-SIGSOFT workshop

on Program analysis for software tools and engineering,

pages 80–89, 2001.

[12] M. Hutchins, H. Foster, T. Goradia, and T. Os-

trand. Experiments of the eﬀectiveness of dataﬂow-

and controlﬂow-based test adequacy criteria. In Proc.

of the 16th Int. Conf. on Software Engineering, pages

191–200, 1994.

[13] C. Kaner, J. Falk, and H. Q. Nguyen. Testing Com-

puter Software. Wiley, second edition, 1999.

[14] E. M. Knorr, R. T. Ng, and V. Tucakov. Distance-

based outliers: algorithms and applications. The

VLDB Journal, 8(3-4):237–253, 2000.

[15] E. Lehmann. Testing Statistical Hypotheses. Springer,

second edition, 1997.

[16] N. Leveson and C. Turner. An investigation of the

Therac-25 accidents. Computer, 26(7):18–41, 1993.

[17] Z. Li, Z. Chen, and Y. Zhou. Mining block correla-

tions to improve storage performance. Trans. Storage,

1(2):213–245, 2005.

[18] Z. Li, S. Lu, S. Myagmar, and Y. Zhou. CP-Miner: A

tool for ﬁnding copy-paste and related bugs in oper-

ating system code. In Proceedings of the Sixth Sympo-

sium on Operating System Design and Implementation,

pages 289–302, 2004.

[19] Z. Li and Y. Zhou. PR-Miner: Automatically extract-

ing implicit programming rules and detecting viola-

tions in large software code. In 13th ACM SIGSOFT

Symposium on the Foundations of Software Engineer-

ing, Sept 2005.

[20] C. Liu, X. Yan, L. Fei, J. Han, and S. P. Midkiﬀ. Sober:

statistical model-based bug localization. In Proc. of

the 10th European Software Engineering Conference

held jointly with 13th ACM SIGSOFT Int. Symp. on

Foundations of Software Engineering (ESEC/FSE’05),

pages 286–295, 2005.

[21] C. Liu, X. Yan, H. Yu, J. Han, and P. S. Yu. Mining

behavior graphs for ”backtrace” of noncrashing bugs.

In In Proc. 2005 SIAM Int. Conf. on Data Mining

(SDM’05), 2005.

[22] B. Livshits and T. Zimmermann. DynaMine: Finding

common error patterns by mining software revision

histories. In 13th ACM SIGSOFT Symp. on the

Foundations of Software Engineering, Sept 2005.

[23] A. Podgurski, D. Leon, P. Francis, W. Masri,

M. Minch, J. Sun, and B. Wang. Automated support

for classifying software failure reports. In Proc. of the

25th Int. Conf. on Software Engineering, pages 465–

475, 2003.

[24] S. Ramaswamy, R. Rastogi, and K. Shim. Eﬃcient

algorithms for mining outliers from large data sets.

In Proc. of the 2000 ACM SIGMOD Int. Conf. on

Management of Data, pages 427–438, 2000.

[25] G. Rothermel and M. J. Harrold. Empirical studies of

a safe regression test selection technique. IEEE Trans.

on Software Engineering, 24(6):401–419, 1998.

[26] W. Visser, K. Havelund, G. Brat, and S. Park. Model

checking programs. In Proc. of the 15th IEEE Int.

Conf. on Automated Software Engineering, 2000.

[27] X. Yan, J. Han, and R. Afshar. CloSpan: Mining

closed sequential patterns in large databases. In Proc.

of the Third SIAM Int. Conf. on Data Mining, 2003.

[28] A. Zeller. Isolating cause-eﬀect chains from computer

programs. In Proc. of ACM 10th Int. Symp. on the

Foundations of Software Engineering, 2002.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited 117

15 } 16 } Example 1.. let us ﬁrst examine an example. int *i. we record how many times it is evaluated as true and false respectively in each execution. For instance. we regard that the more divergent the evaluation of one conditional in incorrect runs is from that of correct ones. Although this coding scheme loses much information about the exact execution. Logic errors. Program 1 Buggy Code .. k++) 10 junk = addstr(k. we treat each conditional in the buggy program as one distinct feature and try to isolate the logic error through contrasting the behaviors of correct and incorrect executions in terms of these conditionals. like Purify.. most importantly. In order to see why this choice can be eﬀective. However. like Purify. The replace program has 512 lines of non-blank C code. Although it is unlikely to totally unload the debugging burden from human beings. char *src. k<=src[*i+1].Version 9 of replace 01 void dodash (char delim. dest. programs with logic errors usually exit silently without segmentation faults. Speciﬁcally. like segmentation faults and/or memory leaks. we plan to focus on isolating logic errors in this paper. Because the faulty symptom of logic error is the unexpected execution path. int maxset) 03 { 04 while (. we need to ﬁrst encode each execution in such a way that discriminates incorrect from correct executions. as well as expensive to register. j. refer to the program logic incorrectness. e. Instead of causing programs to abort due to memory violations. In the dodash function. he will resort to conventional debuggers. Because logic errors do not behave abnormally at the memory access level. However.. a program that performs regular expression matching and substitutions. 14 (*i) = (*i) + 1. let us go back to Example 1. To make things worse.. personalized coding styles and. we can roughly classify program errors into two categories: memory errors and logic errors. 12 } 13 . B = src[*i-1]<= src[*i+1]. This is equivalent to a proper extraction of error-sensitive features from program executions. Moreover. no segmentation faults happened during executions. reduce the debugging cost. In order to facilitate the analysis of program executions. such as maintaining legacy codes. this scheme turns out to be hard to manipulate. for). for a step-by-step tracing and verify observed values against his expectations in mind. have little chance to isolate them. Probably. A = isalnum(src[*i-1]) && isalnum(src[*i+1]). by correctly suggesting the possible buggy regions. Copyright © by SIAM. For convenience. one can narrow down the search. and speed up the development process. it is tempting to register the exact path for each execution. like generating none or incorrect outputs. to guide programmers to examine the buggy function ﬁrst. Considering the plethora of tools for memory errors.) { 05 . it turns out to be easy to manipulate and eﬀective. Unauthorized reproduction of this article is prohibited 107 . if. To illustrate the challenges. we are interested in ﬁnding an automated method that can prioritize the code examination for debugging. the more likely the conditional is bug-relevant. we ﬁnd it possible to minimize programmers’ workloads through machine intelligence. the complex control and data dependencies formulated by the original author. without any memory violations. Some tools.on their faulty symptoms. Program 1 shows a code segment of replace. Based on the above analysis. understanding other programmers’ code is notoriously troublesome because of un-intuitive identiﬁers. 11 *i = *i + 1. like GDB. int *j. 02 char *dest. the developer generally has no clues for debugging. maxset). and consists of 20 functions. for each condition statement.. Knowing the diﬃculty of debugging logic errors. on the other hand. Noticing that executions are mainly directed by condition statements (e.g. For logic errors as the above. We follow this moderate thought in this study. he may not even be able to trace the execution until he at least roughly understands the program. oﬀ-the-shelf memory monitoring tools. one subclause (commented by /∗ and ∗/) that should have been included in the if condition is missed by the developer.. The only faulty symptom is its malfunction.. we ﬁnally decide to summarize the execution by the evaluation frequencies of conditionals. In other words. This is a typical “incomplete logic” error. which is very likely in reality. are designed for tracking down memory errors. We tested this buggy program through 5542 test cases and found that 130 out of them failed to give the correct outputs. Valgrind and CCured. if the developer is not the original author of the program. we declare two boolean variables A and B as follows.g. Memory errors usually manifest themselves through memory abnormalities. while. 06 if ( isalnum(src[*i-1]) && isalnum(src[*i+1]) 07 /* && (src[*i-1] <= src[*i+1]) */) 08 { 09 for (k = src[*i-1]+1.

If the program were written correctly. in the buggy program where B is missing. Because P is certainly not available when debugging P. we re-visit Example 1 and explain in detail the implication of control ﬂow abnormality to logic errors.exactly characterizes the scenario under which incorrect Copyright © by SIAM. While this heuristic can be eﬀective. respectively. Because the true evaluation of (A∧¬B) only happens in incorrect runs. The remaining of the paper is organized as follows. ment and further derive two algorithms to locate the possible buggy functions. our method collects statistics only from the buggy program P and performs all the analysis. unexpected actions take place tackle this tough problem and contribute to software if and only if A ∧ ¬B evaluates to true. A enter enter ¬A skip skip A enter skip ¬A skip skip B ¬B B ¬B 1. Section 6 concludes this study. the evaluation distribution of the if statement should be diﬀerent between correct and incorrect executions. the control ﬂow actually enters the block. The above example motivates us to infer about the potential buggy region from the divergent branching probability. any control ﬂow that satisﬁes A will fall into the block. After all. which features a similar rationale to hypothesis testing [15]. We propose an error isolation technique that is control ﬂow will eventually lead to incorrect outputs. Based on this quantiﬁcation. like the A here. let us consider their evaluation combinations and corresponding branching actions (either enter or skip the block) in both P and P. 2 Motivating Examples In this section. We introduce the problem of isolating logic errors. Explicitly. We believe Diﬀerences between the above two tables reveal that that recent advances in data mining would help in the buggy program P. both of them achieve an encouraging success. for the buggy program P. we choose to if and only if there exist true evaluations of A ∧ ¬B at monitor only conditionals to bear low overhead while Line 6. program contains a bug. i. while it is expected to skip. Especially. In summary. and the right one lists the expected actions if P had no errors. Unauthorized reproduction of this article is prohibited 108 . their evaluations will shed light on the hidden error and can be exploited for error isolation. and treat the numeric diﬀerence as the measure of divergence. we develop a novel and more sophisticated approach to quantifying the divergence of branching probabilities. we further derive two algorithms that eﬀectively rank functions according to their likelihood of containing the error.. Figure 1 explicitly lists the actions in P (left) and P (right). one should not expect it to work with any logic errors. The statistical model and the two derived ranking algorithms are developed in Section 3. into data mining research. the simplest approach is to ﬁrst calculate the average true evaluation probabilities for both correct and incorrect runs. This incorrect 2. In this paper. We present the analysis of experiment results in Section 4. we denote the program with the subclause (src[*i-1]<= src[*i+1]) commented out as the incorrect (or buggy) program P. even experienced programmers may feel incapable to debug certain logic errors. denoted as P. While the boolean expression B: (A ∧ ¬B) = true 3. Given the two boolean variables A and B as presented in Section 1. This suggests that if we monitor the program conditionals. when A ∧ ¬B is true. For example. both algorithms recognize the dodash function as the most suspicious in Example 1. we believe this heuristic worths a try. A ∧ B should have been the guarding condition for the if block. As one will see in Section 3. a Figure 1: Branching Actions in P and P problem not yet well solved in software engineering community. After the discussion about related work in Section 5. Clearly. P is used here only for illustration purposes: It helps illustrate how our method is motivated. For clarity in what follows. engineering in general. and it contributes to the true evaluation of A.e. However. otherwise. As one may notice and we will explain in Section 2. the execution is correct although the maintaining the isolation quality. we make the following contributions. Section 2 ﬁrst provides several examples that illustrate how our method is motivated. the left panel shows the actual actions taken in the buggy program P. the control ﬂow goes into the if block if and only if A ∧ B is true. and the one without comments is the correct one. However. one run is incorrect runs against correct ones. simple heuristic methods like the above generally suffer from weak performance across various buggy programs. Given that few eﬀective ways exist for isolating logic errors. As evaluated on a set of standard test programs. Heuristically. We develop a principled statistical approach to quantify the bug relevance of each condition state. based on mining control ﬂow abnormity in incorrect Therefore. an execution is logically correct until (A∧¬B) is evaluated as true when the control ﬂow reaches Line 6.

542 test cases. If the evaluation of A in incorrect executions signiﬁcantly diverge from that in correct ones. 11 } B ¬B B ¬B Figure 2: A Correct and Incorrect Run in P We therefore contrast how A is evaluated diﬀerently between correct and incorrect executions of P. 02 char *pat) 03 { 04 . · · · . we recognize that given that the dependency structure of a program is hard to untangle.Version 15 of replace 01 void makepat (char *arg. Our method. 07 else 08 result = i + 1.727 in a correct and 0. The error in Program 2 literally has no relation with any condition statements. We say P passes the test case ti (i. For example. Each test case ti = (di . A ¬A nAB nAB ¯ nAB = 0 nAB ¯ ¯¯ A ¬A nAB nAB ¯ nAB ≥ 1 nAB ¯ ¯¯ Program 2 Buggy Code . control ﬂows may still disclose the error trace. we use “correct” and “passing”.executions take place. Discovering these abnormalities by mining the program executions can provide useful information for error isolation. although the branching statement is not directly involved in this oﬀ-by-one error. otherwise. However. Speciﬁcally. In the next section. Line 6) does exhibit detectable abnormal behaviors in incorrect executions. Tf = {ti |oi = P(di ) does not match oi }. 3. but motivating example where we can use the branching statistics to capture the control ﬂow abnormality. 05 if(!junk) 06 result = 0. there do exist simple statistics that capture the abnormality caused by hidden errors. Since program P passes test case ti if and only if P executes correctly. we decide to instrument each condition statement in program P to collect the evaluation frequencies at runtime. char delim. as developed in Section 3. we are therefore interested in whether the evaluation of A will give away the error. corresponding to the passing and failing cases respectively. Figure 2 shows the number of true evaluations for the four combinations of A and B in one correct (left) and incorrect (right) run. “a passing case”) if and only if oi is identical to oi .896 in an incorrect execution on average. We thus partition the test suite T into two disjoint subsets Tp and Tf . we take the entire boolean expression in each condition Copyright © by SIAM. the programmer mistakenly assigns the variable result with i+1 instead of i. Tp = {ti |oi = P(di ) matches oi }. The obvious reason is that while we are debugging the program P. which points to the exact error location. t2 . /* off-by-one error */ 09 /* should be result = i */ 10 return result. tn } be a test suite for a program P. The major diﬀerence is that in the correct run. which is a typical “oﬀby-one” logic error. our task is to isolate the suspicious error region by contrasting P’s runtime behaviors on Tp and Tf . As we tested through 5. which is eventually associated with the if statement at Line 5. int start. isolating this logic error.. A∧¬B never evaluates true (nAB = 0) ¯ while nAB must be nonzero for one execution to be ¯ incorrect. there is little chance for an automated tool to spot B as bug relevant. In Program 2. nicely captures this divergence and ranks the dodash function as the most suspicious function. 3 Ranking Model Let T = {t1 . Unauthorized reproduction of this article is prohibited 109 . the if statement at line 6 may be regarded as bug-relevant. The above discussion illustrates a simple.e. In this case. “incorrect” and “failing” interchangeably in the following discussion. the abnormal branching probability in incorrect runs can still reveal the error trace. we therefore expect that the probability for A to be true is diﬀerent between correct and incorrect executions.. This divergence suggests that the error location (i. Given a buggy program P together with a test suite T = Tp ∪ Tf . we have no idea of what B is. “a failing case”). the makepat function is identiﬁed as the most suspicious function by our algorithms. Because A is observable in P. P fails on ti (i. Sometimes when the error does not occur in a condition statement. we elaborate on the statistical ranking model that identiﬁes potential buggy regions. let alone its combination with A. oi ) (1 ≤ i ≤ n) has an input di and the desired output oi . the true evaluation probability is 0. and obtain the output oi = P(di ). Inspired by the above two examples..e. the error is triggered only when the variable junk is nonzero.e.1 Feature Preparation Motivated by previous examples... Since the true evaluation of A ∧ ¬B implies A = true. We execute P on each ti .

we ﬁrst propose the null hypothesis (3.. π(B) is deﬁned to be 0 because of no evidence favoring either true or false. Because no prior knowledge of the closed form of f (X|θp ) is known. which conceptually contains all possible input and output pairs. as one will see. T . For example. parameters that work signiﬁcantly f (X|θf ) diverges from f (X|θp ).d. Instead.1) L(B) = Sim(f (X|θp ). For example. X2 . we do not regard their estimates as suﬃcient statistics. 3. 3. σ ). µp = E(X|θp ) and σp = V ar(X|θp ).3 Feature Ranking Before detailed discussions about our method. Following the convention of statistics. the bug relevance score of B can be deﬁned is supported by a similar rationale to hypothesis testing as [15]. (Bug Relevance) A boolean feature B is relevant to the program error if its underlying However. like the α and β f (X|θ) with the hidden program error. Let L(B) be a other errors or programs.e. s(B) can be deﬁned as 2 2 s(B) = α ∗ |µp − µf | + β ∗ |σp − σf | (α. T can be partitioned into two disjoint sets Tp and Tf for passing and failing cases. Definition 2. Therefore. Our method takes an indirect approach to quantifying the divergence between f (X|θp ) and f (X|θf ). Tp . The ranking problem boils down to ﬁnding a proper way to quantify the similarity function. f (X|θf )). the more perfectly with one program error may not generalize to relevant B is to the hidden error. develop a principled statistical method to quantify the bug relevance. and in between for all other mixtures. we 2 take θ = (µ. · · · . where Xi represents the boolean bias of B from the ith 2 passing case. t π(B) varies in the range of [−1. the diﬀerence exhibited through µ and σ 2 is treated as a measure of the model diﬀerence. µp = X = and 2 σ p = Sn = n i=1 Xi n n 1 n−1 i=1 (Xi − X)2 . µp and σp can be estimated by the sample mean X and sample variance Sn . Let X be the random variable for the boolean bias of boolean feature B. Definition 1. which. we ﬁrst lay out the main idea in this subsection. (3. It encodes the distribution of B’s value: π(B) is equal to 1 if B always assumes true and −1 as it sticks to false.i.2) s(B) = −log(L(B)). Tp . we can rank features according to their relevance to the error.statement as one distinct boolean feature. β ≥ 0). Unauthorized reproduction of this article is prohibited 110 . Xn ) from 2 f (X|θp ). Let the entire test case space be T . i. In addition. According to the correctness of P on cases from T . n −nf π(B) = nt +nf is the boolean bias of B in this execution. is parameter-free.2 Methodology Overview Using the above measure. Instead of writing a formula that explicitly speciﬁes the diﬀerence. isalnum(src[*i-1]) && isalnum(src[*i+1]) in Program 1 is treated as one feature. Therefore. Given an independent and identically distributed (i. Given the estimations. This includes two problems: (1) what is a suitable similarity function? and (2) how to compute it while we do not known the closed form of f (X|θp ) and f (X|θf )? In the following subsections. in the following. If a conditional is never touched during an execution. from the dilemma of parameter settings: no guidance is The above deﬁnition relates the probability model available to properly set the parameters. Copyright © by SIAM. The population mean µf and variance σf of f (X|θf ) can be estimated in a similar way. 1]. such as the population mean and 2 variance. we similarity function. heuristic methods like the above usually suﬀer probability model f (X|θf ) diverges from f (X|θp ). (Boolean Bias) Let nt be the number of times that a boolean feature B evaluates true. which and s(B). and Tf respectively. we can only characterize it through general parameters. The more in the above formula. we use f (X|θp ) and f (X|θf ) to denote underlying probability model that generates the boolean bias of B for cases from Tp and Tf respectively. and Tf can be thought as random samples from T .) random sample X = (X1 . Since a feature can be evaluated zero to multiple times as either true or false in each execution. it may be appealing to take the diﬀerences of both mean and variance as the bug relevance score. we use uppercase letters for random variables and lowercases for their realizations. f (X|θ) is the probability density function (pdf) of a distribution (or population) family indexed by the parameter θ. we will examine these two problems in detail. we deﬁne boolean bias to summarize its exact evaluations. and nf the number of times it evaluates false in one execution. and hence we do not take normality assumptions on f (X|θ). While µ and σ 2 are taken to characterize f (X|θ). Moreover.

B1 B2 . we can reasonably set the similarity function in Eq. In the following section. A smaller likelihood implies that H0 is less likely to hold and this.e. ..i. 1). θp . σp as →→ − − where ϕ(Z) is the pdf of N (0. Xm ) be an i. . we regard that a ranking of suspicious functions is preferable to that of individual statements because a highly relevant condition statement is not necessarily the error root. Under the null hypothesis. Yk Z1 Z2 . · · · . we discuss how to combine individual s(B) to a global score s(F) for each function F.7).4) Y = m i=1 According to Eq. E(Xi ) and V ar(Xi ) are both ﬁnite. we have 2 2 E(Xi ) = µf = µp and V ar(Xi ) = σf = σp . However.. (3.10) does B as →→ − − not immediately apply because f ( Y | θp ) is a multivariσp ). For example. 1). Bk t1 X11 X21 . the null hypothesis H0 is (3. Unauthorized reproduction of this article is prohibited 111 . 2. Because neither the closed forms mϕ(Z) Copyright © by SIAM. · · · . if Y (x) corresponds to a small probability event. the error in Program 2 does not take place in the “if(!junk)” statement. Y2 .. .9) σp Given that the function F has k features. Let X = (X1 . .H0 that θp = θf (i. we generally have higher conﬁdence in the quality of function ranking than that of individual statements in that function abnormality is assessed by considering all of its component features. In this table. we tabulate the statistics in the following table. H0 . (3.. However. − → conforms to the standard normal distribution N (0. Xkm X1 X2 . Xi2 . Therefore.d. .6) L(B) = L(θp |Y ).5). .. parameters in a vector form: the normalized statistic − → 1 2 k H0 = H0 . m ) converges to the normal distribution as m → +∞. Combining Eq. . m ). X1m .8). Because Xi ∈ [−1.6).. Xi = (Xi1 .10) s(F) = −log(f ( Y | θp )). and (3. According to the Central Limit Theorem.. · · · . Zk Xi m . Xk1 t2 X12 X22 . (3. we derive a statistic Y (X) that conforms to a speciﬁc known distribution. and − → Under the null hypothesis H0 ..3) µp = µf and σp = σf . random sample from f (X|θf ). Speciﬁcally.. the null hypothesis H0 is invalidated.. (3. For each feature Bi . indicates that a larger divergence exists between f (X|θp ) and f (X|θf ). Xk Y1 Y2 . Let Xij denote the boolean bias of the ith boolean feature in the j th test case. Y − µp √ Z= → − 1 2 k σp / m θp = θp . (3. the divergence is proportional to the extent to which the null hypothesis is invalidated..2). the following statistic (3. . . 1]. tm . and then under the null hypothesis. θp . we denote the According to the property of normal distribution.. we propose the null hypothesis i i i H0 : θf = θp . which immediately suggests that f (X|θp ) and f (X|θf ) are divergent. Yk . Given the realized random sample X = x. X2m . and obtain √ m i f (Yi |θp ) = i ϕ(Zi ).3. Xim ) represents the observed boolean bias from the m failing runs and Yi and Zi are similarly derived from Xi as in Section 3. Xk2 . 2 σp Let f (Y |θp ) be the pdf of N (µp ... (3. Moreover. . . k).4 Function Ranking Suppose a function F encompasses k boolean features B1 . Bk . . we →→ − − ﬁnally have the bug relevance score for boolean feature where f ( Y | θp ) is the joint pdf of Yi ’s (i = 1. 3.1) as the likelihood function. (3. .5) L(θp |Y ) = f (Y |θp ). (3. · · · . . Moreover.... similar arguments sug√ m gest that the bug relevance score s(F) can be chosen (3. X2 . the above scoring function Eq. in consequence.8) s(B) = −log(L(B)) = log( √ ate density function. (3. and there are m failing test cases. H0 . · · · . then the likelihood of θp given the observation of Y is (3.7) f (Y |θp ) = ϕ(Z). Y = Y1 . (3. we can rank all condition statements of the buggy function [20]. · · · . 2 σp N (µp . no divergence exists).

We implemented these two algorithms using C++ and Matlab and conducted the experiments on a 3.12) just captures this systematical abnormality and encode it into the ﬁnal bug relevance score for the function F. We note that the independence assumption is practically unrealistic because condition statements are usually intertwined. If we assume that boolean features Bi ’s are mutually indepen. its ranking quality is infe→− − → m i (3. From the other point of view. UpperRank will rank omatch over dodash due to the maximal peak abnormality. we generally prefer CombineRank to UpperRank. CombineRank works really well in locating buggy functions. (3. in the following discussion.Therefore.i of f (Yi |θp ) nor the dependencies among Yi ’s are avail. calculated on the faulty Version 5 of the replace program. we have ever. (3. The main idea is based on the following apparent inequality √ →→ − − m i ϕ(Zi ). the ﬁrst six stubs represent scores for the six boolean features in dodash and the following nine are for features in omatch. Therefore. we propose another approach to →→ − − approximating f ( Y | θp ). However. 3. if we adopt the upper bound as an alternative →→ − − able.12)) CombineRank as it sums over the bug relevance score of each individual condition statement with the function. for f ( Y | θp ). we propose two √ simple ways to approximate it. s(F) can be derived as Therefore. it is impossible to calculate the exact value of s(F).2GHz Intel Pentium 4 PC with 1GB physical memory that runs Fedora Core 2.4. one should not expect for a devices more magic than the independence assumption in decomposing the joint distribution. as implemented by CombineRank. given that no assumptions are made about the probability densities.1 CombineRank This scoring schema. (3. As we will see in the experiments. (3. 4 Experiment Results In this section. As one can see. Figure 3 visualizes the bug relevance score of boolean features in both function dodash and omatch.11) f (Yi |θp ) = f ( Y | θp ) = ϕ(Zi ) rior to that of CombineRank when multiple peak abi σp i=1 i=1 normalities exist. CombineRank correctly ranks dodash as the most bug relevant. However.sentative” of the function. This is meaningful because if dent. Therefore.12) s(F) log( √ ) Bug Relevance Score 400 dodash omatch →→ − − = −log(f ( Y | θp )) = k k i σp 350 300 250 200 150 100 50 0 i=1 mϕ(Zi ) = i=1 s(Bi ) We name the above scoring schema (Eq. CombineRank does make good sense in that the abnormal branchings at one condition statement may likely trigger the abnormal executions of other branches (like an avalanche) and Eq. the population of Yi is also mutually independent one boolean expression is extremely abnormal. UpperRank Example 2. For this reason. In this example. In this subsection. since UpperRank uses the upper bound as the →→ − − k k √ approximation to f ( Y | θp ). it might be better to credit the abnormality of each feature for function ranking. we evaluate the eﬀectiveness of both CombineRank and UpperRank in isolating logic errors. essenOne conventional approach to untangling joint distritially picks the most suspicious feature as the “reprebution is through the independence assumption.2 UpperRank Figure 3: CombineRank vs.13) f ( Y | θp ) ≤ min f (Yi |θp ) = min i 1≤i≤k 1≤i≤k σp Copyright © by SIAM. named UpperRank. the function containing it is very likely to be abnormal. such as nested loops or the if statements inside while loops. Howwith each other. (3. Following Eq. the logic error is inside dodash. as is also supported by experiments.14) s(F) = −log( min ϕ(Zi )) = max s(Bi ). as illustrated in the following example.10). Unauthorized reproduction of this article is prohibited 112 . and end up with another score schema UpperRank. From left to right.4. m (3. i 1≤i≤k σp 1≤i≤k 3.

obtaining multiple faulty versions. we choose them as the benchmark to evaluate our algorithms. Table 1: Summary of Buggy Versions and Ranking Results 4. replace. and (2) it covers the most varieties of logic errors. Except tcas (141 lines). Research in study of test adequacy criteria [12]. After the examination of replace. We experimented on a package of standard test programs. Unauthorized reproduction of this article is prohibited 113 . Because debugging tcas is pretty straightforward due to its small size.1 Subject Programs research [12. with each version containing one and only one error.edu/aristotle/Tools/subjects.3. 9. 10].2. we discuss about the experiments on the other ﬁve programs in Section 4. tcas. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 28 29 30 31 dodash dodash subline subline dodash locate in set 2 in set 2 dodash dodash dodash Macro subline omatch makepat in set 2 esc omatch change esc getline getccl esc omatch omatch omatch in set 2 in set 2 in set 2 omatch : 68 37 130 143 271 96 83 54 30 23 30 309 175 137 60 83 24 210 3 22 3 19 22 170 3 198 142 64 284 210 2 1 3 8 1 3 5 3 1 1 1 5 5 1 1 5 1 2 15 1 12 7 1 2 2 6 4 6 1 2 Error Category Legend Incorrect Branch Expression change *i to *i -1 add one if branch add one && subclause change i to lastm change < to <= change >= to > change c == ANY to c == EOL add one || subclause add one && subclause add one && subclause change > to <= change 50 to 100 in define MAXPAT 50 add one if branch add one && subclause change i+1 to i in result = i+1 remove one || subclause change result = NEWLINE to = ESCAPE add one && subclause rewrite the function change change result = ENDSTR to = ESCAPE rewrite the function getline move one statement into if branch change s[*i] to s[*i + 1] add one if branch change <= to == change j to j + 1 remove one || subclause remove one || subclause remove one || subclause change >= to != : Misuse of Variables or Constants : Oﬀ-By-One : 1 1 3 11 3 3 4 2 1 2 1 5 5 1 1 4 1 3 8 1 5 3 2 4 2 6 3 5 1 3 Misc. For each program. we focus the experiment study on the other six programs. Because these injected errors rightly represent common mistakes made in practice. called Siemens programs 1 .Version Buggy Function Fail Runs combineRank upperRank Error Description: How to Fix Cat. This package was originally prepared by Siemens Corp. the size of these programs ranges from 292 to 512 lines of C code. Copyright © by SIAM.cc. In Section 4. and tot info. schedule2. 25.gatech. The Siemens programs consist of seven programs: print tokens. schedule. excluding blanks. The replace program deserves detailed examination in that (1) it is the largest and most complex one among the six programs. Because these injected errors are mainly logic faults. we ﬁrst analyze the eﬀectiveness of our algorithms on the 30 faulty versions of the replace program. the Siemens programs are widely adopted as a benchmark in software engineering 1 A variant is available at http://www. the Siemens researchers manually injected multiple errors. print tokens2.

4. For the replace program under study. because CombineRank ranks the buggy function at the ﬁrst place for 11 faulty versions.5 errors are expected to be located if top-1 function is examined. both methods rank the buggy function within the top ﬁve for 24 versions except Versions 4. 1. In Table 1. the second column lists the name of the buggy function for each version and the third column shows the number of failing runs out of the 5542 test cases. In this setting. CombineRank ranks the dodash function at the second place and UpperRank identiﬁes it as the most suspicious. Furthermore. the number for UpperRank and CombineRank is leveraged to 13 and 16 respectively whereas it is merely 3 for random rankings. and assumes no more program semantics than the random ranking. We thus break down errors in Table 1 into the following four categories and examine the eﬀectiveness of our method on each of them. 4. After the discussion about the overall eﬀectiveness. only 1.2 Eﬀectiveness for Various Errors The rankings in Table 1 indicate that both CombineRank and UpperRank work very well with the replace program. This implies that the developer can locate 24 out of the 30 errors if he only examines the top ﬁve functions of the ranked list. Furthermore. among which Version 0 is error free. Incorrect Branch Expression (IBE): This category generally refers to errors that directly inﬂu- Copyright © by SIAM. the programmer can locate 16 out of the 30 errors if he takes the top-2 functions seriously. Interested readers are encouraged to download “Siemens Programs” from the public domain for detailed examination. we choose to measure how many errors are located if a programmer examines the suggested ranked list from the top down. the last two columns brieﬂy describe each error and the category the error belongs to (Section 4. Because we cannot list the buggy code for each error due to the space limit. Because we focus on isolating logic errors that do not incur segmentation faults. the buggy function has k a probability of m to be within the top-k of the entire m functions. 26. Taking version 1 for an example.2). one may also ﬁnd it instructive to explore for what kinds of errors our method works well (or not). where m = 20 and n = 30. the programmer will locate 11 out of the 30 errors if he only examines the ﬁrst function of the ranked list for each faulty version. similar results are also observed on other programs.3. m n errors are expected to be located if the programmer only examines the top-k functions.2 Experimental Study on Replace The replace program contains 32 versions in total. 4. we believe that this does shed light on the eﬀectiveness of the isolation algorithms. the buggy function is dodash and the error causes incorrect outputs for 68 out of the 5542 test cases. 22. Moreover. we concisely describe how to ﬁx each error in the “Error Description” column.2. this also suggests that given k n faulty versions.1 Overall Eﬀectiveness because the buggy function of another ﬁve versions is ranked at the second place by CombineRank. Version 27 is excluded from examination. of Located Errors 25 20 15 10 5 0 0 2 4 6 CombineRank UpperRank RandomRank 8 10 12 14 Top-k Functions 16 18 20 Figure 4: Isolation Quality: replace In order to assess the eﬀectiveness of the algorithms. Because a random ranking puts the buggy function at any place with equal probability. 21. As one will see in Section 4. For example. For this case. The ﬁnal ranks of the buggy function provided by CombineRank and UpperRank are presented in the fourth and ﬁfth columns respectively. Among the 30 faulty versions under examination.2. Each of the other 31 faulty versions contains one and only one error in comparison with Version 0. Located Errors within Top-k Functions 30 Num. UpperRank and CombineRank locates 9 and 11 errors respectively. 19. Table 1 lists the error characteristics for each faulty version and the ranking results provided by CombineRank and UpperRank. In order to quantify the isolation quality of the algorithms. In this way. we also plot the curve for random rankings in Figure 4. Version 0 serves as the oracle in labelling whether a particular execution is correct or incorrect. Considering the tricky nature of these logic errors. and 29.2. when top-2 functions are examined. In contrast. Finally. Figure 4 plots the number of located errors with respect to the top-k examination. Unauthorized reproduction of this article is prohibited 114 . Although this proposition is drawn across various errors on a speciﬁc subject program. we regard that the result shown in Figure 4 is excitingly good because our method infers about the buggy function purely from the execution statistics.

CombineRank identiﬁes 11 while UpperRank gets 9. 6 etc.g. result = i is mistakenly written as result = i + 1. Misuse of Variables or Constants (MVC): Since similar data structures can exist in programs. we cannot unpractically expect one method to work well for all kinds of errors. like CPMiner [18]. two priority schedulers schedule and schedule2. it includes omitted (e. CombineRank is apparently better than UpperRank: in terms of top-1. echoing the saying that “no silver bullets exist”. For instance. 4.g. which justiﬁes our choice of branching abnormality for error isolation. Especially. This result reaﬃrms our belief that value related errors may also incur control ﬂow abnormality. in Versions 5. 4. and CombineRank for each category on replace is listed so are schedule and schedule2. in Table 2. These errors usually sneak in when the programmer fails to think fully of the program logic.3 More Subject Programs We also tested our algorithms on the other ﬁve programs in the Siemens program package.2. One typical scenario of introducing this kind of error is through the notorious copy-pastes: when a programmer copies and pastes one segment of code. in Version 15. the top-5 ranking is less appealing than the top-2 or even top-1 ranking. CombineRank achieves similar or better quality than it does on replace. in Version 22). They are two lexical analyzers print tokens and print tokens2.3 value one more or less than it is expected. and misuses of comparators (e. he may forget to change certain variables for the destination contexts. However. (MIS): Other types of errors that include. it is also a non-trivial task for experienced programmers to isolate these errors. 2.. the last column tabulates how many of them are actually located within the top-2’s.) or un-wanted (e.... For example. and tot info that computes the statistics of given input data. 9. Therefore. Overall.ence program control ﬂows. With the third column telling how many errors are under examination. and CombineRank locates in 16 within top-2’s whereas UpperRank hits 13. except on schedule2. UpperRank In terms of the isolation quality. However. we believe that Table 3 lists the isolation quality of CombineRank across the six programs. The isolation quality of name. 10 etc. The second column gives the total number of functions in each program.g. in Versions 19 and 21) and moving one statement around (e. in Version 16) logic subclauses. For also support this conclusion. Some tools. Similar conclusions can be also drawn for UpperRank based on Table 1. CombineRank Copyright © by SIAM. we believe that considering the utility to programmers. Oﬀ-By-One (OBO): A variable is used with the from other subject programs as examined in Section 4..). Although We label the category that each faulty ver. Results 3. print tokens 18 7 3 Error Category Error # Located (top-1) print tokens2 19 10 7 schedule 18 5 4 IBE 18 7 schedule2 16 10 2 MVC 5 2 tot info 7 23 16 OBO 4 2 MIS 3 0 4.g. in Versions 3. can automatically detect such copy-paste errors. we note that there do exist some logic errors that are extremely hard for machines to isolate. the isolation quality on print tokens2 and schedule is amazingly good. CombineRank also fail to work well on certain programs. Notice that the third column only lists the Program Functions Error # Located (top-2) number of faulty versions where the buggy function is replace 20 30 16 ranked at the top by CombineRank. Unauthorized reproduction of this article is prohibited 115 . example. Misc. developers may misuse one variable for another. Finally. an overview of Figure 4 may suggest that UpperRank outperforms CombineRank because the curve of UpperRank is above CombineRank for a large portion of k’s. but not limited to.3 CombineRank vs.g. From this point of view. they are two completely diﬀerent programs. function reformulations (e.print tokens and print tokens2 merely diﬀer a little bit in sion belongs to in Table 1. Table 3: Isolation Quality across Programs Table 2: Isolation Quality by Categories Table 2 also indicates that CombineRank ranks the buggy function at the top not only for errors that literally reside on condition statements but for errors that involve variable values as well. Considering the error description in Table 1.

6 There exist some automated debugging tools that 4 do not rely on speciﬁcations either. For example. it is the developer’s responsibility to ware error isolation. which is available online. ming rules implicitly embedded in source code. Since this method makes no assumption of the metric space structure.from incorrect executions can actually provide inside tempts to isolate logic errors from the program dynamic information that could reveal the error location.ation within a certain metric spaces.from previous methods that directly measure the devigrams. and CombineRank tends to outperform space. nor retern similar to Figure 4. like the Euclidean parable. this is As the software becomes increasingly bulky in size. tries to outline the cause-eﬀect chains 2 CombineRank of program failures by narrowing down the diﬀerence UpperRank 0 between an incorrect and a correct execution. For tracking down program logic errors. our method quantiﬁes the abnormality through UpperRank at the top rank range. it is more principled and alleviates users from the common dilemma of pa5 Related Work rameter settings. and automatically infers about the 8 error location from program runtime statistics. can successfully detect and score two conventional methods: software testing [4. This stimulates the wide applications of also demonstrates the power and usefulness of this apdata mining techniques in solving software engineering proach at detecting subtle deviations in complex and problems. ing can detect all hidden errors. together with two ranking algobehaviors in executions. construct the PR-Miner [19]. and fail to work on the other 8 errors. of Located Errors Copyright © by SIAM. However. 0 2 4 6 8 10 12 14 16 18 since its analysis relies on memory abnormality. There are still many unsolved issues related to softing case is found. locate logic errors ent from the above static analysis approaches. our method actually comple. Diﬀerent We also compared the isolation quality between CombineRank and UpperRank on these ﬁve pro. its the ﬁrst piece of work that calibrates deviations in a hycomplexity has exceeded the capability of human unpothesis testing alike approach. regardless of the time it may take. A thorough examination of model checking [26].we are able to.in a program and/or only a few test cases are available. It is observed analysis is based on the static entities. Once a fail. Figure 5 plots their an indirect approach that is supported by a similar racomparison on print tokens2. model checkthe top-1. there are rithms we developed. The aim of testings is to ﬁnd test our approach on a set of standard test programs clearly cases that induce unexpected behaviors. On Moreover. ments the necessary human eﬀorts in debugging. Unauthorized reproduction of this article is prohibited 116 . Results suggest that they are generally com. To the best of our knowledge. in which we can see a pat. The statistical approach. In comparison. quires users to set any parameters. Theoretically. 14] or deviation detection [3] in general. like the source from our study that the abnormality of program traces codes and revision history. Finally. However. it could be more systematic and enlightening the other hand. 13] and suspicious buggy regions. our method also relates to outlier detection [24. through mining software trace data. apply frequent intertwined environments. such as what if multiple errors exist trace and eradicate the hidden error. which In this paper. model checking can detect logic errors if the problem of error localization is viewed from feaNum.tionale to hypothesis testing. itemset mining algorithms to software revision histories. PR-Miner ﬁrst tokenizes the program source through abnormality analysis of program control ﬂows. as is conﬁrmed by our Figure 5: Isolation Quality: print tokens2 practice. 1. our method in this paper at.For the ﬁrst time. if not fully impossible. we investigate the capability of data detects programming rules (PR) more general than that mining methods in isolating suspicious buggy regions from [22]. whose without knowing the program semantics. By prioritizing the potential buggy functions. when the model is accurate and complete. code and then applies CloSpan [27] to discover program. like the softwares. AskIgor [28]. AskIgor Top-k Functions does not work with logic errors. Going one step further.only locates 2 out of the 10 errors in schedule2 within based on a well-speciﬁed program model. 10 our method does not rely on any semantic speciﬁcation from human beings. Li et al. quality models are usually too Located Errors within Top-k Functions expensive to construct in practice [26].demonstrates the eﬀectiveness of this approach. this study derstandings. and relate mining results with programming patterns 6 Conclusions that programmers should conform to [22]. Diﬀer. Livshits et al. In addition.

Unauthorized reproduction of this article is prohibited 117 . [25] G. Leon. S. Economics of software veriﬁcation. on Foundations of Software Engineering (ESEC/FSE’05). Kaner. and S. Springer. In Proc. [27] X. [12] M.it/docs/esa-x1819 eng. Symp. 1999. of the 2000 ACM SIGMOD Int. Ng. Conf. Yan. and L. G. J. 1998. Trans. In Proceedings of the Sixth Symposium on Operating System Design and Implementation. [11] G. of the 10th European Software Engineering Conference held jointly with 13th ACM SIGSOFT Int. Testing Computer Software. In Proc. Ammons. Experiments of the eﬀectiveness of dataﬂowand controlﬂow-based test adequacy criteria. R. on the Foundations of Software Engineering. J. P. Brat. Falk. of the 2001 ACM SIGMOD Int. Automated support for classifying software failure reports. Mining speciﬁcations. Li. Distancebased outliers: algorithms and applications. 1(2):213–245. of the 23rd Int. Tucakov. Rothermel. T. 2004. 2001. on Foundations of Software Engineering. DynaMine: Finding common error patterns by mining software revision histories. Afshar. on Management of Data. second edition. Leveson and C. and S. and M. 2002. 2002. Sun. Mining behavior graphs for ”backtrace” of noncrashing bugs. 24(6):401–419. Francis. Holzmann. S. IEEE Trans. CP-Miner: A tool for ﬁnding copy-paste and related bugs in operating system code. [17] Z. Hutchins. Masri. pages 286–295. [28] A. Beizer. and P. Liu. [6] Y. 2000. [3] A. [8] European Space Agency. R. S. pages 195–205. Ramaswamy. Park. second edition. In Proc. In Proc. and V. and T. Brun and M. Li. In Proc. In Proc. Podgurski. Agrawal. 26(7):18–41. of the ACM SIGPLAN-SIGSOFT workshop on Program Analysis for Software Tools and Engineering. second edition. on Automated Software Engineering. 2005. 2003. [18] Z. We also appreciate Zheng Shao. [4] B. 2000. In Proc. Eﬃcient algorithms for mining outliers from large data sets. D. Copyright © by SIAM. [20] C. X. on Data Mining. Harrold. and P. R. 2001. International Thomson Computer Press. Zeller. on Knowledge Discovery and Data Mining. Rastogi. Knorr. 2000. [14] E. 1998. of the Third SIAM Int. X. on Software Engineering. of the 2001 ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering. Lu. Minch. D. Zhou. Sept 2005. Ostrand. of the 10th ACM SIGKDD Int. Sept 2005. Visser. Ernst. on Management of Data. and A. Conf. and H. [5] J. J. Wu. [19] Z. 2005. Foster. Chen. Aggarwal and P. [10] M. P. Acknowledgements We would like to thank the anonymous reviewers for their insightful and constructive suggestions. Dickinson. Frankl and O. Symp. [15] E. Rehg. Conf. on Software testing and analysis. Z. and R. In Proc. J. 2005. In http://ravel. and Y. Arning. Conf. [23] A. 8(3-4):237–253. Podgurski. a ﬁnalist of TopCoder Contest. Symp. [9] P.ture selection point of view. [26] W. J. L. Bodik. of the 16th Int. pages 339–348. Goradia. Raghavan. In Proc. Rothermel and M. Arianne-5 ﬂight 501 inquiry board report. Havelund. Yu. on Data Mining (SDM’05). Symp. [13] C. Yi. Myagmar. Yu. 2004. R. pages 153–162. Zhou. F. In Proc. W. 1990. Zhou. and K. of the 25th Int. 1997. In Proc. Conf. An empirical investigation of program spectra. Wiley. R. Yu. In 13th ACM SIGSOFT Symposium on the Foundations of Software Engineering. on Software Engineering. In Proc. Sober: statistical model-based bug localization. Livshits and T. on the Foundations of Software Engineering. of ACM 10th Int. pages 83– 90. 2003. of the 15th IEEE Int. Wang. Lehmann. Harrold. J. Q. on Principles of Programming Languages. on Software Engineering. C. In Proc. Nguyen. Midkiﬀ. and Y. and J. Zimmermann. Han. K. pages 37–46. M. PR-Miner: Automatically extracting implicit programming rules and detecting violations in large software code. of the 26th Int. Liu. Empirical studies of a safe regression test selection technique. for his participation in our user study. [24] S. Larus. pages 80–89. Yan. In Proc. Computer. Shim. In In Proc. Fei. J. Li and Y. An investigation of the Therac-25 accidents. Conf. M. Mining block correlations to improve storage performance. on Software Engineering. M. [22] B. Bowring. 2004. and B. Turner. In Proc. pages 465– 475. on Software Engineering. References [1] C. CloSpan: Mining closed sequential patterns in large databases. of the 29th ACM SIGPLAN-SIGACT Symp. 1994. Outlier detection for high dimensional data.esa. In 13th ACM SIGSOFT Symp. 2005 SIAM Int. Han. 1993. pages 191–200. Conf. J. Yan. Han. Finding latent code errors via machine learning over program executions. These are interesting issues in our future research. H. Iakounenko.pdf. The VLDB Journal. In Proc. [21] C. Model checking programs. pages 480–490. Further empirical studies of test eﬀectiveness. T. J. S. A linear method for deviation detection in large databases. of the 2004 Int.esrin. pages 289–302. Leon. 1998. Conf. 1995. Storage. Conf. Conf. pages 4–16. Isolating cause-eﬀect chains from computer programs. G. of the 6th ACM Int. [16] N. Finding failures by cluster analysis of execution proﬁles. Testing Statistical Hypotheses. Active learning for automatic classiﬁcation of software behavior. Harrold. [7] W. H. 2001. [2] G. pages 427–438. Software Testing Techniques.

Sign up to vote on this title

UsefulNot useful- Kra Definition Pm Sm Mt 299
- Salesboom Test Plan Ver1.0
- Common App
- [IJCST-V4I1P7]
- Automated Bug Assignment
- Case Study
- Release notes for installation purpose
- Testing_Articles
- Spencer Brown - Laws of Form
- Bugs and Its Basic Questions
- D2 - Opportunities for Defect
- Mech a Boolean Tutorial
- Psychology of Testing
- How Do You Know When You Are Done Testing
- Rurple Man
- MKE3B21 Semester Assignment Marking Rubric 2nd Semester 2016
- Case Study Info Appliance Test ROI
- bfm%3A978-0-387-30311-6%2F1.pdf
- Changes
- Mount Management of Traceability Transformation Reality
- pSXfin
- bugzilla
- Bugzilla tutorial
- QA Induction
- 01405993
- Basic Interview Questions for Manual Testing
- Me So to Connect With Thou
- Friedrich Waismann - The Many-Level-Structure of Language.pdf
- Grolier in Home Learning - Rajan
- Logic 6
- dm06_010liuc

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

We've moved you to where you read on your other device.

Get the full title to continue

Get the full title to continue reading from where you left off, or restart the preview.