You are on page 1of 16

JOURNAL OF SOFTWARE: EVOLUTION AND PROCESS J. Softw. Evol. and Proc.

2012; 24:5166 Published online 25 February 2011 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/smr.532

Discovering programming rules and violations by mining interprocedural dependences


Ray-yaung Chang1 and Andy Podgurski2, ,
1 Graduate

School of Resources Management and Decision Science, Management College, National Defense University, Taiwan 2 EECS Department, Case Western Reserve University, Cleveland, OH 44106, U.S.A.

SUMMARY This paper presents a novel approach to discovering implicit programming rules and rule violations in a code base, which integrates static interprocedural analysis and graph mining techniques to identify both function-call ordering rules and conditional rules that check input parameters or return values of functions. The approach discovers rules even when rule instances cross function boundaries. Rules are modeled as graph minors of dependence graphs augmented with edges indicating shared data dependences. The approach employs two innovative algorithms: a greedy one for mining maximal frequent minors from a set of interprocedural dependence spheres and a heuristic minor-matching algorithm for discovering rule violations. We evaluated our approach on the latest versions of three applications: net-snmp, openssl, and the Apache HTTP server. It detected 62 new bugs (24 involving rules with interprocedural instances), 35 of which have been conrmed and xed recently by developers based on our reports. Copyright q 2011 John Wiley & Sons, Ltd.
Received 18 March 2009; Revised 5 August 2010; Accepted 22 December 2010 KEY WORDS:

defect detection; graph mining; software maintenance

1. INTRODUCTION Many programming rules require preserving relationships among program elements when programmers address the same concern in different places in a code base. The elements may be function calls, parameters, assignment statements, conditions statements, variables, etc., and the relationships that must be preserved may involve data dependences, control dependences, or execution ordering constraints that are not reected by the available dependence information. For example, Figure 1 shows an instance of a programming rule mined from the openssl project with the approach proposed in this paper. This rule requires that the rst input parameter of the function PEM read bio PrivateKey() must be a non-null BIO object generated by BIO new(). Moreover, the EVP PKEY object returned by PEM read bio PrivateKey() must be checked to see if it is null. Finally, BIO free() must be called to release the object generated by BIO new(). Adherence to programming rules plays a signicant role in ensuring the reliability of software systems, because failure to follow rules may result in erroneous output, program crashes, memory or resource leaks, etc. However, it is a tedious task to manually identify and document implicit programming rules embedded in a code base. To address this problem, several approaches have been proposed to infer programming rules automatically from software systems [14].
Correspondence

to: Andy Podgurski, EECS Department, Case Western Reserve University, Cleveland, OH 44106, U.S.A. E-mail: podgurski@case.edu Copyright q 2011 John Wiley & Sons, Ltd.

52

R.-Y. CHANG AND A. PODGURSKI

Figure 1. Instance of a programming rule mined with our approach.

To ascertain that co-occurring elements constitute a rule instance, it is necessary to consider whether the necessary semantic relationships hold between the elements. For instance, consider again the code in Figure 1. Note that the function BIO new() is called twice in MAIN(), at lines 266 and 267, and the return values from the two calls are used to dene the BIO variables in and out, respectively. To accurately identify an instance of the rule described previously, it is necessary to determine that the rst input parameter of PEM read bio ECPrivateKey() is data dependent on the denition of in, that in is veried to be non-null, and that in is passed to BIO free() at line 388. To capture the semantic relationships among program elements of a rule, our previous work modeled rules as graph minors of enhanced procedure dependence graphs (EPDGs), which are explained in Section 2 [5]. In that work, programming rules were mined from a set of EPDGs using frequent subgraph mining techniques, and rule violations were detected by using graph matching techniques. However, the work was limited to discovering intraprocedural programming rules, that is, rules whose instances are wholly contained within a single function. Thus it would not detect the rule instance in Figure 1 because its elements are spread over three functions. This paper presents a novel approach to discovering programming rules with instances that cross function boundaries and to discovering bugs (defects) corresponding to violations of such rules. To our knowledge, it is the rst approach to be based on interprocedural program dependence analysis. This greatly expands the range of rules it can nd, which in turn increases the number of actual bugs that are found (without increasing false positives). The discovered rules are of two basic types: ordering rules, which involve ordering of function calls, and conditional rules, which involve checking the input parameters and return values of functions. The new approach includes the following innovations: (i) a method for computing interprocedural dependence spheres to encompass interprocedural rule instances; (ii) a sophisticated greedy maximal frequent minor mining (GMFMM) algorithm for discovering interprocedural and intraprocedural rules embedded in dependence spheres; (iii) an incremental heuristic minor matching (IHMM) algorithm for identifying violations of such rules; and (iv) a new way of matching control points based on data dependences, which reduces false positives. Refer again to the programming rule shown in Figure 1. By considering both intraprocedural and interprocedural dependences, our approach determines that the predicate !key at line 251 is transitively data dependent on the output of the function PEM read bio PrivateKey(). This indicates that the return value of the function should be checked. Additionally, as shown in Figure 2, there is a shared data dependence edge (SDDE) (see Section 2.1) from the control-point (in == null)
Copyright q 2011 John Wiley & Sons, Ltd.

J. Softw. Evol. and Proc. 2012; 24:5166 DOI: 10.1002/smr

PROGRAMMING RULES AND VIOLATIONS

53

Figure 2. The EPDG of the rule instance.

at line 268 to the rst actual-in parameter of PEM read bio PrivateKey(), which indicates that the parameter should be non-null before the function is called. We evaluated our approach on the latest versions of three applications: net-snmp [6], openssl [7], and the Apache HTTP server [8]. As described in Section 4, it discovered a number of new bugs in these projects, which were conrmed and xed by developers, and it exhibited good precision and recall. The evaluation also characterized the precision gained by considering dependences between rule elements, by comparing our approach with one based on frequent itemset mining.

2. BACKGROUND In this section, we provide the background information about extended dependence graphs, graph minors, and maximal frequent subgraph mining. 2.1. Enhanced Dependence Graphs A procedure dependence graph (PDG) [9] represents a procedure or function and may contain nodes representing a variety of program elements and contain edges representing data or control dependences between them. A system dependence graph (SDG) links PDGs with data dependences between formal-in/out parameters and actual-in/out parameters and with control dependences between the call-site nodes and the entry node of the callee functions. A call-site graph associated with a call-site node c is a PDG subgraph CSG(c) consisting of c and its adjacent actual-in/out parameter nodes. The edges between the call-site node and its parameter nodes in the PDGs generated by CodeSurfer [10], which we employed in our empirical study, are control dependence edges. To specify certain semantic relationships among elements more precisely and to provide some of the benets of interprocedural analysis, we enhance PDGs by adding a new type of edge, called shared data dependence edges (SDDEs). An intraprocedural SDDE is a directed edge (a , b) that is added to a PDG if the following two conditions are met: (i) there exists a node c on which both nodes a and b are directly data dependent and (ii) there exists a path from a to b in the corresponding control ow graph. (We will describe interprocedural SDDEs in Section 3.1.2.) A PDG augmented with SDDEs is called an enhanced procedure dependence graph (EPDG); similarly, an SDG augmented with SDDEs is called an enhanced system dependence graph (ESDG). ESDG edges are labeled by their type. Because two nodes can be connected by multiple types of edges, ESDG are actually multigraphs. Most nodes are labeled by the abstract syntax tree of the corresponding element, with variable names discarded. However, a call-site is given the label of the entry node of the callee function, and nodes representing actual-in/out parameters are given the labels of their corresponding formal-in/out parameters. In this work, all control points receive the same label, and they are distinguished by their adjacencies.
Copyright q 2011 John Wiley & Sons, Ltd.

J. Softw. Evol. and Proc. 2012; 24:5166 DOI: 10.1002/smr

54

R.-Y. CHANG AND A. PODGURSKI

2.2. Frequent subgraph mining and graph minors A graph H is a frequent subgraph of a graph dataset G = {G 1 , G 2 , G 3 , . . ., G n } if the support of H , which is the number of graphs in G containing H as a subgraph, is greater than or equal to a user-specied threshold, called the minimum support [11]. If H is a frequent subgraph, all of its subgraphs must also be frequent. A frequent subgraph of a set of ESDGs represents a recurring programming pattern, which may be a programming rule. To discover such patterns, it sufces to nd maximal frequent subgraphs, which are not contained in any other frequent subgraphs. A graph M is called a minor of a graph G if M is isomorphic to a graph that can be obtained by applying zero or more edge contractions to a subgraph of G , that is, by replacing certain paths with edges. It is not difcult to prove that a directed graph M is a minor of a directed graph G if and only if M is isomorphic to a subgraph of the transitive closure of G . Our approach actually involves mining frequent ESDG minors, rather than ordinary subgraphs, so that certain variations in the dependence structure of rule instances can be represented by a single rule [5].

3. SPECIFICS OF THE APPROACH This section describes how interprocedural dependence spheres are extracted, how candidate rules are mined and then evaluated by users, how violations are found, and the limitations of our approach. 3.1. Interprocedural dependence spheres The rst step in our approach to discovering conditional rules and ordering rules for function calls is to extract from PDGs dependence spheres encompassing rule instances. (SDDEs are added to the spheres later.) The set of dependence spheres corresponding to calls of a particular function constitutes the graph dataset that is mined in order to discover any candidate rules involving that function. To mine rules involving calls of a function f , a dependence sphere is rst extracted from around each call site c of f , which is called a candidate node. We assume that such a dependence sphere contains the elements and dependences of any rule instance involving c, along with irrelevant nodes and edges. Since the program elements involved in a rule instance may be distributed across multiple functions, it is desirable to construct interprocedural dependence spheres. This is a much more complex task than constructing the intraprocedural spheres used in our previous work [12]. To increase the likelihood that an instance is contained within a dependence sphere, it is desirable to permit a sphere to have a fairly large radius. Hence, unlike in our previous work, we do not directly limit the radius of a sphere to control computation costs. Instead, the number of nodes in a sphere is bounded, and unnecessary nodes are removed as described below. 3.1.1. Choice of functions. We refer to the function containing the candidate node c as the base function ( f b ). In order to include elements from different functions in the sphere, we rst build a function call tree with a radius bound r around f b , containing the functions whose PDGs will be used for creating S . (We used r = 3 in our empirical evaluation.) The PDGs of the functions in the tree are called qualied PDGs. The part of this tree rooted at f b is uniquely determined by c. However, f b may have multiple callers, which may in turn have multiple callers, etc. We select a single chain of ancestor functions of f b to include in the function call tree, along with their descendants, using a recursive heuristic that ranks a calling function by the number of call sites (potential rule elements) that it and its called functions contain. The heuristic is based on the assumption that if a rule involving f b is usually followed, then such an ancestor chain and its descendants are likely to include the rules elements. 3.1.2. Interprocedural SDDEs. The intraprocedural SDDEs mentioned in Section 2 are not sufcient to discover certain rules whose instances cross procedure boundaries. Therefore, we extend
Copyright q 2011 John Wiley & Sons, Ltd.

J. Softw. Evol. and Proc. 2012; 24:5166 DOI: 10.1002/smr

PROGRAMMING RULES AND VIOLATIONS

55

Figure 3. Examples of interprocedural SDDEs.

the denition of an SDDE to encompass two types of interprocedural SDDEs: (i) an SDDE from an actual-in parameter to another actual-in parameter and (ii) an SDDE from a predicate to an actual-in parameter. These are dened in terms of data dependences that may be either intraprocedural or interprocedural and either direct or indirect (transitive). The details of the technique used to generate interprocedural SDDEs are presented in [13]. Examples of interprocedural SDDEs are shown in Figure 3. 3.1.3. Phases of dependence sphere creation. We now describe the phases in the process of creating a dependence sphere S for call site/candidate node c for a function f . Initial growth: The dependence sphere S is initialized to be the call-site graph CSG(c) associated with c. Nodes from qualied PDGs (see Section 3.1.1) are then added iteratively to S if they are directly data dependent (intraprocedurally or interprocedurally) on a node in S or if a node in S is directly data dependent on them. When a parameter node p is added into S , CSG(cs ( p )) is added to S at the same time, where cs ( p ) denotes the call-site node of p . This process continues until the nodes from qualied PDGs are exhausted or either the number of nodes or the number of call-sites in S reaches a user-specied threshold. SDDEs are then added among actual-in parameter nodes and from control point nodes to actual-in parameter nodes in S . In our empirical evaluation, thresholds of 1500 total nodes and 100 call-site nodes were used; larger thresholds increased the computation times excessively. Partial reduction: S contains a variety of nodes, such as expression, formal-in/out parameter, and declaration nodes. Because we focus on discovering function-call ordering rules and conditional rules that involve checking the input parameters or return values of functions, nodes other than call-site graph nodes and control point nodes are not of interest and so are removed. Call-site graphs having fewer occurrences in the set of dependence spheres than the user-specied minimum support for rules are also removed. (In our empirical evaluation, the minimum support was 80% of the number of spheres.) Edges representing transitive data dependences are then added between the remaining nodes. Such an edge (u , v) is added to the reduced sphere S if there is a data dependence path from u to v in S that does not contain any other nodes in S . An SDDE between u and v is preserved if both u and v are in S . The control dependences between a call-site node and its parameters are also preserved. We call the resulting sphere a partially reduced dependence sphere (PRDS). Full reduction: The PRDS contains data dependence edges, SDDEs, and control dependence edges between call-site nodes and their parameter nodes. Control point nodes from S are retained in the PRDS. However, we are only interested in those control-point nodes that are responsible for checking the input parameters or return values of the functions involved in candidate rule instances. To reduce the PRDS further, we compute the maximal connected subgraph (MCS) of the PRDS, starting from the candidate node c and ignoring edge directions. Nodes that are not in the MCS are removed. This is done because control point nodes or call-site graphs without direct or indirect data dependences or shared data dependences with the candidate node c are usually not relevant to actual rules. We compute transitive control dependences involving nodes in the MCS by using the original PDGs. For nodes u , v in the MCS, a control dependence edge from u to v is added to the MCS if there is a control dependence path from u to v in the original PDG that does not contain any
Copyright q 2011 John Wiley & Sons, Ltd.

J. Softw. Evol. and Proc. 2012; 24:5166 DOI: 10.1002/smr

56

R.-Y. CHANG AND A. PODGURSKI

Table I. GMFMM Algorithm.


Notations: G: Graph NTC: Near Transitive Closure FM: Frequent Minor FMI: Frequent Minor Instance (F)BES: (Frequent) Boundary Edge Set (F)EES: (Frequent) Extendable Edge Set Input: Graph dataset G [1, . . . , n ] and minimum support MIN SUPPORT Output: A set of frequent minors FM [1, . . . , m ]. Each FM [i ] is associated with a set of frequent minor instances FMI [i 1 , . . ., il ], such that all the instances are isomorphic. Preprocess: NTC[1, . . . , n ] GetNTC(G [1, . . . , n ]) Initialize an initial FMI set FMI [1, . . . , n ], such that FM [i ] is the call-site graph of the candidate function Phase I: I.1: BES[1, . . . , r ] GetBES(FMI [1, . . . , r ]) I.2: FBES[1, . . . , p] MAFIA(BES[1, . . ., r ], MIN SUPPORT ) If FBES is empty, which means that the instances cannot be extended anymore, then output a new frequent minor FM [k ] with FMI [1, . . . , r ] as its instances I.3: For each FBES[i ] Get FMI [ j1 , . . . , j p ] such that they have FBES[i ] as a subset of their BES Repeat I.1 I.3 with FMI [ j1 , . . ., j p ] Phase II: For each FB[i ] generated from Phase I EES[i 1 , . . ., il ] GetEES(FBI [i 1 , . . ., il ]) FEES[1, . . . , p] MAFIA(EES[i 1 , . . ., il ], MIN SUPPORT ) For each FEES[ j ] Extend FM [i ] with FEES[ j ], and output a new frequent minor FM [k ] with extended instances

other nodes in the MCS. The resulting graph is called a fully reduced dependence sphere. Its near transitive closure (introduced in Section 3.2) is used for mining candidate rules in the next step. 3.2. The GMFMM algorithm This section sketches our GMFMM algorithm for mining programming rules whose instances may be either intraprocedural or interprocedural. The full details of the algorithm are presented in [14]. The objective of the algorithm is to mine maximal frequent minors, which are considered to be candidate rules, from the set of FRDSs generated for a candidate node. It is more efcient than our previous rule mining algorithms [5, 12], and hence can handle larger dependence spheres. The algorithm uses an iterative two-phase process to discover maximal frequent minors. Phase I locates nodes in frequent minors, while Phase II makes the frequent minors discovered in Phase I maximal by adding additional dependences between the nodes. The essentials of the algorithm are described below. A pseudocode version of the algorithm is shown in Table I. 3.2.1. Phase I: Adding frequent boundary edges (FBEs) to frequent minors. To mine maximal frequent minors from a set of FRDSs, the algorithm rst computes a near transitive closure of each FRDS by adding a data dependence or control dependence edge from u to v if there is a data dependence path or control dependence path between them containing no other nodes of the FRDS and having length less than a user-specied threshold. (A threshold of 3 was used in our empirical evaluation, because dependence paths of length >3 were judged as less likely to be semantically relevant.) Since all input graphs contain the call-site graph of the candidate function, the algorithm rst creates an initial frequent minor containing only the call-site graph, which it subsequently
Copyright q 2011 John Wiley & Sons, Ltd.

J. Softw. Evol. and Proc. 2012; 24:5166 DOI: 10.1002/smr

PROGRAMMING RULES AND VIOLATIONS

57

extends. We rst give an overview of how frequent minors are extended and then explain some important subtleties. We refer to an input-graph minor that is isomorphic to a frequent minor as an instance of the latter. Iterative extension: An iterative method is used in Phase I to nd nodes with which to extend a frequent minor. In each iteration, the frequent minors and their instances discovered in the previous iteration are extended by adding FBEs incident to/from their nodes (see below). Iteration continues until no more FBEs can be found. Each iteration involves three stages. Based on the observation that data dependence edges and SDDEs permit more precise matching of rule elements than do control dependence edges, the former are employed to extend a frequent minor with non-control point nodes in the rst stage and control point nodes in the second stage. Frequent control dependence edges are employed in the third stage to extend the frequent minor with non-control point nodes. Node labeling: Frequent subgraph mining algorithms generally rely on node and edge labels to nd isomorphic instances of frequent subgraphs in a graph dataset. In our original approach to rule mining [12] control point nodes were labeled based on their abstract syntax trees. However, this method suffers from the problem that control point nodes with the same semantics but different ASTs, such as if (a != null) and if (a), are given different labels, causing useful rules to be missed. To address this issue, we now assign all control-point nodes the same label and match them based only on the data dependence edges and SDDEs they are incident with. The labeling of call-site nodes and parameter nodes is described in Section 2.1. Discovery of FBEs: A node is given an index when it is added to a frequent minor and its instances. The nodes with the same index in a frequent minor and its instances have the same label. An FBE used for extending a frequent minor is of the form (b, v), where node b is in the minor but node v is not. Thus, when an FBE is added to a frequent minor, it includes a new node, and the minor remains a connected graph. An edge is modeled in terms of a record consisting of (i) the index of the node b; (ii) the label of the node v ; (iii) the edge label; and (iv) the edge direction. The GMFMM algorithm then employs the frequent itemset mining algorithm MAFIA [15] to discover FBEs. The GMFMM algorithm relies on the minimum support used by MAFIA to ensure that the resulting minors are frequent, meaning that their supports are larger than the minimum support specied by the user. Isomorphic instances of a frequent minor: We preprocess call-site graphs of a function and make them identical . Thus, the instances of the initial frequent minor, which contain only the call-site graph of the candidate function, are isomorphic. As the initial frequent minor is created, nodes in the frequent minor and its instances are given an index, and nodes with the same index have the same labels. As mentioned previously, for each frequent edge (b, v) used for extending the frequent minor and its instances, the algorithm ensures that (i) the nodes corresponding to b have the same index, so that they have the same label; (ii) the nodes corresponding to v have the same label; and (iii) the edges corresponding to (b, v) have the same edge label and direction. These steps ensure that after the new edge is added to the minor and its instances, they remain isomorphic to one another . Furthermore, since each frequent edge added to a frequent minor in Phase I is incident to a new node, the frequent minors discovered in this phase are acyclic graphs, even if edge directions are ignored. Efciency: One of the reasons why mining frequent subgraphs is computationally expensive is that there are multiple ways to extend every instance of a frequent subgraph. Instead of considering all of these ways, the GMFMM algorithm employs greedy heuristics, based on information from the ESDG and the source code, to select the best option to extend an instance of a frequent minor. (For details, see reference [14].) Consequently, the algorithm will not necessarily nd all maximal frequent minors.

When

The call-site graphs of a function generated by CodeSurfer may have different numbers of actual-in/out parameters. a call-site node or a parameter node is added into a frequent minor, the call-site graph containing the node is added into the minor at the same time. Since the call-site graphs of a function f are isomorphic, the extended frequent minor and its instances are also isomorphic. 2011 John Wiley & Sons, Ltd.

Copyright q

J. Softw. Evol. and Proc. 2012; 24:5166 DOI: 10.1002/smr

58

R.-Y. CHANG AND A. PODGURSKI

3.2.2. Phase II: Adding frequent edges between nodes within a frequent minor. The frequent minors generated in Phase I cannot be extended by adding more nodes. However, in order to ensure that the instances of frequent minors are easily determined to be isomorphic, some frequent edges are deliberately ignored in Phase I. These are edges whose addition would create loops or duplicate edges if edge directions are not considered. As a result, the frequent minors discovered in Phase I may not be maximal. To address this issue, frequent edges whose endpoints are both in the frequent minor are used in Phase II to extend the frequent minors discovered in Phase I so that the minors are maximal. We represent such an edge by a record consisting of the edge label and the indices of the nodes in the edge. As in Phase I, the frequent itemset mining algorithm MAFIA is used to discover frequent edges for extending the frequent minor. In addition, trivial frequent minors, such as a call-site graph without any control point nodes, are ltered out. 3.3. User evaluation of rules Candidate rules are next examined by users with the aid of a tool that displays code with rule elements highlighted, as described in [5]. If necessary, the user may modify or delete rules. 3.4. The IHMM algorithm In our previous work [5], we applied intraprocedural analysis to discover violations of the mined rules in order to locate bugs. However, intraprocedural analysis may result in false-positive indications of violations because the elements of a rule instance are located in different functions. This is illustrated by Figure 3(b). Assume that we have discovered a rule that requires the input parameter to the function h to be checked before the function is called. Intraprocedural analysis will indicate that the function call h (a ) in function h is a violation of the rule since the input parameter a is not checked within h . However, in main(), the parent function of h , the parameter a of the call h (a ) is checked by the statement if (a). To address this issue, we developed an incremental heuristic minor matching (IHMM) algorithm, for discovering violations of conrmed rules whose instances may be interprocedural. We now outline the IHMM algorithm; a more detailed description is provided in [14]. 3.4.1. Overview. To detect violations of a rule R represented by a graph G R , the IHMM algorithm looks for ESDG minors that are very similar to G R but lack some nodes or edges present in G R . The user species a key node k in G R , which is used as a starting point for graph matching. Since we are concerned with rules related to function calls, k is normally chosen to be a call-site node. The algorithm searches for key node instances, which are nodes with the same label as k . It determines whether possible rule instances involving a key node instance n are correct instances or violations of the rule. The function f b containing n is called the base function. Starting from f b , the IHMM algorithm employs a recursive method to examine the function f b and, if necessary, the ancestor functions of f b to identify violations involving n . Dependence spheres are created for nding correct instances and violations of a rule. A dependence sphere is created using the PDGs of a top function f t and its descendant functions. The descendant functions of f t are those functions called by f t directly or indirectly. The algorithm involves two major stages: the initial stage and the recursive stage. In the initial stage the top function f t is the base function f b ; in the recursive stage f t is an ancestor function of f b . Initial stage: An initial dependence sphere Ssb is created from the PDGs of f t = f b and its descendants. A graph-matching algorithm (see Section 3.4.3) is then employed to identify a graph minor M in S that is similar to G R . If M is isomorphic to G R , a correct instance of the rule is reported and the algorithm stops. Otherwise, the algorithm fails to nd a rule instance in S and, as mentioned below, it may need to enter the recursive stage to try to nd rule elements in the ancestor functions of f b . Recursive stage: If the heuristics described in Section 3.4.2 indicate that ancestor functions of the top function f t need to be examined, the IHMM algorithm examines the parent functions of f t and the corresponding descendant functions to nd missing program elements and dependences. A dependence sphere Ssp extending Ssb is created for each call site of f t using the PDGs of the calling (parent) function f p and f p s descendant functions. The parent function f p becomes the
Copyright q 2011 John Wiley & Sons, Ltd.

J. Softw. Evol. and Proc. 2012; 24:5166 DOI: 10.1002/smr

PROGRAMMING RULES AND VIOLATIONS

59

new top function. Obviously, Ssb is a subsphere of Ssp . As in the initial stage, the graph matching algorithm is used to nd the graph minor M in Ssp that best matches G R . The algorithm employs the heuristics described in Section 3.4.2 to determine whether (i) Ssb or some super-sphere Ssp involves a rule violation or (ii) the parent functions of f t should be examined recursively. In the latter case, each parent function examined becomes a new top function. 3.4.2. Heuristics. For a dependence sphere S in which no instance of the rule R is found, the following heuristics are used to infer whether S involves a violation of R . Assume that M is the graph minor of S best matching G R and that f t is the top function used to generate S . The IHMM algorithm will not examine the ancestor functions of f t if the heuristics indicate that S does involve a violation of R . Deciding whether S involves a violation: There are three cases in which S is considered to involve a rule violation. Case 1: M is missing only edges but not any nodes. Obviously, the missing edges will not appear in a super-sphere Ssp . Case 2: Suppose that an SDDE (l , m ) is missing from M . Suppose that m is a node in M , l is a missing node, and the function containing m is f . If the analysis performed by the IHMM algorithm shows that m does not use the denition of a formal-in parameter of f directly, there cannot be an interprocedural SDDE (l , m ) in Ssp and hence S is found to involve a rule violation. Suppose, on the other hand, that a data dependence edge (l , m ) or (m , l ) is missing from M . If the analysis indicates that there is no data dependence path between m and a formal-in/out parameter node of f , there cannot be an interprocedural data dependence path between m and l in Ssp , and hence S is found to involve a rule violation. Case 3: S is assumed to involve a rule violation if the user-dened threshold on the level of ancestor functions of f b is reached. In our empirical evaluation a limit of two levels was used, so that only parent and grandparent functions were examined. This heuristic is based on the observation that when a rule element is found in an ancestor of the base function, it is usually found within two levels. Examination of ancestor functions: If the algorithm cannot determine that S involves a violation by using the aforementioned heuristics, then the ancestor functions of f t are checked for the missing program elements or dependences. We employ a greedy method to determine the locations of rule violations without examining the ancestor functions of f t exhaustively. The method decides that S involves a rule violation if the number of super-spheres in which no rule instance is found is greater than a user specied threshold. In our empirical evaluation, S was considered to involve a violation if more than 3 or at least 30% of the super-spheres of S failed to contain an instance of the rule R . This heuristic is based on the observation that when a rule element is found in an ancestor of the top function, it is generally a near ancestor. 3.4.3. Graph matching algorithm. This algorithm searches a dependence sphere S for the graph minor M that best matches a rule graph G R . The algorithm rst nds an image in S , if possible, for each node in G R . These images comprise the nodes of M . Then, the algorithm computes the edges among the nodes in M using the edges of S . Finally, the algorithm determines whether M is isomorphic to G R . This algorithm is an extension of the intraprocedural matching algorithm described in [5], and a detailed description of it can be found in [14]. 3.5. Limitations The GMFMM algorithm will not discover rule instances whose elements are not linked by dependence paths of the types mentioned above. Also, it will not discover a rule instance whose elements are not contained within a dependence sphere (with given size bound).

4. EMPIRICAL EVALUATION We empirically evaluated our approach by applying it to code from three open source projects: openssl [7], net-snmp [6], and the Apache HTTP server [8]. The evaluation addressed the
Copyright q 2011 John Wiley & Sons, Ltd.

J. Softw. Evol. and Proc. 2012; 24:5166 DOI: 10.1002/smr

60

R.-Y. CHANG AND A. PODGURSKI

Table II. Project characteristics and rule mining times.


Project openssl net-snmp Apache Version 0.9.8 g 5.3.2 2.2.8 Files 611 251 192 Loc 225 K 199 K 112 K SDG nodes 361 K 340 K 173 K Time (s) 63 105 174

Loc is source lines of code (including comments and empty lines); Time is average time to mine rules for a subject function.

Table III. Results of rule mining.


Precision Project openssl net-snmp Apache Recall Rule type

#Subject functions Total #Invalid Precision #Missed Recall Conditional Call sequence #Inter 650 324 244 375 191 138 52 42 54 86.1% 78.0% 60.9% 10 2 2 96.6% 98.5% 97.3% 285 133 73 97 33 27 89 41 15

Total is the total number of candidate rules; #Invalid is the number of invalid rules; #Missed is the number of valid rules missed by our algorithm; #Inter is the number of valid interprocedural rules.

following issues: 1. Is the approach able to discover a high proportion of conditional and ordering rules in a code base? 2. Is it able to discover rules involving interprocedural dependences? 3. Are the discovered rules relevant and precise? 4. Do reported rule violations actually involve bugs? 4.1. Experimental methodology We employed CodeSurfer, a commercial static program analysis tool, to generate SDGs for each project. We extracted required information from SDGs using the Scheme API provided by CodeSurfer and imported it into a database . Characteristics of the source code and SDGs are shown in Table II. Our database runs on a Windows system with two 2.4 GHz CPUs and 4 GB RAM. The system for discovering rules and violations, which was implemented in Java, was run on a Linux system with two 2.4 GHz CPUs and 4 GB RAM. The machines were connected through a 1 Gbps intranet. Creating the graph database: Call-site nodes of functions that were called in at least four functions were chosen as candidate nodes. The number of such subject functions in each project is shown in Table III. To reduce computation costs, at most 10 candidate nodes with a given label L were selected for generating dependence spheres. If more than 10 PDGs contained L , 10 of them were chosen at random and a candidate node with label L was selected from each of them. Mining candidate rules: The GMFMM algorithm was applied to the near transitive closure of the FRDSs to nd maximal frequent minors, i.e., candidate rules, with minimum support equal to 80% of the number of FRDSs. Reviewing the discovered rules: We manually reviewed the discovered rules, conrmed those of interest, and removed any irrelevant nodes and dependence edges from them. If the GMFMM algorithm generated multiple maximal frequent minors for a subject function, we examined only those with the highest support. The candidate nodes were assigned to be key nodes for detecting rule violations. Subject functions were also examined manually to determine the precision and recall of our approach to mining rules. We also compared our approach with one based on frequent itemset mining, with respect to the precision of discovered rules.
In

order to analyze larger programs, we disabled the gmod and summaries options when CodeSurfer was used to generate SDGs. 2011 John Wiley & Sons, Ltd.

Copyright q

J. Softw. Evol. and Proc. 2012; 24:5166 DOI: 10.1002/smr

PROGRAMMING RULES AND VIOLATIONS

61

Finding rule violations: The conrmed rules were used with the IHMM algorithm to try to nd rule violations. For each conrmed rule, at most 20 key node instances n were randomly selected for use in nding violations. The discovered violations were ltered to remove likely false positives such as ones involving functions called through pointers. Evaluating rule violations: Some of the ltered violations for each project were reported to the corresponding developer community for conrmation. To limit the effort required from developers we reported only those violations whose ESDG subgraph was not apparently equivalent to the rule. 4.2. Effectiveness and efciency of rule mining The average time to mine rules from the graph dataset consisting of FRDSs generated for a candidate node was about 4.7 s, 8.0 s, and 5.6 s for the openssl, net-snmp, and Apache projects, respectively. We characterized the effectiveness of our approach to discovering rules in terms of (i) precision, as estimated by the proportion of reported rules that we concluded were valid after reviewing them, and (ii) recall, as estimated by the proportion of rules we identied by inspection (of subject functions) that were also discovered automatically. The results of rule discovery with our approach are summarized in Table III. (Example instances of rules mined with our approach can be viewed at [13]). In what follows, we will describe the results for the largest project, openssl, in detail. Estimated precision of rule discovery: 375 candidate rules were mined from openssl. After reviewing them, we concluded that 52 were invalid, hence our estimate of the precision of rule discovery was 86.1%. Of the 323 rules we conrmed, 285 involved conditional rules and 97 involved function-call ordering rules. (Some involved both types.) Estimated recall of conditional rule discovery: We inspected the call sites of all subject functions, to determine whether the latter involved conditional rules . For openssl there were 650 subject functions. A total of 385 rules were identied manually, of which 10 rules were missed by automated rule mining. Hence, the estimated recall of conditional rule discovery was 96.6%. Over all the three projects, some conditional rules were missed for reasons such as: (i) the limited size of dependence spheres; (ii) failure of heuristics for choosing potential rule elements; and (iii) rule instances with different EPGD minors. Interprocedural rule instances: Inspection revealed that 89 rules were supported by at least some interprocedural instances. Some of these rules might be discovered by intraprocedural mining if the minimum support for rules is reduced, although this would cause more false positives. Comparison with a frequent itemset mining approach: In order to evaluate the effectiveness of our approach to discovering function call ordering rules, we compared it to a restricted frequent itemset mining (RFIM) approach, which approximates part of PR-Miners [2] functionality (see Section 5). The RFIM approach is intended to discover functions that are called frequently together . To implement it, we rst built a function call sphere with radius r = 3 for each call-site node of the subject function f . Let F be the set of functions included in this sphere. Each call-site node inside a member of F was mapped to an integer called an item, which is unique to the called function, yielding an itemset. The maximal frequent itemset mining algorithm MAFIA was applied (with support 90%) to the itemsets generated for call sites of f to discover candidate rules consisting of call-site nodes frequently appearing together with f . To compare the two approaches to discovering ordering rules, we randomly chose 40 subject functions for investigation from each project. For each subject function f , the RFIM approach was used to nd the set of functions F f that frequently appear together with f . The semantic relationships between f and the functions in F f were examined manually to determine whether f involves an ordering rule. The results of the comparison are summarized in Table IV. Of the 40 subject functions for the openssl project, our manual inspection showed that there are 15 distinct call ordering rules associated with these functions and that each rule includes calls of 3.2 functions
Note that we cannot guarantee that all conditional rules are discovered by manual inspection. also considers variables that appear frequently together. This high level of support reduces false positives.
PR-Miner

Copyright q

2011 John Wiley & Sons, Ltd.

J. Softw. Evol. and Proc. 2012; 24:5166 DOI: 10.1002/smr

62

R.-Y. CHANG AND A. PODGURSKI

Table IV. Comparison of our approach with RFIM approach to discovering function call ordering rules.
Manual rules Project openssl net-snmp Apache #Subject functions 40 40 40 #Rules 15 8 11 #Func/ Rule 3.2 2.3 2.5 #Func/ Rule 3.8 3.9 2.6 Our approach #Complete rules 12 7 7 #Incomplete rules 2 0 0 RFIM #Func/ Rule 24.4 21.1 20.9

Table V. Summary of rule violations.


Reported violations Project #After #Removed #Not #Conrmed xed/unxed/ #No Total ltering manually #Viol conrmed conrmed by some response Precision 62 24 295 24 9 126 38 15 169 7 9 10/10/3 6/4/1 // 8 1 50% 50% 57.2%

Net-snmp 80 Apache 39 Openssl 430

30 violations were reported to the Openssl developers. Since they have not responded, the remaining 139

Pr ecision = (#Con f ir med + #Not Responded)/#After Filtering. Note that

violations have not been reported to the developers to date.

(1) Although developers did not respond about some violations, we still count them as real violations since we have evidence from either documentation or repository indicating that they are correct. (2) Since Openssl developers did not respond about the violations reported, Precision was computed by manual examination, namely (#After Filtering #Removed Manually)/#After Filtering.

on average. For these 15 rules, there were about 24.4 functions in F f on average, and our inspection showed that about 87% of the functions in F f were not relevant to the subject function on average. (For openssl and Apache, the percentages of irrelevant functions were 89 and 88%, respectively.) This result is apparently due to the fact that the RFIM approach does not consider semantic relationships between the elements of frequent itemsets. Of the 15 ordering rules mentioned above, our approach mined 12 rules completely and 2 rules partially, and it missed one. The rules mined with our approach included calls of 3.8 functions on average. This indicates that considering data and control dependences between rule elements permits ordering rules to be identied much more precisely than with the RFIM approach. The failure of our approach to discover one rule was caused by the fact that the functions (CRYPTO w lock() and CRYPTO w unlock()) involved in the rule had no data dependences, control dependences, or shared data dependences with the subject function. With the RFIM approach, the rule was contained in a frequent itemset with a number of irrelevant functions. 4.3. Effectiveness of violation detection We characterized the effectiveness of our approach to detecting violations in terms of precision, as measured by the proportion of reported violations that were conrmed upon review to involve actual bugs. The results of violation detection with our approach are summarized in Table V. In what follows, we will describe the results for the net-snmp project, which have been conrmed by developers, in detail. Discarded violations: Among the 80 violations reported by the IHMM algorithm for the netsnmp project, 18 violations were removed automatically because they involved functions called

The violations of a rule R were not automatically ltered if the ratio of the number of rule violations to the number of rule instances was less than a user-specied threshold (5% in the experiments), because such violations are likely to involve intraprocedural bugs. 2011 John Wiley & Sons, Ltd.

Copyright q

J. Softw. Evol. and Proc. 2012; 24:5166 DOI: 10.1002/smr

PROGRAMMING RULES AND VIOLATIONS

63

through pointers or involved non-entry functions without parent or grandparent functions for the net-snmp project. Bugs conrmed and xed by the developers: We reviewed the 62 remaining net-snmp violations and found that 24 were false positives. The other 38 violations were reported to developers. Among these, the feedback from the developers indicated that seven were false positives. The remaining 31 violations were considered to be bugs, for the following reasons: (i) 10 violations were conrmed and xed by developers; (ii) developers considered 10 additional violations to be bugs, technically, but chose not to x them; (iii) for each of three additional violations there was at least one developer who thought that the violation might be a bug yet also felt that conrmation by other developers was desirable; and (iv) eight violations have not been reviewed by developers to date, but the evidence available to us indicates they are real ones . Precision: Among the 62 violations remaining after automatic ltering, 31 involved bugs, so that the precision of automated violation detection was 50%. Bugs found overall: Overall, our approach detected 62 new bugs from the latest versions of the net-snmp (32) and Apache (30) projects, with 35 (10, 25) of them conrmed and xed by developers and with 26 (4, 22) of them involving rules with interprocedural instances, shown in Table V. Some rule violations involved multiple bugs. For the Apache project, 25 conrmed bugs were found from six rule violations. Analysis of violations: We examined 169 violations from the openssl project that we determined to be bugs, to identify possible patterns. 27 were found in source les with test in their lenames. 26 were found in the main() function, which suggests that programmers treat main() differently from other functions. Some of the bugs were associated with programming idioms. For example, instead of checking the variables a1, a2, b1, and b2 with if (!a1 !a2 !b1 !b2), the following code only checks the variable b2 for programming convenience: 1: a1 = BN CTX get(ctx); a2 = BN CTX get(ctx); 2: b1 = BN CTX get(ctx); b2 = BN CTX get(ctx); 3: if (!b2){return 1; } Eight bugs were directly caused by this idiom. We found that omitting one predicate could cause multiple violations. For example, an omitted check of the formal-in parameter of EC GROUP cmp() caused two violations, which were of rules requiring the rst input parameters of EC GROUP method of() and EC GROUP get order(), respectively, to be non-null: 1: int EC GROUP cmp(EC GROUP a, . . .){ 2: if (. . .(EC GROUP method of(a))! = . . .) 3: if (!EC GROUP get order(a, a1, ctx) We found 15 apparent bugs in openssl that shared the same cause with others. Finally, 22 openssl bugs that violated function ordering rules occurred on error handling paths. A typical case is a function not releasing an object after an error is detected and before the function returns. 4.4. Examples of bugs discovered This section presents examples of bugs discovered by our approach. (Others can be viewed at [13].) The rst example demonstrates that the approach is able to discover interprocedural bugs. A rule discovered from the Apache project requires error handling if apr le ush locked() returns !APR SUCCESS. Our approach reported a violation of this rule in apr le ush(), which fails to check the output value of apr le ush locked(): // Filename: /srclib/apr/le io/unix/readwrite.c 330:APR DECLARE(. . . ) apr le ush(. . . ) { 336: rv = apr le ush locked(thele);
Among

these violations, ve are identical to the violations that have been conrmed by the net-snmp developers. 2011 John Wiley & Sons, Ltd.

Copyright q

J. Softw. Evol. and Proc. 2012; 24:5166 DOI: 10.1002/smr

64

R.-Y. CHANG AND A. PODGURSKI

// rv is not redened between line 336 and 342 342: return rv; The code indicates that apr le ush() returns to its caller the output value of apr le ush locked() directly. Thus, the callers of apr le ush() take the responsibility of checking the return value of apr le ush locked(). Our approach reported that apr le ush() involved a violation because three of its callers failed to check its return value, which is actually returned by apr le ush locked(). Our investigation of the callers of apr le ush() showed that ve violations involved failure to check the output of apr le ush(), and all of them were conrmed to be real bugs by the Apache developers. The next example is a reported violation of a function call ordering rule that requires apr thread mutex unlock() to be executed to unlock the apr thread mutex t object r 1 >listlock , which is locked by executing apr thread mutex lock(). The violation occurred in function reslist cleanup(): 138: apr status t reslist cleanup(void *data ) { 141: apr reslist t *r1 = data ; 144: apr thread mutex lock(r1 >listlock); 146: while (r1 >nidle>0){ 149: rv = destroy resource(r1, res); 150: if (rv != APR SUCCESS) 151: return rv; 154: } 159: apr thread mutex destory(r1 >listlock); This violation involves two bugs: calls to apr thread mutex unlock() were omitted before the execution of return rv at line 151 and before the call to apr thread mutex destroy() at line 159. The Apache developers conrmed these two bugs and xed them by adding a call to apr thread mutex unlock() before the call to apr thread mutex destroy() and by removing the return statement at line 151. 4.5. Threats to validity One threat to the validity of our results is that the three open-source software projects studied, although substantial, may not be representative of other projects. Another threat is that we alone classied automatically discovered rules as valid or invalid, which might affect the precision results for rule discovery. (By contrast, developers helped us classify rule violations). Similarly, the recall results for our approach would be affected if we overlooked actual rule instances. Naturally, we are not as knowledgeable about the projects as their developers and we may also be more or less stringent in accepting rules than they would be. We are also more familiar with our own technique than the developers using it would be initially. We have, however, developed an easy-touse graphical tool PatternView that highlights the source code lines of rule and violation instances and allows the user to edit rule instances. We are currently beginning a study with a large company to see how helpful the tool is for their programmers.

5. RELATED WORK Here we survey some of the most relevant related work; we describe the additional related work in [5, 14]. Engler et al. presented a seminal approach to discovering software bugs by inferring implicit programming rules and nding violations of those rules [1]. Their approach uses rule templates; specic rules are extracted by tting rule templates in the code base. Li et al. employed frequent itemset mining in the tool PR-Miner to automatically discover intraprocedural programming rules and rule violations [2]. Rule elements are not necessarily related by dependences; hence their approach appears to require higher rule support to control false positives than our approach. Wasylkowski et al. employ static intraprocedural analysis to infer object interaction protocols, in the form of nite state automata,
Copyright q 2011 John Wiley & Sons, Ltd.

J. Softw. Evol. and Proc. 2012; 24:5166 DOI: 10.1002/smr

PROGRAMMING RULES AND VIOLATIONS

65

and to detect bugs involving abnormal method usage [11]. Krinke presented an approach to identifying duplicated code in programs by nding maximal similar subgraphs in ne-grained dependence graphs [17]. Liu et al. proposed an approach to detecting software plagiarism by identifying isomorphic subgraphs between PDGs from the original program and the plagiarized program [18]. A number of researchers have addressed the problem of nding interprocedural rules. Acharya et al. work [19] involves mining API specications as partial orders from API usage scenarios extracted from static traces of client code. By contrast, our approach nds both ordering rules and conditional rules for function calls and relates rule elements in terms of dependences. Ramanathan et al. developed CHRONICLER [3], a tool that applies interprocedural path-sensitive static analysis and maximal frequent sequence mining to automatically derive function precedence protocols from source code. While this approach is mainly based on a control ow analysis, they presented another approach [4], which employs data ow analysis, to extract preconditions of a procedure call that involve both controlow properties and data-ow properties. In comparison, our approach can generate both preconditions and postconditions of interprocedural programming rules. Renieres and Reiss describe an approach in which a library of usual code patterns is extracted from several projects that have been heavily used and reviewed [20]. If a pattern in a new project appears rarely in the library it is considered suspect. Lu et al. applied frequent itemset mining technique to nd correlated variables, which are ones that need to be accessed together [21]. These can be used to detect two kinds of bugs that our approach does not currently address: multi-variable inconsistent updates and multi-variable concurrency bugs. Their approach uses interprocedural analysis involving direct callee(s) of a function. Shoham et al. use static interprocedural analysis to mine client-side temporal API specications in the form of nite automata [22]. Tan et al. proposed a tool AutoISES to extract code-level security specications and detect security vulnerabilities [16]. They apply interprocedural analysis for both rule mining and vulnerability detection.

6. CONCLUSION This paper has presented an innovative approach to automatically discovering conditional and ordering rules for function calls and also rule violations, even when rule instances span different functions. Rules are discovered with a novel heuristic algorithm for nding maximal frequent minors of interprocedural dependence spheres. Another novel algorithm, for matching such minors, is used to identify rule violations. To our knowledge, these are the rst algorithms of their kind to be based on interprocedural program dependence analysis, which substantially increases their power, as characterized by recall and precision, beyond the power of purely intraprocedural analysis and of analysis that disregards program dependences. An empirical evaluation of the approach indicates that it is effective in discovering rules and violations. The rule discovery algorithm exhibited excellent precision (86%) and recall (97%). The precision of the violation detection algorithm was lower (50%) but compares very favorably with other static analysis tools. As a practical matter, this indicates that most actual rules involving function calls were discovered automatically; that most automatically discovered rules were valid, and that about half of the reported violations were actual bugs. If these results generalize to other projects, it will mean that our approach to bug detection is both powerful and widely applicable. Note that in the recent work we have shown that the precision of dependence-based rule and violation mining can be enhanced signicantly by use (after mining) of supervised learning with logistic regression models [23]. In the future work, we intend to enhance the efciency of our algorithms and further expand the range of rules that our approach is capable of detecting.
REFERENCES 1. Engler D, Chen DY, Hallem S, Chou A, Chelf B. Bugs as deviant behavior: A general approach to inferring errors in systems code. ACM SOSP, 2001. 2. Li Z, Chou Y. PR-Miner: Automatically extracting implicit programming rules and detecting violations in large software code. ACM ECSE/FSE, 2005. Copyright q 2011 John Wiley & Sons, Ltd.

J. Softw. Evol. and Proc. 2012; 24:5166 DOI: 10.1002/smr

66

R.-Y. CHANG AND A. PODGURSKI

3. Ramanathan M, Grama A, Jagannathan S. Path sensitive inference of function precedence protocols. ICSE, 2007. 4. Ramanathan M, Grama A, Jagannathan S. Static specication inference using predicate mining. ACM SIGPLAN PLDI, 2007. 5. Chang RY, Podgurski A, Yang J. Discovering neglected conditions in software by mining dependence graphs. IEEE TSE, DOI: ieeecomputersociety.org/10.1109/TSE.2008.24. 6. Net-snmp project. Available at: www.net-snmp.org [25 January 2011]. 7. OpenSSL project. Available at: www.openssl.org [25 January 2011]. 8. Apache HTTP Server. Available at: www.apache.org [25 January 2011]. 9. Ferrante J, Ottenstein KJ, Warren JD. The program dependence graph and its use in optimization. ACM Transactions on Programming Languages and Systems 1987; 9:319349. 10. Grammatech. CodeSurfer user guide and technical reference. Available at: www.grammatech.com. 11. Wasylkowski A, Zeller A, Lindig C. Detecting object usage anomalies. ESEC/FSE, 2007. 12. Chang RY, Podgurski A, Yang J. Finding whats NOT there: A new approach to revealing neglected conditions in software. ACM ISSTA, 2007. 13. Podgurski A, Chang R-Y, Sun B. Discovering programming rules and violations by mining dependences project. Available at: http://se-lab.case.edu/rules/ [25 January 2011]. 14. Chang RY. Discovering neglected conditions in software by mining program dependence graphs. PhD Dissertation, EECS Department, Case Western Reserve University, August 2008. Available at: http://se-lab.case.edu/rules/ [25 January 2011]. 15. Burdick D, Calimlim M, Gehrke J. MAFIA: A Maximal Frequent Itemset Algorithm for transactional databases, ICDE, 2001. 16. Tan L, Zhang X, Ma X, Xiong W, Zhou Y. AutoISES: Automatically inferring security specications and detecting violation, Proceedings of the 17th USENIX Security Symposium, 2008. 17. Krinke J. Identifying similar code with program dependence graphs. Eighth Working Conference on Reverse Engineering, 2001. 18. Liu C, Yan X, Han J. GPLAG: Detection of software plagiarism by program dependence graph analysis. International Conference on Knowledge Discovery and Data Mining, 2006. 19. Acharya M, Xie T, Pei J, Xu J. Mining API patterns as partial orders from source code: from usage scenarios to Specications. ACM ESEC/FSE, 2007. 20. Renieres M, Reiss SP. Finding Unusual Code. IEEE ICSM, 2007. 21. Lu S, Park S, Hu C, Ma X, Jiang W, Li Z, Popa RA, Zhou Y. Muvi: Automatically inferring multi-variable access correlations and detecting related semantic and concurrency bugs. SOSP, 2007. 22. Shoham S, Yahav E, Fink S, Pistoia M. Static specication mining using automata-based abstractions. ACM ISSTA, 2007. 23. Sun B, Podgurski A, Ray S. Improving the precision of dependence-based defect mining by supervised learning of rule and violation graphs. 21st IEEE International Symposium on Software Reliability Engineering (ISSRE 2010), 2010.
AUTHORS BIOGRAPHIES

Ray-yaung Chang received his BS degree in Computer Science from Chung-Cheng Institute of Technology, Taoyaun, Taiwan, in 1989, MS degree in Engineering and Technology Management from the University of Pretoria, Pretoria, Republic of South Africa, in 1995, and PhD degree in Computer Science from Case Western Reserve University, Cleveland, Ohio, U.S.A., in 2009. He is currently an assistant professor in the Graduate School of Resources Management and Decision Science, Management College, National Defense University (NDU), Taiwan. Prior to joining the NDU, he was a software engineer at ChungShan Institute of Science and Technology in Taiwan from 1989 to 2004. His research interest includes software testing and mining of software repositories.

Andy Podgurski received the MS and PhD degrees in Computer Science from the University of Massachusetts at Amherst in 1989. He is currently a Professor in the Electrical Engineering and Computer Science Department at Case Western Reserve University, where he has been a faculty member since 1989. His research interest is software engineering methodology, especially the application of static and dynamic program analysis in combination with data mining, statistical, and machine learning techniques to enhance software reliability and security and to facilitate software maintenance.

Copyright q

2011 John Wiley & Sons, Ltd.

J. Softw. Evol. and Proc. 2012; 24:5166 DOI: 10.1002/smr

You might also like