REVERSE ENGINEERING

Edited by LINDA WILLS PHILIP NEWCOMB

KLUWER ACADEMIC PUBLISHERS

REVERSE ENGINEERING

A Special Issue of AUTOMATED SOFTWARE ENGINEERING An International Journal Volume 3. Nos. Inc.REVERSE ENGINEERING edited by Linda Wills Georgia Institute of Technology Philip Newcomb The Software Revolution. 1/2(1996) KLUWER ACADEMIC PUBLISHERS Boston / Dordrecht / London .

Hainaut. M Galler andM. Yeh and Howard B. Nos. Wills and James H. En^lebert. Berstein 11 Extracting Architectural Features from Source Code David R. Hick and D. Wills Al 9 Pattern Matching for Clone and Concept Detection K. E. Chenf^ 139 Recent Trends and Open Issues in Reverse Engineering Linda M. Reubenstein 109 Strongest Postcondition Semantics and the Formal Basis for Reverse Engineering GeraldC.AUTOMATED SOFTWARE ENGINEERING An International Journal Volume 3. Merlo. Cross II Desert Island Column John Dobson 165 173 .'L. V.A. Konto^iannis. Pernori. June 1996 Special Issue: Reverse Engineering Guest Editors: Linda Wills and Philip Newcomh Preface Introduction Lewis Johnson 5 Linda Wills 1 Database Reverse Engineering: From Requirements to CARE Tools J. Harris. R. Gannod and Betty H C. J. J. Henrard. Roland Understanding Interleaved Code Spencer Ru^aber.-M. Alexander S. Hurt Stirewalt and Linda M. 1/2.

Norwell. Massachusetts 02061 Printed on acid-free paper. Printed in the United States of America .Distributors for North America: Kluwer Academic Publishers 101 Philip Drive Assinippi Park Norwell. or otherwise. 101 Philip Drive. Massachusetts 02061 USA Distributors for all other countries: Kluwer Academic Publishers Group Distribution Centre Post Office Box 322 3300 AH Dordrecht. stored in a retrieval system or transmitted in any form or by any means. Kluwer Academic Publishers. Copyright © 1996 by Kluwer Academic Publishers All rights reserved. THE NETHERLANDS Library of Congress Cataloging-in-Publication Data A CLP. recordmg. mechanical. No part of this publication may be reproduced. Catalogue record for this book is available from the Library of Congress. photocopying. Assinippi Park. without the prior written permission of the publisher.

By the time we were done. Even so. I would like to express my sincere thanks to Drs. and I am pleased to recommend it for inclusion in this special issue. by Rugaber. and so we decided that a special issue should be devoted to reverse engineering. 2000. and should not be overlooked. we were not able to include all of the papers that we hoped to publish at this time. For example. As these systems grow older. and renovate the codes so that they can continue to be maintained. as it might appear to the readership that they had a conflict of interest. Johnson . they found that their own papers were among the papers receiving highest marks.Automated Software Engineering 3. there were more papers than could be easily accommodated in a single issue. there will be an avalanche of computer errors from systems that were not designed to handle dates larger than 1999. Thus automated software engineering plays a critical role in this endeavor. But in order to unlock and preserve the value of legacy systems. is now ready for publication. In software engineering the term "legacy" has a negative connotation. This was a gratifying outcome. meaning old and decrepit. and changing demands are made on them. embodying the organization's collective knowledge and expertise. but also a cause for concern. the challenge of research in reverse engineering and software understanding is to give the term the positive connotation that it deserves. W. Stirewalt. and I decided that many of these could be readily adapted into journal articles. When Philip and Linda polled the WCRE program committee to determine which papers they thought deserved consideration for this issue. 5 (1996) © 1996 Kluwer Academic Publishers. Philip Newcomb and Linda Wills. A note of clarification is in order regarding the review process for this issue.L. I concurred with the WCRE program committee. Legacy systems ought to be viewed as a valuable resource. these papers constituted an important contribution to the topic of reverse engineering. Preface This issue of Automated Software Engineering is devoted primarily to the topic of reverse engineering. One of these papers. and be published when they reach completion. They are often an important cultural heritage for an organization. it was decided that these papers would be handled through the regular Automated Software Engineering admissions process. After reviewing the papers myself. so-called "legacy" systems. Manufactured in The Netherlands. the program co-chairs of the conference. it is anticipated that on January 1. We look to publish additional papers in forthcoming issues. As Leon Osterweil has observed. we need tools that can help extract useful information. In order to eliminate the conflict of interest. along with a Desert Island Column that was due for publication. and Wills. Their tireless efforts were essential to making this project a success. and so we decided to publish the papers as a double issue. capturing algorithms and business rules that can be reused in future software systems. Newcomb and Wills for organizing this special issue. and expect to include some additional reverse engineering papers in future issues. This is a timely topic: many organizations must devote increasing resources to the maintenance of outdated. they constitute an increasing risk of catastrophic failure. Last year's Working Conference on Reverse Engineering (WCRE) attracted a number of excellent papers.

interactive recognition paradigms. held in Toronto. Researchers actively working on these problems in academia and industry met at the Working Conference on Reverse Engineering (WCRE). characterize the difficult problem of unraveling code that consists of several interleaved strands of computation. deal with the problem of recovering logical and conceptual data models from database applications. and Rugaber et al. As with any complex problem. but seemingly independent problems. Kontogiannis et al. describe a collection of new pattern matching techniques for detecting pairs of code "clones" as well as for recognizing abstract programming concepts. papers both do this for problems that have thus far been ill-defined and attacked only in limited ways.WILLS Georgia Institute of Technology A central activity in software engineering is comprehending existing software artifacts. They are representative of key technological trends in the field. This occurs as programs evolve and code segments are reused by simply duplicating them where they are needed. migrate. Introduction to the Special Double Issue on Reverse Engineering LINDA M. The Hainaut et al. generalized function. This issue of Automated Software Engineering features extended versions of select papers presented at the Working Conference. in selecting the type of recognizers to use . While Rugaber et al. focus on the complementary problem of code duplication. which give the user explicit control.Automated Software Engineering 3. Rugaber et al. the software engineer must be able to recover information about existing software. Whether the task is to maintain. Research in this field focuses on developing tools for assisting and automating portions of this process and representations for capturing and managing the information extracted. Currently. rather than factoring out the common structure into a single. The recognition of meaningful patterns in software is a widely-used technique in reverse engineering. or upgrade a legacy system or reuse software components in the development of new systems. deal with the problem of interleaving. 7-8 (1996) (c) 1996 Kluwer Academic Publishers. there is a trend toward flexible. Manufactured in The Netherlands. which often arises due to structure-sharing optimizations. Both papers draw together work on several related. Relevant information includes: What are its components and how do they interact and compose? What is their functionality? How are certain requirements met? What design decisions were made in the construction of the software? How do features of the software relate to concepts in the application domain? Reverse engineering involves examining and analyzing software systems to help answer questions like these. Ontario. being able to provide a well-defined characterization of the problem's scope and underlying issues is a crucial step toward solving it. in July 1995. Hainaut et al. Kontogiannis et al. test. providing a framework for solving them in a unified way. for example.

A more general overview of the trends and challenges of the field is provided in the summary article by Wills and Cross. it also contributes to more generic recognition issues. Finally. productive environment. This trend can be seen in the Kontogiannis et al. as well as the reviewers of the original WCRE papers.WILLS and the degree of dissimilarity to tolerate in partial matches. Elliot Chikofsky. making the techniques more easily automated and validated. whose vision and creativity has provided a forum for researchers to share ideas and work together in a friendly. A representative paper by Gannod and Cheng describes a formal approach to extracting specifications from imperative programs. such as library organization and analyst-controlled retrieval. and recognition coverage metrics. The papers featured here together provide a richly detailed perspective on the state of the field of reverse engineering. interoperability between recognizers. papers. Another trend in reverse engineering is toward increased use of formal methods. recognition process optimization. and Harris et al. The papers in this issue are extensively revised and expanded versions of papers that originally appeared in the proceedings of the Working Conference on Reverse Engineering. architectural features in code. using a library of individual recognizers. . for their diligent efforts in creating high-quality presentations of this research. we would like to acknowledge the general chair of WCRE. We would like to thank the authors and reviewers of these papers. The use of formal methods introduces more rigor and clarity into the reverse engineering process. Harris et al. This work not only attacks the important problem of architectural recovery. They advocate the use of strongest postcondition semantics as a formal model that is more appropriate for reverse engineering than the more familiar weakest precondition semantics. focus on recognition of high-level.

J . ORIGIN (Be). then it analyzes the main characteristics of DBRE activities in order to collect a set of desirable requirements. This paper presents some results of the DB-MAIN project. CGER (Be).. Therefore.-L. July 1995.-M. Ville de Namur (Be). its transformation toolkit. Centre de recherche public H. 3 Suisses (Be)..-L. reverse engineering has long been recognized as a complex. In this case. the European Union. EDF (Fr). J. 9-45 (1996) (c) 1996 Kluwer Academic Publishers.1. the assistants. the methodological control and its functional extensibility. the paper describes five real-world projects in which the methodology and the CASE tool were applied. convert. the paper describes a generic DBMS-independent DBRE methodology. Database Reverse Engineering: From Requirements to CARE Tools* J. 136-145. D. 1992. First. Tudor (Lux). among others.fundp. Roland. HICK AND D. 21-B-5000 Namur Abstract. and proposes a general architecture for data-centered applications reverse engineering CASE environments. its repository. Finally. BBL (Be). ffiM. University of Namur. which first appeared in the Proceedings of the Second Working Conference on Reverse Engineering. painftil and prone-to-failure *This is a heavily revised and extended version of "Requirements for Information System Reverse Engineering Support" by J. The main features of this tool that are described in this paper are its unique generic specification model. . Hall.be V. it describes DB-MAIN. 1. J. Recovering these specifications is generally intended to redocument.Automated Software Engineering 3. an operational CASE tool developed according to these requirements. Manufactured in The Netherlands. Englebert. 1990. starting mainly from the source text of the programs (IEEE. program understanding. restructure. CASE tools 1. Finally. EPFL (CH). pp. OBLOG Software (Port). maintain or extend old applications. Henrard.M Institut d'Informatique. Banque UCL (Lux). This paper analyzes the requirements that CASE tools should meet for effective database reverse engineering (DBRE). DIGITAL. 1995). . Winterthur (Be). database engineering. Cockerill-Sambre (Be). The DB-Process subproject is supported by the Communaute Frangaise de Belgique. Keywords: reverse engineering. This project is partially supported by the Region Wallonne. Hainaut. IEEE Computer Society Press. J HENRARD. Hick. the text processors. Wills et al. HAINAUT jlh@info. and by a consortium comprising ACEC-OSI (Be). It is also required when developing a Data Administration function that has to know and record the description of all the information resources of the company. ARIANE-II (Be).ac. not only can no decent documentation (if any) be relied on. CONCIS (Fr). ENGLEBERT. rue Grandgagnage. ROLAND . Introduction The problem and its context Reverse engineering a piece of software consists. in recovering or reconstructing its functional and technical specifications. The problem is particularly complex with old and ill-designed applications. methodology. but the lack of systematic methodologies for designing and maintaining them have led to tricky and obscure code. Groupe S (Be). V. its user interface. D'leteren (Be).

keys. The primary aim of database reverse engineering (DBRE) is to recover possible logical and conceptual schemas for an existing database.2. if needed. the semantic structures that underlie the file structures are mainly procedure-independent (though their physical structures are highly proceduredependent). The database community considers that there exist two outstanding levels of description of a database or of a consistent collection of files. A logical schema comprises tables. we describe a very small COBOL fragment fi-om which we intend to extract the semantics underlying the files CF008 and PFOS. by analyzing the procedural code. such as a commercial DBMS. This proposition to split the problem in this way can be supported by the following arguments: — the semantic distance between the so-called conceptual specifications and the physical implementation is most often narrower for data than for procedural parts. Therefore. the complexity can be broken down by considering that the files or databases can be reverse engineered (almost) independently of the procedural parts. technologyindependent. record types.. most situations are actually far more complex. Two introductory examples The real scope of database reverse engineering has sometimes been misunderstood. and therefore definitively lost. materialized into two documents. schema (b) can be considered as a refinement of schema (a) resulting from the following reasonings: . In figure 2. Unfortunately. some elementary translation rules suffice to produce a tentative conceptual schema. In such situations. so much so that it is simply not undertaken most of the time. — even in very old applications. — reverse engineering the procedural part of an application is much easier when the semantic structure of the data has been elicited. 1. For instance. and presented as merely redrawing the data structures of a database into some DBMS-independent formalism. NIAM or OMT models. In information systems. ox data-oriented applications. However. concentrating on reverse engineering the data components of the application first can be much more efficient than trying to cope with the whole application. Conceptual schemas are expressed in some semantics-representation formalisms such as the ERA. segment types and the like. and. as most DBRE CASE tools do at the present time. description of the data. a more expressive schema can be obtained. the user-program dialogs. The logical schema describes these data translated into the data model of a specific data manager. and most current CASE tools are limited to the translation process illustrated in figure 1. only schema (a) in figure 2 can be extracted. columns. the file contents.10 HAINAUTETAL. It obviously brings little information about the meaning of the data.e. i. The first one is an abstract. activity. expressed in terms close to the application domain. Many early scientific proposals. By merely analyzing the record structure declarations. namely its conceptual schema and its logical schema. in applications whose central component is a database (or a set of permanent files). — the permanent data structures are generally the most stable part of applications. leaving huge amounts of invaluable knowledge buried in the programs.

CADDRESS char(48 not null. Working s t o r a g e s e c t i o n . Data d i v i s i o n . 02 CNAME p i c X ( 2 5 ) . primary key (CNUM)) create table ORDER ( ONUM char(8) not null. 02 f i l l e r p i c X ( 1 2 5 ) . r e a d PFOS k e y Kl on i n v a l i d k e y d i s p l a y " I n v a l i d P r o d u c t ID" d i s p l a y PDATA w i t h n o a d v a n c i n g .C F 0 0 8 .1 . [File control. Merely analyzing the data stnicture declaration statements yields a poor result (a). a c c e p t K l l o f PKEY. X(6). Input-output section. . division.P F O S . A more realistic view of database reverse engineering. CNAME char(24) not null. fd CF008. foreign key (CNUM) references CUSTOMER) CUSTOMEF CNUM CNAME CADDRESS id: CNUM 0-N 1 <^passes\ 1 1-1 1 ORDER 1 ONUM ODATE j id: ONUM 1 Figure 1.CPY-lD 1 COMPANY CPY-ID n CNAME CADDRESSi id: CPY-ID /PRODUCTIONV ^ \ VOLUME / PRODUCT PRO-ID PNAME CATEGORYI id: PRO-ID (a) (b) (c) Figure 2. d i s p l a y " E n t e r ID :* w i t h n o a d v a n c i n g . 0 1 CPy-DATA. 01 PKEY. File section. while further inspection of the procedural code makes it possible to recover a much more explicit schema (b). primary key (ONUM). An idealistic view of database reverse engineering. r e a d PFOS. Environment d i v i s i o n . 02 Kl p i c 9 ( 6 ) . 01 REC-CF008-1. move z e r o e s t o PKBY. 02 Kl 03 K l l p i c X ( 9 ) .PRO-lD ref:P-ID. 02 C-DATA p i c X ( 1 2 5 ) . p e r f o r m u n t i l K l l o f Kl > K l l o f PKEY d i s p l a y " P r o d u c t i o n : " w i t h no a d v a n c i n g d i s p l a y PRDATA w i t h n o a d v a u i c i n g d i s p l a y " t o n s b y " w i t h no a d v a n c i n g move Kl o f REC-PFOS-1 t o PKEY move K12 o f PKEY t o Kl o f REC-CF008-1 r e a d CF008 i n t o IN-COMPANY not i n v a l i d key move C-DATA t o CPY-DATA d i s p l a y CNAME o f CPY-DATA end-read. 02 PRDATA r e d e f i n e s PDATA. ODATE date. 0 1 IN-COMPANY. 02 CADDRESS p i c X ( I O O ) . which can be expressed as a conceptual schema (c). r e a d next PFOS end-perform.|] PRDATA[0-1] niler id:KI excIPDATA PRDATA = ^ 1 L PRODUCTION P-ID PRO-ID CPY-ID VOLUME id: P-ID ref:P-ID. 02 K l l p i c 02 K12 p i c Procedure X(9). s e l e c t CF008 a s s i g n t o DSK02:P12 orgemization i s indexed r e c o r d k e y i s Kl o f REC. ^ 1COMPANY CPY-ID CNAME CADDRESS I id: CPY-ID <k PRODUCT j PRO-ID PNAME CATEGORY^ id: PRO-ID |< REC-CF008-11 REC-PFOS-1 Ki KL niler Kll id:KI 1 filler PDATA[0. CNUM numeric(6) not null. 0 1 REC-PFOS-1. 03 f i l l e r p i c 9 ( 4 ) V 9 9 .1 . 03 f i l l e r p i c 9 ( 6 ) 0 2 PDATA p i c X ( 1 8 0 ) . f d PFOS. 02 CPY-ID p i c 9 ( 6 ) . s e l e c t PFOS a s s i g n t o DSK02:P27 organization i s indexed r e c o r d k e y i s Kl o f REC. move PKEY t o Kl o f REC-PFOS-1.create table CUSTOMER ( CNUM numeric{6)not null.

the PDATA value is expressed in tons (most probably a volume). and interpreting these technical data structures into a semantic information model (here some variant of the Entity-relationship model) leads to schema (c). Refining the initial schema (a) by these reasonings results in schema (b). followed by an arbitrary sequence of type-2 records (whose PRDATA field is processed in the body of the loop). arranged into ordered sequences. the userprogram interaction. — examining the screen contents when running the program shows that PDATA is made of a product name followed by a product category. the file contents. this interpretation is given by a typical user of the program. and seems to be produced by some kind of agents described in the file CF008. — the processing of type-1 records shows that the Kll part of key Kl is an identifier. together with some common programming practices that tend to hide important structures. force the programmer to express essential data properties through procedural code. — the gross structure of the program suggests that there are two kinds of REC-PFOS-1 records. this exercise emphasizes some common difficulties of database reverse engineering. all the records of such a sequence share the same first 9 characters of the key. this record type is called PRODUCT. such as the procedural code. this second part can be considered as a foreign key to the REC-CF008-1 records. . the rest of the key acting as pure padding. this field can then be considered as the concatenation of a PNAME field and a CATEGORY field. so that this 9-digit subfield can be considered as a foreign key to the PRODUCT record. — the C-DATA field of the COMPANY record type should match the structure of the CPY-DATA variable. it shows that the declarative statements that define file and record structures can prove a poor source of information. and CPY-ID for its access key. The analyst must often rely on the inspection of other aspects of the application. which in turn is decomposed into CNAME and CADDRESS. they all share the PRO-ID value of their parent PRODUCT record. Despite its small size. — the name of the field in which the agent record is stored suggests that the latter is a company. therefore. — the agent of a production is obtained by using the second part of the key of the PRODUCTION record. In particular. — the processing of a type-2 record consists in displaying one line made up of constants and field values. This example also illustrates the weaknesses of most data managers which. and its key PRO-ID. hence the names PRODUCTION for type-2 record type and VOLUME for the PRDATA field. Finally domain knowledge proves essential to discover and to validate some components of the resulting schema. the linguistic structure of this line suggests that it informs us about some Production of the current product. the user dialog suggests that type-1 records describe products. each comprising one type-1 record (whose PDATA field is processed before the loop).12 HAINAUTETAL. visual inspection of the contents of the PFOS file could confirm this hypothesis. — the body of the loop processes the sequence of type-2 records depending on the current PRODUCT record. the program behaviour. hence the name COMPANY for this record type.

a relational schema is often supposed to be in 4NF^. 1992. 1994. appear to be limited in scope. Fong and Ho. 1988. 1994. Sabanis and Stevenson. contradictory data. 1994) — for CODAS YL databases: (Batini et al. Andersson. 1993. or idiosyncrasies. Fonkam and Gray.g. degradation of inheritance. The literature proposes systematic approaches for database schema recovery: — for standard files: (Casanova and Amarel de Sa. This condition cannot be . In many proposals. 1995)..DATABASE REVERSE ENGINEERING L3. 1985. Winans and Davis.. let us mention some of them: nuUable primary key attributes. Batini et al. Vermeer and Apers. record fragmentation or merging for disc space or access time minimization will remain undetected and will be propagated to the conceptual schema. — a complete and up-to-date DDL schema of the data is available. observed in real world applications. Johannesson and Kalman. it appears that the only databases that can be processed are those that have been obtained by a rigourous database design method. Chiang et al.. 1994.. they often suppose that — all the conceptual specifications have been translated into data structures and constraints (at least until 1993). Fong and Ho. Batini et al. 1992.. 1990. overloaded attributes. 1994. 1985. 1992) — for IMS databases: (Navathe and Awong. denormalized structures. however. 1990. 1992. 1995) — for relational databases: (Casanova and AmaralDeSa. a foreign key and the referenced primary key have the same name). Many of these studies. Davis and Arora. Shoval and Shreiber. 1990. — names have been chosen rationally (e. almost unique primary keys. Petit et al. 1992. constraints that have been procedurally expressed are ignored. 1994. for instance. mismatched referential integrity domains. 1984.. Signore et al. Nilson. Davis and Arora. State of the art 13 Though reverse engineering data structures is still a complex task. For instance. Navathe and Awong. in particular. 1988. so that they can be used as reliable definition of the objects they denote. 1990. Springsteel and Kou. for instance. — the translation is rather straightforward (no tricky representations). and are generally based on assumptions about the quality and completeness of the source data structures to be reverse engineered that cannot be relied on in many practical situations. it appears that the current state of the art provides us with concepts and techniques powerful enough to make this enterprise more realistic. Premerlani and Blaha (1993) and Blaha and Premerlani (1995) are among the only proposals that cope with some non trivial representations. — the schema has not been deeply restructured for performance objectives or for any other requirements. 1993. 1988. Premerlani and Blaha. 1983. Edwards and Munro. Markowitz and Makowsky.

. 1992. the transformation toolkit (Section 6). A generic methodology for database reverse engineering (DBRE) The problems that arise when recovering the documentation of the data naturally fall in two categories that are addressed by the two major processes in DBRE. (1993b) and Hainaut et al. Hainaut et al. Hainaut et al. and that they require quite different concepts. DBMSindependent. An increasing number of conmiercial products (claim to) offer DBRE functionalities. DBRE cannot be successful without the support of adequate tools called CARE tools^. particularly for the oldest ones. and presents the main aspects and components of a CASE tool dedicated to database applications engineering. we mean that these problems relate to the recovery of two different schemas. By naturally. Sabanis and Stevenson. (1992) and Teorey (1994) are recommended for database design. those tools provide their users with invaluable help to carry out DBRE more effectively (Rock-Evans. The reader is assumed to have some basic knowledge of data management and design. each of these processes grossly appears as the reverse of a standard . functional extensibility (Section 10) and methodological control (Section 11). namely data structure extraction and data structure conceptualization (Joris et al. 1992. 1993a). 1994). Signore et al. The following sections describe in further detail some of the original principles and components of this CASE tool: the specification model and the repository (Section 5). Section 4 briefly presents a prototype DBRE CASE tool which is intended to address these requirements. Section 3 discusses some important requirements which should be satisfied by future DBRE CARE tools. these proposals are most often dedicated to one data model and do not attempt to elaborate techniques and reasonings conmion to several models. About this paper The paper is organized as follows. In addition.. we proposed the theoretical baselines for a generic. while Batini et al. Andersson. leaving the question of a general DBRE approach still unanswered. some authors have recognized that the procedural part of the application programs is an essential source of information on data structures (Joris et al. 1992. the text analyzers and name processor (Section 8). Petit et al. and of a generic DBMS-independent DBRE methodology. the user interface (Section 7). These baselines have been developed and extended in Hainaut et al. Moreover. assumed for most large operational databases. 1993a. 1994. (1993a).14 HAINAUTETAL.4. DBRE methodology.. the assistants (Section 9).. 2. 1990). The current paper translates these principles into practical requirements DBRE CARE tools should meet.. Though they ignore many of the most difficult aspects of the problem. (1994). 7. 1994. Section 12 evaluates to what extent the tool meets the requirements while Section 13 describes some real world applications of the methodology and of the tool. and more specifically to database reverse engineering. Section 2 is a synthesis of the main problems which occur in practical DBRE. Like any complex process. Recent references Elmasri and Navathe (1994) and Date (1994) are suggested for data management. Since 1992. In Hainaut et al. reasonings and tools.

who prefer proceeding in one step only. First. other important processes are ignored in this discussion for simplicity (see Joris et al. True database systems generally supply. rather than the order in which the actions must be carried out. Secondly. 1994. Its general architecture. In addition. We will describe briefly these processes and the problems they try to solve. This methodology is generic in two ways. Batini et al. a description of this schema (data dictionary contents. as developed in Hainaut et al. including all the implicit and explicit structures and constraints. physical and logical design (Teorey.DATABASE REVERSE ENGINEERING 15 Nonnalized conceptual schema CONCEPTUAL NORMALIZATION C Raw conceptuaT""^ schema ^^ SCHEMA DE'^OPTIMJZATION\ iir 8 l^ ./'"'^onceptual-logic^^^ 'V^^^^hysical s c h e m a ^ SCHEMA UNTRANSUTION SCHEMA PREPARATION DMS-compliant optimized schema Figure 3. 1992)). and in which way. 2. DDL . Let us mention however. this reverse methodology is to be read fromrightto left. it specifies what problems have to be solved. (1993a). its architecture and its processes are largely DMS-independent. and bottom-up! database design process (resp. that partitioning the problems in this way is not proposed by many authors. (1992) for instance). Data structure extraction This phase consists in recovering the complete DMS^ schema. Quite naturally. in some readable and processable form.1. Main components of the generic DBRE methodology.. is outlined in figure 3.

such as procedural sections of the application. The problem is much more complex for standard files. Signore et al. For most real-world (i. for which no deterministic methods exist so far. of the organizational rules. Recovering hidden. Lost specifications are constructs of the conceptual schema that have not been implemented in the DMS data structures nor in the application programs. texts. the main processes of . program execution. non-declarative and lost specifications is a complex problem. database content. 1994. Let us mention popular examples: uniqueness constraints on sequential files and secondary keys in IMS and CODASYL databases. procedures..e.. of the dataflow through local variables and files. subschemas. It consists in declaring it as another data structure S2 that is more general and less expressive than SI..e. as an unnamed COBOL field). but is procedurally checked. but the trace of its enforcement cannot be found in the declared data structures nor in the application programs. Unfortunately. the latter is a rich starting point that can be refined through further analysis of the other components of the application (views. this analysis must go well beyond the mere detection of the record structures declared in the programs. Until very recently. Structure hiding applies to a source data structure or constraint S1. a sequence of contiguous fields are merged into a single anonymous field (e.16 HAINAUTETAL. In our generic methodology. those controlled by DBMS. Andersson. A careful analysis of the procedural statements of the programs. a one-to-many relationship type is implemented as a many-to-many link. of program inputs and outputs. which could be implemented in the DMS. that derive from frequent design practices.. Most often. Referential integrity in standard files and one-to-one relationship types in CODASYL databases are some examples.g. that evidence must be consolidated by the domain knowledge. etc. non academic) applications. non declarative structures and lost specifications. but that satisfies other requirements such as field reusability. Let us mention some examples: a compound/multivalued field in a record type is declared as a single-valued atomic field. Most often.g.). The analysis of each source program provides a partial view of the file and record structures only. and therefore are represented and checked by other means. the checking sections are not centralized. etc. a referential constraint is not explicitly declared as a foreign key. but are distributed and duplicated (frequently in different versions). The first proposals address the recovery of integrity constraints (mainly referential and inclusion) in relational databases through the analysis of SQL queries (Petit et al. these problems have not triggered much interest in the literature. namely structure hiding. Non declarative structures are structures or constraints which cannot be declared in the target DMS. as illustrated by Premerlani and Blaha (1993) and Blaha and Premerlani (1995) for relational databases.). 1994). program conciseness. screen and report layouts. of the file contents. In particular. Though essential information may be missing from this schema. 1994. of the user interfaces. a relationship type is represented by a foreign key (e. for which no computerized description of their structure exists in most cases'*. fragments of documentation. simplicity or efficiency. three problems are encountered... This does not mean that the data themselves do not satisfy the lost constraint^. can accumulate evidence for these specifications. these practices are also conraion in (true) databases. in IMS and CODASYL databases). genericity. i. throughout the application programs.

and (2) to test hypotheses (e. requiring different reasonings and methods. and still includes possible optimized constructs. This rather straightforward process consists in analyzing the data structures declaration statements (in the specific DDL) included in the schema scripts and application programs.DATABASE REVERSE ENGINEERING DATA STRUCTURE EXTRACTION are the following: 17 • DMS-DDL text ANALYSIS. • BASIC CONCEPTUALIZATION.2. the analyst is provided with several. (1993b) a more detailed description of these processes. etc. through a schema integration process. in order to detect evidence of additional data structures and integrity constraints. extracted (and possibly refined) schemas. have to be solved: schema untranslation and schema de-optimization.g.. The current DBRE CARE tools offer only limited DMS-DDL text ANALYSIS functionalities. DBD and PSB (IMS). Let us mention some common situations: base tables and views (RDBMS). non declarative and lost structures can be found in this way. 2. we often have to prepare the schema by cleaning it.o. When more than one information source has been processed. It consists for instance in detecting and transforming or discarding non-conceptual structures. which rely heavily on schema restructuring techniques (or schema transformations). The end product of this phase is the complete logical schema. • SCHEMA INTEGRATION. This schema is expressed according to the specific model of the DMS. Hidden. The first-cut schema can therefore be refined following the detection of hidden. Two different problems. This refinement process examines the contents of the files and databases in order (1) to detect data structures and properties (e. the procedural sections. before tackling these problems. non declarative structures.g. "could this field be a foreign key to this file?"). a. technical optimization and DMS-dependent constructs. redundancies. It consists of two sub-processes. schema and subschemas (CODASYL). The reader will find in Hainaut et al. generally different. The main objective of this process is to extract all the relevant semantic concepts underlying the logical schema. However. hence its name: the DMS-compliant optimized schema. DATA ANALYSIS and SCHEMA INTEGRATION processes are concerned. • PROGRAM ANALYSIS.. The analyst is left without help as far as PROGRAM ANALYSIS. The final logical schema must include the specifications of all these partial views. namely Basic conceptualization and Conceptual normalization. It consists in analyzing the other parts of the application programs. This process is much more complex. The DB-MAIN tool is intended to address all these processes and to improve the support that analysts are entitled to expect from CARE tools. file structures from all the application programs (standard files).. • DATA ANALYSIS. Data structure conceptualization This second phase addresses the conceptual interpretation of the DMS schema. It produces a first-cut logical schema. . to find the unique fields or the functional dependencies in a file). or DMS schema for short.

g.18 HAINAUTETAL. 1995). COBOL files and SQL structures).1 and 2.g. simplicity. This process is borrowed from standard DB design methodologies (Batini et al. CODASYL and TOTAL/IMAGE databases. 1993b). 2. and restructuring some parts of the schema can prove useful before trying to interpret them. the analyst identifies the traces of such translations. in particular. and for rather simple schemas (e. is-a relations are made explicit.. Through this process. The logical schema is searched for traces of constructs designed for optimization purposes. etc. Three main families of optimization techniques should be considered: denormalization. The schema still includes some constructs. For instance Hainaut et al. COBOL.. the data models can share an important subset of translating rules (e. but which can now be discarded. All the other sources are . 1992. 1994. translating names to make them more meaningful (e. foreign keys in IMS and CODAS YL databases). The logical schema is the technical translation of conceptual constructs.OPTIMIZATION. extensibility. • CONCEPTUAL NORMALIZATION. Rauh and Stickel. translation rules considered as specific to a data model are often used in other data models (e. but not for the more complex D^:OPTIMIZATION phsiSQ.2 can be specialized according to a specific DMS and according to specific development standards.g. IMS. minimality. Current CARE tools are limited to parsing DMS-DDL schemas only (DMS-DDL t e x t ANALYSIS).g. Summary of the limits of the state of the art in CARE tools The methodological framework developed in Sections 2.. names are standardized. two facts are worth mentioning. They generally propose elementary rules and heuristics for the SCHEMA UNTRANSLATION process and to some extent for CONCEPTUAL NORMALIZATION. such as files and access keys. expressiveness. First. most often for specific DMS. (1993b) suggests specialized versions of the CONCEPTUALIZATION phdisc for SQL. All the proposals published so far address this phase. those which underlie the current CARE tools (figure 4). This process restructures the basic conceptual schema in order to give it the desired qualities one expects from any final conceptual schema. Secondly. In addition. SCHEMA PREPARATION. with no implementation tricks). readability.. For instance. It is interesting to use this framework as a reference process model against which existing methodologies can be compared... genericity. substitute the file name for the record name). structural redundancy and restructuring (Hainaut et al. Teorey. e. The conclusions can be summarized as follows: • DATA STRUCTURE EXTRACTION.3. The DB-MAIN CARE tool has been designed to address all these processes in a flexible way..g. which may have been useful in the Data Structure Extraction phase. Though each data model can be assigned its own set of translating (and therefore of untranslating) rules. SCHEMA DE. and replaces them by their original conceptual construct. Hence the importance of generic approaches and tools. some entity types are replaced by relationship types or by attributes. SCHEMA UNTRANSLATION.

(see Premerlani and Blaha (1993) and Blaha and Premerlani (1995) for instance) are completely beyond the scope of these tools. DMS-DDL schema DMS-compliant^"^N optimized schema^x Simplified DBRE methodology proposed by most current CARE tools. For instance. ^ schema ^^. Flexibility Observation.DATABASE REVERSE ENGINEERING 19 Normalized conceptual schema 2 o H DMS-compiiant optimized schema CONCEPTUAL NORMALIZATION t ^-"^ ^-x 1 I !e s ~^r Conceptual-logical ^ 1 ^ ^ ^ . DMS-DDL text ANALYSIS 1 C Figure 4. as well as most non standard database structures. some strong naming conventions must often be satisfied for the tools to help. All performance-oriented constructs. and particularly a database. 3. 1. Only then was the tool able to extract the file data structures". Once again. and to integrate them to produce the global COBOL schema. Current CARE tools focus mainly on untranslation (SCHEMA UNTRANSLATION) and offer some restructuring facilities (CONCEPTUAL NORMALIZATION). A user of a popular CARE tool tells us ''how he spent several weeks. these tools are unable to collect the multiple views of a COBOL application. ignored. These requirements are induced by the analysis of the specific characteristics of the DBRE processes. • DATA STRUCTURE CONCEPTUALIZATION. . often by hand or with very basic text editing tools. The very nature of the RE activities differs from that of more standard engineering activities. .^ 1 SCHEMA UNTRANSLATION . Requirements for a DBRE CARE tool This section states some of the most important requirements an ideal DBRE support environment (or CARE tool) should meet. They also derive from reverse engineering files and databases. though these processes often are merged. For instance. and must be processed manually. Reverse engineering a software component. of a dozen actual applications. cutting and pasting hundreds of sections ofprograms. a foreign key and the referenced primary key must have the same names. to build an artificial COBOL program in which all the files and records were fully described.

etc. and not deterministically inferred from the operational ones. It should be methodology-neutral^ unlike forward engineering tools. database RE requires browsing through huge amounts of text. workflow and dataflow analysis.). each RE project often is a new problem of its own. In addition. Name processing Observation. program execution. databases. programming cliches (Selfridge et al. Requirements. Specific functions should be easy to develop. due to the use of strict naming conventions. . screen layout. it must be highly interactive. even for one-shot use. is basically an exploratory and often unstructured activity. 1993)). Requirements. 1-087). including unstructured ones. INV-QTY. The tool must include browsing and querying interfaces with these sources. program output. Requirements.g. RE requires a great variety of information sources: data structure. Extensibility Observation.. C-DATA). Source multiplicity Observation. paper or computer-based documentation.g.. or at least less informative than expected (e. REC-001-R08. these names often happen to be meaningless (e. Object names in the operational code are an important knowledge source. extracting program slices (Weizer. The tool must allow the user to follow flexible working patterns. The latter should be language independent. Text analysis Observation. Customizable functions for automatic and assisted specification extraction should be available for each of them. and tightly coupled with the specification processing functions.. In addition. data (from files. spreadsheets. following static execution paths and dataflows. interview. requiring specific reasonings and techniques.20 HAINAUTETAL. Some important aspects of higher level specifications are discovered (sometimes by chance). More particularly. 1984). QOH. multi-programmer development often induces non consistent naming conventions. searching them for specific patterns (e. etc. Requirements. so that data names may be expressed in several languages. Many applications are multilingual. CASE repository and Data dictionary contents.. RE appears as a learning process. easy to customize and to program. domain knowledge. Frustratingly enough. The CARE tool must provide sophisticated text analysis processors. program text.g.

a schema in process can include record types (physical objects) as well as entity types (conceptual objects). (1) forward engineering projects frequently include reverse engineering of some existing components.. There is (and probably will be) no available tool that can satisfy all corporate needs in application engineering. and therefore highly generic. 21 The tool must include sophisticated name analysis and processing func- Links with other CASE processes Observation. DBMS. many RE reasonings and techniques are common to the different data models used by current applications. foreign keys are frequent in IMS and CODASYL databases). RE is seldom an independent activity. communications with a common repository or by exchanging specifications through a conmion format).DATABASE REVERSE ENGINEERING Requirements. For instance. Requirements. Requirements. Flexible specification model Observation. one of its characteristics makes it intrinsically different ft-om design processes: at any time. companies usually already make use of one or.g.g. However. the current specifications may include components from different abstraction levels. software development environments. RE applies on incomplete and inconsistent specifications. several CASE tools. Openness Observation. Therefore. In addition. The specification model must be wide-spectrum. The specification model and the basic techniques offered by the tool must be DMS-independent. 4GL or DDS. Genericity Observation. most often. A CARE tool must offer a large set of functions. Requirements.g. Tricks and implementation techniques specific to some data models have been found to be used in other data models as well (e. via integration hooks. conceptual normalization). Requirements. As in any CAD activity. (3) reverse engineering is a major activity in broader processes such as migration. including those which pertain to forward engineering... For instance. (2) reverse engineering share important processes with forward engineering (e. reengineering and data administration. A CARE tool must communicate easily with the other development tools (e. and provides artifacts for components of different abstraction levels. . tions.

modularity. logical or conceptual). distribution. The CARE tool must provide a rich set of schema transformation techniques. Requirements.).g. schemas). etc. migration and evolution.. The specifications.. More generally. database reverse engineering. this set must include operators which can undo the transformations commonly used in practical database designs. database application reengineering. A DBRE project includes at least three sets of documents: the operational descriptions (e. The CARE tool must provide several ways of viewing both source texts and abstract structures (e. Requirements. are most often huge and complex. source program texts). the logical schema (DMS-compliant) and the conceptual schema (DMS-independent). These constructs are obtained through schema restructuration techniques. the tool must ensure the traceability of the RE processes. and to satisfy non functional requirements (performance. maintenance.22 Multiplicity of views HAINAUTETAL. reverse engineering alone. Observation. The forward and backward mappings between these specifications must be precisely recorded. In particular. In particular. The DB-MAIN CASE tool The DB-MAIN database engineering environment is a result of a R&D project initiated in 1993 by the DB research unit of the Institute of Informatics. and its scope encompasses. (1994). but is much broader than. summary and fine-grained presentations must be available. This tool is dedicated to database applications engineering. The repository of the CARE tool must record all the links between the schemas at the different levels of abstraction. DDL texts. . Further detail on the whole approach can be found in Hainaut et al. access control. while the backward mapping indicates of which conceptual (or logical) construct each operational (or logical) construct is an implementation. 4. its ultimate objective is to assist developers in database design (including full control of logical and physical processes).g. Actual database schemas may include constructs intended to represent conceptual structures and constraints in non standard ways. Requirements. Traceability Observation. University of Namur. and need to be examined and browsed through in several ways. physical.g. The forward mapping specifies how each conceptual (or logical) construct has been implemented into the operational (or logical) specifications.. whatever their abstraction level (e. Multiple textual and graphical views. according to the nature of the information one tries to obtain. Rich transformation toolset Observation.

In Section 5 we describe the way schemas and other specifications are represented in the repository. they pertain to the methodological control aspects of the DB-MAIN approach. dozens to hundreds of) schemas. They include the notions of entity types (with/without attributes. Section 9 presents some expert modules. total and disjoint properties). — the history (or trace) of the project.. the DB-MAIN CASE tool has been designed to address as much as possible the requirements developed in the previous section. called assistant. entry. processing. We will ignore the two latter classes. e. DB-MAIN is an extensible tool which allows its users to build new functions through the Voyager-2 tool development language (Section 10). 5. — the specification of the methodology followed to conduct the project. various tools dedicated to text and name processing and analysis are described. will be ignored in this paper.. we will reexamine the requirements described in Section 3. The conceptual constructs are intended to describe abstract. In Section 8. validation and transformation of specifications. The repository comprises three classes of information: — a structured collection of schemas and texts used and produced in the project. a program or an SQL script). A schema is made up of specification constructs which can be classified into the usual three abstraction levels. with/without identifiers). 1992a). We will limit the presentation to the data aspects only. the specification of the other aspects of the applications. machine-independent. The schemas of a project are linked through specific relationships. Section 11 evokes some aspects of the tool dedicated to methodological customization. control and guidance. However. Though they have strong links with the data structures in DBRE. The DB-MAIN specification model includes the following concepts (Hainautetal. of super/subtype hierarchies (single/multiple inheritance. The tool is based on a general purpose transformational approach which is described in Section 6. e. as well as code and report generation. whose roles are characterized by min-max cardinalities and optional names.g. In Section 12. browsing. and will be ignored in this section. The DB-MAIN specification model and repository The repository collects and maintains all the information related to a project. will concentrate only on the main aspects and components of the tool which are directly related to DBRE activities. and evaluate to what extent the DB-MAIN tool meets them. which are related to methodological control and which will be evoked briefly in Section 11.DATABASE REVERSE ENGINEERING 23 As far as DBRE support is concerned. and of relationship types (binary/iV-ary. while a text is any textual material generated or analyzed during the project (e.g. semantic structures. management..e. A project usually comprises many (i. As a wide-scope CASE tool.. which help the analyst in complex processing and analysis tasks. A schema is a description of the data structures to be processed. namely Sections 5 to 11. cyclic/acyclic). the rest of this paper.. DB-MAIN includes the usual functions needed in database analysis and design. Viewing the specifications from different angles and in different formats is discussed in Section 7.g. a role can be defined . Finally.

calc key. CODASYL or IMS schemas. This schema includes conceptualized objects (PRODUCT. among others. which can be interpreted by Voyager-2 functions (see Section 10).24 PRODUCT PROD-ID NAME U-PRICE id: PROD-ID ORDER ORD-ID DATE ORIGIN DETAIL[l-20] PRO QTY id: ORD-ID ace ref: ORIGIN ace ref: DETAIL[*]. coexistence. made of the property name and its value.ORD-ID and ORDER. Attributes can be associated with entity and relationship types. etc. logical objects (record type ORDER. at-least-one.). of). etc. A typical data structure schema during reverse engineering. made up of attributes and/or roles. In addition. new concepts such as organizational units. with single-valued and multivalued foreign keys) and physical objects (access keys ORDER. while statistics about entity populations. access keys (index. and other implementation details. They comprise. can be associated with an entity type. as discussed in Section 2. These annotations can include semi-formal properties. a relationship type and a multivalued attribute. The contents of the repository can .). Identifiers (or keys). segment. In database engineering. 0-N 1-1 1 1 DSK:MGT-03 1 ACCOUNT ACC-NBR AMOUNT id: ACC-NBR of. exclusion. The physical constructs describe implementation aspects of the data which are related to such criteria as the performance of the database. The logical constructs are used to describe schemas compliant with DMS models. referential constraints. CUST-]P NAME ADDRESS id:CUST-ID 1 I L. atomic or compound.PRO CUSTOMER HAINAUTETAL. an in progress schema may even include constructs at different levels of abstraction. file DSK:MGT-03). Figure 5 presents a schema which includes conceptual. on one or several entity-types. bag and list multivalued attributes. this schema will be completely conceptualized through the interpretation of the logical and physical objects. the concepts of record types (or table. servers. fields (or columns). Various constraints can be defined on these objects: inclusion. annotations can be associated with each object. In reverse engineering.CUSTOMER Figure 5.ORIGIN. They make it possible to specify files. they can be single-valued or multivalued. CUSTOMER. For instance. Besides these concepts. These features provide dynamic extensibility of the repository. logical and physical constructs. a schema describes a fragment of the data structures at a given level of abstraction. or geographic sites can be represented by specializing the generic objects. the gender dind plural of the object names can be represented by semi-formal attributes. ACCOUNT. such as relational. and redundancy. Ultimately. etc. physical data types. the repository includes some generic objects which can be customized according to specific needs.

Rosenthal and Reiner. producing an SQL database or COBOL files. Schema transformations are essential to define formally forward and backward mappings between schemas. 1992. 1988. According to Fikas (1985) for instance. Quite naturally.1991.. 1994). 1993. Producing a database schema from another schema can be carried out through selected transformations.2). and replacing a relationship type by an equivalent entity type. the process of developing a program [can be] formalized as a set of transformations. To illustrate this concept. This approach has been put forward in database engineering by an increasing number of authors since several years. i. 1995. Moreover. or reverse engineering standard files and CODAS YL databases can be described mostly as sequences of schema transformations. 1992). in several CASE tools (Hainaut et al. its conceptualization (Section 2. traceability). some authors claim that the whole database design process. either in research papers. Joris et al. we will outline informally three of the most popular transformation techniques. Batini et al. 1981.e. 1980. then both schemas will be equivalent by construction. 1987. are three examples of schema transformations. Hainaut et al. Roughly speaking. 1996). Such a transformation ensures that the source and target schemas have the same semantic descriptive power. 1995. called mutations (type changing) used in database design. normalizing a schema. which provides import-export facilities between DB-MAIN and its environment. Some authors propose schema transformations for selected design activities (Navathe...DATABASE REVERSE ENGINEERING 25 be expressed as a pure text file through the ISL language. or in text books and.. a schema transformation consists in deriving a target schema 5' from a source schema S by some kind of local or global modification. A special class of transformations is of particular importance. The transformation toolkit The desirability of the transformational approach to software engineering is now widely recognized. 1995). If we can produce a relational schema from a conceptual schema by applying reversible transformations only. Rauh and Stickel. In other words any situation of the application domain that can be modelled by an instance of one schema can be described by an instance of the other. schema transformations have found their way into DBRE as well (Hainaut et al. 1986. 1994.. together with other related activities.1993b. As a . 1993b. deleting a relationship type. also called reversible since each of them is associated with another semantics-preserving transformation called its inverse.. The transformational approach is the cornerstone ofthe DB-MAIN approach (Hainaut. more recently. if the interpretation of a relational schema.. 6. 1992. Halpin and Proper. 1994)andCASE tool (Hainaut et al. 1996). 1992. and no semantics will be lost in the translation process. Kozaczynsky. 1993b). A formal presentation of this concept can be found in Hainaut (1991. the resulting conceptual schema will be semantically equivalent to the source schema. 1994). Conversely. namely the semanticspreserving transformations.. Rosenthal and Reiner. For instance. Adding an attribute to an entity type. and particularly between conceptual structures and DMS constructs (i. optimizing a schema. Rosenthal and Reiner. can be performed by using reversible transformations. can be described as a chain of schema transformations (Batini et al.e. An in-depth discussion ofthe concept of specification preservation can be found in Hainaut (1995. Kobayashi. 1993a.

Another widely used transformation replaces a binary relationship type by a foreign key (figure 7).. in which the direct transformation is read from left to right. J .A2 id:A2 (b) Figure 8.A (a) <^ A Al A3 EA2 -I-J -<JR~\-I-N. The third standard technique transforms an attribute into an entity type. (b) by explicit representation of its distinct values. namely instance representation {figure 8a). and value representation (figure 8b).I2-J2 {^A/ l-l Figure 6.26 HAINAUTETAL. . their inverse will be used in reverse engineering. —0-N Figure 7. which can be either multivalued (/ > 1) or single-valued (J = 1). and its inverse from right to left. Relationship-type R is represented by foreign key Bl. in which each instance of attribute A2 in each A entity is represented by an EA2 entity. and conversely. To simplify the presentation. It comes in two flavours.B RA. A I B r2-J2 \ R id: RB. in which each distinct value of A2. The technique can be extended to A^-ary relationship types. according to the skill of the user and the complexity of the problem to be solved. These tools .A 1-1 XB > A B ii-ji <=^ Il-Jl.\ R / > —. and conversely. A 1 Al A AI A2[l-J] A3 A3 -I-J -<(^~Rr\-\-\~ EA2 A2 id:A2 R. each transformation and its inverse are described in one figure. Figure 6 shows graphically how a relationship type can be replaced by an equivalent entity type. consequence. is represented by one EA2 entity. whatever the number of its instances. DB-MAIN proposes a three-level transformation toolset that can be used freely. Transformation of an attribute into an entity type: (a) by explicit representation of its instances. and conversely. B A Al M 62 id:Bl A B B2 O id:Bl <^ AJ 1 Biri-J] ref: Bl[*] . Transforming a relationship type into an entity type.

in that they can be used in any database engineering process.. or migrate components from an entity type to another. However. A selected elementary transformation is applied to all the objects of a schema which satisfy a specified precondition: apply transformation T to the objects that satisfy condition P DB-MAIN offers some predefined global transformations. . • Elementary transformations. Figure 9 illustrates the dialog box for the Split/Merge of an entity type. The current version of DB-MAIN proposes a toolset of about 25 elementary transformations. or merge two entity types.DATABASE REVERSE ENGINEERING 27 I II i f f I t[iii[[i iii^f iiitiiin ill iiMriiiliiiTNlTiiiiiffti ^ fi^il Figure 9. similar situations can often be solved by different transformations. • Global transformations. such as: replace all one-tomany relationship types by foreign keys or replace all multivalued attributes by entity types. they are mainly used in Data Structure Conceptualization (Section 2.2). As far as DBRE is concerned. the analyst can define its own toolset through the Transformation Assistant described in Section 9. The dialog box of the Split/Merge transformation through which the analyst can either extract some components from the master entity type (left). a multivalued attribute can be transformed in a dozen ways. Indeed. e. The selected transformation is applied to the selected object: apply transformation T to current object 0 With these tools. the user keeps full control of the schema transformation.g. are neutral and generic.

or. which is a sort of algorithm comprising global transformations. untranslation from these models and conceptual normalization. either through the scripting facilities of the Transformation Assistant. 7. extended (same + domains. processing. All the constructs of a schema that violate a given model M are transformed in such a way that the resulting schema complies with M: apply the transformation plan which makes the current schema satisfy model M Such an operator is defined by a transformation plan. and analyzing. and any object selected in a view is still current when another view is chosen.1). Any relevant operator can be applied to an object. Browsing through. However. The analyst can define its own transformation plans. standard (same + attributes. A model-driven transformation implements formal techniques or heuristics applicable in such major engineering processes as normalization. In addition. CODASYL. (1992) and Hainaut (1995). the text-based views make it possible to navigate from entity types to relationship types and conversely through hypertext links. Four of these views use a hypertext technique: compact (sorted list of entity type. namely DMS-DDL text ANALYSIS and PROGRAM ANALYSIS. annotations. model translation or untranslation. It quickly appears that more than one way of viewing them is necessary. for more complex problems. DB-MAIN offers some original options which deserve being discussed. large schemas require an adequate presentation of the specifications. . The user interface The user interaction uses a fairly standard GUI. a graphical representation of a schema allows an easy detection of certain structural patterns. • Model-driven transformations. and COBOL translation. Switching from one view to another is inmiediate.28 HAINAUTETAL. but is useless when analyzing name structures to detect similarities as is common in the DATA STRUCTURE EXTRACTION process (Section 2. DB-MAIN offers a dozen predefined model-based transformations such as relational. All of them are illustrated in figure 10. A more detailed discussion of these three transformation modes can be found in Hainaut et al. roles and constraints). 8. relationship type and file names). DB-MAIN currently offers six ways of presenting a schema. through the development of Voyager-2 functions (Section 10). Here too. Text analysis and processing Analyzing and processing various kinds of texts are basic activities in two specific processes. and conceptualization. For instance. Two views are graphical: full and compact (no attributes and no constraints). whatever the view through which it is presented. which is proved (or assumed) to make any schema comply with M. ET-RT cross-reference) and sorted (sorted list of all the object names).

DATABASE REVERSE ENGINEERING 29 Figure 10. CNAME. and can be carried out by automatic extractors which analyze the data structure declaration statements of programs and build the corresponding abstract objects in the repository. Six different views of the same schema. an interactive pattern-matching engine. but other parsers can be developed in Voyager-l (Section 10). IMS. The main objective of these tools is to contribute to program understanding as far as data manipulation is concerned. DB-MAIN currently offers built-in standard parsers for COBOL. and RPG. 1994). To address the requirements of the second process. As an illustration. a Pattern Definition Language. 1994. The principle is simple: most multi-table queries use primary/foreign key joins. CUSTOMER where ORDER.CUST = CUSTOMER. we will describe one of the most popular heuristics to detect an implicit foreign key in a relational schema.CNUM. ThQ pattern-matching function allows searching text files for definite patterns or cliches expressed in PDL..CNUM . Petit et al. The first process is rather simple. 1994. Andersson. at the present time. CODAS YL.. SQL. For instance. a dataflow diagram inspector and a program slicer. through which the preliminary specifications are refined from evidence found in programs or in other textual sources. It consists in searching the application programs for some forms of SQL queries which evoke the presence of an undeclared foreign key (Signore et al. the following query suggests that column CUST in table ORDER may be a foreign key to CUSTOMER: select CUSTOMER. considering that column CNUM has been recognized as a candidate key of table CUSTOMER. DATE from ORDER. DB-MAIN includes a collection of program analysis tools comprising.

The pattern engine can analyze external source files. the generic patterns define the skeleton of an SQL query. For instance.a . if a procedure such as that presented in figure 11 (creation of a referential constraint between column C2 and table Tl) is associated with this pattern.Z O .C2 .9 ] table-name : : = AN-name column-name ::= AN-name _ ::= ( { " / n " | V t " | " "}) + .Z ] [ . Symbols "+'\ "*". which is valid for any RDBMS and any host language. while the specific patterns complement this skeleton by defining the C0B0L/DB2 API conventions. e. may suggest that Cl is a foreign key to table T2 or C2 a foreign key to Tl.::= ( { . When a generic pattern file is opened..." ) ) * begin-SQL ::= {"exec"|"EXEC"} _{"sql"|-SQL"}_ end-SQL ::= _{"end"j"END"} {"-exec"1"-EXEC"}-". any SQL expression that looks like: s e l e c t .. Tl. for instance. this evidence would be even stronger if we could prove that Cl—^resp." s e l e c t ::= {"select"("SELECT"} from ::= {"from"1"FROM"} where ::= {"where"|"WHERE"} s e l e c t .z A ... Tl.T2 . In the specific patterns."eC2 ! end-SQL The COBOUDB2 specific patterns AN-name ::= [ a .g.. More generally. Replacing the latter will allow processing. "-" designates any non-empty separator.b u t ( { w h e r e j end-SQL}) { " . When such a pattern is instantiated in a text. but pattern instantiation can also trigger DB-MAIN actions. A set of patterns can be split into two parts (stored in different files). 2. A pattern can include variables. This example illustrates two essential features of PDL and of its engine. to automatically update the repository. the name of which is prefixed with @.n a m e T2 : := table-neune ' c l : := column-name C2 : := column-name j o i n ./ f | .l i s t ::= any-but(frora) ! ::= a n y .Cl = T2.z A .. "_" any separator. In this example. this procedure can be executed automatically (under the analyst's control) for each instantiation of the . SQL t r i g g e r and check). and "AN-name" any alphanumeric string beginning with a letter. "/t". the unresolved patterns are to be found in the specified specific pattern file..."®C1 _ " = '' _ eT2". These texts can be searched for visual inspection only.. " | " / n " | " / t . such as comments.. Generic and specific patterns for foreign key detection in SQL queries. "/n". where . The SQL generic patterns Tl : : = t a b l e .. Of course. This is just what figure 11 translates more formally in PDL. from .30 HAINAUTETAL.g.| " "} Figure 11.. "I" and "a-z" have their usual grep or BNF meaning./ n " | .q u a l i f ::= begin-SQL select select-list from i {©Tl ! @T2 j ©T2 ! STl} where ! @T1". The "any-but(E)" function identifies any string not including expression E.. the extractors store the statements they do not understand. Cl—is a key of its table.. as well as textual descriptions stored in the repository (where. C/0/?ACL£^ programs. 1. e. the variables are given a value which can be used.

DATABASE REVERSE ENGINEERING 31 pattern. " ^CODE$" -> "REFERENCE" replaces all the names "CODE" with the new name "REFERENCE"."FROM" @var_2. or in selected objects of a schema. These parameters can be saved as a name processing script. it proposes case transformation: lower-to-upper. whatever its position.@var_l .op . a pattern made of a dozen statements can span several thousands lines of code.not : := "IF" .rel_op . as illustrated in figure 2. Generally P ' is a very small fragment of P.ivar^l .@var_l .2. with the substring "TIME". the analyst can build a powerful custom tool which detects foreign keys in queries and which adds them to the schema automatically.@var. With this problem in mind. if. The program slice of P for O at 5 is the smallest subset P' of P whose execution will give O the same state at S as would the execution of P in the same environment. the following COBOL rules can be used to build a graph in which two nodes are linked if their corresponding variables appear simultaneously in a simple assignment statement.@var_2. replaces each substring " DATE". upper-to-lower.var .2 ."REDEFINES" . and can be inspected much more efficiently and reliably. we have developed a variant of program slicer (Weiser. Let us consider a point 5 in P (generally a statement) and an object O (generally a variable) of P. DB-MAIN also includes a name processor which can transform selected names in a schema. and reused later.(ivar. The dataflow inspector builds a graph whose nodes are the variables of the program to be analyzed. in an indirect write statement or in comparisons: var_l : := cob_var. . In addition.@var_l . generates a new program P ' defined as follows."NOT" . write : := "WRITE" . both visually and with the help of the analysis tools described above. move : := "MOVE" . These relationships are defined by selected PDL patterns. redefines : := @var_l . capitalize and remove accents. if : := "IF" .rel."TO" . which. and the edges are relationships between these variables. according to substitution patterns. but can prove difficult to use for large programs.(ivar_2 . var_2 : : = cob. in a redefinition declaration. Here are some examples of such patterns: "^C-" -> "CUST-" " DATE" -> " TIME" replaces all prefixes "C-" with the prefix "CUST-". 1984). For instance. One application in which this program slicer has proved particularly valuable is the analysis of the statements contributing to the state of a record when it is written in its file. given a program P. The first experiments have quickly taught us that pattern-matching and dataflow inspection work fine for small programs and for locally concentrated patterns. In this way. For instance. This tool can be used to solve structure hiding problems such as the decomposition of anonymous fields and procedurally controlled foreign keys. than its source program P.

These processors offer a collection of built-in functions that can be enriched by user-defined functions developed in Voyager-2 (Section 10). It gives access to the basic toolboxes of DB-MAIN. and execute it automatically or in a controlled way.. Each operation appears as a problem/solution couple. normalization). Customized operations can be added via Voyager-2 functions (Section 10). which presents a catalog of problems (1st column) and suggested solutions (2nd column). in which the problem is defined by a pre-condition (e. among others. The worksheet shows a simplified script for conceptuahzing relational databases. The current version of DB-MAIN includes three general purpose assistants which can support. Bachman model.g. the Schema Analysis assistant and the Text Analysis assistant. transform them into entity types).. execute it. Figure 12 shows the control panel of this tool. A second generation of the Transformation assistant is under development. binary model.. The assistants An assistant is a higher-level solver dedicated to coping with a special kind of problem. standard files). the objects are the many-to-many relationship types of the current schema).g. The left-side area is the problem solver. or to perform standard engineering processes (e. relational. save and load it. or performing specific activities efficiently.. (s)he can build a script comprising a list of operations. Alternatively.g. Control panel of the Transformation assistant. a library . conceptualization of relational and COBOL schemas.g. namely the Transformation assistant. but in a controlled and intelligent way. 9. CODASYL. Several dozens problem/solution items are proposed. The Transformation Assistant (figure 12) allows applying one or several transformations to selected objects. Figure 12. the DBRE activities.32 HAINAUTETAL. and the solution is an action resulting in eliminating the problem (e. It provides a more flexible approach to build complex transformation plans thanks to a catalog of more than 200 preconditions. Predefined scripts are available to transform any schema according to popular models (e. Theright-sidearea is the script manager. The analyst can select one of them.

To cope with such problems. When such a pattern instantiates. or importing and exchanging specifications with any CASE tool or Data Dictionary Systems are practically impossible. For instance. Voyager-2 offers a powerfiil language in which specific processors can be developed and integrated into DB-MAIN. "the schema is hierarchical". There are two important domains in which users require customized extensions. T2. The Schema Analysis assistant is dedicated to the structural analysis of schemas. However. The Schema Analysis assistant offers two ftinctions. Checking a schema consists in detecting all the constructs which violate the selected submodel. "names do not include spaces". Functional ER. Predefined submodels are available: Normalized ER. and specialized operators may be needed to deal with unforeseen or marginal situations. It uses the concept of submodel. Voyager-2 is a procedural language which proposes primitives to access and modify the repository through predicative or navigational queries. "no name belongs to a given list of reserved words". "there are no access keys". Figure 13 presents a small but powerftil Voyager-2 function which validates and creates a referential constraint with the arguments extracted from a COBOL/SQL program by the pattern defined in figure 11. etc. NIAM. analyzing and generating texts in any language and according to any dialect. defined as a restriction of the generic specification model described in Section 5 (Hainaut et al. . while the Search frinction detects all the constructs which comply with the selected submodel. DB-MAIN provides the Voyager-2 tool development environment allowing analysts to build their own functions. Binary ER. and which ones are forbidden. the pattern-matching engine passes the values of the four variables Tl. CI and C2 to the MakeForeignKey ftinction.DATABASE REVERSE ENGINEERING 33 of about 50 actions and more powerful scripting control structures including loops and if-then-else patterns. This restriction is expressed by a boolean expression of elementary predicates stating which specification patterns are valid. The Text Analysis assistant presents in an integrated package all the tools dedicated to text analysis. namely Check and Search. Once compiled. and to invoke all the basic functions of DB-MAIN. Functional extensibility DB-MAIN provides a set of built-in standard ftinctions that should be sufficient to satisfy most basic needs in database engineering.. it can be invoked by DB-MAIN just like any basic function. Customized predicates can be added via Voyager-2 functions (Section 10). namely additional internal functions and interfaces with other tools. even with highly parametric import/export processors. "entity type names are less than 18-character long". whatever their complexity. CODASYL. Bachman. A submodel appears as a script which can be saved and loaded. It provides a poweful list manager as well as ftinctions to parse and generate complex text files. An elementary predicate can specify situations such as the following: "entity types must have from 1 to 100 attributes". A user's tool developed in Voyager-2 is a program comprising possible recursive procedures and functions. "relationship types have from 2 to 2 roles". no CASE tool can meet the requirements of all users in any possible situation. In addition it manages the active links between the source texts and the abstract objects in the repository. 10. "entity types have from 0 to 1 supertype". Basically. Relational. 1992).

NAME = T2} and A.e. Some attributes can be identifiers (boolean ID) or can reference (foreign key) another attribute (candidate key). then define C2 as a foreign key of T2 to Tl.FK).TYPE and IK. Methodological control and design recovery^ Though this paper presents it as a CARE tool only.NAME = T l ) a n d A. 11. The repository expresses the fact that schemas have entity types.C2 such as those resulting from an instantiation of the pattern offigure11. else return false" *). / * S is the current schema */ ID-LIST = list of the attributes A such that :(I)A belongs to an entity type E which is in schema S and whose name is (2) the name of A is CI. table) T2 being a foreign key to entity type Tl with identifier (candidate key).e.IK. if both lists are not-empty. If the evaluation is positive. and return true. the referential constraint is created.C1. helps"if CI is a unique key of table Tl and if C2 is a column of T2. explain(* title="Create a foreign key from an SQL join". it is to address the complex and critical problem of application evolution. function integer KakeForeignKey (string : T1. which in turn have attributes.T2.L I S T := a t t r i b u t e [ A l { o f : e n t i t y _ t y p e [ E ] { i n : [ S ] a n d E. column) C2 of entity type (i. return true. In particular. FK := GetFirst(FK-LIST). attribute : A.NAME = Cl and A. understanding how the . Figure 13.TYPE = FK. (2) name of A is C2 */ FK-LIST := a t t r i b u t e [ A ] { o f : e n t i t y _ t y p e [ E ] { i n : { S ] a n d E.. A (strongly simplified) excerpt of the repository and a Voyager-2 function which uses it.FK. Cl.}. the DB-MAIN environment has a wider scope.FK-LIST. i.C1.C2) . any repository object type can be a domain */ schema : S.} else {return false.} else {return false.NAME = C 2 } .. The e x p l a i n section illustrates the self-documenting facility of Woyager-2 programs. it defines the answers the compiled version of this function will provide when queried by the DB-MAIN tool. list : ID-LIST.LENGTH = FK. /*define the variables. FK-UST = list of the attributes A such that: (I) A belongs to an entity type E which is in S and whose name is Tl. (3) A is an identifier ofE (the ID property of A is true) */ I D .. and if CI and C2 are compatible.e. data-centered applications engineering. entity_type : E.LENGTH then {connect(reference.34 HAINAUTETAL. The input arguments of the procedure are four names T1. if IK. S := Ge t C u r r e n t Schema 0 . In this context. The function first evaluates the possibility of attribute (i. then if the attributes are compatible then define the attribute in ID-LIST as a foreign key to the attribute in FK-LIST * / i f not(empty(ID-LIST) o r empty(FK-LIST)) t h e n {IK := GetFirst(ID-LIST).IK.T2.ID = t r u e ) .}.

hypothesis and rationale. such as CONCEPTUAL NORMALIZATION in figure 3. extended to all database engineering activities.. decision. or it may be non-deterministic. deriving new solutions from earlier dead-end ones. Thirdly.DATABASE REVERSE ENGINEERING 35 engineering processes have been carried out when legacy systems were developed. can all be considered product instances. a collection of user's views. in the same way as forward engineering. how each instance of the process must be carried out. 1994. namely to recover the design of the application. an SQL DDL text. analyzing it. First. the way the application has (or could have) been developed. such as Normalized c o n c e p t u a l schema. 1992) but also any kind of heuristic design behaviour. and few results have been made available to practitioners so far. are common practices. it therefore deserves being supported by methodological control functions of the CARE tool.e. based on trial-and-error behaviours. an entity type. Process and process instance. and guiding today's analysts in conducting application development. in which case the exact way in which each of its . Product and product instance. Similar process instances are classified into processes. Batini et al. This model derives from proposals such as those of Potts and Bruns (1988) and Rolland (1993). Briefly stated. is still under full development. a table. Secondly. the transformations. A process instance is any logical unit of activity which transforms a product instance into another product instance. A conceptual schema. The strategy of a process is the specification of how its goal can be achieved. such as the Conceptual-Logical-Physical approaches (Teorey. This model describes quite adequately not only standard design methodologies. Similar product instances are classified into products.. Exploring several solutions. A product instance is any outstanding specification object that can be identified in the course of a specific design. Recording the history of a RE project. i.e. are major functions that should be offered by the tool. DBRE is a complex process. This design includes not only the specifications. a secondary objective is progressively emerging. reverse engineering is an engineering activity of its own (Section 2). are typical design process modeling objectives. in which case it reduces to an algorithm (and can often be implemented as a primitive). and replaying some of its parts. Process strategy. design process.. known as design (or software) process modeling. techniques and methods. including those that occur in reverse engineering. DB-MAIN proposes a design process model comprising concepts such as design product. completing it with new processes. Normalizing schema SI into schema S2 is a process instance. a COBOL program. A strategy may be deterministic. i. The reverse engineering process is strongly coupled with these aspects in three ways. process strategy. maintenance and reengineering. DMS-compliant o p t i m i z e d schema or DMS-DDL schema (see figure 3). but also the reasonings. while the primary aim of reverse engineering is (in short) to recover technical and functional specifications from the operational code of an existing application. the hypotheses and the decisions the development process consists of. We will shortly describe the elements of this design process model. comparing them. an evaluation report. and therefore is submitted to rules. This research domain.

iteration. as well as of the relationships between them. A process is defined mainly by the input product type(s). all the product instances and all the rationales that have appeared. The control structures in a script include action selection (at most one. etc. together with the product instances involved and the rationale that has been formulated. among others. When developing an application. In many cases. (s)he can perform another instance of the same process. and more precisely as a submodel generated by the Schema analysis assistant (Section 9). described as a specialization of a generic specification object from the DB-MAIN model (Section 5). at least one. one only. etc.). When the engineer needs to try another hypothesis. can be declared as a BINARY-ER-SCHEMA. parallel actions. the tool is customized according to this specific methodology. a product called Raw-conceptual-schema (figure 3). the analyst carries out process instances according to chosen hypotheses. its history collects all the design activities. the DB-MAIN Methodology Description Language. the output product type(s) and by its strategy. at least one any number of times. and will appear. Decision. strong condition (must be satisfied). instances will be carried out is up to the designer. This hypothesis is an essential characteristics of this process instance since it implies the way in which its strategy will be performed. The history of a product instance P (also called its design) is the set of all the process instances. and builds product instances. The description includes the specification of the products and of the processes the methodology is made up of. and that the attributes are atomic and single. The history of a process instance is the recorded trace of the way in which its strategy has been carried out. alternate actions. After a while (s)he is facing a collection of instances of this product. what lower-level processes must/can be triggered. stating that relationship types are binary. The strategy of a design process is defined by a script that specifies. in what order.36 HAINAUTETAL. A product is of a certain type. For instance. (S)he makes decisions which (s)he can justify. and under what conditions. A justification of the decision must be provided. the design of a database collects all the information needed to describe and explain how the database came to be what it is. generating a new instance of the same product. Since a project is an instance of the highest level process. the internal product type. History. all in this order. Hypothesis and decision justification comprise the design rationale. weak condition (should be satisfied). hypothesis and rationale. in the life of the project. In this way. all in any order. The latter is a product type that can be defined by a SCHEMA satisfying the following predicate. The DB-MAIN CASE tool is controlled by a methodology engine which is able to interpret such a method description once it has been stored in the repository by the MDL compiler. For instance. the analyst/developer will carry out an instance of a process with some hypothesis in mind. . A specific methodology is described in MDL. and have no attributes. from which (s)he wants to choose the best one (according to the requirements that have to be satisfied). product instances and rationales which contributed to P.valued: (all rel-types have from 2 to 2 roles) and (all rel-types have from 0 to 0 attributes) and (all attributes have from 0 to 0 components) and (all attributes have a max cardinality from 1 to 1).

(1994). the analyst can quickly develop specific functions. we can obtain a possible design history of the database. DBRE requirements and the DB-MAIN CASE tool We will examine the requirements described in Section 3 to evaluate how the DB-MAIN CASE tool can help satisfy them. In addition. data analysis is most often performed by small ad hoc queries or application programs.g. then reversing their order. synthesized. (1996). Through the Voyager-2 language.DATABASE REVERSE ENGINEERING 37 All the product instances. Instead of being constrained by rigid methodological frameworks. by customizing the method engine. related to the engineering of an application make up the trace. By normalizing the latter. However. generally undocumented database is a complex problem which we propose to tackle in the following way. hypotheses. external information processors and analyzers can easily introduce specifications through the text-based import-export ISL language. and by structuring it according to a reference methodology. The most common information sources have a text format. Reverse engineering the database generates a DBRE history. A more comprehensive description of how these problems are addressed in the DB-MAIN approach and CASE tool can be found in Hainaut et al. the analyst can build a specialized CASE tool that is to enforce strict methodologies. Other sources can be processed through specific Voyager-2 functions. For example. which can then be imported into the repository. One of the most promising applications of histories is database design recovery. backtracking and multi-hypothesis exploration are easily performed. process instances.. yields a tentative. replayed. decisions and justifications. or history of this application development. It can be examined. In particular. a possible identifier or foreign key. e. for design recovery). and processed (e. the name and the text analysis processors allows the analyst to develop customized scripts. Reversing each of the actions of this history. Flexibility. in addition. Sources multiplicity. such as that which has been described in Section 2.g. This history is also recorded in the repository. Such queries and programs can be generated by Voyager-2 programs that implement heuristics about the discovery of such concepts. the analyst is provided with a collection of neutral toolsets that can be used to process any schema whatever its level of abstraction and its degree of completion. This history can be cleaned by removing unnecessary actions. and generate their ISL expression. Extensibility. and can be queried and analyzed through the text analysis assistant. Constructing a possible design history for an existing. a simple SQL program can extract SQL specificationsfi-omDBMS data dictionaries. unstructured. while the design recovery approach is described in Hainaut et al. For example. the assistants. 12. .. which validate specific hypotheses about. design history. Replaying this history against the recovered conceptual schema should produce a physical schema which is equivalent to the current database.

DB-MAIN supports exchanges with other CASE tools in two ways. in addition. many functions are common to all the engineering processes. ISL specifications can be used as a neutral intermediate language to communicate with other processors. Both the repository schema and the functions of the tool are independent of the DMS and of the programming languages used in the application to be analyzed. the analyst will be allowed to define customized views. the compact and sorted views can be used as poweful browsing tools to examine name patterns or to detect similarities. The DB-MAIN repository can accomodate specifications of any abstraction level. through the Schema Analysis assistant. DB-MAIN explicitly records a history. at the end of a complex process the analyst can ask. Genericity. DB-MAIN can be fairly tolerant to incomplete and inconsistent specifications and can represent schemas which include objects of different levels and of different paradigms (see figure 5). The tool proposes a rich palette of presentation layouts both in graphical and textual formats. possibly more complex. a precise analysis of the schema to sort out all the structural flaws. Openness. external analyzers and text processors can be used provided they can generate ISL specifications which can then be imported in DB-MAIN to update the repository. if asked to be so. Finally. DB-MAIN includes several ways to specialize the generic features in order to make them compliant with a specific context. specific Voyager-l functions can be developed to cope with more specific name patterns or heuristics. Multiplicity of views. If needed. and based on a various paradigms. Besides the name processor.38 HAINAUTETAL. Links with other CASE processes. Name processing. such as forward engineering. such as processing PL/l-IMS. Voyager-2 programs can be developed (1) to generate specifications in the input language of the other tools. DB -MAIN proposes a transformational toolset of more than 25 basic functions. transformations can be built by the analyst through specific scripts. They can be used to model and to process specifications initially expressed in various technologies. other. Being neutral. other processors can be developed in Voyager-2. or through Voyager-2 functions. DB-MAIN is not dedicated to DBRE only. Rich transformation toolset. and (2) to load into the repository the specifications produced by these tools. therefore it includes in a seamless way supporting functions for the other DB engineering processes. Traceability. The DB-MAIN tool offers both general purpose and specific text analyzers and processors. COBOLVSAM or C-ORACLE applications. which includes the successive states of the specifications as well as all the engineering activities performed by the analyst and . Secondly. Text analysis. Flexible specification model. First. Finally. In the next version.

CODASYL. The multivalued and compound attributes have been transformed into entity types. Viewing these activities as specification transformations has proved an elegant way to formalize the links between the specifications states. the Transformation. For instance. The first version of DB-MAIN was released in September 1995. the entity types with identical semantics have been merged. — Untranslation. refinement through direct contacts with selected accounting officers. Manual encoding. the PDL pattern-matching engine. It includes the basic processors and functions required to design. Among the other functions of Version 1.. Its estimated cost was about 20 man/year. Its repository can accomodate data structure specifications at any abstraction level (Section 5). . i. these redundancies have been detected and removed. Data structure conceptualization. attributes with similar names and identical types. these links can be processed to explain how a conceptual object has been implemented (forward mapping). we have built a specific 0 0 database manager which provides very short access and update times. Schema Analysis and Text Analysis assistants (Section 9). COBOL. four textual and two graphical views (Section 7). the Voyager-2 virtual machine and compiler (Section 10). • Design of a government agricultural accounting system.e. IMS and RPG programs. For performance reasons. the name processor (Section 8). serial attributes. 13. These documents were manually encoded as giant entity types with more than 1850 attributes and up to 9 decomposition levels. implement and reverse engineer large size databases according to various DMS. — De-optimization. It provides a 25-transformation toolkit (Section 6). and how a technical object has been interpreted (reverse mapping). Through conceptualization techniques.000-object project can be developed on a 8-MB machine. have been replaced with multivalued attributes. a simple history generator and its replay processor (Section 11). Version 1 supports many of the features that have been described in this paper. The repository has been implemented as an object oriented database. a fully documented 40. The initial information was found in the notebooks in which the farmers record the day-to-day basic data. let us mention code generators for various DMS. parsers for SQL. the calculated data have been removed as well. Let us describe five of them briefly. Implementation and applications of DB-MAIN We have developed DB-MAIN in C++ for MS-Windows machines. The DB-MAIN tool has been used to carry out several government and industrial projects.DATABASE REVERSE ENGINEERING 39 by the tool itself. these structures were transformed into pure conceptual schemas of about 90 entity types each. we have followed the general methodology described in Section 2: Data structure extraction. and whose disc and core memory requirements are kept very low. The farmer is requested to enter the same data at different places. the dataflow graph inspector. Despite the unusual context for DBRE. In particular.

40

HAINAUTETAL. — Normalization. The schema included several implicit IS-A hierarchies, which have been expressed explicitly;

The cost for encoding, conceptualizing and integrating three notebooks was about 1 person/month. This rather unusual application of reverse engineering techiques was a very interesting experience because it proved that data structure engineering is a global domain which is difficult (and sterile) to partition into independent processes (design, reverse). It also proved that there is a strong need for highly generic CASE tools. Migrating a hybrid file/SQL social security system into a pure SQL database. Due to a strict disciplined design, the programs were based on rather neat file structures, and used systematic cliches for integrity constraints management. This fairly standard two-month project comprised an interesting work on name patterns to discover foreign keys. In addition, the file structures included complex identifying schemes which were difficult to represent in the DB-MAIN repository, and which required manual processing. Redocumenting the ORACLE repository of an existing OO CASE tool. Starting from various SQL scripts, partial schemas were extracted, then integrated. The conceptualization process was fairly easy due to systematic naming conventions for candidate and foreign keys. In addition, it was performed by a developer having a deep knowledge of the database. The process was completed in two days. Redocumentating a medium size ORACLE hospital database. The database included about 200 tables and 2,700 columns. The largest table had 75 columns. The analyst quickly detected a dozen major tables with which one hundred views were associated. It appeared that these views defined, in a systematic way, a 5-level subtypes hierarchy. Entering the description of these subtypes by hand would have required an estimated one week. We chose to build a customized function in PDL and Voyager-l as follows. A pattern was developed to detect and analyze the c r e a t e view statements based on the main tables. Each instantiation of this pattern triggered a Voyager-2 function which defined a subtype with the extracted attributes. Then, the function scanned these IS-A relations, detected the conraion attributes, and cleaned the supertype, removing inherited attributes, and leaving the conmion ones only. This tool was developed in 2 days, and its execution took 1 minute. However, a less expert Voyager-l programmer could have spent more time, so that these figures cannot be generalized reliably. The total reverse engineering process cost 2 weeks. Reverse engineering of an RPG database. The application was made of 31 flat files comprising 550 fields (2 to 100 fields per file), and 24 programs totalling 30,000 LOC. The reverse engineering process resulted in a conceptual schema comprising 90 entity types, including 60 subtypes, and 74 relationship types. In the programs, data validation concentrated in well defined sections. In addition, the programs exhibited complex access patterns. Obviously, the procedural code was a rich source of hidden structures and constraints. Due to the good quality of this code, the program analysis tools were of little help, except to quickly locate some statements. In particular, pattern detection could be done visually, and program slicing yielded too large program chunks. Only the dataflow inspector was found useful, though in some programs, this graph was too large, due to the presence of working variables common to several independent program sections. At that time, no RPG parser was available, so that a Voyager-2 RPG extractor was developed

DATABASE REVERSE ENGINEERING

41

in about one week. The final conceptual schema was obtained in 3 weeks. The source file structures were found rather complex. Indeed, some non-trivial patterns were largely used, such as overlapping foreign keys, conditional foreign and primary keys, overloaded fields, redundancies (Blaha and Premerlani, 1995). Surprisingly, the result was estimated unnecessarily complex as well, due to the deep type/subtype hierarchy. This hierarchy was reduced until it seemed more tractable. This problem triggered an interesting discussion about the limit of this inheritance mechanism. It appeared that the precision vs readability trade-off may lead to unnormalized conceptual schemas, a conclusion which was often formulated against object class hierarchies in 0 0 databases, or in OO applications. 14. Conclusions

Considering the requirements outlined in Section 3, few (if any) commercial CASE/CARE tools offer the functions necessary to carry out DBRE of large and complex applications in a really effective way. In particular, two important weaknesses should be pointed out. Both derive from the oversimplistic hypotheses about the way the application was developed. First, extracting the data structures from the operational code is most often limited to the analysis of the data structure declaration statements. No help is provided for further analyzing, e.g., the procedural sections of the programs, in which essential additional information can be found. Secondly, the logical schema is considered as a straighforward conversion of the conceptual schema, according to simple translating rules such as those found in most textbooks and CASE tools. Consequently, the conceptualization phase uses simple rules as well. Most actual database structures appear more sophisticated, however, resulting from the application of non standard translation rules and including sophisticated performance oriented constructs. Current CARE tools are completely blind to such structures, which they carefully transmit into the conceptual schema, producing, e.g., optimized IMS conceptual schemas, instead of pure conceptual schemas. The DB-MAIN CASE tool presented in this paper includes several CARE components which try to meet the requirements described in Section 3. The first version^ has been used successfully in several real size projects. These experiments have also put forward several technical and methodological problems, which we describe briefly. • Functional limits of the tool. Though DB-MAIN Version 1 already offers a reasonable set of integrity constraints, a more powerful model was often needed to better describe physical data structures or to express semantic structures. Some useful schema transformations were lacking, and the scripting facilities of the assistants were found very interesting, but not powerful enough in some situations. As expected, several users asked for ''full program reverse engineering". • Problem and tool complexity. Reverse engineering is a software engineering domain based on specific, and still unstable, concepts and techniques, and in which much remains to learn. Not surprisingly, true CARE tools are complex, and DB-MAIN is no exception when used at its full potential. Mastering some of its functions requires intensive training which can be justified for complex projects only. In addition, writing and testing specific PDL pattern libraries and Voyager-2 functions can cost several weeks.

42

HAINAUTETAL.

• Performance, While some components of DB-MAIN proved very efficient when processing large projects with multiple sources, some others slowed down as the size of the specifications grew. That was the case when the pattern-matching engine parsed large texts for a dozen patterns, and for the dataflow graph constructor which uses the former. However, no dramatic improvement can be expected, due to the intrinsic complexity of pattern-matching algorithms for standard machine architectures. • Viewing the specifications. When a source text has been parsed, DB-MAIN builds a firstcut logical schema. Though the tool proposes automatic graphical layouts, positioning the extracted objects in a natural way is up to the analyst. This task was often considered painful, even on a large screen, for schemas comprising many objects and connections. In the same realm, several users found that the graphical representations were not as attractive as expected for very large schemas, and that the textual views often proved more powerful and less cumbersome. The second version, which is under development, will address several of the observed weaknesses of Version 1, and will include a richer specification model and extended toolsets. We will mainly mention some important extensions: a view derivation mechanism, which will solve the problem of mastering large schemas, a view integration processor to build a global schema from extracted partial views, the first version of the MDL compiler, of the methodology engine, and of the history manager, and an extended program sheer. The repository will be extended to the representation of additional integrity constraints, and of other system components such as programs. A more powerful version of the Voyager-2 language and a more sophisticated Transformation assistant (evoked in Section 9) are planned for Version 2 as well. We also plan to experiment the concept of design recovery for actual applications. Acknowledgments The detailed conmients by the anonymous reviewers have been most useful to improve the readability and the consistency of this paper, and to make it as informative as possible. We would also like to thank Linda Wills for her friendly encouragements. Notes
1. A table is in 4NF ij^all the non-trivial multivalued dependencies are functional. The BCNF (Boyce-Codd normal form) is weaker but has a more handy definition: a table is in BCNF (/f each functional determinant is a key, 2. A CASE tool offering arichtoolset for reverse engineering is often called a CARE (Computer-Aided Reverse Engineering) tool. 3. A Data Management System (DMS) is either a File Management System (FMS) or a Database Management System (DBMS). 4. Though some practices (e.g., disciplined use of COPY or INCLUDE meta-statements to include common data structure descriptions in programs), and some tools (such as data dictionaries) may simulate such centralized schemas. 5. There is no miracle here: for instance, the data are imported, or organizational and behavioural rules make them satisfy these constraints.

of the 10th ERA. 19(4). H. and Marchand. of the 12thlnt. of ERA: A Bridge to the User. SE-11:1268-1277. In Proc. Contribution to a theory of database reverse engineering. M. Fundamentals of Database Systems. As a consequence. Fonkam. Casanova. M.H.. English is often used as a de facto common language. of the 2nd IEEE Working Conf. In Proc. be. Hainaut. S. of ERA. San Mateo (CA). North-Holland. Di Battista. 7. Toronto: IEEE Computer Society Press. In Proc. J. Batini. Manchester: Springer-Veriag. Entity-generating schema transformation for entity-relationship models. and Navathe. Joum.. A. C. M. In Proc. ACM/IEEE. and will be evoked in Section 11. LNCS.M. of the 13th Int. and Navathe. of the 12th Int. K. LNCS. 1994. 1994. Structuring primitives for a dictionary of entity relationship data schemas. Deriving a logical model for a system using recast method.. Res. Date. Hainaut. 1985. of Data and Knowledge Engineering. TM. Theoretical and practical tools for data base design. References Andersson. S. of 4th Int. G. 1988.-L. De Sa 1984. Extracting an entity relationship schema from a relational database through reverse engineering. Transformations in reengineering techniques. The part of the DB-MAIN project in charge of this aspect is the DB-Process sub-project.A. 1992. Davis. 8. Davis. M. 1994. J. In Proc.C. G. Conceptual Database Design. Vol. 1994.H.K. Intern. J. Chiang. M. pp. Arlington-Dallas: Springer-Verlag. K. IEEE TSE. Hainaut. and Joris. fully supported by the Communaut^ Francaise de Belgique. A Methodology for translating a conventional file system into an entityrelationship model. . Ceri. Victoria. lEEE/North-HoUand. Designing entity relationship schemas for conventional information systems. 1992. This aspect has been developed in Hainautetal. and Premerlani. and Arora.. Mapping uninterpreted schemes into entity-relationship diagrams: Two applications to conceptual schema design. V. P.R.-L.J. & Develop.. In order to develop contacts and collaboration. M„ Decuyper. B. O. and Amaral.DATABASE REVERSE ENGINEERING 43 6. LNCS.-L. Benjamin-Cummings. J. VLDB Conf. Toronto: IEEE Computer Society Press.. 1992.. of ERA. a c . on Reverse Engineering.J. Database CASE tool architecture: Principles for flexible design strategies. This free version can be obtained by contacting thefirstauthor at j lh@inf o. Edwards. In Proc. on ER Approach. (1994). 1993. pp. 1993b. 1981. 265-278. 1983. Addison-Wesley. J. Reverse engineering of relational databases: Extraction of an EER model from a relational database.B. 1991. Canada. IEEE TSE. Fikas. J. 1. In Proc. Manchester: Springer-Verlag. and Ho. In Proc. In Proc. Baltimore: IEEE Computer Society Press. of the IEEE Working Conf. M. Tonneau C. Bolois. In Proc. In Proc. M.H. 1985. Barron. 0. Hainaut. Converting a relational database model to an entity relationship model. Transformational techniques for database reverse engineering. An approach to ehciting the semantics of relational databases. and Santucci. and Gray. 1993a. Arlington-Dallas: E/R Institute and Springer-Verlag. 9. and Storey. Blaha. Conf on Advance Information Systems Engineering—CAiSE'92. In Proc.. C. R. Conf on Advanced Information System Engineering (CAiSE-92). and Robillard.K.. S. Knowledge-based approach for abstracting hierarchical and network schema semantics. on Reverse Engineering. Observed idiosyncracies of relational database designs. and Munro. f undp. J. North-Holland. Fong. Chandelon M.. 1994. An Introduction to Database Systems. Conf. A. 463-480.-L. 28(1).A. Cadelli. Belgium commonly uses three legal languages. For instance. Chandelon M. of the 2nd IEEE WC on Reverse Engineering. of the 4th Int.F. M. 1994.-L. Conf on ER Approach. In Proc. and Amarel de Sa. Springer-Verlag. M.. namely Dutch. W. But methodology-aware if ^5/g« recovery is intended. Casanova. French and German. 1995. Conf on ER Approach. In IBM J. S. In Proc. and Arora. an Education version (complete but limited to small applications) and its documentation have been made available. Hainaut. Elmasri. C. Benjamin-Cummings. Automating the transformational development of software. Tonneau C. of the 4th Reengineering Forum Reengineering in Practice. W. 12(2): 107-142.M. Batini. 1995. and Joris. R.

Standard transformations for the normalization of ER schemata. and Stevenson.. Using queries to improve database reverse engineering. M. C. In Proc. C. data and knowledge Engineering. Nilsson. Manchester: SpringerVerlag. T. In Proc. O. S. 1994. In Proc. Theoretically sound transformations for practical database design. January. . Reconstruction of E-R schema from database Applications: A cognitive approach. IEEE Computer Society Press. An approach for reverse engineering of relational databases. 1995.G. of ERA: A Bridge to the User. IEEE Computer Society Press. V.A. in Proc. Evolution of database applications: The DB-MAIN approach. Hainaut. In Proc. Rauh. J.) 1992.. N.. In Proc.-M. 1988. In Proc. J. North-Holland. P. 1995. IEEE Software. OVUM report. 1993. Hick. 1993. In Information Systems. O. and Cima. and Toumani. 19(2). Lofftedo. and Proper. North-Holland. Navathe. Software Reuse and Reverse Engineering in Practice. 1993. Joris. 1990. of the 3rd European-Japanese Seminar in Information Modeling and Knowledge Bases. 1994. 5th Int Conf. Challenges to thefieldof reverse engineering. 1988. 10(10). Rock-Evans. H. In Proc. Budapest (preprints). Selfridge..-L.-M. 144-150. An extended entity-relationship (E2R) database specification and its automatic verification and transformation. Hainaut. 1996. R.-L. D. VLDB'95. Database reverse engineering: From relational to the binary relationship Model. and Bruns. 5th Int. on Software Engineering and Applications. Hick J.. of the CAiSE'95 Conf.be). 1980.. Losslessness and semantic correctness of database schema transformation: Another look of schema equivalence. 1990.44 HAINAUTETAL. G. C. Bouliaut. Petit.G. E. and Makowsky. of the 8th ERA. Potts. In Proc. I. N. EC2 Publish.. A. LNCS. Springer-Verlag. Van Hoe. 1986. In Proc. lEEE/North-HoUand.. N. Chandelon. on Software Engineering. Premerlani. Shoval. Abstracting relational and hierarchical data with a semantic data model. Modeling the requirements engineering process. Johannesson. Hall. J. and Reiner. Tools and techniques for data remodelling cobol applications.J. K. J. P. Elsevier: Data & Knowledge Engineering (to appear). and Chikofsky. 1987. M. December 1992.. Rolland. and Roland. of ERA.M. A. and Kalman. and Englebert. J. In Proc. 1985. M. Springer-Verlag: Manchester. and Shreiber. The translation of COBOL data structure to an entity-rel-type conceptual schema. 5(2).. pp. E.B. J. Transformation-based database engineering. of the 1st WC on Reverse Engineering. Conf on ER/00 Modelling (ERA). Schema analysis for databaserestrucmring. W. Hainaut. J. 1994. Finland. 1996. 1990. of ERA Conf Rosenthal. of the IEEE Working Conf on Reverse Engineering. J. and Bodart F. of CAiSE'96. 1990. 1992.A. et al.. Conf on Software Engineering and Applications. Tonneau. D.ac. oflCSE. (available atjlh@info. on ER Approach^ Manchester: Springer-Verlag.C. In Proc. Toulouse. ACM TODS. Tutorial notes. Waters. PHENIX: Methods and tools for database reverse engineering.. 1992. Henrard. R. Database design recovery. A Method for translating relational schemas into conceptual schemas.-F. 517-529. In Proc.In ACM TODS. Conf. Henrard. Specification Preservation in Schema transformations—Application to Semantics and Statistics. K. M. Hainaut. and Reiner. methods and tools. Halpin. In Proc. Ziirich. of ERA Conf Markowitz... Special issue on Reverse Engineering. and Awong. Tools and transformations—Rigourous and otherwise—for practical database design. E. Signore. R. Rosenthal.A. EC2 Publish. 7-11 December.. In Proc. Roland. In Proc. IEEE Computer Society Press. of the 13th Int. Lilien. 16(8). Springer-Veriag. of the 13th Int. 1994. pp. M. Reverse engineering: Markets. A. (Ed. J. ll(l):41-59. Kobayashi. Recording the reasons for design decisions.V. 1988.. Chapman & Hall IEEE.. 1993. Jyvaskyla. of the 13th Int Conf on ER Approach. Database schema transformation and optimization. Identifying extended entity-relationship object structures in relational schemas. of the 14th Int.. In Proc. V. J. Englebert. Springer-Veriag. Gregori.. P. D. Sabanis.B. 1995. Toulouse. Conf on ER Approach. 1990. Kozaczynsky.R. and Stickel. Kouloumdjian.fundp. J. J.-L. P. Switzerland. Hainaut. F. Toronto.-L. D.A..-L. IEEE Trans. M. and Blaha. S. Navathe.-M.J.

In Proc. of the 14th Int.DATABASE REVERSE ENGINEERING 45 Springsteel.N. M. and Apers. Weiser. E. IEEE TSE.J. pp. and Kou. F. Toronto: IEEE Computer Society Press. Parallel Architectures and their Applications. Program slicing. and Chikofsky. Database Modeling and Design: The Fundamental Principles. North-Holland. of Databases. . 345-360. 1990. Reverse engineering of relational databases. Morgan Kaufmann.) 1995. of the 2nd IEEE Working Conf on Reverse Engineering. 1990. In Proc. L. Teorey. Reverse data engineering of E-R designed relational schemas. and Davis. Software reverse engineering from a currently existing IMS database to an entity-relationship model. 1984. Wills. 1994. Proc.. of ERA: the Core of Conceptual Modelling. Winans. In Proc. Conf on ER/00 Modelling (ERA). C. R 1995.H. (Eds. K. T. R. Newcomb. J. M. 10:352-357. Vermeer.

Nevertheless. Answering them requires not only comprehending the program text but relating it to the program's purpose .gatech. This includes knowledge of common algorithms and data structures and even concerns style issues. GA {spencer.wills@ee. — Walt Whitman. KURT STIREWALT College of Computing. due to patches. Understanding Interleaved Code SPENCER RUGABER. specification extraction. we have looked at a variety of instances of interleaving in actual programs and have distilled characteristic features. Boston. We refer to these code fragments as being interleaved. Letovsky has observed that programmers engaged in software understanding activities typically ask "how" questions and "why" questions (Letovsky.solving some sort of problem. in optimizing a program. 3. Atlanta. This paper also describes our experiences in developing tools to detect specific classes of interleaving in this software. 1981). Complex programs often contain multiple. Perhaps you need to track down a bug. such as indentation and use of comments. Our exploration of interleaving has been done in the context of a case study of a corpus of production mathematical software. Georgia Institute of Technology. analysis tools. 1988). Manufactured in The Netherlands.edu School of Electrical and Computer Engineering. WILLS linda. 47-76 (1996) © 1996 Kluwer Academic Publishers. The individual strands responsible for each goal are typically delocalized and overlap rather than being composed in a simple linear sequence. Introduction Imagine being handed a software system you have never seen before. With every leaf a miracle. The description. We know that software maintenance tasks such as these consume the majority of software costs (Boehm. domain models. Atlanta. and we know that reading and understanding the code requires more effort than actually making the changes (Fjeldstad and Hamlen.kurt}@cc. The former require an in-depth knowledge of the programming language and the ways in which programmers express their software designs. 1979). Interleaving may be intentional-for example. Georgia Institute of Technology. driven by the need to enhance a formal description of this software library's components. To understand this phenomenon. 1. each responsible for accomplishing a distinct goal. But we do not know what makes understanding the code itself so difficult. written in Fortran from the Jet Propulsion Laboratory. or other hasty maintenance practices. GA Abstract. the answers to "how" questions can be derived from the program text. This paper presents our characterization of interleaving and the implications it has for tools that detect certain classes of interleaving and extract the individual strands of computation. Keywords: software understanding. interleaving. rewrite the software in another language or extend it in some way. quick fixes.edu LINDA M. a programmer might use some intermediate result for several purposes-or it may creep into a program unintentionally. interwoven strands of computation. And .gatech. in turn aids in the automated component-based synthesis of software using the library. "Why" questions are more troublesome.Automated Software Engineering.

The important point to note is that although it is natural to program in a way that intersperses error checks with computational code. In particular. 1990. The line is specified by a point (LINEPT) and a direction vector (LINEDR). B. By extracting the error checking plan from NPEDLN. the resultant computations in NPEDLN would be shorter and easier to follow. The full program consists of 565 lines. stereotypical form. Rich and Wills. The acronym NPEDLN stands for Nearest Point on Ellipsoid to Line. In some sense. . 1991. The executable statements. Selfridge et al. with comments and declarations removed. and roughly half of the code in NPEDLN is used for this purpose. To demonstrate this problem.48 RUGABER. Although this approach would require redundant computation and potentially more total lines of code. the error handling code and the rest of the routine realize independent plans. it may be novel or idiosyncratic. we examine an example program in a step-by-step fashion. an abridged version can be found in the Appendix with a brief description of subroutines it calls and variables it uses. an entirely separate routine could be constructed to make the checks and NPEDLN called only when all the checks are passed. are shown in Figure L The lines of code in NPEDLN that actually compute the nearest point are somewhat hard to locate. 1986. Following (Rich and Waters. as well as understanding why the programmer decided to interleave the strands. In principal. trying to answer the questions "why is this program the way it is?" and "what makes it difficult to understand?" /. This paper is concerned with a specific difficulty that arises when trying to answer "why" questions about computer programs. STIREWALT. 1990. . such as (Hartman. Rich and Waters. The ellipsoid is specified by the lengths of its three semi-axes (A. we get the much smaller and. and z coordinate axes. is part of the SPICELIB library obtained from the Jet Propulsion Laboratory and intended to help space scientists analyze data returned from space missions. We use the term plan to denote a description or representation of a computational structure that the designers have proposed as a way of achieving some purpose or goal in a program. we reserve the term cliche for a plan that represents a standard. AND WILLS the problem being solved may not be explicitly stated in the program text. 1994. Selfridge et al.. The nearest point is contained in a variable called PNEAR. more understandable program shown in Figure 3. Wills. it is not necessary to do so. which can be detected by recognition techniques. Note that apian is not necessarily stereotypical or used repeatedly. Letovsky. and c). it is concerned with the phenomenon of interleaving in which one section of a program accomplishes several purposes. nor is the rationale the programmer had for choosing the particular solution usually visible. Kozaczynski and Ning. Unraveling interleaved code involves discovering the purpose of each strand of computation. We have indicated those lines by shading in Figure 2. y. called NPEDLN. presumably.. This definition is distilled from definitions in (Letovsky and Soloway. It turns out that SPICELIB includes an elaborate mechanism for reporting and recovering from errors. 1993). and disentangling the code responsible for each purpose is difficult. 1990. Plans can occur at any level of abstraction from architectural overviews to code./. 1994. One reason for this has to with error checking. 1988. 1992). which are oriented with the X. 1993) . NPEDLN The Fortran program. Quilici.

PNEAR ) CALL VSCL ( SCALE. PT(1. PRJPT PRJEL. OPPDIR. THEN CALL SETMSG 'Semi-axes: A = #. UDIR.1). S C L P T d ) = LINEPT (1) / SCALE SCLPT{2) = LINEPT(2) / SCALE SCLPT{3) = LINEPT(3) / SCALE CALL VMINUS (UDIR. A ) CALL ERRDP '#'.OR. C. NPEDLN minus comments and declarations.' ) SPICE(DEGENERATECASE)' ) CALL SIGERR ( NPEDLN• ) CALL CHKOUT ( RETURN E N D IF CALL VSCL ( SCALE. C = #.DO.LE. O. PNEAR. UDIR.LE. SCLB. 2 IF ( FOUND(I) ) THEN DIST = 0. LINEPT. MAG ) IF ( MAG .I). B. C ) CALL ERRDP CALL SIGERR ('SPICE(DEGENERATECASE)') CALL CHKOUT ( 'NPEDLN' ) RETURN END IF Scale LINEPT. C ) CALL SIGERR ('SPICE(INVALIDAXISLENGTH)') CALL CHKOUT ( 'NPEDLN' ) RETURN END IF SCALE = MAX ( DABS (A) DABS(B).PNEAR. . B = #. B ) CALL ERRDP CALL ERRDP '#'.EQ. SCLC. FOUND(2)) DO 50001 I = 1. P R J E L ) ) CALL VPRJP ( SCLPT. . CALL VPRJPI IF PRJPL.ODO CALL VEQU ( PT(1. IFOUND ) ( . SCLC.' ) CALL SIGERR { SPICE(DEGENERATECASE)' CALL CHKOUT { NPEDLN* ) RETURN END IF CALL NVC2PL ( UDIR.UNDERSTANDING INTERLEAVED CODE 49 SUBROUTINE NPEDLN ( A. PRJ P L . PNEAR. 0 ) THEN CALL SETMSG( 'Line direction vector is the zero vector. PRJPL. { B LE.DO ) ( SCLA**2 . DIST ) IF ( RETURN 0 ) RETURN ELSE CALL CHKIN ( END IF CALL UNORM ( LINEDR. PNEAR ) CALL CHKOUT ( 'NPEDLN' ) RETURN END IF 50001 CONTINUE C NORMALd) = U D I R d ) / SCLA**2 NORMAL(2) = UDIR(2) / SCLB**2 NORMAL(3) = UDIR(3) / SCLC**2 CALL NVC2PL ( NORMAL. O. OPPDIR ) CALL SURFPT (SCLPT.DO. CANDPL ) CALL INEDPL ( SCLA.LE IF ( . DO ) . FOUND(l)) CALL SURFPT (SCLPT. ' '#'. ( SCLB**2 . SCLA. B ) CALL ERRDP '#*. O. PRJPL ) CALL PJELPL ( CAND. ' ) '#'.DO ) ) THEN CALL SETMSG { 'Semi-axis too small: A = #. CALL SIGERR( 'SPICE(ZEROVECTOR)' CALL CHKOUTI 'NPEDLN' ) RETURN ELSE IF { DO ) ( A LE. C = #. .OR. PNEAR. PT{1. A ) CALL ERRDP '#'. O.OR. B = #. DABS(C) ) A / SCALE SCLA B / SCALE SCLB SCLC = C / SCALE O. LINEDR.NOT. CALL NPELPT ( PRJPT. { C LE. CANDPL. PRJNPT PRJPT ) ) ) ( PRJNPT.2). SCLB. DO ) .OR. IFOUND ) T H E N CALL SETMSG ( Inverse projection could not be found.DO ) . XFOUND ) THEN CALL SETMSG ( Candidate ellipse could not be found. SCLC CANDPL. CAND. ( SCLC**2 . SCLA. DIST = VDIST ( PRJNPT. DIST = SCALE * DIST C A L L C H K O U T { 'NPEDLN' ) RETURN END Figure 1. SCLB. XFOUND ) IF ( .NOT.

m. that is.2). SCLC.LE. LINEPT. CANDPL ) mm CALL INEDPL ( SCLA.m ) NORMAL(2) = UDIR(2) / SCLB**2 .m ) } CALL NVC2PL ( NORMAL. MAG ) ( PT(1. PNEAR. PT(1. NPEDLN has a primary goal of computing the nearest point on an ellipsoid to a specified line.m ) \ THEN CALL VPRJPI ( PRJNPT. C. It also has a related goal of ensuring that the computations involved have stable numerical behavior. DABS(B). in the code and extract it. PNEAR. and then unscale the results.' > CALL SIGEKR ('SPtCfiCDEtSENERATSCASE) ') CALL SIGERR i 'SPICS(DSJeENERATSCASEj ' } CALL CHKOUT i 'NPEDLN* } CAU. PRJPT ) SCLC { SCLA**2 .OR. perform the computation. ' } = IFOUND ) CALL ERRDP i '#*. F0UND(2)) DO 50001 I = 1. XFOUND ) THEN CALL ERRDP i ' # \ B ) CALL SETMSG < 'Candidate ellipse could not CALL EKRDP. C ) be found.DO. PRJNPT ) IF { ( SCLB**2 . O.LE. A standard trick in numerical programming for achieving stability is to scale the data involved in a computation.OR.I). O. OPPDIR ) END Figure 2. SCLC. . CALL SETMSG i 'Semi-axis too SMiall: A « #. i c . SCALE = MAX { DABS(A).DO. STIREWALT. ' END IF CALL SIQERR< * SPICE(ZEROVECTOB)' \ 50001 CONTINUE CALL CHKODTi *NPEI>LK' ) C RETPim NORMAL(1) = UDIR(l) / SCLA**2 ( A X E .00 ) ELSE IF { { B . o. 2 IF ( FOUND(I) ) THEN mo XF DIST = 0 ODO CALL VEQU CALL UNORM ( LINEDR.m } CALL NPELPT ( PRJPT. Code with error handling highlighted. document the extracted plan independently. SCLC. PNEAR.' > CALL SIGBRR T SPICE {IHVALIDAXISJi^ESGTH) ^ ) CALL SIGERR ( ' SPICE (DES^NERATECASEP ) CALL CHKOUT f 'nPSlDlM' J CALL CHKOUT ( 'NP13>WI' \ RETtmN RETtmJJ FND I F e»D IF CALL NVC2PL ( UDIR. SCLB. A ) IF ( . 0 ) THfiN { SCALE. C « #•' i CAND. DABS(C) ) PRJPL ) CALL PJELPL ( CAND. CANDPL. CALL SKtKSG { fi » #. B = #.EQ.LE. that the computations are accurate in the presence of a wide range of numerical inputs. B. UDIR. PRJPL. CHKOUT { 'JtJPEDLN' > RETURN RETURN END IF END I F Scale LINEPT.LE. leaving a smaller and more coherent residue for further analysis. SCLA. Q. SCLB. LINEDR. SCLB.bO ) DIST = VDIST ( PRJNPT.PNEAR. PRJEL. 0. PNEAR ) CALL VSCL I F ( JJAG . such as error checking.1).NOT. PNEAR ) CALL CRKOUT ( CALL SETMSG{ 'Line direction vector ) RETURN is the zer6 vector.OR. PRJPT ) . We can apply this approach further to NPEDLN'S residual code in Figure 3. A } IF ( . a. i SCLC**2 . AND WILLS SUBROUTINE NPEDLN { A.50 RUGABER. NORMAL(3) = UDIR(3) / SCLC**2 . PNEAR ) SCLPT(l) = LINEPT(l) / SCALE DIST = SCALE * DIST SCLPT(2) = LINEPT(2) / SCALE CALL CHKOUT ( 'STPEDLN* ) SCLPT{3) = LINEPT(3) / SCALE RETURN CALL VMINUS (UDIR.NOT. O. PRJPL. XFOUND ) CALL ERRDP { * #' . The . ERRDP i '#% B ) CALL SETMSG ( 'Inverse projection could not CALt WmW { '#'r C J be found. { '#'. C = i. PRJEL ) A / SCALE SCLA B / SCALE SCLB C / SCALE CALL VPRJP ( SCLPT. PT(1. OPPDIR. CANDPL. PRJPL. DIST ) IF ( RETURN 0 f RETUiUS EL$E CAX*L CHKIN { 'N|»E0I^* ) CALL SURFPT (SCLPT. IPOmm ) THEN CAtt. O. UDIR. The structure of an understanding process begins to emerge: detect a plan. FOUND(l)) CALL SURFPT (SCLPT. CALL VSCL ( SCALE. and note the ways in which it interacts with the rest of the code. SCLA.

PRJPT ) CALL VPRJPI ( PRJNPT. The two routines would each be more coherent. This is illustrated in Figure 6. DABS(B). PRJPL. PRJPL. both in the code and at runtime. OPPDIR ) CALL SURFPT (SCLPT. FOUNDd)) CALL SURFPT (SCLPT. It is highlighted in the excerpt shown in Figure 4. There is one further complication. PRJEL ) CALL VPRJP ( SCLPT. UDIR. CAND. PRJNPT ) DIST = VDIST ( PRJNPT. PRJPL ) CALL PJELPL ( CAND. PRJPL. When we extract the scale-unscale code from NPEDLN. . XFOUND ) CALL NVC2PL { UDIR. CANDPL. O. IFOUND ) CALL VSCL ( SCALE. The delocalized nature of this "scale-unscale" plan makes it difficult to gather together all the pieces involved for consistent maintenance. Letovsky and Soloway's cognitive study (Letovsky and Soloway. F0UND(2)) DO 50001 I = 1. It also gets in the way of understanding the rest of the code.1). code responsible for doing this in NPEDLN is scattered throughout the program's text.UNDERSTANDING INTERLEAVED CODE 51 JROUTINE NPEDLN ( A. (The computation of D I S T using VDIST is actually the last computation performed by the subroutine NPELPT. PT(1. UDIR. SCLB. SCLB. PNEAR ) CALL VSCL RETURN END IF 50001 CONTINUE Figure 3. SCLA. PNEAR. It is now much easier to see the primary computational purpose of this code.) Note that an alternative way to structure SPICELIB would be to have separate routines for computing the nearest point and the distance. PRJPT ) CALL NPELPT ( PRJPT. This additional output (DIST) is convenient to construct because it can make use of intermediate results obtained while computing the primary output (PNEAR). since it provides distractions that must be filtered out. which NPEDLN calls. SCLC. LINEPT. LINEDR. SCLC. it also computes the shortest distance between the line and the ellipsoid. but the common intermediate computations would have to be repeated. we are left with the smaller code segment shown in Figure 5 that more directly expresses the program's purpose: computing the nearest point. PT(1.ODO ( PT(1. OPPDIR. PNEAR. 2 IF ( F O U N D d ) ) THEN DIST = O. we have pulled this computation out of NPELPT for clarity of presentation. SCLB. The residual code without the error handling plan. PNEAR. O. It turns out that NPEDLN not only computes the nearest point from a line to an ellipsoid. B.DO. PNEAR ) DIST = SCALE * DIST RETURN END CALL VMINUS (UDIR. SCLA. PRJEL. CANDPL.I). however.2). C. PNEAR ) CALL VEQU ( SCALE. MAG ) SCALE = MAX { DABS(A).DO.PNEAR. The "pure" nearest point computation is shown in Figure 7. SCLC. CANDPL ) SCLA. 1986) shows the deleterious effects of delocalization on comprehension and maintenance. DABS{C) A / SCALE SCLA B / SCALE SCLB C / SCALE SCLC SCLPT{1) SCLPT{2) SCLPT(3) = = = LINEPT(1) / SCALE LINEPT(2) / SCALE LINEPT(3) / SCALE NORMAL (1) = NORMAL(2) = NORMAL(3) = CALL NVC2PL ( CALL INEDPL ( U D I R d ) / SCLA**2 UDIR(2) / SCLB**2 UDIR{3) / SCLC**2 NORMAL. DIST ) CALL UNORM ( LINEDR.

PNEAR. IFOUND ) RETURN END Figure 5. SCLB.1). PRJEL ) C PRJPL. PJIEAH J SCALE * DIST CALL VSCL RETURN END IF 50001 CONTINUE { SCALE. PRJPT ) CALL NPELPT ( PRJPT.DO. IFOUND ) LINEPT (2) LINEPTO) CALL VMINUS (UDIR. PRJEL ) CALL VPRJP ( SCLPT. SCLC. SCLC. CAND. SCLA. PRJPL.1). MAG ) A SCALE . mBAR } Figure 4. FOUND(l)) CALL SURFPT (SCLPT.I). . CANDPL. CANDPL ) SCLA. OPPDIR ) CALL SURFPT (SCLPT. XFOUND ) CALL NVC2PL ( UDIR. SCLC. STIREWALT. SCLA. PNEAR ) RETURN END IF 50001 CONTINUE NORMAL(1) = NORMAL(2) = NORMAL(3) = CALL NVC2PL { CALL INEDPL { UDIR(l) / SCLA**2 UDIR(2) / SCLB**2 UDIR(3) / SCLC**2 NORMAL. LINEPT. UDIR. The residual code without the scale-unscale plan. PRJPT ) C CALL VPRJPI ( PRJNPT. SCLB. OPPDIR. PRJPT ) CALL VPRJPI ( PRJNPT.ODO CALL VEQU { PT(1. C.DO. PNEAR. DIST = VDIST ( PRJNPT.M X { DABS {Ah mBS(B). O. FOUND(2)) DO 50001 I = 1.DO. C. SCLC. PNEAR ) CALL VEQU ( PT(1< CALL VSCL ( SCALE. LINEDR. SCLB. XFOUND ) CALL NVC2PL ( UDIR. CANDPL. PRJPL. CANDPL ) CALL INEDPL ( SCLA.2). B. PRJEL. SCLB. OPPDIR ) CALL SURFPT (SCLPT. 2 IF ( FOUND(I) ) THEN .52 RUGABER. LINEPT. SCLA. DIST RETURN END = ? PimAR. LINEDR. O. . DABS(C) } / SCALE SClk / SCALE SCLB / SCALE SCLC SCLP1*aj SCLPTI2) SCLPTU) ISCALB SCALE SCALE NORMAL(1) = UDIR(l) / SCLA**2 N0RMAL(2) = UDIR(2) / SCLB**2 NORMALO) = UDIR(3) / SCLC**2 CALL NVC2PL ( NORMAL. PRJPL ) CALL PJELPL ( CAND. MAG ) C CALL VMINUS (UDIR. PNEAR. SCLC. SCLA. B. Code with scale-unscale plan highlighted. CAND. SCLB. PRJNPT ) CALL NPELPT { PRJPT. OPPDIR. UDIR. PRJPL. AND WILLS SUBROUTINE NPEDLN { A. FOUND{1)) CALL SURFPT (SCLPT.0D0 DIST = 0. PT(1. UDIR. UDIR. CANDPL. PT(1. fWEAR. PT{1. PRJPL ) CALL PJELPL ( CAND. O. PRJPT ) CALL VPRJP { SCLPT. 2 IF { FOUND(I) ) THEN DIST = O. CANDPL. PRJPL.2).DO. O. SCLC. SCLB. DIST ) CALL UNORM { LINEDR. DIST ) C CALL UNORM ( LINEDR. F0UND(2)) DO 50001 I = 1. SUBROUTINE NPEDLN ( A. PRJEL. PT(1. PNEAR. PRJNPT ) DIST = VDIST ( PRJNPT.1). PRJPL.

LINEPT. SCLA. SCLB. PT(1. CALL NPELPT ( PRJPT. XFOUND ) CALL NVC2PL ( UDIR. PNEAR ) RETURN END IF 50001 CONTINUE NORMAL(1) = NORMAL(2) = NORMAL(3) = CALL NVC2PL ( CALL INEDPL ( UDIRd) / SCLA**2 UDIR(2) / SCLB**2 UDIR(3) / SCLC**2 NORMAL. SCLA. SCLC. CALL PJELPL ( CAND. DIST ) C CALL UNORM { LINEDR. PRJPT ) CALL NPELPT ( PRJPT. PNEAR. OPPDIR ) CALL SURFPT (SCLPT. SCLC.2). PT(1. CANDPL ) SCLA. O. PRJEL. O. 2 IF { FOUND(I) ) THEN DISfT » O. IFOUND ) RETURN END Figure 7. OPPDIR ) CALL SURFPT (SCLPT. PNEAR. MAG ) CALL VMINUS (UDIR.I).DO. LINEPT.ODQ CALL VEQU ( PT(1. SCLB. OPPDIR. PRJPT ) PRJNPT ) CALL VPRJPI ( PRJNPT.2). . PRJPL. UDIR.DO. 2 IF ( FOUND(I) THEN CALL VEQU ( PT(1. FOUND(2)) DO 50001 I = 1. PNEAR ) CALL UNORM ( LINEDR. SUBROUTINE NPEDLN ( A.DO. C.DO. PRJPL. SCLC. PNEAR ) RETURN END IF 50001 CONTINUE NORMAL(1) = UDIRd) / SCLA**2 NORMAL(2) = UDIR(2) / SCLB**2 NORMAL(3) = UDIR(3) / SCLC**2 CALL NVC2PL ( NORMAL. F0UND{2)) DO 50001 I = 1. LINEDR. OPPDIR. B.1). Code with distance plan highlighted. CAND. SCLB. UDIR. CAND. CANDPL. B. CANDPL. PT(1. PRJPL. XFOUND ) PRJPL ) CALL NVC2PL ( UDIR. PT(1. SCLA. PRJPL ) CALL PJELPL ( CAND. PRJEL. PRJPL.1). UDIR. PRJPL. The residual code without the distance plan. SCLC. PRJNPT ) DlSfT s VDXST ( PRJN&T. tRJPT } CALL VPRJPI ( PRJNPT. LINEDR. PRJPL. SCLC. PNEAR. PRJEL ) CALL VPRJP ( SCLPT. SCLB. SCLA. SCLB. FOUNDd) ) CALL SURFPT (SCLPT. CANDPL ) CALL INEDPL ( SCLA. O. CANDPL. FOUNDd)) CALL SURFPT (SCLPT. CANDPL.I). O.UNDERSTANDING INTERLEAVED CODE 53 SUBROUTINE NPEDLN ( A. MAG ) C CALL VMINUS (UDIR. UDIR. SCLB. IFOUND ) RETURN END Figure 6. PRJEL ) CALL VPRJP ( SCLPT. SCLC. C.

1. and why they were interleaved is required for a deep understanding of NPEDLN. scaling up the scope of interleaving. 1995). 1979). We have distilled this characterization from an empirical examination of existing software . STIREWALT. the sharing of some resource. incorporating three aspects that make interleaved code difficult to understand: independence. to detect interleaving. 1986). We describe the analyses that we have formulated to detect specific classes of interleaving that are particularly useful in elaborating specifications. we present a characterization of interleaving. 1982) and (Rugaber et al. Our driving program comprehension problem is to elaborate and validate existing partial specifications of the JPL library routines to facilitate the automation of specification-driven generation of programs using these routines. how they are related. and it is common for multiple plans to occur in a single code segment. 1990). delocalization. and the implementation of multiple. A delocalized scaling plan is used to improve numerical stability. Knowledge of the existence of the several plans. and the feasibility of building tools to assist interleaving detection and extraction. Interleaving expresses the merging of two or more distinct plans within some contiguous textual area of a program. presented and analyzed in (Basili and Mills. We then discuss open issues concerning requirements on software and plan representations that detection imposes. independent plans in the program's overall purpose. Pieces are programming language implementations of plans. Contributions In this paper. Intermediate Fortran computations are shared by the nearest point and distance plans. such as delocalized plans (Letovsky and Soloway. 1990. based on the Software Refinery. and resource sharing. coupling (Yourdon and Constantine. and an independent error handling plan is used to deal with unacceptable input. 1991).primarily SPICELIB. the role of application knowledge in addressing the interleaving problem.. and redistribution of intermediate results (Hall. . We conclude with a description of how related research in cliche recognition as well as non-recognition techniques can play a role in addressing the interleaving problem. We relate our characterization of interleaving to existing concepts in the literature. AND WILLS The production version of NPEDLN contains several interleaved plans. Hall.2. Interleaving can be characterized by the delocalization of the code for the individual plans involved. Secondary sources of existing software which we also examined are a Cobol database report writing system from the US Army and a program for finding the roots of functions. We have developed analysis tools..54 RUGABER. Interleaving Programmers solve problems by breaking them into pieces. We then describe the context in which we are exploring and applying these ideas. 2. We use the term interleaving to denote this merging (Rugaber et al.

Or interleaving may arise as a natural by-product of expressing separate but related plans in a linear. that impose global constraints which are satisfied by diffuse computational structures. We now examine each of the characteristics of interleaving .delocalization. For example. Because two or more design purposes are implemented in a single segment of code. independence . it complicates understanding a program. localizing the effects of maintenance changes. Rather. it is to find ways of detecting interleaving and representing the interleaved plans at a level of abstraction that makes the individual plans and their interrelationships clear. There are several reasons interleaving is a source of difficulties. the original.. as a result of inadequate software maintenance. It may be intentionally introduced to improve program efficiency. This makes it difficult to perform tasks such as extracting reusable components. our ability to comprehend code containing interleaved fragments is compromised.g. Delocalization Delocalization is one of the key characteristics of interleaving: one or more parts of a plan are spatially separated from other parts by code from other plans with which they are interleaved. accessors and constructors for manipulating data structures are typically interleaved throughout programs written in traditional programming languages due to their procedural. Regardless of why interleaving is introduced. rather than object-oriented structure. Intentional interleaving may also be performed to deal with non-functional requirements. Our goal is not to completely eliminate interleaving from programs. since that is not always desirable or possible to do at the level of source text.. it may be more efficient to compute two related values in one place than to do so separately. the rationale behind the decision to intentionally introduce interleaving is often not explicitly recorded in the program. expressing intricate optimizations in a clean and well-documented fashion is not typically done. the individual code fragments responsible for each purpose are more spread out than they would be if they were segregated in their own code segments. Another reason interleaving presents a problem is that when it is the result of poorly thought out maintenance activities such as "patches" and "quick fixes". Finally. For example. 2. The first has to do with delocalization. Interleaving may also creep into a program unintentionally. . such as numerical stability.UNDERSTANDING INTERLEAVED CODE 55 Interleaving may arise for several reasons. textual medium.1. sharing. For example. due to limitations of the available progranmiing language) and may be desirable (e. for economy and avoiding duplication which can lead to inconsistent maintenance).in more detail. and migrating to object-oriented languages. such as adding a feature locally to an existing routine rather than undertaking a thorough redesign. Interleaving cannot always be avoided (e. For all of these reasons.g. highly coherent structure of the system may degrade. although interleaving is often introduced for purposes of optimization.

56 RUGABER. LINEPT. SCLB. . [error checks] SCLPT(l) « LXIilBPTil) / SCALE SCLP'r(2) * LINEPT <2) / SCALE SCLPT{3) ^ LINEPT(3) / SCALE CALL VMINUS ( UDIR. DIST) CALL UNORM ( LINEDR. FOUND(1)) CALL SURFPT ( SCLPT. Portions of the NPEDLN Fortran program.ODO * CALI* VSCL {SCALE. B. DABS<B).. SCLC. LINEDR.1). SCLC. C. Shaded regions highlight the lines of code responsible for scaling and unsealing. OPPDIR.2). UDIR. . UDIR. « A / SCALE SCLB = B / SCALE SCLC * C / SCAt^fi . [checking for intersection of the line with the ellipsoid] IF ( FOUND(I) ) THEN DXST * O. . PT(1.. AND WILLS SUBROUTINE NPEDLN(A. STIREWALT. PNEAR. SCLA.PT(1. MAG ) . [handling the non-intercept case] CALL VSCL ( SCALE. OPPDIR ) CALL SURFPT ( SCLPT... FOUND(2)) . PNEAH} RETURN END IF .SCLA. SCLB. PNEAR. PNEASt... l^rror c^h^ck^] SCALB « m X { DABjS(A). PABS(C) ) SCLA. PNBAR ) DIST s = SCALE * DIST RETURN END Figure 8.

1991). When interleaving is introduced into a program. Another reason is that the intermediate results of part of a plan may be shared with another plan. introducing fanout into the dataflow. Realizing that a reformulation wrapper or some other delocalized plan is interleaved with a particular computation can help prevent comprehension failures during maintenance (Letovsky and Soloway. Hall. causing the plans to overlap and their steps to be shuffled together. B. Redistribution covers a wide range of conmion types of function sharing optimizations. as is the case with reformulation wrappers. and c and in unsealing the results. Formerly. It can also help detect when the delocalized plan is incomplete. there is normally some implicit relationship between the interleaved plans. The implementations for computing the nearest point and the shortest distance overlap in that a single structural element contributes to multiple goals. as it was in an earlier version of our example subroutine whose modification history includes the following correction: C. 1986). Hall developed an automated technique for redistributing results for use in optimizing code generated from . One is that there may be an inherently non-local relationship between the components of the plan. the common resources shared by the interleaved plans are intermediate data computations. For example. 1990. Delocalization may occur for a variety of reasons. in Figure 8. part of the unscale plan (computing the scaling factor) is separated from the rest of the plan (multiplying by the scaling factor) in all unscalings of the results (DIST and PNEAR). C it was returned without having been re-scaled.2. motivating the designer to choose to interleave them. including common subexpression elimination and generalized loop fusion. In this case.2. This allows the scaling factor to be computed once and the result reused in all scalings of the inputs A. 2. 25-NOV-1992 (NJB) C Bug fix: in the intercept case. Reformulation wrappers transform one problem into another that is simpler to solve and then transfer the solution back to the original situation.UNDERSTANDING INTERLEAVED CODE 57 The "scale-unscale" pattern found in NPEDLN is a simple example of a more general delocalized plan that we refer to as a reformulation wrapper. Resource Sharing The sharing of some resource is characteristic of interleaving. The shaded portions of the code shown are shared between the two computations for PNEAR and DIST. Other examples of reformulation wrappers in SPICELIB are reducing a three-dimensional geometry problem to a two-dimensional one and mapping an ellipsoid to the unit sphere to make it easier to solve intersection problems. More specifically. which is frequently interleaved with computations in SPICELIB. which makes the spatial separation necessary. An example of this within NPEDLN is shown in Figure 9. The sharing of the results of some subcomputation in the implementation of two distinct higher level operations is termed redistribution of intermediate results by Hall (Hall. redistribution is a class of function sharing optimizations which are implemented simply by tapping into the dataflow from some value producer and feeding it to an additional target consumer.SPICELIB Version 1.0. the steps of one plan separate those of the other. PNEAR is now C properly re-scaled prior to output.

. FOUND. Typically. for example control structures. In NPEDLN. the entire contents of a module may be lexically included in another.NOT. [error handling] END IF CALL VSCL ( SCALE. This sometimes occurs when a programmer wants to take advantage of a powerful intraprocedural optimizer limited to improving the code in a single routine. for example. lexical module structures. PRJPT ) ~| CALL VPRJPI(PRJNPT.CANDPL. intentional interleaving involves sharing higher level resources. This routine returns a control flag. indicating whether or not the intersection exists. 1979). The use of control flags allows control conditions to be determined once but used to affect execution at more than one location in the program. Control coupling. tFir&t 100 lines of NPEDLMJ CALL NPELPT { PRJPT. Anotherform of resource sharing occurs when the lexical structure of a module is shared among several related functional components. STIREWALT. flags. PNEAR SCALE * DIST DIST CALL CHKOUT ( 'NPBDLN* ) RETUKKf END Shared Figure 9. PRJNPT ) PI ST = VDIST ( PRJNPT. Often when interleaving is unintentional. C. and names. PNEAR.58 RUGABER.PRJPL. B. Content coupling. SURFPT is called to compute the intersection of the line with the ellipsoid." typically in the form of function codes. For example. 1975). the resource shared is code space: the code statements of two plans are interleaved because they must be expressed in linear text. as is shown in Figure 10. This sharing of control information between two modules increases the complexity of the code. complicating comprehension and maintenance. highlighting two overlapping computations. LINEPT. IFOUND ) THEN . AND WILLS SOBROtTTINE NPEDLN (A. The commonality between interleaved plans might be in the form of other shared resources besides data values. general-purpose reusable software components. Redistribution of results is a form of interleaving in which the resources shared are data values. Control conditions may be redistributed just as data values are.. This flag is then used outside of SURFPT to control whether the intercept or non-intercept case is to be handled. Another example occurs when a programmer uses ENTRY statements to partially overlap the contents of several routines so that they may share . Portions of NPEDLN.PNEAR IFOUND ) IF ( . or switches (Myers. The use of control flags is a special form of control coupling: "any connection between two modules that communicates elements of control (Yourdon and Constantine. PRJEL.

A variant of . it is also true that the interleaved plans each have a distinct purpose. . FOUND(l) ) CALL SURFPT ( SCLPT. SCLC. There are several ways for dealing with this problem. SCLA. there are costs due to the extra code and the implicit need. since the maintainer might be focusing on only one of the actual uses of the resource (variable. SCLC. . often forgotten.1). One way would be to make two copies of the code segment. Content coupling makes it difficult to independently modify or maintain the individual functions. UDIR."some or all of the contents of one module are included in the contents of another" . This is sometimes done in a language. control flag. Name Sharing. [handling the non-intercept case] RETURN END Figure 10. 2 IF ( FOUND(I) ) THEN . These two practices are examples of a phenomenon called content coupling (Yourdon and Constantine. Fragment of subroutine showing control coupling access to some state variables. A simple form of sharing is the use of the same variable name for two different purposes. to update both versions of the common code whenever it needs to be fixed. In general.UNDERSTANDING INTERLEAVED CODE 59 CALL SURFPT ( SCLPT. SCLB. a separate routine could be provided that is responsible for computing DIST. Because understanding relates program goals to program code. FOUND(2) ) DO 50001 I = 1. SCLA. . This can lead to incorrect assumptions about what effect changes will have. In the NPEDLN example. SCLB. .2).and which often manifests itself in the form of a multipleentry module. PT{1.3. the difficulty that resource sharing introduces is that it causes ambiguity in interpreting the purpose of program pieces.). value. . [handling the intercept case] RETURN END I F 50001 CONTINUE C G e t t i n g h e r e means t h e l i n e d o e s n ' t C i n t e r s e c t the e l l i p s o i d . such as Fortran. This can lead to incorrect assumptions about the relationship between subcomputations within a program. having two goals realized in one section of code can be confusing. each responsible for one of the goals. 2. Although this may make understanding each of the routines somewhat simpler. PT{1. Independence While interleaving is introduced to take advantage of commonalities. 1979) in which . that does not contain an encapsulation mechanism like packages or objects. data structure slot. and both duplicating any common code. etc. OPPDIR.

ellipses.. such as parallelizing or "objectifying" the code (converting it to an object-oriented style). the domain theory does not describe both of these values. called SPICELIB. consists of approximately 600 mathematical programs. generating the appropriate calls. 1994. it is often the case that the code responsible for the secondary functionalities is interleaved with the code for the primary function covered by . This is shown in the bottom half of Figure 11. In the case of NPEDLN.. expressed as axioms in first-order logic with equality. we have undertaken a case study of production library software. one incompleteness in Amphion's domain theory is that it does not fully cover the functionality of the routines in SPICELIB. Some routines compute more than one result. In these routines. The software performs calculations related to solar system geometry. We were introduced to SPICELIB by researchers at NASA Ames. replacing it in each of the two copies with a call to the new routine. Amphion automatically constructs programs that compose routines drawn from SPICELIB. The bottom line is that this style of intentional interleaving confronts the programmer with a tradeoff between efficiency and maintainability/understandability. planes. who have developed a component-based software synthesis system called Amphion (Lowry et al. and Amphion automatically generates Fortran programs to call SPICELIB routines to solve the described problem. connecting them to the abstract concepts of solar system geometry.. Amphion is able to do this by proving a theorem about the solvability of the problem and. To do this. 1994). and complete its domain theory is. consistent. Amphion's success depends on how accurate. This factoring approach works well when the common code is contiguous. The programs consist of dozens of subroutine calls and are typically synthesized in under three minutes of CPU time using a Sun Sparc 2 (Lowry et al. but quickly becomes unworkable if the common code is interrupted by the plan specific code. Lowry et al. For example. It does this by making use of di domain theory that includes formal specifications of the library routines. and ellipsoids..60 RUGABER. not the shortest distance. NPEDLN computes the nearest point on an ellipsoid to a line as well as the shortest distance between that point and the ellipsoid. 3. Lowry zi al. The domain theory is encoded in a structured representation. we need to be able to pull apart interleaved strands. such as coordinate frame conversions. 1994. For example. as a side effect. However. Stickel et al.. The library. AND WILLS this approach is to place the common code in a separate routine. A space scientist using Amphion can schematically specify the geometry of a problem through a graphical user interface. An essential program understanding task is to validate the domain theory by checking it against the SPICELIB routines and extending it when incompletenesses are found. only the nearest point computation is modelled. STIREWALT. Case Study In order to better understand interleaving. written in Fortran by programmers at the Jet Propulsion Laboratory. NPEDLN comes from this library. Ironically. 1994. Amphion has been installed at JPL and used by space scientists to successfully generate over one hundred programs to solve solar system kinematics problems. 1994). and light-time calculations. making the efficiency choice may hinder efforts to make the code more efficient and reusable in the long run. intersections of rays. for analyzing data sent back from space missions.

Elaboration I elaborated spec componentT T partial spec T t Reu$e Library — ^ domain AmphJon based^ (NASA) spec Jupiter Galileo z.^ program i ^ l Figure 11. It maintains an object-oriented repository for holding the results of its analyses. In collaboration with NASA Ames researchers. As the top half of Figure 11 shows. Amphion's domain theory.).. 1985). such as abstract syntax trees and symbol tables. called Refine (Smith et al. Another way in which Amphion's current domain theory is incomplete is that it does not express preconditions on the use of the library routines. with the aim of extending the incomplete domain theory. we developed mechanisms for detecting particular classes of interleaving.4. we also performed analyses to gather empirical information about how much of spiCELiB is covered by the domain theory. Applying interleaving detection to component-based reuse. we explored ways in which Amphion's domain theory is incomplete. and we built program comprehension techniques to extend it. Ada. This is a comprehensive tool suite including language-specific parsers and browsers for Fortran. C. Using the Software Refinery allows us to leverage a commercially available tool as well as to evaluate the strengths and limitations of its approach to program analysis. and Cobol. which supports pattern matching and querying the repository. Uncovering the secondary functionahty requires unraveling and understanding two interleaved computations. language extension mechanisms for building analyzers for new languages. We have built interleaving detection mechanisms and empirical analyzers using a commercial tool called the Software Refinery (Reasoning Systems Inc. . It is difficult to detect the code responsible for checking these preconditions because it is usually tightly interleaved with the code for the primary computation in order to take advantage of intermediate results computed for the primary computation. that a line given as input to a routine must not be the zero vector or that an ellipsoid's semi-axes must be large enough to be scalable. and a user interface construction tool for displaying the results of analyses. which we discuss in Section 4. Domain Theory Fortran /=". It provides a powerfiil wide-spectrum language.UNDERSTANDING INTERLEAVED CODE 61 Interleaving! plans Detection Spec Extraction. for example. In the process.

knowledge about the application sets up expectations about how abstract concepts are typically manifested in concrete code implementations. We are studying interleaving in the context of performing these activities.2. to assist in updating and growing the domain theory as new software components are added. We found many examples of precondition checks on input parameters in our empirical analysis of the SPICELIB. possibly erroneous conditions in the state of a running program. Extracting Preconditions Using the Software Refinery. given SPICELIB and an incomplete theory of its application domain.L Domain Theory Elaboration in Synthesis and Analysis Our motivations for validating and extending a partial domain theory of existing software come both from the synthesis and from the analysis perspectives. AND WILLS 3. The primary motivations for doing this from the synthesis perspective are to make component retrieval more accurate in support of reuse.62 RUGABER. based on what is discovered in the code. or invoking error handlers. and then takes alternative action when these conditions arise. is a primary activity. 1984): 1. driving the generation of hypotheses and informing future analyses. Because precondition checks are often interspersed with the computation of intermediate results. using knowledge of the code to understand the domain . such as returning with an error code. A precondition is a Boolean guard controlling execution of a routine. In some instances the majority of the lines of code in a routine are there to deal with the preconditions and resulting exception handling rather than to actually implement the base plan of the routine. they tend to delocalize the plans that perform the primary computational work. one of which is the detection of subroutine parameter precondition checks.what is discovered in the code is used to build up a description of various aspects of the application and to help answer questions about why certain code structures exist and what is their purpose with respect to the application. signaling. 2. and to improve the software synthesized. 1992. using domain knowledge to understand the code . We are also looking for ways in which the current knowledge in the domain theory can guide detection and ultimately comprehension. STIREWALT. Moreover precondition computations are usually part of a larger plan that detects exceptional. The process of understanding software involves two parallel knowledge acquisition activities (Brooks. We are targeting our detection of interleaving toward elaborating the existing domain theory. Soloway and Ehrlich. From the software analysis perspective. 5. 1983. we automated a number of program analyses. Preconditions normally occur early in the code of a routine before a significant commitment (in terms of execution time and state changes that must be reversed) is made to execute the routine. Ornburn and Rugaber. One such check occurs in the subroutine SURFPT and is shown in . the refinement and elaboration of the domain theory.

C ( ( U(l) . but the analysis that decides if a code fragment is an exception plan depends on the fact that exceptions are dealt with in a stylized and stereotypical manner in SPICELIB.declarations.ODO ) . We have created a tool that detects parameter precondition checks and extracts the preconditions into a documentation form suitable for expression as a partial specification.AND. With this in mind.UNDERSTANDING INTERLEAVED CODE 63 C$Procedure SURFPT ( Surface point on an ellipsoid ) SUBROUTINE SURFPT ( POSITN. O. SURFPT finds the intersection (POINT) of a ray (represented by a point POSITN and a direction vector u) with an ellipsoid (represented as three semi-axes lengths A. B.ODO ) ) THEN CALL SETMSG { 'SURFPT: The input vector is the zero vector. The logical negation of each of the predicates forms a conjunct in the precondition of the subroutine.EQ. O.AND. C. A fragment of the subroutine SURFPT in SPICELIB. CALL SIGERR ( 'SPICE(ZEROVECTOR)' ) CALL CHKOUT ( 'SURFPT' ) RETURN END IF IF Figure 12.. This fragment shows a precondition check which invokes an exception if all of the elements of the u array are 0. C Check the input vector to see if its the zero vector.. O. U.. The process of understanding a subroutine can be facilitated by detecting its precondition checks and using the information they encode to elaborate a high-level specification of the subroutine. One of the preconditions checked by SURFPT is that the direction vector u is not the zero-vector. The analysis that decides whether or not I F statements test only unmodified input parameters is specific to the Fortran language. and c). if such an intersection exists (indicated by FOUND). B. these checks could be heuristically identified in SPICELIB by searching for I F statements whose predicates are unmodified input parameters (or simple dataflow dependents of them) and whose bodies invoke exception handlers. . we discovered that. ( U(3) . Figure 12. A. FOUND ) DOUBLE PRECISION U {3 ) . though interleaved. POINT.ODO ) . However. Parameter precondition checks make explicit the assumptions a subroutine places on its inputs. The implication is that the Fortran specific portion is not likely to need changing when we apply the tool to a new Fortran application.. ( U(2) . If it is C signal an error and return.EQ. The specifications can then be compared against the Amphion domain model. Precondition checks are particularly difficult to understand when they are sprinkled throughout the code of a subroutine as opposed to being concentrated at the beginning. whereas the SPICELIB specific portion will certainly need to change. we chose a tool architecture that allows flexibility in keeping these different types of pattern knowledge separate and independently adaptable.EQ.

being explicitly aliased by an EQUIVALENCE statement to another variable which is then modified. necessary to design the recognition component of our architecture around this need to specialize the tool with knowledge about the system being analyzed. 1995). being passed into a subroutine which then modifies the formal parameter bound to X by the call. then X is removed from the set. If they have been modified before the check.64 RUGABER. or 4. . involves keeping track of whether or not these parameters have been modified. appearing on the left hand side of an assignment statement. Hence. AND WILLS Detecting Exception Handlers. the developers of spiCELiB followed a strict discipline of exception propagation by registering an exception upon detection using a subroutine SIGERR and then exiting the executing subroutine using a RETURN statement. Our tool generated the lATgX source included in Figure 13 without change. if a variable X in the set could be modified by the execution of the statement. Results. Since we are targeting partial specification elaboration for Amphion. This is useful for including SPICELIB specific pattern knowledge because it allows the independent. we chose to make the tool output the preconditions in I^TgX form. At each statement.. a call to SIGERR together with a RETURN indicates a cliche for handling an exception in SPICELIB. It is. In general. 3. The result of this analysis is a table of preconditions associated with each subroutine. Detecting Guards. We recognize application specific exception handlers using two rules that search the AST for a call to SIGERR followed by a RETURN statement. In Fortran. These rules and the Refine code that applies them are presented in detail in (Rugaber et al. 2. Figure 13 gives examples of preconditions extracted for a few SPICELIB subroutines. the form of this cliche will be different. The Software Refinery provides excellent support for this design principle through the use of the rule construct and a tree-walker that applies these rules to an abstract syntax tree (AST). After the propagation. In some other application. declarative expression of the different facets of the pattern. we need application specific knowledge about usage patterns in order to discover exception handlers. For example. STIREWALT. being implicitly passed into another subroutine in a COMMON block and modified in this other subroutine. therefore. we can easily check whether or not an IF statement is a guard. which are I F statements that depend only upon input parameters. Rules declaratively specify state changes by listing the conditions before and after the change without specifying how the change is implemented. We track modifications to input parameters by using an approximate dataflow algorithm that propagates a set of unmodified variables through the sequence of statements in the subroutine. then the check probably is not a precondition check on inputs. Discovering guards. Currently our analysis does not detect modification through COMMON or EQUIVALENCE because none of the code in SPICELIB uses these features with formal parameters. a variable X can be modified by: 1.

using a flattening coefficient p. . The other preconditions listed in Figure 13. Preconditions extracted for some of the subroutines in SPICELIB. Finally." Extracting the precondition into the literal representation is the first step to being able to express the precondition in the more abstract form. for example. It requires that the positions of the first character LEFT and the last character RIGHT to be removed are in the range 1 to the length of the string and that the position of the first character is less than the position of the last. Taken literally. with respect to a given reference spheroid whose equatorial radius is RE. number of rows. 3. The subroutine RECGEO converts the rectangular coordinates of a point RECTAN to geodetic coordinates. BSIZE) « ^(NROW < 1) A -^(BSIZE < 1) 7^ 0)) A -^(NCOL < 1) A Figure 13. and number of columns are all at least 1. states that one of the first three elements of the u array parameter must be non-zero.ODO) REMSUB -^({LEFT > RIGHT) V (RIGHT < 1) V (LEFT < 1) V (RIGHT > LEN(IN)) (LEFT > LEN(IN))) SURFPT -'((C/(l) = O. the subroutine XPOSBL transposes the square blocks within a matrix BMAT. Finding Interleaving Candidates There are several other analyses that we have investigated using heuristic techniques for finding interleaving candidates. stated in their abstract form. In terms of solar system geometry.BSIZE) 7^ 0) V (MOD(NROW. Its preconditions are that the block size BSIZE must evenly divide both the number of rows NROW in BMAT and the number of columns NCOL and that the block size.UNDERSTANDING INTERLEAVED CODE 65 RECGEO -n{F > 1) A ^{RE < O. the precondition for SURFPT.ODO) A (U(3) = O. u is seen as a vector.RIGHT) from a character string IN.ODO) A (U(2) = O. Its precondition is that the radius is greater than 0 and the flattening coefficient is less than 1. are the following. The subroutine REMSUB removes the substring (LEFT-. so the more abstract precondition can be stated as "U is not the zero vector.ODO)) V XPOSBL -y((MOD(NCOL.3.

A good example of this case occurs in the SPICELIB subroutine SURFPT. but a maintainer's conceptual categorization of the subroutine is still obscured by the appearance of some number of seemingly distinct outputs. out if the parameter is only written in the subroutine. In the former case.66 RUGABER. In such a situation the output parameter POINT will be undefined. Our analysis revealed that of the subroutines covered by the domain theory. we performed an empirical analysis to determine. ON Clearly subroutines with multiple outputs complicate program understanding.1. Since the programs that Amphion creates can never make use of these return values. STIREWALT.3. The resulting analysis showed that 25 percent of the subroutines in SPICELIB had multiple output parameters. When this occurs. which conceptually returns the intersection of a vector with the surface of an ellipsoid. In the latter case. for those routines covered by the Amphion domain model (35 percent of the library). However. they have not been associated with any meaning in the domain theory. and when F U D is true. or in-out if the parameter is both read and written in the subroutine. the subroutine may be implementing only a single plan. 1986)).. These are good focal points for detecting interleaved plans that might be relevant to extending the domain theory. which ones have multiple output parameters. . Our tool bases its analysis on the structure chart (call graph) objects that the Software Refinery creates. the subroutine is returning either the results of multiple distinct computations or a result whose type cannot be directly expressed in the Fortran type system (e. In addition. as they are likely to involve interleaving. Routines with Multiple Outputs One heuristic forfindinginstances of interleaving is to determine which subroutines compute more than one output. Dead end dataflows imply interleaving in the subroutine and/or an incompleteness in the domain theory. adopting the convention that when FOUND is false. the return value is Undefined. as a data aggregate). We refer to outputs that are not mapped to anything in the domain model as dead end dataflows (similar to an interprocedural version of dead code (Aho et al.. The nodes of these structure charts are annotated with parameter direction information. NPEDLN'S distance output (DIST) is a dead end dataflow as far as the domain theory is concerned. it is possible to give SURFPT a vector and an ellipsoid that do not intersect. 30 percent have some output parameters that are dead end dataflows.g. We built a tool that determines the multiple output subroutines in a library by analyzing the direction of dataflow in parameters of functions and subroutines. the return value is POINT. but the Fortran type system cannot express the type: DOUBLE PRECISION V Undefined. the subroutine is realized as the interleaving of multiple distinct plans. A parameter's direction is either: in if the parameter is only read in the subroutine. Multiple output subroutines will have more than one parameter with direction out or in-out. The original programmer was forced to simulate a variable of this type using two variables. as is the case with NPEDLN'S computation of both the nearest point and the shortest distance. AND WILLS 3. For example. some of which are not covered by the domain model. POINT and FOUND. We were thus able to focus our work on these routines first.

We would like to detect co-occurrence pairs because they are likely to form reformulation wrappers. not just library routines. the extent to which the concept of interleaving scales up.3. we focus on calls to library routines that supply a constant as a parameter to other routines. we have discovered co-occurrence pairs that form reformulation wrappers and are building tools to perform this analysis automatically. Each member of this set is then analyzed to see if the formal parameter associated with the constant actual parameter is used to conditionally execute disjoint sections of code.1.2. in the "scale-unscale" reformulation wrapper. as opposed to a variable. 4. 3. they are executed under the same conditions. Control coupling is often implemented by using a subroutine formal parameter as a control flag. in general we would like to consider any code fragments as potential pairs. the inputs are scaled (divided) and the results of the wrapped computation are later unsealed (multiplied). A programming . The constant parameter may be a flag that is being used to choose among a set of possible computations to perform. Through empirical investigation of SPICELIB. Our analysis shows that 19 percent of the routines in SPICELIB are of this form. Of course. and how powerful tools need to be to detect and extract interleaved components. how knowledge of the application domain can be used to detect plans. For example. they must be further checked to see whether they are inverses of each other. Open Issues and Future Work We are convinced that interleaving seriously complicates understanding computer programs. Control Coupling Another heuristic for detecting potential interleaving finds candidate routines that may be involved in control coupling. the operations that divide and multiply by the scaling factor co-occur and invert the effects of each other.3. 4. Questions arise as to what form of representation is appropriate to hold the extracted information. and there is a flow of computed data from one to the other. The heuristic strategy we use for detecting control coupling first computes a set of candidate routines that are invoked with a constant parameter at every call-site in the library or in code generated from the Amphion domain theory. Once co-occurrence pairs are detected. But recognizing a problem is different from knowing how to fix it. So.UNDERSTANDING INTERLEAVED CODE 67 3.3. Reformulation Wrappers A third heuristic for locating interleaving is to ask: Which pairs of routines co-occurl Two routines co-occur if they are always called by the same routines. Representation Our strategy for building program analysis tools is to formulate a program representation whose structural properties correspond to interesting program properties.

since interleaving may occur at any level of abstraction. 2. Rich.g. Irreducible control flow graphs signify the use of unstructured GO TO statements. This diagrammatic notation is complemented with an axiomatized description of the plan that defines its formal semantics. independent plans must be localized as much as possible. Since there are a number of such representations to choose from. 3. The Plan Calculus also provides a mechanism. the representation must support multiple views of the program as the interaction of plans at various levels of abstraction. implementation and optimization relationships).68 RUGABER. and independence. STIREWALT. . Graph representations naturally express a partial execution ordering via implicit concurrency and explicit transfer of control and data. The components of the plans have to be serialized with respect to the dataflow constraints. We do this by first listing structural properties that correspond to each of the three characteristics of interleaving and then searching for a representation that has these structural properties. called overlays^ for representing correspondences and relationships between pairs of plans (e. Overlays also support a general notion of plan composition which takes into account resource sharing at all levels of abstraction by allowing overlapping points of view. This allows us to develop correctness preserving transformations to extract interleaved plans. 1990). data and control flow connections) between them. sharing must be detectable (shared resources should explicitly flow from one plan to another). we narrow the possibilities by noting that: 1. a representation must impose a partial rather than a total execution ordering on the components of plans. The partial execution ordering requirement suggests that some form of graphical representation is appropriate. A plan in the Plan Calculus is encoded as a graphical depiction of the plan's structural parts and the constraints (e. p2 both share a resource provided by a plan ps then Pi and p2 should appear in the graph as siblings with a common ancestor ps. 1981. with no explicit ordering among them. AND WILLS Style tool. In sequential languages like Fortran. The key characteristics of interleaving are delocalization. Since we want to build tools for interleaving detection we have to formulate a representation that captures the properties of interleaving. Rich and Waters. This typically means that components of plans cluster around the computation of the data being shared as opposed to clustering around other components of the same plan. uses a control flow graph that explicitly represents transfer of execution flow in programs. similarly if two plans pi. 1981. It follows then that in order to express a delocalized plan. delocalization often cannot be avoided when two or more plans share data.g. resource sharing. The style tool uses this structural property to report violations of structured programming style. This enables the viewing of plans at multiple levels of abstraction.. An existing formalism that meets these criteria is Rich's Plan Calculus (Rich. for example.. This total ordering is necessary due to the lack of support for concurrency in most high level programming languages.

are driven by the problem the program is solving. Without this understanding. For example. In fact. The underlying issue is that any scheme for code understanding based solely on a top-down or a bottom-up approach is inherently limited. The common iteration construct involved in loop fusion is another control-based mechanism. if a maintenance task requires extending NPEDLN to handle symmetric situations where more than one "nearest point" to a line exist.3. where plans generate expectations that guide program analysis and program analysis generates related segments that need explanation. its application domain. Why was DIST computed inside of the routine instead of separately? Was it only for efficiency reasons. Reformulation wrappers use a protocol mechanism.2. a pair of identical values is indicated. These plans are inherently delocalized. such as maintaining stack discipline or synchronization mechanisms for cooperating processes. And this sort of plan knowledge derives from understanding the application area. then the programmer needs to figure out what to do about the distance calculation also computed by NPEDLN. But the tasks that require the understanding . programmers need to know which plans pieces of code are implementing. but this interleaving has intraprocedural scope. Exploiting Application Knowledge Most of the current technology available to help understand programs addresses implementation questions. These form a possible design space of solutions to the interleaving problem and can help relate existing techniques that might be applicable. which may be naming. adaptive. and corrective maintenance . the best hope is to recognize that the code has uniformly applied a function and its inverse in two places. One spectrum is the scope of the interleaving. that is. Scaling the Concept of Interleaving We can characterize the ways interleaving manifests itself in source code along two spectrums. they only make sense as plans at all when considered in the context of the application: stable computations of solar system geometry. it is driven by the syntactic structure of programs written in some programming language. without knowing why this was done and how the computations are connected. usually at the intraprocedural level. For example. the use of control flags is a control-based mechanism for interleaving with interprocedural scope. And a top-down approach cannot hope to find where a plan is implemented without being able to understand how plan implementations are related syntactically and via dataflows. which can range from intraprocedural to interprocedural to object (clusters of procedures and data) to architectural. a bottom-up approach cannot hope to relate delocalized segments or disentangle interleavings without being able to relate to the application goals. The other spectrum is the structural mechanism providing the interleaving. Another example from NPEDLN concerns reformulation wrappers. or might the nearest point and the distance be considered a/?a/r of results by its callers? In the former case. 4. in the latter. To answer questions like these.UNDERSTANDING INTERLEAVED CODE 69 4. or protocol.perfective. Multiple-inheritance is an example of a . data. As illustrated by the examples. control. but they can have interprocedural scope. not the program. a single DIST return value is still appropriate. that is. Protocols are global constraints. The implication is that a coordinated strategy is indicated.

thus providing flexibility and generality. The language itself combines features of imperative. In particular. maps. Refine provides abstract data structures. a user interface builder. thereby reducing programmer work. This comprehensive toolkit provides a set of language-specific browsers and analyzers. and rule-based programming. AND WILLS data-centered interleaving mechanism with object scope. We also take full advantage of Reasoning Systems' existing Fortran language model and its structure chart analysis. . The approach taken by the Refine language and tool suite has many advantages for attacking problems like ours.70 RUGABER. 4. The object-oriented repository further reduces programmer responsibility by providing persistence and memory management. would prove useful. Robust dataflow analysis is particularly important to the precision of precondition extraction. such as control flow graphs for Fortran and general dataflow analysis. Tool Support We used the Software Refinery from Reasoning Systems in our analyses.. These allowed us a running start on our analysis and provided a robust handling of Fortran constructs that are not typically available from non-commercial research tools. We made particular use of two other features of the toolkit. In addition to the rule-based features. and cross reference lists. STIREWALT. and it provided pre-existing analyses for traditional graphs and reports such as structure charts. The first is called the Workbench. Interleaving at the scope of objects and architectures or involving global protocol mechanisms is not yet well understood. dataflow diagrams. such as sets. the availability of other analyses. Consequently. and sequences. Refine language programs such as those described in (Rugaber et al. and an object-oriented repository for holding the results of analyses. The results of the analyses can be accessed from the repository using small. objectoriented. functional. compiling a Refine program into compiled Lisp. which manage their own memory requirements. We can see several ways in which the Refine approach can be extended. few mechanisms for detection and extraction currently exist in these areas. We had merely to add a simple tree walking routine to apply the rules to the abstract syntax tree. 1995). Related Work Techniques for detecting interleaving and disentangling interleaved plans are likely to build on existing program comprehension and maintenance techniques.4. a parser generator. Before-and-after condition patterns define the properties of constructs without indicating how to find them. The Refine compiler was the other feature we used. Of particular value to us is its rule-based constructs. 5.

1992)) is a useful detection mechanism. Schwanke. For example.g. based on the detection of shared uses of global data.e. 1994. 1990. non-recognition-based methods of delineation are needed. Kozaczynski and Ning. However. However. slicing (Weiser. Cluster analysis (Biggerstaff et al. 1994). is its ability to deal with delocalization and redistribution-type function sharing optimizations. 1989) is used to group related sections of code. but also recognize familiar types of transformations or design decisions that went into constructing the program. 1994) is a widely-used technique for localizing functional components by tracing through data dependencies within the procedural scope. called "potpourri module detection" (Calliss and Cornelius. not repeatedly used plans).. Wills. One of the key features of GRASPR (Wills. clustering techniques can only provide limited assistance by roughly delineating possible locations of functionally cohesive components. For example..1. Another technique. 1979) uses a simple.. called temporal abstraction. de- . 1988. Mechanisms for dealing with specific types of interleaving have been explicitly built into existing recognition systems. Loop fusion is viewed as redistribution of sequences of values and treated as any other redistribution optimization (Wills. In the future. 1992). 1991. Domain-based clustering. Letovsky. for instance. and interleaving iterative computations. Most existing cliche recognition systems tend to deal with interleaving involving data and control mechanisms. Disentangling Unfamiliar Plans When what is interleaved is unfamiliar (i. Schwanke. 1981. Quilici. This is based on detecting coarse patterns of data and control flow at the procedural level that are indicative of common ways of constructing. KBEmacs looks for minimal sections of a loop body that have data flow feeding back only to themselves. 5.e. KBEmacs (Rich and Waters. 1985. other. 1990). Johnson. 1991. by keying in on the patterns of linguistic idioms used in the program. Rich and Wills. focuses on naming mechanisms. 1986. control paths. rather than as being an orthogonal form of recognition. stereotypical. novel. The recognition and temporal abstraction of iteration cliches is similarly used in GRASPR to enable it to deal with generalized loop fusion forms of interleaving. augmenting.. frequently used plans).. which suggest the manifestations of domain concepts.. 1990. and names. cliche recognition (e. idiosyncratic. 1992).2.UNDERSTANDING INTERLEAVED CODE 71 5. 1994. In fact. Ning et al. Hutchens and Basili. most recognition systems deal explicitly with the recognition of cliches that are interleaved in specific ways with unrecognizable code or other cliches. this process is usually done with special-purpose procedural mechanisms that are difficult to extend and that are viewed as having supporting roles to the cliche recognition process. (Hartman. as explored by DM-TAG in the DESIRE system (Biggerstaff et al. special-purpose recognition strategy to segment loops within programs. we envision recognition architectures that detect not only familiar computational patterns. which views iterative computations as compositions of operations on sequences of values. Waters. Many existing cliche recognition systems implicitly detect and undo certain types of interleaving design decisions. 1994. This decomposition enables a powerful form of abstraction. The Role of Recognition When what is interleaved is familiar (i.

Cimitile et al. 1993. Although a particular instance may be the result of an intentional decision on the part of a programmer trying to improve the efficiency of a program. We are grateful to JPL'S NAIF group for enabling our study of their SPICELIB software. in turn. To investigate the phenomenon of interleaving. 6. often so that a program resource could be shared among the plans. For example. and we were able to add to the understanding by performing a variety of interleaving-based analyses. Research into automating data encapsulation has recently provided mechanisms for hypothesizing possible locations of data plans at the object scope. based on the call graph and dominance relations. lead to each of the separate plan implementations being spread out or delocalized throughout the segment. B. and that many instances of interleaving can be detected by relatively straightforward tools. STIREWALT. Bowdidge and Griswold (Bowdidge and Griswold. We also benefited from insightful discussions with Michael Lowry at Nasa Ames Research Center concerning this study and interesting future directions.. The results of these studies reinforce our feelings that interleaving is a useful concept when understanding is important. In our studies we have observed that interleaving typically involves the implementation of several independent plans in one code segment.. SPICELIB from the Jet Propulsion Laboratory. in NPEDLN. to help progranmiers see all the uses of a particular data structure and to detect frequently occurring computations that are candidates for abstract functions. and the outputs PNEAR and DIST represent a pair of results related by interleaved. . we have studied a substantial collection of production software. SPICELIB needs to be clearly understood in order to support automated program generation as part of the Amphion project. AND WILLS tects modules that provide more than one independent service by looking for multiple proper subgraphs in an entity-to-entity interconnection graph. The interleaving can. Techniques have also been developed within the RE'^ project (Canfora et al. highly overlapping plans). 1994) for identifying candidate abstract data types and their associated modules. Acknowledgments Support for this research has been provided by ARPA. Conclusion Interleaving is a commonly occurring phenomenon in the code that we have examined.72 RUGABER. where the input parameters A. Further research is required to develop techniques for extracting objects from pieces of data that have not already been aggregated in programmer-defined data structures. These graphs show dependencies among global entities within a single module. 1994) use an extended data flow graph representation. detecting multiple pieces of data that are always used together might suggest candidates for data aggregation (as for example. it can nevertheless make understanding the program more difficult for subsequent maintainers. the independent services reflect separate plans in the code. and c are used as a tuple representing an ellipsoid. For example. (contract number NAG 2-890). called a star diagram. Presumably.

Otherwise scale the C point on the input line too. C NORMAL is a normal vector to the plane C containing the candidate ellipse. and then find the near point PRJNPT C on the projected ellipse. { C .DO.LE. If squaring any of the C scaled lengths causes it to underflow to C zero. signal an error. but may not be calculeible due to nuC merical problems (this cem only happen when the C ellipsoid is extremely flat or needle-shaped). O.*) CALL SIGERR ('SPICE(DEGENERATECASE)' ) CALL CHKOUT ('NPEDLN' ) RETURN END IF C Undo the scaling. PRJPL. UDIR.B. O. The output C DIST was coinputed in step 3 and needs only to be C re-scaled.LE.') CALL SIGERR ( 'SPICE(DEGENERATECASE)' ) CALL CHKOUT ( 'NPEDLN' ) RETURN END IF C Project the candidate ellipse onto a plane C orthogonal to the line. CANDPL ) CALL INEDPL (SCLA.PNEAR. A ) CALL ERRDP ('#'.LE. A ) '' CALL ERRDP ( # ' .CANDPL. OPPDIR) CALL SURFPT(SCLPT.DO ) . ( SCLC**2 .C=#. Only numerical problems C can prevent the intersection from being found.') CALL ERRDP ( • .NOT. PNEAR. this is the point on the camdidate C ellipse that projects to PRJNPT.PRJPL. PNEAR ) CALL CHKOUT { 'NPEDLN' ) RETURN END IF 50001 CONTINUE C Getting here means the line doesn't intersect C the ellipsoid. O. PRJPL. C so we treat the line as a pair of rays.2). CALL VMINUS(UDIR. PT(1. O.EQ. 0 ) THEN CALL SETMSG('Direction is zero vector.B=#.DO ) ) THEN CALL SETMSG ('Semi-axee: A=#.C. NORMAL(l) = UDIR(l) / SCLA**2 NORMAL(2) = UDIR(2) / SCLB**2 N0RMAL(3) = UDIR(3) / SCLC**2 CALL NVC2PL ( NORMAL.1). Find the candidate ellipse CAND.SCLB. IFOUND) IF ( .OR.LE. PNEAR. { SCLB**2 . DIST) INTEGER UBEL PARAMETER ( UBEL = 9 ) INTEGER UBPL PARAMETER { UBPL = 4 ) DOUBLE PRECISION A DOUBLE PRECISION B DOUBLE PRECISION DOUBLE PRECISION LINEPT ( 3 ) DOUBLE PRECISION LINEDR ( 3 ) DOUBLE PRECISION PNEAR ( 3 ) DOUBLE PRECISION DIST LOGICAL RETURN CANDPL ( UBPL ) DOUBLE PRECISION DOUBLE PRECISION CAND ( UBEL ) DOUBLE PRECISION OPPDIR ( 3 ) DOUBLE PRECISION PRJPL ( UBPL ) DOUBLE PRECISION MAG DOUBLE PRECISION NORMAL (3 ) PRJEL { UBEL ) DOUBLE PRECISION PRJPT (3 ) DOUBLE PRECISION PRJNPT ( 3 ) DOUBLE PRECISION DOUBLE PRECISION DOUBLE PRECISION SCALE DOUBLE PRECISION SOLA DOUBLE PRECISION SCLB DOUBLE PRECISION SCLC DOUBLE PRECISION 3 ) SCLPT DOUBLE PRECISION UDIR 3 ) INTEGER I FOUND LOGICAL 2 ) LOGICAL IFOUND LOGICAL XFOUND IF ( RETURN 0 ) THEN RETURN ELSE CALL CHKIN ( 'NPEDLN' ) END IF CALL UNORM ( LINEDR. O. PRJEL ) C Find the point on the line lying in the projectC ion plane.DO ) •OR. 2 IF ( FOUND(I) ) THEN DIST = O. FOUND(l)) CALL SURFPT(SCLPT. XFOUND ) THEN CALL SETMSG ( 'Cauididate ellipse not found. PNEAR ) CALL VSCL ( SCALE.DO ) . B ) CALL ERRDP {•*'. SCALE = MAX { DABS(A). SCLB. B ) ' CALL ERRDP ( # ' . SCLA. PRJPL ) CALL PJELPL ( CAND. O. C SURFPT determines whether rays intersect a body. O. C The distance between PRJPT and PRJNPT is DIST. CANDPL.UNDERSTANDING INTERLEAVED CODE 73 Appendix NPELDN with Some of Its Documentation C$ Nearest point on ellipsoid to line. Here PRJPT is the C point on the line lying in the projection plane. DABS{B).LINEDR. PRJEL.CAND.DO. CALL NVC2PL ( UDIR.') CALL SIGERR{'SPICE(ZEROVECTOR)* ) CALL CHKOUTCNPEDLN' ) RETURN ELSE IF (( A . SCLC.LE. SCLC. PT(1.OR. IFOUND ) THEN CALL SETMSG ('Inverse projection not found.DO ) . F0UND(2)) DO 50001 I = 1. SUBROUTINE NPEDLN(A.OR. DABS(C) ) SCLA = A / SCALE SCLB = B / SCALE SCLC = C / SCALE IF (( SCLA**2 . PRJNPT ) DIST = VDIST ( PRJNPT.LINEPT. The inverse projection of PNEAR ought C to exist.SCLC. CALL VPRJPKPRJNPT. C ) CALL SIGERR ('SPICE(INVALIDAXISLENGTH)') CALL CHKOUT ('NPEDLN' ) RETURN END IF C Scale the semi-zixes lengths for better C numerical behavior.DO ) ) THEN CALL SETMSG {'Axis too small: A=#.C=#. it's the intersection of C an ellipsoid centered at the origin and a plane C containing the origin. MAG ) IF ( MAG . CALL VPRJP ( SCLPT.LE. PNEAR ) DIST = SCALE * DIST CALL CHKOUT ( 'NPEDLN' ) RETURN END .') CALL ERRDP {'#'. O. UDIR. Mathematically C the ellipse must exist.B=#. We'll call the plane C PRJPL and the projected ellipse PRJEL.ODO CALL VEQU ( PT(1. PRJPT ) CALL NPELPT ( PRJPT. CALL VSCL ( SCALE.I). C ) ' CALL SIGERR {'SPICE(DEGENERATECASE)') CALL CHKOUT ('NPEDLN' ) RETURN ( END IF SCLPT(1) = LINEPT(1) / SCALE SCLPT(2) = LINEPT(2) / SCALE SCLPT(3) = LINEPT(3) / SCALE C Hand off the intersection case to SURFPT.NOT. PNEAR. SCLB. ( B . OPPDIR. PRJPT ) C Find the near point PNEAR on the ellipsoid by C taking the inverse orthogonal projection of C PRJNPT. SCLA.XFOUND) IF ( .

c SIGERR Signal Error Condition. Magnitude of line direction vector. AND WILLS C Descriptions of subroutines called by NPEDLN: C c CHKIN Module Check In (error handling). Upper bound of array containing plane. inverted. Unitized line direction vector. c ERRDP Insert DP Number into Error Message Text. Normal to the candidate plane CANDPL. Plane containing candidate ellipse. c VPRJP Project a vector onto plane orthogonally. Scaling factor. Projection plane. c VMINUS Negate a double precision 3-D vector. c SETMSG Set Long Error Message. Make one DP 3-D vector equal to another. c VEQU Vector scaling. Candidate ellipse. orthogonally. c NPELPT Find nearest point on ellipse to point. Vector in direction opposite to UDIR. Upper bound of array containing ellipse. which the candidate ellipse is projected onto to yield PRJEL. Nearest point on ellipsoid to line. . c INEDPL Intersection of ellipsoid and plane. Direction vector of input line. Length of semi-axis in the z direction.74 RUGABER. c VSCL c NVC2PL Make plane from normal and constant. Intersection point of line & ellipsoid. Nearest point on projected ellipse to projection of line point. STIREWALT. 3 dimensions. Point on input line. c c c c c c c c c c c c c c c c c c c c c c c c c c c PRJEL PRJPT PRJNPT SCALE Length of semi-axis in the x direction. c CHKOUT Module Check Out (error handling). Projection of the candidate ellipse CAND onto the projection plane PRJEL. Length of semi-axis in the y direction. c UNORM Normalize double precision 3-vector. c PJELPL Project ellipse onto plane. c SURFPT Find intersection of vector w/ ellipsoid. c VPRJPI Vector projection onto plane. Distance of ellipsoid from line. Projection of line point.

Underwood. and M. 1981. Ning. November 1990. pages 2-11. pages 12-19. MA. Reverse engineering: Resolving conflicts between expected and actual software designs. G. and WKozaczynski. Vancouver. Petrocelli Charter. 1986.... pages 48-57. In Proc. S. IEEE Computer Society Press. Program improvement by automatic redistribution of intermediate results.Q. Monterey. New Orleans.. Canada. Addison-Wesley. 1975. Parikh. R. Understanding and documenting programs. ReasoningSystems Incorporated. pages 1044-1052. Fjeldstad. In Proc. 1986. 9th Knowledge-Based Software Engineering Conference. CA.. Orlando. Program understanding and the concept assignment problem. In Proc. Automated Software Engineering. R.T. Biggerstaff. Webster. editors. M. and E. M. Sethi. and W. February 1990. Palo Alto. 2nd ACM SIGSOFT Symposium on Foundations of Software Engineering. and W. and J. R. TPressburger. In Proc. TPressburger. Menlo Park. Amphion: automatic programming for subroutine libraries. W. Automatic control understanding for natural programs. A. In Proc. 3rd Workshop on Program Comprehension. Techniques. Maryland. and Tools. A. 4 1979. August 1981. Hamlen.1983. Software Engineering Economics. May 1994. and B. Baltimore. Communications of the ACM. IEEE Software. PhD thesis. pages 73-82. Washington. A formal representation for plans in the Programmer's Apprentice. Bany. 7th International Joint Conference on Artificial Intelligence. 18:543-554.Rugaber. 9th Knowledge-Based Software Engineering Conference. In GUIDE 48. May 1994. British Columbia. Letovsky. Communications of the ACM. IEEE Transactions on Software Engineering. Automated support for legacy code understanding. Brooks. Dec. G. Monterey. Plan analysis of programs. In Proc. In IEEE Conf on Software Maintenance -1992. Order No. A. Morgan Kaufmann Publishers. Automated support for encapsulating abstract data types. CA. CA. T. A. Inc. . 3(3). Mills. Canfora. Myers. Communications of the ACM. Program improvement by automatic redistribution of intermediate results: An overview. S. and I. CA. Ning. March 1994. F. May 1982. EM453.Engberts. AAAI Press.Philpot. Technical Report AI91-161.Cornelius. Florida. 37(5):50-57. Ullman. Reliable Software through Composite Design. of the First Working Conference on Reverse Engineering. Research Report 662. CA. pages 32-40.. IEEE Transactions on Software Engineering. A reverse engineering method for identifying reusable abstract data types. D. Hall. Mitbander.R. International Journal ofMan-Machine Studies. PhD.Lowry and R. In IEEE Conference on Software Maintenance -1990. R. J. Potpourri module detection. 1991. Intention-Based Diagnosis of Novice Programming Errors. Automated program understanding by concept recognition. Cimitile. Bowdidge. Kozaczynski. A. System structure analysis: Clustering with data bindings. A. Griswold.Cimitile. May 1994. Johnson. Hutchens. MIT Artificial Intelligence Lab. Tutorial on Software Maintenance. and S. Technical Report 1251. Reading. and V. and D. S.K. and J. Boehm. Yale University. August 1985. Towards a theory of the comprehension of computer programs. May 1993. and N. 1994. PhD.McCartney. A memory-based approach to recognizing programming plans. A formal approach to domain-oriented software design environments. Automating Software Design.Basili. W.. 37(5):84-93. and H.Soloway. December 1988.Philpot.Munro. Letovsky.D. Los Altos.UNDERSTANDING INTERLEAVED CODE 75 References Aho. November 1994. Basili. IEEE Computer Society Press.Munro.Underwood. M. Prentice Hall.Zvegintozov. 37(5):72-83. 1983). Rich. 1994. and M. Hartman. G.. and I. Also appears in (Parikh and Zvegintozov. pages 97-110. Calliss. Lowry. R. Compilers: Principles. Application program maintenance study: Report to our respondents. November 1992. editors. Software Refinery Toolkit. J. San Diego. In M. 11(8).L. IEEE Computer Society. B. 1994.Tortorella. 1983. l(l):61-78. 1986.Q. University of Texas at Austin. A. Program comprehension through the identification of abstract data types. V. 8(3):27(>-283. 1991.C. pages 46-51. R. Hall. Delocalized plans and program comprehension. Ombum. D. IEEE Computer Society Press. CA. Lowry.. C. Quilici.

C. pages 265-274.1991. C. Automated programrecognitionby graph parsing. In Proc.76 RUGABER. visualizing. Mark. Deductive composition of astronomical software from subroutine libraries. K. 3 1981. of the Second Working Conference on Reverse Engineering.Westfold. R. pages 83-92. G. 1986. Addison-Wesley. Rich. K. In 5th International Conference on Software Engineering. IEEE Software yl{\)\%2-%9. France. and L. IEEE Computer Society Press. Pittsburgh. Morgan Kauftnann. R.M. and R. Workshop on Software Specification and Design. Nancy. D. France. September 1984.Ehrlich.C. Recognizing design decisions in programs. and M. M.Stirewalt. IEEE Software. T. Recognizing a program's design: A graph-parsing approach.Waters. IEEE Transactions on Software Engineering. 1994. IEEE Computer Society Press. In IEEE Conference on Software Maintenance 1995. P.Stirewalt. Weiser. January 1990.Platoff. In IEEE Conference on Software Maintenance -1991. Baltimore.Altucher. R. Readings in Artificial Intelligence and Software Engineering. and S. Technical Report 604. In Proc.C. L. S. R. Prentice-Hall. May 1993.Ombum. 10(5):595-609. An intelligent tool for re-engineering software modularity. Stickel. PA. of the First Working Conference on Reverse Engineering.. July 1995. and A. Waters. Wills. pages 439-449.LeBlanc. 12th International Conference on Automated Deduction. 1989. Empirical studies of programming knowledge. Selfridge. Reprinted in C. Ontario. Detecting interleaving. 5th Int. Inspection methods in programming. Soloway.C. R. R. IEEE Transactions on Software Engineering. PhD thesis. MIT Artificial Intelligence Lab. S. Schwanke. The Programmer's Apprentice. and L.Pressburger.A position paper. Rich and R.Waldinger. Toronto. 7(l):46-54... CA. MD. Reading. STIREWALT. pages 166-175. E.. pages 341-55. 1979. Baltimore. In Proc. Technical Report 1358. .. and L.Kotik. Research on knowledge-based software environments at Kestrel Institute. The interleaving problem in program understanding. MA and ACM Press. AND WILLS Rich. A method for analyzing loop programs. E. and K. pages 147-150. Rugaber. and controlling software structure. Wills. Program slicing. C. and R. Structured Design: Fundamentals of a Discipline of Computer Program and Systems Design. Waters. MIT Artificial Intelligence Lab.Underwood.. July 1992. Discovering.Bundy. I.Chikofsky. Challenges to the field of reverse engineering . November 1985. January 1990. June 1981. Waters. S. M. San Diego. Maryland.Wills...Wills. PhD Thesis. May 1979. 1990. In Proc. S. editors. pages 144-150. and E. Constantine. IEEE Computer Society Press. Rugaber.Lowry. Smith. IEEE Transactions on Software Engineering. Schwanke. 5(3):237-247. Rich. Rugaber. Yourdon.. and L. Nice. September 1995.

The activities. over their useful lifetimes.Merlo.ca McGill University School of Computer Science 3480 University St. M. DEMORI. Integration of the tools provides opportunities for synergy. M. 1995 . Our research is concerned with developing a suite of tools to aid the maintainers of legacy systems in recovering the knowledge embodied within the system. in the normal course of events. and CLIPS.Galler. dynamic programming 1. Montreal. are even more expensive to maintain. MERLO.Bernstein.Kontogiannis. M. M. they become increasingly complex and brittle. Introduction Large-scale production software systems are expensive to build and. * This work is in part supported by IBM Canada Ltd. pattern matching.. are essential preludes for several key processes. the Natural Sciences and Engineering Research Council of Canada. July. BERNSTEIN kostas @ cs.mcgill. move on to other projects.DeMori. 1995. A. Successful large-scale systems are often called "legacy systems" because (a) they tend to have been in service for many years. Canada H3A 2A7 Abstract. bash. and hence harder to maintain. KONTOGIANNIS. In this paper we present three pattern-matching techniques: source code metrics. large-scale software system that is maintained beyond its first generation of programmers. As such systems age. They also become even more critical to the survival of their organization because the business rules encoded within the system are seldom documented elsewhere.96-103. including maintenance and design recovery for reengineering. program understanding. and (c) the systems themselves represent enormous corporate assets that cannot be easily replaced.A. R. The programmer's skill and experience are essential elements of our approach. In many cases. Institute for Robotics and Intelligent Systems. © IEEE. a Canadian Network of Centers of Excellence and. a dynamic programming algorithm for finding the best alignment between two code fragments. GALLER. Room 318. R. allowing the programmer to select the most appropriate tool for a given task. leaving the system to be maintained by successive generations of maintenance programmers. Selection of particular tools and analysis methods depends on the needs of the particular task to be accomplished.Automated Software Engineering. whichfirstappeared in Proceedings of the Second Working Conference on Reverse Enginering. It typically represents a massive economic investment and is critical to the mission of the organization it serves. pp. software metrics. Boston. Legacy systems are intrinsically difficult to maintain because of their sheer bulk and because of the loss of historical information: design documentation is seldom maintained as the system evolves. The methods are applied to detect instances of code cloning in several moderately-sized production systems including tcsh. E. E. Keywords: reverse engineering. (b) the original developers. known collectively as "program understanding*'. A legacy system is an operational. 3. 77-108 (1996) © 1996 Kluwer Academic Publishers. Pattern Matching for Clone and Concept Detection * K. Manufactured in The Netherlands. and a statistical matching algorithm between abstract code descriptions represented in an abstract language and actual source code. the source code becomes the sole repository for evolving corporate business rules.. Based on "Pattern Matching for Design Concept Localization" by K.

judgement and creativity. called RevEngE (i?everse Engineering Environment). We believe that maintaining a large legacy software system is an inherently human activity that requires knowledge. and "redocumentation". 1990) there are definitions for a variety of subtasks. During system maintenance. ART (Johnson. Ariadne is a set of pattern matching and design recovery programs implemented using a commercial tool called The Software Refinery^. including "reengineering".. In this paper we describe two types of pattern-matching techniques and discuss why pattern matching is an essential tool for program understanding. 1990). The second type is based on Dynamic Programming techniques that allow for statementlevel comparison of feature vectors that characterize source code program statements. and University of Victoria (Buss et al. "restructuring". Currently we are working on another version of the Ariadne environment implemented in C++. 1993). and normal maintenance. In particular. Facilitating the program understanding process can yield significant economic savings. that is if the two segments are implementations of the same algorithm. Individual tools in the kit include Ariadne (Konto. updates.78 KONTOGIANNIS ET AL. 1994) Over the past three years. no single tool or technique will replace the maintenance progranmier nor even satisfy all of the programmer's needs. taste. the team has been developing a toolset. Evolving real-world systems requires pragmatism and flexibility. Our research is part of a larger joint project with researchers from IBM Centre for Advanced Studies. The tools communicate through a flexible object server and single global schema implemented using the Telos information modeling language and repository (Mylopoulos. The first type is based on numerical comparison of selected metric values that characterize and classify source code fragments. experience. The first one is a comparison between two different program segments to see if one is a clone of the other. 1990). For the foreseeable future. we apply these techniques to address two types of relevant program understanding problems. 1994). and Rigi (Tilley. Rigi is a programmable environment for program visualization. The average Fortune 100 company maintains 35 million lines of source code (MLOC) with a growth rate of 10 percent per year just in enhancements. it has been estimated that 50 to 90 percent of the maintenance programmer's effort is devoted to simply understanding relationships within the program. Consequently. Similar . Our approach is to provide a suite of complementary tools from which the programmer can select the most appropriate one for the specific task at hand. An integration framework enables exploitation of synergy by allowing conmiunication among the tools. it is often necessary to move from low. implementationoriented levels of abstraction back to the design and even the requirements levels. University of Toronto. ART (Analysis of 7?edundancy in Text) is a prototype textual redundancy analysis system. The toolset is integrated through a common repository specifically designed to support program understanding (Mylopoulos. 1994).^ In (Chikofsky. The problem is in theory undecidable. based on an open architecture for integrating heterogeneous tools. but in practice it is very useftil to provide software maintainers with a tool that detects similarities between code segments. The process is generally known as "reverse engineering".

Distances between program segments can be computed based on feature differences. variable names. code cloning can be a costly practice. This paper proposes two methods for addressing the code cloning detection problem. Thirdly.e. The DP approach provides in general. insertions and. This paper introduces new techniques for detecting instances of source code cloning. The granularity for selecting and comparing code fragments is at the level of begin-end blocks. The second problem is the recognition of program segments that implement a given progranmiing concept. Secondly. Program features based on software metrics are proposed. substitutions.1. We address this problem by defining a concept description language called ACL and by applying statement-level comparison between feature vectors of the language and feature vectors of source code program statements. due to bug fixes. These features apply to basic program segments like individual statements. it results in a program that is larger than necessary. The Code Cloning Problem Source code cloning occurs when a developer reuses existing code in a new context by making a copy that is altered to provide new functionality. parameterized function.PATTERN MATCHING FOR CLONE AND CONCEPT DETECTION 79 segments are proposed to the software engineer who will make the final decision about their modification or other use. and efficiency constraints may not admit the extra overhead (real or perceived) of a generalized routine. they are compared at the statement level. In the long run. requiring larger computers. literal strings and numbers). The practice is widespread among developers and occurs for several reasons: making a modified copy may be simpler than trying to exploit conunonality by writing a more general.e. The second is based on a new Dynamic Programming (DP) technique that is used to calculate the best alignment between two code fragments in terms of deletions. enhancements. scheduling pressures may not allow the time required to generalize the code. Thefirstis based on direct comparison of metric values that classify a given code fragment. The granularity for selecting code fragments for comparison is again at the level of begin-end blocks. less false positives) than the one based on direct comparison of metric values at the begin-end block level. Once two begin-end blocks have been selected. begin-end blocks and functions. more accurate results (i. often-cloned functionality is a prime candidate for repackaging and generalization for a repository of reusable components which can yield tremendous leverage during development of new applications. Firstly. This method returns clusters of begin-end blocks that may be products of cutand-paste operations. the change must be propagated to all instances of the clone. . 1. This method returns clusters of begin-end blocks that may be products of cut-and-paste operations. when a modification is required (for example. The reason is that comparison occurs at the statement level and informal information is taken into account (i. or changes in business rules). increasing the complexity that must be managed by the maintenance programmer and increasing the size of the executable program.

they do not provide any similarity measure between the pattern and the input string. ed and v i . variables defined and keywords). as in machine translation.3. and REFINE. Related Work A number of research teams have developed tools and techniques for localizing specific code patterns. Concept-to-code matching is under testing and optimization. Concept descriptions and source code are parsed. Moreover. These tools are very efficient in localizing patterns but do not provide any way for partial and hierarchical matching.5jfc. Other tools have been developed to browse source code and query software repositories based on structure. Incomplete or imperfect matching is also possible leaving to the software engineer the final decision on the similar candidates proposed by the matcher.2.80 KONTOGIANNIS ET AL. Comparison of a concept description language statement with a source code statement is achieved by comparing feature vectors (i. Such tools include CIA. S2] . permanent relations between code fragments. . The concept recognition problem becomes the problem of establishing correspondences. variables used. The UNIX operating system provides numerous tools based on regular expressions both for matching and code replacement. The comparison and selection granularity is at the statement level. The use of a statistical formalism allows a score (a probability) to be assigned to every match that is attempted. SCAN.e. The proposed concept description language. These tools are efficient on representing and storing in local repositories relationships between program components. Microscope. A2. and source code. belong to the innermost begin-end block containing 5i. and control or dataflow relationships. keywords.. a code fragment V = Si. Matching of concept representations and source code representations involves alignment that is again performed using a dynamic programming algorithm that compares feature vectors of concept descriptions. metrics. they provide effective mechanisms for querying . 1. A way of dynamically updating matching probabilities as new data are observed is also suggested in this paper. models insertions as wild characters (AbstractStatement* and AbstractStatemenf^) and does not allow any deletions from the pattern. . Moreover. between a parse tree of the concept description language and the parse tree of the code.. Rigi. awk. 1. and b) the sequence of statements 52.-Am. Given a concept description M = Ai.-Sk is selected for comparison if: a) the first concept description statement Ai matches with Si. It has been implemented using the REFINE environment and supports plan localization in C programs. Widely-used tools include grep. A concept to be recognized is a phrase of the concept language. The Concept Recognition Problem Programming concepts are described by a concept language. A new formalism is proposed to see the problem as a stochastic syntax-directed translation. Translation rules are pairs of rewriting rules and have associated a probability that can be set initially to uniform values for all the possible alternatives.

dynamic programming techniques for comparing begin-end blocks at a statementby-statement basis. Program features relevant for clone detection focus on data and control flow program properties. The number of functions called (fanout). text comparison enhanced with heuristics for approximate and partial matching (Baker. Our approach to clone detection exploits the observation that clone instances. 1990). 1987). and b) the ability to perform hierarchical recognition. Features examined include metric values and specific data. and stored so that they can be used inside other more complex composite patterns. 1977). Metric-value similarity analysis is based on the assumption that two code fragments Ci and C2 have metric values M{Ci) and M(C2) for some source code metric M. The work presented here uses feature vectors to establish similarity measures. approximate fingerprints from program text files (Johnson. Code to Code Matching In this section we discuss pattern-matching algorithms applied to the problem of clone detection. by their nature. a For or. 1994) for which their components exhibit low correlation (based on the Spearman-Pierson correlation test) were selected for our analyses: 1. Modifications of five widely used metrics (Adamov.and control-flow properties. However.s t a t e m e n t can be used allowing for multiple matches with a Whi l e . We look for identifiable characteristics or features that can be used as a signature to categorize arbitrary pieces of code. Code duplication systems use a variety of methods to localize a code fragment given a model or apattem. Other tools use metrics to detect code patterns (McCabe. Determining whether two arbitrary program functions have identical behavior is known to be undecidable in the general case. and text comparison tools such as Unix d i f f. Moreover no partial matching and no similarity measures between a query and a source code entity can be calculated. The analysis framework uses two approaches: 1. . If the two fragments are similar under the set of features measured by M. direct comparison of metric values between begin-end blocks. The closest tool to the approach discussed in this paper. then the values of M{Ci) and M{C2) should be proximate. providing similarity measures between a pattern and a matched code fragment. Moreover. they do not provide any other mechanism to localize code fragments except the stored relations. 1994). 2. is SCRUPLE (Paul. 1990). 1995).(Halstead. common dataflow (Horwitz.. An expansion process is used for unwrapping the composite pattern into its components. a Do statement in the code. and 2.PATTERN MATCHING FOR CLONE AND CONCEPT DETECTION 81 and updating their local repositories. 1993). recognized patterns can be classified. In this approach. (Buss et al. One category of such tools uses structure graphs to identify the"fingerprint" of a program (Jankowitz. The major improvement of the solution proposed here is a) the possibility of performing partial matching with feature vectors. 1988). explicit concepts such as i t e r a t i v e . should have a high degree of structural similarity.

Rather than working directly with textual representations. the following heuristics are currently considered: • Adjustments between variable names by considering lexicographical distances. Specifically. where n is a parameter provided by the user. The features per statement used in the Dynamic Programming approach are: • • • Uses of variables. The ratio of input/output variables to the fanout. and file of the program and are stored as annotations in the corresponding nodes of the AST.82 KONTOGIANNIS ET AL. five different metrics are calculated compositionally for every statement. In particular. Variations in these features provide a dissimilarity value used to calculate a global dissimilarity measure of more complex and composite constructs such as begin-end blocks and functions. 2. Modified Albrecht's function point metric. In addition to the direct metric comparison techniques. The comparison function used to calculate dissimilarity measures is discussed in detail in Section 2. The comparison granularity is at the level of a begin-end block of length more than n lines long. This table is used for selecting the source code entities to be matched based on their metric proximity. numerical literals. Modified Henry-Kafura's information flow quality metric. block. strings. Dynamic programming (DP) techniques detect the best alignment between two code fragments based on insertion. more sophisticated analytical approach was to form clusters by comparing values on one or more axes in the metric space. evaluating the Euclidean distance of each pair. deletion and comparison operations. McCabe cyclomatic complexity. Similarity of two code fragments is measured using the resulting 5-dimensional vector. as opposed to begin-end blocks. are abstracted into feature sets that classify the given statement. Once metrics have been calculated and annotations have been added.3. Uses and definitions of data types. naive approach. and numerical literals. deletion and comparison operations. is to make 0{'n?) pairwise comparisons between code fragments. The five metrics as discussed previously. 4. we use dynamic programming techniques to calculate the best alignment between two code fragments based on insertion. A second. 5. The first. Two methods of comparing metric values were used. Two statements match if they define and use the same variables. and strings. when the source code is parsed an Abstract Syntax Tree (AST) Tc is created. Heuristics have been incorporated in the matching process to facilitate variations that may have occurred in cut and paste operations. function. a reference table is created that contains source code entities sorted by their corresponding metric values. 3. The selection of the blocks to be compared is based on the proximity of their metric value similarity in a selected metric axis. . definitions of variables. source code statements. Detailed descriptions and references for metrics will be given later on in this section.

First. The tree is annotated with the fan-out attribute which has been determined during an analysis phase following the initial parse. 1. and as loop index values. such as linkage information and the call graph are created automatically by the parser. Parsers for other languages may be easily constructed or obtained through the user community. The second step is to use the parser on the subject system to construct the AST representation of the source code. Program Representation and the Development of the Ariadne Environment The foundation of the Ariadne system is a program representation scheme that allows for the calculation of the feature vectors for every statement. 2) because the comparison of the feature vector is performed at the statement level. In the current implementation. Such information is typically obtained using dataflow analysis algorithms similar to the ones used within compilers. We use an object-oriented annotated abstract syntax tree (AST). The final step is to add additional annotations into the tree for information on data types. The following sections further discuss these approaches and present experimental results from analyzing medium scale (< lOOkLOC) software systems. For example. an If-Statement and a While-Statement are defined to be subclasses of the Statement class. Dynamic progranmiing is a more accurate method than the direct metric comparison based analysis (Fig. ELSE . consider the following code fragment from an IBM-proprietary PL/1-like language.PATTERN MATCHING FOR CLONE AND CONCEPT DETECTION 83 • Filtering out short and trivial variable names such as i and j which are typically used for temporary storage of intermediate values. block or function of the source code. the results of external analysis. Creating the annotated AST is a three-step process. This preselection reduces the comparison space for the more computationally expensive DP match. The corresponding AST representation for the i f statement is shown in Fig. Within this framework only the begin-end blocks that have a dissimilarity measure less than a given threshold are considered for DP comparison. further steps operate in an essentially language-independent fashion. MAIN: PROCEDURE(OPTION). Some tree annotations. 2.1. The domain model defines object-oriented hierarchies for the AST nodes in which. and links to informal information. Nodes of the AST are represented as objects in a LISP-based development environment^. Once the AST is created. only variable names of more than three characters long are considered. DCL OPTION FIXED(31). a grammar and object (domain) model must be written for the programming language of the subject system. IF (OPTION>0) THEN CALL SHOW_MENU(OPTION). for example. dataflow (dataflow graphs). Code fragments are selected for Dynamic Programming comparison by preselecting potential clone candidates using the direct metric comparison analysis. The tool vendor has parsers available for such conmion languages as C and COBOL.

so each metric adds useful information.2. ." ^ J pLegend ( [^ \ J »ASTnode altributo naiiw 1 + 1 (anoul m Link from parent tocNIdvfaa named attribute. and function. MENU ^ f 1 1 J J I — NODE NAME OPTION ^ 1 1 f SHOW_ ERROR 1 \ J f I "Invalid option. Metrics Based Similarity Analysis Metrics based similarity analysis uses five source-code metrics that are sensitive to several different control and data flow program features. 1994) shows the metrics components have low correlation.. Metric values are computed for each statement. The AST for an IF Statement With Fanout Attributes. block. CALL SHOW_ERROR("Invalid o p t i o n number"). The features examined for metric computation include: • Global and local variables defined or used. f I OPTION i^ l . 2. \ B fh • Fanout attribute containing Integer value V. Empirical analysis ^ (Buss et al.. Figure 1. END MAIN.84 KONTOGIANNIS ET AL. ^ ( 0 ^ f I SHOW.._ .

Defined/used parameters passed by reference and by value.n + 2 where • • • € is the number of edges in the controlflowgraph n is the number of nodes in the graph. Let 5 be a code fragment. and minor modifications such as replacement of while with f o r loops and insertion of statements that do not alter the basic data and control flow of the original code structure. D_COMPLEXITY(s) = GLOBALS{S)/{FANJDUT{S) + 1) where • GLOBALS(s) is the number of individual declarations of global variables used or updated within s. functions.SET{s)-\USERJNPUT{s)+ FILEJNPUT{s) . and files. VARSJJSEDJiNDJSET{s)-^ GLOBAL.PATTERN MATCHING FOR CLONE AND CONCEPT DETECTION 85 • • • • • Functions called. A global variable is a variable which is not declared in the code fragment s. Control flow graph. 3. ALBRECHT(s) = { P3 * [ P4 * where. MCCABE(s) = 1 + d. Partial matching may occur because the metrics are not sensitive to variable names. where d is the number of control decision predicates in j Altematively McCabe metric can be calculated using f pi * P2 * 4. A description of the metrics used is given below but a more detailed description can be found in (Adamov. Note that these metrics are computed compositionally from statements. (Moller93). I/O operations (read. write operations). The description of the five modified metrics used is given below.VARS. Files accessed. source code white space. to b e g i n end blocks. S-COMPLEXITY(s) = FANJOUT{sf where • FAN_OUT(s) is the number of individual function calls made within s. 1991). (Fenton. 1. MCCABE(5) = € . 1987).

. In a large software system though there are many begin-end blocks and such a pairwise comparison is not possible because of time and space limitations.. 5). every cluster that has been calculated by intersecting clusters in Mi and Mj contains potential clones under the criteria implied by both metrics.86 VARSJJSED^NDJ5ET{s) ment s.. For every metric axis Mi (i = 1.. We have experimented with two techniques for calculating similar code fragments in a software system. The process ends when all metric axis have been considered. The first one is based on pairwise Euclidean distance comparison of all begin-end blocks that are of length more than n lines long. The technique starts by creating clusters of potential clones for every metric axis A^^ (i = 1 . In the current implementation the values chosen are pi = 5. Once the clusters for every axis are created. p2 = 4. KAFURA JN(5) is the sum of local and global incoming dataflow to the the code fragment s. The parameter n can be changed by the user. block and function node. The clone detection algorithm that is using clustering can be summarized as: 1. For example every cluster in the axis Mi contains potential clones under the criteria implied by this metric. The selection of values for the piS' ^0 does not affect the matching process. The user may specify at the beginning the order of comparison. Select all source code begin-end blocks B from the AST that are more than n lines long. KAFURA_OUT(s) is the sum of local and global outgoing dataflow from the the code fragment s. FILEJNPUT(s) is the number offilesaccessed for reading in 5-. p4. Once the five metrics Mi to M5 are computed for every statement. 5.. p3 = 4 and. where n is a parameter given by the user.OUT{s)y where. Consequently. The factors pi. P4 = 7. the pattern matching process is fast and efficient. are weight factors. then intersections of clusters in different axes are calculated forming intermediate results. 2. KONTOGIANNIS ET AL. 1987) possible values for these factors are given. we limit the pairwise comparison between only these begin-end blocks that for a selected metric axis Mi their metric values differ in less than a given threshold di. It is simply the comparison of numeric values. and the clustering thresholds for every metric axis. Each cluster . USERJNPUT{s) is the number of read operations in statement s. KAFURA(s) = { {KAFURAJN{s) • • * KAFURA. In (Adamov. is the number of data elements set and used in the state- GLOBAL-VARSSET{s) is the number of global data elements set in the statement s. Instead. The second technique is more efficient and is using clustering per metric axis. In such a way every block is compared only with its close metric neighbors. 5) create clusters Cij that contain begin-end blocks with distance less than a given threshold di that is selected by the user.

and form a composite metric axis McurrOj. Manual inspection of the above results combined with more detailed Dynamic Programming re-calculation of distances gave some statistical data regarding false positives. (Kontogiannis. For every cluster Ccurr. that is based on Dynamic Programming. where i = 1. Different programs give different distribution of false alarms. In CLIPS. j G {1 . a 34 kLOC expert system shell. 1995) to find the best alignment between two code fragments.Mark Mj as used and set the current axis Mcurr ~ '^currQj' 4. As a refinement. 5}. resulting in a total of 20 percent of potential system duplication at the function level.28 functions per cluster. The cumulative similarity measure T> between two code fragments P . but generally the closest the distance is to 0. The following section. These results are given in Table 1. we detected 35 clusters of similar functions of average size 4. a 40KLOC Unix shell program. 2.7 percent of potential system duplication at the function level. as a similarity measure between program constructs. resulting to a total of 23 percent of potential code duplication at the function level. current axis Mcurr = Mi. The clusters in the resulting set contain potential code clone fragments under the criteria Mcurr and Mj. 1994). discusses in detail the other code to code matching technique we developed. The metric-based clone detection analysis has been applied to a several medium-sized production C programs. the user may restrict the search to code fragments having minimum size or complexity. A program feature vector is used for the comparison of two statements.. The pattern matching engine uses either the computed Euclidean distance or clustering in one or more metric dimensions combined. In tcsh. If all metric axes have been considered the stop. The distance between the two code fragments is given as a summation of comparison values as well as of insertion and deletion costs corresponding to insertions and deletions that have to be applied in order to achieve the best alignment between these two code fragments. else go to Step 3. our analysis has discovered 39 clusters or groups of similar functions of average size 3 functions per cluster resulting in a total of 17. Dynamic Programming Based Similarity Analysis The Dynamic Programming pattern matcher is used (Konto. In bash. M.3. is calculated using the function . a 45 kLOC Unix shell program. the analysis has discovered 25 clusters.0 the more accurate the result is. of average size 5.84 functions per cluster.PATTERN MATCHING FOR CLONE AND CONCEPT DETECTION 87 then contains potential code clone fragments under the metric criterion Mi.m in the current metric axis Mcurr > intersect with all clusters Cj^k in one of the non used metric axis Mj. Mark Mi as used Set the 3. The features are stored as attribute values in aframe-basedstructure representing expressions and statements in the AST.

Vy) is the the distance between two feature vectors Vx.j~l. data types used or set.M)-h D{£{l.M)) D{£{l. The comparison cost is calculated by comparing the corresponding feature vectors.£{lJ-l. used per statement. Currently. • • • • • • • Al is the model code fragment 7^ is the input codefragmentto be compared with the model M £{h jt Q) is a program feature vectorfromposition / to position y in codefragmentQ -D(Vx .88 KONTOGIANNIS ET AL.0. and comparisons based on metric values (1) Note that insertion.p. 7 5 M) is the cost of deleting \hc']th statement of Al.V). D : Feature ^Vector X Feature^Vector — Real > where: A(p.P. at position / of the fragment ^ V /(i. 7^.£{lJ. The column labeled Distance Range gives the value range of distances between functions using the Dynamic Progranmiing approach. we compare ratios of variables set. The column labeled Partial Clones contains the percentage of functions which correspond .M)+ D{£{l. J. and deletion costs are used by the Dynamic Programming algorithm to calculate the best fit between two codefragments. X ) the cost of inserting the ith statement of V at position^* of the model M and C(^. J. Table 1 summarizes statistical data regarding false alarms when Dynamic Programming comparison was applied to functions that under direct metric comparison have given distance 0." The quality and the accuracy of the comparison cost is based on the program features selected and the formula used to compare these features. J. For simplicity in the implementation we have attached constant real values as insertion and deletion costs. Vy A(i.M)) and.V.V).nS{lJ-l.M)) = Mm{ I{p-lj^V. The column labeled False Alarms contains the percentage of functions that are not clones but they have been identified as such.p.M)^ C{p-lJ-l.An intuitive interpretation of the best fit using insertions and deletions is "if we insert statement i of the input at position 7 of the model then the model and the input have the smallest feature vector difference.p-l. V^ M) is the cost of comparing the ith statement of the codefragmentV with the j^Afragmentof the model M.

0% 16.0% 37. uses and definitions of variables).0 % 32.0% 82.0% 31.5-1. The matching process between two code fragments M and V is discussed with an example later in this section and is illustrated in Fig. Within the experimentation of this approach we used the following three different categories of features 1.0 % 3. There are many program features that can be considered to characterize a code fragment (indentation. 2.e.0 -15.3 The comparison cost function C{i.2.0 % 0.e numbers.V) is the key factor in producing the final distance result when DP-based matching is used.PATTERN MATCHING FOR CLONE AND CONCEPT DETECTION 89 Table 1.01 .3.0% 0.M. metrics. definitions and uses of variables as well as.0% 89. for the first two categories is calculated as : .0% 100.0% 8.99 2.0% only in parts to cut and paste operations. the column labeled as Positive Clones contains the percentage of functions clearly identified as cut and paste operations. in a printf statement). > strings) within a statement (i.99 4.0 . (B) Feature2 .0-1.99 3.0 .0% 30.0% 56.j.5. False alarms for the Clips program Distance Range False Alarms Partial Clones 0.0% 10.0 % 10.0% 36.0% 6.0% 33. definitions and uses of data types : (A) Featurei • Statement within a statement.0. (B) Feature2 • Statement within a statement — String denotes the set of data type names used in > — String denotes the set of data type names defined > The comparison cost of the ith statement in the input V and the jth statement of the model M. Finally.49 1. keywords.0% 78.0 .99 1.0 % 13.0 0.0% Positive Clones 90.0% 32.0% 8. literal values within a statement: (A) Featurei : Statement statement.0 0.99 6.Statement statement — String denotes the set of variables used in within a > -^ String denotes the set of variables defined within a (C) Features • Statement — String denotes the set of literal values (i.

how much noise in terms of insertions and deletions is allowed before the matcher fails).e. zero distance) using the direct per function metric comparison.90 KONTOGIANNIS ET AL.axis) that have been already identified as clones (i. A lower deletion cost indicates the preference of the Ubcr to accept a code fragment V that is written by deleting statements from the model M.) \ A:=l . five metric values which are calculated compositionally from the statement level to function and file level: The comparison cost of the ith statement in the input V and the jth statement of the model M when the five metrics are used is calculated as : C{VuMj) ^^{Mk{V. The opposite holds when the deletion cost is lower than the corresponding insertion cost. In Fig. Note that in the Dynamic Programming based approach the metrics are used at . The following points on insertion and deletion costs need to be discussed. . 3. A lower insertion cost than the corresponding deletion cost indicates the preference of the user to accept a code fragment V that is written by inserting new statements to the model M. while the solid line shows the distance results obtained when the five metrics are used as features. terminate matching if a certain threshold is exceeded). The values for insertion and deletion should be higher than the threshold value by which two statements can be considered "similar". The dashed line shows distance results when definitions and uses of variables are used as features in the dynamic programming approach. or in other words how many features are used.e. while smaller values indicate higher tolerance. • • When different comparison criteria are used different distances are obtained.2 (Clips) distances calculated using Dynamic Programming are shown for 138 pairs of functions (X . especially if cutoff thresholds are used (i. • The insertion and deletion costs reflect the tolerance of the user towards partial matching (i. Insertion and deletion costs are constant values throughout the comparison process and can be set empirically. otherwise an insertion or a deletion could be chosen instead of a match. Higher insertion and deletion costs indicate smaller tolerance. *' ^ 1 Y^ card{InputFeaturem{Vi) O ModelFeaturem{-Mj)) V ^ card{InputFeaturem{Vi)UModelFeaturemMj)) where v is the size of the feature vector.e.MUMj))^ (3) Within this framework new metrics and features can be used to make the comparison process more sensitive and accurate.

_ Distances on data and control flow measurements 3h 1 C 40 60 80 Function Pairs 100 120 140 0 Figure 2. Distances between function pairs of possible function clones using DP-based matching.Distances on definitions and uses of variables _ Distances on data and controlflowmeasurements. As an example consider the following statements M and V: ptr = head.> i t e i t i == s e a r c h l t e m ) found = 1 else ptr = ptr->next.Distances on definitions and uses of variables Distances between Function Pairs (Bash) . instead of the begin-end block level when metrics direct comparison is performed. while(ptr != NULL && !found) { if(ptr->item == searchltem) .PATTERN MATCHING FOR CLONE AND CONCEPT DETECTION 91 Distances between Function pairs (Clips) . the statement level. while(ptr != NULL && !found) { i f ( p t r .

The comparison of the two composite while statements in the first grid at position (0.t h e n . deletions as vertical lines and.e l s e statements at position (1. . Insertions are represented as horizontal hnes.1) initiates a new nested match. the comparison of the composite t h e p a r t of the i f . In the third grid. ^ ptr I-. In the second grid the comparison of the composite i f . i£().92 KONTOGIANNIS ET AL.e l s e statements initiates the final fourth nested match. matches as diagonal hnes. %s\n". The matching process between two code fragments. In the first grid the two code fragments are initially considered. } else ptr = ptr->next. •Ise-part then-part ptr->lten •> 1 y.. 1). Finally. t^—1— than-purt M A ^-l-T£ounJk> 1 \ ^-^ alfls part prlntfO..0). 3. { printf("ELEMENT FOUND found = 1. .t h e n . 0) of the first grid a deletion is considered as it gives the best cumulative distance to this point (assuming there will be a match at position (0. 1). At position (0. i2i found • 1 ptx->it«m H . an insertion has been detected. 0). initiates a nested match (second grid). The Dynamic Programming matching based on definitions and uses of variables is illustrated in Fig. in the fourth grid at position (0. Figure 3. as it gives the best cumulative distance to this point (assuming a potential match in (1. searchltem).

1994). The probability that such a description matches with a code fragment is used to calculate a similarity measure between the description and the implementation. The concept language contains: . The selection of a fragment in the code to be compared with the conceptual representation. (Church. concepts are represented as abstract-descriptions using a concept language called ACL. (Biggerstaff. The intuitive idea is that a concept description may match with a number of different implementations. 1994) problem consists of assigning concepts described in a concept language to program fragments. source code is represented as an annotated AST Tc. The reason for this transformation is to reduce the complexity of the matching algorithm as Ta and Tc may have a very complex and different to each other structure. 1992). We use REFINE to build and transform both ASTs. An abstract-description is parsed and a corresponding AST Ta is created.1. The measure of similarity. sets of constraints between components to be retrieved (Ning. A concept language specifies in an abstract way sequences of design concepts. 3.PATTERN MATCHING FOR CLONE AND CONCEPT DETECTION 93 When a nested match process finishes it passes its result back to the position from which it was originally invoked and the matching continues from this point on. 1992). Both Ta and Tc are transformed into a sequence of abstract and source code statements respectively using transformation rules. Similarly. 1994). In this approach feature vectors of statements are matched instead of Abstract Syntax Trees. The associated problems with matching concepts to code include : • • • The choice of the conceptual language. Concept assignment can also be seen as a matching problem. Moreover. (Rich. query pattern languages (Paul. 1994). 1993). 1990). In our approach. Language for Abstract Representation A number of research teams have investigated and addressed the problem of code and plan localization. the implementation of the Dynamic Programming algorithm is cleaner and faster once structural details of the ASTs have been abstracted and represented as sequences of entities. These problems are addressed in the following sections. Current successful approaches include the use of graph granmiars (Wills. (Muller. and summary relations between modules and data (Canfora. 3. 1992). In our approach a stochastic pattern matcher that allows for partial and approximate matching is used. Concept To Code Matching The concept assignment (Biggerstaff.

A type variable can generate (match) with any actual variable in the source code provided that they belong to the same data type category. characters that may used in the text of a code statement Metrics : a vector of five different complexity. Uses of variables : variables that are used in a statement or expression Definitions of variables'. Numeral: Representing Int. 2. 5. 4. Currently the following abstract types are used : 1.): To indicate one statement follows another Choice ( 0 ) : To indicate choice (one or the other abstract statement will be used in the matching process Inter Leaving (|| ) : to indicate that two statements can be interleaved during the matching process . Sequencing (. The correspondence between an abstract expression and the source code expression that it may generate is given at Table 3 Abstract feature descriptions T that contain the feature vector data used for matching purposes. An example is when we are looking for a Traversal of a list plan but we do not know the name of the pointer variable that exists in the code. and float types 2. 3. ariables that are defined in a statement or expression Keywords: strings. data and control flow metrics. 4. • • Typed Variables X Typed variables are used as a placeholders for feature vector values.94 KONTOGIANNIS ET AL. Currently the features that characterize an abstract statement and an abstract expression are: 1. numbers. For example a List type abstract variable can be matched with an Array or a Linked List node source code pointer variable. Character : Representing char types List: Representing array types Structure : Representing struct types Named : matching the actual data type name in the source code • Operators O Operators are used to compose abstract statements in sequences. 3. Currently the following operators have been defined in the language but only sequencing is implemented for the matching process : 1. when no actual values for the feature vector can be provided. 2. 3. • Abstract expressions £ that correspond to source code expression.

Generation (Allowable Matching) of source code statements from ACL statements ACL Statement Abstract Iterative Statement Abstract While Statement Abstract For Statement Abstract Do Statement Abstract Conditional Statement Abstract If Statement Abstract Switch Statement Abstract Return Statement Abstract GoTo Statement Abstract Continue Statement Abstract Break Statement Abstract Labeled Statement Abstract Statement* Generated Code Statement While Statement For Statement Do Statement While Statement For Statement Do Statement If Statement Switch Statement If Statement Switch Statement Return Statement GoTo Statement Continue Statement Break Statement Labeled Statement Zero or more sequential source code statements One or more sequential source code statements AhstractStatement^ .PATTERN MATCHING FOR CLONE AND CONCEPT DETECTION 95 Table 2.

while . Included plans are incorporated in the current pattern's AST at parse time. 19890). As an example consider the inlining p l a n : traversal-linked-list that is used to include an instance of the traversal-linked-list plan at a particular point of the pattern. This pattern expresses an iterative statement (e. array. (Chikofsky.s t a t e m e n t contains a sequence of one or more stateme nts (+-statement) .g. Special macro definition statements in the Abstract Language are used to include the necessary macros. include definitions: These are special statements in ACL that specify the name of the plan to be included and the file it is defined.96 KONTOGIANNIS ET AL. The body of I t e r a t i v e .acl. 1992). In a pattern more than one occurrence of an included plan may appear. then special preprocessor statements can be used to include this plan to compose more complex patterns. Table 3. inline uses : These are statements that direct the parser to inline the particular plan and include its AST in the original pattern's AST. Currently there are two types of macro related statements 1. linked list) and the conditional expression contains the keyword "NULL". In this way they are similar to inline functions in C++. A typical example of a design concept in our concept language is given below. Macros are entities that refer to plans that are included at parse time.acl traversal-linked-list that imports the plan traversal-linked-list defined in file planl. 2. do loop that has in its condition an inequality expression that uses variable ?x that is a pointer to the abstract type l i s t (e.g.for. For example if a plan has been identified and is stored in the plan base. Generation (Allowable Matching) of source code expressions from ACL expressions ACL Expression Abstract Function Call Abstract Equality Abstract Inequality Abstract Logical And Abstract Logical Or Abstract Logical Not Generated Code Expression Function Call Equality (==) Inequality (\ =) Logical And (Sz&z) Logical Or (\\) Logical Not (!) • Macros M Macros are used to facilitate hierarchical plan recognition (Hartman. As an example consider the statement i n c l u d e planl.

."member") && notlnOrig ) ) if (strcmp(field->Avalue.PATTERN MATCHING FOR CLONE AND CONCEPT DETECTION 97 that uses at least variable ?y (which matches to the variable obj) in the code below and contains the keyword meniber.origObj. field->Avalue). Concept-tO'Code Distance Calculation In this section we discuss the mechanism that is used to match an abstract pattern given in ACL with source code.2. keywords : [ "NULL" ]) { -(--Statement abstract-description uses : [?y : string. } } 3. { Iterative-statement(Inequality-Expression abstract-description uses : [ ?x : *list].origObj) || (!strcmp(field->AvalueType. Assignment-Statement abstract-description uses : [?x. defines variable ?x which in this example matches to variable f i e l d .. ] . ."method") != 0) INSERT_THE_FACT(o->ATTLIST[num]. and an Assignment-Statement that uses at least variable ?x.Aname. keywords : [ "next" ] A code fragment that matches the pattern is: { while (field != NULL) { if (!strcmp(obj. and contains the keyword next. . field = field->nextValue.] keywords : [ "member" ]. . defines : [?x].

A Viterbi (Viterbi. In general the matching process contains the following steps : 1. For example an i f statement will be decomposed as a sequence of an e x p r e s s i o n (for its condition).5^) is parsed and an AST Tc is created. A similarity measure is established by this comparison between the features of the abstract statement and the features of the source code statement. . b) S2]S^]. A model can be in a state with certain probability. 1967) algorithm is used to find the best fit between the Dynamic Model and a code sequence selected from the candidate list. Once a candidate list of code fragments has been chosen the actual pattern matching takes place between the chosen statement and the outgoing transitions from the current active APM's state. Moreover.. Composite statements generate nested matching sessions as in the DP-based code-to-code matching. 3. If the type of the abstract statement the transition points to and the source code statement are compatible (compatibility is computed by examining the Static Model) then feature comparison takes place. a transition to another state can be taken with a given probability.. 4. The ACL pattern {Ai. Source code (^i. The selection of a code fragment to be matched with an abstract description is based on the following criteria : a) the first source code statement Si matches with the first pattern statement Ai and. If composite statements are to be compared. A Static Model called SCM provides the legal entities of the source language.. A transformation program generates from Ta a Markov Model called Abstract Pattern Model (APM). 2.Sk belong to the innermost block containing Si The process starts by selecting all program blocks that match the criteria above. This feature comparison is based on Dynamic Programming as described in section 2. The intuitive idea of using Markov models to drive the matching process is that an abstract pattern given in ACL may have many possible alternative ways to generate (match) a code fragment. From a state. an expansion function "flattens" the structure by decomposing the statement into a sequence of its components. The underlying finite-state automaton for the mapping between a APM state and an SCM state basically implements the Tables 2. 3.A^) is parsed and an AST Ta is created.. . 6. A Markov model provides an appropriate mechanism to represent these alternative options and label the transitions with corresponding generation probabilities. .3. A transition is associated with the generation (recognition) of a symbol with a specific probability. the Vitrebi algorithm provides an efficient way to find the path that maximizes the overall generation (matching) probability among all the possible alternatives. its then part and its e l s e part.. Candidate source code sequences are selected.98 KONTOGIANNIS ET AL. 5. A Markov model is a source of symbols characterized by states and transitions.

. This corresponds to approximating (4) as follows (Brown. For example a transition in APM labeled as (pointing to) an A b s t r a c t while S t a t e ment is linked with the while node of the static model. A measure of similarity between Tc and Ta is the following probability where.Sk\Ai.. Let iSi. An approximation of (4) is thus introduced.... 5fc and a pattern A = Ai. In its turn a while node in the SCM describes in terms of states and transitions the syntax of a legal while statement in c.5. because of complexity issues related to possible variations in Ta generating Tc.. a sequence of abstract descriptions is produced...4. ^2...(5i.PATTERN MATCHING FOR CLONE AND CONCEPT DETECTION 99 3.3. The sequence of abstract descriptions Aj forms a pattern A in Abstract Code Language (ACL) and is used to build dynamically a Markov model called Abstract Pattern Model (APM).. ACL Markov Model Generation Let Tc be the AST of the code fragment and Ta be the AST of the abstract representation.(5i. 1992): Pr{Tc\Ta) c^ P... permanently available Markov model called a Source Code Model (SCM). Nodes in the APM represent Abstract ACL Statements and arcs represent transitions that determine what is expected to be matched from the source code via a link to a static. This is determined by examining the reachable APM transitions at the ith step. The probability in (1) cannot be computed in practice. .5fc be a sequence of program statements During the parsing that generates Ta. The best alignment between a sequence of statements S = 5i..A2]. The Source Code Model is an alternative way to represent the syntax of a language entity and the correspondence of Abstract Statements in ACL with source code statements.|%. The Abstract Pattern Model is generated an ACL pattern is parsed.An) = .Aj is computed by the Viterbi (Viterbi. (rci..)) (7) where/(/) indicates which abstract description is allowed to be considered at step /.0 must be satisfied and ^/(fc) corresponds to a final APM state. An example of which is given in Fig..ran"'raL) (6) is the sequence of rules used for generating Ta.52. Each of these descriptions is considered as a Markov source whose transitions are labeled by symbols Aj which in turn generate (match) source code.. 1967) dynamic programming algorithm using the SCM and a feature vector comparison function for evaluating the following type of probabilities: P. For the matching to succeed the constraint P^(*S'i|Ai) = 1.rcj (5) is the sequence of the grammar rules used for generating Tc and {ra^... ..rc.

52. 4 generated by the pattern ^ i .S3\As)=Max PriSi._l|Al.%^-l))•Pr(5^|%i))) i=l (8) This is similar to the code-to-code matching. The way to calculate similarities between individual abstract statements and code fragments is given in terms of probabilities of the form Pr{Si\Aj) as the probability of abstract statement Aj generating statement Si. A dynamic model for the pattern Al\ A2*.100 KONTOGIANNIS ET AL.^2.. A3* Pr{Si\Ai) = 1. we allow matching abstract description features with source code features. The dynamic model (APM) guarantees that only the allowable sequences of comparisons are considered at every step.S2\As)'Pr{Ss\As) (12) .. ^maa:(P^(5l. ^2 5 ^3» where Aj is one of the legal statements in ACL.5. The magnitude of the logarithm of the probability p is then taken to be the distance between Si and Aj. S'a: Figure 4.. The difference is that instead of matching source code features. The value ofp is computed by multiplying the probability associated with the corresponding state for Aj in SCM with the result of comparing the feature vectors of Si and Aj. Then the following probabilities are computed for a selected candidate code fragment 5i. As an example consider the APM of Fig. The feature vector comparison function is discussed in the following subsection. The probability p = Pr{Si\Aj) = Pscm{Si\Aj) * Pcomp{Si\Aj) is interpreted as "The probability that code statement Si can be generated by abstract statement Aj".0 [delineation • Pr{S2\A2) ' Pr{S2\As) criterion) (9) (10) (11) Pr{Su S2\A2) = PriSllAi) PriSuS2\As) = PriSMl) Pr{SuS2\A2)'Pr{Ss\A3) Pr{SuS2.52.

the expression in the while loop is more likely to be an inequality (Fig. Similarly. The preferred probabilities can be specified by the user while he or she is formulating the query using the ACL primitives. In such a scenario the i t e r a t i v e abstract statement can be considered to generate a while statement with higher probability than a for statement.Ss\A2) = Pr{Si. In the above mentioned example of the T r a v e r s a l of a l i n k e d l i s t plan the I t e r a t i v e . It can also be empirically set to be proportional to the amount of data stored in the cache. (equations 12 and 13) two transitions have been consumed and the reachable active states currently are A2 or A3. 1990).S t a t e m e n t pattern usually is implemented with a while loop. in the T r a v e r s a l of a l i n k e d l i s t plan the while loop condition. which is an expression. Here we assume for simplicity that only four C expressions can be generated by a P a t t e r n Expression. Once the system is used and results are evaluated these probabilities can be adjusted to improve the performance.PATTERN MATCHING FOR CLONE AND CONCEPT DETECTION 101 Pr{SuS2. Moreover at every step the probabilities of the previous steps are stored and there is no need to be reevaluated. 5).S2\A2) ' Pr{Ss\A2) (13) Note that when the first two program statements 81^82 have already been matched. The value of A can be computed by deleted-interpolation as suggested in (Kuhn.A) • Pstatic{Si\Aj) (14) In this formula Pcache{Si\Aj) represents the frequency that Aj generates Si in the code examined at run time while PstaticiSi\Aj) represents the a-priori probability of Aj generating Si given in the static model. . The choice of the weighting factor A indicates user's preference on what weight he or she wants to give to the feature vector comparison. 5. The initial probabilities in the static model are provided by the user who either may give a uniform distribution in all outgoing transitions from a given state or provide some subjectively estimated values. An example of a static model for the p a t t e r n . For example. most probably generates an i n e q u a l i t y of the form (list-node-ptr 1= NULL) which contains an identifier reference and the keyword NULL. With each transition we can associate a list of probabilities based on the type of expression likely to be found in the code for the plan that we consider. Higher A values indicate a stronger preference to depend on feature vector comparison. Pcache{Si\Aj) + ( 1 . 1990).e x p r e s s i o n is given in Fig. These values may come from the knowledge that a given plan is implemented in a specific way. Static probabilities can be weighted with dynamically estimated ones as follows : Pscm{Si\Aj) = X . For example Pr{Si^S2\A2) is computed in terms of Pr{Si\Ai) which is available from the previous step. A cache is used to maintain the counts for most frequently recurring statement patterns in the code being examined. A is a weighting factor. Probabilities can be dynamically adapted to a specific software system using a cache memory method originally proposed (for a different application) in (Kuhn. Lower A values indicate preference to match on the type of statement and not on the feature vector.

^*\ Equality \ / 1.5 -Args expression expression Figure 5. different cache memories can be introduced.0 Arg2 expression 0.0 I is-a-id-ref 7 Pattern \ \ 0.0 expression Argl 1. Variables defined V : Source-Entity — {String} > 2. 3. For example the traversal of linked-list plan may have higher probability attached to the is-an-inequality transition as the programmer expects a pattern of the form (field f= NULL) As proposed in (Kuhn.^^^ / / / Pattern 0. The features used for comparing two entities (source and abstract) are: 1.0 ^'''^>. 1990). Feature Vector Comparison In this section we discuss the mechanism used for calculating the similarity between two feature vectors.25^^^. one for each Aj. The static model for the expression-pattern.. Aj returns a value p = Pr{Si\Aj). Specific values of A can also be used for each cache. Variables usedU : Source-Entity — {String} > .102 KONTOGIANNIS ET AL.25 / 1 is-an-inequality ^^-. The feature vector comparison of Si.25 \ V Id-Ref \ is-a-function-call / id-ref 1.-^-'*''^^ y Inequality J 1.0 Arg2 expression r Pattern \ is~an-equality lExpression 7 1.4. Different transition probability values may be set by the user for different plans. Note that Si's and ^^'s feature vectors are represented as annotations in the corresponding ASTs. / Pattern \ ^.0 Argl expression 1.^/ Pattern \ V Fcn~Call / id-ref Fen-Name 0.

Let Si be a source code statement or expression in program C and Aj an abstract statement or expression in pattern A. System Architecture The concept-to-code pattern matcher of the Ariadne system is composed of four modules. Such a parser builds at run time. For example a new feature may be a link or invocation to another pattern matcher (i. The ACL AST is built using Refine and its corresponding domain model maps to entities of the C language domain model. or in other words how many features are used. . Within this context two strings are considered similar if their lexicographical distance is less than a selected threshold. AbstractFeaturCj^n is the nth feature of the ACL statement Aj. and the comparison of an abstract entity with a code entity is valid if their corresponding metric values are less than a given threshold. set of Strings or set of Numbers.e. an Abstract-Iterative-Statement corresponds to an Iterative-Statement in the C domain model. As in the code to code dynamic programming matching. Within this framework we experimented with the following similarity considered in the computation as a probability: /CM \ comp % 3 -'• ^r^ car d{ Abstract Feature j^n^CodeF eaturci^n) ^ £^ card{AbstractFeaturej^n ^ CodeFeaturCi^n) where v is the size of the feature vector. For example. Let the feature vector associated with Si be Vi and the feature vector associated with Aj be Vj. an AST for the ACL pattern provided by the user. These themes show that ACL is viewed more as a vehicle where new features and new requirements can be added and be considered for the matching process.e. lexicographical distances between variable names (i. next. Keywords /C : Source-Entity — {String} > 4. SCRUPLE) so that the abstract pattern in ACL succeeds to match a source code entity if the additional pattern matcher succeeds and the rest of the feature vectors match. 4. Thefirstmodule consists of an abstract code language (ACL) and its corresponding parser. Metrics • • • • • Fan out All : Source-Entity — Number > D-Complexity M2 . CodeFeaturei^n is the nth feature of source statement Si and.PATTERN MATCHING FOR CLONE AND CONCEPT DETECTION 103 3. next value) and numerical distances between metrics are used when no exact matching is the objective.Source-Entity -^ Number McCabe Ms : Source-Entity -^ Number Albrecht M4 : Source-Entity —^ Number Kafura M5 : Source-Entity — Number > These features are AST annotations and are implemented as mappings from an AST node to a set of AST nodes.

States represent Abstract Statements and are nodes of the ACL's AST. and a function c a l l . is the set of states. is a set of final states. Incoming transitions represent the nodes of the C language AST that can be matched by this Abstract Statement. A Static explicit mapping between the ACL's domain model and C's domain model is given by the SCM (Source Code Model). field->Avalue). The Viterbi algorithm is used to evaluate the best path from the start to the final state of the APM. i d e n t i f i e r r e f e r e n c e . given a model M = A\\A2\ . is matched with the abstract pattern Expression(abstract-description uses : ["ATTLIST". SCM consists of states and transitions. 5. Formally APM is an automaton <Q. and provide the pattern statements to be considered for the next matching step.origObj. is a transition function implementing statement expansion (in the case of composite abstract or C statements) and the matching process qo.104 KONTOGIANNIS ET AL. F> where • • • • • Q. Finally.i^ the Initial state. 5 where it is assumed for simplicity that an Abstract Pattern Expression can be matched by a C i n e q u a l i t y . The matching process stops when one of the final states have been reached and no more statements from the source code can be matched. An example of a match between two simple expresssions (di function call and an AbstractExpression is given below : INSERT_THE_FACT(o->ATTLIST[num]. taken from the domain of ACL's AST nodes S. APM consists of states and transitions. is the input alphabet which consists of nodes of the C language AST <5. States represent nodes of the ACL's AST. Transitions have initially attached probability values which follow a uniform distribution. Transitions model the structure of the pattern given. The set of outgoing transitions must match the first statement in the code segment considered. This model directly reflects the structure of the pattern provided by the user. 82'.-Sk. "Avalue"] Keywords : ["INSERT". e q u a l i t y . The third module builds the Abstract Pattern Model at run time for every pattern provided by the user. F. A subpart of the SCM is illustrated in Fig. The algorithm starts by selecting candidate code fragments V = Si. E. "Aname". ..Aname. qo.An. Ariadne's second module. the fourth module is the matching engine. "FACT"] ) .

The user may provide such a value if a plan favours a particular type instead of another. statement. These legal values are provided by the binding table and are initialized every time a new pattern is tried and a new APM is created.25. Ariadne maintains a global binding table and it checks if the given pattern variable is bound to one of the legal values from previous instantiations. For example in the T r a v e r s a l of a 1 inked l i s t plan the loop statement is most likely to be a whi 1 e loop. To reduce complexity when variables in the pattern statement occur. partial and inexact matching can be computed. As the pattern statement does not specify what type of expression is to be matched the static model (SCM) provides an estimate. 5) so the matching can proceed. Code-to-code matching is used for clone detection and for computing similarity distances between two code fragments. Once a final value is set then a record < abstract jpattern. This is very important as the programmer may not know how to specify in detail the code fragment that is sought. It is based on a) a dynamic programming pattern matcher that computes the best alignment between two code fragments and b) metric values obtained for every expression. and lexicographical distances between variable names in the abstract and source statement. The dynamic programming pattern matcher produces more accurate results but the metrics approach is cheaper and can be used to limit the search space when code fragments are selected for comparison using the dynamic progranuning approach. allow for partial and inexact matching. The next step is to compare features.PATTERN MATCHING FOR CLONE AND CONCEPT DETECTION 105 In this scenario both abstract and code statements are simple and do not need expansion. 5 the likelihood that the Expression generates a function call is 0. In this paper we have presented a number of pattern matching techniques that are used for codeto-code and concept-to-code matching. Instead.. Expression and INSERT_THE_FACT(. The process ends when a final state of the APM has been reached and no more statements match the pattern. matched-Code. The main objective of this research was to devise methods and algorithms that are time efficient. Conclusion Pattern matching plays an important role for plan recognition and design recovery. distance-value > is created and is associated with the relevant transition of the APM. We have experimented with different code features for comparing code statements and are able to detect clones in large software systems > 300 KLOC. Moreover. and tolerate a measure of dissimilarity between two code fragments. clone detection is used to identify "conceptually" related operations in the source code.. and block of the AST. For code representation schemes the program's Abstract Syntax Tree was used because it maintains all necessary information without creating subjective views of the source code (control or dataflowbiased views).) are type compatible statements because an expression can generate a function call (Fig. Metrics are calculated by taking into account a number of control and data program properties. The finalvalue is obtained by multiplying the value obtained from the feature vectors comparison and the probability that Expression generates a Function Call. With this approach the matching process does not fail when imperfect matching between the pattern and the code occurs. 5. The performance . In the SCM given in Fig.

can be added to the features of the language as requirements for the matching process. When the DP approach was used. b) querying digital databases that may contain partial descriptions of data and c) recognizing concepts and other formalisms in plain or structured text (e.e. Clone detection analysis reveals clusters of functions with similar behaviour suggesting thus a possible system decomposition.106 KONTOGIANNIS ET AL. For 30KLOCS of the CLIPS system and for selecting candidate clones from approximately 500. Integration between the Ariadne tool and the Rigi tool is achieved via the global software repository developed at the University of Toronto. line numbers. or invocations and results from other pattern matching tools. Currently the system is used for system clustering. The corresponding DP-based algorithm implemented in Lisp took 3.HTML) . when DP is considered) of code fragment pairs from a pool of half a million possible pairs that could have been considered in total.g. Our current research efforts are focusing on the development of a generic pattern matcher which given a set of features. where m is the size of the model and n the size of the input. is limited by the fact we are using a LISP environment (frequent garbage collection calls) and the fact that metrics have to be calculated first. does not require any knowledge of the system and is computationally acceptable 0{n * m) for DP. The significant gain though in this approach is that we can limit the search space to a few hundreds (or less than a hundred. Even if the noise presents a significant percentage of the result. A problem we foresee arises when binding variables exist in the pattern. Moreover.this ratio dropped to approximately 10% in average (when zero distance is reported). Markov models and the Viterbi algorithm are used to compute similarity measures between an abstract statement and a code statement in terms of the probability that an abstract statement generates the particular code statement. This analysis is combined with other data flow analysis tools (Konto.5 minutes to complete. the method is fully automatic. statement count). New features. The false alarms using only the metric comparison was on average for the three systems 39% of the total matches reported. redocumentation and program understanding. an abstract pattern language. Concept-to-code matching uses an abstract language (ACL) to represent code operations at an abstract level. it can be filtered in almost all cases by adding new metrics (i. 1994) to obtain a multiple system decomposition view. The ACL can be viewed not only as a regular expression-like language but also as a vehicle to gather query features and an engine to perform matching between two artifacts. For the visualization and clustering aspect the Rigi tool developed at the University of Victoria is used. as opposed to a Lisp implementation that took 1.000 pairs of functions the C version of the clone detection system run in less than 10 seconds on a Sparc 10. When the algorithm using metric values for comparing program code fragments was rewritten in C it performed very well. Halstead's metric. and an input code fragment can provide a similarity measure between an abstract pattern and the input stream..9 minutes to complete. If the pattern is vague then complexity issues slow down the matching process. The way we have currently overcome this problem is for every new binding to check only if it is a legal one in a set of possible ones instead of forcing different alternatives when the matching occurs. Such a pattern matcher can be used a) for retrieving plans and other algorithmic structures from a variety of large software systems ( aiding software maintenance and program understanding).

. 3. M..). "Program Understanding and the Concept Assignment Problem". IBM Centre for Advanced Studies. al. ACM SIGPLAN Conference on Programming Language Design and Implementation.. et. M. The Spearman-Pearson rank correlation test was used. pp. "A Logic-Based Approach to Reverse Engineering Tools Production" Transactions of Software Engineering. Vol. Journal of Computational Linguistics. D. "Localization of Design Concepts in Legacy Systems". A. pp. 37. 153-174. The terms do not refer to illegal or unethical activities such as the reverse compilation of object code to produce a competing product. References Adamov.. Church. No. U. Baker S. 2. B. In this paper. R. "Identifying the semantic and textual differences between two versions of a program. 3. In Proceedings of International Conference on Software Maintenance 1994.. Cimitile.. Canfora.. Johnson.28. Carlini. "The Software Refinery" and REFINE are trademarks of Reasoning Systems. Chapman and Hall.. Kuhn." IEEE Software. 234-245. 1-8. Vol. AAAr92.5. Halstead. San-Jose. C-Language Integrated Production System User's Manual NASA Software Technology Division. Communications of the ACM. . pp. 1977. G. Hartman. H. Webster. Houston. No. Johnson Space Center.. 1988. E. H. TX. and Cross. "Identifying Redundancy in Source Code Using Fingerprints" In Proceedings of GASCON '93. DeMori. July 1995 Biggerstaff. Chikofsky. we investigate the use of the cloning detection technique to identify similar operations on specific data types so that generic classes and corresponding member functions can be created when migrating a procedural system to an object oriented system.J.. 1991. H.18. IBM Systems Journal. pp.. B.. No. IEEE Transactions on Pattern Analysis and Machine Intelligence. Mitbander. R.. pp. Zurich: Institutfur Informatik der Universitat Zurich. E. pp. Toronto. October 24 . /.. al. "Elements of Software Science". Victoria.H. I. 33. 414-423. Computer Journal. "A Cache-Based Natural Language Model for Speech Recognition". CA. June 1990. 12. Jankowitz.4. May 1994.6. 18. Vol.1994.. June 1993. 13 -17. 171-183. T. Buss. 31. 4. Computational and Graphical Statistics 2.. Vol.. K. In Proc. "Class-Based n-gram Models of natural Language". June 1990. 73-83. September 1994. DeMori. P. R. "Investigating Reverse Engineering Technologies for the CAS Program Understanding Project". Jan. Horwitz S. E. 12. II. "Literature review on software metrics". pp. BC. Vol. T. We are using a commercial tool called REFINE (a trademark of Reasoning Systems Corp. New York: Elsevier North-Holland.. Canada. K. Helfman.1. Inc. J. Merlo. No. "Detecting plagiarism in student PASCAL programs". Brown et. 1987.2.1. pp. 477-500. R. 1053-1063. December 1992.. Kontogiannis. J. December 1992. Bernstein. No.467-479. Moreover. "Dotplot: a program for exploring self-similarity in millions of lines of text and code". "Technical Introduction to the First Workshop on Artificial Intelligence and Automated Program Understanding" First Workshop on Al and Automated Program Understanding. pp. 1990.PATTERN MATCHING FOR CLONE AND CONCEPT DETECTION 107 Another area of research is the use of metrics for finding a measure of the changes introduced from one to another version in an evolving software system. 570-583. pp. "Software metrics: a rigorous approach". E. Fenton. pp. "reverse engineering*' and related terms refer to legitimate maintenance activities based on sourcelanguage programs.. Toronto ON. Vol. "Reverse Engineering and Design Recovery: A Taxonomy. "On Finding Duplication and Near-Duplication in Large Software Systems" In Proceedings of the Working Conference on Reverse Engineering 1995. Notes 1.

"Recognizing a Program's Design: A Graph-Parsing Approach. pp. 96-103. April 1992. R. No. AI Lab No. Toronto. S." IEEE Software. S.342. Prakash. Information Theory. October 1990.J.. No.89. Whitney... "Pattern matching for Design Concept Localization". K. A. TR-74. Kontogiannis..5. In Proceedings of the Second Working Conference on Reverse Engineering. M. "Reverse Engineering. Jan 1990. August 1990.50-57. "A Framework for Source Code Search Using Program Patterns".. pp. Toronto. Kozaczynski. Spatial and Visual Representations of Software Structures. S. Corrie. Wills. pp. 13(2) 1967. In CSM'94 : Proceedings of the 1994 Conference on Software Maintenance. 463-475.. E.M. IBM Canada Ltd. 10... "Telos : A Language for Representing Knowledge About Information Systems.. May 1994. reusability.. Engberts. July 1995. redundancy : the connection". C. Bernstein.. Vol.. 1358. Vol. J. of Computer Science Technical Report KRR-TR-89-1. ON. September 1994. Tilley...M. L. "Error Bounds for Convolutional Codes and an Asymptotic Optimum Decoding Algorithm". Communications of the ACM. June 1994. 82 . Rich... American Programmer 3."Automated Program Recognition by Graph Parsing". A.. 20. H.. M. Wong. Canada. W.108 KONTOGIANNIS ET AL. H. pp. Mylopoulos.1992 . IEEE Trans. Rep. Dept. pp. M. A. "Domain-retargetable Reverse Engineeringll: Personalized User Interfaces". MIT Technical Report. K. 8-13.6. 086. pp. DeMori.. B. 336 . MoUer. Tilley. Merlo. J. "McCabe T. IEEE Transactions on Software Engineering. J.37. NIng. K. Paul. Viterbi. L.. and Wills." University of Toronto.. Galler. Software metrics: a practitioner's guide to improved product development" Muller.. Tech. "Automated Support for Legacy Code Understanding". Muller.

interprocess communication. The framework provides for the recognition of architectural features in program source code by use of a library of recognizers.. interfaces. Manufactured in The Netheriands. "Reverse Engineering to the Architectural Level" by Harris. layering. (Shaw. which appeared in the Proceedings of the 17th International Conference on Software Engineering. Extracting Architectural Features from Source Code* DAVID R. This paper was written while H. © 1995 ACM. Reubenstein and Yeh. Recognizers (individual source code query modules used to analyze the target program) are used to locate architectural features in the source code. "Recognizers for Extracting Architectural Features from Source Code" by Harris. Bedford. REUBENSTEIN * Mitretek Systems. USA HOWARD B. YEH The MITRE Corporation. which appeared in the Proceedings of the 2nd Working Conference on Reverse Engineering. 109-138 (1996) © 1996 Kluwer Academic Publishers. © 1995 IEEE.g. layers. ALEXANDER S. 2. Architectural features are the constituent parts of architectural styles (Perry and Wolf.Automated Software Engineering. Introduction We have implemented an architecture recovery framework on top of a source code examination mechanism. Examples of architectural styles include pipe and filter data processing. Keywords: Reverse engineering. MA 01730. USA drh@mitre. . software documentation 1. 25 Burlington Mall Road. Recovery of higher level design information and the ability to create dynamic software documentation is crucial to supporting a number of program understanding activities. Reubenstein was at GTE Laboratories. H. Software maintainers look for standard software architectural structures (e. HARRIS. MA 01803. objects) that the code developers had employed.org hbr@mitretek. Reubenstein and Yeh. July 1995. 1991) which in turn define organizational principles that guide a programmer in developing source code. Reubenstein's current address is listed above. 202 Burlington Road. April 1995. Moreover. 1992). Boston. Burlington. Recognizers are queries that analysts or applications can run against source code to identify portions of the code with certain static properties. abstract data type. Our goals center on supporting software maintenance/evolution activities through architectural recovery tools that are based on reverse engineering technology. 3.org Abstract. We also report on representation and organization issues for the set of recognizers that are central to our approach. recognizer authors and software analysts can associate recognition results with architectural features so that the code identified by a recognizer corresponds to an instance of the associated architectural This is a revised and extended version based on two previous papers: 1. and blackboard control processing. software architecture. The work reported in this paper was sponsored by the MITRE Corporation's internal research program and was performed while all the authors were at the MITRE Corp. Our tools start with existing source code and extract architecture-level descriptions linked to the source code firagments that implement architectural features.

As a starting point. The implementation provides for analyst control over parameterization and retrieval of recognizers from a library. they still only present static abstractions that focus on code level constructs rather than architectural features.. Recovery of higher level design information and the ability to create as-built software documentation is crucial to supporting a number of program understanding activities. While it is clear that every piece of software conforms to some design. feature addition. Reubenstein. Within our implementation. In Section 4. operating system port. REUBENSTEIN. Using the framework.S. HARRIS. We argue that it is practical and effective to automatically (sometimes semi-automatically) recognize architectural features embedded in legacy systems. Concretely. YEH feature. e. we have recovered constituent features of architectural styles in our laboratory experiments (Harris. Our motivation for building our recovery framework stems from our efforts to understand legacy software systems. In addition. Using these recognizers. For example. By stressing as-built. the representation of architectural styles provides knowledge of software design beyond that defined by the syntax of a particular language and enables us to respond to questions such as the following: • • • When are specific architectural features actually present? What percent of the code is used to achieve an architectural feature? Where does any particular code fragment fall in an overall architecture? The paper describes our overall architecture recovery framework including a description of our recognition library. We begin in Section 2 by describing the overall framework. in Section 3. it is often the case that existing documentation provides little clue to that design.110 D. program upgrade. language port. we emphasize how a program is actually structured versus the structure that designers sketch out in idealized documentation. while a system block diagram portrays an idealized software architecture description. analysts can recover multiple as-built views . The problem with conventional paper documentation is that it quickly becomes out of date and it often is not adequate for supporting the wide range of tasks that a software maintainer or developer might wish to perform.descriptions of the architectural structures that actually exist in the code. we have developed an extensive set of recognizers targeted for architecture recovery applications. we have used the recognizers in a stand-alone mode as part of a number of source code quality assessment exercises.g. we describe the underlying analysis tools of the framework. Yeh: ICSE. Our framework goes beyond basic tools by integrating reverse engineering technology and architectural style representations. Next. While these views are an improvement over detailed paper designs in that they provide accurate information derived directly from the source code. or program consolidation. it typically does not even hint at the source level building blocks required to construct the system.R. we describe . general maintenance.B. 1995). we address the gap between idealized architectural descriptions and source code and how we bridge this gap with architectural feature recognizers. These technology transfer exercises have been extremely useful for identifying meaningful architectural features. A. In Section 5. H. conmiercially available reverse engineering tools (Olsem and Sittenauer. 1993) provide a set of limited views of the source under analysis.

the components/connectors associated with architectural styles (selecting a particular style selects a set of constituent features to search for). Since file structure is a very weak form of architectural organization. The set of architectural features discovered in a program form its as-built architecture containing views with respect to many architectural styles. In addition.e. In section 6. the legacy source code is parsed into an internal abstract syntax tree representation. pathname of directory. In bottom-up recovery. note that the as-built architecture we have recovered is both less than and more than the original idealized architecture. number of top level forms. In top-down recovery. i. we describe our experience in using our recovery techniques on a moderately sized (30. The as-built is less than the idealized because it may miss some of the designer's original intentions and because it may not be complete. We run recognizers over this representation to discover architectural features . The framework supports architectural recovery in both a bottom-up and top-down fashion. rectangles for other source files). The features we display (see Figure 2) include file type (diamond shapes for source files with entry point functions. From our point of view.EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE 111 the aspects of the recognition library that support analyst access and recognizer authoring. the bird's eye view is a place where our implementation can register results of progress toward recognition of various styles.Framework and Process Our recovery framework (see Figure 1) spans three levels of software representation: • a program parsing capability (implemented using Software Refinery (Reasoning Systems. The idealized architecture contains the initial intentions of the system designers.000 lines of code) system. however. only shallow analysis is possible. 1990)) with accompanying code level organization views. Related work and conclusions appear in Sections 7 and 8 respectively.. an architectural style places an expectation on what . and file size (indicated by the size of the diamond or rectangle). abstract syntax trees and a "bird's eye" file overview an architectural representation that supports both idealized and as-built architectural representations with a supporting library of architectural styles and constituent architectural features a source code recognition engine and a supporting library of recognizers • • Figure 1 shows how these three levels interact. Architecture Recovery . analysts use architectural styles to guide a mixed-initiative recovery process. The notions of code coverage described later in the paper provides a simple metric to use in determining when a full understanding of the system has been obtained. name. we provide a very preliminary notion of code coverage metrics that researchers can used for quantifying recovery results. Developers encode these intentions in the source code. Finally. analysts use the bird's eye view to display the overall file structure and file components of the system. The as-built is also more than the idealized because it is up-to-date and because we now have on-line linkage between architecture features and their implementation in the code. Within our framework. 2. We do not have a definition of a complete architecture for a system.

a layer may be implemented as a set of procedures).B. Hofmeister. Relations such as contains. it may be found in a shared library or it may be part of the implementation language itself. In a task spawning architectural style. the set of mappings from feature types to their realization in the source code forms the as-built architecture of the system. Tracz. H.112 D. processing elements. Soni. Component participation in a relation follows from the existence of a connector . 1992. 1989. Architectural recovery framework recovery tools will find in the softw^are system. Some recognizers discover source code instances of entities where developers have implemented major components .g. Nord.S. Recognizers are used tofindthe component/connector features. For example. That is. repositories. 1995) of architectural styles. Shaw. Architectural Styles The research community has provided detailed examples (Garlan and Shaw. spawns. This infrastructure may or may not be part of the body of software under analysis. Perry and Wolf. Figure 3 details the task entity and the spawns relation associated with a task spawning style.g."large" segments of source code (e.. 2. A.. the style establishes a set of architectural feature types which define components/connectors types to be found in the software. HARRIS. YEH Idealized Architecture Views of the As-Built Architecture combine using architectural styles to form implemented by I Architectural Features provides clues for recognizing 1 Program parses into 1 Abstract Syntax Tree Figure 1. 1991. layers. 1994). and tasks.R. executable processing . Shaw. and is-connected-to each describe how entities are linked. 1993. 1993.a specific code fragment (e. Our architecture modeling language uses entity/relation taxonomies to capture the component/connector style aspects that are prevalent in the literature (Abowd. tasks (i.. initiates. Perry and Wolf. REUBENSTEIN. and we have codified many of these in an architecture modeling language. objects. special operating system invocation) or the infrastructure that processes these fragments. As an illustration. /. Allen.e. Once the features are discovered. Garlan. Entities include clusters. 1992.

EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE 113 Figure 2. Bird's Eye Overview .

to retrieve instances of task spawning. of the program's procedures and data structures. and conducts (relating tasks to functional descriptions of the work performed). 1993). defentity TASK :specialization-of processing-element :possible-implementation file :recognized-by executables defrel SPAWNS :specialization-of initiates :possible-implementation system-call :recognized-by find-executable-links :domain task :range task Figure 3. REUBENSTEIN. and task-functions. parent and child tasks respectively). Its connectors are spawns (invocations from tasks to tasks).. we offer two more examples to help the reader understand the scope of our activities. find-executable-links.114 D. and a service invocation style. layered.R.. execl. repository). Spawns relates tasks to tasks (i. repositories. As mentioned in (Garlan and Shaw.. possibly the entire system. abstract data type. However. layering is a hierarchical style: the connectors are the specific references that occur in components in an upper layer and reference components that are defined in a lower layer. execlp. programmers can use a system. These include application programming interface (API) use. or transparent: components in one layer can reference components more than one layer away. the task spawning associated with real time systems. execv. Task spawning is a style that is recognized by the presence of its connectors (i. Spawns might be implemented by objects of type system-call (e.e. One way to think of a layering is that each layer provides a service to the layer(s) above it. HARRIS. in Unix/C. pipe and filter.S. H. the task invocations).. or execvp call to start a new process via a shell command). A default recognizer named executables will extract a collection of tasks. spawned-by (the inverse of spawns). Analysts can use the default recognizer. Space limitations do not permit a full description of all styles here. Its components are tasks.e. implicit invocation.B. YEH elements) are linked when one task initiates a second task. A. by call trees).g.g. Elements in an architecture modeling language Many of the styles we work with have been elaborated by others (e. uses (relating tasks to any tasks with direct interprocess communications and to any repositories used for interprocess communications). Layered: In a layered architecture the components (layers) form a partitioning of a subset. In addition we have worked with a few styles that have special descriptive power for the type of programs we have studied. . A layering can either be opaque: components in one layer cannot reference components more than one layer away. object-oriented. Tasks are a kind of processing element that programmers might implement by files (more generally.

Recognizers Recognizers map parts of a program to features found in architectural styles. The recognizer calls the function i n v o c a t i o n s .code-fragment.1. such as REFINE/C (Reasoning Systems. For each such call. or AST) to extract code fragments (pieces of concrete syntax) that implement some architectural feature.o f . Table 1 shows the results computed by a task spawning recognizer (named Find-ExecutableLinks) applied to a network management program. or service.. In each triple. the ordered triple contains the special function call that is the connector. Often. we examine parts of one of these in detail. The fragments found by recognizers are components and connectors that implement architectural style features.EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE 115 Data Abstractions and Objects: Two related ways to partially organize a system are to identify its abstract data types and its groups of interacting objects (Abelson and Sussman. Examples of these code fragments include a string that names a data file or a call to a function with special effects.invoked . For each task to task connector. Garlan and Shaw. Figure 4 shows the action part of the previously mentioned task spawning recognizer. and the task that is spawned (invoked) by the call (the other end). i.t y p e . the task which makes the spawn (one end of the connector). A Sample Recognizer The appendix contains a partial listing of the recognizers we use. the instances of an abstract data type are objects. A data abstraction is one or more related data representations whose internal structure is hidden to all but a small group of procedures. The RRL code itself may call functions written either in RRL or REFINE. The action part of a recognizer is written in our RRL (REFINE-based recognition language). These two organization methods are often used together. objects are instances of classes that are described as types of abstract data. 1984. the procedures that implement that data abstraction. which finds and returns a set of all the calls in the program to functions that may spawn a task. Here. A connector recognizer returns a set of ordered triples . 1992). This recognizer has a static view of a task: a task is the call tree subset of the source code that might be run when the program enters the task. This recognizer examines an AST that analysts generate using a REFINE language workbench. and some meaningful influence such as a referenced file. The main difference between RRL and REFINE is the presence in RRL of iteration operators that make it easy for RRL authors to express iterations over pieces of a code fragment. An object is an entity which has some persistent state (only directly accessible to that entity) and a behavior that is governed by that state and by the inputs the object receives. executable object. The recognizers traverse some or all of a parsed program representation (abstract syntax tree. enclosing structure. 3. 3. the code-fragment is a connector. or conversely. and the other two elements are the two components being connected by that connector. the recognizer calls p r o c e s s . 1993). A component recognizer returns a set of code-fragments in which each code-fragment is a component.e.

Figure 5 shows what this task spawning recognizer examines when it encounters the special function call systeiti(cmd).2.S. and if so. H..prepend(results. system(. which finds the root of the task which made the call and then returns the entire call tree (the task) starting from that root. The target task is also in the form of the entire call tree starting from the target task's root function..a description of the intended structure for a program. get the target task being spawned. system(." is a connector. root. These triples of function calls. The structural information at the code level differs from idealized descriptions in two important ways... pipes or application programming interfaces). we would like to recover the idealized architecture .g. First.f r o m .R. If p r o c e s s . system(.. system{..r o o t .. execlp(. The command "system(cind) . 3. spawning tasks and target tasks are saved in r e s u l t s and then returned by the recognizer. The results of task spawning recognition Function Call system(.116 D.. where we highlight our underlying analysis capabilities. system(..t o . Unfortunately. HARRIS. Spawning Task RUN_SNOOPY SNOOPY SNOOPY SNOOPY SNOOPY SNOOPY MAIN Spawned Task SNOOPY EXNFS EX69 EX25 EX21 SCANP RUN_SNOOPY let (results = {}) (for-every call in invocations-of-type('system-calls) do let (target = process-invoked(call)) if ~(target = undefined) then let (root = go-to-top-from-root(call)) results <.. target])). REUBENSTEIN. Rationale for Level of Recovery In addition to architectural features actually found in the source code. while a program's design may commit to certain architectural features (e.i n v o k e d finds a target task. the recognizer then calls g o . A. which is embedded in the task Run_snoopy and is used by Run_snoopy to spawn the task Snoopy..B. The action part of the task spawning recognizer Find-Executable-Links to determine if a task is indeed spawned.. actual programs implement these fea- . results Figure 4. the recognizer finds and connects Run_snoopy's call tree to Snoopy's call tree. Starting from that connector.. these idealized descriptions cannot be directly recognized from source code. The figure also shows processing details that are described in Section 4. [call... YEH Table 1.t o p .

export lists) . As described at the start of Section 3. we risk missing important structures because the ideal does not exist in the code. If we just target idealized architectures directly and do not search for architectural features as they are actually built in the source code. Note that this restriction relaxes expectations that we will find fully formed instantiations of architectural styles in existing programs. .g.. use of Unix pipes. language. Task spawning recognizer examines task Run_snoopy spawning task snoopy via the connector "system (cmd) . Second.c Call tree of task spawned (Snoopy) Call tree of spawning task (RunjBnoopy) Figure 5. or commercial enabling software) to adequately support the idealized view and may occur with the earliest engineering decisions.a one-to-many conceptual shift from the idealized to the concrete. This erosion from the ideal usually increases over the life cycle of the program due to the expanding number of developers and maintainers who touch the code. Some are due to a developer's failure to either honor or understand the entailments of one or more architectural features.EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE 117 main(argc. there are differences due to architectural mixing/matching and architectural violations. argv) i map from find task root snoopy-' maitefile special pattern "^ file snoopy. a collection of these code fragments forms a view on the program's as-built architecture. host platform. Reasons for such violations are varied. Rather our recognizers will find partial instantiations of style concepts and are tolerant of architectural pathologies such as missing components and missing connectors.. Together. To overcome this difficulty we use a partial recognition approach that does not require finding full compliance to an idealized architecture nor does it bog down in a detailed analysis of all of the source code.g. development tools. we aim our recognizers at extracting code fragments that implement specific architectural features.*' lures with source code constructs (e. Other violations are due to the inability of an existing or required environment (e. procedure parameter passing. but generating such an aggregation is not the responsibility of the individual recognizers.

and encode language-specific ways of accomplishing abstract tasks. analyze special patterns. H. Figure 6 shows two code fragments that illustrate the requirements for the slice evaluator. 1984) that handles parameter passing and local variables. but there are several important analysis capabilities that we have added. 1994) describes our related work on CMS2 code. A programmer can always find an obscure way to implement an architectural feature which the recognizers will not detect and a programmer may write code that accidently aligns with an architectural feature. From the slice. REUBENSTEIN. 4. 4. Our "slice evaluator" algorithm makes several assumptions to avoid intractable computation.118 D.1. The more advanced recognizers from the set of recognizers (listing in the appendix) capture task spawnings and service invocations via slice evaluation and searching for special progranuning patterns. YEH The recognizers are designed to recognize typical and possible patterns of architectural feature implementations.e. manage clusters (i.B. Among other things. Section 4 highlights this analysis. We implement this analysis by first computing a program slice (Gallagher and Lyle. . we will modify and expand the recognizers as needed. HARRIS. The capabilities themselves are special functions that recognizer authors can include in a recognizer's definition. (Holtzblatt.S. This approach is used for finding users of communication channels. Roberts. but does not evaluate the conditional part of the " i f statement. In most of the other cases. the features are not difficult to recognize. 1991). data files that a procedure accesses or modifies. the recognizers cover a wide spectrum of components and connectors that C/Unix programmers typically use for implementing architectural features. The string contains a pathname to an executable image. the slice evaluator identifies 3 and 5 as possible values. the recognizers written so far capture the more common patterns and have worked well on the examples we have seen. As we encounter more examples. The most prominent capabilities find potential values of variables at a given line of source code. Reubenstein. the slice evaluator finds the use of C's sprintf to assign the cmd variable with a command line string. if variable x is bound to 3 and 5 respectively in the "then" and "else" parts of an "if" statement. we compute a slice evaluation to retrieve the potential variable values at given points in the source code. A. The recognizers are not fool-proof. Piazza. collections of code ft'agments). Starting with the first argument to the system call or the fourth argument to the execlp call. Values of Variables Several recognizers use inter-procedural dataflow.. In addition. Analysis Tools for Supporting Recognition The recognizers make use of conunercially available reverse engineering technology.R. Most notably it ignores control flow and finds the potential values of argument assignments but not the conditions under which different choices will be made. (Weiser. and references to executable programs. For example. However.

or other informal structures of a program. " s h " . sprintf(cmd.} Figure 6."cd %s/bin.of the key command string that contains the name of the executable. In the first example. The second pattern describes potential ways programmers can encode pathnames in the command strings. bin_dir). Two approaches for invoking an executable image Other examples of patterns for C/Unix systems include the use of socket calls with connect or bind calls for creating client-server architectures. " . They create clusters (or match new . Clustering facilities follow some algorithm for gathering elements from the abstract syntax tree. While our approach has been somewhat catch-as-catch-can. The first pattern identifies the position . we need to exploit knowledge of two patterns.c " . top_dir)./snoopy". In the second. a set of procedures. if ( debug == 0) status = system (cmd). and the declaration of read/write modes in fopen calls. Programmers use stereotypical code patterns to implement frequently occurring computations. For example.EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE 119 4. the code in Figure 6 shows two standard ways of invoking an executable (and potentially invoking a task).2. Clustering Clusters are groupings of features of the program . 43. To uncover this architectural feature. Some recognizers need to bundle up collections of objects that may be de-localized in the code. 1. "%s/snoopy".a set of files. " ) is separated from the actual spawning of "snoopy". last but for the null string for execlp . sprintf(cmd.first argument for system calls. cmd. ( c h a r *)0). Some of these patterns can be easily recognized in abstract syntax trees. 2. We designed our approach to catch such dominate patterns and to ferret out the names of files and executable images (possibly tasks) within string arguments. Special Patterns Slicing provides only part of the story for the examples in Figure 6. the movement to the appropriate directory ("cd %s / b i n . if (forkO == 0) { e x e c l p ( " / b i n / s h " . the function sprintf binds the variable cmd to the string "%s/snoopy" where the %s is replaced by the name of the directory stored in the variable b i n _ d i r . . we have found that identifying only a few of these patterns goes a long way toward recovering architectural features across many architectural styles.

120

D.R. HARRIS, H.B. REUBENSTEIN, A.S. YEH

collections to an old cluster), and, in some cases, conduct an analysis that assigns properties to pairs of clusters based on relationships among constituent parts of the clusters. For example, our OBject and Abstract Data type (OBAD) recovery sub-tool (Harris, Reubenstein, Yeh: Recovery, 1995) builds clusters whose constituents are collections of procedures, data structures, or global variables. OBAD is an interactive approach to the recovery of implicit abstract data types (ADTs) and object instances from C source code. This approach includes automatic recognition and semi-automatic techniques that handle potential recognition pitfalls. OBAD assumes that an ADT is implemented as one or a few data structure types whose internal fields are only referenced by the procedures that are part of the ADT. The basic version of OBAD finds candidate ADTs by examining a graph where the procedures and structure types are the nodes of the graph, and the references by the procedures to the internal fields of the structures are the edges. The set of connected components in this graph form the set of candidate ADTs. OBAD has automatic and semi-automatic enhancements to handle pitfalls by modifying what is put into the above graphs. Currently, OBAD constructs the graph from the abstract syntax tree. In the future, OBAD will use graphs made from the results returned by more primitive recognizers. Also, recognizers can use clusters as input and proceed to detect relationships among clusters. For example, a computation of pairwise cluster level dominance looks at the procedures within two clusters. If cluster A contains a reference to an entry point defined in cluster B, while cluster B does not reference cluster A, we say that A is dominant over B. This notion of generalizing properties held by individual elements of groups occurs in several of our recognizers.

4,4,

Language/Operating-System Models

A design goal has been to write recognizers that are LOL-independent - independent of specific patterns due to the source code Language, the Operating system, and any Legacy system features. Our hope is that we will be able to reuse most recognizers across ASTs associated with different LOL combinations. While we have not explored this goal extensively, we have had some success with recognizers that work for both FORTRAN (under the MPX operating system) and C (under Unix). Our approach to this is two-fold. First, we write recognizers using special accessors and analysis functions that have distinct implementations for each LOL. That is, the special access functions need to be re-written for each LOL, but the recognizer's logic is reusable across languages. Second, we isolate LOL-specific function (e.g., operating system calls) names in separately loadable libraries of call specifications. Each call specification describes the language, operating system, and sometimes even target system approach for coding a LOL-neutral behaviors such as system calls, time and date calls, communication channel creators, data accessing, data transmission, input/output calls, API's for commercial products, and network calls. For examples. Figure 7 is the C/Unix model for system-calls (i.e., calls that run operating system line commands or spawn a task) while Figure 8 shows an analogous FORTRAN/MPX model.

EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE

121

These specifications are also a convenient place for describing attributes of special patterns. In these examples, the key-positions field indicates the argument position of the variable that holds the name of the executable invoked.
defcalls SYSTEM-CALLS :call-desc "System Calls" :call-type system-call :call-ref-names "system", "execve", "exec1", "execV", "execlp", "execvp", "execle" :key-positions first, next-last, next-last, next-last, next-last, next-last, next-last

Figure 7. A C/Unix Call Specification

defcalls SYSTEM-CALLS :call-desc "System Calls" :call-type system-call :call-ref-names "m::rsum", "m::sspnd" :key-positions first, first

Figure 8. A FORTRAN/MPX Call Specification

4,5. An Example - Putting it all together We return to the find-executable-links recognizer described in Section 3.1. When faced with either code fragment of Figure 6, this recognizer will collect the appropriate triple. We explain this activity in terms of the above analysis capabilities. The functions g o - t o - t o p from-root and i n v o c a t i o n s - o f - t y p e perform their job by traversing the program AST. i n v o c a t i o n s - o f - t y p e accesses the call-specification to tell it which functions in the examined program can implement some architectural style feature. For example, in the Unix operating system, the system-call specification names the functions that can spawn a task (i.e., system or members of the execlp family of functions). The function p r o c e s s invoked uses slice evaluation to find the value(s) of the arguments to the fiinction calls returned by i n v o c a t i o n s - o f - t y p e . process-invoked then uses special patterns to determine the name of the executable image within the command string. In addition, p r o c e s s - i n v o k e d consults a map to tell it which source code file has the root for which task. The map is currently hand generated from examining system makefvX^^. In the file with the root, process-invoked finds the task's root function (in the C language, this is

122

D.R. HARRIS, H.B. REUBENSTEIN, A.S. YEH

the function named main) and then traverses the program AST to collect the call tree into a cluster starting at that root function. Figure 5 shows how these various actions are put together for the sample recognition described in Section 3.1. The database of language and operating system specific functions, the program slicing (and slice evaluation), and the special patterns described in this section are all areas where our architecture recovery tool adds value beyond that of commercially available software reverse engineering tools. 5. Recognizers in Practice

As we developed a set of recognizers, it quickly became clear to us that we needed to pay attention to organization and indexing issues. Even at a preliminary stage addressing only a few architecture styles and a single implementation language, we found we could not easily manage appropriate recognizers without some form of indexing. Since we intend that software maintenance and analysis organizations treat recognizers as software assets that can be used interactively or as as part of analysis applications, we have augmented the recognizer representations with retrieval and parameterization features. These features provide support so that families of recognizers can be managed as a software library. As part of this effort, we identified reusable building blocks that enable us to quickly construct new recognizers and manage the size of the library itself. This led us to codify software knowledge in canonical forms that can be uniformly accessed by the recognizers. In addition, we discovered that architectural commitments map to actual programs at multiple granularity levels and this imposed some interesting requirements on the types of recognizers we created. In this section, we describe several of the features of our framework that facilitate recognition authoring and recognizer use. In particular, we describe a retrieval by effect mechanism and several recognizer composition issues. 5. i. Recognizer Authoring

Recognizer authors (indeed all plan/recognition library designers) face standard software development trade-off issues that impact the size of the library, the understandability of the individual library members, and the difficulty of composing new library members from old. While our REFINE-based recognition language (RRL) does not support input variables, it does have a mechanism for parameterization. These parameters have helped us keep the recognition library size small. The parameters we currently use are focus, program, reference, and functions-of-interest. The parameters provide quite a bit of flexibility for the recognizer author who can populate the library with the most appropriate member of a family of related recognizers. As an illustration, when f unctions-of-interest is bound to the set of names "system", "execve", "execl", "execv", "execlp", "execvp", and "execle" and reference is bound to "system-calls", the three fragments in Figure 9 yield an equivalent enumeration (over the same sets of objects in a legacy program). The first fragment maximizes programming flexibility, but does require analyst tailoring (i.e., building the appropriate functions-of-interest list from scratch or setting it to some

Thus. For example. Within our recovery framework. and a third looked for separate executables (identified by "main" procedures) that may not have been found by the other recognizers. Second. recognizers can be stand-alone analysis methods for answering a specific question about the source code. within our architecture recovery implementation.e. more of the processing is explicitly stated. Three recognizers were employed. (for-every item in invocations-of-type(reference) do 3. Operation and Control Analysts use recognizers in two ways. First.. let (function-names = FUNCTIONS-OF-INTEREST) (for-every item in program such-that function-call(item) and name(item) in function-names do 2. A family of recognizer fragments pre-defined list of special calls). either in stand-alone or as-built architecture recovery modes. the third special purpose recognizer does not require any external parameter settings. In contrast. perhaps making the fragment more difficult to understand (i. an analyst might ask for the locations where the source code invokes the sendmail service. In general. This view was constructed using the set of default recognizers associated with the entities and relations of the task-spawning style. For example. recognizers are semi-automatically bundled together to produce a composite view. lacking abstractions). analysts can override the defaults by making selections from the recognition library. a second recognizer found instances of file/shared-memory interprocess communication (through fopen and open calls). but would co-exist in a library with many close cousins. Section 6 below shows a system's as-built architecture with respect to the task-spawning style.2. (for-every item in invocations-of-type('system-calls) do Figure 9. . 5. In addition.EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE 123 1. our set of parameters allows recognizer authors to modulate abstraction versus understandability issues to produce a collection that best suits the needs of their specific user community. The second fragment is a compromise. The find-executable-links recognizer found instances of the spawns relation (encoded in the system or execlp calls of the program). recovery is an interactive process and we need facilities that will help analyst make informed selections from the library.

. If a recognizer requires other recognizers to have been run (i. the result of running a recognizer may be that some part of the source code is annotated with markers. the task-spawning recognizer in Figure 4 finds function calls and files (associated with tasks).. For example. When analysts select a type from this list. . Figure 11 is an example that shows the restricted menu of recognizers that achieve [know function-call]. analysts may not remember the name of a recognizer. The first scheme simply uses the text strings in a description attribute associated with each recognizer.e.g. H. Uppercase entries are top entries of taxonomies based on the language model (e. The depth of indentation indicates the depth in a subtree. Figure 10 is the type taxonomy our implementation uses. Reasons for only selecting a subset could range from abstracting away details (for understanding or analysis) to removing irrelevant details that cannot be detected syntactically (e. specializations of function call) and clustering extensions.1.g. YEH Since the library is large (60 or more entries). Such tuples indicate that the recognizer will "know" about fragments of the stated type or "check" to each if fragments are of the stated type.124 5. HARRIS. To support this retrieval. C. While. function-call. the system shows them a list of all the recognizers that find items of that type. Since. Recognizer Retrieval D. to populate some information on the AST) its representation indicates that the second recognizer is a pre-condition. they will probably know the type of information (e. A. procedure) that they are looking for. Once a recognizer is selected.B.g. we think of the "effects" of running a recognizer on the AST. we have attached effect descriptions to each recognizer.2.. The second scheme allows an analyst to see and select from all the recognizers that would return some type of information. there is an explicit backtracking scheme encoded for the recognizers. a module is only used for testing). the system helps by offering to expand the search to find recognizers that know generalizations of the current type. a request [know special-call] would be extended to the request [know function-call] to the request [know expression] climbing into the upper domain model for the legacy system's language. The analyst can review the list of returned descriptions and select the recognizer that looks most promising.. In the event that the analyst does not find a relevant recognizer in the list. In addition. The analyst enters a text string and the implementation returns a list of all recognizers whose description contains text that matches the string. the system prompts the analyst for parameters that the recognizer requires. The analyst can review the result and select some subset of the returned results for subsequent analysis.S. Analysts can set the reference parameter to the result of a previous recognition thus providing a mechanism for cascading several recognizers together to retrieve a complex pattern. FORTRAN) along with our specializations (e.g. The format for these effect descriptions is "[<category> <type>]" where <category> is either "know" or "check" and <type> is some entry in the type hierarchy.R. file. we have provided two indexing schemes that help the analyst find an appropriate recognizer. REUBENSTEIN.. For example.

EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE 125 CLUSTER Ne twork-exchange RPC-exchange Port-exchange Pipe-exchange Unix-pipe Code-fragment Connector-fragment Module Service Non-source-file Shell-script Input-file Output-file Source-file Executable-object FUNCTION-CALL Special-call Network-call System-call I/O-call Non-POSIX-compliant-call FUNCTION-DEF STRUCT-TYPE Figure 10. k Taxonomy of recognition types .

2. or a set of root nodes (i.perhaps a functionally cohesive unit . This choice has been motivated by the multiple purposes we envision for recognizer use. First. The danger is that if we have too many different output forms. Recognizer Results From among several possible representations our recognition results are either sets of objects from the AST or sets of tuples of objects from the AST. recognition results may stand by themselves in answering a question. we output results in a manner that reduces the need for repeating computationally expensive analyses in subsequent rec- .S. functional units) requires an analysis of a calling hierarchy. or they may be used as inputs to other recognizers in a more detailed analysis of the code. As we have mentioned. a distinct functional unit).2. we will drastically limit our ability to compose recognition results.B. H.. All of these are meaningful for identifying architectural components. Given a set of procedures .126 D. this is how style recognition is accomplished). REUBENSTEIN.. the entire calling hierarchy.. HARRIS. YEH FUHCTIOH-CALL-ARTIFACT NETWORK-CALL : implementations of client process NETWORK-CALL : implementations of server process SERVICE : LINKS between the program and any network services or remote procedun NETWORK-CALL : LINKS between procedures and some service NETWORK-CALL : LINKS between procedures and network services PROCESS-INVOCATION : LINKS between procedures and shell commands SPECIAL-CALL : Connection family used in a network exchange SPECIAL-CALL : Connection type used in a network exchange PROCESS-INVOCATION : Spawning LINKS between executable modules PROCESS-INVOCATION : Invocations that activate executables FUNCTION-DEF : LINKS between local and remote procedures SPECIAL-CALL : Function calls identified directly or by dereferenced function name SPECIAL-CALL : Invocations of members of a family of functions Abort Figure 11..g. We might be interested in identifying a set of common callers of these procedures.R.several aggregations are possible. many architectural features (e. Thus our library contains recognizers that return various aggregations within the calling hierarchy. Our solution deals with this problem in two ways. This notion needs to be balanced with the need to allow analyst toflexiblycompose solutions to a wide variety of questions involving multiple aggregation modes. A.e. Recognizers with effect [know function-call] 5. Standard output results are needed to support interoperability among recognizers and to provide a uniform API to applications.e. a calling hierarchy that is mutually exclusive with some other set of procedures (i.e. For example. candidates for task entry points). tasks. they may be joined with other results to form a composite picture (i.

g. procedure. the ordered triples described above) that contains contextual information. 5.EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE 127 ognizers of a cascaded chain.other recognizers that are run before this recognizer will run Environment: the set of parameters that analysts must set before invoking the recognizer Recognition method: the action part of the recognizer. a directory. a task.1 above) . triples will be of the form < object. Second. unless there is reason to report some other structures. written in RRL (as illustrated in Section 3. That is to say. Recognizer Representation We can summarize the above issues by displaying the internal representation we use for each recognizer. The critical concern is to identify some standard contexts so that other parts of an analysis process can rely on a uniform type of response. For the current framework.. directory) can be easily re-derived from the AST. it is useful to return a structure (i. The attributes are as follows: • • • • • • Name: a unique identifier Description: a textual description of what the recognizer finds (used in indexing) Effects: effects indicate the types of source code fragments that are found (also used in indexing) Pre-condition . a file. the enclosing structure part of a recognition could be a procedure. courser grained structure (e. we selected the procedure level as a standard context. or something else. In our implementation. Our justification for this is that. For example. we standardize output levels so that results can be compared and bundled together easily.e. We have found this approach to be unsatisfactory because many of the recognizers collect objects in the context of some useful larger structure. it cannot know how some other recognizer will use its results. This would require each recognizer to carry out a normalization step prior to using the results of another recognizer. the slice associated with the code in Figure 6) can be a relatively expensive computation. file. while procedures offer an architecture level result that embodies the results of expensive lower-level analyses such as slice evaluation. Avoiding Redundant Computations: One approach to recognition would be to assume that each recognizer always returns a single object and that adjoining architectural structures can be found piecemeal by following the AST (or using some of the analysis tools described above). Standard Contexts: Each recognizer has only a local view.g. if necessary.3. procedure >. each recognizer is an object with a set of attributes that the implementation uses for composition and retrieval. Rather. Once the recognizer completes this examination it caches the result as the third element of a triple (as in Table 1) to avoid re-computations. a slice evaluation coupled with the use of program patterns (e.. This format has enabled us to support extensive architecture recovery without excessive duplication of computations. If we do not have some standardization.

HARRIS. These executables are linked in an executive task that uses operating system calls to spawn specific tasks in accordance with switches set when the user initiates the program.S. Each task is a test routine consisting of a stimulus. Upon finding that few examples of an architectural feature are recognized. This capability complements the recognizer indexing scheme based on code level relationships. Figure 12 is a screen image of the graphical view of task spawning recovered from XSN. Calls using socket constructs provide communications between host platforms on the network to implement a client/server architecture. A. We provide additional support via specialization hierarchies among the architectural entities and relation.R. the analyst has the option of expanding a search by following generalization and specialization links and searching for architecturally related information. They set pre-conditions and environment attributes to link the recognizer into the library. we were able to present our analysis to the original code developers and receive their feedback and suggestions on identifying additional architectural features in the code. The oval is an unknown module. 6. At this time they may add the new recognizer's name to default recognizer lists for the style-level entities/relations. and interprets the RRL code in the recognizer's method. The program contains approximately 30. by indicating a text fragment of the description. or by indicating the effect desired.B. YEH In summary. Periodically. The data files' names (and indication of their existence) are recovered from the source code. Our most successful example was XSN.000 source lines of code. an analyst retrieves the recognizer either by selecting an entity/relation with a default. Experience During the past year. Our first recovery effort involved looking for the task spawning structure.128 D. The diamonds represent data files. recognizer authors build the RRL descriptions using the RRL language constructs and special analysis functions. Subsequently. It is built on top of the X window system and hence contains multiple invocations of the X application program interface. REUBENSTEIN. we employed our architecture recovery tools on six moderate sized programs. If the analyst employed the recognizer in architecture recovery. XSN contains several tasks and specific operating system calls that are used to connect these modules. H. This view also contains elements of what we call the file-based repository style . The rectangular boxes represent a static view of a task: the source code that may be run when entering that task. a MITRE-developed network-based program for Unix system management. and analysis procedures. during an investigation.the connections between the tasks and the data files that they access or modify. This program contains several common C/Unix building blocks and has the potential for matching aspects of multiple styles. It consists of executable files for multiple tasks developed individually by different groups over time. by recognizer name. asks the analyst to set any of the required parameters. the results are added to the as-built architecture with respect to some style. The implementation recursively runs recognizers in the pre-condition attribute. a listener. The arrows indicate connections of either one task spawning another task or the data flow between data files and .

We next looked for layering structure. Second. Thirteen of the recognizers were utilities producing intermediate results that could be used for recovering features of multiple styles. One predominate example. We are still at a stage where each new system we analyze . These capabilities found portions of the code identified as users of some API. We attempted an approach that bundled up cycles within the procedure calling hierarchy but otherwise used the procedure calling hierarchy in its entirety. At this point. We feel that we have gone a long way toward recognizing standard C/Unix idioms for encoding architectural features. We felt that additional clustering was possible using either deeper dominance analysis or domain knowledge. we refined its ability to identify the source of an interaction. we see the client setting up a second communication channel in which it now acts as the server. lists. Thirteen were used for client/server recovery. Over time we made several enhancements to the recognizer to improve its explanation power. In the view's legend. These recognizers proved to be particularly useful in situations where it was not possible to obtain a complete program slice (Section 4. we enhanced the recognizer so that it would recognize a certain pattern of complex. but stereotypical client/server interaction. seven for task spawning.e. four for repository. nine were used for some form of layering. Since the above profile of recognizers is based on recognition adequacy with respect to only a few systems. the numbers should be taken in context. These table and list abstractions were recognized interactively by our OBAD sub-tool (see Section 4. We discovered that there were several large blocks of code that did not participate in any of the styles. Thus. What is important is that they indicate the need for serious recognition library management of the form we have described in this paper.3). This approach lead to little reduction over the basic structure chart report. was the code that accesses the underlying X window system. two for ADT recovery. A service-invocation recognizer shown in Figure 13 recovered elements of this style successfully. it was clear that the developers had implemented several abstract data types . We have not yet been able to implement a method that would combine such bottom-up recognition with more globally-based layering recovery methods. It was necessary to recognize this pattern in order to identify the correct external program associated with the second channel. indicate the service to be contacted) rather than the procedure containing the service invocation call. In this pattern.tables.. seven for code level features. First. but we did not pursue these approaches. By examining the code. The notion we settled on was to identify the procedures that set port numbers (i. The library also contains seven recognizers that make some simplifying assumptions in order to approximate the results of more computational intensive recognizers. we inspected the code to see if there were any obvious gaps in system coverage by the as-built architecture we had found. Figure 12 is actually a view of a thinned-out XSN: several tasks and data files have been removed to reduce diagram clutter. We developed over sixty recognizers for this analysis. We did build some preliminary capabilities based on advertised API's for commercial subsystems or program layers. particularly informative for XSN. and one for implicit invocation. "query" is another term for "recognizer". we set about building and applying OBAD to the XSN system.EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE 129 tasks.1). XSN acts as a client (sometimes a server) in its interactions with network services such as sendmail or ftp.

B. H.130 D. HARRIS. Task spawning view of (a thinned-out version of) XSN . REUBENSTEIN.R. YEH UJ Figure 12.S. A.

9 c/s 0.g. we have found that the set of recognizers is adequate but we need to refine existing recognizers to account for subtleties that we had not seen before. but the number of required modifications is decreasing. In one case. This style mapping is perfectly legal and covers the whole system. The second row gives the percentage of the procedures covered by that style. Table 2 summarizes the amount of code in XSN covered when viewed with respect to the various styles.7 2.5 Combining all the styles whose statistics are given results in a total connector coverage of about 3% of the lines of code and over 47% of the procedures. [call. Procedure coverage total is less than the sum of its constituents in the above table because the same procedure may be covered by multiple styles.3 API 0 13. We offer these statistics as elementary examples of architectural recovery metrics. We were able to recognize all features of this style by just authoring one new recognizer and reusing several others. The measures we provide are potentially subject to some misinterpretation. A service recognizer uses the invocations-of-type construct. Table 2.prepend(result. result Figure 13.2 13. .EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE 131 let (result = [] ) for-every call in invocations-of-type('service-invocations) for-each port in where-ports-set(call) let (target = service-at(second(port))) let (proc = enclosing-procedure(first(port))) (result <. More frequently.3 3. we encoded a new architecture style called "context" (showing the relationship between system processes and the connections to external files and devices) as a means to best describe a new system's software architecture. The first row gives the percentage of the lines of code used in the connectors for that style. proc])). This endeavor is important both to determine the effectiveness of the style representations (e.3 Repository 2. It is difficult to determine how strongly a system exhibits a style and how predictive that style is of the entire body of code. A procedure is covered if it is included in some component in that style. As an extreme example.1 Task Spawning 0. one could fit an entire system into one layer. Code coverage measures for XSN Style: % connector LOC: % of procedures: ADT 0 39. but provides no abstraction of the system and no detailed explanation of the components. what is the value-added of authoring a new style) and to provide an indicator for analysts of how well they have done in understanding the system under analysis. requires some modifications to our implementation. target..

1991) describes a clustering approach based on similarity measurements. top-down and bottom-up approaches to software understanding. Webster. The maintenance conmiunity can benefit from discussion on establishing reasonable measures of progress toward understanding large systems.132 D.). bottom-up recognition approach using traditional plan definitions along with specialization links and plan entailments. Informal information and heuristics can also be used to reorganize and automatically refine recovered software designs/modularizations. 7. (Richardson and Wilde. 7. Schwanke (Schwanke. Related Work We can contrast our work with related work in recovery of high-level software design. This notion matches well to some of the informal clustering that we are doing although their work is not used to find components of any particular architectural style. and interactive reverse engineering. This work addresses a functional architectural style that we have not considered. 1989. 1994) recovers architectural modules by aggregating program units into modules via a concept of node dominance on a directed graph. Mitbander. HARRIS. In the tutoring domain. 1993). 7.2. DESIRE (Biggerstaff. 1993) also explores a mixed top-down. 1991). modularization heuristics. their automatic classification capabilities are more powerful than our inferencing capabilities. H. Fasolino. Top-down approaches Our recognizers are intended for use in explorations of architectural hypotheses . Selfridge.S. Quilici (Quilici.B. 1983). /.Ballard. General inquiry into the structure of software can be supported by software information systems such as LaSSIE (Devanbu.a form of top-down hypothesis-driven recognition coupled with bottom-up recognition rules. there are experimental and programmatic advantages for defining code coverage metrics. LaSSIE represents programs from a relational view that misses some of the architectural grist we find deeply embedded in abstract syntax trees. 1991. 1994) relies on externally supplied cues regarding program structure. DiLucca. It is useful to compare our work to activities in the tutoring/bug detection community. context-independent approaches suffer because they cannot deal with the higher . De Lucia. Recovery of high-level design Program structure has been analyzed independently of any pre-conceived architectural styles to reveal program organization as discussed in (Biggerstaff. A. Our context-independent approach is similar to MENO (Soloway. 1994. However. Brachman. In contrast to our work.R. but again there are similarities to the clustering we perform within our OBAD subsystem.(Schwanke. YEH In spite of these limits. REUBENSTEIN. manual assistance. Canfora et al (Canfora. Mitbander. Webster. and informal information. Biggerstaff.

EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE

133

level plans in the program. PROUST (Johnson and Soloway, 1985) remedies some of this via a combination of bottom-up recognition and top-down analysis - i.e., looking up typical patterns which implement a programmer's intentions. In contrast, architectural commitments to use a particular architectural style are made at the top level, thus the mapping between intentions and code is more direct.

73,

Bottom-up approaches

The reverse engineering and program understanding conmiunity has approached software understanding problems generally with a bottom-up approach where a program is matched to a set of pre-defined plans/cliches from a library. This work is not motivated by architecture organizational principles essential for the construction of large programs. Current work on program concept recognition is exemplified by (Kozaczynski, Ning, Sarver, 1992), (Engberts, Kozaczynski, Ning, 1991), (Dekker and Ververs, 1994) which continues the clichebased tradition of (Rich and Wills, 1990). This work is based on a precise data and control flow match which indicates that the recognized source component is precisely the same as the library template. Our partial recognition approach does not require algorithmic equivalence between a plan and the source being matched, rather they are based on events (Harandi and Ning, 1990) in the source code. That is to say, the existence of patterns of these events is sufficient to establish a match. Our style of source code event-based recognition rules is also exemplified in (Kozaczynski, Ning, Sarver, 1992), (Engberts, Kozaczynski, Ning, 1991) which demonstrates a combination of precise control and data flow relation recognition and more abstract code event recognition.

7A.

Interactive Reverse Engineering

Wills (Wills, 1993) points out the need for flexible, adaptable control structures in reverse engineering. Her work attacks the important problem of building interactive support that cuts across multiple types of software analysis. In contrast, our work emphasizes authoring and application of multiple analysis approaches applicable for uncovering architectural features in the face of specific source code nuances and configurations. Paul and Prakash (Paul and Prakash: patterns, 1994) (Paul and Prakash: queries, 1994) investigate source code search using program patterns. This work uses a query language for specifying high level patterns on the source code. Some of these patterns correspond to specific recognition rules in our approach. Our approach focuses more on analyst use of a pre-defined set of parameterizable recognizers each written in a procedural language. That is, we restrict analyst access to a set of predefined recognizers, but allow recognizer authors the greater flexibility of a procedural language.

134

D.R. HARRIS, H.B. REUBENSTEIN, A.S. YEH

8.

Evaluation and Conclusions

We have implemented an architecture recovery framework that merges reverse engineering and architectural style representation. This is an important first step toward long range goals of providing custom, dynamic documentation to over a variety of software analysis tasks. The framework provides for analyst control over parameterization and retrieval of recognition library elements. We have described methods for recognizer execution and offered some recognizer authoring guidance for identifying recognizers that will interact well with other recognizers in the library. The recognizers make use of commercially available reverse engineering technology, but there are several important analysis capabilities that we have added. In addition, one of our major contributions has been to determine the architectural patterns to recognize and to express these patterns with respect to the underlying analysis capabilities. Our current recognition capabilities have been motivated by thinking about a CAJnix environment which does have its unique progranmiing idioms. While we phrase our recognizers at a general language/operating system independent level (e.g., task spawning or service invocation), there are some biases within the recognition library itself and we would like to extend our approaches to cover idioms of other high level languages and operating systems. Primarily, there is a dependence of a set of functions on specifics of the legacy language or operating system. In addition, many of the features that are recognized through low level patterns in C/Unix implementations (e.g., a call to the system function spawns a task, a struct-type) will appear explicitly in other languages/operating systems as special constructs (e.g., tasks, class definitions). There are four broad areas in which we intend to extend our work: • Additional automation We would like to expand our ability to index into the growing library of recognizers and would like to develop additional capabilities for bridging the gap from source code to style descriptions. The ultimate job of recognizers is to map the entities/relations (i.e., objects in the domain of system design such as pipes or layers) to recognizable syntactic features of programs (i.e., objects in the implementation domain). Clearly, we are working with a moving target. New programming languages, COTS products, and standard patterns present the reverse engineering community with the challenge of recovering the abstractions from source code. We are hopeful that many of the mechanisms we have put in place will enable us to rapidly turn out new recognizers that can deal with new abstractions. An enhancement that we intend to consider is the automatic generation of effect descriptions from information encoded in explicit output lists of recognizers. This scheme is similar to the transformation indexing scheme of Aries (Johnson, Feather, Harris, 1992). • Combining Styles We intend to investigate combining architectural styles. The as-built architectural views each provides only a partial view unto the structure of a program and such partial views can overlap in fundamental ways (e.g., a repository view emphasizing data elements

EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE

135

contains much in common with an interprocess-communication view emphasizing data transmissions through shared memory or data files on disk) In addition, style combinations can be used to uncover hybrid implementations where individual components with respect to one style are implemented in terms of a second style. • COTS modeling Systems that we wish to analyze do not always come with the entire body of source code, e.g., they may make use of COTS (commercial off-the-shelf) packages that are simply accessed through an API. For example, from the analysis point of view, the Unix operating system is a COTS package. We have developed representations for COTS components that allow us to capture the interface and basic control and dataflow dependencies of the components. This modeling needs to be extended to represent architectural invariants required by the package. • Requirements modeling The distinction between functional and non-functional requirements suggests two broad thrusts for reverse engineering to the requirements level. For functional requirements we want to answer the important software maintenance question: "Where is X implemented?". For example, a user may want to ask where message decoding is implemented. Message and decoding are concepts at the user requirements level. Answering such questions will require building functional models of systems. These models will contain parts and constraints that we can use to map function to structure. For nonfunctional requirements, we need to first recognize structural components that implement the non-functional requirements. For example, fault tolerant requirements will to some degree be observable as exception handling in the code. We believe our framework is well suited for extensions in this direction. As a second step, we need to identify measures of compliance (e.g., high "coverage" by abstract data types means high data modifiability). Preliminary work in this area appears in (Chung, Nixon, Yu, 1995) and (Kazman, Bass, Abowd, Clements, 1995). While we are continuing to refine our representations to provide more automated assistance both for recognizer authors and for analysts such as software maintainers, the current implementation is in usable form and provides many insights for long range development of architectural recognition libraries.

Acknowledgments We would like to thank The MITRE Corporation for sponsoring this work under its internal research program. We also thank MITRE colleagues Melissa Chase, Susan Roberts, and Richard Piazza. Their work on related MITRE efforts and enabling technology has been of great benefit for the research we have reported on above. Finally, we acknowledge the many insightful suggestions of the anonymous reviewers.

Typed function calls . Clusters derived from dependency analysis • • Find-Upper-Layers: clusters that are layered above iht focus cluster Find-Global-Var-Based-Clusters: find clusters based on common global variable reference 6.found directly on abstract syntax trees (ASTs) • • • Find-Structu re-With-Attribute: structures that have reference as an attribute name Find-Loops: find all loops Hill-Climbing: instances of hill-climbing algorithms 2. reference) highlight a parameter that analysts must set before running the recognizer.136 D.B. Structures referenced in special invocations Find-Executable-Links: links between spawned tasks and the tasks that spawned them Task-invocation: task invoked (spawned) by a special function call File-Access-Links: links between procedures and the files that they access File-IPC: files touched by more than one process Service-Thru-Port: all relations to a reference port ..R. HARRIS.S.use special call specifications • • • Find-Interesting-Invocations: invocations oifunctions-of-interest Find-lnvocations-Of-Executables: invocations that activate other executables Find-UI-Demons: registrations of user-interface demons 3. focus. Italicized words in the descriptions (e. YEH Appendix The Recognizer Library Our recognition library contains approximately sixty recognizers directed toward discovery of the architectural components and relations of nine styles. Forward references . REUBENSTEIN. Program structure .g. Clusters of objects • • Decomposables: decomposable objects of an architecture Top-Clusters: top clusters of the current architecture 5. H. The following partial list of recognizers shows the variety of the elements of our recognition library and is organized by analysis method. 1. A.procedures that use a variable set by a special call • Envelope-Of-A-Conduit: procedures that use the communication endpoint created hy focus 4.

EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE

137

Find-Port-Connections: links (relations) between program layers and local or network services

7. Clusters derived from calling hierarchy • • • • Find-Upper-Functional-Entry-Points: high level functional entry points Find-Mid-Level-Functional-Entry-Points: mid-level functional entry points Find-Common-Callers: common callers of a set of functions Who-Calls-lt: procedures that call focus

8. Procedures within some context - using containment within clusters • • • • Find-Functions-Of-Cluster: Functions of a cluster Find-Exported-Functions: Exported functions of focus cluster Find-Localized-Function-Calls: procedure invocations within the focus procedure Has-Non-Local-Referents: non-local procedures that call definitions located in focus

References
H. Abelson and G. Sussman. Structure and Interpretation of Computer Programs. The MIT Press, 1984. G. Abowd, R. Allen, and D. Garlan. Using style to understand descriptions of software architecture. ACM Software Engineering Notes, 18(5), 1993. Also in Proc. of the 1st ACM SIGSOFT Symposium on the Foundations of Softwa re Engineering, 1993. T. Biggerstaff. Design recovery for maintenance and reuse. IEEE Computer, July 1989. T. Biggerstaff, B. Mitbander, and D. Webster. Program understanding and the concept assignment problem. Communications of the ACM, 37(5), May 1994. G. Canfora, A. De Lucia, G. DiLucca, and A. Fasolino. Recovering the architectural design for software comprehension. In IEEE 3rd Workshop on Program Comprehension, pages 30-38. IEEE Computer Society Press, November 1994. L. Chung, B. Nixon, and E. Yu. Using non-functional requirements to systematically select among alternatives in architectural design. In First International Workshop on Architectures for Software Systems, April 1995. R. Dekker and F. Ververs. Abstract data structure recognition. The Ninth Knowledge-Based Software Engineering Conference, 1994. P. Devanbu, B. Ballard, R. Brachman, and P. Selfridge. Automating Software Design, chapter LaSSIE: A Knowledge-Based Software Information System. AAAI/MIT Press, 1991. A. Engberts, W. Kozaczynski, and J. Ning. Concept recognition-based program transformation. In 1991 IEEE Conference on Software Maintenance, 1991. K. Gallagher and J. Lyle. Using program slicing in software maintenance. IEEE Transactions on Software Engineering, 17(8), 1991. D. Garlan and M. Shaw. An introduction to software architecture. Tutorial at 15th International Conference on Software Engineering, 1993. M. Harandi and J. Ning. Knowledge-based program analysis. IEEE Software, 7(1), 1990. D. Harris, H. Reubenstein, and A. Yeh. Recognizers for extracting architectural features from source code. In Second Working Conference on Reverse Engineering, July 1995. D. Harris, H. Reubenstein, and A. Yeh. Recoverying abstract data types and object instances from a conventional procedure language. In Second Working Conference on Reverse Engineering, July 1995.

138

D.R. HARRIS, H.B. REUBENSTEIN, A.S. YEH

D. Harris, H. Reubenstein, and A. Yeh. Reverse engineering to the architectural level. In ICSE-I7 Proceedings, April 1995. C. Hofmeister, R. Nord, and D. Soni. Architectural descriptions of software systems. In First International Workshop on Architectures for Software Systems, April 1995. L. Holtzblatt, R. Piazza, H. Reubenstein, and S. Roberts. Using design knowledge to extract real-time task models. In Proceedings of the 4th Systems Reengineering Technology Workshop, 1994. W. L. Johnson and E. Soloway. Proust: Knowledge-based program understanding. IEEE Transactions on Software Engineering, 11(3), March 1985. W.L. Johnson, M. Feather, and D. Harris. Representation and presentation of requirements knowledge. IEEE Transactions on Software Engineering, 18(10), October 1992. R. Kazman, L. Bass, G. Abowd, and R Clements. An architectural analysis case study: Internet information systems. In First International Workshop on Architectures for Software Systems, April 1995. W. Kozaczynski, J. Ning, and T. Sarver. Program concept recognition. In 7th Annual Knowledge-Based Software Engineering Conference, 1992. E. Mettala and M. Graham. The domain specific software architecture program. Technical Report CMU/SEI-92SR-9, SEI, 1992. M. Olsem and C. Sittenauer. Reengineering technology report. Technical report. Software Technology Support Center, 1993. S. Paul and A. Prakash. A framework for source code search using program patterns. IEEE Transactions on Software Engineering, 20(6), June 1994. S. Paul and A. Prakash. Supporting queries on source code: A formal framework. International Journal of Software Engineering and Knowledge Engineering, September 1994. D. Perry and A. Wolf. Foundations for the study of software architecture. ACM Software Engineering Notes, 17(4), 1992. A. Quilici. A hybrid approach to recognizing program plans. In Proceedings of the Working Conference on Reverse Engineering, 1993. Reasoning Systems, Inc., Palo Alto, CA. REFINE User's Guide, 1990. For R E F I N E ^ ^ Version 3.0. Reasoning Systems. Refine/C User's Guide, March 1992. C. Rich and L. Wills. Recognizing a program's design: A graph parsing approach. IEEE Software, 7(1), 1990. R. Richardson and N. Wilde. Applying extensible dependency analysis: A case study of a heterogeneous system. Technical Report SERC-TR-62-F, SERC, 1993. R. Schwanke. An intelligent tool for re-engineering software modularity. In 13th International Conference on Software Engineering, 1991. M. Shaw. Larger scale systems require higher-level abstractions. In Proceedings of the 5th Intematioruzl Workshop on Software Specification and Design, 1989. M. Shaw. Heterogeneous design idioms for software architecture. In Proceedings of the 6th International Workshop on Software Specification and Design, 1991. E. Soloway. Meno-ii: An intelligent program tutor. Computer-based Instruction, 10,1983. W. Tracz. Domain-specific software architecture (DSSA) frequently asked questions (FAQ). ACM Software Engineering Notes, 19(2), 1994. M. Weiser. Program slicing. IEEE Transactions on Software Engineering, 10(4), July 1984. L. Wills. Flexible control for program recognition. In Working Conference on Reverse Engineering, May 1993.

Automated Software Engineering, 3,139-164 (1996) © 1996 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.

Strongest Postcondition Semantics as the Formal Basis for Reverse Engineering*
GERALD C. GANNOD** AND BETTY H.C. CHENGt Department of Computer Science Michigan State University East Lansing, Michigan 48824-1027 {gannod,chengb}@cps.msu.edu

Abstract. Reverse engineering of program code is the process of constructing a higher level abstraction of an implementation in order to facihtate the understanding of a system that may be in a "legacy" or "geriatric" state. Changing architectures and improvements in programming methods, including formal methods in software development and object-oriented programming, have prompted a need to reverse engineer and re-engineer program code. This paper describes the application of the strongest postcondition predicate transformer (sp) as the formal basis for the reverse engineering of imperative program code. Keywords: formal methods, formal specification, reverse engineering, software maintenance

1.

Introduction

The demand for software correctness becomes more evident when accidents, sometimes fatal, are due to software errors. For example, recently it was reported that the software of a medical diagnostic system was the major source of a number of potentially fatal doses of radiation (Leveson and Turner, 1993). Other problems caused by or due to software failure have been well documented and with the change in laws concerning liability (Flor, 1991), the need to reduce the number of problems due to software increases. Software maintenance has long been a problem faced by software professionals, where the average age of software is between 10 to 15 years old (Osborne and Chikofsky, 1990). With the development of new architectures and improvements in programming methods and languages, including formal methods in software development and object-oriented programming, there is a strong motivation to reverse engineer and re-engineer existing program code in order to preserve functionality, while exploiting the latest technology. Formal methods in software development provide many benefits in the forward engineering aspect of software development (Wing, 1990). One of the advantages of using formal methods in software development is that the formal notations are precise, verifiable, and facilitate automated processing (Cheng, 1994). Reverse Engineering is the process of constructing high level representations from lower level instantiations of an existing system. One method for introducing formal methods, and therefore taking advantage of the benefits
This work is supported in part by the National Science Foundation grants CCR-9407318, CCR-9209873, and CDA-9312389. ** This author is supported in part by a NASA Graduate Student Researchers Program Fellowship. t Please address all correspondences to this author.

Software Maintenance One of the most difficult aspects of re-engineering is the recognition of the functionality of existing programs. sp) versus using a predicate transformer as a guideline for constructing formal specifications (i. and implementation levels (Chikofsky and Cross. and domain specific details are often significant obstacles to successfully re-engineering a system. 1990). design. component relationships.140 GANNOD AND CHENG of formal methods. is through the reverse engineering of existing program code into formal specifications (Gannod and Cheng. and intended behavior (Chikofsky and Cross. and Section 4 gives the sp semantics for iterative and procedural constructs. This process does not require semantic understanding of the system and is best characterized by the task of transforming unstructured code into structured code. Forward Engineering is the process of developing a system by moving from high level abstract specifications to detailed. The explicit use of the word "forward" is used to contrast the process with Reverse Engineering. the process of analyzing a system in order to identify system components. 1989. 2. Lano and Breuer. Ward et al.. An example applying the reverse engineering technique is given in Section 5. where Section 3 discusses the sp semantics for assignment. Background This section provides background information for software maintenance and formal methods for software development. . 1990).e. 1990). Section 2 provides background material for software maintenance and formal methods. we investigated the use of the weakest precondition predicate transformer wp as the underlying formal model for constructing formal specifications from program code (Cheng and Gannod. wp). implementation-specific manifestations (Chikofsky and Cross. The formal approach to reverse engineering based on sp is described in Sections 3 and 4. Included in this discussion is the formal model of program semantics used throughout the paper. Finally. and sequence. The remainder of this paper is organized as follows. 1990). Several terms are frequently used in the discussion of re-engineering (Chikofsky and Cross.1.. 1990). Re-Engineering is the examination and alteration of a system to reconstitute it in a new form. Section 7 draws conclusions and suggest future investigations. Restructuring is the process of creating a logically equivalent system at the same level of abstraction (Chikofsky and Cross. which potentially involves changes at the requirements. alternation. 1969). Gannod and Cheng.e. 1990). The difference between the two approaches is in the ability to directly apply a predicate transformer to a program (i. Previously. intended use. 1994). 2. 1991. Identifying design decisions. 1989). This step in re-engineering is known as reverse engineering.. This paper describes an approach to reverse engineering based on the formal semantics of the strongest postcondition predicate transformer sp (Dijkstra and Scholten. 1994. Related work is discussed in Section 6. and the partial correctness model of program semantics introduced by Hoare (Hoare.

The next step is Alteration. Byrne and Gustafson. where the system is constituted into a new form at a different level of abstraction. The lower levels include designs and implementations. The motivation for operating in such an implementation-bound level of abstraction is that it provides a means of traceability between the program source code and the formal specifications constructed using the techniques described in this paper. 1992.2. The process model appears in the form of two sectioned triangles. That is. Structured Analysis and . Alteration "Reverse Engineering Abstraction "Forward Engineering" Refinement System A System B Figure 1. the context for this paper is represented by the dashed arrow. currently existing development teams must be able to understand the relationship between the source code and the specifications. 2. we address the construction of formal low-level or ''as-builf design specifications.e. Reverse Engineering Process Model This paper describes an approach to reverse engineering that is applicable to the implementation and design levels.STRONGEST POSTCONDITION SEMANTICS 141 Byrne described the re-engineering process using a graphical model similar to the one shown in Figure 1 (Byrne. where Abstraction (or reverse engineering) is performed to an appropriate level of detail. 1992). Refinement of the new form into an implementation can be performed to create system B. the design methodologies that support the life-cycle (i.. That is. Entry into this re-engineering process model begins with system A. In Figure 1. where each section in the triangles represents a different level of abstraction. The relative size of each of the sections is intended to represent the amount of information known about a system at a given level of abstraction. Finally. The higher levels in the model are concepts and requirements. Formal Methods Although the waterfall development life-cycle provides a structured process for developing software. This traceability is necessary in order to facilitate technology transfer of formal methods.

R) describes the set of all states in which the statement S can begin execution and terminate with postcondition R true. Therefore.142 GANNOD AND CHENG Design (Yourdon and Constantine. using the properties listed in Table 1. 2. A rearrangement of the braces to produce {Q} S {R}/m contrast. if the execution of program S terminates.2. The predicate . the weakest precondition wp{S. AV B) wp(S. and the weakest liberal precondition wlp{S^ R) is the set of all states in which the statement S can begin execution and establish R as true if S terminates. developing. 1990). A) A wp{S. formal methods used in software development are rigorous techniques for specifying. where. represents a total correctness model of execution. The wp and wlp are called predicate transformers because they take predicate R and. Given a statement S and a postcondition R. That is. wp{S^ R) establishes the total correctness of S. if condition Q holds.7.2. and verifying computer software (Wing. true) A wlp{S. A) V wp{S. are amenable to automated processing (Cheng. 1994). 1969) is used to represent a partial correctness model of execution. the partial correctness model is sufficient for these purposes. A) -^wlp{S. 1990). In contrast. 1990). A formal method consists of a well-defined specification language with a set of well-defined inference rules that can be used to reason about a specification (Wing. Strongest Postcondition Consider the predicate -twlp{S. Table 1. and wlp{S^ R) establishes the partial correctness of 5. Program Semantics The notation Q {S} R (Hoare. B) The context for our investigations is that we are reverse engineering systems that have desirable properties or functionality that should be preserved or extended. 1978)) make use of informal techniques. 2. produce a new predicate. B) wp{S. A precondition describes the initial state of a program. given that a logical condition Q holds. Properties of the wp and wlp predicate transformers wp{S. thus increasing the potential for introducing ambiguity. inconsistency. then S is guaranteed to terminate with condition R true. That is. and di postcondition describes the final state. B) wp{S. which is the set of all states in which there exists an execution of S that terminates with R true. In this respect.2. A) -)• wp{S.-^R). ^A) false wp(S. A) wp{S. and incompleteness in designs and implementations. then logical condition R will hold. A-^ B) = => = = =^ => wp{S. AAB) wp{S. we wish to describe the set of states in which satisfaction of R is possible (Dijkstra and Scholten. A benefit of formal methods is that their notations are well-defined and thus. A) wp(Syfalse) wp{S.

We contrast this model with the sp model. meaning that given S and R. Finally. That is. ->R) is contrasted to wlp{S^ R) which is the set of states in which the computation of S either fails to terminate or terminates with R true. if the computation of S begins in state wp{S. Second. First. and produces a predicate wp{S^ R). The use of sp assumes that a precondition Q is known and that a postcondition will be derived through the direct application of sp. it provides a formal basis for translating programming statements into formal specifications. where the input to the predicate transformer is "S" and "Q". the symmetry of sp and wlp provides a method for verifying the correctness of a reverse engineering process that utilizes the properties of wlp and sp in tandem. which is the set of all states in which there exists a computation of 5 that begins with Q true. in that a derivation of a specification begins with R.wc note that wp is a backward rule. sp is more applicable to reverse engineering. Figure 2(a) gives the case where the input to the predicate transformer is "S" and "R". sp{S^ Q) assumes partial correctness. The predicate transformer sp assumes a partial correctness model of computation meaning that if a program starts in state Q. As such. As such. 2. R).3.Q) wlp{S. if 5 terminates. The sp case (Figure 2(b)) is similar. The predicate transformer wp assumes a total correctness model of computation. wp Given a Hoare triple Q{S} R. .Q)". However. That is. and the output to the transformer is "sp(S. where the input to the predicate transformer produces the corresponding predicate. then the execution of S will place the program in state sp{S. Q) predicate transformer (Dijkstra and Scholten.R) => R The importance of this relationship is two-fold. spvs. therefore wp can only be used as a guideline for performing reverse engineering.R)". An analogous characterization can be made in terms of the computation state space that describes initial conditions using the strongest postcondition sp{S. determining R is the objective. 1990): Q ^ sp{S. and the output to the predicate transformer (given by the box and appropriately named "wp") is "wp(S. a forward derivation rule. Q) true. Using wp implies that a postcondition R is known. with respect to reverse engineering. given the Hoare triple Q{5}/?(Dijkstra and Scholten. the program S will halt with condition R true. given a precondition Q and a program 5.2. given that Q holds. sp derives a predicate sp{Sy Q). execution of S results in sp(5. The use of these predicate transformers for reverse engineering have different implications. Q) if S terminates. Figure 2 gives a pictorial depiction of the differences between sp and wp. we make the following observation about sp{S^ Q) and wlp{S.STRONGEST POSTCONDITION SEMANTICS 143 ^wlp{S. 1990). R) and the relationship between the two predicate transformers.

initially. which represents the postcondition R with every free occurrence of x replaced by the expression e. The sp of an assignment statement is expressed as follows (Dijkstra and Scholten. and *::' indicates that the range of the quantified variable v is not relevant in the current context.^ sp(S. the notation {Q} S {R} will be used to indicate a partial correctness interpretation. performing the textual substitution Q^ in Expression (1) is a redundant operation if. respectively. for reverse engineering purposes. Assignment An assignment statement has the form x: = e. We conjecture that the removal of the quantification for the initial values of a variable is valid if the precondition Q has a conjunct that specifies the textual substitution. where x is a variable. where each yi is replaced by Ei. v is the quantified variable. 1990) 5p(x:=e. in expression R. the Hoare triple formulation for assignment statements is as follows: . Notationally. we first describe the semantics of the predicate transformers wlp and sp as they apply to each primitive and then. and sequences. ahernation. This type of replacement is termed a textual substitution of x by e in expression R. throughout the remainder of this paper. (1) where Q is the precondition. If x corresponds to a vector y of variables and e represents a vector E of expressions. Black box representation and differences between wp and sp\ (a) wp (b) sp 3. and e is an expression. 1976) is used to represent each primitive construct but the techniques are applicable to the general class of imperative languages. That is. Refer to Appendix A where this case is described in more depth. For each primitive. 3J. Q has a conjunct of the form x = v. R) = R^.144 GANNOD AND CHENG {Q Q) sp (a) .R) -^ wp (b) Figure 2. describe specification derivation in terms of Hoare triples.Q) = {3v :: Q^Ax = e^). The wlp of an assignment statement is expressed as wlp{yi: =e. then the wlp of the assignment is of the form R^. Primitive Constructs This section describes the derivation of formal specifications from the primitive programming constructs of assignment.Q) wp(S. The Dijkstra guarded command language (Dijkstra. Given the imposition of initial (or previous) values on variables.

x : = b . For instance.} {x = cAb = b} X := d.) A Q } /* p o s t c o n d i t i o n */ where Xj represents the initial value of the variable x.. fAx5=eA. {xi=aAxo=X} x := b.. {x4 = d} x : = e.. {x5=eAx4=dA. A historical subscript is an integer number used to denote the i^"' textual assignment to a variable. using . {x6 = f} x := g. Q is the precondition. {x3=cAx2=bA. {x2 = bAxi = x := c. Consider a program that consists of a series of assignments to a variable x. x:= e. {xi = a} x := b. {x7 = 9Ax6 = {xs = {x = h Ag = 9} {xs = h} hAx7=gA. /* p r e c o n d i t i o n * / {(xj+i = e^.} fA. Xj^i is the subsequent value of x.. Subscripts are added to variables to convey historical information for a given variable." Despite its simplicity. when using historical subscripts. x:= h..} x := d.} (a) Code with strict sp application (b) Code with historical subscripts (c) Code with historical subscripts and propagation Figure 3.} {x = gAf = f} x := h. "x : = a. {x = dAc= c} X : = e.. {xe = x := g.. Figure 3(a) depicts the specification of the program by strict application of the strongest postcondition.... {x = aAX = X} X := b .. {x4 = dAx3 = x := e. However. special care must be taken to maintain the consistency of the specification with respect to the semantics of other programming constructs.} x:=f. cA. x:= g. {x7 = g} X := h. {xo = X} x := a.. aA. {x5=e} x:=f. the {x = X} X := a. {x3=c} X := d.. x := h. x:= f. x:= c. {x = f Ae = e} X := g. Different approaches to specifying the history of a variable example is useful in illustrating the different ways that the effects of an assignment statement on a variable can be specified. x:= d.} {x = e Ad = d} x:=f. [x = bAa = a} X : = C. That is. An example of the use of historical subscripts is given in Figure 3(b). {xo = X} X := a. where a textual assignment is an occurrence of an assignment statement in the program source (versus the number of times the statement is executed).. {x2=b} X : = c. Another possible way to specify the program is through the use of historical subscripts for a variable.STRONGEST POSTCONDITION SEMANTICS 145 {Q} X := e.

The wlp for alternation statements is given by (Dijkstra and Scholten. V Sp{Sn. is that given B^ is true. The existential expression can be expanded into the following form Sp{lF.2. The precondition of a given statement must be propagated to the postcondition. 1990) 5p(lF. Bn A Q)). The main motivation for using histories is to remove the need to apply textual substitution to a complex precondition and to provide historical context to complex disjunctive and conjunctive expressions. 1976) is expressed as if Bi ^ Si. (3) (2) Expression (3) illustrates the disjunctive nature of alternation statements where each disjunct describes the postcondition in terms of both the precondition Q and the guard and guarded command pairs. respectively. where B^ ^^ s^ is a guarded command such that Si is only executed if logical expression (guard) Bi is true. 3. extra information is appended that provides a historical context to all variables of a program during some "snapshot" or state of a program. Bi A Q)). Bi A (5) V . Q) = {Bi :: sp{Si. The translation of alternation statements to specifications is based on the similarity of the semantics of Expression (3) and the execution behaviour for alternation statements. Note that we have not changed the semantics of the strongest postcondition. rather. . . but. in the application of strongest postcondition. Alternation An alternation statement using the Dijkstra guarded command language (Dijkstra. if the alternation statement terminates.146 GANNOD AND CHENG the technique shown in Figure 3(b) is not sufficient. a specification is constructed as follows . given by B^ and s^. R) = (Vi : Bi : wlp{Si. Using the Hoare triple notation. where IF represents the alternation statement.R)). Q) = (5p(Si. The equation states that the necessary condition to satisfy R. the wlp for each guarded statement Si with respect to R holds. as shown in Figure 3(c). \\ ^n ^ ^Uf fi. The disadvantage to using such a technique is that the propagation of the precondition can potentially be complex visually. 1990): wlp{iF. This characterization follows the intuition that a statement Si is only executed if B^ is true. The sp for alternation has the form (Dijkstra and Scholten.

. V Sp{Sn. {sp{s2. We deviate from our previous convention of providing the formalisms for wlp and sp for each construct and use an operational definition of how specifications are constructed. The Hoare triple formulation and construction process is as follows: {Q} Si. 4. R) = wlp{Si. . Q). . This section discusses the formal specification of iteration and procedural abstractions without recursion.sp{Si. The wlp for sequences is defined as follows (Dijkstra and Scholten. The wlp and sp for sequences follow accordingly. 1990) is Sp{Si. Likewise. Iterative and Procedural Constructs The programming constructs of assignment. .> S i . . . In the case of wlp. 5 i A Q) V . {sp(Si. BnAQ)} 3. the derived postcondition for the sequence Si. alternation. the set of states for which the sequence Si. II ^n ^ ^n> fi. and sequence can be combined to produce straight-line programs (programs without iteration or recursion). Sequence For a given sequence of statements S i .Q)). even for the human specifier. Sn.R)' For 577. The introduction of iteration and recursion into programs enables more compactness and abstraction in program development.Q)} S2.S2 with respect to the precondition Q is equivalent to the derived postcondition for S2 with respect to a precondition given by sp{si. { Sp(Si.3. it follows that the postcondition for some statement Si is the precondition for some subsequent statement S^+i. However. the sp (Dijkstra and Scholten. S2. constructing formal specifications of iterative and recursive programs can be problematic.Q) = Sp{S2.STRONGEST POSTCONDITION SEMANTICS 147 {Q} if Bi .S2. S2 can execute with R true (if the sequence terminates) is equivalent to the wlp of Si with respect to the set of states defined by wlp{S2. This approach is .Q)) }. 1990): wlp{Si.Sp{Si.wlp{S2. (4) R)).

1990): 5P(DO. has the form do Bi -^ S i . given that condition Q holds. provided that the iteration statement terminates. od. In more general terms. Iteration. The strongest postcondition semantics for repetition has a similar but notably distinct formulation (Dijkstra and Scholten. Gries. Iteration Iteration allows for the repetitive application of a statement. and thus can be relaxed. (6) Expression (6) states that the strongest condition that holds after executing an iterative statement. > A simplified form of repetition is given by "do B — s od ". using the Dijkstra language. An invariant is a predicate that is true before and after each iteration of a loop. > In the context of iteration.148 GANNOD AND CHENG necessary because the formalisms for the wlp and sp for iteration are defined in terms of recursive functions (Dijkstra and Scholten. Although the semantics for repetition in terms of strongest postcondition and weakest liberal precondition are less complex than that of the weakest precondition (Dijkstra and . Operationally. 1990. The problem of constructing formal specifications of iteration statements is difficult because the bound functions and the invariants must be determined. where i > 0. the semantics for iteration ^ in terms of the weakest liberal precondition predicate transformer wlp is given by the following (Dijkstra and Scholten. is equivalent to the condition where the loop guard is false {-^B). R) = {Wi:0<i: wlp{iF\ B V R)). for a partial correctness model of execution. However. Expression (5) states that the weakest condition that must hold in order for the execution of an iteration statement to result with R true.1. a bound function determines the upper bound on the number of iterations still to be performed on the loop. difficult to practically apply. and a disjunctive expression describing the effects of iterating the loop i times. 1981) that are. concerns of boundedness and termination fall outside of the interpretation. (5) where the notation " I F * " is used to indicate the execution of " i f B ^^ s f i " i times. Q) = -^^ A (3i : 0 < ^ : sp{iF\ Q)). in general. 4. is equivalent to a conjunctive expression where each conjunct is an expression describing the semantics of executing the loop i times. 1990): wlp{BO. Using the abbreviated form of repetition "do B — s od". where i > 0. the iteration statement may contain any number of guarded commands of the form B^ — s^. such that the loop is executed as long as any guard B^ is true.

fi if i < n —> i : = i + 1. the derived specification for the code sequence is ( ( n . 5. 10. For instance.Q). if i < n —> i : = i + 1. where start is the initial value of variable i. we find that the solution is non-trivial when applying the formal definition of sp{DO. 7. 9. In the construction of specifications of iteration statements. the specification process must rely on a user-guided strategy for constructing a specification.1 < n) A (z = n)). consider the counter program "do i < n ^ i : = i + 1 od". A strategy for obtaining a specification of a repetition statement is given in Figure 5. As such. fi if i < n --> i : = i + 1. 11. For this simple example. If j is set to n — start. {start < n) A {i = start)). The application of the sp semantics for repetition leads to the following specification: sp{do i < n . then the unrolled version of the loop would have the following form: 1. in line 19 of Figure 4 the inductive assertion that "i = start + (n — start — 1)" is made.> i : = i + l od. As such. fi Application of the rule for alternation (Expression (2)) yields the sequence of annotated code shown in Figure 4. 6. The closed form for iteration suggests that the loop be unrolled j times. This assertion is based on a specifier providing the information that (n — start — 1) additions have been performed if the loop were unrolled at least (n — start — 1) times.Q) = {i>n)A{3j:0<j:sp{IF^. the recurrent nature of the closed forms make the appHcation of such semantics difficult. where the goal is to derive 5p(do i < n — > i : = i + l od.STRONGEST POSTCONDITION SEMANTICS 149 Scholten. 2. 4. knowledge must be introduced by a human specifier. 3. 1990). For instance.Q)). i : = start. . 8. by using loop unrolling and induction.

{i < n) A {i = start + 1) A {start < n)) V {{i >= n) A{i = start -h 1) A {start < n)) = {{i = start + 2) A {start + 1 < n)) V {{i > = n) A{i = start + 1) A {start < n)) } { ((^ = start -h (n . 8. 5. 4. 7.start . 28. {{'^ = start) A {start < n)} i f i < n -> i : = i + 1 f i { sp{i := i -\-1. 22. 17.1 < n)) V {{i >= n)A{i = start-\-{n-start-2))A{start-h{n-start-2)-l < n)) = {{i = n-l)A{n-2<n))} i f i < n -> i : = i + 1 f i { sp{i := i -h 1. 3. 12. { {i = I) A {start < n) } 2. 19.150 GANNOD AND CHENG 1. 20.1) A (n . 9. 27. 6. {i < n) A {i = start) A {start < n)) V {{i >= n) A{i = start) A {start < n)) = {{i = start + 1) A {start < n)) } i f i < n -> i : = i + 1 f i { sp{i := i -\-1. 10. Annotated Source Code for Unrolled Loop . 11. 21. 13. 29.1)) A {start + (n . i:= s t a r t . 23. 16. 18. 24. (z < n) A (i = n .start . 14. 25.1) .2 < n)) V {{i>=n)A{i = n-l)A{n-2< n)) = {^ = n)} Figure 4. 26. 15.

W BnmdR is the postcondition). 2. 6. Using the specification obtained from step 4 as a guideline. then the postcondition of the loop should be satisfied (P A ^BB -^ R. 3.2. query the user for a loop invariant. The following criteria are the main characteristics to be identified during the specification of the repetition statement: • • invariant (P): an expression describing the conditions prior to entry and upon exit of the iterative structure. Using the relationship stated above ( P A -*BB -^ R). Bi — Si terminates with P true. so that P is an invariant > of the loop. Apply the strongest postcondition to the loop body Si using the precondition given by step 3. 1981). Gries. 5. Strategy for constructing a specification for an iteration statement 4. Although this step is non-trivial. techniques exist that aid in the construction of loop invariants (Katz and Manna. where BB = BiV . 1976. {P A Bi}Si{P}. fori <i<n When none of the guards is true and the invariant is true. wlp can be applied to the assertion. In order to verify that the modifications made by a user are valid.STRONGEST POSTCONDITION SEMANTICS 151 1. Query the user for modifications to the assertion made in step 2. 4. Figure 5. Procedural Abstractions This section describes the construction of formal specifications from code containing the use of non-recursive procedural abstractions. and the loop invariant. guards (B): Boolean expressions that restrict the entry into the loop.. Execution of each guarded command. . This guided interaction allows the user to provide generalizations about arbitrary iterations of the loop. Begin by introducing the assertion "Q ^ ^ ^ " ^s the precondition to the body of the loop. A procedure declaration can be represented using the following notation . construct the specification of the loop by taking the negation of the loop guard.

respectively. That is. alternation. respectively. where a. result z ). and Ei is one or more output parameter types with attribute value-result or result. Using this theorem for the procedure call. is constructed using the previously defined guidelines for assignment. 1981) using a total correctness model of execution. non-recursive. sequence. Informally. 1981) {PRT : P f f A iWu. value-result. Likewise. the condition states that PRT must hold before the execution of procedure p in order to satisfy R.i => I^l)}pia. Given a procedure declaration of the above form. and terminates. and 'z represent the value. respectively. A specification of a procedure can be constructed to be of the form {P:f/} proc p : EQ —^ El (body) {Q:sp(bodyM)AU} where EQ is one or more input parameter types with attribute value or value-result.c) {R} (8) for a procedure call p(a. c). Gries defines a theorem for specifying the effects of a procedure call (Gries. PRT states that the precondition for procedure p must hold for the parameters passed to the procedure and that the postcondition for procedure/? implies R for each value-result and result parameter. The formulation of Equation (8) in terms of a partial correctness model of execution is identical. The signature of a procedure appears as proc/7: {inputjtype)* — {output Jype)* > (7) where the Kleene star (*) indicates zero or more repetitions of the preceding unit.152 GANNOD AND CHENG proc p (value x\ value-result y. the construction of a formal specification from a procedure call can be performed by inlining a procedure call and using the strongest postcondition for assignment. 6. {P}{body){Q} where x. The notation (body ) represents one or more statements making up the "procedure". 6. and result. The postcondition for the body of the procedure. y. Local variables of procedure p used to compute value-result and result parameters are represented using u and v. value-result. the following condition holds (Gries. . In addition. and output-type denotes the one or more names of output parameters of procedure p. sp(body. A parameter of type value means that the parameter is used only for input to the procedure. Parameters that are known as value-result indicate that the parameters can be used for both input and output to the procedure. and c represent the actual parameters of type value. and result parameters for the procedure. input-type denotes the one or more names of input parameters to the procedure p. while {P} and {Q} are the precondition and postcondition.v :: Q|. U). a parameter of type result indicates that the parameter is used only for output from the procedure. an abstraction of the effects of a procedure call can be derived using a specification of the procedure declaration. assuming that the procedure is straight-line. and iteration as applied to the statements of the procedure body.b. respectively.

{«} end Figure 6. parameter binding can be achieved through multiple assignment statements and a postcondition R can be established by using the sp for assignment. y. {PR} is the precondition for the call to procedure/?. c) end begin d e c l a r e x. By representing a procedure call in this manner. v.y : = a. and { /?} is the specification of the program after the actual parameters to the procedure call have been "returned". {e^} _ t>. z . 6. {Q} is the specification of the program after the procedure has been executed.c : = y . { QR} is the specification of the program after formal parameters have been assigned with the values of local variables.v. where (body) comprises the statements of the procedure declaration for/?. 6. c) abstraction A procedure call p(a. {P}is the specification of the program after the formal parameters have been replaced by actual parameters. z". u.z : = u.b. Removal of a procedural abstraction enables the extension of the notion of straight-line programs to include non-recursive straight-line procedures. Removal of procedure call p(a. c) can be represented by the program block (Gries. 6. 1981) found in Figure 6. {p} (body) {Q}__ y. {PR } _ _ x. we can annotate the code sequence from Figure 6 to appear as follows: . Making the appropriate sp substitutions.STRONGEST POSTCONDITION SEMANTICS 153 begin '{PR} p(a.

P).z := u. .-G/?Ab = y A c = ^ } where Q is derived using sp{{body). Gannod and Cheng.z : = u. y (after execution of the procedure body). ^. Example The following example demonstrates the use of four major programming constructs described in this paper (assignment.c : = YJZ". ^ {P. and c. { R: (3^.b. the above sequence can be simplified using the semantics of 577 for assignments to obtain the following annotated code sequence: {PR } _ _ x.1.-(37. 7. A U T O S P E C (Cheng and Gannod.y : = a. alternation. _ _ {ei?. has four procedures. 1991. 1993.-P/?Ax = a A y = b } {body) {G}__ y.-QAy = u ^ A z = v ^ } b./3:: PR1% A X = a^'^ A y = b1%) } {body) {e}__ y. shown in Figure 7.v. b. sequence. Gannod and Cheng. we described how the existential operators and the textual substitution could be removed from the calculation of the sp. and procedure call) along with the application of the translation rules for abstracting formal specifications from code.b.^ {/?. respectively. 5. and that local variables are used to compute the values of the value-result parameters. C ^» and ip are the initial values of x.c : = y.C::e^|Ay = u | j A z = v ^ | ) } b. { P: {3a. 1994) is a tool that we have developed to support the derivational approach to the reverse engineering of formal specifications fi-om program code.154 GANNOD AND CHENG {PR } _ _ x.^ :: fi/^^f Ab = y5'5 A c = z^f) } where a. including three different implementations of "swap". {G/?. The program. ^.'z .y : = a. Applying that technique to assignments and recognizing that formal and actual result parameters have no initial values. Recall that in Section 3. y (before execution of the procedure » body).v.

Largest. var temp : integer.b). var Max.b). end else begin Max := NumTwo.b. Min := NumOne. Y:integer ) . begin if NumOne > NumTwo then begin Max := NumOne. var temp : integer. NumTwo:real. Min := NumTwo. begin a := 5. Y := tenp end. var a. FindMaxMin{a. end end. procedure FindMaxMin{NumOne.X end. and 10 depict the output of A U T O S P E C when applied to the program code given in Figure 7 where the notation id{scope}instance is used to indicate a variable i d with scope defined by the referencing environment for scope. 9. var Y:integer ) . end.Smallest). procedure swapa( var X:integer. begin temp := X. b := 10. begin Y + X = Y . Figure 7. swapa(a. The i n s t a n c e identifier . var Y:integer ) . X := Y. Largest. b. Min:real ) .X Y = Y . funnyswap{a. c := Largest. c. Smallest : real. begin temp := X. X := Y. procedure funnyswap( X:integer. procedure swapb( var X:integer. output ) . swapb(a. Example Pascal program Figures 8.b). Y := temp end.STRONGEST POSTCONDITION SEMANTICS 155 program MaxMin ( input.

Although the variables being referenced are outside the scope of the calling procedure. ( Min{2}l = NumOneO & U *) * end J: ( (Max{2)l = NumTwoO & Min{2}l = Nu* mOneO) & U *) K: ( (((NumOneO > NumTwoO) & * (Max{0}l = NumOneO & Min{0}l = NumTwoO)) | (not (NumOneO > NumTwoO) & (Max{0}l = NumTwoO & Min{0}l = NumOneO ) ) ) & U *) end L: ( (((NumOneO > NumTwoO) & * (Max{0}l = NumOneO & Min{0}l = N\JimTwoO)) | (not (NumOneO > NumTwoO) & (Max{0}l = NumTwoO & Min{0}l = NumOneO ) ) ) & U *) Figure 8. procedure FindMaxMin( NumOne. NumTwo:real. When scope is an identifier. c. ( Max{2}l = NiimTwoO & U *) * Min := NumOne. var Max. var a.156 GANNOD AND CHENG program McixMin { input. if a call to some arbitrary procedure called f oo is invoked. it might appear in a specification outside its local context as q{f oo}4. a specification of the input and output parameters for f oo can provide valuable information. the specification of the calling procedure will have references to variables local to f oo. The scope identifier has two purposes. ( Max{2)l = NumOneO & U *) * Min := NumTwo. . Smallest : real. b. begin if (NumOne > NumTwo) then begin Max := NumOne. we use the scope label So. Output created by applying AUTOSPEC to example is used to provide an ordering of the assignments to a variable. Largest. it provides information about variables specified in a different context. output ) . it indicates the level of nesting within the current program or procedure. then specifications for variables local to f oo are labeled with an integer scope. When scope is an integer. such as the logic used to obtain the specification for the output variables to f oo. in the specification for the variables local to f oo but outside the scope of the calling procedure. if we have a variable q local to f oo. Upon return. Therefore. ( Min{2}l = NumTwoO & U *) * end I: ( (Max{2)l = Nu* mOneO & Min{2}l = NumTwoO) & U *) else begin Max := NumTwo. For instance. Min:real ) . As such. where "4" indicates the fourth instance of variable q in the context of f oo.

we see the specification of variables x and Y from the context of swapa. which gives the specification for the entire procedure. where lines I. The variables x and Y are specified using the notation described above. and p. respectively. The specification of the main begin-end block of the program MaxMin is given in Figure 10. of interest are the final values of the variables that are local to the program MaxMin (i. Of particular interest are the specifications for the swap procedures given in Figure 9 named swapa and swapb. '{o}' describes the level of nesting (here. which is shown by line M of Figure 10. However. The effects of the call to procedure FindMaxMin provides another example of the specification of a procedure call (line N). and ' 1 ' is the historical subscript. and x{o}l = Y is the O O specification of the final value of x. and the effect of the entire procedure (L). The specification at line K demonstrates the use of identifier scope labels. In addition. The final comment for swapa (line M). a property appropriately captured by the respective specifications for swapa and swapb with respect to the final values of the variables x and Y. '&' to denote a logical-and. J. Figure 10 shows the formal specification of the funnyswap procedure. Thus. denoted Y{O}I. Although each implementation of the swap operation is different. according to the rules for historical subscripts. As such. L. b{o}3. Line N is the specification after the execution of the last line and reads as: (* (Y{0}1 = X & X{0}1 = Y & temp{0}l = XO) & U *) O O where Y{O}I = X is the specification of the final value of Y. we use the notation ' | ' to denote a logical-or. no variables local to the scope of the call to funnyswap are affected by funnyswap due to the pass by value nature of funnyswap. reads as: ( (Y{0}2 = XO S X{0}1 = YO & Y{0}1 = YO + XO) & U *) * c where Y{O}2 = X is the specification of the final value of Y. Finally. labeled i. M. and c{o}l . with every precondition propagated to the final postcondition as described in Section 3. the intermediate value of Y. respectively. the *l' indicating the first instance of Y after the initial value. In the main program. specifications). and thus the specification shows no change in variable values.1. the code in each procedure effectively produces the same results. where in this case. b. In Figure 8. are given.. line P is the specification of the entire program. and the symbols ' (* * ) ' to delimit comments (i.e. a. the code for the procedure FindMaxMin contains an alternation statement. o. Lines I and J specify the effects of assignment statements. where Y is the variable. and C). Line L is another example of the same idea. the parameter passing scheme used in this procedure is pass by value.. the a{o}3. K. N. and x{o}l = Y is the O D specification of the final value of x. There are eight lines of interest. Here.e. In this case. K. the effect of the alternation statement (K). where the specification of variables from the context of swapb (x and Y). and L specify the guarded commands of the alternation statement (i and j ) . J. the first assignment to Y is written using Y{O}I. with value Y + xo is not considered in the final value of Y. O Procedure swapb uses a temporary variable algorithm for swap. it is zero). The semantics for the funnyswap procedure are similar to that of swapb.STRONGEST POSTCONDITION SEMANTICS 157 In addition to the notations for variables.

X ) .. var temp : integer. begin temp := X. ( (X{0)1 = YO) & U *) * Y := temp. the logic that was used to obtain the values for the variables of interest can be analyzed.) are of interest. ( (Y{0}1 = XO) & U *) * end ( (Y{0)1 = XO & X{0}1 = YO & temp{0}l = XO) & U *) * procedure funnyswap( X:integer. begin temp := X. Output created by applying AUTOSPEC to example (cont. ( (Y{0}1 = XO) & U *) * end ( (Y{0}1 = XO & X{0}1 = YO & temp{0}l = XO) & U *) * Figure 9. ( (tenp{0}l = XO) & U *) * X := Y. begin Y := (Y + X ) . ).158 GANNOD AND CHENG procedure swapa( var X:integer. formal approaches to reverse engineering have used the semantics of the weakest precondition predicate transformer wp as the underlying formalism of their technique.XO))) & U *) & Y(0)1 = YO + XO) & U *) procedure swapb( var X:integer. by propagating the preconditions for each statement. A knowledge-base manages the correctness preserving transforma- . ( {Y(0)1 = (YO + XO)) & * X := (Y . var Y:integer var tenp : integer. In addition. 1989). ( (X{0)1 = YO) & U *) * Y := temp.X) . ( (Y{0}2 = ((YO + XO) * end ( (Y{0}2 = XO & X{0)1 = YO * U *) XO)) & U *) ((YO + XO) . ( (X{0}1 = ((YO + XO) * Y := (Y . The Maintainer's Assistant uses a knowledge-based transformational approach to construct formal specifications from program code via the use of a Wide-Spectrum Language (WSL)(Ward et al. 6. A WSL is a language that uses both specification and imperative language constructs. ( (temp{0)l = XO) & U *) * X := Y. Related Work Previously. Y:integer ) . var Y:integer ) .

b) ( (b{0}3 = 10 & * (a{0}3 = 5 & (Y{swapb}l = 10 & (X{swapb}l = 5 & temp{swapb}l =10)))) & U *) funnyswap(a.) tions of concrete. implementation constructs in a WSL to abstract specification constructs in the same WSL. b{0}l = 10' & U *) swapa{a.b) ( (Y{funnyswap}l = 5 & X{funnyswap)l = 10 & * tenp(funnyswap}l = 5 ) & U *) FindMaxMin{a.Smallest) ( (Smallest{0}l = Min{FindMaxMin)l & * Largest{0)1 = Max{FindMaxMin}l & (({5 > 10) & {Max{FindMaxMin}l = 5 & Min{FincaMaxMin}l = 10)) | (not (5 > 10) & (Max{FindMaxMin)l = 10 & Min{FindMaxMin)l = 5)))) & U *) c := Largest.b) ( (b{0)2 = 5 & * (a{0}2 = 10 & {Y{swapa}2 ^ 5 & (X{swapa}l = 10 & Y{swapa)l =15)))) & U *) swapb(a.b. ( c{0}l = Max{FindMaxMin)l & U *) * ( ((c{0)l = Max{FindMaxMin}l) & * (Smallest{0)l = Min{FindMaxMin)l & Largest{0}l = Max{FindMaxMin)l & (((5 > 10) & (Max{Finc3MaxMin}l = 5 & Min{FindMaxMin}l = 1 0 ) ) | (not(5 > 10) & (Max{FindMaxMin)l = 10 & Min{FindMaxMin)l = 5))))) & ( Y{funnyswap}l = 5 & X{fvinnyswap) 1 = 1 tenip{funnyswap)l = 5 ) & ( b{0)3 = 10 6 c a{0}3 = 5 & (Y{swapb}l = 10 & X{swapb}l = 5 & teirp{swapb)l = 10)) & ( b{0}2 = 5 & a{0}2 = 10 & (Y{swapa}2 = 5 & X{swapa)l = 10 & Y{swapa}l = 15)) & (b{0}l = 10 & a{0)l = 5 ) & U *) Figure 10. a{0}l = 5 & U *) ( * ( * b := 10.STRONGEST POSTCONDITION SEMANTICS 159 ( Main Program for MaxMin *) * begin a := 5. . Output created by applying AUTOSPEC to example (cont.Largest.

As such. Our current investigations into the use of strongest postcondition for reverse engineering focus on three areas. no strongest total postcondition stp (Dijkstra and Scholten. 1991). The approach used to reverse engineer COBOL involves the development of general guidelines for the process of deriving objects and specifications from program code as well as providing a framework for formally reasoning about objects (Haughton and Lano. Automating the process of abstracting formal specifications from program code is sought but. Conclusions and Future Investigations Formal methods provide many benefits in the development of software. 1993). unfortunately. Validation and Documentation of Software Systems) is an Espirit II project whose objective is to improve applications by making them more maintainable through the use of reverse engineering techniques. much can be learned about the functionality of a system. 1990)). automated techniques for verifying the correctness of straight-line programs can be facilitated.160 GANNOD AND CHENG REDO (Lano and Breuer. To this end. Some differences in applying wp and sp are that wp is a backward rule for program semantics and assumes a total correctness model of execution. Directly related to this work is the potential for . the specifications contain implementationspecific information. Second. 7.e. not completely realizable as of yet. By using a partial correctness model of execution. by providing the tools that support the reverse engineering of software. both a forward rule {sp) and backward rule {wlp) can be used to verify and refine formal specifications generated by program understanding and reverse engineering tasks. the total correctness interpretation has no forward rule (i. the applied formalisms are based on the semantics of the weakest precondition predicate transformer wp. a rigorous technique for re-engineering specifications from the imperative programming paradigm to the object-oriented programming paradigm is being developed (Gannod and Cheng. 1989) (Restructuring. methods for constructing higher level abstractions from lower level abstractions are being investigated. For straight-line programs (programs without iteration or recursion) the techniques described herein can be applied in order to obtain a formal specification from program code. Maintenance. First. Since our technique to reverse engineering is based on the use of strongest postcondition for deriving formal specifications from program code. However. we are extending our method to encompasses all major facets of imperative programming constructs. In each of these approaches. that is. we are in the process of defining the formal semantics of the ANSI C programming language using strongest postcondition and are applying our techniques to a NASA mission control application for unmanned spacecraft. including iteration and recursion. The level of abstraction of specifications constructed using the techniques described in this paper are at the "as-built" level. The main difference between the two approaches is the ability to directly apply the strongest postcondition predicate transformer to code to construct formal specifications versus using the weakest precondition predicate transformer as a guideline for constructing formal specifications. Finally. and then applying those semantics to the programming constructs of a program. However. the application of the technique to other programming languages can be achieved by defining the formal semantics of a programming language using strongest postcondition.

July 1995.3) .H. it follows that spp{x:= e.(3) = > 5p(x:= e . 1992). Recall that 5p(x:= e. pp. IEEE Computer Society Press. the authors wish to thank Linda Wills for her efforts in organizing this special issue.4) (A. we address the elimination of the existential quantifier. Finally. Elimination of the existential quantifier 2.l) There are two goals that must be satisfied in order to use the definition of strongest postcondition for assignment. This is a revised and extended version of "Strongest Postcondition semantics as the Formal Basis for Reverse Engineering" by G. Acknowledgments The authors greatly appreciate the comments and suggestions from the anonymous referees. Also.STRONGEST POSTCONDITION SEMANTICS 161 applying the results to facilitate software reuse.1 states a conjecture that the removal of the quantification for the initial values of a variable is valid if the precondition Q has a conjunct that specifies the textual substitution. Q) (pronounced "s-p-rho") as the strongest postcondition for assignment with the quantifier removed. (A. 188-197. where automated reasoning is applied to the specifications of existing components to determine reusability (Jeng and Cheng. That is. the authors would like to thank the participants of the IEEE 1995 Working Conference on Reverse Engineering for the feedback and comments on an earlier version of this paper. Appendix A Motivations for Notation and Removal of Quantification Section 3. Given the definition of spp. Consider the RHS of definition A.Q) = {3v::QlAx = el).Q) = {Ql Ax = ey) forsome}'.2) Define spp{K:= e.l. (A. They are: 1. Gannod and B. which first appeared in the Proceedings of the Second Working Conference on Reverse Engineering.C. spp{K:= e. Let y be a variable such that (Q^ A X = el) ^ {3v :: Q^ A X = el). First. Cheng. Q ) . Eliminating the Quantifier. 4 (A. Development and use of a traceable notation. This Appendix discusses this conjecture.C.

where P is an expression. use a place holder 7 such that the precondition Q has no occurrence of 7.Q) = PA{z^z)A{x = e^). It must: 1. It can be proven that through successive assignments to a variable x that the specification spp will have only one conjunct of the form (x = /?). The convention used by the approach described in this paper is to choose for y the expression yS. meaning that the specification of a program describes impossible behaviour. Choosing an arbitrary a for y in (A. Textual substitution) PA{a = z)A{x = 6%) = {a = z) P A ( a = z)A(x = 0 = (textual substitution) PA{a = z)A{x = e^). consider the following. . This is unacceptable because it leads to a contradiction. The choice ofy must be made carefully. this is not the case. Let Q := P A {x = z) such that P contains no free occurrences of x. As an example. If no P can be identified.3) can either be identified explicitly or named implicitly. Suppose P were replaced with P' A{a^ z). Ideally. Informally.162 GANNOD AND CHENG As such. However. Notice that the last conjunct in each of the derivations is (x = e^) and that since P contains no free occurrences of x. For instance. we note that each successive application of spp uses a textual substitution that eliminates free references to x in the precondition and introduces a conjunct of the form (x = /3).(3) = P^A{a^z)A{a = z)A{x = e%). let y in equation (A. At first glance.3) leads to the following derivation: = {Q:=PA{x = z)) {PA{x = z)raA{x = e%) = (textual substitution) (P^ A{x = z)l A (x = el) = {P has no free occurrences of x. and 2. Then spp{x:= e. Adjust the precondition Q so that the free occurrences of x are replaced with the value of X before the assignment is encountered. Describe the behaviour of the assignment of the variable x. and Q := PA{X = z). the specification of the assignment statement can be made more simple if y from equation (A. P is an invariant. namely removal of the quantification. it is desired that the specification of the assignment statement satisfy two requirements.3) be z. this choice ofy would seem to satisfy the first goal. The derivation would lead to 5pp(x:= e.

STRONGEST POSTCONDITION SEMANTICS 163 Notation.{x:= e. As such. spp^ has the form spp.{x:= e. the construction of the specification of the assignment statements involves the propagation of the precondition Q as an invariant conjuncted with the specification of the effects of setting a variable to a dependent value. an appropriate y must be chosen. (A. and let the assignment statement be x: = e. Define sp^^ (pronounced "s-p-rho-iota") as the strongest postcondition for assignment with the quantifier removed and indices. the definition of spp^^ can be modified to appear as spp. textual substitution) PA{xi = a)A {xi^i = e^) A Xi+2 = fe(definition of Q) Q A {xi+i = e^) A Xi+2 = fe(definition of Q') Q' A Xi+2 = feg SPP. Q') = ( P A {xi = a)A (x^+i = e^))^.0 < j < i.{yi: = e. Let Q := PA{xi = y). This is especially helpful in the case where choice statements (alternation and iteration) create alternative values for specific variable instances. Based on the previous discussion. Let Q := P A{xi = a). where P has no occurrence of X other than i subscripted x's of form {xj = ej). Application of sppc yields spp. . A Xi^2 = fe^ (textual substitution) Pfx A {xi = a)^x A {xi^i = eS)^x A x^+s = f^. it is observed that by using historical subscripts. Formally. (A.5) Again.{K:= = = = = Therefore. choose y to be the RHS of the relation (xi = y). Q) = {Qy Axk = Sy) for somey. This convention makes the evaluation of a specification annotation traceable by avoiding the elimination of descriptions of variables and their values at certain steps in the program. (P has no free x.Q) = ( P A ( x i = a))^ A(xi+i = e ) ^ = (textual substitution) P^ A {xi = a)% A {xi+i = e)l = (textual substitution) PA{xi = a)A{xi^i = e^) A subsequentapplication of 5j?pt on the statement x:= f subjecttoQ' := QA(xi+i = e%) has the following derivation: f. Q) = ((P A {xi = y))l A x^+i = e^) for some j .6) Consider the following example where subscripts are used to show the effects of two consecutive assignments to the variable x.

.C. and Scholten. IEEE Software. Gannod. IEEE. and Gustafson. Yourdon. IEEE. E. and Turner. and Cheng. 2(4):523-546. . Elliot J. In COMPSAC.. The Maintainer's Assistant. 1991. Lano. M. 2(2): 137-164. Eric J. A Two Phase Approach to Reverse Engineering Using Formal Methods. IEEE Computer. Katz. A Conceptual Foundation for Software Re-engineering. Jeannette M. Gerald C. Calliss. Betty H. 1978. Edsger W. Jun-jang and Cheng. Shmuel and Manna. M. pages 152-161.P. Abstraction of Formal Specifications from Program Code. and Cross. Reverse Engineering and Design Recovery: A Taxonomy. Wing. Clark S. IEEE. Cheng. July 1991. Kevin and Breuer. Victoria Slid. Haughton. A. January 1990. R. C. Applying formal methods in automated software development.1994. Lecture Notes in Computer Science: Formal Methods in Programming and Their Applications. pages 226-235. Springer-Verlag.. 23(9):8-24. July 1993. David A. L. Carel S. Nicholls. From Programs to Z Specifications. Prentice Hall. Jeng. April 1976. In Proceedings for the Conference on Software Maintenance. Journal of Computer and Software Engineering. Gerald C. F. Cheng. Betty H. Ward. Leveson. A Software Re-engineering Process Model. Peter T. The Science of Programming. H. Communications of the ACM. 1991. Springer-Verlag. C. 1994. Ruling's Dicta Causes Uproar.W. Zohar. and Lano. Betty H. 19(4): 188-206.. Edsgar W. 1992. and Cheng. James H. Facilitating the Maintenance of Safety-Critical Systems Using Formal Methods. David. September 1990. In Proceedings for the IEEE 3rd International Conference on Tools for Artificial Intelligence. Predicate Calculus and Program Semantics. 1989. In Proceedingsfor the Conference on Software Maintenance. Eric J. In Proceedings for the Conference on Software Maintenance. A Specifier's Introduction to Formal Methods. October 1969.C. pages 1 8 ^ 1 . IEEE Software. Nancy G. Yourdon Press. Dijkstra. 1990. 4(2): 183-204. Betty H. Objects Revisited. In John E. 7(1): 11-12. IEEE. December 1992. July 1993. Byrne.164 GANNOD AND CHENG References Byrne. Hoare. Betty H. Gerald C. and Constantine. Wilma M. and Gannod. Gries. Dijkstra. The National Law Journal. Fitting pieces to the maintenance puzzle. editor. International Journal of Software Engineering and Knowledge Engineering. IEEE. IEEE Computer. 1976. and Chikofsky. Elliot J. Communications of the ACM. Using Automated Reasoning to Determine Software Reuse. pages 125-128.C. A Discipline of Programming. Flor. 1981. C. Springer-Verlag. Chikofsky. and Munro. Logical Analysis of Programs. An Investigation of the Therac-25 Accidents. 1989. 735:335-348. 7(1): 13-17. 1992. The International Journal of Software Engineering and Knowledge Engineering. Kevin. 12(10):576-580. Gannod. pages 46-70. Structured Analysis and Design: Fundamentals Discipline of Computer Programs and System Design. Osborne. Z User Workshop. January 1990. An axiomatic basis for computer programming.

WILLS School of Electrical and Computer Engineering Georgia Institute of Technology. They are detectives. 1. guidancefromempirical study. Although working from diverse points of view. and data and control flow "circulatory and nervous systems. Some practice radiology. Conceptual complexity is the software engineer's worst enemy. Introduction Researchers in reverse engineering use a variety of metaphors to describe the role their work plays in software development and evolution. searching for gems to extract. It directly affects costs and ultimately the reliability of the delivered system.gatech. Comprehension of existing systems is the underlying goal of reverse engineering technology. 1995). piecing together clues incrementally discovered about a system's design and what "crimes" were committed in its evolution. and documentation standards." Others are software archeologists (Chikofsky. The trends observed include increased orientation toward tasks. 3. the reverse engineering process generates multiple views of the system that highlight its salient features and delineate its components and the relationships between them (Chikofsky and Cross. salvaging huge software investments. The paper also summarizes open research issues and provides pointers to future events and sources of information in this area. Auburn University. reconstructing models of structures buried in the accumulated deposits of software patches and fixes. Recent Trends and Open Issues in Reverse Engineering LINDA M. polish. 165-172 (1996) © 1996 Kluwer Academic Publishers. and save in a reuse library. finding ways of viewing internal structures. reverse engineering researchers have a common goal of recovering information from existing software systems. and increased formalization. left stranded by shifting hardware platforms and operating systems. including those mentioned above. coding. Recovering this information makes possible a wide array of critical software engineering activities.wills@ee.aubum. translating software in one language to another. particularly those highlighted at the Second Working Conference on Reverse Engineering. By examining and analyzing the system. Manufactured in The Netherlands. Boston.edu cross@eng. CROSS II Auburn University. obscured by and entangled with other parts of the software "organism": objects in procedural programs. logical data models in relational databases. grounding in complex real-world applications. analysis of non-code sources.edu Abstract.Automated Software Engineering. They are rescuers. Computer Science and Engineering lOJDunstan Hall. and treasure hunters and miners. measuring compliance with design. foreign language interpreters. Atlanta. inspectors. GA 30332-0250 JAMES H. This paper discusses recent trends in the field of reverse engineering. 1990). AL 36849 linda. held in July 1995. The prospect of being able to provide tools and methodologies to assist and automate portions of the reverse engineering process is .

The Working Conference provides a forum for researchers to discuss as a group current research directions and challenges to the field.. and where we are heading. Dynamic Documentation. Lewis Johnson coined the phrase "explanation on demand" for this type of documentation technology (Johnson. 1993). The issue in reverse engineering is not only how to extract information from an existing system. Increased Task-Orientation The diverse set of metaphors listed above indicates the variety of tasks in which reverse engineering plays a significant role. Mechanisms for focused. 1995. it provides pointers to future conferences and workshops in this area and places to find additional information. but which information should be extracted and in whatform should it be made accessible? Researchers are recognizing the need to tailor reverse engineering tools toward recovering information relevant to the task at hand. A topic of considerable interest is automatically generating accessible. rather than generating all possible documentation whether it is needed or not. researchers and practitioners have been meeting at the Working Conference on Reverse Engineering. For example. Different tasks place different demands on the reverse engineering process. but also in facilitating the development of new software. dynamic documentation from legacy systems. The strategy is to concentrate on generating only documentation that addresses specific tasks. The Second Working Conference on Reverse Engineering (Wills et al. goal-driven inquiries about a software system are actively being developed. It also points out areas where hopefully more research attention will be drawn in the future. with varying availability and accuracy of information about the software. the first of which was held in May 1993 (Waters and Chikofsky. organized by general chair Elliot Chikofsky of Northeastern University and the DMR Group. and by program co-chairs Philip Newcomb of the Software Revolution and Linda Wills of Georgia Institute of Technology.166 WILLS AND CROSS an appealing one. There are many different types of information to extract and many different task situations. what new trends are apparent in the field. 1995). multifarious problem. The adjective "working" in the title emphasizes the conference's format of interspersing significant periods of discussion with paper presentations. Finally. From the many different metaphors used to describe the diverse roles that reverse engineering plays. Reverse engineering is an area of tremendous economic importance to the software industry not only in saving valuable existing assets. it is apparent that supporting and semi-automating the process is a complex. and how well do existing formalisms match the particular tasks maintainers have to perform? These issues are relevant to documentation at all levels of abstraction. A variety of approaches and skills is required to attack this problem. Two important open issues are: what formalisms are appropriate for documentation. a similar issue arises in program understanding: what kinds of formal design representations . 2. 1995) was held in July. This article uses highlights and observations from the Second Working Conference on Reverse Engineering to present a recent snapshot of where we are with respect to our overall goals. To help achieve coherence and facilitate communication in this rapidly growing field.

g. viewed. The division of labor will be influenced by the task and environmental situation. large. Depending on the task. A good example of a large scale application was provided by Philip Newcomb (Newcomb. This helped in initial explorations of techniques that have since matured. 1993). Consequently. and diverse. industrial invoicing systems. This tool supports modeling.. At the Working Conference. 1995. including specifications. without actually understanding the redundant pieces. If a user were interested only in detecting instances of "cut-and-paste" reuse. Successful collaboration will depend on finding ways to leverage the respective abilities of the collaborators. often using data that did not always scale up to more realistic problems (Selfi-idge et al. 1995)). and transforming legacy systems on an enterprise scale by providing a mechanism for efficiently storing and managing huge models of information systems at Boeing Computer Services. for the task of identifying families of reusable components that all embody the same mathematical equations or business rules. 1995). and keywords). and integrated? Varying the Depth of Analysis. several researchers reported on the application of reverse engineering techniques to practical industrial problems with results of significant economic importance. . Early work tended to focus on simplified versions of reverse engineering problems. This is useful in identifying candidates for reuse and in preventing inconsistent maintenance of conceptually related code. Kontogiannis et al. maintainers. if more complex. Related to the issue of providing flexibility in task-oriented tools is the degree of automation and interaction the tools have with people (programmers. How is the focusing done? Who is controlling the depth of analysis and level of effort? The reverse engineering process is characterized by a search for knowledge about a design artifact with limited sources of information available. architectural features. 3.. objects. The software and legacy systems to which their techniques are being applied are quite complex. The person and the tool each bring different types of interpretive skills and information sources to the discovery process. it would be sufficient to find similarities based on matching syntactic features (e. the types of information being extracted from existing software spans a wide range. semantic similarities are to be detected. variable usage. analyzing. the X window system. business rules. however. different levels of analysis power are required. and software for analyzing data sent back from space missions. called the Legacy System Cataloging Facility. The depth of analysis must be increased. Attacking Industrial-Strength Problems The types of problems that are driving reverse engineering research come from real-world systems and applications. and more recently. constant and function names.1995) who presented a tool. and domain experts (Quilici and Chin. Examples include a public key encryption program.TRENDS AND ISSUES IN REVERSE ENGINEERING 167 should be used as a target for program understanding systems? How can multiple models of design abstractions be extracted. recent advances have been made in using analysis techniques to detect duplicate fragments of code in large software systems (Baker. A person can often see global patterns in data or subtle connections to informal domain concepts that would be difficult for tools based on current technology to uncover.. for example. For example. Interactive Tools.

1995). While the value of case studies is widely recognized. empirical investigation. (Blaha and Premerlani. relatively few have been conducted thus far. Lewis Johnson described his work on dynamic. accessible documentation. the study identified the need for adaptable automated tools.1995).. Unfortunately. Papers describing case studies and available data sets would significantly contribute to advancing the research in this field (Selfridge et al. Based on an analysis of productivity. The study focused on a reverse engineering project at a software factory (Basica S. which was driven by studies of inquiry episodes gathered from newsgroups.A in Italy) to reverse engineer banking software. 5. Looking Beyond Code for Sources of Information In trying to understand aspects of a software system. This can influence a software engineer's decisions about whether to reengineer a system or opt for continued maintenance or a complete redesign (Newcomb. Exploring these issues and developing new techniques in the context of real-world systems and problems is critical.168 WILLS AND CROSS Current applications are pushing the limits of existing techniques in terms of scalability and feasibility. what is needed to support them. More Empirical Studies One of the prerequisites in addressing real-world. The results of one such full-scale case study were presented at the Working Conference by Piernicola Fiore (Fiore et al. many of which are in conmiercial software products! Empirical data is useful not only in driving and guiding reverse engineering technology development. Results indicated that cost is not necessarily related to number of lines of code. They must not be proprietary and they must be made easily accessible.. In addition to this formal.g. This helped to determine what types of questions software users and maintainers typically ask. During a panel discussion. 1993)). some informal studies were reported at the Working Conference.. a reverse engineer uses all the sources of information available. but also in estimating the effort involved in reverse engineering a given system. 1995) reported on idiosyncracies they observed in relational database designs. and that both the data and the program need distinct econometric models. most reverse engineering research focused on supporting . it is difficult tofinddata sets that can be agreed upon as being representative of those found in common reverse engineering situations. In the past.p. Closely related to this problem is the critical need for publicly available data sets that embody representative reverse engineering problems (e. a legacy database system including all its associated documentation (Selfi"idge et al. economically significant reverse engineering problems is understanding what the problems are and establishing requirements on what it would take to solve them.. and how well (or poorly) the existing technology is meeting their needs. Adopting these as standard test data sets would enable researchers to quantitatively compare results and set clear milestones for measuring progress in the field. 1993) and are actively sought by the Working Conference. 4. Researchers are recognizing the necessity of conducting studies that examine what practitioners are doing currently.

the value of noncode system documents as rich sources of information has been recognized. 1995). Recently. with their well-defined notations. 7. The field of reverse engineering is starting to see this type of growth. analysis techniques were presented that automatically derived test cases from reference manuals and structured requirements (Lutsky. 1994).. particularly the code. as is the case in generating test cases. it is common for researchers to try many different informal techniques and experimental methodologies to get a handle on the complex problems they face.. researchers start to formalize their methods and the underlying theory. 1995). 6. and the process of formalization which tries to provide an underlying theoretical basis for these informal techniques. Challenges for the Future Other issues not specifically addressed by papers presented at the Working Conference include: . business rules and a domain lexicon from structured analysis specifications (Leite and Cerqueira. for example. 1994). also have a tremendous potential for facilitating automation. Documents associated with the source code often contain information that is difficult to capture in the source code itself. Who is thefinalarbiter? Often it is valuable simply to detect such inconsistencies. such as those being explored in applying formal methods to component-based reuse (Lowry et al. including graphical notations and domain-oriented representations. or the history of evolutionary steps that went into creating the software. This raises issues of practicality. and formal semantics from dataflow diagrams (Butler et al. This helps make the methods more precise and less prone to ambiguous results. at the Working Conference. the current state-of-the-art focuses on small programs. Making reverse engineering tools based on formal methods accessible to practicing engineers will require the support of interfaces to the formal notations. 1995). they tend to introduce a communication barrier between the reverse engineer who is not familiar with formal methods and the machine. Although the formal notations lend themselves to machine manipulation. A crucial open issue in this area of exploration is what happens when one source of information is inaccurate or inconsistent with another source of information.. coupling pattern matching with symbolic execution. For example. While formal methods.TRENDS AND ISSUES IN REVERSE ENGINEERING 169 the recovery of information solely from the source code. Formal methods contribute to the validation of reverse engineering technology and to a clearer understanding of fundamental reverse engineering problems. Increased Formalization When a field is just beginning to form. such as design rationale. As the field matures. A promising strategy is to explore how formal methods can be used in conjunction with other approaches. feasibility and scalability. A fruitful interplay is emerging between prototyping and experimenting with new techniques which are sketched out informally. connections to "human-oriented" concepts (Biggerstaff et al.

Even more important than the trends and ideas discussed is the energy and enthusiasm shared by the research community. database queries. and use knowledge of the application domain. What can we do now to prevent the software systems we are currently creating from becoming the incomprehensible legacy systems of tomorrow? For example. what new problems does object-oriented code present? What types of programming language features. Most reverse engineering research assumes that reverse engineering will be performed and thus overlook this critical assessment task which needs tools and methodologies to support it. How can it be used to organize and present information extracted in terms the tool user can readily comprehend? What new presentation and visualization techniques are useful? How can domain knowledge be captured from noncode sources? What new techniques are needed to reverse engineer programs written in non-traditional. What is the lifecycle of a reverse engineering activity and how does it relate to the forward engineering life-cycle? Can one codify the best practices of reverse engineers." such as spreadsheets. domain-oriented "languages. education. and hardware description languages? A clearer articulation of the reverse engineering process is needed. In reality. and application. refine.. focused on short-term transition. As such. documentation. . Conclusion and Future Events • • • • 8. More details about the WCRE presentations and discussions is given in (Cross et al. domain experts. what outcome is expected. management is reluctant to invest heavily in reverse engineering research. or design techniques are helpful for later comprehension and evolution of the software? A goal of reverse engineering research is to raise the conceptual level at which software tools interact and conmiunicate with software engineers. Even though the problems being attacked are complex. reverse engineering is often seen as a temporary set of activities. and end users. This article has highlighted the key trends in thefieldof reverse engineering that we observed at the Second Working Conference. and thereby improve the effectiveness of reverse engineering generally? What is management's role in the success of reverse engineering technology? From the perspective of management. reverse engineering can be used in forward engineering as well as maintenance to better control conceptual complexity across the life-cycle of evolving software. granmiar-based specifications. The 1993 and 1995 WCRE proceedings are available from IEEE Computer Society Press.170 WILLS AND CROSS • • How do we validate and test reverse engineering technology? How do we measure its potential impact? How can we support the critical task of assessment that should precede any reverse engineering activity? This includes determining how amenable an artifact is to reverse engineering. 1995). the estimated cost of the reverse engineering project and the anticipated cost oinot reverse engineering. This raises issues concerning how to most effectively acquire.

. M. In (Willis et al. at the First Working Conference. Other future events related to reverse engineering include: • • • the Workshop on Program Comprehension. Premerlani. Blaha. 1995). 1996 in Berlin. Mitbander. Communications of the ACM. On finding duplication and near-duplication in large software systems. Julio Cesar Leite. 1996 in Monterey. he created a "reverse taxonomy" of tongue-in-cheek definitions that needed to be reverse engineered into computing-related words.ee. MO. he challenged attendees to reverse engineer jokes given only their punch-lines. Germany. which is being planned for London. 37(5):72-83.TRENDS AND ISSUES IN REVERSE ENGINEERING 171 they are intensely interesting and highly relevant to many software-related activities. Program understanding and the concept assignment problem. and Mark Wilson. Alex Quilici. the International Workshop on Computer-Aided Software Engineering (CASE). ^ The next Working Conference is planned for November 8-10.org. Notes 1. and the Reengineering Forum. Webster.. . B. David Eichmann. in the Summer of 1997. on notes taken by rapporteurs at the Second Working Conference on Reverse Engineering: Gerardo Canfora. In (Willis et al. Some examples of Elliot's reverse taxonomy: (A) a suggestion made to a computer. We also appreciate comments from Lewis Johnson which contributed to our list of challenges. Jean-Luc Hainaut. (B) the answer when asked "what is that bag the Blue Jays batter runs to after hitting the ball?" (C) an instrument used for entering errors into a system. in part. and D. Spencer Rugaber. Further information on the upcoming Working Conference can be found at http ://www. pages 86-95. which was held in conjunction with the International Conference on Software Engineering in March. England. Biggerstaff. (B)database. May 1994. One of the hallmarks of the Working Conference is that Elliot Chikofsky manages to come up with amusing reverse engineering puzzles that allow attendees to revel in the reverse engineering process. Observed idiosyncracies of relational database designs. B. Answers: (A)command. Louis. pages 116-125. Acknowledgments This article is based. T. a commercially-oriented meeting. and W. 1995). Michael Olsem.1996 in St.. This year.edii/coiiferencesAVCRE or by sending mail to were @ computer. Ettore Merlo. which complements the Working Conference and is being held June 27-28. It will be held in conjunction with the 1996 International Conference on Software Maintenance (ICSM). Howard Reubenstein. (C)k:eyboard References Baker. Lewis Johnson. For example.gatech. CA.

R.. W. Retrieving information from data flow diagrams. Reverse engineering and design recovery: A taxonomy..A position paper. May 1993. of the First Working Conference on Reverse Engineering. IEEE Software. and P. Baltimore. Newcomb. In (Willis et al. May 1993. L. pages 13-17. ACM SIGSOFTSoftware Engineering Notes. In (Willis et al. 1995). Chikofsky. Chin. page ix. .. pages 8-12. Quilici. Automating testing by reverse engineering of software documentation. R. IEEE Computer Society Press. Fiore. and E.. Kontogiannis.. 1995).. Philpot. G. Shinghal. editors. pages 96-103. 9th Knowledge-Based Software Engineering Conference. pages 22-29. Cross.172 WILLS AND CROSS Butler. T. July 1995. M. and J. Johnson. July 1995. Newcomb.. of the Second Working Conference on Reverse Engineering. 1995). Wills. R. CA. L. Quilici. Waters. Vissaggio. Decode: A cooperative environment for reverse-engineering legacy software. In (Willis et al. Baltimore. Proc. of the First Working Conference on Reverse Engineering. pages 155-164. pages 144-150. pages 156-165. Second working conference on reverse engineering summary report. M. P. In Proc. L. 1995). editors. MD. In [19]. 10th Knowledge-Based Software Engineering Conference. Cross. Boston. Lowry. and L Tjandra. IEEE Computer Society Press. MA. R. In (Willis et al. December 1995. Cerqueira. J. K. In (Wilhs et al. IEEE Computer Society Press. pages 13-21. Chikofsky.. Ontario. and E. Monterey. Selfridge. Recovering business rules from structured analysis specifications. A formal approach to domain-oriented software design environments. In (Willis et al. J. pages 106-114. IEEE Computer Society Press. and I. pages 48-57. Chikofsky. P.. Toronto. DeMori. Pressburger... Chikofsky. Leite.. R. Bernstein. Lutsky. and E. and G. A. In Proc. January 1990. In Proc. A. Pattern matching for design concept localization. M. Chikofsky. Analyzing empirical data from a reverse engineering project. Message from the general chair. Grogono. Underwood. and D. Legacy system cataloging facility. Proc. Challenges to the field of reverse engineering . E. MD. Merlo. 1995). Galler. 20(5):23-26. A. Interactive explanation of software systems. 1995). Wills. In (Willis et al. Chikofsky. P. R Newcomb. 1994. Waters. 1995) (contains a particularly vivid analogy to archeology). pages 52-60. E Lanubile. and E. and E. P. P.. 1995. E.

In fact these categories provided more than enough for me to wish to take to the desert island. to have a major disaster in civil engineering which can be completely concealed. All are worth reading for the illumination they shed on software engineering from another source. by Petroski (1985). the last was the most important.dobson@newcastle. which is why engineering it is so hard. lies. I also chose to limit my quota to six (or maybe the editor did. All of my chosen books are well-written. There were none. of course.ac. and books on engineering. and I hope you will read them for that reason. it had to be relevant to the kind of intellectual exercise we engage in when we are engineering software. The importance of disasters. a book on engineering: To Engineer is Human. The major bridge disasters of civil engineering history—Tay Bridge. and so there is not much of interest to say about it. Of these. and books on many a subject of interest to software engineers such as architecture and language. It is not that software engineering has not been part of my life. Tacoma Narrows—have their analogues in our famous disasters—Therac. the London Ambulance Service. and this means that they have to be well documented. The examples of software disasters that I gave have been documented. and it had to be well-written.K. the London Ambulance Service particularly. . There is a pleasure to be gained from reading a well-written book simply because it is written well. Actually it is not so much about engineering (understood as meaning civil engineering) as about the history of civil engineering. What is interesting about Petroski's book. It is probably not possible. U. I made for myself some criteria: it had to be a book that I had read and enjoyed reading. I looked along my bookshelves to see what books I had on software engineering. Since there was an element of choice involved. University of Newcastle. When I started preparing for this article. First. Perhaps that is why I have no books on software engineering: the discipline is not yet old enough to have a decent history. though.uk Centre for Software Reliability. There were books on software. in what is learnt from them. for it shows how the civil engineering discipline (particularly the building of bridges) has developed through disaster. from which as a result nothing has been learnt. it means that there is a just and appropriate balance between what the writer has brought to the book and what the reader needs to bring in order to get the most out of it. That doesn't necessarily mean easy to read. but these are in the minority. so making the selection provided an enjoyable evening. Manufactured in The Netherlands. it had to have (or have had) some significance for me in my career. either in terms of telling me how to do something or increasing my understanding. at least in the western world. Bedson Building. This is yet another example of the main trouble with software being its invisibility. 173-178 (1996) (c) 1996 Kluwer Academic Publishers. is the way it can be used as a base text for a future book on the history of software engineering. Newcastle NEl 7RU.Automated Software Engineering 3. Desert Island Column JOHN DOBSON john. but that I have not read anything on it as a subject that I wished to keep in order to read again. Petroski's book shows just how this has helped the development of the discipline. I forget). There must be many undocumented disasters in software engineering.

It seems that the software engineer's favourite architect is Christopher Alexander. But for all its influence over software architects (its influence over real architects is. Reading between the lines of these two books does seem to indicate that the process was perhaps not as successful as it might have been and I think there is probably scope for an architectural process engineer to see what could be done to improve the process design. But of course this is only part of the story. This is a process which Alexander believes has to be rediscovered. The Production of Houses (Alexander et al. the "generalizing Demonstrations of the Rational Power"? In a word. A cynical friend of mine commented. One good example is described in Pelle Ehn's book Work-Oriented . I do not think we have really given this way of building systems a fair try. Engineering design lies in the details.. which he describes in an earlier book.174 DOBSON What makes Petroski's book so pleasant to read is the stress he places on engineering as a human activity and on the forces that drive engineers. Indeed the book could have been called Zen and the Art of Architecture. When it is not being used merely as a fashionable management slogan. this quality has nothing to do with the architecture of the building or with the processes it supports and which stem from it. the quality without a name. because the architects and planners have taken them for themselves. There is the issue of what the artifact is trying to achieve to consider. of course. a matter of making bad design better. The Timeless Way of Building is an exploration of this Zen-like way of doing architecture. are no longer shared. it is not the one I have chosen to take with me.. I am not sure what it means. though I am sure that a lot of people who use it do not know what it means either. and the results are described in two of his other books. since the languages have broken down. He sees the creation of pattern languages as being an expression of the actions of ordinary people who shape buildings for themselves instead of having the architect do it for them. The Timeless Way of Building (Alexander. the "minutely organized Particulars" as Blake calls them^. helping people to decide for themselves what it is they want. the architecture—and of course the architect (the "Scoundrel. after he had read the book. empowerment seems to be a recognition of the embodiment in an artifact of the Tao. But what about the general principles. Engineering is something that is born in irritation with something that is not as good as it could have been. but fortunately it was not. I think. 1975). 1979).. but that the power has been frozen in us. I can see what he means but I think he is being unfair. Some experiments in designing computer systems that way have been performed. As applied to architecture. A number of colleagues have been influenced by that remarkable book A Pattern Language (Alexander et al. which is the architects' version of a library of reusable object classes. Christopher Alexander has. the grand scheme of things in which the particulars have a place. There is much talk these days of empowerment. We find that we already know how to make the building live. Alexander's vision of the architectural language has come out of his vision of the architectural process. The architecture and architectural process should serve to release a more basic understanding which is native to us. 1977). much less noticeable). The role of the architect is that of a facilitator. Architectural empowerment is the unfreezing of this ability. Hypocrite and Flatterer" who appeals to the "General Good"?). "It is good to have thought like that"—the implication being that people who have been through that stage are more mature in their thinking than those who have not or who are still in it. 1985) and The Oregon Experiment (Alexander et al.

It would be very easy to write a book which does for software engineering what What Computers Can't Do did for artificial intelligence: raise a few deep issues. For those who have yet to read this book. 1976) (subtitled The Logic of Mathematical Discovery—making the . design and architecture. rhetoric. Now it is too easy. Looking again at the first three books I have chosen. I note that all of them deal with the human and not the technical side of software capabilities. remind us all that when we cease to think about something we start to say stupid things and make unwarranted claims. But nevertheless I think these books of Alexander's should be required reading. A bit more dialectic would not come amiss. It also shares Alexander's irritating tendency to give the uneasy impression that the project was not quite as successful as claimed. for example). since it is remarkable how shallow and uninteresting the theorems and proofs about the behaviour of programs usually are. and their limitations are provocatively explored in Hubert Dreyfus' famous book What Computers Can't Do (Dreyfus. I do find with What Computers Can't Do. Part of the promotion of any new discipline must involve a certain amount of overselling (look at that great engineer Brunei. One of the great developments in software engineering came when it was realised and accepted that the creation of software was a branch of mathematics. on occasion. 1979). 1988). have not been found agreeable to experience (as Gibbon remarked about the early Christian belief in the nearness of the end of the world). Where is it all going to end—indeed will it ever end? Is there anything that they can't do? Well of course there is. upset a lot of people. which has clearly been influenced by Alexander's view of the architectural process. The notion of proof is a particularly interesting one when it is applied to software. but it is worth remembering that some famous names in software engineering have. however they may deserve respect for their useftilness and authority. It might be harder to do it with Dreyfus' panache. It raises some searching questions about the nature and use of intelligence in our society. and perhaps a bit unfair. to tease the AI community with some of the sillier sayings of their founders. claims which. from flying aeroplanes to painting pictures. it is an enquiry into the basic philosophical presuppositions of the artificial intelligence domain. though. I do not wish to engage in that debate again here. then there is something about the architectural product that embodies the human intellect. and philosophic understanding. which is my third selection. particularly for those who like to acknowledge the influence of A Pattern Language. It sometimes seems as if computers have taken over almost every aspect of human intellectual endeavour. said things which perhaps they now wish they had not said. But the book is splendid reading.DESERT ISLAND COLUMN 175 Design of Computer Artifacts (Ehn. That would be a good next stage of development for requirements engineering to go through. with mathematical notions of logic and proof. Where are the new concepts that make for great advances in mathematical proofs? The best book I know that explores the nature of proof is Imre Lakatos' Proofs and Refutations (Lakatos. It is also a reaction against some of the more exaggerated claims of proponents of artificial intelligence. If there is something about the architectural process that somehow embodies the human spirit. Perhaps The Timeless Way of Building and The Production of Houses will come to have the same influence on the new breed of requirements engineers as A Pattern Language has had on software engineers. that the rhetoric gets in the way a bit.

I have lots of books on that topic on my bookshelf. all very Hegelian). oh dear. and dangerous things are all placed in one category. be intensely interested in how people do categorise things and what the attributes are that are common to each category (since this will form the basis of the object model and schema). and its proof is hardly a deep one.176 DOBSON point that proofs and refutations lead to discoveries. But the theorem discussed is just an ordinary invariant theorem (Euler's formula relating vertices. Any object-oriented software engineer should.e. its wit. except that many books tell me it not easy and requires a lot of understanding of the subject domain. Perhaps this is because the task of empowering people to construct their own reality. but better this time). that it concerns the relationship between symbols and things in the world) and that there is a single correct way of understanding what is and what is not true. Since computer systems have to take their place in the world of people. something which I know already but lacks the concreteness of practical guidance. The title comes from the fact that in the Dyirbal language of Australia.) With George Lakoff telling you about the linguistic basis for object classification and the Christopher Alexander telling you about how to go about finding out what a person or organisation's object classification is. Fire and Dangerous Things by Lakoff (1987). I would prefer to be accompanied by Women. it describes the history of mathematical proof in the way that To Engineer is Human describes the history of engineering (build it. my aim is to get you to read this book as well. This surely is a deathless work which so cleverly explores the nature of proof. of course. but not because women are considered fiery or dangerous.. and the one that currently I like the best is Computers in Context by Dahlbom and Mathiassen (1993). which believes that meaning is a matter of truth and reference (i. Instead. What makes Proofs and Refutations so memorable is its cleverness. But Lakatos makes all sorts of deep discussion come out of this simple example: the role of formalism in the advancement of understanding. but most software engineers seem reluctant to countenance any alternative view. the discussion of the nature of mathematics (and there is no better discussion anywhere) is of relevance to software engineers. What Lakoff's book does is to tell you what the basis of linguistic categorisation actually is. the words for women. of course. There is some debate about the objectivist stance and its relation to software (see the recent book Information Systems Development and Data Modelling by Hirschheim. its intellectual fun. build it again. edges and faces of a polyhedron: V — E-\-F = 2). but it is not the one that I would choose to take to my desert island. the only discipline of relevance to software engineering. the relationship between the certainty of a formal proof and the meaning of the denotational terms in the proof. either. To the extent that software engineering is a branch of mathematics. you are beginning to get enough knowledge to design a computer system for them. I find very little in my books on object-oriented requirements and design that tells me how to do this. you should be aware that the Lakoff book contains fiindamental criticisms of the objectivist stance. (But I'm not going to tell you. However. . Klein and Lyytinen (1995) for a fair discussion). they have to respect that social world. In a way. the process of concept formation. fire. it's fallen down. Mathematics is not. the role of counterexamples in producing new proofs by redefining concepts. and the role of formalism in convincing a mathematician. The subtitle of this book is What Categories Reveal about the Mind.

. and Abrams. [that's enough books. New York: Harper & Row. 1979) (which attempts to codify the rules of artistic composition) perhaps the most. I have tried to select a representative picture of engineering design. 1979. intellectual and linguistic context in which it takes place. New York: Oxford University Press.]. B.. all are relevant since they show that what is true of our discipline is true of other disciplines also. of the limitations and powers of mathematical formalisation of software.. M. A Pattern Language. 1977. H. is seen as a task not fit. to write bonkbusters. The Production of Houses. and the same meretricious research perhaps—constructing machines to invent the news in the newspapers. Alexander. The Oregon Experiment. J. of the architecture of software artifacts. C. Ehn. Stockholm: Arbetslivscentrum (ISBN 91-86158-45-7).. So although none of these books is about software engineering. it still seems very pointed. Now for my next trip to a desert island. It is the funniest novel about computers ever written. architectural. New York: Oxford University Press. too subversive. but of the historical. such as organising the official visit which the Queen is making to the Institute to open the new wing. though you have to read the whole book to get the most enjoyment out of it. (Or maybe it is just too hard. for any decently engineered software to engage in. 1993. For a book which was written more than thirty years ago.L. Dahlbom. to do good and say their prayers.) My final choice goes against my self-denying ordinance not to make fun of the artificial intelligentsia. and Mathiassen. The Timeless Way of Building. Dreyfus. not so much of the technical detail of software engineering. I know of some institutions that claim as a matter of pride to have been the original for the fictitious William Morris Institute of Automation Research (a stroke of inspiration there!). Note L Jerusalem. Alexander. MA and Oxford. Cambridge. Ishikawa. Work-Oriented Design of Computer Artifacts. 1985. Together they say something about my view. 1979. P 1988. C. I would like to take. Alexander.. UK: NCC Blackwell. plate 55. Wassily Kandinsky's book Point and Line to Plane (Kandinsky. Ishikawa.. Angel. Martinez. S. M. Computers in Context. C. So there it is. Ed. of the language software embodies and of the institutions in which software research is carried out. the technology may have been updated but the same individual types are still there. D. and therefore we can learn from them and use their paradigms as our own. What Computers Can't Do (revised edition). Part III. There are many other books from other disciplines of relevance to computing that I am particularly sorry to leave behind. and Comer. New York: Oxford University Press. References Alexander.DESERT ISLAND COLUMN 177 which is what all my chosen books so far are about. For those who appreciate such things. 1975. They still could be. to play all the world's sport and watch it—while the management gets on with more stimulating and demanding tasks. C. New York: Oxford University Press. S. and Silverstein. it also contains (in its last chapter) the best and most humorous use of self-reference ever published. Silverstein. hardly dated at all. S. L. in addition to the Kandinsky. and one of the great classics of comedy literature: The Tin Men by Frayn (1965). D.

1995). Proofs and Refutations. 1976. Information Systems Development and Data Modelling. W. Petroski. . To Engineer is Human. Lakoff. New York: St. Klein. Lakatos. Fire. Cambridge University Press. H. Zahar (Eds. H. University of Chicago Press. Rebay (Eds.. (originally published 1926.). in German). Cambridge University Press.). Worrall and E. Point and Line to Plane Trans. 1985.178 DOBSON Frayn. Hirschheim. and Dangerous Things: What Categories Reveal about the Mind. H. 1979. and Lyytinen. 1995. Women. (republished by Penguin Books. M. J. K. 1987. Dearstyne and H.K. G.. Martin's Press. The Tin Men. 1965. Kandinsky. London: Collins. New York: Dover. I. R.

7. 5. Enclose with each manuscript. Authors should submit five hard copies of their final manuscript to: Mrs. Acknowledgment of financial support may be given if appropriate. typewritten. Provide a separate double-space sheet listing all footnotes. 2. Use an informative title for the paper and include an abstract of 100 to 250 words at the head of the manuscript. The abstracts should be a carefully worded description of the problem addressed. 3. and the results. This will help the journal expedite the refereeing of the manuscript. Please include a telephone number.com 2. in the style described below. PROCESS FOR SUBMISSION 1. send an electronic mail message to <jkkluwer@world. including the title.com> at the time your manuscript is submitted. original work that has neither appeared in. Abstracts will be printed with the article. Alternatively. Please see ELECTRONIC SUBMISSION Section below. if available. other journals. Judith A. If possible.std. double or 1^ space. Typeset. Authors are strongly encouraged to use Kluwer's LSTgX journal style file. use one side of sheet only (laser printed. 4. the names of the authors. Enclose originals for the illustrations. nor is under consideration by. and good quality duplication acceptable). from three to five key words. MA 02061 Tel. the key ideas introduced. . 6. The refereeing is done by anonymous reviewers. Kemp AUTOMATED SOFTWARE ENGINEERING Editorial Office Kluwer Academic Publishers 101 Philip Drive Norwell. fax number and email address.: 617-871-6300 FAX: 617-871-6528 E-mail: jkemp@wkap. beginning with "Affiliation of author" and continuing with numbered references. Photocopies of the figures may accompany the remaining copies of the manuscript. STYLE FOR MANUSCRIPT 1. original illustrations may be submitted after the paper has been accepted. for one copy of the manuscript. Enclose a separate page giving the preferred address of the contact author for correspondence and return of proofs. 3.Automated Software Engineering An International Journal Instructions for Authors Authors are encouraged to submit high quality. on a separate page. and an abstract.

inclusive page numbers. Scaling theorems for zero crossings.L. Joint Conf Artif Intell. Technol. constants.g.. 722. e. title. A.1) WWW URL: gopher://gopher.. The style file may be accessed through a gopher site by means of the following commands: Internet: gopher g o p h e r . Massachusetts Inst. and Thurston. in the following style: Style for papers: Authors.90. Edge and curve detection for visual scene analysis. A. 1019-1021. Indicate best breaks for equations in case they will not fit on one line. 1971. Style for books: Authors. 1983. 1982).I. Type or mark mathematical copy exactly as they should appear in print. 1983. 1982. San Francisco: Freeman. Int. last names followed by first initials. the preferred format of submission is the Kluwer I^^^gX journal style file. Artif.com . All letter symbols in text discussion must be marked if they should be italic or boldface. memo) Yuille. T. €-20:562-569. Journal style for letter symbols is as follows: variables. (Journal) Rosenfeld. Karlsruhe. volume. Comput. MA.L Memo. Proc. If you do not have access to gopher or have questions. (Lab. italic type (indicated by underline). a Computational Investigation into the Human Representation & Processing of Visual Information. boldface type (indicated by wavy underline).87. n l or (IP number 192. use appropriate typeface.References should appear in a separate bibliography at the end of the paper in alphabetical order with items referred to in the text by author and date of publication in parentheses.nl Submitting and Author Instructions Submitting to a Journal Choose Journal Discipline Choose Journal Listing Submitting Camera Ready Authors are encouraged to read the ''About this menu" file. publisher and location. It will be assumed that letters in displayed equations are to be set in italic type unless you mark them otherwise. A. roman text type. pp. D. matrices and vectors. In word-processor manuscripts. Vision. A. IEEE Trans. Lab. chapter and page numbers (if desired).T.wkap. and Poggio. M.. Examples as follows: (Book) Marr. year of publication. (Conference Proceedings) Witkin. title. (Marr.. Cambridge. ELECTRONIC SUBMISSION PROCEDURE Upon acceptance of publication. M. please send e-mail to: srumsey@wkap. References should be complete. Intell. wkap. Scales space filtering. West Germany. year of publication.

Via disk 1. Via electronic mail 1. we will accept other common formats (e. WordPerfect or MicroSoft Word) as well as ASCII (text only) files. original figures in camera-ready form) should still be mailed to the appropriate Kluwer department. Recommended formats for sending files via e-mail: a.0) along with the authors' names.. and any changes made to the hard copy must be incorporated into the electronic version.compress. I^lgX or ASCII). The hard copy must match the electronic version. MA 02061 Any questions about the above procedures please send e-mail to: srumsey @ wkap. Compressing files . gunzip c. pkzip. e.eps or circlel. Also.tar 3.5 inch floppy disk with the operating system and word processing program (e. A PostScript figure file should be named after its figure number. it is also helpful to supply both the source and ASCII files of a paper. ELECTRONIC DELIVERY IMPORTANT .g. Mail disk to Kluwer Academic Publishers Desktop Department 101 Philip Drive Assinippi Park Norwell. DOSAVordPerfect5. manuscript title.The Kluwer L^T^ journal style file is the preferred format.Hard copy of the ACCEPTED paper (along with separate. 2.. Binary files . the name of the journal to which the paper has been accepted. Note. and we urge all authors to use this style for existing and future papers.g. Please e-mail ACCEPTED. we accept FrameMaker documents as "text only" files..eps. Label a 3.. original figures in camera-ready form.uuencode or binhex b.com .g. and name of journal to which the paper has been accepted. Collecting files . Please submit PostScript files for figures as well as separate.g.com 2. however. and the type of file (e. FINAL paper to KAPfiles @ wkap. figl. The e-mail message should include the author's last name.

include the illustration itself at the relevant place in the text. authors are required to sign a copyright transfer form before publication. which will be read by the reviewers. Provide a separate sheet listing all figure captions. reserving the term "figure" for material that has been drawn. In the remainder of copies. 8.. Specify the desired location of each table in the text. to simplify handling of the manuscript. in proper style for the typesetter. but place the table itself on a separate page following the text. Copyright Law. Authors must submit a signed copy of this form with their manuscript. of good contrast and gradation. and any reasonable size. 7. and of good contrast. Number each original on the back. 6.g. . 2. Each figure should be mentioned in the text and numbered consecutively using Arabic numerals. We regret that we cannot provide drafting or art service. In one of your copies. The proofread copy should be received back by the Publisher within 72 hours. Number each table consecutively using Arabic numerals." PROOFING Page proofs for articles to be included in a journal issue will be sent to the contact author for proofing. unless otherwise informed. Examples of the fault coverage of random vectors in (a) combinational and (b) sequential circuits.STYLE FOR ILLUSTRATIONS 1. To comply with the U. Line drawings should be in laser printer output or in India ink on paper. 4. specify the desired location of each figure in the text but place the original figure itself on a separate page. noise-free. 3. Photographs should be glossy prints.S. Originals for illustrations should be sharp. e. or board. 5. This form returns to authors and their employers full rights to reuse their material for their own purposes. All lettering should be large enough to permit legible reduction. which you should clearly distinguish. REPRINTS Each group of authors will be entitled to 50 free reprints of their paper. Type a brief title above each table. "Fig. Please label any material that can be typeset as a table. COPYRIGHT It is the policy of Kluwer Academic Publishers to own the copyright of all contributions it publishes. 3. Use 8 ^ by 11-inch (22 x 29 cm) size sheets if possible.

REVERSE ENGINEERING serves as an excellent reference. providing insight into some of the most important research issues in the field. ISBN 0-7923-9756-8 0-7923-9756-8 92»'397564» .REVERSE ENGINEERING brings together in one place important contributions and up-to-date research results in this important area.