You are on page 1of 5

2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering

Experience Report on Building ASTM Based Tools for Multi-Language Reverse


Engineering

Günter Fleck∗ , Wilhelm Kirchmayr† , Michael Moser‡ , Ludwig Nocke† , Josef Pichler‡ ,
Rudolf Tober† and Michael Witlatschil∗
∗ Siemens AG Austria, Weiz, Austria
Email: (guenter.fleck, michael.witlatschil.ext)@siemens.com

† Voestalpine
Stahl GmbH, Linz, Austria
Email: (wilhelm.kirchmayr, ludwig.nocke, rudolf.tober)@voestalpine.com

‡ Software Competence Center Hagenberg GmbH, Hagenberg, Austria


Email: (michael.moser, josef.pichler)@scch.at

Abstract—Reverse engineering tools are utilized for devel- and as exchange format. The ASTM establishes a specifica-
opment, maintenance, and modernization of software systems. tion for abstract syntax tree models. To provide for unifor-
The reverse engineering community has developed a large mity as well as a universal framework for extension, ASTM
number of reverse engineering tools for different programming
languages that support a variety of software engineering is composed of the Generic Abstract Syntax Tree Metamodel
activities. Although tools address different reverse engineering (GASTM) and a set of complementary, language-specific
problems and different programming languages, several issues specifications, called the Specialized Abstract Syntax Tree
with respect to parsing, intermediate representations, code Metamodels (SASTM). The usage of GASTM also promises
query, program analysis, etc. are similar. However, reuse simplification of multi-language analysis. While custom
between tools takes place only on a moderate scale. To facilitate
reuse in building reverse engineering tools, we have used the abstract syntax trees (AST) are widely used in reverse engi-
OMG standard ASTM as intermediate representation of source neering tools, the use of ASTM is only sporadically reported.
code together with black box reuse of existing (free) language Nevertheless, we have built and applied several reverse
parsers. In this paper we report on challenges, experiences, and engineering tools around the ASTM standard. Especially
solutions from several industrial research projects, in which the potential for multi-language reverse engineering tools
ASTM based reverse engineering tools have been developed and
used for re-documentation, re-engineering, and modernization was appealing from the very beginning and, from where
of software systems. we stand now, it has largely realized its potential. In this
paper we give a brief overview of tools we built around
Keywords-reverse engineering tools; abstract syntax tree;
ASTM standard; static code analysis ASTM (Section II), describe our experience from reusing
(free) parsers to construct ASTM models (Section III) and
discuss the consequences of using ASTM as intermediate
I. I NTRODUCTION
representation in reverse engineering tools (Section IV).
The reverse engineering community (both academic and
industrial) has developed many reverse engineering tools. II. BACKGROUND
Even though most of these tools have a similar architec- The experience described in this paper stems from the
ture, a variety of schemas (or meta-models) for internal development and application of several reverse engineering
representation and for data exchange between tools have tools in industrial context. The Rulebook Generator (RbG)
risen from these initiatives [1]. Prominent examples include [6] supports the generation of documentation from source
the Rigi Standard Format (RSF) [2] and GXL [3], which code of technical software implemented in C, C++ or
have been adopted by a number of tools. A more complete Fortran. Programmers can direct the extraction and gener-
overview is provided by Kienle [4]. Whereas data exchange ation process by annotating source code with custom-built
formats are reused by many tools, the internal representation tags. Metamorphosis [7] supports program comprehension of
mostly remains tool-specific. Furthermore, available formats legacy systems implemented in C, C++, Fortran or PL/SQL.
are intended for abstract views of source code to be reverse Based on the Metamorphosis platform, we developed a
engineered and do not cover the level of program statements. further tool (VaDoc) for extraction of business logic from
To tackle this issue, the Abstract Syntax Tree Metamodel PL/SQL source code of software in the steel making domain
(ASTM) [5] standard provided by the Object Management [8]. The extracted representations help to identify obsolete
Group can be used as common intermediate representation and missing business cases and support an ongoing re-

978-1-5090-1855-0/16 $31.00 © 2016 IEEE 683


DOI 10.1109/SANER.2016.33
Parser Reuse Representation Analysis Our tools do not need full parse trees to generate useful
C/C++ CFG CG results. The column Coverage / Parser shows to which
.cpp ODF extent different parser tree elements or hook methods needed
.f90
Fortran IR to be covered to build a GASTM representation suitable
Analysis
LaTex
for our analysis. For instance, the Open Fortran Parser
PL/SQL
provides 534 hook methods that are called during parsing
COBOL HTML
and that can be used to create GASTM syntax tree. So far,
.CBL

...
133 (about 25%) hook methods needed to be implemented
...
DFA ...
GASTM SASTM ...
to perform our analysis and to generate useful results for
Figure 1. Architectural overview
source code of industrial programs. The transformation of
parser specific syntax trees into GASTM representations
is typically performed by model-to-model transformations.
engineering process. The DocIO tool analyzes file input and Only the Open Fortran Parser required a hook method based
output behaviour of C++ programs in order to generate docu- transformation. The column Coverage / GASTM shows
mentation for written output files. A similar tool (GlobVar) the number of required GASTM elements to achieve the
generates read and write access to global variables of C aforementioned language coverage. The column numbers
programs from an embedded domain. In an ongoing project give the percentage of how many syntax elements from a
we build and apply a documentation generator for COBOL total of 165 syntax elements provided by GASTM are used.
programs (DoCO) in the insurance domain. Although tools The total number corresponds to all concrete subtypes of the
target different reverse engineering topics and support differ- type GASTMSyntaxObject.
ent programming languages, all tools share commonalities
III. E XPERIENCE FROM PARSER R EUSE
manifested in principal design decisions, implementation
patterns, reuse of analysis components, and reuse within a We discuss challenges and solutions from parser reuse for
similar technological spectrum. building multi-language intermediate representations based
on GASTM. Most parsing problems stem from legacy code
A. Principal Design Decisions with deprecated language features, dialect-specific code, and
Reuse of (free) parsers, multi-language intermediate rep- embedded code fragments written in a different language
resentation, and reuse of analysis algorithms, are principal (e.g. a SQL statement within a COBOL program). Such
design decisions being found across all tools we consider in problems are typical for formal parsers built on context-free
this paper. grammars, which generate the entire parse tree from source
Figure 1 shows a generalized architectural overview which code. However, less precise methods can only be employed
all our tools have in common. Source code files are parsed by if program analysis can also produce useful results from
reusing freely available language parsers. This facilitates the a subset of the parse tree [9]. Semi-parsing methods are
rapid adoption of new programming languages. Language- more robust (or tolerant) to the above mentioned problems.
specific abstract syntax trees produced by language parsers Unfortunately, most available parsers that are candidates
are transformed into an intermediate representation (IR), for reuse are built on context-free grammars, semi-parsing
which follows ASTM. Analysis of all reverse engineering front-ends are the exception. From our experience, most
tools targets syntax trees expressed as GASTM or SASTM. problems indeed appeared when Fortran and PL/SQL source
This enables us to reuse analysis components across different code was parsed using full-fledged parsers. In contrast,
tools and programming languages. So far we have created parsing COBOL by using an island parser was generally
reusable components for call-graph analysis (CG), control- unproblematic. Hence, the following discussion refers to
flow analysis (CFA), data-flow analysis (DFA) and canonical
simplification (CSI) of expressions. The developed reverse Table I
engineering tools typically perform a one-step output gen- PARSER AND LANGUAGE COVERAGE .
eration, which directly produces result documentation in the
Coverage
expected format (e.g. HTML).
Parser (Language) SLOC Parser GASTM
Eclipse CDT (C, C++)
B. Technology https://eclipse.org/cdt
4.350 20% 60%

All our tools are implemented in Java and use a GASTM Open Fortran Parser (Fortran)
6.450 25% 38%
http://fortran-parser.sourceforge.net
implementation generated by using the Eclipse Modeling JDeveloper, Akiban (PL/SQL)
Framework (EMF). Table I lists reused parsers per program- http://www.oracle.com 950 20% 21%
ming language and gives the number of lines of Java code https://github.com/brunoribeiro/sql-parser
Koopa (COBOL)
(column SLOC) required to implement the construction of http://koopa.sourceforge.net
3.100 14% 25%
GASTM syntax trees from language-specific parse trees.

684
Fortran and PL/SQL parsers only. Table II
R ESULTS OF PREPROCESSING AND try-and-ignore PARSING STRATEGY.
We analyzed Fortran legacy code that was initially written
thirty years ago. It contains Fortran 77 language features and Prog. F1 F2 F3 F4 P1 P2 Fix Fail Remain
occasionally even Fortran IV (i.e. Fortran 66) constructs. 1 156 46 13 7 59 20 21 10 9
Whereas, such constructs are tolerated by the used compiler 2 113 13 26 0 140 50 88 7 0
3 109 14 10 2 23 10 8 3 2
(Intel), the Open Fortran Parser has problems with some 4 108 12 11 3 24 26 13 2 3
constructs. This concerns syntactical issues as well as lexical 5 57 10 5 4 13 7 7 0 4
6 41 33 3 0 37 5 9 3 0
issues. Lexical issues appear in the Fortran fixed format and 7 29 21 8 0 23 23 10 3 0
include SUB character contained at the file end, tabulator 8 26 12 2 0 98 2 10 1 0
character at column 6 (instead of a whitespace), the dollar 9 16 3 3 2 54 4 3 0 2
10 13 5 3 3 34 34 0 0 4
sign within identifiers, and comments starting in column 73. Total 668 169 84 21 455 153 169 29 24
Lexical issues appear frequently and are solved by a custom
preprocessor that operates on the input stream before the
actual parser starts. after each parsing step. Note that due to a parsers error-
Syntactical issues include initializers and missing paren- handling strategy the number of problems may increase after
thesis similar to the following example. a problem was fixed. Moreover, we observed that fixing
Error: X = 0.2 * Y**-0.25 (ignoring) a problem in line n, may even lead to new
Fixed: X = 0.2 * Y**(-0.25) problems in lines before line n in subsequent parsing steps.
Table II shows the improvement by preprocessing and TAI
Since the effort to manually refactor such problems dramat- parsing from ten Fortran programs of different size. Initially,
ically hinders the adoption of a tool, we internally refactor 25% (column F2 169/668) of Fortran source files contain at
language constructs so that tools can deal with such code out least one parsing problem; after preprocessing 13% (84/668)
of the box. Furthermore, we report refactoring candidates of Fortran source files (column F3) still contain at least
and support actual refactoring by quick-fixes. one parsing problem. Finally, after TAI parsing 3% (column
Some problems occur only sporadically and cannot be F4, 21/668) of files have parsing problems. Note that for
anticipated without actually parsing the source code. Hence, program 2, after the pre-processing step we encounter more
when analyzing source code of a yet unknown legacy files (26) with at least one problem than before (13). This
system, there is a high probability that parsing problems is a consequence from pre-processing on lexical level which
will occur. In practice, this is a show stopper for indus- occasionally introduces problems, that are fixed (ignored)
trial application of reverse engineering tools. Furthermore, in the next step. Fortunately, we detected this unsolicited
sporadically appearing problems do not justify building an behaviour only once and at source code fragments that can
automated refactoring. To mitigate this issue and to increase be ignored without side effects.
the probability that the tools work out of the box, we use The preprocessing step decrements the number of prob-
try-and-ignore (TAI) parsing. lems from a total of 455 (column P1) to 153 problems
The key idea behind TAI parsing is that for reverse (column P2) that corresponds to the initial number of
engineering not necessarily a full parse tree is required problems for the TAI parsing. TAI eliminates 169 individual
and, hence, parsing problems on unused tree nodes may problems but 24 problems remained not fixed. Surprisingly,
be ignored. For instance, the Fortran FORMAT statement, TAI eliminates more problems than initially available. This
which is predestined to cause parsing problems, can be is a result of the error handling strategy of the reused
ignored without compromising on quality. As the actual parser. Fixing a parser problem by ignoring a source code
parser is (black box) reused it was not possible to integrate line may increment the total number of problems because
this strategy in the parser. Instead, we realized the try-and- the parser actually handles more source code lines. From
ignore strategy as an upstream component. The algorithm the 169 problems fixed by ignoring corresponding source
is rather straightforward and uses an internal parse routine code fragments, 29 source code fragments are missing the
that accepts a list of problems (in particular line numbers) produced output. The preprocessing actions as well as the
that should be ignored by the parser. Technically, this is try-and-ignore strategy were originally developed to resolve
done by the preprocessing component that replaces an entire parsing issues in the source code of programs 2 and program
line by an empty line (in order to preserve line numbers). 4. Nevertheless, the result in Table II show an improvement
After a first initial parsing, the algorithm iteratively tries for other programs as well.
to ignore the first problem reported by the parser. If the
problem does not appear anymore after a subsequent parsing IV. ASTM AS I NTERMEDIATE R EPRESENTATION
step or the number of problems decreased, the source code
According to OMG documents, ASTM establishes a
line is permanently ignored and the problem is considered as
specification which provides uniformity through GASTM
fixed. The try-and-ignore loop updates the list of problems
and a universal framework for extensions through SASTMs.

685
Towards multi-language reverse engineering tools, we aim to Table III
R EUSE P OTENTIAL
create abstract syntax trees on GASTM only, i.e. language-
independent models. This requires mapping of different Tool Languages CG CFG DFA CSI
language elements to language-neutral elements. RbG Fortran, C++, C X X X
Metamorphosis C++, C, PL/SQL X X
A. Syntax Differences VaDoc PL/SQL X
DocIO C++ X
When GASTM does not provide a corresponding element, GlobVar C X
the language construct can be modeled with a different DoCO COBOL X X
but equivalent element. For instance, GASTM does not
provide a model element for the arithmetic IF statement
of Fortran; however, every arithmetic IF can be expressed a break or goto statement is already contained at the end
by a corresponding switch statement, since the code IF of a case block. However, potential dead code constructed
(v) s1, s2, s3 corresponds to a switch statement with in this way is eliminated later as part of the control-flow
case blocks for v < 0, v == 0 and v > 0, which analysis. Overall, in favour of real language-independent
provide goto statements to the labels s1, s2 and s3. analysis algorithms we accept this minor extra effort when
In the same way, the computed GOTO statement which is building up GASTM trees.
not directly covered by GASTM, can also be mapped to A second example stems from different language con-
the SwitchStatement model element. This mapping strategy structs for a counting loop, i.e. the C/C++ for loop and
is an alternative to extend GASTM by language-specific the Fortran DO loop which both are represented by the
constructs (SASTM). Even if extension is foreseen by the GASTM element ForStatement. Whereas the C/C++ variant
ASTM standard, we explicitly tried to avoid this. The allows an arbitrary boolean expression as loop condition,
difference between our approach (mapping to equivalent the Fortran loop has an upper value which specifies the
GASTM elements) and the extension by using language- value contained in the loop variable in the last iteration.
specific elements becomes apparent by considering static Experiments to normalize the semantics of C/C++ loops
analysis based on the syntax tree. Having GASTM elements towards Fortran style and to transform the Fortran loop to
only facilitates multi-language analysis, e.g. a control-flow the more general form of C/C++ led to the decision to keep
graph algorithm that can handle a SwitchStatement element, language-specific difference, because expected advantages in
automatically handles arithmetic IF and computed GOTO subsequent analysis steps did not justify the drawback of a
statements without any modification. Another example is the more complex GASTM creation.
usage of compound assignment operators in C++ which are C. Reuse Potential
mapped to an ordinary assignment operators. For instance,
Code reuse was the main reason for the design decision
the C++ assignment a += 1 results in the same syntax tree
in favor of using GASTM. From where we stand now, we
as the assignment a = a + 1.
know that GASTM facilitates code reuse in two dimensions.
B. Semantic Differences Firstly, reuse is facilitated across different programming
languages targeted by a single reverse engineering tool. The
When it comes to static analysis based on syntax trees,
presented reverse engineering tools are generally developed
we sometimes encounter different semantics of syntacti-
in a language-independent manner with limited language-
cally equivalent trees. For instance, a SwitchStatement ei-
specific adaptations. In the case of RbG, which targets tech-
ther has a fall-through semantic (e.g. C/C++ switch)
nical software implemented in C, C++ or Fortran, language-
or not (e.g. COBOL EVALUATE, Fortran SWITCH CASE,
specific handling of GASTM elements is only required for
PL/SQL CASE). A control-flow graph algorithm must handle
counting loops as presented above. Secondly, GASTM facil-
such differences. The interesting question is which analysis
itates reuse of analysis components across different tools. As
step should handle this difference. If the syntax tree is
shown in Table III software components for construction and
created straightforward in the parsing step, two syntactically
analysis of call graphs (CG) and control-flow graph (CFG),
equivalent syntax trees have different semantics, depending
which were initially introduced with the RbG tool, are reused
on the input language. In this case, the subsequent analysis
in various other tools. The data-flow analysis (DFA) is
step (e.g. a CFG algorithm) must be aware of this semantic
also jointly used in two different tools; Metamorphosis and
difference. To facilitate language-independent analysis we
DocIO.
decided that GASTM trees have a single semantic only. In
case of SwitchStatement, we therefore chose to have fall- V. R ELATED W ORK
through following the C/C++ semantics. In order to describe
The experience report presented in this paper is related to
non fall-through, we additionally insert BreakStatement ele-
research and experiences on the usage of ASTM standard,
ments at the end of every case block during construction of
and to experiences on multi-language static program analy-
the GASTM tree. This may lead to dead code in cases when
sis. The ASTM standard is hardly used for building tools,

686
which is also recognized by a systematic mapping study R EFERENCES
by Durelli et al. [10]. The first industrial adoption was part [1] H. M. Kienle and H. A. Müller, “The tools perspective on
of the EU ARTIST project, where the MoDisco [11] tool software reverse engineering: Requirements, construction, and
included a reference implementation of GASTM. MoDisco evaluation,” Advances in Computers, vol. 79, pp. 189–290,
provides an extensible framework to develop model-driven 2010.
tools to support use-cases of software modernization. Further [2] H. M. Kienle and H. A. Müller, “Rigi-An environment for
software reverse engineering, exploration, visualization, and
ASTM related work as well as our own work Metamorphosis redocumentation,” Sci. Comput. Program., vol. 75, pp. 247–
[7] is based on this reference implementation. Owens and 263, Apr. 2010.
Anderson [12] use the GASTM as internal representation [3] R. Holt, A. Winter, and A. Schurr, “Gxl: toward a standard ex-
for automated quality assurance of software models. They change format,” in Reverse Engineering, 2000. Proceedings.
implemented the GASTM for a subset of the Java program- Seventh Working Conference on, 2000, pp. 162–171.
[4] H. M. Kienle, “Exchange Format Bibliography,” SIGSOFT
ming language. Izquierdo and Molina [13] implemented and Softw. Eng. Notes, vol. 26, no. 1, pp. 56–60, Jan. 2001.
extended GASTM for the PL/SQL programming language [5] OMG. (2011, January) Architecture-driven modernization:
for building a software modernization tool. The mentioned abstract syntax tree metamodel (ASTM), version 1.0. OMG.
work used GASTM but only for a single programming [Online]. Available: http://www.omg.org/spec/ASTM/1.0/
language. In our work, multi-language analysis was intended [6] M. Moser, J. Pichler, G. Fleck, and M. Witlatschil, “Rbg:
A documentation generator for scientific and engineering
from the very beginning and realized for several program- software,” in 22nd IEEE Int.l Conf. on Software Analysis,
ming languages. Deltombe et al. [14] worked on bridging Evolution, and Reengineering, SANER 2015, Montreal, QC,
GASTM and KDM within the REMICS project. Overall, Canada, March 2-6, 2015, 2015, pp. 464–468.
documented reuse of the OMG’s ASTM standard is limited. [7] C. Klammer and J. Pichler, “Towards tool support for
In contrast, industry and research provide an abundance analyzing legacy systems in technical domains,” in 2014
Software Evolution Week - IEEE Conf. on Software Main-
of tools which support multiple programming languages. tenance, Reengineering, and Reverse Engineering, CSMR-
Well known examples from research include Bauhaus [15] WCRE 2014, Belgium, February 3-6, 2014, pp. 371–374.
and Moose [16]. As with our tools, various programming [8] M. Habringer, M. Moser, and J. Pichler, “Reverse engineering
languages are represented with uniform abstract syntax trees. PL/SQL legacy code: An experience report,” in 30th IEEE Int.
While Bauhaus uses a proprietary representation called IML, Conf. on Software Maintenance and Evolution, Victoria, BC,
Canada, September 29 - October 3, 2014, 2014, pp. 553–556.
Moose creates language-independent representations which [9] V. Zaytsev, “Formal Foundations for Semi-parsing,” in Pro-
conform to the FAMIX meta-model. ceedings of the Software Evolution Week (IEEE Conference
on Software Maintenance, Reengineering and Reverse Engi-
VI. C ONCLUSION neering), Early Research Achievements Track (CSMR-WCRE
In this paper we discussed experiences from using the 2014 ERA), S. Demeyer, D. Binkley, and F. Ricca, Eds.
IEEE, Feb. 2014, pp. 313–317.
Abstract Syntax Tree Metamodel (ASTM) as uniform in- [10] R. Durelli, D. Santibanez, B. Marinho, R. Honda, M. Dela-
termediate representation in multi-language reverse engi- maro, N. Anquetil, and V. de Camargo, “A mapping study on
neering tools. To our experience GASTM provides rep- architecture-driven modernization,” in Information Reuse and
resentations which are sufficient for many programming Integration (IRI), 2014 IEEE 15th International Conference
languages and reverse engineering tools. As shown in this on, Aug 2014, pp. 577–584.
[11] H. Brunelire, J. Cabot, G. Dup, and F. Madiot, “Modisco:
paper, missing language elements can often be mapped A model driven reverse engineering framework,” Information
to equivalent elements covered by GASTM. Interestingly, and Software Technology, vol. 56, no. 8, pp. 1012 – 1032,
for the purpose of the presented reverse engineering tools 2014.
not the entire GASTM implementation needed to be cov- [12] D. Owens and M. Anderson, “A generic framework for
ered. Coverage over all reverse engineering tools ranges automated quality assurance of software models - applica-
tion of an abstract syntax tree,” in Science and Information
from 66.3% (C/C++) to 21% (PL/SQL). We can confirm Conference (SAI), 2013, Oct 2013, pp. 207–211.
that GASTM facilitates component reuse across different [13] J. Izquierdo and J. Molina, “An architecture-driven modern-
programming languages. However, to our experience from ization tool for calculating metrics,” Software, IEEE, vol. 27,
industrial reverse engineering projects reuse across tools is no. 4, pp. 37–43, July 2010.
not restricted by differences in the analyzed programming [14] G. Deltombe, O. L. Goaer, and F. Barbier, “Bridging kdm
and astm for model-driven software modernization.” in SEKE,
languages but by differences in the expected results. 2012, pp. 517–524.
[15] A. Raza, G. Vogel, and E. Plödereder, “Bauhaus: A tool suite
ACKNOWLEDGMENT for program analysis and reverse engineering,” in Proceedings
The research reported in this paper has been supported of the 11th Ada-Europe International Conference on Reliable
by the Austrian Ministry for Transport, Innovation and Software Technologies, ser. Ada-Europe’06. Berlin, Heidel-
berg: Springer-Verlag, 2006, pp. 71–82.
Technology, the Federal Ministry of Science, Research and [16] S. Ducasse, T. Grba, M. Lanza, and S. Demeyer, “Moose:
Economy, and the Province of Upper Austria in the frame a collaborative and extensible reengineering environment,”
of the COMET center SCCH. 2005.

687

You might also like