Learning Visual Semantics to Support Program
Understanding and Search
Abstract—Understanding program semantics is of crucial im- reusable code identification. Third, conventional program ana-
portance to both programmers and end users for many software lysis falls far short of automated semantics inference; even se-
engineering tasks, such as finding an application of desired mantic dependencies can only be approximated (through syn-
functionality, identifying similar applications, and reusing code at
various levels. However, manual approaches to program under- tactic analysis) [2]. Finally, semantic code analysis (e.g., [3],
standing (e.g., code review) are tedious and unscalable, especially [4] for code search) is typically limited to a particular program-
with large and growing code bases and search sources, whereas ming language, making code comprehension and search across
existing automated approaches suffer from limitations at multiple languages a still daunting task. Existing relevant approaches
fronts that may impede their practical adoption. In this paper, we also suffer from challenges of applicability (e.g., manually
propose a novel solution based on visual semantics learning that
both supports program understanding and facilitates application curated search templates [4] hard to adopt in practice) and/or
search across any programming languages. By leveraging latest effectiveness (e.g., low precision/recall [5]) besides scalability.
advances in computer vision, machine learning, and natural In this paper, we propose the design of a novel solution to
language processing, our approach extracts functional semantics automatically understanding code semantics hence facilitating
of programs from their visual outputs. It then generates natural- code search and reuse by learning visual (pictorial) outputs
language descriptions on program semantics from recognized
visual objects, and answers search queries via natural-language of programs. Leveraging latest advances in computer vision,
matching. We have developed an early prototype of this approach machine learning, and natural language processing, our ap-
and applied it to a set of real-world subject programs of proach learns functional semantics of programs from their
varied functionality domains and programming languages. Our visual outputs (e.g., data visualizations, GUI elements, etc.).
preliminary results show the promise of our solution with 85% It then generates natural-language descriptions on program
precision@2 and 95% recall@2 against the chosen subjects.
semantics from recognized visual objects, and answers code-
I. I NTRODUCTION search queries via natural-language matching. Meanwhile,
Understanding a program is a fundamental task in software the automatically generated semantics description immediately
development, offering the basis for various other tasks ranging supports understanding the program (without reading its code).
from localizing functionality defects and security vulnerabi- We refer to the functional semantics that can be demonstra-
lities as well as identifying changes to fix them. Another ted through visual outputs as visual semantics. Accordingly,
common task, for both software developers and end users, is programs that have visual semantics are referred to as visual
program search—finding existing programs of specific functi- programs. Given the difficulties of automatically inferring
onalities or looking for programs functionally similar to a program semantics from the code [6], it is promising to learn
particular one. For example, developers can learn from existing the visual semantics of programs, which complements code-
examples or even reuse existing code for their development analysis-based solutions. On the other hand, many (if not most)
activities. Essential for both tasks is the description of the modern programs are visual—they produce visual outputs that
functional semantics of a program. Such a description would represent at least part of their functional semantics. Thus, a
immediately support understanding the program, and greatly code search/understanding approach based on visual semantics
facilitate searching equivalent/similar programs if these other learning potentially has a wide application ground.
programs’ semantics descriptions are also available. To validate the design of our approach, we have imple-
Unfortunately, obtaining such descriptions for programs mented an early prototype, called VISPUS (short for VIsual
of interest is difficult. Manual approaches, such as code Semantics based Program Understanding and Search), using
inspection, is costly, error-prone, and generally hard to scale, existing techniques for object recognition and natural language
especially with the ever-growing code bases and search sour- processing. We then applied VISPUS to 12 real-world subject
ces. Also, such approaches are of little utility to end users programs from diverse sources that represent three different
(e.g., for program search) as they are not able nor suppo- languages and varied application domains. Our preliminary
sed to understand program code directly. Potentially able to results with VISPUS demonstrated promising merits of our
overcome these limitations, automated approaches to program visual semantics learning approach to program understanding
understanding and search would be much more desirable. and search across languages: on average over the chosen sub-
However, automatically deriving the functional semantics jects, VISPUS achieved 85% precision@2 and 95% recall@2
from a program is challenging. First, the (requirements/design) against varying search queries.
specification that describes the functionalities would greatly In summary, this paper makes the following contributions:
ease the task, yet such documents are usually unavailable. ‚ The design of a novel solution to automated program
Second, specification mining and recovery techniques exist [1], understanding and search. By the nature of its design,
yet their resulting descriptions/representations are too coarse our approach works across any programming languages
or abstract to be amendable for human understanding or for as it does not rely on code analysis.
1
Search
sources Application Crawling search results; otherwise, the system returns None. Thus, the
VISPUS inputs eventual output of the search is a list of programs (an empty
Visual Program Test NL Search Query list corresponds to None).
Outputs (executable) Inputs from User
B. Visual Output Generation
1 2 3
Visual Object Natural Language Natural Language List of Since our approach learns a program’s functional semantics
Recognition Generation Matching Programs from, and only from, its visual outputs, it naturally has the
VISPUS output advantages of being language-agnostic and avoiding potenti-
Object Semantics
Database ally substantial code analysis costs. Also, the understanding
Descriptors Descriptions
and search our approach supports can be either at whole-
Fig. 1: An overview of the proposed approach. program level or at class even method levels, as long as the
corresponding levels of visual outputs are provided to the VOR
‚ A tool prototype that implements our approach, and a
followed by the two NLP components. For example, if the
preliminary evaluation of the prototype against real-world
visual outputs are known to correspond to a method, then the
programs of varied languages and application domains
generated semantics descriptions would describe the functional
which shows promising merits of our approach.
semantics of that method. Moreover, how well our approach
II. O UR A PPROACH can learn visual semantics from a program (i.e., how well the
We first give a brief overview of our approach, followed by visual semantics learned covers the full functional semantics)
elaborating the key components of our design. depends on the quality of the visual outputs used. In sum,
the visual outputs determine the (scope and granularity of)
A. Overview capabilities and effectiveness of our approach.
As shown in Figure 1, our design mainly consists of three Thus, visual output generation is a key part of our learning
major components (marked by numbered circles): visual object system. This module focuses on executing visual programs
recognition (VOR), natural language generation (NLG), and to produce visual outputs, using test inputs that effectively
natural language matching (NLM). The system takes as inputs and maximally exercise a program’s visual semantics. For
the visual outputs of programs that are crawled from a range automated execution, if test inputs can be crawled from the
of search sources (e.g., github). We assume that the program search sources, existing build tools will be used to exercise the
is executable (e.g., the search source provides working build program against the inputs. Otherwise, our system may use
facilities) and able to produce visual outputs that reflect its automated test input generators or forced execution engines
functional semantics. To that end, the test inputs that exercise (e.g., [7]) to generate the test inputs. We exploit forced
visual semantics can be either crawled along with the program execution because this technique overcomes the potential
or generated with dedicated test input generators. unavailability of test inputs and possibly undesirable coverage
Next, the VOR takes the visual outputs, recognizes visual of available test inputs. Forced execution also comes with
objects (e.g., a curve) in the outputs, and then produces bag the capability of targeted path exploration, which can be
of words that describe each object recognized (referred to customized to efficiently trigger visual outputs (e.g., with APIs
as object descriptors). These descriptors are object labels, that are known for visual-output purposes set as the targets).
resulted from the VOR that is pre-trained via deep learning.
The labels of a visual object are mostly, but not limited C. Visual Object Recognition (VOR)
to, object names (e.g., “a circle”)—they could be words The VOR component extracts the visual elements (objects)
describing object attributes (e.g., “red”). The descriptors are from the visual outputs represented (and stored) as images.
then fed to the NLG module to generate simple natural- To that end, it segments each image about the visual outputs
language (NL) sentences (e.g., “the program plots a barchat”) into regions of interests and creates bounding boxes. State-
that describe the program’s functional semantics (referred to of-the-art computer vision techniques are applied for image
as semantics descriptions). The resulting descriptions, along processing and recognition of visual objects, which mainly
with the original program’s information (e.g., its location), are involves three steps: (1) convert the image to grayscale with
then stored to a database. As the application crawler keeps a threshold applied, (2) dilate, and (3) find contours and
crawling more programs (and optionally their test inputs), the extract bounding boxes as separate images. The bounding
database keeps updating, forming the dynamics of the system. boxes include the whole image itself. The segmented images
Once a natural-language (NL) search query from user is are then fed to a pre-trained classifier. The classifier identifies
given (e.g., “an application that makes phone calls”), the NLM the visual objects’ features and generates object descriptors.
module will try to retrieve programs from the database by The descriptors essentially consist of bag of words (including
matching the query against the semantics descriptions stored object labels) that characterize associated visual objects.
in the database that describes the visual semantics of those At the core of the VOR component is a multi-class classifier
programs. For search efficiency, each record in the database based on deep learning (e.g., convolutional neural networks
maintains the mapping from a program’s meta information to (CNNs)). This classifier determines the ability of our system
its semantics descriptions; there could be indexes created to to generate object descriptors, hence immediately affects the
speed up search by group (e.g., searching any program having accuracy of code search with the system. This classifier also
an OK button in its GUI). If any matching programs are found reflects the essence of visual semantics learning in our appro-
from the current database, their locations will be returned as ach, thus training it is a critical step. Note that VOR training
2
is a separate, pre-processing step for the proposed solution, by updating/plugging them in our system. Next, we describe
using images of objects in typical visual outputs of programs the specific decisions we made for building a prototype of our
directly as training data. Our system does not generate these approach that aims to demonstrate the feasibility and merits
training images through the visual output generation process of our design architecture as a whole.
described above. Instead, we curate them or utilize existing III. VISPUS: A P ROTOTYPE I MPLEMENTATION
training datasets (e.g., using natural-scene images via transfer
learning) to build the VOR classifier. An exact implementation of our design as described above
needs extensive engineering work and further research. To
D. Natural Language Generation initially validate the effectiveness of our design, we have built
a prototype of the proposed system, VISPUS, using simplified
To immediately support program understanding and faci-
choices. We describe key implementation details below.
litate semantic program search, our system needs to gene-
For visual output generation, VISPUS currently does not
rate natural language descriptions of visual semantics (for
include an automated execution engine. Thus, visual outputs
understanding) and later to match the descriptions against
are generated from manually exercising a few sample pro-
various user queries (for search). Natural language generation
grams. As a result, the current database is small, holding the
is generally still an open research problem, and current NLG
visual semantics records only for a dozen of programs. We
engines are based on language models largely tailored for
also manually captured the visual outputs of each program as
particular languages.
images to feed the VOR.
Thus, a key design consideration of our system regarding
After extensive exploration, we have chosen to use CNNs
the choice of the NLG model concerns the ability of the
to train an image classifier for the VOR component, as CNNs
underlying language model to be appropriate particularly for
are the foundation of current state-of-the-art deep learning-
describing programs’ functional semantics. For example, a
based computer vision. In particular, we have used the LeNet
language model for general-purpose NLG (e.g.,a simple NLG
model [8], a classical image classification deep learning con-
engine) would typically not understand “application” as in
volutional neural network, assuming that the visual outputs of
“software application” correctly but mistakenly as “the process
programs are simple with relatively clean background without
of applying” (as we empirically validated). Thus, a customized
much noise. Visual outputs with noisy background can be
language model is required for precise generation of semantics
handled through deeper CNN architectures with larger amount
descriptions in our system. For instance, one possibility is to
of training data. To train the VOR, we used training data
curate a template for the domain of software engineering, and
collected from online resources, consisting of (1) different
then generate domain-specific natural language sentences as
types of images of non-GUI objects (e.g., simple geometric
the semantics descriptions. More broadly, a more accurate
shapes such as circles, triangles, charts etc.) and (2) cropped
NL description of a program’s semantics would need not
images of standard GUI elements (buttons, checkboxes, etc.).
only to refer to multiple visual outputs (i.e., multiple images)
For the NLG component, we have utilized template-based
collectively, but also to utilize the information about transitions
NL sentence generation. We fit the object descriptors pro-
between images captured consecutively in time so as to capture
duced by the VOR component into pre-defined templates.
the dynamics of the program’s (visual) behaviors.
The templates used are of the form “this program ăverbą
E. Natural Language Matching ăobjectą” (e.g., if the visual output contains a circle, the
generated sentence would be “this program produces a ci-
For the purpose of program understanding only, the seman- rcle”). Finally, for NLM, we have utilized trained vectors
tics descriptions produced by the NLG component already and cosine similarity techniques. we used Google’s pre-trained
suffice. Our approach goes further to facilitate automated, Word2Vec model [9] to identify the semantics of a user query
semantic program search based on the semantics descriptions and match it against the output of the NLG component. This
of many programs mined by the system (see Figure 1). Google’s pre-trained model includes word vectors for a large
Specifically, this is realized through matching the NL-style vocabulary (millions of words) and phrases that are trained
descriptions against user-supplied NL-style program search on datasets of roughly 100 billion words. Word2Vec [10] is
queries, using the dedicated NLM component. Similar to the a semantic learning framework that uses a shallow neural
problems that existing NLG techniques may have as discus- network to learn the representations of words/phrases in a
sed above, general-purpose NLM engines (e.g., openNLP) particular text. We used Gensim [11], a popular NLP package,
also suffer from lacking sufficient domain knowledge about to load the pre-trained word vectors and then compute the
software engineering and programs’ functional semantics. For cosine similarity between the two sentences. Accordingly,
example, if the user searches for a “program” that draws a given a query, VISPUS produces the programs of matching
piechart, a general-purpose NLM engine may interpret the semantics descriptions ranked by the similarity scores in a
search target as a program as in “academic program”. Thus, a non-descending order.
customized NLM technique should be used in our system to
answer program semantics related queries accurately. IV. E VALUATION
It is important to note that we have described the design of For a preliminary evaluation, we applied VISPUS to twelve
our approach generically. The design and engineering of each real-world programs as study subjects, summarized in Table I.
of the key modules are orthogonal to the holistic design of our These programs are obtained from a variety of sources, inclu-
system—once more advanced relevant techniques/implemen- ding github, vtk.org [12], F-droid [13], and matplotlib [14],
tations become available, we can leverage them immediately representing three languages: C++, Java, and Python.
3
TABLE I: Subject programs used in our evaluation VI. C ONCLUSION F UTURE W ORK
Subject Language Functionality Program understanding and search are two crucial yet
MPAndroidChart [15] Java Produces a barchart
StackedBarGraph [16] Python Produces a barchart challenging tasks in software engineering, especially with
BarchartDemo [17] Python Produces a barchart the growing complexity and corpus of programs. To address
Circle [18] C++ Generates a circle this challenge, we propose to learn program semantics from
Matplotlib-circle [19] Python Generates a circle
Triangle [20] C++ Produces a triangle their visual outputs. By leveraging advances in computer
Streamplot [21] Python Produces a streamplot vision, machine learning, and natural language processing, our
SIP Caller [22] Java Makes a call to SIP numbers approach generates semantics descriptions of programs based
Piechart [23] Python Procudes a piechart
Piechart2 [24] Python Produces a piechart on the visual objects recognized from their outputs, and ans-
Rectangle [25] C++ Generates a rectangle wers search queries through natural language matching. Our
GeometricObjects [26] Python Generates a rectangle preliminary results with subjects of varied functionalities and
languages demonstrated promising prospects of the proposed
To measure the effectiveness of VISPUS, we used the approach. Future work will focus on improving the system
functionality description (Table I) of each subject to search, by developing specialized techniques for the NLG and NLM
and computed the precision and recall of the search results components and automating visual output generation.
for that subject against manual ground truth. A program in
R EFERENCES
the search result is a true positive as long as its functional [1] S. Shoham et al., “Static specification mining using automata-based
semantics matches the query, regardless of its language. Over abstractions,” TSE, pp. 34(5): 651–666, 2008.
all the 12 queries (one per subject), VISPUS achieved on [2] A. Podgurski and L. A. Clarke, “A formal model of program de-
pendences and its implications for software testing, debugging, and
average 85% precision@2 and 95% recall@2. maintenance,” TSE, vol. 16, no. 9, pp. 965–979, 1990.
[3] K. Stolee et al., “Solving the search for source code,” TOSEM, 2014.
We also looked into failed searches and found that the main [4] S. P. Reiss, Y. Miao, and Q. Xin, “Seeking the user interface,” Automated
reasons lied in the limited capabilities of our current VOR Software Engineering, vol. 25, no. 1, pp. 157–193, 2018.
[5] K. Kim, D. Kim, T. F. Bissyande, E. Choi, L. Li, J. Klein, and
component implementation due to its insufficient training. Y. Le Traon, “Facoy–a code-to-code search engine,” in ICSE’18.
Another reason is, for some particular functionalities (e.g., [6] Y. Ke et al., “Repairing programs with semantic code search,” in ASE’15.
[7] Z. Tang, J. Zhai, M. Pan, Y. Aafer, S. Ma, X. Zhang, and J. Zhao,
producing a streamplot), the true result set has less than 2 “Dual-force: Understanding webview malware via cross-language forced
elements which gives lower precision when we consider top 2 execution,” in ASE, 2018, pp. 714–725.
[8] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning
results. Nevertheless, this preliminary result revealed the great applied to document recognition,” Proceedings of the IEEE, vol. 86,
no. 11, pp. 2278–2324, 1998.
promise of our new approach. [9] https://code.google.com/archive/p/word2vec/.
Our tool prototype and experiment dataset have been made [10] T. Mikolov et al., “Computing numeric representations of words in a
high-dimensional space,” 2015, US Patent 9,037,464.
publicly available (link withheld to observe anonymity). [11] R. Řehůřek and P. Sojka, “Software Framework for Topic Modelling
with Large Corpora,” in the LREC Workshop on New Challenges for
NLP Frameworks, 2010, pp. 45–50.
[12] vtk.org, “Sample VTK programs,” https://vtk.org/, 2018.
V. R ELATED W ORK [13] f droid.org, “F-droid,” https://f-droid.org/en/, 2018.
[14] matplotlib.org, “Sample Python programs using matplotlib,” https://
matplotlib.org/, 2018.
[15] https://matplotlib.org/gallery/lines bars and markers/barchart.html#
GUI prototyping. Prior works have explored generating GUI sphx-glr-gallery-lines-bars-and-markers-barchart-py, 2018.
[16] “Stacked Bar Graph,” https://matplotlib.org/
design [27] and code [28]–[30] skeleton from a given GUI gallery/lines bars and markers/barchart.html#
image or screenshot, by deep learning samples mined from a sphx-glr-gallery-lines-bars-and-markers-barchart-py, 2018.
[17] “Stacked Bar Graph,” https://matplotlib.org/gallery/statistics/barchart
large corpus of existing GUI examples. Our work leverages demo.html#sphx-glr-gallery-statistics-barchart-demo-py, 2018.
advanced computer vision techniques as did these prior works [18] https://vtk.org/Wiki/VTK/Examples/Cxx/GeometricObjects/Circle,
2018.
for visual element recognition and classification [31], yet [19] http://www.learningaboutelectronics.com/Articles/
How-to-draw-a-circle-using-matplotlib-in-Python.php, 2018.
with a disparate focus on finding existing code that matches [20] https://vtk.org/Wiki/VTK/Examples/Cxx/GeometricObjects/Triangle,
functional semantics expressed by visual outputs according to 2018.
[21] https://matplotlib.org/gallery/images contours and fields/plot
user queries, rather than matching a particular kind of visual streamplot.html, 2018.
outputs (GUIs) themselves—our visual outputs are generally [22] https://f-droid.org/en/packages/org.whitequark.sipcaller/, 2018.
[23] https://matplotlib.org/gallery/pie and polar charts/pie features.html#
defined (e.g., including data visualizations). sphx-glr-gallery-pie-and-polar-charts-pie-features-py, 2018.
[24] https://matplotlib.org/gallery/pie and polar charts/pie and donut
GUI search. The line of works aim at a more immediate step labels.html#sphx-glr-gallery-pie-and-polar-charts-pie-and-donut-labels-py,
from a GUI skeleton to the code that implements the GUI by 2018.
[25] https://github.com/softvar/OpenGL/blob/master/primitives/rectangle.
searching an existing code repository (rather than generating cpp, 2018.
[26] https://lorensen.github.io/VTKExamples/site/Python/GeometricObjects/
the code) [4], [32]. In comparison to our work aiming to GeometricObjectsDemo/, 2018.
search code that implements certain functionalities, these prior [27] C. Chen, T. Su, G. Meng, Z. Xing, and Y. Liu, “From UI design image
to GUI skeleton: a neural machine translator to bootstrap mobile GUI
research targeted searching code that implements given GUIs implementation,” in ICSE, 2018, pp. 665–676.
themselves (but not what they mean/do—their semantics). [28] T. A. Nguyen and C. Csallner, “Reverse engineering mobile application
user interfaces with remaui,” in ASE, 2015, pp. 248–259.
Semantic code search. Primary prior research in this direction [29] T. Beltramelli, “pix2code: Generating code from a graphical user in-
terface screenshot,” in SIGCHI Symposium on Engineering Interactive
expresses code snippets as constraints and matches code with Computing Systems, 2018, p. 3.
[30] K. Moran, C. Bernal-Cárdenas, M. Curcio, R. Bonett, and D. Poshyva-
search queries through constraint solving [3], [6]. This is nyk, “Machine learning-based prototyping of graphical user interfaces
disparate from our work analyzing visual outputs instead of for mobile apps,” TSE, 2018.
[31] S. Hassan, M. Arya, U. Bhardwaj, and S. Kole, “Extraction and
program code, and overcoming scalability and cross-language classification of user interface components from an image,” International
challenges (e.g., in modeling and solving complex constraints) Journal of Pure and Applied Mathematics, 2018.
[32] F. Behrang, S. P. Reiss, and A. Orso, “GUIfetch: supporting app design
of existing peer approaches. and development through gui search,” in Mobilesoft, 2018, pp. 236–246.