You are on page 1of 8

V is u a l i z a t i o n C o r n e r

Editors: Claudio Silva, csilva@cs.utah.edu


Joel E. Tohline, tohline@rouge.phys.lsu.edu

Provenance for Visualizations


Reproducibility and Beyond

By Claudio T. Silva, Juliana Freire, and Steven P. Callahan

The demand for the construction of complex visualizations is growing in many disciplines of science, as users
are faced with ever increasing volumes of data to analyze. The authors present VisTrails, an open source
provenance-management system that provides infrastructure for data exploration and visualization.

C omputing has been an enor-


mous accelerator for science,
leading to an information ex-
plosion in many different fields. Fu-
ture scientific advances depend on our
Without provenance, it’s diffi-
cult (and sometimes impossible) to
reproduce and share results, solve
problems collaboratively, validate re­
sults with different input data, and
on the notion of data flows,2 and they
provide visual interfaces for produc-
ing visualizations by assembling pipe-
lines out of modules (or functions)
connected in a network. SCIRun
ability to comprehend the vast amounts un­derstand the process used to solve supports an interface that lets users
of data currently being produced and a particular problem. In addition, directly edit data flows. MayaVi and
acquired. To analyze and understand data products’ longevity becomes ParaView have a different interac-
this data, though, we must assemble limited—without precise and suffi- tion paradigm that implicitly builds
complex computational processes and cient information about how the data data flows as the user makes “task-
generate insightful visualizations, product was generated, its value di- ­oriented” choices (such as selecting
which often require combining loosely minishes significantly. an isosurface value).
coupled resources, specialized librar- The lack of adequate provenance Although these systems let users
ies, and grid and Web services. Such support in visualization systems mo- create complex visualizations, they
processes could generate yet more tivated us to build VisTrails, an open lack the ability to support data explo-
data, adding to the information over- source provenance-management sys- ration at a large scale. Notably, they
flow scientists currently deal with. tem that provides infrastructure for don’t adequately support collabora-
Today, the scientific community data exploration and visualization tive creation and exploration of mul-
uses ad hoc approaches to data ex- through workflows. VisTrails trans- tiple visualizations. Because these
ploration, but such approaches have parently records detailed provenance systems don’t distinguish between
serious limitations. In particular, of exploratory computational tasks the definition of a data flow and its
scientists and engineers must expend and leverages this information be- instances, to execute a given data
substantial effort managing data yond just the ability to reproduce and flow with different parameters (for
(such as scripts that encode computa- share results. In particular, it uses this example, different input files), users
tional tasks, raw data, data products, information to simplify the process of must manually set these parameters
images, and notes) and recording exploring data through visualization. through a GUI. Clearly, this process
provenance information (that is, all the doesn’t scale to more than a few vi-
information necessary to reproduce Visualization Systems sualizations at a time. Additionally,
a certain piece of data) so that they Visualization systems such as Maya­ modifications to parameters or to a
can answer basic questions: Who cre- Vi (http://mayavi.sourceforge.net) and data flow’s definition are destruc-
ated a data product and when? When ParaView (www.paraview.org)—which tive—the systems don’t maintain any
was it modified, and who modified it? are built on top of Kitware’s Visual- change history. This requires the
What process was used to create the ization Toolkit (VTK)1—as well as user to first construct the visualiza-
data product? Were two data products SCIRun (http://software.sci.utahedu/ tion and then remember the input
derived from the same raw data? This scirun.html) enable users to interac- data sets, parameter values, and the
process is not only time-consuming, tively create and manipulate complex exact dataflow configuration that led
but also error-prone. visualizations. Such systems are based to a particular image.

82 Copublished by the IEEE CS and the AIP 


1521-9615/07/$25.00 ©2007 IEEE Computing in Science & Engineering
Reproducibility and Sharing more actively participate in the way we do things.
This first column discusses the benefits of provenance
Data and Processes for the and makes a case that better provenance mechanisms are
Visualization Corner needed for visualization. In upcoming installments, we’ll
attempt to inform the scientific community at large about
By Claudio Silva and Joel E. Tohline the benefits and technologies related to provenance. In
particular, we want to promote the idea of reproducible

G reetings! We’re the new co-editors for the Visualiza-


tion Corner. Claudio is a computer science professor at
the University of Utah and faculty member of the Scientific
visualizations. We encourage authors of articles published
here to provide metadata for visualizations in their articles
that let readers reproduce images as well as generate
Computing and Imaging Institute, where he does research related ones (for example, using different data). Ultimately,
primarily in visualization, graphics, and applied geometry. our hope is that this trend will spread to the point that
Joel is a professor of physics and astronomy at Louisiana published articles will contain not only textual descriptions
State University and a faculty member in LSU’s Center for of the techniques, but links to data, code, and the complete
Computation and Technology, with a research focus on overall process used to generate the scientific results.
complex fluid flows in astrophysical systems. We both have As a mechanism to capture and share provenance meta-
extensive experience in high-performance computing. In data, authors can use VisTrails to produce specifications
partnership with our readers and colleagues, we hope to of the figures and plots presented in their articles. We’ll
bring you relevant and effective information about visual- archive this information at www.vistrails.org/index.php/
ization techniques that can directly affect the way our read- CiSE. The data and processes associated with this column
ers do science. We would like to use new Web technologies are already available on the Web site, so you can reproduce
(Wikis, blogs, and so on) to encourage the community to them, right now, from your desktop!

Finally, before constructing a vi- this detailed provenance of the pipe- extensible infrastructure lets users in-
sualization, users must often acquire, line evolution as a visualization trail, tegrate a wide range of libraries. This
generate, or transform a given data or vistrail. makes the system suitable for other
set—for example, to calibrate a simu- The stored provenance ensures exploratory tasks, including data min-
lation, they must obtain data from sen- that users will be able to reproduce ing and integration.
sors, generate data from a simulation, the visualizations and lets them easily
and finally construct and compare the navigate through the space of pipe- Creating an Interactive
visualizations for both data sets. Most lines created for a given exploration Visualization with VisTrails
visualization systems, however, don’t task. The VisTrails interface lets users To illustrate the issues involved in
give users adequate support for cre- query, interact with, and understand creating visualizations and how prov-
ating complex pipelines that support the visualization process’s history. In enance can aid in this process, we
multiple libraries and services. particular, they can return to previous present the following scenario, com-
versions of a pipeline and change the mon in medical data visualization.
VisTrails: Provenance specification or parameters to gener- Starting from a volumetric computed
for Visualization ate a new visualization without losing tomography (CT) data set, we generate
The VisTrails system (www.vistrails. previous changes. different visualizations by exploring
org) we developed at the Univer- Another important feature of the the data through volume rendering,
sity of Utah is a new visualization action-based provenance model is isosurfacing (extracting a contour), and
­system that provides a comprehensive that it enables a series of operations slicing. Note that with proper modi-
­provenance-­management infrastruc- that greatly simplify the exploration fications, this example also works for
ture and can be easily combined with process and could reduce the time to visualizing other types of data (for ex-
existing visualization libraries. Unlike insight. In particular, the model al- ample, tetrahedral meshes).
previous systems, VisTrails uses an lows the flexible reuse of pipelines
action-based provenance model that and provides a scalable mechanism Dataflow-Processing Networks
uniformly captures changes to both for creating and comparing numer- and Visual Programming
parameter values and pipeline defini- ous visualizations as well as their cor- A useful paradigm for building visu-
tions by unobtrusively tracking all responding pipelines. Although we alization applications is the dataflow
changes that users make to pipelines originally built VisTrails to support model. A data flow is a directed graph in
in an exploration task. We refer to exploratory visualization tasks, its which nodes represent computations,

September/October 2007 83
V is u a l i z a t i o n C o r n e r

(a) (b) (c)

Figure 1. Dataflow programming for visualization. (a) We commonly use a script to describe a pipeline from existing
libraries such as the Visualization Toolkit (VTK). (b) Visual programming interfaces, such as the one VisTrails provides,
facilitate the creation and maintenance of these dataflow pipelines. The green rectangles represent modules, and
the black lines represent connections. (c) The end result of the script or VisTrails pipeline is a set of interactive
visualizations.

and edges represent data streams: for creating visualization pipelines— ing a high-level, structured workflow
each node or module corresponds to for example, scripting in a modern description is that we can use expres-
a procedure that’s applied on the in- dynamic language, such as Python. sive languages in querying and updat-
put data and generates some output Consider Figure 1a, which defines the ing workflows.
data as a result. The flow of data in workflow via a script written in Py-
the graph determines the order in thon that uses VTK to read a volume Comparing and Exploring
which a dataflow system executes the data set from a file, extract an isosur- Multiple Visualizations
processing nodes. In visualization, we face, map the isosurface to renderable Regardless of the specific mechanism
commonly refer to a dataflow network geometry, and then finally render it in we use to define a pipeline, the visu-
as a visualization pipeline. (For this an interactive window. alization process’s end goal is to gain
article, we use the terms workflow, Visual programming interfaces for insight from the data. To obtain such
data flow, and pipeline interchange- designing data flows have become insight, users must often generate
ably.) Figure 1b shows an example of popular and several systems, such as and compare multiple visualizations.
the data flow used to derive the im- SCIRun, have adopted them. These Going back to our scenario, several
ages shown in Figure 1c. The green interfaces give users a more intuitive alternatives exist for rendering our
rectangles represent modules, and view of the pipeline. They also dy- CT data. Isosurfacing is a commonly
the black lines represent connec- namically perform type checking and used technique. Given a function f:
tions. Most of the modules in Figure guide the connection between mod- R n → R and a value a, an isosurface
1 are from VTK, and labels on each ules’ input and output ports—once consists of the set of points in a do-
module indicate the corresponding the user selects a module’s output, main that map to a—that is, Sa = {x ∈
VTK class. In this figure, we natu- connections are allowed only to the R n: f(x) = a}.
rally think of data flowing from top target module’s appropriate input. The range of a values determines all
to bottom, eventually being rendered VisTrails automatically pulls edges possible isosurfaces that the user can
and presented for display. toward the correct input port. As we generate. To identify “good” a values
We can use different mechanisms discuss later, another benefit of hav- that represent a data set’s important

84 Computing in Science & Engineering


Figure 2. VisTrails’ parameter exploration interface. The system computes the results efficiently by avoiding
redundant computation and displays them in the spreadsheet for interactive comparative visualization.

features, we can look at the range of isosurface value as four steps between dering might also show the bones and
values taken by a, and their frequen- 50 and 70, displayed horizontally in internal organs.
cy, in the form of a histogram. Using the VisTrails spreadsheet. Volume rendering and isosurfacing
VisTrails, we can straightforwardly This spreadsheet lets us compare are complementary techniques, and
extend the isosurface pipeline to visualizations in different dimensions they can generate very similar imag-
also display a histogram of the data. (row, column, sheet, and time); we ery depending on parameters. In fact,
VisTrails provides a very simple plug­ can also link the spreadsheet’s cells to distinguishing between them can be
in functionality that you can use to synchronize the interactions between difficult. The VisTrails system lets us
add packages and libraries, including visualizations. Note that VisTrails compare workflows using a visual dif-
your own. For our example, we used leverages the dataflow specifications ference interface. To demonstrate this
matplotlib’s 2D plotting functionality to identify and avoid redundant op- capability, we compute the difference
(http://matplotlib.sourceforge.net) to erations. By using the same cache between the original iso­surface-gen-
generate the histogram at the top of for different cells in the spreadsheet, eration pipeline and the new volume-
Figure 1c. This histogram helps in VisTrails lets users efficiently explore rendering pipeline. Figure 3 shows
data exploration by suggesting re- numerous related visualizations. the visual difference of the workflows
gions of interest in the volume. The that we can inspect, along with their
plot shows that the highest frequency Comparing different visualization tech- resulting visualizations. In Figure 3a,
features lie between the ranges [0,25] niques. Volume rendering is a power- we use volume rendering to create the
and [58,68]. To identify the features ful computer graphics technique for image, in which we can see the skin
that correspond to these ranges, we visualizing 3D data. Whereas many on top of the bone structure; Figure
must explore these regions directly visualization algorithms focus on 3b shows only the bone structure ren-
through visualization. creating a rendering of surfaces— dered with our standard isosurface
­although they might be surfaces of technique. This ability to (efficiently)
Scalable exploration of parameter spaces. 3D objects—volume rendering lets us compare workflows and visualizations
VisTrails provides an interface for see “inside” the volume. This tech- is one of the benefits of the VisTrails
parameter exploration that lets users nique models the volume as cloud- action-based provenance model and
specify a set of parameters to explore, like cells of semitransparent material. becomes increasingly important as a
as well as how to explore, group, and A surface rendering of the human workflow becomes more complex and
display them. As a simple 1D example, body might show only the skin, for is shared among collaborators.
Figure 2 shows an exploration of the example, but a complete volume ren- With other workflow systems, these

September/October 2007 85
V is u a l i z a t i o n C o r n e r

(a)

(b) (c)

Figure 3. A visual difference between different pipelines in VisTrails. We show the difference between pipelines that
generated (a) volume rendering and (b) isosurface visualizations. (c) The interface distinguishes shared modules in
dark gray, the modules unique to isosurfacing in blue, those unique to direct volume rendering in orange, and those
with parameter changes in light gray.

(c)

(a) (b)

Figure 4. Multiple rendering techniques. (a) VisTrails renders visualizations by combining volume rendering and
isosurfacing and updates them with user interactions. (b) The corresponding pipeline represents the data flow for
creating interactive visualizations. (c) VisTrails provides a fully browseable history of the exploration process that led
to this final set of visualizations.

comparisons are challenging because (scripting) comparisons. Although the subgraph isomorphism is NP-complete),
they require module-by-module (vi- former can be computationally intrac- the latter could lead to results that are
sual programming) or line-by-line table (the related decision problem of hard to interpret.

86 Computing in Science & Engineering


The VisTrails System • Extensibility. VisTrails provides a very simple plug-in
functionality that can help dynamically add packages

I n this article, we focus on using VisTrails as a tool for


exploratory visualization. Additional features might be
relevant for CiSE readers:
and libraries. Neither changes to the user interface nor
system recompilation are necessary. Because VisTrails is
written in Python, integrating Python-wrapped libraries
is straightforward.
• Flexible provenance architecture. VisTrails transparently • Scalable derivation of data products and parameter explora-
tracks changes made to workflows by maintaining a tion. VisTrails supports a series of operations for simulta-
detailed record of all the steps followed in the explora- neously generating multiple data products, including an
tion. The system can optionally track runtime informa- interface that lets users specify sets of values for different
tion about workflow execution (such as who executed parameters in a workflow. Users can display the results
a module, on which machine, and how much time of a parameter exploration side by side in the VisTrails
elapsed). VisTrails also provides a flexible annotation spreadsheet for easy comparison.
framework through which users can specify application- • Task creation by analogy. VisTrails supports analogies as
­specific provenance information. first-class operations to guide semiautomated changes to
• Querying and reusing history. Provenance information multiple workflows, without requiring users to directly
is stored in a structured way. Users have the choice of manipulate or edit the workflow specifications.
using a relational database (such as MySQL or IBM DB2)
or XML files in the file system. The system provides Please visit www.vistrails.org to access the VisTrails
flexible and intuitive query interfaces through which community Web site. You’ll find information including
users can explore and reuse provenance information. instructions for obtaining the software, online docu-
Users can formulate simple keyword-based and selection mentation, video tutorials, and pointers to papers and
queries (find a visualization that a given user created, for presentations.
example) as well as structured queries (find visualizations VisTrails is written in Python and uses the multiplat-
that apply simplification before an isosurface computa- form Qt library for its user interface. The system is open
tion for irregular grid data sets). source, released under the GPL 2.0 license. The pre-com-
• Support for collaborative exploration. Users can configure piled versions for Windows, Mac OS X, and Linux come
the system with a database back end that can act as a with an installer and several packages, including VTK,
shared repository. This back end also provides a syn- matplotlib, and Image Magick. Additional packages,
chronization facility that lets users collaborate asynchro- including ones users have written, are also available,
nously and in a disconnected fashion—they can check but you can easily add new packages using the VisTrails
changes in and out, akin to a version-control system plug-in infrastructure. Detailed instructions are available
(such as subversion (SVN), http://subversion.tigris.org). at our Web site.

Interacting with visualizations. The teractively updates these slices along hypotheses. Whereas for simple
images we generated so far corre- with the plane during user interac- experiments, manual approaches
spond to simple, static workflows. To tions. Figure 4 shows the resulting to provenance management might
perform a more dynamic compari- visualizations along with the com- be feasible, complex computational
son between volume rendering and plex dynamic workflow required to tasks involving large volumes of data
isosurfacing, we add a feedback loop produce them. or multiple researchers require au-
into the workflow to let users adjust tomated approaches. As these tasks’
the visualization interactively. We Provenance and complexity and scale increases, data
thus build a new workflow that uses Exploratory Visualization organization becomes a major com-
the isosurfacing and volume render- The combination of multiple visual- ponent of the process.
ing algorithms simultaneously. We ization algorithms, different librar- VisTrails manages the data ma-
add a clipping plane into the volume ies, and the interactions between nipulated and metadata created in
visualization to assign the volume them considerably complicates the the course of an exploratory task.3 As
regions used for each algorithm. In workflow specification. In addition, a user (or group of users) generates
addition, we use a point on the plane creating a set of visualizations from a series of visualizations, VisTrails
to define axis-aligned slices of the data is not always a linear process transparently tracks all the steps this
volume that we display in distinct and often involves several itera- exploration followed—that is, the
spreadsheet cells. The pipeline in- tions as a user formulates and tests modules and connections added and

September/October 2007 87
V is u a l i z a t i o n C o r n e r

Figure 5. The VisTrails query-by-example interface. (a) Users can define a set of modules and parameters in the visual
programming interface to create a query template. (b) The query results are shown in the history tree, which users
can browse for specific instances of the match (inset).

Table 1. Query examples. • seamlessly navigate over the history


Query Result tree and return to previous pipeline
volume Highlights all nodes in the history tree in which versions after reaching a dead end;
the string “volume” appears (for example, in a • undo bad changes;
module name, parameter name, annotation) • reuse pipelines and pipeline frag-
ments from previous versions;
user:juliana Highlights all nodes in the history tree created by
the user “juliana” • compare different pipelines and
their results; and
before: March 30, 2007 Highlights all nodes in the history tree created
• be reminded, intuitively, of the ac-
before “March 30, 2007”
tions that led to a particular result.

Thus, users can efficiently explore


deleted, parameter value changes, and ciated with this example available for several related visualizations.
so on. Figure 4c shows a history tree download with the VisTrails system The issue of reproducibility for vi-
of the different pipelines created in the (see “The VisTrails System” sidebar). sualization has been considered be-
course of our running example. The You can recreate each figure shown in fore,5 but we should note that whereas
nodes in this tree correspond to pipe- this article by executing the different some visualization and workflow sys-
lines; an edge between two pipelines nodes in the history tree. Note that tems provide support for provenance
corresponds to changes performed on by using the action-based provenance tracking, their focus has been on data
the parent pipeline to obtain its child. model, we obtain a very concise rep- provenance—that is, information
For readability, by default, only the resentation of the history, which uses about how the system derived a given
nodes in the tree that the user tags as substantially less space than the alter- data product, including the param-
important are displayed. native of explicitly storing multiple eter values used6 —and on interaction
By tracking all the changes made versions of a pipeline.3 provenance (such as capturing a visu-
to a workflow ensemble, VisTrails The exploration trail VisTrails cap- alization’s viewing manipulations).7
properly captures each step, leaving tures also supports various activities VisTrails is the first system to capture
a complete trail of the work. Hav- that are crucial for performing reflec- information about how workflows
ing access to the different pipelines’ tive reasoning and obtaining insights, evolve over time.
specifications lets others reproduce such as following chains of reasoning For instance, to generate the
and share the results of each step in backward and forward and comparing composite visualization in our final
the exploratory process. To demon- different results.4 The tree-based view example, we extended our pipeline
strate this, we made the vistrail asso- lets users labeled Volume Rendering to include

88 Computing in Science & Engineering


modules from the pipeline labeled to display the workflow and highlight Data Flow for Scientific Applications (SciFlow),
Isosurfacing. (Having two pipelines the modules that match the query. IEEE CS Press, 2006.

lets us further explore the visualiza- Users can then click on the individual 4. D.A. Norman, Things That Make Us Smart:
Defending Human Attributes in the Age of the
tion—by trying different isosurface modules to view the execution log re- Machine, Addison-Wesley, 1994.
values, for example [see Figure 2]). In cords associated with them. 5. G. Kindlmann, “Lack of Reproducibility Hin-
addition, we can compare the pipe- ders Visualization Science,” IEEE Visualization
lines by dragging one node on top Compendium, IEEE CS Press, 2006, p. 69.

T
of the other (see Figure 3). These he VisTrails project has focused 6. T. Jankun-Kelly, K.-L. Ma, and M. Gertz,
“A Model and Framework for Visualization
computed differences are useful for on creating an infrastructure to Exploration,” IEEE Trans. Visualization and
understanding the visualization pro- manage the provenance data of ex- Computer Graphics, vol. 13, no. 2, 2007, pp.
cess, and the user can also reuse them. ploratory tasks. With this infrastruc- 357–369.

In this case, we applied the modules ture in place, our research focus is 7. D.P. Groth and K. Streefkerk, “Provenance
and Annotation for Visual Exploration
unique to “Isosurfacing” to “Volume now on what we can do with all the Systems,” IEEE Trans. Visualization and
Rendering” to create a new pipeline provenance accumulatation. By min- Computer Graphics, vol. 12, no. 6, 2006, pp.
called “Combined Rendering,” that ing this information, we hope to learn 1500–1510.

uses a cutting plane to define regions useful patterns that can guide users 8. C. Scheidegger et al., Querying and
Creating Visualizations by Analogy,” to be
for the rendering methods. VisTrails in assembling and refining complex published in IEEE Trans. Visualization and
can automatically apply pipeline dif- computational tasks. Computer Graphics, 2007.
ferences (like a patch) to derive new
pipelines in a process we call visual- Acknowledgments
ization creation by analogy.8 This article summarizes work being Claudio T. Silva is an associate professor at
Another benefit to having a high- done in the VisTrails project. It’s only the University of Utah. His research inter-
level specification of the visualiza- possible through the work of all our ests include visualization, geometry pro-
tion process is that users can query team members: Erik Anderson, Jason cessing, graphics, and high-performance
the pipelines and their execution Callahan, David Koop, Emanuele computing. Silva has a PhD in computer
instances. Scientists can query a vis- Santos, Carlos E. Scheidegger, and science from SUNY at Stony Brook. He
trail to find anomalies in previously Huy T. Vo. The data used in this ar- is a member of the IEEE, the ACM, Eu-
generated visualizations and locate ticle is available courtesy of the Na- rographics, and Sociedade Brasileira
data products and visualizations tional Library of Medicine’s Visible de Matematica. Contact him at csilva@
based on operations applied in the Human Project. The US National Sci- cs.utah.edu.
visualization process. VisTrails sup- ence Foundation partially supported
ports simple, keyword-based queries this work under grants IIS-0513692, Juliana Freire is an assistant professor at
as well as structured queries. In addi- CCF-0401498, EIA-0323604, CNS- the University of Utah. Her research inter-
tion to providing information about 0541560, OCE-0424602, and OISE- ests include scientific data management,
the results (for example, workflow 0405402. The US Department of Web information systems, and information
identifiers and attributes), VisTrails Energy, an IBM Faculty Award, and integration. Freire has a PhD in computer
can visually display query results by a University of Utah Seed Grant also science from SUNY at Stony Brook. She is a
highlighting the workflows and mod- partially supported this work. member of the ACM and the IEEE. Contact
ules that satisfy the query. Table 1 her at juliana@cs.utah.edu.
shows an example.
References
Users might also define queries by Steven P. Callahan is a research assistant and
1. W. Schroeder, K. Martin, and B. Lorensen,
example.8 As Figure 5 illustrates, us- The Visualization Toolkit: An Object-Oriented PhD candidate at the University of Utah. His
ers can construct (or copy and paste) Approach To 3D Graphics, Kitware, 2003. research interests include scientific visual-
a pipeline fragment into the VisTrails 2. E.A. Lee and T.M. Parks, “Dataflow Process ization, visualization systems, and computer
query tab to identify in the history Networks,” Proc. IEEE, vol. 83, no. 5, 1995, graphics. Callahan has an MS in computa-
pp. 773–801.
tree all nodes that contain that frag- tional engineering and science from the
3. S. Callahan et al., “Managing the Evolution
ment. They can then browse through of Dataflows with VisTrails (extended ab- University of Utah. Contact him at stevec@
the highlighted nodes and click on one stract),” Proc. IEEE Workshop on Workflow and sci.utah.edu.

September/October 2007 89

You might also like