You are on page 1of 114

Contents | Zoom in | Zoom out For navigation instructions please click here Search Issue | Next Page

A Cloud-Based Seizure Visualizing High- Computers in Cars, p. 108


Alert System, p. 56 Dimensional Data, p. 98

Vol. 18, No. 5 | September/October 2016

cise.aip.org

www.computer.org/cise/

Contents | Zoom in | Zoom out For navigation instructions please click here Search Issue | Next Page
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

Login to mycs.computer.org

SEARCH, ANNOTATE, UNDERLINE, VIEW VIDEOS,


CHANGE TEXT SIZE, DEFINE

READ YOUR FAVORITE


PUBLICATIONS YOUR WAY
Now, your IEEE Computer Society technical publications
aren’t just the most informative and state-of-the-art
PDJD]LQHVDQGMRXUQDOVLQWKHƮHOGŜWKH\ŞUHDOVRWKH
<RXŞYH*RWWR6HH,W
most exciting, interactive, and customizable to your 7RUHDOO\DSSUHFLDWHWKHYDVWGLƬHUHQFHLQ
reading preferences. UHDGLQJHQMR\PHQWWKDWP\&6UHSUHVHQWV
\RXQHHGWRVHHDYLGHRGHPRQVWUDWLRQDQG
The new myCS format for all IEEE Computer Society
WKHQWU\RXWWKHLQWHUDFWLYLW\IRU\RXUVHOI
digital publications is:
Just go to www.computer.org/mycs-info

• Mobile friendly./RRNVJUHDWRQDQ\GHYLFHŜPRELOHWDEOHW
ODSWRSRUGHVNWRS
• Customizable.:KDWHYHU\RXUHUHDGHUOHWV\RXGR\RXFDQGR
RQP\&6&KDQJHWKHSDJHFRORUWH[WVL]HRUOD\RXWHYHQXVH
DQQRWDWLRQVRUDQLQWHJUDWHGGLFWLRQDU\ŜLWŞVXSWR\RX
• Adaptive. 'HVLJQHGVSHFLƮFDOO\IRUGLJLWDOGHOLYHU\DQG
UHDGDELOLW\
• Personal.6DYHDOO\RXULVVXHVDQGVHDUFKRUUHWULHYHWKHP
TXLFNO\RQ\RXUSHUVRQDOP\&6VLWH

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
Contents | Zoom in | Zoom out For navigation instructions please click here Search Issue | Next Page

A Cloud-Based Seizure Visualizing High- Computers in Cars, p. 108


Alert System, p. 56 Dimensional Data, p. 98

Vol. 18, No. 5 | September/October 2016

cise.aip.org

www.computer.org/cise/

Contents | Zoom in | Zoom out For navigation instructions please click here Search Issue | Next Page
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

Experience the Newest and Most Advanced


Thinking in Big Data Analytics

03 November 2016 | Austin, TX

Big Data: Big Hype or Big Imperative? Rock Star Speakers


BOTH.
Business departments know the promise of
ELJGDWDŜDQGWKH\ZDQWLW%XWQHZO\PLQWHG
GDWDVFLHQWLVWVFDQŞW\HWPHHWH[SHFWDWLRQVDQG
WHFKQRORJLHVUHPDLQLPPDWXUH<HVELJGDWDLV
WUDQVIRUPLQJWKHZD\ZHGRŜHYHU\WKLQJ%XW
NQRZLQJWKDWGRHVQŞWKHOS\RXGHFLGHZKDWVWHSV
WRWDNHWRPRUURZWRDVVXUH\RXUFRPSDQ\ŞVIXWXUH
7KDWŞVZK\0D\LV\RXUUHDOZRUOGDQVZHU Kirk Borne Satyam Priyadarshy Bill Franks
&RPHPHHWWKHH[SHUWVZKRDUHJUDSSOLQJZLWKDQG Principal Data Chief Data Scientist, Chief Analytics
VROYLQJWKHSUREOHPV\RXIDFHLQPLQLQJWKHYDOXH Scientist, Halliburton 2Ʊ
FHU7HUDGDWD
RIELJGDWD<RXOLWHUDOO\FDQŞWDƬRUGWRPLVVWKHDOO Booz Allen Hamilton
QHZ5RFN6WDUVRI%LJ'DWD

www.computer.org/bda

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

EDITOR IN CHIEF
George K. Thiruvathukal, Loyola Univ. Chicago, gkt@cs.luc.edu
_________

ASSOCIATE EDITORS IN CHIEF


Jeffrey Carver, University of Alabama, carver@cs.ua.edu
__________
Jim X. Chen, George Mason Univ., ___________
jchen@cs.gmu.edu
Judith Bayard Cushing, The Evergreen State College, judyc@evergreen.edu
_____________
Steven Gottlieb, Indiana Univ., sg@indiana.edu
_________
Douglass E. Post, Carnegie Mellon Univ., post@ieee.org
________
Barry I. Schneider, NIST, bis@nist.gov
________

EDITORIAL BOARD MEMBERS STAFF


Joan Adler, Technion-IIT, phr76ja@tx.technion.ac.il
_____________ Editorial Product Lead: Cathy Martin, cathy.martin@computer.org
_______________
Francis J. Alexander, Los Alamos National Laboratory, fja@lanl.gov
_______ Editorial Management: Jennifer Stout
Isabel Beichl, Nat’l Inst. of Standards and Technology, isabel.beichl@nist.gov
____________ Operations Manager: Monette Velasco
Bruce Boghosian, Tufts Univ., bruce.boghosian@tufts.edu
_______________ Senior Advertising Coordinator: Marian Anderson
Hans-Joachim Bungartz, Technical University of Munich, bungartz@in.tum.de
___________
Director of Membership: Eric Berkowitz
Norman Chonacky, Yale Univ. (EIC Emeritus), norman.chonacky@yale.edu
_______________
Director, Products & Services: Evan Butterfield
Massimo DiPierro, DePaul Univ., mdipierro@cti.depaul.edu
______________
Senior Manager, Editorial Services: Robin Baldwin
Jack Dongarra, Univ. of Tennessee, dongarra@cs.utk.edu
___________
Manager, Editorial Services: Brian Brannon
Rudolf Eigenmann, Purdue Univ., eigenman@ecn.purdue.edu
_______________
William J. Feiereisen, Intel Corporation, bill@feiereisen.net
_________ Senior Business Development Manager: Sandra Brown
Geoffrey Fox, Indiana Univ., gcf@indiana.edu
_________
K. Scott Hemmert, Sandia National Laboratories, kshemme@sandia.gov
____________ AMERICAN INSTITUTE OF PHYSICS STAFF
David P. Landau, Univ. of Georgia, dlandau@hal.physast.uga.edu
________________ Marketing Director, Magazines: Jeff Bebee, jbebee@aip.org
________
Konstantin Läufer, Loyola Univ. Chicago, laufer@cs.luc.edu
__________ Editorial Liaison: Charles Day, cday@aip.org
_______
Preeti Malakar, Argonne National Laboratory, pmalakar@anl.gov
_________
James D. Myers, University of Michigan, jim.myers@computer.org
______________
Manish Parashar, Rutgers Univ., parashar@rutgers.edu
____________
John Rundle, Univ. of California, Davis, rundle@physics.ucdavis.edu
_______________
Robin Selinger, Kent State Univ., rselinge@kent.edu
__________
Thomas L. Sterling, Indiana Univ., tron@indiana.edu
__________ IEEE Antennas & Propagation Society Liaison:
John West, University of Texas, Austin, john@tacc.utexas.edu
____________ Don Wilton, Univ. of Houston, wilton@uh.edu
________

DEPARTMENT EDITORS
Books: Stephen P. Weppner,
Eckerd College, weppnesp@eckerd.edu
____________
IEEE Signal Processing Society Liaison:
Computing Prescriptions: Ernst Mucke, Identity Solutions, ernst.mucke@gmail.
___________ Mrityunjoy Chakraborty, Indian Institute of Technology, mrityun@ece.iitkgp.ernet.in
______________
com,
__ and Francis Sullivan, fran@super.org,
________ IDA/Center for
Computing Sciences CS MAGAZINE OPERATIONS COMMITTEE
Computer Simulations: Barry I. Schneider, NIST, bis@nist.gov,
_______ and Forrest Shull (chair), Brian Blake, Maria Ebling, Lieven Eeckhout,
Gabriel A. Wainer, Carleton University, gwainer@sce.carleton.ca
_____________ Miguel Encarnacao, Nathan Ensmenger, Sumi Helal, San Murugesan,
Education: Rubin H. Landau, Oregon State Univ., rubin@science.oregonstate.
_______________
Ahmad-Reza Sadeghi, Yong Rui, Diomidis Spinellis, George K. Thiruvathukal,
edu,
__ and Scott Lathrop, University of Illinois, lathrop@illinois.edu
__________
Mazin Yousif, Daniel Zeng
Leadership Computing: James J. Hack, ORNL, ________
jhack@ornl.gov,
and Michael E. Papka, ANL, papka@anl.gov
________
Novel Architectures: Volodymyr Kindratenko, University of Illinois, CS PUBLICATIONS BOARD
kindr@ncsa.uiuc.edu,
___________ David S. Ebert (VP for Publications), Alfredo Benso, Irena Bojanova, Greg Byrd,
and Pedro Trancoso, Univ. of Cyprus, pedro@cs.ucy.ac.cy
___________ Min Chen, Robert Dupuis, Niklas Elmqvist, Davide Falessi, William Ribarsky,
Scientific Programming: Konrad Hinsen, CNRS Orléans, Forrest Shull, Melanie Tory
konrad.hinsen@cnrs.fr
____________
and Matthew Turk, NCSA, matthewturk@gmail.com
______________ EDITORIAL OFFICE
Software Engineering Track: Jeffrey Carver, University of Alabama,
Publications Coordinator: cise@computer.org
__________
carver@cs.ua.edu,
_________ and Damian Rouson, Sourcery Institute,
damian@rouson.net
___________ COMPUTING IN SCIENCE & ENGINEERING
The Last Word: Charles Day, cday@aip.org
_______ c/o IEEE Computer Society
Visualization Corner: Joao Comba, UFRGS, comba@inf.ufrgs.br,
__________ and 10662 Los Vaqueros Circle, Los Alamitos, CA 90720 USA
Daniel Weiskopf, Univ. Stuttgart, _________________
weiskopf@visus.uni-stuttgart.de Phone +1 714 821 8380; Fax +1 714 821 4010
Your Homework Assignment: Nargess Memarsadeghi, NASA Goddard Space Websites: www.computer.org/cise or http://cise.aip.org/
Flight Center, nargess.memarsadeghi@nasa.gov
__________________

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

September/October 2016, Vol. 18, No. 5

SCIENCE AS A SERVICE

8 Guest Editors’ Introduction


Ravi Madduri and Ian Foster

10 A Case for Data Commons: Toward Data Science as a Service


Robert L. Grossman, Allison Heath, Mark Murphy, Maria Patterson,
and Walt Wells
Data commons collocate data, storage, and computing infrastructure with core
services and commonly used tools and applications for managing, analyzing, and
sharing data to create an interoperable resource for the research community. An
architecture for data commons is described, as well as some lessons learned from
operating several large-scale data commons.

21 MRICloud: Delivering High-Throughput MRI Neuroinformatics as Cloud-Based


Software as a Service
Susumu Mori, Dan Wu, Can Ceritoglu, Yue Li, Anthony Kolasny, Cover illustration: Andrew Baker
www.debutart.com/illustration/
Marc A. Vaillant, Andreia V. Faria, Kenichi Oishi, and Michael I. Miller andrew-baker
______
MRICloud provides a high-throughput neuroinformatics platform for automated brain
MRI segmentation and analytical tools for quantification via distributed client-server
remote computation and Web-based user interfaces. This cloud-based service approach
improves the efficiency of software implementation, upgrades, and maintenance. The
client-server model is also ideal for high-performance computing, allowing distribution S TAT EMEN T O F PURP O SE
of computational servers and client interactions across the world. Computing in Science & Engineering
(CiSE) aims to support and promote the
36 WaveformECG: A Platform for Visualizing, Annotating, and Analyzing ECG Data emerging discipline of computational
Raimond L. Winslow, Stephen Granite, Christian Jurado science and engineering and to foster the use
The electrocardiogram (ECG) is the most commonly collected data in cardiovascular of computers and computational techniques
in scientific research and education. Every
research because of the ease with which it can be measured and because changes issue contains broad-interest theme articles,
in ECG waveforms reflect underlying aspects of heart disease. Accessed through a departments, news reports, and editorial
browser, WaveformECG is an open source platform supporting interactive analysis, comment. Collateral materials such as
source code are made available electronically
visualization, and annotation of ECGs. over the Internet. The intended audience
comprises physical scientists, engineers,
mathematicians, and others who would
COMPUTATIONAL CHEMISTRY benefit from computational methodologies.
All articles and technical notes in CiSE
are peer-reviewed.
48 Chemical Kinetics: A CS Perspective
Dinesh P. Mehta, Anthony M. Dean, and Tina M. Kouri

CLOUD COMPUTING

56 A Cloud-Based Seizure Alert System for Epileptic Patients That Uses


Higher-Order Statistics
Sanjay Sareen, Sandeep K. Sood, and Sunil Kumar Gupta

HYBRID SYSTEMS

68 The Feasibility of Amazon’s Cloud Computing Platform for Parallel,


GPU-Accelerated, Multiphase-Flow Simulations
Cole Freniere, Ashish Pathak, Mehdi Raessi, and Gaurav Khanna

For more information on these and other computing topics, please visit the
________
IEEE Computer Society Digital Library at www.computer.org/csdl.

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

COLUMNS
98 Visualization Corner
Renato R.O. da Silva, Paulo E. Rauber,
4 From the Editors and Alexandru C. Telea
Steven Gottlieb Beyond the Third Dimension: Visualizing
The Future of NSF Advanced Computing High-Dimensional Data with Projections
Infrastructure Revisited

108 The Last Word RESOURCES


Charles Day
Computers in Cars 46 AIP Membership Information

47 IEEE Computer Society Information


DEPARTMENTS

78 Computer Simulations
Christian D. Ott
Massive Computation for Understanding
Core-Collapse Supernova Explosions

94 Leadership Computing
Laura Wolf
Multiyear Simulation Study Provides
Breakthrough in Membrane Protein Research

Editorial: Unless otherwise stated, bylined articles, as well as product and service descriptions,
reflect the author’s or firm’s opinion. Inclusion in Computing in Science & Engineering does
not necessarily constitute endorsement by IEEE, the IEEE Computer Society, or the AIP. All
submissions are subject to editing for style, clarity, and length. IEEE prohibits discrimination,
harassment, and bullying. For more information, visit www.ieee.org/web/aboutus/whatis/policies/
_____ Circulation: Computing in Science & Engineering (ISSN 1521-9615) is published
p9-26.html.
bimonthly by the AIP and the IEEE Computer Society. IEEE Headquarters, Three Park Ave.,
17th Floor, New York, NY 10016-5997; IEEE Computer Society Publications Office, 10662
Los Vaqueros Cir., Los Alamitos, CA 90720, phone +1 714 821 8380; IEEE Computer Society
Headquarters, 2001 L St., Ste. 700, Washington, D.C., 20036; AIP Circulation and Fulfillment
Department, 1NO1, 2 Huntington Quadrangle, Melville, NY, 11747-4502. Subscribe to
Computing in Science & Engineering by visiting www.computer.org/cise. Reuse Rights and
Reprint Permissions: Educational or personal use of this material is permitted without fee,
provided such use: 1) is not made for profit; 2) includes this notice and a full citation to the
original work on the first page of the copy; and 3) does not imply IEEE endorsement of any
third-party products or services. Authors and their companies are permitted to post the accepted
version of IEEE-copyrighted material on their own web servers without permission, provided
that the IEEE copyright notice and a full citation to the original work appear on the first screen
of the posted copy. An accepted manuscript is a version that has been revised by the author to
incorporate review suggestions, but not the published version with copy-editing, proofreading
and formatting added by IEEE. For more information, please go to: http://www.ieee.org/
publications_standards/publications/rights/paperversionpolicy.html.
____________________________ Permission to reprint/
republish this material for commercial, advertising, or promotional purposes or for creating
new collective works for resale or redistribution must be obtained from IEEE by writing to the
IEEE Intellectual Property Rights Office, 445 Hoes Lane, Piscataway, NJ 08854-4141 or pubs- ___
_________ Copyright © 2016 IEEE. All rights reserved. Abstracting and Library
permissions@ieee.org.
Use: Abstracting is permitted with credit to the source. Libraries are permitted to photocopy for
private use of patrons, provided the per-copy fee indicated in the code at the bottom of the first
page is paid through the Copyright Clearance Center, 222 Rosewood Dr., Danvers, MA 01923.
Postmaster: Send undelivered copies and address changes to Computing in Science & Engineering,
445 Hoes Ln., Piscataway, NJ 08855. Periodicals postage paid at New York, NY, and at additional
mailing offices. Canadian GST #125634188. Canada Post Corporation (Canadian distribution)
publications mail agreement number 40013885. Return undeliverable Canadian addresses to PO
Box 122, Niagara Falls, ON L2E 6S8 Canada. Printed in the USA.

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

FROM THE EDITORS

The Future of NSF Advanced


Computing Infrastructure Revisited

I
am in Sunriver, Oregon, having just enjoyed three days at the annual Blue Waters
Symposium for Petascale Science and Beyond. It was a perfect opportunity to catch
up on all the wonderful science being done on Blue Waters, the National Science
Foundation’s flagship supercomputer, located at the University of Illinois’s National
Center for Supercomputing Applications (NCSA). To be honest, you can’t really catch
up on all the science: most of the presentations are in parallel sessions with four simul-
Steven Gottlieb taneous talks. There were also very interesting tutorials to help attendees make the best
Indiana University use of Blue Waters.
But what I’m most interested in discussing here isn’t the petascale science, but the
“beyond” issue. CiSE readers might recall that in the March/April 2015 issue, I used
this space for a column entitled “Whither the Future of NSF Advanced Computing
Infrastructure?” (vol. 17, no. 2, 2015, pp. 4–6). One focus of that piece was the in-
terim report of the Committee on Future Directions for NSF Advanced Computing
Infrastructure to Support US Science in 2017–2020. This committee was appointed
through the Computer Science and Telecommunications Board of the National Re-
search Council (NRC) and was expected to issue a final report in mid-2015 (in fact, it
was announced nearly a year later, in a 4 May 2016 NSF press release). I had a chance
to sit down with Bill Gropp (University of Illinois Urbana-Champaign), who cochaired
the committee with Robert Harrison (Stony Brook) and gave a very well-received after-
dinner talk at the symposium about the report.
Over the years, there has been a growing gap between requests for computer time
through NSF’s XSEDE (Extreme Science and Engineering Discovery Environment)
program and the availability of such time. Making matters worse, Blue Waters is sched-
uled to shut down in 2018. At the symposium, William Kramer announced that the
NCSA had requested a zero-cost extension to continue operations of Blue Waters until
sometime in 2019. Extension of Blue Waters operations would be a very positive devel-
opment. Unfortunately, the NSF hasn’t announced a plan to replace Blue Waters with
a more powerful computer, even in light of the NSF’s role in the National Strategic
Computer Initiative announced by President Obama on 29 July 2015. There could be a
very serious shortage of computer time in the next few years that would broadly impact
science and engineering research in the US.
My previous article mentioned that the Division of Advanced Cyberinfrastructure
(ACI) is now part of the NSF’s Directorate of Computer & Information Science & Engi-
neering (CISE). Previously, the Office of Cyberinfrastructure reported directly to the NSF
director. The NSF has asked for comments on the impact of this change, but the deadline
is 30 June, well before you’ll see this column. The NSF’s request for comments was a major
topic of conversation in an open meeting at the symposium held by NCSA Director Ed
Seidel. I plan to let the NSF know that I think it’s essential to go back to the previous ar-
rangement: scientific computing isn’t part of computer science, and it’s very important
that the people at the NSF planning for supercomputing be at the same level as the science
directorates in order to get direct input on each directorate’s computing needs.
The committee report I mentioned earlier has seven recommendations, most of which
contain subpoints (see the “Committee Recommendations” sidebar for more information).
The recommendations are organized into four main issues: maintaining US leadership in
science and engineering, ensuring that resources meet community needs, helping compu-
tational scientists deal with the rapid changes in high-end computers, and sustaining the

4 Computing in Science & Engineering 1521-9615/16/$33.00 © 2016 IEEE Copublished by the IEEE CS and the AIP September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

Committee Recommendations

T he full report is at http://tinyurl.com/advcomp17-20; the text here is a verbatim, unedited excerpt, reprinted with permission from
“Future Directions for NSF Advanced Computing Infrastructure to Support US Science and Engineering in 2017-2020,” Nat’l
Academy of Sciences, 2015 (doi:10.17226/21886).
A: Position US for continued leadership in science and engineering
Recommendation 1. NSF should sustain and seek to grow its investments in advanced computing—to include hardware and
services, software and algorithms, and expertise—to ensure that the nation’s researchers can continue to work at frontiers of science
and engineering.
Recommendation 1.1. NSF should ensure that adequate advanced computing resources are focused on systems and services
that support scientific research. In the future, these requirements will be captured in its road maps.
Recommendation 1.2. Within today’s limited budget envelope, this will mean, first and foremost, ensuring that a predominant
share of advanced computing investments be focused on production capabilities and that this focus not be diluted by undertaking too
many experimental or research activities as part of NSF’s advanced computing program.
Recommendation 1.3. NSF should explore partnerships, both strategic and financial, with federal agencies that also provide
advanced computing capabilities as well as federal agencies that rely on NSF facilities to provide computing support for their
grantees.
Recommendation 2. As it supports the full range of science requirements for advanced computing in the 2017-2020 timeframe,
NSF should pay particular attention to providing support for the revolution in data driven science along with simulation. It should
ensure that it can provide unique capabilities to support large-scale simulations and/or data analytics that would otherwise be unavail-
able to researchers and continue to monitor the cost-effectiveness of commercial cloud services.
Recommendation 2.1. NSF should integrate support for the revolution in data-driven science into NSF’s strategy for advanced
computing by (a) requiring most future systems and services and all those that are intended to be general purpose to be more data-
capable in both hardware and software and (b) expanding the portfolio of facilities and services optimized for data-intensive as well as
numerically-intensive computing, and (c) carefully evaluating inclusion of facilities and services optimized for data-intensive comput-
ing in its portfolio of advanced computing services.
Recommendation 2.2. NSF should (a) provide one or more systems for applications that require a single, large, tightly coupled
parallel computer and (b) broaden the accessibility and utility of these large-scale platforms by allocating high-throughput as well as
high-performance work flows to them.
Recommendation 2.3. NSF should (a) eliminate barriers to cost-effective academic use of the commercial cloud and (b) carefully
evaluate the full cost and other attributes (e.g., productivity and match to science work flows) of all services and infrastructure mod-
els to determine whether such services can supply resources that meet the science needs of segments of the community in the most
effective ways.
B. Ensure resources meet community needs
Recommendation 3. To inform decisions about capabilities planned for 2020 and beyond, NSF should collect community re-
quirements and construct and publish roadmaps to allow NSF to set priorities better and make more strategic decisions about ad-
vanced computing.
Recommendation 3.1. NSF should inform its strategy and decisions about investment trade-offs using a requirements analysis
that draws on community input, information on requirements contained in research proposals, allocation requests, and foundation-
wide information gathering.
Recommendation 3.2. NSF should construct and periodically update roadmaps for advanced computing that reflect these re-
quirements and anticipated technology trends to help NSF set priorities and make more strategic decisions about science and engi-
neering and to enable the researchers that use advanced computing to make plans and set priorities.
Recommendation 3.3. NSF should document and publish on a regular basis the amount and types of advanced computing capa-
bilities that are needed to respond to science and engineering research opportunities.
Recommendation 3.4. NSF should employ this requirements analysis and resulting roadmaps to explore whether there are more
opportunities to use shared advanced computing facilities to support individual science programs such as Major Research Equipment
and Facilities Construction projects.
Recommendation 4. NSF should adopt approaches that allow investments in advanced computing hardware acquisition, comput-
ing services, data services, expertise, algorithms, and software to be considered in an integrated manner.
Recommendation 4.1. NSF should consider requiring that all proposals contain an estimate of the advanced computing
resources required to carry out the proposed work and creating a standardized template for collection of the information as one step

September/October 2016 5

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

FROM THE EDITORS

of potentially many toward more efficient individual and collective use of these finite, expensive, shared resources. (This information
would also inform the requirements process.)
Recommendation 4.2. NSF should inform users and program managers of the cost of advanced computing allocation requests in
dollars to illuminate the total cost and value of proposed research activities.
C. Aid the scientific community in keeping up with the revolution in computing
Recommendation 5. NSF should support the development and maintenance of expertise, scientific software, and software tools
that are needed to make efficient use of its advanced computing resources.
Recommendation 5.1. NSF should continue to develop, sustain, and leverage expertise in all programs that supply or use
advanced computing to help researchers use today’s advanced computing more effectively and prepare for future machine
architectures.
Recommendation 5.2. NSF should explore ways to provision expertise in more effective and scalable ways to enable
researchers to make their software more efficient; for instance, by making more pervasive the XSEDE (Extreme Science and
Engineering Discovery Environment) practice that permits researchers to request an allocation of staff time along with computer
time.
Recommendation 5.3. NSF should continue to invest in and support scientific software and update the software to support
new systems and incorporate new algorithms, recognizing that this work is not primarily a research activity but rather is support of
software infrastructure.
Recommendation 6. NSF should also invest modestly to explore next-generation hardware and software technologies to explore
new ideas for delivering capabilities that can be used effectively for scientific research, tested, and transitioned into production
where successful. Not all communities will be ready to adopt radically new technologies quickly, and NSF should provision advanced
computing resources accordingly.
D. Sustain the infrastructure for advanced computing
Recommendation 7. NSF should manage advanced computing investments in a more predictable and sustainable way.
Recommendation 7.1. NSF should consider funding models for advanced computing facilities that emphasize continuity of
support.
Recommendation 7.2. NSF should explore and possibly pilot the use of a special account (such as that used for Major Research
Equipment and Facilities Construction) to support large-scale advanced computing facilities.
Recommendation 7.3. NSF should consider longer-term commitments to center-like entities that can provide advanced
computing resources and the expertise to use them effectively in the scientific community.
Recommendation 7.4. NSF should establish regular processes for rigorous review of these center-like entities and not just their
individual procurements.

infrastructure for advanced computing. When I asked Gropp about the report’s main
message, he told me that “the community needs to get involved for the NSF to imple-
ment the recommendations.” That’s because we’ll need to do a better job of describing
our needs and our scientific plans. Gropp emphasized that it’s important to distinguish
between our wants and our needs. For example, Recommendation 3 calls on the NSF
to collect information on the needs of the scientific community for advanced comput-
ing—one possibility is that all grant applications will need to supply information about
their computing needs in a standard form (see recommendation 4.1).
The report also emphasizes that data-driven science needs to be supported along
with simulation. The latter has often driven machine design, but there are many inter-
esting scientific problems for which access to large amounts of data is the bottleneck,
and there are also now many simulations that produce large volumes of data that must
be read, stored, and visualized. It will be best to purchase computers that can support
both requirements well.
“For many years, we have been blessed with rapid growth in computing power,”
Gropp stated, but in referring to stagnant clock speeds, he noted, “that period is over.”
New supercomputers are going to employ new technologies that will require new pro-
gramming techniques to deal with the massive parallelism and deep memory hierar-
chies. Gropp quoted Ken Kennedy as saying that software transformations can take

6 September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

10 years to reach maturity. I note that my own community is eight years into GPU
code development and three to four years into development for Intel Xeon Phi. The ef-
fort is continuing in anticipation of the next generation of supercomputers. The report
strongly emphasizes that the NSF must help users to adapt their codes (Recommenda-
tion 5 and its subpoints).

B efore my conversation with Gropp ended, I asked him about the delay from the
original mid-2015 target date for the report’s release. He mentioned the “grueling
review process” and the need to respond to every comment. However, he said there
were many thoughtful, useful comments and that responding to them made the report
much better. Finally, Gropp left me with the thought that “Writing the report is not
the end, it is the beginning.” I certainly hope that my fellow CiSE readers will take that
to heart and get involved with helping the NSF plan for our needs for advanced com-
puting. You can find the entire report at http://tinyurl.com/advcomp17-20.

Steven Gottlieb is a distinguished professor of physics at Indiana University, where he directs


the PhD minor in scientific computing. He’s also an associate editor in chief of CiSE. Gottlieb’s
research is in lattice quantum chromodynamics, and he has a PhD in physics from Princeton
University. Contact him at _________
sg@indiana.edu.

Keeping Publications your way,

YOU U at the when you want them.


The future of publication delivery
is now. Check out myCS today!

Center
er Mobile-friendlyŜ/RRNVJUHDWRQ
ť Mobile-friendlyŜ/RRNVJUHDWRQ
DQ\GHYLFHŜPRELOHWDEOHWODSWRS
DQ\GHYLFHŜPRELOHWDEOHWODSWRS
or desktop

of Technology
ology
ť CustomizableŜ:KDWHYHU\RXU
HUHDGHUOHWV\RXGR\RXFDQGRRQ
HUHDGHUOHWV\RXGR\RXFDQGRRQ
myCS
ArchiveŜ6DYHDOO\RXU
ť Personal ArchiveŜ6DYHDOO\RXU
issues and search or retrieve them
quickly on your personal myCS
site.

Stay relevant with the IEEE Computer Society

More at www.computer.org/myCS

September/October 2016 7

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

GUEST EDITORS’ INTRODUCTION

Science as a Service

Ravi Madduri and Ian Foster | Argonne National Laboratory and the University of Chicago

R
esearchers are increasingly taking advantage of advances in cloud computing to make data analy-
sis available as a service. As we see from the articles in this special issue, the science-as-a-service
approach has many advantages: it accelerates the discovery process via a separation of concerns,
with computational experts creating, managing, and improving services, and researchers using
them for scientific discovery. We also see that making scientific software available as a service can lower
costs and pave the way for sustainable scientific software. In addition, science services let users share their
analyses, discover what others have done, and provide infrastructure for reproducing results, reanalyzing
data, backward tracking rare or interesting events, performing uncertainty analysis, and verifying and
validating experiments. Generally speaking, this approach lowers barriers to entry to large-scale analysis
for theorists, students, and nonexperts in high-performance computing. It permits rapid hypothesis test-
ing and exploration as well as serving as a valuable tool for teaching.
Computation and automation are vital in many scientific domains. For example, the decreased se-
quencing costs in biology have transformed the field from a data-limited to a computationally-limited dis-
cipline. Increasingly, researchers must process hundreds of sequenced genomes to determine statistical
significance of variants. When datasets were small, they could be analyzed on PCs in modest amounts
of time: a few hours or perhaps overnight. However, this approach does not scale to large, next-
generation sequencing datasets—instead, researchers require high-performance computers and parallel

8 Computing in Science & Engineering 1521-9615/16/$33.00 © 2016 IEEE Copublished by the IEEE CS and the AIP September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

algorithms if they are to analyze their data in a research interests include high-performance comput-
timely manner. By leveraging services such as the ing, workflow technologies, and distributed comput-
cloud-based Globus Genomics, researchers can ing. Madduri has an MS in computer science from the
analyze hundreds of genomes in parallel using Illinois Institute of Technology. Contact him at rm@ ___
just a browser. anl.gov.
In this special issue, we present three great ex-
amples of efforts in science as a service. In “A Case Ian Foster is director of the Computation Institute, a
for Data Commons: Toward Data Science as a joint institute of the University of Chicago and Argonne
Service,” Robert L. Grossman and his colleagues National Laboratory. He is also an Argonne Senior Sci-
present a flexible computational infrastructure that entist and Distinguished Fellow and the Arthur Holly
supports various activities in the data life cycle Compton Distinguished Service Professor of Computer
such as discovery, storage, analysis, and long-term Science. His research deals with distributed, parallel,
archiving. The authors present a vision to create a and data-intensive computing technologies, and in-
data commons and discuss challenges that result novative applications of those technologies to scientific
from a lack of appropriate standards. problems in such domains as climate change and bio-
In “MRICloud: Delivering High-Throughput medicine. Foster received a PhD in computer science
MRI Neuroinformatics as Cloud-Based Software from Imperial College, United Kingdom.
as a Service,” Susumu Mori and colleagues pres-
ent MRICloud, a science as a service for large-
scale analysis of brain images. This article illustrates
how researchers can make novel analysis capabili-
ties available to the scientific community at large Selected articles and columns from IEEE Computer
by outsourcing key capabilities such as high-perfor- Society publications are also available for free at
mance computing. http://ComputingNow.computer.org.
Finally, in “WaveformECG:
A Platform for Visualizing, Annotat-
ing, and Analyzing ECG Data,” Rai-
mond Winslow and colleagues present
a service for analyzing electrocardio-
gram data that lets researchers upload
time-series ECG data and provides
analysis capabilities to enable discov-
ery of the underlying aspects of heart
disease. WaveformECG is accessible
through a browser and provides inter-
active analysis, visualization, and an-
notation of waveforms using standard COMPUTER ENTREPRENEUR AWARD
medical terminology.
In 1982, on the occasion of its All members of the profession are
thirtieth anniversary, the IEEE invited to nominate a colleague

A s adoption of public cloud com-


puting resources for science in-
creases, science as a service provides
Computer Society established the
Computer Entrepreneur Award to
recognize and honor the technical
who they consider most eligible
to be considered for this award.
Awarded to individuals whose
managers and entrepreneurial entrepreneurial leadership is
a great way to create sustainable, leaders who are responsible for responsible for the growth of some
reliable services that accelerate the the growth of some segment of the segment of the computer industry.
FRPSXWHULQGXVWU\7KHHƬRUWV
scientific discovery process and im- PXVWKDYHWDNHQSODFHRYHUƮIWHHQ DEADLINE FOR 2017 AWARD
prove the adoption of various tools years earlier, and the industry
NOMINATIONS
and thus increase software reuse. HƬHFWVPXVWEHJHQHUDOO\DQG
openly visible. DUE: 15 OCTOBER 2016
Ravi Madduri is a project manager and
a Senior Fellow at the Computation In- AWARD SITE: https://www.computer.org/web/awards/entrepreneur
__________________________________

stitute at the Argonne National Labora- www.computer.org/awards


tory and the University of Chicago. His

www.computer.org/cise 9

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

SCIENCE AS A SERVICE

A Case for Data Commons:


Toward Data Science as a Service

Robert L. Grossman, Allison Heath, Mark Murphy, and Maria Patterson | University of Chicago
Walt Wells | Center for Computational Science Research

Data commons collocate data, storage, and computing infrastructure with core services and com-
monly used tools and applications for managing, analyzing, and sharing data to create an interoper-
able resource for the research community. An architecture for data commons is described, as well
as some lessons learned from operating several large-scale data commons.

W
ith the amount of available scientific data being far larger than the ability of the research com-
munity to analyze it, there’s a critical need for new algorithms, software applications, software
services, and cyberinfrastructure to support data throughout its life cycle in data science. In
this article, we make a case for the role of data commons in meeting this need. We describe the
design and architecture of several data commons that we’ve developed and operated for the research com-
munity in conjunction with the Open Science Data Cloud (OSDC), a multipetabyte science cloud that the
nonprofit Open Commons Consortium (OCC) has managed and operated since 2009.1 One of the distin-
guishing characteristics of the OSDC is that it interoperates with a data commons containing over 1 Pbyte
of public research data through a service-based architecture. This is an example of what is sometimes called
“data as a service,” which plays an important role in some science-as-a-service frameworks.
There are at least two definitions for science as a service. The first is analogous to the software-as-a-service2
model, in which instead of managing data and software locally using your own storage and computing resourc-
es, you use the storage, computing, and software services offered by a service provider, such as a cloud service
provider (CSP). With this approach, instead of setting up his or her own storage and computing infrastructure
and installing the required software, a scientist uploads data to a CSP and uses preinstalled software for data

10 Computing in Science & Engineering 1521-9615/16/$33.00 © 2016 IEEE Copublished by the IEEE CS and the AIP September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

analysis. Note that a trained scientist is still required In the discussion below, we distinguish among
to run the software and analyze the data. Science as a several stakeholders involved in data commons: the
service can also refer more generally to a service mod- data commons service provider (DCSP), which is
el that relaxes the requirement of needing a trained the entity operating the data commons; the data
scientist to process and analyze data. With this service contributor (DC), which is the organization or in-
model, specific software and analysis tools are avail- dividual providing the data to the DCSP; and the
able for specific types of scientific data, which is up- data user (DU), which is the organization or indi-
loaded to the science-as-a-service provider, processed vidual accessing the data. (Note that there’s often a
using the appropriate pipelines, and then made avail- fourth stakeholder: the DCSP associated with the
able to the researcher for further analysis if required. researcher accessing the data.) In general, there will
Obviously these two definitions are closely connected be an agreement, often called the data contribu-
in that a scientist can set up the required science-as- tors agreement (DCA), governing the terms by
a-service framework, as in the first definition, so that which the data is managed by the DCSP and the
less-trained technicians can use the service to process researchers accessing the data, as well as a second
their research data, as in the second definition. By and agreement, often called the data access agreement
large, we focus on the first definition in this article. (DAA), governing the terms of any researcher who
There are various science-as-a-service frameworks, accesses the data.
including variants of the types of clouds formalized As we describe in more detail later, we’ve built
by the US National Institute of Standards and Tech- several data commons since 2009. Based on this ex-
nology (infrastructure as a service, platform as a ser- perience, we’ve identified six main requirements that,
vice, and software as a service),2 as well as some more if followed, would enable data commons to interop-
specialized services that are relevant for data science erate with each other, science clouds,1 and other cy-
(data science support services and data commons): berinfrastructure supporting science as a service:

■ data science infrastructure and platform services, ■ Requirement 1, permanent digital IDs. The data
in which virtual machines (VMs), containers, commons must have a digital ID service, and
or platform environments containing com- datasets in the data commons must have per-
monly used applications, tools, services, and manent, persistent digital IDs. Associated with
datasets are made available to researchers (the digital IDs are access controls specifying who
OSDC is an example); can access the data and metadata specifying
■ data science software as a service, in which data additional information about the data. Part of
is uploaded and processed by one or more ap- this requirement is that data can be accessed
plications or pipelines and results are stored from the data commons through an API by
in the cloud or downloaded (general-purpose specifying its digital ID.
platforms offering data science as a service ■ Requirement 2, permanent metadata. There
include Agave,3 as well as more specialized ser- must be a metadata service that returns the as-
vices, such as those designed to process ge- sociated metadata for each digital ID. Because
nomics data); the metadata can be indexed, this provides a ba-
■ data science support services, including data stor- sic mechanism for the data to be discoverable.
age services, data-sharing services, data trans- ■ Requirement 3, API-based access. Data must
fer services, and data collaboration services be accessed by an API, not just by browsing
(one example is Globus4); and through a portal. Part of this requirement is
■ data commons, in which data, data science com- that a metadata service can be queried to return
puting infrastructure, data science support a list of digital IDs that can then be retrieved
services, and data science applications are col- via the API. For those data commons that con-
located and available to researchers. tain controlled access data, another component
of the requirement is that there’s an authentica-
Data Commons tion and authorization service so that users can
When we write of a “data commons,” we mean cy- first be authenticated and the data commons
berinfrastructure that collocates data, storage, and can check whether they are authorized to have
computing infrastructure with commonly used access to the data.
tools for analyzing and sharing data to create an ■ Requirement 4, data portability. The data must
interoperable resource for the research community. be portable in the sense that a dataset in a data

www.computer.org/cise 11

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

SCIENCE AS A SERVICE

Table 1. Data-intensive users supported by the Open Science computing, high-performance data transport ser-
Data Cloud. vices, and VM images and shareable snapshots con-
No. core hours per month No. users taining common data analysis pipelines and tools.
20,000 120 The OSDC is designed to provide a long-term
50,000 34
persistent home for scientific data, as well as a plat-
form for data-intensive science, allowing new types of
100,000 23
data-intensive algorithms to be developed, tested, and
200,000 5 used over large sets of heterogeneous scientific data.
Recently, OSDC researchers have logged about two
commons can be transported to another data million core hours each month, which translates to
commons and be hosted there. In general, if data more than US$800,000 worth of cloud computing
access is through digital IDs (versus referencing services (if purchased through Amazon Web Services’
the data’s physical location), then software that public cloud). This equates to more than 12,000 core
references data shouldn’t have to be changed hours per user, or a 16-core machine continuously
when data is rehosted by a second data commons. used by each researcher on average.
■ Requirement 5, data peering. By “data peer- OSDC researchers used a total of more than
ing,” we mean an agreement between two data 18 million core hours in 2015. We currently target
commons service providers to transfer data at operating OSDC computing resources at approxi-
no cost so that a researcher at data commons 1 mately 85 percent of capacity, and storage resources
can access data commons 2. In other words, the at 80 percent of capacity. Given these constraints,
two data commons agree to transport research we can determine how many researchers to support
data between them with no access charges, no and what size allocations to provide them. Because
egress charges, and no ingress charges. the OSDC specializes in supporting data-intensive
■ Requirement 6, pay for compute. Because, in research projects, we’ve chosen to target research-
practice, researchers’ demand for computing ers who need larger-scale resources (relative to our
resources is larger than available computing total capacity) for data-intensive science. In other
resources, computing resources must be ra- words, rather than support more researchers with
tioned, either through allocations or by charg- smaller allocations, we support fewer researchers
ing for their use. Notice the asymmetry in how with larger allocations. Table 1 shows the number
a data commons treats storage and computing of times researchers exceeded the indicated number
infrastructure. When data is accepted into a of core hours in a single month during 2015.
data commons, there’s a commitment to store
and make it available for a certain period of The OSDC Community
time, often indefinitely. In contrast, computing The OSDC is developed and operated by the Open
over data in a data commons is rationed in an Commons Consortium, a nonprofit that supports
ongoing fashion, as is the working storage and the scientific community by operating data com-
the storage required for derived data products, mons and cloud computing infrastructure to support
either by providing computing and storage al- scientific, environmental, medical, and healthcare-
locations for this purpose or by charging for related research. OCC members and partners include
them. For simplicity, we refer to this require- universities (University of Chicago, Northwestern
ment as “pay for computing,” even though the University, University of Michigan), companies
model is more complicated than that. (Yahoo, Cisco, Infoblox), US government agencies
and national laboratories (NASA, NOAA), and
Although very important for many applications, international partners (Edinburgh University, Uni-
we view other services, such as those for providing versity of Amsterdam, Japan’s National Institute
data provenance,5 data replication,6 and data col- of Advanced Industrial Science and Technology).
laboration,7 as optional and not core services. The OSDC is a joint project with the University of
Chicago, which provides the OSDC’s datacenter.
OSDC and OCC Data Commons Much of the support for the OSDC came from the
The OSDC is a multipetabyte science cloud that Moore Foundation and from corporate donations.
serves the research community by collocating a mul- The OSDC has a wide-reaching, multicampus,
tidisciplinary data commons containing approxi- multi-institutional, interdisciplinary user base and has
mately 1 Pbyte of scientific data with cloud-based supported more than 760 research projects since its

12 September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

inception. In 2015, 470 research groups from 54 uni- to hold human genomic and other sensitive biomed-
versities in 14 countries received OSDC allocations. ical data. These two clouds contain a variety of sensi-
In a typical month (November 2015), 186 of these tive controlled-access biomedical data that we make
research groups were active. The most computation- available to the research community following the
al-intensive group projects in 2015 included projects requirements of the relevant data access committees.
around biological sciences and genomics research,
analysis of Earth science satellite imagery data, anal- Common software stack. The core software stack for
ysis of text data in historical and scientific literature, the various data commons and clouds described
and a computationally intensive project in sociology. here is open source. Many of the components are
developed by third parties, but some key services
OCC Data Commons are developed and maintained by the OCC and
The OCC operates several data commons for the other working groups. Although there are some
research community. differences between them, we try to minimize the
differences between the software stacks used by the
OSDC data commons. We introduced our first data various data commons that we operate. In practice,
commons in 2009. It currently holds approximately as we develop new versions of the basic software
800 Tbytes of public open access research data, in- stack, it usually takes a year or so until the changes
cluding Earth science data, biological data, social can percolate throughout our entire infrastructure.
science data, and digital humanities data.
OSDC Design and Architecture
Matsu data commons. The OCC has collaborated with Figure 1 shows the OSDC’s architecture. We are
NASA since 2009 on Project Matsu, a data commons currently transitioning from version 2 of the
that contains six years of Earth Observing-1 (EO-1) OSDC software stack1 to version 3. Both are based
data, with new data added daily, as well as selected on OpenStack8 for infrastructure as a service. The
datasets from other NASA satellites, including NA- primary change made between version 2 and ver-
SA’s Moderate Resolution Imaging Spectrometer sion 3 is that version 2 uses GlusterFS9 for storage,
(MODIS) and the Landsat Global Land Surveys. while version 3 uses Ceph10 for object storage in
addition to OpenStack’s ephemeral storage. This
The OCC NOAA data commons. In April 2015, NOAA is a significant user-facing change that comes with
announced five data alliance partnerships (with Am- some tradeoffs. Version 2 utilized a POSIX-com-
azon, Google, IBM, Microsoft, and the OCC) that pliant file system for user home directory (scratch
would have broad access to its data and help make it and persistent) data storage, which provides com-
more accessible to the public. Currently, only a small mand-line utilities familiar for most OSDC users.
fraction of the more than 20 of data that NOAA has Version 3’s object storage, however, provides the
available in its archives is available to the public, but advantage of an increased level of interoperability,
NOAA data alliance partners have broader access as Ceph’s object storage has an interface compat-
to it. The focus of the OCC data alliance is to work ible with a large subset of Amazon’s S3 RESTful
with the environmental research community to build API in addition to OpenStack’s API.
an environmental data commons. Currently, the In version 3, there’s thus a clearer distinction
OCC NOAA data commons contains Nexrad data, between the way users interface with scratch data
with additional datasets expected in 2016. and intermediate working results on ephemeral
storage, which is simple to use and persists only un-
National Cancer Institute’s (NCI’s) genomic data com- til VMs are terminated. This results in longer-term
mons (GDC). Through a contract between the NCI and data on object storage, which requires the small
the University of Chicago and in collaboration with extra effort of curating through the API interface.
the OCC, we’ve developed a data commons for cancer Although there’s a learning curve required in adopt-
data; the GDC contains genomic data and associated ing object storage, we’ve noticed that it’s small and
clinical data from NCI-funded projects. Currently, the easily overcome with examples in documentation. It
GDC contains about 2 Pbytes of data, but this is ex- also tempers increased storage usage that could stem
pected to grow rapidly over the next few years. from unnecessary data that isn’t actively removed.
The OSDC has a portal called the Tukey por-
Bionimbus protected data cloud. We also operate two tal, which provides a front-end Web portal inter-
private cloud computing platforms that are designed face for users to access, launch, and manage VMs

www.computer.org/cise 13

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

SCIENCE AS A SERVICE

Data access
Disk portal + APIs

Disk
Infrastructure-as-
a-service (IaaS)
portal + APIs
Disk

VMs and containers VM and Data


Disk for workflow services containers for user submission
for OSDC projects computation portal + APIs

Disk
Data peering
Disk API

Relational and NoSQL databases


Disk

Microservice-based middleware (digital IDs, metadata server, and so on)


Object storage with
access control lists
for managed data VM- and container-based IaaS

Physical hardware

Figure 1. The Open Science Data Cloud (OSDC) architecture. The various data commons that we have developed and
operate share an architecture, consisting of object-based storage, virtual machines (VMs), and containers for on-demand
computing, and core services for digital IDs, metadata, data access, and access to computing resources, all of which are
available through RESTful APIs. The data access and data submission portals are applications built using these APIs.

and storage. The Tukey portal interfaces with the an interactive support ticketing system that tracks
Tukey middleware, which provides a secure au- user support requests and system team responses
thentication layer and interface between various for technical questions. Collecting this data lets us
software stacks. The OSDC uses federated login track usage statistics and build a comprehensive as-
for authentication so that academic institutions sessment of how researchers use our services.
with InCommon, CANARIE, or the UK Federa- While adding to our resources, we’ve devel-
tion can use those credentials. We’ve worked with oped an infrastructure automation tool called Yates
145 academic universities and research institutions to simplify bringing up new computing, storage,
to release the appropriate attributes for authentica- and networking infrastructure. We also try to au-
tion. We also support Gmail and Yahoo logins, but tomate as much of the security required to operate
only for approved projects when other authentica- the OSDC as is practical.
tion options aren’t available. The core OSDC software stack is open source,
We instrument all the resources that we oper- enabling interested parties to set up their own sci-
ate so that we can meter and collect the data re- ence cloud or data commons. The core software
quired for accounting and billing each user. We stack consists of third-party, open source software,
use Salesforce.com, one of the components of the such as OpenStack and Ceph, as well as open
OSDC that isn’t open source, to send out invoic- source software developed by the OSDC commu-
es. Even when computing resources are allocated nity. The latter is licensed under the open source
and no payment is required, we’ve found that re- Apache license. The OSDC does use some propri-
ceipt of these invoices promotes responsible usage etary software, such as Salesforce.com to do the ac-
of OSDC community resources. We also operate counting and billing, as mentioned earlier.

14 September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

OCC Digital ID and Metadata Services Second, digital IDs are an important component
The digital ID (DID) service is accessible via an of the data portability requirement. More specifically,
API that generates digital IDs, assigns key-value datasets can be moved between data commons,
attributes to digital IDs, and returns key-value at- and, again, researchers don’t need to change their
tributes associated with digital IDs. We also devel- code. In practice, datasets can be migrated over
oped a metadata service that’s accessible via an API time, with the digital IDs’ references updated as
and can assign and retrieve metadata associated the migration proceeds.
with a digital ID. Users can also edit metadata as- Signpost is the digital ID service for the
sociated with digital IDs if they have write access OSDC. Instead of using a hard-coded URL, the
to it. Due to different release schedules, there are primary way to access managed data via the OSDC
some differences in the digital ID and metadata is through a digital ID. Signpost is an implementa-
services between several of the data commons that tion of this concept via JavaScript Object Notation
we operate, but over time, we plan to converge these (JSON) documents.
services. The Signpost digital ID service integrates
a mutable ID that’s assigned to the data with an
Persistent Identifier Strategies immutable hash-based ID that’s computed from the
Although the necessity of assigning digital IDs to data. Both IDs are accessible through a REST API
data is well recognized,11,12 there isn’t yet a widely interface. With this approach, data contributors can
accepted service for this purpose, especially for large make updates to the data and retain the same ID,
datasets.13 This is in contrast to the generally accept- while the data commons service provider can use
ed use of digital object identifiers (DOIs) or handles the hash-based ID to facilitate data management.
for referencing digital publications. An alternative to To prevent unauthorized editing of digital IDs, an
a DOI is an archival resource key (ARK), a Uniform access control list (ACL) is kept by each digital ID
Resource Locator (URL) that’s also a multipurpose specifying the read/write permissions for different
identifier for information objects of any type.14,15 In users and groups.
practice, DOIs and ARKs are generally used to as- User-defined identities are flexible, can be of
sign IDs to datasets, with individual communities any format (including ARKs and DOIs), and pro-
sometimes developing their own IDs. DataCite is vide a layer of human readability. They map to
an international consortium that manages DOIs for hashes of the identified data objects, with the bot-
datasets and supports services for finding, accessing, tom layer utilizing hash-based identifiers, which
and reusing data.16 There are also services such as guarantee data immutability, allow for identifica-
EZID that support both DOIs and ARKs.17 tion of duplicated data via hash collisions, and al-
Given the challenges the community is fac- low for verification upon retrieval. These map to
ing in coming to a consensus about which digital known locations of the identified data.
IDs to use, our approach has been to build an open
source digital ID service that can support multiple Metadata Service
digital IDs, support “suffix pass-through,”13 and The OSDC metadata service, Sightseer, lets us-
that can scale to large datasets. ers create, modify, and access searchable JSON
documents containing metadata about digital
Digital IDs IDs. The primary data can be accessed using Sign-
From the researcher viewpoint, the need for digital post and the digital ID. At its core, Sightseer pro-
IDs associated with datasets is well appreciated.18,19 vides no restrictions on the JSON documents it
Here, we discuss some of the reasons that digital can store. However, it has the ability to specify
IDs are important for a data commons from an op- metadata types and associate them with JSON
erational viewpoint. schemas. This helps prevent unexpected errors
First, with digital IDs, data can be moved from in metadata with defined schemas. Sightseer has
one physical location or storage system within a data similar abilities as Signpost to provide ACLs to
commons to another without the need to change specify users that have write/read access to the
any code that references the data. As the amount specific JSON document.
of data grows, moving data between zones within a
data commons or between storage systems becomes Case Studies
more and more common, and digital IDs allow this Two case studies illustrate some of the projects that
to take place without impeding researchers. can be supported with data commons.

www.computer.org/cise 15

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

SCIENCE AS A SERVICE

Figure 2. A screenshot of part of the Namibia Flood Dashboard from 14 March 2014. This image shows water catchments (outlined
and colored regions) and a one-day flood potential forecast of the area from hydrological models using data from the Tropical Rainfall
Measuring Mission (TRMM), a joint space mission between NASA and the Japan Aerospace Exploration Agency.

Matsu Earth and OpenStreetMap, methods for retrieving


Project Matsu is a collaboration between NASA NASA images for a region of interest, and analyt-
and the OCC that’s hosted by the University of ics for projecting flood potential using hydrologi-
Chicago, processes the data produced each day cal models. The Namibia Flood Dashboard is an
by NASA’s EO-1 satellite, and makes a variety of important tool for developing better situational
data products available to the research communi- awareness and enabling fast decision making and is
ty, including flood maps. The raw data, processed a model for the types of focused analytics products
data, and data products are all available through made possible by collocating related datasets with
the OSDC. Project Matsu uses a framework called each other and with computational and analytic
the OSDC Wheel to ingest raw data, process and capabilities.
analyze it, and deliver reports with actionable in-
formation to the community in near real time.20 Bionimbus
Project Matsu uses the data commons architecture The Bionimbus Protected Data Cloud 21 is a pet-
illustrated in Figure 1. abyte-scale private cloud and data commons that
As part of Project Matsu, we host several fo- has been operational since 13 March 2013. Since
cused analytic products with value-added data. going online in 2013, it has supported more than
Figure 2 shows a screenshot from one of these 152 allocation recipients from over 35 different
focused analytic products, the Project Matsu Na- projects at 29 different institutions. Each month,
mibia Flood Dashboard,20 which was developed as Bionimbus provides more than 2.5 million core
a tool for aggregating and rapidly presenting data hours to researchers, which at standard Amazon
and sources of information about ground condi- AWS pricing would cost over $500,000. One of
tions, rainfall, and other hydrological information the largest users of Bionimbus is the Cancer Ge-
to citizens and decision makers in the flood-prone nome Atlas (TCGA)/International Cancer Ge-
areas of water basins in Namibia and the surround- nome Consortium (ICGC) PanCancer Analysis
ing areas. The tool features a bulletin system that of Whole Genomes working group (PCAWG).
produces a short daily written report, a geospatial PCAWG is currently undertaking a large-scale
data visualization display using Google Maps/ analysis of most of the world’s whole genome

16 September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

Related Work

S everal projects share many of the goals of data commons


in general, and the Open Commons Consortium (OCC)
data commons in particular. Here, we discuss three of the most
of the commons described here are similar to the NDS and, for
this reason, they can be considered as some of the possible ways
to implement the services proposed for the NDS.
important: the National Institutes of Health (NIH) Big Data to The OCC and Open Science Data Cloud (OSDC) started in
Knowledge (BD2K) program, the Research Data Alliance (RDA), 2008, several years before BD2K, RDA, and NDS, and have been
and the National Data Service (NDS). developing cloud-based computing and data commons services
The work described in the main text is most closely con- for scientific research projects ever since. Roughly speaking,
nected with the vision for a commons outlined by the BD2K the goals of these projects are similar, but the OSDC is strictly
program at the US National Institutes for Health.1 The commons a science service provider and data commons provider, whereas
described in this article can be viewed partly as an implementa- the RDA is a much more general initiative. The BD2K program
tion of a commons that supports the principles of findability, is focused on biomedical research, especially for NIH-funded
accessibility, interoperability, and reusability2 which are key researchers, while the NDS is a newer effort that involves the Na-
requirements of the data-sharing component of the BD2K tional Science Foundation supercomputing centers, their partners,
program. and their users.
Of the three projects mentioned, the largest and most ma-
ture is the RDA,3 the goals of which are to create concrete pieces References
of infrastructure that accelerate data sharing and exchange for a 1. V. Bonazzi, “NIH Commons Overview, Framework & Pilots,”
specific but substantive target community; adopt the infrastruc- 2015; https://datascience.nih.gov/commons.
ture within the target community; and use the infrastructure to 2. M.D. Wilkinson et al., “The FAIR Guiding Principles for Sci-
accelerate data-driven innovation.3 entific Data Management and Stewardship,” Scientific Data,
The goals of the NDS are to implement core services for dis- vol. 3:160018, 2016.
covering data; storing persistent copies of curated data and as- 3. F. Berman, R. Wilkinson, and J. Wood, “Building Global In-
sociated metadata; accessing data; linking data with other data, frastructure for Data Sharing and Exchange through the Re-
publications, and credit for reuse; and computing and analyzing search Data Alliance,” D-Lib Magazine, vol. 20, 2014; www.
___
data (www.nationaldataservice.org).
_________________ Broadly speaking, the goals dlib.org/dlib/january14/01guest_editorial.html.

cancer data available to the cancer community repository or digital library for data associated
through the TCGA and ICGA consortia using with published research. Second, data commons
several clouds, including Bionimbus. can store data along with computational environ-
Bionimbus also uses the data commons archi- ments in VMs or containers so that computations
tecture illustrated in Figure 1. More specifically, supporting scientific discoveries can be reproduc-
the current architecture uses OpenStack to provide ible. Third, data commons can serve as a platform,
virtualized infrastructure, containers to provide a enabling future discoveries as more data, algo-
platform-as-a-service capability, and object-based rithms, and software applications are added to the
storage with an AWS compatible interface. Bion- commons.
imbus is a National Institutes of Health (NIH) Data commons fit well with the science-as-a-
Trusted Partner22 that interoperates with both the service model: although data commons allow re-
NIH Electronic Research Administration Com- searchers to download data, host it themselves, and
mons to authenticate researchers and with the NIH analyze it locally, they also allow current data to
Database of Genotypes and Phenotypes system to be reanalyzed with new methods, tools, and appli-
authorize users access to specific controlled access cations using collocated computing infrastruc-
datasets, such as the TCGA dataset. ture. New data can be uploaded for an integrated
analysis, and hosted data can be made available to
Discussion other resources and applications using a data-as-a-
Three projects that are supporting infrastructures service model, in which data in a data commons is
similar to the OCC data commons are described accessed through an API. A data-as-a-service model
in the sidebar. With the appropriate services, data is enhanced when multiple data commons and sci-
commons support three different but related func- ence clouds peer so that data can be moved between
tions. First, data commons can serve as a data them at no cost.

www.computer.org/cise 17

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

SCIENCE AS A SERVICE

Challenges frastructure that scales to midscale and larger


Perhaps the biggest challenge for data commons, computing infrastructure.
especially large-scale data commons, is develop- ■ AnalyticOps. The third challenge is to develop
ing long-term sustainability models that support an integrated development and operations
operations year after year. methodology to support large-scale analysis and
Over the past several years, funding agencies reanalysis of data. You might think of this as the
have required data management plans for the dis- analogy of DevOps for large-scale data analysis.
semination and sharing of research results, but, by
and large, they haven’t provided funding to sup- An additional category of challenges is the lack
port this requirement. What this means is that a of consensus within the research community for a
lot of data is searching for data commons and simi- core set of standards that would support data com-
lar infrastructure, but very little funding is avail- mons. There aren’t yet widely accepted standards
able to support this type of infrastructure. for indexing data, APIs for accessing data, and au-
Moreover, datacenters are sometimes divided thentication and authorization protocols for access-
into several “pods” to facilitate their management ing controlled-access data.
and build out—for lack of better name, we some-
times use the term cyberpod to refer to the scale Lessons Learned
of a pod at a datacenter. Cyberinfrastructure at Data reanalysis is an important capability. For many
this scale is also sometimes called midscale com- research projects, large datasets are periodically re-
puting,23 to distinguish it from the large-scale analyzed using new algorithms or software appli-
infrastructure available to Internet companies cations, and data commons are a convenient and
such as Google and Amazon and the HPC clus- cost-effective way to provide this service, especially
ters generally available to campus research groups. as the data grows in size and becomes more expensive
A pod might contain 50 to several hundred racks to transfer.
of computing infrastructure. Large-scale Internet In addition, important discoveries are made
companies have developed specialized software for at all computing resource levels. As mentioned,
mid- to large-scale (datacenter-scale) computing, 24 computing resources are rationed in a data com-
such as MapReduce (Google)25 and Dynamo (Am- mons (either directly through allocations or in-
azon),26 but this proprietary software isn’t avail- directly through charge backs). Typically, there’s
able to the research community. Although some a range of requests for computing allocations in
software applications, such as Hadoop,23 are avail- a data commons spanning six to seven or more
able to the research community and scale across orders of magnitude, ranging from hundreds of
multiple racks, there isn’t a complete open source core hours to tens of millions of core hours. The
software stack containing all the services required challenge is that important discoveries are usu-
to build a large-scale data commons, including the ally made across the entire range of resource al-
infrastructure automation and management servic- locations, from the smallest to the largest. This is
es, security services, and so on24 required to oper- because when large datasets, especially multiple
ate a data commons at midscale. large datasets, are collocated, it’s possible to make
We single out three research challenges related interesting discoveries even with relatively small
to building data commons at the scale of cyberpods: amounts of compute.
The tragedy of the commons can be allevi-
■ Software stacks for midscale computing. The ated with smart defaults in implementation. In
first research challenge is to develop a scal- the early stages of the OSDC, the number of us-
able open source software stack that provides ers was smaller, and depletion of shared computa-
the infrastructure automation and monitor- tional resources wasn’t an urgent concern. As the
ing, computing, storage, security, and related popularity of the system grew and attracted more
services required to operate at the scale of a users, we noted some user issues (for example,
cyberpod. increase in support tickets that noted that larger
■ Datapods. The second research challenge is to VM instances wouldn’t launch) as compute core
develop data management services that scale utilization surpassed 85 percent. Accounting and
out to cyberpods. We sometimes use the term invoicing promotes responsible usage of commu-
datapods for data management infrastructure nity resources. We also implemented a quarterly
at this scale—that is, data management in- resource allocation system with a short survey to

18 September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

users requiring opt-in for continued resource us- of data” as the number of data commons begins
age extending into the next quarter. This provides to grow, as standards for data commons and their
a more formal reminder every three months to interoperability begin to mature, and as data com-
users who are finishing research projects to re- mons begin to peer.
linquish their quotas and has been successful for
tempering unnecessary core usage. Similarly, as we Acknowledgments
moved to object storage functionality, we noted This material is based in part on work supported by the US
more responsible usage of storage, as scratch space National Science Foundation under grant numbers OISE
is in ephemeral storage and removed by default 1129076, CISE 1127316, and CISE 1251201 and by Nation-
when the computing environment is terminated. al Institutes of Health/Leidos Biomedical Research through
The small extra effort in moving data via an API to contracts 14X050 and 13XS021/HHSN261200800001E.
the object storage requires more thoughtful cura-
tion and usage of resources. References
Over the past several years, much of the re- 1. R.L. Grossman et al., “The Design of a Community
search focus has been on designing and operating Science Cloud: The Open Science Data Cloud Per-
data commons and science clouds that are scalable, spective,” Proc. High Performance Computing, Net-
contain interesting datasets, and offer computing working, Storage and Analysis, 2012, pp. 1051–1057.
infrastructure as a service. We expect that as these 2. P. Mell and T. Grance, The NIST Definition of Cloud
types of science-as-a-service offerings become more Computing (Draft): Recommendations of the National
common, there will be a variety of more interest- Institute of Standards and Technology, Nat’l Inst.
ing higher-order services, including discovery, cor- Standards and Tech., 2011.
relation, and other analysis services that are offered 3. R. Dooley et al., “Software-as-a-Service: The iPlant
within a commons or cloud and across two or more Foundation API,” Proc. 5th IEEE Workshop Many-Task
commons and clouds that interoperate. Computing on Grids and Supercomputers, 2012; https://
____
Today, Web mashups are quite common, but www.semanticscholar.org/paper/Software-as-a-
______________________________
analysis mashups, in which data is left in place but service-the-Iplant-Foundation-Api-Dooley-Vaughn/
______________________________
continuously analyzed as a distributed service, are ccde19b95773dbb55328f3269fa697a4a7d60e03/pdf.
______________________________
relatively rare. As data commons and science clouds 4. I. Foster, “Globus Online: Accelerating and Democ-
become more common, these types of services can ratizing Science through Cloud-Based Services,”
be more easily built. IEEE Internet Computing, vol. 3, 2011, pp. 70–73.
Finally, hybrid clouds will become the norm. 5. Y.L. Simmhan, B. Plale, and D. Gannon, “A Survey
At the scale of a several dozen racks (a cyberpod), of Data Provenance in E-Science,” ACM Sigmod Re-
a highly utilized data commons in a well-run data- cord, vol. 34, no. 3, 2005, pp. 31–36.
center is less expensive than using today’s public 6. A. Chervenak et al., “Wide Area Data Replication
clouds.22 For this reason, hybrid clouds consisting for Scientific Collaborations,” Int’ l J. High Perfor-
of privately run cyberpods hosting data commons mance Computing and Networking, vol. 5, no. 3,
that interoperate with public clouds seem to have 2008, pp. 124–134.
certain advantages. 7. J. Alameda et al., “The Open Grid Computing Environ-
ments Collaboration: Portlets and Services for Science
Gateways,” Concurrency and Computation: Practice

P roperly designed data commons can serve sev-


eral roles in science as a service: first, they can
serve as an active, accessible, citable repository for
8.
9.
and Experience, vol. 19, no. 6, 2007, pp. 921–942.
K. Pepple, Deploying OpenStack, O’Reilly, 2011.
A. Davies and A. Orsaria, “Scale out with Glus-
research data in general and research data associ- terFS,” Linux J., vol. 235, 2013, p. 1.
ated with published research papers in particular. 10. S.A. Weil et al., “Ceph: A Scalable, High-Perfor-
Second, by collocating computing resources, they mance Distributed File System,” Proc. 7th Symp.
can serve as a platform for reproducing research re- Operating Systems Design and Implementation, 2006,
sults. Third, they can support future discoveries as pp. 307–320.
more data is added to the commons, as new algo- 11. M.S. Mayernik, “Data Citation Initiatives and Is-
rithms are developed and implemented in the com- sues,” Bulletin Am. Soc. Information Science and Tech-
mons, and as new software applications and tools nology, vol. 38, no. 5, 2012, pp. 23–28.
are integrated into the commons. Fourth, they can 12. R.E. Duerr et al., “On the Utility of Identification
serve as a core component in an interoperable “web Schemes for Digital Earth Science Data: An Assess-

www.computer.org/cise 19

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

SCIENCE AS A SERVICE

ment and Recommendations,” Earth Science Infor- Robert L. Grossman is director of the University of Chi-
matics, vol. 4, no. 3, 2011, pp. 139–160. cago’s Center for Data Intensive Science, a professor in
13. C. Lagoze et al., “CED 2 AR: The Comprehensive the Division of Biological Sciences at the University of
Extensible Data Documentation and Access Reposi- Chicago, founder and chief data scientist of Open Data
tory,” Proc. IEEE/ACM Joint Conf. Digital Libraries, Group, and director of the nonprofit Open Commons
2014, pp. 267–276. Consortium. Grossman has a PhD from Princeton Uni-
14. J. Kunze, “Towards Electronic Persistence Using versity from the Program in Applied and Computational
ARK Identifiers,” Proc. 3rd ECDL Workshop Web Mathematics. He’s a Core Faculty and Senior Fellow
Archives, 2003; https://wiki.umiacs.umd.edu/adapt/
_____________________ at the University of Chicago’s Computation Institute.
images/0/0a/Arkcdl.pdf.
______________ Contact him at robert.grossman@uchicago.edu.
__________________
15. J.R. Kunze, The ARK Identifier Scheme, US Nat’l
Library Medicine, 2008. Allison Heath is director of research for the University
16. T. Pollard and J. Wilkinson, “Making Datasets Vis- of Chicago’s Center for Data Intensive Science. Her re-
ible and Accessible: DataCite’s First Summer Meet- search interests include scalable systems and algorithms
ing,” Ariadne, vol. 64, 2010; www.ariadne.ac.uk/ tailored for data-intensive science, specifically with ap-
issue64/datacite-2010-rpt.
_______________ plications to genomics. Heath has a PhD in computer
17. J. Starr et al., “A Collaborative Framework for Data science from Rice University. Contact her at aheath@
_____
Management Services: The Experience of the Uni- uchicago.edu.
________
versity of California,” J. eScience Librarianship, vol.
1, no. 2, 2012, p. 7. Mark Murphy is a software engineer at the University
18. A. Ball and M. Duke, “How to Cite Datasets and of Chicago’s Center for Data Intensive Science. His re-
Link to Publications,” Digital Curation Centre, search interests include the development of software to
2011. support scientific pursuits. Murphy has a BS in com-
19. T. Green, “We Need Publishing Standards for Da- puter science engineering and a BS in physics from the
tasets and Data Tables,” Learned Publishing, vol. 22, Ohio State University. Contact him at murphymarkw@
__________
no. 4, 2009, pp. 325–327. uchicago.edu.
________
20. D. Mandl et al., “Use of the Earth Observing One
(EO-1) Satellite for the Namibia SensorWeb Flood Maria Patterson is a research scientist at the University
Early Warning Pilot,” IEEE J. Selected Topics in Ap- of Chicago’s Center for Data Intensive Science. She also
plied Earth Observations and Remote Sensing, vol. 6, serves as scientific lead for the Open Science Data Cloud
no. 2, 2013, pp. 298–308. and works with the Open Commons Consortium on its
21. A.P. Heath et al., “Bionimbus: A Cloud for Manag- Earth science collaborations with NASA and NOAA.
ing, Analyzing and Sharing Large Genomics Datas- Her research interests include cross-disciplinary scien-
ets,” J. Am. Medical Informatics Assoc., vol. 21, no. 6, tific data analysis and techniques and tools for ensuring
2014, pp. 969–975. research reproducibility. Patterson has a PhD in astron-
22. D.N. Paltoo et al., “Data Use under the NIH GWAS omy from New Mexico State University. Contact her at
Data Sharing Policy and Future Directions,” Nature mtpatter@uchicago.edu.
______________
Genetics, vol. 46, no. 9, 2014, p. 934.
23. Future Directions for NSF Advanced Computing Walt Wells is director of operations at the Open Com-
Infrastructure to Support US Science and Engineer- mons Consortium. His professional interests include
ing in 2017–2020, Nat’l Academies Press, 2016. using open data and data commons ecosystems to ac-
24. L.A. Barroso, J. Clidaras, and U. Hölzle, “The Data- celerate the pace of innovation and discovery. Wells re-
center as a Computer: An Introduction to the De- ceived a BA in ethnomusicology/folklore from Indiana
sign of Warehouse-Scale Machines,” Synthesis Lec- University and is pursuing an MS in data science at
tures on Computer Architecture, vol. 8, no. 3, 2013, CUNY. Contact him at walt@occ-data.org.
___________
pp. 1–154.
25. J. Dean and S. Ghemawat, “MapReduce: Simplified
Data Processing on Large Clusters,” Comm. ACM,
vol. 51, no. 1, 2008, pp. 107–113.
26. G. DeCandia et al., “Dynamo: Amazon’s Highly
Available Key-Value Store,” ACM SIGOPS Op- Selected articles and columns from IEEE Computer
erating Systems Rev., vol. 41, no. 6, 2007, pp. Society publications are also available for free at
205–220. http://ComputingNow.computer.org.

20 September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

SCIENCE AS A SERVICE

MRICloud: Delivering High-Throughput MRI


Neuroinformatics as Cloud-Based Software
as a Service

Susumu Mori and Dan Wu | Johns Hopkins University School of Medicine


Can Ceritoglu | Johns Hopkins University, Whiting School of Engineering
Yue Li | AnatomyWorks
Anthony Kolasny | Johns Hopkins University, Whiting School of Engineering
Marc A. Vaillant | Animetrics
Andreia V. Faria and Kenichi Oishi | Johns Hopkins University School of Medicine
Michael I. Miller | Johns Hopkins University, Whiting School of Engineering

MRICloud provides a high-throughput neuroinformatics platform for automated brain MRI segmentation
and analytical tools for quantification via distributed client-server remote computation and Web-based
user interfaces. This cloud-based service approach improves the efficiency of software implementation,
upgrades, and maintenance. The client-server model is also ideal for high-performance computing,
allowing distribution of computational servers and client interactions across the world.

I
n our laboratories at Johns Hopkins University, we have more than 15 years of experience in developing im-
age analysis tools for brain magnetic resonance imaging (MRI) and in sharing the tools with research com-
munities. The effort started when we developed DtiStudio in 20001 as an executable program that could
be downloaded from our website to perform tensor calculation of diffusion tensor imaging and 3D white
matter tract reconstruction. In 2006, two more programs (RoiEditor and DiffeoMap) joined the family that
we collectively called MriStudio. These two programs were designed to perform ROI (region of interest)-based

September/October 2016 Copublished by the IEEE CS and the AIP 1521-9615/16/$33.00 © 2016 IEEE Computing in Science & Engineering 21

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

SCIENCE AS A SERVICE

image quantification for any type of brain MRI data. the 16 atlases and complete the calculation in 15
The ROI could be manually defined, but DiffeoMap to 20 minutes using 8 cores for each atlas registra-
introduced our first capability for automated brain tion (a total of 128 cores), which is equivalent to
segmentation. We based our work on a single-subject consuming 32 to 43 CPU-hour service units (SUs).
atlas with more than 100 defined brain regions that Through the cloud system, users can transparently
were automatically deformed to image data, and access this type of high-performance computation
thus transferring the predefined ROIs in the atlas to facility and run large cohorts of data in a short
achieve automated brain segmentation of the target. amount of time, which is certainly an advantage.
We call this image analysis pipeline high-throughput However, in the transition from a conventional
neuroinformatics,2 as it offers the user the opportu- executable-distribution model to a cloud platform,
nity to reduce MR imagery on the order of O(106 to it has become apparent that computational power
107) variables to O(1,000) dimensions associated with is only one advantage that a cloud system can of-
the neuro-ontology of atlas-defined structures. These fer in terms of science as a service. It also changes
1,000 dimensions are searchable and can be used to the efficiency for new tools development and dis-
support diagnostic workflows. semination, enabling services that weren’t previ-
The core atlas-to-data image analysis is based ously possible. In this article, we share the expe-
on advanced diffeomorphic image registration riences we have accumulated during our period of
algorithms for positioning information in hu- development.
man anatomical coordinate systems.3 To posi-
tion dense atlas-based image ontologies, we use Software as a Service
image-based large deformation diffeomorphic The core mapping service is a computationally
metric mapping (LDDMM),4 which is most effi- demanding high-throughput image analysis al-
ciently implemented using high-performance net- gorithm that parcels brain MRIs into upward
worked systems, especially for large-volume data of 400 structures by positioning the labeled atlas
such as high-resolution T1-weighted images. In ontologies into the coordinates of the brain tar-
2006, the term cloud wasn’t yet widely used, but gets. The approach assumes that there exists a
we employed a concept similar to that of cloud structure-preserving or correspondence, what we
storage to solve this problem. Specifically, we used term a diffeomorphism, a one-to-one smooth map-
an IBM supercomputer at Johns Hopkins Univer- ping between the target I(x), x  X, M I l Iatlas.
sity’s Institute for Computational Medicine to re- Here, I and Iatlas are the target and atlas images,
motely and transparently process user data. Since respectively, x and X denote an image’s individual
the introduction of DiffeoMap, approximately coordinates and spatial domain, and M denotes the
50,000 whole-brain MRI data have been pro- diffeomorphic transformation between the two
cessed using this approach. The platform natural- images. The correspondence between the indi-
ly evolved into MRICloud, which we introduced vidual and the atlas is termed the DiffeoMap. We
in December 2014 as a beta testing platform. This interpret the morphisms M(x), x X as carrying the
Web-based software follows a cloud-based soft- contrast MR imagery I(x), x X. The morphisms
ware-as-a-service (SaaS) model. provide a GPS3 for both transferring the atlas’s on-
After 15 years of software development, tological semantic labeling and providing coordi-
the number of MriStudio’s registered users now nates to statistically encode the anatomy.
approaches 10,000, and in 2015, the number of Personalization of atlas coordinates to the target
processed data per month through the new cloud occurs via smooth transformation of the atlas, which
system reached a record of 3500 per month. One minimizes the distance inf d ( I , I atlas  I1−1 ) between
I
motivation to adopt a cloud system is to exploit the individual’s representation I and the transfor-
−1
publicly available supercomputing systems for med atlas I atlas  I1 ,5 with the transformation solving
CPU- and memory-intensive operations. For ex- 
the equation φt = vt (φt ), t [0, 1] and minimizing
ample, although each MR image is typically 10 to the integrated cost
20 Mbytes, our current image-segmentation algo-
rithm with 16 reference atlases requires approxi- 1

mately 5 Gbytes/data of memory. In the Extreme inf


vt , t ∈[0,1]:I =v ( I ),I atlas ⋅I0−1=I atlas ∫ 0
vt dt ,
V
Science and Engineering Discovery Environment
(XSEDE) extreme computing environment, the where vt is the time-dependent velocity vector field
pipeline can parallelize the registration process of of the flow of deformation, It is the diffeomorphism

22 September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

Structural MRI Functional MRI


Resting-state fMRI
Regional correlation s
maps rie
e se
T1-weighted
Tim
MRI

T2-weighted
MRI Metabolic MRI

Cr
NAA
Cho

Diffusion
tensor MRI

MR spectroscopy

Vol
Atlas-based T2
feature analysis FA
NAA

Figure 1. The structural-functional model of atlas-based MRI informatics. MRI imaginary from different modalities,
such as T1- and T2-weighted structural MRI, diffusion tensor MRI, functional and resting-state functional MRI, and
MRI spectroscopy images can be parcellated to predefined structures based on the presegmented MRI atlases. This
allows for extraction of multicontrast features, from hundreds of anatomical structures to millions of voxels, in a
reduced dimension.

at time t, φt denotes first-order differentiation of It, developed on the Windows platform and written
1 in C++ for core algorithms. It also contains com-
and ∫0 vt dt denotes the integration of the norm ponents of MS-Visual C, MFC, and OpenGL.
V
of vt over the entire velocity field, V the Hilbert User data and the executable file are both located
space of smooth vector fields. Figure 1 shows ex- in users’ local computers (Figure 2a).1 The execut-
amples of our structure-function model, including able file needs to be downloaded from our website
T1- and T2-weighted structural contrast imagery, (www.mristudio.org),
____________ but all operations are per-
orientation vector imagery (such as diffusion tensor formed within users’ local computers, including
MRI), metabolism measured via magnetic resonance data I/O, calculations, and the visualization in-
spectroscopy, and functional connectivity via rest- terface. The input data are raw diffusion-weighted
ing-state functional MRI (rs-fMRI).6 Each atlas images and associated parameters from MRI scan-
carries with it the means and variances associated ners, from which diffusion tensor matrices are cal-
with each high-dimensional feature vector. culated. The software also offers ROI drawing and
tractography tools to define white matter tracts and
perform quantifications.
Evolution of Software Architecture DiffeoMap is an example of a model in which
To highlight the software architecture’s evolution external computation power is incorporated based
(see Figure 2), let’s first look at the functions of on a seamless communication scheme (Figure 2b).7
three key software programs. DtiStudio is software The software reads two images (a reference brain

www.computer.org/cise 23

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

SCIENCE AS A SERVICE

User-friendly human-machine interface

Algorithm and modules

Image viewer DTI mapping Fiber tracing ...

Layers
Local data Image data input/output
DICOM Analyze Nifty RAW Mosaic ...
Custom support
Developing tools Email listing Forum FAQ ...
MS-Visual C/C++ MFC OpenGL
End users
Processing servers
Windows PC Components JHU IBM XSEDE (Gordon, Stampede) ...
(a) http
Software as a service
FTP
Remote services Local data DTI mapping Segmentation LDDMM
Atlas-based statistics fMRI ...
Nonlinear image trasformation
FTP

Web developing tools Databases


Unix/Linux
Local data servers http Data storage Knowledge database (MongoDB)
PHP C REST MySQL HTML ... End users
Unix/Linux
Apache Web farm
https/FTP servers
PHP C REST Python HTML ...

Email User-friendly human-machine interface (c)


Resizing Linear alignment Intensity match ...

End users Image data input/output


DICOM Analyze Nifty RAW Mosaic ...

Developing tools
Windows
PC (local) MS-Visual C/C++ MFC OpenGL
(b)

Figure 2. Schematic diagrams of the architectures of (a) DtiStudio, (b) DiffeoMap, and (c) MRICloud. DtiStudio is an example of a
conventional distribution model, in which an executable file is downloaded to local computers and the entire process take place within
local computers. DiffeoMap has an internal architecture similar to that of DtiStudio, but CPU-demanding calculations associated to the
large deformation diffeomorphic mapping calculations of DiffeoMap occur on a remote Unix/Linux server. For the MRICloud system,
the entire calculation occurs in the remote server, and the communication with users relies on a Web interface. The system has flexible
scalability and contains a storage system for temporary storage of user data.

atlas and a user-provided image), and one image MRICloud is the latest evolution of our soft-
is transformed into the shape of the other, thereby ware, in which computationally intensive algo-
anatomically registering voxel-coordinates of the rithms are migrated to a remote server (Figure 2c).
two images. Basic image transformation (voxel resiz- The cloud computing model is an attractive client-
ing, image cropping, and linear transformation) and server model that we adopted because of the ease
associated functions (file format conversion, inten- of scalability, portability, accessibility, and main-
sity matching) are performed locally. The data I/O tenance cost, providing a “virtual” hardware en-
and visualization interfaces also remain in the local vironment that decouples the computer from the
Windows platform. However, diffeomorphic image physical hardware. The computer is referred to as
transformation, which is too CPU-intensive for local a virtual machine and behaves like a software pro-
PCs, is performed by a remote server. Communica- gram that can run on another computer. Abstract-
tion with the remote server is performed through ing the computer from hardware facilitates move-
HTTPS and FTP protocols and through notifica- ment and scaling of virtual machines on the fly.
tion to users via email. Once users’ data are automat-
ically sent to the remote server, the server performs Cloud System Architecture
diffeomorphic transformation and the resultant The main entry point to the server infrastructure
transformation matrices, which are typically about is through either the MRICloud Web application
1 Gbyte, can be retrieved by DiffeoMap through or its accompanying RESTful Web API. Data
32-bit data identifiers provided in the email. payloads can be several hundreds of Mbytes, and a

24 September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

MARCC cluster
https://www.marcc.jhu.edu
Iddmm processing cluster

MriStudio
user Internet io19.cis.jhu.edu
validation and cluster allocation server
JHU Departmental
firewall firewall

MriCloud.org

ftp.mristudio.org

Incoming/outgoing
XSEDE processing storage
https://www.xsede.org/
Iddmm processing cluster

Figure 3. Diagram of core MriStudio and MRICloud server components. MRICloud.org or MriStudio applications
generate a zip from the users’ data, which are uploaded to an anonymous FTP server (ftp.mristudio.org).
__________ Another
server (io19.cis.jhu.edu)
_________ monitors the incoming queue for new data. Upon arrival, this server validates the data,
identifies a computation resource, and copies the data to one of the clusters (currently, http://icm.jhu.edu, www.
___
xsede.org
_____ or www.marcc.jhu.edu).
___________ The data are then queued using an SSH signal. The validation and allocation
server also monitors job completion and updates the job status at www.mricloud.org or sends an email to MriStudio
users with a URL of the data location.

special jQuery interface is used to facilitate resum- by email with the URI to retrieve the data. Alterna-
able uploads because they aren’t directly supported tively, the user can check on the status of the pro-
by the HTTP protocol. A successful upload returns cessing at any time via the MRICloud website and
a job identifier that references the data and its pipe- retrieve the data from there if they’re ready.
line throughout the system. The job ID is used to To facilitate a programmatic interface to the
check the status of the processing and to reference processing, the RESTful Web API provides a service
the resulting processed data to be downloaded. that can ping the status of the processing, as well as
Figure 3 outlines our back-end processing pipe- another service for downloading the data. There-
line, which is built from standard legacy protocols on fore, a user can batch process and retrieve results,
a LAMP (Linux, Apache, MySQL, PHP) stack, also without a human in the loop, and is notified when
including FTP, SSH, SMTP, and high-level script- the MRI images being processed are completed.
ing (BASH, PHP) that keeps the system lightweight, An example protocol might be api_submit data; in
simple, robust, and easily maintainable. Once the a loop, api_job_status every 30 seconds until com-
data are uploaded, it’s repacked in a zip payload plete; and api_download result. As with any REST-
structure to move through the system, first moving ful Web API, this can be done programmatically in
into a queue via FTP. Once consumed by the moni- any language that supports the HTTP protocol.
tor, the payload is validated for completeness. Then, To secure the processing pipeline, SSH is core
an available computational resource (www.xsede.org to data transfer and signaling commands on remote
or www.marcc.jhu.edu)
____________ is identified and the data are systems. The validation and cluster allocation server
submitted to the cluster’s processing queue using an uses public and private keys with authorized key re-
SSH signal. The cluster uses SMTP to signal that the strictions. The root of the server allows SSHFS to
job is submitted, and the monitor then polls for com- mount a restricted area of data space for processing
plete jobs. Upon job completion, the resulting data storage. A user-level SSH public/private key with au-
are moved to an FTP server, and the user is notified thorized_key restrictions is used to signal the cluster

www.computer.org/cise 25

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

SCIENCE AS A SERVICE

Registration and login Software-as-a-service (SaaS) interface

(a) (b)

Figure 4. The actual login window and one of the SaaS interfaces of MRICloud. (a) User registration and
authentication are essential for cloud-based services. (b) Currently, six SaaSs have been tested, which include brain
segmentation, diffusion tensor imaging (DTI) calculation, resting-state functional MRI (rs-fMRI), arterial spin labeling,
angio scan parameter calculation, and surface mapping. The figure shows the interface for the brain segmentation
based on T1-weighted images.

for job submission. The user-level account doesn’t the Institute for Computational Medicine (http:// ____
have direct shell access to the cluster but moves data icm.jhu.edu),
_______ and the University of California, San
through the root-level SSHFS mount point. Diego, Gordon computer from the US National Sci-
ence Foundation for the Computational Anatomy
Step-by-Step Procedure for the Cloud Service Gateway via XSEDE (www.xsede.org).
_________ Thus far, the
To provide a clear illustration of how the cloud- services at the Computational Anatomy Gateway have
based SaaS functions, Figure 4 shows the actual been supported by the XSEDE grant program, which
steps involved in the image analysis services. The allows us to provide the MRI SaaS to users free of
first step is to create an account in the login window charge. Our current effort is focused on utilizing pub-
(Figure 4a). Once logged in, users have access to licly available computational resources to make them
several SaaSs, including T1-based brain segmenta- available for users. When using the occupied SUs and
tion, diffusion tensor imaging (DTI) data process- computing resources, we can compare the efficiency
ing, resting-state fMRI analysis, and arterial spin performance in terms of computing consumption.
labeling data processing. SUs can be defined as SU = (wall time/60) * (total
If a T1-based brain segmentation SaaS is CPU number). Given a T1-segmentation pipeline
chosen, a data upload page appears (Figure 4b), using 16 atlases, the wall time on XSEDE resources
in which users need to choose several options, would be 32 minutes. The SU and runtime increase
including choice of processing servers, image ori- with the number of atlases, which also depend on the
entations (sagittal or axial), and multiatlas libraries. available number of CPUs, as illustrated in Figure 5.
Currently, the SaaS accepts a specific file format gen- On the current MRICloud platform, 45 and 30 at-
erated by a small program that needs to be down- lases are in use for adult and pediatric target images,
loaded from the MRICloud website. If users want respectively. The results in Figure 5a highlight the im-
to compare their data with the internal control data portance of parallelization and enhanced efficiency by
being logged within MRICloud, the demography employing supercomputing resources. The runtime of
information must also be provided. Users have a the pipeline decreases drastically as more CPUs be-
choice of two processing servers: the John Hop- come available. Figures 5b and 5c demonstrate the
kins University IBM Blade computer, supported by pipeline’s scalability when a large number of cores/

26 September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

4,000 80
CPUs are available. As long as there are a sufficient
Runtime
number of CPUs available, the increased number of 3,500
System units
70
atlases does not lead to an increased runtime.

System units (CPU hours)


3,000 60
Service status can be monitored at the “my job
2,500 50
status” section; once completed, the results can be

Runtime (s)
downloaded or viewed from the same page (Figure 6a). 2,000 40
The “view results” option opens a new webpage
1,500 30
that allows visual inspection of the segmentation
results, which are displayed in three orthogonal 1,000 20
views and a 3D surface rendering (Figure 6b). Us-
500 10
ers can examine the quality of the segmentation
with these views and, in addition, if the age of the 0 0
16 32 64 128 256
data is specified at the data submission, the vol- (a) CPU number
ume of each defined structure can be compared to 50
age-matched controls based on z-scores or within 45
an age-versus-volume plot. These control data are
System units (CPU hours)

40
stored in MongoDB; results on the Web are up-
35
dated in real time as the control database evolves.
30
The downloaded files contain information 25
about volumes and intensity of segmented struc-
20
tures. Currently, we offer atlas version 7a, which
15
has 289 defined structures and a five-level ontolog-
10 Stampede
ical relationship for these structures, as described Gordon
5
in our previously published paper.8 This service
0
converts T1-weighted images with more than 1 4/32 8/64 12/96 16/128
million voxels to standardized and quantitative (b) Atlas number/CPU number
matrices with [volume, intensity] u 289 structures. 1,400
This T1-based brain segmentation service is
linked to other SaaSs provided by MRICloud. For 1,200
example, rs-fMRI and arterial spin labeling (ASL) 1,000
services incorporate the structural segmentation re-
Runtime (s)

sults and perform structure-to-structure connectivity 800

analysis (rs-fMRI) or structure-specific quantifica- 600


tion of blood flow (ASL). In this way, the cloud sys-
tem can be a platform to link multiple SaaSs. 400
Stampede
200 Gordon
Advantages and Limitations of the Cloud
Service 0
4/32 8/64 12/96 16/128
A cloud-based SaaS lowers the threshold for adop-
(c) Atlas number/CPU number
tion by users as well as developers: there are numer-
ous steps software developers take for granted that
aren’t obvious to application scientists. Each step of Figure 5. Computational performance of the T1 segmentation pipeline on XSEDE
software installation, source code compilation, up- Stampede and Gordon clusters. (a) The system units (CPU hours) used in each
grades, and management can be a major obstacle. cluster increase as the number of atlases and the number of CPUs increase. (b)
Technologies that can eliminate or minimize these and (c) The pipeline is scalable with nearly constant runtime, if the available
processes can be major factors for the tools to be number of CPUs for each atlas is also constant.
adopted in research communities.
The cloud-based SaaS also drastically changes
the efficiency of software development. After more operating systems undergo upgrades every few
than 15 years of software development, we learned years, and there’s no guarantee that our software
that writing new software is not even half the story. will run on every new system. Then, of course, new
Manufacturers constantly change their data format versions of our software need to be updated, dis-
and image parameters. The versions of computer tributed, and adopted. Upgraded documents need

www.computer.org/cise 27

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

SCIENCE AS A SERVICE

because the heavy lifting is done at the server and


communicated to the browser over the Internet. As
such, the applications are available readily to any-
one capable of running a browser, typically with no
further component installation. At the same time,
the vendor is relieved of maintaining compatibility
across varying local hardware configurations and
can make application upgrades effective immedi-
ately, without cooperation from the user and with-
out the need to maintain legacy versions. When a
bug is found, all affected results can be traced, users
(a)
notified, and results recalculated. Specifically, if a
bug is found after a new version of software or at-
las resources are deployed, we can trace all affected
data based on the dates of submission, unique data
identifiers, and user email addresses recorded in our
log. We then send bug notices to the users with the
lists of affected data and reprocess the data. This ap-
proach offloads a substantial amount of the mainte-
nance burden and provides for a much better user
experience.
That said, file format and Health Insurance
Portability and Accountability Act (HIPPA) issues
need to be addressed: the cloud-based approach
requires users’ data to be transferred outside the
institution, which raises the HIPPA issue of protec-
tion of personal identification information. This is-
sue is also related to a more general question about
file format and header information. Regardless of
the image source, original files are in one of the
vendor-specific Digital Imaging and Communica-
(b) tions in Medicine (DICOM) formats (the files ex-
ported from the MRI scanners). Externally or in-
Figure 6. Different interfaces. (a) The status of the submitted data can be ternally, at some point, we need a tool that can read
monitored by the “my job status” window. (b) Once the job is completed, DICOM files from all MRI vendors and all ver-
the results can be downloaded or visualized. The color coding is based on sions of MR operating systems. Once read, the files
z-scores using age-matched internal control data; red is more than three are usually stored in a more standardized file for-
standard deviations larger and green is more than three standard deviations mat, such as NIfTI or Analyze, although the stan-
smaller. The actual volume-age relationships of the internal control data and dards for these third-party file formats still contain
the submitted data can be shown by right-clicking the structure of interest. significant variability, and consistency hasn’t been
In the plot, green dots are from the internal data, which are connected to guaranteed. When the SaaS is provided, this poses
the data stored in MongoDB and updated in real time. a substantial challenge. One thing that’s clear is
that once the raw MRI DICOM data go through a
third-party program, including the PACS (Picture
to be distributed, and if a bug is found, we need to Archive and Communication System), the SaaS
make sure all users receive a revised version. Failure needs to support a large number of file formats and
to keep up with these efforts often leads to software matrix definitions because the variability is mul-
becoming extinct. Laboratories that support image tiplied by the variability of the original DICOM
analysis software soon become swamped by soft- formats and that of the third-party file definitions.
ware and user maintenance. The only practical solution is to restrict the data
The cloud approach really shines in this respect. format to the original DICOM formats. However,
Only a Web browser on the user’s machine is re- these often include personal identifiers. For re-
quired, with very lightweight hardware specifications search, this could raise a question about whether

28 September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

use of a cloud service should be included in each mented in a computation server and linked to a
project’s Internal Review Board approval. For clini- Web interface for users. However, this process does
cal purposes, it isn’t immediately clear whether require communication between program develop-
hospitals would allow their data to migrate to an ers and cloud systems, who must agree on exactly
outside entity, even temporarily, without proper what the inputs and outputs are. The Web interface
permission. then is designed to meet specific needs.
Our current approach is to develop a small ex- During this process, however, it is important to
ecutable program that can read original DICOM realize that two phases in science as a service profound-
from most vendors, which is then distributed to ly affect service design. In the first phase, users must
each local client and file conversions are executed have access to all the service software’s parameters
from DICOM to two simple files, a raw image ma- to provide scientific freedom and to maximally ex-
trix and a header file that contains only the matrix plore data contents, as well as to evaluate tool effica-
dimension information. (De-identification and file cy. This process is highly interactive and thus requires
standardization are performed on local computers extensive user-interaction interfaces that let users vi-
before the data are uploaded to the cloud service.) sually inspect results and store intermediate files at
This executable file needs to be constantly updated each step. The cloud approach’s performance could
and distributed as vendors change their DICOM degrade if each interaction requires a large amount
contents. In this sense, the cloud system requires a of data transfer. The Web interface also mandates
lightweight download of a pre-processing execut- modern designs to efficiently perform the frequent
able program, and isn’t completely free from dis- interactions between local and remote computers, es-
tribution burdens and local computations. This pecially for complex visualization and graphics inter-
strategy also indicates that all de-identification is faces. In the second phase, the technology matures,
accomplished by users prior to submission of their tool efficacy is established, and the majority of users
data to the service and, therefore, the SaaS is free start to use the same parameter sets and protocols.
from HIPPA issues. However, questions remain The cloud system is more efficient, as the informa-
about the unique signatures embedded in the im- tion transfer occurs only twice: data upload and re-
age. We can assume that highly processed data, sults download. One of the frequent questions we re-
such as the volumes of 100 structures, are essen- ceive is, “Can we modify the segmentation results?”
tially anonymized data, but we can also argue that Unfortunately, in our current setup, the segmenta-
unique identifiers associated with imaging features tion files must be downloaded to the local computer
are the purpose of the SaaS. Certainly, at some and modified by ROI-management software in the
point we need to define a line where HIPPA is local PC. The cloud approach requires a balance be-
applicable or not, although such a boundary isn’t tween server- and client-side uploads and downloads.
entirely clear. In addition, as science as a service, We find that our downloadable executable programs
it would be more beneficial for users if the HIP- such as MriStudio provide advantages in terms of
PA issue is handled on the server side. Another physical public network separation among data stor-
interesting strategy, which we’re testing for clini- age, visualization, and memory-based computer en-
cal applications, is to transplant the entire cloud gines, facilitating user feedback and stepwise quality
service behind an institutional firewall. This hy- control monitoring. However, the scaling arguments,
brid approach falls between a distribution and a software maintenance, and upgrade ease, as well as
cloud model, which is highly viable because the large-scale computation distribution through nation-
cloud architecture is portable and transplantation al computing networks, gives the cloud solution its
is relatively straightforward, but it loses several ad- own distinct advantages.
vantages such as access to public supercomputing In the first phase, it’s important to stress that
resources and multiplication of efforts to maintain the maturation processes takes place both through
the servers. These issues deserve more discussion users’ experience in testing and parameter choices
for science as a service in the future. and through developers’ efforts to revise the soft-
The SaaS model is powerful when technologies ware to accommodate user requests for better or
are mature: deployment of a new SaaS is, in the- newer functionalities. In this period of dynamic
ory, straightforward if core programs are written updates, high-level programming languages, such
without relying on platform-dependent libraries. as Matlab and IDL, provide an ideal environment
If we already have local executable files to perform for efficient revisions. This could also facilitate
certain types of image analysis, they can be imple- open source strategies and user participation in

www.computer.org/cise 29

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

SCIENCE AS A SERVICE

software development. As the software becomes database development and sharing because a large
mature and services are solidified, the final phase number of data would be submitted by users that
of the maturation process should be tested by pro- conform to a relatively uniform image protocol to
cessing a large amount of data. For example, a perform the automated image analysis. This data
simple task, such as skull-stripping, has two modes collection doesn’t require coordinated planning or
of failure. In the early phase of tool development, specific funding; users have motivation to acquire
the tool is improved by minimizing the leakage (or images with specified types, submit their data, and
removal) of brain definitions (say, 5 percent leak- have access to the automated segmentation servic-
age of the voxels outside the brain), but in the latter es. The cloud host then has opportunities to build
phase, as tool performance improves, our interest two types of databases: users’ raw images and pro-
shifts to occasional failures (5 percent of the popu- cessed data (such as the standardized anatomical
lation) that are encountered only in a large-scale feature vectors shown in Figure 1).
analysis. At this point, the low computation effi- This indeed could be a new approach to fa-
ciency of the high-level language starts to become cilitate efficient knowledge building and sharing.
a major obstacle due to low throughput: every time However, several hurdles should also be noted.
we make a minor modification, we need to wait First, our current SaaS has a rule to erase users’ data
a week to complete 100 test data. At some point, after 60 days of storage. To retain them, we would
recoding to C++ is inevitable, which can improve need not only a much larger storage space but also
computation time as much as 10,000 times from permission from users. Storage of anatomical fea-
the original, depending on the algorithms. Think- ture vectors, on the other hand, could be less of an
ing about the nature of the cloud-based SaaS and issue as they’re much smaller and highly de-identi-
its position in science as a service, it makes more fied. In either case, probably the largest limitation
sense to deploy software using the lower-level lan- is the availability of the associated nonimage infor-
guage. This is especially important when we utilize mation. In the regular data submission, users sub-
national resource computation facilities because we mit their images without demographic and clinical
need to make every effort to maximize the resourc- information. The resultant database then have only
es. One practical limitation, however, is that it isn’t anatomical features, which wouldn’t be very useful
always easy to secure human resources to support for many application studies. It’s relatively straight-
these types of efforts. As much as we need the ex- forward to build an interface to gather demograph-
pertise and knowledge of trainees and faculties in ic and clinical information as part of a SaaS (see
academic institutes, some crucial efforts are needed Figure 4b), but the barrier would be the extra effort
to develop a sophisticated cloud, and SaaS isn’t a of users to compile and input them at the time of
subject for academic publications. data submission. The incentive, therefore, would be
building a useful database through SaaS for future
Impact of SaaS on Data Sharing data sharing. For example, if the service includes
In recent years, data sharing is becoming an im- image interpretation (potential diagnoses and their
portant National Institutes of Health policy, and likelihood) based on detailed clinical patient data,
there are many data available in the public domain users might be willing to make the extra effort to
including Alzheimer’s Disease Neuroimaging Ini- submit additional information associated with the
tiative (ADNI; www.adni-info.org)
____________ for Alzheimer’s images. For the actual method to distribute data,
disease, Pediatric Imaging, Neurocognition, and we currently use the GitHub (https://github.com)
____________
Genetics (PING; http://pingstudy.ucsd.edu)
________________ for nor- repository, which has become a de facto site for
mal pediatric data, and National Database for Autism data sharing. Our rich atlas resources are available
Research (NDAR; https://ndar.nih.gov)
____________ for autism through this channel.
research. What is common to these database is the
availability of raw data with which research com- What New Things Can We Do with the Cloud
munities can apply their own tools to extract bio- Service?
logically or clinically important finding. In these In the previous sections, we discussed the advan-
types of public databases, proactive plans and co- tages and limitations of the cloud-based SaaS, high-
ordinated efforts, as well as funding, are needed to lighting differences from classical distribution mod-
acquire data in a uniform manner, establish a da- els. In this section, we focus on service concepts
tabase structure, gather data, and maintain them. that are only possible in the cloud platform. The key
SaaS introduces a very different perspective to concept is “knowledge-driven” analysis.

30 September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

Data Interpretation the fluidity of data usage, but cloud-based opera-


Usually, the roles of image analysis tools are complet- tions could be the key to opening this bottleneck.
ed when the requested quantification is achieved. For The limitations of the data interpretation service
example, in our MRICloud, T1-weighted ana- should also be stressed. The normal or pathologi-
tomical images are converted to standardized 300-el- cal data stored in the cloud database are external to
ement vectors (volumes of 300 defined structures; user data, meaning they’re most likely acquired un-
Figure 6). Users are supposed to have self-inclusive der different imaging protocols. It is, therefore, es-
data that consist of patient groups and age-matched sential that image analysis tools are robust against
control groups. All these data go through the same a reasonable range of protocol variability that could
high-throughput image analysis pipelines, the be encountered in research and clinical data. For
300 volumes are defined, and the volume data are example, the ADNI database contains data with
statistically compared between the groups. But what two magnetic fields (1.5 and 3.0 Tesla) and three
if the analyzed metadata stay in the server, just like manufacturers. If these data with six different proto-
travel industries keep their customers’ travel infor- cols are processed together in our pipeline, the age-
mation for mass analysis? We call these 300-element dependent anatomical changes have been shown to
feature vectors, “brainprints,” just like fingerprints have a much larger effect size compared to protocol-
that describe each individual’s uniqueness. If these associated bias.11 However, it’s reasonable to assume
brainprints have associated clinical and demographic that if we’re interested in pathological effects that are
information, we can provide interesting knowledge- much smaller than age effects, the users’ own con-
based services. The plot in Figure 6b is an example trol data would be needed to minimize the protocol
of this idea, in which age-matched control data are impacts and maximize the study’s sensitivity.
provided from our internal database, as well as pub-
licly available data, such as ADNI (http://adni.loni. Multiatlas-Based Analysis
usc.edu)
_____ and PING (http://pingstudy.ucsd.edu).
________________ The The multiatlas analysis is another example of the
brainprints are merely strings of numbers. knowledge-based approach that benefits greatly
However, if age-matched control data are from a cloud architecture. To highlight this point,
available, each element of the brainprint can be Figure 7 shows the concept of atlas-based image
compared to the normal values and, for example, segmentation. For a computer algorithm to define
converted to 300 z-score values. Namely, the avail- structures of interest, a teaching file is needed that
ability of age-matched normal data lets us interpret defines the structure’s location, shape, and intensity
brainprints. As a cloud-based service, the normal features. This teaching file is called the. The sim-
database can be centrally managed, enriched, and plest form is a single-subject atlas, in which various
utilized in real time. By extending this idea, it’s structures are defined based on one person’s anato-
possible to perform pattern matching of the brain- my. This atlas can be warped to individual patient
prints to identify past cases with similar anatomical images, and boundary definitions can be trans-
features and provide reports of population statis- ferred to images.12–15 This approach is, however,
tics about the diagnosis and prognosis of identi- not accurate if atlas-to-subject image registration is
fied cases.2,9,10 This, however, is possible only if the not perfect. In particular, it is difficult to perform
cloud database contains a vast amount of patient perfect image matching for brain regions with high
data and each brainprint is associated with rich cross-subject variability, such as cortical areas. In
clinical information. In the past, there were efforts more advanced approaches, probabilistic atlases
to establish centralized image archives for various were created from population data in which each
diseases. In fact, the clinical PACS in each hospital voxel contains the location and intensity probability
stores tens of thousands of images and a great deal of structural labels.16,17 For example, the location
of clinical information. The aforementioned data probability of a given voxel near the brain surface
interpretation services weren’t available, not because could be 33 percent white matter, 33 percent cor-
of the lack of data but because the data couldn’t be tex, and 34 percent cerebrospinal fluid (CSF) based
effectively utilized. These data are fragmented and on the probabilistic atlas. However, the atlas also
they remain “high-dimensional” raw image formats teaches average intensities in T1-weighted images
that aren’t suitable for downstream analysis, such as (after intensity normalization) are, for example,
feature extraction and comparison. Conversion to white matter is 211 +/– 19, gray matter is 142 +/–
low-dimensional representation of the images, such 31, and CSF 81 +/– 11. If the voxel of the patient
as brainprints, remains a bottleneck to increasing has the intensity of 154, it will probably be assigned

www.computer.org/cise 31

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

SCIENCE AS A SERVICE

Single atlas

Probabilistic atlas

Multiatlas

Figure 7. Evolution of atlas-based brain segmentation approaches. In a single-atlas-based approach, only one atlas
is warped to the target image and, at the same time, transfers its presegmented labels to the target image. In the
probabilistic atlas-based approach, multiple atlases are warped to the target image, and a probabilistic map is
generated by averaging the label definitions from all atlases; image intensity information can be incorporated to
determine the final segmentation. The multiatlas-based approach also warps multiple atlases to the target image, but
employs arbitration algorithms (typically, weighting and fusion) to combine the multiple atlas labels to generate the
final segmentation.

to the gray matter. In this way, the probabilistic process by which a probabilistic map is created is
atlas could teach an algorithm about the anatomi- omitted and multiple atlases are directly registered
cal signatures (locations and intensities) of each to the patient image, followed by an arbitration pro-
structure label such that the best labeling accuracy cess.18–20 This process opens up many new possibili-
can be achieved. In the multiatlas framework, the ties for knowledge-based image analysis.

32 September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

For example, in the multiatlas framework, the as the tools are mature enough to follow the API,
atlas library can be enriched, altered, or revised which is nothing more than defining input and
easily without creating a population-averaged atlas. output parameters. Then, the developers can enjoy
The appropriate atlases can be dynamically cho- the existing infrastructures for super-computation
sen from a library. The criteria for appropriateness resources, processing status management, data up-
could be nonimage attributes, such as age, gender, load/download functions, and user management,
race, or diagnosis.21 Image-based attributes, such such as registration and notifications. This kind
as intensity, shape, or the amount of atrophy, could of expandability doesn’t have to rely on a specific
be used to determine the contributions of the at- cloud platform like MRICloud because other de-
lases.12,22,23 By extending this notion, the selection velopers can create their own cloud platforms and
of images from an atlas library can be evolved to access our SaaS without going through our cloud
context-based image retrieval (CBIR),24,25 and if interface. This has an important implication for
the library is sufficiently large with various patho- future extensions of medical imaging informatics.
logical cases with rich clinical information, statis- In the past, attempts have been made to inte-
tics about the retrieved images, such as diagnosis grate results from multiple contrasts, multiple im-
information, could be generated. aging modalities, and multiple medical records.
While the multiatlas-based analysis provides These integrative analyses have been, however, ham-
interesting and new research frontiers, it also poses pered by the fact that they need to ensure that data
unique challenges. First, the algorithm is CPU-in- from each modality have already been standard-
tensive. For segmentation and mapping based on a ized, quantified, and dimension-reduced. If we use
single atlas or a probabilistic atlas, image registra- the analogy of building a house, a cloud platform
tion is required only once. However, for 30 atlases, such as MRICloud serves as one of the foundations
the registration has to be repeated 30 times, fol- to build vertical columns that correspond to each
lowed by another CPU-intensive arbitration pro- SaaS. If we come up with a new image analysis tool,
cess. If the user chooses to select a subset of “appro- it can be integrated into one of the cloud founda-
priate” atlases from the 300-atlas library, further tions as a new service column. In this context, the
calculation would be needed. This implies the issue cloud platform’s role is to provide an environment
of content management for atlas libraries. Because to readily establish new columns. The real power
the libraries of data sources are dynamically evolv- of the cloud strategy is then materialized when a
ing in quantity and quality with frequent updates, “horizontal service” (corresponding to the roof of
it isn’t realistic to distribute the entire library to the house in this analogy) emerges, which spans
every user and provide version management. The not only multiple service columns but also multiple
cloud-based approach provides a high-performance cloud foundations.
computation environment and centralized man- In the field of medical records, there are high
agement of the atlas libraries, 26 therefore enabling expectations for the integration of big data as-
advanced multiatlas technologies and applications. sociated to available medical records to create a
knowledge database and providing personalized
Linkage of Services medicine through the comparison of the features
The cloud-based SaaS provides unique platforms to of individual patients to the knowledge database.
link different types of service tools. Many research- This is a typical example of the horizontal ser-
ers in image analysis communities often make their vice, but if we open the electronic health records
own programs to analyze their data or assist in the currently available in each hospital, we soon real-
interpretation of MR scans. Many of these tools ize that the data aren’t standardized, structured,
are highly valuable for these communities, and quantitative, consistent, or cohesive. The inte-
developers are willing to share them. However, for grative analysis by a horizontal service would be-
their programs to be widely adopted, they need to come prohibitively difficult if, for example, one
develop user interfaces, distribution channels, and aspect of the data were a raw MR image that
user management systems, such as registration and didn’t specify where the brain is within the 8
communications. Based on our experience in de- million voxels (200 u 200 u 200 image dimen-
veloping the MriStudio software family, we know sion). The success of the horizontal services, there-
how time-consuming it is to develop new programs fore, hinges on the proliferation of high-quality
as stand-alone software. In the cloud platform, the vertical services. This is somewhat akin to integra-
addition of new SaaSs is straightforward as long tive travel services, such as Orbitz, Expedia, and

www.computer.org/cise 33

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

SCIENCE AS A SERVICE

Booking.com,
_________ which rely on reservation SaaSs for 7. K. Oishi et al., “Atlas-Based Whole Brain White
each hotel, airline, or rental car company. Vertical Matter Analysis Using Large Deformation Dif-
SaaSs, established in many different medical appli- feomorphic Metric Mapping: Application to Nor-
cations, have the potential to be linked via third- mal Elderly and Alzheimer’s Disease Participants,”
party horizontal services to perform higher-order NeuroImage, 19 Jan. 2009, pp. 486–499.
integrative analysis, which, in the future, could 8. A. Djamanakova et al., “Tools for Multiple Granularity
realize new medical informatics that we haven’t yet Analysis of Brain MRI Data for Individualized Image
imagined. Analysis,” NeuroImage, vol. 101, 2014, pp. 168–176.
9. A.V. Faria et al., “Content-Based Image Retrieval
for Brain MRI: An Image-Searching Engine and

T he architecture of our cloud platform allows


for powerful computational resources beyond
traditional software packages and also facilitates
Population-Based Analysis to Utilize Past Clinical
Data for Future Diagnosis,” NeuroImage: Clinical,
vol. 7, 2015, pp. 367–76.
the future development of image analysis func- 10. S. Mori et al., “Atlas-Based Neuroinformatics via MRI:
tions and the incorporation of new services. We are Harnessing Information from Past Clinical Cases and
currently working on making availability through Quantitative Image Analysis for Patient Care,” Ann.
MRICloud new services associated with arterial Rev. Biomedical Eng., vol. 15, 2013, pp. 71–92.
spin labeling as well as functional MRI. 11. Z. Liang et al., “Evaluation of Cross-Protocol Stabil-
ity of a Fully Automated Brain Multi-atlas Parcel-
Acknowledgments lation Tool,” PLoS One, vol. 10, no. 7, 2015, article
This publication was made possible by the following no. e0133533.
grants: P41EB015909 (MIM, MS), R01EB017638 (MIM), 12. T. Rohlfing et al., “Evaluation of Atlas Selection
R01NS084957 (MS). Potential conflict of interest are Strategies for Atlas-Based Image Segmentation
that MS and MIM own “AnatomyWorks," with MS with Application to Confocol Microscopy Images
serving as its CEO. This arrangement is being managed of Bee Brains,” NeuroImage, vol. 21, no. 4, 2004,
by the Johns Hopkins University in accordance with its pp. 1428–1442.
conflict of interest policies. 13. B. Fischl et al., “Whole Brain Segmentation: Automat-
ed Labeling of Neuroanatomical Structures in the Hu-
References man Brain,” Neuron, vol. 33, no. 3, 2002, pp. 341–355.
1. H.Y. Jiang et al., “DtiStudio: Resource Program for 14. D.L. Collins et al., “Automatic 3D Model-Based
Diffusion Tensor Computation and Fiber Bundle Neuroanatomical Segmentation,” Human Brain
Tracking,” Computer Methods and Programs in Bio- Mapping, vol. 3, no. 3, 1995, pp. 190–208.
medicine, vol. 81, no. 2, 2006, pp. 106–116. 15. M.I. Miller et al., “Mathematical Textbook of De-
2. M.I. Miller et al., “High-Throughput Neuro-imaging formable Neuroanatomies,” Proc. Nat’ l Academy of
Informatics,” Front Neuroinform, vol. 7, 2013, p. 31. Sciences USA, vol. 90, no. 24, 1993, pp. 11944–11948.
3. M.I. Miller, A. Trouve, and Y. Younes, “Diffeomor- 16. B. Fischl et al., “Automatically Parcellating the Hu-
phometry and Geodesic Positioning Systems for man Cerebral Cortex,” Cerebral Cortex, vol. 14, no.
Human Anatomy,” Technology, vol. 2, 2013; http://
____ 1, 2004, pp. 11–22.
dx.doi.org/10.1142/S2339547814500010. 17. D.W. Shattuck et al., “Construction of a 3D Proba-
4. M.I. Miller et al., “Increasing the Power of Func- bilistic Atlas of Human Cortical Structures,” Neuro-
tional Maps of the Medial Temporal Lobe by Using Image, vol. 39, no. 3, 2008, pp. 1064–1080.
Large Deformation Diffeomorphic Metric Map- 18. R.A. Heckemann et al., “Automatic Anatomical
ping,” Proc. Nat’ l Academy of Sciences USA, vol. 102, Brain MRI Segmentation Combining Label Propa-
no. 27, 2005, pp. 9685–9690. gation and Decision Fusion,” NeuroImage, vol. 33,
5. M.F. Beg et al., “Computing Large Deformation no. 1, 2006, pp. 115–126.
Metric Mapping via Geodesic Flows of Diffeomor- 19. X. Artaechevarria, A. Muñoz-Barrutia, and C.
phisms,” Int’ l J. Computer Vision, vol. 61, 2005, Ortiz-de-Solorzano, “Combination Strategies in
pp. 139–157. Multi-atlas Image Segmentation: Application to
6. A.V. Faria et al., “Atlas-Based Analysis of Resting- Brain MR Data,” IEEE Trans. Medical Imaging, vol.
State Functional Connectivity: Evaluation for Repro- 28, no. 8, 2009, pp. 1266–1277.
ducibility and Multi-modal Anatomy-Function Cor- 20. J.M.P. Lotjonen et al., “Fast and Robust Multi-atlas
relation Studies,” NeuroImage, vol. 61, no. 3, 2012, Segmentation of Brain Magnetic Resonance Imag-
pp. 613–621. es,” NeuroImage, vol. 49, no. 3, 2010, pp. 2352–2365.

34 September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

21. P. Aljabar et al., “Multi-atlas Based Segmentation of University School of Medicine. Contact him at yueli.
___
Brain Images: Atlas Selection and Its Effect on Accu- bme@gmail.com.
__________
racy,” NeuroImage, vol. 46, no. 3, 2009, pp. 726–738.
22. F. Maes et al., “Multimodality Image Registration by Anthony Kolasny is an IT architect at Johns Hopkins
Maximization of Mutual Information,” IEEE Trans. University’s Center for Imaging Science. His research
Medical Imaging, vol. 16, no. 2, 1997, pp. 187–198. interests include high-performance computing and is the
23. M. Wu et al., “Optimum Template Selection for JHU XSEDE Campus Champion. Kolasny has an MS in
Atlas-Based Segmentation,” NeuroImage, vol. 34, computer science from Johns Hopkins University. He’s
no. 4, 2007, pp. 1612–1618. a professional member of the Society for Neuroscience,
24. W. Hsu et al., “Context-Based Electronic Health Usenix, and ACM. Contact him at akolasny@cis.jhu.edu.
____________
Record: Toward Patient Specific Healthcare,” IEEE
Trans. Information Technology in Biomedicine, vol. Marc A. Vaillant is president and CTO of Animetrics, a
16, no. 2, 2012, pp. 228–234. software company that provides facial recognition solutions
25. H. Müller et al., “A Review of Content-Based Image to law enforcement, government, and commercial markets.
Retrieval Systems in Medical Applications—Clini- His research interests include computational anatomy in
cal Benefits and Future Directions,” Int’ l J. Medical the brain sciences and machine learning. Vaillant has a
Informatics, vol. 73, no. 1, 2004, pp. 1–23. PhD in biomedical engineering from Johns Hopkins Uni-
26. D. Wu et al., “Resource Atlases for Multi-Atlas versity. Contact him at vaillant@animetrics.com.
______________
Brain Segmentations with Multiple Ontology Levels
Based on T1-Weighted MRI,” NeuroImage, vol. 125, Andreia V. Faria is a radiologist and an assistant profes-
no. 10, 2015, pp. 120–130. sor in the Department of Radiology at Johns Hopkins
University School of Medicine. Her interests include
Susumu Mori is a professor in the Department of Radi- the development, improvement, and application of tech-
ology at Johns Hopkins University School of Medicine. niques to study normal brain development and aging, as
His research interest is to develop new technologies for well as pathological models. Faria has a PhD in neuro-
brain MRI data acquisition and analyses. Mori has a PhD sciences from the State University of Campinas. Contact
in biophysics from Johns Hopkins University School of her at afaria1@jhmi.edu.
__________
Medicine. He’s a Fellow of the International Society of
Magnetic Resonance in Medicine. Contact him at smo- ___ Kenichi Oishi is an associate professor in the Depart-
ri1@jhu.edu.
_______ ment of Radiology at Johns Hopkins University School
of Medicine. His research interests include multimodal
Dan Wu is a research associate in the Department of brain atlases and applied atlas-based image recognition
Radiology at Johns Hopkins University School of Medi- and feature extraction methods for various neurological
cine. Her research interests include advanced neuroim- diseases. Oishi has an MD in medicine and a PhD in
aging and quantitative brain MRI analysis, especially neuroscience from Kobe University School of Medicine
atlas-based neuroinformatics for clinical data analysis. in Japan. Contact him at koishi@mri.jhu.edu.
____________
Wu has a PhD in biomedical engineering from Johns
Hopkins University. She’s a Junior Fellow of the Inter- Michael I. Miller is the Herschel Seder Professor of Bio-
national Society of Magnetic Resonance in Medicine. medical Engineering and director of the Center for Im-
Contact her at dwu18@jhu.edu.
__________ aging Science at Johns Hopkins University. He has been
influential in pioneering the field of computational anato-
Can Ceritoglu is a research scientist and software engi- my, focused on the study of the shape, form, and connec-
neer in the Center for Imaging Science at Johns Hop- tivity of human anatomy at the morpheme scale. Miller
kins University. His research interests includes medical has a PhD in biomedical engineering from Johns Hopkins
image processing. Ceritoglu has a PhD in electrical and University. Contact him at mim@cis.jhu.edu.
__________
computer engineering from Johns Hopkins University.
Contact him at can@cis.jhu.edu.
__________

Yue Li is an engineer at AnatomyWorks. His research


interests include medical image analysis, visualization,
high-performance computing, cloud-based medical im- Selected articles and columns from IEEE Computer
age solutions, and magnetic resonance imaging. Li has Society publications are also available for free at
a PhD in biomedical engineering from Johns Hopkins http://ComputingNow.computer.org.

www.computer.org/cise 35

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

SCIENCE AS A SERVICE

WaveformECG: A Platform for Visualizing,


Annotating, and Analyzing ECG Data

Raimond L. Winslow, Stephen Granite, and Christian Jurado | Johns Hopkins University

The electrocardiogram (ECG) is the most commonly collected data in cardiovascular research
because of the ease with which it can be measured and because changes in ECG waveforms
reflect underlying aspects of heart disease. Accessed through a browser, WaveformECG is an open
source platform supporting interactive analysis, visualization, and annotation of ECGs.

T
he electrocardiogram (ECG) is a measurement of time-varying changes in body surface poten-
tials produced by the heart’s underlying electrical activity. It’s the most commonly collected data
in heart disease research. This is because changes in ECG waveforms reflect underlying aspects
of heart disease such as intraventricular conduction, depolarization, and repolarization distur-
bances,1,2 coronary artery disease,3 and structural remodeling.4 Many studies have investigated the use
of different ECG features to predict the risk of coronary events such as arrhythmia and sudden cardiac
death, however, it remains an open challenge to identify markers that are both sensitive and specific.
Many different commercial vendors have developed information systems that accept, store, and ana-
lyze ECGs acquired via local monitors. The challenge in applying these systems in clinical research is that
they’re closed and don’t provide APIs by which other software systems can query and access their stored
digital ECG waveforms for further analyses, nor the means for adding and testing novel data-processing
algorithms. They’re designed for use in patient care, rather than for clinical research. Despite the ubiquity
of ECGs in cardiac clinical research, there are no open, noncommercial platforms for interactive manage-
ment, sharing, and analysis of these data. We developed WaveformECG to address this unmet need.
WaveformECG is a Web-based tool for managing and analyzing ECG data, developed as part of the
CardioVascular Research Grid (CVRG) project funded by the US National Institutes of Health’s National

36 Computing in Science & Engineering 1521-9615/16/$33.00 © 2016 IEEE Copublished by the IEEE CS and the AIP September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

Heart, Lung, and Blood Institute.5 Users can browse End user
their files and upload ECG data in a variety of ven-
dor formats for storage. WaveformECG extracts and
stores ECGs as a time series; once data are uploaded,
a browser can select, view, and scroll through indi-
vidual digital ECG lead signals. Points and time in- Single sign-on and authentication: Giobus Nexus
tervals in ECG waveforms can be annotated using

Liferay
Upload Visualize Analyze Download
ontology from the Bioportal ontology server oper- portlet portlet portlet portlet
ated by the National Center for Biomedical Ontol-
ogy (NCBO),6 and annotations are stored with the
waveforms for later retrieval, enabling features of in-
Analysis QT QRS
terest to be marked and saved for others. Users can Java libraries for screening

processing

Algorithms
algorithms: detector
Backed data/metadata algorithm
select groups of ECGs for computational analysis via Apache axis2
access and Web service
multiple algorithms, and analyses can be distributed (local and QRS Heart
manipulation score rate
remote) algorithm variability
across multiple CPUs to decrease processing time.
WaveformECG has also been integrated with the
Informatics for Integrating Biology and the Bed- ECG time series and annotation storage:
file system

OpenTSDB
Database/

Metadata
side (I2B2) clinical data warehouse system.7 This storage:
bidirectional coupling lets users define study cohorts PostgreSQL Hadoop Zookeeper HBase

within I2B2, analyze ECGs within WaveformECG,


and then store analysis results within I2B2.
WaveformECG has been used by hundreds of Figure 1. Platform architecture. Users authenticate to WaveformECG via
investigators in several large longitudinal studies Globus Nexus to upload, visualize, analyze, and download ECG data, analysis
of heart disease including the Multi-ethnic Study results, and annotations. To accomplish this, WaveformECG makes use
of Atherosclerosis (MESA),8 the Coronary Artery of Java libraries and Web services that provide access to data, metadata,
analysis results, annotations, and data analysis algorithms.
Disease Risk in Young Adults (CARDIA),9 and
the Prospective Observational Study of Implantable
Cardioverter Defibrillators (PROSE-ICD)10 studies. The upload and visualize portlets utilize an
A public demo version is available for use through open source distributed storage system running
the CVRG Portal. All software developed in this on Apache Hadoop and Hbase known as the open
effort is available on GitHub, under the CVRG time-series database (OpenTSDB),13 a database sys-
project open source repository, with an Apache 2.0 tem optimized for storage of streaming time-series
license. Instructions for deployment of the Wavefor- data. OpenTSDB sits on top of Apache HBase,14
mECG tool is available on the CVRG wiki. an open source nonrelational distributed (noSQL)
database developed as part of the Apache Soft-
WaveformECG System Architecture ware Foundation Hadoop project (https://hadoop.
__________
WaveformECG is accessed via a portal developed apache.org).
_______ Apache Zookeeper (https://zookeeper.
___________
using Liferay Portal Community Edition, an open apache.org)
_______ serves as the synchronization and nam-
source product that lets developers build portals as ing registry for deployment. This configuration
an assembly of cascading style sheets (CSS), webpag- allows HBase to be deployed across multiple serv-
es, and portlets (see Figure 1).11 Liferay was extended ers and to be scaled to accommodate high-speed,
to use Globus Nexus, a federated identity provider real-time read/write access of massive datasets—an
that’s part of the Globus Online family of services,12 important consideration because WaveformECG
for authentication and authorization. Users login is being extended to accept real-time ECG data
to WaveformECG with their Globus credentials or streams from patient monitors. Reported ingest
credentials from other identity providers (such as rates for OpenTSDB in many different applica-
InCommon or Google) linked with their Globus tions range from 1 to 100 million points per sec-
identity. Following authentication, users can access ond. OpenTSDB defines a time series as a set of
four separate portlet interfaces: upload, visualize, time-value pairs and assigns a unique time-series
analyze, and download. We developed several back- identifier (TSUID) to each time series. OpenTSDB
end libraries and tools supporting these interfaces also supports execution of aggregation functions on
that enable storage, retrieval, and analysis of ECG query results (a query result is a read of time-series
time-series data, metadata, and their annotations. data). Examples of aggregators include calculations

www.computer.org/cise 37

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

SCIENCE AS A SERVICE

(a)
Are all Store the ECG Files
Yes Write time-series/ Store
Transfer necessarry in LDML/extract
Start upload analysis results metadata in Done
file to server files transferred? metadata & analysis
to OpenTSDB PostgreSQL
results

1 2 3 4 5

(b) No

Figure 2. Upload portal. (a) The listing on the left shows that the user has created the patient006 folder, into which
data will be uploaded. Datasets under “my subjects” are owned by the user. Folders group datasets by subjects,
and progress bars next to the file names in the center of the screen show progress on the upload of each file to the
server. The background queue on the right provides users with a real-time update of progress on dataset processing.
(b) The upload processing flow consists of five parts: server upload (“wait”); storage in LDML and parsing file data
(“transferred”); transfer of time-series data and analysis results to OpenTSDB (“written”); transfer of metadata to
PostgreSQL (“annotated”); and completion (“done”).

of sums, averages, max-min values, statistics, and cific proprietary ECG analysis algorithms, metrics
custom functions. on signal quality, and other data. These data are
OpenTSDB provides access to its storage and also extracted and stored.
retrieval mechanisms via RESTful APIs.15 With this Within the upload interface (Figure 2a), users
capability, other software systems can query OpenTS- can browse their file system to locate folders contain-
DB to retrieve ECG datasets. The open source rela- ing ECG data. Files are selected for upload by click-
tional database system PostgreSQL16 maintains file- ing the “choose” button or by dragging and dropping
related information and other metadata. PostgreSQL files into the central display area (Figure 2a).
is also used for portal content management (user Clicking the “upload all” button initiates transfer
identities, portal configuration, the Liferay document of data from the user’s file system to WaveformECG.
and media library [LDML], and so on), storage of all WaveformECG automatically determines each file’s
uploaded ECG data files in their native format, and format and follows a multistep procedure for storing
other ECG metadata (sampling rate, length of data and retrieving data (Figure 2b). Completion of these
capture, subject identifier, and so on). steps is used as an indicator of progress. Progress in-
formation is displayed in the right-most portion of
Data Upload and Management the upload interface, under the “background queue”
WaveformECG can import ECG data in several tab. In the first step, the system checks to make sure
different vendor formats, including Philips XML that all required files have transferred from the local
1.03/1.04, HL7aECG, Schiller XML, GE Muse/ source to the host. While most formats only have one
Muse XML 7+, Norav Raw Data (RDT), and the file per dataset, some formats split information across
WaveForm DataBase (WFDB) format used in the multiple files. Figure 2a shows this for s0027lre, a
Physionet Project.17 In addition to the ECG time WFDB format ECG dataset. s0027lre’s data is pack-
series, Philips, Schiller, and GE Muse XML files aged in three different files, with dataset metadata in
also contain results from execution of vendor-spe- the header (.hea) file, and time-series data in the other

38 September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

(a)
Invoke
Drag/drop Click checkboxes for
Click start algorithms on
Start analysis all files for analysis any/all algorithms Done
analysis selected files/
to the center pane to process with
(b) update progress

Figure 3. Analysis portal. (a) Three datasets with different formats were selected for processing by multiple
algorithms. The background queue shows progress in data processing: two datasets have each been processed using
eight algorithms, while the third has completed processing by seven. (b) In the analysis process, data and algorithm
selection follow a step-by-step workflow.

two (.dat and .xyz). In this example, WaveformECG appropriate ontology term, selected from the NCBO
has fully received the .hea file, but the .dat and .xyz Bioportal Electrocardiography Ontology (http://purl.
file transfers are still in progress. The progress bar for bioontology.org/ontology/ECG) by storing the ontol-
dataset s0027lre is empty, and the phase column of ogy ID along with the result. WaveformECG bundles
the background queue displays “wait” because these this information, along with the subject identifier, the
data files are still being transferred. Once each ECG format of the uploaded ECG dataset, and the start
file transfer to the service is complete, they’re stored in time of the time series itself, and writes labeled analysis
their native format in the LDML. Files at this stage of results into OpenTSDB. Once this is completed, the
the workflow have a progress bar at 40 percent com- progress bar moves to 80 percent, with “annotated”
pletion, with “transferred” displayed in the “back- displayed in the “background queue” phase column.
ground queue” area. The folder structure within the WaveformECG must be able to maintain a
LDML corresponds with that of the folders created connection with the original uploaded ECG files,
by the user in the upload interface. WaveformECG the stored time-series data, file metadata, analysis
displays this folder structure on all screens where the results, and manual annotations made to ECG
user interacts directly with their uploaded files. waveforms. To do this, the OpenTSDB TSUID is
Once transfer is complete, WaveformECG stored in PostgreSQL. Once this is done, the prog-
spawns a separate process to extract each ECG time ress bar moves to 100 percent, and “done” is dis-
series for storage in OpenTSDB. A single ECG file played in the “background queue” phase column.
contains signals from multiple leads, and a time
series for each lead signal is extracted and labeled Data Analysis
with a unique TSUID. Once this step is complete ECG analysis algorithms are made available for use
for all leads of an ECG waveform, the progress bar in WaveformECG as Web services. The analyze
moves to 60 percent completion, with “written” interface (Figure 3a) uses libraries for Web service
displayed in the “background queue” column. implementations of ECG analysis algorithms ac-
Following completion of writing, another back- cessed through Apache Axis2.18
ground process is spawned to extract ECG waveform Analysis Web services are developed by using
analysis results in the files. Each result is labeled with an Axis2 for communicating with the compiled version

www.computer.org/cise 39

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

SCIENCE AS A SERVICE

Figure 4. Multilead visualization interface. In the multilead display for 4 of the 15 leads from a GE Marquette
Universal System for Electrocardiography (MUSE) XML upload, the vertical bar in 3 of the graphs represents the
cursor location in the first graph. The bars move with the cursor and change focus as the cursor changes graphs.

of the analysis algorithm. Axis2 is an open source list allows a user to toggle the selection of all the
XML-based framework that provides APIs for gen- available algorithms. All available algorithms have
erating and deploying Web services. It runs on an default settings—some have parameters that can
Apache Tomcat server, rendering it operating-plat- be set via the “options” button, but all parameters
form-independent. Algorithms developed using set for an algorithm will be applied to all files to
the interpreted language Matlab can be compiled be analyzed. Upon selection of files to be processed
using Matlab Compiler (www.mathworks.com/ and the algorithms with which to process them, the
products/compiler/mcr/?requestedDomain=www.
____________________________ ____ user clicks the “start analysis” button, which creates
mathworks.com)
__________ and executed in Matlab Runtime, a thread to handle the processing. The thread dis-
a stand-alone set of shared libraries that enables the patches a RESTful call to OpenTSDB to retrieve
execution of compiled Matlab applications or com- all the data requested. Depending on the algo-
ponents on computers that don’t have it installed. rithms chosen, the thread writes the data into the
An XML file is developed that defines the service, necessary formats required by the algorithms (for
commands it will accept, and acceptable values to example, algorithms from the PhysioToolkit19 re-
pass to it. In a separate administrative portion of quire that ECG data be in the WFDB file format).
the analyze interface, a tool allows administrators The thread then invokes the requested algorithms
to easily add algorithms implemented as Web ser- on the requested data. As long as the analyze screen
vices to the system. Upon entry of proper algorithm remains open, the background queue will be up-
details and parameter information, WaveformECG dated, incrementing the number of algorithms that
can invoke an algorithm that the administrator has have finished processing. Upon processing comple-
deployed. This approach simplifies the process of tion of all selected algorithms for one file, the phase
adding new algorithms to WaveformECG. will update to “done” in the background queue.
Figure 3a shows the analyze interface and Fig- Analyses of ECGs can provide information
ure 3b shows the associated processing steps. Users on the heart’s both normal and pathological func-
select files or folders from the file chooser on the tions. The lead V3 ECG waveform in Figure 4
left; multiple files can be dragged and dropped into shows body surface potential (uV; ordinate) as a
the central pane. Placing a file in that pane makes function of time (Sec; abscissa) measured over a
that file available for analysis by one or more algo- single heartbeat. The ECG P, Q, R, S, and T waves
rithms, listed in the bottom center pane. Clicking are labeled. The P wave reflects depolarization of
the checkbox on an algorithm entry instructs the the cardiac atria in response to electrical excitation
system to analyze the selected files with that algo- produced in the pacemaker region of the heart, the
rithm. The checkbox at the top of the algorithm sinoatrial node. Onset of the Q wave corresponds

40 September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

Table 1. Algorithm listing for the analyze interface.

Algorithm name Developer Purpose/dependencies


Rdsamp Physionet Converts the Waveform Database (WFDB) ECG format to human-readable format

Sigamp Physionet Measure signal amplitudes of a WFDB record

sqrs/sqrs2csv Physionet Detects onset and offset times of the QRS complex in single leads; second implementation
produces output in CSV format

wqrs/wqrs2csv Physionet Detects onset and offset times of the QRS complex in single leads using the length transform;
second implementation produces output in CSV format

ihr (sqrs & wqrs Physionet Computes successive RR intervals (instantaneous heart rate); requires input from sqrs or wqrs
implementation)

pNNx (sqrs & wqrs Physionet Calculates time domain measures of heart rate variability; requires input from sqrs or wqrs
implementation)

QT Screening Yuri Chesnokov Detects successive QT intervals based upon high- or low-pass filtering of ECG waveforms;
and colleagues19 works with data in WFDB format

QRS-Score David Strauss Produces the Strauss-Selvester QRS arrhythmia risk score based on certain criteria derived
and colleagues20 from GE MUSE analysis

to onset of depolarization of the cardiac interven- Initially, four leads are displayed, but additional
tricular septum. The R and S waves correspond to leads can be viewed by grabbing and dragging the
depolarization of the remainder of the cardiac ven- window scroll bar located on the right side of the
tricles and Purkinje fibers, respectively. Ventricular browser display. Whenever the cursor is positioned
activation time is defined as the time interval be- within a display window, its x-y location is marked
tween onset of the Q wave and the peak of the R by a filled red dot, and time-amplitude values at
wave. The T wave corresponds to repolarization of that location are displayed at the bottom of the
the ventricles to the resting state. The time interval panel. Cursor display in all graphs is synchronized
between onset of the Q wave and completion of the so that as the user navigates through one graph,
T wave is known as the QT interval and represents others update with it. Lead name and number of
the amount of time over which the heart is partial- annotations for that lead signal are displayed in
ly or fully electrically excited over the cardiac cycle. each graph. File metadata, including subject ID,
The time interval between successive R peaks is lead count, sampling rate, and ECG duration, are
the instantaneous heart rate. Abnormalities of the displayed above the graphs. WaveformECG sup-
shape, amplitude, and other features of these waves ports scrolling through waveforms. Clicking on the
and intervals can reflect underlying heart disease, “next” button at the top of the display steps through
and there has been considerable effort in develop- time in increments of 1.2 seconds. Users can jump
ing algorithms that can be used to automatically to a particular time point by entering the time value
analyze ECGs to detect these peaks and intervals. into the panel labeled “jump to time (sec).”
Table 1 lists the algorithms available in the current By clicking on a lead graph, users can expand
release of WaveformECG.20,21 the view to see the data for that lead in detail, in-
cluding any annotations that have been entered
Data Visualization manually. A list of analysis results on the lead are
The visualize interface lets users examine and in- displayed in a table at the left of the view, and the
teract with stored ECG data. This feature also pro- graph is displayed in the center right. In Figure 5a,
vides a mechanism for manually annotating wave- WaveformECG displays part of the analysis results
forms. When the user selects a file to view in the extracted from a Philips XML upload. While re-
visualization screen, it initially displays the data as a questing the analysis results displayed, the visualize
series of graphs, one for each lead in the dataset (15 interface also checks the data originally returned
leads for the GE MUSE dataset shown in Figure 4). from OpenTSDB to see if any annotations exist
A 1-mV amplitude with a 200-msec duration for the time frame displayed and, if so, Dygraphs
calibration pulse is displayed in the left-most panel. (http://dygraphs.com),
_____________ an open source JavaScript

www.computer.org/cise 41

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

SCIENCE AS A SERVICE

(a)

(b)

Figure 5. Visualization. (a) For lead II in a 12-lead ECG in Philips format, the table under “Analysis Results” displays
the results of automated data processing by the Philips system used to collect this ECG. In the waveform graph,
A denotes a QT interval annotation, with the yellow bar representing the interval itself. This annotation was made
manually. The 1 denotes an R peak annotation, also made manually. All interval and point annotations are listed
below the graph. (b) In the manual annotation interface, the R peak is highlighted and the information in the center
shows the definition returned for that term selection. In addition, there are details about the ECG and the point at
which the annotation was made. To create these displays, the visualize interface initiates a RESTful call to OpenTSDB
to retrieve the first 2.5 seconds of time-series data associated with all the leads in the file. Dygraphs, an open source
JavaScript charting library, generates each of the graphs displayed.

42 September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

charting library, renders the annotation display on As typing commences, a JavaScript application devel-
the screen. Figure 5a shows examples of the two oped by the NCBO provides a list of terms in the tar-
types of supported waveform annotations: point get ontology that match the typed text. The user can
annotations are associated with a specific ECG then select a term from that list. Upon selection of a
time-value pair—in this case, the time-value pair term, the lower box in the right-center screen will up-
corresponding to the peak of the R wave, labeled date with the term and the term definition retrieved
with the number 1—and interval annotations are from Bioportal. The user can then enter a comment
associated with a particular time interval. The user in the text field below that describes any additional
can scroll through the individual lead data using information to be included with the annotation.
the slider control at the bottom of the display or Upon completion of term selection and comment
the navigation buttons. There’s also a feature to entry, the user clicks the “save” button. In Figure 5b,
jump to a specific point in time. Zooming can be this button is grayed out because this figure shows
performed using the slider bar at the bottom of the the result of clicking on an existing annotation.
screen. To restore the original view of the graph, This lets users delve into the details of existing an-
the user can double-click on it. Manual annota- notations and see any comments entered previously
tions can be added by clicking in the graph screen. in the comment box for annotation.

Data Annotation Integration with the I2B2 Clinical


Users can manually annotate ECG waveforms to Data Warehouse
mark features of interest. These annotations are then WaveformECG has been integrated with the Eu-
stored and redisplayed along with the waveform on reka Clinical Analytics systems.22 Eureka provides
subsequent visualization. To create an annotation, Web interfaces and tools for simplifying the task
the user selects a point on the graph or highlights of extracting, transferring, and loading clinical
an interval using the mouse. This activates a graph- data into the I2B2 clinical data warehouse. The ad-
ical interface where the user enters the specific vantage of this integration is that subject cohorts
details of the annotation (Figure 5b). The system identified in I2B2 can be easily sent to Wavefor-
relies on the Bioportal ontology server to select an- mECG for further analysis using a newly devel-
notation terms. Figure 5b shows the end result of oped Send Patient Set plug-in that communicates
the R peak point annotation listed in Figure 5a, with WaveformECG using JavaScript Object No-
labeled with a 1 on the waveform itself. On the up- tation (JSON). In the example in Figure 6, we use
per-right-hand side of the screen, users can see the I2B2 to query data from the MIMIC-II database,23
time and amplitude values for the point they se- looking for all Asian subjects for whom there ex-
lected. If it’s incorrect, the user can click the “chg” ists a measurement of cardiac output. The plug-in
button. The annotation interface will then provide extracts subject IDs satisfying this particular query
the user with a zoomed-in portion of the graph from I2B2 and sends them to WaveformECG as a
and view where the current selection is. If they so JSON message. WaveformECG receives and pro-
choose, users can change the point to another one cesses the JSON, creating a folder displayed in the
and click “save.” If not, they can click “cancel.” The upload interface that corresponds to the I2B2 co-
annotation interface refreshes and the onset will be hort name (Monitor-Asian@14:11:45[10-29-2015]).
updated to the chosen value. WaveformECG creates subfolders for each of the
When WaveformECG renders the annotation corresponding subjects, with their subject iden-
interface shown in Figure 5b, it dispatches a re- tifiers as the folder names. The user can then up-
quest to NCBO’s Bioportal to retrieve all the root- load the waveform data for each of the subjects
level terms from the ECG terms view of the ECG into their respective folders (left side of Figure 6).
ontology in Bioportal. As the ontology itself con- In the example, I2B2 returned three subjects who
forms to the standards of a basic formal ontology met this criteria, but only one of the subjects had
(BFO), the ECG terms view provides a less formal corresponding waveform data. We then uploaded
display of the terms within. A button at the end of that subject’s data to WaveformECG and processed
the terms listing in the annotation screen lets users those waveforms with multiple analysis algorithms.
change the display to or from the BFO and terms Eureka lets users define a data source and
view of the ECG ontology. specify cohort definitions. In the case of Wavefor-
At the top of the term listing is a text box labeled mECG, we defined OpenTSDB as the data source.
“search for class,” which lets a user type a search term. Through that definition, Eureka performed a

www.computer.org/cise 43

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

SCIENCE AS A SERVICE

Figure 6. Integration with a clinical data warehouse. A split screen shows information sent from I2B2 (left) to WaveformECG.
In the expanded EKG annotations folder, three analysis results can be returned from WaveformECG to I2B2.

RESTful call to OpenTSDB, searching for analysis study were the large number of subjects and ECGs
results linked with files in the Eureka folder. Once (~35,000) to be managed and analyzed, the use of
found, analysis results along with their ECG lntol- different ECG instrumentation and thus different
ogy IDs are transferred to Eureka, where they are data formats at the two sites, and the fact that instru-
reorganized into a format acceptable for automatic ment vendors don’t make either of the algorithms to
loading into I2B2. A subset of those results can be tested available in their systems. WaveformECG
be seen in Figure 6 under the “EKG annotations” proved to be a powerful platform for supporting this
folder in the I2B2 “navigate terms” window. study. The QRS score and QRS-T angle algorithms
were implemented and deployed, making it possible
WaveformECG Case Study for the research team to quickly select and analyze
Sudden cardiac death (SCD) accounts for 200,000 ECGs from different sites. The two ECG-based fea-
to 450,000 deaths in the US annually.24 Current tures were shown to be a useful initial method (a
screening strategies fail to detect roughly 80 per- sensitivity of 70 percent and a specificity of 55 per-
cent of those who die suddenly. The ideal screening cent) for identifying those at risk of SCD in the pop-
method for increased risk should be simple, inex- ulation of patients having preserved left ventricular
pensive, and reproducible in different settings so ejection fraction (LVEF > 35 percent).
that it can be performed routinely in a physicians’
offices, yet be both sensitive and specific. A recent
study has shown that features computed from the
12-lead ECG known as the QRS score and QRS-T
angle can be used to identify patients with fibrotic
O ther physiological time-series data arise in
many other healthcare applications. Blood
pressure waveforms, peripheral capillary oxygen
scars (determined using late-gadolinium enhance- saturation, respiratory rate, and other physiologi-
ment magnetic resonance imaging) with 98 percent cal signals are measured from every patient in the
sensitivity and 51 percent specificity.25 Motivated by modern hospital, particularly those in critical care
these findings, we assisted in a large-scale screening settings. Currently in most hospitals, these data are
of all ECGs obtained over a six-month period at two “ephemeral,” meaning they appear on the bedside
large hospital systems. The challenges faced in this monitor and then disappear. These data are among

44 September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

the most actionable in the hospital because they 9. E.B. Lynch et al., “Cardiovascular Disease Risk Fac-
reflect the patient’s moment-to-moment physiolog- tor Knowledge in Young Adults and 10-Year Change
ical functioning. Capturing these data and under- in Risk Factors: The Coronary Artery Risk Devel-
standing how they can be used along with other opment in Young Adults (CARDIA) Study,” Am. J.
data from the electronic health record to more pre- Epidemiology, vol. 164, 15 Dec. 2006, pp. 1171–1179.
cisely inform patient interventions has the poten- 10. A. Cheng et al., “Protein Biomarkers Identify Pa-
tial to significantly improve healthcare outcomes. tients Unlikely to Benefit from Primary Prevention
In future work, we will extend WaveformECG to Implantable Cardioverter Defibrillators: Findings
serve as a general-purpose platform for working from the Prospective Observational Study of Im-
with other types physiological time-series data. plantable Cardioverter Defibrillators (PROSE-
ICD),” Circulation: Arrhythmia and Electrophysiol-
Acknowledgments ogy, vol. 7, no. 12, 2014, pp. 1084–1091.
Development of WaveformECG was supported by the 11. J.X. Yuan, Liferay Portal Systems Development, Packt
National Heart, Lung and Blood Institute through NIH Publishing, 2012.
R24 HL085343, NIH R01 HL103727, and as a subcon- 12. R. Ananthakrishnan et al., “Globus Nexus: An
tract of NIH U54HG004028 from the National Center Identity, Profile, and Group Management Platform
for Biomedical Ontology. for Science Gateways and Other Collaborative Sci-
ence Applications,” Proc. Int’ l Conf. Cluster Comput-
References ing, 2013, pp. 1–3.
1. B. Surawicz et al., “AHA/ACCF/HRS Recommen- 13. B. Sigoure, “OpenTSDB: The Distributed, Scalable
dations for the Standardization and Interpretation Time Series Database,” Proc. Open Source Convention,
of the Electrocardiogram: Part III: Intraventricular 2010; http://opentsdb.net/misc/opentsdb-oscon.pdf.
Conduction Disturbances,” Circulation, vol. 119, 17 14. R.C. Taylor, “An Overview of the Hadoop/MapRe-
Mar. 2009, pp. e235–240. duce/HBase Framework and Its Current Applica-
2. P.M. Rautaharju et al., “AHA/ACCF/HRS Recom- tions in Bioinformatics,” BMC Bioinformatics, vol.
mendations for the Standardization and Interpre- 11, 2010, p. S1.
tation of the Electrocardiogram: Part IV: The ST 15. C. Pautasso, RESTful Web Services: Principles, Patterns,
Segment, T and U Waves, and the QT Interval,” Emerging Technologies,” Springer, 2014, pp. 31–51.
Circulation, vol. 119, 17 Mar. 2009, pp. e241–250. 16. K. Douglas and S. Douglas, PostgreSQL: A Com-
3. G.S. Wagner et al., “AHA/ACCF/HRS Recommen- prehensive Guide to Building, Programming, and
dations for the Standardization and Interpretation Administering PostgreSQL Databases, SAMS pub-
of the Electrocardiogram: Part VI: Acute Ischemia/ lishing, 2003.
Infarction,” J. Am. College Cardiology, vol. 53, 17 17. G.B. Moody, R.G. Mark, and A.L. Goldberger,
Mar. 2009, pp. 1003–1011. “PhysioNet: A Web-Based Resource for the Study of
4. E.W. Hancock et al., “AHA/ACCF/HRS Recommen- Physiologic Signals,” IEEE Eng. Medicine and Biol-
dations for the Standardization and Interpretation ogy Magazine, vol. 20, no. 3, 2001, pp. 70–75.
of the Electrocardiogram: Part V: Electrocardio- 18. D. Jayasinghe and A. Azeez, Apache Axis2 Web Ser-
gram Changes Associated with Cardiac Chamber vices, Packt Publishing, 2011.
Hypertrophy,” Circulation, vol. 119, 17 Mar. 2009, 19. G.B. Moody, R.G. Mark, and A.L. Goldberger,
pp. e251–261. “PhysioNet: Physiologic Signals, Time Series and
5. R. Winslow et al., “The CardioVascular Research Related Open Source Software for Basic, Clinical,
Grid (CVRG) Project,” Proc. AMIA Summit on and Applied Research,” Proc. Conf. IEEE Eng. Medi-
Translational Bioinformatics, 2011, pp. 77–81. cine and Biology Soc., vol. 2011, pp. 8327–8330.
6. M.A. Musen et al., “BioPortal: Ontologies and Data 20. Y. Chesnokov, D. Nerukh, and R. Glen, “Individu-
Resources with the Click of a Mouse,” Proc. Am. Medi- ally Adaptable Automatic QT Detector,” Computers
cal Informatics Assoc. Ann. Symp., 2008, pp. 1223–1224. in Cardiology, vol. 33, 2006, pp. 337–341.
7. S.N. Murphy et al., “Serving the Enterprise and Be- 21. D.G. Strauss et al., “Screening Entire Health System
yond with Informatics for Integrating Biology and ECG Databases to Identify Patients at Increased
the Bedside (I2B2),” J. Am. Medical Informatics As- Risk of Death,” Circulation: Arrhythmia and Elec-
soc., vol. 17, no. 2, 2010, pp. 124–130. trophysiology, vol. 6, no. 12, 2013, pp. 1156–1162.
8. D.E. Bild et al., “Multi-ethnic Study of Atheroscle- 22. A. Post et al., “Semantic ETL into I2B2 with Eu-
rosis: Objectives and Design,” Am. J. Epidemiology, reka!,” AMIA Summit Translational Science Proc.,
vol. 156, 1 Nov. 2002, pp. 871–881. 2013, pp. 203–207.

www.computer.org/cise 45

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

SCIENCE AS A SERVICE

23. M. Saeed et al., “Multiparameter Intelligent Moni- cipal investigator of the CardioVascular Research Grid
toring in Intensive Care II: A Public-Access Inten- Project and holds joints appointments in the departments
sive Care Unit Database,” Critical Care Medicine, of Electrical and Computer Engineering, Computer Sci-
vol. 39, no. 5, 2011, pp. 952–960. ence, and the Division of Health Care Information Sci-
24. J.J. Goldberger et al., “American Heart Association/ ences at Johns Hopkins University. He’s a Fellow of the
American College of Cardiology Foundation/Heart American Heart Association, the Biomedical Engineer-
Rhythm Society Scientific Statement on Noninva- ing Society, and the American Institute for Medical and
sive Risk Stratification Techniques for Identifying Biological Engineers. Contact him at rwinslow@jhu.edu.
___________
Patients at Risk for Sudden Cardiac Death: A Sci-
entific Statement from the American Heart Associa- Stephen Granite is the director of database and software
tion Council on Clinical Cardiology Committee on development of the Institute for Computational Medi-
Electrocardiography and Arrhythmias and Council cine at Johns Hopkins University. He’s also the program
on Epidemiology and Prevention,” Circulation, vol. manager for the CardioVascular Research Grid Project.
118, 30 Sept. 2008, pp. 1497–1518. Granite has an MS in computer science with a focus in
25. D.G. Strauss et al., “ECG Quantification of Myo- bioinformatics and an MS in business administration
cardial Scar in Cardiomyopathy Patients with or with a focus in competitive intelligence, both from Johns
without Conduction Defects: Correlation with Car- Hopkins University. Contact him at sgranite@jhu.edu.
__________
diac Magnetic Resonance and Arrhythmogenesis,”
Circulation: Arrhythmia and Electrophysiology, vol. 1, Christian Jurado is a software engineer in the Institute
no. 12, 2008, pp. 327–336. for Computational Medicine at Johns Hopkins Univer-
sity. He’s also lead developer of WaveformECG for the
Raimond L. Winslow is the Raj and Neera Singh Pro- CardioVascular Research Grid Project. Jurado has a BS
fessor of Biomedical Engineering and director of the in computer science, specializing in Java Web develop-
Institute for Computational Medicine at Johns Hopkins ment and Liferay. Contact him at cjurado2@jhu.edu.
___________
University. His research interests include the use of com-
putational modeling to understand the molecular mech-
anisms of cardiac arrhythmias and sudden death, as well
as the development of informatics technologies that pro- Selected articles and columns from IEEE Computer
vide researchers secure, seamless access to cardiovascular Society publications are also available for free at
research study data and analysis tools. Winslow is prin- http://ComputingNow.computer.org.

for science and engineering professionals, statistical


research in physics employment and education,
industrial outreach, and the history of physics and
allied fields. AIP publishes PHYSICS TODAY, the most
closely followed magazine of the physical sciences
community, and is also home to the Society of Physics
Students and the Niels Bohr Library and Archives. AIP
owns AIP Publishing LLC, a scholarly publisher in the
physical and related sciences.

Board of Directors: Louis J. Lanzerotti (Chair), Robert


The American Institute of Physics is an organization G. W. Brown (CEO), Judith L. Flippen-Anderson
of scientific societies in the physical sciences, (Corporate Secretary), J. Daniel Bourland, Charles
representing scientists, engineers, and educators. AIP Carter, Beth Cunningham, Robert Doering, Judy
offers authoritative information, services, and expertise Dubno, Michael D. Duncan, David Ernst, Kate Kirby,
in physics education and student programs, science Rudolf Ludeke, Kevin B. Marvel, Faith Morrison, Dian
communication, government relations, career services Seidel.

46 September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

PURPOSE: The IEEE Computer Society is the world’s largest EXECUTIVE COMMITTEE
association of computing professionals and is the leading President: Roger U. Fujii
provider of technical information in the field. President-Elect: Jean-Luc Gaudiot; Past President: Thomas M. Conte;
MEMBERSHIP: Members receive the monthly magazine Secretary: Gregory T. Byrd; Treasurer: Forrest Shull; VP, Member &
Computer, discounts, and opportunities to serve (all activities Geographic Activities: Nita K. Patel; VP, Publications: David S. Ebert;
are led by volunteer members). Membership is open to all IEEE VP, Professional & Educational Activities: Andy T. Chen; VP, Standards
members, affiliate society members, and others interested in the Activities: Mark Paulk; VP, Technical & Conference Activities: Hausi A.
computer field. Müller; 2016 IEEE Director & Delegate Division VIII: John W. Walz; 2016
COMPUTER SOCIETY WEBSITE: www.computer.org IEEE Director & Delegate Division V: Harold Javid; 2017 IEEE Director-
OMBUDSMAN: Direct unresolved complaints to ombudsman@
________ Elect & Delegate Division V: Dejan S. MilojiɯiƩ
computer.org.
CHAPTERS: Regular and student chapters worldwide provide the BOARD OF GOVERNORS
opportunity to interact with colleagues, hear technical experts, Term Expriring 2016: David A. Bader, Pierre Bourque, Dennis J. Frailey,
and serve the local professional community. Jill I. Gostin, Atsuhiro Goto, Rob Reilly, Christina M. Schober
AVAILABLE INFORMATION: To check membership status, report Term Expiring 2017: David Lomet, Ming C. Lin, Gregory T. Byrd, Alfredo
an address change, or obtain more information on any of the Benso, Forrest Shull, Fabrizio Lombardi, Hausi A. Müller
following, email Customer Service at help@computer.org
____________ or call Term Expiring 2018: Ann DeMarle, Fred Douglis, Vladimir Getov, Bruce
+1 714 821 8380 (international) or our toll-free number, +1 800 M. McMillin, Cecilia Metra, Kunio Uchiyama, Stefano Zanero
272 6657 (US):
EXECUTIVE STAFF
G Membership applications Executive Director: Angela R. Burgess
G Publications catalog Director, Governance & Associate Executive Director: Anne Marie Kelly
G Draft standards and order forms Director, Finance & Accounting: Sunny Hwang
G Technical committee list Director, Information Technology & Services: Sumit Kacker
G Technical committee application Director, Membership Development: Eric Berkowitz
G Chapter start-up procedures Director, Products & Services: Evan M. Butterfield
G Student scholarship information Director, Sales & Marketing: Chris Jensen
G Volunteer leaders/staff directory
G IEEE senior member grade application (requires 10 years COMPUTER SOCIETY OFFICES
practice and significant performance in five of those 10) Washington, D.C.: 2001 L St., Ste. 700, Washington, D.C. 20036-4928
Phone:     GFax: +1 202 728 9614
PUBLICATIONS AND ACTIVITIES Email: hq.ofc@computer.org
___________
Computer: The flagship publication of the IEEE Computer Los Alamitos: 10662 Los Vaqueros Circle, Los Alamitos, CA 90720
Society, Computer, publishes peer-reviewed technical content that Phone: +1 714 821 8380
covers all aspects of computer science, computer engineering, Email: help@computer.org
__________
technology, and applications.
MEMBERSHIP & PUBLICATION ORDERS
Periodicals: The society publishes 13 magazines, 19 transactions,
Phone:    GFax:    G9-58418</;9<A@1>;>3
__________
and one letters. Refer to membership application or request
Asia/Pacific: Watanabe Building, 1-4-2 Minami-Aoyama, Minato-ku,
information as noted above.
Tokyo 107-0062, Japan
Conference Proceedings & Books: Conference Publishing
Phone:     GFax: +81 3 3408 3553
Services publishes more than 175 titles every year.
Email: tokyo.ofc@computer.org
_____________
Standards Working Groups: More than 150 groups produce IEEE
standards used throughout the world. IEEE BOARD OF DIRECTORS
Technical Committees: TCs provide professional interaction in President & CEO: Barry L. Shoop
more than 45 technical areas and directly influence computer President-Elect: Karen Bartleson
engineering conferences and publications. Past President: Howard E. Michel
Conferences/Education: The society holds about 200 conferences Secretary: Parviz Famouri
each year and sponsors many educational activities, including Treasurer: Jerry L. Hudgins
computing science accreditation. Director & President, IEEE-USA: Peter Alan Eckstein
Certifications: The society offers two software developer Director & President, Standards Association: Bruce P. Kraemer
credentials. For more information, visit www.computer.org/ Director & VP, Educational Activities: S.K. Ramesh
certification.
_______ Director & VP, Membership and Geographic Activities: Wai-Choong
(Lawrence) Wong
Director & VP, Publication Services and Products: Sheila Hemami
NEXT BOARD MEETING Director & VP, Technical Activities: Jose M.F. Moura
13–14 November 2016, New Brunswick, NJ, USA Director & Delegate Division V: Harold Javid
Director & Delegate Division VIII: John W. Walz

revised 10 June 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

COMPUTATIONAL CHEMISTRY

Chemical Kinetics: A CS Perspective

Dinesh P. Mehta and Anthony M. Dean | Colorado School of Mines


Tina M. Kouri | Sandia National Labs

Chemical kinetics has played a critical role in understanding phenomena such as global climate change and
photochemical smog, and researchers use it to analyze chemical reactors and alternative fuels. When computing is
applied to the development of detailed chemical kinetic models, it allows scientists to predict the behavior of these
complex chemical systems.

T
he 1995 Nobel Prize in Chemistry was spectively. Other applications of kinetics include
awarded to Paul J. Crutzen, Mario J. Mo- controlling photochemical smog through emis-
lina, and F. Sherwood Rowland “for their sions regulations on automobiles and factories and
work in atmospheric chemistry, particu- the development of alternative fuels for the inter-
larly concerning the formation and decomposition nal combustion engine. The experimental testing
of ozone.”1 Molina and Rowland performed calcu- needed for fuel certification is expensive and time-
lations predicting that chlorofluorocarbon (CFC) consuming, leading to the development of compu-
gases being released into the atmosphere would tational approaches to minimize the experimental
lead to the depletion of the ozone layer. Because space.
the ozone layer absorbs ultraviolet light, its deple- The purpose of this article is to acquaint com-
tion would lead to an increase in ultraviolet light puter scientists with the application of computing
on the Earth’s surface, resulting in an increase in in chemical kinetics and to outline future chal-
skin cancer and eye damage in humans. The sub- lenges, such as the need to couple kinetics and
sequent international treaty, the Montreal Protocol transport phenomena to obtain more accurate pre-
on Substances that Deplete the Ozone Layer, was dictions. We focus on gas-phase chemistry; that
universally adopted and phased out the production is, all the chemicals are gases. Figure 1 shows the
of CFCs; it serves as an exemplar of public policy flow of the chemical kinetic computation. Aspects
being informed by science. of this are similar to hardware and software design
The underlying calculations used by Molina flows, where the upstream steps have considerable
and Rowland have their basis in chemical kinet- impact on the downstream steps, both in terms
ics, which concerns the rate at which chemical re- of computation time and quality of solution. This
actions occur. When a chemical reaction (such as article’s organization mirrors the steps in the flow-
the combustion of methane) takes place, the over- chart, with sections on mechanism generation,
all reaction might appear simple—such as CH4 + consistency and completeness analysis, and mecha-
2O2 = CO2 + 2H2O—but the actual chemistry is nism reduction.
typically much more complex (for details, see the
“Related Work in Chemical Kinetics” sidebar). An Mechanism Generation
accurate analysis of the underlying combustion The automated development of a reaction mechanism
phenomenon requires consideration of all the spe- involves starting with a set of reactants and determin-
cies (molecules) and elementary reactions, which ing the reactions they participate in, the intermedi-
could number in the hundreds and thousands, re- ate species that are generated, and, ultimately, the

48 Computing in Science & Engineering 1521-9615/16/$33.00 © 2016 IEEE Copublished by the IEEE CS and the AIP September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

Related Work in Chemical Kinetics

C hemical kinetics is a branch of chemistry concerned with the


rate at which chemical reactions occur. This is opposed to
chemical thermodynamics, which studies the enthalpy and entropy
where k is the rate constant. The values of x and y, which can be
determined experimentally, depend on the reaction and aren’t
necessarily equal to the stoichiometric coefficients a and b. The
associated with chemical reactions; that is, it tells you whether a order of a reaction is defined by x + y.
given chemical reaction might occur under certain conditions but It often turns out that a seemingly simple reaction actually
not how fast it will occur. consists of a number of even simpler steps. For example, the reaction
For example, chemical thermodynamics suggests that the H2 + Br2 Ÿ 2HBr consists of the following elementary steps:
oxidation of graphite (carbon) resulting in carbon dioxide is highly
Br2 Ÿ 2Br
favored at room temperature. However, graphite can be exposed
Br + H2 Ÿ HBr + H
to air indefinitely without any apparent changes because the
H + Br2 Ÿ HBr + Br
reaction is very slow. The rate of a chemical reaction is affected
H + HBr Ÿ H2 + Br
by the reactants’ nature, physical state (solid, liquid, or gas), and
2Br Ÿ Br2
concentrations, as well as the temperature, pressure, and the
presence of catalysts and inhibitors. This collection of elementary steps is called a reaction mechanism.
Chemical reaction kinetics can be described by rate It turns out that x = a and y = b in an elementary reaction—that
laws. Chemists and chemical engineers can use the resulting is, the order is identical to the molecularity. For an elementary
mathematical models to better understand a variety of chemical reaction, the rate law can therefore be written by inspection using
reactions and design chemical systems that maximize product the Guldberg–Waage mass action law.
yield, minimize harmful effects on the environment, and so on. Given a reaction mechanism, a quantitative characterization of
Consider the reaction O + N2O Ÿ 2NO. Here, one atom of the chemical system is achieved by developing a set of differential
oxygen reacts with one molecule of nitrous oxide to produce two equations, one for each species in the mechanism (in our example,
molecules of nitric oxide. Let [N2O] denote the concentration Br2, H2, Br, H, HBr). Assuming that the rate constants for the five
of nitrous oxide. The reaction rate can be expressed as the rate reactions above are k1 through k5, respectively, we can write down
of disappearance of a reactant (say, N2O) by denoting it by the the following set of ordinary differential equations (ODEs):
derivative of concentration with respect to time: d[N2O]/dt. Note
that d[N2O]/dt is negative because the concentration of N2O d [Br2 ]
− = k1 [Br2 ] + k 3 [H][Br2 ] − k 5 [Br]2
(a reactant) reduces with time. The rate of a reaction can also be dt
d [H2 ]
expressed as the rate of product formation: in this case, − = k 2 [Br][H2 ] − k 4 [H][HBr]
dt
(1/2)(d[NO]/dt). The 1/2 accounts for the production of two
d [H]
molecules of nitric oxide in the reaction. More generally, consider = k 2 [Br][H2 ] − k 3 [H][Br2 ] − k 4 [H][HBr]
dt
a reaction of the form aA + bB Ÿ cC + dD, where A and B are d [Br]
reactants, C and D are products, and a, b, c, and d are integers that = 2k1 [Br2 ] − k 2 [Br][H2 ] + k 3 [H][Br2 ] + k 4 [H][HBr] − 2k 5 [Br]2
dt
denote the relative amounts of reactants and products consumed and d [HBr]
= k 2 [Br][H2 ] + k 3 [H][Br2 ] − k 4 [H][HBr].
produced, respectively. The reaction rate can then be expressed as dt

1 d [A] 1 d [B ] 1 d [C ] 1 d [D ]
− =− = = . These are then numerically integrated to give, for each species, a
a dt b dt c dt d dt
description of the concentration variation with time. The resulting
The field of chemical kinetics was pioneered by Cato M. predictions are compared to experimental data to assess the proposed
Guldberg and Peter Waage, who observed in 1864 that reaction mechanism’s suitability. Although our example reaction consists
rates are related to the concentrations of the reactants. Typically, of five reactions and five species, a complex reaction can contain
thousands of species and tens of thousands of reactions, requiring
RATE ∝ [A]x [B ] y efficient integration algorithms. A description of the ODE solvers and
the electronic structure calculations used to determine rate constants,
= k [A]x [B ] y ,
while crucial to this technology, are beyond this article’s scope.

products that are obtained. This is an inherently This process can theoretically continue indefinitely, re-
iterative process because intermediate species might sulting in a combinatorial explosion of species and reac-
themselves participate in reactions, resulting in the for- tions. In practice, criteria are needed for two reasons: to
mation of new species, which might in turn react with decide when to terminate the process and to identify
other species to generate even more species, and so on. the most chemically important reactions and species.

www.computer.org/cise 49

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

COMPUTATIONAL CHEMISTRY

Chemical kinetics
flow diagram
Chemical insight element Mii (typically zero in an adjacency matrix)
(expt, theory)
denotes the number of free electrons of atom i that
aren’t used in its bonds. The sum of the elements in
Rate rules
the ith row then gives the valence of atom i. Figure 2
Mechanism illustrates the concept.
generation A reaction (R) matrix is used to capture the bond
Consistency and completeness changes associated with a certain type of reaction.
analysis Several well-known reaction types exist—including
Efficient ODE solver Validation
No hydrogen abstraction, b-scission, and recombina-
Yes tion—and each has well-defined behavior with re-
Mechanism reduction spect to the bond changes that occur in the reaction.
Final mechanism CFD solver This is illustrated using hydrogen abstraction—that
is, the removal (abstraction) of an H atom from a
molecule by a radical (a species with an atom with a
Figure 1. Mechanism generation flow chart. The
unpaired electron in its outermost shell)—as follows:
kineticist uses chemical insights gained through
experiments and theory to develop rate rules, which X  H  Y* q X*  Y  H.
are used by the generation algorithm to create a new
reaction mechanism. The reaction mechanism is Here, the radical Y* abstracts the H atom from the
checked for consistency and completeness. This is molecule X-H, giving the molecule Y-H and the
followed by validation procedures. If the mechanism
radical X*. The bonds associated with three atoms
fails the validation procedures—that is, if predictions
obtained by running ordinary differential equations are impacted by this reaction:
(ODE) solvers don’t match experimental data—the
process must be repeated. Upon passing the validation ■ X a, the atom in X whose bond with the H
procedures, the size of the mechanism is reduced, atom is broken;
giving a final mechanism. This mechanism can be
■ the H atom itself; and
used along with computational fluid dynamics (CFD)
computation to perform accurate system simulations. ■ Yb, the atom in Y with the unpaired electron,
that forms a bond with the H atom.

Bond-Electron and Reaction Matrices These are reflected in the H-abstraction reaction
We now describe the use of matrices to generate matrix:
products from a set of reactants and reaction
Xa H Yb
types.2,3 The bond-electron (BE) matrix represents
a species and is a variation of the classical adjacency Xa 1 1 0
matrix used to represent graphs. Specifically, H 1 0 1
Yb 0 1 1
■ graph vertices are augmented with labels to de-
note atoms, such as C for carbon, H for hydro- Figure 3 illustrates in detail how the products of
gen, and O for oxygen; and the reaction in Figure 2 are generated.
■ multiple edges are permitted between vertices
to account for bond order. Termination and Selection
Let the initial set of reactants be R 0. Assume that
For example, a pair of C atoms can be joined by a reaction matrices are applied to all possible com-
single bond, a double bond, or a triple bond; these binations of reactants to generate products as de-
three cases must be distinguishable. Bond forma- scribed in the previous section. Let the set of new
tion is governed by the participating atoms’ va- products generated be R1. We now repeat the pro-
lences—that is, the number of unpaired electrons cess on R 0 ‰ R1 to generate R 2.
in their outermost shells. The valences for C, O, This process can be repeated indefinitely. The
and H are 4, 2, and 1, respectively. A single bond challenge is twofold: to determine which criteria
is formed by the contribution of one unpaired elec- should be used to terminate this process and to
tron from the two participating atoms. identify how to select chemically significant species
Element Mij in an n w n BE matrix M of a mol- and reactions, while leaving chemically insignifi-
ecule with n atoms denotes the number of bonds cant ones out. The rate-based approach,4 which has
between atoms i and j, when i | j. The diagonal found favor within the kinetics community, uses

50 September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

5 5
H H
B A B A 2
4H C1 H2 + H O* 4H C* + H O H
1

H H
3 3

1 2 3 4 5
1 3 4 5
1 C 0 1 1 1 1 A B 2
A B 1 C 1 1 1 1
2 H 1 0 0 0 0 A O 0 1 1
A O 1 1 3 H 1 0 0 0
3 H 1 0 0 0 0 B H 1 0 0
B H 1 0 4 H 1 0 0 0
4 H 1 0 0 0 0 2 H 1 0 0
5 H 1 0 0 0
5 H 1 0 0 0 0
(a) (b) (c) (d)

Figure 2. Hydrogen (H) abstraction reaction example and bond-electron (BE) matrices for all participating species:
(a) and (b) the reactants methane and hydroxyl radical, and (c) and (d) the products methyl radical and water. The *
denotes an atom with an unpaired electron. Each atom has a label used later to identify it uniquely.

reaction rate computations during mechanism gen- 1 2 3 4 5 A B Xa H Yb


eration to determine which species are chemically 1 C 0 1 1 1 1 0 0 Xa 1 –1 0
significant: assume as before that the initial reac- 2 H 1 0 0 0 0 0 0 H –1 0 1
3 H 1 0 0 0 0 0 0
tants set is R 0. Once all the reactions (including 4 H 1 0 0 0 0 0 0
rate constants) involving R 0 are determined, the 5 H 1 0 0 0 0 0 0
system of ordinary differential equations (ODEs) is A O 0 0 0 0 0 1 1 Yb 0 1 –1
solved, giving the rate at which the various prod- B H 0 0 0 0 0 1 0
ucts (in R1) are formed. Only the products that are (a) (b)
formed the fastest are added to R 0, and the process 1 2 3 4 5 A B 1 3 4 5 A B 2
is repeated. The process terminates when the rates 1 C 1 0 1 1 1 0 0 1 C 1 1 1 1 0 0 0
at which all the products formed are less than a us- 2 H 0 0 0 0 0 1 0 3 H 1 0 0 0 0 0 0
er-specified threshold. The relative rates of individ- 3 H 1 0 0 0 0 0 0 4 H 1 0 0 0 0 0 0
ual reactions depend on the temperature and pres- 4 H 1 0 0 0 0 0 0 5 H 1 0 0 0 0 0 0
5 H 1 0 0 0 0 0 0 A O 0 0 0 0 0 1 1
sure of the system. Consequently, the mechanism
A O 0 1 0 0 0 0 1 B H 0 0 0 0 1 0 0
derived for the same set of reactants could vary for B H 0 0 0 0 0 1 0 2 H 0 0 0 0 1 0 0
different temperature–pressure combinations. (c) (d)

Estimating Rate Constants Figure 3. Reaction matrices for the reaction shown in Figure 2. (a) We combine
We now briefly describe how rate constants are es- the BE matrices of the reactants CH4 and OH and place boxes around matrix
timated. Functional groups are specific groups of elements that will be impacted by the succeeding steps. (b) The expanded
reaction matrix. (c) The result of adding the matrices from (a) and (b) with
atoms within molecules that are responsible for the
boxes around the elements that are affected by the addition. (d) Rows and
characteristic chemical behaviors of those mole-
columns are reordered, giving the products CH3 and H2O, by identifying
cules. For example, all acids (for example, HCl and connected components of the graph using, for example, breadth-first search.
H2SO4) contain the H atom and all alkalies (such
as NaOH and KOH) contain the OH grouping.
Reaction rate constants and other thermochemi- carbonyls (a carbon atom double-bonded to an ox-
cal properties that are required to set up the sys- ygen atom). The more specialized the knowledge of
tem of ODEs can be estimated from the functional functional groups in a reaction, the more accurate
groups that participate in a reaction. Estimates are the rate constant estimate that can be obtained.
required because the direct measurement of these
quantities is often impractical. Consistency and Completeness Analysis
Functional groups are represented in a rooted Mechanisms generated using these techniques must
tree data model,5 in which the root represents a be checked for consistency and completeness. Due to
general functional group and its descendants rep- the large sizes of the mechanisms, kineticists use
resent more specialized groups. Figure 4 shows a software tools to verify accuracy and completeness.
portion of a functional group tree for classifying Tools that automatically classify reactions into

www.computer.org/cise 51

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

COMPUTATIONAL CHEMISTRY

C=O ARM is formally defined as an optimization


problem: find a bijection (one-to-one mapping)
=C=O from the reactant atoms to the product atoms that
C=O
minimizes the number of bonds broken or formed.
This mapping must respect atom labels; that is, a
!O
C=C=O O=C=O reactant C atom is mapped to a product C atom.
O C=O
!O C=O Clearly, there’s only one way to map the C and O
atoms in the reaction in Figure 5a. Of the 5! = 120
ways to map the five H atoms on the LHS to the
O !O H C
five H atoms on the RHS, Figure 5a shows an op-
O C=O O C=O C C=O C C=O timal mapping that corresponds to breaking a C-H
Carbonates Aldehydes Ketones reactant bond and forming an O-H product bond
for a total cost of two. Optimal solutions to ARM
Figure 4. Portion of a carbonyl functional group tree. “!O” denotes any atom
except oxygen. aren’t guaranteed to reflect the underlying chem-
istry, but have been found to be accurate for com-
bustion reactions.
H2 H2
The general ARM problem is known to be NP-
O H1 H3 C H5 O H3 C H5 hard.6 Unlike other optimization applications, in
H4 H1 which a suboptimal solution is often acceptable,
(a) H4 it’s crucial to the reaction classification application
H H
that optimal solutions be found. We’ve developed a
1 7 family of exhaustive algorithms that find optimal
0 2 4 8 9
O H H C H 5
O
6
H C H solutions by using an approach that systematically
3
H H removes bonds from reactant and product graphs
(b) H
until the LHS and RHS of the reaction are identi-
0 1 2 3 4 5 6 7 8 9 cal. In the example of Figure 5a, removing a C-H
(c) 0 0 0 1 0 1 0 0 0 0 reactant bond and an O-H product bond results in
the LHS becoming identical to the RHS. Bonds
Figure 5. Mapping a simple chemical reaction OH + CH4 removed from reactant graphs represent bonds
œ H2O +CH3. (a) A bijection, or one-to-one mapping, on broken during the reaction, while bonds removed
vertices. (b) Each bond is labeled to indicate its position
from product graphs represent bonds formed dur-
in the bit string. (c) A bit pattern of bonds that will be
broken or retained. ing the reaction.
To implement this algorithm, we need a meth-
od to determine whether the LHS is identical to
specific reaction types are key to this process be- the RHS. This in turn boils down to the famous
cause they allow the mechanism to be sorted into graph isomorphism question. Two graphs G and H
manageable groups and simplify the task of check- are isomorphic if there’s a bijection f from the verti-
ing for completeness of reactions and consistency ces of G (denoted V(G)) to the vertices of H(V(H))
of rate coefficient assignments. such that for any two vertices u and v in G, (u,v)
is an edge in G if and only if (f(u), f(v)) is an edge
Automated Reaction Mapping in H. In practice, this problem is solved for chemi-
Automated reaction mapping (ARM) is a funda- cal graphs using canonical labeling. The canoni-
mental first step in the classification of chemical re- cal label CL(G) of a graph G is a character string
actions. The objective is to determine which bonds such that for any two graphs G1 and G2, G1 is iso-
are broken and formed in a reaction. Figure 5a shows morphic to G2 if and only if CL(G1) = CL(G2). In
our earlier hydrogen-abstraction reaction consisting other words, to compare two chemical graphs, first
of two reactants and two products. The input to the convert them into strings using canonical label-
automated reaction mapping problem is a balanced ing and then perform a simple string comparison.
chemical reaction; that is, the same number and Canonical labeling algorithms (e.g., Nauty) exist
types of atoms are present on both the left-hand side for chemical graphs that are fast in practice, but
(LHS) and the right-hand side (RHS) of the reac- have exponential runtimes in the worst case.
tion. The output is the list of bonds that were broken ARM algorithms use a bit string data struc-
or formed to transform the reactants into products. ture in which each bit corresponds to a reactant

52 September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

or product bond. A bit set to 0 indicates the bond ware and redeployment of the system.7 The rules
shouldn’t be broken, while a bit set to 1 indicates for hydrogen abstraction are as follows:
the bond should be broken. Figure 5c shows a bit
(defrule habstraction
string corresponding to the reaction of Figure 5b
(1)(Reaction {numReactants == 2})
(in which each bond is labeled to indicate posi-
(2)(Reaction {numProducts == 2})
tion in the bit string). In Figure 5c, the bit string
(3) (Reaction {atLeastOneRadicalReactant == TRUE})
indicates that the reactant bond labeled 3 and the
(4) (Reaction {atLeastOneRadicalProduct == TRUE})
product bond labeled 5 must be broken. We then
(5) (Reaction {sameRadicalReactantAndProduct == FALSE})
use canonical labels to see whether this break re-
(6)(Mapping {allHydrogenBonds == TRUE})
sults in the LHS becoming equal to the RHS; in
(7) (Mapping{hydrogenGoingFromStableToRadical
this case, it does, giving us a mapping of cost 2,
Reactant == TRUE})
which is optimal. Note that there are bit patterns
(8)(Reaction {numBondsBroken == 1})
(such as 100000000) that don’t give LHS = RHS
(9)(Reaction {numBondsFormed == 1})
and therefore aren’t valid mappings. Also, because
=>
the reaction is balanced, breaking all the bonds—
(add (new String HydrogenAbstraction)))
that is, a bit pattern with all 1s—guarantees the ex-
istence of a mapping. Statements (1) and (2) verify that there are two
Of course, we don’t know a priori which of reactants and two products using methods from the
the 2b subsets of b bonds will result in an optimal Reaction class. Statements (3) and (4) verify the exis-
mapping. A relatively simple, but remarkably effec- tence of at least one radical reactant and one radi-
tive, approach is to try all the cal product using methods from the Reaction class.
Statement (5) verifies that the radical reactant isn’t
⎛ b ⎞⎟
⎜⎜ ⎟ the same as the radical product using a method from
⎜⎜⎝ i ⎟⎟⎠ the Reaction class. Statement (6) verifies that each
bond broken or formed was connected to a hydro-
bit patterns with i 1s, with i going from 0 to b, gen atom using a method from the Mapping class.
stopping as soon as a mapping is found. In the Statement (7) verifies that a hydrogen atom moved
worst case, as mentioned earlier, this approach is from a stable to a radical reactant using a method
guaranteed to find a mapping when i = b when the from the Mapping class. Statements (8) and (9) verify
bit pattern consists of all 1s, resulting in an expo- that exactly one reactant bond was broken and one
nential time complexity. In practice, the minimum product bond was formed using methods from the
number of bond changes in a chemical reaction is Reaction class. Notice that rules (8) and (9) pertain to
small. For hydrogen abstraction, this quantity is bonds broken and formed in the reaction obtained
two, which means that the number of bit patterns using the ARM techniques described earlier.
examined by the algorithm is bounded by O(b2), a The complicated nature of gas-phase reaction
polynomial. systems makes it impractical to devise a set of rules
that classify all reactions. Unclassified reactions are
Reaction Classification important in their own right because they allow
A classified and sorted reaction mechanism can be the kineticist to focus on problems in mechanisms
used to that have failed validation procedures. Our system
was able to determine the classification for about
■ check for completeness in the mechanism, 95 percent of the reactions in a set of benchmark
■ check the consistency of rate coefficient assignments, combustion mechanisms.
■ focus on unclassified reactions when looking
for problems if validation fails, and Mechanism Reduction
■ compare multiple mechanisms that model the After consistency and completeness analysis, the
same phenomena. system of ODEs is solved and concentration-time
profiles are generated. With improvements in com-
Reaction classification is based on rules asso- puting hardware and ODE solver algorithms, these
ciated with the properties of the reaction and its large systems are now solved routinely. The mecha-
species. These rules can be recorded in a rule-based nisms are validated by comparing the predictions
system such as Jess, which allows for rule modifi- with available data. Many of the problems of interest
cation without requiring recompilation of the soft- require coupling of the validated kinetic mechanism

www.computer.org/cise 53

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

COMPUTATIONAL CHEMISTRY

To more accurately model real systems, we need a tighter


coupling of the computational chemical kinetics techniques
described here with CFD. We expect that improvements in
computing technology (hardware, software, and algorithms)
will facilitate the integration of these two disciplines.
with computational fluid dynamics (CFD) to ad-
dress the coupled kinetics/transport problem. The
resulting computational demand is such that the
W e’ve discussed pure chemical kinetics, in
that mechanisms have been generated with-
out regard to the spatial aspects of real systems.
kinetics community has invested significant effort We’ve also assumed that systems are homoge-
to develop mechanism reduction techniques that neous; that is, species are uniformly distributed
replace the original large mechanism with a much throughout a volume so that the likelihood of a
smaller one that closely approximates the behavior particular reaction taking place is independent of
of the original. Specifically, the solution of the sys- location. This is also known as zero-dimensional
tem of ODEs requires at least O(N2) time, where kinetics. However, nature isn’t necessarily ho-
N is the number of species. Further, these sets of mogenous—different species occur in different
equations must be solved in large numbers of cells. regions with varying concentrations and tempera-
A representative mechanism reduction technique tures. For example, in the context of atmospheric
is based on directed relation graphs (DRG).8 Each chemistry, a combination of factors leads to sub-
species in the original mechanism is represented by a stantial stratification of species and temperature
vertex in the DRG. Intuitively, there’s a directed edge with height. Similarly, it’s well known that tubu-
from species X to species Y in the DRG if and only lar reactors have radial variations in flow proper-
if the removal of Y would significantly impact the ties that introduce heterogeneity in the system.
production rate of X. In other words, an edge from Therefore, to more accurately model real systems,
X to Y means that we must retain Y in the reduced we need a tighter coupling of the computational
mechanism to correctly evaluate the production rate chemical kinetics techniques described here with
of X. We now describe a quantitative criterion based CFD. We expect that improvements in comput-
on the underlying chemistry to formalize this idea. ing technology (hardware, software, and algo-
The total production rate of X, denoted P(X), is the rithms) will facilitate the integration of these two
sum of its individual reaction production rates, add- disciplines.9
ed over all the reactions in which X participates. The
production rate of X that can be directly attributed References
to Y, denoted as P(X,Y ), is a similar sum restricted to 1. M.J. Molina and F.S. Rowland, “Stratospheric Sink
the subset of reactions in which both X and Y partici- for Chlorofluoromethanes: Chlorine Atom-Cata-
pate. This is normalized to obtain lysed Destruction of Ozone,” Nature, vol. 249, 1974,
pp. 810–812.
P ( X ,Y )
rXY = . 2. I. Ugi et al., “New Applications of Computers in
P( X ) Chemistry,” Angewandte Chemie, Int’ l Ed., vol. 18,
There’s an edge from X to Y in the DRG if and no. 2, 1979, pp. 111–123.
only if rXY t H, where H is a small user-defined 3. L.J. Broadbelt, S.M. Stark, and M.T. Klein, “Com-
threshold value. puter Generated Pyrolysis Modeling: On-the-Fly
Given a starting user-specified set S of major Generation of Species, Reactions, and Rates,” In-
species in the mechanism, the key algorithmic dustrial & Eng. Chemistry Research, vol. 33, no. 4,
idea of this approach is to traverse the graph us- 1994, pp. 790–799.
ing, for example, depth-first search to identify, in 4. R.G. Susnow et al., “Rate-Based Construction of
linear time, all the vertices that are reachable from Kinetic Models for Complex Systems,” J. Physical
S. These reachable vertices are precisely the species Chemistry A, vol. 101, no. 20, 1997, pp. 3731–3740.
that must be retained in the reduced mechanism. 5. W.H. Green Jr., “Predictive Kinetics: A New Ap-
The species that aren’t reachable along with the re- proach for the 21st Century,” Chemical Eng. Kinet-
actions they participate in are eliminated. ics, vol. 32, Academic Press, 2007, pp. 1–50.

54 September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

6. J. Crabtree and D. Mehta, “Automated Reaction Anthony M. Dean is a professor of chemical engineering
Mapping,” J. Experimental Algorithmics, vol. 13, no. and vice president for research at the Colorado School
15, 2009, article no. 15. of Mines. His research interests include quantitative ki-
7. T. Kouri et al., “RCARM: Reaction Classification netic characterization of reaction networks in a variety
Using ARM,” Int’ l J. Chemical Kinetics, vol. 45, no. of systems. Dean received a PhD in physical chemistry
2, 2013, pp. 125–139. from Harvard University. He’s a member of the Ameri-
8. T. Lu and C.K. Law, “A Directed Relation Graph can Chemical Society, the American Institute of Chemi-
Method for Mechanism Reduction,” Proc. Combus- cal Engineers, and the Combustion Institute. Contact
tion Institute, vol. 30, no. 1, 2005, pp. 1333–1341. him at ____________
amdean@mines.edu.
9. S.W. Churchill, “Interaction of Chemical Reac-
tions and Transport. 1. An Overview,” Industrial Tina M. Kouri is a research and development computer
& Eng. Chemistry Research, vol. 44, no. 14, 2005, scientist at Sandia National Labs. Her research interests
pp. 5199–5212. include applied algorithms and cheminformatics. Kouri
received a PhD in mathematical and computer sciences
Dinesh P. Mehta is professor of electrical engineering from the Colorado School of Mines. She’s a member of
and computer science at the Colorado School of Mines. the ACM. Contact her at tkouri@sandia.gov.
___________
His research interests include applied algorithms, VLSI
design automation, and cheminformatics. Mehta re-
ceived a PhD in computer and information science from Selected articles and columns from IEEE Computer
the University of Florida. He’s a member of IEEE and Society publications are also available for free at
the ACM. Contact him at ____________
dmehta@mines.edu. http://ComputingNow.computer.org.

ADVERTISER INFORMATION

Advertising Personnel Southwest, California:


Mike Hughes
Marian Anderson: Sr. Advertising Coordinator Email: mikehughes@computer.org
________________
Email: manderson@computer.org
_______________ Phone: +1 805 529 6790
Phone: +1 714 816 2139 | Fax: +1 714 821 4010
Southeast:
Sandy Brown: Sr. Business Development Mgr. Heather Buonadies
Email sbrown@computer.org
_____________ Email: h.buonadies@computer.org
________________
Phone: +1 714 816 2144 | Fax: +1 714 821 4010 Phone: +1 973 304 4123
Fax: +1 973 585 7071
Advertising Sales Representatives (display)
$GYHUWLVLQJ6DOHV5HSUHVHQWDWLYHV &ODVVLÀHG/LQH
Central, Northwest, Far East:
Eric Kincaid Heather Buonadies
Email: ______________
e.kincaid@computer.org Email: h.buonadies@computer.org
________________
Phone: +1 214 673 3742 Phone: +1 973 304 4123
Fax: +1 888 886 8599 Fax: +1 973 585 7071

Northeast, Midwest, Europe, Middle East: Advertising Sales Representatives (Jobs Board)
Ann & David Schissler
Email: a.schissler@computer.org,
_______________ d.schissler@computer.org
_______________
Phone: +1 508 394 4026 Heather Buonadies
Fax: +1 508 394 1707 Email: h.buonadies@computer.org
________________
Phone: +1 973 304 4123
Fax: +1 973 585 7071

www.computer.org/cise 55

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

CLOUD COMPUTING

A Cloud-Based Seizure Alert System for Epileptic


Patients That Uses Higher-Order Statistics

Sanjay Sareen | Guru Nanak Dev University, Amritsar, India and IK Gujral Punjab Technical University, Kaurthala, India
Sandeep K. Sood | Guru Nanak Dev University Regional Campus, Gurdaspur, India
Sunil Kumar Gupta | Beant College of Engineering and Technology, Gurdaspur, India

Automatic detection of an epileptic seizure before its occurrence could protect patients from accidents or even save
lives. A framework that automatically predicts seizures can exploit cloud-based services to collect and analyze EEG
data from a patient’s mobile phone.

E
pilepsy is a disorder that affects the brain, options for real-time and continuous monitoring of
causing seizures. During a seizure, a patient chronically ill and assisted living patients remotely,
could lose consciousness, including while thus minimizing the need for caregivers. One of the
walking or driving a vehicle, which could important segments of WSN is body sensor net-
result in significant injury or death. According to a works (BSNs), which record the vital signs of a pa-
recent survey, the main cause of death for epileptic tient such as heart rate, electrocardiogram (ECG),
patients includes sudden unexpected death during and EEG. These wearable sensors are placed on the
epilepsy due to drowning and accidents, which ac- patient’s body, and their key benefit is mobility—
count for 89 percent of total epilepsy-related deaths they enable the patient to move freely insider or
in Australia.1 Such patients can benefit from an outside the home. BSNs generate a huge amount of
alert before the start of a seizure or emergency treat- sensor data that needs to be processed in real time to
ment when they have a seizure, thus improving their provide timely help to the patient. Cloud computing
quality of life and safety considerably. provides the ability to store and analyze this rapidly
In a clinical study, Brian Litt and colleagues2 ob- generated sensor data in real time from sensors of dif-
served that an increase in the amount of abnormal ferent patients residing in different geographic loca-
electrical activity occurs before a seizure’s onset. One tions. The cloud computing infrastructure integrated
of the most important steps to protect the life of an with BSNs provides an infrastructure to monitor and
epileptic patient is the early detection of seizures, analyze the sensor data of large numbers of epilep-
which can help patients take precautionary measures tic patients around the globe efficiently and in real
and prevent accidents. It has also been observed that time.3 The cloud service provider is bound to provide
during the transition from a normal state to an ictal an agreed upon quality of service based on a service-
state (mid-seizure), to detect a seizure, the electrical level agreement (SLA), and appropriate compensa-
activity in the patient’s brain needs to be recorded tion is paid to the customer if the required service
continuously and efficiently around the clock. Elec- levels aren’t met.4 To protect the patient from acci-
troencephalogram (EEG) is the most commonly used dents when a seizure occurs, ideally, family members
technique to measure electrical activities in the brain could continuously monitor the patient everywhere,
for the diagnosis of epileptic seizures. which isn’t feasible under the traditional circum-
In this direction, wireless sensor network (WSN) stances. Hence, the main objectives of our proposed
technology is emerging as one of the most promising system are

56 Computing in Science & Engineering 1521-9615/16/$33.00 © 2016 IEEE Copublished by the IEEE CS and the AIP September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

Related Work in Seizure Prediction

T he sensor data stream generated from body sensor networks


(BSNs) has drawn the attention of researchers, many of whom
have taken the initiative to develop cloud-based systems based on
lection from wireless body area networks that contains a virtual ma-
chine and a virtualized cloudlet that integrates the cloud capabilities
with the sensor devices. Recently, Ahmed Lounis and colleagues11
BSNs to develop e-healthcare applications. In 2003, Leon Iasemi- proposed a new secure cloud-based system using wireless sensor
dis and colleagues1 designed an algorithm based on the short-term networks (WSNs) that enable a healthcare institution to process
maximum Lyapunov exponents (STLmax) to detect a seizure prior data captured by a WSN for patients under supervision.
to its occurrence. However, it’s based on the assumption that the
occurrence of the first seizure is known. Joel Niederhauser and col- References
leagues2 proposed a model to detect a seizure using a periodogram 1. L.D. Iasemidis et al., “Adaptive Epileptic Seizure Prediction System,”
of the electroencephalogram (EEG) signal and demonstrated that IEEE Trans. Biomedical Eng., vol. 50, no. 5, 2003, pp. 616–627.
EEG events occur prior to the electrical onset of a seizure. However, 2. J.J. Niederhauser et al., “Detection of Seizure Precursors from
this model is only applicable to patients with temporal lobe epi- Depth-EEG Using a Sign Periodogram Transform,” IEEE Trans.
lepsy. In 2008, Hasan Ocak and colleagues 3 proposed a technique Biomedical Eng., vol. 51, no. 4, 2003, pp. 449–458.
for the analysis of seizures using wavelet packet decomposition and 3. H. Ocak, “Optimal Classification of Epileptic Seizures in EEG
a genetic algorithm. Shayan Fakhr and colleagues 4 presented a Using Wavelet Analysis and Genetic Algorithm,” Signal Pro-
review of a variety of techniques for EEG signal analysis of patients cessing, vol. 88, 2008, pp. 1858–1867.
in a sleep state. They studied different preprocessing, feature 4. S.M. Fakhr et al., “Signal Processing Techniques Applied to
extraction, and classification techniques to process the sleep Human Sleep EEG Signals—a Review,” Biomedical Signal Pro-
EEG signals. Sriram Ramgopal and colleagues5 presented another cessing and Control, vol. 10, 2014, pp. 21–33.
review on seizure prediction and detection methods and studied 5. S. Ramgopal et al., “Seizure Detection, Seizure Prediction,
their usage in closed-loop warning systems. In 2015, Mohamed and Closed-Loop Warning Systems in Epilepsy,” Epilepsy and
Menshawy and colleagues6 developed a mobile-based EEG moni- Behavior, vol. 37, 2014, pp. 291–307.
toring system to detect epileptic seizures. They implemented an 6. E.M. Menshawy, A. Benharref, and M. Serhani, “An Automatic Mo-
appropriate combination of different algorithms for preprocessing bile-Health Based Approach for EEG Epileptic Seizures Detection,”
and feature extraction of EEG signals. In this model, a k-means Expert Systems with Applications, vol. 42, 2015, pp. 7157–7174.
clustering algorithm is used to classify the features into different 7. S. Pandey et al., “An Autonomic Cloud Environment for Host-
homogeneous clusters in terms of their morphology. ing ECG Data Analysis Services,” Future Generation Computer
Recently, cloud computing in the field of healthcare has started Systems, vol. 28, 2012, pp. 147–154.
to gain momentum. Suraj Pandey and colleagues7 proposed an 8. A. Forkan, I. Khalil, and Z. Tari, “CoCaMAAL: A Cloud-Oriented
architecture for online monitoring of patient health using cloud Context-Aware Middleware in Ambient Assisted Living,” Future
computing technologies. Abdur Forkan and colleagues8 proposed a Generation Computer Systems, vol. 35, 2014, pp. 114–127.
model based on service-oriented architecture that enables real-time 9. G. Fortino et al., “BodyCloud: A SaaS Approach for Commu-
assisted living services. It provides a flexible middleware layer that nity Body Sensor Networks,” Future Generation Computer Sys-
hides the complexity in the management of sensor data from dif- tems, vol. 35, 2014, pp. 62–79.
ferent kinds of sensors as well as contextual information. Giancarlo 10. M. Quwaider and Y. Jararweh, “Cloudlet-Based Efficient Data
Fortino and colleagues9 proposed an architecture based on the Collection in Wireless Body Area Networks,” Simulation Mod-
combined use of BSNs and the cloud computing infrastructure. It elling Practice and Theory, vol. 50, 2015, pp. 57–71.
monitors assisted living via wearable body sensors that send data to 11. A. Lounis et al., “Healing on the Cloud: Secure Cloud Architec-
the cloud with the help of a mobile phone. Muhannad Quwaider and ture for Medical Wireless Sensor Networks,” Future Generation
Yaser Jararweh10 proposed a prototype for efficient sensor data col- Computer Systems, vol. 55, 2016, pp. 266–277.

■ to detect preictal variations that occur prior to a unique identification number (UID) is allocat-
seizure onset so that the patient can be warned in ed to that person. The data from body sensors in
a timely manner before the start of a seizure, and digital form is collected through patients’ mobile
■ to alert the patient, his or her family members, phones using the Bluetooth technology. The fast
and a nearby hospital for emergency assistance. Walsh-Hadamard transform (FWHT) is used to
extract features or abnormalities from the EEG sig-
To achieve these objectives, we propose a model nal; these features are then reduced using higher-
in which each patient is registered by entering per- order spectral analysis (HOSA) and classified into
sonal information through a mobile phone, then normal, preictal, and ictal states using a Gaussian

www.computer.org/cise 57

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

CLOUD COMPUTING

Data acquisition
Electroencephalogram and transmission Cloud storage and processing of EEG signal
(EEG) to mobile
sensors phone
Bluetooth ER
Data Data
database validation
collection
Wi-Fi Feature
4G extraction
Patient
Feature
classification

Alert message sent to patient GPS-based


location tracking

m Aler

ge
me rt
es t

ssa
Ale
sa
ge

Family Hospital
member

Figure 1. The architecture of the proposed cloud-based seizure alert system for epileptic patients. The model
integrates wireless body sensor network, mobile phone, cloud computing, and Internet technology to predict the
seizure in real time irrespective of the patient’s geographic location.

process classification algorithm. GPS is used to information and their sensor data. The system as-
track the location of patients from their respective signs each user a UID at the time of registration.
mobile phones. Whenever the system detects the The seizure prediction component performs tasks
preictal state of the patient, an alert message will be such as data validation, feature extraction, and fea-
generated to be sent to the patient’s mobile phone, ture classification. The FWHT and higher-order
family member, and a nearby hospital, depending statistics analyze and extract the feature set from
upon the location of the patient. the EEG signal. A Gaussian process classifies the
See the “Related Work in Seizure Prediction” feature set into normal, preictal, and ictal states of
sidebar for more information on the use of cloud seizure. Based on the classification, the system can
computing and wireless BSNs to predict epileptic generate an alert message and send it to the hospi-
seizures. tal closest to the user’s geographic location, a fam-
ily member, and the actual user. The objective in
Proposed Model sending an alert message to users’ mobile phones
The proposed model consists of target patients, a is to encourage them to take precautionary mea-
BSN, data acquisition and transmission, data col- sures to protect themselves from injuries. The GPS-
lection, seizure prediction, and GPS-based loca- based location-tracking component keeps track of
tion tracking. The BSN consists of wearable EEG the location of the patients with the help of their
sensors placed on different parts of the brain for mobile phones. Figure 1 demonstrates the design
capturing EEG signals. The data acquisition and of our proposed system for predicting and detect-
transmission component comprise a smartphone ing seizures.
and an Android-based application that capture data
from body sensors and send it to the cloud along Data Acquisition and Transmission
with the user’s personal information, manually en- The EEG sensor device contains one or more elec-
tered through the app. The data collection compo- trodes to detect the voltage of current flowing
nent is used to collect and store raw sensor data in through the brain neurons from multiple segments
a database and transforms it into a suitable form of the brain. In our model, we use an Emotiv
for further processing and analysis. It contains a EPOC headset, which contains 14 sensors placed
cloud storage repository to store patients’ personal on the scalp to read signals from different areas of

58 September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

the brain. The signals are transformed at 200 Hz Table 1. Personal attributes of a patient.
using an analog-to-digital converter before being Serial number Attribute Data type
sent to a mobile phone. The raw data streams gen- 1 Social Security Number Integer
erated by EEG sensors are collected continuously 2 Name String
in real time by the patient’s own mobile phone us-
3 Age Integer
ing wireless communication protocol. The mobile
phone constitutes a wireless personal area network 4 Sex String
(WPAN) that receives data from the BSN. Blue- 5 Address String
tooth is used to transfer the data streams between 6 Mobile number Integer
Bluetooth-enabled devices over short distances. 7 Family member’s name String
Several sensor devices can be connected to one 8 Family member’s mobile number Integer
Bluetooth server device (such as a mobile phone),
which acts as a coordinator. An Android-based ap-
plication collects digital sampled values from the Because the EEG signals are nonstationary and
body sensors. The mobile phone transmits the data their frequency response varies with time, conven-
to the cloud via a suitable communication proto- tional methods based on time and frequency do-
col, such as Wi-Fi 3G or 4G networks. mains aren’t suitable for seizure prediction.

Data Collection Fast Walsh-Hadamard transform. The FWHT decom-


Signal data streams from wearable body sensors are poses a signal into a group of rectangular or square
captured through mobile phones and sent to the waveforms with binary values of 1 or 1, known as
cloud for storage and analysis. The data received is Walsh functions. This generates a unique sequence
stored in the cloud database known as an epilepsy value assigned to each Walsh function and is used
record (ER) database. Patients’ personal informa- to estimate frequencies in the original signal. The
tion is stored in the ER database with their UID, as FWHT has the ability to accurately detect signals,
shown in Table 1. contains sharp discontinuities, and takes less com-
putation time using fewer coefficients.
Data Validation The FWHT converts a signal from the time
The original EEG signals recorded by the sensors domain to the frequency domain and is effec-
are contaminated with a variety of external influ- tive for locating transient events, which might
ences such as noise and artifacts that originate occur before the seizure onset in both time and
from two sources: physiological and nonphysiolog- frequency domains. It’s capable of extracting and
ical sources. Physiological artifacts are generated highlighting discriminating features of the EEG
from sources within the body, such as eye move- signal, such as epileptic spikes in the frequency
ments, ECGs, and electromyography. Nonphysi- and spectral domains, with greater speed and ac-
ological artifacts come from external sources such curacy. The FWHT of a signal x(t) having length
as electronic components, line power, and environ- N can be defined as
ment. Such artifacts should be eliminated from the N −1
1
EEG signals by using a filtering mechanism such as yn =
N
∑ x WAL (n, i ),
i
a band-pass or low-pass filter. i =0

Feature Extraction from EEG Signals where i = 0, 1, …, N  1 and WAL(n) are Walsh
The EEG signal in its original form doesn’t provide functions.
any information that can be helpful in detecting a The features extracted from the EEG signal in
seizure. The variation in signal pattern during dif- the form of FWHT coefficients are normalized to
ferent identifiable seizure states can be detected by remove any possible errors that might occur due to
applying an appropriate feature-extraction tech- inadequate extracted features. This can be done using
nique. Inadequate feature extraction might not the equation
provide good classification results, even though the yi − μ
classification method is highly optimized for the npi = , ∀yi , i = 1,2,#, n,
σ
problem at hand. Several feature-extraction meth-
ods based on the time domain and feature domain, where m and Ȫ are the mean and standard devia-
and wavelet transform (WT) features are available. tion, respectively, over all features.

www.computer.org/cise 59

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

CLOUD COMPUTING

Higher-order spectral analysis (HOSA). Seizure pre- detect the distribution of the energy contained
diction of epileptic patients with a higher degree in a signal, and we use entropy parameters to
of accuracy and rapid response is a major chal- characterize the irregularity (normal, preictal,
lenge. In the preictal state, bursts of complex and ictal) of the EEG signal. Different statistical
epileptiform discharges occur in the patient’s characteristics are examined and the entropy-
EEG signal. These quantitative EEG changes can based parameters listed in Equations 1 through
be detected by appropriate analysis of EEG sig- 3 are considered to be the most important and
nals. Higher-order statistics are widely used for distinctive for seizure state detection. Different
the analysis of EEG and ECG data to diagnose entropy values of the normalized bispectrum are
faults in the human body such as tremor, epilepsy, evaluated and can be represented mathematically
and heart disease.5 However, EEG signals contain as follows.
significant nonlinear and non-Gaussian features Normalized Shannon entropy (E1) is
among signal frequency components.6 Existing
techniques aren’t sufficient in handling these non- E1 = −∑ pi log pi ,
linear and non-Gaussian characteristics. HOSA is i

able to effectively analyze such signals to diagnose where


signal abnormalities.
HOSA is the spectral representation of a sig- BIC ( f 1 , f 2 )
pi = .
nal’s higher-order cumulants. It’s used for power ∑ Ω
BIC ( f 1 , f 2 ) (1)
spectral analysis, which is a natural extension to
the signal’s higher-order powers, and is represent- Log energy entropy (E2) is
ed by the average of the signal squared (that is,
the second-order moment).7 The higher-order sta- E 2 = −∑ (log pi ),
tistics are very useful in handling non-Gaussian i
random processes and are used to retrieve high-
er-order cumulants of a signal, even in the pres- where
ence of artifacts. The bispectrum of a signal x is 2
calculated using the Fourier transform evaluated BIC ( f 1 , f 2 )
pi = . (2)
at f 1 and f 2 and is defined by the mathematical

2
Ω
BIC ( f 1 , f 2 )
equation

The concentration of r norm entropy (E3) is


B ( f1, f 2 ) = ∑ X ( f ). X ( f
i
i 1 i 2 ) . X i* ( f 1 + f 2 ) ,
E 3 = ∑ (log pi ),
where X( f 1), X( f 2), and X( f 1 + f 2) represent the i

power spectral components computed by the fast where


Fourier transform (FFT) algorithm. The value ρ
X*( f ) is the conjugate of X( f ). BIC ( f 1 , f 2 )
pi = ρ . (3)
In our study, we used bicoherence, which is a
∑ Ω
BIC ( f 1 , f 2 )
normalized bispectrum and is very useful in analyz-
ing EEG signals. Bispectrum values contain both Feature classification. Once different entropy pa-
the amplitude of the signal and the degree of phase rameters are extracted from the EEG signal, an
coupling, whereas bicoherence values directly repre- automatic classification of features into different
sent the degree of phase coupling. The bispectrum seizure states is one of the essential processes of
is normalized using bicoherence, such that it con- our model. We implemented an unsupervised
tains a value between 0 and 1 and is defined by the classification technique because live EEG data
equation coming from patients’ mobile phones need to be
analyzed in the cloud in real time, and it’s not
B ( f1 , f 2 ) possible to first label such EEG data with the
BIC = ,
∑ i
Pi ( f 1 ) Pi ( f 2 ) Pi ( f 1 +f 2 ) help of a physician. Thus, these techniques only
depend on the information contained in the
where P( f 1), P( f 2), and P( f 1 + f 2) represent the EEG data, and there’s no need for prior knowl-
power spectrum. We use spectral estimation to edge about the data. While several unsupervised

60 September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

Table 2. Summary of analyzed EEG recordings.


Normal EEG Epileptic EEG
Queries Set A Set B Set C Set D Set E
Patient’s type Nonepileptic Nonepileptic Nonepileptic Nonepileptic Epileptic
Recording type Surface Surface Surface Surface Intracranial
No. epochs 100 100 100 100 100
Epoch duration 23.6 s 23.6 s 23.6 s 23.6 s 23.6 s

classification techniques are available for classify- each file consists of 4,096 values of one EEG time
ing EEG signals, we adopted the Gaussian pro- series in ASCII code. The first four sets (A–D)
cess technique based on Laplace approximation. were obtained from nonepileptic patients. The last
We made this choice due to the fact that it can be set (E) was recorded from an epileptic patient who
applied to very large databases. In this technique, had seizure activity. Therefore, our experimental
clustering is generally used prior to classifiers to data set contains a total of 500 single-channel
prepare a training dataset for classifiers. EEG epochs (windows), out of which 400 are of
The Gaussian process classifier is used to mod- nonepileptic patients and 100 are of an epilep-
el the three states of seizure class probabilities and tic patient. Each EEG epoch is 23.6 s long. The
is given by the equation recordings were captured using a 128-channel
amplifier and converted into digital form at a sam-
−1
⎛ c ⎞ pling rate of 173.61 Hz and 12-bit analog/digi-
p (Yi | f i ) = exp ( f i , c )⎜⎜⎜∑ exp (YiT f i )⎟⎟⎟ , tal resolution. Table 2 shows some details of EEG
⎜⎝ j =1 ⎟⎠ recordings related to nonepileptic and epileptic
patients.8
where f i = ⎡⎣ f i ,1 , … , f i ,c ⎤⎦ is a vector of the la-
T
We analyzed the EEG signals using Matlab
tent function values related to data point i, and and its toolboxes. We performed our experiments
Yi = ⎡⎣ yi ,1 , … , yi ,c ⎤⎦ is the corresponding target
T
on an Intel i5 CPU at 2.40 GHz with 2 Gbytes
vector, which has one entry for the correct memory running on Windows 7. Our experiment
class for the observation i and zero entries performed the following tasks:
otherwise.
■ EEG signal decomposition,
GPS-Based Location Tracing ■ bispectral analysis,
The objective of location tracking is to identify ■ feature extraction based on entropy,
the patient’s location to provide him or her with ■ feature classification,
immediate treatment whenever a seizure occurs. ■ performance analysis on Amazon Elastic Com-
The mobile phone’s GPS function is used to track pute Cloud (EC2), and
the patient’s location, which is sent to the cloud ■ performance comparison.
through the Internet. An alert message is gener-
ated before the triggering of the seizure and is sent EEG Signal Decomposition
to the patient’s mobile phone, as well as to family In the first stage, we applied the FWHT to decom-
members and a nearby hospital. pose the signal. We extracted the discriminating
features in terms of frequency and spectral domain.
Experimental Results and Performance We applied Algorithm 1 to each patient’s EEG data
Analysis file, which each contain 4,096 points and generate
We conducted different experiments to analyze 8,192 coefficients. Figure 2 represents the original
and classify EEG signals. Our objective was to EEG signal and its FWHT coefficients for a non-
identify the preictal state so as to provide alerts epileptic and an epileptic patient.
to the patient before the seizure actually occurs. One of the major problems of seizure state
The EEG recordings used in this experiment were characterization is in identifying whether the pro-
collected from five patients at a sampling rate of cess is Gaussian or linear. In our experiment, the
173.61 Hz by using surface electrodes placed on Hinich test is applied for detecting the nonskew-
the skull. Each set (A–E) contains 100 files, and ness and linearity of the process.9 Different statisti-

www.computer.org/cise 61

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

CLOUD COMPUTING

Input List of patient directories containing EEG data.


Output
Shannon entropy, log energy entropy and norm entropy
Let wh[ ] and np[ ] be the one-dimensional matrix that stores FWHT and normalized
FWHT coefficients for each sample;
for each patient UID do
Locate the directory labeled with the UID of the patient registered with the system;
if EEG data for that UID already exists then
Replace the existing data with the new one;
else
Create a new patient directory with the UID of the patient and store the EEG data;
end if
Read EEG data from the directory;
Compute FWHT coefficients for each EEG sample by invoking fwht Matlab algorithm
and save them in one-dimensional matrix wh[ ];
for each FWHT coefficient do
Find the mean and standard deviation of each FWHT coefficient using mean() and std()
Matlab functions;
Normalize each FWHT coefficient using equation
np[i] = (wh[i]– mean(wh[i]))/std(wh[i]);
end for
Apply the bispecd and bicoher Matlab algorithms to generate bispectrum and bicoherence
respectively using following parameters;
(a) Data vector of FWHT coefficients
(b) Fast Fourier transformation (fft) length
(c) Window specification for frequency domain smoothing
(d) Number of segments per sample
(e) Percentage of overlapping
Compute the Shannon entropy, log energy entropy and norm entropy of bicoherence;
end for

Algorithm 1. Feature extraction and selection.

EEG signal of nonepileptic patient EEG signal of epileptic patient


200
2,000
Amplitude

100
Amplitude

1,000
0
0
−100
−1,000
−200
−2,000
0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500 0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500
No. samples No. samples
WHT coefficients WHT coefficients
5 30
4
Magnitude

Magnitude

3 20
2
1 10

0
0
0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000 0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000
(a) Sequency index (b) Sequency index

Figure 2. Original EEG signal and fast Walsh-Hadamard transform (WHT) coefficients: (a) nonepileptic patient and (b) epileptic patient.
The fast WHT coefficients extracts the discriminating features of the EEG signal such as epileptic spikes.

62 September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

Table 3. Summary statistics for fast WHT coefficients.

Parameter Ictal Preictal Normal


Mean 1.9187e-017 2.28767e-017 2.30355e-017

Variance 0.999878 0.999878 0.999878

Skewness (normalized) 22.3652 0.505892 0.187998

Kurtosis (normalized) 1,031.16 15.8076 7.19272

Gaussianity linearity test 66,108.9418 180.1864 41.5563

R (estimated) 2,733.5991 15.6339 1.266

Ȣ 2,571.0311 7.1698 1.5579

R (theory) 136.8142 7.4108 3.778

Maximum of bispectrum 41.0551 1,226.2354 2,564.5996

Bispectrum estimated via the direct (FFT) method Bispectrum estimated via the direct (FFT) method
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
f2

f2

−0.1 −0.1
−0.2 −0.2
−0.3 −0.3
−0.4 −0.4
−0.5 −0.5
−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4
(a) f1 (b) f1

Figure 3. Nonparametric bispectrum: (a) nonepileptic patient and (b) epileptic patient. The bispectrum is capable of
retrieving the higher-order cumulants of a signal even in the presence of artifacts (FFT = fast Fourier transform).

cal parameters such as mean, variance, skewness, 256 w 256 is obtained. Figure 3 shows the bispec-
Gaussianity, and so on, based on FWHT coeffi- trum of a nonepileptic and an epileptic patient.
cients are evaluated in Table 3. The bicoherence, the normalized form of the
bispectrum, is estimated using the direct FFT
Bispectral Analysis method in the HOSA toolbox. We used bicoher-
Bispectral analysis is a powerful tool for detecting ence to quantify the quadratic phase coupling in
interfrequency phase coherence, which is used for EEG signals, which is very useful in detecting
characterization of the EEG signal’s different states. nonlinear coupling in the time series for the char-
The bispectrum is computed for each dataset to per- acterization of different seizure states. Figure 4
form in-depth analysis of features using the HOSA represents the bicoherence of a nonepileptic and an
toolbox in Matlab.10 For this purpose, the normal- epileptic patient.
ize point (np) is calculated for each FWHT coef-
ficient (y) by using the equation Feature Extraction Based on Entropy
In this stage of the experiment, we determine the best
y − mean ( y ) features relevant to three seizure states (normal, preic-
np = ,
std ( y ) tal, and ictal) by evaluating different kinds of entropy
from the bicoherence. In the seizure recognition, we
where the functions mean() and std() are used to considered three classes: normal, preictal, and ictal.
calculate the mean and standard deviation, respec- Hence, we computed three different sets of entropy
tively, of each FWHT coefficient. values for the recognition of different seizure states.
The bispectrum is computed by applying the Table 4 shows the mean values of different seizure
direct FFT method from the HOSA toolbox to states for the three selected features computed on the
each normalized point. A data vector matrix of size basis of third-order polyspectra for the five patients.

www.computer.org/cise 63

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

CLOUD COMPUTING

Bicoherence estimated via the direct (FFT) method Bicoherence estimated via the direct (FFT) method
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
f2

0 0

f2
−0.1 −0.1
−0.2 −0.2
−0.3 −0.3
−0.4 −0.4
−0.5 −0.5
−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4
(a) f1 (b) f1

Figure 4. Bicoherence: (a) nonepileptic patient and (b) epileptic patient. The bicoherence is used to retrieve different
types of entropy values to characterize the different seizure states.

eters help label the different seizure states. Figure 5


Table 4. Entropy-based coefficients for the three selected features
depicts different seizure states.
of seizure states.
Feature Normal Preictal Ictal Performance Analysis Using Amazon EC2
E1 6.4122e+03 2.3083e+03 476.6912 We propose that the application for predicting a pa-
E2 4.6655e+05 5.3641e+05 6.9407e+05 tient’s seizure before its occurrence be hosted on the
1.3780e+04 6.0346e+03 1.3214e+03
cloud so that we could test our model’s performance
E3
in real time. For this purpose a general-purpose com-
pute-optimized c4:xlarge single-instance consist-
Feature Classification ing of four high-frequency Intel Xeon E5-2666 v3
The Gaussian process model classifies the data into (Haswell) processors and 7.5 GiB or Gibibyte RAM
three classes, where each class represents a differ- with dynamic provisioning offered by Amazon EC2
ent seizure state. Algorithm 2 is designed to con- (http://aws.amazon.com/ec2/instance-types) were
tinuously take entropy-based features from the used to host the application over the cloud. A Java-
feature extraction component and perform initial based application was designed and installed in the
classification to detect different seizure states. Af- cloud to perform both feature extraction and feature
ter the initial classification, the reclassification is classification functions. EEG data for five patients
performed as soon as new data is received. In our isn’t sufficient to evaluate our proposed model, so
experiment, the three features selected from the we used a bootstrapping technique11 to replicate the
500 samples were used as input to the Gaussian EEG data of these five patients to 50,000 patients
classification model using Weka 3.6. Table 4 shows randomly, using coefficients minimum (20.5920),
that the entropy parameters E1, E2, and E3 decrease maximum (25.5656), and mean (0:0122) as the vari-
from the normal state to the preictal state. The ant. In a 60-minute experiment, the system started
values of these parameters reduce significantly in with 5,000 patients, then after each 6-minute dura-
the ictal state. Such variations in entropy param- tion, the system increased by 5,000.

Input Entropy parameters and UID of a patient


Output Classify or reclassify feature set
if patient UID is already exists then
Replace the existing feature set with the new one;
Execute the Gaussian process for the reclassification of different seizure states;
else
Create a new patient directory with UID number of the patient and store the data;
Execute the Gaussian process algorithm;
end if
Label three sets of clusters as normal, preictal, and ictal;

Algorithm 2. EEG signal classification.

64 September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

100
Total no. users

Resource utilization (%)


Normal 90
1
Norm entropy (E3)

Ictal
Preictal 80
0.5
70
0 60
50
–0.5
40
–1 30
–4 –4.5 –5 0 1 20
–5.5 –6 –2 –1
–6.5 –7 –4 –3 10
x105 –7.5 –5 x104
Normalized 0
Log energy entropy (E2) shannon entropy (E1) 6 12 18 24 30 36 42 48 54 60
(a) Time (in minutes)

Figure 5. Gaussian cluster plot based on three entropy 12


Total no. users
features, E1, E 2 , and E 3. The value of these entropy 10

Response time (s)


features decreases from normal state to ictal state and
8
can be used to categorize the different seizure states.
6
4
Figure 6a represents our proposed model’s re- 2
source utilization for several patients; Figure 6b 0
shows the response time. The system has a lower 6 12 18 24 30 36 42 48 54 60
(b) Time (in minutes)
response time for fewer patients due to low compu-
3.5
tational load. Figure 6c shows the model’s latency Total no. users
3
time, which increases with the increase in the
Latency time (s)

2.5
number of patients.
2
We performed a comparative evaluation on a
1.5
desktop computer and Amazon EC2 with different 1
sets of patients, starting from 5,000 and increasing 0.5
up to 50,000. Figure 7 shows the execution time 0
to process and classify the EEG data. Results show 6 12 18 24 30 36 42 48 54 60
(c) Time (in minutes)
that the time required for the computation of EEG
data on Amazon EC2 is reduced significantly from 15,000 30,000 45,000

the time required on the desktop computer.


The accurate classification of patient data to de- Figure 6. Performance analysis of proposed model on
Amazon EC2: (a) resource utilization of system, (b)
tect different seizure states is a vital step in our pro- response time of system, and (c) latency time of system.
posed model. Different classification algorithms such
as multiplayer perceptron (MPP),12 linear regres-
sion (LR),13,14 and least median of square regression 600
(LMSR)15 were also tested in Weka 3.616 to compare 500 Desktop computer
Execution time (s)

their performance with our proposed Gaussian pro- 400 Amazon EC2 cloud
cess. Table 5 shows the summary of statistics in dif- 300
ferent classification models tested in Weka 3.6. 200
Table 6 shows the results of classification accu- 100
racy in the Gaussian process classifier. The classifier 0
is able to classify normal, preictal, and ictal states 0 5 10 15 20 25 30 35 40 45 50 55 60
No. patients (in thousands)
with an accuracy of 84.20 percent, 86.40 percent,
and 89.00 percent, respectively. Figure 7. Comparative performance of EEG signal analysis
Next, we calculated the classification accuracy on the Amazon EC2 cloud and a desktop computer. The
of detecting the preictal state versus a non-preictal time required for the analysis of EEG data on Amazon EC2
state using the three statistical measures of sensitiv- is reduced significantly than on the desktop computer.
ity, specificity, and accuracy.17,18 The accuracy of
each classification algorithm was tested in Weka
3.6; Table 7 shows the sensitivity, specificity, and Gaussian process classifier has a larger area under
accuracy scores. The proposed Gaussian process the receiver operating characteristic (ROC) curve
classification algorithm provides a high sensitivity than other models. It’s clear from Table 7 that the
of 83.6 percent and a high accuracy of 85.1 percent Gaussian classifier achieves the highest classification
over all other classification models. Moreover, the accuracy of 85.1 and justifies its use in our proposed

www.computer.org/cise 65

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

CLOUD COMPUTING

Table 5. Performance results of different classifiers tested in Weka 3.6.


Multiplayer Least median of
Gaussian process
Parameters perceptron Linear regression square regression
Correlation coefficient 0.9782 0.9364 0.9402 −0.8973
Mean absolute error 821.1310 982.234 916.1129 6,485.2421
Root-mean-square error 1,032.1451 1,385.2702 1,466.9124 16,664.1458
Relative absolute error 22.4512% 29.4412% 27.5514% 192.4591%
Root relative square error 22.0897% 28.1602% 30.6487% 342.4587%
Total no. instances 50,000 50,000 50,000 50,000
Time taken 21 s 11 s 16 s 44 s

Table 6. Classification accuracy of the GP classifier with entropy features (E1, E2, E3).
No. correctly classified Correct
Categories No. instances instances classification (%)
Normal 50,000 42,100 84.2
Preictal 50,000 43,200 86.4
Ictal 50,000 44,500 89.0

Table 7. Detailed accuracy of Gaussian and other models for EEG signal classification.
Classification Receiver operating
model Sensitivity (%) Specificity (%) Accuracy (%) characteristic area
Gaussian process 83.6 16.3 85.1 0.984
Multiplayer 78.5 21.5 80.3 0.928
perceptron
Linear regression 71.7 28.3 77.4 0.892
Least median of 26.6 73.4 25.2 0.464
squares regression
Accuracy of classification (%)

The technique of adaptive epileptic seizure pre-


90
80 diction system19 is based on prior knowledge of the
70
60
occurrence of the first seizure, which can’t be used
50
GP MP LR LMSR
for real-time monitoring. In our proposed model,
40
30 such a condition isn’t required, and it can therefore
20 be used for real-time detection and monitoring of
10
0 seizures.
0 5 10 15 20 25 30 35 40 45 50
No. patients (in thousands) An algorithm proposed by Leon Iasemidis and
colleagues19 is based on the detection of the criti-
Figure 8. Performance analysis of classification cal electrode sites before a seizure. The reliability
algorithms. The accuracy of EEG signal classification and accuracy depend on the probability of detect-
for different algorithms is shown in the graph and varies ing the critical electrode sites. Our proposed model
with the number of patients.
takes EEG data from all the electrodes and extracts
features relevant to different seizure states; hence,
it’s more reliable and accurate.
model. Figure 8 illustrates the comparison of the Iasemidis and colleagues19 analyzed spatiotempo-
accuracy of the different classification algorithms. ral dynamical characteristics of multichannel intra-
cranial EEG signals by measuring approximations of
Performance Comparison Lyapunov exponents used to determine the stability of
We compared our model with adaptive epileptic any steady state behavior. Such dynamic behavior in
seizure prediction system,19 as both are designed spatiotemporal patterns of the brain occurs in patients
for the detection of a seizure’s preictal state. with refractory temporal lobe epilepsy. Our model uses

66 September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

higher-order statistics to detect and extract features 12. H. Yan et al., “A Multilayer Perceptron-Based Medi-
representing nonlinearity of the EEG signal; hence, its cal Decision Support System for Heart Disease Di-
use isn’t limited to a specific part of the brain. agnosis,” Expert Systems with Applications, vol. 30,
no. 2, 2006, pp. 272–281.
13. D.M. Bates and D.G. Watts, Nonlinear Regression:

M edical data needs to be shared among physi-


cians, healthcare agencies, and other authorized
users to provide better treatment and reduced costs. 14.
Iterative Estimation and Linear Approximations, John
Wiley & Sons, 1988.
M. Koc and A. Barkana, “Application of Linear Re-
However, the privacy issues associated with sharing gression Classification to Low-Dimensional Datas-
such sensitive data is a big concern. Future work will ets,” Neurocomputing, vol. 131, 2014, pp. 331–335.
focus on incorporating new data privacy techniques 15. P.J. Rousseeuw, “Least Median of Square Regres-
to secure patients’ personal and health information. sion,” Am. Statistical Assoc., vol. 79, no. 388, 1984,
pp. 871–880.
References 16. M. Hall et al., “The WEKA Data Mining Software:
1. M. Bellon, R.J. Panelli and F. Rillotta, “Epilepsy- An Update,” ACM SIGKDD Explorations Newslet-
Related Deaths: An Australian Survey of the Expe- ter, vol. 11, no. 1, 2009, pp. 10–18.
riences and Needs of People Bereaved by Epilepsy,” 17. A. Subasi, “EEG Signal Classification Using Wave-
Elsevier J. Seizure, vol. 29, 2015, pp. 162–168. let Feature Extraction and a Mixture of Expert
2. B. Litt et al., “Epileptic Seizures May Begin Hours Model,” Expert Systems with Applications, vol. 32,
in Advance of Clinical Onset: A Report of Five Pa- 2007, pp. 1084–1093.
tients,” Neuron, vol. 30, 2001, pp. 51–64. 18. P. Baldi et al., “Assessing the Accuracy of Prediction
3. G. Fortino and M. Pathan, “Integration of Cloud Algorithms for Classification: An Overview,” Bioin-
Computing and Body Sensor Networks,” Future Gen- formatics, vol. 16, no. 5, 2001, pp. 412–424.
eration Computer Systems, vol. 35, 2014, pp. 57–61. 19. L.D. Iasemidis et al., “Adaptive Epileptic Seizure
4. B. Javadi, J. Abawajy, and R. Buyya, “Failure-Aware Prediction System,” IEEE Trans. Biomedical Eng.,
Resource Provisioning for Hybrid Cloud Infrastruc- vol. 50, no. 5, 2003, pp. 616–627.
ture,” Parallel Distributed Computing, vol. 72, no.
10, 2012, pp. 1318–1331. Sanjay Sareen is a system manager at Guru Nanak Dev
5. J. Jakubowski et al., “Higher Order Statistics and Neu- University, Amritsar, Punjab, India. His research interests
ral Network for Tremor Recognition,” IEEE Trans. include cloud computing, Internet of Things and data se-
Biomedical Eng., vol. 49, no. 2, 2002, pp. 152–159. curity. Sareen is pursuing a PhD in computer applications
6. P. Husar and G. Henning, “Bispectrum Analysis at IK. Gujral Punjab Technical University, Kapurthala,
of Visually Evoked Potentials,” IEEE Eng. Medicine Punjab, India. Contact him at sareen.gndu@gmail.com.
______________
and Biology, vol. 16, no. 1, 1997, pp. 57–63.
7. C.L. Nikias and J.M. Mendel, “Signal Processing Sandeep K. Sood is a professor at Guru Nanak Dev
with Higher-Order Spectra,” IEEE Signal Processing, University Regional Campus, Gurdaspur, Punjab, In-
vol. 10, no. 3, 1993, pp. 10–37. dia. His research interests include cloud computing,
8. R.G. Andrzejak et al., “Indications of Nonlinear data security, and big data. Sood has a PhD in computer
Deterministic and Finite Dimensional Structures science and engineering from IIT Roorkee, India. Con-
in Time Series of Brain Electrical Activity: Depen- tact him at san1198@gmail.com.
____________
dence on Recording Region and Brain State,” Phys-
ical Rev. E, vol. 64, no. 6, 2001, article no. 061907. Sunil Kumar Gupta is an associate professor at Beant Col-
9. M.J. Hinich, “Testing for Gaussianity and Linearity lege of Engineering and Technology, Gurdaspur, Punjab,
of a Stationary Time Series,” Time Series Analysis, India. His research interests include cloud computing,
vol. 3, no. 3, 1982, pp. 169–176. mobile computing, and distributed systems. Gupta has a
10. A. Swami, C.M. Mendel, and C.L. Nikias, Higher- PhD in computer science from Kurukshetra University,
Order Spectral Analysis (HOSA) Toolbox, version Kurukshetra. Contact him at skgbcetgsp@gmail.com.
_____________
2.0.3, 2000; http://in.mathworks.com/matlabcen-
______________________
tral/fileexchange/3013-hosa-higher-order-spectral-
______________________________
analysis-toolbox.
__________
11. A. Bao et al., “Helping Mobile Apps Bootstrap with Selected articles and columns from IEEE Computer
Fewer Users,” Proc. 14th Int’ l Conf. Ubiquitous Com- Society publications are also available for free at
puting, 2012, pp. 1–10. http://ComputingNow.computer.org.

www.computer.org/cise 67

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

HYBRID SYSTEMS

The Feasibility of Amazon’s Cloud Computing


Platform for Parallel, GPU-Accelerated,
Multiphase-Flow Simulations

Cole Freniere, Ashish Pathak, Mehdi Raessi, and Gaurav Khanna | University of Massachusetts Dartmouth

Amazon's Elastic Compute Cloud (EC2) service could be an alternative computational resource for running MPI-
parallel, GPU-accelerated, multiphase-flow simulations. The EC2 service is competitive with a benchmark cluster
in a certain range of simulations, but there are some performance limitations, particularly in the GPU and cluster
network connection.

S
ince the 1980s, the US National Science maintaining their own local HPCCs. Instead, they
Foundation (NSF) has funded supercom- can simply set up an account and run an applica-
puters for use by scientific researchers and tion instantly for a fee, with no financial or installa-
engineers, but continuing this practice tion maintenance overhead. Moreover, a cloud ser-
today involves many challenges. An interim NSF vice can offer great flexibility—HPC users can out-
report published in 2014 made it clear that the source their lower-profile jobs to cloud servers and
high cost of high-end facilities and shrinking NSF reserve the most critical ones for their local clusters.
resources are compounded by the fact that the com- An additional benefit to using cloud computing is
puting needs of scientists and engineers are becom- that various machine configurations can be expedi-
ing more diverse.1 For example, data analytics is a tiously tested and explored for benchmarking pur-
rapidly growing field that brings with it completely poses, which can lead to more appropriate decisions
different computing requirements than conven- for those planning on building their own HPCC.
tional scientific and engineering simulations. For However, are the cloud services available today
optimal application performance, a certain system prepared to meet the needs of HPC applications? Is
structure is desired, and different disciplines tend to using the cloud a viable alternative to localized, con-
have different optimal systems. Some applications, ventional supercomputers? Amazon Web Services
for example, are shifting from conventional CPUs (AWS) is one of the most prevalent vendors in the
to heterogeneous parallelized architectures that in- cloud computing market,2 and its computing service,
clude GPUs. Cloud computing could be a potential Amazon Elastic Cloud Compute (EC2), offers a va-
solution to meet these expanding computing needs. riety of virtual computers (www.ec2instances.info).
______________
Although cloud computing services could be trans- In recent years, several new services tailored toward
formative for some fields, there’s a high level of un- HPC applications have been released; AWS seems
certainty about the cost tradeoffs, and the options to be an appropriate cloud computing provider to
must be evaluated carefully.1 evaluate whether cloud computing is ready for HPC
A case in support of cloud computing is that if applications. The first work that evaluated Amazon’s
it’s used as an alternative to constructing and main- EC2 service for an HPC application ran coupled
taining a local high-performance computing cluster atmosphere-ocean climate models and performed
(HPCC), it would relieve institutions and compa- standard benchmark tests.3 That work highlighted
nies from the drudgery and cost of building and that the performance was significantly worse in the

68 Computing in Science & Engineering 1521-9615/16/$33.00 © 2016 IEEE Copublished by the IEEE CS and the AIP September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

cloud than at dedicated supercomputer centers and 3D parallel code. Keeping in mind that Amazon’s
was only competitive with low-cost cluster systems. services are rapidly evolving and that new hard-
The poor performance occurred because latencies ware options are constantly being added, the key
and bandwidths were inferior to dedicated centers; question addressed is the following: Is outsourcing
the authors recommended that the interconnect net- HPC workloads to the AWS cloud a viable alter-
work be upgraded to systems such as Myrinet or In- native to using a local, purpose-built HPCC? This
finiBand to be desirable for HPC use. Peter Zaspel question is answered from the perspective of our
and Michael Griebel4 evaluated AWS for their het- own research group; more broad recommendations
erogeneous CPU-GPU parallel two-phase flow solv- are made for other HPC users. Not surprisingly,
er, similar to the solver we present in this article. This the answer to this question depends on many fac-
work concluded that the cloud was well prepared tors. We believe this work is the first to comprehen-
for moderately sized computational fluid dynamics sively test the g2.2xlarge GPU-instance of AWS for
(CFD) problems for up to 64 cores or 8 GPUs, and multiphase-flow simulations; it's not limited to just
that it was a viable and cost-effective alternative to standard benchmark tests.
mid-sized parallel computing systems. However, if
the cloud cluster was increased to more than eight Amazon Web Services
nodes, network interconnect problems followed. In The user can manage cloud services with the AWS
2012, Piyush Mehrotra and coworkers5 of the NASA management console via a Web browser or the com-
Ames Research Center compared Amazon’s perfor- mand line. AWS offers more than 40 different ser-
mance to their renowned Pleiades supercomputer. vices, but the only ones necessary for our tests were
For single-node tests, AWS was highly competitive the EC2 service for virtual computer rental and the
with Pleiades, but for large core counts, it was signifi- Simple Storage Service (S3) for data storage. EC2 of-
cantly slower because the Ethernet connection didn’t fers computers (known as instances) with a variety of
compare well with Pleiades’ InfiniBand network. hardware specifications—the most basic instance is
The authors concluded that Amazon’s computers a single-core CPU with 1 Gbyte of RAM, priced at
aren’t suitable for tightly coupled applications, where US$0.013 per hour, and the most expensive instance
fast communication is paramount. consists of 32 cores with 104 Gbytes of RAM,
Many other studies conducted standard bench- priced at $6.82 per hour (www.ec2instances.info).
_______________
mark tests on AWS to compare it to a conventional The user must select an Amazon Machine Im-
HPCC and reached similar conclusions. Zach Hill age (AMI), which includes the operating system
and Marty Humphrey6 concluded that AWS’s ease and software loaded onto the instance. Several
of use and low cost make it an attractive option default and community AMIs built by other cus-
for HPC, but not for tightly coupled applications. tomers are available. Because default AMIs are very
Keith Jackson and coworkers7 ran their own ap- bare-boned, it may be necessary for the user to in-
plication in addition to standard benchmark tests, stall several libraries and other software on the in-
and also concluded that AWS isn’t suited for tight- stance to run specific applications—we spent con-
ly coupled applications. Yan Zhai and coworkers8 siderable time properly configuring the instance for
included a variety of benchmark tests, application our application. However, once the instance is set
tests, and a highly detailed breakdown of the costs up to the user’s liking, a new AMI can be saved
associated with the two alternatives, producing a from that machine and can be used as a template to
more positive evaluation of AWS than most other easily create more instances in the future. This is a
studies, but with an admission that the cloud isn’t critical feature when building clusters of instances.
ideal for codes that require many small messages
between parallel processes. Aniruddha Marathe Building a Cluster in the Cloud
and coworkers9 ran benchmark tests and developed For instances to communicate over the same net-
a pricing model to evaluate AWS as an alternative work, they must be launched into the same place-
to a local cluster on a case-by-case basis but didn’t ment group, which ensures that the requested ma-
use this model to present quantitative economic chines are physically located in the same comput-
results. Overall, the general conclusions regarding ing facility. Three notable tools are available:
cloud computing for HPC applications has evolved
over time as the market has developed. ■ Cloud Formation Cluster (CfnCluster) is of-
Our work is concerned with evaluating AWS fered by AWS for cluster creation, but it’s cur-
for a GPU-accelerated, multiphase-flow solver—a rently limited in its configuration options. Only

www.computer.org/cise 69

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

HYBRID SYSTEMS

one default AMI is available for all instance Multiphase-Flow Solver


types, and because all custom AMIs must be The multiphase-flow solver simulates a two-fluid
first constructed off the default AMI, this posed flow interacting with moving rigid solid bodies.10,11
limitations for us. The two fluids are incompressible, immiscible, and
■ Starcluster was developed by MIT and enables Newtonian. The two-step projection method12 is
easy configuration and management of clus- implemented to solve the flow equations. The solu-
ters. In contrast to CfnCluster, many default tion procedure includes a pressure Poisson problem
AMIs are available for various instances. How- that is solved iteratively at each time step by using
ever, some newer instances aren’t supported on a Jacobi preconditioned conjugate gradient method.
Starcluster. We used Starcluster to set up our The pressure solution is the bottleneck of the over-
clusters and found it easy to use. all algorithm, taking 60 to 90 percent of the total
■ CloudFlu was developed specifically for the execution time. To remove this bottleneck, Stephen
CFD program OpenFOAM to provide ease of Codyer and coworkers13 ported the pressure solu-
use to scientists and engineers who are new to tion to GPUs using MPI parallelism. The pressure
the cloud computing environment. solver requires communication between CPUs and
GPUs, which is done through the Peripheral Com-
ponent Interconnect Express (PCIe) bus, peaking at
Hardware Specifications of the Benchmark 4 Gbytes/s on the benchmark cluster. Additionally,
and Amazon Clusters at the end of each iteration, the pressure solution for
The University of Massachusetts Dartmouth the MPI ghost cells is transferred to the neighboring
HPCC is the benchmark cluster for this study. It MPI subdomains constituting the CPU-CPU com-
contains two Intel Xeon (quad-core) E5620 2.4 munication. Consequently, the flow solver’s com-
GHz processors, 24 Gbytes of DDR3 ECC 1333 pute time depends heavily on GPU speed, commu-
MHz RAM, one Nvidia Tesla (Fermi) M2050 nication time across the GPU device and the CPU,
GPU with 3 Gbytes memory, and an InfiniBand and communication time across different CPUs. We
network connection. The instance on AWS, which evaluated both the CPU-GPU and CPU-CPU com-
was the most appropriate for our applications, was munication times. The benchmarked problem was a
the g2.2xlarge instance. It has eight high-frequency freely falling, rigid solid wedge that’s released in air
Intel Xeon E5-2670 (Sandy Bridge) 2.6 GHz pro- and eventually impacts a water free-surface.10
cessors, 15 Gbytes RAM, one Nvidia GRID K520
GPU with 1,536 CUDA cores, and 4 Gbytes mem- MPI Communication Benchmarks
ory. The Tesla GPU was purpose-built for scientific To determine MPI communication performance for
computing, while the GRID GPU is marketed for both clusters, we used the Ohio micro-benchmark
high-performance gaming. No numbers for the suite developed by the Ohio State University (mvapi-
_____
network connection speed for this instance type are ch.cse.ohio-state.edu/benchmarks).
_____________________ We conducted
published, but it’s claimed to have high networking point-to-point tests to study latency and bandwidth
performance in EC2 listings (www.ec2instances. internodally between any two random nodes in the
info).
___ Other instances advertise a 10 Gbytes/s Eth- two clusters. This is representative of the ghost cell
ernet connection, such as the g2.8xlarge instance, data transfer that occurs between MPI subdomains.
which is identical to the g2.2xlarge instance ex- We also conducted collective latency tests that uti-
cept that it has four times as many GPUs, cores, lize all nodes in a cluster. The results for these tests
and RAM. However, it was introduced during the are presented in logarithmic scale in Figure 1.
time of this study, and building a cluster from it
proved to be a large obstacle that wasn’t overcome Point-to-point latency tests. Referring to Figure 1a, it’s
because of the lack of support from StarCluster. apparent that the latencies are 10 to 40 times larger on
CfnCluster successfully launched a cluster of these AWS than on the benchmark cluster. For small mes-
instances, but the default AMI was too restrictive, sage sizes, the latency for the benchmark cluster and
and configuring the necessary libraries over the de- AWS are 2 and 85 Ps, respectively. For applications
fault AMI proved very difficult. Another instance, requiring frequent communication of small messages,
called cg1.4xlarge, is a cluster GPU instance, but a factor-of-40 deficit for AWS can drastically affect
it was first offered in 2010 and is now considered performance. However, as message size increases, the
a previous-generation instance and wasn’t deemed disparity isn’t quite as large: the latency on AWS is a
desirable for our needs. factor of 10 higher than the benchmark cluster.

70 September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

1,00,000 10,000
Point-to-point latency (μs)

AWS cluster
10,000

bandwidth (MB/s)
UMD HPCC 1,000

Point-to-point
1,000 (benchmark cluster)
100

100 10
AWS cluster
10 1 UMD HPCC
(benchmark cluster)
1 0.1
1 100 10,000 10,00,000 1 100 10,000 10,00,000
(a) Message size (bytes) (b) Message size (bytes)

Figure 1. The average (a) latency and (b) bandwidth for various message sizes between two nodes on each cluster. Note
that the horizontal axis is logarithmic base 2 and the vertical is logarithmic base 10.

Point-to-point bandwidth tests. Figure 1b shows the munication overhead, and in the context of weak
communication bandwidth between two nodes. and strong scaling, we determined it for both
The maximum sustained bandwidth rates for AWS CPU-GPU and CPU-CPU communication for
and the benchmark cluster are 984 Mbytes/s and various cluster sizes. Typically, when the flow solver
25.6 Gbytes/s, respectively. The bandwidth ranges is running on a conventional HPCC, about 10 to
from 15 to 25 times lower on Amazon, illustrating 25 percent of the execution time is spent just trans-
the difference between the Ethernet connection in ferring data from the CPU to GPU, and 5 to 10
the AWS placement group and the InfiniBand con- percent is spent transferring data between CPUs
nection on the benchmark cluster. Contrary to the through MPI-parallel calls. Thus, any decrease in
latency tests, the bandwidth tests show that AWS communication speed performance on AWS can
suffers more at larger message sizes. have a significant impact on overall execution time.

Collective latency tests. The graphs for the collective GPU Performance
latency tests aren’t presented in this article for brev- GPU speed is of great importance and drastically
ity; the results are actually similar to the point-to- affects execution time. The GPU on the g2.2xlarge
point tests. For an eight-node cluster on Amazon, instance was found to be about 25 percent slower
the collective test MPI_alltoall approaches laten- than the benchmark cluster’s GPU. This impedi-
cies of 700 Ps, while on the benchmark cluster, it’s ment plays a large role in the results for overall
80 Ps. Such large latencies drastically slow down AWS performance.
the flow solver when quantities across multiple pro-
cesses are collected and summed. Strong Scaling
The simulation tested for strong scaling required
Connection Speed over the Internet from a nearly all the 15 Gbytes of memory offered by a sin-
Local Machine to Instances gle g2.2xlarge instance. As Figure 2 shows, the AWS
For our purposes, it was convenient to simply se- cluster is 25 to 40 percent slower than the benchmark
cure copy (scp) the data directly from Amazon’s cluster. Note that the speedup is reported relative
virtual machines to ours, rather than using S3. The to one node on the benchmark cluster on a loga-
bandwidth fluctuated between 1 and 7 Mbytes/s, rithmic scale. For low node counts, the AWS clus-
which is a reasonable connection. There could be ter is competitive with the benchmark cluster and
some cases in which the data must persist past the is merely 25 percent slower than the benchmark.
lifetime of the instance, for example, if the output However, as node count increases, AWS doesn’t fare
data can’t be copied to a local server as quickly as as well: the performance is 30 percent slower than
the application produces it. the UMD HPCC for clusters with two nodes or
more. The solver’s general behavior for strong scaling
Performance of MPI-Parallel is as follows: increasing the number of processes for
GPU-Accelerated Code a fixed problem size means less memory per process,
We tested the flow solver’s performance on AWS by that is, the number of cells per process decreases.
simulating a rigid, solid wedge free-falling through Memory transfer between processes is directly relat-
air and impacting a water surface.10 The time spent ed to the number of cells per process, and commu-
in communication between devices is termed com- nication time between processes is proportional to

www.computer.org/cise 71

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

HYBRID SYSTEMS

6.4
from the CPU across the PCIe bus to the GPU is
Speedup relative to one

highly competitive between the two clusters.


benchmark node

3.2

1.6 CPU-CPU communication. MPI communication con-


AWS cluster stitutes the CPU-CPU communication in the solver.
0.8 The Ohio State MPI benchmarks presented earlier
UMD HPCC
(benchmark cluster) are representative of this CPU-CPU communication.
0.4
1 2 3 4 5 6 7 8 Using a similar procedure as in the CPU-GPU com-
No. nodes (8 cores and 1 GPU per node) munication evaluation, we calculated the communi-
cation overhead: the pressure solver was allowed to it-
Figure 2. Strong scaling speedup for the benchmark
cluster at the University of Massachusetts Dartmouth
erate 1,000 times, both with and without CPU-CPU
(UMD) and the AWS cluster. Note that the speedup is communication. One layer of ghost cells is necessary
reported relative to one UMD HPCC node. The vertical for each shared boundary, so as the number of sub-
axis is logarithmic base 2. domains increases, the number of ghost cells increas-
es disproportionally. This is one of the limitations of
32
domain decomposition and leads to diminishing re-
AWS cluster turns for each node added. Figure 4 shows the results
Communication time (s)

16 UMD HPCC for these tests. For the benchmark cluster, CPU-CPU
(benchmark cluster)
8 communication starts off at 2.7 seconds and drops
to less than 1 second very consistently for all sub-
4
sequent cluster sizes. On the other hand, on AWS,
2 the overhead starts off lower than the benchmark,
at 1.3 seconds, but when a second node is added, it
1
1 2 3 4 5 6 7 8 steps up dramatically to 4.6 seconds. It’s interesting
No. nodes (8 cores and 1 GPU per node) to note that AWS communication time increases
with the addition of a second node, whereas the
Figure 3. Strong scaling CPU-GPU communication
overhead time in seconds after 1,000 iterations of the UMD HPCC communication time decreases. The
pressure solver. Error bars indicate the maximum and key difference is that the addition of a second node
minimum data points, and plotted points are averages. on the AWS cluster requires the use of the Ethernet
The vertical axis is logarithmic base 2. network, which negatively impacts performance. An-
other shortfall of AWS is that the performance of its
the memory that must be transferred. Hence, strong Ethernet network is highly variable, which is visible
scaling has the advantage of reducing the workload in the error bars in Figure 4. Even though the CPU-
and communication time per process, but it has the CPU communication time on AWS is higher than
disadvantage of requiring a large network size. the benchmark cluster, the difference isn’t significant
for this particular application because the time spent
CPU-GPU communication. A single node consists of in communication is relatively small compared to the
one CPU (eight processors) and one GPU card. All total execution time.
the pressure field data for eight CPU processes are
transferred between the CPU and GPU twice dur- Weak Scaling
ing each iteration of the pressure solver. The pres- Figure 5 shows the results of the weak scaling
sure solver was set to iterate 1,000 times, and the tests. Note that the scaling is presented relative to
communication time was determined by modify- one UMD HPCC node. The AWS cluster is 25
ing the code to either allow communication be- to 45 percent slower overall than the benchmark
tween the CPU and GPU or not at all. This iso- cluster. As the number of nodes increases, AWS
lated the communication time between the CPU becomes progressively slower than the UMD
and GPU. Figure 3 shows the results for commu- HPCC. For example, AWS is 25 percent slower
nication time in logarithmic scale as the cluster is than the UMD HPCC for single-node test cases,
scaled up. Recognizing that scaling up the cluster but for high node counts, it becomes 45 percent
decreases the number of cells per process, CPU- slower. The 25 percent deficit for AWS for one
GPU communication time decreases accordingly. node is because the GPU is inherently less pow-
This behavior is observed for both the benchmark erful than the benchmark cluster. However, the
and AWS, implying that network performance increased deficit with large cluster size is due to

72 September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

8 AWS cluster 16

Communication time (s)


Communication time (s)

7 14
UMD HPCC
6 (benchmark cluster) 12
5 10
4 8
3 6
AWS cluster
2 4
UMD HPCC
2
1 (benchmark cluster)
0
0 1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
No. nodes (8 cores and 1 GPU per node)
No. nodes (8 cores and 1 GPU per node)

Figure 6. Weak scaling CPU-GPU communication time


Figure 4. Strong scaling CPU-CPU communication
in seconds after 1,000 iterations of the pressure solver.
overhead time in seconds after 1,000 iterations of the
Error bars indicate the maximum and minimum data
pressure solver. Error bars indicate the maximum and
points, and plotted points are averages.
minimum data points, and plotted points are averages.
The vertical axis is linear, not logarithmic.
10
AWS cluster
Communication time (s)

9
1.2 8
Weak scaling relative to one

AWS cluster UMD HPCC


7 (benchmark cluster)
1.0
UMD HPCC 6
benchmark node

0.8 (benchmark cluster) 5


4
0.6 3
2
0.4 1
0.2 0
1 2 3 4 5 6 7 8
0 No. nodes (8 cores and 1 GPU per node)
1 2 3 4 5 6 7 8
No. nodes (8 cores and 1 GPU per node) Figure 7. Weak scaling CPU-CPU communication time
in seconds after 1,000 iterations of the pressure solver.
Figure 5. Weak scaling performance for the benchmark Error bars indicate the maximum and minimum data
cluster at UMD and the AWS cluster. Note that the points, and plotted points are averages.
performance is reported relative to one UMD HPCC node.
The ordinate is (t1,benchmark)/tN, where t1,benchmark
is the computation time for one node on the benchmark is about 15 percent of the total execution time. AWS
cluster, and tN is the time taken for a cluster of N nodes. shows similar behavior, although less consistently. We
have no explanation for why CPU-GPU communica-
the poor network communication of AWS relative tion increases at seven and eight nodes for AWS.
to the benchmark.
CPU-CPU communication. For weak scaling, the
CPU-GPU communication. We used the same CPU- amount of data exchanged between parallel pro-
GPU communication tests that we used for strong cesses is the same regardless of cluster size, so theo-
scaling for weak scaling. Figure 6 shows the results retically the only variable from one run to the next
for CPU-GPU communication time as a function of is the overhead from increasing the total number
cluster size. Similar to strong scaling, in weak scaling, of processes. When comparing the two clusters
Amazon’s instances are highly competitive with the (see Figure 7), drastically different behavior is ob-
benchmark cluster, although they’re significantly less served for CPU-CPU communication overhead.
consistent. Note that the number of grid points per For weak scaling, it’s at 2 seconds or less on the
subdomain remains constant. At each iteration of the benchmark cluster, and it remains relatively con-
pressure solver, the pressure field information is trans- stant. On AWS, it increases from 1 second for
ferred to the GPU device, and the amount of data the single-node case all the way up to 8 seconds
transferred is proportional to the total number of grid for the eight-node cluster. Note the variability in
points in the subdomain. These two facts imply that AWS performance, represented by the error bars in
ideally the communication time to the GPU would Figure 7. As previously stated, the Ethernet network
remain constant because the same amount of data is on AWS is much slower than the InfiniBand net-
being transferred for all tests. This inference is accurate work on the local cluster. However, the time spent
for the benchmark cluster: the CPU-GPU communi- in communication between CPUs is still relatively
cation overhead is consistently around 8 seconds, which small compared to the time spent in CPU-GPU

www.computer.org/cise 73

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

HYBRID SYSTEMS

Table 1. Cost comparison between the benchmark local cluster (first two rows) and the AWS cloud
HPC with on-demand, one-year, and three-year reservations. The benchmark is considered with and
without electricity and maintenance costs.
Total cost Equivalent cost per node-hour Useful life
Without electricity or maintenance $6,000 $0.137 5 years

With electricity and maintenance $10,800 $0.247 5 years

On-demand N/A $0.65 Hourly

1-year reservation $3,478 $0.40 1 year

3-year reservation $7,410 $0.282 3 years

communication and general computations, so the ■ amount of data that is transferred from AWS
slow network connection doesn’t pose as much of a to the Internet.
problem as it presented in previous studies.
AWS offers three usage tiers: on-demand, one-
Cost Analysis year reserved, and three-year reserved. Note that
Cost is an important factor in our evaluation of a reserved instance will use the same physical ma-
AWS as an alternative to a local, conventional chine for the reservation period. Table 1 presents
HPCC. Comparing the two alternatives on an a cost comparison between the benchmark local
hourly or total cost basis doesn’t lead to an imme- cluster and AWS for each usage tier. The bench-
diately obvious conclusion because there are many mark local cluster is considered with and without
variables that can affect the outcome. We used our the additional costs associated with electricity and
local cluster for the cost analysis, and the results can maintenance by IT professionals. In this analysis,
be considered a case study. When building a local we approximated that the electricity cost per node
HPCC, the upfront cost is very large, but the invest- is $3,200 for a period of five years. We also approx-
ment is relatively long term because it can last sever- imated that 30 percent of a full-time IT profes-
al years. In addition to the electricity cost, research- sional’s time is spent on local cluster maintenance,
ers using a local cluster might need to support IT which would result in $1,600 in maintenance cost
professionals for maintenance services on the cluster. per node for a five-year period.
These additional costs over the cluster’s lifetime can
become significant compared to the cluster’s upfront Integration of Performance with Cost
cost. Therefore, we’ll make the cost analysis and Next, we narrow down our price analysis to price per
comparison with AWS both with and without these unit of useful computational work. In other words,
additional costs in the following sections. how many simulations can be completed on a cost ba-
Purchasing cloud services is a fundamentally dif- sis? This type of analysis, admittedly, could be highly
ferent approach to doing business. No maintenance variable: it depends on the cluster size and the simu-
or installation is required, and the upfront cost can lation, as well as the cluster’s hardware specifications.
be eliminated entirely by using an “on-demand” pay- For the majority of test cases shown in Figures 2 and 5,
ment method that charges the user by rounding up to AWS was about 40 percent slower than the benchmark
the nearest hour of usage time. Customers can com- cluster. Consequently, simulations require roughly
mit to a certain amount of reserved hours and pay an 40 percent more time to complete on AWS than the
upfront cost that will reduce the total cost compared benchmark cluster. To account for this, a weighting
to the on-demand payment method. Pricing on AWS factor is applied to the results in Table 1, resulting in
depends mainly on the following factors: the “weighted cost per unit of work” shown in Table 2.

■ compute time, which is the most expensive fac- Breakdown of Total Cost
tor and depends on the instance type as well as The total cost associated with running the test case
usage tier (on-demand or reserved); simulation on AWS can be modeled by Equation 1.
■ number of nodes; The cost of data storage is $0.03/(Gbyte-month) for
■ amount and duration of data storage in the both EC2 block storage and S3, while the cost of
cloud; and data transfer is $0.09/Gbyte:

74 September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

Table 2. Weighted cost per unit of work summary.

Weighted total cost Weighted unit cost Useful life


Without electricity or maintenance $6,000 $0.137 5 years

With electricity and maintenance $10,800 $0.247 5 years

On-demand N/A $0.91 Hourly

1-year reservation $4,850 $0.64 1 year

3-year reservation $10,300 $0.37 3 years

Table 3. Utilization evaluation.


On-demand 1-yr. reservation 3-yr. reservation
Percent utilization without electricity or maintenance 15 21 37
Percent utilization with electricity and maintenance 27 39 67

Cost = (EC2) + (data storage) + (data transfer) that some nodes are paid for but aren’t completing
useful work. This increases the “weighted cost per
= (p × t1 × n) + (4.175 × 10–5 × t2 × x1) + (0.09 × x2), (1) unit of work” quantity, which can be quantified
by the percentage of a local cluster’s utilization.
where p is the price of instance ($/node-hour), t 1 If the local cluster’s percent utilization is below a
is compute time, n is the number of nodes, 4.175 critical value, then using AWS would be more cost-
× 10 –5 is the price of data storage ($/Gbyte-hour), effective. Table 3 presents this critical value for the
t 2 is the duration of data storage (hours), x1 is the various AWS pricing options. For example, with
amount of data stored (Gbyte), 0.09 is the price of the electricity and maintenance costs included, if
data transferred to the Internet ($/Gbyte), and x 2 is the local cluster is utilized 27 percent or less, then
the amount of data transferred (Gbyte). the AWS on-demand option is more cost-effective.
It should be mentioned that if the utilization of a
Sample Calculation local cluster is expected to be low, users could pool
The computational domain in the test case studied the resource with other local computational re-
here consisted of 36 million grid points, which re- search groups, effectively subsidizing the cost and
quired 60 Gbytes of RAM distributed across four raising the utilization. It’s important to note that
nodes. The simulation time was 106 hours on AWS it’s less likely that reserved instances would have
and 71 hours on the benchmark cluster. On AWS, 100 percent utilization than on-demand instances,
16 Gbytes of data were stored and transferred from but the calculation for reserved instances are in-
the on-demand instances, which translates into cluded with 100 percent utilization for consistency.
$275 for EC2, $0.08 for data storage, and $1.44 The percent utilization of our local cluster is
for data transfer. much higher than the percentages shown in Table 3.
Clearly, EC2 is by far the largest contributor. Therefore, AWS isn’t a cost-effective option com-
On the benchmark cluster, the simulation cost is pared to our local cluster. The only AWS option
$39 when the electricity cost and maintenance are that becomes relatively competitive when the costs
neglected and $70 when included. In both cases, associated with electricity and maintenance are in-
running the simulation on the local cluster costs cluded is the three-year reserved instance.
less than on AWS.

Consideration of Percent Utilization


In some cases, a local cluster might not be fully
utilized at all times—that is, some nodes might be
T he performance of our in-house, 3D, MPI-paral-
lel, GPU-accelerated, multiphase-flow solver was
assessed on both Amazon’s Elastic Compute Cloud
idle for an extended period of time. The amount of service and the local HPC cluster at the University
nodes utilized on a cluster can be simply represent- of Massachusetts Dartmouth, which is considered
ed by a percent utilization quantity, which means the benchmark. For the type of application that we

www.computer.org/cise 75

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

HYBRID SYSTEMS

tested, AWS (g2.2xlarge instance) isn’t fully recom- Acknowledgments


mended as an alternative to a local HPCC—spe- We gratefully acknowledge support from the US Na-
cifically, we found that the g2.2xlarge instance isn’t tional Science Foundation grants CBET-1236462, PHY-
optimized for GPU-accelerated simulations. In fact, 1303724, and PHY-1414440, and US Air Force support
the GPU offered by the Amazon instance is a gam- 10-RI-CRADA-09. We’re also grateful to the University
ing GPU, which exhibited slower performance than of Massachusetts Dartmouth Office of Undergraduate
the Tesla GPU on the benchmark cluster. If the GPU Research for funding this project.
offered by Amazon cloud computing was more suit-
able for HPC, then the results would improve. Ad- References
ditionally, the interconnect for the Amazon instance 1. Nat’l Research Council, Future Directions for NSF
is an Ethernet connection with about 1 Gbyte/s Advanced Computing Infrastructure to Support US
bandwidth, which is about 25 times slower than Science and Engineering in 2017–2020: Interim Re-
the InfiniBand connection used on the benchmark port, Nat’l Academies Press, 2014.
cluster. The Amazon cloud clusters’ performance is 2. D. Eadline, “Moving HPC to the Cloud,” Admin
also highly variable, particularly in the MPI com- Magazine, 2015; www.admin-magazine.com/HPC/
munication across the nodes, which is a serious issue Articles/Moving-HPC-to-the-Cloud.
______________________
for heavy HPC workloads like ours, which require a 3. C. Evangelinos and C.N. Hill, “Cloud Comput-
heterogeneous CPU-GPU framework and frequent ing for Parallel Scientific HPC Applications:
communication. However, our results show that the Feasibility of Running Coupled Atmosphere-
slow cluster network connection doesn’t hinder per- Ocean Climate Models on Amazon’s EC2,” Proc.
formance as much as previous studies suggest. Nev- 1st Workshop Cloud Computing and Its Applica-
ertheless, these impediments result in simulations tions, 2008.
that can take 40 percent longer than the benchmark 4. P. Zaspel and M. Griebel, “Massively Parallel Fluid
cluster. From a cost viewpoint, the only AWS option Simulations on Amazon’s HPC Cloud,” Proc. 1st
that comes close to our local cluster when the costs Int’ l Symp. Network Cloud Computing and Applica-
associated with electricity and maintenance are in- tions, 2011, pp. 73–78.
cluded is the three-year reserved instance. All other 5. P. Mehrotra et al., “Performance Evaluation of Ama-
AWS options are significantly more expensive than zon EC2 for NASA HPC Applications,” Proc. 3rd
the local cluster. Workshop Scientific Cloud Computing, 2012, pp. 41–50.
It should be noted that performance on cloud 6. Z. Hill and M. Humphrey, “A Quantitative Analy-
clusters can vary considerably depending on appli- sis of High Performance Computing with Amazon’s
cation and hardware requirements. Members of the EC2 Infrastructure: The Death of the Local Clus-
HPC community are encouraged to test their own ter?,” Proc. 10th IEEE/ACM Int’ l Conf. Grid Com-
applications on cloud computing services, such as puting, 2009, pp. 26–33.
AWS. New instances that could allow HPC us- 7. L. Jackson et al., “Performance Analysis of High
ers to switch to more powerful instance types are Performance Computing Applications on the Ama-
frequently released on cloud computing services. zon Web Services Cloud,” Proc. IEEE 2nd Int’ l Conf.
An additional benefit is that cloud computing can Cloud Computing Technology and Science, 2010,
be useful for companies or consultants who need pp. 159–168.
quick access to medium-sized GPU clusters like 8. Y. Zhai et al., “Cloud versus In-House Cluster:
the ones we tested, but as things currently stand, Evaluating Amazon Cluster Compute Instances for
cloud computing probably wouldn’t be suitable for Running MPI Applications,” Proc. Int’ l Conf. High
researchers and scientists who continuously need Performance Computing, Networking, Storage and
to run large-scale simulations for long periods of Analysis, 2011, pp. 1–10.
time. If an HPC user’s hardware needs are relative- 9. A. Marathe et al., “A Comparative Study of High-
ly simple, for instance, if the user doesn’t require Performance Computing on the Cloud,” Proc. ACM
GPUs or parallel processing, the cloud becomes Symp. High-Performance Parallel and Distributed
more appealing. Finally, for those who are plan- Computing, 2013, pp. 239–250.
ning on building a local HPCC, cloud computing 10. A. Pathak and M. Raessi, “A 3D, Fully Eulerian,
services can be useful for testing various machine VOF-Based Solver to Study the Interaction between
configurations for benchmarking purposes, which Two Fluids and Moving Rigid Bodies Using the Fic-
can lead to more effective decisions concerning fu- titious Domain Method,” J. Computational Physics,
ture hardware investments. vol. 311, 2016, pp. 87–113.

76 September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

11. A. Pathak and M. Raessi, “A Three-Dimensional Vol- Mehdi Raessi (corresponding author) is an assistant
ume-of-Fluid Method for Reconstructing and Advect- professor in the Mechanical Engineering Department
ing Three-Material Interfaces Forming Contact Lines,” at the University of Massachusetts Dartmouth. His re-
J. Computational Physics, vol. 307, 2016, pp. 550–573. search interests include computational simulations of
12. A.J. Chorin, “Numerical Solution of the Navier- multiphase flows with applications in energy systems
Stokes Equations,” Mathematics of Computation, (renewable and conventional), material processing, and
vol. 22, 1968, pp. 745–762. microscale transport phenomena. Raessi has a PhD in
13. S. Codyer, M. Raessi, and G. Khanna, “Using Graph- mechanical engineering from the University of Toronto.
ics Processing Units to Accelerate Numerical Simu- Contact him at mraessi@umassd.edu.
_____________
lations of Interfacial Incompressible Flows,” Proc.
ASME Fluid Engineering Conf., 2012, pp. 625–634. Gaurav Khanna is an associate professor in the Physics De-
partment at the University of Massachusetts Dartmouth.
Cole Freniere is pursuing an MS in mechanical engineering His primary research project is related to the coalescence
at the University of Massachusetts Dartmouth. His research of binary black hole systems using perturbation theory and
interests include renewable energy, fluid dynamics, and HPC. estimation of the properties of the emitted gravitational
Specifically, he’s interested in the application of advanced radiation. Khanna has a PhD in physics from Penn State
computational simulations to aid in the design of ocean wave University. He’s a member of the American Physical Soci-
energy converters. Contact him at cfreniere@umassd.edu.
____________ ety. Contact him at gkhanna@umassd.edu.
_____________

Ashish Pathak is a PhD candidate in the Engineering


and Applied Science program at the University of Mas-
sachusetts Dartmouth. His research interests include Selected articles and columns from IEEE Computer
multiphase flows and their interaction with moving rig- Society publications are also available for free at
id bodies. Contact him at apathak@umassd.edu.
_____________ http://ComputingNow.computer.org.

Keeping
YOU at the All the Knowledge You
Need—On Your Time

Center Sharpen you edge in Cisco, IT


Security, MS Enterprise, Oracle,
Project Management and many more.

of Technology
E 3,000 online courses
E 6,500 technical books
E 11,000 training videos
E Mentoring, practice exams, and

IEEE Computer Society much more!


Learn something new. Try Computer
Online Training Society eLearning today!

Stay relevant with the IEEE Computer Society

More at www.computer.org/elearning

www.computer.org/cise 77

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

SECTION TITLE
COMPUTER SIMULATIONS
Editors: Barry I. Schneider, bis@nist.gov
______ | Gabriel A. Wainer, gwainer@sce.carleton.ca
_____________

Massive Computation for Understanding


Core-Collapse Supernova Explosions
Christian D. Ott | Caltech

C
ore-collapse supernova explosions come from stars As the name alludes, the explosion is preceded by the
more massive than 8 to 10 times the mass of the collapse of a stellar core. At the end of its life, a massive star
sun. Ten core-collapse supernovae explode per sec- has a core composed mostly of iron-group nuclei. The core is
ond in the universe—in fact, automated astronomi- surrounded by an onion-skin structure of shells dominated
cal surveys discover multiple events per night, and one or two by successively lighter elements. Nuclear fusion is still ongo-
explode per century in the Milky Way. Core-collapse super- ing in the shells, but the iron core is inert. The electrons in
novae outshine entire galaxies in photons for weeks and out- the core are relativistic and degenerate. They provide the li-
put more power in neutrinos than the combined light output on’s share of the pressure support stabilizing the core against
of all other stars in the universe, for tens of seconds. These gravitational collapse. In this, the iron core is very similar to
explosions pollute the interstellar medium with the ashes of a white dwarf star, the end product of low-mass stellar evolu-
thermonuclear fusion. From these elements, planets form and tion. Once the iron core exceeds its maximum mass (the so-
life is made. Supernova shock waves stir the interstellar gas, called effective Chandrasekhar mass of approximately 1.5 to
trigger or shut off the formation of new stars, and eject hot 2 solar masses [M⦿]), gravitational instability sets in. With-
gas from galaxies. At their centers, a strongly gravitating com- in a few tenths of a second, the inner core collapses from a
pact remnant, a neutron star or a black hole, is formed. central density of approximately 1010 g cm –3 to a density

78 Computing in Science & Engineering 1521-9615/16/$33.00 © 2016 IEEE Copublished by the IEEE CS and the AIP September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

comparable to that in an atomic nucleus (approxi- Red supergiant


(not drawn to scale)
mately 2.7 × 1014 g cm–3). There, the repulsive He
part of the nuclear force causes a stiffening of the C,O
Iron core
equation of state (EOS; the pressure–density rela- 1.5−2M
Si
tionship). The inner core first overshoots nuclear
≈2,000 km
density, then rebounds (“bounces”) into the still
collapsing outer core. The inner core then stabilizes
and forms the inner regions of the newborn proto- H
Inner
neutron star. The hydrodynamic supernova shock core
is created at the interface of inner and outer cores.
First, the shock moves outward dynamically. It R ≈ 109 km .

M
then quickly loses energy by work done breaking up
infalling iron-group nuclei into neutrons, protons, Core collapse to
and alpha particles. The copious emission of neutri- v
v v protoneutron star
nos from the hot (T a 10 MeV a1011 K) gas fur- (PNS)
ther reduces energy and pressure behind the shock. v v
≈400 km

The shock stalls and turns into an accretion shock: Accretion M


PNS
the ram pressure of accretion of the star’s outer core .

M v v
balances the pressure behind the shock.
The supernova mechanism must revive the v v
v Stalled shock
stalled shock to drive a successful core-collapse su-
pernova explosion. Depending on the structure of
the progenitor star, this must occur within one to Shock not . Shock
a few seconds of core bounce. Otherwise, continu- revived M revived
ing accretion pushes the protoneutron star over its
maximum mass (approximately 2 to 3 M⦿), which
results in the formation of a black hole and no su-

© Anglo-Australian Observatory
pernova explosion. Figure 1 provides a schematic
of the core-collapse supernova phenomenon and its
outcomes. τ ≈ 1 − few
If the shock is successfully revived, it must seconds
travel through the outer core and the stellar enve- black hole
Core-collapse
lope before it breaks out of the star and creates the formation
supernova explosion
spectacular explosive display observed by astrono-
mers on Earth. This could take more than a day Figure 1. Schematic of core collapse and its simplest outcomes.
The image shows SN 1987A, which exploded in the large
for a red supergiant star (such as Betelgeuse, a 20
Magellanic cloud.
M⦿ star in the constellation Orion) or just tens of
seconds for a star that has been stripped of its ex-
tended hydrogen-rich envelope by a strong stellar
wind or mass exchange with a companion star in a insight and for making predictions that can be
binary system. contrasted with future neutrino and gravitational-
The photons observed by astronomers are wave observations from the next core-collapse su-
emitted extremely far from the central regions, pernova in the Milky Way.
and they carry information on the overall ener-
getics, the explosion geometry, and the products Supernova Energetics and Mechanisms
of the explosive nuclear burning triggered by the Core-collapse supernovae are “gravity bombs.” The
passing shock wave. They can, however, only pro- energy reservoir from which any explosion mecha-
vide weak constraints on the inner workings of the nism must draw is the gravitational energy released
supernova. Direct observational information on in the collapse of the iron core to a neutron star: ap-
the supernova mechanism can be gained only from proximately 3 × 1053 erg (3 × 1046 J), a mass-energy
neutrinos and gravitational waves that are emitted equivalent of approximately 0.15 M⦿c2. A fraction
directly in the supernova core. Detailed computa- of this tremendous energy is stored initially as heat
tional models are required for gaining theoretical (and rotational kinetic energy) in the protoneutron

www.computer.org/cise 79

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

COMPUTER SIMULATIONS

naturally have something to do with the explosion


mechanism. The neutrino mechanism, in its current
form, was proposed by Hans Bethe and Jim Wilson.1
In this mechanism, a fraction (approximately 5 per-
cent) of the outgoing electron neutrinos and antineu-
trinos is absorbed in a layer between protoneutron
star and the stalled shock. In the simplest picture,
this neutrino heating increases the thermal pressure
behind the stalled shock. Consequently, the dynami-
cal pressure balance at the accretion shock is violated
and a runaway explosion is launched.
The neutrino mechanism fails in spherical
symmetry but is very promising in multiple dimen-
sions (axisymmetry [2D], 3D). This is due largely
to multidimenstional hydrodynamic instabilities
that break spherical symmetry (see Figure 2 for
an example2), increase the neutrino mechanism’s
efficiency, and facilitate explosion. I discuss this
in more detail later in this article. The neutrino
mechanism is presently favored as the mechanism
driving most core-collapse supernova explosions
(a recent review appears elsewhere3).
Despite its overall promise, the neutrino
Figure 2. Volume rendering of the specific entropy in the core of a mechanism is very inefficient. Only a5 percent of
neutrino-driven core-collapse supernova at the onset of explosion, the outgoing total luminosity is deposited behind
based on 3D general-relativistic simulations2 and rendered by the stalled shock at any moment and much of this
Steve Drasco (Cal Poly San Luis Obispo). Specific entropy is a deposition is lost again as heated gas flows down,
preferred quantity for visualization: in the supernova’s core, it
typically ranges from 1 to 20 units of Boltzmann’s constant kB
leaves the heating region, and settles onto the proto-
per baryon. Shown is the large-scale asymmetric shock front and neutron star. The neutrino mechnism may (barely)
a layer of hot expanding plumes behind it. The physical scale is be able to power ordinary core-collapse supernovae,
roughly 600 × 400 km. but it cannot deliver hypernova explosion energies
or account for gamma-ray bursts.
An alternative mechanism that could be part of
star and the rest comes from its subsequent contrac- the explanation for such extreme events is the mag-
tion. Astronomical observations, on the other hand, netorotational mechanism.4–6 In its modern form, a
show the typical core-collapse supernova explosion very rapidly spinning core collapses to a protoneu-
energy to be in the range 1050 – 1051 erg. Hyper- tron star with a spin period of only a1 millisecond.
nova explosions can have up to 1052 erg, but they Its core is expected to be spinning uniformly, but its
make up only about 1 percent of all core-collapse outer regions will be extremely differentially rotat-
supernovae. A small subset of hypernovae are asso- ing. These are ideal conditions for the magnetoro-
ciated with gamma-ray bursts. tational instability (MRI7) to operate, amplify any
Where does all the gravitational energy that seed magnetic field, and drive magnetohydrody-
doesn’t contribute to the explosion energy go? The namic (MHD) turbulence. If a dynamo process is
answer is neutrinos. Antineutrinos and neutrinos present, an ultra-strong largescale (globally ordered)
of all flavors carry away t99 percent (t90 percent magnetic field is built up. This makes the protoneu-
in the hypernova case) of the available energy over tron star a protomagnetar. Provided this occurs,
O(10) s as the protoneutron star cools and con- magnetic pressure gradients and hoop stresses could
tracts. This was first theorized and then later obser- lead to outflows along the axis of rotation. The
vationally confirmed with the detection of neutri- MRI’s fastest growing mode has a small wavelength
nos from SN 1987A, the most recent core-collapse and is extremely difficult to resolve numerically.
supernova in the Milky Way vicinity. Because of this, all simulations of the magne-
Because neutrinos dominate the energy trans- torotational mechanism to date have simply made
port through the supernova, they might quite the assumption that a combination of MRI and

80 September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

dynamo is operating. They then ad hoc impose a


strong large-scale field as an initial condition. In
2D simulations, collimated jets develop along the
axis of rotation. In 3D, the jets are unstable and
a more complicated explosion geometry develops,4
as shown in Figure 3. Nevertheless, even in 3D, an
energetic explosion could potentially be powered.
The magnetorotational mechanism requires one
special property of the progenitor star: rapid core
rotation. Currently, stellar evolution theory suggests
that the cores of most massive stars should be slowly
spinning. However, there could be exceptions of
rapidly spinning cores at just about the right occur-
rence rate to explain hypernovae and long gamma-
ray bursts. In addition to the neutrino and magne-
torotational mechanisms, several other explosion
mechanisms have been proposed. A full review on
explosion mechanisms appears elsewhere.3

A Multiscale, Multiphysics, Multidimensional


Computational Challenge
The core-collapse supernova problem is highly com-
plex and inherently nonlinear, and it involves many
branches of (astro)physics. Only limited progress
can be made with analytic or perturbative meth-
ods. Computational simulation is a powerful
means for gaining theoretical insight and for mak-
ing predictions that could be tested with astronom-
ical observations of neutrinos, gravitational waves,
and electromagnetic radiation.
Core-collapse supernova simulations are time
evolution simulations: starting from initial condi- Figure 3. Volume rendering of the specific entropy in the core
tions, the matter, radiation, and gravitational fields of a magnetorotational core-collapse supernova. Bluish colors
are evolved in time. In the case of time-explicit evo- indicate low entropy, red colors high entropy, and green and
lution, the numerical time step is limited by causal- yellow intermediate entropy. The vertical is the axis of rotation
ity, controlled by the speed of sound in Newtonian and shown is a region of 1,600 × 800 km. The ultra-strong
simulations and the speed of light in general-relativ- toroidal magnetic field surrounding the protoneutron star pushes
istic simulations. Because of this, an increase in the hot plasma out along the rotation axis. The distorted, double-lobe
spatial resolution by a factor of two corresponds to a structure is due to an MHD kink instability akin to those seen
decrease in the time step by a factor of two. Hence, in Tokamak fusion experiments. This figure was first published
in a 3D simulation, the computational cost scales elsewhere4 and is used with permission.
with the fourth power of resolution.

Multiscale propagation to the surface can be treated as (almost)


Taking the red supergiant in Figure 1 as an example, independent problems. If our interest is on the shock
a complete core-collapse supernova simulation that revival mechanism, we need to include the inner
follows the shock to the stellar surface would have 10,000 km of the star. Because information about
to cover dynamics on a physical scale from approxi- core collapse is communicated to overlying layers
mately 109 km (stellar radius) down to 0.1 km (the with the speed of sound, stellar material at greater
typical scale over which the structure and thermo- radii won’t “know” that core collapse has occurred
dynamics of the protoneutron star change). These before it’s hit by the revived expanding shock.
ten orders of magnitude in spatial scale are daunt- Even with only five decades in spatial scale,
ing. In practice, reviving the shock and tracking its some form of grid refinement or adaptivity is called

www.computer.org/cise 81

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

COMPUTER SIMULATIONS

for: a 3D finite-difference grid with an extent of allowing us to describe core-collapse supernovae on a


10,000 km symmetric about the origin with uni- macroscopic scale by a coupled set of systems of non-
form 0.1 km cell size would require 57 Pbytes of linear partial differential equations (PDEs).
RAM to store a single double-precision variable.
Many tens to hundreds of 3D variables are required. (Magneto)hydrodynamics (MHD). The stellar plasma is
Such high uniform resolution is not only currently both in local thermodynamic equilibrium, essentially
impossible but also unnecessary. Most of the reso- perfectly conducting, and essentially inviscid (al-
lution is needed near the protoneutron star and in though neutrinos might provide some shear viscos-
the region behind the stalled shock. The near-free- ity in the protoneutron star). The ideal inviscid
fall collapse of the outer core can be simulated with MHD approximation is appropriate under these
much lower resolution. conditions. The MHD equations are hyperbolic
Because of the broad range of physics involved and can be written in flux-conservative form with
and the limited available compute power, early core- source terms that don’t include derivatives of the
collapse supernova simulations were spherically MHD variables. They are typically solved with
symmetric (1D). Such 1D simulations often employ standard time-explicit high-resolution shock-cap-
a Lagrangian comoving mass coordinate discreti- turing methods that exploit the characteristic struc-
zation. This grid can be set up to provide just the ture of the equations.9,10 Special attention must be
right resolution where and when needed or can be paid to preserving the divergence-free property of
dynamically re-zoned (an adaptive mesh refinement the magnetic field. The MHD equations require an
[AMR] technique). Other 1D codes discretize in the EOS as a closure.
Eulerian frame and use a fixed grid whose cells are Unless ultra-strong (B t 1015 G), magnetic fields
radially stretched using geometric progression. have little effect on the supernova dynamics and thus
In 2D simulations, Eulerian geometrically spaced are frequently neglected. Because strong gravity and
fixed spherical grids are the norm, but some codes use velocities up to a few tenths of the speed of light are
cylindrical coordinates and AMR. Spherical grids, al- involved, the MHD equations are best solved in a
ready in 2D, suffer from a coordinate singularity at the general-relativistic formulation. General-relativistic
axis that can lead to numerical artifacts. In 3D, they MHD is particularly computationally expensive be-
become even more difficult to handle, and their focus- cause the conserved variables are not the primitive
ing grid lines impose a severe time step constraint near variables (density, internal energy/temperature, ve-
the origin. Some 3D codes still use a spherical grid, locity, chemical composition). The latter are needed
while many others employ Cartesian AMR grids. Re- for the EOS and enter flux terms. After each update,
cent innovative approaches use so-called multiblock they must be recovered from the conserved variables
grids with multiple curvilinear touching or overlap- via multidimensional root finding.
ping logically Cartesian “cubed-sphere” grids.8
Gravity. Deviations in the strength of the gravita-
Multiphysics tional acceleration between Newtonian and gen-
Core-collapse supernovae are very rich in physics. All eral-relativistic gravity are small in the precollapse
fundamental forces are involved and essential to the core but become of order 10 to 20 percent in the
core-collapse phenomenon. These forces are probed protoneutron star phase. In the case of black hole
under conditions that are impossible (or exceedingly formation, Newtonian physics breaks down com-
difficult) to create in Earthbound laboratories. pletely. General relativistic gravity is included at
Gravity drives the collapse and provides the en- varying levels in simulations. Some neglect it com-
ergy reservoir. It’s so strong near the protoneutron pletely and solve the linear elliptic Newtonian Pois-
star that general relativity becomes important and its son equation to compute the gravitational poten-
Newtonian description doesn’t suffice. The electro- tial. This is done by using direct multigrid methods
magnetic force describes the interaction of the dense, or integral multipole expansion methods. Some
hot, magnetized, perfectly conducting plasma and codes modify the monopole term in the latter ap-
the photons that provide thermal pressure and make proach to approximate general relativistic effects.
the supernova light. The weak force governs the in- Including full general relativity is more chal-
teractions of neutrinos, and the strong (nuclear) force lenging, in particular in 2D and 3D, because general
is essential in the nuclear EOS and nuclear reactions. relativity has radiative degrees of freedom (gravita-
All this physics occurs at the microscopic, per-particle tional waves) there. An entire subfield of gravitation-
level. Fortunately, the continuum assumption holds, al physics, numerical relativity, spent nearly five de-

82 September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

cades looking for ways to solve Einstein’s equations


on computers.11 In general relativity, changes in the
gravitational field propagate at the speed of light.
Hence, time evolution equations must be solved. 30 km
This is done by splitting 4D spacetime into 3D spa-
tial slices that are evolved in the time direction. In
the simplest way of writing the equations (the so-
called Arnowitt-Deser-Misner [ADM] formulation),
they form a system of 12 partial differential evolu-
tion equations, 4 gauge variables that must be speci-
fied (and evolved in time or recalculated on each 60 km
slice), and 4 elliptic constraint equations without
time derivatives. The ADM formulation has poor
numerical stability properties that lead to violations
of the constraint equations and numerical instabili-
ties that make long-term evolution impossible.
It took until the late 1990s and the early 2000s
for numerical relativity to find formulations of Ein-
stein’s equations and gauge choices that together 120 km
lead to stable long-term evolutions. In some cases,
well-posedness and strong or symmetric hyperbo-
licity can be proven. The equations are typically
evolved time-explicitly with straightforward high-
order (fourth and higher) finite-difference schemes
or with multidomain pseudospectral methods.
Because numerical relativity only recently be-
150 km
came applicable to astrophysical simulations, very
few core-collapse supernova codes are fully gen-
eral relativistic at this point.2,12 The fully general-
relativistic approach is much more memory and
FLOP intensive than solving the Newtonian Pois-
son equation. Its advantage in large-scale compu-
tations, however, is the hyperbolic nature of the
equations, which doesn’t require global matrix in- 240 km
versions or summations and thus is advantageous
for the parallel scaling of the algorithm.

Neutrino transport and neutrino-matter interactions.


Neutrinos move at the speed of light (the very small Figure 4. Map projections of the momentum-space neutrino
neutrino masses are neglected) and can travel mac- radiation field (for Qe at an energy of 16.3 MeV) going outward
roscopic distances between interactions. Therefore, radially (from top to bottom) on the equator of a supernova
they must be treated as nonequilibrium radiation. core.9 Inside the protoneutron star (R t 30 km), neutrinos
and matter are in equilibrium, and the radiation field is
Radiation transport is closely related to kinetic isotropic. It becomes more forward peaked as the neutrinos
theory’s Boltzmann equation. It describes the phase- decouple and become free streaming. Handling the transition
space evolution of the neutrino distribution function from slow diffusion to free streaming correctly requires angle-
or, in radiation transport terminology, their specific dependent radiation transport, which is a 61-D problem and
intensity. This is a 61-D problem: three spatial computationally extremely challenging.
dimensions, neutrino energy, and two momentum
space propagation angles in addition to time. The
angles describe the directions from which neutrinos cies: electron neutrinos, electron antineutrinos, and
are coming and where they’re going at a given spa- heavy-lepton (m, t) neutrinos and antineutrinos.
tial coordinate. In addition, the transport equation Figure 4 shows map projections of the momen-
must be solved separately for multiple neutrino spe- tum space angular neutrino distribution at different

www.computer.org/cise 83

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

COMPUTER SIMULATIONS

radii in a supernova core. In the dense protoneutron to the choice of flux limiter, and the need for
star, neutrinos are trapped and in equilibrium time-implicit integration (involving global ma-
with matter. Their radiation field is isotropic. They trix inversion) due to the stability properties of
gradually diffuse out and decouple from matter the parabolic diffusion equation. Two-moment
at the neutrinosphere (the neutrino equivalent of transport is the next better approximation, solv-
the photosphere). This decoupling is gradual and ing equations for the radiation energy density and
marked by the transition of the angular distribution momentum (that is, the radiative flux) and requir-
into the forward (radial) direction. In the outer ing a closure that describes the radiation pressure
decoupling region, neutrino heating is expected tensor (also known as the Eddington tensor). This
to occur, and the heating rates are sensitive to the closure can be analytic and based on the local val-
angular distribution of the radiation field.9 Eventu- ues of energy density and flux (the M1 approxima-
ally, at radii of a few hundred kilometers, the neu- tion). Alternatively, some codes compute a global
trinos have fully decoupled and are free streaming. closure based on the solution of a simplified,
Neutrino interactions with matter (and thus the time-independent Boltzmann equation. The major
decoupling process) are very sensitive to neutrino advantage of the two-moment approximation is
energy, since weak-interaction cross-sections scale that its advection terms are hyperbolic and can be
with the square of the neutrino energy. handled with standard time-explicit finite-volume
This is why neutrino transport needs to be mul- methods of computational hydrodynamics, and
tigroup, with typically a minimum of 10 to 20 en- only the local collision terms need time-implicit
ergy groups covering supernova neutrino energies updates.
of 1 – O(100) MeV. Typical mean energies of elec- There are now implementations of multigroup
tron neutrinos are around 10 to 20 MeV. Energy two-moment neutrino radiation-hydrodynamics in
exchanges between matter and radiation occur via multiple 2D/3D core-collapse supernova simula-
the collision terms in the Boltzmann equation. These tion codes.12,15,16 This method could be sufficiently
are stiff sources/sinks that must be handled time- close to the full Boltzmann solution (in particular,
implicitly with (local) backward-Euler methods. The if a global closure is used) and appears to be the
neutrino energy bins are coupled through frame- way toward massively parallel long-term 3D core-
dependent energy shifts. Neutrino-matter interaction collapse supernova simulations.
rates are usually precomputed and stored in dense
multidimensional tables within which simulations Neutrino oscillations. Neutrinos have mass and can
interpolate. oscillate between flavors. The oscillations occur in a
Full 61-D general-relativistic Boltzmann neu- vacuum but can also be mediated by neutrino-elec-
trino radiation-hydrodynamics is exceedingly chal- tron scattering (the Mikheyev-Smirnov-Wolfenstein
lenging and so far hasn’t been possible to include [MSW] effect) and neutrino-neutrino scattering.
in core-collapse supernova simulations, but 31-D Neutrino oscillations depend on neutrino mixing
(1D in space, 2D in momentum space),13 51-D parameters and on the neutrino mass eigenstates
(2D in space, 3D in momentum space),9 and static (the magnitudes of the mass differences are known
6D simulations14 have been carried out. but not their signs). Observation of neutrinos from
Most (spatially) multidimensional simulations the next galactic core-collapse supernova could help
treat neutrino transport in some dimensionally constrain the neutrino mass hierarchy.17
reduced approximation. The most common is an MSW oscillations occur in the stellar envelope.
expansion of the radiation field into angular mo- They’re important for the neutrino signal observed in
ments. The nth moment of this expansion requires detectors on Earth, but they can’t influence the ex-
information about the (n  1)th moment (and in plosion itself. The self-induced (via neutrino-neutrino
some cases, the (n  2)th moment as well). This scattering) oscillations, however, occur at the extreme
necessitates a closure relation for the moment at neutrino densities near the core. They offer a rich
which the expansion is truncated. Multigroup phenomenology that includes collective oscillation
flux-limited diffusion evolves the 0th moment (the behavior of neutrinos.17 The jury’s still out on their
radiation energy density). The flux limiter is the potential influence on the explosion mechanism.
closure that interpolates between diffusion and Collective neutrino oscillation calculations (es-
free streaming. The disadvantages of this method sentially solving coupled Schrödinger-like equa-
are its very diffusive nature (washes out spatial tions) are computationally intensive.17 They’re cur-
variations of the radiation field), its sensitivity rently performed independently of core-collapse

84 September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

supernova simulations and don’t take into account still be treated as ideal Boltzmann gases (but in-
feedback on the stellar plasma. Fully understand- cluding Coulomb corrections).
ing collective oscillations and their impact on the The nuclear force becomes relevant at den-
supernova mechanism will quite likely require that sities near and above 1010 – 1011 g cm–3. It is an
neutrino oscillations, transport, and neutrino-mat- effective quantum manybody interaction of the
ter interactions are solved for together in a quan- strong force, and its detailed properties presently
tum-kinetic approach.18 aren’t known. Under supernova conditions, mat-
ter will be in NSE in the nuclear regime, and the
Equation of state and nuclear reactions. The EOS is EOS is a function of density, temperature, and Ye.
essential for the (M)HD part of the problem and for Starting from a nuclear force model, an EOS can
updating the matter thermodynamics after neutri- be obtained in multiple ways,19 including direct
no-matter interactions. Baryons (protons, neutrons, Hartree-Fock manybody calculations, mean field
alpha particles, heavy nuclei), electrons, positrons, models, or phenomenological models (such as the
and photons contribute to the EOS. Neutrino mo- liquid-drop model).
mentum transfer contributes an effective pressure Typically, the minimum of the Helmholtz free
that is taken into account separately because neu- energy is sought and all thermodynamic variables
trinos are not everywhere in local thermodynamic are obtained from derivatives of the free energy.
equilibrium with the stellar plasma. In different In most cases, EOS calculations are too time-con-
parts of the star, different EOS physics applies. suming to be performed during a simulation. As in
At low densities and temperatures below ap- the case of the electron/positron EOS, large (more
proximately 0.5 MeV, nuclear reactions are too than 200 Mbytes must be stored by each MPI pro-
slow to reach nuclear statistical equilibrium. In this cess), densely spaced nuclear EOS tables are pre-
regime, the mass fractions of the various heavy nu- computed and simulations efficiently interpolate
clei (isotopes, in the following) must be tracked ex- in (log U, log T, Ye) to obtain thermodynamic and
plicitly. As the core collapses, the gas heats up and compositional information.
nuclear burning must be tracked with a nuclear re-
action network, a stiff system of ODEs. Solving the Multidimensionality
reaction network requires the inversion of sparse Stars are, at zeroth order, gas spheres. It’s thus nat-
matrices at each grid point. Depending on the ural to start with assuming spherical symmetry in
number of isotopes tracked (ranging typically from simulations—in particular, given the very limited
O(10) to O(100)), nuclear burning can be a signifi- compute power available to the pioneers of super-
cant contributor to the overall computational cost nova simulations. After decades of work, it now ap-
of a simulation. The EOS in the burning regime pears clear that detailed spherically symmetric simu-
is simple: all isotopes can essentially be treated as lations robustly fail at producing explosions for stars
noninteracting ideal Boltzmann gases. Often, cor- that are observed to explode in nature. Spherical
rections for Coulomb interactions are included. symmetry itself could be the culprit because sym-
Photons and electrons/positrons can be treated ev- metry is clearly broken in core-collapse supernovae:
erywhere as ideal Bose and Fermi gases, respective-
ly. Because electrons will be partially or completely ■ Observations show that neutron stars receive
degenerate, computing the electron/positron EOS “birth kicks,” giving them typical velocities of
involves the FLOP-intensive solution of Fermi inte- O(100) km s–1 with respect to the center of mass
grals. Because of this, their EOS is often included of their progenitors. The most likely and straight-
in tabulated form. forward explanation for these kicks is that highly
At temperatures above 0.5 MeV, nuclear sta- asymmetric explosions lead to neutron star re-
tistical equilibrium holds. This greatly simplifies coil, owing to momentum conservation.
things, since now the electron fraction Ye (number ■ Deep observations of supernova remnants
of electrons per baryon; because of macroscopic show that the innermost supernova ejecta ex-
charge neutrality, Ye is equal to Yp, the number hibit low-mode asphericity similar to the ge-
fraction of protons) is the only compositional vari- ometry of the shock front shown in Figure 2.
able. The mass fractions of all other baryonic spe- ■ Analytic considerations as well as 1D core-col-
cies can be obtained by solving Saha-like equations lapse simulations show that the protoneutron
for compositional equilibrium. At densities below star and the region behind the stalled shock
approximately 1010 – 1011 g cm–3, the baryons can where neutrino heating takes place are both

www.computer.org/cise 85

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

COMPUTER SIMULATIONS

Core-collapse supernova simulation components


in the community. Many, in particular the 3D
Simulation Hydrodynamics/MHD codes, follow the design encapsulated in Figure 5.
framework They employ a simulation framework (such as
Gravity FLASH, http://flash.uchicago.edu/site/flashcode,
AMR or Cactus, http://cactuscode.org)
_____________ that handles do-
Memory
main decomposition, message passing, memory
Coupling
Equation of state/nuclear reactions management, AMR, coupling of different physics
Scheduling
Communication components, execution scheduling, and I/O.
I/O Given the tremendous memory requirement
Neutrino transport and interactions
and FLOP consumption of the core-collapse su-
Figure 5. Multiphysics modules of core-collapse supernova pernova problem, these codes are massively parallel
simulation codes. The simulation framework provides and employ both node-local OpenMP and inter-
parallelization, I/O, execution scheduling, AMR, and memory node MPI parallelization. All current codes follow
management. a data-parallel paradigm with monolithic sequen-
tial scheduling. However, this limits scaling, can
create load imbalances with AMR, and makes the
unstable to buoyant convection, which always use of GPU/MIC accelerators challenging because
leads to the breaking of spherical symmetry. communication latencies between accelerator and
■ Rotation and magnetic fields naturally break CPU block execution in the current paradigm.
spherical symmetry. Observations of young The Caltech Zelmani2 core-collapse simulation
pulsars show that some neutron stars must be package is an example of a 3D core-collapse super-
born with rotation periods on the order of 10 nova code. It is based on the open source Cactus
milliseconds. Magnetars could be born with framework, uses 3D AMR Cartesian and multi-
even shorter spin periods if their magnetic field block grids, and employs many components pro-
is derived from rapid differential rotation. vided by the open source Einstein Toolkit (http://
____
■ Multidimensional simulations of the violent einsteintoolkit.org).
____________ Zelmani has fully general-rela-
nuclear burning in the shells overlying the tivistic gravity and implements general-relativistic
iron core show that large-scale deviations from MHD. Neutrinos are included either via a rather
sphericity develop that couple into the precol- crude energy-averaged leakage scheme that approx-
lapse iron core via the excitation of nonradial imates the overall energetics of neutrino emission
pulsations.20 These create perturbations from and absorption or via a general-relativistic two-
which convection will grow after core bounce. moment M1 radiation-transport solver that has re-
cently been deployed on first simulations.16
Given the above, multidimensional simula- In full radiation-hydrodynamics simulations
tions are essential for studying the dynamics of the of the core-collapse supernova problem with eight
supernova engine. The rapid increase of compute levels of AMR, Zelmani exhibits good strong scal-
power since the early 1990s has facilitated increas- ing with hybrid-OpenMP/MPI to 16,000 cores on
ingly detailed 2D radiation-hydrodynamics simu- Blue Waters. At larger core counts, load imbalances
lations over the past two and a half decades. Three- due to AMR prolongation and synchronization op-
dimensional simulations with simplified neutrino erations begin to dominate the execution time.
treatments have been carried out since the early
2000s. The first 3D neutrino radiation-hydrody- Multidimensional Dynamics and Turbulence
namics simulations have become possible only in Even before the first detailed 2D simulations of
the past few years, thanks to the compute power neutrino-driven core-collapse supernovae became
of large petascale systems like the US-funded Blue possible in the mid-1990s, it was clear that buoyant
Waters and Titan, and the Japanese K computer. convection in the protoneutron star and in the neu-
trino-heated region just behind the stalled shock
Core-Collapse Supernova Simulation Codes breaks spherical symmetry. Neutrino-driven con-
Many 1D codes exist, some are no longer in use, vection is due to a negative radial gradient in the
and one is open source and free to download (http://
____ specific entropy, making the plasma at smaller radii
GR1Dcode.org).
__________ There are approximately 10 (de- “lighter” than overlying plasma. This is a simple
pending on how you count them) multidimen- consequence of neutrino heating being strongest
sional core-collapse supernova simulation codes at the base of the heating region. Rayleigh-Taylor-

86 September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

like plumes develop from small perturbations and


log(E k) Injection Dissipation
grow to nonlinear convection. This convection is Inertial range
scale scale
extremely turbulent because the physical viscosity
in the heating region is vanishingly small. Neutri-
∝k –5/3
no-driven turbulence is anisotropic on large scales
(due to buoyancy), mildly compressible (the flow
reaches Mach numbers of approximately 0.5), and
only quasi-stationary because an explosion eventu-
ally develops. Nevertheless, it turns out that Kol-
mogorov’s description for isotropic, stationary,
incompressible turbulence works surprisingly well
for neutrino-driven turbulence (see Figure 6).
There is something special about neutrino-
driven convection in core-collapse supernovae: un- log(wavenumber k )
like convection in globally hydrostatic stars, neutri-
Figure 6. Schematic view of turbulence: kinetic energy is
no-driven convection occurs on top of a downflow injected into the flow at large scales and cascades through
of outer core material that has accreted through the inertial range via nonlinear interactions of turbulent
the stalled shock and is headed for the protoneu- eddies to small scales (high wave numbers in the spectral
tron star. The consequence of this is that there is a domain) where it dissipates into heat. The scaling of the
turbulent kinetic energy with wavenumber in the inertial
competition between the time it takes for a small
range is v k–5/3 for Kolmogorov turbulence. This scaling is
perturbation to grow to macroscopic scale to be- also found in very high-resolution simulations of neutrino-
come buoyant and the time it takes for it to leave driven convection.
the region that is convectively unstable (the heat-
ing region) as it is dragged with the background
flow toward the protoneutron star. This means that Independent of how spherical symmetry is bro-
there are three parameters governing the appear- ken in the heating region, all simulations agree that
ance of neutrino-driven convection: the strength of 2D/3D is much more favorable for explosion than
neutrino heating, the initial size of perturbations 1D. Some 2D and 3D simulations yield explosions
entering through the shock, and the downflow rate for stars where 1D simulations fail.21 Why is that?
through the heating region. Because of this, neu- The first reason has been long known and is
trino-driven convection is not a given, and simula- seemingly trivial: the added degrees of freedom,
tions find that it does not develop in some stars. lateral motion in 2D, and lateral and azimuthal
But even in the absence of neutrino-driven motion in 3D all have the consequence that a gas
convection, there is another instability that breaks element that enters through the shock front spends
spherical symmetry in the supernova core: the more time in the heating region before flowing
standing accretion shock instability (SASI).3 SASI down to settle onto the protoneutron star. Because
was first discovered in simulations that did not it spends more time in the heating region, it can
include neutrino heating. It works via a feedback absorb more neutrino energy, increasing the neu-
cycle: small perturbations enter through the shock, trino mechanism’s overall efficiency.
flow down to the protoneutron star, and get reflect- The second reason has to do with turbulence
ed as sound waves that in turn perturb the shock. and has become apparent only in the past few years.
The SASI is a low-mode instability that is most Turbulence is often analyzed employing Reynolds
manifest in an up-down sloshing (l = 1 in terms decomposition, a method that separates back-
of spherical harmonics) along the symmetry axis ground flow from turbulent fluctuations. Using this
in 2D and in a spiral mode (m = 1) in 3D. Once method, we can show that turbulent fluctuations
it has reached nonlinear amplitudes, the SASI cre- lead to an effective dynamical ram pressure (Reyn-
ates secondary shocks (entropy perturbations) and olds stress) that contributes to the overall momen-
shear flow from which turbulence develops. SASI tum balance between behind and in front of the
appears to dominate in situations in which neutri- stalled shock. The turbulent pressure is available
no-driven convection is weak or absent: in condi- only in 2D/3D simulations, and it has been demon-
tions where neutrino heating is weak, the perturba- strated22 that because of this pressure, 2D/3D core-
tions entering the shock are small, or the downflow collapse supernovae explode with less thermal pres-
rate through the heating region is high. sure and, consequently, with less neutrino heating.

www.computer.org/cise 87

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

COMPUTER SIMULATIONS

2x
resolution creates a numerical bottleneck in the turbu-
lent cascade, artificially trapping turbulent kinetic en-
ergy at large scales where it can contribute most to the
explosion, and two, low resolution also increases the
size of numerical perturbations that enter through the
shock and from which buoyant eddies form. The larger
these seed perturbations are, the stronger is the turbu-
lent convection and the larger is the Reynolds stress.
Ref. 6x The qualitative and quantitative behavior of
turbulent flow is very sensitive to numerical resolu-
tion. This can be appreciated by looking at Figure 7,
which shows the same 3D simulation of neutrino-
driven convection at four different resolutions, span-
ning a factor of 12 from the reference resolution that
is presently used in many 3D simulations and which
underresolves the turbulent flow. As resolution is in-
12 x creased, turbulent flow breaks down to progressively
smaller features. What also occurs but that cannot
Figure 7. Slices from four semiglobal 3D simulations of
be appreciated from a still figure is that the intermit-
neutrino-driven convection with parameterized neutrino cooling
and heating, carried out in a 45s wedge. The color map is tency of the flow increases as the turbulence is better
the specific entropy; blue colors mark low-entropy regions, resolved. This means that flow features are not persis-
red corresponds to high entropy. Only the resolution is varied. tent but quickly appear and disappear through non-
The wedge marked “ref.” is the reference resolution ()r = linear interactions of turbulent eddies. In this way,
3.8 km, )V = )O = 1.8s) that corresponds to the resolution
the turbulent cascade can be temporarily reversed
of present global 3D detailed radiation-hydrodynamics core-
collapse supernova simulations. Note how low resolution (this is called backscatter in turbulence jargon), cre-
favors large flow features and how the turbulence breaks down ating large-scale intermittent flow features similar to
to progressively smaller features with increasing resolution. what is seen at low resolution. The role of intermit-
This figure includes simulations up to 12 times the reference tency in neutrino-driven turbulence and its effect on
resolution that were run on 65,536 cores of Blue Waters.
the explosion mechanism remain to be studied.
Rendered by David Radice (Caltech).
A key challenge for 3D core-collapse supernova
simulations is to provide sufficient resolution so that
kinetic energy cascades away from the largest scales
Now, the Reynolds stress is dominated by tur- at the right rate. Resolution studies suggest that this
bulent fluctuations at the largest physical scales: a could require between 2 to 10 times the resolution
simulation that has more kinetic energy in large- of current 3D simulations. A 10-fold increase in
scale motions will explode more easily than a simu- resolution in 3D corresponds to a 10,000 times in-
lation that has less. This realization readily explains crease in computational cost. An alternative could
recent findings by multiple simulation groups, be to devise an efficient subgrid model that, if in-
namely, that 2D simulations appear to explode cluded, provides the correct rate of energy transfer
more readily than 3D simulations.21,22 This is likely to small scales. Work in that direction is still in its
a consequence of the different behaviors of turbu- infancy in the core-collapse supernova context.
lence in 2D and 3D. In 2D, turbulence transports
kinetic energy to large scales (which is unphysical), Making Magnetars: Resolving the
artificially increasing the turbulent pressure contri- Magnetorotational Instability
bution. In 3D, turbulence cascades energy to small The magnetorotational mechanism relies on the
scales (as it should and is known experimentally), presence of an ultra-strong (1015 to 1016 G) global,
so a 3D supernova will generally have less turbu- primarily toroidal, magnetic field around the proto-
lent pressure support than a 2D supernova. neutron star. Such a strongly magnetized protoneu-
Another recent finding by multiple groups is that tron star is called a protomagnetar. It has been the-
simulations with lower spatial resolution appear to orized that the MRI7 could generate a very strong
explode more readily than simulations with higher local magnetic field that could be transformed into
resolution. There are two possible explanations for this a global field by a dynamo process. While appeal-
and it is likely that they play hand-in-hand: one, low ing, it was not at all clear that this is what happens.

88 September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

The physics is fundamentally global and 3D, and


global 3D MHD simulations with sufficient resolu-
tion to capture MRI-driven field growth were im-
possible to perform for core-collapse supernovae.
This changed with the advent of Blue Waters–
class petascale supercomputers and is a testament to
how increased compute power and capability systems
like Blue Waters facilitate scientific discovery. Our
group at Caltech carried out full-physics 3D global
general-relativistic MHD simulations of 10 millisec-
onds of a rapidly spinning protoneutron star’s life,
starting shortly after core bounce.23 We cut out a cen-
tral octant (with appropriate boundary conditions)
from another, lower-resolution 3D AMR simula-
tion, and covered a 3D region of 140 u 70 u 70 km
with uniform resolution. We performed four simu-
lations to study the MHD dynamics at resolutions
of 500 m (approximately 2 points per MRI wave-
length), 200 m, 100 m, and 50 m (approximately 20
points per MRI wavelength). Because we employed
uniform resolution and no AMR, the simulations
showed excellent strong scaling. The 50 m simula-
tion was run on 130,000 Blue Waters cores and con-
sumed roughly 3 million Blue Waters node hours
(approximately 48 million CPU hours).
Our simulations with 100 m and 50 m resolu-
tion resolve the MRI and show exponential growth
of the magnetic field. This growth saturates at small
scales within a few milliseconds and is consistent
with what we anticipate on the basis of analytical es-
timates. The MRI drives the MHD turbulence that
is most prominent in the layer of greatest rotational
shear, just outside of the protoneutron star core at ra-
dii of 20 to 30 km. What we did not anticipate is Figure 8. Visualization by Robert R. Sisneros (NCSA) and
that in the highest-resolution simulation (which re- Philipp Mösta (UC Berkeley) of the toroidal magnetic field built
up by an inverse cascade (large-scale dynamo) from small-
solves the turbulence best), an inverse turbulent cas- scale magnetoturbulence in a magnetorotational core-collapse
cade develops that transports magnetic field energy supernova. Shown is a 140 u 70 km 3D octant region with
toward large scales. It acts as a large-scale dynamo periodic boundaries on the x-z and y-z faces. Regions of strongest
that builds up global, primarily toroidal field, just positive and negative magnetic field are marked by light blue and
in the way needed to power a magnetorotational ex- yellowish colors. Dark blue and dark red colors mark regions of
weaker negative and positive magnetic field.23
plosion. Figure 8 shows the final toroidal magnetic
field component in our 50 m simulation after 10 ms
of evolution time. Regions of strongest positive and
negative magnetic field are marked by yellowish and grow to the needed saturation field strengths from
light blue colors, respectively, and are just outside any small seed magnetic field. The next step is to
the protoneutron star core. At the time shown, the find a way to simulate for longer physical time and
magnetic field on large scales has not yet reached its with a larger physical domain. This will be neces-
saturated state. We expect this to occur after approxi- sary to determine the long-term dynamical impact
mately 50 ms, which could not be simulated. of the generated large-scale magnetic field. Such
Our results suggest that the conditions neces- simulations will require algorithmic changes to im-
sary for the magnetorotational mechanism are a prove parallel scaling and facilitate the efficient use
generic outcome of the collapse of rapidly rotating of accelerators; they could even require larger and
cores. The MRI is a weak field instability and will faster machines than Blue Waters.

www.computer.org/cise 89

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

COMPUTER SIMULATIONS

C ore-collapse supernova theorists have always


been among the top group of users of super-
computers. The CDCs and IBMs of the 1960s and
lation physics. Formalisms for doing so are under
development, and first implementations (in spa-
tially 1D) simulations could be available in a few
1970s, the vector Crays of the 1970s to 1990s, the years. A single current top-of-the-line 3D neutrino
large parallel scalar architectures of the 2000s, and radiation-hydrodynamics simulation can be car-
the current massively parallel SIMD machines all ried out to approximately 0.5 to 1 second after core
paved the path of progress for core-collapse super- bounce at a cost of several tens of millions of CPU
nova simulations. hours, but it still underresolves the neutrino-driven
Today’s 3D simulations are rapidly improv- turbulence. What is needed now are many such
ing in their included macroscopic and micro- simulations for studying sensitivity to initial con-
scopic physics. They are beginning to answer ditions such as rotation and progenitor structure
decades-old questions and are allowing us to and input physics. These simulations should be at
formulate new ones. There is still much need for higher resolution and carried out for longer so that
improvement, which will come at no small price the longer-term development of the explosion (or
in the post–Moore’s law era of heterogeneous collapse to a black hole) and, for example, neutron
supercomputers. star birth kicks can be reliably simulated.
One important issue that the community must Many longer simulations at higher resolution
address is the reproducibility of simulations and will require much more compute power than is
the verification of simulation codes. It still occurs currently available. The good news is that the next
more often than not that different codes starting generation of petascale systems and, certainly, ex-
from the same initial conditions and implement- ascale machines in the next decade will provide the
ing nominally the same physics arrive at quanti- necessary FLOPS. The bad news: the radical and
tatively and qualitatively different outcomes. In disruptive architectural changes necessary on the
the mid-2000s, an extensive comparison of 1D route to exascale will require equally disruptive
supernova codes provided results that are still be- changes in supernova simulation codes. Already
ing used as benchmarks today.13 Efforts are now at petascale, the traditional data-parallel, linear/
underway that will lead to the definition of mul- sequential execution model of all present super-
tidimensional benchmarks. In addition to code nova codes is the key limiting factor of code per-
comparisons, the increasing availability of open formance and scaling. A central issue is the need
source simulation codes and routines for generat- to communicate many boundary points between
ing input physics (such as neutrino interactions) subdomains for commonly employed high-order
is furthering reproducibility. Importantly, these finite-difference and finite-volume schemes. With
open source codes now allow new researchers to increasing parallel process count, communication
enter the field without the need to spend many eventually dominates over computation in current
years developing basic simulation technology that supernova simulations.
already exists. Because latencies cannot be hidden, efficiently
Core collapse is, in essence, an initial value prob- offloading data and tasks to accelerators in hetero-
lem. Current simulations, even those in 3D, start geneous systems is difficult for current supernova
from spherically symmetric precollapse conditions codes. The upcoming generation of petascale ma-
from 1D stellar evolution codes. However, stars ro- chines such as Summit and Sierra fully embraces
tate, and convection in the layers surrounding the heterogeneity. For exascale machines, power con-
inert iron core is violently aspherical. These asphe- sumption will be the driver of computing architec-
ricities have an impact on the explosion mechanism. ture. Current Blue Waters already draws approxi-
For 3D core-collapse supernova simulations to pro- mately 10 MW of power, and there is not much
vide robust and reliable results, the initial conditions upward flexibility for future machines. Unless
must be reliable and robust, and will likely require there are unforeseen breakthroughs in semicon-
simulating the final phases of stellar evolution in ductor technology that provide increased single-
3D,20 which is another multidimensional, multiscale, core performance at orders of magnitude lower
multiphysics problem. power footprints, exascale machines will likely be
Neutrino quantum-kinetics for including neu- all-accelerator with hundreds of millions of slow,
trino oscillations directly into simulations will be highly energy-efficient cores.
an important but exceedingly algorithmically and Accessing the compute power of upcoming
computationally challenging addition to the simu- petascale and exascale machines requires a radical

90 September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

departure from current code design and major code 5. G.S. Bisnovatyi-Kogan, “The Explosion of a Rotat-
development efforts. Several supernova groups are ing Star as a Supernova Mechanism,” Astronomi-
exploring new algorithms, numerical methods, and cheskii Zhurnal, vol. 47, Aug. 1970, p. 813.
parallelization paradigms. Discontinuous Galerkin 6. J.M. LeBlanc and J.R. Wilson, “A Numerical Exam-
(DG) finite elements24 have emerged as a promis- ple of the Collapse of a Rotating Magnetized Star,”
ing discretization approach that guarantees high Astrophysical J., vol. 161, Aug. 1970, pp. 541–551.
numerical order while minimizing the amount of 7. S.A. Balbus and J.F. Hawley, “A Powerful Local
subdomain boundary information that needs to Shear Instability in Weakly Magnetized Disks. I—
be communicated between processes. In addition, Linear Analysis. II—Nonlinear Evolution,” Astro-
switching to a new, more flexible parallelization ap- physical J., vol. 376, July 1991, pp. 214–233.
proach will likely be necessary to prepare supernova 8. A. Wongwathanarat, H. Janka, and E. Muller,
codes (and other computational astrophysics codes “Hydrodynamical Neutron Star Kicks in Three
solving similar equations) for exascale machines. Dimensions,” Astrophysical J. Letters, vol. 725, Dec.
A prime contender being considered by supernova 2010, pp. L106–L110.
groups is task-based parallelism, which allows for 9. C.D. Ott et al., “Two-Dimensional Multiangle,
fine-grained dynamical load balancing and asyn- Multigroup Neutrino Radiation-Hydrodynamic
chronous execution and communication. Frame- Simulations of Postbounce Supernova Cores,” As-
works that can become task-based backbones trophysical J., vol. 685, Oct. 2008, pp. 1069–1088.
of future supernova codes already exist, such as 10. E.F. Toro, Riemann Solvers and Numerical Methods
Charm++ (http://charm.cs.illinois.edu/research/ for Fluid Dynamics, Springer, 1999.
charm),
____ Legion (http://legion.stanford.edu/over- 11. T.W. Baumgarte and S.L. Shapiro, Numerical Rela-
view),
___ and Uintah (http://uintah.utah.edu).
_______________ tivity: Solving Einstein’s Equations on the Computer,
Cambridge Univ. Press, 2010.
Acknowledgments 12. T. Kuroda, T. Takiwaki, and K. Kotake, “A New
I acknowledge helpful conversations with and help Multi-Energy Neutrino Radiation-Hydrodynamics
from Adam Burrows, Sean Couch, Steve Drasco, Ro- Code in Full General Relativity and Its Application
land Haas, Kenta Kiuchi, Philipp Mösta, David Radice, to Gravitational Collapse of Massive Stars,” Astro-
Luke Roberts, Erik Schnetter, Ed Seidel, and Masaru physical J. Supplemental Series, vol. 222, Feb. 2016,
Shibata. I thank the Yukawa Institute for Theoretical article no. 20.
Physics at Kyoto University for hospitality while writ- 13. M. Liebendörfer et al., “Supernova Simulations
ing this article. This work is supported by the US Na- with Boltzmann Neutrino Transport: A Compari-
tional Science Foundation (NSF) under award numbers son of Methods,” Astrophysical J., vol. 620, Feb.
CAREER PHY-1151197 and TCAN AST-1333520, and 2005, pp. 840–860.
by the Sherman Fairchild Foundation. Computations 14. K. Sumiyoshi et al., “Multidimensional Features of
were performed on NSF XSEDE under allocation TG- Neutrino Transfer in Core-Collapse Supernovae,”
PHY100033 and on NSF/NCSA Blue Waters under Astrophysical J. Supplemental Series, vol. 216, Jan.
NSF PRAC award number ACI-1440083. Movies of 2015, article no. 5.
simulation results can be found on www.youtube.com/ 15. E. O’Connor and S.M. Couch, “Two Dimensional
SXSCollaboration.
___________ Core-Collapse Supernova Explosions Aided by
General Relativity with Multidimensional Neu-
References trino Transport,” submitted to Astrophysical J., Nov.
1. H.A. Bethe and J.R. Wilson, “Revival of a Stalled 2015; arXiv:1511.07443.
Supernova Shock by Neutrino Heating,” Astrophysi- 16. L.F. Roberts et al., “General Relativistic Three-
cal J., vol. 295, Aug. 1985, pp. 14–23. Dimensional Multi-Group Neutrino Radiation-
2. C.D. Ott et al., “General-Relativistic Simulations Hydrodynamics Simulations of Core-Collapse Su-
of Three-Dimensional Core-Collapse Supernovae,” pernovae,” submitted to Astrophysical J., Apr. 2016;
Astrophysical J., vol. 768, May 2013, article no. 115. arXiv:1604.07848.
3. H.-T. Janka, “Explosion Mechanisms of Core-Col- 17. A. Mirizzi et al., “Supernova Neutrinos: Produc-
lapse Supernovae,” Ann. Rev. Nuclear and Particle tion, Oscillations and Detection,” La Rivista del
Science, vol. 62, Nov. 2012, pp. 407–451. Nuovo Cimento, vol. 39, Jan. 2016, pp. 1–112.
4. P. Mösta et al., “Magnetorotational Core-Collapse 18. A. Vlasenko, G.M. Fuller, and V. Cirigliano, “Neu-
Supernovae in Three Dimensions,” Astrophysical J. trino Quantum Kinetics,” Physical Rev. D., vol. 89,
Letters, vol. 785, Apr. 2014, article no. L29. no. 10, 2014, article no. 105004.

www.computer.org/cise 91

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

COMPUTER SIMULATIONS

19. A.W. Steiner, M. Hempel, and T. Fischer, “Core- 24. J.S. Hesthaven and T. Warburton, Nodal Discon-
Collapse Supernova Equations of State Based on tinuous Galerkin Methods: Algorithms, Analysis, and
Neutron Star Observations,” Astrophysical J., vol. Applications, 1st ed., Springer, 2007.
774, Sept. 2013, article no. 17.
20. S.M. Couch et al., “The Three-Dimensional Evolu- Christian D. Ott is a professor of theoretical astrophys-
tion to Core Collapse of a Massive Star,” Astrophysi- ics in the Theoretical Astrophysics Including Cosmol-
cal J. Letters, vol. 808, July 2015, article no. L21. ogy and Relativity (TAPIR) group of the Walter Burke
21. E.J. Lentz et al., “Three-Dimensional Core-Col- Institute for Theoretical Physics at Caltech. His research
lapse Supernova Simulated Using a 15 M Progeni- interests include astrophysics and computational simula-
tor,” Astrophysical J. Letters, vol. 807, July 2015, tions of core-collapse supernovae, neutron star mergers,
article no. L31. and black holes. Ott received a PhD in physics from the
22. S.M. Couch and C.D. Ott, “The Role of Turbulence Max Planck Institute for Gravitational Physics and Uni-
in Neutrino-Driven Core-Collapse Supernova Explo- versität Potsdam. Contact him at cott@tapir.caltech.edu.
_____________
sions,” Astrophysical J., vol. 799, Jan. 2015, article no. 5.
23. P. Mösta et al., “A Large-Scale Dynamo and Mag-
netoturbulence in Rapidly Rotating Core-Collapse
Supernovae,” Nature, vol. 528, no. 7582, 2015, pp. Selected articles and columns from IEEE Computer
376–379; www.nature.com/nature/journal/v528/ Society publications are also available for free at
n7582/full/nature15755.html.
_________________ http://ComputingNow.computer.org.

____________________________________________

92 September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

IEEE Computer Society 2016 Call for


MAJOR AWARD NOMINATIONS
Help Recognize Computing’s Most Prestigious

,(((&RPSXWHU6RFLHW\DZDUGVUHFRJQL]HRXWVWDQGLQJDFKLHYHPHQWVDQGKLJKOLJKWVLJQLƮFDQW
contributors in the teaching and R&D computing communities. All members of the profession
are invited to nominate individuals who they consider most eligible to receive international
recognition of an appropriate society award.

Computer Entrepreneur Award Computer Pioneer Award


Sterling Silver Goblet 6LOYHU0HGDO
Vision and leadership resulting in the growth of Pioneering concepts and development of the
some segment of the computer industry. FRPSXWHUƮHOG

Technical Achievement Award W. Wallace McDowell Award


&HUWLƮFDWH &HUWLƮFDWH
Contributions to computer science or computer Recent theoretical, design, educational,
technology. practical, or other tangible innovative
contributions.
Harry H. Goode Memorial Award
%URQ]H0HGDO Taylor L. Booth Award
Information sciences, including seminal ideas, %URQ]H0HGDO
algorithms, computing directions, and concepts. Contributions to computer science and
engineering education.
Hans Karlsson Award
3ODTXH Computer Science & Engineering
Team leadership and achievement through Undergraduate Teaching Award
collaboration in computing standards. 3ODTXH
Recognizes outstanding contributions to
Richard E. Merwin Distinguished Service undergraduate education.
Award
%URQ]H0HGDO IEEE-CS/Software Engineering Institute
Outstanding volunteer service to the profession Watts S. Humphrey Software Process
at large, including service to the IEEE Computer Achievement Award
Society. (Joint award by CS/SEI)
3ODTXH
Harlan D. Mills Award Software professionals or teams responsible for
3ODTXH an improvement to their organization’s ability to
Contributions to the practice of software create and evolve software-dependent systems.
engineering through the application of sound
theory.

Deadline: 15 October 2016


Nomination Site: awards.computer.org
For more information visit: www.computer.org/awards

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

SECTION TITLE
LEADERSHIP COMPUTING
Editors: James
KonradJ.Hinsen,
Hack, jhack@ornl.gov
hinsen@cnrs-orleans.fr
________ | Michael| Konstantin
E. Papka, papka@anl.gov
Läufer, laufer@
________

Multiyear Simulation Study Provides


Breakthrough in Membrane Protein Research
Laura Wolf | Argonne National Laboratory

“M
olecular machines,” composed of protein The research team of Benoît Roux, a professor in the
components, consume energy to perform University of Chicago’s Department of Biochemistry and
specific biological functions. The concerted Molecular Biology and a senior scientist in Argonne Na-
actions of the proteins trigger many of the tional Laboratory’s Center for Nanoscale Materials, relies
critical activities that occur in living cells. However, like any on an integrative approach to discover and define the ba-
machine, the components can break (through various muta- sic mechanisms of biomolecular systems—an approach that
tions), and then the proteins fail to perform their functions relies on theory, modeling, and running large-scale simula-
correctly. tions on some of the fastest open science supercomputers in
It’s known that malfunctioning proteins can result in the world.
a host of diseases, but pinpointing when and how a mal- Computers have already changed the landscape of biol-
function occurs is a significant challenge. Very few func- ogy in considerable ways; modeling and simulation tools are
tional states of molecular machines are determined by routinely used to fill in knowledge gaps from experiments,
experimentalists working in wet laboratories. Therefore, helping design and define research studies. Petascale super-
more structure-function information is needed to develop computing provides a window into something else entirely:
an understanding of disease processes and to design novel the ability to calculate all the interactions occurring be-
therapeutic agents. tween the atoms and molecules in a biomolecular system,

94 Computing in Science & Engineering 1521-9615/16/$33.00 © 2016 IEEE Copublished by the IEEE CS and the AIP September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

such as a molecular machine, and to visualize the The Science


motion that emerges. A membrane protein, like all protein molecules, con-
sists of a long chain of amino acids. Once fully formed,
The Breakthrough it folds into a highly specific conformation that enables
Roux’s team recently concluded a three-year In- it to perform its biological function. Membrane pro-
novative and Novel Computational Impact on teins change shape and go through many conforma-
Theory and Experiment (INCITE) project at the tional “states” to perform their functions.
Argonne Leadership Computing Facility (ALCF), “From a scientific standpoint, membrane pro-
a US Department of Energy (DOE) Office of teins such as the calcium pump are very interest-
Science User Facility, to understand how P-type ing because they undergo complex changes in their
ATPase ion pumps—an important class of mem- three-dimensional conformations,” said Roux. “Ul-
brane transport proteins—operate. Over the past timately, a better understanding may have a great
decade, Roux and his collaborators, Avisek Das, impact on human health.”
Mikolai Fajer, and Yilin Meng, have been devel- Experimentalists understand the structural de-
oping new computational approaches to simulate tails of proteins’ stable conformational states but
virtual models of biomolecular systems with un- very little about the process by which a protein
precedented accuracy. changes from one conformational state to another.
The team exploits state-of-the-art develop- “Only computer simulation can explore the inter-
ments in molecular dynamics (MD) and protein actions that occur during these structural transi-
modeling. The MD simulation approach, frequent- tions,” said Roux.
ly used in computational physics and chemistry, Intermediate conformations along these transi-
calculates the motions of all the atoms in a given tions could potentially provide the essential infor-
molecular system over time—information that’s mation needed for the discovery of novel therapeu-
impossible to access experimentally. In biology, tic agent design. (Drugs are essentially molecules
large-scale MD simulations provide a perspective that counteract the effect of bad mutations to help
to understand how a biologically important mo- recover the normal functions of the protein.) Be-
lecular machine functions. cause membrane proteins regulate many aspects of
For several years, Roux’s research has been cell physiology, they can serve as possible diagnos-
focused on the membrane proteins that control tic tools or therapeutic targets.
the bidirectional flow of material and informa- Roux and his team are trying to obtain de-
tion in a cell. Now, in a major breakthrough, tailed knowledge about all the relevant conforma-
he and his team have described the complete tional states that occur during SERCA’s transport
transport cycle in atomic detail of a large cal- cycle. In years one and two of the study, Roux’s
cium pump called sarco/endoplasmic reticulum team identified two of the conformation transition
calcium ATPase, or SERCA, which plays an im- pathways needed to describe the cycle. Last year,
portant role in normal muscle contraction. This the project shifted focus to the three remaining
membrane protein uses the energy from ATP hy- pathways.
drolysis to transport calcium ions against their
concentration gradient and, importantly, its The ALCF Advantage
malfunction causes cardiac and skeletal muscle As is the case for much of the domain science re-
diseases. search being conducted on DOE leadership super-
Roux and his team wanted to understand computer systems today, biomolecular science re-
how SERCA functions in a membrane, so they lies on advances in methodology as well as in soft-
set out to build a complete atomistic picture of ware and hardware technologies. The usefulness of
the pump in action. Das, a postdoctoral research Roux’s simulations hinges on the accuracy of the
fellow in Roux’s lab, did this by obtaining all the modeling parameters and on the efficiency of the
transition pathways for the entire ion transport MD algorithm enabling the adequate sampling of
cycle using an approach called the string meth- motions.
od—essentially capturing a “molecular movie” Computational science teams can spend years
of the transport process, frame by frame, of how refining their application code to do what they
different protein components and parts within need it to do, which is often to simulate a particu-
the proteins communicated with each other (see lar physical phenomenon at the necessary space
Figure 1). and time scales. Code advancements can push the

www.computer.org/cise 95

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

LEADERSHIP COMPUTING

E1 E1–2Ca2+–ATP E1P–2Ca2+–ADP

R≈109 km

E2 E2–Pi E2P

Figure 1. Interaction of cytoplasmic domains in the calcium pump of sarscoplasmic reticulum. These six states have been structurally
characterized and represent important intermediates along the reaction cycle. The blue domain, shown in surface representation, is
called the phosphorylation domain (P). The red and green domains, shown as CF traces, are called actuator (A) and nucleotide binding
(N) domains, respectively. The red and green patches in the P domain are interacting with residues in A and N domains, respectively.
Two residues are considered to be in contact if at least one pair of non-hydrogen atoms is within 4 Å of each other. (Image: Avisek Das,
University of Chicago, used with permission.)

simulation capabilities and take advantage of the on Blue Gene/Q), the string method can achieve ex-
machine’s features, such as high processor counts treme scalability on leadership-class supercomputers.
or advanced chips, to evolve the system for longer ALCF staff provided maintenance and sup-
and longer periods of time. port for NAMD software and helped coordinate
Roux and his team used a premier MD simula- and monitor the jobs running on Mira, ALCF’s
tion code, called NAMD, that combines two ad- 10-Pflops IBM Blue Gene/Q.
vanced algorithms—the swarm-of-trajectory string ALCF computational scientist Wei Jiang has
method and multidimensional umbrella sampling. been actively collaborating with Roux’s team since
NAMD, which was first developed at the Uni- 2012, as part of Mira’s Early Science Program.
versity of Illinois at Urbana-Champaign by Klaus Jiang worked with IBM’s system software team on
Schulten and Laxmikant Kale, is a program used to early stage porting and optimization of NAMD
carry out classical simulations of biomolecular sys- on the Blue Gene/Q architecture. He’s also one of
tems. It’s based on the Charm++ parallel program- the core developers of NAMD’s multiple copy al-
ming system and runtime library, which provides in- gorithm, which is the foundation for multiple IN-
frastructure for implementing highly scalable parallel CITE projects that use NAMD.
applications. When combined with a machine-spe- Jiang, who has a background in computation-
cific communication library (such as PAMI, available al biology, considers the recent work a significant

96 September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

breakthrough. “Only in the third year of the proj- Leadership Computing Facility, which is a US DOE Of-
ect did we begin to see real progress,” he said. “The fice of Science User Facility supported under Contract
first and second year of an INCITE project is often DE-AC02-06CH11357.
accumulated experience.”
Laura Wolf is a science writer and editor for Argonne
National Laboratory. Her interests include science com-

T he computations Roux and his team ran for


this breakthrough work will serve as a road-
map for simulating and visualizing the basic mech-
munication, supercomputing, and new media art. Wolf
received a BA in political science from the University of
Cincinnati and an MA in journalism from Columbia
anisms of biomolecular systems going forward. By College Chicago. Contact her at lwolf@anl.gov.
________
studying experimentally well-characterized systems
of increasing size and complexity within a unified
theoretical framework, Roux’s approach offers a
new route for addressing fundamental biological
questions.

Acknowledgments
An award of computer time was provided by the US
Department of Energy’s Innovative and Novel Compu- Selected articles and columns from IEEE Computer
tational Impact on Theory and Experiment (INCITE) Society publications are also available for free at
program. This research used resources of the Argonne http://ComputingNow.computer.org.

Keeping
YOU U at the
Center
er All the Knowledge
<RX1HHGŜ
<RX1HHGŜ

of Technologyl
logy On Your Time.
Learn something new.
Try Computer Society
eLearning today!
Computer Society eLearning

Stay relevant with the IEEE Computer Society

More at www.computer.org/elearning

www.computer.org/cise 97

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

SECTION TITLE CORNER


VISUALIZATION
Editors: Joao
Konrad
Comba,
Hinsen,
UFRGS,
hinsen@cnrs-orleans.fr
comba@inf.ufrgs.br,
__________ | Konstantin
and Daniel
Läufer,
Weiskopf,
laufer@cs.luc.edu
weiskopf@visus.uni-stuttgart.de
________________

Beyond the Third Dimension: Visualizing


High-Dimensional Data with Projections
Renato R.O. da Silva | University of São Paulo, Brazil
Paulo E. Rauber and Alexandru C. Telea | University of Groningen, The Netherlands

M
any application fields produce large amounts of records); and business intelligence (think of large tables in
multidimensional data. Simply put, these are da- databases).
tasets where, for each measurement point (also While storing multidimensional data is easy, understand-
called data point, record, sample, observation, or ing it is not. The challenge lies not so much in having a large
instance), we can measure many properties of the underlying number of observations but in having a large number of di-
phenomenon. The resulting measurement values for all data mensions. Consider, for instance, two datasets A and B. Da-
points are usually called variables, dimensions, or attributes. taset A contains 1,000 samples of a single attribute, say, the
A multidimensional dataset can thus be described as an n × m birthdates of 1,000 patients in an EPD. Dataset B contains
data table having n rows (one per observation) and m 100 samples of 10 attributes, say, the amounts of 10 different
columns (one per dimension). When n is larger than drugs distributed to 100 patients. The total number of mea-
roughly 5, such data is called high-dimensional. Such surements in the two datasets is the same (1,000). Yet, un-
datasets are common in engineering (think of manufac- derstanding dataset A is quite easy, and it typically involves
turing specifications, quality assurance, and simulation displaying either a (sorted) bar chart of its single variable or a
or process control); medical sciences and e-government histogram showing the patients’ age distribution. In contrast,
(think of electronic patient dossiers [EPDs] or tax office understanding dataset B can be very hard—for example, it

98 Computing in Science & Engineering 1521-9615/16/$33.00 © 2016 IEEE Copublished by the IEEE CS and the AIP September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

might be necessary to examine the correlations of sorted column lets us see the distribution of values
any pair of two dimensions of the 10 available ones. of a given dimension.
In this article, we discuss projections, a particu- But while spreadsheet views are good for show-
lar type of tool that allows the efficient and effective ing detailed information, they don’t scale to data-
visual analysis of multidimensional datasets. Pro- sets having thousands of observations and tens of
jections have become increasingly interesting and dimensions or more. To address such scalability,
important tools for the visual exploration of high-di- table lenses refine the spreadsheet idea: they work
mensional data. Compared to other techniques, they much like zooming out of the drawing of a large
scale well in the number of observations and dimen- table, thereby reducing every row to a row of pix-
sions, are intuitive, and can be used with minimal els. Rather than showing the actual textual cell
effort. However, they need to be complemented by content, cell values are now drawn as horizontal
additional visual mechanisms to be of maximal add- pixel bars colored and scaled to reflect data values.
ed value. Also, as they’ve been originally developed As such, columns are effectively reduced to bar
in more formal communities, they’re less known or graphs. Using sorting, we can now view the varia-
accessible to mainstream scientists and engineers. tion of dimension values for much larger datasets.
We provide here a compact overview of how to use However, reasoning about the correlation of differ-
projections to understand high-dimensional data, ent dimensions isn’t easy using table lenses.
present a classification of projection techniques, and
discuss ways to visualize projections. We also com- Scatterplots
ment on the advantages of projections as opposed to Another well-known visualization technique for
other visualization techniques for multidimensional multidimensional data is a scatterplot, which shows
data, and illustrate their added value in a complex the distribution of all observations with respect to
visual analytics workflow for machine learning ap- two chosen dimensions i and j. Finding correla-
plications in medical science. tions, correlation strengths, and the overall distri-
bution of data values is now easy. To do this for m
Exploring High-Dimensional Data dimensions, a so-called m × m scatterplot matrix
Before outlining solutions for exploring high-dimen- can be drawn, showing the correlation of each di-
sional data, we need to outline typical tasks that must mension i with each other dimension j. However,
be performed during such exploration. These can reasoning about observations is hard now—an ob-
be classified into observation-centric tasks (which servation is basically a set of m2 points, one in each
address questions focusing on observations) and scatterplot in the matrix. Also, scatterplot matri-
dimension-centric tasks (which address questions fo- ces don’t scale well for datasets having more than
cusing on the dimensions). Observation-centric tasks roughly 8 to 10 dimensions.
include finding groups of similar observations and
finding outliers (observations that are very different Parallel Coordinates
from the rest of the data). Dimension-centric tasks A third solution for visualizing multidimensional
include finding sets of dimensions that are strongly data is parallel coordinates. Here, each dimension
correlated and dimensions that are mutually inde- is shown as a vertical axis, thus the name parallel
pendent. There exist also tasks that combine observa- coordinates. Each observation is shown as a frac-
tions and dimensions, such as finding which dimen- tured line that connects the m points along these
sions make a given group of observations different axes corresponding to its values in all the m dimen-
from the rest of the data. Several visual solutions ex- sions. Correlations of dimensions (shown by adja-
ist to address (parts of) these tasks, as follows. More cent axes) can now be spotted as bundles of parallel
details on these and other visualization techniques line segments; inverse correlations are shown by a
for high-dimensional data appear elsewhere.1,2 typical x-shaped line-crossing pattern. Yet, par-
allel coordinates don’t scale well beyond 10 to 15
Tables dimensions. Also, they might require careful order-
Probably the simplest method is to display the en- ing of the axes to bring dimensions that one wants
tire dataset as a n × m table, as we do in a spread- to compare close to each other in the plot.
sheet. Sorting rows on the values in a given column
lets us find observations with minimal or maximal Multidimensional Projections
values for that column and then read all their di- Projections take a very different approach to visual-
mensions horizontally in a row. Visually scanning a izing high-dimensional data. Think of the n data

www.computer.org/cise 99

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

VISUALIZATION CORNER

Data table 2D projection

a table row gets


mapped to a point

2D point distance reflects


n D row distance
color map values of
a selected column

Figure 1. From a multivariate data table to a projection. Projections can be thought of as reducing the unnecessary
dimensionality of the data (the original m dimensions) keeping the inherent dimensionality (that which encodes dis-
tances, or similarities, between points).

points in an m-dimensional space. The dataset can as there are fewer dimensions to consider next. The
then be conceptually seen as a point cloud in this simplified dataset can next be used instead of the
space. If we could see in m dimensions, we could original one in various processing or analysis tasks.
then (easily) find outliers as those points that are The second use case involves reducing the number
far from all other points in the cloud and find im- of dimensions to two or three, so that we can vi-
portant groups of similar observations as dense and sually explore the reduced dataset. In contrast to
compact regions in the point cloud. the first case, this usually isn’t done by dropping
However, we can’t see in more than three di- dimensions but by creating two or three synthetic
mensions. Note also that a key ingredient of per- dimensions along which the data structure is best
forming the above-mentioned tasks is reasoning in preserved. We next focus on this latter use case.
terms of distances between the points in m dimen-
sions. Hence, if we could somehow map, or project, Projection Techniques
our point cloud from m to two or three dimen- Many different techniques exist to create a 2D or
sions, keeping the distances between point-pairs, 3D projection, and they can be classified according
we could do the same tasks by looking at a 2D or to several criteria, as follows.
3D scatterplot. Projections perform precisely this
operation, as illustrated by Figure 1. Intuitively, Dimension versus distance. The dimension versus
they can be thought of as reducing the unneces- distance classification looks at the type of informa-
sary dimensionality of the data (the original m tion used to construct a projection. Distance-based
dimensions), keeping the inherent dimensionality methods use only the distances, or similarities, be-
(that which encodes distances, or similarities, be- tween m-dimensional observations. Typical distances
tween points). Additionally, we can color-code the here are Euclidean and cosine, thus, the projection
projected points by the values of one dimension, to algorithm’s input is an n × n distance matrix be-
get extra insights. tween all observation pairs. Such methods are also
There are two main use cases for projections. known as multidimensional scaling (MDS) because
The first is to reduce the number of dimensions by they intuitively scale the m-dimensional distances to
keeping only one dimension from a set of dimen- 2D distances. Technically, this is done by optimizing
sions which are strongly correlated, or by dropping a function that minimizes the so-called aggregated
dimensions along which the data has a very low normalized stress, or summed difference between the
variance. Essentially, this preserves patterns in the inter-point distances in m dimensions and 2D, re-
data (clusters, outliers) but makes its usage simpler, spectively. The main advantage of MDS methods is

100 September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

that they don’t require the original dimensions—a (small) subset of observations, called representatives,
dissimilarity matrix between observations is suf- from the initial dataset and then projecting these by
ficient and extremely useful in cases where we can using a high-accuracy method. This isn’t expensive,
measure the similarities in some data collections but as the number of representatives is small. Finally,
don’t precisely know which attributes (dimensions) the remaining observations close to each representa-
explain those similarities. The main disadvantage of tive are fit around the position of the representative’s
MDS methods is that they require storing (and ana- projection. This is cheaper, simpler, and also more
lyzing) an n × n distance matrix. For n being tens of accurate than using a global technique. Intuitively,
thousands of observations, this can be very expen- think of our Earth example as splitting the ball sur-
sive.3 Several MDS refinements have been proposed, face into several small patches and projecting these
such as ISOMAP,4 Pivot MDS,5 and Fastmap,6 to 2D. When such patches have low curvature, fit-
which can compute projections in (near) linear time ting them to a 2D surface is easier than if we were to
to the number of observations. project the entire ball at once. Good local methods
In contrast, dimension-based methods use as in- include PLMP9 and LAMP.10 Using representatives
put the actual m dimensions of all observations. For has another added value: users can arrange these as
datasets having many more observations than dimen- desired in 2D, thereby controlling the projection’s
sions (n much larger than n), this gives considerable overall shape with little effort.
savings. However, we now need to have access to the
original dimension values. Arguably the best known Distance versus neighborhood preserving. A final classi-
method in this class is principal component analysis fication looks into what a projection aims to preserve.
(PCA), whose variations are also known under the When it’s important to accurately assess the similar-
names of singular value decomposition (SVD) or ity of points, distance preservation is preferred. All
Karhunen-Loève transform (KLT).7 Intuitively put, projection techniques listed above fall into this class.
the idea of 2D PCA is to find the plane, in m dimen- However, as we’ve seen, getting a good distance pres-
sions, on which the projections of the n observations ervation for all points can be hard. When the number
have the largest spread. Visualizing these 2D projec- of dimensions is very high, the Euclidean (straight-
tions will then give us a good way of understanding line) distances between all point-pairs in a dataset
the actual variance of the data in m dimensions.8 tend to become very similar, so accurately preserving
While simple and fast, PCA-based methods work such distances has less value. In such cases, it’s often
well only if the observations are distributed close to a better to preserve neighborhoods in a projection—this
planar surface in m dimensions. To understand this, way, the projection can still be used to reason about
consider a set of observations uniformly distributed the groups and outliers existing in the high-dimen-
on the surface of the Earth (a ball in 3D). When pro- sional dataset. Actually, the depiction of groups could
jecting these, PCA will effectively squash the ball to get even clearer because the projection algorithm has
a planar disk, projecting diametrically opposed ob- more freedom to place observations in 2D, as long as
servations on the ball’s surface to the same location, the nearest neighbors of a point in 2D are the same
meaning the projection won’t preserve distances. as those of the same point in m dimensions. The best-
What we actually want is a projection that acts much known method in this class is t-stochastic neighbor
as a map construction process, where the Earth’s sur- embedding (t-SNE), which is used in many applica-
face is unfolded to a plane, with minimal distortions. tions in machine learning, pattern recognition, and
data mining, and has a readily usable implementation
Global versus local. The global versus local classifica- (https://lvdmaaten.github.io/tsne).
____________________
tion looks at the type of operation used to construct a
projection. Global methods define a single mapping, Type of data. Most projection methods handle
which is then applied for all observations. MDS and quantitative dimensions, whose values are typically
PCA methods fall in this class. The main disadvan- continuously varying over some interval. Examples
tage of global methods is that it can be very hard are temperature, time duration, speed, volume, or
to find a single function that optimally preserves financial transaction values. However, projection
distances of a complex dataset when projecting it techniques such as multiple correspondence analy-
(as in the Earth projection example). Another dis- sis (MCA) can also handle categorical data (types)
advantage is that computing such a global mapping or mixed datasets of quantitative and categorical
can be expensive (as in the case of classical MDS). data. A good description of MCA and related tech-
Local methods address both these issues, selecting a niques is given by Greenacre.11

www.computer.org/cise 101

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

VISUALIZATION CORNER

y legend
maximum value axis 3
minimum value
selected
dimension
for color
mapping
axis 7 axis 2 (gender)
axis 8
axis 4
axis 5
axis 1
x legend
axis 0
axis 6

axis 6
Male Female
axis 2 axis 4
axis 7 axis 1
(a) (b) (c)

spike 5: He+ mass abundance


variable 7 6: He++ mass abundance

error legend
y legend

variable 5

7: H– mass abundance
color: variable 5
x legend
(d) (e)

Figure 2. Projection visualizations with (a) thumbnails, (b) biplot axes, (c) and (d) axis legends, and (e) key local dimensions.

The Projection Explorer is a very good place to or even thumbnails to explain several of their
start working with projections in practice.12 This dimensions.
tool implements a wide range of state-of-the-art Figure 2a shows this for a dataset where ob-
projection techniques that can handle hundreds servations are images. The projection shows image
of thousands of observations with hundreds of di- thumbnails, organized by similarity. We can eas-
mensions and provides several visualizations to in- ily see here that our image collection is split into
teractively customize and explore projections. The two large groups; we can get more insight into
tool is freely downloadable from http://infoserver. the composition of the groups by looking at the
lcad.icmc.usp.br/infovis2/Tools.
____________________ thumbnails.
However, in many cases, there’s no easy way to
Visualizing Projections draw a small thumbnail-like depiction of all the m
The simplest and most widespread way to visu- attributes of an observation. Projections will then
alize a projection is to draw it as a scatterplot. show us groups and outliers, but how do we ex-
Here, each point represents an observation, and plain what these mean? In other words, how do we
the 2D distance between points reflects the put the dimension information back into the pic-
similarities of the observations in m dimensions. ture? Without this, the added value of a projection
Points can be also annotated with color, labels, is limited.

102 September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

There are several ways of explaining projec- tical (y) axes of the plot mean. For a projection, this
tions. By far the simplest, and most common, is to isn’t easy because these axes don’t straightforwardly
color code the projection points by the value of a map to data dimensions but, rather, to combina-
user-chosen dimension. If we next see strong col- tions of dimensions. Luckily, we can compute the
or correlations with different point groups in the contribution of each of the original m data dimen-
projection, we can explain these in terms of the sions to the spread of points along the projection’s
selected dimension’s specific values or value rang- x and y axes. Next, we can visualize these contri-
es. However, if we have tens of dimensions, using butions by standard bar charts (see Figure 2c): for
each one to color code the projection is tedious at each dimension, the x and y axis legends show a
best. Moreover, it could be that no single dimen- bar indicating how much that dimension is visible
sion can explain why certain observations are simi- on the x and y axes. Long bars, thus, indicate di-
lar. Tooltips can be shown at user-chosen points, mensions that strongly contribute to the spread of
which does a good job explaining a few outliers one points along the horizontal and vertical directions.
by one, but it doesn’t work if we want to explain a Figure 2c shows how this works: the dataset con-
large number of points together. tains 583 patient records, each having 10 dimen-
One early way to explain projections is to draw sions describing patients’ gender, age, and eight
so-called biplot axes.13 For PCA projections and blood measurements. The projection shows two
variants, lines indicate the directions of maximal clusters placed aside each other.
variation in the 2D space of all m dimensions. In- How do we explain these? In the x axis legend,
tuitively put, biplot axes generalize the concept of we see a tall orange bar, which tells us that this
a scatterplot, where we can read the values of two dimension (gender) is strongly responsible for the
dimensions along the x and y axes, to the case where points’ horizontal spread. If we color the points by
we have m dimensions. Moreover, strongly corre- their gender value, we see that, indeed, gender ex-
lated dimensions appear as nearly parallel axes, and plains the clusters. Axis legends can also be used for
independent dimensions appear as nearly orthogo- 3D projections, as in Figure 2d, which shows a 3D
nal axes. Finally, the relative lengths of the axes projection of a 200,000-sample dataset with 10 di-
indicate the relative variation of the respective di- mensions coming from a simulation describing the
mensions. Biplots can also be easily constructed for formation of the early universe.14 As we rotate the
any other projection, including 3D projections that 3D projection, the bars in the axis legends change
generate a 3D point cloud rather than a 2D scat- lengths and are sorted from longest to shortest, in-
terplot.14 In such cases, the biplot axes need not be dicating the best-visible dimensions from a given
straight lines. Figure 2b shows an example of biplot viewpoint (dimensions 5 and 7, in our case). A third
axes for a dataset containing 2,814 abstracts of legend (Figure 2d, top right) shows which dimen-
scientific papers. Each observation (abstract) has sions we can’t see well in the projection from the
nine dimensions, indicating the frequencies of the current viewpoint. These dimensions vary strongly
nine most used technical terms in all abstracts. The along the viewing direction, so we shouldn’t use the
projection, created using a force-based technique, current viewpoint to reason about them.
places points close to each other if the respective ab- Biplot axes can also be inspected to get more
stracts are similar. Labels can be added to the axes detail. For example, we see that the projection’s
to tell their identity and also indicate their signs saddle shape is mainly caused by variable 7 and
(extremities associated to minimum and maximum that the spike outlier is caused by a combination
values). The curvature of the biplot axes tells us that of dimensions 5 and 6. This interactive viewpoint
the projection is highly nonlinear—intuitively, we manipulation of 3D projections effectively lets us
can think that the nine-dimensional space gets dis- create an infinite set of 2D scatterplot-like visual-
torted when squashed into the resulting 3D space. izations on the fly. Both biplot axes and axis leg-
This is undesirable because reading the values of the ends explain a projection globally. If well-separated
dimensions along such curved axes is hard. groups of points are visible, we can’t directly tell
Still, interpreting biplot axes can be challeng- which variables are responsible for their appearance
ing, especially when we have 10 or more variables, without visually correlating the groups’ positions
as we get too many lines drawn in the plot. More- with annotations, which can be tedious. Local
over, most users are accustomed to interpreting a explanations address this by explicitly splitting the
point cloud as a Cartesian scatterplot—that is, projection into groups of points that admit a single
they want to know what the horizontal (x) and ver- (simple) explanation, depicting this explanation atop

www.computer.org/cise 103

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

VISUALIZATION CORNER

Looking at the scatterplots’ deviations from the red


diagonals, we get more insight in the nature of the errors:
points under the diagonal tell us that original distances are
subestimated in the projection, that is, that the projection
compressed the data.
the groups. Figure 2e shows this for a dataset contain- The easiest way to do this is to compute the ag-
ing 6,773 open source software projects, each having gregated normalized stress. Low values of this stress
11 quality metrics, along with their download count.15 tell us that the projection preserves distances well.
The projection, constructed with LAMP, shows a However, if this single figure indicates low qual-
concave shape but no clearly separated clusters. ity, we don’t know what that precisely means or
Let’s consider next every projected point and sev- which observations are affected. More insight can
eral of its close neighbors—that is, a small circular be obtained by showing scatterplots of the original
patch of projected points. Because these points are distances in m dimensions versus distances in the
close in the projection, they should also be similar in projection. Figure 3a illustrates this for several da-
m dimensions. We can analyze these points to find tasets and projection techniques.10 The ideal pro-
which dimension is most likely responsible for their jection behavior is shown with red diagonal lines;
similarity. By doing this for all points in turn, we figures in each scatterplot show the aggregated nor-
can rank all m dimensions by the number of points malized stress, telling us that LAMP is generally
whose neighborhoods they explain. If we color code better than the other two studied projections. Yet,
points by their best-explaining dimension, the pro- we don’t know what this means precisely. Looking
jection naturally splits into several clusters. We can at the scatterplots’ deviations from the red diago-
next add labels with the names of their explaining di- nals, we get more insight in the nature of the er-
mensions. Finally, we can tune the points’ brightness rors: points under the diagonal tell us that original
to show how much of a point’s similarity is explained distances are subestimated in the projection, that
by the single selected dimension. In Figure 2e, is, that the projection compressed the data. Note
we see, for instance, that the lines of code metric that this is quite a typical phenomenon: projections
(purple) explains two clusters of points—by interac- have to embed points in a much lower-dimensional
tive brushing, we can find that one contains small space, so crowding occurs very likely. For the isolet
software projects and the other has large software dataset, we see, for example, that small 2D distanc-
projects. The bright-to-dark color gradient shows es can mean a wider range of high-dimensional
how it’s increasingly hard to explain a point’s similar- distances than large 2D distances, so close points
ity with its neighbors once we approach the cluster in a projection may or may not be that close in m
border, that is, the place where another dimension dimensions. For the viscontest dataset, we see that
becomes key to explaining local similarity. Doing LAMP has a constant spread around the diagonal,
this visual partitioning of the projection into groups indicating a uniform error distribution for all dis-
explained by dimensions would have been hard us- tance ranges. In contrast, Glimmer shows a much
ing global methods only, such as biplot axes or axis worse error distribution.
legends. Besides explaining groups in a projection via While useful to reason about distances, such
single dimensions, we can also use tag clouds to show scatterplots don’t tell us where in the projection we
the names of several dimensions.16 have errors. For this, we can use observation-cen-
tric error metrics.17 The aggregate error shows the
Interpreting Projections normalized stress, aggregated per point rather than
As already explained, projections can be used as for all points. Figure 3b shows this for a projection
visual proxies of high-dimensional spaces that en- created with LAMP. As we see, the projection over-
able reasoning about a dataset’s structure. For this all is of good quality, with the exception of four
to work, however, a projection should faithfully small hot spots. Figure 3c shows errors created by
preserve those elements of the data structure that false neighbors—that is, points close in 2D but far
are important for the task at hand. As such, before in m dimensions, or zones where the projection
using a projection, it’s essential to check its quality. compressed the high-dimensional space. We see

104 September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

LAMP Glimmer PLMP


0.0494 0.0970 0.1556
wdbc
projected distance (2 dimensions)

0.2414 0.2324 0.2253 (b) (c)


0.63s 4.32s 10.03s
isolet

selected
0.0023 0.1478 0.0016
group
viscontest

missing neighbors
of selected point missing
members
selected point
original distance (m dimensions)
(a)

(d) (e)

Figure 3. Projection visualized with (a) distance-centric methods and (b) through (e) observation-centric methods. The ideal projection
behavior is shown with red diagonal lines. Figures in each scatterplot show the aggregated normalized stress, telling us that LAMP is
generally better than the other two studied projections.

here only three hot spots, meaning that the fourth Using Projections in Visual Analytics
one in Figure 3b wasn’t caused by false neighbors. Workflows
Figure 3d shows errors created by missing neigh- So far, we’ve shown how we can construct projec-
bors—that is, points close in m dimensions but tions, check their quality, and visually annotate
far in 2D. The missing neighbors of the selected them to explain the contained patterns. But how
point of interest are connected by lines, which are are projections used in complex visual analytics
bundled to simplify the image. The discrepancy workflows? The most common way is to visually
between the 2D and original distances is also color explore them while searching for groups, and when
coded on the points themselves. In this image, we such groups appear, to use tools like the ones pre-
see that the missing neighbors of the selected point sented so far to explain them in terms of dimen-
are quite well localized on the other side of the pro- sions and dimension values.2 This is often done in
jection. This typically happens when a closed sur- data mining and machine learning.
face in m dimensions is split by the projection to We illustrate this with a visual analytics work-
be embedded in 2D. Finally, Figure 3e shows for flow for building classifiers for medical diagnosis.18
a selected group of points all the points that are The advent of low-cost, high-accuracy imaging de-
closer in m dimensions to a point in the group than vices has enabled both doctors and the public to
to any other point but closer to points outside that generate large collections of skin lesion images.
group in 2D. This lets us easily see if groups that Dermatologists want to automatically classify these
appear in the projection are indeed complete or if into benign (moles) and potentially malignant
they actually miss members. (melanoma), so they can focus their precious time

www.computer.org/cise 105

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

VISUALIZATION CORNER

Training Validation
data data
Feature Good Classifier
extraction Project separation? tool
1 2 3 6

T1
Input objects Features Projection Classifier design Classifier testing
Bad
4 separation?
Feature Too low Good
subset performance? 8 7 performance?

5
T2

Iterative feature selection


Feature set Classification
10 9 redesign system ready
Repeat cycle with newly Study problem
designed features causes
Use in
production
Feature vs. observation study

Figure 4. Using projections to build and refine classifiers in supervised machine learning.

on analyzing the latter. For this, image classifiers little sense to spend energy on designing and
can be used: each skin image is described in terms testing a classifier, since we seem to have a poor
of several dimensions, or features, such as color his- feature choice (step 4) We can then interactively
tograms, edge densities and orientations, texture select the desired class groups in the projection
patterns, and pigmentation. Next, dermatologists and see which features discriminate them best,18
manually label a training dataset of images as be- repeating the cycle with a different feature subset
nign or malignant, using it to train a classifier so it (step 5). If, however, classes are well separated in
becomes able to label new images. Other applica- the projection (step 3), our features discriminate
tions of machine learning include algorithm opti- them well, so the classification task isn’t too hard.
mization, designing search engines, and predicting We then proceed to design, train, and test the
software quality. classifier (step 6). If the classifier yields a good
Designing good classifiers is a long-standing performance, we’re done: we have a production-
problem in machine learning and is often re- ready system (step 7). If not, we can again use
ferred to as the “black art” of classifier design.19 projections to see which are the badly classified
The problem is multiple-fold: understanding observations (step 8), which features are respon-
discriminative features; understanding which sible for this (step 9), and engineer new features
observations are hard to classify and why; and that separate these better (step 10). In this work-
selecting and designing features to improve clas- flow, projections serve two key tasks: predicting
sification accuracy. Projections can help all these the ease of building a good classifier ahead of the
tasks, via the workflow in Figure 4. Given a set of actual construction (T1), thereby saving us from
input observations, we first extract features that designing a classifier with unsuitable features,
are typically known to capture their essence (step and showing which observations are misclassified
1). This yields a high-dimensional data table with and their feature values (T2), thereby helping us
observations as rows and features as columns. design better features in a targeted way.
We also construct a small training set by manual
labeling. Next, we want to determine how easy
the classification problem ahead of us will be. For
this, we project the training set and color obser-
vations by class labels (step 2). If the classes we
P rojections are the new emerging instrument
for the visual exploration of large high-dimen-
sional datasets. Complemented by suitable visual
wish to recognize are badly separated, it makes explanations, they’re intuitive, easy to use, visually

106 September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

compact, and easy to learn for users familiar with vol. 17, no. 12, 2011, pp. 2563–2571.
scatterplots. Recent technical developments allow 11. M. Greenacre, Correspondence Analysis in Practice,
their automatic computation from large datasets 2nd ed., CRC Press, 2007.
in seconds, helping users avoid complex parameter 12. P. Pagliosa et al., “Projection Inspector: Assessment
settings or needing to understand the underlying and Synthesis of Multidimensional Projections,”
technicalities. As such, they’re part of the visual Neurocomputing, vol. 150, 2015, pp. 599–610.
data scientist’s kit of indispensable tools. 13. M. Greenacre, Biplots in Practice, CRC Press,
But as projections become increasingly more 2007.
useful and usable, several new challenges have 14. D. Coimbra et al., “Explaining Three-Dimensional
emerged. Users require new ways to manipulate a Dimensionality Reduction Plots,” Information Visu-
projection to improve its quality in specific areas, alization, vol. 15, no. 2, 2015, pp. 154–172.
to obtain the best-tuned results for their datasets 15. R. da Silva et al., “Attribute-Based Visual Explana-
and problems. Developers require consolidated tion of Multidimensional Projections,” Proc. Eu-
implementations of projections that would let roVA, 2015, pp. 134–139.
them integrate them in commercial-grade applica- 16. F.V. Paulovich et al., “Semantic Wordification of
tions such as Tableau. And last but not least, users Document Collections,” Computer Graphics Forum,
and scientists require more examples of workflows vol. 31, no. 3, 2012, pp. 1145–1153.
showing how projections can be used in visual an- 17. R.M. Martins et al., “Visual Analysis of Dimen-
alytics sensemaking to solve problems in increas- sionality Reduction Quality for Parameterized
ingly diverse application areas. Projections,” Computers & Graphics, vol. 41, 2014,
pp. 26–42.
References 18. P.E. Rauber et al., “Interactive Image Feature Se-
1. S. Liu et al., “Visualizing High-Dimensional Data: lection Aided by Dimensionality Reduction,” Proc.
Advances in the Past Decade,” Proc. EuroVis– EuroVA, 2015, pp. 54–61.
STARs, 2015, pp. 127–147. 19. P. Domingos, “A Few Useful Things to Know
2. C. Sorzano, J. Vargas, and A. Pascual-Montano, “A about Machine Learning,” Comm. ACM, vol. 10,
Survey of Dimensionality Reduction Techniques,” no. 55, 2012, pp. 78–87.
2014; http://arxiv.org/pdf/1403.2877.
3. W.S. Torgeson, “Multidimensional Scaling of Renato R.O. da Silva is a PhD student at the University
Similarity,” Psychometrika, vol. 30, no. 4, 1965, pp. of São Paulo, Brazil. His research interests include mul-
379–393. tidimensional projections, information visualization,
4. J.B. Tenenbaum, V. de Silva, and J.C. Langford, and high-dimensional data analytics. Contact him at
“A Global Geometric Framework for Nonlinear rros@icmc.usp.br.
__________
Dimensionality Reduction,” Science, vol. 290, no.
5500, 2000, pp. 2319–2323. Paulo E. Rauber is a PhD student at the University of
5. U. Brandes and C. Pich, “Eigensolver Methods Groningen, the Netherlands. His research interests in-
for Progressive Multidimensional Scaling of Large clude multidimensional projections, supervised classifier
Data,” Proc. Graph Drawing, Springer, 2007, pp. design, and visual analytics. Contact him at p.e.rauber@
_______
42–53. rug.nl.
____
6. C. Faloutsos and K.-I. Lin, “FastMap: A Fast Algo-
rithm for Indexing, Data-Mining and Visualization Alexandru C. Telea is a full professor at the University
of Traditional and Multimedia Datasets,” SIG- of Groningen, the Netherlands. His research interests
MOD Record, vol. 24, no. 2, 1995, pp. 163–174. include multiscale visual analytics, graph visualization,
7. K. Fukunaga, Introduction to Statistical Pattern and 3D shape processing. Telea received a PhD in com-
Recognition, Academic Press, 1990. puter science (data visualization) from the Eindhoven
8. I.T. Jolliffe, Principal Component Analysis, Springer, University of Technology, the Netherlands. Contact
2002, p. 487. him at a.c.telea@rug.nl.
_________
9. F.V. Paulovich, C.T. Silva, and L.G. Nonato, “Two-
Phase Mapping for Projecting Massive Data Sets,”
IEEE Trans. Visual Computer Graphics, vol. 16, no.
6, 2010, pp. 1281–1290. Selected articles and columns from IEEE Computer
10. P. Joia et al., “Local Affine Multidimensional Pro- Society publications are also available for free at
jection,” IEEE Trans. Visual Computer Graphics, http://ComputingNow.computer.org.

www.computer.org/cise 107

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

THE LAST WORD

Computers in Cars

T
he major role that computational devices play in cars became dramatically apparent
last September, when the Environmental Protection Agency announced the results
of its investigation into Volkswagen. The EPA discovered that the German automaker
had installed software in some of its diesel-engine cars that controlled a system
for reducing the emission of environmentally hostile nitrogen oxides—but only during an
emissions test. On the open road, the car belched nitrogen oxides, unbeknownst to its driver.
Computers have been stealthily controlling cars for decades. My second car, a 1993 Honda
Civic hatchback, had a computational device—an engine control unit—whose microproces-
by Charles Day sor received data from sensors in and around the engine. On the basis of those data, the ECU
would consult preprogrammed lookup tables and adjust actuators that controlled and opti-
mized the mix of fuel and air, valve timing, idle speed, and other factors. This combination of
ECU and direct fuel injection not only reduced emissions and boosted engine efficiency, it was
also less bulky and mechanically simpler than the device it replaced, the venerable carburetor.
Unfortunately, however, the trend for computers in cars is toward greater complex-
ity, not simplicity. Consider another Honda, the second-generation Acura NSX, which
went on sale earlier this year. The supercar’s hybrid power train consists of a turbo-
charged V6 engine mated to three electric motors: one each for the two front wheels and
one for the two rear wheels. An array of sensors, microprocessors, and actuators ensures
that all three motors are optimally deployed during acceleration, cruising, and braking.
And talking of braking, the NSX’s brake pedal isn’t actually mechanically con-
nected to the brakes. Rather, it activates a rheostat, which controls the brakes electroni-
cally. To preserve the feel of mechanical braking, a sensor gauges how much hydraulic
pressure to push back on the driver’s foot.
In Formula One racing, the proliferation of computer control has led to an arms race among
manufacturers, which reached its apogee in 1993. Thanks in part to its computer-controlled anti-
lock brakes, traction control, and active suspension, the Williams FW15C won 10 of the season’s
16 races. The sport’s governing body responded by restricting electronic aids. By the 2008 season,
all cars were compelled to use the same standard ECU. The 23-year-old Williams FW15C retains
a strong claim to being the most technologically sophisticated Formula One car ever built.
Computers aren’t confined to supercars or racing cars. The July issue of Consumer Re-
ports ranked cars’ infotainment systems, with Cadillac’s being among the worst. Owners
reported taking months, even years, to master its user interface. “This car REALLY needs
a co-pilot with an IT degree,” one despairing owner told the magazine. And this past
May, USA Today reported that consumer complaints about vehicle software problems
filed with the National Highway Traffic Safety Administration (NHTSA) jumped 22
percent in 2015 compared with 2014. Recalls blamed on software rose 45 percent.
I’m not against computers in cars. Rather, I worry that their encroachment will be-
come so complete that consumers like me will be deprived of the choice to buy a car
that lacks such fripperies as a remote vehicle starter system, rear vision camera, head-up
display, driver seat memory, lane departure warning system, and so on. I worry, too, that
even as the NHTSA records more software problems, it’s also considering whether to
mandate computer-controlled safety features.

S o although I wouldn’t turn down an Acura NSX, I’d rather drive one of its ances-
tors, the Honda S800 roadster, circa 1968.

Charles Day is Physics Today’s editor in chief. The views in this column are his own and not nec-
essarily those of either Physics Today or its publisher, the American Institute of Physics.

108 Computing in Science & Engineering 1521-9615/16/$33.00 © 2016 IEEE Copublished by the IEEE CS and the AIP September/October 2016

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

CALL FOR NOMINEES


Education Awards Nominations

Taylor L. Booth Education Award Computer Science and Engineering


Undergraduate Teaching Award
A bronze medal and US$5,000 honorarium are awarded
for an outstanding record in computer science and A plaque, certificate and a stipend of US$2,000
engineering education. The individual must meet two or is awarded to recognize outstanding contributions
more of the following criteria in the computer science and to undergraduate education through both teaching
HQJLQHHULQJƮHOG and service and for helping to maintain interest,
increase the visibility of the society, and making a
ť Achieving recognition as a teacher of renown.
statement about the importance with which we view
ť :ULWLQJDQLQưXHQWLDOWH[W
ť /HDGLQJLQVSLULQJRUSURYLGLQJVLJQLƮFDQWHGXFDWLRQ
undergraduate education.
FRQWHQWGXULQJWKHFUHDWLRQRIDFXUULFXOXPLQWKHƮHOG
ť Inspiring others to a career in computer science and The award nomination requires a minimum of three
engineering education. endorsements.

Two endorsements are required for an award nomination.

6HHWKHDZDUGLQIRUPDWLRQDW 6HHWKHDZDUGGHWDLOVDW
www.computer.org/web/awards/booth www.computer.org/web/awards/cse-undergrad-teaching

Deadline: 15 October 2016


Nomination Site: awards.computer.org
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

Meet Analytics Experts Face-To-Face

18 October 2016 | Mountain View, CA

Why Attend Rock Stars of Pervasive,


Predictive Analytics?

Want to know how to avoid the 4 biggest problems of


predictive analytics – from the Principal Data Scientist at
Microsoft? Take a little time to discover how predictive
analytics are used in the real world, like at Orbitz
Worldwide. You can stop letting traditional perspectives
limit you with help from Mashable’s chief data scientist.

Sameer Chopra Juan Miguel Lavista Haile Owusu


Come to this dynamic one-day GVP & 3ULQFLSDO'DWD6FLHQWLVW &KLHI'DWD6FLHQWLVW
&KLHI$QDO\WLFV2ƱFHU 0LFURVRIW 0DVKDEOH
symposium. 2UELW]:RUOGZLGH

www.computer.org/ppa

qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®

You might also like