You are on page 1of 124

COMMUNICATIONS

ACM

cacm.acm.oRG OF THE 06/2010 vol.53 no.06

Robin milner
the elegant Pragmatist
Managing scientific data
Privacy by design
An interview with
ed Feigenbaum
beyond the smart Grid
straightening out
heavy tails

Association for
Computing Machinery
ACM, Uniting the World’s Computing Professionals,
Researchers, Educators, and Students

Dear Colleague,
At a time when computing is at the center of the growing demand for technology jobs world-
wide, ACM is continuing its work on initiatives to help computing professionals stay competitive
in the global community. ACM’s increasing involvement in activities aimed at ensuring the health of
the computing discipline and profession serve to help ACM reach its full potential as a global and
diverse society which continues to serve new and unique opportunities for its members.
As part of ACM’s overall mission to advance computing as a science and a profession, our invaluable member
benefits are designed to help you achieve success by providing you with the resources you need to advance
your career and stay at the forefront of the latest technologies.
I would also like to take this opportunity to mention ACM-W, the membership group within ACM. ACM-W’s purpose is
to elevate the issue of gender diversity within the association and the broader computing community. You can join the
ACM-W email distribution list at http://women.acm.org/joinlist.

ACM MEMBER BENEFITS:


• A subscription to ACM’s newly redesigned monthly magazine, Communications of the ACM
• Access to ACM’s Career & Job Center offering a host of exclusive career-enhancing benefits
• Free e-mentoring services provided by MentorNet®
• Full access to over 2,500 online courses in multiple languages, and 1,000 virtual labs
• Full access to 600 online books from Safari® Books Online, featuring leading publishers,
including O’Reilly (Professional Members only)
• Full access to 500 online books from Books24x7®
• Full access to the new acmqueue website featuring blogs, online discussions and debates,
plus multimedia content
• The option to subscribe to the complete ACM Digital Library
• The Guide to Computing Literature, with over one million searchable bibliographic citations
• The option to connect with the best thinkers in computing by joining 34 Special Interest Groups
or hundreds of local chapters
• ACM’s 40+ journals and magazines at special member-only rates
• TechNews, ACM’s tri-weekly email digest delivering stories on the latest IT news
• CareerNews, ACM’s bi-monthly email digest providing career-related topics
• MemberNet, ACM’s e-newsletter, covering ACM people and activities
• Email forwarding service & filtering service, providing members with a free acm.org email address
and Postini spam filtering
• And much, much more
ACM’s worldwide network of over 92,000 members range from students to seasoned professionals and includes many
of the leaders in the field. ACM members get access to this network and the advantages that come from their expertise
to keep you at the forefront of the technology world.
Please take a moment to consider the value of an ACM membership for your career and your future in the dynamic
computing profession.

Sincerely,
Wendy Hall

President
Association for Computing Machinery
Advancing Computing as a Science & Profession
membership application &
Advancing Computing as a Science & Profession
digital library order form
Priority Code: ACACM10

You can join ACM in several easy ways:


Online Phone Fax
http://www.acm.org/join +1-800-342-6626 (US & Canada) +1-212-944-1318
+1-212-626-0500 (Global)
Or, complete this application and return with payment via postal mail

Special rates for residents of developing countries: Special rates for members of sister societies:
http://www.acm.org/membership/L2-3/ http://www.acm.org/membership/dues.html
Please print clearly
Purposes of ACM
ACM is dedicated to:
Name
1) advancing the art, science, engineering,
and application of information technology
2) fostering the open interchange of
Address information to serve both professionals and
the public
3) promoting the highest professional and
City State/Province Postal code/Zip ethics standards
I agree with the Purposes of ACM:
Country E-mail address

Signature

Area code & Daytime phone Fax Member number, if applicable ACM Code of Ethics:
http://www.acm.org/serving/ethics.html

choose one membership option:


PROFESSIONAL MEMBERSHIP: STUDENT MEMBERSHIP:
❏ ACM Professional Membership: $99 USD ❏ ACM Student Membership: $19 USD

❏ ACM Professional Membership plus the ACM Digital Library: ❏ ACM Student Membership plus the ACM Digital Library: $42 USD
$198 USD ($99 dues + $99 DL) ❏ ACM Student Membership PLUS Print CACM Magazine: $42 USD
❏ ACM Digital Library: $99 USD (must be an ACM member) ❏ ACM Student Membership w/Digital Library PLUS Print
CACM Magazine: $62 USD

All new ACM members will receive an payment:


ACM membership card. Payment must accompany application. If paying by check or
For more information, please visit us at www.acm.org money order, make payable to ACM, Inc. in US dollars or foreign
currency at current exchange rate.
Professional membership dues include $40 toward a subscription
to Communications of the ACM. Member dues, subscriptions, ❏ Visa/MasterCard ❏ American Express ❏ Check/money order
and optional contributions are tax-deductible under certain
circumstances. Please consult with your tax advisor.
❏ Professional Member Dues ($99 or $198) $ ______________________

❏ ACM Digital Library ($99) $ ______________________


RETURN COMPLETED APPLICATION TO:
❏ Student Member Dues ($19, $42, or $62) $ ______________________
Association for Computing Machinery, Inc.
General Post Office Total Amount Due $ ______________________
P.O. Box 30777
New York, NY 10087-0777

Questions? E-mail us at acmhelp@acm.org Card # Expiration date


Or call +1-800-342-6626 to speak to a live representative

Satisfaction Guaranteed! Signature


communications of the acm

Departments News Viewpoints

5 ACM’s Chief Operating Officer Letter 13 Straightening Out Heavy Tails 24 Privacy and Security
A Tour of ACM’s HQ A better understanding of Myths and Fallacies of “Personally
By Patricia Ryan heavy-tailed probability distributions Identifiable Information”
can improve activities from Developing effective privacy protection
6 Letters to the Editor Internet commerce to the design technologies is a critical challenge.
Workflow Tools for of server farms. By Arvind Narayanan
Distributed Teams? By Neil Savage and Vitaly Shmatikov

8 In the Virtual Extension 16 Beyond the Smart Grid 27 Inside Risks


Sensor networks monitor residential Privacy By Design: Moving
10 BLOG@CACM and institutional devices, motivating from Art to Practice
The Chaos of the Internet energy conservation. Designing privacy into systems at the
as an External Brain; and More By Tom Geller beginning of the development process.
Greg Linden writes about the By Stuart S. Shapiro
Internet as a peripheral resource; 18 Mine Your Business
Ed H. Chi discusses lessons learned Researchers are developing new 30 The Profession of IT
from the DARPA Network Challenge; techniques to gauge employee The Resurgence of Parallelism
and Mark Guzdial asks if there productivity from information flow. Parallel computation is making
are too many IT workers or too By Leah Hoffmann a comeback after a quarter
many IT jobs. century of neglect.
20 Robin Milner: By Peter J. Denning and Jack B. Dennis
12 CACM Online The Elegant Pragmatist
Interact Naturally Remembering a rich legacy 33 Kode Vicious
By David Roman in verification, languages, Plotting Away
and concurrency. Tips and tricks for visualizing
29 Calendar By Leah Hoffmann large data sets.
By George V. Neville-Neil
116 Careers 22 CS and Technology Leaders Honored
By Jack Rosenberger 35 Law and Technology
Intel’s Rebates: Above Board
Last Byte or Below the Belt?
Over several years, Intel paid billions of
118 Puzzled dollars to its customers. Was it to force
Solutions and Sources them to boycott products developed
By Peter Winkler by its rival AMD or so they could sell its
microprocessors at lower prices?
120 Future Tense By François Lévêque
How the Net Ensures
Our Cosmic Survival 38 Viewpoint
Give adolescent wonder Institutional Review Boards
About the Cover:
an evolutionary jolt. “Dreams of perfection, for and Your Research
David Brin a scientist, mean avenues A proposal for improving the review
for exploration,” said Robin
Milner in an interview procedures for research projects
shortly before his death that involve human subjects.
on March 20, 2010 at age
76. Photographer Roland By Simson L. Garfinkel
Eva captured Milner, a and Lorrie Faith Cranor
renowned CS theorist and
recipient of the 1991 ACM
A.M. Turing Award, in one
41 Interview
of his favorite places—the
classroom. Josie Jammet An Interview with Ed Feigenbaum
Association for Computing Machinery used that picture as
Advancing Computing as a Science & Profession inspiration for this month’s cover. For more on Jammet,
By Len Shustek
see http://www.heartagency.com/artist/JosieJammet/

2 communications of the ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
06/2010 vol. 53 no. 06

Practice Contributed Articles Virtual Extension

68 Managing Scientific Data As with all magazines, page limitations often


Needed are generic, rather than prevent the publication of articles that might
one-off, DBMS solutions automating otherwise be included in the print edition.
To ensure timely publication, ACM created
storage and analysis of data from
Communications’ Virtual Extension (VE).
scientific collaborations. VE articles undergo the same rigorous review
By Anastasia Ailamaki, Verena Kantere, process as those in the print edition and are
and Debabrata Dash accepted for publication on their merit. These
articles are now available to ACM members in
79 Conference Paper Selectivity the Digital Library.
and Impact
Conference acceptance rate Examining Agility in Software
signals future impact of published Development Practice
conference papers. Sergio de Cesare, Mark Lycett,
By Jilin Chen and Joseph A. Konstan Robert D. Macredie, Chaitali Patel,
and Ray Paul
46 Securing Elasticity in the Cloud
Elastic computing has great Review Articles Barriers to Systematic
potential, but many security Model Transformation Testing
challenges remain. 84 Efficiently Searching Benoit Baudry, Sudipto Ghosh,
By Dustin Owens for Similar Images Franck Fleurey, Robert France,
New algorithms provide the ability Yves Le Traon, and Jean-Marie Mottu
52 Simplicity Betrayed for robust but scalable image search.
Emulating a video system shows By Kristen Grauman Factors that Influence
how even a simple interface can Software Piracy:
be more complex—and capable— A View from Germany
than it appears. Research Highlights Alexander Nill, John Schibrowsky,
By George Phillips James W. Peltier, and Irvin L. Young
96 Technical Perspective
59 A Tour Through Building Confidence The Requisite Variety
the Visualization Zoo in Multicore Software of Skills for IT Professionals
A survey of powerful visualization By Vivek Sarkar Kevin P. Gallagher, Kate M. Kaiser,
techniques, from the obvious to Judith C. Simon, Cynthia M. Beath,
the obscure. 97 Asserting and Checking and Tim Goles
By Jeffrey Heer, Michael Bostock, Determinism for
and Vadim Ogievetsky Multithreaded Programs Panopticon Revisited
By Jacob Burnim and Koushik Sen Jan Kietzmann and Ian Angell
Articles’ development led by
queue.acm.org The Social Influence Model
106 Technical Perspective of Technology Adoption
Learning To Do Program Verification Sandra A. Vannoy and Prashant Palvia
By K. Rustan M. Leino
I, Myself and e-Myself
107 seL4: Formal Verification Cheul Rhee, G. Lawrence Sanders,
of an Operating-System Kernel and Natalie C. Simpson
By Gerwin Klein, June Andronick,
Kevin Elphinstone, Gernot Heiser, Beyond Connection:
Illustrati on by gluek it

David Cock, Philip Derrin, Situated Wireless Communities


Dhammika Elkaduwe, Jun Sun and Marshall Scott Poole
Kai Engelhardt, Rafal Kolanski,
Michael Norrish, Thomas Sewell,
Harvey Tuch, and Simon Winwood

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f the acm 3
communications of the acm
Trusted insights for computing’s leading professionals.

Communications of the ACM is the leading monthly print and online magazine for the computing and information technology fields.
Communications is recognized as the most trusted and knowledgeable source of industry information for today’s computing professional.
Communications brings its readership in-depth coverage of emerging areas of computer science, new trends in information technology,
and practical applications. Industry leaders use Communications as a platform to present and debate various technology implications,
public policies, engineering challenges, and market trends. The prestige and unmatched reputation that Communications of the ACM
enjoys today is built upon a 50-year commitment to high-quality editorial content and a steadfast dedication to advancing the arts,
sciences, and applications of information technology.

ACM, the world’s largest educational STA F F editori a l B oard


and scientific computing society, delivers  
resources that advance computing as a Director of Group Publish ing E ditor-i n-c hief
science and profession. ACM provides the Scott E. Delman Moshe Y. Vardi ACM Copyright Notice
computing field’s premier Digital Library publisher@cacm.acm.org eic@cacm.acm.org Copyright © 2010 by Association for
and serves its members and the computing Executive Editor News Computing Machinery, Inc. (ACM).
profession with leading-edge publications, Diane Crawford Co-chairs Permission to make digital or hard copies
conferences, and career resources. Managing Editor Marc Najork and Prabhakar Raghavan of part or all of this work for personal
Thomas E. Lambert Board Members or classroom use is granted without
Executive Director and CEO Senior Editor Brian Bershad; Hsiao-Wuen Hon; fee provided that copies are not made
John White Andrew Rosenbloom Mei Kobayashi; Rajeev Rastogi; or distributed for profit or commercial
Deputy Executive Director and COO Senior Editor/News Jeannette Wing advantage and that copies bear this
Patricia Ryan Jack Rosenberger notice and full citation on the first
Director, Office of Information Systems Web Editor Viewpoints page. Copyright for components of this
Wayne Graves David Roman Co-chairs work owned by others than ACM must
Director, Office of Financial Services Editorial Assistant Susanne E. Hambrusch; John Leslie King; be honored. Abstracting with credit is
Russell Harris Zarina Strakhan J Strother Moore permitted. To copy otherwise, to republish,
Director, Office of Membership Rights and Permissions Board Members to post on servers, or to redistribute to
Lillian Israel Deborah Cotton P. Anandan; William Aspray; lists, requires prior specific permission
Director, Office of SIG Services Stefan Bechtold; Judith Bishop; and/or fee. Request permission to publish
Donna Cappo Art Director Stuart I. Feldman; Peter Freeman; from permissions@acm.org or fax
Director, Office of Publications Andrij Borys Seymour Goodman; Shane Greenstein; (212) 869-0481.
Bernard Rous Associate Art Director Mark Guzdial; Richard Heeks;
Director, Office of Group Publishing Alicia Kubista Rachelle Hollander; Richard Ladner; For other copying of articles that carry a
Scott Delman Assistant Art Director Susan Landau; Carlos Jose Pereira de Lucena; code at the bottom of the first or last page
Mia Angelica Balaquiot Beng Chin Ooi; Loren Terveen or screen display, copying is permitted
ACM Cou n c il Production Manager provided that the per-copy fee indicated
President Lynn D’Addesio in the code is paid through the Copyright
Wendy Hall Director of Media Sales P ractice Clearance Center; www.copyright.com.
Vice-President Jennifer Ruzicka Chair
Alain Chesnais Marketing & Communications Manager Stephen Bourne Subscriptions
Secretary/Treasurer Brian Hebert Board Members An annual subscription cost is included
Barbara Ryder Public Relations Coordinator Eric Allman; Charles Beeler; David J. Brown; in ACM member dues of $99 ($40 of
Past President Virgina Gold Bryan Cantrill; Terry Coatta; Mark Compton; which is allocated to a subscription to
Stuart I. Feldman Publications Assistant Stuart Feldman; Benjamin Fried; Communications); for students, cost
Chair, SGB Board Emily Eng Pat Hanrahan; Marshall Kirk McKusick; is included in $42 dues ($20 of which
Alexander Wolf George Neville-Neil; Theo Schlossnagle; is allocated to a Communications
Co-Chairs, Publications Board Columnists subscription). A nonmember annual
Jim Waldo
Ronald Boisvert, Holly Rushmeier Alok Aggarwal; Phillip G. Armour; subscription is $100.
Martin Campbell-Kelly; The Practice section of the CACM
Members-at-Large
Michael Cusumano; Peter J. Denning; Editorial Board also serves as
Carlo Ghezzi; ACM Media Advertising Policy
Shane Greenstein; Mark Guzdial; the Editorial Board of .
Anthony Joseph; Communications of the ACM and other
Mathai Joseph; Peter Harsha; Leah Hoffmann; ACM Media publications accept advertising
C ontributed Articles
Kelly Lyons; Mari Sako; Pamela Samuelson; in both print and electronic formats. All
Co-chairs
Bruce Maggs; Gene Spafford; Cameron Wilson advertising in ACM Media publications is
Al Aho and Georg Gottlob
Mary Lou Soffa; Board Members at the discretion of ACM and is intended
Fei-Yue Wang C o n tac t P o i n ts to provide financial support for the various
Yannis Bakos; Gilles Brassard; Alan Bundy;
SGB Council Representatives Copyright permission activities and services for ACM members.
Peter Buneman; Ghezzi Carlo;
Joseph A. Konstan; permissions@cacm.acm.org Current Advertising Rates can be found
Andrew Chien; Anja Feldmann;
Robert A. Walker; Calendar items by visiting http://www.acm-media.org or
Blake Ives; James Larus; Igor Markov;
Jack Davidson calendar@cacm.acm.org by contacting ACM Media Sales at
Gail C. Murphy; Shree Nayar; Lionel M. Ni;
Change of address (212) 626-0654.
Sriram Rajamani; Jennifer Rexford;
Publicat i o n s Boa rd acmcoa@cacm.acm.org
Marie-Christine Rousset; Avi Rubin;
Co-Chairs Letters to the Editor Single Copies
Abigail Sellen; Ron Shamir; Marc Snir;
Ronald F. Boisvert and Holly Rushmeier letters@cacm.acm.org Single copies of Communications of the
Larry Snyder; Veda Storey;
Board Members Manuela Veloso; Michael Vitale; ACM are available for purchase. Please
Jack Davidson; Nikil Dutt; Carol Hutchins; W e b SITE contact acmhelp@acm.org.
Wolfgang Wahlster; Andy Chi-Chih Yao;
Ee-Peng Lim; Catherine McGeoch; http://cacm.acm.org
Willy Zwaenepoel
M. Tamer Ozsu; Vincent Shen; Comm unicat ions of the ACM
Mary Lou Soffa; Ricardo Baeza-Yates Au t h o r G u i d e l i n es Research Highligh ts (ISSN 0001-0782) is published monthly
http://cacm.acm.org/guidelines Co-chairs by ACM Media, 2 Penn Plaza, Suite 701,
ACM U.S. Public Policy Office David A. Patterson and Stuart J. Russell New York, NY 10121-0701. Periodicals
Cameron Wilson, Director A dv e rt i s i ng Board Members postage paid at New York, NY 10001,
1828 L Street, N.W., Suite 800 Martin Abadi; Stuart K. Card; Deborah Estrin; and other mailing offices.
Washington, DC 20036 USA ACM A dvertisi n g Departmen t Shafi Goldwasser; Monika Henzinger;
T (202) 659-9711; F (202) 667-1066 2 Penn Plaza, Suite 701, New York, NY Maurice Herlihy; Norm Jouppi; POSTMASTER
10121-0701 Andrew B. Kahng; Gregory Morrisett; Please send address changes to
Computer Science Teachers Association T (212) 869-7440 Michael Reiter; Mendel Rosenblum; Communications of the ACM
Chris Stephenson F (212) 869-0481 Ronitt Rubinfeld; David Salesin; 2 Penn Plaza, Suite 701
Executive Director
Lawrence K. Saul; Guy Steele, Jr.; New York, NY 10121-0701 USA
2 Penn Plaza, Suite 701 Director of Media Sales
Gerhard Weikum; Alexander L. Wolf;
New York, NY 10121-0701 USA Jennifer Ruzicka
Margaret H. Wright
T (800) 401-1799; F (541) 687-1840 jen.ruzicka@hq.acm.org
Web
Association for Computing Machinery Media Kit acmmediasales@acm.org
Co-chairs
(ACM)
Marti Hearst and James Landay
2 Penn Plaza, Suite 701
Board Members SE
REC
New York, NY 10121-0701 USA A Y

Jason I. Hong; Jeff Johnson;


E

CL
PL

T (212) 869-7440; F (212) 869-0481


E

Greg Linden; Wendy E. MacKay Printed in the U.S.A.


TH

NE

S
I

MAGAZ

4 communications of the ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
acm’s chief operating officer letter

DOI:10.1145/1743546.1743547 Patricia Ryan

A Tour of ACM’s HQ Russell Harris, a 36-year veteran of


HQ, heads the Office of Financial Ser-
vices. His team oversees all the Asso-
Let me admit at the outset, I’m a bit biased ciation’s financial matters, including
when I talk about ACM headquarters accounting responsibilities, monthly
and quarterly financial statements,
(HQ) and the many ways it outshines the and the annual ACM budget. His de-
command centers of other professional partment cuts 10,000 checks per year,
guides auditors through evaluations,
handles treasury functions, and moni-
societies. But that bias comes from ACM and its SIGs sponsor, co-sponsor, tors ACM’s investment strategies daily.
nearly three decades of firsthand expe- and cooperate with over 170 technical The Office of Information Systems
rience, having worked in almost every meetings annually, all of which are co- maintains constant vigilance over the
department within HQ, and having ordinated through the Office of SIG Ser- Association’s “computing machinery,”
had a front-row seat to the dynamics vices, with Donna Cappo at the helm. with responsibilities for all the infor-
that drive each department to create The department manages all SIGs and mation processing needs serving staff-
and provide products and services to their conferences, including promo- ers, volunteers, and members. Wayne
meet the needs of ACM members. tion, publicity, and membership. The Graves leads this effort, which includes
The HQ staff coordinates the global team also helps produce over 85 pro- maintaining HQ’s digital infrastruc-
activities of ACM’s chapters and various ceedings and 30 newsletters yearly. ture, as well as spearheading the DL’s
committees; acting as the liaison for all The Office of Publications pro- technical direction and advances.
conferences carrying the ACM stamp; duces ACM’s many research journals Based in Washington, D.C., ACM’s
enhancing the Digital Library (DL); pro- and Transactions and is responsible Office of Public Policy (USACM) is
ducing almost four dozen publications; for Digital Library content. Director very much a part of HQ, representing
providing professional services and on- Bernard Rous oversees these efforts, ACM’s interests on IT policy issues that
line courses; and performing organiza- which include 39 print journals and a impact the computing field. Cameron
tional functions. HQ also serves as the DL containing over 275,000 articles and Wilson works to enlighten legislators
hub for members, news media, and the 1.50 million citations covered by the about policies that foster computing
public on all subjects related to com- Guide to Computing Literature. innovations, including the critical
puting and technologies. Scott Delman, director of the Office need for CS and STEM education.
Probably the greatest difference of Group Publishing, shepherds DL The unsung heroes that comprise
between ACM and other professional sales and builds business strategies to the Office of Administrative Services
societies is just how much is accom- make the DL a must-have subscription support all the necessary functions
plished by so few. With a membership for institutions worldwide, which to- that allow HQ to operate at the top
fast approaching 100,000, ACM’s 170 day number over 2,800. He also serves of its game every single day. These
conferences, 45 periodicals, 34 Special as group publisher for ACM’s maga- varied services include human re-
Interest Groups, and 644 professional zine series, including our flagship sources, office services and mailroom
and student chapters are all supported Communications of the ACM. facilities, and support policies and
by 72 staffers. When you consider that ACM’s Office of Membership, un- procedures, the ACM’s Awards pro-
similar professional associations with der the leadership of Lillian Israel, is gram, and elections.
equal or fewer members and services one big umbrella covering everything All of us at HQ are indebted to the
are often supported by staffs of over under membership and marketing tireless efforts of ACM’s many devoted
100, I hope you are as impressed, as I services, including membership pro- volunteers worldwide who work to-
continue to be, by the devotion of the cessing, subscriptions, professional gether with us to manage ACM’s grow-
people working at ACM. development services, and education. ing array of products and services.
Here, I’d like to briefly highlight the Of particular pride to me is the fact that None of our efforts would succeed
departments and talented directors with close to 100,000 members, ACM’s without their efforts.
that spearhead ACM’s New York HQ. member services department, consist-
ACM’s Special Interest Groups ing of four customer representatives, is Patricia Ryan is Deputy Executive Director and Chief
Operating Officer of ACM, New York, NY.
(SIGs) represent virtually every major able to provide a response to any mem-
discipline within computer science. ber query within 24 hours! © 2010 ACM 0001-0782/10/0600 $10.00

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f the acm 5
letters to the editor

DOI:10.1145/1743546.1743549

Workflow Tools for Distributed Teams?

T
he “Profession of IT” View- time and little bench time, so is, in this
Time to Debug
point “Orchestrating Coor- sense, resource-efficient.
dination in Pluralistic Net- George V. Neville-Neil’s “Kode Vi- The downside of using a printf()
works” by Peter J. Denning cious” Viewpoint “Taking Your Net- statement is that at program creation
et al. (Mar. 2010) offered work’s Temperature” (Feb. 2010) was (when it is inserted) programmers an-
guidance for distributed development thought-provoking, but two of its ticipate errors but are unaware of where
teams. As a leader of one such team, conclusions—“putting printf()… and when they might occur; printf()
I can vouch for the issues it raised. throughout your code is a really annoy- output can be overwhelming, and the
However, my coordination problems ing way to find bugs” and “limiting the aggregate time to produce diagnostic
are compounded because email (and files to one megabyte is a good start”— output can impede time-critical opera-
related attachments) is today’s de fac- were somewhat misleading. tions. The overhead load of output and
to medium for business and technical Timing was one reason Neville-Neil time is only partially correctable.
communication. The most up-to-date offered for his view that printf() can Limiting file size to some arbitrary
version of a document is an email at- lead to “erroneous results.” Debugger maximum leads programmers to as-
tachment that instantly goes out of and printf() both have timing loads. sume (incorrectly) that the search is for
date when changes are made by any Debug timing depends on hardware a single error and that localizing it is the
of the team members; project docu- support. A watch statement functions goal. Limiting file size allows program-
ments include specifications, plans, like a printf(), and a breakpoint mers to focus on a manageable subset
status reports, assignments, and consumes “infinite” time. In both sin- of data for analysis but misses other
schedules. gle-threaded and multithreaded envi- unrelated errors. If the point of error-
Software developers use distrib- ronments, a breakpoint stops thread generation is not within some limited
uted source-code control systems to activity. In all cases, debugger state- number of files, little insight would be
manage changes to code. But these ments perturb timing in a way that’s gained for finding the point an error
tools don’t translate well to all the like printf(). was in fact generated.
documents handled by nondevelop- We would expect such stimulus Neville-Neil saying “No matter how
ers, including managers, marketers, added to multithreaded applications good a tool you have, it’s going to do a
manufacturers, and service and sup- would produce different output. Nev- much better job at finding a bug if you
port people. I’d like to know what ille-Neil expressed a similar senti- narrow down the search.” might apply
workflow-management tools Denning ment, saying “Networks are perhaps to “Dumped” (the “questioner” in his
et al. would recommend for such an the most nondeterministic compo- Viewpoint) but not necessarily to every-
environment. nents of any complex computing sys- one else. An analysis tool is meant to
Ronnie Ward, Houston, TX tem.” Both printf() and debuggers discover errors, and programmers and
exaggerate timing differences, so the users both win if errors are found. Try-
qualitative issue resolves to individual ing to optimize tool execution time over
Author’s Response: preferences, not to timing. error-detection is a mistake.
Workflow tools are not the issue. Many Choosing between a debugger and Art Schwarz, Irvine, CA
people simply lack a clear model of a printf() statement depends on the
coordination. They think coordination is development stage in which each is to George V. Neville-Neil’s Viewpoint (Feb.
about exchanging messages and that be used. At an early stage, a debugger 2010) said students are rarely taught to
related coordination breakdowns indicate might be better when timing and mes- use tools to analyze networking prob-
poorly composed, garbled, or lost messages saging order are less important than lems. For example, he mentioned Wire-
(as in email). Coordination is about making error detection. Along with functional shark and tcpdump, but only in a cur-
commitments, usually expressed as “speech integration in the program, a debug- sory way, even though these tools are
acts,” or utterances that take action and ger can sometimes reach a point of part of many contemporary university
make the commitments that produce the diminishing returns. Programmers courses on networking.
outcome the parties want. People learning shift their attention to finding the first Sniffers (such as Wireshark and
the basics of coordination are well on their appearance of an error and the point Ethereal) for analyzing network pro-
way toward successful coordination, even in their programs where the error was tocols have been covered at Fairleigh
without workflow tools. generated. Using a debugger tends Dickinson University for at least the
We don’t yet know enough about effective to be a trial-and-error process involv- past 10 years. Widely used tools for
practices for pluralistic coordination to be ing large amounts of programmer network analysis and vulnerability
able to design good workflow tools for this and test-bench time to find that very assessment (such as nmap, nessus,
environment. point. A printf() statement inserted Snort, and ettercap) are available
Peter J. Denning, Monterey, CA at program creation requires no setup through Fedora and nUbuntu Linux

6 communications of the ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
letters to the editor

distributions. Open source tools for information concerning network traffic work on outdated equipment to fix soft-
wireless systems include NetStumbler and potential system vulnerabilities. ware written in aging languages; and
and AirSnort. Gertrude Levine, Madison, NJ Maintenance is well tooled. We
Fairleigh Dickenson’s network found the opposite. Maintenance
labs run on virtual machines to limit tools are inferior, and development
inadvertent damage and the need for What Jack Doesn’t Know About tools and regression test suites do not
protection measures. We teach the Software Maintenance unfortunately support the work.
basic network utilities available on I agree that the community doesn’t Maintenance involves much more
Windows- and/or Posix-compliant understand software maintenance, as than Stachour and Collier-Brown indi-
systems, including ping, netstat, arp, covered in the article “You Don’t Know cated. In light of the changing nature
tracert (traceroute), ipconfig (ifconfig Jack about Software Maintenance” by of the work being done every day by
in Linux/Unix and iwconfig in Linux Paul Stachour and David Collier-Brown software maintenance teams, my col-
wireless cards), and nslookup (dig in (Nov. 2009), but much more can be leagues and I urge Communications to
Linux). With the proper options, net- done to improve the general under- continue to cover the topic.
stat displays IP addresses, protocols, standing of the important challenges.
Reference
and ports used by all open and listen- The software-maintenance proj- 1. Reifer, D. Allen, J.-A., Fersch, B., Hitchings, B., Judy, J.,
ing connections, as well as by protocol ects I’ve worked on have been difficult, and Rosa, W. Software maintenance: Debunking the
myths. In Proceedings of the International Society of
statistics and routing tables. due to the fact that maintenance work Parametric Analysis / Society of Cost Estimating and
The Wireshark packet sniffer iden- is so different from the kind of work Analysis Annual Conference and Training Workshop (San
Diego, CA, June 8-11). ISPA/SCEA, Vienna, VA, 2010.
tifies control information at different described in the article. The commu-
protocol layers. A TCP capture specifi- nity does not fully understand that Donald J. Reifer, Prescott, AZ
cation thus provides a tree of protocols, maintenance involves much more
with fields for frame header and trailer than just adding capabilities and fix-
(including MAC address), IP header ing bugs. For instance, maintenance Authors’ Response:
(including IP address), and TCP head- teams on large projects spend almost In our slightly tongue-in-cheek description
er (including port address). Students as much time providing facility, op- of software maintenance, we were
compare the MAC and IP addresses erations, product, and sustaining-en- concentrating on the “add a highway”
found through Wireshark with those gineering support as they do changing side of the overall problem, rather than
found through netstat and ipconfig. code.1 Moreover, the work tends to be “repair the railroad bridge.” We try to
They then change addresses and check distributed differently. My colleagues avoid considering software maintenance
results by sniffing new packets, analyz- and I recently found maintenance as a separate process done by a different
ing the arp packets that try to resolve teams spending as much as 60% of team. That’s a genuinely difficult problem,
the altered addresses. Capture filters in their effort testing code once the re- as Reifer points out. We’ve seen it tried
Wireshark support search through pro- lated changes are implemented. a number of times, with generally
tocol and name resolution; Neville-Neil Other misconceptions include: disappointing results.
stressed the importance of narrowing The primary job in maintenance is fa- A better question might be the one
one’s search but failed to mention the cilitating changes. We found that sup- asked by Drew Sullivan, president of the
related mechanisms. Students are also port consumes almost as much effort Greater Toronto Area Linux User Group,
able to make connections through (un- as changes and repairs; at a presentation we gave on the subject:
encrypted) telnet and PuTTy, compar- Maintenance is aimed at addressing “Why aren’t you considering maintenance
ing password fields. new requirements. Because most jobs as continuing development?” In fact
My favorite Wireshark assignment are small, maintenance teams focus we were, describing the earlier Multics
involves viewing TCP handshakes via on closing high-priority trouble reports norm of continuous maintenance without
statistics/flow/TCP flow, perhaps fol- rather than making changes; stopping any running programs. We’re
lowing an nmap SYN attack. The free Funding maintenance is based on re- pleased to see the continuous process
security scanner nmap runs with Wire- quirements. Most maintenance proj- being independently reinvented by
shark and watch probes initiated by the ects are funded level-of-effort; as such, practitioners of the various agile methods.
scan options provided. I always assign maintenance managers must deter- In addition, we’re impressed by their
a Christmas-tree scan (nmap –sX) that mine what they can do with the resourc- refactoring and test-directed development.
sends packets with different combina- es they have rather than what needs to These are genuinely worthwhile
tions of flag bits. Capturing probe pack- be done; improvements to the continuous approach,
ets and a receiving station’s reactions Maintenance schedules are based on and we hope the techniques we re-
enables identification of flag settings user-need dates. Maintenance sched- described are valuable to that community.
and the receiver’s response to them. ules are written in sand, so mainte- Paul Stachour, Bloomington, MN
Operating systems react differently to il- nance leaders must determine what David Collier-Brown, Toronto
legal flag combinations, as students ob- can be done within a limited time pe-
serve via their screen captures. riod; Communications welcomes your opinion. To submit a
Letter to the Editor, please limit your comments to 500
Network courses and network main- Maintenance staff is junior. Average words or less and send to letters@cacm.acm.org.
tenance are thus strongly enhanced by experience for maintenance personnel
sniffers and other types of tools that yield is 25 years during which they tend to © 2010 ACM 0001-0782/10/0600 $10.00

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f the acm 7
in the virtual extension

DOI:10.1145/1743546.1743550

In the Virtual Extension


Communications’ Virtual Extension brings more quality articles to ACM
members. These articles are now available in the ACM Digital Library.

Examining Agility in Software Factors that Influence Software The Social Influence Model
Development Practice Piracy: A View from Germany of Technology Adoption
Sergio de Cesare, Mark Lycett, Alexander Nill, John Schibrowsky, Sandra A. Vannoy and Prashant Palvia
Robert D. Macredie, Chaitali Patel, James W. Peltier, and Irvin L. Young While social computing has fast become
and Ray Paul Software piracy has wide-ranging negative an industry buzzword encompassing
Agility is a facet of software development economic consequences for manufacturers networking, human innovation, and
attracting increasing interest. The authors and distributors striving to compete in communications technologies, few
investigate the value of agility in practice a competitive global market. Indeed, studies have investigated technology
and its effects on traditional plan-based software piracy is jeopardizing the future adoption targeting the individual at the
approaches. Data collected from senior growth and development of the IT industry, level of society, community, or lifestyle
development/project managers in 62 which in turn disproportionately impacts experience. The authors address this
organizations is used to investigate countries with the highest piracy rates. This gap by developing social constructs and
perceptions related to agile development. article details an exploratory study that providing a theoretically grounded model
Specifically, the perceptions tested relate investigated the relationship between a for technology adoption in the context of
to the belief in agile values and principles, comprehensive set of factors and software social computing. Their model suggests
and the value of agile principles within piracy in Germany. The authors gleaned that social computing action, cooperation,
current development/organization some valuable security measures from the consensus, and authority are antecedents to
practice. These perceptions are examined results of the study that can be used as a social influence. And social influence, they
in the context of current practice in order starting point for industry and governments contend, leads to technology adoption.
to test perceptions against behavior to develop programs to deter piracy.
and understand the valued aspects of I, Myself and e-Myself
agile practice implicit in development The Requisite Variety
today. The broad outcome indicates an Cheul Rhee, G. Lawrence Sanders,
interesting marriage between agile and of Skills for IT Professionals and Natalie C. Simpson
plan-based approaches. This marriage Kevin P. Gallagher, Kate M. Kaiser, The human ego, developed from birth, is
seeks to allow flexibility of method while Judith C. Simon, Cynthia M. Beath, central to one’s conscious self, according
retaining control. and Tim Goles to experts. This article examines the
IT professionals today are beset by ongoing concept of the virtual ego (one that
changes in technology and business begins with the creation of an online
Barriers to Systematic Model practices. To thrive in such a dynamic identity and functions only online)
Transformation Testing environment requires competency in a and the notion of an online persona
broad range of skills, both technical and as overarching concepts providing a
Benoit Baudry, Sudipto Ghosh,
nontechnical. The authors contend the Law new rationale for understanding and
Franck Fleurey, Robert France,
of Requisite Variety—adapting to change explaining online behavior. The authors
Yves Le Traon, and Jean-Marie Mottu
requires a varied enough solution set to posit that an Internet user’s virtual ego
Model Driven Engineering (MDE) match the complexity of the environment— is distinguishable from his or her real
techniques support extensive use of can help explain the need for greater and ego and argue that understanding the
models in order to manage the increasing broader skills among IT professionals. The differences between the two is essential
complexity of software systems. article outlines a framework containing six for advancing the dynamics of Web-based
Automatic model transformations skill categories critically important for the social networks.
play a critical role in MDE since they career development of IT professionals.
automate complex, tedious, error-prone,
and recurrent software development
Beyond Connection: Situated
tasks. For example, Airbus uses automatic Panopticon Revisited Wireless Communities
code synthesis from SCADE models Jan Kietzmann and Ian Angell Jun Sun and Marshall Scott Poole
to generate the code for embedded
Many claims have been made regarding Compared to traditional Internet-based
controllers in the Airbus A380.
the safety benefits of computer-supported virtual communities, situated wireless
Model transformations that automate
surveillance technologies. However, like communities (SWCs) go beyond just
critical software development tasks
many technologies the advantageous door connecting people together. In fact, SWCs
must be validated. The authors identify
swings both ways. The authors compare enable people to share their common
characteristics of model transformation
how current computer and communication physical and/or social context with each
approaches that contribute to the
technologies are shaping today’s other. With the availability of these cues,
difficulty of systematically testing
“panopticons,” pulling heavily from the the social interaction among members
transformations as well as present
1787 prison architectural design by social within a community is significantly
promising solutions and propose
theorist Jeremy Bentham that allowed enhanced. The authors detail four general
possible ways to overcome these barriers.
prison officials and observers to keep an types of SWCs as well as give examples of
eye on prisoners without the imprisoned the strengths and weaknesses of each.
able to tell they are being watched.

8 communications of the ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
ACM, INTEL, AND GOOGLE
THE ACM CONGRATULATE
CHARLES P. THACKER
A. M. TURING FOR THE PIONEERING

AWARD DESIGN AND REALIZATION


OF THE FIRST MODERN
PERSONAL COMPUTER—
THE ALTO AT XEROX PARC—
AND SEMINAL INVENTIONS
AND CONTRIBUTIONS TO
LOCAL AREA NETWORKS
(INCLUDING THE ETHERNET),
MULTIPROCESSOR
WORKSTATIONS, SNOOPING
BY THE COMMUNITY... CACHE COHERENCE
PROTOCOLS, AND TABLET
FROM THE COMMUNITY... PERSONAL COMPUTERS

FOR THE COMMUNITY...

Charles Thacker’s Alto computer embodied the key elements “Google is pleased to honor contributions that made possible
of today’s PCs, and is at the root of one of the world’s most the style of computing that we enjoy today. We are proud to
innovative industries. Intel applauds Chuck’s clarity of insight, support the Turing Award to encourage continued research in
focus on simplicity, and amazing track-record of designing computer science and the related technologies that depend
landmark systems that have accelerated the progress of research on its continued advancement.
and industry for decades.”
Alfred Spector
Andrew A. Chien Vice President, Research and
Vice President, Research Special Initiatives, Google
Director, Future Technologies Research
For more information, see http://www.google.com/corporate/
For more information see www.intel.com/research. index.html and http://research.google.com/.

Financial support for the ACM A. M. Turing Award is provided by Intel Corporation and Google.
The Communications Web site, http://cacm.acm.org,
features more than a dozen bloggers in the BLOG@CACM
community. In each issue of Communications, we’ll publish
selected posts or excerpts.

Follow us on Twitter at http://twitter.com/blogCACM

doi:10.1145/1743546.1743551 http://cacm.acm.org/blogs/blog-cacm

The Chaos of the not in the sense we expected. There


is no system of rules, no encoding of

Internet as an External
experts, no logical reasoning. There
is precious little understanding of in-
formation, at least not in the search

Brain; and More itself. There is knowledge in the many


voices that make up the data on the
Web, but no synthesis of those voices.
Greg Linden writes about the Internet as a peripheral resource; Perhaps we should have expected
Ed H. Chi discusses lessons learned from the DARPA Network this. Our brains, after all, are a con-
trolled storm of competing patterns
Challenge; and Mark Guzdial asks if there are too many IT workers and signals, a mishmash of evolution-
or too many IT jobs. ary agglomeration that is barely func-
tional and easily fooled. From this
chaos can come brilliance, but also
Greg Linden’s the Internet. Google has become our superstition, illusion, and psychosis.
“The Rise of the external brain, sifting through the ex- While early studies of the brain envi-
External Brain” tended stores of knowledge offered by sioned it as a disciplined and orderly
http://cacm.acm.org/blogs/ multitudes, helping us remember what structure, deeper investigation has
blog-cacm/54333 we once found, and locating advice proved otherwise.
From the early days of from people who have been where we And so it is fitting that the biggest
computers, people have speculated that now want to go. progress on building an external brain
computers would be used to supple- For example, the other day I was also comes from chaos. Search engines
ment our intelligence. Extended stores trying to describe to someone how mi- pick out the gems in a democratic sea
of knowledge, memories once forgot- tochondria oddly have a separate ge- of competing signals, helping us find
ten, computational feats, and expert nome, but could not recall the details. the brilliance that we seek. Occasion-
advice would all be at our fingertips. A search for “mitochondria” yielded ally, our external brain leads us astray,
In the last decades, most of the a Wikipedia page that refreshed my as does our internal brain, but therein
work toward this dream has been in memory. Later, I was wondering if trav- lies both the risk and beauty of building
the form of trying to build artificial eling by train or flying between Venice a brain on disorder.
intelligence. By carefully encoding ex- and Rome was a better choice; advice
pert knowledge into a refined and well- arrived immediately on a search for Ed H. Chi’s “The
pruned database, researchers strove to “train flying venice rome.” Recently, DARPA Network
build a reliable assistant to help with I had forgotten the background of a Challenge and the
tasks. Sadly, this effort was always colleague, which was restored again Design of Social
thwarted by the complexity of the sys- with a quick search on her name. Participation Systems”
tem and environment, with too many Hundreds of times a day, I access this http://cacm.acm.org/
variables and uncertainty for any small external brain, supplementing what is blogs/blog-cacm/60832
team to fully anticipate. lost or incomplete in my own. The DARPA Network Challenge recent-
Success now is coming from an en- This external brain is not pro- ly made quite a splash across the In-
tirely unexpected source—the chaos of grammed with knowledge, at least ternet and the media. The task was to

10 communications of th e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
blog@cacm

identify the exact location of 10 red people. Designers can create mecha- Recession. The news is terrific for us—
weather balloons around the country. nisms, such as voting systems, folkson- computing is only forecast to grow, and
The winning team from MIT succeeded omies, and other opinion aggregators, at an amazing rate.
in identifying the locations of the bal- to ensure the emergence of social intel- Via the Computing Community
loons in less than nine hours. ligence over time. (Note that the defini- Consortium blog: “‘Computer and
There was recently a good article tion for “social intelligence” here dif- mathematical’ occupations are pro-
in Scientific American about the win- fers from traditional use of the phrase jected to grow by the largest per-
ning entry and the second-place team in social psychology.) centage between now and 2018—by
from Georgia Tech. The article details The principal concern for design- 22.2%. In other words, ‘Computer
the way in which the teams tried to: (1) ers of systems is to ensure the partici- and mathematical’ occupations are
build social incentives into the system pants both give and get something from the fastest-growing occupational clus-
to get people to participate and to rec- the system that is beneficial to both the ter within the fastest-growing major
ommend their friends to participate; individual as well as to the group. This occupational group.”
(2) how they managed to fight spam may take the form of being challenged DARPA is so concerned about the
or noisy information from people try- in their ideas, or to contribute to the lack of IT workers (and the lack of di-
ing to lead them astray. The MIT team, overall knowledge of a domain, or to versity among its workers) that it has
for example, required photo proofs of contribute their experiences of using launched a new research project to de-
both the balloon and the official DAR- a particular product or drug. velop more and more diverse IT workers.
PA certificate of the balloon at each More importantly, social participa- DARPA has launched a “far-out re-
location, suggesting they realized that tion systems should encourage users to search” project to increase the num-
noisy or bad data is a real challenge in take part in effective action. One main ber of students going into “CS-STEM”
social participation systems. design principle here is that effective (computer science and science, tech-
But what did the challenge really action arises from collective action. That nology, engineering, and mathemat-
teach us? is, by encouraging participants to learn ics) fields. Wired just covered this ef-
Looking back for the last decade or from each other and to form consen- fort to address the “Geek shortage.”
so, we have now gotten a taste of how sus, group goals will form, and action What makes the Wired piece so in-
mass-scale participation in social com- can be taken by the entire group. teresting is the enormous and harsh
puting systems results in dramatic The DARPA Network Challenge is pushback in the comments section,
changes in the way science, govern- interesting in that it was designed to like the below:
ment, health care, entertainment, and see how we can get groups of people “I’m 43, with a degree in software en-
enterprises operate. to take part in effective action. In that gineering and enjoy what I do for a living.
The primary issue relating to the sense, the experiment was really quite But I wouldn’t encourage my 12-year-old
design of social participation systems successful. But we already have quite son to major in CS or similar because in-
is understanding the relationship be- a good example of this in Wikipedia, teresting, new project development jobs
tween usability, sociability, social capi- in which a group of people came to- are the first to disappear in a down econ-
tal, collective intelligence, and how to gether to learn from each other’s per- omy and non-cutting-edge skills are eas-
elicit effective action through design. spective, but they share a common ily offshored and new hires are cheaper
˲˲ Usability concerns the ability for all goal to create an encyclopedia of the than retraining outdated workers.”
users to contribute, regardless of their state of human knowledge for broader “Why get a four-year degree for a ca-
accessibility requirements and com- distribution. Here, collective action reer with a 15-year shelf life?”
puting experience, and how to lower resulted in effective change in the way Are these complaints from a vocal
the interaction costs of working with people access information. small group, or are do they represent
social systems. Looking toward the next decade, the a large constituency? Why is there this
˲˲ Sociability refers to the skill or ten- social computing research challenge is disconnect between claims of great
dency of being sociable and of interact- understanding how to replicate effec- need and claims of no jobs? Are old
ing well with others. There is a huge role tive social actions in social participa- IT workers no longer what industry
in how the designer can facilitate and tion systems, in domains such as health wants? Is BLS only counting newly cre-
lubricate social interactions amongst care, education, and open government. ated jobs and not steady-state jobs? Is
users of a system. United, we might just solve some of the the IT job market constantly churning?
˲˲ Social Capital refers to positions biggest problems in the world. Is industry not training existing people
that people occupy in social networks, and instead hiring new people? It’s a
and their ability to utilize those posi- Mark Guzdial’s “Are real problem to argue for the need for
tions for some goal. Designers need to There Too Many IT more IT in the face of many (vocal) un-
enable people to sort themselves into Jobs or Too Many IT employed IT workers.
comfortable positions in the social Workers?”
network, including leadership and fol- http://cacm.acm.org/blogs/ Greg Linden is the founder of Geeky Ventures.
lower positions. blog-cacm/67389 Ed H. Chi is a research manager at Palo Alto
Research Center. Mark Guzdial is a professor at
˲˲ Collective Intelligence (or Social In- The latest U.S. Bureau of Labor Sta- the Georgia Institute of Technology.
telligence) refers to the emergence of tistics (BLS) have been updated, as of
intelligent behavior among groups of November 2009, to reflect the Great © 2010 ACM 0001-0782/10/0600 $10.00

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t he acm 11
cacm online

ACM
Member
News
DOI:10.1145/1743546.1743552 David Roman Ed Lazowska Wins
ACM’s Distinguished

Interact Naturally Service Award


Ed Lazowska,
the Bill &
Melinda Gates
Chair in
The mouse’s days are numbered. Computer interfaces that remove user-system Computer
Science &
barriers are in the works and are intuitive enough for first-time users to throw away Engineering
the manual. The iPhone’s multitouch interface may have ushered in a wave of eas- and director of the eScience
ier interfaces for the mass market, but it’s just the beginning. Many new and excit- Institute at the University of
ing replacements for the familiar point-and-click scheme are on the way. Washington, is the 2009
recipient of ACM’s Distinguished
Skinput technology (http://www.chrisharrison.net/projects/skinput/) show- Service Award for his wide-
cased at CHI 2010 (http://cacm.acm.org/news/83935) “appropriates” the human ranging service to the computing
body as an input surface, says Carnegie Mellon Ph.D. student Chris Harrison, who community and his long-standing
advocacy for this community at
developed SkinPut with Desney Tan and Dan Morris of Microsoft Research.
the national level.
Gaze-based interfaces are being considered for data input, search, and se- “Forty years ago, in 1969,
lection (http://portal.acm.org/toc.cfm?id=1743666&type=proceeding&coll=GU there were three landmark
IDE&dl=GUIDE&CFID=86057285&CFTOKEN=34856226), and driving vehicles events: Woodstock, Neil
Armstrong’s journey to the
(http://cacm.acm.org/news/88018-car-steered-with-drivers-eyes/fulltext). Voice surface of the moon, and the
controls have boarded Ford cars (http://www.fordvehicles.com/technology/ first packet transmission on
sync/) and Apple smartphones. ARPANET,” Lazowska said in
Gesture interfaces is another hot area in the interface arena. MIT Media Lab an email interview. “With four
decades of hindsight, which
Ph.D. candidate Pravan Mistry’s Sixth Sense (http://www.pranavmistry.com/ had the greatest impact? Unless
projects/sixthsense/) gesture interface (http://cacm.acm.org/news/23600) re- you’re big into Tang and Velcro
ceived a standing ovation when it premiered at the TED conference in Febru- (or sex and drugs), the answer
is clear. Our future is every bit
ary 2009 (http://www.ted. as bright as our past. No field
com/talks/pattie_maes_ is more important to the future
demos_the_sixth_sense. of our nation or our world than
html). Using multitouch computer science. We need to
get that message out. Advances
gestures like the iPhone, in computing are fundamental
Sixth Sense does not require to addressing every societal
a dedicated screen, but like grand challenge.
“I was recently party to a
many advanced interfaces it
discussion with Anita Jones and
does depend on specialized Robin Murphy on ‘Type I’ and
hardware. Microsoft’s Proj- ‘Type II’ research in CS. Type I
ect Natal gesture interface focuses on the technologies—
compilers, operating systems,
(http://en.wikipedia.org/wiki/ sensors. Type II focuses on the
Project_Natal) will give problems—engineering the
Sixth Sense: Using palm for dialing a phone number. gamers hands-free control new tools of scientific discovery,
of the Xbox 360 in time for global development, health
IT, emergency informatics,
the holidays season. There are dozens of related YouTube videos at http://www. entertainment technology.
youtube.com/user/xboxprojectnatal. Its application outside gaming is not clear. I’m a Type II person. Either
Another promising but challenging area is the brain-machine interface (http:// we expand our definition of CS
to embrace these challenges,
cacm.acm.org/news/73070-building-a-brain-machine-interface/fulltext), which or we become insignificant.
sounds less fact than fiction, but was in fact the focus of DARPA’s Augmented Cog- “Forty years ago, as an
nition program (http://www.wired.com/dangerroom/2008/03/augcog-continue/). undergraduate at Brown
All these interfaces aim to give users a simple, natural way to interact with a University, I was seduced into
computer science by Andy
system. Microsoft’s Chief Research and Strategy Officer Craig Mundie says natural
Photograph by Lynn Barry

van Dam. The potential to


user interfaces will appear first with gaming and entertainment systems but “will change the world is greater
certainly find application…in the communications domain.” today than it has ever been.
We need to communicate this
To read more about the interfaces of the future, check out the newly revamped to students. This field makes
ACM student magazine XRDS (formerly Crossroads). The print edition is available you powerful.”
now; look for magazine’s new Web site coming soon. —Jack Rosenberger

12 communications of th e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
N
news

Science | doi:10.1145/1743546.1743553 Neil Savage

Straightening
Out Heavy Tails
A better understanding of heavy-tailed probability distributions can
improve activities from Internet commerce to the design of server farms.

O
nline commerce has affect-
ed traditional retailers,
moving transactions such 34% 35%
38% 38% 40% 42%
as book sales and movie
Repri nted by permissi on of Ha rvard Business Re view Fr o m “Should You Invest in the Long Tail? ” by An ita E lberse , j uly–august 2008

42% 44% 47%


rentals from shopping
61%
malls to cyberspace. But has it funda-
mentally changed consumer behavior?
Wired Editor-in-Chief Chris Ander-
son thinks so. In his 2006 book titled
The Long Tail and more recently on his 13% 14%
Long Tail blog, Anderson argues that 14% 14%
14%
online retailers carry a much wider va- 14% 14%
Co pyri ght 2 008 by the Harvard Business School Publishing Corpo ratio n; all rights reserved.

riety of books, movies, and music than 15%


10% 10%
traditional stores, offering customers 10% 11% 21%
10%
many more products to choose from. 10% 10%
7% 7%
These customers, in turn, pick more
7% 16%
of the niche products than the popular 7% 7% 13%
9% 9% 7%
hits. While individual niche items may 9%
13% 9%
9%
not sell much more, cumulatively they’re 5%
9%
5% 13% 6% 8%
a bigger percentage of overall business. 4% 6%
5% 5% 5% 8%
The book’s title comes from the 5% 10% 7%
6% 5%
shape of a probability distribution 5% 5% 10%
4% 4% 3% 5%
graph. Many phenomena follow the 4% 10% 3%
4% 4% 4% 3%
normal or Gaussian distribution; most 9% 4% 3% 3%
8% 3% 3% 3% 3% 2% 2%
of the values cluster tightly around the 3% 3% 2% 2% 2% 2% 2% 2%
2%

median, while in the tails they drop off 1 2 3 4 5 6 7 8 9 10


exponentially to very small numbers.
Events far from the median are ex- SOURCE: QUICKFLIX

ceedingly rare, but other phenomena, Each vertical bar represents a decile of DVD popularity, with the DVDs in decile 10 being
including book and movie purchases, the most popular. Each bar is subdivided to demonstrate how, on average, customers who
rented at least one DVD from within that decile distributed their rentals among all the
follow a different course. Values in this deciles. Shoppers in the bottom decile, for instance, selected only 8% of their rentals from
distribution drop much less rapidly, among its titles—and 34% from among top-decile titles.

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t he acm 13
news

and events of roughly equal probability the top 500 titles dropped from more
stretch far out into the tail. than 70% of demand in 2000 to under
Plotted on a graph, such distribu- “There might be 50% in 2005. In 2005, Anderson notes,
tions produce a peak very near the y- an increasingly long 15% of demand came from below the
axis, a cluster near the intersection of top 3,000, about the point where brick-
the axes, and a long tail along the x-axis. tail, but that tail is and-mortar stores run out of inventory.
This gives rise to the name long tail, or getting thinner,” says “In that sense, Anderson was right,
heavy tail. While a long tail is techni- but to me that was more or less a triv-
cally just one class of a heavy tail distri- Anita Elberse. ial finding, because each year we have
bution—other terms include fat tail, more and more movies available,” Ne-
power law, and Pareto distribution—in tessine says.
popular usage and for practical purpos-
es there’s no notable difference. Influencing Consumers’ Choices
Because many phenomena have Anita Elberse, associate professor An improved understanding of where in
heavy tails—computer job sizes, for in- of business administration at Harvard the distribution consumers land could
stance, and Web page links—research- Business School, earlier looked at data lead to new methods of swaying their
ers want to know how the distributions from Quickflix, an Australian movie choices, in the form of better-designed
work and their effects. rental service similar to Netflix, and recommendation engines. Researchers
Recently, for instance, business ex- from Nielsen VideoScan and Nielsen postulate that one reason consumers
perts have tested the Long Tail theory SoundScan, which monitor video and fail to choose niche products is that they
and found that its effect on Internet music sales, respectively, and reached have no way to find them. “If nobody is
commerce may not be as straightfor- the same conclusion. “There might be buying these items, how do they get rec-
ward as some thought. Serguei Netes- an increasingly long tail, but that tail is ommended in the first place?” asks Kartik
sine, associate professor of operations getting thinner,” Elberse says. Hosanagar, associate professor of opera-
and information management at the Anderson disputes their conclu- tions and information management at
University of Pennsylvania, looked at sions, saying the researchers define the University of Pennsylvania.
data from Netflix, the movie rental the Long Tail differently. Anderson de- Hosanagar says collaborative filter-
service, containing 100 million online fines hits in absolute terms, the top 10 ing, based on user recommendations,
ratings of 17,770 movies, from 2000 to or top 100 titles, while the academics can’t find undiscovered items. It can,
2005, by 480,000 users. Netessine used look at the top 1% or 10% of products. however, find items that are somewhat
the ratings as a proxy for rentals in his “I never do percentage analysis, since popular, and bring them to the atten-
study, but says he has since looked at I think it’s meaningless,” Anderson tion of customers who might have
actual rental data. wrote in an email interview. “You can’t missed them. He found that consum-
Netessine’s conclusion is, when say ‘I choose to define Long Tail as X ers appreciated recommendation en-
looked at in percentage terms, the de- and X is wrong, therefore Chris Ander- gines that suggest more niche items,
mand for hits actually grew, while the son is wrong.’ If you’re going to cri- perhaps because they were already
demand for niche products dropped. tique the theory, you’ve got to actually aware of the blockbusters.
“There is no increased demand for get the theory right.” He says retailers might boost their
bottom products that we are seeing in Anderson points to the same Netflix sales with improved recommendation
this data,” Netessine says. data used by Netessine and notes that engines. Using content analysis, which

Obituary

PC Pioneer Ed Roberts, 1941–2010


Henry Edward “Ed” Roberts, 8800 was featured on the January MITS was sold to Pertec duration, is immeasurable.
who created the first inexpensive 1975 cover of Popular Electronics, Computer Corporation in 1977, “He was a seed of this thought
personal computer in the mid- and MITS shipped an impressive and Roberts received $2 million. that computers would be
1970s, died on April 1 at the age 5,000 units within a year. He retired to Georgia and affordable,” Apple cofounder
of 68. Roberts is often credited as The Popular Electronics cover first worked as a farmer, then Steve Wozniak has said. The
being “the father of the personal story caught the attention of Paul studied medicine and became a breakthrough insight for
computer.” Allen, a Honeywell employee, and physician. Roberts might have occurred
Roberts and a colleague Bill Gates, a student at Harvard Agreement about who during the late 1960s while
founded Micro Telemetry University. They approached invented the first personal working on a room-size IBM
Instrumentation Systems (MITS) Roberts and were soon working computer differs, with credit computer as an electrical
in 1970 to sell electronics kits to at MITS, located in Albuquerque, being variously given to Roberts, engineering major at Oklahoma
model-rocket hobbyists. In the NM, where they created the Basic John Blankenbaker of Kenbak State University. As he later
mid-1970s, MITS developed the programming language for the Corporation, Xerox Palo Alto said in an interview, “I began
Altair 8800, a programmable Altair 8800. It was a move that Research Center, Apple, and thinking, What if you gave
computer, which sold for a would ultimately lead to the IBM. Roberts’ impact on everyone a computer?”
starting price of $397. The Altair founding of Microsoft Corp. computing, though short in —Jack Rosenberger

14 communications of th e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
news

identifies characteristics of a product— ficient to use more, slower machines. Networking


say, the director or genre of a movie— Designing server farms with this under-
and suggesting it to buyers of products
with similar characteristics, could in-
standing, Harchol-Balter says, could cut
electricity demand by 50%.
Quantum
crease the diversity of recommenda-
tions. Hosanagar says researchers look-
“In computer science, we really
concentrate on understanding the
Milestone
ing at Internet commerce shouldn’t distribution of the sizes of the require-
Researchers at Toshiba
assume sales mechanisms and buyers’ ments, and making our rules based on Research Europe, in
behavior are unchangeable. “A lot of this understanding,” she notes. “We Cambridge, U.K., have attained
social scientists are looking at the sys- are having to invent whole new mathe- a major breakthrough in
tem as a given and trying to look at its matical fields to deal with these kinds quantum encryption, with their
recent continuous operation
impact, but what we are saying is the of distributions.” of quantum key distribution
system is not a given and there are a lot Harchol-Balter finds the origin of (QKD) with a secure bit rate
of design factors involved,” he says. heavy tail distributions a fascinating of more than 1 megabit per
second over 50km of fiber optic
That matches the thinking of Mi- question, one she sometimes asks her cable. The researchers’ feat,
chael Mitzenmacher, a professor of classes to speculate about. “Why are averaged over a 24-hour period,
computer science at Harvard, who says files at Web sites distributed accord- is 100–1,000 times higher than
researchers need a better understand- ing to a Pareto distribution? Why on any previous QKD for a 50km
link. The breakthrough could
ing of how heavy tails come to be, so that Earth? Nobody set out to make this enable the widespread usage
they can turn that knowledge to practi- kind of distribution,” she says. “Some- of one-time pad encryption,
cal use. “For a long time people didn’t how most people write short programs a method that is theoretically
100% secure.
realize power law distributions come up and some people write long programs
First reported in Applied
in a variety of ways, so there are a variety and it fits this distribution very well.” Physics Letters, the QKD
of explanations,” Mitzenmacher says. But the “why” isn’t her main con- milestone was achieved with a
“If we understand the model of how cern. “I don’t need to know why it pair of innovations: a unique
light detector for high bit rates
that power law comes to be, maybe we happens,” Harchol-Balter says. “I just and a feedback system that
can figure out how to push people’s be- need to know if it’s there what I’m go- maintains a high bit rate and,
havior, or computers’ behavior, in such ing to do with it.” unlike previous systems, does
a way as to improve the system.” not depend on manual set-up
or adjustments.
For instance, file size follows a “Although the feasibility
power law distribution. An improved Further Reading of QKD with megabits per
understanding of that could lead to Tan, T.F. and Netessine, S. second has been shown in the
Is Tom Cruise threatened? Using Netflix lab, these experiments lasted
more efficiently designed and eco- only minutes or even seconds
Prize data to examine the long tail of
nomical file storage systems. Or if at a time and required manual
electronic commerce. Working paper,
hyperlinks are similar to movie rent- University of Pennsylvania, Wharton adjustments,” says Andrew
als—that is, if the most popular Web Shields, assistant managing
Business School, July 2009.
director at the Cambridge lab.
pages retain their popularity while the Elberse, A. “To the best of our knowledge
pages out in the tail remain obscure— Should you invest in the long tail? Harvard this is the first time that
it might make sense to take that into Business Review, July-August 2008. continuous operation has
been demonstrated at high
account when designing search en- Mitzenmacher, M. bit rates. Although much
gines. And if search engines have al- Editorial: the future of power law research. development work remains,
ready changed that dynamic, it could Internet Mathematics 2, 4, 2005. this advance could allow
be valuable to understand how. Harchol-Balter, M. unconditionally secure
communication with
One area where heavy tails affect The effect of heavy-tailed job size
significant bandwidths.”
computer systems is the demand that distributions on computer system design.
The QKD breakthrough will
Proceedings of ASA-IMS Conference on
UNIX jobs place on central processing Applications of Heavy Tailed Distributions
allow the real-time encryption
units (CPUs). Mor Harchol-Balter, as- of video with a one-time pad.
in Economics, Engineering and Statistics, Previously, researchers could
sociate department head of graduate Washington, DC, June 1999. encrypt continuous voice data,
education in the computer science de- Fleder, D.M. and Hosanagar, K. but not video.
partment at Carnegie Mellon Univer- Blockbuster culture’s next rise or fall: the Toshiba plans to install
a QKD demonstrator at
sity, says the biggest 1% of jobs make impact of recommender systems on sales
the National Institute
up half the total load on CPUs. While diversity. Management Science 55, 5, May 2009.
of Information and
most UNIX jobs may require a second Newman, M.E.J. Communications Technology
or less of processing time, some will Power laws, Pareto distributions, and Zipf’s in Tokyo. “The next challenge
law. Contemporary Physics 46, 323–351, 2005. would be to put this level of
need several hours. technology into metropolitan
If there were low variability among network operation,” says
job sizes, as people used to believe, it Neil Savage is a science and technology writer based Masahide Sasaki, coordinator
in Lowell, MA. Prabhakar Raghavan, Yahoo! Research, of the Tokyo QKD network.
would make sense to have just a few, contributed to the development of this article. “Our Japan-EU collaboration
very fast servers. But because the job size is going to do this within the
distribution is heavy tailed, it’s more ef- © 2010 ACM 0001-0782/10/0600 $10.00 next few years.”

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t he acm 15
news

Technology | doi:10.1145/1743546.1743554 Tom Geller

Beyond the Smart Grid


Sensor networks monitor residential and institutional devices,
motivating energy conservation.

A
s President-elect, Barack bathroom
toilet 1 sink 1
Obama used the term bathtub
wireless water
“smart grid” in his first transmitter spigot

major speech of 2009, and


few phrases have enjoyed
as much currency recently. The electri-
cal grid isn’t the only utility acquiring
intelligence, however, as water and kitchen sink
dishwasher
gas meters throughout the U.S. gain
sensor
radio communication capabilities and thermal
expansion
other innovations. tank

But those grids (and their attendant


incoming cold
smartness) stop at the residential me- water from
supply line water
ter, so consumers never know which meter

household devices are the biggest en-


pressure backflow
ergy users. Once monitored, these de- regulator preventer
vices would need to communicate—to
turn on the ceiling fan and adjust the hot water heater
bath
sink 2
drain valve
air conditioner when electricity prices
peak, for example. The final ingredient
hot water washing shower toilet 2
for such a system to be useful to con- heater machine
sumers is an easy-to-understand inter-
face for monitoring and controlling HydroSense can be installed at any accessible location in a home’s water infrastructure,
with typical installations at an exterior hose bib, a utility sink spigot, or a water heater drain
the devices. valve. By continuously sensing water pressure at a single installation point, HydroSense
The solution, like the problem, has can identify individual fixtures where water is being used and estimate their water usage.
three parts. First, monitor each device
separately; second, network them to- and colleagues can extrapolate electri- chology has shown that giving people
gether for coordination; third, pres- cal, water, and gas use of individual de- itemized feedback can reduce over-
ent the resulting data in an easy-to-use vices by measuring the “shock waves” all energy use by 15%–20%. But adop-

figure from H ydroSense: Infrastructure - Mediated Sin gle -Point Sensing of W ho le-H ome Water Activit y
format. As it happens, this solution created when consumers turn on the tion of sensors that will give them that
goes beyond energy conservation to devices that use those utilities. feedback drastically drops off as the

Jon F roehlich, Eric L ars on, Tim Campbell, Co nor H aggert y, James Fogart y, Shwetak N. Patel
suggest new ways of integrating home Patel’s HydroSense approach is installation burden increases. So the
automation, safety, security, and en- to attach a single sensor to a spigot question is, How can we build a single
tertainment applications with smart and examine differences in pressure sensor that gives them disaggregated
grid data. created by the water-hammer phe- information, but doesn’t need a profes-
nomenon when individual devices sional electrician or plumber to install?
Your Home’s Five Senses are turned on and off. After build- If we can build cheap sensors that give
The first key part is the sensors them- ing a profile of devices in the house, consumers effective feedback, they can
selves. For utility monitoring, instal- he says, the single sensor can accu- start to reduce overall consumption in
lation has been a major barrier to rately tell each device from the others their home.”
consumer adoption. Measuring water within a 5% margin of error. The same Even with single-point sensors
flow to a specific faucet, for example, model works for gas lines as well; for installed, there’s still a place for in-
required removing a section of pipe. electricity, the single plug-in sensor dividual sensors to measure environ-
To give a residential consumer the looks for characteristic noise patterns mental factors. For example, a sensor
whole picture, this process would produced by individual devices over that measures the oil level in a furnace
have to be repeated for every faucet in the home’s electrical lines. could switch on electric heating when
the house. Patel points out the educational val- the oil is running out, but only during
But now work done by Shwetak Pa- ue of this information. “Often peoples’ times of low electricity demand. Or a
tel, an assistant professor in the de- mental model of [utility] consumption combustible-gas sensor could prevent
partment of computer science and en- is really inaccurate,” he says. “Over 30 an explosion, when a gas leak is detect-
gineering at University of Washington, years of study in environmental psy- ed, by preventing a gas furnace’s igni-

16 communications of th e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
news

tion from sparking on. Such concepts monitors electricity use in each of the
require a development platform that college’s dorms, in some cases with
hosts both sensors and communica- Oberlin College multiple sensor points per dorm. Ad-
tions hardware. hosts an annual dorm ministrators make adjustments to dis-
One such platform is SquidBee and count nondiscretionary expenditures,
its successor, Waspmote. Both origi- energy competition, such as a kitchen in those dorms with
nated at the Spain-based company with prizes for cafeterias, then take a baseline reading
Libelium, which also produces three to determine typical usage. Data from
types of Waspmote sensor boards. One the dorm that dorms’ current energy use is displayed
determines the presence of gases, such achieves the greatest in three ways: on the Web at oberlin.
as carbon dioxide and methane; the edu/dormenergy; as building dash-
second senses environmental changes, energy reduction. board video displays throughout cam-
such as vibration, pressure, and mois- pus; and as color-changing orbs placed
ture; and the third is a prototype board in several campus locations, including
that will host any other sensor types a the dorms themselves.
developer might have. For applications Finally, Oberlin College runs an
that don’t require immediate commu- annual dorm energy competition and
nication or in situations where imme- hand, ZigBee largely ignores issues of gives prizes to the dorm with the great-
diate communication is impossible, bandwidth and quality of service, as est reduction from baseline use. Henry
the Waspmote board contains two gig- would be needed for a telephony or Bent, sustainable technology research
abytes of internal memory for later video application. fellow partly responsible for maintain-
transmission. The ZigBee specifications cover all ing the Oberlin system, is especially
seven layers of the Open Systems Inter- enthusiastic about the orbs. “Numbers
Making Sensors Talk connection model, in three parts. The and dials and graphs are fantastic, but
Both Patel and Libelium’s devices re- bottom two—the physical and data- you want something that you can see
quire a way to communicate their find- link layers—are the 802.15.4 standard, very quickly at a glance,” Bent says. “I
ings to the outside world. Waspmote with no changes. Layers three to six just know when I’m on my way to the
uses a variety of methods, including comprise the “ZigBee stack,” includ- bathroom, ‘Oh, look, that orb is red, I
USB, GPS, 802.15.4, and a range of ing algorithms for organization among should turn something off.’ ”
radio frequencies. Patel is agnostic nodes, error routing, and AES-128 secu-
about the communication methods his rity. (As a wireless technology, the secu-
Further Reading
still-in-development devices will use. rity portion is especially important to
“We’re innovating on the hardware, ag- prevent outside tampering that could Patel, S.N., Robertson, T., Kientz, J.A.,
gregation, and signal processing,” he cause unpredictable device behavior.) Reynolds, M.S., Abowd, G.D.
At the flick of a switch: detecting and
says, “but not on the network.” When layers one through six are imple- classifying unique electrical events on
One specification that both plan mented according to the ZigBee speci- the residential power line. Proceedings
to use is ZigBee, an extension of the fication, it qualifies for ZigBee platform of Ubicomp 2007, Innsbruck, Austria,
802.15.4 standard promoted by the compliance certification. ZigBee-cer- September 16–19, 2007.
nonprofit, industry-based ZigBee Al- tified products also implement layer Patel, S.N., Reynolds, M.S., Abowd, G.D.
liance. According to ZigBee Alliance seven, which is a ZigBee public profile Detecting human movement by differential
Chairman Bob Heile, ZigBee was de- such as smart energy, home automa- air pressure sensing in HVAC system
ductwork: an exploration in infrastructure
signed specifically “to create open, tion, or health care. mediated sensing. Proceedings of Pervasive
global standards for wireless sensor 2008, Sydney, Australia, May 19–22, 2008.
networks.” As such, it prioritizes power Acting on Data
Petersen, J.E., Shunturov, V., Janda, K.,
consumption and transmission integ- Once the data is collected, it needs to be Platt, G., Weinberger, K.
rity so that the devices—which might presented in ways that are understand- Dormitory residents reduce electricity
be used in difficult-to-access areas— able to humans and to other devices. consumption when exposed to real-
can operate trouble-free for a long pe- “We don’t want to overwhelm the con- time visual feedback and incentives.
International Journal of Sustainability in
riod of time. “We’re achieving devices sumer with a bunch of data,” says Patel. Higher Education 8, 1, 2007.
that go for five to 10 years on an alka- “We could provide them with a ‘Top Ten
Fischer, C.
line battery or 10 to 20 years on lithi- Energy Consumers in Your Home’ list
Feedback on household electricity
um-ion,” says Heile. to give them something to work on. Or consumption: a tool for saving energy?
The ZigBee Alliance also prioritized if we see that the compressor in their re- Energy Efficiency 1, 1, Feb. 2008.
scalability well beyond the residential frigerator is degrading in performance Wireless Sensor Networks Research Group
needs. Heile says the ARIA Resort & over time, we could give them targeted http://www.sensor-networks.org.
Casino in the new CityCenter devel- advice on blowing out the coils.”
opment in Las Vegas has more than One example of how such data is be-
Tom Geller is an Oberlin, Ohio-based science, technology,
90,000 ZigBee-compliant devices to ing used is found in Oberlin College’s and business writer.
control both common-area and guest- campus resource monitoring system.
room environments. On the other The environmental studies program © 2010 ACM 0001-0782/10/0600 $10.00

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t he acm 17
news

Society | doi:10.1145/1743546.1743555 Leah Hoffmann

Mine Your Business


Researchers are developing new techniques to gauge
employee productivity from information flow.

W
h at’ s t h e b e stway
to measure employee
productivity in the
digital age? From lines
of code to number of
sales, each industry has its own imper-
fect standards. Yet according to a new
line of thought, the answer may, in fact,
lie in your email inbox. And your chat
log. And the comments you added to
a shared document—in the sum, that
is, of your electronic activity. Research-
ers at IBM and Massachusetts Institute
of Technology, for example, analyzed
the electronic data of 2,600 business
consultants and compared their com-
munication patterns with their billable IBM’s SmallBlue technology analyzes employees’ electronic data and creates a networked
hours. The conclusion: the average map of who they’re connected to and what their expertise is.
email contact is worth $948 in annual
revenue. Of course, it also makes a dif- a native language, and could collude think about leaving—and to work
ference who your contacts are. Consul- with one another. Its engineers search harder to keep them engaged.
tants with strong ties to executives and for unusual linguistic patterns, and “We have access to unprecedented
project managers generated an average set actual communication networks amounts of data about human activity,”
of $7,056 in additional annual revenue, against official organization charts to says Sinan Aral, a professor of manage-
compared with the norm. determine when people interact with ment sciences at New York University’s
According to Redwood City, CA- those to whom they have no ostensible Stern School of Business who studies in-
based Cataphora, one of the companies work connection. formation flow. Of course, not every ben-
at the forefront of the movement, the Yet Cataphora and others are also efit an individual brings to a company
objective is to build patterns of activ- developing tools to analyze such pat- can be captured electronically, “but we
ity—then flag and analyze exceptions. terns of behavior in non-investigative can explain a lot,” Aral says. Research-
“We’re interested in modeling settings in the hope of understand- ers hasten to add they’re not seeking
behavior,” says Keith Schon, a se- ing—and enhancing—employee pro- to punish people for using Facebook at
nior software engineer at Cataphora. ductivity. Microsoft examines internal work or making personal phone calls.
“What do people do normally? How communications to identify so-called “The social stuff may be important, and
do they deviate when they’re not act- “super connectors,” who communi- we don’t count that against a person,”
ing normally?” Using data mining cate frequently with other employees says Cataphora’s Schon. In most cases,
techniques that encompass network and share information and ideas. in fact, personal communications are
analysis, sentiment analysis, and clus- Eventually, researchers say, that data filtered out and ignored.
tering, Schon and colleagues analyze could help business leaders make
the flow of electronic data across an strategic decisions about a project Measuring Electronic Productivity
organization. “We’re trying to figure team’s composition, effectiveness, Some measures of electronic produc-
out relationships,” he explains. and future growth. Likewise, Google tivity are relatively straightforward. Cat-
visualization courtesy of IBM SmallBlue

Cataphora got its start in the elec- is testing an algorithm that uses em- aphora, for example, seeks to identify
tronic discovery field, where under- ployee review data, promotions, and blocks of text that are reused, such as
standing what people know and how pay histories to identify its workers a technical explanation or a document
that knowledge spreads is critical to who feel underused, and therefore template, reasoning that the employees
legal liability. The company thus works are most likely to leave the company. who produce them are making a com-
to uncover so-called “shadow net- Though Google is reluctant to share paratively greater impact on the com-
works” of employees who know each details, human resources director pany by doing work that others deem
other through non-business channels Laszlo Bock has said the idea is to get valuable. Such text blocks can be diffi-
like colleges or churches, or who share inside people’s heads before they even cult to identify, for their language often

18 communications of th e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
news

evolves as they spread through a corpo- take to connect them. SmallBlue is an


rate network. Cataphora has developed invaluable tool for large, international
a fuzzy search algorithm to detect them, Microsoft firms, and IBM has used it to connect
but Schon admits the task is complex. examines internal its 410,000 employees since 2007. Yet
Creating an algorithm that organizes indexing the 20-plus million emails
sentences into text blocks, for example, communications to and instant messages those employees
often forces researchers to make inflex- identify so-called write is not a trivial task—not to men-
ible choices about boundaries, using tion the 2 million blog and database
punctuation, length limits, paragraph “super connectors,” entries and 10 million pieces of data
breaks, or some other scheme. That, who communicate that come from knowledge sharing
in turn, could cause a program to over- and learning activities. It is the largest
look a document whose author formats frequently with publicly known social network dataset
things differently, such as not breaking fellow employees in existence, and the project’s founder,
the text into paragraphs very frequently Ching-Yung Lin, says IBM worked hard
or using unconventional punctuation. and share information to design a database that would hold
Cataphora has also developed a and ideas. different types of data and dynamically
proprietary set of ontologies that cover index the graphs that are generated.
human resources-related topics, mar- Proponents of electronic produc-
keting issues, product development, tivity analysis say the markers are best
and more to examine various subject- used to augment, rather than replace,
specific communications. One way in traditional metrics and peer evalu-
which they are useful, Schon explains, uals who help each other.” In a study of ations. “It’s a sanity check,” asserts
is for studying the relationships be- five years of data from an executive re- Schon. In the future, predicts Aral,
tween people and topics. If an executive cruiting firm, Aral found that employ- who is helping IBM refine SmallBlue,
is central to communications about ees who were more central to the firm’s the software could provide real-time,
product development, marketing, and information flow—who communicat- expertise-based recommendations:
finance, but marginal to those about ed more frequently and with a broader automatically suggesting connections
sales, it’s likely that she or he is out of number of people—tended to be more while employees work on a particular
the loop when it comes to the newest productive. It makes a certain amount task, for example, or helping managers
sales tactics. Ontologies can also iden- of sense. “They received more novel assemble compatible project teams.
tify communications related to partic- information and could make matches
ular tasks, such as hiring and perfor- and placements more quickly,” Aral
mance reviews. From there, engineers notes. In fact, the value of novel infor- Further Reading
can statistically determine what the mation turned out to be quite high. Aral, S., Brynjolfsson, E., and Van Alstyne, M.
“normal” procedure is, and see when Workers who encountered just 10 nov- Information, technology and information
it is and isn’t followed. Thanks to the el words more than the average worker worker productivity. International
Conference on Information Systems,
training corpus Cataphora has built were associated with an additional $70 Milwaukee, WI, 2006.
over time through its clients, these in monthly revenue.
Manning, C.D., Raghavan, P., and Schütze, H.
ontologies perform quite well. Yet to Yet Aral’s conclusions also point to
Introduction to Information Retrieval.
detect communication that is specific one of the more challenging aspects of Cambridge University Press, New York,
to a particular industry, location, or this type of research. If a position in the 2008.
research group and whose names can corporate network is associated with in- Mikawa, S., Cunnington, S., and Gaskis, S.
be idiosyncratic, “we may need to ex- creased productivity, is it because of the Removing barriers to trust in distributed
amine the workflow and develop more nature of that position or because cer- teams: understanding cultural differences
specific ontologies,” says Schon. tain kinds of people naturally gravitate and strengthening social ties. International
Workshop on Intercultural Collaboration,
Further analysis helps identify how toward it? “You always have to ques-
Palo Alto, CA, 2009.
employees influence each other at tion your assumptions,” admits Aral.
work. Aral, for example, correlates his New statistical techniques are needed, Wasserman, S. and Faust, K.
Social Network Analysis: Methods and
electronically derived network topolo- he says, to more accurately distinguish Applications. Cambridge University Press,
gies with traditional accounting and correlation from causation. Cambridge, 1994.
project data, such as revenues and Large-scale data mining presents Wu, L., Lin, C.-Y., Aral, S., and Brynjolfsson, E.
completion rates, to try to understand another challenge. IBM’s SmallBlue, Value of social network: a large-scale
which factors enhance or diminish which grew out of research at its Wat- analysis on network structure impact to
certain outcomes. “The old paradigm son Business Center, analyzes employ- financial revenue of information technology
was that each employee had a set of ees’ electronic data and creates a net- consultants. Winter Information Systems
Conference, Salt Lake City, UT, 2009.
characteristics, like skills or education, worked map of who they’re connected
which he or she brought to a firm,” to and where their expertise lies. Em-
Leah Hoffmann is a Brooklyn, NY-based technology
Aral explains. “Our perspective is that ployees can then search for people writer.
employees are all connected, and that with expertise on certain subjects and
companies build a network of individ- find the shortest “social path” it would © 2010 ACM 0001-0782/10/0600 $10.00

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t he acm 19
news

credit t k

20 communications of th e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
news

In Memoriam | doi:10.1145/1743546.1743556 Leah Hoffmann

Robin Milner:
The Elegant Pragmatist
Remembering a rich legacy in verification,
languages, and concurrency.

I
t cam e as a surprise even to programmed to cooperate on a single
those who knew him well: the task. CCS was succeeded by a more
death of Arthur John Robin “Robin initiated a general theory of concurrency called pi
Gorell Milner, known to friends revolution in what calculus, which incorporated dynamic
simply as Robin, of a heart at- generation, and bigraphs, a theory of
tack on March 20. Milner’s wife, Lucy, computing languages ubiquitous computing. For his work on
had died three weeks before. Yet the pi- are and could be,” LCF, ML, and CCS, Milner received the
oneering computer scientist remained ACM A.M. Turing Award in 1991.
active, and seemingly in good health, says Robert Harper. “Robin had this ability to think
until the end, maintaining close con- large and translate that all the way
nections with his colleagues and even down to new theorems,” says Mads
co-authoring a paper with a postdoc- Tofte, vice chancellor of the IT Univer-
toral student he supervised at the IT sity of Copenhagen and a former grad-
University of Copenhagen. uate student. The sentiment is echoed
A man of modest background and Functions (LCF), an interactive sys- by many of Milner’s collaborators,
quiet brilliance, Milner made ground- tem that helps researchers formulate who cite his unfailing ability to shift
breaking contributions to the fields of proofs. “It was a novel idea,” says Robert from a grand vision of computing to
verification, programming languages, Harper, a professor of computer science subtle mathematical particulars. Cou-
and concurrency. He was born in 1934 at Carnegie Mellon University. “The ap- pled with his rigorous attention to de-
near Plymouth, England, and won proach before then was that computers tail, the trait gave Milner’s work a firm
scholarships to Eton—where he devel- would search for your proof. Robin rec- grounding in practical applications.
oped an enduring love of math as well ognized that the computer is a tool to “He was very concerned with making
as a prodigious work ethic—and King’s help you find the proof.” During that re- things work,” says Gordon Plotkin, a
College, Cambridge. It was during his search, Milner also laid the foundations former colleague at the University of
time at Cambridge that Milner was in- of ML, a metalanguage whose original Edinburgh.
troduced to programming, though the intent was to enable researchers to de- Milner also labored on behalf of in-
subject didn’t interest him initially. “I ploy proof tactics on LCF. The innovative stitutions and initiatives that shaped
regarded programming as really rather concepts it introduced, however—such the future of the field. He helped es-
inelegant,” he recalled in an interview as polymorphic type inference and type- tablish the Laboratory for Foundations
in 2001 with Martin Berger, a profes- safe exception handling—were soon of Computer Science at the University
sor at the University of Sussex. “So I recognized. Ultimately, ML evolved into of Edinburgh and served as the first
resolved that I would never go near a powerful general programming lan- chair of the University of Cambridge’s
a computer in my life.” Several years guage (with Milner, Harper, and others Computer Laboratory. In 2002, Milner
later, in 1960, Milner broke that resolu- working to specify and standardize it) and Tony Hoare, a senior researcher at
tion with a programming job at Ferran- and led to languages like F# and Caml. Microsoft Research in Cambridge (and
ti, a British computer company; from “Robin initiated a revolution in a fellow Turing Award winner), began
there, it wasn’t long before he moved to what computing languages are and working on a grand challenge initiative
academia with positions at City Univer- could be,” asserts Harper. to identify research topics that would
sity London, Swansea University, Stan- Although he was closely involved drive science in the 21st century.
ford University, and the Universities of in ML’s development throughout the He was a natural leader, according to
Edinburgh and Cambridge. 1980s and 1990s, Milner also began Tofte. “I’m not sure he was always aware
Inspired by Dana Scott’s famous working on the problem of concur- of it,” says Tofte, “but he was so good at
formulation of domain theory, Milner rency, looking for a mathematical getting the best out of people, which is
Photograph by Roland Eva

began working on an automatic theo- treatment that could rival theories of exactly what a good leader does.”
rem prover, hoping to find a way to sequential computation. The Calcu-
mechanize a logic for reasoning about lus of Communicating Systems (CCS) Leah Hoffmann is a Brooklyn, NY-based technology
writer.
programs. The work culminated in the was the first solution he devised: a pro-
development of Logic for Computable cess calculus for a network that was © 2010 ACM 0001-0782/10/0600 $10.00

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t he acm 21
news

Milestones | doi:10.1145/1743546.1743557 Jack Rosenberger

CS and Technology
Leaders Honored

A
wards were recently gineering and Director of the Ken Ken-
announced by ACM and nedy Institute for Information Technol-
the American Associa- ogy at Rice University, is the recipient of
tion for the Advancement the Outstanding Contribution to ACM
of Science honoring lead- Award for his leadership in restructuring
ers in the fields of computer science ACM’s flagship publication, Communi-
and technology. cations of the ACM, into a more effective
communications vehicle for the global
ACM Awards computing discipline, and for organiz-
VMware Workstation 1.0, which was ing an influential, systematic analysis
developed by Stanford University pro- of offshoring, Globalization and Offshor-
fessor Mendel Rosenblum and his ing of Software, which helped reinforce
colleagues Edouard Bugnion, Scott the case that computing plays a fun-
Devine, Jeremy Sugerman, and Edward damental role in defining success in a
Wang, was awarded the Software Sys- competitive global economy.
tem Award for bringing virtualization Elaine Weyuker and Mathai Joseph
technology to modern computing en- were named recipients of the ACM
vironments, spurring a shift to virtual- Presidential Award. Weyuker, an AT&T
machine architectures, and allowing Michael I. Jordan, University of California, Fellow at Bell Labs, was honored for her
users to efficiently run multiple operat- Berkeley efforts in reshaping and enhancing the
ing systems on their desktops. growth of ACM-W to become a thriving
Michael I. Jordan, a professor at lating to the diagnosis and treatment of network that cultivates and celebrates
the University of California, Berke- behavioral disorders, as well as the as- women seeking careers in computing.
ley, is the recipient of the ACM/AAAI sessment of behavioral change within Joseph, an advisor to Tata Consultancy
Allen Newell Award for fundamen- complex social environments. Services, was honored for his commit-
tal advances in statistical machine Edward Lazowska, the Bill & Me- ment to establishing an ACM presence
learning—a field that develops com- linda Gates Chair in computer sci- in India. His efforts were instrumental
putational methods for inference and ence and engineering at the Uni- in the formation of the ACM India Coun-
decision-making based on data. versity of Washington, received the cil, which was launched last January.
Tim Roughgarden, an assistant pro- Distinguished Service Award for his
fessor at Stanford University, received wide-ranging service to the comput- American Academy Fellows
the Grace Murray Hopper Award for in- ing community and his long-stand- Eight computing scientists and tech-
troducing novel techniques that quan- ing advocacy for this community at nology leaders were among the 229
tify lost efficiency with the uncoordi- the national level. (An interview with newly elected Fellows and Foreign
nated behavior of network users who Lazowska appears in the ACM Mem- Honorary Members to the American
act in their own self-interest. ber News column on p. 12.) Association for the Advancement of Sci-
Matthias Felleisen, a Trustee Pro- Moshe Y. Vardi, the Karen Ostrum ence. They are: Randal E. Bryant, Carn-
fessor at Northeastern University, George Professor in Computational En- egie Mellon University; Nancy A. Lynch,
was awarded the Karl V. Karlstrom Massachusetts Institute of Technol-
Outstanding Educator Award for his ogy; Ray Ozzie, Microsoft; Samuel J.
visionary and long-standing contribu- VMware Workstation Palmisano, IBM; Burton Jordan Smith,
tions to K–12 outreach programs. Microsoft; Michael Stonebraker, Ver-
Gregory D. Abowd, a professor at 1.0 is the recipient tica Systems; Madhu Sudan, Massachu-
Georgia Institute of Technology, is the of ACM’s Software setts Institute of Technology; Moshe Y.
Photograph by Barbara Rosari o

recipient of the Eugene L. Lawler Award Vardi, Rice University; and Jeannette
for Humanitarian Contributions with- System Award. M. Wing, Carnegie Mellon University/
in Computer Science and Informatics National Science Foundation.
for promoting a vision of health care
and education that incorporates the Jack Rosenberger is Communications’ senior editor,
news.
use of advanced information technolo-
gies to address difficult challenges re- © 2010 ACM 0001-0782/10/0600 $10.00

22 communications of th e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
ACM Europe Council
Serving the European Computing Science Community

ACM announces the ACM Europe Council:  Holding high-quality ACM conferences in Europe
16 leading European computer scientists from  Expanding ACM chapters
academia and industry to spearhead expansion of  Strong co-operation with other European
ACM's high-quality technical activities, conferences scientific societies in computing
and services in Europe.

“Our goal is to share ACM’s vast array ACM – World’s Largest Educational
of valued resources and services on and Scientific Computing Society
-
a global scale. We want to discover
the work and welcome the talent Since 50 years, ACM strengthens the computing
from all corners of the computing profession’s collective voice through strong leader-
arena so that we are better posi- ship, promotion of the highest standards, and
tioned to appreciate the key issues recognition of technical excellence. ACM supports
and challenges within Europe’s aca- the professional growth of its members by provid-
demic, research, and professional ing opportunities for life-long learning, career
computing communities, and development, and professional networking.
respond accordingly,” says
Professor Dame Wendy Hall, Fabrizio Gagliardi, ACM
U. of Southampton (UK), Europe Chair and Director of
ACM President. External Research Programs,
Microsoft Research Europe says,
“By strengthening ACM’s ties in the
Supporting the European Computing region and raising awareness of its
Community many benefits and resources with
the public and European decision-
ACM Europe aims to strengthen the European makers, we can play an active role
computing community at large, through its in the critical technical, educational,
members, chapters, sponsored conferences and and social issues that surround the
symposia. Together with other scientific societies, computing community.”
it helps make the public and decision makers
aware of technical, educational, and social issues
related to computing. ACM is present in Europe with 15 000 members
and 41 chapters. 23 ACM Turing Awards and other
major ACM awards have gone to individuals in
A European Perspective Within ACM Europe. 215 Europeans received an Advanced
The ACM Europe Council brings a unique European Membership Grade since 2004.
perspective inside ACM and helps increase visibil- The ACM Europe Council represents European
ity of ACM across Europe, through: computer scientists. Contact us at:
 Participation of Europeans throughout ACM acmeurope@acm.org or http://europe.acm.org/
 Representation of European work in ACM
Awards and Advanced Membership Grades

http://europe.acm.org
V
viewpoints

doi:10.1145/1743546.1743558 Arvind Narayanan and Vitaly Shmatikov

Privacy and Security


Myths and Fallacies of “Personally
Identifiable Information”
Developing effective privacy protection technologies is a critical challenge for
security and privacy research as the amount and variety of data collected about
individuals increase exponentially.

T
relies
h e d ig ital e co no m y are commonly used for authenticating
on the collection of personal an individual, as opposed to those that
data on an ever-increasing Any information that violate privacy, that is, reveal some sen-
scale. Information about our distinguishes one sitive information about an individual.
searches, browsing history, This crucial distinction is often over-
social relationships, medical history, person from another looked by designers of privacy protec-
and so forth is collected and shared can be used for tion technologies.
with advertisers, researchers, and gov- The second legal context in which
ernment agencies. This raises a num- re-identifying data. the term “personally identifiable infor-
ber of interesting privacy issues. In mation” appears is privacy law. In the
today’s data protection practices, both U.S., the Privacy Act of 1974 regulates
in the U.S. and internationally, “person- the collection of personal information
ally identifiable information” (PII)—or, by government agencies. There is no
as the U.S. Health Insurance Portability overarching federal law regulating pri-
and Accountability Act (HIPAA) refers on data privacy, PII is surprisingly dif- vate entities, but some states have their
to it, “individually identifiable” infor- ficult to define. One legal context is own laws, such as California’s Online
mation—has become the lapis phi- provided by breach-notification laws. Privacy Protection Act of 2003. Generic
losophorum of privacy. Just as medieval California Senate Bill 1386 is a rep- privacy laws in other countries include
alchemists were convinced a (mythical) resentative example: its definition of Canada’s Personal Information Pro-
philosopher’s stone can transmute lead personal information includes Social tection and Electronic Documents Act
into gold, today’s privacy practitioners Security numbers, driver’s license (PIPEDA) and Directive 95/46/EC of
believe that records containing sensi- numbers, financial accounts, but not, the European Parliament, commonly
tive individual data can be “de-identi- for example, email addresses or tele- known at the Data Protection Directive.
fied” by removing or modifying PII. phone numbers. These laws were en- Privacy laws define PII in a much
acted in response to security breaches broader way. They account for the pos-
What is PII? involving customer data that could sibility of deductive disclosure and—
For a concept that is so pervasive in enable identity theft. Therefore, they unlike breach-notification laws—do
both legal and technological discourse focus solely on the types of data that not lay down a list of informational

24 communications of th e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
viewpoints

attributes that constitute PII. For ex- HIPAA.b The “safe harbor” provision tection is to consider both the data and
ample, the Data Protection Directive of the Privacy Rule enumerates 18 spe- its proposed use(s) and to ask: What
defines personal data as: “any informa- cific identifiers that must be removed risk does an individual face if her data
tion relating to an […] natural person prior to data release, but the list is not is used in a particular way? Unfortu-
[…] who can be identified, directly or intended to be comprehensive. nately, existing privacy technologies
indirectly, in particular by reference such as k-anonymity6 focus instead on
[…] to one or more factors specific to PII and Privacy Protection the data alone. Motivated by an attack
his physical, physiological, mental, Technologies in which hospital discharge records
economic, cultural, or social identity.” Many companies that collect personal were re-identified by joiningc them via
The Directive goes on to say that information, including social net- common demographic attributes with
“account should be taken of all the works, retailers, and service providers, a public voter database,5 these meth-
means likely reasonably to be used ei- assure customers that their informa- ods aim to make joins with external da-
ther by the controllera or by any other tion will be released only in a “non- tasets harder by anonymizing the iden-
person to identify the said person.” personally identifiable” form. The un- tifying attributes. They fundamentally
Similarly, the HIPAA Privacy Rule de- derlying assumption is that “personally rely on the fallacious distinction be-
fines individually identifiable health identifiable information” is a fixed set tween “identifying” and “non-identify-
information as information “1) That of attributes such as names and contact ing” attributes. This distinction might
identifies the individual; or 2) With information. Once data records have have made sense in the context of the
respect to which there is a reasonable been “de-identified,” they magically original attack, but is increasingly
basis to believe the information can be become safe to release, with no way of meaningless as the amount and variety
used to identify the individual.” What linking them back to individuals. of publicly available information about
is “reasonable”? This is left open to The natural approach to privacy pro- individuals grows exponentially.
Illustrati on by j ohn hersey

interpretation by case law. We are not To apply k-anonymity or its variants


aware of any court decisions that de- b When the Supreme Court of Iceland struck such as l-diversity, the set of the so-
fine identifiability in the context of down an act authorizing a centralized database called quasi-identifier attributes must
of “non-personally identifiable” health data, its
ruling included factors such as education, pro-
be fixed in advance and assumed to
a The individual or organization responsible for fession, and specification of a particular medi-
the safekeeping of personal information. cal condition as part of “identifiability.” c In the sense of SQL join.

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t he acm 25
viewpoints

be the same for all users. It typically Re-identification algorithms based Beyond De-identification
includes ZIP code, birth date, gender, on behavioral attributes must toler- Developing effective privacy protection
and/or other demographics. The rest ate a certain “fuzziness” or impreci- technologies is a critical challenge for
of the attributes are assumed to be sion in attribute values. They are thus security and privacy research. While
non-identifying. De-identification in- more computationally expensive and much work remains to be done, some
volves modifying the quasi-identifiers more difficult to implement than re- broad trends are becoming clear, as
to satisfy various syntactic properties, identification based on demographic long as we avoid the temptation to find
such as “every combination of quasi- quasi-identifiers. This is not a signifi- a silver bullet. Differential privacy is a
identifier values occurring in the data- cant deterrence factor, however, be- major step in the right direction.1 In-
set must occur at least k times.”6 The cause re-identification is a one-time ef- stead of the unattainable goal of “de-
trouble is that even though joining two fort and its cost can be amortized over identifying” the data, it formally de-
datasets on common attributes can thousands or even millions of individ- fines what it means for a computation
lead to re-identification, anonymizing uals. Further, as Paul Ohm argues, re- to be privacy-preserving. Crucially, it
a predefined subset of attributes is not identification is “accretive”: the more makes no assumptions about the ex-
sufficient to prevent it. information about a person is revealed ternal information available to the ad-
as a consequence of re-identification, versary. Differential privacy, however,
Re-identification without PII the easier it is to identify that person in does not offer a universal methodology
Any information that distinguishes the future.4 for data release or collaborative, priva-
one person from another can be used cy-preserving computation. This limi-
for re-identifying anonymous data. Lessons for Privacy Practitioners tation is inevitable: privacy protection
Examples include the AOL fiasco, in The emergence of powerful re-identi- has to be built and reasoned about on
which the content of search queries fication algorithms demonstrates not a case-by-case basis.
was used to re-identify a user; our own just a flaw in a specific anonymization Another lesson is that an interac-
work, which demonstrated feasibility technique(s), but the fundamental tive, query-based approach is generally
of large-scale re-identification using inadequacy of the entire privacy pro- superior from the privacy perspective
movie viewing histories (or, in general, tection paradigm based on “de-identi- to the “release-and-forget” approach.
any behavioral or transactional pro- fying” the data. De-identification pro- This can be a hard pill to swallow, be-
file2) and local structure of social net- vides only a weak form of privacy. It may cause the former requires designing
works;3 and re-identification based on prevent “peeping” by insiders and keep a programming interface for queries,
location information and stylometry honest people honest. Unfortunately, budgeting for server resources, per-
(for example, the latter was used to in- advances in the art and science of re- forming regular audits, and so forth.
fer the authorship of the 12 disputed identification, increasing economic Finally, any system for privacy-pre-
Federalist Papers). incentives for potential attackers, and serving computation on sensitive data
Re-identification algorithms are ag- ready availability of personal informa- must be accompanied by strong access
nostic to the semantics of the data ele- tion about millions of people (for ex- control mechanisms and non-techno-
ments. It turns out there is a wide spec- ample, in online social networks) are logical protection methods such as in-
trum of human characteristics that rapidly rendering it obsolete. formed consent and contracts specify-
enable re-identification: consumption The PII fallacy has important impli- ing acceptable uses of data.
preferences, commercial transac- cations for health-care and biomedical
tions, Web browsing, search histories, datasets. The “safe harbor” provision References
and so forth. Their two key properties of the HIPAA Privacy Rule enumerates 1. Dwork, C. A firm foundation for private data analysis.
Commun. ACM. (to appear).
are that (1) they are reasonably stable 18 attributes whose removal and/or 2. Narayanan, A. and Shmatikov, V. Robust de-
across time and contexts, and (2) the modification is sufficient for the data anonymization of large sparse datasets. In
Proceedings of the 2008 IEEE Symposium on Security
corresponding data attributes are suf- to be considered properly de-identi- and Privacy.
ficiently numerous and fine-grained fied, with the implication that such 3. Narayanan, A. and Shmatikov, V. De-anonymizing
social networks. In Proceedings of the 2009 IEEE
that no two people are similar, except data can be released without liability. Symposium on Security and Privacy.
with a small probability. This appears to contradict our argu- 4. Ohm, P. Broken promises of privacy: Responding to
the surprising failure of anonymization. 57 UCLA Law
The versatility and power of re-iden- ment that PII is meaningless. The “safe Review 57, 2010 (to appear).
5. Sweeney, L. Weaving technology and policy together
tification algorithms imply that terms harbor” provision, however, applies to maintain confidentiality. J. of Law, Medicine, and
such as “personally identifiable” and only if the releasing entity has “no ac- Ethics 25 (1997).
6. Sweeney, L. Achieving k-anonymity privacy protection
“quasi-identifier” simply have no tech- tual knowledge that the information using generalization and suppression. International
nical meaning. While some attributes remaining could be used, alone or in Journal on Uncertainty, Fuzziness, and Knowledge-
Based Systems 10 (2002).
may be uniquely identifying on their combination, to identify a subject of
own, any attribute can be identifying in the information.” As actual experience
Arvind Narayanan (arvindn@cs.utexas.edu) is a
combination with others. Consider, for has shown, any remaining attributes postdoctoral fellow at Stanford University. Vitaly
example, the books a person has read can be used for re-identification, as Shmatikov (shmat@cs.utexas.edu) is an associate
professor of computer science at the University of Texas
or even the clothes in her wardrobe: long as they differ from individual to at Austin. Their paper on de-anonymization of large sparse
while no single element is a (quasi)- individual. Therefore, PII has no mean- datasets2 received the 2008 PET Award for Outstanding
Research in Privacy Enhancing Technologies.
identifier, any sufficiently large subset ing even in the context of the HIPAA
uniquely identifies the individual. Privacy Rule. Copyright held by author.

26 communications of th e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
V
viewpoints

doi:10.1145/1743546.1743559 Stuart S. Shapiro

Inside Risks
Privacy By Design:
Moving from Art to Practice
Designing privacy into systems at the beginning of the development process necessitates the effective
translation of privacy principles, models, and mechanisms into system requirements.

M
ost people involved with
system development
are well aware of the
adage that you are bet-
ter off designing in se-
curity and privacy (and pretty much
any other “nonfunctional” require-
ments) from the start, rather than try-
ing to add them later. Yet, if this is the
conventional wisdom, why is the con-
ventional outcome so frequently sys-
tems with major flaws in these areas?
Part of the problem is that while
people know how to talk about func-
tionality, they are typically a lot less flu-
ent in security and privacy. They may
sincerely want security and privacy, but
they seldom know how to specify what
they seek. Specifying functionality, on
the other hand, is a little more straight-
forward, and thus the system that pre-
viously could make only regular coffee
in addition to doing word processing
will now make espresso too. (Whether
this functionality actually meets user Members of staff are seen demonstrating a new whole-body security scanner at Manchester
Airport, Manchester, England, in January 2010. Airline passengers bound for the United
needs is another matter.) States faced a hodgepodge of security measures across Europe and airports did not appear
to be following a U.S. request for increased screening of passengers from 14 countries.
Security and Privacy
The fact that it is often not apparent stantial bodies of knowledge for some also tends to be the most problematic
what security and privacy should look nonfunctional areas, including secu- activity and the activity for which the
like is indicative of some deeper issues. rity, but figuring out how to translate least guidance is provided. The sheer
Photograph by J on Super/ AP Photo

Security and privacy tend to be articu- the abstract principles, models, and complexity of most modern systems
lated at a level of abstraction that often mechanisms into comprehensive spe- compounds the problem.
makes their specific manifestations cific requirements for specific systems Security, though, is better posi-
less than obvious, to either customers operating within specific contexts is tioned than privacy. Privacy—or infor-
or system developers. seldom straightforward. That trans- mational privacy at least—certainly has
This is not to say the emperor has lation process is crucial to designing commonly understood and accepted
no clothes; far from it. There are sub- these properties into systems, but it principles in the form of Fair Informa-

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t he acm 27
viewpoints

tion Practices. It presently doesn’t have in the absence of applicable models


much else. Models and mechanisms and mechanisms and their requisite
that support privacy are scarce, not gen- translation, along with principles,
erally known, and rarely understood by into effective requirements.
either customers or developers. In this instance, Solove’s concept
As more things become digitized, in- of exposure provides the necessary
formational privacy increasingly covers (partial) model. Exposure is a privacy
areas for which Fair Information Prac- violation that induces feelings of vul-
tices were never envisioned. Biometrics, nerability and distress in the individ-
physical surveillance, genetics, and be- ual by revealing things we customarily
havioral profiling are just a few of the conceal. The potential harm from ex-
areas that are straining Fair Informa- posure is not restricted to modesty or
tion Practices to the breaking point. dignity. A friend is convinced that her
More sophisticated models are emerg- pubescent daughter, who is currently
ing for thinking about privacy risk, as extremely self-conscious about her
represented by the work of scholars body, would be quite literally trauma-
such as Helen Nissenbaum and Daniel tized if forced to undergo WBI. If physi-
Solove. However, if not associated with cal strip searches would raise concern,
privacy protection mechanisms and why not WBI? Real damage—physical
supported by translation guidance, the as well as psychological—can occur in
impact of such models is likely to be the context of body image neuroses.
much less than they deserve. If one recognizes from the outset
A recent example is the develop- the range of privacy risks represent-
ment and deployment of whole-body ed by exposure, and the relevance of
imaging (WBI) machines at airports exposure for WBI, one then stands
for physical screening of passengers. a chance of effectively moving from
ACM’s In their original incarnation, these ma- principles to requirements. Even
chines perform what has been dubbed then, though, the translation process
interactions
a “virtual strip search” due to the body is not necessarily obvious.
magazine explores image that is presented. These ma- Supposedly, the WBI machines be-
critical relationships chines are currently being deployed at ing used by the U.S. Transportation Se-
between experiences, people, U.S. airports in a way that is arguably curity Administration are not capable
compliant with Fair Information Prac- of retaining images when in normal
and technology, showcasing tices. Yet they typically operate in a way operating mode. (They have this capa-
emerging innovations and industry that many people find offensive. bility when in testing mode, though,
leaders from around the world The intended purpose certainly is so significant residual risk may exist.)
not to collect, use, disclose, and retain Other necessary mechanisms were
across important applications of naked images of people; it is to detect not originally specified. Some models
design thinking and the broadening potentially dangerous items they may of WBI are being retrofitted to pres-
field of the interaction design. be carrying on their persons when ent a nonexposed image, but the issue
screened. Fair Information Practices of intermediate processing remains.
Our readers represent a growing include minimization of personal in- Some models developed after the ini-
community of practice that formation collected, used, disclosed, tial wave apparently implement all the
is of increasing and vital and retained, consistent with the in- necessary control mechanisms; priva-
tended purpose. cy really was designed in. Why wasn’t
global importance.
This has profound implications it designed in from the beginning
for how image data is processed, and across the board? The poor state
presented, and stored. It should of practice of privacy by design offers
be processed so at no point does a partial explanation. The state of the
e

there ever exist an exposed body im- art, though, is advancing.


ib
cr

age that can be viewed or stored. It The importance of meaningfully


s
ub

should be presented in a nonexposed designing privacy into systems at the


/s
rg

form (for example, a chalk outline beginning of the development pro-


.o
cm

or a fully clothed person) with indi- cess, rather than bolting it on at the
a
w.

cators where things have been de- end (or overlooking it entirely), is be-
w
w

tected. None of it should be retained ing increasingly recognized in some


://
tp

beyond the immediate encounter. quarters. A number of initiatives


ht

That almost none of these design and activities are using the rubric of
elements were originally specified privacy by design. In Canada, the On-
illustrates what too often happens tario Information and Privacy Com-

28 communications of th e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
viewpoints

missioner’s Office has published a


number of studies and statements on
Security and privacy
Calendar
how privacy can be designed into spe-
cific kinds of systems. One example
is electronic (RFID-enabled) driver’s
tend to be articulated of Events
licenses, for which the inclusion of at a level of June 15
a built-in on/off switch is advocated, abstraction that often MobileCloud Workshop
(co-located with MobiSys 2010),
thereby providing individuals with
direct, immediate, and dynamic con- makes their specific San Francisco, CA,
Contact: Li Erran Li,
trol over whether the personal infor- manifestations Email: erranlli@research.bell-
labs.com
mation embedded in the license can
be remotely read or not. Such a mech- less than obvious. June 15–18
anism would support several Fair International Conference
Information Practices, most notably on Informatics in Control,
collecting personal information only Automation and Robotics,
Funchal, Portugal,
with the knowledge and consent of Contact: Joaquim Filipe,
the individual. This approach is clear- Email: jfilipe@insticc.org
ly applicable as well to other kinds of doubt that appropriately trained en-
RFID-enabled cards and documents gineers (including security engineers) June 15–18
Computers, Freedom,
carrying personal information. are key to supporting the effective and Privacy,
Similar efforts have been spon- translation of principles, models, San Jose, CA,
sored by the U.K. Information Com- and mechanisms into system require- Contact: Jon Pincus,
missioner’s Office. This work has ments. There doesn’t yet appear to Email: jon@achangeiscoming.net
taken a somewhat more systemic per- be such a thing as a privacy engineer; June 15–18
spective, looking less at the applica- given the relative paucity of models Annual NASA/ESA Adaptive
tion of privacy by design to specific and mechanisms, that’s not too sur- Hardware and Systems
types of technology and more at how prising. Until we build up the latter, Conference,
Anaheim, CA,
to effectively integrate privacy into the we won’t have a sufficient basis for the Contact: Arslan Tughrul,
system development life cycle through former. For privacy by design to extend Email: t.arslan@ed.ac.uk
measures such as privacy impact as- beyond a small circle of advocates and
sessments and ‘practical’ privacy stan- experts and become the state of prac- June 16–18
Conference on the Future
dards. It also emphasizes the potential tice, we’ll need both. of the Internet 2010,
role of privacy-enhancing technolo- This will require recognition that Seoul Republic of Korea,
gies (PETs) that can be integrated with there is a distinct and necessary tech- Contact: Dongman Lee,
Email: dlee@cs.kaist.ac.kr
or into other systems. While some of nical discipline of privacy, just as
these are oriented toward empower- there is a distinct and necessary tech- June 17–18
ing individuals, others—which might nical discipline of security—even if Third International Workshop
more appropriately be labeled Enter- neither is fully formed. If that can be on Future Multimedia
Networking,
prise PETs—are oriented toward sup- accomplished, it will create a home Krakow, Poland,
porting organizational stewardship and an incentive for the models and Contact: Mauthe Andreas,
of personal information. mechanisms privacy by design so bad- Email: a.mauthe@lancaster.
However, state of the art is state of ly needs. ac.uk
the art. Supporting the translation of This is not to minimize the difficul- June 19–23
abstract principles, models, and mech- ty of more effectively and consistently ACM SIGCHI Symposium
anisms into implementable require- translating security’s body of knowl- on Engineering Interactive
ments, turning this into a repeatable edge (which is still incomplete) into Computing Systems,
Berlin, Germany,
process, and embedding that process in implementable and robust require- Contact: Jean Vanderdonckt
the system development life cycle is no ments. Both security and privacy need Email: jean.vanderdonckt@
small matter. Security has been at it a lot to receive more explicit and directed uclouvain.be
longer than privacy, and it is still run- attention than they often do as areas
June 19–23
ning into problems. But at least security of research and education. The 37th Annual International
has a significant repertoire of princi- Security by design and privacy by Symposium on Computer
ples, models, and mechanisms; privacy design can be achieved only by de- Architecture,
Saint-Malo, France,
has not really reached this stage yet. sign. We need a firmer grasp of the
Contact: Andre Seznec
obvious. Email: seznec@irisa.fr
Conclusion
So, if privacy by design is still a ways Stuart S. Shapiro (s_shapiro@acm.org) is Principal
off, and security by design still leaves Information Privacy and Security Engineer at The MITRE
Corporation, Bedford MA.
something to be desired, how do we
get there from here? There’s little Copyright held by author.

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t he acm 29
V
viewpoints

doi:10.1145/1743546.1743560 Peter J. Denning and Jack B. Dennis

The Profession of IT
The Resurgence
of Parallelism
Parallel computation is making a comeback after a quarter century
of neglect. Past research can be put to quick use today.

M
ult i- c ore c hips Are a code was “reentrant,” meaning that
new paradigm!” “We multiple processors could execute
are entering the age of the same code simultaneously while
parallelism!” These are computing on separate stacks. The
today’s faddish rally- instruction set, which was attuned to
ing cries for new lines of research and the Algol language, was very simple
commercial development. Is this really and efficient even by today’s RISC
the first time when computing profes- standards. A newly spawned pro-
sionals seriously engaged with parallel cess’s stack was linked to its parent’s
computation? Is parallelism new? Is stack, giving rise to a runtime struc-
parallelism a new paradigm? ture called “cactus stack.” The data
memory outside of the stacks was
Déjà Vu All Over Again laid out in segments; a segment was
Parallel computation has always been a contiguous sequence of locations
a means to satisfy our never-ending with base and bound defined by a de-
hunger for ever-faster and ever-cheaper scriptor. Segments were moved auto-
computation.4 In the 1960s and 1970s, matically up and down the memory
parallel computation was extensively hierarchy, an early form of virtual
researched as a means to high-perfor- memory not based on paging. Elliot
mance computing. But the commer- Organick’s masterful descriptions of
cial world stuck with a quest for faster these machines make for refreshing
CPUs and, assisted by Moore’s Law, and worthwhile reading today.9,12
made it to the 2000s without having The Burroughs systems were an
to seriously engage with parallel com- important influence on research
putation except for supercomputers. seeking efficient and reliable paral-
The parallel architecture research of lel program structures. A group of
the 1960s and 1970s solved many prob- researchers at Brown University and
lems that are being encountered today. ture survives today in the reverse polish General Electric Research Labora-
Our objective in this column is to recall notation HP calculators and in the Uni- tories produced a set of reports on a
the most important of these results Sys ClearPath MCP machines. “contour model” of nested multitask
and urge their resurrection. Those machines used shared computations in 1971.12 Those re-
memory multiprocessors in which a ports give a remarkably clear picture
Shared Memory Multiprocessing crossbar switch connected groups of of a parallel programming runtime
The very first multiprocessor archi- four processors and memory boxes. environment that would suit today’s
tecture was the Burroughs B5000, de- The operating system, known as Au- languages well and would resolve
signed beginning in 1961 by a team led tomatic Scheduling and Operating many contemporary problems con-
by Robert Barton. It was followed by Program (ASOP), included many in- sidered as “research challenges.” It
the B5500 and B6700, along with a de- novations. Its working storage was is a tragedy these ideas have disap-
fense version, the D850. The architec- organized as a stack machine. All its peared from the curriculum.

30 communications of th e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
viewpoints

The Burroughs machines disap- determinate because the message


peared not because of any defect in queues always present the data items
their architecture, but because of The parallel in the same order to the tasks receiving
IBM’s massive success in marketing architecture research them.
the 360 series systems. Moreover, in a Another corollary is that an imple-
process reminiscent of Clayton Chris- of the 1960s mentation of a functional program-
tensen’s Innovator’s Dilemma, the and 1970s solved ming language using concurrent tasks
low-end assembler-language mini- is determinate because the functions
computers, originally designed to run many problems provide their data privately to their
laboratory instruments, grew up into that are being successors when they fire. There is no
the minicomputer and then micro- interference among the memory cells
computer market, untainted by any encountered today. used to transmit data between func-
notions of parallel programming. tions.
With the introduction of RISC ar- Determinacy is really important in
chitectures in the early 1980s, much parallel computation. It tells us we can
of the research for high-performance unleash the full parallelism of a com-
computers was rechanneled toward putational method without worrying
exploiting RISC for fast chips. It Determinate Computation whether any timing errors or race con-
looked at the time that sophisticated One of the holy grails of research in ditions will negatively affect the results.
compilers could make up for missing parallel computation in the 1960s and
functions in the chips. 1970s was called “determinacy.”7 De- Functional Programming
With two notable exceptions, most terminacy requires that a network of and Composability
of the projects exploring alternatives parallel tasks in shared memory always Another holy grail for parallel system
to the “von Neumann architecture” ex- produces the same output for given has been modular composability. This
pired and were not replaced with new input regardless of the speeds of the would mean that any parallel program
projects or other initiatives. One excep- tasks. It should not be confused with can be used, without change, as a com-
tion was Arvind’s Monsoon Project at a similar word, “deterministic,” which ponent of a larger parallel program.
MIT,10 which demonstrated that mas- would require that the tasks be ordered Three principles are needed to en-
sive parallelism is readily identified in in the same sequence every time the able parallel program composability.
the functional programming language system runs. David Parnas wrote about two: infor-
Haskell, and then readily mapped to a A major result of this research was mation hiding and context indepen-
shared memory multiprocessor. (Func- the “determinacy theorem.” A task is dence. Information hiding means
tional languages generate all their val- a basic computation that implements a task’s internal memory cannot be
ues by evaluating functions without a function from its inputs to outputs. read or written by any other task. Con-
side effects.) Two tasks are said to be in conflict if text independence means no part of
The other project involved a group either of them writes into memory a task can depend on values outside
at the Lawrence Livermore National cells used by the other. In a system of the task’s internal memory or input-
Laboratory studying scientific codes in concurrent tasks, race conditions may output memory. The third principle is
the functional language Sisal, a deriva- be present that make the final output argument noninterference; it says that
tive of MIT’s Val language; Sisal pro- depend on the relative speeds or or- a data object presented as input to two
grams were as efficient as Fortran pro- ders of task execution. Determinacy is concurrent modules cannot be modi-
grams and could be readily compiled ensured if the system is constrained fied by either.
to massively parallel shared memory so that every pair of conflicting tasks is Functional programming languag-
supercomputers.1,2,11 performed in the same order in every es automatically satisfy these three
The current generations of super- run of the system. Then no data races principles; their modules are thus
computers (and data warehouses) are are possible. Note that atomicity and composable.
based on thousands of CPU chips run- mutual exclusion are not sufficient It is an open question how to struc-
ning in parallel. Unlike the innovative for determinacy: they ensure only that ture composable parallel program
designs of the Burroughs systems, their conflicting tasks are not concurrent, modules from different frameworks
hardware architectures conform to the but not that they always executed in when the modules implement non-
conventional von Neumann machine. the same order. determinate behavior. Transaction
Their operating systems are little more A corollary of the determinacy theo- systems are an extreme case. Because
than simple schedulers and message rem is that the entire sequence of val- their parallel tasks may interfere in
passing protocols, with complex func- ues written into each and every mem- the records they access, they use lock-
tions relegated to applications running ory cell during any run of the system is ing protocols to guarantee mutual ex-
on separate host machines. the same for the given input. This cor- clusion. Transaction tasks cannot be
The point is clear: ideas for arrang- ollary also tells us that any system of ordered by a fixed order—their nonde-
ing multiple processors to work togeth- blocking tasks that communicates by terminacy is integral to their function.
er in an integrated system have been messages using FIFO queues (instead For example, an airplane seat goes to
with us for 50 years. What’s new? of shared memory) is automatically whichever task requested it first. The

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t he acm 31
viewpoints

problem is to find a way to reap the each task accesses a limited but dy- be fully realized unless memory man-
benefits of composability for systems namically evolving working set of agement and thread scheduling are
that are necessarily nondeterminate. data objects.3 The working set is eas- freely managed at runtime. In the long
ily detected—it is the objects used in run, this will require merging compu-
Virtual Memory a recent backward-looking window— tational memory and file systems into
There are obvious advantages if the and loaded into a processor’s cache. a single, global virtual memory.
shared memory of a parallel multi- There is no reason to be concerned
processor could be a virtual memory. about performance loss due to an in- Conclusion
Parameters can be passed as pointers ability to load every task’s working set We can now answer our original ques-
(virtual addresses) without copying into its cache. tions. Parallelism is not new; the re-
(potentially large) objects. Compila- What about cache consistency? A alization that it is essential for con-
tion and programming are greatly copy of a shared object will be present tinued progress in high-performance
simplified because neither compilers in each sharing task’s cache. How do computing is. Parallelism is not yet
nor programmers need to manage the changes made by one get transmitted a paradigm, but may become so if
placement of shared objects in the to the other? It would seem that this enough people adopt it as the stan-
memory hierarchy of the system; that problem is exacerbated in a highly par- dard practice and standard way of
is done automatically by the virtual allel system because of the large num- thinking about computation.
memory system. ber of processors and caches. The new era of research in parallel
Virtual memory is essential for Here again, the research that was processing can benefit from the results
modular composability when modules conducted during the 1970s provides of the extensive research in the 1960s
can share objects. Any module that an answer. We can completely avoid and 1970s, avoiding rediscovery of
manages the placement of a shared the cache consistency problem by nev- ideas already documented in the litera-
object in memory violates the infor- er writing to shared data. That can be ture: shared memory multiprocessing,
mation hiding principle because other accomplished by building the memory determinacy, functional programming,
modules must consult it before using as a write-once memory: when a pro- and virtual memory.
a shared object. By hiding object loca- cess writes into a shared object, the
tions from modules, virtual memory system automatically creates a copy References
Cann, D. Retire Fortran?: A debate rekindled. Commun.
enables composability of parallel pro- and tags it as the current version. 1. ACM 35, 8 (Aug. 1992), 81–89.
gram modules. These value sequences are unique in a 2. Cann, D. and Feo, J. SISAL versus FORTRAN:
A comparison using the Livermore loops. In
There are two concerns about large determinate system. Determinate sys- Proceedings of the 1990 ACM/IEEE Conference on
virtual memory. One is that the virtual tems, therefore, give a means to com- Supercomputing. IEEE Computer Society Press, 1990.
3. Denning, P. Virtual memory. ACM Computing Surveys
addresses must be large so that they pletely avoid the cache consistency 2, 3 (Sept. 1970), 153–189.
encompass the entire address space in problem and successfully run a very 4. Denning, P. and Tichy, W. Highly parallel computation.
Science 250 (Nov. 30, 1990), 1217–1222.
which the large computations proceed. large virtual memory. 5. Dennis, J. and Van Horn, E.C. Programming semantics
The Multics system demonstrated that for multi-programmed computations. Commun. ACM
9, 3 (Mar. 1966), 143–155.
a very large virtual address space—ca- Research Challenges 6. Fabry, R. Capability-based addressing. Commun. ACM
pable of encompassing the entire file Functional programming languages 17, 7 (July 1974), 403–412.
7. Karp, R.M. and Miller, R.E. Properties of a model for
system—could be implemented effi- (such as Haskell and Sisal) currently parallel computations: Determinacy, termination,
queueing. SIAM Journal of Applied Mathematics 14, 6
ciently.8 Capability-based addressing5,6 support the expression of large classes (Nov. 1966), 1390–1411.
can be used to implement a very large of application codes. They guarantee 8. Organick, E.I. The Multics System: An Examination of
Its Structure. MIT Press, 1972.
address space. determinacy and support compos- 9. Organick, E.I. Computer systems organization: The
The other concern about large vir- ability. Extending these languages to B5700/B6700. ACM Monograph Series, 1973. LCN:
72-88334.
tual memory pertains to performance. include stream data types would bring 10. Papadopoulos, G.M. and Culler, D.E. Monsoon: An
The locality principle assures us that hazard-free expression to computa- explicit token-store architecture. In Proceedings
of the 17th Annual International Symposium on
tions involving inter-module pipelines Computer Architecture. (1990), 82–91.
and signal processing. We badly need a 11. Sarkar, V. and Cann, D. POSC—A partitioning and

Parallelism is not
optimizing SISAL compiler. In Proceedings of the 4th
further extension to support program- International Conference on Supercomputing. IEEE
ming in the popular object-oriented Computer Society Press, 1990.
new; the realization style while guaranteeing determinacy
12. Tou, J. and Wegner, P., Eds. Data structures in
programming languages. ACM SIGPLAN Notices 6
that it is essential for and composability. (Feb. 1971), 171–190. See especially papers by Wegner,
Johnston, Berry, and Organick.
We have means to express nondeter-
continued progress minate computation in self-contained
in high-performance environments such as interactive
Peter J. Denning (pjd@nps.edu) is the director of the
Cebrowski Institute for Innovation and Information
editors, version control systems, and Superiority at the Naval Postgraduate School in Monterey,
computing is. transaction systems. We sorely need
CA, and is a past president of ACM.

approaches that can combine determi- Jack B. Dennis (dennis@csail.mit.edu) is Professor


Emeritus of Computer Science and Engineering at MIT
nate and nondeterminate components and a Principal Investigator in the Computer Science and
into well-structured larger modules. Artificial Intelligence Laboratory. He is an Eckert-Mauchly
Award recipient and a member of the NAE.
The full benefits of functional pro-
gramming and composability cannot Copyright held by author.

32 communications of th e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
V
viewpoints

doi:10.1145/1743546.1743561 George V. Neville-Neil


Article development led by
queue.acm.org

Kode Vicious
Plotting Away
Tips and tricks for visualizing large data sets.

Dear KV,
I’ve been working with some code that
generates massive data sets, and while
I’m perfectly happy looking at the raw
textual output to find patterns, I’m
finding that more and more often I
e n t
pm
have to explain my data to people who
are either unwilling to or incapable of

ve l o
De
understanding the data in a raw for-

i n
mat. I’m now being required to gener-
ate summaries, reports, and graphs for
these people, and, as you can imagine,
they are the ones with control over the Ar t
money in the company, so I also have to
be nice to them. I know this isn’t exact-
ly a coding question, but what do you
do when you have to take the bits you
understand quite well and summarize
them for people like this?
If It Ain’t Text...

Dear Text,
Since I often come across as some sort
of hard-core, low-level, bits-and-bytes
kind of guy, I gather that you’re as- ward Tufte, The Visual Display of Quan- be allowed to code on their own.
suming my answer will be to tell man- titative Information. Now for the Kode One should approach any such
agement—and from your description Vicious Kwik Kourse on Visualization, problem as a science experiment, and
these people must be management— please read on. scientists know how to represent their
to take their fancy graphs and, well, do While I agree it is cool to be able to results in many ways, including plot-
something that would give them paper look at some incredibly confusing out- ting them on paper. At some point in
cuts in hard-to-reach places. Much as I put in text and be able to pick out the your career you’re going to have to fig-
like to give just that kind of advice, the needle you’re looking for, and while ure out how to get easy-to-read results
fact is, it’s just as important to be able I’m sure this impresses many of your that you can look at and compare side
to transform large data sets from col- coder friends, this is just not a skill by side. A plot of your data can, when
Photograph by Nic M cPhee

umns and lines of numbers into some- that’s going to take you very far. I also done well, give you a lot of information
thing that is a bit more compact and find that programmers who cannot and tell you a lot about what might be
still as descriptive. For the polite and understand the value in a single-page happening with your system. Note the
well-written version of this type of ad- graph of their results are the same all-important phrase, when done well,
vice, please see the classic work by Ed- kinds of programmers who should not in that previous sentence. As is the case

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t he acm 33
viewpoints

with many tools, the plotting of data portant not only to get the y-axis limits tent over a 24-hour day than work done
can mislead you as easily as it can lead correct, but also to plot as much data by humans, all of those people have
you somewhere. as absolutely possible, given the limits been, and continue to be, dead wrong.
There are plenty of tools with which of the system on which you’re plotting If you’re plotting data over a day, then it
to plot your data, and I usually shy away the data. Some problems or effects are is highly likely that you will see changes
from advocating particular tools in not easily seen if you reduce the data too when people wake up, when they go to
these responses, but I can say that if you much. Reduce the data set by 90% (look work, take meals, go home, and sleep.
were trying to plot a lot of data, where a at every 10th sample), and you might It might be vitally important for you to
lot is more than 32,767 elements, you miss something subtle but important. If notice that something happens every
would be wise to use something like your data won’t all fit into main memory day at 4 p.m. Perhaps your systems in
gnuplot. Every time I’ve seen people try in one go, then break it down by chunks England are recording when people
to use a certain vendor’s spreadsheet to along the x-axis. If you have one million take tea, rather than an odd slowdown
plot data sets larger than 32,767, things samples, graph them 100,000 at a time, in the system. The system you’re watch-
have gone awry—I might even say that print out the graphs, and tape them to- ing might be underutilized because the
they were brought up “short” by that gether. Yes, it’s kind of a quick-and-dirty tea trolley just arrived! If your plot has
particular program. The advantage of solution but it works, trust me. time, then use time as an axis.
gnuplot is that as long as you have a lot Another problem occurs when you As I wrap this up, you may have
of memory (and memory is inexpen- want to compare two data sets directly noticed that I did not mention color,
sive now), you can plot very large data on the same plot. Perhaps you have fonts, font size, or anything else relat-
sets. KV recently outfitted a machine data from several days and you want ed to how the graph looks on paper. I
with 24GB of RAM just to plot some im- to see how Wednesday and Thursday didn’t leave these factors out because
portant data. I’m a big believer in big compare, but you don’t have enough I’m a total nerd who can’t match any
memory for data, but not for programs, memory to plot both days at once, of his own clothes. I can easily match
but let’s just stop that digression here. only enough for one day at a time. clothes, since black goes with every-
Let’s now walk through the impor- You could beg your IT department for thing. Most people I’ve seen generat-
tant points to remember when plotting more memory or, if you have a screw- ing graphs spend far too much time
data. The first is that if you intend to driver, “borrow” some memory from a picking a color or a font. Take the de-
compare several plots, your measure- coworker, but such measures are un- faults; just make sure the lines on the
ment axis—the one on which you’re necessary if you have a window. Print graph are consistently representing
showing the magnitude of a value— both data sets, making sure both axes your data. Choosing a graph color or
absolutely must remain constant or line up, and then hold the pages up to text font before getting the data cor-
be easily comparable among the total the window. Oh, when I said “window,” rectly plotted is like spending hours
set of graphs that you generate. A plot I meant one that allows light from that twiddling your code highlighting col-
with a y-axis that goes from 0 to 10 and bright yellow ball in the sky to enter ors in your IDE instead of doing the ac-
another with a y-axis from 0 to 25 may your office, not one that is generated by tual work of coding. It’s a time waster.
look the same, but their meaning is your computer. Now, get back to work.
completely different. If the data you’re Thus far I have not mentioned the KV
plotting runs from 0 to 25, then all of x-axis, but let’s remedy that now. If
your graphs should run from, for exam- you’re plotting data that changes over
ple, 0 to 30. Why would you waste those time, then your x-axis is actually a time Related articles
last five ticks? Because when you’re axis. The programmers who label this on queue.acm.org
generating data from a large data set, “samples,” and then do all kinds of Code Spelunking Redux
you might have missed something, per- internal mental transformations, are George V. Neville-Neil
http://queue.acm.org/detail.cfm?id=1483108
haps a crazy outlier that goes to 60, but legion—and completely misguided.
only on every 1,000th sample. If you set While you might know that your sam- Unifying Biological Image
Formats with HDF5
the limits of your axes too tightly ini- ples were taken at 1KHz and therefore
Matthew T. Dougherty, Michael J. Folk,
tially, then you might never find those that every 1,000 samples is one second, Erez Zadok, Herbert J. Bernstein,
outliers, and you would have done an and 360,000 samples is an hour, most Frances C. Bernstein, Kevin W. Eliceiri,
awful lot of work to convince yourself— of the people who see your plots are not Werner Benger, Christoph Best
and whoever else sees your pretty little going to know this, even if you cleverly http://queue.acm.org/detail.cfm?id=1628215
plot—that there really isn’t a problem, label your x-axis “1KHz.” If you’re plot- A Conversation with Jeff Heer, Martin
when in fact it was right under your ting something against time, then your Wattenberg, and Fernanda Viégas
http://queue.acm.org/detail.cfm?id=1744741
nose, or more correctly, right above the x-axis really should be time.
limit of you graph. This recommendation is even more
George V. Neville-Neil (kv@acm.org) is the proprietor of
Since you mention you are plotting important when graphing long-run- Neville-Neil Consulting and a member of the ACM Queue
large data sets, I’ll assume you mean ning data—for example, a full working editorial board. He works on networking and operating
systems code for fun and profit, teaches courses on
more than 100,000 points. I have rou- day. It turns out that computers are various programming-related subjects, and encourages
tinely plotted data that runs into the slaves to people and while many people your comments, quips, and code snips pertaining to his
Communications column.
millions of individual points. When have predicted that the work done by
you plot the data the first time, it’s im- computers would be far more consis- Copyright held by author.

34 communications of th e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
V
viewpoints

doi:10.1145/1743546.1743562 François Lévêque

Law and Technology


Intel’s Rebates: Above
Board or Below the Belt?
Over several years, Intel paid billions of dollars to its customers.
Was it to force them to boycott products developed by its rival AMD or
so they could sell its microprocessors at lower prices?

O
ver a five-year period,
Dell allegedly received pay-
ments from Intel averag-
ing $300 million per quar-
ter.a The Attorney General
of the State of New York, Andrew M.
Cuomo, accuses the Santa Clara-based
chip firm of making these payments
to force its OEM customer not to use
AMD’s x86 CPUs in its computers, in
violation of antitrust law. Intel is al-
leged to have infringed Section 2 of the
Sherman Act, which opposes behavior
by firms aimed at creating or preserv-
ing a monopoly other than by merit.
In December 2009, the Federal Trade
Commission also filed suit against In-
tel.b The FTC accuses the chip maker
of numerous anticompetitive unfair
practices, including various pay-
ments to its customers in the form of A netbook equipped with an Intel Atom processor is demonstrated at the International
lump sums or discounts. Consumer Electronics Show in Las Vegas in 2009.
In Europe, the case is closed, or al-
most. The billions of dollars that In- strictions imposed on buyers, which and patent allegations.
tel paid to Dell, HP, and several other are incompatible with European an- Intel considers the payments it
firms were deemed anticompetitive titrust law. The Commission found made to customers were, on the con-
behavior. The European Commission against Intel in May 2009 and fined trary, a reflection of vigorous competi-
found that the payments amounted the firm almost $2 billion.c So, in- tion and beneficial to consumers.
to a strategy to exclude AMD from the stead of going to its customers, Intel’s Who’s right? Who’s wrong? The par-
Photograph by Jae C. Hong, file/ AP Ph oto

microprocessor market. They were money replenished the coffers of the ties offer diametrically opposed ver-
considered akin to rebates and re- European Union! Intel immediately sions of the story.
appealed to the EU Court of First In-
a Complaint by Attorney General of the State of stance in Luxembourg. It also signed Plaintiff Perspective
New York, Andrew M. Cuomo against Intel Cor- a $1.25 billion settlement agreement The story told by the plaintiff State of
poration before the United States District Court with Dell to put an end to its antitrust New York and by the European Com-
for the District of Delaware, November 4, 2009.
b Administrative complaint of the U.S. FTC
mission can be summed up as follows.
against Intel Corporation, docket No. 9341, c European Commission’s decision against In- Intel and AMD are practically the only
December 14, 2009. tel Corp., case COMP/37.990, May 13, 2009. manufacturers of x86 CPUs, the micro-

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t he acm 35
viewpoints

processors inside most computers. Al- high-margin corporate server market. strictions. In Europe, both are gener-
though four times the size of AMD, In- Intel feared its competitor would erode ally prohibited because they are per-
tel was outpaced by the smaller firm in its profits on this segment, since busi- ceived to be anticompetitive because
innovation. In particular, Intel is hav- ness users would be eager to purchase they tend to exclude competitors and
ing more trouble negotiating the tran- AMD-based desktops and notebooks. reduce consumer choice.
sition from 32-bit architecture to the To prevent that market shift, In-
64-bit architecture that makes comput- tel paid Dell and HP to purchase Intel Defendant Perspective
ers more powerful. According to New microprocessors almost exclusively, Intel’s version is nothing like the pre-
York Attorney General Cuomo, Intel and paid Acer and Lenovo to delay the vious story.d Since 2000, the Santa
has a “big competitive hole in its prod- launch of their AMD-based notebooks. Clara-based chip maker has faced ag-
uct development roadmap.” In 2003, In other words, Intel paid its custom- gressive price competition from AMD
AMD was the first to introduce a new- ers to protect a segment of its market. and it has responded by defending
generation processor for the high-end, Dell was by far the biggest beneficiary itself fairly. AMD’s failure to succeed
of these practices. Between February in some market segments is due to
2002 and January 2007, Dell received its own shortcomings, especially in-
Who’s right? more than $6 billion in return for sufficient production capacity, not to
maintaining an exclusive procurement any action by Intel. Between 2002 and
Who’s wrong? agreement with Intel. Without these 2007, the price of microprocessors
The parties offer payments, Dell would have reported a fell by 36% on average per year and
loss in some quarters. According to the AMD’s market share among the main
diametrically State of New York, the Federal Trade computer manufacturers has risen
opposed versions Commission, and the European Com- from 8% to 22%. These figures contra-
mission, the money that Intel paid its dict the claims that Intel has behaved
of the story. customers was conditional on their
boycotting AMD’s products. In techni- d See “Why the European Commission’s Intel
cal terms, the retroactive rebates given decision is Wrong,” and “Intel’s Response to
the EC’s Provisional Non-Confidential Version
to some OEM customers are loyalty of the Commission Decision of 13 May 2009,”
rebates, and the restrictions imposed September 21, 2009; http://www.intel.com/
on OEMs’ sales policies are naked re- pressroom/legal/news.htm

Announcing ACM’s Newly Improved


Career & Job Center!
Are you looking for your next IT job? Do you need Career Advice?
Visit ACM’s newly enhanced career resource at:
http://www.acm.org/careercenter
◆ ◆ ◆ ◆ ◆

The ACM Career & Job Center offers ACM members a host of benefits including:
➜ A highly targeted focus on job opportunities in the computing industry

➜ Access to hundreds of corporate job postings

➜ Resume posting keeping you connected to the employment market while letting you maintain full

control over your confidential information


➜ An advanced Job Alert system that notifies you of new opportunities matching your criteria

➜ Career coaching and guidance from trained experts dedicated to your success

➜ A content library of the best career articles complied from hundreds of sources, and much more!

The ACM Career & Job Center is the perfect place to


begin searching for your next employment opportunity!
Visit today at http://www.acm.org/careercenter

36 communications of th e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
viewpoints

like a monopolist and AMD has been to market share targets (for example,
squeezed out of the market. Comput- the buyer receives a rebate if it cov-
er manufacturers know how to play The problem of ers X% of its requirements with the
off their two suppliers to their advan- access to the same supplier) are a different story. If
tage. Intel claims that the payments to a competitor wants to obtain a signifi-
customers were not tied to exclusive data and evidence cant share of the customer’s purchas-
or near-exclusive purchasing commit- makes it impossible es, it must compensate for the loss
ments, but were volume-based dis- of the rebates. For example, if Intel
counts enabled by economies of scale to verify the validity offers $100,000 on the condition that
in production. Thanks to Intel’s dis- of the different HP fulfills 95% of its requirements
count policy, consumers benefit from with Intel, AMD will be forced to offer
lower prices. The prices were always viewpoints. the same amount if it wants HP to buy
higher than Intel’s costs. Therefore In- more than 5% of its chips from AMD.
tel cannot be accused of selling below That threshold effect can have sur-
cost to drive out AMD. prising effects. It would explain, for
Whose story should we believe? example, why HP refused AMD’s offer
How can we tell who’s right? of a million Athlon XP processors free
In order to decide between the with a situation akin to a cartel. Since of charge. If the gift is worth less than
two versions, the first thing to do is of the agreements were secret, evidence the rebate forfeited by not purchasing
course to look at the facts. This is not is scant and often only indirect. 95% of its requirements from Intel, it
easy for an outside observer (including For want of being able to decide on is rational for HP to refuse it.
this writer) because the evidence is off the basis of the facts, an outside ob-
limits. Only the public statements and server can call theory to the rescue. Conclusion
official decisions are available. In the The first principle to recall is that Economic literaturee shows this thresh-
U.S. lawsuits, the State of New York’s antitrust law seeks to protect consum- old effect can lead to the exclusion of
complaint and the FTC’s statements ers, not competitors. It does not op- competitors that are at least as effi-
total less than 100 pages. In the EU pose the elimination of less-efficient cient as the firm offering the rebates.
case, the material is more abundant. competitors; it prohibits behavior of And consumers lose out on two counts.
The European Commission’s decision firms that results in higher prices or First, there are no more competitors to
against Intel runs to more than 500 lower-quality products. While clearly push down the price set by the domi-
pages and cites numerous statements detrimental to AMD, did Intel’s actions nant firm. So consumers pay higher
by the parties. For example, a Lenovo harm consumers? prices. Second, there is no competitive
purchasing manager wrote in an email In the case of the naked restrictions, pressure driving the firm to innovate.
message dated December 11, 2006, the damage to consumers is not in So products are manufactured at high-
“Last week Lenovo cut a lucrative deal doubt. Let’s take the example of Intel’s er cost and their quality stagnates.
with Intel. As a result of this, we will not lump sum payments to HP, Lenovo, The European Commission sought
be introducing AMD-based products and Acer in exchange for delaying the to show that Intel’s loyalty rebates in-
in 2007 for our notebook products.” launch of their AMD-based computers. deed had a foreclosure effect. Accord-
Thousands of figures are also report- That practice (if proven) did hurt con- ing to the Commission, a rival with the
ed. Unfortunately, in order to respect sumers: some had to wait before buy- same costs as Intel would have been ex-
business secrecy, almost all the figures ing the product they preferred, while cluded. Intel contests this conclusion
have been deleted from the public ver- others, in more of a hurry, had to buy by finding fault with the Commission’s
sion of the decision. Thus, there is no hardware that was not their first choice. calculations. But once again, the prob-
information about the amount of the Moreover, consumers who were not in- lem of access to the data and evidence
discounts granted to Dell. terested in buying AMD-based desktops makes it impossible to verify the valid-
A factual approach is also hampered and notebooks did not gain anything. ity of the different viewpoints. Theory
by the absence of formal contracts. The money paid by Intel did not affect without the facts is unfortunately of
What Intel requested in exchange for the OEMs’ marginal cost and, conse- little use for vindicating either the de-
the payments to its customers is not quently, the price of their computers. fendant Intel or the plaintiffs.
mentioned in written documents. Intel and the firms it paid off were the
Most of the agreements were oral and only beneficiaries of these transactions. e See, for example, Nicolas Economides, ‘‘Loy-
sealed with a handshake. The few writ- The case of the rebates is a more alty/Requirement Rebates and the Antitrust
ten agreements raise no antitrust con- complicated situation. When rebates Modernization Commission: What is the Ap-
cerns. The State of New York and the are linked to volumes purchased, they propriate Liability Standard?, Antitrust Bulle-
European Commission accuse Intel are good for consumers. They enable tin 54, 2 (Summer 2009), 259–279.
of having deliberately removed from manufacturers to pass on some of the
the written documents the litigious savings from economies of scale in François Lévêque (francois.leveque@mines-paristech.fr)
clauses with respect to antitrust law. is a professor of law and economics at Mines-ParisTech in
production and distribution. In other Paris, France.
If the allegations were proved true, the words, they bring prices down for the
antitrust agencies would be dealing consumer. But retroactive rebates tied Copyright held by author.

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t he acm 37
V
viewpoints

doi:10.1145/1743546.1743563 Simson L. Garfinkel and Lorrie Faith Cranor

Viewpoint
Institutional Review Boards
and Your Research
A proposal for improving the review procedures for research projects that
involve human subjects and their associated identifiable private information.

R
esearchers in computer
science departments
throughout the U.S. are
violating federal law and
their own organization’s
regulations regarding human sub-

t
jects research—and in most cases
they don’t even know it. The violations
e n
pm
are generally minor, but the lack of
review leaves many universities open

ve l o
De
to significant sanctions, up to and
including the loss of all federal re-

r t i n
A
search dollars. The lack of review also
means that potentially hazardous re-
search has been performed without
adequate review by those trained in
human subject protection.
We argue that much computer sci-
ence research performed with the In-
ternet today involves human subject
data and, as such, must be reviewed
by Institutional Review Boards—in-
cluding nearly all research projects
involving network monitoring, email,
Facebook, other social networking
sites and many Web sites with user-
generated content. Failure to address
this issue now may cause significant which together articulate U.S. policy from poor rural black men. Another
problems for computer science in the on the Protection of Human Subjects. was the 1971 Stanford Prison Experi-
near future. This policy was created following a ment, funded by the U.S. Office of
series of highly publicized ethical Naval Research, in which students
Prisons and Syphilis lapses on the part of U.S. scientists playing the role of prisoners were
At issue are the National Research Act performing federally funded re- brutalized by other students playing
Illustrati on by Da niel Zalk us

(NRA) of 1974a and the Common Rule,b search. The most objectionable cases the roles of guards.
involved human medical experimen- The NRA requires any institution
tation—specifically the Tuskegee receiving federal funds for scientific
a PL 93-348, see http://history.nih.gov/research/
downloads/PL93-348.pdf
Syphilis Experiment, a 40-year long research to set up an Institutional Re-
b 45 CFR 46, see http://www.hhs.gov/ohrp/hu- U.S. government project that delib- view Board (IRB) to approve any use
mansubjects/guidance/45cfr46.htm erately withheld syphilis treatment of humans before the research takes

38 communications of th e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
viewpoints

place. The regulation that governs Complicating matters is the fact that
these boards is the Common Rule— the Common Rule allows organiza-
“Common” because the same rule was Much computer tions to add additional requirements.
passed in 1991 by each of the 17 federal science research Indeed, many U.S. universities require
agencies that fund most scientific re- IRB approval for any research involving
search in the U.S. performed with human subjects, regardless of funding
Computer scientists working in the the Internet today source. Most universities also prohibit
field of Human-Computer Interaction researchers from determining if their
(HCI) have long been familiar with involves human own research is exempt. Instead, U.S.
the Common Rule: any research that subject data and, universities typically require that all
involves recruiting volunteers, bring- research involving human beings be
ing them into a lab and running them as such, must submitted to the school’s IRB.
through an experiment obviously in- be reviewed by This means a broad swath of “ex-
volves human subjects. NSF grant ap- empt” research involving publicly
plications specifically ask if human Institutional available information nevertheless re-
subjects will be involved in the research Review Boards. quires IRB approval. Performing social
and require that applicants indicate the network analysis of Wikipedia pages
date IRB approval was obtained. may fall under IRB purview: Wikipedia
But a growing amount of research tracks which users edited which pages,
in other areas of computer science and when those edits were made. Us-
also involves human subjects. This ing Flickr pages as a source of JPEGs
research doesn’t involve live human computer scientists the relevant ex- for analysis may require IRB approval,
beings in the lab, but instead involves emptions are “research to be conduct- because Flickr pages frequently have
network traffic monitoring, email, on- ed on educational practices or with ed- photos of people (identifiable informa-
line surveys, digital information creat- ucational tests” (§46.101(b)(1&2)); and tion), and because the EXIF “tags” that
ed by humans, photographs of humans research involving “existing data, doc- many cameras store in JPEG images
that have been posted on the Internet, uments, [and] records…” provided that may contain serial numbers that can
and human behavior observed via so- the data set is either “publicly avail- be personally identifiable. Analysis of
cial networking sites. able” or that the subjects “cannot be Facebook poses additional problems
The Common Rule creates a four- identified, directly or through identi- and may not even qualify as exempt:
part test that determines whether or fiers linked to the subjects’’(§46.101(b) not only is the information person-
not proposed activity must be reviewed (4)). Surveys, interviews, and observa- ally identifiable, but it is frequently not
by an IRB: tions of people in public are generally public. Instead, Facebook information
1. The activity must constitute sci- exempt, provided that identifiable in- is typically only available to those who
entific “research,” a term that the Rule formation is not collected, and pro- sign up for the service and get invited
broadly defines as “a systematic inves- vided that the information collected, into the specific user’s network.
tigation, including research develop- if disclosed, could not “place the sub- We have spoken with quite a few
ment, testing and evaluation, designed jects at risk of criminal or civil liabil- researchers who believe the IRB regu-
to develop or contribute to generaliz- ity or be damaging to the subjects’ lations do not apply to them because
able knowledge.”c financial standing, employability, or they are working with “anonymized”
2. The research must be federally reputation’’(§46.101(b)(2)(i&ii)). data. Ironically, the reverse is probably
funded.d IRBs exist to review proposed re- true: IRB approval is required to be
3. The research must involve human search and protect the interests of sure the data collection is ethical, that
subjects, defined as “a living individual the human subjects. People can par- the data is adequately protected prior
about whom an investigator (whether ticipate in dangerous research, but it’s to anonymization, and that the ano-
professional or student) conduct- important that people are informed, nymization is sufficient. Most schools
ing research obtains (1) data through if possible, of the potential risks and do not allow the experimenters to an-
intervention or interaction with the benefits—both to themselves and to swer these questions for themselves,
individual, or (2) identifiable private society at large. because doing so creates an inherent
information.”e What this means to computer sci- conflict of interest. Many of these re-
4. The research must not be “ex- entists is that any federally funded searchers were in violation of their
empt” under the regulations.f research involving data generated by school’s regulations; some were in vio-
The exemptions are a kind of safety people that is “identifiable” and not lation of federal regulations.
valve to prevent IRB regulations from public probably requires approval in
becoming utterly unworkable. For advance by your organization’s IRB. How to Stop Worrying
This includes obvious data sources and Love the IRB
like network traffic, but it also in- Many IRBs are not well equipped to
c §46.102 (d)
d §46.103 (a)
cludes not so obvious sources like handle the fast-paced and highly tech-
e §46.102 (f) software that collects usage statistics nical nature of computer-related re-
f §46.101 (b) and “phones home.” search. Basic questions arise, such as,

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t he acm 39
viewpoints

Are Internet Protocol addresses per- nymized search engine data could be
sonally identifiable information? What re-identified to reveal what particular
is “public” and what is not? Is encrypt- It is becoming individuals are searching for? Can net-
ed data secure? Can anonymized data increasingly easy work traffic data collected for research
be re-identified? Researchers we have purposes be used to identify copyright
spoken with are occasionally rebuffed to collect human violators? Can posts to LiveJournal and
by their IRBs—the IRBs insist that no subjects data over Facebook be correlated to learn the
humans are involved in the research— identities of children who are frequent-
ignoring that regulations also apply to the Internet that ly left home alone by their parents?
“identifiable private information.” needs to be properly In order to facilitate more rapid IRB
Another mismatch between com- review, we recommend the develop-
puter science research and IRBs is protected to avoid ment of a new, streamlined IRB appli-
timescale. CS research progresses at a harming subjects. cation process. Experimenters would
much faster pace than research in the visit a Web site that would serve as a
biomedical and behavioral fields. In self-serve “IRB kiosk.” This site would
one case we are aware of, an IRB took ask experimenters a series of questions
more than a year to make a decision to determine whether their research
about a CS application. But even two qualifies as exempt. These questions
or three months to make a decision— figure out which data will be collected, would also serve to guide experiment-
typical of many IRBs—is too slow for a and then to collect the results in a man- ers in thinking through whether their
student in a computer science course ner that protects the privacy of the data research plan adequately protects hu-
who wants to perform a social network subjects. (Arguably, computer scien- man subjects. Qualifying experiment-
analysis as a final project. tists would benefit from better train- ers would receive preliminary approval
For example, one of our studies, ing on experimental design, but that from the kiosk and would be permitted
which involved observing how mem- is a different issue.) We have observed to begin their experiments. IRB repre-
bers of our university community re- that many IRB applications are delayed sentatives would periodically review
sponded to simulated phishing attacks because of a failure on the part of CS these self-serve applications and grant
over a period of several weeks, had researchers to make these points clear. final approval if everything was in order.
to be shortened after being delayed Finally, many computer scientists Such a kiosk is actually permissible
two months by an understaffed IRB. are unfamiliar with the IRB process under current regulations, provided
With the delayed start date, part of and how it applies to them, and may that the research is exempt. A kiosk
the study would have taken place over be reluctant to engage with their IRB could even be used for research that is
winter break, when few people are on after having heard nothing but com- “expedited” under the Common Rule,
campus. Another study we worked on plaints from colleagues who have since expedited research can be ap-
was delayed three months after an had their studies delayed by a slow proved by the IRB Chair or by one or
IRB asked university lawyers to review IRB approval process. While the more “experienced reviewers.”h In the
a protocol to determine whether it studies that CS researchers perform case of non-exempt expedited research,
would violate state wiretap laws. are often exempt or extremely low the results of the Kiosk would be re-
In another case, researchers at In- risk, it is becoming increasingly easy viewed by such a reviewer prior to per-
diana University worked with their to collect human subjects data over mission being given to the researcher.
IRB and the school’s network secu- the Internet that needs to be prop- Institutional Review Board chairs
rity group to send out phishing attacks erly protected to avoid harming sub- from many institutions have told us
based on data gleaned from Facebook.g jects. Likewise, the growing amount informally that they are looking to
Because of the delays associated with of research involving honeypots, bot- computer scientists to come up with
the approval process, the phishing nets, and the behavior of anonymity a workable solution to the difficulty
messages were sent out at the end of systems would seem to require IRBs, of applying the Common Rule to com-
the semester, just before exams, rather since the research involves not just puter science. It is also quite clear that
than at the beginning of the semes- software, but humans—both crimi- if we do not come up with a solution,
ter. Many recipients of the email com- nals and victims. they will be forced to do so.
plained vociferously about the timing. The risks to human subjects from
Another reason computer scientists computer science research are not al- h §46.110 (b)
have problems with IRBs is the level ways obvious, and the IRB can play an
of detail the typical IRB application important role in helping computer sci- Simson L. Garfinkel (slgarfin@nps.edu) is an associate
requires. Computer scientists, for the entists identify these risks and insure professor at the U.S. Naval Postgraduate School in
Monterey, CA.
most part, are not trained to carefully that human subjects are adequately
plan out an experiment in advance, to protected. Is there a risk that data col- Lorrie Faith Cranor (lorrie+@cs.cmu.edu) is an associate
professor of computer science and engineering and public
lected on computer security incidents policy and the director of the CyLab Usable Privacy and
g T. Jagatic, N. Johnson, M. Jakobsson, and F.
could be used by employers to identify Security Laboratory at Carnegie Mellon University in
Pittsburgh, PA.
Menczer. Social phishing. Commun. ACM 50, underperforming computer security
10 (Oct. 2007), 94–100. administrators? Is there a risk that ano- Copyright held by author.

40 communications of th e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
V
viewpoints

doi:10.1145/1743546.1743564 Len Shustek

Interview
An Interview with
Ed Feigenbaum
ACM Fellow and A.M. Turing Award recipient Edward A. Feigenbaum,
a pioneer in the field of expert systems, reflects on his career.

T
h e C ompute r Hi story Mu-
seum has an active program
to gather videotaped histo-
ries from people who have
done pioneering work in
this first century of the information
age. These tapes are a rich aggregation
of stories that are preserved in the col-
lection, transcribed, and made avail-
able on the Web to researchers, stu-
dents, and anyone curious about how
invention happens. The oral histories
are conversations about people’s lives.
We want to know about their upbring-
ing, their families, their education,
and their jobs. But above all, we want
to know how they came to the passion
and creativity that leads to innovation.
Presented here are excerptsa from
four interviews with Edward A. Feigen-
baum, the Kumagai Professor of Com-
puter Science, Emeritus, at Stanford
University and a pioneering researcher
in artificial intelligence. The interviews the book, and so there’s a tremendous culators and learned to use them with
were conducted in 2007 separately by focus on learning, and books, and great facility. That was one of my great
Donald Knuth and Nils Nilsson, both reading. I learned to read very early. skills—in contrast to other friends of
professors of computer science at Stan- mine whose great skills were things
ford University.  —Len Shustek What got you interested in like being on the tennis team.
science and engineering? I was a science kid. I would read
What was your family background? My stepfather was the only one in the Scientific American every month—if
I was born in New Jersey in 1936 to a family who had any college education. I could get it at the library. One book
Photograph by H a ns H enrik H. Hemi ng

culturally Jewish family. That Jewish Once a month he would take me to the that really sucked me into science was
culture thinks of itself as the people of Hayden Planetarium of the American Microbe Hunters. We need more books
Museum of Natural History. I got really like Microbe Hunters to bring a lot
a Oral histories are not scripted, and a tran- interested in science, mostly through more young people into science now.
script of casual speech is very different from astronomy, at about 10 years old.
what one would write. I have taken the liberty
of editing liberally and reordering freely for
My stepfather worked as an accoun- Why did you study
presentation. For the original transcripts, see tant and had a Monroe calculator. I electrical engineering?
http://archive.computerhistory.org/search/oh/ was absolutely fascinated by these cal- I got As in everything, but I really en-

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t he acm 41
viewpoints

joyed most the math and physics and he handed us an IBM 701 manual, an who had come downstairs to talk to
chemistry. So why electrical engineer- early IBM vacuum tube computer. That us. IT’s introduction actually preceded
ing, as opposed to going into physics? was a born-again experience! Taking Fortran’s by about nine months.
Around my family, no one had ever that manual home, reading it all night
heard of a thing called a physicist. In long—by the dawn, I was hooked on What was it like to use
this middle-class to lower-middle- computers. I knew what I was going a computer then?
class culture people were focused on to do: stay with Simon and do more of There was no staff between you and the
getting a job that would make money, this. But Carnegie Tech did not have computer. You could book time on the
and engineers could get jobs and make any computers at that time, so I got a computer, then you went and did your
money. job at IBM for the summer of 1956 in thing. A personal computer! I loved it.
I happened to see an advertise- New York. I loved the lights, I loved pressing the
ment for scholarships being offered switches. This idea has been very im-
by an engineering school in Pittsburgh What did you learn at IBM? portant for my career—the hands on,
called Carnegie Institute of Technol- First, plug board programming, which experimental approach to computer
ogy. I got a scholarship, so that’s what was a phenomenally interesting thing science as opposed to the theoretical
I did. Life is an interesting set of choic- for a geeky kid. Second, the IBM 650, approach. Experiment turns out to be
es, and the decision to go to Carnegie because by that time it became known absolutely vital.
Tech (now Carnegie-Mellon Univer- that Carnegie Tech would be getting a I was able to write a rather compli-
sity) was a fantastically good decision. 650. Third, the IBM 704, which was a cated—for that time—simulation of
successor machine to the 701. two companies engaged in a duopolis-
Something else there got you excited. When I got back to Carnegie Tech tic decision-making duel about pric-
I had a nagging feeling that there was in September 1956 and began my ing of tin cans in the can industry, the
something missing in my courses. graduate work, there was Alan Perlis, a second such simulation of economic
There’s got to be more to a university wonderful computer genius, and later behavior ever written. It led to my first
education! In the catalog I found a re- the first Turing Award winner. Perlis conference paper, in December 1958,
ally interesting listing called “Ideas was finishing up an amazing program at the American Economics Associa-
and Social Change,” taught by a young called a compiler. That was “IT,” Inter- tion annual meeting.
new instructor, James March. The first nal Translator, and it occupied 1,998
thing he did was to expose us to Von words of the 2,000-word IBM 650 drum. What did you do for your dissertation?
Neumann’s and Morgenstern’s “Theo- I had known about the idea of alge- A model called EPAM, Elementary Per-
ry of Games and Economic Behavior.” braic languages because in the summer ceiver and Memorizer, a computer sim-
Wow! This is mind-blowing! My first at IBM someone had come down from ulation model of human learning and
published paper was with March in so- the fourth floor to talk to the graduate memory of nonsense syllables.
cial psychology, on decision-making in students and tell them about a new I invented a data structure called a
small groups. thing that had just hit the scene. You Discrimination Net—a memory struc-
March introduced me to a more se- didn’t have to write “CLA” for “clear ture that started out as nothing when
nior and famous professor, Herbert and add,” and you didn’t have to write the learner starts. List structures had
Simon. That led to my taking a course “005” for “add.” You could write a for- just been invented, but no one had
from Simon called “Mathematical mula, and a program would translate tried to grow trees. I had to, because
Models in the Social Sciences.” I got to that formula into machine language. I would start with two nonsense syl-
know Herb, and got to realize that this FOR-TRAN. The guy was John Backus, lables in the Net, and then the next
was a totally extraordinary person. pair would come in and they’d have to
In January 1956 Herb walked into “grow into” the net somewhere. These
our seminar of six people and said This idea has been were the first adaptively growing trees.
these famous words: “Over Christmas Now here’s an amazing and kind of stu-
Allen Newell and I invented a think- very important pid thing that shows what it means to
ing machine.” Well, that just blew our for my career— focus your attention on x rather than
minds. He and Newell had formulated y. We were focused on psychology. We
the Logic Theorist on December 15th, the experimental were not focused on what is now called
1955. They put together a paper pro- approach to computer science. So we never pub-
gram that got implemented in the lan- lished anything about those adaptively
guage called IPL-1, which was not a lan- computer science growing trees, except as they related
guage that ran on any computer. It was as opposed to the to the psychological model. But other
the first list processing language, but it people did see trees as a thing to write
ran in their heads. theoretical approach. papers about in the IT literature. So I
missed that one!
That led to your first
exposure to computers. Where was your first academic job?
When we asked Herb in that class, I had wanted to come to the West
“What do you mean by a machine?” Coast, and the University of California

42 communications of th e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
viewpoints

at Berkeley was excited about getting pill, and also one of the world’s most
me. There I taught two things: organi- respected mass spectrometrists. Carl
zation theory à la March and Simon, AI is not much of a and his postdocs were world-class ex-
and the new discipline called Artificial theoretical discipline. perts in mass spectrometry. We began
Intelligence. to add in their knowledge, inventing
There were no books on the subject It needs to work knowledge engineering as we were go-
of AI, but there were some excellent in specific task ing along. These experiments amount-
papers that Julian Feldman and I pho- ed to titrating into DENDRAL more and
tocopied. We decided that we needed environments. more knowledge. The more you did
to do an edited collection, so we took that, the smarter the program became.
the papers we had collected, plus a few We had very good results.
more that we asked people to write, The generalization was: in the
and put together an anthology called knowledge lies the power. That was the
Computers and Thought that was pub- big idea. In my career that is the huge,
lished in 1963. solving or theorem proving, but induc- “Ah ha!,” and it wasn’t the way AI was
The two sections mirrored two tive hypothesis formation and theory being done previously. Sounds simple,
groups of researchers. There were formation. but it’s probably AI’s most powerful
people who were behaving like psy- I had written some paragraphs at generalization.
chologists and thinking of their work the end of the introduction to Comput- Meta-DENDRAL was the culmina-
as computer models of cognitive pro- ers and Thought about induction and tion of my dream of the early to mid-
cesses, using simulation as a tech- why I thought that was the way forward 1960s having to do with theory for-
nique. And there were other people into the future. That’s a good strate- mation. The conception was that you
who were interested in the problem of gic plan, but it wasn’t a tactical plan. I had a problem solver like DENDRAL
making smart machines, whether or needed a “task environment”—a sand- that took some inputs and produced
not the processes were like what peo- box in which to specifically work out an output. In doing so, it used layers
ple were doing. ideas in detail. of knowledge to steer and prune the
I think it’s very important to em- search. That knowledge got in there
How did choosing one of those phasize, to this generation and every because we interviewed people. But
lead you to Stanford? generation of AI researchers, how im- how did the people get the knowledge?
The choice was: do I want to be a psy- portant experimental AI is. AI is not By looking at thousands of spectra. So
chologist for the rest of my life, or do I much of a theoretical discipline. It we wanted a program that would look
want to be a computer scientist? I looked needs to work in specific task environ- at thousands of spectra and infer the
inside myself, and I knew that I was a ments. I’m much better at discovering knowledge of mass spectrometry that
techno-geek. I loved computers, I loved than inventing. If you’re in an experi- DENDRAL could use to solve individual
gadgets, and I loved programming. The mental environment, you put yourself hypothesis formation problems.
dominant thread for me was not going in the situation where you can discover We did it. We were even able to
to be what humans do, it was going to be things about AI, and you don’t have to publish new knowledge of mass spec-
what can I make computers do. create them. trometry in the Journal of the American
I had tenure at Berkeley, but the busi- Chemical Society, giving credit only in
ness school faculty couldn’t figure out Talk about DENDRAL. a footnote that a program, Meta-DEN-
what to make of a guy who is publishing One of the people at Stanford interest- DRAL, actually did it. We were able to
papers in computer journals, artificial ed in computer-based models of mind do something that had been a dream: to
intelligence, and psychology. That was was Joshua Lederberg, the 1958 Nobel have a computer program come up with
the push away from Berkeley. The pull Prize winner in genetics. When I told a new and publishable piece of science.
to Stanford was John McCarthy. him I wanted an induction “sandbox”,
he said, “I have just the one for you.” What then?
How did you decide on your His lab was doing mass spectrometry We needed to play in other playpens. I
research program? of amino acids. The question was: believe that AI is mostly a qualitative sci-
Looking back in time, for reasons that how do you go from looking at a spec- ence, not a quantitative science. You are
are not totally clear to me, I really, real- trum of an amino acid to the chemical looking for places where heuristics and
ly wanted smart machines. Or I should structure of the amino acid? That’s inexact knowledge can come into play.
put the “really” in another place: I re- how we started the DENDRAL Project: The term I coined for my lab was “Heu-
ally wanted really smart machines. I was good at heuristic search meth- ristic Programming Project” because
I wasn’t going to get there by walk- ods, and he had an algorithm which heuristic programming is what we did.
ing down the EPAM road, which mod- was good at generating the chemical For example, MYCIN was the Ph.D.
els verbal learning, or working on puz- problem space. thesis project of Ted Shortliffe, which
zle-solving deductive tasks. I wanted We did not have a grandiose vision. turned out to be a very powerful knowl-
to model the thinking processes of We worked bottom up. Our chem- edge-based system for diagnosing
scientists. I was interested in problems ist was Carl Djerassi, inventor of the blood infections and recommending
of induction. Not problems of puzzle chemical behind the birth control their antibiotic therapies. Lab mem-

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t he acm 43
viewpoints

bers extracted from Mycin the core of it they wanted to do it in the “I-am-not-
and called it E-Mycin for Essential My- LISP style,” because the Japanese had
cin, or Empty Mycin. That rule-based In my view the been faulted in the past for being imi-
software shell was widely distributed. science that we call tators. So they chose Prolog and tried
What is the meaning of all those formal methods. And they included
experiments that we did from 1965 to AI, maybe better parallel computing in their initiative.
1968? The Knowledge-Is-Power Hy- called computational They made a big mistake in their
pothesis, later called the Knowledge project of not paying enough atten-
Principle, which was tested with doz- intelligence, is the tion to the application space at the
ens of projects. We came to the conclu- manifest destiny of beginning. They didn’t really know
sion that for the “reasoning engine” of what applications they were aiming
a problem solving program, we didn’t computer science. at until halfway through; they were fly-
need much more than what Aristo- ing blind for five years. Then they tried
tle knew. You didn’t need a big logic to catch up and do it all in five more
machine. You need modus ponens, years, and didn’t succeed. [See the
backward and forward chaining, and book, The Fifth Generation,” written
not much else in the way of inference. with Pamela McCorduck].
Knowing a lot is what counts. So we lished from a military classified proj-
changed the name of our laboratory to ect? The Navy didn’t care about the How did you come to work
the “Knowledge System Lab,” where we blackboard framework; that was com- for the U.S. government?
did experiments in many fields. puter science. So we published the In 1994 an amazing thing happened.
ideas in a paper on a kind of hypothet- The phone rings and it is Professor
What other AI models did you use? ical: “how to find a koala in eucalyp- Sheila Widnall of the Department of
AI people use a variety of underlying tus trees,” which was a non-cassified Aeronautics and Astronautics of MIT.
problem-solving frameworks, and problem drawn from my personal ex- She said, “Do you know anyone who
combine a lot of knowledge about the perience in an Australian forest! wants to be Chief Scientist of the Air
domain with one of these frameworks. Force? And by the way, if you are inter-
These can either be forward-chain- Talk about being an entrepreneur ested let me know.” She had been cho-
ing—sometimes called generate and as well as an academic. sen to be Secretary of the Air Force, and
test—or they could be backward-chain- There was a very large demand for the she was looking for her Chief Scientist.
ing, which say, for example, “here’s the software generalization of the MY- I thought about it briefly, told her yes,
theorem I want to prove, and here’s CIN medical diagnosis expert system and stayed for three years.
how I have to break it down into pieces “shell,” called EMYCIN. So a software My job was to be a window on sci-
in order to prove it.” company was born called Teknowledge, ence for the Chief of Staff of the Air
I began classified research on de- whose goal was to migrate EMYCIN Force. I was the first person to be asked
tecting quiet submarines in the ocean into the commercial domain, make it to be Chief Scientist who was not an
by their sound spectrum. The problem industrial strength, sell it, and apply it. Aero-Astro person, a weapons person,
was that the enemy submarines were Teknowledge is still in existence. or from the physical sciences. There
very quiet, and the ocean is a very noisy Our Stanford MOLGEN project was had not been any computer scientists
place. I tried the same hypothesis for- the first project in which computer before me.
mation framework that had worked science methods were applied to what I did two big things. One was con-
for DENDRAL, and it didn’t even come is now called computational molecu- sciousness-raising in the Air Force about
close to working on this problem. lar biology. Some MOLGEN software software. The one big report I wrote, at
Fortunately Carnegie Mellon turned out to have a very broad ap- the end of my term, was a report called,
people—Reddy, Erman, Lesser and plicability and so was the basis of the It’s a Software-First World. The Air Force
Hayes-Roth—had invented another very first company in computational had not realized that. They probably
framework they were using for un- molecular biology, called Intellige- still do not think that. They think it is an
derstanding speech, the Blackboard netics, later Intellicorp. They had lots airframe-based world.
Framework. It did not work well for of very sophisticated applications. The other was on software devel-
them, but I picked it up and adapted During the dot-com bust they went opment. The military up to that point
it for our project. It worked beauti- bust, but they lasted, roughly speak- believed in, and could only imagine,
fully. It used a great deal of knowledge ing, 20 years. a structured-programming top-down
at different “levels of abstraction.” It world. You set up requirements, you
allowed flexible combination of top- In the 1980s you studied the Japanese get a contractor to break down the re-
down and bottom-up reasoning from government’s major effort in AI. quirements into blocks, another con-
data to be merged at those different The Japanese plan was very ambitious. tractor breaks them down into mini-
levels. In Defense Department tests, They organized a project to essentially blocks, and down at the bottom there
the program did better than people. do knowledge-based AI, but in a style are some people writing the code. It
But that research was classified different from the style we were accus- takes years to do. When it all comes
as “secret.” How could ideas be pub- tomed to in this country. For one thing, back up to the top, (a) it’s not right,

44 communications of th e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
viewpoints

and (b) it’s not what you want anymore. really good at something. Stick with it. chemistry. Or read books on physics
They just didn’t know how to contract Focus. Persistence, not just on prob- and learn physics. Or biology. Or what-
for cyclical development. Well, I think lems but on a whole research track, is ever. We just don’t do that today. Our AI
we were able to help them figure out really worth it. Switching in the middle, programs are handcrafted and knowl-
how to do that. flitting around from problem to prob- edge engineered. We will be forever do-
lem, isn’t such a great idea. ing that unless we can find out how to
What happened after your “tour ˲˲ Life includes of a lot of stuff you build programs that read text, under-
of duty” in Washington? have to do that isn’t all that much fun, stand text, and learn from text.
It was a rather unsettling experience to but you just have to do it. Reading from text in general is a
come back to Stanford. After playing a ˲˲ You have to have a global vision hard problem, because it involves all of
role on a big stage, all of a sudden you of where you’re going and what you’re common sense knowledge. But read-
come back and your colleagues ask, doing, so that life doesn’t appear to be ing from text in structured domans
“What are you going to teach next year? just Brownian motion where you are I don’t think is as hard. It is a critical
Intro to AI?” being bumped around from one little problem that needs to be solved.
So at the beginning of 2000, I re- thing to another thing.
tired. Since then I have been leading a Why is AI important?
wonderful life doing whatever I please. How far have we come in your quest There are certain major mysteries that
Now that I have a lot more time than to have computers think inductively? are magnificent open questions of the
I had before, I’m getting geekier and Our group, the Heuristic Programming greatest import. Some of the things
geekier. It feels like I’m 10 years old Project, did path-breaking work in the computer scientists study are not. If
again, getting back involved with de- large, unexplored wilderness of all the you’re studying the structure of data-
tails of computing. great scientific theories we could pos- bases—well, sorry to say, that’s not one
The great thing about being retired sibly have. But most of that beautiful of the big magnificent questions.
is not that you work less hard, but that wilderness today remains largely un- I’m talking about mysteries like
what you do is inner-directed. The explored. Am I am happy with where the initiation and development of life.
world has so many things you want to we have gotten in induction research? Equally mysterious is the emergence
know before you’re out of here that you Absolutely not, although I am proud of of intelligence. Stephen Hawking once
have a lot to do. the few key steps we took that people asked, “Why does the universe even
will remember. bother to exist?” You can ask the same
Why is history important? question about intelligence. Why does
When I was younger, I was too busy for Is general pattern recognition intelligence even bother to exist?
history and not cognizant of the impor- the answer? We should keep our “eye on the
tance of it. As I got older and began to I don’t believe there is a general pattern prize.” Actually, two related prizes.
see my own career unfolding, I began recognition problem. I believe that pat- One is that when we finish our job,
to realize the impact of the ideas of tern recognition, like most of human whether it is 100 years from now or
others on my ideas. I became more and reasoning, is domain specific. Cogni- 200 years from now, we will have in-
more of a history buff. tive acts are surrounded by knowledge vented the ultra-intelligent computer.
That convinced me to get very seri- of the domain, and that includes acts The other is that we will have a very
ous about archives, including my own. of inductive behavior. So I don’t really complete model of how the human
If you’re interested in discoveries and put much hope in “general anything” mind works. I don’t mean the human
the history of ideas, and how to manu- for AI. In that sense I have been very brain, I mean the mind: the symbolic
facture ideas by computer, you’ve got much aligned with Marvin Minsky’s processing system.
to treat this historical material as fun- view of a “society of mind.” I’m very In my view the science that we call
damental data. How did people think? much oriented toward a knowledge- AI, maybe better called computational
What alternatives were being consid- based model of mind. intelligence, is the manifest destiny of
ered? Why was the movement from computer science.
one idea to another preposterous at How should we give For the people who will be out there
one time and then accepted? computers knowledge? years from now, the question will be:
I think the only way is the way human will we have fully explicated the theory
You are a big fan of using heuristics culture has gotten there. We transmit of thinking in your lifetime? It would
not only for AI, but also for life. What our knowledge via cultural artifacts be very interesting to see what you peo-
are some of your life heuristics? called texts. It used to be manuscripts, ple of a hundred years from now know
˲˲ Pay a lot of attention to empirical then it was printed text, now it’s elec- about all of this.
data, because in empirical data one can tronic text. We put our young people
discover regularities about the world. through a lot of reading to absorb the It will indeed. Stay tuned.
˲˲ Meet a wonderful collaborator— knowledge of our culture. You don’t
for me it was Joshua Lederberg—and go out and experience chemistry, you
Len Shustek (shustek@computerhistory.org) is the
work with that collaborator on mean- study chemistry. chairman of the Computer History Museum.
ingful problems We need to have a way for computers
˲˲ It takes a while to become really, to read books on chemistry and learn Copyright held by author.

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t he acm 45
practice
doi:10.1145/1743546.1743565
about it, to say the least. This concept,
Article development led by
queue.acm.org
which has generated the most industry
hype in years—and which has execu-
tives clamoring for availability because
Elastic computing has great potential, of promises of substantial IT cost sav-
but many security challenges remain. ings and innovation possibilities—has
finally won me over.
by Dustin Owens So, what did I discover about cloud
computing that has made a convert

Securing
out of someone who was so adamantly
convinced that it was nothing more
than the latest industry topic du jour?
First let me explain that it was no

Elasticity in
small feat. It took a lot of work to sort
through the amazing amount of confu-
sion concerning the definition of cloud
computing, let alone find a nugget of
real potential. Definitions abound, and

the Cloud
with my curmudgeon hat still solidly
in place I was beginning to see a lot of
hair-splitting and “me too” definitions
that just seemed to exacerbate the
problem. I finally settled on the defini-
tion provided by the National Institute
of Standards and Technology (NIST)
because of the simplicity the frame-
work provides (see the accompanying
sidebar). Still, it wasn’t until a good
As som ew hat o f a technology-hype curmudgeon, friend who had already discovered the
I was until very recently in the camp that believed true potential hidden in all this mad-
ness provided me with some real-world
cloud computing was not much more than the use cases for elasticity that the light be-
latest marketing-driven hysteria for an idea that has gan shining very brightly.
Elasticity, in my very humble opin-
been around for years. Outsourced IT infrastructure ion, is the true golden nugget of cloud
services, aka Infrastructure as a Service (IaaS), has computing and what makes the entire
been around since at least the 1980s, delivered by concept extraordinarily evolutionary, if
not revolutionary. NIST’s definition of
the telecommunication companies and major IT elasticity (http://csrc.nist.gov/groups/
outsourcers. Hosted applications, aka Platform as a SNS/cloud-computing/) is as follows:
“Capabilities can be rapidly and elasti-
Service (PaaS) and Software as a Service (SaaS), were in cally provisioned, in some cases auto-
vogue in the 1990s in the form of application service matically, to quickly scale out and rap-
providers (ASPs). idly released to quickly scale in. To the
consumer, the capabilities available
Looking at cloud computing through this for provisioning often appear to be un-
perspective had me predicting how many more limited and can be purchased in any
quantity at any time.” When elasticity
months it would be before the industry came up with is combined with on-demand self-ser-
Photograph by Kate Kerr

another “exciting” technology with which to generate vice capabilities it could truly become
mass confusion and buzz. However, I have recently a game-changing force for IT.
Advanced outsourced IT infrastruc-
been enlightened as to the true potential of cloud ture and software services, once avail-
computing and have become very excited able only to organizations with large

46 communications of th e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
credit t k

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t he acm 47
practice

budgets available to develop, build, service to form the full extent of cloud
and support ongoing use of these re- computing—those elements that in
sources, can now be provided to small my opinion make a particular service
to medium organizations. In addition, something more than a just an out-
these resources can be added, changed,
or removed much more rapidly, po- Elasticity, in my sourced service with a prettier market-
ing face.
tentially allowing for exponential ad- very humble
Elasticity Security Challenges
vances in operational efficiency. These
sorts of changes to major IT services opinion, is the true Enabling elasticity in the cloud strong-
environments that previously (and for
the most part currently) took months if
golden nugget of ly implies the use of virtualization.
Though the inherent security chal-
not years to plan and execute might be cloud computing lenges in virtualization are certainly
done in a matter of minutes or hours
if elasticity holds up to its promise. In
and what makes not new, how it is likely to be used by
cloud-computing providers to achieve
other words, elasticity could bring to the entire concept elastic IT environments on a grand
the IT infrastructure what Henry Ford
brought to the automotive industry
extraordinarily scale poses some interesting security
challenges worth exploring in more
with assembly lines and mass produc- evolutionary, detail. In addition, as virtualization
tion: affordability and substantial im-
provements on time to market. if not revolutionary. technology continues to evolve and
gain popularity, so does the discov-
Enlightening as this realization has
been, it has also become clear that sev-
Elasticity could ery of new vulnerabilities; witness
the recently announced vulnerabil-
eral monumental security challenges bring to the IT ity (http://web.nvd.nist.gov/view/vuln/
(not to mention many monumental
nonsecurity-related challenges, not
infrastructure what detail?vulnId=CVE-2009-3733) where-
by one is able to traverse from one vir-
least of which are full functionality Henry Ford brought tual machine (VM) client environment
availability and how well an organi-
zation’s environment is prepared to
to the automotive to other client environments being
managed by the same hypervisor.
operate in a distributed model) now industry with These new vulnerabilities could
come into play and will need to be ad-
dressed in order for the elasticity ele- assembly lines and have significantly greater impacts
in the cloud-computing arena than
ment of cloud computing to reach its mass production: within an organization’s corporate en-

affordability
full potential. Most of the dialogue vironment, especially if not dealt with
I am engaged in with customers to- expeditiously. Case in point: imagine
day and that I see in publicized form,
however, is simplistically centered
and substantial that many customers are being man-
aged by a single hypervisor within
on security challenges with IT out- improvements on a cloud provider. The vulnerability
sourcing in general. These are chal-
lenges have existed for some time in
time to market. shared above might allow a customer
to access the virtual instances of oth-
the predecessor models mentioned er customers’ applications if not ad-
earlier: who within an outsourcer is dressed. Consider the impact if your
able to access a customer’s data, pe- bank or particularly sensitive federal
rimeter security considerations when government or national defense infor-
outsourcing, DOS/DDOS (denial of mation happen to be managed in this
service/distributed denial of service), sort of environment, and the cloud
resource starvation, and compliance provider does not immediately deal
challenges with where data is stored with, or even know about, a vulner-
or backed up. These are all challenges ability of this nature.
that I have provided counsel on for With this bit of background, it is
many years and are nothing new or in- clear that providing adequate admin-
surmountable. Don’t misunderstand istrative separation between virtual
me. These challenges are indeed very customer environments will be a sig-
real and still need to be addressed, but nificant security challenge with elas-
I strongly believe most should be fairly ticity. Cloud providers will need to be
well known by now and can be read- prepared to account for and show how
ily met through existing procedural or their particular services are able to
technological solutions. control vulnerabilities such as the ear-
The challenges I am more con- lier example and keep similar yet-to-be
cerned about are those introduced by discovered vulnerabilities from having
adding elasticity and on-demand self- devastating impacts on their custom-

48 communications of th e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
practice

ers. Perhaps more importantly, critical across the entirety of a virtual customer change management. In this scenario,
infrastructure (see http://en.wikipedia. environment. The service models to the application could be extremely
org/wiki/Critical_infrastructure for which this might apply most directly vulnerable to attack or even inadver-
definition) could be subject to insur- are those that provide IaaS and PaaS tently cause a production application
mountable risk and/or loss of sensi- functionality such as dynamic mul- to cease operating properly. The ability
tive information if providers lack the tilevel security services or multitier to implement and enforce access con-
necessary controls. As services offered application environments. To under- trols to a granular level, defining who
from the cloud continue to mature stand the challenge better, it is prob- has the authority to perform which ac-
and expand, the threat posed is not ably useful to provide some context for tions within these environments, will
limited to unauthorized information how these types of services are built be absolutely necessary.
access but may include any cloud- and administered in today’s corporate Having the ability to predefine se-
provided computing systems (such as infrastructure, such as with a multitier curity control templates may also aid
virtual servers, virtual desktops, and application. One example of a typical in this sort of scenario. This means
so on). We hope the U.S. government scenario is where the application de- the organization’s IT security group
recognizes and addresses this chal- velopment group needs to work closely is able to define a set of controls that
lenge as federal agencies move rapidly with the network and hopefully IT se- must be applied to a given application
toward adoption of cloud-based ser- curity groups to establish proper com- depending on the type of data it will be
vices (http://www.federalnewsradio. munication paths among the various processing or how the application will
com/?sid=1836091&nid=35), because tiers, including limiting which network be used. For example, as the developer
the potential consequences are partic- protocols are allowed to interface with builds out the new virtual environment
ularly unsettling. each of the tiers. This would be done to that processes credit-card informa-
Addressing this challenge may be no ensure proper routing of information tion, the self-service portal might iden-
small feat. For one, in order for cloud and to limit the attack surface available tify the type of data to be processed
providers to minimize their manage- to hackers or malware once the system and apply predefined security controls
ment costs and obtain profitability, is put into production. to the database, application, and Web
they are expected to have to use shared In addition, when dealing with front end, as well as predefined fire-
administrative management systems certain types of data such as finan- wall rule sets limiting network access
(that is, hypervisors) across multiple cial or credit cards, certain regula- to the various tiers. It is unlikely that
virtual customer environments. I can tions and industry standards have this capability exists today, anywhere,
envision certain service models where a requirement for separation of du- and we are probably years away from
this theory may not hold true: for ex- ties to aid in protection from certain ubiquitous availability.
ample, if each customer were given sole scenarios—for example, an applica- Another security challenge that de-
hypervisor (or hypervisor-like) manage- tion developer inserting code into velops out of this scenario and in the
ment access that connected only to that software that would allow skimming same vein is how to enforce proper
customer’s virtual environment, such of financial data and not having an configuration and change manage-
as within a virtual private cloud offer- audit trail available as the developer ment in this more dynamic and elastic
ing. Use of a separate management sys- elected not to enable one for obvious model. Even where a portal is capable
tem for every customer in every service reasons. Although various cloud pro- of granular-access controls that con-
model is probably not realistic simply viders do provide some detail on how trol which actions a given user is able
because of cost containment. their solutions address this concern, to perform, it also needs to enforce
In researching several cloud provid- proper implementation by the user or- when and under what circumstances a
ers’ capabilities in this regard, I could ganization, as well as performing due user is allowed to perform certain ac-
not clearly see how their solutions diligence review of actual capabilities tions. Without this ability, untested
could effectively address the entirety within a desired delivery model, will code or system changes could result
of the provided traversal vulnerabil- be critical to ensuring this challenge in business-impacting (or even dev-
ity example when multiple customers can be adequately addressed. astating) results. Even something as
are using the same hypervisor, at least Fast forward to the cloud scenario “slight” as rolling a new system into
at the time of writing this article. Al- in which a developer now has access production without ensuring that
though some provide detail of built-in to a self-service portal where in a few proper server and application patches
software functionality within their hy- mouse clicks he or she would be able have been applied could result in sig-
pervisors meant to curtail one custom- to build out a new multitier virtual nificant damage to an organization.
er from gaining access to another’s en- application environment. Without Therefore, a mechanism within self-
vironment, I suspect these capabilities fine-grained access controls available service portals for enforcing an orga-
would not fully address the vulnerabil- through the self-service portal it will nization’s change policies becomes a
ity in question and are certainly worthy be extremely difficult to enforce sepa- worthy and necessary capability.
of further detailed review. ration of duties to keep this developer These are but a few of the chal-
Another interesting challenge from accessing sensitive data he or she lenges that come to mind within a
with elasticity in the cloud will be in shouldn’t have access to, or promoting truly elastic PaaS and/or IaaS service
the ability to provide fine-grained ac- new code to production without having model and not even delving into sepa-
cess and predefined security controls gone through proper security review or rate challenges with SaaS. Other chal-

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t he acm 49
practice

The NIST Definition of Cloud Computing


By Peter Mell and Tim Grance

Cloud computing is still an evolving center). Examples of resources include Cloud IaaS (Infrastructure as a Service).
paradigm. Its definitions, use cases, storage, processing, memory, network The capability provided to the consumer
underlying technologies, issues, risks, bandwidth, and virtual machines. is to provision processing, storage,
and benefits will be refined in a spirited Rapid elasticity. Capabilities can be networks, and other fundamental
debate by the public and private rapidly and elastically provisioned, in computing resources where the
sectors. These definitions, attributes, some cases automatically, to quickly consumer is able to deploy and run
and characteristics will evolve and scale out and rapidly released to arbitrary software, which can include
change over time. The cloud-computing quickly scale in. To the consumer, the operating systems and applications. The
industry represents a large ecosystem capabilities available for provisioning consumer does not manage or control
of many models, vendors, and market often appear to be unlimited and can be the underlying cloud infrastructure
niches. The following definition purchased in any quantity at any time. but has control over operating systems,
attempts to encompass all of the various Measured service. Cloud systems storage, deployed applications, and
cloud approaches. automatically control and optimize possibly limited control of select
Cloud computing is a model for resource use by leveraging a metering networking components (for example,
enabling convenient, on-demand capability at some level of abstraction host firewalls).
network access to a shared pool of appropriate to the type of service
configurable computing resources (for (for example, storage, processing, Deployment Models
example, networks, servers, storage, bandwidth, and active user accounts). Private cloud. The cloud infrastructure
applications, and services) that can Resource usage can be monitored, is operated solely for an organization.
be rapidly provisioned and released controlled, and reported, providing It may be managed by the organization
with minimal management effort transparency for both the provider and or a third party and may exist on or off
or service-provider interaction. This consumer of the utilized service. premise.
cloud model promotes availability Community cloud. The cloud
and is composed of five essential Service Models infrastructure is shared by several
characteristics, three service models, Cloud SaaS (Software as a Service). The organizations and supports a specific
and four deployment models. capability provided to the consumer community that has shared concerns
is to use the provider’s applications (for example, mission, security
Essential Characteristics running on a cloud infrastructure. requirements, policy, and compliance
On-demand self-service. A consumer The applications are accessible from considerations). It may be managed by
can unilaterally provision computing various client devices through a thin the organizations or a third party and
capabilities, such as server time client interface such as a Web browser may exist on or off premise.
and network storage, as needed (for example, Web-based email). The Public cloud. The cloud infrastructure
automatically without requiring human consumer does not manage or control is made available to the general public
interaction with each service’s provider. the underlying cloud infrastructure, or a large industry group and is owned
Broad network access. Capabilities are including network, servers, operating by an organization selling cloud
available over the network and accessed systems, storage, or even individual services.
through standard mechanisms that application capabilities, with the possible Hybrid cloud. The cloud
promote use by heterogeneous thin exception of limited user-specific infrastructure is a composition of two
or thick client platforms (for example, application configuration settings. or more clouds (private, community, or
mobile phones, laptops, and PDAs). Cloud PaaS (Platform as a Service). public) that remain unique entities but
Resource pooling. The provider’s The capability provided to the are bound together by standardized or
computing resources are pooled to serve consumer is to deploy onto the cloud proprietary technology that enables data
multiple consumers using a multitenant infrastructure consumer-created and application portability (for example,
model, with different physical and or acquired applications created cloud bursting for load balancing
virtual resources dynamically assigned using programming languages and between clouds).
and reassigned according to consumer tools supported by the provider. The Note: Cloud software takes full
demand. There is a sense of location consumer does not manage or control advantage of the cloud paradigm by
independence in that the customer the underlying cloud infrastructure, being service oriented with a focus on
generally has no control or knowledge including network, servers, operating statelessness, low coupling, modularity,
over the exact location of the provided systems, or storage, but has control and semantic interoperability.
resources but may be able to specify over the deployed applications
location at a higher level of abstraction and possibly application-hosting Peter Mell and Tim Grance are with the National
Institute of Standards and Technology, Information
(for example, country, state, or data environment configurations. Technology Laboratory, Gaithersburg, MD.

lenges include the ability to provide pand or contract a service within one providing adequate levels of security
audit trails across these environments of these environments. This last chal- services within nonsecurity-related
for regulatory compliance and digital lenge could pose particular financial environments. One of these challeng-
forensic purposes, enforcement, and issues in the elastic “pay by the drink” es is with traditionally nonsecurity-
awareness of differing levels of zones service model if, for example, users are minded providers needing to supply
among development, test, and pro- able to add services at will and an or- service options for common security
duction environments to protect the ganization gets a bill at the end of the capabilities such as intrusion detec-
integrity of services deployed in the month for excessive service additions. tion, firewalls, content filtering, and
higher-level environments, as well as Changing tack slightly, however, it vulnerability testing. In predecessor
controlling whom is authorized to ex- is worth mentioning the challenges in service models, such as an ASP, these

50 communications of th e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
practice

services could be offered through what the industry will be faced with
partnerships with security vendors in moving toward these new models
and manually designed and provi- that offer greater levels of “true” cloud
sioned into the outsourced environ- computing. Depending on the type
ment. In the new model, however,
how providers are able to provide Though the inherent of service model being discussed and
various use cases, exploring all of the
tighter integration with these services
in order not to lose full elasticity may
security challenges challenges is all but impossible, espe-
cially not in a single discussion. In ad-
be interesting. It may require creating in virtualization are dition, some of the security challenges
optional service hooks from a provid-
er’s self-service portal to security ser-
not new, how it is discussed appear to be recognized by
certain cloud providers but are pri-
vice products or perhaps developing likely to be used by marily being addressed through the
interesting but complex multiservice
cloud models provided by multiple
cloud-computing use of private cloud models (Amazon
and OpSource are two such vendors of-
specialty service providers. Either way, providers to fering answers within a virtual private
this challenge is probably worthy of a
discussion in and of itself because of achieve elastic IT cloud offering), suggesting perhaps
higher costs versus a public cloud of-
the perceived number of additional is- environments on a fering and/or limited availability in
sues it brings to mind. Note that some
vendors do offer these capabilities to- grand scale poses addressing within other cloud-deliv-
ery models.
day, particularly within virtual private
cloud models, but of the vendors re-
some interesting The promise of what an elastic
cloud-computing model could do for
searched, none is fully addressing for security challenges. the IT world, however, is extremely
every model it offers. invigorating and certainly worth pur-
Encryption capabilities for data-at- suing. It can only be hoped that orga-
rest may be an interesting challenge as nizations already taking this path or se-
well. For example, given the previous riously considering doing so will take
environment traversal example, use the time to fully appreciate the security
of file-based encryption within a vir- challenges facing them and whether
tual environment would be essentially or not adoption at this point fits into
worthless in offering protection from their risk appetite. Certainly, keeping
remote access. If one can readily gain these and other security challenges in
access to another’s environment, this mind while assessing how a prospec-
would also provide access to any front- tive cloud provider can address these
end encryption mechanism used for concerns (and at what cost and with
file-based encryption within the vir- what deployment constraints) should
tual environment. Disk-based encryp- be a critical business objective.
tion becomes particularly challenging
because of the nature of virtual stor-
age and potential lack of user organi- Related articles
zational control over where data may on queue.acm.org
be physically stored (which disk does Cybercrime 2.0: When the Cloud Turns Dark
one encrypt for a given customer and Niels Provos, Moheeb Abu Rajab,
other constraints in sharing of physi- Panayiotis Mavrommatis
cal disks among multiple customers). http://queue.acm.org/detail.cfm?id=1517412
It will certainly be necessary to explore Meet the Virts
a prospective provider’s capabilities Tom Killalea
http://queue.acm.org/detail.cfm?id=1348589
for encrypting data-at-rest and how
well it addresses the shared concerns, CTO Roundtable: Cloud Computing
Mache Creeger
especially for those organizations
http://queue.acm.org/detail.cfm?id=1536633
with regulatory requirements dictat-
ing the use of file- and/or disk-based
encryption. Dustin Owens (dustin.owens@bt.com) is a senior
principal consultant with BT Americas’ Business
It should be apparent by now that Innovation Group. He provides consulting services
cloud computing is fraught with a centered on operational risk and security management
for multinational customers, specializing in applying these
number of security challenges. While concepts to various areas of strategic business innovation.
He has more than 14 years of practical experience
some of the concepts and scenarios in addressing information security within distributed
discussed here are focused on more computing environments.
advanced service models, the intent
is to create a bit more awareness of © 2010 ACM 0001-0782/10/0600 $10.00

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t he acm 51
practice
d oi:10.1145/1743546.1743566
of how existing software uses it. JVM
Article development led by
queue.acm.org
tends to be forward-looking with the
expectation that new code will be writ-
ten and run under JVM to increase
Emulating a video system shows how even its portability. Emulators tend to be
a simple interface can be more complex—and backward-looking, expecting only to
make old code more portable.
capable—than it appears. In either case the emulator and
JVM give the programs that run under
by George Phillips them a suite of services needed to in-
teract with the host system. JVM pres-

Simplicity
ents those services as a series of API
calls; the emulator presents them as
simulated hardware. Nonetheless, the
simulated hardware is an API—just

Betrayed
not in the form most programmers
expect.

TRS-80 Example
As an example hardware API, consider
the TRS-80 video system, which dis-
plays text and graphics on a modified
television set. It has 16 lines of char-
acters with 64 columns each. It sup-
ports the normal ASCII character set,
An e mulator is a program that runs programs built with an additional 64 graphics charac-
ters allowing every possible combina-
for different computer architectures from the host tion of a 2-pixel by 3-pixel sub-block.
platform that supports the emulator. Approaches Judicious use of the graphics charac-
differ, but most emulators simulate the original ters provide an effective 128-pixel by
48-pixel resolution, albeit with pix-
hardware in some way. At a minimum the emulator els the size of watermelon seeds. A
interprets the original CPU instructions and provides TRS-80 program displays a character
by writing the character value to the
simulated hardware-level devices for input and output. memory location associated with the
For example, keyboard input is taken from the host desired position. In effect, the API has
platform and translated into the original hardware only one call:

format, resulting in the emulated program “seeing” void ChangeCharacter
the same sequence of keystrokes. Conversely, the (location /* 0 – 1024 */,
character)
emulator will translate the original hardware screen
format into an equivalent form on the host machine. Emulating such a simple graphics
The emulator is similar to a program that format is trivial. It can be rendered
quite adequately on a 512-by-192 im-
implements the Java Virtual Machine (JVM). The age allotting each character an 8-by-
difference is merely one of degree. JVM is designed 12 rectangle. Each graphics pixel is a
4-by-4 rectangle, while the characters
to enable efficient and tractable implementations, themselves occupy the upper two-
Illustrati on by gluek it

whereas an emulator’s machine is defined by real thirds, or an 8-by-8 area. While you
hardware that generally imposes undesirable could get away with any old font for
the characters, a little more work will
constraints on the emulator. Most significantly, the get something that looks dot-for-dot
original hardware may be fully described only in terms identical to the original. (To get the

52 communications of th e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
credit t k

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t he acm 53
practice

same aspect ratio as the original, the play it every 1/60th of a second. at the bottom. That happens for just
image should be doubled in height. The resulting emulation will look an instant; by the next frame the fill-
We’ll keep the numbers as is to sim- very much like the original. The ma- ing is done and the frame is filled with
plify exposition.) jority of programs that run under the white.
Figure 1 shows how an emulator emulator will appear exactly the same Even though this is at the edge of
converts the hardware-level charac- as when run on original hardware, perception, the tear exhibited by the
ter values into an actual image repre- but if you pay very close attention you two is quite different. You will need a
sentation. While the figure serves as will see some differences. Watch as video camera to see it in the original,
a blueprint for the emulator, it also the screen is cleared from all black but the emulator can either dump
shows what a TRS-80 program must to all white. In both the original and frames for offline analysis or be told
do in order to display something. One the emulation you get a tear, because to step one frame at a time. Figure 2
detail is missing: changes to screen filling the screen takes long enough shows the difference between the tear
memory are not displayed instanta- that it goes over more than one video on the original hardware and that
neously. The hardware redraws the frame. Just before the fill starts, the seen in the emulator.
display every 1/60th of a second. This frame will be black. The fill is partially Although apparently minor, this
means the emulator must assemble done on the next frame, and the dis- difference is puzzling. The emulator
the static image as described and dis- play shows white at the top and black implemented the specification exactly
as written, and no bugs were found in
Figure 1. Translating TRS-80 screen memory into a displayed image. the code. Moreover, the specification
was obviously correct and complete.
Screen Memory
Except for the evidence at hand, the
32 32 32 132 32 32 132 32 32 situation is impossible. Of course,
136 32 72 101 108 108 111 33 136 Display the problem lies in the specification,
32 32 32 129 32 32 129 32 32
which only appears complete. The as-
Hello! ... sumption that a particular character
Character Images is either there or not is incorrect. The
! " # $ %& '( ) * +, -. / hardware does not draw a character at
...

0 1 2 3 4 5 6 7 8 9 : ; < = > ? a time; it draws one line of a character


at a time. That character has the op-
@ A B C D E F G H I J K L M N O portunity to change after the drawing
P Q R S T U V W X Y Z [ \ ] ^ _ has already started. The tearing on the
` a b c d e f g h i j kl m n o original results from a character being
blank on the initial passes and subse-
p q r s t u v w x y z { | } ~ +=
quently filled in. Put another way, the
character is not atomic but made up of
12 pieces stacked on top of one anoth-
er; each piece is 8-by-1. Incidentally,
those 8-by-1 pieces are atomic—they
are displayed entirely or not at all. The
graphics hardware ends up reading
each displayed character 12 times.
Refining the emulator to correct
Figure 2. Difference in tears between the emulation and the original hardware. this difference is straightforward. In-
stead of waiting an entire 1/60th of a
second before drawing the screen, it
will be drawn a line at a time. With 192
lines the emulation loop looks some-
thing like this:

for (i = 0; i < 192; i++) {


emulate CPU for 86.8 micro-
seconds
draw line i of the video dis-
play
}

Now the tearing on the emulator


Approximate Tear Correct Tear
is the same as the hardware. You may
be tempted to declare the specifica-
tion and the emulator complete be-

54 communications of th e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
practice

cause of the major increase in output processor and the graphics. It can
fidelity. As a conscientious developer, show exactly what is being drawn and
however, your reaction must be ex- point out when screen memory writes
actly the opposite. A rather small test happen at the incorrect times. In no
case required a considerable change
in the program. Now is the time to The majority of time at all a demonstration program
is written that shows a blocky line
investigate further and look for ad-
ditional bugs. In all likelihood the
programs that run in a simple emulator and a diagonal
line in a more accurate emulator (see
specification needs more refinement. under the emulator Figure 3).
At the very least, a better test case for
the new functionality is needed. After
will appear exactly The Real Machine
a bit of thought it becomes clear that the same as when The program is impressive as it must
displaying a one-line-high pixel (1-by-
4) would make such a test case.
run on original read/write to the display with micro-
second-level timing. The real excite-
This can be done in three simple hardware, but ment is running the program on the
steps.
if you pay very original machine. After all, the output
of the emulator on a PC is theoretical-
1. Write an ordinary 4-by-4 pixel on close attention ly compelling but it is actually produc-
screen.
2. Wait until the first line has been you will see some ing graphics that pale in comparison
to anything else on the platform. On
drawn by the graphics hardware.
3. Quickly erase the pixel.
differences. the real machine it will produce some-
thing never before seen.
Sadly, the program utterly fails to
All that will be visible on screen is work on the real machine. Most of the
the 1-by-4 part of the pixel that was time the display is blank. It occasion-
drawn before you pulled the rug out ally flashes the ordinary block line
from under the 4-by-4 pixel. Many pix- for a frame, and very rarely one of the
els can be combined to create some- small pixels shows up as if by fluke.
thing seemingly impossible on a stock Once again, the accurate emulation
TRS-80: a high-resolution diagonal is not so accurate. The original tearing
line. effect proves that the fundamental ap-
The only thing missing is some way proach is valid. What must be wrong
of knowing which line the hardware is is the timing itself. For those strong
drawing at any one point. Fortunately, in software a number of experimental
the graphics hardware generates an programs can tease out the discrepan-
interrupt when it draws a frame. When cies. Hardware types will go straight
that interrupt happens you know ex- to the schematic diagrams that docu-
actly where the graphics hardware is. ment the graphics hardware in detail.
A few difficulties of construction re- Either way, several characteristics will
main, but they come down to trivial become evident:
matters such as putting in delays be- ˲˲ Each line takes 64 microseconds,
tween the memory accesses to ensure not 86.8.
you turn pixels on and off in step with ˲˲ There are 264 lines per frame; 192
each line being drawn. visible and 72 hidden.
Here the emulator is a boon. Mak- ˲˲ A frame is 16,896 microseconds or
ing such a carefully timed procedure 59.185 frames per second, not 60.
work on real hardware is very difficult. What’s most remarkable is how the
Any mistake in timing will result in emulator appeared to be very accurate
either no display because a pixel was in simulating a tear when it was, in
erased too quickly or a blocky line fact, quite wrong. So much has been
caused by erasing pixels too slowly. written about the brittleness of com-
Not only does the debugger not care puter systems that it is easy to forget
about time, it eschews it entirely. how flexible and forgiving they can be
Single-stepping through the code is at times. The numbers bring some re-
useless. To be fair, the debugger can- lief to the emulator code itself. What
not single-step the graphics hardware. appeared to be floating-point values
Even if it did, the phosphor would fade for timing are in fact just multiples of
from sight before you could see what the system clock. Simple, integer rela-
was happening. tionships exist between the speed of
The emulator can single-step the the CPU and the graphics hardware.

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t he acm 55
practice

A Z-80 Cycle Waster


The Z-80 code is in the comments alongside equivalent C code. The C program is self-contained and
runs an exhaustive test verifying that waitHL() always uses H * 256 + L + 100 cycles. Observe that the JR
conditional branch instructions take extra time if the branch is taken. Those time differences along with
looping are used to expand the subroutine’s running time in proportion to the requested number of cycles.

// Z-80 registers are akin to C global variables. no2:


t(4); C = A & 1; // RRCA
unsigned char A, H, L; A = (A << 7) | (A >> 1);
unsigned char C, Z; t(7); if (!C) { // JR NC,no4
t(5); goto no4;
int cycles; }
t(5); if (!C) { // RET NC
void t(int n) t(6); return;
{ }
cycles += n; t(4); // NOP
} no4:
t(4); C = A & 1; // RRCA
void waitA(); A = (A << 7) | (A >> 1);
t(7); if (!C) { // JR NC,no8
/* Cycle Cost C code Z-80 code */ t(5); goto no8;
}
void wait256() t(13); /* *0 = A; */ // LD (0),A
{ no8:
wHL256: t(7); A &= 15; // AND A,15
t(4); H--; Z = H != 0; // DEC H Z = A == 0;
t(7); A = 127; // LD A,127 t(5); if (Z) { // RET Z
t(17); waitA(); // CALL wA t(6); return;
} }
wait16:
void waitHL() t(4); A--; Z = A == 0; // DEC A
{ t(7); if (!Z) { // JR NZ,wait16
wHL: t(5); goto wait16;
t(4); H++; Z = H == 0; // INC H }
t(4); H--; Z = H == 0; // DEC H t(5); if (Z) { // RET Z
t(7); if (!Z) { // JR NZ,wHL256 t(6); return;
t(5); wait256(); }
goto wHL; }
}
t(4); A = L; // LD A,L //-----------------------
t(0); waitA();
} #include <stdio.h>

void waitA() int main(int argc, char *argv[])


{ {
wA: for (int hl = 0; hl < 65536; hl++)
t(4); C = A & 1; // RRCA {
A = (A << 7) | (A >> 1); H = hl / 256;
t(7); if (C) { // JR C,have1 L = hl & 255;
t(5); goto have1; cycles = 0;
} waitHL();
t(4); // NOP
have1: if (cycles != hl + 100)
t(4); C = A & 1; // RRCA {
A = (A << 7) | (A >> 1); printf(“Blew it on %d (got
t(7); if (!C) { // JR NC,no2 %d instead of %d)\n”, hl,
t(5); goto no2; cycles, hl + 100);
} }
t(7); if (!C) { // JR NC,no2 }
t(5); goto no2;
} return 0;
}

56 communications of th e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
practice

We can restate timings from the CPU’s it only happens every second frame. has an effect similar to the vertical roll
perspective: The program works but flashes be- that afflicted old televisions.
˲˲ Each line takes 128 cycles. tween a perfect diagonal line and a An ad hoc solution will work just
˲˲ The hidden 72 lines go for 9,216 chunky one. There’s no hardware fa- fine. The proof-of-concept demonstra-
cycles. cility to help out here, but there is an tion program will be complete. The
˲˲ Each frame is 33,792 cycles (264 * obvious, if distasteful, software solu- possibilities for even more impres-
128). tion. sive graphics are clear. Hand-timing
The number of frames per second Once the diagonal line has been everything, however, is tedious, slow,
is still a floating-point number, but drawn, you know exactly when it must and error prone. You’re stuck with
the emulator core can return to inte- be drawn again: 33,792 cycles from writing in assembly, but the timing
gers as you might expect for a digital when you started drawing it the first effort takes you back to the days when
system. time. If it takes T cycles to draw the code was hand-assembled. Much of
With the new timings in place, line, then you just write a cycle-wast- the burden can be lifted by taking
the emulator exhibits the same prob- ing loop that runs for 33,792-T cycles the instruction timing table from the
lems as the real hardware. With a bit and jump back to the line-drawing emulator and putting it into the as-
of (tedious) fiddling with the timing, routine. Since that jump takes 10 cy- sembler. Assemblers have always been
the program almost works on the real cles, however, you better make that able to measure the size of their out-
hardware. 33,792-T-10 cycles. This seems like a put, generally to fill in buffer sizes and
There’s just one problem left. Re- fine nit to pick, but even being a single the like. Here’s that facility in use to
member that interrupt that gave a syn- cycle off in the count will lose synchro- define length as the number of bytes in
chronization point between the CPU nization. In two seconds the sync is off a message, which will vary if the mes-
and the graphics hardware? Turns out by almost an entire line. Losing sync sage is changed:

figure 3. a diagonal line in a simple emulator and a more accurate emulator. message: ascii ‘Hello,
world.’
length byte *-message

This works because the special “*”


variable keeps track of the memory
location into which data and code
are assembled. To automate timing
simply add a time() function that
says how many cycles are used by the
program up to that point. It can’t ac-
count for loops and branches but will
give accurate results for straight-line
code. At a high level the diagonal slash
incorrect correct demo will be:

start:...some code to draw


...the diagonal line
figure 4. the letter “a” displayed with square pixels and on the original hardware. waste equ 33792 -
(time(*) - time(start)) - 10
...code to use up
...”waste” cycles
goto start

Straightforward, but what about


the code to waste the cycles? The as-
sembler could be extended to sup-
ply that code automatically. Instead,
keeping with the principle of minimal
design, the task can be left to an ordi-
nary subroutine. Writing a subroutine
that runs for a given number of cycles
is a different requirement from what
you are accustomed to, but it is pos-
sible. (See the accompanying sidebar
for one such cycle-wasting subrou-
tine.)

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t he acm 57
practice

As programmers we can see the at a time and interweaving the CPU troublesome is the increased effort
potential of the diagonal-line dem- and the graphics hardware at the level required by the host CPU to pull this
onstration program. Although it has of CPU cycles. The graphics emula- off. The work involved is many times
only one pixel per line, there is a clear tion has become extremely accurate. greater than before. Only through the
path to more complex and compelling Not only will side effects such as a tear aid of a graphics coprocessor or mod-
images, to say nothing of animations be seen, but they will be exactly the erately optimized rendering code can
and other effects. One final bump in same as they manifest on the original the screen be drawn in this fashion
the road awaits. Every time the CPU hardware. The results are not purely in real time. It is difficult to believe
accesses screen memory, it denies ac- academic, either. Test programs dem- that drawing a 30-year-old computer’s
cess to the graphics hardware. This onstrate the fidelity of the emulator display takes up so much of a modern
results in a blank line that is two- or while still achieving the same output system. This is one reason why accu-
three-characters wide. The more pix- on the original hardware. The result rate emulation takes so long to per-
els you change on a per-line basis, the is not tiny differences only of interest fect. We can decide to make a better
more blanked-out portions there will to experts but extremely visible differ- display, but today’s platforms may not
be. Once again you will find that al- ences in program behavior between have the horsepower to accomplish it.
though the graphics may look fine on precise and sloppy emulators. That realistic fuzzy pixels can over-
the emulator, they will be riddled with Can there be anything else? lap does lead to noticeable visual arti-
“holes” on the real machine because Having tripped over so many emu- facts. Two pixels alternating between
of the blanking side effect. lator shortcomings, can the answer on and off sitting side by side will ap-
Moreover, as you try to do more be anything but yes? In fact, there is pear to be three pixels: two flashing
work per line, the exact positions of a double-wide mode where the char- pixels on each end and a single always-
the blank spots will matter a great acters are doubled in size for a 32-by- on pixel in the middle where the two
deal. Their exact positions will be a 16 display. Based on what we’ve seen overlap. I’ll leave it to your imagina-
measure of emulator accuracy and up to this point, it’s not surprising tion what useful effect this artifact
can be used to maximize the graphics to learn that it brings in many more may have.
displayed per line. Several discoveries complications than might be expect-
await and will be the result of a feed- ed. Even leaving that morass aside, Conclusion
back loop of emulator refinement, test there’s one more obvious limitation of A system’s complexity is easy to under-
program development, measurement the emulator. The original display was estimate. Even the simple video sys-
of the original system leading to fur- a CRT. Each pixel on it looks entirely tem of the TRS-80 has greater depth
ther emulator refinement, and so on. different from what is seen on a mod- than anticipated. What lurks beneath
Along the way you will discover the fol- ern LCD flat panel. The pixels there the surface is far greater than the high-
lowing: are unrelentingly square, whereas level description. Take it as a sideways
˲˲ The visible portion of a line takes the CRT produced soft-edged ovals of reinforcement of the KISS principle.
102.4 cycles; the rest of the time (25.6 phosphorescence. Figure 4 compares Yet do not despair. You must also con-
cycles) is used for setting up drawing two close-ups of the letter A. sider the power of tools. Each emula-
the next line. Hard-edged pixels result in an im- tor improvement has led to discover-
˲˲ Blank spots do not cover the en- age that is functionally identical to the ies that could be exploited for good
tire time an instruction takes but only original but has a completely different use once the necessary support tools
the sub-portion of the instruction that feel. The difference between the two were built. Above all, however, beware
accesses video memory. is unmistakable. Observe also that of perfection. No system is perfect,
˲˲ The emulator must be extended the real pixels are neither distinct nor and the cost of pursuing perfection
to report exactly when memory is ac- independent. Pixels in adjacent rows can be much greater than mere time
cessed on a sub-instructional basis. overlap. Pixels in adjacent columns and money invested.
˲˲ Our method of synchronization is not only overlap but also display dif-
crude and can be depended upon to ferently if there is a single one versus
be accurate only to within a few char- several in a row. The first pixel in a row Related articles
acters. of lit pixels is larger. All these subtle on queue.acm.org
˲˲ Finer synchronization can be ac- differences combine to create a sub- No Source Code? No Problem!
complished, but the emulator must stantially different picture. Peter Phillips and George Phillips
be upgraded so programs using the The problem itself is much simpler http://queue.acm.org/detail.cfm?id=945155
technique can still be tested. than the functional issues because Enhanced Debugging with Traces
˲˲ Video blanking can be put to good there is no feedback to the rest of the Peter Phillips
use sculpting graphics that cannot be implementation. There is no need to http://queue.acm.org/detail.cfm?id=1753170
constructed in other ways. change the CPU timing or how the
In other words, we’re a long way CPU interacts with the graphics sys- George Phillips (gp2000@shaw.ca) works on video
from where we started. Instead of tem. It is merely a matter of drawing games both emulated and otherwise. In a simpler time he
worked on the crawling side of an early Web search engine
drawing an entire screen at once or each dot as an alpha-blended patch that was chastised for selling advertisements.
even a line at a time, the emulator is rather than a hard-edged off/on set-
down to drawing 1/12th of a character ting of one or two pixels. What is © 2010 ACM 0001-0782/10/0600 $10.00

58 communications of th e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
doi:10.1145/1743546 . 1 7 4 3 5 6 7
Article development led by
queue.acm.org

A survey of powerful visualization techniques,


from the obvious to the obscure.
by Jeffrey Heer, Michael Bostock, and Vadim Ogievetsky

A Tour
Through the
Visualization
Zoo

in sensing, networking, and


T ha nks to adva nc e s
data management, our society is producing digital
information at an astonishing rate. According to
one estimate, in 2010 alone we will generate 1,200
exabytes—60 million times the content of the Library
of Congress. Within this deluge of data lies a wealth

of valuable information on how we help engage more diverse audiences in


conduct our businesses, governments, exploration and analysis. The challenge
and personal lives. To put the informa- is to create effective and engaging visu-
tion to good use, we must find ways to alizations that are appropriate to the
explore, relate, and communicate the data.
data meaningfully. Creating a visualization requires a
The goal of visualization is to aid our number of nuanced judgments. One
understanding of data by leveraging the must determine which questions to
human visual system’s highly tuned ask, identify the appropriate data, and
ability to see patterns, spot trends, and select effective visual encodings to map
identify outliers. Well-designed visual data values to graphical features such
representations can replace cognitive as position, size, shape, and color. The
calculations with simple perceptual in- challenge is that for any given data set
ferences and improve comprehension, the number of visual encodings—and
memory, and decision making. By mak- thus the space of possible visualization
ing data more accessible and appeal- designs—is extremely large. To guide
ing, visual representations may also this process, computer scientists, psy-

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t he acm 59
practice

Time-Series Data: Figure 1a. Index chart of selected technology stocks, 2000–2010. chologists, and statisticians have stud-
5.0x
ied how well different encodings facili-
AAPL
tate the comprehension of data types
4.0x such as numbers, categories, and net-
works. For example, graphical percep-
3.0x tion experiments find that spatial po-
Gain / Loss Factor

sition (as in a scatter plot or bar chart)


AMZN
2.0x
GOOG leads to the most accurate decoding of
numerical data and is generally prefer-
1.0x

IBM
able to visual variables such as angle,
0.0x
MSFT
S&P 500
one-dimensional length, two-dimen-
sional area, three-dimensional volume,
-1.0x and color saturation. Thus, it should
Jan 2005
be no surprise that the most common
Source: Yahoo! Finance; http://hci.stanford.edu/jheer/files/zoo/ex/time/index-chart.html
data graphics, including bar charts,
Time-Series Data: Figure 1b. Stacked graph of unemployed U.S. workers by industry, 2000–2010. line charts, and scatter plots, use posi-
Agriculture
tion encodings. Our understanding of
Business services graphical perception remains incom-
plete, however, and must appropriately
Construction
be balanced with interaction design
Education and Health and aesthetics.
Finance
Government This article provides a brief tour
Information
Leisure and hospitality
through the “visualization zoo,” show-
casing techniques for visualizing and
Manufacturing
Mining and Extraction
Other
interacting with diverse data sets. In
Self-employed
Transportation and Utilities
many situations, simple data graphics
Wholesale and Retail Trade
will not only suffice, they may also be
preferable. Here we focus on a few of
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
the more sophisticated and unusual
Source: U.S. Bureau of Labor Statistics; http://hci.stanford.edu/jheer/files/zoo/ex/time/stack.html
techniques that deal with complex data
Time-Series Data: Figure 1c. Small multiples of unemployed U.S. workers, normalized by industry, 2000–2010. sets. After all, you don’t go to the zoo to
see chihuahuas and raccoons; you go
to admire the majestic polar bear, the
Self-employed Agriculture
graceful zebra, and the terrifying Suma-
Other Leisure and hospitality
tran tiger. Analogously, we cover some
of the more exotic (but practically use-
Education and Health Business services ful) forms of visual data representation,
starting with one of the most common,
Finance Information time-series data; continuing on to sta-
tistical data and maps; and then com-
Transportation and Utilities Wholesale and Retail Trade
pleting the tour with hierarchies and
Manufacturing Construction
networks. Along the way, bear in mind
that all visualizations share a common
Mining and Extraction Government “DNA”—a set of mappings between
Source: U.S. Bureau of Labor Statistics; http://hci.stanford.edu/jheer/files/zoo/ex/time/multiples.html
data properties and visual attributes
Time-Series Data: Figure 1d. Horizon graphs of U.S. unemployment rate, 2000–2010. such as position, size, shape, and col-
or—and that customized species of vi-
sualization might always be construct-
ed by varying these encodings.
Each visualization shown here is
accompanied by an online interactive
example that can be viewed at the URL
displayed beneath it. The live examples
were created using Protovis, an open
source language for Web-based data
visualization. To learn more about how
a visualization was made (or to copy
and paste it for your own use), see the
online version of this article available
Source: U.S. Bureau of Labor Statistics; http://hci.stanford.edu/jheer/files/zoo/ex/time/horizon.html
on the ACM Queue site at http://queue.

60 communications of th e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
practice

acm.org/detail.cfm?id=1780401/. All graph does not support negative num- ther to test hypotheses or predict future
example source code is released into bers and is meaningless for data that values, but an improper choice of mod-
the public domain and has no restric- should not be summed (temperatures, el can lead to faulty predictions. Thus,
tions on reuse or modification. Note, for example). Moreover, stacking may one important use of visualizations is
however, that these examples will work make it difficult to accurately interpret exploratory data analysis: gaining in-
only on a modern, standards-compli- trends that lie atop other curves. Inter- sight into how data is distributed to
ant browser supporting scalable vector active search and filtering is often used inform data transformation and mod-
graphics (SVG). Supported browsers in- to compensate for this problem. eling decisions. Common techniques
clude recent versions of Firefox, Safari, Small Multiples. In lieu of stacking, include the histogram, which shows the
Chrome, and Opera. Unfortunately, In- multiple time series can be plotted prevalence of values grouped into bins,
ternet Explorer 8 and earlier versions within the same axes, as in the index and the box-and-whisker plot, which can
do not support SVG and so cannot be chart. Placing multiple series in the convey statistical features such as the
used to view the interactive examples. same space may produce overlapping mean, median, quartile boundaries, or
curves that reduce legibility, however. extreme outliers. In addition, a number
Time-Series Data An alternative approach is to use small of other techniques exist for assessing
Sets of values changing over time—or, multiples: showing each series in its a distribution and examining interac-
time-series data—is one of the most own chart. In Figure 1c we again see tions between multiple dimensions.
common forms of recorded data. Time- the number of unemployed workers, Stem-and-Leaf Plots. For assessing a
varying phenomena are central to many but normalized within each industry collection of numbers, one alternative
domains such as finance (stock prices, category. We can now more accurately to the histogram is the stem-and-leaf
exchange rates), science (temperatures, see both overall trends and seasonal plot. It typically bins numbers accord-
pollution levels, electric potentials), patterns in each sector. While we are ing to the first significant digit, and then
and public policy (crime rates). One of- considering time-series data, note that stacks the values within each bin by the
ten needs to compare a large number small multiples can be constructed for second significant digit. This minimal-
of time series simultaneously and can just about any type of visualization: bar istic representation uses the data itself
choose from a number of visualizations charts, pie charts, maps, among others. to paint a frequency distribution, re-
to do so. This often produces a more effective vi- placing the “information-empty” bars
Index Charts. With some forms of sualization than trying to coerce all the of a traditional histogram bar chart and
time-series data, raw values are less im- data into a single plot. allowing one to assess both the overall
portant than relative changes. Consider Horizon Graphs. What happens distribution and the contents of each
investors who are more interested in when you want to compare even more bin. In Figure 2a, the stem-and-leaf plot
a stock’s growth rate than its specific time series at once? The horizon graph shows the distribution of completion
price. Multiple stocks may have dra- is a technique for increasing the data rates of workers completing crowd-
matically different baseline prices but density of a time-series view while pre- sourced tasks on Amazon’s Mechani-
may be meaningfully compared when serving resolution. Consider the five cal Turk. Note the multiple clusters:
normalized. An index chart is an inter- graphs shown in Figure 1d. The first one group clusters around high levels
active line chart that shows percentage one is a standard area chart, with posi- of completion (99%–100%); at the oth-
changes for a collection of time-series tive values colored blue and negative er extreme is a cluster of Turkers who
data based on a selected index point. values colored red. The second graph complete only a few tasks (~10%) in a
For example, the image in Figure 1a “mirrors” negative values into the same group.
shows the percentage change of select- region as positive values, doubling the Q-Q Plots. Though the histogram
ed stock prices if purchased in January data density of the area chart. The third and the stem-and-leaf plot are common
2005: one can see the rocky rise enjoyed chart—a horizon graph—doubles the tools for assessing a frequency distribu-
by those who invested in Amazon, Ap- data density yet again by dividing the tion, the Q-Q (quantile-quantile) plot is a
ple, or Google at that time. graph into bands and layering them more powerful tool. The Q-Q plot com-
Stacked Graphs. Other forms of to create a nested form. The result is pares two probability distributions by
time-series data may be better seen in a chart that preserves data resolution graphing their quantiles against each
aggregate. By stacking area charts on but uses only a quarter of the space. Al- other. If the two are similar, the plotted
top of each other, we arrive at a visual though the horizon graph takes some values will lie roughly along the central
summation of time-series values—a time to learn, it has been found to be diagonal. If the two are linearly related,
stacked graph. This type of graph (some- more effective than the standard plot values will again lie along a line, though
times called a stream graph) depicts when the chart sizes get quite small. with varying slope and intercept.
aggregate patterns and often supports Figure 2b shows the same Mechani-
drill-down into a subset of individual Statistical Distributions cal Turk participation data compared
series. The chart in Figure 1b shows the Other visualizations have been de- with three statistical distributions.
number of unemployed workers in the signed to reveal how a set of numbers Note how the data forms three distinct
U.S. over the past decade, subdivided by is distributed and thus help an analyst components when compared with uni-
industry. While such charts have prov- better understand the statistical prop- form and normal (Gaussian) distribu-
en popular in recent years, they do have erties of the data. Analysts often want tions: this suggests that a statistical
some notable limitations. A stacked to fit their data to statistical models, ei- model with three components might

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t he acm 61
practice

Statistical Distributions:  Figure 2a. Stem-and-leaf plot of Mechanical Turk participation rates. be more appropriate, and indeed we
0 1 1 1 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 5 6 7 8 8 8 8 8 8 9
see in the final plot that a fitted mixture
1 0 0 0 0 1 1 1 1 2 2 3 3 3 3 4 4 4 4 5 5 6 7 7 8 9 9 9 9 9 of three normal distributions provides
2 0 0 1 1 1 5 7 8 9 a better fit. Though powerful, the Q-Q
3 0 0 1 2 3 3 3 4 6 6 8 8
4 0 0 1 1 1 1 3 3 4 5 5 5 6 7 8 9
plot has one obvious limitation in that
5 0 2 3 5 6 7 7 7 9 its effective use requires that viewers
6 1 2 6 7 8 9 9 9
possess some statistical knowledge.
7 0 0 0 1 6 7 9
8 0 0 1 2 3 4 4 4 4 4 4 4 5 6 7 7 7 9
SPLOM (Scatter Plot Matrix). Other
9 1 3 3 5 7 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 visualization techniques attempt to
10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
represent the relationships among
Source: Stanford Visualization Group; http://hci.stanford.edu/jheer/files/zoo/ex/stats/stem-and-leaf.html
multiple variables. Multivariate data
Statistical Distributions:  Figure 2b. Q-Q plots of Mechanical Turk participation rates.
occurs frequently and is notoriously
hard to represent, in part because of
the difficulty of mentally picturing data
Turker Task Group Completion %

100% in more than three dimensions. One


50%
technique to overcome this problem is
to use small multiples of scatter plots
0% showing a set of pairwise relations
among variables, thus creating the SP-
0% 50%
Uniform Distribution
100% 0% 50%
Gaussian Distribution
100% 0% 50%
Fitted Mixture of 3 Gaussians
100%
LOM (scatter plot matrix). A SPLOM en-
Source: Stanford Visualization Group; http://hci.stanford.edu/jheer/files/zoo/ex/stats/qqplot.html
ables visual inspection of correlations
between any pair of variables.
Statistical Distributions:  Figure 2c. Scatter plot matrix of automobile data. In Figure 2c a scatter plot matrix is
used to visualize the attributes of a da-
20

30

40

50

10

20

30

40
00

00

00
00

10

15

20

tabase of automobiles, showing the re-


0

200
lationships among horsepower, weight,
150
horsepower acceleration, and displacement. Addi-
100
tionally, interaction techniques such
50

5000 5000
as brushing-and-linking—in which a
selection of points on one graph high-
4000 4000

weight lights the same points on all the other


3000 3000
graphs—can be used to explore pat-
2000 2000
terns within the data.
Parallel Coordinates. As shown in
20 20
Figure 2d, parallel coordinates (||-co-
acceleration
15 15 ord) take a different approach to visu-
10 10 alizing multivariate data. Instead of
graphing every pair of variables in two
400
dimensions, we repeatedly plot the data
300
displacement on parallel axes and then connect the
200
corresponding points with lines. Each
100
poly-line represents a single row in the
50

10

15

20

20

30

40

50

10

15

20

database, and line crossings between


0
0

00

00

00
00

United States European Union Japan


dimensions often indicate inverse cor-
Source: GGobi; http://hci.stanford.edu/jheer/files/zoo/ex/stats/splom.html
relation. Reordering dimensions can
Statistical Distributions:  Figure 2d. Parallel coordinates of automobile data. aid pattern-finding, as can interactive
cylinders displacement weight horsepower acceleration mpg year
querying to filter along one or more di-
8 455 cubic inch 5140 lbs 230 hp 25 (0 to 60mph) 47 miles/gallon 82
mensions. Another advantage of paral-
lel coordinates is that they are relatively
compact, so many variables can be
shown simultaneously.

Maps
Although a map may seem a natural
way to visualize geographical data, it
has a long and rich history of design.
Many maps are based upon a carto-
graphic projection: a mathematical
3 68 cubic inch 1613 lbs 46 hp 8 (0 to 60mph) 9 miles/gallon 70
function that maps the 3D geometry
Source: GGobi; http://hci.stanford.edu/jheer/files/zoo/ex/stats/parallel.html
of the Earth to a 2D image. Other maps

62 communications of th e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
practice

knowingly distort or abstract geo- Maps:  Figure 3a. Flow map of Napoleon’s March on Moscow, based on the work of Charles Minard.

graphic features to tell a richer story or


highlight specific data.
Flow Maps. By placing stroked lines
on top of a geographic map, a flow map
can depict the movement of a quantity
in space and (implicitly) in time. Flow
lines typically encode a large amount of
multivariate information: path points,
direction, line thickness, and color can
all be used to present dimensions of
information to the viewer. Figure 3a is

a modern interpretation of Charles Mi- 24 Nov 09 Nov
24 Oct 18 Oct -10°
-20°

nard’s depiction of Napoleon’s ill-fated


28 Nov 14 Nov -30°
07 Dec 01 Dec
06 Dec

march on Moscow. Many of the greatest http://hci.stanford.edu/jheer/files/zoo/ex/maps/napoleon.html

flow maps also involve subtle uses of Maps:  Figure 3b. Choropleth map of obesity in the U.S., 2008.
distortion, as geography is bended to
accommodate or highlight flows. WA
MT
ND

Choropleth Maps. Data is often col- MN


ME

lected and aggregated by geographi- ID SD WI


OR VT
MI NH
WY NY
cal areas such as states. A standard NE
IA MA
CT RI
PA
approach to communicating this data NV UT
CO
IL IN
OH

MD DE
NJ

WV
is to use a color encoding of the geo-
KS MO
CA KY VA

graphic area, resulting in a choropleth OK


AR
TN NC
AZ NM
map. Figure 3b uses a color encoding MS AL GA
SC

to communicate the prevalence of obe- 32 - 35%


29 - 32%
TX
LA

sity in each state in the U.S. Though 26 - 29%


23 - 26%
FL

this is a widely used visualization tech- 20 - 23%


17 - 20%

nique, it requires some care. One com- 14 - 17%


Source: National Center for Chronic Disease Prevention and Health Promotion; http://hci.stanford.edu/jheer/files/zoo/ex/maps/choropleth.html

mon error is to encode raw data values


Maps:  Figure 3c. Graduated symbol map of obesity in the U.S., 2008.
(such as population) rather than using
normalized values to produce a densi-
ty map. Another issue is that one’s per-
ception of the shaded value can also be
affected by the underlying area of the
geographic region.
Graduated Symbol Maps. An alterna-
tive to the choropleth map, the gradu-
ated symbol map places symbols over an
underlying map. This approach avoids
confounding geographic area with data
Normal
values and allows for more dimensions Overweight
Obese
to be visualized (for example, symbol
size, shape, and color). In addition to Source: National Center for Chronic Disease Prevention and Health Promotion; http://hci.stanford.edu/jheer/files/zoo/ex/maps/symbol.html

simple shapes such as circles, gradu-


Maps:  Figure 3d. Dorling cartogram of obesity in the U.S., 2008.
ated symbol maps may use more com-
plicated glyphs such as pie charts. In
Figure 3c, total circle size represents a WI NY
WA ND MI
state’s population, and each slice indi- MT MN
ME

SD NH
cates the proportion of people with a OR
ID
WY
IA
OH
RI
IL IN NJ VT MA
specific BMI rating. NV
NE
WV
CT
UT KS
Cartograms. A cartogram distorts the CO
MO
KY MD DE

VA
shape of geographic regions so that the CA
AZ NM
OK
SC
PA
TN
area directly encodes a data variable. AR
MS GA
A common example is to redraw every TX AL
NC

country in the world sizing it propor- 32 - 35%


29 - 32%
LA
10M
tionally to population or gross domes- 26 - 29%
23 - 26% 5M FL

tic product. Many types of cartograms 20 - 23%


17 - 20%
1M
100K
14 - 17%
have been created; in Figure 3d we use
the Dorling cartogram, which represents Source: National Center for Chronic Disease Prevention and Health Promotion; http://hci.stanford.edu/jheer/files/zoo/ex/maps/cartogram.html

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t he acm 63
practice

Hierarchies:  Figure 4a. Radial node-link diagram of the Flare package hierarchy. each geographic region with a sized
circle, placed so as to resemble the true

flare
geographic configuration. In this ex-
ample, circular area encodes the total
analytics

animate

physics
display

query

scale
data

flex

util

vis
number of obese people per state, and
optimization

interpolate

converters

methods

operator
controls
palette

legend
color encodes the percentage of the to-
cluster

events
graph

heap

math

data
axis
Easing
FunctionSequence
ISchedulable
Parallel
Pause
Scheduler
Sequence
Transition
TransitionEvent
Transitioner
Tween

DataField
DataSchema
DataSet
DataSource
DataTable
DataUtil

DirtySprite
LineSprite
RectSprite
TextSprite
FlareVis
DragForce
GravityForce
IForce
NBodyForce
Particle
Simulation
Spring
SpringForce
AggregateExpression
And
Arithmetic
Average
BinaryExpression
Comparison
CompositeExpression
Count
DateUtil
Distinct
Expression
ExpressionIterator
Fn
If
IsA
Literal
Match
Maximum
Minimum
Not
Or
Query
Range
StringUtil
Sum
Variable
Variance
Xor

IScaleMap
LinearScale
LogScale
OrdinalScale
QuantileScale
QuantitativeScale
RootScale
Scale
ScaleType
TimeScale
Arrays
Colors
Dates
Displays
Filter
Geometry
IEvaluable
IPredicate
IValueProxy
Maths
Orientation
Property
Shapes
Sort
Stats
Strings

Visualization
tal population that is obese.

distortion

encoder
render

layout
label
filter
AgglomerativeCluster
CommunityStructure
HierarchicalCluster
MergeEdge
BetweennessCentrality
LinkDistance
MaxFlowMinCut
ShortestPaths
SpanningTree
AspectRatioBanker

ArrayInterpolator
ColorInterpolator
DateInterpolator
Interpolator
MatrixInterpolator
NumberInterpolator
ObjectInterpolator
PointInterpolator
RectangleInterpolator
Converters
DelimitedTextConverter
GraphMLConverter
IDataConverter
JSONConverter

_
add
and
average
count
distinct
div
eq
fn
gt
gte
iff
isa
lt
lte
max
min
mod
mul
neq
not
or
orderby
range
select
stddev
sub
sum
update
variance
where
xor

FibonacciHeap
HeapNode
DenseMatrix
IMatrix
SparseMatrix
ColorPalette
Palette
ShapePalette
SizePalette
Axes
Axis
AxisGridLine
AxisLabel
CartesianAxes
AnchorControl
ClickControl
Control
ControlList
DragControl
ExpandControl
HoverControl
IControl
PanZoomControl
SelectionControl
TooltipControl
Data
DataList
DataSprite
EdgeSprite
NodeSprite
ScaleBinding
Tree
TreeBuilder

DataEvent
SelectionEvent
TooltipEvent
VisualizationEvent
Legend
LegendItem
LegendRange
IOperator
Operator
OperatorList
OperatorSequence
OperatorSwitch
SortOperator
Hierarchies
While some data is simply a flat collec-

ArrowType
EdgeRenderer
IRenderer
ShapeRenderer

BifocalDistortion
Distortion
FisheyeDistortion
ColorEncoder
Encoder
PropertyEncoder
ShapeEncoder
SizeEncoder
FisheyeTreeFilter
GraphDistanceFilter
VisibilityFilter
Labeler
RadialLabeler
StackedAreaLabeler
AxisLayout
BundledEdgeRouter
CircleLayout
CirclePackingLayout
DendrogramLayout
ForceDirectedLayout
IcicleTreeLayout
IndentedTreeLayout
Layout
NodeLinkTreeLayout
PieLayout
RadialTreeLayout
RandomLayout
StackedAreaLayout
TreeMapLayout
tion of numbers, most can be organized
http://hci.stanford.edu/jheer/files/zoo/ex/hierarchies/tree.html into natural hierarchies. Consider: spa-
tial entities, such as counties, states,
Hierarchies:  Figure 4b. Cartesian node-link diagram of the Flare package hierarchy.
and countries; command structures
entrality
for businesses and governments; soft-
AgglomerativeCluster
re
NodeLin

StackedAreaLayouutt

luster
HierarchicalCtructu
IndenticleTreeL
ForcendrogramgLayoouut

RadialTreeLay
Circ

ware packages and phylogenetic trees.


TreeMapLayout
De

RandomLayoout
Bun

r
Ic ectedLayo

Sho FlowMinCut

anke
sC
lePa Circle Routerut
Dir

CommunityS
Sta

kTreeLayyout
edTree ayoutt

SpanrtestPaths

Scahuse el ble nce


Max Distance
dled Axis

es

ree
e

Even for data with no apparent hierar-


PieLayoou

MergeEdg

tioB
cke Radia La ilter

Betweenn
ckin Lay

ll la ue
ningT
Edg Lay
Gr Fish

Parched nSeq
dAr lL be

La

ctRa
Layout
ap ey eEnncodode er

Layo t
e

chy, statistical methods (for example,


eaL abe ler

nk

t
hD eT

u
ut

n ne ven
Vis tancreeF

Fu sing
IS nctio
Aspe

T ran en er
Li
t
u
is

ut

T ran sit ce

be n to ol at r
ibil eF te

T u l

ee itio nE
Twrans sitioion

m ixI la rp pol lato


beleler

r
o

Ea

rI ter r ato or
a
Pr
Sh er E Enc rtio

Se ed
k-means clustering) may be applied to
Siz eE nc od e

ity ilte r
op

r
Nu at terpInt nter rpo
ap tyE nc od n

r
F r

la r
P

r
po to
to
M n te rI te
or
Co
Fi

il

er la
r

I a lo In
co e r

at to
sh

nt po
organize data empirically. Special visu-
lo isto

Bi ol or la

e
y
de r
ey

fo
Co rra rp lat po
r

ca D
eD

o
te rpo ter
cluster

A
lD is n
layout

Op O S
graph

ist tor tI nte eI n


D
er pe ort
r alization techniques exist to leverage
on

o ti c l
r

at ra O
or to pe rtionon je tI g
Ob oinctan ld ma
izati
lab

Op Seq rSw ra P e e
er ue it tor R taFi che
el
filt

hierarchical structure, allowing rapid


optim

a t n c S t e
O o h Da ata aSe ource
er

IO per rLisce
ics
en

Le pe ato t D at aS bl r
analyt

g D at aTa il rte
co

rat r
LeendR nve
di
multiscale inferences: micro-observa-
e

D at Ut
gen an or
ate
de

at

st D ata rs tCo r
rte Tex erte
ol

or
r

d ge D
im

Vis
rp

tio
uali I
Leg tem nve ed onv r
te

n CoelimhitMLCverte r
an

zat end
tions of individual elements and mac-
in

ion D rap Con erte


op

T o
Sele oltip ven E G ata Conv
er

ctio Eve t IDSON


at

DatanEvennt
or

leg J ite
Shap
eRen
de
Eve t
nt end
da
ta
r t ers
Spr
DirtyeSpritete
Lin ctSpri
ro-observations of large groups.
EdgeRIRendere eve ve
r con Re xtSprite
en
Arrowderer
Type
rer nts
lay
Te
FlareV
is
Node-link diagrams. The word tree
rende disp
TreeBu
ScaleBindTr
ilder
ee
ing
r
flex DragFo
GravityFo
rce
rce
rce is used interchangeably with hierarchy,
NodeSprite IFo
EdgeSprite
DataSprite
data vis physics
NBodyForce
Particle
Simulation
as the fractal branches of an oak might
DataList Spring
Data flare SpringForce
AggregateExpressio
mirror the nesting of data. If we take a
TooltipControll n
PanZoomCo
ntro
SelectionContr ol
IControll contro
ls
And
Arithm
Avera etic two-dimensional blueprint of a tree, we
ontrol Binaryge
CompaExpression
HoverC Contro
ExpandgControl
Dra trolLis l t que
Com ris
Cou positeExp
on have a popular choice for visualizing
ConContro l ry Da nt ressio
Dis teUtil
Contr ol
ClicokrContr
o
axis Ex tinct
Exppressio
n
hierarchies: a node-link diagram. Many
h Fn ressio n
A c
n
ia n
tes La e
Car AxisridLinxis
s
Axebel
let
te I
If
L sA
nIte
r ator different tree-layout algorithms have
G A es pa M itera
Axis Ax
a tio
n
m
at
h
M atc l
M ax h
No inimimum
been designed; the Reingold-Tilford al-
liz O t um
util

tte Q r
ua
gorithm, used in Figure 4a on a package
ap

Vis alelettete R ue
he

P
e a t S an ry
SizpeP Palelette S tr ge
V u ing
Va ariam Uti
a Pa rix ix
hierarchy of software classes, produces
me

Sh
lor Mat atr ix Xo ri b
r an le
l
Co
tho

se M tr ce
ar IeMa
a tidy result with minimal wasted space.
ds

_ dd d ge

p
scale
St eap e

a n ra t

S
iH od

ns
a ve n ct
cc pN

De
a u
St ings

co istin
Or Pro hapSor ts
na ea

d iv
ien pe es t
a

An alternative visualization scheme


bo H

d q
r

en
a M atio ty

f t
I lueP ath n
t r

g
rox s
Ge lua atey

gtef
Fi

if a
S

et e

is

is the dendrogram (or cluster) algorithm,


lt e
om bl
Dis Filtery
r

lt ax
IEPredic

m
D lays

minod
Coates

m ul
Arralors

m q
ys

ne t
no
a

or rby
ScaleScale

orde e
Scalee
p

rangct
v

sele ev
IV

stdd
Quantit RootScale

sub
eScalee

which places leaf nodes of the tree at the


Typ

sum
update
variance
where
IScaleMap
xor
LogScalele
LinearScale
al
OrdinalSca
ativeSc
Time

same level. Thus, in the diagram in Fig-


Quantil

http://hci.stanford.edu/jheer/files/zoo/ex/hierarchies/cluster-radial.html ure 4b, the classes (orange leaf nodes)


are on the diameter of the circle, with
Hierarchies:  Figure 4c. Indented tree layout of the Flare package hierarchy.
the packages (blue internal nodes) in-
side. Using polar rather than Cartesian
flare 933KB controls 43KB
analytics 47KB data 107KB coordinates has a pleasing aesthetic,
cluster 14KB events 6KB
graph 25KB legend 35KB while using space more efficiently.
optimization 6KB operator 179KB
animate 97KB IOperator 1KB We would be remiss to overlook
data
display
29KB
23KB
Operator
OperatorList
2KB
5KB
the indented tree, used ubiquitously
flex
physics
4KB
29KB
OperatorSequence
OperatorSwitch
4KB
2KB
by operating systems to represent file
DragForce 1KB SortOperator 1KB directories, among other applications
GravityForce 1KB distortion 13KB
IForce 0KB encoder 14KB (see Figure 4c). Although the indented
NBodyForce 10KB filter 11KB
Particle 2KB label 16KB tree requires excessive vertical space
Simulation 9KB layout 105KB
Spring 2KB AxisLayout 6KB
and does not facilitate multiscale infer-
SpringForce
query
1KB
87KB
BundledEdgeRouter
CircleLayout
3KB
9KB
ences, it does allow efficient interactive
scale 30KB CirclePackingLayout 11KB exploration of the tree to find a specific
util 161KB DendrogramLayout 4KB
vis 422KB ForceDirectedLayout 8KB node. In addition, it allows rapid scan-
Visualization 16KB IcicleTreeLayout 4KB
axis 33KB IndentedTreeLayout 3KB ning of node labels, and multivariate
data such as file size can be displayed
http://hci.stanford.edu/jheer/files/zoo/ex/hierarchies/indent.html
adjacent to the hierarchy.

64 communications of th e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
practice

Adjacency Diagrams. The adjacency Hierarchies:  Figure 4d. Icicle tree layout of the Flare package hierarchy.

diagram is a space-filling variant of the


node-link diagram; rather than draw-

flare
ing a link between parent and child in

analytics

animate
the hierarchy, nodes are drawn as solid

physics
display

query

scale
data

util

vis
areas (either arcs or bars), and their

Visualization
Transitioner

NBodyForce
interpolate

DirtySprite

Simulation
converters

TextSprite
Transition

Geometry
methods

operator
Displays

controls
palette
cluster

Shapes

Strings
Arrays

legend
Colors
Easing

Maths
Dates
graph

Query

math
heap

data
axis
placement relative to adjacent nodes
reveals their position in the hierarchy.

GraphMLConverter

TooltipControl
FibonacciHeap

LegendRange
Interpolator

ScaleBinding

TreeBuilder
NodeSprite
DataSprite

distortion
DataList

encoder
Legend
render

layout
filter

label
Data
Axis
The icicle layout in Figure 4d is similar

ForceDirectedLayout

NodeLinkTreeLayout
CirclePackingLayout

StackedAreaLayout
RadialTreeLayout

TreeMapLayout
to the first node-link diagram in that

CircleLayout
Labeler
the root node appears at the top, with
child nodes underneath. Because the http://hci.stanford.edu/jheer/files/zoo/ex/hierarchies/icicle.html

nodes are now space-filling, however, Hierarchies:  Figure 4e. Sunburst (radial space-filling) layout of the Flare package hierarchy.
we can use a length encoding for the

StackedAreaL
size of software classes and packages.

TreeMapLayout
RadialT
Node
LinkT
For

reeLayo
Layo
ceD

ree

ayout
Cir
This reveals an additional dimension

irec

ut

Layo
cle

ut
tedL
Pa

ut
ck
Cir

ayou
ing
cle

La
La

inCut
that would be difficult to show in a

t
yo
yo

ut
ut

MaxFlowM
La
be

layo
le
r

ut
node-link diagram.

la
be
The sunburst layout, shown in Fig-

cluster
r

l
filt

graph
to
er ola

ng
en rp

Easi
co te
de In

n
op

itio
r

era

er
dis

ns

ion
tort

ure 4e, is equivalent to the icicle lay-

tor

Tra
ion

sit
rter

an
te onve

Tr
ola LC
rp phM
inte Gra

s
analytic
Lege

out, but in polar coordinates. Both are


ndRa
nge rs
erte

ate
conv ite

im
ySpr
Dirt

an
Legend
legen prite
d data TextS

implemented using a partition layout, render


TreeBuilder
vis
flare
displ
ay

physics
NBodyF
orce
Simulation

which can also generate a node-link ScaleBinding


data sc
ale
quer
y

diagram. Similarly, the previous cluster


prite Que
NodeS ry
me
e tho

util
Sprit ds
Data

layout can be used to generate a space-


t
aLis
Dat

ls

Ar olor s
ro
nt

C ate s
ra s
co

ys
D lay
Dis
ta

Geo
Da l

tion

p
tro l

filling adjacency diagram in either Car-

s
axi
on tro

Mat

met
te
aliza

Shapes
ipC Con

Strings
math
palet

heap

ry
olt

hs
To on

Visu
cti
le
Se

tesian or polar coordinates. s


Axi

eap
Enclosure Diagrams. The enclosure
diagram is also space filling, using FibonacciH
http://hci.stanford.edu/jheer/files/zoo/ex/hierarchies/sunburst.html

containment rather than adjacency to Hierarchies:  Figure 4f. Treemap layout of the Flare package hierarchy.
represent the hierarchy. Introduced by
Ben Shneiderman in 1991, a treemap Labeler pertyEncEncoder
Interpolator
NodeLinkTreeLayou RadialTreeLayoutCirclePackingLayo Strings Shapes Maths

recursively subdivides area into rect-


label encoder trixInterpo
interpolate
yInterp
tInter Transitioner
izeEncod
RadialLabelkedAreaL lorEnco lorInterpo
apeEnco ectInterp
Inter

angles. As with adjacency diagrams, CircleLayout


StackedAreaLayout ForceDirectedLayou
layout operator ifocalDistort
Distortion
distortion
OperatorList
Geometry
SparseMatri
mathIMatr
util DenseMatrix
Arrays
angleInterberInter

animate
Displays

the size of any node in the tree is Layout cicleTreeLayndrogramLa

dentedTreeLa VisibilityFilte eratorSwOperator


sheyeDistort OperatorSequenc

FibonacciHeap
heap pN
Stats
Easing Transition

TreeMapLayou Dates

quickly revealed. The example shown AxisLayout ledEdgeR


PieLayout
om heyeTreeFfilter
aphDistanceFSortOperatOperat ColorPalette
palette
Property
Tween Scheduler
Parallel

in Figure 4f uses padding (in blue) to


hapePale Colors Sort aluePr FunctionSequence Sequence
zePale
TooltipControl anZoomCont ControlListClickContr Palette
Filter ientat
redivalu sitionchedul
vis flare Pause
controls

emphasize enclosure; an alternative Data DataList


SelectionControl HoverContro
nchorCont
ExpandContro

DragControlContro
Cont
range iff gte lte
mul lt div eq add
gt mod tddexorrian
methods
and or pda
Query
MaxFlowMinChortestPat
cluster
HierarchicalCluste TimeScale uantitativeSca

sub isarderstin omerativemunityStru


saturation encoding is sometimes fn
neq notwher
eramin
maxcoun
elecsum _
graph analytics
analyt
weennessCen
LinkDistanc
tics
MergeEdge
Scale

IScaleMap
RootScal
LogScale
scale uantileSc

Legend SpanningTree AspectRatioBanker


optimization OrdinalScale
used. Squarified treemaps use approxi- NodeSpritedata ScaleBinding Expression Comparison DateUtil
Axis ScaleTypeinearSca
legend axis query
StringUtil Arithmetic Match

mately square rectangles, which offer


itedTextCo NBodyForce TextSprite
peRend LegendRange egendIte aphMLConve
converters
DataSprite EdgeRendere
render Axes mpositeExprenaryExpress If IsA SONConver
rowTend CartesianAxes display
GridisLa data physics
better readability and size estimation
ataConvnver Simulation RectSprit
Variance Not Literal Variabl
DirtySprite
pressionIter
Tree dgeSpr DataEvenoltipEv egateExpr Xor
Or nimaxim DataSource DataSchema
TreeBuilder Visualization events Spring avityFor LineSprite
Distinc ataTa Particle

than a naive “slice-and-dice” subdivi- lectionEvlizatio Fn Range AndAverageSum Coun DataUtil DataFie
DataSe SpringForcragFor
Fo FlareVis
flex

sion. Fancier algorithms such as Vo- http://hci.stanford.edu/jheer/files/zoo/ex/hierarchies/treemap.html

ronoi and jigsaw treemaps also exist


Hierarchies:  Figure 4g. Nested circles layout of the Flare package hierarchy.
but are less common.
By packing circles instead of sub-
flare

animate

dividing rectangles, we can produce interpolate

a different sort of enclosure diagram


util
heap query

math
methods

that has an almost organic appear- physics

ance. Although it does not use space


display
palette
scale
flex analytics
graph

as efficiently as a treemap, the “wast-


vis
data
converters
operator cluster
encoder
distortion optimization

ed space” of the circle-packing layout, label


filter

shown in Figure 4g, effectively reveals axis


layout

the hierarchy. At the same time, node


legend

events

sizes can be rapidly compared using controls data

area judgments.
render

Networks http://hci.stanford.edu/jheer/files/zoo/ex/hierarchies/pack.html
In addition to organization, one aspect Source: The Flare Toolkit http://flare.prefuse.org

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t he acm 65
practice

networks: figure 5a. force-directed layout of Les Misérables character co-occurrences. of data that we may wish to explore
through visualization is relationship.
For example, given a social network,
who is friends with whom? Who are
the central players? What cliques ex-
ist? Who, if anyone, serves as a bridge
between disparate groups? Abstractly,
a hierarchy is a specialized form of net-
work: each node has exactly one link
to its parent, while the root node has
no links. Thus node-link diagrams are
also used to visualize networks, but the
loss of hierarchy means a different al-
gorithm is required to position nodes.
Mathematicians use the formal
term graph to describe a network. A
central challenge in graph visualiza-
http://hci.stanford.edu/jheer/files/zoo/ex/networks/force.html
tion is computing an effective layout.
networks: figure 5b. arc diagram of Les Misérables character co-occurrences. Layout techniques typically seek to po-
sition closely related nodes (in terms
of graph distance, such as the number
of links between nodes, or other met-
rics) close in the drawing; critically,
unrelated nodes must also be placed
far enough apart to differentiate rela-
tionships. Some techniques may seek
to optimize other visual features—for
example, by minimizing the number
of edge crossings.
Force-directed Layouts. A common
Mlle. Baptistine

Tholomyes

Brevet

Anzelma

Feuilly
Courfeyrac

Brujon
Napoleon

Cravatte

Marguerite

Blacheville

Bamatabois

Simplice

Champmathieu

Pontmercy
Boulatruelle

Mme. Pontmercy

Combeferre
Prouvaire

Joly

Claquesous

Mme. Hucheloup
Myriel

Count

Favourite

Eponine

Jondrette

Grantaire
Mother Plutarch
Isabeau

Fameuil

Perpetue

Woman 1

Woman 2

Child 1
Child 2
Geborand

Old Man
Labarre

Gervais

Listolier

Cosette

Gillenormand

Mlle. Gillenormand

Lt. Gillenormand

Bahorel

Gueulemer
Mme. Magloire

Valjean

Dahlia

Fantine
Mme. Thenardier
Thenardier

Mother Innocent

Gavroche

Mabeuf

Montparnasse
Toussaint
Mme. de R

Javert

Scaufflaire

Judge

Chenildieu
Cochepaille

Mme. Burgon
Countess de Lo

Zephine

Magnon

Mlle. Vaubois

Baroness T

Enjolras

Bossuet

Babet
Champtercier

Fauchelevent

Gribier

Marius

and intuitive approach to network lay-


out is to model the graph as a physical
http://hci.stanford.edu/jheer/files/zoo/ex/networks/arc.html
system: nodes are charged particles that
networks: figure 5c. matrix view of Les Misérables character co-occurrences.
repel each other, and links are damp-
ened springs that pull related nodes
Mlle. Gillenormand

Mme. Thenardier
Mme. Hucheloup

Mme. Pontmercy
Lt. Gillenormand

Mother Innocent
Countess de Lo
Mother Plutarch

Mlle. Baptistine
Champmathieu

Mme. Magloire

together. A physical simulation of these


Montparnasse
Mlle. Vaubois
Mme. Burgon

Champtercier

Fauchelevent
Gillenormand
Combeferre

Boulatruelle

Claquesous

Bamatabois

Cochepaille
Baroness T

Mme. de R
Gueulemer
Courfeyrac

Tholomyes
Thenardier
Pontmercy

Marguerite

Blacheville

Scaufflaire

Chenildieu

Geborand
Woman 2

Woman 1
Gavroche

Toussaint
Prouvaire

Napoleon
Grantaire
Jondrette

Favourite

Perpetue
Anzelma

Old Man
Simplice

Cravatte
Eponine
Enjolras

Bossuet

Magnon

Zephine
Fameuil

Isabeau
Bahorel

Cosette

Listolier

Labarre

Gervais
Mabeuf

Fantine

Valjean

forces then determines the node posi-


Child 1
Child 2

Marius

Gribier
Feuilly

Brujon

Brevet
Dahlia
Javert

Judge

Myriel

Count
Babet
Joly

Child 1
Child 2
Mother Plutarch
Gavroche
tions; approximation techniques that
Marius
Mabeuf
Enjolras
avoid computing all pairwise forces
Combeferre
Prouvaire
Feuilly
Courfeyrac
enable the layout of large numbers of
Bahorel
Bossuet
Joly
Grantaire
nodes. In addition, interactivity allows
Mme. Hucheloup
Jondrette
Mme. Burgon
the user to direct the layout and jiggle
Boulatruelle
Cosette
Woman 2
Gillenormand
nodes to disambiguate links. Such a
force-directed layout is a good starting
Magnon
Mlle. Gillenormand
Mme. Pontmercy
Mlle. Vaubois
Lt. Gillenormand
Baroness T
Toussaint
point for understanding the structure
Mme. Thenardier
Thenardier
Javert
Pontmercy
of a general undirected graph. In Figure
5a we use a force-directed layout to view
Eponine
Anzelma
Gueulemer
Babet
Claquesous
Montparnasse
Brujon
the network of character co-occurrence
Marguerite
Tholomyes
Listolier
Fameuil
in the chapters of Victor Hugo’s classic
novel, Les Misérables. Node colors de-
Blacheville
Favourite
Dahlia
Zephine
Fantine
Perpetue
Labarre
pict cluster memberships computed by
Valjean
Mme. de R
Isabeau
Gervais
a community-detection algorithm.
Arc Diagrams. An arc diagram,
Bamatabois
Simplice
Scaufflaire
Woman 1
Judge
Champmathieu
Brevet
shown in Figure 5b, uses a one-dimen-
Chenildieu
Cochepaille
Myriel
Napoleon
sional layout of nodes, with circular
Mlle. Baptistine
Mme. Magloire
Countess de Lo
Geborand
arcs to represent links. Though an arc
Champtercier
Cravatte
Count
diagram may not convey the overall
Old Man
Fauchelevent
Mother Innocent
Gribier
structure of the graph as effectively as
http://hci.stanford.edu/jheer/files/zoo/ex/networks/matrix.html
a two-dimensional layout, with a good
Source: http://www-personal.umich.edu/~mejn/netdata
ordering of nodes it is easy to identify

66 communications of th e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
practice

cliques and bridges. Further, as with the into the wild, try deconstructing the
indented-tree layout, multivariate data various visualizations crossing your
can easily be displayed alongside nodes. path. Perhaps you can design a more ef-
The problem of sorting the nodes in a fective display?
manner that reveals underlying cluster
structure is formally called seriation and All visualizations
has diverse applications in visualiza-
tion, statistics, and even archaeology.
share a common Additional Resources
Few, S.
Matrix Views. Mathematicians and “DNA”—a set of Now I See It: Simple Visualization
Techniques for Quantitative Analysis.
computer scientists often think of a
graph in terms of its adjacency matrix:
mappings between Analytics Press, 2009.

each value in row i and column j in the data properties and Tufte, E.
The Visual Display of Quantitative
matrix corresponds to the link from
node i to node j. Given this representa-
visual attributes Information. Graphics Press, 1983.

tion, an obvious visualization then is: such as position, Tufte, E.


Envisioning Information. Graphics Press,
just show the matrix! Using color or sat-
uration instead of text allows values as- size, shape, 1990.
Ware, C.
sociated with the links to be perceived and color—and Visual Thinking for Design. Morgan
more rapidly.
customized species
Kaufmann, 2008.
The seriation problem applies just Wilkinson, L.
as much to the matrix view, shown in
Figure 5c, as to the arc diagram, so
of visualization The Grammar of Graphics. Springer, 1999.

the order of rows and columns is im- might always be Visualization Development Tools
portant: here we use the groupings
generated by a community-detection
constructed by Prefuse: Java API for information
algorithm to order the display. While varying these visualization.

encodings.
path-following is more difficult in a Prefuse Flare: ActionScript 3 library for data
visualization in the Adobe Flash Player.
matrix view than in a node-link dia-
gram, matrices have a number of com- Processing: Popular language and IDE for
pensating advantages. As networks graphics and interaction.
get large and highly connected, node- Protovis: JavaScript tool for Web-based
link diagrams often devolve into giant visualization.
hairballs of line crossings. In matrix The Visualization Toolkit: Library for 3D
views, however, line crossings are im- and scientific visualization.
possible, and with an effective sort-
ing one quickly can spot clusters and
Related articles
bridges. Allowing interactive group- on queue.acm.org
ing and reordering of the matrix facili-
A Conversation with Jeff Heer, Martin
tates even deeper exploration of net- Wattenberg, and Fernanda Viégas
work structure. http://queue.acm.org/detail.cfm?id=1744741
Unifying Biological Image Formats
Conclusion with HDF5
We have arrived at the end of our tour Matthew T. Dougherty, Michael J. Folk,
and hope the reader has found the ex- Erez Zadok, Herbert J. Bernstein,
amples both intriguing and practical. Frances C. Bernstein, Kevin W. Eliceiri,
Werner Benger, Christoph Best
Though we have visited a number of
http://queue.acm.org/detail.cfm?id=1628215
visual encoding and interaction tech-
niques, many more species of visualiza-
Jeffrey Heer is an assistant professor of computer
tion exist in the wild, and others await science at Stanford University, where he works on human-
discovery. Emerging domains such as computer interaction, visualization, and social computing.
He led the design of the Prefuse, Flare, and Protovis
bioinformatics and text visualization visualization toolkits.
are driving researchers and designers to
Michael Bostock is currently a Ph.D. student in the
continually formulate new and creative Department of Computer Science at Stanford University.
representations or find more powerful Before attending Stanford, he was a staff engineer at
Google, where he developed search quality evaluation
ways to apply the classics. In either case, methodologies.
the DNA underlying all visualizations Vadim Ogievetsky is a master’s student at Stanford
remains the same: the principled map- University specializing in human-computer interaction.
He is a core contributor to Protovis, an open-source Web-
ping of data variables to visual features based visualization toolkit.
such as position, size, shape, and color.
As you leave the zoo and head back © 2010 ACM 0001-0782/10/0600 $10.00

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t he acm 67
contributed articles
d oi:10.1145/1743546.1743568
nature the human brain could other-
Needed are generic, rather than one-off, DBMS wise never imagine. Scientists must
be able to manage data derived from
solutions automating storage and analysis of observations and simulations. Con-
data from scientific collaborations. stant improvement of observational
instruments and simulation tools give
By Anastasia Ailamaki, Verena Kantere, modern science effective options for
and Debabrata Dash abundant information capture, re-
flecting the rich diversity of complex

Managing
life forms and cosmic phenomena.
Moreover, the need for in-depth analy-
sis of huge amounts of data relent-
lessly drives demand for additional

Scientific Data
computational support.
Microsoft researcher and ACM
Turing Award laureate Jim Gray once
said, “A fourth data-intensive sci-
ence is emerging. The goal is to have
a world in which all of the science lit-
erature is online, all the science data
is online, and they interoperate with
each other.”9 Unfortunately, today’s
commercial data-management tools
are incapable of supporting the un-
precedented scale, rate, and complex-
ity of scientific data collection and
processing.
Despite its variety, scientific data
depend on
Data -oriented sc i e nt i f ic p ro c es se s does share some common features:
˲˲ Scale usually dwarfing the scale of
fast, accurate analysis of experimental data generated
transactional data sets;
through empirical observation and simulation. ˲˲ Generated through complex and

However, scientists are increasingly overwhelmed interdependent workflows;


˲˲ Typically multidimensional;
by the volume of data produced by their own ˲˲ Embedded physical models;
experiments. With improving instrument precision ˲˲ Important metadata about experi-

and the complexity of the simulated models, data ments and their provenance;
˲˲ Floating-point heavy; and
overload promises to only get worse. The inefficiency ˲˲ Low update rates, with most up-
of existing database management systems (DBMSs) dates append-only.
for addressing the requirements of scientists has led
key insights
to many application-specific systems. Unlike their
general-purpose counterparts, these systems require M anaging the enormous amount of
scientific data being collected is the key
more resources, hindering reuse of knowledge. Still, to scientific progress.

the data-management community aspires to general- T hough technology allows for the
extreme collection rates of scientific
purpose scientific data management. Here, we explore data, processing is still performed
with stale techniques developed for
the most important requirements of such systems and small data sets; efficient processing
is necessary to be able to exploit the
the techniques being used to address them. value of huge scientific data collections.

Observation and simulation of phenomena are keys P roposed solutions also promise
to achieve efficient management for
for proving scientific theories and discovering facts of almost any other kind of data.

68 communications of th e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
Result of seven-trillion-electronvolt collisions (March 30, 2010) in the ATLAS particle detector on the Large Hadron Collider at CERN, hunting
for dark matter, new forces, new dimensions, the Higgs boson, and ultimately a grand theory to explain all physical phenomena.

Persistent common requirements of commercial DBMSs; for example, continues to envision a general-pur-
for scientific data management in- the Sloan Digital Sky Survey (SDSS- pose scientific data-management
clude: 1 and SDSS-2; http://www.sdss.org/) system adapting current innova-
˲˲ Automation of data and metadata uses SQL Server as its backend. More- tions: parallelism in data querying,
processing; over, the resulting software is typi- sophisticated tools for data definition
˲˲ Parallel data processing; cally tightly bound to the application and analysis (such as clustering and
˲˲ Online processing; and difficult to adapt to changes in SDSS-1), optimization of data orga-
˲˲ Integration of multifarious data the scientific landscape. Szalay and nization, data caching, and replica-
and metadata; and Blakeley9 wrote, “Scientists and scien- tion techniques. Promising results
Image: CERN-EX-1003 060 02 © Cer n

˲˲ Efficient manipulation of data/ tific institutions need a template and involve automated data organization,
metadata residing in files. best practices that lead to balanced provenance, annotation, online pro-
Lack of complete solutions us- hardware architectures and corre- cessing of streaming data, embedded
ing commercial DBMSs has led sci- sponding software to deal with these complex data types, support for de-
entists in all fields to develop or volumes of data.” clarative data, process definition, and
adopt application-specific solutions, Despite the challenges, the data- incorporation of files into DBMSs.
though some have been added on top management research community Scientific databases cover a wide

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t he acm 69
contributed articles

Figure 1. Workflow of SDSS data. processing also helps detect and fine-
tune the telescope’s alignment. The
image data is then recorded onto tape
for archival purposes.
Sky Astronomy Virtual The tapes are physically mailed to a
Observatory Web
Pipelines Observatories processing center at Fermilab in Bata-
via, IL, to be processed through auto-
mated software pipelines to identify
celestial objects. Many astronomical-
Tape Local processing-software pipelines process
Archive Archives
the data in parallel. The output of this
processing, along with the image data,
is stored in the local archives at Fermi-
lab. The metadata generated from the
scope, with notable demand for high often necessary for scientific research processing pipelines is converted to
performance and data quality. Scien- on the same topic; for instance, obser- relational format and stored in a MS-
tific data ranges from medical and vational data is compared with simu- SQL Server database. Astronomers
biological to community science and lation data produced under the same closely associated with the SDSS ac-
from large-scale institutional to local experimental setup. Consider three cess the data from the local archives
laboratories. Here, we focus on the examples, one each for observational, (see Figure 1).
big amounts of data collected or pro- simulation, and combined: The data is also published (pub-
duced by instruments archived in da- Observational scientific data. The licly) once every two years through
tabases and managed by DBMSs. The SDSS located at Ohio State Univer- virtual observatories (http://www.
database community has expertise sity and Johns Hopkins University, is sdss.org/dr7, the final release of the
that can be applied to solve the prob- a long-running astronomy project. SDSS-II project) by running SQL on
lems in existing scientific databases. Since 2000, it has generated a detailed the database at the observatories or
3D map of about 25% of the sky (as downloading the entire database over
Observation and Simulation seen from Earth) containing millions the Internet while running the queries
Scientific data originates through ob- of galaxies and quasars. One reason locally. The SDSS project began pro-
servation and/or simulation.16 Obser- for its success is its use of the SQL viding public data sets in 2002 with
vational data is collected through de- Server DBMS. The SDSS uses the tele- three such observatories located at
tectors; input is digitized, and output scope at Apache Point Observatory, the Space Telescope Science Institute
is raw observational data. Simulation NM, to scan the sky at regular intervals in the U.S., the National Astronomi-
data is produced through simulators to collect raw data. Online processing cal Observatory of Japan, and the Max
that take as input the values of simula- is done on the data to detect the stars Planck Institute for Astrophysics in
tion parameters. Both types of data are and galaxies in the region. This online Germany.

Figure 2. Workflow of earthquake-simulation data.

Mesh
generation

Mesh

Simulation
u2(t) u1(t)

Analysis
Visualization

Ground model Time-varying output

70 communications of th e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
contributed articles

The schema of SDSS data includes tion data set through multidimen- to AOD. Another important data type
more than 70 tables, though most sional indexes. is tag data, or event-level metadata,
user queries focus on only a few of Combined simulation and observa- stored in relational databases, de-
them, referring, as needed, to spectra tional data. The ATLAS experiment signed to support efficient identifica-
and images. The queries aim to spot (http://atlas.ch/), a particle-physics tion and selection of events of interest
objects with specific characteristics, experiment in the Large Hadron Col- to a given analysis.
similarities, and correlations. Pat- lider (http://lhc.web.cern.ch/lhc/) be- Due to the complexity of the ex-
terns of query expression are also lim- neath the Swiss-French border near periment and the project’s worldwide
ited, featuring conjunctions of range Geneva, is an example of scientific scope, participating sites are divided
and user-defined functions in both data processing that combines both into multiple layers. The Tier-0 layer is
the predicate and the join clause. simulated and observed data. ATLAS a single site—CERN itself—where the
Simulation scientific data. Earth intends to search for new discoveries detector is located and the raw data
science employs simulation mod- in the head-on collision of two highly is collected. The first reconstruction
els to help predict the motion of the energized proton beams. The entire of the observed electrical signals into
ground during earthquakes. Ground workflow of the experiment involves physics events is also done at CERN,
motion is modeled with an octree- petabytes of data and thousands of us- producing ESD, AOD, and tag data.
based hexahedral mesh19 produced by ers from organizations the world over Tier-1 sites are typically large national
a mesh generator, using soil density (see Figure 3). computing centers that receive repli-
as input (see Figure 2). A “solver” tool We first describe some of major AT- cated data from the Tier-0 site. Tier-1
simulates the propagation of seismic LAS data types: The raw data is the di- sites are also responsible for repro-
waves through the Earth by approxi- rect observational data of the particle cessing older data, as well as for stor-
mating the solution to the wave equa- collisions. The detector’s output rate ing the final results from Monte Carlo
tion at each mesh node. During each is about 200Hz, and raw data, or elec- simulations at Tier-2 sites. Tier-2 sites
time step, the solver computes an trical signals, is generated at about are mostly institutes and universities
estimate of each node velocity in the 320MB/sec, then reconstructed using providing computing resources for
spatial directions, writing the results various algorithms to produce event Monte Carlo simulations and end-
to the disk. The result is a 4D spatio- summary data (ESD). ESD has an ob- user analysis. All sites have pledged
temporal earthquake data set describ- ject-oriented representation of the computing resources, though the vast
ing the ground’s velocity response. reconstructed events (collisions), with majority is not dedicated to ATLAS or
Various types of analysis can be per- content intended to make access to to high-energy physics experiments.
formed on the data set, employing raw data unnecessary for most physics The Tier-0 site is both computation-
both time-varying and space-varying applications. ESD is further processed and storage-intensive, since it stores
queries. For example, a user might de- to create analysis object data (AOD), the raw data and performs the initial
scribe a feature in the ground-mesh, a reduced event representation suit- event reconstruction. It also serves
and the DBMS finds the approximate able for user analysis. Data volume data to the Tier-1 sites, with aggregate
location of the feature in the simula- decreases gradually from raw to ESD sustained transfer rates for raw, ESD,

Figure 3. Workflow of the ATLAS experiment.

Physics Discovery!

Improving algorithms
MC
Data Taking Reprocessing Analysis
Simulation
Raw Output
Raw stored
Raw/ESD/AOD
ESD/AOD Raw/ESD/AOD

Data Management

Observed data only Both observed and simulated data Both observed and simulated data

Raw ESD AOD Raw ESD AOD AOD

T0 T1 T2

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t he acm 71
contributed articles

Disk with silicon sensors as an endcap of the ATLAS silicon strip detector in its testbox at the Nationaal Instituut voor Subatomaire Fysica,
Amsterdam, The Netherlands.

and AOD in excess of 1GB/sec over the AOD data, which is sent to a subset of and understand the machine’s bias.
WAN or dedicated fiber. Tier-1 sites other Tier-1s sites and, in the case of These simulations are run at Tier-2
are also computation- and storage- AOD, to all sites. The primary differ- sites by the “MC Simulation” compo-
intensive, since they store new data ence between the first reconstruction nent in Figure 3, with results sent to a
samples while permanently running at the Tier-0 site and later reconstruc- Tier-1 site.
reconstruction of older data samples tions at the Tier-1 sites is due to better Finally, the “User Analysis” compo-
with newer algorithms. Tier-2 sites are understanding of the detector’s be- nent aims to answer specific physics
primarily CPU-intensive, since they havior. Simulated data is used for this questions, using a model built with
generally run complex Monte Carlo reconstruction. specific analysis algorithms. This
simulations and user analyses while Simulated data, using Monte Carlo model is then validated and improved
only transiently storing data, with ar- techniques, is required to understand against Monte Carlo data to compen-
chival copies of interesting data kept the behavior of the detector and help sate for the machine’s bias, includ-
at Tier-1s sites. validate physics algorithms. The AT- ing background noise, and validated
The ATLAS experimental workflow LAS machine is physically very large against real, observed data, eventually
involves a combination of observed and complex, at 45 meters × 25 meters testing the user’s hypothesis.
Photograph by Ginter , P / cern Photo #: indet-2 006 - 0 02

and simulated data, as outlined in Fig- and weighing more than 7,000 tons, Beyond observational and simula-
ure 3. The “Data Taking” component including more than 100 million elec- tion data and its hybrids, researchers
consumes the raw data and produces tronic channels and 3,000 kilometers discuss special cases of “information-
ESD, AOD, and tags that are replicated of cable. A precise understanding of intensive” data.16 Sociology, biology,
to a subset of the Tier-1s sites and, the machine’s behavior is required psychology, and other sciences ap-
in the case of AOD, to all Tier-1s and to fine-tune the algorithms that pro- ply research on heterogeneous data
Tier-2s, where each Tier-2 site receives cess its data and reconstruct simula- derived from both observation and
AOD only from its parent Tier-1 site. tion data from the observed electrical simulation under various conditions.
The “Reprocessing” component at signals. This is the role of the Monte For example, in biology, research is
each Tier-1 site reads older data and Carlo simulations, using large statis- conducted on data collected by biolo-
produces new versions of ESD and tics and enough data to compensate gists under experimental conditions

72 communications of th e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
contributed articles

intended for a variety of purposes. well as administrative data about the task, especially for raw observational
Scientists involved in combustion experiments, data model, and mea- data derived from experiments that
chemistry, nanoscience, and the en- surement units. Other kinds of meta- cannot be replayed or replayed only at
vironment must perform research on data are extracted from raw data, pos- prohibitively high cost. The complete
data related to a variety of phenomena sibly through ontologies. Metadata data set is usually archived on tape,
concerning objects of interest rang- may also denote data relationships with selected parts stored on disks.
ing from particles to devices and from and quality. Annotating the data, all Portions of the data might also have to
living organisms to inorganic sub- metadata (accumulated or extracted) be cached temporarily during migra-
stances. Information-intensive data is critical to deducing experimental tion between tape and disk or between
is characterized by heterogeneity, in conclusions. The metadata and work- various computers.
both representation and the way sci- flows often complement one another, Beyond archiving master copies,
entists use it in the same experiment. necessitating combined manage- data replication on multiple sites
Management of such data emphasizes ment.7 may be necessary to accommodate
logical organization and description, Data and process integration. Data geographically dispersed scientists.
as well as integration. and its respective metadata may be All the challenges of distributing and
Independent of the data type char- integrated such that they can be ma- replicating data management come
acterizing particular scientific data, nipulated as a unit. Moreover, newly into play when coordinating the move-
its management is essentially divided collected data may be integrated with ment of large data volumes. Efforts
into coarse phases: workflow manage- historic data representing different are under way to manage these repli-
ment, management of metadata, data versions or aspects of the same ex- cation tasks automatically.4
integration, data archiving, and data periment or belonging to different ex- Archiving scientific data is usually
processing: periments in the same research track. performed by storing all past data ver-
Workflow management. In a simple The quality of the data, as well as its sions, as well as the respective meta-
scenario, a scientific experiment is semantic interpretation, is crucial for data (such as documentation or even
performed according to a workflow data integration. Data quality can be human communication like email).
that dictates the sequence of tasks achieved through data cleansing; se- Nevertheless, the problem of organiz-
to be executed from the beginning mantic integration can be achieved ing archives relates to the general re-
until the end of the experiment. The through ontologies. Beyond data in- search problem of data versioning, so
tasks coarsely define the manner and tegration, process integration is often solutions to the versioning problem
means for implementing the other needed to simplify the overall flow of can be adapted to archiving scientific
phases of data management: data the experiment and unify partial re- data. Representative versioning solu-
acquisition, metadata management, sults. For example, integration may be tions (such as the concurrent versions
data archiving, and data processing. necessary or desirable for data mining system) compute differences between
Task deployment may be serial or par- algorithms that facilitate feature ex- sequential versions and use the dif-
allel and sequential or looping. Broad traction and provenance. Process in- ferences for version reconstruction.
and long-lasting experiments encom- tegration might necessitate creation Recent proposals targeting scientific
pass sets and hierarchies of partial ex- or customization of middleware al- data3 exploit the data’s hierarchical
perimental studies performed accord- lowing for interoperation among dif- structure in order to summarize and
ing to complex workflows. In general, ferent procedures and technologies. merge versions.
many experiments could share a work- Automatic integration of data and Data processing. Data is analyzed
flow, and a workflow could use results processes is highly desirable for ease to extract evidence supporting scien-
from many experiments. of use but also because querying in- tific conclusions, ultimately yielding
Scientific workflow management formation as a unit allows parallel and research progress. Toward this end,
systems have been a topic for much online processing of partial results. the data must undergo a series of
research over the past two decades, Moreover, scientists want transpar- procedures specified by scientists in
involving modeling and enactment15 ent access to all data. Automatic inte- light of the goal of their respective ex-
and data preservation.12 Numerous gration assumes customizable tools periments. These procedures usually
related scientific products are used in implementing generic integration so- involve data clustering, mining, and
the sciences. lutions appropriate for scientific data. lineage, leading to the inference of
Metadata management. Raw data A notably challenging task is how to association rules and abnormalities,
is organized in a logically meaning- identify commonalities in scientific as well as to computation for feature
ful way and enriched with respective experiments and data in order to cre- identification and tracking.
metadata to support the diagnosis of ate integration tools. Data analysis is often tightly corre-
unexpected (independent) reinvesti- Data archiving. After they’ve met lated with data visualization, especial-
gation or reinterpretation of results, the expected standards of data con- ly when it comes to simulation data.
ideally automatically. Metadata in- tent, scientists archive the data so Scientists want a visual representa-
cludes information on data acquisi- other scientists are able to access and tion of the data to help them recognize
tion (such as parameters of the de- use it in their own research. The data coarse associations and abnormali-
tectors for observational data and must first be stored using robust, reli- ties. Interleaving the steps involved in
simulation for simulation data), as able storage technology, a mandatory visualization and analysis yields the

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t he acm 73
contributed articles

right data regions for testing hypothe- For the ATLAS experiment, the pa-
ses and drawing conclusions. The key rameters of the detectors, as well as
to efficient data processing is a care- the influence of external magnetic de-
fully designed database and is why vices and collider configurations, are
automated physical database design
is the subject of recent research (dis- Not having to read all stored automatically as metadata.
Some policies can be enforced auto-
cussed in the following section). In ad- from the disk and matically through a knowledge base

write computation
dition, there is an imminent need for of logical statements; the rest can
online data processing (discussed in be verified through questionnaires.
the second following section). results back saves Many commercial tools are available
for validating policies in the enter-
Automation hours to days of prise scenario, and the scientific com-
Errors and inefficiencies due to hu-
man-handled physical database de-
scientific work, munity can borrow technology from
them to automate the process (http://
sign are common in both metadata giving scientists www.compliancehome.com/).
management and data processing.
Much recent research has focused on more time to Provenance data includes experi-
mental parameters and task history
automating procedures for these two investigate the data. associated with the data. Provenance
phases of scientific data management. can be maintained for each data en-
Metadata management. Metadata try or for each data set. Since work-
processing involves determining the load management tracks all tasks ap-
data model, annotations, experimen- plied to the data, it can automatically
tal setup, and provenance. The data tag it with task information. Hence,
model can be generated automatically automating provenance is the most
by finding dependencies between dif- straightforward of the metadata-pro-
ferent attributes of data.10 However, cessing automation tasks. The enor-
experimenters typically determine mous volume of automatically col-
the model since this is a one-time pro- lected metadata easily complicates
cess, and dependencies A=πr2 are eas- the effort to identify the relevant sub-
ily identified at the attribute level. set of metadata to the processing task
Annotations are meta-information in hand. Some research systems are
about the raw scientific data and espe- capable of automatically managing a
cially important if the data is not nu- DBMS’s provenance information.2
meric. For example, annotations are Data processing. Data processing
used in biology and astronomy image depends on how data is physically
data. Given the vast scale of scientific organized. Commercial DBMSs usu-
data, automatically generating these ally offer a number of options for de-
annotations is essential. Current au- termining how to store and access it.
tomated techniques for gathering an- Since scientific data might come in
notations from documents involve petabyte-scale quantities and many
machine-learning algorithms, learn- scientists work on the same data si-
ing the annotations through a set of multaneously, the requirements for
pre-annotated documents.14 Similar efficient data organization and re-
techniques are applied to images trieval are demanding. Furthermore,
and other scientific data but must be the data might be distributed or rep-
scaled to terabyte or petabyte scale. licated in multiple geographically
Once annotations are built, they can dispersed systems; hence, network
be managed through a DBMS. resources play an important role in
Experimental setups are gener- facilitating data access. Possibly hun-
ally recorded in notebooks, both pa- dreds or thousands of scientists could
per and electronic, then converted to simultaneously query a petabyte-scale
query-able digital records. The quality database over the network, requiring
of such metadata is typically enforced more than 1GB/sec bandwidth.
through policies that must be as au- To speed data access, the database
tomated as possible. For example, administrator might have to tune sev-
when data is collected from instru- eral parameters, changing the data’s
ments, instrument parameters can be logical design by normalizing the
recorded automatically in a database. data or its physical design. The logi-
For manually generated data, the poli- cal design is determined by the data
cies must be enforced automatically. model in the metadata-processing

74 communications of th e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
contributed articles

phase of a science experiment. The Physical and logical design-auto- width, and operation sharing) avail-
physical design determines optimal mation tools must consider all param- able to the DBMS for running the que-
data organization and location, cach- eters and suggest optimal organiza- ry. Operations, including querying/
ing techniques, indexes, and other tion. The tools must be robust to small updating data, should thus provide
performance-enhancing techniques. variations in data and query changes, real-time feedback about the query
All depend on the data-access pattern, dynamically suggesting changes in progress to enable scientists to better
which is dynamic, hence, it changes the data organization when the query plan their experiments.
much more frequently than logical or data changes significantly. Debugging. Scientific data is typi-
design; physical design automation cally processed on multiprocessor
is therefore critical for efficient data Online Processing systems, as scientific applications are
processing. Most data-management techniques in often parallelizable and computation
Considering the number of param- the scientific community are offline to- can thus scale to data volume. How-
eters involved in the physical design day; that is, they provide the full result ever, it is nearly impossible to detect
of a scientific database, requiring the of the computation only after process- all the problems of a parallel program
database administrator to specify and ing an entire data set. However, the at development time. Using source
optimize the parameters for all these ever-growing scale of scientific data debuggers for parallel programming
techniques is unreasonable. Data volume necessitates that even simple is infeasible, since debuggers change
storage and organization must be au- processes, one-time data movement, the timing of the programs, thereby
tomated. checksum computation, and verifica- hiding many problems. Debugging be-
All DBMSs today provide tech- tion of data integrity might have to run comes even more difficult when pro-
niques for tuning databases. Though for days before completion. grams execute complex tasks (such as
the provision of these techniques is Simple errors can take hours to queries with user-defined functions).
a step in the right direction, existing be noticed by scientists, and restart- Some research DBMS prototypes
tools are insufficient for four main ing the process consumes even more provide feedback on query progress,13
reasons: time. Therefore, it is important that though they are not yet incorporated
Precision. They require the query all processing of scientific data be per- into commercial systems, so the bene-
workload to be static and precise; formed online. Converting the pro- fits are still not available to scientists.
Relational databases. They consider cesses from offline to online provides Similarly, tools that provide online
only auxiliary structures to be built on the following benefits: visualization of progress for specific
relational databases and do not con- Efficiency. Many operations can be simulations are not generic enough
sider other types of data organization; applied in a pipeline manner as data for a variety of scientific experiments.
Static database. They assume a is generated or move around. The op- Computational steering. Building
static database, so the statistics in the erations are performed on the data complex simulations is a challenge
database are similar to the statistics at when already in memory, which is even in uniprocessor systems. After
the time the tool is run; and much closer to the CPU than to a disk building them, the submitters-scien-
Query optimizer. They depend on or tape. Not having to read from the tists often boot on suboptimal param-
the query optimizer to direct their disk and write computation results eters, unaware that they’re indeed re-
search algorithms, making them slow back saves hours to days of scientific lying on suboptimal parameters until
for large workloads. work, giving scientists more time to the entire simulation is over. There-
Recent database research has ad- investigate the data. fore, online processing, combined
dressed these inherent DBMS limita- Feedback. Giving feedback to the with online visualization, can simul-
tions. For example, some techniques operations performed on the scientif- taneously help debug such programs
do not require prespecifying the ic data is important, because it allows and parameters. The system’s opera-
workload,1 and others make the cost scientists to plan their analysis accord- tions should allow an observer to gen-
model more efficient, enabling more ing to the progress of the operation. erate snapshots of the simulations or
thorough search in the data space.18 Modern DBMSs typically lack a prog- operations and, if possible, control
However, they also fall short in several ress indicator for queries, hence sci- the simulation to remove potential
areas; for example, they are not robust entists running queries or other pro- problems. Manual intervention in
enough to change database statistics cesses on DBMSs are typically blind to an otherwise automatic process is
and do not consider data organization the completion time of their queries. called “computational steering”; for
other than relational data. Likewise, This blindness may lead to canceling example, in parallel programs, ob-
data-organization methods for dis- the query and issuing a different one servers could decide which deadlocks
tributed data and network caches are or abandoning the DBMS altogether. to break or when simulations should
nascent today. Automatically utilizing DBMSs usually allow a query issuer change a parameter on the fly.
multiple processing units tuned for to compute the “cost” of a query in a Software for computational steer-
data-intensive workloads to scale the unit specific to the DBMS. This cost is ing includes the scientific program-
computation is a promising research not very useful to scientists, since it ming environment SciRun (http://
direction, and systems (such as Gray- doesn’t correspond to actual running www.sci.utah.edu/cibc/software/106-
Wulf24) apply this technique to achieve time or account for the complete set of scirun.html). Nevertheless, software
scalability. resources (such as memory size, band- must support simulations with pet-

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t he acm 75
contributed articles

abytes of data and execution of com- though the object-oriented model is tween DBMSs and native files in the
plex tasks. For software designers, it suitable for the representation of sci- process. The multidimensional ar-
may sometimes be beneficial to model entific data, it cannot efficiently opti- rays are present in multidimensional
the simulations and data processing mize and support complex queries. online analytical processing (MOLAP)
as event generators, using streaming Nevertheless, most scientific data implementations from mainstream
and complex event-processing tech- is represented by objects with strong DBMSs that allow fast exploratory
niques to summarize data operations commonalities with respect to their analysis of the data by pre-computing
with little overhead or controllable ac- structural elements. DBMSs must be aggregations on multiple dimensions.
curacy guarantees. extended so they manage the common However, MOLAP needs a significant
structural elements of scientific data amount of offline processing and an
Data and Process Integration representations as generic database enormous amount of disk space to
Large-scale experiments organized objects, and database support for store the pre-computed aggregations,
by scientists collect and process huge these objects must include efficient making them unsuitable for the enor-
amounts of raw data. Even if the origi- ways to store and index data. mous scale of scientific data. Attempts
nal data is reorganized and filtered in Experimental data derived from to support exploratory ad hoc OLAP
a way that keeps only the interesting simulations is frequently represented queries on large data sets, includ-
parts for processing, these interest- as meshes ranging from structured to ing wavelets, promises to enable fast,
ing parts are still big. The reorganized unstructured and consisting of tetra- powerful analysis of scientific data.21
data is augmented with large volumes hedra, hexahedra, or n-facets cells. A frequently used type of scientific
of metadata, and the augmented reor- For example, an earthquake simula- data is time-and-space-based observa-
ganized data must be stored and ana- tion data set may be represented as tions, meaning interesting data sets
lyzed. an unstructured hexahedral mesh. A are trajectories in space and time. As
Scientists must collaborate with typical volume of earth, say, 100km trajectories are not inherently sup-
computer engineers to develop cus- × 100km × 30km, is represented by a ported by databases, data points are
tom solutions supporting data stor- mesh consisting of roughly one bil- usually stored individually and pro-
age and analysis for each experiment. lion nodes and one billion elements cessed for line-fitting during scientific
In spite of the effort involved in such requiring about 50GB of storage; such analysis. Line-fitting is a resource-
collaborations, the experience and a mesh is capable of resolving seis- consuming task and could be avoided
knowledge gained this way is not mic waves up to 2Hz. Scientific data if a DBMS inherently supported tra-
generally disseminated to the wider management would benefit greatly if jectories. Inherent support for trajec-
scientific community or benefit next- DBMSs offered storage and indexing tories is related to multidimensional
generation experimental setups. methods for meshes. Initial efforts array support, since trajectories are
Computer engineers must therefore toward supporting meshes in DBMSs actually polylines, and each line can
develop generic solutions for storage were presented in research19 and com- be approximated (fitted) by functions.
and analysis of scientific data that can mercial products.8 Promising research results have been
be extended and customized to reduce Multidimensional data also needs reported for managing trajectories.22
the computing overhead of time-con- storage and indexing methods. Most DBMSs should inherently support
suming collaborations. Developing scientific data is represented as mul- new data types while accounting for
generic solutions is feasible, since tidimensional arrays, but support for the specialized use of the new types
many low-level commonalities are multidimensional arrays in RDBMSs for representing scientific data. For
available for representing and analyz- is poor. Computer engineers must example, the Hierarchical Triangular
ing experimental data. produce custom solutions for manip- Mesh method11 subdivides spheri-
Management of generic physical ulating multidimensional data, lead- cal surfaces so objects localized on a
models. Experimental data tends to ing to many domain-specific data for- sphere can be indexed and queried
have common low-level features not mats, including netCDF (http://www. efficiently. Scientific data is usually
only across experiments of the same unidata.ucar.edu/software/netcdf/) persistent, meaning it is rarely (if ever)
science, but across all sciences. For and HDF (http://www.hdfgroup.org/) changed or involved in complex as-
example, reorganized raw data en- for climate data; FITS (http://heasarc. sociations. While developing support
hanced with metadata usually involves gsfc.nasa.gov/docs/heasarc/fits.html) mechanisms (such as indexes) for new
complex structures that fit the object- for astronomical data; and ROOT data types, priority must go to search
oriented model. Scientific data repre- (http://root.cern.ch/drupal/) for high- rather than to update efficiency. The
sentation benefits from inheritance energy physics data. persistence of scientific data allevi-
and encapsulation, two fundamental An experimental study5 showed ates a major requirement, making it
innovations of the object-oriented that, even if using array primitives in possible to develop efficient indexes
data model. RDBMSs, native file formats outper- for the new types.
Beyond its complexity in terms of formed the relational implementa- Management of generic data pro-
representation, scientific data is char- tion by a factor of 20 to as much as 80. cessing. Scientific data processing dif-
acterized by complex interdependen- Proposed scientific DBMSs6,23 provide fers from experiment to experiment
cies, leading to complex queries dur- multidimensional arrays as first-class and discipline to discipline. No mat-
ing data processing and analysis. Even types, aiming to bridge the gap be- ter how wide the scope of processing

76 communications of th e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
contributed articles

and overall heterogeneity, processing technology (https://sdm.lbl.gov/srm-


frequently encompasses generic pro- wg/index.html).
cedures. Existing persistent scientific data
It is most common that scientific in files is huge and will not be moved
data is searched in order to find inter-
esting regions with respect to prespec- DBMSs must be to databases, even if they support ef-
ficient scientific experimentation.
ified characteristics, data regions, and
abnormalities. Moreover, metadata
extended so Moreover, the tradition in applica-

they manage
tions that manipulate scientific data
processing consists of data annota- files is long, and implementing the
tion, as well as feature extraction. Data
tagged with parameter values refers to
the common same functionality in modules that
are plug-able on DBMSs needs further
the condition of the experiment and is structural elements effort. A long tradition and the need
mined to deduce common character-
istics and behavior rules.
of scientific data for plug-able capabilities mean that
a full-fledged querying mechanism
Generic processes that produce representations for files, similar to DBMSs, is need-
metadata (such as those just men-
tioned) must be supported inherently as generic database ed. Such a querying mechanism can
be constructed in either of two ways:
by the DBMS for all generic physical objects, and enhance current DBMSs so they uni-
models in a parameterized manner,
thus bringing processing close to database support formly manage both structured data
and unstructured data in files; and
the data and leading to reduced data
movement and reorganization, along
for these objects create a management layer on top of
both the DBMS and the file system to
with efficient processing execution must include enable transparent querying of struc-
in the DBMS. Each generic physical
model must include templates sup-
efficient ways tured and unstructured data. Each
approach has advantages and disad-
porting the generic procedures for the to store and vantages.
model in a customizable manner. For
example, the particles in the ATLAS index data. Enhancing a DBMS to manage files
and data means that all mechanisms
experiment are tracked using spatio- in the system should be extended for
temporal attributes. Even though the files. Querying on files is assumed
data sets are enormous, only a small to be efficient since it would benefit
amount of space and time are popu- from sophisticated database struc-
lated by particles. Therefore, the data tures (such as indexes, autonomic
sets would benefit from a generic database organization, and database
DBMS customizable procedure sup- query planning). Moreover, querying
porting compression. structured and unstructured data is
Scientists would benefit even more an opportunity for tight interaction
if the definition and customization among queries and query results and
of the templates could be performed refined optimization in intermediate
using a declarative language. Such a querying steps.
language would give users intuitive Extending DBMSs to manage files is
guidance as to the specification of a challenge for the data-management
the customization procedure, as well community since it entails reconsid-
as to the combination and pipelining eration of many database protocols
of multiple procedures. In this way, and a total rebuild of all database
the processing burden would be lev- procedures with new enhancements
eraged to the DBMS, and scientists for unstructured data. Yet it is inevi-
would not have to function as com- table that such a breakthrough in the
puter engineers. functionality of DBMSs will involve
substantial redundancy, since a big
File Management part of database management (such
The vast majority of scientific data as transaction management) is use-
is stored in files and manipulated less in the manipulation of scientific
through file systems, meaning all pro- data. Recent research efforts seeking
cessing, from search to computation, to integrate scientific files into DBMSs
is performed in the content of the include Netlobs, a netCDF cartridge
files. Sophisticated frameworks have for Oracle (http://datafedwiki.wustl.
been proposed to manage the files edu/images/f/ff/Dews_poster_2006.
over a large number of disks, includ- ppt) and Barrodale Computing Ser-
ing storage resource management vices, Ltd., on Postgres (http://www.

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t he acm 77
contributed articles

barrodale.com/) but are not generic and file management, but data manip- MPA/ESO/MPE Workshop (Garching, Germany, July
31–Aug. 4). Springer, Berlin, 2000, 631–637.
enough for other file formats (such as ulation must still address the diversity 12. Liu, D.T., Franklin, M.J., Abdulla, G.M., Garlick, J., and
HDF and ROOT). Aiming for file ma- of experimentation tasks across the Miller, M. Data-preservation in scientific workflow
middleware. In Proceedings of the 18th International
nipulation, the Data Format Descrip- sciences, the complexity of scientific Conference on Scientific and Statistical Database
tion Language Work Group project data representation and processing, Management (July 3–5). IEEE Computer Society,
Washington, DC, 2006, 49–58.
(http://forge.gridforum.org/projects/ and the volume of collected data and 13. Mishra, C. and Koudas, N. A lightweight online
dfdl-wg) is developing an XML-based metadata. Nevertheless, data-man- framework for query progress indicators. In
Proceedings of the IEEE International Conference on
language for describing the metadata agement research in all these areas Data Engineering (Istanbul, Apr. 15–20). IEEE Press,
of files. However, these approaches suggests the inherent management 1292–1296.
14. Müller, A. and Sternberg, M. Structural annotation of
do not provide complete support for problems of scientific data will indeed the human genome. In Proceedings of the German
Conference on Bioinformatics (Braunschweig,
DBMS features on the files. MapRe- be addressed and solved. Germany, Oct. 7–10). German Research Center for
duce-based techniques for processing Biotechnology, Braunschweig, 2001, 211–212.
15. Ngu, A.H., Bowers, S., Haasch, N., McPhillips, T., and
data stored in files17,20 do not replicate Acknowledgments Critchlow, T. Flexible scientific workflow modeling
DBMS functionalities and are mainly We would like to express our gratitude using frames, templates, and dynamic embedding.
In Proceedings of the 20th International Conference
used for batch processing of files. to Miguel Branco of CERN for contrib- on Scientific and Statistical Database Management
A management level on top of the uting the dataflow of the ATLAS ex- (Hong Kong, July 9–11). Springer-Verlag, Berlin,
Heidelberg, 2008, 566–572.
database and file system that man- periment, allowing us to demonstrate 16. Office of Data Management Challenge. Report
ages structured and unstructured a scientific application requiring both from the DOE Office of Science Data Management
Workshops, Mar.–May 2004; http://www.er.doe.gov/
data separately but transparently observational and simulation data. ascr/ProgramDocuments/Docs/Final-report-v26.pdf
seems more feasible in the near fu- We would also like to acknowledge the 17. Olston, C., Reed, B., Srivastava, U., Kumar, R., and
Tomkins, A. Pig Latin: A not-so-foreign language for
ture. The challenge in developing this European Young Investigator Award data processing. In Proceedings of the ACM SIGMOD
approach is twofold. First, it is neces- by the European Science Foundation International Conference on Management of Data
(Vancouver, B.C., June 9–12). ACM Press, New York,
sary to construct middleware through (http://www.esf.org/). 2008, 1099–1110.
which users define their queries in a 18. Papadomanolakis, S., Dash, D., and Ailamaki, A.
Efficient use of the query optimizer for automated
declarative high-level manner, but physical design. In Proceedings of the 33rd
References
the middleware must include a mech- International Conference on Very Large Data Bases
1. Bruno, N. and Chaudhuri, S. Online autoadmin:
(Vienna, Austria, Sept. 23–27). VLDB Endowment,
anism that transcribes queries as in- Physical design tuning. In Proceedings of the
2007, 1093–1104.
ACM International Conference on Management of
19. Papadomanolakis, S., Ailamaki, A., Lopez, J.C., Tu,
put for the DBMS or file system and Data (Beijing, June 11–14). ACM Press, New York,
T., O’Hallaron, D.R., and Heber, G. Efficient query
1067–1069.
routes it appropriately. Second, a que- 2. Buneman, P., Chapman, A., and Cheney, J.
processing on unstructured tetrahedral meshes.
In Proceedings of the ACM SIGMOD International
ry mechanism dedicated to the file Provenance management in curated databases.
Conference on Management of Data (Chicago, June
In Proceedings of the ACM SIGMOD International
system must be developed; the ben- Conference on Management of Data (Chicago, June
27–29). ACM Press, New York, 2006, 551–562.
20. Pike, R., Dorward, S., Griesemer, R., and Quinlan, S.
efit of a separate file-querying mecha- 27–29). ACM Press, New York, 2006, 539–550.
Interpreting the data: Parallel analysis with Sawzall.
3. Buneman, P., Khanna, S., Tajima, K., and Tan, W.
nism is that it includes only proce- Archiving scientific data. In Proceedings of the ACM
Scientific Programming 13, 4 (Oct. 2005), 277–298.
21. Shahabi, C., Jahangiri, M., and Banaei-Kashani, F.
dures targeted at querying, thereby SIGMOD International Conference on Management
ProDA: An end-to-end wavelet-based OLAP system
of Data (Madison, WI, June 3–6). ACM Press, New
avoiding implications due to compli- York, 2002, 1–12.
for massive datasets. Computer 41, 4 (Apr. 2008),
69–77.
cated database mechanisms—insert, 4. Chervenak, A.L., Schuler, R., Ripeanu, M., Amer, M.A.,
22. Spaccapietra, S., Parent, C., Damiani, M.L., de
Bharathi, S., Foster, I., Iamnitchi, A., and Kesselman,
delete, update—serving database Macedo, J. A., Porto, F., and Vangenot, C. A
C. The Globus Replica Location Service: Design and
conceptual view on trajectories. Data Knowledge
operations. However, the procedures experience. IEEE Transactions on Parallel Distributed
Engineering 65, 1 (Apr. 2008), 126–146.
Systems 20, 9 (Sept. 2009), 1260–1272.
involved in the querying mechanism 23. Stonebraker, M., Bear, C., Çetintemel, U., Cherniack,
5. Cohen, S., Hurley, P., Schulz, K.W., Barth, W.L., and
M., Ge, T., Hachem, N., Harizopoulos, S., Lifter, J.,
must be designed and implemented Benton, B. Scientific formats for object-relational
Rogers, J., and Zdonik, S.B. One size fits all? Part
database systems: A study of suitability and
2: Benchmarking studies. In Proceedings of the
from scratch, an approach that pre- performance. SIGMOD Records 35, 2 (June 2006),
Conference on Innovative Data Systems Research
cludes uniform querying of both 10–15.
(Asilomar, Jan. 7–10, 2007), 173–184.
6. Cudre-Mauroux, P., Kimura, H., Lim, K., Rogers, J.,
24. Szalay, A., Bell, G., VandenBerg, J., et al. GrayWulf:
structured and unstructured data; Simakov, R., Soroush, E., Velikhov, P., Wang, D.L.,
Scalable clustered architecture for data-intensive
Balazinska, M., Becla, J., DeWitt, D., Heath, B., Maier,
it also means limited uniform query D., Madden, S., Patel, J., Stonebraker, M., and Zdonik,
computing. In Proceedings of the Hawaii
International Conference on System Sciences
optimization and limited query-exe- S. A demonstration of SciDB: A science-oriented
(Waikoloa, Jan. 5–8). IEEE Computer Society Press,
DBMS. Proceedings of VLDB Endowment 2, 2 (Aug.
cution efficiency. Future grid middle- 2009), 1534–1537.
2009, 1–10.
ware promises to support such inte- 7. Davidson, S.B. and Freire, J. Provenance and
scientific workflows: Challenges and opportunities.
gration (http://www.ogsadai.org.uk/), In Proceedings of the ACM SIGMOD International Anastasia Ailamaki (natassa@epfl.ch) is director of
though the related research is still Conference on Management of Data (Vancouver, B.C., the Data-Intensive Applications and Systems Laboratory
June 9–12). ACM Press, New York, 1345–1350. and a professor at Ecole Polytechnique Fédérale de
only nascent. 8. Gray, J. and Thomson, D. Supporting Finite-Element Lausanne, Lausanne, Switzerland, and an adjunct
Analysis with a Relational Database Backend, professor at Carnegie Mellon University, Pittsburgh, PA.
Parts i–iii. MSR-TR-2005-49, MSR-TR-2006-21, MSR-
Conclusion TR-2005-151. Microsoft Research, Redmond, WA, Verena Kantere (verena.kantere@epfl.ch) is a
Scientific data management suffers 2005. postdoctoral researcher in the Data-Intensive
9. Hey, T., Tansley, S., and Tolle, K. The Fourth Applications and Systems Laboratory at Ecole
from storage and processing limita- Paradigm: Data-Intensive Scientific Discovery. Polytechnique Fédérale de Lausanne, Lausanne,
tions that must be overcome for sci- Microsoft, Redmond, WA, Oct. 2009. Switzerland.
10. Ilyas, I.F., Markl, V., Haas, P., Brown, P., and
entific research to take on demand- Aboulnaga, A. CORDS: Automatic discovery of Debabrata Dash (debabrata.dash@epfl.ch) is Ph.D
ing experimentation involving data correlations and soft functional dependencies. In student in the Data-Intensive Applications and Systems
Proceedings of the ACM SIGMOD International Laboratory at Ecole Polytechnique Fédérale de Lausanne,
collection and processing. Future so- Conference on Management of Data (Paris, June Lausanne, Switzerland.
13–18). ACM Press, New York, 2004, 647–658.
lutions promise to integrate automa- 11. Kunszt, P.Z., Szalay, A.S., and Thakar, A.R. The
tion, online processing, integration, hierarchical triangular mesh. In Proceedings of the © 2010 ACM 0001-0782/10/0600 $10.00

78 communications of th e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
d oi:10.1145/1743546 . 1 7 4 3 5 6 9

Conference acceptance rate signals future


impact of published conference papers.
By Jilin Chen and Joseph A. Konstan

Conference
Paper
Selectivity
and Impact

Studyi ng t h e m e tadata of the ACM Digital Library


(http://www.acm.org/dl), we found that papers in
low-acceptance-rate conferences have higher impact
than those in high-acceptance-rate conferences
within ACM, where impact is measured by the number
of citations received. We also found that highly
selective conferences—those that accept 30% or less
of submissions—are cited at a rate comparable to
or greater than ACM Transactions and journals.

In addition, the higher impact of selec-


tive conferences cannot be explained
key insights
solely by a more strict filtering process; P apers published in highly selective
selectivity signals authors and/or read- CS conferences are cited more often
on average than papers published in
ers of the quality of a venue and thus Transactions and journals.
invites higher-quality submissions from
authors and/or more citations from oth-
C onference selectivity serves two
purposes: pick the best submitted
er authors. papers and signal prospective authors
Low-acceptance-rate conferences and readers about conference quality.
with selective peer-review processes B elow a certain acceptance rate,
distinguish computer science from selectivity can backfire; conferences
rejecting 85% or more of their
other academic fields where only jour-
submissions risk discouraging overall
nal publication carries real weight. submissions and inadvertently filtering
Focus on conferences challenges the out high-impact research.

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t he acm 79
field in two ways: how to assess the of the original work. Using citations as
importance of conference publica- a measure of scientific influence has a
tion, particularly compared to journal long tradition, including the journal-
publication, and how to manage con- impact factor.1 We chose two years as a
ferences to maximize the impact of
the papers they publish. “Impact fac- Overall, the compromise between measuring long-
term impact and the practical impor-
tor” (average citation rate) is the com-
monly used measure of the influence
conference papers tance of measuring impact of more re-
cent work.1 Less than two years might
of a journal on its field. While nearly had an average be too short for the field to recognize
all computer scientists have strong
intuition about the link between con-
two-year citation the worth of a paper and cite it. More
than two years would have excluded
ference acceptance rate and a paper’s count of 2.15, and more recently published papers from
impact, we are aware of no systematic
studies examining that link or com-
the journal papers our analysis due to insufficient time
after publication, so would not have al-
paring conference and journal papers had an average lowed us to include the current era of
in terms of impact.
This article addresses three main two-year citation widespread use of the Digital Library.a
For conferences, we counted only
questions: How does a conference’s ac-
ceptance rate correlate with the impact
count of 1.53. full papers, since they represent their
attempt to publish high-impact work,
of its papers? How much impact do rather than posters and other less-rig-
conference papers have compared to orously reviewed material that might
journal papers? To what extent does the also appear in conference proceedings.
impact of a highly selective conference Conferences where acceptance rates
derive from filtering (the selectivity of were not available were excluded as well.
the review process) vs. signaling (the For journals, we included only titled
message the conference sends to both ACM Transactions and journals; only
authors and readers by being selective)? these categories are generally viewed as
Our results offer guidance to conference archival research venues of lasting value.
organizers, since acceptance rate is one Finally, since our data source was
of the few parameters they can control limited to the metadata in the ACM
to maximize the impact of their confer- Guide, our analysis considered only
ences. In addition, our results inform citations from within that collection
the process of evaluating researchers, and ignored all citations from confer-
since we know that computer scientists ences and journals outside of it; this
often defend the primary publication was a pragmatic constraint because, in
of results in conferences, particularly part, other indexing services do not com-
when being evaluated by those out- prehensively index conference proceed-
side the field (such as in tenure evalu- ings. While it means that all our num-
ations).2 Finally, we hope these results bers were underestimates and that the
will help guide individual researchers nature of the underestimates varied
in understanding the expected impact by field (we expected to significantly
of publishing their papers in the vari- underestimate artificial intelligence
ous venues. and numerical-computation papers
due to the large number of papers
Data and Methodology published by SIAM and AAAI outside
We based our study on ACM Digital Li- our collection), such underestimates
brary metadata for all ACM conference were not biased toward any particu-
and journal papers as of May 2007, as lar acceptance rate in our data set.b
well as on selected other papers in the
ACM Guide to Computing Literature for a To ensure that the two-year citation count was
which metadata was available. Since reasonable, we repeated this analysis using
there is no established metric for mea- four- and eight-year citation counts; the distri-
butions and graphs were similar, and the con-
suring the scientific influence of pub- clusions were unchanged.
lished papers, we chose to estimate b We hand-checked 50 randomly selected con-
a paper’s influence as the number of ference papers receiving at least one citation
times it was cited in the two years fol- in our data set, comparing citation count in
lowing publication, referred to as cita- the data set against citation count according
to Google scholar (http://scholar.google.com/).
tion count or simply as impact. We ex- When trying to predict Google scholar citation
cluded from this count “self-citation” count from ACM citation count in a linear re-
in subsequent papers by the authors gression, we found an adjusted R-square of

80 communications of th e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
contributed articles

Therefore, this limitation did not in- Figure 1. Citation count distribution within two years of publication.
validate our results.
  Conference Papers
Our analysis included 600 confer-   Journal Papers
ences consisting of 14,017 full papers 0.6
and 1,508 issues of journals consist-
ing of 10,277 articles published from 0.5
1970 to 2005. Their citation counts
were based on our full data set con- 0.4

Fraction of Papers
sisting of 4,119,899 listed references
from 790,726 paper records, of which 0.3
1,536,923 references were resolved
within the data set itself and can be 0.2
used toward citation count. Overall,
the conference papers had an average
0.1
two-year citation count of 2.15 and the
journal papers an average two-year cita-
0
tion count of 1.53. These counts follow 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
a highly skewed distribution (see Fig-
Number of Citations
ure 1), with over 70% of papers receiv-
ing no more than two citations. Note
that while the average two-year citation Figure 2. Average citation count vs. acceptance rate for ACM conferences.
count for conferences was higher than
journals, the average four-year citation 15
count for articles published before
2003 was 3.16 for conferences vs. 4.03
for journals; that is, on average, jour-
nals come out a little ahead of confer-
ence proceedings over the longer term.
10

Results
Citation Count

We addressed the first question—on how


a conference’s acceptance rate correlates
with the impact of its papers—by cor-
relating citation count with acceptance 5
rate; Figure 2 shows a scatterplot of aver-
age citation counts of ACM conferences
(y-axis) by their acceptance rates (x-axis).
Citation count differs substantially from
the spectrum of acceptance rates, with
0
a clear trend toward more citations for 0.2 0.4 0.6 0.8 1.0
low acceptance rates; we observed a sta- Acceptance Rate
tistically significant correlation between
the two values (each paper treated as
a sample, F[1, 14015] = 970.5, p<.001c) and computed both a linear regression into bins according to acceptance
line (each conference weighted by its rates and computed the average ci-
0.852, showing that overall ACM citation count size, adjusted R-square: 0.258, weighted tation counts of each bin.d Citation
is proportional to Google scholar citation count residual sum-of-squares: 35311) and a counts for journal articles are shown
with a small variation. When added as an addi-
nonlinear regression curve in the form as a dashed line for comparison.
tional parameter to the regression, acceptance
rate had a nonsignificant coefficient, showing of y=a+bx−c (each conference weighted Conferences with rates less than 20%
that acceptance rate does not have a significant by its size, pseudo R-square: 0.325, enjoyed an average citation count as
effect on the difference between ACM citation weighted residual sum-of-squares: high as 3.5. Less-selective conferences
count and Google scholar citation count. We 32222), as shown in Figure 2. yielded fewer citations per paper, with
also hand-checked 50 randomly selected con-
ference papers receiving no citations in our
Figure 3 is an aggregate view of the the least-selective conferences (>55%
data set, finding no correlation between accep- data, where we grouped conferences acceptance rate) averaging less than ½
tance rate and Google scholar citation count. citation per paper.
c This F-statistic shows how well a linear rela- plain citation counts), 14,015 degrees of free-
tionship between acceptance rate and cita- dom for error (from the more than 14,000 con-
tion count explains the variance within cita- ference papers in our analysis), an F-statistic d We excluded conferences with an acceptance
tion count. The notation F[1, 14015] = 970.5, of 970.5, and probability less than 0.001 that rate less than 10% and an acceptance rate
p<.001 signifies one degree of freedom for the correlation between acceptance rate and over 60%, as there were too few conferences in
model (from using only acceptance rate to ex- citation count is the result of random chance. these categories for meaningful analysis.

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t he acm 81
contributed articles

Figure 4 shows the percentages of tions in the next two years as conferenc- the third question—on the extent the
papers within each group where cita- es accepting 35%–40% of submissions, impact of a highly selective conference
tion count was above a certain threshold. a much higher low-impact percentage derives from filtering vs. signaling—as
The bottom bands (reflecting papers cit- than for highly selective conferences. this correlation can be attributed to
ed more than 10, 15, or 20 times in the fol- The same analyses over four- and two mechanisms:
lowing two years) show high-acceptance- eight-year periods yielded results con- Filtering. A selective review process
rate conferences have few papers with sistent with the two-year period; jour- filters out low-quality papers from the
high impact. Also notable is the fact nal papers received significantly fewer submission pool, lowering the accep-
that about 75% of papers published in citations than conferences where the tance rate and increasing the average
>55%-acceptance-rate conferences were acceptance rate was below 25%. impact of published papers; and
not cited at all in the following two years. Low-acceptance-rate conferences in Signaling. A low acceptance rate sig-
Addressing the second question— computer science have a greater impact nals high quality, thus attracting better
on how much impact conference than the average ACM journal. The fact submissions and more future citations,
papers have compared to journal pa- that some journal papers are expanded because researchers simply prefer sub-
pers—in Figures 3 and 4, we found that versions of already-published (and cit- mitting papers to reading the proceed-
overall, journals did not outperform ed) conference papers is a confounding ings of and citing publications from
conferences in terms of citation count; factor here. We do not have data that better conferences. While filtering is
they were, in fact, similar to conferenc- effectively tracks research contribu- commonly viewed as the whole point
es with acceptance rates around 30%, tions through multiple publications to of a review process and thus likely ex-
far behind conferences with accep- assess the cumulative impact of ideas plains the correlation to some extent,
tance rates below 25% (T-test, T[7603] published more than once. it is unclear whether signaling is also a
= 24.8, p<.001). Similarly, journals pub- Pondering why citation count corre- factor. As a result, to address the third
lished as many papers receiving no cita- lates with acceptance rate brings us to question, we clarified the existence of
signaling by separating its potential ef-
Figure 3. Average citation count by acceptance rate within two years of publication. fect from filtering.
We performed this separation by
  Avg. Number of Citation for Conferences
  Avg. Number of Citation for Journals
normalizing the selectivity of filtering to
4 the same level for different conferences.
3.5 For example, for a conference accepting
3 90 papers at a 30% acceptance rate, the
Citation Count

2.5 best potential average citation count


2 the conference could have achieved
1.5 by lowering the acceptance rate to,
1 say, 10% for the same submission pool
0.5
would be the average citation count of
0
the top 30 most-cited papers of the 90
10%–15% 15%–20% 20%–25% 25%–30% 30%–35% 35%–40% 40%–45% 45%–50% 50%–55% 55%–60% accepted (presumably the 30 best pa-
Acceptance Rate pers of the original 300 submitted). We
treated these 30 papers as the top 10%
best submissions in the pool; other sub-
missions were either filtered out during
Figure 4. Citation count distribution by acceptance rate within two years of publication.
the actual review or later received fewer
1+ 16+ citations. Their citation count was thus
6+ 21+ an upperbound estimate of what might
11+
0.8 be achieved through stricter filtering,
assuming conference program com-
0.7
mittees were able to pick exactly the
0.6 submissions that would ultimately be
the most highly cited. Using this nor-
Fraction of Papers

0.5
malization, we compared the same top
0.4 portions of submission pools of all con-
0.3 ferences and evaluated the effect of sig-
naling without the influence of filtering.
0.2
We normalized all ACM conferences in
0.1 Figure 3 to a 10% acceptance rate and
0
compared the citation counts of their
10%–15% 15%–20% 20%–25% 25%–30% 30%–35% 35%–40% 40%–45% 45%–50% 50%–55% 55%–60% Journal top 10% best submissions; Figure 5
Acceptance Rate/Venue Type
(same format as Figure 3) outlines the
results. We excluded transactions and
journals, as we were unable to get actual

82 communications of th e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
contributed articles

acceptance-rate data, which might also Figure 5. Average citation count vs. acceptance rate within two years of publication,
be less meaningful, given the multi-re- top 10% of submissions.
view cycle common in journals.
Figure 5 suggests that citation count   Avg. Citation for Conferences
for the top 10% of submitted papers fol-   Avg. Citation for Journals
6
lows a trend similar to that of the full 5.5

Number of Citations
proceedings (F[1, 5165] = 149.5, p<.001), 5
with generally higher count for low ac- 4.5
4
ceptance rates. This correlation indi-
3.5
cates that filtering alone does not fully 3
explain the correlation between citation 2.5
count and acceptance rate; other factors 2
(such as signaling) play a role. 1.5
1
10%–15% 15%–20% 20%–25% 25%–30% 30%–35% 35%–40% 40%–45% 45%–50% 50%–55% 55%–60%
Discussion
Acceptance Rate/Venue Type
Combining the results in Figures 3
and 5 provides further insight into the
relationship between acceptance rate
and citation count. For conferences a few coveted publication slots. A third have goals separate from generating
with acceptance rates over 20%, the ci- explanation suggests that extremely citations, and many high-acceptance-
tation numbers in the figures almost low acceptance rates have caused a con- rate conferences might do a better job
consistently drop as the acceptance ference proceedings to be of such lim- getting feedback to early ideas, support-
rate increases, suggesting that in this ited focus that other researchers stop ing networking among attendees, and
range, a higher acceptance rate makes checking it regularly and thus never cite bringing together different specialties.
conferences lose out on citation count it. We consider all three to be plausible Given the link between acceptance
not only for the conference but for its explanations; intuitively, all would hurt rate and future impact, further research
best submitted papers. Either higher- the impact of lower-acceptance-rate is warranted in the degree to which a
quality papers are not submitted to conferences more than they would hurt conference’s reputation interacts over
higher-acceptance-rate conferences higher-acceptance-rate conferences. time with changes in its acceptance
as frequently or those submitted are rate. Though a number of highly selec-
not cited because readers do not ex- Conclusion tive conferences have become more
plore the conferences as often as they Our results have several implications: or less selective over time, we still lack
explore lower-acceptance-rate confer- First and foremost, computing re- enough data to clarify the effect of such
ences to find them. searchers are right to view conferences changes. We hope that understanding
The case for conferences with ac- as an important archival venue and use them will yield new insight for confer-
ceptance rates below 20% is more in- acceptance rate as an indicator of fu- ence organizers tuning their selectivity
triguing. Note that the lower impact ture impact. Papers in highly selective in the future.
of the 10%–15% group compared with conferences—acceptance rates of 30%
the 15%–20% group in Figure 5 is sta- or less—should continue to be treated Acknowledgments
tistically significant (T[1198] = 3.21, as first-class research contributions This work was supported by National
p<.002). That is, the top-cited papers with impact comparable to, or better Science Foundation grant IIS-0534939.
from 15%–20%-acceptance-rate con- than, journal papers. We thank our colleague John Riedl
ferences are cited more often than Second, we hope to bring to the at- of the University of Minnesota for his
those from 10%–15% conferences. We tention of conference organizers and valuable insights and suggestions.
hypothesize that an extremely selec- program committees the insight that
tive but imperfect (as review processes conference selectivity does have a sig- References
1. Garfield, E. Citation analysis as a tool in journal
always are) review process has filtered- naling value beyond simply separat- evaluation. Science 178, 60 (Nov. 1972), 471–479.
out submissions that would deliver ing good work from bad. Adopting the 2. National Research Council. Academic Careers for
Experimental Computer Scientists and Engineers. U.S.
impact if published. This hypothesis right selectivity level helps attract better National Academy of Sciences Report, Washington, D.C.,
1994; http://www.nap.edu/catalog.php?record_id=2236
matches the common speculation, in- submissions and more citations. Accep- 3. Patterson, D.A. The health of research conferences
cluding from former ACM President tance rates of 15%–20% seem optimal and the dearth of big idea papers. Commun. ACM 47,
12 (Dec. 2004), 23–24.
David Patterson, that highly selective for generating the highest number of fu-
conferences too often choose incre- ture citations for both the proceedings
Jilin Chen (jilin@cs.umn.edu) is a doctoral student in the
mental work at the expense of innova- as a whole and the top papers submit- Department of Computer Science and Engineering at the
tive breakthrough work.3 ted, though we caution that this guide- University of Minnesota, Twin Cities.

Alternatively, extremely low accep- line is based on ACM-wide data, and Joseph A. Konstan (konstan@cs.umn.edu) is
Distinguished McKnight Professor and Distinguished
tance rates might discourage submis- individual conferences should consider University Teaching Professor in the Department of
sions by authors who dislike and avoid their goals and the norms of their sub- Computer Science and Engineering at the University of
Minnesota, Twin Cities.
competition or the perception of there disciplines in setting target acceptance
being a “lottery” among good papers for rates. Furthermore, many conferences © 2010 ACM 0001-0782/10/0600 $10.00

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t he acm 83
review articles
d oi:10.1145/1743546.1743570
Image and video data are certainly
New algorithms provide the ability for robust rich with meaning, memories, or
entertainment, and in some cases
but scalable image search. they can facilitate communication or
scientific discovery. However, without
By Kristen Grauman efficient vision algorithms to automat-
ically analyze and index visual data,

Efficiently
their full value will remain latent—the
ratio of data to human attention is
simply too large.
Most image search tools in opera-

Searching
tion today rely heavily on keyword
meta-tags, where an image or video is
annotated with a limited number of
words that are either provided manu-

for Similar
ally, or else are taken from whatever
text occurs nearby in its containing
document. While such a scheme sim-
plifies the image indexing task to one

Images
that well-known information retrieval
techniques can handle, it has serious
shortcomings. At the ­surface, accu-
rate manual tags are clearly too expen-
sive to obtain on a large scale, and
keywords in proximity to an image are
often irrelevant.
Even more problematic, however,
is the semantic disconnect between
words and visual content: a word is a
If a tree falls in the forest and no one is there to human construct with a precise intent,
whereas a natural image can convey a
hear it, does it make a sound? In the realm of content- multitude of concepts within its (say)
based image retrieval, the question is: if an image is million pixels, and any one may be more
or less significant depending on the
captured and recorded but no one is there to annotate context. For example, imagine querying
it, does it ever again make an appearance? Over the a database for all text documents con-
last decade we have witnessed an explosion in the taining the word “forest.” Now imagine

number and throughput of imaging devices. At the


key insights
same time, advances in computer hardware and
communications have made it increasingly possible A s it becomes increasingly viable to
capture, store, and share large amounts
to capture, store, and transmit image data at a low of image and video data, automatic
image analysis is crucial to managing
cost. Billions of images and videos are hosted publicly visual information.

on the Web; cameras embedded in mobile devices are O ften the most effective metrics for
image comparisons do not mesh well
commonplace. Climatologists compile large volumes with existing efficient search methods;
Photom osaic by J im Bum gardner

scalable image recognition and search


of satellite imagery in search of long-term trends that techniques, however, aim to provide
might elucidate glacial ac­tivity and its impact on direct access to visual content.

water supplies. Centralizedmedical image databases T he capability to perform fast visual
search is critical for many applications,
archive terabytes of X-ray, CAT scans, and ultrasound and more generally, lays the groundwork
for data-driven approaches in computer
images, which may assist in new diagnoses. vision.

84 communications of th e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t he acm 85
review articles

figure 1. Visual data is complex and often holds valuable information. image-based conjuring a text query that would find
search algorithms automatically analyze and organize visual content, with the goal you all images relevant to the one on the
of allowing efficient retrieval from large image or video collections. left in Figure 1; while you immediately
have a visual concept, it may be difficult
to pinpoint words to capture it, espe-
cially if the objects within the image
are unfamiliar. Thus, even if we were
to somehow record keywords for all the
images in the world, visual data would
still not be sufficiently accessible.
Content-based image search
? streamlines the process by sorting
images directly based on their visual
information and allowing images
themselves to serve as queries. While
early work in the area focused on cor-
relating low-level cues such as color
and texture,8,28 more recently the image

search problem has become inter-
twined with the fundamental problem
of recognition, in which algorithms
must capture higher-level notions of
visual object and scene categories.
The technical challenges are con-
figure 2. the same object type can generate dramatically different images due to
a variety of nuisance parameters (top), but local descriptions can offer substantial siderable. Instances of the same
robustness (bottom). object category can generate drasti-
cally different images, depending on
confounding variables such as illu-
mination conditions, object pose,
camera viewpoint, partial occlusions,
and unrelated background “clutter”
(see Figure 2). In general, the quality
of image search relies significantly on
occlusion self-occlusion scale the chosen image representation and
the distance metric used to compare
examples. Meanwhile, the complexity
of useful image representations com-
bined with the sheer magnitude of the
search task immediately raises the
practical issue of scalability.
deformation Clutter illumination This article overviews our work
considering how to construct robust
measures of image similarity that can
be deployed efficiently, even for com-
plex feature spaces and massive image
databases. We pose three essential
technical questions: (1) what is an
Camera view point object pose effective distance measure between
images that can withstand the natu-
rally occurring variability among
related examples? (2) when external
cues beyond observable image content
are available, how can that improve
our comparisons? and (3) what kind of
search strategy will support fast que-
ries with such image-driven metrics,
particularly when our database is so
large that a linear scan is infeasible?
local feature matches The following sections address each
of these issues in turn, and highlight

86 communications of th e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
review articles

some of our results to demonstrate between local portions of related therein). However, a real practical chal-
the impact with real image data. images, even when the images appear lenge is the computational cost of eval-
Our approach enables rapid, scal- quite different at the global level. uating the optimal matching, which
able search for meaningful metrics that Local features are more reliable for is cubic in the number of features
were previously restricted to artificially several reasons: extracted per image. Compounding
modestly sized inputs or databases. ˲˲ Isolate occlusions: An object may that cost is substantial empirical evi-
Additionally, we show how minimal be partially occluded by another ob- dence showing that recognition accu-
annotations can be exploited to learn ject. A global representation will suf- racy improves when larger and denser
how to compare images more reliably. fer proportionally, but for local rep- feature sets are used.
Both contributions support the ulti- resentations, any parts that are still
mate goal of harnessing the potential visible will have their descriptions re- The Pyramid Match Algorithm
of very large repositories and providing main intact. To address this challenge, we devel-
direct access to visual content. ˲˲ Isolate clutter and the back- oped the pyramid match—a lin-
ground: Similarly, while the global ear-time matching function over
Comparing Images with description may be overwhelmed by unordered feature sets—and showed
Local ­Feature Matches large amounts of background or clut- how it allows local features to be
Earlier work in content-based image ter, small parts of an image contain- used efficiently within the context
retrieval focused on global represen- ing an actual object of interest can of multiple image search and learn-
tations that describe each image with emerge if we describe them indepen- ing problems.12 The pyramid match
a single vector of attributes, such as a dently by regions. Recogni­tion  can approximates the similarity mea-
color histogram, or an ordered list of proceed without prior segmentation. sured by the optimal partial matching
intensity values or filter responses. ˲˲ Accommodate partial appearance between feature sets of variable car-
While vector representations permit variation: When instances of a catego- dinalities. Because the matching is
the  direct application of standard ry can vary widely in some aspects of partial, some features may be ignored
distance functions and indexing their ­appearance, their commonality without penalty to the overall set
structures, they are known to be pro- may be best captured by a part-wise similarity. This tolerance makes the
hibitively sensitive to realistic image description that includes the shared measure robust in situations where
conditions. For example, consider reoccurring pieces of the object class. superfluous or “outlier” features may
stacking the images in Figure 2 one ˲˲ Invariant local descriptors: Re- appear. Note that our work focuses
on top of the other, and then check- searchers have developed local de- on the image matching and indexing
ing the intensity at any given pixel for scriptors designed explicitly to  offer aspects of the problem, and is flexible
each example—it is quite likely that invariance to common transforma- to the representation choice, that is,
few of them would be in agreement, tions, such as illumination changes, which particular image feature detec-
even though each image contains a rotations, translations, scaling, or all tors and descriptors are used as input.
koala as its most prominent object. affine transformations. We consider a feature space V of
Much recent work shows that This appealing representation—a d-dimensional vectors for which the
decomposing an image into its compo- set of vectors—does not fit the mold of values have a maximal range D. The
nent parts (or so-called “local features”) many traditional distances and learn- point sets we match will come from
grants resilience to image transforma- ing ­algorithms. Conventional methods the input space S, which contains
tions and variations in object appear- assume vector inputs, but with local sets of feature vectors drawn from
ance.23,30 Typically, one either takes a representations, each image produces V: S = {X|X = {x1, …, xm}}, where each-
dense sample of regions at multiple a variable number of features, and feature xi Î V Í d, and m = |X|. We
scales, or else uses an interest opera- there is no ordering among features can think of each xi as a descriptor for
tor to identify the most salient regions in a single set. In this situation, com- one of the elliptical image regions on
in an image. Possible salient points puting a correspondence or match- the koalas in Figure 2. Note that the
include pixels marking high contrast ing between two images’ features can point dimension d is fixed for all fea-
(edges), or points selected for a region’s reveal their overall resemblance: if tures in V, but the value of m may vary
repeatability at multiple scales (see many parts in image A can be asso- across instances in S.
Tuytelaars30 for a survey). Then, for ciated with similar-looking parts in Given point sets X, Y Î S, with
each detected region, a feature descrip- image B, then they are likely to display |X| £ |Y|, the optimal partial
tor vector is formed. Descriptors may similar content (see Figure 2, bottom). matching p* pairs each point in X
be lists of pixel values within a patch, or Current strategies for recognition to some unique point in Y such
histograms of oriented contrast within and image matching exploit this notion that the total distance between
the regions,23 for example. The result in some form, often by building ­spatial matched points is minimized: p*
is one set of local appearance or shape constellations of a category’s reoc- where pi
description vectors per image, often curring local features, summarizing specifies which point is matched
numbering on the order  of 2,000 or images with a histogram of discretized to xi, and || . ||1 denotes the L1
more features per image. local patches, or explicitly comput- norm. For sets with m features, the
The idea behind such representat­ ing the least-cost correspondences Hungarian algorithm computes the
ions  is to detect strong similarity (for a survey, see Pinz27 and references optimal match in O(m3) time, which

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t he acm 87
review articles

Figure 3. An example pyramid match.

Here, two 1-D feature sets are used to y z


form two histogram pyramids. Each row H0(y) H0(z) min(H0(y),H0(z))
corresponds to a pyramid level. In (a), set Y
is on the left, and set Z is on the right; points
are distributed along the vertical axis.
Light lines are bin boundaries, bold dashed I0=2
lines indicate a new pair matched at this
level, and bold solid lines indicate a match y z min(H1(y),H1(z))
already formed at a finer resolution level. H1(y) H1(z)
In (b) multiresolution histograms are
shown; (c) shows their intersections.
The pyramid match function P uses
these intersection counts to measure how I1=4
many new matches occurred at each level.
Here, Ii = I(Hi(Y), Hi(Z) ) = 2, 4, 5 across
y z
levels, so the number of new matches H2(y) H2(z)
counted are 2, 2, 1. The weighted sum
over these counts gives the pyramid
match similarity.
I2=5
The figure is reprinted from Grauman and
Darrell12 with permission, ©2005 IEEE.
(a) Point sets (b) Histogram pyramids (c) Intersections

severely limits the practicality of histogram vector over points in X. The Thus, for each pyramid level, we
large input sizes. In contrast, the pyr- bins continually increase in size from want to count the number of “new”
amid match approximation requires the finest-level histogram H0 until the matches—the number of feature pairs
only O(mL) time, where L = log D, and coarsest-level histogram HL−1. For low- that were not in correspondence at
L << m. In practice, this translates to dimensional feature spaces, the bound- any finer resolution level. For exam-
speedups of several orders of mag- aries of the bins are computed simply ple, in Figure 3, there are two points
nitude relative to the optimal match with a uniform partitioning along all matched at the finest scale, two new
for sets with thousands of features. feature dimensions, with the length of matches at the medium scale, and
We use a multidimensional, multi- each bin side doubling at each level. For one at the coarsest scale.
resolution histogram pyramid to parti- high-dimensional feature spaces (for To calculate the match count, we use
tion the feature space into increasingly example, d > 15), we use hierarchical histogram intersection, which mea­
larger regions. At the finest resolution clustering to concentrate the bin  par- sures the “overlap” between the mass in
r
level in the pyramid, the partitions titions where feature points tend to two histograms: I(P, Q) = S j=1 min(Pj, Qj),
(bins) are very small; at successive levels cluster for typical point sets.13 In either where P and Q are histograms with r
they continue to grow in size until a sin- case, we maintain a sparse representa- bins, and Pj denotes the count of the
gle bin encompasses the entire feature tion per point set that maps each point j-th bin. The intersection value effec-
space. At some level along this gradation to its bin indices. Even though there is tively counts the number of points in
in bin sizes, any two particular points an exponential growth in the number two sets that match at a given quanti-
from two given point sets will begin to of possible histogram bins relative to zation level. To calculate the number
share a bin in the pyramid, and when the feature dimension (for uniform of newly matched pairs induced at
they do, they are considered matched. bins) or histogram levels (for nonuni- level i, we only need to  compute the
The key is that the pyramid allows us form bins), any given set of features can difference between successive levels’
to  extract a matching score without occupy only a small number of them. intersections. By using the change in
computing distances between any of An image with m features results in a intersection values at each level, we
the points in the input sets—the size of pyramid description with no more than count matches without ever explicitly
the bin that two points share indicates mL nonzero entries to store. searching for similar points or  com-
the farthest distance they could be from Two histogram pyramids implicitly puting inter-feature distances.
one another. We show that a weighted encode the correspondences between The pyramid match similarity score
intersection of two pyramids defines an their point sets, if we consider two P between two input sets Y and Z is
implicit partial correspondence based points matched once they fall into then defined as the weighted sum of
on the smallest histogram cell where a the same histogram bin, starting at the number of new matches per level:
matched pair of points first appears. the finest resolution level. The match-
Let a histogram pyramid for input ing is a hierarchical process: vectors
example X Î S be defined as: Ψ (X) = not found to correspond at a fine
[H0(X), …, HL−1(X)], where L specifies the resolution have the opportunity to
number of pyramid levels, and Hi(X) is a be matched at coarser resolutions.

88 communications of th e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
review articles

The number of new matches search the image collection based on research community and today stands
induced at level i is weighted by content alone. as a key point of comparison for exist-
to reflect the (worst-case) similarity Figure 4 shows some illustrative ing methods. For all the ­following
of points matched at that level. This results using two well-known publicly results, we use the SIFT descriptor,23
weighting reflects a geometric bound available benchmark datasets, the which is insensitive to shifts and
on the maximal distance between any ETH-80 and Caltech-101. Both datas- rotations in the image yet provides a
two points that share a particular bin. ets are used to measure image catego- ­distinctive summary of a local patch.
Intuitively, similarity between vectors rization accuracy. The ETH collection The leftmost plot of Figure 4 demon-
at a finer resolution—where features is comprised of 80 object instances strates that when the pyramid match
are more distinct—is rewarded more from eight different categories posi- is used to sort the images from the
heavily than similarity between vec- tioned on simple backgrounds; it is ETH-80 in a retrieval task, its complete
tors at a coarser level. among the first benchmarks estab- ranking of the database examples is
We combine the scores resulting lished for the categorization task, and highly correlated to that of the optimal
from multiple pyramids with randomly since several categories are visually matching. The vertical axis measures
shifted bins in order to alleviate quan- rather similar (for example, horse and how well results from two variants
tization effects, and to enable formal cow, apple and tomato), it is a good of the PMK agree with the optimal
error bounds. The approximation error test for detailed discrimination. The cubic-time results, and the horizontal
for the pyramid match cost is bounded Caltech collection, first ­introduced axis shows the relative impact of the
in the expectation by a factor of C . d in 2003, contains 101 categories. It is feature dimension d. While for low-
log D, for a constant C.15 We have also challenging due to the magnitude of dimensional features either a uniform
proven that the pyramid match kernel the multiclass problem it poses, and or data-dependent partitioning of the
(PMK) naturally forms a Mercer kernel, for many categories it offers notice- feature space is adequate for good
which essentially means that it satisfies able intraclass appearance variation. results, due to the curse of dimension-
the necessary technical requirements It has received much attention in the ality, a data-dependent pyramid bin
to permit its use as a similarity func-
tion within a number of existing ker- Figure 4. The image rankings produced by the linear-time pyramid match.
nel-based machine learning methods.
Retrieval quality relative to the optimal match Recognition on ETH-80
Previous approximation methods Pyramid match with uniform bins Explicit match
Spearman rank correlation with optimal match

have also considered a hierarchical 1


Pyramid match with data−dependent bins
90
Pyramid match

decomposition of the feature space to


88
reduce complexity1,2,5,17; the method 0.95
86
by Indyk and Thaper 17 particularly
0.9 84
inspired our approach. However, ear-
Accuracy (%)

lier matching approximations assume 0.85


82

equally sized input sets, and cannot 80

compute partial matches. In addi- 0.8 78


tion, while previous techniques suffer 76
0.75
from distortion factors that are lin- 74
ear in the feature dimension, we have 0.7 72
shown how to alleviate this decline in 0 20 40 60 80 100 120 0 200 400 600 800
Feature dimension (d) Training time (s)
accuracy by tuning the hierarchical
(a) (b)
decomposition according to the par-
ticular structure of the data.13 Finally, (a) The image rankings produced by
our approximation is unique in that the linear-time pyramid match are
closely aligned with those produced
it forms a valid Mercer kernel, and is by the ubic-time optimal matching.
useful in the context of various learn- This plot shows how closely rankings
ing applications. computed with our approximate
measure correlate with the optimal
In short, the pyramid match gives
result, for features of increasing
us an efficient way to measure the dimensionality. The vertical axis
similarity between two images based measures the rank correlation;
on the matching between their (poten- perfect ranking agreement with the
optimal measure would yield a score
tially many) local features. Now, given of 1. (b) More features per image lead
a query image such as the one on the to more reliable matches, but explicit
left of Figure 1, we can first extract matching techniques scale poorly with
descriptors for its local regions using the representation size. The pyramid
match makes large feature sets easily
any standard feature extractor,23, 30 and affordable. (c) The four category pairs
then find its relevant “neighbors” in in the Caltech-101 database that our
the collection on the right by comput- method confused most.
(c)
ing and sorting their pyramid match
scores. In this way, we are able to

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t he acm 89
review articles

structure is much more effective for water lily, gerenuk and kangaroo, and in situations where we have no back-
high-dimensional features. nautilus and brain. In each row, the ground knowledge; that is, where the
The center plot shows accuracy as two images on the left have local fea- system only has access to the image
a function of computation time when tures that match quite well to the two content itself. However, in many cases
the eight categories of the same dataset on the right, as compared to images the system could also receive external
are learned using local feature matches from any of the other 100 classes in side-information that might benefit its
between images. The plot compares the dataset. Some of these confused comparisons. For example, if provided
the performance of the ­pyramid match category pairs have rather subtle dis- with partially annotated image exam-
to an exact matching function that tinctions in appearance. However, ples, or if a user wants to enforce simi-
averages the cost between the closest the case of the gerenuk and kangaroo larity between certain types of images,
features in one set to the other. The hor- reveals a limitation of the completely then we ought to use those constraints
izontal axis measures the total training local description, as by definition it to adapt the similarity measure.
time, which is directly affected by the fails to capture the significance of A good distance metric between
size of the feature sets. To vary the size the global contour shapes of the two images accurately reflects the true
of a typical set, we tune the saliency objects. underlying relationships. It should
parameter controlling how many Overall, approaches based on the report small distances for examples
regions are detected per image. For pyramid match consistently show that are similar in the para­meter space
both methods, more features lead to accuracy that is competitive with (or of interest (or that share a class label),
striking accuracy improvements; this better than) the state of the art while and large distances for unrelated
behavior is expected since introduc- requiring dramatically less computa- examples. Recent advances in metric
ing more features assures better cov- tion time. This complexity advantage learning make it possible to learn dis-
erage of all potentially relevant image frees us to consider much richer repre- tance functions that are more effective
regions. However, the linear-time pyra- sentations than were previously prac- for a given problem, provided some
mid match offers a key advantage in tical. Methods that compute explicit partially labeled data or constraints are
terms of computational cost, reaching correspondences require about one available (see Yang32 and references
peak performance for significantly less minute to match a novel example; in within). By taking advantage of the
computation time. the time that these methods recognize prior information, these techniques
On the Caltech-101 benchmark, we a single object, the pyramid match can offer improved accuracy. Typically, the
have shown that classifiers employing recognize several hundred.15 Due to its strategy is to optimize any parameters
the PMK with a variety of features cur- flexibility and efficiency, the pyramid to the metric so as to best satisfy the
rently yield one of the most accurate match has been adapted and extended desired constraints.
results in the field,20 with 74% accu- for use within a number of tasks, such Figure 5a depicts how metric learn-
racy on the 101-way decision problem as scene recognition,22 video index- ing can influence image comparisons:
when training with just 15 exemplars ing,6 human action recognition,24 and the similarity (solid line) and dis-
per class. Figure 4c shows example robot localization.25 similarity (dotted lines) constraints
images from four pairs of categories essentially warp the feature space to
in the Caltech-101 dataset that cause Learning Image Metrics preserve the specified relationships,
the most confusion for the pyramid Thus far, we have considered how to and generalize to affect distances
match: schooner and ketch, lotus and robustly measure image similarity between other examples like them. In

Figure 5. The learned metric.

Caltech 101: gains over original (nonlearned) kernels


ML+CORR CORR
ML+PMK PMK
70
Mean accuracy/class (%)

60

50

40

30

20

10

0
5 10 15 20 25 30
(a) (b) Number of constrained examples per class

(a) By constraining some examples to be similar (green solid line), and others to be dissimilar (red dotted lines), the learned metric refines the original distance
function so that examples are close only when they share the relevant features. (b) Retrieval accuracy is improved by replacing two matching-based metrics (PMK
and CORR) with their learned counterparts (ML + PMK and ML + CORR).

Plot reprinted from Jain et al.19 with permission, ©2008 IEEE.

90 communications of th e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
review articles

this illustrative example, even though figure 6. Query hashes.


we may measure high similarity
between the greenery portions of both
Database
the cockatoo and koala images, the if hash functions guarantee a high
probability of collision for similar im-
dissimilarity constraint serves to refo-
ages under a metric of interest, one can
cus the metric on the other (distinct) search a large database in sublinear
parts of the images. h
time via lsh techniques. the query
In the case of the pyramid match, hashes directly to bins with similar
examples, and only that subset needs
the weights associated with matches to be searched. We have designed lsh
at different pyramid levels can be functions for matching, learned dis-
treated as learnable parameters. … tances, and arbitrary kernel functions.
While fixing the weights according to
the bin diameters gives the most accu-
rate approximation of true inter-fea-
ture distances in a geometric sense,
when we have some annotated images hash table

available, we can directly learn the Query


weights that will best map same-class 10010
images close together.18
The idea is that the best match-
h
ing function is specific to the class ?
of images, and is inherently defined 10101
by the variability a given class exhib-
its. For example, to distinguish one
skyscraper from another, we might 10111
expect same-class examples to con-
tain some very tight local feature cor-


respondences, whereas to distinguish
all skyscrapers from koalas, we expect
feature matches to occur at greater matching functions.19 Given points two matching-based kernels as the
distances even among same-class {x1, . . . , xn}, with xi Î d, a positive- base metrics. The first kernel is the
examples. While the same type of definite d × d matrix A parameterizes PMK, the approximate matching mea-
image feature may be equally relevant the squared Mahalanobis distance: sure defined here. The second kernel
in both situations, what is unique is is defined in Zhang et al.,33 and uses
the distance at which similarity is sig- dA(xi , xj) = (xi - xj)T A(xi - xj). (1) exhaustive comparisons between
nificant for that feature. Therefore, by features to compute a one-to-many
learning the reward (weight) associ- A generalized inner product mea- match based on both descriptor and
ated with each matching level in the sures the pairwise similarity associ- positional agreement; we refer to it
pyramid, we can automatically deter- ated with that distance: SA(xi xj) = xTi Axj. as CORR for “correspondence.” For
mine how close feature matches must Thus for a kernel K(xi, xj) = f(xi)Tf(xj), this dataset of 101 object types, note
be in order to be considered signifi- the parameters transform the inner that chance performance would yield
cant for a given object class. product in the implicit feature space an accuracy rate of only 1%. Both
To achieve this intuition, we as f(xi)T Af(xj). Given a set of inter- base metrics do the most they can by
observe that the PMK can be written as example distance constraints, one matching the local image features;
a weighted sum of base kernels, where can directly learn a matrix A to yield the learned parameters adapt those
each base kernel is the similarity com- a measure that is more accurate for metrics to better reflect the side-infor-
puted at a given bin resolution. We a given problem. We use the efficient mation specifying a handful of images
thus can compute the weights using a method of Davis et al.7 because it is from each class that ought to be near
form of kernel alignment,3 where we kernelizable. This method optimizes (or far) from the others.
find the optimal combination of ker- the parameters of A so as to minimize
nel matrices that most closely mimics how much that matrix diverges from searching image collections
the “ideal” kernel on the training data, an initial user-provided “base” param- in sublinear time
that is, the one that gives maximal eterization, while satisfying con- Now that we have designed effective
similarity values for in-class examples straints that require small distances similarity measures, how will image
and minimal values for out-of-class between examples specified as simi- search scale? We must be able to use
examples (see Jain et al.18 for details). lar, and large distances between pairs these metrics to query a very large image
We have also shown how image specified as dissimilar. database—potentially on the order of
retrieval can benefit from learning the Figure 5b shows the significant millions of examples or more. Certainly,
Mahalanobis parameterization for sev- retrieval accuracy gains achieved a naive linear scan that compares the
eral distinct base metrics, including when we learn image metrics using query against every database image is

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t he acm 91
review articles

not feasible, even if the metric itself functions, the query time for retrieving To embed the pyramid match as an
is efficient. Unfortunately, traditional (1 + )-near neighbors is bounded by inner product, we exploit the relation-
methods for fast search cannot guar- O(N1/(1+)) for the Hamming distance ship between a dot product and the
antee low query-time performance for and a database of size N.10 One can min operator used in the PMK’s inter-
arbitrary specialized ­metrics.a This sec- therefore trade off the accuracy of the sections. Taking the minimum of two
tion overviews our work designing hash search with the query time required. values is equivalent to computing the
functions that enable approximate sim- Note that Equation 2 is essentially dot product of a unary-style encoding
ilarity search for both types of metrics a gateway to LSH: if one can provide a in which a value u is written as a list
introduced above: a matching between ­distribution of hash functions guar- of u ones, followed by a zero padding
sets, and learned Mahalanobis kernels. anteed to preserve this equality for large enough to allot space for the max-
The main idea of our approach is to the similarity function of interest, then imal value that will occur. So, since a
construct a new family of hash func- approximate nearest neighbor search weighted intersection value is equal to
tions that will satisfy the locality sen- may be performed in sublinear time. the intersection of weighted values, we
sitivity requirement that is central to Existing LSH functions can accommo- can compute the embedding by stack-
existing randomized algorithms5, 16 for date the Hamming distance, Lp norms, ing up the histograms from a single
approximate nearest neighbor search. and inner products, and such func- pyramid, and weighting the entries
Locality sensitive hashing (LSH) has tions have been explored previously in associated with each pyramid level
been formulated in two related con- the vision community. In the following appropriately. Our embedding enables
texts—one in which the likelihood of we show how to enable sublinear time LSH for normalized partial match
collision is guaranteed relative to a search with LSH for metrics that are similarity with local features, and we
threshold on the radius surrounding a ­particularly useful for image search. have shown that it can achieve results
query point,16 and another where col- Matching-sensitive hashing. Even very close to a naive linear scan when
lision probabilities are equated with a though the pyramid match makes searching only a small fraction of an
similarity function score.5 We use the each individual matching scalable image database (1%–2%) (see Grauman
latter definition here. relative to the number of features per and Darrell14 for more details).
A family of LSH functions F is a dis- image, once we want to search a large Semi-supervised hashing. To provide
tribution of functions where for any database of images according to the suitable hash functions for learned
two objects xi and xj, correspondence-based distance, we Mahalanobis metrics, we propose
still cannot afford a naive linear scan. altering the distribution from which
 (2) To guarantee locality sensitivity for the randomized hyperplanes are
a matching, we form an embedding drawn. Rather than drawing the vec-
where sim(xi, xj) Î [0, 1] is some simi- function that maps our histogram tor r uniformly at random, we want
larity function, and h(x) is a hash pyramids into a vector space in such to bias the selection so that similarity
function drawn from F that returns a way that the inner product between constraints provided for the metric
a single bit.5 Concatenating a series vectors in that space exactly yields the learning process are also respected
of b hash functions drawn from F PMK similarity value.14 by the hash functions. In other words,
yields b-dimensional hash keys. When This remapping is motivated by we still want similar examples to col-
h(xi) = h(xj), xi and xj collide in the hash the  fact that randomized hash func- lide, but now that similarity cannot be
table. Because the probability that two tions exist for similarity search with purely based on the image measure-
inputs collide is equal to the similarity the inner product.5 Specifically, ments xi and xj; it must also reflect the
between them, highly similar objects Goemans and Williamson11 show that constraints that yield the improved
are indexed together in the hash table the probability that a hyperplane r (learned) metric (see Figure 7a). We
with high probability. On the other drawn uniformly at ­random separates refer to this as “semi-supervised”
hand, if two objects are very dissimi- two vectors xi and xj is directly ­pro- hashing, since the hash functions will
lar, they are unlikely to share a hash portional to the angle between them: be influenced by any available partial
key (see Figure 6). Given valid LSH annotations, much as the learned met-
An LSH function that exploits this rics were in the previous section.
relationship is given by Charikar.5 The In Jain et al.,19 we present a straight-
a
  Data structures based on spatial partition-
ing and recursive decomposition have been hash function hr accepts a vector x Î forward solution for the case of rela-
developed for faster search, e.g., k-d trees9 and d, and outputs a bit depending on tively low-dimensional input vector
metric trees.31 While their expected query time the sign of its product with r: spaces, and further derive a solution to
requirement may be logarithmic in the data- accommodate very high-dimensional
base size, selecting useful partitions can be
expensive and requires good heuristics; worse,
data for which explicit input space
in high-dimensional spaces all exact search
(3) computations are infeasible. The for-
methods are known to provide little query time  mer contribution makes fast indexing
improvement over a naive linear scan.16 The ex- accessible for numerous existing metric
pected query time for a k-d tree contains terms Since learning methods, while the latter is of
that are exponential in the dimension of the
features,9 making them especially unsuitable
(xiT  xj), the probability of collision is particular interest for commonly used
for the pyramid representation where the di- high whenever the examples’ inner image representations, such as bags-of-
mension can be on the order of millions. product is high.5 words or multiresolution histograms.

92 communications of th e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
review articles

Figure 7. Semi-supervised hash functions.

Caltech-101: classification error with hashing PhotoTourism: recall vs NN


ML+PMK hashing ML+PMK linear scan ML linear scan L2 linear scan
PMK hashing PMK linear scan ML hashing L2 hashing
110

k−nearest neighbors error rate


100 0.8
r xi
xi 90
r 0.75

Recall
80
xj xj 0.7
70

60 0.65

50
0.6
h(xi) = h(xj) h(xi) ≠ h(xj) 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 100 200 300 400 500 600 700 800 900 1000
Epsilon (e) Number of nearest neighbors (k)

(a) Semi-supervised hash functions: intuition (b) Results with semi-supervised hashing

(a) A generic hash function would choose the orientation of the hyperplane r uniformly at random, causing collisions only for examples that have small angles
between their features (xi and xj). In constrast, the distribution of our randomized semi-supervised hash functions is such that examples like those constrained to
be similar are more likely to collide (left), while pairs like those constrained to be dissimilar are less likely to collide (right). Here the hourglass shapes denote the
regions from which our randomized hash functions will most likely be drawn. (b) Semi-supervised hash functions encode the learned metric, and allow guaran-
teed sublinear time queries that are similar in accuracy to a naive linear scan.
Plots are reprinted from Jain et al.19 with permission, ©2008 IEEE.

Given the matrix A for a metric explicitly, and we must work in the from the PhotoTourism project.28
learned as above, such that A = GTG, we implicit kernel space. For example, for Here L2 is the base metric; the recall
generate the following randomized features like the histogram pyramids rate is substantially improved once we
hash functions hr,A: above, we have d = 106. The examples learn a metric on top of it. Negligible
are sparse and representable; how- accuracy is sacrificed when search-
ever, the matrix A is dense and is ing with our semi-supervised hash
(4) not. This complicates the computa- functions (as seen by the closeness of
tion of hash functions, as they can the top two curves), yet our hashing
no longer be computed directly as in strategy requires touching only 0.8%
where the vector r is chosen at ran- Equation 4. To handle this, we derived of the patches in the database. In our
dom from a d-dimensional Gaussian an algorithm that simultaneously MATLAB implementation, we observe
distribution with zero mean and unit makes implicit updates to both the speedup factors of about 400 relative
variance. hash ­functions and the metric being to a linear scan for databases contain-
By parameterizing the hash func- learned. We show it is possible to com- ing half a million examples. Due to the
tions by both r and G, we enforce the pute the value of rTG indirectly, based query-time guarantees our hash func-
following probability of collision: on comparisons between the points tions enable, that factor grows rapidly
involved in similarity constraints and with the size of the database.
the new example x that we want to Most recently, we have derived
hash. See Jain et al.19 for details. hash functions to enable fast search
Figure 7b shows results using our with arbitrary kernel functions.21
semi-supervised hash functions. In Relative to traditional exact search
the left-hand plot, we see the learned data structures, the approximate hash-
which sustains the LSH requirement metric (denoted ‘ML’) significantly ing approach is critical to performance
for a learned Mahalanobis metric. improves the base metric in the image when inputs are high-dimensional.
Essentially we have shifted the ran- retrieval task for the Caltech data. Modifications to classic tree structures
dom hyperplane r according to A, and Additionally, we now can offer sublin- have also been explored to improve
by factoring it by G we allow the ran- ear time search even once the metric search time with high-dimensional
dom hash function itself to “carry” the has been altered by the input similar- image features;4,26 however, such
information about the learned metric. ity constraints. Note how accuracy approaches do not provide query-time
The denominator in the cosine term varies as a function of , the param- guarantees, and are not applicable
normalizes the kernel values. eter controlling how many examples to searching with learned metrics. By
For low-dimensional data, we we have to search per query; the more hashing to buckets containing a collec-
could equivalently transform all the examples we can afford to search, the tion of examples with a high probabil-
data according to A prior to hashing. stronger our guarantee of approximat- ity of being very similar to the query, we
However, the matrix  A has d2 entries, ing an exhaustive linear scan. are able to sort out the most relevant
and thus for very high-dimensional The right-hand plot shows results list of near neighbors. This is important
input spaces it cannot be represented using another database, 300K patches for content-based retrieval, where we

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t he acm 93
review articles

do not expect the single nearest exem- and query-refinement devices to tailor The QBIC system. IEEE Comput. 28, 9 (1995), 23–32.
9. Friedman, J., Bentley, J., Finkel, R. An algorithm for
plar to answer the query, but rather retrieval toward the current user. For finding best matches in logarithmic expected time.
that the pool of nearby content will give example, we could construct learned ACM Trans. Math. Softw. 3, 3 (1977), 209–226.
10. Gionis, A., Indyk, P., Motwani, R. Similarity search in
the user and/or downstream processes image metrics with constraints that high dimensions via hashing. In Proceedings of the
access to relevant candidates. target the preferences of a given user or International Conference on Very Large Data Bases
(1999).
group of users. 11. Goemans, M., Williamson, D. Improved approximation
Conclusion We are currently exploring online algorithms for maximum cut and satisfiability
problems using semidefinite programming. J. Assoc.
As the world’s store of digital images extensions to our algorithms that Comput. Mach. 42, 6 (1995), 1115–1145.
12. Grauman, K., Darrell, T. The pyramid match kernel:
continues to grow exponentially, and allow similarity constraints to be discriminative classification with sets of image
as novel data-rich approaches to com- processed in an incremental fash- features. In Proceedings of the IEEE International
Conference on Computer Vision (2005).
puter vision begin to emerge, fast tech- ion, while still allowing intermit- 13. Grauman, K., Darell, T. Approximate correspondences
niques capable of accurately searching tent ­queries. We are also pursuing in high dimensions. In Advances
in Neural Information Processing Systems (2006).
very large image collections are critical. active learning methods that allow 14. Grauman, K., Darrell, T. Pyramid match hashing:
The algorithms we have developed aim the system to identify which image sub-linear time indexing over partial correspondences.
In Proceedings of the IEEE Conference on Computer
to provide robust but scalable image annotations seem most promising to Vision and Pattern Recognition (2007).
search, and results show the practical request, and thereby most effectively 15. Grauman, K., Darrell, T. The pyramid match kernel:
efficient learning with sets of features. J. Mach. Learn.
impact. While motivated by vision prob- use minimal manual input. Res. 8 (2007), 725–760.
lems, these methods are fairly general, 16. Indyk, P., Motwani, R. Approximate nearest neighbors:
towards removing the curse of dimensionality. In
and may be applicable in other domains Acknowledgments Symposium on Theory of Computing (1998).
where rich features and massive data I am fortunate to have worked with 17. Indyk, P., Thaper, N. Fast image retrieval via
embeddings. In International Workshop on Statistical
collections abound, such as computa- a number of terrific collaborators and Computational Theories of Vision (2003).
tional biology or text processing. throughout the various stages of 18. Jain, P., Huynh, T., Grauman, K. Learning Discriminative
Matching Functions for Local Image Features.
Looking forward, an important chal- the projects overviewed in this arti- Technical Report, UT-Austin, April 2007.
19. Jain, P., Kulis, B., Grauman, K. Fast image search
lenge in this research area is to develop cle—in particular, Trevor Darrell, for learned metrics. In Proceedings of the IEEE
the representations that will scale in Prateek Jain, and Brian Kulis. This Conference on Computer Vision and Pattern
Recognition (2008).
terms of their distinctiveness; once the research was supported in part by 20. Kapoor, A., Grauman, K., Urtasun, R., Darrell, T.
space of images is even more densely NSF CAREER IIS-0747356, Microsoft Gaussian processes for object categorization. Int. J.
Comput. Vis. 88, 2 (2009), 169–188.
populated, relative differences are sub- Research, and the Henry Luce 21. Kulis, B., Grauman, K. Kernelized locality-sensitive
tle. At the same time, flexibility is still a Foundation. I would like to thank hashing for scalable image search. In Proceedings of
the IEEE International Conference on Computer Vision
key to handling intra-category variation. Kathryn McKinley and Yong Jae Lee (2009).
While our search methods can guaran- for feedback on previous drafts, as 22. Lazebnik, S., Schmid, C., Ponce, J. Beyond bags of
features: Spatial pyramid matching for recognizing
tee query-time performance, it is not yet well as the anonymous reviewers for natural scene categories. In Proceedings of the
possible to guarantee a level of discrimi- their helpful comments. Thanks to IEEE Conference on Computer Vision and Pattern
Recognition (2006).
nation power for the features chosen. In the following Flickr users for shar- 23. Lowe, D. Distinctive image features from scale-
addition, a practical issue for evaluat- ing their photos under the Creative invariant keypoints. Int. J. Comput. Vis. 60, 2 (2004).
24. Lv, F., Nevatia, R. Single view human action recognition
ing algorithms in this space is the dif- Commons license: belgian-choco- using key pose matching and Viterbi path searching.
ficulty of quantifying accuracy for truly late, c.j.b., edwinn.11, piston9, sta- In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (2007).
massive databases; the data itself is easy minaplus100, rick-yrhodes, Rick 25. Murilloa, A. et al. From omnidirectional images to
hierarchical localization. Robotics Auton. Syst. 55,
to come by, but without ground truth Smit, Krikit, Vanessa Pike-Russell, 5 (May 2007), 372–382.
annotations, it is unclear how to rigor- Will Ellis, Yvonne in Willowick Ohio, 26. Nister, D., Stewenius, H. Scalable recognition
with a vocabulary tree. In Proceedings of the
ously evaluate performance. robertpaulyoung, lin padgham, IEEE Conference on Computer Vision and Pattern
An interesting aspect of the image tkcrash123, jennifrog, Zemzina, Recognition (2006).
27. Pinz, A. Object categorization. Found. Trend. Comput.
search problem is the subjectivity Irene2005, and CmdrGravy. Graph. Vis. 1, 4 (2006), 255–353.
related to a real user’s perception of the 28. Snavely, N., Seitz, S., Szeliski, R. PhotoTourism:
Exploring photos in 3D. In SIGGRAPH (2006).
quality of a retrieval. We can objectively References
29. Swain, M., Ballard, D. Color indexing. Int. J. Comput.
1. Agarwal, P., Varadarajan, K.R. A near-linear algorithm
quantify accuracy in terms of the cate- for Euclidean bipartite matching. In Symposium on
Vis. 7, 1 (1991), 11–32.
30. Tuytelaars, T., Mikolajczyk, K. Local invariant feature
gories contained in a retrieved image, Computational Geometry (2004).
detectors: A survey. Found. Trend. Comput. Graph.
2. Avis, D. A survey of heuristics for the weighted
which is helpful to systematically matching problem. Networks, 13 (1983), 475–493.
Vis. 3, 3 (2008), 177–280.
31. Uhlmann, J. Satisfying general proximity/similarity
validate progress. Moreover, example- 3. Bach, F., Lanckriet, G., Jordan, M. Multiple kernel
queries with metric trees. Inf. Process. Lett.
learning, conic duality, and the SMO algorithm. In
based search often serves as one useful 40 (1991), 175–179.
International Conference on Machine Learning (2004).
32. Yang, L. Distance Metric Learning: A Comprehensive
stage in a larger pipeline with further 4. Beis, J., Lowe, D. Shape indexing using approximate
Survey. Technical Report, Michigan State University,
nearest-neighbour search in high dimensional spaces.
processing downstream. Nonetheless, 2006.
In Proceedings of the IEEE Conference on Computer
33. Zhang, H., Berg, A., Maire, M., Malik, J. SVM-KNN:
when end users are in the loop, the per- Vision and Pattern Recognition (1997).
discriminative nearest neighbor classification for
5. Charikar, M. Similarity estimation techniques from
visual category recognition. In Proceedings of the
ception of quality may vary. On the eval- rounding algorithms. In ACM Symposium on Theory of
IEEE Conference on Computer Vision and Pattern
Computing (2002).
uation side, this uncertainty could be 6. Choi, J., Jeon, W., Lee, S.-C. Spatio-temporal pyramid
Recognition (2006).
addressed by collecting user apprais- matching for sports videos. In Proceedings of the
ELACM Conference on Multimedia Information
als of similarity, as is more standard in Retrieval (2008). Kristen Grauman (grauman@cs.utexas.edu) is an
natural language processing. In terms 7. Davis, J., Kulis, B., Jain, P., Sra, S., Dhillon, I. assistant professor in the Department of Computer
Information-theoretic metric learning. In Science at the University of Texas at Austin.
of the algorithms themselves, however, International Conference on Machine Learning (2007).
one can also exploit classic feedback 8. Flickner, M. et al. Query by image and video content: © 2010 ACM 0001-0782/10/0600 $10.00

94 communications of th e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
research highlights
p. 96 p. 97
Technical Asserting and Checking
Perspective Determinism for
Building Confidence
in Multicore Software Multithreaded Programs
By Vivek Sarkar By Jacob Burnim and Koushik Sen

p. 106 p. 107
Technical seL4: Formal Verification
Perspective
Learning To Do of an Operating-System Kernel
Program Verification By Gerwin Klein, June Andronick, Kevin Elphinstone,
By K. Rustan M. Leino Gernot Heiser, David Cock, Philip Derrin, Dhammika Elkaduwe,
Kai Engelhardt, Rafal Kolanski, Michael Norrish, Thomas Sewell,
Harvey Tuch, and Simon Winwood

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t he acm 95
research highlights
doi:10.1145/1743546.1 7 4 3 5 7 1

Technical Perspective
Building Confidence in
Multicore Software
By Vivek Sarkar

Surprises may be fun in real life, but not terminism as a natural future exten- postconditions (“assert” clauses) by
so in software. One approach to avoid- sion to the runtime checks performed using bridge predicates. A bridge predi-
ing surprises in software is to establish in modern software. cate relates values arising from two dif-
its functional correctness, either by A key challenge in multicore pro- ferent executions of the same program,
construction or by verification, but this grams is that determinism lies in the thereby providing the foundation for
is feasible in only a limited set of do- eye of the beholder. If two executions of asserting semantic determinism at any
mains. Instead, the predominant meth- the same program with the same input desired level of user-specified granu-
od in single-threaded software develop- produce outputs that are non-identical larity. One of the examples discussed
ment has been an iterative approach of but semantically equivalent, should the is parallel matrix multiply, where the
design, coding, testing, and bug fixes. program be considered deterministic bridge predicate in the precondition
The cornerstone of this practice is de- or not? For example, consider a pro- assumes the input matrices from two
terminism; that is, the expectation a gram that produces two different float- executions differ entry-by-entry by no
program will exhibit the same behavior ing-point values in two executions with more than an error threshold, and the
each time it is executed with the same the same input. From the viewpoint of bridge predicate in the postcondition
inputs. Even if a program has a bug, it is a strict definition of determinism, the asserts that a similar property holds
comforting to know the buggy behavior program is unquestionably nondeter- for the output matrices. Note that
can be reproduced by others, and work- ministic. From the viewpoint of a more these assertions are focused on de-
arounds can be shared until the devel- relaxed definition in which all values terminism and not on functional cor-
oper provides a fix. within a certain error threshold are per- rectness. For example, a functionally
In this context, multithreaded pro- missible as outputs, the program may incorrect implementation of parallel
grams prove to be more challenging to well be considered to be deterministic. matrix multiply that returns the identi-
handle than single-threaded programs. A similar situation arises for programs ty matrix for all inputs will always pass
For decades multithreaded programs that produce (say) ordered linked lists determinism checking.
were executed on single-core proces- as data structure representations of At this point I hope I’ve raised a
sors and users became accustomed unordered sets. Two non-identical out- number of questions in your mind. Can
to deterministic sets of behaviors ex- puts may still be considered equivalent the determinism assertions be gener-
hibited by standard thread schedul- if they contain the same set of elements, ated automatically? What is the rela-
ers. However, the move to multicore albeit in different orders. tionship between checking assertions
hardware has completely changed this Given this range of interpretations for determinism and detection of data
landscape, which is why I recommend for determinism, it isn’t obvious how races? Are there any assumptions made
you read the following paper. assertions for determinism should be about the underlying system software
Jacob Burnim and Koushik Sen formulated. The approach taken in and hardware, such as the memory con-
propose an assertion framework for this paper is to extend the concepts of sistency model? Can the programming
specifying regions of multithreaded preconditions (“assume” clauses) and constructs advocated by transactional
software that are expected to behave memory researchers help address this
deterministically, and describe a run- problem? Are there applications of de-
time library for checking these asser- A key challenge in terminism assertions to single-thread-
tions that is guaranteed to be sound. ed programs? If you’re interested in any
Though the proposed runtime check- multicore programs of these questions, you need to read the
ing is incomplete in general, prelimi- is that determinism following paper to better understand
nary evaluations suggest that this ap- the ramifications of parallel hardware
proach can be effective in identifying lies in the eye of on determinism guarantees in multi-
many common sources of nondeter- the beholder. threaded software!
minism. Runtime assertion-checking
is used in many situations nowadays, Vivek Sarkar (vsarkar@rice.edu) is an ACM Fellow and
including null-pointer, array-bounds a professor of computer science and of electrical and
computer engineering at Rice University, where he holds
checks, and type checks in managed the E.D. Butcher Chair in Engineering.
runtimes. So, it is reasonable to view
the checking assertions related to de- © 2010 ACM 0001-0782/10/0600 $10.00

96 communications of th e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
d oi:10.1145/1743546 . 1 7 4 3 5 7 2

Asserting and Checking


Determinism for
Multithreaded Programs
By Jacob Burnim and Koushik Sen

Abstract widespread belief that helping programmers manage nonde-


The trend towards processors with more and more parallel terminism in parallel software is critical in making parallel
cores is increasing the need for software that can take advan- programming widely accessible.
tage of parallelism. The most widespread method for writing For more than 20 years, many researchers have attacked
parallel software is to use explicit threads. Writing correct the problem of nondeterminism by attempting to detect or
multithreaded programs, however, has proven to be quite predict sources of nondeterminism in parallel programs. The
challenging in practice. The key difficulty is nondetermin- most notorious of such sources is the data race. A data race
ism. The threads of a parallel application may be interleaved occurs when two threads in a program concurrently access the
nondeterministically during execution. In a buggy program, same memory location and at least one of those accesses is a
nondeterministic scheduling can lead to nondeterministic write. That is, the two threads “race” to perform their conflict-
results—where some interleavings produce the correct result ing memory accesses, so the order in which the two accesses
while others do not. occur can change from run to run, potentially yielding nonde-
We propose an assertion framework for specifying that terministic program output. Many algorithms and tools have
regions of a parallel program behave deterministically despite been developed to detect and eliminate data races in parallel
nondeterministic thread interleaving. Our framework allows programs. (See Burnim and Sen5 for further discussion and
programmers to write assertions involving pairs of program references.) Although the work on data race detection has
states arising from different parallel schedules. We describe an significantly helped in finding determinism bugs in parallel
implementation of our deterministic assertions as a library for programs, it has been observed that the absence of data races
Java, and evaluate the utility of our specifications on a number is not sufficient to ensure determinism.1, 8, 9 Thus researchers
of parallel Java benchmarks. We found specifying determinis- have also developed techniques to find high-level races,1, 16, 21
tic behavior to be quite simple using our assertions. Further, likely atomicity violations,9, 8, 14 and other potential sources of
in experiments with our assertions, we were able to identify nondeterminism. Further, such sources of nondeterminism
two races as true parallelism errors that lead to incorrect non- are not always bugs—they may not lead to nondeterministic
deterministic behavior. These races were distinguished from a program behavior or nondeterminism may be intended. In
number of benign races in the benchmarks. fact, race conditions may be useful in gaining performance
while still ensuring high-level deterministic behavior.3
More recently, a number of ongoing research efforts aim
1. INTRODUCTION to make parallel programs deterministic by construction.
The semiconductor industry has hit the power wall—­ These efforts include the design of new parallel program-
performance of general-purpose single-core microprocessors ming paradigms10, 12, 13, 19 and the design of new type systems,
can no longer be increased due to power constraints. Therefore, annotations, and checking or enforcement mechanisms
to continue to increase performance, the microprocessor that could retrofit existing parallel languages.4, 15 But such
industry is instead increasing the number of processing cores efforts face two key challenges. First, new languages see
per die. The new “Moore’s Law” is that the number of cores will slow adoption and often remain specific to limited domains.
double every generation, with individual cores going no faster.2 Second, new paradigms often include restrictions that can
This new trend of increasingly parallel chips means that we hinder general-purpose programming. For example, a new
will have to write parallel software in order to take advantage type system may require complex type annotations and may
of future hardware advances. Unfortunately, parallel software forbid reasonable programs whose determinism cannot be
is more difficult to write and debug than its sequential coun- expressed in the type system.
terpart. A key reason for this difficulty is nondeterminism—i.e., We argue that programmers should be provided with
that in two runs of a parallel program on the exact same input, a framework that will allow them to express deterministic
the parallel threads of execution can interleave differently,
producing different output. Such nondeterministic thread
The original version of this paper was published in Pro-
interleaving is an essential part of harnessing the power of
ceedings of the 7th Joint Meeting of the European Software
parallel chips, but it is a major departure from sequential
Engineering Conference and the ACM SIGSOFT ­Symposium
programming, where we typically expect programs to behave
on the Foundations of Software Engineering, ­August 2009.
identically in every execution on the same input. We share a

j u ne 2 0 1 0 | vo l. 5 3 | n o. 6 | c o m m u n i c at i o n s o f t he acm 97
research highlights

behaviors of parallel programs directly and easily. Specifically, in the correctness of its use of parallelism independently
we should provide an assertion framework where program- of whatever method we use to gain confidence in the
mers can directly and precisely express intended deterministic ­program’s functional correctness.
behavior. Further, the framework should be flexible enough We have implemented our deterministic assertions as a
so that deterministic behaviors can be expressed more eas- library for the Java programming language. We evaluated
ily than with a traditional assertion framework. For example, the utility of these assertions by manually adding determin-
when expressing the deterministic behavior of a parallel edge istic specifications to a number of parallel Java benchmarks.
detection algorithm for images, we should not have to rephrase We used an existing tool to find executions exhibiting data
the problem as race detection; nor should we have to write a and higher-level races in these benchmarks and used our
state assertion that relates the output to the input, which would deterministic assertions to distinguish between harmful
be complex and time-consuming. Rather, we should simply be and benign races. We found it to be fairly easy to specify the
able to say that, if the program is executed on the same input correct deterministic behavior of the benchmark programs
image, then the output image remains the same regardless of using our assertions, despite being unable in most cases to
how the program’s parallel threads are scheduled. write traditional invariants or functional correctness asser-
In this paper, we propose such a framework for asserting tions. Further, our deterministic assertions successfully
that blocks of parallel code behave deterministically. Formally, distinguished the two races known to lead to undesired non-
our framework allows a programmer to give a specification for determinism from the benign races in the benchmarks.
a block P of parallel code as:
2. DETERMINISTIC SPECIFICATION
In this section, we motivate and define our proposal for
deterministic assume(Pre(s0 , s¢0) ) {
assertions for specifying determinism.
P
Strictly speaking, a block of parallel code is said to be
} assert(Post(s, s¢) );
deterministic if, given any particular initial state, all execu-
tions of the code from the initial state produce the exact same
This specification asserts the following: Suppose P is exe- final state. In our specification framework, the programmer
cuted twice with potentially different schedules, once from can specify that they expect a block of parallel code, say P, to
initial state s0 and once from s¢0 and yielding final states s and be deterministic with the following construct:
s¢. Then, if the user-specified precondition Pre holds over s0
and s¢0, then s and s¢ must satisfy the user-specified postcon-
deterministic {
dition Post.
P
For example, we could specify the deterministic behavior }
of a parallel matrix multiply with:

deterministic assume(|A − A¢| < 10−9 and This assertion specifies that if s and s¢ are both program
|B − B¢| < 10−9) { states resulting from executing P under different thread
schedules from some initial state s0, then s and s¢ must be
C = parallel_matrix_multiply_float(A, B);
equal. For example, the specification:
} assert(|C − C¢| < 10−6);

Note the use of primed variables A¢,B¢, and C¢ in the above deterministic {
example. These variables represent the state of the matrices C = parallel_matrix_multiply_int(A, B);
}
A,B,and C from a different execution. Thus, the predicates
that we write inside assume and assert are different from
state predicates written in a traditional assertion framework— asserts that for the parallel implementation of matrix mul-
our predicates relate a pair of states from different executions. tiplication in function parallel_matrix_multiply_
We call such predicates bridge predicates and assertions using int, any two executions from the same program state
bridge predicates bridge assertions. A key contribution of must reach the same program state—i.e., with identical
this paper is the introduction of these bridge predicates and entries in matrix C—no matter how the parallel threads
bridge assertions. are scheduled.
Our deterministic assertions provide a way to specify A key implication of knowing that a block of parallel code
the correctness of the parallelism in a program indepen- is deterministic is that we may be able to treat the block as
dently of the program’s traditional functional correct- sequential in other contexts. That is, although the block
ness. By checking whether different program schedules may have internal parallelism, a programmer (or perhaps
can nondeterministically lead to semantically different a tool) can hopefully ignore this parallelism when consider-
answers, we can find bugs in a program’s use of parallelism ing the larger program using the code block. For example,
even when unable to directly specify or check functional perhaps a deterministic block of parallel code in a function
­correctness—i.e., that the program’s output is correct can be treated as if it were a sequential implementation
given its input. Inversely, by checking that a parallel pro- when reasoning about the correctness of code calling the
gram behaves deterministically, we can gain confidence function.

98 communications of th e ac m | ju n e 2 0 1 0 | vo l. 5 3 | n o. 6
Semantic Determinism: The above deterministic specifi- the same. That is, for objects set and set¢ computed by two
cation is often too conservative. For example, consider a different schedules, the equals method must return true
similar example, but where A,B,and C are floating-point because the sets must logically contain the same elements.
matrices: We call this semantic determinism.
Preconditions for Determinism: So far we have described
the following construct:
deterministic {
C = parallel_matrix_multiply_float(A, B);
deterministic {
}
P
} assert(Post);
Limited-precision floating-point addition and multipli-
cation are not associative due to rounding error. Thus, where Post is a predicate over two program states from dif-
depending on the implementation, it may be unavoidable ferent executions with different thread schedules. That is,
that the entries of matrix C will differ slightly depending on if s and s¢ are two states resulting from any two executions
the thread schedule. of P from the same initial state, then Post (s, s¢) holds.
In order to tolerate such differences, we must relax the The above construct could be rewritten:
deterministic specification:
deterministic assume(s0 = s0¢) {
deterministic { P
C = parallel_matrix_multiply_float(A, B); } assert(Post);
} assert(|C − C¢| < 10−6);
That is, if any two executions of P start from initial states
s0 and s0¢, respectively, and if s and s¢ are the resulting final
This assertion specifies that, for any two matrices C and states, then s0 = s0¢ implies that Post (s, s¢) holds. The above
C¢ resulting from the execution of the matrix multiply from rewritten specification suggests that we can relax the re­-
the same initial state, the entries of C and C¢ must differ by quirement of s0 = s0¢ by replacing it with a bridge predicate
only a small quantity (i.e., 10−6). Pre (s0, s0¢). For example:
Note that the above specification contains a predicate
over two states—each from a different parallel execution of
) {
deterministic assume(set.equals(set¢) 
the deterministic block. We call such a predicate a bridge
set.add(3);    ||    set.add(5);
predicate, and an assertion using a bridge predicate a bridge
} assert(set.equals(set¢) );
assertion. Bridge assertions are different from traditional
assertions in that they allow one to write a property over two
program states coming from different executions whereas The above specification states that if any two executions
traditional assertions only allow us to write a property over start from sets containing the same elements, then after the
a single program state. execution of the code, the resulting sets must also contain
Note also that such predicates need not be equivalence the same elements.
relations on pairs of states. In particular, the approximate Comparison to Traditional Assertions: In summary, we pro-
equality used above is not an equivalence relation. pose the following construct for the specification of deter-
This relaxed notion of determinism is useful in many con- ministic behavior:
texts. Consider the following example which adds in parallel
two items to a synchronized set:
deterministic assume(Pre) {
P
Set set = new SynchronizedTreeSet(); } assert(Post);
deterministic {
set.add(3); || set.add(5); Formally, it states that for any two program states s0 and s0¢,
} assert(set.equals(set¢) ); if (1) Pre (s0, s0¢) holds, (2) an execution of P from s0 termi-
nates and results in state s, and (3) an execution of P from
s0¢ terminates and results in state s¢, then Post (s, s¢) must
If set is represented internally as a red–black tree, then a strict hold.
deterministic assertion would be too conservative. The struc- Note that the use of bridge predicates Pre and Post has
ture of the resulting tree, and its layout in memory, will likely the same flavor as pre- and postconditions used for functions
differ depending on which element is inserted first, and thus in program verification. However, unlike traditional pre- and
different parallel executions can yield different program states. postconditions, the proposed Pre and Post predicates relate
But we can use a bridge predicate to assert that, no mat- pairs of states from two different executions. In traditional ver-
ter what schedule is taken, the resulting set is semantically ification, a precondition is usually written as a predicate over

ju ne 2 0 1 0 | vol . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t he acm 99
research highlights

a single program state, and a postcondition is usually written


over two states—the states at the beginning and end of the deterministic assume(img.equals(img¢)) {
function. For example: result = parallel_edge_detection(img);
} assert(result.equals(result¢));

parallel_matrix_multiply_int(A, B) {
where img.equals(img¢) returns true if the two images
assume(A.cols == B.rows);
are pixel-by-pixel equal.
...
assert(C == A × B); For this example, a programmer could gain some confi-
return C; dence in the correctness of the routine by writing unit tests
} or manually examining the output for a handful of images.
He or she could then use automated testing or model check-
ing to separately check that the parallel routine behaves
The key difference between a postcondition and a Post deterministically on a variety of inputs, gaining confidence
predicate is that a postcondition relates two states at differ- that the code is free from concurrency bugs.
ent times along a same execution—e.g., here relating inputs We believe that it is often difficult to come up with effective
A and B to output C—whereas a Post predicate relates two functional correctness assertions. However, it is often quite
program states from different executions. easy to use bridge assertions to specify deterministic behav-
Advantages of Deterministic Assertions: Our deterministic ior, enabling a programmer to check for harmful concurrency
specifications are a middle ground between the implicit bugs. In Section 5, we provide several case studies to support
specification used in race detection—that programs should this argument.
be free of data races—and the full specification of functional
correctness. It is a great feature of data race detectors that 3. CHECKING DETERMINISM
typically no programmer specification is needed. However, There may be many potential approaches to checking or
manually determining which reported races are benign and verifying a deterministic specification, from testing to model
which are bugs can be time-consuming and difficult. We checking to automated theorem proving. In this section, we
believe our deterministic assertions, while requiring little propose a simple, sound, and incomplete method for check-
effort to write, can greatly aid in distinguishing harmful ing deterministic specifications at run-time.
from benign data races (or higher-level races). The key idea of the method is that, whenever a determin-
One could argue that a deterministic specification frame- istic block is encountered at run-time, we can record the
work is unnecessary given that we can write the functional program states spre and spost at the beginning and end of the
correctness of a block of code using traditional pre- and block. Then, given a collection of (spre, spost) pairs for a par-
postconditions. For example, one could write the following ticular deterministic block in some program, we can check a
to specify the correct behavior of a paralell matrix multiply: deterministic specification by comparing pairwise the pairs
of initial and final states for the block. That is, for a deter-
C = parallel_matrix_multiply_float(A, B); ministic block:
assert(|C − A × B| < 10−6);
deterministic assume(Pre) {
P
We agree that if one can write a functional specification of } assert(Post);
a block of code, then there is no need to write deterministic
specification, as functional correctness implies determinis-
tic behavior. with pre- and postbridge predicates Pre and Post, we check
The advantage of our deterministic assertions is that for every recorded pair of pairs (spre, spost) and (s¢pre, s¢post) that:
they provide a way to specify the correctness of just the use
of parallelism in a program, independent of the program’s
full functional correctness. In many situations, writing a full Pre (spre, s¢pre) Þ Post (spost, s¢post)
specification of functional correctness is difficult and time-
consuming. A simple deterministic specification, however, If this condition does not hold for some pair, then we report
enables us to use automated techniques to check for paral- a determinism violation.
lelism bugs, such as harmful data races causing semantically To increase the effectiveness of this checking, we must
nondeterministic behavior. record pairs of initial and final states for deterministic
Consider a function parallel_edge_detection blocks executed under a wide variety of possible thread
that takes an image as input and returns an image where interleavings and inputs. Thus, in practice we likely want to
detected edges have been marked. Relating the output to the combine our deterministic assertion checking with existing
input image with traditional pre- and postconditions would techniques and tools for exploring parallel schedules of a
likely be quite challenging. However, it is simple to specify program, such as noise making,7, 18 active random schedul-
that the routine does not have any parallelism bugs causing ing,16 or model checking.20
a correct image to be returned for some thread schedules In practice, the cost of recording and storing entire pro-
and an incorrect image for others: gram states could be prohibitive. However, real determinism

100 communications of t h e ac m | jun e 2 0 1 0 | vol . 5 3 | n o. 6


predicates often depend on just a small portion of the whole Then, also at run-time, a call to assert(o, post) checks
program state. Thus, we need only to record and store small post on o and all o¢ saved from previous, matching execu-
projections of program states. For example, for a determin- tions of the same deterministic block. If the postpredicate
istic specification with pre- and postpredicate set.equals does not hold for any of these executions, a determinism
(set¢) we need only to save object set and its elements (pos- violation is immediately reported. Deterministic blocks can
sibly also the memory reachable from these objects), rather contain many assert’s so that determinism bugs can be
than the entire program memory. This storage cost sometimes caught as early as possible and can be more easily localized.
can be further reduced by storing and comparing check-sums For flexibility, programmers are free to write state projec-
or approximate hashes. tions and predicates using the full Java language. However,
it is a programmer’s responsibility to ensure that these
4. DETERMINISM CHECKING LIBRARY predicates contain no observable side effects, as there are
In this section, we describe the design and implementation of no guarantees as to how many times such a predicate may
an assertion library for specifying and checking determinism be evaluated in any particular run.
of Java programs. Note that, while it might be preferable to Built-in Predicates: For programmer convenience, we pro-
introduce a new syntactic construct for specifying determin- vide two built-in predicates that are often sufficient for spec-
ism, we provide the functionality as a library to simplify the ifying pre- and postpredicates for determinism. The first,
implementation. Equals, returns true if the given objects are equal using
their built-in equals method—i.e., if o.equals(o¢). For
4.1. Overview many Java objects, this method checks semantic equality—
Figure 1 shows the core API for our deterministic asser- e.g., for integers, floating-point numbers, strings, lists, sets,
tion library. Functions open and close specify the begin- etc. Further, for single- or multidimensional arrays (which
ning and end of a deterministic block. Deterministic blocks do not implement such an equals method), the Equals
may be nested, and each block may contain multiple calls predicate compares corresponding elements using their
to functions assume and assert, which are used to specify equals methods. Figure 2 gives an example with assert
the pre- and postpredicates of deterministic behavior. and assume using this Equals predicate.
Each call assume(o, pre) in a deterministic block speci- The second predicate, ApproxEquals, checks if two
fies part of the prepredicate by giving some projection o of floating-point numbers, or the corresponding elements
the program state and a predicate pre. That is, it specifies of two floating-point arrays, are within a given margin of
that one condition for any execution of the block to compute each other. We found this predicate useful in specifying the
an equivalent, deterministic result is that pre.apply(o, o¢) deterministic behavior of numerical applications, where it
return true for object o¢ from the other execution. is often unavoidable that the low-order bits may vary with
Similarly, a call assert(o, post) in a deterministic block different thread interleavings.
specifies that, for any execution satisfying every assume, Real-World Floating-Point Predicates: In practice, float-
predicate post.apply(o, o¢) must return true for object o¢ ing-point computations often have input-dependent error
from the other execution. bounds. For example, we may expect any two runs of a paral-
At run-time, our library records every object (i.e., state lel algorithm for summing inputs x1, …, xn to return answers
projection) passed to each assert and assume in each
deterministic block, saving them to a central, persistent
Figure 2. Deterministic assertions for a Mandelbrot Set
store. We require that all objects passed as state projections implementation from the Parallel Java (PJ) Library.11
implement the Serializable interface to facilitate this
recording. (In practice, this does not seem to be a heavy bur-
main(String args[]) {
den. Most core objects in the Java standard library are seri- // Read parameters from command-line.
alizable, including numbers, strings, arrays, lists, sets, and ...
maps/hashtables.) // Pre-predicate: equal parameters.
Predicate equals = new Equals();
Deterministic.open();
Deterministic.assume(width, equals);
Figure 1. Core deterministic specification API. Deterministic.assume(height, equals);
...
Deterministic.assume(gamma, equals);
public class Deterministic {
// spawn threads to compute fractal
static void open() {...}
int matrix[][] = ...;
static void close() {...} ...
// join threads
static void assume(Object o, Predicate p) {...}
...
static void assert(Object o, Predicate p) {...} Deterministic.assert(matrix, equals);
Deterministic.close();
interface Predicate {
boolean apply(Object a, Object b); // write fractal image to f 
ile
} ...
} }

j u ne 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t h e acm 101
research highlights

equal to within 2N  ∑i|xi|, where Î is the machine epsilon. pixel of the resulting image is a complicated function of the
We can assert: inputs (i.e., the rate at which a particular complex sequence
diverges). Further, there do not seem to be any simple tra-
ditional invariants on the program state or outputs which
sum = parallel_sum(x);
would help identify a parallelism bug.
bound = 2 * x.length *  * sum_of_abs(x);
Predicate apx = new ApproxEquals(bound);
5. EVALUATION
Deterministic.assert(sum, apx);
In this section, we describe our efforts to validate two claims
about our proposal for specifying and checking determinis-
As another example, different runs of a molecular dynam- tic parallel program execution:
ics simulation may be expected to produce particle positions
equal to within something like  multiplied by the sum of 1. First, deterministic specifications are easy to write. That
the absolute values of all initial positions. We can similarly is, even for programs for which it is difficult to specify tra-
compute this value at the beginning of the computation, ditional invariants or functional correctness, it is ­relatively
and use an ApproxEquals predicate with the appropriate easy for a programmer to add deterministic assertions.
bound to compare particle positions. 2. Second, deterministic specifications are useful. When
combined with tools for exploring multiple thread
4.2. Concrete example: Mandelbrot schedules, deterministic assertions catch real parallel-
Figure 2 shows the deterministic assertions we added to one ism bugs that lead to semantic nondeterminism.
of our benchmarks, a program for rendering images of the Further, for traditional concurrency issues such as data
Mandelbrot Set fractal from the Parallel Java (PJ) Library.11 races, these assertions provide some ability to distin-
The benchmark first reads a number of integer and guish between benign cases and true bugs.
floating-point parameters from the command-line. It then
spawns several worker threads that each compute the hues To evaluate these claims, we used a number of bench-
for different segments of the final image and store the hues mark programs from the Java Grande Forum (JGF) bench-
in shared array matrix. After waiting for all of the worker mark suite,17 the Parallel Java (PJ) Library,11 and elsewhere.
threads to finish, the program encodes and writes the image The names and sizes of these benchmarks are given in
to a file given as a command-line argument. Table 1. We describe the benchmarks in greater detail in
To add determinism annotations to this program, we Burnim and Sen.5 Note that the benchmarks range from
simply opened a deterministic block just before the worker a  few hundred to a few thousand lines of code, with the
threads are spawned and closed it just after they are joined. PJ  benchmarks relying on an additional 10–20,000 lines
At the beginning of this block, we added an assume call for of library code from the PJ Library (for threading, synchro-
each of the seven fractal parameters, such as the image size nization, and other functionality).
and color palette. At the end of the block, we assert that the
resulting array matrix should be deterministic, however 5.1. Ease of use
the worker threads are interleaved. We evaluate the ease of use of our deterministic specification
Note that it would be quite difficult to add assertions by manually adding assertions to our benchmark programs.
for the functional correctness of this benchmark, as each One deterministic block was added to each benchmark.

Table 1. Summary of experimental evaluation of deterministic assertions. A single deterministic block specification was added to each
benchmark. Each specification was checked on executions with races found by the CalFuzzer14, 16 tool.

Data Races High-Level Races


Approximate Lines of
Lines of Code Specification Determinism Determinism
Benchmark (App + Library) (+ Predicates) Threads Found Violations Found Violations
JGF sor 300 6 10 2 0 0 0
sparsematmult 700 7 10 0 0 0 0
series 800 4 10 0 0 0 0
crypt 1,100 5 10 0 0 0 0
moldyn 1,300 6 10 2 0 0 0
lufact 1,500 9 10 1 0 0 0
raytracer 1,900 4 10 3 1 0 0
montecarlo 3,600 4 + 34 10 1 0 2 0
PJ pi 150 + 15,000 5 4 9 0 1+ 1
keysearch3 200 + 15,000 6 4 3 0 0+ 0
mandelbrot 250 + 15,000 10 4 9 0 0+ 0
phylogeny 4,400 + 15,000 8 4 4 0 0+ 0
tsp 700 4 5 6 0 2 0

102 communications of t h e ac m | j un e 2 0 1 0 | vo l. 5 3 | n o. 6
The third column of Table 1 records the number of In these experiments, the primary computational cost
lines of specification (and lines of custom predicate code) is from CalFuzzer, which typically has an overhead in the
added to each benchmark. Overall, the specification bur- range of 2x–20x for these kinds of compute bound applica-
den is quite small. Indeed, for the majority of the programs, tions. We have not carefully measured the computational
an author was able to add deterministic assertions in only cost of our deterministic assertion library. For most bench-
5  to 10 minutes per benchmark, despite being unfamiliar marks, however, the cost of serializing and comparing a
with the code. In particular, it was typically not difficult to computation’s inputs and outputs is dwarfed by the cost of
both identify regions of code performing parallel computa- the computation itself—e.g., consider the cost of checking
tion and to determine from documentation, comments, or that two fractal images are identical versus the cost of com-
source code which results were intended to be determinis- puting each fractal in the first place.
tic. Figure 2 shows the assertions added to the mandelbrot Determinism Violations: We found two cases of nondeter-
benchmark. ministic behavior. First, a known data race in the raytracer
The added assertions were correct on the first attempt for benchmark, due the use of the wrong lock to protect a shared
all but two benchmarks. For phylogeny, the resulting phy- sum, can yield an incorrect final answer.
logenetic tree was erroneously specified as deterministic, Second, the pi benchmark can yield a nondeterministic
when, in fact, there are many correct optimal trees. The spec- answer given the same random seed because of insufficient
ification was modified to assert only that the optimal score synchronization of a shared random number generator.
must be deterministic. For sparsematmult, we incorrectly In each Monte Carlo sample, two successive calls to java.
identified the variable to which the output was written. This util.Random.nextDouble() are made. A context switch
error was identified during later work on automatically infer- between these calls changes the set of samples generated.
ring deterministic specifications.6 Similarly, nextDouble() itself makes two calls to java.
The two predicates provided by our assertion library were util.Random.next(), which atomically generates up to
sufficient for all but one of the benchmarks. For the JGF 32 pseudorandom bits. A context switch between these two
montecarlo benchmark, the authors had to write a custom calls changes the generated sequence of pseudorandom
equals and hashCode method for two classes—34 total ­doubles. Thus, although java.util.Random.next­
lines of code—in order to assume and assert that two sets, Double()is thread-safe and free of data races, scheduling
one of initial tasks and one of results, are equivalent across nondeterminism can still lead to a nondeterministic result.
executions. (This behavior is known—the PJ Library provides several
Discussion: More experience, or possibly user studies, would versions of this benchmark, one of which does guarantee a
be needed to conclude decisively that our assertions are eas- deterministic result for any given random seed.)
ier to use than existing techniques for specifying that parallel Benign Races: The high number of real data races for these
code is correctly deterministic. However, we believe our expe- benchmarks is largely due to benign races on volatile variables
rience is quite promising. In particular, writing assertions for used for synchronization—e.g., to implement a tournament
the full functional correctness of the parallel regions of these barrier or a custom lock. Although CalFuzzer does not under-
programs seemed to be quite difficult, perhaps requiring stand these sophisticated synchronization schemes, our deter-
implementing a sequential version of the code and asserting ministic assertions automatically provide some ­confidence
that it produces the same result. Further, there seemed to be that these races are benign because, over the course of many
no obvious simpler, traditional assertions that would aid in experimental runs, they did not lead to nondeterministic final
catching nondeterministic parallelism. results.
Despite these difficulties, we found that specifying the Note that it can be quite challenging to verify by hand that
natural deterministic behavior of the benchmarks with our these races are benign. On inspecting the benchmark code
bridge assertions required little effort. and these data races, an author several times believed he
had found a synchronization bug. But on deeper inspection,
5.2. Effectiveness the code was found to be correct in all such cases.
To evaluate the utility of our deterministic specifications in The number of high-level races is low for the JGF bench-
finding true parallelism bugs, we used a modified version marks because all the benchmarks except montecarlo
of the CalFuzzer14, 16 tool to find real races in the bench- exclusively use volatile variables (and thread joins) for syn-
mark programs, both data races and higher level races (such chronization. Thus, all observable scheduling nondeter-
as  races to acquire a lock). For each such race, we ran 10 minism is due to data races.
­trials using CalFuzzer to create real executions with these The number of high-level races is low for the PJ bench-
races and to randomly resolve the races (i.e., randomly pick marks because they primarily use a combination of vola-
a thread to “win”). We turned on run-time checking of our tile variables and atomic compare-and-set operations for
deterministic assertions for these trials, and recorded all synchronization. Currently, the only kind of high-level
found violations. race our modified CalFuzzer recognizes is a lock race.
Table 1 summarizes the results of these experiments. For Thus, while we believe there are many (benign) races
each benchmark, we indicate the number of real data races in the ordering of these compare-and-set operations,
and higher-level races we observed. Further, we indicate CalFuzzer does not report them. The one high-level race
how many of these races led to determinism violations in for pi, indicated in the table and described above, was
any execution. confirmed by hand.

j u ne 2 0 1 0 | vo l. 5 3 | n o. 6 | c o m m u n i c at i o n s o f t h e acm 103
research highlights

Discussion: Although our checking of deterministic assertions Here we focus on atomicity and determinism as pro-
is sound—an assertion failure always indicates that two execu- gram specifications to be checked. There is much work
tions with equivalent initial states can yield nonequivalent on atomicity as a language mechanism, in which an
final states—it is incomplete. Parallelism bugs leading to non- atomic specification is instead enforced by some com-
determinism may still exist even when testing fails to find any bination of library, run-time, compiler, or hardware
determinism violations. support. One  could similarly imagine enforcing deter-
However, in our experiments we successfully distin- ministic specifications through, e.g., compiler and run-
guished the races known to cause undesired nondetermin- time mechanisms.4
ism from the benign races in only a small number of trials.
Thus, we believe our deterministic assertions can help catch 6.2. Other uses of bridge predicates
harmful nondeterminism due to parallelism, as well as save We have already argued that bridge predicates simplify
programmer effort in determining whether real races and the task of directly and precisely specifying deterministic
other potential parallelism bugs can lead to incorrect pro- behavior of parallel programs. We also believe that bridge
gram behavior. predicates could provide a simple but powerful tool to
express correctness properties in many other situations.
6. DISCUSSION For example, if we have two versions of a program, P1 and
In this section, we compare the concepts of atomicity and P2, that we expect to produce the same output on the same
determinism. Further, we discuss several other possible input, then we can easily assert this using our framework
uses for bridge predicates and bridge assertions. as follows:

6.1. Atomicity versus determinism


deterministic assume(Pre) {
A concept complementary to determinism in parallel pro-
if (nonDeterministicBoolean() ) {
grams is atomicity. A block of sequential code in a multi-
P1
threaded program is said to be atomic9 if for every possible
} else {
interleaved execution of the program there exists an equiv- P2
alent execution with the same overall behavior in which }
the atomic block is executed serially (i.e., the execution of } assert(Post);
the atomic block is not interleaved with actions of other
threads). Therefore, if a code block is atomic, the program-
mer can assume that the execution of the code block by a where Pre requires that the inputs are the same and Post
thread cannot be interfered with by any other thread. This specifies that the outputs will be the same.
enables programmers to reason about atomic code blocks In particular, if a programmer has written both a sequential
sequentially. This seemingly similar concept has the follow- and parallel version of a piece of code, he or she can specify
ing subtle differences from determinism: that the two versions are semantically equivalent with an
assertion like:
1. Atomicity is the property about a sequential block of
code—i.e., the block of code for which we assert atom-
deterministic assume(A==A¢ and B==B¢) {
icity has a single thread of execution and does not
if (nonDeterministicBoolean() ) {
spawn other threads. Note that a sequential block is by
C = par_matrix_multiply_int(A, B);
default deterministic if it is not interfered with by other
} else {
threads. Determinism is a property of a parallel block C = seq_matrix_multiply_int(A, B);
of code. In determinism, we assume that the parallel }
block of code’s execution is not influenced by the exter- } assert(C==C¢);
nal world.
2. In atomicity, we say that the execution of a sequential
block of code results in the same state no matter how it is where nonDeterministicBoolean() returns true or
scheduled with other external threads—i.e., atomicity false nondeterministically.
ensures that external nondeterminism does not interfere Similarly, a programmer can specify that the old
with the execution of an atomic block of code. In deter- and new versions of a piece of code are semantically
minism, we say that the execution of a parallel block of equivalent:
code gives the same semantic state no matter how the
threads inside the block are scheduled—i.e., determin-
deterministic assume(A==A¢ and B==B¢) {
ism ensures that internal nondeterminism does not result
if (nonDeterministicBoolean() ) {
in different outputs.
C = old_matrix_multiply_int(A, B);
} else {
In summary, atomicity and determinism are orthogonal con- C = new_matrix_multiply_int(A, B);
cepts. Atomicity reasons about a single thread under external }
nondeterminism, whereas determinism reasons about mul- } assert(C==C¢);
tiple threads under internal nondeterminism.

104 communications of t h e ac m | j un e 2 0 1 0 | vol . 5 3 | n o. 6


Checking this specification is a kind of regression testing. In Intel (Award #024894) funding and by matching funding by
particular, if the code change has introduced a regression— U.C. Discovery (Award #DIG07–10227), by NSF Grants CNS-
i.e., a bug that causes the new code to produce a semantically 0720906 and CCF-0747390, and by a DoD NDSEG Graduate
different output then the old code for some input—then the Fellowship.
above specification does not hold.
Further, we believe there is a wider class of program prop-
References
erties that are easy to write in bridge assertions but would be
1. Artho, C., Havelund, K., Biere, A. 1–34.
quite difficult to write otherwise. For example, consider the High-level data races. Softw. Test. 11. Kaminsky, A. Parallel Java: a unified
specification: Ver. Reliab. 13, 4 (2003), 207–227. API for shared memory and cluster
2. Asanovic, K., Bodik, R., Demmel, J., parallel programming in 100%
Keaveny, T., Keutzer, K., Kubiatowicz, Java. In 21st IEEE International
J.D., Lee, E.A., Morgan, N., Necula, Parallel and Distributed Processing
G., Patterson, D.A., Sen, K., Symposium (IPDPS) (2007).
Wawrzynek, J., Wessel, D., Yelick, 12. Lee, E.A. The problem with threads.
deterministic assume(set.size() == set¢.size()) { K.A. The Parallel Computing Computer 39, 5 (May 2006),
P Laboratory at U.C. Berkeley: A 33–42.
Research Agenda Based on the 13. Loidl, H., Rubio, F., Scaife, N.,
} assert (set.size () == set¢.size ()); Berkeley View. Technical Report Hammond, K., Horiguchi, S., Klusik,
UCB/EECS-2008–23, EECS U., Loogen, R., Michaelson, G.,
Department, University of California, Pena, R., Priebe, S. et al. Comparing
Berkeley, March 2008. parallel functional languages:
This specification requires that sequential or parallel pro- 3. Barnes, G. A method for Programming and performance.
gram block P transforms set so that its final size is some implementing lock-free shared-data High. Order Symb. Comput. 16, 3
structures. In 5th ACM Symposium (2003), 203–251.
function of its initial size, independent of its elements. The on Parallel Algorithms and 14. Park, C.-S., Sen, K. Randomized
specification is easy to write even in cases where the exact Architectures (SPAA) (1993). active atomicity violation detection in
4. Bocchino, R.L., Jr., Adve, V.S., Dig, D., concurrent programs. In 16th ACM
relationship between the initial and final size might be Adve, S.V., Heumann, S., Komuravelli, SIGSOFT International Symposium
quite complex and difficult to write. It is not entirely clear, R., Overbey, J., Simmons, P., Sung, on Foundations of Software
H., Vakilian, M. A type and effect Engineering (FSE) (2008).
however, when such properties are important or useful to system for deterministic parallel 15. Sadowski, C., Freund, S., Flanagan,
specify. Java. In 24th ACM SIGPLAN C. SingleTrack: A dynamic
Conference on Object-Oriented determinism checker for
Programming Systems, Languages multithreaded programs. In
and Applications (OOPSLA) (2009). 18th European Symposium on
5. Burnim, J., Sen, K. Asserting Programming (ESOP) (2009).
7. CONCLUSION and checking determinism for 16. Sen, K. Race directed random
multithreaded programs. In 7th testing of concurrent programs.
We have proposed bridge predicates and bridge assertions Joint Meeting of the European In ACM SIGPLAN Conference on
for specifying the user-intended semantic deterministic Software Engineering Conference Programming Language Design
and the ACM SIGSOFT Symposium and Implementation (PLDI'08)
behavior of parallel programs. We argue that our specifica- on the Foundations of Software (2008).
tions are much simpler for programmers to write than tradi- Engineering (ESEC/FSE) (2009). 17. Smith, L.A., Bull, J.M., Obdrzálek,
6. Burnim, J. Sen, K. DETERMIN: J. A parallel java grande benchmark
tional specifications of functional correctness, because they Inferring likely deterministic suite. In ACM/IEEE Conference
enable programmers to compare pairs of program states specifications of multithreaded on Supercomputing (SC) (2001).
programs. In 32nd ACM/IEEE 18. Stoller, S.D. Testing concurrent
across different executions rather than relating program International Conference on Java programs using randomized
outputs directly to program inputs. Thus, bridge predicates Software Engineering (ICSE) (2010). scheduling. In 2nd Workshop on
7. Edelstein, O., Farchi, E., Nir, Y., Runtime Verification (RV) (2002).
and bridge assertions can be thought of as a lightweight Ratsaby, G., Ur, S. Multithreaded 19. Thies, W., Karczmarek, M.,
mechanism for specifying the correctness of just the paral- Java program test generation. IBM Amarasinghe, S. StreamIt:
Syst. J. 41, 1 (2002), 111–125. A language for streaming
lelism in a program, independently of the program’s func- 8. Flanagan, C., Freund, S.N. Atomizer: applications. In 11th International
tional correctness. A dynamic atomicity checker for Conference
multithreaded programs. In 31st on Compiler Construction (CC)
We have shown experimental evidence that we can effec- ACM SIGPLAN-SIGACT Symposium (2002).
on Principles of Programming 20. Visser, W., Havelund, K., Brat, G.,
tively check our deterministic specifications. In particular, Languages (POPL) (2004). Park, S., Lerda, F. Model checking
we can use existing techniques for testing parallel software 9. Flanagan, C., Qadeer, S. A type and programs. Autom. Softw. Eng. 10, 2
effect system for atomicity. In ACM (2003), 203–232.
to generate executions exhibiting data and higher-level SIGPLAN 2003 Conference on 21. von Praun, C., Gross, T.R. Object
races. Then our deterministic specifications allow us to Programming Language Design and race detection. In 16th ACM
Implementation (PLDI) (2003). SIGPLAN Conference on Object-
distinguish from the benign races the parallel nondeter- 10. Johnston, W.M., Hanna, J.R.P., Oriented Programming, Systems,
minism bugs that lead to unintended nondeterministic Millar, R.J. Advances in dataflow Languages, and Applications
programming languages. ACM (OOPSLA) (2001).
program behavior. Thus, we argue that it is worthwhile for Comput. Surv. 36, 1 (2004),
programmers to write such lightweight deterministic speci-
fications. In fact, later work6 has suggested that, given the Jacob Burnim (jburnim@cs.berkeley.edu), Koushik Sen (ksen@cs.berkeley.edu),
simple form of our specifications, it may often be possible EECS Department, UC Berkeley, CA. EECS Department, UC Berkeley, CA.
to automatically infer likely deterministic specifications for
parallel programs.

Acknowledgments
We would like to thank Nicholas Jalbert, Mayur Naik,
Chang-Seo Park, and our anonymous reviewers for their
valuable comments on previous drafts of this paper. This
work supported in part by Microsoft (Award #024263) and © 2010 ACM 0001-0782/10/0600 $10.00

j u ne 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t h e acm 105
research highlights
do i:10.1145/1743546.1 7 4 3 5 7 3

Technical Perspective
Learning To Do
Program Verification
By K. Rustan M. Leino

When you decide to use a piece of soft- checkable proof of correctness for the The authors are careful to list not only
ware, how do you know it will do what microkernel of an operating system. positive implications of the verification,
you need it to do or what it claims to For a user of the operating system, but also the assumptions upon which
do? Will it even be safe to run? Will it this verification provides a number of the verification rests. This is important,
interfere with other software you al- significant benefits; for example, the for all verification is performed at some
ready have? assurance that no hostile application level of abstraction. For example, verify-
As a software engineer developing running under this operating system ing a program at the level of abstraction
that piece of software, what can you do can subvert the integrity of the ker- provided by a programming language
to ensure the software will be ready in nel through a buffer-overrun attack. does not say anything about hardware
time and will meet your quality stan- For the software engineers involved, failures. Verification is not an absolute;
dards? the verification brings the benefit of what it seeks to do is offer detailed aid
A long-standing dream of ideal soft- a code analysis that far surpasses that for programmers at the level of abstrac-
ware development calls for the use achieved by testing alone. tion at which they are working.
of program verification as part of the The seL4 verification connects three A project like the seL4 verification
answers to these questions. Program descriptions of the system: an abstract is not easy to pull off. It takes a strong
verification is the process by which one specification of the kernel that de- vision, as well as a significant amount
develops a mathematical proof that scribes what its functional behavior is, of work, expertise, and persistence. In
shows the software satisfies its func- a high-level prototype implementation short, it is a big bet. I applaud the team
tional specification. This dream had of the system that introduces further for undertaking it, and congratulate
grown roots already by the 1970s. details of how the kernel performs its them on delivering. Any doubts as to the
Since then, and especially in the last operations, and a low-level, hand-opti- technical feasibility of such a project
decade, the technology behind program mized implementation that deals with should by now have been removed.
verification has become an important the nitty-gritty of the kernel’s operation. The question about cost-effective-
part of software development practice. A novel aspect of this project is how the ness, however, remains. Some may
At Microsoft alone, a number of sym- middle layer was used as a stepping- argue that the end result of the verifi-
bolic execution techniques, including stone not just for the verification, but cation—a level of assurance that could
abstract interpretation, counterexam- also for the actual design and imple- not have been obtained using more tra-
ple-guided predicate abstraction, and mentation of the system itself. This has ditional methods—has already made
satisfiability-modulo-theories solving, the great advantage that the verification up for the effort expended. Others may
are used routinely to enhance the test- process gets the chance to influence the balk at the 20 person-years to complete
ing of software, in some cases even design. As the paper reports, this led to the proof or at the ratio of 200,000 lines
mathematically verifying the absence of a large number of changes in the top of proof script to 6,000 lines of eventual
certain kinds of software defects. two layers, with the effect of boosting C code. I would like to offer a different
But what we use today pales in com- the productivity of both the design team perspective on these numbers. First,
parison to the grand dreams of the and the verification team. they provide a benchmark against which
1970s. Why? Is it not possible to provide to compare future work. I would expect
rigorous proofs of modern software? Is that in another decade, a similar proj-
it too difficult to capture in functional A project like seL4 ect will take less effort and will involve
specifications the intended behavior a larger degree of automation. Second,
of software? Is full program verifica- verification is not the effort has resulted not just in an im-
tion only for small algorithms and toy easy to pull off. In pressive engineering achievement, but
examples? Are hardcore programmers also in an appreciable amount of scien-
and verification experts not able to find short, it is a big bet. tific learning. It is through pioneering
common goals to work toward? Is it not I applaud the team and repeated efforts like this one that
cost-effective to insist that every detail we will learn how to apply full program
of a piece of software is correct? for undertaking it. verification on a more regular basis.
The following work by Gerwin Klein
et al. is a landmark in the further de- K. Rustan M. Leino (leino@microsoft.com) is a Principal
velopment of applying functional-cor- Researcher at Microsoft Research, Redmond, WA.
rectness verification in practice. The
authors have produced a machine- © 2010 ACM 0001-0782/10/0600 $10.00

106 communications of t h e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
doi:10.1145/1743546 . 1 7 4 3 5 7 4

seL4: Formal Verification of an


Operating-System Kernel
By Gerwin Klein, June Andronick, Kevin Elphinstone, Gernot Heiser, David Cock, Philip Derrin, Dhammika Elkaduwe,
Kai Engelhardt, Rafal Kolanski, Michael Norrish, Thomas Sewell, Harvey Tuch, and Simon Winwood

Abstract hardware mechanisms and needing to run in privileged


We report on the formal, machine-checked verification of mode. All OS services are then implemented as normal pro-
the seL4 microkernel from an abstract specification down to grams, running entirely in (unprivileged) user mode, and
its C implementation. We assume correctness of compiler, therefore can potentially be excluded from the TCB. Previous
assembly code, hardware, and boot code. implementations of microkernels resulted in communica-
seL4 is a third-generation microkernel of L4 provenance, tion overheads that made them unattractive compared to
comprising 8700 lines of C and 600 lines of assembler. Its monolithic kernels. Modern design and implementation
performance is comparable to other high-performance L4 techniques have managed to reduced this overhead to very
kernels. competitive limits.
We prove that the implementation always strictly follows A microkernel makes the trustworthiness problem
our high-level abstract specification of kernel behavior. This more tractable. A well-designed high-performance micro-
encompasses traditional design and implementation safety kernel, such as the various representatives of the L4 micro-
properties such as that the kernel will never crash, and it will kernel family, consists of the order of 10,000 lines of code
never perform an unsafe operation. It also implies much (10 kloc). This radical reduction to a bare minimum comes
more: we can predict precisely how the kernel will behave in with a price in complexity. It results in a high degree of
every possible situation. interdependency between different parts of the kernel, as
indicated in Figure 1. Despite this increased complexity
in low-level code, we have demonstrated that with mod-
1. INTRODUCTION ern techniques and careful design, an OS microkernel is
Almost every paper on formal verification starts with the entirely within the realm of full formal verification.
observation that software complexity is increasing, that this
leads to errors, and that this is a problem for mission and Figure 1. Call graph of the seL4 microkernel. Vertices represent
safety critical software. We agree, as do most. functions, and edges invocations.
Here, we report on the full formal verification of a criti-
cal system from a high-level model down to very low-level
C code. We do not pretend that this solves all of the soft-
ware complexity or error problems. We do think that our
approach will work for similar systems. The main message
we wish to convey is that a formally verified commercial-
grade, general-purpose microkernel now exists, and that
formal verification is possible and feasible on code sizes
of about 10,000 lines of C. It is not cheap; we spent signifi-
cant effort on the verification, but it appears cost-effective
and more affordable than other methods that achieve lower
degrees of trustworthiness.
To build a truly trustworthy system, one needs to start
at the operating system (OS) and the most critical part of
the OS is its kernel. The kernel is defined as the software
that executes in the privileged mode of the hardware,
meaning that there can be no protection from faults
occurring in the kernel, and every single bug can poten-
tially cause arbitrary damage. The kernel is a mandatory
part of a system’s trusted computing base (TCB)—the part
of the system that can bypass security.10 Minimizing this
TCB is the core concept behind microkernels, an idea that
goes back 40 years.
The original version of this paper was published in
A microkernel, as opposed to the more traditional mono-
the Proceedings of the 22nd ACM SIGOPS Symposium on
lithic design of contemporary mainstream OS kernels,
Operating Systems Principles, Oct. 2009.
is reduced to just the bare minimum of code wrapping

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t h e acm 107
research highlights

Formal verification of software refers to the application No buffer overflows: This is mainly a classic vector for code
of mathematical proof techniques to establish proper- injection, but buffer overflows may also inject unwanted
ties about programs. Formal verification can cover not just data and influence kernel behavior that way. We prove that
all lines of code or all decisions in a program, but all pos- all array accesses are within bounds and we prove that all
sible behaviors for all possible inputs. For example, the very pointer accesses are well typed, even if they go via casts to
simple fragment of C code if (x < y)z = x/y else z = y/x void or address arithmetic.
for x, y, and z being int tested with x = 4, y = 2 and x = 8, No NULL pointer access: NULL pointer bugs can allow local
y = 16, results in full code coverage: every line is executed at privilege escalation and execution of arbitrary code in ker-
least once, and every branch of every condition is taken at nel mode.9 Absence of NULL pointer dereference is a direct
least once. Yet, there are still two potential bugs remaining. proof obligation for us for every pointer access.
Of course, any human tester will find inputs such as x = 0, No ill-typed pointer access: Even though the kernel code
y = −1 and x = −1, y = 0 that expose the bugs, but for bigger deliberately breaks C type safety for efficiency at some
programs it is infeasible to be sure of completeness. This is points, in order to predict that the system behaves accord-
what formal verification can achieve. ing to specification, we prove that circumventing the type
The approach we use is interactive, machine-assisted, system is safe at all these points.
and machine-checked proof. Specifically, we use the theo- No memory leaks and no memory freed that is still in use.
rem prover Isabelle/HOL.8 Interactive theorem proving This is not purely a consequence of the proof itself. Much of
requires human intervention and creativity to construct the design of seL4 was focused on explicit memory manage-
and guide the proof. It has the advantage that it is not ment. Users may run out of memory, but the kernel never will.
constrained to specific properties or finite, feasible state No nontermination: We have proved that all kernel calls
spaces. We have proved the functional correctness of the ­terminate. This means the kernel will never suddenly freeze
seL4 microkernel, a secure embedded microkernel of the and not return from a system call. This does not mean that
L46 family. This means, we have proved mathematically the whole system will never freeze. It is still possible to write
that the implementation of seL4 always strictly follows our bad device drivers and bad applications, but set up correctly,
high-level abstract specification of kernel behavior. This a supervisor process can always stay in control of the rest of
property is stronger and more precise than what automated the system.
techniques like model checking, static analysis, or kernel No arithmetic or other exceptions: The C standard defines
implementations in type-safe languages can achieve. We a long list of things that can go wrong and that should be
not only analyze specific aspects of the kernel, such as safe avoided: shifting machine words by a too-large amount,
execution, but also provide a full specification and proof dividing by zero, etc. We proved explicitly that none of these
for the kernel’s precise behavior. occur, including the absence of errors due to overflows in
In the following, we describe what the implications of the integer arithmetic.
proof are, how the kernel was designed for verification, what No unchecked user arguments: All user input is checked and
the verification itself entailed and what its assumptions are, validated. If the kernel receives garbage or malicious argu-
and finally what effort it cost us. ments it will respond with the specified error messages, not
with crashes. Of course, the kernel will allow a thread to kill
2. IMPLICATIONS itself if that thread has sufficient capabilities. It will never
In a sense, functional correctness is one of the strongest allow anything to crash the kernel, though.
properties to prove about a system. Once we have proved Many of these are general security traits that are good to
functional correctness with respect to a model, we can use have for any kind of system. We have also proved a large num-
this model to establish further properties instead of having ber of properties that are specific to seL4. We have proved
to reason directly about the code. For instance, we prove that them about the kernel design and specification. With func-
every system call terminates by looking at the model instead tional correctness, we know they are true about the code as
of the code. However, there are some security-relevant prop- well. Some examples are:
erties, such as transmission of information via covert chan- Aligned objects: Two simple low-level invariants of the
nels, for which the model may not be precise enough. kernel are: all objects are aligned to their size, and no two
So our proof does not mean that seL4 is secure for all pur- objects overlap in memory. This makes comparing memory
poses. We proved that seL4 is functionally correct. Secure regions for objects very simple and efficient.
would first need a formal definition and depends on the Well-formed data structures: Lists, doubly linked, singly
application. Taken seriously, security is a whole-system linked, with and without additional information, are a pet
question, including the system’s human components. topic of formal verification. These data structures also occur
Even without proving specific security properties on top, in seL4 and we proved the usual properties: lists are not cir-
a functional correctness proof already has interesting impli- cular when they should not be, back pointers point to the
cations for security. If the assumptions listed in Section 4.5 right nodes, insertion, deletion etc., work as expected.
are true, then in seL4 there will be: Algorithmic invariants: Many optimizations rely on certain
No code injection attacks: If we always know precisely what properties being always true, so specific checks can be left
the system does and if the spec does not explicitly allow it, out or can be replaced by other, more efficient checks. A sim-
then we can never have any foreign code executing as part ple example is that the distinguished idle thread is always
of seL4. in thread state idle and therefore can never be blocked or

108 communications of t h e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
otherwise waiting for I/O. This can be used to remove checks to provide a programming language for OS developers,
in the code paths that deal with the idle thread. while at the same time providing an artifact that can readily
Correct book-keeping: The seL4 kernel has an explicit be reasoned about in the theorem proving tool: the design
user-visible concept of keeping track of memory, who has team wrote increasingly complete prototypes of the kernel
access to it, who access was delegated to, and what needs to in Haskell, exporting the system call interface via a hard-
be done if a privileged process wants to revoke access from ware simulator to user-level binary code. The formal meth-
delegates. It is the central mechanism for reusing memory ods team imported this prototype into the theorem prover
in seL4. The data structure that backs this concept is corre- and used it as an intermediate executable specification. The
spondingly complex and its implications reach into almost approach aims at quickly iterating through design, proto-
all aspects of the kernel. For instance, we proved that if a type implementation, and formal model until convergence.
live object exists anywhere in memory, then there exists an Despite its ability to run real user code, the Haskell ker-
explicit capability node in this data structure that covers the nel remains a prototype, as it does not satisfy our high-­
object. And if such a capability exists, then it exists in the performance requirement. Furthermore, Haskell requires
proper place in the data structure and has the right relation- a significant run-time environment (much bigger than our
ship towards parents, siblings, and descendants within. If an kernel), and thus violates our requirement of a small TCB.
object is live (may be mentioned in other objects anywhere We therefore translated the Haskell implementation manu-
in the system) then the object itself together with that capa- ally into high-performance C code. An automatic translation
bility must have recorded enough information to reach all (without proof) would have been possible, but we would have
objects that refer to it (directly or indirectly). Together with lost most opportunities to micro-optimize the kernel in order
a whole host of further invariants, these properties allow to meet our performance targets. We do not need to trust the
the kernel code to reduce the complex, system-global test translations into C and from Haskell into Isabelle—we for-
whether a region of memory is mentioned anywhere else in mally verify the C code as it is seen by the compiler, gaining
the system to a quick, local pointer comparison. an end-to-end theorem between formal specification and the
We have proved about 80 such invariants on the execut- C semantics.
able specification such that they directly transfer to the data
structures used in the C program. 3.2. Design decisions
A verification like this is not an absolute guarantee. The Global Variables and Side Effects: Use of global variables and
key condition in all this is if the assumptions are true. To functions with side effects is common in operating systems—
attack any of these properties, this is where one would have mirroring properties of contemporary computer hardware
to look. What the proof really does is take 7500 lines of C and OS abstractions. Our verification techniques can deal
code out of the equation. It reduces possible attacks and routinely with side effects, but implicit state updates and
the human analysis necessary to guard against them to the complex use of the same global variable for different pur-
assumptions and specification. It also is the basis for any poses make verification more difficult. This is not surprising:
formal analysis of systems running on top of the kernel or the higher the conceptual complexity, the higher the verifica-
for further high-level analysis of the kernel itself. tion effort.
The deeper reason is that global variables usually require
3. KERNEL DESIGN FOR VERIFICATION stating and proving invariant properties. For example, sched-
The challenge in designing a verifiable and usable kernel uler queues are global data structures frequently imple-
lies in reducing complexity to make verification easier while mented as doubly linked lists. The corresponding invariant
maintaining high performance. might state that all back links in the list point to the appropri-
To achieve these two objectives, we designed and imple- ate nodes and that all elements point to thread control blocks
mented a microkernel from scratch. This kernel, called seL4, and that all active threads are in one of the scheduler queues.
is a third-generation microkernel, based on L4 and influ- Invariants are expensive because they need to be proved
enced by EROS.11 It is designed for practical deployment not only locally for the functions that directly manipulate
in embedded systems with high trustworthiness require- the scheduler queue, but for the whole kernel—we have
ments. One of its innovations is completely explicit memory-­ to show that no other pointer manipulation in the kernel
management subject to policies defined at user level, even for destroys the list or its properties. This proof can be easy or
kernel memory. All authority in seL4 is mediated by capabili- hard, depending on how modularly the global variable is
ties,2 tokens identifying objects and conveying access rights. used.
We first briefly present the approach we used for a kernel/ Dealing with global variables was simplified by deriving
proof codesign process. Then we highlight the main design the kernel implementation from Haskell, where side effects
decisions we made to simplify the verification work. are explicit and drawn to the design team’s attention.
Kernel Memory Management: The seL4 kernel uses a model
3.1. Kernel/proof codesign process of memory allocation that exports control of the in-kernel
One key idea in this project was bridging the gap between allocation to appropriately authorized applications. While
verifiability and performance by using an iterative approach this model is mostly motivated by the need for precise guar-
to kernel design, based around an intermediate target that antees of memory consumption, it also benefits verification.
is readily accessible to both OS developers and formal meth- The model pushes the policy for allocation outside the ker-
ods practitioners. We used the functional language Haskell nel, which means we only need to prove that the mechanism

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t h e acm 109
research highlights

works, not that the user-level policy makes sense. The mecha- stack, and a mostly atomic application programming inter-
nism works if it keeps kernel code and data structures safe face. This is aided by the traditional L4 model of system calls
from user access, if the virtual memory (VM) subsystem is which are primitive and mostly short-running.
fully controlled by the kernel interface via capabilities, and if We minimize the effect of interrupts (and hence pre-
it provides the necessary functionality for user level to man- emptions) by disabling interrupts during kernel execution.
age its own VM policies. Again, this is aided by the L4 model of short system calls.
Obviously, moving policy into user land does not change However, not all kernel operations can be guaranteed to
the fact that memory allocation is part of the TCB. It does be short; object destruction especially can require almost
mean, however, that memory allocation can be verified sepa- arbitrary execution time, so not allowing any interrupt pro-
rately, and can rely on verified kernel properties. cessing during a system call would rule out the use of the
The memory-management model gives free memory kernel for real-time applications, undermining the goal of
to the user-level manager in the form of regions tagged as real-world deployability.
untyped. The memory manager can split untyped regions We ensure bounded interrupt latencies by the standard
and retype them into one of several kernel object types (one approach of introducing a few, carefully placed, interrupt
of them, frame, is for user-accessible memory); such opera- points. On detection of a pending interrupt, the kernel
tions ­create new capabilities. Object destruction converts a explicitly returns through the function call stack to the ker-
region back to untyped (and invalidates derived capabilities). nel/user boundary and responds to the interrupt. It then
Before reusing a block of memory, all references to this restarts the original operation, including reestablishing all
memory must be invalidated. This involves either find- the preconditions for execution. As a result, we completely
ing all outstanding capabilities to the object, or returning avoid concurrent execution in the kernel.
the object to the memory pool only when the last capabil- I/O: Interrupts are used by device drivers to affect I/O. L4
ity is deleted. Our kernel uses both approaches. In the first kernels traditionally implement device drivers as user-level
approach, a ­so-called capability derivation tree is used to programs, and seL4 is no different. Device interrupts are
find and invalidate all capabilities referring to a memory converted into messages to the user-level driver.
region. In the second approach, the capability derivation This approach removes a large amount of complexity
tree is used to ensure, with a check that is local in scope, that from the kernel implementation (and the proof). The only
there are no system-wide dangling references. This is pos- exception is an in-kernel timer driver which generates timer
sible because all other kernel objects have further invariants ticks for scheduling, which is straightforward to deal with.
on their own internal references that relate back to the exis-
tence of capabilities in this derivation tree. 4. VERIFICATION OF seL4
Similar book-keeping would be necessary for a tradi- This section gives an overview of the formal verification of
tional malloc/free model in the kernel. The difference is that seL4 in the theorem prover Isabelle/HOL.8 The property
the complicated free case in our model is concentrated in we are proving is functional correctness. Formally, we are
one place, whereas otherwise it would be repeated numer- showing refinement: A refinement proof establishes a cor-
ous times over the code. respondence between a high-level (abstract) and a low-level
Concurrency and Nondeterminism: Concurrency is the (concrete, or refined) representation of a system.
execution of computation in parallel (in the case of multiple The correspondence established by the refinement proof
hardware processors), or by nondeterministic interleaving ensures that all Hoare logic properties of the abstract model
via a concurrency abstraction like threads. Reasoning about also hold for the refined model. This means that if a security
concurrent programs is hard, much harder than reasoning property is proved in Hoare logic about the abstract model
about sequential programs. For the time being, we limited (not all security properties can be), our refinement guarantees
the verification to a single-processor version of seL4. that the same property holds for the kernel source code. In this
In a uniprocessor kernel, concurrency can result from paper, we concentrate on the general functional correctness
three sources: yielding of the processor from one thread to property. We have also modelled and proved the security of
another, (synchronous) exceptions and (asynchronous) inter- seL4’s access-control system in Isabelle/HOL on a high level.3
rupts. Yielding can be synchronous, by an explicit handover, Figure 2 shows the specification layers used in the verifi-
such as when blocking on a lock, or asynchronous, by pre- cation of seL4; they are related by formal proof. In the follow-
emption (but in a uniprocessor kernel, the latter can only ing sections we explain each layer in turn.
happen as the result of an interrupt).
We limit the effect of all three by a kernel design which 4.1. Abstract specification
explicitly minimizes concurrency. The abstract level describes what the system does without
Exceptions are completely avoided, by ensuring that they saying how it is done. For all user-visible kernel operations,
never occur. For instance, we avoid virtual-memory excep- it describes the functional behavior that is expected from
tions by allocating all kernel data structures in a region of the system. All implementations that refine this specifica-
VM which is always guaranteed to be mapped to physical tion will be binary compatible.
memory. System-call arguments are either passed in regis- We precisely describe argument formats, encodings and
ters or through preregistered physical memory frames. error reporting, so, for instance, some of the C-level size
The complexity of synchronous yield we avoid by using ­restrictions become visible on this level. We model finite
an event-based kernel execution model, with a single kernel machine words, memory, and typed pointers explicitly.

110 communications of t h e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
Figure 2. The refinement layers in the verification of seL4.
data structures are now explicit data types, records, and lists
with straightforward, efficient implementations in C. For
Isabelle/HOL example the capability derivation tree of seL4, modelled as a
tree on the abstract level, is now modelled as a doubly linked
Abstract specification
list with limited level information. It is manipulated explic-
Refinement proof
itly with pointer-update operations.
Figure 4 shows part of the scheduler specification at this
Executable specification Haskell prototype level. The additional complexity becomes apparent in the
Automatic chooseThread function that is no longer merely a sim-
Refinement proof translation ple predicate, but rather an explicit search backed by data
structures for priority queues. The specification fixes the
High-performance C implementation behavior of the scheduler to a simple priority-based round-
robin algorithm. It mentions that threads have time slices
and it clarifies when the idle thread will be scheduled. Note
that priority queues duplicate information that is already
available (in the form of thread states), in order to make it
Otherwise, the data structures used in this abstract specifi- available efficiently. They make it easy to find a runnable
cation are high level—essentially sets, lists, trees, functions, thread of high priority. The optimization will require us to
and records. We make use of nondeterminism in order to prove that the duplicated information is consistent.
leave implementation choices to lower levels: if there are We have proved that the executable specification cor-
multiple correct results for an operation, this abstract layer rectly implements the abstract specification. Because of
would return all of them and make clear that there is a choice. its extreme level of detail, this proof alone already provides
The implementation is free to pick any one of them. stronger design assurance than has been shown for any
An example of this is scheduling. No scheduling policy is other general-purpose OS kernel.
defined at the abstract level. Instead, the scheduler is mod-
elled as a function picking any runnable thread that is active 4.3. C implementation
in the system or the idle thread. The Isabelle/HOL code for The most detailed layer in our verification is the C imple-
this is shown in Figure 3. The function all_active_tcbs mentation. The translation from C into Isabelle is correct-
returns the abstract set of all runnable threads in the sys- ness-critical and we take great care to model the semantics
tem. Its implementation (not shown) is an abstract logical of our C subset precisely and foundationally. Precisely means
predicate over the whole system. The select statement that we treat C semantics, types, and memory model as the
picks any element of the set. The OR makes a nondetermin- C99 standard4 prescribes, for instance, with architecture-­
istic choice between the first block and switch_to_idle_ dependent word size, padding of structs, type-unsafe casting
thread. The executable specification makes this choice of pointers, and arithmetic on addresses. As kernel program-
more specific. mers do, we make assumptions about the compiler (GCC)
that go beyond the standard, and about the architecture
4.2. Executable specification
The purpose of the executable specification is to fill in the Figure 4. Haskell code for schedule.
details left open at the abstract level and to specify how the
kernel works (as opposed to what it does). While trying to
schedule = do
avoid the messy specifics of how data structures and code    action <- getSchedulerAction
are optimized in C, we reflect the fundamental restrictions    case action of
in size and code structure that we expect from the hardware    ChooseNewThread -> do
and the C implementation. For instance, we take care not to      chooseThread
     setSchedulerAction ResumeCurrentThread
use more than 64 bits to represent capabilities, exploiting
     ...
known alignment of pointers. We do not specify in which chooseThread = do
way this limited information is laid out in C.    r <- findM chooseThread¢ (reverse [minBound ..
The executable specification is deterministic; the only maxBound])
nondeterminism left is that of the underlying machine. All    when (r == Nothing) $ switchToIdleThread
chooseThread¢ prio = do
    q <- getQueue prio
Figure 3. Isabelle/HOL code for scheduler at abstract level.     liftM isJust $ findM chooseThread² q
chooseThread² thread = do
    runnable <- isRunnable thread
schedule º do     if not runnable then do
threads ¬ all_active_tcbs;          tcbSchedDequeue thread
thread ¬ select threads;          return False
switch_to_thread thread     else do
od OR switch_to_idle_thread          switchToThread thread
         return True

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t h e acm 111
research highlights

used (ARMv6). These are explicit in the model, and we can Figure 5. C code for part of the scheduler.
therefore detect violations. Foundationally means that we
do not just axiomatize the behavior of C on a high level, but
void setPriority(tcb_t *tptr, prio_t prio) {
we derive it from first principles as far as possible. For exam-    prio_t oldprio;
ple, in our model of C, memory is a primitive function from    if(thread_state_get_tcbQueued(tptr->tcbState)) {
addresses to bytes without type information or restrictions.     oldprio = tptr->tcbPriority;
On top of that, we specify how types like unsigned int are     ksReadyQueues[oldprio] =
     tcbSchedDequeue(tptr, ksReadyQueues[oldprio]);
encoded, how structures are laid out, and how implicit and
    if(isRunnable(tptr)) {
explicit type casts behave. We managed to lift this low-level      ksReadyQueues[prio] =
memory model to a high-level calculus that allows efficient,       tcbSchedEnqueue(tptr, ksReadyQueues[prio]);
abstract reasoning on the type-safe fragment of the kernel.     }
We generate proof obligations assuring the safety of each     else {
     thread_state_ptr_set_tcbQueued(&tptr->tcbState,
pointer access and write. They state that the pointer in ques-
                         false);
tion must be non-null and of the correct alignment. They are     }
typically easy to discharge. We generate similar obligations   }
for all restrictions the C99 standard demands.   tptr->tcbPriority = prio;
We treat a very large, pragmatic subset of C99 in the verifi- }
cation. It is a compromise between verification convenience
and the hoops the kernel programmers were willing to jump
through in writing their source. The following paragraphs
describe what is not in this subset. be inlined and, after compilation on ARM, the result is more
We do not allow the address-of operator & on local vari- compact and faster than GCC’s native bitfields. The tool not
ables, because, for better automation, we make the assump- only generates the C code, it also automatically generates
tion that local variables are separate from the heap. This could Isabelle/HOL specifications and proofs of correctness.
be violated if their address was available to pass on. It is the Figure 5 shows part of the implementation of the sched-
most far-reaching restriction we implement, because it is com- uling functionality described in the previous sections. It is
mon in C to use local variable references for return parameters standard C99 code with pointers, arrays and structs. The
to avoid returning large types on the stack. We achieved com- thread_state functions used in Figure 5 are examples of
pliance with this requirement by avoiding reference param- generated bitfield accessors.
eters as much as possible, and where they were needed, used
pointers to global variables (which are not restricted). 4.4. The proof
One feature of C that is problematic for verification This section describes the main theorem we have shown
(and programmers) is the unspecified order of evaluation and how its proof was constructed.
in expressions with side effects. To deal with this feature As mentioned, the main property we are interested in is
soundly, we limit how side effects can occur in expressions. functional correctness, which we prove by showing formal
If more than one function call occurs within an expression refinement. We have formalized this property for general
or the expression otherwise accesses global state, a proof state machines in Isabelle/HOL, and we instantiate each of
obligation is generated to show that these functions are side- the specifications in the previous sections into this state-
effect free. This proof obligation is discharged automatically. machine framework.
We do not allow function calls through function pointers. We have also proved the well-known reduction of refine-
(We do allow handing the address of a function to assembler ment to forward simulation, illustrated in Figure 6 where
code, e.g., for installing exception vector tables.) We also do the solid arrows mean universal quantification and the
not allow goto statements, or switch statements with fall- dashed arrows existential: To show that a concrete state
through cases. We support C99 compound literals, making machine M2 refines an abstract one M1, it is sufficient to
it convenient to return structs from functions, and reducing show that for each transition in M2 that may lead from
the need for reference parameters. We do not allow com- an initial state s to a set of states s¢, there exists a corre-
pound literals to be lvalues. Some of these restrictions could sponding transition on the abstract side from an abstract
be lifted easily, but the features were not required in seL4. state s to a set s ¢ (they are sets because the machines may
We did not use unions directly in seL4 and therefore do be nondeterministic). The transitions correspond if there
not support them in the verification (although that would be exists a relation R between the states s and s such that for
possible). Since the C implementation was derived from a each concrete state in s¢ there is an abstract one in s ¢ that
functional program, all unions in seL4 are tagged, and many makes R hold between them again. This has to be shown
structs are packed bitfields. Like other kernel implemen- for each transition with the same overall relation R. For
tors, we do not trust GCC to compile and optimize bitfields externally visible state, we require R to be equality. For
predictably for kernel code. Instead, we wrote a small tool each refinement layer in Figure 2, we have strengthened
that takes a specification and generates C code with the nec- and varied this proof technique slightly, but the general
essary shifting and masking for such bitfields. The tool helps idea remains the same.
us to easily map structures to page table entries or other We now describe the instantiation of this framework to
hardware-defined memory layouts. The generated code can the seL4 kernel. We have the following types of transition in

112 communications of t h e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
Figure 6. Forward simulation.
between entry and exit points in each specification layer for
a running kernel.
Assuming correctness of the C compiler means that we
Abstract operation assume GCC correctly translates the seL4 source code in our
M1
s s C subset according to the ISO/IEC C99 standard,4 that the
formal model of our C subset accurately reflects this stan-
dard and that the model makes the correct architecture-
State relation

State relation
specific assumptions for the ARMv6 architecture on the
Freescale i.MX31 platform.
The assumptions on hardware and assembly mean that
we do not prove correctness of the register save/restore
and the potential context switch on kernel exit. Cache con-
sistency, cache coloring, and TLB flushing requirements
s s
Concrete operation M2 are part of the assembly implemented machine interface.
These machine interface functions are called from C, and
we assume they do not have any effect on the memory state
of the C program. This is only true if they are used correctly.
The VM subsystem of seL4 is not assumed correct, but
our state machines: kernel transitions, user transitions, user is treated differently from other parts of the proof. For our
events, idle transitions, and idle events. Kernel transitions are C semantics, we assume a traditional, flat view of in-kernel
those that are described by each of the specification layers memory that is kept consistent by the kernel’s VM subsys-
in increasing amount of detail. User transitions are specified tem. We make this consistency argument only informally;
as nondeterministically changing arbitrary user-accessible our model does not oblige us to prove it. We do however
parts of the state space. User-events model kernel entry (trap substantiate the model and informal argument by manually
instructions, faults, interrupts). Idle transitions model the stated, machine-checked properties and invariants. This
behavior of the idle thread. Finally, idle events are interrupts means we explicitly treat in-kernel VM in the proof, but this
occurring during idle time; other interrupts that occur dur- treatment is different from the high standards in the rest
ing kernel execution are modelled explicitly and separately in of our proof where we reason from first principles and the
each layer of Figure 2. proof forces us to be complete.
The model of the machine and the model of user pro- This is the set of assumptions we picked. If they are too
grams remain the same across all refinement layers; only strong for a particular purpose, many of them can be elimi-
the details of kernel behavior and kernel data structures nated combined with other research. For instance, we have ver-
change. The fully nondeterministic model of the user means ified the executable design of the boot code in an earlier design
that our proof includes all possible user behaviors, be they version. For context switching, Ni et al.7 report verification suc-
benign, buggy, or malicious. cess, and the Verisoft project1 shows how to verify assembly
Let machine MA denote the system framework instan- code and hardware interaction. Leroy verified an optimizing C
tiated with the abstract specification of Section 4.1, let compiler5 for the PowerPC and ARM architectures.
machine ME represent the framework instantiated with the An often-raised concern is the question What if there
executable specification of Section 4.2, and let machine MC is a mistake in the proof? The proof is machine-checked by
stand for the framework instantiated with the C program Isabelle/HOL. So what if there is a bug in Isabelle/HOL? The
read into the theorem prover. Then we prove the following proof checking component of Isabelle is small and can be
two, very simple-looking theorems: isolated from the rest of the prover. It is extremely unlikely
that there is a bug in this part of the system that applies in a
Theorem 1. ME refines MA. correctness-critical way to our proof. If there was reason for
concern, a completely independent proof checker could be
Theorem 2. MC refines ME. written in a few hundred lines of code. Provers like Isabelle/
HOL can achieve a degree of proof trustworthiness that far
Therefore, because refinement is transitive, we have surpasses the confidence levels we rely on in engineering or
mathematics for our daily survival.
Theorem 3. MC refines MA.
5. EXPERIENCE AND LESSONS LEARNED
4.5. Assumptions 5.1. Verification effort
Formal verification can never be absolute; it always must The project was conducted in three phases. First an ini-
make fundamental assumptions. The assumptions we tial kernel with limited functionality (no interrupts,
make are correctness of the C compiler, the assembly code, the single address space, and generic linear page table) was
hardware, and kernel initialization. We explain each of them designed and implemented in Haskell, while the verifica-
in more detail below. tion team mostly worked on the verification framework
The initialization code takes up about 1.2 kloc of the ker- and generic proof libraries. In a second phase, the verifica-
nel. The theorems in Section 4.4 only state correspondence tion team developed the abstract spec and performed the

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t h e acm 113
research highlights

first refinement while the development team completed ability to change and rearrange code in discussion with
the design, Haskell prototype and C implementation. The the design team was an important factor in the verification
third phase consisted of extending the first refinement team’s productivity and was essential to complete the veri-
step to the full kernel and performing the second refine- fication on time.
ment. The overall size of the proof, including framework, The second refinement stage from executable spec to
libraries, and generated proofs (not shown in the table) is C uncovered 160 bugs, 16 of which were also found dur-
200,000 lines of Isabelle script. ing testing, early application and static analysis. The bugs
Table 1 gives a breakdown for the effort and size of each discovered in this stage were mainly typos, misreading the
of the layers and proofs. About 30 person months (pm) specification, or failing to update all relevant code parts for
went into the abstract specification, Haskell prototype specification changes. Even though their cause was often
and C implementation (over all project phases), including simple, understandable human error, their effect in many
design, documentation, coding, and testing. cases was sufficient to crash the kernel or create security
This compares well with other efforts for developing a vulnerabilities. There were no deeper, algorithmic bugs in
new microkernel from scratch: The Karlsruhe team reports the C level, because the C code was written according to a
that, on the back of their experience from building the very precise, low-level specification.
earlier Hazelnut kernel, the development of the Pistachio
kernel costs about 6 person years (py). SLOCCount with the 5.2. The cost of change
“embedded” profile estimates the total cost of seL4 at 4 py. One issue of verification is the cost of proof maintenance:
Hence, there is strong evidence that the detour via Haskell how much does it cost to reverify after changes are made
did not increase the cost, but was in fact a significant net to the kernel? This obviously depends on the nature of the
cost saver. change. We are not able to precisely quantify such costs,
The cost of the proof is higher, in total about 20 py. This but our iterative verification approach has provided us with
includes significant research and about 9 py invested in for- some relevant experience.
mal language frameworks, proof tools, proof automation, The best case is a local, low-level code change, typically an
theorem prover extensions, and libraries. The total effort for optimization that does not affect the observable behavior.
the seL4-specific proof was 11 py. We made such changes repeatedly, and found that the effort
We expect that redoing a similar verification for a new for reverification was always low and roughly proportional to
kernel, using the same overall methodology, would reduce the size of the change.
this figure to 6 py, for a total (kernel plus proof) of 8 py. This Adding new, independent features, which do not interact
is only twice the SLOCCount estimate for a traditionally in a complex way with existing features, usually has a mod-
engineered system with no assurance. erate impact. For example, adding a new system call to the
The breakdown in Table 1 of effort between the two seL4 API that atomically batches a specific, short sequence
refinement stages is illuminating: almost 3:1. This is a of existing system calls took one day to design and imple-
reflection of the low-level nature of our Haskell prototype, ment. Adjusting the proof took less than 1 person week.
which captures most of the properties of the final prod- Adding new, large, cross-cutting features, such as add-
uct. This is also reflected in the proof size—the first proof ing a complex new data structure to the kernel supporting
step contained most of the deep semantic content. 80% new API calls that interact with other parts of the kernel, is
of the effort in the first refinement went into establishing significantly more expensive. We experienced such a case
invariants, only 20% into the actual correspondence proof. when progressing from the first to the final implementa-
We consider this asymmetry a significant benefit, as the tion, adding interrupts, ARM page tables, and address
executable spec is more convenient and efficient to reason spaces. This change costs several pms to design and
about than C. implement, and resulted in 1.5–2 py to reverify. It modi-
The first refinement step led to some 300 changes in the fied about 12% of existing Haskell code, added another
abstract spec and 200 in the executable spec. About 50% of 37%, and reverification cost about 32% of the time previ-
these changes relate to bugs in the associated algorithms ously invested in verification. The new features required
or design. Examples are missing checks on user-supplied only minor adjustments of existing invariants, but lead to
input, subtle side effects in the middle of an operation a considerable number of new invariants for the new code.
breaking global invariants, or over-strong assumptions These invariants had to be preserved over the whole kernel,
about what is true during execution. The rest of the not just the new features.
changes were introduced for verification convenience. The Unsurprisingly, fundamental changes to existing features
are bad news. We experienced one such change when we
Table 1. Code and proof statistics. added reply capabilities for efficient RPC as an API opti-
mization after the first refinement was completed. Even
Haskell/C Isabelle Proof
though the code size of this change was small (less than 5%
pm kloc kloc Invariants py klop of the total code base), it violated key invariants about the
abst. 4 −  4.9 ~75 way capabilities were used in the system until then and the
8 110
exec. 24 5.7 13 ~80 amount of conceptual cross-cutting was huge. It took about
3 55
impl. 2 8.7 15 0
1 py or 17% of the original proof effort to reverify.
There is one class of otherwise frequent code changes

114 communications of t h e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
that does not occur after the kernel has been verified: imple- References Morrisett and S.L.P. Jones, eds. (New
1. Alkassar, E., Schirmer, N., Starostin, York, NY, USA, 2006), ACM, 42–54.
mentation bug fixes. A. Formal pervasive verification 6. Liedtke, J. Towards real microkernels.
of a paging mechanism. TACAS. CACM 39, 9 (Sept 1996), 70–77.
C.R. Ramakrishnan and J. Rehof, 7. Ni, Z., Yu, D., Shao. Z. Using XCAP to
6. CONCLUSION eds. Volume 4963 of LNCS (2008). certify realistic system code: Machine
We have presented our experience in formally verifying Springer, 109–123. context management. 20th TPHOLs,
2. Dennis, J.B., Van Horn, E.C. volume 4732 of LNCS (Kaiserslautern,
seL4. We have shown that full, rigorous, formal verification Programming semantics for Germany, Sept 2007), Springer, 189–206.
is practically achievable for OS microkernels. multiprogrammed computations. 8. Nipkow, T., Paulson, L., Wenzel, M.
CACM 9 (1966), 143–155. Isabelle/HOL—A Proof Assistant for
The requirements of verification force the designers to 3. Elkaduwe, D., Klein, G., Elphinstone, K. Higher-Order Logic. Volume 2283 of
Verified protection model of the seL4 LNCS (2002), Springer.
think of the simplest and cleanest way of achieving their microkernel. VSTTE 2008—Verified 9. Ormandy, T., Tinnes, J. Linux null pointer
goals. We found repeatedly that this leads to overall better Software: Theories, Tools & Experiments. dereference due to incorrect proto_ops
J. Woodcock and N. Shankar eds. initializations. http://www.cr0.org/misc/
design, for instance, in the decisions aimed at simplifying Volume 5295 of LNCS (Toronto, Canada, CVE-2009–2692.txt, 2009.
concurrency-related verification issues. Oct 2008), Springer, 99–114. 10. Saltzer, J.H., Schroeder, M.D. The
4 ISO/IEC. Programming languages—C. protection of information in computer
Our future research agenda includes verification of the Technical Report 9899:TC2, ISO/IEC systems. Proc. IEEE 63 (1975), 1278–1308.
assembly parts of the kernel, a multi-core version of the ker- JTC1/SC22/WG14, May 2005. 11. Shapiro, J.S., Smith, J.M., Farber, D.J.
5. Leroy, X. Formal certification of a compiler EROS: A fast capability system. 17th
nel, as well as formal verification of overall system security back-end, or: Programming a compiler SOSP (Charleston, SC, USA, Dec
and safety properties, including application components. with a proof assistant. 33rd POPL. J.G. 1999), 170–185.
The latter now becomes much more meaningful than previ- Gerwin Klein National ICT Australia Rafal Kolanksi NICTA, UNSW.
ously possible: application proofs can rely on the abstract, (NICTA), University of South Wales
(UNSW), Sydney. Harvey Tuch NICTA, UNSW, now at
formal kernel specification that seL4 is proven to implement. VMWare.
June Andronick NICTA, UNSW.
Simon Winwood NICTA, UNSW.
Acknowledgments Kevin Elphinstone NICTA, UNSW.
We would like to acknowledge the contribution of the former David Cock NICTA.
Gernot Heiser NICTA, UNSW, Open
team members on this verification project: Timothy Bourke, Kernel Labs. Philip Derrin NICTA, now at Open Kernel
Jeremy Dawson, Jia Meng, Catherine Menon, and David Tsai. Labs.
Dhammika Elkaduwe NICTA, UNSW,
NICTA is funded by the Australian Government now at University of Peradeniya, Michael Norrish NICTA, Australian
as represented by the Department of Broadband, Sri Lanka. National University, Canberra.

Communications and the Digital Economy and the Kai Engelhardt NICTA, UNSW. Thomas Sewell NICTA.
Australian Research Council through the ICT Centre of
Excellence program. © 2010 ACM 0001-0782/10/0600 $10.00

Have you
Considered All
the Possibilities?
Save 20%
on these
New & Noteworthy
Resources
from New Edition of a Bestseller! Flexible, Reliable Software Handbook of
Algorithms and Theory of Using Patterns and Chemoinformatics
Chapman & Hall/CRC Press Computation Handbook, Agile Development Algorithms
Second Edition— “… brings together a careful selection Providing an overview of the most
Two Volume Set of topics that are relevant, indeed common chemoinformatics algo-
Save 20% on your online crucial, for developing good quality rithms, this book explains how to
Praise for the Previous Edition
order by entering promo software … leads the reader through effectively apply algorithms and
“… excellent survey of the state of the an experience of active learning.” graph theory for solving chemical
code 736DM at checkout art … highly recommended for anyone —Michael Kölling, Bestselling problems.
Discount expires July 31, 2010 interested in algorithms, data struc- Author and originator of the BlueJ Catalog no. C2922
tures and the theory of computation … and Greenfoot environments April 2010, 454 pp.
www.crcpress.com an indispensable book of reference … .”
—R. Kemp, Zentralblatt MATH, Vol. 926
Catalog no. C3622 ISBN: 978-1-4200-8292-0
May 2010, 527 pp. $99.95 / £63.99
Catalog no. C8180 ISBN: 978-1-4200-9362-9 $79.96 / £51.19
January 2010, 1938 pp. $69.95 / £44.99
ISBN: 978-1-58488-818-5 $55.96 / £35.99
C HAP MAN & HAL L B O O K S $179.95 / £114.00
$143.96 / £91.20

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t h e acm 115
careers

Epic Epic For more information, please visit the univer-


Problem Solver/Technical Consultant Software Developer sity website www.niituniversity.in.
Interested applicants are invited to submit
Epic’s Problem Solvers are responsible for our Our small teams of software engineers participate their curriculum vitae including employment his-
clients’ happiness after the systems are installed. in all aspects of the development process, from tory, a statement outlining research and teaching
They create valuable relationships by listening meeting customers to system design through interests, list of consultancies and projects un-
well to customer concerns and championing cli- quality assurance and delivery. Their goal is to dertaken and names of at least three referees.
ents’ needs. They work with IT staff at customer create easy-to-use systems with optimal work-
sites to quickly resolve technical issues and per- flows that manage large amounts of data with Applications may be sent electronically in
form necessary programming, helping to ensure sub-second response times and rock-solid stabil- PDF format to:
that every customer gets the most out of an Epic ity. Our continued success in these areas is shown careers@niituniversity.in
software investment. by Epic software systems’ top-rated industry re- or the President, Dr Rajeev Shorey,
Candidates should have a bachelor’s degree views. New functionality and systems are being rajeev.shorey@niituniversity.in
or higher (all majors considered), a minimum developed daily that extend current capabilities Phone : +91-9251083130
3.2 cumulative GPA, and be eligible to work in and break new ground in the industry.
the U.S. without sponsorship. No prior technical Candidates should have a bachelor’s degree
experience is necessary, but exposure to program- or higher in Computer Science, Software Engi- Princeton University
ming is a plus. Relocation to Madison, Wisconsin neering, or Math and a minimum 3.2 major GPA. Computer Science
is required and reimbursed. Relocation to Madison, Wisconsin is required Lecturer
Apply URL: http://careers.epic.com/ and reimbursed. Visa sponsorship is available.
Email Address: job@epic.com Apply URL: http://careers.epic.com/ Part- and full-time Lecturer positions. The De-
Email Address: job@epic.com partment of Computer Science seeks applications
from outstanding teachers to assist the faculty in
teaching our introductory course sequence.
NIIT University, INDIA The primary requirements of the job are
Computer Science & Engg and Information & to teach recitation sections and to participate
Communication Technology in overall management of the introductory se-
Faculty Positions in Computer Science & quence. Other responsibilities include supervis-
Engg and Information and Communication ing graduate student teaching assistants and
Technology developing and maintaining online curricular
Advertising in Career material, classroom demonstrations, and labora-
Opportunities NIIT University (NU) invites applications at Assis- tory exercises.
tant Professor, Associate Professor and Professor Candidates should have an exceptional record
How to Submit a Classified Line Ad: Send levels. of classroom instruction and curricular innova-
an e-mail to acmmediasales@acm.org. Candidates must have earned (or expect tion. An advanced degree in computer science is
Please include text, and indicate the issue/ shortly) a Ph.D for Assistant Professor, Ph.D with preferred.
or issues where the ad will appear, and a 6 years of experience for Associate Professor, Ph.D For general application information and to
contact name and number. with 10 years of experience for Professor in the self-identify visit: https://jobs.princeton.edu
above disciplines. Candidates must have a good Requisition Number: 1000207. You may ap-
Estimates: An insertion order will then be
publication record; proven ability to establish ply online on the Department’s website at: http://
e-mailed back to you. The ad will by
typeset according to CACM guidelines. an independent research program; demonstrate www.cs.princeton.edu/jobs/lecturerposition
NO PROOFS can be sent. Classified line ads flair for innovation; be open to participate in de- We will not accept applications from the
are NOT commissionable. veloping new interdisciplinary programs of study; Princeton jobs site.
have the commitment to excel both in research Princeton University is an equal opportunity
Rates: $325.00 for six lines of text, 40 and teaching and establish linkage with industry. employer and complies with applicable EEO and
characters per line. $32.50 for each Industry experience and/or post-doctoral experi- affirmative action regulations.
additional line after the first six. The ence will be considered an asset.
MINIMUM is six lines. NIIT University is located about 120 Kms from
Deadlines: Five weeks prior to the New Delhi. The University provides an intellec- Taif University, Saudi Arabia
publication date of the issue (which is the tual environment and is committed to academic Assistant/Associate Professor
first of every month). Latest deadlines: excellence. The four core principles of NIIT Uni-
http://www.acm.org/publications versity, namely, Industry-linked, Research-driven, The College of Computers & Information Systems
Technology-based, and Seamless define the DNA at TAIF University in Saudi Arabia invites applica-
Career Opportunities Online: Classified of the University. tions for faculty positions for fall 2010 or spring
and recruitment display ads receive a free NIIT University offers a competitive compen- 2011. Candidates should have Ph.D. in Computer
duplicate listing on our website at:
sation at par with the best academic institutions Science/Computer Engineering/Information Sys-
http://campus.acm.org/careercenter
in India. In addition, NU provides a start up re- tems, or a closely related discipline, and a strong
Ads are listed for a period of 30 days. search grant at the time of joining, travel support commitment to excellence in teaching, research,
for presenting papers in International Conferenc- and service.
For More Information Contact: es and Workshops. NU provides research incen- Please email CV, letter of application, tran-
ACM Media Sales tives, such as monetary award for refereed Journal scripts, statement of research and teaching inter-
at 212-626-0686 or publications. In line with the Industry-linked as a ests, and 3 letters of reference addressing to
acmmediasales@acm.org core principle, the university will enable faculty to Dean Dr. Sultan Aljahdali
consult with industry in India and abroad. at cisdean@tu.edu.sa.

116 communications of t h e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
Expansion of the Research School
“Service-Oriented Systems Engineering“
at Hasso-Plattner-Institute
8 Ph.D. grants available - starting October 1, 2010

Hasso-Plattner-Institute (HPI) is a privately financed institute affiliated with the


University of Potsdam, Germany. The Institute‘s founder and benefactor Professor
Hasso Plattner, who is also co-founder and chairman of the supervisory board of
SAP AG, has created an opportunity for students to experience a unique education
in IT systems engineering in a professional research environment with a strong
practice orientation.

In 2005, HPI initiated the research school on “Service-Oriented Systems Engineering“


under the scientific supervision of Professors Jürgen Döllner, Holger Giese, Robert
Hirschfeld, Christoph Meinel, Felix Naumann, Hasso Plattner, Andreas Polze, Mathias
Weske and Patrick Baudisch.

We are expanding our research school and are currently seeking

8 Ph.D. students
(monthly stipends 1400-1600 Euro)
2 Postdocs (monthly stipend 1800 Euro)
Positions will be available starting October 1, 2010.
The stipends are not subject to income tax.

The main research areas in the research school at HPI are:


„ Self-Adaptive Service-Oriented Systems
„ Operating System Support for Service-Oriented Systems
„ Architecture and Modeling of Service-Oriented Systems
„ Adaptive Process Management
„ Services Composition and Workflow Planning
„ Security Engineering of Service-Based IT Systems
„ Quantitative Analysis und Optimization of Service-Oriented Systems
„ Service-Oriented Systems in 3D Computer Graphics
„ Service-Oriented Geoinformatics

Prospective candidates are invited to apply with:


„ Curriculum vitae and copies of degree certificates/transcripts
„ A short research proposal
„ Writing samples/copies of relevant scientific papers (e. g. thesis etc.)
„ Letters of recommendation

Please submit your applications by July 31, 2010 to the coordinator of the
research school:

Prof. Dr. Andreas Polze, Hasso-Plattner-Institute, Universität Potsdam


Postfach 90 04 60, 14440 Potsdam, Germany

Successful candidates will be notified by September 15, 2010 and are expected
to enroll into the program on October 1, 2010.

For additional information see: http://kolleg.hpi.uni-potsdam.de

or contact the office:


Telephone +49-331-5509-220, Telefax +49-331-5509-229
Email: office-polze@hpi.uni-potsdam.de

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t h e acm 117
last byte

DOI:10.1145/1743546.1743575 Peter Winkler

Puzzled
Solutions and Sources
Last month (May 2010, p. 120) we posted a trio of brainteasers, including
one as yet unsolved, concerning variations on the Ham Sandwich Theorem.
The Intermediate Value Theorem says that if you go continuously
from one real number to another, you must pass through all the
real numbers in between. You can use it to prove the Ham Sandwich
Theorem; here’s how it can be used to solve Puzzles 1 and 2:

1. Hiking the Cascade Range.


 Solution. Puzzle 1 asked us to
prove that the programmers who spent
2. Inscribing a Lake
in a Square.
 Solution. Puzzle 2 asked us to show
3. Curves Containing
the Corners of a Square.
 Solution. The third puzzle was (as
Saturday climbing and Sunday de- that, given any closed curve in the usual) unsolved, frustrating geom-
scending Mt. Baker were, at some time plane, there is a square contain- eters for more than a century. For a
of day, at exactly the same altitude on ing the curve, all four sides of which discussion see http://www.ics.uci.
both days. touch the curve. The idea of the proof edu/~eppstein/junkyard/jordan-square.
It’s easily done. For any time t, let is both simple and elegant. Start html, including reference to an article
f(t) be the progammers’ altitude on with a vertical line drawn somewhere by mathematician Walter Stromquist
Sunday minus their altitude on Satur- west of the curve. Gradually shift the (“Inscribed Squares and Square-like
day; f(t) starts off positive in the morn- line eastward until it just touches Quadrilaterals in Closed Curves,”
ing and ends up negative at night, so at the curve. Repeat with a second line, Mathematika 36, 2 (1989), 187–197)
some point must be 0. drawn east of the curve and moving in which he proved the conjecture for
An equivalent, and perhaps more gradually west, so we now have an- smooth curves. See also Stan Wagon’s
intuitive, way to see this is to imagine other vertical line touching the curve and Victor Klee’s book Old and New
that the programmers have twins who on its east side. Now bring a horizon- Unsolved Problems in Plane Geometry
were instructed to climb the mountain tal line down from the north until it and Number Theory (Mathematical As-
on Sunday exactly as the programmers touches the curve and another from sociation of America, 1991).
climbed it the day before. Then, even the south, thus inscribing the curve in
if their paths up and down were differ- a rectangle.
ent, there is some point at which the But what we want is not merely a
programmers and their twins must rectangle but a square. Suppose the
pass one another in altitude. rectangle is taller than it is wide (as it
would be in, say, Lake Champlain).
Now slowly rotate the four lines togeth-
er clockwise, keeping all four outside
but still touching the curve. After 90 de-
grees of rotation, the picture is exactly
the same as before, only now, the pre-
viously long vertical lines of the rectan-
gle are the short horizontal sides.
All readers are encouraged to submit prospective
At some point in the rotation pro- puzzles for future columns to puzzled@cacm.acm.org.
cess, the original vertical lines and hor-
izontal lines were all the same length— Peter Winkler (puzzled@cacm.acm.org) is Professor of
Mathematics and of Computer Science and Albert Bradley
and, at exactly that point, the curve was Third Century Professor in the Sciences at Dartmouth
inscribed in a square. College, Hanover, NH.

118 communications of t h e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6
last byte

all
[c o nti n u ed f rom p. 120] poten- expression. Not everyone is required
tial competitors. Crude versions of the to engage in erudite discourse. But
same pattern are seen among chimps, Success may depend must the medium conspire to make
wolves, and many other species. The on new skills and discourse next to impossible, leaving
logic of Darwin and Malthus may be per- each decade’s version of “conversa-
vasive on other worlds, too. Could this tools that empower tion” more terse and lobotomized?
help explain the daunting sky-silence? our “adolescent” Must the interface assume that super-
One important exception was by ficiality is the chief desideratum and
far the most successful Earth-based traits, the drive that self-fulfilling expectation?
civilization—the Scientific Enlighten- makes us hunger for If this is an “adolescent phase,” we
ment—which broke from the ancient may yet see what Wikipedia co-found-
feudal pattern, fostering instead what adventure, surprise, er Larry Sanger calls “sophrosyne,” or
Robert Wright, author of Nonzero: even fun. polemical shouting transforming into
The Logic of Human Destiny, called the fair disputation and negotiation, a trick
“positive sum game,” encouraging many teens eventually learn.
individualism and copious self-criti- Imagine today’s Internet augment-
cism. (It’s a theme typified by self-re- ed by a shopping list of now-missing
proachful messages like James Cam- tools to enhance attention allocation,
eron’s movie Avatar.) What better way only thing that worked well on 9/11. empowering users to do more in par-
to detect, reveal, and resolve myriad We’ll surely need such agility and initia- allel while rediscovering the art of
potential pitfalls than by unleashing tive in times to come. concentration. Today’s fetish for “gist-
millions of diverse, highly educated, Still, no one has yet disproved the ing,” or grabbing the summarized es-
technologically empowered citizens to hoary adage that “The intelligence of a sence of any fact or opinion, might yet
swarm (like T-cells) on every apparent crowd is that of its dumbest member, be more useful and accurate, when
failure mode, real or imagined? divided by the number in the crowd.” coupled with utilities for source-repu-
This noisy process was supercharged Could any blog, social-networking tation weighting, paraphrasing, corre-
in the late 1980s when the U.S. govern- site, or twit-mesh be described as a lation, data analysis, and what Howard
ment did something that still seems “problem-solving discourse”? Not Rheingold called general “crap-detec-
historically anomalous: releasing the unless you have very low standards tion.” Collaborationware might yet
Internet it had invented from near-total of “discourse.” Today’s communica- evolve from its present stodginess,
control, simply handing it over to the tions platforms seem obstinately, helping ad hoc teams self-organize,
world. Ponder, in the light of the past even proudly, primitive, encouraging divide tasks, delegate expertise, and
4,000 years of recorded history, the like- dumb-down groupthink that Jaron achieve quick wonders. Such tools
lihood of such a decision. Was there Lanier called “digital Maoism” in his could start by bringing online some of
ever anything comparable, under the Future Tense essay “Confusions of the the amazing mental methods we take
most beneficent kings? Hive Mind” (Sept. 2009), not the vigor- for granted in the real world (such as
I want to defend this rambunctious ously new-citizenship I forecast in my the way we sift for meaning from mul-
culture of freedom for all the usual 1989 eco-thriller Earth. tiple conversations at once, as at a
reasons (such as “freedom is good!”), Connectivity scholar and Google cocktail party). Many have never been
but I wonder. Such an experiment was Vice President Marissa Mayer says implemented online, in any way.
rare here on Earth and seems unlikely the Internet is in its “adolescence.” Destiny, not only on Earth but
to have been tried very often, out there Indeed, many of the traits tech-zealot across the Galaxy, may depend on how
across the cosmos. In fact (and here’s Clay Shirky (http://www.shirky.com/) we choose to cross this danger zone.
the point of this digression), our latest, adores and that Web critic Nicholas Success could arise less out of stodgy
tech-amplified version of the Enlight- Carr (http://www.roughtype.com/) ab- prescriptions than from those “adoles-
enment could become a fiercely effec- hors are qualities we associate with cent” traits that make us hunger for ad-
tive problem-solving system helping our own teenage years. Take the ram- venture, surprise, even fun. Only… per-
us become the exception… a sapient pant flightiness of scattered attention haps empowered by new skills that help
race that survives. That is, if it is al- spans, simplistic online tribalism, us function as thoughtful adolescents,
lowed to. And if it ever matures. and tsunamis of irate opinion; pic- more like precocious 19-year-olds than
But getting the most from our po- ture 10 million electronic Nuremberg scatterbrained 13-year-olds. Perhaps
tential also requires better tools. In his rallies. These punk attributes blend even like people with grownup goals…
book Smart Mobs (2002), social scholar and contrast with positive adoles- and the patience to achieve them.
Howard Rheingold envisioned a future cent qualities like unprecedented
when the savvy, liberated populace vividness, creativity, quickness, alert David Brin (http://www.davidbrin.com) is a scientist,
forms resilient, ad hoc, problem-solv- compassion, and spontaneity. No gen- technology speaker, and author whose stories and
novels have won Hugo and Nebula awards. He is also the
ing networks that pounce on errors and eration ever read or wrote so much… author of the nonfiction book The Transparent Society:
dangers, adapting much quicker than albeit, never was such a high fraction Will Technology Make Us Choose Between Freedom and
Privacy? (Perseus Books Group, 1989).
stuffy, traditional, hierarchical institu- of the writing such drivel.
tions. Recall that citizen-power was the There’s nothing wrong with self- © 2010 ACM 0001-0782/10/0600 $10.00

j u n e 2 0 1 0 | vo l . 5 3 | n o. 6 | c o m m u n i c at i o n s o f t h e acm 119
last byte

Future Tense, one of the revolving features on this page, presents stories and
essays from the intersection of computational science and technological speculation,
their boundaries limited only by our ability to imagine what will and could be.

DOI:10.1145/1743546.1743576 David Brin

Future Tense
How the Net Ensures
Our Cosmic Survival
Give adolescent wonder an evolutionary jolt.
The Internet has changed the way
I think, though, ironically, less than I
expected. As both a freelance scientist
and a science fiction author, I already
telecommuted back in 1980, kept flex-
ible hours, digitally collaborated with
colleagues around the world, con-
ducted digital literature searches, and
was an early adopter of text editing. All
these trends have since accelerated.
Yet, compared to my colleagues’ uto-
pian visions for 2010, today’s Net and
Web remain, a bit, well, stodgy.
Oh, I’m grateful to live in such times.
For one thing, without the Internet, civi-
lization would likely have fallen into the
Specialization Trap that tech students
pondered, pessimistically, in the 1960s.
At the time, it seemed inevitable—as
the weight of accumulated knowledge
piled higher and higher—that research-
ers would have to learn more and more ed overspecialization to scatterbrained tied technology’s march to the question
about less and less, in narrowing sub- shallowmindedness. Flitting about in- of why we’ve seen no signs of intelligent
fields, staggering forward ever more fo-space, we snatch one-sentence sum- life beyond planet Earth, not even radio
slowly under the growing burden of maries of anything that’s known by blips on a SETI screen. Does this Great
human progress. Specialty boundaries the vast, collective mind. Whatever the Silence suggest every sapient race out
would grow more rigid and uncross- topic, each of us is able to preen with there ultimately repeats the same tech-
able. This unpleasant fate seemed un- presumptuous “expertise.” This trend nology-driven mistakes, driving their
avoidable, back when “information” colors even modern politics; a core own civilizations to ruin?
had a heavy, almost solid quality… tenet of the Culture War holds that We know next to nothing about
…till the Internet Era transformed specialists are no more qualified than aliens but can impute something
knowledge into something more like opinionated amateurs to judge truth. about them from the self-perpetuat-
a gas—or sparkling plasma—infinite- Worrisome trends have always ing instinctive drives that propel both
ly malleable, duplicable, accessible, seemed to threaten civilization. From human societies and nearly all ani-
Illustrati on by Dawid M ichalczyk

mobile. At which point the old wor- Plato, Gibbon, and Spengler to Toyn- mal species on Earth, spurred by the
ries about death-by-overspecialization bee, Kennedy, and Diamond, many “zero sum,” or I-win-by-making-you-
vanished so thoroughly that few recall have diagnosed why cultures succeed or lose, logic of reproductive success.
how gloomy the prospect seemed, only fail. Theories vary, but the implications Hence, 99% of human cultures that
a few decades ago. go far beyond the fate of mere human- ever achieved farming and metals also
Today, some fear the opposite fail- ity. In his Future Tense essay “Radical wound up ruled by feudal oligarchies
ure mode, veering from narrow-mind- Evolution” (Mar. 2009), Joel Garreau that squelched [c onti nu ed on p. 1 1 9 ]

120 communications of t h e ac m | j u n e 2 0 1 0 | vo l . 5 3 | n o. 6

You might also like