You are on page 1of 124

COMMUNICATIONS

ACM
cACM.acm.org OF THE 08/2012 VOL.55 NO.8

Quantum Money

IT and
Outsourcing
in North Korea
Cosmic Simulations
The Loss of
Location Privacy
in the Cellular Age
OpenFlow: A Radical New
Idea in Networking

Association for
Computing Machinery
MATCH YOUR SERVER
TO YOUR BUSINESS.
ONLY PAY FOR WHAT YOU NEED!

With a 1&1 Dynamic Cloud Server, you can


1&1 DYNAMIC CLOUD SERVER
change your server configuration in real time.

■ Independently configure CPU, RAM, and storage


3 MONTHS
FREE!
■ Control costs with pay-per-configuration and hourly billing

■ Up to 6 Cores, 24 GB RAM, 800 GB storage


*
■ 2000 GB of traffic included free

■ Parallels® Plesk Panel 10 for unlimited domains, reseller ready


Base Configuration, Starting at $49.99/month
■ Up to 99 virtual machines with different configurations

■ NEW: Monitor and manage your cloud


server through 1&1 mobile apps for Android™ ®
and iPhone®.

www.1and1.com

*Offer valid for a limited time only. 3 months free only applies to $49.99 base configuration. Set-up fee of $49.00 applies. Base configuration includes 1 processor core, 1 GB RAM, 100 GB
Storage. Other terms and conditions may apply. Visit www.1and1.com for full promotional offer details. Program and pricing specifications and availability subject to change without
notice. 1&1 and the 1&1 logo are trademarks of 1&1 Internet, all other trademarks are the property of their respective owners. © 2012 1&1 Internet. All rights reserved.
Call for Nominations
The ACM Doctoral Dissertation Competition

Rules of the Competition Publication Rights


ACM established the Doctoral Dissertation Award Each nomination must be accompanied by an assignment
program to recognize and encourage superior research to ACM by the author of exclusive publication rights.
and writing by doctoral candidates in computer science (Copyright reverts to author if not selected for publication.)
and engineering. These awards are presented annually
at the ACM Awards Banquet. Publication
Winning dissertations will be published by ACM in
Submissions the ACM Digital Library.
Nominations are limited to one per university or college,
from any country, unless more than 10 Ph.D.’s are granted Selection Procedure
in one year, in which case two may be nominated. Dissertations will be reviewed for technical depth and significance
of the research contribution, potential impact on theory and
Eligibility practice, and quality of presentation. A committee of individuals
Please see our website for exact eligibility rules. serving staggered five-year terms performs an initial screening
Only English language versions will be accepted. to generate a short list, followed by an in-depth evaluation to
Please send a copy of the thesis in PDF format determine the winning dissertation.
to emily.eng@acm.org.
The selection committee will select the winning dissertation
Sponsorship in early 2013.
Each nomination shall be forwarded by the thesis advisor
and must include the endorsement of the department head. Award
A one-page summary of the significance of the dissertation The Doctoral Dissertation Award is accompanied by a prize
written by the advisor must accompany the transmittal. of $20,000 and the Honorable Mention Award is accompanied
by a prize of $10,000. Financial sponsorship of the award is
Deadline provided by Google.
Submissions must be received by October 31, 2012 to
qualify for consideration. For Submission Procedure
See http://awards.acm.org/html/dda.cfm
communications of the acm

Departments News Viewpoints

5 Letter from the ICPC Executive Director 22 Emerging Markets


Giving Students Inside the Hermit Kingdom:
the Competitive Edge IT and Outsourcing in North Korea
By Bill Poucher A unique perspective on
an evolving technology sector.
7 Letters to the Editor By Paul Tjia
Composable Trees
for Configurable Behavior 26 Education
Will Massive Open Online
10 BLOG@CACM Courses Change How We Teach?
Machine Learning and Algorithms; Sharing recent experiences
Agile Development with an online course.
John Langford poses questions By Fred G. Martin
about the direction of research for
machine learning and algorithms. 29 Privacy and Security
Ruben Ortega shares lessons The Politics of “Real Names”
about agile development practices 13 Cosmic Simulations Power, context, and control
like Scrum. With the help of supercomputers, in networked publics.
scientists are now able to By danah boyd
31 Calendar create models of large-scale
astronomical events. 32 Kode Vicious
117 Careers By Jeff Kanipe A System Is Not a Product
Stopping to smell the code
16 DARPA Shredder Challenge Solved before wasting time reentering
Last Byte The eight-person winning team configuration data.
used original computer algorithms By George V. Neville-Neil
120 Puzzled to narrow the search space and
Find the Magic Set then relied on human observation 34 Economic and Business Dimensions
By Peter Winkler to move the pieces into their The Internet Is Everywhere,
final positions. but the Payoff Is Not
By Tom Geller Examining the uneven patterns
of Internet economics.
18 Advertising Gets Personal By Chris Forman, Avi Goldfarb,
Online behavioral advertising and and Shane Greenstein
sophisticated data aggregation Image Courtesy of Mult iDa rk Resimul at ion p roj ect (Gustavo Yepes, Splotch )
have changed the face of advertising 36 Viewpoint
and put privacy in the crosshairs. Internet Elections:
By Samuel Greengard Unsafe in Any Home?
Experiences with electronic voting
21 Broader Horizons suggest elections should not
ACM’s Committee for Women in be conducted via the Internet.
Computing (ACM-W) is widening By Kai A. Olsen and
its reach to involve women in Hans Fredrik Nordhaug
industry as well as academia,
including community college 39 Viewpoint
faculty and students. The Ethics of Software Engineering
By Karen A. Frenkel Should be an Ethics for the Client
Viewing software engineering
as a communicative art in which
Association for Computing Machinery
Advancing Computing as a Science & Profession client engagement is essential.
By Neil McBride

2 comm unicatio ns o f th e acm | au gu st 201 2 | vol. 5 5 | no. 8


08/2012 vol. 55 no. 08

Practice Contributed Articles Review Articles

42 OpenFlow: A Radical 60 The Loss of Location Privacy 84 Quantum Money


New Idea in Networking in the Cellular Age Imagine money you can carry
An open standard that enables How to have the best of location- and spend without a trace.
software-defined networking. based services while avoiding the By Scott Aaronson, Edward Farhi,
By Thomas A. Limoncelli growing threat to personal privacy. David Gosset, Avinatan Hassidim,
By Stephen B. Wicker Jonathan Kelner, and Andrew Lutomirski
48 Extending the Semantics
of Scheduling Priorities 69 To Be or Not To Be Cited
Increasing parallelism in Computer Science Research Highlights
demands new paradigms. Traditional bias toward journals
By Rafael Vanoni Polanczyk in citation databases diminishes 96 Technical Perspective
the perceived value of conference Example-Driven Program Synthesis
53 Multitier Programming in Hop papers and their authors. for End-User Programming
A first step toward programming By Bjorn De Sutter By Martin C. Rinard
21st-century applications. and Aäron van den Oord
By Manuel Serrano and Gérard Berry 97 Spreadsheet Data Manipulation
76 Process Mining Using Examples
Articles’ development led by Using real event data to X-ray By Sumit Gulwani, William R. Harris,
queue.acm.org business processes helps ensure and Rishabh Singh
Illustrations by Jason Cook , Bria n Greenberg, Sp ooky Pook a at Début Art

conformance between design


and reality.
By Wil van der Aalst 106 Technical Perspective
Proving Programs Continuous
By Andreas Zeller

107 Continuity and Robustness


About the Cover:
Imagine money you can of Programs
spend without a trace By Swarat Chaudhuri, Sumit Gulwani,
and without the worry
of loss or counterfeit. and Roberto Lublinerman
Enter quantum
information as the basis
for a better kind of money.
This month’s cover story
(p. 84) explores what
it would take to realize
quantum money. Cover
illustration by Spooky
Pooka at Début Art.

au g ust 2 0 1 2 | vol . 55 | no. 8 | c ommu n i cat ion s o f the acm 3


communications of the acm
Trusted insights for computing’s leading professionals.

Communications of the ACM is the leading monthly print and online magazine for the computing and information technology fields.
Communications is recognized as the most trusted and knowledgeable source of industry information for today’s computing professional.
Communications brings its readership in-depth coverage of emerging areas of computer science, new trends in information technology,
and practical applications. Industry leaders use Communications as a platform to present and debate various technology implications,
public policies, engineering challenges, and market trends. The prestige and unmatched reputation that Communications of the ACM
enjoys today is built upon a 50-year commitment to high-quality editorial content and a steadfast dedication to advancing the arts,
sciences, and applications of information technology.

ACM, the world’s largest educational STA F F editorial B oard


and scientific computing society, delivers  
resources that advance computing as a Director of G roup Publis h i ng E ditor -i n-c hief
science and profession. ACM provides the Scott E. Delman Moshe Y. Vardi ACM Copyright Notice
computing field’s premier Digital Library publisher@cacm.acm.org eic@cacm.acm.org Copyright © 2012 by Association for
and serves its members and the computing Executive Editor News Computing Machinery, Inc. (ACM).
profession with leading-edge publications, Diane Crawford Co-Chairs Permission to make digital or hard copies
conferences, and career resources. Managing Editor Marc Najork and Prabhakar Raghavan of part or all of this work for personal
Thomas E. Lambert Board Members or classroom use is granted without
Executive Director and CEO Senior Editor Hsiao-Wuen Hon; Mei Kobayashi; fee provided that copies are not made
John White Andrew Rosenbloom William Pulleyblank; Rajeev Rastogi; or distributed for profit or commercial
Deputy Executive Director and COO Senior Editor/News Jeannette Wing advantage and that copies bear this
Patricia Ryan Jack Rosenberger notice and full citation on the first
Director, Office of Information Systems Web Editor Viewpoi nts page. Copyright for components of this
Wayne Graves David Roman Co-Chairs work owned by others than ACM must
Director, Office of Financial Services Editorial Assistant Susanne E. Hambrusch; John Leslie King; be honored. Abstracting with credit is
Russell Harris Zarina Strakhan J Strother Moore permitted. To copy otherwise, to republish,
Director, Office of SIG Services Rights and Permissions Board Members to post on servers, or to redistribute to
Donna Cappo Deborah Cotton P. Anandan; William Aspray; lists, requires prior specific permission
Director, Office of Publications Stefan Bechtold; Judith Bishop; and/or fee. Request permission to publish
Bernard Rous Art Director Stuart I. Feldman; Peter Freeman; from permissions@acm.org or fax
Director, Office of Group Publishing Andrij Borys Seymour Goodman; Shane Greenstein; (212) 869-0481.
Scott E. Delman Associate Art Directors Mark Guzdial; Richard Heeks;
Margaret Gray Rachelle Hollander; For other copying of articles that carry a
Alicia Kubista Richard Ladner; Susan Landau; code at the bottom of the first or last page
ACM C ou ncil
Assistant Art Directors Carlos Jose Pereira de Lucena; or screen display, copying is permitted
President
Mia Angelica Balaquiot Beng Chin Ooi; Loren Terveen provided that the per-copy fee indicated
Vinton G. Cerf
Brian Greenberg in the code is paid through the Copyright
Vice-President
Production Manager P ractice Clearance Center; www.copyright.com.
Alexander L. Wolf
Lynn D’Addesio Chair
Secretary/Treasurer
Director of Media Sales Stephen Bourne Subscriptions
Vicki L. Hanson
Jennifer Ruzicka Board Members An annual subscription cost is included
Past President
Public Relations Coordinator Eric Allman; Charles Beeler; Bryan Cantrill; in ACM member dues of $99 ($40 of
Alain Chesnais
Virgina Gold Terry Coatta; Stuart Feldman; Benjamin Fried; which is allocated to a subscription to
Chair, SGB Board
Publications Assistant Pat Hanrahan; Marshall Kirk McKusick; Communications); for students, cost
Erik Altman
Emily Williams Erik Meijer; George Neville-Neil; is included in $42 dues ($20 of which
Co-Chairs, Publications Board
Theo Schlossnagle; Jim Waldo is allocated to a Communications
Ronald Boisvert and Jack Davidson Columnists subscription). A nonmember annual
Members-at-Large Alok Aggarwal; Phillip G. Armour; The Practice section of the CACM subscription is $100.
Eric Allman; Ricardo Baeza-Yates; Martin Campbell-Kelly; Editorial Board also serves as
Radia Perlman; Mary Lou Soffa; Michael Cusumano; Peter J. Denning; the Editorial Board of . ACM Media Advertising Policy
Eugene Spafford Shane Greenstein; Mark Guzdial; Communications of the ACM and other
SGB Council Representatives Peter Harsha; Leah Hoffmann; C o ntributed A rticles
ACM Media publications accept advertising
Brent Hailpern; Joseph Konstan; Mari Sako; Pamela Samuelson; Co-Chairs in both print and electronic formats. All
Andrew Sears Gene Spafford; Cameron Wilson Al Aho and Georg Gottlob advertising in ACM Media publications is
Board Members at the discretion of ACM and is intended
Boar d C hairs C ontact P oints Robert Austin; Yannis Bakos; Elisa Bertino; to provide financial support for the various
Education Board Copyright permission Gilles Brassard; Kim Bruce; Alan Bundy; activities and services for ACM members.
Andrew McGettrick permissions@cacm.acm.org Peter Buneman; Erran Carmel; Current Advertising Rates can be found
Practitioners Board Calendar items Andrew Chien; Peter Druschel; Blake Ives; by visiting http://www.acm-media.org or
Stephen Bourne calendar@cacm.acm.org James Larus; Igor Markov; Gail C. Murphy; by contacting ACM Media Sales at
Change of address Shree Nayar; Bernhard Nebel; Lionel M. Ni; (212) 626-0686.
acmhelp@acm.org Sriram Rajamani; Marie-Christine Rousset;
Regional C o uncil C h airs
Letters to the Editor Avi Rubin; Krishan Sabnani; Single Copies
ACM Europe Council
letters@cacm.acm.org Fred B. Schneider; Abigail Sellen; Single copies of Communications of the
Fabrizio Gagliardi
Ron Shamir; Yoav Shoham; Marc Snir; ACM are available for purchase. Please
ACM India Council
W e b S IT E Larry Snyder; Manuela Veloso; contact acmhelp@acm.org.
Anand S. Deshpande, PJ Narayanan
http://cacm.acm.org Michael Vitale; Wolfgang Wahlster;
ACM China Council
Hannes Werthner; Andy Chi-Chih Yao
Jiaguang Sun Comm unications of the ACM
Au t hor G u i d elines (ISSN 0001-0782) is published monthly
http://cacm.acm.org/guidelines Research Hig hlig h ts
Co-Chairs by ACM Media, 2 Penn Plaza, Suite 701,
Pu blications Board
Stuart J. Russell and Gregory Morrisett New York, NY 10121-0701. Periodicals
Co-Chairs ACM Adve rtisin g Departm e nt
Board Members postage paid at New York, NY 10001,
Ronald F. Boisvert; Jack Davidson 2 Penn Plaza, Suite 701, New York, NY
Martin Abadi; Sanjeev Arora; Dan Boneh; and other mailing offices.
Board Members 10121-0701
Marie-Paule Cani; Nikil Dutt; Carol Hutchins; T (212) 626-0686 Andrei Broder; Stuart K. Card; Jon Crowcroft;
Alon Halevy; Monika Henzinger; POSTMAST ER
Joseph A. Konstan; Ee-Peng Lim; F (212) 869-0481
Maurice Herlihy; Norm Jouppi; Please send address changes to
Catherine McGeoch; M. Tamer Ozsu;
Director of Media Sales Andrew B. Kahng; Xavier Leroy; Communications of the ACM
Vincent Shen; Mary Lou Soffa
Jennifer Ruzicka Mendel Rosenblum; Ronitt Rubinfeld; 2 Penn Plaza, Suite 701
ACM U.S. Public Policy Office jen.ruzicka@hq.acm.org David Salesin; Guy Steele, Jr.; David Wagner; New York, NY 10121-0701 USA
Cameron Wilson, Director Alexander L. Wolf; Margaret H. Wright
1828 L Street, N.W., Suite 800 Media Kit acmmediasales@acm.org
Washington, DC 20036 USA Web
T (202) 659-9711; F (202) 667-1066 Association for Computing Machinery Chair
(ACM) James Landay
Computer Science Teachers Association 2 Penn Plaza, Suite 701 Board Members A
SE
REC
Y

Chris Stephenson, New York, NY 10121-0701 USA Gene Golovchinsky; Marti Hearst;
E

CL
PL

Executive Director T (212) 869-7440; F (212) 869-0481 Jason I. Hong; Jeff Johnson; Wendy E. MacKay Printed in the U.S.A.
NE
TH

S
I

Z
I

M AGA

4 c omm unicatio ns o f th e acm | au gu st 201 2 | vol. 5 5 | no. 8


letter from the icpc executive director

DOI:10.1145/2240236.2240237 Bill Poucher

Giving Students gratulations to Eugeny Kapun, Mikhail


Kever, Niyaz Nigmatullin, and coach

the Competitive Edge Andrey Stankevich!


Throughout the event there was
spectacular TV coverage within Po-
The annual ACM International Collegiate land featuring highlights and inter-
Programming Contest (ICPC) shines the views with the hometown team from
the University of Warsaw: Jakub
spotlight on the next generation of problem Pachocki, Tomasz Kulczyński, and
solvers during their university years, engaging Wojciech Śmietanka.
And the coverage did not stop
there. On May 22, 2012, Russia’s Presi-
them in a competition that develops the most advanced science museums dent Vladimir Putin invited the 2012
teamwork, programming skills, and in Europe, hosting over 450 interac- ICPC World Champions to the annual
algorithmic mastery. Doors are opened tive exhibits. meeting of the Russian Academy of
for students to measure their prowess The opening ceremony featured an Science. In a speech unfolding the na-
among their peers as they push human extraordinary celebration of 5D Arts tional focus on science in the Russian
problem-solving performance beyond in the Palace of Culture and Science. Federation over the next five years, Mr.
accepted norms. ICPC is a competition We are indebted to the University of Putin remarked:
of global proportions; participation is Warsaw, alumni, and supporters— “Incidentally, present here at this
open to every student at every univer- specifically Finals Director Jan Madey meeting today are team members of
sity on the planet. and Rector Chałasińska-Macukow— the St. Petersburg State University of
This year over 25,000 students from for transforming the university into Information Technologies, Mechanics
over 2,200 universities competed in an ICPC Village with extraordinary and Optics, which won the 2012 ACM
regional contests that spanned the hospitality. International Collegiate Programming
globe. The top 112 teams of three com- The 2012 World Finals, held on May Contest. So our victories are not limited
peted in the 36th Annual ACM-ICPC 17, 2012 at the University of Warsaw to hockey but extend to such academic
World Finals sponsored by IBM and School of Management, was indeed disciplines as well. I congratulate them
hosted by the University of Warsaw. action packed. With only seconds to on this achievement.
For full coverage of this event, which go, the top two teams were neck-and- The brilliant success of our student
took place May 14–18, I encourage you neck. Both teams solved nine of the team is a prime example of effective
to visit ICPC Digital at http://icpc.bay- 12 problems posed, with St. Peters- integration of science and education,
lor.edu/digital/ for full coverage. burg State University of IT, Mechanics quality training of creative and intelli-
Allow me to recap the highlights and Optics edging out the University gent young people who will doubtlessly
here: of Warsaw to earn the coveted title of be in demand in all areas of life in our
Officials from the University of War- ACM-ICPC World Champions. Con- country such as Russian science, in-
saw, the City of Warsaw, along with cluding fundamental [research].”
leading lights from Poland’s science The 2013 World Finals will be held
and economics communities worked ICPC is a competition in St. Petersburg, Russia, in St. Isaac’s
together to roll out the red carpet, giv- Square, courtesy of St. Petersburg State
ing the event national exposure with of global proportions; University of Information Technolo-
full TV coverage. Leading the open- participation is gies, Mechanics and Optics, the Rus-
ing events for ACM-ICPC World Finals sian Duma, and IBM.
Week was the President of the Repub- open to every student Now for the editorial.
lic of Poland, Bronislaw Komorowski. at every university Poland gets it.
IBM, completing 15 years of a 20- Russia gets it, too.
year sponsor commitment, kicked off on the planet.
the week by introducing their latest Bill Poucher (poucherw@acmicpc.org) is Executive
cognitive computing and big data re- Director of ICPC and a professor of computer science at
Baylor University, Waco, TX.
search in Warsaw’s newly completed
Copernicus Science Center—one of © 2012 ACM 0001-0782/12/08 $15.00

au g u st 2 0 1 2 | vol . 55 | no. 8 | c ommu n i cat ion s of the acm 5


h t tp: // w w w. ac m.or g /dl
letters to the editor

DOI:10.1145/2240236.2240238

Composable Trees for Configurable Behavior

I
concur wholeheartedly with Scalable and robust. New routing is to consider p as a random variable to
the composability benefits Bri- rules can be added to DSL specifica- account for all uncertainty. A common
an Beckman outlined in his ar- tion; new routing concepts can be add- model for p is the beta distribution that
ticle “Why LINQ Matters: Cloud ed through the definition of new node leads to a well-known prototype model
Composability Guaranteed” types; and new techniques can be add- for overdispersion, the “beta-binomial
(Apr. 2012) due to my experience using ed to the overall design; and distribution.” Note this model as-
composability principles to design and Each message traversal recorded by sumes a particular parametric distri-
implement the message-dissemina- the composable tree. Each node in the bution of the random variable p. How-
tion mechanism for a mobile ad hoc composable tree logs a brief one-line ever, sample-size calculations based
router in a proprietary network. In it, statement describing what it was do- on this paradigm also involve compu-
the message-dissemination function- ing and why the message chose a par- tational challenges; M’Lan et al.1 con-
ality of the router emerges from the ticular traversal path; the aggregation cluded that choosing the criterion for
aggregation of approximately 1,500 of these statements provides an itiner- sample-size determination from the
nodes in a composable tree that resem- ary describing the journey of each mes- many criteria in the literature is ulti-
bles a large version of the lambda-tree sage traversal through the composable mately based on personal taste. Note,
diagrams in the article. However, in- tree for confirmation or debugging. too, that Schmettow’s “zero-truncated
stead of being LINQ-based, each node My experience with composable logit-normal binomial model” follows
represents a control element (such as trees defined through a DSL has been this scheme. To the best of our knowl-
if/else, for-loop, and Boolean opera- so positive I would definitely consider edge, the Bernstein–Dirichlet process
tions nodes), as well as nodes that di- using the technique again to solve is a promising family for such a mod-
rectly access message attributes. Each problems that are limited in scope but eling framework; a nice feature of the
incoming message traverses the com- unlimited in variation. related distribution of p is that any den-
posable tree, with control nodes di- Jim Humelsine, Neptune, NJ sity in (0, 1] can be approximated by
recting it through pertinent branches the Bernstein polynomial.
based on message attributes (such as A more common approach is to fix p
message type, timestamp, and send- Model Dependence in through widely used methods involving
er’s location) until the message reach- Sample-Size Calculation fixed p based on the confidence-interval
es processing nodes that complete the We wish to clarify and expand on sev- formulas derived from normal approxi-
dissemination. eral points raised by Martin Schmettow mations to the binomial distribution
Since assembling and maintaining in his article “Sample Size in Usability requiring an estimate of p as input into
a 1,500-node tree within the code base Studies” (Apr. 2012) regarding sample- the sample-size formula. However, the
would be daunting, a parser assembles size calculation in usability engineer- normal-based-interval approximation
the tree from a 1,300-line routing-rule ing, emphasizing the challenges of is well known for being erratic for small
specification based on a domain- calculating sample size for binomial- sample sizes, and even for large samples
specific language (DSL).a Defining the type studies and identifying promising when p is near the boundaries 0 or 1.
routing rules through this DSL-assem- methodologies for future investigation. Current sample-size-calculation pro-
bled composable tree also provides Schmettow interpreted “overdisper- cedures are thus highly model-depen-
these additional benefits: sion” as an indication of the variability dent, so results will differ. This means
Nodes verified independently. Verify- of the parameter p; that is, when n Ber- a universal procedure that works for a
ing the if/else, message-timestamp, and noulli trials are correlated (dependent), particular binomial-type process is as yet
other nodes can be done in isolation; the variance can be shown as np(1–p) nonexistent, and more studies are need-
Routing rules modified for unit test- (1+C), where C is the correlation param- ed. We hope Schmettow’s article and our
ing. As the routing rules mature, their eter, and when C>0 the result is overdis- discussion here inspire more researchers
execution requires a full lab- or field- persion. When the Bernoulli trials are to take on the subject of sample-size cal-
configuration environment, making it negatively correlated, or C<0, the result culation for usability studies.
difficult to test new features; a quick is “underdispersion.” If the trials are Dexter Cahoy and Vir Phoha, Ruston, LA
simplification of a local copy of the DSL independent, then C=0, correspond-
specification defines routing rules that ing to the binomial model. Bernoulli Reference
1. M’Lan, C.E., Joseph, L., and Wolfson, D.B. Bayesian
bypass irrelevant lab/field constraints trials may thus result in overdispersion sample size determination for binomial proportions.
while focusing on the feature being or underdispersion; in practice, over- Bayesian Analysis 3, 2 (Feb. 2008), 269–296.

tested on the developer’s desktop; dispersion is more common due to the


Communications welcomes your opinion. To submit a
heterogeneity of populations/samples. Letter to the Editor, please limit yourself to 500 words or
A widely used approach for model- less, and send to letters@cacm.acm.org.
a http://en.wikipedia.org/wiki/Domain-specific_
language ing an overdispersed binomial model © 2012 ACM 0001-0782/12/08 $15.00

au g u st 2 0 1 2 | vol . 55 | no. 8 | c omm u n icat ion s of the acm 7


Association for Computing Machinery

Global Reach for Global Opportunities in Computing

Dear Colleague,

Today’s computing professionals are at the forefront of the technologies that drive innovation across
diverse disciplines and international boundaries with increasing speed. In this environment, ACM offers
advantages to computing researchers, practitioners, educators and students who are committed to self-
improvement and success in their chosen fields.

ACM members benefit from a broad spectrum of state-of-the-art resources. From Special Interest Group
conferences to world-class publications and peer-reviewed journals, from online lifelong learning resources
to mentoring opportunities, from recognition programs to leadership opportunities, ACM helps computing
professionals stay connected with academic research, emerging trends, and the technology trailblazers
who are leading the way. These benefits include:

Timely access to relevant information

• Communications of the ACM magazine


• ACM Queue website for practitioners
• Option to subscribe to the ACM Digital Library
• ACM’s 50+ journals and magazines at member-only rates
• TechNews, tri-weekly email digest
• ACM SIG conference proceedings and discounts

Resources to enhance your career

• ACM Tech Packs, exclusive annotated reading lists compiled by experts


• Learning Center books, courses, podcasts and resources for lifelong learning
• Option to join 37 Special Interest Groups (SIGs) and hundreds of local chapters
• ACM Career & Job Center for career-enhancing benefits
• CareerNews, email digest
• Recognition of achievement through Fellows and Distinguished Member Programs

As an ACM member, you gain access to ACM’s worldwide network of more than 100,000 members from
nearly 200 countries. ACM’s global reach includes councils in Europe, India, and China to expand high-
quality member activities and initiatives. By participating in ACM’s multi-faceted global resources, you
have the opportunity to develop friendships and relationships with colleagues and mentors that can
advance your knowledge and skills in unforeseen ways.

ACM welcomes computing professionals and students from all backgrounds, interests, and pursuits.
Please take a moment to consider the value of an ACM membership for your career and for your future
in the dynamic computing profession.

Sincerely,

Vint Cerf
President
Advancing Computing as a Science & Profession
Association for Computing Machinery
membership application &
Advancing Computing as a Science & Profession
digital library order form
Priority Code: AD13
You can join ACM in several easy ways:
Online Phone Fax
http://www.acm.org/join +1-800-342-6626 (US & Canada) +1-212-944-1318
+1-212-626-0500 (Global)
Or, complete this application and return with payment via postal mail
Special rates for residents of developing countries: Special rates for members of sister societies:
http://www.acm.org/membership/L2-3/ http://www.acm.org/membership/dues.html

Please print clearly


Purposes of ACM
ACM is dedicated to:
Name
1) advancing the art, science, engineering,
and application of information technology
2) fostering the open interchange of
Address information to serve both professionals and
the public
3) promoting the highest professional and
City State/Province Postal code/Zip ethics standards
I agree with the Purposes of ACM:
Country E-mail address

Signature

Area code & Daytime phone Fax Member number, if applicable ACM Code of Ethics:
http://www.acm.org/about/code-of-ethics

choose one membership option:


PROFESSIONAL MEMBERSHIP: STUDENT MEMBERSHIP:
o ACM Professional Membership: $99 USD o ACM Student Membership: $19 USD

o ACM Professional Membership plus the ACM Digital Library: o ACM Student Membership plus the ACM Digital Library: $42 USD
$198 USD ($99 dues + $99 DL) o ACM Student Membership PLUS Print CACM Magazine: $42 USD
o ACM Digital Library: $99 USD (must be an ACM member) o ACM Student Membership w/Digital Library PLUS Print
CACM Magazine: $62 USD

All new ACM members will receive an payment:


ACM membership card. Payment must accompany application. If paying by check or
For more information, please visit us at www.acm.org money order, make payable to ACM, Inc. in US dollars or foreign
currency at current exchange rate.
Professional membership dues include $40 toward a subscription
to Communications of the ACM. Student membership dues include o Visa/MasterCard o American Express o Check/money order
$15 toward a subscription to XRDS. Member dues, subscriptions,
and optional contributions are tax-deductible under certain
o Professional Member Dues ($99 or $198) $ ______________________
circumstances. Please consult with your tax advisor.
o ACM Digital Library ($99) $ ______________________
RETURN COMPLETED APPLICATION TO:
o Student Member Dues ($19, $42, or $62) $ ______________________
Association for Computing Machinery, Inc.
General Post Office Total Amount Due $ ______________________
P.O. Box 30777
New York, NY 10087-0777

Questions? E-mail us at acmhelp@acm.org Card # Expiration date


Or call +1-800-342-6626 to speak to a live representative

Satisfaction Guaranteed! Signature


The Communications Web site, http://cacm.acm.org,
features more than a dozen bloggers in the BLOG@CACM
community. In each issue of Communications, we’ll publish
selected posts or excerpts.

Follow us on Twitter at http://twitter.com/blogCACM

doi:10.1145/2240236.2240239 http://cacm.acm.org/blogs/blog-cacm

Machine Learning bigger than you might otherwise ex-


pect. A third is that real organizations

and Algorithms;
have people coming and going, and
any project that is by just one person
withers when that person leaves. This

Agile Development observation means the development


of systems with clean abstractions can
be extraordinarily helpful, as it allows
John Langford poses questions about the direction of research for people to work independently. This
machine learning and algorithms. Ruben Ortega shares lessons observation also means simple wide-
ly applicable tricks (for example, the
about agile development practices like Scrum. hashing trick) can be broadly helpful.
A good way to phrase research direc-
tions is with questions. Here are a few
John Langford research, because people often work of my natural questions.
“Research Directions with superlinear time algorithms 1. How do we efficiently learn
for Machine Learning and languages. Two very common ex- in settings where exploration is re-
and Algorithms” amples of this are graphical models quired? These are settings where the
http://cacm.acm.org/ where inference is often a superlin- choice of action you take influences
blogs/blog-cacm/108385 ear operation—think about the n2 de- the observed reward—ad display and
May 16, 2011 pendence on the number of states in medical testing are two good scenari-
S. Muthu Muthukrishnan invited me a Hidden Markov Model and Kernel- os. This is deeply critical to many ap-
to the National Science Foundation’s ized Support Vector Machines where plications because the learning with
Workshop on Algorithms in the Field optimization is typically quadratic or exploration setting is inherently more
with the goal of providing a sense of worse. There are two basic reasons natural than the standard supervised
where near-term research should go. for this. The most obvious is that lin- learning setting. The tutorial we did
When the time came, though, I instead ear time allows you to deal with large detailed much of the state of the art
bargained for a post, which provides a datasets. A less obvious but critical here, but very significant questions
chance for other people to comment. point is that a superlinear time algo- remain. How can we do effective of-
There are several things I did not rithm is inherently buggy; it has an fline evaluation of algorithms? How
fully understand when I went to Yahoo! unreliable running time that can eas- can we be both efficient in sample
about five years ago. I would like to re- ily explode if you accidentally give it complexity and computational com-
peat them as people in academia may too much input. plexity? Several groups are interested
not yet understand them intuitively. 2. Almost anything worth doing in sampling from a Bayesian poste-
1. Almost all the big-impact algo- requires many people working to- rior to solve these sorts of problems.
rithms operate in pseudo-linear or gether. This happens for many rea- Where and when can this be proved
better time. Think about caching, sons. One is the time-critical aspect to work? (There is essentially no anal-
hashing, sorting, filtering, etc. and you of development—in many places it re- ysis.) What is a maximally distributed
have a sense of what some of the most ally is worthwhile to pay more to have and incentive-compatible algorithm
heavily used algorithms are. This mat- something developed faster. Another that remains efficient? The last ques-
ters quite a bit to machine learning is that real projects are simply much tion is very natural for marketplace

10 communicatio ns o f th e acm | au gu st 201 2 | vol. 5 5 | no. 8


blog@cacm

design. How can we best construct topic of the Coarse-to-Fine Learning ˲˲ Cross functionality and specializa-
reward functions operating on dif- and Inference Workshop. These are tion: The team is responsible for de-
ferent time scales? What is the rela- inherently related as coarse-to-fine is ciding when to distribute work across
tionship between the realizable and a pruned breadth first search. Restat- team members or have each focus on a
agnostic versions of this setting, and ed, it is not enough to have a language certain part of the project.
how can we construct an algorithm for specifying your prior structural be- ˲˲ Continuous learning and itera-
that smoothly interpolates between liefs; instead we must have a language tion pressure: The team is respon-
the two? that results in computationally trac- sible for delivering on its own sched-
2. How can we learn from lots of table solutions. ule and the retrospectives to improve
data? We will be presenting a KDD 5. The deep learning problem re- each sprint.
survey/tutorial about what is been mains interesting. How do you effec- The advantage of giving this bal-
done. Some of the larger-scale learn- tively learn complex nonlinearities ancing act to the team is that it take
ing problems have been addressed ef- capable of better performance than ownership of the solution with the
fectively using MapReduce. The best a basic linear predictor? An effective full understanding of all the trade-offs
example I know is Ozgur Cetin’s algo- solution avoids feature engineering. that will need to occur each sprint. By
rithm at Yahoo! It is preconditioned Right now, this is almost entirely dealt distributing the work to the team, it
conjugate gradient with a Newton with empirically, but theory could eas- also makes team members account-
stepsize using two passes over exam- ily have a role to play in phrasing ap- able to one another for making sure
ples per step. (A nonHadoop version propriate optimization algorithms, the goals are achieved.
is implemented in Vowpal Wabbit for example. 2. Self-organizing teams have their
for reference.) But linear predictors Good solutions to each of these re- members assume some well-defined
are not enough; we would like learn- search directions would result in revo- roles spontaneously, informally, and
ing algorithms that can, for example, lutions in their area, and every one of transiently to help make their projects
learn from all the images in the world. them would plausibly see wide appli- successful:
Doing this well plausibly requires a cability. ˲˲ Mentor: Guides the team in the use
new approach and new learning algo- What’s missing? of agile methods.
rithms. A key observation here is that ˲˲ Coordinator: Manages customer
the bandwidth required by the learn- Ruben Ortega expectations and collaboration with
ing algorithm cannot be too great. “Research in Agile the team.
3. How can we learn to index effi- Development ˲˲ Translator: Translates customer
ciently? The standard solution in in- Practices” business requirements to technical re-
formation retrieval is to evaluate (or http://cacm.acm.org/ quirements and back.
approximately evaluate) all objects blogs/blog-cacm/109811 ˲˲ Champion: Advocates agile team
in a database returning the elements June 20, 2011 approach with senior management.
with the largest score according to I am an enthusiastic advocate of agile ˲˲ Promoter: Works with customers
some learned or constructed scoring software development practices like to explain agile development and how
function. This is an inherently O(n) Scrum. Its ability to allow teams to to collaborate best with the team.
operation, which is frustrating when focus on delivering product and com- ˲˲ Terminator: Removes team mem-
it’s plausible that an exponentially municate status has made it one of the bers that hinders the team’s successful
faster O(log(n)) solution exists. A good easiest and best software development functioning.
solution involves both theory and em- techniques I have seen in a career that These roles are an emergent prop-
pirical work here as we need to think has used ad hoc, Waterfall, and every- erty that comes from using agile de-
about how to think about how to solve thing in between. velopment methods. They are not pre-
the problem, and of course we need to Recent research from New Zealand scribed explicitly as part of any of the
solve it. has furthered the cause by performing agile development philosophies, but
4. What is a flexible, inherently ef- a study that involved 58 practitioners they arise as part of successful use of
ficient language for architecting rep- in 23 organizations over four years. In the methodology.
resentations for learning algorithms? reading a Victoria University of Wel- I am eager to see more research
Right now, graphical models often lington article on “Smarter Software emerge as to where agile software
get (mis)used for this purpose. It is Development” and then looking at development practices succeed and
easy and natural to pose a computa- Rashina Hoda’s thesis “Self-Organiz- where they need improvement. There
tionally intractable graphical model, ing Agile Teams: A Grounded Theory,” is a large body of evidence that shows it
implying many real applications in- there are two interesting takeaways: to be a successful strategy, and having
volve approximations. A better solu- 1. Self-organizing scrum teams natu- the research to support it would help
tion would be to use a different rep- rally perform a balancing act between: encourage its adoption.
resentation language that was always ˲˲ Freedom and responsibility: The
computationally tractable yet flexible team is responsible for collective de- John Langford is a senior researcher at Microsoft
cision making, assignment, commit- Research New York. Ruben Ortega is an engineering
enough to solve real-world problems. director at Google.
One starting point for this is Searn. ment, and measurement, and must
Another general approach was the choose to do them. © 2012 ACM 0001-0782/12/08 $15.00

au g u st 2 0 1 2 | vol . 55 | no. 8 | c ommu n i cat ion s o f t he acm 11


ACM
ACM’s Member
Career & Job Center News
Michael Stonebraker
on Big Data and
Looking for your next IT job? launching Start-ups
How big is big
Need Career Advice?

data? Humans
are creating
at least 1.8
zettabytes of
Visit ACM’s Career & Job Center at:
according
it a year,

to IDC’s annual Digital

http://jobs.acm.org
Universe Study. That is enough
to fill 57.5 billion 32GB iPads.
And we are only beginning
to see the kind of technological
breakthroughs that will allow
Offering a host of career-enhancing benefits: us to effectively manage it,
says big-data pioneer Michael
Stonebraker, adjunct professor
➜ A highly targeted focus on job opportunities in
at the Massachusetts Institute
of Technology.
the computing industry “Everything of any
commercial significance will
soon be geopositioned by a
➜ Access to hundreds of corporate job postings sensor,” he says. “All auto
insurers will place one in cars
so they can track safety among
➜ Resume posting keeping you connected to the their drivers. Libraries will tag
employment market while letting you maintain all books so they can be quickly
found if misplaced on a shelf.
full control over your confidential information This is causing an ultimate
tsunami of data. There will
be an unbelievable amount of
➜ An advanced Job Alert system notifies you of advancements over the next 10
years to address this.”
new opportunities matching your criteria Expect Stonebraker to
remain on the forefront.
➜ Career coaching and guidance from trained
His now-legendary work on
relational database systems
experts dedicated to your success has led to him launching
seven start-ups, beginning
with Ingres in the 1970s. He
➜ A content library of the best career articles continues to work on methods
to improve the organization
compiled from hundreds of sources, and much of data for analytics, as well
as adapt to the great increases
more! in velocity through which
information is now delivered.
It is the excitement
of coming up with an
The ACM Career & Job Center is the perfect place to innovation—and then
overseeing its development
begin searching for your next employment opportunity! via the start-up process—that
keeps Stonebraker engaged.
“I like to see my ideas make
http://jobs.acm.org the light of day,” he says. “If
you just publish papers at
a university, it’s likely that
no technology transfer will
happen. If you approach a large
company with your idea, the
chances of it getting picked up
aren’t high. The best way to see
your idea through is a start-up.
I specialize in disruptive tech.
Start-ups are great for that.”
—Dennis McCafferty

12 c ommunicatio ns o f th e acm | au gust 201 2 | vol. 5 5 | no. 8


CareerCenter_TwoThird_Ad.indd 1 4/3/12 1:38 PM
N
news

Science | doi:10.1145/2240236.2240241 Jeff Kanipe

Cosmic Simulations
With the help of supercomputers, scientists are now
able to create models of large-scale astronomical events.

I
f you a re going to build a syn-
thetic universe in a computer
and watch it evolve over billions
of years, you are going to need a
mighty powerful computer, one
that will literally go where no computer
has gone before. Last fall, astronomers
with the University of California High-
Performance AstroComputing Center
announced they had successfully com-
pleted such a model, which they called
the Bolshoi simulation. Bolshoi, which
is Russian for “great” or “grand,” is an
apt word choice. The simulation used
six million CPU hours on the Pleiades
supercomputer at the U.S. National
Aeronautic and Space Administra-
tion’s (NASA’s) Ames Research Center,
which, as of June 2012, was rated as the
world’s 11th most powerful computer
Court esy o f M ultiDa rk Resim ul ation proj ect (Gustavo Yepes, Splotc h)

on the TOP500 list.


Today, there is hardly a field of sci- A visualization from the Bolshoi simulation depicting the evolution of gas density in the
ence that has not been propelled into resimulated 007 cluster.
new territory by supercomputers. You
find them in studies as diverse as ge- of scientific data and run simulations its applications obviously do not stop
nome analysis, climate modeling, and based on that data. there. To produce the Bolshoi simu-
seismic wave propagation. In the last The Pleiades is a marvel of comput- lation, the Adaptive Refinement Tree
five years, supercomputer technology er technology. The system architecture (ART) algorithm was run on Pleiades.
has advanced so significantly in speed consists of 112,896 cores housed in 185 Rather than compute interactions
and processing capacity that many re- refrigerator-sized racks. It can run at a between pairs of particles, the ART
searchers no longer refer to these pow- theoretical peak performance of 1.34 algorithm subdivides space into cubi-
erful mainframes as supercomputers, petaflops and has a total memory of cal cells and calculates interactions
but rather HPCs, or high-performance 191 terabytes. NASA uses the Pleiades between cells. This allows for more
computers. These rarefied machines for its most demanding modeling and efficient calculations of gravitational
are designed to process vast amounts simulation projects in aeronautics, but interactions among billions of mass

au g ust 2 0 1 2 | vol . 55 | no. 8 | co mm u n icat ion s o f t he acm 13


news

particles, making it the perfect tool for mates indicated,” says Joel Primack,
recreating an unfolding cosmos. director of the University of California
Bolshoi’s purpose was to do that and The 180 “snapshots” High-Performance AstroComputing
more. It would model not just how the taken during the Center and coauthor of the paper an-
visible universe of stars, gas, and dust nouncing the results of the Bolshoi
evolved, but also how the vast majority Bolshoi simulation simulation. The previous estimates
of the invisible universe, which is com- will allow predicted the number of halos in the
posed of dark matter, evolved. Dark early universe should be 10 times
matter is a crucial component of the astrophysicists greater than is seen in the Bolshoi
simulation because, although it cannot to analyze how simulation. The difference, says Pri-
yet be directly detected (it can only be mack, is significant. “It remains to be
inferred from its gravitational effects on dark matter halos, seen whether this is just observational
normal matter), galaxies are thought to galaxies, and clusters incompleteness or a potentially seri-
have formed within huge “cocoons” of ous problem for the standard Lambda
dark matter, called dark matter halos. of galaxies coalesced Cold Dark Matter theory.”
Astronomers actually ran two Bol- and evolved.
shoi simulations on Pleiades: the Smaller Simulations
Bolshoi and the BigBolshoi. The Bol- Two other recent computer simula-
shoi computed the evolution of a vol- tions also have broken new astrophysi-
ume of space measuring about one cal ground, but at smaller scales. One
billion light-years on a side contain- is the first realistic simulation, at ga-
ing more than one million galaxies. craft, revealed that the background lactic scales, of how the Milky Way was
The simulation begins about 24 mil- radiation is not perfectly uniform, but formed. The simulation, named Eris,
lion years after the big bang, which exhibits tiny variations, regions that shows the origin of the Milky Way be-
occurred 13.7 billion years ago, and are slightly more or less dense by one ginning one million years after the big
follows the evolution of 8.6 billion part in 100,000. These fluctuations bang and traces its evolution to pres-
dark-matter particles, each with an as- correspond to non-uniformities in the ent time. It was produced by a research
signed mass of 200 million times that otherwise uniform distribution of mat- group run by Lucio Mayer, an astro-
of the sun, to the present day. Logisti- ter in the very early universe, and are physicist at the University of Zurich,
cally, the simulation required 13,824 essentially the seedlings from which and Piero Madau, an astronomer at the
cores and a cumulative 13 terabytes of emerged all of the galaxies observed in University of California, Santa Cruz.
RAM. In all, 600,000 files were saved, the universe today. Just exactly how spiral galaxies like
filling 100 terabytes of disk space. Just how the universe went from a ours form has been the subject of con-
During the Bolshoi simulation’s run, nearly smooth initial state to one so tentious debate for decades (hence
180 “snapshots” were made showing full of complex structure has long puz- christening the simulation “Eris,”
the evolutionary process at different zled cosmologists, but most agree the the Greek goddess of strife and dis-
times. These visualizations will allow best explanation is the Lambda Cold cord). Previous simulations resulted
astrophysicists to further analyze how Dark Matter theory, or ΛCDM. Once re- in galaxies that were either too small
dark matter halos, galaxies, and clus- ferred to as just the Cold Dark Matter or dense, did not have an extended
ters of galaxies coalesced and evolved. theory, it has since been augmented to disk of gas and stars, or exhibited too
The BigBolshoi simulation was of a included dark energy (Λ), a mysterious many stars in the central region. Eris,
lower resolution but covered a volume counteractive force to gravity that has however, achieved the proper balance.
of four billion light-years, 64 times larg- been invoked to explain the accelerat- Once again, the Pleiades computer
er than the Bolshoi model. Its purpose ing expansion of the universe. The the- was brought to bear, entailing 1.4
was to predict the properties and distri- ory makes specific predictions for how million processor hours. Supporting
bution of galaxy clusters and superclu- structure in the universe grows hierar- simulations were performed using
sters throughout this volume of space. chically as smaller objects merge into the Cray XT5 Monte Rosa computer at
Any simulation that strives to re- bigger ones. Because ΛCDM explains Zurich’s Swiss National Supercomput-
flect real processes in nature requires much of what is observed, the Bolshoi ing Center at the Swiss Federal Insti-
a significant amount of observational simulation drew upon this model as its tute of Technology. (The Cray XT5 has
input data and a well-founded theory theoretical framework. since been upgraded to a 400-teraflop
that explains what is observed. In the The results of both simulations Cray XE6.) The simulation modeled
former case, the Bolshoi simulation largely confirm cosmologists’ assump- the formation of a galaxy with 790 bil-
is based on precise measurements tions about the formation of large- lion solar masses comprised of 18.6
of the vestigial all-sky afterglow of scale structure, but one discrepancy million particles, from which dark
the big bang, which astronomers call needs to be addressed. matter, gas, and stars form. The re-
cosmic microwave background radia- “Our analysis of the original Bolshoi sults are another confirmation of the
tion. The measurements, made over simulation showed that dark matter Cold Dark Matter theory.
several years using NASA’s Wilkinson halos that could host early galaxies are “In this theory, galaxies are the out-
Microwave Anisotropy Probe space- much less abundant than earlier esti- come of the gravitational growth of tiny

14 communi catio ns o f th e acm | au gu st 201 2 | vol. 5 5 | no. 8


news

quantum density fluctuations present The simulation was run on the Cray Blue Waters project is now online at
shortly after the big bang,” says Madau. XT5 Kraken supercomputer, which is the National Center for Supercomput-
“The ordinary matter that forms stars the 21st most-powerful computer in the er Applications in Urbana-Champaign,
and planets has fallen into the gravita- world, and is housed at Oak Ridge Na- IL. Titan should become operational
tional wells created by large clumps of tional Laboratory. at the Oak Ridge National Laboratory
dark matter, giving rise to galaxies in None of these studies would have later this year, and Stampede at The
the centers of dark matter halos.” been possible without HPCs. A regu- University of Texas at Austin is expect-
Even smaller, on stellar scales, as- lar personal computer, for instance, ed to be up and running in January
tronomers have employed HPCs to would have required 570 years to 2013. These and other HPCs promise
render a clearer picture of why a class make the same calculations to repro- to reveal new and transformative in-
of compact, dense objects called neu- duce the Eris simulation. Still, even sights into the world and the universe
tron stars often move through space as computer performance improves, from the smallest scales to the largest.
at very high velocities, in some cases Primack sees new challenges ahead. Their simulations will probe levels
1,000 kilometers or more per second. For example, if the observational and of complexity we can only imagine,
(Most stars have typical space veloci- computational datasets expand ex- taking us where no one has gone—or
ties of a few tens of kilometers per ponentially while the speed of data could possibly go—before.
second.) A clue may be found in how transmission expands arithmetically,
neutron stars are created. These ob- methodologies for analysis will need
Further Reading
jects, in which protons and electrons to change. “Instead of bringing data to
have been gravitationally compressed the desktop, we will increasingly have Guedes, J., Callegari, S., Madau, P.,
and Mayer, L.
into neutrons, form from the collaps- to bring our algorithms to the data,”
Forming realistic late-type spirals in a
ing cores of massive stars just before he says. ΛCDM universe: The Eris simulation, The
they explode as supernovae. Most as- Another challenge, says Primack, is Astrophysical Journal 742, 2, Dec. 2011.
tronomers have long been convinced the energy cost of computation. “The Nordhaus, J., Brandt, T. D.,
the explosion itself somehow gives the Department of Energy, which currently Burrows, A., Almgren, A.
neutron star its high-velocity “kick.” has the fastest U.S. supercomputers on The hydrodynamic origin of neutron kick
To explore this possibility, Adam Bur- the TOP500 list, is not willing to have stars, http://arxiv.org/abs/1112.3342.
rows, an astrophysicist at Princeton its supercomputer centers use much Prada, F., Klypin, A., Cuesta, A.,
University, created a three-dimension- more than the 10 megawatts that they Betancort-Rijo, J., and Primack, J.
al animation of conditions through- currently do. It will be a huge challenge Halo concentrations in the standard ΛCDM
cosmology, http://arxiv.org/pdf/1104.5130.pdf.
out the star during the explosion. He to go from 10 petaflops to one exaflop
found that the star does not explode without increasing the energy con- Rantsiou, E., Burrows, A.,
Nordhaus, J., and Almgren, A.
symmetrically, but that the explosion sumption by more than a small factor.
Induced rotation in 3D simulations of Ccre
rips through the star asymmetrically. We don’t know how to do this yet.” collapse supernovae: Implications for
The hydrodynamic recoil from such Despite such challenges, a new pulsar spins, The Astrophysical Journal
an explosion is more than sufficient to era of HPC is dawning. Sequoia, an 732, 1, May 2011.
hurl the neutron star off into space. IBM BlueGene/Q system at Lawrence University of California High-Performance
“This is a straightforward conse- Livermore National Laboratory, has AstroComputing Center
quence of momentum conservation attained 16 petaflops, and four other Bolshoi videos, http://hipacc.ucsc.edu/
Bolshoi/Movies.html.
when things aren’t spherical,” says HPCs in the U.S. are expected to be
Burrows. “A lot of exotic mechanisms in the 10-petaflops range soon. IBM’s
Jeff Kanipe is an astronomy writer based in Boulder, CO.
have been proposed, but this simplest Mira at the Argonne National Labora-
of origins seems quite natural.” tory just attained 10 petaflops, and the © 2012 ACM 0001-0782/12/08 $15.00

Milestones

Computer Science Awards


ACM, the European Association Computing. Herlihy, a professor Transactional Memory.” The behalf of the Austrian Ministry
for Theoretical Computer Science of computer science at Brown Edsger W. Dijkstra Prize is for Science presented the
(EATCS), and the Austrian University, and Moss, a professor awarded to “outstanding Wittgenstein Award, Austria’s
government recently honored five of computer science at the papers on the principles of leading science prize, to
leading computer scientists. University of Massachusetts, distributed computing, whose Thomas Henzinger, president
were honored for their 1993 significance and impact on the of the Institute of Science and
Edsger W. Dijkstra Prize “Transactional Memory” paper. theory or practice of distributed Technology. The Wittgenstein
ACM and EATCS selected Maurice Shavit, a professor of computer computing have been evident for Award honors and supports
Herlihy, J. Eliot B. Moss, Nir science at Tel Aviv University, and at least 10 years.” “scientists working in Austria who
Shavit, and Dan Touitou to Touitou, chief technology officer have accomplished outstanding
receive the 2012 Edsger W. at Toga Networks, were honored Wittgenstein Award scientific achievements.”
Dijkstra Prize in Distributed for their 1995 paper, “Software The Austrian Science Fund on —Jack Rosenberger

au g ust 2 0 1 2 | vol . 55 | no. 8 | c ommu n i cat ion s o f t he acm 15


news

Technology | doi:10.1145/2240236.2240242 Tom Geller

DARPA Shredder the problem. As Don Engel of the two-


person, second-place “Schroddon”

Challenge Solved team says, “If we had taken a com-


pletely manual approach, we’d lose
because teams can be of any size. If
The eight-person winning team used original computer algorithms we’d taken only a computer-driven ap-
to narrow the search space and then relied on human observation proach, we’d also lose because there
to move the pieces into their final positions. are people out there who are better at
computer vision.” In fact, a team from

Y
University of California, San Diego at-
ou acc i d e n tally sh re d d e d tempted to crowdsource the challenge,
an important document. setting it up as a sort of game and in-
Could you put it back to- viting the public to help them via the
gether? Would it be worth Internet. Saboteurs foiled the team’s
the effort? What if it would strategy, accusing it of “cheating” and
stop a terrorist plot, or release an inno- moving shreds to incorrect positions.
cent person from jail? The team responded by implementing
Intelligence agencies around the security measures, which also slowed
world face questions like these, most its progress, knocking it from third to
famously when the East German secret sixth place.
police abandoned tons of torn pages The winning All Your Shreds team
after the fall of the Berlin Wall in late took advantage of the fact the shred-
1989. Those documents are still being ded documents were apparently pho-
reconstructed, and the creation of a tocopied before being placed face-up
computer program called the e-Puzzler on a solid background and scanned,
in 2010 increased hopes of finishing resulting in a phenomenon known as
the job by 2013. But that program was printer steganography. “High-quality
written to reconstruct those particular printers and photocopiers, in the Unit-
documents. It is optimized to handle ed States at least, have a watermark on
hand-torn pieces as the East German them so the Secret Service can track
secret police were forced to manually phony money,” explains team member
DARPA Shredder Challenge puzzle #1
destroy many of the documents. consisted of more than 200 chads of torn, Otávio Good, founder of Quest Visual.
To seek “unexpected advances” yellow lined paper with handwritten text. “It’s a repeating pattern of yellow dots.
that could be applied to such situ- If you can find it, you can use that to
ations, the U.S. Defense Advanced applied for the challenge, but only 69 your advantage. It’s essentially like a
Research Projects Agency (DARPA) is- of them answered one or more ques- map of how the pieces go together.”
sued its Shredder Challenge in Octo- tions correctly. However, Good pointed out they were
ber 2011, daring the public to unshred The winning team was the only one “yellow dots on yellow paper, and the
five documents and extract their hid- to answer all of the questions in all chads were small. It helped us tremen-
den messages. The puzzles were de- five puzzles, taking home the $50,000 dously, but we were pretty much in the
signed to be progressively difficult prize while ending the contest on Dec. lead even before we found the dots. In
on multiple axes, ranging from about 2, 2011, three days ahead of schedule. my opinion, we were still on track to
200 chads to more than 6,000 each. Like other contest leaders, the eight- win... it was part of the game.”
Text files accompanying the scanned person “All Your Shreds Are Belong “The challenge and its component
Court esy o f t h e Def ense Adva nced Resea rch Proj ects Agency

chads provided questions, with points To U.S.” team employed original com- puzzles were designed to resemble the
reflecting each question’s difficulty. puter algorithms to narrow the search problem facing an intelligence ana-
For example, puzzle #3 asked “What space, and then relied on human ob- lyst,” says Norman Whitaker, deputy
is the indicated location?” while the servation to move the pieces into their director of DARPA’s Information In-
reassembled document showed a final positions. In the process the All novation Office. “That analyst is often
set of geographic coordinates and a Your Shreds team discovered unex- less interested in putting the pieces to-
drawing of Cuba. (Naming the coun- pected difficulties—and also a pecu- gether than in getting crucial informa-
try was worth two points; the city of liarity in the contest that gave it an un- tion off the page in the shortest time
Cienfuegos was worth an additional expected edge. possible.” In explaining why the chal-
six.) Solvers needed to both answer lenge could only be a small represen-
the questions and show how their re- Play to Win tation of what is found in the field, he
construction of the document led to The contest’s parameters had a strong says, “[It’s] narrow in scope to make it
that answer. More than 9,000 teams effect on how competitors approached tractable. It concerned English, hand-

16 co mmunicatio ns of th e acm | au gust 201 2 | vol. 5 5 | no. 8


news

written documents shredded with “When we clicked on a recommenda-


a commercial shredder. Additional tion, we wanted that to happen in real-
research and development would be The problem time—maybe one-tenth of a second
required to apply this work to actual of document on a fast computer. We set up a node.js
confiscated documents.” Even then, server to allow team members to place
he defined the problem of document reconstruction puzzle pieces over the Internet, even
reconstruction as “just one facet of in- is an often-overlooked while the programmers worked on al-
formation assurance, an aspect that is gorithms. There were about eight bot-
often overlooked.” aspect of information tlenecks in the program, but because
Other contest features played unex- assurance, we wrote the client in C#, we were able
pected roles in defining the winning to use C#’s parallel libraries to get
solutions. The Schroddon team found says DARPA’s past them by parallelizing across all
the shredders sometimes sheared the Norman Whitaker. our processors.”
paper in its depth. As team member Indeed, the subject of document
Marianne Engel notes, “The computer shredding is one that measures human
looked at [the edges of such chads] as value against human trouble. We shred
though they were blank areas, but the documents when we believe their value
human eye could see that the chads is greater than the reconstruction cost.
were just torn.” In addition, the five As long as that cost involves human in-
puzzles varied in the type of shredder those for reconstruction, but believes tervention—a point the Shredder Chal-
used; the number and size of pages; the pairing of these technologies is in lenge did not disprove—that equation
the documents’ content; and whether its infancy. still holds. As Don Engel says, “If you
the pages were on lined, unlined, or Specialized knowledge also came look at the last 6,000-piece puzzle,
graph paper. As a result, says Good, “we into play, for example, as Don Engel that’s 6,000 times 6,000 possible pairs,
essentially hard-coded our program to of the second-place Schroddon team which is just intractable for a human.
work for each individual puzzle. We found he was able to apply his mas- So we needed a computer to filter for
didn’t create a general ‘unscramble the ter’s work in statistical natural lan- pairs that were plausible. But then, we
puzzle’ codebase.” guage processing to the task. “One could use our human brains to deter-
One method that helped Good’s mind-set I took from my studies was mine which ones made sense.”
team was the use of a scoring system that of conditional probability: When
that improved its algorithms. “As we you have a few characters, what are
Further Reading
adjusted our algorithms, we could the chances that the next one with be
see whether they were working,” says a specific character?” says Don Engel. Justino, E., Oliveira, L.S., and Freitas, C.
Good. “At that point the efficiency of “So as we were assembling shreds, I Reconstructing shredded documents
through feature matching, Forensic Science
our recommendation algorithm went grepped the dictionary file on my com- International 160, 2, July 13, 2006.
up by a factor of 10 within a few hours. puter to find words with those charac-
Prandtstetter, M. and Günther, R.R.
So although each puzzle started out ters in sequence. Also, for puzzle #5
Combining forces to reconstruct strip
difficult, it got easier as we went.” the question was, ‘What are the three shredded text documents, Proceedings of
translated messages?’ That implied the 5th International Workshop on Hybrid
The Human Element that they were either in some foreign Metaheuristics, Malaga, Spain, Oct. 8–9, 2008.
One fact emerged early on in the con- language or encoded. I saw a very large Schauer, S., Prandtstetter, M., and Günther, R.R.
test: The puzzles would not be solved number of the letter ‘d’ and thought, A memetic algorithm for reconstructing
by computers alone. Matthias Prandt- ‘That’s way more than you’d see in cross-cut shredded text documents,
Proceedings of the 7th International
stetter, a researcher at the Austria English.’ The same was true with cer-
Workshop on Hybrid Metaheuristics, Vienna,
Institute of Technology who investi- tain pairs of letters, such as ‘di’ and Austria, Oct. 1–2, 2010.
gated document reconstruction while ‘ah’ and ‘it.’ As it turned out, the an-
Skeoch, A.
at the Vienna University of Technol- swer to puzzle #5 was in Morse code, An investigation into automated shredded
ogy, acknowledged that much of the written as ‘dit dit dah.’ ” document reconstruction using heuristic
scholarly literature on the subject Part of the All Your Shreds solu- search algorithms, Ph.D. thesis, University
stops where human interaction takes tion involved team members remotely of Bath, 2006.
over, thereby failing to address cer- reviewing the computer’s piece-pair Ukovich, A., Ramponi, G., Doulaverakis, H.,
tain real-world issues. For example, recommendations via the Internet, Kompatsiaris, Y., and Strintzis, M.
“[computer algorithms] might be able then manually making selections and Shredded document reconstruction using
MPEG-7 standard descriptors, Proceedings
to reconstruct a document that’s in placing pieces. Knowing they would of the Fourth IEEE International Symposium
multiple columns or a table, but not ultimately spend hundreds of hours in on Signal Processing and Information
have them in the right order,” he says, this pursuit, its programmers created Technology, Rome, Italy, Dec. 18–24, 2004.
while a human could finish the job by a custom client to make the interface
understanding each portion in con- as fast as possible. “We wanted things Tom Geller is an Oberlin, OH-based science, technology,
and business writer.
text. Prandtstetter suggests language- to be very user-friendly because the
recognition algorithms could extend human was a big element,” says Good. © 2012 ACM 0001-0782/12/08 $15.00

au g u st 2 0 1 2 | vol . 55 | no. 8 | c o mmu n i cat ion s of t h e acm 17


news

Society | doi:10.1145/2240236.2240243 Samuel Greengard

Advertising Gets Personal


Online behavioral advertising and sophisticated data aggregation
have changed the face of advertising and put privacy in the crosshairs.

N
ot lo ng ago, a man walked
into the local Target store in
Minneapolis and demanded
to speak to the manager.
He wanted to know why his
then-high school daughter was receiv-
ing coupons and promotions for ma-
ternity clothing, cribs, and other items
that would indicate she was pregnant.
“Are you trying to encourage her to get
pregnant?” he asked. The store man-
ager examined the stack of coupons
and promptly apologized. He said he
did not have any idea why the girl had
received the coupons. The man then
left for home.
This could have been the end of
the story. But after talking to his
daughter the man discovered she was
pregnant. Target had used sophisti-
cated predictive analytics to deter-
mine that her previous buying pat-
terns and behavior had indicated a Collusion, a Firefox add-on, lets a person see all the third-party entities tracking his or her
high probability of expecting a baby. movements across the Web.
In fact, Target and other stores have
become so good at gauging custom- ence, engineering, and public policy However, the last few years have
ers’ buying patterns they now dis- at Carnegie Mellon University. “There brought about a revolution in data
guise customer-specific promotions are clearly advantages to receiving rel- mining particularly as online and con-
by including coupons that are com- evant ads, but the Internet, combined ventional methods have allowed ad-
pletely irrelevant to the recipient. with today’s data-collection technol- vertisers to assemble and reassemble
Welcome to the new world of adver- ogy, poses serious privacy concerns. data in new ways. A growing number
tising. As statisticians, software devel- Unfortunately, most consumers feel of retailers, including Target, assign
opers, and advertising experts mine as though they have little control over each customer a unique ID number
and mix growing volumes of online what happens to their data and how it or guest code. It is associated with
and offline data and develop increas- is used by advertisers.” a credit or debit card, and an indi-
ingly complex algorithms, they are vidual’s purchase history is stored for
building new and remarkably sophis- By the Numbers analysis. This is separate from a loyal-
ticated advertising models designed Over the years, advertisers have strug- ty program, and the only way to avoid
to maximize results—and revenues. gled to better understand the whims of tracking is to pay with cash and avoid
“Technology is enabling new—and in the marketplace and target consumers giving a phone number or any other
some cases hyper-local and person- more effectively. Identifying market personal data. But the process does not
Screensh ot Court esy of M ozi ll a .org/Col lusion

alized—forms of advertising,” states niches and customer segments has stop there. Increasingly, retailers and
John Nicholson, counsel for the law been a daunting task and there has others plug in information from third-
firm Pillsbury Winthrop Shaw Pitt- been no easy way to deliver relevant party sources that track the same indi-
man. However, “there’s a fine line be- ads. The result? Most ads target broad vidual. This might include the person’s
tween what’s acceptable and what con- demographic segments through tele- Web browsing patterns, credit history,
stitutes an invasion of privacy.” vision, radio, newspapers, magazines, what magazines they read, and even
“Online behavioral advertising kiosks, billboards, shopping carts, and conversations they have had at social
methods are advancing at an incredi- other media. In many cases, advertis- media sites.
bly fast pace,” says Lorrie Faith Cranor, ers simply hope for positive results and The result is a fairly comprehensive
an associate professor of computer sci- learn by trial and error. picture of an individual’s buying hab-

18 co mmunicatio ns of th e acm | au gu st 201 2 | vol. 5 5 | no. 8


news

its and consumption patterns. This Cloud Computing

New RaaS
profile—which could include anything
from the type of tea or liquor a person Analytics software
likes to consume to medical condi- has become
tions and sexual orientation—allows
marketers to customize ads, but it also so sophisticated Pricing
offers deep insights into life events and
changes. For example, when a woman
it is now possible to
estimate a pregnant
Model
begins buying vitamin supplements,
larger quantities of skin lotion, hand customer’s delivery The resources behind cloud
computing services will soon be
sanitizers, and a larger purse or bag
there is an extremely high likelihood window within sold in increments of seconds,
according to researchers from
she is pregnant. In addition, analytics a few weeks. Technion-Israel Institute of
Technology.
software has become so sophisticated
Providers of cloud
it is possible to estimate the delivery computing have moved from
window within a few weeks. renting servers on a monthly
Of course, using data to predict life basis to renting virtual “server
equivalents” for as little as an
events could have far-reaching conse- hour at a time. But even that is
quences, particularly if family, friends, and advertising networks sell this data inefficient, say the researchers,
or a prospective employer become to other companies, including data who presented a paper, “The
aware of a sensitive lifestyle or medi- aggregators. Google, meanwhile, col- Resource-as-a-Service (RaaS)
Cloud,” at the recent USENIX
cal issue, such as an affinity for nude lects data from searches and through Hotcloud ‘12 conference in
beaches or a diagnosis of HIV. Worse, keywords in Gmail and YouTube while Boston. Providers are moving
the data may contain errors and pres- Facebook has unlimited access to the toward pricing individual
resources, such as memory,
ent an inaccurate picture that could mother lode of information and mes- within a virtual machine, and
lead to an employer refusing to hire the sages that appear on its site. Finally, changing prices in intervals
person or the loss of a job. As a result, Twitter recently sold its multibillion of seconds, based on shifting
advertisers are attempting to get smart- tweet archive to a U.K. firm that report- demand. That trend from
infrastructure-as-a-service to
er—some would say sneakier—in the edly has more than 1,000 companies resource-as-a-service can save
way they deliver ads. Increasingly, they lined up for the data. buyers money, earn more for
are including coupons and ads that Today’s data-collection system is providers, and make efficient
use of hardware and energy.
are completely random or irrelevant in largely based on an opt-out model that
“Clients don’t need to buy
order to appear as though they are not is nearly impossible to understand things they don’t need, hosts
spying over a person’s shoulder. or manage, many privacy advocates don’t need to sell them things
Joseph Turian, president of consult- contend. Consumers face the daunt- they don’t need, and hosts can
accommodate more clients on a
ing firm MetaOptimize, says that as or- ing task of trying to decipher lengthy server,” says Orna Agmon Ben-
ganizations learn to use analytics and and convoluted privacy policies that in Yehuda, a doctoral student and
cultivate big data, insights that would some cases do not match actual prac- co-author of the paper.
have been unimaginable only a few tices, Cranor says. What is more, data Clients would use an
“economic agent” that makes
years ago are moving into the main- collection firms often rely on loopholes split-second decisions on
stream. There are clear advantages for and devious methods to circumvent how much to spend on which
consumers—particularly those look- cookie-blocking tools built into Web resources, and hosts would
allocate resources based on
ing for discounts and deals—but ad- browsers and privacy tools such as how much a client was willing
vertisers need to avoid stepping over Ghostery. In the end, users’ attempts to pay. Cloud service providers’
the line. “People like the idea of per- to control tracking and personal data software could also incorporate
sonalized searches and advertising,” often ends up resembling a game of economic agents to represent
their own business interests.
says Turian. “Many already provide Whac-A-Mole, Nicholson says. Coauthor and doctoral
data willingly for discounts through re- In fact, half of all Internet users student Muli Ben-Yehuda says
wards programs. But they want to be in recall the ads they view but only 12% the trend demands a lot of
cloud computing researchers.
control of their destiny.” correctly remember the disclosure
Software, for instance, will have
tag-lines attached to ads, Cranor re- to adapt to use an ever-shifting
Cookies, Tweets, and Dollars ports. When she studied usage pat- set of resources, and workloads
What makes the emerging field of data terns she found that the majority of will need to be carefully
balanced. The challenge, he
aggregation and analytics possible participants mistakenly believe that says, is how to turn computing
is a spate of online data-collection ads pop up if they click on disclosure into a commodity. “How do you
techniques that revolve around IP ad- icons and taglines. AdChoices, the ta- make computing something like
dresses, third-party cookies, and Web gline most commonly used by online electricity? It’s there whenever
you want it, you can have as
tools that track consumers as they click advertisers (it discloses sites’ advertis- much as you need, and the price
through Web sites and interact online. ing methods and allows consumers to is set by the market.”
Internet service providers, Web sites, click a button and opt out), was par- —Neil Savage

au g ust 2 0 1 2 | vol . 55 | no. 8 | c ommu n i cat ion s o f t he acm 19


news

ticularly ineffective at communicating of data collected, how it could be used, in a currency we don’t understand be-
notice and choice. Nearly half of the and how long it could be stored. cause most people don’t recognize the
participants who saw AdChoices be- Nicholson, who writes privacy poli- actual value of personal information,”
lieved it was intended to sell advertis- cies for businesses, believes a funda- Nicholson explains.
ing space, while a mere 27% believed mental problem lies in overly complex Cranor says that, in the end, a bal-
it was a means to stop tailored ads. “A and incomprehensible privacy policies, ance must exist between today’s rapidly
majority of participants mistakenly as well as the way data is collected and advancing data-aggregation methods
believed that opting out would stop all used in the U.S. compared to Europe and and increasingly elusive privacy. Al-
online tracking, not just tailored ads,” other parts of the world. “The U.S. has though a strict opt-in model would al-
she notes. treated personal information as more most certainly prove too unwieldy and
Critics believe the inability to con- of a sales transaction and said that busi- annoying for consumers, “the entire
trol what software and tracking mech- nesses can do what they want with it,” process must be more transparent,”
anisms are placed on a person’s com- he explains. “Europeans use more of a says Cranor. “People must understand
puter is nothing less than a violation. licensing model that focuses on the per- what is happening with their data and
Many Web sites contain a half-dozen to son owning their data and a business what choices they are actually making.
a dozen or more tracking tools or third- renting it for a specific use.” In fact, the Only then can we have a system that
party cookies. It is akin to a company European Data Protection Directive and works well for everyone.”
installing video cameras and micro- a newer Data Protection Regulation sets
phones in a home and recording ev- tight controls over how data can be col-
Further Reading
erything that occurs in the household. lected, stored, and used. It also includes
“When people find out what is really provisions for notifying consumers and Leon, P.G., et al.
happening, the typical response is ‘Are obtaining their consent. What do behavioral advertising disclosures
communicate to users? Carnegie Mellon
you kidding!’” says Marcella Wilson, an Nicholson says some privacy advo- University CyLab, April 2, 2012.
adjunct professor of computer science cates now support a system modeled
Goldfarb, A. and Tucker, C.
at the University of Maryland, Balti- after Canada’s Privacy by Design ini-
Privacy regulation and online advertising,
more County. tiative, which aims to embed privacy Social Science Research Network, August
protection into new technology and 4, 2010.
Privacy Matters business processes by default. The CNN
In February, U.S. President Obama un- underlying goal is for consumers to Joel Stein talks data mining and personal
veiled a Consumer Privacy Bill of Rights choose the data they make available privacy on CNN American Morning, http://
as part of a comprehensive blueprint to to companies. With Privacy by Design, www.youtube.com/watch?v=rYiGT4EuE9w,
March 10, 2011.
expand privacy protections while con- data aggregators would cull only the
tinuing to make the Internet a hub of data they need and have permission to Tucker, C.E.
innovation and economic growth. The use, keep it for only as long as it is im- The economics of advertising and privacy,
International Journal of Industrial
measure attempts to force companies mediately valuable, and then purge the Organization 30, 3, May 2012.
such as Google, Microsoft, and Yahoo! data, he explains.
Turow, J.
to stop monitoring when a person Others have floated the idea of cre- The Daily You: How the New Advertising
clicks a Do Not Track button on their ating information exchanges that pay Industry Is Defining Your Identity and Your
Web browser. Do Not Track is intended consumers for the use of their data. Es- Worth, Yale University Press, New Haven,
to supersede a decade-old voluntary sentially, an individual would manage CT, 2012.
industry initiative called P3P, which his or her profile and decide who can
has produced tepid results and proved purchase the data, what purposes they Samuel Greengard is an author and journalist based in
West Linn, OR.
unenforceable. It was designed to offer can use it for, and for how long. “The
consumers some control over the type reality is that we’re currently paying © 2012 ACM 0001-0782/12/08 $15.00

Security

Target Cybercriminals, Urges Report


Governments should spend less and a team of six researchers the U.K. Ministry of Defense, the actual losses. Anderson
money on defensive measures, from Germany, the Netherlands, the report provided estimates of noted the example of a botnet
such as antivirus software and the U.K., and the U.S. the direct costs, indirect costs, that was responsible for a third
firewalls, and more money on The paper, which the and defense costs of different of the world’s spam in 2010.
“hunting down cybercriminals researchers believe is the first types of cybercrime in the U.K. The botnet is estimated to
and throwing them in jail,” systematic report on the costs and the world. It found that have earned its owners about
according to “Measuring the of cybercrime, was presented for Internet-based crimes like U.S. $2.7 million whereas
Cost of Cybercrime,” a research at the recent 11th Annual phishing, spam, online scams, the worldwide costs of spam
paper by Ross Anderson, a Workshop on the Economics of hacking, and denial-of-service prevention probably exceeded
professor of security engineering Information Security. attacks, the costs of defense U.S. $1 billion.
at the University of Cambridge, Developed at the request of are many times higher than —Jack Rosenberger

20 communicatio ns o f th e acm | au gu st 201 2 | vol. 5 5 | no. 8


news

Membership | doi:10.1145/2240236.2240244 Karen A. Frenkel

Broader Horizons
ACM’s Committee for Women in Computing (ACM-W) is widening
its reach to involve women in industry as well as academia, including
community college faculty and students.

A
C M -W, w hich h as has tra- a priority because they do not count
ditionally been dedicated toward tenure. Therefore, ACM-W will
to supporting ACM’s wom- consider how to support networking
en members, now has new opportunities for women who attend
leadership and a broader technical and professional confer-
mission, with Valerie Barr, chair of ences, possibly by arranging birds-of-a-
the computer science department feather meetings.
at Union College, succeeding Elaine In terms of conference scholar-
Weyuker, an AT&T Fellow at Bell Labs. ships, ACM-W has received $20,000
The pair previously served as co-chairs in funding from both Wipro Tech-
for six months. nologies and Microsoft Research. The
“ACM-W has focused on support- Wipro scholarships will provide each
ing women in computing, and has had awardee $600 and $1,200 for intra- and
particular focus on and responsibil- intercontinental conferences, respec-
ity to women members of ACM,” says tively, for registration and attendance
Barr. “While it is important to make fees. ACM-W plans to give at least 10
sure that women of achievement are intra- and 10 intercontinental awards
recognized within the women in com- each year for the next five years. The
puting community, we want to be sure Microsoft Research scholarships will
that women are supported, celebrated, enable European women to attend
and advocated for within the general conferences, and recipients will re-
computing community. But now we are ceive the same amounts for intra- and
ready to also take on responsibilities to “We want to intercontinental conference expenses
ensure all ACM members are aware of be sure that women as the Wipro awardees. ACM-W expects
the role of women in computing, and to to award about 15 Microsoft Research
help ACM achieve organizational goals are supported, scholarships this year.
for growth and international reach celebrated, and “We see the conference scholarship
through our work with and for women.” program very much as a pipeline pro-
ACM-W needs to reach the 12% of advocated for gram for young women who make a
ACM’s 104,000 members who are fe- within the general credible case that going to a conference
male and make sure they know about will help them make decisions about
ACM programs, like professional de- computing what they’re doing in CS,” says Barr.
velopment, and non-ACM programs, community,” “We’re trying to encourage B.A.s to go on
like the Computer Research Associa- to graduate school, M.S.s to get Ph.Ds.,
tion-Women’s mentoring workshops, says Valerie Barr. and early Ph.D. candidates to focus on a
and special programs for women at research topic and stick with it.”
conferences like the Grace Hopper Cel- Barr wants ACM-W’s activities to
ebration of Women in Computing. be more prominent because it will in-
To attract women who are not ACM crease the visibility of women in com-
members, ACM-W will reach out to fac- puting. Barr also wants the rest of the
ulty and students at community colleg- In addition, ACM-W aims to create ACM community to take better notice
es. Women in computing at commu- new programs that promote informal of the contributions women make to
nity colleges are a large and untapped interaction between women at confer- the field. “We’ve done a good job tell-
market for ACM, says Barr, who notes ences so experienced computer scien- ing women about women,” she says,
that community colleges are a less ex- tists can meet and advise newcomers. “but now we need to tell everybody else
pensive way for women to return to the Unfortunately, few opportunities exist about women.”
work force. for women at conferences to talk infor-
ACM-W is also reaching out to stu- mally, says Barr. Moreover, young fac- Based in Manhattan, Karen A. Frenkel is a freelance
writer and editor specializing in science and technology.
dents and faculty by awarding scholar- ulty do not make attending non-tech-
ships so they can attend conferences. nical, non-peer reviewed conferences © 2012 ACM 0001-0782/12/08 $15.00

au g ust 2 0 1 2 | vol . 55 | no. 8 | c omm u n i cat ion s of t he acm 21


V
viewpoints

doi:10.1145/2240236.2240245 Paul Tjia

Emerging Markets
Inside the Hermit Kingdom:
IT and Outsourcing in North Korea
A unique perspective on an evolving technology sector.

T
h e o u t sid e world ’ s view of emerged, which fall into three main its export work, KCC’s main focus has
North Korea ranges from the types, all state owned in whole or part. been the local market and it develops
fear of nuclear demagoguery, There are a number of large special- various products, such as Red Star (the
through tales of economic ist IT service providers who work both North Korean version of Linux), e-learn-
difficulties, to the fun pok- for local and overseas clients. The big- ing products, and translation software.
ing of the film Team America. Behind gest of these is the Korea Computer KCC also produces games; their version
these and many other—almost univer- Center (KCC). Established in 1990, it of Go, a popular Asian chess-like game,
sally negative—projected images of the has more than 1,000 employees. Like has won the Go computer games world
country, there is another side. Some- almost all IT firms, it is headquartered championship for several years. Simi-
what unexpectedly, the Democratic in the capital, Pyongyang, but also has lar, but smaller corporations such as
People’s Republic of Korea—to use its regional branches throughout the the Pyongyang Informatics Centre and
official title—has a sizeable IT sector. country and offices overseas that en- Korea Pioneer Technology are employ-
Some 10,000 professionals work in the able it to work for clients in Europe, ing hundreds of staff.
field, and many more have IT degrees. China, South Korea, and Japan. Despite Some IT firms have developed
They are already engaged in outsourc- from the internal IT departments of
ing contracts for other countries, and large commercial enterprises, such
keen to expand further. Having access to as Unha Corporation or Korea Roksan
General Trading Company. As the
North Korea’s IT Sector a pool of highly country’s IT sector has become more
The origins of the local IT sector can technically skilled dynamic, some of these are being
be traced back to the 1980s, with the spun off as separate ventures, allow-
establishment of various IT research labor is a key rationale ing the IT firms to take on a broader
organizations and the creation of IT behind the growth scope of clients and IT service ac-
faculties in higher education institu- tivities. Third, there are a number of
tions such as Kim Il Sung University of IT outsourcing joint venture IT firms. These include
and Kim Chaek University of Technol- in North Korea. Nosotek (set up with a German en-
ogy (the latter having an international trepreneur) and Hana Electronics
collaborative research program with (which involves U.K. investment),
Syracuse University). Over time, sev- plus several joint ventures with Chi-
eral hundred IT “corporations” have nese business partners.

22 c ommuni catio ns o f th e acm | au gu st 201 2 | vol. 5 5 | no. 8


viewpoints

An IT company in Pyongyang, North Korea. Pyongyang has become a base for IT offshoring.

Historically, North Korea’s IT infra- always significant proportions of M.S.- tions like China or the Philippines. The
structure has been behind the curve and Ph.D.-qualified software engineers, country’s highly regulated economy is
but in recent years the country has and a surprising number have partici- also an advantage in reducing attrition
been investing in order to catch up. In pated in training courses abroad: most rates—the type of “job-hopping” that
a typical IT firm, one can therefore see often in China and India, but also in Eu- can bedevil Indian contracts is pretty
staff using state-of-the-art technology, rope. These organizations are already much unknown in North Korea.
mostly imported (for example, hard- well versed in quality assurance meth- Some of the outsourcing to Pyong-
ware from Dell, running the latest ver- ods including ISO9001 and CMMI. yang-based firms is quite general. At the
sions of Windows or Unix). An example of the typical skills fairly low-skill end is basic digitization.
Internet access is still heavily regu- profile is shown in the accompanying For example, Dakor—a Swiss joint ven-
lated but a recently laid optical fiber table, which gives the expertise list for ture—is conducting data entry work for
backbone connecting all cities and Daeyong Corporation, a relatively small European research firms and interna-
counties provides for a high-speed start-up based in Pyongyang that is in- tional organizations like the United Na-
network—“Kwangmyong”—utilizing volved in Web and mobile applications tions and the Red Cross. Work involv-
a fairly sophisticated architecture that development—including Android and ing more skill includes producing Web
forms what can be seen as a “national Blackberry—for foreign clients. sites for U.S. and European customers
intranet.” A nationwide 3G mobile tele- (though in most cases, the end client
communications network was com- IT Outsourcing to North Korea is unaware of North Korean involve-
pleted in 2009. Mobile operator Ko- Having access to a pool of highly tech- ment since this is subcontracted by an
ryolink (a joint venture with Orascom nically skilled labor is a key rationale intermediary that deals direct with the
Telecom from Egypt) has more than behind the growth of IT outsourcing to client). Examples of projects involving
one million subscribers. North Korea. But such labor is available more advanced skills include North
Of course, the core of any IT sector is in many parts of Asia. A unique selling Korean firms providing programming
its human resources. As noted earlier, point for North Korea is cost. Tariffs inputs for enterprise resource plan-
some 10,000 professionals already have asked can be less than U.S. $10 per hour, ning systems, business process man-
photogra ph by Pau l Tjia

IT careers, and thousands of North Ko- enabling clients to employ experienced agement systems, and e-business ap-
rean youngsters graduate with IT de- software engineers for just a few hun- plications. One corporation is building
grees each year, thus creating a large dred dollars per month. This means a bank management system—based
pool of technically qualified staff. In the North Korea undercuts not just India on the tenets of Islamic banking—for a
corporations I have visited, there are but also later-emerging software loca- client located in the Middle East.

au g u st 2 0 1 2 | vol . 55 | no. 8 | c omm u n i cat ion s of t he acm 23


viewpoints

But there are also emerging spe-


cialisms within North Korea’s IT ex-
port sector. Perhaps not surprisingly,
one of these is IT security. North Ko-
rea might have an image—warranted
or not—of encouraging cyber attacks,
but it has invested a lot in technology
and expertise to thwart such attacks
on its own systems and, more gener-
ally, in security. Fingerprint identifica-
tion products used for access control
(and time attendance) have already
been exported, and there are other
products developed in areas of car li-
cense plate identification, and voice/
face recognition.
On the lighter side, film has been
one of the main forms of state-sup-
ported entertainment in the country
IT security, such as facial recognition, is an important export commodity for North Korea.
since the formal division of the Korean
peninsula into two states in the 1950s.
From this foundation has developed
an export production capacity for
high-quality cartoons and animation.
The specialized state corporation SEK
Studio, established in 1957, has more
than 1,600 employees, and works for
several European film production stu-
dios. Other firms have worked on 2D
and 3D animation contracts, and this
is starting to expand into areas of relat-
ed capability such as computer graph-
ics and games exports for Wii, iPhone,
BlackBerry and other platforms.
Finally, North Korean IT corpora-
tions have developed a set of language
skills. English is quite widely used but
specialisms have developed in Chinese
Developers working on a graphics design project. and Japanese alongside (of course)
Korean. North Korea is therefore be-
ing used as a base for those who wish
to have software products or systems
translated into East Asian languages.
At present, much of this work is near-
shoring; that is, coming from clients
based in China, Japan, or South Korea.
But the potential is there for a wide
range of customers who are targeting
East Asian markets, which are consid-
ered especially relevant to small- and
medium-sized clients.

The Challenges and


Future of IT Outsourcing
North Korea’s IT sector faces a number
photogra phs by Paul Tjia

of challenges, despite its government’s


eagerness to promote international
collaboration. First, and perhaps larg-
est, is the challenge of perceptions.
North Korean firms produce and export high-quality cartoons and animation work. North Korea confronts a difficult mix

24 c ommunicatio ns o f th e acm | au gu st 201 2 | vol. 5 5 | no. 8


viewpoints

Example of a North Korean IT corporation skill set. obscure location for IT work, accom-
panied by images of animal-drawn
carts and corrupt bureaucrats. A few
Major Skills Ansi C/C++, VC++6, VB6 brave U.S. firms stuck a toe in the wa-
VC++.NET/CLR, C#, VB.NET, ASP.NET, .NET ter with some small onsite “body shop-
DotNetNuke, Silverlight, Telerik Sitefinity
Java, J2EE, Spring, Struts, Hibernate, Swing, JUNIT, ANT
ping” projects and built up slowly from
JBoss, Apache Tomcat there. Today, India’s software industry
PHP4/5, Joomla, Wordpress, Magento, Drupal is a multibillion-dollar juggernaut on
Zend Framework, CodeIgniter, CakePHP, Yii, Smarty
which Fortune 500 and other firms rely
MySQL, MSSQL, MS Access, PostgreSQL, Oracle
Shell script , JavaScript, VBScript for portions of their IT services.
jQuery, extjs, dhtmlx suite North Korea is not going to follow
HTML4.01, HTML5, XHTML, CSS2.0, CSS3.0 the same meteoric IT trajectory as In-
XML, XSLT, Web Services
Flash, Flash AS3, Flex
dia: the challenges it faces are differ-
Experienced Techniques phpNuke, phpBB, Symfony
ent and more serious. But nor is it a
Axis2, Velocity, POI, JSF hypothetical or unthinkable destina-
Microsoft Speech API tion. Pyongyang already is a base for
Google Android Phone, iPhone IT offshoring, the country does have
J2ME, Brew, Perl
some unique selling points, and is se-
Specialized IT Areas Web Applications, E-Commerce, Blogs, CRM, Social Networking
General Windows utilities, Windows/Linux Networking rious about its aspiration to grow this
Reverse Engineering area. You cannot escape the broader
Security and Cryptography geopolitics when dealing with North
Languages Korean, English, Japanese, Chinese Korea but, for those who believe na-
tions are reformed by engagement
rather than conflict, building IT rela-
tionships could be part of a broader
of ignorance and suspicion. It is not cumvent these difficulties, some North path to more general rapprochement
well known yet as an outsourcing des- Korean IT firms operate offices abroad, that also helps access a low-cost, high-
tination, particularly outside neigh- such as in China—there are examples skill talent pool.
boring countries. All the local IT firms in Beijing, Dandong, and Dalian all
have a Web site, but in most cases it is staffed by Korean software engineers—
only visible on the national intranet, through which outsourcing contracts Further Reading
not externally. This is particularly dif- can be undertaken. Carmel, E. and Tjia, P.
ficult when trying to compete for at- Because of the challenges, those Offshoring Information Technology.
Cambridge University Press, 2009.
tention with firms in other emerging who have set up outsourcing ar-
Asian IT locations like Vietnam or rangements typically do so by work- Hayes, P., Bruce, S., and Mardon, D.
Bangladesh, which often have a strong ing gradually up a “trust and knowl- North Korea’s Digital Transformation,
Nautilus Institute, 2011; http://nautilus.
Web presence. edge curve.” This begins with a visit org/napsnet/napsnet-policy-forum/north-
Suspicion of all things North Korean by the client to Pyongyang, holding koreas-digital-transformation-implications-
and a “guilt by association” that fails discussions with potential partner for-north-korea-policy/.
to differentiate the political from the organizations until a suitable one is Korea Computer Center
commercial creates a serious barrier identified. The next step may be a set http://naenara.com.kp/en/kcc/.
for some clients. Perhaps as a result, of staff exchange visits—a “getting to Mansourov, A.Y.
Europe—where historical relations know you” type of familiarization ac- North Korea on the Cusp of Digital
with the two Koreas are different— tivity that allows Western clients to Transformation, Nautilus Institute, 2011;
may be a more feasible source of part- understand what North Korea has to http://www.nautilus.org/wp-content/
uploads/2011/12/DPRK_Digital_
ners than the U.S. Certainly the over- offer, and allows North Korean man-
Transformation.pdf.
hanging geopolitical aspects must agers to understand what the client
be acknowledged: there are political needs. Then some small pilot proj- North Korea Tech Web portal
http://www.northkoreatech.org/.
tensions, the Korean War has never ects are contracted. These may well
officially ended, and there are U.S.- begin with North Korean staff be- Park, S.-J.
Software and animation in North Korea.
imposed sanctions. ing brought over to the client site to In Europe—North Korea. Between
Alongside these factors are more complete some tightly specified and Humanitarianism and Business?, M. Park,
mundane barriers—potential clients non-mission-critical IT tasks. Only B. Seliger, S.-J. Park, Eds., LIT Verlag, 2010.
in the West lack contact lists or even once this onsite experience has been
basic information about North Korea. worked through will the clients move Paul Tjia (paul@gpic.nl) is the founder of GPI
In practice, the country is quite easy to to a more offshore mode of working Consultancy, an independent Dutch consultancy firm in
the field of global IT sourcing that organizes study tours
visit, but it is time consuming to obtain with their North Korean partners. to countries such as North Korea and offers support
a visa, and Pyongyang has few direct in- Does some of this sound familiar? on offshoring strategy, country and partner selection,
offshore transition, and cultural training.
ternational flights and, as noted, lim- It should. This is where India was ap-
ited Internet connectivity. To help cir- proximately 30 years ago—a seeming Copyright held by author.

au g ust 2 0 1 2 | vol . 55 | no. 8 | c omm u n i cat ion s of t he acm 25


V
viewpoints

doi:10.1145/2240236.2240246 Fred G. Martin

Education
Will Massive Open Online Courses
Change How We Teach?
Sharing recent experiences with an online course.

I
n the last decade, the Creative This idea—albeit in a more con- ute topics that were broken up into 15
Commons philosophy of free- ventional course structure—exploded or 20 short videos. Most of the individ-
ly sharing information and into the public consciousness with the ual videos had embedded questions
the pervasiveness of the In- massive open artificial intelligence (multiple-choice or fill-in-the-value).
ternet have created many new (AI) course developed and conducted At the end of each mini-lesson, the
opportunities for teaching and learn- by Stanford faculty Sebastian Thrun video image transformed into a Web
ing. MIT OpenCourseWare spear- and Peter Norvig last fall. Announced form where students fill in answers.
headed the sharing of high-quality, in the summer of 2011, the course re- Already logged in, the class server
university-level courses. While these ceived wide publicity, and attracted graded the students immediately. Af-
materials were not originally de- about 160,000 registered students by ter submitting, students were shown
signed for individuals engaged in self- its launch in October 2011. Approxi- an explanation video.
study, approximately half of OCW’s mately 23,000 students completed The lectures themselves were in-
traffic is from these users.6 Recently the 10-week course.8 I was one of the spired by Khan Academy’s casual,
the use of learning management sys- 23,000—along with a cohort of 16 stu- teacher-sitting-by-your-side approach.
tems (LMSs), such as the proprietary dents, both graduate and undergradu- Occasionally, Thrun and Norvig
Blackboard or open-source Moodle ate, from my home institution (the trained the camera at themselves, but
software, has become ubiquitous. computer science department at the the core content was conducted with
In typical use, LMSs support the University of Massachusetts Lowell). the camera aimed at a sheet of paper,
structure of conventional courses in Since the fall AI 2011 course, there with Thrun or Norvig talking and writ-
an online setting. Lectures and read- has been much activity in this space. ing in real time. I found this format
ing material are assigned, homework Thrun has set up a for-profit compa- relaxing and engaging; I liked seeing
is scheduled, and discussions are fa- ny, Udacity, to extend his initial work; equations written out with verbal nar-
cilitated at regular intervals. As in con- Stanford and others are running cours- ration in sync.
ventional coursework, classes are usu- es using Coursera; MIT created MITx There were weekly problem sets
ally closed communities, with students and is partnering with Harvard on edX; with the same format. These home-
registering (and paying) for credit- and there are other initiatives.4 The re- works tracked the lecture material
bearing coursework. mainder of this column describes my closely. The course included a mid-
One of the first initiatives to bring experiences taking the fall 2011 course term and a final with the same for-
together open course philosophy and alongside my students and facilitat- mat as the homework. If you worked
LMSs was David Cormier’s “Massive ing their learning. This is followed by through and understood the prob-
Open Online Course,” or MOOC. In some reflections. It seems likely this lems embedded in the lectures, the
his vision, “Although it may share in new breed of MOOCs will have impact homework assignments were straight-
some of the conventions of an ordinary on education at the university level, forward. The homework, midterm,
course, such as a predefined timeline particularly for technical majors such and final each had hard deadlines.
and weekly topics for consideration, as computer science. The server backend kept track of stu-
a MOOC generally carries no fees, no dents’ scores on the homework as-
prerequisites other than Internet ac- The Fall 2011 signments, the midterm, and the
cess and interest, no predefined expec- Stanford AI-Class.com final, which all counted toward the
tations for participation, and no for- The Stanford course consisted of student’s “grade”—a ranking within
mal accreditation.”9 weekly lectures—two or three 45-min- the active student cohort.

26 co mmunicatio ns o f th e acm | au gu st 201 2 | vol. 5 5 | no. 8


viewpoints

On the Course cious classroom time on more inter-


Thrun and Norvig were strong teach- active problem-solving activities that
ers. They thought through excellent achieve deeper understanding—and
ways of explaining the ideas and quiz- foster creativity.”7
zing the in-lecture comprehension
checks. They often brought fun props What Does it Mean?
or showed research projects in the vid- The success of the fall AI course and
eo recordings. the bloom of new ones this spring
Thrun and Norvig were only a week and summer puts real pressure on
or so ahead of the course delivery, and conventional, lecture-and-test univer-
they paid close attention to students’ sity instruction. Thrun is quoted in
progress. There was a lot of activity on several reports as noting that atten-
the Web forums. They recorded sev- dance at his face-to-face AI course at
eral “office hours,” where students Stanford in the fall dropped precipi-
submitted questions and voted on tously. From the 200 registered, after
their favorite ones, and then they a few weeks, only 30 continued to at-
picked questions and answered them tend.8 But this really speaks to the fail-
on camera. In this way, the course ure to have the in-person time deliver
was like a typical class—it was not anything different from a lecture. In
“canned.” Thrun’s and Norvig’s en- many ways, the carefully crafted on-
thusiasm was infectious. Collectively, line lectures, peppered with probing
the real-time nature of the experience questions that are autograded for cor-
made it a lot like a well-taught conven- rectness and then explained further,
tional course. are indeed an improvement over a
conventional lecture.
My Role at UMass Lowell There are many initiatives to im-
Students registered for my depart- prove the quality of face-to-face time in
ment’s regular AI course, which re- lectures. When used creatively, click-
quires a project. They knew when ers can be a valuable modification. But
signing up that I expected them to more fundamentally, active learning
complete both the Stanford course approaches hold much more promise.
and a directed project. As mentioned Robert Beichner has developed a
earlier, I had 16 students. We met classroom approach called “SCALE
once weekly for a 75-minute session UP” for active learning in the class-
in a roundtable format. We talked room. His work started in physics
about the Stanford material after education, but years of development
each week’s assignment was already and collaboration broadened it to
due. Because of this, I did not have to many fields, including the sciences,
present the course material in a lec- engineering, and the humanities.5 In
ture format. When we met, most of SCALE-UP, faculty engage students
my students had worked through the in a structured activities and prob-
lectures and the homework. So I did lem-solving during classroom time.
not have to explain things to students Students work in teams of three, and
for the first time. Instead, we used faculty mingle with them, engaging
in-class time for conversations about them in discussions. (The SCALE-UP
material that people found confus- acronym has had several meanings,
ing or disagreed upon. We had some including “Student-Centered Active
Fra mes f rom David Cormier’s YouTube video “ Wh at is a MOOC?”

great discussions over the course of Learning Environment for Under-


the semester. graduate Programs.”)
A similar approach was developed Eric Mazur, also from the physics
by Day and Foley in their HCI course education community, has developed
at Georgia Tech.2 They recorded Web a related approach that he calls “peer
lectures, and then used classroom instruction,” in which students work
time for hands-on learning activities. in small groups to answer questions
Daphne Koller, a colleague of Thrun’s posed in lectures. Like Beichner, Mazur
at Stanford (and founder of Cours- is active in disseminating this method.
era), has called this “the flipped class- However, dissemination and adop-
room.” She reported higher-than-usu- tion are big challenges. These ap-
al attendance in her Stanford courses proaches require substantial new
taught this way: “We can focus pre- development of the problems and ac-

au g ust 2 0 1 2 | vol . 55 | no. 8 | c ommu n i cat ion s of t he acm 27


viewpoints

tivities with which students are to be it unattended many times. But part of
engaged in the classroom, and teach- the fun of the fall AI course was that
ers must give up their carefully crafted Thrun and Norvig were right there with
lecture presentations. us, and that we were a large cohort of
Also, teachers need to be protected students there with them.
from low student evaluation scores. Thrun also asserted: @aiclass:
Mazur and others have reported that “Amazing we can probably offer a Master’s
students give lower evaluations in degree of Stanford quality for FREE. HOW
courses with active learning—even COOL IS THAT?”—September 23, 2011
when the evidence shows they have As we know, the modern university
learned more.1,3 Students have grown is a much larger ecosystem than its
up with conventional lecture teaching, collection of courses. Among many
and just like anyone else, they are resis- other things, students derive great
tant to change. value from being in close contact with
Beyond this, faculty must partici- their peers, participating in leader-
pate in these active learning approach- ship opportunities across campus,
es as learners, so they understand and being part of our research labs.
how to facilitate them as instructors. It may well be that this new breed of
In my case, in my graduate training I MOOC is a decent replacement for an
learned how discussion-oriented sem- average, large-sized lecture course.
inar courses are conducted, so it was But this is a low bar.
natural for me to facilitate the same In our drive for efficiency, we must
with my small group of “flipped class- aspire to higher than this. At least, we
room” AI students. can use MOOCs to create a success-
ful flipped classroom. We can use our
ACM’s Conclusion “precious classroom time” for mean-
interactions When Thrun was promoting the fall ingful conversations. As Mazur and
2011 online AI course, his Twitter Beicher have demonstrated, this can
magazine explores feed included some bold claims: @ be done even in large lectures by hav-
critical relationships aiclass: “Advanced students will complete ing students work in small groups.
between experiences, people, the same homework and exams as Stan- At best, we can really add value, by
and technology, showcasing ford students. So the courses will be equal being teachers. We can individually de-
in rigor.”—September 28, 2011 bug students’ thinking, mentor them
emerging innovations and industry The fall 2011 course for matricu- in project work, and honestly be enthu-
leaders from around the world lated Stanford students included siastic when they excel.
across important applications of programming assignments, and the
online one did not. This was a clear References
design thinking and the broadening shortcoming. But the new Udacity
1. Crouch, C.H. and Mazur, E. Peer instruction: Ten years
of experience and results. Am. J. Phys. 69 (Sept. 2001).
field of the interaction design. courses include programming. Most 2. Day, J. and Foley, J. Evaluating a Web lecture
intervention in a human-computer interaction course.
Our readers represent a growing of my students got a lot out of the fall IEEE Transactions on Education 49, 4 (Nov. 2006).
3. Fagen, A.P., Crouch, C.H., and Mazur, E. Peer
Stanford course—and our weekly dis-
community of practice that instruction: Results from a range of classrooms. The
cussion sections made a difference. Physics Teacher 40 (2002).
is of increasing and vital But the weaker students struggled, 4. Fox, A. and Patterson, D. Crossing the software education
chasm. Commun. ACM 55, 5 (May 2012), 44–49.
global importance. and a few strong students were bored. 5. Gaffney, J.D.H. et al. Scaling up education reform.
Journal of College Science Teaching 37, 5 (May/June
This makes me wonder about the 2008), 48–53.
large-scale applicability of the MOOC 6. Giving Knowledge for Free: The Emergence of Open
Educational Resources, Organisation for Economic
format. We need to be able to support Co-operation and Development, 2007.
students who are still learning how to 7. Koller, D. Death knell for the lecture: Technology as a
passport to personalized education. New York Times
learn, and also challenge our best stu- (Dec. 5, 2011).
dents. The MOOC concept does not 8. Lewin, T. Instruction for masses knocks down campus
rg
.o

walls. New York Times (Mar. 4, 2012).


even attempt to address the role of a
cm

9. McAuley, A., Stewart, B., Siemens, G., and Cormier,


D. The MOOC Model for Digital Practice. 2010; http://
.a

small, research-oriented project-based


ns

www.elearnspace.org/Articles/MOOC_Final.pdf.
course. When we individually mentor
io
ct

each student on his or her own ideas,


ra

Fred G. Martin (fredm@cs.uml.edu) is an associate


te

we are doing something that can never professor of computer science and associate dean in the
in

College of Sciences at the University of Massachusetts


://

be performed by an autograder.
tp

Lowell.
ht

Part of the excitement around


MOOCs is about their potential to
change education’s cost equation—
put a great course online once, and run Copyright held by author.

28 co mmuni catio ns o f th e acm | au gu st 201 2 | vol. 5 5 | no. 8


V
viewpoints

doi:10.1145/2240236.2240247 danah boyd

Privacy and Security


The Politics of
“Real Names”
Power, context, and control in networked publics.

F
a c e book ’ s t e r ms o f service
explicitly require its users
to provide their “real names
and information.” Indeed,
the norm among many Face-
book users is to provide a first and
last name that appears to be genuine.
Thus, when Google+ launched in the
summer of 2011, it tried to emulate
Facebook by requiring that new us-
ers provide similar credentials. Many
early adopters responded by providing
commonly used nicknames, pseud-
onyms, and stage names. Google, de-
termined to ensure compliance, began
expelling people who did not abide by
the “real names” requirements. They
ejected high-profile geeks, including
Limor Fried and Blake Ross for failing
to use their real names; they threat-
ened to eject Violet Blue, a well-known
sex educator and columnist.
The digerati responded with out- control remain. Why did people respond popularity grew, new users adopted the
rage, angry with Google for its totali- with outrage over Google while accept- norms and practices of early adopters.
tarian approach. The “nymwars,” as ing Facebook? Do “real names” policies The Facebook norms were seen as op-
they were called, triggered a passion- actually encourage the social dynam- erating in stark contrast to MySpace,
ate debate among bloggers and jour- ics that people assume they engender? where people commonly used pseud-
nalists about the very essence of ano- Why did people talk about “real names” onyms to address concerns about safe-
Illustration by Brian Greenberg/Andrij Borys Associat es

nymity and pseudonymity.2,3,6 Under policies as a privacy issue? This column ty. Unlike MySpace, Facebook appeared
pressure, Google relented, restoring explores these issues, highlighting the secure and private.
users accounts and trying to calm the challenges involved in designing socio- As Facebook spread beyond college
storm without apologizing. Meanwhile, technical systems. campuses, not all new users embraced
Google’s chairman Eric Schmidt pub- the “real names” norm. During the
licly explained the “real names” policy Facebook vs. Google: Norms, course of my research, I found that late-
is important because Google Plus is in- Values, and Enforcement teen adopters were far less likely to use
tended to be an identity service.1 At Harvard, Facebook’s launch signaled their given name. Yet, although Face-
While the furor over “real names” a safe, intimate alternative to the popu- book required compliance, it tended
has subsided—and Google now sup- lar social network sites. People provided not to actively—or at least, publicly—
ports pseudonymity—key questions their names because they saw the site as enforce its policy.
about the role of identity, privacy, and an extension of campus life. As the site’s Today, part of Facebook’s astronomi-

au g ust 2 0 1 2 | vol . 55 | no. 8 | co mmu n i cat ion s o f t he acm 29


viewpoints

cal value stems from the quality and want to know?” Yet, search engines
quantity of information it has about empower the curious to obtain—and
its users. While it is unlikely that Mark Privacy is not misinterpret—information without any
Zuckerberg designed Facebook to be an about restricting social consequences. That shifts how
identity service, that is what it has be- people relate online.
come. Google’s competitive move is ex- information; it is
plicitly an identity play. Yet, rather than about revealing Privacy and Names
creating a value proposition in which Accountability is commonly raised as
users would naturally share their real appropriate one of the reasons behind which peo-
names, they made it a requirement. information in ple should provide identifiable infor-
Larry Lessig argues that four forces mation in online settings. When people
regulate social systems: the market, the a given context. prefer not to share their names, they
law, social norms, and technology or are assumed to have something to hide.
architecture.4 Social norms drove the Many people claim people are better
“real names” culture of Facebook, but behaved and more “honest” when their
Google’s approach was purely driven by identifying information is available.
the market and reinforced by corporate While there is no data that convincingly
policies and technology. Their failure is embedded in rituals of relationship supports or refutes this, it is important
to create the conditions in which new- building. People do not share their to note that both Facebook and face-to-
comers felt comfortable sharing their names with every person they encoun- face settings continue to be rife with
names—and their choice to restrict ter. Rather, names are offered as an in- meanness and cruelty.
commonly used pseudonyms—result- troductory gesture in specific situations Even if we collectively value account-
ed in a backlash. Rather than design- to signal politeness and openness. ability, accountability is more than an
ing an ecosystem in which social norms While the revelation of a person’s avenue for punishment; accountabil-
worked to their favor, their choice to name may link them to their family or ity is about creating the social context
punish dissidents undermined any signal information about their socio- in which people can negotiate the so-
goodwill that early adopters had toward economic position, most names in a cial conditions of appropriate behav-
the service. Western context provide little addition- ior. Most social norms are regulated
al information beyond what is already through incentive mechanisms, not
Implicit Assumptions conveyed through the presence of the punishment. Punishment—and, thus,
about “Real Names” individual. As such, they simply serve the need to identify someone outside of
Although some companies imple- as an identifier for people to use when the mediated context—is really a last-re-
ment “real names” policies for busi- addressing one another. Online, the sort mechanism. The levers for account-
ness reasons, many designers believe stakes are different. ability change by social context, but ac-
such policies are necessary to encour- Online, there are no bodies. By de- countability is best when it is rooted in
age healthy interactions in online fault, people are identified through IP the exchange.
communities. Implicit is the notion addresses. Thus, it is common to lead There are people who abuse other’s
that in “real life,” people have to use with a textual identifier. In the days trust, violate social norms, or purpose-
their “real names” so why shouldn’t of Usenet and IRC, that identifier was fully obscure themselves in order to
they be required to do so online? Yet, typically a nickname or a handle, a engage in misdeeds. This is not just a
how people use names in unmediated username, or an email address. With problem online. But most people who
interactions is by no means similar to Facebook and Google Plus, people are engage in lightweight obfuscation are
what happens online.5 expected to use their names. not trying to deceive. Instead, they are
When someone walks into a cafe, The power of search also shifts the trying to achieve privacy in public envi-
they do reveal certain aspects of them- dynamics. Although it is possible for ronments.
selves while obfuscating other aspects wizards in Hogwarts to scream the Wanting privacy is not akin to want-
of their identity. Through their bodies, equivalent of “grep” into the ether and ing to be a hermit. Just because some-
they disclose information about their uncover others’ location, background one wants to share information does
gender, age, and race. Through fashion information, and relationships, this is not mean they want to give up on pri-
and body language, people convey infor- not something mere mortals can do in vacy. When people seek privacy, they are
mation about their sexuality, socioeco- everyday life. Until the Internet arrived. looking to have some form of control
nomic status, religion, ethnicity, and Today, information about people can over a social situation. To achieve that
tastes. This information is not always be easily accessed with just a few key- control, people must have agency and
precise and, throughout history, people strokes. Through search, the curious they must have enough information to
have gone a long way to obscure what can gain access to a plethora of informa- properly assess the social context. Priva-
is revealed. While people often possess tion, often taken out of context. Without cy is not about restricting information;
documents in their wallets that convey the Internet, inquiring about someone it is about revealing appropriate infor-
their names, this is not how most peo- takes effort and provokes questions. mation in a given context.
ple initiate interactions. Asking around often requires address- People feel as though their privacy
The practice of sharing one’s name ing a common response, “Why do you has been violated when their agency has

30 communicatio ns o f the acm | au gu st 201 2 | vol. 5 5 | no. 8


viewpoints

been undermined or when information


about a particular social context has
Thus, it is important to move away from
offline metaphors in order to address Calendar
been obscured in ways that subvert peo-
ple’s ability to make an informed deci-
sion about what to reveal. This is why
the complexity of people’s mediated in-
teractions.
The “real names” debate goes be-
of Events
people feel so disempowered by techno- yond identification technologies and August 17–19
logical moves where they feel as though economic interests. Regardless of International Conference on
Security of Internet of Things,
they cannot properly manage the social the business implications, the issue
Amritapuri, India,
situation. of whether or not to mandate “real Sponsored: SIGSAC,
In unmediated contexts, choosing names” is fundamentally one of pow- Contact: Rangan Venkat,
when or how to reveal one’s name al- er and control. To what degree do de- Email: venkat@amrita.edu
lows people to meaningfully control a signers want to hold power over their
August 21–24
social situation. For example, when a users versus empower them to devel- Information Interaction
barista asks a customer for her name, op social norms? To what degree do in Context: 2012,
it is common for the customer to pro- companies want to maintain control Nijmegen, Netherlands,
Contact: Kraaij Wessel,
vide only her first name. There are also over their systems versus enable users Email: wessel.kraaij@tno.nl
customers who provide a nickname or to have control over their self-presen-
a fake name when asked for such in- tation and actions? August 26–29
formation, particularly if their name is These are complex socio-technical Advances in Social Networks
Analysis and Mining 2012,
obscure, hard to pronounce, or overly questions with no clear technical or Istanbul, Turkey,
identifiable. The customs involved in policy solution. Furthermore, even Sponsored: SIGMOD,
sharing one’s name differ around the though design plays a significant role Contact: Can Fazli,
world and across different social con- in shaping how people engage with new Email: canf@cs.bilkent.edu.tr
texts. In some settings, it is common technologies, it is the interplay between August 27–29
to only provide one’s last name (for ex- a system and its users that determines 12th International Conference
ample, “Mr. Smith”). In other settings, how it will play out in the wild. Design on Quality Software,
people identify themselves solely in re- decisions can inform social practices, Xi’an China,
Contact: Zhou Xingshe,
lationship to another person (for exam- but they cannot determine them. As Email: zhouxs@nwpu.edu.cn
ple, “Bobby’s father”). People interpret with all complex systems, control is
a social situation and share their name not in the hands of any individual ac- August 28–31
based on how comfortable they are and tor—designer, user, engineer, or poli- Asia Pacific Conference on
Computer Human Interaction,
what they think is appropriate. cymaker—but rather the product of the Matue-City, Shimane Japan,
When people are expected to lead socio-technical ecosystem. Those who Sponsored: SIGCHI,
with their names, their power to control lack control within this ecosystem re- Contact: Kentaro Go,
Email: go@yamanashi.ac.jp
a social situation is undermined. Pow- sist attempts by others to assert control.
er shifts. The observer, armed with a Thus, finding a stabilized solution re- September 3–6
search engine and identifiable informa- quires engaging with the ecosystem as a 8th International ICST
tion, has greater control over the social whole. Conference on Security and
Privacy in Communication
situation than the person presenting in- Networks,
formation about themselves. The loss of References
1. Banks, E. Eric Schmidt: If you don’t want to use your Padua, Italy,
control is precisely why such situations real name, don’t use Google+. Mashable (Aug. 28, Contact: Mauro Conti,
feel so public. Yet, ironically, the sites 2011); http://mashable.com/2011/08/28/google-plus- Email: conti@di.uniroma1.it
identity-service/.
that promise privacy and control are 2. Dash, Anil. If your Website’s full of a-------, it’s your
September 4–6
fault. A Blog About Making Culture (July 20, 2011);
often those that demand users to reveal http://bitly.com/JAyktS. 24th International Teletraffic
their names. 3. Fake, C. Anonymity and pseudonyms in social Congress,
software. Caterina.net (July 26, 2011); http://caterina. Krakow, Poland,
net/wp-archives/88. Contact: Papir Zdazisaaw,
Who Is in Control? 4. Lessig, L. Code: And Other Laws of Cyberspace. Basic
Email: papir@kt.agh.edu.pl
Books, New York, 1999.
Battles over identity, anonymity, pseud- 5. Madrigal, A. Why Facebook and Google’s concept of
onymity, and “real names” are not over. ‘real names’ is revolutionary. Atlantic Monthly, (Aug. September 4–7
5, 2011); http://www.theatlantic.com/technology/ ACM Symposium on Document
New systems are regularly developed archive/2011/08/why-facebook-and-googles-concept-
Engineering,
and users continue to struggle with of-real-names-is-revolutionary/243171/.
Paris, France,
6. Skud. Preliminary results of my survey of suspended
how to navigate information disclo- Google+ accounts. InfoTropism (July 25, 2011); http:// Sponsored: SIGWEB,
sure. What is at stake is not simply a infotrope.net/2011/07/25/preliminary-results-of-my- Contact: Cyril Concolato,
survey-of-suspended-google-accounts/. Email: cyril.concolato@enst.fr
matter of technology; identity in online
spaces is a complex interplay of design, danah boyd (danah@danah.org) is a senior researcher September 6–8
business, and social issues. There is at Microsoft Research, a research assistant professor The International Workshop on
in Media, Culture, and Communication at New York Internationalisation of Products
also no way to simply graft what people University, a visiting researcher at Harvard Law School,
and Systems,
are doing online onto what they might a fellow at Harvard’s Berkman Center, and an adjunct
Bangalore, Kanatak India,
associate professor at the University of New South Wales.
do offline; networked technology shifts Contact: Apala Lahiri Chavan,
how people engage with one another. Copyright held by author. Email: apala@humanfactors.com

au g ust 2 0 1 2 | vol . 55 | no. 8 | co mm u n icat ion s o f t he acm 31


V
viewpoints

doi:10.1145/2240236.2240248 George V. Neville-Neil


Article development led by
queue.acm.org

Kode Vicious
A System Is
Not a Product
Stopping to smell the code before wasting time
reentering configuration data.

E
v e ry o n ce i n a while, I come An example of good code.
across a piece of good code
and like to take a moment to
/**
recognize this fact, if only to * struct cyclecounter - hardware abstraction for a free running counter
keep my blood pressure low * Provides completely state-free accessors to the underlying hardware.
before my yearly medical checkup. The * Depending on which hardware it reads, the cycle counter may wrap
* around quickly. Locking rules (if necessary) have to be defined
first such piece of code to catch my eye
* by the implementor and user of specific instances of this API.
was clocksource.h in Linux. Linux in- *
terfaces with hardware clocks, such as * @read: returns the current cycle value
the crystal on a motherboard, through * @mask: bitmask for two’s complement
* subtraction of non 64 bit counters,
a set of structures that are put together * see CLOCKSOURCE_MASK() helper macro
like a set of Russian dolls. * @mult: cycle to nanosecond multiplier
At the very center is the cyclecounter, * @shift: cycle to nanosecond divisor (power of two)
*/
a very simple abstraction that returns
struct cyclecounter {
the current counter from an underly- cycle_t (*read)(const struct cyclecounter *cc);
ing piece of hardware. A cyclecounter cycle_t mask;
knows nothing about the current time, u32 mult;
u32 shift;
time zone, or anything else; it knows };
only what the register in a piece of hard-
ware is when asked about it. The cycle-
counter has two pieces of state that help
translate cycles into nanoseconds but it is written and documented in a style ment is long enough to describe not
nothing else. The next doll out is the that is clear and clean enough that I just what the structure is, but how it
timecounter. A timecounter contains a was able on first reading to under- is used, and it even mentions what
cyclecounter and raises the level of ab- stand how it worked. The comments has to be done if multiple threads
straction to the level of monotonically and structure for the cyclecounter need to access one of these structures
increasing time, measured in nano- shown in the figure here give you a simultaneously. If only all code were
seconds. On top of these structures are flavor of what makes me so happy to this well documented! You can read
others that eventually give the system read this code. more of this code online at http://lxr.
enough abstraction to know what the I think you can see why I like this free-electrons.com/source/include/
time of day is, and so forth. code, but just in case you can’t, let me linux/clocksource.h.
So, what is so great about this be more specific. The code is well laid My other example of good code re-
code? Well, two things: first, it is well out, nicely indented, and has variables quires more explanation, so I am going
structured, in that it is built from that are short, yet readable. There are to reserve it for a future column. After
small components that can cooperate no Bouncy Caps or_very_long_names_ all, I don’t want to waste the calming
without a layering violation; second, that_read_like_sentences. The com- effect all in the same issue.

32 c ommuni catio ns o f th e acm | au gu st 201 2 | vol. 5 5 | no. 8


viewpoints

Even in the worst-designed prod- of an amplifier and a CD player. If the


Dear KV,
ucts—and I have used enough of user interface was horrific, it was noth-
After spending the last year buying those—there is usually some bit of ing compared with the system perfor-
and deploying a set of monitoring perl that takes the old configuration mance. The box crashed three times be-
systems in my company’s production data and turns it into something that is fore I finally walked out of the store. The
network, we hit our first bug in the mostly valid for the latest revision. one thing I got from that experience was
manufacturer’s system software. After As a matter of fact, many years an understanding that systems and
we reported the bug, the manufactur- ago I worked on a networking switch products are not the same thing.
er asked us to upgrade the systems to project and the first thing the systems A system is a collection of compo-
the latest release. It turns out the up- team (which was responsible for get- nents put together to do a job. A prod-
grade process requires us to reset the ting a reasonable operating system uct is a system that has been designed
devices to their factory defaults, losing and applications onto the box) did was and built to make the work of carrying
all our configuration information on to come up with a way to field-upgrade out the job smooth and natural to the
each system and requiring a person the system. Anyone who has ever con- user. I can certainly cobble together
to reenter all of the configuration data figured a switch or router knows you software to rip, store, and play my
after the upgrade. do not just toss out the configuration CDs on a computer; that is a system,
At first, we thought this might be on an upgrade. whereas an iPhone is a product. When
something required only for this par- The sad part is that doing this cor- I upgrade my phone, I do not reenter
ticular upgrade, which crosses a major rectly just is not that difficult. Most my data. If I did, the product would
revision, but we have since learned we embedded systems now use stripped- have died at the first revision. More
must reenter our configuration data down Unix systems such as Linux or likely, Steve Jobs would have taken
each and every time we upgrade the the BSDs, all of which store their con- someone’s head off if he had been told
software on these systems. I suppose figurations in well-known files. Grant- the upgrade path required reentering
we should have asked this question be- ed, this is not the nicest way to store data. Given how complex modern ap-
fore we bought the systems, but it just configuration data, because it tends to pliances are—and let’s face it, your TV
never occurred to us that anyone would be a bit scattered, but it is not that dif- probably has an Ethernet jack—it is
sell a box purporting to act as an appli- ficult to write a script that can handle clear that people have thought about
ance that could not be easily upgraded differences between the versions and and solved this problem.
in the field. One of the other members reconcile them. On FreeBSD there is The real issue is with the people who
of my group suggested we just return etcupdate, and Linux has etc-update design these devices. Somehow the
the boxes and ask for our money back, and dispatch-conf. In a properly de- fact an appliance is going to be hang-
but, depressingly enough, these are the signed system the configuration would ing out with a bunch of computers—
best systems we have found for net- likely be stored in a simple database for example, in a colocation—makes
work monitoring. or an XML file, both of which are field- it acceptable for the implementers to
Down on Upgrades upgradable with fairly simple scripts. make their box look more like an open
These sorts of issues are what dif- source desktop, which will be diddled
ferentiate an appliance that can be de- by an experienced IT person or the us-
Dear Down, ployed and maintained with little hu- ers themselves. It is a practice that real-
It seems that what you thought was a man interference from a system that ly needs to stop, if only because I don’t
product—that is, a set of components has to be diddled constantly to keep want everyone to wind up bald, like me.
thoughtfully put together by people it happy. It is a sad fact that many pro- My hair didn’t fall out, I pulled it out!
who care about their customers—was grammers and engineers do not think KV
actually just a collection of parts that much of appliances and are more likely
under normal circumstances worked to think their users should “man up” Related articles
well enough to get by. Unfortunately, and spend their time making up for the on queue.acm.org
the number of companies that think original designer’s lack of forethought.
The Seven Deadly Sins of Linux Security
about what will happen after version I still remember one of the first all- Bob Toxen
1.0 of their systems is quite small. digital stereo systems I ever saw. It was http://queue.acm.org/detail.cfm?id=1255423
I have been fortunate—or perhaps I designed around a Sun workstation A Conversation with Chris DiBona
should say the sales critters I have dealt and cost on the order of $15,000. It was http://queue.acm.org/detail.cfm?id=945130
with in the past 10 or so years have obvious from the moment you looked
Closed Source Fights Back
been fortunate—not to come across at the controls that little or no thought Greg Lehey
too many systems that act in the way had been put into what people really http://queue.acm.org/detail.cfm?id=945126
you describe. The idea that the manu- wanted out of a sound system. What
facturer would require a user or sys- most people want out of a stereo is George V. Neville-Neil (kv@acm.org) is the proprietor of
Neville-Neil Consulting and a member of the ACM Queue
tems administrator to reenter already- good-quality sound with a minimum of editorial board. He works on networking and operating
saved data after an upgrade is stupid, button pushing to get what they want. systems code for fun and profit, teaches courses on
various programming-related subjects, and encourages
ludicrous, and a bunch of other words What the system I saw presented was your comments, quips, and code snips pertaining to his
that my editors just are not going to al- a lot of button pushing for about the Communications column.

low in this publication. same quality of sound I could get out Copyright held by author.

au g u st 2 0 1 2 | vol . 55 | no. 8 | c omm u n i cat ion s o f t he acm 33


V
viewpoints

doi:10.1145/2240236.2240249 Chris Forman, Avi Goldfarb, and Shane Greenstein

Economic and
Business Dimensions
The Internet Is Everywhere,
but the Payoff Is Not
Examining the uneven patterns of Internet economics.

A
mi d TH e R a pi d diffu- deployed, and improved the Internet and sophisticated financial opera-
sion of the Internet in from 1995 to 2000, a period of rapid tions than in landscaping firms and
the 1990s, many media initial investment by business. In prior nursing homes. Nevertheless, there
commentators, invest- research,2 we learned to look carefully was a disparity: more prosperous cit-
ment analysts, and poli- at the specific type of investment. We ies were home to sophisticated com-
cymakers hoped the Internet would distinguished between rapidly dif- panies that knew how to best take
erase geographic and socioeconomic fusing email and Web browsing and advantage of the new technology, and
boundaries and improve the compar- more advanced applications. Such those companies (and cities) received
ative economic performance in less advanced Internet applications were more benefits from Internet use than
prosperous regions. enabling productivity increases by other cities. There might well be pay-
The optimism had some merit. lowering costs of communication be- offs from the Internet, but they were
The economic tide was rising during tween suppliers and customers over not evenly shared.
this time, with everyone’s fortunes long distances, and these applications
rising. The best-sellers The Death of required skilled labor to implement The Payoff Puzzle
Distance1 and The World Is Flat4 were and operate. Diffusion of these appli- Our more recent study combined in-
just two of the books that espoused cations shaped productivity for many formation on business Internet usage
that rosy view. The “productivity para- enterprises. with U.S. national-level wage data. We
dox”—the question of whether ubiqui- We were nevertheless struck by the studied from the period 1995 to 2000
tous use of computers would ever pro- persistence of an earlier perception: using several large datasets, includ-
duce economic growth—disappeared the Internet was almost everywhere ing the Quarterly Census of Employ-
among professional economists. Be- but the pattern was uneven. Most ment and Wages that gives county-
tween 1995 and 2000 the economic large business establishments were level information on average weekly
tide was rising, as wages increased by using email and browsing by the late wages and employment, and the Harte
20% on average across the U.S. 1990s. However, the more we looked, Hanks Market Intelligence Computer
Note the words “on average.” What the more we could see that advanced Intelligence Technology Database of
effect did the growth of the commer- commercial uses of the Internet such survey information about how firms
cial Internet have on the economic as inventory management systems use the Internet. In total, we included
performance of different locations and online database sharing were relevant data from over 85,000 private
in the U.S. economy? Our analysis found more often in larger cities than establishments with more than 100
published in the American Economic in smaller ones. They were also found employees each.
Review suggests the Internet is wide- more in data-heavy industries such We found that business adoption
spread, but its economic payoffs are as finance and wholesale distribu- of Internet technologies correlated
not even.3 tion than they were in manufacturing, with wage and employment growth
mining, and social services. At one in only 163 of the over 3,000 counties
Misplaced Optimism level this makes sense—advanced in the U.S. All of these counties had
We conducted a multiyear study to Internet applications are more likely populations above 150,000 and were
document how businesses adopted, to be found in electronic parts supply in the top quarter of income and edu-

34 communicatio ns o f th e acm | au gu st 201 2 | vol. 5 5 | no. 8


viewpoints

cational achievement before 1995. in rural Iowa. Our study shows that
Between 1995 and 2000, they showed benefits from Internet adoption such
a 28% average increase in wages, com- The Internet as increased wages are, on balance,
pared with a 20% increase in other may allow firms more likely to show up in New York
counties, where the Internet appears City than in rural Iowa. The predicted
to not have had much impact on eco- in rural Iowa “leveling” effects of the Internet have
nomic performance (see the accompa- to reach new not materialized, and it is doubtful
nying figure). that short-run investment in Internet
In short, the Internet appears to customers on infrastructure (such as extending ac-
have exacerbated regional wage in- Wall Street but cess to broadband) will generate im-
equality, explaining over half the dif- mediate economic benefits for areas
ference in wage growth between the 6% it also allows that have otherwise low capital invest-
of counties that were already well-off Wall Street banks ment, inadequate labor skills, and
and all other counties. market conditions.
Why did the Internet make such big to reach investors
waves in these few areas? Three expla- in rural Iowa. Conclusion
nations are possible: These results have important implica-
˲˲ Big cities have needed communi- tions for public policy. Many lawmak-
cations infrastructure and are home ers have argued for government to
to firms capable of making capital in- subsidize the expansion of the Inter-
vestments that allow the Internet to net into poor and isolated regions. For
enhance what such sophisticated firms Each explanation is plausible, and example, the 2009 economic stimu-
are already doing. probably explains a piece of the story. lus package allocated over $7 billion
˲˲ A phenomenon called “skill-bi- However, existing data does not allow for broadband expansion in under-
ased technical change” might be un- us to distinguish between the explana- served areas, under the assumption
der way, in which new technologies tions, and so the payoff puzzle remains. that creating such infrastructure will
raise the wages only of skilled work- raise wages and income. Our results
ers, while unskilled workers cannot Policy suggest such Internet infrastructure
use the new technology and do not re- The Internet has been widely diffused investments by themselves are un-
ceive the necessary training to benefit and used: in this, our results echo the likely to raise wages in poor and iso-
from its deployment through wage popular optimism. Also, the benefits lated regions, at least in the short run.
gains. are not limited to the locations that Such investments might be justified
˲˲ Cities have denser labor markets, dominate supply of equipment and to strengthen education, increase
and better communication between software, such as Boston, Seattle, New civic engagement, or promote health
supply and demand. Programmers York, and Silicon Valley. and safety. And, over 20 to 30 years
with skills in the language of main- However, in contradiction to popu- expanding Internet access might lead
frames, such as COBOL, and the lan- lar optimism, our data does not show to more widespread economic gains.
guage of the Web, such as HMTL and any evidence of improvement in the However, there is little evidence to
Perl lived in (or moved to) big cities, comparative economic performance suggest a short-term payoff in the
where more firms would be looking of isolated locations or less dense lo- form of increased local wages.
for such workers. Similarly, a range of cations. The Internet may allow firms
other inputs that would make Inter- in rural Iowa to reach new custom- References
1. Cairncross, F. The Death of Distance. Harvard
net adoption more valuable is more ers on Wall Street but it also allows University Press, Cambridge, MA, 1997.
available in cities. Wall Street banks to reach investors 2. Forman, C., Goldfarb, A., and Greenstein, S. How did
location affect adoption of the commercial Internet?
Global village vs. urban leadership. Journal of Urban
Advanced Internet investment and wage growth by county type. Economics 58, 3 (Mar. 2005), 389–420.
3. Forman, C., Goldfarb, A., and Greenstein, S. The
Internet and local wages: A puzzle. American
Economic Review 102, 1 (Jan. 2012), 556–575.
 Top county in income, education, population, and IT-intensity in 1990    Other counties 4. Friedman, T.L. The World is Flat: A Brief History of the
.5 Twenty-First Century. Farrar, Straus, and Giroux, New
Wage growth 1995 to 2000

York, 2005.

Chris Forman (chris.forman@mgt.gatech.edu) is the


Robert and Stevie Schmidt associate professor of IT
management in the College of Management at Georgia
Institute of Technology, Atlanta, GA.
0
Avi Goldfarb (agoldfarb@rotman.utoronto.ca) is an
associate professor of marketing at the Joseph L. Rotman
School of Management at the University of Toronto.
0 .1 .2 .3
Shane Greenstein (greenstein@kellogg.northwestern.
Fraction of firms with advanced Internet edu) is the Kellogg Chair of Information Technology
and Professor in the Kellogg School of Management at
Northwestern University, Evanston, IL.

Copyright held by author.

au g ust 2 0 1 2 | vol . 55 | no. 8 | co mm u n icat ion s o f t he acm 35


V
viewpoints

doi:10.1145/2240236.2240251 Kai A. Olsen and Hans Fredrik Nordhaug

Viewpoint
Internet Elections:
Unsafe in Any Home?
Experiences with electronic voting suggest elections
should not be conducted via the Internet.

S
ex-
e v e ra l c o u n t ries h ave
perimented with Internet
elections. While voters in
Estonia could use the Inter-
net in their 2007 national
election for parliament,3,7 other coun-
tries have offered this option for local
elections only. The argument in favor of
allowing voting over the Internet is that
this may increase voter participation,
which has dropped in many countries
in recent years. Another key argument
is that Internet elections can increase
access for people with disabilities. Fur-
thermore, some people feel that Inter-
net elections are an inevitable case of
allowing democracy to enter the Inter-
net world along with shopping, bank-
ing, and many other applications.
One argument against voting on the
Internet is that it cannot guarantee the
anonymity that a voting booth can pro-
vide. This can lead to coercion and the
buying and selling of votes. Even more
serious is the risk that someone may
tamper with the voting or the results.
This risk is aggravated by the fact that were engaged in the development of is a coding system designed to prevent
voting could be performed from a com- a secure system, and to check the pro- vote tampering.
puter that has malware installed. posed solutions. To prevent pressure This system uses codes to identify
In order to overcome such obstacles for voting against one’s wish, the sys- each party participating in the elec-
the Ministry of Local Government in tem allowed repeated voting. The In- tion.4 These codes are presented on
Norway launched a $40 million proj- ternet option was closed one day prior the back of a “voter’s card,” which is
ect (“e-valg 2011”) in 2009 to design an to the election, voters always had the mailed to everyone who has the right
Illustration by Rich a rd Borge

electronic voting system to be used in option to cast their vote at a polling to vote. After casting the ballot on the
the 2011 local elections. The system is station, and the voter’s last vote would Internet using a home or office com-
based on experience from other coun- always override any previous votes. To puter, voters receive a confirmation
tries, and is comparable to the Esto- counter malware and other risks, an text message. This message includes a
nian system. Experts from academia, advanced cryptographic system was code the voter can compare to the card
research institutions, and industry employed.2 An important part of this in order to check if the vote was regis-

36 c ommunicatio ns o f th e acm | au gu st 201 2 | vol. 5 5 | no. 8


viewpoints

tered for the correct party. Since these only, could go unnoticed. Even if there
codes only exist on the paper card and is disclosure, what should be done?
are particular to each voter, malware We have Should voters be asked to vote again
would have to break into the central demonstrated or should the election be considered
server in order to generate the codes. invalid? Creating such chaos may even
Tests also showed that many voters do that voters are be the main intent of the villain. Even
check that the code is correct. Thus, actually victims of in such a case, there may not be an easy
the ministry determined that it would “undo.” As well as changing your vote,
require a “large-scale conspiration and the user interface. malware may also know how you voted.
unreasonable amounts of money” to This information, which most people
break the system. consider private, could be used for
blackmail or embarrassment.
Malware Exploiting Voters In theory, the Norwegian e-voting
As we will describe, however, there are system is safe. According to the de-
many ways to violate such a system. velopers, the risks involved can be
Malware can discard votes or change tion to the voter and then discard or expressed mathematically. However,
votes at the same time as sending us- change the vote. In the latter case, the this is based on the condition that vot-
ers the correct confirmation code. In voter will receive another text mes- ers do what they are supposed to do.
order to demonstrate these possibili- sage, this time from the ministry, with We have demonstrated that voters are
ties, we designed a malware system of a different code (for the party that the actually victims of the user interface.
our own. Although we have assumed malware has chosen). With malware We argue that voting systems are par-
the server part of the e-voting system is also present on the smartphone, this ticularly vulnerable. A conscientious
secure, our malware exploits the most second text message can be discard- voter who participates in all elections
vulnerable part of the voting system: ed. Or, more simply, the user can be will in many countries vote once every
the voter. After the voter has chosen a warned that additional messages may two years. This is not frequent enough
certain party, the official voting system be forthcoming due to some com- to get any practice or routine with an
presents a page with the name of the munication problems, and to “please e-voting system. In any case, there
party and the user is asked to confirm ignore these.” A ministry document may also be modifications in the in-
by clicking a send button. Our version on security objectives stresses the terface from one election to another.
is similar, but it also asks for the (se- risk of such malware: “The insecurity So, while methods such as those pre-
cret) party code: “Please confirm your of browsers and operating systems sented here may, in principle, be used
selection by typing in the code for this on the client platform will invariably to obtain secret codes from Internet
party. You will find the code on the make it possible to subversively install bank users, in practice, most of us
back of your voter’s card.” In a test malicious software.”1 would become suspicious if the sys-
simulating voting on paper, we tested However, those with malicious in- tem broke its expected pattern; for ex-
this on 158 college students, including tent may opt for more straightforward ample, if it asked for additional codes.
25 IT students. None of the students solutions. Introducing a fake URL is Furthermore, if someone breaks into
found any faults with our system. In one possibility. Many users will find your bank account, you would at least
addition, over 400 high school stu- the link through other Web pages, such notice money has disappeared. A
dents tested an online version. All of as community pages. These sites have changed or discarded vote, however,
these students typed in the party code been hacked before and can be hacked may never be discovered.
when required by the system. This was again. A false Web site will also receive Estonia requires citizens to insert
despite the fact that all participants, identification data from the voter, and their nationwide ID card into a card
both college and high school students, will, at least in the Norwegian system, reader connected to the home com-
were shown an animation made by be able to change the phone number puter to vote on the Internet, and then
the ministry that explains the e-voting for the confirmation message. Even offer PIN codes as a further proof of
system and stresses the correct use of more simply, a villain could send an identification. This may increase se-
the party code as a final manual check. email message to voters before the curity on the home computer, but the
However, in all other situations PIN election with a “vote here” URL. By tar- risk of malware or a false election site
codes are supposed to be typed into geting the recipients, such as groups is still the same.
some computer system. Even the e-vot- the villain feels vote unsatisfactorily, Postal voting is another option of-
ing system has a part where a PIN re- it is extremely simple to make an e- fered in many countries, and some
ceived by a text message is to be typed voting system that mimics the original states have this as the only alternative.5
in during the identification process. system, asks for the party code, sends a It offers the same possibility of vot-
Thus, while our malware uses the code confirmation message to the voter, and ing from one’s home as the Internet.
as users anticipate, it is the official sys- then discards the vote. It took us just a Theoretically it can be as vulnerable as
tem that applies the code in an unfa- few hours to create such a system. Internet voting, but it would take more
miliar manner. While someone would certainly no- resources to mount an attack. Also,
Once it has the code, the malware tice a large-scale attack, a small-scale while the Internet offers anonymity to
can easily send the correct confirma- attack, perhaps in one community the villains, this may not be the case

au g ust 2 0 1 2 | vol . 55 | no. 8 | c o mmu n i cat ion s o f t he acm 37


viewpoints
CACM_TACCESS_one-third_page_vertical:Layout 1 6/9/09 1:04 PM Page 1

lots may be inefficient, but it does allow


any voter to understand how it works.
Trustworthy design This is the case also for postal voting.
of an electoral Trust in such a system is more direct
than with any e-voting application.
system is critical
for democracy. Conclusion
ACM In an October 2011 Communications In-
side Risks column, Carsten Schürman
Transactions on argued for modernizing the Danish
democratic process.6 He stressed the
Accessible importance of listening to the voices of
scientists and other specialists when

Computing for those who try to interfere with a


postal voting system.
designing new systems, but, as we
have seen, this did not work in Norway.
Postal voting, as any system that While he praised the European e-voting
allows a voter to cast a vote outside initiatives, he was skeptical regarding
a voting booth, still has the disad- Internet voting—“for which there are
vantages that voters can be coerced still more open problems than solved
or paid to vote in a certain way. The ones.” Perhaps voting is one task that
possibility of repeated voting could should not be moved to the Internet?
reduce this problem. By going to the Trustworthy design of an electoral sys-
polling place after giving the Internet tem is critical for democracy; this is a
or postal vote, one has the opportunity place where no risks, neither practical
to vote again. However, a patriarch of nor theoretical, can be tolerated. The
a closely controlled family could eas- advantage of running a computer sys-
ily restrict his daughter’s movements tem that is to be used sparingly is also
on the final day of the election, just as dubious and, as we have seen, creates
he could control their Internet voting. additional risks since users have no
For those buying votes it is just a small routine. It is also reasonable to believe
calculated risk that the seller of a vote that electoral participation does not
will turn up on Election Day. depend on the voting system alone.
Repeated voting on the Internet Perhaps it has something to do with
may not offer any solution. Votes can politics?
still be bought, not by requiring how
◆ ◆ ◆ ◆ ◆ people vote, but by taking control over References
1. e-Vote 2011 Security Objectives. Ministry of Local
This quarterly publication is a their ID codes. This allows the buyer Government; http://www.regjeringen.no/upload/KRD/
quarterly journal that publishes to vote on their behalf. However, the Kampanjer/valgportal/e-valg/tekniskdok/Security_
Objectives_v2.pdf.
Estonian solution with an ID card 2. Gjøsteen, K. Analysis of an Internet voting protocol,
refereed articles addressing issues 2011; http://eprint.iacr.org/2010/380.
would make it more difficult to hand
of computing as it impacts the this over to others. This is especially
3. Meagher, S. When personal computers are
transformed into ballot boxes: How Internet
lives of people with disabilities. the case when the card is also used for elections in Estonia comply with the United Nations
international covenant on civil and political rights.
The journal will be of particular other purposes. American University International Law Review 23, 2
(Feb. 2009), 349–386.
In Isaac Asimov’s science fiction
interest to SIGACCESS members 4. Nestås, L.H. and Hole, K.J. Building and maintaining
story Franchise (1955), the all-encom- trust in Internet voting. IEEE Computer 45, 5 (May
and delegrates to its affiliated passing supercomputer Multivac chose 2012).
5. Qvortrup, M. First past the postman: Voting by mail in
conference (i.e., ASSETS), as well Norman Muller as the “Voter of the comparative perspective. Political Quarterly 76, 3 (Fall
2005), 414–419.
as other international accessibility Year.” In this electronic democracy, 6. Schürmann, C. Modernizing the Danish democratic
a single person was selected to repre- process. Commun. ACM 54, 10 (Oct. 2011), 27–29.
conferences. 7. Trechsel, A. et al. Internet Voting in Estonia. A
sent all voters. Based on the answers Comparative Analysis of Four Elections since 2005.
◆ ◆ ◆ ◆ ◆ to a set of questions to Norman, Multi- Report for the Council of Europe, European University
www.acm.org/taccess vac determined the results of the elec-
Institute, Robert Schuman Centre for Advanced
Studies, 2010.
tion. Norman was proud that the U.S.
www.acm.org/subscribe citizens had, through him, “exercised Kai A. Olsen (kai.olsen@himolde.no) is a professor
of informatics at Molde University College and at the
once again their free, untrammeled University of Bergen, Norway.
franchise.” This is not exactly Internet
Hans Fredrik Nordhaug (hans.f.nordhaug@himolde.
voting, but the two systems do have no) is an associate professor and study counselor in the
something in common: it is impossible department of informatics at Molde University College,
Norway.
for non-experts to verify they work cor-
rectly. The old system with paper bal- Copyright held by author.

38 co mm unicatio ns o f th e acm | au gu st 201 2 | vol. 5 5 | no. 8


V
viewpoints

doi:10.1145/2240236.2240250 Neil McBride

Viewpoint
The Ethics of Software
Engineering Should
be an Ethics for the Client
Viewing software engineering as a communicative art in which client engagement is essential.

A
d va n ces i n paralle lcom-
puting mean Internet traf-
fic can be monitored in
real time and illicit activ-
ity automatically detected.
Single-chip autonomous motes with
wireless communications and envi-
ronmental sensors can be scattered
in a natural or medical environment.
Routine sequencing of individual’s ge-
netic code will soon be commercially
viable and will require new software
applications to support it. How are
software engineers to understand the
social and ethical challenges of such
advances?
In the last 15 years, the environment
and practice of software engineering
has changed radically. Commercial
applications are mostly Web-based,
often globally accessible and subject
to emergent and unexpected effects
arising from their use on a global
scale. Not only are the products differ-
ent, but the practice is different. Agile
methods dominate. Rapid systems de-
livery and continuous contact with the
client is expected. Software engineers refraining from the use of illegal soft- engineering. Computer systems are
have to engage with increasingly com- ware. Its orientation is primarily nega- designed and implemented to support
plex social and ethical environments. tive and deals with injunctions that human purposeful activity. Whether
And yet, currently, the only sup- most practitioners would subscribe to the software is concerned with student
port we can offer the software engi- anyhow.1 enrollment, customer relationship
Illustration by Gluek it

neer is a code of ethics, which is 15 The developing nature of software management, or hospital administra-
years old, parochial in its scope, and engineering requires not a revision tion, its success lies in the extent to
often obvious in its content. It recom- of an ailing code but a revolution in which it enables others to engage in
mends behavior such as having good ethical thinking that acknowledges activities directed toward a goal. The
documentation, rejecting bribery, and the purpose and practice of software software engineer is a service provider,

au g u st 2 0 1 2 | vol . 55 | no. 8 | c omm u n icat ion s o f t he acm 39


viewpoints

supporting the processes and activi- social context of the software. How internal rewards will come from look-
ties that result in human flourishing. does the outsourcer relate to the cli- ing outward and considering our con-
Hence the moral behavior under- ent? Who are the ultimate users of the nection with community—local and
pinning the practice of software engi- software? How do those users relate global—and finding fulfillment in our
neering requires, firstly, an ethics of to others involved in the purposeful contribution to human flourishing.
service, focusing on the relationship activity that is supported by the soft-
with the client, the client’s employees ware? A customer relationship man- Domain Ethics
and customers. Secondly, the software agement system, for example, serves At the heart of a service ethics is the
engineer must be concerned about the the purposeful activity of the manag- development of empathy—the virtue
ethics of the client. The closer relation- ers, service staff, and customers of of putting ourselves in other’s shoes,
ship of IT with the client and the user a business. Ethical issues may arise of learning about their environment
demands a more active involvement from the complex social environment, and concerns. This requires immer-
with the ethical environment within the culture, the business practice, and sion in the client’s world, and seeing
which the software operates. the customer expectations. This so- ourselves not as technicians, but as
A service-oriented, client-oriented cial complexity will defeat any code of servants of human purposeful activ-
software engineering ethics requires conduct. We cannot engage in ethics ity developing an ethics-in-use. To do
outward-looking engagement with at arms length, nor conduct relation- this we must understand the moral
complex social and moral climates. ships by remembering reams of rules. climate of the client: the ethical issues
How is this to be achieved when the A service-oriented ethics is a mat- involved in the practice of medicine,
comfort and certainty of codes cannot ter of character not codes. We develop of nursing, of computer gaming, of
be relied upon? What is the alternative character through practice that en- accountancy, of energy provision. We
and how can software engineers be- ables us to reflect on complex devel- cannot assume our own professional
come reflective practitioners? opment environments and sensitively codes of conduct are adequate or even
identify and tackle most ethical issues relevant to the areas we are serving.
Service Ethics that arise. Character is expressed in The software needs to support the
A service-oriented ethics will focus on virtues such as courage, patience, moral environment within which the
relationships and the complex net- honesty, and empathy. For example, client works, not the moral climate of
work of social interactions that is the patience is developed through prac- the software engineer. The value and
tice in a situation where communica- benefits of the software are obtained
tion with the client is slow and time when it is used in the client’s environ-
Coming Next Month in
is needed to understand what the ment. That is not to say we do not ques-
Communications client wants. Courage is developed tion the ethical framework of the client,
through standing up to unreasonable or offer suggestions for improvement.
Factors Leading to demands, and raising uncomfortable But how can we comment on some-
the Successful Deployment issues about how software will affect thing we have not engaged with or un-
of Cloud Computing users. The virtue of honesty must be derstood? Although some of the ethi-
practiced so that we do not overprom- cal situations faced by our clients will
ise and are transparent about the limi- be foreign to us, this does not mean we
Self-Adaptive Software Needs tations of the systems we build. The should not pursue an understanding of
Qualitative Verification characteristic of humility will help the our client’s predicaments that will en-
software engineer to listen sensitively able us to deliver more ethically sensi-
Questioning Naturalism to the client but not assume the client tive and appropriate software.
in 3D User Interfaces knows better. Having understood the issues in
But whether we, as software engi- the domain of practice, does the soft-
neers, are prepared to engage in char- ware engineer learn another code of
What Science Can acter development rather than rule conduct or retreat to his own better-
Learn from the Arts adherence will depend on our own rehearsed software engineering code
motivation. We must reflect on our
A New Objective-C Runtime own goals. What is driving us? What
meaning do we apply to the work we How can software
do? Is our work just driven by a focus
Fix My Bugs! Seatbelts and on the technology we use? Are we just engineers
Airbags for Your Software concerned with earning money? become reflective
If we are driven by external goods in
All Your Data Belongs to Us terms of financial rewards or technical practitioners?
advancement, there may be little moti-
vation to engage with ethical concerns.
Plus the latest news about atomic But a shift toward internal rewards may
level computing, 3D chips, and using
malware to traffic pollution.
encourage us to reflect on our goals in
our professional life. An emphasis on

40 c ommunicatio ns o f th e acm | au gu st 201 2 | vol. 5 5 | no. 8


viewpoints

of conduct? Attempts to apply the cli- Such a process enables us to practice,


ent’s domain’s code of ethics may re- learn from that practice, and grow in
quire an immersion in a culture and We must understand wisdom.
practice that is impracticable for the the nature of the
software engineering. Conclusion
However, both software engineers domains we engage The software engineer who considers
and their clients develop characteris- with and the facts himself only a technician, who rejects
tic virtues through practice that influ- social and ethical engagement, claim-
ence and inform their decision mak- concerning the ing “it’s a matter for lawyers and doc-
ing when faced with ethical choices. ethical problems tors” is a destructive force and a bur-
Fundamental virtues such as patience, den on the client. We should consider
kindness, compassion, and empathy associated with them. software engineering as a communica-
are common to all areas of human tive art where engagement with the cli-
practice. That is not to say that new vir- ent is at its heart and an understanding
tues, relevant to the client’s practice, of the tasks, goals, and moral concerns
will not be recognized and applied as of the client leads to the development
we learn about the client. of software that is appropriate to the
So the development of our own mor- world in which the software acts will client’s needs.
al characteristics through the learning help the software engineer develop But we cannot expect a written code
and practice of virtues will grow closer empathy and a feeling for the ethical to be the prime motivator of profes-
connections with the client and their climate in which the software runs. sional moral character any more than
activity. And combined with a knowl- Familiarity with the technical issues a written constitution guarantees good
edge of the ethical difficulties encoun- associated with the field of application government. Indeed the written code
tered by the client will help us to sym- is extended to the ethical issues. Out- of ethics may lead to complacency, ig-
pathize with the client and at the least ward engagement will involve search- norance, and the neglecting of moral
avoid writing crass moral assumptions ing for examples of good practice, in- reflection. Codes of ethics can provide
into our software that are culturally volvement with mentors and others a smokescreen for deserting personal
foreign and morally jarring. who model virtues, and creating nar- responsibility.
ratives about the service we engage in Hence we must understand the na-
The Way Forward and the purposeful activity we support. ture of the domains we engage with
So how are we to do this? The danger All-around vision will involve learn- and the facts concerning ethical prob-
is we replace a catalogue of rules with ing to connect with issues that matter lems associated with them. And we
a catalogue of virtues. An understand- to people. The software engineer who must develop virtues that will help us
ing of the idea of virtues, the impor- spends all his time buried in Python to use our understanding of the appli-
tance of character and the relation manuals and playing Call of Duty 4 is cation domain in making ethical deci-
to actions is important. Studying the hardly likely to develop wisdom and sions. The nature of the domains we
virtue of courage will help us iden- ethical understanding. Software en- engage in will change, but the human
tify when courage is needed, when gineers should be encouraged to con- characteristics we need such as em-
we need to overcome fear and self- nect with the wider world. Reading the pathy, patience, honesty, compassion
ish concerns to stand up for a client, literature of the great religions, mod- will not change.
to point out an ethical malfunction, ern and ancient authors, engaging Rules can be learned and forgotten
or to whistle-blow. It will also enable in political debate, reading about ad- quickly, becoming a virtuous software
us to understand the boundaries, to vances in science and global problems engineer takes time and practice. Wis-
tell the difference between consid- will help develop all-around vision. dom develops over a lifetime.
ered courage and rash foolishness. A set of static rules should be re-
A skills-based approach such as that placed by a dynamic process. Starting References
advocated by Huff et al.2,3 is helpful, by learning about the ethical problems 1. Harris, C.E. The good engineer: Giving virtue its due in
engineering ethics. Science and Engineering Ethics 14
but we do not want to descend into and difficulties the client faces, we (2008), 155–164.
debates about hierarchies of virtues move on to consider what character 2. Huff, C.W., Barnard, L., and Frey, W. Good computing:
A pedagogically focused model of virtue in the
or taxonomic exercises. We must look traits, what virtues would be impor- practice of computing (part 1). Journal of Information,
Communication & Ethics in Society 6, 3 (2008),
inward, outward, and all around. tant in this domain. We can then look 246–278.
Inward reflection on who we are, for examples and stories of expres- 3. Huff, C.W., Barnard, L., and Frey, W. Good computing:
A pedagogically focused model of virtue in the
what our drives us, what we consider sions of those virtues in that domain practice of computing (part 2). Journal of Information,
important, will help us become aware or others. An emphasis on narratives Communication & Ethics in Society 6, 4 (2008),
284–316.
of the social and ethical assumptions will lead us to ask what would a virtu-
we make and take for granted. What ous person do?
Neil McBride (nkm@dmu.ac.uk) is a Reader in
we consider good and bad behavior We can then reflect on the impli- Information Technology Management in the Centre for
will affect how we respond to social cations for software engineering and Computing and Social Responsibility at De Montfort
University, Leicester, U.K.
situations and ethical dilemmas. improve the design of the product and
Outward engagement with the the delivery of the service accordingly. Copyright held by author.

au g u st 2 0 1 2 | vol. 55 | no. 8 | c omm u n icat ion s o f t he acm 41


practice
doi:10.1145/ 2240236.2240254

Article development led by


queue.acm.org

An open standard that enables


software-defined networking.
by Thomas A. Limoncelli

OpenFlow:
A Radical
New Idea in
Networking
Computer networks have historically evolved box
by box, with individual network elements occupying
specific ecological niches as routers, switches, load
balancers, network address translations (NATs), or
firewalls. Software-defined networking proposes to
overturn that ecology, turning the network as a whole
into a platform and the individual network elements
into programmable entities. The apps running on
the network platform can optimize traffic flows to
take the shortest path, just as the current distributed
protocols do, but they can also optimize the network
to maximize link utilization, create different
Illustration by Jason Coo k

reachability domains for different users, or make


device mobility seamless.
OpenFlow, an open standard that enables software-
defined networking in IP networks, is a new network
42 c ommunicatio ns o f th e acm | au gu st 201 2 | vol. 5 5 | no. 8
au g u st 2 0 1 2 | vol . 55 | no. 8 | c o mm u n i cat ion s o f t he acm 43
practice

technology that will enable many new There must be careful coordination This example is fictional but loose-
applications and new ways of manag- between the physical VLANs and the ly based on the recipients of the Best
ing networks. software-emulated world. The prob- Demo award at SIGCOMM (Special
lem is that connections can be mis- Interest Group on Data Communica-
‘Real’ Examples configured by human error, and the tions) 2008.4 More-serious examples
Here are three real, though somewhat boundaries between tenants are… of wireless uses of OpenFlow were de-
fictionalized, applications: tenuous. tailed by Sachin Katti, assistant pro-
Example 1: Bandwidth Manage- OpenFlow allows new features fessor of electrical engineering and
ment. A typical wide area network has to be added to the VM management computer science at Stanford Uni-
30%–60% utilization; it must “reserve” system so that it communicates with versity, in a presentation at the 2012
bandwidth for “burst” times. Using the network infrastructure as virtual Open Networking Summit.2
OpenFlow, however, a system was and physical machines are added and
developed in which internal applica- changed. This application ensures A Network Routing System
tion systems (consumers) that need each tenant is isolated from the oth- That Lets You Write Apps
bulk data transfer could use the spare ers no matter how the VMs migrate or OpenFlow is either the most innova-
bandwidth. Typical uses include dai- where the physical machines are lo- tive new idea in networking, or it is a
ly replication of datasets, database cated. This separation is programmed radical step backward—a devolution.
backups, and the bulk transmission of end-to-end and verifiable. It is a new idea because it changes the
logs. Consumers register the source, This example is based on presenta- fundamental way we think about net-
destination, and quantity of data to be tions such as the one delivered at the work-routing architecture and paves
transferred with a central service. The 2012 Open Networking Summit by the way for new opportunities and ap-
service does various calculations and Geng Lin, CTO, Networking Business, plications. It is a devolution because
sends the results to the routers so they Dell Inc.3 it is a radical simplification—the kind
know how to forward this bulk data Example 3: Game Server. Some col- that results in previously unheard of
when links are otherwise unused. lege students set up a Quake server new possibilities.
Communication between the applica- running on a spare laptop as part of A comparison can be made to the
tions and the central service is bidi- a closed competition. Latency to this history of smartphones. Before the
rectional: applications inform the ser- server is important not only to game- iPhone, there was no app store where
vice of their needs, the service replies play, but also to fairness. Because nearly anyone could publish an app. If
when the bandwidth is available, and these are poor graduate students, the you had a phone that could run appli-
the application informs the service server runs on a spare laptop. During cations at all, then each app was indi-
when it is done. Meanwhile, the rout- the competition someone picks up vidually negotiated with each carrier.
ers feed real-time usage information the laptop and unplugs it from the Vetting was intense. What if an app
to the central service. wired Ethernet. It joins the Wi-Fi as were to send the wrong bit out the an-
As a result, the network has 90%– expected. The person then walks with tenna and take down the entire phone
95% utilization. This is unheard of in the laptop all the way to the cafete- system? (Why do we have a phone sys-
the industry. The CIO is excited, but ria and returns an hour later. During tem where one wrong bit could take
the CFO is the one who is really im- this path the laptop changes IP ad- it down?) The process was expensive,
pressed. The competition is paying al- dress four times, and bandwidth var- slow, and highly political.
most 50% more in capital expenditure ies from 100Mbps on wired Ethernet, Some fictional (but not that fiction-
(CAPEX) and operating expenditure to 11Mbps on an 802.11b connection al) examples:
(OPEX) for the same network capacity. in the computer science building, to ˲˲ “We’re sorry but your tic-tac-toe
This example is based on what 54Mbps on the 802.11g connection in game doesn’t fit the ‘c’est la mode’
Google is doing today, as described the cafeteria. Nevertheless, the game that we feel our company represents.”
by Urs Hoelzle, a Google Fellow and competition continues without inter- ˲˲ “Yes, we would love to have your
senior vice president of technical in- ruption. calorie tracker on our phone, but we
frastructure, in his keynote address at Using OpenFlow, the graduate stu- require a few critical changes. Most
the 2012 Open Networking Summit.1 dents wrote an application so that no importantly the colors must match
Example 2: Tenantized Network- matter what subnetwork the laptop our corporate logo.”
ing. A large virtual-machine hosting is moved to, the IP address will not ˲˲ “We regret to inform you that
company has a network management change. Also, in the event of network we are not interested in having your
problem. Its service is “tenantized,” congestion, traffic to and from the game run on our smartphone. First,
meaning each customer is isolated game server will receive higher prior- we can’t imagine why people would
from the others like tenants in a build- ity: the routers will try to drop other want to play a game on their phones.
ing. As a virtual machine migrates traffic before the game-related traffic, Our phones are for serious people.
from one physical host to another, the thus ensuring smooth gameplay for Second, it would drain the battery if
virtual LAN (VLAN) segments must be everyone. The software on the laptop people used their phones for anything
reconfigured. Inside the physical ma- is unmodified; all the “smarts” are in other than phone calls. Third, we find
chine is a software-emulated network the network. The competition is a big the idea of tossing birds at pigs to
that connects the virtual machines. success. be rather distasteful. Our customers

44 co mmunicatio ns o f th e acm | au gu st 201 2 | vol. 5 5 | no. 8


practice

would never be interested.” mation and uses it to deduce the de-


Network equipment vendors do sign of the entire network. If a router
not currently permit apps. An app could think out loud, this is what you
that fulfills a niche market is a dis- might hear: “My neighbor on port 37
traction from the vendor’s roadmap.
It could ruin the product’s “c’est la How does each told me she is 10 hops away from net-
work 172.11.11.0, but the neighbor on
mode.” Routers are too critical. Rout-
er protocols are difficult to design,
router know port 20 told me he is four hops away.

what the next hop


Aha! I’ll make a note that if I receive
implementing them requires highly a packet for network 172.11.11.0, I
specialized programming skills, and
verification is expensive. These excus-
should be? should send it out port 20. Four hops
is better than 10!”
es sound all-to familiar. Every router works Although they share topological

But What If It Became Easy?


independently information, routers do the route cal-
culations independently. Even if the
To understand why writing routing to figure it out network topology means two nearby
protocols is so difficult requires un-
derstanding how routing works in to- for itself. routers will calculate similar results,
they do not share the results of the
day’s networks. Networks are made of overlapping calculations. Since every
endpoints (your PC and the server it is CPU cycle uses a certain amount of
talking to) and the intermediate de- power, this duplication of effort is not
vices that connect them. Between your energy efficient.
PC and www.acm.org there may be 20 Fortunately, routers need to be
to 100 routers (technically, routers and concerned with only their particular
switches but for the sake of simplicity organization’s network, not the entire
all packet-passing devices are called Internet. Generally, a router stores
routers in this article). Each router the complete route table for only its
has many network-interface jacks, or part of the Internet—the company,
ports. A packet arrives at one port, the university, or ISP it is part of (its au-
router examines it to determine where tonomous domain). Packets destined
it is trying to go, and sends it out the for outside that domain are sent to
port that will bring it one hop closer to gateway routers, which interconnect
its destination. with other organizations. Another al-
When your PC sends a packet, it gorithm determines how the first set
does not know how to get to the des- of routers knows where the gateway
tination. It simply marks the packet routers are. The combination of these
with the desired destination, gives it to two algorithms requires less RAM and
the next hop, and trusts that this rout- CPU horsepower than if every router
er, and all the following routers, will had to know the entire Internet. If it
eventually get the packet to the des- were not for this optimization, then
tination. The fact that this happens every router would need enough RAM
trillions of times a second around the and CPU horsepower to store an in-
world and nearly every packet reaches ventory of the entire Internet. The cost
its destination is awe-inspiring. would be staggering.
No “Internet map” exists for plan- The routing algorithms are compli-
ning the route a packet must take to cated by the fact that routers are given
reach its destination. Your PC does clues and must infer what they need
not know ahead of time the entire to know. For example, the conclusion
path to the destination. Neither does that “port 20 is the best place to send
the first router, nor the next. Every hop packets destined for 172.11.11.0” is
trusts that the next hop will be one inferred from clues given by other
step closer and eventually the packet routers. A router has no way of send-
will reach the destination. ing a question to another router.
How does each router know what Imagine if automobiles had no win-
the next hop should be? Every router dows but you could talk to any driver
works independently to figure it out within 25 feet. Rather than knowing
for itself. The routers cooperate in an what is ahead, you would have to infer
algorithm that works like this: each a worldview based on what everyone
one periodically tells its neighbors else is saying. If everyone were using
which networks it connects to, and the same vocabulary and the same
each neighbor accumulates this infor- system of cooperative logical reason-

au g u st 2 0 1 2 | vol. 55 | no. 8 | c omm u n icat ion s o f t he acm 45


practice

ing, then every car could get to where be several hops away. Also, the com-
it is going without a collision. That’s munication channel between entities
the Internet: no maps; close your eyes needs to be secure. Lastly, we need a
and drive. much better name than RCITS.
Every router is trying to infer an en-
tire worldview based on what it hears. The basic idea of The OpenFlow standard addresses
these issues. It uses the term control-
Driving these complicated algorithms OpenFlow is that ler or controller platform (yawn) rather
than RCITS. The controller takes in
things would be
requires a lot of CPU horsepower. Each
router is an expensive mainframe do- configuration, network, and other
ing the same calculation as everyone
else just to get the slightly different re-
better if routers information and outputs a different
blob of data for each router, which in-
sult it needs. Larger networks require were dumb and terprets the blob as a decision table:
more computation. As an enterprise’s
network grows, every router needs
downloaded their packets are selected by port, MAC ad-
dress, IP address, and other means.
to be upgraded to handle the addi- routing information Once the packets are selected, the
tional computation. The number and
types of ports on the router haven’t from a big “route table indicates what to do with them:
drop, forward, or mutate then for-
changed, but the engine no longer has compiler in the sky.” ward. (This is a gross oversimplifica-
enough capacity to run its algorithms. tion; the full details are in the speci-
Sometimes this means adding RAM, fication, technical white papers, and
but often it means swapping in a new presentations at http://www.Open-
and more expensive CPU. It is a good Flow.org/wp/learnmore.)
business model for network vendors: Traditional networks can choose a
as you buy more routers, you need route based on something other than
to buy upgrades for the routers you the shortest path—perhaps because
have already purchased. These are not it is cheaper, has better latency, or
standard PCs that benefit from the has spare capacity. Even with systems
constant Dell-vs.-HP-vs.-everyone-else designed for such traffic engineering,
price war; these are specialized, ex- such as Multiprotocol Label Switch-
pensive CPUs that customers cannot ing Traffic Engineering (MPLS TE),
get anywhere else. it is difficult to manage effectively
with any real granularity. OpenFlow,
The Radical Simplification in contrast, enables you to program
of OpenFlow the network for fundamentally differ-
The basic idea of OpenFlow is that ent optimizations on a per-flow basis.
things would be better if routers were That means latency-sensitive traffic
dumb and downloaded their routing can take the fastest path, while bulk
information from a big “route com- flows can take the cheapest. Rather
piler in the sky,” or RCITS. The CPU than being based on particular end-
horsepower needed by a router would points, OpenFlow is granular down to
be a function of the speed and quan- the type of traffic coming from each
tity of its ports. As the network grew, endpoint.
the RCITS would need more horse- OpenFlow does not stop at traffic
power but not the routers. The RCITS engineering, however, because it is
would not have to be vendor specific, capable of doing more than populat-
and vendors would compete to make ing a forwarding table. Because an
the best RCITS software. You could OpenFlow-capable device can also
change software vendors if a better rewrite the packets, it can act as a
one came along. The computer run- NAT or AFTR (address family transi-
ning the RCITS software could be off- tion router); and because it can drop
the-shelf commodity hardware that packets, it can act as a firewall. It can
takes advantage of Moore’s Law, price also implement ECMP (equal-cost
wars, and all that good stuff. multi-path) or other load-balancing
Obviously, it is not that simple. You algorithms. They do not have to have
have to consider other issues such as the same role for every flow going
redundancy, which requires multiple through them; they can load-balance
RCITS and a failover mechanism. An some, firewall others, and rewrite a
individual router needs to be smart few. Individual network elements are
enough to know how to route data thus significantly more flexible in an
between it and the RCITS, which may OpenFlow environment, even as they

46 co mmunicatio ns o f the acm | au gu st 201 2 | vol. 5 5 | no. 8


practice

shed some responsibilities to the cen- gorithms, more-direct algorithms can and new revenue streams that exceed
tralized controller. be used. A dictatorship is the most the value of the ones they replaced.
OpenFlow is designed to be used efficient form of government. Sup- Who would have imagined Angry
entirely within a single organization. pose the VoIP phones of the campus Birds or apps that monitor our sleep
All the routers in an OpenFlow do- EMS team should always get the band- patterns to calculate the optimal time
main act as one entity. The controller width they need. It is easier to direct to wake us up? Apps are crazy, weird,
has godlike power over all the routers each router on campus to give EMS and disruptive—and I cannot imagine
it controls. You would not let someone phones priority than it is to develop a world without them.
outside your sphere of trust have such an algorithm whereby each router in- OpenFlow has the potential to be
access. An ISP may use OpenFlow to fers which devices are EMS, verifies a similarly disruptive. The democratiza-
control its routers, but it would not trust model, confirms that trust, and tion of network-based apps is a crazy
extend that control to the networks of allocates the bandwidth—and hopes idea and will result in weird applica-
customers. An enterprise might use that the other routers are doing the tions, but someday we will talk about
OpenFlow to manage the network in- same thing. the old days and it will be difficult to
side a large data center and have a dif- ˲˲ Enables Apps. The controller can imagine what it was once like.
ferent OpenFlow domain to manage have APIs that can be used by applica-
its WAN. ISPs may use OpenFlow to tions. This democratizes router fea-
Related articles
run their patch of the Internet, but it tures. Anyone (with proper authoriza- on queue.acm.org
is not intended to take over the Inter- tion and access) can create network
Revisitng Network I/O APIs: The netmap
net. It is not a replacement for inter- features. No open source ecosystem
Framework
ISP systems such as Border Gateway exists for applications that run within Luigi Rizzo
Protocol (BGP). the Cisco IOS operating system. Cre- http://queue.acm.org/detail.cfm?id=2103536
The switch to OpenFlow is not ex- ating one will be much easier in the SoC: Software, Hardware, Nightmare, Bliss
pected to happen suddenly and, as OpenFlow world. Apps will not con- George Neville-Neil, Telle Whitney
with all new technologies, may not be trol individual routers (this is already http://queue.acm.org/detail.cfm?id=644265
adopted at all. possible) but the entire network as a TCP Offload to the Rescue
Centralizing route planning has single entity. Andy Currid
many benefits: ˲˲ Allows Global Optimization and http://queue.acm.org/detail.cfm?id=1005069
˲˲ Takes Advantage of Moore’s Law. Planning. Current routing protocols
Not only are general-purpose com- require each router to optimize inde- References
1. Hoelzle, U. Keynote speech at the Open Networking
puters cheaper and faster, but there pendently, often resulting in a rout- Summit (2012); http://www.youtube.com/
is more variety. In the Google pre- ing plan that is optimal locally but not watch?v=VLHJUfgxEO4.
2. Katti, S. OpenRadio: Virtualizing cellular wireless
sentation at the 2012 Open Network- globally. The OpenFlow controller can infrastructure. Presented at the Open Networking
Summit (2012); http://opennetsummit.org/talks/
ing Summit, Urs Höelzle said Google plan based on a global, mostly omni- ONS2012/katti-wed-openradio.pdf.
does its route computations using the scient, understanding of the network. 3. Lin, G. Industry perspectives of SDN: Technical
˲˲ Provides Centralized Control. Not challenges and business use cases. Presentation
Google compute infrastructure. at the 2012 Open Networking Summit; http://
˲˲ Offers Deeper Integration. End- that today’s routers do not have any opennetsummit.org/talks/ONS2012/lin-tue-usecases.
pdf (slides 6-7).
to-end communication can occur APIs, but the applications would need 4. SIGCOMM Demo; http://www.openflow.org/
directly from the applications all the to communicate with the API of every wp/2008/10/video-of-sigcomm-demo/.
way to the controller. Imagine if every router to get anything done and I do
Web-based service in your enterprise not know any network engineer who Thomas A. Limoncelli is an author, speaker, and
system administrator. His best-known books include
could forward bandwidth require- is open to that. With OpenFlow, apps Time Management for System Administrators (O’Reilly,
ments to the controller, which could can talk to the controller where au- 2005) and The Practice of System and Network
Administration, 2nd edition (Addison-Wesley, 2007).
then respond with whether or not the thentication, authorization, and vet- He works at Google in New York City on the Ganeti
project (http://code.google.com/p/ganeti). See his blog
request could be satisfied. This would ting can happen in one place rather at http://EverythingSysadmin.com.
be a radical change over the “send and than on every router.
pray” architectures in use today.
˲˲ Turns Network Hardware into a Conclusion
Commodity. The CPU and RAM horse- In the past smartphone vendors care-
power required by the device is a func- fully controlled which applications
tion of the speeds and number of ports ran on their phones and had perfectly
on the device as shipped. Therefore, it valid reasons for doing so: quality of
can be calculated during design, elim- apps, preventing instability in the
inating the need to factor in slack. phone network, and protection of
Also, designing and manufacturing a their revenue streams. We can all see
device with a fixed configuration (not now that the paradigm shift of permit-
upgradable) is less expensive. ting the mass creation of apps did not
˲˲ Makes Algorithms Simpler. Rath- result in a “wild west” of chaos and
er than making decisions based on in- lost revenue. Instead, it inspired en-
ference and relying on cooperating al- tirely new categories of applications © 2012 ACM 0001-0782/12/08 $15.00

au g u st 2 0 1 2 | vol. 55 | no. 8 | c omm u n icat ion s o f t he acm 47


practice
d oi:10.1145/ 2240236.2240252

Article development led by


queue.acm.org

Increasing parallelism
demands new paradigms.
By Rafael Vanoni Polanczyk

Extending the
Semantics
of Scheduling
Priorities
App l ic ation p e rfor m anc e is directly affected by the Extending some of the semantics of
scheduling priorities to include prior-
hardware resources required, the degree to which such ity over shared resources could allow
resources are available, and how the operating system the performance-critical components
addresses the requirements with regard to the other of applications to execute with less-
contended access to the resources
processes in the system. Ideally, an application would they require.
have access to all the resources it could use and be The past decade has seen the emer-
gence of the multicore processor and
allowed to complete its work without competing with its subsequent rapid commoditiza-
any other activity in the system. In a world of highly tion. Software developers, whether at
shared hardware resources and general-purpose, the application level or in the broader
design of systems, simply cannot ig-
timeshare-based operating systems, however, no nore this dramatic change in the land-
guarantees can be made as to how well resourced an scape. While not all problems require
a parallel implementation as a solu-
application will be. tion,1 this opportunity must be con-
What can be done to improve both the way in which sidered more seriously than ever be-
applications are developed and how the underlying fore. Both academia and the industry
have followed this trend by educating
layers of the software stack operate, in order to gain and advocating concurrent software
better overall utilization of shared hardware resources? development, while also seeking new

48 communicatio ns o f th e acm | au gu st 201 2 | vol. 5 5 | no. 8


techniques to exploit hardware paral- more CPUs make that sharing a much This new landscape poses new ques-
lelism. Unfortunately, many details more complicated problem. tions for you as a developer and system
about developing concurrent appli- If this growing multiprocessing administrator: How many threads
cations on modern processors have scale and its associated microarchi- should your workload create? What re-
mostly slipped through the cracks, tectural complexity were not enough, sources do they require? Are all threads
only to be found in white papers, modern processors are also dynami- equally important? How should you
blogs, and similar watering holes of cally adapting their processing capac- size shared data structures? What
the performance community. ity based on the current utilization in should you tell the operating system
What developers are finding more an attempt to provide applications (or more generically, the underlying
often than not is that sizing parallel ap- with the resourcing they need. Intel’s layer) about your application?
plications is not as straightforward as Turbo Boost feature increases the pro-
it once seemed. Until quite recently the cessor’s operating frequency as fewer Provisioning Threads in
IllustraTion by Mika el H vidt feldt Ch rist ensen

one or two processors available within cores are active and thermal condi- Multithreaded Applications
a single processor chip did not cause tions allow. The SPARC T4 processor, Although the simple classic recipe
more contention for shared resources in contrast, dynamically allocates of one software thread for each logi-
than perhaps two software threads core resources to its active hardware cal CPU may still be valid in some
fighting over shared cache space. threads, incrementally benefiting the cases, it can no longer be applied in-
Nowadays, not only are there several few active threads by having more in- discriminately.2 Parallel applications
levels of sharing between logical CPUs active ones. Both features are in es- must know which portions of their
(shared execution pipeline, floating- sence enabling heterogeneous appli- program may require resources that
point units, different cache levels, cations by improving single-threaded are not widely available in the system.
among others), but also these many performance. With that knowledge and some under-

au g ust 2 0 1 2 | vol . 55 | no. 8 | c omm u n i cat ion s o f t he acm 49


practice

standing of the possible deployment create the conditions that allow them In the end, even a correctly sized
platforms, applications may be able to come into play. In this case the goal parallel application still relies on
to size themselves by matching their is not simply to prevent performance the operating system—or on the un-
hardware requirements to what the degradation by reducing the number derlying runtime environment—for
underlying layer has to offer. Failure of threads competing for some nec- mechanisms to provision threads,
to do this properly leads to either con- essary component; you want the ap- as well as for appropriate scheduling
tention over shared resources—as too plication to take advantage of all the decisions for correct thread place-
many threads compete for them—or performance you can get from the ment. Unfortunately, operating sys-
the underutilization of resources that processor. In the previous producer/ tems have traditionally offered only
the system otherwise had available. consumer example, the throughput of very simple mechanisms to provi-
For homogeneous multithread- the application would likely increase sion specific threads. For example,
ed applications—those in which if you placed the producer on a dedi- the use of processor sets in Solaris to
all threads perform similar tasks cated core, granting it exclusive access provide an entire core (and its other-
(and therefore have similar require- to all the hardware resources within wise shared components) to certain
ments)—you could simply partition that component. threads in an application is a reason-
the available resources into n slices ably well-known tuning method used
according to how much of each re- Consequences of Virtualization by field engineers and specialized cus-
source a single thread will require. Virtualization mechanisms are a con- tomers. Process binding is also used
A scientific application that makes founding factor, as they will often hide when manually placing processes and
heavy use of floating point might cre- the details of the underlying architec- threads to ensure a desired behav-
ate one thread per available floating- ture and the current utilization levels ior. These mechanisms are too static,
point unit (FPU) in the system (or two of its various components (such as ob- however. They require manual inter-
or three, if they can all take turns while scuring direct access to the physical vention and are usually too expensive
executing their floating-point sec- performance counters on the proces- for this purpose. A preferable solution
tions). For heterogeneous workloads, sor). This may prevent an application would be to provision threads more
on the other hand, it may be advanta- from determining the available physi- accurately with the resources they re-
geous to set aside more resources for cal resources, and therefore make it quire as they become runnable, with-
specific threads. For example, in a unable to size itself correctly by dis- out user intervention, leveraging the
producer/consumer architecture with tributing its requirements across the knowledge that developers have of
a single producer and various con- available resources. It may also pre- their applications and reducing the
sumers, giving the producer as many vent the application from monitoring amount of work (or interference) re-
resources as it can take advantage of the consequences of its own behavior quired by the operating system or the
would likely be beneficial, trying to and adapting to changes in system system administrator.
keep the consumers as busy as pos- utilization. Thankfully, these limita-
sible. This dependency relationship tions can be circumvented if the ap- Priority Over Shared Resources
between producer and consumers is plication uses its own mechanisms to The current implementation and se-
the primary point of contention in the evaluate performance and capacity. mantics of scheduling priorities date
application, making the producer its For example, the application can run from a period when single-processor
most critical component. a micro-benchmark during startup, systems were the norm. Resources
You may also want to exploit the and/or periodically as it runs, to evalu- were very limited and had to be cor-
knowledge of sharing relationships ate how the system is performing ac- rectly divided among threads in the
to take advantage of the dynamic re- cording to its own metrics. It can then system by allowing them to run for de-
sourcing features in recent proces- use that information to adapt to the termined periods of time according to
sors. In other words, you can manually current conditions. their priority. The recent emergence of

Example topologies for SPARC T4 and Intel Xeon processors.

mpipe active power


0-63 0-7

cache
FPU FPU FPU FPU FPU FPU FPU FPU 0-7
0-7 8-15 16-23 24-31 32-39 40-47 48-55 56-63

ipipe ipipe ipipe ipipe ipipe ipipe ipipe ipipe ipipe ipipe ipipe ipipe
0-7 8-15 16-23 24-31 32-39 40-47 48-55 56-63 0, 1 2, 3 4, 5 6, 7

idle pwr idle pwr idle pwr idle pwr


0, 1 2, 3 4, 5 6, 7

50 c ommunicatio ns o f th e acm | au gu st 201 2 | vol. 5 5 | no. 8


practice

systems with a large number of proces- even distribution of utilization across


sors has fundamentally changed the the system. This basic assumption is
scenario. Given the large number of correct, but the traditional implemen-
resources available, threads no longer tation of load balancing does not per-
compete just for processor time, but
also for shared hardware resources. The traditional form well in heterogeneous scenarios
unless the scheduler is capable of
This scheduling model fails to rec-
ognize the sharing aspects of today’s
implementation of identifying the different requirements

load balancing does


of each thread and the importance of
processors, allowing for some perfor- each thread within the application.
mance anomalies that are sometimes
difficult to address. Consider, for ex-
not perform well A few years ago the Solaris sched-
uler was extended to implement load
ample, a high-priority thread compet- in heterogeneous balancing across shared hardware
ing over a specific resource with a set
of “hungry” lower-priority threads.
scenarios unless components in an effort to reduce re-
source contention. We had discovered
In this case, it would be desirable to the scheduler that simply spreading the load across
extend the implementation of priori-
ties to include priority over shared re- is capable of all logical CPUs was not enough: it
was also necessary to load-balance
sources. The operating system could identifying across groups of processors that share
then choose to move the lower-priority
threads away from where the higher- the different performance-relevant components.
To implement this policy, Solaris es-
priority one is running or to find a more
appropriate place for it to execute with
requirements of tablished the Processor Group ab-
straction. It identifies and represents
less-contended resources. each thread. shared resources in a hierarchical
This extension presents a new fashion, with groups that represent
method through which developers the most-shared components (pipe to
and system administrators can specify memory, for example) at the top and
which components of an application groups that represent the least-shared
should be more or less provisioned. It ones at the bottom (such as execution
is a dynamic, unobtrusive mechanism pipeline). The accompanying figure
that provides the necessary informa- illustrates the processor group topol-
tion for the operating system to provi- ogy for two different processors: the
sion threads more effectively, reducing SPARC T4 and Intel Xeon processors,
contention over shared resources and with each hardware component and
taking advantage of the new hardware the CPUs they contain.
features discussed previously. Further- Each processor group maintains
more, the new behavior is likely to ben- a measure of its capacity and utiliza-
efit users who already identify threads tion, defined as the number of CPUs
in their applications with different lev- and running threads in a particular
els of importance (an important aspect group. This information is then in-
of this work, for practical reasons). corporated by the scheduler and used
Additionally, several other aspects when deciding where to place a soft-
of priorities play to our advantage. ware thread, favoring groups where the
Since the proposed “spatial” seman- utilization:capacity ratio would allow it
tics will determine how many resourc- to make the most progress.
es will be assigned to threads, it is criti-
cal that this mechanism is restricted to Performance-Critical Threads
users with the appropriate privileges— The processor group abstraction and
already a standard aspect of priorities the associated load-balancing mecha-
in all Unix operating systems. Priori- nism for multicore, multithreaded
ties can also be applied at different lev- processors successfully reduced con-
els: at the process, thread, or function tention at each level of the topology by
level, allowing optimizations at a very spreading the load equally among the
fine granularity. various components in the system. That
alone, however, did not account for the
Load Balancing and Priorities different characteristics and resource
Load balancing is perhaps one of the requirements of each thread in a het-
most classic concepts in scheduling. erogeneous application or workload.
Modulo implementation details, the To address this issue Solaris recently
basic idea is to equalize work across ex- extended its load-balancing mecha-
ecution units in an attempt to have an nism so that a thread’s notion of utiliza-

au g ust 2 0 1 2 | vol . 55 | no. 8 | c ommu n i cat ion s o f t he acm 51


practice

tion (or required resources) is propor- thread with all of the CPUs sharing an
tional to its scheduling priority. This execution pipeline, and a quarter of the
allows the scheduler to load-balance CPUs sharing a physical chip on x86
lower-priority threads away from where systems. These policies are optimized
high-priority threads are running, au-
tomatically reducing contention for re- From the startup for both known sources of contention
and dynamically resourcing features
sources. With some simple heuristics,
you can safely assume if a high-priority
components of in their respective platforms. A simple
example of the desired behavior would
thread has enough CPU utilization to applications to be to have an entire core devoted to a
take advantage of the existing hardware
resources, then it should be granted as
producer/consumer single high-priority thread on a SPARC
T4 system, while all the other lower-pri-
much access to them as possible. scenarios, any ority threads share the remaining re-
Automatically identifying which
threads or portions of an application
component upon sources (again, as long as enough idle
resources are available in the system).
should be assigned a higher priority which other parts
is not a simple task. There is no single
characteristic that could allow us to of the application Conclusion
The contemporary landscape of in-
make such assumptions for a wide va- depend can be creasing parallelism requires new para-
riety of workloads—one could point
out several cases where threads of considered critical, digms. These will affect developers and
system administrators at a number of
widely different resource requirements
are considered critical in the context
and they should be levels in developing new applications
and systems. Some are occupied with
of their applications. Most critical assigned a suitably the considerations of mechanisms at
threads or components, however, are
at the top of a dependency relationship
high priority. the level of the hardware, virtualiza-
tion, and operating system. Application
in heterogeneous environments. From developers must have suitable means
the startup components of applica- to designate critical elements of their
tions to producer/consumer scenari- applications and to interact with the
os, any component upon which other underlying system software to ensure
parts of the application depend can be those elements are given the special re-
considered critical, and they should be sourcing that they require.
assigned a suitably high priority. Such
dependency relationships, however, Acknowledgments
are not easily observable without some I am indebted to Eric Saxe, Jonathan
previous knowledge of the architecture Chew, and Steve Sistare who, among
of the application. a broader number of colleagues in the
In Solaris 11, once a performance- Solaris Kernel Group, were particularly
critical component or thread is iden- helpful in the development of the ideas
tified, the developer or system ad- presented in this article.
ministrator has only to place it in the
fixed-priority scheduling class at prior-
Related articles
ity 60 or at any real-time priority. The
on queue.acm.org
scheduler will then artificially inflate
its load according to the underlying Real-World Concurrency
Bryan Cantrill anf Jeff Bonwick
platform, attempting to improve its http://queue.acm.org/detail.cfm?id=1454462
performance by allowing it to execute
Performance Anti-Patterns
with more exclusive access to hardware Bart Smaalders
resources. It is important to note that http://queue.acm.org/detail.cfm?id=1117403
this optimization was implemented
to take advantage of the available re- References
sources in the system. In other words, 1. Cantrill, B. and Bonwick, J. Real-world concurrency.
ACM Queue 6, 5 (2008).
if all of the system’s logical CPUs are re- 2. Smaalders, B. Performance anti-patterns. ACM Queue
quired, no single thread will be forced 4, 1 (2006).

to wait for the benefit of a single high-


Rafael Vanoni Polanczyk is a software developer in the
priority thread. Solaris Kernel Group at Oracle, where he spends most of
This implementation also optimiz- his time working on the scheduler/dispatcher subsystem.
Rafael lives in San Francisco and is originally from Porto
es differently according to the under- Alegre in southern Brazil, where he received a B.Sc. in
lying hardware architecture. On sun4v computer science from UFRGS (Universidade Federal do
Rio Grande do Sul).
systems, the scheduler will attempt
to provision a performance-critical © 2012 ACM 0001-0782/12/08 $15.00

52 c ommunicatio ns o f th e acm | au gu st 201 2 | vol. 5 5 | no. 8


d oi:10.1145/ 2240236 . 2 2 40 2 5 3

Article development led by


queue.acm.org

A first step toward programming


21st-century applications.
By Manuel Serrano and Gérard Berry

Multitier
Programming
in Hop
becoming the richest platform on which to
T h e W e b is
create computer applications. Its power comes from
three elements: modern Web browsers enable highly
sophisticated graphical user interfaces (GUIs) with 3D,
multimedia, fancy typesetting, among others; calling
existing services through Web APIs makes it possible

to develop sophisticated applications the Web runtime environment richer


from independently available compo- every day. The future is appealing, but
nents; and open-data availability al- one difficulty remains: current pro-
lows applications to access to a wide gramming methods and languages are
set of information that were unreach- not ideally suited for implementing
able or that simply did not exist be- rich Web applications. This is not sur-
fore. The combination of these three prising as most were invented in the
elements has already given birth to 20th century, before the Web became
revolutionary applications such as what it is now.
Google Maps, radio podcasts, and so- Traditional programming lan-
cial networks. guages have trouble dealing with the
The next step is likely to be incorpo- asymmetric client-server architecture
rating the physical environment into of Web applications. Ensuring the
the Web. Recent electronic devices are semantic coherence of distributed
equipped with various sensors (GPS, client-server execution is challeng-
cameras, microphones, metal detec- ing, and traditional languages have no
tors, speech commands, thermome- transparent support for physical dis-
ters, motion detection, and so on) and tribution. Thus, programmers need
communication means (IP stack, tele- to master a complex gymnastics for
phony, SMS, Bluetooth), which enable handling distributed applications,
applications to interact with the real most often using different languages
world. Web browsers integrate these for clients and servers. JavaScript is
features one after the other, making the dominant Web language but was

au g ust 2 0 1 2 | vol . 55 | no. 8 | co mm u n icat ion s o f t he acm 53


practice

conceived as a browser-only client written in a single unified language. to JavaScript for the client side. Java
language. Servers are usually pro- This principle is known as multitier cannot be considered the unique lan-
grammed with quite different lan- programming. guage of GWT, however. Calling exter-
guages such as Java, PHP, and Ruby. Links is an experimental language nal APIs relies on JavaScript inclusion
Recent experiments such as Node. in which the server holds no state, and in Java extensions. GUIs are based on
js propose using JavaScript on the functions can be symmetrically called static components declared in exter-
server, which makes the development from both sides, allowing them to be nal HTML files and on dynamic parts
more coherent; however, harmonious declared on either the server or the generated by the client-side execution.
composition of independent compo- client. These features are definitely Thus, at least Java, JavaScript, and
nents is still not ensured. interesting for exploring new pro- HTML are directly involved.
In 2006, three different projects— gramming ideas, but they are difficult The Hop language takes another
namely, Google Web Toolkit (GWT), to implement efficiently, making the path relying on a different idea: in-
Links from the University of Edin- platform difficult to use for realistic corporating all the required Web-re-
burgh,1 and Hop from INRIA (http:// applications. lated features into a single language
www.inria.fr)5—offered alternative GWT is more pragmatic. It maps with a single homogeneous develop-
methods for programming Web appli- traditional Java programming into the ment and execution platform, thus
cations. They all proposed that a Web Web. A GWT program looks like a tra- uniformly covering all the aspects of
application should be programmed as ditional Java/Swing program compiled a Web application: client side, server
a single code for the server and client, to Java bytecode for the server side and side, communication, and access to
third-party resources. Hop embodies
Figure 1. A file browser Web widget. and generalizes both HTML and Ja-
vaScript functionalities in a Scheme-
based3 platform that also provides the
1: (define-tag <BROWSER> ((id (gensym)) ;; HTML identifier user with a fully general algorithmic
2: (class #f) ;; HTML class language. Web services and APIs can
3: (path::bstring "/tmp")) ;; default initial path
4: (define (<dir-entry> path)
be used as easily as standard library
5: ;; function to build HTML display for directories functions, whether on the server side
6: (<DIV> :data-hss-tag "dir-entry" or client side.
7: :onclick ˜(with-hop ($(service () (<directory> path)))
8: (lambda (h)
9: (innerHTML-set! $id h))) Multitier Programming
10: name)) and HTML Abstraction
11: (define (<file-entry> path) This article presents the Hop ap-
12: ;; function to build HTML display for files
13: (<DIV> :data-hss-tag "file-entry" proach to Web application program-
14: (basename path))) ming, starting with the basic example
15: (define (<directory> path) of a Web file viewer. In the first ver-
16: ;; main display function
17: (cons sion, files and directories are listed
18: ;; HTML representation of the parent directory on a bare page. Directories are click-
19: (<dir-entry> (make-file-name path "..")) able to enable dynamic browsing, and
20: ;; HTML representation of files
21: (map (lambda (np)
plain file names are displayed. This
22: (if (directory? np) simple problem illustrates the most
23: (<dir-entry> np) central aspect of realistic Web appli-
24: (<file-entry> np)))
cations—namely, the tight collabora-
25: (sort string<? (directory->path-list path)))))
26: ;; main HTML element tion and synchronization of servers
27: (<DIV> :data-hss-tag "browser" and clients. Here, the server owns
28: :id id :class class the files while the browser visualizes
29: (<directory> path)))
them. When the user clicks on a direc-
tory, the client requests new informa-
Figure 1: A file browser Web Widget.
tion from the server.
Upon receipt, the client updates
Figure 2. Hop compiled into HTML+JavaScript the page accordingly. The file viewer
is easy to implement if each request
delivers a new complete page. By add-
<div data-hss-tag=’dir-entry’>
onclick=’HOP.with hop( "/hop/anonymous314", ing the requirement that the file viewer
function( h ) { should be a widget embedded inside a
var el3 = document.getElementById( "g1592" ); complex Web page, the problem be-
el3.innerHTML = HOP.eval( h );
} )’ comes slightly more difficult, since
/tmp updates now concern only incremen-
</div>
tal subparts of the documents. This
demands a new kind of collaboration
Figure 1: Hop compiled into HTML+JavaScript. between the server and the client.

54 co mmunicatio ns o f th e acm | au gu st 201 2 | vol. 5 5 | no. 8


practice

Implementing such a Web client- service, called by the client and run on ment within Hop. For example, the
server file viewer with a single Hop the server. code shown in Figure 3 instantiates
code is almost as straightforward as Runtime communication between two file browsers in a table.
implementing it for a single com- servers and clients involves services Making variations on <BROWSER>
puter, because HTML is built in and that extend the notion of function. A that impact both the server and the
execution partitioning between serv- service is the binding of a function to client is very easy. Figure 4 shows a
ers and clients is automatic. Figure 1 an URL. The URL is used to invoke one-line modification of the <file-
shows a complete Hop solution to this the function remotely, using the spe- entry> function body that displays
problem. cial form (with-hop (<service> the file name and size.
Hop identifiers can contain spe- <arguments>) <callback>). The Hop relies on a mixed dynamic/
cial characters such as “<” and “?”. In arguments are marshaled for network static type discipline. When a vari-
Hop, directory?, string=?, and transmission and the remote proce- able or a function is annotated with
<SPAN> are legal identifiers. In the ex- dure call is initiated; on remote-call a type, the compiler uses this type to
ample, <BROWSER> is the name of a completion, the callback is invoked statically check type correctness and
variable bound to a function that cre- on the caller side with the unmar- improve the generated code. To detect
ates an HTML div element. shaled remote-call result. Extra op- compile-time errors, Hop does not
HTML objects are created by tions control how errors and timeouts compete with statically typed languag-
Hop functions with the same names are handled (these are not presented es such as ML or Haskell but trades
(<DIV>…), (<SPAN> …), among oth- in this article). static guarantees for expressiveness:
ers; no HTML closure is needed be- The example here contains one for example, it supports program-
cause the code deals with parentheti- anonymous service defined in the ming constructs that are out of reach
cal functional expressions instead of <dir-entry> auxiliary function. of statically typed languages, such as
texts. The full power of Scheme is used This service is invoked each time a CLOS-like object-oriented program-
to build DIVs and SPANs. The <dir- user clicks on a directory. It calls the ming style. Also note that the children
entry> and <file-entry> auxil- <directory> function that traverses of HTML nodes can be of any type (a
iary functions are used to build frag- the server file system. Because services list, string, number, another node,
ments of the final HTML document. are statically scoped just as functions, and others) without requiring type
The map operation in the main <di- the <directory> identifier in the annotations, as specified by the W3C
rectory> function is used to build service code refers to the <direc- recommendation of XML Schema for
the global HTML page out of these tory> server function defined later HTML. In the example, only the path
HTML fragments, with help from the in the same mutually recursive define variable is type-annotated.
Hop sort function to sort the printed clause. The path parameter is bound A Hop widget may declare its own
output alphabetically. Figure 2 shows by the definition of the <dir-entry> CSS (Cascading Style Sheets) rules to
how values produced by the <div- function and used in the service call abstract away its implementation de-
entry> function are compiled into from within the client code. Lexical tails. The declaration in Figure 5 cre-
HTML and JavaScript. binding is the same for the server and ates a new CSS type associated with
The ~ and $ constructs appearing at client code. the <BROWSER> tag. Figure 6 shows
line 7 of Figure 1 are the multitier pro- Once the <BROWSER> function is how one can use the browser CSS type
gramming operators. Let us explain defined, it can be freely used as a new to add extra icons before directory and
how they work. Expressions prefixed HTML tag in any regular HTML ele- leaf nodes.
by ~ are client-side expressions evalu-
ated by the browser; unannotated ex- Figure 3. A Web file browser.
pressions are evaluated on the server.
Compiling is done on the server. A (<HTML>
client code is seen as a value, com- (<TABLE>
puted or elaborated by the server and (<TR>
automatically shipped to the client on (<TD> (<BROWSER> :class "usr" :path "/usr"))
demand. The $ construct is used to (<TD> (<BROWSER> :class "local" :path "/usr/local")))))
inject a server value within the client
code at elaboration time. For example,
~(alert $(hostname))prints the Figure 1: A Web file browser.
server name on the client screen. In the Figure 4. Modified file browser for file size.
example of Figure 1, line 7 provokes
the injection of an automatically gen-
erated service named anonymous314 (define (<file-entry> path)
in the HTML excerpt of Figure 2. The (<DIV> :data-hss-tag "file-entry"
(basename path)
~-prefixed expression at line 7 speci- (<SPAN> (round (file-size path) 1024)))) ;; <-- added line for size
fies the action to be executed by the cli-
ent when the user clicks on a directory.
The action invokes the anonymous Figure 1: Modified file browser for file size.

au g ust 2 0 1 2 | vol . 55 | no. 8 | c omm u n i cat ion s o f t he acm 55


practice

In Hop, HTML objects on the serv- sentation of HTML similar to the one ated HTML such as syntactic correct-
er are members of a full-fledged data used by the browser, compiles it on ness—for example, reporting HTML
structure that implements the ab- the fly to actual HTML, and ships it on tag misspellings as unbound-variable
stract syntax tree of the client HTML browser demand. errors. Second, a single document
document; this contrasts with most This multistage process elimi- can be automatically compiled into
Web environments in which they are nates several drawbacks of tradi- different HTML versions on demand.
reduced to bare strings of characters. tional Web frameworks. First, the For example, the same document can
A Hop server computation elaborates a Hop runtime environment guaran- be optionally compiled into a mix of
DOM (document object model) repre- tees several properties of the gener- HTML4 and Flash for old browsers,
into HTML5 for more skilled brows-
Figure 5. Extending CSS with a new type and new attributes. ers, or even into XHTML+RDFa for
semantics annotations.
$(define-hss-type browser "div[data-hss-tag=browser]" Most current client-side Web librar-
(define-hss-property (directory-image v) ies abstract over HTML by providing a
(format "div[data-hss-tag=dir-entry] {
background-image: ˜l; set of JavaScript functions that accept
background-position: left; regular HTML nodes as parameters;
background-repeat: no-repeat;
padding-left: 2em;
typically, DIV and SPAN are used as
}" v)) new widget containers. This approach
(define-hss-property (file-image v) has several drawbacks. First, HTML
(format "div[data-hss-tag=file-entry] {
background-image: ˜l; extensions do not look like regular
background-position: left; tags. The implementation of a GUI
background-repeat: no-repeat;
thus requires assembling different
padding-left: 2em;
}" v)) formalisms. Second, extension initial-
izations are complex to schedule. In
general, the application must resort
Figure 1: Extending CSS with a new type and new attributes. to window onload JavaScript events
Figure 6. CSS user rules. to ensure that the API constructors are
not called before the DOM tree is fully
browser { built. Third, to configure the graphi-
directory-image: url( http://nowhere.org/dir.png ); cal rendering of the extension, imple-
leaf-image: url( http://nowhere.org/leaf.png );
}
mentation details must be unveiled
to let programs designate the HTML
browser.blue { elements to be configured. This jeop-
directory-image: url( http://nowhere.org/blue/dir.png );
color: blue; ardizes maintainability.
font-weight: bold;
}
Code Reuse
The example file-viewer application
presented here involves the main in-
Figure 1: CSS user rules gredients of a Web application—a dis-
tributed architecture, HTML abstrac-
tion, and CSS configurations—but it is
still too simple to be realistic. To devel-
op richer Web applications with com-
paratively little effort, it is now com-
mon practice to rely on the wide set
of publicly available JavaScript APIs.
All Web frameworks offer one way or
another of using them, often through
backdoors that let the programmer in-
sert alien JavaScript calls into the na-
tively supported language.
Hop follows a different approach
by integrating these APIs into its lan-
guage. Calling a JavaScript function or
creating a JavaScript object from Hop
is as easy as manipulating a standard
Hop entity. Furthermore, once com-
piled, client-side Hop-generated JavaS-
cript can be distinguished from hand-
Figure 7. A PDF viewer in a Web browser. written JavaScript only by the name

56 co mmunicatio ns o f th e acm | au gust 201 2 | vol. 5 5 | no. 8


practice

mangling used to map Hop identifiers Figure 8. A PDF viewer invoking two JavaScript APIs.
to JavaScript identifiers. Everything
else is identical: Hop data, variables,
and functions are directly mapped to (<HTML>
their JavaScript counterparts. (<HEAD>
The direct integration of JavaScript ;; Third-party JavaScript libraries
:jscript "http://code.jquery.com/jquery-1.7.1.min.js"
APIs within Hop makes Web-platform :jscript "http://www.turnjs.com/turn.min.js"
development by component combina- ;; Hop bindings for external JavaScript
:include "pdfjs-api-1.0.0.hz")
tion easy. To illustrate API integration, (<DIV>
we show how to write a PDF viewer that ˜(add-event-listener! window "load"
(lambda ()
displays the contents of files with the ;; initialize the turnjs plugin when the client document is ready
nice visual effect of turning the pages (let ((bk (jQuery "#book"))) ;; direct use of a jQuery function
(bk.turn)))) ;; direct use of a turnjs method
of an actual book. Figure 7 shows a (<DIV> :id "book"
snapshot of this application, and Fig- (map (lambda (i)
ure 8 displays the actual source code. ;; create one HTML element per PDF page
(<PDF> :src (make-file-path REPOSITORY "queue12.pdf") :page i))
It is based on jQuery, a popular JavaS- (iota 10 1)))))
cript library; a jQuery plug-in called
turn.js (http://www.turnjs.com), which
implements the page-turning effect; Figure 1: A PDF viewer invoking two JavaScript APIs.
and the JavaScript PDF previewer
implemented by the Mozilla Founda- Figure 9. A dynamically generated book using a JavaScript API.
tion (https://github.com/andreasgal/
pdf.js). It is assumed that the PDF pre-
viewer is packaged in the same way as (define-service (filebook count)
the <BROWSER> first presented. In the
code, the method turn initializes the (define cache
;; a local cache to avoid computing several times the same page
turn.js plug-in. It prepares an HTML (make-vector count #f))
element for use as a book container.
This JavaScript method is directly used (define getpage
;; the service that returns the pages content
as a regular Hop function. (service (i)
The turn.js API automatically in- (unless (vector-ref cache i)
vokes JavaScript listeners before and ;; not in the cache, fill the page with a random quotation
(vector-set! cache (system->string "fortune")))
after page turns. This feature can gen- ;; returns the cache entry
erate page content on demand on the (<BLOCKQUOTE> (vector-ref cache i))))
server. Figure 9 presents a complete (<HTML>
example where the client requests (<HEAD>
page contents from the server right :jscript "http://code.jquery.com/jquery-1.7.1.min.js"
:jscript "http://www.turnjs.com/turn.min.js"
before turning a page. This example :css "filebook.hss")
uses the turn.js method bind to bind
an anonymous Hop function to user ˜(define (turnpage i)
;; function called for each visualized page
events, in this case associating a user (with-hop ($getpage i)
function with page-turn events. This (lambda (h)
shows that Hop functions can be in- ;; insert result of the getpage call in the i-th div
(innerHTML-set! (format "page-˜a" i) h))))
voked directly by JavaScript as regular
JavaScript functions. ˜(add-event-listener! window "load"
;; initialize the turn plugin when the client document is ready
No platform other than the Web (lambda ( )
has ever let anyone write sophisticated (let ((el (jQuery "#dir")))
GUIs so simply, by combining APIs and (el.turn)
(el.bind "turning"
codes provided by different parties. (lambda ( page)
;; function invoked by the turnjs plugin when
Open Data ;; the user clicks on a next/prev page button
(turnpage page)
Governments and public organiza- (turnpage (+ 1 page)))))))
tions have recently understood that
(<DIV> :id "dir"
the data they generate is a valuable
;; create one empty div per document page; these divs are
asset. New companies such as Data ;; filled by the client-side turnpage function
Publica were created with the sole pur- (map (lambda (i) (<DIV> :id (format "page-˜a" i) ""))
(iota count)))))
pose of collecting, organizing, and re-
distributing open data.
Traditionally, open data was exog- Figure 1: A dynamically generated book Using a JavaScript API.
enous to the Web. For example, the

au g u st 2 0 1 2 | vol . 55 | no. 8 | c o mmu n i cat ion s o f t h e acm 57


practice

French government agency INSEE (Na- analysis is a remarkable example of ate a video showing the propagation of
tional Institute of Statistics and Eco- this kind. Google researchers have ob- the flu over time. This is mostly a tool-
nomic Studies; http://www.insee.fr), served that the number and locales of combination problem since all the
created in the middle of the 20th cen- search requests about flu is strongly hard work can be delegated to external
tury, is in charge of conducting all sorts related to the propagation of the ill- parties. Thus, the work shown here
of statistical analyses of the population ness.2 Google makes this data avail- consists only of collecting, combin-
and economy. INSEE has become an able country by country, making it easy ing, and reusing all the tools and data
important open data provider. for anyone to implement applications at our disposal. The actual Hop imple-
Surprisingly, the data might be using these statistics. mentation, shown in Figure 10, counts
endogenous, too. Google’s flu trends Here, we demonstrate how to cre- no more than 15 lines of code.
The application code relies on a
Figure 10. Visualizing the flu propagation worldwide. binding that makes the Google Chart
API directly accessible from Hop. This
(module flutrend Web API creates all kinds of charts and
(import (gchart "gchart-api-1.0.*.hz"))) geographical maps, generating imag-
(<HTML>
es according to the arguments speci-
fied on appropriate URLs. This API is
˜(define sources combined with the extraction of flu
$(map (lambda (data) (google-chart :type ’map :data data))
(with-input-from-file "http://www.google.org/flutrends/data.txt" statistics. First, image URLs are gen-
(lambda (p) erated out of statistical values parsed
;; skip legal Google notice
(read-lines p 11)
using a Hop spreadsheet-elements li-
;; parse the actual flu trends values brary. Then images generated on the
(read-csv-records p))))) fly by Google Chart are displayed one
(<IMG> :src (google-chart :type ’map :area ’europe) after the other, every 300 milliseconds.
:onload ˜(when (pair? sources) This example shows that new and
(after 300
(lambda ()
deep knowledge might emerge when
(set! this.src (car sources)) the ability to collect, compare, and an-
(set! sources (cdr sources))))))) alyze vast sets of open data is easily ac-
cessible. Hop is specifically geared to-
ward this objective, with composition
Figure 1: Visualizing the Flu propagation world wide.
and reuse as the two central features.
Figure 11. Reacting to SMS.
Bigger than the Web
(module androidemo
The human Web can be made even
(library mail phone hopdroid)) bigger by binding sensors from and
(define android (instantiate::androidphone))
actuators to the physical environ-
(define tts (instantiate::androidtts (phone android))) ment, such as those provided by
(define-service (phone/sms)
smartphone sensors, multimedia
equipment, home or automotive auto-
(define speak #t)
mation equipment, and the countless
(add-event-listener! android "sms-received" other electronic devices that surround
(lambda (e::event)
;; function invoked on the server (i.e., the phone)
us. However, only the most popular
;; upon sms-reception of these sensors will be natively sup-
(let ((val e.value))
;; broadcast to all clients the reception of the sms
ported by the next Web browsers. For
(hop-event-broadcast! "sms" val) example, the incoming versions of
;; speak out the sms content on the phone the Firefox and Chrome navigators
(when speak (tts-speak tts (cadr val))))))
have announced support for video
;; create two empty HTML elements that will be filled and audio capture; accessing other
;; upon SMS reception
(let ((sms-num (<TD>)) remote equipment through browsers
(sms-body (<TD>))) or programmed applications will still
(<HTML>
˜(add-event-listener! server "sms" require third-party support. The Hop
;; function invoked on the client (i.e., a web browser) philosophy is to migrate the remote
;; upon reception of the server "sms" event
(lambda (e) interface facilities to the server, give
;; insert the calling number in the HTML page the user full Hop programming power
(innerHTML-set! $sms-num (car e.value))
;; insert the SMS content in the HTML page on the associated data, and deliver
(innerHTML-set! $sms-body (cadr e.value)))) results to clients in the standard way
(<TABLE> (<TR> sms-num sms-body))
(<BUTTON> :onclick ˜(with-hop ($(service () (set! speak (not speak))))) when using browser-based interfaces.
"mute/speak")))) For example, Figure 11 shows how
to implement a Hop application that
Figure 1: Reacting to SMS. integrates features provided by a mo-

58 co mmuni catio ns of th e acm | au gu st 201 2 | vol. 5 5 | no. 8


practice

bile phone inside a Web application. tpd. The performance gain is a result languages. We hope to see such new
The contents of incoming SMSs are of the Hop-in-Hop implementation, languages appearing, because the
displayed in a browser window, letting which enables the server to respond to multiplicity of approaches will foster
users choose whether or not an SMS requests involving dynamic computa- creativity and possibly open a new era
should be spoken aloud. tions without costly execution context in language and tool design.
This application uses three separat- switching.4
ed tiers: a Web browser, a Web server,
Related articles
and a phone server. The phone server A New Playground
on queue.acm.org
is responsible for receiving the SMSs Hop is the conjunction of a multitier
and speaking them on demand. The programming language and a runtime There’s Still Some Life Left in Ada
Alexander Wolfe
Web server plays the role of orchestra- environment for the Web. Its main http://queue.acm.org/detail.cfm?id=1035608
tor; when the phone notifies it that a goals are to unify the linguistic fea-
Extensible Programming
new SMS has been received, it alerts tures needed for Web applications, to
for the 21st Century
the interested clients and, depending automate the physical distribution of Gregory V. Wilson
on the user configuration, speaks out code in a fully transparent way, and http://queue.acm.org/detail.cfm?id=1039534
the short message. to help programmers in reusing and Purpose-Built Languages
The application takes advantage combining external resources. All of Mike Shapiro
of another useful feature of Hop: al- these are keys to program simplicity http://queue.acm.org/detail.cfm?id=1508217
lowing a server to be a new source of and conciseness.
program-generated Web events for This article has presented several References
1. Cooper, E., Lindley, S., Wadler, P. nd Yallop, J. Links:
clients or other servers. In the phone complete Web application examples Web programming without tiers. Presented at the 5th
example, the server starts establish- written in Hop. The first example il- International Symposium on Formal Methods for
Components and Objects (2006).
ing a bridge with the phone (using the lustrates how to raise the HTML ab- 2. Ginsberg, J., Mohebbi, M., Patel, R., Brammer, L.,
android variable), then waits for the straction level by using an algorithmic Smolinski, M. and Brilliant, L. Detecting influenza
epidemics using search engine query data. Nature 457
notification of an SMS and accesses programming language approach. The (Feb. 19, 2009), 1012–1014.
the phone text-to-speech facility (us- second example shows how to reuse 3. Kelsey, R., Clinger, W. and Rees, J. The revised (5)
report on the algorithmic language scheme.
ing the tts variable). third-party code. The third example Higher-Order and Symbolic Computation 11, 1 (1998);
The server code registers a function shows that combining Web technolo- http://www.sop.inria.fr/indes/fp/Bigloo/doc/r5rs.html/
4. Serrano, M. Hop, a fast server for the diffuse Web.
invoked each time a new SMS is re- gies with open data opens the path for In Proceedings of the 11th International Conference
ceived to forward a software Web event new application ideas. The last exam- on Coordination Models and Languages (Lisbon,
Portugal, 2009).
to interested Web clients. ple presents a simple diffuse applica- 5. Serrano, M., Gallesio, E. and Loitsch, F. Hop,
tion on the Web. These examples have a language for programming the Web 2.0. In
Proceedings of the First Dynamic Languages
Implementation all been chosen for their simplicity; Symposium (Portland, OR, 2006).
Hop is an open source project whose the same technique would be equally
source code can be downloaded from effective when dealing with much big- Manuel Serrano is a senior scientist at INRIA, leading
the INDES (Informatique Diffuse et Sécurisée) team in
the Web site http://hop.inria.fr. All the ger applications in domains such as Sophia-Antipolis. After completing his Ph.D. at the Pierre
actual source code shown in this arti- home automation or multimedia. and Marie Curie University (UPMC) on the compilation
of functional languages, he moved to Nice and created
cle can also be downloaded from this None of the applications presented the Bigloo development environment for Scheme. He
Web site. The development kit con- here exceeds 30 lines of Hop source joined INRIA in 2001 and has focused on development
environments for the diffuse web since 2005.
tains the compilers, interactive docu- code. This is not accidental; it is the
Gérard Berry, director of research at INRIA, works on
mentation, various tools for creating consequence of a perception shift programming languages, their mathematical semantics,
and installing applications, and the where the Web is no longer regarded and program verification technologies. His focus has been
on the lambda-calculus and its models, reactive and
runtime environment, which is a dedi- as a mere platform for sharing docu- real-time programming and verification, and high-level
cated Web server that runs on all Un*x ments but as a revolutionary runtime digital circuits specification, synthesis, and verification.
Prior to joining INRIA he was chief scientist of Esterel
platforms—Linux, Mac OS X, and An- environment. Note, however, that Hop Technologies and was the main author of the Esterel
droid. It is operational on mainstream is a realistic programming language, language, which has been used in academia and industry
for applications ranging from complex circuit synthesis to
modern architectures such as x86/32, which implies a lot of additional bells airplane control.
x86/64, PowerPC, and the ARM family. and whistles. This article has ignored
The Hop runtime environment re- most of them, concentrating on multi-
quires only 3MB–4MB of RAM. This tier programming and the connection
small memory footprint makes it suit- to official Web technologies.
able for smartphones or even smaller Finally, note that the Hop core lan-
machines, such as the new ARM- guage is a strict functional program-
equipped computers (Phidget SBC2, ming language, with runtime type
Raspberry Pi, among others) that offer checking. This design choice comes
only a few megabytes to applications. mostly from a personal bias of the
As far as speed is concerned, the Hop first author. Arguably, it fits well with
Web server delivers dynamic Web the current programming trends of
pages significantly faster than tradi- the Web, but other flavors should be
tional servers such as Apache or Light- possible, yielding other programming © 2012 ACM 0001-0782/12/08 $15.00

au g ust 2 0 1 2 | vol . 55 | no. 8 | commu n i cat ion s o f t he acm 59


contributed articles
doi:10.1145/ 2240236.2240255
subsequently revealed that at least
How to have the best of location-based some of the initial concerns were
groundless. Assuming Apple’s ano-
services while avoiding the growing threat nymity-preservation techniques are
to personal privacy. adequate, Apple does not compile
location traces for individual users,
By Stephen B. Wicker instead enlisting those users as data
collectors in a worldwide exercise in

The Loss
crowdsourcing. Apple is creating a
highly precise map of cell sites and ac-
cess points in an effort to improve the
speed and accuracy of its user-location

of Location
estimates, thus providing more-re-
fined location-based services. Howev-
er, despite Apple’s quick and thorough
response, long-term issues remain.

Privacy in the
This article explores the evolution
of location-based services (LBS), cul-
minating in Apple’s and Google’s use
of crowdsourced data to create a sys-

Cellular Age
tem for obtaining location fixes poten-
tially faster and more accurate than
the global positioning system (GPS).
This article also develops an intui-
tive sense of the potentially revelatory
power of fine-grain location data, then
addresses the question of potential
harm. The most obvious concern is
the stalker, while others involve ma-
nipulation and threats to autonomy.
Also provided is a brief review of the
“Our vi ew of reality is conditioned by our position in philosophy of place, focusing on the
ability of location-based advertising
space and time—not by our personalities as we like (LBA) to disrupt individuals’ relation-
to think. Thus every interpretation of reality is based ships with their surroundings. It then
turns to the potential for anonymous
upon a unique position. Two paces east or west and LBS, with the aim of saving the benefit
the whole picture is changed.” while avoiding potential harm. Finally,
—Lawrence Durrell, Balthazaar10
key insights
“…to be human is to be ‘in place’.”  T he precision of cellular-location
—Tim Cresswell, Place: A Short Introduction7 estimates means service providers
are able to obtain location estimates
with address-level precision, creating
On April 20, 2011, U.K. researchers Alasdair Allan and a serious privacy problem, as the estimates
Peter Warden caused a media frenzy by announcing can be highly revealing of user behavior,
preferences, and beliefs.
their discovery of an iPhone file—consolidated.dba—  S upposedly anonymous location traces
that contained time-stamped user-location data.4 A can be de-anonymized through correlation
with publicly available databases.
FAQ published by Apple3 and congressional testimony  P rivacy-aware design makes it
possible to retain the full benefit
by Apple’s vice president for software technology26 of LBS while preventing accumulation
of address-level location traces for
a The file had already been identified in a 2010 text on iOS forensics by Sean Morrissey20 but was a given individual and reducing the
largely ignored at the time. potential for de-anonymization.

60 comm unicatio ns o f th e acm | au gu st 201 2 | vol. 5 5 | no. 8


it asks: How much location data must a cation resolution. The first step toward ing the entire coverage area of a cell
marketer acquire before a correlation adding this capability came with E911, site. However, the technological and
attack can de-anonymize the data?, the Federal Communications Commis- sociological impact has far outstripped
answering through the Shannon-theo- sion’s 1996 effort to enhance location this intuition over the past 16+ years.
retic concept of “unicity” distance and resolution for cellular 911 calls.b E911 One of the more immediate con-
recommending ground rules for devel- established a requirement that cellular sequences of E911 is that many cel-
Illustration by Brian Greenberg/Andrij Borys Associat es

opment of truly anonymous LBS. service providers send location infor- lular handsets now have some form
mation to the Public Safety Answering of GPS capability, whether standalone
Technology of Place Point when subscribers make 911 calls or network-assisted.29 With it, service
Cellular telephony has always been a with their cellphones. The intuition providers increasingly recognize that
surveillance technology. As discussed underlying E911 was clear: It would be a much broader (and more lucra-
by the author,27 cellular networks are desirable for emergency services to be tive) range of location-based services
designed to track a phone’s location so able to locate a victim without search- could be provided. However, it should
incoming calls are routed to the most be understood that GPS was not de-
appropriate cell tower, usually the one signed with cellphones in mind.c GPS
b Notice of Proposed Rulemaking, Docket 94-
closest to the user. As most users are 102, adopted as an official report and order,
aware, recent generations of cellphones June 1996;12 the order and all its subsequent in- c It was designed with guided missiles and
are capable of much more fine-grain lo- carnations are referred to as E911 in this article. bombers in mind.16

au g ust 2 0 1 2 | vol . 55 | no. 8 | c omm u n i cat ion s o f t he acm 61


contributed articles

was intended for outdoor use; the Apple’s vice president for software of the measurements.5 Following cre-
weak signals transmitted from the 24 technology, confirmed what analysts ation of a map of the locations of cell
space vehicles (SVs) that constitute of the consolidated.db file had already sites and access points, a position
the GPS space segment are difficult determined: Apple iPhones record fix for a cellphone can be computed
to detect indoors and blocked by tall the MAC address and signal strengthe through trilateration using received
buildings.16 GPS is also designed to for detected access points, then time- signal-strength measurements.
work with autonomous receivers; GPS stamp and geo-tag that data. The geo- Trilateration is similar to what is
signals are modulated to provide the tag consists of a GPS/cell-tower-derived performed by GPS receivers, with the
receiving unit with the locations and location estimate of the iPhone that added benefit that the distances are
orbits of the SVs, information needed has detected the access point. For de- much shorter and the access points
to compute the receiver’s location.d tected cell sites, the cell-tower ID and- and cell towers are not moving. Over-
The locations and orbits are provided signal strength are combined with the all, one would expect the resulting lo-
on the same carrier used for (civilian) detecting iPhone’s location estimate. cation estimates to be at least as good
distance estimation. In order to avoid Tribble provided little technical as a GPS fix in urban and residential
interference, the data rate for these detail but did suggest that by obtain- areas and could be of sufficiently fine
transmissions is slow—only 50bps— ing such data from a large number granularity as to be able to resolve an
so a receiver takes up to 12.5 minutes of iPhones (crowdsourcing), highly individual address.
to obtain all the information it needs accurate estimates of the location of The presence of consolidated.db in
to perform a location fix. Networks of- sites and access points could be deter- iPhones (a database of time-stamped
ten assist cellphones by providing this mined. With a map of these locations, GPS fixes for the cellphone) gives the
information over much-faster cellular precise location estimates can be gen- appearance that Apple is tracking
links,9 but cellphone manufacturers erated for phones that report receiv- iPhone users, but Tribble said the “data
are apparently looking to other means ing signals from the cell sites and ac- is extracted from the database, encrypt-
for quick, accurate location fixes for cess points. ed, and transmitted—anonymously—
their subscribers. A simple analysis makes the point. to Apple over a Wi-Fi connection every
This brings us to the April 2011 ker- Consider a data set of n records for a 12 hours (or later if the device does not
fuffle over Apple’s and Google’s use of single access point, with each record have Wi-Fi access at that time).”
cellphones to identify Wi-Fi and cell- consisting of the location of a differ- The extent the data is anonymous
tower locations. In testimony before ent receiving unit and the strength is questionable without further detail.
the U.S. Congress’s Judiciary Com- with which that unit receives the sig- The author generated the figure here
mittee’s Subcommittee on Privacy, nal from the access point. The location using the consolidated.db database on
Technology and the Law, Guy Tribble, of the access point can be computed his iPhone and the iPhone Tracker ap-
by determining the weighted centroid plication developed by Pete Warden.f
His well-traveled path from Ithaca, NY,
d Detailed SV orbital information is called “ephem-
eris”; each SV transmits its own ephemeris, e Signal strength is converted into a “horizontal
to Washington, D.C. (National Science
along with an almanac providing less-detailed accuracy number”; Apple does not collect the Foundation and Defense Advanced Re-
information for all active SVs. user-assigned name for the network. search Projects Agency) and onward to
his parent’s house in Virginia Beach,
VA, is apparent for all to see. It would
take little effort to associate this trace
with the author. As the Netflix example
covered later suggests, there is more to
anonymization than stripping a loca-
tion trace of its associated phone num-
ber and user-account ID.

Personality of Place
The iPhone location trace says a lot
about the author, including his pre-
dilection for visiting Washington,
D.C., New York, and his parents. What
would a more fine-grain set of track-
ing data, like that potentially being

f In an FAQ at http://petewarden.github.com/
iPhoneTracker/\#5, Warden noted that the
data is actually more accurate than the maps
generated by the tool; Warden inserted the in-
tentional dithering to reduce the privacy risk
created by the tool while still making apparent
A cellphone’s travels; data from consolidated.db in the author’s iPhone. the problem with consolidated.db.

62 co mmunicatio ns o f th e acm | au gu st 201 2 | vol. 5 5 | no. 8


contributed articles

made available to LBS by emerging tions show this is not the case. LBS sup-
smartphone-location technology, have ports location-based advertising (LBA),
to say about an individual? Consider which has the potential for exerting
the following information, which can substantially more power over indi-
be derived through the correlation of
fine-grain location data with publicly The continued vidual behavior than previous modes
of advertising.
available information:
Location of your home. What kind of
accumulation The work of advertising. As of early

of location data
2012, InfoUSA maintained a list of
neighborhood do you live in? What is 210 million U.S. consumers for sort-
your address? Mortgage balances and
tax levies are often available once an
may reach a point ing into various categories, including
area code, ZIP code, home value/home
address is known. Your socioeconomic where a marketer ownership, housing type, mortgage,
status can be deduced;
Location of your friends’ homes. What
can uniquely match personal finance, hobbies and inter-
ests, children/grandparents/veterans,
sort of homes do they have? Do you an anonymous ethnicity, religion, and voter informa-
ever spend the night? How often?;
Location of any building you frequent location trace to tion.g As seen from the earlier thought
experiment, much of it can now be de-
with a religious affiliation. Or do you a named record in a rived by correlating location data with
never frequent such buildings? In ei-
ther case, beliefs can be deduced; separate database. address databases. But data collection
through location-based services takes
Locations of the stores you frequent. the process of collection to a new level
Your shopping patterns reflect your of invasiveness while adding an addi-
preferences and in some cases your be- tional control variable to the process of
liefs or vices; advertising.
Locations of doctors and hospitals To begin, LBS data collection is
you visit. Do you visit frequently? How able to substantially refine the person-
long do you stay? The fact that you have al information available from other
a serious illness is readily determined, sources; for example, one may claim
as are, in some cases, even the type of to practice yoga, but marketers may
illness through your visits to specialty now know how frequently one takes
clinics. Such information is of interest classes, pointing to a specific level of
to insurance providers and the market- enthusiasm previously known only to
ers of pharmaceuticals, among others; individuals and their fellow yogis and
and yoginis.h LBS also enables an approach
Locations of your entertainment to consumer targeting that goes well
venues. Do you attend the local sym- beyond previous marketing strategies
phony? Do your tastes run to grunge by collecting information about be-
rock? Do you frequent bars? What liefs, preferences, and behavior while
type? One can draw multiple conclu- one performs the illustrative practice.
sions from the frequency of visits and A mailing list may indicate one is gen-
types of venue. erally a lover of Italian food, while an
One could go on with this list. The LBS may have the additional knowl-
fact is fine-grain location information edge that you are currently in an Ital-
can be used to determine a great deal ian restaurant. To understand why this
about an individual’s beliefs, prefer- is important, first consider the “work”
ences, and behavior. Databases con- that advertising has us do on its behalf.
taining such information pose a threat In her 1978 book on the psychology
to individual security and privacy, as of advertising, Judith Williamson de-
they can be a focus for hackers with scribed advertising as shifting meaning
criminal intent. On a less-malevolent from one semantic network to anoth-
note, such data is immensely valuable er.28 As the book was originally written
to direct-marketing firms. Entire busi- in the 1970s, Williamson focused on
nesses are built around the compila- print advertising, with an occasional
tion of lists of such information ac- reference to broadcast television. In
quired through other means.
Is this a problem? Isn’t LBS data
g http://www.infousa.com/
collection simply additional, perhaps
h You may substitute political parties, sporting
redundant, data collection that feeds events, dog shows, or any other personal in-
the ordinary and tedious process of terest to imagine a more personally relevant
direct marketing? The following sec- example.

au g u st 2 0 1 2 | vol. 55 | no. 8 | c omm u n icat ion s of t he acm 63


contributed articles

a canonical example she pointed to definition, the geographer and politi-


an advertisement that seems simple, cal philosopher John Agnew defined
a photograph of the iconic French ac- place as consisting of three things1:
tress Catherine Deneuve juxtaposed Location. “Where,” as defined by,
with a bottle of perfume (Chanel No. 5),
encouraging us to bind to the perfume The more that say, latitude and longitude;
Locale, or the shape of the space.
the association of class and beauty
people of a certain age might associ-
can be done within Shape may include defining boundar-
ies (such as walls, fences, and promi-
ate with Catherine Deneuve. Meaning the handset and nent geographical features like rivers
has been shifted from one semantic
network (the realm of actresses) to an-
kept within and trees); and
Sense. One’s personal and emotion-
other (a brand of perfume). the handset, al connections established through lo-
LBA has the potential to perform
a similar sleight-of-mind, causing us
the greater cation and locale.
Place is thus a location to which
to exchange the meaning we associ- the preservation one ascribes meaning. The process by
ate with a place for one suggested by
an ad. Moreover, this location-based of anonymity. which meaning attaches to place, and
the importance of this process to the
semantic shift is taking place through individual and to society, have become
ads delivered to a device that can track a prime focus for humanistic geogra-
the individual. This raises two new pri- phers. One aspect of it builds on the
vacy issues: The first is that LBA has the work of the phenomenologists. Phe-
potential to be a feedback system with nomenology, generally associated with
dynamic control. The advertiser can the German philosopher Franz Bren-
present an ad when one is near a tar- tano and the Austrian philosopher
get location, then track that person to Edmund Husserl, studies the struc-
determine whether the ad has had the tures of consciousness. Phenomenol-
desired response. In the language of ogy proceeds by first bracketing-out
Gilles Deleuze,8 the advertiser can ob- our assumptions of an outside world,
serve the response to the information then focusing on our experience of the
stream presented to the individual, world through our perception. Phe-
then “modulate,” or refine, that stream nomenologists study consciousness by
over time, driving the individual to a focusing on human perception of phe-
desired state of behavior; in this case, nomena, hence the name.i
movement to and consumption at the Brentano is credited with one of the
target location. Primitive examples of key results of the phenomenologist ap-
modulation fueled by click-tracking proach. In his 1874 book Psychology
can be seen by an aware observer of from an Empirical Standpoint,j he said
the Web. If one fails to produce the one of the main differences between
desired response to a pop-up window, mental and physical phenomena is
other windows offer alternatives on be- the former has intentionality; that is,
half of the advertiser. Second, unlike it is about, or directed at, an object,
click-tracking, LBA exploits consum- or one cannot be conscious without
ers’ physical location, attempting to being conscious of something. In the
manipulate their relationship to their latter part of the 20th century, human-
physical surroundings. The following istic geographers took this philosophy
highlights the potential for a more a step further; in his 1976 book Place
insidious form of manipulation at an and Placelessness, Edward Relph as-
entirely new level of psychological con- serted that consciousness could only
ditioning. be about something in its place, mak-
Philosophy of place. Many people ing place “profound centers of human
view geography as the study of loca- existence.”23
tions and facts; for example, “Jackson Another thread in the philosophy of
is the capital of Mississippi” is the stuff
of geography, as is the shape and size i For a quick look at the field see http://plato.
of the Arabian Peninsula. However, in stanford.edu/entries/phenomenology/ and
the 1970s, humanistic geographers be- for more detail Sokolowski, R. Introduction to
gan to move the field toward a consid- Phenomenology, Cambridge University Press,
Cambridge, U.K., 1999.
eration of “place” as more than a space j Psychologie vom empirischen Standpunkt;
or location, beyond latitude, longitude, http://www.archive.org/details/psychologievome-
and spatial extent.7 In an oft-quoted 00brengoog

64 co mmunicatio ns o f the acm | au gu st 201 2 | vol. 5 5 | no. 8


contributed articles

place originates with 20th century Ger- ger a “place” with familial meaning thus identified a number of users in
man philosopher Martin Heidegger but merely a location for eating. Now, the Netflix training data.21 Along the
who described human existence in to complete the example, assume that way, they developed rules of thumb
terms of dasein, a German word that someone who wants to communicate for such correlation attacks, noting
can be translated as “human exis- with the father from afar knows when such attacks work well when they em-
tence,” or perhaps more helpfully as he is at the table and chooses that time phasize rare attributes and that the
“being there.” The important thing for to send texts. The texter now has the winning match should have a much
us here is to understand that dasein ability to disrupt the father’s relation- higher score than the second-place
is always in the world.k As humans we ship with the family dinner, a relation- match. The first can be understood in-
enter a preexisting world of things and ship often filled with a strong, even de- tuitively; a marketer would learn more
other people and develop our sense of fining, sense of meaning. from the knowledge that someone has
self by (and only by) interacting with The dinner table is a natural exam- purchased the author’s most recent
them. According to Heidegger, an in- ple for the author,m but one might con- text on error-control coding than from
authentic existence is one in which sider a walk through one’s hometown, finding that someone has purchased a
the individual fails to distinguish him visiting an old high school, or attending Harry Potter book. The second rule is
or herself from the surrounding crowd a play. LBA has the potential to detract equally intuitive, as it is intended to
and its priorities. from the experience of these familiar avoid false positives.
Humanistic geographers have tak- and meaning-filled environs. One’s Here, these rules are useful for de-
en up the concept of dasein, using it surroundings may thus lose their “pla- veloping a Shannon-theoretic model
to explore the role of place in human ceness” through LBA, including their for correlation attacks on supposedly
existence. In his 2007 book Place and meaning, and become merely a path anonymized location traces. In his
Experience: A Philosophical Topography, to be traversed. As places become loca- 1949 paper “Communication Theory
Jeff Malpas invoked dasein and related tions, meaning is lost to the individual. of Secrecy Systems,”24 Claude Shannon
concepts of spaciality and agency to That is, we lose some of ourselves, as defined unicity distance as the mini-
show that place is primary to the con- well as one of the critical processes mum amount of ciphertext needed
struction of meaning and society.l,19 through which we become a self. before uncertainty about a piece of
Using these concepts, this article plaintext could be reduced to zero. The
now aims to characterize the potential Location Anonymity translation to the de-anonymization
impact of LBA, the objective of which Having established the importance of of location traces is clear; the contin-
is to alter the ever-present, ongoing location privacy, is it necessary to for- ued accumulation of location data may
human process of interaction with go the benefits of LBS and LBA? Fortu- reach a point where a marketer can
the immediate surroundings. LBA at- nately the answer is no, but it needs to uniquely match an anonymous loca-
tempts to shift intentionality, divert- be clear to data collectors that it is not tion trace to a named record in a sepa-
ing consciousness from an experience sufficient to simply scrub names and rate database.
of the immediate surroundings to the phone numbers from location traces. The goal in this article is not a spe-
consumption of advertised goods. In As AOL15 and Netflix21 have learned, cific number as a cutoff for data ac-
Heideggerian terms, LBA interferes supposedly anonymous datasets are cumulation or an all-encompassing
directly with the individual’s project of often susceptible to correlation at- framework into which all de-anony-
crafting an authentic existence. tacks in which datasets are associated mizing attacks have a place. Rather, it
Consider the following situation, with individuals through comparison develops an example model and evalu-
developed in two stages: A family is of the datasets to previously collected ates its dynamics—how the structure
seated at their dining room table en- data. Netflix is particularly instructive; of the model changes as the amount
joying dinner together, but there is in 2006 it issued a public challenge of location data increases—in order to
an exception—the father, a relentless to develop a better movie-recommen- craft design rules for anonymous LBS.
worker, is reading texts and email mes- dation system.22 As part of the chal- A Shannon-theoretic approach to
sages instead of joining the conver- lenge, it released training data consist- location anonymity. Let a marketing
sation. One could say he is no longer ing “of more than 100 million ratings database S consist of a collection of bi-
present. He has left the place. Or to from over 480,000 randomly chosen, nary preference vectors {Xi} of length
turn it around, as far as the father is anonymous customers on nearly 18 n, where the index i indicates a spe-
concerned, the dinner table is no lon- thousand movie titles.” Within weeks, cific user. The individual vectors have
computer scientists Arvind Narayanan the form
and Vitaly Shmatikov had showed the
k This concept has had profound influence on
the field of artificial intelligence; for example,
data was not as anonymous as Netflix xi = (xi,0,xi,1,xi,2,…,xi,n–1); xi,j ∈ {0,1,e}
Philip Agre explicitly applied Heideggerian might have thought. Narayanan and
thought in moving the practice of computa- Shmatikov devised an elegant algo- Each coordinate xi,j is a binary indi-
tional psychology away from cognition and rithm that correlated the NetFlix data cator representing the user i’s pref-
toward action in the world.2 with other publicly available data and erence with regard to some specific
l For a more-focused exploration of place in the
thought of Martin Heidegger see Malpas, J.
item, belief, or behavior; for example,
Heidegger’s Topology: Being, Place, World. MIT m He would never be allowed to behave like the xi,0 might indicate whether the user
Press, Cambridge, MA, 2008. father in the example. likes cats (yes or no), and xi,1 might

au g ust 2 0 1 2 | vol . 55 | no. 8 | c omm u n i cat ion s o f t he acm 65


contributed articles

indicate feelings about dogs. Some follows from the fact that as data collec- ˲˲ As the length m of a location trace
preferences (such as the identity of tors obtain more location information, Lm increases, the number of non-erased
the user’s favorite rugby team) might they typically increase their knowledge coordinates of a preference vector P in-
cover several coordinates, depending about the associated individual. creases; the reverse is also the case;
on the number of teams that can be Now consider those vectors in the ˲˲ As the number of non-erased coor-
represented in the database. If a given marketing database for which indi- dinates of P increases, the length of the
preference for a particular user is un- vidual preferences on these t coor- vectors in C′ increases, while the cardi-
known, the associated coordinate is dinates are known. Within S there nality of C′ decreases; fewer vectors in
given the value “e” for erasure.n The will be some Nm vectors with support C will have the requisite support as the
marketer’s knowledge concerning a for all t non-erased coordinates of P. number of coordinates requiring sup-
user’s beliefs, preferences, and behav- These Nm vectors form a subset C ⊂ S. port increases. The overall effect is an
ior are thus coded into binary vectors For each vector in C, delete all but the increase in minimum distance and a
of a fixed length (n) with a consistent t coordinates of interest (those cor- corresponding increase in the efficacy
semantic attribution to each coordi- responding to the non-erased coordi- of correlation attacks;
nate or block of coordinates. nates of P). We now have a set C′ of Nm ˲˲ As the number of non-erased coor-
Now let Lm be a trace of length m, a vectors of length t. The problem of de- dinates of P decreases, the length of the
sequence of m location fixes generated anonymization now looks like an er- vectors in C′ decreases while the cardi-
by a single subscriber ror-control coding problem, so which nality of C′ increases; more vectors in C
vector in C′ provides the closest match have the requisite support, as less sup-
Lm = (l0,l1,l2,…lm–1) to the non-erased coordinates of the port is required. The overall effect is a
preference vector P? The ability of the decrease in minimum distance and in
As discussed earlier, marketers can marketing database to distinguish the efficacy of correlation attacks.
associate locations with beliefs and between users can now be expressed It follows both intuitively and ana-
preferences, but the amount of infor- (using coding-theoretic terminology) lytically that the number of non-erased
mation derived clearly varies depend- as the minimum distance between the coordinates in P should be kept as
ing on the type of location in the trace. vectors in C′. The minimum distance small as possible and can be done in
Now consider a preference mapping F is the minimum number of coordi- either of two ways:
that maps location traces to preference nates in which any pair of vectors dif- Reduce the length of location traces. If
vectors while acknowledging such fer. In more compact form, this can be the preference map has less informa-
mapping may not be one to one and is expressed as tion on which to operate, it generates
situation-dependent. The preference a preference vector with more erased
vectors have the same syntactic and se- dmin = minx´i,x´j∈C´,i≠j | {k|xi,k ≠ xj,k, k ∈ (1,t)}| coordinates;
mantic structure as the vectors in the Reduce the ability of the preference
marketer’s database The greater the value of dmin, the greater mapping to resolve a location trace into
the ability of a correlation attack to as- specific coordinate values in a preference
F : {Lm} → {P} sociate a location trace with a single vector. This can be done by reducing or
record, and thus a single individual. eliminating the extent each trace loca-
P = (p0,p1,p2,…,pn–1); pj ∈ {0,1,e} When dmin is large, the individuals rep- tion provides preference-vector infor-
resented by the vectors in C′ are read- mation.
Narayanan and Shmatikov21 discussed ily distinguished from one another. On System designers can exploit these
several ways to identify the Xi ∈ S that is the other hand, if dmin is small or zero results to design anonymity-preserving
the best match for a given P, thereby (as happens when two or more identi- location-based services.
(potentially) de-anonymizing P. This cal vectors are in C′), then the problem Anonymous LBS. Consider a ba-
article takes a somewhat different ap- of de-anonymization becomes difficult sic location-based service; call it “The
proach, attempting to characterize the or even impossible. Marketers are un- Doppio Detector,” giving users direc-
dynamics of the de-anonymization able to distinguish between the indi- tions from their current location to the
problem as the length of the location viduals so are unable to determine with nearest espresso shop. For it to work,
trace grows. which individual to associate a given two basic types of information must
Suppose a location trace of length location trace. be brought together: the subscriber’s
m is mapped into a preference vector Privacy-aware system designers location at an appropriate level of
P of length n. P will have some t non- can now develop rules of thumb for granularity and a geographic database
erased coordinates and n − t erased co- preserving anonymity in the face of containing the locations of all nearby
ordinates. Assume that as m increases, correlation attacks by exploring the espresso shops. With it, the server or
t increases or remains the same.o This dynamics of the relationship between the user’s handset can superimpose
location traces Lm, preference vectors the user’s location onto a geographic
n Narayanan and Shmatikov21 said most “auxilia- P, and the minimum distance dmin of database, then generate directions
ry” databases are extremely sparse and would the corresponding set of vectors C′: through a routing algorithm.
thus contain a large number of erasures.
o While useful for mathematical clarity, this as-
The structure of an LBS can thus be
sumption is not needed to support the results, increase with m, which is the case as long as the generalized as performing two basic
so long as there is a general tendency for t to marketer’s database is not highly corrupted. functions:

66 c ommunicatio ns o f th e acm | au gu st 201 2 | vol. 5 5 | no. 8


contributed articles

˲˲ Determine subscriber location to obtaining a location fix of the desired


the desired level of granularity; and granularity on the handset need not re-
˲˲ Use a database to map the location duce the user’s location privacy. How-
to the desired information (such as di- ever, the second piece of LBS, the map-
rections to an espresso shop).
Separating these functions clarifies Using access-point ping function, creates two significant
obstacles to maintaining privacy, with
the anonymity problem while opening
up the range of available anonymity-
and cell-site the second posing a potential personal
security concern:
preserving techniques. We begin by location information, Consistent input granularity. The
determining subscriber location. The
best means for preserving anonymity
service providers mapping function requires input
granularity consistent with the inher-
is to do an independent GPS fix on a are able to obtain ent granularity of the query; a user who
cellphone. The handset may thus ac-
quire an accurate location estimate
location estimates wants directions to the nearest espres-
so shop needs directions, beginning
without releasing any information to with address-level with a position with street-level resolu-
the outside world. This is a general
theme; the more that can be done precision. tion; and
Known location. Many if not most
within the handset and kept within LBS queries involve objects of known,
the handset, the greater the preserva- fixed location; for example, a bookstore
tion of anonymity. has a known location and is generally
However, this approach can be not in motion. A request for directions
slow. If the handset is to download indicates the requesting cellphone
all necessary SV location information user will probably be at the location
from the SVs themselves, the user may sometime soon.
have to wait as long as 12.5 minutes, The following paragraphs consider
a potentially excruciating delay when general means for accomplishing the
one needs caffeine. If the process is mapping function while retaining a
to be sped up through provision of measure of anonymity:
constellation information by the cel- A release of data is said to provide
lular service provider, some location k-anonymity protection “…if the infor-
information must be leaked to that mation for each person contained in
provider. However, such data can be the release cannot be distinguished
coarse; the network must know only from at least k–1 individuals whose
the cell site that is serving the user to information also appears in the re-
provide the data for SVs that are po- lease.”25 It seems logical that such
tentially visible to the handset. Such protection can be obtained for the
coarse location information provides LBS mapping function by stripping
relatively little information about the identifying information from k LBS
user’s beliefs and preferences. Or to requests, bundling and submiting
use the language of this article’s unic- them all at once. The LBS server then
ity distance analysis, the preference provides a combined response from
mapping F operating on cell-site in- which individual users are able to ex-
formation will produce a preference tract information responsive to their
vector with a large number of erased specific requests.
coordinates. But who or what bundles the origi-
Khoshgozaran and Shahabi17 sug- nal k requests? Gruteser and Grun-
gested another approach to determin- wald14 suggested a trusted server that
ing location anonymously: use the bundles and forwards requests on be-
network to determine the location fix half of users, while Ghinita et al.13 sug-
while preventing the network from gested a tamper-proof device on the
knowing the subscriber’s actual loca- frontend of an untrusted server that
tion. The mobile device biases the data combines queries based on location.
used for the location fix by applying a However, such approaches fall short
randomly selected transform to the of k-anonymity in that there may be
mobile’s measurements. When the side information (such as home loca-
mobile receives the resulting location tion or a known place of business) that
fix from the network, it removes the ef- would allow the server to disaggregate
fects of the bias by adjusting the fix ac- one or more users from the bundled
cordingly. request. For example, I benefit little
It follows from these options that from a bundled request if the request

au g ust 2 0 1 2 | vol . 55 | no. 8 | commu n i cat ion s of t he acm 67


contributed articles

includes my home as a starting point; for preserving user anonymity while 8. Deleuze, G. Postscript on the societies of control.
October 59 (Winter1992), 3–7.
my request is too easily disaggregated allowing users to enjoy the benefits of 9. Djuknic, G.M., and Richton, R.E. Geolocation and
from the bundle. location-based services. assisted GPS. Computer 34 (Feb. 2001), 123–125.
10. Durrell, L. Balthazaar. Faber & Faber, London, 1960.
Bear in mind that system design- 11. Evfimievski, A., Srikant, R., Agrawal, R., and Gehrke,
ers need not completely eliminate the Conclusion J. Privacy-preserving mining of association rules. In
Proceedings of the Eighth ACM SIGKDD International
transfer of location information; it The increasing precision of cellular- Conference on Knowledge Discovery and Data Mining
would be sufficient to reduce the pre- location estimates is at a critical (Edmonton, July 23–26). ACM Press, New York, 2002,
217–228.
cision of the location information to threshold; using access-point and 12. Federal Communications Commission. Notice of
where the preference mapping gives cell-site location information, service Proposed Rulemaking Docket 94-102. Washington,
D.C., 1994.
the attacker or marketer little with providers are able to obtain location 13. Ghinita, G., Kalnis, P., Khoshgozaran, A., Shahabi,
C., and Tan, K.-L. Private queries in location-based
which to work.p Given the decreas- estimates with address-level preci- services: Anonymizers are not necessary. In
ing cost of memory and bandwidth, sion. Compilation of these estimates Proceedings of the ACM SIGMOD International
Conference on Management of Data (Vancouver, B.C.,
it is both efficacious and inexpensive creates a serious privacy problem, as June 9–12). ACM Press, New York, 2008, 121–132.
to simply blur the location estimate it can be highly revealing of user be- 14. Gruteser, M. and Grunwald, D. Anonymous usage of
location-based services through spatial and temporal
provided with the request for map- havior, preferences, and beliefs. The cloaking. In Proceedings of the First International
ping functionality.q An LBS user may, subsequent danger to user safety and Conference on Mobile Systems, Applications, and
Services (San Francisco, May 5–8). ACM Press, New
for example, submit a request to the autonomy is substantial. York, 2003, 31–42.
Doppio Detector that includes his or To determine the extent to which 15. Hansell, S. AOL removes search data on vast group of
Web users. The New York Times (Aug. 8, 2006).
her location as “somewhere in down- location data can be anonymized, this 16. Kaplan, E.D. Understanding GPS Principles and
town Ithaca,” rather than a specific article has explored the Shannon-the- Applications. Artech House Publishers, Boston, 1996.
17. Khoshgozaran, A. and Shahabi, C. Blind evaluation of
address. The server will respond with oretic concept of unicity distance to re- nearest-neighbor queries using space transformation
a map that indicates the locations of veal the dynamics of correlation attacks to preserve location privacy. In Proceedings of
the 10th International Symposium on Spatial and
all the espresso shops in downtown through which existing data records are Temporal Databases (Boston, July 16–18). Springer-
Ithaca. The user’s handset can then used to attribute individual identities Verlag, Berlin, 2007, 239–257.
18. Kifer, D. and Machanavajjhala, A. No free lunch in
use its more precise knowledge of his to allegedly anonymous information. data privacy. In Proceedings of the SIGMOD 2011
International Conference on Management of Data
or her location to determine the near- With this model in mind, it has also laid (Athens, June 12–16). ACM Press, New York, 2011,
est espresso shop and generate direc- out rules of thumb for designing anony- 193–204.
19. Malpas, J. Place and Experience: A Philosophical
tions accordingly. mous location-based services. Critical Topography. Cambridge University Press, Cambridge,
Anonymity can also be preserved to them is maintenance of a coarse level U.K., 2007.
20. Morrissey, S. iOS Forensic Analysis for iPhone, iPad,
by limiting the length m of each loca- of granularity for any location estimate and iPod Touch. Apress, New York, 2010.
tion trace. This limitation is accom- available to service providers and the 21. Narayanan, A. and Shmatikov, V. Robust de-
anonymization of large sparse datasets. In
plished by preventing the LBS from disassociation of repeated requests for Proceedings of the 2008 IEEE Symposium on
determining which requests, if any, location-based services to prevent con- Security and Privacy (Oakland, CA, May 18–21). IEEE
Computer Society Press, Washington, D.C., 2008,
come from a given user.r As described struction of long-term location traces. 111–125.
in Wicker,27 public-key infrastructure 22. Netflix Prize Rules; http://www.netflixprize.com//rules
23. Relph, E. Place and Placelessness. Routledge Kegan &
and encrypted authorization mes- Acknowledgments Paul, London, 1976.
sages can be used to authenticate This work is funded in part by the Na- 24. Shannon, C. Communication theory of secrecy
systems. Bell System Technical Journal 28, 4 (Oct.
users of a service without providing tional Science Foundation TRUST Sci- 1949), 656–715.
their actual identities. Random tags ence and Technology Center and the 25. Sweeney, L. k-anonymity: A model for protecting
privacy. International Journal Uncertainty, Fuzziness
can be used to route responses back NSF Trustworthy Computing Program. and Knowledge-Based Systems 10, 5 (Oct. 2002),
557–570.
to anonymous users. Anonymity for I gratefully acknowledge the techni- 26. Tribble, G.B. Testimony of Dr. Guy “Bud” Tribble, Vice
frequent users of an LBS may thus be cal and editorial assistance of Sarah President for Software Technology, Apple Inc.; http://
judiciary.senate.gov/pdf/11-5-10%20Tribble%20
protected by associating each request Wicker, Jeff Pool, Nathan Karst, Bhas- Testimony.pdf
with a different random tag. All users kar Krishnamachari, Kaveri Chaudhry, 27. Wicker, S.B. Cellular telephony and the question of
privacy. Commun. ACM 54, 7 (July 2011), 88–98.
of the LBS thus enjoy a form of k-ano- and Surbhi Chaudhry. 28. Williamson, J. Decoding Advertisements: Ideology and
nymity. Coupled with coarse location Meaning in Advertising. Marion Boyars Publishers Ltd.,
London, 1978.
estimates or random location offsets, References
29. Yoshida, J. Enhanced 911 service spurs integration
1. Agnew, J.A. Place and Politics: The Geographical
this approach shows great promise Mediation of State and Society. Unwin Hyman, London,
of GPS into cell phones. EE Times (Aug. 16 1999);
http://www.eetimes.com/electronics-news/4038635/
1987.
Enhanced-911-service-spurs-integration-of-GPS-into-
2. Agre, P.E. Computation and Human Experience.
p Privacy-preserving data mining techniques cell-phones
Cambridge University Press, Cambridge, U.K., 1997.
30. Zang, H. and Bolot, J. C. Anonymization of location
(such as those developed by Evfimievski et al.11) 3. Apple. Q&A on Location Data; http://www.apple.com/
data does not work: A large-scale measurement
may also provide solutions. pr/library/2011/04/27Apple-Q-A-on-Location-Data.
study. In Proceedings of the 17th Annual International
html Conference on Mobile Computing and Networking
q Zang and Bolot30 used the Shannon-theoretic 4. Bilton, N. 3G Apple iOS devices are storing users (Las Vegas, Sept. 19–23). ACM Press, New York, 2011,
concept of entropy to show the role of both location data. The New York Times (Apr. 20, 2011). 145–156.
temporally and spatially coarse data in pre- 5. Blumenthal, J., Reichenbach, F., and Timmermann,
serving anonymity, conclusions I corroborate D. Position estimation in ad hoc wireless sensor
networks with low complexity. In Proceedings of the
with this analysis. Stephen B. Wicker (wicker@ece.cornell.edu) is a
Second Joint Workshop on Positioning, Navigation, and
r This follows Kifer and Machanavajjhala,18 professor in the School of Electrical and Computer
Communication and First Ultra-Wideband Expert Talk
Engineering of Cornell University, Ithaca, NY, and member
who said the privacy of an individual is pre- (Hannover, Germany, Mar. 2005), 41–49.
of the graduate fields of information science and computer
6. Clarke, R.A. Information technology and dataveillance.
served when it is possible to limit the inference science.
Commun. ACM 31, 5 (May 1988), 498–512.
of an attacker as to the participation of the 7. Cresswell, T. Place: A Short Introduction. Wiley-
individual in the data-generating process. Blackwell, Malden, MA, 2004. © 2012 ACM 0001-0782/12/08 $15.00

68 comm unicatio ns o f th e acm | au gu st 201 2 | vol. 5 5 | no. 8


doi:10.1145/ 2240236. 2 2 40 2 5 6

Traditional bias toward journals in citation


databases diminishes the perceived value
of conference papers and their authors.
By Bjorn De Sutter and Aäron van den Oord

To Be or Not
To Be Cited
in Computer
Science
The sustainability and nonconformism of conferences
as premier publication venues in computer science is the
subject of intense debate.3,4,17,18 Evaluating scientists for
promotion and budget allocation involves metrics like
journal impact factors14 and h-indexes7 based on citation
counts retrieved from Scopus and the Web of Science (WoS).
Many computer scientists view the Inclusion of proceedings and
historical focus of these databases on journals in Scopus and WoS is often
journals as a professional disadvan- viewed as a stamp of approval and rel-
tage, even though many conferences evance. By contrast, databases like Ci-
have been included in Scopus since teSeerX and Google Scholar (GS) also
2004 and WoS since September 2008, cover books, technical reports, and
including older ones entered later. other less-important manuscripts.
Moreover, whereas Scopus and WoS
key insights are generally viewed as providing cor-
rect information, GS is known to in-
 T he supposedly reliable Scopus and clude erroneous records.11
WoS include incomplete citation records Higher citation counts than those
for indexed CS journal and conference
papers. in Scopus and WoS can be obtained
by extending the coverage of a cita-
 D espite the difficulty of automatically tion count by, say, including citations
retrieving Google Scholar records, these
records can still be used to correct the of non-indexed publications11 and
missing records in Scopus and WoS. by combining databases.10 Despite
the need to manually cleanse GS re-
 C orrected citation counts allow for cords of erroneous and irrelevant re-
much fairer evaluation of CS researchers
and related conferences. cords,2,5,10,11 GS is useful for extending

au g u st 2 0 1 2 | vol. 55 | no. 8 | c o mmu n i cat ion s o f t he acm 69


contributed articles

coverage as well.1,4,10,11,18 sometimes does keep track of their ci-


Meho and Rogers10 concluded in tations by indexed papers (such as the
2008 that choosing WoS or Scopus citation of B by J2). With the manual-
did not have a significant effect on count method of Meho and Rogers,10
the citation-based ranking of human-
computer interaction researchers. Undercitation in these citations can still be counted,
should citation-analysis policy de-
However, in a case study involving some databases mand it. Second, the citations of B
library and information researchers,
Meho and Yang11 observed the oppo- seems to be caused by C and of J1 by J2 were never added
to WoS. For policies that neglect cita-
site, finding that conclusions drawn
for one scientific domain cannot be
mostly by their use tions by papers not indexed in WoS,
the missing citation of B does not
generalized to other domains. of inferior parsing matter, but the missing citation of J1
Complementing coverage studies,
this article explores the inaccuracy
technology. always matters. Both the citing and
cited papers have the WoS stamp of
of citation records, along with their approval, so the citation should be
effect on the perceived impact of CS counted. But when for some reason
conferences and on author ranking. the database lacks a correct record
Figure 1 outlines the difference be- of the citation, as in this example, it
tween coverage and accuracy for an is not counted, and the author suffers
author of a book B and a journal ar- professionally from undercitation.
ticle J1, with their citations visualized The study by Meho and Yang11 on li-
at the top of the figure. B is cited by brary and information researchers
another article J2 and by a technical said 0.5%, 4.4%, and 12% of relevant
report TR. J1 is cited by a conference citations were, at the time of the study
paper C. TR is a preliminary version missing from GS, Scopus, and WoS,
of C, with the same title and authors. respectively, due to database errors.
The list of references in C is shorter Here, we evaluate undercitation
than TR, so TR cites B, but C does not resulting from such an error. Comple-
cite B. menting the studies mentioned ear-
The middle segment of the figure lier, we recently uncovered a signifi-
reflects GS citation records, with GS cant undercitation bias in Scopus and
mistakenly attributing the citations WoS against covered CS conferences,
by TR to C. GS also covers publica- demonstrating how it weakens the CS
tions of lesser importance (such as community’s effort to win greater ap-
books). Due to its erroneous and for preciation for conference papers. We
certain policies irrelevant records, also found how variations in underci-
the GS citation count in the example tation of individual authors make the
is not reliable. ACM Digital Library (DL), Scopus, and
WoS records are visualized in the WoS unreliable information sources
bottom segment of the figure. The for citation-based metrics. We also
first observation is that WoS does not present an automated method that
index less-important manuscripts combines the coverage of GS with
(such as TR and B). However, WoS the quality assurance of Scopus and
WoS to detect undercitation resulting
Figure 1. Publications (vertices) and from missing citations.
citations (edges) recorded by databases.
We do not question Scopus or WoS
coverage. The analyses we perform for
y any such database involve only pub-
lit B
r ea J2 lications indexed in that database.
ion
t
ca Hence all undercitation results pre-
bli J1 TR
pu C sented here are independent of data-
r B base coverage. Moreover, we do not
hola J2
Sc take a position for or against citation-
o gle
Go J1 C based metrics, though their useful-
ness has been questioned,12 and many
ce J2
ien refinements have been proposed.13,14
f Sc
bo Our results demonstrate only that un-
We J1 C
less a corrective method is used, as
we do here, to correct raw counts ob-
tained from Scopus and WoS, their in-

70 c ommunicatio ns o f th e acm | au gust 201 2 | vol. 5 5 | no. 8


contributed articles

accuracy makes them unsuitable for tool’s estimate of an author’s publica- periments: First, we set it to search all
CS research evaluation. tion genuine citation count. aforementioned databases for three
In a second phase, the tool search- authors, using its extended-search
Relative Relevant Undercitation es all databases for all papers occur- mode. Though we focus here on cita-
To study the accuracy of database ring in the list. This search by title tion accuracy, the experiment also en-
citation records, we measure the re- automates a search comparable to abled us to compare a database’s cov-
cords’ relative relevant undercitation manual searches in other studies.11 erage on the basis of what the three
(RRU); the RRU of a database query is When both the cited paper and the authors would consider their own
the fraction of all (cited, citing) paper citing paper of a citation are found relevant output. Due to the tool’s long
pairs for which both cited and citing in a database, the tool considers that running times—several weeks—we
papers are indexed in the database citation relevant for that database. were able to study only three authors
but for which the database has no re- For each relevant citation, the tool in the experiment. We next searched
cord of the citing paper in the cited-by searches the cited-by list of the cited GS and WoS for 14 editors-in-chief
list of the cited paper. This fraction paper. When the citing paper is in it, of various CS transactions published
equals the underestimation of the ci- the tool labels the citation as found by ACM and the IEEE Computer So-
tation count reported by the database in the database. In such cases the ci- ciety. Using the tool’s fast mode, we
within its own coverage; in Figure 1, tation is also included in automated thus limited searches to publication
the citation of J1 by J2 is missing in citation counts provided by that da- lists we obtained from DBLP. The ex-
WoS, but the citation by C is present, tabase. When the citing paper is not periment was less accurate and cov-
resulting in an RRU of 50%. found in the cited-by list, the tool la- ered fewer databases than the first
To compute RRUs, we developed bels the citation as relevant but miss- experiment but included many more
a Python tool for querying six online ing. A database’s RRU for a reference authors and publications, enabling
databases—the ACM DL, CiteSeerX, list of papers is the number of miss- us to validate the trends we observed
DBLP, GS, Scopus, and WoS—by mim- ing relevant citations divided by num- in the first experiment. Finally, we
icking a researcher manually brows- ber of relevant citations. performed a similar experiment for
ing a database by sending similar As some citations may not be re- GS and WoS for eight ACM and IEEE
HTTP and parsing retrieved (HTML) corded in any searched database, our transactions published from 2000 to
data. Given a reference list of an au- tool can underestimate RRUs. The 2002 to study the influence of RRU on
thor’s papers, the tool first queries risk of underestimation can be avoid- journal impact factors.
the databases by title; for papers not ed with the manual-count method of Experiment one. Three colleagues
found by title, it tries searching by cit- Meho and Rogers,10 though their ex- at Ghent University who began pub-
ed author. The search is limited to the perience suggests their labor-inten- lishing around 1990 assembled a ref-
papers in the reference list to prevent sive method will not identify a signifi- erence list of their own peer-reviewed
counting publications by other au- cant number of additional relevant conference and journal publications;
thors with the same name or initials.10 missing citations. Table 1 lists the number of publica-
For each paper found in a data- Due to erroneous database re- tions in each database. In the Com-
base, the tool retrieves its cited-by cords, our tool can also uninten- puter Systems Lab at Ghent Univer-
list. In its extended mode of opera- tionally overestimate the number of sity, we have permanent access to the
tion, it downloads the BibTeX or End- relevant but missing citations in a ACM DL, and WoS, as well as to vari-
Note descriptions provided by the database, thereby overestimating its ous free databases. This experiment
database for all entries in that list. In RRU; we quantify this potential over- was carried out from September 18 to
its fast mode the tool instead parses estimation later. October 13, 2010, when we also had
the HTML pages to identify the citing temporary access to Scopus.
papers. As those pages display infor- Experiments GS, Scopus, and WoS provide ex-
mation in a less-uniform way than We used the tool in 2010 and 2011 cellent coverage of journal papers. In
EndNote or BibTeX, the fast mode can to perform three complementary ex- line with the findings of others,4–6,10,15
produce less-accurate results. How-
ever, this mode is considerably faster Table 1. Number of journal/conference papers in authors’ reference lists and in online
publication databases.
for most databases than its extended-
search mode, as fewer HTTP queries
are needed. Most databases try to de-
tect and block seemingly automated Author (domain) Reference WoS Scopus ACM Google CiteSeerX

querying. To work around this filter, Koen De Bosschere 66/143 57/89 50/54 40/51 57/112 7/38
(compilers, computer architecture)
the tool is designed to sleep a random
Bart Dhoedt 53/187 51/113 44/100 23/47 43/143 3/17
amount of time, say, 25 to 35 seconds (distributed computing, networks)
between consecutive queries to GS. Wilfried Philips 64/285 56/130 58/106 13/17 58/214 6/26
The result of this first search phase (image and video processing)
is a list of (cited, citing) paper pairs
of citations, each recorded by at least
one database. This list constitutes the

au g u st 2 0 1 2 | vol . 55 | no. 8 | c o mmu n i cat ion s of t h e acm 71


contributed articles

GS covers more conferences than oth- we would obtain if a database would ments for convincing scholars and re-
er academic databases. Unlike some fix all its missing citations. As the searchers from non-CS disciplines to
other studies10 we did not find more tool gives us a list of missing relevant value CS conference papers, though
extended conference coverage in Sco- citations, we requested corrections it is precisely those citations that are
pus compared to WoS. This might of citation records through the WoS most underestimated. For example,
have been due to increased coverage correction-request form. Most of our Koen (listed in the tables) might try to
in WoS in the studies. requested corrections were applied convince a promotion committee that
Table 2 lists the citation counts in within weeks. We used the tool to col- conferences should be valued like
each database; in line with previous lect the numbers presented here be- journals in his domain by pointing
studies, the Scopus and WoS citation fore that correction. As the h-indexes to his high #J2C/(#J2J+#J2C) ratio
counts were only a small fraction of are based on coverage, the corrected of 43% in WoS. However, this ratio is
those in GS. Our tool thus relied heav- h-index in one database may be small- not nearly as convincing as the 60% he
ily on the unreliable GS to compute er than the counted h-index in anoth- achieved with corrected WoS records.
RRUs. er database. We see three potential causes for
Table 3 contrasts the numbers of The large RRUs in ACM, Scopus, many of the missing citations: First is
relevant citations to the found num- and WoS indicate that missing rele- overcitation in other databases or in-
bers and are partitioned into four cat- vant citations are an important cause clusion of nonexisting citations; the
egories—J2J, C2J, J2C, and C2C—to of undercitation. We also observed a next experiment demonstrates these
distinguish whether citing and cited large variation in the RRUs of indi- possibilities occur to a limited de-
publications are journal (J) or confer- vidual authors, affecting their rank- gree. The remaining causes are the in-
ence (C) papers. The last column on ing based on WoS and WoS+Scopus correct parsing of correct references
the right combines Scopus and WoS, h-indexes. Unlike the ACM DL, which and the occurrence of incorrect and
where a citation is considered rel- should be able to handle conferences incomplete references in papers, or
evant/found as soon as it is relevant/ with the same completeness as jour- so-called miscitations. Some papers
found in at least one of the two and nals, Scopus and WoS reflect much have been miscited in more than 165
missing if found in neither. Table 4 more undercitation for conferences different ways,16 with more miscita-
lists the h-indexes computed by the than for journals. So independent tions among non-English names8 and
tool for the databases; the found h- of their coverage, Scopus and WoS in papers with more authors.9 For ex-
indexes were based on found cita- put conference-oriented authors at ample, our own work is often miscited
tions and the corrected h-indexes on a disadvantage. Worse is their J2C because the “De,” “van,” and “den” are
relevant citations. The corrected h- undercitation. Large numbers of J2C incorrectly treated as middle names
indexes correspond to the h-indexes citations is one of the strongest argu- or because they are capitalized incor-
rectly. The RRUs we found in WoS
Table 2. Total number of citations per author and citations in each database. and Scopus are more than an order of
magnitude higher than the 0.5% and
Author Total WoS Scopus ACM Google CiteSeerX 4.4% found by Meho and Yang.11 But
Koen 2280 384 (17%) 531 (23%) 217 (10%) 2156 (95%) 138 (6%) that difference is not surprising; of all
Bart 1056 278 (26%) 313 (30%) 62 (6%) 842 (80%) 20 (2%) the scientific disciplines, librarians
Wilfried 1937 745 (38%) 975 (50%) 10 (1%) 1370 (71%) 46 (2%) and information scientists probably
produce the most accurate citations.
Note, however, that our experiments

Table 3. Number of citations per author, category, and database and corresponding RRUs.

WoS Scopus ACM WoS + Scopus


J2J C2J J2C C2C J2J C2J J2C C2C J2J C2J J2C C2C J2J C2J J2C C2C
relevant 209 205 312 278 201 249 190 226 73 110 142 147 246 288 352 339
Koen found 101 114 77 92 151 192 75 113 42 65 56 54 180 232 132 179
RRU 52% 44% 75% 67% 25% 23% 61% 50% 42% 41% 61% 63% 27% 19% 63% 47%
relevant 194 141 127 104 156 122 112 107 24 34 25 26 211 156 149 132
Bart found 140 88 19 31 131 96 37 49 15 12 14 21 188 144 45 70
RRU 28% 38% 85% 70% 16% 21% 67% 54% 38% 65% 44% 19% 11% 8% 70% 47%
relevant 411 240 211 101 507 323 209 141 6 4 10 9 531 385 275 204
Wilfried found 350 239 82 74 453 304 124 94 3 0 4 3 496 369 177 153
RRU 15% 0% 61% 27% 11% 6% 41% 33% 50% 100% 60% 67% 7% 4% 36% 25%

72 c ommunicatio ns of th e acm | au gust 201 2 | vol. 5 5 | no. 8


contributed articles

consider only citations recorded by at all papers of the covered conference overestimation, we corrected the
least one database. So even if the oc- editions as having zero citations. For number of missing relevant citations
currence of incorrect or incomplete each of the 14 editors-in-chief, we se- as reported by the tool with 9.5% for
references inflates a database’s RRU, lected one conference paper with no computing the RRUs reported later in
it apparently did not stop GS or other citations according to WoS and with this article.
databases from recording those cita- the most citations according to GS. Most incorrect labels resulted from
tions. Undercitation in some data- For these papers, which were pub- confusing multiple manuscripts with
bases therefore seems to be caused lished from 1985 to 2009 and covered the same title and authors. Whereas
mostly by their use of inferior parsing 445 of the 1,678 suspect citations, we the tool’s fast mode was responsible
technology. manually verified that at least one pa- for the confusion, the more extended-
Experiment two. We ran a second per of the same conference edition search mode as used in the first ex-
experiment in April 2011 for seven ed- was cited at least once according to periment would have prevented the
itors-in-chief of ACM transactions and WoS. The result was positive in terms error. However, in most cases of in-
seven editors-in-chief of IEEE Com- of finding citations for all 14 confer- correct labeling, GS simply provided
puter Society transactions, aiming to ences, indicating that late entering of incorrect citation information. In the
validate the previously obtained RRUs conference data is not likely a major majority of such cases, GS provided a
on a larger sample set and assess our cause of RRU in WoS. Even if late en- link to the citing document on CiteSe-
tool’s accuracy in the presence of er- tering of data was a significant cause erX. We inspected the .pdf documents
roneous GS records. Due to the large of undercitation in WoS, similar late cached on CiteSeerX, discovering they
amount of data, we ran the tool in entering apparently did not prevent indeed cited the cited paper. We also
its fast mode. Based on 14 reference GS from being more complete. discovered these .pdfs are not from
publication lists obtained from DBLP, To estimate how much errone- the published conference or journal
the tool collected 36,931 citations for ous GS records inflate the RRU in papers CiteSeerX claimed them to be
1,778 papers in GS and WoS, labeling WoS, we manually checked 15 ran- but rather from technical reports and
18,342 citations as relevant to WoS, of domly selected citations per author Ph.D. theses with the same title and
which 9,669 were in WoS, and the re- the tool had labeled as missing from authors but with longer reference
maining 8,673 as missing. WoS. From these 14×15=210 suppos- lists due to lack of a page limit on the
Among the 8,673 missing cita- edly missing citations, 19 had been documents. While Google provides
tions, 1,678 cited conference papers labeled incorrectly as such; the cor- little public documentation on its in-
were not cited according to WoS. For responding 95%-confidence interval formation sources, the correlation be-
such papers, the possibility must be based on the normal approximation tween the errors in CiteSeerX and GS
considered that conference data was is 9.5±3.9%. To compensate for this points in the direction of CiteSeerX as
entered into WoS long after the con-
Table 4. Found/corrected h-indexes for the various databases.
ferences took place and hence after
the citing papers were entered. To es-
timate the likelihood of this potential Author WoS Scopus WoS + Scopus ACM Google CiteSeerX
cause of RRU, we performed an addi- Koen 12/16 13/17 15/19 8/14 24/25 7/7
tional check, building on the assump- Bart 9/12 8/11 11/12 4/6 13/15 3/4
tion that if the late entering of confer- Wilfried 13/15 15/17 16/18 2/4 20/20 4/5
ence data is a major cause of RRU, the
result would likely be WoS reporting

Figure 2. WoS RRUs for 14 editors-in-chief.

J2J C2J J2C C2C


90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Albers
(ACM TALG)

Allender
(ACM TOCT)

Conte
(ACM TACO)

Hoppe
(ACM TOG)

Long
(ACM TOS)

Özsoyoğlu
(ACM TODS)

Renals
(ACM TSLP)

Lin
(IEEE TVCG)

Nejdl
(IEEE TLT)

Nuseibeh
(IEEE SE)

Ooi
(IEEE TKDE)

Stojmenovic
(IEEE TKDE)

Zabih
(IEEE TPAMI)

Zomaya
(IEEE TC)

ACM TOTAL

IEEE TOTAL

au g ust 2 0 1 2 | vol . 55 | no. 8 | c o mmu n i cat ion s o f t he acm 73


contributed articles

the culprit for a considerable fraction all types of citations, and the RRU was and the bars in Figure 3b represent
of overcitation in GS. Whatever the particularly large for J2C and C2C cita- corrected h-indexes based on relevant
cause, however, this overcitation with tions. Figure 3 visualizes the h-indexes citations, confirming h-indexes based
9.5% is much smaller than the under- and h-cores of the 14 editors-in-chief. on found citations suffer significantly
citation we found for other databases. An h-core consists of an author’s x pa- from undercitation. Some authors
The conclusions of our first experi- pers each cited x or more times, with x suffer more than others, to the point
ment remain valid, as confirmed in being the author’s h-index.7 The bars their ranking is altered significantly;
Figure 2; also, for the editors-in-chief, in Figure 3a represent the h-indexes for example, Ooi was in next-to-last
there was significant varying RRU for computed on found citations in WoS, place according to uncorrected WoS
citation counts but in third place af-
Figure 3. WoS h-indexes and h-cores for 14 editors-in-chief. ter correction.
Figures 3a and 3b also show the
contribution of conference papers to
5 10 15 20 5 10 15 20 25 h-indexes. Each blue/orange box in-
Albers j cccj j cccj c cj cccccj cj cj ccc dicates a journal/conference paper in
Allender j cj j j j j j j j j cj cj cj j j j j j j j ccj j c
Conte cccj j j j ccc ccccj j ccj cj cccj cc the h-core, ordered left to right from
Hoppe cj j j cccj cj j j j ccj cc ccccj cj cj j ccj j j cj ccj cccccc most cited to least cited. For example,
Long cccccj cj j j cccc ccccj cccccj ccj j cj ccc Albers has an h-index of 11 in WoS; of
Özsoyoğlu j j j j j j j cj cj j j j cj cj j cccj cccj
Renals j j j j j j cj j cj j cj j j cc
the 11 papers in her h-core, the first,
Lin cj j cj cj cj j ccccj j cccccj ccccj j cj j ccj j j cccj cj fifth, sixth, and 10th most-cited are
Nejdl j ccj j cccj cc ccj cccccccj j cccj ccc journal papers. Based on WoS counts,
Nuseibeh j j j j j cj j cj ccj j j j j j ccccj cj cccj j cc
Ooi j ccj j j j cj cj cj ccccccccj cj ccj cj cc
43% of all papers in the h-cores are
Stojmenovic cj cj j ccj j cj cj j ccj j c j c j j c c c c c c j c j j j j c c cc c j conference papers, and based on the
Zabih ccj j cj cj j ccc cccj j ccj cccj cj ccccc corrected counts, 60% are conference
Zomaya j j j j j j j cj j j c j j j j j j j cj cj j j papers. Of the five most-cited papers
(a) h-index computed on (b) h-index computed on corrected per author, WoS reports 36% are con-
WoS citation counts WoS citation counts
ference papers, whereas the corrected
citation counts report 56% are con-
ference papers. These numbers are
much higher than those presented
Figure 4. WoS RRUs for articles in eight transactions volumes, 2000–2002. by Bar-Ilan1 as obtained from WoS
in late 2008/early 2009. This higher
count might result from increased
J2J C2J
30% coverage in WoS from 2009 to 2011,
25% though we could not verify this con-
20% clusion. These results again confirm
15%
10% that using uncorrected WoS citation
5% counts to estimate the importance of
0% conferences for an author can lead to
ACM TOCS

ACM TODS

ACM TOG

ACM TOSEM

IEEE TC

IEEE TKDE

IEEE TVCG

IEEE TSE

ACM TOTAL

IEEE TOTAL

TOTAL

significant underestimation for that


author. Moreover, such underesti-
mation varies significantly from au-
thor to author; for Zomaya in Figure
2 and Figure 3, WoS attributes 2/12
h-core papers to conferences, which
Figure 5. Underestimation of average journal impact factors, 2002–2003. is close to the corrected number of
2/13. However, for Ooi in Figure 2
and Figure 3, WoS attributes 4/11 to
J2J only J2J + C2J conferences, which does not even
30%
25% approximate the corrected number
20% 15/20. We conclude that GS should be
15% used as a complementary source of
10%
information to obtain accurate cita-
5%
0% tion counts, even when policy stipu-
lates the citation analysis is limited
ACM TOCS

ACM TODS

ACM TOG

ACM TOSEM

IEEE TC

IEEE TKDE

IEEE TVCG

IEEE TSE

ACM average

IEEE average

to WoS coverage. Though this second


experiment did not include the ACM
DL or Scopus, this conclusion can be
extended to those databases, of which
the first experiment revealed compa-

74 co mmunicatio ns o f th e acm | au gu st 201 2 | vol. 5 5 | no. 8


contributed articles

rable levels of undercitation. the databases. We used two methods: References


Experiment three. We applied a one in which only J2J citations of full 1. Bar-Ilan, J. Web of Science with the Conference
Proceedings Citation Indexes: The case of computer
similar search for the articles pub- articles (excluding letters, editorials, science. Scientometrics 83, 3 (June 2010), 809–824.
lished in eight ACM and IEEE trans- and republished proceedings) were 2. Bar-Ilan, J. Which h-index? A comparison of WoS,
Scopus, and Google Scholar. Scientometrics 74, 2 (Feb.
actions from 2000 to 2002, selecting counted, and one in which J2J and C2J 2008), 257–271.
these years to allow collection of cita- were counted. For each method, we 3. Birman, K. and Schneider, F. Program committee
overload in systems. Commun. ACM 52, 5 (May 2009),
tions over a significant period of time computed the impact factors based 34–37.
and to include four ACM and four on citations the tool found in WoS and 4. Freyne, J., Coyle, L., Smyth, B., and Cunningham, P.
Relative status of journal and conference publications
IEEE transactions from related, large- on corrected WoS counts. Including in computer science. Commun. ACM 53, 11 (Nov.
ly overlapping domains. We excluded the C2J counts resulted in impact fac- 2010), 124–132.
5. García-Pérez, M. Accuracy and completeness of
editorials and republished proceed- tors between 2.37x and 4.35x higher, publication and citation records in the Web of Science,
PsycINFO, and Google Scholar: A case study for the
ings to ensure a fair comparison. For averaging 3.63x higher for ACM trans- computation of h-indices in psychology. Journal of
the 135 ACM articles and 770 IEEE ar- actions and 3.39x for IEEE transac- the American Society for Information Science and
Technology 61, 10 (Oct. 2010), 2070–2085.
ticles in the considered volumes, our tions. Figure 5 outlines the underesti- 6. Harzing, A.-W. and van der Wal, R. Google Scholar as a
tool’s fast mode collected 42,658 cita- mation of the impact factors resulting new source for citation analysis. Ethics in Science and
Environmental Politics 8, 1 (June 2008), 61–73.
tions in GS and WoS in April 2011. Be- from missing citations. In line with 7. Hirsch, J. An index to quantify an individual’s scientific
fore applying a correction with 9.5%, the previous results, the underestima- research output. Proceedings of the National Academy
of Sciences of the United States of America 102, 46
our tool labeled 19,215 citations as tion was significant (15%–21%) when (Nov. 2005), 16569–16572.
found and relevant to WoS, and 5,193 only J2J citations are counted, and 8. Kotiaho, J. Papers vanish in mis-citation black hole.
Nature 398, 6722 (Mar. 1999), 19–19.
as relevant but missing. higher (17%–25%) when C2J citations 9. Kotiaho, J., Tomkings, J., and Simmons, L. Unfamiliar
Figure 4 outlines the resulting, cor- are included. Due to their higher RRU citations breed mistakes. Nature 400, 6742 (July
1999), 307–307.
rected RRUs, which are comparable in WoS, the tool reported the impact 10. Meho, L. and Rogers, Y. Citation counting, citation
to those of the other experiments, factors for ACM transactions are un- ranking, and h-index of human-computer interaction
researchers: A comparison of Scopus and Web
but the underestimation of C2J cita- derestimated considerably more than of Science. Journal of the American Society for
tions is over 10% less than what we those for IEEE transactions. Information Science and Technology 59, 11 (Sept.
2008), 13–28.
observed in the first two experiments. These results indicate it is in the 11. Meho, L. and Yang, K. A new era in citation and
bibliometric analyses: Web of Science, Scopus, and
The RRUs also reflect considerably interest of ACM and IEEE to include Google Scholar. Journal of the American Society for
less variation, suggesting C2J cita- C2J citations to compute journal im- Information Science and Technology 58, 13 (Nov.
2007), 2015–2125.
tions of the selected ACM and IEEE pact factors and ensure they are re- 12. Meyer, B., Choppy, C., Staunstrup, J., and Van
transactions are recorded more accu- corded more accurately. Leeuwen, J. Research evaluation for computer
science. Commun. ACM 52, 4 (Apr. 2009), 31–34.
rately in WoS than are the citations of 13. Moed, H. and Van Leeuwen, T. Improving the accuracy
other journals in which the sampled Conclusion of Institute for Scientific Information’s journal
impact factors. Journal of the American Society for
authors were published. ACM, Scopus, and WoS must develop Information Science 46, 6 (July 1995), 461–467.
For J2J citations our tool confirmed better reference-parsing technology 14. Moed, H., De Bruin, R., and Van Leeuwen, T. New
bibliometric tools for the assessment of national
much less variation between the ACM to fix the significant undercitation in research performance: Database description, overview
and the IEEE transactions than we their databases. Due to the variations of indicators, and first applications. Scientometrics 33,
3 (July 1995), 381–422.
observed for individual authors in in undercitation, the ACM DL, Sco- 15. Noruzi, A. Google Scholar: The new generation of
the first two experiments. However, pus, and WoS, and even combinations citation indexes. Libri 55, 4 (Dec. 2005), 170–180.
16. Price, N. What’s in a name (or a number or a date)?
based on a T-test, the difference in thereof, are unreliable information Nature 395, 6702 (Oct. 1998), 538–538.
average J2J RRU between ACM and sources for the most commonly used 17. Vardi, M. Conferences vs. journals in computing
research. Commun. ACM 52, 5 (May 2009), 5–5.
IEEE is statistically significant; IEEE citation-based metrics in CS. More- 18. Wainer, J., Goldenstein, S., and Billa, C. Invisible
work in standard bibliometric evaluation of computer
transactions papers are undercited over, Scopus and WoS databases re- science. Commun. ACM 54, 5 (May 2011), 141–146.
less than ACM transactions papers. flect a significant bias against covered
To determine whether this differ- conference proceedings, resulting in Bjorn De Sutter (bjorn.desutter@elis.ugent.be) is
ence might have resulted from differ- underestimation of their impact. a professor in the Computer Systems Lab of the
Department of Electronics and Information Systems at
ences in citation formats, policies, or Supposedly unreliable, broad data- Ghent University, Ghent, Belgium.
culture between ACM and IEEE, we bases like GS can be used to identify
Aäron van den Oord (aaron.vandenoord@elis.ugent.be)
analyzed the sources of the citations. and correct undercitation problems is a Ph.D. student in the Computer Systems Lab of the
While this analysis indicates ACM in Scopus and WoS without under- Department of Electronics and Information Systems at
Ghent University, Ghent, Belgium.
and IEEE papers favor citing within mining the virtues of their selective
their own organization, the numbers inclusion of high-impact conferences
were inconclusive with respect to the and journals. However, due to the in-
cause of the different RRUs. herently slow access to databases like
Finally, we studied the effect of un- GS, even an automated tool like ours
dercitation on journal impact factors is slow, to the point of being not gen-
by computing their average underes- erally applicable.
timation for 2002 and 2003. These im- Finally, we found a correlation be-
pact factors are based on citations of tween transactions publishers and
papers published from 2000 to 2002, their transactions’ undercitation, to
the years for which our tool crawled the disadvantage of ACM. © 2012 ACM 0001-0782/12/08 $15.00

au g u st 2 0 1 2 | vol . 55 | no. 8 | c o mmu n i cat ion s o f t h e acm 75


contributed articles
doi:10.1145/ 2240236.2240257
their own processes. Process mining
Using real event data to X-ray business results can thus be viewed as X-rays
revealing what really goes on inside
processes helps ensure conformance processes and can be used to diag-
between design and reality. nose problems and suggest proper
treatment. The practical relevance of
By Wil van der Aalst process mining and related interest-
ing scientific challenges make proc­

Process
ess mining a hot topic in business
process management (BPM). This ar-
ticle offers an introduction to process
mining by discussing the core con-
cepts and applications of the emerg-
ing technology.
Process mining aims to discover,

Mining
monitor, and improve real processes
by extracting knowledge from event
logs readily available in today’s in-
formation systems.1,2 Although event
data is everywhere, management deci-
sions tend to be based on PowerPoint
charts, local politics, or management
dashboards rather than on careful
analysis of event data. The knowledge
hidden in event logs cannot be turned
into actionable information. Advanc-
es in data mining made it possible to
find valuable patterns in large data-
sets and support complex decisions
Recent breakthroughs in process mining research based on the data. However, classical
data mining problems (such as classi-
make it possible to discover, analyze, and improve fication, clustering, regression, asso-
business processes based on event data. Activities ciation rule learning, and sequence/
episode mining) are not process-
executed by people, machines, and software leave trails centric. Therefore, BPM approaches
in so-called event logs. What events (such as entering tend to resort to handmade models,
a customer order into SAP, a passenger checking in and process mining research aims to
bridge the gap between data mining
for a flight, a doctor changing a patient’s dosage, or and BPM. Metaphorically, proc­ ess
a planning agency rejecting a building permit) have
in common is that all are recorded by information key insights
systems. Data volume and storage capacity have grown  A lthough large organizations appreciate
the value of big data, they rarely
spectacularly over the past decade, while the digital connect event data to process models;
process mining represents the missing
universe and the physical universe are increasingly link between analysis of big data and
aligned. Business processes thus ought to be managed, business process management.

supported, and improved based on event data rather  A ligning event data and process models
is essential for conformance checking
than on subjective opinions or obsolete experience. and performance analysis; additionally,
relating events to process models helps
Application of process mining in hundreds of breathe life into otherwise static diagrams.

organizations worldwide shows that managers and  T he exponential growth of event data and
the need for smarter, leaner processes
users alike tend to overestimate their knowledge of motivates use of process mining.

76 co mmuni catio ns o f th e acm | au gu st 201 2 | vol. 5 5 | no. 8


mining can be seen as taking X-rays to Figure 1. The three basic types of process mining in terms of input and output.
help diagnose/predict problems and
recommend treatment.
An important driver for process
mining is the incredible growth of event log discovery model
event data4,5 in any context—sector,
economy, organization, and home—
and system that logs events. For less
event log
than $600, one can buy, say, a disk conformance diagnostics
drive with the capacity to store all of model checking
the world’s music.5 A 2011 study by
Hilbert and Lopez4 found that stor-
event log
age space worldwide grew from 2.6 enhancement new model
optimally compressed exabytes (2.6 × model
1018B) in 1986 to 295 compressed exa-
bytes in 2007. In 2007, 94% of all infor-
mation storage capacity on Earth was
digital, with the other 6% in the form
of books, magazines, and other non- and researchers highlight the signifi- to the model and vice versa. The third
digital formats; in 1986, only 0.8% of cance of process mining as a bridge type is enhancement, where the idea
all information-storage capacity was between data mining and business is to extend or improve an existing
digital. These numbers reflect the con- process modeling. process model using information
tinuing exponential growth of data. The starting point for process min- about the actual process recorded in
The further adoption of technolo- ing is an event log in which each event an event log. Whereas conformance
gies (such as radio frequency identifi- refers to an activity, or well-defined checking measures alignment be-
cation, location-based services, cloud step in some process, and is related to tween model and reality, this third
computing, and sensor networks) will a particular case, or process instance. type of process mining aims to change
accelerate the growth of event data. The events belonging to a case are or- or extend the a priori model; for in-
However, organizations have prob- dered and can be viewed as one “run” stance, using timestamps in the event
lems using it effectively, with most of the process. Event logs may also log, one can extend the model to show
still diagnosing problems based on store additional information about bottlenecks, service levels, through-
fiction (such as PowerPoint slides and events; when possible, process min- put times, and frequencies.
Visio diagrams) rather than on facts ing techniques use extra information
(such as event data). This is illustrated (such as the resource, person, or de- Process Discovery
by the poor quality of process models vice executing or initiating the activ- The goal of process discovery is to
in practice; for example, over 20% of ity), the timestamp of the event, and learn a model based on an event log.
the 604 process diagrams in SAP’s ref- data elements recorded with the event Events can have all kinds of attributes
erence model have obvious errors and (such as the size of an order). (such as timestamps, transactional
their relation to actual business proc­ Event logs can be used to conduct information, and resource usage) that
esses supported by SAP is unclear.6 It three types of process mining (see Fig- can be used for process discovery.
is thus vital to turn the world’s mas- ure 1).1 The first and most prominent However, for simplicity, we often rep-
sive amount of event data into relevant is discovery; a discovery technique resent events by activity names only.
knowledge and reliable insights—and takes an event log and produces a That way, a case, or process instance,
this is where process mining can help. model without using a priori informa- can be represented by a trace describ-
The growing maturity of process tion. For many organizations it is sur- ing a sequence of activities. Consider,
mining is illustrated by the Process prising that existing techniques are for example, the event log in Figure 2
Mining Manifesto9 released earlier able to discover real processes based (from van der Aalst1), which contains
this year by the IEEE Task Force on only on example behaviors recorded 1,391 cases, or instances of some re-
Process Mining (http://www.win.tue. in event logs. The second type is con- imbursement process. There are 455
nl/ieeetfpm/) supported by 53 organi- formance, where an existing process process instances following trace ac-
zations and based on contributions model is compared with an event log deh, with each activity represented by
from 77 process-mining experts. The of the same process. Conformance a single character: a = register request,
active contributions from end users, checking can be used to check if real- b = examine thoroughly, c = examine
tool vendors, consultants, analysts, ity, as recorded in the log, conforms casually, d = check ticket, e = decide,

au g u st 2 0 1 2 | vol. 55 | no. 8 | c omm u n i cat ion s o f t h e acm 77


contributed articles
a d c e g
f = reinitiate request, g = pay compen- to reproduce only the most frequent 1 =
and a d c e g
sation, and h = reject request. Hence, trace acdeh. The model does not fit the a d c e g
1 = a bd c ee f gd  e g
trace acdeh models a reimbursement log well because observed traces (such 2 =
a b d e f d b e g
request that was rejected after a reg- as abdeg) are not possible according =
a b  e f d  e g
2
istration, examination, check, and to M2. The third model is able to re- and aa bb de ef f dd eb g e g
3 = a b d e g
decision step; 455 cases followed this produce the entire event log, but M3
a b e f d e g
path, which consists of five steps, so also allows for traces (such as ah and 3 = a b d e g
the first line in the table corresponds adddddddg). M3 is therefore considered
to 455 × 5 = 2,275 events. The whole “underfitting”; too much behavior is γ1 shows perfect alignment between
log consists of 7,539 events. allowed because M3 clearly overgener- σ1 and M1; all moves of the trace in
Process-discovery techniques pro- alizes the observed behavior. Model M4 the event log (top part of alignment)
duce process models based on event is also able to reproduce the event log, can be followed by moves of the model
logs (such as the one in Figure 2); for though the model simply encodes the (bottom part of alignment). γ2 shows
example, the classical α-algorithm example traces in the log; we call such a an optimal alignment for trace σ2 in
produces model M1 for this log. This model “overfitting,” as the model does the event log and model M1; the first
process model is represented as a not generalize behavior beyond the ob- two moves of the trace in the event log
Petri net consisting of places and served examples. can be followed by the model. How-
transitions. The state of a Petri net, In recent years, powerful process ever, e is not enabled after executing
or “marking,” is defined by the distri- mining techniques have been devel- only a and b. In the third position of
bution of tokens over places. A tran- oped to automatically construct a suit- alignment γ2, a d move of the model
sition is enabled if each of its input able process model, given an event is not synchronized with a move in the
places contains a token; for example, log. The goal is to construct a simple event log. This move in just the model
a is enabled in the initial marking of model able to explain most observed is denoted as (,d), signaling a con-
M1, because the only input place of behavior without overfitting or under- formance problem. In the next three
a contains a token (black dot). Tran- fitting the log. moves model and log agree. The sev-
sition e in M1 is enabled only if both enth position of alignment γ2 involves
input places contain a token. An en- Conformance Checking a move in the model that is not also in
abled transition may fire, thereby Process mining is not limited to proc­ the log: (,b). γ3 shows another opti-
consuming a token from each of its ess discovery; the discovered process mal alignment for trace σ2. In γ3 there
input places and producing a token is just the starting point for deeper are two situations where log and mod-
for each of its output places. Firing a analysis. Conformance checking and el do not move together: (e,) and
in the initial marking corresponds to enhancement relate model and log, as (f,). Alignments γ2 and γ3 are both
removing one token from start and in Figure 1. The model may have been optimal if the penalties for “move
producing two tokens, one for each made by hand or discovered through in log” and “move in model” are the
output place. After firing a, three tran- process discovery. In conformance same. Both alignments have two 
sitions—b, c, and d—are enabled. Fir- checking, the modeled behavior and steps, and no alignments are possible
ing b disables c because the token is the observed behavior, or event log, with fewer than two  steps.
removed from the shared input place are compared. When checking the Conformance may be viewed from
(and vice versa). Transition d is con- conformance of M2 with respect to two angles: either the model does not
current with b and c; that is, it can fire the log in Figure 2, only the 455 cases capture real behavior (the model is
without disabling another transition. following acdeh can be replayed from wrong) or reality deviates from the de-
Transition e becomes enabled after d beginning to end. If the model would sired model (the event log is wrong).
and b or c have occurred. By executing try to replay trace acdeg, it would get The first is taken when the model is
e, three transitions—f, g, and h—be- stuck after executing acde because g supposed to be descriptive, or cap-
come enabled; these transitions are is not enabled. If it would try to replay tures or predicts reality; the second is
competing for the same token, thus trace adceh, it would get stuck after ex- taken when the model is normative, or
modeling a choice. When g or h is ecuting the first step because d is not used to influence or control reality.
fired, the proc­ess ends with a token (yet) enabled. Various types of conformance
in place end. If f is fired, the process Among the approaches to diagnos- are available, and creating an align-
returns to the state just after execut- ing and quantifying conformance is ment between log and model is just
ing a. Note that transition d is con- one that looks to find an optimal align- the starting point for conformance
current with b and c. Proc­ess mining ment between each trace in the log checking.1 For example, various fit-
techniques must be able to discover and the most similar behavior in the ness (the ability to replay) metrics are
such advanced process patterns and model. Consider, for example, process available for determining the confor-
should not be restricted to simple se- model M1, a fitting trace σ1 = adceg, a mance of a business process model;
quential processes. non-fitting trace σ2 = abefdeg, and the a model has fitness 1 if all traces can
Checking that all traces in the event following three alignments: be replayed from begin to end, and a
log can be reproduced by M1 is easy. model has fitness 0 if model and event
The same does not hold for the second a d c e g log “disagree” on all events. In Fig-
1 = a d c e g
process model in Figure 2, as M2 is able ure 2, process models M1, M3, and M4
a b  e f d  e g
2 = a b d e f d b e g
78 co mmuni catio ns o f th e acm | au gu st 201 2 | vol. 5 5 | no. 8

a b e f d e g
contributed articles

have a fitness of 1, or perfect fitness, real-life processes often deviate from event log and model, and a non-fit-
with respect to the event log. Model the simplified Visio or PowerPoint ting process model can be corrected
M2 has a fitness 0.8 for the event log representations traditionally used by through the diagnostics provided by
consisting of 1,391 cases. Intuitively, process analysts. the alignment. If the alignment con-
this means 80% of the events in the tains many (e, ) moves, it might make
log are explained by the model. Our Model Enhancement sense to allow for skipping activity e in
experience with conformance check- A process model can be extended or the model. Moreover, event logs may
ing in dozens of organizations shows improved through alignment between contain information about resources,

Figure 2. An event log and four potential process models—M1, M2, M3, and M4—aiming to describe observed behavior.

b
examine
thoroughly
M1 g
pay
c compensation
a examine e
start register casually decide end
event log request
h
d reject
request
# trace check ticket
f reinitiate
455 acdeh request

191 abdeg M2
177 adceh
a c d e h
144 abdeh start register examine check decide reject end
request casually ticket request
111 acdeg
82 adceg
56 adbeh examine check
M3 thoroughly b d ticket g
47 acdefdbeh pay
compensation
38 adbeg a
start register examine end
33 acdefbdeh request casually c
e f reinitiate
reject
h
decide request
14 acdefbdeg request

11 acdefdbeg M4
9 adcefcdeh
a d c e g
8 adcefdbeh register check examine decide pay
request ticket casually compensation
5 adcefbdeg
a c d e g
3 acdefbdefdbeg register examine check decide pay
request casually ticket compensation
2 adcefdbeg
a d c e h
2 adcefbdefbdeg register check examine decide reject
request ticket casually request
1 adcefdbefbdeh
a c d e h
1 adbefbdefdbeg start register examine check decide reject
end
request casually ticket request


1 adcefdbefcdefdbeg
1391 (all 21 variants seen in the log)

a b d e g
register examine check decide pay
request thoroughly ticket compensation

a d b e h
register check examine decide reject
request ticket thoroughly request

a b d e h
register examine check decide reject
request thoroughly ticket request

au g ust 2 0 1 2 | vol . 55 | no. 8 | comm u n icat ion s of t he acm 79


contributed articles

Figure 3. The process model can be extended using event attributes (such as timestamps, resource information, and case data); the model
also shows frequencies, as in, say, 1,537 times a decision was made and 930 cases were rejected.

Resource information, in the event log can


be used for social network analysis, role
discovery, and performance analysis.
Sue Mike
Timestamps, in the event log
can be used to analyze waiting Mary
times between activities. Pete Attributes, in the event log can be
used for decision-point analysis.
Norman

b check = “OK” and


566 566 report = “Approved”

g
1391 1537
971 c 971 461 461
1391 1537
a e
930 930
start end
1391 1537

h
1537 d 1537
146

146 146 f

timestamps, and case data; for exam- service times) can be analyzed. Stan- enterprises stored more than seven
ple, an event referring to activity “reg- dard classification techniques can be exabytes of new data on disk drives,
ister request” and case “992564” may used to analyze the decision points in while consumers stored more than six
also have attributes describing the the proc­ess model; for example, activ- exabytes of new data on such devices as
person registering the request (such as ity e (“decide”) has three possible out- PCs and notebooks.5
“John”), time of the event (such as “30- comes: “pay,” “reject,” and “redo.” The remainder of the article ex-
11-2011:14.55”), age of the customer Using data known about the case plores how process mining provides
(such as “45”), and claimed amount prior to the decision, a decision tree value, referring to case studies that
(such as “650 euro”). After aligning can be constructed explaining the ob- used our open-source software pack-
model and log the event log can be re- served behavior. age ProM (http://www.processmining.
played on the model; while replaying, Figure 3 outlines why process min- org)1 created and maintained by the
one can analyze these additional at- ing is not limited to control-flow dis- process-mining group at Eindhoven
tributes; Figure 3 shows, for example, covery. Moreover, process mining is University of Technology, though oth-
it is possible to analyze wait times be- not limited to offline analysis and can er research groups have contributed,
tween activities. Measuring the time be used for predictions and recom- including the University of Padua,
difference between causally related mendations at runtime; for example, Universitat Politècnica de Catalunya,
events and computing basic statistics the completion time of a partially han- University of Calabria, Humboldt-
(such as averages, variances, and con- dled customer order can be predicted Universität zu Berlin, Queensland
fidence intervals) makes it possible to through a discovered process model University of Technology, Technical
identify the main bottlenecks. with timing information. University of Lisbon, Vienna University
Information about resources can of Economics and Business, Ulsan Na-
help discover roles, or groups of peo- Practical Value tional Institute of Science and Technol-
ple executing related activities fre- Here, I focus on the practical value of ogy, K.U. Leuven, Tsinghua University,
quently, through standard clustering process mining. As mentioned earlier, and University of Innsbruck. Besides
techniques. Social networks can be process mining is driven by the con- ProM, approximately 10 commercial
constructed based on the flow of work, tinuing exponential growth of event- software vendors worldwide develop
and resource performance (such as data volume; for example, according and distribute process-mining soft-
the relation between workload and to McKinsey Global Institute in 2010 ware, often embedded in larger tools

80 co mmunicatio ns o f the acm | au gu st 201 2 | vol. 5 5 | no. 8


contributed articles

(such as Pallas Athena, Software AG, ed differences often point to sources events, all with timestamps; Figure
Futura Process Intelligence, Fluxicon, of waste and mismanagement. 4 outlines how frequently the differ-
Businesscape, Iontas/Verint, Fujitsu, Improve performance. Event logs ent paths are used in the model. The
and Stereologic). can be replayed on discovered or different stages, or “places” in Pe-
Provides insight. For the past 10 handmade process models to support tri net jargon, of the model include
years we have used ProM in more than conformance checking and model color to highlight where, on aver-
100 organizations, including munici- enhancement. Since most event logs age, most proc­ess time is spent; the
palities (such as Alkmaar, Harderwijk, contain timestamps, replay can be purple stages of the process take the
and Heusden), government agencies used to extend the model with perfor- most time, the blue stages the least.
(such as Centraal Justitieel Incasso mance information. It is also possible to select two activi-
Bureau, the Dutch Justice Depart- Figure 4 includes some perfor- ties and measure the time that passes
ment, and Rijkswaterstaat), insur- mance-related diagnostics that can be between them. On average, 202.73
ance-related agencies (such as UWV), obtained through process mining. The days pass from completion of activity
banks (such as ING Bank), hospitals model was discovered based on 745 “OZ02 Voorbereiden” (preparation) to
(such as AMC Hospital in Amsterdam objections raised by citizens against completion of “OZ16 Uitspraak” (final
and Catharina hospital in Eindhoven), the so-called Waardering Onroerende judgment); this is longer than the aver-
multinationals (such as Deloitte and Zaken, or WOZ, valuation in a Dutch age overall flow time of approximately
DSM), high-tech system manufac- municipality. Dutch municipalities are 178 days. Approximately 416, or 56%,
turers and their customers (such as required by law to estimate the value of the objections follow this route; the
ASML, Philips Healthcare, Ricoh, and of houses and apartments within their other cases follow the branch “OZ15
Thales), and media companies (such borders. They use the WOZ value as a Zelf uitspraak” that takes, on average,
as Winkwaves). For each, we discov- basis for determining real-estate prop- less time.
ered some of their proc­ esses based erty tax. The higher the WOZ value, the Diagnostics, as in Figure 4, can be
on the event data they provided, with more tax an owner must pay. Many citi- used to improve processes by remov-
discovered processes often surprising zens appeal against the WOZ valuation, ing bottlenecks and rerouting cases.
even the stakeholders. The variability asserting it is too high. Since the model is connected to event
of processes is typically much greater Each of the 745 objections corre- data, it is possible to drill down im-
than expected. Such insight repre- sponds to a process instance. Togeth- mediately and investigate groups of
sents tremendous value, as unexpect- er, these instances generated 9,583 cases that take notably more time

Figure 4. Performance analysis based on 745 appeals against the WOZ valuation.

Purple places indicate Blue places indicate phases in


bottlenecks in the process. the process that take little time.

Average time
required to move
from one selected
activity to another.

au g ust 2 0 1 2 | vol . 55 | no. 8 | co mmu n i cat ion s o f t he acm 81


contributed articles

Figure 5. Conformance analysis showing deviations between event log and process model.

Activity 0Z15 was not executed


for two cases, although it was required
Activity OZ12 occurred according to the model.
23 times while the activity
was not enabled
according to the model.

Activity OZ16 occurred 475 times


according to the event log, OZ16
was executed once while not
enabled according to the model,
and for two cases, 0Z16 was
not executed, although it was
required according to the model.

Figure 6. Process model discovered for a group of 627 gynecological oncology patients. than others.1
Ensure conformance. Replay can
SKELETSC.TOT 304022B
0,5
1

1E CONSULT 410100
(complete)
165
0,786
44
THORAX 2R 386002
(complete)
126

0,5
1

CYTOL.VULVA 355428
(complete)
1
MAAGONTL.VVL 306231E
(complete)
18

0,7 0,667
18 2

DARM SCIN.VV 306333C


(complete)
18
also be used to check conformance
(see Figure 5). Based on 745 appeals
(complete) 0,5 0,8
3 2 21

SCC EIA 376480A


(complete)
69

CYTOL.VAGINA 355427
(complete) 0,8 0,5
13 31 1

ABO RH 370604
(complete) 0,5
286 1

TOTAAL T4 376406B
0,5 0,5 (complete)
1 1 1

against the WOZ valuation, ProM was


GEB. A.S.ERY 378609N
0,981 (complete)
273 3

CT B.BUIK MC 387042
0,5 0,5 0,5 (complete)
3 1 1 2

RH-D CENTRIF 370606


(complete) 0,5
286 1

OND.V.ELDERS 383333
0,667 0,5 0,5 (complete) 0,8
2 1 1 3 2

GEB. A.S.ERY 378609M EIW.SPEC.KWN 370433F CT HERSEN.ZC 381341


(complete) 0,903 (complete) (complete)
6 224 2 3

used to compare the normative model


CT HERSEN.MC 381342
0,5 0,5 0,5 (complete) 0,5
2 2 1 2 1

AANNAME LAB 370000


(complete) 0,988
2444 1212

BRONCH.TOIL. 339943B
0,667 0,667 0,5 0,5 0,667 0,5 0,5 0,5 0,5 0,5 0,667 0,75 (complete)
7 2 2 1 5 1 1 5 1 3 4 18 1

HEUP R. 2R 389202R THORAX 1R 386001 IMM.FIX. 377450 MYCOBAC PCR 378697F DUPLEX-VEN. 339849W AFWEZIGH.DAG 610002 KALIUM POTEN 377842A CT THORAX MK 386042
0,909 (complete) (complete) (complete) (complete) (complete) 0,909 (complete) 0,5 (complete) 0,5 (complete)
586 1 6 1 1 6 276 1 3 3 1 24

CPK 370488H CHOLESTEROL 370425 IGA 370476B LAPAROTOMIE 335512C


(complete) (complete) (complete) 0,5 0,667 0,667 0,5 0,5 0,667 0,5 (complete)

and the observed behavior, finding


7 3 2 1 5 5 1 2 2 1 4

LIGDAGTARIEF 40014 CT A.PULM.MC 385442A CREATININE 377847A


0,75 0,667 0,5 (complete) 0,984 (complete) 0,5 (complete)
5 2 1 1745 1061 3 3 2

CK-MB 378403 IGG 370476A LYMFADENECT. 333727


(complete) 0,5 (complete) 0,5 0,5 0,5 0,5 0,5 0,5 (complete) 0,5
7 1 2 4 1 2 1 3 1 2 4

VIT. B2 370474B TZ1 710072 CYTOL.PLEURA 355454 UREUM 377840


0,75 0,792 0,5 0,5 (complete) (complete) 0,533 (complete) (complete)
3 42 1 1 1 2 20 3 1

HOOGFR.AUDIO 657026 ECHO BO.BUIK 387070 PROT-S.TOT. 375581J


0,5 (complete) (complete) 0,5 (complete)
1 1 6 1 1

that 628 of the 745 cases can be re-


DRL.THORAX 386000 VIT. B1-THM. 378624 SCHOUD.L. 2R 384202L
0,667 (complete) 0,5 (complete) (complete) 0,727 0,5 0,5 0,5 0,5
2 1 1 4 1 29 1 2 1 1

ICC-KL.NEURL 413409 RIB STERN.2R 386802 FACTOR V 378718


0,5 (complete) (complete) 0,82 0,667 0,5 (complete)
8 1 1 445 3 1 1

CHOLESTEROL 372425F URINE ONDZ. 378149


(complete) 0,5 0,5 0,5 0,667 (complete) 0,5
3 1 1 1 2 81 1

IGM 370476C PROT-S.VRIJ 375581K


(complete) 0,947 0,5 0,5 0,667 0,5 0,667 0,5 0,75 0,667 (complete) 0,875
2 46 1 1 12 1 4 2 20 3 1 13

SEDIMENT S 370111S RES.5 BEP. 370505A SEDIMENT 370111


(complete) 0,5 (complete) 0,947 0,5 (complete) 0,5
47 1 78 13 2 21 1

played without encountering any prob-


OSMOLALITEIT 370496 FII -DNA PCR 378717A
0,5 0,5 0,5 0,5 (complete) 0,75 (complete)
16 1 1 1 1 18 1

ZWANGERSCH.S 370804S REGIO-TOESL. 614400


(complete) 0,5 (complete) 0,944 0,5 0,5
1 1 1065 350 1 1

PROT. C ACT. 370743Q CT BEKKEN ZC 389141


0,5 0,5 0,5 0,8 0,5 (complete) (complete)
1 1 1 17 1 1 1

ANTI-HBC-IAM 377478 VULVECTOMIE 337441 ELEKTROCARD. 330001B TOONAUDIOMET 657021 CYTOL.LYMFEK 355409 INFUUS INBR. 339956
0,5 (complete) (complete) (complete) 0,923 (complete) (complete) (complete) 0,667
1 1 1 70 15 1 7 33 3

ANTI-HAV.IG 371115 ERY ELEC S 377131S


0,5 0,5 0,667 (complete) 0,5 0,5 0,5 0,667 0,667 (complete)
1 1

lems. The fitness of the model and log


2 1 28 1 1 2 11 9

KLIN.KRT.ANS 20189 OP.VRW.ORG. 337469 EC PUN.HALS 382977 HB FOTOELEKT 370407D


0,5 (complete) (complete) (complete) (complete)
2 8 1 1 410

CT DRAINAGE 380048 STAGLAP.REDU 335512H


0,5 0,667 0,5 (complete) 0,5 (complete)
1 2 1 1 1 7

FT 4 RIA 376406 LWK 2R 383302


0,5 (complete) (complete) 0,967
4 9 1 266

KATHET.STOMA 334899 THORAX ZAAL 386001Z


0,839 (complete) 0,75 0,833 0,8 (complete) 0,5
69 1 6 31 7 22 1

TSH EIA 372441 LEUKO TELLEN 370712B

is 0.98876214, indicating the model


0,75 0,8 (complete) (complete)
7 1 12 289

ICC-KL.LONGZ 413422 MICR.ONDERZ. 370501F


(complete) 0,5 0,5 0,5 (complete)
1 1 1 1 15

CT PULMON.MC 385442 BEAD.ANESTH. 40031 ICC-KL.ANAES 413489


0,75 (complete) (complete) (complete)
12 2 4 11

CT RETROP.MC 388942 CYTOL. BUIK 355435


0,667 (complete) 0,5 (complete)
3 1 4 1

ICC-KL.UROLO 413406 GLUCOSE S 370402S


0,857 0,667 (complete) 0,5 (complete) 0,933
38 4 2 1 209 26

explains almost all recorded events.


B.O.Z. 1R 387001 CT HERSENEN 381343
(complete) (complete) 0,5 0,5
20 1 1 2

MR GR.HERSEN 381390 EXC.ADNEX EZ 336930


(complete) (complete) 0,862
1 1 176

OVARIUMCARC. 337106
0,5 0,5 0,5 0,5 0,5 (complete)
1 1 1 1 1 2

KALIUM S 370443S
0,8 0,951 0,75 0,5 (complete)
4 198 2 1 379

SPRAAKAUD.ST 657031
0,5 0,833 0,5 0,75 0,5 0,75 (complete) 0,955 0,889 0,5
1 5 1 6 1 4 1 176 32 1

Despite the good fit, ProM identified all


CREATININE S 370419S
(complete) 0,5 0,929 0,5
206 1 199 1

LYMFADENECT 333742
0,75 0,917 0,96 (complete) 0,5
5 13 28 1 1

MAGN.DIV. S 378858S MELKZUUR SP 370488T


0,5 (complete) (complete) 0,975
1 25 32 141

OVARIUMCARC. 337106A
0,5 0,5 0,947 0,938 0,5 (complete)
2 1 20 26 2 5

NATRIUM S 370442S
0,667 0,5 0,5 0,9 (complete) 0,75

deviations; for example, “OZ12 Hertax-


21 3 1 87 373 3

ALUMINIUM 378437
0,667 (complete) 0,5 0,891 0,835 0,5
6 2 2 56 143 1

SGOT ASAT SP 370489S O2-SATURATIE 378458


0,8 (complete) (complete) 0,909 0,944 0,667
8 62 229 49 17 2

ANTI-HIV 378644
0,5 0,5 0,5 0,977 0,5 0,825 (complete) 0,571 0,5 0,816
1 2 4 56 3 129 2 3 2 49

SGPT ALAT SP 370488S PH-PCO2-BIC. 372414


(complete) (complete) 0,909
59 212 78

STAGLAP.OMCT 335512J

eren” (reevaluate property) occurred


(complete) 0,962 0,8 0,75
4 42 6 25

UREUM S 370403S EIW.TOT. S 370480S GLUCOSE 370402


(complete) (complete) 0,942 0,5 (complete) 0,5 0,9 0,5
78 8 292 1 215 1 45 1

ECHO CAROT.R 381670R


0,75 0,993 (complete) 0,5 0,667
3 181 1 1 4

URINEZUUR 370416 KALIUM POTEN 370443


(complete) (complete) 0,5
4 490 1

CLOSTRIDIUM 378216A
0,857 0,857 (complete) 0,5 0,5 0,986 0,941 0,941
116 48 7 3 1 438 121 57

23 times despite not being allowed


CREATININE 370419
0,5 (complete) 0,8 0,5
1 483 13 1

APCA 330398A
0,857 0,979 (complete) 0,5 0,833 0,95 0,75
8 159 2 1 33 56 4

LDH KINET. 370488J CDE FENOTYP 375003A ALBUMINE 378453A


(complete) 0,5 (complete) 0,5 (complete) 0,5 0,75 0,75
184 3 13 1 238 1 57 29

MELKZUUR 376482C
0,917 0,75 0,5 0,5 0,842 0,5 (complete) 0,857 0,667
15 4 2 1 36 1 18 5 12

MAGN. DIV. 378858 TRIGLYCERIDE 370460E CRP 378452 NATRIUM VLAM 377842C
(complete) (complete) 0,975 0,5 (complete) 0,992 (complete) 0,5 0,667
61 4 258 3 95 144 2 1 3

according to the normative model, as


A1-FETOPROT. 378449 BLD.GRP.LEW. 378490G
0,978 0,5 (complete) 0,667 (complete) 0,5 0,667 0,909 0,5 0,958 0,5 0,667 0,5 0,915
156 1 8 5 1 1 4 21 2 50 1 14 2 223

CT ABDOMEN 387043A HS-CRP 378452A ALK.FOSFAT. 370423


0,974 (complete) (complete) (complete) 0,5 0,857 0,5
41 1 1 187 2 12 2

EIWIT COLOR. 370172 VANCOMYCINE 377410G


(complete) 0,898 0,5 0,5 0,991 0,75 (complete) 0,952 0,992 0,5 0,5 0,923 0,909
4 99 1 5 131 7 2 41 146 1 1 13 20

NATRIUM VLAM 370442 BILI. GECON. 370401 AMYLASE 370415


0,5 (complete) 0,952 (complete) (complete) 0,957 0,5 0,923 0,5 0,5 0,75
1 494 29 144 11 24 1 129 3 1 23

BEAD.ANESTH. 40032

indicated by the “-23” in Figure 5. With


0,96 0,994 0,5 0,824 0,975 0,833 0,5 0,5 (complete) 0,5 0,5 0,833
42 201 1 560 137 10 2 2 4 1 1 28

SGOT-ASAT 370488E CREATININE 370129 BILI TOTAAL 370401C


0,5 (complete) (complete) 0,667 (complete) 0,5 0,5 0,875
2 215 4 3 193 1 1 11

GENTAMYCINE 377410D
0,978 0,984 0,5 0,992 (complete) 0,917 0,5 0,5
64 207 3 151 4 13 1 1

SGPT-ALAT 370488G HAPTO. 375101 CALCIUM 377498A


(complete) 0,5 (complete) (complete)
217 1 4 240

AMMONIAK 370483
0,667 (complete) 0,5 0,993 0,5 0,75 0,5 0,5 0,5 0,929 0,5 0,944 0,923 0,5
2 1 1 169 1 4 1 2 3 31 1 19 26 1

ProM the analyst can drill down to see


CPK-MB S 378403S UREUM 370403 FOLIUMZUUR 370465Q TOT.EIW. 370480A NATRIUM VLAM 370135 CHLORIDE 370420
0,667 (complete) (complete) (complete) 0,5 (complete) (complete) (complete) 0,5 0,5
6 6 246 3 1 7 5 52 5 1

LCR 378546 SGOT KIN. S 370489T


0,667 (complete) 0,5 0,5 0,667 0,5 0,5 0,5 0,938 0,5 0,5 0,917 (complete) 0,991 0,75 0,833 0,5 0,5 0,5 0,966
2 5 2 1 4 1 1 1 84 1 1 21 3 136 22 15 1 3 1 62

TROPONINE-T 378468P OSMOLALITEIT 372107 BNP 376425A MICROALBUM. 378173B FERRITINE 372454A CA-125 MEIA 378619A VIT B12 370466C FOSFAAT 370421
(complete) (complete) (complete) (complete) (complete) (complete) 0,5 (complete) (complete) 0,5 0,827 0,977
7 1 6 1 5 188 1 2 35 2 145 177

IJZER 370437
(complete) 0,5 0,5 0,75 0,5 0,5 0,875
5 1 1 5 1 3 12

PARAPROT.TYP 375128 CYTOL.DIVER. 355499 CA 15.3 MEIA 378619E ANES.VERV. 339992Z CT THORAX ZK 386041 G-GLUT-TRANS 372417
0,667 0,75 (complete) (complete) (complete) 0,957 (complete) (complete) 0,5 (complete)
4 4 2 2 6 96 1 5 1 185

what these cases have in common.


TR.FERRINE 378808
0,667 0,5 (complete) 0,5 0,5 0,75 0,5 0,5 0,667 0,5
3 1 5 1 1 5 1 1 3 2

IMMUNOFORESE 378444A CEA MEIA BL 376400D CAPNOGRAFIE 339832C VERV.CONSULT 411100


(complete) (complete) (complete) 0,8 (complete) 0,981
1 107 14 3 676 131

ECHO HALS 382970


0,5 (complete) 0,667 0,5 0,5 0,5 0,75 0,75 0,5 0,5
1 1 3 2 1 1 4 58 1 1

SHBG 377447 INBR.KATHET. 333698 DUPL.BEEN EZ 389073F ECHO ROUTINE 339486G GYN.-JAAR-KO 10307
0,5 0,833 (complete) (complete) 0,5 0,75 (complete) (complete) (complete)
4 202 2 8 1 12 1 3 61

BOTDICHT.LWK 304360E ERYS ELEKTR. 378731


0,5 0,833 0,5 (complete) (complete)

The conformance of the appeal


1 47 1 1 1

C.V.V.H.D. 339970J BEKKEN LIGG. 389101 TARIEF CONS. 419100 FSH EIA 372439
0,5 (complete) (complete) (complete) (complete) 0,5
1 1 3 495 3 1

BOTDICHT.FEM 304360F
0,5 0,5 0,667 0,5 0,875 0,5 0,944 0,75 (complete) 0,923 0,5 0,5
2 1 2 1 20 1 69 3 1 71 1 3

B.O.Z. 2R 387002 O.BEEN L. 2R 389502L OESTRADIOL 378431 PROGESTERON 372442A LH BLOED 372440A AS-ERY. SCR. 378607 CT BEKKEN MC 389142
(complete) (complete) 0,8 (complete) (complete) (complete) (complete) 0,5 (complete)
4 1 4 4 2 4 285 4 2

ECHO MAMMA 386970 MRI ABDOMEN 387090


0,5 0,5 (complete) (complete) 0,5 0,5 0,75 0,5 0,5
1 1 3 41 1 1 6 1 1

ART.PUNCT.CR 339954A DRL.BUIK 387000 B.BEEN L. 2R 389302L EIWITFRACT. 376478 PROLACTINE 372443 BLD.GRP.KELL 375004 ICC-KL.CHIR. 413403 CT ABDOM.MC 387042A

process is high; approximately 99% of


(complete) (complete) 0,5 (complete) 0,75 (complete) 0,75 0,5 0,5 (complete) (complete) 0,8 (complete) (complete)
6 1 2 1 12 1 3 2 4 2 16 2 2 90

MAM.GR.THWND 386902
0,5 0,5 0,5 0,5 0,5 (complete) 0,5 0,5
1 4 1 1 1 7 2 1

ORDERTARIEF 379999 GYN.-KORT-KO 10107


(complete) 0,992 0,5 (complete) 0,833 0,833
1443 891 1 137 5 3

DRAIN.THORAX 332600D ZWARE DAGVPL 619700


0,5 (complete) 0,5 0,5 0,5 0,5 0,5 0,5 0,667 0,5 0,5 0,5 0,5 0,667 0,5 0,5 0,667 0,667 0,667 0,5 0,889 0,5 0,75 0,5 0,5 0,5 (complete) 0,833 0,667
2 2 1 2 1 1 1 1 3 1 1 1 1 3 2 1 20 2 8 1 30 2 4 2 1 1 1 8 2

TITR.DIR.CMB 375012 RENOGR.LASIX 307031G RETI TELLEN 370716 EXC.ADNEX DZ 336950 LYMFSC.SCH.W 302211F PLEURAPUNCTI 332610 BLD.GRP.KIDD 378610 FACT. 8 ACT. 375552A EIWIT BEP. 700050 ANESTHESIE 339090N HIST.KL.PREP 356132 EXC. UTERUS 337101B ECHO NIER 388170 OP.BUIK 335519A PACLITAXEL 686405 ECHO ABDOMEN 387070A CA-19.9 379414 B-SUBUN. HCG 370828A NO SHOW 380000 MELKZUUR S 376482S ALBUMINE SP 378453S ANTIST.KOUD 375009
(complete) (complete) (complete) (complete) (complete) (complete) (complete) (complete) 0,5 (complete) (complete) (complete) (complete) (complete) (complete) (complete) 0,909 (complete) (complete) (complete) (complete) (complete) (complete) (complete)
2 2 4 1 5 1 2 1 3 3 2 26 3 10 3 50 17 2 3 6 2 135 53 2

events are possible according to the


MR BIJNIER 388890 VIT. A 377439 OV.OP.CLITOR 337436
(complete) (complete) 0,5 0,667 0,833 0,5 0,5 (complete) 0,5 0,667 0,667 0,5 0,5 0,667 0,5 0,944 0,5
1 1 1 4 5 2 1 1 2 3 2 1 1 14 1 20 1

LYMFES.SCH.W 302213E BLD.GRP.MNSP 378490E KLIN.OPNAME 610001 UTERUSCURETT 337190C DIEET NNO 709999 GEFILT.ERYT 710170 ALK.FOSFAT.S 370423T IRREG.AS ERY 378609S
0,5 (complete) 0,5 (complete) 0,8 0,75 0,5 (complete) 0,5 0,5 (complete) 0,667 (complete) (complete) 0,982 (complete) 0,75 0,5 (complete) 0,8
1 5 1 6 2 27 1 312 2 1 5 3 37 187 113 44 4 1 12 1

VIT. E 376451 VULVECT.LIES 337440 DARM SCINT. 306332C


(complete) 0,833 0,5 0,5 0,5 0,5 (complete) 0,5 0,5 0,5 0,667 0,5 0,5 0,667 0,5 0,667 0,5 0,5 0,8 0,9 0,75 (complete) 0,75
1 5 1 1 1 1 5 1 1 1 2 1 1 7 1 2 2 1 19 17 7 2 6

SCINT.LYMFEK 302282F GEB.A.S.ERY 378609Y NATRIUM S 370135S EPI.ANALG.AN 339090B CYTOL.LONGP. 355411 VULVECTOMIE 337452 ERY-ELUAAT 378490B DAGVERPL. 619600 DUN.DARM MC 387411 BICARBONAAT 370424 AMYLASE S 370415S IRREG.AS ERY 378609R AS-ERY.SPEC. 378609K
0,8 (complete) (complete) 0,5 (complete) (complete) 0,5 (complete) (complete) 0,5 (complete) (complete) 0,889 (complete) (complete) 0,857 0,947 (complete) (complete) (complete)
37 5 1 1 3 1 1 1 3 3 2 64 10 1 214 42 27 10 11 7

ECHO DOPPLER 339482A L.A.C. 375552C AFW. VULVA 337419C VRIESCOUPE 355105
0,5 0,8 0,8 (complete) (complete) 0,5 (complete) 0,881 (complete) 0,667 0,885 0,5 0,5 0,75 0,5 0,667 0,857
1 96 100 1 1 1 5 48 10 9 103 1 1 5 1 4 27

model. Many processes have a much


HIST.BIOPTEN 356134 CYTOL.ASCIT. 355401 DAGVERPLEG. 40016 OP. UTERUS 337105 HEP-B SURF. 375138A ICCV-KL.CHIR 414403 CRP S 378452S BILT BILG S 370401S ALFA-AMYLASE 370117 DIR.COOMBS 375005
0,5 (complete) 0,75 (complete) (complete) 0,923 (complete) (complete) 0,97 (complete) 0,776 (complete) (complete) (complete) (complete)
1 49 17 30 66 15 5 101 31 1 147 43 47 1 6

VIT. B6 370474A VULVECT.LIES 337451 AFW. VAGINA 337319 FQ1 - FQ2 710290
0,5 (complete) 0,667 (complete) 0,5 0,857 0,5 0,8 (complete) 0,8 0,8 0,5 0,5 0,5 0,667 0,5 0,889 0,955 0,667 (complete)
2 1 4 1 1 14 1 6 1 4 22 1 1 1 2 1 17 29 8 6

AFW. VULVA 337480 EXC. UTERUS 337101 VAGINA-TOUCH 339988E AFW. VULVA 337419 IUD 337292 COUPE INZAGE 355111 DOPPLER HART 339494C CALCIUM S 370426S AMMONIAK S 370483S
0,5 0,75 (complete) (complete) (complete) (complete) 0,8 0,667 0,667 (complete) (complete) (complete) (complete) (complete) 0,5 0,5
2 3 2 7 34 1 35 2 6 1 40 1 237 1 2 1

ICC-KL.INTER 413413 ECHO BUIKW. 387970 BLAASKATHET. 336272


0,5 (complete) 0,75 0,667 0,769 0,5 (complete) 0,5 (complete) 0,5 0,5 0,5 0,968 0,5 0,929 0,5 0,917 0,667 0,817 0,5 0,5
1 1 2 2 19 1 1 1 1 1 1 1 61 1 19 1 18 7 115 1 1

HIST.GR.PREP 356133 CYST.UR.SCOP 339160 AFW.VRW.ORG. 337180 MET-SULF-HB 370407C CHLORIDE S 370420S OP.BUIK 335519B GAMMA-GT S 372417S CO-HB 370440 LIPASE 370415A
(complete) (complete) (complete) 0,5 0,5 0,838 0,875 (complete) 0,923 0,5 (complete) 0,833 (complete) (complete) 0,667 (complete) 0,923 (complete) 0,5

lower conformance; for example, it


54 22 3 1 3 197 14 190 27 1 22 21 1 34 5 189 57 1 3

AFW. VAGINA 337380 ECHO CAROT.L 381670L LIGDAG IC 40034


0,75 0,5 0,75 0,667 0,5 0,5 0,5 (complete) 0,5 0,5 (complete) 0,667 0,667 0,5 0,833 0,5 0,909 (complete) 0,75 0,5 0,5
14 1 18 4 1 1 1 2 1 1 1 5 6 1 5 1 13 9 3 1 1

MRI BEKKEN 389190 KRUISPR. 375075 CYTOL.ECTOC. 355201 LYMFEKL.BIOP 333780 ANTI-HEPAT-C 377479A EXT. UTERUS 337105F FOSFAAT S 370421S OP.BUIK 335512 ONDERZ.KWEEK 370504A
(complete) 0,909 (complete) 0,933 (complete) (complete) 0,5 0,875 0,8 (complete) (complete) (complete) (complete) 0,855 (complete) 0,941
20 61 292 143 34 1 1 31 7 2 3 20 1 92 228 90

IMM.PATH.OND 350503 CITO HISTOL. 359999 ECHO BEEN 389070


0,667 0,25 0,5 0,5 0,75 0,5 0,5 (complete) 0,5 (complete) 0,5 (complete) 0,5 0,5
1 2 1 1 11 1 1 65 1 29 1 1 1 1

LA2 710170B EC PUN.LEVER 387677 KLASSE 3A 612000 CONISATIE 337220 HCVR PCR 378639U HAEMOGLOB. S 370701S
0,5 (complete) (complete) (complete) 0,909 0,5 0,5 (complete) (complete) 0,667 (complete) 0,941
1 2 1 250 123 1 1 4 1 3 502 122

is not uncommon to find processes


HYSTEROSCOP. 339186 ECHO GEN.INT 339486E ECHO BLAAS 339488A
0,5 0,5 (complete) 0,5 (complete) 0,667 (complete) 0,875 0,8 0,5 0,5 0,5
1 1 5 1 30 3 10 41 11 1 1 1

LISEXC.CERV. 337202 DUO SCOP.ECH 339141J CYTOL. LEVER 355431 COLPOSCOPIE 339170 SINUS 2R 382102
(complete) (complete) 0,5 0,5 0,667 (complete) (complete) 0,667 (complete) 0,954
1 1 3 1 17 2 3 6 1 272

TESTOSTERON 376487D KLIN.KRT.INW 20113


(complete) 0,5 0,5 0,5 (complete) 0,5
2 1 2 2 1 1

COLPOSCOPIE 339171A KLASSE 3B 613000 LEUCO ELEC S 377121S


(complete) (complete) 0,923 (complete)
2 1388 618 294

MORFOMETRIE 355107
0,5 0,857 0,5 0,5 0,5 0,5 0,5 0,5 0,5 (complete) 0,5 0,889
2 16 1 1 1 1 1 1 2 1 1 21

where only 40% of events are possible


STAGLAP.OMCT 335512N ONTSTEK.TOT. 302622H CYSTOSCOPIE 339161 AS-HBS. KWN 375140 VIT. B3 370474G AUD KRT 1.5 659030 ECHO ONDBUIK 388070A DIGOXINE 376454A HEMATOCR. S 370711S
0,75 (complete) (complete) 0,667 (complete) (complete) (complete) (complete) (complete) (complete) 0,95 (complete)
8 2 1 22 2 1 1 1 1 2 207 27

0,5 0,5 0,5 0,875


1 1 1 12

BEZOEK 410500 DIFF.AUTOM. 370701 HEMATOCRIET 370711


(complete) 0,5 (complete) (complete)
37 1 284 39

0,5 0,5 0,5 0,5 0,5 0,993 0,971 0,667


1 1 1 1 1 210 50 15

DUPLEXSCAN 339848H DOPPL.O.EXTR 339848D URODYN.5 KAN 339869K CHLORIDE 370119A TROMBO S 370715S TROMB TELLEN 370715A
(complete) (complete) (complete) 0,833 (complete) (complete) (complete)
1 1 1 5 1 290 263

according to the model; for example,


0,5 0,667 0,667 0,5 0,75 0,5 0,5 0,917
1 9 4 2 4 2 1 20

TEL.CONS. KO 415100 DIFF.HANDM. 379000A FDP DIMEER 376467E KALIUM S 370136S ANTITROMB. 375553D ANF 375408B CEFALINETIJD 370737C
(complete) 0,947 (complete) (complete) 0,95 (complete) (complete) (complete) (complete)
183 26 14 7 44 2 4 1 29

0,75 0,5 0,5 0,5 0,667 0,5 0,5 0,5 0,857


3 1 1 1 3 1 1 1 8

GYN.-AANV.KO 10207 COLON INLOOP 387511 EC-BIOP.BEKK 389177 PTT 370737S BSE 378729 IGG-A.CARD. 375421C FIBRINOGEEN 370487A TT 375518
(complete) (complete) (complete) (complete) (complete) 0,875 (complete) (complete) (complete)
36 1 1 50 14 16 1 2 23

0,9 0,5 0,5 0,667

process mining revealed ASML’s mod-


16 1 1 2

INR TROMBOPL 370737Z CYTOL.NIERC. 355426 PROTROMBINET 378720


0,938 (complete) (complete) (complete)
31 44 4 28

0,857 0,667 0,5


13 2 1

PROTROMB. S 370707S CYTOL.PUNCT. 350507 IGM-A.CARD. 375421B


(complete) (complete) (complete)
45 3 1

0,5
1

TROMBINETIJD 375517

eled test process strongly deviated


(complete)
1

0,5
1

AS. ELISA 375423


(complete)
1

from the real process.8

82 co mmuni catio ns o f th e acm | au gu st 201 2 | vol. 5 5 | no. 8


contributed articles

The increasing importance of cor- example, since 2007, we have used ously check compliance while improv-
porate governance, risk, compliance process mining to analyze the event ing performance.
management, and legislation (such as logs of X-ray machines from Phil- This article introduced the basic
the Sarbanes-Oxley Act and the Basel ips Healthcare1 that record massive concepts and showed that process
II Accord) highlight the practical rele- amounts of events describing actual mining can provide value in several
vance of conformance checking. Proc­ use. Regulations in different coun- ways. For more on process mining see
ess mining can help auditors check tries require proof systems were test- van der Aalst,1 the first book on the
whether processes execute within ed under realistic circumstances; for subject, and the Process Mining Mani-
certain boundaries set by managers, this reason, process discovery was festo9 available in 13 languages; for
governments, and other stakehold- used to construct realistic test pro- sample logs, videos, slides, articles,
ers.3 Violations discovered through files. Philips Healthcare also used and software see http://www.process-
process mining might indicate fraud, process mining for fault diagnosis to mining.org.
malpractice, risks, or inefficiency; identify potential failures within its
for example, in the municipality for X-ray systems. By learning from ear- Acknowledgments
which the WOZ appeal process was lier system failure, fault diagnosis I thank the members of the IEEE Task
analyzed, ProM revealed misconfigu- was able to find the root cause for new Force on Process Mining and all who
rations of its eiStream workflow-man- emergent problems. For example, contributed to the Process Mining
agement system. Municipal employ- we used ProM to analyze the circum- Manifesto9 and the ProM framework.
ees frequently bypassed the system stances under which particular com-
because system administrators could ponents are replaced, resulting in a References
1. Aalst, W. van der. Process Mining: Discovery,
manually change the status of cases set of “signatures,” or historical fault Conformance and Enhancement of Business
(such as to skip activities or roll back patterns; when a malfunctioning X- Processes. Springer-Verlag, Berlin, 2011.
2. Aalst, W. van der. Using process mining to bridge the
the process).7 ray machine exhibits a particular sig- gap between BI and BPM. IEEE Computer 44, 12
Show variability. Handmade proc­ nature behavior, the service engineer (Dec. 2011), 77–80.
3. Aalst, W. van der, Hee, K. van, Werf, J.M. van der, and
ess models tend to provide an ideal- knows what component to replace. Verdonk, M. Auditing 2.0: Using process mining to
ized view of the business process being Enable prediction. Combining his- support tomorrow’s auditor. IEEE Computer 43, 3
(Mar. 2010), 90–93.
modeled. However, such “PowerPoint toric event data with real-time event 4. Hilbert, M. and Lopez, P. The world’s technological
reality” often has little in common data can also help predict problems capacity to store, communicate, and compute
information. Science 332, 6025 (Feb. 2011), 60–65.
with real processes, which have much before they occur; for instance, Phil- 5. Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R.,
Roxburgh, C., and Byers, A. Big Data: The Next Frontier
more variability. To improve confor- ips Healthcare can anticipate that an for Innovation, Competition, and Productivity. Report
mance and performance, process ana- X-ray tube in the field is about to fail by McKinsey Global Institute, June 2011; http://www.
mckinsey.com/mgi
lysts should not naively abstract away by discovering signature fault pat- 6. Mendling, J., Neumann, G., and Aalst, W. van der.
this variability. terns in the machine’s event logs, so Understanding the occurrence of errors in process
models based on metrics. In Proceedings of the OTM
Process mining often involves spa- the tube can be replaced before the Conference on Cooperative information Systems
ghetti-like models; the one in Figure 6 machine begins to malfunction. Many (Vilamoura, Algarve, Portugal, Nov. 25–30), F. Curbera,
F. Leymann, and M. Weske, Eds. Lecture Notes in
was discovered based on an event log data sources today are updated in Computer Science Series, Vol. 4803. Springer-Verlag,
containing 24,331 events referring to (near) real time, and sufficient com- Berlin, 2007, 113–130.
7. Rozinat, A. and Aalst, W. van der. Conformance checking
376 different activities describing the puting power is available to analyze of processes based on monitoring real behavior.
diagnosis and treatment of 627 gyne- events as they occur. Process mining is Information Systems 33, 1 (Mar. 2008), 64–95.
8. Rozinat, A., de Jong, I., Günther, C., and Aalst, W.
cological oncology patients in the AMC not restricted to offline analysis and is van der. Process mining applied to the test process
Hospital in Amsterdam. The spaghetti- useful for online operational support. of wafer scanners in ASML. IEEE Transactions on
Systems, Man and Cybernetics, Part C 39, 4 (July
like structures are not caused by the Predictions can even be made for a 2009), 474–479.
discovery algorithm but by the variabil- running process instance (such as ex- 9. Task Force on Process Mining. Process Mining
Manifesto. In Proceedings of Business Process
ity of the process. pected remaining flow time).1 Management Workshops, F. Daniel, K. Barkaoui, and S.
Dustdar, Eds. Lecture Notes in Business Information
Although stakeholders should see Processing Series 99. Springer-Verlag, Berlin, 2012,
reality in all its detail (see Figure 6), Conclusion 169–194.
spaghetti-like models can be simpli- Process-mining techniques enable
fied. As with electronic maps, it is pos- organizations to X-ray their busi- Wil van der Aalst (w.m.p.v.d.aalst@tue.nl) is a professor
in the Department of Mathematics & Computer Science
sible to seamlessly zoom in and out.1 ness processes, diagnose problems, of the Technische Universiteit Eindhoven, the Netherlands,
Zooming out, insignificant things are and identify promising solutions for where he is chair of the Architecture of Information
Systems group.
either left out or dynamically clus- treatment. Process discovery often
tered into aggregate shapes, in the provides surprising insight that can
same way streets and suburbs amal- be used to redesign processes or im-
gamate into cities in Google Maps. prove management, and conformance
The significance level of an activity or checking can be used to identify
connection may be based on frequen- where processes deviate. This is rele-
cy, costs, or time. vant where organizations are required
Improve reliability. Process min- to emphasize corporate governance,
ing can also help improve the reli- risk, and compliance. Process-mining
ability of systems and processes; for techniques are a means to more rigor- © 2012 ACM 0001-0782/12/08 $15.00

au g u st 2 0 1 2 | vol. 55 | no. 8 | c omm u n icat ion s o f t he acm 83


review articles
doi:10.1145/ 2240236.2240258
that look a lot like solid precious met-
Imagine money you can carry and spend als. Some subway passes and copy
cards are also examples of physical
without a trace. money—they contain small computer
chips or magnetic strips that are actu-
By Scott Aaronson, Edward Farhi, David Gosset, ally worth a certain number of subway
Avinatan Hassidim, Jonathan Kelner, rides or copies. But these tend to be
and Andrew Lutomirski even easier to counterfeit.22 In theory,
any physical money can be counter-

Quantum
feited by just using the same produc-
tion process as the one used to make
the original.
The other kind of money is the

Money
kind that you entrust to someone else.
Think bank accounts and credit lines.
You can carry these around with you in
the form of checks, credit cards, and
debit cards—portable devices that let
you instruct your bank to move money
on your behalf. Unlike physical money,
there is no point in copying your own
credit card (it would not double the
amount of money in your bank). With
a credit card, you can carry as much
Every bod y like s mone y. It is very easy to spend. value as you want without weighing
With cash and credit cards, you can buy what you want down your pockets and you can send
money across the globe nearly instan-
when you want it. So why are quantum computing taneously. But credit cards have disad-
theorists trying to rethink money? vantages: every time you pay someone,
There are a few things we all take for granted about you need to tell your bank whom to
send money to. This leaves a paper trail
money. We trust credit card companies to keep our and does not work if your connection
transactions private and to send the right amount of to your bank is down.
Neither of these kinds of money is
money to the right place quickly. When we use paper ideal. For example, imagine that you
money, we are used to the fact that we have to carry it are going to Las Vegas on a business
physically with us, and we accept the risk of occasional trip and you want to play some high-
stakes games at night. You might feel
counterfeiting. conspicuous carrying a fat wad of cash
Today, we use two basic kinds of money. First, there around the strip. If you use a credit
is the kind we carry around—coins, bank notes, poker key insights
chips, and precious metals. These are objects that are
 A ny digital good can be perfectly copied.
made by a mint or dug out of the ground. It is easy to This is a major headache for software
companies (and for the entertainment
verify that money is valid. You can look for the security industry), and is the reason that digital
features on paper money, you can feel coins in your cash does not exist.

hand, and, if you really know what you are doing,  T he quantum mechanical “no-cloning’
theorem means that in principle it is
you can assay precious metals. All of these kinds of possible to design quantum systems
that cannot be copied. Several recent
physical money can be counterfeited, though—if you works propose to use such systems
for digital money.
have the right equipment, you can print paper money,
 F urther research may lead to a new
stamp out your own coins, or make alloys or platings form of digital rights management.

84 communicatio ns o f th e acm | au gu st 201 2 | vol. 5 5 | no. 8


card, your significant other (not to digital money seem impossible: if you Quantum Mechanics
mention anyone else who gets access had one hundred dollars on your com- If you look closely enough, everything
to your bank statements) will know puter, you could back up your com- is made out of subatomic particles, and
exactly how much money you gambled. puter, spend the money, restore your these particles obey the laws of quan-
What you really want is some kind of computer from the backup, and spend tum mechanics. Quantum mechanical
money that you can spend without your money again. systems store information in a way that
leaving a trace and that you can carry as Enter quantum mechanics. Physicists is dramatically different from classical
much of as you want without weighing have known for years that if you possess (that is, non-quantum) systems.
down your pockets. a single quantum object and know noth- One of the simplest examples of a
This kind of money would be digital: ing about it, then it is fundamentally quantum system is a single electron.
you could transmit it and fit as much of impossible to copy that object perfectly. Electrons spin, and their spin can be
it as you want on some small handheld This is called the no-cloning theorem, characterized by a three-dimensional
computer. It would be self-contained, and it gives us hope that quantum infor- vector.a This vector, like any three-
so you could pay someone without any mation could be used as the basis of a dimensional vector, has three compo-
third party being involved. And it would better kind of money. nents, Sx, Sy, and Sz. It is possible to do
Illustration by Spooky Po oka at Début Art

be cryptographically secure: attack- So can we make the idealized money an experiment to measure the vertical
ers could never produce a counterfeit out of quantum mechanical objects component Sz of an electron’s spin,
bill that passes as real money even with rather than classical ones? In the rest but if you do the experiment, you will
extraordinary resources at their disposal. of this article, we will survey recent discover something strange: Sz can
The reason we do not have this kind work that has tackled this question. only take on two values, +1 and −1.
of money today is not for lack of try- We will introduce the idea of quantum
ing. Any digital piece of information information, and we will talk about a a The vector represents the angular momentum
that can be sent over a communication few proposals for quantum money and of the electron, but its physical interpretation
channel can be copied. This makes some of the open problems in the field. is not important for this discussion.

au g ust 2 0 1 2 | vol . 55 | no. 8 | co mmu n i cat ion s o f t he acm 85


review articles

Once you have measured Sz, you can measure Sz again. You would expect to state |qñ as above and we make this
measure it again and you will get the get Sz = +1 as before, but if you do this measurement. Then there are two
same answer as you got the first time experiment you will get +1 half the time possible outcomes. We might get the
around. Sx and Sy work the same way. and −1 half the time. Measuring the answer yes, in which case the state
So far, you might have thought that electron’s spin therefore changes the of the system would change instanta-
each component of the electron’s spin spin state of the electron. Physicists neously from |qñ to |rñ. The probability
stores one bit of information. have come to realize that this is not that this happens is given by the com-
But if you try to measure more than a limitation of their experiments but plex inner product squared of the two
one of the components, again some- rather that the universe fundamen- states in question
thing strange happens. Take an elec- tally operates this way.
tron, measure Sz, and suppose that No matter what encoding you use Pr [yes] = |a*a + b*b|2.
the outcome is +1. Now measure Sx or how perfect an apparatus you can
(obtaining either +1 or −1) and then build, you can only ever reliably encode If on the other hand we obtain the mea-
one bit worth of recoverable classical surement outcome no, then the state
Figure 1. The no-cloning theorem. information in the spin of an electron.20 of the system would instantaneously
Nonetheless, an electron behaves very change from |qñ to the state |r ⊥ñ = b*|0ñ
Imagine that someone prepares a single differently than a classical bit. If we use − a*|1ñ that is perpendicular to |rñ. This
qubit in the state |ψñ = a|0ñ + b|1ñ and gives it electron spins instead of classical bits happens with probability
to you without telling you what a and b are. to store information, we can perform
Your goal is to copy that qubit. We will call tasks that are completely impossible Pr [no] = |a*b – b*a|2 = 1 − Pr [yes].
whatever algorithm you use (the supposed with ordinary computers.
“quantum copy machine”) C. You feed C the We can use this mathematical frame­
(unknown) qubit |ψñ and an output qubit Qubits work to explain the measurement statis-
that starts in the state |0ñ and your machine An electron’s spin is an example of a tics of electron spin. We define the states
needs to output the original qubit and trans- mathematical object called a qubit.
form |0ñ into a copy of |ψñ. A classical bit can take either of the two |Sz = +1ñ = |0ñ
You do not know in advance what a and values 0 or 1. But a qubit is described |Sz = –1ñ = |1ñ;
b are, so your copy machine has to work mathematically by a normalized state
for any values. In particular, your machine in a two-dimensional complex vec- these two states form a basis for a one-
needs to work if a = 1 and b = 0, which means tor space. We will use notation from qubit vector space. Then
physics to denote vectors that repre- 1
C (|0ñ|0ñ) = |0ñ|0ñ. sent quantum states, writing a vector | Sx = +1〉 = (|0〉 + |1〉)
2
named v as |vñ. We can write any one-
1
Similarly, your copy machine needs to work qubit state as | Sx = −1〉 = (|0〉 − |1〉).
if a = 0 and b = 1, which means 2
|qñ = a|0ñ + b|1ñ and
C (|1ñ|0ñ) = |1ñ|1ñ. 1
where the states |0ñ and |1ñ form a | S y = +1〉 = (|0〉 + i |1〉)
2
But quantum mechanics is linear, so any basis for the 2D vector space and
1
copy machine you could possibly build has where a and b are complex numbers | S y = −1〉 = (|0〉 − i |1〉).
to be linear as well. This means that the that satisfy |a|2 + |b|2 = 1. If neither a 2
operator C is linear, so we can do some lin- nor b is zero, then we call the state |qñ a Measuring the spin component Sx is the
ear algebra: superposition of |0ñ and |1ñ because the same as measuring whether the state
qubit |qñ is, in a sense, in both states being tested is |Sx = +1ñ; the outcome
C (|ψñ|0ñ) = C ((a|0ñ + b |1ñ) |0ñ) at once. yes means Sx = +1 and the outcome no
= aC (|0ñ|0ñ) + bC (|1ñ|0ñ) Just as one qubit can be in the state means Sx = −1. If the spin started in the
=  a|0ñ|0ñ + b |1ñ|1ñ.(1) |0ñ or |1ñ or some superposition (linear |Sz = +1ñ state then, upon measuring
combination) of both, n qubits can be Sx, we will obtain +1 or −1 with equal
Your copy machine was supposed to copy
in any superposition of the states probability and the state after the mea-
any state, so the output should have been
surement would be either |Sx = +1ñ or
(a|0ñ + b |1ñ) (a |0ñ + b|1ñ) |0 . . . 00ñ, |0 . . . 01ñ, |0 . . . 10ñ, |Sx = −1ñ. If we then measure Sz again, we
 = a2 |0ñ|0ñ) + ab |0ñ|1ñ |0 . . . 11ñ, . . ., |1 . . . 11ñ obtain +1 or −1 with equal probability.
+ ab |1ñ|0ñ + b 2 |1ñ|1ñ Physicists are trying to build
So, an n qubit state is a vector in a devices that can manipulate electrons
If both a and b are nonzero, then the ­output 2n-dimensional space. or other qubits in a manner analo-
(1) of your machine is not correct. This The simplest kind of measurement gous to the way ordinary computers
means that your copy machine C  cannot one can perform on a single qubit is manipulate bits in their memories.
possibly work. one that answers this question: is the Such a device, if it worked reliably
qubit in a given state |rñ = a|0ñ + b|1ñ? and could store many qubits, would
Let us say our qubit is prepared in the be a functioning quantum computer.

86 comm unicatio ns o f th e acm | au gu st 201 2 | vol. 5 5 | no. 8


review articles

Many computer scientists and physi- bill is valid (for example, a merchant
cists believe that if we could build a who is offered a quantum bill as pay-
quantum computer, we could use it to ment), he or she sends the qubits to
calculate things that would be intrac- the mint and the mint checks that
table with classical computers.2
Qubits have another strange prop- The no-cloning they are still in the correct state using
some secret process. In this type of
erty: unlike classical bits, they cannot
be copied. This is the content of the
theorem gives us scheme, no one other than the mint
knows how to verify the money. We
quantum no-cloning theorem, which hope that quantum call this private-key quantum money
says there is no such thing as a per-
fect quantum copy machine that can
information could because the key—that is, the informa-
tion needed to verify the money—is
copy any quantum state you feed it. be used as private to the mint.
(See Figure 1 for the proof in the sin-
gle qubit case.) There are also limits
the basis of a better The other type of quantum money is
public-key quantum money. As before, a
on how closely you can approximate a kind of money. mint can produce a quantum state and
quantum copy machine.7 The no-cloning anyone can move it or spend it. Anyone
theorem allows for cryptographic pro- should be able to verify the money
tocols that go beyond the abilities of themselves without communicating
classical computers. The best known with the mint. Public-key money, if it
example is quantum key distribu- could be realized, would be the ideal
tion,4 which allows two parties to com- money we discussed earlier.
municate privately, using a quantum In the first quantum cryptography
channel and an authenticated (public) paper ever written,26 Stephen Wiesner
classical channel. described a way to implement private-
key quantum money in a provably
Quantum Money secure manner. (He wrote the paper
The no-cloning theorem means we in 1969, but it was not published until
should not think of qubits the same 1983.) In Wiesner’s scheme, each
way we think about bits. One might quantum bill is a unique random
imagine using a handful of qubits as quantum state,b which the mint labels
a form of money. A mint could pro- with a serial number. The mint keeps
duce some qubits using some process track of the state that corresponds to
known only to it, and anyone else, the serial number of each quantum
given just those qubits, could not copy bill and it can use its knowledge of the
them by any means without further state to verify the money.
information. The no-cloning theorem In 1982, Bennett et al. made the first
does not immediately imply secure attempt at public-key quantum money.5
quantum money is possible; it only Their scheme only allowed a piece of
says that machines that can copy all money to be spent once (they called
quantum states are impossible, and their quantum states subway tokens,
a counterfeiter would be content with not bills). In hindsight, their scheme
a machine that only copied quantum is insecure for two different reasons:
states that represented valid quan- first, it is based on an insecure protocol
tum money. A counterfeiter could also for 1-2 oblivious transfer, and second,
try to obtain additional information it can be broken by anyone who can
about the quantum money state by run Shor’s algorithm23,24 to factor large
using the algorithm that a merchant numbers. (In the early days of quan-
would use to verify quantum money. tum cryptography, there was no reason
By examining concrete schemes for to suspect either of these weaknesses.
quantum money, we will see how Shor’s algorithm23 and the general
these kinds of attacks can be avoided. attack14 on oblivious transfer were not
We distinguish two broad catego- known until more than a decade later.)
ries of quantum money. Surprisingly, the next paper about
In the simpler version, a mint quantum money did not appear
would produce a quantum bill con- until 2003 when Tokunaga et al.25
sisting of some number of qubits. attempted to modify Wiesner’s
Anyone could store the quantum bill scheme to prevent the mint from
and move it around, maybe even trad-
ing it for something else. Whenever b In fact, this is the random state later used in the
someone wants to verify the quantum BB84 protocol4 for quantum key distribution.

au g ust 2 0 1 2 | vol . 55 | no. 8 | co mm u n icat ion s o f t he acm 87


review articles

tracking each individual bill as it is a quantum bill using n qubits, the


used. They achieved this by requiring mint first chooses n one-qubit states
that the owner of a bill modify the bill randomly drawn from the set {|Sz = 1ñ,
before allowing the bank to verify it. |Sz = −1ñ, |Sx = 1ñ, |Sx = −1ñ}. The mint
The modification is done in a special
way so that valid bills remain valid The resurgence then assigns that state a classical
serial number. A piece of quantum
but are otherwise randomized so that
the bank cannot tell them apart. This
of interest in money consists of the n qubit state and
its serial number. The mint keeps a
scheme has the significant disadvan- quantum money list of all serial numbers issued as well
tage that upon discovering a single
counterfeit bill, the bank is required
is centered as a description of which state corre-
sponds to which serial number. When
to immediately invalidate every bill it around the idea you pay for something with a quantum
has ever issued. In our opinion this
scheme therefore has limited practi-
of public-key bill, the merchant sends the quantum
state and its serial number back to the
cal applicability. quantum money. mint for verification. The mint looks
The idea of public-key quantum up the serial number and retrieves
money gained traction in the years that the description of the corresponding
followed. Aaronson proved a “com­ quantum state. Then the mint verifies
plexity-theoretic no-cloning theorem,”1 the given state is the state that goes
which showed that even with access with the attached serial number. This
to a verifier, a counterfeiter with lim- kind of money cannot be forged by
ited computational resources cannot someone outside the mint. Since a
copy an arbitrary state. Mosca and would-be forger has no knowledge of
Stebila proposed18 the idea of a quan- the basis that each qubit was prepared
tum coin as distinct from a quantum in, the quantum no-cloning theorem
bill—each quantum coin of a given says he or she cannot reliably copy the
denomination would be identical. n qubit quantum state (Figure 2).
Using the ­ complexity-theoretic no- The main weakness in Wiesner’s
cloning theorem they argued it might scheme is that the merchant must
be possible to implement a quantum communicate with the bank to verify
coin protocol but they did not give a each transaction. So this scheme,
concrete implementation. Aaronson1 although theoretically inspiring and
proposed the first concrete scheme provably secure, would not be much
for public-key quantum money; how- more powerful than credit cards.
ever, this scheme was shown to be Wiesner’s scheme is a private-key
insecure in Lutomirski et al.16 In the quantum money scheme because the
latter paper, the authors suggested mint must keep a private secret—the
the idea of collision-free quantum complete description of the state—to
money. Unlike quantum coins, each use for verification.
collision-free quantum bill has a
serial number and nobody, not even Challenges in Designing
the mint, can produce two bills with Public-Key Quantum Money
the same serial number. This feature The resurgence of interest in quantum
can be useful to prevent the mint money is centered around the idea
from printing more money than it of public-key quantum money. As we
says it is printing. The mint posts have discussed, a public-key quantum
a list of all serial numbers of every money scheme would have the follow-
quantum bill ever produced, and we ing properties.16
can be sure the mint produced at most 1. The mint can mint it. That is,
one bill for each serial number on the there is an efficient algorithm to pro-
list. In a subsequent paper, Farhi et duce the quantum money state.
al. proposed a concrete scheme they 2. Anyone can verify it without com-
believed was both collision free and municating with the mint. That is,
secure against counterfeiting.11 there is an efficient measurement any-
Here, we tell you how some of these one can perform that accepts money
proposals work. produced by the mint with high prob-
ability and minimal damage.
Wiesner’s Quantum Money 3. No one (except possibly the
Wiesner’s original quantum money mint) can copy it. That is, no one oth-
scheme26 works as follows. To produce er than the mint can efficiently pro-

88 co mmunicatio ns o f th e acm | au gu st 201 2 | vol. 5 5 | no. 8


review articles

duce states that are accepted by the the bill after the mint’s measurement (that is, unentangled) qubits. A more
verifier with better than exponentially will be general algorithm called quantum
small probability. state restoration10 works on entangled
Why is public-key quantum money |ψ ^1ñ|ψ2ñ . . . |ψ nñ states as well: starting with a single
so hard to design? The difficulty of de- valid quantum bill, a counterfeiter can
veloping public-key quantum money where |ψ ^1ñ is the one-qubit state make a sequence of measurements on
arises from the fact that the verifica- orthogonal to |ψ1ñ. Note that the states the state and use the algorithm that
tion algorithm—which is known to of qubits 2 through n have not been verifies the bill to undo the damage
everyone in a public-key scheme—can changed by this process. So the coun- caused by measuring the state. So any
be used by a would-be forger in an at- terfeiter can then throw away |ψ ^1ñ, public-key quantum money scheme
tempt to counterfeit the money. replace it with a random state, and try must be designed so that the attacker
Wiesner’s scheme is provably again. After an average of two tries, cannot gain enough information to
secure on information-theoretic the counterfeiter will have copied the copy the quantum money state by
grounds if it is used properly. In first qubit of the quantum bill. Then making a reasonable number of mea-
Wiesner’s scheme, only the bank has the counterfeiter can repeat this whole surements on one copy of it. Can we
the additional information required procedure to copy the second qubit, the hope to design a public-key quantum
to verify a given quantum bill is legiti- third qubit, and so on until all n qubits money scheme which has this prop-
mate and therefore only the bank can have been copied.c erty, or is the access to a verification
copy the money. So if the bank sends back quantum algorithm already enough information
It turns out that, if the mint is care- states it deems to be invalid quantum to allow cloning of an arbitrary state?
less, then even the mere act of verifying money, the whole scheme is unus- Aaronson answered this question in
bills can allow someone to create coun- able. This tells us how not to imple- 2009 with a “complexity-theoretic no-
terfeit bills.15 Recall that in Wiesner’s ment Wiesner’s scheme in practice. cloning theorem.”
scheme, in every transaction the bill is But it also highlights the fact that hav-
sent by the merchant back to the mint ing access to a verifier that returns the The Complexity-Theoretic
for verification. If the money is con- state and a verdict on the validity of the No-Cloning Theorem
firmed to be valid, the mint sends back quantum money is in itself a power- As we mentioned earlier, the stan-
the valid bill to the merchant. What ful tool a forger can try to exploit, even dard no-cloning theorem is not good
happens if the money is determined if the forger cannot look inside the enough to prove secure public-key
by the mint to be counterfeit? If the machine that verifies money. This type quantum money is possible, since it
mint sends back the invalid bill, then of attack is particularly applicable to does not take into account the counter-
a counterfeiter can successfully forge public-key quantum money schemes, feiter can check whether a given state is
the money. in which the verification algorithm is valid quantum money or not. In fact,
Let us see how this works. A coun- publicly known. if a counterfeiter has unlimited time,
terfeiter can start with one good quan- This attack was particularly simple then it is straightforward to counter-
tum bill, which in Wiesner’s scheme is against Wiesner’s money because feit public-key quantum money: sim-
n one-qubit states each bill consists of independent ply generate a random state and check
if that state is valid money. If not, try
|ψñ = |ψ1ñ|ψ2ñ . . . |ψ nñ c The attack in Lutomirski15 is different: it is
again. In a secure money scheme, the
deterministic and twice as fast, but it is less probability that any attempt succeeds
along with a serial number the bank intuitive. is exponentially small.
uses to verify the state. The counter-
feiter can produce a random one-qubit Figure 2.  Wiesner’s quantum money. Source: Science, Aug. 7, 1992.
state |f1ñ, and, setting aside the first
qubit |ψ1ñ of the original bill, he or she
then sends the mint the state

|ψ ′ñ = |f1ñ|ψ2ñ . . . |ψ nñ.

If the bill |ψ ′ñ turns out to be valid


(this happens with probability ½), the
mint returns the bill, and in this case
the mint’s measurement will have
changed the state to |ψñ. So now the
counterfeiter possesses both |ψñ and
the original qubit |ψ1ñ that was set
aside, and so he or she has succeeded
in copying the first qubit |ψ1ñ. On the
other hand, if the mint determines the
bill |ψ ′ñ is not valid, then the state of

au g u st 2 0 1 2 | vol . 55 | no. 8 | c omm u n icat ion s of t he acm 89


review articles

The complexity-theoretic no- to take the machine apart to figure out the verifier as a black box. So in order
cloning theorem1 says there is no how it worked or else use the machine to evaluate any public-key quantum
generic attack much better than ran- an exponentially large number of times money scheme, we will have to look at
dom guessing. What do we mean by a in order to make any more quantum the details of the verifier and the set of
generic attack? Suppose there is a veri- money than he or she started with. valid quantum money states that are
fication machine that checks whether This theorem does not guarantee minted by the bank.
or not a given state |fñ is equal to a any particular scheme is secure. For
good quantum money state |ψñ. The every quantum money scheme that has Quantum Coins
machine takes as input any quantum been proposed, the states |ψñ that are One of the first applications of the
state |fñ; it outputs 0 if |fñ = |ψñ and 1 if “good” quantum money states are not complexity-theoretic no-cloning theo-
|fñ is orthogonal to |ψñ. In either case, completely unknown since they come rem was given by Mosca and Stebila.18
it also outputs the quantum state is left from a restricted set of states gener- They showed it might be possible
over after the measurement. Aaronson ated by the mint’s algorithm. If this to have public-key quantum money
showed that, as long as that machine is set of states is small enough then hav- scheme in which every piece of quan-
a black box, it can fall into the hands of ing a “black box” verifier may allow a tum money is identical: they called
a counterfeiter without compromising forger to copy a money state; we have these quantum coins.18,19
the quantum money scheme. In other already seen an example of this with Quantum coins, like ordinary
words, a counterfeiter with access to Wiesner’s scheme. And it might also coins, are all the same with no marks
some quantum money as well as the be possible to design attacks on pub- distinguishing each coin. One advan-
verification machine would either need lic-key quantum money that do not use tage of quantum coins is they are
­anonymous—no one can tell one coin
Figure 3. Quantum money from knots. Figure 4. A knot. from another, so it is difficult to keep
track of where and when a particular
coin was spent.
Mosca and Stebila had two results
about quantum coins. They extended
the complexity-theoretic no-clon-
ing theorem to quantum coins. If a
would-be counterfeiter has access to a
machine that verifies quantum coins
but cannot look inside that machine,
then there is no way to make more
coins than he or she started with in any
reasonable amount of time. This result
gives some hope a public-key quantum
coin protocol could be discovered.
Figure 5. Reidemeister moves. Their second result is based on
blind quantum computation (intro-
duced by Childs9 and studied by
Broadbent et al.6). Blind quantum
computation is a protocol whereby a
quantum computer with very limited
resources (sometimes called a quan-
tum calculator) runs a polynomial
size quantum circuit with the help
of a server, where the server does not
learn anything about the circuit per-
formed (except an upper bound on its
size). In the protocol introduced by
Mosca and Stebila, the merchant runs
an obfuscated verification algorithm
from which he or she learns nothing
except the final answer: that it is or is
not a valid coin. However, this requires
(quantum) communication with
the bank, and so this quantum coin
scheme is a private-key protocol.
To date there is no published con-
crete proposal for public-key quan-
tum coins.

90 co mmunicatio ns o f the acm | au gu st 201 2 | vol. 5 5 | no. 8


review articles

Public-Key Quantum Money the elements of S that have the label y.


without Secrets Mathematically,
In all of the schemes we have discussed
so far, the mint first generates some
|$ y 〉 = ∑ | x 〉.
x∈S s. t. f ( x ) = y
classical secret and then uses that secret
to produce the quantum money. In any One advantage To produce a quantum money state, the
scheme like this, if anyone can figure
out the secret, then they can use this
of quantum coins mint first prepares a uniform superpo-
sition over all elements of S and mea-
secret to produce valid quantum money is that they are sures the label that corresponds to the
states with the same algorithm that the
mint uses. A would-be forger can try to
­anonymous—no state. This results in a random label
and, like all measurements, changes
use the (publicly known) verification one can tell one coin the state so the new state will always get
algorithm along with techniques such
as quantum state restoration10 to try to
from another, so it the same measurement outcome. This
means the superposition collapses to
reverse-engineer the secret. is difficult to keep exactly those elements of S that have
Lutomirski et al.16 suggested a
different approach to designing track of where and the measured label.
The verification procedure pre-
quantum money. Imagine a physical when a particular sented by Lutomirski et al. requires
process (or a quantum measurement)
that can simultaneously generate a coin was spent. anyone who knows some x where f
(x) = y find another random x′ with the
random serial number y (drawn from same label y, and therefore f must be
an enormous set of possible serial num- chosen so this is possible. A merchant
bers) and a corresponding quantum who wants to verify a quantum bill first
state |$y  ñ. For any given serial number measures the label and confirms it
y, a second algorithm would be able to matches the serial number of the bill,
verify some quantum state was indeed and then performs a more compli-
|$y  ñ. A key feature of this scheme is cated quantum measurement to check
collision-freedom: no one can generate the state is invariant under the opera-
more than one copy of |$y  ñ for any value tion that randomizes the elements that
of y. (Anyone can generate quantum share the same label.
states corresponding to different serial Lutomirski et al. conjecture that if
numbers.) f and S are appropriately chosen, then
To use these states as money, the the resulting quantum money will be
mint simply generates a pile of quan- secure. In that paper, however, they did
tum states and corresponding clas- not describe an appropriate f and S.
sical serial numbers. The mint then
publishes the list of all the serial num- Quantum Money from Knots
bers and the verification algorithm The only published scheme11 for pub-
that can be used by anyone to check lic-key quantum money that has not
the validity of a given quantum money been shown to be insecure is an imple-
state. A quantum state matching a pub- mentation of collision-free quantum
lished serial number is valid money; money. In this scheme, the set S is a set
any other state is not. If the mint pub- of drawings of knots. We will have to
lished an actual list, then anyone could take a quick detour into knot theory in
also verify the mint produced no more order to describe this quantum money
quantum money than it said it did; as (Figure 3).
a practical matter, though, the mint Most of us have some experience
would probably use digital signatures in our day-to-day lives with the basic
instead of a list. properties of knots. Mathematicians
Lutomirski et al. also suggested who study knot theory have formalized
a way such an algorithm might be these basic properties. For our pur-
designed. Consider a large set S and a poses, a good place to start will be with
function f that assigns each element some definitions. A knot is a mapping
of S a label. Suppose there is an expo- of the circle S1 (like a loop of string) into
nential number of possible labels and three-dimensional space. For example,
an exponential number of elements of Figure 4 shows a knot.
S that share each label. Each label cor- Usually when we draw a knot, we
responds to a serial number, and the use a two-dimensional diagram like
state corresponding to the serial num- the one in Figure 4. If we take a knot
ber y is a uniform superposition of all of and then fiddle with it a bit (without

au g ust 2 0 1 2 | vol . 55 | no. 8 | c ommu n i cat ion s o f t he acm 91


review articles

cutting the string it is made out of) and While no one has proven that knot Crypto (1983), volume 82, 267–275.
6. Broadbent, A., Fitzsimons, J. and Kashefi, E. Universal
then draw it, we might end up with a money is secure, attempts to break it blind quantum computation. In FOCS (2009), 517–526.
different diagram. But we would still seem to run into knot theory problems 7. Bužek, V. and Hillery, M. Quantum copying: Beyond the no-
cloning theorem. Phys. Rev. A 54, 3 (1996), 1844–1852.
like to call it the same knot. So the that have no known practical solutions. 8. Chailloux, A. and Kerenidis, I. Optimal quantum strong
question arises: which pictures repre- coin flipping. In 50th Annual IEEE Symposium on
Foundations of Computer Science, 2009 (FOCS’09)
sent the same knot? The three modi- What Does the Future Hold for (2009), IEEE, 527–533.
fications to a knot diagram shown in Quantum Money? 9. Childs, A.M. Secure assisted quantum computation.
Quant. Inform. Comput. 5, 6 (2005), 456–466.
Figure 5 are called the Reidemeister Public-key quantum money is one 10. Farhi, E., Gosset, D., Hassidim, A., Lutomirski, A., Nagaj, D.,
moves. It can easily be seen that by of few quantum protocols that does and Shor, P. Quantum state restoration and single-copy
tomography for ground states of hamiltonians. Phys. Rev.
applying these moves you only move something that is truly impossible Lett. 105, 19 (Nov. 2010), 190503.
11. Farhi, E., Gosset, D., Hassidim, A., Lutomirski, A. and
between topologically equivalent classically, even under cryptographic Shor, P. Quantum money from knots. In Innovations in
knots. It is also true (but more difficult assumptions. QKD can be used to Theoretical Computer Science (2012).
12. Feldman, P. and Micali, S. An optimal probabilistic
to see) that any two diagrams repre- encrypt information between two par- protocol for synchronous byzantine agreement. SIAM
senting the same knot can be mapped ties that did not coordinate keys in J. Comput. 26, 4 (1997), 873–933.
13. Hass, J. Algorithms for recognizing knots and
into one another using these moves. advance, but under reasonable secu- 3-manifolds. Chaos, Solitons & Fractals 9(4–5)
There is no known good algorithm rity assumptions, lattice based cryp- (1998), 569–581.
14. Lo, H.-K. Insecurity of quantum secure computations.
to determine whether two knot dia- tography can perform the same feat.4,21 Phys. Rev. A 56 (Aug. 1997), 1154–1162.
grams represent the same knot; it has Assuming SHA1 is a pseudo random 15. Lutomirski, A. An online attack against Wiesner’s
quantum money. arXiv:1010.0256, 2010.
only recently been discovered that knot function, one can use it to implement 16. Lutomirski, A., Aaronson, S., Farhi, E., Gosset, D.,
equivalence is decidable.13 But some- strong coin flipping,8,17 and encrypted Hassidim, A., Kelner, J. and Shor, P. Breaking and
making quantum money: Toward a new quantum
times there is a way to tell that two communication channels enable fast cryptographic protocol. In Innovations in Computer
diagrams do not represent the same Byzantine agreement.3,12 However, no Science (2010).
17. Mochon, C. Quantum weak coin flipping with arbitrarily
knot; by using a knot invariant. A knot cryptographic assumption enables a small bias. Arxiv preprint arXiv:0711.4114 (2007).
invariant is a property of a knot that is digital analog of cash, as any string 18. Mosca, M. and Stebila, D. A framework for quantum
money. Poster at Quantum Information Processing
the same for all diagrams representing of bits that would represent a bill can (QIP) (Brisbane, Australia, 2007).
19. Mosca, M. and Stebila, D. Quantum coins. In Error-
the same knot. If you can find a knot always be copied. Correcting Codes, Finite Geometries, and Cryptography.
invariant that takes different values for The idea of some kind of quantum Contemporary Mathematics, Aiden A. Bruen and David
L. Wehlau, eds. volume 523. American Mathematical
the two diagrams in question, then you object that only one special entity Society, 2010, 35–47.
can be sure they represent different can produce may have applications 20. Nielsen, M.A. and Chuang, I.L. Quantum Information
and Computation, Cambridge University Press,
knots. (The converse of this is not gen- beyond being used as money. For Cambridge, UK, 2000.
erally true—there can be two different example, software companies would 21. Regev, O. On lattices, learning with errors, random
linear codes, and cryptography. In Proceedings of the
knots that share the same value for a like to be able to produce software pro- 37th annual ACM symposium on Theory of computing
particular knot invariant.) One of the grams that anyone can use but that no (2005), ACM, 84–93.
22. Ryan, R., Anderson, Z. and Chiesa, A. Anatomy of a
first knot invariants to be discovered is one can copy. Whether this is possible subway hack. http://tech.mit.edu/V128/N30/subway/
called the Alexander polynomial—any for any useful type of software remains Defcon_Presentation.pdf (August 2008).
23. Shor, P.W. Algorithms for quantum computation:
knot has an associated Alexander poly- to be seen. Discrete logarithms and factoring. In FOCS (1994),
nomial, and its coefficients are inte- Will a future government replace IEEE Computer Society, 124–134.
24. Shor, P.W. Polynomial-time algorithms for prime
gers that can be efficiently calculated its currency with quantum money? factorization and discrete logarithms on a quantum
from any diagram of that knot. Maybe. You could use it online to pur- computer. SIAM Rev. 41, 2 (1999), 303–332.
25. Tokunaga, Y., Okamoto, T. and Imoto, N. Anonymous
To make quantum money from chase things without transaction fees quantum cash. In EQIS (Aug. 2003).
26. Wiesner, S. Conjugate coding. SIGACT News 15, 1
knots, the set S in the general collision- and without oversight from any third (1983), 78–88.
free scheme is taken to be the set of party. You could download your quan-
knot diagrams, and label f associated tum money onto your qPhone (not yet
Scott Aaronson (aaronson@csail.mit.edu) is an associate
with each diagram is its Alexander trademarked) and use it to buy things professor, CSAIL, MIT, Cambridge, MA.
polynomial. Applying a sequence of from quantum vending machines. Edward Farhi (farhi@mit.edu) is the Cecil and Ida
random Reidemeister moves random- With the advent of quantum money, Green Professor of Physics, and director of the Center for
Theoretical Physics, MIT, Cambridge, MA.
izes among knots with the same dia- we hope everybody will like spending
David Gosset (dgosset@mit.edu) is a postdoctorate
gram, allowing the measurement that money a little bit more. fellow in the Department of Combinatorics and
verifies the quantum money states. So Optimization and the Institute for Quantum Computing,
University of Waterloo, Canada.
the mint prepares the superposition References
over all diagrams and measures the 1. Aaronson, S. Quantum copy-protection and quantum Avinatan Hassidim (avinatanh@gmail.com) is an
money. In Annual IEEE Conference on Computational assistant professor in the Department of Computer
Alexander polynomial’s coefficients to Complexity (2009), 229–242. Science, Bar Ilan University, and works at Google Israel.
make a quantum bill, and a merchant 2. Bacon, D. and van Dam, W. Recent progress in quantum His research is supported by a grant from the German-
algorithms. Commun. ACM 53 (Feb. 2010), 84–93. Israeli Foundation (GIF) for Scientific Research and
measures the coefficients and verifies 3. Ben-Or, M. and Hassidim, A. Fast quantum byzantine Development under contract number 1-2322-407.7/2011.
the superposition is invariant under agreement. In STOC (2005), 481–485.
4. Bennett, C.H. and Brassard, G. Quantum cryptography: Jonathan Kelner (kelner@mit.edu) is an assistant
the Reidemeister moves. Public key distribution and coin tossing. In professor of applied mathematics, and a member of
Proceedings of IEEE International Conference on CSAIL, MIT, Cambridge, MA.
(The actual scheme is somewhat more Computers Systems and Signal Processing (Bangalore, Andrew Lutomirski (andy@luto.us) is co-founder of AMA
complicated because knot diagrams India, 1984), 175–179. Capital Management LLC, Palo Alto, CA.
5. Bennett, C.H., Brassard, G., Breidbart, S. and Wiesner,
are inconvenient to work with—see S. Quantum cryptography, or unforgeable subway
Farhi et al.11 for the technical details.) tokens. In Advances in Cryptology—Proceedings of © 2012 ACM 0001-0782/12/08 $15.00

92 co mmuni catio ns of th e acm | au gu st 201 2 | vol. 5 5 | no. 8


Computing Reviews is on the move

Our new URL is

ComputingReviews.com

A daily snapshot of what is new and hot in computing


Previous
A.M. Turing Award
Recipients

1966 A.J. Perlis


1967 Maurice Wilkes
1968 R.W. Hamming
1969 Marvin Minsky
1970 J.H. Wilkinson
1971 John McCarthy
1972 E.W. Dijkstra
1973 Charles Bachman
1974 Donald Knuth
1975 Allen Newell ACM A.M. TURING AWARD
NOMINATIONS SOLICITED
1975 Herbert Simon
1976 Michael Rabin
1976 Dana Scott
1977 John Backus
1978 Robert Floyd
1979 Kenneth Iverson Nominations are invited for the 2012 ACM A.M. Turing Award. ACM’s oldest
1980 C.A.R Hoare
1981 Edgar Codd and most prestigious award is presented for contributions of a technical nature to
1982 Stephen Cook the computing community. Although the long-term influences of the nominee’s
1983 Ken Thompson
1983 Dennis Ritchie work are taken into consideration, there should be a particular outstanding
1984 Niklaus Wirth and trendsetting technical achievement that constitutes the principal claim to
1985 Richard Karp
1986 John Hopcroft
the award. The award carries a prize of $250,000 and the recipient is expected
1986 Robert Tarjan to present an address that will be published in an ACM journal. Financial support
1987 John Cocke
of the Turing Award is provided by the Intel Corporation and Google Inc.
1988 Ivan Sutherland
1989 William Kahan
1990 Fernando Corbató Nominations should include:
1991 Robin Milner
1992 Butler Lampson 1) A curriculum vitae, listing publications, patents, honors, other awards, etc.
1993 Juris Hartmanis
1993 Richard Stearns 2) A
 letter from the principal nominator, which describes the work of
1994 Edward Feigenbaum the nominee, and draws particular attention to the contribution which
1994 Raj Reddy
1995 Manuel Blum is seen as meriting the award.
1996 Amir Pnueli
1997 Douglas Engelbart 3) Supporting letters from at least three endorsers. The letters should not
1998 James Gray all be from colleagues or co-workers who are closely associated with
1999 Frederick Brooks
2000 Andrew Yao the nominee, and preferably should come from individuals at more than
2001 Ole-Johan Dahl one organization. Successful Turing Award nominations usually include
2001 Kristen Nygaard
2002 Leonard Adleman substantive letters of support from a group of prominent individuals
2002 Ronald Rivest broadly representative of the candidate’s field.
2002 Adi Shamir
2003 Alan Kay For additional information on ACM’s award program
2004 Vinton Cerf
2004 Robert Kahn please visit: www.acm.org/awards/
2005 Peter Naur
2006 Frances E. Allen Additional information on the past recipients
2007 Edmund Clarke of the A.M. Turing Award is available on:
2007 E. Allen Emerson http://amturing.acm.org/byyear.cfm .
2007 Joseph Sifakis
2008 Barbara Liskov
2009 Charles P. Thacker Nominations should be sent electronically
2010 Leslie G. Valiant by November 30, 2012 to:
2011 Judea Pearl Ravi Sethi, Avaya Labs c/o mcguinness@acm.org
research highlights
p. 96 p. 97
Technical Spreadsheet Data Manipulation
Perspective
Example-Driven Using Examples
Program Synthesis By Sumit Gulwani, William R. Harris, and Rishabh Singh
for End-User
Programming
By Martin C. Rinard

p. 106 p. 107
Technical Continuity and Robustness
Perspective
Proving Programs of Programs
Continuous By Swarat Chaudhuri, Sumit Gulwani, and Roberto Lublinerman
By Andreas Zeller

au g ust 2 0 1 2 | vol . 55 | no. 8 | commu n i cat ion s o f t he acm 95


research highlights
d oi:10.1145/ 2240236.2 2 40 2 5 9

Technical Perspective The authors have successfully ap-


plied this approach to three classes
Example-Driven of spreadsheet programs: syntactic
string manipulations (such as con-
Program Synthesis for verting telephone numbers into a
standard format), semantic string
End-User Programming manipulations that make use of some
background system knowledge stored
By Martin C. Rinard as relational tables (such as trans-
forming strings in inventory tables or
Understanding how to get a computer automate spreadsheet computations. transforming dates), and table layout
to perform a given task is a central This work, therefore, is of interest to computations (which reorganize data
question in computer science. For the millions of people worldwide who stored in tables). All of these systems
many years the standard answer has use spreadsheets. The methodology have been successfully integrated
been to use a programming language consists of four basic steps: with Microsoft Excel and have been
to write a program the computer will ˲˲ Domain-specific language: De- tested on multiple examples from Ex-
then execute to accomplish the task. velop a domain-specific language ca- cel help forums.
An intriguing alternative, however, pable of representing the desired set In the future, the need for users
is to provide the computer with ex- of computations. to obtain ever more sophisticated
amples of inputs and corresponding ˲˲ Data structure: Develop a data custom behavior will only increase.
outputs, then have the computer auto- structure that can efficiently repre- Given the significant obstacles that
matically generalize the examples to sent the large set of programs that are traditional programming approaches
produce a program that performs the consistent with a given input/output pose for the vast majority of users, we
desired task for all inputs. Research- example. can expect to see a proliferation of so-
ers have worked on this approach for ˲˲ Learn and intersect: Generate lutions that help non-programmers
decades, first in the LISP community,4 data structures for representing the accomplish software development
then later in the inductive logic pro- programs consistent with each indi- tasks that have traditionally been the
gramming community,1–3 to name vidual input/output example, then exclusive domain of software profes-
two prominent examples. Given the intersect the data structures to obtain sionals. Given the difficulty of speci-
relatively modest size of the programs a representation of the programs con- fying and implementing large soft-
the resulting techniques are able to sistent with all examples. ware systems, these solutions will
produce, the field has evolved to focus ˲˲ Rank the resulting set of pro- (at least initially) focus on the auto-
largely on data mining, concept learn- grams, preferring more general pro- matic generation of relatively small
ing, knowledge discovery, and other grams over less general programs. but still useful solutions to everyday
applications (as opposed to main- Users can then view the results of the problems. This paper provides an
stream software development). ranked programs on different inputs to outstanding example of the kinds of
The following paper focuses on guide the program selection process. useful solutions that non-program-
an important emerging area—end This approach effectively ad- mers can now automatically obtain
user programming. As information dresses many of the issues that com- and demonstrates the kind of sophis-
technology has come to permeate plicate example-driven approaches. ticated implementation techniques
our society, broader classes of users Domain-specific languages help focus that will make such automatic pro-
have developed the need for more so- the synthesis process by avoiding the gram synthesis systems feasible for a
phisticated data manipulation and generality of standard programming variety of problem domains.
processing. While users in the past languages (that can produce very large
were satisfied with relatively simple program search spaces that are intrac- References
1. Lavrac, N. and Dzeroski, S. Inductive Logic
interactive models of computation table to manipulate efficiently). Com- Programming—Techniques and Applications. Ellis
such as spreadsheets and other busi- pact program representations make it Horwood, NY, 1994.
2. Muggleton, S. and De Raedt, L. Inductive logic
ness applications, current users are possible to manipulate large numbers programming: Theory and methods. Journal of Logic
now looking to automate custom data of programs efficiently. An effective Programming 19, 20 (1994), 629–679.
3. Muggleton, S., De Raedt, L., Poole, D., Bratko, I.,
manipulations such as reformatting, ranking algorithm helps users quickly Flach, P.A., Inoue, K. and Srinivasan, A. ILP turns
reorganizing, simple calculations, or identify a desirable program (out of 20—Biography and future challenges. Machine
Learning 86, 1 (2012), 3–23.
data cleaning. While such users may the potentially unbounded number of 4. Smith, D.R. The synthesis of LISP programs
from examples: A survey. Automatic Program
have a good command of the interac- programs that are consistent with the Construction Techniques, A.W. Biermann, G. Guiho,
tive functionality of their application, provided examples). And the interac- and Y. Kodratoff, Eds. Macmillan, 1984, 307–324.
they often lack the expertise, time, or tive program evaluation mechanisms
inclination to develop software spe- (automatic identification of inputs on Martin C. Rinard (rinard@csail.mit.edu) is a professor in
the Department of Electrical Engineering and Computer
cifically for their task. which candidate programs produce Science at the Massachusetts Institute of Technology,
Cambridge, MA.
The authors illustrate how to apply different outputs) help users navigate
example-driven program synthesis to the space of synthesized programs. © 2012 ACM 0001-0782/12/08 $15.00

96 communicatio ns o f th e acm | au gu st 201 2 | vol. 5 5 | no. 8


doi:10.1145/ 2240236. 2 2 40 2 6 0

Spreadsheet Data Manipulation


Using Examples
By Sumit Gulwani, William R. Harris, and Rishabh Singh

Abstract even skilled programmers might hesitate to write a script for


Millions of computer end users need to perform tasks over a one-off repetitive task.
large spreadsheet data, yet lack the programming knowl- We performed an extensive case study of spreadsheet
edge to do such tasks automatically. We present a program- help forums and observed that string and table process-
ming by example methodology that allows end users to ing is a very common class of programming problems
automate such repetitive tasks. Our methodology involves that end users struggle with. This is not surprising given
designing a domain-specific language and developing a that various languages such as Perl, Awk, and Python were
synthesis algorithm that can learn programs in that lan- designed to support string processing, and that new lan-
guage from user-provided examples. We present instantia- guages such as Java and C# provide rich support for string
tions of this methodology for particular domains of tasks: processing. During our study, we also observed how novice
(a) syntactic transformations of strings using restricted users specified their desired programs to expert users: most
forms of regular expressions, conditionals, and loops, (b) specifications consisted solely of one or more input–output
semantic transformations of strings involving lookup in examples. Since input–output examples may underspecify
relational tables, and (c) layout transformations on spread- a program, the interaction between a novice and an expert
sheet tables. We have implemented this technology as an often involved multiple rounds of communication over
add-in for the Microsoft Excel Spreadsheet system and have multiple days. Inspired by this observation, we developed a
evaluated it successfully over several benchmarks picked programming by example (PBE), or inductive synthesis, meth-
from various Excel help forums. odology15 that has produced synthesizers that can automati-
cally generate a wide range of string/table manipulating
programs in spreadsheets from input–output examples.
1. INTRODUCTION Each synthesizer takes the role of the forum expert, remov-
The IT revolution over the past few decades has resulted ing a human from the interaction loop and enabling users
in two significant advances: the digitization of massive to solve their problems in a few seconds instead of few days.
amounts of data and widespread access to computational This paper is organized as follows. We start with a brief
devices. It is thus not surprising that more than 500 million overview of our PBE methodology (Section 2). We then
people worldwide use spreadsheets for storing and manip- describe an application of this methodology to perform syn-
ulating data. These business end users have myriad diverse tactic string manipulation tasks (Section 3).6 This is followed
backgrounds and include commodity traders, graphic by an extension that automates more sophisticated seman-
designers, chemists, human resource managers, finance tic string manipulations requiring background knowledge,
professionals, marketing managers, underwriters, compli- which can often be encoded as relational tables (Section 4).18
ance officers, and even mailroom clerks—they are not pro- We also describe an application of this methodology to
fessional programmers, but they need to create small, often perform layout transformations on tables (Section 5).8 In
one-off, applications to perform business tasks.4 Section 6, we discuss related work, and in Section 7, we con-
Unfortunately, the state of the art of interfacing with clude and discuss future work.
spreadsheets is far from satisfactory. Spreadsheet systems,
like Microsoft Excel, come with a maze of features, but 2. OVERVIEW
end users struggle to find the correct features to accom- In this section, we outline a general methodology that we
plish their tasks.12 More significantly, programming is still have used for developing inductive synthesizers for end-user
required to perform tedious and repetitive tasks such as programming tasks, along with how a user can interact with
transforming names or phone numbers or dates from one the synthesizers. In the first step of our methodology, we
format to another, cleaning data, or extracting data from identify a domain of useful tasks that end users struggle with
several text files or Web pages into a single document. Excel
allows users to write macros using a rich inbuilt library of
This paper is based on “Automating String Processing
string and numerical functions, or to write arbitrary scripts
in Spreadsheets using I­nput-Output,” S. Gulwani, pub-
in Visual Basic or .Net programming languages. However,
lished in POPL (2011); “Learning Semantic String Trans-
since end users are not proficient in programming, they find
formations from Examples,” R. Singh and S. G ­ ulwani,
it too difficult to write desired macros or scripts. Moreover,
PVLDB 5 (2012), in press; and “Spreadsheet ­Table Trans-
formations from Examples,” W.R. Harris and S. Gulwani,
Harris’s work was done during an internship at Microsoft Research. published in PLDI (2011).
Singh’s work was done during two internships at Microsoft Research.

au g ust 2 0 1 2 | vo l . 5 5 | n o. 8 | co mmu n icat ion s of t he acm 97


research highlights

and can clearly describe with examples, by studying help between different programs can be explained by synthesiz-
forums or performing user studies (this paper presents two ing a distinguishing input on which the programs behave
domains: string manipulation and table manipulation). We ­differently.10 The user can reapply the synthesizer with the
then develop the following. distinguishing input and its desired output as an addi-
Domain-specific language: We design a domain-specific tional example.
language L that is expressive enough to capture several real-
world tasks in the domain, but also restricted enough to 3. SYNTACTIC TRANSFORMATIONS
enable efficient learning from examples. Spreadsheet users often struggle with reformatting or clean-
Data structure for representing consistent programs: The ing data in spreadsheet columns. For example, consider the
number of programs in L that are consistent with a given set following task.
of input–output examples can be huge. We define a data
structure D based on a version-space algebra14 to s­ uccinctly Example 1 (Phone Numbers). An Excel user wants to uni-
represent a large set of such programs. formly format the phone numbers in the input column, adding a
Algorithm for synthesizing consistent programs: Our default area code of “425” if the area code is missing.
synthesis algorithm for language L applies two key proce-
Input v1 Output
dures: (i) Generate learns the set of all programs, repre-
sented using data structure D, that are consistent with a 323-708-7700 323-708-7700
(425)-706-7709 425-706-7709
given ­single example. (ii) Intersect intersects these sets
510.220.5586 510-220-5586
(each corresponding to a different example). 235 7654 425-235-7654
Ranking: We develop a scheme that ranks programs, 745-8139 425-745-8139
preferring programs that are more general. Each ranking
scheme is inspired by Occam’s razor, which states that a Such tasks can be automated by applying a program that
smaller and simpler explanation is usually the correct one. performs syntactic string transformations. We now present
We define a partial order relationship between programs to an expressive domain-specific language of string-processing
compare them. Any partial order can be used that efficiently programs that supports limited conditionals and loops, syn-
orders programs represented in the version-space algebra tactic string operations such as substring and concatenate,
used by the data structure D. Such an order can be applied to and matching based on regular expressions.6
efficiently select the top-ranked programs from among a set
represented using D. The ranking scheme can also take into 3.1. Domain-specific language
account any test inputs provided by the user (i.e., new addi- Our domain-specific programming language for perform-
tional inputs on which the user may execute a synthesized ing syntactic string transformations is given in Figure 1(a). A
program). A program that is undefined on any test input or string program P is an expression that maps an input state s,
generates an output whose characteristics are different from which holds values for m string variables v1, … , vm (denoting
that of training outputs can be ranked lower. the multiple input columns in a spreadsheet), to an output
string s. The top-level string expression P is a Switch con-
2.1. Interaction models structor whose arguments are pairs of Boolean expressions
A user provides to the synthesizer a small number of exam- b and trace expressions e. The set of Boolean expressions in
ples, and then can interact with the synthesizer according to a Switch construct must be disjoint, that is, for any input
multiple models. In one model, the user runs the top-ranked state, at most one of the Boolean expressions can be true.
synthesized program on other inputs in the spreadsheet and The value of P in a given input state s is the value of the trace
checks the outputs produced by the program. If any output expression that corresponds to the Boolean expression satis-
is incorrect, the user can fix it and reapply the synthesizer, fied by s. A Boolean expression b is a propositional formula
using the fix as an additional example. However, requiring in Disjunctive Normal Form (DNF). A predicate Match(vi, r, k)
the user to check the results of the synthesized program, is satisfied if and only if vi contains at least k nonoverlapping
especially on a large spreadsheet, can be cumbersome. To matches of regular expression r. (In general, any finite set of
enable easier interaction, the synthesizer can run all syn- predicates can be used.)
thesized programs on each new input to generate a set of A trace expression Concatenate(f1, … , fn) is the con-
corresponding outputs for that input. The synthesizer can catenation of strings represented by atomic expressions f1,
highlight for the user the inputs that cause multiple distinct … , fn. An atomic expression f is either a constant-string
outputs. Our prototypes, implemented as Excel add-ins, expression ConstStr, a substring expression con-
support this interaction model. structed from SubStr, or a loop expression constructed
A second model accommodates a user who requires a from Loop.
reusable program. In this model, the synthesizer presents The substring expression SubStr(vi, p1, p2) is defined
the set of consistent programs to the user. The synthesizer partly by two position expressions p1 and p2, each of which
can show the top k programs or walk the user through the implicitly refers to the (subject) string vi and must eval-
data structure that succinctly represents all consistent uate to a position within the string vi. (A string with 
programs and let the user select a program. The programs characters has  + 1 positions, numbered from 0 to 
can be shown using programming-language syntax, or they starting from left.) SubStr(vi, p1, p2) is the substring of
can be described in a natural language. The differences string vi in between positions p1 and p2. For a nonnegative

98 communicatio ns of th e acm | au gu st 201 2 | vol. 5 5 | no. 8


Figure 1. (a) Syntax of syntactic string-processing programs. (b) Data structure for representing a set of such programs.

~
String program P := Switch ((b1, e1), . . . , (bn, en))  | e P := Switch((b1, e~1), . . . , (bn, e~n))
Boolean condition b := d1 ∨ . . . ∨ dn ~
e := ~ hs‚ ht, ~ξ, W ),
Dag(h‚
 ~ ~
Conjunction d := π1 ∧ . . . ∧ πn where W : ξ → 2 f
~
Predicate π := Match(υi, r, k) | ¬ Match(υi, r, k) f := ConstStr(s)
Trace expr e := Concatenate(f1, . . . , fn) | f | SubStr(υi, {p~j}j , {p~k}k)
Atomic expr f := ConstStr(s) | SubStr(υi, p1, p2) | Loop(λw : e) | Loop (λw : ~e)
Position p := CPos(k) | Pos(r1, r2, c) p~ := CPos(k)
Integer expr c := k | k1w + k2 | Pos(~r ,~r , c)
1 2
~
Regular expr r := TokenSeq(T1, . . . , Tn) | T | ε

(a) (b)

constant k, CPos(k) denotes the kth position in the sub- The task described in Example 1 can be expressed in our
ject string. For a negative constant k, CPos(k) denotes domain-specific language as
the ( + 1 + k) th position in the subject string, where
 = Length(s). Pos(r1,r2,c) is another position expres-
sion, where r1 and r2 are regular expressions and integer Switch((b1, e1), (b2, e2)), where
expression c evaluates to a nonzero integer. Pos(r1,r2,c) b1  º  Match(v1, NumTok, 3)
evaluates to a position t in the subject string s such that b2  º  ¬Match(v1, NumTok, 3)
r1 matches some suffix of s[0:t], and r2 matches some e1  º  Concatenate(SubStr2(v1, NumTok, 1), ConstStr(“−”),
prefix of s[t:], where  = Length(s). Furthermore, if c is SubStr2(v1, NumTok, 2), ConstStr(“−”),
positive (negative), then t is the |c|th such match starting SubStr2(v1, NumTok, 3) )
from the left side (right side). We use the expression e2  º Concatenate(ConstStr(“425–”), SubStr2(v1, Num­Tok, 1),
s[t 1:t2] to denote the substring of s between positions t1 and t2. ConstStr(“−”), SubStr2(v1, NumTok, 2) )
The substring construct is quite expressive. For example,
in the expression SubStr(vi, Pos(r1,r2,c), Pos(r3,r4,c) ),
r2 and r3 describe the characteristics of the substring in The atomic expression Loop(λw : e) is the concatenation
vi to be extracted, while r1 and r4 describe the character- of e1, e2, …, en, where ei is obtained from e by replacing all
istics of the surrounding delimiters. We use the expres- occurrences of integer w by i, and n is the smallest integer
sion SubStr2(vi, r, c) as an abbreviation to denote such that evaluation of en+1 is undefined. (It is also possible
the cth occurrence of regular expression r in vi, that is, to define more interesting termination conditions, e.g.,
SubStr(vi, Pos(e, r, c), Pos(r, e, c) ). based on position expressions or predicates.) A trace expres-
A regular expression r is e (which matches the empty sion e is undefined when (i) a constituent CPos(k) expres-
string, and therefore can match at any position of any sion refers to a position not within its subject string, (ii) a
string), a token T, or a token sequence TokenSeq(T1, …, Tn). constituent Pos(r1, r2, c) expression is such that the subject
This restricted choice of regular expressions enables string does not contain c occurrences of a match bounded
­efficient enumeration of regular expressions that match by r1 and r2, or (iii) a constituent SubStr(vi, p1, p2) expres-
certain parts of a string. We use the following finite (but eas- sion has position expressions that are both defined but
ily extended) set of tokens: (a) StartTok, which matches the first refers to a position that occurs later in the subject
the beginning of a string, (b) EndTok, which matches the string than the position indicated by the second. The fol-
end of a string, (c) a token for each special character, such lowing example illustrates the utility of the loop construct.
as hyphen, dot, semicolon, comma, slash, or left/right
parenthesis/bracket, and (d) two tokens for each character Example 2 (Generate Abbreviation). The following task
class C, one that matches a sequence of one or more char- was presented originally as an Advanced Text Formula.23
acters in C, and another that matches a sequence of one or Input v1 Output
more characters that are not in C. Examples of a character
Association of Computing Machinery ACM
class C include numeric digits (0–9), alphabetic characters Principles Of Programming Languages POPL
(a–zA–Z), lowercase alphabetic characters (a–z), upper- Foundations of Software Engineering FSE
case alphabetic characters (A–Z), alphanumeric charac-
ters, and whitespace characters. UpperTok, NumTok, and This task can be expressed in our language as
AlphNumTok match a nonempty sequence of uppercase Loop(λw : Concatenate(SubStr2(v1, UpperTok, w) ) ).
alphabetic characters, numeric digits, and alphanumeric
characters, respectively. DecNumTok matches a non- Our tool synthesizes this program from the first example
empty sequence of numeric digits and/or decimal point. row and uses it to produce the entries in the second and
HyphenTok and SlashTok match the hyphen character third rows (shown here in boldface for emphasis) of the
and the slash character, respectively. output column.

au g ust 2 0 1 2 | vo l . 5 5 | n o. 8 | c omm u n i cat ion s o f t he acm 99


research highlights

3.2. Synthesis algorithm Figure 2. Small sampling of different ways of generating parts of an
The synthesis algorithm first computes, for each input–out- output string from the input string.
put example (s, s), the set of all trace expressions that map
input s to output s (using procedure Generate). It then inter-
sects these sets for similar examples and learns conditionals Input (425) – 706 – 7709
to handle different cases (using procedure Intersect). The
size of such sets can be huge; therefore, we must develop a
data structure that allows us to succinctly represent and effi-
ciently manipulate huge sets of program expressions.
Data structure: Figure 1(b) describes our data structure for
succinctly representing sets of programs from our domain-
~ ~
specific language. P, ~ e, f , and ~p denote representations of,
respectively, a set of string programs, a set of trace expres-
425 – 706 – 7709
Output
sions, a set of atomic expressions, and a set of position expres-
sions. r~ and ~
c represent a set of regular expressions and a set
Constant Constant Constant
of integer expressions; these sets are represented explicitly.
The Concatenate constructor used in our string ­language Constant
~
is generalized to the Dag constructor Dag(h~, hs, ht, x , W), where
~
h is a set of nodes containing two distinctly marked source
~
and target nodes hs and ht,  x is a set of edges over nodes in h~ that generating some substring of an output string is completely
~
defines a Directed Acyclic Graph (DAG), and W maps each x Î x decoupled from the logic for generating another disjoint
to a set of atomic expressions. The set of all Concatenate substring of the output string. Second, the total number of
~
expressions represented by a Dag(h~, hs, ht, x , W) constructor different substrings/parts of a string is quadratic (and not
includes exactly those whose ordered arguments belong to the exponential) in the size of that string.
corresponding edges on any path from hs to ht. The Switch, The Generate procedure creates a Directed Acyclic Graph
~
Loop, SubStr, and Pos constructors are all overloaded to (DAG) Dag(h~, hs, ht, x , W) that represents the trace set of all trace
construct sets of the c­ orresponding program expressions that expressions that generate a given output string from a given
are shown in Figure 1(a). The ConstStr and CPos construc- input state. Generate constructs a node corresponding to
tors can be regarded as producing singleton sets. each position within the output string and constructs an edge
The data structure supports efficient implementation of from a node corresponding to any position to a node corre-
various useful operations including intersection, enumera- sponding to any later position. Each edge corresponds to some
tion of programs, and their simultaneous execution on a substring of the output and is annotated with the set of all
given input. The most interesting of these is the intersection atomic expressions that generate that substring. We describe
operation, which is similar to regular automata intersection. below how to generate the set of all such SubStr expressions.
The additional challenge is to intersect edge labels—in the Any Loop expressions are generated by first generating candi-
case of automata, the labels are simply sets of characters, date expressions (by unifying the sets of trace expressions asso-
while in our case, the labels are sets of string expressions. ciated with the substrings s[k1 : k2] and s[k2 : k3], where k1, k2, and k3
Procedure Generate: The number of trace expressions are the boundaries of the first two loop iterations, identified by
that can generate a given output string from a given input considering all possibilities), and then validating them.
state can be huge. For example, consider the second input– The number of substring expressions that can extract a
output pair in Example 1, where the input state consists given substring from a given string can be huge. For exam-
of one string “(425)-706-7709” and the output string is ple, following is a small sample of various expressions that
“425-706-7709”. Figure 2 shows a small sampling of differ- extract “706” from the string “425-706-7709” (call it v1).
ent ways of generating parts of the output string from the
input string using SubStr and ConstStr constructors. • Second number: SubStr2(v1, NumTok, 2).
Each substring extraction task itself can be expressed with • 2nd last alphanumeric token:
a huge number of expressions, as explained later. The fol- SubStr2(v1, AlphNum­Tok, −2).
lowing are three of the trace expressions represented in the • Substring between the first hyphen and the last hyphen:
figure, of which only the second one, shown in the figure in SubStr(v1, Pos(HyphenTok, e, 1), Pos(e, HyphenTok, −1) ).
bold, expresses the program expected by the user: • First number that occurs between hyphen on both ends.
SubStr(v1, Pos(H yphenTok,
1. Extract substring “425”. Extract substring “-706-7709”. TokenSeq(NumTok, HyphenTok), 1),
2. Extract substring “425”. Print constant “-”. Extract sub- Pos(T okenSeq(HyphenTok, NumTok),
string “706”. Print constant “-”. Extract substring “7709”. HyphenTok, 1) ).
3. Extract substring “425”. Extract substring “-706”. Print • First number preceded by a number–hyphen sequence.
constant “-”. Extract substring “7709”. SubStr(v1, Pos(T okenSeq(NumTok, HyphenTok),
NumTok, 1),
We apply two crucial observations to succinctly generate Pos(TokenSeq(NumTok, HyphenTok,
and represent all such trace expressions. First, the logic for NumTok), e, 1) ).

100 communi catio ns o f th e acm | au gu st 201 2 | vol. 5 5 | no. 8


The substring-extraction problem can be decomposed into that stores id, purchase  date (in month/year format), and
two independent position-identification problems, each of purchase   price of items. The selling price of an item is
which can be solved independently. The solutions to the computed by adding its purchase price (for the corresponding
substring-extraction problem can also be maintained suc- month) to its markup charges, which in turn is calculated by
cinctly by independently representing the solutions to the multiplying the markup percentage by the purchase price.
two position-identification problems. Note the representa-
tion of the SubStr constructor in Eq. 1 in Figure 1(b).
Input v1 Input v2 Output
Procedure Intersect: Given a trace set for each
input–output example, the Intersect procedure gener- Stroller 10/12/2010 $145.67 + 0.30*145.67
Bib 23/12/2010 $3.56 + 0.45*3.56
ates the top-level Switch constructor. Intersect first
Diapers 21/1/2011 $21.45 + 0.35*21.45
partitions the examples, so that inputs in the same parti- Wipes 2/4/2009 $5.12 + 0.40*5.12
tion are handled by the same conditional in the top-level Aspirator 23/2/2010 $2.56 + 0.30*2.56
Switch expression, and then intersects the trace sets for
inputs in the same partition. If a set of inputs are in the
same partition, then the intersection of trace sets is non- MarkupRec CostRec
empty. Intersect uses a greedy heuristic to minimize the Id Name Markup Id Date Price
number of partitions by starting with singleton partitions S33 Stroller 30%
S33 12/2010 $145.67
and then iteratively merging partitions that have the high- S33 11/2010 $142.38
B56 Bib 45%
est compatibility score, which is a function of the size of the B56 12/2010 $3.56
D32 Diapers 35%
D32 1/2011 $21.45
resulting partition and its potential to be merged with other W98 Wipes 40%
W98 4/2009 $5.12
partitions. A46 Aspirator 30%
A46 2/2010 $2.56
. . . . . . . . .
Intersect then constructs a classifier for each of the . . . . . . . . .
resultant partitions, which is a Boolean expression that is
satisfied by exactly the inputs in the partition. The classifier
for each partition and the intersection of trace sets for the To perform the above task, the user must perform a join of
inputs in the partition serve as the Boolean condition and the two tables on the common item Id column to lookup the
corresponding trace expression in the constructed Switch item Price from its Name (v1) and selling Date (substring
expression, respectively. of v2). We present an extension to the trace expression
Ranking: We prefer Concatenate and TokenSeq (from Section 3.1) that can also manipulate strings present
expressions that have fewer arguments. We prefer SubStr in such relational tables.18
expressions to both ConstStr expressions (it is less
likely for constant parts of an output string to also occur 4.1. Domain-specific language
in the input) and Concatenate expressions (if there is We extend the trace expression (from Section 3.1), as shown
a long substring match between the input and output, it in Figure 3(a), to obtain the semantic string transformation
is more likely that the corresponding part of the output language that can also perform table lookup operations.
was produced by a single substring extraction). We prefer The atomic expression f is modified to represent a constant
a Pos expression to CPos expression (giving less prefer- string, a select expression, or a substring of a select expres-
ence to extraction expressions based on constant offsets). sion. A select expression et is either an input string variable
StartTok and EndTok are our most preferred tokens; vi or a lookup expression denoted by Select(Col, Tab, g),
otherwise, we prefer tokens corresponding to a larger char- where Tab is a relational table identifier and Col is a col-
acter class (favoring generality). umn identifier of the table. The Boolean condition g is an
The implementation of the synthesis algorithm is less ordered conjunction of column predicates h1∧…∧hn, where
than 5,000 lines of C# code, and takes less than 0.1 s on aver- a column predicate h is an equality comparison between
age for a benchmark suite of more than 100 tasks obtained the content of some column of the table and a constant or
from online help forums and the Excel product team. a trace expression e. We require columns present in these
conditions to together constitute a primary key of the table
4. SEMANTIC TRANSFORMATIONS to ensure that the select queries produce a single string as
Some string transformation tasks also involve manipulating opposed to a set of strings.
strings that need to be interpreted as more than a sequence The task in Example 3 can be represented in the lan-
of characters, for example, as a column entry from some guage as
relational table, or as some standard data type such as date,
time, currency, or phone number. For example, consider the
following task from an Excel help forum. Concatenate (f1, ConstStr(“+0.”), f2, ConstStr(“*”), f3)
where f1  ≡  Select(Price, CostRec, Id = f4 ∧ Date = f5),
Example 3. A shopkeeper wants to compute the selling f4  ≡  Select(Id, MarkupRec, Name = v1),
price of an item (Output) from its name (Input v1 ) and sell- f5  ≡  SubStr(v2, Pos(SlashTok, e, 1), Pos(e, EndTok, 1) ),
ing date (Input v2 ). The inventory database of the shop con- f2  ≡  SubStr2(f6, NumTok, 1),  f3 ≡ SubStr2(f1, DecNum­Tok, 1),
sists of two tables: (i) MarkupRec table that stores id, name f6  ≡  Select(Markup, MarkupRec, Name = v1).
and markup percentage of items, and (ii) CostRec table

au g ust 2 0 1 2 | vo l . 5 5 | n o. 8 | co mmu n icat ion s of t h e acm 101


research highlights

Figure 3. Extensions to the syntax and data structure in Figure 1 for semantic processing.

Atomic expr f := SubStr(et, p1, p2) | ConstStr(s) | et ~


et := ( h, ~ 2~f
~ ht, Progs) where Progs : h→
:= ~ :=
Select expr et υi | Select(Col, Tab, g) f υi | Select(Col, Tab, B)
Boolean condition g := h1 ∧ . . . ∧ hn B := {g~i} i
:= ~ := ~ ~
Predicate h Col = s | Col = e g h1 ∧ . . . ∧ hn
~
h := Col = s | Col = η | Col = {s, η}
(a)  (b)

The expression f4 looks up the Id of the item (present in 4.2. Synthesis algorithm
v1) from the MarkupRec table and f5 generates a substring We now describe the key extensions to the synthesis algo-
of the date present in v2, which are then used together to rithm for syntactic transformations (Section 3.2) to obtain
lookup the Price of the item from the CostRec table (f1). the synthesis algorithm for semantic transformations.
The expression f6 looks up the Markup percentage of the Data structure: Figure 3(b) describes the data structure that
item from the MarkupRec table and f2 generates a sub- succinctly represents the large set of programs in the semantic
string of this lookup value by extracting the first numeric transformation language that are consistent with a given input–
token (thus removing the % sign). Similarly, the expres- output example. The data structure consists of a generalized
sion f3 generates a substring of f1, removing the $ symbol. expression ~ e t, generalized Boolean condition g ~, and general-
~
Finally, the top-level expression concatenates the strings ized predicate h (which, respectively, denote a set of select expres-
obtained from expressions f1, f2, and f3 with constant strings sions, a set of Boolean conditions g, and a set of predicates h).
“+0.” and “*”. The generalized expression ~ ~, ht,
e t is represented using a tuple (h
This extended language also enables manipulation of ~
Progs) where h denotes a set of nodes containing a distinct tar-
strings that represent standard data types, whose seman- get node ht (representing the output string), and Progs : h ~ ® 2~f
tic meaning can be encoded as a database of relational ~
maps each node h Î h to a set consisting of input variables vi or
tables. For example, consider the following date manipu- generalized select expressions Select(Col, Tab, B). A general-
lation task. ized Boolean condition g ~ corresponds to some primary key of
i
~
table T and is a conjunction of generalized predicates hj, where
~
Example 4 (Date Manipulation). An Excel user wanted each hj is an equality comparison of the j column of the cor-
th

to convert dates from one format to another, and the fixed set responding primary key with a constant string s or some node h ~
of hard-coded date formats supported by Excel 2010 do not or both. The two key aspects of this data structure are (i) the use
match the input and output formats. Thus, the user solicited of intermediate nodes for sharing sub-expressions to represent
help on a forum. an exponential number of expressions in polynomial space and
(ii) the use of Conjunctive Normal Form (CNF) form of Boolean
Input v1 Output
conditions to represent an exponential number of conditionals
6-3-2008 Jun 3rd, 2008 ~
g in polynomial space.
3-26-2010 Mar 26th, 2010
8-1-2009 Aug 1st, 2009
Procedure Generate: We first consider the simpler
9-24-2007 Sep 24th, 2007 case where there are no syntactic manipulations on the
table lookups and the lookups are performed using exact
string matches, that is, the predicate h is Col = et instead
We can encode the required background knowledge for of Col = e. The Generate procedure operates by iteratively
the date data type in two tables, namely a Month table computing a set of nodes (h~), where each node h Î h~ repre-
with 12 entries: (1, January), …, (12, December) and a sents a string val(h) that corresponds to some table entry
DateOrd table with 31 entries (1, st), (2, nd), …, (31, st). or an input string. Generate performs an iterative forward
The desired transformation is represented in our lan- reachability analysis of the string values that can be gener-
guage as ated in a single step (i.e., using a single Select expression)
from the string values computed in the previous step based
on string equality and assigns the Select expression to the
Concatenate(SubStr(Select(MW, Month, MN = e1), Progs map of the corresponding node. The base case of the
Pos(StartTok, e, 1), CPos(3) ), ConstStr(“ ”), e2,
procedure creates a node for each of the input string vari-
ables. After performing the analysis for a bounded number
Select(Ord, DateOrd, Num = e2), ConstStr(“, ”), e3)
of iterations, the procedure returns the set of select expres-
sions of the node that corresponds to the output string s,
that is, Progs[val−1(s)].
where e1 = SubStr2(v1, NumTok, 1), e2 = SubStr2(v1, The Generate procedure for the general case, which
NumTok, 2), e3 = SubStr2(v1, NumTok, 3). (MW,MN) and also includes syntactic manipulations on table lookups,
(Num,Ord) denote the columns of the Month and DateOrd requires a relaxation of the above-mentioned reachability
tables, respectively. criterion of strings that is based on string equality. We now

102 c ommunicatio ns o f th e acm | au gu st 201 2 | vol . 5 5 | no. 8


define a table entry to be reachable from a set of previously the test. For every date, the user needs to produce a row in the
reachable strings if the entry can be generated from the output table containing the name of the test taker, the name of
set of reachable strings using the Generate procedure of the test, and the date on which the test was taken. If a date cell
Section 3.2. The rest of the reachability algorithm operates in the input is empty, then no corresponding row should be pro-
just as before. duced in the output.
Procedure Intersect: A basic ingredient of the
Intersect procedure for syntactic transformations is a 5.1. Domain-specific language
method to intersect two Dag constructs, each representing a We may view every program P that transforms the layout of
set of trace expressions. We replace this by a method to inter- a table as constructing a map mP from the cells of an input
sect two tuples (h~1, h1t , Progs1) and (h~2, h2t , Progs2), each table to the coordinates of the output table. For a cell c in an
representing a set of extended trace expressions. The input table, if mP(c) = (row, col), P fills the cell in the output
tuple after intersection is (h ~ ´h ~ , (ht ,ht ), Progs ), where table at the coordinate (row, col) with the data in c. A program
1 2
~ ~ 1 2 12
Progs12( (h 1, h 2) ) is given by the intersection of Progs1(h ~) in our language of layout transformations is defined syntac-
~
and Progs2(h 2).
1
tically as a finite collection of component programs, each of
Ranking: We prefer expressions of smaller depth (fewer which builds a map from input cells to output coordinates
nested chains of Select expressions) and ones that match (Figure 4: table program). We designed our language on the
longer strings in table entries for indexing. We prefer lookup principle that most layout transformations can be imple-
expressions that use distinct tables (for join queries) as mented by a set of component programs that construct their
opposed to using the same table twice. We prefer condition- map using one of the two complementary procedures: filter-
als with fewer predicates. We prefer predicates that com- ing and associating.
pare columns with other table entries or input variables (as When a component program filters, it scans the cells
opposed to comparing columns with constant strings). of the input table in row-major order, selects a sub-
We implemented our algorithm as an extension to the set of the cells, and maps them in order to a subrange
Excel add-in (Section 3.2) and evaluated it successfully over of the coordinates in the output table. A filter program
more than 50 benchmark problems obtained from various Filter(j, SEQi,j,k) (Figure 4: filter program) is a mapping
help forums and the Excel product team. For each bench- condition j, which is a function whose body is a conjunction
mark, our implementation learned the desired transforma- of predicates over input cells drawn from a fixed set and an
tion in <10 s (88% of them taking <1 s each) requiring at most output sequencer SEQi,j,k, where i, j, and k are nonnegative
three input–output examples (with 70% of them requiring integers. For a mapping condition j and sequencer SEQi,j,k,
only one example). The data structure had size between 100 the filter program Filter(j, SEQi,j,k) scans an input table
and 2,000 (measured as the number of terminal symbols in and maps each cell that satisfies j to the coordinates in the
the data-structure syntax), with an average size of 600, and output table between columns i and j, starting at row k, in
typically represented 1020 expressions. row-major order.
For the tables in Example 5, the filter program F1 =
5. TABLE LAYOUT TRANSFORMATIONS Filter(λc.(c.data ≠ “” ∧ c.col ≠ 1 ∧ c.row ≠ 1), SEQ3,3,1)
End users often transform a spreadsheet table not by chang- maps each date, that is, each nonempty cell not in column 1
ing the data stored in the cells of a table, but instead by and not in row 1, to its corresponding cell in column 3 of the
changing how the cells are grouped or arranged. In other ­output table, starting at row 1. Call this map mF .
1
words, users often transform the layout of a table.8 A table program can also construct a map using spatial
relationships between cells in the input table and spatial
Example 5. The following example input table and subsequent relationships between coordinates in the output table; we
example output table were provided by a novice on an Excel call this construction association. When a table program
user help thread to specify a layout transformation: associates, it takes a cell c in the input table mapped by
some filter program F, picks a cell c1 in the input table whose
Qual 1 Qual 2 Qual 3 coordinate is related to c, finds the coordinate mF(c) that c
Andrew 01.02.2003 27.06.2008 06.04.2007 maps to under mF, picks a coordinate d1 whose coordinate is
Ben 31.08.2001 05.07.2004 related to mF(c), and maps c1 to d1.
Carl 18.04.2003 09.12.2009 An associative program A = Assoc(F, s0, s1) (Figure 4:
Associative program) is constructed from a filter program
Andrew Qual 1 01.02.2003 F and two spatial functions s0 and s1, each of which may be of
Andrew Qual 2 27.06.2008
Andrew Qual 3 06.04.2007 Figure 4. Syntax of layout transformations.
Ben Qual 1 31.08.2001
Ben Qual 3 05.07.2004
Carl Qual 2 18.04.2003 Table program P := TabProg({Ki}i)
Carl Qual 3 09.12.2009 Component program K := F | A
Filter program F := Filter(ϕ, SEQi, j, k)
The example input contains a set of dates on which tests were Associative program A := Assoc(F, S1, S2)
given, where each date is in a row corresponding to the name Spatial function S := RelColi | RelRowj
of the test taker, and in a column corresponding to the name of

au g ust 2 0 1 2 | vo l . 5 5 | n o. 8 | c o mmu n i cat ion s of t h e acm 103


research highlights

the form RelColi or RelRowj. The spatial function RelColi checks if the resulting associative program is consis-
takes a cell c as input, and returns the cell in the same row tent. If so, then Generate adds the associative program
as c and in column i. The spatial function RelRowj takes a to the set of consistent component programs. The table
cell c as input, and returns the cell in row j and in the same program TabProg(K*) stores all table programs that are
column as c. For each cell c in the domain of mK, the map of consistent with the example if any exist, and is thus called
A contains an entry mA(s0(c) ) = s1(mF(c) ). the complete table program of the example.
For the example tables in Example 5, and the filter Procedure Intersect: Given two sets of table programs
­program F1 introduced above that maps to column 3 of represented as table programs stored in TabProg(K0) and
the example output table, the associative program A1 = TabProg(K1), Intersect efficiently constructs the inter-
Assoc(F1, RelCol1, RelCol1) constructs a map to every section of the sets as all consistent table programs stored in
cell in column 1 of the output table. To construct its map, TabProg(K0 ∩ K1).
A1 takes each cell c in the input table mapped by F1, finds the The synthesis algorithm applies Generate to con-
cell RelCol1(c) in the same row as c and in column 1, finds struct the complete table program for each example,
the coordinate mF (c) that F1 maps c to, finds the coordinate applies Intersect to the set of complete table programs,
1
RelCol1(mF (c)), and maps RelCol1(c) to RelCol1(mF (c)): that and checks if the resulting table program TabProg(KI)
1 1
is, A1 sets mA (RelCol1(c) ) = RelCol1(mF (c)). Similarly, is consistent with each of the examples. If so, it returns
1 1
the associative program A2 = Assoc(F1, RelRow1, RelCol2) TabProg(K′I  ) for some subset K′I   of KI that covers each of
constructs a map to every cell in column 2 of the example the examples. The exact choice of K′I   depends on the rank-
output table. The table program TabProg({F1, A1, A2}) takes ing criteria.
the input table in Example 5 and maps to every cell in the Ranking: We prefer table programs that are constructed
output table. from smaller sets of component programs, as such table
It is possible that the ranges of constituent component programs are intuitively simpler. The subset order over sets
programs of a table program may overlap on a given input of component programs thus serves as a partial order for
table. In such a case, if two cells with different values are ranking. Also, suppose that a table program P0 uses a filter
mapped to the same output coordinate, then we say that the program F0, while another table program P1 uses a filter pro-
table program is undefined on the input table. gram F1 that builds the same map as F0, but whose condition
is a conjunction of fewer predicates than the condition of F0.
5.2. Synthesis algorithm Then, we prefer P1, as the condition used by F1 is intuitively
The synthesis algorithm generates the set of all table pro- more general.
grams that are consistent with each example, intersects the To evaluate our synthesis algorithm, we implemented it
sets, and picks a table program from the intersection that is as a plug-in for the Excel spreadsheet program and applied it
consistent with all of the examples. to input–output tasks taken directly from over 50 real-world
Data structure for sets of table programs: To compactly Excel user help threads. The tool automatically inferred pro-
represent sets of table programs, our synthesis algorithm grams for each task. The tool usually ran in under 10 s and
uses a table program itself. Let a component program K be inferred a program that behaved as we expected, given the
consistent with an input–output example if when K is applied user’s description in English of their required transforma-
to the example input and K maps an input cell c, then the tion. If the highest-ranking program inferred by the tool
cell in the output table at coordinate mK(c) has the same behaved in an unexpected way on an input, it inferred a pro-
data as cell c in the input table. Let a set of component pro- gram that behaved as expected after taking at most two addi-
grams K cover an example if, for each cell coordinate d in the tional examples.
example output, there is some component K ∈ K and cell c
in the input table such that d = mK(c). Let a table program 6. RELATED WORK
TabProg(K) be consistent with an example if K is consis- The Human Computer Interaction (HCI) community
tent with the example and K covers the example. For a fixed has developed programming by demonstration (PBD) sys-
input–output example, TabProg(K) stores TabProg(K′) if tems1 for data cleaning, which require the user to specify
K′ Í K covers the example. a complete demonstration or trace visually on the data
Procedure Generate: From a single input–output instead of writing code: SMARTedit14 for text manipula-
example, Generate constructs a table program that tion, Simultaneous Editing16 for string manipulation,
stores the set of all table programs that are consistent and Wrangler11 for table transformations. Our system is
with the example by constructing the set K* of all consis- based on programming by example (as opposed to dem-
tent component programs, in three steps. First, from the onstration); it requires the user to provide only the initial
example input and output, Generate defines a set of spa- and final states, as opposed to also providing the inter-
tial functions and map predicates. Second, from the set of mediate states. This renders our system more usable,13
map predicates, Generate collects the set of all consis- but at the expense of complicating the learning problem.
tent filter programs. Third, from the set of consistent fil- Furthermore, we synthesize programs over a more sophis-
ter programs, Generate constructs the set of consistent ticated language consisting of conditionals and loops.
associative programs. To generate associative programs, The database community has studied the view syn-
Generate combines each consistent filter program with thesis problem,2, 22 which aims to find a succinct query
pairs of spatial functions defined in the first step and for a given relational view instance. Our semantic string

104 comm unicatio ns of th e acm | au gust 201 2 | vol. 5 5 | no. 8


transformation synthesis also infers similar queries, but develop a general framework that can allow synthesizer
infers from very few examples and over a richer language writers to easily develop domain-specific synthesizers of
that combines lookup operations with syntactic manipula- the kind described in this paper, similar to how declara-
tions. Our table layout synthesis addresses the challenges tive parsing frameworks allow a compiler writer to easily
of dealing with spreadsheet tables, which, unlike database write a parser.
relations, have ordering relationships between rows and
carry meta-­information in some cells. Acknowledgments
The PADS project in the programming languages com- We thank Guy L. Steele Jr. for providing insightful and
munity has simplified ad hoc data-processing tasks for detailed feedback on multiple versions of this draft.
programmers by developing domain-specific languages
for describing data formats, and learning algorithms for
References
inferring such formats using annotations provided by
  1. Cypher, A., ed. Watch What I Do: 12. Ko, A.J., Myers, B.A., Aung, H.H.
the user.3 The learned format can then be used by pro- Programming by Demonstration, MIT Six learning barriers in end-user
grammers to implement custom data-analysis tools. In Press, 1993. programming systems. In VL/HCC
  2. Das Sarma, A., Parameswaran, (2004).
contrast, we focus on automating the entire end-to-end A., Garcia-Molina, H., Widom, J. 13. Lau, T. Why P B D systems fail:
process for relatively simpler tasks, which include not only Synthesizing view definitions from Lessons learned for usable AI. In CHI
data. In ICDT (2010). 2008 Workshop on Usable AI (2008).
learning the text structure from the inputs, but also learn-   3. Fisher, K., Walker, D. The PADS project: 14. Lau, T., Wolfman, S., Domingos,
ing the desired transformation from the outputs, without an overview. In ICDT (2011). P., Weld, D. Programming by
  4. Gualtieri, M. Deputize end-user demonstration using version space
any user annotations. developers to deliver business agility algebra. Mach. Learn. 53(1–2) (2003).
The area of program synthesis is gaining renewed inter- and reduce costs. In Forrester Report 15. Lieberman, H. Your Wish is My
for Application Development and Command: Programming by Example,
est. However, it has traditionally focused on synthesizing Program Management Professionals Morgan Kaufmann, 2001.
(Apr. 2009). 16. Miller, R.C., Myers, B.A., Interactive
nontrivial algorithms20 (e.g., graph algorithms9 and pro-   5. Gulwani, S. Dimensions in program simultaneous editing of multiple text
gram inverses19) and discovering intricate code-snippets synthesis. In PPDP (2010). regions. In USENIX Annual Technical
  6. Gulwani, S. Automating string Conference (2001).
(e.g., bit-vector tricks,7 switching logic in hybrid systems21). processing in spreadsheets using input- 17. Mitchell, T.M. Generalization as search.
In this paper, we apply program synthesis to discover sim- output examples. In POPL (2011). Artif. Intell. 18, 2 (1982).
  7. Gulwani, S., Jha, S., Tiwari, A., 18. Singh, R., Gulwani, S. Learning
pler programs required by the much larger class of spread- Venkatesan, R. Synthesis of loop-free semantic string transformations from
sheet end-users. Various classes of techniques have been programs. In PLDI (2011). examples. PVLDB 5 (2012), in press.
  8. Harris, W.R., Gulwani, S. Spreadsheet 19. Srivastava, S., Gulwani, S., Chaudhuri,
explored for program synthesis: exhaustive search, logi- table transformations from examples. S., Foster, J.S. Path-based inductive
cal reasoning, probabilistic inference, and version-space In PLDI (2011). synthesis for program inversion. In
  9. Itzhaky, S., Gulwani, S., Immerman, PLDI (2011).
algebras (for a recent survey, see Gulwani5). Because the N., Sagiv, M. A simple inductive 20. Srivastava, S., Gulwani, S., Foster, J.
data manipulations required by end users are usually rela- synthesis methodology and its From program verification to program
applications. In OOPSLA (2010). synthesis. In POPL (2010).
tively simple, we can apply version-space algebras, which 10. Jha, S., Gulwani, S., Seshia, S., Tiwari, 21. Taly, A., Gulwani, S., Tiwari, A.
allow real-time performance, unlike other techniques. A. Oracle-guided component-based Synthesizing switching logic using
program synthesis. In ICSE (2010). constraint solving. In VMCAI (2009).
Version-space algebras were pioneered by Mitchell for 11. Kandel, S., Paepcke, A., Hellerstein, J., 22. Tran, Q.T., Chan, C.Y., Parthasarathy, S.
refinement-based learning of Boolean functions,17 while Heer, J. Wrangler: Interactive visual Query by output. In SIGMOD (2009).
specification of data transformation 23. Walkenbach, J. Excel 2010 Formulas,
Lau et al. extended the concept to learning more complex scripts. In CHI (2011). John Wiley and Sons, 2010.
functions in a PBD setting.14 Our synthesis algorithms lift
the concepts of version-space algebra to the PBE setting,
for a fairly expressive string expression language involving Sumit Gulwani (sumitg@microsoft.com), Rishabh Singh (rishabh@csail.mit.edu),
Microsoft Research, Redmond, WA. MIT CSAIL, Cambridge, MA.
conditionals and loops.
William R. Harris (wrharris@cs.wisc.edu),
Univ. of Wisconsin, Madison, WI.
7. CONCLUSION AND FUTURE WORK
General-purpose computational devices, such as smart-
phones and computers, are becoming accessible to people
at large at an impressive rate. In the future, even robots
will become household commodities. Unfortunately, pro-
gramming such general-purpose platforms has never been
easy, because we are still mostly stuck with the model of
providing step-by-step, detailed, and syntactically correct
instructions on how to accomplish a certain task, instead
of simply describing what the task is. Program synthesis
has the potential to revolutionize this landscape, when
targeted for the right set of problems and using the right
interaction model.
This paper reports our initial experience with design-
ing domain-specific languages and inductive synthe-
sizers for string and table transformations. Our choice
of domains was motivated by our study of help-forum
problems that end users struggled with. A next step is to © 2012 ACM 0001-0782/12/08 $15.00

au g ust 2 0 1 2 | vo l . 5 5 | n o. 8 | co mm u n icat ion s of t h e acm 105


research highlights
doi:10.1145/ 2240236.2 2 40 2 6 1

Technical Perspective
Proving Programs Continuous
By Andreas Zeller

Proving a program’s correctness is While the intuitive idea is easy to Gulwani, and Roberto Lublinerman
usually an all-or-nothing game. Ei- grasp, the concept of continuity so far comes into play. Their framework can
ther a program is correct with respect has widely evaded formal treatment; formally verify programs for conti-
to its specification or it is not. If our in particular, it was not possible to au- nuity, proving that small changes to
proof succeeds, we have 100% correct- tomatically reason about continuity the input only cause small changes
ness; if our proof does not succeed, in the presence of loops. This is where to the output. They show that several
we have nothing. Formal correctness the work of Swarat Chaudhuri, Sumit programs such as sorting or shortest
proofs are difficult, because one must path are continuous—or even Lip-
symbolically cover the entire range schitz continuous, implying that per-
of possible inputs—and the slightest Being able turbations to a function’s input cause
gap in the input leaves us with a gap
in the proof. to formally reason only proportional changes to its out-
put. Such a function would also be de-
But what if it turned out the risk about continuity clared robust, meaning it will behave
posed by leaving a gap is actually
small? This, of course, is the assump- and robustness predictably even when inputs are un-
certain or erroneous.
tion of testing: If I tested some func- lets us see programs Being able to formally reason
tion with a sample of values, and it
works correctly for this sample, I have as driven not only about continuity and robustness lets
us see programs as driven not only by
reasons to assume it will also work for by logic, but also logic, but also analytical calculus; and
similar values. This is something I can
assume if the behavior of my function analytical calculus; this view can be very helpful for un-
derstanding why programs generally
is continuous—if it computes the cor- and this view tend to work well even if only coarsely
rect square root for 10, 100, and 1,000,
it should also do the right thing for can be very helpful tested. This work also bridges the
gap between programs and control
any value in between. for understanding theory, allowing for ample cross-fer-
One may think this is a dangerous
assumption: Simply because my pro- why programs tilization between the fields; indeed,
one can think of mathematical opti-
gram has worked well for these three generally tend mizations of program code just as the
values, why should it work for any oth-
er? A program is free to do anything; to work well adoption of programming concepts
by control theory. So, should we treat
since it need not obey mathemati- even if only programs as driven by logic, by calcu-
cal or physical laws, it can behave in
an entirely different way for any new coarsely tested. lus, or both? I encourage you to read
the following paper to see the mani-
value. This logical view is true in prin- fold connections between logic and
ciple. In real life, however, program- calculus in computer programs.
mers prefer abstractions that are easy
to understand and to reason about. Andreas Zeller (zeller@cs.uni-saarland.de) is a
professor of software engineering at Saarland University
The reason testing works in practice in Saarbrücken, Germany.
is that programmers naturally strive
toward continuity. © 2012 ACM 0001-0782/12/08 $15.00

106 comm unicatio ns o f t h e acm | au gu st 201 2 | vol. 5 5 | no. 8


doi:10.1145/ 2240236. 2 2 40 2 6 2

Continuity and Robustness


of Programs
By Swarat Chaudhuri, Sumit Gulwani, and Roberto Lublinerman

Abstract unified theory of program analysis that marries logical and


Computer scientists have long believed that software is analytical approaches.
­different from physical systems in one fundamental way: Now we elaborate on the above remarks. Perhaps the
while the latter have continuous dynamics, the former do most basic reason why software systems can violate continu-
not. In this paper, we argue that notions of continuity from ity is conditional branching—that is, constructs of the form
mathematical analysis are relevant and interesting even “if B then P1 else P2.” Continuous dynamical systems arising
for software. First, we demonstrate that many everyday in the physical sciences do not typically contain such con-
programs are continuous (i.e., arbitrarily small changes to structs, but most nontrivial programs do. If a program has
their inputs only cause arbitrarily small changes to their a branch, then even the minutest perturbation to its inputs
outputs) or Lipschitz continuous (i.e., when their inputs may cause it to evaluate one branch in place of the other.
change, their outputs change at most proportionally). Thus, we could perhaps conclude that any program contain-
Second, we give an mostly-automatic framework for veri- ing a branch is ipso facto discontinuous.
fying that a program is continuous or Lipschitz, showing To see that this conclusion is incorrect, consider the
that traditional, discrete approaches to proving programs problem of sorting an array of numbers, one of the most
correct can be extended to reason about these properties. basic tasks in computing. Every classic algorithm for sort-
An immediate application of our analysis is in reasoning ing contains conditional branches. But let us examine the
about the robustness of programs that execute on uncer- specification of a sorting algorithm: a mathematical func-
tain inputs. In the longer run, it raises hopes for a toolkit tion Sort that maps arrays to their sorted permutations.
for reasoning about programs that freely combines logical This specification is not only continuous but Lipschitz
and analytical mathematics. ­continuous: change any item of an input array A by ±, and
each item of Sort(A) changes at most by ±. For example,
­suppose A and A′ are two input arrays as below, with A′
1. INTRODUCTION obtained by perturbing each item of A at most by ±1. Then
It is accepted wisdom in computer science that the dynam- Sort(A′) can be obtained by perturbing each item of Sort(A)
ics of software systems are inherently discontinuous, and at most by ±1.
that this fact makes them fundamentally different from
physical systems. More than 25 years ago, Parnas15 attrib-
uted the difficulty of engineering reliable software to the
fact that “the mathematical functions that describe the
behavior of [software] systems are not continuous.” This Similar observations hold for many of the classic computa-
meant that the traditional analytical calculus—the math- tions in computer science, for example, shortest path and
ematics of choice when one is analyzing the dynamics of, minimum spanning tree algorithms. Our program analysis
say, fluids—did not fit the needs of software engineering extends and automates methods from the traditional analyt-
too well. Logic, which can reason about discontinuous ical calculus to prove the continuity or Lipschitz continuity
systems, was better suited to being the mathematics of of such computations. For instance, to verify that a condi-
programs. tional statement within a program is continuous, we gener-
In this paper, we argue that this wisdom is to be taken alize the sort of insight that a high-school student uses to
with a grain of salt. First, many everyday programs are con- prove the continuity of a piecewise function like
tinuous in the same sense as in analysis—that is, arbitrarily
small changes to its inputs lead to arbitrarily small changes
to its outputs. Some of them are even Lipschitz continuous—
that is, perturbations to the program’s inputs lead to at
most proportional changes to its outputs. Second, we show
that analytic properties of programs are not at odds with
This paper is based on two previous works: “Continuity
the classical, logical methods for program verification, giv-
Analysis of Programs,” by S. Chaudhuri, S. Gulwani, and
ing a logic-based, mostly-automated method for formally
R. Lublinerman, published in POPL (2010), 57–70, and
verifying that a program is continuous or Lipschitz contin-
“Proving Programs Robust,” by S. Chaudhuri, S. ­Gulwani,
uous. Among the immediate applications of this analysis
R. Lublinerman, and S. Navidpour, published in FSE
is reasoning about the robustness of programs that execute
(2011), 102–112.
under uncertainty. In the longer run, it raises hopes for a

AUGUST 2 0 1 2 | vol . 55 | no. 8 | c omm u n icat ion s o f t h e acm 107


research highlights

Intuitively, abs(x) is continuous because its two “pieces” x imperative programming—assignments, branches, and
and −x are continuous, and because x and −x agree on val- loops. The language has two discrete data types—integers
ues in the neighborhood of x = 0, the point where a small and arrays of integers—and two continuous data types—
perturbation can cause abs(x) to switch from evaluating reals and arrays of reals. Usual arithmetic and comparisons
one piece to evaluating the other. Our analysis uses the on these types are supported. In conformance with the
same idea to prove that “if B then P1 else P2” is continuous: model of computation under which algorithms over reals
it inductively verifies that P1 and P2 are continuous, then are typically designed, our reals are infinite-precision, and
checks, often automatically, that P1 and P2 become seman- elementary operations on them are assumed to be given by
tically equivalent at states where the value of B can flip on a unit-time oracles.
small perturbation. Each data type in Imp is associated with a metric.a This
When operating on a program with loops, our analysis metric is our notion of distance between values of a given
searches for an inductive proof of continuity. To prove that type. For concreteness, we fix, for the rest of the paper, the
a continuous program is Lipschitz continuous, we induc- following metrics for the Imp types:
tively compute a collection of Lipschitz matrices that contain
numerical bounds on the slopes of functions computed •  The integer and real types are associated with the
along different control paths of the program. Euclidean metric d(x, y) = |x − y|.
Of course, complete software systems are rarely continu- •  The metric over arrays (of reals or integers) of the same
ous. However, verification technique like ours allows us to length is the L∞-norm: d(A1, A2) = maxi{|A1[i ] − A2[i ]|}.
identify modules of a program that satisfy continuity prop- Intuitively, an array changes by  when its size is kept
erties. A benefit of this is that such modules are amenable fixed, but one or more of its items change by . We
to analysis by continuous methods. In the longer run, we define d(A1, A2) = ∞ if A1 and A2 have different sizes.
can imagine a reasoning toolkit for programs that combines
continuous analysis techniques, for example, numerical The syntax of arithmetic expressions E, Boolean expressions
optimization or symbolic integration, and logical methods B, and programs Prog is as follows:
for analyzing code. Such a framework would expand the
classical calculus to functions encoded as programs, a rep-
resentation worthy of first-class treatment in an era where
much of applied mathematics is computational.
A more immediate application of our framework is
in the analysis of programs that execute on uncertain
inputs, for example noisy sensor data or inexact scien- Here x is a typed variable, c is a typed constant, A is an array
tific measurements. Unfortunately, traditional notions variable, i an integer variable or constant, + and · respec-
of functional correctness do not guarantee predictable tively represent addition and multiplication over scalars
program execution on uncertain inputs: a program may (reals or integers), and the Boolean operators are as usual.
produce the correct output on each individual input, but We assume an orthogonal type system that ensures that
even small amounts of noise in the input could change its all expressions and assignments in our programs are well-
output radically. Under uncertainty, traditional correct- typed. The set of variables appearing in P is denoted by
ness properties must be supplemented by the property Var (P) = {x1,…, xn}.
of robustness, which says that small perturbations to pro- As for semantics, for simplicity, let us restrict our focus to
gram’s inputs do not have much effect on the program’s programs that terminate on all inputs. Let Val be a universe
output. Continuity and Lipschitz continuity can both serve of typed values. A state of P is a vector s ∈ Valn. Intuitively, for
as definitions of robustness, and our analysis can be used all 1 ≤ i ≤ n, s   (i  ) is the value of the variable xi at state s. The
to prove that a program is robust. set of all states of P is denoted by ∑(P).
The rest of the paper is organized as follows. In Section 2, The semantics of the program P, an arithmetic expression
we formally define continuity and Lipschitz continuity of pro- e ­occurring in P, and a Boolean expression b in P are now
grams and give a few examples of computations that satisfy respectively given by maps P: ∑(P) ® ∑(P), e: ∑(P) ® Val,
these properties. In Section 3, we give a method for verifying and b:  ∑(P) ®  {true, false}. Intuitively, for each state s of
a program’s continuity, and then extend it to an analysis for P, e(s) and b(s) are respectively the values of e and b at
Lipschitz continuity. Related work is presented in Section 4; s, and P(s) is the state at which P terminates after starting
Section 5 concludes the paper with some discussion. execution from s. We omit the inductive definitions of these
maps as they are standard.
2. CONTINUITY, LIPSCHITZ CONTINUITY, Our definition of continuity of programs is an adaptation
AND ROBUSTNESS of the traditional -d definition of continuous functions. As
In this section, we define continuity2 and Lipschitz continu- a program can have multiple inputs and outputs, we define
ity3 of programs and show how they can be used to define continuity with respect to an input variable xi and an output
robustness. First, however, we fix the programming lan-
guage Imp whose programs we reason about. a 
Recall that a metric over a set S is a function d: S × S → R È {∞} such that
Imp is a “core” language of imperative programs, for all x, y, z, we have (1) d(x, y) ≥ 0, with d(x, y) = 0 iff x = y; (2) d(x, y) = d(y, x);
meaning that it supports only the most central features of and (3) d(x, y) + d(y, z) ≥ d(x, z).

108 communi catio ns of t h e acm | AUGUST 201 2 | vol. 5 5 | no. 8


variable xj. Intuitively, if P is continuous with respect to outside S¢. Such a definition is useful because many realistic
input xi and output xj, then an arbitrarily small change to programs are Lipschitz only within certain regions of their
the initial value of any xi, while keeping the remaining vari- input space, but for brevity, we do not consider it here.
ables fixed, must only cause an arbitrarily small change to Now we note that many of the classic computations in
the final value of xj. Variables other than xj are allowed to computer science are in fact continuous or Lipschitz.
change arbitrarily.
Formally, consider states s, s¢ of P and any  > 0. Let xi be Example 1 (Sorting). Let Sort be a sorting program that
a variable of type t, and let dt denote the metric over type t. takes in an array A of reals and returns a sorted permutation
We say that s and s¢ are -close with respect to xi, and write of A. As discussed in Section 1, Sort is 1-Lipschitz in input A and
s ≈,i s¢, if dt(s  (i), s¢(i) ) < . We call s¢ an -perturbation of s with output A, because any -change to the initial value of A (defined
respect to xi, and write s ≡,i s¢, if s and s¢ are -close with using our metric on arrays) can produce at most an -change to
respect to xi, and further, for all j π i, we have s  (    j ) = s¢(    j ). its final value. Note that this means that Sort is continuous in
Now we define: input A and output A at every program state.
What if A was an array of integers? Continuity still holds in
Definition 1 (Continuity). A program P is continuous at this case, but for more trivial reasons. Since A is of a discrete
a state s with respect to an input variable xi and an output type, the only arbitrarily small perturbation that can be made
variable xj if for all  > 0, there exists a d > 0 such that for all to A is no perturbation at all. Obviously, the program output
s¢  S(P), does not change in this case. However, reasoning about conti-
nuity turns out to be important even under these terms. This is
s  ≡d,i s¢ ⇒ P(s) ≈, j P(s ′) apparent when we try to prove that Sort is 1-Lipschitz when A is
an array of integers. The easiest way to do this is to “cast” the
An issue with continuity is that it only certifies program input type of the program into an array of reals and prove that
behavior under arbitrarily small perturbations to the inputs. the program is 1-Lipschitz even after this modification, and this
However, we may often want a definition of continuity that demands a proof of continuity.
establishes a quantitative relationship between changes to
a program’s inputs and outputs. Of the many properties in Example 2 (Shortest paths). Let SP be a correct implementa-
function analysis that accomplish this, Lipschitz continu- tion of a shortest path algorithm. We view the graph G on which
ity is perhaps the most well known. Let K > 0. A function SP operates as an array of reals such that G[i] is the weight of
f : R → R is K-Lipschitz continuous, or simply K-Lipschitz, if a the i-th edge. An -change to G thus amounts to a maximum
±-change to x can change f (x) by at most ±K. The constant change of ± to any edge weight of G, while keeping the node and
K is known as the Lipschitz constant of f. It is easy to see that edge structure intact.
if f is K-Lipschitz for some K, then f is continuous at every The output of SP is the array d of shortest path distances in
input x. G—that is, d[i] is the length of the shortest path from the source
We generalize this definition in two ways while adapting node to the i-th node ui of G. We note that SP is N-Lipschitz
it to programs. First, as for continuity, we define Lipschitz in input G and output d (N is the number of edges in G). This
continuity with respect to an input variable xi and an out- is because if each edge weight in G changes by an amount ,
put variable xj. Second, we allow Lipschitz constants that a shortest path weight can change at most by N · .
depend on the size of the input. For example, suppose xi is On the other hand, suppose SP had a second output: an array p
an array of length N, and consider -changes to it that do not whose i-th element is a sequence of nodes forming a minimal-weight
change its size. We consider P to be N-Lipschitz with respect path between src and ui. An -change to G may add or subtract ele-
to xi if on such a change, the output xj can change at most ments from p—that is, perturb p by the amount ∞. Therefore, SP is
by N · . In general, a Lipschitz constant of P is a function K : not K-Lipschitz with respect to the output p for any K.
N → R≥0 that takes the size of xi as an input.
Formally, to each value v, let us associate a size v > 0. If Example 3 (Minimum spanning trees). Consider any algo-
v is an integer or real, then v = 1; if v is an array of length N, rithm MST for computing a minimum-cost spanning tree in a
then v = N. We have: graph G with real edge weights. Suppose MST has a variable c
that holds, on termination, the cost of the minimum spanning
Definition 2 (Lipschitz continuity). Let K : N → R≥0. The tree. MST is N-Lipschitz (hence continuous everywhere) in the
program P is K-Lipschitz with respect to an input xi and an input G and the output c.
output xj if for all s, s¢  S(P) and  > 0,
Example 4 (Knapsack). Consider the Knapsack problem from
s  ≡,i s¢ ∧ (s (i) = s¢(i)) ⇒ P(s) ≈m, j P(s ′) combinatorial optimization. We have a set of items {1,…, N},
each item i associated with a real cost c[i] and a real value v[i].
where m = K(s (i)) · . We also have a nonnegative, real-valued budget. The goal is
to identify a subset Used ⊆ {1,…, N} such that the constraint
More generally, we could define Lipschitz continuity of SjŒUsed c[i] ≤ Budget is satisfied, and the value of totv = SjŒUsed v[i]
P within a subset S¢ of its state space. In such a definition, is maximized.
s and s¢ in Definition 2 are constrained to be in S¢, and no Let our output variable be totv. As a perturbation can turn a
assertion is made about the effect of perturbations on states previously feasible solution infeasible, a program Knap solving

AUGUST 2 0 1 2 | vol . 55 | no. 8 | co mmu n i cat ion s o f t h e acm 109


research highlights

this problem is discontinuous with respect to the input c and Therefore, to make the problem more interesting, we
also with respect to the input Budget. At the same time, Knap is assume that the input to Bubble Sort is an array of reals. As
N-Lipschitz with respect to the input v: if the value of each item before, we model graphs by arrays of reals where each item
changes by ±, the value of totv can change by ±N. represents the weight of an edge.
Continuity and Lipschitz continuity can be used as defi- Given a program P, our task is to derive a syntactic con-
nitions of robustness, a property that ensures that a program tinuity judgment for P, defined as a term b  Cont(P, In, Out),
behaves predictably on uncertain inputs. Uncertainty, in where b is a Boolean formula over Var, and In and Out are
this context, can be modeled by either nondeterminism (the sets of variables of P. Such a judgment is read as “For each
value of x has a margin of error ±) or a probability distribu- xi  In and xj  Out and each state s where b is true, P is
tion. For example, a robust embedded controller does not ­con­tinuous in input xi and output xj at s.” We break down
change its control decisions abruptly because of uncertainty this task into the task of deriving judgments b  Cont(P¢,
in its sensor-derived inputs. The statistics computed by a In, Out) for programs P¢ that are syntactic substructures of
robust scientific program are not much affected by mea- P. For example, if P is of the form “if b then P1 else P2,” then
surement errors in the input dataset. we recursively derive continuity judgments for P1 and P2.
Robustness has benefits even in contexts where a pro- Continuity judgments are derived using a set of syntactic
gram’s relationship with uncertainty is not adversarial. proof rules—the rules can be converted in a standard way
The input space of a robust program does not have iso- into an automated program analysis that iteratively assigns
lated “peaks”—that is, points where the program output is continuity judgments to subprograms. Figure 1 shows the
very different from outputs on close-by inputs. Therefore, most important of our rules; for the full set, see the original
we are more likely to cover a program’s behaviors using reference.2 To understand the syntax of the rules, consider
random tests if the program is robust. Also, robust com- the rule Base. This rule derives a conclusion b  Cont(P, In,
putations are more amenable to randomized and approxi- Out), where b, In, and Out are arbitrary, from the premise that
mate program transformations20 that explore trade-offs P is either “skip” or an assignment.
between a program’s quality of results and resource con- The rule Sequence addresses sequential composition
sumption. Transformations of this sort can be seen to of programs, generalizing the fact that the composition of
deliberately introduce uncertainty into a program’s opera- two continuous functions is continuous. One of the prem-
tional semantics. If the program is robust, then this extra ises of this rule is a Hoare triple of the form {b1}P{b2}. This
uncertainty does not significantly affect its observable is to be read as “For any state s that satisfies b1, P(s) satis-
behavior, hence such a transformation is “safer” to per- fies b2. (A standard program verifier can be used to verify this
form. More details are available in our conference paper premise.) The rule In-Out allows us to restrict or generalize
on robustness analysis.3 the set of input and output variables with respect to which a
Now, if a program is Lipschitz, then we can give quantita- continuity judgment is made.
tive upper bounds on the change to its behavior due to uncer- The next rule—Ite—handles conditional statements,
tainty in its inputs, and further, this bound is small if the and is perhaps the most interesting of our rules. In a con-
inputs are only slightly uncertain. Consequently, Lipschitz ditional branch, a small perturbation to the input variables
continuity is a rather strong robustness property. Continuity can cause control to flow along a different branch, leading
is a weaker definition of robustness—a program computing to a syntactically divergent behavior. For instance, this hap-
ex is continuous, even though it hugely amplifies errors in its pens in Lines 3–4 in the Bubble Sort algorithm in Figure 2—
inputs. Nonetheless, it captures freedom from a common perturbations to items in A can lead to either behaviors
class of robustness violations: those where uncertainty in
the inputs alters a program’s control flow, and this change Figure 1. Key rules in continuity analysis.
leads to a significant change in the program’s output.

3. PROGRAM VERIFICATION
Now we present our automated framework for proving a
program continuous2 or Lipschitz.3 Our analysis is sound—
that is, a program proven continuous or Lipschitz by our
analysis is indeed continuous. However, as the analysis
targets a Turing-complete language, it is incomplete—for
example, a program may be continuous even if the analysis
does not deem it so.

3.1. Verifying continuity


First we show how to verify the continuity of an Imp program.
We use as running examples three of the most well-known
algorithms of computing: Bubble Sort, the Bellman–Ford
shortest path algorithm, and Dijkstra’s shortest path algo-
rithm (Figure 2). As mentioned in Example 1, a program
is always continuous in input variables of discrete types.

110 communicatio ns of t h e acm | AUGU ST 201 2 | vol. 5 5 | no. 8


Figure 2. Bubble sort, the Bellman-Ford algorithm, and Dijkstra’s
The rule Loop derives continuity judgments for while-
algorithm. loops. The idea here is to prove the body R of the loop
continuous, then inductively argue that the entire loop is
BubbleSort (A : array of reals) continuous too. In more detail, the rule derives a continu-
1 for j := 1 to (|A| − 1); ity judgment c  Cont(R, X, X), where c is a loop invariant—
2   do for i := 1 to (|A| − 1); a property that is always true at the loop header—and X is
3   do if (A[i] > A[i + 1]) a set of variables. Now consider any state s satisfying c. An
4      then t := A[i]; A[i] := A[i + 1]; A[i + 1] := t; arbitrarily small perturbation to this state leads to an arbi-
trarily small change in the value of each variable in X at the
BellmanFord(G : array of reals, src : int) end of the first iteration, which only leads to an arbitrarily
1 … small change in the value of each variable in X at the end of
2 for i := 1 to (|G| − 1) the second iteration, and so on. Continuity follows.
3   do for each edge (v, w) of G Some subtleties need to be considered, however. An
4   do if d[v] + G(v, w) < d[w] execution from a slightly perturbed state may terminate
5      then d[w] := d[v] + G(v, w) earlier or later than it would in the original execution.
Even if the loop body is continuous, the extra iterations
Dijkstra(G : array of reals, src : int) in either the modified or the original execution may cause
1 … the states at the loop exit to be very different. We rule out
2 while W ≠ 0/ such scenarios by asserting a premise called synchronized
3   do choose edge (v, w)  W such that d[w] is minimal; termination. A loop “while b do R” fulfills this property
4     remove (v, w) from W; with respect to a loop invariant c and a set of variables X, if
5      if d[w] + G[w, v] < d[v] B (b) ∧ c  R ≡x skip). Under this property, even if the loop
6      then d[v] := d[w] + G[w, v] reaches a state where a small perturbation can cause the
loop to terminate earlier (similarly, later), the extra itera-
tions in the original execution have no effect on the pro-
gram state. We can ignore these iterations in our proof.
of either “swapping A[i] and A[i + 1]” or ­ “leaving A Second, even if the loop body is continuous in input xi and
unchanged.” output xj for every xi, xj  X, an iteration may drastically change
The core idea behind the rule Ite is to show that such a the value of program variables not in X. If there is a data flow
divergence does not really matter, because at the program from these variables to some variable in X, continuity will not
states where arbitrarily small perturbations to the program hold. We rule out this scenario through an extra condition.
variables can “flip” the value of the guard b of the condi- Consider executions of P whose initial states satisfy a con-
tional statement (let us call the set of such states the bound- dition c. We call a set of variables X of P separable under c if
ary of b), the branches of the conditional are arbitrarily close the value of each z  X on termination of any such execution
in behavior. is independent of the initial values of variables not in X. We
Precisely, let us construct from b the following formula: denote the fact that X is separable in this way by c  Sep(P, X).
To verify that P is continuous in input xi and output xj at
state s, we derive a judgment b  Cont(P, {xi}, {xj}), where b
is true at s. The correctness of the method follows from the
following soundness theorem:

Theorem 1. If the rules in Figure 1 can derive the judgment


b  Cont(P, In, Out), then for all xi  In, xj  Out, and s such that
b = true, P is continuous in input xi and output xj at s.

Example 5 (Warmup). Consider the program “if (x > 2) then


Note that B (b) represents an overapproximation of the x := x/2 else x := −5x + 11.” B(x > 2) equals (x = 2) and
boundary of b. Also, for a set of output variables Out and (x = 2)  (x := x/2) ≡{x} (x := −5x + 11). By Ite, the program is
a Boolean formula c, let us call programs P1 and P2 Out- continuous in input x and output x.
equivalent under c, and write c  (P1 ≡Out P2), if for each Let us now use our rules on the algorithms in Figure 2.
state s that satisfies c, the states P1(s) and P2(s) agree
on the values of all variables in Out. We assume an oracle Example 6 (Bubble Sort). Consider the implementation of
that can determine if c  (P1 ≡Out P2) for given c, P1, P2, and Bubble Sort in Figure 2. (We assume it to be rewritten as a while-
Out. In practice, such equivalence questions can often be program in the obvious way.) Our goal is to derive the judgment
solved fully automatically using modern automatic theo- true  Cont(BubbleSort, {A}, {A}).
rem provers.19 Now, to derive a continuity judgment for a Let X = {A, i, j}, and let us write R〈p,q〉 to denote the
program “if b then P1 else P2” with respect to the outputs code ­fragment from line p to line q (both inclusive). Also,
Out, Ite shows that P1 and P2 become Out-equivalent under let us write c  Term(while b do R, X) as an abbreviation for
the condition B (b). B (b) ∧ c  (R ≡x skip).

AUGUST 2 0 1 2 | vol . 55 | no. 8 | c ommu n icat ion s o f t he acm 111


research highlights

It is easy to show that true  Sep(BubbleSort, X) and judgments to subprograms until convergence is reached.
true   Sep(R〈2,4〉, X). Each loop guard only involves discrete Auxiliary tasks such as checking the equivalence of two
­variables, hence we derive true  Term(BubbleSort, X) and straight-line program fragments (needed by rule Ite)
true  Term(R〈2,4〉, X). are performed automatically using the Z38 SMT-solver.
Now consider R〈3,4〉. As B(A[i] > A[i + 1]) equals (A[i] = A[i + 1]) Human intervention is expected in two forms. First, in
and (A[i] = A[i + 1])  (skip ≡ X R〈4,4〉), we have true  Cont(R〈3,4〉, X, applications of the epoch induction rule, we sometimes
X), then true  Cont(R〈2,4〉, X, X), and then true  Cont(BubbleSort, expect the programmer to write annotations that define
X, X). Now the In-Out rule derives the judgment we are after. appropriate “groupings” of iterations. Second, in case a
complex loop invariant is needed for the proof (e.g., when
Example 7 (Bellman–Ford). Take the Bellman–Ford algo- one of the programs in a program equivalence query is
rithm. On termination, d[u] contains the shortest path distance a nested loop), the programmer is expected to supply it.
from the source node src to the node u. We want to prove that There are heuristics and auxiliary tools that can be used
true  Cont(BellmanFord, {G}, {d}). to automate these steps, but our current system does not
We use the symbols R〈p,q〉 and Term as before. Clearly, we have employ them.
true  Sep(R〈3,5〉, X) and true  Term(R〈3,5〉, X), where X = {G, v, w}. Given the incompleteness of our proof rules, a natu-
The two branches of the conditional in Line 4 are X-equivalent ral empirical question for us was whether our system can
at B(d[v] + G(v, w) < d[w]), hence we have true =  Cont(R〈4,5〉, verify the continuity of the continuous computing tasks
X, X), and from this judgment, true  Cont(R〈3,5〉, X, X). Similar described in Section 2. To answer this question, we chose
arguments can be made about the outer loop. Now we can several 13 continuous algorithms (including algorithms)
derive the fact true  Cont(BellmanFord, X, X); weakening, we over real and real array data types. Our system was able to
get the judgment we seek. verify the continuity of 11 of these algorithms, including
Unfortunately, the rule Loop is not always enough for the shortest path algorithms of Bellman-Ford, Dijkstra,
continuity proofs. Consider states s and s¢ of a continu- and Floyd-Warshall; Merge Sort and Selection Sort in
ous program P, where s¢ is obtained by slightly perturb- addition to Bubble Sort; and the minimum spanning tree
ing s. For Loop to apply, executions from s and s¢ must algorithms of Prim and Kruskal. Among the algorithms
converge to close-by states at the end of each loop itera- we could not verify were Quick Sort. Please see Chaudhuri
tion. However, this need not be so. For example, think of et al.2 for more details.
Dijkstra’s algorithm. As a shortest path computation, this
program is continuous in the input graph G and the out- 3.2. Verifying Lipschitz continuity
put d—the array of shortest path distances. But let us look Now we extend the above verification framework to one for
at its main loop in Figure 2. Lipschitz continuity. Let us fix variables xi and xj of the pro-
Note that in any iteration, there may be several items w gram P respectively as the input and the output variable. To
for which d[w] is minimal. But then, a slightly perturbed i­nitial start with, we assume that xi and xj are of continuous data
value of d may cause a loop iteration to choose a different w, types—reals or arrays of reals.
leading to a drastic change in the value of d at the end of Let us define a control flow path of a program P as
the iteration. Thus, individual iterations of this loop are not the sequence of assignment or skip-statements that
continuous, and we cannot apply Loop. P ­executes on some input (we omit a more formal defini-
In prior work,2 we gave a more powerful rule, called tion). We note that since our arithmetic expressions are
epoch induction, for proving the continuity of programs built from additions and multiplications, each control
like the one above. The key insight here is that if we group flow path of P encodes a continuous—in fact differen-
some loop iterations together, then continuity becomes tiable—function of the inputs. Now suppose we can show
an inductive property of the groupings. For example, in that each control flow path of P is a K-Lipschitz compu-
Dijkstra’s algorithm, a “grouping” is a maximal set S of tation, for some K, in input x i and output x j. This does
successive loop iterations that are tied on the initial not mean that P is K-Lipschitz in this input and output:
value of d[w]. Let s0 be the program state before the first a perturbation to the initial value of x i can cause P to
iteration in S is executed. Owing to arbitrarily small per- execute along a different control flow path, leading to a
turbations to s0, we may execute iterations in S in a very drastically different final state. However, if P is continu-
different order. However, an iteration that ran after the ous and the above condition holds, then P is K-Lipschitz
iterations in S in the original execution will still run after in input x i and output x j.
the iterations in S. Moreover, for a fixed s0, the program Our analysis exploits the above observation. To prove
state, once all iterations in S have been executed, is the that P is K-Lipschitz in input xi and output xj, we estab-
same, no matter what order these iterations were exe- lish that (1) P is continuous at all states in input xi and
cuted in. Thus, a small perturbation cannot significantly output xj and (2) each control flow path of P is K-Lipschitz
change the state at the end of S, and the set of iterations S in input xi and output xj. Of these, the first task is accom-
forms a continuous computation. plished using the analysis from Section 3.1. To accomplish
We have implemented the rules in Figure 1, as well the second task, we compute a data structure—a set of
as the epoch induction rule, in the form of a mostly-auto- Lipschitz matrices—that contains upper bounds on the
matic program analysis. Given a program, the analysis slopes of any computation that can be carried out in a
iterates through its control points, assigning continuity control flow path of P.

112 c o mmunicatio ns of th e acm | AUGU ST 201 2 | vol. 5 5 | no. 8


More precisely, let P have n variables named x1,…, xn, Figure 3. Rules for deriving Lipschitz matrices.
as before. A Lipschitz matrix J of P is an n × n matrix, each
of whose elements is a function K : N → R≥0. Elements of J
are represented either as numeric constants or as symbolic
expressions (for example, N + 5), and the element in the
i-th row and j-th column of J is denoted by J(i, j). Our analy-
sis associates P with setsJ of such matrices via judgments
P: J. Such a judgment is read as follows: “For each control
flow path C in P and each xi, xj, there is a J  J such that C is
J(   j, i)-Lipschitz in input xi and output xj.”
The Lipschitz matrix data structure can be seen as a gener-
alization of the Jacobian from vector calculus. Recall that the
Jacobian of a function f : Rn → Rn with inputs x1,…, xn  R and
outputs x1′,…, xn′  R is the matrix whose (i, j)-th entry is   .
If f is differentiable, then for each xi′ and xj, f is K-Lipschitz
with respect to input xj and output xi′, where K is any upper
bound on  . In our setting, each control flow path represents
a differentiable function, and we can verify the Lipschitz conti-
nuity of this function by propagating a Jacobian along the path.
On the other hand, owing to branches, the program P may
not represent a differentiable, or even continuous, function.
However, note that it is correct to associate a condi-
tional statement “if b then P1 else P2” with the set of matri-
ces (J1  È  J2), where the judgments P1 : J1 and P2 : J2 have
been made inductively. Of course, this increases the num-
ber of matrices that we have to track for a subprogram. But
the proliferation of such matrices can be curtailed using an appearing in e; if the variable xi does not appear in e, then
approximation that merges two or more of them. perturbations to the initial value of xi have no effect on
This merge operation  is defined as (  J1  J2)(i, j) = xi[m]. However, the remaining locations in xi are affected by,
max(  J1(i, j ), J2(i, j ) ) for all J1, J2, i, j. Suppose we can correctly and only by, changes to the initial value of xi. Thus, we can
derive the judgment P : J. Then for any J1, J2  J, it is also view xi as being split into two “regions”—one consisting of
correct to derive the judgment P : (J   \{J1, J2} È {J1  J2}). Note xi[m] and the other of every other location—with ­possibly
that this overapproximation may overestimate the Lipschitz different Lipschitz constants. We track these constants
­
constants for some of the control flow paths in P, but this is using two different Lipschitz matrices J and J′. Here J is as in
acceptable as we are not seeking the most precise Lipschitz the rule Assign, while J′ is identical to the Lipschitz matrix
constant for P anyway. for a hypothetical assignment xi := xi.
Figure 3 shows our rules for associating a set J of Sequential composition is handled by matrix multipli-
Lipschitz matrices with a program P. In the first rule Skip, cation (rule Sequence)—the insight here is essentially the
I is the identity matrix. The rule is correct because skip is chain rule of differentiation. As mentioned earlier, the rule
1-Lipschitz in input xi and output xi for all i, and 0-Lipschitz for conditional statements merges the Lipschitz matrices
in input xi and output xj, where i ≠ j. computed along the two branches. The Weaken rule allows
To obtain Lipschitz constants for assignments to variables us to overestimate a Lipschitz constant at any point.
(rule Assign), we must quantify the way the value of an arith- The rule While derives Lipschitz matrices for while-
metic expression e changes when its inputs are changed. loops. Here Bound+ (P, M) is a premise that states that the
This is done by computing a vector ∇e whose i-th element is an symbolic or numeric constant M is an upper bound on the
upper bound on   . In more detail, we have number of iterations of P—it is assumed to be inferred
via an auxiliary checker.11 Finally, J M is shorthand for the
­singleton set of matrix products {J1… JM :  Ji  J }. In cases
        

where M is a symbolic bound, we will not be able to compute


this product explicitly. However, in many practical cases,
one can reduce it to a simpler manageable form using alge-
braic identities.
The While rule combines the rules for if-statements
and sequential composition. Consider a loop P whose body
R has Lipschitz matrix J. If the loop terminates in exactly M
iterations, JM is a correct Lipschitz matrix for it. However,
if the loop may terminate after M′ < M iterations, we require an
Assignments xi[m] := e to array locations are slightly trick- extra property for JM to be a correct Lipschitz matrix: Ji ≤ Ji+1
ier. The location xi[m] is affected by changes to variables for all i < M. This property is ensured by the condition

AUGUST 2 0 1 2 | vol . 55 | no. 8 | c ommu n icat ion s o f t he acm 113


research highlights

∀i, j: J(i, j) = 0 ∨ J(i, j) ≥ 1. Note that in the course of a proof, the same reasoning as above, we assign to the main loop of the
we can weaken any Lipschitz matrix for the loop body to a algorithm the single Lipschitz matrix J. Applying the Loop rule,
matrix J of this form. we derive
We can prove the following soundness theorem:

Theorem 2. Let P be continuous in input xi and output xj . If


the rules in Figure 3 derive the judgment P : {J}, then P is J(j, i)-
Lipschitz in input xi and output xj. Given that the algorithm is continuous in input G and output d,
it is N-Lipschitz in input G and output d.
Example 8 (Warmup). Recall the program “if (x > 2) then Let us now briefly consider the case when the input and
x := x/2 else x := −5x + 11” from Example 5 (x is a real). Our rules output variables in our program are of discrete type. As
can associate the left branch with a single Lipschitz matrix a program is trivially continuous in every discrete input,
containing a single entry , and the right branch with a single continuity is not a meaningful notion in such a setting.
matrix containing a single entry 5. Given the continuity of the Therefore, we focus on the problem of verifying Lipschitz
program, we conclude that the program is 5-Lipschitz in input continuity—for example, showing that the Bubble Sort algo-
x and ­output x. rithm is 1-Lipschitz even when the input array A is an array
of integers.
Example 9 (Bubble Sort). Consider the Bubble Sort algo- An easy solution to this problem is to cast the array A into
rithm (Figure 2) once again, and as before, let R〈p,q〉 denote the an array A* of reals, and then to prove 1-Lipschitz continuity
code fragment from line p to line q. Let us set x0 to be A and x1 of the resultant program in input A* and output A*. As any
to be t. integer is also a real, the above implies that the original algo-
Now, let . From the rules in Figure 3, we can derive rithm is 1-Lipschitz in input A and output A. Thus, reals are
(t := A[i]) : { J}, (A[i ] := A[i + 1]) : {I}, and (A[i + 1] := t) : {J, I}. used here as an abstraction of integers, just as (unbounded)
Carrying out the requisite matrix multiplications, we get integers are often used in program verification as abstrac-
R〈4,4〉 : {J}. Using the rule Ite, we have R〈3,4〉 : {I, J}. Now, it tions of bounded-length machine numbers.
is easy to show that R〈3,4〉 gets executed N times, where N Unsurprisingly, this strategy does not always work.
is the size of A. From this we have R〈2,4〉 : {I, J}N. Given that Consider the program “if (x > 0) then x := x + 1 else skip,”
J2 = IJ = JI = J, this is equivalent to the judgment R〈2,4〉 : {I, J}. where x is an integer. This program is 2-Lipschitz. Its
From this, we derive BubbleSort : {J, I}. Given the proof of con- “slope” is the highest around initial states where x = 0: if
tinuity carried out in Example 1, Bubble Sort is 1-Lipschitz in the initial value of x changes from 0 to 1, the final value of
input A and output A. x changes from 0 to 2. At the same time, if we cast x into a
Intuitively, the reason why Bubble Sort is so robust is that real, the resultant program is discontinuous and thus not
here, (1) there is no data flow from program points where arith- K-Lipschitz for any K.
metic operations are carried out to points where values are It is possible to give an analysis of Lipschitz continu-
assigned to the output variable and (2) continuity holds every- ity that does not suffer from the above issue. This analy-
where. In fact, one can formally prove that any program that sis casts the integers into reals as mentioned above, then
meets the above two criteria is 1-Lipschitz. However, we do not calculates a Lipschitz matrix of the resultant program;
develop this argument here. however, it checks a property that is slightly weaker than
continuity. For lack of space, we do not go into the details
Example 10 (Bellman–Ford; Dijkstra). Let us consider of the analysis here.
the Bellman–Ford algorithm (Figure 2) once again, and let x0 We have extended our implementation of continuity
be G and x1 be d. Consider line 5 (i.e., the program R〈5,5〉); our analysis with the verification method for Lipschitz continu-
rules can assign to this program the Lipschitz matrix J, where ity presented above, and applied the resulting system to the
. With a few more derivations, we obtain R〈4,5〉 : {J}. suite of 13 algorithms mentioned at the end of Section 3.1.
Using the rule for loops, we have R〈3,5〉 : {JN}, where N is the All these algorithms were either 1-Lipschitz or N-Lipschitz.
2
number of edges in G, and then BellmanFord : {JN }. But note Our system was able to compute the optimal Lipschitz
that constant for 9 of the 11 algorithms where continuity could
be verified. In one case (Bellman-Ford), it certified an
N-Lipschitz computation as N2-Lipschitz. The one example
on which it fared poorly was the Floyd-Warshall shortest
path algorithm, where the best Lipschitz constant that it
Combing the above with the continuity proof in Example 7, we could compute was exponential in N3.
decide that the Bellman–Ford algorithm is N2-Lipschitz in input
G and output d. 4. RELATED WORK
Note that the Lipschitz constant obtained in the above So far as we know, we were the first2 to propose a frame-
proof is not the optimal one—that would be N. This is an work for continuity analysis of programs. Before us,
instance of the gap between truth and provability that is the Hamlet12 advocated notions of continuity of software;
norm in program analysis. Interestingly, our rules can derive however, he concluded that “it is not possible in practice
the optimal Lipschitz constant for Dijkstra’s algorithm. Using to mechanically test for continuity” in the presence of

114 comm unicatio ns o f th e acm | AUGUST 201 2 | vol. 5 5 | no. 8


loops. Soon after our first paper on this topic (and before but as code. Speaking more philosophically, the classical
our subsequent work on Lipschitz continuity of pro- calculus focuses on the computational aspects of real anal-
grams), Reed and Pierce18 gave a type system that can ver- ysis, and the notation of calculus texts has evolved primar-
ify the Lipschitz continuity of functional programs. This ily to facilitate symbolic computation by hand. However, in
system can seamlessly handle functional data structures our era, most mathematical computations are carried out
such as lists and maps; however, unlike our method, it by computers, and a calculus for our age should not ignore
cannot reason about discontinuous control flow, and the notation that computers can process most easily: pro-
would consider any program with a conditional branch grams. This statement has serious implications—it opens
to have a Lipschitz constant of ∞. the door not only to the study of continuity or derivatives
More recently, Jha and Raskhodnikova have taken a prop- but also to, say, Fourier transforms, differential equations,
erty testing approach to estimating the Lipschitz constant and mathematical optimization of code. Some efforts in
of a program. Given a program, this method determines, these directions4, 5, 9 are already under way; others will no
with a level of probabilistic certainty, whether it is either doubt appear in the years to come.
1-Lipschitz or -far (defined in a suitable way) from being
1-Lipschitz. While the class of programs allowed by the Acknowledgments
method is significantly more restricted than what is investi- This research was supported by NSF CAREER Award
gated here or by Reed and Pierce 13, the appeal of the method #1156059 (“Robustness Analysis of Uncertain Programs:
lies in its crisp completeness guarantees, and also in that it Theory, Algorithms, and Tools”).
only requires blackbox access to the program.
Robustness is a standard correctness property in control
References
theory,16, 17 and there is an entire subfield of control study-
1. Bucker, M., Corliss, G., Hovland, P., 11. Gulwani, S., Zuleger, F. The
ing the design and analysis of robust controllers. However, Naumann, U., Norris, B. Automatic reachability-bound problem. In PLDI
the systems studied by this literature are abstractly defined Differentiation: Applications, Theory (2010), 292–304.
and Implementations, Birkhauser, 12. Hamlet, D. Continuity in software
using differential equations and hybrid automata rather 2006. systems. In ISSTA (2002).
than programs. The systematic modeling and analysis of 2. Chaudhuri, S., Gulwani, S., 13. Jha, M., Raskhodnikova, S. Testing
Lublinerman, R. Continuity analysis of and reconstruction of lipschitz
robustness of programs was first proposed by us in the con- programs. In POPL (2010), 57–70. functions with applications to data
text of general software, and by Majumdar and Saha14 in the 3. Chaudhuri, S., Gulwani, privacy. In FOCS (2011), 433–442.
S., Lublinerman, R., Navidpour, S. 14. Majumdar, R., Saha, I. Symbolic
context of control software. Proving programs robust. In FSE robustness analysis. In RTSS (2009),
In addition, there are many efforts in the abstract inter- (2011), 102–112. 355–363.
4. Chaudhuri, S., Solar-Lezama, A. 15. Parnas, D. Software aspects of
pretation literature that, while not verifying continuity or Smooth interpretation. In PLDI strategic defense systems.
robustness explicitly, reason about the uncertainty in a pro- (2010), 279–291. Commun. ACM 28, 12 (1985),
5. Chaudhuri, S., Solar-Lezama, A. 1326–1335.
gram’s behavior due to floating-point rounding and sensor Smoothing a program soundly and 16. Pettersson, S., Lennartson, B. Stability
errors.6, 7, 10 Other related literature includes work on auto- robustly. In CAV (2011), 277–292. and robustness for hybrid systems.
6. Chen, L., Miné, A., Wang, J., Cousot, P. In Decision and Control (Dec 1996),
matic differentiation (AD),1 where the goal is to transform Interval polyhedra: An abstract 1202–1207.
domain to infer interval linear 17. Podelski, A., Wagner, S. Model
a program P into a program that returns the derivative of P relationships. In SAS (2009), checking of hybrid systems: From
where it exists. Unlike the work described here, AD does not 309–325. reachability towards stability. In
7. Cousot, P., Cousot, R., Feret, J., HSCC (2006), 507–521.
attempt verification—no attempt is made to certify a pro- Mauborgne, L., Miné, A., Monniaux, 18. Reed, J., Pierce, B. Distance makes
gram as differentiable or Lipschitz. D., Rival, X. The ASTREÉ analyzer. In the types grow stronger: A calculus
ESOP (2005), 21–30. for differential privacy. In ICFP
8. de Moura, L. M. Bjørner, N. Z3: An (2010).
5. CONCLUSION effcient smt solver. In TACAS (2008), 19. Strichman, O. Regression
337–340. verification: Proving the equivalence
In this paper, we have argued for the adoption of analyt- 9. Girard, A., Pappas, G. Approximate of similar programs. In CAV
ical properties like continuity and Lipschitz continuity bisimulation: A bridge between (2009).
computer science and control theory. 20. Zhu, Z., Misailovic, S., Kelner, J.,
as correctness properties of programs. These properties Eur. J. Contr. 17, 5 (2011), 568. Rinard, M. Randomized accuracy-
are relevant as they can serve as useful definitions of 10. Goubault, E. Static analyses of the aware program transformations for
precision of floating-point operations. efficient approximate computations.
robustness of programs to uncertainty. Also, they raise In SAS (2001). In POPL (2012).
some fascinating technical issues. Perhaps counterin-
tuitively, some of the classic algorithms of computer Swarat Chaudhuri (swarat@rice.edu), Roberto Lublinerman (rluble@psu.edu),
Department of Computer Science, Rice Department of Computer Science and
science satisfy continuity or Lipschitz continuity, and University, Houston, TX. Engineering, Pennsylvania State University,
the problem of systematic reasoning about these prop- University Park, PA.
Sumit Gulwani (sumitg@microsoft.com),
erties demands a nontrivial combination of analytical Microsoft Research, Redmond, WA.
and logical insights.
We believe that the work described here is a first step
toward an extension of the classical calculus to a symbolic
mathematics where programs form a first-class represen-
tation of functions and dynamical systems. From a practi-
cal perspective, this is important as physical systems are
increasingly controlled by software, and as even applied
mathematicians increasingly reason about functions that
are not written in the mathematical notation of textbooks, © 2012 ACM 0001-0782/12/08 $15.00

AUGU ST 2 0 1 2 | vol . 55 | no. 8 | c o mmu n icat ion s o f t h e acm 115


ACM TechNews Goes Mobile
iPhone & iPad Apps Now Available in the iTunes Store
ACM TechNews—ACM’s popular thrice-weekly news briefing service—is now
available as an easy to use mobile apps downloadable from the Apple iTunes Store.
These new apps allow nearly 100,000 ACM members to keep current with
news, trends, and timely information impacting the global IT and Computing
communities each day.

TechNews mobile app users will enjoy:


• Latest News: Concise summaries of the most
relevant news impacting the computing world
• Original Sources: Links to the full-length
articles published in over 3,000 news sources
• Archive access: Access to the complete
archive of TechNews issues dating back to
the first issue published in December 1999
• Article Sharing: The ability to share news
with friends and colleagues via email, text
messaging, and popular social networking sites
• Touch Screen Navigation: Find news
articles quickly and easily with a
streamlined, fingertip scroll bar
• Search: Simple search the entire TechNews
archive by keyword, author, or title
• Save: One-click saving of latest news or archived
summaries in a personal binder for easy access
• Automatic Updates: By entering and saving
your ACM Web Account login information,
the apps will automatically update with
the latest issues of TechNews published
every Monday, Wednesday, and Friday

The Apps are freely available to download from the Apple iTunes Store, but users must be registered
individual members of ACM with valid Web Accounts to receive regularly updated content.
http://www.apple.com/iphone/apps-for-iphone/  http://www.apple.com/ipad/apps-for-ipad/

ACM TechNews
careers

Princeton University ment is housed in a state of the art building over- programs. The remaining faculty are active duty
Computer Science looking the scenic Severn River, and discussions military officers with Masters or Doctoral de-
Lecturer have begun regarding a new academic building to grees. Each year the academy graduates roughly
support the department, including the new cyber 1000 undergraduate students with majors in the
The Department of Computer Science seeks ap- security program. Our spaces provide outstand- sciences, engineering, and humanities. More
plications from outstanding teachers to assist ing office, laboratory, and research facilities for information about the department and the Acad-
the faculty in teaching our introductory course both students and faculty, including specialized emy can be found at http://www.usna.edu/cs/ and
sequence or some of our upper-level courses labs for information assurance, networking, and http://www.usna.edu/.
starting September 1, 2012. Depending on the robotics, as well as three micro-computing labs Applicants should send a cover letter, teach-
qualifications and interests of the applicant, the and two high performance computing labs. In ing and research statements, curriculum vitae,
job responsibilities will include such activities addition to computer science and information and arrange for three letters of recommendation
as teaching recitation sections and supervising technology courses for the majors, we also teach that address both teaching and research abilities
graduate-student teaching assistants; grading a required course on cyber security to the entire to be sent to cssearch@usna.edu.
problem sets and programming assignments, freshman class. Review of applications will begin immediate-
and supervising students in the grading of The Naval Academy is an undergraduate insti- ly, continuing until the positions are filled.
problem sets and programming assignments; tution located in historic downtown Annapolis, The United States Naval Academy is an Equal
developing and maintaining online curricular Maryland on the Chesapeake Bay. Roughly half Opportunity Employer. This agency provides
material, classroom demonstrations, and labo- the faculty are tenured or tenure track civilian reasonable accommodations to applicants with
ratory exercises; and supervising undergraduate professors with Ph.D.s who balance teaching ex- disabilities. This position is subject to the avail-
research projects. An advanced degree in com- cellence with internationally recognized research ability of funds.
puter science, or related field, is required (PhD
preferred). The position is renewable for 1-year
terms, up to six years, depending upon depart-
mental need.
Princeton University is an equal opportunity
employer and complies with applicable EEO and
affirmative action regulations. You may apply on-
line, by submitting a letter of application, resume The Hong Kong Polytechnic University is the largest government-funded tertiary institution in Hong Kong in terms of
student number. It offers programmes at Doctorate, Master’s, Bachelor’s degrees and Higher Diploma levels. It has
and names of three references at http://jobs. a full-time academic staff strength of around 1,200. The total consolidated expenditure budget of the University is in
cs.princeton.edu/lecturer. Requisition number: excess of HK$5 billion per year.
1200313 DEPARTMENT OF COMPUTING
The Department of Computing is an academic department in the Faculty of Engineering, and it offers a comprehensive range
of programmes at undergraduate and postgraduate levels including PhD in computer science and information technology. It
U.S. Naval Academy is renowned for research in areas of Graphics, Multimedia and Virtual Reality, Human Technology and Knowledge Discovery,
Mobile and Network Computing, Pattern Analysis and Machine Intelligence, and Software Engineering and Systems. The
Computer Science Department department is positioned strategically in conducting world-class application-driven research and providing high-quality
Assistant Professor education. Please visit the website http://www.comp.polyu.edu.hk for more information about the department.
The department is now inviting high calibre candidates for the post of Research Assistant Professor to take part in a
The U.S. Naval Academy’s Computer Science De- number of research and teaching activities in the areas of Software Engineering / Cloud Computing / Social
Computing / E-business / Business Intelligence / Knowledge Engineering.
partment invites applications for one or more
tenure track positions. Appointments at all ranks Research Assistant Professor in Software Engineering / Cloud Computing / Social
will be considered, but preference is for junior Computing / E-business / Business Intelligence / Knowledge Engineering (three posts)
faculty at the rank of Assistant Professor. These The appointees will be required to (a) conduct innovative research projects that lead to publications in top-tier refereed
journals and awards of external research grants; (b) engage in research collaborations both internally and
positions may begin as early as the Fall of 2012. A internationally; (c) teach at both undergraduate and postgraduate levels, and supervise research students; (d)
Ph.D. in Computer Science or closely related field participate in professional services to the academic community and in promotional activities; and (e) contribute to
is required. departmental activities.
Applicants with backgrounds, experience Applicants should have a PhD degree plus an excellent track record of research in one of the above areas relevant to
the strategic development of the department. Applicants should also show strong evidence of commitment to
and research interests in cyber security, and in- research that leads to top quality publications with high impact.
formation assurance are especially encouraged Remuneration and Conditions of Service
to apply, however all backgrounds of computer The remuneration package for the Research Assistant Professor post is the same as an Assistant Professor post. A
science will be considered. Applicants must highly competitive remuneration package will be offered. Appointments for Research Assistant Professor will be on a
fixed-term gratuity-bearing contract for up to three years. Re-engagement thereafter is subject to mutual agreement.
have a dedication to teaching, an ability to teach Applicants should state their current and expected salary in the application.
a broad range of computer science courses, and Application
the ability to initiate and maintain a strong re- Please submit application form via email to hrstaff@polyu.edu.hk; by fax at (852) 2364 2166; or by mail to Human Resources
search program. Office, 13/F, Li Ka Shing Tower, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong.
If you would like to provide a separate curriculum vitae, please still complete the application form which will help speed
The Computer Science Department offers up the recruitment process. Application forms can be obtained via the above channels or downloaded from
majors in Computer Science and Information http://www.polyu.edu.hk/hro/job.htm. Recruitment will continue until the positions are filled. Details of the University’s
Technology, and is developing a new major in Personal Information Collection Statement for recruitment can be found at http://www.polyu.edu.hk/hro/jobpics.htm.
Cyber Security. We currently have 85 CS majors,
100 IT majors and a faculty of 21. The depart-

Au g ust 2 0 1 2 | vol . 55 | no. 8 | c ommu n i cat ion s of t h e acm 117


careers

University of Calgary is filled. Hiring decisions will be finalized as soon mentor faculty to achieve excellence, teach at the
Department of Computer Science as possible, with the successful candidate joining undergraduate and graduate levels, and commu-
Assistant Professor Position the U of C on either January 1, 2013 or July 1, 2013. nicate the intellectual excitement of computer
All qualified candidates are encouraged to ap- science and its broad, evolving role in contem-
The Department of Computer Science at the Uni- ply; however, Canadians and permanent residents porary society. A PhD in Computer Science or a
versity of Calgary seeks outstanding candidates for will be given priority. closely related field is required.
a tenure-track position at the Assistant Professor The University of Oregon is an AAU compre-
level. Applicants from the area of Database Manage- hensive research university with many nation-
ment are of primary interest. Details for this position University of Oregon ally and internationally renowned programs. It is
appear at: http://www.cpsc.ucalgary.ca/. Applicants Department of Computer and located in Eugene, two hours south of Portland,
must possess a doctorate in Computer Science at the Information Science and within one hour’s drive of both the Pacific
time of appointment, and have a strong potential to Professor/Department Head Ocean and the snow-capped Cascade Mountains.
develop an excellent research record. The CIS Department is part of the College of Arts
The Department is one of Canada’s leaders The Computer and Information Science (CIS) and Sciences and is housed within the integrated
as evidenced by our commitment to excellence in Department at the University of Oregon invites Lorry Lokey Science Complex. The department
research and teaching. It has large undergraduate applications for Professor/Department Head. We offers B.S., M.S. and Ph.D. degrees. More infor-
and graduate programs and extensive state-of-the- seek an outstanding scholar who will be excited to mation about the department, its programs and
art computing facilities. Calgary is a multicultural head a computer science department in a strong faculty can be found at http://www.cs.uoregon.
city that is the fastest growing city in Canada. Cal- public research university. A number of senior edu, or by contacting the search committee at fac-
gary enjoys a moderate climate located beside the faculty have recently or will soon retire, creating ulty.search@cs.uoregon.edu.
natural beauty of the Rocky Mountains. an opportunity for regeneration through multiple Applications will be accepted electronically
Further information about the Department is hires over the next several years. The department through the department’s web site (only). Application
available at http://www.cpsc.ucalgary.ca/. is currently developing and implementing a stra- information can be found at http://www.cs.uoregon.
Interested applicants should send a CV, a tegic plan for increased research prominence in edu/Employment/. Review of applications has been
concise description of their research area and the context of a college-wide plan for excellence. extended through October 1, 2012 and will continue
program, a statement of teaching philosophy, We seek an individual with strategic vision until the position is filled. Please address questions
JOB 9-540
and arrange to have at least three reference let- and leadership abilities. The ideal candidate will to faculty.search@cs.uoregon.edu.
3.4375 X 4.75
ters sent to: Dr. Carey Williamson, Head, Depart- be a prominent scholar with a sustained record The University of Oregon is an equal opportu-
ment of Computer Science, University of Calgary, of publication and research funding in an area of nity/affirmative action institution committed to
ACM COMMUNICATIONS
Calgary, Alberta, Canada, T2N 1N4 or via email to: software or intelligent systems that compliments cultural diversity and is compliant with the Amer-
search@cpsc.ucalgary.ca. existing research strengths of the department. icans with Disabilities Act. We are committed to
Completed applications received by October We are looking for an innovative thinker who is creating a more inclusive and diverse institution
15, 2012 will receive full consideration, though eager to advance interdisciplinary scholarship, and seek candidates with demonstrated potential
the review process will continue until the position build bridges between academia and industry, to contribute positively to its diverse community.

BLENDED LEARNING M.S. in Computer Science


Discover your full potential with
a master’s degree in computer
science from LIU Brooklyn.
• Convenient blended format fuses
Advertising in Career
online learning with traditional Opportunities
classroom studies, reducing the
amount of time you’ll spend on How to Submit a Classified Line Ad: Send an e-mail to
campus and maximizing interaction acmmediasales@acm.org. Please include text, and indicate
with faculty and fellow students. the issue/or issues where the ad will appear, and a contact
name and number.
• Emphasis on design and
development of large Estimates: An insertion order will then be e-mailed back to
software systems. you. The ad will by typeset according to CACM guidelines.
NO PROOFS can be sent. Classified line ads are NOT
• Core curriculum based on commissionable.
Association for Computing
Machinery (ACM) recommendations. Rates: $325.00 for six lines of text, 40 characters per line.
$32.50 for each additional line after the first six. The MINIMUM
• 36-credit program includes small is six lines.
implementation projects and/or
programming exercises that Deadlines: 20th of the month/2 months prior to issue date.
provide practical training. For latest deadline info, please contact:
acmmediasales@acm.org
• Complete a large software
Career Opportunities Online: Classified and recruitment
development project or a thesis.
display ads receive a free duplicate listing on our website at:
http://jobs.acm.org

Ads are listed for a period of 30 days.


For More Information Contact:
ACM Media Sales
at 212-626-0686 or
For more information, call 718-488-1011 or visit liu.edu/brooklyn/mscs. acmmediasales@acm.org

118 c ommunicatio ns o f t he acm | Au gu st 201 2 | vol. 5 5 | no. 8


tHE ACM A. M. turing AwArd
by the community ◆ from the community ◆ for the community

ACM, Intel, and Google congratulate


JUDEA PEARL
for fundamental contributions to artificial intelligence
through the development of a calculus for probabilistic
and causal reasoning.

“Dr. Pearl’s work provided the original paradigm “Judea Pearl is the most prominent advocate for proba-
case for how to do statistical AI. By placing bilistic models in artificial intelligence. He developed
structured knowledge representations at the heart mathematical tools to tackle complex problems that
of his work, and emphasizing how these represen- handle uncertainty. Before Pearl, AI systems had more
tations enabled efficient inference and learning, success in black and white domains like chess. But robot-
he showed the field of AI how to build statistical ics, self-driving cars, and speech recognition deal with
reasoning systems that were actually telling us uncertainty. Pearl enabled these applications to flourish,
something about intelligence, not just statistics.” and convinced the field to adopt these techniques.”

Limor Fix Alfred Spector


Director, University Collaborative Research Group Vice President, Research and Special Initiatives
Intel Labs Google Inc.

For more information see www.intel.com/research. For more information, see http://www.google.com/corporate/
index.html and http://research.google.com/.

Financial support for the ACM A. M. Turing Award is provided by Intel Corporation and Google Inc.
last byte

DOI:10.1145/2240236.2240263 Peter Winkler

Puzzled
Find the Magic Set
Welcome to three new puzzles. Each involves a collection of items, and your job is to find
a subset of them that is characterized by a particular property. Since solving the puzzles
is not easy, here are a couple of hints: For the first, think about averages; for the other two,
try constructing your sets sequentially, bearing in mind that if two partial sums are equal,
the terms between them must add up to zero.

1. A balance
scale sits on 2. You have
two sets (blue
and red) of n n-sided
3. You begin
with a list of
all 1,024 possible
a teacher’s desk,
currently tipped to the dice, with each die vectors of length 10
right. A set of weights labeled with the with entries +1 or –1.
is on the scales, and numbers 1 to n. Your crayon-wielding
on each weight is You roll all 2n dice three-year-old child
the name of at least simultaneously. has got hold of the
one student. Class Now find a nonempty list and unfortunately
is about to begin, subset of the red changed some of the
and on entering the dice and a nonempty entries in some of
classroom, each subset of the blue dice the vectors to zeroes.
student moves each with the same sum. Now find a non-empty
weight carrying his subset of the altered
or her name to the vectors that add up
opposite side of the to the all-zero vector
scale. Now show there (0,0,0,0,0,0,0,0,0,0).
is a set of pupils that
you, the teacher, can
let in the classroom
that will tip the scales
to the left.

Readers are encouraged to submit prospective puzzles for future columns to puzzled@cacm.acm.org.

Peter Winkler (puzzled@cacm.acm.org) is William Morrill Professor of Mathematics and Computer Science
at Dartmouth College, Hanover, NH.

120 c ommunicatio ns o f th e acm | au gu st 201 2 | vol. 5 5 | no. 8

You might also like