You are on page 1of 11

contributed articles

do i:10.1145/ 1516046.15 1 6 0 6 2

Rakesh Agrawal
Database research is expanding, with major
Anastasia Ailamaki efforts in system architecture, new languages,
Philip A. Bernstein cloud services, mobile and virtual worlds,
and interplay between structure and text.
Eric A. Brewer
Michael J. Carey
Surajit Chaudhuri The
AnHai Doan
Daniela Florescu Claremont
Michael J. Franklin
Hector Garcia-Molina Report on
Johannes Gehrke
Le Gruenwald Database
Laura M. Haas
Alon Y. Halevy
Joseph M. Hellerstein
Research
Yannis E. Ioannidis
Hank F. Korth
Donald Kossmann A group of database researchers, architects, users, and
Samuel Madden
pundits met in May 2008 at the Claremont Resort in
Berkeley, CA, to discuss the state of database research
Roger Magoulas and its effects on practice. This was the seventh meet-
Beng Chin Ooi ing of this sort over the past 20 years and was distin-
guished by a broad consensus that the database
Tim O’Reilly community is at a turning point in its history, due
Raghu Ramakrishnan toboth an explosion of data and usage scenarios and
major shifts in computing hardware and platforms.
Sunita Sarawagi Here, we explore the conclusions of this self-
Michael Stonebraker assessment. It is by definition somewhat inward-
focused but may be of interest to the broader
Alexander S. Szalay
computing community as both a window into
Gerhard Weikum upcoming directions in database research and
56 co mm unications of the ac m | j une 2009 | vo l . 5 2 | no. 6
a description of some of the community tional enterprise settings, the barriers crawls of deep-Web sites. There is also
issues and initiatives that surfaced. We between IT departments and business an explosion of text-focused semistruc-
describe the group’s consensus view of units are coming down, and there are tured data in the public domain in the
new focus areas for research, including many examples of companies where form of blogs, Web 2.0 communities,
database engine architectures, declara- data is indeed the business itself. As a and instant messaging. New incentive
tive programming languages, interplay consequence, data capture, integra- structures and Web sites have emerged
of structured data and free text, cloud tion, and analysis are no longer viewed for publishing and curating structured
data services, and mobile and virtual as a business cost but as the keys to data in a shared fashion as well. Text-
worlds. We also report on discussions efficiency and profit. The value of soft- centric approaches to managing the
of the database community’s growth ware to support data analytics has been data are easy to use but ignore latent
and processes that may be of interest growing as a result. In 2007, corporate structure in the data that might add
to other research areas facing similar acquisitions of business-intelligence significant value. The race is on to
challenges. vendors alone totaled $15 billion,2 and develop techniques that extract useful
Over the past 20 years, small groups
of database researchers have periodi-
cally gathered to assess the state of the
field and propose directions for future
research.1,3–7 Reports of the meetings
served to foster debate within the data-
base research community, explain
research directions to external orga-
nizations, and help focus community
efforts on timely challenges.
The theme of the Claremont meet-
ing was that database research and
the data-management industry are at
a turning point, with unusually rich
opportunities for technical advances,
intellectual achievement, entrepre-
neurship, and benefits for science
and society. Given the large number
of opportunities, it is important for
the database research community to
address issues that maximize relevance
within the field, across computing, and
in external fields as well.
The sense of change that emerged
in the meeting was a function of sever-
al factors:
Excitement over “big data.” In recent
years, the number of communities that is only the “front end” of the data- data from mostly noisy text and struc-
working with large volumes of data has analytics tool chain. Market pressure for tured corpora, enable deeper explo-
grown considerably to include not only better analytics also brings new users ration into individual data sets, and
traditional enterprise applications and to the technology with new demands. connect data sets together to wring out
Web search but also e-science efforts Statistically sophisticated analysts are as much value as possible.
(in astronomy, biology, earth science, being hired in a growing number of Expanded developer demands.
and more), digital entertainment, natu- industries, with increasing interest in Programmer adoption of relational
ral-language processing, and social- running their formulae on the raw data. DBMSs and query languages has grown
network analysis. While the user base At the same time, a growing number of significantly in recent years, acceler-
for traditional database management nontechnical decision makers want to ated by the maturation of open source
systems (DBMSs) is growing quickly, “get their hands on the numbers” as systems (such as MySQL and Postgr-
there is also a groundswell of effort to well in simple and intuitive ways. eSQL) and the growing popularity of
design new custom data-management Ubiquity of structured and unstruc- object-relational mapping packages
solutions from simpler components. tured data. There is an explosion of (such as Ruby on Rails). However, the
The ubiquity of big data is expanding structured data on the Web and on expanded user base brings new expec-
Illustration by gluek it

the base of users and developers of enterprise intranets. This data is from tations for programmability and usabil-
data-management technologies and a variety of sources beyond traditional ity from a larger, broader, less-special-
will undoubtedly shake up the data- databases, including large-scale efforts ized community of programmers.
base research field. to extract structured information from Some of them are unhappy or unwill-
Data analysis as profit center. In tradi- text, software logs and sensors, and ing to “drop into” SQL, viewing DBMSs

june 2 0 0 9 | vo l. 52 | no. 6 | com m u n ic at io n s o f t he acm 57


contributed articles

as unnecessarily complicated and tant aspect of the price/performance revolved around two broad agendas
daunting to learn and manage relative metric of large systems. These hard- we call reformation and synthesis. The
to other open source components. As ware trends alone motivate a wholesale reformation agenda involves decon-
the ecosystem for database manage- reconsideration of data-management structing traditional data-centric ideas
ment evolves beyond the typical DBMS software architecture. and systems and reforming them for
user base, opportunities are emerging These factors together signal an new applications and architectural real-
for new programming models and new urgent, widespread need for new data- ities. One part of this entails focusing
system components for data manage- management technologies. There is outside the traditional RDBMS stack
ment and manipulation. an opportunity for making a positive and its existing interfaces, emphasiz-
Architectural shifts in computing. difference. Traditionally, the database ing new data-management systems
While the variety of user scenarios is community is known for the practical for growth areas (such as e-science).
increasing, the computing substrates relevance of its research; relational Another part of the reformation agen-
for data management are shifting databases are emblematic of technol- da involves taking data-centric ideas
like declarative programming and
query optimization outside their origi-
nal context in storage and retrieval to
attack new areas of computing where
a data-centric mindset promises to
yield significant benefit. The synthesis
agenda is intended to leverage research
ideas in areas that have yet to develop
identifiable, agreed-upon system archi-
tectures, including data integration,
information extraction, and data priva-
cy. Many of these subcommunities of
database research seem ready to move
out of the conceptual and algorithmic
phase to work together on comprehen-
sive artifacts (such as systems, languag-
es, and services) that combine multiple
techniques to solve complex user prob-
lems. Efforts toward synthesis can serve
as rallying points for research, likely
leading to new challenges and break-
throughs, and promise to increase the
overall visibility of the work.

Research Opportunities
After two days of intense discussion
at the 2008 Claremont meeting, it was
dramatically as well. At the macro scale, ogy transfer. But in recent years, the surprisingly easy for the group to reach
the rise of cloud computing services externally visible contribution of the consensus on a set of research topics
suggests fundamental changes in database research community has for investigation in coming years.
software architecture. It democratizes not been as pronounced, and there Before exploring them, we stress a few
access to parallel clusters of computers; is a mismatch between the notable points regarding what is not on the list.
every programmer has the opportunity expansion of the community’s portfo- First, while we tried to focus on new
and motivation to design systems and lio and its contribution to other fields opportunities, we do not propose they
services that scale out incrementally of research and practice. In today’s be pursued at the expense of existing
to arbitrary degrees of parallelism. At increasingly rich technical climate, the good work. Several areas we deemed
a micro scale, computer architectures database community must recommit critical were left off because they are
have shifted the focus of Moore’s Law itself to impact and breadth. Impact already focus topics in the database
from increasing clock speed per chip is evaluated by external measures, so community. Many were mentioned in
to increasing the number of processor success involves helping new classes of previous reports1,3–7 and are the subject
cores and threads per chip. In storage users, powering new computing plat- of significant efforts that require
technologies, major changes are under forms, and making conceptual break- continued investigation and funding.
Illustration by gluek it

way in the memory hierarchy due to the throughs across computing. These Second, we kept the list short, favoring
availability of more and larger on-chip should be the motivating goals for the focus over coverage. Though most of us
caches, large inexpensive RAM, and next round of database research. have other promising research topics
flash memory. Power consumption To achieve these goals, discussion we would have liked to discuss at great-
has become an increasingly impor- at the 2008 Claremont Resort meeting er length here, we focus on topics that

58 co mm unications of the ac m | j une 2009 | vo l . 5 2 | no. 6


contributed articles

attracted the broadest interest within management relative to hardware is


the group. exorbitant. In the OLTP market, busi-
In addition to the listed topics, the ness imperatives like regulatory compli-
main issues raised during the meeting ance and rapid response to changing
included management of uncertain
information, data privacy and security, The ubiquity business conditions raise the need to
address data life-cycle issues (such as
e-science and other scholarly appli- of big data is data provenance, schema evolution,

expanding the
cations, human-centric interaction and versioning).
with data, social networks and Web Given these requirements, the
2.0, personalization and contextual-
ization of query- and search-related
base of users commercial database market is wide
open to new ideas and systems, as
tasks, streaming and networked data, and developers of reflected in the recent funding climate
self-tuning and adaptive systems, and
the challenges raised by new hardware
data-management for entrepreneurs. It is difficult to
recall when there were so many start-
technologies and energy constraints. technologies and up companies developing database
Most are captured in the following
discussion, with many cutting across will undoubtedly engines, and the challenging economy
has not trimmed the field much. The
multiple topics. shake up market will undoubtedly consolidate
Revisiting database engines. System R
and Ingres pioneered the architecture the database over time, but things are changing fast,
and it remains a good time to try radi-
and algorithms of relational databases;
current commercial databases are still
research field. cal ideas.
Some research projects have begun
based on their designs. But many of the taking revolutionary steps in database
changes in applications and technolo- system architecture. There are two
gy demand a reformation of the entire distinct directions: broadening the
system stack for data management. useful range of applicability for multi-
Current big-market relational database purpose database systems (for exam-
systems have well-known limitations. ple, to incorporate streams, text search,
While they provide a range of features, XML, and information integration)
they have only narrow regimes in which and radically improving performance
they provide peak performance; online by designing special-purpose database
transaction processing (OLTP) systems systems for specific domains (for exam-
are tuned for lots of small, concurrent ple, read-mostly analytics, streams,
transactional debit/credit workloads, and XML). Both directions have merit,
while decision-support systems are and the overlap in their stated targets
tuned for a few read-mostly, large-join- suggests they may be more synergistic
and-aggregation workloads. Mean- than not. Special-purpose techniques
while, for many popular data-intensive (such as new storage and compres-
tasks developed over the past decade, sion formats) may be reusable in more
relational databases provide poor general-purpose systems, and general-
price/performance and have been purpose architectural components
rejected; critical scenarios include (such as extensible query optimizer
text indexing, serving Web pages, and frameworks) may help speed prototyp-
media delivery. New workloads are ing of new special-purpose systems.
emerging in the sciences, Web 2.0-style Important research topics in the
applications, and other environments core database engine area include:
where database-engine technology ˲˲ Designing systems for clusters
could prove useful but is not bundled of many-core processors that exhibit
in current database systems. limited and nonuniform access to off-
Even within traditional applica- chip memory;
tion domains, the database market- ˲˲ Exploiting remote RAM and Flash
place today suggests there is room for as persistent media, rather than rely-
significant innovation. For example, in ing solely on magnetic disk;
the analytics markets for business and ˲˲ Treating query optimization and
science, customers can buy petabytes physical data layout as a unified, adap-
of storage and thousands of proces- tive, self-tuning task to be carried out
sors, but the dominant commercial continuously;
database systems typically cannot ˲˲ Compressing and encrypting data
scale that far for many workloads. Even at the storage layer, integrated with
when they can, the cost of software and data layout and query optimization;

june 2 0 0 9 | vo l. 52 | no. 6 | com m u n ic at io n s o f t he acm 59


contributed articles

˲˲ Designing systems that embrace This opens opportunities for the


nonrelational data models, rather than database community to extend its
shoehorning them into tables; contribution to the broader commu-
˲˲ Trading off consistency and avail- nity, developing more powerful and
ability for better performance and
thousands of machines; and This is a unique efficient languages and runtime mech-
anisms that help these developers
˲˲ Designing power-aware DBMSs
that limit energy costs without sacrific-
opportunity for address more complex problems.
As another example of declarative
ing scalability. a fundamental programming, in the past five years a
This list is not exhaustive. One
industrial participant at the Claremont
“reformation” variety of new declarative languages,
often grounded in Datalog, have been
meeting noted that this is a time of of the notion of developed for domain-specific systems
opportunity for academic research-
ers; the landscape has shifted enough
data management, in fields as diverse as networking and
distributed systems, computer games,
that access to industrial legacy code not as a single machine learning and robotics, compil-
provides little advantage, and large-
scale clustered hardware is rentable in system but as ers, security protocols, and information
extraction. In many of these scenarios,
the cloud at low cost. Moreover, indus-
trial players and investors are aggres-
a set of services the use of a declarative language has
reduced code size by orders of magni-
sively looking for bold new ideas. This that can be tude while also enabling distributed
opportunity for academics to lead in
system design is a major change in the
embedded, as or parallel execution. Surprisingly, the
groups behind these efforts have coor-
research environment. needed, in many dinated very little with one another; the
Declarative programming for emerg-
ing platforms. Programmer productivity
computing contexts. move to revive declarative languages
in these new contexts has grown up
is a key long-acknowledged challenge organically.
in computing, with its most notable A third example arises in enter-
mention in the database context in Jim prise-application programming.
Gray’s 1998 Turing lecture. Today, the Recent language extensions (such
urgency of the challenge is increasing as Ruby on Rails and LINQ) encour-
exponentially as programmers target age query-like logic in programmer
ever more complex environments, design patterns. But these packages
including many-core chips, distrib- have yet to address the challenge of
uted services, and cloud computing enterprise-style programming across
platforms. multiple machines; the closest effort
Nonexpert programmers must be here is DryadLINQ, focusing on paral-
able to write robust code that scales out lel analytics rather than on distributed
across processors in both loosely and application development. For enter-
tightly coupled architectures. Although prise applications, a key distributed
developing new programming para- design decision is the partitioning of
digms is not a database problem per se, logic and data across multiple “tiers,”
ideas of data independence, declara- including Web clients, Web servers,
tive programming, and cost-based opti- application servers, and a backend
mization provide a promising angle of DBMS. Data independence is particu-
attack. There is significant evidence larly valuable here, allowing programs
that data-centric approaches will have to be specified without making a priori
significant influence on programming permanent decisions about physical
in the near term. deployment across tiers. Automatic
The recent popularity of the Map- optimization processes could make
Reduce programming framework for these decisions and move data and
manipulating big data sets is an code as needed to achieve efficiency
example of this potential. MapReduce and correctness. XQuery has been
is attractively simple, building on proposed as an existing language that
language and data-parallelism tech- would facilitate this kind of declarative
niques that have been known for programming, in part because XML is
decades. For database researchers, often used in cross-tier protocols.
the significance of MapReduce is in It is unusual to see this much
demonstrating the benefits of data- energy surrounding new data-centric
parallel programming to new classes programming techniques, but the
of developers. opportunity brings challenges as

60 co mm unications of the ac m | j une 2009 | vo l . 5 2 | no. 6


contributed articles

well. The research challenges include quality data items in HTML tables on it developed domain-independent
language design, efficient compilers Web pages and a growing number of technology for crawling through forms
and runtimes, and techniques to opti- mashups providing dynamic views on (that is, automatically submitting well-
mize code automatically across both structured data; and data contributed formed queries to forms) and surfac-
the horizontal distribution of parallel by Web 2.0 services (such as photo and ing the resulting HTML pages in a
processors and the vertical distribu- video sites, collaborative annotation search-engine index. Within the enter-
tion of tiers. It seems natural that the services, and online structured-data prise, the database research commu-
techniques behind parallel and distrib- repositories). nity recently contributed to enterprise
uted databases—partitioned dataflow A significant long-term goal for the search and the discovery of relation-
and cost-based query optimization— database community is to transition ships between structured and unstruc-
should extend to new environments. from managing traditional databases tured data.
However, to succeed, these languages consisting of well-defined schemata The first challenge database
must be fairly expressive, going beyond for structured business data to the researchers face is how to extract struc-
simple MapReduce and select-project-
join-aggregate dataflows. This agenda
will require “synthesis” work to harvest
useful techniques from the literature
on database and logic programming
languages and optimization, as well
as to realize and extend them in new
programming environments.
To genuinely improve programmer
productivity, these new approaches
also need to pay attention to the soft-
er issues that capture the hearts and
minds of programmers (such as attrac-
tive syntax, typing and modularity,
development tools, and smooth inter-
action with the rest of the comput-
ing ecosystem, including networks,
files, user interfaces, Web services,
and other languages). This work also
needs to consider the perspective of
programmers who want to use their
favorite programming languages and
data services as primitives in those
languages. Example code and practical
tutorials are also critical.
To execute successfully, database
research must look beyond its tradition-
al boundaries and find allies through- much more challenging task of manag- ture and meaning from unstructured
out computing. This is a unique oppor- ing a rich collection of structured, and semistructured data. Informa-
tunity for a fundamental “reformation” semi-structured, and unstructured tion-extraction technology can now
of the notion of data management, not data spread over many repositories in pull structured entities and relation-
as a single system but as a set of servic- the enterprise and on the Web—some- ships out of unstructured text, even in
es that can be embedded as needed in times referred to as the challenge of unsupervised Web-scale contexts. We
many computing contexts. managing dataspaces. expect in coming years that hundreds
Interplay of structured and unstruc- In principle, this challenge is closely of extractors will be applied to a given
tured data. A growing number of data- related to the general problem of data data source. Hence developers and
management scenarios involve both integration, a longstanding area for analysts need techniques for applying
structured and unstructured data. database research. The recent advanc- and managing predictions from large
Within enterprises, we see large hetero- es in this area and the new issues numbers of independently developed
geneous collections of structured data due to Web 2.0 resulted in significant extractors. They also need algorithms
linked with unstructured data (such discussion at the Claremont meeting. that can introspect about the correct-
as document and email repositories). On the Web, the database community ness of extractions and therefore
Illustration by gluek it

On the Web, we also see a growing has contributed primarily in two ways: combine multiple pieces of extraction
amount of structured data primarily First, it developed technology that evidence in a principled fashion. The
from three sources: millions of data- enables the generation of domain- database community is not alone in
bases hidden behind forms (the deep specific (“vertical”) search engines these efforts; to contribute in this area,
Web); hundreds of millions of high- with relatively little effort; and second, database researchers should continue

june 2 0 0 9 | vo l. 52 | no. 6 | com m u n ic at io n s o f t he acm 61


contributed articles

to strengthen ties with researchers in develop methods to answer keyword concepts around which these function-
information retrieval and machine queries over large collections of hetero- alities are tied.
learning. geneous data sources. We must be able In addition to managing existing
Context is a significant aspect to break down the query to extract data collections, there is an opportu-
of the semantics of the data, taking its intended semantics and route the nity to innovate in the creation of data
multiple forms (such as the text and query to the relevant sources(s) in the collections. The emergence of Web 2.0
hyperlinks that surround a table on a collection. Keyword queries are just creates the potential for new kinds of
Web page, the name of the directory one entry point into data exploration, data-management scenarios in which
in which data is stored, accompany- and there is a need for techniques that users join ad hoc communities to
ing annotations or discussions, and lead users into the most appropriate create, collaborate, curate, and discuss
relationships to physically or tempo- querying mechanism. Unlike previ- data online. As an example, consider
rally proximate data items). Context ous work on information integration, creating a database of access to clean
helps analysts interpret the meaning the challenges here are that we cannot water in different places around the
world. Since such communities rarely
agree on schemata ahead of time, the
schemata must be inferred from the
data; however, the resulting schemata
are still used to guide users to consen-
sus. Systems in this context must
incorporate visualizations that drive
exploration and analysis. Most impor-
tant, these systems must be extremely
easy to use and so will probably require
compromising on some typical data-
base functionality and providing more
semiautomatic “hints” mined from the
data. There is an important opportunity
for a feedback loop here; as more data
is created with such tools, information
extraction and querying could become
easier. Commercial and academic
prototypes are beginning to appear, but
there is plenty of room for additional
innovation and contributions.
Cloud data services. Economic and
technological factors have motivated
a resurgence of shared computing
infrastructure, providing software
and computing facilities as a service,
an approach known as cloud services
of data in such applications because assume we have semantic mappings or cloud computing. Cloud services
the data is often less precise than in for the data sources and we cannot provide efficiencies for application
traditional database applications, as assume that the domain of the query or providers by limiting up-front capital
it is extracted from unstructured text, the data sources is known. We need to expenses and by reducing the cost of
extremely heterogeneous, or sensi- develop algorithms for providing best- ownership over time. Such services
tive to the conditions under which it effort services on loosely integrated are typically hosted in a data center
was captured. Better database tech- data. The system should provide mean- using shared commodity hardware
nology is needed to manage data in ingful answers to queries with no need for computation and storage. A varied
context. In particular, there is a need for manual integration and improve set of cloud services is available today,
for techniques to discover data sourc- over time in a pay-as-you-go fashion as including application services (sales-
es, enhance the data by discovering semantic relationships are discovered force.com), storage services (Amazon
implicit relationships, determine the and refined. Developing index struc- S3), compute services (Amazon EC2,
weight of an object’s context when tures to support querying hybrid data Google App Engine, and Microsoft
assigning it semantics, and maintain is also a significant challenge. More Azure), and data services (Amazon
the provenance of data through these generally, we need to develop new SimpleDB, Microsoft SQL Data Servic-
Illustration by gluek it

steps of storage and computation. notions of correctness and consistency es, and Google’s Datastore). They
The second challenge is to develop in order to provide metrics and enable represent a major reformation of data-
methods for querying and deriving users or system designers to make management architectures, with more
insight from the resulting sea of hetero- cost/quality trade-offs. We also need on the horizon. We anticipate many
geneous data. A specific problem is to to develop the appropriate systems future data-centric applications lever-

62 co m munications of th e ac m | j une 2009 | vo l . 5 2 | no. 6


contributed articles

aging data services in the cloud. management across layers.


A cross-cutting theme in cloud The need for manageability adds
services is the trade-off providers face urgency to the development of self-
between functionality and opera- managing database technologies
tional costs. Today’s early cloud data
services offer an API that is much Limited that have been explored over the past
decade. Adaptive, online techniques
more restricted than that of traditional functionality will be required to make these systems

pushes more
database systems, with a minimalist viable, while new architectures and
query language, limited consistency APIs, including the flexibility to depart
guarantees, and in some cases explicit
constraints on resource utilization.
programming from traditional SQL and transaction-
al semantics when prudent, reduce
This limited functionality pushes more burden on requirements for backward compat-
programming burden on developers
but allows cloud providers to build
developers but ibility and increase the motivation for
aggressive redesign.
more predictable services and offer allows cloud The sheer scale of cloud computing
service-level agreements that would be
difficult to provide for a full-function providers to build involves its own challenges. Today’s
SQL databases were designed in an
SQL data service. More work and expe- more predictable era of relatively reliable hardware and
rience are needed on several fronts
to fully understand the continuum services and offer intensive human administration; as a
result, they do not scale effectively to
between today’s early cloud data servic-
es and more full-function but possibly
service-level thousands of nodes being deployed
in a massively shared infrastructure.
less-predictable alternatives. agreements that On the storage front, it is unclear
Manageability is particularly impor-
tant in cloud environments. Relative to
would be difficult whether these limitations should be
addressed with different transactional
traditional systems, it is complicated by to provide for implementation techniques, different
three factors: limited human interven-
tion, high-variance workloads, and a
a full-function storage semantics, or both simultane-
ously. The database literature is rich
variety of shared infrastructures. In the SQL data service. in proposals on these issues. Cloud
majority of cloud-computing settings, services have begun to explore simple
there will be no database administra- pragmatic approaches, but more work
tors or system administrators to assist is needed to synthesize ideas from the
developers with their cloud-based literature in modern cloud computing
applications; the platform must do regimes. In terms of query processing
much of that work automatically. Mixed and optimization, it will not be feasible
workloads have always been difficult to to exhaustively search a domain that
tune but may be unavoidable in this considers thousands of processing
context. sites, so some limitations on either the
Even a single customer’s workload domain or the search will be required.
can vary widely over time; the elastic Finally, it is unclear how program-
provisioning of cloud services makes mers will express their programs in the
it economical for a user to occasion- cloud, as discussed earlier.
ally harness orders-of-magnitude more The sharing of physical resources in
resources than usual for short bursts a cloud infrastructure puts a premium
of work. Meanwhile, service tuning on data security and privacy that cannot
depends heavily on the way the shared be guaranteed by physical boundaries
infrastructure is “virtualized.” For of machines or networks. Hence cloud
example, Amazon EC2 uses hardware- services are fertile ground for efforts
level virtual machines as its program- to synthesize and accelerate the work
ming interface. On the opposite end of the database community has done in
the spectrum, salesforce.com imple- these areas. The key to success is to
ments “multi-tenant” hosting of many specifically target usage scenarios in
independent schemas in a single the cloud, seated in practical econom-
managed DBMS. Many other virtual- ic incentives for service providers and
ization solutions are possible, each customers.
with different views into the workloads As cloud data services become popu-
above and platforms below and differ- lar, new scenarios will emerge with
ent abilities to control each. These their own challenges. For example, we
variations require revisiting traditional anticipate specialized services that are
roles and responsibilities for resource pre-loaded with large data sets (such as

june 2 0 0 9 | vo l. 52 | no. 6 | com m u n ic at io n s o f t he acm 63


contributed articles

stock prices, weather history, and Web data-rich mix. The term “co-space” is
crawls). The ability to “mash up” inter- sometimes used to refer to a coexist-
esting data from private and public ing space for both virtual and physi-
domains will be increasingly attractive cal worlds. In it, locations and events
and provide further motivation for the
challenges discussed earlier concern- Electronic media in the physical world are captured by
a large number of sensors and mobile
ing the interplay of structured and underscore the devices and materialized within a

modern reality
unstructured data. The desire to mash virtual world. Correspondingly, certain
up data also points to the inevitability actions or events within the virtual
of services reaching out across clouds,
an issue already prevalent in scien-
that it is easy to be world affect the physical world (such
as shopping, product promotion, and
tific data “grids” that typically have widely published experiential computer gaming). Appli-
large shared data servers at multiple
sites, even within a single discipline. It
but much more cations of co-space include rich social
networking, massive multi-player
also echoes, in the large, the standard difficult to be games, military training, edutain-
proliferation of data sources in most
enterprises. Federated cloud architec- widely read. ment, and knowledge sharing.
In both areas, large amounts of data
tures will only add to these challenges. flow from users and get synthesized
Mobile applications and virtual and used to affect the virtual and/or real
worlds. This new class of applications, world. These applications raise new
exemplified by mobile services and challenges, including how to process
virtual worlds, is characterized by the heterogeneous data streams in order
need to manage massive amounts of to materialize real-world events, how to
diverse user-created data, synthesize balance privacy against the collective
it intelligently, and provide real-time benefit of sharing personal real-time
services. The database community information, and how to apply more
is beginning to understand the chal- intelligent processing to send interest-
lenges faced by these applications, but ing events in the co-space to someone
much more work is needed. According- in the physical world.
ly, the discussion about these topics at The programming of virtual actors in
the meeting was more speculative than games and virtual worlds requires large-
about those of the earlier topics but scale parallel programming; declarative
still deserve attention. methods have been proposed as a solu-
Two important trends are changing tion in this environment, as discussed
the nature of the field. First, the plat- earlier. These applications also require
forms on which mobile applications development of efficient systems, as
are built—hardware, software, and suggested earlier in the context of data-
network—have attracted large user base engines, including appropriate
bases and ubiquitously support power- storage and retrieval methods, data-
ful interactions “on the go.” Second, processing engines, parallel and distrib-
mobile search and social networks uted architectures, and power-sensitive
suggest an exciting new set of mobile software techniques for managing the
applications that can deliver timely events and communications across
information (and advertisements) to large number of concurrent users.
mobile users depending on location,
personal preferences, social circles, Moving Forward
and extraneous factors (such as weath- The 2008 Claremont meeting also
er), as well as the context in which involved discussions on the database
they operate. Providing these services research community’s processes,
requires synthesizing user input and including organization of publication
behavior from multiple sources to procedures, research agendas, attrac-
determine user location and intent. tion and mentorship of new talent,
The popularity of virtual worlds and efforts to ensure a benefit from
like Second Life has grown quickly the research on practice and toward
and in many ways mirrors the themes furthering our understanding of the
of mobile applications. While they field. Some of the trends seen in data-
began as interactive simulations for base research are echoed in other
multiple users, they increasingly blur areas of computer science. Whether or
the distinctions with the real world not they are, the discussion may be of
and suggest the potential for a more broader interest in the field.

64 co mmunications of th e ac m | j une 2009 | vo l . 5 2 | no. 6


contributed articles

Prior to the meeting, a team led by intellectual and practical relevance. At from all parties. Unlike previous efforts
one of the participants performed a the same time, it was acknowledged in this vein, the collection should not
bit of ad hoc data analysis over data- that the database community’s growth be designed for any particular bench-
base conference bibliographies from increases the need for clear and clearly mark; in fact, it is likely that most of the
the DBLP repository (dblp.uni-trier. enforced processes for scientific publi- interesting problems suggested by this
de). While the effort was not scien- cation. The challenge going forward data are as yet unidentified.
tific, the results indicated that the is to find policies that simultaneous- There was also discussion at the
database research community has ly reward big ideas and risk-taking meeting of the role of open source
doubled in size over the past decade, while providing clear and fair rules for software development in the database
as suggested by several metrics: achieving these rewards. The publica- community. Despite a tradition of open
number of published papers, number tion venues would do well to focus as source software, academic database
of distinct authors, number of distinct much energy on processes to encour- researchers have only rarely reused
institutions to which these authors age relevance and innovation as they or shared software. Given the current
belong, and number of session topics do on processes to encourage rigor climate, it might be useful to move more
at conferences, loosely defined. This and discipline. aggressively toward sharing software
served as a backdrop to the discus- In addition to tuning the main- and collaborating on software projects
sion that followed. An open question is stream publication venues, there is an across institutions. Information inte-
whether this phenomenon is emerging opportunity to take advantage of other gration was mentioned as an area in
at larger scales—in computer science channels of communication. For exam- which such an effort is emerging.
and in science in general. If so, it may ple, the database research community Finally, interest was expressed
be useful to discuss the management has had little presence in the relatively in technical competitions akin to
of growth at those larger scales. active market for technical books. the Netflix Prize (www.netflixprize.
The growth of the database commu- Given the growing population of devel- com) and KDD Cup (www.sigkdd.org/
nity puts pressure on the content opers working with big data sets, there kddcup/index.php) competitions.
and processes of database research is a need for accessible books on scal- To kick off this effort in the database
publications. In terms of content, the able data-management algorithms domain, meeting participants identi-
increasingly technical scope of the and techniques that programmers can fied two promising areas for competi-
community makes it difficult for indi- use to build software. The current crop tions: system components for cloud
vidual researchers to keep track of the of college textbooks is not targeted at computing (likely measured in terms
field. As a result, survey articles and this market. There is also an oppor- of efficiency) and large-scale infor-
tutorials are increasingly important to tunity to present database research mation extraction (likely measured
the community. These efforts should contributions as big ideas in their own in terms of accuracy and efficiency).
be encouraged informally within the right, targeted at intellectually curious While it was noted that each of these
community, as well as via professional readers outside the specialty. In addi- proposals requires a great deal of time
incentive structures (such as academic tion to books, electronic media (such and care to realize, several participants
tenure and promotion in industrial as blogs and wikis) can complement volunteered to initiate efforts. That
labs). In terms of processes, the review- technical papers by opening up differ- work has begun with the 2009 SIGMOD
ing load for papers is increasingly ent stages of the research life cycle to Programming Contest (db.csail.mit.
burdensome, and there was a percep- discussion, including status reports edu/sigmod09contest).
tion at the Claremont meeting that the on ongoing projects, concise presen-
quality of reviews had been decreasing. tation of big ideas, vision statements, References
It was suggested at the meeting that the and speculation. Online fora can also 1. Abiteboul, S. et al. The Lowell database research
self assessment. Commun. ACM 48, 5 (May 2005),
lack of face-to-face program-commit- spur debate and discussion if appro- 111–118.
tee meetings in recent years has exac- priately provocative. Electronic media 2. Austin, I. I.B.M. acquires Cognos, maker of business
software, for $4.9 billion. New York Times (Nov. 11,
erbated the problem of poor reviews underscore the modern reality that 2007).
and removed opportunities for risky or it is easy to be widely published but 3. Bernstein, P.A. et al. The Asilomar report on database
research. SIGMOD Record 27, 4 (Dec. 1998), 74–80.
speculative papers to be championed much more difficult to be widely read. 4. Bernstein, P.A. et al. Future directions in DBMS
research: The Laguna Beach participants. SIGMOD
effectively over well-executed but more This point should be reflected in the Record 18, 1 (Mar. 1989), 17–26.
pedestrian work. mainstream publication context, as 5. Silberschatz, A. and Zdonik, S. Strategic directions
in database systems: Breaking out of the box. ACM
There was some discussion at the well as by authors and reviewers. In the Computing Surveys 28, 4 (Dec. 1996), 764–778.
meeting about recent efforts—nota- end, the consumers of an idea define 6. Silberschatz, A., Stonebraker, M., and Ullman, J.D.
Database research: Achievements and opportunities
bly by ACM-SIGMOD and VLDB— its value. into the 21st century. SIGMOD Record 25, 1 (Mar.
to enhance the professionalism of Given the growth in the database 1996), 52-63.
7. Silberschatz, A., Stonebraker, M., and Ullman, J.D.
papers and the reviewing process via research community, the time is ripe Database systems: Achievements and opportunities.
such mechanisms as double-blind for ambitious projects to stimulate Commun. ACM 34, 10 (Oct. 1991), 110–120.

reviewing and techniques to encour- collaboration and cross-fertilization


age experimental repeatability. Many of ideas. One proposal is to foster Correspondence regarding this article should be
addressed to Joseph M. Hellerstein (hellerstein@
participants were skeptical that the more data-driven research by building cs.berkeley.edu).
efforts to date have contributed to long- a globally shared collection of struc-
term research quality, as measured in tured data, accepting contributions © 2009 ACM 0001-0782/09/0600 $10.00

june 2 0 0 9 | vo l. 52 | no. 6 | com m u n ic at io n s o f t he acm 65

You might also like