You are on page 1of 11

Q Q

Avi Silberschatz Michael Stonebraker Jeff Ullman Editors

I~ 0

October 1991/Vol.34, No.10/GOMMUNIGATION$ OF THE AOM

alone. Peter Buneman. These applications have four requirements that characterize database systems: • Efficiency in the access to and modification of very large amounts of data. rule processing. heterogeneous. including simultaneous access of data by multiple users in a consistent manner and assuring only authorized access to information. Mike Carey. They also serve as the foundation for considerable progress in basic science in various fields ranging from computing to biology. database research has fueled an information services industry estimated at $10 billion per year in the U. complex objects.10 111 . record keeping. financial management. This article provides further information about these topics. 2. Rapid advances in areas such as manufacturing science. accessibility to scientific literature. the National Science Foundation convened a workshop in Palo Alto. Hector Garcia-Molina. In February 1990. Next-generation database applications will have little in common with today's business data processing databases. Dave Maier. Ron Fagin. • Access control. as well as a brief description of some of the important achievements of the database research community. independent o f any programs that access the data. and a host of other civilian and defense applications. Barely 20 years old as a basic science research field. and that require intensive and sustained basic research.he history of database system research is one of exceptional productivity and startling economic impact. distributed databases. knowledge-based systems. The cooperation among different organizations on common scientific.--. require new capabilities including type extensions. without sustaining loss or becoming inconsistent. engineering. Starting with modest representation. and Maria Zemankova. Background and Scope T h e database research community has been in existence since the late 1960s. the maintenance o f data over long periods of time.As impressive as the accomplishments of basic database research have been. Jim Gray. Achievements in database research underpin fundamental advances in communications systems. Initially. summarized as follows: 1. mostly in industrial research laboratories. Jeff Ullman. and commercial problems will require large-scale. California for the purpose of identifying the technology pull factors that will serve as forcing functions for advanced database technology and the corresponding basic research needed to enable that technology. A substantial number of the advanced technologies that will underpin industrialized economies in the early 21st century depend on radically new database technologies that are currently not well understood. Michael Stonebraker. and will necessitate rethinking the algorithms for almost all DBMS operations. database research centered on the management of data in business applications such as automated banking. robotics. and • Persistence. Dave Lomet. optical storage. Gio Wiederhold. it has expanded dramatically over the last two decades to include substantial efforts at major universities. Ashok Chandra. Database systems research has centered around methods for designing systems with these characteristics.S. Avi Silberschatz. and high-speed communications already threaten to overwhelm the existing substrate of database theory and practice. 3. and design databases. and archival storage. scientific visualization. Irv Traiger. and around the languages and conceptual tools that help users to access. and reservation systems.No.34. • Resilience. security.-. or the ability of the data to survive hardware crashes and software errors. government laboratories and research consortia. 1 The primary conclusions of the workshop participants can be IThe workshop was attended by Michael Brodie. COMMUNICATIONS OF THE ACM/October1991/Vo1. Carlo Zaniolo. and massive scale-up of distributed DBMS technology. The participants included representatives from the academic and industrial sides of the database research community. transportation and logistics. They will involve much more data. Marie Ann Niemat. Very difficult problems lie ahead in the areas of inconsistent databases. manipulate. there is a growing awareness and concern that only the surface has been scratched in developing an understanding of the database principles and techniques required to support the advanced information management applications that are expected to revolutionize industrialized economies early in the next century. multimedia support...

As a result. accessing one record at a time. and universal access to information. A query optimizer in the DBMS translates the predicate specification into an algorithm to perform database access to solve the query. T h e first approach. The technology that makes these systems possible is the direct result of a successful program of database research. This article will highlight some important achievements o f the database research community over the past two decades. the database systems of 1970 were costly to use because o f the low-level interface between the application program and the DBMS. or great enhancement of their capability to manage information. a navigational query language was defined. exemplified by IBM's Information Management System (IMS). The query language permitted an application programmer to navigate from root records to the records of interest. skilled in performing disk-oriented optimization. To answer a specific database request.10/COMMUNIGATIONSOFTHEACM . The relational data model. the company president cannot. an application programmer. Codd suggested that conceptually all data be represented by simple tabular data structures (relations). Today. We focus on the major accomplishments o f relational databases. pioneered by E. quires all data records to be assembled into a collection o f trees. Consequently. • Developed the theory and algorithms necessary to optimize quer i e s . During the 1970s the database research community extensively investigated the relational DBMS concept. we stand at the threshold of applying database technology in a variety of new and important directions. Later in this article we pinpoint two key areas in which research will make a significant impact in the next few years: next-generation database applications and heterogeneous. offered a fundamentally different approach to data storage. distributed databases.34. create and maintain important collections o f information. The second approach was typified by the standards of the Conference on Data Systems Languages (CODASYL).Database management systems (DBMSs) are now used in almost every computing environment to organize. transaction management. and distributed database systems. and distributed databases.t h a t is. design databases. These nonprocedural languages are dramatically easier to use than the navigation languages of IMS and CODASYL. They suggested that the collection of DBMS records be arranged into a directed graph. including the scope and significance of the technological transfer o f database research results to industry. has a data model (mathematical abstraction of data and operations on data) that re- 112 October 1991/Vo1. the following three have perhaps had the most impact: relational database systems. T h e theory of higher-level query languages has been developed to provide a firm basis for understanding and evaluating the expressive power of database language constructs. transaction management. by which an application program could move from a specific entry point record to desired information. receive a response to the query "How many employees in the Widget department will retire in the next three years?" unless a program exists to count departmental retirees. Accomplllshments Of the Last TWO Decades From among the various directions that the database research community has explored. the application programmer is only required to specify a predicate that identifies the desired records or combination of records. including scientific databases. Each has fundamentally affected users o f database systems. some records are root records and all others have unique parent records. NO. Again. application programs usually need to be rewritten. • Formulated a theory o f normalization to help with database design by eliminating redundancy and certain logical anomalies from the data. as it will whenever new kinds of information are added. Both the tree-based (called hierarchical) and graph-based (network) approaches to data management have several fundamental disadvantages. 2. offering either radical simplifications in dealing with data. Consider the following examples: 1. must write a complex program to navigate through the database. they lead to higher p r o g r a m m e r productivity and facilitate direct database access by end users.. Relational Databases In 1970 there were two popular approaches used to construct database management systems. Codd in a series of papers in 1970-72. They: • Invented high-level relational query languages to ease the use o f the DBMS by both end users and application programmers. F. For example. and that users access data through a high-level. nonprocedural (or declarative) query language. and because the dynamic nature of user data mandates continued program maintenance. When the structure of the database changes. Instead of writing an algorithm to obtain desired records one at a time. This technology probably represents the most successful experiment in optimization of very high-level languages a m o n g all varieties o f computer systems. at short notice. to translate queries in the high-level relational query languages into plans that are as efficient as what a skilled prog r a m m e r would have written using one of the earlier DBMSs for accessing the data.

2. namely: As a result of this research in the 1970s. and this guarantee must be kept even when a system crash loses the contents of main memory. in the expectation that other such useful "toys" will emerge. are created to reflect the effects of a transaction and are writ- COMMUNICATIONS O F T H E ACM/October 1991/Vol. in some order. W h e n a transaction completes. to minimize the average cost of accessing those tuples. and precise definitions of the correctness of schedulers (algorithms for deciding when transactions could execute) were produced. that large-scale implementations were made possible. Transaction Management During the last two decades. A transaction is a sequence of operations that must appear "atomic" when executed. the theory of serializability was worked out in detail. the database system must ensure that either both of the o p e r a t i o n s . T h e various algorithms were subjected to rigorous experimental studies and theoretical analysis to determine the conditions u n d e r which each was preferred. Concurrency control is the technique used to provide this assurance. Recovery is the technique used to provide this assurance.D e b i t A and Credit B . the log is used to restore the prior database state. T h e r e is a moral to be learned from the success of relational database systems. or that none of them are. the logged changes are then posted to the database. If only the first occurs. Shadowfile techniques. • Constructed indexing techniques to provide fast associative access to r a n d o m single records and/or sets of records specified by values or value ranges for one or more attributes.. the effect on the database of any n u m b e r of transactions executing in parallel must be the same as if they were executed one after another. n u m e r o u s concurrency control algorithms were invented that ensure serializability. either of hardware or software. a simultaneous movement of $300 from account B to account C should result in a net deduction of $200 from B. two major approaches to this service were investigated. These included algorithms based on • Locking data items to prohibit conflicting accesses. when a bank customer moves $100 from account A to account B. T h e next generation of databases calls for continued research into the foundations of database systems.P H • I H H ) t ) Y ) I t III . For instance. much of it focused on basic principles of relational databases. which guarantees serializability by requiring a transaction to obtain all the locks it will ever need before releasing any locks. it was regarded as an elegant theoretical construct but implementable only as a toy. System failures. C o n c u r r e n t transactions in the system must be synchronized correctly in order to guarantee that consistency is preserved. n u m e r o u s commercial products based on the relational concept appeared in the 1980s. Write-ahead logging. Second.i • Constructed algorithms to allocate tuples of relations to pages (blocks of records) in files on secondary storage.10 113 . A summary of the effects of a transaction is stored in a sequential file. Especially important is a technique called two-phase locking. • I m p l e m e n t e d prototype relational DBMSs that formed the nucleus for many of the present commercial relational DBMSs. and an inconsistent database state results. First. Not only were the ideas identified by the research community picked up and used by the vendors. database researchers have also pioneered the transaction concept. usually disk pages. then the customer has lost $100. • Keeping multiple versions Of data objects available. and are standard software on all new computers in the 1990s.34. No. A transaction must execute in its entirety or not at all. To guarantee that a transaction transforms the database from one consistent state to another requires that: 1. During the 1970s and early 1980s the DBMS research community worked extensively on the transaction model. Recovery is the other essential c o m p o n e n t of transaction management. before the changes are installed in the database itself. T h e concurrent execution of transactions must be such that each transaction appears to execute in isolation. It was only with considerable research. If a transaction fails to complete. We must guarantee that all the effects of a transaction are installed in the database. D u r i n g the late 1970s and early 1980s. that is. When the relational data model was first proposed. T h e view of correct synchronization of transactions is that they must be serializable.. several of the commercial developments were led by implementors of the earlier research prototypes. • Constructed buffer m a n a g e m e n t algorithms to exploit knowledge of access patterns for moving pages back and forth between disk and a main memory buffer pool.h a p p e n or that neither happens (and the customer is informed). New copies of entire data items. • T i m e s t a m p i n g accesses so the system could prevent violations of serializability. called a log. must not result in inconsistent database states. T h e log is on disk or tape where it can survive system crashes and power failures. For example. Today. but also. while we are moving $100 from A to B. commercial relational database systems are available on virtually any hardware platform from personal computer to mainframe.

Technology for optimizing distributed queries was developed. in this section we present several examples of the kinds of database applications that we expect will be built during the next decade. implemented and tested. If a transaction fails. Moreover. We argue strongly here that such a turn o f events would be a serious mistake. the new versions are discarded. the user should be able to execute normal transactions against such data. simulation studies and analysis compared the candidates to see which algorithms were dominant. Current databases will therefore become progressively cheaper to deploy as the 1990s unfold. No. Distributed Databases A third area in which the DBMS research community played a vital role is distributed databases. information about the California customers of a company might be stored on a machine in Los AngelLes. memory and disks drastically cheaper each year. are well studied. Recovery techniques have been extended to cope with the failure o f the stable medium as well. hardware advances o f the next decade will not make brute force solutions economical. However. I f a single. greater per capita productivity through increased personal access to information. Certainly relational DBMSs. The fundamental notion of a two-phase commit to ensure the possibility of crash recovery in a distributed database was discovered. the log can be used to roll forward the backup copy to the current state. while data about the New England customers could exist on a machine in Boston. In a multidatabase environment we strive to provide location transparency. and energy and research efforts allocated elsewhere. and new military applications. Such data distribution moves the data closer to the people who are responsible for it and reduces remote communication costs. In the late 1970s there was a realization that organizations are fundamentally decentralized and require databases at multiple sites. Moreover. In the early 1980s the research community rose to this challenge. and data patch schemes were put forward to rejoin distributed databases that had been forced to operate independently after a network failure. concurrency control. In this section we highlight two key areas in which we feel important research contributions are required in order to make future DBMS applications viable: Nextgeneration database applications and heterogeneous. 1. if one of several regional sites goes down.ten to the disk in entirely new locations. then a single site failure need not cause data inaccessibility. Rather. because the scale of the prospective applications is simply too great. only part o f the total database is inaccessible. Providing location transparency required the DBMS research community to in- vestigate new algorithms for distributed query optimization.ATIONS OF THE A C M . if the company chooses to pay the cost of multiple copies o f important data. have been extensively investigated over the last few years and should allow more general kinds of data elements to be placed in databases than the numbers and character strings supported in traditional systems. their solutions offer products and technology of great social and economic importance. "]['hat is. and commercialization is well along. advanced design and manufacturing systems. Moreover. following the philosophy of object-oriented programming. Perhaps the DBMS area should be declared solved. For example. along with new algorithms to perform the basic operations on data in a distributed environment. In addition to being important intellectual challenges in their own right. Object management ideas. Distributed concurrency control algorithms were designed. namely aggressive research support by government and industry. All the major DBMS vendors are presently commercializing distributed DBMS technology. The Next Challenges Some might argue that database systems are a mature technology and it is therefore time to refocus research onto other topics. NASA scientists 114 October 1991/Vo]. with logging. A backup copy of the data is stored on an entirely separate device. various algorithms for the update of multiple copies of a data item were invented. both centralized and distributed. followed by rapid technology transfer from research labs to commercial products. distributed databases. T h e relentless pace of advances in hardware technology makes CPUs. all data should appear to the user as if they are located at hiis or her particular site. Finally. ]O/COMMUNIC. These ensure that all copies o f each item are consistent. and support of multiple copies of data objects for higher performance and availability. A single atomic actio n remaps the data pages. all data is unavailable. including improved delivery of medical care. crash recovery. The Research Agenda for Next-Generation DBMS Applications To motivate the discussion o f research problems that follows. so as to substitute the new versions for the old when the transaction completes. Then. enhanced tools for scientists.34. we claim that solutions to the important database problems of the year 2000 and beyond are not known. Furthermore. central site goes down. the decentralized system is more likely to be available when crashes occur. For many years. Again. Algorithms were designed to recover from processor and communication failures. Again we see the same pattern discussed earlier for relational databases and transactions.

they also introduce problems with respect to all conventional aspects o f DBMS technology (e. These applications not only introduce problems of size. the retail chain will keep a larger and larger history to track buying habits more accurately. often with even larger databases. ideally. We need solutions to two classes o f problems: data access and data type management. e x t e n d e d storage structures and indexing techniques will be n e e d e d to s u p p o r t efficient processing o f such data. For example. Databases serve as the backbone o f computer-aided design systems. T h e r e must be a way for the p r o g r a m m e r to construct the types a p p r o p r i a t e for his or her application.. This database must maintain and integrate information about the project from the viewpoints of h u n d r e d s of subcontractors. 3.34. Dep a r t m e n t o f Energy (DOE) have e m b a r k e d on a joint national initiative to construct the D N A sequence c o r r e s p o n d i n g to the h u m a n genome. T h e matching of individuals' medical problems to differences in genetic m a k e u p is a staggering problem and will require new technologies o f data representation and search. Most insurance firms have a substantial on-line database that records the policy coverage o f the firm's customers. the load-bearing soundness of the structure could be compromised. lest it be exactly the data set n e e d e d by a future scientist to test some hypothesis. concurrency control. These databases will soon be enhanced with multimedia data such as photographs o f p r o p e r t y damaged. As computing resources become cheaper. T h e insurance example.000 optical disk jukeboxes) just to maintain a few years' worth o f satellite image data they will collect in the 1990s. and so on. Several large d e p a r t m e n t stores already record every productcode-scanning action o f every cashier in every store in their chain.) H I H H ) L ) Y ) I t III have been collecting vast amounts o f information from space.S. transactions. they cannot be overcome simply by waiting for the technology to bring computing costs down to an acceptable level. 5. Moreover. Since image data is exceedingly large. C u r r e n t databases are optimized for delivering small records to an application program. We foresee similar applications. digitized images o f handwritten claim forms. T h e design system should. No. Hence. A second class of problems concerns type management.g. as the cost o f disk space declines. and data representation). and its many subsequences define complex and variable objects. A new generation of query languages will be required to s u p p o r t querying o f array and sequence data as will mechanisms for easily manipulating disk and archive representations of such objects. future systems may well store video walk-throughs o f houses in conjunction with a homeowners policy. W h e n fields in a record become very large. intelligence gathering. such as a skyscraper. Moreover. A database for software engineering might store p r o g r a m statements. applications o f this type are not limited to commercial enterprises. 4. they are very reluctant to throw anything away. Moreover. 2. T h e y estimate that they require storage for 1016 bytes o f data (about 10. For example.10 !~ . this p a r a d i g m breaks down. In addition. Scientific and design databases often deal with very large arrays o r sequences o f data elements. This application taxes the capacity of available disk systems. Protocols must be designed to chunk large objects into manageable size pieces for the application to process. T h e DBMS should read a large object only once and place it directly at its final destination. or at the least. Again. such databases will become enormous. We now turn to the research problems that must be solved to make such next-generation applications work. One o f the drawbacks o f the systems developed so far is that COMMUNICATIONS OF THEACM/October 1991/Vol. Buyers run ad-hoc queries on this historical database in an attempt to discover buying patterns and make stocking decisions. It is unclear how this database can be stored and searched for relevant images using current or soon-tobe available technology. civil engineers envision a facilities-engineering design system that manages all information about a project. they pose fundamentally new requirements for access patterns. T h e National Institutes of Health (NIH) and the U. T h e gene sequence is several billion elements long. these problems are all likely to e x p a n d at the same or at a faster rate. (5) requires storage o f images. images o f specially insured objects. These applications have in c o m m o n the p r o p e r t y that they will push the limits o f available technology for the foreseeable future. This process of "mining" data for h i d d e n patterns is not limited to commercial applications. medicine. f u r t h e r enlarging the size o f this class of databases. when an electrician puts a hole in a beam to let a wire through. Next-generation applications require new services in several different areas in o r d e r to succeed. warn the cognizant engineer that a problem may exist. and many other areas. T h e need for m o r e flexible type systems has been one o f the major forces in the d e v e l o p m e n t o f object-oriented databases. recalculate the stresses. New Kinds of Data Many next-generation applications entail storing large and internally complex objects. and a chemical database might store protein structures. in science. audio transcripts o f appraisers' evaluations.

Many of the new applications will involve primitive concepts not found in most current applications. and complex deductions that should be made automatically within the system. then do C") forms. not during compilation. In many exploratory applications. triggered actions that require processing when specific events take place. and such rule processing must be performed directly by the DBMS. and imperative ("ifA is true. open-ended or closed? Uncertainty. and other forms of reasoning about data. and there is a need to build them cleanly into specialized or extended query languages. such as identification o f features Many o f the new applications will 11~ October 1991/Vo1. the addition of a single rule can easily introduce an inconsistency in the knowledge base or cause chaotic and unexpected effects and can even end up repeatedly firing itself. Although considerable effort has been directed at these topics by the artificial intelligence (AI) community. although we prefer to view them as a natural. including K-D-B trees and R-trees. Such rules may include elaborate constraints that the designer wants enforced. Research is needed on how to specify the rules and on how to process a large rule base efficiently. such as RETE networks. applications such as the notification example cannot be done efficiently by a separate subsystem. In a large system. The value of declarativenes in relational query languages like SQL (the most c o m m o n such language) has been amply demonstrated. However. T h e disk-oriented structures. No. shopkeepers. N e w C o n c e p t s in Data Models involve primitive concepts n o t found in most current applications. Next-generation applications are far too big to be amenable to such techniques. such as quad trees and segment trees. For example. the focus has been on approaches that assume all relevant data structures are in main memory.10/COMMUNICATIONS OF THE ACM . We also need tools that will allow us to validate and debug very large collections of rules. Declarative rules are advantageous because they provide a logical declaration of what the user wants rather than a detailed specification of how the results are to be obtained. A typical search is to find the 10 closest neighbors to some given data element. for example. but most has been oriented toward main memory data structures. which take declarative ("if A is true. There has been substantial research in this area. In order to provide the database application designer with the same safety nets that are provided by modern high-level programming languages. extension o f DBMS technology. data mining (as discussed in the department store example). and an extension o f the idea to the next generation of query languages is desirable.34. do n o t perform particularly well when given real-world data. Traditionally. then B is true").or three-dimensional points.common to call such systems knowledge-base systems. Rule Processing Next-generation applications will frequently involve a large number of rules. we need to consider: Spatial Data. although difficult. we must determine how we can combine static type disciplines with persistent data and evolution of the database structure over time. It is . For example. is it discrete or continuous. Issues range from efficiency o f implementation to the fundamental theory underlying important primitives. No support for an algebra over time exists in any current commercial DBMS. rules allow for a declarative specification o f the conditions under which a certain action is to be taken. Solving such queries will require sophisticated. a design database should notify the proper designer if a modification by a second designer may have affected the subsystem that is the responsibility of the first designer. However. and physicists all require different notions of time. although research prototypes and specialpurpose systems have been built. We need techniques to decompose sets of rules into manageable components and prevent (or control in a useful way) such inconsistencies and repeated rule firing. Engineers. imperative Time. one might wish to retrieve and explore the database state as o f some point in the past or to retrieve the time history o f a particular data value. Many scientific databases have two. usually called expert system shells. lines. type-checking is largely dynamic. Rules have received considerable a t t e n t i o n as the mechanism for triggering. Similarly. which lay:s open the possibility that programming errors tend to show up at run time. and polygons as data elements. new multidimensional access methods. there is not even an agreement across systems on what a "time interval" is. There are applications. rule processing has been performed by separate subsystems.

All c u r r e n t commercial DBMSs require data to be either disk or m a i n . building the index and then releasing the lock. Moreover. a sequential scan may be necessary. r e a d i n g a 1-terabyte d u m p will take days. However. In general. often several orders o f magnitude bigger than the largest databases f o u n d today. especially when a conclusion must be derived from several interrelated partial or alternative results.m e m o r y resident. ultra large databases will require both secondary (disk) storage and the integration o f an archive or tertiary store into the DBMS. I f there are no indexes to speed the search. from satellite photographs. without taking the database off line. Databases larger than a terabyte (10 TM bytes) will not be unusual. while p e r f o r m i n g a single scan o f the data. New algorithms will be n e e d e d to manage intelligently the buffering in a three-level system. and possibly causing the system to thrash. NO. T h a t is. with a large n u m b e r o f CPUs available. the traditional algorithm is to restore the most recent d u m p from tape and then to Tertian/Storage and LongDuration Transactions For the foreseeable future. T h e current architecture o f DBMSs will not scale to such magnitudes. Hence. T h e challenges posed by nextgeneration database systems will force c o m p u t e r scientists to confront these issues. query optimizers must choose strategies that avoid frequent movement o f data between storage media. Scaling Up It will be necessary to scale all DBMS algorithms to operate effectively on databases o f the size contemplated by next-generation applications. but it will also use a larger fraction o f the available computing resources. T h e goal is to process a user's query with nearly linear speedup. Finally. O n the other hand. First. it is imperative that algorithms be designed to construct indexes incrementally without making the table being indexed inaccessible. Hence.34. is a problem that the A I community has addressed for many years. if the table in question is resident on an archive. in such a system. leading to unacceptably long recovery times. making a d u m p on tape of a 1-terabyte database will take days. as we must learn not only to cope with data of limited reliability. T h e next-generation applications often aim to facilitate collaborative and interactive access to a database. but to do so efficiently. F u t u r e systems will have to deal with the more complex issue o f optimizing queries when a portion o f the data to be accessed is in an archive.)• H I H I} H ) L ) )' ) I t III J Next-generation applications often aim to facilitate collaborative and interactive access to a database. parallelizing a user c o m m a n d will allow it to be executed faster. as many queries compete for limited resources. and it is clearly unreasonable for a user to have to submit a query on Monday m o r n i n g and then go home until T h u r s d a y when his answer will appear. and obviously must be done incrementally. Parallelism Ad-hoc queries over the large databases contemplated by next-generation application designers will take a long time to process. in which case the DBMS should evaluate as many queries as possible in parallel. the query is processed in time inversely proportional to the n u m b e r o f processors and disks allocated. In the event that a database is c o r r u p t e d because o f a head crash on a disk or for some other reason. thereby penalizing the response time of other concurrent users. imagine a 1-terabyte database stored on a collection o f disks. the DBMS architecture must avoid bottlenecks. disk storage can be used as a read or write cache for archive objects. c u r r e n t DBMSs build a new index on a relation by locking it. for which we need to attach a likelihood that data represents a certain phenomenon. it remains a challenge to develop a realistic theory for data movement t h r o u g h o u t the m e m o r y hierarchy o f parallel computers. a new a p p r o a c h to backup and recovery in very large databases must be found. a different form o f parallelism may be required. T h e traditional transac- COMMUNICATIONS OF THE ACM/October 1991/Vol.10 117 . For example. A scan of a 1-terabyte table may take days. Reasoning u n d e r uncertainty. To obtain linear speedup. Moreover. Similarly. C u r r e n t archive devices have a very long latency period. F u r t h e r research is essential. the DBMS must also optimize the placement o f data records on the archive to minimize subsequent re* trieval times. roll the database forward to the present time using the database log. and the storage system must ensure that relevant data is spread over all disk drives. Research on multiuser aspects o f parallelism such as this one is in its infancy. with only modest success. with massive amounts o f data. Hence. Building an index for a 1-terabyte table may require several days o f computing.

Likewise. suppliers can give early feedback on the cost o f components. We now use two very simple examples to illustrate the problems that arise in this environment: First. when transactions can take hours or days. this goal requires a database that spans multiple organizations. For instance. T h e r e are over 100 institutions that grant a Ph. a n d on which business can be transacted in a u n i f o r m way. we should now begin to contemplate the existence o f a single.perhaps a fraction o f a second. We need entirely new approaches to maintaining the integrity o f data. Heterogeneous. such as the electrical plan. We believe that all have an on-line student database that allows queries to be asked o f its contents. from Masters candidates. T h e basic p r o b l e m is that these 100+ databases are semantically inconsistent. which list prices and quality ratings for restaurants and hotels. an intercompany database. it is necessary to maintain consistent configurations. Some may erroneously omit certain classes of students. consider a science p r o g r a m manager. T h e contractor wants to have a single project database that spans the portions o f the project database administered by the contractor and each subcontractor. Such feedback will allow the most cost-effective car to be designed and manufactured.and part-time students. N O . the heating plan. F u r t h e r m o r e .tion model discussed earlier assumes that transactions are s h o r t . d e g r e e in c o m p u t e r science. Moreover. Obviously. T h e h u m a n genome project is one example of this p h e n o m e non. it is necessary to treat each value obtained as evidence. discover how to access all o f these databases a n d then ask the correct local query at each site. Most people traveling on vacation consult two or m o r e such travel guides. Versions and configurations begin now to develop the underlying technology in collaboration with other nations. students in the U. such as foreign students. such as electrical engineering candidates in an EECS d e p a r t m e n t . To answer the user's query. we can and must f o r m e d part o f a single database is often called interoperability. a designer may lock a file for a day.D. while a n o t h e r might indicate the price with tax. and a collaborative database results. n u m e r o u s revisions o f the electric plans will occur d u r i n g the design. the sum o f the responses to these 100+ local queries will not necessarily be the answer to his overall query. In this way. little has been implemented. The databases involved are heterogeneous. such as the Michelin Guide. some distinguish Ph.34. that is. others record full. T h e problem o f making heterogeneous. T h e project has a database composed o f portions assembled by each researcher. • A n automobile company wishes to allow suppliers access to new car designs u n d e r consideration. Consider the possibility o f an electronic version o f a travel assistant. the NSF p r o g r a m m a n a g e r can. O n e might quote last year's price. However. into what appears to be a single database. and each guide is likely to give a different answer. • A typical defense contractor has a collection o f subcontractors assisting with portions o f the contractor project. Some institutions record only full-time students.D. consisting o f versions o f related objects. 1 0 / C O M M U N I C A T I O N S OF T H E A C M . general and detailed architectural drawings. sharing data. and they are distributed. Some may mistakenly include students. Unfortunately. These examples all concern the necessity o f logically integrating databases from multiple organizations. in theory. worldwide database system from which users can obtain information on any topic covered by data m a d e available by purveyors. F u r t h e r m o r e . Much remains to be done in the creation o f space-efficient algorithms for version m a n a g e m e n t and techniques for ensuring the consistency o f configurations. While there has been much discussion and many proposals for p r o p e r version and configuration models in different domains. one might want to ask the price o f a r o o m at a specific hotel. 118 October 1991/Vo1. entailing multiproject databases. While such an accomplishment is a generation away. who wishes to find the total n u m b e r o f c o m p u t e r science Ph. often across company b o u n d aries. d u r i n g which it is redesigned. Distributed Databases T h e r e is now effectively one worldwide telephone system and one worldwide c o m p u t e r network. distributed databases behave as if they Some next-generation applications need versions o f objects to represent alternative o r successive states o f a single conceptual entity. construction and maintenance o f the building. meaning that individual databases are u n d e r local control and are connected by relatively low-bandwidth links. Visionaries in the field o f c o m p u t e r networks speak o f a single worldwide file system. and some do not. However. Indeed. A second p r o b l e m is equally illustrative. and it may be necessary to keep all the revisions for accounting or legal reasons.D.S. in a facilities engineering database. and then to provide fusion o f this evidence to form a best answer to the user's query. and recovering data. there are a n u m b e r o f applications that are now becoming feasible and that will help drive the technology n e e d e d for worldwide interconnection o f information: • Collaborative efforts are u n d e r way in many physical science disciplines. in the sense that they do not normally share a complete set o f c o m m o n assumptions about the information with which they deal. and a third might quote the price including meals.

one database might give starting salaries for graduates in dollars per month while another records annual salaries. Research on e x t e n d e d data models is required to discover the form that this information should take. With this information. For example. as did the automobile company mentioned earlier. we need to s u p p o r t browsing. Heterogeneity and distribution make this open problem even more difficult. there is need for a class o f information sources that stand between the user and the heterogeneous databases. car-rental companies. This difference will result in composite answers that are semantically inconsistent. some a g r e e m e n t r e g a r d i n g language and model standards is essential. or if missing information appears to invalidate a query. such as data on foreign students.D. the definition o f a part-time student may be different in the different databases. it is possible to apply a conversion to obtain composite consistent answers. interrogate the nature o f the process that merges data. distributed databases. Moreover. a scientist working in an interdisciplinary p r o b l e m d o m a i n must be able to discover the existence o f relevant data sets collected in other disciplines. A corporation may want to make parts o f its database accessible to certain parties. We foresee a need for m a n d a t o r y security and research into the analysis o f covert channels. Authentication is the reliable identification o f subjects m a k i n g . Rather. it would make sense to create a "CS Ph. Self-description o f data is another i m p o r t a n t re- search p r o b l e m that must be addressed if access to unfamiliar data is to become a reality. distributed database system will need to cope with a world o f multiple authenticators o f variable trustworthiness. Name Services T h e NSF p r o g r a m m a n a g e r must be able to consult a national name service to discover the location and name o f the databases or mediators o f interest. each containing student information. and so on. there are individual differences that must be addressed. No. stud e n t example there are 100+ disparate databases. mediator" that could be queried as if it were a consistent. H I H P H ) t ) )' ) I t III ' To p r o p e r l y s u p p o r t heterogeneous. A heterogeneous. I f an inconsistency is detected. database access.34. hotels. Since the individual participant databases were never designed with the objective o f interoperating with other databases. could be commercially viable. Mediators As the problems o f fusion and semantic inconsistency are so severe. T h e r e must be some system for explaining to the user how the data arrived in that state and. For example.D. A widely distributed system may have thousands or millions o f users. We must extend substantially the data model that is used by a DBMS to include much more semantic information about the meaning o f the data in each database. which offered preliminary design information to potential suppliers. A travel adviser that provided the information obtained by fusing the various databases o f travel guides. Worse still is the case in which the local database omits information. there is no single global schema to which all individual databases conform. T h e mechanism by which items e n t e r and leave such name servers a n d the organization of such systems is an o p e n issue. Further. from what databases it was derived.10 119 . if there were sufficient d e m a n d . or at least that portion o f the libraries that are stored electronically. Without it. and is therefore simply wrong. In this case. a given user may be identified differently on different systems. we cannot simply give up. access permission COMMUNICATIONS OF THE ACM/October 1991/Vol. when multiple databases are combined. unified database containing the information that actually sits in the 100+ local databases o f the CS departments. it is highly unlikely that any automatic agent could do a trustworthy job. In the Ph.. student and travel advisor examples indicate the problems with semantic inconsistency and with data fusion. Similarly. there is a difficult research a g e n d a that must be accomplished. Perhaps most valuable o f all would be a mediator that provided the information available in the world's libraries. and it does not want any outsider accessing its salary data. However. T h e user has a uniform query language that can be applied to any one o f the individual databases or to some m e r g e d view o f the collection o f databases. Thus. and we need to do extensive experiments before standardization can be addressed. Future interoperability o f databases will require dramatic progress to be m a d e on these semantic issues. Browsing Let us suppose that the problems o f access have been solved in any one o f the scenarios mentioned earlier. Incompleteness and Inconsistency T h e Ph. Database systems must be resistant to compromise by remote systems m a s q u e r a d i n g as authorized users. the automobile company certainly does not want the same designs accessed by its competitors. in o r d e r that distributed. These include differences in units. it may be possible to filter out the offending data elements and still arrive at a meaningful query. the ability to interrogate the structure o f the database and. More s e - riously. Mediators must be accessible by people who have not had a chance to study the details o f their query language and data model? Thus. SecurlW Security is a major problem (failing) in c u r r e n t DBMSs.D. in particular. heterogeneous database systems do not increase user uncertainty about the security and integrity o f one's data.

sites can act as intermediate agents for users. For example.000 and maintain important collections site network. Hewlett-Packard Laboratories hosted the workshop. Other algorithms now used in almost every computexpect all sites in the network to be ing environment to organize. concurrency control.g. transaction management. and other areas. focusing on the major acover all sites in this fashion is likely complishments of relational datato spend more time trying to opti. or to republish. conclusions. complex obresponsive and least expensive way jects. distributed database system is a difficult issue. In addition..with a few sites. current company Treasurer) or access site. where global and local transactions are allowed to both read and write data. is still open. ation database applications will little Powerful desktop computers. resemble current business data procheap and frequently underuti. Won Kim. them will frequently be the most multimedia support. and clearly in a 10. of remote will require heterogeneous. and they must all be rethought for 1. medicine. Efficient cache maintenance is an open problem.st be factored into the much larger data sets. Postworkshop contributors to this report include Phil Bernstein. that makes these systems possible is certain query-processing algorithms the direct result of a successful proexpect to optimize a join by consid. database systems research offers • A host of new intellectual challenges for computer scientists. security. and notice is given that copying is by permission of the Association for Computing Machinery. the issue of reliable transaction management in the general case. To date. bring very difficult problems in the areas of querying semantically inconsistent databases. and they will entail reuser response time becomes in. site Scale-up The security issue is just one element of scale-up. and it is crucial that advanced research be encouraged as the database community tackles the challenges ahead. or recommendations expressed in this report are those of the panel and do not necessarily reflect the views o f the National Science Foundation. To copy otherwise. mize the cluery than it would have and distributed databases. and Andre van Tilborg. and • Resulting technology that will enable a broad spectrum of new applications in business. Finally. and ganizations on common problems even local replication.uted databases.might be based on role (e. With a very large number of search community over the past two sites. We have ering all possible sites and choosing highlighted some important the one with the cheapest overall achievements of the database recost. some query-processing algorithms expect to find the location of an object by searching all sites for it. Thus. The workshop was supported by NSF Grant IRI-89-19556. the database industry has shown remarkable success in transforming scientific ideas into major products.bases. © ACM 0002-0782/91/1000-110 $1. and scale-up of distributed DBMS technology to large numbers of sites. The technology down at any given time.Conclusion proach is clearly impossible in a Database management systems are large network. Hank Korth. Local caching. made worse if we wish to preserve the local autonomy o f each of the local databases and allow local and global transactions to execute in parallel. distribdata at the desktop will become in. Whether an access is permitted may well be influenced by who is acting on a user's behalf. Acknowledgments Any opinions. Ensuring good storage. Current authorization systems will surely require substantial extensions to deal with these problems.cessing databases. rule processing. Avi Silbershatz and J e f f Ullman initiated the workshop and coordinated and edited the report.000 sites. which must be addressed in a large distributed DBMS.50 120 October 1991/Vol. findings. science. Finally. requires a fee and/or specific permission. They will have lized.000 or 10. The main problem is that each of the local database management systems may be using a different type of a concurrency control scheme. and archival to execute a query. spent in simply executing the query We have argued that next-generin a naive and expensive way. creasingly important. [ ] Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage.10/COMMUNICATIONS OF THE ACM . defense. However. Integrating these is a challenging problem. the of sites and the distances between cooperation between different orthem increase. and support of multipl~e copies were designed to function . One simple solution is to restrict global transactions to retrieve-only access. mu. Such databases Transaction Management Transaction management in a heterogeneous.thinking algorithms for almost all creasingly difficult as the number DBMS operations. require new query optimization space. This ap. several sites will be of information. as using capabilities such as type extensions. Current distributed DBMS algorithms for query processing. a query optimizer that loops decades. and data may pass through and be manipulated by these intervening sites. No. create operational.gram of database research.34. Mike Stonebraker provided the initial draft and much of the content for the final report. the ACM copyright notice and the title of the publication and its date appear.