Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Standard view
Full view
of .
Save to My Library
Look up keyword
Like this
0 of .
Results for:
No results containing your search query
P. 1
A Relational Model of Data for Large Shared Data Banks

A Relational Model of Data for Large Shared Data Banks



|Views: 1,670 |Likes:
Published by Chris Nash
E. F. Codd's seminal paper that went from the mathematics of relational theory, to the design of a means to store data, and thus propelled relational databases as a storage mechanism.
E. F. Codd's seminal paper that went from the mathematics of relational theory, to the design of a means to store data, and thus propelled relational databases as a storage mechanism.

More info:

Categories:Types, Research
Published by: Chris Nash on Apr 22, 2009
Copyright:Attribution Non-commercial


Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less





Information Retrieval
A Relational Model of Data forLarge Shared Data Banks
E. F. CODDIBM Research Laboratory, San Jose, California
Future users of large data banks must be protected fromhaving to know how the data is organized in the machine (theinternal representation). A prompting service which suppliessuch information is not a satisfactory solution. Activities of usersat terminals and most application programs should remainunaffected when the internal representation of data is changedand even when some aspects of the external representationare changed. Changes in data representation will often beneeded as a result of changes in query, update, and reporttraffic and natural growth in the types of stored information.Existing noninferential, formatted data systems provide userswith tree-structured files or slightly more general networkmodels of the data. In Section 1, inadequacies of these modelsare discussed. A model based on n-ary relations, a normalform for data base relations, and the concept of a universaldata sublanguage are introduced. In Section 2, certain opera-tions on relations (other than logical inference) are discussedand applied to the problems of redundancy and consistencyin the user’s model.
KEY WORDS AND PHRASES: data bank, data base,
structure, dataorganization, hierarchies of data, networks of data, relations, derivability,redundancy, consistency, composition, join, retrieval language, predicatecalculus, security, data integrityCR CATEGORIES:3.70, 3.73, 3.75, 4.20, 4.22, 4.29
1. Relational Model and Normal Form
1 I. INTR~xJ~TI~NThis paper is concerned with the application of ele-mentary relation theory to systems which provide sharedaccess o large banks of formatted data. Except for a paperby Childs [l], the principal application of relations to datasystems has been to deductive question-answering systems.Levein and Maron [2] provide numerous references o workin this area.In contrast, the problems treated here are those of dataindependence-the independence of application programsand terminal activities from growth in data types andchanges n data representation-and certain kinds of datainconsistency which are expected to become troublesomeeven in nondeductive systems.
Volume 13 / Number 6 / June, 1970
The relational view (or model) of data described inSection 1 appears to be superior in several respects to thegraph or network model [3,4] presently in vogue for non-inferential systems. It provides a means of describing datawith its natural structure only-that is, without superim-posing any additional structure for machine representationpurposes. Accordingly, it provides a basis for a high leveldata language which will yield maximal independence be-tween programs on the one hand and machine representa-tion and organization of data on the other.A further advantage of the relational view is that itforms a sound basis for treating derivability, redundancy,and consistency of relations-these are discussed n Section2. The network model, on the other hand, has spawned anumber of confusions, not the least of which is mistakingthe derivation of connections for the derivation of rela-tions (see emarks in Section 2 on the “connection trap”).Finally, the relational view permits a clearer evaluationof the scope and logical limitations of present formatteddata systems, and also the relative merits (from a logicalstandpoint) of competing representations of data within asingle system. Examples of this clearer perspective arecited in various parts of this paper. Implementations ofsystems to support the relational model are not discussed.1.2.DATA DEPENDENCIESN PRESENTSYSTEMSThe provision of data description tables in recently de-veloped information systems represents a major advancetoward the goal of data independence [5,6,7]. Such tablesfacilitate changing certain characteristics of the data repre-sentation stored in a data bank. However, the variety ofdata representation characteristics which can be changedwithout logically impairing some application programs isstill quite limited. Further, the model of data with whichusers nteract is still cluttered with representational prop-erties, particularly in regard to the representation of col-lections of data (as opposed to individual items). Three ofthe principal kinds of data dependencies which still needto be removed are: ordering dependence, indexing depend-ence, and accesspath dependence. In some systems thesedependencies are not clearly separable from one another.1.2.1. Ordering Dependence. Elements of data in adata bank may be stored in a variety of ways, some nvolv-ing no concern for ordering, some permitting each elementto participate in one ordering only, others permitting eachelement to participate in several orderings. Let us considerthose existing systems which either require or permit dataelements to be stored in at least one total ordering which isclosely associated with the hardware-determined orderingof addresses.For example, the records of a file concerningparts might be stored in ascending order by part serialnumber. Such systems normally permit application pro-grams to assume hat the order of presentation of recordsfrom such a file is identical to (or is a subordering of) the
Communications of the ACM
stored ordering. Those application programs which takeadvantage of the stored ordering of a file are likely to failto operate correctly if for some reason it becomes necessaryto replace that ordering by a different one. Similar remarkshold for a stored ordering implemented by means ofpointers.It is unnecessary to single out any system as an example,because all the well-known information systems that aremarketed today fail to make a clear distinction betweenorder of presentation on the one hand and stored orderingon the other. Significant implementation problems must besolved to provide this kind of independence.1.2.2. Indexing Dependence. In the context of for-matted data, an index is usually thought of as a purelyperformance-oriented component of the data representa-tion. It tends to improve response to queries and updatesand, at the same time, slow down response to insertionsand deletions. From an informational standpoint, an indexis a redundant component of the data representation. If asystem uses indices at all and if it is to perform well in anenvironment with changing patterns of activity on the databank, an ability to create and destroy indices from time totime will probably be necessary. The question then arises:Can application programs and terminal activities remaininvariant as indices come and go?Present formatted data systems take widely differentapproaches to indexing. TDMS [7] unconditionally pro-vides indexing on all attributes. The presently releasedversion of IMS [5] provides the user with a choice for eachfile: a choice between no indexing at all (the hierarchic se-quential organization) or indexing on the primary keyonly (the hierarchic indexed sequent,ial organization). Inneither case is the user’s application logic dependent on theexistence of the unconditionally provided indices. IDS[8], however, permits the fle designers to select attributesto be indexed and to incorporate indices into the file struc-ture by means of additional chains. Application programstaking advantage of the performance benefit of these in-dexing chains must refer to those chains by name. Such pro-grams do not operate correctly if these chains are laterremoved.1.2.3. Access Path Dependence. Many of the existingformatted data systems provide users with tree-structuredfiles or slightly more general network models of the data.Application programs developed to work with these sys-tems tend to be logically impaired if the trees or networksare changed in structure. A simple example follows.Suppose the data bank contains information about partsand projects. For each part, the part number, part name,part description, quantity-on-hand, and quantity-on-orderare recorded. For each project, the project number, projectname, project description are recorded. Whenever a projectmakes use of a certain part, the quantity of that part com-mitted to the given project is also recorded. Suppose thatthe system requires the user or file designer to declare ordefine the data in terms of tree structures. Then, any oneof the hierarchical structures may be adopted for the infor-mation mentioned above (see Structures l-5).378
Communications of the ACM
Structure 1. Projects
Subordinate to PartsFileSegment FieldsF
PARTpart #
part namepart descriptionquantity-on-handquantity-on-order
project #project nameproject descriptionquantity committedStructure 2. Parts Subordinate to ProjectsFileSqmeutFieldsF
project #project nameproject descriptionPARTpart #part namepart descriptionquantity-on-handquantity-on-orderquantity committedStructure 3. Parts and Projects as PeersCommitment Relationship Subordinate to ProjectsFile Segment FieldsF PARTpart #part namepart descriptionquantity-on-handquantity-on-orderG PROJECT project #project nameproject descriptionPARTpart #quantity committedStructure 4. Parts and Projects as PeersCommitment Relationship Subordinate to PartsFile
Segnren1 Fields
PARTpart #part descriptionquantity-on-handquantity-on-orderPROJECT project #quantity committed
PROJECT project #project nameproject descriptionStructure 5. Parts, Projects, andCommitment Relationship as Peers
FCZC .%&-,,ZC,,tFiclds
PARTpart #part namepart descriptionquantity-on-handquantity-on-orderG PROJECT project #project nameproject descriptionH
COMMITpart #
project #quantity
Volume 13 / Number 6 / June, 1970
Now, consider the problem of printing out the partnumber, part name, and quantity committed for every partused in the project whose project name is “alpha.” Thefollowing observations may be made regardless of whichavailable tree-oriented information system is selected totackle this problem. If a program
is developed for thisproblem assuming one of the five structures above-thatis,
makes no test to determine which structure is in ef-fect-then
will fail on at least three of the remainingstructures. More specifically, if
succeeds with structure 5,it will fail with all the others; if
succeeds with structure 3or 4, it will fail with at least 1,2, and 5; if
succeeds with1 or 2, it will fail with at least 3, 4, and 5. The reason issimple in each case. In the absence of a test to determinewhich structure is in effect,
fails because an attempt ismade to exceute a reference to a nonexistent file (availablesystems treat this as an error) or no attempt is made toexecute a reference to a file containing needed information.The reader who is not convinced should develop sampleprograms for this simple problem.Since, in general, it is not practical to develop applica-tion programs which test for all tree structurings permittedby the system, these programs fail when a change in&ructure becomes necessary.Systems which provide users with a network model ofthe data run into similar difficulties. In both the tree andnetwork cases, the user (or his program) is required toexploit a collection of user access paths to the data. It doesnot matter whether these paths are in close correspondencewith pointer-defined paths in the stored representation-inIDS the correspondence is extremely simple, in TDMS it isjust the opposite. The consequence, regardless of the storedrepresentation, is that terminal activities and programs be-come dependent on the continued existence of the useraccess paths.One solution to this is to adopt the policy that once auser access path is defined it will not be made obsolete un-til all application programs using that path have becomeobsolete. Such a policy is not practical, because the numberof access paths in the total model for the community ofusers of a data bank would eventually become excessivelylarge.1.3. A
The term relation is used here in its accepted mathe-matical sense.Given sets X1 , S, , . . . , S, (not necessarilydistinct),
is a relation on these n sets if it is a set of n-tuples each of which has its first element from S1, itssecond element from Sz , and so on.’ We shall refer to Si asthe jth domain of
As defined above,
is said to havedegreen. Relations of degree 1 are often called unary, de-gree 2 binary, degree 3 ternary, and degree n n-ary.For expository reasons, we shall frequently make use ofan array representation of relations, but it must be re-membered that this particular representation is not an es-sential part of the relational view being expounded. An ar-
1 More concisely,
is a subset of the Cartesian product 81 Xsz x *.* x 87%.
ray which represents an n-ary relation
has the followingproperties :(1) Each row represents an n-tuple of
(2) The ordering of rows is immaterial.(3) All rows are distinct.(4) The ordering of columns is significant-it corre-sponds to the ordering S1, Sz , . . . , S, of the do-mains on which
is defined (see, however, remarksbelow on domain-ordered and domain-unorderedrelations ) .(5) The significance of each column is partially con-veyed by labeling it with the name of the corre-sponding domain.The example in Figure 1 illustrates a relation of degree4, called supply, which reflects the shipments-in-progressof parts from specified suppliers to specified projects inspecified quantities.
supply (supplier
25 17
3 5 232 3 7 92 7 544
1 1 12FIG.
relation of degree 4
One might ask: If the columns are labeled by the nameof corresponding domains, why should the ordering of col-umns matter? As the example in Figure 2 shows, two col-umns may have identical headings (indicating identicaldomains) but possess istinct meanings with respect to therelation. The relation depicted is called component. It is aternary relation, whose first two domains are called partand third domain is called quantity. The meaning of com-ponent (2, y, z) is that part x is an immediate component(or subassembly) of part y, and z units of part 5 are neededto assembleone unit of part y. It is a relation which playsa critical role in the parts explosion problem.
5 92573 522 6 123 6 347
6 7
relation with-two identical domains
It is a remarkable fact that several existing informationsystems (chiefly those based on tree-structured files) failto provide data representations for relations which havetwo or more identical domains. The present version ofIMS/360 [5] is an example of such a system.The totality of data in a data bank may be viewed as acollection of time-varying relations. These relations are ofassorted degrees. As time progresses,each n-ary relationmay be subject to insertion of additional n-tuples, deletionof existing ones, and alteration of components of any of itsexisting n-tuples.
Volume 13 / Number 6 / June,
Communications of the ACM

Activity (18)

You've already reviewed this. Edit your review.
1 hundred reads
1 thousand reads
kornykyano liked this
Muslim Bhutto liked this
Zulu Mansela liked this
teamgreen liked this
zeusfaber liked this
mikewege liked this
uest37 liked this
budu1 liked this

You're Reading a Free Preview

/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->