You are on page 1of 11

Object Identity

Setrag N. Khoshafian and George P. Copeland

Mlcroelectronles And Computer Technology Corporation


9430 Research Blvd.
Austin, Texas 78759

Abstract from one another regardless of their content, location or


addressability, and to be able to share objects. In this
Identity is that property of an object which distinguishes paper we concentrate on a powerful concept called object
each object from all others. Identity has been investigated
identity which enables us to realize this goal.
almost independently in general-purpose programming
languages and database languages. Its importance is growing Every language must have some way to tell one object
as these two environments evolve and merge. from another. Identity is that property of an object which
distinguishes it from all other objects. Most prolpramming
We describe a continuum between weak and strong and database languages use variable names to distinguish
support of identity,and argue for the incorporationof the temporary objects, mixing addressability and identity.
strong notion of identity at the conceptual level in languages Most database systems use identifier keys (i.e., attributes
for general purpose programming, database systems and their which uniquely, identify a tuple) to distinguish persistent
hybrids. We define a data model that can directly describe objects, mixing data value and identity. Both of these
complex objects, and show that identity can easily be approaches compromise identity. Object.oriented
incorporated in it. Finally, we compare d~erent languages employ separate mechanisms for these
implementation schemes for identity and argue that a concepts, so that each object maintains a separate and
surrogate-based implementation scheme is needed to support consistent notion of identity regardless of how it is
the strong notion of identity. accessed or how it is modeled with descriptive data. This
paper focuses on the identity aspects of languages that
1 Introduction. model both temporary and persistent data.
With the advent of increased efficiency of computer Section 2 describes the importance of identity in both
systems, the sophistication and demand of the users of programming and database environments, as well as the
these systems has been increasing. We have seen some need to have a consistent notion of identity in
significant changes in both general-purpose programming environments which attempt to merge temporary and
and database languages. In general-purpose persistent data. A continuum between weak and strong
programming, we went from assemblers to high-level support of identity is described, and an argument is made
languages and, more recently, to logic, functional and for the importance of incorporating the strong notion of
object-oriented languages. In database languages, we identity at the conceptual level. Section 3 defines an
went from navigational to more declarative relational object model that can directly describe complex objects,
models. In more novel applications, such as CAD/CAM, and shows that identity can easily be incorporated in it.
document retrieval, expert systems and decision support Section 4 compares the various implementation
systems, we are realizing that there are a host of powerful techniques for identity based on how well they preserve
data modeling concepts which need to be introduced in identity when objects are modified and physically moved,
both programming languages and database models. One and argues that a surrogate-based implementation
of these concepts is the need to model arbitrarily complex scheme is needed to support the strong notion of identity.
and dynamic objects with versions. A more specific need Implementation of the object model using surrogates is
in this representation is the ability to distinguish objects discussed. Section 5 provides a summary.

2 The Importance Of Object Identity


Permissionto copywithoutfee all or part of this material is grantedwovided Identity has been investigated almost independently in
that the copiesare not madeor distributed for directcommercialndvantqe, general-purpose programming languages and database
the ACMcopyrightnoticeand the title of the publicationand its dateappear, languages. Its importance is growing as these two
and noticeis given that copyinsis by permimonof the Amoc/ationfor environments evolve and merge.
ComputingMachinery.To copyotherwise,or to republish,requiresa fee and/
or weeiflcpennmion.

O 1986 ACM 0-89791-204-7/86/0900-0406 75¢

406 OOPSLA '86 Procsedlngs Sq:~ember 1986


This section describes a continuum between weak and 2.Z Identity In Progremmlng Languages
strong support of identity. We argue for the incorporation
of the strong notion of identity at the conceptual level in Most general-purpose programming languages are
designed without the notion of persistent data in mind.
languages for general-purpose programming, database
systems and their hybrids. For this reason, they provide weak support of identity in
the temporal dimension. As far as the language is
concerned, data lives only during the execution of a
2.1 Degrees Of Support Of Identity
program. A file system, which is not part of the
There are at least two dimensions involved in the language, is used for any persistent data. Input and
support of identity, the representation dimension and the output is usually designed for user interaction, that is, to a
temporal dimension. Figure 1 illustrates this identity screen or printer, or from a keyboard or mouse. Then
space, populated with some example languages. Note this same input and output model is applied, as an
that when a language includes a stronger support of afterthought, to file system transfers. The structures
identity in either dimension, it is not repeated at the supported in the virtual address space of the program are
weaker levels. usually not supported in the file system.
Programming languages vary in their support of
REPRESENTATION
& identity in the representation dimension. Most
programming languages and file systems employ
built-in -~ Smailtaik-80 RM/T OPAL
GEM user-defined names (i.e., variables in languages and file
names in files systems) to represent identity. The actual
binding of an object to its name could be either dynamic
Pascal
u~r- (i.e., at run-time) or static. Figure 1 includes Pascal
Proiog
supplied - SQL UNIX shell [Wirth 1971] and Prolog [Colmerauer 1975] as
name representatives of these languages. This approach mixes
QBE
addressability and identity, although the concepts are
quite different. Addressability is external to an object.
SQL
data value- QBE TEMPORAL Its purpose is to provide a way to access to an object
I I within a particular environment and is therefore
I "-
within a between between environment dependent. Identity is internal to an object.
program or transactions structural Its purpose is to provide a way to represent the
transaction reorganizations individuality of an object independently of how it is
accessed. An address-based identity mechanism
temporary data persistent data compromises identity. Object-oriented languages, such
as Smalitalk-80, provide separate mechanisms for these
FIGURE 1: Languages In The Identity Space concepts so that neither is compromised.
There are practical limitations to the use of variable
The representation dimension distinguishes languages names without some built-in representation of identity
based on whether they represent the identity of an object and operators to test and manipulate this representation at
by its value (e.g., identifying employees by social security an abstract level. One problem is that a single object may
number), by a user-defined name (e.g., variable names, be accessed in different ways and bound to different
user-defined file names, etc.), or built into the language variables without having a way to find out if they refer to
(e.g., Smalltalk-80 [Goldberg and Robson 1983]). Going the same object [Saltzer 1978]. For example, an
upward in this dimension indicates stronger support of employee object may be accessed as the vice president of
identity. A language providing a stronger notion of sales and bound to the variable X. The same employee
identity in this dimension must maintain its representation object may be accessed as an employee with a
of identity during updates, use identity in the semantics of salary>S60,000 who is stationed in Austin and bound to
its operators, and provide operators to manipulate the variable Y. Smalltalk-80 provides a simple identity
identity. test with the expression X==Y, which is different from the
equality test X=Y. Unix [Ritchie and Thompson 1974]
The temporal dimension distinguishes languages has a built-in representation for file identit~ to support
based on whether they preserve their representation of links between files, but provides no way to test directly
identity within a single program or transaction, between whether two files arrived at by different paths are the
transactions, or between structural reorganizations. An Sglrle.
example of structural reorganization is schema
reorganization in databases [Sockut 1985]. Going to the Given that such built-in support for identity is
right in this dimension indicates stronger support of provided, adequate operators to manipulate identity are
identity. A language providing stronger identity in the needed. For example, two objects with separate identity
temporal dimension must employ more robust may later be discovered to be the same (the murderer is
implementation techniques to preserve its representation the butlert) and therefore need to be merged. Codd
of identity. [1979] has argued for a "coalescing" operator in RM/T
which merges identity. Different copy operators are also
needed to indicate the degree of copying vs. sharing.
Smalltalk-80 provides a "shallow copy" operator and a

September1986 OOPSLA~6 Proceedings 407


"deep copy" operator in addition to simple assignment. A second problem is that identifier keys cannot
For example, suppose the value of Y is a set. Assigning provide identity for every object in the relational model.
Y to X causes X to share the same set object as Y. Each attribute or meaningful subset of attributes cannot
Assigning a shallow copy of Y to X causes X to be a new have identity. For example, an employee object may have
set object with its own identity, whose elements are an attribute describing the employee's spouse by his or
shared with those in Y. Assigning a deep copy of Y to X her first name. Later, the spouse also becomes an
causes X to be a new set object with its own identity, employee, causing a discontinuity in identity for the
whose elements are new objects with their own identity, spouse.
but which have the same values as those of Y.
A third problem is that the choice of which attribute/s
to use for an identifier key may need to change. For
2.3 Identity In Database Languages example, RCA may use an employee numbers to identify
Database languages are designed to support large and employees, while General Electric may use social security
persistent data that models large and persistent numbers for the same purpose. A merger of these two
real-world systems. These characteristics require strong companies would require one of these to change, causing
support of identity in both the representation and a discontinuity in identity for the employees of one of the
temporal dimensions. companies.
Every real-world object is an individual. That is, A fourth problem is that the use of identifier keys
there is something unique about everything. We have causes joins to be used in retrievals instead'of path
often heard this pointed out about humans, as well as expressions, which are simpler, as in GEM [Zaniolo
such simple and plentiful objects as leaves, grains of 1983] and OPAL [Copeland and Maier 1984]. For
sand, blades of grass and snow flakes. When we model example, suppose we have an employee relation
real-world objects with some particular purpose in mind, employee[name, SS#, birthdate, assignment] and a
however, we only include some subset of that object's department relation department[name, budget, location],
description in the model. This subset may not be and the assignment attribute establishes a relationship
complete enough to capture the object's uniqueness. In between an employee and a department. Using identifier
some cases uniqueness is external (e.g., an object is keys, assignment would have as its value the identifier key
unique if it has some local attribute values and belongs to of the department, say name. A retrieval involving both
a different set, or is related to a different object). This tuples would require a join between the two tuples. Using
problem arises in databases, as well as in many other tuple variables, a retrieval might be
computer systems, because they attempt to model SE/.~CT E.name, D.Iocation WHERE
real-world systems. Smith and Smith [1978] have argued E.assignment=D.name
for the principle of individual preservation for databases, & E IN employee & D IN deparunent,
which states that "every user-invokeable update operation
must preserve the integrity of individuals". If the concept where E and D are tuple variables. Using domain
of identity is "built into a language, then an object's variables, the retrieval would be
uniqueness is modeled even though its description is not [X, Z] <== employee[X, -, -, Y], department[Y, -, Z],
unique.
where X, Y and Z are domain variables. Using built-in
Codd [1970] introduced the notion of user-defined identity, assignment would have as its value a department
identifier keys to represent the identity of an object. An
tuple. Unlike strict hierarchical database systems, the
identifier key is some subset of the attributes of an object
department tuple could also be the value of other objects
which is unique for all objects in the relation. This without either being owned by any object or being
representation of identity is supported in many existing replicated, forming a directed graph structure. Using
database systems. ]=or example, Figure 1 includes SQL
tuple variables, the retrieval would be
[Chamberlin and Boyce 1974] and QBE [Zloof 1975] as
representatives of these systems. Both languages use SELP_.CI' E.name, E.assignment.location WHERE E IN
identifier keys to identify persistent objects. SQL uses employee.
tuple variables and QBE uses domain variables, as does
Prolog, to identify temporary objects. There are several Using domain variables, the retrieval would be
problems with identifier keys which are due to the fact IX, Z] <== employee[X, -, -, department[-, -, Z]].
that the concepts of data value and identity are mixed.
The identifier key approach requires explicitly introducing
One problem is that identifier keys cannot be allowed the additional tuple variable D or domain variable Y for
to change, even though they are user-defined descriptive the join. The built-in identity approach has some of the
data. For example, a department's name may be used as advantages of the universal relation approach (i.e., no
the identifier key for that department and replicated in joins for entity relationships) but without the
employee objects to indicate where the employee works. disadvantages of requiring unique attribute names
But the department name may need to change under a (because nested names are used) and occasionally having
company reorganization, causing a discontinuity in ambiguous paths (since paths are specified) [Kent 1981,
identity for the department as well as update problems in Maier et al. 1984]. Note that with built-in identity,
all objects which refer to it. hierarchical structures are possible without the
undesirable insertion and deletion anomalies described by
Codd [1971].

408 00PSLA ~6 ProcNdlogs ~ 1986


Kent [1978] describes many other problems with The interfaces between programming and database
using descriptive data for identity. The solution calls for languages are usually crude because of these differing
built-in support for identity in the language which is concepts and because they are usually designed as an
independent of its external descriptive data, so that the afterthought. This causes what some have called an
system can provide a strong notion of identity in both the "impedance mismatch" [Copeland and Maier 1984],
representation and temporal dimensions. Strong support because much of the recta information (e.g., structures
is provided in the representation dimension because and operations) in either system is reflected back at the
identity is built-in. Strong support is provided in the interface rather than passing through it. This meta
temporal dimension because identity is preserved between information must be defined redundantly in both
transactions, regardless of changes in data or structure. languages. Also, transformations must be defined
RM/F and GEM provide built-in identity for some whenever data or operations need to pass through the
persistent objects. These languages are not fully interface.
object-oriented because they lack a uniform treatment of
all objects, providing identity only for persistent tuples. There is a growing trend to merge programming and
OPAL [Copeland and Maier 1984] provides built-in database languages into a hybrid environment which
identity for all temporary and persistent objects. includes a language with a unified typing and
computation. Some researchers have approached the
Several researchers have argued for a temporal data problem by making programming language data types
model [e.g., Copeland 1980 and 1982, Ben-Zvi 1982, persistent. Some examples of this approach are PS-algol
Clifford and Warren 1983, Katz and Lehman 1984, [Atkinson et al. 1983], Amber [Cardelli 19844 Poly
Copeland and Maier 1984]. The reason is that most [Matthews 1985] and Galileo [Albano et al. 1985]. These
real-world organizations deal with histories of objects, but languages extend the file system to support the same
they have little support from existing systems to help types as in the language and provide type checking when
them in modeling and retrieving historical data. Strong file objects are imported into a program. Others have
support of identity in the temporal dimension is even approached the problem by combining programming and
more important for temporal data models, because a database language data types and database transactions.
single retrieval may involve multiple historical versions of Some examples of this are PASCAIJR [Schmidt 1977]
a single object. Such support requires the database which combines PASCAL with relational data types,
system to provide a continuous and consistent notion of PLAIN [Wasserman 1979] and RIGEL [Rowe and Shoens
identity throughout the life of each object, independently 1979] each of which combines a new programming
of any descriptive data or structure which is user language with relational data types, and OPAL [Copeland
modifiable. This identity is the common thread that ties and Maier 1984] which combines Smalltaik-80 and a set
together these historical versions of an object. data type with predicate calculus.
Database systems must provide efficient access to Regardless of how one approaches this merging of
large data. To deal with this, database languages usually programming and database capability, the end result
provide the capability to map the user's conceptual should be a language with a uniform treatment of types,
schema onto an internal schema, which describes the way computation and identity. Data instances of any type
that data is actually stored [ANSI/X3/SPARC 1975]. An should be capable of being either temporary or persistent.
internal schema may have multiple copies of the Any computation should apply uniformly to either
conceptual schema and may further partition the temporary or permanent data, although computations
auzibutes of an object of the conceptual schema. Some which cause state changes of shared persistent data
way of relating these multiple copies and attribute should be enveloped by a transaction. All types should
partitions to the same conceptual object is needed. The employ the same notion of identity.
object's identity provides a convenient way of doing this.
3 An Object Model
2.4 The Hybrid Environment
This section provides an object model which
Programming with persistent data has always been incorporates object identity. Much of this model is
difficult because programming language environments similar to the Smalltalk-80 [Goldberg and Robson 1983]
and database systems are each designed within different and FAD [Bancilhon et al. 1985] languages. Our purpose
cultures. They are usually built on different concepts for is not to present a complete language, but rather to
typing, computation and identity. Typing systems in demonstrate how the strong notion of identity in the
programming languages typically include arrays, lists and representation dimension (i.e., built-in) can be
atomic types, while typing systems in database languages incorporated into an object model. We concentrate on the
typically include sets, records and atomic types. definition of object structures and those operators which
Computational models in programming languages are manipulate identity. Although this model includes only
typically rich in manipulation capability, while atomic and two structured types, its generalization to
computational models in database languages typically other structured types is straightforward.
include only search and simple update capability. The
notion of identity in programming languages is typically
weaker than that of database systems.

September1966 OOPSLA'86 Proceedings 409


3.1 Object Structure An Object System is a set of objects. An object system
is consistent if
The object structure is very similar to the object
structure of FAD [Bancilhon et al. 1985]. We assume we (a) No two distinct objects have the same
are given a set of attribute names A, a set of identifiers I, identifiers (unique identifier assumption). In other
a collection of base atomic types. words, the identifier functionally determines the
type and the value of the object.
An object 0 is a triple (identO~er, type, value) where Co) For each identifier present in the system there
a) The identifier is in I. The identifier of an object O is an object with this identifier (no dangling
is denoted O.identity. identifier assumption).
b) The type is in {atom, set, tupic}. We provide only All object systems will be assumed consistent throughout
single-level typing for simplicity. A more complete
this paper. This definition of objects allows a
typing system would be desirable for a full language directed-graph structure. An object can belong to
(for example, a set of tuples whose attribute values
multiple objects through set membership or a ~ i b u t e
are typed). .
assignment without being replicated and without being
c) The value is one of the following:
owned by any object.
1) If the object is of the type atom, then the value
is an element of a user-defined domain of atoms, 3.2 Operators Wlth Object Identity
each of which has no subparts.
2) If the object is of the type set, then the value is a In the previous section we defined object structure
set of distinct identifiers from I. with identity. In this section, we define several operators
which compare or manipulate objects with identity, and
3) If the object is of the type topic, then the value
give an informal semantics for each operator.
is of the form
[AI:II, A2:12..... An:In], where the Ai's are distinct Definition: identity predicate (identical)
attribute names, and the lJ's are distinct identifiers
from I. II is the value taken by the object O on Given two objects, O1 and 02 the predicate
attribute Ai and is denoted O.AI. identical(O1, 02) will return true if Ol and 02 are the
same. Therefore, identical(Of, 02) is true if Or.identity
Note that this model allows us to have objects of - O2.identity.
arbitrary nestings and graphical structure. For example,
we can have nested relations [Jaeske and Schek 1982] and Definition: shallow equality predicate (shallow-equal)
nested tuples [Zaniolo 1985]. Two objects are shallow--equal if their values are
Objects can be represented graphically. We represent identical. Note that this definition is not recursive, i.e.,
an atomic object by a node labeled by its value, a tuple two set objects whose elements have pairwise equal values
are not necessarily shallow--equal or two tuples whose
object O = tAI:O1 . . . . . An:On] by a node labeled by its
identifier such that there is an arc labeled A/which goes attributes have pairwise equal values are not necessarily
from O to Oi, and a set O = {O1. . . . . O~} by a node shallow-equal.
labeled by its identifier such that there is an unlabeled arc Definition: deep equality predicate (deep--equal)
going from O to every Ol. As an example, suppose we
have a database which consists of employees and Object deep equality is defined as follows:
students, where each has a name, consisting of first name (a) Two atomic objects are deep-equal if their
and last name, and an age. Then an instance of the values are the same (note that deep-equal and
database could be represented as in Figure 2. shallow-equal are the same for atomic objects).
Co) Two set objects are deep-equal if the they have
the same cardinality and the elements in their
values are pairwise deep--equal.
emp • ~ e n t (c) Two tuple objects are deep-equal if the value
they take on the same attributes are deep-equal.
This shows that we have three flavors of "equality":
(I) Identical, which checks if the two objects are
the same object.
(2) Shallow, which goes 1 level deep, comparing
values of the components of the object.
(3) Deep, which recursively traverses the objects,
John D~ smith Mark Smith comparing equality of corresponding components.

FIGURE 2: Object Model Example Each is an equivalence relation on objects: (3) refines (2)
and (2) refines (1). Therefore, two identical objects are
always shallow-equal and deep-equal. Furthermore, two
shallow-equal objects are always deep--equal.

410 OOPSLA~6 Proceedings Sel~ember1986


Next, we discuss two kinds of operators which are Def'mitlon: merging two objects into one (merge)
used to update tuple and set objects. Atomic objects are
not updatable. Another operator which is useful for systems that
support object identity merges two objects and makes
Definition: axsigning an object to an attribute of a tuple them a single object. In other words, if we realize that
(assign) two similarly structured objects are really the same we
could have
To assign an object to an attribute of a tuple we use:
merge(Ol, O2),
assign(<tuple object>, <attribute name>, <object>).
which will merge the two objects and henceforth they will
The effect of this operator is to assign the <object> to
be one and the same object. Note that this is an updating
<tuple object>.<attribute name>. Por example, if X =
operation. The semantics and support of this operation
[name:Ol, age:O2], then assign(X, salary, 03) will yield
could be tricky and expensive. The simplest approach is
X ,. [name:Ol, age:O2, salary:O3]. The identities of X,
to require the two objects to have the same type and be
O1, 02 and 03 are not changed. If X.salary was defined
deep-equal. Then all that we need is to ensure all the
before the assignment, 03 will replace the old X.salary.
references to the old objects and their sub-components
Definition: adding an element to a set (add-elemenO now refer to the merged object and its sub-components.
However, it is possible to make merging more
The second operation on sets adds an element to a sophisticated and provide support for merging differently
set. Therefore, structured objects. This is a very useful concept in
add-element(<set object>, <object>) statistical databases called record-linking [Wrigley 1973,
Howe and Lindsay 1981], where an attempt is made to
will add <object> as an element of <set object>. The set merge information which was gathered by different
or any of its existing elements are not affected. Note that sources and which contain different sorts of information
if the object already exists in the set, the effect of the about the same objects.
add-element update operator is null. However, if the set
already contains an object which is either shallow, equal Finally, similar to the different flavors of equality,
or deep-equal to the object to be added (but not identical there are two flavors of object copying, called
to it), the object will still be added to the set. Below, we shallow-copy and deep-copy.
shall introduce an operator which eliminates duplicates Definition: shallow copying (shallow-copy)
based on values.
The shallow-copy operator will copy its first
Definition: removing an element from a set (remove-element) argument into the second, such that the resulting object
The third operator on sets removes an element from a will be shallow-equal to the first and the two objects will
set. Therefore, have different identity. Therefore,
remove-element(<set object>, <object>) shallow-copy(Ol, 02)
will remove <object> from <set object>, if the object is will generate a new object 02 such that O2.identity is not
there. equal to Ol.identity but shallow-equal(O2, Or) is true.
Definition: value elimination from a set (value-eliminate) Definition: deep copying (deep-copy)
We already alluded to the problem of the existence of The deep-copy operator will copy its first argument
value based duplicates in a set. This problem becomes into the second, such that the resulting object will be
important if we want to generate a final result showing deep-equal to the first and the two objects will have
only the content (i.e., the values) of the set object without different identity. Therefore,
any duplicates. If an operation is interested in only deep-copy(Ol, 02)
manipulating unique values (e.g., count the different
colors of parts), without value elimination we might have will generate a new object 02, with all new sub-parts,
redundant information accessed and manipulated. such that, if Ot is a set or a tuple, 02 is not shallow-equal
Therefore, the fourth operator on sets performs value to O1, but 02 is deep-equal to O1. Deep-copy should
based duplicate elimination: also preserve co-referencing.
value-eliminate(<set object>).
4 Implementation Techniques
If O is a set object, value-eliminate(O) will create a new
set object whose elements are new objects and whose There have been several techniques for implementing
values are the same as in O but without duplicate values. object identity both in databases and in programming
That is, there will be no two distinct elements Ol and 02 languages. In this section, we first provide a taxonomy of
in value-eliminate(O) that are deep-equal. The existence identity implementation techniques. We draw our
of identity and value elimination allows the option of examples from programming languages, distributed file
either objects or values to be manipulated. systems and database management systems. Then, we
describe how the object system in Section 3 can be
implemented using surrogates, which is the most powerful
of these techniques.

September1986 OOPSLA ~6 Proceedings 411


4.1 Implementation Taxonomy address implementation allows only whole pages of
objects, not individual objects, to be moved within one
The power of each implementation technique can be virtual address space, providing minimal location
measured by the degree of value, structure and location independence. However, because objects cannot be
independence it provides. Data independence means that moved between address spaces, object sharing among
identity is preserved through changes in either data values multiple programs is limited. Both real physical address
or structure. Location independence means that identity and virtual address implementations provide data
is preserved through movement of objects among physical independence, unless such modifications cause the.object
locations or address spaces. Both of these are important
to be moved within the address space due to size
when implementing objects with the strong notion of
differences.
identity in both the representation and temporal
dimensions as described in Section 2. Identity Through Indirection
Figure 3 describes where these implementation In Smalltalk-80 [Ooldberg and Robson 1983], an oop
techniques lie in the two dimensional space provided by (object-oriented pointer) is used to implement identity.
this taxonomy. Some of the details of these techniques An oop is an entry in an object table. Therefore,
are described below. identities are implemented through a level of indirection.

DATA
INDEPENDENCE

real virtual indirect indirect structured surrogate


fully physical physical virtual for each
independent address address address address identifier object

value
independent surrogate
somewhat for each
structure tuple
dependent

value
independent tople
structure identifier
dependent

value and identifier


structure key
dependent

I I I. I I ! LOCATION
cannot move move move move . fully. "~ INDEPENDENCE
move page object object object inoepenoent
within within within within
one one one one
virtual physical virtual disk or
address address address server
space space space

FIGURE 3: Implementation Taxonomy

Identity Through Physical Address


In LOOM [Kaehler and Krasner 1983], it is shown that
Perhaps the simplest implementation of the identity of this scheme could be used to support secondary-storage
an object is the physical address of the object. This resident objects, providing support for a much larger
physical address could be the real or the virtual address number of objects. Indirect physical or virtual address
of the object (if the object system is operating in a virtual implementations allow individual objects to be moved
memory environment). For example, in PASCAL the within one address space, providing stronger location
"identity" of a record (i.e., the pointer to the record) is independence than direct address implementation but not
implemented through a virtual heap address. Physical allowing sharing of objects among multiple programs.
address implementation does not permit an object to be Indirect address implementations provide full data
moved, so that there is no location independence. Virtual independence.

412 OOPSLA '86 Proceedings .Seplember1986


Identity Through Structured Identifier globally unique identifiers,completely independent of any
physical location. Leach et al. [1982] argue for the use of
In some distributed systems, such as the Cambridge
surrogates in distributed systems. The arguments of
File Server [Dion 1980] and the LOCUS system [Popek et
al. 1981], the identifiers of files (the objects of the location independence presented therein are also valid for
distributed database management systems. Surrogates
systems) are structured, where pan of the structure
provide full location independence. If surrogates are
captures an aspect of the location of the object, such as a
disk or server. For example, in LOCUS part of the associated only with some objects, such as tuples in RM/T
[Codd 1979] and G E M [Tsur and Zaniolo 1984], then
identifier of a file identifies a (logical) volume where the
they provide value independence but not full structure
file is located. Structured identifiers provide full data
independence. If surrogates are associated with every
independence. Structured identifiers allow individual
object as in OPAL [Maier et al. 1985], then they provide
objects to be moved within one disk or server, although
full data independence.
movement among multiple address spaces is possible so
that objects can be shared among multiple programs.
4.2 Implementing The Object Model With Surrogates
One reason why movement of object is desirable is
due to the load balancing in a distributed systems. In This section provides a more detailed description of
other words, if a server contains many "hot" objects, the how to realize the object model in Section 3 using
surrogates.
overall performance of the distributed system will
improve if some of the hot objects are moved to another Each object of any type is associated with a globally
site (e.g., volume or site). unique surrogate at the instant it is instantiated. This
Identity Through Identifier Keys surrogate is used to internally represent the identity of its
object throughout the lifetime of the object. Leach et al.
The main approach for supporting identity in [1982] discusses several implementation issues involved
database management systems is by direct in the non-trivial task of generating globally unique
implementation of user-supplied identifier keys. The surrogates in a distributed environment.
tuples are ordered (in most cases sorted) on the identifier
There are several reasons why the physical
key and an auxiliary structure (e.g., a B-tree) is
description of a conceptual object may not be stored in a
constructed on top of the set of tuples to provide fast
single location. One reason is that some of the parts of
access to objects retrieved through their identifier keys.
an object may be shared by other objects due to the graph
Identifier key implementations provide full location
structure of the object model. An object referenced by
independence. They do not provide value independence
because they consist of values. They do not provide multiple objects cannot be physically stored with each of
structure independence because they are unique only its referencing objects without uncontrolled replication. A
within a single relation (e.g., a relation may be second reason is that controlled replication may be used
restructured into two relations) and they are applied only to facilitate data recovery. These replicates must be
physically stored on separate media for maximum
to tuples and not to attributes (e.g., an attribute object
may be expanded into a. tuple on its own). recoverability. A third reason is that the parts of an
object may be physically partitioned based on frequency
Identity Through Tuple Identifiers of use together to improve performance for disk-resident
data [Hoffer 1975]. A fourth reason pertaining to support
In some systems, such as System R [Astrahan et al.
of a temporal data model is that the current version may
1976], INGRES [Stonebraker et al. 1976] and WiSS
be kept separately from past versions of objects, so that
[Chou et al. 1985], internal tuple identifiers are
the speed of access to current data is not reduced. For all
introduced in the internal layer to simplify the interfaces
of these, an object's surrogate provides a convenient way
of the DBMS layering scheme and to implement the
to relate these separately stored replicates or parts of the
concurrency control/recovery module of the DBMS. These single conceptual object.
tuple identifiers should not be confused with the
implementation of identity, since they do not directly Testing for identity (i.e., Ol.identity = O2.identity) is
correspond to any conceptual notion of identity. Tuple required to support identity and shallow-equal predicates.
identifiers can, however, be used to implement identity. This is accomplished by testing surrogate equality (i.e.,
They are system-generated identifiers which are unique Or.surrogate = O2.surrogate).
for all tuples within a single relation and have no
The remove-element operator requires checking for
relationship to physical location. Tuple identifiers provide
dangling identifiers to insure consistency of the object
full location independence. They also provide full value
system. That is, there should not be any references to an
independence. They do not provide full structure
object that does not exist. This could be implemented by
independence since they are unique only within a single
searching for dangling surrogates each time such an
relation and they are applied only to tuples and not to
update is made. This is similar to the technique used in
attributes.
databases which allow foreign identifier key declaration
Identity Through Surrogates with referential integrity enforcement.
The most powerful technique for supporting identity The assign operator may cause the last reference to
is through surrogates [Abrial 1974, Hall et al. 1976, Kent an object to disappear, so that it is no longer accessible
1978, Codd 1979]. Surrogates are system-generated, and its storage can be freed. This could be implemented
by keeping a reference count for each object (i.e., the

September1986 OOPSLA '86 Proceedings 413


number of references to an object), which is updated each We compared different implementation techniques
time a reference is added or removed. When the for identity using a taxonomy which is based on data and
reference count of an object goes to zero, garbage location independence. Value and structure independence
collection is invoked. This is important for temporary means that identity is preserved through changes in either
data and is used in the Smalltalk-80 system [Goldberg data values or structure. Location independence means
and Robson 1983]. that identity is preserved through movement of objects
among physical locations or address spaces. The most
The merge operator causes two objects to become robust of these techniques is surrogates, which provides
one. This could be implemented by maintaining an full independence in both dimensions. We described how
equivalence relationship between the two surrogates. the object model could be implemented using surrogates.
To support systems that form a hybrid of
programming languages and database systems, it is
important to maintain a consistent and continuous notion
of identity throughout the lifetime of an object. Let us Acknowledgments
consider an object that is first created as a temporary and
later made persistent. Although the strong notion of Thanks to David Maier of OGC for his many
identity in both the representation and temporal suggestions concerning both the content and presentation
dimensions is most important to persistent objects, this of this paper.
ability to change status means that the same
implementation representation of identity should apply to
both temporary and persistent objects.
References
5 Summary J.R. Abrial, "Data Semantics," in Data Base
Most programming and database languages mix the Management, J.W. Kiimbie and K.L. Koffeman, eds.,
concepts of addressability and identity, using variable North-Holland Publishing Co., New York (1974).
names as the only way to distinguish temporary objects.
Most database languages mix the concepts of data value A. Albano, G. Ghelli and R. Orsini, "The Implementation
and identity, using identifier keys as the only way to Of Galileo's Values Persistence," Proceedings O f The
distinguish persistent objects. Both of these concepts Appin Workshop on Persistence And Data Types,
compromise identity. Object-oriented languages University Of Glasgow (August 1985).
distinguish these concepts, providing a stronger notion of
identity which is independent of how an object is accessed M.M. Astrahan, M.W. Biasgen, D.D. Chamberlin, K.P.
or described with data values. Eswaran, J.N. Gray, P.P. Griffiths, W.F. King, R.A. Lorie,
We discussed the importance of identity in both P.R. McJones, J.W. MChl, G.R. Putzolu, I.L. Traiger,
programming and database languages as well as B.W. Wade and V. Watson, "System R: Relational
languages which combine the power of both these Approach To Database Management," Transactions On
disciplines. Strong identity involves two dimensions, Database Systems, ACM, Vol. 1, No. 2 (June 1976).
representation and temporal. The representation
dimension distinguishes languages based on whether they ANSI/X3/SPARC, Study Group On Data Base
represent the identity of an object by its value, by a Management Systems Interim Report 75-02-08, FDT
user-defined name, or built into the language. The Bulletin, Vol. 7, No. 2 (February 1975).
temporal dimension distinguishes languages based on
whether they preserve their representation of identity M.P. Atkinson, P.J. Bailey, W.P. Cockshott, K.J.
within a single program or transaction, between Chisholm and R. Morrison, "An Approach To Persistent
transactions, or between structural reorganizations. Strong Programming," Computer Journal, Vol. 26, No. 4 (1983).
identity in the representation dimension is important for
both temporary and persistent objects. Strong identity in
the temporal dimension is important for persistent F. Bancilhon, S. Khosbafian and P. Valduriez, "FAD, A
objects. For hybrid languages which merge programming Database Machine Language, Formal Description,"
and database functionality, a strong identity in both personal cormnunications (1985).
dimensions is important due to the need for a uniform
treatment of all objects, because their status may change J. Ben-Zvi, "The Time Relational ~vlodel," Ph.D.
between temporary and persistent. Dissertation, UCLA (1982).
We defined a data model, including structures and
operations, which supports complex objects with strong L. Cardeili, "Amber," AT&T Bell Labs Technical
identity. Although structures included only atomic types Memorandum 11271-840924--10TM (1984).
and two structured types (i.e., set and tuple), its
generalization to other structured types is straight D.D. Chamberlin and R.F. Boyce, "SEQUEL: A
forward. The operators serve as the building blocks of a Structured English Query Language," Proceedings of the
data manipulation language based on this object model SIOMOD Workshop On Data Description, Access And
with identity. Control, ACM, Ann Arbor (May 1974).

414 OOPSLA~ Procxdings September1966


H.T. Chou, D.J. DeWitt, R. Katz, and A. Klug, "Design O. Jaeske and H. Schek, "Remarks On The Algebra Of
And Implementation of The Wisconsin Storage System," Non First Normal Form Relations," Proceedings of the
Software Practice And Experience, Vol. 15, No. 10 Symposium On Principles Of Database Systems, ACM
(October 1985). SIGACT-SIGMOD, Los Angeles (March 1982).

J. Clifford and D.S. Warren, "Formal Semantics For T. Kaehler and G. Krasner, "LOOM--Large
Time In Databases," Transactions On Database Systems, Object-Oriented Memory For Smalltalk-80 Systems," in
ACM, Vol. 8, No. 2 (June 1983). Smalltalk-80: Bits Of History, Words Of Advice,
Addison-Wesley Publishing Co., Reading, Mass. (1983).
E. F. Codd, "A Relational Model Of Data For Large
Shared Data Banks," Communications of the ACM, Vol. R.H. Katz and T.J. Lehman, "Database Support For
13, No. 6 (June 1970). Versions And Alternatives Of Large Design Files,"
Transactions On Software Engineering, IP.I~. Vol.
E. F. Codd, "Further Normalization Of The Data Base SE--10, No. 2 (March 1984).
Relational Model," in Data Base Systems, Courant
Institute Computer Science Symposia 6, R. Rustin (ed.), W. Kent, Data And Reality, North-Holland Publishing
Prentice-Hall, Inc., Englewood Cliffs, New Jersey (Nlay Co., New York (1978).
1971).
W. Kent, "Consequences Of Assuming A Universal
E. F. Codd, "Extending The Database Relational Model Relation," Transactions On Database Systems, ACM, Vol.
To Capture More Meaning," Transactions On Database 6, No. 4 (December 1981).
Systems, ACM, Vol. 4, No. 4 (December 1979).
P.J. Leach, B.L. Stumpf, J,A. Hamilton and P.H. Levine,
A. Colmerauer, "Les Grammaires De Metamorphose," "UIDS As Internal Names In A Distributed File System,"
Groupe d'Intelligence Artificielle, Marseille-Luminy Proceedings of the First Symposium On Principles Of
(November 1975). Distributed Computing, ACM, Ottawa (August 1982).

G.P. Copeland, "What If Mass Storage Were Free?," D. Maier, J.D. Ullman and M.Y. Vardi, "On The
Proceedings Of The Fifth Workshop On Computer Foundations Of The Universal Relation Model,"
Architecture For Non-Numeric Processing, ACM, Pacific Transactions On Database Systems, ACIVl, Vol. 9, No. 2
Grove, California (March 1980); a revised version (June 1984).
appears in Computer, II~.R Computer Society, Vol. 15,
No. 7 (July 1982). D.C.J. Matthews, "An Overview Of The Poly
Programming Language," Proceedings Of The Appin
G.P. Copeland and D. Maier, "Making Smalltalk A Workshop on Persistence And Data Types, University Of
Database System," Proceedings of the SIGMOD Glasgow (August 1985).
Conference, ACM, Boston (June 1984).
D. Maier, A. Otis and A. Purdy, "Object-Oriented
J. Dion, "The Cambridge File Server," Operating Systems Database Development At Servio Logic," Database
Review, ACM SIGOPS, Vol. 14, No. 4 (October 1980). Engineering, l l ~ , Vol. 8, No. 4 (December 1985).

A. Goldberg and D. Robson, Smalltalk-80: The Language G. Popek, B. Walker, J. Chow, D. Edwards, C. Kline, G.
And Its Implementation, Addison-Wesley Publishing Co., Rudisin and G. Thiel, "LOCUS: A Network Transparent,
Reading, Massachusetts (1983). High ReliabilityDistributed System," Proceedings of the
Eight Symposium On Operating Systems Principles,
P.A.V. Hall, J. Owlett and S.J.P. Todd, "Relations And (December 1981).
Entities," In Modeling In Data Base Management
Systems, G.M. Nijssen, ed., North-Holland Publishing D.M. Ritchie and K. Thompson, "The Unix Time-Sharing
Co., New York (1976). System," Communications Of The ACM, Vol. 17, No. 7
(July 1974).
J.A. Hoffer, "A Clustering Approach To The Generation
Of Subfiles For The Design Of A Computer Database," L. Rowe and K. Shoens, "Data Abstraction, Views And
Ph.D. dissertation,Cornell University (January 1975). Updates In RIGEL," Proceedings of the S I G M O D
Conference, ACM, Boston (May 1979).
G.R. Howe and J. Lindsay, "A Generalized Iterative
Record Linkage Computer System For Use In Medical J.H. Saltzer, "Naming And Binding Of Objects," in
Follow-up Studies," Computers And Biomedical Lecture Notes In Computer Science, Goos and Hartman,
Research, Vol. 14 (1981). eds., Springer-Verlag, (1978~.

September1986 OOPSLA ~6 Proceedings 415


J.W. Schmidt, "Some High Level Language Constructs
For Data Type Relation," Transactions On Database
Systems, ACM, Vol. 2, 14o. 3 (September 1977).

J.M. Smith and D.C.P. Smith, "Principles Of Database


Conceptual Design," Proceedings Of The NYU
Symposium On Database Design, New York (May 1978).

G.H. Sockut, "A Framework For Logical-Level Changes


Within Database Systems," Computer, ~ E Computer
Society, Vol. 18, No. 5 (May 1985).

M. Stonebraker, E. Wong and P. Kreps, "The Design And


Implementation Of INGRES," Transactions On Database
Systems, ACM, Vol. 1, No. 3 (September 1976).

S. Tsur and C. Zaniolo, "An Implementation Of


GEM--Supporting A Semantic Data Model On A
Relational Back-End," Proceedings of the SIGMOD
Conference, ACM, Boston (June 1984).

A.I. Wasserman, "The Data Management Facilities Of


PLAIN," Proceedings of the SIGMOD Conference, ACM,
Boston (May 1979) .

N. Wirth, "The Programming Language PASCAL," Acta


lnformatica 1, Vol. 1 (May 1971).

E.A. Wrigley (ed.), Identifying People In The Past,


Edward Arnold, London (1973).

C. Zaniolo, "The Database Language GEM," Proceedings


of the SIGMOD Conference, ACM, San Jose (May 1983).

C. Zaniolo, "The Representation And Deductive Retrieva'


Of Complex Objects," Proceedings of the International
Conference on Very Large Data Bases, Stockholm
(August 1985).

M.M. Zloof, "Query By Example," Proceedings of the


NCC, AFIPS Press, Montvale, N.J. (May 1975).

416 OOPSI..A'86 ~ i r t g 8 ~ 1966

You might also like