Professional Documents
Culture Documents
Concurrency Control in Distributed Database Systems: Computer Corporation of America, Cambridge, Massachusetts 02139
Concurrency Control in Distributed Database Systems: Computer Corporation of America, Cambridge, Massachusetts 02139
In this paper we survey, consolidate, and present the state of the art in distributed
database concurrency control. The heart of our analysts is a decomposition of the
concurrency control problem into two major subproblems: read-write and write-write
synchronization. We describe a series of synchromzation techniques for solving each
subproblem and show how to combine these techniques into algorithms for solving the
entire concurrency control problem. Such algorithms are called "concurrency control
methods." We describe 48 principal methods, including all practical algorithms that have
appeared m the literature plus several new ones. We concentrate on the structure and
correctness of concurrency control algorithms. Issues of performance are given only
secondary treatment.
Keywords and Phrases: concurrency control, deadlock, dtstnbuted database management
systems, locking, senahzability, synchromzation, tunestamp ordering, timestamps, two-
phase commit, two-phase locking
CR Categories: 4.33, 4.35
Permission to copy without fee allor part of thismaterialm granted provided that the coplesare not made or
distributedfor directcommercial advantage, the A C M copyrightnotice and the titleof the publicationand its
date appear, and noticeis given that copying isby perrmssion of the Associationfor Computing Machinery. To
copy otherwise,or to republish,requiresa fee and/or specificpermission.
1981 ACM 0010-4892/81/0600-0185 $00.75
WRITE result
bock to dotobose bock to dotobose
Figure 2. I n c o n s i s t e n t retrieval a n o m a l y .
ROTH77]. Each site is a computer running logical data item is called a stored data
one or both of the following software mod- item. (When no confusion is possible, we
ules: a transaction manager (TM) or a data use the term data item for stored data
manager (DM). TMs supervise interactions item.) The stored copies of logical data item
between users and the DDBMS while DMs X are denoted xl . . . . . Xm. We typically use
manage the actual database. A network is x to denote an arbitrary stored data item.
a computer-to-computer communication A stored database state is an assignment
system. The network is assumed to be per- of values to the stored data items in a
fectly reliable: if site A sends a message to database.
site B, site B is guaranteed to receive the Users interact with the DDBMS by exe-
message without error. In addition, we as- cuting transactions. Transactions may be
sume that between any pair of sites the on-line queries expressed in a self-contained
network delivers messages in the order they query language, or application programs
were sent. written in a general-purpose programming
From a user's perspective, a database language. The concurrency control algo-
consists of a collection of logical data rithms we study pay no attention to the
items, denoted X, Y, Z. We leave the gran- computations performed by transactions.
ularity of logical data items unspecified; in Instead, these algorithms make all of their
practice, they may be files, records, etc. A decisions on the basis of the data items a
logical database state is an assignment of transaction reads and writes, and so details
values to the logical data items composing of the form of transactions are unimportant
a database. Each logical data item may be in our analysis. However we do assume that
stored at any DM in the system or redun- transactions represent complete and cor-
dantly at several DMs. A stored copy of a rect computations; each transaction, if ex-
transoctlon
tronsactlon
tronsoctlon
/X\
tronsoctlon
tronsoctlon
ecuted alone on an initially consistent da- DMs manage the data. (TMs do not com-
tabase, would terminate, produce correct municate with other TMs, nor do DMs
results, and leave the database consistent. communicate with other DMs.)
The logical readset (correspondingly, TMs supervise transactions. Each trans-
writeset) of a transaction is the set of logical action executed in the DDBMS is super-
data items the transaction reads (or writes). vised by a single TM, meaning that the
Similarly, stored readsets and stored transaction issues all of its database oper-
writesets are the stored data items that a ations to that TM. Any distributed com-
transaction reads and writes. putation that is needed to execute the
The correctness of a concurrency control transaction is managed by the TM.
algorithm is defined relative to users' ex- Four operations are defined at the trans-
pectations regarding transaction execution. action-TM interface. READ(X) returns
There are two correctness criteria: (1) users the value of X (a logical data item) in the
expect that each transaction submitted to current logical database state. WRITE(X,
the system will eventually be executed; (2) new-value) creates a new logical database
users expect the computation performed by state in which X has the specified new
each transaction to be the same whether it value. Since transactions are assumed to
executes alone in a dedicated system or in represent complete computations, we use
parallel with other transactions in a multi- BEGIN and END operations to bracket
programmed system. Realizing this expec- transaction executions.
tation is the principal issue in concurrency DMs manage the stored database, func-
control. tioning as backend database processors. In
A DDBMS contains four components response to commands from transactions,
(see Figure 3): transactions, TMs, DMs, TMs issue commands to DMs specifying
and data. Transactions communicate with stored data items to be read or written. The
TMs, TMs communicate with DMs, and details of the T M - D M interface constitute
T1 BEGIN; i----n ~
READ (X); WRITE(Y); END
T2 BEGIN;
READ(Y), WRITE(Z); END
T3 . BEGIN,
READ(Z), WRITE(X), END
The execution modeled in Figure 4 is serial.Each the total order, then all of T,'s operations
log is itselfserial;that is, there is no interleaving of precede all of Tfs operations in every log
operations from differenttransactions.At D M A, Ti where both appear (see Figure 5). Intui-
precedes T~ precedes T3; at D M B, % precedes T~;
tively, this says that transactions execute
and at D M C, T2 precedes T3. Therefore, TI, T2, T3
is a total order satisfyingthe definitionof serial.
serially and in the same order at all DMs.
The following execution is not serial.The logs them- Two operations conflict if they operate
selves are not serial. on the same data item and one of the op-
erations is a dm-write. The order in which
DM A: rl[xl]r2[YllW3[Xl]Wl[yl]
DM B: w2[z2]wl[y2]
operations execute is computationally sig-
DM C: w2[z3lr3[z3] nificant if and only if the operations con-
flict. To illustrate the notion of conflict,
The following execution is also not serial Although
consider a data item x and transactions T,
each log is serial, there is no total order consistent
with all logs. and Tj. If T, issues dm-read (x) and T~
issues dm-write(x), the value read by T, will
DM A: rl[x~]wl[yl]re[yl]w3[x~] (in general) differ depending on whether
DM B: w2[z2]wl[y2] the dm-read precedes or follows the dm-
DM C: w2[z3]r3[z3]
write. Similarly, if both transactions issue
Figure 5. Serial and nonserial loops. dm-write(x) operations, the final value of x
depends on which dm-write happens last.
goal of database concurrency control is to Those conflict situations are called read-
ensure that all executions are serializable. write (rw) conflicts and write-write (ww)
The only operations that access the conflicts, respectively.
stored database are din-read and din-write. The notion of conflict helps characterize
Hence it is sufficient to model an execution the equivalence of executions. Two execu-
of transactions by the execution of din- tions are computationally equivalent if (1)
reads and din-writes at the various DMs of each dm-read operation reads data item
the DDBMS. In this spirit we formally values that were produced by the same dm-
model an execution of transactions by a set writes in both executions; and (2) the final
of logs, each of which indicates the order in dm-write on each data item is the same in
which dm-reads and din-writes are proc- both executions [PAPA77, PAPA79]. Condi-
essed at one DM (see Figure 4). An execu- tion (1) ensures that each transaction reads
tion is serial if there is a total order of the same input in both executions (and
transactions such that if T, precedes Tj in therefore performs the same computation).
Combined with (2), it ensures that both (3) T, --,ww Tj if in some log of E, T, writes
executions leave the database in the same into some data item into which T~ sub-
final state. sequently writes;
From this we can characterize serializa- (4) T, --,~w~Tj if T, - * ~ T~ or T, --*w~Tj;
ble executions precisely. (5) T~ --* Tj if Tj --*~ T~ or T~ --*ww
%
Theorem 1 [PAPA77, PAPA79, STEA76]
Intuitively, -* (with any subscript)
Let T ffi (T1, ..., Tin} be a set of transac- means "in any serialization must precede."
tions and let E be an execution of these For example, T, --*~w Tj means "T, in any
transactions modeled by logs (Lb . . . . serialization must precede Tj." This inter-
Lm}. E is serializable if there exists a totalpretation follows from Theorem 1: If T,
ordering of T such that for each pair of reads x before Tj writes into x, then the
conflicting operations O~ and Oj from dis- hypothetical serialization in Theorem 1
tinct transactions T, and Tj (respectively), must have T, preceding T~.
O~precedes Oj in any log L~. . . . . Lm if and Every conflict between operations in E is
only if T~precedes T~ in the total ordering. represented by an --, relationship. There-
The total order hypothesized in Theorem fore, we can restate Theorem 1 in terms of
1 is called a serialization order. If the --,. According to Theorem 1, E is serializa-
transactions had executed serially in the ble if there is a total order of transactions
serialization order, the computation per- that is consistent with -*. This latter con-
formed by the transactions would have dition holds if and only if --, is acyclic. (A
been identical to the computation repre- relation, --*, is acyclic if there is no sequence
sented by E. T1 -* T2, T2 --* Ta. . . . . Tn-1 --* Tn such that
To attain serializability, the DDBMS T1 ffi T~.) Let us decompose --, into its
must guarantee that all executions satisfy components, --*rwrand--* ww,and restate the
the condition of Theorem 1, namely, that theorem using them.
conflicting dm-reads and dm-writes be
processed in certain relative orders. Con-
currency control is the activity of control- Theorem 2 [BERNSOa]
ling the relative order of conflicting opera-
tions; an algorithm to perform such control Let "-'>rwr and ---,ww be associated with exe-
is called a synchronization technique. To cution E. E is serializable if (a) -'*rwr and
be correct, a DDBMS must incorporate "-'>w~ are acyclic, and (b) there is a total
synchronization techniques that guarantee ordering of the transactions consistent
the conditions of Theorem 1. with all - - ~ and all ---~w relationships.
T2 BEGIN;
READ(Y); WRITE(Z); END
T3 , BEGIN,
READ(Z), WRITE(X), END
Figure 6. Deadlock.
2PL, since d m - r e a d s and prewrites usually T w o general techniques are available for
c a n n o t implicitly r e q u e s t locks. deadlock resolution: deadlock prevention
a n d deadlock detection.
3.5 D e a d l o c k Detection and Prevention
3.5.1 Deadlock Prevention
T h e preceding i m p l e m e n t a t i o n s of 2PL
force t r a n s a c t i o n s to wait for unavailable D e a d l o c k p r e v e n t i o n is a "cautious"
locks. I f this waiting is uncontrolled, dead- s c h e m e in which a t r a n s a c t i o n is r e s t a r t e d
locks can arise (see Figure 6). w h e n the s y s t e m is "afraid" t h a t deadlock
D e a d l o c k situations can be characterized m i g h t occur. T o i m p l e m e n t deadlock pre-
b y waits-for graphs [HOLT72, KING74], di- vention, 2PL schedulers are modified as
rected graphs t h a t indicate which transac- follows. W h e n a lock r e q u e s t is denied, the
tions are waiting for which o t h e r transac- scheduler tests the requesting t r a n s a c t i o n
tions. N o d e s of the g r a p h r e p r e s e n t trans- (say T,) and the t r a n s a c t i o n t h a t currently
actions, and edges r e p r e s e n t the "waiting- owns the lock (say T~). I f T, a n d Tj pass the
for" relationship: an edge is d r a w n f r o m test, T, is p e r m i t t e d to wait for T~ as usual.
t r a n s a c t i o n T, to t r a n s a c t i o n Tj if T, is Otherwise, one of the two is aborted. I f T,
waiting for a lock currently owned b y T~. is restarted, t h e deadlock p r e v e n t i o n algo-
T h e r e is a deadlock in the s y s t e m if and r i t h m is called nonpreemptive; if T~ is re-
only if the waits-for g r a p h contains a cycle started, the algorithm is called preemptive.
(see Figure 7). T h e test applied b y the scheduler m u s t
guarantee that if T, waits for Tj, then dead- Two timestamp-based deadlock preven-
lock cannot result. One simple approach is tion schemes are proposed in Rasp,78.
never to let T~ wait for Tj. This trivially Wait-Die is the nonpreemptive technique.
prevents deadlock but forces many restarts. Suppose transaction T, tries to wait for T~.
A better approach is to assign priorities If T, has lower priority than T~ (i.e., T, is
to transactions and to test priorities to de- younger than T~), then T, is permitted to
cide whether T, can wait for Tj. For exam- wait. Otherwise, it is aborted ("dies") and
ple, we could let T, wait for Tj if T, has forced to restart. It is important that T, not
lower priority than Tj (if T~ and Tj have be assigned a new timestamp when it re-
equal priorities, T, cannot wait for Tj, or starts. Wound.Wait is the preemptive
vice versa). This test prevents deadlock counterpart to Wait-Die. If T, has higher
because, for every edge ( T , Tj) in the waits- priority than Tj, then T, waits; otherwise Tj
for graph, T, has lower priority than Tj. is aborted.
Since a cycle is a path from a node to itself Both Wait-Die and Wound-Wait avoid
and since T, cannot have lower priority cyclic restart. However, in Wound-Wait an
thCan itself, no cycle can exist. old transaction may be restarted many
One problem with the preceding ap- times, while in Wait-Die old transactions
proach is that cyclic restart is possible-- never restart. It is suggested in RosE78 that
some unfortunate transaction could be con- Wound-Wait induces fewer restarts in total.
tinually restarted without ever finishing. To Care must be exercised in using preemp-
avoid this problem, Rosenkrantz et al. tive deadlock prevention with two-phase
[RosE78] propose using "timestamps" as commit: a transaction must not be aborted
priorities. Intuitively, a transaction's time- once the second phase of two-phase commit
stamp is the time at which it begins execut- has begun. If a preemptive technique
ing, so old transactions have higher priority wishes to abort Tj, it checks with Tfs TM
than young ones. and cancels the abort if Tj has entered the
The technique of Ros~.78 requires that second phase. No deadlock can result be-
each transaction be assigned a unique cause if Tj is in the second phase, it cannot
timestamp by its TM. When a transaction be waiting for any transactions.
begins, the TM reads the local clock time Preordering of resources is a deadlock
and appends a unique TM identifier to the avoidance technique that avoids restarts
low-order bits [THOM79]. The resulting altogether. This technique requires prede-
number is the desired timestamp. The TM claration of locks (each transaction obtains
also agrees not to assign another timestamp all its locks before execution). Data items
until the next clock tick. Thus timestamps are numbered and each transaction re-
assigned by different TMs differ in their quests locks one at a time in numeric order.
low-order bits (since different TMs have The priority of a transaction is the number
different identifiers), while timestamps as- of the highest numbered lock it owns. Since
signed by the same TM differ in their high- a transaction can only wait for transactions
order bits (since the TM does not use the with higher priority, no deadlocks can oc-
same clock time twice). Hence timestamps cur. In addition to requiring predeclaration,
are unique throughout the system. Note a principal disadvantage of this technique
that this algorithm does not require clocks is that it forces locks to be obtained sequen-
at different sites to be precisely synchro- tially, which tends to increase response
nized. time.
DM A DM B DM C
,(9 @ ,@
Figure 8. Multisite deadlock.
I Bufferit,I
es
!
I
i
Output all ready R's. (This may increase |
min-R-ts(x) and make some W's ready.)
I
I
Figure 9. Buffer emptying for basic T / O rw synchromzation.
I Bufferlt ]
es
I
Figure 10. Buffer emptying for basic T/O ww synchronization.
the transaction is guaranteed to execute (W-ts, value) pairs, called versions. The R-
without danger of restart. Another varia- ts's of x record the timestamps of all exe-
tion is to delay the processing of operations cuted dm-read(x) operations, and the ver-
to wait for operations with smaller time- sions record the timestamps and values of
stamps. The extreme version of this heuris- all executed dm-write(x) operations. (In
tic is conservative T/O, described in Sec- practice one cannot store R-ts's and ver-
tion 4.4. sions forever; techniques for deleting old
versions and timestamps are described in
4.2 The Thomas Write Rule Sections 4.5 and 5.2.2.)
Multiversion T / O accomplishes rw syn-
For ww synchronization the basic T / O chronization as follows (ignoring two-phase
scheduler can be optimized using an obser- commit). Let R be a dm-read(x). R is proc-
vation of THOM79. Let W be a dm-write(x),
essed by reading the version of x with larg-
and suppose ts(W) < W-ts(x). Instead of
est timestamp less than ts(R) and adding
rejecting W we can simply ignore it. We ts(R) to x's set of R-ts's; see Figure l l a . R
call this the Thomas Write Rule (TWR). is never rejected. Let W be a dm-write(x),
Intuitively, TWR applies to a dm-write that and let interval(W) be the interval from
tries to place obsolete information into the ts(W) to the smallest W-ts(x) > ts(W); 4
database. The rule guarantees that the ef-
see Figure l l b . If any R-ts(x) lies in
fect of applying a set of dm-writes to x is
interval(W), W is rejected; otherwise W is
identical to what would have happened had
output and creates a new version of x with
the dm-writes been applied in timestamp
timestamp ts(W).
order. To prove the correctness of multiversion
If TWR is used, there is no need to in- T/O, we must show that every execution is
corporate two-phase commit into the ww
equivalent to a serial execution in time-
synchronization algorithm; the ww sched- stamp order [BERNS0b]. Let R be a dm-
uler always accepts prewrites and never read(x) that is processed "out of order";
buffers dm-writes.
that is, suppose R is executed after a dm-
write(x) whose timestamp exceeds ts(R).
4.3 Multiversion T / O
Since R ignores all versions with time-
For rw synchronization the basic T/O
scheduler can be improved using multiver-
sion data items [REED78]. For each data 4Interval(W) ffi (ts(W),oo) if no W-ts(x) > ts (W)
item x there is a set of R-ts's and a set of exists.
R-timestamps 1 I J I I
5 7 15 92 95
Values Vl V2 V3 .. Vn.1 V Vn
W-timestamps 5I I
10 20I 912 913 1100
However, this new version "invalidates" the din-read of part (a),
because if the din-read had arrived after the din-write, it would
have read value V instead of Vn-1.Therefore, we must reject the
din-write.
stamps greater than ts(R), the value read > ts(P). Rw synchronization is performed
by R is identical to the value it would have as follows:
read had it been processed in timestamp
order. Now let W be a dm-write(x) that is
processed "out of order"; that is, suppose it 1. Let R be a dm-read(x). R is never rejected.
is executed after a dm-read(x) whose time- If ts(R) lies in interval(prewrite(x))
for some buffered prewrite(x), then R is
stamp exceeds ts(W). Since W was not
rejected, there exists a version of x with buffered. Else R is output.
timestamp TS such that ts(W) < TS < 2. Let P be a prewrite(x). If some R-ts(x)
ts(dm-read). Again the effect is identical to lies in interval(P), P is rejected. Else P
a timestamp-ordered execution. is buffered.
For ww synchronization, multiversion 3. Let W be a din-write(x). W is always
T / O is essentially an embellished version output immediately.
of TWR. A dm-write(x) always creates a 4. When W is output, its prewrite is debuf-
new version of x with timestamp ts(dm- fered, and the buffered din-reads are re-
write) and is never rejected. tested to see if they can now be output.
Integrating two-phase commit requires See Figure 12.
that dm-reads and prewrites (but not dm-
writes) be buffered as in basic T/O. Let P Two-phase commit is not an issue for ww
be a buffered prewrite(x): interval(P) is the synchronization, since dm-writes are never
interval from ts(P) to the smallest W-ts(x) rejected for ww synchronization.
1
I o t ut ready 's'l
2. Let W b e a dm-write(x). Ifts(W) :> min-
R-ts(TM) for any TM, W is buffered.
Else W is output.
3. When R or W is output or buffered, this
may increase min-R-ts(TM) or min-W-
ts(TM); buffered operations are retested
to see if they can now be output.
C1 writeset = {yl, y2} C2 writeset = {yl, y2, z2, za} Ca writeset ffi {xl, z2, za}
C~ there is a vertical edge between r~ and cally. The analysis produces a table indi-
w~; (2) for each pair of classes C, and Cj cating which horizontal and vertical edges
(with i ~ j ) there is a horizontal edge require synchronization and which do not.
between w~ and wj if and only if writeset(C~) This table, like class definitions, is distrib-
intersects writeset(C~); (3) for each pair of uted in advance to all schedulers that
classes C, and C~ (with i # j ) there is a need it.
diagonal edge between r~ and w~if and only
if readset(C~) intersects writeset(C~). 4.5 Timestamp Management
Intuitively, a horizontal edge indicates A common criticism of T / O schedulers is
that a scheduler S may be forced to delay that too much memory is needed to store
dm-writes for purposes of ww synchroniza- timestamps. This problem can be overcome
tion. Suppose classes C~ and C~ are con- by "forgetting" old timestamps.
nected by a horizontal edge (w,, wj), indi- Timestamps are used in basic T / O to
cating that their class writesets intersect. If reject operations that "arrive late," for ex-
S receives a dm-write from C,, it must delay ample, to reject a dm-read(x) with time-
the dm-write until it receives all dm-writes stamp TS1 that arrives after a dm-write(x)
with smaller timestamps from Cj. Similarly, with timestamp TS2, where TS1 <: TS2. In
a diagonal edge indicates that S may need principle, TS1 and TS2 can differ by an
to delay operations for rw synchronization. arbitrary amount. However, in practice it is
Conflict graph analysis improves the sit- unlikely that these timestamps will differ
uation by identifying interclass conflicts by more than a few minutes. Consequently,
that cannot cause nonserializable behavior. timestamps can be stored in small tables
This corresponds to identifying horizontal which are periodically purged.
and diagonal edges that do not require syn- R-ts's are stored in the R-table with en-
chronization. In particular, schedulers need tries of the form (x, R-ts); for any data
only synchronize dm-writes from C, and Cj item x, there is at most one entry. In addi-
if either (1) the horizontal edge between w~ tion, a variable, R-min, tells the maximum
and wj is embedded in a cycle of the conflict value of any timestamp that has been
graph; or (2) portions of the intersection of purged from the table. To find R-ts(x), a
C~'s writeset and C/s writeset are stored at scheduler searches the R-table for an (x,
two or more DMs [BERN80C]. That is, if TS) entry. If such an entry is found, R-
conditions (1) and (2) do not hold, schedu- ts(x) = TS; otherwise, R-ts(x) _ R-rain. To
ler S need not process dm-writes from C~ err on the side of safety, the scheduler
and Cj in timestamp order. Similarly, dm- assumes R-ts(x) ffi R-rain. To update R-
reads from C, and dm-writes from Cj need ts(x), the scheduler modifies the (x, TS)
only be processed in timestamp order if entry, if one exists. Otherwise, a new entry
either (1) the diagonal edge between r, and is created and added to the table. When the
wj is embedded in a cycle of the conflict R-table is full, the scheduler selects an ap-
graph; or (2) portions of the intersection of propriate value for R-rain and deletes all
C,'s readset and Cj's writeset are stored at entries from the table with smaller time-
two or more DMs. stamp. W-ts's are managed similarly, and
Since classes are defined statically, con- analogous techniques can be devised for
flict graph analysis is also performed stati- multiversion databases.
Maintaining timestamps for conservative have useful properties that cannot be at-
T / O is even cheaper, since conservative tained by pure 2PL or T / O methods.
T / O requires only timestamped operations,
not timestamped data. If conservative T / O 5.1 Pure 2PL Methods
is used for rw synchronization, the R-ts's of
data items may be discarded. If conserva- T h e 2PL synchronization techniques of
tive T / O is used for both rw and ww syn- Section 3 can be integrated to form 12
chronization, W-ts's can be eliminated also. principal 2P L methods:
Method rw t e c h m q u e ww technique
5. INTEGRATED CONCURRENCY CONTROL 1 Basic2PL Basic 2PL
METHODS 2 Basic2PL Primary copy 2PL
3 Basic2PL Voting 2PL
An integrated concurrency control method 4 Basic2PL Centralized 2PL
consists of two components--an rw and a 5 Primarycopy 2PL Basic2PL
ww synchronization technique--plus an in- 6 Primarycopy 2PL Primarycopy 2PL
terface between the components that at- 7 Primarycopy 2PL Voting2PL
8 Primarycopy 2PL Centralized2PL
tains condition (b) of T h e o r e m 2: a total 9 Centralized2PL Basic 2PL
ordering of the transactions consistent with 10 Centralized2PL Primarycopy 2PL
all ---*~w~ and --*ww relationships. In this 11 Centralized2PL Voting2PL
section we list 48 concurrency control meth- 12 Centralized2PL Centralized2PL
ods t h a t can be constructed using the tech- Each method can be further refined by the
niques of Sections 3 and 4. choice of deadlock resolution technique
Approximately 20 concurrency control (see Section 3.5).
methods have been described in the litera- T h e interface between each 2PL rw tech-
ture. Virtually all of them use a single nique and each 2PL ww technique is
synchronization technique {either 2PL or straightforward. It need only guarantee
T / O ) for both rw and ww synchronization. t ha t "two-phasedness" is preserved, mean-
Indeed, most methods use the same varia- ing t h at all locks needed for both the rw
tion of a single technique for both kinds of and ww technique must be obtained before
synchronization. However, such homoge- any lock is released by either technique.
neity is neither necessary nor especially
desirable.
5. 1.1 Methods Using Basic 2PL for rw
For example, the analysis of Section 3.2
Synchronization
suggests th at using basic 2PL for rw syn-
chronization and primary copy 2PL for ww Methods 1-4 use basic 2PL for rw synchro-
synchronization might be superior to using nization. Consider a logical data item X
basic 2PL (or primary copy 2PL) for both. with copies xl, . . . , Xm. TO read X, a trans-
More outlandish combinations may be even action sends a dm-read to any DM t hat
better. For example, one can combine basic stores a copy of X. This dm-read implicitly
2PL with TWR. In this method ww con- requests a readlock on the copy of X at t hat
flicts never cause transactions to be delayed DM. To write X, a transaction sends pre-
or restarted; multiple transactions can write writes to every DM t hat stores a copy of X.
into the same data items concurrently (see These prewrites implicitly request write-
Section 5.3). locks on the corresponding copies of X. For
In Sections 5.1 and 5.2 we describe meth- all four methods, these writelocks conflict
ods that use 2PL and T / O techniques for with readlocks on the same copy, and may
both rw and ww synchronization. T he con- also conflict with other writelocks on the
currency control methods in these sections same copy, depending on the specific ww
are easy to describe given the material of synchronization technique used by the
Sections 3 and 4; the description of each method.
method is little more than a description of Since locking conflict rules for writelocks
each component technique. In Section 5.3 will vary from copy to copy, we distinguish
we list 24 concurrency control methods that three types. An rw writelock only conflicts
combine 2PL and T / O techniques. As we with readlocks on the same data item. A
show in Section 5.3, methods of this type ww writelock only conflicts with ww write-
Computing Surveys, Vol 13, No. 2, June 1981
Concurrency Control in Database Systems 207
locks on the s a m e d a t a item. A n d an r w w explicit lock request to Xl'S DM; w h e n t h e
writelock conflicts with readlocks, ww lock is granted t h e t r a n s a c t i o n can r e a d any
writelocks, and rww writelocks. T h u s , using copy of X.
basic 2PL for rw synchronization, every T o write into X, a t r a n s a c t i o n sends pre-
prewrite sets rw writelocks, and m a y set writes to every D M t h a t stores a copy of X.
stronger locks depending on the ww tech- A prewrite(xl) implicitly requests a n rw
nique. writelock. Prewrites on o t h e r copies of X
m a y also request writelocks depending on
Method 1: Basic 2PL for w w synchroni-
the ww technique.
zation. All writelocks are rww writelocks;
t h a t is, for i ffi 1, . . . , m, a writelock on x,
Method 5: Basic 2PL for w w synchroni-
conflicts with either a readlock or a write- zation. For i ffi 2 . . . . , m, prewrite(x,) re-
lock on x , T h i s is the " s t a n d a r d " distrib- quests a ww writelock. Since the writelock
u t e d i m p l e m e n t a t i o n of 2PL. on xz m u s t also conflict with readlocks on
Method 2: Primary copy 2PL for w w xl, prewrite(xl) requests a n rww writelock.
synchronization. Writelocks only conflict Method 6: Primary copy 2PL for w w
on the p r i m a r y copy. An rww writelock is synchronization. Prewrite(xl) requests an
used on the p r i m a r y copy, while rw write- rww writelock on xl. Prewrites on o t h e r
locks are used on the others.
copies do not request a n y locks. T h i s
Method 3: Voting 2PL for ww synchro- m e t h o d was originally p r o p o s e d b y SToN79
nization. A D M responds to a prewrite(x,) and is used in D i s t r ~ u t e d I N G R E S
b y attempting to set an rww writelock on [STOs77].
x , However, if a n o t h e r transaction already Method 7: Voting 2PL for w w synchro-
owns an rww writelock on x,, the D M only nization. W h e n a scheduler receives a pre-
sets an rw writelock and leaves a request write(x,) for i ~ 1, it tries to set a ww
for an rww writelock pending. A transaction writelock on x,. W h e n it receives a pre-
can write into a n y copy of X after it obtains write(x1), it tries to set an rww writelock on
rww writelocks on a m a j o r i t y of copies. T h i s x~; if it cannot, t h e n it sets a n rw writelock
is similar to the m e t h o d p r o p o s e d in on xz (if possible) before waiting for the
GIFF79. rww writelock. A t r a n s a c t i o n can write into
Method 4: Centralized 2PL for w w syn- every copy of X after it obtains a ww (or
chronization. T o write into X, a t r a n s a c t i o n rww) writelock on a m a j o r i t y of copies
m u s t first explicitly request a ww writelock of X.
on X f r o m a centralized 2PL scheduler. T h e Method 8: Centralized 2PL for ww syn-
rw writelocks set b y prewrites never conflict chronization. T r a n s a c t i o n s obtain ww
with each other. writelocks f r o m a centralized 2 P L schedu-
In all four methods, readlocks are explic- ler. T h u s a prewrite(xl) requests an rw
itly released b y lock releases while write- writelock on x~; for i ffi 2 , . . . , m, prewrite(x,)
locks are implicitly released b y dm-writes. does not request a n y lock.
Lock releases m a y be t r a n s m i t t e d in paral-
Lock releases for M e t h o d s 5-8 are han-
lel with dm-writes. In M e t h o d 4, after all
dled as in Section 5.1.1.
din-writes h a v e b e e n executed, additional
lock releases m u s t be sent to the centralized
scheduler to release writelocks held there. 5 1.3 Methods Using Centrahzed 2PL for rw
Synchroniza tion
5.1.2 Methods Using Primary Copy 2PL for rw
T h e remaining 2 P L m e t h o d s use central-
Synchromzahon
ized 2PL for rw synchronization. Before
M e t h o d s 5-8 use p r i m a r y copy 2PL for rw reading (or writing) a n y copy of logical d a t a
synchronization. Consider a logical d a t a i t e m X, a t r a n s a c t i o n m u s t obtain a read-
i t e m X with copies xl . . . . . xm, and assume lock (or rw writelock) on X f r o m a central-
xz is the p r i m a r y copy. T o r e a d X, a trans- ized 2PL scheduler. Before writing X, the
action m u s t obtain a readlock on x~. I t m a y t r a n s a c t i o n m u s t also send prewrites to ev-
obtain this lock b y issuing a dm-read(Xl). ery D M t h a t stores a c o p y of X. S o m e
Alternatively, the t r a n s a c t i o n can send an of these prewrites implicitly r e q u e s t ww
Method 5: Basic T / O for ww synchron- Like Method 5, this method only requires
ization. Condition (A) is ts(P) < max- that the maximum R-ts(x) be stored, and it
W-ts(x) and condition (B) is ts(W) > supports systematic "forgetting" of old ver-
min-P-ts(x). Condition (A) implies that sions described above.
interval(P) = (ts(P), ~); some R-ts(x) lies 5.2.3 Methods Using Conservative T/O for rw
in that interval if and only if ts(P) < max- Synchronization
imum R-ts(x). Thus step 2 simplifies to
The remaining T / O methods use conserv-
2. If ts(P) < max W-ts(x) or ts(P) < max ative T / O for rw synchronization. Methods
ComputingSurveys, Vol. 13, No. 2, June 1981
210 P. A. Bernstein a n d N. Goodman
Consider data items x and y with the foUowmg
versions:
Values 0 100
I I v
W-tlmestamps 0 lo0
Values 0
y I r
W-timestamps 0
Values 0 lo0
x I I v
W-tmaestamps 0 100
Values 0 5O
y J I
W-timestamps 0 5O
9 and 10 require W-ts's for each data item, This method is essentially the SDD-1
and Method 11 requires a set of versions for concurrency control [BERN80d], although
each data item. Method 12 needs no data in SDD-1 the method is refined in several
item timestamps at all. Define R, P, W and ways. SDD-1 uses classes and conflict graph
min-P-ts as in Section 5.2.1; let min-R- analysis to reduce communication and
ts(TM) (or min-W-ts(TM)) be the mini- increase the level of concurrency. Also,
mum timestamp of any buffered dm-read SDD-1 requires predeclaration of read-sets
(or dm-write) from TM. and only enforces the conservative sched-
uling on dm-reads. By doing so, it forces
1. If ts(R) > min-W-ts(TM) for any TM, R dm-reads to wait for dm-writes, but does
is buffered; else it is output. not insist that dm-writes wait for all dm-
2. If condition (A) holds, P is rejected. Else reads with smaller timestamps. Hence dm-
P is buffered. reads can be rejected in SDD-1.
3. I f t s ( W ) > min-R-ts(TM) for any TM or
condition (B) holds, W is buffered. Else Method 11: Multiversion T / O for w w
W is output. synchronization. Conditions (A) and (B)
4. When W is output, its prewrite is debuf- are null. When W is output, it creates a
fered. When R or W is output or new version of x with timestamp ts(W).
buffered, buffered dm-reads and dm- When R is output it reads the version with
writes are retested to see if any can now largest timestamp less than ts(R).
be output. This method can be optimized by noting
the multiversion T / O "automatically" pre-
Method 9: Basic T / O for w w synchroni- vents dm-reads from being rejected, and
zation. Condition (A) is ts(P) < W-ts(x), makes it unnecessary to buffer dm-writes.
and condition (B) is ts(W) > min-P-ts(x). Thus step 3 can be simplified to
Method 10: T W R for w w synchroniza- 3. W is output immediately.
tion. Conditions (A) and (B) are null. How-
ever, if ts(W) < W-ts(x), W has no effect Method 12: Conservative T / O for w w
on the database. synchronization. Condition (A) is null; con-
dition (B) is ts(W) > min-W-ts(TM) for precedes T,+l's locked point, and (b) T,
some TM. The effect is to output W if the released a lock on some data item x before
scheduler has received all operations with T,+I obtained a lock on x. Let L be the L-
timestamps less than ts(W) that it will ever ts(x) retrieved by TI+I. Then ts(T,) < L <
receive. Method 12 has been proposed in ts(T,+~), and by induction ts(Ta) < ts(Tn).
CI~EN80, KANE79, and SHAP77a.
5.3.2 Mixed Methods Using 2PL for rw Syn-
chrontzation
5.3 Mixed 2PL and T / O Methods
There are 12 principal methods in which
The major difficulty in constructing meth- 2PL is used for rw synchronization and
ods that combine 2PL and T/O lies in de- T / O is used for ww synchronization:
veloping the interface between the two
techniques. Each technique guarantees an Method rw technique ww technique
acyclic --*~ (or ---~) relation when used
1 Basic 2PL Basic T / O
for rw (or ww) synchronization. The inter- 2 Basic 2PL TWR
face between a 2PL and a T/O technique 3 Basic 2PL Multiversion T / O
must guarantee that the combined --* rela- 4 Basic 2PL Conservative T / O
tion (i.e., --*~ U --->v,~)remains acyclic. That 5 Primary copy 2PL Basic T / O
is, the interface must ensure that the seri- 6 Prnnary copy 2PL TWR
7 Primary copy 2PL Multiversion T / O
alization order induced by the rw technique 8 Primary copy 2PL Conservative T / O
is consistent with that induced by the ww 9 Centrahzed 2PL Basic T / O
technique. In Section 5.3.1 we describe an 10 Centralized 2PL TWR
interface that makes this guarantee. Given 11 Centrahzed 2PL Multiversion T / O
such an interface, any 2PL technique can 12 Centralized 2PL Conservatwe T / O
be integrated with any T/O technique. Sec-
tions 5.3.2 and 5.3.3 describe such methods. Method 2 best exemplifies this class of
methods, and it is the only one we describe
in detail. Method 2 requires that every
5.3. 1 The Interface
stored data item have an L-ts and a W-ts.
The serialization order induced by any 2PL (One timestamp can serve both roles, but
technique is determined by the locked we do not consider this optimization here.)
points of the transactions that have been Let X be a logical data item with copies
synchronized (see Section 3). The seriali- xl . . . . , xm. To read X, transaction T issues
zation order induced by any T / O technique a dm-read on any copy of X, say x,. This
is determined by the timestamps of the dm-read implicitly requests a readlock on
synchronized transactions. So to interface x, and when the readlock is granted, L-
2PL and T / O we use locked points to in- ts(x,) is returned to T. To write into X, T
duce timestamps [BERN80b]. issues prewrites on every copy of X. These
Associated with each data item is a lock prewrites implicitly request rw writelocks
timestamp, L-ts(x). When a transaction T on the corresponding copies, and as each
sets a lock of x, it simultaneously retrieves writelock is granted, the corresponding L-ts
L-ts(x). When T reaches its locked point it is returned to T. When T has obtained all
is assigned a timestamp, ts(T), greater than of its locks, ts(T) is calculated as in Section
any L-ts it retrieved. When T releases 5.3.1. T attaches ts(T) to its dm-writes,
its lock on x, it updates L-ts(x) to be which are then sent.
max(L-ts(x), ts(T)). Dm-writes are processed using TWR. Let
Timestamps generated in this way are W be dm-write(xj). If ts(W) > W-ts(xj),
consistent with the serialization order in- the dm-write is processed as usual (xj is
duced by 2PL. That is, ts(Tj) < ts(Tk) if Tj updated). If, however, ts(W) < W-ts(xj), W
must precede Tk in any serialization in- is ignored.
duced by 2PL. To see this, let T1 and Tn be The interesting property of this method
a pair of transactions such that T~ must is that writelocks never conflict with write-
precede T , in any serialization. Thus there locks. The writelocks obtained by prewrites
exist transactions T1, T2 .... T,q, T , such are only used for rw synchronization, and
that for i = 1. . . . , n-1 (a) T,'s locked point only conflict with readlocks. This permits
Computing Sm~zeys,Vol. 13, No. 2, June 1981
212 . P. A. B e r n s t e i n a n d N. G o o d m a n
action T issues its END, the DBMS decides when they terminate. The main problem in
whether or not to certify, and thereby com- the summarization algorithm is avoiding
mit, T. the need to store information about al-
To understand how this decision is made, ready-certified transactions. The main
we must distinguish between "total" and problem in the certification algorithm is
"committed" executions. A total execution obtaining a consistent copy of the summary
of transactions includes the execution of all information. To do so the certification al-
operations processed by the system up to a gorithm often must perform some synchro-
particular moment. The committed execu- nization of its own, the cost of which must
tion is the portion of the total execution be included in the cost of the entire method.
that only includes din-reads and din-writes
processed on behalf of committed transac- A1.2 Certificatton Using the--~ Relatton
tions. That is, the committed execution is
the total execution that would result from One certification method is to construct the
aborting all active transactions (and not ---) relation as dm-reads and prewrites are
restarting them). processed. To certify a transaction, the sys-
When T issues its END, the system tests tem checks that ---> is acyclic [BADA79,
BAYE80, CASA79]. s
whether the committed execution aug-
mented by T's execution is serializable, that To construct --% each site remembers the
is, whether after committing T the resulting most recent transaction that read or wrote
committed execution would still be serial- each data item. Suppose transactions T,
izable. If so, T is committed; otherwise T is and T~ were the last transactions to (re-
restarted. spectively) read and write data item x. If
There are two properties of certification transaction Tk now issues a din-read(x),
that distinguish it from other approaches. Tj --* Tk is added to the summary infor-
First, synchronization is accomplished en- mation for the site and Tk replaces T, as
tirely by restarts, never by blocking. And the last transaction to have read x. Thus
second, the decision to restart or not is pieces of--* are distributed among the sites,
made after the transaction has finished ex- reflecting run-time conflicts at each site.
ecuting. No concurrency control method To certify a transaction, the system must
discussed in Sections 3-5 satisifies both check that the transaction does not lie on
these properties. a cycle in --* (see Theorem 2, Section 2).
The rationale for certification is based on Guaranteeing acyclicity is sufficient to
an optimistic assumption regarding run- guarantee serializability.
time conflicts: if very few run-time conflicts There are two problems with this ap-
are expected, assume that most executions proach. First, it is in general not correct to
are serializable. By processing din-reads delete a certified transaction from --), even
and prewrites without synchronization, the if all of its updates have been committed.
concurrency control method never delays a For example, if T, --) Tj and T, is active but
transaction while it is being processed. Only Tj is committed, it is still possible for Tj ---)
a (fast, it is hoped) certification test when T, to develop; deleting Tj would then cause
the transaction terminates is required. the cycle T, --~ Tj ---) T, to go unnoticed
Given optimistic transaction behavior, the when T, is certified. However, it is ob-
test will usually result in committing the viously not feasible to allow ---) to grow
transaction, so there are very few restarts. indefinitely. This problem is solved by Ca-
Therefore certification simultaneously sanova [CASA79] by a method of encoding
avoids blocking and restarts in optimistic information about committed transactions
situations. in space proportional to the number of ac-
A certification concurrency control tive transactions.
method must include a summarization al- A second problem is that all sites must
gorithm for storing information about dm- be checked to certify any transaction. Even
reads and prewrites when they are proc-
essed and a certification algorithm for us- 8 In BAYE80 certification is only used for rw synchro-
ing that information to certify transactions nization whereas 2PL is used for ww synchronization.
sites at which the transaction never ac- action terminates, the system augments the
cessed data must participate in the cycle update listwith the.listof data items read
checking of--*. For example, suppose we and their timestamps at the time they were
want to certify transaction T. T might be read. In addition, the timestamp of the
involved in a cycle T --. T1 --) T2 --) . . . --* transaction itselfis added to the update list.
Tn-1 --> Tn ---> T, where each conflict Tk --) This completes the firstphase of execution.
Tk+l occurred at a different site. Possibly T In the second phase the update list is
only accessed data at one site; yet the --) sent to every site.Each site (including the
relation must be examined at n sites to site that produced the update list)votes on
certify T. This problem is currently un- the update list.Intuitively speaking, a site
solved, as far as we know. T h a t is, any votes yes on an update listif it can certify
correct certifier based on this approach of the transaction that produced it. After a
checking cycles in --) must access the --~ site votes yes, the update listis said to be
relation at all sites to certify each and every pending at that site.To cast the vote, the
transaction. Until this problem is solved, site sends a message to the transaction's
we judge the certification approach to be home site,which, when it receives a major-
impractical in a distributed environment. ity of yes or no votes, informs all sites of
the outcome. If a majority voted yes, then
all sitesare required to commit the update,
A2. Thomas' Majority Consensus Algorithm which is then installed using T W R . If a
A2.1 The Algorithm majority voted no, all sites are told to dis-
card the update, and the transaction is re-
One of the first published algorithms for started.
distributed concurrency control is a certifi- The rule that determines when a sitem a y
cation m e t h o d described in THOM79. vote "yes" on a transaction is pivotal to the
T h o m a s introduced several i m p o r t a n t syn- correctness of the algorithm. To vote on an
chronization techniques in t h a t algorithm, update list U, a site compares the time-
including the T h o m a s Write Rule (Section stamp of each data item in the readset of U
3.2.3), majority voting (Section 3.1.1), and with the timestamp of that same data item
certification (Appendix A1). Although in the site'slocal database. If any data item
these techniques are valuable when consid- has a timestamp in the database different
ered in isolation, we argue t h a t the overall from that in U, the sitevotes no. Otherwise,
T h o m a s algorithm is not suitable for dis- the site compares the readset and writeset
tributed databases. We first describe the of U with the readset and writeset of each
algorithm and t h e n c o m m e n t on its appli- pending update listat that site,and ifthere
cation to distributed databases. is no rw conflict between U and any of the
T h o m a s ' algorithm assumes a fully re- pending update lists,it votes yes. If there is
dundant database, with every logical data an rw conflict between U and one of those
item stored at every site. Each copy carries pending requests, the site votes pass (ab-
the timestamp of the last transaction t h a t stain) if U's timestamp is larger than that
wrote into it. of all pending update lists with which it
Transactions execute in two phases. In conflicts. If there is an rw conflict but U's
the first phase each transaction executes timestamp is smaller than that of the con-
locally at one site called the transaction's flicting pending update list,then it sets U
home site. Since the database is fully re- aside on a wait queue and triesagain when
dundant, any site can serve as the h o m e the conflictingrequest has either been com-
site for any transaction. T h e transaction is mitted or aborted at that site.
assigned a unique timestamp when it begins The voting rule is essentially a certifica-
executing. During execution it keeps a rec- tion procedure. By making the timestamp
ord of the timestamp of each data item it comparison, a siteis checking that the read-
reads and, when its executes a write on a set was not written into since the transac-
data item, processes the write by recording tion read it. If the comparisons are satisfied,
the new value in an update list. N o t e t h a t the situation is as if the transaction h a d
each transaction must read a copy of a data locked its readset a t t h a t site and held the
item before it writes into it. W h e n the trans- locks until it voted. T h e voting rule is
Computing Stttw~ys, Vgl. 13, N% 2, June 1981
216 P. A. B e r n s t e i n a n d N. G o o d m a n
thereby guaranteeing rw synchronization The second part of the voting rule guaran-
with a certification rule approximating rw tees that for every pair of conflicting trans-
2PL. (This fact is proved precisely in actions that received a majority of yes
BEm~79b.) votes, all sites that voted yes on both trans-
The second part of the voting rule, in actions voted on the two transactions in the
which U is checked for rw conflicts against same order. This makes the certification
pending update lists, guarantees that con- step behave just as it would if it were cen-
flicting requests are not certified concur- tralized, thereby avoiding the problem ex-
rently. An example illustrates the problem. emplified in the previous paragraph.
Suppose T1 reads X and Y, and writes Y,
while T2 reads X and Y, and writes X. A2.3 Partially Redundant Databases
Suppose T1 and T2 execute at sites A and
B, respectively, and X and Y have time- For the majority consensus algorithm to be
stamps of 0 at both sites. Assume that T1 useful in a distributed database environ-
and T~ execute concurrently and produce ment, it must be generalized to operate
update lists ready for voting at about the correctly when the database is only par-
same time. Either T~ or T2 must be re- tially redundant. There is reason to doubt
started, since neither read the other's out- that such a generalization can be accom-
put; if they were both committed, the result plished without either serious degradation
would be nonserializable. However both of performance or a complete change in the
Tl's and T2's update lists will (concurrently) set of techniques that are used.
satisfy the timestamp comparison at both First, the majority consensus decision
A and B. What stops them from both ob- rule apparently must be dropped, since the
taining unanimous yes votes is the second voting algorithm depends on the fact that
part of the voting rule. After a site votes on all sites perform exactly the same certifi-
one of the transactions, it is prevented from cation test. In a partially redundant data-
voting on the other transaction until the base, each site would only be comparing
first is no longer pending. Thus it is not the timestamps of the data items stored at
possible to certify conflicting transactions that site, and the significance of the major-
concurrently. (We note that this problem ity vote would vanish.
of concurrent certification exists in the al- If majority voting cannot be used to syn-
gorithms of Section A1.2, too. This is yet chronize concurrent certification tests, ap-
another technical difficulty with the certi- parently some kind of mutual exclusion
fication approach in a distributed environ- mechanism must be used instead. Its pur-
ment.) pose would be to prevent the concurrent,
With the second part of the voting rule, and therefore potentially incorrect, certifi-
the algorithm behaves as if the certification cation of two conflicting transactions, and
step were atomically executed at a primary would amount to locking. The use of locks
site. If certification were centralized at a for synchronizing the certification step is
primary site, the certification step at the not in the spirit of Thomas' algorithm, since
primary site would serve the same role as a main goal of the algorithm was to avoid
the majority decision in the voting case. locking. However, it is worth examining
such a locking mechanism to see how cer-
tification can be correctly accomplished in
A2.2 Correctness
a partially redundant database.
To process a transaction T, a site pro-
No simple proof of the serializability of duces an update list as usual. However,
Thomas' algorithm has ever been demon- since the database is partially redundant, it
strated, although Thomas provided a de- may be necessary to read portions of T's
tailed "plausibility" argument in THOM79. readset from other sites. After T termi-
The first part of the voting rule can cor- nates, its update list is sent to every site
rectly be used in a centralized concurrency that contains part of T's readset or writeset.
control method since it implies 2PL To certify an update list, a site first sets
[BERN79b], and a centralized method based local locks on the readset and writeset, and
on this approach was proposed in KUNG81. then (as in the fully redundant case) it
Computing Surveys, Vol. 13, No 2, June 1981
Concurrency Control in Database Systems * 217
compares the update list's timestamps with correctly running, the algorithm runs
the database's timestamps. If they are iden- smoothly. Thus, handling a site failure is
tical, it votes yes; otherwise it votes no. A free, insofar as the voting procedure is con-
unanimous vote of yes is needed to commit cerned.
the updates. Local locks cannot be released However, from current knowledge, this
until the voting decision is completed. justification is not compelling for several
While this version of Thomas' algorithm reasons. First, although there is no cost
for partially redundant data works cor- when a site fails, substantial effort may be
rectly, its performance is inferior to stand- required when a site recovers. A centralized
ard 2PL. This algorithm requires that the algorithm using backup sites, as in
same locks be set as in 2PL, and the same ALSB76a, lacks the symmetry of Thomas'
deadlocks can arise. Yet the probability of algorithm, but may well be more efficient
restart is higher than in 2PL, because even due to the simplicity of site recovery. In
after all locks are obtained the certification addition, the majority consensus algorithm
step can still vote no (which cannot happen does not consider the problem of atomic
in 2PL). commitment and it is unclear how one
One can improve this algorithm by des- would integrate two-phase commit into the
ignating a primary copy of each data item algorithm.
and only performing the timestamp com- Overall, the reliability threats that are
parison against the primary copy, making handled by the majority consensus algo-
it analogous to primary copy 2PL. How- rithm have not been explicitly listed, and
ever, for the same reasons as above, we alternative solutions have not been ana-
would expect primary copy 2PL to outper- lyzed. While voting is certainly a possible
form this version of Thomas' algorithm too. technique for obtaining a measure of relia-
We therefore must leave open the prob- bility, the circumstances under which it is
lem of producing an efficient version of cost-effective are unknown.
Thomas' algorithm for a partially redun-
dant database. A3. Ellis' Ring Algorithm
A2.4 Performance
Another early solution to the problem of
distributed database concurrency control is
Even in the fully redundant case, the per- the ring algorithm [ELLI77]. Ellis was prin-
formance of the majority consensus algo- cipally interested in a proof technique,
rithm is not very good. First, repeating the called L systems, for proving the correct-
certification and conflict detection at each ness of concurrent algorithms. He devel-
site is more than is needed to obtain seri- oped his concurrency control method pri-
alizability: a centralized certifier would marily as an example to illustrate L-system
work just as well and would only require proofs, and never made claims about its
that certification be performed at one site. performance. Because the algorithm was
Second, the algorithm is quite prone to only intended to illustrate mathematical
restarts when there are run-time conflicts, techniques, Ellis imposed a number of re-
since restarts are the only tactic available strictions on the algorithm for mathemati-
for synchronizing transactions, and so will cal convenience, which make it infeasible in
only perform well under the most optimistic practice. Nonetheless, the algorithm has
circumstances. Finally, even in optimistic received considerable attention in the lit-
situations, the analysis in GARC79a indi- erature, and in the interest of completeness,
cates that centralized 2PL outperforms the we briefly discuss it.
majority consensus algorithm. Ellis' algorithm solves the distributed
concurrency control problem with the fol-
A2.5 Rehablhty lowing restrictions:
Despite the performance problems of the (1) The database must be fully redundant.
majority consensus algorithm, one can try (2) The communication medium must be
to justify the algorithms on reliability a ring, so each site can only communi-
grounds. As long as a majority of sites are cate with its successor on the ring.
control mechanisms in a system for dis- trollers," in Proe. 4th Berkeley Workshop
tributed databases (SDD-1)," in ACM D~stnbuted Databases and Computer
Trans. Database Syst. 5, 1 (March 1980), Networks, Aug. 1979.
52-68. GARC79C GARCIA-MOLINA, H. "Centrahzed con-
BERN80d BERNSTEIN, P, SHIPMAN, D. W., AND trol update algorithms for fully redundant
ROTHNIE, J.B. "Concurrency control m distributed databases," in Proe. 1st Int.
a system for distributed databases (SDD- Conf. D~stributed Computing Systems
1)," in ACM Trans. Database Syst 5, 1 (IEEE), New York, Oct. 1979, pp. 699-
(March 1980), 18-51. 705.
BERN81 BERNSTEIN, P. A, GOODMAN,N., WONG, GARD77 GARDARIN,G., ANDLEBAUX,P. "Sched-
E, REEVE, C. L., AND ROTHNIE, J. uling algorithms for avoiding inconsis-
B. "Query processing m SDD-I," ACM tency in large databases," in Proc. 1977
Trans. Database Syst. 6, 2, to appear. Int. Conf. Very Large Data Bases
BREI79 BREITWIESER, H., AND KERSTEN, (IEEE), New York, pp. 501-516.
U. "Transaction and catalog manage- GELE78 GELEMBE,E., ANDSEVCIE,K. "Analysis
merit of the distributed file management of update synchronization for multiple
system DISCO," in Proc. Very Large copy databases," in Proc. 3rd Berkeley
Data Bases, Rm de Janerio, 1979 Workshop Distributed Databases and
BRIN73 BRINCH-HANSEN, P. Operating system Computer Networks, Aug. 1978.
pnnc~ples, Prentice-Hall, Englewood GIFF79 GIFFORD, D. K. "Weighted voting for
Cliffs, N. J., 1973. rephcated data," in Proc. 7th Syrup. Op-
CASA79 CASANOVA, M. A. "The concurrency erating Systems Principles, Dec. 1979.
control problem for database systems," GRAY75 GRAY, J. N., LORIE, R. A., PUTZULO,G.
Ph.D. dissertation, Harvard Univ., Tech. R., AND TRAIGER,I.L. "Granularity of
Rep. TR-17-79, Center for Research in locks and degrees of consistency in a
Computmg Technology, 1979. shared database," IBM Res. Rep. RJ1654,
CHAM74 CHAMBERLIN, D. D., BOYCE, R. F., AND Sept. 1975.
TRAIGER,I.L. "A deadlock-free scheme GRAY78 GRAY, J . N . "Notes on database oper-
for resource allocation in a database en- ating systems," in Operating Systems: An
wronment," Info. Proc. 74, North-Hol- Advanced Course, vol. 60, Lecture Notes
land, Amsterdam, 1974. in Computer Science, Springer-Verlag,
CHEN80 CHENG, W. K., AND BELFORD, G. New York, 1978, pp. 393-481.
C. "Update Synchromzation in Distrib- HAMM80 HAMMER, M. M., AND SHIPMAN, D.
uted Databases," in Proc. 6th Int. Conf. W. "Reliability mechanisms for SDD-I:
Very Large Data Bases, Oct. 1980. A system for distributed databases," ACM
DEPP76 DEPPE, M. E., AND FRY, J. Trans. Database Syst. 5, 4 (Dec. 1980),
P. "Distributed databases' A summary 431-466.
of research," in Computer networks, vol. HEWI74 HEWITT,C.E. "Protection and synchro-
1, no. 2, North-Holland, Amsterdam, nizationin actor systems," Working Paper
Sept. 1976. No. 83, M.I.T. Artificial Intelligence Lab.,
DIJK71 DIJKSTRA,E.W. "Hmrarchical ordering Cambridge, Mass., Nov. 1974.
of sequential processes," Acta Inf. 1, 2 HOAR74 HOARE, C. A.R. "Monitors. An operat-
(1971), 115-138. ing system structuring concept," Com-
ELLI77 ELLIS, C.A. "A robust algorithm for up- mun. ACM 17, 10 (Oct. 1974), 549-557.
dating duphcate databases," in Proe 2nd HOLT72 HOLT, R.C. "Some deadlock propemes
Berkeley Workshop D~str~buted Data- of computer systems," Comput. Surv. 4, 3
bases and Computer Networks, May (Dec. 1972) 179-195.
1977. KANE79 KANEKO,A., NISHIHARA,Y., TSURUOKA,
ESWA76 ESWARAN,K. P., GRAY,J. N., LORIE, R. K., AND HATTORI, M. "Logical clock
A., AND TRAIGER,I.L. "The notions of synchronization method for duplicated
consistency and predicate locks in a da- database control," in Proe. 1st Int. Conf.
tabase system." Commun. ACM 19, 11 D~stributed Computing Systems (IEEE),
(Nov. 1976), 624-633. New York, Oct. 1979, pp. 601-611.
GARC78 GARCIA-MOLINA, H "Performance KAWA79 KAWAZU,S, MINAMI,ITOH,S., ANDTER-
comparisons of two update algorithms for ANAKA,K. "Two-phase deadlock detec-
distributed databases," in Proc. 3rd tion algorithm in distributed databases,"
Berkeley Workshop D~stmbuted Data- in Proc. 1979 Int. Conf. Very Large Data
bases and Computer Networks, Aug. Bases (IEEE), New York.
1978. KING74 KING, P. P., AND COLLMEYER, A
GARC79a GARCIA-MOLINA, H. "Performance of J. "Database sharing--an efficient
update algorithms for replicated data in a method for supporting concurrent pro-
distributed database," Ph.D. dlssertatmn, cesses," in Proc. 1974 Nat. Computer
Computer Science Dept., Stanford Umv., Conf., vol. 42, AFIPS Press, Arlington,
Stanford, Calif., June 1979. Va., 1974.
GARc79b GARCIA-MOLINA, H. "A concurrency KUNG79 KUNG, H. T , AND PAPADIMITRIOU, C.
control mechanism for distributed data H. "An optimality theory of concur-
bases winch use centralized locking con- rency control for databases," in Proe. 1979
ComputingSurveys,VoL 13, No. 2, June 1981
220 P.A. B e r n s t e i n a n d N. G o o d m a n