You are on page 1of 7

Low Cost Comparisons of File Copies*

Thomas Schwarz
Robert W. Bowdidge
Walter A. Burkhard
Computer Science and Engineering
University of California at San Diego

Abstract Our model is the one usually adopted for this va-
The problem of maintaining consistency of repli- riety of problem. A file is organized as a sequence of
cas of large files has been addressed b y Barbara, Fei- pages and complete copies of the file are maintained
j o o and Garcia-Molina [l], Barbara and Lipton [2], a t distinct sites throughout the network. We assume
Fuchs, W u and Abraham [5] and Metzner [6]. Our that few discrepancies occur and note that our scheme
model is essentially identical to that previously as- will not accommodate catastrophic failures. The cost
sumed. W e present a scheme that provides all capabil- of sending information through the network is high
ities previously obtained as well as being able to detect compared t o the cost of calculations a t an individual
and identify missing as well as extraneous pages. Our site. The order of pages is important and there might
solution is an adaptation of Metzner’s approach. We be missing pages besides faulty ones.
assume the existence of a signature function f o r indi- As usual, our solution avoids the costly and tedious
vidual pages. A supersignature is obtained as a power task of bitwise comparison between the pages stored
series in a primitive root within the Galois field with a t different sites. Rather, we compare “signatures”
2” elements. The coefficients are the signatures of the of the individual pages by means of a supersignature
individual pages. This scheme enables us to detect calculated from the individual page signature. This
errors such as missing, altered or incorrectly p l a c e d allows conclusions that two copies are identical only
pages with very high probability and t o find an error with a previously chosen arbitrarily high probability.
diagnosis i f discrepancies have been detected; The cost of higher accuracy will be additional bits
within the signature.
The problem solved in this paper is typified by the Our scheme is based on the calculation of the super-
following situation. Consider a large database, repli- signature. We interpret page signatures as elements
cas of which are situated a t several sites. Each site of the Galois field 2; of 2” elements, but we disallow
keeps its own log file of updates and in order to insure 0 as a value for a page signature. (We later propose
consistency, these log files are compared periodically. variants that use more common data structures.) Let
The physical organization of the log file consists of g be a fixed primitive root of 27,i.e. a generator of
pages. the multiplicative group of 2;. The supersignature
is then calculated as the weighted sum of the page
*This work was supported by the University of California signatures with the powers of g.
MICRO program as well as the NCR Corporation, Dayton,
OH

CH2878-7/90/0000/0196$01.OO 0 1990 IEEE I96


Definition 1 Let P I , . . . ,pn denote the page signa- Sometimes, page signatures are selected because
tures. W e define a supersignature H by of their error detecting probabilities. This however
n
tends t o generate good hash functions. A simple
counting argument shows that error detection is im-
u=l
possible for supersignatures (at least if the number of
pages in a file is not limited.)
More generally, w e define the supersignature of
pk,.. .,Pm 20 be
m

u=k
Some Properties
Note that the supersignature of a (contiguous) set
of pages in a file does not only depend on the pages Subsupersignatures, that is, supersignatures of a
but also on the page number, that is, the placement contiguous subset of pages in a file, are dependent
of the pages in the file. on the placement in the file. T h a t is, exactly the
same set of pages will yield different supersignatures
We have t o show this supersignature is capable of according t o the pagenumber the first page of this
detecting the existence of altered, missing, extraneous subset has. However, one will be the g k multiple of
and misplaced pages and we present a n algorithm that the other, where k is the difference of the respective
given a small number of discrepancies of the replicas page numbers.
of the file finds a proper error diagnosis.
We also have additivity. If we know the supersig-
nature of a (contiguous) set of pages and the super-
signature of the first (or second) half of this set then
Effectiveness
a simple subtraction will yield the supersignature of
the other half. This property will cut the amount of
There are several considerations regarding the ef- bits transmitted during the diagnosis of a discrepancy
ficiency of a signature. The computation of the sig- by half.
nature must be feasible. Error detection is another
measure of efficiency. We assume that the supersig-
nature is t o be a good hash function on the set of all
pages, that is, the probability of attaining any par-
ticular value is identical for all possible values. In A File Comparison Algorithm
this sense it can be shown that H is good if the page
signature is good.
We can now use a variant of Metzner's strategy t o
We can also give probabilities that one error of a compare two files.
particular kind can be detected. The probability that
one altered page is detected is equal t o the probabil- In overview, Site I (the initiator site) requests the
ity that a page signature does not detect an error, number of pages in Site's I1 copy of the file as well as
which is l/2.. A single missing or extraneous page the supersignature and then calculates its own data
will be detect.ed because the total number of pages from its local copy. If the data do not agree, the file
does not agree. Even without this information the is split in halves and each copy's data are compared.
supersignatures will not agree if the page signature of But not only are these parts of the file compared,
the missing page is not zero (which can happen only but it is also tested whether the lacking consistency
with probability l / 2 " . An inversion (that is a case is due to misalignment caused by lacking pages in
where two pages have exchanged places in the file) either copy of the file. This is done by conceptually
will be noticed with probabilit,y FZ 1/2"-'. A jump, shifting the parts of Site's I copy t o the left and to
where one page shows up later in one copy than in the right.
the other, will be detected with the same probability. We will work with pairs ( H , n ) consisting of a super-
(In all this we made the silent assumption that page signature of a contiguous set of pages and the number
signatures are independently distributed.) of pages. Two such pairs are the same exactly when

I91
both components agree. In the course of error de- by Left- and Rightshifting of the pair corresponding
tection our increased knowledge of both file copies in t o site I by up t o F (the number of errors allowed)
terms of a breaking up of t h e supersignatures in those shifts in either direction. When through this maneu-
covering smaller numbers of pages will be represented ver or straight from the beginning equality of pairs has
by a sequence of those pairs: been found the part of both sequences that is t o the
left of the pair in question is separated from the rest
and the error allowance F decreased by the number
of shifts necessary. In the same move the equal pairs
the meaning of which is that the first n1 pages of are eliminated from further consideration as they are
the copy a t site i have supersignature Hi”), and so deemed t o be equal. If equality cannot be achieved,
on. We will have t o define two operations on pairs: the set of pages in question is divided in half in such a
Leftshift and Rightshift. The first one will correspond way that the highest power of 2 strictly smaller than
to conceptually destroying a page to the left of the nc2) is the number of pages in the first half. Site I
area of pages in question thus simulating a missing has t o request the corresponding supersignature from
page in the other site’s copy or a superfluous one in site I1 and initiate the appropriate calculation for its
our copy whereas the other operation correspond t o part. The supersignature of the other half can be cal-
inserting a dummy page to the left thus simulating a culated from the old and the just obtained one. For
missing page in our copy or a superfluous one in the every splitting of a pair, the error allowance F is de-
other copy. These operations are not always defined creased by one, as we just proved the existence of one
if the set of pages covered does include the first or the more mistake, After each comparison, we end up with
last page. a set of sequences with a total of a t most F pairs that
Assume are dealt with recursively. The algorithm stops when
that ( H ’ , n’) covers the pages { p m , p m + 1 , .. .pm+,,}. F becomes negative, i.e. in failure, or when there is
Conceptually destroying page pl now results in hav- nothing left t o be considered.
ing t o replace H’ by the supersignature of the pages
{ p m + l l . . .,pn+,,,p,,,+,,+~}. This can be calculated by
Hnew ._ +
.- g-’H’ - gmh(pm) gm+nh(pm+n).A sim-
ilar formula can easily be derived for the Rightshift.
Example
The scheme proper now starts out by Site I, which is
the calculation intensive site, requesting the number Assume, that site I contains the file with page sig-
of pages and the supersignature of site’s I1 copy of the natures
file and comparing it with its own. Also, the number
of errors allowed is set t o F ,a previously agreed upon Site I: l , m , n , o , p , q , r , s , t , u , v , w l ~ , y , z
number.
If in general we have reached a state in which we and that site I1 contains the file with page signatures
want to compare two sequences of pairs
Site 11: 1, m, o , p , 9, r, s,t , U , v, w, x,y,t

Note that the page with signature n a t site I1 is miss-


we compare the corresponding pairs sequentially in ing. In the initial step, site I notices that the number
the manner described below. If during a comparison of pages differ. It nevertheless requests
of a pair ( H ( l ) ,n ( l ) )with ( H ( 2 ) n(’))
, it turns out that
n ( j ) = 0 the corresponding set of pages a t the other
side is labelled superfluous. If n(’) = 1 = n(z) and the
signatures are not the same these pages are labelled
as different. Note that we actually are not diagnos-
ing inversions as such, this is left to the error recovery
routines. In these cases, the pairs are eliminated from and calculates H2 and compares
further consideration. In all other cases, inequality of
pairs is diagnosed. But this is first tried t o be resolved

I98
because only values of H of the same number of argu- context, it should be noted that a jump or an inver-
ments Should be compared. At this point site I knows sion counts as one missing and one superfluous page.
that there is a t least one error in The number of shift operations is then restricted t o
F.If the algorithm fails, rerun with increasing values
I: 1, m , n , 0,P , q, r , s , t 11: 1, m, o , ~q,, r, s , t of F = 1 , .. . will lead t o a correct answer.
(which incidentally also follows from the fact that the In general, the algorithm will be able t o detect a
number of pages d o not agree.) In the next round site greater number of missing pages than the amount of
I compares F ,even if these missing pages are contiguous.

H l ( 1 , m , n , 0) # H z ( l , m , 0 , P )
and calculates H l ( q , r , s , t ) which is compared to
H z ( p , q , r , s ) and H a ( q , r , s , t ) . Site I finds that file Alternative Signature Spaces
(q,r,s,t)at I is identical t o (q,r,s,t) a t site I1 and hence
site I only has t o compare files In the preceeding discussion we equipped the sig-
nature space, i.e. the space of all possible signatures,
I: I, m , n , o , p and 11: 1, m, o , p with the algebraic structure of a Galois field. (This
is the only algebraic structure satisfying our needs
After comparing which can be defined on a set of this size.) T o make
our scheme computationally accessible, we have to
Hl(1,m ) = Ha(/,
m); multiply powers of g. This task has been solved by
the dual base algorithm due t o Elwyn Berlekamp (v.
Hl(T 0) # Hz(o,p);
[3,71).
Hl(0, P ) = H z ( 0 ,P )
In practice, other implementations of arithmetic on
site I can now conclude that its third page is not t o a slightly altered signature space might be preferable.
be found in the second site's record. Though the space of all bit strings of a given length
is the best signature space for generation purposes,
the implementation of Berlekamp's algorithm or any
other implementation of multiplication in a Galois
Variations of the Algorithm field might be considered tedious.
Our results remain valid if the signature space is
The algorithm is biased towards the assumption the set of integers modulo a given prime (usually the
that errors are due t o missing pages. For a small largest prime smaller than the largest unsigned inte-
number of actual errors this will not pose a problem, ger directly representable on the machine) or a carte-
the algorithm will give the correct error diagnosis with sian product of the integers modulo different primes.
high probability. If however one tries t o use this al- The arithmetic operations in the latter case are de-
gorithm under more bizarre circumstances, obvious fined componentwise. The generating element is a
improvements can be made. vector consisting of primitive elements modulo the
ith prime in the ith coordinate. Presumably, page
In rare cases the algorithm might fail t o find the
signatures are generated as a string of bits. If we
right diagnosis because supersignatures of subsets of
interpret these strings as unsigned integers and take
pages coincide even though the subsets do not and the
the remainder modulo these primes we can represent
algorithm is led astray. It will never make a wrong
I I z l p i different signatures (where pi = 2"'--ai are the
diagnosis unless several page signatures of different
primes and w denotes the word length.) This scheme
pages agree. If several explanations exist, it might
is known as Chinese Remaindering. The values with
however not give the most probable one.
ithcomponent smaller than ai will be attained with
To guarantee a diagnosis the following modifica- double the probability than the larger ones. This is
tions are made. We impose an upper error bound E not a big disadvantage, as these values will only be at-
for the number of altered pages and another, F , for tained with a very small probability and as t o achieve
the number of missing or superfluous pages. In this the same guaranteed bounds of the original scheme we

199
have to make signatures N bits longer. Alternatively, periments each consisting of 200 samples where the
we can alter the page signature generation scheme. number of errors was fixed in each experiment. The
The big advantage of this alternation lies in the use average comparison time increased from 21 to 41 sec-
of commonly used operations that are optimized in onds as we varied the number of errors from 1 to 20.
any machine. The entries in the transmission row measure the
time required to transmit the necessary information.
In the naive algorithm the file itself is transmitted
while in our algorithm an inconsequential amount of
Implementation data is transmitted.

A version of this algorithm has been coded and


tested ([4]).The signature space consisted of vectors comparisons 29 sec 54 sec
of dimension five the coordinates of which were re- transmission w 0 sec 80 sec
mainders modulo the five largest primes smaller than
the maximal representable unsigned integer. To test These times assume that it is not necessary to com-
the algorithm a signature file was randomly gener- pute the signature as they can be maintained incre-
ated and transmission of the original file t o another mentally. In any case, the time to compute the super-
site simulated by capturing transmission errors by signature would be approximately the time required
its effects on the signature file. Four types of errors by the byte-by-byte comparison.
were considered: Altered pages (g), lost pages (&),
jumped pages (&) and doubled pages (&), where the
numbers are the probabilities for this kind of error to
occur. To measure the costs the two resulting signa-
ture files were subjected to our comparison algorithm Appendix: Mathematical Results
and the number of multiplication and remote requests
recorded with the total number of errors as a param- In this section, we assume that the signatures are
eter. (See figures 1 and 2.) represented as elements of an, but that the value 0
We also compare our algorithm with the naive one, is not a permissible signature.
which transfers the file and compares it bit by bit,
measuring the time required by each one. We present Lemma 1 Assume that py = py+c where
both experimental and analytical results for the two p k , .. . p l , p k + c , . . . , p i f c E P (the set of p a g e signa-
comparison schemes. These are typical of c o m p u t a tures and c = constant.
tion environments including Sun 3/50 sites connected Then
with 1 Mb/sec networking. Our simulation times were
obtained using Sun 3/50.
Assuming a file size of 10000 pages of 1K byte each,
the naive approach causes 81920000 bits to be sent.
Our algorithm will cause 10g2(104) remote requests
for supersignatures in the presence of one error and l+c
+
hence at most 1 Elog2(1O4) for E errors resulting
g” . h ( p y )=
in network traffic of 161392 bits assuming 100 errors v=k+c
and an integer size of 2 bytes or around 500 times less
traffic.
The performance comparison is represented in the
table below. The entries in the comparison row mea-
sure the time required to do the necessary computati-
nal comparisons. For the naive algorithm, the result
depends on the file size and is otherwise constant.
For our algorithm the result is the averqge for 20 ex-

200
The following observation shows that we can calculate
the supersignature of the first (resp. last) half of a
set of contiguous pages from the other half's and the
whole set's supersignatures.

Lemma 2

=(g-1)( 5
p=k+1
Sp-lh(Pp) )+
Lemma 3 A n inversion will be detected with proba-
bility approximately 2-". (gk -d)(h(Pk) + h(pj))
The probability that H I equals H I I is therefore 2-".
Proof: We calculate the probability that an inversion
(i.e. an exchange) of two pages will not be noticed
by H . Let the first page have signature s1 and page
number i at the first site and let the second page have References
signature s2 and page number j at the first site and
assume that these page numbers are reversed at the [l] D. Barbara, B. Feijoo and H. Garcia-Molina: Ex-
second site. Then H will not detect an inversion iff ploiting Symmetries for Low-Cost Comparison of
File Copies; Proc. International Conference on
Distributed Computing Systems, San Jose, June
This is equivalent to 1988
[a] Daniel Barbara and Richard J . Lipton: A class of
(9' - g j ) s 1 = (9' - g j ) s 2 , Randomized Strategies for Low-Cost Comparison
of File Copies; Technical Report Princeton CS -
i.e., iff
. . T R 176-88 ; 9/88
s1 = s2 or g' = g J .
[3] Elwyn R. Berlekamp: Algebraic Coding Theory;
The later condition is equivalent to i being congruent
McGraw-Hill Book Company New York 1968
modulo 2" - 1 to j . Therefore the probability that
H will not detect an inversion is only slightly higher [4] R. W. Bowdidge, W. A. Burkhard, T. J . E.
than 2-". Schwazz, Technical Report UCSD , in statu
nascendi
Lemma 4 A jump will be detected with probability
2-".
[5] W. K. Fuchs, K. Wu and J . Abraham: Low-Cost
Comparison and Diagnosis of Large Remotely Lo-
cated Files; Fifth Symposium on Reliability in Dis-
Proof Consider: tributed Software and Database Systems, January
Site I: 1986, 67-73
[6] J . Metzner: A Parity Structure for Large Re-
and motely Located Data Files; IEEE Transactions on
Computers Vol C - 32, No. 8 , 1983
Site 11:
[7] Robert J . McEliece: Finite Fields for Computer
Scientists and Engineers; Kluwer Academic Pub-
lishers ; Boston, Dordrecht, Lancaster 1987

20 I
13oooO -
1 2 m- 0 rc - - - 7 - -

1 1 m -
1OOOOO-
9 m -
Average 8 m -
Number of
Multiplications 7m - Average # of Multiplications
,. . . . . . . . . ., Minimum # of Multipl!cat!ons
c-----. Maximum # of Multiplications
Averages are over 200 samples of 10,OOO page files

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Number of Errors/sample
Figure 1: Number of multiplications of signature objects done by both sites

200

140

Average

/I/______.:-
*. -
Number of 100
Requests

I _._.... Average # of Requests


...........
r - - _ _ - .
k ,.:'d.. ..
40 Minimum ## of Requests
Maximum
/

20 ..... Averages are over 200 samples of 10,OOO page files

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Number of Errors/Sample
Figure 2: Number of requests to the remote site for supersignatures

202

You might also like