You are on page 1of 15

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

,

VOL. 21,

NO. 10,

OCTOBER 2009

1403

Distributed View Divergence Control of Data Freshness in Replicated Database Systems
Takao Yamashita, Member, IEEE
Abstract—In this paper, we propose a distributed method to control the view divergence of data freshness for clients in replicated database systems whose facilitating or administrative roles are equal. Our method provides data with statistically defined freshness to clients when updates are initially accepted by any of the replicas, and then, asynchronously propagated among the replicas that are connected in a tree structure. To provide data with freshness specified by clients, our method selects multiple replicas using a distributed algorithm so that they statistically receive all updates issued up to a specified time before the present time. We evaluated by simulation the distributed algorithm to select replicas for the view divergence control in terms of controlled data freshness, time, message, and computation complexity. The simulation showed that our method achieves more than 36.9 percent improvement in data freshness compared with epidemic-style update propagation. Index Terms—Data replication, weak consistency, freshness, delay, asynchronous update.

Ç
1 INTRODUCTION
ITH

W

the progress of network computing technologies, data processing occupies an important role in various applications such as electronic commerce, decision-support systems, and information dissemination. Data replication methods [1], [2], [3], [4] are effective for achieving the required scalability and reliability for data processing. They are categorized into two types according to the timing of their data updates: eager and lazy [5]. In eager replication, data on replicas are simultaneously updated using read-one-write-all or quorum consensus [1], [2]. In lazy replication, an update is initially processed by one replica, and then, gradually propagated to the other replicas [6], [7], [8]. Data replication methods can also be categorized into two types, master and group, according to the number of replicas that are candidates for initially accepting update requests from clients [5]. In group replication, multiple replicas can initially process updates, while single replica can initially accept updates for a data object in master replication. Lazy-group replication is most advantageous for achieving high scalability and availability, but it cannot offer strict data consistency. Many applications do not require strict data consistency but allow weak data consistency. In addition, scalability of data processing is essential in large-scale systems. Hence, lazygroup replication is suitable under such conditions. Lazygroup replication has server-based peer-to-peer architecture because the facilitating or administrative roles of replicas can be equal.

. The author is with NTT Information Sharing Platform Laboratories, NTT Corporation, 3-9-11, Midori-Cho Musashino-Shi, Tokyo 180-8585, Japan. E-mail: yamashita.takao@lab.ntt.co.jp. Manuscript received 29 Feb. 2008; revised 29 Sept. 2008; accepted 12 Nov. 2008; published online 21 Nov. 2008. Recommended for acceptance by S. Wang. For information on obtaining reprints of this article, please send e-mail to: tkde@computer.org, and reference IEEECS Log Number TKDE-2008-02-0116. Digital Object Identifier no. 10.1109/TKDE.2008.230.
1041-4347/09/$25.00 ß 2009 IEEE

Recently, cross-enterprise business-to-business collaboration, such as multiorganization supply chain management and virtual Web malls, has become a key industry trend of information technology [9]. Therefore, large-scale resource sharing using Grid technologies is needed not only for science and technical computing but also for industrial areas [9]. As a result, information processing infrastructures for enterprises are distributed. To provide services to customers according to their requests, enterprises have to process a large number of transactions from clients. In lazy-group replication, to increase the processing performance of update transactions, the transactions are aggregated into one when they are propagated among replicas. Therefore, deferring update propagation for a particular period to improve the processing performance of update transactions is necessary. In addition, the greater the number of replicated database systems, the longer the update propagation delay. A number of applications, such as a decision-support system, have to retrieve distributed data reflecting such updates from multiple sites depending on user requirements. In decision-support systems, freshness is one of the most important attributes of data [10], [11]. Hence, in lazy-group replication, data freshness obtained by clients should be controlled so that it satisfies client requirements. In this paper, we propose a distributed method to control the view divergence of data freshness for clients in the replicated database systems that have server-based peer-to-peer architecture. Our method enhances the centralized way for view divergence control of data freshness by improving the computation complexity of view divergence control [12], [13]. To control data freshness, which is statistically defined, our method selects multiple replicas, called a read replica set, using a distributed algorithm so that they statistically receive all updates issued up to a specified time before the present time with a particular probability. Our method then provides the data that reflect all updates received by selected replicas.
Published by the IEEE Computer Society

simple system architecture leads to simple estimation of update delays for controlling data freshness. we compare our method with related work. This is because data quality is essential for data warehouses. Data extracted from replicas can vary depending on source replicas due to the asynchronous arrival of inserts and updates. the available system architecture depends on the relation among the enterprises. A data warehouse is separately constructed from operational databases in order to process complex queries. 10. control of data freshness is needed in decision making using data warehouses because when and how new data are needed depends on business and propagating all updates in a short time so that the data freshness of local data can meet any business request is very costly. front-end. NO. [13]. Each replica node is composed of more than one database server for fault tolerance. 3 CENTRALIZED VIEW DIVERGENCE CONTROL 3. process short query. This includes a centralized algorithm for determining a read replica set to control the data freshness. there are three types of nodes: client. From the viewpoint of fault tolerance. 2. Cross-enterprise systems in particular require simple interfaces between enterprises for troubleshooting. 2 MOTIVATION Lazy-group replication is most suitable for large-scale data processing systems. We are motivated to control the view divergence of data freshness in lazy-group replication by two types of applications for distributed computing infrastructure and enterprise systems. Therefore. message. as described in [12]. A new entry should not always be distributed quickly for the scalability of data processing because a large majority of users might not always look up a new entry. For example. quarter. an application and/or its users can estimate possible reasons for the looking-up failure and should try to search for newer data. In such a case. In data warehousing. a decentralized architecture is advantageous and flexible because it is available in a variety of relations among enterprises. day. Cross-enterprise business-to-business collaboration is a key industry trend of information technology [17]. To improve data freshness for data warehousing. We then evaluate the time. Fig. System architecture used in view divergence control for lazygroup replication. There are two options for a system architecture controlling data freshness in enterprises: centralized and decentralized architecture. and updates. and computation complexity of the proposed distributed algorithm. and replica nodes. 2. Such data are then hierarchically gathered for analysis and reporting in a data warehouse [14]. 3. A host needs the new location data of a mobile host right after it moves. A system architecture for enterprise use must be simple for easy administration and operation. we have to gather such extracted data from distributed systems. inserts. When we adopt lazy-group replication to achieve the high scalability of OLTP systems and data sharing among multiple organizations. month. and 4 at a particular time in period 5 as shown in Fig. OLTP systems. Need for data freshness in data warehousing. The remainder of this paper is organized as follows: We describe the motivation of our research in Section 2. Another example of this type is mobile host tracking because a host that tries to communicate with a mobile host can estimate if the obtained location of the mobile host is already old data. adopting a centralized architecture for controlling data freshness is difficult. decision makers sometimes need the entire sales of a product for every hour.1 System Architecture The system architecture used in our method is shown in Fig. [13]. The first type of application is data warehousing. Replicas are connected by logical links. To process them. Section 7 concludes this paper with a summary of the main points. and year [15]. 21. tracing where and how they occurred is difficult. decision makers then have to retrieve data reflecting updates issued by the end of period 4. In Section 5. The users of data warehouses make decisions using data satisfying their requirements of data quality. The second type of application is infrastructure services in distributed systems such as directory services [9]. [9]. In Section 6. data freshness depends on replicas. data are first extracted from operational databases such as online transaction processing (OLTP) systems. we evaluate by simulation data freshness that can be controlled by the proposed method and compare it with related work. 2. Therefore. and update transactions [14]. For crossenterprise business-to-business collaboration. when update propagation troubles occur in complicated systems. which handle a large number of short queries. 1. VOL. however. As described in Section 1. our system is designed for enterprise and cross-enterprise systems. 1. For example. This system processes two types of transactions: refer and update. If they need the entire sales of periods 1. One measure of data quality is timeliness [15]. inserts and updates are asynchronously propagated among replicas. Both are available for controlling data freshness when they are used in only one enterprise. Section 3 briefly introduces the centralized view divergence control method [12]. it may not be up-to-date. the . If there is no center organization that administrates the business activities of all enterprises or the data shared among them. as described in Section 1. insert. In addition. Section 4 describes the distributed view divergence control method that includes the distributed algorithm used to determine a read replica set. When an application looks up a new entry stored in those services and fails.1404 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING. OCTOBER 2009 Fig. In lazy-group replication. [16].

[20]. a refresh transaction includes the update of totaled sales included in all updates. When a frontend node receives a refer request from a client. the replica’s administrator determines how it should be connected based on the distances between replicas. it is forwarded to replica r3 . 2. How to aggregate updates to one refresh transaction depends on the application. it simply transmits the update to a replica. samples of update delay used in nonparametric statistical methods are acquired by inserting time stamps of the starting time of update propagation in a refresh transaction. A client retrieves and changes the values of data objects stored in replicated database systems through a front-end node. the read replica set for T ðiÞ that satisfies condition T ðiÞ Tp < T ðiþ1Þ is selected to provide data with the degree of freshness required by the client [12]. called read data freshness (RDF). calculating read replica sets for multiple degrees of freshness by cooperating with other replicas. old updates are omitted. respectively. and to receive in reply processing results from a front-end node. In our system. it retrieves data values and transaction processing logs from replicas r3 . As a result. propagating update and refresh transactions to other replicas. where there is only one path between replicas. Updates are then asynchronously propagated among replicas. is statistically defined in this system. such as the Mann-Whitney U or the median test [19]. Let T ðiÞ be the degrees of freshness for which read replica sets are calculated. processing asynchronously propagated updates in refresh transactions. T ðiÞ should be selected so that application requirements are satisfied because when and how new data are needed depends on applications and the timing of data acquisition. 3. as described in Section 2. in mobile host tracking. any node can initially process an update. where i is an integer and T ðiÞ < T ðiþ1Þ for any i. respectively. For eliminating inconsistent states caused by the difference in the order of updates and record deletions among replicas. In Fig. when a front-end node provides the values of data objects that satisfy the degree of freshness required by a client. when an update from client c1 is received by front-end f1 . old data are overwritten by the newest data. A refer request includes the degree of freshness that is required by a client. [21].3. Conflicts between updates cause inconsistent states of data objects. In our system. In addition. we use a tree structure for update propagation among replicas. a frontend node first selects one of the read replicas. the calculation of read replica sets for multiple degrees of freshness is periodically performed using samples of update delay for a past period whose statistical characteristics are determined to be the same as those of the current period using nonparametric statistical methods. To eliminate inconsistency.YAMASHITA: DISTRIBUTED VIEW DIVERGENCE CONTROL OF DATA FRESHNESS IN REPLICATED DATABASE SYSTEMS 1405 system architecture should have redundancy of processing nodes and update propagation paths. it calculates data values satisfying the degree of freshness required by the client after it retrieved data values and transaction processing logs of one or more replicas. The functions of a client are to send refer and update requests to a front-end node to retrieve and change data. our method restricts the differences in RDF accordingly. and r12 . Then. The periods whose statistical characteristics are compared are determined using the knowledge on application domains so that statistical characteristics do not change much in each of those periods. to distinguish transactions that have not been received by the selected read replica yet but have already been received by at least one of the other read replicas. the newer the retrieved data. When a front-end node receives an update from a client. 5. when front-end f3 receives a refer request from client c3 . It offers data consistency among replicas by keeping the time stamps of deleted objects for a particular period [22]. 2. processing refer and update requests to its local database. a replica has five functions: 1. Second. updates may conflict with each other.2 Available Transactions In lazy-group replication. 4. 3. r9 . An update to change data in replicated database systems is propagated through the links of a tree as a refresh transaction to which some updates are aggregated. The Mann-Whitney U and the median test verify similarity between two sets of samples from the viewpoints of rank sum and median. we can assume that network failures are recovered by routing protocols [18]. The RDF is formally described in Section 3. For example. Here. To solve the disadvantage of a tree-topology network. and informing all front-end nodes about whether or not it is a read replica for a degree of freshness. a front-end node compares the transaction logs received from the read replicas. In general. every replica is composed of multiple database servers for node failures in our system. attributes associated with update requests and data objects such as time stamps and version numbers are used [5]. When a client requests a degree of freshness Tp . For decision-support systems to handle the sales of a product in an area. Read replica sets are not calculated every time a refer request arrives at a front-end node in our system because calculating a read replica set for every refer request requires high overhead of transaction processing and a comparatively long time to complete processing a refer request. it generates refresh transactions and record deletions that do not reach the selected read replica yet and . Because the orders of transactions processed by replicas can vary. a replica may join and leave a tree-topology network. Here. We call this difference in RDF among clients view divergence. For update processing. In Fig. To process refer and update transactions. We call this set of replicas accessed by a front-end node a read replica set. various ways are used to process updates and record deletions in lazy-group replication. In our system. Therefore. A front-end node is a proxy for clients to refer and update data in replicas. When clients request the same RDF. the smaller the T ðiÞ . a method called death certificate or tombstone is used. 2. The degree of freshness. it behaves like a replica to process refresh transactions after receiving data values and transaction processing logs from read replicas. Nonparametric statistical methods are distribution-free statistical methods. When a replica joins a tree-topology network.

and tc is the present time. can be calculated by using various statistical estimation methods [19]. Therefore. we use the delay in update propagation between replicas and assume three conditions. or Tp . we use the probabilistic degree of data freshness as its measure. In OLTP. is less than the degree of RDF required by the client. 3. transactions are determined in their system design. we might be able to statistically estimate present update delay using that in past periods. knowledge workers can tolerate some degree of error and omission if they are aware of the degree and nature of the error [15]. any updates are recovered and pending updates are reliably sent to other replicas [26].org/10. front-end node. For obtaining data. directory service. Tr . and replica nodes are synchronized with each other. If the last update reflected in the acquired data was at tl (<te ). in the location tracking of a mobile host. .ieee computersociety. 10. updates issued before a particular time are reflected in obtained data with some probability. NO. the value of a data object is replaced by a replica with a newer value. As a result. On the other hand. or a refresh transaction. there is no update request between tl and te with probability p. our method does not further restrict transactions available in lazy-group replication. Because a front-end node performs the same process as a replica does when it provides data with the degree of freshness required by clients.2008. Because the delays caused by these processes occur at each replica. The change in update delay times contains these in network delay and the processing time of transactions. When we use the update propagation delay caused by the maximum load on databases for view divergence control. The node clocks are synchronized. referring to this measure as RDF.1109/TKDE. For some applications. Therefore. In addition. the probabilistic degree of data freshness can represent its worst boundary. 3. at least one of the following conditions needs to be satisfied: 1) the load on databases slowly changes or 2) the update propagation delay caused by the maximum load on databases is available. an update. which is the probabilistic degree of data freshness with probability 1. VOL.4. The upper confidence limit 1. [24]. Condition 1 means that the clocks on a client. transactions used in distributed database systems need to be relatively uniform because we cannot estimate delay time for processing transactions that have never been processed before. there are two options: the probabilistic degree and the worst boundary of data freshness. The processing time of transactions leads to the response time of a database system. which is caused by the number of incoming transactions in the applications described in Section 2. The synchronization can be achieved by using various clock synchronization methods [23].230. For example. [21].3 Read Data Freshness As a measure of data freshness. Therefore. is Tp ¼ tc À te . Hence. some processes are necessary. When such applications are developed. The formal definition of RDF. which can be found on the Computer Society Digital Library at http://doi. where the former is typically much lower than the latter.1406 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING.1 Assumptions To determine a read replica set so that the client’s required RDF is satisfied. This means that the probability distribution of the delay does not change much in the time it takes to obtain enough samples to estimate the delay. where time te is such that all updates issued before te in the network are reflected in the data acquired by a client with probability p. it uses samples of update propagation delay in a past period whose statistical characteristics are determined to be the same as those in the present time using the statistical test described in Section 3. Condition 2 means that we can estimate the upper confidence limit of the delay between replicas with some probability by using samples of the measured delay time. [20]. To estimate update delay times with some probability as described above. we consider this condition to be acceptable. a communication delay occurs and updates are deferred for some period in order to aggregate several updates in one refresh transaction. it can request only data that satisfy condition 3. The delay in update propagation can be statistically estimated. When we use the probabilistic degree of freshness.1. In addition. a front-end node selects the newest value of a data object among those in the read replicas. we can obtain data whose RDF value is the same as or better than the RDF Tp requested by a client. data reflect all updates issued before a particular time when data freshness is defined by its worst boundary. these criteria are optimized under a variety of expected conditions so that application requirements are satisfied. the difference of processing time of each transaction might largely affect that of update delay. describes an example way to calculate the upper confidence limit using a nonparametric estimation method. For example. When a replica transmits updates to another replica. 2. when an update is propagated. including recent periods whose loads on replicated database systems are close to those of the present time. the role of a front-end node is simpler than the above described. 21. OCTOBER 2009 applies them to the data values received from the selected replica. This is needed to measure the delay times of message delivery for a read. For some decisionsupport processes. When a replica statistically estimates the delay in update propagation. we need the time for communication between a client and a front-end node and for communication between the front-end node and replica nodes. Therefore. In addition.4 Centralized Algorithm 3. Tp . Appendix C. The degree of RDF represents a client’s requirements in terms of the view divergence. The processing time of each transaction depends on the load of a database system. this condition is acceptable for such applications. Tp . 3. The performance criteria of OLTP systems are throughput and response time [25] as are those of directory services and mobile host tracking. There are a number of applications whose users are tolerant to some degree of errors inherent in the data. The time it takes a client to obtain data. and mobile host tracking systems. each replica needs to keep update logs so that if it fails. when a client requires data with a degree of RDF.

If a replica can have any set of replicas for its range originator set. However.4. A range originator can be determined by statistical estimation methods [19]. respectively. Hence. If multiple replicas whose range originator sets are equal . For example.2008. 3a. when there is a replica j on the path from i to k. In classified-replica determination. respectively. Then. we have to choose a read replica set such that for any replica r. A read replica set is the set of read replicas for a degree of RDF required by a client. where Tp is the degree of RDF required by a client. the sets of replicas and range originator sets correspond to X and F . As a result.e. The sum of range originator sets of all read replicas in a read replica set is the set of all replicas. there is a certain relationship among the range originator sets of replicas depending on the location of replicas in a network for update propagation because we use a tree-topology network for update propagation and the range originator set of a replica has to follow the properties of probabilistic delay described below. 4. A classified replica is a replica whose set of range originators is not a subset of those for any other replicas. of the delay time from replica x to y. mandatory-replica. then conditions dij Tp and djk Tp are satisfied for any replica j that is along the path from replica i to k. Let Oi be the set of range originators for replica i.3. For another example. 4. inferior to. or equivalent to replica j in range-originator-set (ROS) capability when condition Oi ' Oj . F Þ of the set-covering problem consists of a finite set X and a family F of subsets of X such that every element of X belongs to at least one subset in F (i. The first property is that if condition dik Tp is satisfied. the distance from replica i to k is the sum of the distances from replica i to j and from j to k. The pseudocode of the centralized algorithm is in Fig. a read replica (set). A read replica is a replica to which a front-end node sends a refer transaction to obtain the value of a data object.org/10. Then.2. However.YAMASHITA: DISTRIBUTED VIEW DIVERGENCE CONTROL OF DATA FRESHNESS IN REPLICATED DATABASE SYSTEMS 1407 TABLE 1 Formal Definition of Terminology 3. which is apparent from Theorem 1 and Lemma 1 described in Sections 4. which has some variations including the hitting-set problem. when there is replica k along the paths from replica i to l and j to l and when condition dik djk is satisfied. A range originator for replica r is a replica from which a refresh transaction can reach replica r within time Tp with probability p. which can be found on the Computer Society Digital Library at http:// doi. a range originator set can be calculated using samples of update delay and the nonparametric estimation method described in Appendix C. An instance ðX. The set-covering problem. 3. X ¼ US2F S ).230. then condition dik > Tp is satisfied for any replica k such that replica j is along the path from replica i to k. is NP-hard [27]. A mandatory replica is a classified replica that has one or more range originators not included in the range originator set of any other classified replica. Range originator sets are first calculated in step 2) of Fig. or Oi ¼ Oj is satisfied. the problem to calculate a minimum read replica set in our method is not the set-covering problem because a range originator set cannot have arbitrary elements. dil is not always equal to or less than djl . a classified replica. In addition.1 and 4.3. respectively. the problem to calculate a minimum read replica set is an example of the set-covering problem represented as follows. replica i is excluded from candidates for read replicas when Oi is a subset of the range originator set of another replica because a replica whose range originator set is a superset of Oi is more capable of decreasing the number of read replicas. and a mandatory replica. Our algorithm uses the following two properties that arise from the statistical delay distribution of update propagation [12]. respectively. The second property is that if condition dij > Tp is satisfied. with probability p. The probabilistic delay has properties different from those of the distance in a weighted graph. at least one element of the set can receive updates from replica r within Tp with probability p. which corresponds to steps 3) and 4) in Fig. a mandatory replica becomes a read replica. The problem is to find a minimum size subset C  F [27].ieeecomputersociety.4. A mandatory replica means an indispensable element of the minimum read replica set..2 Terminology We define four terms for explaining our algorithm: a range originator.1109/TKDE. we say that replica i is superior to. Oi & Oj . For example. In our method. [21]. The main part of the algorithm is composed of three processes: classified-replica. [20]. and minimum-subtree determination. The three processes are iterated until one replica covers all replicas in a tree. A classified replica is a candidate for an element of a read replica set. dik is not the sum of dij and djk . The flow of the centralized algorithm to calculate a minimum read replica set is shown in Fig. Table 1 describes their formal definitions. 4 before the three processes are performed. we say that replica j is covered or noncovered by replica i if condition j 2 Oi or j 62 Oi is satisfied. Our algorithm calculates a minimum read replica set. The range originator set of replica i is calculated by removing replicas covered by mandatory replicas from its original range originator set.3 Calculating a Minimum Read Replica Set In our method. This means that the sum of the range originator sets of all read replicas is the set of all replicas. let dxy be the upper confidence limit.

This computation complexity is determined by the number of range-originator-set comparisons in each iteration and the iterations of the three processes described above. 8). Fig. 3. Every replica knows the set of all front-end nodes. which are Oðn2 Þ and OðnÞ. In classifiedreplica and mandatory-replica determination. This is caused by the use of a treetopology network for update propagation and two properties of probabilistic delay time described in the second paragraph of this section. which corresponds to steps 5) and 6) in Fig. in every iteration. All calculated mandatory replicas are added to the read replica set. we calculate a minimum subtree defined as the subtree that includes replicas not covered by mandatory replicas and has the minimum number of edges. Flow of centralized and distributed algorithms to calculate a minimum read replica set. A replica knows the adjacent replica to which it should send a message destined for a particular replica. 21. On the other hand. We iterate classified-replica. 4. and minimum-subtree determination to decrease the size of a tree until one replica l covers all remaining replicas in a tree as described in step 1) of Fig. where j is not equal to k. [29]. classified replica j is selected as a mandatory replica when it has one or more elements that are not included in Ok for any classified replica k. On termination of the distributed algorithm. we decrease the computation complexity of the algorithm to calculate a minimum read replica set in the distributed view divergence control of data freshness described in Section 4. all replicas inform all front-end nodes about whether or not they are 1.1 DISTRIBUTED VIEW DIVERGENCE CONTROL Assumptions for Distributed View Divergence Control In addition to the assumptions for the centralized view divergence control described in Section 3. 4. This can be accomplished using diffusing computation and the termination detection for it [28]. 10. In mandatory-replica determination. 3. Therefore. any pair of range originator sets of replicas and classified replicas need to be compared in every iteration. Then. 4. where n is the number of replicas. which are candidates for read replicas.4. which corresponds to steps 7). we add replica l to the read replica set in step 10) of Fig. VOL. This can be achieved by various routing protocols [18]. 4. NO. which corresponds to a read replica. Centralized algorithm to calculate a minimum read replica set.1. . A front-end node knows the set of all replicas. all but one of them are excluded as candidates for read replicas because they have the same capability to cover replicas. 2. 4. The computation complexity of this centralized algorithm is Oðn3 Þ as proved in [12]. The key point of low computation complexity is that there is at least one mandatory replica. 4.1408 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING. The computation complexity of the centralized algorithm is effective but should be improved for a large number of replicas. and 9) in Fig. Finally. mandatoryreplica. 4 4. we assume the following conditions: A replica can directly communicate with only its adjacent replicas. respectively. respectively. the centralized algorithm cannot apparently solve the set-covering problem with low computation complexity because it cannot always find one or more mandatory replicas in every iteration. the remaining replicas are classified replicas. In minimumsubtree determination. OCTOBER 2009 Fig. remain. This is because we cannot replace classified replica j with any other classified replica to construct a minimum read replica set.

the size of the subtree for the next iteration is zero. We call the distance between replica i and the furthest replica in Li from it the maximum distance in process p by replica i. When replicas j are the adjacent replicas of replica i. when new data are required. which includes classified-replica. Here. 4. let Li be the set of replicas with which replica i has to interact in process p. When the centralized algorithm to calculate a minimum read replica set is executed in a distributed manner. distributed view divergence control achieves low time and computation complexity of the algorithm to calculate a minimum read replica set. we design the distributed algorithm to calculate a minimum read replica set so that its time complexity is as small as possible. bp includes time for protocol processing. in the first phase of classified-replica determination. Time complexity is the time an algorithm takes to terminate. bc include time for packet forwarding and message transmission. Then. When the hop count distance between replicas i and j is h. If a replica interacts with all the other replicas in distributed classified-replica determination. This can be achieved by the execution of classifiedreplica and mandatory-replica determination. which means the termination of the distributed algorithm. classified-replica and mandatory-replica determination are executed one more time in the distributed algorithm than in the centralized algorithm. Therefore. 4. In the second phase. we divide distributed classified-replica determination into two phases. and 10) in Fig. However. 2). È Á 9j : Oi ¼ Oj ^ 8k 2 kj k 2 N ðiÞ ^ k 6¼ j : Oi 6& Ok . In the distributed algorithm. transaction processing. we measure time complexity in the same way as that described in [29]. the decrease of the number of range-originator-set comparisons leads to the decrease of computation complexity. replicas with the same range originator set coordinate with each other to determine whether or not they are classified replicas. Message complexity is the total number of messages exchanged among nodes in the worst case. the range originator set of a replica tends to include only replicas close to it. Therefore. and minimum-subtree determination. the steps that are not included in any of classified-replica. In particular. to achieve step 10).3 Classified-Replica Determination In the classified-replica determination of the centralized algorithm. 3b shows the distributed algorithm executed by every replica to calculate a minimum read replica set. we decrease the computation complexity of the algorithm to calculate a minimum read replica set from that of the centralized view divergence control. 9 j : Oi & O j . step 10) performs the final selection of a mandatory replica when there is at least one replica that covers all remaining replicas. we divide the relationship between replicas i and j into three cases according to which of the following three conditions are satisfied: 8j : Oi 6 Oj . a replica compares its own range originator set with only those of its adjacent replicas to determine whether or not it is a classified replica. In other words. and minimum-subtree determination by replica i for any replica i as much as possible. in which only the way for calculating a minimum read replica set requires centralized operation. the range originator sets of replicas far from each other tend to have no intersection. where D is the diameter of a tree for update propagation. Therefore. ð1Þ ð2Þ ð3Þ where Oi is the range originator set and N ðiÞ is the set of adjacent replicas for replica i.3. In addition.YAMASHITA: DISTRIBUTED VIEW DIVERGENCE CONTROL OF DATA FRESHNESS IN REPLICATED DATABASE SYSTEMS 1409 read replicas. This means that the range originator set of a replica does not need to be compared with that of a replica far from it for classified-replica determination. which is Oðn3 Þ.1 First Phase For the decrease of range-originator-set comparison. and minimumsubtree determination are steps 1). mandatory-replica. For efficiency. 4. we should decrease the maximum distance in classified-replica. it takes time hðbp þ bc Þ þ bp for replica i to interact with replica j in one way. Therefore. A front-end node can determine whether it learns the set of all read replicas because it knows the set of all replicas and every replica informs front-end nodes whether or not it is a read replica. Condition (3) is the condition . a distributed view divergence control method needs a distributed algorithm to calculate a minimum read replica set. mandatory-replica. To decrease the number of range-originator-set comparisons. 4. Fig. In the centralized algorithm. A distributed algorithm is evaluated in terms of time and message complexity [29]. and logging for recovery. mandatory-replica. the time complexity of classified-replica determination in the distributed algorithm can be improved by eliminating comparisons between the range originator sets of replicas far from each other. the range originator set of every replica is compared with those of all the other replicas. a replica compares its range originator set with those of its adjacent replicas to determine whether or not it is a classified replica. and minimum-subtree determination are iterated a number of times. which is only performed as necessary. In the first phase.2 Overview The distributed view divergence control method described in this section is based on the centralized view divergence control method described in the previous section. it takes time Dðbp þ bc Þ þ bp in the worst case that a replica receives information from the furthest replica. We accomplish low time and computation complexity using the relation among the range originator sets of replicas caused by the topology of update propagation and probabilistic delay. mandatory-replica. each replica needs to detect the termination of a current process to determine when it should start the next process. where it is measured using the upper bound of time for each task of each process (denoted by bp ) and for the single task of each channel (denoted by bc ). In this paper. we incorporate steps 1) and 2) with minimum-subtree determination. classified-replica. The time complexity of the distributed algorithm to determine a minimum read replica set is an important measure because our algorithm needs to calculate read replica sets for multiple degrees of freshness periodically. On the other hand. Therefore. For low-time complexity.

Theorem 2. OCTOBER 2009 that neither condition (1) nor (2) is satisfied. 5a. different means are used to determine whether a replica is a classified replica according to whether an inequivalent-replica or an equivalent-replica case applies as described below. In the first phase of classified-replica determination. 2. Lemma 1. 5. Therefore. the time and message complexity of the first phase are Oð2bp þ bc Þ and OðNv Þ.org/ 10. replica i is not a classified replica.org/ 10. respectively. so all of them satisfy condition (3). 4. 5a and 5b show an inequivalent-replica case and an equivalent-replica case. Inequivalent-replica case. when condition (1) or (2) is satisfied.1109/TKDE. We call cases 1) and 2) inequivalent-replica and equivalent-replica cases.2008. and 6 have the same range originator sets. the following theorem indicates that Oi is not equal to and is not a subset of Ok for any replica k other than replica i. Two cases in second phase of classified-replica determination.2 Second Phase The situation where condition (3) is satisfied by one or more replicas can be divided into two cases by using Theorem 2 proved by Lemma 1 below: 1) the minimum subtree that includes replicas that have the same range originator set or supersets thereof consists of replicas with the same range originator sets or supersets thereof and 2) the minimum subtree that includes replicas with the same range originator set or supersets thereof consists solely of replicas with the . 5. 2) replica i can reach replica k along the path consisting of only replicas equivalent to replica i in ROS capability. replica i can learn that a replica with the same range originator set as Oi determined itself to be a nonclassified replica. 5. Theorem 3 means that replica i should interact with a replica of type 2) instead of type 1) in inequivalent-replica 4. same range originator set. VOL. there exists a replica k that satisfies the following conditions: 1) replica k is equivalent to replica i in ROS capability. In addition. Hence.2. respectively. and comparison of range originator sets. The theorems and lemma in the remainder of this paper are proved in Appendix A. replica i is a classified replica according to the definition of classified replicas. It takes time 2bp þ bc for an RON message to be transmitted on a link and processed by the replica that receives it. In Fig. If replica k is a range originator for replicas i and j. Theorem 1. the computation complexity of the first phase of classified-replica determination in one iteration is OðNa Þ. Theorem 3 described below indicates that replica i should interact with a replica of type 2).1410 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING. replica i can determine that it is not a classified replica as follows. Therefore. The situation where condition (3) is satisfied by one or more replicas can be divided into an inequivalent-replica or an equivalent-replica case.230. 21. The pseudocode of the first phase of classified-replica determination is described in Appendix B.ieeecomputersociety. When replica i interacts with a replica of type 1). When condition (3) is satisfied. The termination condition of this phase is that a replica receives RON messages from all its adjacent replicas. In this case. replicas 1. which can be found on the Computer Society Digital Library at http://doi.3.1109/TKDE. The first phase of classified-replica determination comprises Na -times RON message transmission. where Nv is the number of replicas. not a classified replica. replica i is. replica i cannot determine whether or not it is a classified replica in the first phase. where Na is the number of adjacent replicas. Fig. replica m does not cover replica k.230. NO. replica i can learn the existence of a replica that is superior to itself in ROS capability. as described in Section 4. 5b. replica i has to cooperate with one or more replicas that are not its adjacents in the second phase of classified-replica determination. Suppose that replicas i and j are adjacent and replica k is covered by replica i but not by replica j. In Fig. when condition (2) is satisfied. replicas 5 and 6 satisfy condition (3). When replica j is superior to replica i in ROS capability. We call a message used in the first phase a range originator notification (RON) message. and 6 have the same range originator set.2008. respectively. Replicas with the same range originator set as Oi need to interact with either of two types of replicas to determine that it is not a classified replica: 1) a replica superior to replica i in ROS capability or 2) a replica that is equivalent to replica i in ROS capability and satisfies condition (2). In this example. which can be found on the Computer Society Digital Library at http://doi. then replica k is a range originator for any replica l along the path between replicas i and j. 10. replicas 1. by definition. When replica i satisfies condition (3) in an inequivalent-replica case. When condition (1) is satisfied. every replica informs all its adjacent replicas about its range originator set.3. By interacting with replicas of types 1) and 2). In the second phase. For any replica m such that the path from replica i to m includes j. while that for replica 2 is a superset of this.ieeecomputersociety. When replica i interacts with a replica of type 2). receipt. An RON message sent by replica i includes Oi and i and is transmitted on each direction of each link. Theorem 3. Therefore. a replica also performs part of the second phase of classified-replica determination to help other replicas that have the same range originator set to identify themselves as nonclassified replicas. When condition (2) is satisfied. a replica can determine whether or not it is a classified replica by comparing its range originator set with only those of all its adjacent replicas. Figs. and 3) replica k satisfies condition (2).1. there are one or more replicas that are superior to replica i in ROS capability. 4. Therefore.

In the example of Fig. Replica i receives one or more NCRN messages because every replica in Bi originates an NCRN message. the computation complexity of the second phase of classified-replica determination is OðNb Þ. there are one or more subtrees Tl consisting of only replicas with the same range originator set. Similar to an inequivalentreplica case. where an elect message includes a set of node identifiers that the sender of an elect message has learned. NCRN messages should be propagated among replicas in Tl . let Tl be the set of replicas with the same range originator set in an equivalent-replica case. When replica i first receives one of the NCRN messages with the same range originator set as its own. The pseudocode of the second phase of classified-replica determination is described in Appendix B. We say that replica i is neighboring to classified replica j when there is no classified replica along . One or two NCRN messages are transmitted along a link because a replica may send an NCRN message to its adjacent replica from which it has not received an NCRN message but which has already sent an NCRN message. org/10. In general. where Nb is the number of adjacent replicas with the same range originator set. In mandatory-replica determination. The second phase of classified-replica determination comprises Nb -times range-originator-set comparison and NCRN message and CRP message transmission/receipt. 4. Therefore. Unlike in an inequivalent-replica case. in an equivalent-replica case. Because there is only one path between any pair of replicas in a tree. As a result. Replica i should interact with a replica in Bi so that it can determine as early as possible whether it is not a classified replica. The propagation of NCRN messages in Tl can be achieved by broadcasting. replica i interacts with a replica in Bi so that it can determine as early as possible whether it is not a classified replica. As is described in the centralized algorithm. Let Bi be the set of replicas of type 2) for replica i in an inequivalent-replica case. In the example of Fig. which can be found on the Computer Society Digital Library at http://doi. 4. Each leaf node is initially enabled to send an elect message to its unique adjacent node.2. where Dc and Nc are the maximum value among the diameters of subtrees Tl and the sum of the numbers of replicas of each Tl . In an inequivalent-replica case. such a node broadcasts the result in a tree.1109/TKDE.YAMASHITA: DISTRIBUTED VIEW DIVERGENCE CONTROL OF DATA FRESHNESS IN REPLICATED DATABASE SYSTEMS 1411 cases to decrease the maximum distance in the second phase by replica i. Before describing a theorem to decrease the number of rangeoriginator-set comparisons in mandatory-replica determination. which is initially sent by any replica in Bi and includes its range originator set. replica 4 learns that replicas 5 and 6 are superior to itself in ROS capability. the time and message complexity of the second phase in an equivalentreplica case are OðDc ðbp þ bc ÞÞ and OðNc Þ. An NCRN message sent by any replica in Bi has the same information. a classified replica is selected as a mandatory replica when its range originator set has one or more elements that are not included in the range originator set of any other classified replica. respectively. Equivalent-replica case. there are one or more replicas of type 2) because there can be one or more replicas superior to replica i in ROS capability. respectively. Any node that receives elect messages from all but one of its adjacent nodes is enabled to send an elect message to the remaining adjacent node. This algorithm in a tree is as follows [29]. As a result. we here define a relationship between replica i and classified replica j. When the size of Bi is more than one. we can also decrease its time and computation complexity for the same reason as in classified-replica determination because the range originator sets of replicas far from each other tend to have no intersection. respectively. only one replica is designated as a classified replica among replicas whose range originator sets are equal and are not inferior to any other replicas in ROS capability. respectively.2008. We call a message used in an inequivalent-replica case a nonclassified-replica notification (NCRN) message. 5. replicas 1. 5b. it can determine that it is not a classified replica. where Dc and Nc are the maximum value among the diameters of subtrees Tl and the sum of the numbers of replicas of each Tl . In an inequivalent-replica case. where D and n are the diameter of a tree and the number of nodes. we call a message corresponding to an elect message a classified-replica probe (CRP) message. the time and message complexity of the second phase in an inequivalent-replica case are OðDc ðbp þ bc ÞÞ and OðNc Þ. and 6 exchange CRP messages to designate replica 6 as a classified replica. This can be achieved by the leader election algorithm [29]. replicas 1 and 4 are replicas of type 2). In the second phase of classified-replica determination.230. This is because when replica p receives an NCRN message from an adjacent replica q. Then. any replica p in Tl forwards an NCRN message to adjacent replicas from which it does not receive NCRN messages. in an inequivalent-replica case. 2. 5a. This message informs the receiver that the replica with a range originator set in an NCRN message is not a classified replica. Therefore. Therefore. an NCRN message originated by another element in Bi has been propagated among replicas that replica p can reach through replica q. any replica i in Tl cannot determine whether or not it is a classified replica by comparing its range originator set with those of others because there is no replica that is superior to replica i in ROS capability in a tree for update propagation. respectively. NCRN messages are propagated among replicas with the same range originator set.4 Mandatory-Replica Determination As is described in the centralized algorithm to calculate a minimum read replica set.ieeecomputersociety. The time and message complexity of the leader election algorithm described above are OðDðbp þ bc ÞÞ and OðnÞ. From Theorem 3. The termination condition of the second phase in an equivalent-replica case is that a replica sends CRP messages to all its adjacent replicas in each Tl . one or two nodes can determine which node is the leader with the maximum identifier. The termination condition of the second phase in an inequivalent-replica case is that a replica receives/sends NCRN messages from/ to at least one adjacent replica in each Tl . a node operation for the broadcasting of an NCRN message in a tree is to simply send messages to all but one of its adjacent nodes from which it receives an original message. replica 4 sends NCRN messages to replicas 5 and 6 because through the first phase of classified-replica determination. Therefore. respectively.

which is equal to that of gossiping. 5g.2008.5 Minimum-Subtree Determination In the centralized algorithm. In gossiping.3. Therefore. The sum of ni for all subtrees is at most OðNv Þ. and ff1. the pieces of information in replicas are shared among replicas. the gossiping algorithm is executed in every subtree defined above.1412 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING. In the example of Fig. ff1. which is equal to that of gossiping. 7gg. 6. as shown in Fig. Then. The termination condition of mandatory-replica determination is the same as that of the second phase of classifiedreplica determination in an equivalent-replica case because the distributed mandatory-replica determination is based on the gossiping algorithm. In mandatory-replica determination. there is at least one replica that satisfies condition 1) because there is at least one mandatory replica.2. 6gg are shared. is OðnÞ. Therefore. 3. In a tree for update propagation. pieces of information in nodes are first gathered. NO. the message complexity of mandatory-replica determination is OðNv Þ. ff1. the computation complexity of mandatory-replica determination is OðNa Þ. The mandatory-replica determination comprises Na -times MRP message transmission/receipt and one-time set comparison at most. 6. OCTOBER 2009 Fig. The pseudocode of mandatory-replica determination is described in Appendix B.org/10. A piece of information of a mandatory or a nonmandatory replica is its range originator set or the empty set. We can use Lemma 1 to decrease the time and computation complexity for determining whether a replica satisfies condition 1). Example of mandatory-replica determination.ieeecomputersociety. We also use the same phrase for a mandatory replica in Section 4. Let ni be the number of nodes in subtree i.230. 3. 10. where Dm is the maximum value among the diameters of subtrees defined above. the path between replica i and classified replica j. 3. 3g. a piece of information in replica i is Oi or the empty set when replica i is a classified replica or a nonclassified replica. When condition 1) is not satisfied. where Nv is the number of replicas in a tree. 5. classified replicas in a subtree defined above can share their range originator sets. As is described in Section 4. Then. Theorem 4. 21. A replica can verify the satisfaction of condition 1) by comparing its identifier with all elements of the range originator sets of all neighboring mandatory replicas. 5g. 2. . 7gg. which can be found on the Computer Society Digital Library at http://doi. Therefore. the gossiping of the range originator sets of mandatory replicas for the verification of condition 2) enables a replica to verify the satisfaction of condition 1). a replica has to learn all replicas that do not satisfy condition 1). we call a message corresponding to an elect message used in the leader election algorithm a mandatory-replica probe (MRP) message. 6. f1. which needs gossiping of such replicas among all replicas. The leader election algorithm can be solved by using gossiping [29]. the selected leader is notified to all nodes. In mandatory-replica determination. 4. In mandatory-replica determination. f1. there are four subtrees in a tree: the subtrees that consist of f3. We call a message used for gossiping on the range originator sets of mandatory replicas a minimum-subtree probe (MSP) message. 3. and f1. the gossiping algorithm in a tree is similar to the leader election algorithm described in Section 4. f1. VOL.3. From Theorem 4. Classified replica i is a mandatory replica if and only if Oi has at least one element that is not included in the range originator set of any classified replica that is neighboring to classified replica i. and 3) detecting the termination of the distributed algorithm that calculates a minimum read replica set. 2. 4. By using gossiping. In the leader election algorithm. where the set of range originator sets ff1. respectively. minimum-subtree determination calculates the minimum subtree including all replicas that are not included in the range originator set of any mandatory replica. minimum-subtree determination in the distributed algorithm consists of three functions: 1) calculating the minimum subtree for the next iteration as the centralized algorithm does. In every iteration.5. the whole set of node identifiers is first gathered. 2. is OðDðbp þ bc ÞÞ.2. respectively. it does not satisfy condition 1). The message complexity of the leader election algorithm in a tree. 2) removing replicas covered by mandatory replicas from the range originator sets of replicas in the current subtree. where n is the number of nodes in a tree. there are subtrees whose leaves are classified replicas or leaves in the tree. 6g. On the other hand. The time complexity of the leader election algorithm in a tree. 4. When nodes have pieces of information. the process of every node learning the whole cumulative information is called gossiping [30]. the cumulative information is notified to all nodes. for the next iteration. Functions 2) and 3) correspond to steps 2) and 1) in Fig. 2g.1109/ TKDE. 4. A replica in the minimum subtree for the next iteration satisfies at least one of the following two conditions: 1) it is not included in the range originator set of any mandatory replica or 2) it is along the path between replicas satisfying condition 1). To verify condition 2). Therefore. 4. respectively. 5gg. the time complexity of mandatory-replica determination is OðDm ðbp þ bc ÞÞ. Classified replicas neighboring each other exist in such a subtree. where D is the diameter of a tree. Therefore. f4. Therefore. a classified replica can determine whether or not it is a mandatory replica by sharing its range originator set in such subtrees. In gossiping. we need the gossiping of the range originator sets of mandatory replicas to achieve function 1). 7g. respectively. 3. 4. If at least one mandatory replica has the range originator set including the identifier of a replica. a classified replica can determine whether or not it is a mandatory replica by sharing its range originator set with its neighboring classified replicas. a replica has to verify the satisfaction of condition 2).

the termination condition of minimum-subtree determination is that of function 1). respectively. at least one replica is removed in one iteration [12]. From the above discussion. which is the same as that of the second phase of classified-replica determination in an equivalent-replica case. which leads to the recalculation of range originator sets in replicas. Dm . the time and message complexity of our distributed algorithm to calculate a minimum read replica set are the sums of those of the above processes. When a replica completes its role in the distributed algorithm. our distributed algorithm is performed among all replicas except for the newly joining replica. our distributed algorithm works by deleting it from the range originator sets of all replicas. it informs all front-end nodes about whether or not it is a read replica. Dynamic addition and deletion of replicas cause the change in paths along which refresh transactions are propagated among replicas. which can be found on the Computer Society Digital Library at http://doi. 4. a tree-topology network for update propagation in our system is divided into two or more subtrees. the distributed algorithm executes classified-replica and mandatory-replica determination until there is no replica in the subtree for the next iteration. OðDm bs Þ. all replicas in the current subtree share the set of replicas covered by mandatory replicas V as described just above. message. a replica is a read replica if it has become a mandatory replica at least once. Dm Dg .4.3. Let ct and cm be the time and message complexity of the distributed algorithm in an iteration. Message. Therefore. our distributed algorithm is performed by all replicas in a newly constructed tree-topology network.4.1 and 4. ct and cm are the sums of OðDc bs Þ. where Dg and Nv are the diameter in a tree for update propagation and the number of replicas. Dc and Dm . 4. a replica removes all elements in V from its range originator set. Hence. where bs is the sum of bp and bc . replicas dynamically join and leave replicated database systems. A frontend node sends refer transactions to all members of the sum of read replica sets calculated in each subtree. respectively. We divide replicas deleted from replicated database systems into two types: leaf and nonleaf replicas. the right and left sides are almost equal when old data are required by clients. the computation complexity of minimumsubtree determination is OðNa Þ. A front-end node sends refer transactions to a newly joining replica and all read replicas calculated for an original update propagation tree.YAMASHITA: DISTRIBUTED VIEW DIVERGENCE CONTROL OF DATA FRESHNESS IN REPLICATED DATABASE SYSTEMS 1413 For function 1). However.230. and Computation Complexity We discussed the time. Therefore. as described in Sections 4. and minimum-subtree determination in Sections 4. respectively. A replica can determine that its role in the distributed algorithm is completed when it is not in the minimum subtree for the next iteration because a replica will never be included in a subtree after it is once excluded from the subtree in an iteration. and Nc . In addition. determine the time complexity of classified-replica and mandatory-replica determination. In each iteration.org/10. As a result. When replicas collect samples of update propagation delay from all replicas in other subtrees after the deletion of a replica.ieeecomputersociety. OðNv Þ. and OðDg bs Þ and OðNv þ Nc Þ. Our system designates one replica among the adjacent replicas of the deleted replica as their hub and connects the designated replica to them. For function 2).7 Dynamic Addition and Deletion of Replicas In our system. when a new replica joins replicated database systems.org/ 10. These of mandatory-replica and minimum-subtree determination are OðNa Þ. the time and message complexity are OðDg ðbp þ bc ÞÞ and OðNv Þ.6 Time. which can be found on the Computer Society Digital Library at http://doi.1109/TKDE. As is described in Section 4.1109/TKDE. The computation complexity of our distributed algorithm to calculate a minimum read replica set is decreased from the centralized algorithm by using the information of topology that connects all replicas and the properties of probabilistic delay. When replicas collect samples of update propagation delay from a newly joining replica. mandatory-replica. For Dc . Therefore. respectively. The pseudocode of minimum-subtree determination is described in Appendix B.230. and computation complexity of classified-replica. and Nc Nv are satisfied. the termination condition of the distributed algorithm is changed from that of the centralized algorithm. The time and message complexity of our distributed algorithm to calculate a minimum read replica set are consequently ct and cm multiplied by Nv .ieee computersociety. the computation complexity of our distributed algorithm executed by all replicas is 2 OðNa Nv Þ because there are Nv replicas. For function 3). The computation complexity of the first and second phase of classified-replica determination is OðNa Þ and OðNb Þ.2008. As a result. a replica cannot immediately estimate update propagation delay from others because it must collect samples of update propagation delay along new paths for a particular time period. . and 4. respectively. there is at least one mandatory replica [12]. When a leaf replica leaves replicated database systems.5. and Na is equal to or greater than Nb . respectively.2008. Then. respectively. which 4. our distributed algorithm is performed by all replicas in a whole tree-topology network including the newly joining replica. In every iteration. Hence. In each of those inequalities. it is not a read replica. This means that the number of replicas in a tree is decreased by at least one in an iteration. The minimum-subtree determination comprises Na times MSP message transmission/receipt and set comparison. the time and message complexity of our algorithm are at most 2 OðNv Dg bs Þ and OðNv Þ. Here.2. respectively. conditions Dc Dg . the time and message complexity of minimum-subtree determination are equal to those of gossiping. we add it as a leaf replica node to minimize the change in update propagation paths in our system. When a nonleaf replica leaves replicated database systems. Therefore. Otherwise. a replica can easily achieve function 2) using V . and OðNv Þ.2.3. we demonstrate our distributed algorithm in Appendix D. our distributed algorithm is separately performed in every subtree. As a result.3. Then. the time complexity for obtaining new data is much lower than that for obtaining old data though the order of the time complexity for new data is equal to that for old data. Because update delay between replicas included in different subtrees changes after the deletion of a replica. are much less than Dg when new data are required by clients.

topology that connects replicas should have a minimum diameter in order to improve data freshness because the number of hop counts for message propagation has the minimum value in a tree with a minimum diameter. The process of direct update propagation from one replica to another in epidemic-style and chain-style methods for lazy-group replication is similar to that in our method.1414 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING. 21. When a node receives a message. each with different Gamma function parameters.16) when the maximum fan outs are 5 and 10. respectively. This means that the degrees of RDF achieved by our method are 98. ¼ 2. there were 1. From Fig. 10.9 and 94. [33] and chain-style [34] update propagation. 7. where the maximum delay times are 2:03  102 and 1:36  102 when the maximum fan outs are 5 and 10. our method achieves more than 36. one of the two probability density functions used for the evaluation of our method was randomly assigned to each path along which an update is directly propagated from one replica to another in epidemic-style and chain-style update propagation. The delay assigned to each direction of each link includes those that occur in the channel of the assigned link and both its end nodes for the period from the beginning of update propagation in the initial end node to the completion of update processing in the terminal end node. This type of distribution includes the Gamma and Weibull distribution. In addition. is difficult. respectively. which are epidemic-style [31]. For the delay of update propagation. on average. ¼ 2.09 (1.8 percent improvement in RDF compared with epidemic-style and chain-style methods. In these trees. the proposed method can retrieve data with the normalized RDF values of 5:06  10À1 and 5:07  10À1 in a tree with the minimum diameter by searching only one replica node when the maximum fan outs of the treetopology networks are 5 and 10. When our method is practically used. respectively. and transaction processing time by database management systems. respectively. and r ¼ 2. and one of the two functions was randomly assigned to each direction of each link. and r ¼ 1 and 2) c ¼ 20ðsÞ. We used the well-known Gamma distribution for the delay time because our objectives are to achieve relative comparison of our method with related work and evaluate the rough efficiency of our method.99) times that of the tree with the minimum . Therefore. [32]. Therefore. though our algorithm also works in arbitrary tree-topology networks.16 by searching only one replica node. are shown in Fig. The parameters were: 1) c ¼ 10ðsÞ. which is 2. we also evaluated controlled data freshness in five randomly generated trees for comparison. In this evaluation.95. The increase of RDF values in the randomly generated trees is caused by the increase of the maximum update propagation delay. VOL.1 Controlled Data Freshness Data freshness controlled by our algorithm depends on the delay of update propagation and topology that connects replicas.000 replica nodes whose maximum degrees were 6 and 11. waiting time in the message queue of an operating system. where only the most distant node has the role of message propagation to other nodes. Therefore. We used two probability density functions. This function is generally represented by ðx À cÞrÀ1 expðÀ ðx À cÞÞ: ð4Þ Fig. Epidemic-style update propagation is a message delivery method using an infectdie model [32].22 and 1. In the infect-die model. long and short delays may occur with low probability. View divergence of data freshness achieved by the proposed method and related work. The view divergence of data freshness controlled by our method and related works. When the proposed method is used for the randomly generated trees. In chain-style update propagation. the normalized RDF values of epidemic-style (chainstyle) methods are 6:16  10À1 and 6:72  10À1 (1:09  10 and 8. the proposed method achieves the normalized RDF values of 3:40  10À1 and 4:24  10À1 by searching less than 1 percent of all replicas when the maximum fan outs of the tree-topology network are 5 and 10. nodes are connected in a chain [34]. a node distributes a message to randomly selected nodes when it first receives a message.1 and 72. we used topology with a minimum diameter for the evaluation of data freshness controlled by our algorithm. As a result. The horizontal axis is the RDF normalized by the maximum delay time in the tree-topology network with a minimum diameter. the message is propagated to a constant number of the closest nodes. respectively. when the maximum fan outs of the tree-topology networks are 5 and 10. respectively. in general. This means that the maximum fan outs for message propagation are 5 and 10 when the maximum degrees of nodes are 6 and 11. probability p used in the definition of RDF is 0. OCTOBER 2009 5 EVALUATION 5. 7. In addition to a tree with a minimum diameter. modeling the statistical delay time distribution of update propagation. A node never distributes a message to any node when it receives an already received message. 7. it can retrieve data with the normalized RDF values of 1. On the other hand. NO. respectively.6 percent greater than epidemic-style methods when the maximum fan outs are 5 and 10. The delay of update propagation is caused by complicated processes such as message delivery time by network systems. respectively. The vertical axis is the percentage of read replicas in all replicas.

even when fresh data are rarely needed. and 5:20 Â 102 . the time.6. which is caused by the number of iterations and dominated largely by the first iteration of our algorithm.28. The normalized time complexity in the figure is in the range between 1. respectively. and 1. the mean numbers of replicas participating in the second or later iterations in tree-topology networks with 100. when we use epidemic-style and chain-style methods. 500. 4:40 Â 102 . in a case where when and how new data are needed depends on applications and the timing of data acquisition as described in Section 2. our method is much more advantageous than epidemic-style and chain-style methods. Fig. the computation complexity of the distributed algorithm to calculate a minimum read replica set is OðNa Þ for one replica in one iteration. The mean diameters of networks with 100. where the computation complexity of message transmission/receipt and set comparison are both OðNa Þ. and 1. In this evaluation.6. In addition. the number of messages that were calculated by simulation is almost independent of the number of replicas because the number of messages is dominated largely by the first iteration of our algorithm. 8a shows the mean numbers of iterations performed by replicas. 8.000 replicas are at most 7. The numbers of replicas in the evaluation were 100. Because the number of iterations is at most 3. where Nv is the number of replicas. and 2:93 Â 10. time complexity of calculating a minimum read replica set using gossiping in classified-replica and mandatory-replica determination is also shown in (b). For comparison. In addition. in a tree with the minimum diameter. 9a shows the mean number of messages normalized by the number of replicas. Because the message complexity 2 of our algorithm is at most OðNv Þ as described in Section 4.74 percent of all replicas on average. we have to increase the fan out for update propagation. the RDF values of epidemicstyle (chain-style) methods are 1:25 Â 102 and 9:14 Â 10 (2:21 Â 103 and 1:11 Â 103 ) when the maximum fan outs are 5 and 10. As described in Section 4. and minimum-subtree determination. respectively. our method can achieve better RDF than epidemic-style methods do by accessing more than 0. our algorithm terminates in the small number of iterations though it is theoretically OðNv Þ in the worst case [12]. which leads to the increase of the load on replicas. our method can control RDF regions from 6:90 Â 10 to 2:03 Â 102 and from 5:77 Â 10 to 1:36 Â 102 . 4:51 Â 10. the RDF values achieved by related work (epidemic and chain-style methods) are worse than those of our method even if we double the fan out of the related work. 500. diameter when the maximum fan out is 5 (10). and 1. For evaluation. which are not normalized. the number of messages normalized by the number of replicas is OðNv Þ. we used 50 randomly generated tree-topology networks. the time complexity to calculate a minimum read replica set using the centralized algorithm and gossiping is also plotted in Fig. Fig. Fig.000. They satisfied only the condition that the degrees of nodes are in the range from 1 to 5. Therefore. 8b shows the time complexity of the distributed algorithm normalized by the time complexity for propagating information along the mean diameter of networks. Hence. 500. This is because the normalized time complexity of our algorithm is theoretically OðNv Þ. and 1. The computation . If epidemic-style or chain-style methods are used. (a) Number of iterations. when the maximum fan outs are 5 and 10.000 replicas were 2:71 Â 102 .12 in the figure. message. Number of iterations and time complexity. and computation complexity of our algorithm are dominated largely by the first iteration. From the above discussion. and 9:36 Â 10. (b) Time complexity of our distributed algorithm normalized by time complexity for propagating information along mean diameter of networks. For comparison. However. 5. 2:46 Â 10. respectively. the maximum delay times for update propagation in networks with 100. clients have to wait for more than the time corresponding to the normalized RDF value of 2:48 Â 10À1 in order to acquire the same data as in our method or to use old data corresponding to more than that time.YAMASHITA: DISTRIBUTED VIEW DIVERGENCE CONTROL OF DATA FRESHNESS IN REPLICATED DATABASE SYSTEMS 1415 Fig. The horizontal axis is the RDF Tp with probability p ¼ 0:95. On the other hand. which is caused by the number of iterations. message.2 Efficiency of Distributed Algorithm We evaluated the efficiency of our distributed algorithm in terms of time. on average.000 replicas were 1:50 Â 10.70 independent of the number of replicas. 8b. However. respectively. As a result. respectively. mandatory-replica. Our distributed algorithm iterates the three types of processes: classified-replica. 500. and computation complexity.40 and 2.

They achieve the guarantee or improvement of data freshness under the condition where only one source-database exists in a system like lazy-master replication [36]. [35]. as described in Section 3. a replica with the most up-to-date data does not always exist as a source database. respectively. Hadzilacos. and network delay. REFERENCES [1] P. Therefore. Message and computation complexity. the evaluation results in terms of time. various methods have been proposed [36]. In our method. However. our method can provide data with various degrees of freshness to clients by adaptively changing read replicas according to such system conditions. 9b shows the mean number of comparisons of range originator sets normalized by the number of replicas.4. and data integration [10].9 and 94. OCTOBER 2009 Fig. [34]. 21. [35]. 9. VOL. Web servers. a refer request issued by a client is sent to multiple replicas in what we call a read replica set. and N. and computation complexity. Fig. V.000 replicas with high scalability while enabling effective load balancing of view divergence control. 1987. 10. the aforementioned methods are not available in lazy-group replication. update propagation methods have been studied [31]. [35]. Goodman.000 replicas with high scalability. (b) Number of comparisons of range originator sets. which is caused by the number of iterations. message. [26]. (a) Number of messages exchanged among replicas. Recently. [31]. These methods achieve effective update propagation in peer-to-peer systems. data replication and caching methods have been studied for distributed systems with unreliable components such as peer-to-peer systems and ad hoc networks [37]. the freshness of data that a node can retrieve depends on system conditions. On the other hand.3. We evaluated by simulation the distributed algorithm to calculate a minimum read replica set in terms of controlled data freshness time. When using lazygroup replication. Bernstein. our method achieves more than 36. such as the frequency of updates. and the client obtains the data that reflect all updates received by the replicas in that set. Numbers of messages and comparisons are normalized by the number of replicas. As a result. [38]. In such systems. In addition. and is dominated largely by the first iteration of our algorithm as described in this section. In addition. message. To guarantee and improve data freshness. [26]. the number of comparisons in the figure is almost independent of the number of replicas because the number of comparisons of our distributed algorithm normalized by the number of replicas is theoretically OðNa Nv Þ. Concurrency Control and Recovery in Database Systems. [36]. Such systems are usually composed of individually administrated hosts that dynamically join and leave systems. Our method calculates a minimum read replica set using a distributed algorithm so that the data freshness requested by a client are satisfied.A. our distributed algorithm achieves lower computation complexity compared with the centralized one and effective load balancing of computation complexity among replicas. Therefore. workload of replicas. We have proposed a distributed method to control the view divergence of data freshness for clients in replicated database systems. These methods can probabilistically provide high availability when operational replicas satisfy particular conditions. our method can control the view divergence in networks with 100 to 1. [34]. we compare the computation complexity of the centralized and distributed algorithms in terms of the number of set comparisons. .8 percent improvement in data freshness compared with epidemic-style and chain-style update propagation methods. our method enables clients to retrieve data with various degrees of freshness from replicas in lazy-group replication architectures according to the degree of freshness required by clients. The number of comparisons in the distributed algorithm is much smaller than that in the centralized one. From the above discussion. 7 CONCLUSION 6 RELATED WORK Data freshness is one of the most important attributes of data quality in a number of applications. For replicated peer-to-peer systems. NO. Addison-Wesley. In addition. and computation complexity of our distributed algorithm showed that our method can control the view divergence in networks with 100 to 1. such as data warehouses. complexity of the centralized algorithm in one iteration is determined by the number of set comparisons.1416 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING.

[33] I. . 240-255. 1989. Terry. L. E. [36] R.H.A. Kermarrec. pp. vol.org/publications/dlib. pp.” Computer. “An Exploratory Investigation of System Success Factors in Data Warehousing. Datta. vol. and T. Kesselman. A. MIT Press.” Proc.” Proc. C. vol. Keleher. 2006. “Update Propagation through Replica Chain in Decentralized and Unstructured P2P Systems. Gupta. 18th IEEE Int’l Conf. 22nd IEEE Int’l Symp. Feb. Shiba and H. Foster and C. Demers. Kamber. Peer-to-Peer Computing (P2P ’04). 2003. 1982. 481-492. vol. D. ACM SIGMOD ’96.” Computer Comm. A.” Computer. Replicated Peer-to-Peer Systems. For more information on this or any other computing topic. 1996. P. pp.” Proc. 23rd IEEE Int’l Conf. Shin. Silaghi. and C. Mills. 24. “Providing High Availability Using Lazy Replication.A. 59-67. 1996. Han and M. “Controlling View Divergence of Data Freshness in a Replicated Database System Using Statistical Update Delay Estimation. Bhargava. 10.” Proc. Parallel and Distributed Systems.” Proc. Rev. Nonparametric Statistical Methods. `. Hsu. Ono. M. respectively. J. 2004. pp. ACM SIGMOD ’91. 35. ACM Symp. Lynch. Kluwer Academic Publishers. J. Guerraoui. R. Helland. S. pp.P.” Proc. S. 1999. and B. Ramos.computer. Leemis. Improving Data Warehouse and Business Information Quality. pp. he joined Nippon Telegraph and Telephone Corporation. Statistical Methods II: Estimation. 360-369. 6. Liskov. 1999.W. 17. Networking. 1-4. no. 28-43. Bouzeghoub and V. L. A.T. ACM SIGMOD ’96. second ed. “Replica Control in Distributed Systems: An Asynchronous Approach. Du and D. 377-386. 2003.A. Transaction Processing: Concepts and Techniques.” IEEE Trans. Labrinidis and N. pp. 1993. Simon. please visit our Digital Library at www. pp. 2.” Information Processing Letters. 4. Prentice-Hall.H. 141-170. 1992. Reuter. T. W. Oct.. vol. Parker and R. and distributed algorithms. Nguyen. He is a member of the IEEE. T. Distributed Systems. “Grid Services for Distributed System Integration. Distributed Algorithms. 4. Hauswirth. Ghemawat. Leiserson. C. and D. Int’l Workshop Information Quality in Information Systems. Distributed Computing Systems. 2001. no. Eugster. Shenker.” ACM Trans. D.” Proc. and P. M. J. Ganesh.” Proc. Information and Systems. Swinehart. Levine. English. John Wiley & Sons. “Exploring the Tradeoff between Performance and Data Freshness in Database-Driven Web Servers.” J. S. M.” Proc. and S. pp. second ed. “Sacrificing Serializability to Attain High Availability of Data in an Unreliable Network. Reliable Distributed Systems.M. The Grid 2: Blueprint for a New Computing Infrastructure. T. vol. network security. 1994. 2005. Yamashita and S. and K. John Wiley & Sons. Rivest. 2004. M. pp. S. the IEEE Computer Society. pp. Hauser. “Improving Data Freshness in Lazy Master Schemes. B. 2001. July 2006. “A Framework for Supporting Data Integration Using the Materialized and Virtual Approaches. May 1998. Computer Systems. “Updates in Highly Unreliable. 1. Heddaya. Cox and B. 60-67. C.L. no. Routing in the Internet. C. Cormen. J. 1987. 37. Introduction to Algorithms. and S. In 1992. 1-12. and A. P.” Proc. Irish. B. Replication Techniques in Distributed Systems. R. and L. I. 11. pp. 2003. 3.” Proc. Wolfe. 1999. Principles of Distributed Computing. vol. P. Tuecke. 13. pp. 1995. Zhou. 4. D. [35] A. vol. Fischer and A. 24th IEEE Int’l Conf. pp. Reliability. Kesselman.L. J. Stein. R. [38] V. 6471. “Termination Detection for Diffusing Computations. and the APS. Larson. Pu and A. eds. 70-75. D. Das.D. Martin. pp. Prentice-Hall. 164-171. no. E. 37-46. Distributed Computing Systems. pp. Shen. Kluwer Academic Publishers. pp. “An Algorithm to Synchronize the Time of a Computer to Universal Time. Massooulie “Epidemic Information Dissemination in Distributed Systems. “Adaptive Replication in Peer-to-Peer Systems. R.” IEEE/ACM Trans. First ACM Symp. H. “Autonomous Replication for High Availability in Unstructured P2P Systems. pp. 161-183. 173-182. Helal. Leff. Greene. 102-111. Shasha. 7. Int’l Conf. Hollander and D. Gray. A. and the PhD degree in informatics from Kyoto University in 2006.S. Morgan Kaufmann Publishers. no.” VLDB J. Takao Yamashita received the BS and MS degrees in electrical engineering from Kyoto University in 1990 and 1992. June 1996.J. for Information Systems.” Proc. Watanabe. Assoc. N. Kumar. pp. Reliable Distributed Systems. 2004. Cuenca-Acuna. 2004. Pacitti. second ed.-M. 18th IEEE Symp. vol. Melo. Feb.M. 3. Yamashita and S.” Proc. Gray and A.YAMASHITA: DISTRIBUTED VIEW DIVERGENCE CONTROL OF DATA FRESHNESS IN REPLICATED DATABASE SYSTEMS 1417 [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] A. Ono. “Fast Reconciliations in Fluid Replication. pp. Bhattacharjee. May 1991. no. June 2002. (in Japanese). A. Nick. Roussopoulos. Mullender. Peralta. pp. and H. 1995. “A Distributed File System Architecture Supporting High Availability. pp.K. Distributed Computing Systems.. Dijkstra. O’Neil. 5. and D. pp. “The Dangers of Replication and a Solution. Kermarrec.” Proc. 1996.J.F. Noble. eds. Hull and G. C. Sixth Ann. pp. E. I. Foster. no. 1. P. Sturgis. Wang. 1995. Gopalakrishnan. May 1982. Int’l Conf. 99-108. Principles of Database Systems. Shinyosha. ed. May 2004. ACM Press. 42-50. 739-749. Morgan Kaufmann Publishers. J. “View Divergence Control of Replicated Data Using Update Delay Estimation. 449-458. Combinatorial Network Theory.P. E88-D. Ladin.M. B. and R. no. “Efficient and Adaptive Epidemic-Style Protocols for Reliable and Scalable Multicast. “A Framework for Analysis of Data Freshness. 76-88. “Epidemic Algorithm for Replicated Database Maintenance. J. 2003. Michael.” IEICE Trans. Data Mining. 360-391. D. Morgan Kaufmann. 1976. Morgan Kaufmann.D. . 1980. [37] F. B. Distributed Computing Systems. Huitema. “Precision Synchronization of Computer Network Clocks. June 1996. 593-605. His current research interests encompass loosely coupled distributed systems. [34] Z.-M. Sixth Berkeley Workshop Distributed Data Management and Computer Networks. Aberer. no. vol.