You are on page 1of 14

Deadlocks – Recovery -

Distributed Query Processing.


Module III
DEADLOCK HANDLING IN DISTRIBUTED DATABASE

➢ A Deadlock is a situation in which two or more transactions are waiting for the data items that are
held by and to be released by the other transactions.

➢ Like in the case of centralized database systems, distributed database systems also prone to
Deadlocks.

➢ In Distributed Database systems, we need to handle transactions differently.

➢ In every site that are part of the distributed database, we have the transaction specific components -
transaction coordinators, transaction managers, lock managers, etc.

➢ Above all, data might be owned by many sites, or replicated in many sites.

➢ Due to these reasons, deadlock handling is bit tough job in Distributed Database.
➢ Deadlock can be handled in two ways;

1. Deadlock prevention – it deals with preventing the deadlock before it occurs. It is harder in
centralized database system as it involves more number of rollback and slows down the
transactions. In distributed database it would cause more problems because, it rollback more
transactions that are happening in more sites (not in a single server but possibly many servers).

2. Deadlock detection – it deals with detecting deadlock if one happened. In centralized database
systems, detection is easier compare to prevention. We have handled detection using Wait-for
graphs. In the case of distributed database, the main problem is where and how to maintain the
Wait-for graphs.

➢ Deadlock detection technique in distributed database

➢ We have handled deadlock detection in centralized database system using Wait-for graph. The same
can be used in distributed database.

➢ That is, we can maintain Local wait-for graphs in every site. If the local wait-for graph of any site
formed a cycle, then we would say that a deadlock has occurred.
➢ On the other hand, no cycles in any of the local wait-for graph does not mean no deadlock has occurred.

➢ Let us discuss this point with local wait-for graph examples as shown below;

➢ Figure 1 shows the lock request status for


transactions T1, T2, T3 and T4 in a
distributed database system.

➢ In the local wait-for graph of SITE 1,


transaction T2 is waiting for transactions
T1 and T3 to finish. In SITE 2,
transactions T3 is waiting for T4, and T4
is waiting for T2.

➢ From SITE 1 and SITE 2 local wait-for


graphs, it is clear that transactions T2 and
T3 are involved in both sites.
➢ You can observe from the local wait-for graphs of SITE 1 and SITE 2, there are no symptoms of cycles.
If we merge these two local wait-for graphs into a single wait-for graph, then we would get the graph
which is given in Figure 2, below. From Figure 2, it is clear that the union of two local wait-for graphs
have formed a cycle, which means deadlock has occurred. This merged wait-for graph is called as
Global wait-for graph.
Centralized deadlock detection approach

➢ This is the technique used in distributed database system to handle deadlock detection.

➢ According to this approach, the system maintains one Global wait-for graph in a single chosen site,
which is named as deadlock-detection coordinator.

➢ The Global wait-for graph is updated during the following conditions;

▪ Whenever a new edge is inserted in or removed from one of the localwait-for graphs.

▪ Periodically, when a number of changes have occurred in a local wait-for graph.

▪ Whenever the coordinator needs to invoke the cycle-detection algorithm.


➢ How does it work?
▪ When the deadlock-detection coordinator starts the deadlock-detection algorithm, it
searches for cycles.

▪ If the coordinator finds a cycle, then the following will happen;


❖ The coordinator selects a victim transaction that need to be rolled back.
❖ The coordinator informs about the victim transaction to all the sites in the distributed
database.
❖ The sites rollback the transaction.

➢ This approach (centralized detection approach) may lead to unnecessary rollbacks due to one of the
following; (the main cause is the communication delay.)

1. False cycles –

2. Individual transaction rollback during a deadlock and a victim is chosen – for example; let us
assume that a deadlock occurred in a distributed database. Then the coordinator chooses one victim
transaction and informs the sites about the victim to rollback. At the same time, because of some
other reasons, a transaction Ti rollback itself. Now the whole system performed unnecessary
rollbacks.
Suppose that T2 releases the resource that it is holding in
site S1, resulting in the deletion of the edge T1 → T2 in
S1.

Transaction T2 then requests a resource held by T3 at site


S2, resulting in the addition of the edge T2 → T3 in
S2.

If the insert T2 → T3 message from S2 arrives before the


remove T1 → T2 message from S1, the coordinator may
discover the false cycle T1 → T2 → T3 after the insert
(but before the remove).

Deadlock recovery may be initiated, although no


deadlock has occurred.
DISTRIBUTED QUERY PROCESSING

➢ A Query processing in a distributed database management system requires the transmission of


data between the computers in a network.

➢ A distribution strategy for a query is the ordering of data transmissions and local data processing
in a database system.

➢ Generally, a query in Distributed DBMS requires data from multiple sites, and this need for data
from different sites is called the transmission of data that causes communication costs.

➢ Query processing in DBMS is different from query processing in centralized DBMS due to this
communication cost of data transfer over the network.

➢ The transmission cost is low when sites are connected through high-speed Networks and is quite
significant in other networks.
1. Costs (Transfer of data) of Distributed Query processing :

➢ In Distributed Query processing, the data transfer cost of distributed query processing means the
cost of transferring intermediate files to other sites for processing and therefore the cost of
transferring the ultimate result files to the location where that result’s required.

➢ Let’s say that a user sends a query to site S1, which requires data from its own and also from another
site S2.

➢ Now, there are three strategies to process this query which are given below:

▪ We can transfer the data from S2 to S1 and then process the query

▪ We can transfer the data from S1 to S2 and then process the query

▪ We can transfer the data from S1 and S2 to S3 and then process the query.

➢ So the choice depends on various factors like, the size of relations and the results, the communication
cost between different sites, and at which the site result will be utilized.
➢ Commonly, the data transfer cost is calculated in terms of the size of the messages.

➢ By using the below formula, we can calculate the data transfer cost:

Data transfer cost = C * Size

Where C refers to the cost per byte of data transferring and Size is the no. of bytes transmitted.

Example: Consider the following table EMPLOYEE and DEPARTMENT.


Site1: EMPLOYEE
Site2: DEPARTMENT
EID NAME SALARY DID DID DNAME

EID- 10 bytes DID- 10 bytes


SALARY- 20 bytes DName- 20 bytes
DID- 10 bytes Total records- 50
Name- 20 bytes Record Size- 30 bytes
Total records- 1000
Record Size- 60 bytes
Example : Find the name of employees and their department names. Also, find the amount of data transfer to
execute this query when the query is submitted to Site 3.

Considering the query is submitted at site 3 and neither of the two relations that is an EMPLOYEE and the
DEPARTMENT not available at site 3. So, to execute this query, we have three strategies:

▪ Transfer both the tables that is EMPLOYEE and DEPARTMENT at SITE 3 then join the tables there.
The total cost in this is 1000 * 60 + 50 * 30 = 60,000 + 1500 = 61500 bytes.

▪ Transfer the table EMPLOYEE to SITE 2, join the table at SITE 2 and then transfer the result at SITE 3.
The total cost in this is 60 * 1000 + 60 * 1000 = 120000 bytes since we have to transfer 1000 tuples
having NAME and DNAME from site 1,

▪ Transfer the table DEPARTMENT to SITE 1, join the table at site1 and then transfer the result at site3.
The total cost is 30 * 50 + 60 * 1000 = 61500 bytes since we have to transfer 1000 tuples having NAME
and DNAME from site 1 to site 3 that is 60 bytes each.
2. Using Semi join in Distributed Query processing :

➢ The semi-join operation is used in distributed query processing to reduce the number of tuples in a
table before transmitting it to another site.

➢ This reduction in the number of tuples reduces the number and the total size of the transmission that
ultimately reducing the total cost of data transfer.

➢ Let’s say that we have two tables R1, R2 on Site S1, and S2.

➢ Now, we will forward the joining column of one table say R1 to the site where the other table say R2
is located.

➢ This column is joined with R2 at that site. The decision whether to reduce R1 or R2 can only be made
after comparing the advantages of reducing R1 with that of reducing R2.

➢ Thus, semi-join is a well-organized solution to reduce the transfer of data in distributed query
processing.
Example : Find the amount of data transferred to execute the same query given in the above example
using semi-join operation.

▪ Select all (or Project) the attributes of the EMPLOYEE table at site 1 and then transfer them to
site 3. For this, we will transfer NAME, DID(EMPLOYEE) and the size is 30 * 1000 = 30000
bytes.

▪ Transfer the table DEPARTMENT to site 3 and join the projected attributes of EMPLOYEE
with this table. The size of the DEPARTMENT table is 30 * 50 = 1500

Applying the above scheme, the amount of data transferred to execute the query will be 30000 + 1500 =
31500 bytes.

You might also like