Professional Documents
Culture Documents
Motivation
Database Computer
Technology Networks
integration distribution
Distributed
Database
Systems
integration
integration ≠ centralization
Distributed Computing
Distributed DBMS.
Software system that permits the management
of the distributed database and makes the
distribution transparent to users.
Concepts
• Collection of logically-related shared data.
• Data split into fragments.
• Fragments may be replicated.
• Fragments/replicas allocated to sites.
• Sites linked by a communications network.
• Data at each site is under control of a DBMS.
• DBMSs handle local applications autonomously.
• Each DBMS participates in at least one global application.
What is not a DDBS?
Site 1
Site 2
Site 5
Communication
Network
Site 4 Site 3
Implicit Assumptions
Data stored at a number of sites each site
logically consists of a single processor.
Processors at different sites are interconnected
by a computer network no multiprocessors
◦ parallel database systems
Distributed database is a database, not a
collection of files data logically related as
exhibited in the users’ access patterns
◦ relational data model
D-DBMS is a full-fledged DBMS
◦ not remote file system, not a TP system
Advantages of DDBMSs
• Organizational Structure
• Shareability and Local Autonomy
• Improved Availability
• Improved Reliability
• Improved Performance
• Economics
• Modular Growth
Advantages
Distributed Database System
– Improved performance:
• A distributed DBMS fragments the database to keep
data closer to where it is needed most.
• This reduces data management (access and
modification) time significantly.
– Easier expansion (scalability):
• Allows new nodes (computers) to be added anytime
without chaining the entire configuration.
Advantages
Distributed Database System
– Increased reliability and availability:
• Reliability refers to system live time, that is, system is
running efficiently most of the time. Availability is the
probability that the system is continuously available
(usable or accessible) during a time interval.
• A distributed database system has multiple nodes
(computers) and if one fails then others are available to
do the job.
Distributed Database System
Improved performance
Distributed Database
Distributed DBMS - Reality
User
Query
DBMS
Software
User
Application
DBMS
Software
DBMS Communication
Software Subsystem
User
DBMS User Application
Software Query DBMS
Software
User
Query
Additional Functions of DDBS
• Keep track of data
• Distributed query processing
• Replicated data management
• Additional security requirement
• Global transaction management
Distributed DBMS Issues
Distributed Database Design
◦ how to distribute the database
◦ replicated & non-replicated database distribution
◦ a related problem in directory management
Query Processing
◦ convert user transactions to data manipulation
instructions
◦ optimization problem
◦ min{cost = data transmission + local processing}
Design of DDBS
• Data fragmentation
• Data Replication
• Data Allocation
Data Replication
• A relation or fragment of a relation is replicated if it is stored redundantly in
two or more sites.
• Full replication of a relation is the case where the relation is stored at all sites.
• Fully redundant databases are those in which every site contains a copy of the
entire database.
Data Replication (Cont.)
• Advantages of Replication
– Availability: failure of site containing relation r does not result in unavailability
of r is replicas exist.
– Parallelism: queries on r may be processed by several nodes in parallel.
– Reduced data transfer: relation r is available locally at each site containing a
replica of r.
• Disadvantages of Replication
– Increased cost of updates: each replica of relation r must be updated.
– Increased complexity of concurrency control: concurrent updates to distinct
replicas may lead to inconsistent data unless special concurrency control
mechanisms are implemented.
• One solution: choose one copy as primary copy and apply concurrency
control operations on primary copy
Data Fragmentation
• Data Fragmentation
– Split a relation into logically related and correct
parts. A relation can be fragmented in two ways:
• Horizontal Fragmentation
• Vertical Fragmentation
Slide 25- 27
Data Fragmentation, Replication and Allocation
• Horizontal fragmentation
– It is a horizontal subset of a relation which contain those of
tuples which satisfy selection conditions.
– Consider the Employee relation with selection condition (DNO =
5). All tuples satisfy this condition will create a subset which will
be a horizontal fragment of Employee relation.
– A selection condition may be composed of several conditions
connected by AND or OR.
– Derived horizontal fragmentation: It is the partitioning of a
primary relation to other secondary relations which are related
with Foreign keys.
Slide 25- 28
Data Fragmentation, Replication and Allocation
• Vertical fragmentation
– It is a subset of a relation which is created by a subset of
columns. Thus a vertical fragment of a relation will contain
values of selected columns. There is no selection condition used
in vertical fragmentation.
– Consider the Employee relation. A vertical fragment of can be
created by keeping the values of Name, Bdate, Sex, and
Address.
– Because there is no condition for creating a vertical fragment,
each fragment must include the primary key attribute of the
parent relation Employee. In this way all vertical fragments of a
relation are connected.
Slide 25- 29
Horizontal Fragmentation of account Relation
Hillside Lowman 1
Hillside Camp 2
Valleyview Camp 3
Valleyview Kahn 4
Hillside Kahn 5
Valleyview Kahn 6
Valleyview Green 7
deposit1 = branch_name, customer_name, tuple_id (employee_info )
account_number balance tuple_id
A-305 500 1
A-226 336 2
A-177 205 3
A-402 10000 4
A-155 62 5
A-408 1123 6
A-639 750 7
deposit2 = account_number, balance, tuple_id (employee_info )
Advantages of Fragmentation
• Horizontal:
– allows parallel processing on fragments of a relation
– allows a relation to be split so that tuples are located where they are most
frequently accessed
• Vertical:
– allows tuples to be split so that each part of the tuple is stored where it is
most frequently accessed
– tuple-id attribute allows efficient joining of vertical fragments
• Vertical and horizontal fragmentation can be mixed.
– Fragments may be successively fragmented to an arbitrary depth.
• Replication and fragmentation can be combined
– Relation is partitioned into several fragments: system maintains several
identical replicas of each such fragment.
Data Allocation
• Each fragment of database must be assigned
to a particular site in the distributed system.
This process is called data allocation. The
choice of site and degree of replication
depend on the performance and availablity
goals of the system.
Types of Distributed Database Systems
• Homogeneous
– All sites of the database
system have identical setup, Window
i.e., same database system Site 5 Unix
software. Oracle Site 1
– The underlying operating Oracle
Window
system may be different. Site 4 Communications
• For example, all sites run network
Oracle or DB2, or Sybase or
some other database Oracle
system.
Site 3 Site 2
– The underlying operating Linux Oracle Linux Oracle
systems can be a mixture of
Linux, Window, Unix, etc.
Slide 25- 34
Types of Distributed Database Systems
• Heterogeneous
– Federated: Each site may run different database system but the data
access is managed through a single conceptual schema.
• This implies that the degree of local autonomy is minimum. Each site must
adhere to a centralized access policy. There may be a global schema.
– Multidatabase: There is no one conceptual global schema. For data
access a schema is constructed dynamically as needed by the
application software.
Object Unix Relational
Oriented Site 5 Unix
Site 1
Hierarchical
Window
Site 4 Communications
network
Network
Object DBMS
Oriented Site 3 Site 2 Relational
Linux Linux Slide 25- 35
Types of Distributed Database Systems
Slide 25- 36
Query Processing in Distributed Databases
• Issues
– Cost of transferring data (files and results) over the network.
• This cost is usually high so some optimization is necessary.
• Example relations: Employee at site 1 and Department at Site 2
– Employee at site 1. 10,000 rows. Row size = 100 bytes. Table size = 10 6 bytes.
• Department at Site 2. 100 rows. Row size = 35 bytes. Table size = 3,500 bytes.
• Q: For each employee, retrieve employee name and department name Where the employee works.
• Q: Fname,Lname,Dname (Employee Dno = Dnumber Department)
Fname Minit Lname SSN Bdate Address Sex Salary Superssn Dno
Slide 25- 37
Query Processing in Distributed
Databases
• Result
– The result of this query will have 10,000 tuples,
assuming that every employee is related to a
department.
– Suppose each result tuple is 40 bytes long. The
query is submitted at site 3 and the result is sent
to this site.
– Problem: Employee and Department relations are
not present at site 3.
Slide 25- 38
Query Processing in Distributed
Databases
• Strategies:
1. Transfer Employee and Department to site 3.
• Total transfer bytes = 1,000,000 + 3500 = 1,003,500 bytes.
2. Transfer Employee to site 2, execute join at site 2 and send the
result to site 3.
• Query result size = 40 * 10,000 = 400,000 bytes. Total transfer
size = 400,000 + 1,000,000 = 1,400,000 bytes.
3. Transfer Department relation to site 1, execute the join at site
1, and send the result to site 3.
• Total bytes transferred = 400,000 + 3500 = 403,500 bytes.
• Optimization criteria: minimizing data transfer.
Slide 25- 39
Query Processing in Distributed
Databases
• Strategies:
1. Transfer Employee and Department to site 3.
• Total transfer bytes = 1,000,000 + 3500 = 1,003,500 bytes.
2. Transfer Employee to site 2, execute join at site 2 and send the
result to site 3.
• Query result size = 40 * 10,000 = 400,000 bytes. Total transfer
size = 400,000 + 1,000,000 = 1,400,000 bytes.
3. Transfer Department relation to site 1, execute the join at site
1, and send the result to site 3.
• Total bytes transferred = 400,000 + 3500 = 403,500 bytes.
• Optimization criteria: minimizing data transfer.
– Preferred approach: strategy 3.
Slide 25- 40
Query Processing in Distributed
Databases
• Consider the query
– Q’: For each department, retrieve the
department name and the name of the
department manager
• Relational Algebra expression:
– Fname,Lname,Dname (Employee Mgrssn = SSN
Department)
Slide 25- 41
Query Processing in Distributed
Databases
• The result of this query will have 100 tuples, assuming that
every department has a manager, the execution strategies
are:
1. Transfer Employee and Department to the result site and
perform the join at site 3.
• Total bytes transferred = 1,000,000 + 3500 = 1,003,500 bytes.
2. Transfer Employee to site 2, execute join at site 2 and send the
result to site 3. Query result size = 40 * 100 = 4000 bytes.
• Total transfer size = 4000 + 1,000,000 = 1,004,000 bytes.
3. Transfer Department relation to site 1, execute join at site 1
and send the result to site 3.
• Total transfer size = 4000 + 3500 = 7500 bytes.
Slide 25- 42
Query Processing in Distributed
Databases
• The result of this query will have 100 tuples, assuming that
every department has a manager, the execution strategies
are:
1. Transfer Employee and Department to the result site and
perform the join at site 3.
• Total bytes transferred = 1,000,000 + 3500 = 1,003,500 bytes.
2. Transfer Employee to site 2, execute join at site 2 and send the
result to site 3. Query result size = 40 * 100 = 4000 bytes.
• Total transfer size = 4000 + 1,000,000 = 1,004,000 bytes.
3. Transfer Department relation to site 1, execute join at site 1
and send the result to site 3.
• Total transfer size = 4000 + 3500 = 7500 bytes.
• Preferred strategy: Choose strategy 3.
Slide 25- 43
Query Processing in Distributed
Databases
• Now suppose the result site is 2. Possible
strategies :
1. Transfer Employee relation to site 2, execute the
query and present the result to the user at site 2.
• Total transfer size = 1,000,000 bytes for both queries
Q and Q’.
2. Transfer Department relation to site 1, execute
join at site 1 and send the result back to site 2.
• Total transfer size for Q = 400,000 + 3500 = 403,500
bytes and for Q’ = 4000 + 3500 = 7500 bytes.
Slide 25- 44
Concurrency Control and Recovery
• Distributed Databases encounter a number of
concurrency control and recovery problems
which are not present in centralized
databases. Some of them are listed below.
– Dealing with multiple copies of data items
– Failure of individual sites
– Communication link failure
– Distributed commit
– Distributed deadlock
Slide 25- 45
Concurrency Control and Recovery
• Details
– Dealing with multiple copies of data items:
• The concurrency control must maintain global
consistency. Likewise the recovery mechanism must
recover all copies and maintain consistency after
recovery.
– Failure of individual sites:
• Database availability must not be affected due to the
failure of one or two sites and the recovery scheme
must recover them before they are available for use.
Slide 25- 46