You are on page 1of 46

Distributed databases

Motivation

Database Computer
Technology Networks
integration distribution

Distributed
Database
Systems
integration

integration ≠ centralization
Distributed Computing

 A number of autonomous processing elements


(not necessarily homogeneous) that are
interconnected by a computer network and that
cooperate in performing their assigned tasks.
What is a Distributed Database System?

A distributed database (DDB) is a collection of


multiple, logically interrelated databases distributed
over a computer network.

A distributed database management system (D–


DBMS) is the software that manages the DDB and
provides an access mechanism that makes this
distribution transparent to the users.

Distributed database system (DDBS) = DDB + D–DBMS


Concepts
Distributed Database.
A logically interrelated collection of shared data
(and a description of this data), physically
distributed over a computer network.

Distributed DBMS.
Software system that permits the management
of the distributed database and makes the
distribution transparent to users.
Concepts
• Collection of logically-related shared data.
• Data split into fragments.
• Fragments may be replicated.
• Fragments/replicas allocated to sites.
• Sites linked by a communications network.
• Data at each site is under control of a DBMS.
• DBMSs handle local applications autonomously.
• Each DBMS participates in at least one global application.
What is not a DDBS?

 A timesharing computer system


 A loosely or tightly coupled multiprocessor
system
 A database system which resides at one of the
nodes of a network of computers - this is a
centralized database on a network node
Distributed DBMS Environment

Site 1
Site 2

Site 5
Communication
Network

Site 4 Site 3
Implicit Assumptions
Data stored at a number of sites  each site
logically consists of a single processor.
Processors at different sites are interconnected
by a computer network  no multiprocessors
◦ parallel database systems
Distributed database is a database, not a
collection of files  data logically related as
exhibited in the users’ access patterns
◦ relational data model
D-DBMS is a full-fledged DBMS
◦ not remote file system, not a TP system
Advantages of DDBMSs
• Organizational Structure
• Shareability and Local Autonomy
• Improved Availability
• Improved Reliability
• Improved Performance
• Economics
• Modular Growth
Advantages
Distributed Database System
– Improved performance:
• A distributed DBMS fragments the database to keep
data closer to where it is needed most.
• This reduces data management (access and
modification) time significantly.
– Easier expansion (scalability):
• Allows new nodes (computers) to be added anytime
without chaining the entire configuration.
Advantages
Distributed Database System
– Increased reliability and availability:
• Reliability refers to system live time, that is, system is
running efficiently most of the time. Availability is the
probability that the system is continuously available
(usable or accessible) during a time interval.
• A distributed database system has multiple nodes
(computers) and if one fails then others are available to
do the job.
Distributed Database System

– Management of distributed data with different


levels of transparency:
• This refers to the physical placement of data (files,
relations, etc.) which is not known to the user
(distribution transparency).
Distributed Database System
– Distribution and Network transparency:
• Users do not have to worry about operational details of
the network.
– There is Location transparency, which refers to freedom of
issuing command from any location without affecting its
working.
– Then there is Naming transparency, which allows access to
any names object (files, relations, etc.) from any location.
Distributed Database System
– Replication transparency:
• It allows to store copies of a data at multiple sites as
shown in the above diagram.
• This is done to minimize access time to the required
data.
– Fragmentation transparency:
• Allows to fragment a relation horizontally (create a
subset of tuples of a relation) or vertically (create a
subset of columns of a relation).
Data Transparency
• Data transparency: Degree to which system user may remain unaware of the
details of how and where the data items are stored in a distributed system
• Consider transparency issues in relation to:
– Fragmentation transparency
– Replication transparency
– Location transparency
• Naming of data items: criteria
1. Every data item must have a system-wide unique name.
2. It should be possible to find the location of data items efficiently.
3. It should be possible to change the location of data items transparently.
4. Each site should be able to create new data items autonomously.
Disadvantages of DDBMSs
• Complexity
• Cost
• Security
• Integrity Control More Difficult
• Lack of Standards
• Lack of Experience
• Database Design More Complex
Distributed DBMS Promises

Transparent management of distributed,


fragmented, and replicated data

 Improved reliability/availability through


distributed transactions

Improved performance

Easier and more economical system


expansion
Distributed Database - User View

Distributed Database
Distributed DBMS - Reality
User
Query
DBMS
Software
User
Application
DBMS
Software

DBMS Communication
Software Subsystem

User
DBMS User Application
Software Query DBMS
Software

User
Query
Additional Functions of DDBS
• Keep track of data
• Distributed query processing
• Replicated data management
• Additional security requirement
• Global transaction management
Distributed DBMS Issues
 Distributed Database Design
◦ how to distribute the database
◦ replicated & non-replicated database distribution
◦ a related problem in directory management
 Query Processing
◦ convert user transactions to data manipulation
instructions
◦ optimization problem
◦ min{cost = data transmission + local processing}
Design of DDBS
• Data fragmentation
• Data Replication
• Data Allocation
Data Replication
• A relation or fragment of a relation is replicated if it is stored redundantly in
two or more sites.
• Full replication of a relation is the case where the relation is stored at all sites.
• Fully redundant databases are those in which every site contains a copy of the
entire database.
Data Replication (Cont.)

• Advantages of Replication
– Availability: failure of site containing relation r does not result in unavailability
of r is replicas exist.
– Parallelism: queries on r may be processed by several nodes in parallel.
– Reduced data transfer: relation r is available locally at each site containing a
replica of r.
• Disadvantages of Replication
– Increased cost of updates: each replica of relation r must be updated.
– Increased complexity of concurrency control: concurrent updates to distinct
replicas may lead to inconsistent data unless special concurrency control
mechanisms are implemented.
• One solution: choose one copy as primary copy and apply concurrency
control operations on primary copy
Data Fragmentation

• Division of relation r into fragments r1, r2, …, rn which contain sufficient


information to reconstruct relation r.
• Horizontal fragmentation: each tuple of r is assigned to one or more
fragments
• Vertical fragmentation: the schema for relation r is split into several smaller
schemas
– All schemas must contain a common candidate key (or superkey) to
ensure lossless join property.
– A special attribute, the tuple-id attribute may be added to each schema to
serve as a candidate key.
• Example : relation account with following schema
• Account = (account_number, branch_name , balance )
Data Fragmentation, Replication and Allocation

• Data Fragmentation
– Split a relation into logically related and correct
parts. A relation can be fragmented in two ways:
• Horizontal Fragmentation
• Vertical Fragmentation

Slide 25- 27
Data Fragmentation, Replication and Allocation

• Horizontal fragmentation
– It is a horizontal subset of a relation which contain those of
tuples which satisfy selection conditions.
– Consider the Employee relation with selection condition (DNO =
5). All tuples satisfy this condition will create a subset which will
be a horizontal fragment of Employee relation.
– A selection condition may be composed of several conditions
connected by AND or OR.
– Derived horizontal fragmentation: It is the partitioning of a
primary relation to other secondary relations which are related
with Foreign keys.

Slide 25- 28
Data Fragmentation, Replication and Allocation

• Vertical fragmentation
– It is a subset of a relation which is created by a subset of
columns. Thus a vertical fragment of a relation will contain
values of selected columns. There is no selection condition used
in vertical fragmentation.
– Consider the Employee relation. A vertical fragment of can be
created by keeping the values of Name, Bdate, Sex, and
Address.
– Because there is no condition for creating a vertical fragment,
each fragment must include the primary key attribute of the
parent relation Employee. In this way all vertical fragments of a
relation are connected.

Slide 25- 29
Horizontal Fragmentation of account Relation

account_number branch_name balance

A-305 Hillside 500


A-226 Hillside 336
A-155 Hillside 62

account1 = branch_name=“Hillside” (account )

account_number branch_name balance

A-177 Valleyview 205


A-402 Valleyview 10000
A-408 Valleyview 1123
A-639 Valleyview 750

account2 = branch_name=“Valleyview” (account )


Vertical Fragmentation of employee_info Relation

branch_name customer_name tuple_id

Hillside Lowman 1
Hillside Camp 2
Valleyview Camp 3
Valleyview Kahn 4
Hillside Kahn 5
Valleyview Kahn 6
Valleyview Green 7
deposit1 = branch_name, customer_name, tuple_id (employee_info )
account_number balance tuple_id

A-305 500 1
A-226 336 2
A-177 205 3
A-402 10000 4
A-155 62 5
A-408 1123 6
A-639 750 7
deposit2 = account_number, balance, tuple_id (employee_info )
Advantages of Fragmentation
• Horizontal:
– allows parallel processing on fragments of a relation
– allows a relation to be split so that tuples are located where they are most
frequently accessed
• Vertical:
– allows tuples to be split so that each part of the tuple is stored where it is
most frequently accessed
– tuple-id attribute allows efficient joining of vertical fragments
• Vertical and horizontal fragmentation can be mixed.
– Fragments may be successively fragmented to an arbitrary depth.
• Replication and fragmentation can be combined
– Relation is partitioned into several fragments: system maintains several
identical replicas of each such fragment.
Data Allocation
• Each fragment of database must be assigned
to a particular site in the distributed system.
This process is called data allocation. The
choice of site and degree of replication
depend on the performance and availablity
goals of the system.
Types of Distributed Database Systems

• Homogeneous
– All sites of the database
system have identical setup, Window
i.e., same database system Site 5 Unix
software. Oracle Site 1
– The underlying operating Oracle
Window
system may be different. Site 4 Communications
• For example, all sites run network
Oracle or DB2, or Sybase or
some other database Oracle
system.
Site 3 Site 2
– The underlying operating Linux Oracle Linux Oracle
systems can be a mixture of
Linux, Window, Unix, etc.

Slide 25- 34
Types of Distributed Database Systems

• Heterogeneous
– Federated: Each site may run different database system but the data
access is managed through a single conceptual schema.
• This implies that the degree of local autonomy is minimum. Each site must
adhere to a centralized access policy. There may be a global schema.
– Multidatabase: There is no one conceptual global schema. For data
access a schema is constructed dynamically as needed by the
application software.
Object Unix Relational
Oriented Site 5 Unix
Site 1
Hierarchical
Window
Site 4 Communications
network

Network
Object DBMS
Oriented Site 3 Site 2 Relational
Linux Linux Slide 25- 35
Types of Distributed Database Systems

• Federated Database Management Systems


Issues
– Differences in data models:
• Relational, Objected oriented, hierarchical, network,
etc.
– Differences in constraints:
• Each site may have their own data accessing and
processing constraints.
– Differences in query language:
• Some site may use SQL, some may use SQL-89, some
may use SQL-92, and so on.

Slide 25- 36
Query Processing in Distributed Databases
• Issues
– Cost of transferring data (files and results) over the network.
• This cost is usually high so some optimization is necessary.
• Example relations: Employee at site 1 and Department at Site 2
– Employee at site 1. 10,000 rows. Row size = 100 bytes. Table size = 10 6 bytes.

• Department at Site 2. 100 rows. Row size = 35 bytes. Table size = 3,500 bytes.
• Q: For each employee, retrieve employee name and department name Where the employee works.
• Q: Fname,Lname,Dname (Employee Dno = Dnumber Department)

Fname Minit Lname SSN Bdate Address Sex Salary Superssn Dno

Dname Dnumber Mgrssn Mgrstartdate

Slide 25- 37
Query Processing in Distributed
Databases
• Result
– The result of this query will have 10,000 tuples,
assuming that every employee is related to a
department.
– Suppose each result tuple is 40 bytes long. The
query is submitted at site 3 and the result is sent
to this site.
– Problem: Employee and Department relations are
not present at site 3.

Slide 25- 38
Query Processing in Distributed
Databases
• Strategies:
1. Transfer Employee and Department to site 3.
• Total transfer bytes = 1,000,000 + 3500 = 1,003,500 bytes.
2. Transfer Employee to site 2, execute join at site 2 and send the
result to site 3.
• Query result size = 40 * 10,000 = 400,000 bytes. Total transfer
size = 400,000 + 1,000,000 = 1,400,000 bytes.
3. Transfer Department relation to site 1, execute the join at site
1, and send the result to site 3.
• Total bytes transferred = 400,000 + 3500 = 403,500 bytes.
• Optimization criteria: minimizing data transfer.

Slide 25- 39
Query Processing in Distributed
Databases
• Strategies:
1. Transfer Employee and Department to site 3.
• Total transfer bytes = 1,000,000 + 3500 = 1,003,500 bytes.
2. Transfer Employee to site 2, execute join at site 2 and send the
result to site 3.
• Query result size = 40 * 10,000 = 400,000 bytes. Total transfer
size = 400,000 + 1,000,000 = 1,400,000 bytes.
3. Transfer Department relation to site 1, execute the join at site
1, and send the result to site 3.
• Total bytes transferred = 400,000 + 3500 = 403,500 bytes.
• Optimization criteria: minimizing data transfer.
– Preferred approach: strategy 3.

Slide 25- 40
Query Processing in Distributed
Databases
• Consider the query
– Q’: For each department, retrieve the
department name and the name of the
department manager
• Relational Algebra expression:
– Fname,Lname,Dname (Employee Mgrssn = SSN
Department)

Slide 25- 41
Query Processing in Distributed
Databases
• The result of this query will have 100 tuples, assuming that
every department has a manager, the execution strategies
are:
1. Transfer Employee and Department to the result site and
perform the join at site 3.
• Total bytes transferred = 1,000,000 + 3500 = 1,003,500 bytes.
2. Transfer Employee to site 2, execute join at site 2 and send the
result to site 3. Query result size = 40 * 100 = 4000 bytes.
• Total transfer size = 4000 + 1,000,000 = 1,004,000 bytes.
3. Transfer Department relation to site 1, execute join at site 1
and send the result to site 3.
• Total transfer size = 4000 + 3500 = 7500 bytes.

Slide 25- 42
Query Processing in Distributed
Databases
• The result of this query will have 100 tuples, assuming that
every department has a manager, the execution strategies
are:
1. Transfer Employee and Department to the result site and
perform the join at site 3.
• Total bytes transferred = 1,000,000 + 3500 = 1,003,500 bytes.
2. Transfer Employee to site 2, execute join at site 2 and send the
result to site 3. Query result size = 40 * 100 = 4000 bytes.
• Total transfer size = 4000 + 1,000,000 = 1,004,000 bytes.
3. Transfer Department relation to site 1, execute join at site 1
and send the result to site 3.
• Total transfer size = 4000 + 3500 = 7500 bytes.
• Preferred strategy: Choose strategy 3.

Slide 25- 43
Query Processing in Distributed
Databases
• Now suppose the result site is 2. Possible
strategies :
1. Transfer Employee relation to site 2, execute the
query and present the result to the user at site 2.
• Total transfer size = 1,000,000 bytes for both queries
Q and Q’.
2. Transfer Department relation to site 1, execute
join at site 1 and send the result back to site 2.
• Total transfer size for Q = 400,000 + 3500 = 403,500
bytes and for Q’ = 4000 + 3500 = 7500 bytes.

Slide 25- 44
Concurrency Control and Recovery
• Distributed Databases encounter a number of
concurrency control and recovery problems
which are not present in centralized
databases. Some of them are listed below.
– Dealing with multiple copies of data items
– Failure of individual sites
– Communication link failure
– Distributed commit
– Distributed deadlock

Slide 25- 45
Concurrency Control and Recovery
• Details
– Dealing with multiple copies of data items:
• The concurrency control must maintain global
consistency. Likewise the recovery mechanism must
recover all copies and maintain consistency after
recovery.
– Failure of individual sites:
• Database availability must not be affected due to the
failure of one or two sites and the recovery scheme
must recover them before they are available for use.

Slide 25- 46

You might also like