Section 2: Distributed Databases

Section 2
Distributed Databases
Section Content
• 2.1 Concepts
• 2.2 Advantages
• 2.3 Classification of Distributed Systems
• 2.4 Database Design
• 2.5 Distributed Query Processing
CA306 Introduction 2-2

2.1 Concepts
• A Distributed Database (DDB) is a collection of nodes, connected

via a communication network.
• Each site is autonomous, but a partnership exists among a set of

independent but co-operating centralised systems.
• A Distributed Database Management System (DDBMS) is the

software that permits the management of the DDBs and makes
distribution transparent to users.
• There are three basic architectures: networked with a single

centralised database; shared memory, and shared nothing.

Centralised in a Networked Architecture
Client Interface
DBMS
Interface
Network
Client Interface
Client Interface
Client Interface

Centralised in a Networked Architecture
• Storage exists at a single site (with a shared disk architecture).
• Architecture resembles a typical client server architecture although

DDB transparencies exist.
• This architecture is suited to a (conceptually) fully replicated

environment. Each client site sees the same data as all other sites.
• This architecture also suits a (conceptually) fully fragmented site

where each client sees a different view of the overall schema.

Shared Memory Architecture
NT2000 O/S Workstation
SQL Server
NT2000 O/S Workstation

DBMS
Interface
Network
SQL Server
NT2000 O/S Workstation DBMS Interface
SQL Server
DBMS
Interface

Shared Memory Architecture
• Each node on the network operates in an autonomous fashion, with

selected hardware and operating system setup.
• However, each system runs (for example) distributed Oracle where

each system shares a common memory space in which transactions
are processed.
• Each site may have copies of data which ‘belong’ to other sites: will
require synchronisation of updates.

Shared Nothing
UNIX cluster
Oracle
NT2000 O/S Node

DBMS
Interface
Oracle
Network
VMS Mainframe DBMS Interface
Oracle
DBMS
Interface

Shared Nothing Architecture
• Each processor has its own autonomous processing and storage

capabilities.
• Each node is homogenous with respect to operating system,

database management system protocols and storage.
• Communication is (typically) through a high-speed interconnection

network.

Sections Covered
 2.1 Concepts
• 2.2 Advantages

2.2 Advantages
• Management of distributed data with different levels of

transparency. Transparencies:
+ Distribution: location transparency ensures that the user need not worry
about the location or local name of data objects.
+ Replication: The user is unaware of data copies. These copies provide better
availability, performance and reliability.
+ Fragmentation: horizontal and vertical fragmentation details are hidden from
the user.
• Increased reliability and availability.

+ Reliability is improved with a decrease in downtime. This is due to replication.
+ Availability is the probability that the DDB runs for a predetermined time
interval.

Advantages (ii)
• Improved Performance
+ A distributed DBMS fragments so that data is stored at the site where it is
needed most.
+ Fragmentation also implies that the database is smaller: instead of a single
CPU processing one large database, multiple CPUs process many smaller
databases.
+ Inter-query and intra-query parallelism can be achieved as multiple queries can
be run in parallel at separate sites.
• Easier Expansion
+ Expansion is easier as it may involve adding a new site.
+ Expansion can be planned to suit the current distribution scheme.

System Overheads (i)
• Controlling Data. It is necessary to monitor data distribution,

fragmentation and replication by expanding the system catalog.
• Distributed Query Processing. It is necessary to access multiple

sites during the execution of global queries.
• Optimisation. It is necessary to devise execution strategies based

on factors such as the movement of data between sites and the
speed of network connections between those sites.
• Replicated Data Management. It is necessary to propagate changes

form one site to all copies. This requires an ability to decide which
copy is master, and to maintain consistency among replicated sites.

System Overheads (ii)
• Distributed Database Recovery. A requirement to handle new types

of failure (based on communication), and to recover from individual
site crashes.
• Security. Global transactions require the negotiation of different

security systems. Authorisation and access privileges must be
maintained.
• Distributed Catalog Management. The hold holds metadata for the

entire DDBMS. A decision must be made at design time as to the
fragmentation or replication (or both) of the system catalog.

Sections Covered
 2.1 Concepts
 2.2 Advantages

2.3 Classification of Distributed Systems
• Distributed databases have design alternatives along three

dimensions:
+ Autonomy,
+ Distribution,
+ Heterogeneity.
• Autonomy refers to the distribution of control, and indicates the

degree to which individual DBMSs can operate independently.
• The distribution dimension deals with data. There are only two
possibilities: data is distributed across multiple sites, or is stored at a
single site.
• Heterogeneity can occur in various forms: hardware, networking
protocols, variations in database managers. The important ones
relate to data models, query languages, and transaction
management protocols.

Distribution
Distributed
Homogeneous
Distributed
“federated” DBMSs
Homogeneous DBMSs
Distributed
Heterogeneous
DBMSs
Logically integrated
Single site homogeneous
Homogeneous
Federated DBMSs
Multiple DBMSs
Autonomy
Heterogeneous
integrated DBMSs
Heterogeneity

Sections Covered
 2.1 Concepts
 2.2 Advantages
 2.3 Classification of Distributed Systems

2.4 Database Design
• Early research into DDBSs suggests the organisation of distributed

systems along three orthogonal dimensions:
+ level of sharing;
+ behaviour of access patterns;
+ level of knowledge on access pattern behaviour.
• The first property looks at how data is shared between users; the
second looks at issues such as static and dynamic access patterns;
and the third looks at how much information is available regarding
access patterns.

Top-down design
• Top-down design is suited to a “green-field” type of application,

whereas bottom-up design is generally employed where systems
already exist.
• Requirements Analysis  Objectives

• Conceptual Design  the Global Conceptual Schema
• View Design  Access Information and External Schema Definitions
• Distributed Design  Local Conceptual Schemas
• Physical Design  Physical Schema
• Observation & Monitoring  Feedback

Issues
• Why fragment ?
• How should fragmentation be performed ? (horizontally v vertical)
• How much should be fragmented? An important issue as it effects
the performance of query execution; aim to find a nice balance
between large and small units.
• Can we test the correctness of decomposition ? (Observe rules)

• How is allocation performed ? (choose sites, replication required ?)
• What is the necessary information for fragmentation and allocation?
(database information, application information, communication
network information and computer system information).

Correctness Rules of Fragmentation
• The following three rules should be enforced during fragmentation, which,

together ensure that the database does not undergo semantic change during
fragmentation.
• Completeness. If a relation instance R is decomposed into fragments R1,R2,

…,Rn, each data item that can be found in R can also be found in one or
more of each Ri. This property is identical to the lossless decomposition
property of normalisation.
• Reconstruction. If a relation R is decomposed into fragments R1,R2,…,Rn, it

should be possible to define a relational operator  such that
R = Ri  Ri  FR
The operator  will be different for different fragmentations, but the
operation must be identified.

Rules
• Disjointness. If a relation instance R is decomposed into

fragments R1,R2,…,Rn, and data item di resides in Rj, it cannot reside
in any other fragment Rk (kj). This criterion ensures that the
horizontal fragments are disjoint. Note that the primary key is often
repeated in all fragments for vertical partitioning, thus, disjointness
is defined only on the non-primary key attributes of a relation.

Sections Covered
 2.1 Concepts
 2.2 Advantages
 2.3 Classification of Distributed Systems
 2.4 Database Design

2.5 Query Processing
• The main function of a relational query processor is to transform a

high-level query into an equivalent lower-level query.
• The low-level query (contains the information required to)
implements the execution strategy for the query.
• The transformation must achieve correctness and efficiency. The
well-defined mapping between relational calculus and algebra
makes the correctness issue easy.
• However, producing an execution strategy that is efficient is more
complex. A relational calculus query may have many equivalent
transformations in relational algebra. The issue is to select that
execution strategy that minimises resource consumption.
• In a distributed system, relational algebra is not enough to express
execution strategies. It must be supplemented with operations for
exchanging data between sites. For example, the distributed query
processor must select the best sites to process data.

Sample DB
• Site 1 (containing a table called EMPLOYEE)

{Fname, Lname, RSI, DOB, Address, Sex, Salary, DeptNo}
10,000 tuples (each 100 bytes in length)
RSI is 9 bytes; DeptNo is 4 bytes; Fname is 15 bytes; Lname is 15 bytes
• Site 2 (containing a table called DEPARTMENT)

{Dname, Dnumber, MgrRSI, MGRStartdate}
100 tuples (35 bytes in length)
Dnumber is 4 bytes; Dname is 10 bytes; MgrRSI is 9 bytes
• Properties
Size of EMPLOYEE is 10,000 * 100 = 1,000,000 bytes
Size of DEPARTMENT is 100 * 35 = 3,500 bytes
EMPLOYEE.DeptNo = DEPARTMENT.Dnumber

Sample Query 1
For each employee, retrieve the employee name and the department
in which that employee works.
• The result of the query will include 10,000 tuples (assuming that
every employee has a valid department). We know that 40 bytes
are required for each tuple in the result.
• The query is executed at Site 3 (result site). Three strategies exist

for execution of the distributed query.
• If minimising the amount of data transfer is the optimisation

criterion, which strategy is selected?

Strategy 1
• Transfer both the EMPLOYEE and DEPARTMENT relations to the

result site, and perform the join there.
Site 1 Site 2
Employee Dept
E = {Fname, Lname, RSI, D = {Dname, Dnumber,

DOB, Address, Sex, Salary, MgrRSI, MGRStartdate}
Site 3
DeptNo}
Transfer amount = 1,000,000 + 3,500 = 1,003,500 bytes

Strategy 2
• Transfer the EMPLOYEE relation to site 2, execute the join at site 2,

and send the result to site 3.
Site 1 Site 2
Employee Dept
E = {Fname, Lname, RSI,

DOB, Address, Sex, Salary,
DeptNo}
R = {Fname, Lname, Dname}

Site 3
Transfer 1,000,000 bytes to Site 2;

Query result size = 40 * 10,000 = 400,000 bytes;
Transfer amount = 1,000,000 + 400,000 = 1,400,000 bytes.
Strategy 3
• Transfer the DEPARTMENT relation to site 1, execute the join at site

2, and transfer the result to site 3.
Site 1 Site 2
Employee Dept
D = {Dname, Dnumber,
MgrRSI, MRGStartdate}
Site 3
Transfer 3,500 bytes to Site 1;

Query result size = 40 * 10,000 = 400,000 bytes;
Transfer amount = 3,500 + 400,000 = 403,500 bytes.
Sample Query 2
• For each department, retrieve the department name, and the name
of the department manager.
• Assume the query is again submitted at site 3, and that the result
contains 100 tuples (of 40 bytes).

Strategy 1
• Transfer both EMPLOYEE and DEPARTMENT to site 3, and perform

the join there.
Site 1 Site 2
Employee Dept
E = {Fname, Lname, RSI, D = {Dname, Dnumber,

DOB, Address, Sex, Salary, MgrRSI, MRGStartdate}
Site 3
DeptNo}
Transfer amount = 1,000,000 + 3,500 = 1,003,500 bytes

Strategy 2
• Transfer the EMPLOYEE relation to site 2, execute the join at site 2,

and send the result to site 3.
Site 1 Site 2
Employee Dept
E = {Fname, Lname, RSI,

DOB, Address, Sex, Salary,
DeptNo}

Site 3
Transfer 1,000,000 bytes to Site 2;

Query result size = 40 * 100 = 4,000 bytes;
Transfer amount = 1,000,000 + 4,000 = 1,004,000 bytes.
Strategy 3
• Transfer the DEPARTMENT relation to site 1, execute the join at site

2, and transfer the result to site 3.
Site 1 Site 2
Employee Dept
D = {Dname, Dnumber,
MgrRSI, MRGStartdate}
Site 3
Transfer 3,500 bytes to Site 1;

Query result size = 40 * 100 = 4,000 bytes;
Transfer amount = 3,500 + 4,000 = 7,500 bytes.
Exercises
• Determine what the result would be if the projection of each table

was executed before they left the site (eg.  Dnumber(Department)
and  <DeptNo, Fname, Lname>(Employee) for query 1).
• Determine the best strategy if the query is executed at site 2.

Processing Layers
Used for
processing
Output from
Step 1
Output from
Step 2
Output from
Step 3

Query Decomposition
• The first layer decomposes the distributed calculus query into an

algebraic query.
• Query decomposition can be viewed as four successive steps:
+ rewrite the calculus query in a normalised form (suitable for subsequent

manipulations);
+ analyse the normalised query to detect incorrect queries (reject them early);
+ simplify the correct query (eg. eliminate redundant predicates);
+ transform the calculus query into an algebraic query.

Data Localisation
• The input into this layer is the algebraic transformation of the

query.
• The main role of this layer is to localise the query’s data using data
distribution information: determine which fragments are involved in
the query and transform the distributed query into fragment
queries.
• There are two steps:
+ The distributed query is mapped into a fragment query by substituting each

distributed relation by its materialisation program.
+ The fragment query is simplified and restructured to another correct query.

Global Query Optimisation
• The input to this layer is a (algebraic) query fragment .
• The goal of the query optimiser is to locate an execution strategy

for the query that is close to optimal.
• This consists of finding the best ordering of operations in the

fragment query.
• An important aspect of query optimisation is join ordering, since

permutations of joins within the query may lead to improvements of
orders of magnitude.

Local Query Optimisation
• The final layer is performed by all sites having fragments involved

in the query.
• Each sub-query executing at local sites is optimised using the local

schema of the site.

Section 2: Distributed Databases

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Section 2: Distributed Databases

Uploaded by

Copyright:

Available Formats

Section 2

CA306 Introduction 2-2

• A Distributed Database (DDB) is a collection of nodes, connected

• Each site is autonomous, but a partnership exists among a set of

• A Distributed Database Management System (DDBMS) is the

• There are three basic architectures: networked with a single

CA306 Introduction 2-3

CA306 Introduction 2-4

• Storage exists at a single site (with a shared disk architecture).

• Architecture resembles a typical client server architecture although

• This architecture is suited to a (conceptually) fully replicated

• This architecture also suits a (conceptually) fully fragmented site

CA306 Introduction 2-5

NT2000 O/S Workstation

NT2000 O/S Workstation DBMS Interface

CA306 Introduction 2-6

• Each node on the network operates in an autonomous fashion, with

• However, each system runs (for example) distributed Oracle where

CA306 Introduction 2-7

NT2000 O/S Node

CA306 Introduction 2-8

• Each processor has its own autonomous processing and storage

• Each node is homogenous with respect to operating system,

• Communication is (typically) through a high-speed interconnection

CA306 Introduction 2-9

CA306 Introduction 2-10

• Management of distributed data with different levels of

• Increased reliability and availability.

CA306 Introduction 2-11

CA306 Introduction 2-12

• Controlling Data. It is necessary to monitor data distribution,

• Distributed Query Processing. It is necessary to access multiple

• Optimisation. It is necessary to devise execution strategies based

• Replicated Data Management. It is necessary to propagate changes

CA306 Introduction 2-13

• Distributed Database Recovery. A requirement to handle new types

• Security. Global transactions require the negotiation of different

• Distributed Catalog Management. The hold holds metadata for the

CA306 Introduction 2-14

CA306 Introduction 2-15

• Distributed databases have design alternatives along three

• Autonomy refers to the distribution of control, and indicates the

CA306 Introduction 2-16

CA306 Introduction 2-17

CA306 Introduction 2-18

• Early research into DDBSs suggests the organisation of distributed

CA306 Introduction 2-19

• Top-down design is suited to a “green-field” type of application,

• Requirements Analysis  Objectives

CA306 Introduction 2-20

• Can we test the correctness of decomposition ? (Observe rules)

CA306 Introduction 2-22

• The following three rules should be enforced during fragmentation, which,

• Completeness. If a relation instance R is decomposed into fragments R1,R2,

• Reconstruction. If a relation R is decomposed into fragments R1,R2,…,Rn, it

CA306 Introduction 2-23

• Disjointness. If a relation instance R is decomposed into

CA306 Introduction 2-24

CA306 Introduction 2-25

• The main function of a relational query processor is to transform a

CA306 Introduction 2-26

• Site 1 (containing a table called EMPLOYEE)

• Site 2 (containing a table called DEPARTMENT)