A distributed database management system ('DDBMS') is a software system that permits the management of a distributed database and makes

the distribution transparent to the users. A distributed database is a collection of multiple, logically interrelated databases distributed over a computer network. Sometimes "distributed database system" is used to refer jointly to the distributed database and the distributed DBMS. Distributed database management systems (DDBMS) are amongst the most important and successful software developments in this decade. They are enabling he computing power and data to be placed within the user environment close to the point of user activities. The performance efficiency of DDBMS is deeply related to the query processing strategies involving data transmission over different nodes through the network. This thesis is to study the optimization of query processing strategies in a distributed databases environment. With the objective of minimum communication cost, we have developed a mathematical model to find a join-semijoin program for processing a given equi-join query in distributed homogeneous relational databases. Rules for estimating the size of the derived relation is proposed. The distributed query processing problem is formulated as dynamic network problem. We also extend this model to consider both communication cost and local processing cost. We extend this model to query processing in a distributed heterogeneous databases environment. A heterogeneous database communication system is proposed to integrate heterogeneous database management systems to combine and share information. The use of a database communication system for heterogeneous DBMSs makes the overall system transparent to users from an operational point of view. Problems of schema translation and query translation of the query processing in this environment are studied. Distributed Databases
What is a DDBMS?

• • • • • • • • •

A Distributed DB is a related collection of data just like a normal DB, but physically distributed over a network. The data is split into “fragments” on separate machines, each running a local DBMS. A Distributed DBMS is the software that manages distribution of data and processing in a fashion that is invisible to users. Local DBMSs can access local data autonomously, or access remote data through the DDBMS and other local DBMSs. Apps that only use local data are called “Local Applications” Apps that access remote data are called “Global Applications” To be a DDBMS, every local DBMS must participate in at least one Global App. Homogeneous DDBMSs use the same DBMS on the same platform on each site Heterogeneous DDBMSs use different DBMSs/Platforms and require gateways or other middle-ware to

convert queries/data models between sites Advantages of a DDBMS:

• • • • •

Realistically reflects decentralised organisational structures If well designed can strike an optimal balance between local speed and global access More failure-proof than a single DBMS Shared processing and I/O leads to better performance Scalability

Disadvantages of a DDBMS:

• • • •
Note:

Increased complexity and higher cost Networks compromises sceurity Validity, consistency and integrity control becomes complicated It’s a new technology with few standards/best practices

• •

A DDBMS is not the same as “distributed processing” which is centralised DB accessible through a network (rather than the data itself being distributed) A DDBMS is not always the same as a “parallel DBMS” which is a single DBMS using multiple processors/multiple disks

DDBMS Components:

• •

Global External Schema User Views. Provides logical data independence Global Conceptual Schema Logical description of entire DB, including entities and relationships, constraints, domains, security etc. Provides physical data independence Fragmentation and Allocation Schema Describes how the data is partitioned and where it is stored 3 Tier Schemas for each Local DBMS As normal, but instead of external schema, the top level is a mapping schema designed to be used to communicate with the frag/alloc schema above. Local DBMS (LDBMS) – normal DBMS controlling local data Data Communications (DC) – network software Global System Catalogue – same as a normal systems catalogue, plus frag/alloc information

• •

• • •

Distributed DBMS (DDBMS) – main functional unit – transaction management, backup/recovery etc.

When designing a DDBMS we seek to maximise:

• • •

Locality of reference (letting local apps hitting local data) Reliability and Accessibility (by strategically replicating the data) System Performance (by avoiding over/under utilisation of resources)

When designing a DDBMS we seek to minimise:

• •

Cost vs. Storage (affects replication strategy) Communication Costs (taking into account consistency maintenance)

Design Strategies with which to approach the problem:

• •

Centralised Data Storage – this is not a DDB at all. No replication so no additional storage costs. Fragmented Data Storage – no replication but data is split up and distributed. If done correctly, locality of reference will be very high as is performance and storage and communications costs are low. Reliability and accessibility are only ok.

Design Strategies with which to approach the problem:

• •

Fully Replicated Storage – every site hosts a full copy of the DB. Very high storage costs, improved performance for read trans/low comm costs. Write trans low performance/high comms cost. Partially Replicated Storage – combination of above methods. It makes the most sense – if done right. You have high locality, high reliability/access, good all-round performance, ok storage costs and low comm. costs.

Fragmentation:

There are a few reasons why it makes sense to fragment data:

• • • • • •

By keeping data not required by local apps separate, we improve security By maximising locality of reference, we improve efficiency Since most apps only use subsets of relations, it makes sense to break the relations into subsets for storage across the network. Done right, we open the door for parallel processing

There are drawbacks, like increased integrity administration and performance hits for poorly fragmented data sets For a fragmentation effort to be viable: it must be complete (all items in a relations must appear in at least one fragment of the relation), functional dependencies must be preserved, and (other than primary keys) fragments should be disjoint.

Types of Fragmentation:

Horizontal – break up relation into subsets of the tuples in that relation, based on a restriction on one or more attributes. E.g. – we could break up a table with student info into one subset for undergrads and one subset for postgrads. Vertical – breaking up a relation into subsets of attributes. E.g. – breaking up a hypothetical student table into grade/course related columns and contact/personal related columns. Mixed – fragments the data multiple times, in different ways. We could do our postgrad/undergrad split and then our grades/course split to each of the fragments Derived – fragmenting a relation to correspond with the fragmentation of another relation upon which the 1st relation is dependent in some way.

• • •
Transparency:

Distribution Transparency allows users to ignore the physical fragmentation of data, to varying degrees:

Frag. Transparency is high level transparency where a user could write “SELECT * FROM Student WHERE year = 2” without needing to specify what fragment of the Student relation contains the data, nor where that fragment is stored. Location Transparency – mid level transparency where a user would need to write “SELECT * FROM S14 WHERE year = 2” where S14 is the relevant fragment of the Student relation, but still wouldn’t need to say where the fragment is stored. Local Mapping Transparency – low level transparency where a user would need to write “SELECT * FROM S14 AT SITE 7 WHERE year = 2” where S14 is the relevant fragment of the Student relation and SITE 7 is where the fragment is physically located. Distribution transparency is supported by a database name server which aliases unique database object identifiers with user friendly names. Transaction transparency ensures integrity and consistency vis-à-vis multi-site transactions, concurrent users and DB failure. Local transactions and remote single site transactions are handled without additional difficulty, but multi-site transactions must be broken into subtransactions (for each site) and the independence, atomicity and durability of a centralised DBMS. Performance Transparency simply means DDBMS perform at the same level as a normal DBMS. This puts a lot of burden on the distributed query processor, which decides what fragment to hit, which copy (if replicated), and which location to use, as well as calculating I/O time, CPU time and communication costs. DBMS Transparancy means that a heterogeneous DDBMS will behave like a homogenous DDBMS.

• • •

Replication:

• •

Advantages – better access and reliability, improved performance Disadvantages – storage and consistency

Replication Options:

• •

Synchronous Updates – all copy updates are part of one transactions commit phase - a lot of admin overhad, comm. Costs and opportunity for failure but often necessary Asynchronous Updates – periodic updates of all copies based on one mater copy – violates the idea of data independence but can be useful in situations where the cost of synch updates are unwarranted.

What we expect from replication management:

• • • • •

Copy data, either synch or asynch from one db to another Scalability and mappability (in heterogeneous environments) Replication of procedures, indexes, schema etc. tools for DBAs to manage replication

Replicated data ownership models:

• •

Master/slave: publish and subscribe model – authoritive changes are made only to the master site and published to the slaves asnchronously Workflow: like M/S, but Master status moves from site to site depending on the task at hand Update-Anywhere: Symmetric, shared write authority for all replicas. Synchronous: We can synch up our replicas using regularly scheduled “snapshots” of the master data, or database triggers (when X happens, do Y)

Back to Lecture Guide

Sign up to vote on this title
UsefulNot useful