You are on page 1of 14

DDBMS - Lecture 3

Overview of Distributed Database Management Systems

i

Table of Contents
Evolution of DDBMS.................................................................................................................1 DDBMS Advantages...........................................................................................................1 DDBMS Disadvantages......................................................................................................1 Distributed Database vs. Distributed Processing........................................................................2 Functions of DDBMS.................................................................................................................2 Centralized Database...................................................................................................2 DDBMS Components.................................................................................................................3 Levels of Data and Process Distribution.....................................................................................3 Single-Site Processing, Single-Site Data (SPSD)...............................................................4 Multiple-Site Processing, Single-Site Data (MPSD)..........................................................4 Multiple-Site Processing, Multiple-Site Data (MPMD).....................................................4 Distributed DB Transparency Features.......................................................................................5 Distribution Transparency...........................................................................................................5 Transaction Transparency...........................................................................................................7 Management mechanisms...................................................................................................7 Distributed Concurrency Control........................................................................................7 Two-Phase Commit Protocol..............................................................................................8 Phases..........................................................................................................................8 Performance Transparency and Query Optimization..................................................................9 Data Fragmentation...................................................................................................................10 Horizontal Fragmentation.................................................................................................10 Vertical Fragmentation......................................................................................................10 Mixed Fragmentation........................................................................................................11 Data Replication........................................................................................................................11 Data Allocation.........................................................................................................................12 Client/Server vs. DDBMS.........................................................................................................12 Date’s 12 Commandments for Distributed Databases..............................................................13

DDBMS - Lecture 3

Overview of Distributed Database Management Systems

1

Evolution of DDBMS  Decentralized database management systems (DDBMS)  Interconnected computer systems  Data/processing functions reside on multiple sites  1970’s: Centralized DBMS  1980’s: Social and Technical Changes  Ad hoc capability required  Decentralized management structure common  1990’s: New forces  Internet and the World Wide Web used for data access and distribution  Data analysis through data mining and data warehousing  The decentralized database is especially desirable because centralized database management is subject to problem such as:  Performance degradation – remote locations over greater distances.  High costs – maintaining and operating large central (mainframe) database systems.  Reliability problems – by dependence on a central site. DDBMS Advantages  Data located near site with greatest demand  Faster data access  Faster data processing  Growth facilitation  Improved communications  Reduced operating costs  User-friendly interface  Less danger of single-point failure  Processor independence DDBMS Disadvantages  Complexity of management and control  Security  Lack of standards  Increased storage requirements  Greater difficulty in managing data environment  Increased training costs

DDBMS - Lecture 3

Overview of Distributed Database Management Systems

2

Distributed Database vs. Distributed Processing  Distributed processing (Figure 1) – Stores logically related database over physically independent sites  Does not require distributed database but shares the processing chores among several sites.  May be based on a single database on single computer  Copies or parts of database processing functions must be distributed to all data storage sites  Distributed database (Figure 2) – Stores logically related database over two or more physically independent sites  Requires distributed processing  A database is a composed of several parts known as database fragments.  Both  Require a network to connect components Functions of DDBMS             Application/end user interface Validation to analyze data requests Transformation to determine request components Query optimization to find the best access strategy Mapping to determine the data location I/O interface to read or write data Formatting to prepare the data for presentation Security to provide data privacy Backup and recovery DB Administration Concurrency Control Transaction Management

Centralized Database  Refer Figure 3: all activities illustrated below are transparent to the end user.  Receives an application’s (or an end user’s) request.  Validates, analyzes, and decomposes the request.  Maps the request’s logical-to-physical data components.  Decomposes the request into several disk I/O operations.  Searches for, locates, reads, and validates the data.  Ensures database consistency, security, and integrity.

DDBMS - Lecture 3

Overview of Distributed Database Management Systems

3

 Validates the data for the conditions, if any, specified by the request.  Presents the selected data in the required format.  A fully Distributed Database Management System (See Figure 4) must perform all the functions of a centralized DBMS.  It must handle all necessary functions imposed by the distribution of data and processing.  It must perform these additional functions transparently to the end users.  Users see only one logical database and do not need to know the names of the fragments; i.e., they need not even know that the database is divided into separate fragment, nor do they need to know where the fragments are located. DDBMS Components  The DDBMS must include (at least) the following components:  Computer workstations  Network hardware and software components  Communications media  Transaction processor (TP) – receives and processes the application’s data requests.  Also called application manager (AP) or transaction manager (TM)  Data processor (DP) – stores and retrieves data located at the site.  Also called data manager (DM)  May even be a centralized DBMS.  The Distributed Database Components are illustrated in Figure 5.  Note: each TP can access data on any DP, and each DP handles all requests for local data from any TP.  DDBMS Protocols  Interface with network to transport data and commands between DPs and TPs  Synchronize data received from DPs and route to appropriate TPs  Ensure common database functions  Security  Concurrency control  Backup and recovery Levels of Data and Process Distribution Database systems can be classified based on process distribution and data distribution

DDBMS - Lecture 3

Overview of Distributed Database Management Systems

4

Single-Site Processing, Single-Site Data (SPSD)  Refer Figure 6  All processing on single CPU or host computer  All data are stored on host computer disk  DBMS located on the host computer  DBMS accessed by dumb terminals  Typical of mainframe and minicomputer DBMSs  Typical of 1st generation of single-user microcomputer database Multiple-Site Processing, Single-Site Data (MPSD)  Refer Figure 7  Requires network file server  Applications accessed through LAN  Variation known as client/server architecture Multiple-Site Processing, Multiple-Site Data (MPMD)  Fully distributed DDBMS with support for multiple DPs and TPs at multiple sites  Homogeneous  Integrate one type of centralized DBMS over the network  The same DBMS will be running on different mainframes, minicomputers, and microcomputers.  Heterogeneous (Figure 8)  Integrate different types of centralized DBMSs over a network  The fully heterogeneous DDBMS will support different DBMSs that may even support different data models (relational, hierarchical, or network) running under different computer systems listed above.  Some DDBMS implementations support several platforms, operating systems, and networks, and allow remote data access to another DBMS. However, such DDBMSs still are subject to certain restrictions:  Remote access is provided on a read-only basis and does not support write privileges.  Restrictions are placed on the number of remote tables that may be accessed in a single transaction.

DDBMS - Lecture 3

Overview of Distributed Database Management Systems

5

 Restrictions are placed on the number of distinct databases that may be accessed.  Restrictions are placed on the database model that may be accessed with relational databases but not network or hierarchical databases. Distributed DB Transparency Features  Functional characteristics and properties:  Be grouped and described as transparency features.  Allows end users to feel like only database user.  Hides complexities of distributed database.  Transparency features  Distribution Transparency – allows a distributed database to be treated as a single logical database, the user doesn’t need to know  The data are partitioned.  The data can be replicated at several sites.  The data location.  Transaction Transparency – allows a transaction to update data at several network sites.  Failure Transparency – ensures that the system will continue to operate in the event of a node failure.  Performance Transparency – allows the system to perform as if it were a centralized DBMS.  Heterogeneity Transparency – allows the integration of several different local DBMSs (relational, network, and hierarchical) under a common, or global, schema. Distribution Transparency Allows management of a physically dispersed database as though it were centralized Three Levels:  Fragmentation transparency – the highest level of transparency; the end user or programmer does not need to know that a database is partitioned. Therefore, neither fragment names nor fragment locations are specified prior to data access.  Location transparency – when the end user or programmer must specify the database fragment names but does not need to specify where these fragments are located.  Local mapping transparency – when the end user or programmer must specify both the fragment names and their locations.

DDBMS - Lecture 3

Overview of Distributed Database Management Systems

6

 The use of various transparency levels – refers Figure 9, in which, assume 1. The EMPLOYEE table is fragmented and each fragment is unique. 2. The end user wants to list all the employees with a date of birth prior to January 1, 1940. 3. No partition of the database is replicated at any other site on the network.  Case 1: The Database Supports Fragmentation Transparency – It does not specify fragment names or locations. SELECT * FROM EMPLOYEE WHERE EMP_DOB < ’01-JAN-1940’;  Case 2: The Database Supports Location Transparency – Fragment names must be specified in the query, but fragment location is not specified. SELECT FROM E1 WHERE EMP_DOB < ’01-JAN-1940’; UNION SELECT * FROM E2 WHERE EMP_DOB < ’01-JAN-1940’;  Case 3: The Database Supports Local Mapping Transparency – Both the fragment name and location must be specified in the query. SELECT * FROM E1 NODE NY WHERE EMP_DOB < ’01-JAN-1940’; UNION SELECT * FROM E2 NODE ATL WHERE EMP_DOB < ’01-JAN-1940’; UNION SELECT * FROM E3 NODE MIA WHERE EMP_DOB < ’01-JAN-1940’;  Distribution transparency is supported by a distributed data dictionary (DDD), or a distributed data catalog (DDC).  DDC contains the description of the entire database as seen by the database administrator.  The database description, known as the distributed global schema, is the common

DDBMS - Lecture 3

Overview of Distributed Database Management Systems

7

database schema used by local TPs to translate user requests into subqueries (remote requests) that will be processed by different DPs.  Some of the current DDBMS implementations impose limitations on the level of transparency support; e.g., you might be able to distribute a database, but not a table, across multiple sites. Such a condition indicates that the DDBMS supports location transparency but not fragmentation transparency. Transaction Transparency Ensures transactions maintain integrity and consistency Completed only if all involved database sites complete their part of the transaction

Management mechanisms  Remote request (refer Figure 10) – access data to be processed by a single remote database processor; i.e., the SQL statement (or request) can reference data at only one remote site.  The transaction updates the CUSTOMER and INVOICE tables.  Both tables are located at Site B.  The transaction can reference only one remote DP.  Each SQL statement (or request) can reference only one (the same) remote DP at a time, and the entire transaction can reference and is executed at only one remote DP.  Remote transaction (refer Figure 11) – composed of several request, may access data at only a single site.  Distributed transaction (refer Figure 12) – allows a transaction to reference several different (local or remote) DP sites.  The transaction references two remote sites (B and C).  The first request (SELECT statement) is processed by the DP at the remote Site B, and the next requests (UPDATE and INSERT) are processed by the DP at the remote Site C.  Each request can access only one remote site at a tome.  Distributed request (refer Figure 13 & 14) – reference data from several remote DP sites.  Partition a database table into several fragments.  Reference one or more of those fragments with only one request; i.e., fragmentation transparency is performed. Distributed Concurrency Control  Multi-site, multiple-process operations more likely to create data inconsistencies and deadlocked transactions

DDBMS - Lecture 3

Overview of Distributed Database Management Systems

8

 Problems (Figure 15)  Transaction committed by local DP  One DP could not commit transaction’s result  Yields inconsistent database  Solution: two-phase commit protocol! Two-Phase Commit Protocol  If a portion of a transaction operation cannot be committed, all changes made at the other sites participating in the transaction will be undone to maintain a consistent database state.  The protocol requires that each individual DP’s transaction log entry, which is maintained by unique DP, be written before the dataset fragment is actually updated.  DO-UNDO-REDO protocol and write-ahead protocol are necessary.  DO-UNDO-REDO protocol – rolls back and/or rolls forward transactions with the help of the system’s transaction log entries. Three types below:  DO – performs the operation and records the “before” and “after” values in the transaction log.  UNDO – reverses an operation, using the log entries written by the DO portion of the sequence.  REDO – redoes an operation, using the log entries written by the DO portion of the sequence.  Write-ahead protocol – forces the log entry to be written to permanent storage before he actual operation takes place.  The protocol defines the operations between two kinds of nodes  Coordinator  Subordinates (or cohorts) Phases  Phase 1: Preparation  Coordinator sends message to all subordinates  Subordinates receive the message, write the transaction log using the write-ahead protocol, and send an acknowledgement (YES/PREPARED TO COMMIT or NO/NOT PREPARED) message to the coordinator.  Confirms all are ready to commit or abort  Phase 2: Final Commit – Ensures all subordinates have committed or aborted  The coordinator broadcasts a COMMIT message to all subordinates and waits for the replies.  Each subordinate receives the COMMIT message, then updates the database using the DO protocol.

DDBMS - Lecture 3

Overview of Distributed Database Management Systems

9

 The subordinates reply with a COMMITTED or NOT COMMITTED message to the coordinator. Performance Transparency and Query Optimization  Objective: Minimize total cost associated with execution since the data may be replicated in several different sites and a database may be partitioned into several fragments.  Main costs  Access time – involved in accessing the physical data stored on disk.  Communication – associated with the transmission of data among nodes in DDBS.  CPU time – associated with the processing overhead of managing distributed transactions.  Basis for query optimization algorithms  Classified by communication or processing costs.  To evaluate query optimization, the TP must receive data from the DP, synchronize it, assemble the answer, and present it to the end user or an application.  Replica Transparency – as important as distribution transparency, it hides the existence of multiple copies of data from the user.  The selection of the optimum execution order  The selection of sites accessed to minimize communication costs  Evaluated on the basis of operation mode of classification  Automatic query optimization – the DDBMS finds the most cost-effective access path without user intervention. (More desirable from the end user’s view, but high cost!)  Manual query optimization – the optimization be selected and scheduled by the end user or programmer.  Timing of classification according to when the optimization is down.  Static query optimization – takes place at compilation time; i.e., the best optimization strategy is selected when the query is compiled by DBMS. (e.g., SQL statements are embedded in procedural programming language, such as COBOL or Pascal.)  Dynamic query optimization – takes place at execution time; i.e., database access strategy is determined by the DBMS at the run time, using the most up-to-date information about the database. (Efficient but expensive!)  Classified according to the type of information that is used to optimize the query:  Statistically based optimization algorithms – provide information about such database characteristics as size, number of records, average access time, number of request serviced, number of users with access rights, and so on.  Dynamic statistical generation mode – the DBMS automatically evaluates and

DDBMS - Lecture 3

Overview of Distributed Database Management Systems

10

updates the statistics after each access.  Manual statistical generation mode – the statistics must be updated periodically through a user-selected utility, such as IBM’s RUNSTAT command used by OS/2 Database Manager  Rule-based query optimization algorithms – determine the best query strategy based on a set of user-defined rules. The rules are entered by the end user or database administrator, and they typically are very general in nature. Data Fragmentation Partition database into fragments, the data fragmentation allows us to break a single object into two or more segments or fragments. Data fragmentation information is stored in the distributed data catalog (DDC), from which it is accessed by the TP to process user requests. Three types of data fragmentation strategies: horizontal, vertical and mixed fragmentation. Figure 16 is selected as a sample CUSTOMER table.

Horizontal Fragmentation  Refers to the division of a relation into subsets (fragments) of tuples (rows).  Each fragment is stored at a different node, and each fragment has unique rows while all unique rows the same attributes (columns).  It is equivalent to SELECT statement with WHERE clause on a single attribute.  Refer the sample table below and Figure 17.

Vertical Fragmentation  Refers to the division of a relation into attribute (column) subsets.  Each subset (fragment) is stored at a different node, and each fragment has unique columns – with the exception of the key column, which is common to all fragments.  It is equivalent to PROJECT statement.  Refer the sample table below and Figure 18.

DDBMS - Lecture 3

Overview of Distributed Database Management Systems

11

Mixed Fragmentation  Refers to a combination of horizontal and vertical strategies.  A table may be divided into several horizontal subsets (rows), each one having a subset of the attributes (columns).  Refer the sample table below and Figure 19.

Data Replication  Data replication refers to the storage of data copies at multiple sites served by a computer network where the fragment copies can be stored at several sites to serve specific information requirements.  Mutual consistency rule – all copies of data fragments are identical; DDBMS must ensure that a database update is performed at all sites where replicas exist.  If the database is fragmented, the DBMS must decompose a query into subqueries to access the appropriate fragments.  If the database is replicated, the DDBMS must decide which copy to access.  A READ operation selects the nearest copy.  A WRITE operation requires all copies be selected and updated.  The TP sends a data request to each selected DP for execution.  The DP receives and executes each request and sends the data back to the TP.  The TP assembles the DP responses.

DDBMS - Lecture 3

Overview of Distributed Database Management Systems

12

 Replication scenarios:  Fully replicated databases – store multiple copies of each database fragment at multiple sites; i.e., all database fragments are replicated.  Partially replicated databases – store multiple copies of some database fragments at multiple sites. (available for most DDBMS)  Unreplicated replicated databases – store each database fragment at a single site; i.e., no duplicate database fragment.  Factors influence the decision to use data replication:  Database size  Usage frequency  Costs – performance, software overhead, and management – associated with synchronizing transactions and their components vs. fault-tolerance benefits that are associated with replicated data. Data Allocation  Data allocation describes the process of deciding where to locate data.  The strategies of data allocation follows:  Centralized data allocation – the entire database is stored at one site.  Partitioned data allocation – the database is divided into several disjoint parts (fragments) and stored at several sites.  Replicated data allocation – copied of one or more database fragments are stored at several sites.  Study issue: which data to locate where. The algorithm considers:  Performance and data availability goals.  Size, number of rows, and the number of relations that an entity maintains with other entities.  Types of transactions to be applied to the database, the attributes accessed by each of those transactions, and so on. Client/Server vs. DDBMS  Client/Server Advantages Over DDBMS  Client/server less expensive  Client/server solutions allow use of microcomputer’s GUI  More people with PC skills than mainframe skills  PC is well established in workplace  Numerous data analysis and query tools exist  Considerable cost advantages to off-loading application development

DDBMS - Lecture 3

Overview of Distributed Database Management Systems

13

 Client/Server Disadvantages  Creates more complex environment with different platforms  Increased number of users and sites creates security problems  Training issues become more complex and expensive Date’s 12 Commandments for Distributed Databases  No discussion of distributed databases is complete unless it includes C.J. Date’s distributed database commandments with twelve rules below: 1. Local Site Independence – Each local site can act as an independent, autonomous, centralized DBMS with responsibility on security, concurrency control, backup, and recovery. 2. Central Site Independence – No site in the network relies on a central site or any other site, but all have the same capability. 3. Failure Independence – The system is not affected by node failures. 4. Location Transparency – The user does not need to know the location of the data in order to retrieve those data. 5. Fragmentation Transparency – The user sees only one single logical database because data fragmentation is transparent to the user. 6. Replication Transparency – The user sees only one single logical database because the DDBMS transparently selects the database fragment to access. 7. Distributed Query Processing – A distributed query may be executed at several different DP sites with transparent performance of query optimization. 8. Distributed Transaction Processing – A transaction may update data at several different sites. 9. Hardware Independence – The system must run on any hardware platform. 10. Operating System Independence – The system must run on any operating system software platform. 11. Network Independence – The system must run on any network platform. 12. Database Independence – The system must support any vendor’s database product.