CPU CPU CPU CPU CPU CPU Shared Memory Parallel Database Architecture CPU CPU CPU CPU CPU CPU Shared Disk Parallel Database Architecture M M M M M M Shared Nothing Parallel Database Architecture CPU M CPU M CPU M CPU M CPU M
MAINFRAME DATABASE SYSTEM
DUMB DUMB DUMB S P E C I A L I S E D
N E T W O R K
C O N N E C T I O N
TERMINALS MAINFRAME COMPUTER PRESENTATION LOGIC BUSINESS LOGIC DATA LOGIC
DISTRIBUTED DATABASE SYSTEM
A distributed database system is a collection of logically related databases that co-operate in a transparent manner.
Transparent implies that each user within the system may access all of the data within all of the databases as if they were a single database
There should be location independence i.e.- as the user is unaware of where the data is located it is possible to move the data from one physical location to another without affecting the user. DISTRIBUTED DATABASES WHAT IS A DISTRIBUTED DATABASE? LAN CLIENT CLIENT CLIENT CLIENT D B M S
DISTRIBUTED DATABASE ARCHITECTURE LAN CLIENT CLIENT CLIENT CLIENT D B M S
Leytonstone CLIENT CLIENT CLIENT D B M S
Stratford CLIENT CLIENT CLIENT CLIENT D B M S
Barking CLIENT CLIENT CLIENT Leyton D/BASE SERVER #1 CLIENT#1 D/BASE SERVER #2 CLIENT#2 CLIENT#3 M:N CLIENT/SERVER DBMS ARCHITECTURE NOT TRANSPARENT!
DB Computer Network Site 2 Site 1 GSC DDBMS DC LDBMS GSC DDBMS DC
LDBMS = Local DBMS DC = Data Communications GSC = Global Systems Catalog DDBMS = Distributed DBMS COMPONENTS OF A DDBMS Reduced Communication Overhead
Most data access is local, less expensive and performs better.
Improved Processing Power
Instead of one server handling the full database, we now have a collection of machines handling the same database.
Removal of Reliance on a Central Site
If a server fails, then the only part of the system that is affected is the relevant local site. The rest of the system remains functional and available.
DISTRIBUTED DATABASES ADVANTAGES Expandability
It is easier to accommodate increasing the size of the global (logical) database.
Local autonomy
The database is brought nearer to its users. This can effect a cultural change as it allows potentially greater control over local data .
DISTRIBUTED DATABASES ADVANTAGES
A distributed system looks exactly like a non-distributed system to the user!
1. Local autonomy 2. No reliance on a central site 3. Continuous operation 4. Location independence 5. Fragmentation independence 6. Replication independence 7. Distributed query independence 8. Distributed transaction processing 9. Hardware independence 10. Operating system independence 11. Network independence 12. Database independence DISTRIBUTED DATABASES DATES TWELVE RULES FOR A DDBMS LAN CLIENT CLIENT LAN CLIENT CLIENT CLIENT CLIENT LAN CLIENT CLIENT LAN CLIENT Leyton CLIENT CLIENT CLIENT Stratford D B M S
Distributed Queries (see chapter 20) DISTRIBUTED DATABASES ISSUES
1. Locality of reference Is the data near to the sites that need it?
2. Reliability and availability Does the strategy improve fault tolerance and accessibility?
3. Performance Does the strategy result in bottlenecks or under-utilisation of resources?
4. Storage costs How does the strategy effect the availability and cost of data storage?
5. Communication costs How much network traffic will result from the strategy? DISTRIBUTED DATABASES DATA ALLOCATION METRICS
CENTRALISED
DISTRIBUTED DATABASES DATA ALLOCATION STRATEGIES Locality of Reference Reliability/Availability Storage Costs Performance Communication Costs Lowest Lowest Lowest Unsatisfactory Highest
PARTITIONED/FRAGMENTED
DISTRIBUTED DATABASES DATA ALLOCATION STRATEGIES Locality of Reference Reliability/Availability Storage Costs Performance Communication Costs High Low (item) High (system) Lowest Satisfactory Low
COMPLETE REPLICATION
DISTRIBUTED DATABASES DATA ALLOCATION STRATEGIES Locality of Reference Reliability/Availability Storage Costs Performance Communication Costs Highest Highest Highest High High (update) Low (read)
SELECTIVE REPLICATION
DISTRIBUTED DATABASES DATA ALLOCATION STRATEGIES Locality of Reference Reliability/Availability Storage Costs Performance Communication Costs High Average Satisfactory Low Low (item) High (system)
Usage Applications are usually interested in views not whole relations.
Efficiency Its more efficient if data is close to where it is frequently used.
Parallelism It is possible to run several sub-queries in tandem.
Security Data not required by local applications is not stored at the local site.
DISTRIBUTED DATABASES WHY FRAGMENT DATA?
CLIENT/SERVER DATABASE SYSTEM
CLIENT/SERVER DBMS
Manages user interface
Accepts user data
Processes application/business logic
Generates database requests (SQL)
Transmits database requests to server
Receives results from server
Formats results according to application logic
Present results to the user CLIENT PROCESS CLIENT/SERVER DBMS
Accepts database requests
Processes database requests
Performs integrity checks
Handles concurrent access
Optimises queries
Performs security checks
Enacts recovery routines
Transmits result of database request to client SERVER PROCESS
Data Request Data Response
CLIENT/SERVER DBMS ARCHITECTURE CLIENT#1 CLIENT#2 CLIENT#3 PRESENTATION LOGIC BUSINESS LOGIC DATA LOGIC (FAT CLIENT) D/BASE SERVER
D/BASE SERVER
Data Request Data Response
CLIENT/SERVER DBMS ARCHITECTURE CLIENT#1 CLIENT#2 CLIENT#3 PRESENTATION LOGIC BUSINESS LOGIC DATA LOGIC (THIN CLIENT) LAN CLIENT CLIENT LAN CLIENT CLIENT CLIENT CLIENT LAN CLIENT CLIENT LAN CLIENT Leyton CLIENT CLIENT CLIENT Stratford D B M S
Barking Leytonstone DISTRIBUTED PROCESSING ARCHITECTURE CLIENT CLIENT CLIENT CLIENT Middleware Systems Overview and Introduction Middleware Systems Middleware systems are comprised of abstractions and services to facilitate the design, development, integration and deployment of distributed applications in heterogeneous networking environments. remote communication mechanisms (Web services, CORBA, Java RMI, DCOM - i.e. request brokers) event notification and messaging services (COSS Notifications, Java Messaging Service etc.) transaction services naming services (COSS Naming, LDAP) Definition by Example The following constitute middleware systems or middleware platforms CORBA, DCE, RMI, J2EE (?), Web Services, DCOM, COM+, .Net Remoting, application servers, some of these are collections and aggregations of many different services some are marketing terms What & Where is Middleware ? Distributed Systems Middleware Systems Programming Languages Databases Operating Systems Networking middleware is dispersed among many disciplines What & Where is Middleware ? Distributed Systems ACM PODC, ICDE Middleware ACM/IFIP/IEEE Middleware Conference, DEBS, DOA, EDOC Programming Languages Databases SIGMOD, VLDB, ICDE Operating Systems SIGOPS Networking SIGCOMM,INFOCOM mobile computing, software engineering, . Middleware Research dispersed among different fields with different research methodologies different standards, points of views, and approaches a Middleware research community is starting to crystallize around conferences such as Middleware, DEBS, DOA, EDOC et al. Many other conferences have middleware tracks many existing fields/communities are broadening their scope middleware is still somewhat a trendy or marketing term, but I think it is crystallizing into a separate field - middleware systems. in the long term we are trying to identify concepts and build a body of knowledge that identifies middleware systems - much like OS - PL - DS ... Middleware Systems I In a nutshell: Middleware is about supporting the development of distributed applications in networked environments This also includes the integration of systems About making this task easier, more efficient, less error prone About enabling the infrastructure software for this task Middleware Systems II software technologies to help manage complexity and heterogeneity inherent to the development of distributed systems, distributed applications, and information systems layer of software above the operating system and the network substrate, but below the application Higher-level programming abstraction for developing the distributed application higher than lower level abstractions, such as sockets provided by the operating system a socket is a communication end-point from which data can be read or onto which data can be written Middleware Systems III aims at reducing the burden of developing distributed application for developer informally called plumbing, i.e., like pipes that connect entities for communication often called glue code, i.e., it glues independent systems together and makes them work together it masks the heterogeneity programmers of distributed applications have to deal with network & hardware operating system & programming language different middleware platforms location, access, failure, concurrency, mobility, ... often also referred to as transparencies, i.e., network transparency, location transparency Middleware Systems IV an operating system is the software that makes the hardware usable similarly, a middleware system makes the distributed system programmable and manageable bare computer without OS could be programmed, so could the distributed application be developed without middleware programs could be written in assembly, but higher-level languages are far more productive for this purpose however, sometimes the assembly-variant is chosen - WHY? The Questions What are the right programming abstractions for middleware systems? What protocols do these abstractions require to work as promised? What, if any, of the underlying systems (networks, hardware, distribution) should be exposed to the application developer? Views range from full distribution transparency to full control and visibility of underlying system to fewer hybrid approaches achieving both With each having vast implications on the programming abstractions offered Middleware Metaphorically Distributed application Middleware Operating system Network Host 1 Distributed application Middleware Operating system Host 2 Categories of Middleware remote invocation mechanisms e.g., DCOM, CORBA, DCE, Sun RPC, Java RMI, Web Services ... naming and directory services e.g., JNDI, LDAP, COSS Naming, DNS, COSS trader, ... message oriented middleware e.g., JMS, MQSI, MQSeries, ... publish/subscribe systems e.g., JMS, various proprietary systems, COSS Notification Categories II (distributed) tuple spaces (databases) - I do not consider a DBMS a middleware system LNDA, initially an abstraction for developing parallel programs inspired InfoSpaces, later JavaSpaces, later JINI transaction processing system (TP-monitors) implement transactional applications, e.g.e, ATM example adapters, wrappers, mediators Categories III choreography and orchestration Workflow and business process tools (BPEL et al.) a.k.a. Web service composition fault tolerance, load balancing, etc.
real-time, embedded, high-performance, safety critical Middleware Curriculum A middleware curriculum needs to capture the invariants defining the above categories and presenting them A middleware curriculum needs to capture the essence and the lessons learned from specifying and building these types of systems over and over again We have witnessed the re-invention of many of these abstractions without any functional changes over the past 25 years (see later in the course.) Due to lack of time and the invited guest lectures, we will only look at a few of these categories Concurrency Control Lock-Based Protocols A lock is a mechanism to control concurrent access to a data item Data items can be locked in two modes : 1. exclusive (X) mode. Data item can be both read as well as written. X-lock is requested using lock-X instruction. 2. shared (S) mode. Data item can only be read. S-lock is requested using lock-S instruction. Lock requests are made to concurrency-control manager. Transaction can proceed only after request is granted. Lock-Based Protocols (Cont.) Lock-compatibility matrix
A transaction may be granted a lock on an item if the requested lock is compatible with locks already held on the item by other transactions Any number of transactions can hold shared locks on an item, but if any transaction holds an exclusive on the item no other transaction may hold any lock on the item. If a lock cannot be granted, the requesting transaction is made to wait till all incompatible locks held by other transactions have been released. The lock is then granted. Lock-Based Protocols (Cont.) Example of a transaction performing locking: T 2 : lock-S(A); read (A); unlock(A); lock-S(B); read (B); unlock(B); display(A+B) Locking as above is not sufficient to guarantee serializability if A and B get updated in-between the read of A and B, the displayed sum would be wrong. A locking protocol is a set of rules followed by all transactions while requesting and releasing locks. Locking protocols restrict the set of possible schedules. Pitfalls of Lock-Based Protocols Consider the partial schedule
Neither T 3 nor T 4 can make progress executing lock-S(B) causes T 4
to wait for T 3 to release its lock on B, while executing lock-X(A) causes T 3 to wait for T 4 to release its lock on A. Such a situation is called a deadlock. To handle a deadlock one of T 3 or T 4 must be rolled back and its locks released. Pitfalls of Lock-Based Protocols (Cont.) The potential for deadlock exists in most locking protocols. Deadlocks are a necessary evil. Starvation is also possible if concurrency control manager is badly designed. For example: A transaction may be waiting for an X-lock on an item, while a sequence of other transactions request and are granted an S-lock on the same item. The same transaction is repeatedly rolled back due to deadlocks. Concurrency control manager can be designed to prevent starvation. The Two-Phase Locking Protocol This is a protocol which ensures conflict-serializable schedules. Phase 1: Growing Phase transaction may obtain locks transaction may not release locks Phase 2: Shrinking Phase transaction may release locks transaction may not obtain locks The protocol assures serializability. It can be proved that the transactions can be serialized in the order of their lock points (i.e. the point where a transaction acquired its final lock). The Two-Phase Locking Protocol (Cont.) Two-phase locking does not ensure freedom from deadlocks Cascading roll-back is possible under two-phase locking. To avoid this, follow a modified protocol called strict two-phase locking. Here a transaction must hold all its exclusive locks till it commits/aborts. Rigorous two-phase locking is even stricter: here all locks are held till commit/abort. In this protocol transactions can be serialized in the order in which they commit. The Two-Phase Locking Protocol (Cont.) There can be conflict serializable schedules that cannot be obtained if two-phase locking is used. However, in the absence of extra information (e.g., ordering of access to data), two-phase locking is needed for conflict serializability in the following sense: Given a transaction T i that does not follow two- phase locking, we can find a transaction T j that uses two-phase locking, and a schedule for T i and T j that is not conflict serializable. Lock Conversions Two-phase locking with lock conversions: First Phase: can acquire a lock-S on item can acquire a lock-X on item can convert a lock-S to a lock-X (upgrade) Second Phase: can release a lock-S can release a lock-X can convert a lock-X to a lock-S (downgrade) This protocol assures serializability. But still relies on the programmer to insert the various locking instructions. Automatic Acquisition of Locks A transaction T i issues the standard read/write instruction, without explicit locking calls. The operation read(D) is processed as: if T i has a lock on D then read(D) else begin if necessary wait until no other transaction has a lock-X on D grant T i a lock-S on D; read(D) end Automatic Acquisition of Locks (Cont.) write(D) is processed as: if T i has a lock-X on D then write(D) else begin if necessary wait until no other trans. has any lock on D, if T i has a lock-S on D then upgrade lock on D to lock-X else grant T i a lock-X on D write(D) end; All locks are released after commit or abort Implementation of Locking A lock manager can be implemented as a separate process to which transactions send lock and unlock requests The lock manager replies to a lock request by sending a lock grant messages (or a message asking the transaction to roll back, in case of a deadlock) The requesting transaction waits until its request is answered The lock manager maintains a data-structure called a lock table to record granted locks and pending requests The lock table is usually implemented as an in-memory hash table indexed on the name of the data item being locked