You are on page 1of 64

Parallel and Distributed Databases

Parallel DBMS - What and Why?



What is a Client/Server DBMS?

Why do we need Distributed DBMSs?

Dates rules for a Distributed DBMS

Benefits of a Distributed DBMS

Issues associated with a Distributed DBMS

Disadvantages of a Distributed DBMS

PARALLEL DATABASE SYSTEM

PARALLEL DBMSs
WHY DO WE NEED THEM?
More and More Data!

We have databases that hold a high amount of
data, in the order of 10
12
bytes:

10,000,000,000,000 bytes!

Faster and Faster Access!

We have data applications that need to process
data at very high speeds:

10,000s transactions per second!

SINGLE-PROCESSOR DBMS ARENT UP TO THE JOB!

Improves Response Time.

INTERQUERY PARALLELISM

It is possible to process a number of transactions in
parallel with each other.

Improves Throughput.

INTRAQUERY PARALLELISM

It is possible to process sub-tasks of a transaction in
parallel with each other.

PARALLEL DBMSs
BENEFITS OF A PARALLEL DBMS

Speed-Up.

As you multiply resources by a certain factor, the time taken
to execute a transaction should be reduced by the same factor:

10 seconds to scan a DB of 10,000 records using 1 CPU
1 second to scan a DB of 10,000 records using 10 CPUs
PARALLEL DBMSs
HOW TO MEASURE THE BENEFITS

Scale-up.

As you multiply resources the size of a task that can be executed
in a given time should be increased by the same factor.

1 second to scan a DB of 1,000 records using 1 CPU
1 second to scan a DB of 10,000 records using 10 CPUs

Sub-linear speed-up
Linear speed-up (ideal)
Number of CPUs
N
u
m
b
e
r

o
f

t
r
a
n
s
a
c
t
i
o
n
s
/
s
e
c
o
n
d

1000/Sec
5 CPUs
2000/Sec
10 CPUs 16 CPUs
1600/Sec
PARALLEL DBMSs
SPEED-UP
10 CPUs
2 GB Database
Number of CPUs, Database size
N
u
m
b
e
r

o
f

t
r
a
n
s
a
c
t
i
o
n
s
/
s
e
c
o
n
d

Linear scale-up (ideal)
Sub-linear scale-up
1000/Sec
5 CPUs
1 GB Database
900/Sec
PARALLEL DBMSs
SCALE-UP

MEMORY

CPU
CPU
CPU
CPU
CPU
CPU
Shared Memory
Parallel Database Architecture
CPU
CPU
CPU
CPU
CPU
CPU
Shared Disk
Parallel Database Architecture
M
M
M
M
M
M
Shared Nothing
Parallel Database Architecture
CPU M
CPU M
CPU M
CPU
M
CPU
M

MAINFRAME DATABASE SYSTEM

DUMB
DUMB
DUMB
S
P
E
C
I
A
L
I
S
E
D

N
E
T
W
O
R
K

C
O
N
N
E
C
T
I
O
N

TERMINALS
MAINFRAME COMPUTER
PRESENTATION LOGIC
BUSINESS LOGIC
DATA LOGIC

DISTRIBUTED DATABASE SYSTEM


A distributed database system is a collection of
logically related databases that co-operate in a
transparent manner.

Transparent implies that each user within the
system may access all of the data within all of the
databases as if they were a single database

There should be location independence i.e.- as
the user is unaware of where the data is located it
is possible to move the data from one physical
location to another without affecting the user.
DISTRIBUTED DATABASES
WHAT IS A DISTRIBUTED DATABASE?
LAN
CLIENT
CLIENT
CLIENT
CLIENT
D
B
M
S

DISTRIBUTED DATABASE ARCHITECTURE
LAN
CLIENT
CLIENT
CLIENT
CLIENT
D
B
M
S

Leytonstone
CLIENT
CLIENT
CLIENT
D
B
M
S

Stratford
CLIENT
CLIENT
CLIENT
CLIENT
D
B
M
S

Barking
CLIENT
CLIENT
CLIENT
Leyton
D/BASE
SERVER #1
CLIENT#1
D/BASE
SERVER #2
CLIENT#2
CLIENT#3
M:N CLIENT/SERVER DBMS ARCHITECTURE
NOT TRANSPARENT!

DB
Computer
Network
Site 2
Site 1
GSC
DDBMS
DC LDBMS
GSC
DDBMS
DC

LDBMS = Local DBMS
DC = Data Communications GSC
= Global Systems Catalog
DDBMS = Distributed DBMS
COMPONENTS OF A DDBMS
Reduced Communication Overhead

Most data access is local, less expensive and performs
better.

Improved Processing Power

Instead of one server handling the full database, we now
have a collection of machines handling the same
database.

Removal of Reliance on a Central Site

If a server fails, then the only part of the system that is
affected is the relevant local site. The rest of the system
remains functional and available.

DISTRIBUTED DATABASES
ADVANTAGES
Expandability

It is easier to accommodate increasing the size of the
global (logical) database.

Local autonomy

The database is brought nearer to its users. This can effect
a cultural change as it allows potentially greater control
over local data .

DISTRIBUTED DATABASES
ADVANTAGES

A distributed system looks exactly like
a non-distributed system to the user!

1. Local autonomy
2. No reliance on a central site
3. Continuous operation
4. Location independence
5. Fragmentation independence
6. Replication independence
7. Distributed query independence
8. Distributed transaction processing
9. Hardware independence
10. Operating system independence
11. Network independence
12. Database independence
DISTRIBUTED DATABASES
DATES TWELVE RULES FOR A DDBMS
LAN
CLIENT
CLIENT
LAN
CLIENT
CLIENT
CLIENT
CLIENT
LAN
CLIENT
CLIENT
LAN
CLIENT
Leyton
CLIENT
CLIENT
CLIENT
Stratford
D
B
M
S

Barking Leytonstone
DISTRIBUTED PROCESSING ARCHITECTURE
CLIENT
CLIENT
CLIENT
CLIENT

Data Allocation

Data Fragmentation

Distributed Catalogue Management

Distributed Transactions

Distributed Queries (see chapter 20)
DISTRIBUTED DATABASES
ISSUES

1. Locality of reference
Is the data near to the sites that need it?

2. Reliability and availability
Does the strategy improve fault tolerance and accessibility?

3. Performance
Does the strategy result in bottlenecks or under-utilisation of
resources?

4. Storage costs
How does the strategy effect the availability and cost of data storage?

5. Communication costs
How much network traffic will result from the strategy?
DISTRIBUTED DATABASES
DATA ALLOCATION METRICS

CENTRALISED

DISTRIBUTED DATABASES
DATA ALLOCATION STRATEGIES
Locality of Reference
Reliability/Availability
Storage Costs
Performance
Communication Costs
Lowest
Lowest
Lowest
Unsatisfactory
Highest

PARTITIONED/FRAGMENTED

DISTRIBUTED DATABASES
DATA ALLOCATION STRATEGIES
Locality of Reference
Reliability/Availability
Storage Costs
Performance
Communication Costs
High
Low (item) High (system)
Lowest
Satisfactory
Low

COMPLETE REPLICATION

DISTRIBUTED DATABASES
DATA ALLOCATION STRATEGIES
Locality of Reference
Reliability/Availability
Storage Costs
Performance
Communication Costs
Highest
Highest
Highest
High
High (update) Low (read)

SELECTIVE REPLICATION

DISTRIBUTED DATABASES
DATA ALLOCATION STRATEGIES
Locality of Reference
Reliability/Availability
Storage Costs
Performance
Communication Costs
High
Average
Satisfactory
Low
Low (item) High (system)

Usage
Applications are usually interested in views not whole relations.

Efficiency
Its more efficient if data is close to where it is frequently used.

Parallelism
It is possible to run several sub-queries in tandem.

Security
Data not required by local applications is not stored at the local
site.

DISTRIBUTED DATABASES
WHY FRAGMENT DATA?

CLIENT/SERVER DATABASE SYSTEM

CLIENT/SERVER DBMS



Manages user interface

Accepts user data

Processes application/business logic

Generates database requests (SQL)

Transmits database requests to server

Receives results from server

Formats results according to application logic

Present results to the user
CLIENT PROCESS
CLIENT/SERVER DBMS



Accepts database requests

Processes database requests

Performs integrity checks

Handles concurrent access

Optimises queries

Performs security checks

Enacts recovery routines

Transmits result of database request to client
SERVER PROCESS


Data Request
Data Response


CLIENT/SERVER
DBMS ARCHITECTURE
CLIENT#1
CLIENT#2
CLIENT#3
PRESENTATION LOGIC
BUSINESS LOGIC
DATA LOGIC
(FAT CLIENT)
D/BASE
SERVER


D/BASE
SERVER




Data Request
Data Response


CLIENT/SERVER
DBMS ARCHITECTURE
CLIENT#1
CLIENT#2
CLIENT#3
PRESENTATION LOGIC
BUSINESS LOGIC
DATA LOGIC
(THIN CLIENT)
LAN
CLIENT
CLIENT
LAN
CLIENT
CLIENT
CLIENT
CLIENT
LAN
CLIENT
CLIENT
LAN
CLIENT
Leyton
CLIENT
CLIENT
CLIENT
Stratford
D
B
M
S

Barking Leytonstone
DISTRIBUTED PROCESSING ARCHITECTURE
CLIENT
CLIENT
CLIENT
CLIENT
Middleware Systems Overview
and Introduction
Middleware Systems
Middleware systems are comprised of abstractions and
services to facilitate the design, development, integration
and deployment of distributed applications in heterogeneous
networking environments.
remote communication mechanisms (Web
services, CORBA, Java RMI, DCOM - i.e. request
brokers)
event notification and messaging services (COSS
Notifications, Java Messaging Service etc.)
transaction services
naming services (COSS Naming, LDAP)
Definition by Example
The following constitute middleware systems
or middleware platforms
CORBA, DCE, RMI, J2EE (?), Web Services, DCOM,
COM+, .Net Remoting, application servers,
some of these are collections and aggregations of
many different services
some are marketing terms
What & Where is Middleware ?
Distributed
Systems
Middleware
Systems
Programming
Languages
Databases
Operating Systems
Networking
middleware is dispersed among many disciplines
What & Where is Middleware ?
Distributed
Systems
ACM PODC, ICDE
Middleware
ACM/IFIP/IEEE
Middleware Conference,
DEBS, DOA, EDOC
Programming
Languages
Databases
SIGMOD, VLDB, ICDE
Operating Systems
SIGOPS
Networking
SIGCOMM,INFOCOM
mobile computing, software engineering, .
Middleware Research
dispersed among different fields
with different research methodologies
different standards, points of views, and approaches
a Middleware research community is starting to crystallize around
conferences such as Middleware, DEBS, DOA, EDOC et al.
Many other conferences have middleware tracks
many existing fields/communities are broadening their scope
middleware is still somewhat a trendy or marketing term, but I
think it is crystallizing into a separate field - middleware systems.
in the long term we are trying to identify concepts and build a body
of knowledge that identifies middleware systems - much like OS - PL
- DS ...
Middleware Systems I
In a nutshell:
Middleware is about supporting the development
of distributed applications in networked
environments
This also includes the integration of systems
About making this task easier, more efficient,
less error prone
About enabling the infrastructure software for
this task
Middleware Systems II
software technologies to help manage complexity and
heterogeneity inherent to the development of distributed
systems, distributed applications, and information systems
layer of software above the operating system and the
network substrate, but below the application
Higher-level programming abstraction for developing the
distributed application
higher than lower level abstractions, such as sockets
provided by the operating system
a socket is a communication end-point from which data can be read or
onto which data can be written
Middleware Systems III
aims at reducing the burden of developing distributed
application for developer
informally called plumbing, i.e., like pipes that connect
entities for communication
often called glue code, i.e., it glues independent systems
together and makes them work together
it masks the heterogeneity programmers of distributed
applications have to deal with
network & hardware
operating system & programming language
different middleware platforms
location, access, failure, concurrency, mobility, ...
often also referred to as transparencies, i.e., network
transparency, location transparency
Middleware Systems IV
an operating system is the software that makes the
hardware usable
similarly, a middleware system makes the distributed system
programmable and manageable
bare computer without OS could be programmed, so could
the distributed application be developed without middleware
programs could be written in assembly, but higher-level
languages are far more productive for this purpose
however, sometimes the assembly-variant is chosen - WHY?
The Questions
What are the right programming abstractions for middleware
systems?
What protocols do these abstractions require to work as
promised?
What, if any, of the underlying systems (networks, hardware,
distribution) should be exposed to the application developer?
Views range from
full distribution transparency to
full control and visibility of underlying system to
fewer hybrid approaches achieving both
With each having vast implications on the programming abstractions
offered
Middleware Metaphorically
Distributed application
Middleware
Operating system
Network
Host 1
Distributed application
Middleware
Operating system
Host 2
Categories of Middleware
remote invocation mechanisms
e.g., DCOM, CORBA, DCE, Sun RPC, Java RMI, Web Services ...
naming and directory services
e.g., JNDI, LDAP, COSS Naming, DNS, COSS trader, ...
message oriented middleware
e.g., JMS, MQSI, MQSeries, ...
publish/subscribe systems
e.g., JMS, various proprietary systems, COSS Notification
Categories II
(distributed) tuple spaces
(databases) - I do not consider a DBMS a middleware system
LNDA, initially an abstraction for developing parallel programs
inspired InfoSpaces, later JavaSpaces, later JINI
transaction processing system (TP-monitors)
implement transactional applications, e.g.e, ATM
example
adapters, wrappers, mediators
Categories III
choreography and orchestration
Workflow and business process tools (BPEL et al.)
a.k.a. Web service composition
fault tolerance, load balancing, etc.

real-time, embedded, high-performance,
safety critical
Middleware Curriculum
A middleware curriculum needs to capture the invariants
defining the above categories and presenting them
A middleware curriculum needs to capture the essence and
the lessons learned from specifying and building these types
of systems over and over again
We have witnessed the re-invention of many of these
abstractions without any functional changes over the past 25
years (see later in the course.)
Due to lack of time and the invited guest lectures, we will only
look at a few of these categories
Concurrency Control
Lock-Based Protocols
A lock is a mechanism to control concurrent access to a data
item
Data items can be locked in two modes :
1. exclusive (X) mode. Data item can be both read as well as
written. X-lock is requested using lock-X instruction.
2. shared (S) mode. Data item can only be read. S-lock is
requested using lock-S instruction.
Lock requests are made to concurrency-control manager.
Transaction can proceed only after request is granted.
Lock-Based Protocols (Cont.)
Lock-compatibility matrix




A transaction may be granted a lock on an item if the requested lock is
compatible with locks already held on the item by other transactions
Any number of transactions can hold shared locks on an item,
but if any transaction holds an exclusive on the item no other
transaction may hold any lock on the item.
If a lock cannot be granted, the requesting transaction is made to wait
till all incompatible locks held by other transactions have been
released. The lock is then granted.
Lock-Based Protocols (Cont.)
Example of a transaction performing locking:
T
2
: lock-S(A);
read (A);
unlock(A);
lock-S(B);
read (B);
unlock(B);
display(A+B)
Locking as above is not sufficient to guarantee serializability if A and B
get updated in-between the read of A and B, the displayed sum would be
wrong.
A locking protocol is a set of rules followed by all transactions while
requesting and releasing locks. Locking protocols restrict the set of
possible schedules.
Pitfalls of Lock-Based Protocols
Consider the partial schedule









Neither T
3
nor T
4
can make progress executing lock-S(B) causes T
4

to wait for T
3
to release its lock on B, while executing lock-X(A) causes
T
3
to wait for T
4
to release its lock on A.
Such a situation is called a deadlock.
To handle a deadlock one of T
3
or T
4
must be rolled back
and its locks released.
Pitfalls of Lock-Based Protocols (Cont.)
The potential for deadlock exists in most locking
protocols. Deadlocks are a necessary evil.
Starvation is also possible if concurrency control
manager is badly designed. For example:
A transaction may be waiting for an X-lock on an item,
while a sequence of other transactions request and
are granted an S-lock on the same item.
The same transaction is repeatedly rolled back due to
deadlocks.
Concurrency control manager can be designed to prevent
starvation.
The Two-Phase Locking Protocol
This is a protocol which ensures conflict-serializable
schedules.
Phase 1: Growing Phase
transaction may obtain locks
transaction may not release locks
Phase 2: Shrinking Phase
transaction may release locks
transaction may not obtain locks
The protocol assures serializability. It can be proved that
the transactions can be serialized in the order of their
lock points (i.e. the point where a transaction acquired
its final lock).
The Two-Phase Locking Protocol (Cont.)
Two-phase locking does not ensure freedom from deadlocks
Cascading roll-back is possible under two-phase locking. To
avoid this, follow a modified protocol called strict two-phase
locking. Here a transaction must hold all its exclusive locks till
it commits/aborts.
Rigorous two-phase locking is even stricter: here all locks are
held till commit/abort. In this protocol transactions can be
serialized in the order in which they commit.
The Two-Phase Locking Protocol (Cont.)
There can be conflict serializable schedules that
cannot be obtained if two-phase locking is used.
However, in the absence of extra information
(e.g., ordering of access to data), two-phase
locking is needed for conflict serializability in the
following sense:
Given a transaction T
i
that does not follow two-
phase locking, we can find a transaction T
j
that
uses two-phase locking, and a schedule for T
i
and
T
j
that is not conflict serializable.
Lock Conversions
Two-phase locking with lock conversions:
First Phase:
can acquire a lock-S on item
can acquire a lock-X on item
can convert a lock-S to a lock-X (upgrade)
Second Phase:
can release a lock-S
can release a lock-X
can convert a lock-X to a lock-S (downgrade)
This protocol assures serializability. But still relies on the
programmer to insert the various locking instructions.
Automatic Acquisition of Locks
A transaction T
i
issues the standard read/write instruction,
without explicit locking calls.
The operation read(D) is processed as:
if T
i
has a lock on D
then
read(D)
else begin
if necessary wait until no other
transaction has a lock-X on D
grant T
i
a lock-S on D;
read(D)
end
Automatic Acquisition of Locks (Cont.)
write(D) is processed as:
if T
i
has a lock-X on D
then
write(D)
else begin
if necessary wait until no other trans. has any lock on D,
if T
i
has a lock-S on D
then
upgrade lock on D to lock-X
else
grant T
i
a lock-X on D
write(D)
end;
All locks are released after commit or abort
Implementation of Locking
A lock manager can be implemented as a separate
process to which transactions send lock and unlock
requests
The lock manager replies to a lock request by sending a
lock grant messages (or a message asking the transaction
to roll back, in case of a deadlock)
The requesting transaction waits until its request is
answered
The lock manager maintains a data-structure called a lock
table to record granted locks and pending requests
The lock table is usually implemented as an in-memory
hash table indexed on the name of the data item being
locked

You might also like