You are on page 1of 58

Distributed Database Systems

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 1
Outline
 Introduction
 Distributed and Parallel Database Design
 Distributed Data Control
 Distributed Query Processing
 Distributed Transaction Processing
 Data Replication
 Database Integration – Multidatabase Systems
 Parallel Database Systems

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 2
Outline
 Introduction
 What is a distributed DBMS
 History
 Distributed DBMS promises/benefits
 Design issues/challenges/problems
 Distributed DBMS architecture

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 3
Distributed Computing

 A number of autonomous processing elements (not


necessarily homogeneous) that are interconnected by a
computer network and that cooperate in performing their
assigned tasks.
 What is being distributed?
 Processing logic
 Function
 Data
 Control

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 4
Current Distribution – Geographically
Distributed Data Centers

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 5
What is a Distributed Database System?

A distributed database is a collection of multiple, logically


interrelated databases distributed over a computer network

A distributed database management system (Distributed


DBMS) is the software that manages the DDB and provides
an access mechanism that makes this distribution
transparent to the users

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 6
What is not a DDBS?

 A timesharing computer system


 A loosely or tightly coupled multiprocessor system
 A database system which resides at one of the nodes of
a network of computers - this is a centralized database
on a network node

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 7
Distributed DBMS Environment

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 8
Implicit Assumptions

 Data stored at a number of sites → each site logically


consists of a single processor
 Processors at different sites are interconnected by a
computer network → not a multiprocessor system
 Parallel database systems
 Distributed database is a database, not a collection of
files → data logically related as exhibited in the users’
access patterns
 Relational data model
 Distributed DBMS is a full-fledged DBMS
 Not remote file system, not a TP system

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 9
Important Point

Logically integrated
but
Physically distributed

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 10
Outline
 Introduction
 What is a distributed DBMS
 History
 Distributed DBMS promises/benefits
 Design issues/challenges/problems
 Distributed DBMS architecture

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 11
History – File Systems

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 12
History – Database Management

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 13
History – Early Distribution
Peer-to-Peer (P2P)

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 14
History – Client/Server

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 15
History – Data Integration

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 16
History – Cloud Computing

On-demand, reliable services provided over the Internet in


a cost-efficient manner
 Cost savings: no need to maintain dedicated compute
power
 Elasticity: better adaptivity to changing workload

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 17
Data Delivery Alternatives

 Delivery modes
 Pull-only
 Push-only
 Hybrid
 Frequency
 Periodic
 Conditional
 Ad-hoc or irregular
 Communication Methods
 Unicast
 One-to-many
 Note: not all combinations make sense

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 18
Outline
 Introduction
 What is a distributed DBMS
 History
 Distributed DBMS promises/benefits
 Design issues/challenges/problems
 Distributed DBMS architecture

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 19
Distributed DBMS promises/benefits

 Transparent management of distributed, fragmented, and


replicated data

 Improved reliability/availability through distributed


transactions

 Improved performance

 Easier and more economical system expansion

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez
Transparency

 Transparency is the separation of the higher-level


semantics of a system from the lower level
implementation issues.
 Fundamental issue is to provide data independence
in the distributed environment
 Network (distribution) transparency
 Replication transparency
 Fragmentation transparency
 horizontal fragmentation: selection
 vertical fragmentation: projection
 hybrid

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez
Example

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 22
Transparent Access

Tokyo

SELECT ENAME,SAL
Boston Paris
FROM EMP,ASG,PAY
WHERE DUR > 12 Paris projects
Paris employees
AND EMP.ENO = ASG.ENO Communication Paris assignments
Network Boston employees
AND PAY.TITLE =
EMP.TITLE Boston projects
Boston employees
Boston assignments
Montreal
New
Montreal projects
York Paris projects
Boston projects New York projects
New York employees with budget > 200000
New York projects Montreal employees
New York assignments Montreal assignments

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 23
Distributed Database - User View

Distributed Database

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 24
Distributed DBMS - Reality
User
Query

User
DBMS
Application
Software
DBMS
Software

DBMS Communication
Software Subsystem

User
DBMS User Application
Software Query
DBMS
Software

User
Query

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 25
Types of Transparency

 Data independence
 Network transparency (or distribution transparency)
 Location transparency
 Fragmentation transparency
 Fragmentation transparency
 Replication transparency

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 26
Reliability Through Transactions

 Replicated components and data should make distributed


DBMS more reliable.
 Distributed transactions provide
 Concurrency transparency
 Failure atomicity

• Distributed transaction support requires implementation of


 Distributed concurrency control protocols
 Commit protocols

 Data replication
 Great for read-intensive workloads, problematic for updates
 Replication protocols

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 27
Potentially Improved Performance

 Proximity of data to its points of use


 Requires some support for fragmentation and replication

 Parallelism in execution
 Inter-query parallelism

 Intra-query parallelism

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 28
Scalability

 Issue is database scaling and workload scaling

 Adding processing and storage power

 Scale-out: add more servers


 Scale-up: increase the capacity of one server → has limits

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 29
Geographic Distribution

 Distributed databases support geographic distribution of


data, allowing organizations to store data closer to users
or locations where it is needed.

 This reduces latency and improves responsiveness for


users accessing data from different geographic regions.
Issue is database scaling and workload scaling

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 30
Flexibility and Adaptability

 Distributed databases offer greater flexibility and


adaptability to changing business requirements and
environments.

 Organizations can easily add or remove nodes, adjust


data distribution strategies, and scale resources as
needed to meet evolving demands.

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 31
Data Security and Privacy

 Distributed databases incorporate robust security


mechanisms to protect data confidentiality, integrity, and
availability.

 Encryption, access controls, authentication, and auditing


features help ensure compliance with data privacy
regulations and mitigate security risks.

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 32
Cost efficiency

 Distributed databases can offer cost efficiencies


compared to centralized databases, particularly in terms
of hardware and infrastructure costs.

 By leveraging commodity hardware and cloud computing


resources, organizations can achieve cost-effective
scalability and resource utilization.

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 33
Resilience to Network Failures

 Distributed databases are designed to withstand network


failures and disruptions.

 Replication, data synchronization, and distributed


consensus algorithms ensure data consistency and
integrity even in the face of network partitions or
communication failures.

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 34
Outline
 Introduction
 What is a distributed DBMS
 History
 Distributed DBMS promises/benefits
 Design issues/challenges/problems
 Distributed DBMS architecture

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 35
Distributed DBMS
Issues/challenges/problems
 Distributed database design
 How to distribute the database
 Replicated & non-replicated database distribution
 Distributed query processing
 Convert user transactions to data manipulation instructions
 Optimization problem
 min{cost = data transmission + local processing}
 General formulation is NP-hard

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 36
Distributed DBMS
Issues/challenges/problems
 Distributed concurrency control
 Synchronization of concurrent accesses
 Consistency and isolation of transactions' effects
 Deadlock management
 Reliability
 How to make the system resilient to failures: consider when a
system in inoperable or inaccessible due to network failure for
instance
 How to ensure database is up to date after recover?
 Consistency, atomicity and durability

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 37
Distributed DBMS
Issues/challenges/problems
 Distributed directory management
 A directory may be global to the entire DDBS or local to each
site; it can be centralized at one site or distributed over several
sites; there can be a single copy or multiple copies
 Contains information (such as descriptions and locations) about
data items in database
 How to distribute the database
 Replicated & non-replicated database distribution

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 38
Distributed DBMS
Issues/challenges/problems
 Distributed deadlock management
 Consider the competition among users for access to a set of
resources (data, in this case); this can result in a deadlock if the
synchronization mechanism is based on locking.
 How to handle deadlock in DDBSs: prevention, avoidance, and
detection/recovery.

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 39
Distributed DBMS
Issues/challenges/problems
 Replication
 Mutual consistency
 Freshness of copies
 Eager vs lazy
 Centralized vs distributed
 Parallel DBMS
 Objectives: high scalability and performance
 Not geo-distributed
 Cluster computing

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 40
Related Issues

 Alternative distribution approaches


 Modern P2P
 World Wide Web (WWW or Web)
 Big data processing
 4V: volume, variety, velocity, veracity
 MapReduce & Spark
 Stream data
 Graph analytics
 NoSQL
 NewSQL
 Polystores

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 41
Outline
 Introduction
 What is a distributed DBMS
 History
 Distributed DBMS promises/benefits
 Design issues/challenges/problems
 Distributed DBMS architecture

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 42
ANSI/SPARC Architecture
 The ANSI/SPARC
architecture has three
views of data:
 the external view,
which is that of the
end user, who might
be a programmer;
 the internal view, that
of the system or
machine: physical
definition and
organization of data.
 the conceptual view,
that of the enterprise

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 43
ANSI/SPARC Architecture
 The ANSI/SPARC
architecture has three
views of data:
 the external view,
which is that of the
end user, who might
be a programmer;
 the internal view, that
of the system or
machine: physical
definition and
organization of data.
 the conceptual view,
that of the enterprise

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 44
DBMS Implementation Alternatives
Architectural Models for Distributed DBMSs

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 45
DBMS Implementation Alternatives
Architectural Models for Distributed DBMSs: Dimensions of the Problem
 Distribution
 Whether the components of the system are located on the same machine or not
 Heterogeneity
 Various levels (hardware, communications, operating system)
 DBMS important one
 data model, query language, transaction management algorithms
 Autonomy
 Refers to distribution of control
 Various versions
 Design autonomy: Ability of a component DBMS to decide on issues related to its own
design.
 Communication autonomy: Ability of a component DBMS to decide whether and how
to communicate with other DBMSs.
 Execution autonomy: Ability of a component DBMS to execute local operations in any
manner it wants to.

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 46
Client/Server Architecture

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 47
Advantages of Client-Server
Architectures
 More efficient division of labor
 Horizontal and vertical scaling of resources
 Better price/performance on client machines
 Ability to use familiar tools on client machines
 Client access to remote data (via standards)
 Full DBMS functionality provided to client workstations
 Overall better system price/performance

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 48
Database Server

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 49
Distributed Database Servers

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 50
Peer-to-Peer Component Architecture

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 51
MDBS Components & Execution
A multidatabase system, also known as a federated database system,
is a type of database management system (DBMS) that integrates
multiple autonomous databases into a single logical database.

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 52
Mediator/Wrapper Architecture

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 53
Cloud Computing

On-demand, reliable services provided over the Internet in


a cost-efficient manner
 IaaS – Infrastructure-as-a-Service

 PaaS – Platform-as-a-Service

 SaaS – Software-as-a-Service

 DaaS – Database-as-a-Service

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 54
Simplified Cloud Architecture

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 55
Distributed computing Individual exercise
a. Define the concept of distributed database management
systems (DDBMS) and explain its significance in
modern information systems.
b. Discuss the key differences between a centralized
DBMS and a distributed DBMS.
c. a. Define a distributed database management system
(DDBMS) and explain how it differs from a traditional
centralized database management system.
d. Discuss the benefits and challenges associated with
implementing a distributed DBMS in organizations.
e. Identify and discuss the key design issues and
challenges involved in building a distributed database
management system.
Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 56
Distributed computing Individual exercise
f. Using books and the Internet resources, explain how
design issues such as data distribution, concurrency
control, data replication, query optimization, and
security are addressed in distributed DBMS
architectures.
g. Describe the architecture of a distributed database
management system, including its components, layers,
and functionalities. Compare it with the multidatabase
systems integration
h. Discuss the role of distributed data storage, query
processing, concurrency control mechanisms,
transaction management, and data replication in
distributed DBMS architectures.

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 57
Distributed computing Individual exercise
i. Using books and the Internet resources, expound on the
following characteristics of the distributed database
systems. NB: characteristics define the inherent
qualities or attributes of distributed database systems

data distribution, autonomy, heterogeneity,


transparency, concurrency control, data replication,
scalability, fault tolerance, consistency and consensus,
and security.

Lecture slides as adapted and customized from © 2020, M.T. Özsu & P. Valduriez 58

You might also like