You are on page 1of 25

Reference Book

Principles of Distributed Database


System

Chapters
Chapter 12: Distributed DBMS Reliability
Chapter 14: Distributed Object Database
Management Systems
Chapter 16: Current Issues
Preethi Vishwanath

Week 2 : 12th September 2006 –24th September 2006


Reliability concepts - definitions
System refers to a mechanism that consists of a collection of
components and interacts with its environment with a recognizable
pattern of behavior.
Each component of a system is itself a system, commonly called a
subsystem.
The way components of a system are put together is called the design
of the system.
An external state of a system can be defined as the response that a
system gives to an external stimulus.
The behavior of the system in providing response to all the possible
stimuli from the environment needs to be laid out in an authoritative
specification of its behavior.
Any deviation of a system from the behavior described in the
specification is considered a failure.
Some transactions could cause system failure, such internal states are
called erroneous states.
Any error in the internal states of the components of a system or in the
design of a system is called a fault in the system.
A permanent fault, also called a hard fault, is one that reflects an
irreversible change in the behavior of the system.
Reliability Mean Time between Failures
– Reliability refers to the – Is the expected time between
probability that the system subsequent failures in a
under consideration does not system with repair.
experience any failures in a – Can be calculated either from
given time interval. empirical data or from the
– R(t) = Pr{0 failure in time [0,t] reliability function
no failures at t = 0} – Is related to the failure rate
where R(t) – MTBF = ∫∞0 R(t) dt
: reliability of the system
Mean Time to repair
Availability – Expected time to repair a failed
– Refers to the probability that system.
the system is operational – Is related to the repair rate
according to its specification at
a given point in time t
– A=µ/‫ح‬+µ Steady State availability of a
where ‫ ح‬is a failure rate system with exponential failure
µ is a mean repair time and repair rates can be
specified as
A = MTBF/(MTBF + MTTR)
Reasons for Failure
SE SS Swi tc h Da ta
S t u d y Co n d u c t e d a t S t a n f o rd Lin e a r
a c c e l e ra t o r

Un kn own
Envir onment
Har dwar e
Oper ations
S of t war e

Har dwar e Oper at ion s

Sof twar e

Ta n d e m D a t a

Envir onment
har dwar e
s of twar e
maintainence
oper ations
Fault tolerance
– Refers to a system design approach which recognizes that faults will occur
Fault prevention/Fault intolerance
– Aim at ensuring that the implemented system will not contain any faults
– Two aspects
Fault avoidance
– Refers to the techniques used to make sure that faults are not introduced into the system
– Involve detailed design methodologies such as design walkthroughs, design inspections etc..
Fault removal
– Refers to the techniques that are employed to detect any faults that might have remained in the system despite the
application of fault avoidance and removed these faults.
Fault detection
– Issue a warning when a failure occurs but do not provide any means of tolerating the failure.
Latent Failure
– One that is detected some time after its occurrence
Mean time to detect
– Average error latency time over a number of identical systems.
Fail-stop modules
– Constantly monitors itself and when it detects a fault, shuts itself down automatically
Fail-fast
– Implemented in software by defensive programming, where each software module checks
its own state during state transactions.

Different ways of implementing process pairs


– Lock-step
– Automatic check pointing
– State check pointing
– Data check pointing
– Persistent process pairs
Failure in Distributed DBMS
Site(System) Failures Media Failures
– Always assumed to result in – Refers to the failures of the
the loss of main memory secondary storage devices
contents. that store the database.
– Total failures, refers to the – Duplexing of disk storage and
simultaneous failure of all sites maintaining archival copies of
in the distributed system. the database are common
– Partial Failure indicates the techniques that deal with this
failure of only some sites while sort of catastrophic problem.
the others remain operational.
Communication Failure
Transaction Failure – Unique to the distributed case.
– Incorrect input data – Most common ones are the
– Detection of present or errors in the messages,
potential deadlock improperly ordered messages,
– Usual approach to take in lost messages and line
cases of transaction failure is failures
to abort the transaction. – The term for the failure of the
communication network to
deliver messages and the
confirmations within this
period is performance failure
Interface between the local recovery manager and the
buffer manager

Local Recovery
Manager

Stable
database Database
Database Buffer
Manager Buffers

(Volatile
Database)
Recovery Information
In-Place Update Recovery Information Network Partitioning
– Necessary to store info about – Simple partition
database state changes, inorder to Network is divided into only two
recover back. components
– Recorded in the database log – Multiple partitioning
– REDO Action Network is divided into more than two
components
Database needs to include sufficient
data to permit the undo by taking the Centralized Protocols
old database state and recover the – Primary Site
new state Makes sense to permit the operation of the
– UNDO Action partition that contains the primary site,
since it manages the lock.
Database needs to include sufficient
data to permit the undo by taking the – Primary copy
new database state and recover the More than one partition may be operational
old state. for different queries.
Out-of-place update recovery Voting-based Protocols
information – Transactions are executed if a majority
– Typical techniques of the sites vote to execute it.
Shadowing – Quorum-based voting can be used as
Every time an update is made, the old a replica control method, as well as a
stable storage page, called shadow commit method to ensure transaction
page is left intact and a new page with atomicity in the presence of network
the updated data item values is written partitioning.
into the stable database. – In case of non replicated databases,
Differential files this involves the integration of the
voting principle with commit protocols.
2 Phase Commit Protocol
The two phase commit protocol is a distributed algorithm which lets all
sites in a distributed system agree to commit a transaction.
The protocol results in either all nodes committing the transaction or
aborting, even in the case of site failures and message losses.
Basic Algorithm

– Commit-request phase

1. The coordinator sends a query to commit message to all cohorts.


2. The cohorts execute the transaction up to the point where they will be
asked to commit. They each write an entry to their undo log and an entry
to their redo log.
3. Each cohort replies with an agreement message if the transaction
succeeded, or an abort message if the transaction failed.
4. The coordinator waits until it has a message from each cohort
Commit phase
– Success
If the coordinator received an agreement message from all cohorts
during the commit-request phase:
1. The coordinator writes a commit record into its log.
2. The coordinator sends a commit message to all the cohorts.
3. Each cohort completes the operation, and releases all the locks and
resources held during the transaction.
4. Each cohort sends an acknowledgement to the coordinator.
5. The coordinator completes the transaction when acknowledgements
have been received.
– Failure
1. If any cohort sent an abort message during the commit-request
phase:
2. The coordinator sends a rollback message to all the cohorts.
3. Each cohort undoes the transaction using the undo log, and
releases the resources and locks held during the transaction.
4. Each cohort sends an acknowledgement to the coordinator.
5. The coordinator completes the transaction when
acknowledgements have been received.
3 Phase Commit
Non blocking when failures are restricted to site failures
A commit protocol that is synchronous within one state
transition is nonblocking if and only if its state transition
diagram contains neither of the following.
– No state that is “adjacent” to both a commit and an abort state.
– No noncommittal state that is “adjacent” to a commit state.
Replication and Replica Control Protocols

Having replicas of data items improves system


availability.

Advantages
– With careful design, it is possible to ensure that single points of
failure are eliminated
– Overall system availability is maintained even when one or more
sites fail.

Disadvantages
– Whenever updates are introduced, the complexity of keeping
replicas consistent arises and this is the topic of replication
protocols.
Concepts
Object
– Represents a real entity in the Abstract Data Types
system
– Template for all objects of that type.
– Represented as a pair (object
Identity, state) – Describes type of data by providing a
– Enables referential object sharing. domain of data with the same
structure, as well as operations
State applicable to the objects of that
– Either an atomic value or a domain.
constructed value – Abstraction capability commonly
Value referred as encapsulation.
– An element of D is a value, called an Composition (Aggregation)
atomic value – Restriction on composite objects
– [a1:v1,…,an:vn], in which ai is an results in complex objects
element of A and vi is either a value – The composite object relationship
or an element of I, is called a tuple between types can be represented by
value. a composition graph.
– {v1,..,vn}, in which vi is either a value Collection
or an element of I, is called a set – User defined grouping of objects
value.
– Similar to class in that it groups
Class objects.
– Grouping of common objects Subtyping
– Template for all common objects – Based on specialization relationship
Inheritance among types.
– Declaring a type to be a subtype of
another.
Object Distribution Design
Path partitioning Allocation
– A concept describe the clustering – Local behavior-local object
of all the objects forming a Behavior, the object to which it is
composite object into a partition. applied, and the arguments are all
– Can be represented as a co-located.
hierarchy of nodes forming a No special mechanism needed to
structural index. handle this case.
– Index contains the references to – Local behavior-remote object
all the component objects of a Behavior, the object to which it is
composite object, eliminating the applied, and the arguments are all
co-located.
need to traverse the class
composition hierarchy. Two ways to deal
– Move th remote object to the site
Class Partitioning Algorithms where the behavior is located.
– Ship the behavior
– Main issue is to improve the implementation to the site where
performance of user queries and the object is located
applications by reducing the
irrelevant data access.
– Affinity based approach
Affinity among instance variables
and methods and affinity among
multiple methods can be used for
horizontal and vertical class
partitioning.
– Cost-Driven Approach
Client-Server Architecture

Object
Object Database
Database
Cache Consistency
Problem in any data shipping system that moves data to the clients.
Cache consistency algorithms
– Avoidance-based synchronous algorithms
Clients retain read locks across transactions, but they relinquish write locks at the end
of the transaction.
The client send lock requests to the server and they block until the server responds.
If the client requests a write lock on a page that is cached at other clients.
– Avoidance-based asynchronous algorithms
Do not have the message blocking overhead present in synchronous algorithms.
Clients send lock escalation messages to the server and continue application
processing
– Avoidance-based deferred algorithms
Clients batch their lock escalation requests and send them to the server at commit time.
The server blocks the updating client if other clients are reading the updated objects.
– Detection-based synchronous algorithms
Clients contact the server whenever they access a page in their cache to ensure that
the page is not stale or being written to by other clients.
– Detection-based asynchronous algorithms
Clients send lock escalation requests to the server, but optimistically assume that their
requests will be successful.
After a client transaction commits, the server propagates the updated pages to all the
other clients that have also cached the affected pages.
– Detection-based deferred algorithms
Can outperform callback locking algorithms even while encountering a higher abort rate
if the client transaction state completely fits into the client cache, and all application
processing is strictly performed at the clients.
Object Identifier Management
Object Identifiers are system generated
Used to Uniquely identify every object
Transient object identity can be implemented more
efficiently
Two common solutions
– Physical Identifier approach (POID)
Equates the OID with the physical address of the corresponding
object
Advantage , the object can be obtained directly from the OID.
Drawback, all the parent objects and indexes must be updated
whenever an object is moved to a different page.
– Logical Identifier approach (LOID)
Consists of allocating a system wide unique OID.
Since OIDs are invariant, there is no overhead due to object
movement.
Object Migration
Three alternatives can be
considered for the migration of
– Active
classes (types)
Active objects are currently
involved in an activity in response
– The source code is moved and to an invocation or a message
recompiled at the destination – Waiting
– The compiled version of a class is Waiting objects have invoked
migrated just like any other object, another object and are waiting for
or a response.
– The source code of the class – Suspended
definition is moved, but not its
compiled operations, for which a Suspended objects are
lazy migration strategy us used. temporarily unavailable for
invocation.
Objects can be in one of the four
states Migration involves two steps

– Ready, – Shipping the object from the


source to the destination, and
Ready objects are not currently
invoked, or have not received a – Creating a proxy at the source,
message, but are ready to be replacing the original object.
invoked to receive a message.
Object Clustering

– Difficult for two reasons

Not orthogonal to object identity implementation. Logical


OIDs incur more overhead , but enable vertical partitioning of
classes.
Clustering of complex objects along the composition
relationship is more involved because of object sharing .

– Given a class graph, there are three basic storage


models for object clustering

The decomposition storage model, partitions each object


class in binary relations.
The normalized storage model stores each class as a
separate relation.
The direct storage model enables multi-class clustering of
complex objects based on the composition relationship.
Distributed Garbage Collection

– As programs modify objects and remove references, a persistent


object may become unreachable from the persistent roots of the
system when there is no more reference to it.
– Basic garbage collection algorithms can be categorized
reference counting
– In reference counting, each object has an associated count o reference
– Each time a program creates an additional reference that points to an
object, the object’s count is incremented.
– When reference to an object is destroyed, the corresponding count is
decremented.
tracing-based.
– Mark and sweep algorithms
Two phase algorithms
First phase, mark phase, starts from the root and marks every
reachable object
Once all live objects are marked, the memory is examined and
unmarked objects are reclaimed.
– Copy-based algorithms
Divide memory into two disjoint areas
From-space, Programs manipulate from this space
To-space, left empty
Object Query Processing – Important issues
Object Query Processor
Architectures – Cost Function
– Open OODB project Can be defined recursively based
Separation between the user on the algebraic processing tree.
query language parsing – Parameterization
structures and the operator
graph on which the optimizer – Path Expression
operates – Rewriting and Algebraic
– EPOQ project Optimization
Approach to query optimization – Path Indexes
extensibility, where the search
space is divided into regions
– TIGUKAT project Query Execution
Uses an object approach to – Path Indexes
query processing extensibility Algorithms
Is an extensible uniform 1. Create an index on each class
behavioral model characterized traversed
by a purely behavioral
semantics and a uniform 2. Define indexes on objects across
approach to objects. their type inheritance
3. Access support relations, is a
data structure that stores selected
Query Processing Issues path expression.
– Search space and transformation – Set Matching
rules Algorithms
– Search Algorithm 1. Centralized Algorithms
2. Join execution algorithm
Data Delivery alternatives Architecture of a Data
Warehouse
Pull-only
– Transfer of data from servers to Query/Analysis
clients is initiated by a client Reporting
pull. Data Mining
– Arrival of new data items or Q
U
updates to existing data items E
R

are carried out a server without I


E

modification to clients unless s

clients explicitly poll the server. Target Metadata


Push-only Database repository
– Transfer of data from servers to Integ
clients is initiated by a server rate
push in the absence of any
specific request from clients.
Hybrid Source
– Combines the client-pull and database
server-push mechanisms.
Semi structured Data
– Free and commercial database on product information etc, interfaces to
such sources, is typically a collection of fill-out forms.
– Typically modeled as a labeled graph
– A labeled graph are self-describing and have no schema.
– Object Exchange Model is used to illustrate such a labeled graph
A label which is the name of the object class
A type which is either atomic (integer, string etc.) or set
A value which is either atomic or a set of objects
An optional object identifier

Wrapper Data
Source

Global
Web Data Wrapper Data
Server Dictionary Source

Wrapper
Data
Source
Algorithm – Push based approach
Problems with Pull-based approach
1. Order the data items from hottest to
– users need to know a priori where coldest
and when to look for data. 2. Partition the data items into ranges of
– Mismatch between the items, such that the items in each
asymmetric nature of some range have similar application access
applications and the symmetric profiles. The number of ranges is
communications infrastructure on denoted by num_ranges.
applications such as internet. 3. Choose the relative broadcast
– Two types of asymmetry frequency for each range as integers
Network asymmetry, network (rel_freqi, where i is the range).
bandwidth between client- server 4. Divide each range into smaller
different from server-client. elements, called chunks (Cij is the j-th
Distributed information systems, chunk of range i). Determine the
due to imbalance between the number of chunks into which range i is
number of clients and the number divided as num_chunk, =
of servers. max_chunks/rel_freqi, where
Data, amount of data being max_chunks is the least common
transferred between client and multiple of rel_freqi,¥i.
server.
Data volatility 5. Create the broadcast schedule by
interleaving the chunks of each range
using the following procedure.
Why Push based technologies? for I from 0 to max_chunks-1 by 1 do
for j from 1 to max ranges by 1 do
Response to some of the Broadcast chunk Cj, (i mod
problems inherent in pull-based
systems. num_chunksj)
end-for
end-for
Difference between pull-based and push-based systems
– Cache replacement policies
– Prefetching mechanism

An idealized algorithm for page replacement is one which determines the


page with the smallest ratio between its probability of access and its
frequency of broadcast.

PIX algorithm, calculates the “cost” of replacing a page and replaces the
least costly one.

The operation of the algorithm is as follows:


1. When a page Pi is brought into cache and inserted into a chain.
Pri = 0, LTi = CurrentTime
2. When Pi is accessed again, it is moved to the top of its own chain and the following
caculations are made:
Pri = HF / (Current Time –LT i) + (1 – HF) * LTi , LTi
= CurrentTime,
3. If a new page needs to be flushed out to open up space, a lix value is calculated
for the pages at the bottom of each chain and the page with the lowest lix value is
flushed out. The lix value is calculated as follows:
lixi = Pri/rel-freqi
where rel-freqi is the relative broadcast frequency of the range (disk) to which
that page Pi belongs.

You might also like