You are on page 1of 55

UNIT-III

Distributed Database
A distributed DBMS manages the distributed database in a manner so that it
appears as one single database to users..
A distributed database is a collection of multiple interconnected databases,
which are spread physically across various locations that communicate via a
computer network.

Features
 Databases in the collection are logically interrelated with each other. Often they
represent a single logical database.
 Data is physically stored across multiple sites. Data in each site can be managed by a
DBMS independent of the other sites.
 The processors in the sites are connected via a network. They do not have any
multiprocessor configuration.
 A distributed database is not a loosely connected file system.
 A distributed database incorporates transaction processing, but it is not synonymous with
a transaction processing system.

Distributed Database Management System


A distributed database management system (DDBMS) is a centralized software system that
manages a distributed database in a manner as if it were all stored in a single location.

Features

 It is used to create, retrieve, update and delete distributed databases.


 It synchronizes the database periodically and provides access mechanisms by the virtue
of which the distribution becomes transparent to the users.
 It ensures that the data modified at any site is universally updated.
 It is used in application areas where large volumes of data are processed and accessed by
numerous users simultaneously.
 It is designed for heterogeneous database platforms.
 It maintains confidentiality and data integrity of the databases.

Factors Encouraging DDBMS


The following factors encourage moving over to DDBMS −
 Distributed Nature of Organizational Units − Most organizations in the current times
are subdivided into multiple units that are physically distributed over the globe. Each
unit requires its own set of local data. Thus, the overall database of the organization
becomes distributed.
 Need for Sharing of Data − The multiple organizational units often need to
communicate with each other and share their data and resources. This demands common
databases or replicated databases that should be used in a synchronized manner.
 Support for Both OLTP and OLAP − Online Transaction Processing (OLTP) and
Online Analytical Processing (OLAP) work upon diversified systems which may have
common data. Distributed database systems aid both these processing by providing
synchronized data.
 Database Recovery − One of the common techniques used in DDBMS is replication of
data across different sites. Replication of data automatically helps in data recovery if
database in any site is damaged. Users can access data from other sites while the
damaged site is being reconstructed. Thus, database failure may become almost
inconspicuous to users.
 Support for Multiple Application Software − Most organizations use a variety of
application software each with its specific database support. DDBMS provides a
uniform functionality for using the same data among different platforms.

Advantages of Distributed Databases


Following are the advantages of distributed databases over centralized databases.
Modular Development − If the system needs to be expanded to new locations or new units, in
centralized database systems, the action requires substantial efforts and disruption in the
existing functioning. However, in distributed databases, the work simply requires adding new
computers and local data to the new site and finally connecting them to the distributed system,
with no interruption in current functions.
More Reliable − In case of database failures, the total system of centralized databases comes to
a halt. However, in distributed systems, when a component fails, the functioning of the system
continues may be at a reduced performance. Hence DDBMS is more reliable.
Better Response − If data is distributed in an efficient manner, then user requests can be met
from local data itself, thus providing faster response. On the other hand, in centralized systems,
all queries have to pass through the central computer for processing, which increases the
response time.
Lower Communication Cost − In distributed database systems, if data is located locally where
it is mostly used, then the communication costs for data manipulation can be minimized. This is
not feasible in centralized systems.

Adversities of Distributed Databases


Following are some of the adversities associated with distributed databases.
 Need for complex and expensive software − DDBMS demands complex and often
expensive software to provide data transparency and co-ordination across the several
sites.
 Processing overhead − Even simple operations may require a large number of
communications and additional calculations to provide uniformity in data across the
sites.
 Data integrity − The need for updating data in multiple sites pose problems of data
integrity.
 Overheads for improper data distribution − Responsiveness of queries is largely
dependent upon proper data distribution. Improper data distribution often leads to very
slow response to user requests.

Types of Distributed Databases


Distributed databases can be broadly classified into homogeneous and heterogeneous
distributed database environments, each with further sub-divisions, as shown in the following
illustration.

Homogeneous Distributed Databases

In a homogeneous distributed database, all the sites use identical DBMS and operating systems.
Its properties are −
 The sites use very similar software.
 The sites use identical DBMS or DBMS from the same vendor.
 Each site is aware of all other sites and cooperates with other sites to process user
requests.
 The database is accessed through a single interface as if it is a single database.

Types of Homogeneous Distributed Database

There are two types of homogeneous distributed database −


 Autonomous − Each database is independent that functions on its own. They are
integrated by a controlling application and use message passing to share data updates.
 Non-autonomous − Data is distributed across the homogeneous nodes and a central or
master DBMS co-ordinates data updates across the sites.

Heterogeneous Distributed Databases

In a heterogeneous distributed database, different sites have different operating systems, DBMS
products and data models. Its properties are −
 Different sites use dissimilar schemas and software.
 The system may be composed of a variety of DBMSs like relational, network, hierarchical or object
oriented.
 Query processing is complex due to dissimilar schemas.
 Transaction processing is complex due to dissimilar software.
 A site may not be aware of other sites and so there is limited co-operation in processing user requests.

Types of Heterogeneous Distributed Databases

 Federated − The heterogeneous database systems are independent in nature and


integrated together so that they function as a single database system.
 Un-federated − The database systems employ a central coordinating module through
which the databases are accessed.

Distributed DBMS Architectures


DDBMS architectures are generally developed depending on three parameters −
 Distribution − It states the physical distribution of data across the different sites.
 Autonomy − It indicates the distribution of control of the database system and the
degree to which each constituent DBMS can operate independently.
 Heterogeneity − It refers to the uniformity or dissimilarity of the data models, system
components and databases.

Architectural Models
Some of the common architectural models are −
 Client - Server Architecture for DDBMS
 Peer - to - Peer Architecture for DDBMS
 Multi - DBMS Architecture

Client - Server Architecture for DDBMS

This is a two-level architecture where the functionality is divided into servers and clients. The
server functions primarily encompass data management, query processing, optimization and
transaction management. Client functions include mainly user interface. However, they have
some functions like consistency checking and transaction management.
The two different client - server architecture are −

 Single Server Multiple Client


 Multiple Server Multiple Client (shown in the following diagram)

Peer- to-Peer Architecture for DDBMS

In these systems, each peer acts both as a client and a server for imparting database services.
The peers share their resource with other peers and co-ordinate their activities.
This architecture generally has four levels of schemas −
 Global Conceptual Schema − Depicts the global logical view of data.
 Local Conceptual Schema − Depicts logical data organization at each site.
 Local Internal Schema − Depicts physical data organization at each site.
 External Schema − Depicts user view of data.

Multi - DBMS Architectures

This is an integrated database system formed by a collection of two or more autonomous


database systems.
Multi-DBMS can be expressed through six levels of schemas −
 Multi-database View Level − Depicts multiple user views comprising of subsets of the
integrated distributed database.
 Multi-database Conceptual Level − Depicts integrated multi-database that comprises
of global logical multi-database structure definitions.
 Multi-database Internal Level − Depicts the data distribution across different sites and
multi-database to local data mapping.
 Local database View Level − Depicts public view of local data.
 Local database Conceptual Level − Depicts local data organization at each site.
 Local database Internal Level − Depicts physical data organization at each site.
There are two design alternatives for multi-DBMS −
 Model with multi-database conceptual level.
 Model without multi-database conceptual level.

Design Alternatives
The distribution design alternatives for the tables in a DDBMS are as follows −

 Non-replicated and non-fragmented


 Fully replicated
 Partially replicated
 Fragmented
 Mixed

Non-replicated & Non-fragmented

In this design alternative, different tables are placed at different sites. Data is placed so that it is
at a close proximity to the site where it is used most. It is most suitable for database systems
where the percentage of queries needed to join information in tables placed at different sites is
low. If an appropriate distribution strategy is adopted, then this design alternative helps to
reduce the communication cost during data processing.

Fully Replicated

In this design alternative, at each site, one copy of all the database tables is stored. Since, each
site has its own copy of the entire database, queries are very fast requiring negligible
communication cost. On the contrary, the massive redundancy in data requires huge cost during
update operations. Hence, this is suitable for systems where a large number of queries is
required to be handled whereas the number of database updates is low.

Partially Replicated

Copies of tables or portions of tables are stored at different sites. The distribution of the tables is
done in accordance to the frequency of access. This takes into consideration the fact that the
frequency of accessing the tables vary considerably from site to site. The number of copies of
the tables (or portions) depends on how frequently the access queries execute and the site which
generate the access queries.

Fragmented

In this design, a table is divided into two or more pieces referred to as fragments or partitions,
and each fragment can be stored at different sites. This considers the fact that it seldom happens
that all data stored in a table is required at a given site. Moreover, fragmentation increases
parallelism and provides better disaster recovery. Here, there is only one copy of each fragment
in the system, i.e. no redundant data.
The three fragmentation techniques are −

 Vertical fragmentation
 Horizontal fragmentation
 Hybrid fragmentation

Mixed Distribution

This is a combination of fragmentation and partial replications. Here, the tables are initially
fragmented in any form (horizontal or vertical), and then these fragments are partially replicated
across the different sites according to the frequency of accessing the fragments.

Concurrency Control in Distributed Database:

Concurrency or concurrent execution of transactions is about executing multiple


transactions simultaneously. This is done by executing few instructions of one
transaction then the next and so on. This way will increase the Transaction
throughput (number of transactions that can be executed in a unit of time) of the
system is increased.
When you execute multiple transactions simultaneously, extra care should be taken
to avoid inconsistency in the results that the transactions produce. This care is
mandatory especially when two or more transactions are working (reading or
writing) on the same database items (data objects).
For example, one transaction is transferring money from account A to account B
while the other transaction is withdrawing money from account A. these two
transactions should not be permitted to execute in interleaved fashion like the
transaction that are working on different data items. We need to serially execute
(one after the other) such transactions.
If we do not take care about concurrent transactions that are dealing with same data
items, that would end up in following problems;

 Lost Update Problem


 Reading Uncommitted Data / Temporary Update / Dirty Read Problem
 Incorrect Summary / Inconsistent Analysis problem

Concurrency Control in Distributed Database

Concurrency control schemes dealt with handling of data as part of concurrent transactions.
Various locking protocols are used for handling concurrent transactions in centralized database
systems. There are no major differences between the schemes in centralized and distributed
databases. The only major difference is that the way the lock manager should deal with
the replicated data. The following topics discusses about several possible schemes that are
applicable to an environment where data can be replicated in several sites. We shall assume the
existence of the shared and exclusive lock modes.

Locking protocols
1. Single lock manager approach
2. Distributed lock manager approach
a) Primary Copy protocol
b) Majority protocol
c) Biased protocol
d) Quorum Consensus protocol
Concurrency Control in Distributed Database - Single Lock
Manager Approach
In this approach, the distributed database system which consists of several sites, maintains a
single lock manager at a chosen site as shown in Figure 1.Observe Figure 1 for Distributed Sites
S1, S2, …, S6 with Site S3 chosen as Lock-Manager Site.

Distributed Lock-Manager Approach


In this approach, the function of lock-manager is distributed over several sites.

[Every DBMS server (site) has all the components like Transaction Manager, Lock-
Manager, Storage Manager, etc.]

In Distributed Lock-Manager, every site owns the data which is stored locally.

 This is true for a table that is fragmented into n fragments and stored in n sites. In this
case, every fragment is unique from every other fragment and completely owned by the site in
which it is stored. For those fragments, the local Lock-Manager is responsible to handle lock
and unlock requests generated by the same site or by other sites.

 If the data stored in a site is replicated in other sites, then a site cannot own the data
completely. In such case, we cannot handle any lock request for a data item stored in a site as the
case of fragmented data. If we handle like fragmented data, it leads to inconsistency problems as
there are multiple copies stored in several sites. This case can be handled using several protocols
which are specifically designed for handling lock requests on replicated data. The protocols are,

o Primary Copy Protocol


o Majority Based Protocol
o Biased Protocol
o Quorum Consensus Protocol

Advantages:
 Simple implementation is required for the data which are fragmented. They can be
handled as in the case of Single Lock-Manager approach.
 For replicated data, again the work can be distributed over several sites using one of the
above listed protocols.
 Lock-Manager site is not the bottleneck as the work of lock-manager is distributed over
several sites.
Disadvantages:
 Handling of Deadlock is difficult, because, a transaction T1 which acquired a lock on a
data item Q at site S1 may be waiting for lock on another data item R as site S2. This wait is
genuine or a deadlock has occurred is not easily identifiable.
Primary Copy Distributed Lock Manager Approach / Primary
Copy based Distributed Concurrency Control Approach
Primary Copy Protocol:
Assume that we have the data item Q which is replicated in several sites and we choose one of
the replicas of data item Q as the Primary Copy (only one replica). The site which stores the
Primary Copy is designated as the Primary Site. Any lock requests generated for data item Q at
any sites must be routed to the Primary site. Primary site’s lock-manager is responsible for
handling lock requests, though there are other sites with same data item and local lock-
managers.We can choose different sites as lock-manager sites for different data items.

How does Primary Copy protocol work?


Figure 1 shows the Primary Copy protocol implementation.
In the figure,

 Q, R, and S are different data items that are replicated.


 Q is replicated in sites S1, S2, S3 and S5 (represented in blue colored text). Site S3 is
designated as Primary site for Q (represented in purple colored text).
 R is replicated in sites S1, S2, S3, and S4. Site S6 is designated as Primary site for R.
 S is replicated at sites S1, S2, S4, S5, and S6. Site S1 is designated as Primary site for S.
Then, the concurrency control mechanism works this way;

Step 1: Transaction T1 initiated at site S5 and requests lock on data item Q. Even though the data
item available locally at site S5, the lock-manager of S5 cannot grant the lock. The reason is, in
our example, Site S3 is designated as primary site for Q. Hence, the request must be routed to the
site S3 by the Transaction manager of S5.
Step 2: S5 requests S3 for lock on Q. S5 sends lock request message to S3.

Step 3: If the lock on Q can be granted, S3 grants lock and send a message to S5.

On receiving lock grant, S5 executes the Transaction T1 (Executed on the data item Q available
locally. If no local copy, S3 has to execute the transaction in other sites where Q is available).

Step 4: On successful completion of Transaction, S5 sends unlock message to the Primary Site
S3.

Note: If the transaction T1 writes the data item Q, then the changes must be forward to all the
sites where Q is replicated. If the transaction read the data item Q, then no problem.

Advantages:
 Handling of concurrency control on replicated data is like unreplicated data. Simple
implementation.
 Only 3 messages to handle lock and unlock requests (1 Lock request, 1 Granted, and 1
Unlock message) for both read and write.
Disadvantages:
 Possible single point-of-failure. If the Primary Site of a data item, say Q fails, even
though the other sites with the same copy of Q available, the data item Q is inaccessible.

Figure 1 - Single Lock Manager Approach


The technique works as follows;

When a transaction request for locking some data items, the request must be forwarded to the chosen lock
manager site for locks. This is done by the Transaction manager of site where the request is initiated.

The lock manager at the chosen lock-manager site decides to grant the lock request immediately based on
the usual procedure. [That is, if a lock is already held on the requested data item by some other transactions in an
incompatible mode, lock cannot be granted. If the data item is free or data item is locked in a compatible mode, the
lock manager grants the lock]

If lock request granted, the transaction can read from any site where the replica is available.

On successful completion of transaction, the Transaction manager of initiating site can release the lock
through unlock request to the lock-manager site.

Example – Transaction handling in Single Lock-Manager Approach:

Figure 2 - Steps of Transaction handling in Single Lock Manager approach

Let us assume that the Transaction T1 is initiated at Site S5 as shown in Figure 2 (Step 1). Also,
assume that the requested data item D is replicated in Sites S1, S2, and S6. The steps are
numbered in the Figure 2. According to the discussion above, the technique works as follows;

 Step 2 - The initiator site S5’s Transaction manager sends the lock request to lock data
item D to the lock-manager site S3.
The Lock-manager at site S3 will look for the availability of the data item D.
 Step 3 - If the requested item is not locked by any other transactions, the lock-manager
site responds with lock grant message to the initiator site S5.
 Step 4 - As the next step, the initiator site S5 can use the data item D from any of the sites
S1, S2, and S6 for completing the Transaction T1.
 Step 5 - After successful completion of the Transaction T1, the Transaction manager of
S5 releases the lock by sending the unlock request to the lock-manager site S3.
Advantages:
 Locking can be handled easily. We need two messages for lock (one for request, the other
for grant), and one message for unlock requests. Also, this method is simple as it resembles the
centralized database.
 Deadlocks can be handled easily. The reason is, we have one lock manager who is
responsible for handling the lock requests.
Disadvantages:

 The lock-manager site becomes the bottleneck as it is the only site to handle all the lock
requests generated at all the sites in the system.
 Highly vulnerable to single point-of-failure. If the lock-manager site failed, then we lose
the concurrency control.
Majority Based Protocol for Concurrency Control / Variants of
Distributed Lock Manager based Concurrency Control
mechanism
Majority Based Protocol:
Assume that we have the data item Q which is replicated in several sites and the Majority Based
protocol works as follows;

 A transaction which needs to lock data item Q has to request and lock data item Q in
half+one sites in which Q is replicated (i.e, majority of the sites in which Q is replicated).
 The lock-managers of all the sites in which Q is replicated are responsible for handling
lock and unlock requests locally individually.
 Irrespective of the lock types (read or write, i.e, Shared or Exclusive), we need to lock
half+one sites.
Example:
Let us assume that Q is replicated in 6 sites. Then, we need to lock Q in 4 sites (half+one = 6/2 +
1 = 4). When transaction T1 sends the lock request message to those 4 sites, the lock-managers
of those sites have to grant the locks based on the usual lock procedure.

How does Majority Based protocol work?


Implementation:
Figure 1 show the implementation of Majority Based protocol.
In the figure,

 Q, R, and S are the different data items.


 Q is replicated in sites S1, S2, S3 and S6.
 R is replicated in sites S1, S2, S3, and S4.
 S is replicated at sites S1, S2, S4, S5, and S6.

Figure 1 - Majority Based Protocol Implementation

Let us assume that Transaction T1 needs data item Q to be locked (either read or write mode).

Step 1: Transaction T1 initiated at site S5 and requests lock on data item Q. Q is available in S1,
S2, S3 and the site S6. According to the protocol, T1 has to lock Q in half+one sites, i.e, in our
example, we need to lock any 3 out of 4 sites. Assume that we have chosen sites S1, S2, and S3.

Step 2: S5 requests S1, S2 and S3 for lock on Q. The lock request is represented in purple color.

Step 3: If the lock on Q can be granted, S1, S2, and S3 grant lock and send a message to S5.

On receiving lock grant, S5 executes the Transaction T1 (Executed on the data item Q which is
taken from any one of the locked sites). The lock grant is represented in green color.
Step 4: On successful completion of Transaction, S5 sends unlock message to all the sites S1, S2,
and S3.
The unlock message is represented in blue color.

Note: If the transaction T1 writes the data item Q, then the changes must be forward to all the
sites where Q is replicated. If the transaction read the data item Q, then no problem.

Advantages:

 Replicated data handled in decentralized manner. Hence, no single point-of-failure


problem.
Disadvantages:
 Implementation is complex. We need to send (n/2 + 1) lock request messages, (n/2 + 1)
lock grant messages, and (n/2 + 1) unlock messages, irrespective of lock requested (read or
write).
 Both read and write involves same level of complexity.
 Deadlock would occur. (To handle this deadlock, we may need to impose an order in
which the locks can be requested)
Points to note:
-Transaction can execute only after successful acquisition of locks on majority of the replicas.

-Needs to send more messages, 2(n/2+1) lock messages and (n/2+1) unlock messages, when
compared to Primary Copy protocol (2n+1 messages).

-Local lock-managers are responsible for granting or denying the locks on requested items.

-Not suitable for applications where read operation is frequent.


-When writing the data item, a transaction performs writes on all replicas.

-When handling with unreplicated data, both requests can be handled by requesting the site at
which the data item available.

Biased Protocol - Distributed Lock Manager Concurrency


Control
Biased Protocol Concurrency Control / Variants of Distributed
Lock Manager based Concurrency Control mechanism

Biased Protocol
Biased protocol is one of the many protocols to handle concurrency control in distributed
database system, in case of replicated database. Few of the others are Primary copy, Majority
Based protocol, and Quorum Consensus protocol.
Protocol:

If a data item Q is replicated over n sites, then a read lock (Shared lock) request message must
be sent to any one of the sites in which Q is replicated and, a write lock (Exclusive lock)
request message must be sent to all the sites in which Q is replicated. The lock-managers of all
the sites in which Q is replicated are responsible for handling lock and unlock requests locally
individually.

Example:
In the figures,

 Q, R, and S are the different data items.


 Q is replicated in sites S1, S2, S3 and S6.
 R is replicated in sites S1, S2, S3, and S4.

 S is replicated at sites S1, S2, S4, S5, and S6.


How does Biased protocol handle Shared and Exclusive locks?
Shared Lock (Read Lock)
Figure 1 shows Biased protocol implementation for handling read request (Shared lock).

Figure 1 - Biased Protocol for Read Lock


Let us assume that Transaction T1 needs data item Q.Step 1: Transaction T1 initiated at site S5
and requests lock on data item Q. Q is available in S1, S2, S3 and the site S6. According to the
protocol, T1 has to lock Q in any one site in which Q is replicated, i.e, in our example, we need
to lock any 1 out of 4 sites where Q is replicated. Assume that we have chosen the site S3.

Step 2: S5 requests S3 for shared lock on Q. The lock request is represented in purple color.

Step 3: If the lock on Q can be granted, S3 can grant lock and send a message to S5.

On receiving lock grant, S5 executes the Transaction T1 (Reading can be done in the locked site,
in our case, it is S3).

Step 4: On successful completion of Transaction, S5 sends unlock message to the site S3.

Exclusive Lock (Write Lock)


Figure 2 shows Biased protocol implementation for handling write request (Exclusive lock).

Figure 2 - Biased Protocol for Write Lock


Let us assume that Transaction T1 needs data item Q. Q is available in S1, S2, S3 and the site S6.
Sites S4 and S5 do not have Q in them, are represented in red color

Step 1: Transaction T1 initiated at site S5 and requests lock on data item Q. According to the
protocol, T1 has to lock Q in all the sites in which Q is replicated, i.e, in our example, we need to
lock all the 4 sites where Q is replicated.

Step 2: S5 requests S1, S2, S3, and S6 for exclusive lock on Q. The lock request is represented in
purple color.Step 3: If the lock on Q can be granted at every site, all the sites will respond with
grant lock message to S5. (If any one or more sites cannot grant, T1 cannot be continued)

On receiving lock grant, S5 executes the Transaction T1 (When writing the data item, transaction
performs writes on all replicas).Step 4: On successful completion of Transaction, S5 sends
unlock message to all sites S1, S2, S3, and S6.

Advantages:
 Read operation can be handled faster compared to Majority based protocol.If read
operations are performed frequently in an application, biased approach can be suggested.
Disadvantages:
 Additional overhead on Write operation.
 Implementation is complex. We need to send (n/2 + 1) lock request messages, (n/2 + 1)
lock grant messages, and (n/2 + 1) unlock messages for write operation.
 Deadlock would occur as in Majority Based protocol.
Quorum Consensus Protocol
This is one of the distributed lock manager based concurrency control protocol in distributed
database systems. It works as follows;

1. The protocol assigns each site that have a replica with a weight.

2. For any data item, the protocol assigns a read quorum Qr and write quorum Qw. Here, Qr and
Qw are two integers (sum of weights of some sites). And, these two integers are chosen according
to the following conditions put together;

 Qr + Qw > S – this rule avoids read-write conflict. (i.e, two transactions cannot read and
write concurrently)
 2 * Qw > S – this rule avoids write-write conflict. (i.e, two transactions cannot write
concurrently)
Here, S is the total weight of all sites in which the data item replicated.
How do we perform read and write on replicas?
A transaction that needs a data item for reading purpose has to lock enough sites. That is, it has
lock sites with the sum of their weight >= Qr. Read quorum must always intersect with write
quorum.

A transaction that needs a data item for writing purpose has to lock enough sites. That is, it has
lock sites with the sum of their weight >= Qw.

Example:
(How does Quorum Consensus Protocol work?)
Let us assume a fully replicated distributed database with four sites S1, S2, S3, and S4.

1. According to the protocol, we need to assign a weight to every site. (This weight can be
chosen on many factors like the availability of the site, latency etc.). For simplicity, let us assume
the weight as 1 for all sites.

2. Let us choose the values for Qr and Qw as 2 and 3. Our total weight S is 4. And according to
the conditions, our Qr and Qw values are correct;

Qr + Qw > S => 2 + 3 > 4 True

2 * Qw > S => 2 * 3 > 4 True

3. Now, a transaction which needs a read lock on a data item has to lock 2 sites. A transaction
which needs a write lock on data item has to lock 3 sites.

CASE 1

Read Quorum Qr = 2, Write Quorum Qw = 3, Site’s weight = 1, Total weight of sites S = 4

Lock Example Discussion

1. Read request has to lock


at least two replicas (2 sites
Read Lock in our example)

2. Any two sites can be


locked

1. Write request has to lock


at least three replicas (3 sites
Write
in our example)
Lock
Note that, read quorum intersects with write quorum. That is, out of available 4 sites, in our
example, 3 sites to be locked for write and 2 sites to be locked for read. It ensures that no two
transactions can read and write at the same time.

CASE 2

Read Quorum Qr = 1, Write Quorum Qw = 4, Site’s weight = 1, Total weight of sites S = 4

1. Read lock requires one


site
Read Lock

2. Write lock requires 4 sites


Write
Lock

Note that, read requires any one site and write requires all the sites, which is actually the
implementation of Biased protocol. At the same time, if we make read and write quorum with
the same value 3, then it resembles the implementation of Majority based protocol. That is the
reason why Quorum Consensus protocol is mentioned as the generalization of the above said
techniques.

Points to note:

 The Quorums must be chosen very carefully. That is, if read operations are frequent then
we would choose very small read quorum value so as to make read faster in available replicas
and so on.

 The chosen read quorum value must intersect the write quorum value to avoid read-write
conflict.
Database Design
Database Design Database systems are designed to manage large bodies of information. These
large bodies of information do not exist in isolation. They are part of the operation of some
enterprise whose end product may be information from the database or may be some device or
service for which the database plays only a supporting role. Database design mainly involves the
design of the database schema. The design of a complete database application environment that
meets the needs of the enterprise being modeled requires attention to a broader set of issues. In
this text, we focus initially on the writing of database queries and the design of database
schemas. Chapter 9 discusses the overall process of application design.

Design Process A high-level data model provides the database designer with a conceptual
framework in which to specify the data requirements of the database users, and how the database
will be structured to fulfill these requirements. The initial phase of database design, then, is to
characterize fully the data needs of the prospective database users. The database designer needs
to interact extensively with domain experts and users to carry out this task. The outcome of this
phase is a specification of user requirements. Next, the designer chooses a data model, and by
applying the concepts of the chosen data model, translates these requirements into a conceptual
schema of the database. The schema developed at this conceptual-design phase provides a
detailed overview of the enterprise. The designer reviews the schema to confirm that all data
requirements are indeed satisfied and are not in conflict with one another. The designer can also
examine the design to remove any redundant features. The focus at this point is on describing the
data and their relationships, rather than on specifying physical storage details. In terms of the
relational model, the conceptual-design process involves decisions on what attributes we want to
capture in the database and how to group these attributes to form the various tables. The “what”
part is basically a business decision, and we shall not discuss it further in this text. The “how”
part is mainly a computer-science problem. There are principally two ways to tackle the problem.
The first one is to use the entity-relationship model the other is to employ a set of algorithms
(collectively known as normalization) that takes as input the set of all attributes and generates a
set of tables. A fully developed conceptual schema indicates the functional requirements of the
enterprise. In a specification of functional requirements, users describe the kinds of operations
(or transactions) that will be performed on the data. Example operations include modifying or
updating data, searching for and retrieving specific data, and deleting data. At this stage of
conceptual design, the designer can review the schema to ensure it meets functional
requirements. The process of moving from an abstract data model to the implementation of the
database proceeds in two final design phases. In the logical-design phase, the designer maps the
high-level conceptual schema onto the implementation data model of the database system that
will be used. The designer uses the resulting system-specific database schema in the subsequent
physical-design phase, in which the physical features of the database are specified. These
features include the form of file organization and the internal storage structures;

Database Design for a University Organization To illustrate the design process, let us examine
how a database for a university could be designed. The initial specification of user requirements
may be based on interviews with the database users, and on the designer’s own analysis of the
organization. The description that arises from this design phase serves as the basis for specifying
the conceptual structure of the database. Here are the major characteristics of the university.

• The university is organized into departments. Each department is identified by a unique name
(dept name), is located in a particular building, and has a budget.
• Each department has a list of courses it offers. Each course has associated with it a course id,
title, dept name, and credits, and may also have have associated prerequisites.

• Instructors are identified by their unique ID. Each instructor has name, associated department
(dept name), and salary. • Students are identified by their unique ID. Each student has a name, an
associated major department (dept name), and tot cred (total credit hours the student earned thus
far).

The university maintains a list of classrooms, specifying the name of the building, room number,
and room capacity.

• The university maintains a list of all classes (sections) taught. Each section is identified by a
course id, sec id, year, and semester, and has associated with it a semester, year, building, room
number, and time slot id (the time slot when the class meets).

• The department has a list of teaching assignments specifying, for each instructor, the sections
the instructor is teaching.

• The university has a list of all student course registrations, specifying, for each student, the
courses and the associated sections that the student has taken (registered for). A real university
database would be much more complex than the preceding design. However we use this
simplified model to help you understand conceptual ideas without getting lost in details of a
complex design.

The Entity-Relationship Model The entity-relationship (E-R) data model uses a collection of
basic objects, called entities, and relationships among these objects. An entity is a “thing” or
“object” in the real world that is distinguishable from other objects. For example, each person is
an entity, and bank accounts can be considered as entities.

Entities are described in a database by a set of attributes. For example, the attributes dept name, building,
and budget may describe one particular department in a university, and they form attributes of the
department entity set. Similarly, attributes ID, name, and salary may describe an instructor entity.2

The extra attribute ID is used to identify an instructor uniquely (since it may be possible to have two
instructors with the same name and the same salary). A unique instructor identifier must be assigned to
each instructor. In the United States, many organizations use the social-security number of a person (a
unique number the U.S. government assigns to every person in the United States) as a unique identifier.

A relationship is an association among several entities. For example, a member relationship associates an
instructor with her department. The set of all entities of the same type and the set of all relationships of
the same type are termed an entity set and relationship set, respectively.

The overall logical structure (schema) of a database can be expressed graphically by an entity-relationship
(E-R) diagram. There are several ways in which to draw these diagrams. One of the most popular is to use
the Unified Modeling Language (UML). In the notation we use, which is based on UML, an E-R diagram
is represented as follows:

1 Introduction instructor ID name salary department dept_name building budget member Figure 1.3 A
sample E-R diagram.

• Entity sets are represented by a rectangular box with the entity set name in the header and the attributes
listed below it.

• Relationship sets are represented by a diamond connecting a pair of related entity sets. The name of the
relationship is placed inside the diamond. As an illustration, consider part of a university database
consisting of instructors and the departments with which they are associated. Figure 1.3 shows the
corresponding E-R diagram. The E-R diagram indicates that there are two entity sets, instructor and
department, with attributes as outlined earlier. The diagram also shows a relationship member between
instructor and department. In addition to entities and relationships, the E-R model represents certain
constraints to which the contents of a database must conform. One important constraint is mapping
cardinalities, which express the number of entities to which another entity can be associated via a
relationship set. For example, if each instructor must be associated with only a single department, the E-R
model can express that constraint.

The entity-relationship model is widely used in database design, and Chapter 7 explores it in detail.

1.6.4 Normalization

Another method for designing a relational database is to use a process commonly known as
normalization. The goal is to generate a set of relation schemas that allows us to store information without
unnecessary redundancy, yet also allows us to retrieve information easily. The approach is to design
schemas that are in an appropriate normal form. To determine whether a relation schema is in one of the
desirable normal forms, we need additional information about the real-world enterprise that we are
modeling with the database. The most common approach is to use functional dependencies, which we
cover in Section 8.4. To understand the need for normalization, let us look at what can go wrong in a bad
database design. Among the undesirable properties that a bad design may have are:

• Repetition of information

• Inability to represent certain information

Distributed Database Design

Data Fragmentation

Data Replication

Data Allocation

Client/Server Architecture
In this section, you will be introduced to distributed database design issues. These include data
fragmentation, data replication and data allocation. We will also look briefly at how distributed
database capabilities are implemented within a client/server architecture.

Data Fragmentation

Data fragmentation is a technique used to break up objects. In designing a distributed database,


you must decide which portion of the database is to be stored where. One technique used to
break up the database into logical units called fragments. Fragmentation information is stored in
a distributed data catalogue which the processing computer uses to process a user's request.

As a point of discussion, we can look at data fragmentation in terms of relations or tables. The
following matrix describes the different types of fragmentation that can be used.

Horizontal fragmentation This type of fragmentation refers division of a relation into


fragments of rows. Each fragment is stored at a different
computer or node, and each fragment contains unique rows.
Each horizontal fragment may have a different number of rows,
but each fragment must have the same attributes.

Vertical fragmentation This type of fragmentation refers to the division of a relation


into fragments that comprise a collection of attributes. Each
vertical fragment must have the same number of rows, but can
have different attributes depending on the key.

Mixed fragmentation This type of fragmentation is a two-step process. First,


horizontal fragmentation is done to obtain the necessary rows,
then vertical fragmentation is done to divide the attributes
among the rows.

Data Replication

Data replication is the storage of data copies at multiple sites on the network. Fragment copies
can be stored at several site, thus enhancing data availability and response time. Replicated data
is subject to a mutual consistency rule. This rule requires that all copies of the data fragments
must be identical and to ensure data consistency among all of the replications.

Although data replication is beneficial in terms of availability and response times, the
maintenance of the replications can become complex. For example, if data is replicated over
multiple sites, the DDBMS must decide which copy to access. For a query operation, the nearest
copy is all that is required to satisfy a transaction. However, if the operation is an update, then all
copies must be selected and updated to satisfy the mutual consistency rule.

A database can be either fully replicated, partially replicated or unreplicated.


Full replication Stores multiple copies of each database fragment at multiple
sites. Fully replicated databases can be impractical because of
the amount of overhead imposed on the system.

Partial replication Stores multiple copies of some database fragments at multiple


sites. Most DDBMS can handle this type of replication very
well.

No replication Stores each database fragment at a single site. No duplication


occurs.

Data replication is particularly useful if usage frequency of remote data is high and the database
is fairly large. Another benefit of data replication is the possibility of restoring lost data at a
particular site.

Data Allocation

Data allocation is a process of deciding where to store the data. It also involves a decision as to
which data is stored at what location. Data allocation can be centralised,
partitioned or replicated.

Centralised The entire database is stored at one site. No distribution


occurs.

Partitioned The database is divided into several fragments that are stored
at several sites.

Replicated Copies of one or more database fragments are stored at several


sites.

Client Server Architecture

Implementation of a distributed database system must be carefully managed within a client


server architecture. Typically, the server provides the resources for the client to use. The client
receives the request from the user and the request is passed to the server. The server receives,
schedules and executes the requests, selecting only what the client requires. The request is sent
only when the client requests it.

There are advantages to implementing a distributed database system using client server
architecture:
Cost Client server systems are less expensive that mainframes.
There is also a considerable cost savings in off-loading
applications development from the mainframe to PCs

PC Functionality and Use Many users are more familiar and more skilled with PC
technology than they are with mainframe technology. The use
of PC technology is more widespread in the workplace.

Data Analysis and Query Tools These tools are readily available in the marketplace and can
be used with many database systems.

There are some disadvantages however, in that the client server environment is more complex
and thus requires more management resources. Security is also another issue because of the
number of users and sites.

Distributed Database Architectures A distributed database is a database that consists of two or


more files located in different sites either on the same network or on entirely different networks.
Portions of the database are stored in multiple physical locations and processing is distributed among
multiple database nodes. A centralized distributed database management system (DDBMS) integrates
data logically so it can be managed as if it were all stored in the same location. The DDBMS synchronizes
all the data periodically and ensures that data updates and deletes performed at one location will be
automatically reflected in the data stored else where.By contrast, a centralized database consists of a
single database file located at one site using a single network.

The software system that permits the management of the distributed database and makes the
distribution transparent to users.
A Distributed Database Management System (DDBMS) consists of a single logical database that
is split into a number of fragments. Each fragment is stored on one or more computers under the
control of a separate DBMS, with the computers connected by a communications network. Each
site is capable of independently processing user requests that require access to local data (that is,
each site has some degree of local autonomy) and is also capable of processing data stored on
other computers in the network.
Users access the distributed database via applications. Applications are classified as those that do
not require data from other sites (local Applications) and those that do require data from other
sites (global applications). We require a DDBMS to have at least one global application.

Characteristics of Distributed Database Management System


A DDBMS has the following characteristics:
 A collection of logically related shared data;

 The data is split into a number of fragments;


 Fragments may be replicated;
 Fragments/replicas are allocated to sites;
 The sites are linked by a communications network;
 The data at each site is under the control of a DBMS;
 The DBMS at each site can handle local applications, autonomously;
 Each DBMS participates in at least one global application

It is not necessary for every site in the system to have its own local database, as illustrated by the
topology of the DDBMS shown in figure.
Banking Example
Using distributed database technology, a bank may implement their database system on a number
of separate computer systems rather than a single, centralized mainframe. The computer systems
may be located at each local branch office: for example, Amritsar, Jalandhar, and Qadian. A
network linking the computer will enable the branches to communicate with each other, and
DDBMS will enable them to access data stored at another branch office. Thus a client living in
Amritsar can also check his/her account during the stay in Jalandhar or Qadian.
From the defination of the DDBMS, the system is expected to make the distribution transparent
(invisible) to the user. Thus, the fact that a distributed database is split into fragments that can be
stored on different computers and perhaps replicated should be hidden from the user. The
objective of transparency is to make the distributed system appear like a centralized system. This
is sometimes referred to as the fundamental principle of distributed DBMSs. This requirement
provides significant functionality for the end-user but, unfortunately, creates many additional
problems that have to be handled by the DDBMS.
Distributed Processing
It is important to make a distinction between a distributed DBMS and distributed processing.
Distributed Processing is a centralized database that can be accessed over a computer network. .
The key point with the definition of a distributed DBMS is that the system consists of data that is
physically distributed across a number of sites in the network. The data is centralized, even
though other users may be accessing the data over the network, we do not consider this to be a
distributed DBMS, simply distributed processing. We illustrate the topology of distributed
processing in figure. Compare this figure, which has a central database at site 2, with figure
which shown several sites each with their own database (DB).

Parallel DBMSs
Parallel DBMS is a DBMS running across multiple processor and disks that is designed to
execute operations in parallel, whenever possible, in order to improve performance. Parallel
DBMSs are again based on the idea that single process of systems can no longer meet the
growing requirements for cost-effective scalability, reliability, and performance. A powerful and
financially attractive alternative to single processor-driven DBMS is a parallel DBMS driven by
multiple processors. Parallel DBMSs link multiple, smaller machines to achieve the same
throughput as a single, larger machine, often with greater scalability and reliability than single-
processor DBMSs.
To provide multiple processors with common access to a single database, a parallel DBMS must
provide for shared resource management. Which resources are shared, and how those shared
resources are implemented, directly affects the performance and scalability of the system, which,
in turn, determines its appropriateness for a given application environment.
The three main architectures for parallel DBMS, as illustrated in figure are:
• Shared Memory
• Shared Disk
• Shared Nothing
Shared Memory: Shared memory is a tightly coupled architecture in which multiple processors
within a single system share system memory as shown in figure. Known as symmetric
multiprocessing (SMP), this approach has become popular on platforms ranging from personal
workstations that support a few microprocessors in parallel, to large RISC (Reduced Instruction
Set Computer) based machines, all the way up to the largest mainframes. This architecture
provides high-speed data access for a limited number of processors, but it is not scalable beyond
about 64 processors when the interconnection network becomes a bottleneck.
Shared Disk: Shared Disk is a loosely coupled architecture optimized for application that are
inherently centralized and require high availability and performance. Each processor can access
all disks directly, but each has its own private memory as shown in figure. Shared disk systems
are sometimes referred to as a cluster.
Shared Nothing: Shared nothing, often known as Massively Parallel Processing (MPP), is a
multiple processor architecture in which each processor is part of a complete system, with its
own memory and disk storage as shown in fig. The database is partitioned among all the disks on
each system associated with the database, and data is transparently available to users on all
systems. This architecture is more scalable than shared memory and can easily support a large
number of processors. However, performance is optional only when requested data is stored
locally.
While the shared nothing definition sometimes includes distributed DBMSs, the distribution of
data in a parallel DBMS is based solely on performance considerations. Further, the nodes of a
DDBMS are typically geographically distributed, separately administered, and have a slower
interconnection network, whereas the nodes of a parallel DBMS are typically within the same
computer or within the same site.
Parallel technology is typically used for very large database possibly of the order of terabytes
(1012 bytes) or systems that have to process thousands of transactions per second. These systems
need access to large volumes of data and must provide timely responses to queries. A parallel
DBMS can use the underlying architecture to improve the performance of complex query
execution using parallel scan, join and sort techniques that allow multiple processor nodes
automatically to share the processing workload.

The three main architectures for parallel DBMS, as illustrated in figure are:
• Shared Memory
• Shared Disk
• Shared Nothing
Shared Memory: Shared memory is a tightly coupled architecture in which multiple processors
within a single system share system memory as shown in figure. Known as symmetric
multiprocessing (SMP), this approach has become popular on platforms ranging from personal
workstations that support a few microprocessors in parallel, to large RISC (Reduced Instruction
Set Computer) based machines, all the way up to the largest mainframes. This architecture
provides high-speed data access for a limited number of processors, but it is not scalable beyond
about 64 processors when the interconnection network becomes a bottleneck.
Shared Disk: Shared Disk is a loosely coupled architecture optimized for application that are
inherently centralized and require high availability and performance. Each processor can access
all disks directly, but each has its own private memory as shown in figure. Shared disk systems
are sometimes referred to as a cluster.
Shared Nothing: Shared nothing, often known as Massively Parallel Processing (MPP), is a
multiple processor architecture in which each processor is part of a complete system, with its
own memory and disk storage as shown in fig. The database is partitioned among all the disks on
each system associated with the database, and data is transparently available to users on all
systems. This architecture is more scalable than shared memory and can easily support a large
number of processors. However, performance is optional only when requested data is stored
locally.
While the shared nothing definition sometimes includes distributed DBMSs, the distribution of
data in a parallel DBMS is based solely on performance considerations. Further, the nodes of a
DDBMS are typically geographically distributed, separately administered, and have a slower
interconnection network, whereas the nodes of a parallel DBMS are typically within the same
computer or within the same site.
Parallel technology is typically used for very large database possibly of the order of terabytes
(1012 bytes) or systems that have to process thousands of transactions per second. These systems
need access to large volumes of data and must provide timely responses to queries. A parallel
DBMS can use the underlying architecture to improve the performance of complex query
execution using parallel scan, join and sort techniques that allow multiple processor nodes
automatically to share the processing workload.

Features of distributed databases

When in a collection, distributed databases are logically interrelated with each other, and they
often represent a single logical database. With distributed databases, data is physically stored
across multiple sites and independently managed. The processors on each site are connected by a
network, and they don't have any multiprocessing configuration.

Parallel versus Distributed Architectures


There are two main types of multiprocessor system architectures that are common-place:
Shared memory (tightly coupled) architecture. Multiple processors share secondary
(disk) storage and also share primary memory.
Shared disk (loosely coupled) architecture. Multiple processors share secondary (disk)
storage but each has their own primary memory.
These architectures enable processors to communicate without the overhead of exchanging
messages over a network.4 Database management systems developed using the above types of
architectures are termed parallel database management systems rather than DDBMSs, since
they utilize parallel processor technology. Another type of multiprocessor architecture is
called shared nothing architecture. In this architecture, every processor has its own primary
and secondary (disk) memory, no common memory exists, and the processors communicate over
a high-speed interconnection network (bus or switch). Although the shared nothing architecture
resembles a distributed database computing environment, major differences exist in the mode of
operation. In shared nothing multiprocessor systems, there is symmetry and homogeneity of
nodes; this is not true of the distributed database environment where heterogeneity of hardware
and operating system at each node is very common. Shared nothing architecture is also
considered as an environment for parallel databases. Figure 25.3a illustrates a parallel database
(shared nothing), whereas Figure 25.3b illustrates a centralized database with distributed access
and Figure 25.3c shows a pure distributed database. We will not expand on parallel architectures
and related data management issues here.
2. General Architecture of Pure Distributed Databases

In this section we discuss both the logical and component architectural models of a DDB. In
Figure 25.4, which describes the generic schema architecture of a DDB, the enterprise is
presented with a consistent, unified view showing the logical structure of underlying data across
all nodes. This view is represented by the global conceptual schema (GCS), which provides
network transparency (see Section 25.1.2). To accommodate potential heterogeneity in the DDB,
each node is shown as having its own local internal schema (LIS) based on physical organization
details at that particular site. The logical organization of data at each site is specified by the local
conceptual schema (LCS). The GCS, LCS, and their underlying mappings provide the
fragmentation and replication transparency discussed in Section 25.1.2. Figure 25.5 shows the
component architecture of a DDB. It is an extension of its centralized counterpart (Figure 2.3) in
Chapter 2. For the sake of simplicity, common elements are not shown here. The global query
compiler references the global conceptual schema from the global system catalog to verify and
impose defined constraints. The global query optimizer references both global and local
conceptual schemas and generates optimized local queries from global queries. It evaluates all
candidate strategies using a cost function that estimates cost based on response time (CPU,

I/O, and network latencies) and estimated sizes of intermediate results. The latter is particularly
important in queries involving joins. Having computed the cost for each candidate, the optimizer
selects the candidate with the minimum cost for execution. Each local DBMS would have their
local query optimizer, transaction man-ager, and execution engines as well as the local system
catalog, which houses the local schemas. The global transaction manager is responsible for
coordinating the execution across multiple sites in conjunction with the local transaction
manager at those sites.
3. Federated Database Schema Architecture
Typical five-level schema architecture to support global applications in the FDBS environment is
shown in Figure 25.6. In this architecture, the local schema is the conceptual schema (full
database definition) of a component database, and the component schema is derived by
translating the local schema into a canonical data model or common data model (CDM) for the
FDBS. Schema translation from the local schema to the component schema is accompanied by
generating mappings to transform commands on a component schema into commands on the
corres-ponding local schema. The export schema represents the subset of a component schema
that is available to the FDBS. The federated schema is the global schema or view, which is the
result of integrating all the shareable export schemas. The external schemas define the schema
for a user group or an application, as in the three-level schema architecture.

All the problems related to query processing, transaction processing, and directory and metadata
management and recovery apply to FDBSs with additional considerations. It is not within our
scope to discuss them in detail here.

4. An Overview of Three-Tier Client-Server Architecture

As we pointed out in the chapter introduction, full-scale DDBMSs have not been developed to
support all the types of functionalities that we have discussed so far. Instead, distributed database
applications are being developed in the context of the client-server architectures. We introduced
the two-tier client-server architecture in Section 2.5. It is now more common to use a three-tier
architecture, particularly in Web applications. This architecture is illustrated in Figure 25.7.
In the three-tier client-server architecture, the following three layers exist:
1. Presentation layer (client). This provides the user interface and interacts with the user.
The programs at this layer present Web interfaces or forms to the client in order to interface with
the application. Web browsers are often utilized, and the languages and specifications used
include HTML, XHTML, CSS, Flash, MathML, Scalable Vector Graphics (SVG), Java,
JavaScript, Adobe Flex, and others. This layer handles user input, output, and naviga-tion by
accepting user commands and displaying the needed information, usually in the form of static or
dynamic Web pages. The latter are employed when the interaction involves database access.
When a Web interface is used, this layer typically communicates with the application layer via
the HTTP protocol.

2. Application layer (business logic). This layer programs the application logic. For example,
queries can be formulated based on user input from the client, or query results can be formatted
and sent to the client for presentation. Additional application functionality can be handled at this
layer, such

as security checks, identity verification, and other functions. The application layer can interact
with one or more databases or data sources as needed by connecting to the database using
ODBC, JDBC, SQL/CLI, or other database access techniques.

3. Database server. This layer handles query and update requests from the application layer,
processes the requests, and sends the results. Usually SQL is used to access the database if it is
relational or object-relational and stored database procedures may also be invoked. Query results
(and queries) may be formatted into XML (see Chapter 12) when transmitted between the
application server and the database server.
Exactly how to divide the DBMS functionality between the client, application server, and
database server may vary. The common approach is to include the functionality of a centralized
DBMS at the database server level. A number of relational DBMS products have taken this
approach, where an SQL server is provided. The application server must then formulate the
appropriate SQL queries and connect to the database server when needed. The client provides the
processing for user inter-face interactions. Since SQL is a relational standard, various SQL
servers, possibly provided by different vendors, can accept SQL commands through standards
such as ODBC, JDBC, and SQL/CLI
In this architecture, the application server may also refer to a data dictionary that includes
information on the distribution of data among the various SQL servers, as well as modules for
decomposing a global query into a number of local queries that can be executed at the various
sites. Interaction between an application server and database server might proceed as follows
during the processing of an SQL query:
1. The application server formulates a user query based on input from the client layer and
decomposes it into a number of independent site queries. Each site query is sent to the
appropriate database server site.
2. Each database server processes the local query and sends the results to the application
server site. Increasingly, XML is being touted as the standard for data exchange so the database
server may format the query result into XML before sending it to the application server.
3. The application server combines the results of the subqueries to produce the result of the
originally required query, formats it into HTML or some other form accepted by the client, and
sends it to the client site for display.
The application server is responsible for generating a distributed execution plan for a multisite
query or transaction and for supervising distributed execution by sending commands to servers.
These commands include local queries and transactions to be executed, as well as commands to
transmit data to other clients or servers. Another function controlled by the application server (or
coordinator) is that of ensuring consistency of replicated copies of a data item by employing
distributed (or global) concurrency control techniques. The application server must also ensure
the atomicity of global transactions by performing global recovery when certain sites fail.
If the DDBMS has the capability to hide the details of data distribution from the application
server, then it enables the application server to execute global queries and transactions as though
the database were centralized, without having to specify the sites at which the data referenced in
the query or transaction resides. This property is called distribution transparency. Some
DDBMSs do not provide distribution transparency, instead requiring that applications are aware
of the details of data distribution.
UNIT-IV

Introduction
Signal processing is a discipline in electrical engineering and in mathematics that deals with
analysis and processing of analog and digital signals , and deals with storing , filtering , and
other operations on signals. These signals include transmission signals , sound or voice signals ,
image signals , and other signals e.t.c.
Out of all these signals , the field that deals with the type of signals for which the input is an
image and the output is also an image is done in image processing. As it name suggests, it deals
with the processing on images.
It can be further divided into analog image processing and digital image processing.

Analog image processing


Analog image processing is done on analog signals. It includes processing on two dimensional
analog signals. In this type of processing, the images are manipulated by electrical means by
varying the electrical signal. The common example include is the television image.
Digital image processing has dominated over analog image processing with the passage of time
due its wider range of applications.

Digital image processing


The digital image processing deals with developing a digital system that performs operations on
an digital image.

What is an Image
An image is nothing more than a two dimensional signal. It is defined by the mathematical
function f(x,y) where x and y are the two co-ordinates horizontally and vertically.
The value of f(x,y) at any point is gives the pixel value at that point of an image.
The above figure is an example of digital image that you are now viewing on your computer
screen. But actually , this image is nothing but a two dimensional array of numbers ranging
between 0 and 255.

128 30 123

232 123 321

123 77 89

80 255 255

Each number represents the value of the function f(x,y) at any point. In this case the value 128 ,
230 ,123 each represents an individual pixel value. The dimensions of the picture is actually the
dimensions of this two dimensional array.

Relationship between a digital image and a signal


If the image is a two dimensional array then what does it have to do with a signal.

Signal

In physical world, any quantity measurable through time over space or any higher dimension
can be taken as a signal. A signal is a mathematical function, and it conveys some information.
A signal can be one dimensional or two dimensional or higher dimensional signal. One
dimensional signal is a signal that is measured over time. The common example is a voice
signal.
The two dimensional signals are those that are measured over some other physical quantities.
The example of two dimensional signal is a digital image. We will look in more detail in the
next tutorial of how a one dimensional or two dimensional signals and higher signals are formed
and interpreted.

Relationship

Since anything that conveys information or broadcast a message in physical world between two
observers is a signal. That includes speech or (human voice) or an image as a signal. Since
when we speak , our voice is converted to a sound wave/signal and transformed with respect to
the time to person we are speaking to. Not only this , but the way a digital camera works, as
while acquiring an image from a digital camera involves transfer of a signal from one part of the
system to the other.
How a digital image is formed
Since capturing an image from a camera is a physical process. The sunlight is used as a source
of energy. A sensor array is used for the acquisition of the image. So when the sunlight falls
upon the object, then the amount of light reflected by that object is sensed by the sensors, and a
continuous voltage signal is generated by the amount of sensed data. In order to create a digital
image , we need to convert this data into a digital form. This involves sampling and
quantization. (They are discussed later on). The result of sampling and quantization results in an
two dimensional array or matrix of numbers which are nothing but a digital image.

Overlapping fields
Machine/Computer vision

Machine vision or computer vision deals with developing a system in which the input is an
image and the output is some information. For example: Developing a system that scans human
face and opens any kind of lock. This system would look something like this.

Computer graphics

Computer graphics deals with the formation of images from object models, rather then the
image is captured by some device. For example: Object rendering. Generating an image from an
object model. Such a system would look something like this.
Artificial intelligence

Artificial intelligence is more or less the study of putting human intelligence into machines.
Artificial intelligence has many applications in image processing. For example: developing
computer aided diagnosis systems that help doctors in interpreting images of X-ray , MRI e.t.c
and then highlighting conspicuous section to be examined by the doctor.

Signal processing

Signal processing is an umbrella and image processing lies under it. The amount of light
reflected by an object in the physical world (3d world) is pass through the lens of the camera
and it becomes a 2d signal and hence result in image formation. This image is then digitized
using methods of signal processing and then this digital image is manipulated in digital image
processing.

Data compression
Data compression is the function of presentation layer in OSI reference model. Compression is
often used to maximize the use of bandwidth across a network or to optimize disk space when
saving data.
There are two general types of compression algorithms:

1. Lossless compression
2. Lossy compression

Lossless Compression
Lossless compression compresses the data in such a way that when data is decompressed it is
exactly the same as it was before compression i.e. there is no loss of data.
A lossless compression is used to compress file data such as executable code, text files, and
numeric data, because programs that process such file data cannot tolerate mistakes in the data.
Lossless compression will typically not compress file as much as lossy compression techniques
and may take more processing power to accomplish the compression.

Lossless Compression Algorithms


The various algorithms used to implement lossless data compression are :

1. Run length encoding


2. Differential pulse code modulation
3. Dictionary based encoding
1. Run length encoding
• This method replaces the consecutive occurrences of a given symbol with only one copy of the
symbol along with a count of how many times that symbol occurs. Hence the names ‘run length'.
• For example, the string AAABBCDDDD would be encoded as 3A2BIC4D.
• A real life example where run-length encoding is quite effective is the fax machine. Most faxes
are white sheets with the occasional black text. So, a run-length encoding scheme can take each
line and transmit a code for while then the number of pixels, then the code for black and the
number of pixels and so on.
• This method of compression must be used carefully. If there is not a lot of repetition in the data
then it is possible the run length encoding scheme would actually increase the size of a file.
2. Differential pulse code modulation
• In this method first a reference symbol is placed. Then for each symbol in the data, we place
the difference between that symbol and the reference symbol used.
• For example, using symbol A as reference symbol, the string AAABBC DDDD would be
encoded as AOOOl123333, since A is the same as reference symbol, B has a difference of 1
from the reference symbol and so on.
3. Dictionary based encoding
• One of the best known dictionary based encoding algorithms is Lempel-Ziv (LZ) compression
algorithm.
• This method is also known as substitution coder.
• In this method, a dictionary (table) of variable length strings (common phrases) is built.
• This dictionary contains almost every string that is expected to occur in data.
• When any of these strings occur in the data, then they are replaced with the corresponding
index to the dictionary.
• In this method, instead of working with individual characters in text data, we treat each word as
a string and output the index in the dictionary for that word.
• For example, let us say that the word "compression" has the index 4978 in one particular
dictionary; it is the 4978th word is usr/share/dict/words. To compress a body of text, each time
the string "compression" appears, it would be replaced by 4978.

Lossy Compression
Lossy compression is the one that does not promise that the data received is exactly the same as
data send i.e. the data may be lost.
This is because a lossy algorithm removes information that it cannot later restore.
Lossy algorithms are used to compress still images, video and audio.
Lossy algorithms typically achieve much better compression ratios than the lossless algorithms.
Audio Compression
• Audio compression is used for speech or music.
• For speech, we need to compress a 64-KHz digitized signal; For music, we need to compress a
1.411.MHz signal

• Two types of techniques are used for audio compression:

1. Predictive encoding
2. Perceptual encoding

Predictive encoding
• In predictive encoding, the differences between the samples are encoded instead of encoding all
the sampled values.
• This type of compression is normally used for speech.
• Several standards have been defined such as GSM (13 kbps), G. 729 (8 kbps), and G.723.3 (6.4
or 5.3 kbps).
Perceptual encoding
• Perceptual encoding scheme is used to create a CD-quality audio that requires a transmission
bandwidth of 1.411 Mbps.
• MP3 (MPEG audio layer 3), a part of MPEG standard uses this perceptual encoding.
• Perceptual encoding is based on the science of psychoacoustics, a study of how people perceive
sound.
• The perceptual encoding exploits certain flaws in the human auditory system to encode a signal
in such a way that it sounds the same to a human listener, even if it looks quite different on an
oscilloscope.
• The key property of perceptual coding is that some sounds can mask other sound. For example,
imagine that you are broadcasting a live flute concert and all of a sudden someone starts striking
a hammer on a metal sheet. You will not be able to hear the flute any more. Its sound has been
masked by the hammer.
• Such a technique explained above is called frequency masking-the ability of a loud sound in
one frequency band to hide a softer sound in another frequency band that would have been
audible in the absence of the loud sound.
• Masking can also be done on the basis of time. For example: Even if the hammer is not striking
on a metal sheet, the flute will be inaudible for a short period of time because the ears turn down
its gain when they start and take a finite time to turn up again.
• Thus, a loud sound can numb our ears for a short time even after the sound has stopped. This
effect is called temporal masking.

MP3
• MP3 uses these two phenomena, i.e. frequency masking and temporal masking to compress
audio signals.
• In such a system, the technique analyzes and divides the spectrum into several groups. Zero bits
are allocated to the frequency ranges that are totally masked.
• A small number of bits are allocated to the frequency ranges that are partially masked.
• A larger number. of bits are allocated to the frequency ranges that are not masked.
• Based on the range of frequencies in the original analog audio, MP3 produces three data rates:
96kbps, 128 kbps and 160 kbps.

FILE FORMATS
Some popular file formats for information exchange are described below. One of the most
important is the 8 - bit GIF format, because of its historical connection to the WWW and HTML
markup language as the first image type recognized by net browsers. However, currently the
most important common file format is JPEG.

GIF
GIF is palette - based: the colors used in an image (a frame) in the file have their RGB values
that can hold up to 256 entries, and the data for the image refer to the colors by their indexes (0 –
255) in the palette table. The color definitions in the palette can be drawn from a color space of
millions of shades (224 shades, 8 - bits for each primary), but the maximum number of colors a
frame can use is 256. This limitation seemed reasonable when GIF was developed because few
people could afford the hardware to display more colors simultaneously. Simple graphics, line
drawings, cartoons, and grey - scale photographs typically need fewer than 256 colors.

As a further refinement, each frame can designate one index as a "transparent background color":
any pixel assigned this index takes on the color of the pixel in the same position from the
background, which may have been determined by a previous frame of animation.

Many techniques, collectively called dithering, have been developed to approximate a wider
range of colors with a small color palette by using pixels of two or more colors to approximate in
between colors. These techniques sacrifice spatial resolution to approximate deeper color
resolution. While not part of the GIF specification, dithering can of course be used in images
subsequently encoded as GIF images. This is often not an ideal solution for GIF images, both
because the loss of spatial resolution typically makes an image look fuzzy on the screen, and
because the dithering patterns often interfere with the compressibility of the image data, working
against GIF's main purpose.

In the early days of graphical web browsers, graphics cards with 8 - bit buffers (allowing only
256 colors) were common and it was fairly common to make GIF images using the websafe
palette. This ensured predictable display, but severely limited the choice of colors. Now that 32 -
bit graphics cards, which support 24 - bit color, are the norm, palettes can be populated with the
optimum colors for individual images.

A small color table may suffice for small images, and keeping the color table small allows the
file to be downloaded faster. Both the 87a and 89a specifications allow color tables of 2n colors
for any n from 1 through 8. Most graphics applications will read and display GIF images with
any of these table sizes; but some do not support all sizes when creating images. Tables of 2, 16,
and 256 colors are widely supported.
GIF file format

GIF screen descriptor


GIF color map
GIF image descriptor
GIF four - pass interlace display row order

JPEG
The JPEG compression algorithm is at its best on photographs and paintings of realistic scenes
with smooth variations of tone and color. For web usage, where the amount of data used for an
image is important, JPEG is very popular. JPEG / Exif is also the most common format saved by
digital cameras.

On the other hand, JPEG may not be as well suited for line drawings and other textual or iconic
graphics, where the sharp contrasts between adjacent pixels can cause noticeable artifacts. Such
images may be better saved in a lossless graphics format such as TIFF, GIF, PNG, or a raw
image format. The JPEG standard actually includes a lossless coding mode, but that mode is not
supported in most products.

As the typical use of JPEG is a lossy compression method, which somewhat reduces the image
fidelity, it should not be used in scenarios where the exact reproduction of the data is required
(such as some scientific and medical imaging applications and certain technical image processing
work).

JPEG is also not well suited to files that will undergo multiple edits, as some image quality will
usually be lost each time the image is decompressed and recompressed, particularly if the image
is cropped or shifted, or if encoding parameters are changed – see digital generation loss for
details. To avoid this, an image that is being modified or may be modified in the future can be
saved in a lossless format, with a copy exported as JPEG for distribution.

The compression method is usually lossy, meaning that some original image information is lost
and cannot be restored, possibly affecting image quality. There is an optional lossless mode
defined in the JPEG standard; however, this mode is not widely supported in products.

There is also an interlaced "Progressive JPEG" format, in which data is compressed in multiple
passes of progressively higher detail. This is ideal for large images that will be displayed while
downloading over a slow connection, allowing a reasonable preview after receiving only a
portion of the data. However, progressive JPEGs are not as widely supported, and even some
software which does support them (such as versions of Internet Explorer before Windows 7) only
displays the image after it has been completely downloaded.

There are also many medical imaging and traffic systems that create and process 12 - bit JPEG
images, normally grayscale images. The 12 - bit JPEG format has been part of the JPEG
specification for some time, but again, this format is not as widely supported.

A number of alterations to a JPEG image can be performed losslessly (that is, without
recompression and the associated quality loss) as long as the image size is a multiple of 1 MCU
block (Minimum Coded Unit) (usually 16 pixels in both directions, for 4:2:0 chroma
subsampling). Utilities that implement this include jpegtran, with user interface Jpegcrop, and
the JPG_TRANSFORM plugin to IrfanView.Blocks can be rotated in 90 degree increments,
flipped in the horizontal, vertical and diagonal axes and moved about in the image. Not all blocks
from the original image need to be used in the modified one.
The top and left edge of a JPEG image must lie on a 8 × 8 pixel block boundary, but the bottom
and right edge need not do so. This limits the possible lossless crop operations, and also prevents
flips and rotations of an image whose bottom or right edge does not lie on a block boundary for
all channels (because the edge would end up on top or left, where – as a fore mentioned – a block
boundary is obligatory).When using lossless cropping, if the bottom or right side of the crop
region is not on a block boundary then the rest of the data from the partially used blocks will still
be present in the cropped file and can be recovered.

It is also possible to transform between baseline and progressive formats without any loss of
quality, since the only difference is the order in which the coefficients are placed in the file.

Furthermore, several JPEG images can be losslessly joined together, as long as the edges
coincide with block boundaries.

PNG
One interesting development stemming from the popularity of the Internet is efforts toward more
system - independent image formats. One such format is Portable Network Graphics (PNG).
This standard is meant to supersede the GIF standard and extends it in important ways. The
motivation for a new standard was in part the patent held by UNISYS and CompuServe on the
LZW compression method. (Interestingly, the patent covers only compression, not
decompression: this is why the UNIX gunzip utility can decompress LZW - compressed files.)
Special features of PNG files include support for up to 48 bits of color information — a large
increase. Files may also contain gamma - correction information for correct display of color
images and alpha - channel information for such uses as control of transparency. Instead of a
progressive display based on widely separated rows, as in GIF images, the display progressively
displays pixels in a two - dimensional fashion a few at a time over seven passes through each 8 x
8 block of an image.

Comparison to Graphics Interchange Format (GIF)


1. On small images, GIF can achieve greater compression than PNG.
2. On most images, except for the above cases, GIF will be bigger than indexed PNG.
3. PNG gives a much wider range of transparency options than GIF, including alpha channel
transparency.
4. Whereas GIF is limited to 8 - bit indexed color, PNG gives a much wider range of color
depths, including 24 - bit (8 bits per channel) and 48 - bit (16 bits per channel) truecolor,
allowing for greater color precision, smoother fades, etc. When an alpha channel is added, up
to 64 bits per pixel (before compression) are possible.
5. When converting an image from the PNG format to GIF, the image quality may suffer due to
if the PNG image has more than 256 colors.
6. GIF intrinsically supports animated images. PNG supports animation only via unofficial
extensions (see the section on animation, above).
PNG images are less widely supported by older browsers. In particular, IE6 has limited support
for PNG. As users adopt newer browsers, this becomes less of an issue.

Comparison to JPEG
JPEG (Joint Photographic Experts Group) format can produce a smaller file than PNG for
photographic (and photo - like) images, since JPEG uses a lossy encoding method specifically
designed for photographic image data, which is typically dominated by soft, low - contrast
transitions, and an amount of noise or similar irregular structures. Using PNG instead of a high -
quality JPEG for such images would result in a large increase in file size with negligible gain in
quality.

By contrast, when storing images that contain text, line art, or graphics – images with sharp
transitions and large areas of solid color – the PNG format can compress image data more than
JPEG can, and without the noticeable visual artifacts which JPEG produces around high -
contrast areas. Where an image contains both sharp transitions and photographic parts a choice
must be made between the two effects. JPEG does not support transparency.

Because JPEG uses lossy compression, it suffers from generation loss, where repeatedly
encoding and decoding an image progressively loses information and degrades the image.
Because PNG is lossless, it is a suitable format for storing images to be edited. While PNG is
reasonably efficient when compressing photographic images, there are lossless compression
formats designed specifically for photographic images, lossless JPEG 2000 and Adobe DNG
(Digital negative) for example. However these formats are either not widely supported or
proprietary. An image can be saved into JPEG format for distribution so that the single pass of
JPEG encoding will not noticeably degrade the image.
The PNG specification does not include a standard for embedded Exif image data from sources
such as digital cameras. TIFF, JPEG 2000, and DNG support EXIF data.

Early web browsers did not support PNG images; JPEG and GIF were the main image formats.
JPEG was commonly used when exporting images containing gradients for web pages, because
of GIF's limited color depth. However, JPEG compression causes a gradient to blur slightly. A
PNG file will reproduce a gradient as accurately as possible for a given bit depth, while keeping
the file size small. PNG became the optimal choice for small gradient images as web browser
support for the format improved.

Comparison to TIFF
Tagged Image File Format (TIFF) is a format that incorporates an extremely wide range of
options. While this makes TIFF useful as a generic format for interchange between professional
image editing applications, it makes adding support for it to applications a much bigger task and
so it has little support in applications not concerned with image manipulation (such as web
browsers). It also means that many applications can read only a subset of TIFF types, creating
more potential user confusion.

The most common general - purpose, lossless compression algorithm used with TIFF is Lempel
– Ziv – Welch (LZW). This compression technique, also used in GIF, was covered by patents
until 2003. There is a TIFF variant that uses the same compression algorithm as PNG uses, but it
is not supported by many proprietary programs. TIFF also offers special - purpose lossless
compression algorithms like CCITT Group IV, which can compress bilevel images (e.g., faxes or
black - and - white text) better than PNG's compression algorithm.

TIFF
TIFF is a flexible, adaptable file format for handling images and data within a single file, by
including the header tags (size, definition, image - data arrangement, applied image
compression) defining the image's geometry. For example, a TIFF file can be a container holding
compressed (lossy) JPEG and (lossless) Pack Bits compressed images. A TIFF file also can
include a vector - based clipping path (outlines, croppings, image frames). The ability to store
image data in a lossless format makes a TIFF file a useful image archive, because, unlike
standard JPEG files, a TIFF file using lossless compression (or none) may be edited and resaved
without losing image quality. This is not the case when using the TIFF as a container holding
compressed JPEG. Other TIFF options are layers and pages.

TIFF offers the option of using LZW compression, a lossless data compression technique for
reducing a file's size. Until 2004, use of this option was limited because the LZW technique was
under several patents. However, these patents have expired.

EXIF
Exchange Image Fife (EXIF) is an image format for digital cameras. Initially developed in 1995,
its current version was published in 2002 by the Japan Electronics and Information Technology
Industries Association (JEITA). Compressed EXIF files use the baseline JPEG format. A variety
of tags (many more than in TIFF) is available to facilitate higher quality printing, since
information about the camera and picture - taking conditions (flash, exposure, light source, white
balance, type of scene) can be stored and used by printers for possible color - correction
algorithms. The EXTF standard also includes specification of file format for audio that
accompanies digital images. It also supports tags for information needed for conversion to
FlashPix (initially developed by Kodak).
Graphics Animation Files
A few dominant formats are aimed at storing graphics animations (i.e., series of drawings or
graphic illustrations) as opposed to video (i.e., series of images). The difference is that
animations are considerably less demanding of resources than video files. However, animation
file formats can be used to store video information and indeed are sometimes used for such.

FLC is an important animation or moving picture file format; it was originally created by
Animation Pro. Another format, FLI, is similar to FLC. GL produces somewhat better quality
moving pictures. GL animations can also usually handle larger file sizes. Many older formats are
used for animation, such as DL and Amiga IFF, as well as alternates such as Apple Quicktime.
And, of course, there are also animated GIF89 files.
PS and PDF
PostScript is an important language for typesetting, and many high - end printers have a
PostScript interpreter built into them. PostScript is a vector - based, rather than pixel - based,
picture language: page elements are essentially defined in terms of vectors. With fonts defined
this way, PostScript includes text as well as vector / structured graphics; bit - mapped images can
also be included in output files. Encapsulated PostScript files add some information for including
PostScript files in another document.

Several popular graphics programs, such as Illustrator and FreeHand, use PostScript. However,
the PostScript page description language itself does not provide compression; in fact, PostScript
files are just stored as ASCII. Therefore files are often large, and in academic settings, it is
common for such files to be made available only after compression by some UNIX utility, such
as compress or gzip.

Therefore, another text + figures language has begun to supersede PostScript: Adobe Systems
Inc. includes LZW (see Chapter) compression in its Portable Document Format (PDF) file
format. As a consequence, PDF files that do not include images have about the same
compression ratio, 2:1 or 3:1, as do files compressed with other LZW - based compression tools,
such as UNIX compress or gzip on PC - based winzip (a variety of pkzip).
For files containing images, PDF may achieve higher compression ratios by using separate JPEG
compression for the image content (depending on the tools used to create original and
compressed versions). The Adobe Acrobat PDF reader can also be configured to read documents
structured as linked elements, with clickable content and handy summary tree structured fink
diagrams provided.

Windows WMF
Windows MetaFile (WMF) is the native vector file format for the Microsoft Windows operating
environment. WMF files actually consist of a collection of Graphics Device Interface (GDI)
function calls, also native to the Windows environment. When a WMF file is "played" (typically
using the Windows PlayMetaFile {) function) the described graphic is rendered. WMF files are
ostensibly device independent and unlimited in size.
Windows BMP
A bitmap or pixmap is a type of memory organization or image file format used to store digital
images. The term bitmap comes from the compute programming terminology, meaning just
a map of bits, a spatially mapped array of bits. Now, along with pixmap, it commonly refers to
the similar concept of a spatially mapped array of pixels. Raster images in general may be
referred to as bitmaps or pixmaps, whether synthetic or photographic, in files or memory.
In some contexts, the term bitmap implies one bit per pixel, while pixmap is used for images
with multiple bits per pixel. Many graphical user interfaces use bitmaps in their builtin graphics
subsystems; for example, the Microsoft Windows and OS/2 platforms' GDI subsystem, where
the specific format used is the Windows and OS/2 bitmap file format, usually named with the file
extension of .BMP (or .DIB for device independent bitmap). Besides BMP, other file formats
that store literal bitmaps include InterLeaved Bitmap (ILBM), Portable Bitmap (PBM), X
Bitmap (XBM), and Wireless Application Protocol Bitmap (WBMP). Similarly, most other
image file formats, such as JPEG,TIFF, PNG, and GIF, also store bitmap images (as opposed to
vector graphics), but they are not usually referred to as bitmaps, since they use compressed
formats internally.
Macintosh PAINT and PICT
PAINT was originally used in the MacPaint program, initially only for 1 - bit monochrome images.

PICT is used in MacDraw (a vector - based drawing program) for storing structured graphics.

X Windows PPM
This is the graphics format for the X Windows System. Portable PixMap (PPM) supports 24 - bit
color bitmaps and can be manipulated, using many public domain graphic editors, such as at. It is
used in the X Windows System for storing icons, pixmaps, backdrops, and so on.

You might also like