Unit - I Distributed Data Processing

UNIT – I
1. Explain in detail Distributed Data Processing.

Distributed Data Processing.
The DDB a single, centralized server managing and providing processing capability to all
connected systems.
Computers that comprise the distributed data-processing network are located at different
locations but interconnected by means of wireless or satellite links.
The working definition we use for a distributed computing system states that it is a number of
autonomous processing elements (not necessarily homogeneous) that are interconnected by a
computer network and that cooperate in performing their assigned tasks.
The “processing element” referred to in this definition is a computing device that can execute a
program on its own.
How it is being distributed data processing?

These modes of distribution are all necessary and important. In the following sections we talk
about these in more detail.
• Processing Logic
• Functions
• Data
Processing Logic
• In distributed processing, a database’s logical processing is shared among two or more
physically independent sites that are connected through a network.
• For example, the data input/output (I/O), data selection, and data validation might be
performed on one computer, and a report based on that data might be created on another
computer.
Functions of Distributed database system
• Keeping track of data-keep track of the data distribution, fragmentation
and replication
• Distributed Query Processing- ability to access remote sites and to transmits queries
and data among the various sites
• Replicated Data Management-decide which copy of a replicated data item to access and
to maintain the consistency of copies of replicated data
• Distributed Database Recovery-recover from the individual site crashes and from new
types of failures such as failure
• Security-with proper management of the security of the data
• Distributed Directory Management -The placement and distribution of the directory
may have design and policy issues.
• Distributed Transaction Management -maintain the integrity of the complete database
Data
A distributed database is a collection of multiple interconnected databases, which are
spread physically across various locations that communicate via a computer network
2. What are the Promises of DDBSs?

Promises of DDBS:
• Promises of DDBSs
• A distributed database system is then defined as the software system that permits the
management of the distributed database and makes the distribution transparent to the
users.
• The two important terms in this system are “logically interrelated” and “distributed over a
computer network.”
• There are four fundamentals which may also be viewed as promises of DDBS
technology:
• Transparent management of distributed and replicated data.
• Reliable access to data through distributed transactions,
• Improved performance
• Easier system expansion
• Transparent Management of Distributed and Replicated Data

• Different types of transparencies
• Data Independence
• Network Transparency
• Replication Transparency
• Fragmentation Transparency
• Who Should Provide Transparency?
Transparent Management of Distributed and Replicated Data
• A transparent system is a system which “hides” the implementation details from users.
• Let us start our discussion with an example. Consider an engineering that has offices in
Boston, Waterloo, Paris and San Francisco.
• They run projects at each of these sites and would like to maintain a database of their
employees, the projects and other related data.
• Furthermore, it may be preferable to duplicate some of this data at other sites for
performance and reliability reasons. The result is a distributed database which is
fragmented and replicated
• Assuming that the database is relational, we can store this information in two relations:
EMP(ENO, ENAME, TITLE) and PROJ(PNO, PNAME, BUDGET).
• We also introduce a third relation to store salary information: SAL(TITLE, AMT) and a
fourth relation ASG which indicates which employees have been assigned to which
projects for what duration with what responsibility: ASG(ENO, PNO, RESP, DUR).
• If all of this data were stored in a centralized DBMS, and we wanted to find out the
names and employees who worked on a project for more than 12 months, we would
specify this using the following SQL query:
• SELECT ENAME, AMT
• FROM EMP, ASG, SAL
• WHERE ASG.DUR > 12
• AND EMP.ENO = ASG.ENO
AND SAL.TITLE = EMP.TITLE
1.Data Independence
• Data independence is the type of data transparency that matters for a
centralized DBMS. When a user application is written, it should not be concerned with
the details of physical data organization. There are two types of data independence:
physical and logical data independence.
• Logical data independence refers to the immunity of user applications to changes in the
logical structure (i.e., schema) of the database
• Physical data independence, deals with hiding the details of the storage structure from
user applications.
2.Network Transparency
• User should be protected from the operational details of the network possibly even
hiding the existence of the network. This type of transparency is referred to as network
transparency or distribution transparency.
3.Replication Transparency
• For performance, reliability, and availability reasons, Furthermore, if one of the machines
fails, a copy of the data are still available on another machine on the network. In fact,
the decision as to whether to replicate or not, and how many copies of any database
object to have, depends to a considerable degree on user applications.
4.Fragmentation Transparency
• There are two general types of fragmentation alternatives.
• Horizontal fragmentation, a relation is partitioned into a set of sub-relations, each of
which have a subset of the tuples (rows) of the original relation and vertical
fragmentation where each sub-relation is defined on a subset of the attributes (columns)
of the original relation
5.Who Should Provide Transparency?
• To provide easy and efficient access by users to the services of the DBMS, one would
want to have full transparency.
Responsibility of providing transparent access depends on the access layer, operating system
level & within the DBMS
Reliability Through Distributed Transactions
• The failure of a single site, or the failure of a communication link which makes one or
more sites unreachable, is not sufficient to bring down the entire system.
• In the case of a distributed database, this means that some of the data may be
unreachable, but with proper care, users may be permitted to access other parts of the
distributed database. The proper care comes in the form of support for distributed
transactions and application protocols.
Begin transaction SALARY UPDATE
begin
EXEC SQL UPDATE PAY
SET SAL = SAL*1.1
end.
Improved Performance
The case for the improved performance of distributed DBMSs is typically made based on two
points.
• First, a distributed DBMS fragments the conceptual database, enabling data to be stored
in close proximity to its points of use (also called data localization).
• This has two potential advantages: firstly, since each site handles only a portion of the
database, contention for CPU and I/O services is not as severe as for centralized
databases and secondly, localization reduces remote access delays that are usually
involved in wide area networks.
Easier System Expansion
• In a distributed environment, it is much easier to accommodate increasing database
sizes.
• Major system overhauls are seldom necessary; expansion can usually be handled by
adding processing and storage power to the network.
Obviously, it may not be possible to obtain a linear increase in power, since this also depends on
the overhead of distribution. However, significant improvements are still possible.
3. Draw the Architectural Models for Distributed DBMS.
Distributed DBMS Architecture
• The architecture of a system defines its structure. This means that the components of the
system are identified, the function of each component is specified, and the
interrelationships and interactions among these components are defined.
• The specification of the architecture of a system requires identification of the various
modules, with their interfaces and interrelationships, in terms of the data and control flow
through the system.
Client - Server Architecture for DDBMS
• General idea: Divide the functionality into two classes:
• server functions
∗ mainly data management, including query processing, optimization, transaction management,
etc.
• client functions
∗ might also include some data management functions(consistency checking, transaction
management, etc.) not just user interface
• Provides a two-level architecture
• More efficient division of work
• Different types of client/server architecture
• Multiple client/single server
• Multiple client/multiple server
• Peer-to-Peer Architecture for DDBMS
• Peer-to-peer distribution- In these systems, each peer acts both as a client and a server for
impart database services.
• Local internal schema (LIS)
• Describes the local physical data organization (which might be different on each
machine)
• Local conceptual schema (LCS)
• Describes logical data organization at each site
• Required since the data are fragmented and replicated
• Global conceptual schema (GCS)
• Describes the global logical view of the data
• Union of the LCSs
• External schema (ES)
Describes the user/application view on the data
• Multi - DBMS Architectures
• This is an integrated database system formed by a collection of two or more autonomous
database systems.
• The local DBMSs present to the multi-database layer the part of their local DB
they are willing to share.
4.What are the Distribution Design issues?
.Design Issues of DDBS
1.Distributed database design

• There are two basic alternatives to placing data: non-replicated and replicated,
fragmentation.
2.Distributed Directory Management
• A directory contains information (such as descriptions and locations) about data items
in the database.
3. Distributed Query Processing
• Query processing deals with designing algorithms that analyze queries and convert them
into a series of data manipulation operations.
4 .Distributed Concurrency Control
• Two fundamental primitives that can be used with both approaches are locking, which is
based on the mutual exclusion of access to data items, and time-stamping, where
transactions executions are ordered based on timestamps.
5. Distributed Deadlock Management
The deadlock problem in DDBSs is similar in nature to that encountered in operating systems.
The well-known alternatives of prevention, avoidance, and detection/recovery also apply to
DDBSs
6. Reliability of Distributed DBMS
• We mentioned earlier that one of the potential advantages of distributed systems is
improved reliability and availability. This, however, is not a feature that comes,
automatically.
• It is important that mechanisms be provided to ensure the consistency of the database as
well as to detect failures and recover from them. The implication for DDBSs is that
when a failure occurs and various sites become either inoperable or inaccessible, the
databases at the operational sites remain consistent and up to date
7 . Replication
If the distributed database is (partially or fully) replicated, it is necessary to implement
protocols that ensure the consistency of the replicas, i.e., copies of the same data item
have the same value. The Replication can be either eager or lazy.
5. Discuss in detail Fragmentation.
Fragmentation
• Fragmentation is the task of dividing a table into a set of smaller tables. The subsets of
the table are called fragments.
• Fragmentation can be of three types: horizontal, vertical, and hybrid (combination of
horizontal and vertical). Horizontal fragmentation can further be classified into two
techniques: primary horizontal fragmentation and derived horizontal fragmentation.
Advantages of Fragmentation
• Since data is stored close to the site of usage, efficiency of the database system is
increased.
• Local query optimization techniques are sufficient for most queries since data is locally
available.
• Since irrelevant data is not available at the sites, security and privacy of the database
system can be maintained.
Vertical Fragmentation
• In vertical fragmentation, the fields or columns of a table are grouped into fragments. In
order to maintain reconstructiveness, each fragment should contain the primary key
field(s) of the table. Vertical fragmentation can be used to enforce privacy of data.
• For example, let us consider that a University database keeps records of all registered
students in a Student table having the following schema.
Horizontal Fragmentation
• Horizontal fragmentation groups the tuples of a table in accordance to values of one or
more fields. Horizontal fragmentation should also confirm to the rule of
reconstructiveness. Each horizontal fragment must have all columns of the original base
table.
Regd_No
Name
Course
Address
Semester
Hybrid Fragmentation
In hybrid fragmentation, a combination of horizontal and vertical fragmentation techniques are
used. This is the most flexible fragmentation technique, reconstruction of the original table is
often an expensive task.
Hybrid fragmentation can be done in two alternative ways −
At first, generate a set of vertical fragments; then generate horizontal fragments from one or
more of the vertical fragments.
At first, generate a set of horizontal fragments; then generate vertical fragments from one or
more of the horizontal fragments.
Reasons for Fragmentation
• fragmentation at the most basic level is about fragmented data on a storage device.
• The fragmentation is a result of continuous application or file system storage, where
different parts of a given application or file are stored in a sequential set of storage
blocks on a storage device.
• An operating system will typically store application and file storage blocks in the next
available location on a storage device.
6. Write short notes on Allocation.

Allocation
• Data allocation describes the process of deciding where to locate data.
Data allocation strategies are as follows:
• With centralized data allocation, the entire database is stored at one site.
• With replicated data allocation, copies of one or more database fragments are stored at
several sites.
Allocation Problem
Minimal cost- The main objective of the study is to generate a fragment allocation schema
which can minimized the total data transmission cost during the execution of database queries.
Performance- The allocation strategy is designed to maintain a performance metric. Two well-
known ones are to minimize the response time and to maximize the system through put at each
site.
7. Discuss Distributed Database System

Distributed Database System
• A distributed database is basically a database that is not limited to one system, it is spread
over different sites, i.e., on multiple computers or over a network of computers.
• This may be required when a particular database needs to be accessed by various users
globally.
• It needs to be managed such that for the users it looks like one single database.
• Distributed database is a system in which storage devices are not connected to a common
processing unit.
Database is controlled by Distributed Database Management System and data may be stored at
the same location or spread over the interconnected network. It is a loosely coupled system.
• The above diagram is a typical example of distributed database system, in which

communication channel is used to communicate channel with the different locations and
every system has its own memory and database
Distributed database goals.

• Reliability: In distributed database system, if one system fails down or stops working for
some time another system can complete the task.
• Availability: In distributed database system reliability can be achieved even if sever fails
down. Another system is available to serve the client request.
• Performance: Performance can be achieved by distributing database over different
locations. So the databases are available to every location which is easy to maintain
Types of distributed databases.
• Homogeneous distributed databases system.
• Heterogeneous distributed database system.
Homogeneous distributed databases system.
• Homogeneous distributed database system is a network of two or more databases (With
same type of DBMS software) which can be stored on one or more machines.
• Example: Consider that we have three departments using Oracle-9i for DBMS. If some
changes are made in one department then, it would update the other department also.
• There are 2 ways in which data can be stored on different sites.

Replication
Data replication is the process of storing separate copies of the database at two or more sites. It is
a popular fault tolerance technique of distributed databases.
Fragmentation
Fragmentation is the task of dividing a table into a set of smaller tables. The subsets of the table
are called fragments. Fragmentation can be of three types.
• vertical fragmentation
• Horizontal fragmentation
• hybrid fragmentation
• UNIT – II
• Discuss in detail Query processing
objectives.
Query processing objectivesp:
Parser and translator- checks grammar, syntax and translate RAE
Optimizer- different aspects like location of data, type data, details
Execution plan- executes that plan, and returns the answers
• The main objectives of query processing in a distributed environment is to form a high

level query on a distributed database, which is seen as a single database by the users, into
an efficient execution strategy expressed in a low level query language in local databases.
• An important point of query processing is query optimization. Because many execution
strategies are correct transformations of the same high level query the one that optimizes
(minimizes) resource consumption should be retained(same).
• The good measure of resource consumption are:
i. The total cost that will be incurred in processing the query. It is the dome of all
times incurred in processing the operations of the query at various sites and intrinsic
communication.
ii. The resource time of the query. This is the time elapsed for executing the query. Since
operations can be executed in parallel at different sited, the response time of a query may be
significantly less than its cost.
• Obviously the total cost should be minimized.
i. In a distributed system, the total cost to be minimized includes CPU, I\O, and
communication costs. This cost can be minimized by reducing the number of I\O operation
through fast access methods to the data and efficient use of main memory. The communication
cost is the time needed for exchanging the data between sited participating in the execution of the
query.
ii. In centralized systems, only CPU and I\O cost have to be considered
2.Explain in detail layers of query processing

Query Decomposition
• The first layer decomposes the calculus query (high-level query) into an algebraic
query with global relations. To check the query is syntax errors -violation of
programming language and semantic errors -errors in meaning.
3.What are the characterization of query

processors?
Characterization of query processors:
The
first four characteristics hold for both centralized and distributed query processors while
the next four characteristics are particular to distributed query processors in tightly-
integrated distributed DBMSs.
Languages
Types of Optimization
Optimization Timing
Statistics
Decision Sites
Exploitation of the Network Topology
Exploitation of Replicated Fragments
Use of Semijoins
languages
• Input language can be rel ational algebra or calculus and output language is relational
algebra (annotated with communication primitives).
• The query processor must efficiently map input language to output language
Types of Optimization
• The output language specification represents the execution strategy. There can be many
such strategies, the best one can be selected through exhaustive search, or by applying
heuristic (minimize size of intermediate relations).
• Static: done before executing the query
• Dynamic: done at run time
• Hybrid: mixes static and dynamic approaches
Statistics(analysis)
• fragment cardinality and size
• size and number of distinct values for each attribute.
• detailed histograms of attribute values for better selectivity estimation.
Decision Sites
• one site or several sites participate in selection of strategy
• Centralized decision approach

• single site generates the strategy that is determines the
“best” schedule
• Simpler
• need knowledge about the entire distributed database
• Distributed decision approach

• cooperation among various sites to determine the schedule (elaboration of the
best strategy)
• need only local information
• Hybrid decision approach

• one site makes the major decisions that is determines the global schedule
• Other sites make local decisions that is optimizes

the local sub-queries
•
Exploitation of network topology
• wide area network communication cost
• local area network parallel execution
Development of replicated fragments
• larger number of possible strategies
Use of Semi joins (mixing of two)
• reduce size of data transfer
• increase of messages and local processing
4.Explain in detail localization of distributed data.

The general techniques for decomposing and restructuring queries are expressed
in relational calculus. The general techniques apply to both centralized and
distributed DBMSs and do no take into account the distribution of data. This is the
role of the localization layer, which translates an algebraic query on global
relations into an algebraic query expressed on physical fragments. Localization
uses information stored in the fragment schema. 1Fragmentation is defined
through fragmentation rules, which can be expressed as relational queries. A
naive way to localize a distributed query is to generate a query where each global
relation is substituted by its localization program.
3.4.1 Reduction for Primary Horizontal Fragmentation
The horizontal fragmentation function distributes a relation based on selection
predicates. The reduction of queries on horizontally fragmented relations consist
primarily of determining, after restructuring the subtrees, those that will produce
empty relations, and removing them. Horizontal fragmentation can be exploited
to simplify both selection and join operations.
3.4.2 Reduction for Vertical Fragmentation
The vertical fragmentation function distributes a relation based on projection
attributes. Since the reconstruction operator for vertical fragmentation is the join,
the localization program for a vertically fragmented relation consist of the join of
the fragments on the common attribute. Similar to horizontal fragmentation,
queries on vertical fragments can be reduced by determining the useless
intermediate relations and removing the subtrees that produce them.
3.4.3 Reduction for Derived Fragmentation
The join operation is probably the most important operation because it is both
frequent and expensive, can be optimized by using primary horizontal
fragmentation when the joined relations a re fragmented according to the joins
attributes. In this case the join of two relations is implemented as a union of
partial joins. However, this method precludes one of the relations from being
fragmented on a different attribute used for selection. Derived horizontal
fragmentation is another way of distributing two relations so that the joint
processing of select and join is improved.
3.4.4 Reduction for Hybrid Fragmentation
Hybrid fragmentation is obtained by combining the fragmentation functions
discussed above. The goal of hybrid fragmentation is to support, efficiently
queries involving projection, selection, and join. Note that the optimization of an
operation or of a combination of operations is always done at the expense of
other operations.
5.Briefly explain query decomposition
Query Decomposition
• The first layer decomposes the calculus query (high-level query) into an algebraic
query with global relations. To check the query is syntax errors -violation of programming
language and semantic errors -errors in meaning.
• Query decomposition can be viewed as four successive steps.
1.normalization
2.analysis
3.elimination of redundancy
4.rewriting
The first layer decomposes the calculus query into an algebraic query on global relations. The
information needed for this transformation is found in the global conceptual schema describing the
global relations. However, the information about data distribution is not used here but in the next
layer. Thus the techniques used by this layer are those of a centralized DBMS.
Query decomposition can be viewed as four successive steps. First, the calculus query is
rewritten in a normalized form that is suitable for subsequent manipulation. Normalization of a
query generally involves the manipulation of the query quantifiers and of the query qualification by
applying logical operator priority.
Second, the normalized query is analyzed semantically so that incorrect queries are detected and
rejected as early as possible. Techniques to detect incorrect queries exist only for a subset of
relational calculus. Typically, they use some sort of graph that captures the semantics of the query
Third, the correct query (still expressed in relational calculus) is simplified. One way to simplify a
query is to eliminate redundant predicates. Note that redundant queries are likely to arise when a
query is the result of system transformations applied to the user query. Such transformations are
used for performing semantic data control (views, protection, and semantic integrity control).
Fourth, the calculus query is restructured as an algebraic query. That several algebraic queries
can be derived from the same calculus query, and that some algebraic queries are “better” than
others. The quality of an algebraic query is defined in terms of expected performance. The
traditional way to do this transformation toward a “better” algebraic specification is to start with an
initial algebraic query and transform it in order to find a “good” one. The initial algebraic query is
derived immediately from the calculus query by translating the predicates and the target statement
into relational operators as they appear in the query. This directly translated algebra query is then
restructured through transformation rules. The algebraic query generated by this layer is good in
the sense that the worse executions are typically avoided. For instance, a relation will be accessed
only once, even if there are several select predicates. However, this query is generally far from
providing an optimal execution, since information about data distribution and fragment allocation is
not used at this layer.
6.Write short notes on centralized query
optimization.
Centralized Query Optimization:
• In a centralized system, query processing is done with the optimization techniques.
• Minimization of response time of query (time taken to produce the results to user’s
query).
• Maximize system throughput (the number of requests that are processed in a given
amount of time).
• Reduce the amount of memory and storage required for processing.
• Increase parallelism.
• In DDBMS joins of fragments that are stored at different sites may increase the
communication time.
Two approaches exist:
Join — Join is a binary operation in Relational Algebra. — It combines records from two or
more tables in a database. — A join is Used for combining fields from two tables by using values
common to each.
Semi-Join •A Join where the result only contains the columns from one of the joined tables.
•Useful in distributed databases, so we don't have to send as much data over the network. •Can
dramatically speed up certain classes of queries.
Two approaches exist:
• Optimize the ordering of joins directly
∗ INGRES and distributed INGRES
∗ System R and System R ∗
• Replace joins by combinations of semi joins in order to minimize the communication

costs
∗ Hill Climbing and SDD-1 (System for Distributed Databases)
joins directly
• Direct join ordering of two relation/fragments located at different sites
• Move the smaller relation to the other site
• We have to estimate the size of R and S
• Direct join ordering of queries involving more than two relations is substantially more
complex
• Example: Consider the following query and the respective join graph, where we make
also assumptions about the locations of the three relations/fragments
• PROJ ⋊⋉ P NO ASG ⋊⋉ EMP
ENO
• Example : The query can be evaluated in at least 5 different ways.

• Centralized Query Optimization
• Plan 1: ASG→Site 1
Site 1: ASG’=EMP⋊⋉ASG
ASG’ → Site 3
Site 3: ASG’ ⋊⋉PROJ
• Plan 2: EMP→Site 2
Site 2: EMP’=EMP⋊⋉ASG
EMP’→Site 3
Site 3: EMP’⋊⋉PROJ
• Plan 3: ASG→Site 3
Site 3: ASG’=ASG⋊⋉PROJ
ASG’→Site 1
Site 1: ASG’⋊⋉EMP
• Plan 4: PROJ→Site 2
Site 2: PROJ’=PROJ⋊⋉ASG
PROJ’ → Site 1
Site 1: PROJ’⋊⋉EMP
• Plan 5: EMP→Site 2
PROJ→Site 2
Site 2: EMP⋊⋉PROJ⋊⋉ASG
• To select a plan, a lot of information is needed, including

• size(EMP ), size(ASG), size(PROJ ), size(EMP ⋊⋉ ASG), size(ASG ⋊⋉ PROJ )
• Possibilities of parallel execution if response time is used.
• Optimize the ordering of joins directly
∗ INGRES and distributed INGRES,System R and System R ∗
• The join approach is better if almost all tuples of R participate in the join
Semijoins
• can be used to efficiently implement joins
• The semijoin acts as a size reducer (similar as to a selection) such that smaller
relations need to be transferred
• Consider two relations: The join of two relations R and S over attribute A, stored at sites 1
and 2
• Solution with semijoins: Replace one or both operand relations/fragments by a
semijoin, ⊲ indicates sub group using the following rules:
R ⋊⋉ S A ⇐⇒ (R ⊲< S) ⋊⋉ S
A A
⇐⇒ (S ⊲< R) ⋊⋉ R
A A
⇐⇒ (R ⊲< S) ⋊⋉ (S ⊲< R)
A A A
Centralized Query Optimization

• The semijoin is beneficial if the cost to produce and send it to the other site is less than
the cost of sending the whole operand relation and of doing the actual join.
• Cost analysis R ⋊⋉ S vs (R ⊲ < S) ⋊⋉ S, assuming that size(R) < size(S)
A A
• Perform the join R ⋊⋉ S:

R → Site 2
Site 2 computes R ⋊⋉ S
• Perform the semijoins (R ⊲< S) ⋊⋉ S:
S = Π(S)
′
S Site 1
′=
Site 1 computes R= R< S site-1 site-2

R →Site 2
′
Site 2 computes R ⋊⋉ S ′
• The semijoin approach is better if the semijoin acts as a sufficient reducer (i.e., a few
tuples of R participate in the join)
• Hill Climbing and SDD-1
7.Briefly explain Query optimization
Distributed query optimization refers to the process of producing a plan for the processing of a query to
a distributed database system. The plan is called a query execution plan. In a distributed database
system, schema and queries refer to logical units of data
In a relational distributed relation database system, for instance, logical units of data are relations. These
units may be be fragmented at the underlying physical level. The fragments, which can be redundant and
replicated, are allocated to different database servers in the distributed system.
Distributed query optimization requires evaluation of a large number of query trees each of which
produce the required results of a query. This is primarily due to the presence of large amount of
replicated and fragmented data. Hence, the target is to find an optimal solution instead of the best
solution.
The main issues for distributed query optimization are –
Optimal utilization of resources in the distributed system.
Query trading.
Reduction of solution space of the query.
Steps for Query Optimization
There are three steps for query optimization. They are –
Step 1 − Query Tree Generation

A relational algebra expression is represented by a tree data structure known as a query tree. Leaf nodes
represent the tables of the query. The internal nodes represent the relational algebra operations and the
complete query is represented by a root.
When the operand table is made available, the internal node is executed. The result table replaces the
node and the process is continued until the result table replaces the root node.
Step 2 − Query Plan Generation

The query plan is prepared once the query tree is generated. All the operations of the query tree are
included with access paths which are known as query plan. The relational operations on the
performance of the tree are specified by the access paths. For instance, the access path for a selection
operation provides information on the use of B+ tree index.
The information about the intermediate tables that are required to be passed from one operator to
another is provided by a query plan. The information about the usage of temporary tables, combining of
the operations is provided by the query plan.
Step 3− Code Generation

The final step of query optimization is generation of the code. The type of the underlying operating
system determines the form of the query. The results are produced by running the query code thus
generated by the Execution Manager.
8.Explain in detail distributed query optimization

algorithms.
Distributed Query Optimization Algorithms.
INGRES Algorithm
• INGRES uses a dynamic query optimization algorithm that recursively breaks a query
into smaller pieces. It is based on the following ideas:
• An n-relation query q is decomposed into n sub queries q1 ! q2 ! · · · qn ! Each
qi is a mono-relation (best access path) query
• ∗ The output of q is consumed by q
i i+1
• For the decomposition two basic techniques are used: detachment and
substitution
• There’s a processor that can efficiently process mono-relation queries
• ∗ Optimizes each query independently for the access to a single relation
Example: Consider query q1: “Names of employees working on the CAD/CAM project”
q:
1 SELECT EMP.ENAME
FROM EMP, ASG, PROJ
WHERE EMP.ENO = ASG.ENO
AND ASG.PNO = PROJ.PNO
AND PROJ.PNAME = ”CAD/CAM”
q into q → q :
1 11
′
q: 11 SELECT PROJ.PNO INTO JVAR

FROM PROJ
WHERE PROJ.PNAME = ”CAD/CAM”
q: ′
SELECT EMP.ENAME
FROM EMP, ASG, JVAR
WHERE EMP.ENO = ASG.ENO
AND ASG.PNO = JVAR.PNO
Distributed INGRES Algorithm
• The distributed INGRES query optimization algorithm is very similar to the
centralized INGRES algorithm.
• In addition to the centralized INGRES, the distributed one should break up each
query q into sub-queries that operate on fragments; only horizontal
i
fragmentation is handled.
• Optimization with respect to a combination of communication cost and response
time
System R Algorithm
• The System R(centralized) query optimization algorithm
• Performs static query optimization based on “exhaustive search”(complete) of the
solution space and a cost function (IO cost + CPU cost)
∗ Input: relational algebra tree
∗ Output: optimal relational algebra tree
• Dynamic programming technique is applied to reduce the number of alternative plans
• The optimization algorithm consists of two steps
• Predict the best access method to each individual relation (mono-relation query)
∗ Consider using index, file scan, etc.
2. For each relation R, estimate the best join ordering
∗ R is first accessed using its best single-relation access method
∗ Efficient access to inner relation is crucial
• Considers two different join strategies
∗ (Indexed-) nested loop join , Sort-merge join
• Example: Consider query q1: “Names of employees working on the CAD/CAM
project”
• PROJ⋊⋉ ASG⋊⋉ EMP
PNO ENO
• Example: Step 1 – Select the best single-relation access paths

• EMP: sequential scan (because there is no selection on EMP)
• ASG: sequential scan (because there is no selection on ASG)
• PROJ: index on PNAME (because there is a selection on PROJ based on
PNAME)
• Example: Step 2 – Select the best join ordering for each relation
• (EMP × PROJ) and (PROJ × EMP) are pruned because they are CPs
• (ASG × PROJ) pruned because we assume it has higher cost than (PROJ × ASG);
similar for (PROJ × EMP)
• Best total join order ((PROJ⋊⋉ ASG)⋊⋉ EMP), since it uses the indexes best
• pruned (cut or lop off)
• ∗ Select PROJ using index on PNAME
• ∗ Join with ASG using index on PNO
• ∗ Join with EMP using index on ENO
Distributed Query Optimization Algorithms.
Hill-Climbing Algorithm
• Hill-Climbing query optimization algorithm
• Refinements of an initial feasible solution are recursively computed until no more
cost improvements can be made
• data replication, and fragmentation are not used
• Devised for wide area point-to-point networks
• firstly distributed query processing algorithm, Then it call hill-climbing algorithm
• Select initial feasible execution strategy ES0
• Split ES0 into two strategies: ES1 followed by ES2
• ES1: send one of the relations involved in the join to the other relation’s site
• ES2: send the join result to the final result site
• Replace ES0 with the split schedule which gives
• cost(ES1) + cost(ES2)+cost(local join) < cost(ES0)
• Recursively apply steps 2 and 3 on ES1 and ES2 until no more benefit can be gained
• Check for redundant transmissions in the final plan and eliminate them
Example: What are the salaries of engineers who work on the CAD/CAM project?
• Π (PAY ⋊⋉
SAL EMP ⋊⋉ (ASG ⋊⋉
T IT LE ENO P NO (σ
P NAME=“CAD/CAM (PROJ ))))
• Schemas: EMP(ENO, ENBAME, TITLE), ASG(ENO, PNO, RESP, DUR),
PROJ(PNO, PNAME, BUDGET, LOC), PAY(TITLE, SAL)
∗ Size of relations is defined as their cardinality
∗ Minimize total cost
∗ Transmission cost between two sites is 1
∗ Ignore local processing cost
• Total cost = cost(PAY → Site1) + cost(ASG → Site1) + cost(PROJ → Site1)
= 4 + 10 + 1 = 15
• Alternative 2: Resulting site is site 2
• Total cost =8 + 10 + 1 = 19
• Total cost = 8 + 4 + 10 = 22
Relation Size Site
EMP 8 1
PAY 4 2
PROJ 1 3
ASG 10 4
• Total cost = 8 + 4 + 1 = 13
• Therefore ES0 = EMP→Site4; PAY → Site4; PROJ → Site4
System for Distributed Databases (SDD-1 )
The SDD-1 query optimization algorithm improves the Hill-Climbing algorithm in a number of
directions:
• Semi joins are considered
• More elaborate statistics
• Initial plan is selected better
• Post-optimization step is introduced
UNIT - III
1.What are the properties of transactions?
Properties of Transactions
• The ACID properties
• Atomicity
∗ A transaction is treated as a single/atomic unit of operation and is either executed completely
or not at all
• Consistency
∗ A transaction preserves DB consistency, i.e., does not violate any integrity constraints
• Isolation
∗ A transaction is executed as if it would be the only one.
• Durability
∗ The updates of a committed transaction are permanent in the DB
Atomicity
• Atomicity refers to the fact that a transaction is treated as a unit of operation.
• Therefore, either all the transaction’s actions are completed, or none of them are. This is
also known as the “all-or-nothing property.”
• The two possible courses of action: it can either be terminated by completing the
remaining actions, or it can be terminated by undoing all the actions that have already
been executed. Failed transaction always restarts.
• To types of failures, Transaction recovery is the activity of the restoration of atomicity
due to input errors, system overloads, and deadlocks
• Crash recovery is the activity of ensuring atomicity in the presence of system crashes
ex: ATM transaction
Consistency
• The consistency of a transaction is simply its correctness and ensures that a transaction
maps a consistent DB into a consistent DB. (Before or after transaction DB should be
same.)
• Transactions are correct programs and do not violate database integrity constraints
• Dirty data is data that is updated by a transaction that has not yet committed
• This classification groups databases into four levels of consistency. In the following
definition, dirty data refers to data values that have been updated by a transaction prior to
its commitment.
• Different levels of DB consistency
∗ Degree 0
Transaction T does not overwrite dirty data of other transactions
∗ Degree 1
Degree does not commit any writes before EOT
∗ Degree 2
Degree 2 does not read dirty data from other transactions
∗ Degree 3
Degree 3 Other transactions do not dirty any data read by T before T completes
Isolation
• Isolation is the property of transactions which requires each transaction to see a
consistent DB at all times.
• If two concurrent transactions access a data item that is being updated by one of them
(i.e., performs a write operation), it is not possible to guarantee that the second will read
the correct value
• Therefore, if several transactions are executed concurrently, the result must be the same
as if they were executed serially in some order (→ serializability)
• Consider the following two concurrent transactions (T and T ),
1 2
T : Read(x)
1 T : Read(x)
2
x+1→x x+1→x
Write(x) Write(x)
Commit Commit
Durability
• Durability refers to that property of transactions which ensures that once a transaction
commits, its results are permanent and cannot be erased from the database.
• The system must guarantee that the results of its operations will never be lost, in spite
of subsequent failures
• Changes that have been committed to the database should remain even in the case of
software and hardware failure.
• For instance, if joy account contains 1L , this information should not disappear upon
hardware or software failure.
• Database recovery is used to achieve the task.
2.Explain distributed concurrency serializability.

Distributed concurrency control
• Distributed Concurrency Control is the procedure in the DBMS for managing
simultaneous (run-time) operations without conflicting with each another.
• Executing multiple transactions at a time.
• Ex: Atm transaction
• Advantages- waiting time, response time, recourse utilization, effiency.
• Simultaneous execution of transaction over a shared db can create several data
integrity and consistency problem
(Lost update, uncommitted data, and Inconsistency retrievals.)
a. Serializability
b. Concurrency control mechanisms & algorithms,
c. Time-stamped & optimistic concurrency control Algorithms,
d. Deadlock Management
Serializability
• Serializability is a concurrency scheme where the simultaneous transaction is
equivalent to one that executes the transactions serially. A schedule is a list of
transactions.
• Serial schedule defines each transaction is executed consecutively without any
interference from other transactions. T1🡪T2 OR T2🡪T1
• Non-serial schedule defines the operations from a group of simultaneous transactions
that are interleaved.
• In non-serial schedule, if the schedule is not proper, then the problems can arise like
multiple update, uncommitted dependency and incorrect analysis.
• The main objective of serializability is to find non-serial schedules that allow
transactions to execute concurrently without interference of a database state that could a
serial execution.
Issues in serializability are conflict and view.

3.What are the types of transactions?
Classification of Transactions
• Classification of transactions according to various criteria
• One criterion is the duration of transactions. Accordingly, transactions may be
classified as online or batch. These two classes are also called short-life and long-
life transactions, respectively.
• Online transactions are characterized by very short execution/response times
(typically, on the order of a couple of seconds) and by access to a relatively small
portion of the database. Examples include banking transactions and airline
reservation transactions.
Batch transactions, on the other hand, take longer to execute (response time being
measured in minutes, hours, or even days) and access a larger portion of the database.
Flat transaction (Online transactions)
• Flat transactions have a single start point (Begin transaction) and a single
termination point (End transaction).
Begin transaction Reservation
...
end.
∗ Nested transaction (Batch transactions)
• An alternative transaction model is to permit a transaction to include other
transactions with their own begin and commit points. Such transactions are called
nested transactions. These transactions that are embedded in another one are usually
called sub transactions.
• The operations of a transaction may themselves be transactions.
Begin transaction Reservation
...
Begin transaction Airline
...
end.
Begin transaction Hotel
...
end. end.
4.Discuss in detail concurrency control mechanisms & algorithms.
Concurrency control mechanisms & algorithms:
Concurrency controlling techniques ensure that multiple transactions are executed simultaneously
while maintaining the ACID properties of the transactions and serializability in the schedules.
Locking Based Concurrency Control Protocols
Locking-based concurrency control protocols use the concept of locking data items. A lock is a
variable associated with a data item that determines whether read/write operations can be
performed on that data item. Generally, a lock compatibility matrix is used which states whether a
data item can be locked by two transactions at the same time.
Locking-based concurrency control systems can use either one-phase or two-phase locking
protocols.
One-phase Locking Protocol
In this method, each transaction locks an item before use and releases the lock as soon as it has
finished using it. This locking method provides for maximum concurrency but does not always
enforce serializability.
Two-phase Locking Protocol
In this method, all locking operations precede the first lock-release or unlock operation. The
transaction comprise of two phases. In the first phase, a transaction only acquires all the locks it
needs and do not release any lock. This is called the expanding or the growing phase. In the
second phase, the transaction releases the locks and cannot request any new locks. This is called
the shrinking phase.
Every transaction that follows two-phase locking protocol is guaranteed to be serializable.
However, this approach provides low parallelism between two conflicting transactions
5.Briefly explain deadlock Management.
https://www.geeksforgeeks.org/deadlock-in-dbms/
6.Explain in detail time - stamped & optimistic concurrency
control Algorithms.
• 7. Discuss in detail Transaction Management

UNIT 4
Q)fault tolerance
ans)https://www.slideshare.net/sumitjain2013/fault-tolerance-in-
distributed-systems#:~:text=%E2%80%A2%20Fault
%20Tolerance%20is%20needed%20in%20order
%20to,Monitoring%20systems%2C%20flight%20control
%20systems%2C%20Banking%20Services%20etc.

Unit - I Distributed Data Processing

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit - I Distributed Data Processing

Uploaded by

Copyright:

Available Formats

UNIT – I

1. Explain in detail Distributed Data Processing.

How it is being distributed data processing?

2. What are the Promises of DDBSs?

• Transparent Management of Distributed and Replicated Data

1.Distributed database design

6. Write short notes on Allocation.

7. Discuss Distributed Database System

• The above diagram is a typical example of distributed database system, in which

Distributed database goals.

• There are 2 ways in which data can be stored on different sites.

• The main objectives of query processing in a distributed environment is to form a high

2.Explain in detail layers of query processing

3.What are the characterization of query

Exploitation of the Network Topology

Exploitation of Replicated Fragments

• Centralized decision approach

• Distributed decision approach

• need only local information

• Hybrid decision approach

• Other sites make local decisions that is optimizes

4.Explain in detail localization of distributed data.

• Replace joins by combinations of semi joins in order to minimize the communication

• Example : The query can be evaluated in at least 5 different ways.

• To select a plan, a lot of information is needed, including

Centralized Query Optimization

• Perform the join R ⋊⋉ S:

Site 1 computes R= R< S site-1 site-2

The main issues for distributed query optimization are –

Optimal utilization of resources in the distributed system.

Reduction of solution space of the query.

Steps for Query Optimization

There are three steps for query optimization. They are –

Step 1 − Query Tree Generation

Step 2 − Query Plan Generation

Step 3− Code Generation

8.Explain in detail distributed query optimization

q: 11 SELECT PROJ.PNO INTO JVAR

• Example: Step 1 – Select the best single-relation access paths

2.Explain distributed concurrency serializability.

Issues in serializability are conflict and view.

• 7. Discuss in detail Transaction Management

You might also like