Professional Documents
Culture Documents
• Assuming that the database is relational, we can store this information in two relations:
EMP(ENO, ENAME, TITLE) and PROJ(PNO, PNAME, BUDGET).
• We also introduce a third relation to store salary information: SAL(TITLE, AMT) and a
fourth relation ASG which indicates which employees have been assigned to which
projects for what duration with what responsibility: ASG(ENO, PNO, RESP, DUR).
• If all of this data were stored in a centralized DBMS, and we wanted to find out the
names and employees who worked on a project for more than 12 months, we would
specify this using the following SQL query:
• SELECT ENAME, AMT
• FROM EMP, ASG, SAL
• WHERE ASG.DUR > 12
• AND EMP.ENO = ASG.ENO
AND SAL.TITLE = EMP.TITLE
1.Data Independence
• Data independence is the type of data transparency that matters for a
centralized DBMS. When a user application is written, it should not be concerned with
the details of physical data organization. There are two types of data independence:
physical and logical data independence.
• Logical data independence refers to the immunity of user applications to changes in the
logical structure (i.e., schema) of the database
• Physical data independence, deals with hiding the details of the storage structure from
user applications.
2.Network Transparency
• User should be protected from the operational details of the network possibly even
hiding the existence of the network. This type of transparency is referred to as network
transparency or distribution transparency.
3.Replication Transparency
• For performance, reliability, and availability reasons, Furthermore, if one of the machines
fails, a copy of the data are still available on another machine on the network. In fact,
the decision as to whether to replicate or not, and how many copies of any database
object to have, depends to a considerable degree on user applications.
4.Fragmentation Transparency
• There are two general types of fragmentation alternatives.
• Horizontal fragmentation, a relation is partitioned into a set of sub-relations, each of
which have a subset of the tuples (rows) of the original relation and vertical
fragmentation where each sub-relation is defined on a subset of the attributes (columns)
of the original relation
5.Who Should Provide Transparency?
• To provide easy and efficient access by users to the services of the DBMS, one would
want to have full transparency.
Responsibility of providing transparent access depends on the access layer, operating system
level & within the DBMS
Reliability Through Distributed Transactions
• The failure of a single site, or the failure of a communication link which makes one or
more sites unreachable, is not sufficient to bring down the entire system.
• In the case of a distributed database, this means that some of the data may be
unreachable, but with proper care, users may be permitted to access other parts of the
distributed database. The proper care comes in the form of support for distributed
transactions and application protocols.
Begin transaction SALARY UPDATE
begin
EXEC SQL UPDATE PAY
SET SAL = SAL*1.1
end.
Improved Performance
The case for the improved performance of distributed DBMSs is typically made based on two
points.
• First, a distributed DBMS fragments the conceptual database, enabling data to be stored
in close proximity to its points of use (also called data localization).
• This has two potential advantages: firstly, since each site handles only a portion of the
database, contention for CPU and I/O services is not as severe as for centralized
databases and secondly, localization reduces remote access delays that are usually
involved in wide area networks.
Easier System Expansion
• In a distributed environment, it is much easier to accommodate increasing database
sizes.
• Major system overhauls are seldom necessary; expansion can usually be handled by
adding processing and storage power to the network.
Obviously, it may not be possible to obtain a linear increase in power, since this also depends on
the overhead of distribution. However, significant improvements are still possible.
3. Draw the Architectural Models for Distributed DBMS.
Distributed DBMS Architecture
• The architecture of a system defines its structure. This means that the components of the
system are identified, the function of each component is specified, and the
interrelationships and interactions among these components are defined.
• The specification of the architecture of a system requires identification of the various
modules, with their interfaces and interrelationships, in terms of the data and control flow
through the system.
Client - Server Architecture for DDBMS
• General idea: Divide the functionality into two classes:
• server functions
∗ mainly data management, including query processing, optimization, transaction management,
etc.
• client functions
∗ might also include some data management functions(consistency checking, transaction
management, etc.) not just user interface
• Provides a two-level architecture
• More efficient division of work
• Different types of client/server architecture
• Multiple client/single server
• Multiple client/multiple server
• Peer-to-Peer Architecture for DDBMS
• Peer-to-peer distribution- In these systems, each peer acts both as a client and a server for
impart database services.
• Local internal schema (LIS)
• Describes the local physical data organization (which might be different on each
machine)
• Local conceptual schema (LCS)
• Describes logical data organization at each site
• Required since the data are fragmented and replicated
• Global conceptual schema (GCS)
• Describes the global logical view of the data
• Union of the LCSs
• External schema (ES)
Describes the user/application view on the data
• Multi - DBMS Architectures
• This is an integrated database system formed by a collection of two or more autonomous
database systems.
• The local DBMSs present to the multi-database layer the part of their local DB
they are willing to share.
4.What are the Distribution Design issues?
.Design Issues of DDBS
Regd_No
Name
Course
Address
Semester
Hybrid Fragmentation
In hybrid fragmentation, a combination of horizontal and vertical fragmentation techniques are
used. This is the most flexible fragmentation technique, reconstruction of the original table is
often an expensive task.
Hybrid fragmentation can be done in two alternative ways −
At first, generate a set of vertical fragments; then generate horizontal fragments from one or
more of the vertical fragments.
At first, generate a set of horizontal fragments; then generate vertical fragments from one or
more of the horizontal fragments.
Reasons for Fragmentation
• fragmentation at the most basic level is about fragmented data on a storage device.
• The fragmentation is a result of continuous application or file system storage, where
different parts of a given application or file are stored in a sequential set of storage
blocks on a storage device.
• An operating system will typically store application and file storage blocks in the next
available location on a storage device.
Allocation Problem
Minimal cost- The main objective of the study is to generate a fragment allocation schema
which can minimized the total data transmission cost during the execution of database queries.
Performance- The allocation strategy is designed to maintain a performance metric. Two well-
known ones are to minimize the response time and to maximize the system through put at each
site.
• The first layer decomposes the calculus query (high-level query) into an algebraic
query with global relations. To check the query is syntax errors -violation of
programming language and semantic errors -errors in meaning.
Languages
Types of Optimization
Optimization Timing
Statistics
Decision Sites
Use of Semijoins
languages
• Input language can be rel ational algebra or calculus and output language is relational
algebra (annotated with communication primitives).
• The query processor must efficiently map input language to output language
Types of Optimization
• The output language specification represents the execution strategy. There can be many
such strategies, the best one can be selected through exhaustive search, or by applying
heuristic (minimize size of intermediate relations).
• Static: done before executing the query
• Dynamic: done at run time
• Hybrid: mixes static and dynamic approaches
Statistics(analysis)
• fragment cardinality and size
• size and number of distinct values for each attribute.
• detailed histograms of attribute values for better selectivity estimation.
Decision Sites
• one site or several sites participate in selection of strategy
1.normalization
2.analysis
3.elimination of redundancy
4.rewriting
The first layer decomposes the calculus query into an algebraic query on global relations. The
information needed for this transformation is found in the global conceptual schema describing the
global relations. However, the information about data distribution is not used here but in the next
layer. Thus the techniques used by this layer are those of a centralized DBMS.
Query decomposition can be viewed as four successive steps. First, the calculus query is
rewritten in a normalized form that is suitable for subsequent manipulation. Normalization of a
query generally involves the manipulation of the query quantifiers and of the query qualification by
applying logical operator priority.
Second, the normalized query is analyzed semantically so that incorrect queries are detected and
rejected as early as possible. Techniques to detect incorrect queries exist only for a subset of
relational calculus. Typically, they use some sort of graph that captures the semantics of the query
Third, the correct query (still expressed in relational calculus) is simplified. One way to simplify a
query is to eliminate redundant predicates. Note that redundant queries are likely to arise when a
query is the result of system transformations applied to the user query. Such transformations are
used for performing semantic data control (views, protection, and semantic integrity control).
Fourth, the calculus query is restructured as an algebraic query. That several algebraic queries
can be derived from the same calculus query, and that some algebraic queries are “better” than
others. The quality of an algebraic query is defined in terms of expected performance. The
traditional way to do this transformation toward a “better” algebraic specification is to start with an
initial algebraic query and transform it in order to find a “good” one. The initial algebraic query is
derived immediately from the calculus query by translating the predicates and the target statement
into relational operators as they appear in the query. This directly translated algebra query is then
restructured through transformation rules. The algebraic query generated by this layer is good in
the sense that the worse executions are typically avoided. For instance, a relation will be accessed
only once, even if there are several select predicates. However, this query is generally far from
providing an optimal execution, since information about data distribution and fragment allocation is
not used at this layer.
6.Write short notes on centralized query
optimization.
Centralized Query Optimization:
• In a centralized system, query processing is done with the optimization techniques.
• Minimization of response time of query (time taken to produce the results to user’s
query).
• Maximize system throughput (the number of requests that are processed in a given
amount of time).
• Reduce the amount of memory and storage required for processing.
• Increase parallelism.
• In DDBMS joins of fragments that are stored at different sites may increase the
communication time.
Two approaches exist:
Join — Join is a binary operation in Relational Algebra. — It combines records from two or
more tables in a database. — A join is Used for combining fields from two tables by using values
common to each.
Semi-Join •A Join where the result only contains the columns from one of the joined tables.
•Useful in distributed databases, so we don't have to send as much data over the network. •Can
dramatically speed up certain classes of queries.
Two approaches exist:
• Optimize the ordering of joins directly
∗ INGRES and distributed INGRES
∗ System R and System R ∗
joins directly
• Direct join ordering of two relation/fragments located at different sites
• Move the smaller relation to the other site
• We have to estimate the size of R and S
• Direct join ordering of queries involving more than two relations is substantially more
complex
• Example: Consider the following query and the respective join graph, where we make
also assumptions about the locations of the three relations/fragments
• PROJ ⋊⋉ P NO ASG ⋊⋉ EMP
ENO
• Plan 1: ASG→Site 1
Site 1: ASG’=EMP⋊⋉ASG
ASG’ → Site 3
Site 3: ASG’ ⋊⋉PROJ
• Plan 2: EMP→Site 2
Site 2: EMP’=EMP⋊⋉ASG
EMP’→Site 3
Site 3: EMP’⋊⋉PROJ
• Plan 3: ASG→Site 3
Site 3: ASG’=ASG⋊⋉PROJ
ASG’→Site 1
Site 1: ASG’⋊⋉EMP
• Plan 4: PROJ→Site 2
Site 2: PROJ’=PROJ⋊⋉ASG
PROJ’ → Site 1
Site 1: PROJ’⋊⋉EMP
• Plan 5: EMP→Site 2
PROJ→Site 2
Site 2: EMP⋊⋉PROJ⋊⋉ASG
• The join approach is better if almost all tuples of R participate in the join
Semijoins
• can be used to efficiently implement joins
• The semijoin acts as a size reducer (similar as to a selection) such that smaller
relations need to be transferred
• Consider two relations: The join of two relations R and S over attribute A, stored at sites 1
and 2
• Solution with semijoins: Replace one or both operand relations/fragments by a
semijoin, ⊲ indicates sub group using the following rules:
R ⋊⋉ S A ⇐⇒ (R ⊲< S) ⋊⋉ S
A A
⇐⇒ (S ⊲< R) ⋊⋉ R
A A
⇐⇒ (R ⊲< S) ⋊⋉ (S ⊲< R)
A A A
S Site 1
′=
Site 2 computes R ⋊⋉ S ′
• The semijoin approach is better if the semijoin acts as a sufficient reducer (i.e., a few
tuples of R participate in the join)
• Hill Climbing and SDD-1
7.Briefly explain Query optimization
Distributed query optimization refers to the process of producing a plan for the processing of a query to
a distributed database system. The plan is called a query execution plan. In a distributed database
system, schema and queries refer to logical units of data
In a relational distributed relation database system, for instance, logical units of data are relations. These
units may be be fragmented at the underlying physical level. The fragments, which can be redundant and
replicated, are allocated to different database servers in the distributed system.
Distributed query optimization requires evaluation of a large number of query trees each of which
produce the required results of a query. This is primarily due to the presence of large amount of
replicated and fragmented data. Hence, the target is to find an optimal solution instead of the best
solution.
Query trading.
When the operand table is made available, the internal node is executed. The result table replaces the
node and the process is continued until the result table replaces the root node.
The information about the intermediate tables that are required to be passed from one operator to
another is provided by a query plan. The information about the usage of temporary tables, combining of
the operations is provided by the query plan.
• For the decomposition two basic techniques are used: detachment and
substitution
• There’s a processor that can efficiently process mono-relation queries
• ∗ Optimizes each query independently for the access to a single relation
Example: Consider query q1: “Names of employees working on the CAD/CAM project”
q:
1 SELECT EMP.ENAME
FROM EMP, ASG, PROJ
WHERE EMP.ENO = ASG.ENO
AND ASG.PNO = PROJ.PNO
AND PROJ.PNAME = ”CAD/CAM”
q into q → q :
1 11
′
fragmentation is handled.
• Optimization with respect to a combination of communication cost and response
time
System R Algorithm
• The System R(centralized) query optimization algorithm
• Performs static query optimization based on “exhaustive search”(complete) of the
solution space and a cost function (IO cost + CPU cost)
∗ Input: relational algebra tree
∗ Output: optimal relational algebra tree
• Dynamic programming technique is applied to reduce the number of alternative plans
• The optimization algorithm consists of two steps
• Predict the best access method to each individual relation (mono-relation query)
∗ Consider using index, file scan, etc.
2. For each relation R, estimate the best join ordering
∗ R is first accessed using its best single-relation access method
∗ Efficient access to inner relation is crucial
• Considers two different join strategies
∗ (Indexed-) nested loop join , Sort-merge join
• Example: Consider query q1: “Names of employees working on the CAD/CAM
project”
• PROJ⋊⋉ ASG⋊⋉ EMP
PNO ENO
• Best total join order ((PROJ⋊⋉ ASG)⋊⋉ EMP), since it uses the indexes best
• pruned (cut or lop off)
• ∗ Select PROJ using index on PNAME
• ∗ Join with ASG using index on PNO
• ∗ Join with EMP using index on ENO
Distributed Query Optimization Algorithms.
Hill-Climbing Algorithm
• Hill-Climbing query optimization algorithm
• Refinements of an initial feasible solution are recursively computed until no more
cost improvements can be made
• data replication, and fragmentation are not used
• Devised for wide area point-to-point networks
• firstly distributed query processing algorithm, Then it call hill-climbing algorithm
• Select initial feasible execution strategy ES0
• Split ES0 into two strategies: ES1 followed by ES2
• ES1: send one of the relations involved in the join to the other relation’s site
• ES2: send the join result to the final result site
• Replace ES0 with the split schedule which gives
• cost(ES1) + cost(ES2)+cost(local join) < cost(ES0)
• Recursively apply steps 2 and 3 on ES1 and ES2 until no more benefit can be gained
• Check for redundant transmissions in the final plan and eliminate them
Example: What are the salaries of engineers who work on the CAD/CAM project?
• Π (PAY ⋊⋉
SAL EMP ⋊⋉ (ASG ⋊⋉
T IT LE ENO P NO (σ
P NAME=“CAD/CAM (PROJ ))))
• Schemas: EMP(ENO, ENBAME, TITLE), ASG(ENO, PNO, RESP, DUR),
PROJ(PNO, PNAME, BUDGET, LOC), PAY(TITLE, SAL)
∗ Size of relations is defined as their cardinality
∗ Minimize total cost
∗ Transmission cost between two sites is 1
∗ Ignore local processing cost
• Total cost = cost(PAY → Site1) + cost(ASG → Site1) + cost(PROJ → Site1)
= 4 + 10 + 1 = 15
• Alternative 2: Resulting site is site 2
• Total cost =8 + 10 + 1 = 19
• Alternative 3: Resulting site is site 3
• Total cost = 8 + 4 + 10 = 22
• Alternative 4: Resulting site is site 4
Relation Size Site
EMP 8 1
PAY 4 2
PROJ 1 3
ASG 10 4
• Total cost = 8 + 4 + 1 = 13
• Therefore ES0 = EMP→Site4; PAY → Site4; PROJ → Site4
System for Distributed Databases (SDD-1 )
The SDD-1 query optimization algorithm improves the Hill-Climbing algorithm in a number of
directions:
• Semi joins are considered
• More elaborate statistics
• Initial plan is selected better
• Post-optimization step is introduced
UNIT - III
1.What are the properties of transactions?
Properties of Transactions
• The ACID properties
• Atomicity
∗ A transaction is treated as a single/atomic unit of operation and is either executed completely
or not at all
• Consistency
∗ A transaction preserves DB consistency, i.e., does not violate any integrity constraints
• Isolation
∗ A transaction is executed as if it would be the only one.
• Durability
∗ The updates of a committed transaction are permanent in the DB
Atomicity
• Atomicity refers to the fact that a transaction is treated as a unit of operation.
• Therefore, either all the transaction’s actions are completed, or none of them are. This is
also known as the “all-or-nothing property.”
• The two possible courses of action: it can either be terminated by completing the
remaining actions, or it can be terminated by undoing all the actions that have already
been executed. Failed transaction always restarts.
• To types of failures, Transaction recovery is the activity of the restoration of atomicity
due to input errors, system overloads, and deadlocks
• Crash recovery is the activity of ensuring atomicity in the presence of system crashes
ex: ATM transaction
Consistency
• The consistency of a transaction is simply its correctness and ensures that a transaction
maps a consistent DB into a consistent DB. (Before or after transaction DB should be
same.)
• Transactions are correct programs and do not violate database integrity constraints
• Dirty data is data that is updated by a transaction that has not yet committed
• This classification groups databases into four levels of consistency. In the following
definition, dirty data refers to data values that have been updated by a transaction prior to
its commitment.
• Different levels of DB consistency
∗ Degree 0
Transaction T does not overwrite dirty data of other transactions
∗ Degree 1
Degree does not commit any writes before EOT
∗ Degree 2
Degree 2 does not read dirty data from other transactions
∗ Degree 3
Degree 3 Other transactions do not dirty any data read by T before T completes
Isolation
• Isolation is the property of transactions which requires each transaction to see a
consistent DB at all times.
• If two concurrent transactions access a data item that is being updated by one of them
(i.e., performs a write operation), it is not possible to guarantee that the second will read
the correct value
• Therefore, if several transactions are executed concurrently, the result must be the same
as if they were executed serially in some order (→ serializability)
• Consider the following two concurrent transactions (T and T ),
1 2
T : Read(x)
1 T : Read(x)
2
x+1→x x+1→x
Write(x) Write(x)
Commit Commit
Durability
• Durability refers to that property of transactions which ensures that once a transaction
commits, its results are permanent and cannot be erased from the database.
• The system must guarantee that the results of its operations will never be lost, in spite
of subsequent failures
• Changes that have been committed to the database should remain even in the case of
software and hardware failure.
• For instance, if joy account contains 1L , this information should not disappear upon
hardware or software failure.
• Database recovery is used to achieve the task.
Serializability
• Serializability is a concurrency scheme where the simultaneous transaction is
equivalent to one that executes the transactions serially. A schedule is a list of
transactions.
• Serial schedule defines each transaction is executed consecutively without any
interference from other transactions. T1🡪T2 OR T2🡪T1
• Non-serial schedule defines the operations from a group of simultaneous transactions
that are interleaved.
• In non-serial schedule, if the schedule is not proper, then the problems can arise like
multiple update, uncommitted dependency and incorrect analysis.
• The main objective of serializability is to find non-serial schedules that allow
transactions to execute concurrently without interference of a database state that could a
serial execution.
Batch transactions, on the other hand, take longer to execute (response time being
measured in minutes, hours, or even days) and access a larger portion of the database.
Flat transaction (Online transactions)
• Flat transactions have a single start point (Begin transaction) and a single
termination point (End transaction).
Begin transaction Reservation
...
end.
∗ Nested transaction (Batch transactions)
• An alternative transaction model is to permit a transaction to include other
transactions with their own begin and commit points. Such transactions are called
nested transactions. These transactions that are embedded in another one are usually
called sub transactions.
• The operations of a transaction may themselves be transactions.
Begin transaction Reservation
...
Begin transaction Airline
...
end.
Begin transaction Hotel
...
end. end.
4.Discuss in detail concurrency control mechanisms & algorithms.
Concurrency control mechanisms & algorithms:
Concurrency controlling techniques ensure that multiple transactions are executed simultaneously
while maintaining the ACID properties of the transactions and serializability in the schedules.
Locking Based Concurrency Control Protocols
Locking-based concurrency control protocols use the concept of locking data items. A lock is a
variable associated with a data item that determines whether read/write operations can be
performed on that data item. Generally, a lock compatibility matrix is used which states whether a
data item can be locked by two transactions at the same time.
Locking-based concurrency control systems can use either one-phase or two-phase locking
protocols.
One-phase Locking Protocol
In this method, each transaction locks an item before use and releases the lock as soon as it has
finished using it. This locking method provides for maximum concurrency but does not always
enforce serializability.
Two-phase Locking Protocol
In this method, all locking operations precede the first lock-release or unlock operation. The
transaction comprise of two phases. In the first phase, a transaction only acquires all the locks it
needs and do not release any lock. This is called the expanding or the growing phase. In the
second phase, the transaction releases the locks and cannot request any new locks. This is called
the shrinking phase.
Every transaction that follows two-phase locking protocol is guaranteed to be serializable.
However, this approach provides low parallelism between two conflicting transactions
5.Briefly explain deadlock Management.
https://www.geeksforgeeks.org/deadlock-in-dbms/
6.Explain in detail time - stamped & optimistic concurrency
control Algorithms.
Q)fault tolerance
ans)https://www.slideshare.net/sumitjain2013/fault-tolerance-in-
distributed-systems#:~:text=%E2%80%A2%20Fault
%20Tolerance%20is%20needed%20in%20order
%20to,Monitoring%20systems%2C%20flight%20control
%20systems%2C%20Banking%20Services%20etc.