Professional Documents
Culture Documents
Relational Algebra:
Relational Algebra is a procedural language. In Relational Algebra, The order is specified in
which the operations have to be performed. In Relational Algebra, frameworks are created to
implement the queries.
The basic operations included in relational algebra are:
1. Select (σ)
2. Project (Π)
3. Union (U)
4. Set Difference (-)
5. Cartesian product (X)
6. Rename (ρ)
Relational Calculus:
Relational Calculus is the formal query language. It is also known as declarative language. In
Relational Calculus, the order is not specified in which the operation has to be performed.
Relational Calculus means what result we have to obtain.
Relational Calculus has two variations:
1. Tuple Relational Calculus (TRC)
2. Domain Relational Calculus (DRC)
Relational Calculus is denoted as:
{ t | P(t) } Where,
t: the set of tuples
p: is the condition which is true for the given set of tuples.
Difference between Relational Algebra and Relational Calculus:
S.NO Relational Algebra Relational Calculus
Query processing
Activities involved in parsing, validating, optimizing and executing a query.
Example:
Query Processing
#1. SQL Query: Select sname from student where sid =101;
#2. SQL Query: select regno, mark1 from stud_marks where mark1>=50;
#3. SQL Query: select sum(mark1) from marks where year =2 and dept =’CS’;
#4. Select sname , totalMarks from student S, marks M where S.regno =M.regno;
#5. Select empName, from Emp E, Project P, Works_for W where E.idno= W.idno And W.pno
= P.pno And ebdate>’31-Jan-1980’ And hours>20;
Query Tree
Query: select patient_Name, age from patient where status = ‘ICU’ AND
gender= ‘F’;
Semantic Analysis
The objective of semantic analysis is to reject normalized queries that are incorrectly
formulated or contradictory. A query is incorrectly formulated if components do not
contribute to the generation of the result, which may happen if some join specifications
are missing. A query is contradictory if its predicate cannot be satisfied by any tuple. For
example, the predicate (position = ‘Manager’ ∧ position = ‘Assistant’) on the Staff relation is
contradictory, as a member of staff cannot be both a Manager and an Assistant simultaneously.
Algorithms to determine correctness exist only for the subset of queries that do
not contain disjunction and negation. For these queries, we could apply the following
checks:
(1) Construct a relation connection graph (Wong and Youssefi, 1976). If the graph is not
connected, the query is incorrectly formulated. To construct a relation connection
graph, we create a node for each relation and a node for the result. We then create
edges between two nodes that represent a join, and edges between nodes that represent the source
of Projection operations.
(2) Construct a normalized attribute connection graph (Rosenkrantz and Hunt, 1980).
If the graph has a cycle for which the valuation sum is negative, the query is contradictory. To
construct a normalized attribute connection graph, we create a node for each reference to an
attribute, or constant 0. We then create a directed edge between nodes that represent a join, and a
directed edge between an attribute node and a constant 0 node that represents a Selection
operation. Next, we weight the edges a → b with the value c, if it represents the inequality
condition (a ≤ b + c), and weight the edges 0 → a with the value -c, if it represents the inequality
condition (a ≥ c).
The normalized attribute connection graph for this query shown in Figure 21.3(b) has a
cycle between the nodes c.maxRent and 0 with a negative valuation sum, which indicates
that the query is contradictory. Clearly, we cannot have a client with a maximum rent that
is both greater than £500 and less than £200.
Example 2: SELECT
Ename,Resp FROM Emp, Works, Project WHERE Emp.Eno = Works.Eno AND
Works.Pno = Project.Pno AND Pname = ‘CAD/CAM’ AND Dur > 36 AND Title
= ‘Programmer’;
1. Database Statistics
The success of estimating the size and cost of intermediate relational algebra operations depends
on the amount and currency of the statistical information that the DBMS holds. Typically, we
would expect a DBMS to hold the following types of information in its system catalog:
For each base relation R
nTuples(R) – the number of tuples (records) in relation R (that is, its cardinality).
bFactor(R) – the blocking factor of R (that is, the number of tuples of R that fit into one
block).
nBlocks(R) – the number of blocks required to store R. If the tuples of R are stored
physically together, then:
nBlocks(R) = [nTuples(R)/bFactor(R)]
We use [x] to indicate that the result of the calculation is rounded to the smallest integer
that is greater than or equal to x.
For each attribute A of base relation R
nDistinctA(R) – the number of distinct values that appear for attribute A in relation R.
minA(R),maxA(R) – the minimum and maximum possible values for the attribute A in
relation R.
SCA(R) – the selection cardinality of attribute A in relation R. This is the average number
of tuples that satisfy an equality condition on attribute A. If we assume that the values of
A are uniformly distributed in R, and that there is at least one value that satisfies the
condition, then:
For each multilevel index I on attribute set A
nLevelsA(I) – the number of levels in I.
nLfBlocksA(I) – the number of leaf blocks in I.
2. Selection Operation (S = σp(R))
The Selection operation in the relational algebra works on a single relation R, say, and defines a
relation S containing only those tuples of R that satisfy the specified predicate. The predicate
may be simple, involving the comparison of an attribute of R with either a constant value or
another attribute value.
Linear Search
Retrieving every records in the file, and test whether its attribute value satisfy the selection
condition or not.
[nBlocks(R)/2], if equality condition is on key attribute.
[nBlocks(R)], otherwise.
Binary Search
[log2 (nBlocks(R))], if equality condition is on key attribute, because SCA(R) = 1 in this
case.
[log2(nBlocks(R))] + [SCA(R)/bFactor(R)] – 1, otherwise.
Example 1:
Linear Search = nblocks(R) = nblocks(Employee) = ntuples(Employee)/Femployee
= 10,000/10
= 1000
Binary Search = log2 nBlocks (Employee) + Sc(deptno, Employee)/Femployee -1
= log21000 + 200/10 -1
= 10 +20 -1 =29
If the selection condition involves equality comparison on a key attribute on which a file is
ordered, binary search is more efficient then the linear search.
Example 2: Relation: R (A, B, C)
nR = 10000 tuples
bfR = 20 tuples/page
dist(A) = 50, dist(B) = 500
Query: select * from R where A = a1 and B = b1
Relational Algebra: σA=a1 σB=b1 (R)
The cost of selection using:
Linear search = bR
= nR/bfR
=10000/20
= 500
Binary search = log2bR + SC (a, R)/bfR -1, Hence SC (a,R) = nR/ dist(A)= 10000/50= 200
= log2500 + 200/20 -1
=9+9
=18
Exercise- assume the staff relation has the following statistics
n(staff) = 3000
bf(staff) = 30
ndistinct branchNo (staff) = 500
ndistinct position(staff) = 10
Calculate the cost of selection using
a) Linear search
b) Binary search
The predicate may also be composite, involving more than one condition, with conditions
combined using the logical connectives ∧ (AND), ∨ (OR), and ~ (NOT). There are a number of
different implementations for the Selection operation, depending on the structure of the file in
which the relation is stored, and on whether the attribute(s) involved in the predicate have been
indexed/hashed. The main strategies that we consider are:
S1: This Selection operation contains an equality condition on the primary key. Therefore,
as the attribute staffNo is hashed we can use strategy 3 defined above to estimate the
cost as 1 block. The estimated cardinality of the result relation is SCstaffNo(Staff) = 1.
S2: The attribute in the predicate is a non-key, non-indexed attribute, so we cannot
improve on the linear search method, giving an estimated cost of 100 blocks. The
estimated cardinality of the result relation is SCposition(Staff) = 300.
S3: The attribute in the predicate is a foreign key with a clustering index, so we can use
Strategy 6 to estimate the cost as 2 + [6/30] = 3 blocks. The estimated cardinality of
the result relation is SCbranchNo(Staff) = 6.
S4: The predicate here involves a range search on the salary attribute, which has a B+-tree
index, so we can use strategy 7 to estimate the cost as: 2 + [50/2] + [3000/2] = 1527
blocks. However, this is significantly worse than the linear search strategy, so in this
case we would use the linear search method. The estimated cardinality of the result
relation is SC salary(Staff) = [3000*(50000–20000)/(50000–10000)] = 2250.
S5: In the last example, we have a composite predicate but the second condition can be
implemented using the clustering index on branchNo (S3 above), which we know has
an estimated cost of 3 blocks. While we are retrieving each tuple using the clustering
index, we can check whether it satisfies the first condition (position = ‘Manager’).
We know that the estimated cardinality of the second condition is SCbranchNo(Staff) = 6.
If we call this intermediate relation T, then we can estimate the number of distinct
values of position in T, nDistinct position(T), as: [(6 + 10)/3] = 6. Applying the second condition
now, the estimated cardinality of the result relation is SC position(T) = 6/6 = 1, which would be
correct if there is one manager for each branch.
EQUI Join
EQUI JOIN is done when a Theta join uses only the equivalence condition. EQUI join is the
most difficult operation to implement efficiently in an RDBMS, and one reason why
RDBMS have essential performance problems.
For example: A ⋈ A.column 2 = B.column 2 (B)
Column 1 Column 2
1 1
Outer Join
An OUTER JOIN doesn't
require each record in the
two join tables to have a
matching record. In this
type of join, the table
retains each record even if
no other matching record exists.
Three types of Outer Joins are:
In a FULL OUTER JOIN, all tuples from both relations are included in the result, irrespective of
the matching condition.
Example:
Estimating the cardinality of the Join operation
nTuples(R) * nTuples(S)
Unfortunately, it is much more difficult to estimate the cardinality of any join as it depends
on the distribution of values in the joining attributes. In the worst case, we know that the
cardinality of the join cannot be any greater than the cardinality of the Cartesian product, so:
nTuples(T) ≤ nTuples(R) * nTuples(S)
Some systems use this upper bound, but this estimate is generally too pessimistic. If we
again assume a uniform distribution of values in both relations, we can improve on this
estimate for Equijoins with a predicate (R.A = S.B) as follows:
(1) If A is a key attribute of R, then a tuple of S can only join with one tuple of R. Therefore,
the cardinality of the Equijoin cannot be any greater than the cardinality of S:
nTuples(T) ≤ nTuples(S)
(2) (2) Similarly, if B is a key of S, then:
nTuples(T) ≤ nTuples(R)
If neither A nor B are keys, then we could estimate the cardinality of the join as:
nTuples(T) = SCA(R)*nTuples(S) or
nTuples(T) = SCB(S)*nTuples(R)
Table 2 Summary of estimated I/O cost of strategies for Join operation.
Example 1: Cost estimation for Join operation
For the purposes of this example, we make the following assumptions:
There are separate hash indexes with no overflow on the primary key attributes staffNo
of Staff and branchNo of Branch.
There are 100 database buffer blocks.
The system catalog holds the following statistics:
nTuples(Staff) = 6000
bFactor(Staff) = 30 ⇒ nBlocks(Staff) = 200
nTuples(Branch) = 500
bFactor(Branch) = 50 ⇒ nBlocks(Branch) = 10
nTuples(PropertyForRent) = 100,000
bFactor(PropertyForRent) = 50 ⇒ nBlocks(PropertyForRent) = 2000
A comparison of the above four strategies for the following two joins is shown in
Table 2:
J1: Staff ⨝ staffNo PropertyForRent
J2: Branch ⨝ branchNo PropertyForRent
Calculate the cost of the two strategies discussed above, if the Staff relation has 10 000 tuples,
Branch has 500 tuples, there are 500 Managers (one for each branch), and there are 10 London
branches.
Using the Hotel schema, assume the following indexes exist:
a hash index with no overflow on the primary key attributes, roomNo/hotelNo in Room;
a clustering index on the foreign key attribute hotelNo in Room;
a B+-tree index on the price attribute in Room;
a secondary index on the attribute type in Room.
nTuples(Room) = 10,000 bFactor(Room) = 200
nTuples(Hotel) = 50 bFactor(Hotel) = 40
nTuples(Booking) = 100,000 bFactor(Booking) = 60
nDistincthotelNo(Room) = 50
nDistincttype(Room) = 10
nDistinctprice(Room) = 500
minprice(Room) = 200 maxprice(Room) = 50
nLevelshotelNo(I) = 2
nLevelsprice(I) = 2 nLfBlocksprice(I) = 50
(a) Calculate the cardinality and minimum cost for each of the following Selection operations:
S1: σroomNo=1 ∧ hotelNo=‘H001’(Room)
S2: σtype=‘D’(Room)
S3: σhotelNo=‘H002’(Room)
S4: σprice>100(Room)
S5: σtype=‘S’ ∧ hotelNo=‘H003’(Room)
S6: σtype=‘S’ ∨ price < 100(Room)
(b) Calculate the cardinality and minimum cost for each of the following Join operations:
J1: Hotel ⋈ hotelNo Room
J2: Hotel ⋈ hotelNo Booking
J3: Room ⋈ roomNo Booking
J4: Room ⋈ hotelNo Hotel
J5: Booking ⋈ hotelNo Hotel
J6: Booking ⋈ roomNo Room