You are on page 1of 22

Advanced database systems

Chapter one Query Processing & Optimization

Relational Algebra and Relational Calculus

Relational Algebra:
Relational Algebra is a procedural language. In Relational Algebra, The order is specified in
which the operations have to be performed. In Relational Algebra, frameworks are created to
implement the queries.
The basic operations included in relational algebra are: 
1. Select (σ)
2. Project (Π)
3. Union (U)
4. Set Difference (-)
5. Cartesian product (X)
6. Rename (ρ)
Relational Calculus: 
Relational Calculus is the formal query language. It is also known as  declarative language. In
Relational Calculus, the order is not specified in which the operation has to be performed.
Relational Calculus means what result we have to obtain. 
Relational Calculus has two variations: 
1. Tuple Relational Calculus (TRC)
2. Domain Relational Calculus (DRC)
Relational Calculus is denoted as:  
{ t | P(t) } Where,
t: the set of tuples
p: is the condition which is true for the given set of tuples.
Difference between Relational Algebra and Relational Calculus: 
S.NO Relational Algebra Relational Calculus

1. It is a Procedural language. While Relational Calculus is


Declarative language.
2. Relational Algebra means how to obtain the While Relational Calculus means
result. what result we have to obtain.
3. In Relational Algebra, The order is specified in While in Relational Calculus, The
which the operations have to be performed. order is not specified.
4. Relational Algebra is independent of the While Relation Calculus can be
domain. domain-dependent.
5. Relational Algebra is nearer to a programming While Relational Calculus is not
language. nearer to programming language.
6. The SQL includes only some features from the SQL is based to a greater extent on
relational algebra. the tuple relational calculus.
7. Relational Algebra is one of the languages in For a database language to be
which queries can be expressed but the queries relationally complete, the query
should also be expressed in relational calculus to written in it must be expressible in
be relationally complete. relational calculus.

Query processing
Activities involved in parsing, validating, optimizing and executing a query.

Query processing includes-

Scan – check keywords, symbols, attributes, table list names.

- Line by line / word by word checking.


- Example: Select Sname From Student;

Parsing – checks the validity, syntax, order /structures.

Example:

1. Select student from sname; // parse error encounters


2. Select empname from student; // parse error encounters

Query Processing

#1. SQL Query: Select sname from student where sid =101;

R.A: ᴨsname (σsid=101(student))

#2. SQL Query: select regno, mark1 from stud_marks where mark1>=50;

R.A: ᴨregno, mark1 (σ mark1>=50(stud_marks))

#3. SQL Query: select sum(mark1) from marks where year =2 and dept =’CS’;

R.A: ᴨsum(mark1) (σ year =2 and dept =’CS’(marks))

#4. Select sname , totalMarks from student S, marks M where S.regno =M.regno;

R.A: ᴨsname, totalMarks (σ S.regno =M.regno(SXM))


R.A: ᴨsname, totalMarks (S ⨝ σS.regno =M.regnoM)

#5. Select empName, from Emp E, Project P, Works_for W where E.idno= W.idno And W.pno
= P.pno And ebdate>’31-Jan-1980’ And hours>20;

R.A: ᴨempName(σebdate>’31-Jan-1980 And hours>20(σP.pno = W.pno(P X(σE.Idno = W.Idno(EXW)))))

Query Tree

Query: select sname from student;

Query : select sname from student where sid=101;

Query: select patient_Name, age from patient where status = ‘ICU’ AND
gender= ‘F’;

Query: select sname, mark1 from student s, marks


m where s.regno = m.regno;
Query: Select ename, salary, dept_head, from Emp E, Dept D, where E.dno = D.dno
AND dname=’’CS’;

Query: select ename from Emp E,


Dept D, Course C where E.Idno =
D.dno AND E.specialization = C.course
AND E.experience >10 AND
D.dname =’CS’ AND C.credit>3;
 Assignment 1- Find the staff_name who handles database and is from
computing faculty, provide the staff is involved with project Data mining with
more than working hours of 20.
 Find SQL query, R.A and query tree of the above case.

Semantic Analysis

The objective of semantic analysis is to reject normalized queries that are incorrectly
formulated or contradictory. A query is incorrectly formulated if components do not
contribute to the generation of the result, which may happen if some join specifications
are missing. A query is contradictory if its predicate cannot be satisfied by any tuple. For
example, the predicate (position = ‘Manager’ ∧ position = ‘Assistant’) on the Staff relation is
contradictory, as a member of staff cannot be both a Manager and an Assistant simultaneously.

Algorithms to determine correctness exist only for the subset of queries that do
not contain disjunction and negation. For these queries, we could apply the following
checks:
(1) Construct a relation connection graph (Wong and Youssefi, 1976). If the graph is not
connected, the query is incorrectly formulated. To construct a relation connection
graph, we create a node for each relation and a node for the result. We then create
edges between two nodes that represent a join, and edges between nodes that represent the source
of Projection operations.
(2) Construct a normalized attribute connection graph (Rosenkrantz and Hunt, 1980).
If the graph has a cycle for which the valuation sum is negative, the query is contradictory. To
construct a normalized attribute connection graph, we create a node for each reference to an
attribute, or constant 0. We then create a directed edge between nodes that represent a join, and a
directed edge between an attribute node and a constant 0 node that represents a Selection
operation. Next, we weight the edges a → b with the value c, if it represents the inequality
condition (a ≤ b + c), and weight the edges 0 → a with the value -c, if it represents the inequality
condition (a ≥ c).

Consider the following SQL query:

SELECT p.propertyNo, p.street FROM Client c, Viewing v, PropertyForRent p WHERE


c.clientNo = v.clientNo AND c.maxRent >= 500 AND c.prefType = ‘Flat’ AND p.ownerNo =
‘CO93’;

Relation connection graph showing query is


incorrectly formulated. The relation connection
graph is not fully connected, implying that the
query is not correctly formulated. In this case, we
have omitted the join condition (v.propertyNo =
p.propertyNo) from the predicate

Now consider the query:

SELECT p.propertyNo, p.street FROM Client c, Viewing v, PropertyForRent p


WHERE c.maxRent > 500 AND c.clientNo = v.clientNo AND v.propertyNo = p.propertyNo
AND c.prefType = ‘Flat’ AND c.maxRent < 200;

The normalized attribute connection graph for this query shown in Figure 21.3(b) has a
cycle between the nodes c.maxRent and 0 with a negative valuation sum, which indicates
that the query is contradictory. Clearly, we cannot have a client with a maximum rent that
is both greater than £500 and less than £200.
Example 2: SELECT
Ename,Resp FROM Emp, Works, Project WHERE Emp.Eno = Works.Eno AND
Works.Pno = Project.Pno AND Pname = ‘CAD/CAM’ AND Dur > 36 AND Title
= ‘Programmer’;

Heuristic based Processing Strategies


What are the names of customers living on Elm Street who have checked out “Terminator”?
SQL query:
SELECT Name FROM Customer CU, CheckedOut CH, Film F WHERE Title =
’Terminator’ AND F.FilmId = CH.FilmID AND CU.CustomerID = CH.CustomerID AND
CU.Street = ‘Elm’;
Assignment 2- using heuristic algorithms optimize the following sql query.

SELECT LNAME FROM EMPLOYEE, WORKS_ON, PROJECT WHERE PNAME =


‘AQUARIUS’ AND PNMUBER=PNO AND ESSN=SSN AND BDATE > ‘1957-12-31’;
Cost Estimation for the Relational
Algebra Operations
A DBMS may have many different ways of implementing the relational algebra operations.
The aim of query optimization is to choose the most efficient one. To do this, it uses formulae
that estimate the costs for a number of options and selects the one with the lowest
cost. In this section we examine the different options available for implementing the main
relational algebra operations.

1. Database Statistics

The success of estimating the size and cost of intermediate relational algebra operations depends
on the amount and currency of the statistical information that the DBMS holds. Typically, we
would expect a DBMS to hold the following types of information in its system catalog:
For each base relation R

 nTuples(R) – the number of tuples (records) in relation R (that is, its cardinality).
 bFactor(R) – the blocking factor of R (that is, the number of tuples of R that fit into one
block).
 nBlocks(R) – the number of blocks required to store R. If the tuples of R are stored
physically together, then:
 nBlocks(R) = [nTuples(R)/bFactor(R)]
We use [x] to indicate that the result of the calculation is rounded to the smallest integer
that is greater than or equal to x.
For each attribute A of base relation R
 nDistinctA(R) – the number of distinct values that appear for attribute A in relation R.
 minA(R),maxA(R) – the minimum and maximum possible values for the attribute A in
relation R.
 SCA(R) – the selection cardinality of attribute A in relation R. This is the average number
of tuples that satisfy an equality condition on attribute A. If we assume that the values of
A are uniformly distributed in R, and that there is at least one value that satisfies the
condition, then:
For each multilevel index I on attribute set A
 nLevelsA(I) – the number of levels in I.
 nLfBlocksA(I) – the number of leaf blocks in I.
2. Selection Operation (S = σp(R))
The Selection operation in the relational algebra works on a single relation R, say, and defines a
relation S containing only those tuples of R that satisfy the specified predicate. The predicate
may be simple, involving the comparison of an attribute of R with either a constant value or
another attribute value.
Linear Search
Retrieving every records in the file, and test whether its attribute value satisfy the selection
condition or not.
 [nBlocks(R)/2], if equality condition is on key attribute.
 [nBlocks(R)], otherwise.
Binary Search
 [log2 (nBlocks(R))], if equality condition is on key attribute, because SCA(R) = 1 in this
case.
 [log2(nBlocks(R))] + [SCA(R)/bFactor(R)] – 1, otherwise.
Example 1:
Linear Search = nblocks(R) = nblocks(Employee) = ntuples(Employee)/Femployee
= 10,000/10
= 1000
Binary Search = log2 nBlocks (Employee) + Sc(deptno, Employee)/Femployee -1
= log21000 + 200/10 -1
= 10 +20 -1 =29
If the selection condition involves equality comparison on a key attribute on which a file is
ordered, binary search is more efficient then the linear search.
Example 2: Relation: R (A, B, C)
nR = 10000 tuples
bfR = 20 tuples/page
dist(A) = 50, dist(B) = 500
Query: select * from R where A = a1 and B = b1
Relational Algebra: σA=a1 σB=b1 (R)
 The cost of selection using:
Linear search = bR
= nR/bfR
=10000/20
= 500
Binary search = log2bR + SC (a, R)/bfR -1, Hence SC (a,R) = nR/ dist(A)= 10000/50= 200
= log2500 + 200/20 -1
=9+9
=18
Exercise- assume the staff relation has the following statistics
 n(staff) = 3000
 bf(staff) = 30
 ndistinct branchNo (staff) = 500
 ndistinct position(staff) = 10
Calculate the cost of selection using
a) Linear search
b) Binary search
The predicate may also be composite, involving more than one condition, with conditions
combined using the logical connectives ∧ (AND), ∨ (OR), and ~ (NOT). There are a number of
different implementations for the Selection operation, depending on the structure of the file in
which the relation is stored, and on whether the attribute(s) involved in the predicate have been
indexed/hashed. The main strategies that we consider are:

Example 1.0 Cost estimation for Selection operation


For the purposes of this example, we make the following assumptions about the Staff
relation:
 There is a hash index with no overflow on the primary key attribute staffNo.
 There is a clustering index on the foreign key attribute branchNo.
 There is a B+-tree index on the salary attribute.
 The Staff relation has the following statistics stored in the system catalog:
 nTuples(Staff) = 3000
 bFactor(Staff) = 30 ⇒ nBlocks(Staff) = 100
 nDistinctbranchNo(Staff) = 500 ⇒ SCbranchNo(Staff) = 6
 nDistinctposition(Staff) = 10 ⇒ SCposition(Staff) = 300
 nDistinctsalary(Staff) = 500 ⇒ SCsalary(Staff) = 6
 minsalary(Staff) = 10,000 maxsalary(Staff) = 50,000
 nLevelsbranchNo(I) = 2
 nLevelssalary(I) = 2 nLfBlockssalary(I) = 50
The estimated cost of a linear search on the key attribute staffNo is 50 blocks, and the cost
of a linear search on a non-key attribute is 100 blocks. Now we consider the following
Selection operations, and use the above strategies to improve on these two costs:
S1: σstaffNo=‘SG5’(Staff)
S2: σposition=‘Manager’(Staff)
S3: σbranchNo=‘B003’(Staff)
S4: σsalary>20000(Staff)
S5: σposition=‘Manager’ ∧ branchNo=‘B003’(Staff)

S1: This Selection operation contains an equality condition on the primary key. Therefore,
as the attribute staffNo is hashed we can use strategy 3 defined above to estimate the
cost as 1 block. The estimated cardinality of the result relation is SCstaffNo(Staff) = 1.
S2: The attribute in the predicate is a non-key, non-indexed attribute, so we cannot
improve on the linear search method, giving an estimated cost of 100 blocks. The
estimated cardinality of the result relation is SCposition(Staff) = 300.
S3: The attribute in the predicate is a foreign key with a clustering index, so we can use
Strategy 6 to estimate the cost as 2 + [6/30] = 3 blocks. The estimated cardinality of
the result relation is SCbranchNo(Staff) = 6.
S4: The predicate here involves a range search on the salary attribute, which has a B+-tree
index, so we can use strategy 7 to estimate the cost as: 2 + [50/2] + [3000/2] = 1527
blocks. However, this is significantly worse than the linear search strategy, so in this
case we would use the linear search method. The estimated cardinality of the result
relation is SC salary(Staff) = [3000*(50000–20000)/(50000–10000)] = 2250.
S5: In the last example, we have a composite predicate but the second condition can be
implemented using the clustering index on branchNo (S3 above), which we know has
an estimated cost of 3 blocks. While we are retrieving each tuple using the clustering
index, we can check whether it satisfies the first condition (position = ‘Manager’).
We know that the estimated cardinality of the second condition is SCbranchNo(Staff) = 6.
If we call this intermediate relation T, then we can estimate the number of distinct
values of position in T, nDistinct position(T), as: [(6 + 10)/3] = 6. Applying the second condition
now, the estimated cardinality of the result relation is SC position(T) = 6/6 = 1, which would be
correct if there is one manager for each branch.

3. Join Operation (T (R ⨝ F S))


What are Joins?
JOINS in SQL are commands which are used to combine rows from two or more tables,
based on a related column between those tables. There are predominantly used when a user is
trying to extract data from tables which have one-to-many or many- to-many relationships
between them.
Types of Join
There are mainly two types of joins in DBMS:
1. Inner Joins: Theta Join, Natural Join, EQUI Join
2. Outer Join: Left Outer Join, Right Outer Join, Full Outer Join
Inner Join
INNER JOIN is used to return rows from both tables which satisfy the given condition. It is
the most widely used join operation and can be considered as a default join-type An Inner
join or equijoin is a comparator-based join which uses equality comparisons in the join-
predicate. However, if you use other comparison operators like ">" it can't be called equijoin.
Inner Join further divided into three subtypes:
Theta join
Natural join
EQUI join
Theta Join
THETA JOIN allows you to merge two tables based on the condition represented by
theta. Theta joins work for all comparison operators. It is denoted by symbol θ. The
general case of JOIN operation is called a Theta join.
Syntax:
A⋈ B

Theta join can use any conditions in the selection criteria.


Consider the following tables.
Table A Table B
Column 1 Column 2 Column 1 Column 2
1 2 1 2
1 1 1 1
1 2 1 3
Example: A ⋈ A.column 2 >B.column 2 (B)
column 1 column 2
1 2

EQUI Join
EQUI JOIN is done when a Theta join uses only the equivalence condition. EQUI join is the
most difficult operation to implement efficiently in an RDBMS, and one reason why
RDBMS have essential performance problems.
For example: A ⋈ A.column 2 = B.column 2 (B)
Column 1 Column 2

1 1

Natural Join (⋈)


NATURAL JOIN does not utilize any of the comparison operators. In this type
of join, the attributes should have the same name and domain. In Natural Join,
there should be at least one common attribute between two relations.
It performs selection forming equality on those attributes which appear in both
relations and eliminates the duplicate attributes.
Example:
Consider the following two tables

Outer Join
An OUTER JOIN doesn't
require each record in the
two join tables to have a
matching record. In this
type of join, the table
retains each record even if
no other matching record exists.
Three types of Outer Joins are:

 Left Outer Join


 Right Outer Join
 Full Outer Join

Left Outer Join (A ⟕B)


LEFT JOIN returns all the rows from the table on the left even if no matching rows have been
found in the table on the right. When no matching record found in the table on the right, NULL
is returned.
Right
Outer Join (A ⟖B)
RIGHT JOIN returns all the columns from the table on the right even if no matching rows have
been found in the table on the left. Where no matches have been found in the table on the left,
NULL is returned. RIGHT outer JOIN is the opposite of LEFT JOIN. In our example, let's
assume that you need to get the names of members and movies rented by them. Now we have a
new member who has not rented any movie yet.

Full Outer Join (A ⟗ B)

In a FULL OUTER JOIN, all tuples from both relations are included in the result, irrespective of
the matching condition.

Example:
Estimating the cardinality of the Join operation

The cardinality of the Cartesian product of R and S, R × S, is simply:

nTuples(R) * nTuples(S)

Unfortunately, it is much more difficult to estimate the cardinality of any join as it depends
on the distribution of values in the joining attributes. In the worst case, we know that the
cardinality of the join cannot be any greater than the cardinality of the Cartesian product, so:
nTuples(T) ≤ nTuples(R) * nTuples(S)
Some systems use this upper bound, but this estimate is generally too pessimistic. If we
again assume a uniform distribution of values in both relations, we can improve on this
estimate for Equijoins with a predicate (R.A = S.B) as follows:
(1) If A is a key attribute of R, then a tuple of S can only join with one tuple of R. Therefore,
the cardinality of the Equijoin cannot be any greater than the cardinality of S:
nTuples(T) ≤ nTuples(S)
(2) (2) Similarly, if B is a key of S, then:
nTuples(T) ≤ nTuples(R)
If neither A nor B are keys, then we could estimate the cardinality of the join as:
nTuples(T) = SCA(R)*nTuples(S) or
nTuples(T) = SCB(S)*nTuples(R)
Table 2 Summary of estimated I/O cost of strategies for Join operation.
Example 1: Cost estimation for Join operation
For the purposes of this example, we make the following assumptions:

 There are separate hash indexes with no overflow on the primary key attributes staffNo
of Staff and branchNo of Branch.
 There are 100 database buffer blocks.
 The system catalog holds the following statistics:
nTuples(Staff) = 6000
bFactor(Staff) = 30 ⇒ nBlocks(Staff) = 200
nTuples(Branch) = 500
bFactor(Branch) = 50 ⇒ nBlocks(Branch) = 10
nTuples(PropertyForRent) = 100,000
bFactor(PropertyForRent) = 50 ⇒ nBlocks(PropertyForRent) = 2000
A comparison of the above four strategies for the following two joins is shown in
Table 2:
J1: Staff ⨝ staffNo PropertyForRent
J2: Branch ⨝ branchNo PropertyForRent

Calculate the cost of the two strategies discussed above, if the Staff relation has 10 000 tuples,
Branch has 500 tuples, there are 500 Managers (one for each branch), and there are 10 London
branches.
Using the Hotel schema, assume the following indexes exist:
 a hash index with no overflow on the primary key attributes, roomNo/hotelNo in Room;
 a clustering index on the foreign key attribute hotelNo in Room;
 a B+-tree index on the price attribute in Room;
 a secondary index on the attribute type in Room.
nTuples(Room) = 10,000 bFactor(Room) = 200
nTuples(Hotel) = 50 bFactor(Hotel) = 40
nTuples(Booking) = 100,000 bFactor(Booking) = 60
nDistincthotelNo(Room) = 50
nDistincttype(Room) = 10
nDistinctprice(Room) = 500
minprice(Room) = 200 maxprice(Room) = 50
nLevelshotelNo(I) = 2
nLevelsprice(I) = 2 nLfBlocksprice(I) = 50
(a) Calculate the cardinality and minimum cost for each of the following Selection operations:
S1: σroomNo=1 ∧ hotelNo=‘H001’(Room)
S2: σtype=‘D’(Room)
S3: σhotelNo=‘H002’(Room)
S4: σprice>100(Room)
S5: σtype=‘S’ ∧ hotelNo=‘H003’(Room)
S6: σtype=‘S’ ∨ price < 100(Room)
(b) Calculate the cardinality and minimum cost for each of the following Join operations:
J1: Hotel ⋈ hotelNo Room
J2: Hotel ⋈ hotelNo Booking
J3: Room ⋈ roomNo Booking
J4: Room ⋈ hotelNo Hotel
J5: Booking ⋈ hotelNo Hotel
J6: Booking ⋈ roomNo Room

You might also like