You are on page 1of 3

CS 533 Database Management Systems

Homework Assignment 1 - Due September 17, 2010 Friday

1. Consider the join R ⊲⊳R.a=S.b S, given the following information about the relations to be joined. The cost
metric is the no. of page I/Os unless otherwise noted, and the cost of writing out the result is uniformly
ignored.

• Relation R contains 200,000 tuples and has 20 tuples per page


• Relation S contains 4,000,000 tuples and also has 20 tuples per page
• Attribute a of relation R is the primary key for R
• Each tuple of R joins with exactly 20 tuples of S
• 1002 buffer pages are available

Answer the following questions. If some question is inapplicable, explain why.

(a) What is the cost of joining R and S using a page-oriented simple nested loops join? What is the no. of
buffer pages required for this cost to remain unchanged?

Cost = 10,000 + (10,000 * 200,000) = 2,000,010,000. Min. no. of buffer pages = 3.

(b) What is the cost of joining R and S using a block nested loops join? What is the no. of buffer pages
required for this cost to remain unchanged?

10,000
Cost = 10,000 + 200,000*⌈ 1002−2 ⌉ = 2,010,000. Min. no. of buffer pages = 1002.

(c) How many tuples does the join of R and S produce at most, and how many pages are required to store
the result of the join back to disk?

Since R.a is the primary key, any tuple in S can match at most one tuple in R. So, the max. no. of tuples
in the result is equal to the number of tuples in S which is 4,000,000.
The size of a tuple in the result will be approx. double the size of R. Thus, only 10 tuples can be stored
in a page. Storing 4,000,000 tuples at 10 per page would require 400,000 pages in the result.

(d) Would any of your answers change if you were told that R.a is a foreign key that refers to S.b? Explain.

It would contradict the statement that each R tuple joins with exactly 20 S tuples.

2. Consider the following relation:

• Employees(eid: integer, ename: string, sal: integer, title: string, age: integers)

1
Suppose the data entries (leaf nodes) of the indexes are of the form < search key, record id >. We have the
following indexes: a hash index on eid, a B+ tree index on sal, a hash index on age, and a clustered B+ tree
index on < age, sal >. Each Employees record is 100 bytes long, and you can assume that each index data
entry is 20 bytes long. The Employees relation contains 10,000 pages. Each data page contains 20 records per
page.

(a) Consider each of the following selection conditions and assuming that the reduction factor for each
term that matches an index is 0.1, compute the cost of the most selective access path for retrieving all
Employees tuples that satisfy the condition:
i. sal > 100
Filescan is best with cost 10,000.

ii. age = 25
Clustered B+ tree. Cost = 2(lookup) + 10000*0.1(selectivity) + 10,000*0.2*0.1 = 1202.
iii. eid = 1000
Cost = 1.2 (lookup) + 1 = 2 or 3.

iv. sal > 200 ∧ age = 20


Clustered B+ index on < age, sal >. Cost = 2 (lookup) + 10,000*0.1*0.1 + 10,000*0.4*0.1*0.1 =
142.

v. sal > 200 ∧ title = ‘CFO′


Filescan. Cost = 10,000.
(b) Suppose that for each preceding selection conditions, you want to retrieve the average salary of quali-
fying tuples. For each selection condition, describe the least expensive evaluation method and state its
cost.
i. sal > 100
Index only scan using the unclustered B+ tree index on sal. Cost = 2 (lookup) + 10,000*0.1*0.2 =
202.

ii. age = 25
Clustered index on < age, sal >. Cost = 2 (lookup) + 10,000*0.1*0.4 = 402.

iii. eid = 1000


Cost = 1.2 (lookup) + 1 = 2 or 3.

iv. sal > 200 ∧ age = 20


Clustered B+ tree index for an index only scan. Cost = 2 (lookup) + 10,000 * 0.1 * 0.1 * 0.4 = 42.

v. sal > 200 ∧ title = ‘CFO′


Filescan. Cost = 10,000.
(c) Suppose that for each preceding selection conditions, you want to compute the average salary for each
age group. For each selection condition, describe the least expensive evaluation method and state its
cost.
i. sal > 100
Cost = 2 (lookup) + 10,000*0.4 = 4002.

2
ii. age = 25
Clustered B+ index in an index only scan. Cost = 2 (lookup) + 10,000 *0.1 *0.4 = 2 + 400 = 402.

iii. eid = 1000


Hash index. Cost = 1.2 (lookup) + 1 = 2 or 3.

iv. sal > 200 ∧ age = 20


Clustered B+ index over < age, sal > for index only scan. Cost = 2 + 10,000*0.4*0.1*0.1 = 42.

v. sal > 200 ∧ title = ‘CFO′


Scan of < age, sal > and then getting each relation from the relation page. Cost = 2 + 10,000 * 0.4
+ 10,000*20(no. of tuples per page) * 0.1 = 24002.

3. You are given the following information.

• Executives has attributes ename, title, dname, and address; all are string fields of the same length.
• The ename attribute is a candidate key.
• The relation contains 10,000 pages.
• There are 10 buffer pages.

Consider the following query:

SELECT E.title, E.ename FROM Executives E WHERE E.title = ‘CFO’

Assume that only 10% of Executives tuples meet the selection condition.

(a) Suppose that a clustered B+ tree index on title is the only index available. What is the cost of the best
plan?
Cost = 2 (finding the first title) + 10,000*0.1 (getting 10% of data pages) + 2,500 * 0.1 (scanning the
index) = 1252.

(b) Suppose that a unclustered B+ tree index on title is the only index available. What is the cost of the best
plan?
Filescan. Cost = 10,000.

(c) Suppose that a clustered B+ tree index on ename is the only index available. What is the cost of the best
plan?

Filescan. Cost = 10,000.

(d) Suppose that a clustered B+ tree index on address is the only index available. What is the cost of the
best plan?
Filescan. Cost = 10,000.

(e) Suppose that a clustered B+ tree index on < ename,title > is the only index available. What is the cost
of the best plan?
Scanning the entire index = 10,000 * 0.5 = 5,000.

You might also like