You are on page 1of 16

CSCI 4707 – Written Submission 3 solutions

Chapter Question# Sections Max Score Details Score


Schema Refinement and Normal Forms 19.6 2 40 20 each
Schema Refinement and Normal Forms 19.10 5 100 20 each
Data Warehousing and Decision Support 25.2 3 60 20 each Q 27.2 – 1 20
Data Warehousing and Decision Support 25.10 3 60 20 each 2 20
Data Mining 26.2 4 120 30 each 3 50
Data Mining 26.8 2 40 20 each 5 (a) 20
Data Mining 26.10 1 30 (b) 20
IR and XML Data 27.2 8 210 (c) 80
IR and XML Data 27.4 7 140 20 each 210
Total 800

19.6 Suppose that we have the following three tuples in a legal instance of a relation
schema S with three attributes ABC (listed in order): (1,2,3), (4,2,3), and (5,3,3).
1. Which of the following dependencies can you infer does not hold over schema S?
a. A → B
b. BC → A
c. B → C
BC → A does not hold over S.

2. Can you identify any dependencies that hold over S?


No. Given just an instance of S, we can say that certain dependencies (e.g. A → B and B
→ C) are not violated by this instance, but we cannot say that these dependencies hold
with respect to S. To say that an FD holds w.r.t a relation is to make a statement about all
allowable instances of that relation!

19.10 Suppose you are given a relation R(A,B,C,D). For each of the following sets of FDs,
assuming they are the only dependencies that hold for R, do the following: (a) Identify the
candidate key(s) for R. (b) State whether or not the proposed decomposition of R into
smaller relations is a good decomposition and briefly explain why or why not.
1. B → C, D → A; decompose into BC and AD
a. (BD)
b. The proposed decomposition is not good because neither decomposed relations
contain an attribute by which to reference the other. More specifically, it violates
the property of lossless-joins that 𝑋 ∩ 𝑌 → 𝑋 𝑂𝑅 𝑋 ∩ 𝑌 → 𝑌 because 𝑋 ∩ 𝑌is the null
set. The join of BC and AD is the cartesian product, which could be much bigger
than the actual relation ABCD
2. AB → C, C → A, C → D, C → D; decompose into ACD and BC
a. (AB), (BC)
b. The proposed decomposition into ACD and BC is not good. It is lossless (𝐴𝐶𝐷 ∩
𝐵𝐶 = 𝐶 and 𝐶 → 𝐴𝐶𝐷 by combining the second, third, and trivial FDs). In
particular, this is a BCNF decomposition, but it is not dependency-preserving since
AB → C is not preserved

3. A → BC, C → AD; decompose into ABC and AD


a. (A), (C)
b. The proposed decomposition into ABC and AD is not good. Since A and C are
both candidate keys for R, it is already in BCNF. So from a normalization
standpoint, it makes no sense to decompose R. Furthermore, the decomposition
is not dependency-preserving since C → AD can no longer be enforced.

4. A → B, B → C, C → D; decompose into AB and ACD


a. (A)
b. The proposed decomposition into AB and ACD is not good. ACD is not even in
3NF, since C is not a superkey and D is not part of a key. This is a lossless
decomposition (𝐴𝐵 ∩ 𝐴𝐶𝐷 = 𝐴 and 𝐴 → 𝐴𝐵 by combining the first trivial FDs) but
it is not dependency preserving since B → C is not preserved.

5. A → B, B → C, C → D; decompose into AB, AD and CD


a. (A)
b. It is a lossless BCNF decomposition. It is however not dependency preserving (B
→ C not preserved). This is not the best decomposition (AB, BC, CD is better)

Exercise 25.2 Consider the instance of the Sales relation shown in Figure 25.2.
1. Show the result of pivoting the relation on pid and timeid.

Pid 11 12 13 Total
Timeid
1 60 56 28 144
2 30 65 50 145
3 25 70 15 110
Total 115 191 93 399

2. Write a collection of SQL queries to obtain the same result as in the previous part.
SELECT S.pid, S.timeid, SUM(S.sales)
FROM Sales S
GROUP BY S.pid, S.timeid

SELECT S.pid, SUM(S.sales)


FROM Sales S
GROUP BY S.pid
SELECT S.timeid, SUM(S.sales)
FROM Sales S
GROUP BY S.timeid

SELECT SUM(S.sales)
FROM Sales S

3. Show the result of pivoting the relation on pid and locid.

Pid 11 12 13 Total
Locid
1 48 100 28 176
2 67 91 65 223
Total 115 191 93 399

Exercise 25.10
1. To decide whether to materialize a view, what factors do we need to consider?
Following factors need to be considered to decide whether to materialize a view
• Whether it would be possible to exploit the materialized view to answer a query on
a view.
• What indexes should be built on the materialized view.
• How the materialized views may be synchronized with changes to the underlying
table.
• Disk space
• Cost of refreshing

2. Assume that we have defined the following materialized view:


SELECT L.state, S.sales
FROM Locations L, Sales S
WHERE S.locid = L.locid
a. Describe what auxiliary information the algorithm for incremental view
maintenance from Section 25.10.1 maintains and how this data helps in
maintaining the view incrementally
The auxiliary information maintained by the incremental view algorithm is number
of times a row is derived in the view. This will in turn determine whether a row
should be a part of the materialized view or not. Moreover, if the count reduces to
zero, it means that the particular row must be deleted from the view. The main idea
behind incremental maintenance algorithms is to efficiently compute changes to
rows of the view, either new rows or changes to the count associated with a row.
b. Discuss the pros and cons of materializing this view.
Pros: Data is indexed, making queries faster and easy to retrieve
Cons: Maintenance, disk space usage, data is outdated and updates are costly.
26.2 Consider the Purchases table shown in Figure 26.1.
1. Simulate the algorithm for finding frequent itemsets on the table in Figure 26.1 with
minsup=90%, and then find association rules with minconf=90%.
First find the support of item sets with one item
• {pen} = 100%
• {ink} = 75%
• {milk} = 75%
• {juice} = 50%
• {water} = 25%
We can say now that only the {pen} itemset has minsup = 90% and no other one element
itemset. Because there is no more one element itemset with minsup = 90% we can
conclude that there does not exist and two element or further item set with minsup = 90%

Association rules with mincon = 90% are nothing because there only exist one one
element itemset. Also it was not mentioned in the book and thus I did not include it but, a
set has association with itself because that association is trivial with con = 100% and thus
will be included if considered an association.

2. Can you modify the table so that the same frequent itemsets are obtained with
minsup=90% as with minsup=70% on the table shown in Figure 26.1?
One example possibility is shown below where support for support for the items {ink} and
{milk} is reduced to below 75%.

Transid CustId Date Item Qty

111 201 5/1/99 pen 2

111 201 5/1/99 milk 3

111 201 5/1/99 juice 6

112 105 6/3/99 pen 1

112 105 6/3/99 ink 1

113 106 5/10/99 pen 1

113 106 5/10/99 milk 1

114 201 6/1/99 pen 2

114 201 6/1/99 ink 2

114 201 6/1/99 juice 4

114 201 6/1/99 water 1


3. Simulate the algorithm for finding frequent itemsets on the table in Figure 26.1
with minsup=10% and then find association rules with minconf=90%.
First find the support for one element itemset (all these itemsets have minsup = 10%)
• {pen} = 100%
• {ink} = 75%
• {milk} = 75%
• {juice} = 50%
• {water} = 25%

Then the support of two elements itemsets are:


• {pen, ink} = 75%
• {pen, milk} = 75%
• {pen, juice} = 50%
• {pen, water} = 25%
• {ink, milk} = 50%
• {ink, juice} = 50%
• {ink, water} = 25%
• {milk, juice} = 25%
• {milk, water} = 0%
• {juice, water} = 25%

Now all the two element itemsets have minsup = 10%. Then the support of three
elements itemsets are:
• {pen, ink, milk} = 50%
• {pen, ink, juice} = 50%
• {pen, ink, water} = 25%
• {pen, milk, juice} = 25%
• {pen, juice, water} = 25%
• {ink, milk, juice} = 25%
• {ink, juice, water} = 25%
Note: Since {milk, water} has support of 0% we will not consider its supersets since
they will have 0% support as well

Now only using the three element itemsets with the minsup = 10% we can get the
support for 4 element itemsets
• {pen, ink, milk, juice} = 25%
• {pen, ink, juice, water} = 25%

Since {milk, water} has 0% support, there cannot be a 5-itemset having support > 0%

So the final answer contains the following itemsets and their support:
• {pen} = 100%
• {ink} = 75%
• {milk} = 75%
• {juice} = 50%
• {water} = 25%
• {pen, ink} = 75%
• {pen, milk} = 75%
• {pen, juice} = 50%
• {pen, water} = 25%
• {ink, milk} = 50%
• {ink, juice} = 50%
• {ink, water} = 25%
• {milk, juice} = 25%
• {juice, water} = 25%
• {pen, ink, milk} = 50%
• {pen, ink, juice} = 50%
• {pen, ink, water} = 25%
• {pen, milk, juice} = 25%
• {pen, juice, water} = 25%
• {ink, milk, juice} = 25%
• {ink, juice, water} = 25%
• {pen, ink, milk, juice} = 25%
• {pen, ink, juice, water} = 25%

Association rules with confidence >= 90%: all the rules listed have confidence = 100%
• {ink} → {pen}
• {milk} → {pen}
• {juice} → {pen}
• {juice} → {ink}
• {juice} → {pen, ink}
• {water} → {pen}
• {water} → {ink}
• {water} → {juice}
• {water} → {pen, ink}
• {water} → {pen, juice}
• {water} → {ink, juice}
• {water} → {pen, ink, juice}
• {ink, milk} → {pen}
• {ink, juice} → {pen}
• {ink, water} → {pen}
• {ink, water} → {juice}
• {juice, water} → {ink}
• {juice, water} → {pen}
• {milk, juice} → {pen}
• {milk, juice} → {ink}
• {pen, juice} → {ink}
• {pen, water} → {ink}
• {pen, water} → {juice}
• {milk, juice} → {pen, ink}
• {pen, water} → {ink, juice}
• {ink, water} → {pen, juice}
• {juice, water} → {pen, ink}

4. Can you modify the table so that the same frequent itemsets are obtained with
minsup=10 percent as with minsup=70 percent on the table shown in Figure 26.1?
Currently the support is defined as follows:
• {pen} - 4/4
• {ink} - 3/4
• {milk} - 3/4
• {pen, ink} - 3/4
• {pen, milk} - 3/4
• {ink, milk} - 2/4

To make sure that we don’t decrease the support of either of {pen, milk} or {pen, ink}
because of just adding only one type of this itemset we need to add both of them, which
would result in one example as follows:
• {111, 201, 5/1/99, pen, 2}
• {112, 105, 5/1/99, pen, 2}
• {113, 106,5/10/99, pen, 1}
• {113, 106,5/10/99, ink, 1}
• {114, 201, 6/1/99, milk, 2}
• {114, 201, 6/1/99, pen, 2}
Exercise 26.8. Consider the SubscriberInfo Relation shown in Figure 26.17. It contains
information about the marketing campaign of the DB Aficionado magazine. The first two
columns show the age and salary of a potential customer and the subscription column
shows whether the person subscribes to the magazine. We want to use this data to
construct a decision tree that helps predict whether a person will subscribe to the
Magazine.
1. Construct the AVC-group of the root node of the tree.

Age Subscription Salary Subscription

Yes No Yes No

32 0 1 43k 1 0

35 1 0 45k 0 1

37 0 1 50k 1 0

39 1 0 54k 0 1

40 0 1 58k 0 1

43 1 0 68k 1 0

52 1 0 70k 1 0

55 1 0 85k 1 0

56 1 0 90k 1 0
2. Assume that the splitting predicate at the root node is age <= 50. Construct the AVC
groups of the two children nodes of the root node.
Child Node 1:

Age Subscription Salary Subscription

Yes No Yes No

32 0 1 45k 0 1

35 1 0 54k 0 1

37 0 1 58k 0 1

39 1 0 68k 1 0

40 0 1 70k 1 0

43 1 0 90k 1 0

Child Node 2:

Age Subscription Salary Subscription

Yes No Yes No

52 1 0 43k 1 0

55 1 0 50k 1 0

56 1 0 85k 1 0

Exercise 26.10. Assume you are given the three sequences (1, 3, 4), (2, 3, 2), (3, 3, 7).
Compute the Euclidean Bonn/Norm between all pairs of sequences.

a. ||(1, 3, 4) - (2, 3, 2)|| = √(1-2)2 + (3-3)2 + (4-2)2 = √5 = 2.236


b. ||(1, 3, 4) - (3, 3, 7)|| = √(1-3)2 + (3-3)2 + (4-7)2 = √13 = 3.605
c. ||(3, 3, 7) - (2, 3, 2)|| = √(3-2)2 + (3-3)2 + (7-2)2 = √26 = 5.099
Exercise 27.2. Assume you are given a document database that contains SIX documents.
Answer the following questions.
1. Show the result of creating an inverted file on the documents.

Term (Word) Number of Documents Documents

Auto 2 {1, 2}

Beetle 1 {6}

Car 2 {1, 6}

Computer 3 {2, 4, 5}

Honda 2 {1, 3}

IBM 2 {4, 5}

Manufacturer 2 {1, 4}

Navigation 2 {2, 3}

Personal 1 {5}

VW 1 {6}

2. Show the result of creating a signature file with a width of 5 bits. Construct your
own hashing function that maps terms to bit positions

Hashing Function
Term (Word) Hashed Value

Auto 10000

Beetle 10000

Car 01000
Document Terms Hash
Computer 01000
1 car, manufacturer, Honda, auto 11110
Honda 00100
2 auto, computer, navigation 11010
IBM 00100
3 Honda, navigation 00110
Manufacturer 00010
4 Manufacturer, computer, IBM 01110
Navigation 00010
5 IBM, personal, computer 01101
Personal 00001
6 car, Beetle, VW 11001
VW 00001
3. Evaluate the following boolean queries using the inverted file and the signature file
that you created: 'car', 'IBM' AND 'Computer', 'IBM’ AND 'car', ‘IBM’ or ‘auto’, and
‘IBM’ AND 'computer' AND 'manufacturer'.
Using inverted file
o car -> returns {1,6}
o IBM AND computer -> returns {4,5}
‘IBM’ appears in documents - Doc 4 and Doc 5
‘computer’ appears in documents - Doc 2, Doc 4 and Doc 5
So their intersection - Doc 4 and Doc 5
o IBM AND car -> returns {}
‘IBM’ appears in documents - Doc 4 and Doc 5
‘car’ appears in documents - Doc 1, and Doc 6
So their intersection - None
o IBM OR auto -> returns {1, 2, 4, 5}
‘IBM’ appears in documents - Doc 4 and Doc 5
‘auto’ appears in documents - Doc 1, and Doc 2
So their union - Doc 1, Doc 2, Doc 4 and Doc 5
o IBM AND computer AND manufacturer -> returns {4}
‘IBM’ appears in documents - Doc 4 and Doc 5
‘computer’ appears in documents - Doc 2, Doc 4 and Doc 5
‘manufacturer’ appears in documents - Doc 1, and Doc 4
So their intersection - Doc 4

Using signature files


o car -> 01000 -> returns {1,6}
o IBM AND computer -> 01100 -> returns {4, 5}
This appears in Doc 1, 4, 5
Out of these there is one false matches as follows - Doc 1 because the
term ‘honda’ and ‘car’ appear together which maps to the same hash.
So the result is Doc 4 and Doc 5
o IBM AND car -> 01100 -> returns {}
This appears in Doc 1, 4, 5
All of these three are false matches because the term ‘honda’ and
‘computer’ appear together which maps to the same hash.
So the result is None
o IBM OR auto -> returns {1, 2, 4, 5}
o The hash for ‘IBM’ = 00100
The hash for ‘auto’ = 10000
‘IBM’ according to the hash appears in Doc 1, 3, 4, 5
Of these taking out the false positives (because of Honda) we get Doc 4
and Doc 5.
‘auto’ according to the hash appears in Doc 1, 2, 6
Of these taking out the false positives (because of beetle) we get Doc 1
and Doc 2. So the result is Doc 1, 2, 4 and 5
o IBM AND computer AND manufacturer -> 01110 -> returns {1,4,5}
The hash for ‘IBM’ and ‘computer’ and ‘manufacturer’ = 01110
According to the hash produced the documents might be = Doc 1 and
Doc 4.
Doc 1 is a false positive so the final answer is: Doc 4

5. Consider the following ranked queries: 'car’, 'IBM computer', ‘IBM car', ‘IBM auto', 'IBM
computer manufacturer'
(a) Calculate the IDF for every term in the database.
Term (Word) IDF

Auto log(6/2) = log(3) = 0.477

Beetle log(6/1) = log(6) = 0.778

Car log(6/2) = log(3) = 0.477

Computer log(6/3) = log(2) = 0.301

Honda log(6/2) = log(3) = 0.477

IBM log(6/2) = log(3) = 0.477

Manufacturer log(6/2) = log(3) = 0.477

Navigation log(6/2) = log(3) = 0.477

Personal log(6/1) = log(6) = 0.778

VW log(6/1) = log(6) = 0.778

(b) For each document, show its document vector.


Auto beetle car computer honda IBM manufacturer navigation personal VW

Doc 1 1 0 1 0 1 0 1 0 0 0

Doc 2 1 0 0 1 0 0 0 1 0 0

Doc 3 0 0 0 0 1 0 0 1 0 0

Doc 4 0 0 0 1 0 1 1 0 0 0

Doc 5 0 0 0 1 0 1 0 0 1 0

Doc 6 0 1 1 0 0 0 0 0 0 1
(c) For each query, calculate the relevance of each document in the database, with
and without the length normalization step.
Document Vectors (TF-IDF) are as follows:
Auto beetle car computer honda IBM manufacturer navigation personal VW

Doc 1 0.477 0 0.477 0 0.477 0 0.477 0 0 0

Doc 2 0.477 0 0 0.301 0 0 0 0.477 0 0

Doc 3 0 0 0 0 0.477 0 0 0.477 0 0

Doc 4 0 0 0 0.301 0 0.477 0.477 0 0 0

Doc 5 0 0 0 0.301 0 0.477 0 0 0.778 0

Doc 6 0 0.778 0.477 0 0 0 0 0 0 0.778

Query Vectors (TF-IDF) are as follows:


auto beetle Car computer honda IBM manufacturer navigation personal VW

Query 1 0 0 0.477 0 0 0 0 0 0 0

Query 2 0 0 0 0.301 0 0.477 0 0 0 0

Query 3 0 0 0.477 0 0 0.477 0 0 0 0

Query 4 0.477 0 0 0 0 0.477 0 0 0 0

Query 5 0 0 0 0.301 0 0.477 0.477 0 0 0

Doc length (Used for length normalization) Query length

2 Length
Length = √∑𝑡𝑘=1 𝑤𝑘
Q 1 0.477
Doc 1 0.954
Q 2 0.564
Doc 2 0.739
Q 3 0.675
Doc 3 0.675
Q 4 0.675
Doc 4 0.739
Q 5 0.739
Doc 5 0.961

Doc 6 1.199
Relevance without length normalization:
Sim(Q,D) Q1 Q2 Q3 Q4 Q5

Doc 1 0.228 0 0.228 0.228 0.228

Doc 2 0 0.091 0 0.228 0.091

Doc 3 0 0 0 0 0

Doc 4 0 0.318 0.228 0.228 0.546

Doc 5 0 0.318 0.228 0.228 0.318

Doc 6 0.228 0 0.228 0 0

Relevance with length normalization:


Sim(Q,D) Q1 Q2 Q3 Q4 Q5

Doc 1 0.5 0 0.354 0.354 0.323

Doc 2 0 0.217 0 0.457 0.166

Doc 3 0 0 0 0 0

Doc 4 0 0.764 0.457 0.457 1

Doc 5 0 0.587 0.351 0.351 0.448

Doc 6 0.398 0 0.281 0 0

Exercise 27.4. You are in charge of the Genghis ('We execute fast') search engine. You are
designing your server cluster to handle 500 Million hits a day and 10 billion pages of
indexed data. Each machine costs $1000, and can store 10 million pages and respond to
200 queries per second (against these pages).

1. If you were given a budget of $500,000 dollars for purchasing machines, and were
required to index all 10 billion pages, could you do it?
$500,000 is equivalent to 500000/1000 = 500 machines. This is equivalent to 500 * 10
Million pages = 5 ∗ 109 pages. Which is equivalent to 5 billion pages. However the
requirement is for 10 billion pages and hence we cannot do it.
2. What is the minimum budget to index all pages? If you assume that each query can
be answered by looking at data in just one (10 million page) partition, and that
queries are uniformly distributed across partitions, what peak load (in number of
queries per second) can such a cluster handle?
Minimum cost would be equivalent to $1,000,000 = 1 Million Dollars. This would allow us
to have enough machines to index around 10 billion pages in 10 million page partitions on
1000 machines. The server would be able to handle around 200 * 1000 = 200,000 queries
per second.

3. How would your answer to the previous question change if each query, on average,
accessed two partitions?
Then the answer would be around 100,000 queries per second. Because two machines
together are expected to respond to 200 queried per second.

4. What is the running budget required to handle the desired load of 500 million hits
per day if all queries are on a single partition? Assume that queries are uniformly
distributed with respect to time of day.
Assumption - That the facility already has stored the 10 billion pages.
500 million hits per day = 5788 queries per second. Considering they are using one
partition each, we need 5788/200 machines = 29 machines.
So the budget = 29*1000 = $29,000
If we do want to account for the 10 billion pages, the budget should be $1 million, which
will give us more machines that needed to compute the queries while making sure that the
machines are able to save all the 10 billion pages of indexed data.

5. Would your answer to the previous question change if the number of queries per
day went up to 5 billion hits per day? How would it change if the number of pages
went up to 100 billion?
(a) Yes, the answer will change as follows:
5 billion hits per day = 57880 queries per second. Considering they are using one
partition each, we need 57880/200 machines = 290 machines.
So the budget = 290*1000 = $290,000
(b) Minimum budget for 100 billion hits:
100 billion hits per day = 100b/(24*60*60) queries per second. Considering they
are using one partition each, we need 100b/(24*60*60*200) machines = 5788
machines.
So the budget = 5788*1000 = $5,788,000
6. Assume that each query accesses just one partition, that queries are uniformly
distributed across partitions, but that at any given time, the peak load on a
partition is upto 10 times the average load. What is the minimum budget for
purchasing machines in this scenario?
Assumption : That we have already been able to index all the pages (10 Billion)
• Assuming that the number of hits per day = 500m,
hits per day / (60*60*24) = 5788 hits per second in the worst scenario
• The peak would be 57880 hits in a second (10x).
• So, 57880 / 200 = 290 machines needed
• The budget would be 290 * 1000 = $290,000

7. Take the cost for machines [take the previous question and multiply it by 10 to
reflect the costs of maintenance, administration, network bandwidth, etc. This
amount is your annual cost of operation. Assume that you charge advertisers 2
cents per page. What fraction of your inventory (i.e., the total number of pages that
you serve over the course of a year) do you have to sell in order to make a profit?
The cost of machines after accounting for the other costs → $290,000
• Number of ads pages = $290,000 / 0.02 = 14,500,000 = = 1.45*108 in one year
• We have 1010 pages. So the fraction needed to make a profit
= (1.45 * 108)/1010 = 1.45/100 = 1.45%

You might also like