You are on page 1of 4

CPSC

304: Solution to Practice Questions


in Data Warehousing & OLAP and Data
Mining
Laks V.S. Lakshmanan

Winter 2015, Term 2

(1) What are the advantages of the cube operator compared to the operators
rollup, drilldown, and pivot?

Cube generalizes rollup, drilldown, and pivot and thus subsumes them. The benefit
of this is that if we precompute and materialize the cube, that will facilitate a whole
class of data explorations (and queries). Exploration and visualization of the cube or
subsets of group-bys from the cube can in turn enable the detection of patterns that
are otherwise hard to find.

(2) Consider a data warehouse with the dimensions D1, D2, D3, D4, D5 which
have 999, 99, 24, 4, and 19 values respectively. Suppose the sparsity factor of
the cube is 10%. What is the estimated number of tuples in the sparse cube?

The size of the full cube is (999+1)(99+1)(24+1)(4+1)(19+1) = 250,000,000 tuples.
With a sparsity factor of 10%, the size of the sparse cube is about 250,000,000 x 0.1
= 25,000,000 tuples.

(3) Consider computing the full cube over the dimensions {Product, Time,
Geography}, using sorting to speed up the computation of the group-bys that
make up the cube. Explain with a diagram of the cube lattice how youd order
the dimensions of each group-by in order to minimize the number of sort
operations required.

P=product; T=time; G=geography.

PGT

PG GT TP


P G T

{}

Each red path is a sorted pass. By ordering the sort attributes as shown in the
cube lattice above, we can compute the entire cube in 3 sort passes of
pipesort.

(4) Consider the following cube lattice, with numbers showing the estimated
sizes of the group-bys, where M indicates a million tuples. If you are allowed
to materialize 3 group-bys, which are the best three youd materialize in
order to optimize the evaluation of queries corresponding to the various
group-bys in the cube lattice?

(a,b,c) 10 M


(b,c) 6M



(a,b) 8M
(a,c) 4M




(a)
2M
(b) 4M
(c)
5M 3.5 M







{}
1

First, notice that there is an inconsistency in the size estimates provided. (c) has a
size of 5M , whereas (a,c) has a size of 4M. This is impossible since the size of group-
bys cannot grow as you go down the lattice. Lets work with a revised estimate for
(c) of 3.5 M. This is the size we will use below.

Of the three group-bys we are allowed to materialize, one is taken: we always must
materialize the top element of the cube lattice, for it cannot be derived from any
other group-by. So, that leaves two more group-bys to choose. The following table
tracks the marginal gain of each group-by, given what has been materialized in
previous rounds. Initially, Only abc is materialized. The group-by that is the winner
of the greedy choice in each round is highlighted in red.

Remember, marginal saving for a single group-by = sum of marginal savings for all
group-bys that can be derived from it, compared with the current cheapest way of
computing each of those group-bys. E.g., the group-bys that can be derived from ab
are ab (yes, you include itself), a, b, and {}. Initially, the cheapest way to compute
each of them is to use the top group-by abc, which has a cost of 10 M. On
materializing ab, these four group-bys can be computed at a cheaper cost of 8 M. So,
the marginal gain of ab = 4 group-bys x savings on each = 4 x 2 M. Keep in mind,
sometimes the savings on different group-bys derivable from the same group-by can
be different.
Note: For simplicity, we ignore subtraction of small numbers from millions: e.g., we
write 10 M 1 as just 10 M.






Group-by
ab
bc
ac
a
b
c
{}

Round 1
4 x 2 M
4 x 4 M
4 x 6 M
2 x 8 M
2 x 6 M
2 x 6.5 M
10 M

Round 2
2 x 2 M
2 x 4 M

2 x 2 M
1 x 6 M
2 x 0.5 M
4 M


Thus, the two additional group-bys we should materialize according to the greedy
algorithm are ac and bc.

(5) Consider the following transaction database.

Transaction_id Basket of items
t1
{a,c,d,f}
t2
{a,b,d,e,g}
t3
{b,c,d,e}
t4
{a,b,c,d}

Suppose minSup = 3 and minConf = 2/3.

(a) Using the Apriori algorithm, find all itemsets that are frequent, i.e., have a
support minSup.

Round 1: sup(a) = 3; sup(b) = 3; sup(c) = 3; sup(d) = 4; sup(e) = 2; sup(f) = 1;
sup(g) = 1. Discard e,f,g and their supersets as their support is < minSup = 3.
Candidates for round 2 = ab, ac, ad, bc, bd, cd.

Round 2: sup(ab) = 2; sup(ac) = 2; sup(ad) = 3; sup(bc) = 2; sup(bd) = 3; sup(cd) =
3. Discard ab, ac, bc. Candidates for round 3 = abd, acd, bcd.

Round 3: sup(abd) = 2; sup(acd) = 2; sup(bcd) = 2.
All discarded. No candidates for round 4 stop!

The frequent itemsets are a, b, c, d, ad, bd, cd.

(b) Based on (a), find all strong association rules, i.e., association rules whose
confidence minConf.

We use all frequent itemsets found in (a) above to form ARs.
Singleton itemsets never contribute to non-trivial ARs.
That only leaves ad, bd, cd.
conf(ad) = sup(ad)/sup(a) = 3/3 = 1.

conf(da) = sup(ad)/sup(d) = 3/4.


conf(bd) = sup(bd)/sup(b) = 3/3 = 1.
conf(db) = sup(bd)/sup(d) = 3/4.
conf(cd) = sup(cd)/sup(c) = 3/3 = 1.
conf(dc) = sup(cd)/sup(d) = 3/4.

All ARs above has a confidence > minConf = 2/3. If minConf instead was, e.g., 0.8, all
ARs above with d on the LHS would be disqualified as their confidence would be
below minConf.