You are on page 1of 18

11/19/2015

CSI 4352, Introduction to Data Mining

Lecture 4, Data Cube Computation

Young-Rae Cho
Associate Professor
Department of Computer Science
Baylor University

A Roadmap for Data Cube Computation

 Full Cube
 Full materialization
 Materializing all the cells of all of the cuboids for a given data cube
 Issues in time and space

 Iceberg Cube
 Partial materialization
 Materializing the cells of only interesting cuboids
 Materializing only the cells in a cuboid whose measure value is above
the minimum threshold

 Closed Cube
 Materializing only closed cells

1
11/19/2015

Typical Computation Process

 Example all

time item location supplier

time,item time,location item,location location,supplier

time,supplier item,supplier

time,item,location time,location,supplier

time,item,supplier item,location,supplier

time, item, location, supplier

Computation Techniques

 Aggregating
 Aggregating from the smallest child cuboid

 Caching
 Caching the result of a cuboid for the computation of other cuboids to
reduce disk I/O

 Sorting, Hashing and Grouping


 Sorting, hashing, and grouping operations are applied to a dimension
in order to reorder and cluster

 Pruning
 A priori pruning the cells with lower support than minimum threshold

2
11/19/2015

CSI 4352, Introduction to Data Mining

Lecture 4, Data Cube Computation

 Multi-way Array Aggregation

 BUC: Iceberg Cube Computation

 Star-Cubing

 Shell Fragment Cube Computation

Multi-Way Array Aggregation

 Features
 Array-based “bottom-up” approach
 Uses multi-dimensional chunks all
 No direct tuple comparisons
 Simultaneous aggregation
A B C
on multiple dimensions
 Intermediate aggregate values
are re-used for computing ancestor
AB AC BC
cuboids
 Full materialization (No iceberg
optimization, No Apriori pruning) ABC

3
11/19/2015

Aggregation Strategy

 Partitions array into chunks


 Chunk: a small sub-cube which fits in memory

 Data addressing
 Uses chunk id and offset
C c2
c3 61 62 63 64
45 46 47 48
 Multi-way Aggregation c1 29 30 31 32
c0
60
 Computes aggregates in b3 B13 14 15 16 44
28 56
multi-way
b2 9 10 11 12 40
 Visits chunks in the order B 24 52
b1 5 6 7 8 36
• to minimize memory 20
access b0 1 2 3 4
• to minimize memory space a0 a1 a2 a3
A

Aggregation Process

 Example

c3 61 62 63 64
c2 45 46 47 48
c1 29 30 31 32
c0
60
b3 B13 14 15 16 44
28 56
b2 9 10 11 12 40
24 52
b1 5 6 7 8 36
20
b0 1 2 3 4
a0 a1 a2 a3

4
11/19/2015

Aggregation Process

 Example - Continued

c3 61 62 63 64
c2 45 46 47 48
c1 29 30 31 32
c0
60
b3 B13 14 15 16 44
28 56
b2 9 10 11 12 40
24 52
b1 5 6 7 8 36
20
b0 1 2 3 4
a0 a1 a2 a3

Optimization of Memory Usage

 What is the best traversing order?


 Depends on the memory space required

 Example
 Suppose the data size on
each dimension A, B and C
C c2
c3 61 62 63 64
45 46 47 48
c1 29 30 31 32
is 40, 400 and 4000, c0
60
respectively. b3 B13 14 15 16 44
28 56
 Minimum memory required b2 9 10 11 12 40
B 24 52
when traversing the order,
b1 5 6 7 8 36
1,2,3,4,5,…, 64 20
b0 1 2 3 4
 Total memory required is
100×1000 + 40×1000 + a0 a1 a2 a3
40×400
A

5
11/19/2015

Summary of Multi-Way

 Method
 Cuboids should be sorted and computed according to the data size
on each dimension
 Keeps the smallest plane in the main memory, fetches and computes
only one chunk at a time for the largest plane

 Limitations
 Full materialization
 Computes well only for a small number of dimensions
( high dimensional data → partial materialization )

Reference
 Zhao, Y., Deshpande, P.M. and Naughton, J.F., “An Array-Based Algorithm
for Simultaneous Multidimensional Aggregates”, In Proceedings of SIGMOD
(1997)

CSI 4352, Introduction to Data Mining

Lecture 4, Data Cube Computation

 Multi-way Array Aggregation

 BUC: Iceberg Cube Computation

 Star-Cubing

 Shell Fragment Cube Computation

6
11/19/2015

Bottom-Up Computation (BUC)

 Characteristics
 “Top-down” approach
 Partial materialization (iceberg cube computation)
 Divides dimensions into partitions
1 all
and facilitates iceberg pruning
 No simultaneous aggregation
2A 10 B 14 C 16 D

3 AB 7 AC 9 AD 11 BC 13 BD 15 CD

4 ABC 6 ABD 8 ACD 12 BCD

5 ABCD

Iceberg Pruning Process

 Partitioning
 Sorts data values
 Partitions into blocks that fit in memory

 Apriori Pruning
 For each partition
• If it does not satisfy min_sup, its descendants are pruned
• If it satisfies min_sup,
materialization and a recursive call including the next dimension

7
11/19/2015

Apriori Pruning Example

 Description in SQL
compute cube iceberg_cube as
 Iceberg cube computation select A, B, C, D, count(*)
from R
cube by A, B, C, D
having count(*) ≥ min_sup
 Example

( , , ,  ) 0-D cuboid

( a1, , ,  ) 1-D cuboid

( a1, b1, ,  ) 2-D cuboid

( a1, b1, c1,  ) 3-D cuboid

( a1, b1, c1, d1 ) pruning

( a1, b1, c2,  ) 3-D cuboid

( a1, b2, ,  ) pruning

Summary of BUC

 Method
 Computation of sparse data cubes

 Limitations
 Sensitive to the order of dimensions
→ The most discriminating dimension should be used first
→ Dimensions should be in the order of decreasing cardinality
→ Dimensions should be in the order of increasing maximum number
of duplicates

Reference
 Beyer, K. and Ramakrishnan, R., “Bottom-Up Computation of Sparse and
Iceberg CUBEs”, In Proceedings of SIGMOD (1999)

8
11/19/2015

CSI 4352, Introduction to Data Mining

Lecture 4, Data Cube Computation

 Multi-way Array Aggregation

 BUC: Iceberg Cube Computation

 Star-Cubing

 Shell Fragment Cube Computation

Characteristics of Star-Cubing

 Characteristics
 Integrated method of “top-down” and “bottom-up” cube computation
 Explores both multidimensional aggregation (as Multi-way) and
apriori pruning (as BUC)
all
 Explores shared dimensions
e.g., dimension A is a shared
A/A B/B C/C D
dimension of ACD and AD
e.g., ABD/AB means ABD has
AB/AB AC/AC AD/A BC/BC BD/B CD
the shared dimension AB

ABC/ABC ABD/AB ACD/A BCD

ABCD

9
11/19/2015

Iceberg Pruning Strategy

 Iceberg Pruning in a Shared Dimension


 If AB is a shared dimension of ABD, then the cuboid AB is computed
simultaneously with ABD

 If the aggregate on a shared dimension does not satisfy the iceberg


condition (min_sup), then all the cells extended from this shared
dimension cannot satisfy the condition either

 If we can compute the shared dimensions before the actual cuboid,


we can apply Apriori pruning
→ Pruning while aggregating simultaneously on multiple dimensions

Cuboid Trees

 Cuboid Trees
 Tree structure to represent cuboids
 Base cuboid tree, 3-D cuboid trees, 2-D cuboid trees, …
 Each level represents a dimension
 Each node represents an attribute
 Each node includes the attribute value, root: 100

aggregate value, descendant(s)


a1: 30 a2: 20 a3: 20 a4: 20
 The path from the root to a leaf
represents a tuple b1: 10 b2: 10 b3: 10

 Example
c1: 5 c2: 5
• count(a1,b1,*,*) = 10
• count(a1,b1,c1,*) = 5 d1: 2 d2: 3

10
11/19/2015

Star Nodes

 Star Nodes
 If the single dimensional aggregate does not satisfy min_sup,
no need to consider the node in the iceberg cube computation
 The nodes are replaced by *
A B C D count
 Example (min_sup = 2) a1 b1 * * 1
a1 b1 * * 1
A B C D count a1 * * * 1
a1 b1 c1 d1 1 a2 * c3 d4 1
a1 b1 c4 d3 1 a2 * c3 d4 1
a1 b2 c2 d2 1
a2 b3 c3 d4 1
a2 b4 c3 d4 1 A B C D count
a1 b1 * * 2
a1 * * * 1
a2 * c3 d4 2

Star Tree Structure

 Star Tree
 A cuboid tree that is compressed using star nodes

 Star Tree Construction


 Uses the compressed table
root:5 Star Table
 Keeps the star table for lookup b2 *
of star nodes a1:3 a2:2 b3 *
 Lossless compression from b4 *
b*:1 b1:2 b*:2 c1 *
the original cuboid tree
c2 *
c*:1 c*:2 c3:2 d1 *
...
d*:1 d*:2 d4:2

11
11/19/2015

Multi-Way Star-Tree Aggregation

 Aggregation
 DFS (depth-first-search) from the root of a star tree
 Creates star trees for the cuboids on the next level

root:5 1 BCD:5 1 a1CD/a1:3 2 a1b*D/a1b*:1 3 a1b*c*/a1b*c*:1 4

a1:32 a2:2 b*:1 3 c*:1 4 d*:15

b*:13 b1:2 b*:2 c*:1 4 d*:1 5

c*:14 c*:2 c3:2 d*:1 5

d*:15 d*:2 d4:2


Base−Tree BCD−Tree ACD/A−Tree ABD/AB−Tree ABC/ABC−Tree

 Pruning
 Prunes if the aggregates do not satisfy min_sup
 Prunes if all the nodes in the generated tree are star nodes

Implementation of Star Cubing

 Method
 Multi-way aggregation & iceberg pruning
all

BCD :3
A/A C/C D/D
b*:3 b1:2

pruning pruning c*:1 c3:2 c*:2


AC/AC AD/A BC/BC BD/B CD
d*:1 d4:2 d*:2

pruning pruning
root:5
ABC/ABC ABD/AB ACD/A BCD
a1:3 a2:2

b*:1 b1:2 b*:2


ABCD
c*:1 c*:2 c3:2

d*:1 d*:2 d4:2

12
11/19/2015

Summary of Star Cubing

 Limitations
 Sensitive to the order of dimensions
→ The order of decreasing cardinality

Reference
 Xin, D., Han, J., Li, X. and Wah, B.W., “Star-Cubing: Computing Iceberg
Cubes by Top-Down and Bottom-Up Integration”, In Proceedings of VLDB
(2003)

CSI 4352, Introduction to Data Mining

Lecture 4, Data Cube Computation

 Multi-way Array Aggregation

 BUC: Iceberg Cube Computation

 Star-Cubing

 Shell Fragment Cube Computation

13
11/19/2015

Shell Fragment Cube Computation

 Motivations
 The computation and storage of iceberg cube are still costly for high
dimensional data ( “the curse of dimensionality” )
 Hard to determine an appropriate iceberg threshold
 No update in an iceberg cube

 Features
 Reduces a high dimensional cube into a set of lower dimensional cubes
 Lossless reduction
 Online re-construction of high-dimensional data cube

Fragmentation Strategy

 Observation
 OLAP occurs only on a small subset of dimensions at a time

 Fragmentation
 Partitions the set of dimensions into shell fragments
 Computes data cubes for each shell fragment
( 20 3-D data cube computation is much better than 1 60-D data cube.)
 Retains inverted indices or value-list indices

 Semi-Online Computation
 Given the pre-computed fragment cubes, dynamically compute cube cells
of the high-dimensional cube online

14
11/19/2015

Inverted Index

 Example
TID A B C D E
1 a1 b1 c1 d1 e1
2 a1 b2 c1 d2 e1
3 a1 b2 c1 d1 e2
4 a2 b1 c1 d1 e2
Value TID List Size
5 a2 b1 c1 d1 e3
a1 1 2 3 3

 Divides the 5 dimensions into a2 4 5 2


b1 1 4 5 3
2 shell fragments, (A, B, C) and (D, E)
b2 2 3 2
c1 1 2 3 4 5 5
 Builds the inverted index d1 1 3 4 5 4
(1-D inverted index) d2 2 1
e1 1 2 2
e2 3 4 2
e3 5 1

Cube Computation

 Process
 Generalizes the 1-D inverted indices to multi-dimensional inverted indices
 Computes cuboids using the inverted indices

 Example
 Shell fragment cube ABC contains 7 cuboids, A, B, C, AB, AC, BC, ABC
 The cuboid AB is computed using the 2-D inverted index below

Cell Intersection TID List Size


(a1,b1) {1,2,3}∩{1,4,5} {1} 1
(a1,b2) {1,2,3}∩{2,3} {2,3} 2
(a2,b1) {4,5}∩{1,4,5} {4,5} 2
(a2,b2) {4,5}∩{2,3} {} 0

15
11/19/2015

Online Query Computation

 Process
 Given shell fragment cubes
 Divides the query into shell fragments
 Fetches the corresponding TID list for each fragment from the cubes
 Intersects the TID lists from each fragment to construct instantiated
base table
 Computes the online cube using the base table with any data cube
computation algorithm
 Computes the query

Shell Fragment Cube Process

Dimensions D Cuboid
EF Cuboid
A B C D E F … DE Cuboid
Cell T-ID List
d1 e1 {1, 3, 8, 9}
d1 e2 {2, 4, 6, 7}
d2 e1 {5, 10}
… …
ABC DEF
Cube Cube

16
11/19/2015

Online Query Process

A B C D E F GH I J K L MN …

Instantiated Online
Base Table Cube

Summary of Shell Fragment Cube

 Advantages
 Significant increase of the speed of data cube computation
 Significant decrease of the storage usage
 Various applications in very high dimensional data

 Limitations
 Tradeoffs between the preprocessing time to construct a data cube and
the online computation time for queries
 Tradeoffs between the storage for a data cube and the memory for
online computation

 Reference
 Li, X., Han, J. and Gonzalez, H., “High-Dimensional OLAP: A Minimal
Cubing Approach”, In Proceedings of VLDB (2004)

17
11/19/2015

Questions?

 Lecture Slides are found on the Course Website,


www.ecs.baylor.edu/faculty/cho/4352

18

You might also like