Lecture 4, Data Cube Computation: CSI 4352, Introduction To Data Mining

11/19/2015
CSI 4352, Introduction to Data Mining
Lecture 4, Data Cube Computation
Young-Rae Cho
Associate Professor
Department of Computer Science
Baylor University
A Roadmap for Data Cube Computation
 Full Cube
 Full materialization
 Materializing all the cells of all of the cuboids for a given data cube
 Issues in time and space
 Iceberg Cube
 Partial materialization
 Materializing the cells of only interesting cuboids
 Materializing only the cells in a cuboid whose measure value is above
the minimum threshold
 Closed Cube
 Materializing only closed cells
1
11/19/2015
Typical Computation Process
 Example all
time item location supplier
time,item time,location item,location location,supplier
time,supplier item,supplier
time,item,location time,location,supplier
time,item,supplier item,location,supplier
time, item, location, supplier
Computation Techniques
 Aggregating
 Aggregating from the smallest child cuboid
 Caching
 Caching the result of a cuboid for the computation of other cuboids to
reduce disk I/O
 Sorting, Hashing and Grouping

 Sorting, hashing, and grouping operations are applied to a dimension
in order to reorder and cluster
 Pruning
 A priori pruning the cells with lower support than minimum threshold
2
11/19/2015
 Multi-way Array Aggregation
 BUC: Iceberg Cube Computation
 Star-Cubing
 Shell Fragment Cube Computation
Multi-Way Array Aggregation
 Features
 Array-based “bottom-up” approach
 Uses multi-dimensional chunks all
 No direct tuple comparisons
 Simultaneous aggregation
A B C
on multiple dimensions
 Intermediate aggregate values
are re-used for computing ancestor
AB AC BC
cuboids
 Full materialization (No iceberg
optimization, No Apriori pruning) ABC
3
11/19/2015
Aggregation Strategy
 Partitions array into chunks

 Chunk: a small sub-cube which fits in memory
 Data addressing
 Uses chunk id and offset
C c2
c3 61 62 63 64
45 46 47 48
 Multi-way Aggregation c1 29 30 31 32
c0
60
 Computes aggregates in b3 B13 14 15 16 44
28 56
multi-way
b2 9 10 11 12 40
 Visits chunks in the order B 24 52
b1 5 6 7 8 36
• to minimize memory 20
access b0 1 2 3 4
• to minimize memory space a0 a1 a2 a3
A
Aggregation Process
 Example
c3 61 62 63 64
c2 45 46 47 48
c1 29 30 31 32
c0
60
b3 B13 14 15 16 44
28 56
b2 9 10 11 12 40
24 52
b1 5 6 7 8 36
20
b0 1 2 3 4
a0 a1 a2 a3
4
11/19/2015
Aggregation Process
 Example - Continued
c3 61 62 63 64
c2 45 46 47 48
c1 29 30 31 32
c0
60
b3 B13 14 15 16 44
28 56
b2 9 10 11 12 40
24 52
b1 5 6 7 8 36
20
b0 1 2 3 4
a0 a1 a2 a3
Optimization of Memory Usage
 What is the best traversing order?

 Depends on the memory space required
 Example
 Suppose the data size on
each dimension A, B and C
C c2
c3 61 62 63 64
45 46 47 48
c1 29 30 31 32
is 40, 400 and 4000, c0
60
respectively. b3 B13 14 15 16 44
28 56
 Minimum memory required b2 9 10 11 12 40
B 24 52
when traversing the order,
b1 5 6 7 8 36
1,2,3,4,5,…, 64 20
b0 1 2 3 4
 Total memory required is
100×1000 + 40×1000 + a0 a1 a2 a3
40×400
A
5
11/19/2015
Summary of Multi-Way
 Method
 Cuboids should be sorted and computed according to the data size
on each dimension
 Keeps the smallest plane in the main memory, fetches and computes
only one chunk at a time for the largest plane
 Limitations
 Full materialization
 Computes well only for a small number of dimensions
( high dimensional data → partial materialization )
Reference
 Zhao, Y., Deshpande, P.M. and Naughton, J.F., “An Array-Based Algorithm
for Simultaneous Multidimensional Aggregates”, In Proceedings of SIGMOD
(1997)
 Star-Cubing
6
11/19/2015
Bottom-Up Computation (BUC)
 Characteristics
 “Top-down” approach
 Partial materialization (iceberg cube computation)
 Divides dimensions into partitions
1 all
and facilitates iceberg pruning
 No simultaneous aggregation
2A 10 B 14 C 16 D
3 AB 7 AC 9 AD 11 BC 13 BD 15 CD
4 ABC 6 ABD 8 ACD 12 BCD
5 ABCD
Iceberg Pruning Process
 Partitioning
 Sorts data values
 Partitions into blocks that fit in memory
 Apriori Pruning
 For each partition
• If it does not satisfy min_sup, its descendants are pruned
• If it satisfies min_sup,
materialization and a recursive call including the next dimension
7
11/19/2015
Apriori Pruning Example
 Description in SQL
compute cube iceberg_cube as
 Iceberg cube computation select A, B, C, D, count(*)
from R
cube by A, B, C, D
having count(*) ≥ min_sup
 Example
( , , ,  ) 0-D cuboid
( a1, , ,  ) 1-D cuboid
( a1, b1, ,  ) 2-D cuboid
( a1, b1, c1,  ) 3-D cuboid
( a1, b1, c1, d1 ) pruning
( a1, b1, c2,  ) 3-D cuboid
( a1, b2, ,  ) pruning
Summary of BUC
 Method
 Computation of sparse data cubes
 Limitations
 Sensitive to the order of dimensions
→ The most discriminating dimension should be used first
→ Dimensions should be in the order of decreasing cardinality
→ Dimensions should be in the order of increasing maximum number
of duplicates
Reference
 Beyer, K. and Ramakrishnan, R., “Bottom-Up Computation of Sparse and
Iceberg CUBEs”, In Proceedings of SIGMOD (1999)
8
11/19/2015
 Star-Cubing
Characteristics of Star-Cubing
 Characteristics
 Integrated method of “top-down” and “bottom-up” cube computation
 Explores both multidimensional aggregation (as Multi-way) and
apriori pruning (as BUC)
all
 Explores shared dimensions
e.g., dimension A is a shared
A/A B/B C/C D
dimension of ACD and AD
e.g., ABD/AB means ABD has
AB/AB AC/AC AD/A BC/BC BD/B CD
the shared dimension AB
ABC/ABC ABD/AB ACD/A BCD
ABCD
9
11/19/2015
Iceberg Pruning Strategy
 Iceberg Pruning in a Shared Dimension

 If AB is a shared dimension of ABD, then the cuboid AB is computed
simultaneously with ABD
 If the aggregate on a shared dimension does not satisfy the iceberg

condition (min_sup), then all the cells extended from this shared
dimension cannot satisfy the condition either
 If we can compute the shared dimensions before the actual cuboid,

we can apply Apriori pruning
→ Pruning while aggregating simultaneously on multiple dimensions
Cuboid Trees
 Cuboid Trees
 Tree structure to represent cuboids
 Base cuboid tree, 3-D cuboid trees, 2-D cuboid trees, …
 Each level represents a dimension
 Each node represents an attribute
 Each node includes the attribute value, root: 100
aggregate value, descendant(s)

a1: 30 a2: 20 a3: 20 a4: 20
 The path from the root to a leaf
represents a tuple b1: 10 b2: 10 b3: 10
 Example
c1: 5 c2: 5
• count(a1,b1,*,*) = 10
• count(a1,b1,c1,*) = 5 d1: 2 d2: 3
10
11/19/2015
Star Nodes
 Star Nodes
 If the single dimensional aggregate does not satisfy min_sup,
no need to consider the node in the iceberg cube computation
 The nodes are replaced by *
A B C D count
 Example (min_sup = 2) a1 b1 * * 1
a1 b1 * * 1
A B C D count a1 * * * 1
a1 b1 c1 d1 1 a2 * c3 d4 1
a1 b1 c4 d3 1 a2 * c3 d4 1
a1 b2 c2 d2 1
a2 b3 c3 d4 1
a2 b4 c3 d4 1 A B C D count
a1 b1 * * 2
a1 * * * 1
a2 * c3 d4 2
Star Tree Structure
 Star Tree
 A cuboid tree that is compressed using star nodes
 Star Tree Construction

 Uses the compressed table
root:5 Star Table
 Keeps the star table for lookup b2 *
of star nodes a1:3 a2:2 b3 *
 Lossless compression from b4 *
b*:1 b1:2 b*:2 c1 *
the original cuboid tree
c2 *
c*:1 c*:2 c3:2 d1 *
...
d*:1 d*:2 d4:2
11
11/19/2015
Multi-Way Star-Tree Aggregation
 Aggregation
 DFS (depth-first-search) from the root of a star tree
 Creates star trees for the cuboids on the next level
root:5 1 BCD:5 1 a1CD/a1:3 2 a1b*D/a1b*:1 3 a1b*c*/a1b*c*:1 4
a1:32 a2:2 b*:1 3 c*:1 4 d*:15
b*:13 b1:2 b*:2 c*:1 4 d*:1 5
c*:14 c*:2 c3:2 d*:1 5
d*:15 d*:2 d4:2

Base−Tree BCD−Tree ACD/A−Tree ABD/AB−Tree ABC/ABC−Tree
 Pruning
 Prunes if the aggregates do not satisfy min_sup
 Prunes if all the nodes in the generated tree are star nodes
Implementation of Star Cubing
 Method
 Multi-way aggregation & iceberg pruning
all
BCD :3
A/A C/C D/D
b*:3 b1:2
pruning pruning c*:1 c3:2 c*:2

AC/AC AD/A BC/BC BD/B CD
d*:1 d4:2 d*:2
pruning pruning
root:5
ABC/ABC ABD/AB ACD/A BCD
a1:3 a2:2
b*:1 b1:2 b*:2

ABCD
c*:1 c*:2 c3:2
d*:1 d*:2 d4:2
12
11/19/2015
Summary of Star Cubing
 Limitations
 Sensitive to the order of dimensions
→ The order of decreasing cardinality
Reference
 Xin, D., Han, J., Li, X. and Wah, B.W., “Star-Cubing: Computing Iceberg
Cubes by Top-Down and Bottom-Up Integration”, In Proceedings of VLDB
(2003)
 Star-Cubing
13
11/19/2015
Shell Fragment Cube Computation
 Motivations
 The computation and storage of iceberg cube are still costly for high
dimensional data ( “the curse of dimensionality” )
 Hard to determine an appropriate iceberg threshold
 No update in an iceberg cube
 Features
 Reduces a high dimensional cube into a set of lower dimensional cubes
 Lossless reduction
 Online re-construction of high-dimensional data cube
Fragmentation Strategy
 Observation
 OLAP occurs only on a small subset of dimensions at a time
 Fragmentation
 Partitions the set of dimensions into shell fragments
 Computes data cubes for each shell fragment
( 20 3-D data cube computation is much better than 1 60-D data cube.)
 Retains inverted indices or value-list indices
 Semi-Online Computation
 Given the pre-computed fragment cubes, dynamically compute cube cells
of the high-dimensional cube online
14
11/19/2015
Inverted Index
 Example
TID A B C D E
1 a1 b1 c1 d1 e1
2 a1 b2 c1 d2 e1
3 a1 b2 c1 d1 e2
4 a2 b1 c1 d1 e2
Value TID List Size
5 a2 b1 c1 d1 e3
a1 1 2 3 3
 Divides the 5 dimensions into a2 4 5 2

b1 1 4 5 3
2 shell fragments, (A, B, C) and (D, E)
b2 2 3 2
c1 1 2 3 4 5 5
 Builds the inverted index d1 1 3 4 5 4
(1-D inverted index) d2 2 1
e1 1 2 2
e2 3 4 2
e3 5 1
Cube Computation
 Process
 Generalizes the 1-D inverted indices to multi-dimensional inverted indices
 Computes cuboids using the inverted indices
 Example
 Shell fragment cube ABC contains 7 cuboids, A, B, C, AB, AC, BC, ABC
 The cuboid AB is computed using the 2-D inverted index below
Cell Intersection TID List Size

(a1,b1) {1,2,3}∩{1,4,5} {1} 1
(a1,b2) {1,2,3}∩{2,3} {2,3} 2
(a2,b1) {4,5}∩{1,4,5} {4,5} 2
(a2,b2) {4,5}∩{2,3} {} 0
15
11/19/2015
Online Query Computation
 Process
 Given shell fragment cubes
 Divides the query into shell fragments
 Fetches the corresponding TID list for each fragment from the cubes
 Intersects the TID lists from each fragment to construct instantiated
base table
 Computes the online cube using the base table with any data cube
computation algorithm
 Computes the query
Shell Fragment Cube Process
Dimensions D Cuboid
EF Cuboid
A B C D E F … DE Cuboid
Cell T-ID List
d1 e1 {1, 3, 8, 9}
d1 e2 {2, 4, 6, 7}
d2 e1 {5, 10}
… …
ABC DEF
Cube Cube
16
11/19/2015
Online Query Process
A B C D E F GH I J K L MN …
Instantiated Online
Base Table Cube
Summary of Shell Fragment Cube
 Advantages
 Significant increase of the speed of data cube computation
 Significant decrease of the storage usage
 Various applications in very high dimensional data
 Limitations
 Tradeoffs between the preprocessing time to construct a data cube and
the online computation time for queries
 Tradeoffs between the storage for a data cube and the memory for
online computation
 Reference
 Li, X., Han, J. and Gonzalez, H., “High-Dimensional OLAP: A Minimal
Cubing Approach”, In Proceedings of VLDB (2004)
17
11/19/2015
Questions?
 Lecture Slides are found on the Course Website,

www.ecs.baylor.edu/faculty/cho/4352
18

Lecture 4, Data Cube Computation: CSI 4352, Introduction To Data Mining

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 4, Data Cube Computation: CSI 4352, Introduction To Data Mining

Uploaded by

Copyright:

Available Formats

11/19/2015

CSI 4352, Introduction to Data Mining