You are on page 1of 12

Modeling Multidimensional Databases

Rakesh Agrawal Ashish Gupta* Sunita Sarawagi

IBM Almaden Research Center


650 Harry Road, San Jose, CA 95 120

Abstract adequate for OLAP applications [Cod931 [Fin951 [KS94].


In response, several multidimensional database products
We propose a data model and a f a v algebraic oper- have appeared on the market (see [Rad951 for a survey).
ations that provide semantic foundation to multidimen- The database research community, however, has so far
sional databases. The distinguishing feature of the pro- not played a major role in this market phenomenon. Gray
posed model is the symmetric treatment not only of all et al. [GBLP96] recently proposed an extension to SQL
dimensions but also measures. The model provides sup- with a Data Cube operator that generalizes the groupby
port for multiple hierarchies along each dimension and construct to support some of the multidimensional analy-
support for adhoc aggregates. The proposed operators sis. Since then, techniques have been developed for com-
are composable, reorderable, and closed in application. puting the data cube [AADt96], for deciding what sub-
These operators are also minimal in the sense that none set of a data cube to pre-compute [HRU96] [GHRU97],
can be expressed in terms of others nor can any one for estimating the size of multidimensional aggregates
be dropped without sacrificing functionali&. They make [SDNR961, and for indexing pre-computed summaries
possible the declarative specification and optimization [SR96] [JS96]. The research in multidimensional index-
of multidimensional database queries that are currently ing structures (see, for example, [Gut941 for an overview)
specified operationally. The operators have been designed is relevant as well. Lastly, research in statistical databases
to be translated to SQL and can be implemented either on (see, for example, [Sho82] for an overview) also ad-
top of a relational database system or within a special dressed some of the same concerns.
purpose multidimensional database engine. In effect, they This paper presents a framework for research in multi-
provide an algebraic application programming interface dimensional databases. We first review concepts and ter-
(API) that allows the separation of the frontend from the minologies in vogue in multidimensional database prod-
backend. Finally, the proposed model provides a frame- ucts in Section 2. We also point out some of the deficien-
work in which to study multidimensional databases and cies in the current products. We then propose in Section 3
opens several new research problems. a data model to provide semantic backing to the tech-
niques used by current multidimensional database prod-
ucts. The salient features of our model are:
1 Introduction 0 Our data model is a multidimensionalcube with a set

of basic operations designed to unify the divergent


Codd [CCS93] coined the phrase On-Line Analytical styles in use today and to extend the current func-
Processing (OLAP) to characterize the requirements for tionality.
summarizing, consolidating, viewing, applying formulae 0 The proposed model provides symmetric treatment
to, and synthesizing data according to multiple dimen- to not only all dimensions but also to measures. The
sions. OLAP software enables analysts, managers, and model also is very flexible in providing support for
executives to gain insight into the performance of an en- multiple hierarchies along each dimension and sup-
terprise through fast access to a wide variety of views of port for adhoc aggregates.
data organized to reflect the multidimensional nature of
0 Each of our operators are defined on the cube and
the enterprise data [Co195]. It has been said that current
relational database systems have been designed and tuned produce as output a new cube. Thus the operators are
for On-Line Transaction Processing (OLTP) and are in- closed and can be freely reordered. This free compo-
sition allows a user to form larger queries, thereby
'Current Affiliation: Junglee corporation, Sunnyvale, CA 94086. replacing the relatively inefficient one-operation-at-

232
1063-6382197 $10.00 0 1997 IEEE
a-time approach of many existing products. The al- tributes like sales are referred to as measures. (Dimen-
gebraic nature of the cube also provides an opportu- sions are called categorical attributes and measures are
nity for optimizing multidimensional queries. called numerical or summary attributes in the statistical
The proposed operators are minimal. None can be database literature [Sho82]). There is no formal way of
expressed in terms of others nor can any one be deciding which attributes should be made dimensions and
dropped without sacrificing functionality. which attributes should be made measures. It is left as a
Our modeling framework provides the logical sepa- database design decision.
ration of the frontend graphical user interface (GUI) Dimensions usually have associated with them hierur-
used by a business analyst from the backend storage chies that specify aggregation levels and hence granularity
system used by the corporation. The operators thus of viewing data. Thus, day + m o n t h + quarter -+
provide an algebraic application programming inter- year is a hierarchy on date that specifies various ag-
face (API) that allows the interchange of frontends gregation levels. Similarly, product name + type -+
and backends. category is a hierarchy on the product dimension.
An analyst might want to see only a subset of the data
We discuss in Section 4 some of our design choices and
and thus might view some attributes and within each se-
show how currently popular multidimensional database
lected attribute might restrict the values of interest. In
operations can be expressed in terms of the proposed oper-
multidimensional database parlance, the operations are
ators. These operators have been designed to be translated
called pivoting (rotate the cube to show a particular face)
into SQL, albeit with some minor extensions. We refer the
and slicing-dicing (select some subset of the cube). The
reader to [AGS96] for these translations. Thus, our data
multidimensional view allows hierarchies associated with
model can be implemented on either a general-purpose re-
each dimension also to be viewed in a logical manner.
lational system or a specialized engine. We conclude with
Aggregating the product dimension from product name
a summary in Section 5.
to product type is expressed as a roll-up operation. The
converse of roll-up is drill-down that displays detail in-
2 Current State of the Art formation for each aggregated point. Thus, drilling-down
We begin with a brief overview of the current state of the product dimension from product category to product
art in multidimensionaldatabases. type gets the sales for each product type corresponding to
each product category. Further drill down will get sales for
Example 2.1 Consider a database that contains point of individual products. Drill-down is essential because often
sale data about the sales price of products, the date of sale, users want to see only aggregated data first and selectively
and the supplier who made the sale. The sales value is see more detailed data.
functionally determined by the other three attributes. In-
tuitively, each of the other three attributes can vary and Example 2.2 We give below some queries to provide a
accordingly determine the sales value. Figure 1 illustrates flavor of multidimensional queries. These queries use the
this multidimensional view. database from Example 2.1 and other necessary hierar-
chies on product and time dimensions.
Give the total sales for each product in each quarter
of 1995. (Note that quarter is a function of date).
For supplier Ace and for each product, give the
fractional increase in the sales in January 1995 rel-
ative to the sales in January 1994.
20 25
I-Troduct
For each product give its market share in its category
PI p2 P3 P4 today minus its market share in its category in Octo-
s3 ber 1994.
Supplier Select top 5 suppliers for each product category for
last year, based on total sales.
Figure 1. Example data cube For each product category, select total sales this
month of the product that had highest sales in that
U category last month.
Select suppliers that currently sell the highest selling
2.1 Terminology product of last month.
Determining attributes like product, date, supplier Select suppliers for which the total sale of every
are referred to as dimensions while the determined at- product increased in each of last 5 years.

233
0 Select suppliers for which the total sale of every 0 Support for computing ad-hoc aggregates. That
product category increased in each of last 5 years. is, aggregates other than those originally prespecified
should be computable. For instance, for each product
U
both the total sales and the average sales are interest-
2.2 Implementation Architectures ing numbers.
There are two main approaches currently used to build 0 Support for a query model in place of one-
multidimensionaldatabases. One approach maintains the operation-at-a-time computation model. Cur-
data as a k-dimensional matrix based on a non-relational rently, a user operates on a cube once and obtains
specialized storage structure. The database designer spec- the resulting cube. Then the user makes the next
ifies all the aggregations they consider useful. While operation. However, not all the intermediate cubes
building the storage structure, these aggregations associ- are of interest to the user. A set of basic operators
ated with all possible roll-ups are precomputed and stored. that have well defined semantics enable this compu-
Thus, roll-ups and drill-downs are answered in interactive tation to be replaced by a query model. Thus, having
time. Many products have adopted this approach - for in- tools to compose operators allows complex multidi-
stance, Arbor Essbase [Arb] and IRI Express [IRI]. mensional queries to be built and executed faster than
Another approach uses a relational backend wherein having the user specify each step. This approach is
operations on the data cube are translated to relational also more declarative and less operational.
queries (posed in a possibly enhanced dialect of SQL).
Indexes built on materialized views are heavily used in 2.4 Related Research
such systems. This approach also manifests itself in many Data models developed in the context of temporal, spa-
products - for instance, Redbrick [Eri95] and Microstrat- tial and statistical databases also incorporate dimensional-
egy [Mic]. ity and hence have similarities with our work.
In temporal databases [TCGt93], rows and columns of
2.3 Additional Desired Functionality
a relational table are viewed as two dimensions and time
We believe that multidimensional database systems adds a third dimension forming what is called the time
must provide the followingadditional functionality, which cube. This cube is different from a cube in our model
is either missing or poorly supported in current products: where dimensions correspond to arbitrary attributes and
0 Symmetric treatment not only of all dimensions all dimensionsare treated uniformly without attaching any
but also of measures. That is, selections and ag- fixed connotationwith any one of them.
gregations should be allowed on all dimensions and The modeling efforts in spatial databases [Gut941
measures. For example, consider a query that finds mostly concentrate on representing arbitrary geometric
the total sales for each product for ranges of sales objects (points, lines, polygons, regions etc.) in mul-
price like 0-999, 1000-9999and so on. Here the sales tidimensional space. By viewing OLAP data as points
price of a product, besides being treated as a measure, in the multidimensional space of attributes, one could
is also the grouping attribute. Such queries that re- draw analogies between the two models. But the oper-
quire categorizing on a measure are quite frequent. ations central to spatial databases (overlap, contain-
Non-uniform treatment of dimensions and measures ment, etc.) are quite different from the common OLAP
makes such queries hard in current products. The operations (roll-up, drill-down, joins etc.). How-
proposed OLAP council API [OLA96] provides for ever, the multi-dimensionalindexing structures developed
all the measures to be put on one dimension of the for spatial databases (see [Gut94]) may be useful in devel-
cube. However, this proposal maintains a sharp dis- oping efficient implementationsof OLAP databases.
tinction between measures and dimensions and does Statistical databases also address some of the same
not solve the problem of being able to categorize on concerns as OLAP databases. However, models in the
a measure. statistical database literature [Mic92] [CL941 have been
0 Support for multiple hierarchies along each di- primarily concemed with extending existing data models
mension. For instance, Example 2.1 illustrates the (mostly relational) for representing summaries and sup-
type-category hierarchy on products (of interest to a porting operations for statistical data analysis. In contrast,
consumer analyst). An alternative hierarchy is one our objective has been to develop a model and a set of
based on which company manufactures the product basic operations that abstract the analysts view of enter-
and who owns that company, namely, product -+ prise data. In statistical databases, category (dimensions)
manufacturer + parent company (of interest to and summaries (measures) are treated quite differently,
a stock market analyst). Roll-upddrill-downs can be whereas we have strived to treat dimensions and mea-
on either of the hierarchies. sures uniformly. For instance, [MERS92] describes an

234
S-algebra with operations like S-union, S-selections, the elements of the cube is not 0. If all the elements of a
S-aggregation that are similar to some of the operators cube are 0 then the cube is empty. Additionally, if domain
that we define. However, since we treat measures and d o m , of dimension Di has no values then too the cube is
dimensions symmetrically, we have additional operators considered to be empty.
that are unique to our model. Irrespective of these differ- In our model, no distinction is made between measures
ences, OLAP databases will benefit from implementation and dimensions. Thus, in Example 2.1, sales is just an-
techniques developed in statistical databases, particularly other dimension. Note that this is a logical model and does
related to aggregation views (see [Shot321 [STL89]). not force any storage mechanism. Thus, a cube in our data
Concurrentto our work, [GLS96] proposed a model for model may have more logical dimensions than the num-
tabular data that embeds both the relational and the mul- ber of dimensions used to physically store the cube in a
tidimensional data model. Their model also allows mea- multidimensional storage system.
sures and dimensions to be interchanged like our model.
However, their model does not state how to handle ag- 3.1 Operators
gregations, an operation that is hndamental to OLAP
databases. We now discuss our multidimensional operators. We
illustrate the operators using a 2-D subset of the cube in-
troduced in Example 2.1. We omit the supplier dimension
3 DataModel and display in Figure 2 only the product, date, and sales
We now outline our proposed multidimensional data dimensions. Note, sales is not a measure but another di-
model and operations that capture the functionality cur- mension, albeit only logical, in the model.
rently provided by multidimensional database products
and the additional desired functionality listed above. Our
design was driven by the following key considerations:
mar 4 ---
0 Treat dimensions and measures symmetrically.
feb3
--
0 Strive for flexibility (multiple hierarchies, adhoc ag-

gregates). feb 2 .--


0 Keep the number of operators small. jan I ---

0 Stay as close to relational algebra as possible. Make

operators translatable to SQL.


In our logical model, data is organized in one or more
multidimensional cubes. A cube has the following com-
ponents:
0 IC dimensions, and for each dimension a name Di, a
Figure 2. Logical cube wherein sales is a di-
domain domi from which values are taken. mension (omitting the 1/Os)
0 Elements defined as a mapping E( C) from d o m l x

. . . x d o m k to either an n-tuple, 0, or 1. Thus, To operate on the logical cube, the safes dimension
E ( C ) ( d l , . . . , d k ) refers to the element at position may have to be folded into the cube such that sales val-
d l , . . . , dk of cube C. Note, the dis refer to values ues seem determined by the product and date dimensions.
not positions per se. Therefore, our model does not We describe later how this is achieved. For now, we will
require the dimensions to have a ranked, discrete do- use the cube with safes values as the sole member of
main. the elements of the cube. Thus, the value < 15 > for
0 An n-tuple of names that describes the n-tuple ele- date = mar 4 and product = pl in Figure 3 in-
ment of the cube. dicates that in the logical cube of Figure 2 the element
The elements of a cube can be either 0, 1, or an n- corresponding to date = m a r 4, product = pl, and
tuple < X 1 , . . . , X , >. If the element corresponding to sales = 15 is 1.We show the metadata description of
E(C)( d l , . . . , d k ) is 0 then that combination of dimen- the elements as an annotation in the cube. Thus, <safes>
sion values does not exist in the database. A 1 indicates in Figure 3 indicates that each element in the cube is a
the existence of that particular combination. Finally, an sales value.
n-tuple indicates that additional information is available Notation We define the operators using a cube C with
for that combination of dimension values. If any of the k dimensions. We refer to the dimensions as Dl , . . , Dk.
elements of a cube is a 1 then none of the elements can We use 0,to refer also to the domain of dimension D,if
be a n-tuple and vice-versa. We represent only those val- the context makes the usage clear; otherwise we refer to
ues along a dimension of a cube for which at least one of the domain of dimension D, as dom,(C). We use lower

235
Date
<Sales,Product>
<Sales> <10,p3>

< I 5.p2> <2O.p4>


<lS.p3>

feb 2 <10.p2> <15,p3>

<20> <25> <2O.p3> <25,p4>


I I I I
I I I *

PI p2 P3 p4 PI p2 P3 p4
Product

Figure 3. The push operation on dimensionproduct

Pull 1 st member Sales


of each element
as the new

I I I ,Product
PI P2 P3 p2 p3 Product

Figure 4. Pull first member of each element as dimension sales

case letters like a , b , c to refer to constants. concatenates g and < di >.


Dimension values in our data model functionally de-
termine elements of the cube. As a result of an applica- Pull
tion of an operation, more than one element value may be
mapped to the same element (i.e. the same combination of This operation is the converse ofthe push operator. Pull
values of dimension attributes) of the answer cube. These creates a new dimension for a specified member of the
element values are combined into one value to maintain elements. The operator is usehl to convert an element into
functional dependency by specifying what we call an ele- a dimension so that the element can be used for merging
ment combining function, denoted as f e l e m . or joining. This operator too is needed for the symmetric
We also sometime merge values along a dimension. treatment of dimensions and measures.
We call functions used for this purpose dimension merg- Input: C , new dimension name D , integer i.
ingfunctions, denoted as f m e r g e .
Output: C, with an additional dimension D that is ob-
tained by pulling out the ith element of each element of
Push
the matrix.
The push operation (see Figure 3 ) is used to convert di- Constraint: all non-0 elements of C are n-tuples because
mensions into elements that can then be manipulated us- each non-0 element need at least one member to enable
ing function fezem. This operator is needed to allow di- the creation of a new dimension.
mensions and measures to be treated uniformly.
Mathematically: pull(C, D ,i) = C,, 1 5 i 5 n.
Input: C , D,. D becomes the k + l S tdimension of the cube.
Output: C with each non-0 element extended by an ad- domk+l(C,) = {ele is ith m e m b e r of s o m e
ditional member, the value of dimension Di for that ele- E(C)(dl,. dk)}. ' b

ment. E(Ca)(d1,.. . , d k , e t ) =< e l , . . . , ~ z - I ,% t i r . .. , e , >


Mathematically: push(C, 0,) = Ca. if E ( C ) ( d I , . . , d k ) = < e l , . . . , e,, . . . , e, >,
E(Ca)(dl,. . , d k ) = g .E d, where g = E(C,)(d1,. . . , d k , e,) = 0, otherwise.
E ( C ) ( d l , . . , d k ) . The operator Cb is defined to be Note, if n = 1 then elements of C, are ' 1's or '0's.
0 if g = 0, it is < d, > if g = 1, and in all other cases it

236
mar4
Date

feb3

feb 2

jan I
I- 1
<15>

<20> <15>

<IO>
<IO>

p2
<lo>

<15>

<15>
<Sales>

<20>

<to> <25>

p3 p4 Product
*
prodin {pl.p31

Figure 5. The restriction operation


Date

mar4

feb3

feb2

jan 1
Cl5>

C20>

<IO>
<Sales>
<IO>

<15>

<15>

<20>

---c--t-,
PI p3 Product

Destroy Dimension evaluated on a set of values and not on just a single value.
Thus, P can be either of the form greater than 5 that
Often the dimensionality of a cube needs to be reduced. is evaluated on single values at-a-time or be of the form
This operator removes a dimension D that has in its do- top 5 values that is evaluated on the entire domain Di
main a single value. The presence of a single value im- and outputs a set of values. If no element of dimension Di
plies that for the remaining k - 1 dimensions, there is a satisfies P then C, is an empty cube.
unique k - 1 dimensional cube. Thus, if dimension D is
eliminated then the resulting k - 1 dimensional space is Mathematically: restrict(C, Di, P)= Ca.
occupied by this unique cube. domj(C,) = d o m j ( C )if 1 5 j 5 IC & j # i
= P(domj(C)),otherwise.
Input: C,dimension name Di.
E ( C a ) ( d I , .. . , dk) =: E(C)(dI,.. . dk).
Output: C, with dimension Di absent.
Constraint: Di has only one value, say v Join
Mathematically: destroy(C , Di) = Ca. The join operation is used to relate information in two
Ga has k - 1 dimensions, DI . . . Di-1,Di+l, . . . , Dk, cubes. The result ofjoining a m-dimensional cube C with
E(Ca)(dl...dz-l,di+l,...,dk)=E ( C ) ( d l , . . . , d k ) an n-dimensional cube C1 on k dimensions, calledjoin-
+
ing dimensions, is cube C, with m n - k dimensions.
A dimension that has multiple values cannot be di- Each joining dimension D, of C combines with exactly
rectly destroyed because then elements would no longer one dimension Dz, of C1 to get resulting dimension Di
be functionally determined by dimension values. A of Caas follows: For each joined dimension, two mapping
multi-valued dimension is destroyed by first applying a functions are used for mapping the corresponding joining
merge operation (described later) and then applying the dimension of C and Cl to the resulting dimension of Ca.
above operation. Note that, if k = 1 we will get a zero The elements of C, are then formed via a function f e l e m
dimensional cube or a scalar as a result of the destroy that combines all elements of C and C1 that get mapped
operation. to the same element of Ca.
Figure 6 illustrates cube C joining with cube C1 on
Restriction dimension D1(mapping function is identity). Ilimension
D1 of the resulting cube has only two values. The func-
The restrict operator operates on a dimension of a cube tion f e l e m divides the element value from cube C by the
and removes the cube values of the dimension that do not element value from C1; if either element is 0 then the
satisfy a stated condition. Figure 5 illustrates an appli- resulting element is also 0. Values of result dimension
cation of restriction. Note that this operator realizes slic- that have only 0 elements corresponding to them are elim-
ing/dicing of a cube in multidimensional database termi- inated from C, (like values 0 and 3 for dimension 01).
nology.
Input: C with dimensions D1.. .Dm and CX with
Input: Cube C and predicate P defined on Di. dimensions D m - k + l . . .D,. D m - k + l , . . . , Dm are
Output: New cube C, obtained by removing from C those the join dimensions (without loss of generality).
values of dimension D, that do not satisfy the predicate P . k mapping functions, f m - k + l , . . . , f m defined over
We have a more general notion of predicate P that can be dimensions D m - k + ] , . . . , D m of C and k map-

237
<3> <2>

i---i-----t
1 2 4

D2
d 4 map dimension
D i using the
identify mapping

+
c + +
tiemdivides

bt the element from


C by theelement
from C1 if both 1 Di
elements exist. Else
it returns 0

0 1 2 3

Figure 6. Joining two cubes

ping function f k - k + l , . . . , fk defined over dimensions and associate. In the case of Cartesian product, the two
Dm-k+l, . . . D, of C1. Mapping fj applied to value cubes have no common joining dimension. In the case
v E dom, (C) produces values for dimension Di in Ca. of natural join the mapping function is identity and the
Similarly fi applied to U E domi(C1)produces values felem function returns 0 whenever one of the elements
for dimension Di in Ca. Also needed is a function felem is 0. The sub-cases union and merge are discussed later.
that combines sets of elements from C and C1 to output Associate is an especially useful sub case in OLAP
elements of Ca. applications for computations like express each months
Output: C, with n dimensions, D1 . . . D,. Multi- sale as a percentage of the quarterly sale. Associate is
ple elements in C and C1 could get mapped to the asymmetric and requires that each dimension of C1 be
same element in Ca. All elements of C and C1 that joined with some dimension of C. Figure 7 illustrates
get mapped to the same point in C, are combined by associating cube C1 with C where month dimension of
function felem to produce the output element of Ca. C1 and date dimension of C are joined by mapping them
If for some value v of dimension Di, the elements to the date dimension of Ca. Similarly, category and
product are joined by mapping them to product of Ca.
E ( C , ) ( z l , . . . , v ,z , + l , . . . , z n ) is 0 for all values of the
other dimensions, then w is not included in dimension Di For dimension month, each month is mapped to all the
of ca. dates in that month. For dimension category, value cat1
is mapped to products p l and p 2 , and cat2 is mapped to
Mathematically: p 3 and p4. For dimensions date and product the identity
join(C, C1, [fm-k+lr.. . , f m ,f&-k+lr.. fA12 f e l e m ) 7 mapping is used. Function f e l e m divides the element
= c, value from cube C by the element value from C1; if
d o q ( C , ) = d o m ( C ) if 1 5 i 5 m - k. either element is 0 then the resulting element is also 0.
= dom;(C1) if m + 1 5 i 5 n. Note, value marl is eliminated from Ca because all its
= {dalda E f i ( d ) , d E dom;(C) OR corresponding elements are 0.
d E f i ( d ) , d E d o ~ ( C 1 ) ) .
~ ( C a ) ( d l , . .d. ~, - ~ l . . , d ~ , . . . =l fde l~e )m ( { t l } , { t 2 } ) Merge
such that The merge operation is an aggregation operation. We
t l = E ( C ) ( d l , .. . d m - k , . . . d,), illustrate it in Figure 8. The figure shows how hierarchies
t 2 = E(Cl)(dL-h,. . &, dm+l,. . ., dn), in a multidimensional database are implemented using the
d, E fi(di)ORd, E f : ( d : ) merge operator. Intuitively, a dimension merging function
is used to map multiple product names into one or more
We defined above a general notion of the join operator that categories and another function is used to map individ-
covers several important special cases. Notable amongst ual dates into their corresponding month. Thus, multiple
these are: Cartesian product, natural join, union, merge elements on each dimension are merged to produce a di-

238
Month
+ <Total Sales>

feb
Jan tt <45>
<I,,
c50'
<45;
*
Category
Date <Fraction
of totab
<0.4>
<0.3>
..-'
cat1
:
cat2..
: %. map month to
..'*.f :: .: e,.
'., each date in that
month
feb 2 t <219> <0.3>
PI p2 P3 p4
map category to
cacii oroduct in Jan I <4/9> <5/9>

Date that category '+ -


A -3 PI p2 P3 p4
f,lem / the element
divides Product
from C by the element from
Cl ifbothexist. Else it
returns 0
<Sales>

<20> <25>

pi p2 p3 p4 Product

Figure 7. Associating two cubes

Date
<Sales> fmerqel
( ~ 1 . ~ incatl
21 Month
mar 4 - <15> <IO> {p3,p4) in cat2,

i-
<Total Sales>
- -1 mar <15> <IO>

-
feb 3 <20> <15> <15> <20> fmerseZ
date by month
feb <45> <50>
feb 2 <IO> <IS> felem: SUM
jan <IO> <45>
jan 1 - <IO> <20> <25>

cat1 cat2
p2 p3 p4 Product Category

Figure 8. Merging dimensions date and product using delem= sum

mension with a possibly smaller domain. As a result of , f,],. . . , [Om,f,]


sume that the m pairs are [DI
merging each dimension, multiple elements in the original
Output: Cube C, of same dimensionality as C. Dimen-
cube get mapped to the same element in the new cube. An
sion Di is merged as per function f,. An element cor-
element combining function is used to specify how these responding to the merged elements is aggregated as per
multiple elements are aggregated into one. In the example
felem.
in Figure 8, the aggregation function f e l e m is sum.
In general, the dimension merging function might be a Mathematically:
one-to-many mapping that takes an element in the lower merge(C, {[DI, fil, . > [Dm,f,I), f e l e m W a .
t .

level into multiple elements in the higher level of hierar- dom(C,) = {fi(e)le E d o m ( C ) }if 1 5 i 5 m,
chies. For instance, a 1 t n mapping can be used to = d o m , (C), otherwise.
merge a product belonging to n categories to handle mul- E(C,)(d,, , d k ) = f e l e m ( { t l t = E ( C ) ( d i ., . . , 4 )
t . .

tiple hierarchies. where forfi(dk) = d, if 1 5 i _< m else di = di}).


Input: C , function f e l e m for merging elements and m (di-
mension, function) pairs. Without loss of generality, as- A special case of the merge operator is when all the

239
merging functions are identity. In this case, the merge op- gate functions.
erator can be used to apply a function f e l e m to each ele- The reader may also notice the absence of direct
ment of a cube. analogs of relational projection, union, intersection, and
Remark The merge operator is strictly not part of our difference. These operations can be expressed in terms of
basic set of operators. It can be expressed as a special our proposed operators as follows:
case of the self-join of a cube using fmerge transforma- Projection The projection of a cube is computed by
tion functions on dimensions being merged and identity merging each dimension not included in the projection and
transformation functions for other dimensions. We chose then destroying the dimension. A f e l e m function specify-
to separately define merge because it is a unary operator ing how multiple elements are combined is needed as part
unlike the binary join and also for performance reasons. of the specification of the projection.
Union Two cubes are union-compatible if (i) they have
4 Discussion the same number of dimensions; and (ii) for each dimen-
sion D,in C , dimension Di in C1 is of the same domain
The reader may have noticed similarities in the oper- type. Union is computed by joining the two cubes using
ators proposed and relational algebra [Cod70]. It is by the identity transformation functions for each dimension
design. One of our goals was to explore how much of the of each cube and by choosing a hnction f e l e m that pro-
functionality of current multidimensional products can be duces a non-empty element for element e in C, whenever
abstracted in terms of relational algebra. By developing an element from either of the two cubes is mapped into e.
operators that can be translated into SQL (see [AGS96] Dimension 0; in the resulting cube has as its values the
for translations), our hope was to create a fast path for union of the values in dom, (C) and in dom, (Cl).
providing OLAP capability on top of relational database Intersect The intersection of two union-compatible
systems. We must hasten to add that we are not arguing cubes is computed by joining the cubes through the iden-
against specialized OLAP e n g i n e s w e believe the design tity mapping that effectively retains only those dimension
and prototyping of such engines is a fruitful research di- values that are present in both cubes. Thus, function f e l e m
rection. We are also not suggesting that simply translat- makes non-0 an element for point p in C,only if elements
ing these operators into SQL would be sufficient for pro- from both cubes are mapped into p .
viding OLAP capabilities in relational database systems. Difference The difference of two union-compatible
However, it does point to directions in which optimization cubes C1 and C2 is expressed as an intersection of
techniques and storage structures in relational database C1 and C2 followed by a union of the result with C1.
- systems need to evolve. The f e l e m function for combining two elements for the
The goal of treating dimensions and measures symmet- intersection steps discards the value of the element for
rically had a permeating influence in our design. It is C1 and retains C2s element. The f e l e m function for
a functionality either not present or poorly supported in combining two elements for the union step saves the
current multidimensional database products. Its absence value of Cls element if the two elements are different
causes expensive schema redesign when an unanticipated and makes the result 0 if they are identical.
need arises for aggregation along an attribute that was ini-
tially thought to be a measure. In hindsight, the push and 4.1 Expressive Power
pull operations may appear trivial. However, their intro-
duction was the key that made the symmetric treatment of Our algebra can be seen to be at least as powerful as re-
dimensions and measures possible. lational algebra [Cod70]. A precise circumscription of the
The reader may argue with the way we have chosen to expressive power of the proposed data model is an open
incorporate order-based information into our algebra. We problem. A related interesting open question pertains to
rely on functions for this purpose, which implies that the defining a formal notion of completeness for multidimen-
system may not be able to use this information in opti- sional database queries and evaluating how complete our
mizing queries. We debated about allowing a native order algebra is with respect to that metric. We take an empiri-
to be specified with each dimension and providing order- cal approach and discuss below how the current high-level
ing operators. We decided against it because of the large multidimensional operations can be built using our pro-
number of such operators and because the semantics gets posed operators.
quite complex when there are multiple hierarchies along This implementation corresponds to the following semantics for
a dimension. In a practical implementation of our model, 121 - C2: E ( C a ) ( d l , . . , d k ) equals 0 if E ( C 2 ) ( d l , . . , d k ) =
it will be worthwhile to allow a default order to be speci- E ( C l ) ( d l , .. . , d k ) ; i t IS E ( C l ) ( d l , .. . , d k ) otherwise. Another
alternative semantics could be that E ( C , ) ( d l , . . . , d k ) equals 0 if
fied with each dimension and make system aware of some E ( C 2 ) ( d l , . . , d k ) # 0 , and E ( C l ) ( d l , .. . , d k )otherwise. This se-
built-in ordering functions such as first n. The same mantics can be implemented by a small change in the f e l e m function
holds for providing the knowledge of some built-in aggre- used in the union step.

240
Roll-up Roll-up is a merge operation that needs one di- have a cube C with dimensions product, month, supplier
mension merging function and one element combining and element sales.
function. If a hierarchy is specified on a dimension then
the dimension merging function is defined implicitly by For supplier Ace and for each product, g i w the frac-

the hierarchy. The elements corresponding to merged tional increase in the sales in January I995 relative to the
values on the dimension are combined using the user- sales in January 1994.
specified element combining function like SUM. Restrict supplier to Ace and dates to January 1994
Drill-down This operator is actually a binary opera- or January 1995. Merge date dimension using an felem
tion even though most current multidimensional database that combines sales as ( B - A ) / A where A is the sale in
products make it seem like a unary operation. Consider Jan 1994 and B is the sale in Jan 1995.
computing the sum X of 10 values. Drill-down from X
to the underlying 10 values is possible in infinite ways. For each product give its markt share in its category this
Thus, the underlying 10 values have to be known. That is, month minus its market share in its category in October
the aggregate cube has to be joined (actually associated) 1994.
with the cube that has detailed information. Continuing Restrict date to October 1994 or current month.
with our analogy, to drill down from X to its constituents Merge supplier to a single point using sum of sales as
the database has to keep track of how X was obtained and the felem function to get C1. Merge product dimension
then associate X with these values. Thus, if users merge to category using sum as the felem function to get in C2
cubes along stored paths and there are unique path down the total sale for the two months of interest. Associate C1
the merging tree, then drill down is uniquely specified. By and C2, mapping a category in C2 to each of its products
storing hierarchy information and by restricting single el- in C 1. The identity mapping is used for the Month dimen-
ement merging functions to be used along each hierarchy, sion. Function felem divides the element from Cl by the
drill-down can be provided as a high-level operation on element from C2 to get the market share. For lhe result-
top of associate. ing cube, Merge dimension month to a single point using
Star Join In a star join [Eri95], a large detail mother a felem function ( A - B) where A is the market share for
table M is joined with several small daughter tables that this month and B is the market share in October 1994.
describe join keys in the mother table. A star join denor-
malizes the mother table by joining it with its daughter For each product category, select total sales this month
tables after applying selection conditions to their descrip- of the product that had highest sales in that category last
tive attributes. We describe how our operators capture a month.
star join when M has one daughter table FI that describes Restrict dimension month to last month. Merge
the join key field D of M . Table Fl can be viewed as a supplier to a single point using sum of sales as the felem
one-dimensionalcube, C1 with the join key field D as the function. Push product dimension resulting in %-tupleel-
dimension and all the description fields pulled in as ele- ements with <Sale and product>. Merge product to cate-
ments. A restriction on a description attribute A of table gory using felem function that retains an element if it has
FI corresponds to a function application to the elements the maximum sales. Pull product into the category di-
of CL.Restrictions on the join key attribute translate to mension (over-riding the category dimension, this can be
restrictions on dimension D of C1. The join between M easily done using our basic operators). Let the resulting
and Fl is achieved by associating the mother cube with cube be C1. This cube has the highest sales value for each
the daughter cube on the key dimension D using the iden- element for last month. Restrict C on dimension date
tity mapping function. The description of each key value to this month, Merge supplier to a single point using
is pulled in from the daughter cube into the mother cube sum of sales as the f e l e m function and associate it with
via the felem function. C1 on the product dimension using f e l e m function that
Expressing a dimension as a function of other dimen- only outputs the element of C when it is the salme as the
sions This functionality is basic in spread sheets. We corresponding elements from C1 (otherwise rehirnS 0).
can create a new dimension D expressed as a function, f
of another dimension Dby first pushing Dinto the cube Select suppliersfor which the total sale of every product
elements, then modifying the cube elements by applying increased in each of last 5 years.
function f and finally pulling out the corresponding mem- Restrict to months of last 6 years. Merge month to
ber of the cube element as a new dimension D. year. Merge years to a single point using a felem function
that maps the six sales values to 1 if sales values are in-
4.2 Example queries creasing, 0 otherwise. Merge product to a point where
This section illustrates how to express some of the felem function is 1 if and only if all its arguments are
queries of Example 2.2 using our operators. Assume we 1.

241
5 Conclusions and Future Work research in storage and access structures and materialized
views.
This paper introduced a data model and a set of al- Acknowledgments We wish to thank Mike Carey,
gebraic operations that unify and extend the functional- Piyush Goel, Bala Iyer, and Eugene Shekita for stimulat-
ity provided by current multidimensional database prod- ing discussions and probing questions.
ucts. As illustrated in Section 4.1, the proposed operators References
can be composed to build OLAP operators like roll-up,
drill-down, star-join and many others. In addition, the
[AAD+96] S. Aganval, R. Agrawal, P.M. Deshpande,
model provides symmetric treatment to dimensions and
A. Gupta, J.F. Naughton, R. Ramakrishnan,
measures. The model also provides support for multiple
and S. Sarawagi. On the computation of
hierarchies along each dimension and support for adhoc
multidimensional aggregates. In Proc. of
aggregates. Absence of these features in current products
the 22nd Int'I Conference on Ely Large
results in expensive schema redesign when an unantici-
Databases, pages 50652 1, Mumbai (Bom-
pated need arises for a new aggregation or aggregation
bay), India, September 1996.
along an attribute that was initially thought to be a mea-
sure. AGS961 Rakesh Agrawal, Ashish Gupta, and
The proposed operators have several desirable proper- Sunita Sarawagi. Modeling multidi-
ties. They have well-defined semantics. They are minimal mensional databases. Research Report,
IBM Almaden Research Center, San
in the sense that none can be expressed in terms of others
nor can any one be dropped without sacrificing function- Jose, California, 1996. Available from
http://www.almaden.ibm.com/cs/quest.
ality. Every operator is defined on cubes and produces as
output a cube. That is, the operators are closed and can be Arb] Arbor Software Corporation, Sunnyvale,
freely composed and reordered. This allows the inefficient CA. Multidimensional Analysis: Converting
one-operation-at-a-time approach currently in vogue to be Corporate Data into Strategic Information.
replaced by a query model and makes multidimensional http://www.arborsoft.com.
queries amenable to optimization. [CCS93] E. F. Codd, S. B. Codd, and C. T. Salley.
The proposed operators enable the logical separation Beyond decision support. Computerworld,
of the frontend user interface from the backend that stores 27(30), July 1993.
and executes queries. They thus provide an algebraic [CL941 R. Cicchetti and L. Lakhal. Matrix relation
API that allows the interchange of frontends and back- for statistical database management. In Proc.
ends. The operators are designed to be translated into of the Fourth Int 'I Conference on Extending
SQL. Thus, they can be implemented on either a relational Database Technology (EDBT), March 1994.
system or a specialized engine. [Cod701 E.F. Codd. A relational model for large
For future, on the modeling side, work is needed to in- shared data banks. Comm. ACM, 13(6):377-
corporate duplicates and NULL values in our model. We 387, 1970.
believe that the duplicates can be handled by treating el- [Cod931 E. F. Codd. Providing OLAP (on-line analyt-
ements of the cube as pairs consisting of an arity and a ical processing) to user-analysts: An IT man-
tuple of values. The arity gives the number of occurrences date. Technical report, E. F. Codd and Asso-
of the corresponding combination of dimensional values. ciates, 1993.
NULLS can be represented by allowing for a NULL value
[CO1951 George Colliat. OLAP, relational, and mul-
for each dimension. Details of these extensions and other
tidimensional database systems. Technical
possible alternatives require further investigation.
report, Arbor Software Corporation, Sunny-
On the implementation side, there are interesting re- vale, CA, 1995.
search problems for implementing our model on top of a
relational system as well as within a specialized engine. [Eri95] Christopher G. Erickson. Multidimensional-
ism and the data warehouse. In The Data
Although each of the proposed operators can be translated
into a SQL query, simply executing this translated SQL on Warehousing Conference, Orlando, Florida,
a relational engine is likely to be quite inefficient. Corre- February 1995.
sponding to a multidimensional query composed of sev- [Fin951 Richard Finkelstein. MDD: Database reaches
eral of these operators, we will get a sequence of SQL the next dimension. Database Programming
queries that offers opportunity for multi-query optimiza- and Design, pages 27-38, April 1995.
tion. It needs to be investigated whether the known tech- [GBLP96] J. Gray, A. Bosworth, A. Layman, and H. Pi-
niques (e.g. [SG90]) will suffice or do we need to de- rahesh. Data cube: A relational aggrega-
velop new techniques. Similarly, there is opportunity for tion operator generalizing group-by, cross-

242
tabs and sub-totals. In Pmc. of the 12th Int 'I 522-53 1, Mumbai (Bombay), India, Septem-
Conference on Data Engineering, pages 152- ber 1996.
159,1996. [SG90] T. Sellis and S. Ghosh. On the multiple-query
[GHRU97] Himanshu Gupta, Venky Harinarayan, Anand optimization problem. IEEE Transactions on
Rajaraman, and Jeffrey D. Ullman. Index se- Knowledge and Data Engineering, 2(2):262-
lection for OLAP. In Proc. of the 13th Int '1 266, 1990.
Conference on Data Engineering, Birming- [Sho82] A. Shoshani. Statistical databases: Character-
ham, U.K., April 1997. istics, problems and some solutions. In Pro-
[GLS96] M. Gyssens, L.V.S. Lakshmanan, and I.N. ceedings of the Eighth International Confer-
Subramanian. Tables as a paradigm for ence on ?+?y Large Databases (VLLPB),pages
querying and restructuring. In Proceedings 208-2 13, Mexico City, Mexico, September
of the ACM Symposium on Principles of 1982.
Database Systems (PODS), 1996. [SR96] B. Salzberg and A. Reuter. Indexing for ag-
[Gut941 R. H. Guting. An introduction to spatial gregation, 1996. Working Paper.
database systems. VLDB Journal, 3(4):357- [STL89] J. Srivastava, J.S.E. Tan, and V.Y. Lum. TB-
399,1994. SAM: An access method for efficient pro-
[HRU96] V. Harinarayan, A. Rajaraman, and J.D. U11- cessing of statistical queries. IEEE Transac-
man. Implementing data cubes efficiently. In tions on Knowledge and Data Engineering,
Proc. of the ACM SIGMOD Conference on 1(4), 1989.
Management ofData, June 1996. [TCG+93] A.U. Tansel, J. Clifford, S. Gadia, S.. Jajo-
[IRII IRI Software, Information Resources Inc., dia, A. Segev, and R. Snodgrass. Temporal
Waltham, MA. OLAP: Turning Cor- Databases: Theoty, Design, and Implemen-
porate Data into Business Intelligence. tation. BenjamidCummings, 1993.
http://www.infores.com.
[JS96] T. Johnson and D. Shasha. Hierarchically
split cube forests for decision support: de-
scription and tuned design, 1996. Working
Paper.
[KS94] R. Kimball and K. Strehlo. What's wrong
with SQL. Datamation, June 1994.
[MERS92] L. Meo-Evoli, F.L. Ricci, and A. Shoshani.
On the semantic completeness of macro-data
operators for statistical aggregations. In Pm-
ceedings of the Sixth International Work-
ing Conference on Scientific and Statistical
Database Management, 1992.
[Mic] Microstrategy Inc., Vienna, VA
22 182. True Relational OLAP.
http://www.microstrategy.com.
[Mic92] Z. Michalewicz. Statistical and Scientific
Databases. Ellis Horwood, 1992.
[OLA96] The OLAP Council. MD-API the OLAP
Application Program Interface Version 0.5
Specification, September 1996.
[Rad951 Neil Raden. Data, data everywhere. Informa-
tion Week, pages 60-65, October 30 1995.
[SDNR96] A. Shukla, P.M. Deshpande, J.F. Naughton,
and K. Ramasamy. Storage estimation for
multidimensional aggregates in the presence
of hierarchies. In Pmc. of the 22nd Int'l
Conference on Very Large Databases, pages

243

You might also like