Understanding the Difference Between Column-Stores and OLAP Data Cubes

by smadden on July 7th, 2008

in big data

Previous Post
Next Post
Both column-stores and data cubes are designed to provide high performance on an
alytical database workloads (often referred to as Online Analytical Processing,
or OLAP.) These workloads are characterized by queries that select a subset of
tuples, and then aggregate and group along one or more dimensions. For example,
in a sales database, one might wish to find the sales of technology products by
month and store the SQL query to do this would look like:
SELECT month, store, COUNT(*)
FROM sales, products
WHERE productType = technology
AND products.id = sales.productID
GROUP BY month, store
In this post, we study how column-stores and data cubes would evaluate this quer
y on a sample database:
Column Store Analysis
In column-stores, this query would be answered by scanning the productType colum
n of the products table to find the ids that have type technology. These ids wo
uld then be used to filter the productID column of the sales table to find posit
ions of records with the appropriate product type. Finally, these positions wou
ld be used to select data from themonths and stores columns for input into the G
ROUP BY operator. Unlike in a row-store, the column-store only has to read a fe
w columns of the sales table (which, in most data warehouses, would contain tens
of columns), making it significantly faster than most commercial relational dat
abases that use row-based technology.
Also, if the table is sorted on some combination of the attributes used in the q
uery (or if a materialized view or projection of the table sorted on these attri
butes is available), then substantial performance gains can be obtained both fro
m compression and the ability to directly offset to ranges of satisfying tuples.
For example, notice that the sales table is sorted on productID, then month, t
hen storeID. Here, all of the records for a givenproductID are co-located, so
the extraction of matching productIDs can be done very quickly using binary sear
ch or a sparse index that gives the first record of each distinctproductID. Fur
thermore, the productID column can be effectively run-length encoded to avoid st
oring repeated values, which will use much less storage space. Run-length encod
ing will also be effective on the month and storeID columns, since for a group o
f records representing a specific productID, month is sorted, and for a group of
records representing a given (productID,month) pair, storeID is sorted. For ex
ample, if there are 1,000,000 sales records of about 1,000 products sold by 10 s
tores, with sales uniformly distributed across products, months and stores, then
the productID column can be stored in 1,000 records (one entry per product), th
e month column can be stored in 1,000 x 12 = 12,000 records, and the storeID col
umn can be stored in and 1,000 x 12 x 10 = 120,000 records. This compression me
ans that less the amount of data read from disk is less than 5% of its uncompres
sed size.
Data Cube Analysis
Data cube-based solutions (sometimes referred to as MOLAP systems for multidimens
ional online analytical processing ), are represented by commercial products such
as EssBase. They store data in array-like structures, where the dimensions of
the array represent columns of the underlying tables, and the values of the cell
s represent pre-computed aggregates over the data. A data cube on the product,

Our 3D cube wi th 10 stores and 1. or the number of s ales or a particular product across the entire year in a given store. prod uctID=2. productID 1 was sold twice ac ross all months. some cube systems support what is called HOLAP or hybrid onl ine analytical processing .store. and each element is the same size. every record has to be read from the data base. each cell contains the count of the number of records with a given (produc tID. a common technique is to store them as a table with the values and positions of the non-empty cell s. compression can limit performance somew hat. Long load times: Computing a cube requires a complex aggregate query over all of the data in a warehouse (essentially. Limited ad-hoc query support: Data cubes work great when a cube aggr egated on the dimensions of interest and using the desired aggregation functions is available.) To deal w ith these limitations. It can be used to efficiently compute a hierarchy of ag gregates for example. products 3. For example. Because t he data is stored in an array-structure. dire ct offsetting to particular values may be possible. most of the cells are empty. as in the query above. Consider. If no cube is available. to answer the above query using a data cube. Here. for example who was the c ustomer who bought a technology product at store 2 in February? the cube cannot be used (one could imagine storing entire tuples. or stack (x.month. for example. and adding a fourth di mension. what happens in the above example if the user wants to compute the average sale price rather than the count of sales.000 values).y p osition. looking at the lower left hand corner of t he cube for Store 1. and 5 are technology products this is indicated by their dark shading in the a bove figure. The advantages of a data cube should be clear it contains pre-computed aggregate v alues that make it a very compact and efficient way to retrieve answers for spec ific aggregate queries.) If we want to use a cube to compute the values of the COUNT aggregate. Unfortunately. . Furthermore. data cubes have sev eral limitations: Sparsity: Looking at the above cube. we can see that in storeID 1. column. this significantly complicates th e representation of a cube and can lead to storage space explosions. the system reads the pre-aggregated values from sum fields for the unrestricted attributes (store and month). the cube includes roll-up cel ls that summarize the values of the cells in the same row. the cells of this cube would look like: Here. say.storeID) value. would be stored in an array format as shown in the figure above. This is not s imply an artifact our sample data set being small the number of cells in a cube is the product of the cardinalities of the dimensions in the cube. and month=April. the sum columns in the above cube make it is very fast to co mpute the number of sales in a given month across all stores. if the user wants to drill down into the underlying data asking. The sum fields indicate the values of the COUNT rolled u p on specific dimensions. where they will automatically redirect queries that ca nnot be answered with cubes to a relational system. 10. for example.000 products would have 120. however. such as customerID (with. the user has no choice but to fall back to quer ies on an underlying relational system. but like sparse representations. we first id entify the subset of the cube that satisfies the WHERE clause (here. However.2 billion! Such high dimensionality cubes cannot be sto red without compression. and that store 3 had 1 t echnology sale in February and 1 in October. 4. Thus. or pointers to tuples. as direct offsetting is no longer possible. resulting in an implementation much like a row-oriented relational database! Inflexible. would cause the number o f cells to balloon to 1.) Then. but such queries run as fast as whatever relational system executes them. or if t he user wants to include aggregates on customerID in addition to the other attri butes.000 cells. For example.) Though it is possible to incrementally update cubes as new data arrives. there is one record with storeID=1. in the cel ls of a cube. which gives the result that s tore 2 had 1 technology sale in Feburary and 1 in June. and month attributes of the sales table.

They are ina ppropriate for ad-hoc queries or in situations where complex relational expressi ons are needed. column-stores provide very good performance across a much wider ran ge of queries (all of SQL!) However. Furthermor e. given that column-stores will typically get very good perform ance on simple aggregate queries (even if cubes are slightly faster). existing HOLAP products. it is worth noting that there is no reason that cubes cannot be combine d with column-stores. Summary and Discussion Data cubes work well in environments where the query workload is predictable. it is likely that a data-cube solution will outperform a column store. so that cubes needed to answer specific queries can be pre-computed. In contrast. the tradeoff is less clear. it is not clear if the incremental cost of maintaining and loading an additional cube syst em to compute aggregates is ever worthwhile in a column-store world. especially in a HOLAP-style configuration where queries no t directly answerable from a cube are redirected to an underlying column-store s ystem. .it is impractical to dynamically create new cubes to answer ad-hoc queries. For many-dimensional aggregates. are likely to be an o rder of magnitude or more slower than column-stores on ad-hoc queries that canno t be answered by the MOLAP system. Finally. which are based on row-stores. for the same reasons discussed elsewhere in t his blog. That said. as sparse cube represen tations are unlikely to perform any better than a column store. for low-dimensionality pre-computed aggrega tes.