You are on page 1of 13

OLAP - SQL

Much of the OLAP reporting feature embedded in Oracle SQL is ignored. People turn to expensive OLAP reporting tools in the market - even for simple reporting needs. This article outlines some of the common OLAP reporting needs and shows how to meet them by using the enhanced aggregation features of Oracle SQL. The article is divided in two sections. The first introduces the GROUP BY extensions of SQL, and the second uses them to generate some typical reports. A section at the end introduces the common OLAP terminologies. The enhanced SQL aggregation features are available across all flavors of Oracle including Oracle Standard Edition One. It might be worth mentioning here, that Oracle OLAP, the special OLAP package of Oracle, is not available with Oracle Standard Edition and Standard Edition One. Enhanced aggregation features discussed here have been tested on Oracle 9i and Oracle 10g. Advanced Aggregation Extensions of GROUP BY GROUPING SETS clause, GROUPING function and GROUPING_ID function The fundamental concept of enhanced aggregation features of Oracle is that of GROUPING SETS. All other aggregation features can be expressed in terms of it. With GROUPING SETS clause comes the functions GROUPING, GROUPING_ID and GROUP_ID. The GROUPING SETS clause in GROUP BY allows us to specify more than one GROUP BY options in the same record set. All GROUPING clause query can be logically expressed in terms of several GROUP BY queries connected by UNION. Table-1 shows several such equivalent statements. This is helpful in forming the idea of the GROUPING SETS clause. A blank set ( ) in the GROUPING SETS clause calculates the overall aggregate. Table 1 - GROUPING SET queries and the equivalent GROUP BY queries Set A - Aggregate Query with GROUPING SETS Set B - Equivalent Aggregate Query with GROUP BY

A1. SELECT a, b, SUM(c) FROM tab1 GROUP BY GROUPING SETS ( (a,b) )

B1. SELECT a, b, SUM(c) FROM tab1 GROUP BY a, b

A2. SELECT a, b, SUM( c ) FROM tab1 GROUP BY GROUPING SETS ( (a,b), a)

B2. SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b UNION

SELECT a, null, SUM( c ) FROM tab1 GROUP BY a

A3. SELECT a,b, SUM( c ) FROM tab1 GROUP BY GROUPING SETS (a,b)

B3. SELECT a, null, SUM( c ) FROM tab1 GROUP BY a UNION SELECT null, b, SUM( c ) FROM tab1 GROUP BY b

A4. SELECT a, b, SUM( c ) FROM tab1 GROUP BY GROUPING SETS ( (a, b), a, b, ( ) )

B4. SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b UNION SELECT a, null, SUM( c ) FROM tab1 GROUP BY a, null UNION SELECT null, b, SUM( c ) FROM tab1 GROUP BY null, b UNION SELECT null, null, SUM( c ) FROM tab1 Example (Table-1 Set 4) is like a superset of all the above cases and also includes an overall aggregate by the use of ( ). We will see latter that this result is similar to that of CUBE (a, b). The first 3 columns of Table-2 show the result of a query of this type. GROUPING clause uses a single scan to compute all the required aggregates. So the performance is better than its logical equivalent of several GROUP BY and UNION. The general syntax of a SQL with GROUPING SETS is SELECT <grouping_columns>, <aggregate_functions> FROM <table_list> WHERE <where_condition> GROUP BY GROUPING SETS (<column_set_1>, ... , <column_set_N> The "column sets" can have none, one or more "grouping column" from SELECT. However, all columns from the select should be present in at least one of the column sets. In mathematical terms UNION UNION should be equal to

So the following two queries below will return error (1) SELECT a, b, c, SUM(d ) FROM tab1 GROUP BY GROUPING SETS ( (a,b), b) --- Reason (a,b) U ( b ) is not equal to (a,b,c) (2) SELECT a, b, SUM( c ) FROM tab1 GROUP BY GROUPING SETS (a, ( ) ) --- Reason (a) U ( ) is not equal to ( a, b ) Table 2 - A GROUPING SET query with GROUPING and GROUPING_ID Function on EMP SELECT deptno, job, SUM(sal), GROUPING(deptno) GDNO, GROUPING (job) GJNO, GROUPING_ID(deptno, job) GID_DJ, GROUPING_ID(job, deptno) GID_JD FROM EMP GROUP BY GROUPING SETS ( (deptno, job), deptno, job, ( ))

DEPTNO JOB

SUM(SAL)

GDNO

GJNO

GID_DJ

GID_JD

---------- --------- ---------- ---------- ---------- ---------- ----------

10 CLERK 10 MANAGER 10 PRESIDENT 20 CLERK 20 ANALYST 20 MANAGER 30 CLERK 30 MANAGER 30 SALESMAN

1300 2450 5000 1900 6000 2975 950 2850 5600

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

10 20 30 ANALYST CLERK MANAGER PRESIDENT SALESMAN

8750 10875 9400 6000 4150 8275 5000 5600

0 0 0 1 1 1 1 1 1

1 1 1 0 0 0 0 0 1

1 1 1 2 2 2 2 2 3

2 2 2 1 1 1 1 1 3

29025

18 rows selected. GROUPING Function and GROUPING_ID Function From Table-2 we see that when aggregates are displayed for a column its value is null. This may conflict in case the column itself has some null values. There needs to be some way to identify NULL in column, which means aggregate and NULL in column, which means value. GROUPING function is the solution to that. This function returns a flag "1" for a row in the result set if that column has been aggregated in that row. Otherwise the value is "0". There can be only one column expression as the argument of the GROUPING function and that column should also be in the SELECT. GROUPING function can be used to substitute the NULL value, which usually appears in columns at the aggregation level by something meaningful like Total. GROUPING function has the general syntax of GROUPING ( ). It is used only in SELECT clause. It takes only a single column expression as argument. GROUPING_ID takes a set of columns. It applies the GROUPING function on each column in its argument and composes a bit vector with the "0" and "1" values. It returns the decimal equivalent of the bit vector. The columns GID_DJ and GID_JD show the use of GROUPING_ID function and also show how interchanging the order of the columns inside the GROUPING_ID function might impact the result. CUBE This is the most generalized aggregation clause. The general syntax is CUBE ( ). It is used with the GROUP BY only. CUBE creates a subtotal of all possible combinations of the set of column in its argument. Once we compute a CUBE on a set of dimension, we can get answer to all possible aggregation questions on those dimensions. Table3 shows a cube building.

It might be also worth mentioning here that GROUP BY CUBE( a, b, c) is equivalent to GROUP BY GROUPING SETS ( (a, b, c), (a, b), (b, c), (a, c), (a), (b), (c), ( )).

ROLLUP
ROLLUP clause is used with GROUP BY to compute the aggregate at the hierarchy levels of a dimension. ROLLUP(a, b, c) assumes that the hierarchy is "a" drilling down to "b" drilling down to "c". ROLLUP (a, b, c) is equivalent to GROUPING SETS ( (a, b, c), (a, b), (a), ( )). The general syntax of ROLLUP is ROLLUP( ) Composite Columns A composite column is a collection of columns that can be used in CUBE or ROLLUP. They are treated as unit before computing the aggregate.Composite columns usage in CUBE and ROLLUP and the equivalent GROUPING SETS . CUBE( (a, b), c) is equivalent to GROUPING SETS ( (a, b, c), (a, b) , c, ( )) . ROLLUP ( a, (b, c) ) is equivalent to GROUPING SETS ( (a, b, c), ( a ), ( ) ) Partial GROUPING SETS, CUBE or ROLLUP If any column appears in GROUP BY but outside the aggregation clauses discussed above. It can be thought of as being first column of the resulting GROUPING SET equivalent. The following examples make this clear. GROUP BY a, CUBE( b, c) is equivalent to GROUP BY GROUPING SETS ( (a, b, c), (a, b), (a, c), (a) ) GROUP BY a, ROLLUP( b, c) is equivalent to GROUP BY GROUPING SETS ( (a, b, c), (a, b), (a) ) OLAP Reporting using enhanced aggregation features While the queries on the EMP table are used to illustrate the GROUPING SETS they will be poor examples for discussing the next sections. The reason is that the tables are not in a Star-Schema format. Please run the script (Script A) to get a simple Star Schema. The tables of the Sample Schema are: Product(prdid, prd_name, prd_family) TimeByDay(datekey, td_month, td_quarter, td_year) Location( Loc_id, City, State, Country) Customer(cust_id, cust_name, cust_type); Sales(sales_id, cust_id, loc_id, prdid, sales_date, amount);

The schema is about a fictitious Art Trader that supplies remakes of statues of famous historical figures (like ALEXANDER, BUDDHA, etc) or landscape paintings of places (like SIKKIM, etc). They sell to museums, resellers or individuals. The dimensions are Product, TimeByDay, Location and Customer. The fact is Sales. The hierarchies are (1) Product_Name (prd_name) -> Product Family (prd_family) (2) Date (datekey) -> Month (td_month) -> Quarter (td_quarter) -> Year (td_year) (3) City -> State -> Country (4) Customer_Name (cust_name) -> Customer_Type (cust_type) The two approaches used for generating OLAP reports are as follows (1) Get the most generalized possible CUBE built with the dimensions, or (2) Use on the fly aggregation queries to get the real-time report. Using generalized pre-built CUBE for CUBE, ROLLUP, Drill Down and Slicing Queries. This approach consists of building a table or a materialized view with the CUBE of the dimensions. Table-3 shows the SQL to build such a cube. The generalized CUBE keeps all possible meaningful aggregation pre-computed. We need to query some of the rows of the CUBE to get the desired values. Since the CUBE stores all possible permutations of the dimensions there is a chance that the number or records in the cube itself might be large. Intelligent use of composite columns might help a great deal here. Note the use of composite columns (City, State) in the CUBE. This is because each state has got only one city with the office of our demo organization. The GROUPING_ID function helps to achieve the ROLLUP. For example, take the combination (cust_name, cust_type). It is meaningless to make the cube perform aggregations for customer types across customer names. So we include only the bit vectors (1,1), (1, 0) and (0,0) that is GROUPING_ID of 3, 2 and 0 on the customer dimension. Table 3 - Building a cube CREATE TABLE sales_cube AS SELECT prd_name, prd_family, datekey, td_month, td_quarter, td_year, cust_name, cust_type, city, state, country, GROUPING_ID (prd_name, prd_family) GID_product, GROUPING_ID (datekey, td_month, td_quarter, td_year) GID_DATE, GROUPING_ID (cust_name, cust_type) GID_CUST, GROUPING_ID (city, state, country) GID_LOC, sum(amount) amount FROM sales, product, timebyday, location, customer WHERE sales.cust_id = customer.cust_id and sales.loc_id = location.loc_id and

sales.sales_date = timebyday.datekey and sales.prdid = product.prdid GROUP BY CUBE ( prd_name, prd_family, datekey, td_month, td_quarter, td_year, cust_name, cust_type, (city, state), country ) HAVING ( GROUPING_ID (prd_name, prd_family) = 0 or GROUPING_ID (prd_name, prd_family) = 2 or GROUPING_ID (prd_name, prd_family) = 3) and ( GROUPING_ID (datekey, td_month, td_quarter, td_year) = 0 or GROUPING_ID (datekey, td_month, td_quarter, td_year) = 8 or GROUPING_ID (datekey, td_month, td_quarter, td_year) = 12 or GROUPING_ID (datekey, td_month, td_quarter, td_year) = 14 or GROUPING_ID (datekey, td_month, td_quarter, td_year) = 15) and ( GROUPING_ID (cust_name, cust_type) = 0 or GROUPING_ID (cust_name, cust_type) = 2 or GROUPING_ID (cust_name, cust_type) = 3) and ( GROUPING_ID (city, state, country) = 0 or GROUPING_ID (city, state, country) = 6 or GROUPING_ID (city, state, country) = 7 ); The next table (Table-4) shows a typical crosstab query of sales for Product and Location. It shows the query and also how to generate a crosstab report out of it by using the function CROSSTAB (Script-B). The next examples show the query and cross-tab report and skips the PLSQL portion. The WHERE condition is determined by the bit vectors. We need o

Details of product and details of customer - Both Product and Customer dimensions are all details. So GID_Product = bit vector (0,0) = 0. Same for GID_Cust. Summary of product and details of customer - Product is summarized fully, so GID_Product = bit vector (1,1) = 3. Details of product and summary of customer - Customer is summarized fully, GID_Customer = bit vector (1,1) = 3. Summary of product and summary of customer - Both Customer and products are summarized.

o o o

Along with any of the above 4 conditions we need full summary or the rest of the dimensions. So GID_date = bit vector (1,1, 1,1) = 15 and GID_Loc = bit vector (1,1,1) = 7.

Table-4 Crosstab Query on Product and Location (Query, Generation Routine and Result) /*********** The Query ***********/ SELECT prd_name, cust_name, amount FROM sales_cube WHERE ((GID_Product = 0 and GID_Cust = 0) or (GID_Product = 0 and GID_Cust = 3) or (GID_Product = 3 and GID_Cust = 0) or (GID_Product = 3 and GID_Cust = 3)) and GID_date = 15 and GID_LOC = 7;

/*********** Generating the crosstab report ********/ set serveroutput on set lines 120 var tempstr varchar2(500)

exec :tempstr := ''||'SELECT cust_name, prd_name, amount'||chr(10)||'FROM sales_cube'||chr(10)||'WHERE ((GID_Product = 0 and GID_Cust = 0) or'||chr(10)||' ' ' ' (GID_Product = 0 and GID_Cust = 3) or'||chr(10)||(GID_Product = 3 and GID_Cust = 0) or'||chr(10)||(GID_Product = 3 and GID_Cust = 3)) and'||chr(10)||GID_date = 15 and'||chr(10)||-

'

GID_LOC = 7';

exec crosstab(:tempstr);

*Customers * ************ --Total-ART HOUSE BARKER JONES MAHAJATI RATAN SMITH STONEWORK --Total--

*---------------------- Products -----------------------------* ALEXANDER BUDDHA CHANDRAGUPTA PURI BEACH SIKKIM

0 5100 0 0 0 9500 850 15450 0 0

0 0 0 0

0 0

500 0 2050 1000 0 0 0 3550

750 0 3500 0 4000 0 0 8250

1250 5100 5550 1000 9000 19400 7650 48950

5000 9000 800 14800

0 900 6000 6900

Tables Table-5, Table-6 and Table-7 show TimeSales report and drill-down to the quarters of year 2003. Slicing is achieved by including WHERE condition in the query with desired values of the dimensions. Drill down is achieved by selection of proper value of GID_ type columns and by deciding the proper GROUPING value of all the dimensions at the particular level of drill down. Dicing is achieved by merely interchanging the first two columns of SELECT. Table-5 Year-Product Sales Report: Main (Query and Result) SELECT prd_name, td_year, amount FROM Sales_cube

WHERE ((GID_Product = 0 and GID_date = 14) or (GID_Product = 0 and GID_date = 15) or (GID_Product = 3 and GID_date = 14) or (GID_Product = 3 and GID_date = 15)) and GID_Cust = 3 and GID_Loc = 7;

************ ALEXANDER BUDDHA CHANDRAGUPTA PURI BEACH SIKKIM --Total--

2002 5100 6800 0 3550 0 15450

2003

--Total-15450 14800 6900 3550 8250 48950

10350 8000 6900 0 8250 33500

Table-6 Year-Product Sales Report:Drill Down to Quarters and Dicing Product and Time Dimensions (Query and Result) SELECT td_year||td_quarter, prd_name, amount FROM Sales_cube WHERE ((GID_Product = 0 and GID_date = 12) or (GID_Product = 0 and GID_date = 15) or (GID_Product = 3 and GID_date = 12) or (GID_Product = 3 and GID_date = 15)) and GID_Cust = 3 and GID_Loc = 7;

************ --Total--

ALEXANDER

BUDDHA CHANDRAGUPTA PURI BEACH

SIKKIM

2002Q1 2002Q2 2002Q4 2003Q1 2003Q2 2003Q3 2003Q4 --Total--

0 5100 0 10350 0 0 0 15450

1000 0 5800 0 0 8000 0 14800

0 0 0 6000 0 0 900 6900

2050 500 1000 0 0 0 0 3550 750

0 0 0 0

3050 5600 6800 16350 750 11500 4900

3500 4000

8250

48950

Table-6 Year-Product Sales Report:Slice of year 2003, Quarter level drill down (Query and Result) SELECT td_year||td_quarter, prd_name, amount FROM Sales_cube WHERE ((GID_Product = 0 and GID_date = 12) or (GID_Product = 0 and GID_date = 14) or (GID_Product = 3 and GID_date = 12) or (GID_Product = 3 and GID_date = 14)) and GID_Cust = 3 and GID_Loc = 7 and td_year = 2003; ************ 2003 2003Q1 2003Q2 2003Q3 2003Q4 ALEXANDER BUDDHA CHANDRAGUPTA SIKKIM 10350 8000 6900 8250 33500 10350 0 6000 0 16350 0 0 0 750 750 0 8000 0 3500 11500 0 0 900 4000 4900 --Total--

Using on-the-fly aggregation queries for CUBE, ROLLUP, Drill Down and Slicing While using on-the-fly aggregation queries the cube is not pre-computed and we get the real time summary. However the performance is slower than querying precomputed cubes. Several features (CUBE, ROLLUP, Composite Columns) discussed here can be used to generate the required aggregation levels. An important thing to

ensure during executing on-the-fly queries is that, the query should not perform any useless aggregation. Proper use of the GROUPING functions is important. Conclusion Most of the OLAP tools will provide several additional features other than just reporting. There are user-friendly drag and drop interfaces which make drill-down, rollup, slicing, dicing happen on a mouse-click. Report generation and formatting is easier for someone who is not familiar with SQL. There are security features that restrict specific users from drilling down specific sections of the cube or viewing some specific cubes. When the requirements are just few canned OLAP reports or when simple custom GUI can be made to mask the SQLs, use of the enhanced aggregation features can be really effective. A large portion of the requirements do fall in the second category.

ROLLUP
The ROLLUP operation in the simple_grouping_clause groups the selected rows based on the values of the first n, n-1, n-2, ... 0 expressions in the GROUP BY specification, and returns a single row of summary for each group. You can use the ROLLUP operation to produce subtotal values by using it with the SUM function. When used with SUM, ROLLUP generates subtotals from the most detailed level to the grand total. Aggregate functions such as COUNT can be used to produce other kinds of superaggregates. For example, given three expressions (n=3) in the ROLLUP clause of the simple_grouping_clause, the operation results in n+1 = 3+1 = 4 groupings. Rows grouped on the values of the first 'n' expressions are called regular rows, and the others are called superaggregate rows. The following query uses rollup operation to show sales amount product wise and year wise. To see the structure of the sales table refer to appendices. Select prod,year,sum(amt) from sales group by rollup(prod,year);

CUBE
The CUBE operation in the simple_grouping_clause groups the selected rows based on the values of all possible combinations of expressions in the specification, and returns a single row of summary information for each group. You can use the CUBE operation to produce cross-tabulation values. For example, given three expressions (n=3) in the CUBE clause of the simple_grouping_clause, the operation results in 2n = 23 = 8 groupings. Rows grouped on the values of 'n' expressions are called regular rows, and the rest are called superaggregate rows. The following query uses CUBE operation to show sales amount product wise and year wise. To see the structure of the sales table refer to appendices. Select prod,year,sum(amt) from sales group by CUBE(prod,year);