You are on page 1of 27

Data Warehouse Basics

WEEK 2
Review of Last Class
Roles of Information Technology
◦ 1. Automate clerical work
◦ 2. Decision support

OLAP vs. OLTP


◦ Different query characteristics
◦ Different performance requirements
◦ Different data modeling requirements
◦ OLAP combines data from many sources
High-level course outline
◦ Logical Database Design
◦ Query Processing
◦ Physical Database Design
◦ Data Mining
Outline of Today’s Class
Data integration
Basic OLAP queries
◦ Data cubes
◦ Slice and dice, drill down, roll up
◦ MOLAP vs. ROLAP
◦ SQL OLAP Extensions: ROLLUP, CUBE
Loading the Data Warehouse
Data is periodically
extracted

Data is cleansed and


transformed

Users query the data


warehouse

Source Systems Data Staging Area Data Warehouse


(OLTP)
Terminology: ETL
ETL = Extraction, Transformation, & Load
Extraction: Get the data out of the source systems
Transformation: Convert the data into a useful
format for analysis
Load: Get the data into the data warehouse
(…and build indexes, materialized views, etc.)
We will return to this topic in a couple weeks.
Data Integration is Hard
Data warehouses combine data from multiple sources
Data must be translated into a consistent format
Data integration represents ~80% of effort for a typical data
warehouse project!
Some reasons why it’s hard:
◦ Metadata is often poor or non-existent
◦ Data quality is often bad
◦ Missing or default values
◦ Multiple spellings of the same thing
(Cal vs. UC Berkeley vs. University of California)
◦ Inconsistent semantics
◦ What is an airline passenger?
Federated Databases
An alternative to data warehouses
Data warehouse
◦ Create a copy of all the data
◦ Execute queries against the copy

Federated database
◦ Pull data from source systems as needed to answer queries

“lazy” vs. “eager” data integration Rewritten


Query Queries
Extraction Query

Answer Answer Mediator


Warehouse
Source
Source Systems
Data Warehouse Federated Database
Systems
Warehouses vs. Federation
Advantages of federated databases:
◦ No redundant copying of data
◦ Queries see “real-time” view of evolving data
◦ More flexible security policy
Disadvantages of federated databases:
◦ Analysis queries place extra load on transactional systems
◦ Query optimization is hard to do well
◦ Historical data may not be available
◦ Complex “wrappers” needed to mediate between analysis server and
source systems
Data warehouses are much more common in practice
◦ Better performance
◦ Lower complexity
◦ Slightly out-of-date data is acceptable
Two Approaches to Data
Warehousing
Data mart: like a data warehouse, but smaller and more focused
Top-down approach
◦ First build single unified data warehouse with all enterprise data
◦ Then create data marts containing specialized subsets of the data from
the warehouse
Bottom-up approach
◦ First build a data mart to solve the most pressing problem
◦ Then build another data mart, then another
◦ Data warehouse = union of all data marts

In practice, not much difference between the two


Our book advocates the bottom-up approach
HW 1
Submission: 10-4-2021
Email or hard copy before next lecture

What is Data Warehousing?


Why Do We Need a Data Mart?
Data Cube
Axes of the cube represent
attributes of the data records
◦ Generally discrete-valued / Auto Sales
categorical
◦ e.g. color, month, state
◦ Called dimensions

Cells hold aggregated


measurements Red
◦ e.g. total $ sales, number of autos
sold Blue WA
◦ Called facts OR
Gray
Real data cubes have >> 3 CA
dimensions Jul Aug Sep
Slicing and Dicing

Red

Red
Blue
Blue WA WA
OR OR
Gray Gray
CA CA
Jul Aug Sep Jul Aug Sep

WA
Blue
Total OR
Blue
Jul Aug Sep CA
Jul Aug Sep
Querying the Data Cube
Cross-tabulation CA OR WA Total
◦ “Cross-tab” for short
◦ Report data grouped by 2 Jul 45 33 30 108
dimensions Aug 50 36 42 128
◦ Aggregate across other dimensions
Sep 38 31 40 109
◦ Include subtotals
Total 133 100 112 345
Operations on a cross-tab
◦ Roll up (further aggregation) Number of Autos Sold
◦ Drill down (less aggregation)
Roll Up and Drill Down
Number of Autos Sold Number of Autos Sold
CA OR WA Total
CA OR WA Total
Jul 45 33 30 108 133 100 112 345
Roll up
Aug 50 36 42 128 Drill down
by Month
Sep 38 31 40 109 by Color
Total 133 100 112 345 Number of Autos Sold

CA OR WA Total

Red 40 29 40 109
Blue 45 31 37 113
Gray 48 40 35 123
Total 133 100 112 345
“Standard” Data Cube Query
Measurements
◦ Which fact(s) should be reported?
Filters
◦ What slice(s) of the cube should be used?
Grouping attributes
◦ How finely should the cube be diced?
◦ Each dimension is either:
◦ (a) A grouping attribute
◦ (b) Aggregated over (“Rolled up” into a single total)
◦ n dimensions → 2n sets of grouping attributes
◦ Aggregation = projection to a lower-dimensional subspace
Full Data Cube with
Subtotals
Pre-computation of aggregates → fast answers to
OLAP queries
Ideally, pre-compute all 2n types of subtotals
Otherwise, perform aggregation as needed
Coarser-grained totals can be computed from finer-
grained totals
◦ But not the other way around
Data Cube Lattice
State, Month,
Color

State, State, Month,


Month Color Color
Drill Roll
Down Up
State Month Color

Total
MOLAP vs. ROLAP
MOLAP = Multidimensional OLAP
Store data cube as multidimensional array
(Usually) pre-compute all aggregates
Advantages:
◦ Very efficient data access → fast answers

Disadvantages:
◦ Doesn’t scale to large numbers of dimensions
◦ Requires special-purpose data store
Sparsity
Imagine a data warehouse for Safeway.
Suppose dimensions are: Customer, Product, Store, Day
If there are 100,000 customers, 10,000 products, 1,000 stores, and
1,000 days…
…data cube has 1,000,000,000,000,000 cells!
Fortunately, most cells are empty.
A given store doesn’t sell every product on every day.
A given customer has never visited most of the stores.
A given customer has never purchased most products.
Multi-dimensional arrays are not an efficient way to store sparse data.
MOLAP vs. ROLAP
ROLAP = Relational OLAP
Store data cube in relational database
Express queries in SQL
Advantages:
◦ Scales well to high dimensionality
◦ Scales well to large data sets
◦ Sparsity is not a problem
◦ Uses well-known, mature technology
Disadvantages:
◦ Query performance is slower than MOLAP
◦ Need to construct explicit indexes
Creating a Cross-tab with
SQL
Grouping
Measurements
Attributes

SELECT state, month, SUM(quantity)


FROM sales
GROUP BY state, month
WHERE color = 'Red'

Filters
What about the totals?
State Month SUM
SQL aggregation query with
CA Jul 45
GROUP BY does not produce CA Aug 50
subtotals, totals CA Sep 38
Our cross-tab report is OR Jul 33
incomplete. OR Aug 36
OR Sep 31
Number of Autos Sold WA Jul 30
WA Aug 42
CA OR WA Total WA Sep 40
Jul 45 33 30 ?
Aug 50 36 42 ?
Sep 38 31 40 ?
Total ? ? ? ?
One solution: a big UNION
ALL
SELECT state, month, SUM(quantity)
Original FROM sales
Query GROUP BY state, month
WHERE color = 'Red‘
UNION ALL
State SELECT state, "ALL", SUM(quantity)
FROM sales
Subtotals GROUP BY state
WHERE color = 'Red'
UNION ALL
Month SELECT "ALL", month, SUM(quantity)
FROM sales
Subtotals GROUP BY month
WHERE color = 'Red‘
UNION ALL
Overall SELECT "ALL", "ALL", SUM(quantity)
Total FROM sales
WHERE color = 'Red'
A better solution
“UNION ALL” solution gets cumbersome with more than
2 grouping attributes
n grouping attributes → 2n parts in the union
OLAP extensions added to SQL 99 are more convenient
◦ CUBE, ROLLUP

SELECT state, month, SUM(quantity)


FROM sales
GROUP BY CUBE(state, month)
WHERE color = 'Red'
Results of the CUBE query
State Month SUM(quantity)
CA Jul 45
CA Aug 50
Notice the use of CA Sep 38
NULL for totals CA NULL 133
OR Jul 33
OR Aug 36
OR Sep 31
OR NULL 100
WA Jul 30
WA Aug 42
WA Sep 40
WA NULL 112
Subtotals at NULL Jul 108
all levels NULL Aug 128
NULL Sep 109
NULL NULL 345
ROLLUP vs. CUBE
CUBE computes entire lattice
ROLLUP computes one path through lattice
◦ Order of GROUP BY list matters
◦ Groups by all prefixes of the GROUP BY list

GROUP BY ROLLUP(A,B,C) GROUP BY CUBE(A,B,C)


•A,B,C •A,B,C
•(A,B) subtotals •Subtotals for the following:
•(A) subtotals (A,B), (A,C), (B,C),
•Total (A), (B), (C)
•Total
ROLLUP example
State, Month,
SELECT color, month, state, SUM(quantity)Color
FROM sales
GROUP BY ROLLUP(color,month,state)
State, State, Month,
Month Color Color

State Month Color

Total

You might also like