Professional Documents
Culture Documents
Data Warehouses and Olap: Example of A Data Warehouse
Data Warehouses and Olap: Example of A Data Warehouse
DATA WAREHOUSES
eid name birthdate did dname
... ... ... ... ... timeid pid sales
1 1 2
Transaction Details 2 1 4
AND OLAP
2 2 1
tid type date tid pid qty 3 3 2
1 sale 4/11/1999 1 21 2 ... ... ...
2 sale 5/2/1999 2 13 1
3 buy 5/17/1999 3 41 3
... ... ... ... ... ...
dimension 1: time
timeid day month year
1 11 4 1999
HK-Database 2
3
15
2
4
5
1999
1999
Supplier Country ... ... ...
sid name birthdate cid cname
... ... ... ... ... dimension 2: product
sid date time qty pid pid name type
1 chair office
1 15:4:1999 8:30 2 11
2 table office
Sales 2 15:4:1999 9:30 2 11
3 desk office
3 ??? 3 56
... ...
4 19:5:1999 4 22 3
... ...
1
Example of a Data Warehouse (3) What is Data Warehouse?
•Data Cleaning • Defined in many different ways, but not rigorously.
• Tuples which are incomplete or logically inconsistent • A decision support database that is maintained separately from the
are cleaned organization’s operational database
• Support information processing by providing a solid platform of
•Data Summarization consolidated, historical data for analysis.
• Values are summarized according to the desired • “A data warehouse is a subject-oriented, integrated, time-
level of analysis variant, and nonvolatile collection of data in support of
management’s decision-making process.”—W. H. Inmon
• For example, HK database records the daytime a
sales transaction takes place, but the most detailed • Data warehousing:
• The process of constructing and using data warehouses
time unit we are interested for analysis is the day.
5 7
Data cube
desks 56 84 9 35 184 issues by excluding data that are not useful in the decision
shelves 19 20 0 71 110
boards 5 16 11 15 47
support process.
ALL 115 187 109 187 598
6 8
2
Data Warehouse—Integrated Data Warehouse—Non-Volatile
• Constructed by integrating multiple, heterogeneous data sources
• relational databases, flat files, on-line transaction records • A physically separate store of data transformed from the
• Data cleaning and data integration techniques are applied. operational environment.
• Ensure consistency in naming conventions, encoding structures, attribute
measures, etc. among different data sources • Operational update of data does not occur in the data
• E.g., Hotel price: currency, tax, breakfast covered, etc.
• When data is moved to the warehouse, it is converted. warehouse environment.
• Does not require transaction processing, recovery, and concurrency
control mechanisms
• Requires only two operations in data accessing:
• initial loading of data and access of data.
9 11
• Contains an element of time, explicitly or implicitly • Data warehouse: update-driven, high performance
• But the key of operational data may or may not contain “time element” • Information from heterogeneous sources is integrated in advance and stored in
(the time elements could be extracted from log files of transactions) warehouses for direct query and analysis
10 12
3
Data Warehouse vs. Heterogeneous DBMS Data Warehouse vs. Operational DBMS
• Example of a Heterogeneous DBMS • OLTP (on-line transaction processing)
• Major task of traditional relational DBMS
Heterogeneous • Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll,
Databases registration, accounting, etc.
mediator/
wrapper • OLAP (on-line analytical processing)
R1
Q1 • Major task of data warehouse system
results
R2 meta- user • Data analysis and decision making
Q2 data query
• Distinct features (OLTP vs. OLAP):
R3
query • User and system orientation: customer vs. market
Q3 transformation • Data contents: current, detailed vs. historical, consolidated
• Database design: ER + application vs. star + subject
• The results from the various sources are integrated and • View: current, local vs. evolutionary, integrated
returned to the user 13
• Access patterns: update vs. read-only but complex queries
15
14 16
4
17 19
18 20
5
Data Warehousing and OLAP Technology From Tables and Spreadsheets to Data
for Data Mining Cubes
• A dimension is a perspective with respect to which we analyze
• What is a data warehouse?
the data
• A multi-dimensional data model • A multidimensional data model is usually organized around a
central theme (e.g., sales). Numerical measures on this theme
• Data warehouse architecture
are called facts, and they are used to analyze the relationships
• Data warehouse implementation between the dimensions
• Example:
• Further development of data cube technology
• Central theme: sales
21 23
which views data in the form of a data cube dimensions and provides summarizations for all subsets of them
• Fact table contains measures (such as dollars_sold) and keys to each of tables 10 30 0 45 85
desks 56 84 9 35 184 Data cube
the related dimension tables
shelves 19 20 0 71 110
boards 5 16 11 15 47
ALL 115 187 109 187 598
22 24
6
Conceptual Modeling of Data
What is a data cube? Warehouses
• In data warehousing literature, the most detailed part of the • The ER model is used for relational database design. For data warehouse
cube is called a base cuboid. The top most 0-D cuboid, which design we need a concise, subject-oriented schema that facilitates data
cuboid. The lattice of cuboids forms a data cube. • Modeling data warehouses: dimensions & measures
• Star schema: A fact table in the middle connected to a set of dimension tables
year base cuboid • Snowflake schema: A refinement of star schema where some dimensional
1999 2000 2001 2002 ALL hierarchy is normalized into a set of smaller dimension tables, forming a shape
chairs 25 37 89 21 172 similar to snowflake
product
tables 10 30 0 45 85
• Fact constellations: Multiple fact tables share dimension tables, viewed as a
desks 56 84 9 35 184 Data cube
shelves
collection of stars, therefore called galaxy schema or fact constellation
19 20 0 71 110
boards 5 16 11 15 47 apex cuboid
ALL 115 187 109 187 598 25 27
7
Example of Snowflake Schema A Data Mining Query Language, DMQL:
Language Primitives
time
time_key
item • Cube Definition (Fact Table)
day item_key supplier define cube <cube_name> [<dimension_list>]: <measure_list>
day_of_the_week Sales Fact Table item_name supplier_key • Dimension Definition ( Dimension Table )
month brand supplier_type
quarter time_key type define dimension <dimension_name> as
year supplier_key (<attribute_or_subdimension_list>)
item_key
branch_key • Special Case (Shared Dimension Tables)
location • First time as “cube definition”
branch location_key
location_key • define dimension <dimension_name> as <dimension_name_first_time>
branch_key in cube <cube_name_first_time>
units_sold street
branch_name
city_key city
branch_type dollars_sold
city_key
avg_sales city
province_or_street
Measures normalization country
29 31
8
Defining a Snowflake Schema in DMQL Aggregate Functions on Measures:
Three Categories
• distributive: if the result derived by applying the function to n
define cube sales_snowflake [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) aggregate values is the same as that derived by applying the
function on all the data without partitioning.
define dimension time as (time_key, day, day_of_week, month, • E.g., count(), sum(), min(), max().
quarter, year)
• algebraic: if it can be computed by an algebraic function with M
define dimension item as (item_key, item_name, brand, type,
arguments (where M is a bounded integer), each of which is
supplier(supplier_key, supplier_type))
obtained by applying a distributive aggregate function.
define dimension branch as (branch_key, branch_name, • E.g., avg(), min_N(), standard_deviation().
branch_type) • holistic: if there is no constant bound on the storage size
define dimension location as (location_key, street, city(city_key, needed to describe a sub-aggregate.
province_or_state, country)) • E.g., median(), mode(), rank().
33 35
define dimension time as (time_key, day, day_of_week, month, quarter, year) • Target: compute an aggregate on quantity
define dimension item as (item_key, item_name, brand, type, supplier_type) • distributive:
define dimension branch as (branch_key, branch_name, branch_type) • To compute sum(quantity) we can first compute sum(quantity) for each item and
define dimension location as (location_key, street, city, province_or_state, country) then add these numbers.
define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*) • algebraic:
• To compute avg(quantity) we can first compute sum(quantity) and count(quantity)
define dimension time as time in cube sales
and then divide these numbers.
define dimension item as item in cube sales
define dimension shipper as (shipper_key, shipper_name, location as location in cube sales, • holistic:
shipper_type) To compute median(quantity) we can use neither median(quantity) for each item
•
define dimension from_location as location in cube sales nor any combination of distributive functions, too.
define dimension to_location as location in cube sales
34 36
9
Concept Hierarchies Multidimensional Data
A concept hierarchy is a hierarchy of conceptual • Sales volume as a function of product, month, and region
relationships for a specific dimension, mapping
low-level concepts to high-level concepts Dimensions: Product, Location, Time
Typically, a multidimensional view of the Hierarchical summarization paths
summarized data has one concept from the Industry Region Year
hierarchy for each selected dimension
Category Country Quarter
Example:
Product
General concept: Analyze the total sales with respect to Product City Month Week
item, location, and time
View 1: <itemid, city, month> Office Day
View 2: <item_type, country, week> total order
View 3: <item_color, state, year> Month partial order
37 39
.... (lattice)
sum
Canada
country Germany ... Spain Canada ... Mexico
Mexico
10
Cuboids Corresponding to the
DataCubes
Cube
‘color’, ‘size’: DIMENSIONS
all
0-D(apex) cuboid ‘count’: MEASURE
product date country
1-D cuboids C/S S M L TOT
f Red 20 3 5 28
product,date product,country date, country
size color Blue 3 3 8 14
2-D cuboids
Gray 0 0 5 5
3-D(base) cuboid TOT 23 6 18 47
product, date, country color; size
42 44
11
DataCubes DataCubes
‘color’, ‘size’: DIMENSIONS ‘color’, ‘size’: DIMENSIONS
‘count’: MEASURE ‘count’: MEASURE
C/S S M L TOT C/S S M L TOT
f Red 20 3 5 28 f Red 20 3 5 28
size color Blue 3 3 8 14 size color Blue 3 3 8 14
Gray 0 0 5 5 Gray 0 0 5 5
TOT 23 6 18 47 TOT 23 6 18 47
color; size color; size
DataCube
45 47
12
Typical OLAP Operations
• Browsing between cuboids
• Roll up (drill-up): summarize data
• by climbing up hierarchy or by reducing a dimension
• Drill down (roll down): reverse of roll-up
• from higher level summary to lower level summary or detailed data, or introducing new
dimensions
• Pivot (rotate):
• reorient the cube, visualization, 3D to series of 2D planes.
• Other operations
• drill across: involving (across) more than one fact table
• drill through: through the bottom level of the cube to its back-end relational tables
(using SQL)
51
OLAP
Example of operations on a
• Online Analytical Processing (OLAP)
Datacube
• Given a data table D with 4 attributes W;X;Y and Z
• An OLAP cube can be characterized as a function
• f : (X; Y ; Z) W
• An example of a slice is a function g : (Y ; Z) W
C/S S M L TOT
f Red 20 3 5 28
size color Blue 3 3 8 14
Gray 0 0 5 5
TOT 23 6 18 47
color; size
52
13
Example of operations on a Example of operations on a
Datacube Datacube
Roll-up: Slice: Perform a selection on one dimension
• In this example we reduce one dimension
• It is possible to climb up one hierarchy
• Example (product, city) (product, country)
TOT 23 6 18 47 TOT 23 6 18 47
color; size color; size
53 55
14
Design of a Data Warehouse: A Business Building and managing a data warehouse
Analysis Framework
• Four views regarding the design of a data warehouse
• Top-down view
• allows selection of the relevant information necessary for the
data warehouse
• Data source view
• exposes the information being captured, stored, and managed
by operational systems
• Data warehouse view
• consists of fact tables and dimension tables
• Business query view
• sees the perspectives of data in the warehouse from the view of
end-user 57 59
Prioritized EDW
Data Warehouse Design Process
• Top-down, bottom-up approaches or a combination of both
• Top-down: Starts with overall design and planning
• Bottom-up: Starts with experiments and prototypes (rapid).
15
Multi-Tiered Architecture Three Data Warehouse Models
• Enterprise warehouse
Monitor
& OLAP Server • collects all of the information about subjects spanning the entire
other Metadata
Integrator organization
sources
Analysis • Data Mart
Extract Query • a subset of corporate-wide data that is of value to a specific groups of users.
Operational
DBs Transform Data Serve Reports Its scope is confined to specific, selected groups, such as marketing data
Load mart
Refresh
Warehouse Data mining
• Independent vs. dependent (directly from warehouse) data mart
• Virtual warehouse
• A set of views over operational databases
• Only some of the possible summary views may be materialized
Data Marts
62
16