You are on page 1of 16

Example of a Data Warehouse (1)

US-Database Data Warehouse


Employee Department
FACT table

DATA WAREHOUSES
eid name birthdate did dname
... ... ... ... ... timeid pid sales
1 1 2
Transaction Details 2 1 4

AND OLAP
2 2 1
tid type date tid pid qty 3 3 2
1 sale 4/11/1999 1 21 2 ... ... ...
2 sale 5/2/1999 2 13 1
3 buy 5/17/1999 3 41 3
... ... ... ... ... ...
dimension 1: time
timeid day month year
1 11 4 1999
HK-Database 2
3
15
2
4
5
1999
1999
Supplier Country ... ... ...
sid name birthdate cid cname
... ... ... ... ... dimension 2: product
sid date time qty pid pid name type
1 chair office
1 15:4:1999 8:30 2 11
2 table office
Sales 2 15:4:1999 9:30 2 11
3 desk office
3 ??? 3 56
... ...
4 19:5:1999 4 22 3
... ...

Why Data Warehousing? Example of a Data Warehouse (2)


• Data warehousing can be considered as an important • Data Selection
preprocessing step for data mining • Only data which are important for analysis are selected (e.g.,
information about employees, departments, etc. are not stored in
Heterogeneous the warehouse)
Databases
• Therefore the data warehouse is subject-oriented
data selection
Data Warehouse • Data Integration
data cleaning • Consistency of attribute names
• Consistency of attribute data types. (e.g., dates are converted to a
data integration
consistent format)
• Consistency of values (e.g., product-ids are converted to
data summarization
correspond to the same products from both sources)
• Integration of data (e.g, data from both sources are integrated into
• A data warehouse also provides on-line analytical processing
the warehouse)
(OLAP) tools for interactive multidimensional data analysis.
2 4

1
Example of a Data Warehouse (3) What is Data Warehouse?
•Data Cleaning • Defined in many different ways, but not rigorously.
• Tuples which are incomplete or logically inconsistent • A decision support database that is maintained separately from the
are cleaned organization’s operational database
• Support information processing by providing a solid platform of
•Data Summarization consolidated, historical data for analysis.
• Values are summarized according to the desired • “A data warehouse is a subject-oriented, integrated, time-
level of analysis variant, and nonvolatile collection of data in support of
management’s decision-making process.”—W. H. Inmon
• For example, HK database records the daytime a
sales transaction takes place, but the most detailed • Data warehousing:
• The process of constructing and using data warehouses
time unit we are interested for analysis is the day.

5 7

Example of a Data Warehouse (4) Data Warehouse—Subject-


• Example of an OLAP query (collects counts) Oriented
• Summarize all company sales according to product and year, and further aggregate • Organized around major subjects, such as customer, product,
on each of these dimensions.
sales.
• Focusing on the modeling and analysis of data for decision
year
1999 2000 2001 2002 ALL makers, not on daily operations or transaction processing.
chairs 25 37 89 21 172
tables 10 30 0 45 85 • Provide a simple and concise view around particular subject
product

Data cube
desks 56 84 9 35 184 issues by excluding data that are not useful in the decision
shelves 19 20 0 71 110
boards 5 16 11 15 47
support process.
ALL 115 187 109 187 598

6 8

2
Data Warehouse—Integrated Data Warehouse—Non-Volatile
• Constructed by integrating multiple, heterogeneous data sources
• relational databases, flat files, on-line transaction records • A physically separate store of data transformed from the

• Data cleaning and data integration techniques are applied. operational environment.
• Ensure consistency in naming conventions, encoding structures, attribute
measures, etc. among different data sources • Operational update of data does not occur in the data
• E.g., Hotel price: currency, tax, breakfast covered, etc.
• When data is moved to the warehouse, it is converted. warehouse environment.
• Does not require transaction processing, recovery, and concurrency
control mechanisms
• Requires only two operations in data accessing:
• initial loading of data and access of data.

9 11

Data Warehouse vs. Heterogeneous DBMS


Data Warehouse—Time Variant
• The time horizon for the data warehouse is significantly longer
• Traditional heterogeneous DB integration:
than that of operational systems. • Build wrappers/mediators on top of heterogeneous databases
• Operational database: current value data. • Query driven approach
• Data warehouse data: provide information from a historical perspective • When a query is posed to a client site, a meta-dictionary is used to translate the
query into queries appropriate for individual heterogeneous sites involved, and
(e.g., past 5-10 years)
the results are integrated into a global answer set
• Every key structure in the data warehouse • Complex information filtering, compete for resources

• Contains an element of time, explicitly or implicitly • Data warehouse: update-driven, high performance
• But the key of operational data may or may not contain “time element” • Information from heterogeneous sources is integrated in advance and stored in
(the time elements could be extracted from log files of transactions) warehouses for direct query and analysis

10 12

3
Data Warehouse vs. Heterogeneous DBMS Data Warehouse vs. Operational DBMS
• Example of a Heterogeneous DBMS • OLTP (on-line transaction processing)
• Major task of traditional relational DBMS
Heterogeneous • Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll,
Databases registration, accounting, etc.
mediator/
wrapper • OLAP (on-line analytical processing)
R1
Q1 • Major task of data warehouse system
results
R2 meta- user • Data analysis and decision making
Q2 data query
• Distinct features (OLTP vs. OLAP):
R3
query • User and system orientation: customer vs. market
Q3 transformation • Data contents: current, detailed vs. historical, consolidated
• Database design: ER + application vs. star + subject
• The results from the various sources are integrated and • View: current, local vs. evolutionary, integrated
returned to the user 13
• Access patterns: update vs. read-only but complex queries
15

Data Warehouse vs. Heterogeneous DBMS


• Advantages of a Data Warehouse:
• The information is integrated in advance, therefore there is no overhead
for (i) querying the sources and (ii) combining the results
• There is no interference with the processing at local sources (a local source
may go offline)
• Some information is already summarized in the warehouse, so query effort
is reduced.
• When should mediators be used?
• When queries apply on current data and the information is highly dynamic
(changes are very frequent).
• When the local sources are not collaborative.

14 16

4
17 19

OLTP vs. OLAP Why Separate Data Warehouse?


OLTP OLAP • High performance for both systems
users clerk, IT professional manager • DBMS— tuned for OLTP: access methods, indexing, concurrency control,
function day to day operations Decision support recovery
DB design application-oriented subject-oriented • Warehouse—tuned for OLAP: complex OLAP queries, multidimensional
data current, up-to-date historical, view, consolidation.
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated • Different functions and different data:
usage repetitive ad-hoc • missing data: Decision support requires historical data which operational
access read/write lots of scans DBs do not typically maintain
index/hash on prim. key • data consolidation: DS requires consolidation (aggregation,
unit of work short, simple transaction complex query
summarization) of data from heterogeneous sources
# records accessed tens millions
• data quality: different sources typically use inconsistent data
#users thousands hundreds representations, codes and formats which have to be reconciled
DB size 100MB-GB 100GB-TB (even PB)
metric transaction throughput query throughput, response

18 20

5
Data Warehousing and OLAP Technology From Tables and Spreadsheets to Data
for Data Mining Cubes
• A dimension is a perspective with respect to which we analyze
• What is a data warehouse?
the data
• A multi-dimensional data model • A multidimensional data model is usually organized around a
central theme (e.g., sales). Numerical measures on this theme
• Data warehouse architecture
are called facts, and they are used to analyze the relationships
• Data warehouse implementation between the dimensions
• Example:
• Further development of data cube technology
• Central theme: sales

• Dimensions: item, customer, time, location, supplier, etc.


• From data warehousing to data mining

21 23

From Tables and Spreadsheets to Data


Cubes What is a data cube?
• A data warehouse is based on a multidimensional data model • The data cube summarizes the measure with respect to a set of n

which views data in the form of a data cube dimensions and provides summarizations for all subsets of them

• A data cube, such as sales, allows data to be modeled and


viewed in multiple dimensions
year
• Dimension tables, such as item (item_name, brand, type), or time(day, 1999 2000 2001 2002 ALL

week, month, quarter, year) chairs 25 37 89 21 172


product

• Fact table contains measures (such as dollars_sold) and keys to each of tables 10 30 0 45 85
desks 56 84 9 35 184 Data cube
the related dimension tables
shelves 19 20 0 71 110
boards 5 16 11 15 47
ALL 115 187 109 187 598

22 24

6
Conceptual Modeling of Data
What is a data cube? Warehouses
• In data warehousing literature, the most detailed part of the • The ER model is used for relational database design. For data warehouse

cube is called a base cuboid. The top most 0-D cuboid, which design we need a concise, subject-oriented schema that facilitates data

holds the highest-level of summarization, is called the apex analysis.

cuboid. The lattice of cuboids forms a data cube. • Modeling data warehouses: dimensions & measures
• Star schema: A fact table in the middle connected to a set of dimension tables

year base cuboid • Snowflake schema: A refinement of star schema where some dimensional
1999 2000 2001 2002 ALL hierarchy is normalized into a set of smaller dimension tables, forming a shape
chairs 25 37 89 21 172 similar to snowflake
product

tables 10 30 0 45 85
• Fact constellations: Multiple fact tables share dimension tables, viewed as a
desks 56 84 9 35 184 Data cube
shelves
collection of stars, therefore called galaxy schema or fact constellation
19 20 0 71 110
boards 5 16 11 15 47 apex cuboid
ALL 115 187 109 187 598 25 27

Example of Star Schema


Cube: A Lattice of Cuboids
time
time_key foreign keys item
all day item_key
0-D(apex) cuboid
day_of_the_week Sales Fact Table item_name
month brand
time item location supplier quarter time_key type
1-D cuboids year
item_key supplier_type

time,item time,location item,location location,supplier branch_key


branch location
2-D cuboids location_key
time,supplier item,supplier location_key
branch_key
branch_name units_sold street
time,location,supplier city
time,item,location 3-D cuboids branch_type
dollars_sold province_or_street
time,item,supplier item,location,supplier country
avg_sales
4-D(base) cuboid Measures
26 28
time, item, location, supplier

7
Example of Snowflake Schema A Data Mining Query Language, DMQL:
Language Primitives
time
time_key
item • Cube Definition (Fact Table)
day item_key supplier define cube <cube_name> [<dimension_list>]: <measure_list>
day_of_the_week Sales Fact Table item_name supplier_key • Dimension Definition ( Dimension Table )
month brand supplier_type
quarter time_key type define dimension <dimension_name> as
year supplier_key (<attribute_or_subdimension_list>)
item_key
branch_key • Special Case (Shared Dimension Tables)
location • First time as “cube definition”
branch location_key
location_key • define dimension <dimension_name> as <dimension_name_first_time>
branch_key in cube <cube_name_first_time>
units_sold street
branch_name
city_key city
branch_type dollars_sold
city_key
avg_sales city
province_or_street
Measures normalization country
29 31

Example of Fact Constellation Defining a Star Schema in DMQL


time
item Shipping Fact Table
time_key
day
define cube sales_star [time, item, branch, location]:
item_key dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold =
day_of_the_week Sales Fact Table item_name time_key count(*)
month brand
quarter time_key type item_key define dimension time as (time_key, day, day_of_week, month,
year
item_key
supplier_type shipper_key quarter, year)
branch_key from_location define dimension item as (item_key, item_name, brand, type,
supplier_type)
branch location_key location to_location
branch_key dollars_cost
define dimension branch as (branch_key, branch_name,
units_sold location_key
branch_name street branch_type)
branch_type dollars_sold city units_shipped
province_or_street
define dimension location as (location_key, street, city,
avg_sales country shipper province_or_state, country)
Measures shipper_key
shipper_name
30 32
location_key
shipper_type

8
Defining a Snowflake Schema in DMQL Aggregate Functions on Measures:
Three Categories
• distributive: if the result derived by applying the function to n
define cube sales_snowflake [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) aggregate values is the same as that derived by applying the
function on all the data without partitioning.
define dimension time as (time_key, day, day_of_week, month, • E.g., count(), sum(), min(), max().
quarter, year)
• algebraic: if it can be computed by an algebraic function with M
define dimension item as (item_key, item_name, brand, type,
arguments (where M is a bounded integer), each of which is
supplier(supplier_key, supplier_type))
obtained by applying a distributive aggregate function.
define dimension branch as (branch_key, branch_name, • E.g., avg(), min_N(), standard_deviation().
branch_type) • holistic: if there is no constant bound on the storage size
define dimension location as (location_key, street, city(city_key, needed to describe a sub-aggregate.
province_or_state, country)) • E.g., median(), mode(), rank().
33 35

Aggregate Functions on Measures: Three


Defining a Fact Constellation in DMQL Categories (Examples)
define cube sales [time, item, branch, location]: • Table: Sales(itemid, timeid, quantity)
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)

define dimension time as (time_key, day, day_of_week, month, quarter, year) • Target: compute an aggregate on quantity
define dimension item as (item_key, item_name, brand, type, supplier_type) • distributive:
define dimension branch as (branch_key, branch_name, branch_type) • To compute sum(quantity) we can first compute sum(quantity) for each item and
define dimension location as (location_key, street, city, province_or_state, country) then add these numbers.
define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*) • algebraic:
• To compute avg(quantity) we can first compute sum(quantity) and count(quantity)
define dimension time as time in cube sales
and then divide these numbers.
define dimension item as item in cube sales
define dimension shipper as (shipper_key, shipper_name, location as location in cube sales, • holistic:
shipper_type) To compute median(quantity) we can use neither median(quantity) for each item

define dimension from_location as location in cube sales nor any combination of distributive functions, too.
define dimension to_location as location in cube sales

34 36

9
Concept Hierarchies Multidimensional Data
 A concept hierarchy is a hierarchy of conceptual • Sales volume as a function of product, month, and region
relationships for a specific dimension, mapping
low-level concepts to high-level concepts Dimensions: Product, Location, Time
 Typically, a multidimensional view of the Hierarchical summarization paths

summarized data has one concept from the Industry Region Year
hierarchy for each selected dimension
Category Country Quarter
 Example:

Product
 General concept: Analyze the total sales with respect to Product City Month Week
item, location, and time
 View 1: <itemid, city, month> Office Day
 View 2: <item_type, country, week> total order
 View 3: <item_color, state, year> Month partial order
37 39
 .... (lattice)

A Concept Hierarchy: Dimension (location) A Sample Data Cube


all all Total annual sales
Date of TV in U.S.A.
1Qtr 2Qtr 3Qtr 4Qtr sum
TV
PC U.S.A
region Europe ... North_America
VCR
Country

sum
Canada
country Germany ... Spain Canada ... Mexico
Mexico

Vancouver ... sum


city Frankfurt ... Toronto

office L. Chan ... M. Wind


38 40

10
Cuboids Corresponding to the
DataCubes
Cube
‘color’, ‘size’: DIMENSIONS
all
0-D(apex) cuboid ‘count’: MEASURE
product date country
1-D cuboids C/S S M L TOT
f Red 20 3 5 28
product,date product,country date, country
size color Blue 3 3 8 14
2-D cuboids
Gray 0 0 5 5
3-D(base) cuboid TOT 23 6 18 47
product, date, country color; size

The cuboids are also called multidimensional views


41 43

DataCube example DataCubes


‘color’, ‘size’: DIMENSIONS ‘color’, ‘size’: DIMENSIONS
‘count’: MEASURE ‘count’: MEASURE
C/S S M L TOT C/S S M L TOT
f Red 20 3 5 28 f Red 20 3 5 28
size color Blue 3 3 8 14 size color Blue 3 3 8 14
Gray 0 0 5 5 Gray 0 0 5 5
TOT 23 6 18 47 TOT 23 6 18 47
color; size color; size

42 44

11
DataCubes DataCubes
‘color’, ‘size’: DIMENSIONS ‘color’, ‘size’: DIMENSIONS
‘count’: MEASURE ‘count’: MEASURE
C/S S M L TOT C/S S M L TOT
f Red 20 3 5 28 f Red 20 3 5 28
size color Blue 3 3 8 14 size color Blue 3 3 8 14
Gray 0 0 5 5 Gray 0 0 5 5
TOT 23 6 18 47 TOT 23 6 18 47
color; size color; size

DataCube
45 47

Browsing a Data Cube


DataCubes
‘color’, ‘size’: DIMENSIONS
‘count’: MEASURE
C/S S M L TOT
f Red 20 3 5 28
size color Blue 3 3 8 14
Gray 0 0 5 5
TOT 23 6 18 47
color; size • Visualization
• OLAP capabilities
• Interactive manipulation
46 48

12
Typical OLAP Operations
• Browsing between cuboids
• Roll up (drill-up): summarize data
• by climbing up hierarchy or by reducing a dimension
• Drill down (roll down): reverse of roll-up
• from higher level summary to lower level summary or detailed data, or introducing new
dimensions

• Slice and dice:


• project and select

• Pivot (rotate):
• reorient the cube, visualization, 3D to series of 2D planes.

• Other operations
• drill across: involving (across) more than one fact table
• drill through: through the bottom level of the cube to its back-end relational tables
(using SQL)
51

OLAP
Example of operations on a
• Online Analytical Processing (OLAP)
Datacube
• Given a data table D with 4 attributes W;X;Y and Z
• An OLAP cube can be characterized as a function
• f : (X; Y ; Z)  W
• An example of a slice is a function g : (Y ; Z)  W

C/S S M L TOT
f Red 20 3 5 28
size color Blue 3 3 8 14
Gray 0 0 5 5
TOT 23 6 18 47
color; size
52

13
Example of operations on a Example of operations on a
Datacube Datacube
Roll-up: Slice: Perform a selection on one dimension
• In this example we reduce one dimension
• It is possible to climb up one hierarchy
• Example (product, city)  (product, country)

C/S S M L TOT C/S S M L TOT


f Red 20 3 5 28 f Red 20 3 5 28
size Blue 3 3 8 14
size color Blue 3 3 8 14
color
Gray 0 0 5 5 Gray 0 0 5 5

TOT 23 6 18 47 TOT 23 6 18 47
color; size color; size
53 55

Example of operations on a Example of operations on a


Datacube Datacube
Drill-down Dice: Perform a selection on two or more dimensions
• In this example we add one dimension
• It is possible to climb down one hierarchy
• Example (product, year)  (product, month)

C/S S M L TOT C/S S M L TOT


f Red 20 3 5 28 f Red 20 3 5 28
size color Blue 3 3 8 14 size color Blue 3 3 8 14
Gray 0 0 5 5 Gray 0 0 5 5
TOT 23 6 18 47 TOT 23 6 18 47
color; size color; size
54 56

14
Design of a Data Warehouse: A Business Building and managing a data warehouse
Analysis Framework
• Four views regarding the design of a data warehouse
• Top-down view
• allows selection of the relevant information necessary for the
data warehouse
• Data source view
• exposes the information being captured, stored, and managed
by operational systems
• Data warehouse view
• consists of fact tables and dimension tables
• Business query view
• sees the perspectives of data in the warehouse from the view of
end-user 57 59

Prioritized EDW
Data Warehouse Design Process
• Top-down, bottom-up approaches or a combination of both
• Top-down: Starts with overall design and planning
• Bottom-up: Starts with experiments and prototypes (rapid).

• From software engineering point of view


• Waterfall: structured and systematic analysis at each step before
proceeding to the next.
• Spiral:rapid generation of increasingly functional systems, short
turn around time, quick turn around.

• Typical data warehouse design process


Choose a business process to model, e.g., orders, invoices, etc.
Choose the grain (atomic level of data) of the business process
Choose the dimensions that will apply to each fact table record
Choose the measure that will populate each fact table record
58 60

15
Multi-Tiered Architecture Three Data Warehouse Models
• Enterprise warehouse
Monitor
& OLAP Server • collects all of the information about subjects spanning the entire
other Metadata
Integrator organization
sources
Analysis • Data Mart
Extract Query • a subset of corporate-wide data that is of value to a specific groups of users.
Operational
DBs Transform Data Serve Reports Its scope is confined to specific, selected groups, such as marketing data
Load mart
Refresh
Warehouse Data mining
• Independent vs. dependent (directly from warehouse) data mart
• Virtual warehouse
• A set of views over operational databases
• Only some of the possible summary views may be materialized
Data Marts

Data Sources Data Storage OLAP Engine Front-End61 Tools 63

62

16

You might also like