You are on page 1of 34

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

TIES443

Lecture 3

Data Warehousing
Mykola Pechenizkiy
Course webpage: http://www.cs.jyu.fi/~mpechen/TIES443
November 3, 2006
Department of Mathematical Information Technology
University of Jyvskyl
TIES443: Introduction to DM

Lecture 3: Data Warehousing

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Topics for today


What is a data warehouse?
Data warehouse architectures
Conceptual DW Modelling
Physical DW Modelling

A multi-dimensional data model


Data Cubes

OLAP
12 Codds rules for OLAP
Main OLAP operations
New buzzwords

Data warehouse implementation and maintenance

TIES443: Introduction to DM

Lecture 3: Data Warehousing

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Data Warehouse
A decision support DB that is maintained separately
from the organizations operational databases.
Why Separate Data Warehouse?
High performance for both systems
DBMS tuned for OLTP
access methods, indexing, concurrency control, recovery

Warehousetuned for OLAP


complex OLAP queries, multidimensional view, consolidation.

Different functions and different data


Missing data: Decision support requires historical data which
operational DBs do not typically maintain
Data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
Data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
TIES443: Introduction to DM

Lecture 3: Data Warehousing

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Three-Tier Architecture
other

Monitor
&
Integrator

Metadata

sources

OLAP
Server

Analysis

Query/Reporting
Operational

DBs

Extract
Transform
Load
Refresh

Data
Warehouse

Serve

Data Mining

Data Marts

Data Sources
TIES443: Introduction to DM

Data Storage
Lecture 3: Data Warehousing

ROLAP
Server

OLAP Engine Front-End Tools


4

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Three-Tier Architecture
Warehouse database server

Almost always a relational DBMS, rarely flat files


Schema design
Specialized scan, indexing and join techniques
Handling of aggregate views (querying and materialization)
Supporting query language extensions beyond SQL
Complex query processing and optimization
Data partitioning and parallelism

OLAP servers
Relational OLAP (ROLAP): extended relational DBMS that maps operations on
multidimensional data to standard relational operators
Multidimensional OLAP (MOLAP): special-purpose server that directly
implements multidimensional data and operations
Hybrid OLAP (HOLAP): user flexibility, e.g., low level: relational, high-level: array
Specialized SQL servers: specialized support for SQL queries over star/snowflake
schemas

Clients
Query and reporting tools
Analysis tools
Data mining tools
TIES443: Introduction to DM

Lecture 3: Data Warehousing

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Warehouse Physical Architectures


Client

Client

Client

Central
Data
Warehouse

Source

Source
Logical
Data
Warehouse

Centralized architecture

Source

Physical
Data
Warehouse

Source
Source

Federated architecture
TIES443: Introduction to DM

Lecture 3: Data Warehousing

Source

Tiered architecture
6

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Three Data Warehouse Models


Enterprise warehouse: collects all information about subjects
(customers,products,sales,assets, personnel) that span
the entire organization
Requires extensive business modeling (may take years to design
and build)

Data Marts: Departmental subsets that focus on selected


subjects
Marketing data mart: customer, product, sales
Faster roll out, but complex integration in the long run

Virtual warehouse: views over operational DBs


Materialize selective summary views for efficient query processing
Easy to build but require excess capability on operat. db servers

TIES443: Introduction to DM

Lecture 3: Data Warehousing

UNIVERSITY OF JYVSKYL

TIES443: Introduction to DM

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Lecture 3: Data Warehousing

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Data Warehouse
A data warehouse is a
subject-oriented,
integrated,
time-varying,
non-volatile
collection of data that is used primarily in organizational
decision making

TIES443: Introduction to DM

Lecture 3: Data Warehousing

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Data WarehouseSubject-Oriented
Organized around major subjects, such as customer,
product, sales
Focusing on the modeling and analysis of data for
decision makers, not on daily operations or transaction
processing
Provide a simple and concise view around particular
subject issues by excluding data that are not useful in the
decision support process

TIES443: Introduction to DM

Lecture 3: Data Warehousing

10

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Data WarehouseIntegrated
Constructed by integrating multiple, heterogeneous
data sources
relational databases, flat files, on-line transaction records

Data cleaning and data integration techniques are


applied
Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different data
sources
E.g., Hotel price: currency, tax, breakfast covered, etc.

When data is moved to the warehouse, it is converted

TIES443: Introduction to DM

Lecture 3: Data Warehousing

UNIVERSITY OF JYVSKYL

11

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Data WarehouseTime Variant


The time horizon for the data warehouse is significantly
longer than that of operational systems
Operational database: current value data
Data warehouse data: provide information from a historical
perspective (e.g., past 5-10 years)

Every key structure in the data warehouse


Contains an element of time, explicitly or implicitly
But the key of operational data may or may not contain time
element

TIES443: Introduction to DM

Lecture 3: Data Warehousing

12

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Data WarehouseNon-Volatile
A physically separate store of data transformed from
the operational environment
Operational update of data does not occur in the data
warehouse environment
Does not require transaction processing, recovery, and
concurrency control mechanisms
Requires only two operations in data accessing:
initial loading of data and access of data

TIES443: Introduction to DM

Lecture 3: Data Warehousing

UNIVERSITY OF JYVSKYL

13

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Data Warehouse vs. Heterogeneous DBMS


Traditional heterogeneous DB integration
Build wrappers/mediators on top of heterogeneous databases
Query driven approach
When a query is posed to a client site, a meta-dictionary is used to
translate the query into queries appropriate for individual
heterogeneous sites involved, and the results are integrated into a
global answer set
Complex information filtering, compete for resources

Data warehouse
update-driven, high performance
Information from heterogeneous sources is integrated in advance and
stored in warehouses for direct query and analysis

TIES443: Introduction to DM

Lecture 3: Data Warehousing

14

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Data Warehouse vs. Operational DBMS


OLTP (On-Line Transaction Processing)
Major task of traditional relational DBMS
Day-to-day operations: purchasing, inventory, banking, manufacturing,
payroll, registration, accounting, etc.

OLAP (On-Line Analytical Processing)


Major task of data warehouse system
Data analysis and decision making

Distinct features (OLTP vs. OLAP):


User and system orientation: customer vs. market
Data contents: current, detailed vs. historical, consolidated
Database design: ER + application vs. star + subject
View: current, local vs. evolutionary, integrated
Access patterns: update vs. read-only but complex queries
TIES443: Introduction to DM

Lecture 3: Data Warehousing

UNIVERSITY OF JYVSKYL

15

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Conceptual Modeling of Data Warehouses

ER design techniques not appropriate - design should


reflect multidimensional view

Star Schema

Snowflake Schema

A fact table in the middle connected to a set of dimension tables


A refinement of star schema where some dimensional hierarchy is
normalized into a set of smaller dimension tables, forming a shape
similar to snowflake

Fact Constellation Schema

Multiple fact tables share dimension tables, viewed as a collection


of stars, therefore called galaxy schema or fact constellation

TIES443: Introduction to DM

Lecture 3: Data Warehousing

16

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Example of a Star Schema


Order

Product

Order No

ProductNO

Order Date

ProdName

Customer
Customer No
Customer Name

Fact Table

ProdDescr

OrderNO

Category

SalespersonID

CategoryDescription
UnitPrice

CustomerNO

Customer
Address
City
Salesperson

ProdNo

Date

DateKey

DateKey

CityName

Date

Quantity

SalespersonID
SalespersonName

City

Total Price

CityName

City

State

Quota

TIES443: Introduction to DM

Country

Lecture 3: Data Warehousing

UNIVERSITY OF JYVSKYL

17

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Star Schema
A single fact table and a single table for each dimension
Every fact points to one tuple in each of the dimensions
and has additional attributes
Does not capture hierarchies directly
Generated keys are used for performance and
maintenance reasons
Fact constellation: Multiple Fact tables that share many
dimension tables
Example: Projected expense and the actual expense may share
dimensional tables

TIES443: Introduction to DM

Lecture 3: Data Warehousing

18

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Some Terms
Relation, which relates the dimensions to the
measure of interest, is called the fact table (e.g. sale)
Information about dimensions can be represented as
a collection of relations called the dimension tables
(product, customer, store)
Each dimension can have a set of associated
attributes
For each dimension, the set of associated attributes
can be structured as a hierarchy

TIES443: Introduction to DM

Lecture 3: Data Warehousing

19

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

A Concept Hierarchy: Dimension (location)


all

all
region

country

city

North_America

Canada

Toronto

...

...

Mexico

Lecture 3: Data Warehousing

Europe

Ireland

Dublin
Belfield

office
TIES443: Introduction to DM

...

...

...

...

France

Belfast

Blackrock
20

10

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Example of a Snowflake Schema


Order
Product
ProductNO

Order No
Order Date
Fact Table
Customer

OrderNO

Customer No
Customer Name

SalespersonID

Category

ProdName

CategoryName

ProdDescr

CategoryDescr

Category
Category
UnitPrice

CustomerNO

Customer
Address
City
Salesperson

ProdNo

Date

DateKey

DateKey

CityName

Date

SalespersonID

Quantity

Month
City

SalespersonName

Total Price

CityName

City

State

Quota

Country

TIES443: Introduction to DM

Month
Month
Year

Year

Year
State
StateName
Country

Lecture 3: Data Warehousing

UNIVERSITY OF JYVSKYL

21

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Example of Fact Constellation


Multiple fact tables share dimension tables
Shipping Fact Table
Time_key

Time
time_key
day
day_of_the_week
month
quarter
year

Sales Fact Table


Time_key
Item_key
Branch_key

Item
item_key
item_name
brand
type
supplier_key

Location_key
Branch
branch_key
branch_name
branch_type

Unit_sold
Euros_sold
Avg_sales

Measures
TIES443: Introduction to DM

Lecture 3: Data Warehousing

Location
location_key
street
city
Province/street
country

Item_key
shipper_key
from_location
to_location
Euros_sold
unit_shipped

shipper
shipper_key
shipper_name
location_key
shipper_type
22

11

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Multidimensional Data Model


Two-dimensional cube

Fact relation
sale

Product Client
p1
c1
p2
c1
p1
c3
p2
c2

Amt
12
11
50
8

Fact relation
sale

Product Client
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2

TIES443: Introduction to DM

Date
1
1
1
1
2
2

c1
12
11

p1
p2

c2

c3
50

3-dimensional cube
Amt
12
11
50
8
44
4

day 2
day 1

c1
c2
c3
p1
44
4
p2 c1
c2
c3
p1
12
50
p2
11
8

Lecture 3: Data Warehousing

UNIVERSITY OF JYVSKYL

23

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Multidimensional Data
Sales volume as a function of product, month, and
region
Dimensions: Product, Location, Time
Re
gi
on

Hierarchical summarization paths


Industry Region

Year

Product

Category Country Quarter


Product

City
Office

Month

Week

Day

Month
TIES443: Introduction to DM

Lecture 3: Data Warehousing

24

12

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

From Tables to Data Cubes


A data warehouse is based on
multidimensional data model which views data in the form of a data
cube

A data cube allows data to be modeled and viewed in multiple


dimensions (such as sales)
Dimension tables, such as item (item_name, brand, type), or time(day,
week, month, quarter, year)
Fact table contains measures (such as dollars_sold) and keys to each of
the related dimension tables

Definitions
an n-Dimensional base cube is called a base cuboid
The top most 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid
The lattice of cuboids forms a data cube
TIES443: Introduction to DM

Lecture 3: Data Warehousing

25

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Cube: A Lattice of Cuboids


all
time

0-D(apex) cuboid

item

location

supplier
1-D cuboids

time,item

time,location

item,location

time,supplier
time,item,location

location,supplier

item,supplier

2-D cuboids

time,location,supplier

3-D cuboids

time,item,supplier

item,location,supplier

4-D(base) cuboid
time, item, location, supplier
TIES443: Introduction to DM

Lecture 3: Data Warehousing

26

13

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Pr
od
u

TV
PC
VCR
sum

1Qtr

2Qtr

Total annual sales


of TV in Ireland

Date
3Qtr

sum

4Qtr

Ireland
France
Germany

Country

ct

A Sample Data Cube

sum

TIES443: Introduction to DM

Lecture 3: Data Warehousing

UNIVERSITY OF JYVSKYL

27

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Typical OLAP Operations


Roll up (drill-up): summarize data
by climbing up hierarchy or by dimension reduction

Drill down (roll down): reverse of roll-up


from higher level summary to lower level summary or detailed data, or
introducing new dimensions

Slice and dice


project and select

Pivot (rotate)
reorient the cube, visualization, 3D to series of 2D planes.

Other operations
drill across: involving (across) more than one fact table
drill through: through the bottom level of the cube to its back-end relational
tables (using SQL)
rankings
time functions: e.g. time avg.
TIES443: Introduction to DM

Lecture 3: Data Warehousing

28

14

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Typical OLAP Operations: Drill & Roll


Drill
Down

Total Sales
Total Sales per city
Total Sales per city per store
Total Sales per city per store per month

Drill
Down

Total Sales
Total Sales per city
Total Sales per city by category

Drill
Up

Drill
Up

Drill Across

TIES443: Introduction to DM

Lecture 3: Data Warehousing

29

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

OLAP Queries: Rollup & Drill-Down

c1
c2
c3
p1
44
4
p2 c1
c2
c3
p1
12
50
p2
11
8

day 2

day 1

p1
p2

c1
56
11

c2
4
8

c3
50

rollup
drill-down
TIES443: Introduction to DM

Lecture 3: Data Warehousing

sum

c1
67

c2
12

c3
50

129
p1
p2

sum
110
19

30

15

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Typical OLAP Operations: Pivoting


Pivoting can be combined with aggregation
sale

prodId clientid
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2

1
2
Sum

c1
23
44
67

c2
8
4
12

c3
50
50

date
1
1
1
1
2
2

amt
12
11
50
8
44
4

Sum
81
48
129

day 2
day 1

p1
p2
Sum

c1
c2
c3
p1
44
4
p2 c1
c2
c3
p1
12
50
p2
11
8

c1
56
11
67

c2
4
8
12

c3
50
50

Sum
110
19
129

The result of pivoting is called a cross-tabulation


TIES443: Introduction to DM

Lecture 3: Data Warehousing

UNIVERSITY OF JYVSKYL

31

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Browsing a Data Cube


Visualization
OLAP capabilities
Interactive
manipulation

TIES443: Introduction to DM

Lecture 3: Data Warehousing

32

16

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

12 Codds Rules for OLAP


1. Multi-Dimensional Concept View
The user should be able to see the data as being multidimensional insofar as it
should be easy to 'pivot' or 'slice and dice. (See later.)

2. Transparency
The OLAP functionality should be provided behind the user's existing
software without adversely affecting the functionality of the 'host, i.e.
OLAP server should shield the user for the complexity of the data and
application

3. Accessibility
OLAP should allow the user to access diverse data stores (relational,
nonrelational and legacy systems) but see the data within a common
'schema provided by the OLAP tool, i.e. Users shouldnt have to know
the location, type or layout of the data to access it.
OLAP server should automate the mapping of the logical schema to the
physical data

4. Consistent Reporting Performance


There should not be significant degradation in performance with large
numbers of dimensions or large quantities of data.
TIES443: Introduction to DM

Lecture 3: Data Warehousing

UNIVERSITY OF JYVSKYL

33

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

12 Codds Rules for OLAP


5. Client-Server Architecture
Since much of the data is on mainframes, and the users work on
PCs, the OLAP tool must be able to bring the two together
Different clients can be used
Data sources must be transparently supported by the OLAP server

6. Generic Dimensionality
Data dimensions must all be treated equally. Functions available
for one dimension must be available for others.

7. Dynamic Sparse Matrix Handling


The OLAP tool should be able to work out for itself the most
efficient way to store sparse matrix data.

8. Multi User Support


access, integrity, security

TIES443: Introduction to DM

Lecture 3: Data Warehousing

34

17

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

12 Codds Rules for OLAP


9. Unrestricted Cross-Dimensional Operations
e.g., individual office overheads are allocated according to total corporate
overheads divided in proportion to individual office sales.
Non-additive formulas cause the problems
Contribution = Revenue - Total Costs
Margin Percentage = Margin / Revenue

10. Intuitive Data Manipulation


Navigation should be done by operations on individual cells rather than
menus.
Dimensions defined should allow automatic reorientation, drill-down,
zoom-out, etc
Interface must be intuitive

11. Flexible Reporting


Row and column headings must be capable of more than one dimension
each, and of displaying subsets of any dimension.

12. Unlimited Dimensions and Aggregation Levels


15 - 20 dimensions are required in modelling, and within each there may
be many hierarchical levels, i.e. unlimited aggregation
Rare in reporting to go beyond 12 dimensions, 6-7 is usual
TIES443: Introduction to DM

Lecture 3: Data Warehousing

UNIVERSITY OF JYVSKYL

35

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

DW Back-End Tools and Utilities


Data extraction:
get data from multiple, heterogeneous, and external sources

Data cleaning:
detect errors in the data and rectify them when possible

Data transformation:
convert data from legacy or host format to warehouse format:
different data formats, languages, etc.

Load:
sort, summarize, consolidate, compute views, check integrity, and
build indicies and partitions

Refresh
propagate the updates from the data sources to the warehouse
TIES443: Introduction to DM

Lecture 3: Data Warehousing

36

18

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

DW Information Flows
INFLOW - Processes associated with the extraction,
cleansing, and loading of the data from the source systems
into the data warehouse.
UPFLOW - Processes associated with adding value to the
data in the warehouse through summarizing, packaging,
and distribution of the data.
DOWNFLOW - Processes associated with archiving and
backing-up/recovery of data in the warehouse.
OUTFLOW - Processes associated with making the data
available to the end-users.
METAFLOW - Processes associated with the management
of the metadata.
TIES443: Introduction to DM

Lecture 3: Data Warehousing

UNIVERSITY OF JYVSKYL

37

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Data Cleaning
why?
Data warehouse contains data that is analyzed for business
decisions
More data and multiple sources could mean more errors in the
data and harder to trace such errors
Results in incorrect analysis

finding and resolving inconsistency in the source data


detecting data anomalies and rectifying them early has
huge payoffs
Important to identify tools that work together well
Long Term Solution
Change business practices and data entry tools
Repository for meta-data
TIES443: Introduction to DM

Lecture 3: Data Warehousing

38

19

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Data Cleaning Techniques


Transformation Rules
Example: translate gender to sex

Uses domain-specific knowledge to do scrubbing


Parsing and fuzzy matching
Multiple data sources (can designate a preferred
source)

Discover facts that flag unusual patterns


(auditing)
Some dealer has never received a single complaint

TIES443: Introduction to DM

Lecture 3: Data Warehousing

39

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Load
Issues:
huge volumes of data to be loaded
small time window (usually at night) when the warehouse can be
taken off-line
When to build indexes and summary tables
allow system administrator to monitor status, cancel suspend,
resume load, or change load rate
restart after failure with no loss of data integrity

Techniques:
batch load utility: sort input records on clustering key and use
sequential I/O; build indexes and derived tables
sequential loads still too long (~100 days for TB)
use parallelism and incremental techniques

TIES443: Introduction to DM

Lecture 3: Data Warehousing

40

20

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Refresh
when to refresh
on every update: too expensive, only necessary if OLAP queries
need current data (e.g., up-the-minute stock quotes)
periodically (e.g., every 24 hours, every week) or after
significant events
refresh policy set by administrator based on user needs and traffic
possibly different policies for different sources

how to refresh
Full extract from base tables
read entire source table or database: expensive

Incremental techniques
detect & propagate changes on base tables: replication servers
logical correctness
transactional correctness: incremental load

TIES443: Introduction to DM

Lecture 3: Data Warehousing

UNIVERSITY OF JYVSKYL

41

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Metadata Repository
Administrative metadata

source databases and their contents


warehouse schema, view & derived data definitions
dimensions, hierarchies
pre-defined queries and reports
data mart locations and contents
data partitions
data extraction, cleansing, transformation rules, defaults
data refresh and purging rules
user profiles, user groups
security: user authorization, access control

Business data
business terms and definitions
ownership of data
charging policies

Operational metadata
data lineage: history of migrated data and sequence of transf-s applied
currency of data: active, archived, purged
monitoring information: warehouse usage statistics, error reports, audit
trails.
TIES443: Introduction to DM

Lecture 3: Data Warehousing

42

21

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Design of a DW: A Business Analysis Framework


Four views regarding the design of a data warehouse
Top-down view: allows selection of the relevant
information necessary for the data warehouse (mature)
Bottom-up: Starts with experiments and prototypes (rapid)
Data source view
exposes the information being captured, stored, and managed
by operational systems

Data warehouse view


consists of fact tables and dimension tables

Business query view


sees the perspectives of data in the warehouse from the view
of end-user

TIES443: Introduction to DM

Lecture 3: Data Warehousing

UNIVERSITY OF JYVSKYL

43

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Data Warehouse Design Process


From software engineering point of view
Waterfall: structured and systematic analysis at each step before
proceeding to the next
Spiral: rapid generation of increasingly functional systems,
short turn around time, quick turn around

Typical data warehouse design process

Choose a business process to model, e.g., orders, invoices, etc.


Choose the grain (atomic level of data) of the business process
Choose the dimensions that will apply to each fact table record
Choose the measure that will populate each fact table record

TIES443: Introduction to DM

Lecture 3: Data Warehousing

44

22

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

DW Design: Issues to Consider

What data is needed?


Where does it come from?
How to clean data?
How to represent in warehouse (schema)?
What to summarize?
What to materialize?
What to index?

TIES443: Introduction to DM

Lecture 3: Data Warehousing

UNIVERSITY OF JYVSKYL

45

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

DW Data Management: Issues to Consider

Meta-data
Data sourcing
Data quality
Data security
Granularity
History- how long and how much?
Performance

TIES443: Introduction to DM

Lecture 3: Data Warehousing

46

23

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Common DW Problems

Underestimation of resources for data loading


Hidden problems with source systems
Required data not captured
Increased end-user demands
Data homogenization
High demand for resources
Data ownership
High maintenance
Long duration projects
Complexity of integration

TIES443: Introduction to DM

Lecture 3: Data Warehousing

UNIVERSITY OF JYVSKYL

47

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Research Issues
Data cleaning
focus on data inconsistencies, not schema differences
data mining techniques

Physical Design
design of summary tables, partitions, indexes
tradeoffs in use of different indexes

Query processing
selecting appropriate summary tables
dynamic optimization with feedback
query optimization: cost estimation, use of transformations, search
strategies
partitioning query processing between OLAP server and backend server.

Warehouse Management

incremental refresh techniques


computing summary tables during load
failure recovery during load and refresh
process management: scheduling queries, load and refresh
use of workflow technology for process management

TIES443: Introduction to DM

Lecture 3: Data Warehousing

48

24

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

OLAP Mining: An Integration of DM and DW


Data mining systems, DBMS, Data warehouse systems
coupling
No coupling, loose-coupling, semi-tight-coupling, tight-coupling

On-line analytical mining data


integration of mining and OLAP technologies

Interactive mining multi-level knowledge


Necessity of mining knowledge and patterns at different levels of
abstraction by drilling/rolling, pivoting, slicing/dicing, etc.

Integration of multiple mining functions


Characterized classification, first clustering and then association
TIES443: Introduction to DM

Lecture 3: Data Warehousing

UNIVERSITY OF JYVSKYL

49

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Summary
What is a data warehouse
Data warehouse architectures
Conceptual DW Modelling
Physical DW Modelling

A multi-dimensional data model


Data Cubes
Main OLAP operations

Data warehouse implementation and maintenance

What else did you get from this lecture?


TIES443: Introduction to DM

Lecture 3: Data Warehousing

50

25

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Additional Slides

TIES443: Introduction to DM

Lecture 3: Data Warehousing

UNIVERSITY OF JYVSKYL

51

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Some Critics for Data Cubes


Jargon for IT professionals
Metaphor for end-users
Not useful beyond introduction
Users expect to see one
Rubics cube
Easy to manipulate but difficult to solve

Alternatives are better


Cognos ways, Thomsens multi-dimensional domain diagrams,
Buloss Adapt diagrams or simply well designed interactive
reports

TIES443: Introduction to DM

Lecture 3: Data Warehousing

52

26

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

DW Development: A Recommended Approach


Multi-Tier Data
Warehouse
Distributed
Data Marts

Data
Mart

Enterprise
Data
Warehouse

Data
Mart

Model refinement

Model refinement

Define a high-level corporate data model


TIES443: Introduction to DM

Lecture 3: Data Warehousing

UNIVERSITY OF JYVSKYL

53

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

DMQL: Language Primitives


Nice presentation on Data Mining Query Languages can be found here:
http://www.cs.wisc.edu/EDAM/slides/Data%20Mining%20Query%20Languages.ppt

Cube Definition (Fact Table)


define cube <cube_name> [<dimension_list>]:
<measure_list>

Dimension Definition (Dimension Table)


define dimension <dimension_name> as
(<attribute_or_subdimension_list>)

Special Case (Shared Dimension Tables)


First time as cube definition
define dimension <dimension_name> as
<dimension_name_first_time> in cube
<cube_name_first_time>
TIES443: Introduction to DM

Lecture 3: Data Warehousing

54

27

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Defining a Star Schema in DMQL


define cube sales_star [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars),
avg_sales = avg(sales_in_dollars),
units_sold = count(*)

define dimension time as


(time_key, day, day_of_week, month, quarter, year)

define dimension item as


(item_key, item_name, brand, type, supplier_type)

define dimension branch as


(branch_key, branch_name, branch_type)

define dimension location as


(location_key, street, city, province_or_state, country)

TIES443: Introduction to DM

Lecture 3: Data Warehousing

UNIVERSITY OF JYVSKYL

55

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Defining a Snowflake Schema in DMQL


define cube sales_snowflake [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars),
avg_sales = avg(sales_in_dollars),
units_sold = count(*)

define dimension time as


(time_key, day, day_of_week, month, quarter, year)

define dimension item as


(item_key, item_name, brand, type, supplier(supplier_key, supplier_type))

define dimension branch as


(branch_key, branch_name, branch_type)

define dimension location as


(location_key, street, city(city_key, province_or_state, country))

TIES443: Introduction to DM

Lecture 3: Data Warehousing

56

28

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Defining a Fact Constellation in DMQL


define cube sales [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars),
units_sold = count(*)

define dimension time as (time_key, day, day_of_week, month, quarter, year)


define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state,
country)
define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)

define dimension time as time in cube sales


define dimension item as item in cube sales
define dimension shipper as (shipper_key, shipper_name, location as location
in cube sales, shipper_type)
define dimension from_location as location in cube sales
define dimension to_location as location in cube sales
TIES443: Introduction to DM

Lecture 3: Data Warehousing

57

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

A Star-Net Query Model


Customer Orders

Shipping Method

Customer
CONTRACTS
AIR-EXPRESS
ORDER

TRUCK

PRODUCT LINE
Time
ANNUALY QTRLY DAILY

Product
PRODUCT ITEM PRODUCT GROUP

CITY
SALES PERSON
COUNTRY
DISTRICT
REGION
DIVISION
Location

TIES443: Introduction to DM

Each circle is
called a
footprint
Lecture 3: Data Warehousing

Promotion

Organization
58

29

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Index Structures
This and the following slides on Indexing for DW are adopted with minor
modifications from: http://infolab.stanford.edu/~hector/cs245/Notes12.ppt

Indexing principle:
mapping key values to records for associative direct access

Most popular indexing techniques in relational


database: B+-trees
For multi-dimensional data, a large number of
indexing techniques have been developed: R-trees
Index structures applied in warehouses

inverted lists
bit map indexes
join indexes
text indexes

TIES443: Introduction to DM

Lecture 3: Data Warehousing

59

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Inverted Lists
18
19

20
21
22

23
25
26

inverted
lists

age
index
Query:
Get people with age = 20 and
name = fred
TIES443: Introduction to DM

r5
r19
r37
r40

Lecture 3: Data Warehousing

rId
r4
r18
r19
r34
r35
r36
r5
r41

name
joe
fred
sally
nancy
tom
pat
dave
jeff

age
20
20
21
20
20
25
21
26

...

20
23

r4
r18
r34
r35

data
records

List for age = 20: r4, r18, r34, r35


List for name = fred: r18, r52
Answer is intersection: r18
60

30

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Bitmap Indexes
Bitmap index: An indexing technique that has
attracted attention in multi-dimensional database
implementation
table
Customer
c1
c2
c3
c4
c5
c6

TIES443: Introduction to DM

City
Detroit
Chicago
Detroit
Poznan
Paris
Paris

Car
Ford
Honda
Honda
Ford
BMW
Nissan

Lecture 3: Data Warehousing

61

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Bitmap Indexes
The index consists of bitmaps:
Index on City:
ec1
1
2
3
4
5
6

Chicago Detroit
0
1
1
0
0
1
0
0
0
0
0
0

Index on Car:
Paris
0
0
0
0
1
1

Poznan
0
0
0
1
0
0

bitmaps

ec1
1
2
3
4
5
6

BMW
0
1
0
0
1
0

Ford
1
0
0
1
0
0

Honda
0
1
1
0
0
0

Nissan
0
0
0
0
0
1

bitmaps

Index on a particular column


Index consists of a number of bit vectors - bitmaps
Each value in the indexed column has a bit vector (bitmaps)
The length of the bit vector is the number of records in the base table
The i-th bit is set if the i-th row of the base table has the value for the indexed
column
TIES443: Introduction to DM

Lecture 3: Data Warehousing

62

31

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Bitmap Indexes
Index on a particular column
Index consists of a number of bit vectors - bitmaps
Each value in the indexed column has a bit vector
(bitmaps)
The length of the bit vector is the number of records
in the base table
The i-th bit is set if the i-th row of the base table has
the value for the indexed column

TIES443: Introduction to DM

Lecture 3: Data Warehousing

63

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Bitmap Index

20
23

20
21
22

23
25
26

age
index
TIES443: Introduction to DM

1
1
0
1
1
0
0
0
0

0
0
1
0
0
0
1
0
1
1

id
1
2
3
4
5
6
7
8

name age
joe
20
fred
20
sally
21
nancy 20
tom
20
pat
25
dave 21
jeff
26

...

18
19

data
records

Query:
Get people with age = 20
and name = fred
List for age = 20:
1101100000
List for name = fred:
0100000001
Answer is intersection:
0100000000
Suited well for domains
with small cardinality

bit
maps
Lecture 3: Data Warehousing

64

32

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Bitmap Index Summary


With efficient hardware support for bitmap
operations (AND, OR, XOR, NOT), bitmap index
offers better access methods for certain queries
e.g., selection on two attributes
Some commercial products have implemented
bitmap index
Works poorly for high cardinality domains since the
number of bitmaps increases
Difficult to maintain - need reorganization when
relation sizes change (new bitmaps)

TIES443: Introduction to DM

Lecture 3: Data Warehousing

65

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Join
Combine SALE, PRODUCT relations
In SQL: SELECT * FROM SALE, PRODUCT
sale

prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2

joinTb

TIES443: Introduction to DM

date
1
1
1
1
2
2

prodId
p1
p2
p1
p2
p1
p1

amt
12
11
50
8
44
4

name
bolt
nut
bolt
nut
bolt
bolt

Lecture 3: Data Warehousing

product

price
10
5
10
5
10
10

storeId
c1
c1
c3
c2
c1
c2

date
1
1
1
1
2
2

id
p1
p2

name price
bolt
10
nut
5

amt
12
11
50
8
44
4

66

33

UNIVERSITY OF JYVSKYL

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Join Indexes
join index
product

sale

TIES443: Introduction to DM

id
p1
p2

rId
r1
r2
r3
r4
r5
r6

name price
bolt
10
nut
5

prodId
p1
p2
p1
p2
p1
p1

jIndex
r1,r3,r5,r6
r2,r4

storeId
c1
c1
c3
c2
c1
c2

date
1
1
1
1
2
2

amt
12
11
50
8
44
4

Lecture 3: Data Warehousing

UNIVERSITY OF JYVSKYL

67

DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY

Join Indexes
Traditional indexes map the value to a list of record
ids. Join indexes map the tuples in the join result of
two relations to the source tables.
In data warehouse cases, join indexes relate the
values of the dimensions of a star schema to rows in
the fact table.
For a warehouse with a Sales fact table and dimension city, a
join index on city maintains for each distinct city a list of
RIDs of the tuples recording the sales in the city

Join indexes can span multiple dimensions

TIES443: Introduction to DM

Lecture 3: Data Warehousing

68

34

You might also like