You are on page 1of 32

DATA WAREHOUSE DESIGN

ICDE 2001 Tutorial

Stefano Rizzi, Matteo Golfarelli


DEIS - University of Bologna, Italy

Motivation
Building a data warehouse for an enterprise is a huge and complex task, which requires an accurate planning aimed at devising satisfactory answers to organizational and architectural questions. Despite the pushing demand for working solutions coming from enterprises and the wide offer of advanced technologies from producers, few attempts towards devising a specific methodology for data warehouse design have been made. On the other hand, the statistic reports related to DW project failures state that a major cause lies in the absence of a global view of the design process: in other terms, in the absence of a design methodology.

Summary

Introduction to Data Warehousing Conceptual design of Data Warehouses Workload-based logical design for ROLAP Indexes for physical design

Introduction to Data Warehousing


Stefano Rizzi

Information Systems: profile and role

Information systems are rooted in the relationship between information, decision and control. An IS should collect and classify the information, by means of integrated and suitable procedures, in order to produce in time and at the right levels the synthesis to be used to support the decisional process, as well as to administrate and globally control the enterprise activity.
4

Information as a resource

Information is an increasing value resource, required from managers to schedule and monitor effectively the enterprise activities. Information is the first matter which is transformed by information systems like unfinished products are transformed by manufacturing systems.
Manufacturing system Finished product

Information system Information

Value of information

Information is an enterprise resource like capital, first matters, plants and people; thus, it has a cost. Hence, understanding the value of information is important.

Value
Reports

Strategic directions

Selected information Primary information sources

Amount
6

Different kinds of information systems


ESS MIS DSS OAS KWS TPS onal
ti ra pe O e dg le w no K t en m ge a an M c gi te ra St

Senior managers

Middle managers Knowledge and data workers Operational managers

Sales and Manufacturing marketing

Finance

Accounting

Human resources

The Data Warehouse phenomenon

Usual complaints:

We have tons of data but we cannot access them! How can people playing the same role produce substantially different results? We want to slice and dice data in any possible way! Show me only what is important! Everyone knows some data are incorrect...
(R. Kimball, The Data Warehouse Toolkit)
8

Data Warehousing

A collection of technologies and tools supporting the knowledge worker (executive, manager, analyst) in analysing data aimed at decision making and at improving the knowledge assets of the enterprise.

Data Warehouse
At the core of the architecture of modern information systems, it is a data repository: Oriented to subjects Integrated and consistent Representing temporal evolution Non volatile
The data warehouse is regularly refreshed, permanently growing, logically centralised and easily accessed by users, essentially read-only
9

Data Warehouse
Operational data (relational, legacy) External data

ETL tools

Summary data Access

Warehouse

What-If analysis Analysis tools (OLAP) Reporting tools Data mining


10

Data Marts
Data Warehouse Replication and broadcasting

Data mart
Marketing Finance Geographical regions Client management Supplier management

11

Subject vs Process
region charge consumption

patient

reservations

Medical reports

admissions

Emphasis on applications

Emphasis on subjects
12

Integration and consistency


External data
Schema Integration Extraction Transformation Cleaning Validation Filtering Loading

DW

DB

wrappers

mediators loaders
13

Text files

Temporal evolution
OLTP DW

Current values
Restricted historical content, Often time is not included in keys, Data are updated

Snapshot
Rich historical content, Time is included in keys, Snapshots cannot be updated

14

Non-volatility
OLTP
update

DW
load acce ss

insert

delete

Huge data volumes: from 20 GBs to some TBs in a few years

In a DW, no advanced techniques for transaction management are required (differently from OLTP systems) Key issues are the query throughput and the resilience

15

DW
90% ad hoc queries Mostly read access Hundreds users Denormalised Supports historical versions Optimised for accesses involving most database Based on summary data

vs.

OLTP
90% predefined transactions Read/write access Thousands users Normalised Does not support historical versions Optimised for accesses involving a small database fraction Based on elemental data

16

ROLAP (Relational OLAP)

Intermediate level server between a relational back- end server and the front-end client Specialised middleware Generation of SQL multi-statements for the back-end server Query scheduling

MOLAP (Multidimensional OLAP) OLAP)


Direct support of multi-dimensional views Special data structures (e.g., multi-dimensional arrays) Compression techniques Intelligent disk/memory caching Pre-computation Complex analysis
17

The technological progress


knowledge Pattern Warehousing Data Mining Refinement OLAP Data Warehousing Statistics & reporting 1970 1980 1990

Source: Information Discovery

data

2000

18

The Data Warehouse Market


4500 4000 3500 3000 2500 2000 1500 1000 500 0 1998 1999 2000 2001 2002 300 250 200 150 100 50 0 1998 1999 2000 2001 2002 400 Data Marts 350 ETL Data Quality Metadata RDBMS OLAP

Source: Shilakes, Tylman Enterprise Information Portals

19

The DW life-cycle
Objective definition and planning Clearly determine the scopes, define the borders, estimate dimensions, choose the approach to design, evaluate the benefits

Infrastructure design

Choose the technologies and the tools, analyse the architectural solutions, solve the management problems

Design and implementation of applications

Add iteratively new data marts and applications to the warehouse

20

Bibliography

R. Barquin, S. Edelstein. Planning and Designing the Data Warehouse. Prentice Hall (1996). S. Chaudhuri, U. Dayal. An overview of data warehousing and OLAP technology. SIGMOD Record 26,1 (1997). G. Colliat. OLAP, relational and multidimensional database systems. SIGMOD Record 25, 3 (1996). M. Demarest. The politics of data warehousing. Http://www.hevanet.com/demarest/marc/dwpol.html U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth. Data mining and knowledge discovery in databases: an overview. Comm. of the ACM 39, 11 (1996). W.H. Inmon. Building the data warehouse. John Wiley & Sons (1996). S. Kelly. Data Warehousing in Action. John Wiley & Sons (1997). R. Kimball. The data warehouse toolkit. John Wiley & Sons (1996). R. Kimball, L. Reeves, M. Ross, W. Thornthwaite. The data Warehouse Lifecycle Toolkit. John Wiley & Sons (1998). C. Shilakes, J. Tylman. Enterprise Information Portals. Http://www.sagemaker.com/company/downloads/eip/indepth.pdf P. Vassiliadis. Gulliver inthe land of data warehousing: practical experiences and observations of a researcher. Proc. DMDW2000 (2000). J. Widom. Research Problems in Data Warehousing. Proc. CIKM (1995).

21

Conceptual modelling for Data Warehousing


Stefano Rizzi

22

Why a new conceptual model?

While it is universally recognised that a DW leans on a multidimensional model, there is no agreement on the approach to conceptual modelling. On the other hand, an accurate conceptual design is the necessary foundation for building a good information system. The Entity/Relationship model is widespread in the enterprises, but.

"Entity relation data models [...] cannot be understood by users and they cannot be navigated usefully by DBMS software. Entity relation models cannot be used as the basis for enterprise data warehouses. (Kimball, 96)
23

The multidimensional data model model


Number of Coke cans sold at BIGSTORES in London on 10/10/99

Sales

Store

e im T
Product
Number of Pepsi cans sold at all BIGSTORES on 10/10/99 Number of Fanta cans globally sold

24

Basic terminology

Fact (cube, target). It is a focus of interest for the decisionmaking process; typically, it models an event occurring in the enterprise world (sales, shipments, purchases). It is essential for a fact to have some dynamic aspects, i.e., to evolve somehow across time.

Measures (attributes, variables, metrics, properties). They are


continuously valued (typically numerical) attributes which describe a fact from different points of view. For instance, each sale is measured by its revenue.

Dimensions. They are discrete attributes which determine the


minimum granularity adopted to represent facts. dimensions for the sale fact are product, store and date. Typical

Hierarchies (dimensions). They contain dimension attributes (levels, parameters) connected in a tree-like
structure by many-to-one relationships (functional dependencies).

25

DW modelling in the literature


Golfarelli et al. 98 Gyssens, Lakshmanan 97

Hsemann et al. 00 Vassiliadis 98 Sapia et al. 98 Datta, Thomas 97 Tryfona et al. 99 Franconi, Sattler 99 Li, Wang 96 Cabibbo, Torlone 98 Agrawal et al. 95

26

DW modelling in the literature


Golfarelli et al. 98

CONCEPTUAL
Hsemann et al. 00 Vassiliadis 98 Sapia et al. 98 Datta, Thomas 97 Tryfona et al. 99 Franconi, Sattler 99

Gyssens, Lakshmanan 97

Agrawal et al. 95

Cabibbo, Torlone 98

Li, Wang 96

LOGICAL
27

DW modelling in the literature


Golfarelli et al. 98

FORMAL
Hsemann et al. 00 Vassiliadis 98 Sapia et al. 98 Datta, Thomas 97 Tryfona et al. 99 Franconi, Sattler 99

Gyssens, Lakshmanan 97

Agrawal et al. 95

Cabibbo, Torlone 98

Li, Wang 96

GRAPHICAL GRAPHICAL
28

DW modelling in the literature


Golfarelli et al. 98

ALGEBRA

Gyssens, Lakshmanan 97

Hsemann et al. 00 Vassiliadis 98 Sapia et al. 98 Datta, Thomas 97 Tryfona et al. 99 Franconi, Sattler 99 Li, Wang 96 Cabibbo, Torlone 98 Agrawal et al. 95

29

DW modelling in the literature


Golfarelli et al. 98 Gyssens, Lakshmanan 97

Hsemann et al. 00 Vassiliadis 98 Sapia et al. 98 Datta, Thomas 97 Tryfona et al. 99 Franconi, Sattler 99 Cabibbo, Torlone 98 DESIGN Li, Wang 96 Agrawal et al. 95

30

Conceptual models

Sapia, Blaschka, Hfling, Dinter (1998)

dimension level roll-up relationship attribute

fact relationship

31

Conceptual models (2)

Franconi, Sattler (1999)


dimension target

property level

aggregated entity
32

Conceptual models (3)

Hsemann, Lechtenbrger, Vossen (2000)


fact optional

dimension

measure

dimension level

property attribute

optional property attribute

aggregation path

33

The Dimensional Fact Model


The Dimensional Fact Model (DFM) is a graphical Model conceptual model for DWs, aimed to:
Effectively support conceptual design; Provide an environment where user queries can be formulated intuitively; Enable communication between the designer and the final user in order to refine requirement specification; Supply a stable platform for logical design; Provide an expressive and non-ambiguous documentation.

The DFM is independent of the target logical model (multidimensional or relational)

34

The Dimensional Fact Model (2)

Three levels of conceptual documentation are provided:


Fact scheme: represents a fact of interest and the associated measures, dimensions and hierarchies. Data Mart scheme: summarizes the fact schemes which constitute each data mart and emphasize the feasible connections between them. Data Warehouse scheme: shows the different data marts emphasizing their overlaps, the different profiles of the users accessing them, and the operational sources which feed them.

Each documentation level is integrated by glossaries which explain the names adopted within the schemes, define a connection between the DW data and the operational sources, express data volumes. Data mart schemes are associated to the workload specification.
35

Fact schemes
hierarchy

marketing group
fact dimension

department category

dimension attribute

type

day of week holiday

brand city brand product sales manager sale district store store county state city
measure

SALE year quarter month week date qty sold revenue unit price no. of customers

A fact expresses a many-to-many relationship between its dimensions


36

Fact schemes (2)


A non-dimension attribute contains additional information about a dimension attribute, and is typically connected to it by a one-to-one relationship. manager It cannot be used manager marketing department for aggregation. group
category

Some links between attributes can be optional.

type product day of week holiday SALE

brand city brand diet sales manager sale district store store county state city address phone non-dimension attribute

year quarter month date week

qty sold revenue unit price no. of customers promotion

optionality

begin date end date cost

price reduction ad type


37

Fact schemes (3)


Convergence Cross-dimension attributes marketing Additivity, group non-additivity, non-aggregability non-aggregability Overlap product
week year quarter month fiscal fiscal fiscal fiscal year quarter month week date manager V.A.T. department

category brand city type brand diet

cross-dimension attribute
sale district store store county store city phone address store state

SALE qty sold revenue unit price no. of customers

day of week promotion ad type price reduction begin date end date

convergence

38

The SHIPMENTS fact scheme


FACT SCHEME: SHIPMENT TO STORES department marketing group category type brand product week year quarter month fiscal fiscal fiscal fiscal year quarter month week date warehouse state brand city

SHIPMENT TO STORES qty shipped shipping cost

warehouse

warehouse city store state store store city

day of week mode type carrier

39

The INVENTORY fact scheme


FACT SCHEME: INVENTORY department marketing group units per pallet package type package size weight week year quarter month date fiscal fiscal fiscal fiscal year quarter month week day of week level category type brand product warehouse warehouse nation brand city

INVENTORY

warehouse city

AVG, MIN

40

The supply chain


component date PRODUCTION OF COMPONENTS factory component date from factory to factory date component COMPONENT INVENTORY factory COMPONENT DELIVERY

product date MANUFACTURING factory

product date PACKAGING

package type factory

product date

warehouse factory

SHIPMENT TO WAREHOUSE mode

product date WAREHOUSE INVENTORY

product warehouse date

warehouse store

product date SALES

promotion store

SHIPMENT TO STORES mode

41

Glossaries
ATTRIBUTE GLOSSARY: SHIPMENT TO STORES
name
product brand brand city type category department marketing group stores store city store state
....................

description

domain

card.
5000 800 50 200 10 5 20 100 80 5 .........

query
select prodName,brandName, cityName, from PRODUCTS P,BRANDS B, CITIES C, where P.brandId = B.brandId and B.cityId = C.cityId and . . . . . . . . . . . select storeName,cityName, stateName from STORES S,CITIES C where S.cityId = C.cityId . . . . . . . . . . . . .

products brands Where brands are manufactured cities (pasta, soft drink, ) pr. types (food, clothing, music,) pr. categories Deps. managing categories deps. Responsible for product types groups stores cities states .................... .................

MEASURE GLOSSARY: SHIPMENT TO STORES (sparsity = 0.01)


name
qty shipped

description

type

query
select SUM(PS.qty) from PRODUCTS P,SHIP S,PRODSHIP PS, where P.prodId = PS.prodId and PS.shipId = S.shipId and . . . . . . . . . . . . . group by P.prodId,S.date, . . . . . . . . . . . . . . . .

Quantity of each product being INTEGER shipped

shipping cost

Cost of the shipment refresh frequency: 1 per week;

MONEY

refresh technique: periodic complete

42

Data mart schemes

The data mart scheme is used to summarize the fact schemes which constitute the data mart and to show drill-across connections between them. It is a graph whose nodes are elemental and overlapped fact schemes; the arcs are directed to each overlapped scheme from its component schemes, which in turn may be overlapped.
DATA MART SCHEME: SUPPLY CHAIN
PRODUCTION OF COMPONENTS PRODUCTION AND DELIVERY COMPONENT DELIVERY DELIVERY AND INVENTORY COMPONENT INVENTORY

MANUFACTURING

MANUFACTURING AND PACKAGING

PACKAGING

WAREHOUSE INVENTORY

DISTRIBUTION CYCLE

SHIPMENT TO WAREHOUSE

PRODUCT CYCLE

SHIPMENT TO STORES

SHIPMENT AND SALE

SALE

43

The workload

In principle, the workload for a data mart is dynamic and unpredictable. In some commercial tools, the actual workload is monitored while the DW is operating and the logical and physical schemes are dynamically tuned.

We claim that a core workload can, and should, be determined a priori:


The user typically knows in advance which kind of data analysis (s)he will carry out more often for decisional or statistical purposes; A substantial amount of queries are aimed at extracting summary data to fill standard reports.
44

The workload (2)


FACT SCHEME: SHIPMENT TO STORES department marketing group category type brand product week year quarter month fiscal fiscal fiscal fiscal year quarter month week date warehouse state brand city

SHIPMENT TO STORES qty shipped shipping cost

warehouse

warehouse city store state store store city

day of week mode type carrier

45

Data warehouse schemes

At the highest abstraction level, the data warehouse scheme shows the different data marts emphasizing the fact schemes duplicated on two or more of them, the different profiles of the users accessing them, and the operational sources which feed them.
personnel manager personnel database

SALES

data mart
incentives PERSONNEL administrative manager

user fact scheme

buyer SUPPLY CHAIN RENOVATION

operational db
DEMAND CHAIN

SALES

file transfer
purchases restoration works

orders product database claims sale executive

manual input

46

Conceptual design of Data Warehouses


Stefano Rizzi

47

Designing the DW

Within a successful approach to DW design, top-down and bottom-up strategies should be mixed.
When planning a DW, a bottom-up approach should be followed. One data mart at a time is identified and prototyped. Each data mart is designed in a top-down fashion by building a conceptual scheme for each fact of interest.

48

Data Mart prototyping


Prototype first the data mart which:

plays the most strategic role for the enterprise; can convince the final users of the potential benefits; leans on available and consistent data sources.

DM2

DM4

DM1

DM5

DM3

Source 3

Source 1

Source 2 49

Reference architecture

DW

Reconciled data

Problem of designing the reconciled data (integration of heterogeneous sources)

heterogeneous operational dbs

50

Methodological framework
analysis of the operational db requirement specification conceptual design db administrator

designer

workload refinement final user logical design physical design


51

DWs are based on a pre-existing information system

Methodological framework (2)

E/R Scheme Relational Scheme


chiav n ozio neg e eg ozio citt N 1 N 2 . . r ione indirizzo resp.v ite eg end . ch etemo ch e n o iav p iav egzio ch ve_roo ia p dtto T 1 T 1 T 1 N 1 N 1 N 2 .. P1 P2 P5 qanvenu in u t dta casso nmclieti u_ n 1 0 8 1 5 . 10 0 2 0000 10 0 8 2000 10 0 5 5000

Conceptual Scheme

Logical Scheme

Physical Scheme

CONCEPTUAL DESIGN

LOGICAL DESIGN

PHYSICAL DESIGN

Facts Preliminary workload Workload Target logical model Workload Target DBMS

52

Conceptual design of the data mart

Design is based on the documentation of the underlying operational information system:


E/R schemes Relational schemes
Golfarelli, Maio, Rizzi 98; Cabibbo, Torlone 98; Moody, Kortink 00; Hsemann, Lechtenbrger, Vossen 00

Steps:
Find facts For each fact:
Navigate functional dependencies Drop useless attributes Define dimensions and measures
53

Finding facts
Within an E/R scheme, a fact is represented by either an entity F or an n-ary relationship between entities E1...En Within a relational scheme, a fact is represented by a relation F.

The entities and relationships representing frequently updated archives are good candidates to define facts; those representing nearly-static archives are not.

54

Navigating functional dependencies dependencies


Build a tree in which each vertex corresponds to an attribute of the scheme; The root corresponds to the identifier (key) of F; For each vertex v, the corresponding attribute functionally determines all the attributes corresponding to the descendants of v.

55

Example (from the E/R scheme):


marketing manager group MARKETING GROUP (1,N) type for (1,1) TYPE (1,1) diet (0,1) size (0,N) of (1,1) (0,N) (0,N) (1,N) PURCHASE (1,1) sale in TICKET qty ticket number (1,1) of store address phone (1,N) BRAND (1,1) (1,N) city STORE (1,1) in (1,N) of (1,N) date unit price department manager district no. SALE DISTRICT in state

DEPARTM. (1,N) category for (1,1) CATEGORY

(1,1)

(1,N)

STATE (1,N)

county of (1,N) of (1,1) sales manager (1,1) COUNTY (1,N) of (1,1) CITY

PRODUCT

weight warehouse

(1,N) from product (1,N) address

WAREHOUSE

produced in

brand

56

Example (from the E/R scheme):

state county brand diet weight category

city qty size sale sales date manager address phone city county state

dept. manager manager

type product

mark. grp.

ticket store district no number district no+state unit price

57

Dropping useless attributes

Some attributes in the tree may be uninteresting for the DW. In order to drop useless levels of detail, it is possible to apply the following operators:
Pruning: delete a vertex and its subtree. Pruning Grafting: delete a vertex and move its subtree. It is Grafting useful when an attribute is not interesting but the attributes it determines must be preserved.
sales date manager address
sales date manager

sales manager address

ticket store number

city

state

address

store date

ticket store number

58

Defining dimensions

The choice of dimensions determines the fact granularity. granularity Dimensions must be chosen among the root children in the attribute tree. Time should always be a dimension.
city brand diet weight category type product unit price sales qty manager sale store date district no+state address phone city county state

dept. manager manager

mark. grp.

59

Defining measures

Measures must be chosen among the children of the root. Typically, measures are computed either by counting the number of instances of F, or by summing (averaging, ) expressions which involve numerical attributes. An attribute cannot be both a measure and a dimension. A fact may have no measures.
city brand diet weight category type product unit price sales qty manager sale store date district no+state address phone city county state

dept. manager manager

mark. grp.

60

Granularity

Defining the granularity of data is a primary issue in determining performance. Granularity depends on the queries users are interested in, and represents a trade-off between query response time and detail of information to be stored.
It may be worth adopting a finer granularity than that required by users, provided that this does not slow down the system too much. Constrained by the maximum time frame for loading.

Choosing granularity includes defining the refresh interval.


Issues to be considered:
Availability of operational data Workload characteristics The total time period to be analysed
61

WAND a CASE tool for data warehouse design

A design methodology is almost useless, if no CASE tool to support it is provided.


Acquire the relational db scheme via ODBC Carry out conceptual design Define the workload Calculate data volume Carry out logical design Create the documentation (including loading/feeding queries)

62

Bibliography (1)

K. Aberer, K. Hemm. A methodology for building a data warehouse in a scientific environment. Proc. 1st Int. Conf. on Cooperative Inf. Systems, Brussels (1996). R. Agrawal, A. Gupta, S. Sarawagi Modeling multidimensional databases. IBM Research Report, IBM Almaden Research Center (1995). M. Blaschka et al. Finding your way through multidimensional data models. Proc. DEXA98 (1998). L. Cabibbo, R. Torlone. A logical approach to multidimensional databases. EDBT 98 (1998). A. Datta, H. Thomas. A conceptual model and algebra for on-line analytical processing in data warehouses. Proc. WITS97 (1997). E. Franconi, U. Sattler. A data warehouse conceptual model for multidimensional aggregation. Proc. DMDW99 (1999). M. Golfarelli , D. Maio, S. Rizzi The Dimensional Fact Model: a conceptual model for data warehouses. Int. Jour. of Cooperative Inf. Systems 7, 2&3 (1998). M. Golfarelli, S. Rizzi. Designing the data warehouse: key steps and crucial issues. Jour. of Computer Science and Information Management 2, 3 (1999).

63

Bibliography (2)

M. Gyssens, L.V.S. Lakshmanan. A foundation for multi-dimensional databases. Proc. 23rd VLDB, Athens, Greece (1997). B. Hsemann , J. Lechtenbrger, G. Vossen. Conceptual data warehouse design. Proc. DMDW00 (2000). R. Kimball. The data warehouse toolkit. John Wiley & Sons (1996). D. Moody, M. Kortink. From enterprise models to dimensional models: a methodology for data warehouse and data mart design. Proc. DMDW00 (2000). T. Bach Pedersen, C. Jensen. Multidimensional data modelling for complex data. Proc. 15th ICDE, Sydney (1999). C. Sapia et al. Extending the E/R model for the multidimensional paradigm. Proc. ER98 (1998). N. Tryfona, F. Busborg, J. Christiansen. starER: A Conceptual Model for Data Warehouse Design. Proc. DOLAP99 (1999). P. Vassiliadis. Modeling multidimensional databases, cubes and cube operations. Proc. 10th SSDBM Conf., Capri, Italy (1998).

64

You might also like