Professional Documents
Culture Documents
DWtutorial PDF
DWtutorial PDF
Motivation
Building a data warehouse for an enterprise is a huge and complex
task, which requires an accurate planning aimed at devising
satisfactory answers to organizational and architectural questions.
Despite the pushing demand for working solutions coming from
enterprises and the wide offer of advanced technologies from
producers, few attempts towards devising a specific methodology for
data warehouse design have been made. On the other hand, the
statistic reports related to DW project failures state that a major
cause lies in the absence of a global view of the design process: in
other terms, in the absence of a design methodology.
Summary
Ë Introduction to Data Warehousing
Ë Conceptual design of Data Warehouses
Ë Workload-based logical design for ROLAP
Ë Indexes for physical design
2
Introduction
to Data Warehousing
Stefano Rizzi
4
Information as a resource
Ë Information is an increasing value resource,
required from managers to schedule and monitor
effectively the enterprise activities.
Ë Information is the first matter which is transformed
by information systems like unfinished products are
transformed by manufacturing systems.
Manufacturing
system Finished product
Information
Information
system
Value of information
Ë Information is an enterprise resource like capital, first
matters, plants and people; thus, it has a cost.
Ë Hence, understanding the value of information is
important.
Reports
Selected information
Primary information sources
Amount
6
Different kinds of information systems
Senior
ESS ic
t eg managers
ra
St
MIS en
t
Middle
DSS g em
a managers
an
M
OAS d ge Knowledge and
le
KWS w data workers
no
K
TPS onal Operational
managers
i
r at
pe
O
Sales and Manufacturing Finance Accounting Human
marketing resources
8
Data Warehousing
Ì A collection of technologies and tools supporting the
knowledge worker (executive, manager, analyst) in
analysing data aimed at decision making and at
improving the knowledge assets of the enterprise.
Data Warehouse
At the core of the architecture of modern information systems,
it is a data repository:
ÁOriented to subjects
ÁIntegrated and consistent
ÁRepresenting temporal evolution
ÁNon volatile
Data Warehouse
Operational data (relational, legacy) External data
ETL tools
Summary
data
Warehouse
Access
What-If
analysis
Reporting
Analysis tools tools
(OLAP)
Data mining
10
Data Marts
Data Warehouse
Data mart
Marketing Geographical Client Supplier
Finance regions management management
11
charge consumption
reservations
Medical
reports
admissions
Emphasis on applications
Emphasis on subjects
12
Integration and consistency
External
data
Schema Integration
Extraction DW
Transformation
Cleaning
Validation
Filtering
Loading
DB
wrappers mediators
Text files
loaders
13
Temporal evolution
OLTP
DW
14
Non-volatility
OLTP update
DW
load acce
ss
15
DW vs. OLTP
• 90% ad hoc queries • 90% predefined
transactions
• Mostly read access • Read/write access
• Hundreds users • Thousands users
• Denormalised • Normalised
• Supports historical • Does not support historical
versions versions
• Optimised for accesses • Optimised for accesses
involving most involving a small database
database fraction
• Based on summary • Based on elemental data
data
16
ROLAP (Relational OLAP)
Ë Intermediate level server between a relational back- end server
and the front-end client
Ë Specialised middleware
Ë Generation of SQL multi-statements for the back-end server
Ë Query scheduling
Data
Mining
Refinement
OLAP
Data
Source:
Warehousing Information
Statistics & Discovery
data reporting
18
The Data Warehouse Market
4500
RDBMS
4000
Source: Shilakes, Tylman -
3500 OLAP Enterprise Information Portals
3000
2500
2000
400
1500 Data Marts
350
ETL
1000
300 Data Quality
500
250 Metadata
0
1998 1999 2000 2001 2002 200
150
100
50
0
1998 1999 2000 2001 2002
19
The DW life-cycle
20
Bibliography
Ë R. Barquin, S. Edelstein. Planning and Designing the Data Warehouse. Prentice Hall
(1996).
Ë S. Chaudhuri, U. Dayal. An overview of data warehousing and OLAP technology.
SIGMOD Record 26,1 (1997).
Ë G. Colliat. OLAP, relational and multidimensional database systems. SIGMOD Record
25, 3 (1996).
Ë M. Demarest. The politics of data warehousing.
Http://www.hevanet.com/demarest/marc/dwpol.html
Ë U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth. Data mining and knowledge discovery
in databases: an overview. Comm. of the ACM 39, 11 (1996).
Ë W.H. Inmon. Building the data warehouse. John Wiley & Sons (1996).
Ë S. Kelly. Data Warehousing in Action. John Wiley & Sons (1997).
Ë R. Kimball. The data warehouse toolkit. John Wiley & Sons (1996).
Ë R. Kimball, L. Reeves, M. Ross, W. Thornthwaite. The data Warehouse Lifecycle
Toolkit. John Wiley & Sons (1998).
Ë C. Shilakes, J. Tylman. Enterprise Information Portals.
Http://www.sagemaker.com/company/downloads/eip/indepth.pdf
Ë P. Vassiliadis. Gulliver inthe land of data warehousing: practical experiences and
observations of a researcher. Proc. DMDW’2000 (2000).
Ë J. Widom. Research Problems in Data Warehousing. Proc. CIKM (1995).
21
Conceptual modelling
for Data Warehousing
Stefano Rizzi
22
Why a new conceptual model?
Ë While it is universally recognised that a DW leans on a
multidimensional model, there is no agreement on the
approach to conceptual modelling.
Ë On the other hand, an accurate conceptual design is
the necessary foundation for building a “good”
information system.
Ë The Entity/Relationship model is widespread in the
enterprises, but….
23
ime
T
Product
Number of Fanta
Number of Pepsi cans globally sold
cans sold at all
BIGSTORES on
10/10/99
24
Basic terminology
Ë Fact (cube, target). It is a focus of interest for the decision-
making process; typically, it models an event occurring in the
enterprise world (sales, shipments, purchases). It is essential for
a fact to have some dynamic aspects, i.e., to evolve somehow
across time.
Ë Measures (attributes, variables, metrics, properties). They are
continuously valued (typically numerical) attributes which describe
a fact from different points of view. For instance, each sale is
measured by its revenue.
Ë Dimensions. They are discrete attributes which determine the
minimum granularity adopted to represent facts. Typical
dimensions for the sale fact are product, store and date.
Ë Hierarchies (dimensions). They contain dimension
attributes (levels, parameters) connected in a tree-like
structure by many-to-one relationships (functional dependencies).
25
Golfarelli et al. 98
Gyssens, Lakshmanan 97
Hüsemann et al. 00
Vassiliadis 98 Agrawal et al. 95
Sapia et al. 98
Cabibbo, Torlone 98
Datta, Thomas 97
Tryfona et al. 99
Franconi, Sattler 99 Li, Wang 96
26
DW modelling in the literature
Golfarelli et al. 98
CONCEPTUAL Gyssens, Lakshmanan 97
Hüsemann et al. 00
Vassiliadis 98 Agrawal et al. 95
Sapia et al. 98
Cabibbo, Torlone 98
Datta, Thomas 97
Tryfona et al. 99
Franconi, Sattler 99 Li, Wang 96
LOGICAL
27
Golfarelli et al. 98
FORMAL Gyssens, Lakshmanan 97
Hüsemann et al. 00
Vassiliadis 98 Agrawal et al. 95
Sapia et al. 98
Cabibbo, Torlone 98
Datta, Thomas 97
Tryfona et al. 99
Franconi, Sattler 99 Li, Wang 96
GRAPHICAL
GRAPHICAL
28
DW modelling in the literature
Golfarelli et al. 98
ALGEBRA Gyssens, Lakshmanan 97
Hüsemann et al. 00
Vassiliadis 98 Agrawal et al. 95
Sapia et al. 98
Cabibbo, Torlone 98
Datta, Thomas 97
Tryfona et al. 99
Franconi, Sattler 99 Li, Wang 96
29
Golfarelli et al. 98
Gyssens, Lakshmanan 97
Hüsemann et al. 00
Vassiliadis 98 Agrawal et al. 95
Sapia et al. 98
Cabibbo, Torlone 98
Datta, Thomas 97
30
Conceptual models
Ë Sapia, Blaschka, Höfling, Dinter (1998)
dimension level
roll-up relationship
attribute
fact relationship
31
aggregated entity
32
Conceptual models (3)
Ë Hüsemann, Lechtenbörger, Vossen (2000)
fact optional
dimension
dimension
level property attribute
measure
34
The Dimensional Fact Model (2)
Ë Three levels of conceptual documentation are provided:
Á Fact scheme: represents a fact of interest and the associated
measures, dimensions and hierarchies.
Á Data Mart scheme: summarizes the fact schemes which
constitute each data mart and emphasize the feasible
connections between them.
Á Data Warehouse scheme: shows the different data marts
emphasizing their overlaps, the different profiles of the users
accessing them, and the operational sources which feed
them.
Ë Each documentation level is integrated by glossaries
which explain the names adopted within the schemes,
define a connection between the DW data and the
operational sources, express data volumes.
Fact schemes
hierarchy
dimension
marketing department attribute
group category
type brand city
fact
brand
dimension product
day of week sales manager
holiday sale district
SALE store
year quarter month date qty sold store county state
revenue city
week unit price
no. of customers
measure
36
Fact schemes (2)
Á A non-dimension attribute contains additional information
about a dimension attribute, and is typically connected to
it by a one-to-one relationship.
manager
It cannot be used manager
for aggregation. marketing
group
department
category
Á Some links between type brand city
attributes can product
brand
diet
be optional. day of week sales manager
holiday sale district
SALE store
year quarter month date qty sold store county state
revenue city
week address
unit price phone
no. of customers
non-dimension
attribute
promotion optionality
begin date
end date price reduction
ad type
cost
37
promotion
ad type begin date convergence
price reduction
end date
38
The SHIPMENTS fact scheme
marketing category
group
week warehouse
year quarter month SHIPMENT warehouse state
TO STORES
warehouse
qty shipped city store state
fiscal fiscal fiscal fiscal date shipping cost store
year quarter month week
store city
day of week
mode
type
carrier
39
marketing category
group
units per pallet
type brand city
package type
package size brand
weight product
week warehouse
year quarter month INVENTORY warehouse nation
date
warehouse
fiscal fiscal fiscal fiscal city
year quarter month week level
day of week AVG,
MIN
40
The “supply chain”
component component from factory component
mode
mode
41
Glossaries
ATTRIBUTE GLOSSARY: SHIPMENT TO STORES
name description domain card. query
product products 5000 select prodName,brandName,
brand brands 800 cityName,…
brand city Where brands are manufactured cities 50 from PRODUCTS P,BRANDS B,
type (pasta, soft drink, …) pr. types 200 CITIES C,…
where P.brandId =
category (food, clothing, music,…) pr. categories 10
B.brandId
department Deps. managing categories deps. 5 and B.cityId = C.cityId
marketing group Responsible for product types groups 20 and . . . . . . . . . . .
stores stores 100 select storeName,cityName,
store city cities 80 stateName from STORES
store state states 5 S,CITIES C
where S.cityId = C.cityId
.................... .................... ................. ......... . . . . . . . . . . . . .
42
Data mart schemes
Ë The data mart scheme is used to summarize the fact
schemes which constitute the data mart and to show
drill-across connections between them.
Ë It is a graph whose nodes are elemental and
overlapped fact schemes; the arcs are directed to
each overlapped scheme from its component
schemes, which in turn may be overlapped.
DATA MART SCHEME: SUPPLY CHAIN
MANUFACTURING
MANUFACTURING AND PACKAGING PACKAGING
43
The workload
Ë In principle, the workload for a data mart is dynamic
and unpredictable.
Ë In some commercial tools, the actual workload is
monitored while the DW is operating and the logical
and physical schemes are dynamically tuned.
44
The workload (2)
marketing category
group
week warehouse
year quarter month SHIPMENT warehouse state
TO STORES
warehouse
qty shipped city store state
fiscal fiscal fiscal fiscal date shipping cost store
year quarter month week
store city
day of week
mode
type
carrier
45
46
Conceptual design
of Data Warehouses
Stefano Rizzi
47
Designing the DW
² Within a successful approach to DW design, top-down
and bottom-up strategies should be mixed.
48
Data Mart prototyping
Prototype first the data mart which:
Ë plays the most strategic role for the enterprise;
Ë can convince the final users of the potential benefits;
Ë leans on available and consistent data sources.
DM2 DM4
DM1
DM3
DM5
49
Reference architecture
DW
Problem of designing
the reconciled data
Reconciled data (integration of
heterogeneous sources)
50
Methodological framework
db administrator
analysis of the
operational db
requirement
specification designer
conceptual
design
workload
refinement
logical
final user design
physical
DWs are based on a pre-existing design
information system
51
52
Conceptual design of the data mart
Ë Design is based on the documentation of the
underlying operational information system:
Á E/R schemes
Á Relational schemes
Ë Steps:
Á Find facts
Á For each fact:
• Navigate functional dependencies
• Drop useless attributes
• Define dimensions and measures
53
Finding facts
54
Navigating functional dependencies
dependencies
55
brand
56
Example (from the E/R scheme):
city
state county sales
qty date manager
brand size
diet address
weight phone
dept.
sale city county state
manager category
type product ticket store
manager district no
mark. grp. number district no+state
unit price
57
58
Defining dimensions
Ë The choice of dimensions determines the fact
granularity.
granularity
Ë Dimensions must be chosen among the root children
in the attribute tree.
Ë Time should always be a dimension.
city
sales
qty manager
brand
diet address
weight phone
dept.
sale city county state
manager category
type product store
manager
mark. grp. date district no+state
unit price
59
Defining measures
Ë Measures must be chosen among the children of the root.
Ë Typically, measures are computed either by counting the
number of instances of F, or by summing (averaging, …)
expressions which involve numerical attributes.
Ë An attribute cannot be both a measure and a dimension.
Ë A fact may have no measures.
city
sales
qty manager
brand
diet address
weight phone
dept.
sale city county state
manager category
type product store
manager
mark. grp. date district no+state
unit price
60
Granularity
Ë Defining the granularity of data is a primary issue in
determining performance. Granularity depends on the
queries users are interested in, and represents a
trade-off between query response time and detail of
information to be stored.
Á It may be worth adopting a finer granularity than that
required by users, provided that this does not slow down
the system too much.
Á Constrained by the maximum time frame for loading.
Ë Choosing granularity includes defining the refresh
interval.
Á Issues to be considered:
• Availability of operational data
• Workload characteristics
• The total time period to be analysed
61
WAND
a CASE tool for data warehouse design
62
Bibliography (1)
Ë K. Aberer, K. Hemm. A methodology for building a data warehouse in
a scientific environment. Proc. 1st Int. Conf. on Cooperative Inf.
Systems, Brussels (1996).
Ë R. Agrawal, A. Gupta, S. Sarawagi Modeling multidimensional
databases. IBM Research Report, IBM Almaden Research Center
(1995).
Ë M. Blaschka et al. Finding your way through multidimensional data
models. Proc. DEXA’98 (1998).
Ë L. Cabibbo, R. Torlone. A logical approach to multidimensional
databases. EDBT 98 (1998).
Ë A. Datta, H. Thomas. A conceptual model and algebra for on-line
analytical processing in data warehouses. Proc. WITS’97 (1997).
Ë E. Franconi, U. Sattler. A data warehouse conceptual model for
multidimensional aggregation. Proc. DMDW’99 (1999).
Ë M. Golfarelli , D. Maio, S. Rizzi The Dimensional Fact Model: a
conceptual model for data warehouses. Int. Jour. of Cooperative Inf.
Systems 7, 2&3 (1998).
Ë M. Golfarelli, S. Rizzi. Designing the data warehouse: key steps and
crucial issues. Jour. of Computer Science and Information
Management 2, 3 (1999).
63
Bibliography (2)
Ë M. Gyssens, L.V.S. Lakshmanan. A foundation for multi-dimensional
databases. Proc. 23rd VLDB, Athens, Greece (1997).
Ë B. Hüsemann , J. Lechtenbörger, G. Vossen. Conceptual data
warehouse design. Proc. DMDW’00 (2000).
Ë R. Kimball. The data warehouse toolkit. John Wiley & Sons (1996).
Ë D. Moody, M. Kortink. From enterprise models to dimensional models:
a methodology for data warehouse and data mart design. Proc.
DMDW’00 (2000).
Ë T. Bach Pedersen, C. Jensen. Multidimensional data modelling for
complex data. Proc. 15th ICDE, Sydney (1999).
Ë C. Sapia et al. Extending the E/R model for the multidimensional
paradigm. Proc. ER’98 (1998).
Ë N. Tryfona, F. Busborg, J. Christiansen. starER: A Conceptual Model
for Data Warehouse Design. Proc. DOLAP’99 (1999).
Ë P. Vassiliadis. Modeling multidimensional databases, cubes and cube
operations. Proc. 10th SSDBM Conf., Capri, Italy (1998).
64