Professional Documents
Culture Documents
Motivation
Building a data warehouse for an enterprise is a huge and complex task, which requires an accurate planning aimed at devising satisfactory answers to organizational and architectural questions. Despite the pushing demand for working solutions coming from enterprises and the wide offer of advanced technologies from producers, few attempts towards devising a specific methodology for data warehouse design have been made. On the other hand, the statistic reports related to DW project failures state that a major cause lies in the absence of a global view of the design process: in other terms, in the absence of a design methodology.
Summary
Introduction to Data Warehousing Conceptual design of Data Warehouses Workload-based logical design for ROLAP Indexes for physical design
Information systems are rooted in the relationship between information, decision and control. An IS should collect and classify the information, by means of integrated and suitable procedures, in order to produce in time and at the right levels the synthesis to be used to support the decisional process, as well as to administrate and globally control the enterprise activity.
4
Information as a resource
Information is an increasing value resource, required from managers to schedule and monitor effectively the enterprise activities. Information is the first matter which is transformed by information systems like unfinished products are transformed by manufacturing systems.
Manufacturing system Finished product
Value of information
Information is an enterprise resource like capital, first matters, plants and people; thus, it has a cost. Hence, understanding the value of information is important.
Value
Reports
Strategic directions
Amount
6
Senior managers
Finance
Accounting
Human resources
Usual complaints:
We have tons of data but we cannot access them! How can people playing the same role produce substantially different results? We want to slice and dice data in any possible way! Show me only what is important! Everyone knows some data are incorrect...
(R. Kimball, The Data Warehouse Toolkit)
8
Data Warehousing
A collection of technologies and tools supporting the knowledge worker (executive, manager, analyst) in analysing data aimed at decision making and at improving the knowledge assets of the enterprise.
Data Warehouse
At the core of the architecture of modern information systems, it is a data repository: Oriented to subjects Integrated and consistent Representing temporal evolution Non volatile
The data warehouse is regularly refreshed, permanently growing, logically centralised and easily accessed by users, essentially read-only
9
Data Warehouse
Operational data (relational, legacy) External data
ETL tools
Warehouse
Data Marts
Data Warehouse Replication and broadcasting
Data mart
Marketing Finance Geographical regions Client management Supplier management
11
Subject vs Process
region charge consumption
patient
reservations
Medical reports
admissions
Emphasis on applications
Emphasis on subjects
12
DW
DB
wrappers
mediators loaders
13
Text files
Temporal evolution
OLTP DW
Current values
Restricted historical content, Often time is not included in keys, Data are updated
Snapshot
Rich historical content, Time is included in keys, Snapshots cannot be updated
14
Non-volatility
OLTP
update
DW
load acce ss
insert
delete
In a DW, no advanced techniques for transaction management are required (differently from OLTP systems) Key issues are the query throughput and the resilience
15
DW
90% ad hoc queries Mostly read access Hundreds users Denormalised Supports historical versions Optimised for accesses involving most database Based on summary data
vs.
OLTP
90% predefined transactions Read/write access Thousands users Normalised Does not support historical versions Optimised for accesses involving a small database fraction Based on elemental data
16
Intermediate level server between a relational back- end server and the front-end client Specialised middleware Generation of SQL multi-statements for the back-end server Query scheduling
Direct support of multi-dimensional views Special data structures (e.g., multi-dimensional arrays) Compression techniques Intelligent disk/memory caching Pre-computation Complex analysis
17
data
2000
18
19
The DW life-cycle
Objective definition and planning Clearly determine the scopes, define the borders, estimate dimensions, choose the approach to design, evaluate the benefits
Infrastructure design
Choose the technologies and the tools, analyse the architectural solutions, solve the management problems
20
Bibliography
R. Barquin, S. Edelstein. Planning and Designing the Data Warehouse. Prentice Hall (1996). S. Chaudhuri, U. Dayal. An overview of data warehousing and OLAP technology. SIGMOD Record 26,1 (1997). G. Colliat. OLAP, relational and multidimensional database systems. SIGMOD Record 25, 3 (1996). M. Demarest. The politics of data warehousing. Http://www.hevanet.com/demarest/marc/dwpol.html U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth. Data mining and knowledge discovery in databases: an overview. Comm. of the ACM 39, 11 (1996). W.H. Inmon. Building the data warehouse. John Wiley & Sons (1996). S. Kelly. Data Warehousing in Action. John Wiley & Sons (1997). R. Kimball. The data warehouse toolkit. John Wiley & Sons (1996). R. Kimball, L. Reeves, M. Ross, W. Thornthwaite. The data Warehouse Lifecycle Toolkit. John Wiley & Sons (1998). C. Shilakes, J. Tylman. Enterprise Information Portals. Http://www.sagemaker.com/company/downloads/eip/indepth.pdf P. Vassiliadis. Gulliver inthe land of data warehousing: practical experiences and observations of a researcher. Proc. DMDW2000 (2000). J. Widom. Research Problems in Data Warehousing. Proc. CIKM (1995).
21
22
While it is universally recognised that a DW leans on a multidimensional model, there is no agreement on the approach to conceptual modelling. On the other hand, an accurate conceptual design is the necessary foundation for building a good information system. The Entity/Relationship model is widespread in the enterprises, but.
"Entity relation data models [...] cannot be understood by users and they cannot be navigated usefully by DBMS software. Entity relation models cannot be used as the basis for enterprise data warehouses. (Kimball, 96)
23
Sales
Store
e im T
Product
Number of Pepsi cans sold at all BIGSTORES on 10/10/99 Number of Fanta cans globally sold
24
Basic terminology
Fact (cube, target). It is a focus of interest for the decisionmaking process; typically, it models an event occurring in the enterprise world (sales, shipments, purchases). It is essential for a fact to have some dynamic aspects, i.e., to evolve somehow across time.
Hierarchies (dimensions). They contain dimension attributes (levels, parameters) connected in a tree-like
structure by many-to-one relationships (functional dependencies).
25
Hsemann et al. 00 Vassiliadis 98 Sapia et al. 98 Datta, Thomas 97 Tryfona et al. 99 Franconi, Sattler 99 Li, Wang 96 Cabibbo, Torlone 98 Agrawal et al. 95
26
CONCEPTUAL
Hsemann et al. 00 Vassiliadis 98 Sapia et al. 98 Datta, Thomas 97 Tryfona et al. 99 Franconi, Sattler 99
Gyssens, Lakshmanan 97
Agrawal et al. 95
Cabibbo, Torlone 98
Li, Wang 96
LOGICAL
27
FORMAL
Hsemann et al. 00 Vassiliadis 98 Sapia et al. 98 Datta, Thomas 97 Tryfona et al. 99 Franconi, Sattler 99
Gyssens, Lakshmanan 97
Agrawal et al. 95
Cabibbo, Torlone 98
Li, Wang 96
GRAPHICAL GRAPHICAL
28
ALGEBRA
Gyssens, Lakshmanan 97
Hsemann et al. 00 Vassiliadis 98 Sapia et al. 98 Datta, Thomas 97 Tryfona et al. 99 Franconi, Sattler 99 Li, Wang 96 Cabibbo, Torlone 98 Agrawal et al. 95
29
Hsemann et al. 00 Vassiliadis 98 Sapia et al. 98 Datta, Thomas 97 Tryfona et al. 99 Franconi, Sattler 99 Cabibbo, Torlone 98 DESIGN Li, Wang 96 Agrawal et al. 95
30
Conceptual models
fact relationship
31
property level
aggregated entity
32
dimension
measure
dimension level
property attribute
aggregation path
33
34
Each documentation level is integrated by glossaries which explain the names adopted within the schemes, define a connection between the DW data and the operational sources, express data volumes. Data mart schemes are associated to the workload specification.
35
Fact schemes
hierarchy
marketing group
fact dimension
department category
dimension attribute
type
brand city brand product sales manager sale district store store county state city
measure
SALE year quarter month week date qty sold revenue unit price no. of customers
brand city brand diet sales manager sale district store store county state city address phone non-dimension attribute
optionality
cross-dimension attribute
sale district store store county store city phone address store state
day of week promotion ad type price reduction begin date end date
convergence
38
warehouse
39
INVENTORY
warehouse city
AVG, MIN
40
product date
warehouse factory
warehouse store
promotion store
41
Glossaries
ATTRIBUTE GLOSSARY: SHIPMENT TO STORES
name
product brand brand city type category department marketing group stores store city store state
....................
description
domain
card.
5000 800 50 200 10 5 20 100 80 5 .........
query
select prodName,brandName, cityName, from PRODUCTS P,BRANDS B, CITIES C, where P.brandId = B.brandId and B.cityId = C.cityId and . . . . . . . . . . . select storeName,cityName, stateName from STORES S,CITIES C where S.cityId = C.cityId . . . . . . . . . . . . .
products brands Where brands are manufactured cities (pasta, soft drink, ) pr. types (food, clothing, music,) pr. categories Deps. managing categories deps. Responsible for product types groups stores cities states .................... .................
description
type
query
select SUM(PS.qty) from PRODUCTS P,SHIP S,PRODSHIP PS, where P.prodId = PS.prodId and PS.shipId = S.shipId and . . . . . . . . . . . . . group by P.prodId,S.date, . . . . . . . . . . . . . . . .
shipping cost
MONEY
42
The data mart scheme is used to summarize the fact schemes which constitute the data mart and to show drill-across connections between them. It is a graph whose nodes are elemental and overlapped fact schemes; the arcs are directed to each overlapped scheme from its component schemes, which in turn may be overlapped.
DATA MART SCHEME: SUPPLY CHAIN
PRODUCTION OF COMPONENTS PRODUCTION AND DELIVERY COMPONENT DELIVERY DELIVERY AND INVENTORY COMPONENT INVENTORY
MANUFACTURING
PACKAGING
WAREHOUSE INVENTORY
DISTRIBUTION CYCLE
SHIPMENT TO WAREHOUSE
PRODUCT CYCLE
SHIPMENT TO STORES
SALE
43
The workload
In principle, the workload for a data mart is dynamic and unpredictable. In some commercial tools, the actual workload is monitored while the DW is operating and the logical and physical schemes are dynamically tuned.
warehouse
45
At the highest abstraction level, the data warehouse scheme shows the different data marts emphasizing the fact schemes duplicated on two or more of them, the different profiles of the users accessing them, and the operational sources which feed them.
personnel manager personnel database
SALES
data mart
incentives PERSONNEL administrative manager
operational db
DEMAND CHAIN
SALES
file transfer
purchases restoration works
manual input
46
47
Designing the DW
Within a successful approach to DW design, top-down and bottom-up strategies should be mixed.
When planning a DW, a bottom-up approach should be followed. One data mart at a time is identified and prototyped. Each data mart is designed in a top-down fashion by building a conceptual scheme for each fact of interest.
48
plays the most strategic role for the enterprise; can convince the final users of the potential benefits; leans on available and consistent data sources.
DM2
DM4
DM1
DM5
DM3
Source 3
Source 1
Source 2 49
Reference architecture
DW
Reconciled data
50
Methodological framework
analysis of the operational db requirement specification conceptual design db administrator
designer
Conceptual Scheme
Logical Scheme
Physical Scheme
CONCEPTUAL DESIGN
LOGICAL DESIGN
PHYSICAL DESIGN
Facts Preliminary workload Workload Target logical model Workload Target DBMS
52
Steps:
Find facts For each fact:
Navigate functional dependencies Drop useless attributes Define dimensions and measures
53
Finding facts
Within an E/R scheme, a fact is represented by either an entity F or an n-ary relationship between entities E1...En Within a relational scheme, a fact is represented by a relation F.
The entities and relationships representing frequently updated archives are good candidates to define facts; those representing nearly-static archives are not.
54
55
(1,1)
(1,N)
STATE (1,N)
county of (1,N) of (1,1) sales manager (1,1) COUNTY (1,N) of (1,1) CITY
PRODUCT
weight warehouse
WAREHOUSE
produced in
brand
56
city qty size sale sales date manager address phone city county state
type product
mark. grp.
57
Some attributes in the tree may be uninteresting for the DW. In order to drop useless levels of detail, it is possible to apply the following operators:
Pruning: delete a vertex and its subtree. Pruning Grafting: delete a vertex and move its subtree. It is Grafting useful when an attribute is not interesting but the attributes it determines must be preserved.
sales date manager address
sales date manager
city
state
address
store date
58
Defining dimensions
The choice of dimensions determines the fact granularity. granularity Dimensions must be chosen among the root children in the attribute tree. Time should always be a dimension.
city brand diet weight category type product unit price sales qty manager sale store date district no+state address phone city county state
mark. grp.
59
Defining measures
Measures must be chosen among the children of the root. Typically, measures are computed either by counting the number of instances of F, or by summing (averaging, ) expressions which involve numerical attributes. An attribute cannot be both a measure and a dimension. A fact may have no measures.
city brand diet weight category type product unit price sales qty manager sale store date district no+state address phone city county state
mark. grp.
60
Granularity
Defining the granularity of data is a primary issue in determining performance. Granularity depends on the queries users are interested in, and represents a trade-off between query response time and detail of information to be stored.
It may be worth adopting a finer granularity than that required by users, provided that this does not slow down the system too much. Constrained by the maximum time frame for loading.
62
Bibliography (1)
K. Aberer, K. Hemm. A methodology for building a data warehouse in a scientific environment. Proc. 1st Int. Conf. on Cooperative Inf. Systems, Brussels (1996). R. Agrawal, A. Gupta, S. Sarawagi Modeling multidimensional databases. IBM Research Report, IBM Almaden Research Center (1995). M. Blaschka et al. Finding your way through multidimensional data models. Proc. DEXA98 (1998). L. Cabibbo, R. Torlone. A logical approach to multidimensional databases. EDBT 98 (1998). A. Datta, H. Thomas. A conceptual model and algebra for on-line analytical processing in data warehouses. Proc. WITS97 (1997). E. Franconi, U. Sattler. A data warehouse conceptual model for multidimensional aggregation. Proc. DMDW99 (1999). M. Golfarelli , D. Maio, S. Rizzi The Dimensional Fact Model: a conceptual model for data warehouses. Int. Jour. of Cooperative Inf. Systems 7, 2&3 (1998). M. Golfarelli, S. Rizzi. Designing the data warehouse: key steps and crucial issues. Jour. of Computer Science and Information Management 2, 3 (1999).
63
Bibliography (2)
M. Gyssens, L.V.S. Lakshmanan. A foundation for multi-dimensional databases. Proc. 23rd VLDB, Athens, Greece (1997). B. Hsemann , J. Lechtenbrger, G. Vossen. Conceptual data warehouse design. Proc. DMDW00 (2000). R. Kimball. The data warehouse toolkit. John Wiley & Sons (1996). D. Moody, M. Kortink. From enterprise models to dimensional models: a methodology for data warehouse and data mart design. Proc. DMDW00 (2000). T. Bach Pedersen, C. Jensen. Multidimensional data modelling for complex data. Proc. 15th ICDE, Sydney (1999). C. Sapia et al. Extending the E/R model for the multidimensional paradigm. Proc. ER98 (1998). N. Tryfona, F. Busborg, J. Christiansen. starER: A Conceptual Model for Data Warehouse Design. Proc. DOLAP99 (1999). P. Vassiliadis. Modeling multidimensional databases, cubes and cube operations. Proc. 10th SSDBM Conf., Capri, Italy (1998).
64