Professional Documents
Culture Documents
• Introduction
• Data Warehousing Concepts
• OLAP
• Dimension Modeling
• Conceptual Modeling
• Indexing
• Conclusion
Introduction
The Evolution
• 1960 - DSS processing using Fortron or COBOL
• 1970 - DBMS systems and the advent of DASD
• 1975 - OLTP systems facilitating faster access to data 1980
• - PC/4GL technology and the advent of MIS
• 1985 - OLAP systems and separation of analytical
processing from transactional processing
• 1994 - Architectured environments with integrated
OLAP engines and tools
What is a Data Warehouse?
• A copy of transaction data specifically structured to Query and
Analysis (Ralph Kimball, 1996)
• A collection of integrated, subject oriented databases designed
to support the DSS function where each unit of data is relevant
at some moment of time (Bill Inmon, 1991)
• The data characteristics of a Data Warehouse are;
• Subject-oriented
• Time-variant
• Non-volatile
• Integrated
What is a Data Warehouse? (cont’d)
• A single, complete and consistent store of data obtained from a
variety of different sources made available to end users, in
what they can understand and use in a business context (Barry
Devlin 1992)
• A process of transforming data into information and making it
available to users in a timely enough manner to make a difference
(Forrester Research 1996)
Data Warehouse Goals/Characteristics
• It must make an organization’s information easily accessible
(slicing and dicing)
• It must present the organization’s information consistently It
• must be adaptive and resilient to change
• It must be a secure bastion that protects our information
assets
• It must serve as the foundation for improved decision making
• The business community must accept the DW, if it is to be
deemed successful
Data Warehouse Applications
• Retail Industry
• Forecasting, Market research, Merchandising etc.
• Manufacturing and distribution
• Sales history/trends, Market demand projects etc.
• Banks
• Spot market trends, Marketing, Credit cards etc.
• Insurance Companies
• Property and casualty fraud etc.
• Health Care Providers
• Fraud detection, Patient matching etc.
Data Warehouse Applications
• Government Agencies
• Auditing tax records, information sharing across
different agencies etc.
• Internet Companies
• Analyzing shopping behavior, CRM etc.
• Telecommunications
• Telemarketing, Product development etc.
• Sports
• Analyzing strategies, Winning player combinations etc.
Data Warehouse Sizes
• Terabyte (10^12) - Walmart (24 TB)
• Petabyte (10^15) - Geographic Information
Systems
• Exabyte (10^18) - National Medical Association
• Zettabyte (10^21) - Weather Images
• Zottabyte (10^24) - Intelligence Agency (Video)
Why Data?
Which
Whichare
areour
our
lowest/highest
lowest/highestmargin
margin
customers
customers??
Who
Whoare
aremy
mycustomers
customers
What and
andwhat
whatproducts
Whatisisthe
themost
most products
effective are
arethey
theybuying?
effectivedistribution
distribution buying?
channel?
channel?
What
Whatproduct
productprom- Which
prom- Whichcustomers
customers
-otions
-otionshave
havethe
thebiggest are
biggest aremost
mostlikely
likelyto
togo
go
impact
impactononrevenue? to
revenue? tothe
thecompetition
competition??
What
Whatimpact
impactwill
will
new
newproducts/services
products/services
have
haveon
onrevenue
revenue
and 10
andmargins?
margins?
Dimensional Modeling
What is Dimensional Modeling?
• Logical design technique that seeks to present the data
in a standard, intuitive framework that allows for high-
performance access.
• Adheres to a discipline that uses the relational
model with some important restrictions.
• Composed of one table with a multi-part key,
called the fact table, and a set of smaller tables
called dimension tables.
DM v/s ER Models
DM ER
Used to design database for Used to design database for Online
Online Analytical Processing Transaction Processing (OLTP)
(OLAP)
De-normalized Normalized
Fact Tables
• Primary table in the DM
• Each row corresponds to a measurement Facts in
• the fact table are numeric and additive Narrow
• rows with a few columns
• Large number of rows (billions)
• Express many-to-many relationships between
dimensions
Dimension Tables
• Define business in terms already familiar to users
• Implement the user interface to the DW
• Wide rows with lots of descriptive text
• Small tables (about a million rows)
• Joined to fact table by a foreign key
• Heavily indexed
• E.g. of typical dimensions
• time periods, geographic region (markets, cities),
products, customers, salesperson, etc.
Four Step Dimensional Design
Process
• Step 1 - Select the business process to model The first
• step in converting an ER diagram to a set of DM diagrams
is to separate the ER diagram into its discrete business
processes and to model each one separately.
Custom Tools
Technologies:
•
• HTML Reports
• Informatica PowerMart
• Oracle • Cognos
• PeopleSoft • Ab Initio
• SQL Server • Business Objects
• SAP
• Teradata • MicroStrategy
• Data Stage
• Siebel
• Oracle Warehouse Builder
• DB2 •
Oracle Discoverer
• Oracle Applications •
Brio
• Custom programs
• Custom Systems • Data Mining Tools
• SQL scripts
• Portals
Data Warehouse Structure
Information
Individually Highly
Structured Summarized
Departmentally Lightly
Structured Summarized
Organizationally Atomic/Detailed
Structured Data Warehouse
Data
Data Warehouse Architecture Drivers
The requirements that drive the DW architecture are;
• Granularity of data
• Data retention and timeliness
• Reporting capability
• Availability
• Scalability
Data Mart Centric
Data Sources
Data Marts
Data Warehouse
Data Mart Centric
Data Sources
Data Warehouse
Data Marts
OLAP
OLAP: 3 Tier DSS
Data Warehouse OLAP Engine Decision Support Client
• M is a set of measures
• A is a set of dimension attributes
• N is a set of non-dimension attributes
• R is a set of ordered couples, having the form (ai,
aj), indicating the ‘edges’ of the scheme
ai A a0
ajAN
ai a j
Fact Scheme
f M , A, N , R, O, S
month, type
week, product M
N Imonth, product SUM
month, type
Indexing
Cost Model
• Cost of answering a query is number of rows
processed
• Subcubes
• Powerset of the dimensions
Cost Model
Indexes
• B-tree indexes to speed up query processing
• E.g. for cube ps, we can construct the
following indexes
•
Ips
•
Isp
Example
• p s
Consider Q : 1
• Using subcube ps: 0,8M rows
• Using subcube psc: 6M rows
• 80 rows
Indexes
• Ideal situation
• All subcubes
• All indexes
Algorithms
• Balance space subcubes – indexes
• Greedy Algorithm
• Given a set of queries
• Every step select index/subcube with the
highest benefit
?
References
• Text books
• Ralph Kimball, The Data Warehouse Toolkit, John Wiley and Sons, 1996
• W.H. Inmon, Building the Data Warehouse, Second Edition, John Wiley and Sons,
1996
• Barry Devlin, Data Warehouse from Architecture to Implementation, Addison
Wesley Longman, Inc 1997
• Research Papers/Whitepapers
• M. Golfarelli, D. Maio, S. Rizzi, The Dimensional Fact Model: a Conceptual
Model for Data Warehouses, International Journal of Cooperative Information,
Vol.7 (issue 2/3), pages 215-247, 1998.
• H. Gupta, V. Harinarayan, A. Rajaraman, J.D. Ullman, Index Selection for
OLAP, Proceedings of the Thirteenth international Conference on Data
Engineering, April 07 - 11, pages 208-219, 1997.
• S. Luján-Mora J. Trujillo. A comprehensive method for data warehouse
design. Proc. DMDW, 2003.
References (cont’d)
• Luján-Mora, S., Trujillo, J., and Song, I. Extending the UML for
Multidimensional Modeling. Lecture Notes In Computer Science, Vol. 2460, pages
290-304., 2002.
• Husemann, B., Lechtenborger, J., Vossen, G.: Conceptual Data Warehouse Design.
In: Proc. of the 2nd. Intl. Workshop on Design and Management of Data
• Warehouses (DMDW'2000), Stockholm, pages 3-9, 2000.
Lehner, W., Albrecht, J., and Wedekind, H. 1998. Normal Forms for
• Multidimensional Databases. In Proceedings of the 10th international Conference
on Scientific and Statistical Database Management (July 01 – 03), pages 63-72,
1998.
• Web Articles
• http://en.wikipedia.org/wiki/Data_warehouse
• http://en.wikipedia.org/wiki/Online_analytical_processing
• http://en.wikipedia.org/wiki/OLTP
References (cont’d)
• http://www.sidadelman.com/data_warehouse_applications.htm
• http://infolab.stanford.edu/infoseminar/Archive/FallY97/slides/ncr
• www.cdd.go.th/it/file/DataWarehousing_and_DataMining.pdf
• http://www.ciobriefings.com/whitepapers/StarSchema.asp