You are on page 1of 90

Inmon

• Father of the data warehouse


• Co-creator of the Corporate
Information Factory.
• He has 35 years of
experience in database
technology management
and data warehouse design.

1
Inmon-Cont’d
• Bill has written about a variety
of topics on the building, usage,
& maintenance of the data warehouse
& the Corporate Information Factory.

• He has written more than 650


articles (Datamation, ComputerWorld,
and Byte Magazine).

• Inmon has published 45 books.


– Many of books has been translated to Chinese, Dutch,
French, German, Japanese, Korean, Portuguese, Russian,
and Spanish.
2
Introduction
• What is Data Warehouse?
A data warehouse is a collection of integrated
databases designed to support a DSS.

• According to Inmon’s (father of data warehousing)


definition
– It is a collection of integrated, subject-oriented
databases designed to support the DSS function,
where each unit of data is non-volatile and
relevant to some moment in time.

3
Characteristics of Data Warehouse

• Subject oriented. Data are organized based on how


the users refer to them.
• Integrated. All inconsistencies regarding naming
convention and value representations are removed.
• Nonvolatile. Data are stored in read-only format and
do not change over time.
• Time variant. Data are not current but normally time
series.

4
A Data Warehouse is Subject Oriented

5
Subject Orientation
Application Data warehouse
Environment Environment

Design activities must be DW world is primarily


equally focused on both void of process design
process and database and tends to focus
design exclusively on issues of
data modeling and
database design

6
Data Integrated
• Integration –consistency naming conventions and
measurement attributers, accuracy, and common
aggregation.

• Establishment of a common unit of measure for all


synonymous data elements from dissimilar database.

• The data must be stored in the DW in an integrated,


globally acceptable manner

7
Data Integrated

8
Time Variant
• Every piece of data contained within the warehouse
must be associated with a particular point in time if
any useful analysis is to be conducted with it.

• Another aspect of time variance in DW data is that,


once recorded, data within the warehouse cannot be
updated or changed.

9
Nonvolatility
• Typical activities such as deletes, inserts, and changes
that are performed in an operational application
environment are completely nonexistent in a DW
environment.

• Only two data operations are ever performed in the


DW: data loading and data access.

10
Why Do We Need Data Warehouses?
• Consolidation of information resources

• Improved query performance

• Separate research and decision support functions


from the operational systems

• Foundation for data mining, data visualization,


advanced reporting and OLAP tools

11
Data Warehouse Usage
• Three kinds of data warehouse applications
– Information processing
• supports querying, basic statistical analysis, and reporting
using crosstabs, tables, charts and graphs
– Analytical processing
• multidimensional analysis of data warehouse data
• supports basic OLAP operations, slice-dice, drilling, pivoting
– Data mining
• knowledge discovery from hidden patterns
• supports associations, constructing analytical models,
performing classification and prediction, and presenting the
mining results using visualization tools
12
Data Warehouses, Data Marts, and
Operational Data Stores
• Data Warehouse – The queryable source of data in the
enterprise. It is comprised of the union of all of its
constituent data marts.
• Data Mart – A logical subset of the complete data
warehouse. Often viewed as a restriction of the data
warehouse to a single business process or to a group
of related business processes targeted toward a
particular business group.
• Operational Data Store (ODS) – A point of integration
for operational systems that developed independent of
each other. Since an ODS supports day to day
operations, it needs to be continually updated.

1
How Do Data Warehouses Differ From
Operational Systems?

• Goals
• Structure
• Size
• Performance optimization
• Technologies used

2
Design Differences
Operational System Data Warehouse

ER Diagram Star Schema

3
Data Warehouse vs. Operational DBMS
• OLTP (on-line transaction processing)
– Major task of traditional relational DBMS
– Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
• OLAP (on-line analytical processing)
– Major task of data warehouse system
– Data analysis and decision making
• Distinct features (OLTP vs. OLAP):
– User and system orientation: customer vs. market
– Data contents: current, detailed vs. historical, consolidated
– Database design: ER + application vs. star + subject
– View: current, local vs. evolutionary, integrated
– Access patterns: update vs. read-only but complex queries 4
OLTP vs. OLAP
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response

5
From Tables and Spreadsheets to Data Cubes
• A data warehouse is based on a multidimensional data model which
views data in the form of a data cube
• A data cube, such as sales, allows data to be modeled and viewed in
multiple dimensions
– Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year)
– Fact table contains measures (such as dollars_sold) and keys to
each of the related dimension tables
• In data warehousing literature, an n-D base cube is called a base
cuboid. The top most 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid. The lattice of cuboids
forms a data cube. 6
Dimension and Fact tables
TIMES
(Dimension) timeid date week month quarter year holiday_flag

pid timeid locid sales SALES (Fact table)


PRODUCTS LOCATIONS
pid pname category price locid city state country
(Dimension table) (Dimension table)

The main relation, which relates dimensions to a measure, is


called the fact table. Each dimension can have additional
attributes and an associated dimension table.

E.g., Products(pid, pname, category, price)


Fact tables are much larger than dimensional tables.
Conceptual Modeling of Data Warehouses

• Modeling data warehouses: dimensions & measures


Star schema: A fact table in the middle connected to a set of
dimension tables
Snowflake schema: A refinement of star schema where some
dimensional hierarchy is normalized into a set of smaller
dimension tables, forming a shape similar to snowflake
Fact constellations: Multiple fact tables share dimension tables,
viewed as a collection of stars, therefore called galaxy
schema or fact constellation

1
Terms
• Fact table sale
orderId
• Dimension tables product date customer
custId custId
• Measures prodId
name prodId name
storeId address
price
qty city
amt

store
storeId
city

2
Star
product prodId name price store storeId city
p1 bolt 10 c1 nyc
p2 nut 5 c2 sfo
c3 la

sale oderId date custId prodId storeId qty amt


o100 1/7/97 53 p1 c1 1 12
o102 2/7/97 53 p2 c1 2 11
105 3/8/97 111 p1 c3 5 50

customer custId name address city


53 joe 10 main sfo
81 fred 12 main sfo
111 sally 80 willow la

3
Star Schema

sale
orderId
date customer
product
custId custId
prodId
prodId name
name
storeId address
price
qty city
amt

store
storeId
city

4
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures
5
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key

branch_key
location
branch location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country
6
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location

branch location_key location to_location


branch_key location_key dollars_cost
branch_name
units_sold
street
branch_type dollars_sold city units_shipped
province_or_state
avg_sales country shipper
Measures shipper_key
shipper_name
location_key
7
shipper_type
Why Multidimensional Data Model
Three dimensional model

Region

Time
Multidimensional Data Model Fact Table

timeid

sales
locid
pid
• Collection of numeric measures, which 11 1 1 25
depend on a set of dimensions. 11 2 1 8
– E.g., measure Sales, dimensions 11 3 1 15
Product (key: pid), Location (locid),
12 1 1 30
and Time (timeid).
12 2 1 20
12 3 1 50
11 12 13

8 10 10
13 1 1 8
pid

Slice locid=1 30 20 50 13 2 1 10
locid
is shown: 25 8 15 13 3 1 10
1 2 3 11 1 2 35
timeid
Dimension Hierarchies
For each dimension, the set of values can be
organized in a hierarchy
TIME
LOCATION
PRODUCT year

quarter country

category week month state

pane date city


A Concept Hierarchy: Dimension (location)

all all

region Europe ... North_America

country Germany ... Spain Canada ... Mexico

city Frankfurt ... Vancouver ... Toronto

office L. Chan ... M. Wind

4
Representing Multi-Dimensional Data
• Example of two-dimensional query.
– What is the total revenue generated by property sales in
each city, in each quarter of 1997?’

• Choice of representation is based on types


of queries end-user may ask.

• Compare representation - three-field


relational table versus two-dimensional
matrix.
5
Multi-Dimensional Data as Three-Field Table versus
Two-Dimensional Matrix

6
Representing Multi-Dimensional
Data
• Example of three-dimensional query.
– ‘What is the total revenue generated by property sales
for each type of property (Flat or House) in each city,
in each quarter of 1997?’

• Compare representation - four-field


relational table versus three-dimensional
cube.

7
Multi-Dimensional Data as Four-Field
Table versus Three-Dimensional Cube

8
Representing Multi-Dimensional Data

• Cube represents data as cells in an array.

• Relational table only represents multi-


dimensional data in two dimensions.

9
Cuboids Corresponding to the Cube

all
0-D(apex) cuboid

product date country


1-D cuboids

product,date product,country date, country


2-D cuboids

3-D(base) cuboid
product, date, country

10
Cube: A Lattice of Cuboids
all
0-D(apex) cuboid

time item location supplier


1-D cuboids

time,location item,location location,supplier


time,item 2-D cuboids
time,supplier item,supplier

time,location,supplier
3-D cuboids
time,item,location
time,item,supplier item,location,supplier

4-D(base) cuboid
time, item, location, supplier
11
Lattice of Cuboids
129
all

c1 c2 c3
p1 67 12 50
city product date

city, product city, date product, date


c1 c2 c3
p1 56 4 50
p2 11 8

day 2
p1
c1
44
c2
4
c3
city, product, date
p2 c1 c2 c3
day 1
p1 12 50
p2 11 8

12
OLAP
 OLAP: Online Analytic Processing

 OLAP queries are complex queries that


Touch large amounts of data
Discover patterns and trends in the data
Typically expensive queries that take long time
Also called decision-support queries

 In contrast to OLAP:
OLTP: Online Transaction Processing
OLTP queries are simple queries, e.g., over banking or airline
OLTP queries touch small amount of data for fast transactions systems

1
What is OLAP?
• OLAP is an analytical technique that combines
data access tools with an analytical database
engine. In contrast to the simple rows and
columns structure of relational databases,
OLAP uses a multi-dimensional view of data.
OLAP uses calculations and transformations to
perform its analytical tasks.
Introducing OLAP
• Enables users to gain a deeper
understanding and knowledge about various
aspects of their corporate data through fast,
consistent, interactive access to a wide
variety of possible views of the data.

• Allows users to view corporate data in such


a way that it is a better model of the true
dimensionality of the enterprise.
3
OLTP vs. OLAP
On-Line Transaction Processing (OLTP):
– technology used to perform updates on operational or
transactional systems (e.g., point of sale systems)
On-Line Analytical Processing (OLAP):
– technology used to perform complex analysis of the data
in a data warehouse
OLAP is a category of software technology that enables
analysts, managers, and executives to gain insight into data
through fast, consistent, interactive access to a wide variety
of possible views of information that has been transformed
from raw data to reflect the dimensionality of the enterprise
as understood by the user.
[source: OLAP Council: www.olapcouncil.org] 4
EXAMPLE OLAP APPLICATIONS
Market Analysis

 Find which items are frequently sold over the summer but
not over winter?

Credit Card Companies

 Given a new applicant, does (s)he a credit-worthy?


 Need to check other similar applicants (age,gender,income,etc…)
and observe how they perform, then do prediction for new
applicant

OLAP queries are also called “decision support” queries

5
RELATIONAL OLAP: ROLAP
• Data are stored in relational model (tables)
• Special schema called Star Schema
• One relation is the fact table, all the others are dimension
tables

6
MOLAP
Unlike ROLAP, in MOLAP data are stored in special structure
called “Data Cubes” (Array-bases storage)

Data cubes pre-compute and aggregate the data


Possibly several data cubes with different granularities
Data cubes are aggregated materialized views over the data

As long as the data does not change frequently, the overhead of


data cubes is manageable

7
MOLAP vs ROLAP

• In Multidimensional OLAP ( MOLAP ), data is


stored in a special OLAP database server, after being
extracted from various sources, in pre-aggregated
cubic format. In contrast to this approach, Relational
OLAP ( ROLAP ) does not use an intermediate server
because it can work directly against the relational
database.
ROLAP Server
• Relational OLAP Server sale prodId
p1
date
1
sum
62
p2 1 19
p1 2 48

tools

ROLAP Special indices, tuning;


utilities Schema is “denormalized”
server

relational
DBMS

9
MOLAP Server
• Multi-Dimensional OLAP Server
Sales
B
A
milk

Product
soda
M.D. tools eggs
soap

1 2 3 4
Date

utilities
multi- could also
dimensional sit on
relational
server DBMS

10
MOLAP

C c3 61
c2 45
62 63 64
46 47 48
c1 29 30 31 32
c0
B13 14 15 16 60
b3 44
B b2 28 56
9
40
24 52
b1 5
36
20
b0 1 2 3 4
a0 a1 a2 a3
A

11
MOLAP vs ROLAP
Relational OLAP (ROLAP)

• Fastest growing style of OLAP technology.

• Supports RDBMS products using a metadata layer


- avoids need to create a static multi-dimensional
data structure - facilitates the creation of multiple
multi-dimensional views of the two-dimensional
relation.

13
Relational OLAP (ROLAP)

• To improve performance, some products


use SQL engines to support complexity of
multi-dimensional analysis, while others
recommend, or require, the use of highly
denormalized database designs such as the
star schema.

14
Typical Architecture for ROLAP
Tools

15
Multi-Dimensional OLAP Servers

• Use multi-dimensional structures to store data and


relationships between data.

• Multi-dimensional structures are best visualized


as cubes of data, and cubes within cubes of data.
Each side of cube is a dimension.

• A cube can be expanded to include other


dimensions.

16
Multi-Dimensional OLAP Servers

• In summary, pre-aggregation, dimensional


hierarchy, and sparse data management can
significantly reduce the size of the cube and the
need to calculate values ‘on-the-fly’.

• Removes need for multi-table joins and provides


quick and direct access to arrays of data, thus
significantly speeding up execution of multi-
dimensional queries.
17
Typical Architecture for MOLAP Tools

18
OLAP
 OLAP: Online Analytic Processing

 OLAP queries are complex queries that


Touch large amounts of data
Discover patterns and trends in the data
Typically expensive queries that take long time
Also called decision-support queries

 In contrast to OLAP:
OLTP: Online Transaction Processing
OLTP queries are simple queries, e.g., over banking or airline
OLTP queries touch small amount of data for fast transactions systems

1
What is OLAP?
• OLAP is an analytical technique that combines
data access tools with an analytical database
engine. In contrast to the simple rows and
columns structure of relational databases,
OLAP uses a multi-dimensional view of data.
OLAP uses calculations and transformations to
perform its analytical tasks.
Introducing OLAP
• Enables users to gain a deeper
understanding and knowledge about various
aspects of their corporate data through fast,
consistent, interactive access to a wide
variety of possible views of the data.

• Allows users to view corporate data in such


a way that it is a better model of the true
dimensionality of the enterprise.
3
OLTP vs. OLAP
On-Line Transaction Processing (OLTP):
– technology used to perform updates on operational or
transactional systems (e.g., point of sale systems)
On-Line Analytical Processing (OLAP):
– technology used to perform complex analysis of the data
in a data warehouse
OLAP is a category of software technology that enables
analysts, managers, and executives to gain insight into data
through fast, consistent, interactive access to a wide variety
of possible views of information that has been transformed
from raw data to reflect the dimensionality of the enterprise
as understood by the user.
[source: OLAP Council: www.olapcouncil.org] 4
EXAMPLE OLAP APPLICATIONS
Market Analysis

 Find which items are frequently sold over the summer but
not over winter?

Credit Card Companies

 Given a new applicant, does (s)he a credit-worthy?


 Need to check other similar applicants (age,gender,income,etc…)
and observe how they perform, then do prediction for new
applicant

OLAP queries are also called “decision support” queries

5
RELATIONAL OLAP: ROLAP
• Data are stored in relational model (tables)
• Special schema called Star Schema
• One relation is the fact table, all the others are dimension
tables

6
MOLAP
Unlike ROLAP, in MOLAP data are stored in special structure
called “Data Cubes” (Array-bases storage)

Data cubes pre-compute and aggregate the data


Possibly several data cubes with different granularities
Data cubes are aggregated materialized views over the data

As long as the data does not change frequently, the overhead of


data cubes is manageable

7
MOLAP vs ROLAP

• In Multidimensional OLAP ( MOLAP ), data is


stored in a special OLAP database server, after being
extracted from various sources, in pre-aggregated
cubic format. In contrast to this approach, Relational
OLAP ( ROLAP ) does not use an intermediate server
because it can work directly against the relational
database.
ROLAP Server
• Relational OLAP Server sale prodId
p1
date
1
sum
62
p2 1 19
p1 2 48

tools

ROLAP Special indices, tuning;


utilities Schema is “denormalized”
server

relational
DBMS

9
MOLAP Server
• Multi-Dimensional OLAP Server
Sales
B
A
milk

Product
soda
M.D. tools eggs
soap

1 2 3 4
Date

utilities
multi- could also
dimensional sit on
relational
server DBMS

10
MOLAP

C c3 61
c2 45
62 63 64
46 47 48
c1 29 30 31 32
c0
B13 14 15 16 60
b3 44
B b2 28 56
9
40
24 52
b1 5
36
20
b0 1 2 3 4
a0 a1 a2 a3
A

11
MOLAP vs ROLAP
Relational OLAP (ROLAP)

• Fastest growing style of OLAP technology.

• Supports RDBMS products using a metadata layer


- avoids need to create a static multi-dimensional
data structure - facilitates the creation of multiple
multi-dimensional views of the two-dimensional
relation.

13
Relational OLAP (ROLAP)

• To improve performance, some products


use SQL engines to support complexity of
multi-dimensional analysis, while others
recommend, or require, the use of highly
denormalized database designs such as the
star schema.

14
Typical Architecture for ROLAP
Tools

15
Multi-Dimensional OLAP Servers

• Use multi-dimensional structures to store data and


relationships between data.

• Multi-dimensional structures are best visualized


as cubes of data, and cubes within cubes of data.
Each side of cube is a dimension.

• A cube can be expanded to include other


dimensions.

16
Multi-Dimensional OLAP Servers

• In summary, pre-aggregation, dimensional


hierarchy, and sparse data management can
significantly reduce the size of the cube and the
need to calculate values ‘on-the-fly’.

• Removes need for multi-table joins and provides


quick and direct access to arrays of data, thus
significantly speeding up execution of multi-
dimensional queries.
17
Typical Architecture for MOLAP Tools

18
Typical OLAP Operations
• Roll up (drill-up): summarize data
– by climbing up hierarchy or by dimension reduction
• Drill down (roll down): reverse of roll-up
– from higher level summary to lower level summary or
detailed data, or introducing new dimensions
• Slice and dice: project and select
• Pivot (rotate):
– reorient the cube, visualization, 3D to series of 2D planes
• Other operations
– drill across: involving (across) more than one fact table
– drill through: through the bottom level of the cube to its
back-end relational tables (using SQL)
1
Fig. 3.10 Typical OLAP
Operations

2
Examples of OLAP Applications in
Various Functional Areas

3
DATA MINING vs. OLAP

OLAP – Online Analytical


Processing
Provides you with a very
good view of what is
happening, but can not
predict what will happen
in the future or why it is
happening

Data Mining is a combination of discovering


techniques + prediction techniques
4
Design of Data Warehouse: A Business Analysis
Framework
• Four views regarding the design of a data warehouse
– Top-down view
• allows selection of the relevant information necessary for the data
warehouse
– Data source view
• exposes the information being captured, stored, and managed by
operational systems
– Data warehouse view
• consists of fact tables and dimension tables
– Business query view
• sees the perspectives of data in the warehouse from the view of end-
user

1
Data Warehouse Design Process
• Top-down, bottom-up approaches or a combination of both
– Top-down: Starts with overall design and planning (mature)
– Bottom-up: Starts with experiments and prototypes (rapid)
• From software engineering point of view
– Waterfall: structured and systematic analysis at each step before
proceeding to the next
– Spiral: rapid generation of increasingly functional systems, short turn
around time, quick turn around
• Typical data warehouse design process
– Choose a business process to model, e.g., orders, invoices, etc.
– Choose the grain (atomic level of data) of the business process
– Choose the dimensions that will apply to each fact table record
– Choose the measure that will populate each fact table record
2
Data Warehouse: A Multi-Tiered Architecture

Monitor
& OLAP Server
Other Metadata
sources Integrator

Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining

Data Marts

Data Sources Data Storage OLAP Engine Front-End Tools


3
Three Data Warehouse Models
• Enterprise warehouse
– collects all of the information about subjects spanning the
entire organization
• Data Mart
– a subset of corporate-wide data that is of value to a specific
groups of users. Its scope is confined to specific, selected
groups, such as marketing data mart
• Independent vs. dependent (directly from warehouse) data mart
• Virtual warehouse
– A set of views over operational databases
– Only some of the possible summary views may be
materialized
4
Building a Data Warehouse
Data Warehouse Lifecycle

– Analysis
– Design
– Import data
– Install front-end tools
– Test and deploy

5
Stage 1: Analysis
Analysis
– Design
• Identify: – Import data
– Target Questions – Install front-end tools
– Test and deploy
– Data needs
– Timeliness of data
– Granularity
• Create an enterprise-level data dictionary
• Dimensional analysis
– Identify facts and dimensions

6
Stage 2: Design
• Star schema – Analysis
• Data Transformation Design
– Import data
• Aggregates – Install front-end tools
– Test and deploy
• Pre-calculated
Values Dimensional Modeling

• HW/SW
Architecture

7
Dimensional Modeling
• Fact Table – The primary table in a
dimensional model that is meant to contain
measurements of the business.
• Dimension Table – One of a set of
companion tables to a fact table.
• Most dimension tables contain many
textual attributes that are the basis for
constraining and grouping within data
warehouse queries.

8
Stage 3: Import Data
• Identify data sources
– Analysis
• Extract the needed data from – Design
existing systems to a data Import data
staging area – Install front-end tools
• Transform and Clean the data – Test and deploy
– Resolve data type conflicts
– Resolve naming and key
conflicts
– Remove, correct, or flag bad data
– Conform Dimensions
• Load the data into the
warehouse

9
Importing Data Into the Warehouse

OLTP 1

Data Staging Area Data


OLTP 2
Warehouse

OLTP 3

Operational Systems
(source systems)

10
Stage 4: Install Front-end Tools
– Analysis
– Design
• Reporting tools – Import data
Install front-end tools
• Data mining tools – Test and deploy

• GIS
• Etc.

11
Stage 5: Test and Deploy
– Analysis
– Design
• Usability tests – Import data
– Install front-end tools
• Software installation Test and deploy

• User training
• Performance tweaking based on usage

12

You might also like