Professional Documents
Culture Documents
1
Inmon-Cont’d
• Bill has written about a variety
of topics on the building, usage,
& maintenance of the data warehouse
& the Corporate Information Factory.
3
Characteristics of Data Warehouse
4
A Data Warehouse is Subject Oriented
5
Subject Orientation
Application Data warehouse
Environment Environment
6
Data Integrated
• Integration –consistency naming conventions and
measurement attributers, accuracy, and common
aggregation.
7
Data Integrated
8
Time Variant
• Every piece of data contained within the warehouse
must be associated with a particular point in time if
any useful analysis is to be conducted with it.
9
Nonvolatility
• Typical activities such as deletes, inserts, and changes
that are performed in an operational application
environment are completely nonexistent in a DW
environment.
10
Why Do We Need Data Warehouses?
• Consolidation of information resources
11
Data Warehouse Usage
• Three kinds of data warehouse applications
– Information processing
• supports querying, basic statistical analysis, and reporting
using crosstabs, tables, charts and graphs
– Analytical processing
• multidimensional analysis of data warehouse data
• supports basic OLAP operations, slice-dice, drilling, pivoting
– Data mining
• knowledge discovery from hidden patterns
• supports associations, constructing analytical models,
performing classification and prediction, and presenting the
mining results using visualization tools
12
Data Warehouses, Data Marts, and
Operational Data Stores
• Data Warehouse – The queryable source of data in the
enterprise. It is comprised of the union of all of its
constituent data marts.
• Data Mart – A logical subset of the complete data
warehouse. Often viewed as a restriction of the data
warehouse to a single business process or to a group
of related business processes targeted toward a
particular business group.
• Operational Data Store (ODS) – A point of integration
for operational systems that developed independent of
each other. Since an ODS supports day to day
operations, it needs to be continually updated.
1
How Do Data Warehouses Differ From
Operational Systems?
• Goals
• Structure
• Size
• Performance optimization
• Technologies used
2
Design Differences
Operational System Data Warehouse
3
Data Warehouse vs. Operational DBMS
• OLTP (on-line transaction processing)
– Major task of traditional relational DBMS
– Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
• OLAP (on-line analytical processing)
– Major task of data warehouse system
– Data analysis and decision making
• Distinct features (OLTP vs. OLAP):
– User and system orientation: customer vs. market
– Data contents: current, detailed vs. historical, consolidated
– Database design: ER + application vs. star + subject
– View: current, local vs. evolutionary, integrated
– Access patterns: update vs. read-only but complex queries 4
OLTP vs. OLAP
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
5
From Tables and Spreadsheets to Data Cubes
• A data warehouse is based on a multidimensional data model which
views data in the form of a data cube
• A data cube, such as sales, allows data to be modeled and viewed in
multiple dimensions
– Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year)
– Fact table contains measures (such as dollars_sold) and keys to
each of the related dimension tables
• In data warehousing literature, an n-D base cube is called a base
cuboid. The top most 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid. The lattice of cuboids
forms a data cube. 6
Dimension and Fact tables
TIMES
(Dimension) timeid date week month quarter year holiday_flag
1
Terms
• Fact table sale
orderId
• Dimension tables product date customer
custId custId
• Measures prodId
name prodId name
storeId address
price
qty city
amt
store
storeId
city
2
Star
product prodId name price store storeId city
p1 bolt 10 c1 nyc
p2 nut 5 c2 sfo
c3 la
3
Star Schema
sale
orderId
date customer
product
custId custId
prodId
prodId name
name
storeId address
price
qty city
amt
store
storeId
city
4
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures
5
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key
branch_key
location
branch location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country
6
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location
Region
Time
Multidimensional Data Model Fact Table
timeid
sales
locid
pid
• Collection of numeric measures, which 11 1 1 25
depend on a set of dimensions. 11 2 1 8
– E.g., measure Sales, dimensions 11 3 1 15
Product (key: pid), Location (locid),
12 1 1 30
and Time (timeid).
12 2 1 20
12 3 1 50
11 12 13
8 10 10
13 1 1 8
pid
Slice locid=1 30 20 50 13 2 1 10
locid
is shown: 25 8 15 13 3 1 10
1 2 3 11 1 2 35
timeid
Dimension Hierarchies
For each dimension, the set of values can be
organized in a hierarchy
TIME
LOCATION
PRODUCT year
quarter country
all all
4
Representing Multi-Dimensional Data
• Example of two-dimensional query.
– What is the total revenue generated by property sales in
each city, in each quarter of 1997?’
6
Representing Multi-Dimensional
Data
• Example of three-dimensional query.
– ‘What is the total revenue generated by property sales
for each type of property (Flat or House) in each city,
in each quarter of 1997?’
7
Multi-Dimensional Data as Four-Field
Table versus Three-Dimensional Cube
8
Representing Multi-Dimensional Data
9
Cuboids Corresponding to the Cube
all
0-D(apex) cuboid
3-D(base) cuboid
product, date, country
10
Cube: A Lattice of Cuboids
all
0-D(apex) cuboid
time,location,supplier
3-D cuboids
time,item,location
time,item,supplier item,location,supplier
4-D(base) cuboid
time, item, location, supplier
11
Lattice of Cuboids
129
all
c1 c2 c3
p1 67 12 50
city product date
day 2
p1
c1
44
c2
4
c3
city, product, date
p2 c1 c2 c3
day 1
p1 12 50
p2 11 8
12
OLAP
OLAP: Online Analytic Processing
In contrast to OLAP:
OLTP: Online Transaction Processing
OLTP queries are simple queries, e.g., over banking or airline
OLTP queries touch small amount of data for fast transactions systems
1
What is OLAP?
• OLAP is an analytical technique that combines
data access tools with an analytical database
engine. In contrast to the simple rows and
columns structure of relational databases,
OLAP uses a multi-dimensional view of data.
OLAP uses calculations and transformations to
perform its analytical tasks.
Introducing OLAP
• Enables users to gain a deeper
understanding and knowledge about various
aspects of their corporate data through fast,
consistent, interactive access to a wide
variety of possible views of the data.
Find which items are frequently sold over the summer but
not over winter?
5
RELATIONAL OLAP: ROLAP
• Data are stored in relational model (tables)
• Special schema called Star Schema
• One relation is the fact table, all the others are dimension
tables
6
MOLAP
Unlike ROLAP, in MOLAP data are stored in special structure
called “Data Cubes” (Array-bases storage)
7
MOLAP vs ROLAP
tools
relational
DBMS
9
MOLAP Server
• Multi-Dimensional OLAP Server
Sales
B
A
milk
Product
soda
M.D. tools eggs
soap
1 2 3 4
Date
utilities
multi- could also
dimensional sit on
relational
server DBMS
10
MOLAP
C c3 61
c2 45
62 63 64
46 47 48
c1 29 30 31 32
c0
B13 14 15 16 60
b3 44
B b2 28 56
9
40
24 52
b1 5
36
20
b0 1 2 3 4
a0 a1 a2 a3
A
11
MOLAP vs ROLAP
Relational OLAP (ROLAP)
13
Relational OLAP (ROLAP)
14
Typical Architecture for ROLAP
Tools
15
Multi-Dimensional OLAP Servers
16
Multi-Dimensional OLAP Servers
18
OLAP
OLAP: Online Analytic Processing
In contrast to OLAP:
OLTP: Online Transaction Processing
OLTP queries are simple queries, e.g., over banking or airline
OLTP queries touch small amount of data for fast transactions systems
1
What is OLAP?
• OLAP is an analytical technique that combines
data access tools with an analytical database
engine. In contrast to the simple rows and
columns structure of relational databases,
OLAP uses a multi-dimensional view of data.
OLAP uses calculations and transformations to
perform its analytical tasks.
Introducing OLAP
• Enables users to gain a deeper
understanding and knowledge about various
aspects of their corporate data through fast,
consistent, interactive access to a wide
variety of possible views of the data.
Find which items are frequently sold over the summer but
not over winter?
5
RELATIONAL OLAP: ROLAP
• Data are stored in relational model (tables)
• Special schema called Star Schema
• One relation is the fact table, all the others are dimension
tables
6
MOLAP
Unlike ROLAP, in MOLAP data are stored in special structure
called “Data Cubes” (Array-bases storage)
7
MOLAP vs ROLAP
tools
relational
DBMS
9
MOLAP Server
• Multi-Dimensional OLAP Server
Sales
B
A
milk
Product
soda
M.D. tools eggs
soap
1 2 3 4
Date
utilities
multi- could also
dimensional sit on
relational
server DBMS
10
MOLAP
C c3 61
c2 45
62 63 64
46 47 48
c1 29 30 31 32
c0
B13 14 15 16 60
b3 44
B b2 28 56
9
40
24 52
b1 5
36
20
b0 1 2 3 4
a0 a1 a2 a3
A
11
MOLAP vs ROLAP
Relational OLAP (ROLAP)
13
Relational OLAP (ROLAP)
14
Typical Architecture for ROLAP
Tools
15
Multi-Dimensional OLAP Servers
16
Multi-Dimensional OLAP Servers
18
Typical OLAP Operations
• Roll up (drill-up): summarize data
– by climbing up hierarchy or by dimension reduction
• Drill down (roll down): reverse of roll-up
– from higher level summary to lower level summary or
detailed data, or introducing new dimensions
• Slice and dice: project and select
• Pivot (rotate):
– reorient the cube, visualization, 3D to series of 2D planes
• Other operations
– drill across: involving (across) more than one fact table
– drill through: through the bottom level of the cube to its
back-end relational tables (using SQL)
1
Fig. 3.10 Typical OLAP
Operations
2
Examples of OLAP Applications in
Various Functional Areas
3
DATA MINING vs. OLAP
1
Data Warehouse Design Process
• Top-down, bottom-up approaches or a combination of both
– Top-down: Starts with overall design and planning (mature)
– Bottom-up: Starts with experiments and prototypes (rapid)
• From software engineering point of view
– Waterfall: structured and systematic analysis at each step before
proceeding to the next
– Spiral: rapid generation of increasingly functional systems, short turn
around time, quick turn around
• Typical data warehouse design process
– Choose a business process to model, e.g., orders, invoices, etc.
– Choose the grain (atomic level of data) of the business process
– Choose the dimensions that will apply to each fact table record
– Choose the measure that will populate each fact table record
2
Data Warehouse: A Multi-Tiered Architecture
Monitor
& OLAP Server
Other Metadata
sources Integrator
Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining
Data Marts
– Analysis
– Design
– Import data
– Install front-end tools
– Test and deploy
5
Stage 1: Analysis
Analysis
– Design
• Identify: – Import data
– Target Questions – Install front-end tools
– Test and deploy
– Data needs
– Timeliness of data
– Granularity
• Create an enterprise-level data dictionary
• Dimensional analysis
– Identify facts and dimensions
6
Stage 2: Design
• Star schema – Analysis
• Data Transformation Design
– Import data
• Aggregates – Install front-end tools
– Test and deploy
• Pre-calculated
Values Dimensional Modeling
• HW/SW
Architecture
7
Dimensional Modeling
• Fact Table – The primary table in a
dimensional model that is meant to contain
measurements of the business.
• Dimension Table – One of a set of
companion tables to a fact table.
• Most dimension tables contain many
textual attributes that are the basis for
constraining and grouping within data
warehouse queries.
8
Stage 3: Import Data
• Identify data sources
– Analysis
• Extract the needed data from – Design
existing systems to a data Import data
staging area – Install front-end tools
• Transform and Clean the data – Test and deploy
– Resolve data type conflicts
– Resolve naming and key
conflicts
– Remove, correct, or flag bad data
– Conform Dimensions
• Load the data into the
warehouse
9
Importing Data Into the Warehouse
OLTP 1
OLTP 3
Operational Systems
(source systems)
10
Stage 4: Install Front-end Tools
– Analysis
– Design
• Reporting tools – Import data
Install front-end tools
• Data mining tools – Test and deploy
• GIS
• Etc.
11
Stage 5: Test and Deploy
– Analysis
– Design
• Usability tests – Import data
– Install front-end tools
• Software installation Test and deploy
• User training
• Performance tweaking based on usage
12