You are on page 1of 63

Data Warehouse Design Practices

and Methodologies

Relational Database Concepts for


Multidimensional Data
Objectives
• Discuss motivation for relational database
representation of multidimensional data
• Explain importance of grain determination
• Provide examples of types of fact tables

2
Motivation for Table Design
• Relational model dominance
- Most DBMS market controlled by relational DBMS and open source
vendors
- SQL standard (now SQL:2011): 4,000 pages
• Lack of scalability for data cube engines
- Major problem reported in the early years of data warehouse deployment
- Not an active area of research and development
• Large amounts of research and development of relational
database performance
- Development of optimizing compilers
- Development of physical design software to select storage structures and
monitor performance
• New features
3
- SQL standard: query operators
- Proprietary: materialized views and query rewriting
Multidimensional Data Representations
Data cubes
Dimension Table ... Dimension Table

Fact Table

Map business analyst representation to relational model


- Data cubes with dimensions and measures
- Relational design with tables and 1-M relationships (FKs)
- Dimensions to dimension tables
- Measures to fact tables
- Group fact and dimension tables 4

Grain: most detailed measure values stored


• Dimension table:
– Store values of one dimension
– Often not normalized
• Fact table:
– Store measure values
– Often contain multiple foreign keys (to dimension
tables)

5
Grain
• Most detailed stored value (Finest level) in fact tables
• Determined by the finest level of each dimension,
such as individual customer
• Determines the size of the data warehouse:
(product of dimension cardinalities * sparsity)
• Tradeoff
– Flexibility and size
• Grain too small: large data warehouse & increased computation time
• Grain too large: cannot answer queries on more detailed dimension
values
– Trend towards finer grains 6
Grain Example
• Sales fact table grain
– Coarse: customer postal codes (1,000), product type (100), store (200),
week (52)
– Fine: individual customer (200,000), individual product (2,000), store
(200), day (365)
– Numbers in parenthesis indicate number of values of dimensions
– Sparsity: coarse (5%), fine (75%)
• Sparsity = 1 – (number of cell with values / total number of cells)

• Impact
– Higher storage requirements for fine grain
• Storage = 1 – (product of dimension sizes * sparsity)
– Storage requirements of the finer grain are 7,000 times more larger than
the coarser grain after reducing for sparsity. 7

– More reporting flexibility for fine grain


Measure Aggregation Properties
• “Aggregate Property” indicates allowable summary operations
for measures
• Additive
– Summarized by addition across all dimensions such as sales, profit
– Sales can be summed across product, time, customer, …
• Semi-Additive
– Summarized by addition in some but not all dimensions such as time
– Periodic measurements such as account balances and inventory levels
– Account balance can be summed across customer branch
– Account balance cannot be summed across time because balance is just
a point in time measurement
• Non-Additive
– Cannot be summarized by addition through any dimension
– Historical facts such as unit price for a sale: sum of unit prices for zip 8

code of customers is not meaningful


– Unit price converted to extended price (price * quantity) is additive
Types (classification) of Fact Tables (FT)
• Fact tables are classified based on the types of measure
stored.
– Transaction FT
• Most common
• Usually additive measures, e.g. sales, web activity hits, purchases
– Snapshot (inventory level) FT
• Periodic or accumulating view of asset level
• Usually semi-additive measures, e.g. Inventory levels, accounts receivable
balances, accounts payable balances
– Factless
• Records event occurrence, e.g. attendance, room reservations and hiring
• No measures, just FKs
• This classification is somewhat fluid, as a fact table may be a 9

combination of these types.


Dimensions vs Measures
• Dimensions contain qualitative values (such as
names, dates, or geographical data). You can
use dimensions to categorize, segment, and
reveal the details in your data. Dimensions affect
the level of detail in the view. Measures contain
numeric, quantitative values that you can
measure.

10
Fact Table Examples Enrollment FT
Transaction Periodic Factless
Store Account Student For University,
Product Account Type Semester
Some measures
Customer Balance Date Course like: TIME SPENT
Date Dividend Date Faculty ONLINE or
additive COURSE PAGE
Quantity Balance Date
VISITS
Extended Price Transaction Count Period
Non-
Dividend Cumulative
additive
averagable Dividend Current Year

Account FT
- All measures are semi-additive across account
Sales FT - Balance (sum)
- Transaction Count (cumulative)
- Dividend cumulative & current year (across account)
11
Table Design Patterns
Objectives
• Two separate but related topics
- Principles, schema patterns, and schema design problems
- Large scale DW development and example data warehouses

• Objectives:
- Understand the motivation for relational database
representation of multidimensional data
- Understand basic ideas of fact and dimension tables
- Recognize data modeling patterns for data warehouse
schemas
- Explain three alternatives for historical integrity
13
Star Schema Example
Traditional Schema pattern for DW & represent one data cube
Store
Item StoreId
ItemId StoreManager
ItemName StoreStreet
ItemUnitPrice StoreCity
ItemBrand StoreSales StoreState
ItemCategory StoreZip
StoreNation
ItemSales DivId
Sales DivName
SalesNo
DivManager
SalesUnits
SalesDollar
Customer SalesCost
TimeDim
CustId TimeNo
CustName TimeSales TimeDay
CustPhone TimeMonth
CustStreet CustSales TimeQuarter
CustCity TimeYear 14
CustState TimeDayOfWeek
CustZip TimeFiscalYear
CustNation
Constellation Schema Example
Supplier
SuppId
SuppName
Inventory
InvNo Multiple
SuppInv StoreInv
fact
SuppCity InvQOH
SuppState InvCost

tables
SuppZip InvReturns
SuppNation

ItemInv
Store
Item StoreId
ItemId StoreManager
ItemName StoreStreet
ItemUnitPrice StoreCity
ItemBrand
ItemCategory
StoreSales StoreState
StoreZip
TimeInv
Fact
tables
StoreNation
ItemSales DivId
Sales
share
DivName
SalesNo
DivManager
SalesUnits

Customer
SalesDollar
SalesCost
TimeDim
dimension
CustId
CustName TimeSales
TimeNo
TimeDay
tables
CustPhone TimeMonth
CustSales 15
CustStreet TimeQuarter
CustCity TimeYear
CustState TimeDayOfWeek
CustZip TimeFiscalYear
CustNation
Snowflake Schema Example
Item
ItemId Store
ItemName StoreId
ItemUnitPrice Division
StoreManager
ItemBrand DivId
StoreStreet
ItemCategory
StoreSales DivStore DivName
StoreCity
DivManager
StoreState
ItemSales Sales StoreZip
SalesNo StoreNation
SalesUnits
SalesDollar
Customer SalesCost
TimeSales TimeDim
CustId TimeNo
CustName TimeDay
CustPhone TimeMonth
CustStreet CustSales TimeQuarter
CustCity TimeYear
CustState TimeDayOfWeek
CustZip TimeFiscalYear
CustNation

- Multiple levels of dimension tables


16
Time Representation for Fact Tables
• Time representation is crucial for data warehouses,
– because most data warehouse queries use time in conditions.
– The principle usage of time is to record occurrence of facts.
• The simplest representation is a timestamp datatype column in
a fact table.
• Alternatives:
– Many data warehouses use a foreign key to a time dimension table.
– supports convenient representation of organization's specific calendar features,
such as holidays, fiscal years and week numbers
– The granularity of the time dimension table is usually in days.
– If time of day is also required for a fact table, it can be added as a column in the
fact table
• A variation identified by Kimball in 2003, is:
– Accumulating fact table.
– Records the status of multiple events, rather than one event.
– For example, order date, shipment date, delivery date, payment date, and so on.
17
Historical Integrity for Dimensions
• Primarily an issue in case dimension changes
• Fact rows no longer historically accurate after a dimension
update
– Example, if the city column of a customer row changes, the related sales
rows are no longer historically accurate.
– Shipping address for a company may change
• Determine importance of history preservation for dimension
columns
– Time representation is only important for selected columns that change
quickly, such as credit score/rating of customer
– History is not typically important for most columns that are relatively
stable (or changes slowly)
• Some inaccuracy are tolerated with summary query results 18
Three alternatives for Historical Integrity

I: Does not support


historical Integrity

II: New Row /


versioning 19

III: New Attribute: Limited history is represented : Previous, Past (2 versions)


Summarizability Patterns
for Dimension Tables
Lesson Objectives
• Recognize data patterns with dimension summarizability
problems
• Recognize cardinalities in schema designs for dimension
summarizability problems
• Explain ways to resolve dimension summarizability
problems

21
Summarizability Motivation
• Summary computations in navigation and join operations
– Operations on hierarchical dimensions: drill down and rollup
– Join operations combining fact and dimension tables
• Violations of summarizability
– Incompleteness in rollup and drill down operations
– Double counting problem
– Incorrect results
– Erroneous decision making and user confusion
– Inability to use performance optimizations
• Relationships among
– dimension levels and
– dimension and fact tables
• Summarizability conditions:
– Intra dimension 22

– Inter dimension: fact/dimension table relationships


Drill Down Incompleteness Example
Department Enrollment
College Enrollment Civil Eng. 150
Business 1,250 Drill down Comp. Sc. 650
CLAS 555 Economics 330
Eng 1,070 Electrical Eng. 270
Total 2,875 Math 225
Total 1,625

Drill down incomplete:


•Business has no departments
•Drill down does not show same total as rollup.
•Users may perceive difference as inconsistency.
23
Roll-up Incompleteness Example
Product Sales
Beer 5 Category Sales
Bread 10 Rollup
Drink 15
Milk 10
Food 25
Napkin 20
Total 40
Tuna 15
Total 60

Roll-up incomplete:
•Napkin does not have a category.
•Parent level (rollup) shows a smaller total than child level.
•Users may perceive difference as inconsistency.
•Should have “other” category.
24
Non Strict Example
Week Sales
1-2013 5
2-2013 10
3-2013 10 Month Sales
4-2013 10 Rollup Jan-2013 37
5-2013 20 Feb-2013 53
6-2013 10 Total 90
7-2013 10
8-2013 10
9-2013 10 Non strict:
Total 95 •Some weeks are split between months.
•M-N relationship between dimension levels
•Fine and coarse levels show different totals.
25

•Users may perceive difference as inconsistency.


Dimension Non Summarizability
Patterns
(a)
(b) (c)

Parent
Parent Parent

Roll-up Non strict


Drill-down incomplete
incomplete

Child Child Child

26
Dimension Non Summarizability
Examples

College Category Month

Roll-up Non strict


Drill-down incomplete
incomplete

Department Product WeekofYear

27
Resolving Dimension Problems
• Drill-down and roll-up problems are due to exceptions
• Incomplete drill down: Unallocated Parent members
should be connected to default / duplicate child member.
– E.g., colleges without departments must represented by new child
member such as business college enrollments.
• Incomplete rollup: a new default parent member should
be used for child memebr without a parent.
– E.g., a new non food, non-beverage category should be added to
the patent level to relate non associated child member i.e., napkin.
• add connection to unallocated parent
– Use default or duplicate parent entity and connection

28
Resolving Dimension Problems
• Non strict relationship (M-N) among dimensions
– Design error
– Use levels in separate hierarchies
– E.g., WeekOfYear in month levels should be placed in
different time hierarchies
– Eliminate M-N relationship by placing in another
hierarchy: calendar week not in same hierarchy as
month OR
• you have to go to lower granularity for example days that exist
at intersection
– Use a major or primary parent: products having
multiple categories; use major category
29
Summarizability Patterns
for Dimension-Fact Relationships
Incomplete Dimension-Fact
Relationship
Customer-Month Sales Month Sales
Customer Month Sales Month Sales
Cust-1 Jan-2012 10 Rollup
Jan-2012 25
Cust-2 Jan-2012 5
Feb-2012 15
Cust-3 Feb-2012 15
Total 40
Total 30

Incomplete:
•Inconsistent totals
•Some sales for anonymous customers: 10 in January 2012
•January sales larger than shown by known customers
•Caused by some facts not being related to known customers
31
Non Strict Dimension-Fact Relationship
Salesperson Date UnitSales
SP1 10-Feb-2013 10
(a) Unit sales by SP2 10-Feb-2013 10
salesperson SP3 11-Feb-2013 15
SP4 12-Feb-2013 20
Total 55

Salesperson Date UnitSales


(b) Shared unit sales SP1, SP2 10-Feb-2013 10
by salesperson SP3 11-Feb-2013 15
SP4 12-Feb-2013 20
Non strict problem: Total 45
•Double counting sales with multiple sales people
•SP1 and SP2 shared a sale on February 10, 2013
•May not be a clear method to allocate sales amount to 32

individual sales person


Non Summarizability Schema Patterns

Dimension Dimension

Incomplete Non strict


dimensioning dimensioning

Fact Fact

33
Examples of Non Summarizability
Schema Patterns

Customer Salesperson

Incomplete Non strict


dimensioning dimensioning

Sales Sales

34
Resolving Incomplete Dimension-Fact
Relationships
• Conceptually simple
– although the resolution may complicate the data
integration process.
• Data integration process changes
• Use default dimension entities
– For example, anonymous sales should be connected
to a default anonymous customer in the customer
entity type.

35
Resolving Non Strict Dimension-Fact
Relationships
• Source data may have M-N relationships, not 1-M
relationships
• Adjust fact or dimension tables for a fixed number of
exceptions
– Multiple columns can be added to the fact or the dimension table
to allow for more than one customer.
– For example, the Customer table can have an additional column
SecondCustId to identify an optional second customer on the
invoice
• More complex solutions to support M-N relationships with
a variable number of connections
36
Resolution with Limited Related Entities

37
Resolution with Unlimited Related Entities
- Two fact tables and identifying relationship
Store
Item StoreId
ItemId StoreManager
ItemName StoreStreet
ItemUnitPrice StoreCity
ItemBrand StoreSales StoreState
ItemCategory StoreZip
StoreNation
ItemSales DivId
Sales DivName
- Sales fact for item, store, and time SalesNo
SalesUnits
DivManager
SalesDollar
SalesCost
Customer TimeDim
CustId TimeNo
CustName TimeDay
SalesRole TimeMonth
CustPhone CustOf
CustStreet RoleNo TimeQuarter
Weight
TimeSales
CustCity CustSales TimeYear
CustState TimeDayOfWeek
CustZip TimeFiscalYear
CustNation

- SalesRole for customer; a customer plays at most one role in a sale. 38

Possibly use weight.


Data Warehouse Design Methodologies
Lesson Objectives
• Gain insights about issues involved with
enterprise data warehouse development
• Compare and contrast THREE methodologies
for data warehouse design
• Understand the importance of grain on data
warehouse flexibility and capacity

40
Design Methodology
• Elements
– Phases to create design
artifacts and working system
– Human and automated
processes
– Project management skills
required to monitor
• Artifacts include:
dimensional models,
schema design,
data marts, and
data integration procedures
41
Design Methodology of DWH differ by
emphasis on the following three issues:
Supply of data
sources
(Internal &
External Sources,
Quality of Data)
Demand for BI
(Reporting and Level of
Analysis automation
requirements)

Methodology

42
Demand-Driven Data warehouse
Design Methodology
(Requirements driven approach, Kimball’98)
Data mart : a collection of related facts important for a group of data warehouse.

Emphasizes the identification of data marts to capture


intended usage of a data warehouse 43
Demand-Driven Methodology
Details
• Identify data marts
• Identify dimensions for data marts
– Matrix relating data marts and dimensions
– Standardize (conform) dimensions
• Design fact tables
– Define grain
– Determine details of dimensions (i.e. hierarchies)
– Define measures (including measure properties i.e.,
aggregation & derivability) 44
Supply-Driven Data warehouse
Design Methodology (Moody & Kortink’00)

Emphasizes the analysis of existing data sources


45
Supply-Driven Methodology Details
• Classify entity types
– Transactional entity types: events (will become fact table in
star schema)
– Component entity types: related to events in 1-M
relationships (will become dimensions in Star schema)
• Refine dimensions
– Classification entity types: related to component entity types
in 1-M relationship
– Dimension hierarchies for component/classification entity
types
• Refine dimension model
– Collapse (denormalize to reduce snowflaking) 46

– Aggregate (Make the grain coarser in fact entity types)


Hybrid Data warehouse
Design Methodology (Bonifati’01)
Fact and
GQM forms and Determine Analyze dimension
guidelines table
Goals ERDs guidelines

Integrate
Models

Terminology
analysis

47
Hybrid Methodology Details
• Collect user requirements:
– Use Goal/Question/Metric approach
– Develop dimensions and measures (demand driven)
• Analyze existing ER diagrams
– Identify entity types representing facts and dimensions
– Create star schemas (supply driven)
• Integrate star schemas
– Convert schemas to common terminology (using terminology
analysis)
– Match demand and supply models

48
Comparison
• Consider each methodology
– If you have the opportunity to lead a DWH design
project
• Overall, Hybrid approach is most appealing
– Developed to overcome the shortcomings of
both demand and supply driven approaches
– Has some structure for GQM in the analysis of
existing ERDs
• Major appeal of demand-driven
– Emphasis on grain determination 49
Case for Data Warehouse Design
Case on Data Warehouse Design
• Apply and integrate skills learned so far
– Schema patterns
– Summarizability problems and resolution
– Grain determination and size estimation
• Acquire new skills
– Integration: apply skills to a mini case study
• Data source specifications, business needs, and
sample data

51
Design Requirements

Identify
Specify summarizability Map data
Create table
dimensions and Determine grain problems and sources and
design
measures suggest populate tables
resolutions

52
Data Source (1) for a fitness firm
Sales Database
Franchise MemberType
FranchId MemTypeId
FranchRegion MemTypeName
FranchPostalCode MemTypePrice
FranchModelType

Merchandising is any practice


MemTypeOf
which contributes to the sale of
products to a retail consumer.
Member
FranchiseOf MmbrId Sale
MmbrName
ServiceCategory SoldTo SaleId
MmbrZip
SaleDate
ServCatId MmbrEmail
ServCatName MmbrDate
ServCatPrice
Qty
Contains

ServPurchase ServMember
ServPurchId
Merchandise
ServCatOf
ServPurchDate MerchId 53
MerchName
MerchPrice
MerchType
Sample Data of Data Source (1)

54
Sample Data of Data Source (1)

55
Sample Data of Data Source (1)

56
Data Source (2)
• Franchises also sell special events to corporate
and other organizations
– These sales are not standard, spreadsheets are used
to track special events.
– The sales database was never extended to
accommodate special event sales.
– Most franchises use a similar spreadsheet

57
Sample Spread Sheet for
Data Source (2)

58
Business Intelligence Needs
• Support analysis of merchandise sales and service
purchases by
– franchise, merchandise or service type, and customer over
time
• They need detail by individual customer, product or
service, and franchise, and date
• For typical reporting applications, they need detail by
customer location, franchise location, and product or
service type, and week

59
Important Design Decisions
• Grain determination and relative size calculations
– Flexibility versus size
– Flexibility seems to have more priority
– Higher costs for accommodating more detailed grains
• Simplification
– Fact Table choice (OLTP transactions with multiple levels,
i.e., Servpurchase and MerchSale tables ==> Fact table at
single level
– Collapse 2 levels (operational database) into 1 level (DW)
• Mappings from source data to populate data
warehouse tables
– Insight about data integration requirements 60

– Discover summarizability problems


Grain Size Calculations

61
Mappings from Source Data

• Source column
matching
Associations • Conversions (units of
measure, data types)

• Generated PK values
• Default values (missing
Additions values)
• Derived values
62
Grain Size Determination
• Determine sparsity
– Given dimension cardinalities and source table
cardinality
– Associate fact table to tables of data source
– 1 minus source table cardinality divided by product of
dimension cardinalities
• Determine fact table size
– Given dimension cardinalities and sparsity estimate
– Product of dimension cardinalities
– Reduce by sparsity
63

You might also like