Professional Documents
Culture Documents
and Methodologies
2
Motivation for Table Design
• Relational model dominance
- Most DBMS market controlled by relational DBMS and open source
vendors
- SQL standard (now SQL:2011): 4,000 pages
• Lack of scalability for data cube engines
- Major problem reported in the early years of data warehouse deployment
- Not an active area of research and development
• Large amounts of research and development of relational
database performance
- Development of optimizing compilers
- Development of physical design software to select storage structures and
monitor performance
• New features
3
- SQL standard: query operators
- Proprietary: materialized views and query rewriting
Multidimensional Data Representations
Data cubes
Dimension Table ... Dimension Table
Fact Table
5
Grain
• Most detailed stored value (Finest level) in fact tables
• Determined by the finest level of each dimension,
such as individual customer
• Determines the size of the data warehouse:
(product of dimension cardinalities * sparsity)
• Tradeoff
– Flexibility and size
• Grain too small: large data warehouse & increased computation time
• Grain too large: cannot answer queries on more detailed dimension
values
– Trend towards finer grains 6
Grain Example
• Sales fact table grain
– Coarse: customer postal codes (1,000), product type (100), store (200),
week (52)
– Fine: individual customer (200,000), individual product (2,000), store
(200), day (365)
– Numbers in parenthesis indicate number of values of dimensions
– Sparsity: coarse (5%), fine (75%)
• Sparsity = 1 – (number of cell with values / total number of cells)
• Impact
– Higher storage requirements for fine grain
• Storage = 1 – (product of dimension sizes * sparsity)
– Storage requirements of the finer grain are 7,000 times more larger than
the coarser grain after reducing for sparsity. 7
10
Fact Table Examples Enrollment FT
Transaction Periodic Factless
Store Account Student For University,
Product Account Type Semester
Some measures
Customer Balance Date Course like: TIME SPENT
Date Dividend Date Faculty ONLINE or
additive COURSE PAGE
Quantity Balance Date
VISITS
Extended Price Transaction Count Period
Non-
Dividend Cumulative
additive
averagable Dividend Current Year
Account FT
- All measures are semi-additive across account
Sales FT - Balance (sum)
- Transaction Count (cumulative)
- Dividend cumulative & current year (across account)
11
Table Design Patterns
Objectives
• Two separate but related topics
- Principles, schema patterns, and schema design problems
- Large scale DW development and example data warehouses
• Objectives:
- Understand the motivation for relational database
representation of multidimensional data
- Understand basic ideas of fact and dimension tables
- Recognize data modeling patterns for data warehouse
schemas
- Explain three alternatives for historical integrity
13
Star Schema Example
Traditional Schema pattern for DW & represent one data cube
Store
Item StoreId
ItemId StoreManager
ItemName StoreStreet
ItemUnitPrice StoreCity
ItemBrand StoreSales StoreState
ItemCategory StoreZip
StoreNation
ItemSales DivId
Sales DivName
SalesNo
DivManager
SalesUnits
SalesDollar
Customer SalesCost
TimeDim
CustId TimeNo
CustName TimeSales TimeDay
CustPhone TimeMonth
CustStreet CustSales TimeQuarter
CustCity TimeYear 14
CustState TimeDayOfWeek
CustZip TimeFiscalYear
CustNation
Constellation Schema Example
Supplier
SuppId
SuppName
Inventory
InvNo Multiple
SuppInv StoreInv
fact
SuppCity InvQOH
SuppState InvCost
tables
SuppZip InvReturns
SuppNation
ItemInv
Store
Item StoreId
ItemId StoreManager
ItemName StoreStreet
ItemUnitPrice StoreCity
ItemBrand
ItemCategory
StoreSales StoreState
StoreZip
TimeInv
Fact
tables
StoreNation
ItemSales DivId
Sales
share
DivName
SalesNo
DivManager
SalesUnits
Customer
SalesDollar
SalesCost
TimeDim
dimension
CustId
CustName TimeSales
TimeNo
TimeDay
tables
CustPhone TimeMonth
CustSales 15
CustStreet TimeQuarter
CustCity TimeYear
CustState TimeDayOfWeek
CustZip TimeFiscalYear
CustNation
Snowflake Schema Example
Item
ItemId Store
ItemName StoreId
ItemUnitPrice Division
StoreManager
ItemBrand DivId
StoreStreet
ItemCategory
StoreSales DivStore DivName
StoreCity
DivManager
StoreState
ItemSales Sales StoreZip
SalesNo StoreNation
SalesUnits
SalesDollar
Customer SalesCost
TimeSales TimeDim
CustId TimeNo
CustName TimeDay
CustPhone TimeMonth
CustStreet CustSales TimeQuarter
CustCity TimeYear
CustState TimeDayOfWeek
CustZip TimeFiscalYear
CustNation
21
Summarizability Motivation
• Summary computations in navigation and join operations
– Operations on hierarchical dimensions: drill down and rollup
– Join operations combining fact and dimension tables
• Violations of summarizability
– Incompleteness in rollup and drill down operations
– Double counting problem
– Incorrect results
– Erroneous decision making and user confusion
– Inability to use performance optimizations
• Relationships among
– dimension levels and
– dimension and fact tables
• Summarizability conditions:
– Intra dimension 22
Roll-up incomplete:
•Napkin does not have a category.
•Parent level (rollup) shows a smaller total than child level.
•Users may perceive difference as inconsistency.
•Should have “other” category.
24
Non Strict Example
Week Sales
1-2013 5
2-2013 10
3-2013 10 Month Sales
4-2013 10 Rollup Jan-2013 37
5-2013 20 Feb-2013 53
6-2013 10 Total 90
7-2013 10
8-2013 10
9-2013 10 Non strict:
Total 95 •Some weeks are split between months.
•M-N relationship between dimension levels
•Fine and coarse levels show different totals.
25
Parent
Parent Parent
26
Dimension Non Summarizability
Examples
27
Resolving Dimension Problems
• Drill-down and roll-up problems are due to exceptions
• Incomplete drill down: Unallocated Parent members
should be connected to default / duplicate child member.
– E.g., colleges without departments must represented by new child
member such as business college enrollments.
• Incomplete rollup: a new default parent member should
be used for child memebr without a parent.
– E.g., a new non food, non-beverage category should be added to
the patent level to relate non associated child member i.e., napkin.
• add connection to unallocated parent
– Use default or duplicate parent entity and connection
28
Resolving Dimension Problems
• Non strict relationship (M-N) among dimensions
– Design error
– Use levels in separate hierarchies
– E.g., WeekOfYear in month levels should be placed in
different time hierarchies
– Eliminate M-N relationship by placing in another
hierarchy: calendar week not in same hierarchy as
month OR
• you have to go to lower granularity for example days that exist
at intersection
– Use a major or primary parent: products having
multiple categories; use major category
29
Summarizability Patterns
for Dimension-Fact Relationships
Incomplete Dimension-Fact
Relationship
Customer-Month Sales Month Sales
Customer Month Sales Month Sales
Cust-1 Jan-2012 10 Rollup
Jan-2012 25
Cust-2 Jan-2012 5
Feb-2012 15
Cust-3 Feb-2012 15
Total 40
Total 30
Incomplete:
•Inconsistent totals
•Some sales for anonymous customers: 10 in January 2012
•January sales larger than shown by known customers
•Caused by some facts not being related to known customers
31
Non Strict Dimension-Fact Relationship
Salesperson Date UnitSales
SP1 10-Feb-2013 10
(a) Unit sales by SP2 10-Feb-2013 10
salesperson SP3 11-Feb-2013 15
SP4 12-Feb-2013 20
Total 55
Dimension Dimension
Fact Fact
33
Examples of Non Summarizability
Schema Patterns
Customer Salesperson
Sales Sales
34
Resolving Incomplete Dimension-Fact
Relationships
• Conceptually simple
– although the resolution may complicate the data
integration process.
• Data integration process changes
• Use default dimension entities
– For example, anonymous sales should be connected
to a default anonymous customer in the customer
entity type.
35
Resolving Non Strict Dimension-Fact
Relationships
• Source data may have M-N relationships, not 1-M
relationships
• Adjust fact or dimension tables for a fixed number of
exceptions
– Multiple columns can be added to the fact or the dimension table
to allow for more than one customer.
– For example, the Customer table can have an additional column
SecondCustId to identify an optional second customer on the
invoice
• More complex solutions to support M-N relationships with
a variable number of connections
36
Resolution with Limited Related Entities
37
Resolution with Unlimited Related Entities
- Two fact tables and identifying relationship
Store
Item StoreId
ItemId StoreManager
ItemName StoreStreet
ItemUnitPrice StoreCity
ItemBrand StoreSales StoreState
ItemCategory StoreZip
StoreNation
ItemSales DivId
Sales DivName
- Sales fact for item, store, and time SalesNo
SalesUnits
DivManager
SalesDollar
SalesCost
Customer TimeDim
CustId TimeNo
CustName TimeDay
SalesRole TimeMonth
CustPhone CustOf
CustStreet RoleNo TimeQuarter
Weight
TimeSales
CustCity CustSales TimeYear
CustState TimeDayOfWeek
CustZip TimeFiscalYear
CustNation
40
Design Methodology
• Elements
– Phases to create design
artifacts and working system
– Human and automated
processes
– Project management skills
required to monitor
• Artifacts include:
dimensional models,
schema design,
data marts, and
data integration procedures
41
Design Methodology of DWH differ by
emphasis on the following three issues:
Supply of data
sources
(Internal &
External Sources,
Quality of Data)
Demand for BI
(Reporting and Level of
Analysis automation
requirements)
Methodology
42
Demand-Driven Data warehouse
Design Methodology
(Requirements driven approach, Kimball’98)
Data mart : a collection of related facts important for a group of data warehouse.
Integrate
Models
Terminology
analysis
47
Hybrid Methodology Details
• Collect user requirements:
– Use Goal/Question/Metric approach
– Develop dimensions and measures (demand driven)
• Analyze existing ER diagrams
– Identify entity types representing facts and dimensions
– Create star schemas (supply driven)
• Integrate star schemas
– Convert schemas to common terminology (using terminology
analysis)
– Match demand and supply models
48
Comparison
• Consider each methodology
– If you have the opportunity to lead a DWH design
project
• Overall, Hybrid approach is most appealing
– Developed to overcome the shortcomings of
both demand and supply driven approaches
– Has some structure for GQM in the analysis of
existing ERDs
• Major appeal of demand-driven
– Emphasis on grain determination 49
Case for Data Warehouse Design
Case on Data Warehouse Design
• Apply and integrate skills learned so far
– Schema patterns
– Summarizability problems and resolution
– Grain determination and size estimation
• Acquire new skills
– Integration: apply skills to a mini case study
• Data source specifications, business needs, and
sample data
51
Design Requirements
Identify
Specify summarizability Map data
Create table
dimensions and Determine grain problems and sources and
design
measures suggest populate tables
resolutions
52
Data Source (1) for a fitness firm
Sales Database
Franchise MemberType
FranchId MemTypeId
FranchRegion MemTypeName
FranchPostalCode MemTypePrice
FranchModelType
ServPurchase ServMember
ServPurchId
Merchandise
ServCatOf
ServPurchDate MerchId 53
MerchName
MerchPrice
MerchType
Sample Data of Data Source (1)
54
Sample Data of Data Source (1)
55
Sample Data of Data Source (1)
56
Data Source (2)
• Franchises also sell special events to corporate
and other organizations
– These sales are not standard, spreadsheets are used
to track special events.
– The sales database was never extended to
accommodate special event sales.
– Most franchises use a similar spreadsheet
57
Sample Spread Sheet for
Data Source (2)
58
Business Intelligence Needs
• Support analysis of merchandise sales and service
purchases by
– franchise, merchandise or service type, and customer over
time
• They need detail by individual customer, product or
service, and franchise, and date
• For typical reporting applications, they need detail by
customer location, franchise location, and product or
service type, and week
59
Important Design Decisions
• Grain determination and relative size calculations
– Flexibility versus size
– Flexibility seems to have more priority
– Higher costs for accommodating more detailed grains
• Simplification
– Fact Table choice (OLTP transactions with multiple levels,
i.e., Servpurchase and MerchSale tables ==> Fact table at
single level
– Collapse 2 levels (operational database) into 1 level (DW)
• Mappings from source data to populate data
warehouse tables
– Insight about data integration requirements 60
61
Mappings from Source Data
• Source column
matching
Associations • Conversions (units of
measure, data types)
• Generated PK values
• Default values (missing
Additions values)
• Derived values
62
Grain Size Determination
• Determine sparsity
– Given dimension cardinalities and source table
cardinality
– Associate fact table to tables of data source
– 1 minus source table cardinality divided by product of
dimension cardinalities
• Determine fact table size
– Given dimension cardinalities and sparsity estimate
– Product of dimension cardinalities
– Reduce by sparsity
63