You are on page 1of 33

Data Warehousing

1
Data Warehousing Design
• Since the 1980s, data warehouses have evolved their own
design techniques, distinct from transaction-processing
systems
• Dimensional design techniques have emerged as the
dominant approach for most data warehouse databases
• Designing a data warehouse database is highly complex
• To begin a data warehouse project, we need answers for
questions such as:
– which user requirements are most important
– which data should be considered first
• For many enterprises the solution is data marts
• Few designers are willing to commit to an enterprise-wide
design that must meet all user requirements at one time 2
Dimensional Modeling
• The database component of a data warehouse is
described using a technique called
dimensionality modeling.
• Dimensionality modeling: A logical design
technique that aims to present the data in a
standard, intuitive form that allows for high-
performance access.
• Dimensionality modeling uses the concepts of
Entity–Relationship (ER) modeling with some
important restrictions
3
• Dimensional modeling provides set of
methods and concepts that are used in dwh
design.
• Dimensional modeling is a design technique
for databases intended to support end-user
queries in a data warehouse.
• Every dimensional model (dm) is composed
of one table with a composite primary key,
called the fact table, and a set of smaller
tables called dimension tables.
4
Fact Tables
• Facts are numerical values which can be
aggregated and analyzed on the fact values
• A Fact table has two types of columns: facts
and foreign key to dimension tables.
• Contains two or more foreign keys
• Tend to have huge numbers of records
• Useful facts tend to be numeric and additive
Example

•This fact table contains foreign keys for time dimension,


product dimension, customer dimension and measurement
value unit sold.
•Suppose a company sells products to customers. Every sale is a
fact that happens within the company, and the fact table is used
to record these facts.
6
Dimension Tables
• Dimensions define hierarchies and description on fact
values.
• Contain text and descriptive information
• 1 in a 1-M relationship
• Generally the source of interesting constraints
• Typically contain the attributes for the SQL answer set.
• Dimension table has a primary key that uniquely
identifies each dimension row.
– This key is used to associate the Dimension table to a Fact
table.
• Dimension tables are normally de-normalized
Example

In the above dimension table, the customer dimension normally includes


the name of customers, address, customer id, gender, income group,
education levels, etc

8
The Multi-Dimensional Model

Store Info Key columns joining fact table


to dimension tables Numerical Measures

Prod Code Time Code Store Code Sales Qty


Fact table for
Product Info
measures

Dimension tables Time Info

...
9
• Each dimension table has a primary key that
corresponds exactly to one of the components
of the composite key in the fact table.
– In other words, the primary key of the fact table is
made up of two or more foreign keys.
• Another important feature of a DM is that all
natural keys are replaced with surrogate
keys.
– This means that every join between fact and
dimension tables is based on surrogate keys, not
natural keys

10
Dimensional Modeling
• Dimensions are organized into hierarchies
– E.g., Time dimension: days  weeks  quarters
– E.g., Product dimension: product  product line 
brand
• Dimensions have attributes
Time Store
Date
Month StoreID
Year City
State
Country
Region
CSE601 11
Dimension Hierarchies
Store Dimension Product Dimension

Total Total

Region Manufacturer

District Brand

Stores Products

Analysts tend to look at the data through dimension at a particular “level” in the
hierarchy 12
Schema Design
• Schema is a logical description of the entire
database.
• It includes the name and description of records of
all record types including all associated data-items
and aggregates.
• Much like a database, a data warehouse also
requires to maintain a schema.
• A database uses relational model, while a data
warehouse uses Star, Snowflake, and Fact
Constellation schema.
• Most data warehouses use a star schema to
represent multi-dimensional model.

CSE601 13
Star Schema
• The links between the fact table in the center and the
dimension tables in the extremities form a shape like a star.
• Bulk of data in a data warehouse is represented as facts,
the fact tables can be extremely large relative to the
dimension tables.
• It is important to treat fact data as read-only reference data
that will not change over time.
• The most useful fact tables contain one or more numerical
measures, or ‘facts’, that occur for each record
• Dimension tables, by contrast, generally contain
descriptive textual information.
• Dimension attributes are used as the constraints in data
warehouse queries
14
• Star schema: a logical structure that has a
fact table containing factual data in the
center, surrounded by dimension tables
containing reference data
• Each dimension in a star schema is
represented with only one-dimension table.
– This dimension table contains the set of
attributes.
– There is a fact table at the center. It contains the
keys to each of dimensions.
– The fact table also contains the attributes,
namely dollars sold and units sold.
15
Star Schema

CSE601 16
Star Schema Example

17
Star Schema Example

CSE601 18
Star Schema with Sample Data

CSE601 19
Need for Aggregates
• Sizes of typical tables:
– Time dimension: 5 years x 365 days = 1825
– Store dimension: 300 stores reporting daily sales
– Production dimension: 40,000 products in each store
(about 4000 sell in each store daily)
– Maximum number of base fact table records: 2 billion
(lowest level of detail)
• A query involving 1 brand, all store, 1 year:
retrieve/summarize over 7 million fact table rows.

CSE601 20
Aggregating Fact Tables
• Aggregate fact tables are summaries of the
most granular data at higher levels along the
dimension hierarchies.

ra rchy
Hie
ls Product key
leve Store key
Product Store name
Category Territory
Department Product key
Region
Time key
Store key
Unit sales
Multi-way aggregates:
Time key Sale dollars
Territory – Category – Month
Date Month
Quarter (Data values at higher level)
Year CSE601 21
Families of Stars

Dimension
Dimension Dimension table
table table
Fact table

Fact table
Dimension Dimension
table table

Fact table
Dimension
Dimension table
Dimension
table
table

CSE601 22
Snowflake Schema
• A variant of the star schema where dimension
tables do not contain denormalized data.
– The normalization splits up the data into additional
tables.
– Unlike Star schema, the dimensions table in a
snowflake schema are normalized.
– Due to normalization in the Snowflake schema, the
redundancy is reduced and therefore, it becomes
easy to maintain and save the storage space

23
The item dimension table in star schema is normalized and split into two
dimension tables, namely item and supplier table.

24
Snowflake Schema
• Snowflake schema is a type of star schema
but a more complex model.
• “Snowflaking” is a method of normalizing
the dimension tables in a star schema.
• The normalization eliminates redundancy.
• The result is more complex queries and
reduced query performance.

CSE601 25
Sales: Snowflake Schema
Category key
Product category
Brand key Region key
Brand name Region name
Category key

Product key Territory key


Sales fact
Product name Territory name
Product code Region key
Brand key Product key
Time key Salesrep key
Product
Customer key Salesperson name
…. Territory key

Salesrep

CSE601 26
Snowflaking
• The attributes with low cardinality in each
original dimension table are removed to
form separate tables. These new tables are
linked back to the original dimension table
through artificial keys.

Product key Brand key


Product name Category key
Brand name Product category
Product code Category key
Brand key

CSE601 27
Snowflake Schema
• Advantages:
– Normalized structures are easier to update and
maintain
• Disadvantages:
– Ability to browse through the contents difficult
– Degrade query performance because of additional joins

CSE601 28
Fact Constellation Schema
• A fact constellation has multiple fact tables.
It is also known as galaxy schema.
• The following diagram shows two fact
tables, namely sales and shipping.

29
30
The “Fact Constellation” Schema
Sto re Dime nsio n Fa c t Ta ble Tim e Dim e nsio n
STORE KEY STORE KEY
PERIOD KEY
Sto re De sc rip tio n PRODUCT KEY
City PERIOD KEY Pe rio d De sc
Sta te Ye a r
Do lla rs Qua rte r
Distric t ID
Units
Distric t De sc . Mo nth
Pric e
Re g io n_ID Da y
Re g io n De sc . Curre nt Fla g
Re g io na l Mg r.
Pro duc t Dim e nsio n
Se que nc e
PRODUCT KEY
Pro d uc t De sc .
Bra nd District Fact Table
Co lo r
Region Fact Table
Size District_ID
Ma nufa c ture r Region_ID
PRODUCT_KE
PRODUCT_KEY
Y
PERIOD_KEY
PERIOD_KEY
Dollars
Dollars
Units Units
Price Price

CSE601 31
What is the Best Design?
• Performance benchmarking can be used to
determine what is the best design.
• Snowflake schema: easier to maintain dimension
tables when dimension tables are very large
(reduce overall space). It is not generally
recommended in a data warehouse environment.
• Star schema: more effective for data cube
browsing (less joins): can affect performance.

CSE601 32
Starflake Schema
• A hybrid structure that contains a mixture
of star and snowflake.
– The most appropriate database schemas use a
mixture of de-normalized star and normalized
snowflake schemas.

33

You might also like