You are on page 1of 44

Dimensional Modeling Techniques

illustrated by Ralph Kimball 


Ralph Kimball founded the Kimball Group. Since the mid-1980s, he was the DW/BI industry’s thought leader
on the dimensional approach and trained more than 20,000 students. He co-authored all the books in
the Toolkit series. Ralph co-taught Kimball University’s dimensional modeling classes with Margy and ETL
architecture classes with Bob. Prior to working at Metaphor and founding Red Brick Systems, Ralph co-invented
the first commercially-available workstation with a graphical user interface at Xerox’s Palo Alto Research Center
(PARC). Ralph has his Ph.D. in Electrical Engineering from Stanford University.
What is Dimensional Modeling
• Dimensional modeling (DM) is a methodology including a set of
techniques and concepts developed by Ralph Kimball for use in 
data warehouse design.[1] It is considered to be different from 
entity-relationship modeling (ER). Dimensional modeling does not
necessarily involve a relational database.
• Dimensional modeling always uses the concepts of facts (measures),
and dimensions (context). Facts are typically (but not always) numeric
values that can be aggregated, and dimensions are groups of
hierarchies and descriptors that define the facts
BI Data Cycle

3
Basic Elements of the Data Warehouse

Ralph Kimball, Margy Ross, The Data Warehouse Toolkit, 2nd Edition, 2002
4
Time_Key Item_Key Branch_Key Location_Key Dollars_Sold Units_Sold
20180828 123456 7 8 1000 10
20180828 123457 7 8 1010 8
20180828 123456 6 8 2000 6
20180828 123457 6 9 1000 15
20180828 123457 8 9 3000 25

Delivery_Ch
Sale_Key Item_Key SalePrice Discount arges
9 1 90 0 2
Time_Key Item_Key Location_Key Dollars_Sold Units_Sold 9 2 10 10 2
100 10 4
20180828 123456 8 1000 10
20180828 123457 8 1010 8
20180828 123456 8 2000 6
20180828 123457 9 4000 40
Four key Decisions of Dimensional
Modeling
1.Select the Business process.
2.Declare the grain.
3.Identify the dimensions.
4.Identify the facts. 
Declaring the grain
• Declaring the grain is the pivotal step in a dimensional design
• The grain establishes exactly what a single fact table row represents
• The grain must be declared before choosing dimensions or facts
because every candidate dimension or fact must be consistent with
the grain.
Identify the
dimensions
Dimensions provide the “who, what, where, when, why, and how” 

Every dimension table has a single primary key column

This primary key is embedded as a foreign key in any associated fact table

 Dimension tables are usually wide, flat and de normalized tables with many low-cardinality text
attributes

These dimension surrogate keys are simple integers, assigned in sequence, starting with the value 1

Sale_Key Item_Key SalePrice Discount Delivery_Charges


9 1 90 0 2
9 2 10 10 2
100 10 4
Dimension Normalization
• Normalization makes the data structure more complex
• Performance can be slower, due to the many joins between tables
• The space savings are minimal
• Bitmap indexes can't be used
• Query performance. 3NF databases suffer from performance
problems when aggregating or retrieving many dimensional values
that may require analysis. If you are only going to do operational
reports then you may be able to get by with 3NF because your
operational user will be looking for very fine grain data.
Identify the facts. 
• Facts are the measurements that result from a business process event
and are almost always numeric.
• At the lowest grain a fact table always contains foreign keys for each
of its associated dimensions, as well as optional degenerate
dimension keys and date/time stamps.
• Null-valued measurements behave gracefully in fact tables. The
aggregate functions (SUM, COUNT, MIN, MAX, and AVG) all do the
“right thing” with null facts avoided in the fact table’s foreign keys
because these nulls would automatically cause a referential integrity
violation
Additive, Semi-Additive, and Non-Additive Facts

Additive measures: These are those specific class of fact measures which can be
aggregated across all dimension and their hieracchy.
An example of a fully additive measure is sales (purchases from a store). You can add hourly
sales to get the sales for a day, week, month, quarter, or year. You can add sales across stores
or regions.

Semi-Additive measures: These are those specific class of fact measures which can be
aggreagated across all dimension and their hieracchy except the time dimension.
Example: Daily balances fact can be summed up through the customers dimension but not
through the time dimension.

Non-additive measures: These are those specific class of fact measures which cannot be
aggregated across all/any dimension and their hierarchy
Facts which have percentages, ratios calculated.
A good example of a non aggregable measure might be ‘Discount On Price' on a sales record
Data Pipeline

Alan Marazzi/ Building a Data Pipeline from Scratch

15
Data Engineering

Darshan Joshi, Data Processing Pipeline Patterns

16
Non Additive
Item ID Price Discount Payable Measure
1 100 10 90 10%
2 100 5 95 5%
3 200 25 175 13%
4 300 30 270 10%
SUM 700 70 630 10.0%
1000 10 2% 20 980

>=10 2%
1010 8 0   1010
>=20
>=30
3%
4%
2000 6 0   2000
>=50 5% 3000 25 3% 30 2970
3000 25 3% 30 2970
10010 74 8%   9930
         
      99.2%  

Saled_ID_ Item_K Branc Locati


Time_Key on_Ke Dollars_Sold Units_Sold Discount
Key ey h_Key
y
20180828 123 123456 1 2 1000 10 2%
20180828 123 123457 1 2 1010 8 0
20180828 123 123458 1 2 2000 6 0
20180828 123 123459 1 2 3000 25 3%
20180828 123 123460 1 2 3000 25 3%
The 10 Essential Rules of Dimensional Modeling
• Rule #1: Load detailed atomic data into dimensional structures.
• Rule #2: Structure dimensional models around business processes.
• Rule #3: Ensure that every fact table has an associated date dimension table.
• Rule #4: Ensure that all facts in a single fact table are at the same grain or level of
detail.
• Rule #5: Resolve many-to-many relationships in fact tables.
• Rule #6: Resolve many-to-one relationships in dimension tables.
• Rule #7: Store report labels and filter domain values in dimension tables.
• Rule #8: Make certain that dimension tables use a surrogate key.
• Rule #9: Create conformed dimensions to integrate data across the enterprise.
• Rule #10: Continuously balance requirements and realities to deliver a DW/BI
solution that’s accepted by business users and that supports their decision-making.
Slowly Changing Dimension
There are three types of slowly changing dimensions:
• Type 1 Slowly Changing Dimension: This method overwrites the
existing value with the new value and does not retain history.
• Type 2 Slowly Changing Dimension: This method adds a new row for
the new value and maintains the existing row for historical and
reporting purposes.
• Type 3 Slowly Changing Dimension: This method creates a new
current value column in the existing record but also retains the original
column.
Type 1 Slowly Changing Dimension
In Type 1 Slowly Changing Dimension, the new information simply overwrites the original information. In other words,
no history is kept.
In our example, recall we originally have the following table:
Customer Key Name State
1001 Christina Illinois
After Christina moved from Illinois to California, the new information replaces the new record, and we have the
following table:
Customer Key Name State
1001 Christina California
Advantages:
- This is the easiest way to handle the Slowly Changing Dimension problem, since there is no need to keep track of the
old information.
Disadvantages:
- All history is lost. By applying this methodology, it is not possible to trace back in history. For example, in this case, the
company would not be able to know that Christina lived in Illinois before.
Type 2 Slowly Changing Dimension
In Type 2 Slowly Changing Dimension, a new record is added to the table to represent the new information. Therefore, both
the original and the new record will be present. The new record gets its own primary key.
In our example, recall we originally have the following table:
Customer Key Name State
1001 Christina Illinois
After Christina moved from Illinois to California, we add the new information as a new row into the table:
Customer Key Name State EFF_DT END_DT flag
1001 Christina Illinois 2020-01-01 2020-03-20 0
1005 Christina California 2020-03-21 9999-12-31 1
Advantages:
- This allows us to accurately keep all historical information.
Disadvantages:
- Change of same customer key
- This will cause the size of the table to grow fast. In cases where the number of rows for the table is very high to start with,
storage and performance can become a concern.
Type 3 Slowly Changing Dimension
In Type 3 Slowly Changing Dimension, there will be two columns to indicate the particular attribute of interest, one indicating the
original value, and one indicating the current value. There will also be a column that indicates when the current value becomes
active.
In our example, recall we originally have the following table:
Customer Key Name State
1001 Christina Illinois
To accommodate Type 3 Slowly Changing Dimension, we will now have the following columns:
Customer Key, Name, Original State, Current State, Effective Date
After Christina moved from Illinois to California, the original information gets updated, and we have the following table (assuming the
effective date of change is January 15, 2003):
Customer Key Name Original State Current State Effective Date
1001 Christina Illinois California 15-JAN-2003
Advantages:
- This does not increase the size of the table, since new information is updated.
- This allows us to keep some part of history.
Disadvantages:
- Type 3 will not be able to keep all history where an attribute is changed more than once. For example, if Christina later moves to
Texas on December 15, 2003, the California information will be lost.
Dimensional Modeling in Hadoop / Big Data

Benefits of dimensional models on Hadoop and similar big data frameworks.

The Hadoop File System is immutable. We can only add but not update data.

 Slowly Changing Dimensions on Hadoop become the default behavior.

get the latest and most up to date record in a dimension table we have three options

• First, we can create a View that retrieves the latest record using windowing functions.
• Second, we can have a compaction service running in the background that recreates the latest state.
• Third, we can store our dimension tables in mutable storage (Allows information to be overwritten at any
time), e.g. HBase and federate queries across the two types of storage.

The way data is distributed across HDFS makes it expensive to join data.

HDFS tables are split into big chunks and distributed across the nodes on our cluster
Designing a Dimensional Model in Oracle OLAP

• Identifying Dimensions
• Identifying Levels
we can identify the levels of summarization within each dimension
he levels of summarization will be (highest to lowest): Total, Region, Warehouse, and Ship To.
market segmentation, the levels of summarization will be (highest to lowest): Total, Market
Segment, Account, and Ship To.
Product dimension will have four levels (highest to lowest): Total, Class, Family, and Item
Time dimension will have four levels (highest to lowest): Total, Year, Quarter, and Month

• Identifying Hierarchies
we will group the levels in the correct order of summarization and in a way that supports the
identified types of analysis.

• Identifying Stored Measures


Overview of the Dimensional Data Model
Dimensional objects are an integral part of OLAP - ORACLE
Degenerate dimension DD
a dimension is defined that has no content except for its primary key
 a Dimension which has only a single attribute

customer_id, product_id, bill_no


?

A degenerate dimension (DD) acts as a dimension key in the fact table, however does not join to
a corresponding dimension table because all its interesting attributes have already been placed
in other analytic dimensions
Example 1:- when an invoice has multiple line items, the line item fact rows inherit all the
descriptive dimension foreign keys of the invoice, and the invoice is left with no unique
content. But the invoice number remains a valid dimension key for fact tables at the line item
level. 

Ticket Number
Order Number
Tracking_Id
Bill of lading number
Policy number

 
Dimension Types and Slowly Change Dimension Methods

Ralph introduced the concept of slowly changing dimension (SCD) attributes in 1996
core SCD approaches:
slowly changing dimension (SCD) 0,1,2 3

Supplier_Key Supplier_Code Supplier_Name Supplier_State


123 ABC Acme Supply Co CA

Supplier_Key Supplier_Code Supplier_Name Supplier_State


123 ABC Acme Supply Co IL

Supplier_Key Supplier_Code Supplier_Name Supplier_State Version.


123 ABC Acme Supply Co CA 0
124 ABC Acme Supply Co IL 1

Supplier_Key Supplier_Code Supplier_Name Supplier_State Effective_Date Current_Flag


123 ABC Acme Supply Co CA 01-Jan-2000 N
124 ABC Acme Supply Co IL 22-Dec-2004 Y
• Type 4: Add Mini-Dimension
• The type 4 technique is used when a group of dimension
attributes are split off into a separate mini-dimension
• Type 5: Add Mini-Dimension and Type 1 Outrigger          
• The type 5 technique builds on the type 4 mini-dimension by
embedding a “current profile” mini-dimension key in the base
dimension that’s overwritten as a type 1 attribute.
Type 6: Add Type 1 Attributes to Type 2
Dimension
Type 7: Dual Type 1 and Type 2
Dimensions

With type 7, the fact table contains dual foreign keys for a given dimension: a surrogate key linked to
the dimension table where type 2 attributes are tracked, plus the dimension’s durable supernatural
key linked to the current row in the type 2 dimension to present current attribute values.
Role-Playing Dimensions
• A single physical dimension can be referenced multiple times in a fact
table, with each reference linking to a logically distinct role for the
dimension. For instance, a fact table can have several dates, each of
which is represented by a foreign key to the date dimension.
Conformed dimension

• a conformed dimension is a dimension that has the same


meaning to every fact with which it relates
Junk dimension
• Transactional business processes typically produce a number of
miscellaneous, low-cardinality flags and indicators. Rather than
making separate dimensions for each flag and attribute, you can
create a single junk dimension combining them together.
Outrigger
Dimensions
• A dimension can contain a reference to another dimension table. For
instance, a bank account dimension can reference a separate
dimension representing the date the account was opened. These
secondary dimension references are called outrigger dimensions

Snowflaked Dimensions
When a hierarchical relationship in a dimension table is normalized, low-cardinality attributes appear
as secondary tables connected to the base dimension table by an attribute key. When this process is
repeated with all the dimension table’s hierarchies, a characteristic multilevel structure is created that
is called a snowflake. Although the snowflake represents hierarchical data accurately, you should
avoid snowflakes because it is difficult for business users to understand and navigate snowflakes.
They can also negatively impact query performance. A flattened denormalized dimension table
contains exactly the same information as a snowflaked dimension.
Types of Facts Tables
Conformed facts
Transaction fact tables
Periodic snapshot fact tables
Accumulating snapshot fact tables
Factless fact tables
Aggregated fact tables or cubes
Consolidated fact tables
Conformed facts
If the same measurement appears in separate fact tables, care must be
taken to make sure the technical definitions of the facts are identical if
they are to be compared or computed together. If the separate fact
definitions are consistent, the conformed facts should be identically
named; but if they are incompatible, they should be differently named
to alert the business users and BI applications.
Transaction fact tables
A row in a transaction fact table corresponds to a measurement event at a point in
space and time. Atomic transaction grain fact tables are the most dimensional and
expressive fact tables; this robust dimensionality enables the maximum slicing
and dicing of transaction data. Transaction fact tables may be dense or sparse
because rows exist only if measurements take place. These fact tables always contain a
foreign key for each associated dimension, and optionally contain precise
time stamps and degenerate dimension keys. The measured numeric facts must be
consistent with the transaction grain.
Periodic snapshot fact tables
A row in a periodic snapshot fact table summarizes many measurement
events occurring over a standard period, such as a day, a week, or a
month.
The grain is the period, not the individual transaction. Periodic
snapshot fact tables often contain many facts because any
measurement event consistent with the fact table grain is permissible.
These fact tables are uniformly dense in their foreign keys because
even if no activity takes place during the period, a row is typically
inserted in the fact table containing a zero or null for each fact.
Accumulating snapshot fact tables

• A row in an accumulating snapshot fact table summarizes the


measurement events occurring at predictable steps between the
beginning and the end of a process.
• Pipeline or workflow processes, such as order fulfillment or claim
processing, that have a defined start point, standard intermediate
steps, and defined end point can be modeled with this type of fact
table.
Factless fact tables
Although most measurement events capture numerical results, it is possible that the event
merely records a set of dimensional entities coming together at a moment in time.
For example, an event of a student attending a class on a given day may not have a
recorded numeric fact, but a fact row with foreign keys for calendar day, student, teacher,
location, and class is well-defined.
Likewise, customer communications are events, but there may be no associated metrics.
Factless fact tables can also be used to analyze what didn’t happen.
These queries always have two parts: a factless coverage table that contains all the
possibilities of events that might happen and an activity table that contains the events
that did happen.
When the activity is subtracted from the coverage, the result is the set of events that did
not happen.
Aggregated fact tables or cubes

• Aggregate fact tables are simple numeric rollups of atomic fact table
data built solely to accelerate query performance. These aggregate
fact tables should be available to the BI layer at the same time as the
atomic fact tables so that BI tools smoothly choose the appropriate
aggregate level at query time.

• This process, known as aggregate navigation, must be open so that


every report writer, query tool, and BI application harvests the same
performance benefits.
Consolidated fact tables

• It is often convenient to combine facts from multiple processes


together into a single consolidated fact table if they can be expressed
at the same grain.
• For example, sales actuals can be consolidated with sales forecasts in
a single fact table to make the task of analyzing actuals versus
forecasts simple and fast, as compared to assembling a drill-across
application using separate fact tables.
• Consolidated fact tables add burden to the ETL processing, but ease
the analytic burden on the BI applications. They should be considered
for cross-process metrics that are frequently analyzed together.

You might also like