Professional Documents
Culture Documents
3
Basic Elements of the Data Warehouse
Ralph Kimball, Margy Ross, The Data Warehouse Toolkit, 2nd Edition, 2002
4
Time_Key Item_Key Branch_Key Location_Key Dollars_Sold Units_Sold
20180828 123456 7 8 1000 10
20180828 123457 7 8 1010 8
20180828 123456 6 8 2000 6
20180828 123457 6 9 1000 15
20180828 123457 8 9 3000 25
Delivery_Ch
Sale_Key Item_Key SalePrice Discount arges
9 1 90 0 2
Time_Key Item_Key Location_Key Dollars_Sold Units_Sold 9 2 10 10 2
100 10 4
20180828 123456 8 1000 10
20180828 123457 8 1010 8
20180828 123456 8 2000 6
20180828 123457 9 4000 40
Four key Decisions of Dimensional
Modeling
1.Select the Business process.
2.Declare the grain.
3.Identify the dimensions.
4.Identify the facts.
Declaring the grain
• Declaring the grain is the pivotal step in a dimensional design
• The grain establishes exactly what a single fact table row represents
• The grain must be declared before choosing dimensions or facts
because every candidate dimension or fact must be consistent with
the grain.
Identify the
dimensions
Dimensions provide the “who, what, where, when, why, and how”
This primary key is embedded as a foreign key in any associated fact table
Dimension tables are usually wide, flat and de normalized tables with many low-cardinality text
attributes
These dimension surrogate keys are simple integers, assigned in sequence, starting with the value 1
Additive measures: These are those specific class of fact measures which can be
aggregated across all dimension and their hieracchy.
An example of a fully additive measure is sales (purchases from a store). You can add hourly
sales to get the sales for a day, week, month, quarter, or year. You can add sales across stores
or regions.
Semi-Additive measures: These are those specific class of fact measures which can be
aggreagated across all dimension and their hieracchy except the time dimension.
Example: Daily balances fact can be summed up through the customers dimension but not
through the time dimension.
Non-additive measures: These are those specific class of fact measures which cannot be
aggregated across all/any dimension and their hierarchy
Facts which have percentages, ratios calculated.
A good example of a non aggregable measure might be ‘Discount On Price' on a sales record
Data Pipeline
15
Data Engineering
16
Non Additive
Item ID Price Discount Payable Measure
1 100 10 90 10%
2 100 5 95 5%
3 200 25 175 13%
4 300 30 270 10%
SUM 700 70 630 10.0%
1000 10 2% 20 980
>=10 2%
1010 8 0 1010
>=20
>=30
3%
4%
2000 6 0 2000
>=50 5% 3000 25 3% 30 2970
3000 25 3% 30 2970
10010 74 8% 9930
99.2%
The Hadoop File System is immutable. We can only add but not update data.
get the latest and most up to date record in a dimension table we have three options
• First, we can create a View that retrieves the latest record using windowing functions.
• Second, we can have a compaction service running in the background that recreates the latest state.
• Third, we can store our dimension tables in mutable storage (Allows information to be overwritten at any
time), e.g. HBase and federate queries across the two types of storage.
The way data is distributed across HDFS makes it expensive to join data.
HDFS tables are split into big chunks and distributed across the nodes on our cluster
Designing a Dimensional Model in Oracle OLAP
• Identifying Dimensions
• Identifying Levels
we can identify the levels of summarization within each dimension
he levels of summarization will be (highest to lowest): Total, Region, Warehouse, and Ship To.
market segmentation, the levels of summarization will be (highest to lowest): Total, Market
Segment, Account, and Ship To.
Product dimension will have four levels (highest to lowest): Total, Class, Family, and Item
Time dimension will have four levels (highest to lowest): Total, Year, Quarter, and Month
• Identifying Hierarchies
we will group the levels in the correct order of summarization and in a way that supports the
identified types of analysis.
A degenerate dimension (DD) acts as a dimension key in the fact table, however does not join to
a corresponding dimension table because all its interesting attributes have already been placed
in other analytic dimensions
Example 1:- when an invoice has multiple line items, the line item fact rows inherit all the
descriptive dimension foreign keys of the invoice, and the invoice is left with no unique
content. But the invoice number remains a valid dimension key for fact tables at the line item
level.
Ticket Number
Order Number
Tracking_Id
Bill of lading number
Policy number
Dimension Types and Slowly Change Dimension Methods
Ralph introduced the concept of slowly changing dimension (SCD) attributes in 1996
core SCD approaches:
slowly changing dimension (SCD) 0,1,2 3
With type 7, the fact table contains dual foreign keys for a given dimension: a surrogate key linked to
the dimension table where type 2 attributes are tracked, plus the dimension’s durable supernatural
key linked to the current row in the type 2 dimension to present current attribute values.
Role-Playing Dimensions
• A single physical dimension can be referenced multiple times in a fact
table, with each reference linking to a logically distinct role for the
dimension. For instance, a fact table can have several dates, each of
which is represented by a foreign key to the date dimension.
Conformed dimension
Snowflaked Dimensions
When a hierarchical relationship in a dimension table is normalized, low-cardinality attributes appear
as secondary tables connected to the base dimension table by an attribute key. When this process is
repeated with all the dimension table’s hierarchies, a characteristic multilevel structure is created that
is called a snowflake. Although the snowflake represents hierarchical data accurately, you should
avoid snowflakes because it is difficult for business users to understand and navigate snowflakes.
They can also negatively impact query performance. A flattened denormalized dimension table
contains exactly the same information as a snowflaked dimension.
Types of Facts Tables
Conformed facts
Transaction fact tables
Periodic snapshot fact tables
Accumulating snapshot fact tables
Factless fact tables
Aggregated fact tables or cubes
Consolidated fact tables
Conformed facts
If the same measurement appears in separate fact tables, care must be
taken to make sure the technical definitions of the facts are identical if
they are to be compared or computed together. If the separate fact
definitions are consistent, the conformed facts should be identically
named; but if they are incompatible, they should be differently named
to alert the business users and BI applications.
Transaction fact tables
A row in a transaction fact table corresponds to a measurement event at a point in
space and time. Atomic transaction grain fact tables are the most dimensional and
expressive fact tables; this robust dimensionality enables the maximum slicing
and dicing of transaction data. Transaction fact tables may be dense or sparse
because rows exist only if measurements take place. These fact tables always contain a
foreign key for each associated dimension, and optionally contain precise
time stamps and degenerate dimension keys. The measured numeric facts must be
consistent with the transaction grain.
Periodic snapshot fact tables
A row in a periodic snapshot fact table summarizes many measurement
events occurring over a standard period, such as a day, a week, or a
month.
The grain is the period, not the individual transaction. Periodic
snapshot fact tables often contain many facts because any
measurement event consistent with the fact table grain is permissible.
These fact tables are uniformly dense in their foreign keys because
even if no activity takes place during the period, a row is typically
inserted in the fact table containing a zero or null for each fact.
Accumulating snapshot fact tables
• Aggregate fact tables are simple numeric rollups of atomic fact table
data built solely to accelerate query performance. These aggregate
fact tables should be available to the BI layer at the same time as the
atomic fact tables so that BI tools smoothly choose the appropriate
aggregate level at query time.