You are on page 1of 9

Data Warehousing with Oracle

Dani Schnider's Blog

Loading Dimensions from a Data Vault Model


Posted on 12. November 2015

Loading data into a Data Vault Model is easy. But how complex is it to extract the data from a Data Vault into a dimensional Data Mart? A Point in Time (PIT) table helps to load Dimensions
from a Hub with multiple Satellites.

One often heard concern about Data Vault Modeling is the complexity of ETL processes. While the rules to load Hubs, Links and Satellites are simple and straight-forward, it seems to be
more complex to load Dimensions and Facts out of a Data Vault. In my very first blog post about Data Vault (see Data Vault Modeling – My first attempt to walk) I wrote: But even if it would be
easier to load data into a Data Vault, it is more complex and expensive to load the Data Marts from a Data Vault because the queries to determine the correct version of each Satellite are not
trivial.

In the meantime, I realized that loading dimensions is indeed not trivial, but much simpler than I expected. The magic behind it is called Point in Time (PIT) Table.

Different Types of Dimensions

The complexity of an ETL process to load a dimension table depends on the type of Slowly Changing Dimension and on the number of Data Vault tables that are used to derive the
information of the dimension. Let’s start with the simple cases:

Slowly Changing Dimension Type 1: Loading an SCD1 dimension is not a big deal. We have to select the current version of each Satellite and join them with the corresponding Hub.
This works for any number of Satellites, and even if multiple Hubs and Links are used to select all dimensional attributes, the extraction of the Data Vault is simple. The only issue is to
find the current version of each Satellite. The easiest way is to add an additional attribute to each Satellite table, either an expiry date or an active flag. This increases the ETL
complexity to load the Satellites (the expiry date or active flag has to be updated if a new version is loaded), but it is much easier to retrieve the current versions.
Slowly Changing Dimension Type 2 with only one Satellite: If a dimension can be loaded from one Hub and one Satellite, we can just join these two Data Vault tables to retrieve all
versions for the SCD2 dimension. This even works for multiple Hubs, but only if each of them consists of exactly one Satellite. In a typical Data Vault model, this is usually not the case.

But what happens if we want to load an SCD2 dimension from a Hub with multiple Satellites, or from more than one Hub, each of them containing several Satellites? Here we will have more
complex cases:

Slowly Changing Dimension Type 2 with more than one Satellite: Because the history tracking is independent for each Satellite, we have to combine the validities of all Satellites to
calculate all validity ranges for the dimension table. This can be done with a Point in Time Table, as described below.
Slowly Changing Dimension Type 2 with multiple Hubs and Links: When a dimension consists of information from different business entities (e.g. one for each hierarchy level),
multiple Hubs have to be joined through the connecting Links. Each of the Hubs (and even the Links) can contain one or more Satellites.

Let’s have a detailed look at the case with one Hub with multiple Satellites. In the example below, we see the timeline for a Hub with four Satellites. Each Satellite contains 2 to 3 versions for
a particular business key. The combination to be loaded into the SCD2 dimension table will contain 7 different versions.

Point in Time Table

To calculate the validity range for each version, a Point in Time Table (PIT table) is created in the Data Vault model. It contains all combinations of load dates and is used to derive all valid
versions that have to be loaded into an SCD2 dimension table.
A PIT table is connected to a Hub or a Link and is therefore a similar structure as a Satellite. But unlike a normal Satellite, it does not contain descriptive attributes, but the load
dates/timestamps of all other Satellites of the same Hub or Link. Let’s explain this with an example: For a Hub H_CUSTOMER, there are four Satellites defined in the Data Vault model. Each
of them contains its own history. The address of a customer can change several times, but usually not at the same data as an e-mail address or a phone number. Each of the Satellites may
contain multiple versions of data for a particular customer, with different load dates.
The PIT table contains all distinct load dates of all Satellites. For each load date, the newest load date of every Satellite is stored in an additional attribute. This helps to identify what version is
valid on a particular date. In the following example, the valid versions for customer 2234 on 22 March 2015 are

16.02.2015 for Satellite S_CUSTOMER_MAIN (S1)


22.03.2015 for Satellite S_CUSTOMER_ADDRESS (S2)
26.02.2015 for Satellite S_CUSTOMER_ONLINE (S3)
12.03.2015 for Satellite S_CUSTOMER_PHONE (S4)

There are two approaches to load the PIT table. One is to select the newest load date of each Satellite after each load. If at least one of them changed since the last load, a new version is
inserted into the PIT table. This approach is similar to the incremental load of Satellites.

The other approach is to reload the PIT table from scratch after each load. This sounds more complex than it is. The following SQL statements show how this can easily be implemented:

TRUNCATE TABLE S_CUSTOMER_PIT;

INSERT INTO S_CUSTOMER_PIT


(H_CUSTOMER_SID,
LOAD_DATE,
S1_LOAD_DATE,
S2_LOAD_DATE,
S3_LOAD_DATE,
S4_LOAD_DATE)
WITH LOAD_DATES AS (
SELECT H_CUSTOMER_SID, LOAD_DATE FROM S_CUSTOMER_MAIN
UNION
SELECT H_CUSTOMER_SID, LOAD_DATE FROM S_CUSTOMER_ADDRESS
UNION
SELECT H_CUSTOMER_SID, LOAD_DATE FROM S_CUSTOMER_ONLINE
UNION
SELECT H_CUSTOMER_SID, LOAD_DATE FROM S_CUSTOMER_PHONE
)
SELECT
LD.H_CUSTOMER_SID,
LD.LOAD_DATE,
LEAD(LD.LOAD_DATE) OVER
(PARTITION BY LD.H_CUSTOMER_SID ORDER BY LD.LOAD_DATE) LOAD_END_DATE,
MAX(S1.LOAD_DATE) OVER
(PARTITION BY LD.H_CUSTOMER_SID ORDER BY LD.LOAD_DATE) S1_LOAD_DATE,
MAX(S2.LOAD_DATE) OVER
(PARTITION BY LD.H_CUSTOMER_SID ORDER BY LD.LOAD_DATE) S2_LOAD_DATE,
MAX(S3.LOAD_DATE) OVER
(PARTITION BY LD.H_CUSTOMER_SID ORDER BY LD.LOAD_DATE) S3_LOAD_DATE,
MAX(S4.LOAD_DATE) OVER
(PARTITION BY LD.H_CUSTOMER_SID ORDER BY LD.LOAD_DATE) S4_LOAD_DATE
FROM LOAD_DATES LD
LEFT JOIN S_CUSTOMER_MAIN S1
ON (S1.H_CUSTOMER_SID = LD.H_CUSTOMER_SID AND S1.LOAD_DATE = LD.LOAD_DATE)
LEFT JOIN S_CUSTOMER_ADDRESS S2
ON (S2.H_CUSTOMER_SID = LD.H_CUSTOMER_SID AND S2.LOAD_DATE = LD.LOAD_DATE)
LEFT JOIN S_CUSTOMER_ONLINE S3
ON (S3.H_CUSTOMER_SID = LD.H_CUSTOMER_SID AND S3.LOAD_DATE = LD.LOAD_DATE)
LEFT JOIN S_CUSTOMER_PHONE S4
ON (S4.H_CUSTOMER_SID = LD.H_CUSTOMER_SID AND S4.LOAD_DATE = LD.LOAD_DATE);
COMMIT;

The subquery factoring clause LOAD_DATES selects all distinct load dates for each key (H_CUSTOMER_SID). With this information, a left outer join on each Satellite is performed. For each
Satellite, the newest load date per key is calculated with an analytical function. This query was implemented on an Oracle 12c database, but can be used in any database system that
supports ANSI SQL.

The attribute LOAD_END_DATE is optional, but is very convenient for getting the current versions, as we will see in the Current View of the next paragraph.

History View Layer

For each business entity (i.e. a Hub with all Satellites), we can generate two views:

Current View: This view selects all relevant attributes from the current version of each satellite. It is used to retrieve the current state of data and as source to load SCD1 dimensions.
History View: In contrast to the Current View, this view selects all versions of data for each business key, including and start and end date for the validity of each version. This view is
used for initial loads of SCD2 dimensions.

The Current View reads the PIT table loaded before. It is joined with the Hub and all connected Satellites. An outer join is used for the Satellites because it may happen that not all Satellites
contain data for any timestamp in the past. Outer joins are used for the Satellites to avoid that a customer is not selected because one of the Satellites is empty. Because we are interested in
the current versions only, only the PIT rows with no LOAD_END_DATE are selected.

CREATE OR REPLACE VIEW V_CUSTOMER_CURR AS


SELECT
HUB.H_CUSTOMER_SID,
HUB.CUSTOMER_NUMBER,
S1.FIRST_NAME,
S1.LAST_NAME,
S1.DATE_OF_BIRTH,
S1.GENDER,
S2.STREET,
S2.ZIP_CODE,
S2.CITY,
S2.STATE,
S2.COUNTRY,
S3.EMAIL_HOME,
S3.EMAIL_WORK,
S4.PHONE_HOME,
S4.PHONE_WORK,
S4.PHONE_MOBILE,
S4.FAX_HOME,
S4.FAX_WORK
FROM S_CUSTOMER_PIT PIT
JOIN H_CUSTOMER HUB ON (HUB.H_CUSTOMER_SID = PIT.H_CUSTOMER_SID)
LEFT JOIN S_CUSTOMER_MAIN S1
ON (S1.H_CUSTOMER_SID = PIT.H_CUSTOMER_SID AND S1.LOAD_DATE = PIT.S1_LOAD_DATE)
LEFT JOIN S_CUSTOMER_ADDRESS S2
ON (S2.H_CUSTOMER_SID = PIT.H_CUSTOMER_SID AND S2.LOAD_DATE = PIT.S2_LOAD_DATE)
LEFT JOIN S_CUSTOMER_ONLINE S3
ON (S3.H_CUSTOMER_SID = PIT.H_CUSTOMER_SID AND S3.LOAD_DATE = PIT.S3_LOAD_DATE)
LEFT JOIN S_CUSTOMER_PHONE S4
ON (S4.H_CUSTOMER_SID = PIT.H_CUSTOMER_SID AND S4.LOAD_DATE = PIT.S4_LOAD_DATE)
WHERE PIT.LOAD_END_DATE IS NULL;

The History View looks almost the same as the Current View, except that all rows from the PIT table are selected, not only the current ones. Additionally, two columns VALID_FROM and
VALID_TO are provided in the view. They can be used to retrieve all versions that are valid on a particular snapshot date. To make it easier to select the data, the end date of the last version
(which is NULL) is replaced by the maximal date 31 December 9999.

CREATE OR REPLACE VIEW V_CUSTOMER_HIST AS


SELECT
HUB.H_CUSTOMER_SID,
HUB.CUSTOMER_NUMBER,
PIT.LOAD_DATE VALID_FROM,
NVL(PIT.LOAD_END_DATE, TO_DATE('31.12.9999', 'DD.MM.YYYY')) VALID_TO,
S1.FIRST_NAME,
S1.LAST_NAME,
S1.DATE_OF_BIRTH,
S1.GENDER,
S2.STREET,
S2.ZIP_CODE,
S2.CITY,
S2.STATE,
S2.COUNTRY,
S3.EMAIL_HOME,
S3.EMAIL_WORK,
S4.PHONE_HOME,
S4.PHONE_WORK,
S4.PHONE_MOBILE,
S4.FAX_HOME,
S4.FAX_WORK
FROM S_CUSTOMER_PIT PIT
JOIN H_CUSTOMER HUB ON (HUB.H_CUSTOMER_SID = PIT.H_CUSTOMER_SID)
LEFT JOIN S_CUSTOMER_MAIN S1
ON (S1.H_CUSTOMER_SID = PIT.H_CUSTOMER_SID AND S1.LOAD_DATE = PIT.S1_LOAD_DATE)
LEFT JOIN S_CUSTOMER_ADDRESS S2
ON (S2.H_CUSTOMER_SID = PIT.H_CUSTOMER_SID AND S2.LOAD_DATE = PIT.S2_LOAD_DATE)
LEFT JOIN S_CUSTOMER_ONLINE S3
ON (S3.H_CUSTOMER_SID = PIT.H_CUSTOMER_SID AND S3.LOAD_DATE = PIT.S3_LOAD_DATE)
LEFT JOIN S_CUSTOMER_PHONE S4
ON (S4.H_CUSTOMER_SID = PIT.H_CUSTOMER_SID AND S4.LOAD_DATE = PIT.S4_LOAD_DATE);

Loading Dimensions

Finally, we come back to our original question: How complex is it to load Dimensions from a Hub with multiple Satellites? With the PIT table and the History View Layer that we built before,
most of the work is already done. To load the dimension tables, we can distinguish the following situations:

Initial Load of SCD1 Dimension: This is the simplest case. We truncate the dimension table and load all rows from the Current View into the empty table.
Initial Load of SCD2 Dimension: Here again, we truncate the dimension table. But for an SCD2 dimension, we load all rows from the History View into the empty table.
Incremental Load of SCD1 Dimension: The Current View is used to do a MERGE into the dimension table. All new rows are inserted, and changed rows are updated.
Incremental Load of SCD2 Dimension: The Current View is used to do a MERGE into the dimension table. All new and changed rows are inserted, and the previous versions are
updated.

As we can see from these four different load scenarios, the History View is only required for an initial load of an SCD2 dimension. In all other situations, the Current View is sufficient to load
the dimensions. But in a agile DWH environment, it should always be possible to rebuild Data Marts from scratch. So, it’s a good idea to have a History View for all Hubs and Satellites. With a
PIT table, this can be built easily.

SHAR E T HIS:

 Twitter  LinkedIn  Email

Like

One blogger likes this.

RELAT E D

Data Vault Book Recommendations Performance Comparison: Snowflake vs Oracle Data Vault Challenge Workshop
13. April 2016 Autonomous Data Warehouse 6. October 2013
In "Books" 20. January 2021 In "Data Vault Modeling"
In "ADWC"

This entry was posted in Data Vault Modeling, Dimensional Modeling, ETL, SQL by Dani Schnider. Bookmark the permalink [https://danischnider.wordpress.com/2015/11/12/loading-
dimensions-from-a-data-vault-model/] .

22 THOUGHTS ON “LOADING DIMENSIONS FROM A DATA VAULT MODEL”

Eckhard
on 5. May 2017 at 17:50 said:

Thanks for the PIT query. The fastest I found and tested.

 Like

David Leblanc
on 23. August 2018 at 19:14 said:

In the PIT query, shouldn’t it be “S1.LOAD_DATE <= LD.LOAD_DATE" for the 4 satellites ?

 Like

Dani Schnider
on 27. August 2018 at 11:39 said:

Hi David
No, with the condition S1.LOAD_DATE <= LD.LOAD_DATE you will get duplicates. The query selects all distinct
load dates from all satellites in the subquery factoring clause LOAD_DATES. After this step, equi-joins are possible
to join the satellites.
Cheers, Dani.

 Like

David Leblanc
on 24. October 2018 at 13:45 said:

Hi Dani,

Why do you use MAX(S1.LOAD_DATE) then ?

The query should be :

INSERT INTO S_CUSTOMER_PIT


(H_CUSTOMER_SID,
LOAD_DATE,
S1_LOAD_DATE,
S2_LOAD_DATE,
S3_LOAD_DATE,
S4_LOAD_DATE)
WITH LOAD_DATES AS (
SELECT H_CUSTOMER_SID, LOAD_DATE FROM S_CUSTOMER_MAIN
UNION
SELECT H_CUSTOMER_SID, LOAD_DATE FROM S_CUSTOMER_ADDRESS
UNION
SELECT H_CUSTOMER_SID, LOAD_DATE FROM S_CUSTOMER_ONLINE
UNION
SELECT H_CUSTOMER_SID, LOAD_DATE FROM S_CUSTOMER_PHONE
)
SELECT
LD.H_CUSTOMER_SID,
LD.LOAD_DATE,
MAX(S1.LOAD_DATE) S1_LOAD_DATE,
MAX(S2.LOAD_DATE) S2_LOAD_DATE,
MAX(S3.LOAD_DATE) S3_LOAD_DATE,
MAX(S4.LOAD_DATE) S4_LOAD_DATE
FROM LOAD_DATES LD
LEFT JOIN S_CUSTOMER_MAIN S1
ON (S1.H_CUSTOMER_SID = LD.H_CUSTOMER_SID AND S1.LOAD_DATE <= LD.LOAD_DATE)
LEFT JOIN S_CUSTOMER_ADDRESS S2
ON (S2.H_CUSTOMER_SID = LD.H_CUSTOMER_SID AND S2.LOAD_DATE <= LD.LOAD_DATE)
LEFT JOIN S_CUSTOMER_ONLINE S3
ON (S3.H_CUSTOMER_SID = LD.H_CUSTOMER_SID AND S3.LOAD_DATE <= LD.LOAD_DATE)
LEFT JOIN S_CUSTOMER_PHONE S4
ON (S4.H_CUSTOMER_SID = LD.H_CUSTOMER_SID AND S4.LOAD_DATE <= LD.LOAD_DATE)
GROUP BY LD.H_CUSTOMER_SID, LD.LOAD_DATE;

If you don't use the "<=", the query will not return all the information.
The GROUP BY take care of the duplicate and the MAX function return the most recent date from the
satellites.

Cheers,
David

 Liked by 2 people

Dani Schnider
on 24. October 2018 at 17:23 said:

Hi David

Thanks for your comment. Your proposed query returns the same result, but the WINDOW (SORT) operator
in the execution plan (for the analytical functions) is not required. Instead, a GROUP BY is performed. All
other parts of the execution plan are identical. So, your query is even a bit faster than mine

Cheers, Dani.

 Like

K-Meleon rashford
on 24. January 2019 at 11:53 said:

Thanks for this example, still waiting for your future post about “Slowly Changing Dimension Type 2 with multiple Hubs and
Links”, i have implemented a solution via bridge table, but not fast and very complex.

Cheers,
Rashford

 Liked by 1 person

Andy White
on 11. February 2019 at 16:49 said:

Hi Dani,
Thanks for posting this example. This was of great help to our project in populating our own Point in Time tables. I’d found
numerous posts that talked around the subject but yours is the first that included actual useful code. In fact it was
straightforward for me to then write a code generator in our ETL tool to generate the SQL for any PIT table driven by a few
simple parameters.
Cheers,
Andy

 Liked by 1 person

John Campos
on 26. March 2019 at 3:27 said:

Mr. Schnider,

I’ve found the business vault to be the most challenging part of this journey. This is one of the few posts that I’ve found that
simplifies how to virturalize dimensions using PIT tables. I have yet to find a simple demonstration on how to pull together
FACT data using BRIDGE tables. I am really looking forward to your post on this topic. Thank you for your contribution to
this small DV community!

John

 Like

jason
on 5. June 2019 at 3:21 said:

Is there a post yet for Slowly Changing Dimension Type 2 with multiple Hubs and Links? I have tried bringing the satellite
load dates into the bridge and then creating a PIT from there but that is causing duplicates.

 Like

Dani Schnider
on 5. June 2019 at 8:39 said:

Hi Jason

A bridge table contains all the keys from a set Hubs and Links, but not the Satellites. Typically, a dimension with
multiple hierarchy levels contains one bridge table. This table is joined with the PIT tables for each level.

Because you are not the first reader asking for an example with a bridge table, I think I have to plan a blog post
about this topic soon. Hope to find some time this summer.

Cheers, Dani.

 Like

John Campos
on 24. July 2019 at 15:06 said:

Jason,

I eagerly await your next blog post regarding an example on how to design more importantly populate a
bridge table. In the meantime, if you can point us to any resources that may be helpful in this regard, it would
be greatly appreciated.

 Like

Dani Schnider
on 25. July 2019 at 8:23 said:

Hi John

In addition to the two Data Vault books I recommended in https://danischnider.wordpress.com/2016/04/13/data-vault-book-


recommendations/ here a list of useful blogs about Data Vault:

http://danlinstedt.com
https://hanshultgren.wordpress.com
https://tedamoh.com/blog
https://learndatavault.com/blog/
http://roelantvos.com/blog/
I’m sure the list is not complete, but these are the links I regularly use.

Cheers, Dani (not Jason)

 Like

Jan Skorupiński
on 17. March 2020 at 12:25 said:

Hi,
For me it seems that a V_CUSTOMER_HIST.VALID_TO column should be calculated using
S_CUSTOMER_PIT.LOAD_END_DATE instead of S_CUSTOMER_PIT.LOAD_DATE as now you have same dates in
VALID_TO field.

What do you think about it?

Best Regards,
Jan Skorupiński

 Like

Dani Schnider
on 17. March 2020 at 12:46 said:

Hi Jan

Of course, you are right! That was a silly mistake I made when I updated the blog post a few months ago.
I changed the SQL code, it should be correct now. Thanks for your comment.

Cheers, Dani.

 Like

Ajay Tripathy
on 18. May 2020 at 9:14 said:

So does it mean the PIT table has a column called LOAD_END_DATE? If so what is the logic to load that column?

Also, regarding the type2 query where we are joining PIT with satellite on the basis of H_CUSTOMER_SID and
LOAD_DATE. First PIT is joined with Hub on H_CUSTOMER_SID. Lets take your example, this join will produce 8 records
with different PIT LOAD_DATE. Now we are making a left join with S_CUSTOMER_MAIN on S1.H_CUSTOMER_SID =
PIT.H_CUSTOMER_SID AND S1.LOAD_DATE = PIT.S1_LOAD_DATE. This will also produce 8 records but only first
record will have FIRST_NAME, LAST_NAME, DATE_OF_BIRTH, GENDER. rest of the records will have this information
null (because no matching record found from S_CUSTOMER_MAIN on LOAD DATE. So how this is represented as Type-
2. I believe all 8 instances of should have S_CUSTOMER_MAIN data populated as this satellite data was available from
day-1.

Please add your suggestion.

 Like

Dani Schnider
on 18. May 2020 at 9:49 said:

Hi Ajay

Thanks for your comment. The LOAD_END_DATE in the PIT table is not mandantory, but makes it easier to find all
records that are vaild on a particular day. It contains the same value as the LOAD_DATE of the next record and is
calculated with the analytic function LEAD, as described in the blog post.

About your concerns with the NULL records: If you go through the example step by step, you will see how it works.
For your case, the PIT table will contain 8 records with different load dates, but all of them contain the same load
date for the specific satellite (e.g. S1_LOAD_DATE). So, the joins in the views afterwards would return the same
values for FIRST_NAME, LAST_NAME, etc. for all 8 versions.

Cheers, Dani.

 Like
saray
on 2. August 2020 at 10:42 said:

Hi Dani,
Thank you for posting this article and it has helped my understanding. Can you also please an example for building a view
for a fact with both SCD1 and SCD2?

 Like

Dani Schnider
on 3. August 2020 at 21:17 said:

Hi Saray

Some examples for the different views and star schema queries you can find in the presentation “Craft Beer in the
Cloud: A Data Warehouse Case Study”. The link to the slides you find here:
https://danischnider.wordpress.com/presentations/

Cheers, Dani.

 Like

Saray
on 23. August 2020 at 6:56 said:

Thanks Dani.
I have one more question regarding the PIT tables, The query in your example will not give the load date on
a day where there is no update in any of the satellite. Because of that we have to write a subquery to fetch
the max load date for a hash key.
We do not have end date in our PIT tables, Can we handle this without end date?
The only option I see is to generate hash keys for all the missing days by joining a date table with
Load_dates(CTE in your query).

 Like

Dani Schnider
on 22. September 2020 at 20:42 said:

Hi Saray

If your PIT table contains no end date, I know (or have seen) the following options:
1. The PIT table contains a row per key for each day. Because the PIT table will become very big, it is a common approach
to use partitioning and drop old partitions after some months.
2. Calculate the end day in the query. In this case, I would recommend to use analytic functions instead of subqueries.

To avoid big PIT tables and slow query performance, I prefer to use an end date in the PIT table (but not in the satellites).
This works fine as long as the PIT table can be rebuilt after every load. Otherwise, I would prefer option 1.

Regards, Dani.

 Like

Alex
on 3. February 2021 at 17:02 said:

What about joining multiple Hubs (with satellites) and Links (with satellites) together for PIT historical data. Do you create
one PIT table for all the objects that are needed to create a given dimension?

 Like

Dani Schnider
on 3. February 2021 at 17:22 said:

Hi Alex

I create a PIT table for each hub (+ all its satellites). This worked well so far, even with virtual data marts. For a
(public) use case, look at the examples in this presentation:
https://danischnider.files.wordpress.com/2020/08/craft_beer_in_the_cloud.pdf

Cheers, Dani.
 Like

You might also like