You are on page 1of 12

Slowly Changing Dimensions

White Paper
Copyright 2011 Lunexa, LLC

Intellectual Property Disclaimer


The names of actual companies and products mentioned herein may be the trademarks of their respective owners. The information contained in this document represents the current view of Lunexa, LLC, on the issues discussed as of the date of publication. Because Lunexa must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Lunexa, and Lunexa cannot guarantee the accuracy of any information presented after the date of publication. This White Paper is for informational purposes only. LUNEXA MAKES NO WARRANTIES, EXPRESS OR IMPLIED, AS TO THE INFORMATION IN THIS DOCUMENT. Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Lunexa, LLC. Lunexa may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this document. Except as expressly provided in any written license agreement from Lunexa, the furnishing of this document does not give you any license to these patents, trademarks, copyrights, or other intellectual property. 301 Howard Street, Suite 1410 * San Francisco, CA 94105 U.S.A. * Phone 415.325.5902 * Fax 415.358.4626 * info@lunexa.com

Page 1

Contents
I. II. Executive Summary ......................................................................................................................... 3 Overview .......................................................................................................................................... 3

III. Background ...................................................................................................................................... 4 IV. Informatica PowerCenter ................................................................................................................. 6 A. V. A. B. C. Details for Type 2 Slowly Changing Dimension ...................................................................... 7 Talend .............................................................................................................................................. 9 Talend-Provided SCD Component ......................................................................................... 9 Self-Designed SCD Type 1 Update Process ........................................................................10 Factors Affecting the Performance of Updates .....................................................................11

VI. Ab Initio ..........................................................................................................................................11 VII. Conclusion .....................................................................................................................................12 VIII. About Lunexa .................................................................................................................................12

301 Howard Street, Suite 1410 * San Francisco, CA 94105 U.S.A. * Phone 415.325.5902 * Fax 415.358.4626 * www.lunexa.com Copyright 2011 Lunexa, LLC Page 2

I.

Executive Summary
Maintaining Slowly Changing Dimensions often poses a difficult challenge for extract, transform, and load (ETL) processes. To capture the changes to a dimension table, the transform process must first perform a lookup on the dimension table and either update an existing entry or insert a new entry (often called an upsert). Such updates are generally one of two types: Type 1: Fields of a matching row are overwritten, and the history for these updated fields is permanently lost. Type 2: Fields of a matching row are retained as history rows, and then a new record with the new field value is added to the dimension table and made the current record. This option requires the creation of a surrogate key to maintain referential integrity with the fact. This paper examines how three ETL tools Informatica PowerCenter, Talend, and Ab Initio handle the processing of SCDs. All load the source and lookup data into files or memory, and then perform the transformations against the loaded data. Performing the transformations against files provides a considerable performance advantage over performing these actions against the target database via ODBC or JDBC calls. For the purposes of this discussion, the paper analyzes how the different tools process the same SCD data. The demonstration scenario used for all three platforms has a fact table containing the sales data for various snack and beverage products, and a dimension table containing the product descriptions. Over time, one of the product descriptions, as recorded in the transaction log, changes. The requirements, in this example, are that all product description changes be captured. Thus, a SCD model is required to meet this requirement.

II.

Overview
In its simplest form, a dimensional data warehouse consists of a central fact table, containing the aggregatable data, and the accompanying descriptive dimension tables. If the dimension data remained static, the extract, transform, and load (ETL) processes that populate the tables would be very simple: the new records would just need to be inserted. In reality, however, many of these dimensions change over time, resulting in what are known as Slowly Changing Dimensions (SCDs), and capturing these changes presents one of the largest challenges in data warehousing. To capture these changes, the transform process performs a lookup into a dimension table and updates an existing entry or inserts a new entry (i.e., upsert), depending on the requirements. These updates are generally one of two types: Type 1: Fields of a matching row are overwritten, and the history for these updated fields is permanently lost. Type 2: Fields of a matching row are retained as history rows, and a new record with the new field value is added to the dimension table and made the current record. This option requires the creation of a surrogate key to maintain referential integrity with the fact. Note that there are actually three other types of SCDs, but the remaining three types are rarely used. These are:

301 Howard Street, Suite 1410 * San Francisco, CA 94105 U.S.A. * Phone 415.325.5902 * Fax 415.358.4626 * www.lunexa.com Copyright 2011 Lunexa, LLC Page 3

Type 0: The changes in the data are simply ignored Type 3: The changes are reflected in a second column in the dimension table. The dimension table contains one column with the original value, and another with the current value. Type 4: The historical records are stored in a separate archive table.

III.

Background
Depending on their architectures, ETL tools will handle the processing of SCDs in different manners. The three ETL tools discussed in this paper Informatica PowerCenter, Talend, and Ab Initio process the data movement in a similar manner: all of them load the source and the lookup data to files or memory, and then perform the transformations against this. Performing the transformations against files provides a considerable performance advantage over performing these actions against the target database via ODBC or JDBC calls. It is in how the tools handle this SCD data, once the lookup data is loaded, that the differences between the tools emerge. For the purposes of this discussion, we will analyze how the different tools process the same SCD data, and the options available for optimizing the handling of SCDs for each tool. For illustrative purposes, special attention will be given to showing multiple ways of handling SCDs using Talend; similar approaches may be adapted for use with Ab Initio. For the demonstration scenario used for all three platforms, we have a fact table, FACT_TRANS, containing the sales data for various snack and beverage products and a dimension table, LU_PRODUCT, containing the product descriptions. Over time, however, one of these product descriptions as recorded in the transaction log, TRANS_LOG.TXT,changed, resulting in an SCD that must be processed.

301 Howard Street, Suite 1410 * San Francisco, CA 94105 U.S.A. * Phone 415.325.5902 * Fax 415.358.4626 * www.lunexa.com Copyright 2011 Lunexa, LLC Page 4

301 Howard Street, Suite 1410 * San Francisco, CA 94105 U.S.A. * Phone 415.325.5902 * Fax 415.358.4626 * www.lunexa.com Copyright 2011 Lunexa, LLC Page 5

IV.

Informatica PowerCenter
To ensure that the dimension has the latest values, the input data should be sorted by the TRANS_DATE. If it is not, you can add a sorter transformation within PowerCenter. When PowerCenter reads data from a source, it stores the data into a buffer cache and processes that entire set of data at once. No matter which type of Slowly Changing Dimension is being loaded, in order to determine if the data is already present in the dimension, use a lookup transformation. Once the lookup is completed, one path of data will load the fact table, and one path of data will load the dimension table. Unless you are loading a Type 2 Slowly Changing Dimension, add an aggregator transformation to the path which loads the dimension, using the SKU as your aggregation key. This ensures that the SKU is only inserted or updated once and because the data is sorted, the default behavior in Informatica will be to use the latest description. Finally, before loading the target, use an update strategy to flag the row as an insert, an update, or a reject.

301 Howard Street, Suite 1410 * San Francisco, CA 94105 U.S.A. * Phone 415.325.5902 * Fax 415.358.4626 * www.lunexa.com Copyright 2011 Lunexa, LLC Page 6

Assume that the lookup transformation returns ports lkp_SKU, lkp_DESC, & lkp_ORIG_DESC (for type 3) from the lookup transformation which, when null, means the dimension table currently doesnt have this record. Depending on the type of Slowly Changing Dimension, the update strategy will be defined differently: Type 0 iif(isnull(lkp_SKU), DD_INSERT, DD_REJECT) Type 1 iif(isnull(lkp_SKU), DD_INSERT, DD_UPDATE) Type 2 See Below. Type 3 same as type 1, with an additional transformation before the target which determines the value for ORIG_DESC. Iif(isnull(lkp_SKU), PROD_DESC, lkp_ORIG_DESC) 5) Type 4 same as type 1, with an additional target which adds all data to the archive table. 1) 2) 3) 4)

A.

Details for Type 2 Slowly Changing Dimension

A special option in the lookup transformation has been built to handle Type 2 Slowly Changing Dimensions: the dynamic lookup cache. When this option is selected, the lookup cache behaves differently in the following ways: 1) The lookup transformation updates the data cache as it reads data from the source. 2) The lookup transformation adds a new indictor stating whether the incoming record was an insert-1, an update-2, or a nothing-0. The "nothing" indicator means the transformation found the record, but nothing was changed, so it did nothing to the data cache. 3) The lookup transformation generates a sequence-ID for inserted or updated cache data, which can be used as the Surrogate Key (SK).

301 Howard Street, Suite 1410 * San Francisco, CA 94105 U.S.A. * Phone 415.325.5902 * Fax 415.358.4626 * www.lunexa.com Copyright 2011 Lunexa, LLC Page 7

New Indicator

Used for Surrogate Key Used for Comparison

Using the example below, lets assume that the dimension has never been loaded. The following indicators will be set allowing us to determine how to route the data. After the lookup transformation, simply connect the Sequence-ID port to both the fact table and the update strategy. Change the update transformation logic to: Iif(NewLookupRow = 0, DD_REJECT, DD_INSERT)

Input New Lookup Row 1 1 1 0 2

Lookup

Update Strategy

SKU A B C B A

Description Brads Drink Dr. Pepper Oranges Dr. Pepper Pepsi

Date 3/12/2005 3/12/2005 3/12/2005 3/12/2005 3/13/2005

PROD_ID 1 2 3 2 4

Lkp Desc NULL NULL NULL Dr. Pepper Brads Drink

Out Desc Brads Drink Dr. Pepper Oranges Dr. Pepper Pepsi

Action DD_INSERT DD_INSERT DD_INSERT DD_REJECT DD_INSERT

For the first three rows, the data is simply inserted because the target does not have any data to begin with. For the fourth row, the dynamic lookup returns a 0 for the indicator because the combination of SKU B & DESCRIPTION Dr. Pepper has already come through. For that reason, it also can re-use the PROD_ID = 2. The last record has a matching SKU, but a different description, so the cache is updated, a new PROD_ID is assigned, and it is inserted into the dimension table.

301 Howard Street, Suite 1410 * San Francisco, CA 94105 U.S.A. * Phone 415.325.5902 * Fax 415.358.4626 * www.lunexa.com Copyright 2011 Lunexa, LLC Page 8

V. A.

Talend Talend-Provided Provided SCD Component

Talend provides a Slowly Changing Dimension component for each database with which it interacts. For the examples that follow, MySQL is used as the RDBMS. In the example shown in the following illustration, , a tMySqlSCD component is used to update the LU_ LU_PRODUCT PRODUCT table, based on the data imported from the TRANS_LOG.TXT file and stored in the metadata. Upon successfully completing the operation, a tMysqlCommit operation is performed to commit any remaining transactions not already committed in the MySQL dat database abase connection.

The SCD component may be used for SCD Type 0, 1, 2, or 3 updates on the specified fields. In the example SCD component editor window shown in the illustration that follows for a Type 2 update to PRODUCT_DESC, the field SK1 is added as a surrogate key, with the value for newly inserted values set to Table max + 1, that is, one greater than the maximum value currently stored in the table for SK1. Four additional fields are added for SCD versioning for this Type 2 update: the start and end d dates for the time the given version is the active version, the version

301 Howard Street, Suite 1410 * San Francisco, CA 94105 U.S.A. * Phone 415.325.5902 * Fax 415.358.4626 * www.lunexa.com Copyright 2011 Lunexa, LLC Page 9

number, umber, and the current active or inactive status.

B.

Self-Designed Designed SCD Type 1 Update Process

Rather than caching the inserts and updates and then performing them at the end of the th process using a single ODBC connection, as does Informatica PowerCenter, the code that Talend generates creates a JDBC call for each insert and update. This may result in substantially slower performance of the SCD processing. Performance may be improved d by creating self self-designed SCD-handling handling components. The following illustration illustrates a simplified Type 1 update.

301 Howard Street, Suite 1410 * San Francisco, CA 94105 U.S.A. * Phone 415.325.5902 * Fax 415.358.4626 * www.lunexa.com Copyright 2011 Lunexa, LLC Page 10

The first mapping in this example, check_for_new_products, compares new records from the TRANS_LOG_update file against existing records in the LU_PRODUCT dimension table. Where matching SKU values are found, these records from TRANS_LOG_update are then compared against LU_PRODUCT a second time, looking for records where the PRODUCT_DESC value has changed for a given SKU value. LU_PRODUCT is then updated to use the new PRODUCT_DESC values for these SKUs. The PROD_ID values are then read from LU_PRODUCT by the get_max_prod_id aggregation, which determines the maximum value of PROD_ID within LU_PRODUCT. The prepare_new_products mapping then uses this to start a sequence, incrementing by 1, for a new PROD_ID value for each new product, coming from the unmatched records from the check_for_new_products mapping. The prepare_new_products mapping prepares these records, which are then inserted with a bulk insert, bulk_insert_new_products. Once both the bulk insert and updates successfully complete, a tMysqlCommit operation is performed to commit any remaining transactions not already committed in the MySQL database connection.

C.

Factors Affecting the Performance of Updates

There are many factors that influence the performance of the Type 1 and Type 2 updates, as listed at the end of the following section on Ab Initio. In addition, while a SCD process could run out of memory for large lookups if data is cached in memory, Talend does allow sending lookup data to disk, with obvious slowdowns in a lookup cached to disk. To help minimize performance issues with large lookups, only the columns used for the lookup purposes should be pulled from the lookup source. In addition, more memory can be allocated to a job by selecting Window -> Preferences -> Talend -> Run/Debug and changing the Job Run VM arguments. Limiting the lookup source selection by a date for situations allowing it may also help. Testing of the relative performance of the SCD component provided by Talend, compared with the bulk insert approach of creating one's own SCD-handling process as outlined in this paper, needs to be performed to find the optimal approach for a given environment.

VI.

Ab Initio
Similar to Talend, Ab Initio can make changes to the database either one operation at a time, using ODBC calls in API mode, or using a batch mode. The API mode is a component developed by Ab Initio and the database vendor which allows the use of some of the database

301 Howard Street, Suite 1410 * San Francisco, CA 94105 U.S.A. * Phone 415.325.5902 * Fax 415.358.4626 * www.lunexa.com Copyright 2011 Lunexa, LLC Page 11

functionality. The database and network limit how fast Ab Initio can update and insert records. Ab Initio can maximize its performance to reach the database's limits. The three Ab Initio database components for modifying databases are "output table", which allows inserts; and "update table" and "join with database", which both allow inserts, updates, and deletions. The available modes for each component are determined by the database. For example, Oracle allows both API and batch mode for output table, but only API mode for update table and join with db. In contrast, Teradata allows both API and batch modes for all three of these database components. Generally, the API mode is much slower than the batch mode, but parallelism may be adjusted to improve the performance. In some cases, it may be possible to make the API mode run as fast as the utility mode by using a lot of database resources, limited by the database and table design. The join with database component tends to be very resource intensive. If used with a large table, better performance will be obtained by performing the manipulations on the database. In contrast, if used on a small table, the table should be dumped to a flat file, which is then used as a lookup file inside of the Ab Initio graph. There are many factors that affect the performance of a Type 1 or 2 update, such as table size, the update volume, the insert volume, the ratio of updates to insert, and the presence of an index. For some situations, the best performance may be obtained by truncating and reloading the dimension table. Ab Initio gives the flexibility to the user to maximize the performance, and a good approach for one scenario might not be suitable for another situation. Testing of multiple approaches is the best approach for determining the way to optimize the handling of SCDs for a given situation.

VII.

Conclusion

Dealing with Slowly Changing Dimensions is often very challenging. The three ETL tools discussed in this paper address Type 1 and Type 2 updates to dimension tables in different ways. This paper detailed the three major ETL platforms, describing both bulk and row-level methods. Rigorous testing should always be performed to determine the optimal approach for your environment.

VIII.

About Lunexa

Lunexa, LLC is a technology consulting firm specializing in data warehousing, business intelligence and enterprise data integration. Lunexa offers a broad portfolio of advisory and implementation services to help our clients maximize their data assets. Our consultants have indepth experience with a wide range of database technologies and platforms. Lunexa is a privately-held firm based in the San Francisco Bay Area. For more information, visit www.lunexa.com.

301 Howard Street, Suite 1410 * San Francisco, CA 94105 U.S.A. * Phone 415.325.5902 * Fax 415.358.4626 * www.lunexa.com Copyright 2011 Lunexa, LLC Page 12