You are on page 1of 32

Manage dimension tables in InfoSphere Information Server DataStage

How to use the Slowly Changing Dimension stage


Skill Level: Intermediate Brian Caufield (bcaufiel@us.ibm.com) Software Architect IBM

12 Mar 2009 Information Server DataStage Version 8.0 introduced the Slowly Changing Dimension (SCD) stage. This tutorial provides step-by-step instructions on how to use the SCD stage for processing dimension table changes. It also shows you how to use the output of the stage to update an associated fact table. The tutorial includes a fully operational download.

Section 1. Before you start


The Slowly Changing Dimension stage was added in the 8.0 release of InfoSphere Information Server DataStage. It is designed specifically to support the types of activities required to populate and maintain records in star schema data models, specifically dimension table data. The Slowly Changing Dimension stage encapsulates all of the dimension maintenance logic finding existing records, generating surrogate keys, checking for changes, and what action to take when changes occur. In addition, you can associate dimension record surrogate key values with source records, which eliminates the need for additional lookups in later processing.

About this tutorial

Manage dimension tables in InfoSphere Information Server DataStage Copyright IBM Corporation 2009. All rights reserved.

Trademarks Page 1 of 32

developerWorks

ibm.com/developerWorks

This tutorial is designed to introduce you to using the Slowly Changing Dimension stage on the Information Server DataStage parallel canvas. The tutorial uses a simplified example scenario that focuses on Slowly Changing Dimension functionality. Actual business scenarios may require different approaches to the job design used in this tutorial's example. The volume of data processed in the tutorial is intentionally small to make it easier to understand the processing that is taking place. The material in the SCD_Tutorial.zip file in the Download section is built to run on a Windows platform with a DB2 database. You can modify the material to run on a different platform or to use a different database.

Objectives
In this tutorial, you will learn how to design a job that uses the Slowly Changing Dimension stage to perform updating and loading of dimension and fact tables. After completion, you will be able to configure the SCD stage for history-tracking changes and in-place changes, and use the output of the stage to update an associated fact table.

Prerequisites
This tutorial is written for DataStage developers who are familiar with the DataStage Parallel Edition design canvas. You will also benefit if you already have a knowledge of star schema design concepts (including fact and dimension tables), the use of surrogate keys, and the usual methodology for updating dimension tables.

System requirements
To create the job in this tutorial, you need an Information Server DataStage 8.x installation that is licensed to use the parallel engine. You also need a DataStage Designer client and access to a DataStage project where you can create, import, compile, and run DataStage jobs. To use the sample scripts in the SCD_Tutorial.zip download, your Information Server must be installed on a Windows OS with access to a DB2 database. However, you can also modify the scripts to work on other operating systems and with a different database.

Manage dimension tables in InfoSphere Information Server DataStage Copyright IBM Corporation 2009. All rights reserved.

Trademarks Page 2 of 32

ibm.com/developerWorks

developerWorks

Section 2. Star schemas and Slowly Changing Dimensions


Star schemas are a method of data modeling in which the data that is being measured, called the facts, are stored in one table, called the Fact table. Business Objects are the entities that are involved in the events being measured. Business Objects consist of identifying information and attributes that describe the object. These objects are stored in tables called dimension tables. The facts in the fact table are linked to the business objects in the associated dimension tables using foreign keys. Figure 1. Example Star Schema

Because fact tables record the measurements generated from business events, they tend to grow rapidly. Dimension tables, on the other hand, tend to grow or change less frequently. In the example used in this tutorial, the fact table records information about sales transactions. Every transaction results in a new row in the fact table. The product dimension in the example only grows when a new product is introduced, or if information about an existing product is changed. You typically handle changes to attribute information in one of two ways:

Manage dimension tables in InfoSphere Information Server DataStage Copyright IBM Corporation 2009. All rights reserved.

Trademarks Page 3 of 32

developerWorks

ibm.com/developerWorks

Overwrite The existing row in the dimension table is updated to contain the new attribute values; the old values are no longer available. This is commonly referred to as a Type1 change. Tracking History The existing row in the dimension table is modified to indicate that it is no longer current (that is, it has been expired), and a new row is inserted with the current attribute values. This is commonly referred to as a Type2 change. Surrogate Keys Surrogate Keys are values that are generated specifically for the purpose of uniquely identifying dimension table rows. The primary reasons you would use a surrogate key rather than the usual business key of the object in the dimension table are: When tracking history in the dimension table, there will be multiple rows in the dimension table for the same business key. Therefore, it is not possible to use the business key as the primary key. Typical fields that are used as business keys generally don't change, but situations can arise where they do change. For example, US citizens can be assigned a new social security number, or account numbers may be reassigned after a merger. Surrogate keys provide a way for the dimension table to have a reliable, unique, and never-changing primary key.

Section 3. Tutorial scenario


The scenario used for this tutorial has one fact table and two dimension tables that will be updated. The source file contains sales transaction records. The information in the source file is used to update the fact and dimension tables. Figure 2. Scenario schemas

Manage dimension tables in InfoSphere Information Server DataStage Copyright IBM Corporation 2009. All rights reserved.

Trademarks Page 4 of 32

ibm.com/developerWorks

developerWorks

Source data The source data file is named SaleDetail.dat and is contained in the SCD_Tutorial.zip download. It contains five records that, when processed, apply changes to the fact and dimension tables. Table 1 shows the contents of the file. Table 1. Source data
StoreId StoreName StoreMgr rodSKU P ProdBrand ProdDescr SaleAmtSaleUnits A1111 A1112 A1113 A1114 A1115 Stuff Washington 1111111111 Bob's Red box 00436.14 13 00456.56 14

MoreStuff Adams 2222222222 SqueakyBlue Chair

Stuffy's Jefferson3333333333 Sunshine Yellow 00203.38 7 Duckie McStuff Madison 4444444444 AAAAA fork Stuff Monroe 5555555555 Best lawn 00308.87 2 00024.40 11

Manage dimension tables in InfoSphere Information Server DataStage Copyright IBM Corporation 2009. All rights reserved.

Trademarks Page 5 of 32

developerWorks

ibm.com/developerWorks

Jr.

mower

Product dimension The product dimension is a table in the target database. Initially this table contains records for three products. When the source data is processed, the table is updated to contain new product records, and to track the history of changed product information. The Setup.bat file in the SCD_Tutorial.zip download contains a script that creates and populates this table with the data shown in Table 2. Table 2. Initial product dimension data
ProdSK SKU 1 2 10 Brand Descr Curr Y Y Y EffDate ExpDate 3333333333 Sunshine Yellow Duckie 4444444444 AAAAA 5555555555 AAAAA spoon grass cutter 2004-01-01099-12-31 2 2004-01-01099-12-31 2 2004-01-01099-12-31 2

Store dimension The store dimension is a table in the target database. Initially this table contains records for three stores. When the source data is processed, the table is updated to contain new store records, and to overwrite changed store information. The Setup.bat file in the SCD_Tutorial.zip download contains a script that creates and populates this table with the data shown in Table 3. Table 3. Initial store dimension data
StoreSK 1 2 5 ID A1113 A1114 A1115 Name Stuffy's McStuff Lil Stuff Mgr Jefferson Adams Monroe

Fact table The fact dimension is a table in the target database. Initially this table contains no records. When the source data is processed, the table is updated with the sales facts and references to the corresponding dimension records. The Setup.bat file in the SCD_Tutorial.zip download contains a script that creates the table as shown in Table 4. Table 4. Initial Fact table data
ProdSK StoreSK SaleAmt SaleUnits

Manage dimension tables in InfoSphere Information Server DataStage Copyright IBM Corporation 2009. All rights reserved.

Trademarks Page 6 of 32

ibm.com/developerWorks

developerWorks

Section 4. Setting up the tutorial


To set up the tutorial, save the SCD_Tutorial.zip file from the Download section to your local file system and follow these steps: 1. 2. Check if you already have the following directory structure: C:\IBM\Demo\DataStage. If not, create it. Extract the contents of SCD_Tutorial.zip into C:\IBM\Demo\DataStage. Be sure to select the option in your extraction program that indicates you want to use the folder or directory names when extracting. You should end up with the directory C:\IBM\Demo\DataStage\SCD, which contains several files and an empty sub-directory named SKG. Run C:\IBM\Demo\DataStage\SCD\setup.bat. In the DataStage Administrator client, set the environment variable APT_DB2INSTANCE_HOME to the location where the db2nodes.cfg file exists. Typically this is C:\IBM\SQLLIB\DB2. This configures the project to access DB2 as a source or target for the DB2 Enterprise Stage. Using the DataStage Designer client, import C:\IBM\Demo\DataStage\SCD\SCD_Tutorial.dsx into your DataStage project.

3. 4.

5.

Verify the state of the database Run the Results executable shortcut in the C:\IBM\Demo\DataStage\SCD directory. This displays the contents of the product and store dimensions as well as the fact table. Review the output to verify that the tables have been initialized properly. Resetting the tutorial Once the tutorial has been run the first time, the contents of the database will have changed. Therefore, subsequent runs would see different behavior. If you want to reset the database tables back to their initial state, run the zReset executable shortcut in the C:\IBM\Demo\DataStage\SCD directory.

Initializing the surrogate keys

Manage dimension tables in InfoSphere Information Server DataStage Copyright IBM Corporation 2009. All rights reserved.

Trademarks Page 7 of 32

developerWorks

ibm.com/developerWorks

The tutorial uses surrogate key generators that use state files to record the key values that have been used. This ensures that unique values are always generated. Because the dimension tables are created with data in them, you need to make the surrogate key generators aware of what values have already been used. Compile and run the Demo\DataStage\Slowly Changing Dimensions\Surrogate Key Generation\CreateAndUpdate_File job to initialize the state files. The job reads the product dimension table and the store dimension table, then creates and updates the respective surrogate key generator state files.

Building the Slowly Changing Dimensions job


In this step you build a job that reads the SalesDetail.dat source file, updates the product and store dimensions, and inserts records into the fact table. For reference, a completed version of the job named Demo\DataStage\Slowly Changing Dimensions\SCD_All is included in the download. Draw the job design as illustrated below in Figure 3. Figure 3. Job design

The primary flow of records is from left to right in the job design. The source records are read from SaleDetail, passed to the first SCD stage to process the Product
Manage dimension tables in InfoSphere Information Server DataStage Copyright IBM Corporation 2009. All rights reserved. Trademarks Page 8 of 32

ibm.com/developerWorks

developerWorks

dimension, then passed to the next SCD stage to process the store dimension, and finally to the fact table. No records are added or removed on this flow of data. Every record read from the source is inserted into the fact table. As part of the processing in the SCD stages, the surrogate key values that are associated with the source records are obtained from the dimension table and added to the data being passed to the fact table. Looking at the job design from top to bottom, the product and store dimension tables are reference sources to the SCD stages. These tables are used to initialize the lookup cache. Only records that are considered current are stored in the lookup cache. Any historical records in the dimension tables are automatically filtered out during initial processing. The SCD stage uses the data values from the primary input link to lookup into the cache and check for changes. If any changes are required to the dimension table, they are written to the secondary output link of the SCD stage, which is called the dimension update link. Target database stages are connected to the dimension update link to apply the changes to the actual dimension table in the database. Each record on the primary input link of the SCD stage will go out on the primary output link, and may produce zero, one, or two records on the dimension update link. The number of records produced depends on what, if any, action needs to be taken on the dimension table. Zero records Unchanged records require no action to the dimension table, so no records are written on the dimension update link. One record New records and overwriting updates (Type1) require a one row change to the dimension table. The change is either an insert or an update. One record is written on the dimension update link to reflect these types of changes. Two records Changed records that are tracking history (Type2) require a two row change to the dimension table. The existing record must be updated to reflect that it is no longer current, and a new record must be inserted for the new set of values. Two records are written to the dimension update link to reflect these changes.

Configuring the stages


Now that you have built the high level job design, you are ready to perform the next
Manage dimension tables in InfoSphere Information Server DataStage Copyright IBM Corporation 2009. All rights reserved. Trademarks Page 9 of 32

developerWorks

ibm.com/developerWorks

set of steps in which you: Configure the individual stages to access the source data. Process the dimension tables. Update the fact table.

Configure the primary source stage


The source stage must be configured to read the SaleDetail.dat file. Complete the following steps to configure the SaleDetail sequential file stage: 1. 2. 3. 4. On the Output|Properties tab, set the File property to C:\IBM\Demo\DataStage\SCD\SaleDetail.dat. On the Output|Format tab, add the Record delimiter string property and set it to DOS Format. On the Output|Format tab, remove the Final delimiter property. Load the Demo\DataStage\Slowly Changing Dimensions\TableDefs\SaleDetail table definition onto the output link.

Figure 4. Source stage

Manage dimension tables in InfoSphere Information Server DataStage Copyright IBM Corporation 2009. All rights reserved.

Trademarks Page 10 of 32

ibm.com/developerWorks

developerWorks

The source stage should now be configured to read the SaleDetail.dat file. Use View Data to confirm that the data is being read from the database properly.

Configure the stages to process the Product dimension


Three stages are used to process the Product dimension. Reading the job design from top to bottom: The first stage specifies how to read the data from the dimension table. The SCD stage determines what changes need to be made to the dimension table and those changes are written to the dimension update link. The dimension update link is connected to the dimension update target stage, which specifies how to update the actual database table with the data produced by the SCD stage. Configure the Product dimension source stage Complete the following steps to configure the Product dimension DB2 Enterprise stage:

Manage dimension tables in InfoSphere Information Server DataStage Copyright IBM Corporation 2009. All rights reserved.

Trademarks Page 11 of 32

developerWorks

ibm.com/developerWorks

1. 2. 3. 4. 5. 6.

On the Output|Properties tab, set the Read method property to Table. On the Output|Properties tab, set the Table property to SCD.ProdDim. On the Output|Properties tab, set the Use Default Database and Use Default Server properties to False. On the Output|Properties tab, set the Database property to SCDDemo. On the Output|Properties tab, set the Server property to DB2. Load the Demo\DataStage\Slowly Changing Dimensions\TableDefs\SCD.ProdDim table definition onto the output link.

Figure 5. Product dimension source

The stage should now be configured to read the SCD.ProdDim table. Use View Data to confirm that the data is being read from the database properly. Configure the Product dimension SCD stage The Fast Path control of the SCD stage editor lets you navigate directly to the tabs that require input in order to complete the stage configuration. The control is in the lower left corner of the editor. Use the arrow buttons to move forward or backward through the tabs. Open the product dimension SCD stage editor and use the Fast Path control to set the properties as shown: Fast Path control

Manage dimension tables in InfoSphere Information Server DataStage Copyright IBM Corporation 2009. All rights reserved.

Trademarks Page 12 of 32

ibm.com/developerWorks

developerWorks

The SCD stage has two input links and two output links. This results in a high number of property link-tab combinations. Use the Fast Path control to move directly to the tabs that are required to configure the stage.

Fast Path page 1: Setting the output link By default, the first output link connected to the stage is used as the primary output link. Look at the link name that is displayed in the Select output link property. Use the drop down list to select the output link that is leading to the next SCD stage. This is the primary output of the stage. The other link automatically becomes the dimension update link. Figure 6. Product dimension SCD stage, Fast Path page 1

Fast Path page 2: Define the lookup condition and purpose codes The first task on this page is to define what the various columns of the dimension table are used for. This information is used in a number of ways in the SCD processing. The choices for purpose codes are:

Manage dimension tables in InfoSphere Information Server DataStage Copyright IBM Corporation 2009. All rights reserved.

Trademarks Page 13 of 32

developerWorks

ibm.com/developerWorks

Surrogate Key This column is the primary key of the dimension table and is populated with a surrogate key value. Business Key This column is the identifier of the business objects that the dimension table is representing, but is not the primary key of the dimension table. This column is typically used as a lookup column and corresponds to a key or some other field of the source data that identifies the associated business object. The lookup is used to find the dimension table row that corresponds to a source data row. Type 2 Check this column for a change in value. If the value has changed, perform a history tracking change to the dimension table. Type 1 Check this column for a change in value. If the value has changed, perform an overwriting change to the dimension table. Current Indicator This column is used as a flag to indicate whether it is the most current record for a particular business key. Effective Date This column is used to specify when a record first became the most current record, that is, when it became the active record. Expiration Date This column is used to specify the ending date of when a record was the active record. For currently active records, this value is typically a future date or NULL. SK Chain This column is used to store the surrogate key of the previous or next record in the history for a particular business key. (blank) This column is not used for anything with respect to SCD processing. Data for this field is inserted into the table when a new row is inserted, but this column will not be checked for changes against the source data. Set purpose codes for the columns as shown below in Figure 7. Because this dimension table is tracking history, it contains columns to track whether a row is current and the date range for when it was current. Click on the ProdSKU source field and drag it to the SKU dimension column to create the lookup condition. Figure 7. Product dimension SCD stage, Fast Path page 2

Manage dimension tables in InfoSphere Information Server DataStage Copyright IBM Corporation 2009. All rights reserved.

Trademarks Page 14 of 32

ibm.com/developerWorks

developerWorks

Although this tab looks similar to a mapping tab, it is actually defining the lookup keys from the source record to the dimension record. Any source column can be associated with any one dimension column. This creates an equality lookup condition between those columns. If more than one source column is associated with a dimension column, then those equality conditions are AND'ed together. In this manner, multi-column lookup keys can be used. Fast Path page 3: Configuring the surrogate key generator Surrogate key generation capabilities are integrated into the SCD stage. This tab specifies how surrogate keys are generated for this stage. Surrogate key generation can use DataStage's file based surrogate generation, or use DB2 or Oracle database sequence object based generation. This tutorial uses the file based method. Set the Source name property to C:\IBM\Demo\DataStage\SCD\SKG\ProdDim as shown in Figure 8. This is the surrogate key state file you created by running the Demo\DataStage\Slowly Changing Dimensions\Surrogate Key
Manage dimension tables in InfoSphere Information Server DataStage Copyright IBM Corporation 2009. All rights reserved. Trademarks Page 15 of 32

developerWorks

ibm.com/developerWorks

Generation\CreateAndUpdate_File job. Leave the defaults for the other properties unchanged. Figure 8. Product dimension SCD stage, Fast Path page 3

Fast Path page 4: Defining the slowly changing dimension behavior and derivations The DimUpdate tab is used to define several critical elements of SCD processing. The Derivation column is used to specify how to map elements of a source row to elements of the dimension table. The Expire column is used to specify what values need to change if an existing record needs to be expired. Expire expressions are only enabled when there are Type2 columns specified, and are only available for Current Indicator and Expiration Date columns. If no matching record is found when the lookup is performed, the derivation expressions are applied and a record is written on the dimension update link to indicate a new record needs to be added to the dimension table. If a matching record is found, the derivation expressions are applied to the source columns, and then the results are compared to

Manage dimension tables in InfoSphere Information Server DataStage Copyright IBM Corporation 2009. All rights reserved.

Trademarks Page 16 of 32

ibm.com/developerWorks

developerWorks

the corresponding columns of the dimension table. Columns specified as Type2 are compared first. If there is a change, two records are written on the dimension update link. The first record is an update record, to expire the matched row. The Expire expressions are used to calculate the values for the update row. The second record is a new record that contains all of the new values for all columns. If no Type2 columns have changed, the Type1 columns are compared. If there are any changes, one record is written on the dimension update link that indicates an update to the dimension table. The derivation expressions are used to calculate the values for the update record. Set the Derivation expressions and the Expire expressions as shown below in Figure 9. Figure 9. Product dimension SCD stage, Fast Path page 4

Note that you are specifying these properties on the dimension update link. The output columns for this link were automatically propagated with their purpose codes from the dimension input link. The SCD stage only does this when the set of columns on the dimension update link is empty. It is possible to load a set of columns directly on the dimension update link, however, they must exactly match those specified on the dimension input link. Fast Path page 5: Selecting the columns for Output Link The Output Map tab is used to define what columns will leave this stage on the primary output link. This tab operates much like the Mapping tab of

Manage dimension tables in InfoSphere Information Server DataStage Copyright IBM Corporation 2009. All rights reserved.

Trademarks Page 17 of 32

developerWorks

ibm.com/developerWorks

other stages. The only difference is that you can select columns from the primary input link and columns from the reference link to output. The columns coming from the primary source have the same values they entered the stage with. The columns coming from the reference link represent the values from the dimension table that correspond to the source row. Note that because the SCD processing has been done by the stage, every record from the primary source data will have a corresponding record in the dimension. Select the columns for output as shown below in Figure 10. The output link is initially empty. Create and map the output columns by dragging and dropping from the source to the target. Because the product dimension has now been processed, the source columns that contain those attributes are no longer needed. Instead, the primary key associated with the source row is appended because that is the value that is required to be inserted into the fact table. Figure 10. Product dimension SCD stage, Fast Path page 5

The stage is now configured to perform the dimension maintenance on the Product

Manage dimension tables in InfoSphere Information Server DataStage Copyright IBM Corporation 2009. All rights reserved.

Trademarks Page 18 of 32

ibm.com/developerWorks

developerWorks

dimension table. Configure the Product dimension target stage This stage processes the dimension update link records produced by the product dimension SCD stage to update the actual dimension table in the database. Because incoming records represent both inserts and updates to the table, a Upsert write method must be used. Auto-generated update and insert statements take the purpose codes specified in the SCD stage into account to generate the correct update statement for this usage. Complete the following steps to configure the Product dimension update DB2 Enterprise stage: 1. 2. 3. 4. 5. 6. On the Input|Properties tab, set the Write Method property to Upsert. On the Input|Properties tab, set the Upsert Mode property to Auto-generated Update and Insert. On the Input|Properties tab, set the Table property to SCD.ProdDim. On the Input|Properties tab, set the Use Default Database and Use Default Server to False. On the Input|Properties tab, set the Database property to SCDDemo. On the Input|Properties tab, set the Server property to DB2.

Figure 11. Product dimension target

Manage dimension tables in InfoSphere Information Server DataStage Copyright IBM Corporation 2009. All rights reserved.

Trademarks Page 19 of 32

developerWorks

ibm.com/developerWorks

The stage is now configured to write to the SCD.ProdDim dimension table.

Configure the stages to process the Store dimension


Configure the Store dimension source stage Complete the following steps to configure the Store dimension DB2 Enterprise stage: 1. 2. 3. 4. 5. 6. On the Output|Properties tab, set the Read Method property to Table. On the Output|Properties tab, set the Table property to SCD.StoreDim. On the Output|Properties tab, set the Use Default Database and Use Default Server to False. On the Output|Properties tab, set the Database property to SCDDemo. On the Output|Properties tab, set the Server property to DB2. Load the Demo\DataStage\Slowly Changing
Trademarks Page 20 of 32

Manage dimension tables in InfoSphere Information Server DataStage Copyright IBM Corporation 2009. All rights reserved.

ibm.com/developerWorks

developerWorks

Dimensions\TableDefs\SCD.StoreDim table definition onto the output link. Figure 12. Store dimension source stage

The stage should now be configured to read the SCD.StoreDim table. Use View Data to confirm that the data is being read from the database properly. Configure the Store dimension SCD stage Open the store dimension SCD stage editor and use the Fast Path control to set the properties as shown: Fast Path page 1: Setting the Output Link Use the Select output link drop down list to select the link leading to the fact table. This is the primary output of the stage. The other link automatically becomes the dimension update link. Figure 13. Store dimension SCD stage, Fast Path page 1

Manage dimension tables in InfoSphere Information Server DataStage Copyright IBM Corporation 2009. All rights reserved.

Trademarks Page 21 of 32

developerWorks

ibm.com/developerWorks

Fast Path page 2: Define the lookup condition and purpose codes Set purpose codes for the columns as shown below in Figure 14. Because this dimension table is not tracking history, it does not contain columns to track whether a row is current or not. The Name column has a blank purpose code, which indicates that this column will not be checked for changes. Click on the StoreId source field and drag it to the dimension column Id to create the lookup condition. Figure 14. Store dimension SCD stage, Fast Path page 2

Manage dimension tables in InfoSphere Information Server DataStage Copyright IBM Corporation 2009. All rights reserved.

Trademarks Page 22 of 32

ibm.com/developerWorks

developerWorks

Fast Path page 3: Configuring the surrogate key generator Set the file path property to C:\IBM\Demo\DataStage\SCD\SKG\StoreDim as shown in Figure 15. This is the surrogate key state file you created by running the Demo\DataStage\Slowly Changing Dimensions\Surrogate Key Generation\CreateAndUpdate_File job. Leave the defaults for the other properties. Figure 15. Store dimension SCD stage, Fast Path page 3

Fast Path page 4: Defining the slowly changing dimension behavior and derivations Set the Derivation expressions as shown below in Figure 16. Because the Name column has no purpose code, the SCD stage does not check this

Manage dimension tables in InfoSphere Information Server DataStage Copyright IBM Corporation 2009. All rights reserved.

Trademarks Page 23 of 32

developerWorks

ibm.com/developerWorks

column for changes when a matching dimension record is found on the lookup. Because there are no Type2 columns in this dimension table, the Expire expression is not enabled for any column. Figure 16. Store dimension SCD stage, Fast Path page 4

Fast Path page 5: Selecting the columns for Output Link Select the columns for output as shown below in Figure 17. Because the store dimension has now been processed, the source columns that contain those attributes are no longer needed. Instead, the surrogate key associated with the source row is appended because that is the value that is required to be inserted into the fact table. Figure 17. Store dimension SCD stage, Fast Path page 5

Manage dimension tables in InfoSphere Information Server DataStage Copyright IBM Corporation 2009. All rights reserved.

Trademarks Page 24 of 32

ibm.com/developerWorks

developerWorks

The stage is now configured to perform the dimension maintenance on the store dimension table. Configure the Store dimension target stage This stage processes the dimension update records produced by the store dimension SCD stage to update the actual dimension table in the database. Complete the following steps to configure the Store dimension target DB2 Enterprise stage: 1. 2. 3. 4. 5. On the Input|Properties tab, set the Write method property to Upsert. On the Input|Properties tab, set the Upsert Mode property to Auto-generated Update and Insert. On the Input|Properties tab, set the Table property to SCD.StoreDim. On the Input|Properties tab, set the Use Default Database and Use Default Server to False. On the Input|Properties tab, set the Database property to SCDDemo.

Manage dimension tables in InfoSphere Information Server DataStage Copyright IBM Corporation 2009. All rights reserved.

Trademarks Page 25 of 32

developerWorks

ibm.com/developerWorks

6.

On the Input|Properties tab, set the Server property to DB2.

Figure 18. Store dimension target stage

The stage is now configured to write to the SCD.StoreDim dimension table.

Configure the Fact table target stage


This stage processes the source records that have been passed through the primary output links to update the actual fact table in the database. At this point, the original input source records have been processed so that the only columns on this link are the measurements (SaleAmt and SaleUnits) and the surrogate key values for the associated Product and Store. Complete the following steps to configure the Fact table target DB2 Enterprise stage: 1. 2. 3. On the Input|Properties tab, set the Write Method property to Write. On the Input|Properties tab, set the Write Mode property to Append. On the Input|Properties tab, set the Table property to SCD.Facttbl.

Manage dimension tables in InfoSphere Information Server DataStage Copyright IBM Corporation 2009. All rights reserved.

Trademarks Page 26 of 32

ibm.com/developerWorks

developerWorks

4. 5. 6.

On the Input|Properties tab, set the Use Default Database and Use Default Server to False. On the Input|Properties tab, set the Database property to SCDDemo. On the Input|Properties tab, set the Server property to DB2.

Figure 19. Fact table target stage

The stage is now configured to write to the SCD.Facttbl dimension table.

Final steps
You have now completed the job design and are ready to compile. Click the Compile button to start the compile. Note that the SCD stage processing makes use of the transform operator. So for the job to compile successfully, the C++ compiler settings for the project must be correct. The Resources page contains a link to an article in the information center for IBM Information Server with details on configuring your environment correctly for your C++ compiler. See the Information Server Configuration Guide for details on how to configure the environment correctly for your C++ compiler. If any compile errors occur, check your job and stages against the settings specified in the tutorial and make any necessary changes.
Manage dimension tables in InfoSphere Information Server DataStage Copyright IBM Corporation 2009. All rights reserved. Trademarks Page 27 of 32

developerWorks

ibm.com/developerWorks

Section 5. Running the tutorial


At this point, you are now ready to compile and run the job. Run the Results executable shortcut in the C:\IBM\Demo\DataStage\SCD directory to see the initial contents of the database tables. The Results shortcut displays the contents of the product dimension, the store dimensions, and the fact table. Run the job by clicking the Run button in the DataStage Designer. After the job finishes successfully, run the Results shortcut again to see the changes that were made to the database tables.

Summary of changes to database tables


The contents of the database tables should now appear as follows: The product dimension has two update records, and four new records. Two of the new records are new objects to the dimension table, and two existing records had Type2 changes, resulting in the two updates and two of the new records.
Change ProdSK SKU No 1 Change Expired 2 (Type2) Expired 10 (Type2) New 3 Record New 4 Record New 5 Record (Type2) Brand Descr Curr EffDate ExpDate 2004-01-01 2099-12-31 2004-01-01 {Today's Date} 2004-01-01 {Today's Date} {Today's2099-12-31 Date} {Today's2099-12-31 Date} {Today's2099-12-31 Date} {Today's2099-12-31 Date} 3333333333 Sunshine Yellow Y Duckie 4444444444 AAAAA spoon 5555555555 AAAAA grass cutter 1111111111 Bob's Red Box N N Y Y Y

2222222222 SqueakyBlue Chair 4444444444 AAAAA fork

New 6 5555555555 Best Record(Type2)

lawn Y mower

Manage dimension tables in InfoSphere Information Server DataStage Copyright IBM Corporation 2009. All rights reserved.

Trademarks Page 28 of 32

ibm.com/developerWorks

developerWorks

The store dimension has one updated record, and two new records. The updated record had a Type1 change and the two new records are new objects to the dimension table.
Change No Change Update No Change New Record New Record StoreSK 1 2 5 3 4 ID A1113 A1114 A1115 A1111 A1112 Name Stuffy's McStuff Lil Stuff Stuff MoreStuff Mgr Jefferson Madison Monroe Washington Adams

The fact table has five new records, one for each source record processed. The surrogate key values in this table correspond to the current records in the dimension tables.
ProdSK 3 4 1 5 6 StoreSK 3 4 1 2 5 SaleAmt 436.14 456.56 203.38 308.87 24.40 SaleUnits 13 14 7 2 11

The contents of the dimension tables have now changed. If you were to run the job again, what results would you expect to see? Hint: The dimension tables and the source file are now in-sync. This completes the Slowly Changing Dimensions tutorial. To reset the database tables to their original state, run the zReset executable shortcut .

Conclusion
You can use the Slowly Changing Dimension stage to greatly reduce the time you spend creating jobs for processing star schemas. In this tutorial you have learned how to configure the Slowly Changing Dimension stage to process history-tracking changes and in-place changes to dimension tables. You have also seen how you can reduce fact table processing by augmenting the source data with associated dimension table surrogate keys that eliminate the need for an additional lookup.

Manage dimension tables in InfoSphere Information Server DataStage Copyright IBM Corporation 2009. All rights reserved.

Trademarks Page 29 of 32

developerWorks

ibm.com/developerWorks

Manage dimension tables in InfoSphere Information Server DataStage Copyright IBM Corporation 2009. All rights reserved.

Trademarks Page 30 of 32

ibm.com/developerWorks

developerWorks

Downloads
Description Name Size
16KB

Download method
HTTP

Supporting scripts and DS jobs for this SCD_Tutorial.zip tutorial Information about download methods

Manage dimension tables in InfoSphere Information Server DataStage Copyright IBM Corporation 2009. All rights reserved.

Trademarks Page 31 of 32

developerWorks

ibm.com/developerWorks

Resources
Learn In the InfoSphere area on developerWorks, get the resources you need to advance your InfoSphere product skills. C++ compiler for job development topic in the information center for IBM Information Server. Browse the technology bookstore for books on these and other technical topics. Get products and technologies Download IBM product evaluation versions and get your hands on application development tools and middleware products from DB2, Lotus, Rational, Tivoli, and WebSphere. Discuss Participate in the discussion forum for this content. Check out developerWorks blogs and get involved in the developerWorks community.

About the author


Brian Caufield Brian Caufield is a software architect in IBM Silicon Valley Lab. Brian has been working in the DataStage development organization for 10 years and was involved in the design of the Slowly Changing Dimension Stage.

Manage dimension tables in InfoSphere Information Server DataStage Copyright IBM Corporation 2009. All rights reserved.

Trademarks Page 32 of 32

You might also like