You are on page 1of 5

Surrogate Key Generation in DataStage - An elegant way

Joshy George(Consulting Employee) Posted 2/14/2008 Comments (0) | Trackbacks (0)

An elegant and fast way to generate surrogate keys in a parallel job! This is a hot topic discussed and attempted by most of the ETL architects, designers and developers. This article looks at an elegant way for Surrogate Key Generation in a DataStage Parallel job, without having the overhead of creating multiple jobs or state file maintenance. This might fall slightly into the advanced way or for power users, as this includes creation of a parallel routine using DataStage Development Kit (Job Control Interfaces). But the strategy is definitely simple and elegant, and you can do it in one job and maintain the surrogate key in a centralised and editable location an environment Variable defined in Administrator. Gives you wings to use it across the project in different jobs as well. Plan of action 1) Starting Key Value / The Last Key Used is an Environment Variable which is defined in Administrator. 2) Increment the surrogate key in the Surrogate Key stage or the Transformer or the column generator approach by passing the starting value as the defined Environment Variable. 3) To capture the last key generated (Finally), use a tail stage with properties for Number of Rows (Per Partition) = 1 & All partitions = True. 4) Capture the last record in a Transformer stage using @PARTITIONNUM +1 = @NUMPARTITIONS in the constraint / filter. 5) Call a parallel routine which uses C/C++ APIs, especially DSSetEnvVar to update the Environment Variable with the last record value for surrogate key. Ex: SetEnvParam(DSProjectName, DSJobName, '', Input_Link.LAST_KEY+1). Advantages: * No extra jobs or effort outside the job required to retrieve and pass in the last key used. * No need to store the last key used in a state file or dataset. As this is stored in an Environment Variable, it can be retrieved anytime (No worries about encryption , still secured with a password) from Administrator. * Last key value can be changed / edited anytime by going into Administrator. Easy Manageability & Maintenance. * Can be used across the project in different jobs (Duplicate Key avoidance is explained later in this article). * Job fails and roll backs, as the Environment Variable updating routine call is made in the last, this call will not happen and you will have the Starting Key Value / Last Key Used intact.

Do it in single job

Passing the starting value as the defined Environment Variable

To capture the last key generated (Finally), use a tail stage

Capture the last record in a Transformer & Call a parallel routine to update the Environment Variable

Parallel routine using API to update environment variable

Ref. Parallel Job Advanced Developers Guide - Chapter 7 : DataStage Development Kit (Job Control Interfaces). This is the API call which sets an environment variable: int DSSetEnvVar(DSPROJECT hProject,char *EnvVarName,char *Value); Remember to add appropriate messages (Info/Warning) to the log from this routine. Log appropriate messages Here is a sample of the log:

Avoiding Duplicate Numbers You can consider different strategies for this. Here is one from Vincent McBurney's blog: Its quite possible that you have several jobs trying to generate surrogate keys for the same target table. If you were sure that they never run at the same time then you don't have to do anything, they will always have unique keys. If you cannot guarantee this you can still have unique keys. Give each job a unique number as a text field of two characters. Eg. 10 for JobA and 11 for JobB. Append the job identifier to the surrogate key generated in a Transformer Stage. This turns the sequence of 1, 2, 3 into 110, 210, 310 or 111, 211, 311. Starting the ID at 10 or 100 or 1000 instead of 01 or 001 or 0001 prevents leading zeros from being lost in the process. In a Transformer use the concatenation ":" character to concatenate two fields. All rows from JobA will always end in 10, all rows from JobB will end in 11. They will never have conflicting surrogate key values. A two digit code gives you up to 99 jobs writing to the same target table. A three digit job identifier gives you 999 codes. If you want unique job codes across all jobs you could choose 5 characters so you can have up to 99999 that will never have conflicting surrogate key values. Most important strategy to be adopted when jobs run at the same time to generate surrogate keys for the same target table is to update the Environment Variable with the last record value for surrogate key. All the jobs will update the Environment Variable with the last record value for surrogate key: * Trim the last part (ie. unique job number) from the last record value for surrogate key and use that value to update the environment variable * Make sure before updating the Environment Variable to check that the current value is not more than the value going to be updated with, if it is more then you shouldn't update. Write down a small example scenario on this, you will understand why! Want to generate the above job number also? Rather than making it job specific? Here is the trick. Along with surrogate key environment variable for a target table, assign a running job number variable (Env Variable) with initial value as 10. Whenever a new job starts loading to

the target table pick the surrogate key environment variable along with the "running job number + 1" and update running job number variable with this incremented value (ie. +1). In the transformer as noted above, concatenate "Environment Variable : Running job number variable" to get the surrogate key. When the job finishes and while you update Environment Variable, decrement corresponding Running job number by 1. I want the seed from the target table max value This case you need to pick the max value from the target table and use the same strategy specified above. Pass it thru a transformer and update the environment variable before the start of main job using the same conditions.

You might also like