You are on page 1of 41

DataStage Best Practices

55000783.doc Page 1 of 41
CONTENTS

1. INTRODUCTION................................................................................................................................. 6
1.1 OBJECTIVE...................................................................................................................................... 6
1.2 REFERENCES.................................................................................................................................. 6
1.3 AUDIENCE....................................................................................................................................... 6
1.4 DOCUMENT USAGE.......................................................................................................................... 6

2. DATASTAGE OVERVIEW.................................................................................................................. 7

3. DATASTAGE DEVELOPMENT WORKFLOW...................................................................................7


3.1 BUILDING AND TESTING JOBS........................................................................................................... 7
3.1.1 Dummy_Dev Project............................................................................................................ 8
3.2 OTHER DATASTAGE PROJECTS........................................................................................................ 8

4. DATASTAGE JOB DESIGN CONSIDERATIONS..............................................................................8


4.1 JOB TYPES...................................................................................................................................... 8
4.1.1 Import Jobs.......................................................................................................................... 9
4.1.2 Transform Jobs.................................................................................................................... 9
4.1.3 Unload Jobs....................................................................................................................... 10
5. USE OF STAGES.............................................................................................................................. 10
5.1 COMBINING DATA.......................................................................................................................... 10
5.1.1 Join, Lookup and Merge Stages........................................................................................ 10
5.1.2 Aggregate Stage................................................................................................................ 11
5.1.3 The Funnel Stage.............................................................................................................. 11
5.2 SORTING....................................................................................................................................... 11
5.3 DATA MANIPULATION..................................................................................................................... 12
5.3.1 Transformer Usage Guidelines.......................................................................................... 12
5.3.2 Modify Stage...................................................................................................................... 14
5.4 TRANSITIONING DATA..................................................................................................................... 14
5.4.1 External Data..................................................................................................................... 14
5.4.2 Parallel Dataset................................................................................................................. 15
5.5 UNIT TEST..................................................................................................................................... 15
5.5.1 Copy Stage........................................................................................................................ 15
5.5.2 Peek Stage........................................................................................................................ 15
5.5.3 Row Generator.................................................................................................................. 15
5.5.4 Column Generator............................................................................................................. 15
5.5.5 Manual XLS Generation.................................................................................................... 15
6. GUI STANDARDS............................................................................................................................. 16

7. DATASTAGE NAMING STANDARDS............................................................................................. 16

8. RUNTIME COLUMN PROPAGATION (RCP)...................................................................................18

9. STANDARDISED REJECT HANDLING........................................................................................... 18


9.1 REJECT COMPONENTS................................................................................................................... 18
9.2 CUSTOMISED REJECT MESSAGES................................................................................................... 20
9.3 REJECT LIMIT................................................................................................................................ 21
9.4 BEFORE ROUTINE.......................................................................................................................... 21
9.5 NOTIFICATIONS.............................................................................................................................. 21

55000783.doc Page 2 of 41
9.5.1 In-line Notification of Rejects............................................................................................. 21
9.5.2 Cross Functional Notification of Rejects............................................................................22
10. ENVIRONMENT............................................................................................................................ 22
10.1 DEFAULT ENVIRONMENT VARIABLES STANDARDS............................................................................22
10.2 JOB PARAMETER FILE STANDARDS................................................................................................. 22
10.3 DIRECTORY PATH PARAMETERS..................................................................................................... 22
10.4 DEFAULT DIRECTORY PATH PARAMETERS......................................................................................22
10.5 DIRECTORY & DATASET NAMING STANDARDS..................................................................................23
10.5.1Functional Area Input Files................................................................................................ 23
10.5.2Functional Area Output Tables.......................................................................................... 23
10.5.3Functional Area Staging Tables......................................................................................... 23
10.5.4Internal Module Tables...................................................................................................... 23
10.5.5Datasets Produced from Import Processing......................................................................23
11. METADATA MANAGEMENT........................................................................................................ 23
11.1 SOURCE AND TARGET METADATA................................................................................................... 24
11.2 INTERNAL METADATA..................................................................................................................... 24

12. STANDARD COMMON COMPONENTS......................................................................................24


12.1 JOB TEMPLATES............................................................................................................................ 24
12.1.1Import Jobs........................................................................................................................ 25
12.1.2Transform Jobs.................................................................................................................. 25
12.1.3Unload Jobs....................................................................................................................... 26
12.2 CONTAINERS................................................................................................................................. 26

13. DEBUGGING A JOB..................................................................................................................... 27

14. COMMON ISSUES AND TIPS...................................................................................................... 27


14.1 1-WAY / N-WAY.............................................................................................................................. 27
14.2 DUPLICATE KEYS........................................................................................................................... 28
14.3 RESOURCE USAGE VS PERFORMANCE............................................................................................ 29
14.4 GENERAL TIPS............................................................................................................................... 30

15. REPOSITORY STRUCTURE........................................................................................................ 31


15.1 JOB CATEGORIES.......................................................................................................................... 31
15.2 TABLE DEFINITION CATEGORIES..................................................................................................... 31
15.3 ROUTINES..................................................................................................................................... 31
15.4 SHARED CONTAINERS.................................................................................................................... 31

16. COMMON COMPONENTS USED IN DUMMY..............................................................................31


16.1 JBT_SC_JOIN................................................................................................................................. 31
16.2 JBT_SC_SRT_CD_LKP.................................................................................................................... 32
16.3 JBT_ENV_VAR................................................................................................................................ 33
16.4 JBT_ANNOTATION........................................................................................................................... 33
16.5 JOB LOG SNAPSHOT...................................................................................................................... 34
16.6 RECONCILIATION REPORT.............................................................................................................. 36
16.7 SCRIPT TEMPLATE.......................................................................................................................... 38
16.8 SPLIT FILE..................................................................................................................................... 38
16.9 MAKE FILE.................................................................................................................................... 38
16.10 JBT_IMPORT........................................................................................................................... 39
16.11 JST_IMPORT........................................................................................................................... 41
16.12 JBT_UNLOAD.......................................................................................................................... 41

55000783.doc Page 3 of 41
16.13 JST_UNLOAD.......................................................................................................................... 43
16.14 jbt_abort_threshold.............................................................................................................. 43

55000783.doc Page 4 of 41
1. INTRODUCTION

1.1 Objective
This document will serve as a source of standards for use of the DataStage software as
employed by the Dummy Transformation project.

The below mentioned standards will be followed by all developers. It is understood that this
document, while setting the standards might not be possible to cover all the development
scenarios. In such cases, developer must contact the appropriate authority to seek clarification
and ensure that such missing items are subsequently added to this document.

It will therefore be an evolving document which will be updated to continually reflect the
changing needs and thoughts of the development team and hence continue to represent best
practices as the project progresses.

Initial review and sign-off process will therefore be followed within this context.

1.2 Document Usage


This document describes the DataStage best practices to be applied to the Dummy
Transformation project. It is intended to channel the general knowledge of DataStage
developers towards the specific things they need to know about the Dummy project and the
specific way jobs will be developed.

It will be referenced by developers initially for familiarisation and as required during the course
of the project. Use of the document will therefore reduce over time as developers become
familiar with the practices described. The Offshore Build Manager will maintain the document
(in collaboration with the development team – through weekly developer meetings) and will be
responsible for distributing the document to developers (and explaining it’s content) initially and
after updates have been applied, ensuring that the standards it describes are communicated
and understood. Such communication will highlight the areas of change.

The best practices will also form the basis for QA and peer testing within the development
environment.

2. DATASTAGE OVERVIEW
DataStage is a powerful Extraction, Transformation, and Loading tool. DataStage has the
following features to aid the design and processing:
 Uses graphical design tools. With simple point-and-click techniques you can draw a
scheme to represent your processing requirements
 Extracts data from any number or type of database
 Handles all the metadata definitions required to define your data warehouse or
migration. You can view and modify the table definitions at any point during the
design of your application
 Aggregates data. You can modify SQL SELECT statements used to extract data
 Transforms data. DataStage has a set of predefined transforms and functions you
can use to convert your data. You can easily extend the functionality by defining
your own transforms to use.

55000783.doc Page 5 of 41
3. DATASTAGE DEVELOPMENT WORKFLOW

3.1 Building and Testing Jobs


This section provides an overview of the DataStage Job development process for the Dummy
transformation project. As detailed in diagram below there will three environments i.e.
development, test and production.

Within DataStage, a project is the entity in which all related material to a development is stored
and organised.

Development will have three projects where each code will move i.e. Dummy_Dev, Version and
Dummy_Promo. Developers will develop code in Dummy_Dev project and after unit testing it
promote to project Version where Version controlling will be managed. After base-lining the
code the DataStage administrator will collate all code in the Dummy_Promo project from where
the DMCoE will move it for unit and end to end testing on the Test server. Finally the code will
be moved by DMCoE to production. Please refer to the Dummy Transform Code Migration
Strategy document for further details.

Development Server Test Server

BUILD MANAGER (Review / Defect Fix / QA / Sign Off)


Role
Ranch_Test
DS Project
Build / Unit Test Deploy / Promote
Process Process

Ranch_Dev Ranch_Dev\FDyy Ranch_Promo Onshore E2E Test Activity -- DMcoE


DS Project DS Project DS Project

Production Server
Developer
Role

Ranch_Prod
Version DS Project
DS Project

Administrator
Onshore Activity -- DMcoE
Role

Each DataStage project is defined below:

3.1.1 Dummy_Dev Project


The Dummy_Dev project will be used by developers for building DataStage jobs and unit testing
by the developers. It will be mapped to a working directory on the UNIX DataStage server. This
will also be used for unit testing, changes / defect fixing will be documented and fixed before
promoting the job to “Dummy_Promo” for integration testing.

3.2 Other DataStage Projects


Several further DataStage projects will be employed across the Development, Test and
Production environments. Please refer to the Dummy Transform Code Migration Strategy
document for further details.

55000783.doc Page 6 of 41
4. DATASTAGE JOB DESIGN CONSIDERATIONS

4.1 Job Types


As per diagram below there will be three types of Jobs within Transform i.e. Import, Transform
and Unload Jobs. Source data having complex file layout will be processed by these jobs in
sequence to give Target file which will in the format required by Load team.

Scheduler Process
Hogan
Extract
Data DataStage Environment

Transform Load Data


Import Jobs Unload Jobs
Jobs

Non-
Hogan
Extract
Data Target data is provided as
flat files

Staging and
Staging Staging
Lookup
Data Data
Source data is provided as Data
complex flat files

Source data is loaded (as Transform jobs apply Transformed data is


is) into DataStage parallel business rules. These unloaded from the
datasets jobs read from and write DataStage environment
to parallel datasets to
enable the movement of
data between jobs and
to facilitate the
application of business
rules through lookups

4.1.1 Import Jobs


Import Jobs will be starting point for transformation. Sanity checks on file and validation of
external properties e.g. Size will be done here. Source file will be read as per source record
layout. If there are any unwanted or bad record the job will fail and file needs to be corrected
before restarting the job. Source data will then be filtered to process records and unprocessed
data will be maintained in a dataset for future reference. Finally one or more datasets will be
created which will be input to actual transform process. See section 9 for further details of
action to be taken on failure or reject.

Hogan
Extract
Data
Import Job
Check zero
Record
byte file, Read File in Create
s to be
Validate specific output
Proces
header and format datasets
Non- sed
trailer details
Hogan
Extract
Data

Write error details in Stats File and Stop processing

4.1.2 Transform Jobs

55000783.doc Page 7 of 41
Datasets created by import jobs will be processed by transform jobs. Transform will join two or
more datasets, lookup data as per functional design specification. Finally the records will be split
as per destination file design and a destination dataset will be created. All data errors will be
captured in an exception log for future reference.

Transform jobs process data flows and will


have a number of inputs in the form of
datasets. These datasets will present data
from completed jobs during a normal batch run
or may represent unprocessed data in the
event of the batch being restarted following job
failure. On completion, transform jobs will normally
produce a dataset. This dataset will pass data
to dependent jobs within the batch Breaks
Records
Lookup between jobs serve as restart checkpoints in
to be
Dataset the event of a downstream failure.
Processed

Data Held
Transform Job for Future
Job

Driving Transform
Join Lookup
Data Flow and Split

Data Held
for Future
Records from Job
driving data
flow failing lookup
Records from
driving data
flow failing Exceptions resulting from data
to join inconsistencies are captured
during job execution.

Exception
Log

4.1.3 Unload Jobs


Unload jobs will take transform datasets as a source and create final files required by load team
in the given format.
Data
Held for Unload Job
Future
Job
Unload Data in Load
output file as Data
per layout
Data
Held for
Future
Job Target data is
provided as flat files
Data Held in
temporary datasets

5. USE OF STAGES

5.1 Combining Data

5.1.1 Join, Lookup and Merge Stages


The Join, Lookup and Merge stages combine two or more input links according to values of key
columns. They differ mainly in memory usage, treatment of rows with unmatched key values
and input requirements i.e. sorted and de-duped.

A brief description as to when to use these stages is provided in the following table:

Join Lookup Merge


Type SQL-like In RAM Lookup Master / Update
55000783.doc Page 8 of 41
Table
Memory Light Heavy Light
Number of Inputs 1 Left, 1 Right 1 Source, n Lookup 1 Master, n
Tables Updates
Sort on Input All None All
Duplicates on OK OK Warning
Primary Input
Duplicates on OK Warning OK (when n=1)
Secondary Input(s)
Options on None Fail, Continue, Keep or Drop
Unmatched Drop or Reject
Primary
Options on None None Capture as Reject
Unmatched
Secondary
Number of Output 1 1 Out, 1 Reject 1 Out, n Rejects
Links
Captured on Reject Nothing Unmatched Unmatched
Primary Rows Secondary Rows

The Lookup stage is most appropriate when the reference data for all lookup stages in a job is
small enough to fit into available physical memory. Each lookup reference requires a contiguous
block of physical memory. If the datasets are larger than available resources, the JOIN or
MERGE stage should be used.

5.1.2 Aggregate Stage


The purpose of the aggregator stage is to perform data aggregations. In order to do this, it is
necessary to understand the key columns that define the aggregation groups, the columns to be
aggregated and the kind of aggregation. Common aggregation functions include:

 Count
 Sum
 Mean
 Min / Max.

Several others are available to process business logic, however it is most likely that
aggregations will be used as part of a calculation to determine the number of rows in an output
table for inclusion in header and footer records for unload files.

5.1.3 The Funnel Stage


The funnel requires all input links to have identical schemas (column names, types, attributes
including null ability). The single output link matches the input schema.

5.2 Sorting
There are two options for sorting data within a job, either on the input properties page of many
stages (a simple sort) or using the explicit sort stage. The explicit sort stage has additional
properties, such as the ability to generate key change column and to specify the memory usage
of the stage.

55000783.doc Page 9 of 41
5.3 Data Manipulation

5.3.1 Transformer Usage Guidelines

5.3.1.1 Choosing Appropriate Stages


The parallel Transformer stage always generates "C" code which is then compiled to a parallel
component. For this reason, it is important to minimize the number of transformers, and to use
other stages (Copy, Filter, Switch, Modify etc) when derivations are not needed.
Optimize the overall job flow design to combine derivations from multiple Transformers into a
single Transformer stage when possible.

5.3.1.2 Transformer NULL Handling and Reject Link


When evaluating expressions for output derivations or link constraints, the Transformer will
reject (through the reject link indicated by a dashed line) any row that has a NULL value used in
the expression. To create a Transformer reject link in DataStage Designer, right-click on an
output link and choose "Convert to Reject".

The Transformer rejects NULL derivation results because the rules for arithmetic and string
handling of NULL values are by definition undefined. For this reason, always test for null values
before using a column in an expression, for example:

If ISNULL(link.col) Then… Else…

5.3.1.3 Transformer Derivation Evaluation


Output derivations are evaluated BEFORE any type conversions on the assignment. For
example, the PadString function uses the length of the source type, not the target. Therefore, it
is important to make sure the type conversion is done before a row reaches the Transformer.

For example, TrimLeadingTrailing(string) works only if string is a VarChar field. Thus, the
incoming column must be type VarChar before it is evaluated in the Transformer.

5.3.1.4 Optimizing Transformer Expressions and Stage Variables


In order to write efficient Transformer stage derivations, it is useful to understand what items are
evaluated and when. The evaluation sequence is as follows:

 Evaluate each stage variable initial value


 For each input row to process:
o Evaluate each stage variable derivation value, unless the derivation is empty
o For each output link:
1. Evaluate each column derivation value
2. Write the output record
o Next output link
 Next input row

The stage variables and the columns within a link are evaluated in the order in which they are
displayed in the Transformer editor. Similarly, the output links are also evaluated in the order in
which they are displayed.

From this sequence, it can be seen that there are certain constructs that will be inefficient to
include in output column derivations, as they will be evaluated once for every output column that
uses them. Such constructs are:

55000783.doc Page 10 of 41
Where the same part of an expression is used in multiple column derivations

For example, suppose multiple columns in output links want to use the same substring of an
input column, then the following test may appear in a number of output column derivations:

IF (DSLINK1.col[1,3] = "001") THEN ...

In this case, the evaluation of the substring of DSLINK1.col[1,3] is evaluated for each column
that uses it.

This can be made more efficient by moving the substring calculation into a stage variable. By
doing this, the substring is evaluated just once for every input row. In this case, the stage
variable definition will be:

DSLINK1.col1[1,3]

and each column derivation will start with:

IF (Stage Var1 = "001" THEN ...

This example could be improved further by also moving the string comparison into the stage
variable. The stage variable will be:

IF (DSLink1.col[1,3] = "001" THEN 1 ELSE 0

and each column derivation will start with:

IF (Stage Var1) THEN

This reduces both the number of substring functions evaluated and string comparisons made in
the Transformer.

Where an expression includes calculated constant values

For example, a column definition may include a function call that returns a constant value, such
as:

Str(" ",20)

This returns a string of 20 spaces. In this case, the function will be evaluated every time the
column derivation is evaluated. It will be more efficient to calculate the constant value just once
for the whole Transformer.

This can be achieved using stage variables. This function could be moved into a stage variable
derivation. However in this case, the function will still be evaluated once for every input row.
The solution here is to move the function evaluation into the initial value of a stage variable.

A stage variable can be assigned an initial value from the Stage Properties dialog/Variables tab
in the Transformer stage editor. In this case, the variable will have its initial value set to:

Str(" ",20)

You will then leave the derivation of the stage variable on the main Transformer page empty.
Any expression that previously used this function will be changed to use the stage variable
instead.
55000783.doc Page 11 of 41
The initial value of the stage variable is evaluated just once, before any input rows are
processed. Then, because the derivation expression of the stage variable is empty, it is not re-
evaluated for each input row. Therefore, it is value for the whole Transformer processing is
unchanged from the initial value.

In addition to a function value returning a constant value, another example would be part of an
expression such as:

"abc" : "def"

As with the function call example, this concatenation is evaluated every time the column
derivation is evaluated. Since the subpart of the expression is actually constant, this constant
part of the expression could again be moved into a stage variable, using the initial value setting
to perform the concatenation just once.

Where an expression requiring a type conversion is used as a constant, or it is used in


multiple places

For example, an expression may include something like this:

DSLink1.col1+"1"

In this case, the "1" is a string constant, and so, in order to be able to add it to DSLink1.col1, it
must be converted from a string to an integer each time the expression is evaluated. The
solution in this case is just to change the constant from a string to an integer:

DSLink1.col1+1

In this example, if DSLINK1.col1 were a string field, then a conversion will be required every
time the expression is evaluated. If this just appeared once in one output column expression,
this will be fine. However, if an input column is used in more than one expression, where it
requires the same type conversion in each expression, it will be more efficient to use a stage
variable to perform the conversion once. In this case, you will create, for example, an integer
stage variable, specify its derivation to be DSLINK1.col1, and then use the stage variable in
place of DSLink1.col1, where that conversion would have been required.

It should be noted that when using stage variables to evaluate parts of expressions, the data
type of the stage variable should be set correctly for that context, otherwise needless
conversions are required wherever that variable is used.

5.3.2 Modify Stage


The Modify stage is the most efficient stage available. Transformations that touch a single field,
such as keep/drop, type conversions, some string manipulations, and null handling, are the
primary operations which should be implemented using Modify instead of using Transform.

5.4 Transitioning Data

5.4.1 External Data


The External Source stage is a file stage which is used to read data that is output from one or
more source programs. The stage calls the program and passes appropriate arguments. The
stage can have a single output link, and a single rejects link. It can be configured to execute in
parallel or sequential mode. This stage will be typically used in the ‘Import’ jobs to import the
External Data to parallel datasets to be processed by further ‘Transformation’ jobs.
55000783.doc Page 12 of 41
5.4.2 Parallel Dataset
The Data Set stage is used to read data from or write data to a data set. The stage can have a
single input link or a single output link. It can be configured to execute in parallel or sequential
mode. DataStage parallel extender jobs use data sets to manage data within a job. The Data
Set stage can store data being operated on in a persistent form, which can then be used by
other DataStage jobs. Data sets are operating system files, each referred to by a control file,
which by convention has the suffix .ds. Using data sets wisely can be key to good performance
in a set of linked jobs. These Parallel Datasets will be created from the external data by the
‘Import’ job and will be created whenever intermediate datasets are needed to be created for
further single/multiple jobs to process. Due to the parallel nature of processing, the danger of
bottle necks is eliminated during dataset creation.

5.5 Unit Test

5.5.1 Copy Stage


The Copy stage copies a single input data set to a number of output data sets. Each record of
the input data set is copied to every output data set. Records can be copied without modification
or columns can be dropped or changed (to copy with more modification – for example changing
column data types). This stage is used commonly for debugging/testing purpose where a copy
of the data flowing from a particular stage can be isolated from the flow and analysed.

5.5.2 Peek Stage


The Peek stage can print record column values either to the job log or to a separate output link
as the stage copies records from its input data set to one or more output data sets. This stage is
used when a specific column’s data is only to be analysed while Unit Testing to validate whether
the preceding transformation logic is working as desired.

5.5.3 Row Generator


The Row Generator stage is a Development/Debug stage that has no input links, and a single
output link. The Row Generator stage produces a set of mock data fitting the specified
metadata. This is useful where you want to test your job but have no real data available which
may be a source file or a dataset produced by some other job whose development is also
underway.
Also, more details can be specified about each data type if required to shape the data being
generated.
For e.g. Type as ‘Cycle’ specifying what ‘Increment’ value is required
Type as ‘Random’ specifying what percent of invalid/zero data is required.

5.5.4 Column Generator


The Column Generator stage is a Development/Debug stage that can have a single input link
and a single output link. The Column Generator stage adds columns to incoming data and
generates mock data for these columns for each data row processed. The new data set is then
output. This is used where not all the columns’ real data is available for testing. Those columns
need to be inserted with mock data fitting the specified metadata.

5.5.5 Manual XLS Generation


In addition to the ‘Row Generator’ and ‘Column Generator’ methods DataStage provides, mock data can
also be created manually in an XLS file and then saved as a CSV file to be given as input to the
DataStage job where this test data is required. These methods of data generation will be used extensively
during Unit testing.

55000783.doc Page 13 of 41
6. GUI STANDARDS
Job Description Fields – the description annotation is mandatory for each job. Note that the
description annotation updates the job short description.

The full description should include the job version number, developer name, date and a brief
reference to the design document including the version number the job has been coded up to,
plus the main job annotation and any modifications to the job. Where the job has not yet entered
Version Control, the initial version should be referred to as 0.1.

When using DataStage Version Control, the Full Description field in job properties is also used
by DS Version control to append revision history. This is packaged and maintained with the job
and will be visible when the jobs are deployed to test, promo and production. It does not stop
developers from using Full Description as a method of maintaining the relevant documentation,
but information maintained by the developer will get appended to by the Version Control tool.

Naming conventions must be enforced on links, transforms and source and target files.
Annotations are also used to further describe the functionality of jobs and stages.

Two types of annotation, a blue job description (description annotation) and a yellow operator
specific description (standard annotation) are used. The detailed description is also updated
automatically in by DataStage Version Control process following the first initialization into
Version Control.

Entries put in the detailed description by Version Control must not be modified manually.

Standard description annotations should be used on every non-trivial stage.

7. DATASTAGE NAMING STANDARDS


Object Type Syntax
Category Import/transform/unload
Job jb_fdXX_<im/tr/ul>_<JobName>
Where XX is 01,02…13 indicating FD
name.
<im> indicates Import Job
<tr> indicates Transform Job
<ul> indicates Unload Job
Job Sequence js_<fdXX>_<im/tr/ul>_<file/detail>
Where XX is 01,02…13 indicating FD
name.
<im> indicates Import Job Sequence
<tr> indicates Transform Job Sequence
<ul> indicates Unload Job Sequence
Source Definition Category source
Target Definition Category target
Link* lnk_<StageName>_<rej/njn/jn>
lnk_<StageName>

<StageName> is the name of the stage


from which the link is coming out.
<rej/njn/jn> indicates the type of link
rej=reject, njn=non join, jn=join. If not
applicable then this will be dropped.
Parallel Job FILE Stages
55000783.doc Page 14 of 41
Data Set ds_<Dataset Name>
Sequential File sq_<Sequential file name>
File Set fs_<File Set name>
Lookup File Set lfs_<Lookup file set name>
External Source esrc_< External Source name>
External Target etrg_< External Target name>
Complex Flat File cff_< Complex Flat File name>
Parallel Job Processing Stages
Transformer tr_<Purpose>
BASIC Transformer btr_<Purpose>
Aggregator agg_<Purpose>
Join jn_<Purpose>
Merge mrg_<Purpose>
Lookup lkp_<Purpose>
Sort srt_<Purpose>
Funnel fnl_<Purpose>
Remove Duplicates rdup_<Purpose>
Compress cps_<Purpose>
Expand exp_<Purpose>
Copy cp_<Purpose>
Modify md_<Purpose>
Filter flt_<Purpose>
External Filter sflt_<Purpose>
Change Capture ccap_<Purpose>
Change Apply capp_<Purpose>
Difference diff_<Purpose>
Compare cmp_<Purpose>
Encode enc_<Purpose>
Decode dec_<Purpose>
Switch cwt_<Purpose>
Generic gen_<Purpose>
Surrogate Key sur_<Target Column Name>
Parallel Job RESTRUCTURE
Stages
Column Import ci_<Purpose>
Column Export ce_<Purpose>
Make Subrecord msub_<Purpose>
Split Subrecord ssub_<Purpose>
Combine Records crec_<Purpose>
Promote Subrecord prec_<Purpose>
Make Vector mkv_<Purpose>
Split Vector splv_<Purpose>
Containers
Local Container lc_<functionality>
Shared Container sc_<functionality>
Others
Stage Variable s_<StageVariableName>
Sequence Generator seq_<Target Column Name>
Job Sequences Stages
Job Activity ja_<job name without jb and fd#>
Execute Command ex_<Script function>_<file/detail>
Sequencer sq_<Purpose>

55000783.doc Page 15 of 41
8. RUNTIME COLUMN PROPAGATION (RCP)
One of the aims/benefits of RCP is to enable jobs that have variable metadata that is
determined at run time. An example would be a generic job that reads flat file and stores the
data into a Dataset, but the file name itself is a job parameter. In this case it is not possible to
determine the column definitions during build.
Conversely, one of the features that sometimes confuse developers, is that in jobs where RCP
is not desired by the developer but the feature is switched on, can cause additional columns to
appear in the output dataset that the developer may have thought were dropped.
For these reasons developers must turn off RCP within each job unless the feature is explicitly
required in the job by the developer as in the above example. In any event, RCP should be
enabled within the Project Properties (providing flexibility at to use RCP at job level) and in the
event that RCP is required, it can be turned on at job / stage level. An annotation should make
this clear on the job.

9. STANDARDISED REJECT HANDLING

9.1 Reject Components


There is a requirement to set up a standard approach to reject handling.

The standardisation of reject capture allows operational support to easily:

 locate the rejection message and understand the format of the message
 locate and diagnose the reason for rejections
 set tolerances to the numbers of rejects permitted
 allow for the re-process rejected rows.

Reject processing is not provided as standard within DataStage Enterprise (Parallel) across the
majority stages. There is a reject link on the Lookup stage. However, a standard approach
must be introduced for the remaining stages and adopted across all stages.

This will be achieved by the introduction of a bespoke element (in the form of example stages
within template jobs) and through the use of a standardised reject component made available to
all developers via a DataStage wrapper.

These components are shown in the following diagram:

55000783.doc Page 16 of 41
Records
Lookup
to be
Dataset
Processed

Data Held
Example Job for Future
Stages Job

Driving Join with Reject


Lookup Transform etc.
Data Flow Handling

Data Held
for Future
Job
Bespoke Reject Component

Rejected
Transform
Rows

Address Key
Card Key
Standad reject message.
Customer Key
Account Key

Standardised Reject Component

All stages (where a row might be rejected) must include a reject link. Three such stages are
shown in the diagram (i.e. Join, Lookup and Transform). In the example above, the Lookup
stage is shown with a reject link, though this is just as applicable to Join, Transform and other
stages. For instance, data flowing down the reject link from a Lookup or Join stage might result
from an inability to match keys and from a Transform stage from the validation of data items, for
instance an unexpected value or null might be encountered. In each of these cases, the
rejected row is passed down a reject link to a bespoke component that:

1. Passes the row to a dataset in order to facilitate the re-processing of the rejected rows
2. Identifies the key of the rejected row and passes this down the relevant link (depending
on the key type) to the standardised reject handling component. Where there is no key,
i.e. a file is empty or there is a mismatch between the number of rows read and the
information provided on the footer record, zeros are passed down all links intended for
key information
3. Compiles and passes a standard message (see table below) describing the rejection to
the standardised reject handling component.

This approach assumes that a key uniquely identifying each failing row is present on driving
flows.

The standardised reject component takes two inputs (over a possible five input links) and
creates a surrogate key, uniquely defining each reject and writes the message along with the
two keys to a dataset. This reject dataset therefore holds the key from the rejected row (that
can be used to cross reference to the dataset of rejected rows) and a message that will help
identify the reason for the rejection.

55000783.doc Page 17 of 41
Paths where reject datasets are automatically set to write to are date stamped within a common
reject and log directory. Reject datasets are uniquely named and created each time the module
runs (see below).

The reject component will be used with every stage which can fail due to data discrepancies
(e.g. join, and lookup). The Join stage requires further processing whilst the error link from the
lookup stage can be linked directly to the custom error component.

In order to facilitate reject handling within the Join stage, further processing is required. This
processing requirement is shown in the following diagram:

Join with Reject Handling

The Column Generator stage adds an


additional 'row-exists' column to the output link.
This will be a single character column to which
a default value will be assigned by the following
Data Source

stage.
Secondary

The Modify stage assigns a standard value of


'Y' to the 'row-exists' column just created. All
rows flowing through this stage will be
assigned the same value.
The join stage performs the required join as
Column Generator
normal.
The transform stage tests for a successful join.
If row-exists='Y' then the row is directed down
the main output link. Otherwise (where 'row-
Modify exists' is NULL) the row is directed down the
reject handling link.
Primary Data

Left Outer Join Transform Main ouput link


Source

Driving data flow

Standard Reject Handling

This component will be made available to all developers for use in reject handling as a job
template.

9.2 Customised Reject Messages


The creation of a list of standard error conditions limits the number of exceptions an operator
will see allowing errors to be quickly identified and resolved. Developers are limited to using the
messages specified below, thus prevent ting the creation of random error messages.

The following reject messages / conditions will be used:

Reject Message Description


Lookup / Join Failure For all referential integrity checking and any
other critical Lookups / Joins. Keys have not
been matched between input links on a Join or
Lookup stage. The job and stage name must
be included in the message.
Row Count Mismatch The number of records processed does not
match the number of records described in the
footer record. The job and stage name must
be included in the message. This message is
particular to import jobs where the input file is
validated against the footer record.
Empty File The input file is empty. This message is
particular to import jobs where the input file is
validated against the footer record.
55000783.doc Page 18 of 41
Reject Message Description
Invalid Field A field has been identified as containing invalid
data. The job, field and stage name must be
included in the message.
Null Field A Not Null field has been identified as
containing null values. The job, field and stage
name must be included in the message.

Developers must intercept rejects in the code they generate and generate a standard reject
message that contains accurate data and relevant information from the record. The job, field
and stage names must be inserted into the message.

A description of rejects and messages should be made available to operational support to help
diagnose problems encountered when running the batch.

9.3 Reject Limit


A Reject Limit parameter is included in all jobs. This is used by the standardised reject
processing wrapper to test against the total number of errors for a module. On meeting the
reject limit, the job and hence the processing for any given module is terminated.

The reject will be variable between 0 and 99. A reject limit of 0 (zero) will ABORT ON FIRST
REJECT, whilst a reject limit of 99 will NEVER ABORT (on reject).

This allows central control the level of rejects allowed across all modules and jobs used in the
Dummy batch.

9.4 Before Routine


The before routine for the first job in a sequence of jobs that implement a module (or for a single
job where there is only one job in a module) will be used to interrogate increment the number
that will uniquely identify the datasets that will be created from the processing of rejects for a
particular module.

9.5 Notifications
A notification is the method by which:
 operations are informed of a reject i.e. in-line notifications
 rejects are communicated between functional streams and / retained to support the re-
running of modules i.e. cross functional notifications.
These are described in the following sections.

9.5.1 In-line Notification of Rejects


In-line notifications are those resulting from rejects within a functional processing stream. The
last activity within a module will be to email notification of rejects within a module to operations.
This will be achieved by using the Notification stage. A template job will be provided that
includes the Notification stage and job parameters that can be tailored such that the names and
paths of the reject datasets can be interrogated and the relevant notifications made.

55000783.doc Page 19 of 41
9.5.2 Cross Functional Notification of Rejects
This type of notification is the means by which rejects are communicated between functional
streams. This ‘communication’ is built around the feedback from the load process, prompting a
rerun and between migration steps (i.e. T14 to T) and an understanding of the dependencies
between functional areas i.e. transactions being dependent on accounts etc. In this instance,
rejected accounts will be incorporated into the transaction processing process, therefore limiting
those transactions processed to those where an account had also been successfully processed.

10. ENVIRONMENT

10.1 Default Environment Variables Standards


DataStage Enterprise Edition allows project / job tuning by means of Environment variables.
These include the settings of the default node configuration file, and error log activities.

The following DataStage Environment variables must exist in all jobs. (Note that DataStage
Environment Variables are different to Standard Parameters)

The template job already has these parameters defined:

 $APT_CONFIGX_FILE= /DataStage/Product/Ascential/DataStage/Configx1.apt (Default


value used in every job)
/DataStage/Product/Ascential/DataStage/Configx4node.apt (Value overwritten for
Testing on extra nodes in individual jobs)
 $APT_DUMP_SCORE=false

10.2 Job Parameter File Standards


A generic parameter file which stores all the default job parameter values including user names
and login details will be run in conjunction with the before job routine “SetDSParamsFromFile”
This will allow project wide settings to be changed once, and avoid unnecessary parameter
duplication. The path to this parameter file will be /DataStage/Parameters/<project name> and
its name will be parameters.lst

10.3 Directory Path Parameters


The following parameters must exist in all jobs.

The template job, held under the Users/Template DataStage will have these parameters
defined.

 pDSPATH = /XX/XX/Dummy (DataStage Datasets top level development directory –


there will be equivalents for testing and Live).
o XX – Base Dummy directory as set by DMCoE.
 pITERATION = n (where n is the migration iteration i.e. from 1 to 9)
 pRUNNUMBER = n (where n is the run number within the iteration starting from 1)

10.4 Default Directory Path Parameters


The following parameters must exist in all jobs (the template job has these parameters defined):

 pDSPATH = /DataStage/Datasets/DummyDev (DataStage Datasets top level directory)


 pITERATION = 1
 pRUNNUMBER = 1

55000783.doc Page 20 of 41
10.5 Directory & Dataset naming standards
UNIX directory paths are set using the following convention, based on the parameters defined
above. Note the final subdirectories (i.e. “Deliver” and “Internal”) are hard coded in the jobs.

This is fine because if the developer mistypes the value the job will fail immediately as the
mistyped directory will not exist.
10.5.1 Functional Area Input Files
Source files will be pushed by Extract system to ETL server in a holding area ‘Hold` via connect
direct software.
#pDSPATH#/#pITERATION#/#pRUNNUMBER#/Hold/<source_file_name>
10.5.2 Functional Area Output Tables
Datasets that are defined in the Detailed Design as output tables for a functional area are stored
in a “Product” directory. This is the directory that downstream Functional Areas (including the
Unload process) will go to find input tables from previous areas.

#pDSPATH#/#pITERATION#/#pRUNNUMBER#/Product/<datasetname>.ds
10.5.3 Functional Area Staging Tables
Datasets that are defined in the Detailed Design as staging tables within an area are stored in a
“Staging“directory. This is the directory that other modules within the same Functional Area will
go to find staging tables from previous modules.

#pDSPATH#/#pITERATION#/#pRUNNUMBER#/Staging/<datasetname>.ds
10.5.4 Internal Module Tables
Datasets produced within a module and used only internally within that module will be stored in
an “Internal” directory. Datasets in this directory are only used within jobs.

#pDSPATH#/#pITERATION#/#pRUNNUMBER#/Internal/<datasetname>.ds
10.5.5 Datasets Produced from Import Processing
Datasets that are produced by Pre-Processing are stored in a “Source” directory. This is the
directory that Functional Areas will go to find input tables from the source.

#pDSPATH#/#pITERATION#/#pRUNNUMBER#/Source/<datasetname>.ds

Reference Datasets that are produced by Import Processing are stored in a “Reference”
directory. Reference data is not split into iterations.

#pDSPATH#/Reference/<datasetname>.ds

11. METADATA MANAGEMENT


Metadata consists of record formats for all external files (flat files) and internal files (datasets)
processed by DataStage which are stored in the DataStage Repository (a Metadata repository).
Metadata is either created manually within stages (i.e. Flat File, Complex Flat File and Dataset)
or imported from sources such as COBOL copybooks.

There are two types of Metadata, described below:

55000783.doc Page 21 of 41
11.1 Source and Target Metadata
Record formats will have been pre-defined within the DataStage Repository describing the
record formats of files that form inputs to import jobs and outputs from unload jobs. This
metadata will therefore only be used by import and unload jobs.

These record formats are for the convenience of developers (they are described in the FDs and
are therefore fixed,) and help maintain consistency in terms of the way data is interpreted
across all jobs (define once, use many times), therefore having a positive impact in terms of
quality.

This metadata must not be changed by developers.

Should a change be required to this metadata, it should first be impacted to assess the potential
impact of the change on jobs that use the metadata and processed through standard change
control.

11.2 Internal Metadata


Developers will also create metadata describing the datasets that:

 pass data between jobs within a functional area


 pass data between jobs in different functional areas.

This metadata will define the outputs of import jobs, be used by all transoform jobs and will
define the inputs to unload jobs and must be stored in the repository with a name that matches
the name of the dataset it describes.

Should it be necessary or more efficient to process data in a different way from the way it is
presented within the pre-defined metadata, developers may create a job specific version of the
metadata which must be clearly identified as a variant on the original and saved within the
repository.

12. STANDARD COMMON COMPONENTS


The use of Standard Components in developing DataStage jobs will:

 Increase quality of the code, since the most optimal method will be used for a function
which is to be achieved in multiple jobs
 Promote reuse, productivity is increased and developers can spend more time on tasks
which are specific to individual jobs
 Reduce the complexity of common tasks.

12.1 Job Templates


DataStage provides intelligent assistance which guides through basic DataStage tasks.
The Intelligent Assistants are listed below:

 Create a template for a server or parallel job. This can be subsequently used to create
new jobs. New jobs will be copies of the original job
 Create a new job from a previously created template
 Create a simple parallel data migration job. This extracts data from a source and writes it
to a target

Not only will the use of templates help in standardization but also it will form reusable
components, which need not be coded yet again. Also certain elements will be common in many

55000783.doc Page 22 of 41
jobs, namely: parameters, annotations and reject handling, etc. which can be implemented by
the use of templates.

Dummy project will have templates which will be a job with stages following naming standards.
These jobs acting as a template will assist developer to develop new jobs as per mentioned
standards.
12.1.1 Import Jobs
Each source file will be read in persistence datasets by separate jobs called import jobs. These
jobs will have functionality of doing sanity checks on received file e.g. data file is not empty,
header and trailer details are consistent with file properties. In Dummy project the files are
repeatedly used in different functionality, we will read file only once and create a DataStage
datasets. These datasets will then be used in respective functionalities. Since associated logic
for importing and validating files will be same, we will build and test one such job and use this
architecture in rest.

The files that are used in multiple instances are described below:

Common Source Files Functionalities the file is used


Import Account Selection File FD01, FD09
Import Customer Selection File FD01, FD05, FD06, FD12
Import ETL Customer Data File FD01, FD03, FD05, FD09
Import ETL Address Data File FD01, FD05
Import ETL Customer Pointer File FD01, FD03, FD05, FD09
Import ETL DDA Account Data FD01, FD02, FD03, FD09, FD11
Import ETL TDA Account Data FD01, FD02, FD11
Import TAX Certification File FD02, FD05
FD01, FD02, FD03, FD04, FD07, FD08, FD09, FD10, FD11,
Import ETL Re-directions table load file FD13

12.1.2 Transform Jobs


Dummy Transform jobs repeatedly perform joins on similar driver files with other data files.
Since this functionality is common, these processes will be developed once and will be copied in
respective occurrences. The table below identifies such occurrences:

Common Process Functionality


Sort Code Lookup & Split data based
on processing centre FD02, FD03, FD04, FD07, FD08, FD10, FD11, FD13
ETL Redirections Table Load file
performs the same join with many
different files i.e. Join based on S/C & FD01, FD02, FD03, FD04, FD07, FD08, FD09, FD10, FD11,
Acc Num FD13
ETL Customer Data File performs the
same join with many different files i.e.
Join based on Customer Num FD01, FD03, FD05, FD09
ETL Customer Pointer File performs the
same join with many different files i.e.
Join based on Customer Num (to get
details of associated Account Numbers
for each customer) FD01, FD03, FD05, FD09
Customer Selection File performs the
same join with many different files. i.e.
Join based on Customer Num. FD01, FD06, FD12

55000783.doc Page 23 of 41
Common Process Functionality

Account Selection File performs the


same join with many different files.
i.e.Join based on S/C & Acc Num. FD01, FD09
ETL Re-directions Table Load File JOIN
WITH ETL DDA Account Data FD02, FD11
ETL Re-directions Table Load File JOIN
WITH ETL TDA Account Data FD02, FD11

12.1.3 Unload Jobs


Dummy Unload jobs are tasked to create output files in format required by load team. These
files will be mainly in mainframe format. Apart from creating files from persistent dataset these
jobs will create header and trailer details within file.

12.2 Containers
A container is a group of stages and links. Containers simplify and modularize server job
designs by replacing complex areas of the diagram with a single container stage. DataStage
provides two types of container:

 Local containers. These are created within a job and are only accessible by that job. A
local container is edited in a tabbed page of the job’s Diagram window. Local containers
can be used in server jobs or parallel jobs.
 Shared containers. These are created separately and are stored in the Repository in the
same way as other jobs. There are two types of shared container:
o Server shared containers are used in server jobs. They can also be used in
parallel jobs, though this can cause bottlenecks in processing as they are serial
only and should be avoided if possible
o Parallel shared container is used in parallel jobs.

You can also include server shared containers in parallel jobs as a way of incorporating server
job functionality into a parallel stage (for example, you could use one to make a server plug-in
stage available to a parallel job).

Containers are the means by which standard DataStage processes are captured and made
available to many users. They are used just as a developer would use a standard stage. Some
work needs to be done to identify opportunities for reuse within the overall design. However,
once identified, reusable components will be identified and delivered into the DataStage
repository as shared components.

Identified containers in Dummy transform project are described in the table below:

Container Functionalities Definition


Will act on the joins, lookups and active transformations to
check records eliminated in process and log them in a
separate file. The functionality needed is discussed in
Reject Handling section 7.
This component will log messages in a mentioned file. It will
Statistics Report logger take input as filename and message to be written.

55000783.doc Page 24 of 41
13. DEBUGGING A JOB
The following techniques options will assist when debugging a job. Debugging essentially
involves viewing the data in order to isolate the fault. There are a number of techniques
including:

 Adding a peek stage will output certain rows to the job log
 Adding a filter to the start of the job to filter out all rows except the ones with the
attributes that the developer may wish to test or debug the behaviour on
 Adding an additional output to a transformer with the relevant constraints and storing the
data into a sequential file to be used as part of the investigation. The use of the copy
stage would also be an option
 A variant of the above would be to add a parameter pDEBUG with a value of 1 or 0 that
will be used as part of the constraint. The resulting debug sequential file would only
contain data when pDEBUG=1.

All changes to code made for debugging (including peeks, extra stages and extra parameters)
must be removed prior to final unit test. Final unit testing must occur on the exact version of
code that is to be promoted to Integration Test.

In processing hotspots (parts of a job which could potentially be an area of concern) it is


advisable that peeks be replaced by COPY stages before promoting the jobs to Integration Test
(instead of complete removal of the stage). Removing and re-inserting peeks and re-inserting
them can often get to be quite a tedious task. The COPY stage is a no-op (non operator) stage.
This means that there isn’t a processing cost to having a copy stage in a job design. While the
job may appear to look overly complex, this will not impact the processing times of the job.

14. COMMON ISSUES AND TIPS


Common issues faced in project while development and testing are mentioned in this section.
Finally there is a tips section to assist developer while coding.

14.1 1-way / n-way


Scaling from 1-way to n-way processing is the method employed within DataStage to take
advantage of parallelism. This improves performance and should not effect or change the
function of the code.

In order to ensure trouble free scaling, jobs are built 1-way and unit tested 1-way and n-way.
This ensures that there has been no functional impact in making the switch to parallel
processing. Jobs will run n-way when live in order to achieve the benefits of parallel processing
provided by DataStage Enterprise.

Problems to do with scaling usually become evident when comparing record counts between 1-
way and n-way runs. Clearly, these counts and the physical records involved should be the
same. If there is a difference, the reasons for this must be examined and corrected.

There are many possible reasons for variations in record counts, for instance:

 One of the most common reasons is when the Join, Lookup and Merge stages (and
others) are used. In these situations care must be taken to ensure that incoming data
streams are not only sorted but partitioned the same way. If not, join conditions may not
be met because of records (with keys that would otherwise match 1-way) being in
different partitions and therefore go unmatched. In these situations, records may be
unnecessarily rejected (either down a reject link or omitted all together) and will therefore

55000783.doc Page 25 of 41
not flow down the main output link to subsequent stages or into an output dataset, hence
causing a variation between the actual rows processed and the anticipated number
 it should be ensured that a dataset used as input on the lookup link to a Lookup stage
must be partitioned as Entire to ensure that the entire dataset is available for lookup
across all partitions within the main input link to these stages, otherwise the lookup may
fail simply because the dataset was partitioned incorrectly for the lookup
 an incoming dataset may have been created by another job or module which may also
have been written by another developer. In this case it might contain the required data,
but may not be correctly partitioned for the needs of your job. Therefore good practice,
unless you can be absolutely sure that the datasets you are using are partitioned
correctly for your needs, is to repartition at the start of a job. This might be less efficient,
but more effective in terms of retaining control over your jobs and the quality of the
output data flows. Where possible, partitioning will be considered within the overall
solution design, therefore minimising the need for repartitioning.

Configuration files are provided for 1-way and 4-way running on the Development server, with 1-
way being the default. 4-way processing is specified at job level as an override. The developer
must ensure that overrides are removed from their jobs prior to promotion to the Test server.

14.2 Duplicate Keys


Often an output table, flat file or internal dataset will contain duplicate keys.

Duplicates will often be identified when the output data (from a DataStage job), perhaps in the
form of a flat file, is loaded into a target database table. This load process will most likely fail if
there are duplicate keys in the data, particularly if the target table is uniquely keyed.

Another sign that there may be duplicates in the data is when the output of a job or stage (within
a job) has more rows in the output stream than would have been thought possible from the
inputs.

For these reasons, care must be taken at the unit test stage and it is always a good idea to have
a general understanding of the anticipated throughput of a job before starting the build.

The key to solving problems related to duplicates, is to understand how duplicates are be
generated. Here are some examples:

 an incoming data stream i.e. a data source or internal dataset (for instance the source
system itself, the output from another job or module), i.e. the problem may be inherited
and a more extensive search may be required in order to find the problem. If the
problem lies with the source system, then this may need to be raised as a data quality
issue and corrected at source
 a 1-way/n-way issue. Scaling from 1-way to n-way processing will often cause
problems. Essentially this is because when running with a single node, all data flows
through a single partition (where processing rules apply to all the data), usually giving
correct results. Running with multiple nodes means that partitioning comes into play and
therefore issues arise from applying processing rules across multiple partitions. This
effect may be desirable, however in many cases this can also lead to incorrect results.
For instance of a job is generating a unique key column, the same key may me
generated across all partitions and therefore duplicated when the data is collected for
output. A sign that this is the case is if the final record count is a multiple of the number
of nodes compared to the single node record count. To avoid this kind of issue, a stage
can be forced to run sequentially (though this may become a bottleneck) or alternatively,

55000783.doc Page 26 of 41
particularly then defining keys, the partition number can be built into the algorithm for
generating the key, therefore ensuring uniqueness across partitions
 A Cartesian join.

14.3 Resource usage Vs Performance


This section concentrates on issues found not only during development but also during wider
Integration, E2E and Performance test stages, particularly discussing the balance that must be
achieved between the resources available on the server where the DataStage jobs run and the
performance of those jobs.

Since DataStage Enterprise (DataStage) starts one Unix process per node (nodes are defined
in the configuration file and can be thought of as a logical processor) per stage, the effective use
of available processors and to an extent the total memory usage is determined by the operating
system rather than DataStage, though generally the more resource (processors and memory)
the better.

Clearly, this can lead to an explosion of processes running and eventually the operating system
spend more time managing than executing code, having a detrimental effect on performance.
The key is to run a number of performance tests to determine the optimum number of nodes. A
starting point will usually be around 50% of actual CPUs.

Within DataStage, the optimum use of parallel (partitioned and piped) data streams is clearly
essential, as is the appropriate use of stages within jobs and the elimination of unnecessary
repartitioning and sorting.

As a general rule of thumb, incoming data streams should be partitioned and sorted as far up
stream as possible and maintained for as long as possible. Partitioning and sorting will take
considerable amounts of time during job execution, so where possible these activities should be
minimized. The sort order of the data within a partition in a data stream will be maintained
throughout a job, even when included as an input link to sort dependent stages such as Dedupe
and Join. It is always tempting to sort on the input links of these stages, however this is
completely unnecessary (providing the data is in the correct order already) and time consuming.
Similarly, it is also tempting to repartition on the input links of stages when specifying Same will
suffice (again, providing the data is correctly partitioned already).

Within DataStage, the Transform stage was inherited from the DataStage Server product and is
less efficient than other native Parallel stages. The jury is out as far as the use of Transform is
concerned, with arguments for and against. For users of DataStage Server it will be familiar
and easy to use, read and maintain. The native Modify stage is a good alternative but is not
consistent with the user interface implemented for other stages, though Transform also differs
slightly. Common sense is the key, too many Transforms will slow your jobs down and in this
situation, Modify for simple type conversions should be considered. Using several transforms in
sequence is also undesirable. Quite often they will ‘look’ good but could be combined, therefore
reducing the overhead.

Finally, the Lookup stage: This stage differs from Merge and Join in that it requires the whole of
the lookup dataset to be held in memory. The upper limit is large, though this needs to be
considered in the context of the total memory available and what else will be running at the time.
Total memory usage will be hard to estimate and will be best left until a point when the runtime
batch has been designed and run – be prepared to increase memory and split jobs if the usage
is too great.

55000783.doc Page 27 of 41
Likewise, to improve runtimes, be prepared to add further processors to facilitate
scaling.

14.4 General Tips


General tips used while development code is mentioned below
 Common information like home directory, system date, username, password
should be initialized in a global variable and then variable should be referred
everywhere.
 Stage Variables allow you to hold data from a previous record when the next
record, allowing you to compare between previous and current records. Stage
variables also allow you return multiple errors for a record of information. By being
able to evaluate all data in a record and not just error on the first exception that is
found, the cleanup of data is more efficient and requires less iteration.
 Nulls are a curse when it comes to using functions/routines or normal equality
type expressions. E.g. NULL = NULL doesn’t work; neither does concatenation
when one of the fields is null. Changing the nulls to 0 or “” before performing
operations is recommended to avoid erroneous outcomes.

 Ensure that job does not look complex. If there are more stages (more than 10) in
a job divide into two or more jobs on functional basis.
 Use containers where stages in the jobs can be grouped together.
 Use Annotations for describing steps done at stages. Use Description
Annotation as job title; as Description Annotation also appears in Job
properties>Short Job Description and also in the Job Report when generated.
 When using String functions on decimal always use Trim function to avoid as
String functions interpret an extra Space used for sign in decimal.
 When you need to get a substring (e.g. first 2 characters from the left) of a
character field:
Use <Field Name>[1,2]
Similarly for a decimal field then:
Use Trim(<Field Name>)[1,2]
 Always use Hash Partition in Join and Aggregator stages. The hash key should
be the same as the key used to join/aggregate.
If Join/Aggregator stages do not produce desirable results, try running in
sequential mode (verify results; if still incorrect problem is with data/logic) and
then run in parallel using Hash partition.
 Use Column Generator stage to create sequence numbers or adding columns
having hard coded values.
 In Job sequences; always use “Reset if required, then run” option in Job
Activity stages. (Note: This is not a default option)
 When mapping a decimal field to a char field or vice versa , it is always better to
convert the value in the field using the ‘Type Conversion’ functions
“DecimalToString” or “StringToDecimal” as applicable while mapping.
 “Clean-up on failure” property in sequential files must be enabled (enabled by
default)

55000783.doc Page 28 of 41
15. REPOSITORY STRUCTURE
The DataStage repository is the resource available to developers that helps organise the
components they are developing or using within their development. This consists of metadata
i.e. table definitions, the jobs themselves and specific routines and shared containers.

The anticipated repository structure is described in the following sections. However the
structure may change during development, usually evolving to a structure that is in it’s most
usable form.

15.1 Job Categories


The jobs can be categorised by developer and by FD. The following jobs will be created:

 Import Jobs: Import Jobs will be starting point for transformation. Sanity checks on file
and validation of external properties e.g. Size will be done here. Source file will be read
in memory datasets as per source record layout. Exception log will be created with
records that do not follow file layout. Source data will then be filtered to process records
and unprocessed data will be maintained in a dataset for future reference. Finally one or
more datasets will be created which will be input to actual transform process.
 Transform Jobs: Datasets created by import jobs will be processed by actual transform
job. Transform will join two or more datasets, lookup data as per given functionality.
Finally the records will be split as per destination file and a destination dataset will be
created. All data errors will be captured in an exception log for future reference.
 Unload Jobs: Unload jobs will take transform datasets as a source and create final files
required by load team in the given format.

15.2 Table Definition Categories


The files are categorised into:

 Source/Target Flat-files: The source and target files will be included in this category.
These files will be converted into datasets by DataStage jobs and then after the
Transformation process is complete, they will be converted back to Target flat files.
 Datasets: Datasets are used as intermediate storage for the various processes. A
Dataset can store data being operated on in a persistent form, which can then be used
by other DataStage jobs. Datasets can either be Sequential or Parallel. These Datasets
will be created from the external data by the ‘Import’ job and will be created whenever
intermediate datasets are needed to be created for further single/multiple jobs to
process.

15.3 Routines
Before and after routines (should they be needed) will be described here.

15.4 Shared Containers


Shared containers (as described above) will be described here. It is anticipated that there will
be a small number of these and therefore no further categorisation is anticipated.

16. COMMON COMPONENTS USED IN DUMMY

16.1 jbt_sc_join
jbt_sc_join is a common component built to meet a specific requirement in Dummy project to
capture 3 types of records from a Join stage, whereas Datastage just offers 2 outputs from a
Join stage.
For example, take file A (master) and file B (child).
55000783.doc Page 29 of 41
The Join stage of Datastage will give 2 outputs in this case:
 A + B (Join records)
 A not in B (Reject Records)
The common component jbt_sc_join will give 3 outputs in this case:
 A + B (Join records)
 A not in B (Reject records)
 B not in A (Non Join records)
This functionality is illustrated in the flow diagram below:

A not in B

ej
_r
File ‘A’

_B
_A
(Master) lnk

lnk
_A
B_jn
lnk_A_ A+B
jn_A_B
lnk
_B _A
lnk _B
_n
jn
File ‘B’

(Child) B not in A

16.2 jbt_sc_srt_cd_lkp
Sort Code look up is a functionality which is required at many places (in various FD’s in
Dummy). So a common component with this functionality is built.

This will take a file as input and divide into 2 files for notth and south separately.

‘A’ - North File

h
ort
A _n
ln k_

lnk_A sc_srt_c
File ‘A’
d_lkp
lnk
_A
_so
uth

‘A’ - South File

55000783.doc Page 30 of 41
16.3 jbt_env_var
This is a template job with commonly used environmental variables imported. This can be used
for all the jobs being developed with these set of common environment variables rather then
importing them again and again.
These Environment variables are as shown below:

$ADTFILEDIR: This would contain the Audit file and reconciliation reports.
$BASEDIR: This folder is the base directory.
$DSEESCHEMADIR: DSEE Schemas that are used by EE jobs using RCP/schema files.
$ITERATION: Current Iteration number
$JOBLOGDIR: This would contain all the Error log files generated in DataStage jobs.
$PARMFILEDIR: This folder will contain parameter files that would be looked up by
jobs/routines that would be triggered from a common parameter file. These parameters values
will be set as per development environment.
$REJFILEDIR: This would contain all the reject files generated in DataStage jobs.
$SCRIPTDIR: This will contain routine UNIX scripts used for processing files, copying, taking
file backup etc.
$SRCDATASET: All the input files will be partitioned and imported into DataStage datasets.
This folder will store all the input datasets.
$SRCFILEDIR: This folder will contain all the input files from the Extract team. All files will be
manually copied into this folder.
$SRCFORMATDIR: This folder will contain the copybook formats for input source files. These
copybook formats are as per functional specifications.
$TMPDATASET: This folder will be used to store all the intermediate files created during
transform job.
$TRGDATASET: This folder will be used for storing output DataStage datasets files.
$TRGFILEDIR: These folders will contain all the transformed output files which can be loaded
to Bank B’s mainframe.
$TRGFORMATDIR: This folder will contain the copybook formats for output source files.

16.4 jbt_annotation
This is a template job where annotations are used for describing steps done at stages. Also
Description Annotation are used as job title; as Description Annotation also appears in Job
properties>Short Job Description and also in the Job Report when generated.

55000783.doc Page 31 of 41
16.5 Job Log Snapshot
JobLogSnapShot.ksh is a script which will create the log file (as seen in Datastage Director) of
job's latest run.
The following parameters need to be hard coded in the script as per environment:
DSHOME=/wload/dqad/app/Ascential/DataStage/DSEngine
PROJDIR=/wload/dqad/app/Ascential/DataStage/Projects/Dummy_dev
LOGDIR=/wload/dqad/app/data/Dummy_dev/itr01/errfile

DSHOME is the Datastage Home path.


PROJDIR is the project directory in which the job exists.
LOGDIR is a common directory where the log file will be created.

The script will be called from the after job subroutine of a job.

ksh /wload/dqad/app/data/Dummy_dev/com/script/JobLogSnapShot.ksh $1

$1 is input parameter: Job name whose latest job log is required.

The Job Log file will be created in:


/wload/dqad/app/data/Dummy_dev/itr01/errfile/<Job name>_log_<time stamp>.txt

Sample Job log:

55000783.doc Page 32 of 41
.
.
.
.

55000783.doc Page 33 of 41
16.6 Reconciliation Report
Reconcilation.ksh is a script which will create the Reconciliation Report of the respective
functional area (FD).
The script will be called from an Execute Command stage of a Job Sequence.

ksh /wload/dqad/app/data/Dummy_dev/com/script/Reconcilation.ksh $1 $2

$1 is 1st input parameter: FD##


$2 is 2nd input parameter: .ini file name (not path)

Specifications of .ini file:


Path: /wload/dqad/app/data/Dummy_dev/com/parmfile
The .ini file will contain the following separated by | sign.
 The type of the file i.e. Input, Output, Reject or Non-Join. Example: INP or OUT or REJ
or NJN. Note: this should be sorted order. Also the input files will be datasets, the output
files will be ebcidic files and the reject and non-join files will be in ascii format.
 The name of the File whose report is to be prepared.
 The Description of the file whose report is to be prepared.
 The Record length of the file.(this is need only for the output ebcidic file).

Sample .ini file:


INP|fd01_customer_pointer_file|Customer Pointer dataset created from source file
INP|fd01_customer_data_file|Customer Data dataset created from source file
OUT|fd01_redirection_file|Output redirection file|117
REJ|fd01_duplicates_file|Reject file containing duplicated account numbers
NJN|fd01_account_nonjoin|Nonjoin files from the join stage in job1

The Reconciliation report will be created in:


/wload/dqad/app/data/Dummy_dev/itr01/adtfile/<FD##>_recon_<time stamp>.txt

Sample Reconciliation report:

55000783.doc Page 34 of 41
55000783.doc Page 35 of 41
16.7 Script template
All scripts are made according to this template script. This has a script description and also a
section for maintaining modification history of the script.
This script name is /wload/dqad/app/data/Dummy_dev/com/script/ScriptTemplate.ksh

16.8 Split File


SplitFile.ksh is a script which will split the input file into header, detail and trailer files.
The script will be called from an Execute Command stage of a Job Sequence (Import
sequence).

ksh /wload/dqad/app/data/Dummy_dev/com/script/SplitFile.ksh $1

$1 is 1st input parameter: <Input file name without extension>


$2 is 2nd input parameter: <Record length>

This requires the file name to have .dat extension. The header, detail and trailer files created
would be $1_hdr.dat, $1_det.dat and $1_trl.dat respectively.
The input file will be /wload/dqad/app/data/Dummy_dev/itr01/opfile/$1.dat
All these files ($1_hdr.dat, $1_det.dat and $1_trl.dat) will be output in
/wload/dqad/app/data/Dummy_dev/itr01/opfile/.

16.9 Make File


Make_File.ksh is a script which will merge the header, detail and trailer record to create the
target file.
The script will be called from an Execute Command stage of a Job Sequence (Unload
sequence).

ksh /wload/dqad/app/data/Dummy_dev/com/script/Make_File.ksh $1

$1 is 1st input parameter: <Target file name without extension>


This requires the header, detail and trailer file to be of name $1_hdr.dat, $1_dtl.dat and
$1_trl.dat respectively.

55000783.doc Page 36 of 41
All these files ($1_hdr.dat, $1_dtl.dat and $1_trl.dat) will have to be present in
/wload/dqad/app/data/Dummy_dev/itr01/opfile/.
The output file will be /wload/dqad/app/data/Dummy_dev/itr01/opfile/$1.dat

16.10 jbt_import
This template job processes the Header, Detail and Trailer record created by the SplitFile.ksh
described in 16.8.
The header and trailer data is validated.
The validations done on header are:
 The file header identifier must contain the value ‘HDR-TDAACCT’
 The file header date must equal the T-14 migration date
The validations done on trailer are:
 The file trailer file identifier must contain the value ‘TRL-TDAACCT’
 The file trailer creation date must equal the file header creation date
 The file trailer record count must equal the total number of record on the input file
including the header and trailer records.
 The file trailer record amount must equal the sum of the Closing Balance field from every
record on the input file excluding the header and trailer records. The accumulation of the
Closing Balance field must be performed using an integer data format, allowing for
overflow.
If any of the above checks fail, then processing should be immediately aborted with a relevant
fatal error message. This is implemented using subroutine AbortOnCall.
Note: These header/trailer validations are for FD01. They will vary (slightly though) for other
FD’s. But this common approach as shown in the template can be taken.
The detail records are written to a dataset to be processed in transform job.

55000783.doc Page 37 of 41
55000783.doc Page 38 of 41
16.11 jst_import
This template job sequence calls the following components:
 SplitFile.ksh as described in 16.8
 jbt_import as described in 16.10

This sequence template will split the source file into 3 different files: Header, Detail and Trailer &
call the import job which will do the necessary validation and create a detail dataset.

16.12 jbt_unload
This template job illustrates creation of header and trailer records. The trailer consists of record
count and Hash count.
This template mainly is for following logic:

 Total number of records on file (excluding header & trailer)


 Hash of account numbers from all detail records on file

55000783.doc Page 39 of 41
55000783.doc Page 40 of 41
16.13 jst_unload
This template job sequence calls the following components:
 jbt_unload as described in 16.12
 MakeFile.ksh as described in 16.9
 Reconciliation report as described in 16.6

This sequence template will create 3 different files: Header, Detail and Trailer & call the script
which will combine these 3 files to create the target file. Also Reconciliation report is created.

16.14 jbt_abort_threshold
Abort Threshold template will abort a job based on threshold value passed as a job parameter.
It uses common routine called “AbortOnThreshold”. This routine has to be called from a BASIC
Transformer:
AbortOnThreshold (@INROWNUM, <Threshold Value>, DSJ.ME)
Here <Threshold Value> is the job parameter. For example, if you give Threshold Value as 5,
job will abort after 4 records pass through the BASIC Transformer.
This is used in places where job needs to be aborted on a particular number of reject records.

55000783.doc Page 41 of 41

You might also like