DataStage Best Practices

55000783.doc

Page 1 of 41

CONTENTS
1. INTRODUCTION.....................................................................................................................................5 1.1 OBJECTIVE..........................................................................................................................................5 1.2 DOCUMENT USAGE................................................................................................................................5 2. DATASTAGE OVERVIEW......................................................................................................................5 3. DATASTAGE DEVELOPMENT WORKFLOW.......................................................................................6 3.1 BUILDING AND TESTING JOBS...................................................................................................................6 3.1.1 Dummy_Dev Project.............................................................................................................6 3.2 OTHER DATASTAGE PROJECTS................................................................................................................6 4. DATASTAGE JOB DESIGN CONSIDERATIONS..................................................................................7 4.1 JOB TYPES..........................................................................................................................................7 4.1.1 Import Jobs...........................................................................................................................7 4.1.2 Transform Jobs.....................................................................................................................8 4.1.3 Unload Jobs..........................................................................................................................8 5. USE OF STAGES....................................................................................................................................8 5.1 COMBINING DATA..................................................................................................................................8 5.1.1 Join, Lookup and Merge Stages...........................................................................................8 5.1.2 Aggregate Stage...................................................................................................................9 5.1.3 The Funnel Stage.................................................................................................................9 5.2 SORTING.............................................................................................................................................9 5.3 DATA MANIPULATION............................................................................................................................10 5.3.1 Transformer Usage Guidelines...........................................................................................10 5.3.2 Modify Stage.......................................................................................................................12 5.4 TRANSITIONING DATA...........................................................................................................................13 5.4.1 External Data......................................................................................................................13 5.4.2 Parallel Dataset..................................................................................................................13 5.5 UNIT TEST........................................................................................................................................13 5.5.1 Copy Stage.........................................................................................................................13 5.5.2 Peek Stage.........................................................................................................................13 5.5.3 Row Generator...................................................................................................................13 5.5.4 Column Generator..............................................................................................................13 5.5.5 Manual XLS Generation......................................................................................................14 6. GUI STANDARDS.................................................................................................................................14 7. DATASTAGE NAMING STANDARDS..................................................................................................14 8. RUNTIME COLUMN PROPAGATION (RCP).......................................................................................16 9. STANDARDISED REJECT HANDLING...............................................................................................16 9.1 REJECT COMPONENTS..........................................................................................................................16 9.2 CUSTOMISED REJECT MESSAGES............................................................................................................18 9.3 REJECT LIMIT.....................................................................................................................................19 9.4 BEFORE ROUTINE................................................................................................................................19 9.5 NOTIFICATIONS...................................................................................................................................19 9.5.1 In-line Notification of Rejects..............................................................................................19 9.5.2 Cross Functional Notification of Rejects.............................................................................20
55000783.doc Page 2 of 41

10. ENVIRONMENT..................................................................................................................................20 10.1 DEFAULT ENVIRONMENT VARIABLES STANDARDS.......................................................................................20 10.2 JOB PARAMETER FILE STANDARDS........................................................................................................20 10.3 DIRECTORY PATH PARAMETERS............................................................................................................20 10.4 DEFAULT DIRECTORY PATH PARAMETERS ..............................................................................................20 10.5 DIRECTORY & DATASET NAMING STANDARDS............................................................................................21 10.5.1 Functional Area Input Files...............................................................................................21 10.5.2 Functional Area Output Tables.........................................................................................21 10.5.3 Functional Area Staging Tables........................................................................................21 10.5.4 Internal Module Tables.....................................................................................................21 10.5.5 Datasets Produced from Import Processing ....................................................................21 11. METADATA MANAGEMENT..............................................................................................................21 11.1 SOURCE AND TARGET METADATA..........................................................................................................22 11.2 INTERNAL METADATA..........................................................................................................................22 12. STANDARD COMMON COMPONENTS............................................................................................22 12.1 JOB TEMPLATES................................................................................................................................22 12.1.1 Import Jobs.......................................................................................................................23 12.1.2 Transform Jobs.................................................................................................................23 12.1.3 Unload Jobs......................................................................................................................24 12.2 CONTAINERS....................................................................................................................................24 13. DEBUGGING A JOB...........................................................................................................................25 14. COMMON ISSUES AND TIPS............................................................................................................25 14.1 1-WAY / N-WAY.................................................................................................................................25 14.2 DUPLICATE KEYS..............................................................................................................................26 14.3 RESOURCE USAGE VS PERFORMANCE....................................................................................................27 14.4 GENERAL TIPS.................................................................................................................................28 15. REPOSITORY STRUCTURE..............................................................................................................29 15.1 JOB CATEGORIES..............................................................................................................................29 15.2 TABLE DEFINITION CATEGORIES............................................................................................................29 15.3 ROUTINES........................................................................................................................................29 15.4 SHARED CONTAINERS.........................................................................................................................29 16. COMMON COMPONENTS USED IN DUMMY...................................................................................30 16.1 JBT_SC_JOIN....................................................................................................................................30 16.2 JBT_SC_SRT_CD_LKP.........................................................................................................................30 16.3 JBT_ENV_VAR...................................................................................................................................31 16.4 JBT_ANNOTATION...............................................................................................................................32 16.5 JOB LOG SNAPSHOT..........................................................................................................................32 16.6 RECONCILIATION REPORT....................................................................................................................34 16.7 SCRIPT TEMPLATE..............................................................................................................................36 16.8 SPLIT FILE......................................................................................................................................36 16.9 MAKE FILE......................................................................................................................................36 16.10 JBT_IMPORT...................................................................................................................................37 16.11 JST_IMPORT...................................................................................................................................39 16.12 JBT_UNLOAD...................................................................................................................................39 16.13 JST_UNLOAD...................................................................................................................................41
55000783.doc Page 3 of 41

16.14 JBT_ABORT_THRESHOLD....................................................................................................................41

55000783.doc

Page 4 of 41

You can view and modify the table definitions at any point during the design of your application Aggregates data. The best practices will also form the basis for QA and peer testing within the development environment. It is understood that this document. 1. In such cases. and Loading tool. With simple point-and-click techniques you can draw a scheme to represent your processing requirements Extracts data from any number or type of database Handles all the metadata definitions required to define your data warehouse or migration. while setting the standards might not be possible to cover all the development scenarios. Initial review and sign-off process will therefore be followed within this context.2 Document Usage This document describes the DataStage best practices to be applied to the Dummy Transformation project.1 Objective This document will serve as a source of standards for use of the DataStage software as employed by the Dummy Transformation project. It will be referenced by developers initially for familiarisation and as required during the course of the project. DataStage has the following features to aid the design and processing: • • • Uses graphical design tools. ensuring that the standards it describes are communicated and understood. INTRODUCTION 1. The below mentioned standards will be followed by all developers. Transformation. Page 5 of 41 • • 55000783. developer must contact the appropriate authority to seek clarification and ensure that such missing items are subsequently added to this document. Use of the document will therefore reduce over time as developers become familiar with the practices described. 2. DataStage has a set of predefined transforms and functions you can use to convert your data. You can easily extend the functionality by defining your own transforms to use. DATASTAGE OVERVIEW DataStage is a powerful Extraction. You can modify SQL SELECT statements used to extract data Transforms data. It will therefore be an evolving document which will be updated to continually reflect the changing needs and thoughts of the development team and hence continue to represent best practices as the project progresses.1. It is intended to channel the general knowledge of DataStage developers towards the specific things they need to know about the Dummy project and the specific way jobs will be developed.doc . The Offshore Build Manager will maintain the document (in collaboration with the development team – through weekly developer meetings) and will be responsible for distributing the document to developers (and explaining it’s content) initially and after updates have been applied. Such communication will highlight the areas of change.

e. Dummy_Dev. test and production. 3.doc Page 6 of 41 . Development Server BUILD MANAGER (Review / Defect Fix / QA / Sign Off) Role Ranch_Test DS Project Test Server Build / Unit Test Process Deploy / Promote Process Ranch_Dev DS Project Ranch_Dev\FDyy DS Project Ranch_Promo DS Project Onshore E2E Test Activity-. Please refer to the Dummy Transform Code Migration Strategy document for further details.1 Dummy_Dev Project The Dummy_Dev project will be used by developers for building DataStage jobs and unit testing by the developers. Within DataStage.2 Other DataStage Projects Several further DataStage projects will be employed across the Development. After base-lining the code the DataStage administrator will collate all code in the Dummy_Promo project from where the DMCoE will move it for unit and end to end testing on the Test server. 55000783. a project is the entity in which all related material to a development is stored and organised. It will be mapped to a working directory on the UNIX DataStage server. development. Developers will develop code in Dummy_Dev project and after unit testing it promote to project Version where Version controlling will be managed.3. This will also be used for unit testing. As detailed in diagram below there will three environments i. DATASTAGE DEVELOPMENT WORKFLOW 3. Finally the code will be moved by DMCoE to production.1 Building and Testing Jobs This section provides an overview of the DataStage Job development process for the Dummy transformation project. Test and Production environments. Development will have three projects where each code will move i.DMcoE Each DataStage project is defined below: 3. Version and Dummy_Promo.1. changes / defect fixing will be documented and fixed before promoting the job to “Dummy_Promo” for integration testing.DMcoE Production Server Developer Role Ranch_Prod DS Project Version DS Project Administrator Role Onshore Activity -. Please refer to the Dummy Transform Code Migration Strategy document for further details.e.

Import. Finally one or more datasets will be created which will be input to actual transform process. Source file will be read as per source record layout.e. See section 9 for further details of action to be taken on failure or reject. Sanity checks on file and validation of external properties e. Source data having complex file layout will be processed by these jobs in sequence to give Target file which will in the format required by Load team.4. Transform and Unload Jobs.1. Validate header and trailer details Read File in specific format Create output datasets Record s to be Proces sed NonHogan Extract Data Write error details in Stats File and Stop processing 55000783. Hogan Extract Data Import Job Check zero byte file.1 Import Jobs Import Jobs will be starting point for transformation. DATASTAGE JOB DESIGN CONSIDERATIONS 4. If there are any unwanted or bad record the job will fail and file needs to be corrected before restarting the job.doc NonHogan Extract Page 7 of 41 . Source data will then be filtered to process records and unprocessed data will be maintained in a dataset for future reference. Size will be done here.1 Job Types As per diagram below there will be three types of Jobs within Transform i.g. Hogan Extract Data 4.

Finally the records will be split as per destination file design and a destination dataset will be created. lookup data as per functional design specification.1 Join. All data errors will be captured in an exception log for future reference. 4. sorted and de-duped.1.1 USE OF STAGES Combining Data Records to be Processed Target data is provided as flat files 5.4. Lookup and Merge Stages The Join.doc Page 8 of 41 . 5. Unload Job Unload Data in output file as per layout Load Data Data Held for Future Job Data Held in temporary datasets 5.2 Transform Jobs Datasets created by import jobs will be processed by transform jobs. They differ mainly in memory usage.1. Lookup and Merge stages combine two or more input links according to values of key columns. Thes from completed or may represe event of the ba failure. Data Held for Future Job Transform jobs have a number datasets.3 Unload Jobs Unload jobs will take transform datasets as a source and create final files required by load team in the given format.e. treatment of rows with unmatched key values and input requirements i.1. Transform will join two or more datasets. A brief description as to when to use these stages is provided in the following table: 55000783.

In order to do this. 1 Right All OK OK None None 1 Nothing Lookup In RAM Lookup Table Heavy 1 Source. Drop or Reject None 1 Out.3 The Funnel Stage The funnel requires all input links to have identical schemas (column names. Each lookup reference requires a contiguous block of physical memory. n Lookup Tables None OK Warning Fail. n Updates All Warning OK (when n=1) Keep or Drop Capture as Reject 1 Out.doc Page 9 of 41 .1. 55000783. Continue. 5. If the datasets are larger than available resources. Common aggregation functions include: • • • • Count Sum Mean Min / Max. 5. 1 Reject Unmatched Primary Rows Merge Master / Update Light 1 Master.1.2 Sorting There are two options for sorting data within a job. The explicit sort stage has additional properties. Several others are available to process business logic. The single output link matches the input schema. however it is most likely that aggregations will be used as part of a calculation to determine the number of rows in an output table for inclusion in header and footer records for unload files.2 Aggregate Stage The purpose of the aggregator stage is to perform data aggregations. such as the ability to generate key change column and to specify the memory usage of the stage. attributes including null ability). the JOIN or MERGE stage should be used. types. 5. it is necessary to understand the key columns that define the aggregation groups.Type Memory Number of Inputs Sort on Input Duplicates on Primary Input Duplicates on Secondary Input(s) Options on Unmatched Primary Options on Unmatched Secondary Number of Output Links Captured on Reject Join SQL-like Light 1 Left. either on the input properties page of many stages (a simple sort) or using the explicit sort stage. the columns to be aggregated and the kind of aggregation. n Rejects Unmatched Secondary Rows The Lookup stage is most appropriate when the reference data for all lookup stages in a job is small enough to fit into available physical memory.

5. Filter.3.3. Therefore.col) Then… Else… 5. Thus.2 Transformer NULL Handling and Reject Link When evaluating expressions for output derivations or link constraints. 5.3. for example: If ISNULL(link. For this reason.3.3. always test for null values before using a column in an expression.3 5.4 Optimizing Transformer Expressions and Stage Variables In order to write efficient Transformer stage derivations. Switch. Optimize the overall job flow design to combine derivations from multiple Transformers into a single Transformer stage when possible. For example. unless the derivation is empty o For each output link: 1. the output links are also evaluated in the order in which they are displayed. From this sequence. 5.3 Transformer Derivation Evaluation Output derivations are evaluated BEFORE any type conversions on the assignment. Evaluate each column derivation value 2. as they will be evaluated once for every output column that uses them. the PadString function uses the length of the source type. Similarly. Modify etc) when derivations are not needed.1 Data Manipulation Transformer Usage Guidelines 5.doc Page 10 of 41 . Such constructs are: 55000783. the Transformer will reject (through the reject link indicated by a dashed line) any row that has a NULL value used in the expression. it is important to minimize the number of transformers. Write the output record o Next output link Next input row • The stage variables and the columns within a link are evaluated in the order in which they are displayed in the Transformer editor. it can be seen that there are certain constructs that will be inefficient to include in output column derivations. right-click on an output link and choose "Convert to Reject". The evaluation sequence is as follows: • • Evaluate each stage variable initial value For each input row to process: o Evaluate each stage variable derivation value. TrimLeadingTrailing(string) works only if string is a VarChar field.1.1. For example.1 Choosing Appropriate Stages The parallel Transformer stage always generates "C" code which is then compiled to a parallel component. The Transformer rejects NULL derivation results because the rules for arithmetic and string handling of NULL values are by definition undefined. it is useful to understand what items are evaluated and when.1. To create a Transformer reject link in DataStage Designer. and to use other stages (Copy.1. the incoming column must be type VarChar before it is evaluated in the Transformer. it is important to make sure the type conversion is done before a row reaches the Transformer. For this reason. not the target.

3] = "001") THEN . suppose multiple columns in output links want to use the same substring of an input column. However in this case. the evaluation of the substring of DSLINK1. This can be made more efficient by moving the substring calculation into a stage variable..col1[1. the function will be evaluated every time the column derivation is evaluated. This can be achieved using stage variables. The solution here is to move the function evaluation into the initial value of a stage variable. It will be more efficient to calculate the constant value just once for the whole Transformer.. The stage variable will be: IF (DSLink1. Where an expression includes calculated constant values For example. This example could be improved further by also moving the string comparison into the stage variable.col[1. the function will still be evaluated once for every input row. then the following test may appear in a number of output column derivations: IF (DSLINK1. By doing this. such as: Str(" "...doc Page 11 of 41 .col[1. the stage variable definition will be: DSLINK1.col[1.3] is evaluated for each column that uses it.3] = "001" THEN 1 ELSE 0 and each column derivation will start with: IF (Stage Var1) THEN This reduces both the number of substring functions evaluated and string comparisons made in the Transformer.20) This returns a string of 20 spaces. the substring is evaluated just once for every input row. In this case. a column definition may include a function call that returns a constant value. In this case.3] and each column derivation will start with: IF (Stage Var1 = "001" THEN . In this case.20) 55000783. In this case. the variable will have its initial value set to: Str(" ".Where the same part of an expression is used in multiple column derivations For example. A stage variable can be assigned an initial value from the Stage Properties dialog/Variables tab in the Transformer stage editor. This function could be moved into a stage variable derivation.

because the derivation expression of the stage variable is empty. using the initial value setting to perform the concatenation just once.3. and then use the stage variable in place of DSLink1. the data type of the stage variable should be set correctly for that context. where that conversion would have been required. Any expression that previously used this function will be changed to use the stage variable instead.You will then leave the derivation of the stage variable on the main Transformer page empty. are the primary operations which should be implemented using Modify instead of using Transform. for example. If this just appeared once in one output column expression. in order to be able to add it to DSLink1. or it is used in multiple places For example.col1+"1" In this case. an expression may include something like this: DSLink1. such as keep/drop. before any input rows are processed. the "1" is a string constant.col1 were a string field. another example would be part of an expression such as: "abc" : "def" As with the function call example.doc Page 12 of 41 . it is not reevaluated for each input row. if an input column is used in more than one expression. Transformations that touch a single field. Then. otherwise needless conversions are required wherever that variable is used. if DSLINK1. it must be converted from a string to an integer each time the expression is evaluated. 55000783. However.col1. it will be more efficient to use a stage variable to perform the conversion once. where it requires the same type conversion in each expression. and so. this will be fine.2 Modify Stage The Modify stage is the most efficient stage available. In addition to a function value returning a constant value. It should be noted that when using stage variables to evaluate parts of expressions. 5.col1. specify its derivation to be DSLINK1. this constant part of the expression could again be moved into a stage variable. you will create. Where an expression requiring a type conversion is used as a constant. an integer stage variable. it is value for the whole Transformer processing is unchanged from the initial value.col1+1 In this example. The solution in this case is just to change the constant from a string to an integer: DSLink1. this concatenation is evaluated every time the column derivation is evaluated. Since the subpart of the expression is actually constant.col1. The initial value of the stage variable is evaluated just once. and null handling. type conversions. Therefore. In this case. then a conversion will be required every time the expression is evaluated. some string manipulations.

1 External Data The External Source stage is a file stage which is used to read data that is output from one or more source programs. It can be configured to execute in parallel or sequential mode.2 Peek Stage The Peek stage can print record column values either to the job log or to a separate output link as the stage copies records from its input data set to one or more output data sets. Type as ‘Cycle’ specifying what ‘Increment’ value is required Type as ‘Random’ specifying what percent of invalid/zero data is required.4 Column Generator The Column Generator stage is a Development/Debug stage that can have a single input link and a single output link. This stage is used when a specific column’s data is only to be analysed while Unit Testing to validate whether the preceding transformation logic is working as desired. the danger of bottle necks is eliminated during dataset creation. It can be configured to execute in parallel or sequential mode. The Data Set stage can store data being operated on in a persistent form. 5.5.1 Copy Stage The Copy stage copies a single input data set to a number of output data sets.g.4. which by convention has the suffix .5.5 Unit Test 5.5. Each record of the input data set is copied to every output data set.3 Row Generator The Row Generator stage is a Development/Debug stage that has no input links.2 Parallel Dataset The Data Set stage is used to read data from or write data to a data set. more details can be specified about each data type if required to shape the data being generated.4. This is useful where you want to test your job but have no real data available which may be a source file or a dataset produced by some other job whose development is also underway. DataStage parallel extender jobs use data sets to manage data within a job. Also. 5. The stage can have a single input link or a single output link. The stage calls the program and passes appropriate arguments. These Parallel Datasets will be created from the external data by the ‘Import’ job and will be created whenever intermediate datasets are needed to be created for further single/multiple jobs to process.5. For e. This stage will be typically used in the ‘Import’ jobs to import the External Data to parallel datasets to be processed by further ‘Transformation’ jobs. Due to the parallel nature of processing. and a single rejects link. 5. The new data set is then 55000783. each referred to by a control file.ds. This stage is used commonly for debugging/testing purpose where a copy of the data flowing from a particular stage can be isolated from the flow and analysed.5. The Column Generator stage adds columns to incoming data and generates mock data for these columns for each data row processed. The Row Generator stage produces a set of mock data fitting the specified metadata. and a single output link.doc Page 13 of 41 . The stage can have a single output link. 5. Using data sets wisely can be key to good performance in a set of linked jobs. Data sets are operating system files.4 Transitioning Data 5. Records can be copied without modification or columns can be dropped or changed (to copy with more modification – for example changing column data types). which can then be used by other DataStage jobs. 5.

7. 6. Standard description annotations should be used on every non-trivial stage. Where the job has not yet entered Version Control. date and a brief reference to the design document including the version number the job has been coded up to. Entries put in the detailed description by Version Control must not be modified manually.02…13 indicating FD name. Naming conventions must be enforced on links. It does not stop developers from using Full Description as a method of maintaining the relevant documentation. <im> indicates Import Job Sequence <tr> indicates Transform Job Sequence <ul> indicates Unload Job Sequence source Page 14 of 41 . developer name. DATASTAGE NAMING STANDARDS Object Type Category Job Job Sequence Source Definition Category 55000783. transforms and source and target files. the Full Description field in job properties is also used by DS Version control to append revision history. but information maintained by the developer will get appended to by the Version Control tool.output. These methods of data generation will be used extensively during Unit testing. the initial version should be referred to as 0. 5. This is used where not all the columns’ real data is available for testing. <im> indicates Import Job <tr> indicates Transform Job <ul> indicates Unload Job js_<fdXX>_<im/tr/ul>_<file/detail> Where XX is 01. Those columns need to be inserted with mock data fitting the specified metadata. GUI STANDARDS Job Description Fields – the description annotation is mandatory for each job. plus the main job annotation and any modifications to the job. Note that the description annotation updates the job short description. a blue job description (description annotation) and a yellow operator specific description (standard annotation) are used.5. promo and production. This is packaged and maintained with the job and will be visible when the jobs are deployed to test.1. The full description should include the job version number. Annotations are also used to further describe the functionality of jobs and stages.5 Manual XLS Generation In addition to the ‘Row Generator’ and ‘Column Generator’ methods DataStage provides. Two types of annotation. When using DataStage Version Control. The detailed description is also updated automatically in by DataStage Version Control process following the first initialization into Version Control. mock data can also be created manually in an XLS file and then saved as a CSV file to be given as input to the DataStage job where this test data is required.02…13 indicating FD name.doc Syntax Import/transform/unload jb_fdXX_<im/tr/ul>_<JobName> Where XX is 01.

jn=join.doc ds_<Dataset Name> sq_<Sequential file name> fs_<File Set name> lfs_<Lookup file set name> esrc_< External Source name> etrg_< External Target name> cff_< Complex Flat File name> tr_<Purpose> btr_<Purpose> agg_<Purpose> jn_<Purpose> mrg_<Purpose> lkp_<Purpose> srt_<Purpose> fnl_<Purpose> rdup_<Purpose> cps_<Purpose> exp_<Purpose> cp_<Purpose> md_<Purpose> flt_<Purpose> sflt_<Purpose> ccap_<Purpose> capp_<Purpose> diff_<Purpose> cmp_<Purpose> enc_<Purpose> dec_<Purpose> cwt_<Purpose> gen_<Purpose> sur_<Target Column Name> ci_<Purpose> ce_<Purpose> msub_<Purpose> ssub_<Purpose> crec_<Purpose> prec_<Purpose> mkv_<Purpose> splv_<Purpose> Page 15 of 41 . <rej/njn/jn> indicates the type of link rej=reject. njn=non join. Parallel Job FILE Stages Data Set Sequential File File Set Lookup File Set External Source External Target Complex Flat File Parallel Job Processing Stages Transformer BASIC Transformer Aggregator Join Merge Lookup Sort Funnel Remove Duplicates Compress Expand Copy Modify Filter External Filter Change Capture Change Apply Difference Compare Encode Decode Switch Generic Surrogate Key Parallel Job RESTRUCTURE Stages Column Import Column Export Make Subrecord Split Subrecord Combine Records Promote Subrecord Make Vector Split Vector 55000783. If not applicable then this will be dropped.Target Definition Category Link* target lnk_<StageName>_<rej/njn/jn> lnk_<StageName> <StageName> is the name of the stage from which the link is coming out.

An annotation should make this clear on the job. In this case it is not possible to determine the column definitions during build.doc Page 16 of 41 . For these reasons developers must turn off RCP within each job unless the feature is explicitly required in the job by the developer as in the above example. can cause additional columns to appear in the output dataset that the developer may have thought were dropped. RCP should be enabled within the Project Properties (providing flexibility at to use RCP at job level) and in the event that RCP is required. a standard approach must be introduced for the remaining stages and adopted across all stages. is that in jobs where RCP is not desired by the developer but the feature is switched on. These components are shown in the following diagram: 55000783. This will be achieved by the introduction of a bespoke element (in the form of example stages within template jobs) and through the use of a standardised reject component made available to all developers via a DataStage wrapper. 9.1 STANDARDISED REJECT HANDLING Reject Components There is a requirement to set up a standard approach to reject handling. but the file name itself is a job parameter. 9. There is a reject link on the Lookup stage. Conversely. The standardisation of reject capture allows operational support to easily: • • • • locate the rejection message and understand the format of the message locate and diagnose the reason for rejections set tolerances to the numbers of rejects permitted allow for the re-process rejected rows. RUNTIME COLUMN PROPAGATION (RCP) One of the aims/benefits of RCP is to enable jobs that have variable metadata that is determined at run time. one of the features that sometimes confuse developers. In any event. Reject processing is not provided as standard within DataStage Enterprise (Parallel) across the majority stages. However.Containers Local Container Shared Container Others Stage Variable Sequence Generator Job Sequences Stages Job Activity Execute Command Sequencer lc_<functionality> sc_<functionality> s_<StageVariableName> seq_<Target Column Name> ja_<job name without jb and fd#> ex_<Script function>_<file/detail> sq_<Purpose> 8. An example would be a generic job that reads flat file and stores the data into a Dataset. it can be turned on at job / stage level.

doc Driving Data Flow Page 17 of 41 . 55000783. data flowing down the reject link from a Lookup or Join stage might result from an inability to match keys and from a Transform stage from the validation of data items. Identifies the key of the rejected row and passes this down the relevant link (depending on the key type) to the standardised reject handling component. though this is just as applicable to Join. uniquely defining each reject and writes the message along with the two keys to a dataset. Transform and other stages. Passes the row to a dataset in order to facilitate the re-processing of the rejected rows 2. the rejected row is passed down a reject link to a bespoke component that: 1. For instance. the Lookup stage is shown with a reject link. In the example above.e. a file is empty or there is a mismatch between the number of rows read and the information provided on the footer record. Join. Where there is no key.e. Three such stages are shown in the diagram (i. zeros are passed down all links intended for key information 3. for instance an unexpected value or null might be encountered. This reject dataset therefore holds the key from the rejected row (that can be used to cross reference to the dataset of rejected rows) and a message that will help identify the reason for the rejection. Lookup and Transform).Records to be Processed All stages (where a row might be rejected) must include a reject link. This approach assumes that a key uniquely identifying each failing row is present on driving flows. Compiles and passes a standard message (see table below) describing the rejection to the standardised reject handling component. The standardised reject component takes two inputs (over a possible five input links) and creates a surrogate key. i. In each of these cases.

9. The job and stage name must be included in the message. and lookup). This message is particular to import jobs where the input file is validated against the footer record.g. Keys have not been matched between input links on a Join or Lookup stage. This processing requirement is shown in the following diagram: This component will be made available to all developers for use in reject handling as a job template. In order to facilitate reject handling within the Join stage. Developers are limited to using the messages specified below. This message is Page 18 of 41 Row Count Mismatch Empty File 55000783. thus prevent ting the creation of random error messages. The number of records processed does not match the number of records described in the footer record. The input file is empty. join.doc Secondary Data Source The creation of a list of standard error conditions limits the number of exceptions an operator will see allowing errors to be quickly identified and resolved. The job and stage name must be included in the message. The reject component will be used with every stage which can fail due to data discrepancies (e. Column . The Join stage requires further processing whilst the error link from the lookup stage can be linked directly to the custom error component. further processing is required.2 Customised Reject Messages The following reject messages / conditions will be used: Reject Message Lookup / Join Failure Description For all referential integrity checking and any other critical Lookups / Joins. Reject datasets are uniquely named and created each time the module runs (see below).Paths where reject datasets are automatically set to write to are date stamped within a common reject and log directory.

The last activity within a module will be to email notification of rejects within a module to operations. A template job will be provided that includes the Notification stage and job parameters that can be tailored such that the names and paths of the reject datasets can be interrogated and the relevant notifications made. whilst a reject limit of 99 will NEVER ABORT (on reject).1 In-line Notification of Rejects In-line notifications are those resulting from rejects within a functional processing stream. On meeting the reject limit. field and stage name must be included in the message. field and stage name must be included in the message. 55000783. 9. The job. This is used by the standardised reject processing wrapper to test against the total number of errors for a module. Developers must intercept rejects in the code they generate and generate a standard reject message that contains accurate data and relevant information from the record. 9. The job. This allows central control the level of rejects allowed across all modules and jobs used in the Dummy batch.5 • • Notifications operations are informed of a reject i. 9.5. field and stage names must be inserted into the message.e. A Not Null field has been identified as containing null values. A notification is the method by which: These are described in the following sections. 9. The job. This will be achieved by using the Notification stage. A description of rejects and messages should be made available to operational support to help diagnose problems encountered when running the batch.3 Reject Limit A Reject Limit parameter is included in all jobs.4 Before Routine The before routine for the first job in a sequence of jobs that implement a module (or for a single job where there is only one job in a module) will be used to interrogate increment the number that will uniquely identify the datasets that will be created from the processing of rejects for a particular module.e. A field has been identified as containing invalid data. the job and hence the processing for any given module is terminated. cross functional notifications. The reject will be variable between 0 and 99. A reject limit of 0 (zero) will ABORT ON FIRST REJECT.Reject Message Invalid Field Null Field Description particular to import jobs where the input file is validated against the footer record.doc Page 19 of 41 . in-line notifications rejects are communicated between functional streams and / retained to support the rerunning of modules i.

In this instance.2 Cross Functional Notification of Rejects This type of notification is the means by which rejects are communicated between functional streams.5.e. held under the Users/Template DataStage will have these parameters defined.doc .2 Job Parameter File Standards A generic parameter file which stores all the default job parameter values including user names and login details will be run in conjunction with the before job routine “SetDSParamsFromFile” This will allow project wide settings to be changed once. transactions being dependent on accounts etc.e. from 1 to 9) pRUNNUMBER = n (where n is the run number within the iteration starting from 1) 10.9. prompting a rerun and between migration steps (i. pITERATION = n (where n is the migration iteration i. The following DataStage Environment variables must exist in all jobs. These include the settings of the default node configuration file. The path to this parameter file will be /DataStage/Parameters/<project name> and its name will be parameters. This ‘communication’ is built around the feedback from the load process. • • • pDSPATH = /XX/XX/Dummy (DataStage Datasets top level development directory – there will be equivalents for testing and Live).apt (Default value used in every job) /DataStage/Product/Ascential/DataStage/Configx4node. (Note that DataStage Environment Variables are different to Standard Parameters) The template job already has these parameters defined: • $APT_CONFIGX_FILE= /DataStage/Product/Ascential/DataStage/Configx1. 10. therefore limiting those transactions processed to those where an account had also been successfully processed. T14 to T) and an understanding of the dependencies between functional areas i. ENVIRONMENT 10.apt (Value overwritten for Testing on extra nodes in individual jobs) $APT_DUMP_SCORE=false • 10.lst 10. and error log activities.e. and avoid unnecessary parameter duplication.4 Default Directory Path Parameters The following parameters must exist in all jobs (the template job has these parameters defined): • • • pDSPATH = /DataStage/Datasets/DummyDev (DataStage Datasets top level directory) pITERATION = 1 pRUNNUMBER = 1 Page 20 of 41 55000783. o XX – Base Dummy directory as set by DMCoE. rejected accounts will be incorporated into the transaction processing process.3 Directory Path Parameters The following parameters must exist in all jobs. The template job.1 Default Environment Variables Standards DataStage Enterprise Edition allows project / job tuning by means of Environment variables.

Note the final subdirectories (i. Reference data is not split into iterations.5 Datasets Produced from Import Processing Datasets that are produced by Pre-Processing are stored in a “Source” directory. #pDSPATH#/#pITERATION#/#pRUNNUMBER#/Source/<datasetname>.5.5 Directory & Dataset naming standards UNIX directory paths are set using the following convention. #pDSPATH#/Reference/<datasetname>.5. #pDSPATH#/#pITERATION#/#pRUNNUMBER#/Staging/<datasetname>. based on the parameters defined above. There are two types of Metadata.ds 11.e.2 Functional Area Output Tables Datasets that are defined in the Detailed Design as output tables for a functional area are stored in a “Product” directory. “Deliver” and “Internal”) are hard coded in the jobs. #pDSPATH#/#pITERATION#/#pRUNNUMBER#/Internal/<datasetname>. Complex Flat File and Dataset) or imported from sources such as COBOL copybooks. 10.doc Page 21 of 41 .ds 10. This is the directory that downstream Functional Areas (including the Unload process) will go to find input tables from previous areas. Datasets in this directory are only used within jobs. described below: 55000783.5. This is the directory that Functional Areas will go to find input tables from the source. Metadata is either created manually within stages (i. Flat File.10. This is the directory that other modules within the same Functional Area will go to find staging tables from previous modules.ds 10.5.5. #pDSPATH#/#pITERATION#/#pRUNNUMBER#/Product/<datasetname>.4 Internal Module Tables Datasets produced within a module and used only internally within that module will be stored in an “Internal” directory. METADATA MANAGEMENT Metadata consists of record formats for all external files (flat files) and internal files (datasets) processed by DataStage which are stored in the DataStage Repository (a Metadata repository).ds Reference Datasets that are produced by Import Processing are stored in a “Reference” directory.3 Functional Area Staging Tables Datasets that are defined in the Detailed Design as staging tables within an area are stored in a “Staging“directory.1 Functional Area Input Files Source files will be pushed by Extract system to ETL server in a holding area ‘Hold` via connect direct software. #pDSPATH#/#pITERATION#/#pRUNNUMBER#/Hold/<source_file_name> 10.e.ds 10. This is fine because if the developer mistypes the value the job will fail immediately as the mistyped directory will not exist.

productivity is increased and developers can spend more time on tasks which are specific to individual jobs Reduce the complexity of common tasks. This metadata must not be changed by developers. STANDARD COMMON COMPONENTS The use of Standard Components in developing DataStage jobs will: • • • Increase quality of the code. Should a change be required to this metadata. since the most optimal method will be used for a function which is to be achieved in multiple jobs Promote reuse. Should it be necessary or more efficient to process data in a different way from the way it is presented within the pre-defined metadata. This metadata will define the outputs of import jobs. New jobs will be copies of the original job Create a new job from a previously created template Create a simple parallel data migration job.1 Source and Target Metadata Record formats will have been pre-defined within the DataStage Repository describing the record formats of files that form inputs to import jobs and outputs from unload jobs. therefore having a positive impact in terms of quality. The Intelligent Assistants are listed below: • • • Create a template for a server or parallel job. These record formats are for the convenience of developers (they are described in the FDs and are therefore fixed. This metadata will therefore only be used by import and unload jobs. 12. This can be subsequently used to create new jobs.11.) and help maintain consistency in terms of the way data is interpreted across all jobs (define once. it should first be impacted to assess the potential impact of the change on jobs that use the metadata and processed through standard change control. be used by all transoform jobs and will define the inputs to unload jobs and must be stored in the repository with a name that matches the name of the dataset it describes. Also certain elements will be common in many 55000783.1 Job Templates DataStage provides intelligent assistance which guides through basic DataStage tasks. use many times).2 Internal Metadata Developers will also create metadata describing the datasets that: • • pass data between jobs within a functional area pass data between jobs in different functional areas.doc Page 22 of 41 . which need not be coded yet again. 11. This extracts data from a source and writes it to a target Not only will the use of templates help in standardization but also it will form reusable components. developers may create a job specific version of the metadata which must be clearly identified as a variant on the original and saved within the repository. 12.

FD07.jobs. FD04. FD02. FD12 FD01. FD02. FD09 FD01. FD03. annotations and reject handling.1. FD07. data file is not empty. FD10.e. FD02. FD09. Since this functionality is common. Join based on S/C & Acc Num ETL Customer Data File performs the same join with many different files i. FD05. these processes will be developed once and will be copied in respective occurrences. The table below identifies such occurrences: Common Process Sort Code Lookup & Split data based on processing centre ETL Redirections Table Load file performs the same join with many different files i. FD11. FD04.g. FD09 FD01. FD09 FD01. namely: parameters. Dummy project will have templates which will be a job with stages following naming standards. FD10. These jobs acting as a template will assist developer to develop new jobs as per mentioned standards. FD13 FD01. 12. These jobs will have functionality of doing sanity checks on received file e. The files that are used in multiple instances are described below: Common Source Files Import Account Selection File Import Customer Selection File Import ETL Customer Data File Import ETL Address Data File Import ETL Customer Pointer File Import ETL DDA Account Data Import ETL TDA Account Data Import TAX Certification File Import ETL Re-directions table load file Functionalities the file is used FD01. In Dummy project the files are repeatedly used in different functionality. FD03. FD11 FD01. FD03. FD09 FD01.doc Functionality FD02. FD03. we will read file only once and create a DataStage datasets. Join based on Customer Num (to get details of associated Account Numbers for each customer) Customer Selection File performs the same join with many different files. FD04. FD07. FD09. FD10. FD06. FD06.e. FD13 12.e.1 Import Jobs Each source file will be read in persistence datasets by separate jobs called import jobs.e. FD03. FD09. FD05. FD02. FD08. FD03. Join based on Customer Num. 55000783. FD05 FD01. Join based on Customer Num ETL Customer Pointer File performs the same join with many different files i. i. These datasets will then be used in respective functionalities. FD05. we will build and test one such job and use this architecture in rest. FD11 FD02.2 Transform Jobs Dummy Transform jobs repeatedly perform joins on similar driver files with other data files. FD11. etc. FD05. header and trailer details are consistent with file properties. FD03. Since associated logic for importing and validating files will be same. FD11. FD13 FD01. which can be implemented by the use of templates. FD08. FD12 Page 23 of 41 . FD09 FD01. FD05 FD01. FD08.1. FD05. FD03.

though this can cause bottlenecks in processing as they are serial only and should be avoided if possible o Parallel shared container is used in parallel jobs. Reject Handling Statistics Report logger 55000783.doc Page 24 of 41 . FD11 FD02. once identified. They are used just as a developer would use a standard stage.e. Local containers can be used in server jobs or parallel jobs.2 Containers A container is a group of stages and links. Shared containers.Common Process Account Selection File performs the same join with many different files. A local container is edited in a tabbed page of the job’s Diagram window. i.1. lookups and active transformations to check records eliminated in process and log them in a separate file.3 Unload Jobs Dummy Unload jobs are tasked to create output files in format required by load team. These are created within a job and are only accessible by that job. The functionality needed is discussed in section 7. FD11 12. Identified containers in Dummy transform project are described in the table below: Container Functionalities Definition Will act on the joins. Apart from creating files from persistent dataset these jobs will create header and trailer details within file. you could use one to make a server plug-in stage available to a parallel job). FD09 FD02. ETL Re-directions Table Load File JOIN WITH ETL DDA Account Data ETL Re-directions Table Load File JOIN WITH ETL TDA Account Data Functionality FD01. These files will be mainly in mainframe format. This component will log messages in a mentioned file. It will take input as filename and message to be written. reusable components will be identified and delivered into the DataStage repository as shared components. Containers are the means by which standard DataStage processes are captured and made available to many users. 12.Join based on S/C & Acc Num. They can also be used in parallel jobs. These are created separately and are stored in the Repository in the same way as other jobs. You can also include server shared containers in parallel jobs as a way of incorporating server job functionality into a parallel stage (for example. Some work needs to be done to identify opportunities for reuse within the overall design. DataStage provides two types of container: • • Local containers. However. There are two types of shared container: o Server shared containers are used in server jobs. Containers simplify and modularize server job designs by replacing complex areas of the diagram with a single container stage.

14. Debugging essentially involves viewing the data in order to isolate the fault. In these situations care must be taken to ensure that incoming data streams are not only sorted but partitioned the same way. If there is a difference. for instance: • One of the most common reasons is when the Join. DEBUGGING A JOB The following techniques options will assist when debugging a job. Jobs will run n-way when live in order to achieve the benefits of parallel processing provided by DataStage Enterprise. In order to ensure trouble free scaling. the reasons for this must be examined and corrected. Final unit testing must occur on the exact version of code that is to be promoted to Integration Test. 14. join conditions may not be met because of records (with keys that would otherwise match 1-way) being in different partitions and therefore go unmatched. This improves performance and should not effect or change the function of the code. There are a number of techniques including: • • • • Adding a peek stage will output certain rows to the job log Adding a filter to the start of the job to filter out all rows except the ones with the attributes that the developer may wish to test or debug the behaviour on Adding an additional output to a transformer with the relevant constraints and storing the data into a sequential file to be used as part of the investigation.13. these counts and the physical records involved should be the same.doc . While the job may appear to look overly complex. Removing and re-inserting peeks and re-inserting them can often get to be quite a tedious task. records may be unnecessarily rejected (either down a reject link or omitted all together) and will therefore Page 25 of 41 55000783. If not. This ensures that there has been no functional impact in making the switch to parallel processing. COMMON ISSUES AND TIPS Common issues faced in project while development and testing are mentioned in this section. this will not impact the processing times of the job. In these situations. Finally there is a tips section to assist developer while coding. Lookup and Merge stages (and others) are used. The COPY stage is a no-op (non operator) stage. Problems to do with scaling usually become evident when comparing record counts between 1way and n-way runs. There are many possible reasons for variations in record counts. This means that there isn’t a processing cost to having a copy stage in a job design. The resulting debug sequential file would only contain data when pDEBUG=1. All changes to code made for debugging (including peeks. extra stages and extra parameters) must be removed prior to final unit test. jobs are built 1-way and unit tested 1-way and n-way.1 1-way / n-way Scaling from 1-way to n-way processing is the method employed within DataStage to take advantage of parallelism. In processing hotspots (parts of a job which could potentially be an area of concern) it is advisable that peeks be replaced by COPY stages before promoting the jobs to Integration Test (instead of complete removal of the stage). The use of the copy stage would also be an option A variant of the above would be to add a parameter pDEBUG with a value of 1 or 0 that will be used as part of the constraint. Clearly.

the problem may be inherited and a more extensive search may be required in order to find the problem. Running with multiple nodes means that partitioning comes into play and therefore issues arise from applying processing rules across multiple partitions. hence causing a variation between the actual rows processed and the anticipated number it should be ensured that a dataset used as input on the lookup link to a Lookup stage must be partitioned as Entire to ensure that the entire dataset is available for lookup across all partitions within the main input link to these stages. Page 26 of 41 • 55000783. Configuration files are provided for 1-way and 4-way running on the Development server. Essentially this is because when running with a single node. Scaling from 1-way to n-way processing will often cause problems. This load process will most likely fail if there are duplicate keys in the data. This might be less efficient. A sign that this is the case is if the final record count is a multiple of the number of nodes compared to the single node record count. with 1way being the default. For instance of a job is generating a unique key column. To avoid this kind of issue. otherwise the lookup may fail simply because the dataset was partitioned incorrectly for the lookup an incoming dataset may have been created by another job or module which may also have been written by another developer. partitioning will be considered within the overall solution design. 4-way processing is specified at job level as an override. but more effective in terms of retaining control over your jobs and the quality of the output data flows. usually giving correct results. but may not be correctly partitioned for the needs of your job. Here are some examples: • an incoming data stream i. This effect may be desirable. care must be taken at the unit test stage and it is always a good idea to have a general understanding of the anticipated throughput of a job before starting the build. a stage can be forced to run sequentially (though this may become a bottleneck) or alternatively. Therefore good practice. 14. the same key may me generated across all partitions and therefore duplicated when the data is collected for output.• • not flow down the main output link to subsequent stages or into an output dataset. If the problem lies with the source system. unless you can be absolutely sure that the datasets you are using are partitioned correctly for your needs.e. is to repartition at the start of a job. In this case it might contain the required data. is to understand how duplicates are be generated. Duplicates will often be identified when the output data (from a DataStage job). Another sign that there may be duplicates in the data is when the output of a job or stage (within a job) has more rows in the output stream than would have been thought possible from the inputs. i. then this may need to be raised as a data quality issue and corrected at source a 1-way/n-way issue. For these reasons.2 Duplicate Keys Often an output table. Where possible. therefore minimising the need for repartitioning. The developer must ensure that overrides are removed from their jobs prior to promotion to the Test server. flat file or internal dataset will contain duplicate keys.doc . all data flows through a single partition (where processing rules apply to all the data). the output from another job or module). however in many cases this can also lead to incorrect results. a data source or internal dataset (for instance the source system itself. The key to solving problems related to duplicates. perhaps in the form of a flat file.e. is loaded into a target database table. particularly if the target table is uniquely keyed.

particularly discussing the balance that must be achieved between the resources available on the server where the DataStage jobs run and the performance of those jobs. Modify for simple type conversions should be considered. even when included as an input link to sort dependent stages such as Dedupe and Join. so where possible these activities should be minimized. 14. though Transform also differs slightly. providing the data is correctly partitioned already). The upper limit is large. E2E and Performance test stages. Partitioning and sorting will take considerable amounts of time during job execution. Since DataStage Enterprise (DataStage) starts one Unix process per node (nodes are defined in the configuration file and can be thought of as a logical processor) per stage. the Transform stage was inherited from the DataStage Server product and is less efficient than other native Parallel stages. incoming data streams should be partitioned and sorted as far up stream as possible and maintained for as long as possible. too many Transforms will slow your jobs down and in this situation. though generally the more resource (processors and memory) the better. read and maintain. As a general rule of thumb.3 Resource usage Vs Performance This section concentrates on issues found not only during development but also during wider Integration. 55000783. as is the appropriate use of stages within jobs and the elimination of unnecessary repartitioning and sorting. Quite often they will ‘look’ good but could be combined. Within DataStage. though this needs to be considered in the context of the total memory available and what else will be running at the time. The key is to run a number of performance tests to determine the optimum number of nodes. Total memory usage will be hard to estimate and will be best left until a point when the runtime batch has been designed and run – be prepared to increase memory and split jobs if the usage is too great. Within DataStage. the effective use of available processors and to an extent the total memory usage is determined by the operating system rather than DataStage. Clearly. having a detrimental effect on performance.doc Page 27 of 41 . Finally. A starting point will usually be around 50% of actual CPUs. Similarly. The native Modify stage is a good alternative but is not consistent with the user interface implemented for other stages. therefore ensuring uniqueness across partitions A Cartesian join. The jury is out as far as the use of Transform is concerned. The sort order of the data within a partition in a data stream will be maintained throughout a job. it is also tempting to repartition on the input links of stages when specifying Same will suffice (again. the partition number can be built into the algorithm for generating the key. therefore reducing the overhead. Using several transforms in sequence is also undesirable. It is always tempting to sort on the input links of these stages. Common sense is the key. the optimum use of parallel (partitioned and piped) data streams is clearly essential.• particularly then defining keys. this can lead to an explosion of processes running and eventually the operating system spend more time managing than executing code. For users of DataStage Server it will be familiar and easy to use. the Lookup stage: This stage differs from Merge and Join in that it requires the whole of the lookup dataset to be held in memory. with arguments for and against. however this is completely unnecessary (providing the data is in the correct order already) and time consuming.

g. Stage variables also allow you return multiple errors for a record of information. In Job sequences. 14. allowing you to compare between previous and current records. then run” option in Job Activity stages.g. password should be initialized in a global variable and then variable should be referred everywhere. If Join/Aggregator stages do not produce desirable results. Ensure that job does not look complex.Likewise.2] Similarly for a decimal field then: Use Trim(<Field Name>)[1.2] • Always use Hash Partition in Join and Aggregator stages. Use Annotations for describing steps done at stages. the cleanup of data is more efficient and requires less iteration. • • • • • • • 55000783. as Description Annotation also appears in Job properties>Short Job Description and also in the Job Report when generated. Use Description Annotation as job title. it is always better to convert the value in the field using the ‘Type Conversion’ functions “DecimalToString” or “StringToDecimal” as applicable while mapping. Changing the nulls to 0 or “” before performing operations is recommended to avoid erroneous outcomes. When you need to get a substring (e. The hash key should be the same as the key used to join/aggregate. neither does concatenation when one of the fields is null. always use “Reset if required. NULL = NULL doesn’t work. (Note: This is not a default option) When mapping a decimal field to a char field or vice versa .4 General Tips General tips used while development code is mentioned below • Common information like home directory. if still incorrect problem is with data/logic) and then run in parallel using Hash partition.doc Page 28 of 41 . system date. try running in sequential mode (verify results. Nulls are a curse when it comes to using functions/routines or normal equality type expressions. Stage Variables allow you to hold data from a previous record when the next record. E. When using String functions on decimal always use Trim function to avoid as String functions interpret an extra Space used for sign in decimal. username. first 2 characters from the left) of a character field: Use <Field Name>[1. be prepared to add further processors to facilitate scaling. to improve runtimes. Use containers where stages in the jobs can be grouped together. • • • Use Column Generator stage to create sequence numbers or adding columns having hard coded values. If there are more stages (more than 10) in a job divide into two or more jobs on functional basis. By being able to evaluate all data in a record and not just error on the first exception that is found.

These Datasets will be created from the external data by the ‘Import’ job and will be created whenever intermediate datasets are needed to be created for further single/multiple jobs to process. Exception log will be created with records that do not follow file layout. Transform will join two or more datasets. Source file will be read in memory datasets as per source record layout.4 Shared Containers Shared containers (as described above) will be described here. REPOSITORY STRUCTURE The DataStage repository is the resource available to developers that helps organise the components they are developing or using within their development. Finally one or more datasets will be created which will be input to actual transform process. 55000783. This consists of metadata i. The following jobs will be created: • Import Jobs: Import Jobs will be starting point for transformation. Datasets can either be Sequential or Parallel.1 Job Categories The jobs can be categorised by developer and by FD.g. they will be converted back to Target flat files.e. Sanity checks on file and validation of external properties e. Finally the records will be split as per destination file and a destination dataset will be created. lookup data as per given functionality. table definitions. the jobs themselves and specific routines and shared containers.2 Table Definition Categories The files are categorised into: • • Source/Target Flat-files: The source and target files will be included in this category.doc Page 29 of 41 . which can then be used by other DataStage jobs. The anticipated repository structure is described in the following sections. A Dataset can store data being operated on in a persistent form.• “Clean-up on failure” property in sequential files must be enabled (enabled by default) 15. Transform Jobs: Datasets created by import jobs will be processed by actual transform job. Datasets: Datasets are used as intermediate storage for the various processes. • • 15. Size will be done here. However the structure may change during development. Source data will then be filtered to process records and unprocessed data will be maintained in a dataset for future reference. 15. It is anticipated that there will be a small number of these and therefore no further categorisation is anticipated. 15. usually evolving to a structure that is in it’s most usable form. Unload Jobs: Unload jobs will take transform datasets as a source and create final files required by load team in the given format.3 Routines Before and after routines (should they be needed) will be described here. All data errors will be captured in an exception log for future reference. 15. These files will be converted into datasets by DataStage jobs and then after the Transformation process is complete.

So a common component with this functionality is built. The Join stage of Datastage will give 2 outputs in this case: • A + B (Join records) • A not in B (Reject Records) The common component jbt_sc_join will give 3 outputs in this case: • A + B (Join records) • A not in B (Reject records) • B not in A (Non Join records) This functionality is illustrated in the flow diagram below: A not in B ln k_ A_ B_ re j File ‘A’ (Master) lnk _A B_jn lnk_A_ A +B jn_A_B _ lnk File ‘B’ (Child) B lnk _A _B _n jn B not in A 16. This will take a file as input and divide into 2 files for notth and south separately. 55000783.doc Page 30 of 41 . take file A (master) and file B (child).2 jbt_sc_srt_cd_lkp Sort Code look up is a functionality which is required at many places (in various FD’s in Dummy). For example. whereas Datastage just offers 2 outputs from a Join stage.1 jbt_sc_join jbt_sc_join is a common component built to meet a specific requirement in Dummy project to capture 3 types of records from a Join stage.16. COMMON COMPONENTS USED IN DUMMY 16.

This folder will store all the input datasets. $SCRIPTDIR: This will contain routine UNIX scripts used for processing files. $ITERATION: Current Iteration number $JOBLOGDIR: This would contain all the Error log files generated in DataStage jobs. $SRCDATASET: All the input files will be partitioned and imported into DataStage datasets.South File 16.3 jbt_env_var This is a template job with commonly used environmental variables imported. $DSEESCHEMADIR: DSEE Schemas that are used by EE jobs using RCP/schema files. This can be used for all the jobs being developed with these set of common environment variables rather then importing them again and again. $BASEDIR: This folder is the base directory.North File h ort _n _A l nk File ‘A’ lnk_A sc_srt_c d_lkp l nk _A _s ou th ‘A’ .‘A’ . These parameters values will be set as per development environment. $REJFILEDIR: This would contain all the reject files generated in DataStage jobs. $PARMFILEDIR: This folder will contain parameter files that would be looked up by jobs/routines that would be triggered from a common parameter file. 55000783.doc Page 31 of 41 . copying. These Environment variables are as shown below: $ADTFILEDIR: This would contain the Audit file and reconciliation reports. taking file backup etc.

ksh is a script which will create the log file (as seen in Datastage Director) of job's latest run.$SRCFILEDIR: This folder will contain all the input files from the Extract team.doc Page 32 of 41 . 16. $TMPDATASET: This folder will be used to store all the intermediate files created during transform job. The following parameters need to be hard coded in the script as per environment: DSHOME=/wload/dqad/app/Ascential/DataStage/DSEngine PROJDIR=/wload/dqad/app/Ascential/DataStage/Projects/Dummy_dev LOGDIR=/wload/dqad/app/data/Dummy_dev/itr01/errfile 55000783. $SRCFORMATDIR: This folder will contain the copybook formats for input source files.5 Job Log Snapshot JobLogSnapShot. $TRGFORMATDIR: This folder will contain the copybook formats for output source files. $TRGFILEDIR: These folders will contain all the transformed output files which can be loaded to Bank B’s mainframe. 16. These copybook formats are as per functional specifications. $TRGDATASET: This folder will be used for storing output DataStage datasets files. as Description Annotation also appears in Job properties>Short Job Description and also in the Job Report when generated. Also Description Annotation are used as job title. All files will be manually copied into this folder.4 jbt_annotation This is a template job where annotations are used for describing steps done at stages.

The script will be called from the after job subroutine of a job. ksh /wload/dqad/app/data/Dummy_dev/com/script/JobLogSnapShot.DSHOME is the Datastage Home path. The Job Log file will be created in: /wload/dqad/app/data/Dummy_dev/itr01/errfile/<Job name>_log_<time stamp>.ksh $1 $1 is input parameter: Job name whose latest job log is required. .doc Page 33 of 41 . 55000783.txt Sample Job log: . PROJDIR is the project directory in which the job exists. LOGDIR is a common directory where the log file will be created. . .

ksh $1 $2 $1 is 1st input parameter: FD## $2 is 2nd input parameter: .e.6 Reconciliation Report Reconcilation. Sample . • The Description of the file whose report is to be prepared. Reject or Non-Join. the output files will be ebcidic files and the reject and non-join files will be in ascii format.ini file will contain the following separated by | sign. The script will be called from an Execute Command stage of a Job Sequence.(this is need only for the output ebcidic file). • The type of the file i.16. Input.doc Page 34 of 41 . Note: this should be sorted order.ksh is a script which will create the Reconciliation Report of the respective functional area (FD).ini file name (not path) Specifications of .ini file: INP|fd01_customer_pointer_file|Customer Pointer dataset created from source file INP|fd01_customer_data_file|Customer Data dataset created from source file OUT|fd01_redirection_file|Output redirection file|117 REJ|fd01_duplicates_file|Reject file containing duplicated account numbers NJN|fd01_account_nonjoin|Nonjoin files from the join stage in job1 The Reconciliation report will be created in: /wload/dqad/app/data/Dummy_dev/itr01/adtfile/<FD##>_recon_<time stamp>. • The name of the File whose report is to be prepared.ini file: Path: /wload/dqad/app/data/Dummy_dev/com/parmfile The . • The Record length of the file. Also the input files will be datasets.txt 55000783. ksh /wload/dqad/app/data/Dummy_dev/com/script/Reconcilation. Example: INP or OUT or REJ or NJN. Output.

doc Page 35 of 41 .Sample Reconciliation report: 55000783.

The script will be called from an Execute Command stage of a Job Sequence (Import sequence).dat respectively.8 Split File SplitFile. 55000783.7 Script template All scripts are made according to this template script. detail and trailer files. detail and trailer file to be of name $1_hdr. ksh /wload/dqad/app/data/Dummy_dev/com/script/SplitFile. ksh /wload/dqad/app/data/Dummy_dev/com/script/Make_File. The input file will be /wload/dqad/app/data/Dummy_dev/itr01/opfile/$1.dat All these files ($1_hdr.dat and $1_trl. This script name is /wload/dqad/app/data/Dummy_dev/com/script/ScriptTemplate.dat.dat respectively.ksh is a script which will split the input file into header. $1_dtl.16. The script will be called from an Execute Command stage of a Job Sequence (Unload sequence). $1_det.ksh is a script which will merge the header. detail and trailer files created would be $1_hdr.dat and $1_trl. The header.doc Page 36 of 41 .dat extension.dat) will be output in /wload/dqad/app/data/Dummy_dev/itr01/opfile/.dat.ksh 16.9 Make File Make_File.ksh $1 $1 is 1st input parameter: <Input file name without extension> $2 is 2nd input parameter: <Record length> This requires the file name to have .dat and $1_trl. 16. This has a script description and also a section for maintaining modification history of the script.dat.ksh $1 $1 is 1st input parameter: <Target file name without extension> This requires the header. detail and trailer record to create the target file. $1_det.

dat. The validations done on header are: • • • • • • The file header identifier must contain the value ‘HDR-TDAACCT’ The file header date must equal the T-14 migration date The file trailer file identifier must contain the value ‘TRL-TDAACCT’ The file trailer creation date must equal the file header creation date The file trailer record count must equal the total number of record on the input file including the header and trailer records.ksh described in 16. Detail and Trailer record created by the SplitFile.dat and $1_trl. The header and trailer data is validated. They will vary (slightly though) for other FD’s. then processing should be immediately aborted with a relevant fatal error message. $1_dtl.dat 16. The file trailer record amount must equal the sum of the Closing Balance field from every record on the input file excluding the header and trailer records.dat) will have to be present in /wload/dqad/app/data/Dummy_dev/itr01/opfile/.doc Page 37 of 41 .All these files ($1_hdr. This is implemented using subroutine AbortOnCall. The output file will be /wload/dqad/app/data/Dummy_dev/itr01/opfile/$1. But this common approach as shown in the template can be taken. The accumulation of the Closing Balance field must be performed using an integer data format. allowing for overflow. 55000783. The validations done on trailer are: If any of the above checks fail.8. Note: These header/trailer validations are for FD01. The detail records are written to a dataset to be processed in transform job.10 jbt_import This template job processes the Header.

doc Page 38 of 41 .55000783.

This template mainly is for following logic: • • Total number of records on file (excluding header & trailer) Hash of account numbers from all detail records on file 55000783. 16. Detail and Trailer & call the import job which will do the necessary validation and create a detail dataset. The trailer consists of record count and Hash count.doc Page 39 of 41 .10 This sequence template will split the source file into 3 different files: Header.ksh as described in 16.11 jst_import This template job sequence calls the following components: • • SplitFile.12 jbt_unload This template job illustrates creation of header and trailer records.16.8 jbt_import as described in 16.

55000783.doc Page 40 of 41 .

13 jst_unload This template job sequence calls the following components: • • • jbt_unload as described in 16.12 MakeFile. This is used in places where job needs to be aborted on a particular number of reject records.ksh as described in 16. For example. Also Reconciliation report is created. 55000783. 16. It uses common routine called “AbortOnThreshold”.16. <Threshold Value>.9 Reconciliation report as described in 16.ME) Here <Threshold Value> is the job parameter. DSJ. Detail and Trailer & call the script which will combine these 3 files to create the target file.6 This sequence template will create 3 different files: Header.14 jbt_abort_threshold Abort Threshold template will abort a job based on threshold value passed as a job parameter. This routine has to be called from a BASIC Transformer: AbortOnThreshold (@INROWNUM. job will abort after 4 records pass through the BASIC Transformer.doc Page 41 of 41 . if you give Threshold Value as 5.