Course Title

IBM Global Business Services

Business Intelligence (BI) Development Toolkit for Datastage

Duration of course: X hours

(Optional client logo can be placed here)

Disclaimer (Optional location for any required disclaimer copy. To set disclaimer, or delete, go to View | Master | Slide Master)

© Copyright IBM Corporation 2006

IBM Global Business Services

Course Objective
At the completion of this course you should be able to understand :  Overview of processes followed in a standard development project.     Various phases and related work product associated with the development process. Importance of generating various work products. Standard / Best practice / Tip & tricks specific with the tool. Insight about different types of projects. 

Different types of testing.

2

Presentation Title | IBM Internal Use | Document ID

| 1/29/2011

© Copyright IBM Corporation 2006

IBM Global Business Services

Course Content 
Module 1: DataStage Low Level Design  Module 2: DataStage Coding Standards  Module 3: DataStage Best Practices ± Tips & Tricks  Module 4: Version Control

3

Presentation Title | IBM Internal Use | Document ID

| 29-Jan-11

© Copyright IBM Corporation 2006

or delete. go to View | Master | Slide Master) (Optional client logo can be placed here) © Copyright IBM Corporation 2006 . To set disclaimer.Course Title IBM Global Business Services BI Development Toolkit for Datastage Module 1 : DataStage Low Level Design Disclaimer (Optional location for any required disclaimer copy.

± Know how a Low Level Design Document looks like.IBM Global Business Services Module Objectives  At the completion of this chapter you should be able to: ± Understand the concept of Low Level Design process. 5 Presentation Title | IBM Internal Use | Document ID | 1/29/2011 © Copyright IBM Corporation 2006 .

IBM Global Business Services Low Level Design : Agenda Key points described in the Low Level Design  Topic 1 :Introduction.  Topic 4 :Core Aspects Of Design.  Topic 2 Objectives/Purpose. 6 | 1/29/2011 © Copyright IBM Corporation 2006 .  Topic 6 :Low Level Technical Design.  Topic 5 :Low Level Technical Overview.  Topic 3 : Scope.

Onsite/Offshore .QA Checkpoints Development Complete .Offshore 7 Presentation Title | IBM Internal Use | Document ID | 1/29/2011 © Copyright IBM Corporation 2006 .IBM Global Business Services DW/BI Development Process Flow Solution Outline Design ( Macro & Micro) Build & Unit Test Deployment Functional Functional Specification Specification Estimate Estimate Detailed Technical Design Build Develop Sign Off Sign Off Completed FS Technical Design Peer Review QA Technical Design Coding and Unit Testing by Developer Rework Required No Peer Review Coding OK ? Workshop Offshore Knowledge Transfer Estimation and Delivery Plan Signoff by Team Lead Functional Spec Review Technical Design Approval Yes Send for Onsite Acceptance Issue Unit Test Plan TPR/SCR Logging Onsite Acceptance Testing Onsite (UAT/System Test/ Testing IntegrationTest Issues Issue Resolution No Yes Estimation Estimation OK ? OK ? .Onsite Technical Specification .

schema details for tables or server details for files.  Source/Target Structures i.e table structure or file structure This describes the field names in a table along with their datatypes or if it is a Delimited or fixed width one for flat file.  Source To Target mapping Explains how data flows from source to target.IBM Global Business Services What is a Low Level Design? The Low Level Design details all the technical aspects involved in the Data Stage ETL process with respect to the following:  Source/Target Names and Locations This section contains the name of the source/target table or File names. 8 | 1/29/2011 © Copyright IBM Corporation 2006 .

 Jobs/Sequences/Master Sequencer Details.IBM Global Business Services What is a Low Level Design?  QA to find any data quality issues. This section shows the name of the Jobs. 9 | 1/29/2011 © Copyright IBM Corporation 2006 . Sequences and Master Sequencer's along with the transformation details  Partitioning Information if any.  Scheduling Information etc.

IBM Global Business Services Sample Low Level Design LLD_Template 10 | 1/29/2011 © Copyright IBM Corporation 2006 .

along with the artifacts.g the inputs.outputs. Understanding of the entire flow from source to target also with mapping rules Technical Specification Roles  Templates and Sample Artifacts Sample Ar ifact   Developer 11 Presentation Title | IBM Internal Use | Document ID | 1/29/2011 © Copyright IBM Corporation 2006 . Inputs  Key Activities:   Outputs  High Level Design  Analysis of High Level Design Identify key elements to be included in the Low Level Design.IBM Global Business Services Key Points Step Overview: This shows the key elements e.key activities involved etc.

IBM Global Business Services Questions and Answers 12 Presentation Title | IBM Internal Use | Document ID | 1/29/2011 © Copyright IBM Corporation 2006 .

go to View | Master | Slide Master) (Optional client logo can be placed here) © Copyright IBM Corporation 2006 . or delete.Course Title IBM Global Business Services BI Development Toolkit for Datastage Module 2 : Datastage Coding Standards Disclaimer (Optional location for any required disclaimer copy. To set disclaimer.

± Know the Parameter Naming conventions used in DataStage.IBM Global Business Services Module Objectives  At the completion of this chapter you should be able to: ± Know the Job Level Naming conventions used in Data Stage. ± Identify the key Coding standard principles. ± Know proper Documentation standards/Commenting within the Job. 14 Presentation Title | IBM Internal Use | Document ID | 1/29/2011 © Copyright IBM Corporation 2006 . ± Know proper Use of Environmental/Generic parameters as a standard practice.

ETL Coding standard guidelines. Parameter Naming Convention. ± ± ± 15 Presentation Title | IBM Internal Use | Document ID | 1/29/2011 © Copyright IBM Corporation 2006 . Link Naming Conventions.IBM Global Business Services Datastage Coding Standards : Agenda  Topic 1 :Coding standards ± Repository structure DataStage. ±  Topic 2 : Job Naming Conventions ± Stage Naming Conventions. Container Naming Convention.

IBM Global Business Services Datastage Coding Standards : Agenda  Topic 3 : Job Naming Conventions ± Stage Naming Conventions. Container Naming Convention. ± ± ± 16 Presentation Title | IBM Internal Use | Document ID | 1/29/2011 © Copyright IBM Corporation 2006 . Parameter Naming Convention. Link Naming Conventions.

 Maintainability.  Allowing for flexibility in exchanging team members between the Data Conversion and the Data Warehouse / Reporting teams. Serving as a basis (after the completion of the pilot project) for the development of jobs for all other countries. Instead of each developer coding in their own preferred style. 17 | 1/29/2011 © Copyright IBM Corporation 2006 .  Making use of the GUI. and self-documenting nature of the tool.  Enabling multiple teams/team members to work on multiple phases.IBM Global Business Services Coding standard What is a Coding standard? The set of rules or guidelines that tells developers how they must write their code. they will write all code aligning to ETL standards ensuring the consistency of the designed ETL application throughout the project.  Enabling new members of the team to quickly pick up development.  Providing a template to follow. Benefits  Reducing development time.

A definition describing the data you want including information about the data table and the columns associated with it. Server jobs only. Stage Types ± Any stage used in a project ± this can be data source. It is a key component of the software whilst developing jobs in DataStage Designer        Data Elements . compiled and run. or data warehouse.) Jobs ± Folder for jobs that are built. Shared Containers ± A shared container is a re-useable item stored in the repository and available to any job in the project.A specification that describes the type of data in a column and how the data is converted. Routines ± The BASIC language can be used to write custom routines that can be called upon within server jobs. Table Definitions . Transforms ± Similar to routines these take one value and compute another value from it. 18 | 1/29/2011 © Copyright IBM Corporation 2006 . Routines can be re-used by several server jobs. data transformation.IBM Global Business Services Coding standards Repository structure : The repository is the central storing place for µBuild¶ related components. Also referred to as meta data.

and if a number of complicated schedules are used. It is expected that a small number of templates will suffice to create jobs at all levels. notably the architectural area. so that there is no need to create specific folders for templates at every level. If multiple versions of a source system are supported then it is a good idea to reflect the version number in the folder name.IBM Global Business Services Coding standards ETL Coding standard guidelines:  By using a simple repository structure. For each of these groups a Jobs and a Sequences folder is created. Templates are stored in a separate Templates folder directly under the Jobs folder. Thus. These groups in turn can be divided into subgroups (and thus subfolders. Thoughtful naming of jobs and categories will help the developer in understanding the structure. | 1/29/2011 © Copyright IBM Corporation 2006      19 . It is a good idea to set up a folder structure based on a common feature. it is easier to navigate and find the components that are needed to build a job. . sequences and templates were written for. for each group two separate folders are created under the Jobs folder. can also show the flow of jobs. so that it is clear which version the corresponding jobs.

 Table Definitions The Table Definitions section contains metadata which can be imported from a number of sources. Oracle tables. These groups in turn can be divided into subgroups (and thus subfolders). DSDB2. The recommended naming standard (and the default for ODBC) is: 1st subfolder: database type (ODBC. These job templates are stored in a separate Templates folder directly under the Jobs folder . and the correct job parameter names.IBM Global Business Services Coding standards  Job Templates : Each project should contain job templates in order to ensure that jobs are created with the proper amount of job parameters. notably the architectural area they belong to.g. for each group a separate folder is created under the Jobs folder.  Jobs and Sequences Jobs can be grouped into folders based on a common feature. ORAOCI9) 2nd subfolder: database name. Universe. 20 | 1/29/2011 © Copyright IBM Corporation 2006 . The folders that this metadata is stored in must represent the physical origin or destination of a table or file. Thus. e. or flat files.

These can be placed in separate directories.  Allow for closer monitoring or file system.  Allow data volumes to be spread evenly across multiple disks. Sequential Files : A DataStage project will potentially use source. 21 | 1/29/2011 © Copyright IBM Corporation 2006 . This will:  Simplify maintenance. and intermediate files.IBM Global Business Services Coding standards Hash Files : Hash files can be stored either in Universe. or in the file system of the operating system. target.  Aid housekeeping processes.  Allow for closer monitoring of data flow.

Benefits:  Facilitates smooth migrations and improves readability for anyone reviewing or carrying out maintenance on the repository objects. A variety of factors are considered when assessing the success of a project. Appropriate Naming convention Establishes consistency in the repository.  Provides a developer friendly environment. Naming standards are an important. but often overlooked component.  It helps to understand the processes being affected thereby saving significant time.IBM Global Business Services Naming Conventions What is 'Naming Convention'? This is an industry accepted way to name various objects. 22 | 1/29/2011 © Copyright IBM Corporation 2006 .

IBM Global Business Services Naming Conventions The following pages suggest naming conventions for various repository components . The policy can be enforced by peer review and at test phases by adding processes to check conventions both to test plans and to test execution documents.Whatever convention is chosen. it is important to make the selection very early in the development cycle and communicate the convention to project staff working on the repository. 23 | 1/29/2011 © Copyright IBM Corporation 2006 .

Test. and Production. which will be appended to the project name in abbreviated (three character) format. The project name cannot contain spaces and punctuation. 24 | 1/29/2011 © Copyright IBM Corporation 2006 .IBM Global Business Services Project Naming Conventions Component/ Parameter Project Name Suggested Naming Conventions Typically a project contains a set of sequences / jobs / routines / table definitions / etc. Distinction will be made according to the project stages: Development. Acceptance. This may be a particular release or version and is very much dependent on the project circumstances.

Usually job names contain a subject area (the target table). update. clear. the standard chosen is: <job function>_<target_table> 25 | 1/29/2011 © Copyright IBM Corporation 2006 . etc). transform.IBM Global Business Services Job Naming Conventions Component /Parameter Suggested Naming Conventions The job names used are very much dependent on the project. and possibly a job function (load. For projects. Job Job names have to be unique across all folders.

IBM Global Business Services Stage Naming Conventions Passive stages : A passive stage indicates a data component. etc Generic Convention : <data source type>_<data source name> where data_source_type is a two . aggregating. transforming. such as sorting.four character (preferably three) abbreviation which is as clear and unambiguous as possible Component /Parameter Sequential File Complex Flat File Hash file 26 | 1/29/2011 Suggested Naming Conventions Seq_<data source name> Cff_<data source name> Hsh_<data source name> © Copyright IBM Corporation 2006 . or an ODBC source. an Oracle table. In active stages some kind of processing occurs. such as a sequential file.

IBM Global Business Services Stage Naming Conventions Object/Parameter XML file Suggested Naming Conventions Xml _<data source name> Oracle database Ora_<data source name> DB2 database DB2_<data source name> 27 | 1/29/2011 © Copyright IBM Corporation 2006 .

IBM Global Business Services Stage Naming Conventions Component /Parameter ODBC source Suggested Naming Conventions Odbc_>_<data source name> File transferred via FTP Ftp >_<data source name> Siebel DA Sbl_< data source name> Dataset Ds _ data source name> 28 | 1/29/2011 © Copyright IBM Corporation 2006 .

such as sorting.IBM Global Business Services Stage Naming Conventions Active stages : In active stages some kind of processing occurs. etc Generic Convention : <stage_type>_<functional_name> In case of a transformation. transforming. the functional_name typically consists of a verb (indicating the action that is performed) and a noun (the object of the action). aggregating. Component /Parameter Command Aggregator Folder Suggested Naming Conventions Cmd _<functional_name> Agg _<functional_name> Fld _<functional_name> 29 | 1/29/2011 © Copyright IBM Corporation 2006 .

IBM Global Business Services Stage Naming Conventions Object/Parameter Filter Suggested Naming Conventions Fltr _<functional_name> Inter Process Link Partitioner Lookup Ipc _<functional_name> Lpr _<functional_name> Lkp _<functional_name> 30 | 1/29/2011 © Copyright IBM Corporation 2006 .

IBM Global Business Services Stage Naming Conventions Component /Parameter Suggested Naming Conventions Merge Sort Transformer Mrg _<functional_name> Srt _<functional_name> Xfm _<functional_name> 31 | 1/29/2011 © Copyright IBM Corporation 2006 .

IBM Global Business Services Stage Naming Conventions Component/Parameter Change Data Capture Suggested Naming conventions Cdc_<functional_name> Funnel Fnl/Club _<functional_name> Join Join _<functional_name> 32 | 1/29/2011 © Copyright IBM Corporation 2006 .

IBM Global Business Services Stage Naming Conventions Component/Parameter Surrogate Key Generator Suggested Namin Conventions SKey _<functional_name> Remove Duplicates Ddup _<functional_name> Copy Cpy _<functional_name> 33 | 1/29/2011 © Copyright IBM Corporation 2006 .

IBM Global Business Services Link Naming Conventions  Links must have a descriptive name. but without the stage type. If possible. let the name resemble the preceding stage name. Unlike the stages. they start with a non-capital. and using the past participle of the verb used in the preceding stage name.  Examples:  enrichedCustomer  sortedOrders 34 | 1/29/2011 © Copyright IBM Corporation 2006 .

followed by a meaningful name describing its function.  Local Containers The names of Local Containers start with Lcn_ . followed by a meaningful name describing its function. Stage variable names start with stg_ and reflect their usage.IBM Global Business Services Container Naming Conventions  Shared Containers The names of Shared Containers start with Scn_ . 35 | 1/29/2011 © Copyright IBM Corporation 2006 .  Stage Variable : A Stage Variable is an intermediate processing variable that retains its value during read but does not pass its value to a target column. A standard must be set so that common stage variables are named consistently.

IBM Global Business Services Parameter Naming Conventions  Parameters A parameter name should clearly reflect its usage.  General The general naming convention is: P_<name> Database Parameter Data Source Name Suggested Naming Conventions P_DB_<logical db name>_DSN User Identification P_DB_<logical db name>_USERID User authentication Password P_DB_<logical db name>_PASSWORD 36 | 1/29/2011 © Copyright IBM Corporation 2006 .

37 | 1/29/2011 P_DIR_REF © Copyright IBM Corporation 2006 .IBM Global Business Services Parameter Naming Conventions For directory (path) parameters the convention is: P_DIR_<usage> The following directory parameters have been identified: Directory (path) parameters source data for the job Suggested Naming Conventions P_DIR_INPUT Destination directory P_DIR_OUTPUT Directory for temp DS files P_DIR_TEMP Directory for error-reporting files P_DIR_ERRORS Directory where csv and other reference data is held.

38 | 1/29/2011 © Copyright IBM Corporation 2006 . so that there will be no misunderstandings as to what the correct job is. verify that the job in development is identical to the one in production. so that you are able to return it to its original state if needed. Create a backup copy of the job you are going to edit.IBM Global Business Services Datastage Coding Principles and Standards Suggested Methods of Working : Before editing a job. cleanup any backup copies of jobs you have created. After development has finished. If not. request a copy from the production system.

 When modifying jobs. Jobs Commenting :  Document all jobs in their Job Properties:  Provide a short description containing a short.  Provide a Long description containing a history of version. Although properly commenting source code serves no purpose at run time.IBM Global Business Services Documentation practices in a job Incorporating Comments:  One challenge of internal software documentation is ensuring that the comments are maintained and updated in parallel with the source code. it is invaluable to a developer who must maintain a particularly intricate or cumbersome piece of software. 39 | 1/29/2011 © Copyright IBM Corporation 2006 . changes made and by whom.  Document any special file references. always keep the short and long descriptions in the Job Properties up to date. including its version.  Include a reference to the design. date. meaningful description.

and in the code via comments.  Provide a Long description containing a history of version. changes made and by whom.IBM Global Business Services Documentation practices in a job Routines and Functions  Routines and functions are documented in the short and long description fields (as are Jobs). 40 | 1/29/2011 © Copyright IBM Corporation 2006 .  The comments in the short and long description fields (on the General tab) are similar to job comments.  Include a reference to the design. always keep the short and long descriptions in the Job Properties up to date. including its version. meaningful description.  When modifying jobs.  Document any special file references.  Provide a short description containing a short. date.

should you get a chance to revisit code you have written. always use comments on bug fixes and workaround code. such as three spaces.  Use complete sentences when writing comments.  Avoid surrounding a block comment with a typographical frame. Instead. Also.  Establish a standard size for an indent. These are key areas that will assist source code readers. Align sections of code using the prescribed indentation. and use it consistently. that which is obvious today probably will not be obvious six weeks from now. especially in a team environment. such as an entire line of asterisks.  Use comments on code that consists of loops and logic branches.  Comment anything that is not readily obvious in the code. not add ambiguity. It may look attractive. use white space to separate comments from code. Comments should clarify the code.IBM Global Business Services Suggested Coding principles  Avoid clutter comments.  Comment as you code because you will not likely have time to do it later. 41 | 1/29/2011 © Copyright IBM Corporation 2006 . but it is difficult to maintain.  To prevent recurring problems.

you can set up parameters which represent processing variables.  If.Instead of entering constants as part of the job design. when we want to use the job again for a different environment. However. for example. we must most likely edit the design and recompile the job. we can include these settings as part of your job design. -.IBM Global Business Services Use of parameters  Definition  Job parameters allow you to design flexible. we want to process data using a certain userid and password. 42 | 1/29/2011 © Copyright IBM Corporation 2006 . reusable jobs. making a job independent from its source and target environments.

." button.. Step 4->Click on the "User Defined" folder to see the list of job specific environment variables. Step 2 ->Choose the project and click the "Properties" button. Step 5->Type in all the required job parameters that are going to be shared between jobs 43 | 1/29/2011 © Copyright IBM Corporation 2006 . Step 3-> On the General tab click the "Environment.IBM Global Business Services Use of parameters Creating Project Specific Environment Variables : Here are the steps to standard steps to follow: Step 1 -> Start up DataStage Administrator.

" button (which doesn't add an environment variable but rather brings an existing environment variable into your job as a job parameter). Step 4-> Add these job parameters just like normal parameters to stages in your job enclosed by the # symbol. csv 44 | 1/29/2011 © Copyright IBM Corporation 2006 ..IBM Global Business Services Use of parameters Using Environment Variables as Job Parameters : Step 1->Open up a job. for example: ± Database=#$DW_DB_NAME# ± Password=#$DW_DB_PASSWORD# ± File=#$PROJECT_PATH#/#SOURCE_DIR#/Customers_#PROCESS_DATE#. Step 2->Go to Job Properties and move to the parameters tab. Step 3-> Click on the "Add Environment Variables..

It may be preferable to use environment variables in Sequence jobs and pass them to child jobs as normal job parameters. eg.IBM Global Business Services Use of parameters Points to Note : We set the Default value of the new parameter to "$PROJDEF" to ensure it dynamically set each time the job is run. In a sequence job $DW_DB_PASSWORD is passed to a parallel job with the parameter DW_DB_PASSWORD. By changing this value to $PROJDEF you instruct DataStage to retrieve the latest Value for this variable at job run time Set the value of these encrypted job parameters to $PROJDEF.  When the job parameter is first created it has a default value the same as the Value entered in the Administrator. We need to type it in twice to the password entry box.   45 | 1/29/2011 © Copyright IBM Corporation 2006 . The "View Data" button will not work in server or parallel jobs that use environment variables set to $PROJDEF or $ENV. This is a defect in DataStage.

In a sequence job $DW_DB_PASSWORD is passed to a parallel job with the parameter DW_DB_PASSWORD.   46 | 1/29/2011 © Copyright IBM Corporation 2006 . By changing this value to $PROJDEF you instruct DataStage to retrieve the latest Value for this variable at job run time Set the value of these encrypted job parameters to $PROJDEF. When the job parameter is first created it has a default value the same as the Value entered in the Administrator. It may be preferable to use environment variables in Sequence jobs and pass them to child jobs as normal job parameters. The "View Data" button will not work in server or parallel jobs that use environment variables set to $PROJDEF or $ENV. eg. This is a defect in DataStage. We need to type it in twice to the password entry box.IBM Global Business Services Use of parameters Points to Note :   We set the Default value of the new parameter to "$PROJDEF" to ensure it dynamically set each time the job is run.

47 | 1/29/2011 © Copyright IBM Corporation 2006 .  File names and Location : All file names and locations were specific to each run thus the filenames themselves were hard coded but the file batch and run reference and related location were parameterised . By paramaterising these at Project level any change can be quickly applied without updating or recompiling all Jobs.IBM Global Business Services Application examples Environment:  Database name. password : Database names or access details can vary between environments or can change over time. username.

IBM Global Business Services Application examples Process Flow : Parameters can be manually entered at runtime. parameter files were pregenerated and loaded within DataStage with minimal manual input.   Generic Parameters It is often seen that a number of parameters will apply across the whole Project. These will relate to either the Environment or specific Business Rules within the mappings. | 1/29/2011 © Copyright IBM Corporation 2006   48 . For example: MIGRATIONDATE ± ³set to the date the extract was taken´ TARGETSYSTEM ± ³set to the test environment name due to be loaded with data from this run . however. to avoid data entry errors and speed up turnaround.

IBM Global Business Services Questions and Answers 49 Presentation Title | IBM Internal Use | Document ID | 1/29/2011 © Copyright IBM Corporation 2006 .

or delete. go to View | Master | Slide Master) (Optional client logo can be placed here) © Copyright IBM Corporation 2006 . To set disclaimer.Course Title IBM Global Business Services BI Development Toolkit for Datastage Module 3 : Datastage Best Practices / Tips and Tricks Disclaimer (Optional location for any required disclaimer copy.

IBM Global Business Services Module Objectives At the completion of this chapter you should be able to: ± Describe Datastage Best Practices and Tips ± Define Datastage Best Practices and Tips ± Demonstrate Datastage Best Practices and Tips ± Etc. 51 Presentation Title | IBM Internal Use | Document ID | 1/29/2011 © Copyright IBM Corporation 2006 .

6. Lookup stage Problem 6.5. Null Handling 6.3.5. Transforming the Extracted Data 6.5.Performing Lookups 6. Sample Job Template 6. Extracting Data 6.5. General Design Guidelines 6. Tips: Removing Duplicates 6.7.5.5.1.5.3.5.5.4. Getting started Prerequisites Overview of the Data Migration Environment implemented Estimating a Conversion Preparing the DS environment Creating Project Level Parameters Designing Jobs 6. Tips: Sorting 6.1.2.8.4.IBM Global Business Services Datastage Best Practices / Tips and Tricks :Agenda      1. Using Transformer 6. Transformer compared to Dedicated stages 6.5.2. When to configure nodes and partitioning 52 © Copyright IBM Corporation 2006 . Ensuring Restartability 6.

Loading Valid Data 6. Troubleshooting 5.6.1 Troubleshooting: Some debugging Techniques 5.12 Dataset Management 6.7 Using Job Level Message Handler 53 © Copyright IBM Corporation 2006 .3 Common Errors and Resolution 5.6 Tips: Job Level and Project Level Message Handling 5.9.2 Oracle Error Codes in DataStage 5.13 Ensuring Restartability 5.10 Tips: Releasing locked Jobs 6. Capturing Rejects 6. Mapping multiple stand-alone jobs in one single job 6.11.8. Sequencing the jobs 6.5 Local runtime Message Handling in Director 5.7. Job sequence vs Batch Scripts 6.4 Tips: Message Handler 5.IBM Global Business Services Datastage Best Practices / Tips and Tricks :Agenda 6.

4.2.5. Mapping multiple stand-alone jobs in one single job 6. General Design Guidelines 6. Sample Job Template 6. Job sequence vs Batch Scripts 6.12 Dataset Management 54 © Copyright IBM Corporation 2006 .10 Tips: Releasing locked Jobs 6. Unit Testing of the modules 6.6. Transforming the Extracted Data 6.11. Loading Valid Data 6.1. Ensuring Restartability 6. Capturing Rejects 6. Sequencing the jobs 6.3.IBM Global Business Services Datastage Best Practices / Tips and Tricks : Agenda 6.7.9. Extracting Data 6.8.

jobs and categories 7.3 DS Auditing Activity 7. Maintenance Activity 7.IBM Global Business Services Datastage Best Practices / Tips and Tricks :Agenda 7. Preparing UTP.guidelines 55 © Copyright IBM Corporation 2006 .2 Version Control in ClearCase 7.4 Retrieving Job Statistics Assuring Naming Conventions of components.1 Backup and version control Activity 7.5 Performance Tuning of DS Jobs 8.

1 Backup and version control Activitties Version Control in ClearCase 9.2 DS Auditing Activity Tracking the list of modified jobs during a period Retrieving Job Statistics Getting the row counts of different jobs 56 © Copyright IBM Corporation 2006 .IBM Global Business Services Datastage Best Practices / Tips and Tricks :Agenda 9 Taking whole project backup Taking Job level Export Taking folder level Export 9.

IBM Global Business Services Datastage Best Practices / Tips and Tricks :Agenda 9.3 Performance Tuning of DS Jobs Analyzing a flow Measuring Performance Designing for good performance Improving performance 9.5 Scheduled Maintenance 57 © Copyright IBM Corporation 2006 .4 Assuring Naming Conventions of components. jobs and categories 9.

acquired through experience. we have defined the roadmap to implement the design using WebSphere DataStage and some tips and tricks along with.   Designing the architecture  Preparing the DS environment  Job Development Phase : creating the estimation model  Job Development Phase : designing the job template  Job Development Phase : Delivering Code Modules  Job Enhancement Phase : Version Control  DataStage Auditing Activity  DataStage Maintenance Activity 58 © Copyright IBM Corporation 2006 . Getting Started In a typical Data Migration Environment.IBM Global Business Services 1.

IBM Global Business Services 2.Job Review Checklist template 7.Prerequisites  The following documents should be in place before we jump into job development: 1.DataStage Estimation Model 2.DataStage Naming Convention Standards to be followed 3.Job Design Templates 4.Issue Checklist template 6.Approach towards Backup and Version Control Activity 5.Unit Testing Template 59 © Copyright IBM Corporation 2006 .

The job repository tables (discussed in auditing section) have also been stored here. We call them CNV_LOG and CNV_RUN resp. Remaining validations can be applied here.IBM Global Business Services 3. Staging 2 records can be used by other applications to load finally to target ERP In Staging area 0: here we have tables for loading Master records. Staging area 2: This is similar to oracle ERP tables which are loaded with stage 1 records.Overview of Data Migration environment DataStage Requirement:Cleansed data is populated into staging area 0 from stage Legacy( which holds the cleansed records from legacy systems) Client specific business rules have to be validated during stage0 to stage1 load primarily. transactional records and configuration data In staging area 1: here we have the same tables as in Stage 0 but the data model can have small differences. 60 © Copyright IBM Corporation 2006 . tables for storing error records and status of each run. Apart from that. Staging 2 is the final target of DataStage load.

In case of multiple lookups or large number of validations the complexity should be high and the contingency factor in the estimation model can be increased. The no of lookups to be performed in the load job. The complexity of the transformer in the load job need to be determined. The existence of mandatory fields (must be loaded in target) should be examined. For non mandatory fields. An overview of the load job designs need to be chalked out. Design of lookup jobs should be explored (scope of any join stage or whether it can be performed using custom SQL in the source oracle stage) 2.Estimation a conversion  . The records can be rejected at the first opportunity (after source DB stage) and sent to log without any further validation. the records can not be rejected and all the validations on other columns need to be performed. 61 © Copyright IBM Corporation 2006 . 1.IBM Global Business Services 4. 3.

Preparing a DS environment DataStage Installation should be in place along with other database installations Project Level Environment variables has to be created to hold connectivity values of staging databases. output and temporary storage. 62 © Copyright IBM Corporation 2006 .IBM Global Business Services 5. the file locations for input.

Designing Jobs 6.9.11.10 Tips: Releasing locked Jobs 6.2.7. Capturing Rejects 6.5.3.12 Dataset Management 63 © Copyright IBM Corporation 2006 . Transforming the Extracted Data 6. General Design Guidelines 6. Mapping multiple stand-alone jobs in one single job 6. Sample Job Template 6.6. Job sequence vs Batch Scripts 6. Sequencing the jobs 6. Ensuring Restartability 6.4. Loading Valid Data 6.1.IBM Global Business Services 6.8. Extracting Data 6.

IBM Global Business Services 6.1 General Guidelines  Templates have to be created to enhance reusability and enforce coding standard. Change record section should be kept in log description to keep track.  Don't copy the job design only. Jobs should be created using templates. copy using save as or create copy option at job level. 64 © Copyright IBM Corporation 2006 . proper Job level annotation and short/long description.  The DataStage connection should be logged off after completion of work to avoid locked jobs.  The template should contain the standard job flow along with proper naming conventions of components.

IBM Global Business Services 6.2 Ensuring Reusability  Creation of common look-up jobs  Some extraction jobs can be created to created reference datasets. The datasets can then be used in different conversion modules  Creation of common track jobs 65 © Copyright IBM Corporation 2006 .

IBM Global Business Services 6. it will populate two flat files with the information about the failed records. 66 © Copyright IBM Corporation 2006 . Apart from loading valid data into target table.2 Sample job Template Below is a sample Job: It contains annotation at the top. The stages have been named as per defined standard.

4. 67 © Copyright IBM Corporation 2006 .IBM Global Business Services 6. Provide select list and where clause for better performance 2. It imports the description part also which is helpful to keep track of the original metadata in case they are modified in the job flow. So Oracle stage should be used.4 Extracting Data 1.Native API stages always perform better compared to ODBC stage.In case of some access restricted apps tables. to access the data from oracle stage open command section should be used with the relevant query 5.Avoid using the table name in the form of parameter in oracle stages.Pull the metadata into appropriate staging folders in Table Definitions>Oracle. 3. Always use the Orchdb utility to import metadata.Use table method for selecting records from source.

Null Handling 6.4 Transforming extracted data ‡ ‡ ‡ ‡ ‡ ‡ ‡ ‡ 6.5.5.4.5.IBM Global Business Services 6.5.7.8. Transformer compared to Dedicated stages 6. Tips: Removing Duplicates 6. When to configure nodes and partitioning 68 © Copyright IBM Corporation 2006 .6.1.5.5.5.2. Lookup stage Problem 6.5.3. Using Transformer 6.Performing Lookups 6.5. Tips: Sorting 6.

2.The no of datasets referenced in one lookup stage should be limited depending on the reference table data volume.1 Performing Lookups  Using a Look-up stage: 1.IBM Global Business Services 6.5.To capture the failed records and store in a definite format in an error table. It performs a leftouter join in this case (source is assumed as left link) 69 © Copyright IBM Corporation 2006 . the lookup failure condition and condition not met option is set to CONTINUE and hence metadata of all the concerned columns in the output of lookup stage should be made NULLABLE.

lkp2 TX The way out is to remove one of the connecting links and connect two fresh links to the stage. if we detach any of the link and connect to the new stage and configure the rest of the things. we would not be able to provide a condition based on input link columns as the tab will be disabled. The reason can be the earlier link fail to recognize the new stage. 70 © Copyright IBM Corporation 2006 .IBM Global Business Services Lookup Stage problem Flow TX lkp1 While connecting a new lookup stage in an existing flow as in the figure.

Create a job level parameter and map it to the actual project level parameter at sequence level is a possible solution. 2.Use GetEnvironment(%envvar%) like GetEnvironment(µ$P_OPCO¶) A parameter can not be used directly inside a stage variable in a Transformer (It will give a compilation error).2 Using Transformer Using parameters in Transformer: While passing Job parameters to a target column in transformer stage. The alternate strategy to be followed is to use a transformer/column generator stage prior to the validation transformer and insert the parameter value to a dummy field of the output dataset of the first transformer stage. Further calculations can be carried out using that dummy column.IBM Global Business Services 6.5. 71 © Copyright IBM Corporation 2006 . Project defaulted parameters can not be directly mapped to a target column. A job level parameter will not cause any problem. Possible solutions are: 1.

we can use copy stage Counters can be implemented using a surrogate key stage. These specialized stages are faster as they do not carry much overhead and should be used when no derivations are present. In filter stage and modify stage.IBM Global Business Services 6. It is a kind of all-rounder stage and dedicated stages are available for many tasks: Transformer constraints can be implemented using a filter stage For metadata conversion. no syntax check is provided and thus there is no easy way to ensure correct code unless we compile and analyze the error message. we have modify stage For dropping columns or to get multiple outputs. 72 © Copyright IBM Corporation 2006 . in many cases using a transformer enhances the maintainability of the code later on and is suggested if performance is not an issue. But these dedicated stages have problems too.5. So.2 Transformer compared to dedicated stages A PX Transformer is compiled into a C++ component separately and thus slows down the performance.

5. sorts them separately and writes them to a single dataset. the Sort Stage might give weird results since DataStage arbitrarily partitions the incoming dataset. The resolutions are:  1. by selecting the Hash Key same as the Sort Key. The safest and the easiest way to solve this problem is to run the Sort stage in Sequential mode. 2. This can be done by selecting µSequential¶ option in the Advanced Tab in the Stage page. Partition the dataset using hash key partitioning.  73 © Copyright IBM Corporation 2006 .IBM Global Business Services 6.Sorting  Sort Stage:  Using sort stage in multi-node environment:  If more than one Logical or Physical nodes are defined.4 Tips. This can be done in the Inputs page Partitioning Tab of the Sort stage. Collect the data with sort/merge collection method.

5. To remove the duplicates as well as capture the duplicated rows.4 Removing Duplicates  Sort Stage or Remove Duplicate Stage can be used to perform this. This column of the dataset can be used in the Transformer separate the distinct and duplicate values.  Capturing rows having duplicate key values:  To select distinct values from the input dataset and also catch the duplicates in a separate file a combination of a Sort stage and a Transformer can be used. This creates an extra column in the result dataset where this column contains µ1¶ for distinct values of the Sort key and µ0¶ for the duplicate values. 74 © Copyright IBM Corporation 2006 .IBM Global Business Services 6. remove duplicate stage has to be used. In the Properties page of the Sort stage the option of CreateKeyChange is selected to be True.

The approach can be to check for null using IsNull function or checking for zero length after trimming the column and then explicitly set it to Null using the SetNull function 75 © Copyright IBM Corporation 2006 . on failure to lookup.IBM Global Business Services 6.5 NULL Handling Functions such as NulltoZero. is populated by zero. NullToEmpty should be used instead of IsNull if the later one causes problem. Care should be taken if the source column can contain zero as well and validation logic should be framed accordingly The approach would be different for mandatory fields to that of not mandatory fields. Source records containing NULL in mandatory fields can be rejected at the first opportunity by using a Filter stage. NulltoValue.5. where as in case of optional fields. they will be loaded into the target. For Decimal fields.

IBM Global Business Services 6. ORG_ID' : validateCustSiteUses.BANK_NUM : '.BANK_ACCOUNT_NUM : '.ID : '.6 NULL Handling while concatenating error messages  Suppose we are generating a key message with more than one fields which are coming from source. Because when we are concatenating that field in the key message field and the field contains a null then the record may get dropped. BANK_ACCOUNT_NUM: ' : validateCustSiteUses. specially if more fields are concatenated after that. BANK_NUM: ' : validateCustSiteUses. Suppose this is our code to generate a key message :  Here the field BANK_NUM is a nullable field ± If len(VarFndBnkNum) <> 0 Then 'Customer ID: ': validateCustSiteUses. We need to be very careful about that.5.ORG_ID_LK Else '' 76 © Copyright IBM Corporation 2006 .

6 NULL Handling while concatenating error messages ± In this case the record containing BANK_NUM = NULL will get dropped.ORG_ID_LK Else '' 77 © Copyright IBM Corporation 2006 . ORG_ID' : validateCustSiteUses.ID : '.IBM Global Business Services 6.5. as below :-If len(VarFndBnkNum) <> 0 Then 'Customer ID: ': validateCustSiteUses. But if we use a NullToEmpty conversion for the field then the code will be perfect.BANK_NUM) : '. BANK_ACCOUNT_NUM: ' : validateCustSiteUses.BANK_ACCOUNT_NUM : '. BANK_NUM: ' : NullToEmpty (validateCustSiteUses.

if we need to know no of rows and write a stage variable as below:  svRowCount=svRowCount + 1. 78 © Copyright IBM Corporation 2006 .7 When to configure nodes and Partition methods  In most of the cases.  Here if the stage runs on two nodes. it will create two processes which will run on two partitions.  Also applicable for the logic of vertical pivoting in Transformer using stage variables.5. the task of node configuration and partitioning has been left to DataStage ( default Auto) and it partitions the input dataset based on the number of nodes( two in our case: so two partitions)  Customization is required when a join is performed (presort the data before join) or when a sort stage is used (typical cases found till date).  In some cases the stage may need to be restricted to one node so that it creates only one process which will work on the entire dataset e. So the final count would be half of the entire dataset.g.IBM Global Business Services 6.

6 Capturing Rejects  Capturing Rejected Rows:  The records failing validation or getting rejected from database can be captured in flat files with a definite format (it should contain the field for which it has failed)  Both files can be concatenated and loaded into a database table in a different job. 79 © Copyright IBM Corporation 2006 .IBM Global Business Services 6.  The entries in the log table should refer to the job run entry in the run table. This job can be called after running the load job.

IBM Global Business Services 6. 3. 80 © Copyright IBM Corporation 2006 .Use upsert method for target Oracle stage.Journal fields which are not of any business interest can be populated either in DataStage or using oracle default. 4. Use user defined query.Always use the Orchdb utility to import metadata.Pull the metadata into proper staging folder in Table Definitions>Oracle 2.7 Loading Valid Data 1. 5.Avoid using the table name in the form of parameter in oracle stages. For insert only records make the update SQL make always meet the false condition like (1 = 2 ).

 Provide the execution action as ³Reset if required. parameter values will not flow from upper level sequence to individual job and hence user can never override any parameter value during test. If all the project level parameters are mapped as project default in Job Activity stage. 81 © Copyright IBM Corporation 2006 .8 Sequencing the jobs  Job Activity Stage Best Practices:  Avoid putting $PROJDEF in Job Activity Stage mappings:  Many developers do this as this is very time saving approach.e.  The priority of parameter values is top-down i. if a job parameter has been defined in a parallel job with some default value and have been mapped to a sequence level parameter. then run´ so that the sequence can reset aborted subordinate jobs if any before running. it will retrieve the values directly at run time. then the sequence level default value will take precedence at runtime. So.IBM Global Business Services 6.

Copy and paste the stage as many times as the number of Job activity stages needed. 82 © Copyright IBM Corporation 2006 . we need to map the same parameters for each Job activity stage manually. So. for a complete development of an conversion. all the parameter mappings get wiped out.Create a sample Sequence job and create one Job activity stage with complete mapping.IBM Global Business Services 6. To avoid this the following steps can be followed: 1. 2.8 Sequencing the Jobs  How to avoid manual mapping of similar Job parameters inside Job Activity stages: A developer Short-cut  If a job name is changed.

Now open the .dsx file in notepad and find the job name.8 Sequencing the Jobs Save the job and export it. 83 © Copyright IBM Corporation 2006 . Now copy those stages from that sample job in the actual sequence jobs. Start from the bottom of the file and replace the job names with the actual job names till the second last job activity stage (first one is already having proper file name) Save the dsx and import it in project.IBM Global Business Services 6.

So in many applications batches are the way things are running.9 Sequences Vs Batch Scripts  Sequences have the obvious advantage of GUI and thus can be developed and maintained very quickly  Batches have been a functionality before sequences were introduced. Can be better used in case of custom restartability to be ensured 84 © Copyright IBM Corporation 2006 .IBM Global Business Services 6.

10 Releasing locked jobs  Using DataStage Director:  Go to Director  Go to Job Clean Up Resources as locks window Click on Show All in processes as well  Make a note of the PID of the locked job from the bottom window  Select that PID from the processes window and click logout  Refresh  Check the job from Designer  Using UNIX command:  Kill command can be used to unlock the process 85 © Copyright IBM Corporation 2006 .IBM Global Business Services 6.

if the execution time of one flow is higher than other flows.IBM Global Business Services 6. Also.11 Mapping multiple stand-alone job in a single job  The flows are executed in parallel  Advantage: minimised development time compared to sequence-job approach. 86 © Copyright IBM Corporation 2006 . useful in case a good number of datasets need to be generated to be used later on as lookup  Disadvantage: The job will abort in case one of the flows abort. they will be kept waiting unless all the flows finish.

12 Releasing locked jobs  We usually use datasets as a reference for performing lookups or during debugging phase of a job by placing a dataset in the output link of a stage.  Points to note: The dataset should be named .  Best Practice Tip:  The default location of dataset data files as in the default. apt (default configuration file) resource disk "C:/Ascential/DataStage/Datasets³.ds as suffix. It is a preferred best practice to create a custom configuration file for each project with a separate location provided as resource disk. It is the control file which stores the data file names and metadata. 87 © Copyright IBM Corporation 2006 .IBM Global Business Services 6. We can remove the unwanted datasets using the dataset management tool in Director or using putty directly in the AIX server where DS server is installed.  During debugging we usually create many temporary datasets.

 DataStage provides some job control stages e. This allows DataStage to send an abort request to a calling sequence if a subordinate job aborts.IBM Global Business Services 6. terminator activity stage to further customize the restartability in your job.g.13 Releasing locked jobs  The easiest way is to enable the ³automatically handle activities that fail´ option in job properties tab of a sequence job. 88 © Copyright IBM Corporation 2006 .

Troubleshooting 7.2 Oracle Error Codes in DataStage 7.IBM Global Business Services 7 .7 Using Job Level Message Handler 89 © Copyright IBM Corporation 2006 .3 Common Errors and Resolution 7.4 Tips: Message Handler 7.6 Tips: Job Level and Project Level Message Handling 7.1 Troubleshooting: Some debugging Techniques 7.5 Local runtime Message Handling in Director 7.

Debugging techniques  Using APT_DUMP_SCORE parameter:  This environment variable is available in the DataStage Administrator  under the Parallel Reporting branch. and data sets in a running job. Configures DataStage to print a report showing the operators. This environment variable is available in the DataStage Administrator under the Parallel branch. 90 © Copyright IBM Corporation 2006 .  Using APT_DISABLE_COMBINATION parameter:  Disable the parameter APT_DISABLE_COMBINATION. processes. Note that disabling combining generates more UNIX processes. It globally disables operator combining (default behavior: two or more operators within a step are combined into one process where possible). and hence requires more system resources and memory.IBM Global Business Services 7.1 Troubleshooting.

it causes DataStage to echo its job specification to the job log after the shell has expanded all arguments. If set.Debugging techniques  It helps to determine the exact stage where the error is getting generated e. 91 © Copyright IBM Corporation 2006 .IBM Global Business Services 7.g. record drop due to null in a function without null handling (otherwise it will throw an AptCombinedOperatorController error)  Using OSH_ECHO: This environment variable is available in the DataStage Administrator under the Parallel Reporting branch.1 Troubleshooting.

multiple nodes) OSH_DUMP ± shows OSH code for your job. Shows if any unexpected settings were set by the GUI. Look at the phantom files for additional error messages: c:\datastage\project_folder\&PH& © Copyright IBM Corporation 2006   92  . Use a Copy stage to dump out data to intermediate peek stages or sequential debug files.Debugging techniques  Enable the following environment variables in DataStage Administrator:       APT_PM_PLAYER_TIMING ± shows how much CPU time each stage uses APT_PM_SHOW_PIDS ± show process ID of each stage APT_RECORD_COUNTS ± shows record counts in log APT_CONFIG_FILE ± switch configuration file (one node.IBM Global Business Services 7. Copy stages get removed during compile time so they do not increase overhead.1 Troubleshooting. Use row generator stage to generate sample data.

2 Oracle error codes in Datastage Some common error codes has been listed for ready reference along with possible remedies to resolve the issues faster. ORACLE ERROR CODES IN DS 93 © Copyright IBM Corporation 2006 .IBM Global Business Services 7.

IBM Global Business Services 7.  Varchar(2000) fields are present in the target. It is most likely due to short fall in availability of resource. most likely resulting in extreme resource consumption corresponds to warning limit. Following techniques worked on a trial and error basis in a no. problem can be resolved 94 © Copyright IBM Corporation 2006 . Concatenating a string with a nullable field)Error occurs if a nullable column is written to a sequential file without null handling properties.3 Common errors and resolution 1) AptCombinedOperatorController: NULL found in input dataset.g. 2) ORCHESTRATE step execution terminating due to SIGINT  RESOLUTION: SIGINT is the signal thrown by a computer program (here UNIX OS) when a user wishes to interrupt a process. of situations:  Increase the warning limit from the Sequence. It the column size is decreased. Record dropped:  RESOLUTION: generated if a function inside a transformer is met with a null value without performing null handling (e.

Where this is happening open up the stage and set the input link properties to 'Clear partitioning'. 3).  RESOLUTION: Tells that the job will repartition the data even though the code is telling the job to preserve the partitioning from upstream. 95 © Copyright IBM Corporation 2006 .  RESOLUTION: As the failure condition is set to CONTINUE. 4).IBM Global Business Services 7. use a modify operator to specify the value to which the null should be converted. a fatal runtime error could occur. metadata of all the concerned columns in the output of lookup stage should be made NULLABLE.3 Common errors and resolution When checking operator: Operator of type "APT_LUTCreateOp": will partition despite the preserve-partitioning flag on the data set on input port 0. When binding input interface field ³FIELD1" to field ³FIELD2": Converting a nullable source to a non-nullable result.

the executables must also be promoted to use these handlers. Local Runtime message handlers (Local.msh) are stored in RC_SC nnnn folder under the specific project folder ( The path can be found in the Project Pathname in Administrator)  where nnnn is the job number generated from DS JOBS.IBM Global Business Services 7.4 Message Handler Local Message Handler : To suppress unwanted warnings following method can be followed: Right click to the warning which you want to handle > click on Add to message Handler > Click on Add Rule > A message will be In the next run. 96 © Copyright IBM Corporation 2006 . the messages will be handled and a consolidated message will be shown While taking exports.

IBM Global Business Services 7.5 Local Runtime Message Handling In Director -1 97 © Copyright IBM Corporation 2006 .

IBM Global Business Services 7.5 Local Runtime Message Handling In Director -2 98 © Copyright IBM Corporation 2006 .

IBM Global Business Services 7.5 Local Runtime Message Handling In Director -3 99 © Copyright IBM Corporation 2006 .

IBM Global Business Services 7.5 Local Runtime Message Handling In Director .4 100 © Copyright IBM Corporation 2006 .

these message handlers can not be exported directly along with the . rather the relevant . Applies to all the jobs in that project APT_ERROR_CONFIGURATION is a parameter that can be configured to customize the error log 101 © Copyright IBM Corporation 2006 . a new . Then the job which is exported will allow to compile and the message handler works fine Project Level Message Handler : Can be defined from Administrator. To take one project from DEV server to another environment.msh file will be created. allows messages to be handled for a single job exclusively. There is a folder named MsgHandler DataStage directory When a new message handler is saved.IBM Global Business Services 7.msh files need to be copied and saved to the same MsgHandler folder there.dsx file.6 Tips : Job Level and Project level Message Handling Job Level Message Handler : Allows for a job source only promotion of code. puts the message handling in a central location.

Using Job Level Message Handler-1 102 © Copyright IBM Corporation 2006 .7.IBM Global Business Services 7.

Using Job Level Message Handler-2 103 © Copyright IBM Corporation 2006 .IBM Global Business Services 7.7.

7.IBM Global Business Services 7.Using Job Level Message Handler-3 104 © Copyright IBM Corporation 2006 .

IBM Global Business Services 8. The first should comprise all lookup reference datasets.Guidelines  One standard template should be followed for Data Artifacts  Only one consolidated UTP should be kept in Ascendant. cnv_log and one analysis tab. target. In case of enhancements.Preparing UTP . The second one should comprise source. the addendum UTP should be added creating a new section above open and closed issues section  Test Artifacts should be attached in two spreadsheets for each sequence job. 105 © Copyright IBM Corporation 2006 . cnv_run.  Main sequence log can be attached as a bmp file in Appendix.

5 Scheduled Maintenance 106 © Copyright IBM Corporation 2006 .IBM Global Business Services 9.4 Assuring Naming Conventions of components.1 Backup and version control Activity Taking whole project backup Taking Job level Export Taking folder level Export Version Control in ClearCase 9.3 Performance Tuning of DS Jobs Analysing a flow Measuring Performance Designing for good performance Improving performance 9.2 DS Auditing Activity Tracking the list of modified jobs during a period Retrieving Job Statistics Getting the row counts of different jobs 9. jobs and categories 9.Maintenance Activities 9.

IBM Global Business Services

9.1 Back Up and Recovery activity
‡INTRODUCTION TO THE PROCESS: ‡ During fresh development phase, each newly built module is backed up after being delivered. During test phase, the jobs enhanced each week is identified in the weekend and backed up as a part of version control activity During dev phase, whole project backup can be performed weekly or every fortnight. During test phase, whole project backup is performed monthly.

‡

‡

‡Feature of the Tool: ‡ ‡ Taking whole project backup from command line automatically. Taking Job level and category level export from command line automatically. Identifying the jobs changed during a specified period taking backup for those jobs as a part of version control activity
© Copyright IBM Corporation 2006

107

IBM Global Business Services

Back up activity 
Taking Job level Export:A Job Repository table has been created in stage 1.A sequence job runs to refresh this repository. This sequence calls a routine which extracts the job names and the associated category path into a sequential file. The subsequent load job loads the data into repository. If some specific categories/jobs has to be exported, then the relevant sql file has to be modified with the required query in the where clause to select the required jobs to be exported. If the requirement is version control, then the repository of modified jobs has to be refreshed and then the main batch can be run directly to perform the export. It will create job level dsx files. One report file will be generated. If a job is locked by any user, the utility will cease to proceed further unless the option to skip/abort is provided by the user. So, it is better to restart the server before the export is started. The job level dsx files will be created with the same folder structure as in the server

108

© Copyright IBM Corporation 2006

IBM Global Business Services

Back up activity 
Taking folder level Export: Once the job level backup is complete, those files can be concatenated to create folder level dsx files. If some specific categories has to be exported, then the relevant sql file has to be modified with the required query in the where clause to select the required jobs to be exported. If the requirement is version control, then the repository of modified jobs has to be refreshed and then the main batch can be run directly to perform the export. It will concatenate the job level dsx files created earlier to create folder wise dsx files. If there exists a log file, the batch will abort. Unlock the job in the server and perform the export batch again to take export of that job. If the export program was successful, folder level dsx files will be generated along with a report file.

109

© Copyright IBM Corporation 2006

IBM Global Business Services Version Control ± To upload the dsx into the respective folder in CC ± connect to ClearCase web client and go to the proper path ± Create the activity indicating the reason of change (defect number) ± Check out the respective folder (folder> basic>check out).dsx file to source control (Right click on the file in the right hand pane > basic > add to source control.Add the . ± Put the .dsx file into the CCRC path in your local machine ± Check in the folder and click Tools > update resources with the selected activity. ± To further apply any change to the code ± Import the . The version tree will be taken.dsx file to the local machine and make modifications as per requirement ± Compile and run the job and upload the new dsx as discussed 110 © Copyright IBM Corporation 2006 . A blue background will come up ± uncheck the option for checking out after adding to source control ± Right click on the file in the right hand pane > Tools >show version tree.

jobs and categories ± Retrieving Job Statistics 111 © Copyright IBM Corporation 2006 .12 DS Auditing activities ± Tracking the list of modified jobs during a period ± Assuring Naming Conventions of components.IBM Global Business Services 9.

Oracle via a metabroker. ± If MetaStage can be used to export DataStage system tables to an RDBMS e.g. then the procedure can be run on the tables to validate the standards 112 © Copyright IBM Corporation 2006 . stages and links. categories can be used.IBM Global Business Services Assuring naming convention of component and jobs ± A pl/sql procedure to ensure the naming conventions of jobs. It can generate the report of components not matching with the specified convention.

to target or the failure links in the load job. This is done using a routine written in DS basic which retrieves record counts by searching for links with some specific keywords. records failed business rule validation and records rejected by oracle. These keywords refer to the links from source. records inserted or updated into target table. ‡ First is to retrieve the record counts for source.IBM Global Business Services Retrieving Job Statistics A very important aspect of auditing activity in case of data migration. This is ensured in two phases. These information are stored in CNV_RUN TABLE ‡ A second approach retrieves those job names for which number of source records do not match with the combined value of inserted records and failed records( hence some records have been dropped somewhere in the flow) 113 © Copyright IBM Corporation 2006 .

IBM Global Business Services 9.3 Performance tuning of DS Jobs ± ± ± ± Analysing a flow Measuring Performance Designing for good performance Improving performance 114 © Copyright IBM Corporation 2006 .

115 © Copyright IBM Corporation 2006 .3 Performance tuning of DS Jobs : Purpose ‡ The document describes the process towards analysing a job flow and measuring its performance based on certain project benchmark. it suggests steps to improve the performance of the identified jobs. performance tuning is not a subject that too much time should be spent on during the initial design. and you will therefore save yourself time ± not having to implement these changes.IBM Global Business Services 9. it may well be that the performance is adequate without having to carry out any of these tuning options. That is to say unless it is clear that performance will be an issue. Further. Important to mention that.

IBM Global Business Services Performance tuning of DS Jobs : Analysing the flow 1. We can do this by setting the APT_DUMP_SCORE environment variable true and running the job (APT _DUMP_SCORE can be set in the Administrator client. Information about where data is buffered. 116 © Copyright IBM Corporation 2006 . ± ± ± ± ± The report includes information about: Where and how data is repartitioned. Whether DataStage had inserted extra operators in the flow. under the Parallel > Reporting branch). and on which nodes. The degree of parallelism each operator runs with. processes and data sets in the job. This causes a report to be produced which shows the operators. A score dump of the job helps to understand the flow.

In particular DataStage will add partition and sort operators where the logic of the job demands it. so this should be done with care. Be aware. 3.IBM Global Business Services Performance tuning of DS Jobs : Analysing the flow ± The score dump is particularly useful in showing you where DataStage is inserting additional components in the job flow. If one partition of an operator is using significantly more CPU than others. that setting this flag will change the performance behavior of your flow. and that repartitioning. This information is written to the job log when the job is run. or choosing different partitioning keys might be a useful strategy. 117 © Copyright IBM Corporation 2006 . however. 2. Sorts in particular can be detrimental to performance and a score dump can help you to detect superfluous operators and amend the job design to remove them. information is provided for each operator in a job flow. Setting the environment variable :APT_DISABLE_COMBINATION may be useful in some situations to get finer-grained information as to which operators are using up CPU cycles. It is often useful to see how much CPU each operator (and each partition of each component) is using. Runtime Information: When you set the APT_PM_PLAYER_TIMING environment variable. it may mean the data is partitioned in an unbalanced way.

± In the transformations section . 118 © Copyright IBM Corporation 2006 . This would give an insight whether the source query is a bottleneck.g. Oracle in our case. This would give us a know-how whether the database connection to the target (as it is a remote connection) is slow or the volume of data is huge hence it takes time. This would help us know whether the job is running slow because of transformations ± If the source is a database e. ± If the target is a database e..g. then the query should be run using hints/partition/index. replace the database stage with a sequential file and see whether it takes the same time.Invalidate all transformations to default values.IBM Global Business Services Performance tuning of DS Jobs : Measuring Performance ± We are Measuring performance using the following ways. Oracle in our case.

To catch partitioning problems. You can just look at the file size.IBM Global Business Services Performance tuning of DS Jobs : Measuring Performance Check for any aggregator stage in your jobs . run your job with a single node configuration file and compare the output with your multi-node run.This is part of transformation bottleneck but need to be given special attention. An aggregator stage in the middle of a big job makes the enter job slow since all the records need to pass the aggregator (cannot be processed in parallel). or sort the data for a more detailed comparison 119 © Copyright IBM Corporation 2006 .

null handling) should be implemented in a Modify stage rather than a Transformer. Transformations that touch a single column (for example. Modify. is a particularly efficient operator. This can be done by placing the transformer constraint or filter where clause in the source oracle stage ± Eliminate Transformers with modify stages where the transformations are simple.IBM Global Business Services Performance tuning of DS Jobs : Improving Performance ± Basic steps: ± Removing unwanted columns at the first opportunity ± Reducing number of rows processed at the earliest. type conversions. due to internal implementation details. 120 © Copyright IBM Corporation 2006 . Any transformation which can be implemented in the Modify stage will be more efficient than implementing the same operation in a transformer stage. keep/drop. some string manipulations.

± Instead of creating multiple standalone flows in a single job.IBM Global Business Services Performance tuning of DS Jobs : Improving Performance ± Consider using Oracle bulk loader instead of upsert method wherever applicable. it should never be written as a sequential file. A data set or file set stage is a much more appropriate format. in parallel. ± If data is going to be read back in. creating separate jobs and calling them parallels using a sequencer stage can improve the performance. 121 © Copyright IBM Corporation 2006 .

and that the partitions. 122 © Copyright IBM Corporation 2006 . are retained at every stage. always write to persistent data sets (using Data Set stages). Avoid format conversion or serial I/O.IBM Global Business Services Performance tuning of DS Jobs : Improving Performance Advanced steps: Running the jobs which handle small volume of data to a single node instead of multiple nodes. Ensure that the data is partitioned. When writing intermediate results that will only be shared between parallel jobs. This will limit spawning up multiple processes and partitions when there is no need. This can be done by adding the environment $APT_CONFIG_FILE and setting it to use a single node configuration. and sort order.

&PH& is a project level folder. select the projects tab. but at a point in your production cycle where it will not delete data critical to debugging a problem. and press the execute button. If the time between when a job says it is finishing.FILE &PH&. Datasets should not be used for long tem storage.4 Scheduled Maintenance ± Regular Cleanup of log files ± Periodic clean up of &PH& folder. click your project. It may be scheduled to run weekly. ± Cleaning up persistent datasets periodically. then press the Command button.IBM Global Business Services 9. Another way is to create a job with the command: EXECUTE "CLEAR. and when it actually ends. enter the command CLEAR. this may be a symptom of a too full &PH& folder. increases. thus the temporary datasets can be cleaned up. A script can be scheduled to automate the process. so this job should be created and scheduled in each project.FILE &PH&" on the job control tab of the job properties window. 123 © Copyright IBM Corporation 2006 . One way to do this is in DataStage Administrator.

IBM Global Business Services 9. ± Create a C++ routine and use it inside a PX transformer ± Create custom operators and use them as a stage: This allows knowledgeable Orchestrate users to specify an Orchestrate operator as a DataStage stage.5 Customised Code  Options: ± Create a basic routine and use it as before/after job subroutine or using a routine activity stage. This is then available to use in DataStage Parallel jobs 124 © Copyright IBM Corporation 2006 .

IBM Global Business Services Questions and Answers 125 Presentation Title | IBM Internal Use | Document ID | 1/29/2011 © Copyright IBM Corporation 2006 .

Course Title
IBM Global Business Services

BI Development Toolkit for Datastage
Module 4 : Version Control
(Optional client logo can be placed here)

Disclaimer (Optional location for any required disclaimer copy. To set disclaimer, or delete, go to View | Master | Slide Master)

© Copyright IBM Corporation 2006

IBM Global Business Services

Module Objectives 
At the completion of this chapter We should be able to: ± Manage and track all DataStage component code changes and releases. ± Maintain an audit trail of changes made to DataStage project components, and records a history of when and where changes e made. ± Store different versions of DataStage jobs. ± Run different versions of the same job. ± Revert back a previous version of a job. ± Store all changes in one centralized place.
127 Presentation Title | IBM Internal Use | Document ID | 1/29/2011 © Copyright IBM Corporation 2006

IBM Global Business Services

Version Control : Agenda 
Topic 1 :Versioning Methodology
±

Discipline. Basic Principle/Approach. Different Projects.

± ± 

Topic 2: Initializing Components
±

Version Control Numbering. Filtering Components.

± 

Topic 3: Promoting Components
±

Component selection for promotion. Different Methods.

± 

Topic 4: Best Practices
±

Using of Custom Folder in Version Control. Starting of Version Control from DS-Designer.
Presentation Title | IBM Internal Use | Document ID | 1/29/2011 © Copyright IBM Corporation 2006

±
128

release-level tracking) of DataStage related components which can be retrieved for bug tracking and other purposes. Team coordination . Version Control can be opened directly from within any DataStage client. DataStage integration . Central code repository . Without version control.all coding changes are contained in one central managed repository.Components are marked as read-only as they are processed through Version Control.e. ensuring that they cannot be modified in any way after being released. It is not intended as a comprehensive guide to version control management theory  Benefits:    Version tracking . Alternatively.Components are stored within theµ VERSION¶ project. It gives an overview of the methodology used in Version Control and highlight some of its benefits.IBM Global Business Services Versioning Methodology  In a typical enterprise environment.  129 | 1/29/2011 © Copyright IBM Corporation 2006 . effective management of these jobs could become very time consuming and they could be difficult to maintain. there may be many developers working on jobs all at different stages of their development cycle. which can be opened directly in DataStage from Version Control. regardless of project or server locations.archiving and versioning (i.

 Always ensure that We pass components though Version Control before sending them to their next stage of development. which has become the de facto standard. These stages are:    130 The Development stage The Test stage The Production stage | 1/29/2011 © Copyright IBM Corporation 2006 . If We build in that discipline from the start We will quickly realize the benefits as project grows. This will make the project development far easier to track.IBM Global Business Services Versioning Methodology Discipline :  To gain the maximum benefit from using Version Control We must exercise a disciplined approach. especially if We have complex projects containing a large number of jobs Basic Principle/Approach : Most DataStage job developers adopt a three stage approach to developing their DataStage jobs.

   131 | 1/29/2011 © Copyright IBM Corporation 2006 . Adopting a staged approach to project development. There is no central management system to control the flow between the development. and then passed to production. sent for test. redeveloped until testing is completed. test and production environments. projects can pass from one stage into Version Control before passed to the next stage. jobs are coded in the development environment. We need to think of Version Control as a central hub where all DataStage projects pass through.IBM Global Business Services Versioning Methodology Basic Principle/Approach :  Scenario without Version control  In this model.

 Consistency of the code across different environment is maintained 132 | 1/29/2011 © Copyright IBM Corporation 2006 . history. and notes.IBM Global Business Services Versioning Methodology Basic Principle/Approach :  Scenario with Version control Whilst in Version Control.  Projects will have the appropriate versioning information added.  This information will include version number.

Where DataStage jobs and associated components are developed. Whatever name We choose for version project.IBM Global Business Services Versioning Methodology Different Projects :  The Version Control Project .  Other Projects-If We adopt the three stage approach. 133 | 1/29/2011 © Copyright IBM Corporation 2006 . although We may create a project with any name.Version Control uses a special DataStage project as a repository to store all projects and their associated components. the principle remains the same. It therefore stores every level of each code release for each component. This project is usually called µVERSION¶. the Version Control repository contains the archive of all components initialized into it. We would typically have three other projects:  Development.

134 | 1/29/2011 © Copyright IBM Corporation 2006 .Where developed jobs and components are tested. When testing is complete (which may include more development-test cycles).the final destination from where the finished jobs are actually run. components are initialized from the Development project into the Version Control repository. From there they are promoted to the Test project. Production. Once a development cycle is complete. These projects can reside on different DataStage servers if required. components are promoted from the Version Control repository to the Production project.IBM Global Business Services Contd«   Test.

IBM Global Business Services

Initializing Components
Initialization is the process of selecting components from a source project and moving them into Version Control for processing and promoting to a target project.  When initializing components, the source project is the development project. 

After they have been initialized and processed in Version Control, components are promoted to a test or production project.  Initializing components gives them a new release version number.

135

| 1/29/2011

© Copyright IBM Corporation 2006

IBM Global Business Services

Initializing Components
Version Control Numbering: The full version number of a DataStage component is broken down as follows: Release Number. Minor Number where:  The Release Number is allocated when We initialize components in Version Control. If required We can specify a release number in the Initialize Options dialog box. By default, Version Control sets this to the highest release number currently used by objects in its repository.  The Minor Number is allocated automatically by Version Control when We initialize a component. It will increment by one each time We initialize a particular component until We increase the release number.

136

| 1/29/2011

© Copyright IBM Corporation 2006

IBM Global Business Services

Initializing Components 
Filtering Components: We can filter a long list of components to show only those that we are interested in for promotion. For example, We may want select components associated with µSales¶ or µAccounting¶. Rather than search through the entire list, We can filter through the list, and select the subset for promotion.

137

| 1/29/2011

© Copyright IBM Corporation 2006

When we are happy with our filter text. Click the arrow next to the Filter button to specify whether the filter is case sensitive or not. sales¶ will result in a list showing components that have µaccounting¶ or µsales¶ in its name. In the text entry field. To return to the default view. For example. Click the Filter button in the Display toolbar so that a text entry field appears: 2. 4. we can type letters or whole words. or click in the tree view of the Version Control window.IBM Global Business Services Initializing Components To filter components: 1. click the Filter execute button. 138 | 1/29/2011 © Copyright IBM Corporation 2006 . click the Filter button again. and separating letters or words with a comma will result in an µOR¶ operation. typing in µaccounting. 3. type in the text we want to filter by. press return.

 In a typical environment. Component selection for promotion: We can select components for promotion in the following ways: By individual selection By batch By user By server By project By release By date 139 | 1/29/2011 © Copyright IBM Corporation 2006 .IBM Global Business Services Promoting Components  We can promote components after they have been initialized into Version Control. components are initialized from a development project and promoted to a test or production project.

the selected group is known as a µbatch¶. The more usual scenario is to use Release/Batch Selection. Version control allows us to select components for selection by initialization batch.IBM Global Business Services Promoting Components The different ways of selecting component for promotion are as follows:  By individual selection: We can select components for promotion in the tree view from any view mode. or named batch.  By batch: When we initialize a group of components into Version Control.  By date: We can select components that were initiated on a particular date. Selecting components by batch automatically highlights all the components of that batch and so selects them for promotion. but we always prefer to specify a name for a batch. By default batches are identified by the date and time they were initialized. promote batch. Individual component selection is suitable when we are promoting a small number of components. All the components that were initialized on that date are selected ready for promotion. 140 | 1/29/2011 © Copyright IBM Corporation 2006 .

All the components that have been initialized by that user are selected ready for promotion.  By release: We can select components that belong to a particular release. All the components that were initialized on that date are selected ready for promotion.  By project: We can select components that have been initialized from a particular project. All the components that have been initialized from that server are selected ready for promotion. Select the required project from the menu. 141 | 1/29/2011 © Copyright IBM Corporation 2006 .  By server: Select the required server from the menu.  By date: We can select components that were initiated on a particular date. All the components that have been initialized from that project are selected ready for promotion.IBM Global Business Services Promoting Components  By user: We can select components that have been initialized by a particular user. All the components that belong to that release are selected ready for promotion. Select the required user from the menu.

then Version Control will create it. either for initialization or for promotion. 142 | 1/29/2011 © Copyright IBM Corporation 2006 . If we choose to add Custom folders. These files may contain DDL scripts or other resource data. Version Control can process these ASCII files in the same way as it processes DataStage components. If it does not exist. The only requirement for using custom folders in Version Control is that the components must be stored within a folder in the project itself.IBM Global Business Services Best Practice Using of Custom Folder in Version Control: Many development projects which use DataStage for extraction. Every time Version Control subsequently connects to a project. it checks to see if the custom folder exists. After Version Control has created a custom folder. transformation and loading (ETL) also incorporate other project related files which are not part of the DataStage repository. they are automatically created by Version Control there is no need to create them manually. it can then be populated with the relevant items.

and a password of µcontrol¶. Director or Manager by adding a link to the DataStage client tools menu. If We want Version Control to start with login details already filled in and without display the login dialog. if We have a hostname of µds_server¶. then We would type in: /H=ds_server /U=vc_user /P=control Version Control can now be started from the DataStage Client. For example. a username of µvc_user¶. 143 | 1/29/2011 © Copyright IBM Corporation 2006 .IBM Global Business Services Best Practice Starting of Version Control from DS-Designer : We can run Version Control directly from within DataStage Designer. These are entered in the Arguments field and have the following syntax: /H=hostname /U=username /P=password where: hostname is the DataStage Server hosting project username is DataStage username password is DataStage password. We can also add options which will allow Version Control to start without displaying the login dialog. We can enter appropriate command line arguments.

IBM Global Business Services Questions and Answers 144 Presentation Title | IBM Internal Use | Document ID | 1/29/2011 © Copyright IBM Corporation 2006 .

Sign up to vote on this title
UsefulNot useful