WebSphere DataStage

®

Version 8

Parallel Job Tutorial

SC18-9889-00

WebSphere DataStage
®

Version 8

Parallel Job Tutorial

SC18-9889-00

.Note Before using this information and the product that it supports. be sure to read the general information under “Notices and trademarks” on page 63.

Module 5: Processing in parallel. . . . Tutorial project goals . . . . . . . . 15 . . . . . . . .1: Exploring the configuration file . . . . . . . Lesson 5. . . Module 3: Designing a transformation job . . . . . . . . . . . . . . . . 9 Running the job . . . . .Contents Chapter 1. Running the transformation job .3: Running the sample job . . The job design . . . . . . . . . . . . . . . . . . Connectors . Lesson checkpoint . . . .5: Creating parameter sets . . . . 7 Lesson 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Configuring the Lookup operation . Lesson checkpoint . . . . . . . . . . Parameter sets .2: Viewing and compiling the sample job . 11 Lesson checkpoint . in . Lesson checkpoint . . . . . . . Lesson checkpoint . Configuring the Data Set stages . 25 . . 25 . . . Lesson checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Using a Lookup stage . . Creating the job . 6 Lesson checkpoint . .3: Changing the configuration file . . . . . . . . . . . . Lesson checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . .3: Capturing rejected data . . . . 23 . . . 25 26 26 28 29 29 29 29 30 31 32 32 33 33 34 35 37 38 38 Chapter 4. . . Lesson checkpoint . . . . . 1 Chapter 2. Lesson checkpoint . . . . . 7 Exploring the Sequential File stage . 8 Exploring the Data Set stage . . . Lesson 2. a . Introduction . . . . . . . . . 2006 iii . . . . .1: Creating a job . Configuring the Lookup File Set stage . . . . . . . . . . 9 Viewing the data set . . 6 Starting the Designer client and opening the sample job . . Creating a parameter set from existing job parameters . . . . . Configuring the Business_Rules Transformer stage . . . . . . . . . . . . . . Lesson checkpoint . . . . . . . . . . . 9 Lesson 1. 5 The sample job for the tutorial . . Configuring the Lookup stage . . . . . . . . . . . . . Example configuration file . . . . . . . . . . . . . . . . . . 22 . . . 11 Module 1: Summary . . . . . . . Specifying properties for the Lookup File Set stage and running the job . . . . . . . . . . . Job parameters . . Lesson 5. . . . . .2: Combining data in a job . . . . . . . . . .3: Importing metadata . . . . . . . . . . . . . . Lesson 2. . . . . Creating the transformation job and adding stages and links . . . Lesson 3. . . . . . . . . . . . . . . Adding job parameters to your job design . . 23 Chapter 5. Module 1: Opening and running the sample job . . . . . . . . . Creating multiple data partitions . . . Configuring the Data Set stage . . . . . Lesson 3. a .4: Performing multiple transformations single job . . . . . . . . . 5 Lesson 1. Deploying the new configuration file . 8 Compiling the sample job . . . . Viewing partitions in a data set . . 5 The Designer client . . . . . . . . . . . Lesson checkpoint . . . . . . . . Lesson 2. . . . . . Configuring the ODBC connector . . . . . . .2: Partitioning data . . . . . . . Supplying values for the job parameters . . . . . . Lesson 4. . . . . . . . . 9 Lesson checkpoint . . . . . . . . . . . . . 13 13 14 14 14 Chapter 6. 47 47 47 48 48 49 50 51 51 52 52 . . . . . . . . .4: Adding job parameters . . . . . . . . . 3 Chapter 3. . . . . . . . . . . . . . . . . . . . . . . . . 39 Lesson 4. . . . . . The transformer job . . . . . . . . . . . . . . . . . . .2: Importing column metadata from database table . . . . . . . . . . 47 Lesson 5. Module 2: Designing your first job . . . . . . . Configuring the Transformer stage. . . . . . . . Lesson checkpoint . . . . . . . Lesson checkpoint . . . . . . . . . Loading column metadata from the repository. . 17 17 18 18 18 20 20 20 20 21 21 22 22 22 Chapter 7. . . . . . . . . . . . . . Creating a lookup job . . . . Importing metadata into your repository . Lesson 2. . . . . . Creating a data connection object . . . Opening the default configuration file . . . . . . . . . . . . . . . . . . . . . . . . Module 2 Summary . . . Adding new stages and links . . .1: Opening the sample job . . Creating a configuration file . . . . . . . . . . . . . . . Module 4 summary . . . . .1: Designing the transformation job . . . Lesson checkpoint . . . Lesson 3. . . . .1: Creating a data connection object Data connection objects . . . . . . . .3: Writing to a database . . . . . . Lesson 4. . . Module 4: Loading a data target . . . . . . Defining job parameters . . . . . . . . . 25 © Copyright IBM Corp. . . . . . Lesson checkpoint . . . . . Module 3 Summary . . . . . . . . . . . . 39 39 39 40 40 41 41 41 41 42 42 44 45 . . . Adding stages and linking them . . . . . 13 Lesson 2. . . . Specifying properties and column metadata for the Sequential File stage . 11 Lesson 3. . . . . .2: Adding stages and links to the job . . . . . . . . . . . . . . .

. . . . 62 Notices and trademarks . . . . . . . . . . . . . Creating the tutorial project . . . . . 63 Notices . . Creating a target database table . . . . . . . . . . . . . 61 . . . . . . . . Providing comments on the documentation . 61 . . . . . . . . Accessible documentation . . 61 Appendix. . . . . . . Creating a DSN for the tutorial table Linux computer . . . . . . . . . . . . . . . 59 UNIX or . . . . . . . 67 iv Parallel Job Primer . on a . . . 53 . . . . . . . . . 65 Index . . . 57 . . . Lesson checkpoint . . . . . . Trademarks . . Module 5 summary . . . . . . . . . 58 . . on a . . . . . . . . . . . 57 . . . 55 Accessing information about IBM . 53 . . . . . . . . . . Windows . . . . . . . . . . . . 59 Chapter 8. . . . . . . . . Importing the tutorial components into the tutorial project . 63 . . . . 57 Creating a folder for the tutorial files . . . . . . 57 . . . . .Applying the new configuration file . Copying the data files to the project folder or directory . . . . . . . . 54 Creating a DSN for the tutorial table computer . . . . . Tutorial summary . Installing and setting up the tutorial . . . . . 58 Contacting IBM . . . . . . . .

Learning objectives By completing this tutorial.Chapter 1. 2006 1 . you will achieve the following learning objectives: v Learn how to design parallel jobs that extract. v Learn how to run the jobs that you have designed. you will learn the basic skills that you need to design and run WebSphere® DataStage® parallel jobs. Introduction In this tutorial. transform. v Learn how to create reusable objects that can be included in other job designs. and how to view the results. © Copyright IBM Corp. and load data.

2 Parallel Job Primer .

If you explore other concepts related to this tutorial.Chapter 2. v To run the parallel processing module (module 5). Tutorial project goals This tutorial uses a simple business scenario to introduce you to the basics of job design for IBM® WebSphere DataStage. 2006 3 . it can take longer to complete. Their customer base is worldwide and because their businesses are similar. Audience This tutorial is intended for WebSphere DataStage designers who want to learn how to create parallel jobs. This data ultimately forms the bill_to dimension table in the finished data warehouse. Skill level You can do this tutorial with only a beginning level of understanding of WebSphere DataStage concepts. You must read this data from a comma-separated file. The exercises in this tutorial focus on a small portion of the work that needs to be done to accomplish this goal. Learning objectives As you work through the job scenario. and then cleanse and transform the data in preparation for it to be merged with the equivalent data from WorldCo. and load data v Run the jobs that you design and view the results v Create reusable objects that can be included in other job designs This tutorial should take approximately four hours to finish. Your part of the project is to work on the GlobalCo data that records billing details for customers. transform. you have the following scenario: The company GlobalCo is merging with WorldCo. the WebSphere DataStage server must be installed on a multi-processor system (SMP or MPP). The new merged company wants to build a data warehouse for the delivery and billing information. v Connection to a WebSphere DataStage server on a Windows or UNIX® platform (Windows servers can be on the same computer as the clients). Prerequisites You need to complete the following tasks before starting the tutorial: v Get DataStage developer privileges from the WebSphere DataStage administrator v Check that the WebSphere DataStage administrator has installed and set up the tutorial by following the procedures described in Appendix A © Copyright IBM Corp. System requirements The tutorial requires the following hardware and software: v WebSphere DataStage clients installed on a Windows® XP platform. In this tutorial. you will learn how to do the following tasks: v Design parallel jobs that extract. the two companies have some customers in common.

v Obtain the name of the tutorial folder on the WebSphere DataStage client computer and the tutorial project folder or directory on the WebSphere DataStage server computer from the WebSphere DataStage administrator. 4 Parallel Job Primer .

load. This lesson shows you how to start the Designer client and open the sample job that is supplied with the tutorial. v Open the Director client and run a job. and run a sample job that is provided with this tutorial. Learning objectives After you complete the lessons in this module. The sample job is an object in the repository that is included with the tutorial. The data that the job writes is used by later modules in the tutorial. and check the quality of data. The sample job uses a table definition. In the design area of the Designer client. The sample job opens in a design window. The sample job extracts data from a comma-separated file and writes the data to a staging area.1: Opening the sample job The first step in learning to design jobs is to become familiar with the structure of jobs and with the Designer client. These objects can be reused by other job designers. Lesson 1. v Compile a job so that it ready to run. Prerequisites Ensure that you have DataStage user authority. v Open an existing job. you will understand how to do the following tasks: v Start the WebSphere DataStage and QualityStage Designer (Designer client) and attach a project. © Copyright IBM Corp. The Designer client has a palette that contains the tools that form the basic building blocks of a job: v Stages connect to data sources to read or write files and to process data. v View the results of the job. transform. you work with the tools and objects to create your job designs. This module should take approximately 30 minutes to complete.Chapter 3. which is also an object in the repository. The Designer client is your workbench and your toolbox for building jobs. compile. The Designer client The Designer client gives you the tools that you need to create jobs that extract. v Annotations provide information about the jobs that you create. Module 1: Opening and running the sample job In this module you will view. v Links connect the stages along which your data flows. The Designer client uses a repository where you can store the objects that you are creating as part of the design process. The Designer client is like a workbench or a blank canvas that you use to build jobs. 2006 5 .

type your user name and password. open the Tutorial folder double-click the samplejob job. 4. The Tutorial folder is shown in the repository tree. Click Cancel to close the New window because you are opening an existing job and not creating a new job or other object. 3. Select Start → Programs → IBM Information Server → IBM WebSphere DataStage and QualityStage Designer. When designing jobs. The following figure shows the Designer client with the samplejob job open. To start the Designer client and open your first job: 1. The job opens in the Designer client display area. 2. or you can choose to create persistent data sets. The data that you use in this job is the bill-to information from GlobalCo. you do not have to create a staging area for your data. this is simply how this tutorial was constructed. Select the tutorial project from the Project list. All of the objects that you need for the tutorial are in this folder. and then click OK. Parallel jobs use data sets to store data as the data is worked on. The sample job writes data to a persistent data set. In the repository tree. In the Attach window. 6 Parallel Job Primer .The sample job for the tutorial The sample job reads data from a flat file and writes it to a data set. The data set provides an internal staging area where the data is held until it is written to its ultimate destination in a later module. the designer. Starting the Designer client and opening the sample job Ensure that WebSphere Application Server is running. These data sets can be transient and invisible to you. 5. This data becomes the bill_to dimension for the star schema. The Designer client opens and displays the New window.

Module 1: Opening and running the sample job 7 . the data will flow down this link. The data that will flow between the two stages on the link was defined when the job was designed. You compile the job to prepare it to run on your system. Chapter 3.2: Viewing and compiling the sample job In this lesson. You learned the following tasks: v How to start the Designer client v How to open a job v Where to find the tutorial objects in the repository tree Lesson 1. The two stages are joined by a link. The sample job has a Sequential File stage to read data from the flat file and a Data Set stage to write data to the staging area.repository tree sample job palette design area Lesson checkpoint In this lesson. When the job is run. you view the sample job to understand its components. you opened your first job.

3. Data Set stages are used to land data that will be used by another job. 8. specify the name of the directory in which the tutorial data was installed and click OK (you have to specify directory path whenever you view data or run the job). A window opens that shows the first 100 rows of the data that the GlobalCo_BillTo. Click OK to close the Sequential File stage editor. The stage editor opens to the Properties tab of the Output page. A data set is the internal format for transferring data inside parallel jobs. 6.csv. In the Value field of the Resolve Job Parameter window. 2. The column metadata defines the data that will flow down the link to the Data Set stage when the job runs. Click the Columns tab. In the sample job.Exploring the Sequential File stage To explore the Sequential File stage: 1. click OK. Look at the File property under the Target category. In the sample job. The column metadata for this stage is the same as the column metadata for the Sequential File stage and defines the data that the job will write to the data set. Look at the First Line is Column Names property under the Options category. 9. the File property points to a file that is named GlobalCo_BillTo. but the properties that the job designer sets here describe the format of the flat file that the stage reads.csv file contains (100 rows is the default setting. 8 Parallel Job Primer . double-click the Sequential File stage that is named GlobalCo_billTo_flat. Click on the Columns tab. This setting means that the file can be read even when the file resides on a UNIX system. Click OK to close the stage editor. 4. In the sample job. the File property points to a file called GlobalCo_BillTo. You specify the directory that contains this file when you run the job. Exploring the Data Set stage To explore the Data Set stage: 1. This property is used to specify the control file for the data set that the stage will write the data to when the job runs. In the Data Browser window. You use this property to specify the file that the stage will read when the job runs. The remaining properties have default values. In this case the file is comma-delimited. In the sample job.ds. Job parameters are used to so that variable information (for example. 7. 10. The GlobalCo_BillTo. which means that each field within a row is separated by a comma character. double-click the Data Set stage that is named GlobalCoBillTo_ds. 2. The Format tab also specifies that the file has DOS line endings. All parallel job stages have properties tabs. 4. this property is set to True because the first line of the GlobalCo_BillTo. The Columns tab is where the column metadata for the stage is defined. 3.csv file contains many columns. Click on the Format tab. the # characters show that the name is a job parameter. All of these columns have the data type VarChar. You use the properties tab to specify the actions that the stage performs when the job is run. you will apply stricter data typing to these columns to cleanse the data. In the sample job. As you work through the tutorial. Click the View Data tab in the top right corner of the stage editor window. The stage editor opens in the Properties tab of the Input page. Look at the File property under the Source category. You specify the directory that contains this file when you run the job. 5.csv file contains the names of the columns in the file. but you can change it). file name or directory name) can be specified when the job runs rather than when the job is designed. Click Close to close the Data Browser window. The name of the directory has been defined as a job parameter named #tutorial_direct#. The Format tab looks similar to the Properties tab.

Module 1: Opening and running the sample job 9 . You use the job log to debug any errors you receive when you run the job. Although the View Data button is available on this tab. Because you are logged in to the tutorial project through the Designer client. Select File → Compile. If you click the View Data button. you will receive a message that no data exists. 2. you use the Director client to run the sample job and to view the log that the job produces as it runs. In the Director client. the sample job has a status of compiled. When the Compile Job window displays a message that the job is compiled. You learned the following tasks: v How to open stage editors v How to view the data that a stage represents v How to compile a job so that it is ready to run Lesson 1. The Director client is the operating console. Compiling the sample job To compile the sample job: 1. You also use the Director client to run fully developed jobs in the production environment. you explored a simple data extraction job that reads data from a file and writes it to a staging area.3: Running the sample job In this lesson. As the job is compiled. The sample job is now compiled and ready to run. You also use the Designer client to look at the data set that is written by the sample job. click OK. the window is updated with messages from the compiler. You use the Director client to run and troubleshoot jobs that you are developing in the Designer client. you do not need to start the Director from the start menu and log on to the project. You run the job from the Director client. Running the job To run the job: 1. Lesson checkpoint In this lesson.The Data Set stage editor does not have a Format tab because the data set does not require any formatting data. In the Designer client. there is no data for this stage yet. Chapter 3. select Tools → Run Director. which means that the job is ready to run. The Compile Job window opens. The data gets created when the job runs.

4. Select File → Exit to close the Director client. In the Job Run Options window. 6. C:\IBM\InformationServer\Server\Projects\Tutorial and click Run. The following figure shows the log view of the job. The job status changes to Running. Examine the job log to see the type of information that the Director client reports as it runs a job.2. Jobs can also have Fatal and Warning messages. When the job status changes to Finished. select View → Log. 5. and select Job → Run Now. specify the path of the project folder (for example. 10 Parallel Job Primer . Select the sample job in the right pane of the Director client. 3. The messages that you see are either control or information type.

Click Close to close the Data Browser window. change the default settings before you click OK). A window opens that shows up to 100 rows of the data written to the data set (if you want to view more than 100 rows in a data browser. 5. v Starting the Director client from the Designer client. 4. Click OK in the Data Browser window to accept the default settings. You learned the following tasks: v How to start the Director client from the Designer client v How to run a job and look at the log file v How to view the data written by the job Module 1: Summary You have now opened. you can start creating your own jobs. compiled. The next module guides you through the process of creating a simple job that does more data extraction. click View Data. Lesson checkpoint In this lesson you ran the sample job and looked at the results. v Opening an existing job. double-click the Data Set stage to open the stage editor. 3. and run your first data extraction job. v Compiling the job. Lessons learned By completing this module. you learned about the following concepts and tasks: v Starting the Designer client. Module 1: Opening and running the sample job 11 . Additional resources For more information about the features that you have learned about. 2. Click OK to close the Data Set stage. see the following guides: v IBM WebSphere DataStage Designer Client Guide v IBM WebSphere DataStage Director Client Guide Chapter 3. Now that you have run a data extraction job.Viewing the data set To view the data set that the job created: 1. v Viewing the results of the sample job and seeing how the job extracts data from a comma-separated file and writes it to a staging area. v Running the sample job. In the stage editor. In the sample job in the Designer client.

12 Parallel Job Primer .

8. 5. 3. 2006 13 . This module should take approximately 90 minutes to complete. You create a parallel job and save it to a new folder in the Tutorial folder in the repository tree. Type in the name of the job in the Item name field. In the New window. Click OK. Lesson 2. In the Designer client. Module 2: Designing your first job This module teaches you how to design your own job. you will understand how to do the following tasks: v Add stages and links to a job. then click Save. You learned the following tasks: v How to create a job in the Designer client. v Consolidate your knowledge of compiling and running jobs. 2. If you closed the Designer client after completing module 1.Chapter 4. you will need to start the Designer client again. 4. v Learn how to specify column metadata. v Specify the properties of the stages and links to determine what they will do when the job is run. 7. right-click on the Tutorial folder and select New → Folder from the shortcut menu. © Copyright IBM Corp. The job that you design will read two flat files and populate two lookup tables. To create a job: 1. The two lookup tables will be used by a more complex job that you will create in the next module. Check that the Folder path field contains the path \Tutorial\My Jobs. select File → New.1: Creating a job The first step in designing a job is to create an empty job and save it to a folder in the repository. You have created a new parallel job named populate_cc_spechand_lookupfiles and saved it in the folder Tutorial\My Jobs in the repository. Call the job populate_cc_spechand_lookupfiles. My Jobs then move to the Item name field. Type in a name for the folder. Lesson checkpoint In this lesson you created a job and saved it to a specified place in the repository. for example. In the Save Parallel Job As window. Learning objectives After completing the lessons in this module. 6. Select File → Save. v How to name the job and save it to a folder in the repository tree. select the Jobs folder in the left pane and then select the parallel job icon in the right pane. A new empty job design window opens in the design area.

A stage is a graphical representation of the data itself. 6. Type the new name. Position the stage on the right side of the job window 3. select the Lookup File Set stage icon and drag the stage to your open job. 5.2: Adding stages and links to the job You add stages and links to the job that you created. A job consists of stages linked together which describe the flow of data from a data source to a data target. click the File bar to open the file section of the palette. In the file section of the palette. Select each stage or link. This method of iterative job design is a good habit to get into. The mouse pointer changes to a target shape. Select the Sequential File stage in the job window. To add the stages to your job design: 1. Position the stage on the left side of the job window. Adding stages and linking them Ensure that the job named populate_cc_spechand_lookupfiles that you created in lesson 1 is open and active in the job design area. A job is active when the title bar is dark blue (if you are using the default Windows colors). and another code that specifies the customer’s language. The first part of the job reads a comma-separated file that contains a series of customer numbers. Your data will flow down this link when the job runs. You ensure that each part of your job is functional before you continue with the design for the next part. you will build the first part of the job. 4. select the Sequential File stage icon and drag the stage to your open job. Stages and links are the building blocks that determine what the job does when it runs. then compile it and run it to ensure that it works correctly before you add the next part of the job design. click the General bar to open the general section of the palette. b. Rename the stages and links as follows: a. The job design In this lesson.Lesson 2. Select the Link icon and move your mouse pointer over to the Sequential File stage. Click the Sequential file stage to anchor the link and then drag the mouse pointer over to the Lookup File Set stage. Stage or link Sequential File stage Lookup File Set stage Link New name country_codes country_code_lookup country_codes_data 14 Parallel Job Primer . 7. or of a transformation that will be performed on that data. In the Designer client palette area. A link is drawn between the two stages. a corresponding code that identifies the country in which the customers are located. 2. c. This table will be used by a subsequent job when it populates a dimension table. You are designing a job that reads the comma-separated file and writes the contents to a lookup table in a lookup file set. In the file section of the palette. In the palette area. Right-click and select Rename.

Click the Format tab. (You can browse for the path name if you prefer. In the record-level category. Because the CustomerCountry. You will also specify the column metadata for the data that will flow down the link that joins the two stages. Your job design should now look something like the one shown in this figure: Specifying properties and column metadata for the Sequential File stage You will now edit the first of the stages that you added to specify what the stage does when you run the job. Click the down arrow next to the First Line is Columns Names field and select True from the list. The editor opens in the Properties tab of the Output page. 2. Select DOS format from the Record delimiter string list. Select File → Save to save the job.csv file contains only three columns. Click the Columns tab. Double-click the country_codes Sequential File stage to open the stage editor.Always use specific names for your stages and links rather than the default names assigned by the Designer client. Module 2: Designing your first job 15 . click the browse button on the right of the File field. 5. The row that contains the column names is dropped when the job reads the file. (If a file contains many columns. 8. In the File field. Select the First Line is Column Name property under the Options category.csv (for example C:\IBM\ InformationServer\Server\Projects\Tutorial\CustomerCountry. 9. Select the File property under the Source category. it is less time consuming Chapter 4. Using specific names make your job designs easier to document and easier to maintain. 6. and then press enter. 7. 3. This setting ensures that the file can be read by UNIX or Linux® WebSphere DataStage servers. 4.csv). type the column definitions into the Columns tab.) You specified the name of the comma-separated file that the stage reads when the job runs. type the path name for your project folder (where the data files were copied when the tutorial was set up) and add the name CustomerCountry. select the Record delimiter string property from the Available properties to add. 8. To edit the stages and add properties and metadata: 1.

11. 13. The definitions can then be reused in other jobs. In the Save Table Definition window. Fill in the fields as follows: Column Name CUSTOMER_ NUMBER Key Yes SQL Type Char Length 7 Description Key column for the look up . Add two more rows to the table to specify the remaining two columns and fill them in as follows: Column Name COUNTRY Key No SQL Type Char Length 2 Description The code that identifies the customer’s country The code that identifies the customer’s language LANGUAGE No Char 2 Your Columns tab should look like the one in the following figure (if you have National Language Support installed. 10. enter the following information: Option Data source type Data source name Table/file name Description Saved CustomerCountry.the customer identifier You will use the default values for the remaining fields. Click the Save button to save the column definitions that you specified as a table definition object in the repository.) Note that column names are case-sensitive. there is an additional field named Extended): 12.csv country_codes_data 16 Parallel Job Primer .and more accurate to import the column definitions directly from the data source. so use the case in the instructions. Double-click the first line of the table.

3. 16. 5. C:\IBM\InformationServer\Server\Projects\Tutorial\countrylookup. Click the View Data button and click OK in the Data Browser window to use the default settings. 6.csv file contains. (for example. Select the Lookup File Set property under the Target category. Double-click the country_code_lookup Lookup File Set stage to open the stage editor. Since you specified the column definitions. 17. then double-click the Key property in the Available Properties to add area. In the Key field. 18. The locator identifies the table definition. Specifying properties for the Lookup File Set stage and running the job In this part of the lesson. Click OK to close the stage editor. You have designed the first part of your job. 19. save the table definition in the Tutorial folder and name it country_codes_data. To configure the Lookup File Set stage: 1. In the Lookup File Set field. 15. Module 2: Designing your first job 17 . You learned the following tasks: v How to add stages and links to a job v How to set the stage properties that determine what the stage will do when you run the job v How to specify column metadata for the job and to save the column metadata to the repository for use in other jobs Chapter 4. 2. 7. You specified that the CUSTOMER_NUMBER column will be the lookup key for the lookup table that you are creating. type the path name for the lookup file set that the stage will create. The editor opens in the Properties tab of the Input page. Notice that a small table icon has appeared on the Country_codes_data link. Lesson checkpoint You have now designed and run your very first job. you configure the next stage in your job. You have now written a lookup table that can be used by another job later on in the tutorial. Save the job and then compile and run the job by using the techniques that you learned in Lesson 1.fs) and press enter. In the Save Table Definition As window. You already specified the column metadata for data that will flow down the link between the two stages. This icon shows that the link now has metadata. Click OK to save your property settings and close the Lookup File Set stage editor. Select the Lookup Keys category. the Designer client can read the file and show you the results. 4.Option Short description Long description Description date and time of saving Table definition for country codes source file 14. click the down arrow and select CUSTOMER_NUMBER from the list and press enter. The data browser shows you the data that the CustomerCountry. Click OK to specify the locator for the table definition. Save the job. so there are fewer properties to specify in this task. Close the Data Browser window.

csv file. Position them under the stages and link that you added earlier in this lesson. the Tutorial folder in the project folder (for example. 8. you are consolidating the job design skills that you learned and loading the column metadata from the table definition that you imported earlier. 2. The contents are again written to a lookup table that is ready to use in a later job. select Import → Table Definitions → Sequential File Definitions. and it will write data to two separate lookup file sets. The stages read a comma-separated file that contains code numbers and corresponding special delivery instructions.csv.csv file.3: Importing metadata You can import column metadata directly from the data source that your job will read.2. Rather than type the column metadata. type the folder name \Tutorial\Table Definitions in which to store the table definition.Lesson 2. 1. 7. The column definitions that you viewed are stored as a table definition in the repository. In the Files list. Click the Define tab and examine the column definitions that were derived from the SpecialHandling. or browse for. 9. you will add more stages to the job that you designed in Lesson 2. In the Directory field in the Import Metadata (Sequential) window. In the Designer client. 2. In the Define Sequential Metadata window. The Files list is populated with all the files in the specified directory that have the suffix . 1. Note that you can import metadata when no jobs are open in the Designer client. select the SpecialHandling. In this part of the lesson. this procedure is independent of job design. In the File Type field. In the To folder field. 6. you import the column metadata from the source file. 4.2. 3.csv). select First line is columns names. select a file type of Comma Separated (*. The finished job contains two separate data flows. 5. Click Import. You will then save the column definitions as a table definition in the repository. Loading column metadata from the repository You can specify the column metadata that a stage uses by loading the metadata from a table definition in the repository. Add a Sequential file stage and a Lookup File Set stage to your job and link them together.txt. Click OK. C:\IBM\InformationServer\Server\ Projects\Tutorial). Importing metadata into your repository In this part of the lesson. Click Close to close the Import Metadata (Sequential) window. you see an error message. If there are no files to display. 10. Rename the stages and link as follows: Stage or Link Sequential File Lookup File Name special_handling special_handling_lookup 18 Parallel Job Primer . The stages that you add are similar to the ones that you added in lesson 2. Ensure that your job named populate_cc_spechand_lookupfiles is open and active. type the path name of. you will import column definitions from the comma-separated file that contains the special delivery instructions. The importer displays any files in the directory that have the suffix . You store the metadata in the repository where it can be used in any job designs. You can ignore this message. and use that metadata in the job design. In this lesson.

csv and that the first line of this file contains column names. Chapter 4. Specify a path name for the destination file set and specify that the lookup key is the SPECIAL_HANDLING_CODE column then close the stage editor.csv column definitions. Click the Columns tab. 14.Stage or Link Link Name special_handling_data Your job design should now look like the one shown in this figure: 3. This setting ensures that the file can be read by UNIX or Linux WebSphere DataStage servers. ensure that all of the columns appear in the Selected columns list and click OK. Click Load. 13. 5. Open the stage editor for the special_handling Sequential File stage and specify that it will read the file SpecialHandling. compile. Save. In the record-level category. 11. In the Selected Columns window. The column definitions appear in the Columns tab of the stage editor. 8. Click the Format tab. In the Table Definitions window. You load the column metadata from the table definition that you previously saved as an object in the repository. browse the repository tree to the folder where you stored the SpecialHandling. Select DOS format from the Record delimiter string list. 12. 9. Select the SpecialHandling. and run the job. 10. Open the stage editor for the special_handling_lookup stage.csv table definition and click OK. 6. Close the Sequential File stage editor. 15. select the Record delimiter string property from the Available properties to add. 4. Module 2: Designing your first job 19 . 7.

you can specify a job parameter to represent this information. 7. you are then prompted to supply a value for the job parameter. You will then supply the actual path names of the files when you run the job.3. 5.3 is open and active. In the Job Properties window. you create a better job design. type Enter the path name for the comma-separated file that contains the country code definitions. type path name for the country codes file. select the path name datatype. In this lesson. you specified a file that contains the source data and a file to write the lookup data set to. In each part of the job. When you want use the same job parameters in a job later on in this tutorial. you want to specify information when you run the job rather than when you design it. 4. click the Parameters tab. Parameter sets enable the same job parameters to be used by different jobs. 1. You will save the definitions of these job parameters in a parameter set in the repository. type country_codes_source.Lesson checkpoint You have now added to your job design and learned how to import the metadata that the job uses. Select Edit → Job Properties. Defining job parameters Ensure that the job named populate_cc_spechand_lookupfile that you designed in Lesson 2.4: Adding job parameters When you use job parameters in your job designs. 2. In the Parameter name field. you can load them into the job design from the parameter set. you will replace all four file names with job parameters. In the Prompt field. 3. Job parameters Sometimes. 8. In your job design. You learned the following tasks: v How to import column metadata directly from a data source v How to load column metadata from a definition that you saved in the repository Lesson 2. When you run the job. 6. In the Type field. In the Help Text field. Double-click the first line of the grid to add a new row. You specified the location of four files in the job that you designed in Lesson 2. Repeat steps 3-7 to define three more job parameters containing the following entries: Parameter Name country_codes_lookup Prompt Type Help text Enter the path name for the file set for the country code lookup table Enter the path name for the comma-separated file that contains the special handling code definitions Enter the path name for the file set for the special handling lookup table path name for the country path name codes lookup file set path name for the special path name handling codes file special_handling_source special_handling_lookup path name for the special path name handling lookup file set 20 Parallel Job Primer .

10. 1. Adding job parameters to your job design Now that you have defined the job parameters. and select Insert Job Parameter from the menu.The Parameters tab of the Job Properties window should now look like the one in the following figure: 9. specifying job parameters for each of the File properties as follows: Stage country_codes_lookup stage special_handling stage special_handling_lookup stage Property Lookup file set File Lookup file set Parameter name country_codes_lookup special_handling_source special_handling_lookup 6. 4. the Director client prompts you to supply values for the job parameters. Select country_codes_source from the list and press enter. 3. Supplying values for the job parameters When you run the job. 4. In the Parameters page of the Job Run Options window. Select the File property in the Source category and delete the path name that you entered in the File field. Click OK to close the Job Properties window. The text #country_codes_source# appears in the File field. Click File → Save to save the job. Repeat these steps for each of the stages in the job. Click Run. you will add them to your job design. This text specifies that the job will request the name of the file when you run the job. Module 2: Designing your first job 21 . Click the right arrow next to the File field. type a path name for each of the job parameters. 2. 3. 2. 5. Select your job name and select Job → Run Now. Save and recompile the job. 1. Double-click the country_codes Sequential File stage to open the stage editor. Chapter 4. Open the Director client.

In the Parameters page.4. 1. type a name for the parameter set and a short description (for example. You learned the following tasks: v How to define job parameters v How to add job parameters in your job design v How to specify values for the job parameters when you run the job Lesson 2. or you can specify the job parameters as part of the task of creating a new parameter set. Lesson checkpoint You defined job parameters to represent the file names in your job and specified values for these parameters when you ran the job. Parameter sets You use parameter sets to define job parameters that you are likely to reuse in other jobs. Select Edit → Job Properties. 5. tutorial_lookup and parameter set for lookup file names). You can also define different sets of values for each parameter set. In the General page of the Parameter Set window. 2. 4. 7. Click Create Parameter Set. click the Parameters tab. Whenever you need this set of parameters in a job design. 3. If you make any changes to a parameter set object. In this lesson. you can insert them into the job properties from the parameter set. type lookupvalues1. You will also supply a set of default values for the parameters in the parameter set that are also available when the parameter set is used. These parameter sets are stored as files in the WebSphere DataStage server installation directory and are available to use in your job designs or when you run jobs that use these parameter sets. specify a default path name for the file that the job parameter represents. However. For each of the job parameters.5: Creating parameter sets You can store job parameters in a parameter set in the repository. the job will link to the current version of the parameter set. Your Values pages should now look similar to the Values page in the following picture: 22 Parallel Job Primer . if you change the design after the job is compiled. using the values that you supplied for the job parameters. these changes are reflected in job designs that use this object until the job is compiled. You can create parameter sets from existing job parameters.4.The job runs. The parameters that a job is compiled with are available when the job is run. Click the Values tab. Creating a parameter set from existing job parameters Ensure that your job is open and active. You can then reuse the job parameters in other job designs. In the Value File name field. In the Job Properties window. 8. use shift-click to select all of the job parameters that you defined in Lesson 2. 9. Click the Parameters tab and check that all the job parameters that you specified for your job appear in this page. 6. you will create a parameter set from the job parameters that you created in Lesson 2.

specify a repository folder in which to store the parameter set. and then click Save. The Designer client asks if you want to replace the selected parameters with the parameter set that you have just created. Click OK to close the Job Parameters window. 12. Save the job. 11. you designed and ran a data extraction job. The current job continues to use the individual parameters rather than the parameter set.10. You created a parameter set that is available for another job that you will create later in this tutorial. Module 2: Designing your first job 23 . 13. Lessons learned By completing this module. You learned the following tasks: v How to create a parameter set from a set of existing job parameters v How to specify a set of default values for the parameters in the parameter set Module 2 Summary In this module. Click OK. Lesson checkpoint You have now created a parameter set. You also learned how to create reusable objects such as table definitions and parameters sets that you can include in other jobs that you design. Click No. you learned about the following concepts and tasks: v Creating new jobs and saving them in the repository v Adding stages and links and specifying their properties v Specifying column metadata and saving it as a table definition to reuse later v Specifying job parameters to make your job design more flexible. and saving the parameters in the repository to reuse later Chapter 4.

24 Parallel Job Primer .

Your job will transform the data by dropping the columns that you do not need and by trimming some of the data in the columns that you do need. Your job will perform some simple cleansing of the data. Several of the processing stages can drop data columns as part of their processing. 3. Click Palette → Processing to locate and drag a Transformer stage to the design area. the job applies a function to one of the data columns to delete space characters that the column contains. and name the one on the right int_GlobalCoBillTo. and store it in the tutorial folder in the repository tree. 2006 25 . Learning objectives After completing the lessons in this module. you can use the Modify stage in your job. you use the Transformer stage because you require a transformation function that you can customize. © Copyright IBM Corp. Lesson 3. For example. if you want to change only the data types in a data set. The dimension table that you will produce later in this tutorial requires only a subset of these columns. 2. and writes the results to a staging Data Set stage. save it as TrimAndStrip. The job will also specify some stricter data typing for the remaining columns. Stricter data typing helps to impose quality controls on the data that you are processing. This transformation job prepares the data in that column for a later operation. Creating the transformation job and adding stages and links In this part of the lesson. Several functions are available to use in the Transformer stage. Add two Data Set stages to the design area. Name the Data Set stage on the left GlobalCoBillTo.Chapter 5. In the current job. 4. Finally. 1. but there are also several other types of processing stages available in the Designer client that can transform data. The job that you design will read the GlobalCo bill_to data that was written to a data set when you ran the sample job in Module 1 of this tutorial. Module 3: Designing a transformation job This module teaches you how to design a job that transforms data.1: Designing the transformation job You will design and run a job that performs some simple transformations on the bill_to data.csv comma-separated file by the sample job in Module 1 contains a large number of columns. Create a parallel job. you will understand how to do the following tasks: v How to use a Transformer stage to transform data v How to handle rejected data v How to combine data by using a Lookup stage This module should take approximately 60 minutes to complete. so you will use the transformation job to drop some of the columns. The transformation job that you are designing uses a Transformer stage. The transformer job The data that was read from the GlobalCo_BillTo. you will create your transformation job and learn a new method for performing tasks that you are already familiar with.

This method of linking the stages is fast and easy. Drag the table definition to the design area and drop it onto the full_bill_to link. Set the File property in the Source category to point to the data set that was created by the sample job in Module 1 (GlobalCoBillTo. and close the stage editor. In the repository window.5. 2. you specify the transformation operations that your job will perform when you run it. You should frequently view the data after you configure a stage to verify that the File property and the column metadata are both correct. Open the stage editor for the GlobalCoBillTo Data Set stage. The method described in step 3 saves time when you are designing very large jobs. Right-click the GlobalCoBillTo Data Set stage and drag a link to the Transformer stage. 6. 3. Set the File property in the Source category to point to a new staging data set (for example. 8. 1. Your job should look like the one in the following picture: Configuring the Data Set stages In this part of the lesson. Double-click the Transformer stage to open the stage editor. select the GlobalCoBillToSource table definition in the Tutorial folder. 1. 4. you opened the GlobalCoBillTo Data Set stage editor and clicked Load to perform the same action. 6. CTRL-click to select the following columns from the full_bill_to link in the upper left pane of the stage editor: v CUSTOMER_NUMBER v CUST_NAME 26 Parallel Job Primer . Drop the Transformer stage between the two Data Set stages and name the Transformer stage Trim_and_Strip. 2. The data browser shows the data in the data set. You do not need to go back to the palette and grab a link to connect each stage.ds). Open the stage editor for the GlobalCoBillTo Data Set stage and click View Data. In Lesson 2. Configuring the Transformer stage In this part of the lesson. 7. C:\IBM\InformationServer\Server\Projects\Tutorial\int_GlobalCoBillTo.3. you configure the Data Set stages and learn a new method for loading column metadata. 5. Name the first link full_bill_to and name the second link stripped_bill_to. Open the stage editor for the int_GlobalCoBillTo Data Set stage. The cursor changes shape to indicate the correct position to drop the table definition. Use the same method to link the Transformer stage to the int_GlobalCoBillTo Data Set stage.ds).

The text specifies a function that deletes all the space characters from the CUSTOMER_NUMBER column on the full_bill_to link before writing it to the CUSTOMER_NUMBER column on the stripped_bill_to link. you will be able to better diagnose inconsistencies in your source data when you run the job. 5. Drag these columns from the upper left pane to the stripped_bill_to link in the upper right pane of the stage editor. edit the SQL type and length fields for your columns as specified in the following table: Column CUSTOMER_NUMBER CUST_NAME ADDR_1 ADDR_2 CITY REGION_CODE ZIP TEL_NUM REVIEW_MONTH SETUP_DATE STATUS_CODE SQL Type Char VarChar VarChar VarChar VarChar Char VarChar VarChar VarChar VarChar Char Length 7 30 30 30 30 2 10 10 2 12 1 By specifying stricter data typing for your data. You are specifying that only these columns will flow through the Transformer stage when the job is run. type the following text: trim(full_bill_to. Your Transformer stage editor should look like the one in the following figure: Chapter 5. 6.v v v v v v v ADDR_1 ADDR_2 CITY REGION_CODE ZIP TEL_NUM REVIEW_MONTH v SETUP_DATE v STATUS_CODE 3.’A’). Double-click the Derivation field for the CUSTOMER_NUMBER column in the stripped_bill_to link. Module 3: Designing a transformation job 27 . The expression editor opens. In the stripped_bill_to column definitions at the bottom of the right pane. 4.’ ’. In the expression editor. The remaining columns will be dropped.CUSTOMER_NUMBER.

Additional information is shown next to the job links to provide figures for the number of rows that were transferred and the number of rows that were processed per second. select the TrimAndStrip job. 28 Parallel Job Primer . open the Director client. you must open the Director client. you ran jobs from the Director client. Click OK to close the Transformer stage editor. Now.7. and look at its job log. You can view the job log in the Director client even when you run the job from the Designer client. Save and then compile your TrimAndStrip job. As the TrimAndStrip job runs. 9. 1. 8. In the Designer client. 3. This technique is useful when you are developing jobs since you do not leave the Designer client. 2. Select File → Run and click Run in the Job run options window. Open the stage editor for the int_GlobalCoBillTo Data Set stage and go to the Columns tab of the Input page. select Diagram → Show performance statistics. and the links themselves change color to show their status. you will run the job from the Designer client. This information is updated as the job runs. When the job finishes running. Running the transformation job Previously. the performance figures for the link are updated. Notice that the stage editor has acquired the metadata from the stripped_bill_to link. To look at the log file.

1. You can also combine data in a parallel job by using a Join stage. and save it in the tutorial folder in the repository. The stages appear in the CleansePrepare job. The lookup is performed by the Lookup stage. and drag to the Lookup stage. Select the Processing area in the palette and drag a Lookup stage to the CleansPrepare job. In the TrimAndStrip job.2 of this tutorial. The Lookup stage has a stream input and a reference input. Where you use a large reference table. Select the stripped_bill_to link.1 is open. The Lookup stage uses one or more key columns in the stream input to search for data in a reference table. 2. click the minimize button in the Designer Client menu bar. For the job that you are designing. run a job from within the Designer client and monitor the performance of the job. a job can run faster if it combines data by using a Join stage. select Edit → Paste. In multi-window view. You will base your new job on the transformation job that you created in Lesson 3. Position the Lookup stage just below the int_GlobalCoBillTo stage and name it Lookup_Country. using a drag-and-drop operation. load column metadata into a link. You moved the link with its associated column metadata to allow data to flow from the Transformer stage to the Lookup stage. Ensure that the TrimAndStrip job that you created in Lesson 3. In the CleansePrepare job. You defined the key columns for the lookup tables that you used in this lesson when you created the file sets in Module 2. The job will look up the data in a reference table in a Lookup File Set stage that was created in Lesson 2. Create a job.1. To switch from single-window view to multi-window view. you can see all the open jobs in the display area. position the mouse cursor in the link’s arrowhead. you must specify the lookup key column when you define the file set. and that you have a multi-window view in the design area of the Designer client. Lesson 3. You can configure Lookup stages to search for data in a Lookup file set. drag the mouse cursor around the stages in the job to select them and select Edit → Copy. You can now close the TrimAndStrip job. You will add a Lookup stage that looks up the data that you created in Lesson 2. you will create a job and add some of the stages that you configured in the TrimAndStrip job that you designed and ran in Lesson 3. You learned the following tasks: v v v v How How How How to to to to configure a Transformer stage link stages using a different method for drawing links. Chapter 5.2: Combining data in a job The Designer client supports more complex jobs than the ones that you designed so far. The Lookup stage is most efficient where the data being looked up fits into the available physical memory.1.Lesson checkpoint In this lesson you learned how to design and configure a transformation job. 5. Module 3: Designing a transformation job 29 . When you use lookup file sets. the reference table is small. In this lesson. The stage adds the data from the reference table to the stream output.2. 4. you begin to build a more complex job that combines data from two different tables. Using a Lookup stage Performing a lookup (search) is one way in which a job can combine data. 3. name it CleansePrepare. and so a Lookup stage is preferred. or they can search for data in a relational database. Creating a lookup job Next.

so you need to configure only the new stages that you add in this job. and click OK. One of these stages is the Country_Codes_Fileset lookup stage.5. The link appears as a dotted line. Browse to the tutorial folder. 1. Position it immediately above the Lookup stage and name it Country_Code_Fileset. 30 Parallel Job Primer . 7. 10. and click the Parameters tab. and click Add Parameter Set. In the list. Open the Job Properties window.5. Double-click the Country_Codes_Fileset Lookup File Set stage to open the stage editor. Click the Columns tab and load the country_codes_data table definition. 6. which indicates that the link is a reference link. A list is displayed that shows all the individual job parameters in the parameter set. Select the first row in the grid. Select the File area in the palette and drag a Lookup File Set stage to the job. 3.″ you copied stages from the TrimandStrip job to the CleansePrepare job. Name the Data Set stage temp_dataset. select the parameter set that you created in Lesson 2. In this exercise. Close the Job Properties window. It will be replaced with a different Data Set stage. and then close the stage editor.6. you will use the parameter set that you created in Lesson 2. 5. 8. 4. Draw a link from the Lookup stage to the Data Set stage and name it country_code. Draw a link from the Country_Code_Fileset Lookup File Set stage to the Lookup_Country Lookup stage and name it country_reference. 7. This stage represents the lookup file set that you created in Lesson 2. Delete the int_GlobalCoBillTo Data Set stage. The job that you designed should look like the one in the following figure: Configuring the Lookup File Set stage In the lesson section ″Creating a lookup job. Drag a Data Set stage from the palette to the job and position it to the right of the Lookup stage. Select the Lookup File Set property in the Source category. select the country_codes_lookup job parameter and then press Enter. These stages are already configured. click the right arrow next to the Lookup File Set field and select Insert Job Parameter from the menu.2. 9. 2.

Configuring the Lookup stage
You specify the data that is combined in the Lookup stage. You defined the date column that will act as the key for the lookup when you created the lookup file set in Module 2. 1. Double-click the Lookup_Country Lookup stage to open the Lookup stage editor. The Lookup stage editor is similar in appearance to the Transformer stage editor. 2. Click the title bar of the stripped_bill_to link in the left pane and drag it over to the Column Name column of the country_code link in the right pane. When the cursor changes shape, release the mouse button. All of the columns from the stripped_bill_to link appear in the country_code link. 3. Select the Country column in the Country_Reference link and drag it to the country_code link. The result of copying the columns from the Country_Reference link to the country_code link is that whenever the value of the incoming CUSTOMER_NUMBER column matches the value of the CUSTOMER_NUMBER column of the lookup table, the corresponding Country column will be added to that row of data. The stage editor looks like the one in the following figure:

Chapter 5. Module 3: Designing a transformation job

31

4. Double-click the Condition bar in the Country_Reference link. The Lookup Stage Conditions window opens. Select the Lookup Failure field and select Continue from the list. You are specifying that, if a CUSTOMER_NUMBER value from the stripped_bill_to link does not match any CUSTOMER_NUMBER column values in the reference table, the job continues to the next CUSTOMER_NUMBER column. 5. Close the Lookup stage editor. 6. Open the temp_dataset Data Set stage and specify a file name for the data set. 7. Save, compile and run the job. The Job Run Options window displays all the parameters in the parameter set. 8. In the Job Run Options window, select lookupvalues1 from the list next to the parameter set name. The parameters values are filled in with the path names that you specified when you created the parameter set. 9. Click Run to run the job and then click View Data in the temp_dataset stage to examine the results.

Lesson checkpoint
With this lesson, you started to design more complex and sophisticated jobs. You learned the following tasks: v How to copy stages, links, and associated configuration data between jobs. v How to combine data in a job by using a Lookup stage.

Lesson 3.3: Capturing rejected data
This lesson shows you how to monitor rows of data that are rejected while you are processing them. Ensure that the CleansePrepare job that you created in Lesson 3.2 is open and active. In the Lookup stage for the job that you created in Lesson 3.2, you specified that processing should continue on a row if the lookup operation fails. Any rows that contain CUSTOMER_NUMBER fields that were not matched in the lookup table were bypassed, and the COUNTRY column for that row was set to NULL. In this lesson, you will specify that non-matching rows are written to a reject link. The reject link captures any customer numbers that do not have an entry in the country codes table. You can examine the rejected rows and decide what action to take. 1. From the File section of the palette, drag a Sequential File stage to the CleansePrepare job and position it under the Lookup_Country Lookup stage. Name the Sequential File stage Rejected_Rows. 2. Draw a link from the Lookup stage to the Sequential File stage. Name the link rejects. Because the Lookup stage already has a stream output link, the new link is designated as a reject link and is shown as a dashed line. Your job should resemble the one in the following figure:

32

Parallel Job Primer

3. Double-click the Lookup_Country Lookup stage to open the Lookup stage editor. 4. Double-Click the Condition bar in the country_reference link to open the Lookup Stage Conditions window. 5. In the Lookup Stage Conditions window, select the Lookup Failure field and select Reject from the list. Close the Lookup stage editor. This step specifies that, whenever a row from the stripped_bill_to link has no matching entry in the country code lookup table, the row is rejected and written to the Rejected_Rows Sequential File stage. 6. Edit the Rejected_Rows Sequential File stage and specify a path name for the file that the stage will write to (for example, c:\tutorial\rejects.txt). This stage derives the column metadata from the Lookup stage, and you cannot alter it. 7. Save, compile the CleansePrepare job, and run the job. 8. Open the Rejected_Rows Sequential File stage editor and click View Data to look at the rows that were rejected.

Lesson checkpoint
You learned the following tasks: v How to add a reject link to your job v How to configure the Lookups stage so that it rejects data where a lookup fails

Lesson 3.4: Performing multiple transformations in a single job
You can design complex jobs that perform many transformation operations on your data. In this lesson, you will further transform your data to apply some business rules and perform another lookup of a reference table.

Chapter 5. Module 3: Designing a transformation job

33

therefore you can remove it. Add the following stages to your CleansePrepare job: a. Link the Special_Handling_Lookup Lookup File Set stage to the Lookup_Spec_Handling Lookup stage and name the link special_handling. The transformation logic also adds a row count to the output data. Drag the Business_Rules Transformer stage down so that it aligns horizontally with the Lookup stages. Your CleansePrepare job should now resemble the one in the following figure: 2. 1. Delete the Temp_Dataset Data Set stage. Link the Business_Rules Transformer stage to the Lookup_Spec_Handling Lookup stage and name the link with_business_rules.In the sample bill_to data. Drag the arrowhead end of the country_code link and attach it to the Business_Rules Transformer stage. Link the stages: a. one of the columns is overloaded. The job then looks up the text description corresponding to the code from the lookup table that you populated in Lesson 2 and adds the description to the output data. c. The Temp_Dataset stage is not required for this job. d. The SET_UP data column can contain a special handling code as well as the date that the account was set up. b. Place the Lookup File Set stage immediately above the Lookup_Spec_Handling Lookup stage and name the Lookup File Set stage Special_Handling_Lookup. 34 Parallel Job Primer . Adding new stages and links This tasks adds the extra stages to the job that will implement the additional transformation logic. c. b. The transformation logic that is being added to the job extracts this special handling code into a separate column. Place the Lookup stage immediately to the right of the temp_dataset Data Set stage and name the Lookup stage Lookup_Spec_Handling. Place the Data Set stage immediately to the right of the Lookup_Spec_Handling Lookup stage and name the Data Set stage Target. 3. 5. Link the Lookup_Spec_Handling Lookup stage to the Target Data Set stage and name the link finished_data. Place the Transformer stage above the temp_dataset Data Set stage and name the stage Business_Rules. 4.

7. Position your mouse pointer immediately to the right of this text. 8. the SOURCE column for each row will contain the two-letter country code prefixed with the text GlobalCo. click the Stage Properties tool on the far left. type ’GlobalCo’:. add the following new columns: Column name SOURCE RECNUM SETUP_DATE SPECIAL_HANDLING_ CODE SQL Type Char Char Char Integer Length 10 10 10 10 Nullable No No Yes Yes 4. In the expression editor. double-click the Derivation field of the SOURCE column. Click the Variables tab and. In the metadata area for the with_business_rules output link. 6. Double-click the Derivation fields of each of the stage variables in turn and type the following expressions in the expression editor: Chapter 5. these stage variables appear in the Stage Variables area above the with_business_rules link. GlobalCoUS. Open the Business_Rules Transformer stage editor. 2. In the Transformer stage editor toolbar. In the graphical area. When you run the job. you configure the Transformer stage to extract the special handling code and add a row count to the output data. add the following stage variables to the grid: SQL Type Char VarChar Precision 1 10 Name xtractSpecialHandling TrimDate When you close the Properties window. The Transformer Stage Properties window opens. for example. You will define some stage variables later in this procedure. The new columns appear in the graphical representation of the link. right-click and select Input Column from the menu. 1.Configuring the Business_Rules Transformer stage In this exercise. Select the following columns in the country_code input link and drag them to the with_business_rules output link: v CUSTOMER_NUMBER v CUST_NAME v ADDR_1 v ADDR_2 v CITY v REGION_CODE v ZIP v TEL_NUM 3. 5. but are highlighted in red because they do not yet have valid derivations. Then select the COUNTRY column from the list. by using the techniques that you learned for defining table definitions. Module 3: Designing a transformation job 35 . and click the Show/Hide Stage Variables icon to display the stage variable grid in the right pane.

Your transformer editor should look like the one in the following picture: 36 Parallel Job Primer . Right-click and select System Variable from the menu. Double-click the Derivation field of the RECNUM column and type ’GC’: in the expression editor. 10. A line is drawn between the stage variable and the column.1) date. the SPECIAL_HANDLING_CODE column writes the current value of the xtractSpecialHandling variable. You added row numbers to your output. Select the xtractSpecialHandling stage variable and drag it to the Derivation field of the SPECIAL_HANDLING_CODE column and drop it on the with_business_rules link. For each row processed. TrimDate 9. If the SETUP_DATE column does not contain a date. If Len (country_code. 11. and the name xtractSpecialHandling appears in the Derivation field. If the SETUP_DATE column contains a date. then the date is extracted and the value of the TrimDate variable is set to a string that contains the date. the SETUP_DATE column writes the current value of the TrimDate variable. A line is drawn between the stage variable and the column.’ ’. If it does.SETUP_DATE. For each row that is processed. the (country_code.Stage variable xtractSpecialHandling Expression Description if Len (country_code. Select the TrimDate stage variable and drag it to the Derivation field of the SETUP_DATE column and drop it on the with_business_rules link. and the name TrimDate appears in the Derivation field.SETUP_DATE.’ ’.SETUP_DATE SETUP_DATE column contains a Else Field special handling code.SETUP_DATE) < This expression checks that the 3 Then ’01/01/0001’ Else Field SETUP_DATE column contains a (country_code. the code is extracted and the value of xtractSpecialHandling is set to that code.SETUP_DATE) < This expression checks that the 2 Then country_code. Then select @OUTROWNUM. then the expression sets the value of the TrimDate variable to the string 01/01/0001.2) value of xtractSpecialHandling is set to that code. If the column contains a date and a code.

Select the following columns in the with_business_rules input link and drag them to the finished_data output link. Open the Lookup_Spec_Handling stage. 1. v CUSTOMER_NUMBER v CUST_NAME v ADDR_1 v ADDR_2 Chapter 5.csv table definition in the repository and then close the Special_Handling_Lookup stage editor. 2.Configuring the Lookup operation In this exercise. Set the File property to reference the special_handling_lookup job parameter. 5. Load the column metadata from the SpecialHandling. Open the Special_Handling_Lookup stage. Module 3: Designing a transformation job 37 . you configure the stages that are required to look up a value for the special handling code and write it to an output data set. 3. 4.

and add this job parameter to the stage. Lessons learned By completing this module. Lesson checkpoint In this lesson. 9. most of the data is rejected. CITY REGION_CODE ZIP TEL_NUM SOURCE RECNUM SETUP_DATE 8. Double-click the Condition bar in the special_handling reference link to open the Lookup Stage Conditions window. so if the rows that do not contain a code are rejected. you consolidated your existing skills in defining transformation jobs and added some new skills. Specify that the processing will continue if the lookup fails for a data row. v SPECIAL_HANDLING_CODE Select the DESCRIPTION column in the special_handling reference link and drag it to the finished_data output link (the LANGUAGE column is not used). You learned the following tasks: v How to define and use stage variables in a Transformer stage v How to use system variables to generate output column values Module 3 Summary In this module you refined and added to your job design skills. you learned the following concepts and tasks: v How to drop data columns from your data flow v How to use the transform functions that are provided with the Designer client v How to combine data from two different sources v How to capture rejected data 38 Parallel Job Primer . Only a minority of the rows in the bill_to data contain a special handling code. You learned how to design more complex jobs that transform the data that your previous jobs extracted. 7. compile and run the CleansePrepare job. Save. Specify a job parameter to represent the file that the Target Sequential File stage will write to.v v v v v v v 6. You do not need to specify a reject link for this stage.

In the tutorial modules that you completed so far. you start working with a relational database. Subsequent changes to the job design will once again link to the data connection object and pick up any changes that were made to that object. these changes are reflected in the job design. you will use the database that is hosting the repository. You can also use data connection objects to provide the details that are needed to connect to a data source and import metadata. the data connection details are fixed in the executable version of the job. Prerequisites Ensure that your database administrator runs the relevant database scripts that are supplied with the tutorial and set up a DSN for you to use when connecting to an ODBC connector. Learning objectives After completing the lessons in this module. If you change the details of a data connection while you are designing a job. after you compile your job. you will understand how to do the following tasks: v How to define a data connection object that you use and reuse to connect to a database. Also. Data connection objects Data connection objects store the information that is needed to connect to a particular database in a reusable form. you use data connection objects to provide the information that is needed to connect to a database. In this module. This module should take approximately 60 minutes to complete. v How to import column metadata from a database. Creating a data connection object To create a data connection object: © Copyright IBM Corp. The database is the ultimate target for the data that you are working with. Lesson 4. You can create data connection objects directly in the repository.Chapter 6. For these lessons. you can create data connection objects when you are using a connector stage to import metadata by saving the connection details. However. The tutorial supplies scripts that your database administrator runs to create the tables that you need for these lessons. You use data connection objects with related connector stages to quickly define a connection to a data source in a job design. you were working with comma-separated files and staging files in internal formats (data sets and lookup file sets). Your database administrator needs to set up a DSN that you can use to connect to the database by using ODBC.1: Creating a data connection object In this lesson. This lesson shows you how to create a data connection object directly in the repository. Because different types of relational database can be used to host the repository. Module 4: Loading a data target This module teaches you how to access a relational database. v How to write data to a relational database target. the lessons in this module use an ODBC connection that makes the lessons database-independent. 2006 39 .

select the computer that hosts the database from the Host name where database resides list. 5. and click Open. The Data source. In the Connector Selection page. Click the New location link.3. 5. The table definition is then available to be used by other projects and by other components in the information integration suite. 7. In Lesson 2. the column definitions are saved as a table definition in the project repository and in the dynamic repository. 4. 40 Parallel Job Primer . enter name for Data Connect (for example. The Connection parameters grid is populated and shows the connection parameters that are required by the stage type that you selected. type the name of the database that has been created on the relational database for this exercise (ask your database administrator if you do not know the name of the database) and click OK. Select the tutorial folder in the repository.1. Select Import → Table Definitions → Start Connector Import Wizard. In the Shared Metadata Management window. Lesson 4. In the Data Source Location page of the Import Connector Metadata wizard. Click the browse button next to the Connect using Stage Type field. In this lesson. open the tutorial folder. In the Open window. Open the Parameters page. select the tutorial folder and click Save. 8. select your DSN from the Data source list and click the Load link. Click OK. Lesson checkpoint You learned how to create a data connection object and store the object in the repository. open the Stage Types → Parallel → Database folder. and Password fields are populated with the corresponding data in the data connection object. 6. right-click. Username. 3. tutorial_connect) and provide a short description and a long description of the object. select the ODBC Connector item and click Open. 2. select ODBC Connector from the list and click Next. To import column metadata by using the ODBC connector: 1. In the Save Data Connection As window.2: Importing column metadata from a database table You can import column metadata from a database table and store it as a table definition object in the repository. select the database from the Database name list and click Next. When you import data by using a connector. 8. and select New → Other → Data Connection from the shortcut menu. select the data connection object that you created in Lesson 4. In the Open window. Enter values for each of the Parameters as shown in the following table: Parameter name ConnectionString Username Password Value Type the DSN name Type the user name for connecting to the database using the specified DSN Type the password for connecting to the database using the specified DSN. 9. 7. In the Data Source Location page. you learned how to import column metadata from a comma-delimited file. select your host name and click Add new database. then click Close to close the Shared Metadata Management window. 4. 2. 3. In the Connection details page. you will import column metadata from a database table by using the ODBC connector. In the General page of the Data Connection window. 6.1. In the Add new database window.

2. In the Confirm import page. v Generates detailed error information if a connector encounters problems when the job runs. Module 4: Loading a data target 41 . 4. and name the link to_target. Create a job. and save it in the tutorial folder of the repository. Name the stage BillToSource. select the tutorial folder and click OK. v Saves any connection information that you specify in the stage as a data connection object. Connectors Connectors are stages that you use to connect to data sources and data targets to read or write data.3: Writing to a database In Lesson 4. name it ODBCwrite. The table definition is imported and appears in the tutorial folder. In the Database section of the palette in the Designer are many types of stages that connect to the same types of data sources or targets. This icon identifies that the table definition was imported by using a connector and is available to other projects and to other suite components. If your database type supports connector stages. Click the Test Connection link to ensure that you can connect to the database by using the connection details and then click Next. Open the Database section of the palette and add an ODBC Connector stage to your job. Your job looks like the one in the following figure: Chapter 6. Lesson 4. 11. use them because they provide the following advantages over other types of stages: v Creates job parameters from the connector stage (without first defining the job parameters in the job properties). In the Select Folder window.3. you can choose to add either an ODBC connector stage or an ODBC Enterprise stage to your job. The table definition has a different icon from the table definitions that you used previously. review the import details. select the schema from the Schema list (ask your database administrator if you do not know the name of the schema) and click Next. 12. you will create a job to write to the database. Double-check that your database administrator ran the scripts to set up the database and database table that you need to access in this lesson. In the Filter page. Open the File section of the palette and add a Data Set stage to your job. 1. 13. if you click the down arrow next to the ODBC icon in the palette. and then click Import. In the Selection page. Position the stage to the right of the Data Set stage and name the ODBC connector stage BillToTarget. 14. you will use an ODBC connector to write the BillTo data that you created in Module 3 to an existing table in the database. Lesson checkpoint You learned how to import column metadata from a database using a connector. v Reconciles data types between source and target to avoid runtime errors. For example. Link the two stages together. Creating the job In this exercise. Also ensure that the database administrator set up a DSN for you to use for the ODBC connection.10. 3. select the tutorial table from the list and click Next.

Use a job parameter to represent the data set file. select the table definition that you created in Lesson 4. and select the table definition that you imported in Lesson 4.1.2. 5. set the Write mode property in the usage category to Insert.2. The columns grid is populated with the column metadata. you will configure the Data Set stage to read the data set that you created in Lesson 3. You use the SQL builder to define the SQL statement that is used to write the data to the database when you run the job. Double-click the BillToTarget stage to open the ODBC Connector. Select the File property on the Properties tab of the Output page and set it to the data set that you created in Lesson 3. The connector interface is different than the stage editors that you have used so far in this tutorial. 4.52 core syntax) from the menu. 1. The Insert statement field under the SQL property is enabled. browse the icons that represent your database and your schema. open the tutorial folder. In the Select tables area. you will use the table definition that you imported in Lesson 4. and select Build new SQL (ODBC 3. In the Properties tab of the to_target page. and Password properties are the displayed values from the data connection object. 8. click the stage icon to select it. Double-click the BillToSource Data Set stage to open the stage editor. 2.4. Notice that the column definitions are the same as the table definition that you created by editing the Transformer stage and Lookup stage in the job in Lesson 3.2. 1. open the tutorial folder and select the data connection object that you created in Lesson 4.Configuring the Data Set stage In this exercise. 5. 3. Click OK to close the stage editor. c. The Data Source. click Load. The SQL builder page should look like the one in the following figure: 42 Parallel Job Primer . In the Open window. In the Columns page. open the tutorial folder. Configure the SQL builder: a. Click the Build button next to the Insert statement field. 6. In this exercise. In the navigator area in the top left of the stage editor. Username. b. 2. Drag the table definition to the area to the right of the repository tree. click Select All and drag all the columns to the Insert Columns area. In the table definition. 4. 3. In the Properties tab of the Stage page. In the navigator area in the top left of the stage editor. In the Table Definitions window. click the link to select it. 7. and click OK.4.4. click Load. Configuring the ODBC connector In this exercise. you configure the ODBC connector to supply the information that is needed to write to the database table.

Module 4: Loading a data target 43 . and your ODBC connector should look like the one in the following figure: Chapter 6.d. then click OK to close the SQL builder. Click the SQL tab to view the SQL statement. The SQL statement is displayed in the Insert statement field.

Save. This table forms the bill_to dimension of the star schema that is being implemented for the GlobalCo delivery data in the business scenario that the tutorial is based on.9. 44 Parallel Job Primer . compile. Lesson checkpoint You learned how to use a connector stage to connect to and write to a relational database table. and run the job. 10. You learned the following tasks: v How to configure a connector stage v How to use a data connection object to supply database connection details v How to use the SQL builder to define the SQL statement by accessing the database. Click OK to close the ODBC Connector. You wrote the BillTo data to the tutorial database table.

Module 4 summary In this module. you designed a job that writes data to a table in a relational database. you learned how to define a data connection object. In lesson 4. you imported column metadata from a data base.3. you learned about the following concepts and tasks: v Load data into data targets v Use the Designer clients reusable components Chapter 6. and in lesson 4.2.1. in lesson 4. you learned how to write data to a relational data base target. Module 4: Loading a data target 45 . Lessons learned By completing this module.

46 Parallel Job Primer .

The contents of the default configuration file are displayed. When you design parallel jobs in the Designer client. The shape and size of the computer system on which you run jobs is defined in the configuration file. 2. the configuration file is the key to getting the optimum performance from the jobs that you design. Opening the default configuration file You use the Configurations editor in the Designer client to view the default configuration file. you change the configuration file. You must have DataStage administrator privileges to create and use a new configuration file. You specify the logic of the job. not the jobs. When your computer system changes. Module 5: Processing in parallel Parallel jobs are scalable and can speed the processing of data by spreading the load over multiple processors. the parallel engine organizes the resources that the job needs according to what is defined in the configuration file. you concentrate on designing your jobs to run sequentially. you can exert more exact control on job implementation. unconcerned about how parallel processing is implemented. Prerequisites You must be working on a computer with multiple processors. and WebSphere DataStage specifies the best implementation on the available hardware. However. select default from the list. v How to control parallel processing at the stage level in your job design. In the Configuration window. In this module. and you look at the partitioning of data. This module should take approximately 60 minutes to complete. 2006 47 . you will know how to do the following: v How to use the configuration file to optimize parallel processing. © Copyright IBM Corp. Select Tools → Configurations.Chapter 7. Example configuration file The following example shows a default configuration file from a four-processor SMP computer system. When you run a job. v How to control the partitioning of data so that it can be handled by multiple processors. Unless you specify otherwise. Learning objectives After completing the lessons in this module. Lesson 5. the parallel engine uses a default configuration file that is set up when WebSphere DataStage is installed.1: Exploring the configuration file In this lesson. you learn about how you can control whether jobs run sequentially or in parallel. To open the default configuration file: 1.

Lesson checkpoint In this lesson. data is partitioned so that each processor has data to process. For an SMP system. so the fastname node is the same for all the nodes that you are defining in the configuration file. pools Specifies that nodes belong to a particular pool of processing nodes. temporary data is stored. You learned the following concepts and tasks: v About configuration files v How to open the default configuration file v What the default configuration file contains Lesson 5. access to a high-speed network link or to a mainframe computer. you specify where the controlling file is called and where it is stored. all processors share a single connection to the network. for example. resource disk Specifies the name of the directory where the processing node will write data set files.{ node "node1" { fastname "R101" pools "" resource disk "C:/IBM/InformationServer/Server/Datasets" {pools ""} resource scratchdisk "C:/IBM/InformationServer/Server/Scratch" {pools ""} } node "node2" { fastname "R101" pools "" resource disk "C:/IBM/InformationServer/Server/Datasets" {pools ""} resource scratchdisk "C:/IBM/InformationServer/Server/Scratch" {pools ""} } } The default configuration file is created when WebSphere DataStage is installed. resource scratchdisk Specifies the name of a directory where intermediate.2: Partitioning data When jobs run in parallel. This file contains the following fields: node The name of the processing node that this entry defines. Configuration files can be more complex and sophisticated than the example file and can be used to tune your system to get the best possible performance from the parallel jobs that you design. A pool of nodes typically has access to the same resource. but the controlling file points to other files that store the data. 48 Parallel Job Primer . you learned how the configuration file is used to control parallel processing. specifying that both nodes belong to the default pool. The pools string is empty for both nodes. fastname The name of the node as it is referred to on the fastest network in the system. When you create a data set or file set. These files are written to the directory that is specified by the resource disk field. the configuration file specifies two processing nodes. Although the system has four processors. Specify fewer processing nodes than there are physical processors to ensure that your computer has processing resources available for other tasks while it runs WebSphere DataStage jobs.

Most partitioning operations result in a set of partitions that are as near to equal size as possible. You change the job so that it has the same number of partitions as there are nodes defined in your system’s default configuration file. In the Select from server window. The Data Set Management window should look like the one in the following figure: Chapter 7. browse for the data set file that was written by the GlobalCo_BillTo. In the partitions section. As you perform other operations. you will run the sample job that you ran in Lesson 1. You must ensure that related data is grouped together in the same partition before the summary operation is performed on that partition. Select Tools → Data Set Management. the data was written to a single partition. you are using an aggregator stage to summarize your data to get the answers that you need.In the simplest scenario. Viewing partitions in a data set You need to be able to see how data in a data set is divided into partitions to determine how the data is being processed. 3. For example.ds sample job in Module 1 and click OK. By default. Module 5: Processing in parallel 49 . 2. the data that is read from the file is not partitioned when it is written to the data set. This exercise teaches you how to use the data set management tool to look at data sets and how they are structured. you need to control partitioning to ensure that you get consistent results. To see how data sets are structured: 1. WebSphere DataStage can partition your data and implement the most efficient partitioning method. ensuring an even load across your processors. do not worry about how your data is partitioned. In this lesson.

View the data in the data set to see its structure. The sample job reads a comma-separated file. you will override the default behavior and specify that the data that is read from the file will be partitioned by using the round-robin method. comma-separated files are read sequentially and all their data is stored in a single partition.4. In the Partition type field. 6. 2. 3. select the round-robin partitioning method. specify round-robin partitioning: Open the sample job that you used in Module 1. Close the window. By default. Open the GlobalCoBillTo_ds Data Set stage editor. Creating multiple data partitions By default. 4. Parallel Job Primer 50 . and so on. The round-robin method sends the first data row to the first processing node. In this exercise. The auto method determines the most appropriate partitioning method based on what occurs before and after this stage in the data flow. Click the disk icon in the toolbar to open the Data Set viewer 5. To 1. and click OK. Open the Partitioning tab of the Input page. the second data row to the second processing node. most parallel job stages use the auto-partitioning method.

When you develop parallel jobs. Return to the data set management tool and open the GlobalCo_BillTo. You learned the following tasks: v How to use the data set management tool to view data sets v How to set a partitioning method for a stage Lesson 5. you will create a new configuration file and see the effect of running the sample job with the new configuration file. Chapter 7.3: Changing the configuration file In this lesson. first run your jobs and test the basic functionality before you start implementing parallel processing.ds data set. The following figure shows the data set partitions on the system. Lesson checkpoint In this lesson.5. Module 5: Processing in parallel 51 . Compile and run the job. You can see that the data set now has multiple data partitions. 6. you learned some basics about data partitioning. This lesson demonstrates that you can quickly change configuration files to affect how parallel jobs are run.

type a name for your new configuration. you use this new file instead of the default file. Delete the selected entries. click and drag to select all the nodes except for the first node in your configuration file. 52 Parallel Job Primer . Click Check to ensure that your configuration file is valid. 6. To deploy the new configuration file: 1. and select Save configuration as from the menu. 5. In the Configuration name field of the Save Configuration As window. click the Projects tab to open the Projects window. In the part of the configuration editor that shows the contents of the configuration file. 4. 3. Select Start → Programs → IBM Information Server → IBM WebSphere DataStage and QualityStage Administrator. Click Save. 3. In the list of projects. You must have DataStage Administrator privileges to use the Administrator client for this purpose. 7.Creating a configuration file You use the configuration editor that you used in Lesson 5. 4. 2. Deploying the new configuration file Now that you have created a new configuration file.1 to create a configuration file. The configuration editor should resemble the one in the following picture: 8. You use the Administrator client to deploy the new file. Click Properties. For example. Select Tools → Configurations to open the Configurations editor. select the tutorial project that you are currently working with. Select default from the Configurations list to open the default configurations file. Click Save and select Save configuration from the menu. In the Administration client. 1. 2. type Module5.

v How to deploy the configuration file Chapter 7. open the data set management tool and open the GlobalCo_BillTo. 6. 3. Lesson checkpoint You learned how to create a configuration file and use it to alter the operation of parallel jobs. In the Designer client. 2. Keep the Administrator client open. Run the sample job. 5. In the Categories tree of the Environment variables window. Although you previously partitioned the data that is read from the GlobalCo_BillTo comma-separated file. and edit the file name in the path name under the Value column heading to point to your new configuration file. You see that the data is in a single partition because the new configuration file specifies only one processing node. Click to reset the job so that you can run it again.ds data set. Open the Director client and select the sample job that you edited and ran in Lesson 5. the configuration file specifies that the system has only a single processing node available. because you will use it to restore the default configuration file at the end of this lesson. You learned the following tasks: v How to create a configuration file based on the default file. To apply the configuration file: 1. Module 5: Processing in parallel 53 .5. The Environment variables window should resemble the one in the following picture: You deployed your new configuration file.apt file. click Environment.2. 4. v How to edit the configuration file. In the General tab of the Project Properties window. and so no data partitioning is performed. You will see how the configuration file overrides other settings in your job design. Reopen the Administrator client to restore the default configuration file by editing the path for the APT_CONFIG_FILE environment variable to point to the default. 7. Applying the new configuration file Now you run the sample job again. select the Parallel node. Select the APT_CONFIG_FILE environment variable.

Module 5 summary In this module. Lessons learned By completing this module. you learned about the following concepts and tasks: v The configuration file v How to use the configuration editor to edit the configuration file v How to control data 54 Parallel Job Primer . you learned how to use the configuration file to control how your parallel jobs are run. You also learned how to control the partitioning of data at the level of individual stages.

transform.Chapter 8. The WebSphere QualityStage tutorial implements the standardization of customer information and the removal of duplicate entries from the data. you can complete the IBM WebSphere QualityStage tutorial. you learned about the following concepts and tasks: v How to extract. and in doing so learned about basic parallel job design skills. and load data by using WebSphere DataStage v Using the parallel processing power of WebSphere DataStage. v How to reuse the job design elements © Copyright IBM Corp. Now that you have successfully completed the WebSphere DataStage tutorial. 2006 55 . Lessons learned By completing this tutorial. Tutorial summary You completed your part of the GlobalCo/WorldCo merger project.

56 Parallel Job Primer .

4. In the Administrator client window. To install and set up the tutorial. 3. click the Projects tab. © Copyright IBM Corp. Copying the data files to the project folder or directory Copy the tutorial data files from the tutorial folder on the client computer to the project folder or directory on the WebSphere DataStage server computer. 1. click Add. 3. specify the name of the new project (for example. Click OK. 4.Appendix. Copy the folder on the CD named \TutorialData\DataStage\parallel_tutorial to the folder that you created on the client computer. In the Attach window. In the Projects page. Insert the CD into the CD drive or DVD drive of the client computer. Tutorial) 6. Installing and setting up the tutorial Before you can start the tutorial. In the Name field. 7. 1. Select Start → Programs → IBM Information Server → IBM WebSphere DataStage and QualityStage Administrator 2. Creating the tutorial project Create a new project for the tutorial to keep the tutorial exercises separate from other work on WebSphere DataStage. “Creating the tutorial project” 3. The new project is created. type your user name and password. 5. 2006 57 . “Creating a folder for the tutorial files” 2. C:\tutorial). You need a higher level of system knowledge and database knowledge to complete the installation and setup tasks than you need to complete the tutorial. Click Close to close the Administrator client. You must have DataStage Administrator privileges. “Copying the data files to the project folder or directory” “Importing the tutorial components into the tutorial project” on page 58 “Creating a target database table” on page 58 One of the following tasks: v “Creating a DSN for the tutorial table on a Windows computer” on page 59 v “Creating a DSN for the tutorial table on a UNIX or Linux computer” on page 59 Creating a folder for the tutorial files Create a folder on your WebSphere DataStage client computer and copy the files from the installation CD to the folder. complete the following tasks: 1. 5. you need to install some files and perform some setup tasks. Create a new folder on your computer (for example. 6. You need DataStage administrator privileges and Windows administrator privileges to perform some of the installation and setup tasks. 2.

When you created the project for the tutorial. 8.csv v GlobalCo_BillTo. 1. Oracle. Select File → Exit to close the Designer client. Module 4 of the tutorial imports metadata from a table in a relational database and then writes data to the table. The scripts are in the tutorial folder and are named as follows: v DB2_table. 3. 9. To create the table: 1. you automatically created a folder or directory for that project on the server computer.ddl 58 Parallel Job Primer . 4. Creating a target database table Create a table in the relational database that hosts the dynamic repository for WebSphere DataStage. Click Cancel to close the New window because you are opening an existing job and not creating a new job or other object. 7. The Designer client imports the sample job and sample table definition into a repository folder named tutorial. Create a new database named Tutorial. Open the tutorial folder that you created on the client computer and locate all the files that end with . The default path name for a UNIX or Linux server is /opt/IBM/InformationServer/Server/Projects/ tutorial_project. Open the administrator client for your database (for example. Copy the files from the tutorial folder on the client computer to the project folder on the server computer.csv 2.The WebSphere DataStage server might be on the same Windows computer as the clients. Connect to the new database. Open the project folder on the server computer for the tutorial project that you created. Select Start → Programs → IBM Information Server → IBM WebSphere DataStage and QualityStage Designer. WebSphere DataStage uses a repository that is hosted by a relational database (DB2 by default) and you can create the table in that database. Run the appropriate DDL script to create the tutorial table in the new database. 1. Importing the tutorial components into the tutorial project Use the Designer client to import the sample job and sample table definition into the tutorial. and SQL Server in the tutorial folder. UNIX. There are data definition language (DDL) scripts for DB2. Click OK. 3. The default path name for a Windows server is c:\IBM\InformationServer\Server\Projects\tutorial_project. Select the tutorial project from the Project list and then click OK.ddl v SQLserver_table. or Linux computer. type tutorial_folder\ parallel_tutorial. the DB2 Control Center).dsx. Select Import → DataStage Components. In the Import from file field of the DataStage Repository Import window. 5. 3. The Designer client opens and displays the New window. 4. In the Attach window. or it might be on a separate Windows.ddl v Oracle_table.csv: v CustomerCountry. 2.csv v SpecialHandling. 2. 6. Select the Import all option. tutorial_folder is the name of the tutorial folder that you created on the client computer. type your user name and password.

Open the control panel and select Administrative Tools. You define the DSN on the computer where the WebSphere DataStage server is installed. You define the DSN on the computer where the WebSphere DataStage server is installed. and Installation Guide. Creating a DSN for the tutorial table on a Windows computer Create a DSN for the database table so that users of the tutorial can connect to the table by using an ODBC connection. Full details are in the IBM Information Server Planning. click Add. 4. A window opens that is specific to the driver that you selected. Select the System DSN tab.ini v uvodbc. 2. Installing and setting up the tutorial 59 . Creating a DSN for the tutorial table on a UNIX or Linux computer Create a DSN for the database table so that users of the tutorial can connect to the table by using an ODBC connection. In the System DSN page. 3. Select Data Sources (ODBC). You require administrator privileges on the Windows computer. The procedure on a UNIX or Linux computer is different from the procedure on a Windows computer. then the connection details include user name and password. Configuration.config The entries that you make in each file depend on the type of database. Specify the details that the driver requires to connect to the tutorial database. 5.5. you edit three files: v dsenv v odbc. 6. In the Create New Data Source window. If you specified a user name and password for the database. select a driver for the database. Close the database administrator client. To set up a DSN on a UNIX or Linux computer. To set up a DSN on a Windows computer: 1. The procedure on a Windows computer is different from the procedure on a UNIX or Linux computer. Appendix.

60 Parallel Job Primer .

ibm. and general information. which is viewable in most Web browsers. go to the IBM Publications Center at www.com/software/data/integration/ info_server/: v Product documentation in PDF and online information centers v Product downloads and fix packs v Release notes and other support documentation v Web resources.ibm. Go to www.com for a list of numbers outside of the United States.com/shop/publications/ order.jsp You can order IBM publications online or through your local IBM representative. Customer support To contact IBM customer service in the United States or Canada. go to the IBM Directory of Worldwide Contacts at www. Software services To learn about available service options. such as white papers and IBM Redbooks™ v Newsgroups and user groups v Book orders To access product documentation. Contacting IBM You can contact IBM by telephone for customer support. go to this site: publib.Accessing information about IBM IBM has several methods for you to learn about products and services.com/planetwide. You can find the latest information on the Web at www. © Copyright IBM Corp. v To order publications by telephone in the United States.ibm.ibm. call one of the following numbers: v In the United States: 1-888-426-4343 v In Canada: 1-800-465-9600 General information To find general information in the United States. v To order publications online. call 1-800-879-2755. call 1-800-IBM-CALL (1-800-426-2255). To find your local IBM representative. 2006 61 . call 1-800-IBM-SERV (1-800-426-7378). software services. Accessible documentation Documentation is provided in XHTML format.com/infocenter/iisinfsv/v8r0/index.ibm.boulder.

a title. 62 Parallel Job Primer .com/software/awdtools/ rcf/. If you are commenting on specific text.ibm. and the name and part number of the information (if applicable). Providing comments on the documentation Please send any comments that you have about this information or other documentation.com. or a page number). a table number. please include the location of the text (for example.XHTML allows you to view documentation according to the display preferences that you set in your browser.ibm. Your feedback helps IBM to provide quality information. It also allows you to use screen readers and other assistive technologies. Include the name of the product. This format is available only if you are accessing the online documentation using a screen reader. You can use any of the following methods to provide comments: v Send your comments using the online readers’ comment form at www. Syntax diagrams are provided in dotted decimal format. the version number of the product. v Send your comments by e-mail to comments@us.

in writing. EITHER EXPRESS OR IMPLIED. IBM may have patents or pending patent applications covering subject matter described in this document. or service. IBM may not offer the products. BUT NOT LIMITED TO. The furnishing of this document does not grant you any license to these patents. or service that does not infringe any IBM intellectual property right may be used instead.S. or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Some states do not allow disclaimer of express or implied warranties in certain transactions. Japan The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION ″AS IS″ WITHOUT WARRANTY OF ANY KIND. THE IMPLIED WARRANTIES OF NON-INFRINGEMENT. or service may be used. 2006 63 . This information could include technical inaccuracies or typographical errors.S. services. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. therefore. MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. to: IBM Director of Licensing IBM Corporation North Castle Drive Armonk. program. in writing. contact the IBM Intellectual Property Department in your country or send inquiries. it is the user’s responsibility to evaluate and verify the operation of any non-IBM product. to: IBM World Trade Asia Corporation Licensing 2-31 Roppongi 3-chome. Minato-ku Tokyo 106-0032. Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. © Copyright IBM Corp. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk.A. IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you.A.Notices and trademarks Notices This information was developed for products and services offered in the U. However. Any functionally equivalent product. or service is not intended to state or imply that only that IBM product. program. Changes are periodically made to the information herein. program. this statement may not apply to you. For license inquiries regarding double-byte (DBCS) information. You can send license inquiries. Any reference to an IBM product. program. INCLUDING. NY 10504-1785 U. these changes will be incorporated in new editions of the publication.

therefore. Actual results may vary.A. Users of this document should verify the applicable data for their specific environment. payment of a fee. CA 95141-1003 U. Portions of this code are derived from IBM Corp. Some measurements may have been made on development-level systems and there is no guarantee that these measurements will be the same on generally available systems. The information herein is subject to change before the products described become available. brands. cannot guarantee or imply reliability. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. and distribute these sample programs in any form without payment to IBM. Sample Programs. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental. companies. modify. must include a copyright notice as follows: © (your company name) (year). You may copy. COPYRIGHT LICENSE: This information contains sample application programs in source language. Any performance data contained herein was determined in a controlled environment. The licensed program described in this document and all licensed material available for it are provided by IBM under terms of the IBM Customer Agreement. some measurements may have been estimated through extrapolation. All rights reserved. These examples have not been thoroughly tested under all conditions. All statements regarding IBM’s future direction or intent are subject to change or withdrawal without notice. for the purposes of developing. compatibility or any other claims related to non-IBM products. Furthermore.Licensees of this program who wish to have information about it for the purpose of enabling: (i) the exchange of information between independently created programs and other programs (including this one) and (ii) the mutual use of the information which has been exchanged. using. the results obtained in other operating environments may vary significantly. _enter the year or years_. should contact: IBM Corporation J46A/G4 555 Bailey Avenue San Jose. including in some cases. and products. or function of these programs. serviceability. This information contains examples of data and reports used in daily business operations. Such information may be available.S. the examples include the names of individuals. IBM International Program License Agreement or any equivalent agreement between us. and represent goals and objectives only. To illustrate them as completely as possible. subject to appropriate terms and conditions. 64 Parallel Job Primer . Each copy or any portion of these sample programs or any derivative work. © Copyright IBM Corp. IBM has not tested those products and cannot confirm the accuracy of performance. marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. Therefore. their published announcements or other publicly available sources. This information is for planning purposes only. Information concerning non-IBM products was obtained from the suppliers of those products. IBM. which illustrate programming techniques on various operating platforms.

product or service names might be trademarks or service marks of others. or both. See http://www. other countries. MMX and Pentium® are trademarks of Intel Corporation in the United States. Inc. other countries.com/legal/copytrade. or both. Intel Inside® (logos).If you are viewing this information softcopy. Intel®. Microsoft®. Linux is a trademark of Linus Torvalds in the United States. Other company. Trademarks IBM trademarks and certain non-IBM trademarks are marked at their first occurrence in this document.shtml for information about IBM trademarks. in the United States.ibm. Windows NT®. the photographs and color illustrations may not appear. Notices and trademarks 65 . UNIX is a registered trademark of The Open Group in the United States and other countries. other countries. The following terms are trademarks or registered trademarks of other companies: Java™ and all Java-based trademarks and logos are trademarks or registered trademarks of Sun Microsystems. and the Windows logo are trademarks of Microsoft Corporation in the United States. other countries. or both. or both. Windows.

66 Parallel Job Primer .

30 Transformer 26 stage properties 15 stages adding 14 starting the Designer client L legal notices 63 loading column metadata 18 lookup 29 Lookup File Set stage 17. 30 Lookup stage 29 6 D data combining 29 looking up 29 partitioning 49 reject 32 stage Lookup 29 data browser 11 data connection creating 39 data files 58 data set management tool Data Set stage 8 database 58 Designer client 5. 40 loading 18 49 V viewing data 11 O ODBC 40. 51 default 47 configurations viewer 47. 42. 40 running jobs 9 I importing column metadata 18.Index A accessibility 62 adding job parameters 20 adding stages 14 Administrator client 57 starting 52 F files data 58 folder 57 R readers’ comment form 62 reject data 32 relational database 58 repository 5 repository objects data connections 39 parameter set 22 table definition 18. 40 loading 18 comma-separated files 58 comments on documentation 62 compiling jobs 9 configuration file 47. 59 opening a job 5 P parameter set creating 22 parameter sets 22 partitioning data 49 partitions creating 50 project creating 57 properties job 20 stage 15 E environment variables 53 © Copyright IBM Corp. 40 tutorial components 58 installing 57 C column metadata importing 18. 40 trademarks 65 Transformer stage 26 tutorial folder 57 M metadata importing 18. 52 connector ODBC 40 configuring 42 contacting IBM 61 creating job parameters 22 creating a job 13 S J job parameters adding 20 parameter sets 22 job properties 20 jobs compiling 9 creating 13 opening 5 running 9 sample job 5 screen readers 62 Sequential File stage 8 setting up 57 source data files 58 SQL Builder 42 stage Lookup File Set 17. 58 starting 6 Director client starting 9 documentation accessible 62 ordering 61 Web site 61 DSN 59 T table definition 18. 2006 67 .

68 Parallel Job Primer .

.

Printed in USA SC18-9889-00 .

Sign up to vote on this title
UsefulNot useful