DataStage Advanced All Labs1-11

DataStage Advanced Bootcamp -- Labs
IBM® InfoSphere™
DataStage
Advanced
Workshop Lab
Workbook
© Copyright IBM Corporation 2010. Page 1 of 187

Table of Contents
Lab 1: DataStage Parallel Engine Architecture ....................................... 5

The default configuration file ......................................................................................... 5
Job using data partitioning and collecting ...................................................................... 5
Examine job log and target files ..................................................................................... 6
Experiment with a different partitioning......................................................................... 8
Lab 2: Reject Links...................................................................................... 9
Sequential File Stage with Reject Link........................................................................... 9
Lookup Stage with Reject Link .................................................................................... 12
Transformer Stage with Reject Link............................................................................. 16
Merge Stage with Reject Link ...................................................................................... 18
Lab 3: Generate Mock Data ..................................................................... 22
Design a job that generates a mock data file................................................................. 22
Examine the job log ...................................................................................................... 29
Lab 4: Read Data in Bulk ......................................................................... 31
Build the job to read data in as single column records ................................................. 31
Lab 5: Read Multiple Format Data in a Sequential File Stage ............. 33
Generate the Header Detail data file............................................................................. 33
Build a job that processes the Header Detail file.......................................................... 37
Lab 6: Complex Flat File Stage ................................................................. 41
Import a COBOL file definition ................................................................................... 41
Using the Complex Flat File stage................................................................................ 43
Lab 7: Optimize a Fork-Join Job ............................................................. 51
Generate the Source Dataset for the Fork-Join Job ...................................................... 51
Build the fork-join job .................................................................................................. 52
Optimize the job............................................................................................................ 53
More optimization......................................................................................................... 54
Lab 8: Sort Stages to Identify Last Row in Group................................. 57
Create the job ................................................................................................................ 57
Lab 9: Globally Sum All Input Rows Using an Aggregator Stage ....... 62
Write to a database table using INSERT ...................................................................... 62
Lab 10: Slowly Changing Dimension Stage ............................................ 67
Create the surrogate key source files ............................................................................ 67
Build an SCD job with two dimensions........................................................................ 70
Build an SCD job to process the first dimension.......................................................... 70
Build an SCD job to process the second dimension ..................................................... 79
Lab 11: Reject Links – DB2 Connector.................................................... 87
DB2 Connector stage with a Reject Link ..................................................................... 87

Lab 12: Dual Inputs to a Connector Stage ............................................... 91

Insert both parents and children records with a single Connector ................................ 91
Lab13: Metadata in the Parallel Framework ......................................... 96
Create a parameter set ................................................................................................... 96
Create a job with a Transformer stage .......................................................................... 97
Use a schema file in a Sequential File stage ............................................................... 101
Define a derivation in the Transformer....................................................................... 103
Create a Shared Container .......................................................................................... 105
Lab 14: Create an External Function Routine ..................................... 109
Use an External Function Routine in a Transformer stage ......................................... 111
Lab 15: Create a Wrapped Stage ........................................................... 113
Create a simple Wrapped stage................................................................................... 113
Lab 16: Working with a Build Stage ..................................................... 116
Create a simple Build stage......................................................................................... 116
Create a job that uses your Build stage....................................................................... 119
Lab 17: Performance Tuning ................................................................. 122
Use Job Monitor.......................................................................................................... 122
Use Performance Analysis tool................................................................................... 124
Analyze the Performance of another Job .................................................................... 127
Lab 18: Process Header / Detail records in a Transformer................. 131
Build a job that processes the Header Detail file........................................................ 131
Lab 19: Exploring the Optimization Capabilities ................................. 134
Creating an optimized version of a parallel job .......................................................... 134
Comparing the performances between root and optimized job .................................. 138
Managing the root versus optimized jobs ................................................................... 145
Pushing the processing to the source and target databases ......................................... 147
Pushing data reduction to database target ................................................................... 155
Optimizing a complex job........................................................................................... 157
Scenario 1: common database for source and target tables ................................... 158
Scenario 2: different databases for source and target tables .................................. 167
Deciding where to stop the optimization process ....................................................... 172
Balancing between Database and DataStage engines ................................................. 173
Lab 20: Repository Functions................................................................. 178
Execute a Quick Find.................................................................................................. 178
Execute an Advanced Find ......................................................................................... 178
Generate a report......................................................................................................... 179
Perform an impact analysis......................................................................................... 181
Find the differences between two jobs........................................................................ 183
Find the differences between two Table Definitions .................................................. 186

List of userids and passwords used in the labs:
ENVIRONMENT USER PASSWORD
SLES user root inf0sphere
IS admin1 isadmin inf0server
DataStage user dsuser inf0server
WAS admin2 wasdmin inf0server
DB2 admin db2admin inf0server
DataStage admin dsadm inf0server
Note: the passwords contain a zero, not the letter o.
All the required data files are located at: /DS_Advanced. You will be using the
DataStage project called “dstage1”. Optionally, you may put all your DataStage
objects (e.g. jobs, parameter sets, etc.) in the project folder /dstage1/Jobs/DS-
Advanced/.
Please start both the DataStage Designer and Director to do the following
exercises.
1
IS admin: InfoSphere Information Server administrator
2
WAS admin: WebSphere Application Server administrator

Lab 1: DataStage Parallel Engine Architecture
The default configuration file

The default configuration file, under which DataStage is using, is stored in a file named
default.apt. This file is located in:
/opt/IBM/InformationServer/Server/Configurations
This configuration file has been modified to have two nodes defined and will allow us to
exercise the capabilities of a parallel engine.
Job using data partitioning and collecting

1. Create a job ‘CreateSeqJobPartition’ and save it in the Jobs folder.
2. Rename the stages and links as shown. This is a good practice as short
documentation.
3. Set up the source Sequential File stage to read the file Selling_Group_Mapping.txt
and don’t forget to import the table definition first.
4. Simply map all the columns across in the Copy stage.
5. Set up the target Sequential File stage to write to two different target files,
TargetFile1.txt and TargetFile2.txt
6. Notice that the partitioning icon is ‘Auto’. (Note: If you do not see this, refresh the
canvas by turning “Show link markings” off and on using the toolbar button.)
7. Compile and run your job.

Examine job log and target files

1. View the job log. Notice how the data is exported to the two different partitions (0
and 1).
2. Log in to the Information Server system using ID “dsadm”. Change directory to

where all the files are. Open the source file, Selling_Group_Mapping.txt, and each of
the two target files, TargetFile1.txt and TargetFile2.txt, with gedit or vi.
Source file:

Target file 1:
Target file 2:
3. Notice how the data partitioned. Here, we see that the 1st, 3rd, 5th, etc. go into one
file and the 2nd, 4th, 6th, etc. go in the other file. This is because the default
partitioning algorithm is Round Robin.

Experiment with a different partitioning

1. Open the target sequential file stage. Go to the ‘Partitioning’ tab. Change the
partitioning algorithm to e.g. ENTIRE.
2. Compile and run the job again. Open the target files and examine. Notice how the
data gets distributed. Experiment with different partitioning algorithms!
3. The following table shows the results for several partitioning algorithms with one
particular system (yours may not match exactly):
Partitioning Algorithm Records in Records in Comments

File1 File2
Round-Robin (Auto) 23 24 Every other records
Entire 47 47 Each file contains all the

records
Random 22 25 Random distribution
Hash on column 20 27 File 1 with Handling_code

“Special_Handling_Code” 6; File 2 with other
Handling_codes

Lab 2: Reject Links
Sequential File Stage with Reject Link

1. Create a job as shown here and save it as ‘RejectLinkSeqFile’ in the Jobs folder of
the project.
2. Set up the Sequential File stage to read the file Selling_Group_Mapping.txt.
3. Compile and run the job to ensure successful execution.

4. Use either gedit or vi to change Selling_Group_Mapping.txt file: put a letter into the
1st column of three records.
5. Run the job again. Check the log messages with the Director and notice the
behavior of the Sequential File stage throwing the record away with a warning
message.

6. Now add a reject link and a Peek stage as shown. Don’t forget to change the “Reject
Mode” property of the Sequential File stage to “Output”.
7. Compile and run the job again. Check the log messages to see the records with
incorrect data were sent down the reject link and captured by the Peek stage.

Note the one from Peek_Reject,0 has (…) at the end. Open it up you will see the
following:
Lookup Stage with Reject Link

1. Create a job as shown here and save it as ‘RejectLinkLU’ in the project.
2. Set up the Sequential File stages to read Warehouse.txt as source records and
Items.txt as reference records. Don’t forget to import the table definitions.
3. Column Item is the lookup key.

4. Map all columns from the source (Warehouse.txt) plus the column Description to
the output.
5. The lookup failure action property can be set to any choice except Reject. Click the
yellow constraint icon and set the lookup failure action.
6. Set up the target Sequential File stage to write the records to Warehouse_Items.txt
file.
7. Compile and run the job.
8. If you set the lookup failure action to FAIL, then you should see your job aborted.
If you set the lookup failure action to Drop, then you will not see any log message
from the Lookup stage. However, you can see that the number of records read from
Warehouse.txt and the records written to Warehouse_Items.txt are differed by 9
records.
If you set the lookup failure action to Continue, then all records will be passed to the
output.

9. Now, let’s change the lookup failure action to REJECT. Add a reject link with a Peek
stage as shown.
11. This time you should see log messages from the Peek_Reject stage.

And if you open each log message, you should find a total of 9 records log by the
Peek stage.

Transformer Stage with Reject Link

1. Use (Open) the Lookup job again and save it as ‘RejectLinkXformer’.
2. Add a Transformer stage between the Lookup stage and the target Sequential File
stage. Add a reject link from the Transformer stage to a Peek stage. To do so, you
need to select the link then right click your mouse to choose “Convert to Reject”.

And your final job should look similar to the picture shown.
3. Change the Lookup stage:

Lookup failure action = Continue
Nullable attribute of Description column of Items record = Yes
Nullable attribute of Description column of Warehouse_Items output record = Yes

4. In the Transformer stage, map all columns of input record to output. Change the
derivation for Description to “[“:Warehouse_Items.Description:”]”.
5. Don’t forget to handle the NULL in the target Sequential File stage (just in case).
7. With DataStage prior to version 8.5, you should find log messages by the
Peek_X_Reject stage containing those records with a NULL in the Description
column. However, DataStage 8.5 now will handle (allow) NULL in derivation so you
won’t see any rejected records.
Merge Stage with Reject Link

1. Open the Lookup job and save it as ‘RejectLinkMerge’.

2. Add a Remove Duplicate stage. Replace the Lookup stage with a Merge stage. Add
a reject link to a Peek stage from the Merge stage.
3. Set up the Remove Duplicate stage with the following properties: Key = Item, Retain
= Last. Map all columns to output.

4. Set up the Merge stage with the following properties:
5. On the Link Ordering tab, make sure all the internal links and external links are
correctly aligned.
7. Find the log message from the Peek_Reject stage.

8. You can see the update records that do not have corresponding master records are
rejected. Viewing the detail of the Peek_Reject stage log message will show these
rejected update records.

Lab 3: Generate Mock Data

In this task, you create a job to generate a mock data file to be used in later exercises.
Design a job that generates a mock data file

1. Create a new parallel job named ‘archGenData’ and save it. Add the stages and
links and name them as shown.
2. Open up the Row Generator stage. On the Properties pages specify that 1000 rows
are to be generated.

3. On the Output>Columns tab, specify the column definitions as shown.
4. Open up the Extended Properties window for the CustID column. (Double-click on
the number to the left of the column.) Specify that the type of algorithm is cycle with
an initial value of 10000.
5. For the Int1 column cycle from 0 to 29.
6. For the Int2 column cycle from 1 to 29. (It’s important that this not start at 0, so that
these cycles won’t repeat.)
9. For the MiddleInit column, use the alphabet algorithm over a string of characters that
might be middle name initials. (That is, remove the numerals from the list.)

10. For the Zip column, cycle as shown.
11. For the CustDate column, generate random dates with a limit of e.g. 20000, so that
the dates don’t get too large.
12. For the InsertUpdateFlagInt, select random integers with a limit of 2. This will ensure
that values are either 0 (meaning update) or 1 (meaning insert).
13. Close the stage and then open it again and click the View Data button to examine a
sampling of the data that will be generated.
14. Edit the Sequential File stages that are used as lookup tables. The sequential files
are FName.txt, LName.txt, Street1.txt, and Street2.txt. Examine these files to get an
idea of the data they contain. Import the metadata for these files and load it into the
stages.

15. Edit the Lookup stage. Map the Int1 to Int4 columns, respectively, to the Num
columns of each of the lookup files. Define the output columns in the order shown at
the far right.
16. Click the Constraints button. Specify that rows that fail to find matches are to be
rejected.

17. Edit the Transformer. Define the target columns as shown.

18. Define the following derivations (in addition to the straight mappings):
• Middle initial should be uppercase.
• Address should consist of the street name followed by the street modifier.
• Rows with customer dates later than the current date should get the current date.
Otherwise, they retain the date in the source row.
• The added column DateEntered should get the date of the job run.
• The InsertUpdateFlag column, which is now Char(1) should replace 0 by “U” and
1 by “I”.

19. Open up the Job Properties window and define a new job parameter named
TargetPath. Provide it with a default that creates a file named CustomersOut.txt in
the /DS_Advanced directory.
20. Edit the target Sequential File stage. Insert your TargetPath job parameter in the
File property to create a new comma-delimited file named CustomersOut.txt. Don’t
surround the column values with quotes.
21. Edit the rejects Sequential File stage. Write rejects to a file named RejectsOut.txt.
22. Compile and run your job. Examine the job log in Director. Fix any errors. Try to
eliminate all warnings.
23. View the target data in DataStage Designer.
24. In addition to viewing the data in DataStage, view the data file in your directory.
Verify that quotes don’t surround the values and that the data is delimited by
commas.

Examine the job log

1. In Director, open up the job log. Locate and examine the message that lists the
values of the job parameters.
2. Locate and examine the message that lists the values of the environment variables
that are in effect at the time the job is run.
3. Locate and examine the message that displays the OSH (Orchestrate Script) that is
generated for the job.

4. Locate and examine the message that displays the configuration file used when the
job was run. How many nodes are defined in the file? (Note: Your job will be using a
different configuration file than the one shown here. This is just an example.)
5. Locate and examine the message that lists the job’s datasets, operators, and
number of processes. This is known as the Score. (Note that you won’t see the
word ‘Score’. The first line is how you can identify it.)
6. Locate the message that says how many rows were successfully written to the target
Sequential File stage and how many were rejected?

Lab 4: Read Data in Bulk

In this task, you create a job that reads data from a file in bulk, that is, in a single
column. The Column Import stage is used to parse the columns.
Build the job to read data in as single column records

1. Create a new parallel Job named dataReadBulkData.
2. Enable RCP for the job and all existing links in the job properties.
3. Edit the source Sequential File stage to read data from the Customers.txt file. On
the Columns tab, specify a single column named RecIn, VarChar(1000).
4. On the Formats tab, specify that there is no quote character. Verify that you can
view the data.
5. Edit the Column Import stage. On the Properties tab, specify that the Import Input
Column is RecIn. As the Column Method, specify Schema File and then reference
the Customers.schema schema file.
6. That’s it for the Column Import stage. It uses RCP to send the columns of data
through to the target.
7. In the target Sequential Stage specify the file to write to.

8. Compile and run. View your data file.

Lab 5: Read Multiple Format Data in a Sequential File Stage

In this lab, you will be creating the header and detail records and then build a job
to process these records so that header information are available on all
associated detail records.
Generate the Header Detail data file

1. Create a new parallel Job and name it partGenHeaderDetailFile.
2. Edit the GenerateHeader Row Generator stage. On the Properties tab specify that 9
records are to be generated.
3. Define the columns, as shown. Save the table definition for the next lab to use.
4. For the OrderNum column, generate numbers from 1 to 9. Also, for this and all
columns add the optional Quote property from the Field Level category and set it to
NONE.
5. For RecType column, generate ‘A’.
6. Choose your own algorithms for the remaining fields. In what follows, I’ve chosen
random for OrdDate with a limit of 20000.

7. Click View Data.
8. Edit the GenerateDetail Row Generator stage. On the Properties tab specify that 81
records are to be generated.
9. Define the columns, as shown. Save the table definition for the next lab to use.
10. For the OrderNum column, generate numbers from 1 to 9. Also, for this and all
columns add the optional Quote property and set it to NONE.
11. For RecType column, generate ‘B’.
12. Choose Random with a limit of 9999 for the remaining fields.

13. Click View Data.
14. In the Column Export stage for the header, in the Input folder of the Properties tab,
specify the input columns that are to be exported to a single output column
(OrderNum, RecType, etc., in the order shown). In the Output folder specify the
name of the single column (Header) and its type (VarChar) that the input columns
are to be combined into.

15. On the Output Columns, create a column named RecOut and map the input field to it
on the Mappings tab.
16. Define the Column Export stage for the detail records in a similar way. Be sure to
use the same name (RecOut) for your output column name.
17. In the Funnel stage, map the single RecOut column across to the target.
18. In the Sort stage, specify that the records are to be sorted in ascending order. The
key is the single RecOut column.
19. Write to a file named Orders.txt. In the Sequential File stage Format tab, set the
quote property to NONE.

20. Compile and run and view your data. It should look something like this, with the
header records at the front of each group of records, grouped by order number.
Build a job that processes the Header Detail file

1. Create a new parallel Job and name it partCombineHeaderDetail. From left to right it
has a Sequential File stage, a Transformer stage, a Join stage, and a DataSet stage.
2. Your job reads data from the Orders.txt file. It reads this data in as one column of
data.
3. In the Transformer, define constraints to send the Header rows down the Header link
and the Detail rows down the Detail link. Also parse out the fields for each of the
record types using the Field function.

4. To create the output column definitions for the two links, load the table definitions
saved from the previous job. The RecType fields won’t be needed downstream, so
delete them from the output. Add a column to the Detail link named RecordNum.
Define an expression that generates a unique integer for each Detail row regardless
of its partition.
5. In the Join stage, specify an inner join by OrderNum.
6. On the Input>Partitioning tab of the Join stage specify that the records are to be
hashed and sort partitioned by OrderNum for both the Header and Detail links.
7. Map all columns through the Join stage.

8. In the Data Set target stage, name the file OrdersCombined.ds. Note the partitioning
icons in front of the Join stage.
10. View the data using the Data Set Management tool available in the Designer Tools
menu. You should see 9 records in each group.
11. Next view the data in each partition. Notice that all the records in a group are in a
single partition, which is not spread across multiple partitions.
12. Save your job as partCombineHeaderDetail2. Now change how the partitioning is
done in the Join stage. Choose Entire for the Header link and SAME for the Detail
link. Turn off the sorts.

13. In the Job Properties, add the environment variable, $APT_NO_SORT_INSERTION,

to the job and set its default value to TRUE.
14. Recompile and run your job. View the data using the Data Set management tool.
Notice that the groups of data are now spread across multiple partitions. This should
yield improved performance.

Lab 6: Complex Flat File Stage
Import a COBOL file definition

1. Open the ORDERS.cfd file in the DS_Adv_Student_Files directory with Notepad or
Wordpad. Examine the file. Note the location of level 01 items, the total numbers of
level 01 item, and the names of the level 01 items.
2. In DataStage Designer, click Import > Table Definitions and then select Cobol File
Definitions.
3. Specify a Start position of 2.
4. Select the path to the ORDERS.cfd file.

5. Take the default location in the repository. Select both DETAIL and HEADER
definitions. Click Import.
6. Find the newly imported table definition in the repository. Open one of them. Go to
the layout tab and explore the different settings of Parallel, COBOL, and Standard
views. Here is what the COBOL view looks like.

7. Open up the HEADER table definition. Click on the Columns tab. Open the
extended properties (Edit Column Meta Data) window for the column ORDDATE
(double click on the column number). Set the Date Format field to CCYY-MM-DD.
This is to allow dates to be displayed correctly using this mask.
8. Remove the Level number and then click Apply and close the table definition.
Using the Complex Flat File stage

1. Create a new parallel job named cffOrders. The source stage is a Complex Flat File
stage. The target is a Sequential File stage. Name the links and stages as shown.

2. Open the Orders CFF stage. On the File options tab, select the file to be read,
OrdersCD.txt.
3. Click the arrow at the bottom to move to the next Fast Path page, that is, the
Records tab. Remove the check from the Single record box.
4. Change the name of the default record type to HEADER.
5. Click Load. Select all the columns from the HEADER Table Definition.

6. Click OK to load the columns.
7. Click the icon at the bottom left of the Records tab to add a new record type.
Complete the process to define and load the Table Definition for the DETAIL record
type.

8. Select the HEADER record type and then click the Master Button (rightmost icon at
the bottom of the records tab). This will make the HEADER record type the master.
9. Click the arrow at the bottom to move to the next Fast Path page, that is, the
Records ID tab. Define the Record ID constraint for the HEADER record type.
10. Define the Record ID constraint for the DETAIL record type.

11. Move to the next Fast path page, that is, the Output > Selection tab. Select all
columns from both record types.
12. Move to the last Fast path page, that is, the Output > Constraint tab. Click the
Default button to add the default output constraint. This will insure that only records
of these two record types will go out the output link.

13. Click the Stage tab then Record options tab. Specify Text for Data format. Select
the ASCII character set. And type in the vertical bar (|) for the record delimiter. If
you open up the OrdersCD.txt file on the DataStage server, you will notice that all the
records are bunched up one after another with a vertical bar separating them. There
is no CR or LF character. This is the usual output of COBOL data from mainframe
and the CFF stage is designed to handle it.
14. Click the Layout tab. Select the COBOL layout option. View the COBOL layouts for
each of the record types. Shown below is the HEADER COBOL format.
15. Shown below is the DETAIL COBOL format.

16. Move to the Output tab and Click View Data. Notice that all the columns from all the
record types are displayed with data in them. However, the data in the columns that
are mapped from the DETAIL record are invalid when viewing the record with record
type ‘A’ (HEADER record). But in the case with the record type ‘B’ (DETAIL record),
all columns contain valid data. In effect, we have just populated the HEADER record
information to all its associated DETAIL records.
17. Set up the target Sequential File stage to output all the records to a file named
CFFOrdersCombined.txt with comma separated and no quotes. In the Columns tab,
change the SQL type to Char for the column ORDDATE. Otherwise, you will get a
conversion error during execution.
18. Compile and run your job. You will get some warning about EOF (End-of-File)
without getting a record delimiter. This is normal due to the last record in the file.
These warning did not affect the correct processing of the data.

19. To verify the result, view data on the target Sequential File stage.

Lab 7: Optimize a Fork-Join Job
Generate the Source Dataset for the Fork-Join Job

1. Open up your archGenData job and save it as sortGenData.
2. Replace the target Sequential File stage by a Data Set stage.
3. The dataset accessed by the target dataset stage should be named Customers.ds.
5. Verify that you can view the data in Customers.ds.
6. Save the metadata of Customers.ds to a table definition for use in the next section.
© Copyright IBM Corporation 2010 Page 51 of 187

Build the fork-join job

1. Create a new parallel job named sortForkJoin.
2. Edit the source stage to read from the Customers.ds dataset. Don’t forget to load
the table definition saved from last job.
3. In the Copy stage map all columns to both output links.
4. Edit the Aggregator stage. Group by Zip. Count the rows in each group of zip
codes. You will add this value to each Customer record. Change Grouping Method
to SORT.
5. Output the Zip column and the new ZipCount column from the Aggregator.
6. On the Partitioning tab, hash and sort by Zip.
7. Edit the Join stage. Specify an inner join on the Zip column.
8. On the Partitioning tab, hash and sort by Zip for both input links to the Join.
9. Write all the rows of the customer record with the added ZipCount column to an
output sequential file named CustomersCount.txt.

10. Your job now looks like this. The hash, sorts on each of the three links going to the
Aggregator and Join stages are what would have implicitly been done by DataStage
if Auto had been selected.
11. Compile and run your job. Verify the data.
12. Examine the score. Are there any inserted tsort operators? What operators are
combined? In addition to the operators corresponding to the Aggregator, Join, and
Copy stages, what other operators are there in the score?
13. Save your job as sortForkJoin2.
14. Remove the Hash partitioning and in-stage sorts, going back to Auto.
15. Compile and run.
16. Examine the score. Compare with the other score in terms of number of operators,
number of processes, number of sorts, hash partitioners, etc.
Optimize the job

1. Save your job as sortForkJoin3.
2. Optimize your job by moving the hash and sort to the Copy stage. Specify SAME
partitioning for the links going into the Aggregator and Join.

3. Recompile and run. Verify the data.
4. View the score. Compare with the scores from the previous jobs. Have the number
of sorts been reduced? Have the number of operators and processes been
reduced?
More optimization
In this task, we will push the partitioning and sorting back even further. We will partition
and sort when the dataset is generated and loaded.
1. Open up your sortGenData job and save it as sortGenData2.
2. On the Partitioning tab of the target Customers dataset, Hash partition and sort by
the Zip. Compile and run to generate a new Customers.ds
3. Save your sortForkJoin3 job as sortForkJoin4.

4. Change the partitioning in the Copy stage to SAME and remove the sort, since the
data is already sorted coming out of the dataset.
5. Compile and run and view the score. Notice here the inserted tsort operators.
Although the data in the dataset is sorted, DataStage doesn’t know this and still
inserts the tsort operators.

6. Open up the job parameters window, and add the environment variable named
$APT_NO_SORT_INSERTION (Disable sort insertion) as a job parameter. When
set, this will cause the Framework to just check that the data is sorted as it is
supposed to be. It will not add tsort operators.
7. Recompile and run. Run it with the $APT_NO_SORT_INSERTION parameter set to

true. View the score. Are there any inserted sorts? How many operators,
processes?
8. Compare when running the job with the $APT_NO_SORT_INSERTION turned off.
9. There is another environment variable called

$APT_SORT_INSERTION_CHECK_ONLY that is similar. “tsort” operators are
inserted, but they do not perform sorts. They just check whether the data is sorted.
Add this environment variable as a job parameter and compare the score when this
is turned on and turned off.

Lab 8: Sort Stages to Identify Last Row in Group

In this exercise, you produce a state count and a list of zip codes from the
CityStateZip.txt file. Since the Aggregator stage can’t produce the list, you will use a
Transformer to produce the count and the list. The main difficulty here is that you need
to know when you have reached the end of each group of state records. To accomplish
this, you will use the Sort stage to add a key change column.
Create the job

1. Create a new parallel job named sortLastRow.
2. In the source Sequential stage read data from the CityStateZip.txt file. The
CityStateZip.txt file contains customer address information. In this job, you will
generate a report that lists each state followed by a count of the addresses in the
state and a list of the zip codes. Here’s a sampling of the source data and the
column names used.

3. Here’s the report to be generated:
4. In the first Sort stage, set Hashing by State as the Partitioning method. We need to
have all the rows of a given state in the same partition in order to get a single count
for the state. The hash should be case insensitive.
5. In the first Sort stage, sort by State in ascending order. The sort should be case
insensitive. Turn off Stable Sort since we don’t need it.

6. In the second Sort stage, set the “Create Cluster Key Change Column” option.
Specify that the data is already sorted as you specified in the previous Sort stage.
7. In the third Sort stage, specify that you want to sort by the cluster key change column
within the State groups. This will place the row with the cluster key change column
of 1 at the end of each State group.
8. Open up the Transformer. Define the stage variables in the order shown.
NewState: Char(1) flag initialized with “Y” indicates that a new state group is being
processed.

Counter: Integer(3) counter initialized with 0 to track the number of members in a

group. Set to 0 when a new state is to be processed.
AddZip: VarChar(255) list of zip codes. Initialize it with an empty string. Lists the
zip codes processed in each group. Set it to the currently read zip code when a new
state is being processed.
PrevClusterKey: Integer(1), initialized to 1. Map the current clusterKeyChange input

value to it.
9. The first (NewState) is a flag. We set it to ‘Y’ when PrevClusterKey is 1.

PrevClusterKey is an integer column. Map the clusterKeyChange input column to it
and initialize it to 1. At the time the derivation for NewState is calculated, it will
contain the cluster key change value from the previous row read or 1 for the first row.
Counter is an integer field initialized to 0. AddZip is a varchar field initialized when a
new state group is read in. For each row read in a state group, Counter adds 1 to
the state count and AddZip adds the current Zip to the list.
10. For the StateCount link, there are three target columns. The State value comes from
the input. The other two target columns get their values from the Counter and
AddZip stage variables. Define a Constraint for the StateCount link. It should only
write out one record per State group. They should be written out when state Count
and Zip lists are complete for the group, i.e., when the clusterKeyChange column
equals 1.
11. Set the target Sequential File stage to write to a file without quotes.

12. Compile and run. Verify the results. (Your ordering may be different.)

Lab 9: Globally Sum All Input Rows Using an Aggregator Stage

You will create this job by generating a constant column with each record so that the
aggregator stage will add up the total number of records sequentially.
Write to a database table using INSERT

1. Create a new parallel job named tipsSumAll.
2. Read records from the Customers.txt file. Since the Customer.ds table definition is
the same as the file, you can use it.
3. In the Copy stage, pass all columns through to the CUSTS stage. Pass just the
CustID column through to the Column Generator.

4. Edit the CUSTS DB2 Connector stage. Connect to the DB2 instance and the
SAMPLE database. Click the “Test” button to make sure you can connect.

5. Write mode is INSERT. The table your job will create is named CUSTS. Select
REPLACE as the Table action with the statement generation and error handling as
shown.
6. Edit the Column Generator stage. Generate a new column named GroupByCol,
Char(1). Set the generation algorithm to just create a single letter ‘A’ using the
extended column properties window.

7. Edit the Aggregator stage. Group by the GroupByCol. Count the number of rows in
the group and send the results to the NumRecs column. Specify Hash as the
Aggregation method, since the data doesn’t need sorting.
8. Specify that the Aggregator stage is to run sequentially.
9. Edit the target Sequential stage. Write to a file named CUSTS_Log.txt.

11. Check the results. The log file should contain the number of records read from the
source file and written to the target table, unless the database rejects some rows.

Lab 10: Slowly Changing Dimension Stage
Create the surrogate key source files

1. In DataStage Designer, create a new parallel job named
scdCreateSurrogateKeyStateFiles.
2. From the Processing folder add two Surrogate Key Generator stages to the canvas.
Name them as shown. Also add the two DB2 Connector stages with links to the
Surrogate Key Generator stages.

3. Open up the PRODDIM Connector stage. Specify the Connection and Usage
properties. Choose to have the stage generate SQL.
4. Click the Columns tab. Load the column definitions into the stage. The table
definition is stored in the repository at “Table Definitions DB2 sample”.
5. Open up the STOREDIM Connector stage. Specify the Connection and Usage
properties. Choose to have the stage generate SQL. Load the column definitions
into the stage.

6. Open the ProdDim_SKG_Create stage properties. The Key Source Update Action is
Create and Update. Select PRODSK for the input column name. Specify a path to a
source key file name proddim as shown.
7. Open the StoreDim_SKG_Update stage properties. The Key Source Update Action
is Create and Update. Select STORESK for the input column name. Specify a path
to a source key file named storedim as shown.
8. Compile and run your job. Check the job log for errors.
9. Verify that the files have been created and that they are not empty. If you encounter
error and need to run the job again, delete the state files.

Build an SCD job with two dimensions

In this section you will update a star schema with two dimensions. The completed job
will look like the following. However, to ease the development and debugging, two
separate jobs will be built. The first will process the PRODDIM dimension table and
write its output to a dataset. The second will read the data from the dataset, process the
STOREDIM dimension table, and write the results to the fact table. We will not build this
complete job shown here since outside of a classroom there are many dimensions to
process. The standard practice is to process one dimension per job.
Build an SCD job to process the first dimension

1. Create a new parallel job named scdLoadFactTable_1.
2. ***Important*** Open the Job Properties window and make sure that Runtime
Column Propogation is not enabled. Otherwise, you will get runtime errors when
source columns such as StoreID are written to the PRODDIM_upd link.

3. Add the stages and links as shown. Notice that the link from the PRODDIM
Connector stage to the Slowly Changing Dimension stage in the middle is a lookup
reference link.
4. Edit the SaleDetail stage. Read data from the SaleDetail.txt file. Import the table
definition. The column definitions are shown below. Correct them if necessary.

5. Verify that you can view the data.
6. Edit the PRODDIM reference link stage. Set the Generate SQL property to Yes.
Click View Data.
7. On the Columns tab, load the column definitions. Select SKU, which is the business
key, as the lookup key field.

8. Open the PROD_SCD stage. On the Stage > General tab, select SaleDetailOut as
the output link.
9. Move to the next Fast Path page (using the arrow key at the bottom left), that is, the
Input>Lookup tab. Specify the column matching to use to lookup a matching
dimension row. Here we want to retrieve the row with the matching PRODDIM
business (natural) key. Also select the purpose codes for each of the dimension
table columns, as shown.

10. Move to the next Fast Path page, that is, the Input>Surrogate Key tab. Select the
surrogate key source file (proddim). Specify the surrogate key initial value, 1. Also
specify how many surrogate key values to retrieve from the state file in a single block
read. Specifying a block size of 1 ensures that there will be no gaps in the key
usage.
11. Move to the next Fast Path page, that is, the Output>Dim Update tab. Here specify
how to create a new dimension record and how to expire a dimension record that
has Type 2 columns in it. Be sure Output name is PRODDIM_Upd, that is, the name
of the dimension table update link. Use the Expression Editor to specify values and
functions.

12. Move to the next Fast Path page, namely Output>Output Map tab. Here the
PRODDIM surrogate key field (PRODSK) replaces the business key field in the
source file.

13. Click OK to close the SCD stage.
14. Open up the PRODDIM_Upd stage. Use Update then Insert to write to the target
SUPER.PRODDIM table. Let the stage generate the SQL.

15. In the columns tab, make sure the PRODSK is the only column set as the key.
16. Edit the target DataSet stage.

17. Compile. Before you run the job, view the data from the SaleDetail.txt file and the
dimension table. This way you can see clearly what happens when you execute the
job.
-----------------
18. Run the job. Check the job log for errors. View the data in PRODDIM to see if the
table was updated properly. SKU 3 doesn’t change. SKU 1 and 2 are new inserts.
SKU 4 and 5 are new Type 2 updates. The original records are preserved as
historical records (CURR=N) PRODSK=2 and 10 are kept as historical records.
19. View the data in the target dataset. A1111 and A1112 are assigned new surrogate
key values since they are inserts. A1113 was not changed, so it has the same
surrogate key value. A1114 and A1115 are new Type 2 updates. They received
new surrogate key values and are inserted into the target.

20. If you want to rerun your job. Drop the three star schema tables and then re-run the
SQL file that creates the tables. Delete the surrogate key source files and then re-
run the job that creates and updates them.
Build an SCD job to process the second dimension

1. Create a new parallel job named scdLoadFactTable_2. Add the stages and links as
shown. Turn off RCP in the Job Properties window.

2. Edit the SaleDetailOut DataSet stage. Extract data from the SaleDetailOut.ds file
that you created in the previous job. To get the Table Definition go to the Columns
tab of the target DataSet stage in your previous job. Click the Save button to save
the columns as a new Table Definition.
3. After you finish editing the stage, verify that you can view the data.

4. Edit the STOREDIM stage. Load column definitions. Select the ID column as the
lookup key. Verify that you can view the data.
5. Open the STORE_SCD stage.
6. Specify the output link, SaleDetailOut2, on the first Fast Path page.

7. Move to the next Fast path page, that is, the Input > Lookup tab. Specify the lookup
condition and purposes.
8. Move to the next Fast Path page, that is the Input > Surrogate Key tab. Select
storedim as the source key file to be used. Specify the other information as shown.

9. Move to the next Fast Path page, that is the Output > Dim Update tab. Specify the
mappings and derivations as shown.
10. Move to the next Fast Path page, that is the Output > Output Map tab. Here the
STORE surrogate key replaces the Store business key from the source file.

11. Edit the STOREDIM_upd stage. Be sure to qualify the table name by the schema
name, as shown.
12. Make sure STORESK is the only column set as the key.
13. Edit the FACTTBL stage. Be sure to qualify the table name by the schema name.

14. Compile. Before you run the job, view the data from the SaleDetailOut.ds file and
the STOREDIM dimension table. This way you can see clearly what happens when
you execute the job.
-------------------

15. Run the job. Check the job log for errors. View the data in the updated STOREDIM
table and in the FACTTBL.
-------------------------- STOREDIM
-------------------------- FACTBL

Lab 11: Reject Links – DB2 Connector
DB2 Connector stage with a Reject Link

1. Create a job as shown here and save it as “RejectLinkDB2Connector” in the Jobs
folder of the project.
2. Set up the Sequential File stage to read the file Employees.txt. Load the Columns
from the table definition of DB2 table EMPLOYEE in the repository.

3. Set up the DB2 Connector stage to write (insert) to database SAMPLE and table
DB2INST1.EMPLOYEE.
4. In the DB2 Connector properties, click on the reject link on the graph and edit the
Reject tab properties.
5. On the Columns tab, set the Enable Runtime Columns Propagation.
6. Set up the Reject Sequential File stage to write to file SQL_Error.txt.

8. Your job execution should be aborted since the Employees.txt contains duplicate
rows and the DB2 Connector options do not tell the job to reject these rows.
9. Go to the DB2 Connector properties > Reject tab and select the SQL Error check-
box.
11. You should see the job finish successfully. This means your records should be
passed to the output and the rows that generate SQL error will be in the reject file.

12. Open the SQL_Error.txt and verify that it contains the rows that already existed in the
Employees table.
13. Open the DB2 Connector properties again, click on the reject link and edit the Reject
tab options as below (Abort after property = 3):
15. You should see your job aborted since Employees.txt contains more than 3 duplicate
rows.

Lab 12: Dual Inputs to a Connector Stage

In this simple lab, we will be processing an input file that contains two kinds of records,
one is the PROJECT (project) record and the other one is the PROJACT (project
activity) records. The project activity records have a foreign key relating back to the
project record. With the referential integrity set in the database tables, we must insert all
the project records before the project activity records can be inserted. Here are the
CREATE statements for both tables and note the column PROJNO is the relationship:
CREATE TABLE "DB2INST1"."PROJECT" (

"PROJNO" CHAR(6) NOT NULL ,
"PROJNAME" VARCHAR(24) NOT NULL WITH DEFAULT '' ,
"DEPTNO" CHAR(3) NOT NULL ,
"RESPEMP" CHAR(6) NOT NULL ,
"PRSTAFF" DECIMAL(5,2) ,
"PRSTDATE" DATE ,
"PRENDATE" DATE ,
"MAJPROJ" CHAR(6) )
IN "USERSPACE1" ;
CREATE TABLE "DB2INST1"."PROJACT" (

"PROJNO" CHAR(6) NOT NULL ,
"ACTNO" SMALLINT NOT NULL ,
"ACSTAFF" DECIMAL(5,2) ,
"ACSTDATE" DATE NOT NULL ,
"ACENDATE" DATE )
IN "USERSPACE1" ;
Insert both parents and children records with a single Connector

1. Create a job as shown here and save it as “ParentChildRecords” in the Jobs folder of
the project.
2. Set up the Sequential File stage to read the file Parent_Child_Records.txt and read
each record in as a single column.
3. Use the Transformer stage to split the records. Hint: use constrains to examine the
record type indicator and the output record is parsed accordingly. As you have done
in an earlier exercise, use the Field function to parse the record. Also, load the table
definitions for both the Child and Parent output links from the Table Definition folder
in the repository.

4. For setting up the DB2 Connector properties, open it and then click on the connector
icon. Set up the credentials as shown. And also select “All records” for recording
ordering.

5. Click on the Parent link and set up the stage to insert records into the “project” table.
Let the stage generate the SQL. Be sure to set the Table action to Append.

6. Repeat for the Child link to insert into table “db2inst1.projact”.
7. On the Link Ordering tab, make sure the parent is the first link as the records from
the first link will be processed first.
8. One other thing: since the job is running in partition mode, it is important to set the
partitioning of each input link to hash on PROJNO. This is to ensure all records with
the same project number in the same partition thereby the parent records and child
records will.

10. Your job execution should contain no error. And you should see a total of 4 records
inserted into table PROJECT and 4 records inserted into table PROJACT. The log
messages are: “[Input link n] Number of rows inserted: 2”.

Lab13: Metadata in the Parallel Framework
Create a parameter set

1. Click the New button on the Designer toolbar and then open the “Other” folder.
2. Double-click on the Parameter Set icon.
3. On the General tab, name your parameter set SourceTargetData.

4. On the Parameters tab, define the parameters as shown.
5. On the Values tab, specify a name for the Value File that holds all the job parameters
within this Parameter Set.
6. Save your new parameter set.
Create a job with a Transformer stage

1. Create a parallel job TransSellingGroup as shown then save the job.
2. Open up your Job Properties and select the Parameters tab. Click Add Parameter
Set. Select your SourceTargetData parameter set and click OK.

3. Import the Selling_Group_Mapping.txt Table Definition.
4. Configure the source Sequential File stage properties using the parameters included
in the SourceTargetData parameter set. Also, set the option “First Line is Column
Names” as True.
5. Click Format tab, set Quote to none under Field defaults.

6. Load the Table Definition previously imported in the Columns tab.
7. Open the transformer stage. Go to edit constraints by clicking on the chain icon and
create a constraint that selects only records with a Special_Handling_Code = 1.
Close the stage editor.

8. In the Transformer stage, map all the columns from the source link to the target link
selecting all the source columns and drag-dropping them to the output link. The
transformer editor should appear as shown below:
9. Configure the properties for the target Sequential File stage. Use the TargetFile
parameter included in the SourceTargetData parameter set to define the File
property as shown. Also, set the option First Line is Column Names as True.
11. View the data in the target and verify that there are only records having
Special_Handling_Code = 1.

Use a schema file in a Sequential File stage

1. Log on to Administrator. On the Projects tab, select your project and then click
Properties. Enable RCP (Runtime Column Propogation) for your project or verify
that it is enabled. If you have to enable it, then you need to restart the Designer in
order to pick up the change.
2. Open your TransSellingGroup job and save it as Metadata_job.
3. Open up the Job Properties window and enable RCP for all links of your job. When
closing the Job Properties, answer YES to let Designer to turn on RCP for all the
links already in the job.
4. In the Repository window, locate the Selling_Group_Mapping.txt Table Definition that

was loaded into the source. Double-Click to open the Table Definition.

5. On the Layout tab, select the Parallel button to display the OSH schema. Click the
right mouse button to save this as a file called Selling_Group_Mapping.osh. Note
that this file is saved on the client machine. Normally, you would have to transfer this
file to the DataStage server. We have already done this step for you.
6. Open up the schema file to view its contents. The “{prefix=2}” must be removed.
The version on the server does not contain these.
7. Open up your Source Sequential stage to the Properties tab. Add the Schema file
option. Then select the Selling_Group_Mapping.osh schema file.
8. On the Columns tab, remove all the columns.

9. In the Transformer, clear all column derivations (don’t delete the output columns!)
going into the target columns. Also remove any constraints, if any are defined. If
you don’t remove the constraints, the job won’t compile, because the constraint
references an unknown input column.
10. Compile and run your job. Verify that the data is written correctly. That is now all
records are written since we don’t have a constraint any more.
11. If you need the constraint, then try defining the constraint in the Transformer stage
again. In addition, go to the Columns tab of the source Sequential File stage and
import just the Special_Handling_Code column from the Table Definition. Compile
and run your job. This time you should only have records that meet the constraint.
Define a derivation in the Transformer

1. Save your job as Metadata_job_02.
2. Open the target Sequential File stage. Remove all the columns. Add the
optional Schema File property and select the same schema file for it since the
metadata will be the same.

3. Add a Copy stage just before the Transformer.
4. If you have loaded the Special_Handling_Code column in the source Sequential

File stage from the last exercise, remove it.
5. On the Columns tab of the Copy stage, load just the
Distribution_Channel_Description field from the Selling_Group_Mapping.txt
Table Definition. Verify that RCP is enabled.
6. Open the Transformer. If you have a constraint left from the last exercise,
remove it. Map the Distribution_Channel_Description column across the
Transformer. Define a derivation for the output column that turns the data to
uppercase.

8. View the data in the file (not using DataStage View Data). Notice that the
Distribution_Channel_Description column data has been turned to uppercase.
All other columns were just passed through untouched.
Create a Shared Container

1. Highlight the Copy and Transformer stages of your job. Click Edit>Construct
Container>Shared. Save your container named UpcaseField.
2. Close your job without saving it. ***NOTE: Don’t save your job! It was just used
to create the container. ***
3. Create a new parallel job named Metadata_Shared_Container. Check the Job
Properties and make sure that RCP is turned on for this job.
4. Drag your shared container to the canvas. This creates a reference to the
shared container, meaning that changes to the shared container will
automatically apply to any job that uses it.

5. Click the right mouse button over the container and click Open.
6. Open up the Transformer and note that it applies the Upcase function to a
column named Distribution_Channel_Description. Close the Transformer and
the Container without saving it.
7. Add a source Sequential File stage, Copy stage, and a target Peek stage as
shown. Name the stages and links as shown.

8. Edit the Items Sequential stage to read from the Items.txt sequential file. You
should already have a Table Definition, but if you don’t you can always import it.
9. Verify that you can view the data.
10. In the Copy stage, move all columns through. On the Columns tab, change the
name of the second column to Distribution_Channel_Description so that it
matches the column in the Shared Container Transformer that the Upcase
function is applied to.
11. Double-click on the Shared Container. On the Inputs tab, map the input link to
the Selling_Group_Mapping container link.

12. On the Outputs tab, map the output link to the Selling_Group_Mapping_Copy
container link.

14. Open up the Director log and find the Peek messages. Verify that the second
column of data has been changed to uppercase.

Lab 14: Create an External Function Routine
In this task, you will create a function that checks for key words in a string that is passed
to it. It returns “Y” if it finds a key word, else it returns “N”.
1. In gedit or vi open the file named keyWords.cpp in the directory. This function
returns “Y” if it finds any of a list of words.
2. Compile your keyWords.cpp file into an object file by log in to the Information Server
system as “dsadm” and change to the /DS_Advanced directory:
g++ -c keyWords.cpp

3. Verify that your directory contains the object file.
4. In DataStage, click your right mouse button over the Jobs folder and then click
New>Parallel Routine, then create a new External Function routine named
keyWords. Save it in the Jobs folder.
Create an Object type External function. Specify the return type (char *). Specify the
path to the object file.
5. On the Arguments tab, specify the input argument to your function. It should match
the type expected by the function you defined.
6. Save and close your External Function Routine.

Use an External Function Routine in a Transformer stage

In this task, you create a simple job to test the use of your new function.
1. Create a new job named buildop_KeyWords.
2. Create a job parameter named inField that can be used to pass in a string value that
you can apply your routine to.
3. In the Row Generator, define a single column. (It can be anything you want. It won’t
be used.) On the Properties tab, specify that you want to generate a single row.
4. Define a VarChar output field named Result in your Transformer stage. Define a
derivation that returns “Key word found” or “Key word not found” in the Result field
depending on whether the key word was found in the input string. Also define a field
to store the input string from the job parameter.

5. Run and test your job.

Lab 15: Create a Wrapped Stage

In this exercise you create a simple wrapped stage that wraps the “ls /DS_Advanced”
command and then use it in a job.
Create a simple Wrapped stage

1. Manually create a Table Definition. Define one VarChar(5000) column. This
definition will be used to define the output interface from the Wrapped stage.
2. Click the right mouse button over a folder in the Repository and click
New>Other>Parallel Stage Type (Wrapped). On the General tab, enter the name
and command (the UNIX list files command): ls

3. On the Wrapped>Interfaces>Output tab, select the Table Definition you created in an

earlier step. And select Stream=Standard Output.
4. On the Properties tab, define an optional property named “Dir” that is to be passed
the path to the directory to be listed. The Conversion type must be set to “Value
Only”, because we only want the value to be passed to the wrapper, not the property
name followed by the value.
5. Click Generate and then OK.
6. Create a new job named wrapperGenFileList. Add your new Wrapped stage with an
output link to a Sequential File stage.

7. Open the Wrapped Stage. On the Output>Columns tab, load your Table Definition
that defines the output if it is not already there.
8. On the Stage>Properties tab, add the Dir property and then specify the directory
/DS_Advanced to be listed.
9. Edit the target Sequential File stage.
10. Compile and run. Examine the job log.
11. View the data in the output.

Lab 16: Working with a Build Stage
Create a simple Build stage

Create a new Build stage named Total_Items_Amount that takes three input values (Qty,
Price, Tax_Rate) and calculates the total amount (Amount). This stage should satisfy
the following requirements:
• One input; one output. Create Table Definitions to define the input and output
columns.
• One property named Exchange that is used to multiply the Amount before it is written
out. Its default is 1. The Exchange rate can be used to calculate the results for
different currencies.
• All reads, writes, transfers are done automatically.
• If the input dataset contains additional column values (beyond Qty, Price, Tax_Rate),
these should be passed through unchanged.
1. Create and save a Table Definition named InRec_TIA defining the input.
2. Create and save a Table Definitions named OutRec_TIA defining the output.
3. Create a new Build stage named Total_Items_Amount.
4. On the Properties page, define a required property named Exchange. Its default is 1
and its Conversion type is the –Name Value type.

5. On the Build>Interfaces Input tab, define the input. Call the port InRec. Specify Auto
Read. Select the input interface Table Definition you defined earlier.
6. On the Build>Interfaces Output tab, define the output. Call the port OutRec. Specify
Auto Write. Select the output interface Table Definition you defined earlier.
7. On the Transfer tab, define an auto transfer with no separate transfer (false).
8. On the Logic Definitions tab, define a variable named beforeTaxAmount. You will
use this to define the base amount before tax is added. Also define a variable
named tax to store the calculated tax.

9. On the Per-Record tab, define the code that calculates the Amount. Be sure to
multiply the final result by the Exchange Property.
10. Click Generate. If the generation fails, fix any errors and then regenerate.

Create a job that uses your Build stage

Create a new job named buildop_Total_Items_Amount that reads rows from
order_items.txt file and totals the amount for each row. The amount should be put into a
column named Amount which is an additional column on each row.
1. Import a Table Definition for the source file order_items.txt. The column names
should be as follows: OrderID, ItemNumber, Quantity, Price, TaxRate. Use float
type for Price and TaxRate.
2. Create a new job that reads the source file, passes the rows to your new Build stage,
and then write the rows to a Sequential File stage.

3. Use the Copy stage to modify the input column names and types to match the input
columns expected by the Build stage.
4. In the Build stage, the output link should include all columns that are in the source
stage plus the Amount column.
5. Edit your target Sequential File stage.

6. Run and test your job. Be sure to test your Exchange Property by trying out different
exchange rates.

Lab 17: Performance Tuning
Use Job Monitor

1. Open up the sortForkJoin job. Save it as perfForkJoin.
2. Compile and run it.
3. In Director, click Tools > New Monitor open up a Monitor on the job.
4. Click the right mouse button over the window. Set or verify that the Monitor is
showing instances and percentage of CPU.

5. Note these are the results when this job was run on a particular virtual machine.
Your results may differ significantly.
6. Expand all the folders. Notice the following:
• Correlate each stage in the job with the stages listed in the first column.
• Identify the different instances of each stage.
• Correlate the links listed with the stage links.
• Identify where the slowest processing (rows/sec) occurs.
7. Save the job sortForkJoin3 as perfForkJoin3. Compile and run it.

8. Open a Job monitor and compare the performance results. Clearly, in this example
the performance has improved.
Use Performance Analysis tool

1. In Designer, open the sortForkJoin job and save it as perfForkJoin2.

2. Open up the Job Properties window. Click the Execution tab. Select “Record job
performance data.”
3. Recompile and run your job.
4. Click the Performance Analysis icon in the toolbar.
5. Click Charts and then select Record Throughput.

6. Click Stages and then de-select everything. One-by-one, select a stage and
examine its throughput. Shown here is the chart for the Aggregator Sort.
7. In a similar manner, select and examine other charts.
8. Now set up the job property of perForkJoin3 the same way and recompile and run.
9. Open the Performance Analysis tool and view the results. Compare the results with
the un-optimized version.

Analyze the Performance of another Job

1. Open the runPerf job in Designer.
2. Open up Job Properties and click on the Execution tab. Select Record job
performance data (if it hasn’t been selected already).
3. Change the two Data Set target stages’ file property to write to the correct directory.
4. Compile and run your job. Verify in Director that it runs to successful completion.
5. Click on the Performance Analyzer icon in the toolbar.

6. Open the Charts folder and select Job Timeline (the default chart).
7. Open the Partitions folder. Deselect one of the Partitions. Notice that the
corresponding tab disappears on the chart. Reselect the partition.
8. Open the Stages folder. Select just the first Generator, the Sort, and the RemDup
stages.

9. Click on the black bars to the right of the stages to display the phases of each
process.
10. Open the Phases folder. Select just the runLocally phase.

11. Open the Filters tab. Deselect each box one at a time and examine the effect on the
chart. Shown below is the effect of deselecting the Hide Startup Phases box.
12. Open up the Charts folder. Examine each chart in the Job Timing, Record
Throughput, CPU Utilization, Memory Utilization, and Machine Utilization folders.

Lab 18: Process Header / Detail records in a Transformer
Build a job that processes the Header Detail file

In this task, you redesign your partCombineHeaderDetail2 job to add the header
information to the detail records in a Transformer stage using stage variables. This will
avoid the Join buffering of the records in each group.
1. Open your partCombineHeaderDetail2 job and save it as buffCombineHeaderDetail.
2. Modify the job as shown below. Move the output of the end of the Orders link to the
added Column Import stage. Remove the Join stage and its two input links and drag
the input side of the OrdersCombined DataSet Stage link to the Transformer stage.
Draw a link from the Column Importer stage to the transformer.
3. Edit the Column Import stage. On the Stage Advanced tab, set the stage to run
sequentially. This is necessary to preserve the ordering and groups of records going
into the Transformer stage.
4. On the Stage Properties tab, import the OrderNum and RecType columns. Set the
“Keep Import Column” property to True, so that the total record is also passed
through.

5. On the Output Columns tab, specify the metadata for the imported columns. Make
sure the RecType field is VarChar(1) rather than Char(1); otherwise, it won’t import
correctly.
6. Edit the Transformer. On the Partitioning tab, specify Hash by OrderNum.
7. In the main window of the Transformer, define two stage variables to store the Name
and OrderDate from the Header records. To simplify the derivation of the OrderDate
field, define it as a VarChar(10) instead of a Date type. Define the derivations for
these Stage variables. Use the Field function to parse the columns from the Header
record. For the output record, only when it is a detail record. Also drag over the
RecIn column to help verify the results when you run the job.
8. Make sure the Transformer stage is set to Hash partitioning on OrderNum.

9. Compile and run your job and verify the results.

Lab 19: Exploring the Optimization Capabilities

In this lab you will learn to optimize a job by InfoSphere DataStage Balanced
Optimization. You will also understand how the optimizer operates on your root job
during the optimization process, how to analyze and compare the performances of root
and optimized jobs, and what the relationship between them is.
Creating an optimized version of a parallel job

In this task you will acquire familiarity with the Optimizer interface to optimize a job which
performs a conditional join between two source tables.
1. Browse the folder Jobs -> DataStage Advanced which contains the jobs you will use
for this lab.
2. Open the job Populate_Orders and edit the Row Generator stage “Orders_gen”, set
the Number of Records = 2,000,000 to generate into the table db2inst1.orders in the
SAMPLE database. The orders table will be used as a source table in the following
exercises.
3. Compile and run the job. Verify that the execution has completed successfully. You
should have now populated the ORDERS table.

4. Open and explore the job JoinOrdEmp. This job performs a join between the orders
with AMOUNT>100 (filtered by a Transformer stage) and the employee who
managed each order. The result is then stored in a Data Set.
Note: both source tables belong to the same database (SAMPLE).
5. Compile and run the job. In the Director client verify that the execution has
completed successfully.
6. Select the Optimize button in the bar as shown to open the Optimizer interface.

7. Select the option Push processing to database sources and press the Optimize
button. In this way the optimizer will attempt to push the processing of the
Transformer and Join stages into the source DB2 Connector, if possible.
8. Explore the Compare tab to see a comparison between the root job and the
optimized job. Notice that the two source DB2 Connector stages, the Join stage, and
the Transformer stage in the root job have been replaced by a single DB2 Connector
stage in the optimized job.

9. Explore the Logs tab which contains the details about the changes made by the
optimizer in defining the optimized job. Looking at the messages you can
understand what exactly the optimizer has accomplished: the identification of the
patterns of stages suitable for optimization, the impact on partitioning and the query
definitions which allow pushing the processing, in this case, to the source database.
10. Save the optimized job by accepting the default proposed job name
Optimized1OfJoinOrdEmp. Close the Optimizer.

11. Open the Optimized1OfJoinOrdEmp (if it is not already opened) and expand the DB2
Connector stage properties. Notice the select statement that optimizer has built to
define the same logic previously implemented by the two DB2 Connectors,
Transformer, and Join stages. For those of you who are SQL curious, simply copy
and paste the SQL statement to Notepad for more detail examination.
12. Compile and run the optimized job.
Comparing the performances between root and optimized job

In this task you will explore an approach to compare the optimized and root versions of a
job, from the performance and resource usage perspectives.
Note: the data appearing in the following analysis about such as figures for timing
measures, throughput, etc, might be different from the ones you will get during the
exercise. Follow in any case the procedure and adapt the results comparison to your
case.
1. Use the Director to compare the execution time of root versus optimized job. Notice
that pushing the operations implemented by the Join and Transformer in the root job
to the source database has achieved an improvement of the performance.

2. Looking at job monitors for the two jobs (you can open both of them) you can
understand that the optimized job has processed fewer records than the root job.
This is because in the root job the ORDERS DB2 Connector retrieved 2 millions rows
from the database and filtered afterwards by the Transformer stage’s constraint. In
the optimized job the source DB2 Connector has retrieved directly only the records
respecting the SQL query which actually implements the root job’s transformer
constraint for ‘AMOUNT’.

3. Refer also to the job logs in the Director to understand the different execution steps
performed by the two jobs. Compare the startup and production run times, which
help you in roughly understanding the elapsed time composition and the benefit you
can reach.
JoinOrdEmp
Optimization1OfJoinOrdEmp
4. To understand in more detail the behaviors of the root and optimized jobs, open the
Performance Analysis tool to compare their resources usage and the record
throughput.

5. For the JoinOrdEmp job you can filter the stages to consider during the analysis
selecting only the ORDERS and EMPLOYEES DB2 Connector stages and the
output Data Set stage.
6. Now examine the Record Throughput Outputs for all the partitions. Crossing this
chart with the Director’s logs, notice that the output stage begins to have records
some seconds after the job starts. However your mouse over these lines to see
exactly when records start arriving.

7. Repeating the same analysis on the Optimized1OfJoinOrdEmp job you can see that
the output stage has records from around the same time after the jobs starts, which
is comparable with the root job.
8. The significant difference between them is not how fast the jobs have records
available for the target loading, but their record throughputs. You can evaluate the
approximated slopes of the output stages’ throughput curves for both the charts,
considering all the partitions, to get comparable figures. Try then to justify the
comparison results.
- For JoinOrdEmp:
( 20,000[rows/sec in Part1] + 30,000[rows/sec in Part2] ) / 26 [sec] = 1920 [row/s2]
- For Optimized1OfJoinOrdEmp:
( 53,000[rows/sec in Part1] + 53,000[rows/sec in Part2] ) / 12 [sec] = 8800 [row/s2]
In the Optimized1OfJoinOrdEmp the data coming from the source DB2 Connector
stage have been already processed by the source database engine, while in the
JoinOrdEmp they must go through the Transformer and Join stage, hence the
resulting record throughputs could not be similar.

9. Open the Memory Usage Density Page Ins for the two jobs and notice that the root
job is more memory intensive than its optimized version. Notice, in job JoinOrdEmp,
the maximum usage of memory for JoinOrdEmp (9000 pages) is mainly due to the
orders records processing.
JoinOrdEmp

Optimized1OfJoinOrdEmp
10. Optional: perform a similar analysis considering the CPU and Disk utilizations.
Note: To perform a more detailed comparison between the root and optimized jobs,
or even to decide the best optimized version for a job, there are also other
parameters to consider: the degree of source/target database concurrency, the
amount of system resources available for DataStage and the source/target
databases, the number of records to process, the database tuning level, etc.

Managing the root versus optimized jobs

In this lab, your goal is now to explore which are the optimized versions of a root job and
also the reverse operation, retrieving the root job for a certain optimized job. You can
find this information by leveraging the automatically maintained relationship between the
optimized versus root versions.
1. Locate the job JoinOrdEmp in the Repository Window, right-click and select Find
where used -> Jobs.
2. The Repository Advanced Find window appears and shows the

Optimized1OfJoinOrdEmp as a dependent job.

3. To perform the reverse operation, exploring what is the root job for the
Optimized1OfJoinOrdEmp job, locate it in the Repository window and select Find
dependencies -> Jobs.
4. The Repository Advanced Find window appears and displays the jobs dependent on
Optimized1OfJoinOrdEmp, in this case the root job JoinOrdEmp.

5. To remove the dependency between the optimized and root job, open the
Optimized1OfJoinOrdEmp in the Designer and select Edit -> Job Properties.
6. In the Dependencies tab, right click on the JoinOrdEmp entry and select Delete row.
The Optimized1OfOrdersReport in this way will loose its relationship with the root job
JoinOrdEmp.
Pushing the processing to the source and target databases

In this lab you will optimize a job by pushing the processing to the source and target
databases. This is a very common situation you may face when a parallel job reads and
loads one or more tables. You will need to consider the scenarios in which Source and
Target tables are in the same database or in different databases.
1. Create a copy of the JoinOrdEmp and save it as JoinOrdEmpTrg.

2. Replace the target Data Set stage with a DB2 Connector stage.
3. Configure the ORDERSTRG stage properties as shown below.
4. Save and compile the job, you will execute it later.

5. You can now optimize the job by distributing the processing to source and target
DB2 connector stages; then verify if that could be a convenient choice. Open the
Optimizer for the JoinOrdEmpTrg job, select the Push processing to database
sources and Push processing to database target options and press the Optimize
button.
Note: the source and target tables are all in the same SAMPLE database.
6. Save the job as Optimized1OfJoinOrdEmpTrg.
7. Another way in which this job can be optimized is pushing the entire processing to
the target database. This is possible because all the tables you are using in the root
job are in the same database. You want now to understand if this optimization
version performs better than the Optimized1OfJoinOrdEmpTrg.
8. Select the optimization options as follows and save the job as

Optimized2OfJoinOrdEmpTrg.

9. Open and compare the two optimized jobs: Optimized1OfJoinOrdEmpTrg and

Optimized2OfJoinOrdEmpTrg.
10. Notice as in the job Optimized1OfJoinOrdEmpTrg, a part of the root’s job logic (the
Transformer constraint and the ORDERS DB2 Connector) has been implemented
within the source DB2 Connector.

11. In Optimized1OfJoinOrdEmpTrg, the optimizer has implemented the Join stage’s

logic inside the target DB2 Connector as you can see from the produced query.
12. The Optimized2OfJoinOrdEmpTrg is based on a single DB2 Connector stage fed by

a Row Generator stage. The latter is not a real source of data, but a dummy stage
inserted by the optimizer not to have a single Connector stage parallel job, which is
not possible. Explore the target DB2 Connector stage and notice that it implements
the entire root’s job logic by a single query.
13. Execute the root job JoinOrdEmpTrg, Optimized1OfJoinOrdEmpTrg and

Optimized2OfJoinOrdEmpTrg, one at a time.

14. Use the Director to compare their Elapsed Times and notice that the job with the
shortest execution time is Optimized2OfJoinOrdEmpTrg.
15. Following the same approach as seen for Lab1, you can use the Performance
Analysis tool to explain the differences between the performances of these three
jobs.
16. Notice that for the job Optimized2OfJoinOrdEmpTrg, actually no record was
processed by DataStage: all the operations have been performed by the target
database in response to the SQL statement pushed down by the job. The job
Optimized1OfJoinOrdEmpTrg processes all the rows (1327629 rows) selected by the
SQL query in the source DB2 Connector stage, and then passes them to the target
DB2 Connector stage as shown below.

Optimized1OfJoinOrdEmpTrg
Optimized2OfJoinOrdEmpTrg

17. Open the JoinOrdEmpTrg job and modify the target DB2 Connector stage properties,
setting QS as a target database, then save the job as JoinOrdEmpTrg2 and compile
it.
18. Open the Optimizer and notice that the Push all processing into the (target)
database is no longer available. This is because the source and target tables reside
on different databases, so the job cannot be built using a single DB2 Connector
stage as happened for Optimized2OfJoinOrdEmpTrg.
19. Optional: optimize the JoinOrdEmpTrg2 job and analyze the performances, using the
Push processing to database sources and Push processing to database
targets optimization options.

Pushing data reduction to database target

When a parallel job performs data reduction operations, such as aggregations or filtering
that reduce the record volume moved from source to target, another possibility for job
optimization you have beside the ones you used in the previous labs is pushing the data
reduction processing to the target database. This could be particularly convenient when
the reduction is performed on data that is already located in the target database.
1. Open the Populate_Orders job and edit the Row Generator stage, setting the
Number of Records = 100,000 as the number of records to be generated into the
target table “ORDERS”. Then compile and run the job.
2. Open and explore the job SalesReport. This job calculates the total order Amount
for the record in the ORDERS table and loads the result into the TOTORD table.
Note: the source and target tables are in the same database (SAMPLE).

4. Considering that your job performs a data reduction on the input records from the
ORDERS table (100,000 rows) generating a single output row, and also considering
that both the tables are in the same database, you might try to push the data
reduction processing to the target database. To do that select the optimization
options Push processing to database targets and Push data reduction
processing to database targets and click on Optimize.
5. Select the Compare tab and notice that the two Transformer and the Aggregator
stages have been pushed to the target DB2 Connector stage, while the source DB2
connector appears to be the same as before the optimization.
6. Save the optimized job as Optimized1OfSalesReport.

7. Open the target DB2 Connector stage and look at the insert SQL statement
generated by the optimizer.
8. Now you can compile and run the optimized job.
9. Compare the execution times, the performances and the system resources usage of
the root and the optimized jobs by the Director and the Performance Analysis tools
as you did for the previous labs.
Optimizing a complex job

In this lab you will practice the optimization process on a more complex job, built with
multiple stages and performing the parallel jobs’ typical operations: data transformations,
sorting, aggregations, and horizontal combinations. You will also experience the case in
which some of the stages cannot be considered by the optimizer.
You will refer the optimization to two main cases:
• Same database for source and target tables
• Different databases for source and target tables
Although the job design will be the same in both of these scenarios, you will see their
differences in terms of optimization options you can use and performance improvements
you can achieve.
You will also learn a way to explicitly condition the optimization process, excluding one
or more stages from the optimization patterns.

Scenario 1: common database for source and target tables
1. Open the Populate_Orders and verify that the Number of Records = 100,000 to be
generated into the target table “ORDERS”. Then compile and run the job in case
currently you don’t have such a number of records in the ORDERS table.
2. Open the PopulateEmployees job and set Number of Records = 1,000,000 to be

generated into the target table “EMPLOYEES”. Then compile and run the job.
Note: if at any moment you need to reload the original 10 records into the
“EMPLOYEES” table, you can simply compile and run the RestoreEmployees job.

3. Open the job OrdersReport and analyze the logic implemented by each stage. This
job calculates, for each order in the table ORDERS, the total amount of orders
summarized by employee and year. The aggregated values are then inserted into
the target table ORDER_REPORT in which the Employee ID code is replaced by
his/her first name and last name by a lookup operation.
5. Analyzing the job you can notice that the first two stages following the source DB2
Connector stage respect the Balanced Optimization requirements (the Copy stage’s
multiple output on the contrary are not supported), so as a possible attempt of
optimization you can consider pushing the processing towards the source database.
Open the Optimizer and check only the Push processing to database sources
option. Then press the Optimize button.

6. Open the compare tab and notice that only the Transformer and Sort_1 stages have
been pushed to the source database. The processing logic implemented by the fork
join structure (i.e. Copy, Aggregator and Join stages) could not be pushed to the
source and it has not been changed.

7. Explore the Logs tab and notice the WARNING messages. Notice the second and
third messages which explain why the stages composing the fork join structure have
not been optimized.
8. Save the job as Optimized1OfOrdersReportSrc and open the source DB2 Connector
stage to see how the optimizer has converted the logic originally defined by the
Transformer and Sort_1 stages into a single SQL query.

10. As a second attempt of optimization, you may choose to push the processing
towards the target database. Open again the optimizer for the OrdersReport job.
This time select the Push processing to database targets option, then press the
Optimize button.
11. Browse the Compare tab and notice that only the target side stages (the Lookup
stage and the last Transformer stage) have been pushed to the target database.
Save the job as Optimized1OfOrdersReportTrg. Close the optimizer window.

12. Open the target DB2 Connector stage to analyze the SQL query defined by the
optimizer, which implements the Lookup and Transformer2 root stages’ logic.
14. Open again the optimizer and select both the Push processing to database
sources and the Push processing to database targets options.

15. Compare the original and optimized version and notice that the only part not pushed
to the database is the fork join, and this version is a composition of the two previous
optimizations.

16. Save the job as Optimized1OfOrdersReport and analyze the SQL generated in the
source and target DB2 connectors. Notice also that the fork join structure could not
be optimized for the same reason you have faced previously.

18. As you learned during Lab2, if source and target tables are on the same database,
the best optimization could be achieved pushing all the processing to the target
database. You can try to apply the same to the OrdersReport job as shown below.
19. Despite you have tried to push all the processing to the target database, the
optimizer has ignored that option. In fact you don’t see a single DB2 Connector
stage fed by a Row Generator as in Lab2, but the optimized job is exactly the same
as Optimized1OfOrdersReport. This is again due to the fork join structure that
prevents the possibility of full optimization.
20. Now you can compare the execution times, the performances and the system
resources usage of the root and the optimized jobs by the Director and the
Performance Analysis tools as you did for Lab1.

Scenario 2: different databases for source and target tables
1. Open and explore the OrdersReportTargetDB job. This job is similar to the
OrdersReport job, but the source and target tables are in two different databases as
you can see exploring the source and target DB2 Connector stages.
3. You can now optimize the job using the same approach you followed for the
OrdersReport job: generating different versions of the root job based on different
optimization options. Comparing their performances and resources usage to
determine which optimized option is more appropriate to match your requirements.
Open the optimizer and select the Push processing to database sources, then
save it as Optimized1OrdersReportTargetDB.
5. Open again the optimizer for the OrdersReportTargetDB and select Push
processing to database targets, then save the optimized job as
Optimized2OrdersReportTargetDB.
6. In the Logs tabs notice that the tables EMPLOYEES and ORDER_REPORT cannot
be part of the same optimization pattern as happened for the OrderReport job
because now they reside in different databases.

7. In job Optimized2OrdersReportTargetDB, for the reason just explained, the lookup

operation cannot be pushed to the target database as happened for the
Optimized1OfOrdersReportTrg job.
8. Open the optimizer, select the both the Push processing to database sources and
Push processing to database targets options.
9. Save the optimized job as Optimized3OfOrdersReportTargetDB and analyze it.
11. When the source and target tables are on different databases, another possibility you
way want to consider is the Bulk Loading optimization option. In this way the target
DB2 Connector will first bulk load a temporary staging table created during the job
execution in the target database. Then SQL statements will load the actual target
table reading from the temporary staging table so any transformation will occur
directly in the target database after the high-performance bulk loading process.

12. Open the optimizer and select the Push processing to database sources, Push
processing to database targets, and Use bulk loading of target tables options.
13. Save the optimized job as Optimized4OfOrdersReportTargetDB and analyze it.

Notice in the target DB2 Connector stage, the Bulk load write mode and the staging
table defined by the optimizer.

14. Notice also the Before/After SQL statement that will be used to load the actual target
table by using the bulk loaded staging table as a source.
15. Enable the Auto commit mode option for the target DB2 Connector stage to allow the
database to commit the transactions automatically.
17. Now you can compare the execution times, the performances and the system
resources usage of the root and the optimized jobs by the Director and the
Performance Analysis tools as you did for Lab1.
18. Notice that in this scenario the Optimized4OfOrdersReportTargetDB, which uses the
bulk load option for the target database, does not perform better than the other
optimized versions. In fact the Optimized3OfOrdersReportTargetDB is the fastest
optimization.

19. Using the Performance Analysis tool, compare the performances of the
Optimized3OfOrdersReportTargetDB job versus the Optimized1OfOrdersReport job
which have been generated using the same optimization option. Try to understand
what the reasons are of their elapsed time differences.
Tip. Look at the Record Throughput and compare the lookup stage elapsed time for
OrdersReportTargetDB and the target DB2 Connector stage for OrdersReport.
Optimized3OfOrdersReportTargetDB
Optimized1OfOrdersReport

Deciding where to stop the optimization process

1. Open the job OrdersReport and open the Optimizer.
2. Select the both the Push processing to database sources and Push processing
to database targets options.
3. You may now optimize the job, forcing the sort operation to be executed by
DataStage, instead of pushing it into the database. To explicitly exclude the sort
stage from the optimization, select the “Advanced Options” tab and set the value
Sort_1 for the property Name of a stage where optimization should stop and
press the Optimize button.
4. Notice that the optimizer has not considered the Sort_1 stage.

Balancing between Database and DataStage engines

In the exercises you have done so far, pushing the processing logic to the source and/or
target databases achieved performance improvement. However, depending on the type
and amount of processing, optimizing a job often means trade-off between DataStage
processing and database processing in order to leverage the best from both types.
In this lab you will see a job that performs better when the processing is done entirely by
the DataStage engine rather than by the database engine.
1. Open the job Populate_Orders and edit the Row Generator stage to set the Number
of Records = 2,000,000.
2. Compile and run the job to populate the table ORDERS in the SAMPLE database.

3. Open the LoadProcessing job and analyze it. Notice that the Transformer stage
implements conversion functions and decision logic for some of the output
derivations.
5. Open the Optimizer and check the Push processing to database sources option.
Then press the Optimize button and save the optimized job as
Optimized1OfLoadProcessing.

6. Open the Optimized1OfLoadProcessing job and notice how the logic originally
implemented by the Transformer stage has been converted into a single SQL
statement in the source DB2 Connector stage.
7. Compile and run the optimized job.
8. Compare the execution times, the performances and the system resources usage of
the root and the optimized jobs by the Director and the Performance Analysis tools
as you did for the previous labs. Notice that the optimized job is slower than the root
job.
9. Notice the Percent CPU Utilization charts. The LoadProcessing requires significant
CPU activity when the Transformer stage starts processing the records after they are
made available by the source DB2 Connector stage (refer to the Percent of time In
CPU chart), while the Optimized1OfLoadProcessing starts processing the records
when the source DB2 Connector connects to the database. The top levels of CPU
usage by the two jobs are comparable; however, looking at the Throughput charts
you can see that the LoadProcessing job performs faster.
Note: in some of the following pictures only the data is about one Partition only.
When you do these analyses you should consider all the partitions.

LoadProcessing

Optimized1OfLoadProcessing

Lab 20: Repository Functions
Execute a Quick Find

1. Open Quick Find by clicking the link at the top of the Repository window.
2. In the Name to find box type sort* and in the Types to find list select Parallel Jobs.
3. Click the Find button.
4. The first found item will be highlighted.
5. Click Next to highlight the next item.
Execute an Advanced Find

1. Click on the link that displays the number of matches. This opens the Advanced Find
window and displays the items found so far in the right pane.
2. Open the Last modification folder. Specify objects modified within the last week.

3. Open up the Where Used folder. Add the SUPER_PRODDIM Table Definition.
Change Name to find to an asterisk (*). Click Find. This reduces the list of found
items to those that use this Table Definition.
4. Close the Advanced Find window.
Generate a report
1. Click the number of matches to get the search result window again. Click File >
Generate Report to open a window from which you can generate a report describing
the results of your Find.

2. Click on the top link to view the report. This report is saved in the Repository where
it can be viewed by logging onto the Reporting Console.
3. After closing this window, click on the Reporting Console link. On the Reporting tab,
expand the Reports folder as shown. Click View Reports.

4. Select your report and then click View Report Result. This displays the report you
viewed earlier from Designer. By default, a Suite user only has permission to view
the report. A Suite administrator can give additional administrative functions to a
Suite user, including the ability to alter report properties, such as format.
5. Close all windows and then close the Quick Find.
Perform an impact analysis

1. In the Repository window, select your SUPER_STOREDIM Table Definition. Click
the right mouse button and then select Find Where Used > All Types.

2. Click the right mouse button over the ForkJoin job listed and then click “Show
dependency path to…”

3. Use the Zoom button to adjust the size of the dependency path so that it fits into the
window.
4. Hold right mouse button over a graphical object and move the path around.
5. Notice the “birds-eye” view box in the lower right hand corner. This shows how the
path is situated on the canvas. You can move the path around by clicking to one side
of the image in the birds-eye view window and by holding the right mouse button
down over the image and moving the image around.
6. Close the window.
Find the differences between two jobs

1. Open your CreateSeqJobPartiton job. Save it as CreateSeqJobPartitonComp.
2. Make the following changes to the CreateSeqJobPartitonComp job.
3. Open up the Selling_Group_Mapping Sequential File stage. On the Columns tab,

change the length of the first column (Selling_Group_Code) to 111. On the
Properties tab, change the First Line is Column Names to False.
4. Change the name of the output link from the Copy stage to TF (from TargetFile).

5. Save the changes to your job.
6. Open up both the CreateSeqJobPartiton and the CreateSeqJobPartitonComp jobs.

Click Tile from the Window menu to display both jobs in a tiled manner.

7. Right-click over your CreateSeqJobPartitonComp job name in the Repository window

and then select Compare against.
8. In the Compare window select your CreateSeqJobPartiton job on the Item Selection
window.

9. Click OK to display the Comparison Results window.
10. Click on firstLineColumnNames in the report. Notice that the stage is opened to the
properties tab when the change was.
Find the differences between two Table Definitions

1. Create a copy of your Warehouse.txt Table Definition.
2. Make the following changes to the copy.
3. On the General tab, change the short description to your name.
4. On the Columns tab change the name of the Item column to ITEM_ZZZ. And
change its type and length to Char(33).
5. Click OK.
6. Right-click over your Table Definition copy and then select Compare Against.
7. In the Comparison window select your Warehouse.txt

8. Click OK to display the Comparison Results window.

DataStage Advanced All Labs1-11

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DataStage Advanced All Labs1-11

Uploaded by

Copyright:

Available Formats

DataStage Advanced Bootcamp -- Labs

© Copyright IBM Corporation 2010. Page 1 of 187

Lab 1: DataStage Parallel Engine Architecture ....................................... 5

© Copyright IBM Corporation 2010. Page 2 of 187

Lab 12: Dual Inputs to a Connector Stage ............................................... 91

© Copyright IBM Corporation 2010. Page 3 of 187

List of userids and passwords used in the labs:

ENVIRONMENT USER PASSWORD

SLES user root inf0sphere

IS admin1 isadmin inf0server

DataStage user dsuser inf0server

WAS admin2 wasdmin inf0server

DB2 admin db2admin inf0server

DataStage admin dsadm inf0server

Note: the passwords contain a zero, not the letter o.

© Copyright IBM Corporation 2010. Page 4 of 187

Lab 1: DataStage Parallel Engine Architecture

The default configuration file

Job using data partitioning and collecting

4. Simply map all the columns across in the Copy stage.

7. Compile and run your job.

© Copyright IBM Corporation 2010. Page 5 of 187

Examine job log and target files

2. Log in to the Information Server system using ID “dsadm”. Change directory to

© Copyright IBM Corporation 2010. Page 6 of 187

© Copyright IBM Corporation 2010. Page 7 of 187

Experiment with a different partitioning

Partitioning Algorithm Records in Records in Comments

Round-Robin (Auto) 23 24 Every other records

Entire 47 47 Each file contains all the

Random 22 25 Random distribution

Hash on column 20 27 File 1 with Handling_code

© Copyright IBM Corporation 2010. Page 8 of 187

Lab 2: Reject Links

Sequential File Stage with Reject Link

2. Set up the Sequential File stage to read the file Selling_Group_Mapping.txt.

3. Compile and run the job to ensure successful execution.

© Copyright IBM Corporation 2010. Page 9 of 187

© Copyright IBM Corporation 2010. Page 10 of 187

© Copyright IBM Corporation 2010. Page 11 of 187

Lookup Stage with Reject Link

3. Column Item is the lookup key.

© Copyright IBM Corporation 2010. Page 12 of 187

7. Compile and run the job.

© Copyright IBM Corporation 2010. Page 13 of 187

10. Compile and run the job.

© Copyright IBM Corporation 2010. Page 14 of 187

© Copyright IBM Corporation 2010. Page 15 of 187

Transformer Stage with Reject Link

© Copyright IBM Corporation 2010. Page 16 of 187

3. Change the Lookup stage:

Nullable attribute of Description column of Items record = Yes

Nullable attribute of Description column of Warehouse_Items output record = Yes

© Copyright IBM Corporation 2010. Page 17 of 187

6. Compile and run your job.

Merge Stage with Reject Link

© Copyright IBM Corporation 2010. Page 18 of 187

© Copyright IBM Corporation 2010. Page 19 of 187

4. Set up the Merge stage with the following properties:

6. Compile and run the job.

7. Find the log message from the Peek_Reject stage.

© Copyright IBM Corporation 2010. Page 20 of 187

© Copyright IBM Corporation 2010. Page 21 of 187

Lab 3: Generate Mock Data