Professional Documents
Culture Documents
IBM® InfoSphere™
DataStage
Advanced
Workshop Lab
Workbook
Table of Contents
All the required data files are located at: /DS_Advanced. You will be using the
DataStage project called “dstage1”. Optionally, you may put all your DataStage
objects (e.g. jobs, parameter sets, etc.) in the project folder /dstage1/Jobs/DS-
Advanced/.
Please start both the DataStage Designer and Director to do the following
exercises.
1
IS admin: InfoSphere Information Server administrator
2
WAS admin: WebSphere Application Server administrator
/opt/IBM/InformationServer/Server/Configurations
This configuration file has been modified to have two nodes defined and will allow us to
exercise the capabilities of a parallel engine.
2. Rename the stages and links as shown. This is a good practice as short
documentation.
3. Set up the source Sequential File stage to read the file Selling_Group_Mapping.txt
and don’t forget to import the table definition first.
5. Set up the target Sequential File stage to write to two different target files,
TargetFile1.txt and TargetFile2.txt
6. Notice that the partitioning icon is ‘Auto’. (Note: If you do not see this, refresh the
canvas by turning “Show link markings” off and on using the toolbar button.)
Source file:
Target file 1:
Target file 2:
3. Notice how the data partitioned. Here, we see that the 1st, 3rd, 5th, etc. go into one
file and the 2nd, 4th, 6th, etc. go in the other file. This is because the default
partitioning algorithm is Round Robin.
2. Compile and run the job again. Open the target files and examine. Notice how the
data gets distributed. Experiment with different partitioning algorithms!
3. The following table shows the results for several partitioning algorithms with one
particular system (yours may not match exactly):
4. Use either gedit or vi to change Selling_Group_Mapping.txt file: put a letter into the
1st column of three records.
5. Run the job again. Check the log messages with the Director and notice the
behavior of the Sequential File stage throwing the record away with a warning
message.
6. Now add a reject link and a Peek stage as shown. Don’t forget to change the “Reject
Mode” property of the Sequential File stage to “Output”.
7. Compile and run the job again. Check the log messages to see the records with
incorrect data were sent down the reject link and captured by the Peek stage.
Note the one from Peek_Reject,0 has (…) at the end. Open it up you will see the
following:
2. Set up the Sequential File stages to read Warehouse.txt as source records and
Items.txt as reference records. Don’t forget to import the table definitions.
4. Map all columns from the source (Warehouse.txt) plus the column Description to
the output.
5. The lookup failure action property can be set to any choice except Reject. Click the
yellow constraint icon and set the lookup failure action.
6. Set up the target Sequential File stage to write the records to Warehouse_Items.txt
file.
8. If you set the lookup failure action to FAIL, then you should see your job aborted.
If you set the lookup failure action to Drop, then you will not see any log message
from the Lookup stage. However, you can see that the number of records read from
Warehouse.txt and the records written to Warehouse_Items.txt are differed by 9
records.
If you set the lookup failure action to Continue, then all records will be passed to the
output.
9. Now, let’s change the lookup failure action to REJECT. Add a reject link with a Peek
stage as shown.
11. This time you should see log messages from the Peek_Reject stage.
And if you open each log message, you should find a total of 9 records log by the
Peek stage.
2. Add a Transformer stage between the Lookup stage and the target Sequential File
stage. Add a reject link from the Transformer stage to a Peek stage. To do so, you
need to select the link then right click your mouse to choose “Convert to Reject”.
And your final job should look similar to the picture shown.
4. In the Transformer stage, map all columns of input record to output. Change the
derivation for Description to “[“:Warehouse_Items.Description:”]”.
5. Don’t forget to handle the NULL in the target Sequential File stage (just in case).
7. With DataStage prior to version 8.5, you should find log messages by the
Peek_X_Reject stage containing those records with a NULL in the Description
column. However, DataStage 8.5 now will handle (allow) NULL in derivation so you
won’t see any rejected records.
2. Add a Remove Duplicate stage. Replace the Lookup stage with a Merge stage. Add
a reject link to a Peek stage from the Merge stage.
3. Set up the Remove Duplicate stage with the following properties: Key = Item, Retain
= Last. Map all columns to output.
5. On the Link Ordering tab, make sure all the internal links and external links are
correctly aligned.
8. You can see the update records that do not have corresponding master records are
rejected. Viewing the detail of the Peek_Reject stage log message will show these
rejected update records.
2. Open up the Row Generator stage. On the Properties pages specify that 1000 rows
are to be generated.
4. Open up the Extended Properties window for the CustID column. (Double-click on
the number to the left of the column.) Specify that the type of algorithm is cycle with
an initial value of 10000.
6. For the Int2 column cycle from 1 to 29. (It’s important that this not start at 0, so that
these cycles won’t repeat.)
9. For the MiddleInit column, use the alphabet algorithm over a string of characters that
might be middle name initials. (That is, remove the numerals from the list.)
11. For the CustDate column, generate random dates with a limit of e.g. 20000, so that
the dates don’t get too large.
12. For the InsertUpdateFlagInt, select random integers with a limit of 2. This will ensure
that values are either 0 (meaning update) or 1 (meaning insert).
13. Close the stage and then open it again and click the View Data button to examine a
sampling of the data that will be generated.
14. Edit the Sequential File stages that are used as lookup tables. The sequential files
are FName.txt, LName.txt, Street1.txt, and Street2.txt. Examine these files to get an
idea of the data they contain. Import the metadata for these files and load it into the
stages.
15. Edit the Lookup stage. Map the Int1 to Int4 columns, respectively, to the Num
columns of each of the lookup files. Define the output columns in the order shown at
the far right.
16. Click the Constraints button. Specify that rows that fail to find matches are to be
rejected.
18. Define the following derivations (in addition to the straight mappings):
• Middle initial should be uppercase.
• Address should consist of the street name followed by the street modifier.
• Rows with customer dates later than the current date should get the current date.
Otherwise, they retain the date in the source row.
• The added column DateEntered should get the date of the job run.
• The InsertUpdateFlag column, which is now Char(1) should replace 0 by “U” and
1 by “I”.
19. Open up the Job Properties window and define a new job parameter named
TargetPath. Provide it with a default that creates a file named CustomersOut.txt in
the /DS_Advanced directory.
20. Edit the target Sequential File stage. Insert your TargetPath job parameter in the
File property to create a new comma-delimited file named CustomersOut.txt. Don’t
surround the column values with quotes.
21. Edit the rejects Sequential File stage. Write rejects to a file named RejectsOut.txt.
22. Compile and run your job. Examine the job log in Director. Fix any errors. Try to
eliminate all warnings.
24. In addition to viewing the data in DataStage, view the data file in your directory.
Verify that quotes don’t surround the values and that the data is delimited by
commas.
2. Locate and examine the message that lists the values of the environment variables
that are in effect at the time the job is run.
3. Locate and examine the message that displays the OSH (Orchestrate Script) that is
generated for the job.
4. Locate and examine the message that displays the configuration file used when the
job was run. How many nodes are defined in the file? (Note: Your job will be using a
different configuration file than the one shown here. This is just an example.)
5. Locate and examine the message that lists the job’s datasets, operators, and
number of processes. This is known as the Score. (Note that you won’t see the
word ‘Score’. The first line is how you can identify it.)
6. Locate the message that says how many rows were successfully written to the target
Sequential File stage and how many were rejected?
2. Enable RCP for the job and all existing links in the job properties.
3. Edit the source Sequential File stage to read data from the Customers.txt file. On
the Columns tab, specify a single column named RecIn, VarChar(1000).
4. On the Formats tab, specify that there is no quote character. Verify that you can
view the data.
5. Edit the Column Import stage. On the Properties tab, specify that the Import Input
Column is RecIn. As the Column Method, specify Schema File and then reference
the Customers.schema schema file.
6. That’s it for the Column Import stage. It uses RCP to send the columns of data
through to the target.
2. Edit the GenerateHeader Row Generator stage. On the Properties tab specify that 9
records are to be generated.
3. Define the columns, as shown. Save the table definition for the next lab to use.
4. For the OrderNum column, generate numbers from 1 to 9. Also, for this and all
columns add the optional Quote property from the Field Level category and set it to
NONE.
6. Choose your own algorithms for the remaining fields. In what follows, I’ve chosen
random for OrdDate with a limit of 20000.
8. Edit the GenerateDetail Row Generator stage. On the Properties tab specify that 81
records are to be generated.
9. Define the columns, as shown. Save the table definition for the next lab to use.
10. For the OrderNum column, generate numbers from 1 to 9. Also, for this and all
columns add the optional Quote property and set it to NONE.
12. Choose Random with a limit of 9999 for the remaining fields.
14. In the Column Export stage for the header, in the Input folder of the Properties tab,
specify the input columns that are to be exported to a single output column
(OrderNum, RecType, etc., in the order shown). In the Output folder specify the
name of the single column (Header) and its type (VarChar) that the input columns
are to be combined into.
15. On the Output Columns, create a column named RecOut and map the input field to it
on the Mappings tab.
16. Define the Column Export stage for the detail records in a similar way. Be sure to
use the same name (RecOut) for your output column name.
17. In the Funnel stage, map the single RecOut column across to the target.
18. In the Sort stage, specify that the records are to be sorted in ascending order. The
key is the single RecOut column.
19. Write to a file named Orders.txt. In the Sequential File stage Format tab, set the
quote property to NONE.
20. Compile and run and view your data. It should look something like this, with the
header records at the front of each group of records, grouped by order number.
2. Your job reads data from the Orders.txt file. It reads this data in as one column of
data.
3. In the Transformer, define constraints to send the Header rows down the Header link
and the Detail rows down the Detail link. Also parse out the fields for each of the
record types using the Field function.
4. To create the output column definitions for the two links, load the table definitions
saved from the previous job. The RecType fields won’t be needed downstream, so
delete them from the output. Add a column to the Detail link named RecordNum.
Define an expression that generates a unique integer for each Detail row regardless
of its partition.
6. On the Input>Partitioning tab of the Join stage specify that the records are to be
hashed and sort partitioned by OrderNum for both the Header and Detail links.
8. In the Data Set target stage, name the file OrdersCombined.ds. Note the partitioning
icons in front of the Join stage.
10. View the data using the Data Set Management tool available in the Designer Tools
menu. You should see 9 records in each group.
11. Next view the data in each partition. Notice that all the records in a group are in a
single partition, which is not spread across multiple partitions.
12. Save your job as partCombineHeaderDetail2. Now change how the partitioning is
done in the Join stage. Choose Entire for the Header link and SAME for the Detail
link. Turn off the sorts.
14. Recompile and run your job. View the data using the Data Set management tool.
Notice that the groups of data are now spread across multiple partitions. This should
yield improved performance.
2. In DataStage Designer, click Import > Table Definitions and then select Cobol File
Definitions.
5. Take the default location in the repository. Select both DETAIL and HEADER
definitions. Click Import.
6. Find the newly imported table definition in the repository. Open one of them. Go to
the layout tab and explore the different settings of Parallel, COBOL, and Standard
views. Here is what the COBOL view looks like.
7. Open up the HEADER table definition. Click on the Columns tab. Open the
extended properties (Edit Column Meta Data) window for the column ORDDATE
(double click on the column number). Set the Date Format field to CCYY-MM-DD.
This is to allow dates to be displayed correctly using this mask.
8. Remove the Level number and then click Apply and close the table definition.
2. Open the Orders CFF stage. On the File options tab, select the file to be read,
OrdersCD.txt.
3. Click the arrow at the bottom to move to the next Fast Path page, that is, the
Records tab. Remove the check from the Single record box.
5. Click Load. Select all the columns from the HEADER Table Definition.
7. Click the icon at the bottom left of the Records tab to add a new record type.
Complete the process to define and load the Table Definition for the DETAIL record
type.
8. Select the HEADER record type and then click the Master Button (rightmost icon at
the bottom of the records tab). This will make the HEADER record type the master.
9. Click the arrow at the bottom to move to the next Fast Path page, that is, the
Records ID tab. Define the Record ID constraint for the HEADER record type.
10. Define the Record ID constraint for the DETAIL record type.
11. Move to the next Fast path page, that is, the Output > Selection tab. Select all
columns from both record types.
12. Move to the last Fast path page, that is, the Output > Constraint tab. Click the
Default button to add the default output constraint. This will insure that only records
of these two record types will go out the output link.
13. Click the Stage tab then Record options tab. Specify Text for Data format. Select
the ASCII character set. And type in the vertical bar (|) for the record delimiter. If
you open up the OrdersCD.txt file on the DataStage server, you will notice that all the
records are bunched up one after another with a vertical bar separating them. There
is no CR or LF character. This is the usual output of COBOL data from mainframe
and the CFF stage is designed to handle it.
14. Click the Layout tab. Select the COBOL layout option. View the COBOL layouts for
each of the record types. Shown below is the HEADER COBOL format.
16. Move to the Output tab and Click View Data. Notice that all the columns from all the
record types are displayed with data in them. However, the data in the columns that
are mapped from the DETAIL record are invalid when viewing the record with record
type ‘A’ (HEADER record). But in the case with the record type ‘B’ (DETAIL record),
all columns contain valid data. In effect, we have just populated the HEADER record
information to all its associated DETAIL records.
17. Set up the target Sequential File stage to output all the records to a file named
CFFOrdersCombined.txt with comma separated and no quotes. In the Columns tab,
change the SQL type to Char for the column ORDDATE. Otherwise, you will get a
conversion error during execution.
18. Compile and run your job. You will get some warning about EOF (End-of-File)
without getting a record delimiter. This is normal due to the last record in the file.
These warning did not affect the correct processing of the data.
19. To verify the result, view data on the target Sequential File stage.
3. The dataset accessed by the target dataset stage should be named Customers.ds.
6. Save the metadata of Customers.ds to a table definition for use in the next section.
2. Edit the source stage to read from the Customers.ds dataset. Don’t forget to load
the table definition saved from last job.
4. Edit the Aggregator stage. Group by Zip. Count the rows in each group of zip
codes. You will add this value to each Customer record. Change Grouping Method
to SORT.
5. Output the Zip column and the new ZipCount column from the Aggregator.
7. Edit the Join stage. Specify an inner join on the Zip column.
8. On the Partitioning tab, hash and sort by Zip for both input links to the Join.
9. Write all the rows of the customer record with the added ZipCount column to an
output sequential file named CustomersCount.txt.
10. Your job now looks like this. The hash, sorts on each of the three links going to the
Aggregator and Join stages are what would have implicitly been done by DataStage
if Auto had been selected.
12. Examine the score. Are there any inserted tsort operators? What operators are
combined? In addition to the operators corresponding to the Aggregator, Join, and
Copy stages, what other operators are there in the score?
14. Remove the Hash partitioning and in-stage sorts, going back to Auto.
16. Examine the score. Compare with the other score in terms of number of operators,
number of processes, number of sorts, hash partitioners, etc.
2. Optimize your job by moving the hash and sort to the Copy stage. Specify SAME
partitioning for the links going into the Aggregator and Join.
4. View the score. Compare with the scores from the previous jobs. Have the number
of sorts been reduced? Have the number of operators and processes been
reduced?
More optimization
In this task, we will push the partitioning and sorting back even further. We will partition
and sort when the dataset is generated and loaded.
2. On the Partitioning tab of the target Customers dataset, Hash partition and sort by
the Zip. Compile and run to generate a new Customers.ds
4. Change the partitioning in the Copy stage to SAME and remove the sort, since the
data is already sorted coming out of the dataset.
5. Compile and run and view the score. Notice here the inserted tsort operators.
Although the data in the dataset is sorted, DataStage doesn’t know this and still
inserts the tsort operators.
6. Open up the job parameters window, and add the environment variable named
$APT_NO_SORT_INSERTION (Disable sort insertion) as a job parameter. When
set, this will cause the Framework to just check that the data is sorted as it is
supposed to be. It will not add tsort operators.
8. Compare when running the job with the $APT_NO_SORT_INSERTION turned off.
2. In the source Sequential stage read data from the CityStateZip.txt file. The
CityStateZip.txt file contains customer address information. In this job, you will
generate a report that lists each state followed by a count of the addresses in the
state and a list of the zip codes. Here’s a sampling of the source data and the
column names used.
4. In the first Sort stage, set Hashing by State as the Partitioning method. We need to
have all the rows of a given state in the same partition in order to get a single count
for the state. The hash should be case insensitive.
5. In the first Sort stage, sort by State in ascending order. The sort should be case
insensitive. Turn off Stable Sort since we don’t need it.
6. In the second Sort stage, set the “Create Cluster Key Change Column” option.
Specify that the data is already sorted as you specified in the previous Sort stage.
7. In the third Sort stage, specify that you want to sort by the cluster key change column
within the State groups. This will place the row with the cluster key change column
of 1 at the end of each State group.
8. Open up the Transformer. Define the stage variables in the order shown.
NewState: Char(1) flag initialized with “Y” indicates that a new state group is being
processed.
AddZip: VarChar(255) list of zip codes. Initialize it with an empty string. Lists the
zip codes processed in each group. Set it to the currently read zip code when a new
state is being processed.
10. For the StateCount link, there are three target columns. The State value comes from
the input. The other two target columns get their values from the Counter and
AddZip stage variables. Define a Constraint for the StateCount link. It should only
write out one record per State group. They should be written out when state Count
and Zip lists are complete for the group, i.e., when the clusterKeyChange column
equals 1.
11. Set the target Sequential File stage to write to a file without quotes.
12. Compile and run. Verify the results. (Your ordering may be different.)
2. Read records from the Customers.txt file. Since the Customer.ds table definition is
the same as the file, you can use it.
3. In the Copy stage, pass all columns through to the CUSTS stage. Pass just the
CustID column through to the Column Generator.
4. Edit the CUSTS DB2 Connector stage. Connect to the DB2 instance and the
SAMPLE database. Click the “Test” button to make sure you can connect.
5. Write mode is INSERT. The table your job will create is named CUSTS. Select
REPLACE as the Table action with the statement generation and error handling as
shown.
6. Edit the Column Generator stage. Generate a new column named GroupByCol,
Char(1). Set the generation algorithm to just create a single letter ‘A’ using the
extended column properties window.
7. Edit the Aggregator stage. Group by the GroupByCol. Count the number of rows in
the group and send the results to the NumRecs column. Specify Hash as the
Aggregation method, since the data doesn’t need sorting.
11. Check the results. The log file should contain the number of records read from the
source file and written to the target table, unless the database rejects some rows.
2. From the Processing folder add two Surrogate Key Generator stages to the canvas.
Name them as shown. Also add the two DB2 Connector stages with links to the
Surrogate Key Generator stages.
3. Open up the PRODDIM Connector stage. Specify the Connection and Usage
properties. Choose to have the stage generate SQL.
4. Click the Columns tab. Load the column definitions into the stage. The table
definition is stored in the repository at “Table Definitions DB2 sample”.
5. Open up the STOREDIM Connector stage. Specify the Connection and Usage
properties. Choose to have the stage generate SQL. Load the column definitions
into the stage.
6. Open the ProdDim_SKG_Create stage properties. The Key Source Update Action is
Create and Update. Select PRODSK for the input column name. Specify a path to a
source key file name proddim as shown.
7. Open the StoreDim_SKG_Update stage properties. The Key Source Update Action
is Create and Update. Select STORESK for the input column name. Specify a path
to a source key file named storedim as shown.
8. Compile and run your job. Check the job log for errors.
9. Verify that the files have been created and that they are not empty. If you encounter
error and need to run the job again, delete the state files.
2. ***Important*** Open the Job Properties window and make sure that Runtime
Column Propogation is not enabled. Otherwise, you will get runtime errors when
source columns such as StoreID are written to the PRODDIM_upd link.
3. Add the stages and links as shown. Notice that the link from the PRODDIM
Connector stage to the Slowly Changing Dimension stage in the middle is a lookup
reference link.
4. Edit the SaleDetail stage. Read data from the SaleDetail.txt file. Import the table
definition. The column definitions are shown below. Correct them if necessary.
6. Edit the PRODDIM reference link stage. Set the Generate SQL property to Yes.
Click View Data.
7. On the Columns tab, load the column definitions. Select SKU, which is the business
key, as the lookup key field.
8. Open the PROD_SCD stage. On the Stage > General tab, select SaleDetailOut as
the output link.
9. Move to the next Fast Path page (using the arrow key at the bottom left), that is, the
Input>Lookup tab. Specify the column matching to use to lookup a matching
dimension row. Here we want to retrieve the row with the matching PRODDIM
business (natural) key. Also select the purpose codes for each of the dimension
table columns, as shown.
10. Move to the next Fast Path page, that is, the Input>Surrogate Key tab. Select the
surrogate key source file (proddim). Specify the surrogate key initial value, 1. Also
specify how many surrogate key values to retrieve from the state file in a single block
read. Specifying a block size of 1 ensures that there will be no gaps in the key
usage.
11. Move to the next Fast Path page, that is, the Output>Dim Update tab. Here specify
how to create a new dimension record and how to expire a dimension record that
has Type 2 columns in it. Be sure Output name is PRODDIM_Upd, that is, the name
of the dimension table update link. Use the Expression Editor to specify values and
functions.
12. Move to the next Fast Path page, namely Output>Output Map tab. Here the
PRODDIM surrogate key field (PRODSK) replaces the business key field in the
source file.
14. Open up the PRODDIM_Upd stage. Use Update then Insert to write to the target
SUPER.PRODDIM table. Let the stage generate the SQL.
15. In the columns tab, make sure the PRODSK is the only column set as the key.
17. Compile. Before you run the job, view the data from the SaleDetail.txt file and the
dimension table. This way you can see clearly what happens when you execute the
job.
-----------------
18. Run the job. Check the job log for errors. View the data in PRODDIM to see if the
table was updated properly. SKU 3 doesn’t change. SKU 1 and 2 are new inserts.
SKU 4 and 5 are new Type 2 updates. The original records are preserved as
historical records (CURR=N) PRODSK=2 and 10 are kept as historical records.
19. View the data in the target dataset. A1111 and A1112 are assigned new surrogate
key values since they are inserts. A1113 was not changed, so it has the same
surrogate key value. A1114 and A1115 are new Type 2 updates. They received
new surrogate key values and are inserted into the target.
20. If you want to rerun your job. Drop the three star schema tables and then re-run the
SQL file that creates the tables. Delete the surrogate key source files and then re-
run the job that creates and updates them.
2. Edit the SaleDetailOut DataSet stage. Extract data from the SaleDetailOut.ds file
that you created in the previous job. To get the Table Definition go to the Columns
tab of the target DataSet stage in your previous job. Click the Save button to save
the columns as a new Table Definition.
3. After you finish editing the stage, verify that you can view the data.
4. Edit the STOREDIM stage. Load column definitions. Select the ID column as the
lookup key. Verify that you can view the data.
6. Specify the output link, SaleDetailOut2, on the first Fast Path page.
7. Move to the next Fast path page, that is, the Input > Lookup tab. Specify the lookup
condition and purposes.
8. Move to the next Fast Path page, that is the Input > Surrogate Key tab. Select
storedim as the source key file to be used. Specify the other information as shown.
9. Move to the next Fast Path page, that is the Output > Dim Update tab. Specify the
mappings and derivations as shown.
10. Move to the next Fast Path page, that is the Output > Output Map tab. Here the
STORE surrogate key replaces the Store business key from the source file.
11. Edit the STOREDIM_upd stage. Be sure to qualify the table name by the schema
name, as shown.
12. Make sure STORESK is the only column set as the key.
13. Edit the FACTTBL stage. Be sure to qualify the table name by the schema name.
14. Compile. Before you run the job, view the data from the SaleDetailOut.ds file and
the STOREDIM dimension table. This way you can see clearly what happens when
you execute the job.
-------------------
15. Run the job. Check the job log for errors. View the data in the updated STOREDIM
table and in the FACTTBL.
-------------------------- STOREDIM
-------------------------- FACTBL
2. Set up the Sequential File stage to read the file Employees.txt. Load the Columns
from the table definition of DB2 table EMPLOYEE in the repository.
3. Set up the DB2 Connector stage to write (insert) to database SAMPLE and table
DB2INST1.EMPLOYEE.
4. In the DB2 Connector properties, click on the reject link on the graph and edit the
Reject tab properties.
8. Your job execution should be aborted since the Employees.txt contains duplicate
rows and the DB2 Connector options do not tell the job to reject these rows.
9. Go to the DB2 Connector properties > Reject tab and select the SQL Error check-
box.
11. You should see the job finish successfully. This means your records should be
passed to the output and the rows that generate SQL error will be in the reject file.
12. Open the SQL_Error.txt and verify that it contains the rows that already existed in the
Employees table.
13. Open the DB2 Connector properties again, click on the reject link and edit the Reject
tab options as below (Abort after property = 3):
15. You should see your job aborted since Employees.txt contains more than 3 duplicate
rows.
2. Set up the Sequential File stage to read the file Parent_Child_Records.txt and read
each record in as a single column.
3. Use the Transformer stage to split the records. Hint: use constrains to examine the
record type indicator and the output record is parsed accordingly. As you have done
in an earlier exercise, use the Field function to parse the record. Also, load the table
definitions for both the Child and Parent output links from the Table Definition folder
in the repository.
4. For setting up the DB2 Connector properties, open it and then click on the connector
icon. Set up the credentials as shown. And also select “All records” for recording
ordering.
5. Click on the Parent link and set up the stage to insert records into the “project” table.
Let the stage generate the SQL. Be sure to set the Table action to Append.
7. On the Link Ordering tab, make sure the parent is the first link as the records from
the first link will be processed first.
8. One other thing: since the job is running in partition mode, it is important to set the
partitioning of each input link to hash on PROJNO. This is to ensure all records with
the same project number in the same partition thereby the parent records and child
records will.
10. Your job execution should contain no error. And you should see a total of 4 records
inserted into table PROJECT and 4 records inserted into table PROJACT. The log
messages are: “[Input link n] Number of rows inserted: 2”.
5. On the Values tab, specify a name for the Value File that holds all the job parameters
within this Parameter Set.
2. Open up your Job Properties and select the Parameters tab. Click Add Parameter
Set. Select your SourceTargetData parameter set and click OK.
4. Configure the source Sequential File stage properties using the parameters included
in the SourceTargetData parameter set. Also, set the option “First Line is Column
Names” as True.
7. Open the transformer stage. Go to edit constraints by clicking on the chain icon and
create a constraint that selects only records with a Special_Handling_Code = 1.
Close the stage editor.
8. In the Transformer stage, map all the columns from the source link to the target link
selecting all the source columns and drag-dropping them to the output link. The
transformer editor should appear as shown below:
9. Configure the properties for the target Sequential File stage. Use the TargetFile
parameter included in the SourceTargetData parameter set to define the File
property as shown. Also, set the option First Line is Column Names as True.
11. View the data in the target and verify that there are only records having
Special_Handling_Code = 1.
3. Open up the Job Properties window and enable RCP for all links of your job. When
closing the Job Properties, answer YES to let Designer to turn on RCP for all the
links already in the job.
5. On the Layout tab, select the Parallel button to display the OSH schema. Click the
right mouse button to save this as a file called Selling_Group_Mapping.osh. Note
that this file is saved on the client machine. Normally, you would have to transfer this
file to the DataStage server. We have already done this step for you.
6. Open up the schema file to view its contents. The “{prefix=2}” must be removed.
The version on the server does not contain these.
7. Open up your Source Sequential stage to the Properties tab. Add the Schema file
option. Then select the Selling_Group_Mapping.osh schema file.
9. In the Transformer, clear all column derivations (don’t delete the output columns!)
going into the target columns. Also remove any constraints, if any are defined. If
you don’t remove the constraints, the job won’t compile, because the constraint
references an unknown input column.
10. Compile and run your job. Verify that the data is written correctly. That is now all
records are written since we don’t have a constraint any more.
11. If you need the constraint, then try defining the constraint in the Transformer stage
again. In addition, go to the Columns tab of the source Sequential File stage and
import just the Special_Handling_Code column from the Table Definition. Compile
and run your job. This time you should only have records that meet the constraint.
6. Open the Transformer. If you have a constraint left from the last exercise,
remove it. Map the Distribution_Channel_Description column across the
Transformer. Define a derivation for the output column that turns the data to
uppercase.
8. View the data in the file (not using DataStage View Data). Notice that the
Distribution_Channel_Description column data has been turned to uppercase.
All other columns were just passed through untouched.
5. Click the right mouse button over the container and click Open.
6. Open up the Transformer and note that it applies the Upcase function to a
column named Distribution_Channel_Description. Close the Transformer and
the Container without saving it.
7. Add a source Sequential File stage, Copy stage, and a target Peek stage as
shown. Name the stages and links as shown.
8. Edit the Items Sequential stage to read from the Items.txt sequential file. You
should already have a Table Definition, but if you don’t you can always import it.
9. Verify that you can view the data.
10. In the Copy stage, move all columns through. On the Columns tab, change the
name of the second column to Distribution_Channel_Description so that it
matches the column in the Shared Container Transformer that the Upcase
function is applied to.
11. Double-click on the Shared Container. On the Inputs tab, map the input link to
the Selling_Group_Mapping container link.
12. On the Outputs tab, map the output link to the Selling_Group_Mapping_Copy
container link.
In this task, you will create a function that checks for key words in a string that is passed
to it. It returns “Y” if it finds a key word, else it returns “N”.
1. In gedit or vi open the file named keyWords.cpp in the directory. This function
returns “Y” if it finds any of a list of words.
2. Compile your keyWords.cpp file into an object file by log in to the Information Server
system as “dsadm” and change to the /DS_Advanced directory:
g++ -c keyWords.cpp
4. In DataStage, click your right mouse button over the Jobs folder and then click
New>Parallel Routine, then create a new External Function routine named
keyWords. Save it in the Jobs folder.
Create an Object type External function. Specify the return type (char *). Specify the
path to the object file.
5. On the Arguments tab, specify the input argument to your function. It should match
the type expected by the function you defined.
2. Create a job parameter named inField that can be used to pass in a string value that
you can apply your routine to.
3. In the Row Generator, define a single column. (It can be anything you want. It won’t
be used.) On the Properties tab, specify that you want to generate a single row.
4. Define a VarChar output field named Result in your Transformer stage. Define a
derivation that returns “Key word found” or “Key word not found” in the Result field
depending on whether the key word was found in the input string. Also define a field
to store the input string from the job parameter.
2. Click the right mouse button over a folder in the Repository and click
New>Other>Parallel Stage Type (Wrapped). On the General tab, enter the name
and command (the UNIX list files command): ls
4. On the Properties tab, define an optional property named “Dir” that is to be passed
the path to the directory to be listed. The Conversion type must be set to “Value
Only”, because we only want the value to be passed to the wrapper, not the property
name followed by the value.
6. Create a new job named wrapperGenFileList. Add your new Wrapped stage with an
output link to a Sequential File stage.
7. Open the Wrapped Stage. On the Output>Columns tab, load your Table Definition
that defines the output if it is not already there.
8. On the Stage>Properties tab, add the Dir property and then specify the directory
/DS_Advanced to be listed.
1. Create and save a Table Definition named InRec_TIA defining the input.
2. Create and save a Table Definitions named OutRec_TIA defining the output.
4. On the Properties page, define a required property named Exchange. Its default is 1
and its Conversion type is the –Name Value type.
5. On the Build>Interfaces Input tab, define the input. Call the port InRec. Specify Auto
Read. Select the input interface Table Definition you defined earlier.
6. On the Build>Interfaces Output tab, define the output. Call the port OutRec. Specify
Auto Write. Select the output interface Table Definition you defined earlier.
7. On the Transfer tab, define an auto transfer with no separate transfer (false).
8. On the Logic Definitions tab, define a variable named beforeTaxAmount. You will
use this to define the base amount before tax is added. Also define a variable
named tax to store the calculated tax.
9. On the Per-Record tab, define the code that calculates the Amount. Be sure to
multiply the final result by the Exchange Property.
10. Click Generate. If the generation fails, fix any errors and then regenerate.
1. Import a Table Definition for the source file order_items.txt. The column names
should be as follows: OrderID, ItemNumber, Quantity, Price, TaxRate. Use float
type for Price and TaxRate.
2. Create a new job that reads the source file, passes the rows to your new Build stage,
and then write the rows to a Sequential File stage.
3. Use the Copy stage to modify the input column names and types to match the input
columns expected by the Build stage.
4. In the Build stage, the output link should include all columns that are in the source
stage plus the Amount column.
6. Run and test your job. Be sure to test your Exchange Property by trying out different
exchange rates.
3. In Director, click Tools > New Monitor open up a Monitor on the job.
4. Click the right mouse button over the window. Set or verify that the Monitor is
showing instances and percentage of CPU.
5. Note these are the results when this job was run on a particular virtual machine.
Your results may differ significantly.
• Correlate each stage in the job with the stages listed in the first column.
8. Open a Job monitor and compare the performance results. Clearly, in this example
the performance has improved.
2. Open up the Job Properties window. Click the Execution tab. Select “Record job
performance data.”
6. Click Stages and then de-select everything. One-by-one, select a stage and
examine its throughput. Shown here is the chart for the Aggregator Sort.
8. Now set up the job property of perForkJoin3 the same way and recompile and run.
9. Open the Performance Analysis tool and view the results. Compare the results with
the un-optimized version.
2. Open up Job Properties and click on the Execution tab. Select Record job
performance data (if it hasn’t been selected already).
3. Change the two Data Set target stages’ file property to write to the correct directory.
4. Compile and run your job. Verify in Director that it runs to successful completion.
6. Open the Charts folder and select Job Timeline (the default chart).
7. Open the Partitions folder. Deselect one of the Partitions. Notice that the
corresponding tab disappears on the chart. Reselect the partition.
8. Open the Stages folder. Select just the first Generator, the Sort, and the RemDup
stages.
9. Click on the black bars to the right of the stages to display the phases of each
process.
10. Open the Phases folder. Select just the runLocally phase.
11. Open the Filters tab. Deselect each box one at a time and examine the effect on the
chart. Shown below is the effect of deselecting the Hide Startup Phases box.
12. Open up the Charts folder. Examine each chart in the Job Timing, Record
Throughput, CPU Utilization, Memory Utilization, and Machine Utilization folders.
2. Modify the job as shown below. Move the output of the end of the Orders link to the
added Column Import stage. Remove the Join stage and its two input links and drag
the input side of the OrdersCombined DataSet Stage link to the Transformer stage.
Draw a link from the Column Importer stage to the transformer.
3. Edit the Column Import stage. On the Stage Advanced tab, set the stage to run
sequentially. This is necessary to preserve the ordering and groups of records going
into the Transformer stage.
4. On the Stage Properties tab, import the OrderNum and RecType columns. Set the
“Keep Import Column” property to True, so that the total record is also passed
through.
5. On the Output Columns tab, specify the metadata for the imported columns. Make
sure the RecType field is VarChar(1) rather than Char(1); otherwise, it won’t import
correctly.
7. In the main window of the Transformer, define two stage variables to store the Name
and OrderDate from the Header records. To simplify the derivation of the OrderDate
field, define it as a VarChar(10) instead of a Date type. Define the derivations for
these Stage variables. Use the Field function to parse the columns from the Header
record. For the output record, only when it is a detail record. Also drag over the
RecIn column to help verify the results when you run the job.
1. Browse the folder Jobs -> DataStage Advanced which contains the jobs you will use
for this lab.
2. Open the job Populate_Orders and edit the Row Generator stage “Orders_gen”, set
the Number of Records = 2,000,000 to generate into the table db2inst1.orders in the
SAMPLE database. The orders table will be used as a source table in the following
exercises.
3. Compile and run the job. Verify that the execution has completed successfully. You
should have now populated the ORDERS table.
4. Open and explore the job JoinOrdEmp. This job performs a join between the orders
with AMOUNT>100 (filtered by a Transformer stage) and the employee who
managed each order. The result is then stored in a Data Set.
5. Compile and run the job. In the Director client verify that the execution has
completed successfully.
6. Select the Optimize button in the bar as shown to open the Optimizer interface.
7. Select the option Push processing to database sources and press the Optimize
button. In this way the optimizer will attempt to push the processing of the
Transformer and Join stages into the source DB2 Connector, if possible.
8. Explore the Compare tab to see a comparison between the root job and the
optimized job. Notice that the two source DB2 Connector stages, the Join stage, and
the Transformer stage in the root job have been replaced by a single DB2 Connector
stage in the optimized job.
9. Explore the Logs tab which contains the details about the changes made by the
optimizer in defining the optimized job. Looking at the messages you can
understand what exactly the optimizer has accomplished: the identification of the
patterns of stages suitable for optimization, the impact on partitioning and the query
definitions which allow pushing the processing, in this case, to the source database.
10. Save the optimized job by accepting the default proposed job name
Optimized1OfJoinOrdEmp. Close the Optimizer.
11. Open the Optimized1OfJoinOrdEmp (if it is not already opened) and expand the DB2
Connector stage properties. Notice the select statement that optimizer has built to
define the same logic previously implemented by the two DB2 Connectors,
Transformer, and Join stages. For those of you who are SQL curious, simply copy
and paste the SQL statement to Notepad for more detail examination.
Note: the data appearing in the following analysis about such as figures for timing
measures, throughput, etc, might be different from the ones you will get during the
exercise. Follow in any case the procedure and adapt the results comparison to your
case.
1. Use the Director to compare the execution time of root versus optimized job. Notice
that pushing the operations implemented by the Join and Transformer in the root job
to the source database has achieved an improvement of the performance.
2. Looking at job monitors for the two jobs (you can open both of them) you can
understand that the optimized job has processed fewer records than the root job.
This is because in the root job the ORDERS DB2 Connector retrieved 2 millions rows
from the database and filtered afterwards by the Transformer stage’s constraint. In
the optimized job the source DB2 Connector has retrieved directly only the records
respecting the SQL query which actually implements the root job’s transformer
constraint for ‘AMOUNT’.
3. Refer also to the job logs in the Director to understand the different execution steps
performed by the two jobs. Compare the startup and production run times, which
help you in roughly understanding the elapsed time composition and the benefit you
can reach.
JoinOrdEmp
Optimization1OfJoinOrdEmp
4. To understand in more detail the behaviors of the root and optimized jobs, open the
Performance Analysis tool to compare their resources usage and the record
throughput.
5. For the JoinOrdEmp job you can filter the stages to consider during the analysis
selecting only the ORDERS and EMPLOYEES DB2 Connector stages and the
output Data Set stage.
6. Now examine the Record Throughput Outputs for all the partitions. Crossing this
chart with the Director’s logs, notice that the output stage begins to have records
some seconds after the job starts. However your mouse over these lines to see
exactly when records start arriving.
7. Repeating the same analysis on the Optimized1OfJoinOrdEmp job you can see that
the output stage has records from around the same time after the jobs starts, which
is comparable with the root job.
8. The significant difference between them is not how fast the jobs have records
available for the target loading, but their record throughputs. You can evaluate the
approximated slopes of the output stages’ throughput curves for both the charts,
considering all the partitions, to get comparable figures. Try then to justify the
comparison results.
- For JoinOrdEmp:
- For Optimized1OfJoinOrdEmp:
In the Optimized1OfJoinOrdEmp the data coming from the source DB2 Connector
stage have been already processed by the source database engine, while in the
JoinOrdEmp they must go through the Transformer and Join stage, hence the
resulting record throughputs could not be similar.
9. Open the Memory Usage Density Page Ins for the two jobs and notice that the root
job is more memory intensive than its optimized version. Notice, in job JoinOrdEmp,
the maximum usage of memory for JoinOrdEmp (9000 pages) is mainly due to the
orders records processing.
JoinOrdEmp
Optimized1OfJoinOrdEmp
10. Optional: perform a similar analysis considering the CPU and Disk utilizations.
Note: To perform a more detailed comparison between the root and optimized jobs,
or even to decide the best optimized version for a job, there are also other
parameters to consider: the degree of source/target database concurrency, the
amount of system resources available for DataStage and the source/target
databases, the number of records to process, the database tuning level, etc.
1. Locate the job JoinOrdEmp in the Repository Window, right-click and select Find
where used -> Jobs.
3. To perform the reverse operation, exploring what is the root job for the
Optimized1OfJoinOrdEmp job, locate it in the Repository window and select Find
dependencies -> Jobs.
4. The Repository Advanced Find window appears and displays the jobs dependent on
Optimized1OfJoinOrdEmp, in this case the root job JoinOrdEmp.
5. To remove the dependency between the optimized and root job, open the
Optimized1OfJoinOrdEmp in the Designer and select Edit -> Job Properties.
6. In the Dependencies tab, right click on the JoinOrdEmp entry and select Delete row.
The Optimized1OfOrdersReport in this way will loose its relationship with the root job
JoinOrdEmp.
2. Replace the target Data Set stage with a DB2 Connector stage.
5. You can now optimize the job by distributing the processing to source and target
DB2 connector stages; then verify if that could be a convenient choice. Open the
Optimizer for the JoinOrdEmpTrg job, select the Push processing to database
sources and Push processing to database target options and press the Optimize
button.
Note: the source and target tables are all in the same SAMPLE database.
7. Another way in which this job can be optimized is pushing the entire processing to
the target database. This is possible because all the tables you are using in the root
job are in the same database. You want now to understand if this optimization
version performs better than the Optimized1OfJoinOrdEmpTrg.
10. Notice as in the job Optimized1OfJoinOrdEmpTrg, a part of the root’s job logic (the
Transformer constraint and the ORDERS DB2 Connector) has been implemented
within the source DB2 Connector.
14. Use the Director to compare their Elapsed Times and notice that the job with the
shortest execution time is Optimized2OfJoinOrdEmpTrg.
15. Following the same approach as seen for Lab1, you can use the Performance
Analysis tool to explain the differences between the performances of these three
jobs.
16. Notice that for the job Optimized2OfJoinOrdEmpTrg, actually no record was
processed by DataStage: all the operations have been performed by the target
database in response to the SQL statement pushed down by the job. The job
Optimized1OfJoinOrdEmpTrg processes all the rows (1327629 rows) selected by the
SQL query in the source DB2 Connector stage, and then passes them to the target
DB2 Connector stage as shown below.
Optimized1OfJoinOrdEmpTrg
Optimized2OfJoinOrdEmpTrg
17. Open the JoinOrdEmpTrg job and modify the target DB2 Connector stage properties,
setting QS as a target database, then save the job as JoinOrdEmpTrg2 and compile
it.
18. Open the Optimizer and notice that the Push all processing into the (target)
database is no longer available. This is because the source and target tables reside
on different databases, so the job cannot be built using a single DB2 Connector
stage as happened for Optimized2OfJoinOrdEmpTrg.
19. Optional: optimize the JoinOrdEmpTrg2 job and analyze the performances, using the
Push processing to database sources and Push processing to database
targets optimization options.
1. Open the Populate_Orders job and edit the Row Generator stage, setting the
Number of Records = 100,000 as the number of records to be generated into the
target table “ORDERS”. Then compile and run the job.
2. Open and explore the job SalesReport. This job calculates the total order Amount
for the record in the ORDERS table and loads the result into the TOTORD table.
Note: the source and target tables are in the same database (SAMPLE).
4. Considering that your job performs a data reduction on the input records from the
ORDERS table (100,000 rows) generating a single output row, and also considering
that both the tables are in the same database, you might try to push the data
reduction processing to the target database. To do that select the optimization
options Push processing to database targets and Push data reduction
processing to database targets and click on Optimize.
5. Select the Compare tab and notice that the two Transformer and the Aggregator
stages have been pushed to the target DB2 Connector stage, while the source DB2
connector appears to be the same as before the optimization.
7. Open the target DB2 Connector stage and look at the insert SQL statement
generated by the optimizer.
9. Compare the execution times, the performances and the system resources usage of
the root and the optimized jobs by the Director and the Performance Analysis tools
as you did for the previous labs.
Although the job design will be the same in both of these scenarios, you will see their
differences in terms of optimization options you can use and performance improvements
you can achieve.
You will also learn a way to explicitly condition the optimization process, excluding one
or more stages from the optimization patterns.
1. Open the Populate_Orders and verify that the Number of Records = 100,000 to be
generated into the target table “ORDERS”. Then compile and run the job in case
currently you don’t have such a number of records in the ORDERS table.
Note: if at any moment you need to reload the original 10 records into the
“EMPLOYEES” table, you can simply compile and run the RestoreEmployees job.
3. Open the job OrdersReport and analyze the logic implemented by each stage. This
job calculates, for each order in the table ORDERS, the total amount of orders
summarized by employee and year. The aggregated values are then inserted into
the target table ORDER_REPORT in which the Employee ID code is replaced by
his/her first name and last name by a lookup operation.
5. Analyzing the job you can notice that the first two stages following the source DB2
Connector stage respect the Balanced Optimization requirements (the Copy stage’s
multiple output on the contrary are not supported), so as a possible attempt of
optimization you can consider pushing the processing towards the source database.
Open the Optimizer and check only the Push processing to database sources
option. Then press the Optimize button.
6. Open the compare tab and notice that only the Transformer and Sort_1 stages have
been pushed to the source database. The processing logic implemented by the fork
join structure (i.e. Copy, Aggregator and Join stages) could not be pushed to the
source and it has not been changed.
7. Explore the Logs tab and notice the WARNING messages. Notice the second and
third messages which explain why the stages composing the fork join structure have
not been optimized.
8. Save the job as Optimized1OfOrdersReportSrc and open the source DB2 Connector
stage to see how the optimizer has converted the logic originally defined by the
Transformer and Sort_1 stages into a single SQL query.
10. As a second attempt of optimization, you may choose to push the processing
towards the target database. Open again the optimizer for the OrdersReport job.
This time select the Push processing to database targets option, then press the
Optimize button.
11. Browse the Compare tab and notice that only the target side stages (the Lookup
stage and the last Transformer stage) have been pushed to the target database.
Save the job as Optimized1OfOrdersReportTrg. Close the optimizer window.
12. Open the target DB2 Connector stage to analyze the SQL query defined by the
optimizer, which implements the Lookup and Transformer2 root stages’ logic.
14. Open again the optimizer and select both the Push processing to database
sources and the Push processing to database targets options.
15. Compare the original and optimized version and notice that the only part not pushed
to the database is the fork join, and this version is a composition of the two previous
optimizations.
16. Save the job as Optimized1OfOrdersReport and analyze the SQL generated in the
source and target DB2 connectors. Notice also that the fork join structure could not
be optimized for the same reason you have faced previously.
18. As you learned during Lab2, if source and target tables are on the same database,
the best optimization could be achieved pushing all the processing to the target
database. You can try to apply the same to the OrdersReport job as shown below.
19. Despite you have tried to push all the processing to the target database, the
optimizer has ignored that option. In fact you don’t see a single DB2 Connector
stage fed by a Row Generator as in Lab2, but the optimized job is exactly the same
as Optimized1OfOrdersReport. This is again due to the fork join structure that
prevents the possibility of full optimization.
20. Now you can compare the execution times, the performances and the system
resources usage of the root and the optimized jobs by the Director and the
Performance Analysis tools as you did for Lab1.
1. Open and explore the OrdersReportTargetDB job. This job is similar to the
OrdersReport job, but the source and target tables are in two different databases as
you can see exploring the source and target DB2 Connector stages.
3. You can now optimize the job using the same approach you followed for the
OrdersReport job: generating different versions of the root job based on different
optimization options. Comparing their performances and resources usage to
determine which optimized option is more appropriate to match your requirements.
Open the optimizer and select the Push processing to database sources, then
save it as Optimized1OrdersReportTargetDB.
5. Open again the optimizer for the OrdersReportTargetDB and select Push
processing to database targets, then save the optimized job as
Optimized2OrdersReportTargetDB.
6. In the Logs tabs notice that the tables EMPLOYEES and ORDER_REPORT cannot
be part of the same optimization pattern as happened for the OrderReport job
because now they reside in different databases.
8. Open the optimizer, select the both the Push processing to database sources and
Push processing to database targets options.
11. When the source and target tables are on different databases, another possibility you
way want to consider is the Bulk Loading optimization option. In this way the target
DB2 Connector will first bulk load a temporary staging table created during the job
execution in the target database. Then SQL statements will load the actual target
table reading from the temporary staging table so any transformation will occur
directly in the target database after the high-performance bulk loading process.
12. Open the optimizer and select the Push processing to database sources, Push
processing to database targets, and Use bulk loading of target tables options.
14. Notice also the Before/After SQL statement that will be used to load the actual target
table by using the bulk loaded staging table as a source.
15. Enable the Auto commit mode option for the target DB2 Connector stage to allow the
database to commit the transactions automatically.
17. Now you can compare the execution times, the performances and the system
resources usage of the root and the optimized jobs by the Director and the
Performance Analysis tools as you did for Lab1.
18. Notice that in this scenario the Optimized4OfOrdersReportTargetDB, which uses the
bulk load option for the target database, does not perform better than the other
optimized versions. In fact the Optimized3OfOrdersReportTargetDB is the fastest
optimization.
19. Using the Performance Analysis tool, compare the performances of the
Optimized3OfOrdersReportTargetDB job versus the Optimized1OfOrdersReport job
which have been generated using the same optimization option. Try to understand
what the reasons are of their elapsed time differences.
Tip. Look at the Record Throughput and compare the lookup stage elapsed time for
OrdersReportTargetDB and the target DB2 Connector stage for OrdersReport.
Optimized3OfOrdersReportTargetDB
Optimized1OfOrdersReport
2. Select the both the Push processing to database sources and Push processing
to database targets options.
3. You may now optimize the job, forcing the sort operation to be executed by
DataStage, instead of pushing it into the database. To explicitly exclude the sort
stage from the optimization, select the “Advanced Options” tab and set the value
Sort_1 for the property Name of a stage where optimization should stop and
press the Optimize button.
4. Notice that the optimizer has not considered the Sort_1 stage.
In this lab you will see a job that performs better when the processing is done entirely by
the DataStage engine rather than by the database engine.
1. Open the job Populate_Orders and edit the Row Generator stage to set the Number
of Records = 2,000,000.
2. Compile and run the job to populate the table ORDERS in the SAMPLE database.
3. Open the LoadProcessing job and analyze it. Notice that the Transformer stage
implements conversion functions and decision logic for some of the output
derivations.
5. Open the Optimizer and check the Push processing to database sources option.
Then press the Optimize button and save the optimized job as
Optimized1OfLoadProcessing.
6. Open the Optimized1OfLoadProcessing job and notice how the logic originally
implemented by the Transformer stage has been converted into a single SQL
statement in the source DB2 Connector stage.
8. Compare the execution times, the performances and the system resources usage of
the root and the optimized jobs by the Director and the Performance Analysis tools
as you did for the previous labs. Notice that the optimized job is slower than the root
job.
9. Notice the Percent CPU Utilization charts. The LoadProcessing requires significant
CPU activity when the Transformer stage starts processing the records after they are
made available by the source DB2 Connector stage (refer to the Percent of time In
CPU chart), while the Optimized1OfLoadProcessing starts processing the records
when the source DB2 Connector connects to the database. The top levels of CPU
usage by the two jobs are comparable; however, looking at the Throughput charts
you can see that the LoadProcessing job performs faster.
Note: in some of the following pictures only the data is about one Partition only.
When you do these analyses you should consider all the partitions.
LoadProcessing
Optimized1OfLoadProcessing
2. In the Name to find box type sort* and in the Types to find list select Parallel Jobs.
2. Open the Last modification folder. Specify objects modified within the last week.
3. Open up the Where Used folder. Add the SUPER_PRODDIM Table Definition.
Change Name to find to an asterisk (*). Click Find. This reduces the list of found
items to those that use this Table Definition.
Generate a report
1. Click the number of matches to get the search result window again. Click File >
Generate Report to open a window from which you can generate a report describing
the results of your Find.
2. Click on the top link to view the report. This report is saved in the Repository where
it can be viewed by logging onto the Reporting Console.
3. After closing this window, click on the Reporting Console link. On the Reporting tab,
expand the Reports folder as shown. Click View Reports.
4. Select your report and then click View Report Result. This displays the report you
viewed earlier from Designer. By default, a Suite user only has permission to view
the report. A Suite administrator can give additional administrative functions to a
Suite user, including the ability to alter report properties, such as format.
2. Click the right mouse button over the ForkJoin job listed and then click “Show
dependency path to…”
3. Use the Zoom button to adjust the size of the dependency path so that it fits into the
window.
4. Hold right mouse button over a graphical object and move the path around.
5. Notice the “birds-eye” view box in the lower right hand corner. This shows how the
path is situated on the canvas. You can move the path around by clicking to one side
of the image in the birds-eye view window and by holding the right mouse button
down over the image and moving the image around.
4. Change the name of the output link from the Copy stage to TF (from TargetFile).
8. In the Compare window select your CreateSeqJobPartiton job on the Item Selection
window.
10. Click on firstLineColumnNames in the report. Notice that the stage is opened to the
properties tab when the change was.
4. On the Columns tab change the name of the Item column to ITEM_ZZZ. And
change its type and length to Char(33).
5. Click OK.
6. Right-click over your Table Definition copy and then select Compare Against.