You are on page 1of 28

IBM InfoSphere DataStage and DB2 pureXML, Part 1:

Integrate XML operational data into a data warehouse


Branislav Barnak (bbarnak@us.ibm.com)
Software Architect
IBM

20 August 2009

Amir Bar-or (baroram@us.ibm.com)


Senior Architect and Manager
IBM
Cynthia M. Saracco
Senior Solutions Architect
IBM
Paul Stanley (stanleyp@us.ibm.com)
Senior Architect
IBM
Learn how to integrate business-critical XML data into your data warehouse using IBM
InfoSphere DataStage 8.1 and DB2 9.5 pureXML. This two-part article series provides
step-by-step instructions for using pureXML as both a source and target data source for
InfoSphere DataStage jobs.
View more content in this series
It's no secret: XML is everywhere. Web-based applications, service-oriented architectures
(SOA), and industry-specific data exchange formats all rely heavily on XML to represent critical
business data. This growing use of XML is prompting many firms to explore how to integrate XML
operational data into their data warehouses and business intelligence environments.
In this article series, learn how IBM DB2 pureXML and InfoSphere DataStage can be used
together to manage operational XML data, transform XML data as needed, and populate a
relational-only or hybrid relational/XML data warehouse. (To keep things simple, subsequent
sections of this article series will refer to "pureXML" and "DataStage" instead of the full product
names.) Step-by-step instructions guide you through two common integration scenarios.
In the first part of this article series, explore how to build a data warehouse that contains both
relational and XML data using pureXML data as input. Additionally, learn important tips about
Copyright IBM Corporation 2009
IBM InfoSphere DataStage and DB2 pureXML, Part 1: Integrate
XML operational data into a data warehouse

Trademarks
Page 1 of 28

developerWorks

ibm.com/developerWorks/

using the DataStage DB2 Connector. In the second part of this article series, you'll learn how to
transform flat file data into XML and store the data in a data warehouse with pureXML.

Introducing key DataStage and pureXML technologies


If you're not familiar with DataStage or pureXML, you'll find brief introductions to each in the
following sections. However, prior knowledge of data warehousing, ETL (extract-transform-load)
technology, and XML will help when reading this article series.

DataStage overview
IBM InfoSphere DataStage enables firms to extract, transform, and load (ETL) data from a
variety of sources into a data warehouse. Built-in support for multi-processor hardware enables
DataStage to provide high levels of scalability and work with large volumes of data efficiently.
Various "connectors" support a wide range of source and target data formats, including popular
IBM and OEM database management systems, ODBC data sources, third-party applications, realtime messages generated by messaging queuing software and Web services, and popular file
formats.
DataStage provides these capabilitiesand othersthrough a number of software components.
The scenarios in this article series use a subset of these. Specifically, the article scenarios use
DataStage Designer to construct ETL jobs. Each job consists of multiple "stages," and each stage
performs a given task. Examples of such tasks include reading information from a data source,
transforming input data using built-in functions, converting data from one type to another, and
so on. The article examples define stages that involve the DB2 Connector, XML operations, and
various processing operations. Two DataStage technologies are critical for the scenarios: the DB2
Connector and the XML Pack 2.0. This article describes these shortly.
After using DataStage Designer to create and compile your job, you'll use DataStage Director to
execute your work.

DataStage DB2 Connector


This article relies on the DB2 Connector for all DB2 database access. This is the main DB2
connectivity option in InfoSphere Information Server and is based on the Common Connector
Framework. The connector contains a superset of features found in the legacy DataStage
connectivity components and offers improved function and performance.
The DB2 Connector is available for Information Server 8.1 as a separate connector patch
installation. It supports local and remote connections to DB2 9.1 and 9.5 servers. (DB2 9.7 support
is planned for a future Information Server release.) The connector requires DB2 Client 9.1 Fix
Pack 6 (or later) or DB2 9.5 Fix Pack 3 (or later).
Key features of the DB2 Connector include:
Support for XML and LOB (large object) data types of any size. Two transfer mechanisms
are available: passing by value (inline) and by reference. When passing LOB or XML data
inline, the maximum size is determined by the DataStage engine. When data is passed by
IBM InfoSphere DataStage and DB2 pureXML, Part 1: Integrate
XML operational data into a data warehouse

Page 2 of 28

ibm.com/developerWorks/

developerWorks

reference, the downstream (or receiving) stage uses the reference to get the data by invoking
the source connector directly. The downside of this method is that the LOB data can only
be moved from one connector to another and cannot be transformed in the job (as only the
reference is passed through the links). The connector requires that the ArraySize property to
be set to 1 when reading or writing XML columns.
Ability to execute standard SQL statements such as SELECT, INSERT, UPDATE, and DELETE,
combinations of these statements, user-defined SQL, as well as bulk load.
Support for parallel (multi-process) execution.
Support for DB2 DPF (Database Partitioning Feature). The connector can work with
partitioned databases in parallel or sequential mode. When running in parallel, a separate
DataStage process is allocated for each DB2 partition. The connector can read and write in
parallel and can load partitioned databases in parallel.
Support for guaranteed data delivery through XA (two-phase commit) transactions when
used together with the Distributed Transaction Stage (DTS). DTS provides a mechanism
for multiple DB2 databases to be updated as part of one XA transaction. Note that DTS is
available with a separate patch for DataStage 8.1.
Support for metadata import through the Common Connector Import Wizard available in the
DataStage Designer.

DB2 Connector usage contexts


The connector can be used in a source, target, or lookup context. In this article series, you use the
DB2 Connector in source and target contexts.
The source context refers to the case when there is an output link coming out of the connector
stage. In this context, the connector is used to retrieve data.
The target context refers to the case when there is an input link going into the connector stage,
which means that data is being passed to the connector. In this context, the connector is used
to execute INSERT, UPDATE, DELETE, UPSERT (combination of INSERT/UPDATE/DELETE), user-defined
SQL, or bulk load statements.
The lookup context refers to the case when the connector is used in conjunction with the lookup
stage. In this context, the data is both passed to and retrieved from the connector. In such a
scenario, the connector executes a SELECT statement with a WHERE clause.

DB2 Connector XML support


The scenarios in this article series focus on the DB2 Connector's support for XML data. As such,
it's worthwhile to learn more about XML support in the DB2 Connector.

DataStage representation of XML columns


DataStage 8.1 uses NVarChar or LongNVarChar data types to represent XML data. During
metadata import, the user can select which type to use for XML columns. The difference is that
XML columns represented as NVarChar columns are used to transfer data inline and are subject
to the DataStage engine's size requirements, while the long data type is used to treat the XML
columns as large objects (LOBs) and allows for them to be passed by reference.
IBM InfoSphere DataStage and DB2 pureXML, Part 1: Integrate
XML operational data into a data warehouse

Page 3 of 28

developerWorks

ibm.com/developerWorks/

The connector supports:

XQueries
XQueries with embedded SQL
SQL/XML queries
Update and delete operations with XML columns

Source context: Reading XML columns


The connector can read a whole XML document or a fragment. The fragment can be a sequence
or an atomic value. When an XQuery is used, the connector does not support passing the XML
result by reference due to difficulties in constructing a LOB locater.
The following examples (Listings 1 - 6) contain several common SQL queries and XQueries that
the connector supports. These queries are based on samples included in popular DB2 pureXML
introductory articles. (See the Resources section for details.) Note that DB2 Connector support for
SQL, SQL/XML, and XQuery isn't limited to these examples.

Listing 1. Using simple SQL to retrieve full XML documents


select xml_col from table

If the XML column was represented with the LongVarChar data type in the DataStage job, the
connector would give the user an option to select whether this column will be passed to the next
stage as a reference or inline.

Listing 2. Using XQuery to retrieve full XML documents


xquery db2-fn:xmlcolumn('TABLE.COL_NAME')

Note: The result for Listing 2 cannot be passed by reference.

Listing 3. Using XQuery to extract XML document fragments


xquery db2-fn:xmlcolumn('CLIENTS.CONTACTINFO')/Client[Address/zip="10011"
or Address/city="San Jose"]/Address

Note: The result for Listing 3 cannot be passed by reference.

Listing 4. Using an XQuery FLWOR expression to extract XML document


fragments
xquery
for $y in db2-fn:xmlcolumn('CLIENTS.CONTACTINFO')/Client
where $y/Address/zip="10011" or $y/Address/city="San Jose"
return $y/email[1]/text()

Note: The result for Listing 4 cannot be passed by reference.

Listing 5. Using XQuery with embedded SQL to extract full XML documents
xquery db2-fn:sqlquery('select xml_col from table')

IBM InfoSphere DataStage and DB2 pureXML, Part 1: Integrate


XML operational data into a data warehouse

Page 4 of 28

ibm.com/developerWorks/

developerWorks

Note: The result for Listing 5 cannot be passed by reference.

Listing 6. Using a SQL/XML query to extract XML document fragments


select xmlquery('$c/Client/Address'
passing contactinfo as "c")
from clients
where status = 'Gold'
and xmlexists('$c/Client/email' passing contactinfo as "c")

In Listing 6, the connector can either pass the result inline or by reference.

Prefix for expression columns


When the connector executes a query, it tries to match the column names the query returns with
the column names specified on the output link of the connector stage. However, XQuery column
names are represented with a number. The connector prefixes these with a string that can be
configured by setting the connector source context property Prefix for Expression Columns. The
default prefix used is EXPR. The XQuery column names then become EXPR1, EXPR2, and so on. This
is important to note as the connector would look for those columns in the list of link columns.

Target context: Inserting/updating/deleting XML columns


As mentioned earlier, the DB2 Connector can act as a target for incoming data and can write this
data to the appropriate DB2 database. The following examples (Listings 7 - 11) illustrate how the
DB2 Connector can insert, update, or delete information in XML columns.
Some of these examples incorporate a reserved DataStage keywordORCHESTRATEthat is used
to pass parameters to the statement. The column name specified after the ORCHESTRATE keyword
needs to match a column in the result set associated with the appropriate DataStage link.

Listing 7. Inserting an XML column


insert into table values(ORCHESTRATE.col1)

The column name col1 indicates the column as specified on the input link. This column can be
either NVarChar or LongNVarChar and can be passed inline or by reference.

Listing 8. Updating full XML document using a simple SQL update statement
update clients set
contactinfo=( xmlparse(document <email>newemail@someplace.com </email> ' ) )
where id = 3227

Listing 9. Updating part of an XML document value using a parameter


update xmlcustomer
set info = xmlquery('copy $new := $INFO
modify do replace value of $new/customerinfo/phone with $z
return $new' passing cast(ORCHESTRATE.col2 as varchar(15)) as "z")
where cid = ORCHESTRATE.col1

Listing 9 uses two parameters. The value of col2 replaces the existing /customerinfo/phone data in
the XML document; the value of col1 restricts the rows affected by the update.
IBM InfoSphere DataStage and DB2 pureXML, Part 1: Integrate
XML operational data into a data warehouse

Page 5 of 28

developerWorks

ibm.com/developerWorks/

Listing 10. Deleting an XML record based on internal XML data filter
delete from clients
where xmlexists ('$c/Client/Address[zip="95116"]'
passing clients.contactinfo as "c")

Listing 11. Deleting an XML record using a parameter


delete from clients
where xmlexists ('$c/Client/Address[zip=$z]'
passing clients.contactinfo as "c", cast(ORCHESTRATE.col2 as varchar(15)) as z)

Listing 11 uses the value of col2 as a parameter to restrict DELETE operations based on the value of
an XML element (the zip or zip code element of a client's address).

DataStage XML Pack


This article series relies on the DataStage XML Pack 2.0 to provide key functions for both
scenarios. The XML Pack consists of three stages: XML Input, XML Output, and XML Transformer.
This series uses the XML Input and XML Output stages. The XML Transformer stage, which is
based on XSLT, isn't used in the article scenarios.

XML Input stage


XML Input converts XML data to "flat" relational tables. With XML Input, you can:

Extract data using standard XPath expressions


Use XML Schemas and optionally validate XML data
Support complex transformations using custom stylesheets
Support multiple output links using different data sets
Handle errors

Figure 1 illustrates a simple transformation done by the XML Input stage:

Figure 1. The XML Input stage transforms hierarchical XML data into "flat"
tables

XML Output stage


XML Output converts tabular data, such as relational tables, to XML hierarchical structures. With
XML Output, you can:
Generate XML using a subset of standard XPath expressions
IBM InfoSphere DataStage and DB2 pureXML, Part 1: Integrate
XML operational data into a data warehouse

Page 6 of 28

ibm.com/developerWorks/

developerWorks

Generate multiple documents from the same input.


Support XML namespaces
Configure XML document generation in a variety of ways, such as one document per input
row
Generate output documents on disk or through an output link
Handle errors
Figure 2 illustrates a simple transformation done by XML Output:

Figure 2. The XML Output stage transforms "flat," tabular structures into XML
hierarchies

DB2 pureXML overview


DB2 provides firms with a common application programming interface and database management
platform for data modeled in tables as well as XML hierarchies. Figure 3 illustrates how DB2
enables firms to manage XML messages and documents alongside traditional relational data. With
pureXML, there is no need to "shred" or map business-critical XML data into a relational format.
Instead, XML data can be stored intact in its native hierarchical format. Applications can readily
access the XML data they need as well as integrate XML and relational data with little effort.

Figure 3. DB2's architecture offers relational and XML data management


support

IBM InfoSphere DataStage and DB2 pureXML, Part 1: Integrate


XML operational data into a data warehouse

Page 7 of 28

developerWorks

ibm.com/developerWorks/

The scenarios in this article series rely on DB2's pureXML capabilities, which include optimized
storage of XML data in its native hierarchical format and support for querying XML data in SQL or
XQuery languages.

Sample development and runtime environment


To create the sample scenarios described in this article series, we used the following software:
Information Server 8.1, which includes the DataStage Designer and DataStage Director
components. In addition, we installed the DataStage 8.1 DB2 Connector, which is contained in
a patch for DataStage 8.1 called "Connectors Rollup Patch 1." Because one of the scenarios
uses a Transformer Stage, we also installed a prerequisite C++ compiler. (We used Microsoft
Visual Studio .NET 2003.)
DB2 9.5 Enterprise Server Edition.
Both DataStage and DB2 were installed on the same Windows system.
The sample operational data includes information about financial portfolios. We derived this
data from a subset of the open source Transaction Processing over XML (TPOX) benchmark.
Specifically, we used data representing information about customer accounts and their holdings (or
investments). You'll learn more about the operational data and the data warehouse design as you
step through each scenario. (For more information about TPOX, see the Resources section.)

Using pureXML as input to the warehouse


The scenario described in this article uses operational XML data stored in DB2 pureXML as input
to the ETL job. As shown in Figure 4, you want DataStage to extract data stored in one DB2 XML
column and map this data to two tables in a DB2 data warehouse. One of these tables contains
only traditional relational columns, while the second contains relational columns and one XML
column.

Figure 4. Operational data stored in DB2 pureXML serves as input to the data
warehouse

This data model represents a common scenario in which portions of XML data are often
"shredded" into a relational structure. Such portions represent data that business users might
frequently analyze and query. Many business intelligence tools are optimized to support relational
structures, so shredding frequently queried XML data into relational columns can be quite effective.
However, business needs can vary over time, so it can be difficult for administrators to determine
IBM InfoSphere DataStage and DB2 pureXML, Part 1: Integrate
XML operational data into a data warehouse

Page 8 of 28

ibm.com/developerWorks/

developerWorks

which relational columns should be created. Maintaining full XML data in the data warehouse
allows users to immediately access important business data that wasn't previously shredded into a
relational format.
To simplify the sample scenarios in this article, we use a single DB2 database (named "TPOX") to
store both the operational data and the warehouse data. Of course, in a production environment,
operational data and warehouse data would be managed in separate databases, usually on
separate servers.
Since we use DB2 to manage both operational data and warehouse data, we could have elected
to use built-in DB2 technologies to perform much of the ETL work. However, offloading this
work to DataStage minimizes impact on DB2 operations, which is a common goal in production
environments. In addition, DataStage provides a number of transformation and cleansing functions
beyond those found in DB2. Finally, many firms need to populate their data warehouses with data
from heterogeneous data sources, and DataStage provides critical services to help them do so.
Such capabilities are well-documented elsewhere and are beyond the scope of this article series.

Design overview
The sample scenario in this article stores operational XML data in the TPOXADMIN.ACCOUNT
table, which serves as the source table to DataStage for this scenario. The ACCOUNT table
contains one relational column (ID) and one XML column (INFO). Listing 12 shows how easy it is
to create this table:

Listing 12. SQL to create the source ACCOUNT table


create table tpoxadmin.account(id int, info xml)

The INFO column contains details about the account, including its title, opening date, working
balance, portfolio holdings, and other information. Listing 13 shows a portion of one XML
document stored in the ACCOUNT table. The DB2 script accompanying this article contains the
full set of XML account records (see Download).

Listing 13. Portion of one XML document stored in the ACCOUNT table
<Account id="804130877" xmlns="http://tpox-benchmark.com/custacc">
<Category>6</Category>
<AccountTitle>Mrs Shailey Lapidot EUR</AccountTitle>
<ShortTitle>Lapidot EUR</ShortTitle>
<Mnemonic>LapidotEUR</Mnemonic>
<Currency>EUR</Currency>
<CurrencyMarket>3</CurrencyMarket>
<OpeningDate>1999-02-20</OpeningDate>
<AccountOfficer>Soraya Lagarias</AccountOfficer>
<LastUpdate>2004-02-10T22:33:58</LastUpdate>
<Balance>
<OnlineActualBal>896882</OnlineActualBal>
<OnlineClearedBal>337676</OnlineClearedBal>
<WorkingBalance>430147</WorkingBalance>
</Balance>
. . .
<Holdings>

IBM InfoSphere DataStage and DB2 pureXML, Part 1: Integrate


XML operational data into a data warehouse

Page 9 of 28

developerWorks

ibm.com/developerWorks/

<Position>
<Symbol>ZION</Symbol>
<Name>Zions Bancorporation</Name>
<Type>Stock</Type>
<Quantity>1927.719</Quantity>
</Position>
. . .
</Holdings>
. . .
</Account>

For testing simplicity, the target data warehousing database is also configured to be TPOX.
Source information from the INFO column of TPOXADMIN.ACCOUNT will be mapped to two
tablesthe DWADMIN.ACCT table that contains information about the overall account, and
the DWADMIN.HOLDINGS table that contains information about various investments (portfolio
holdings) of a given account. Listing 14 shows how these tables are defined:

Listing 14. Definitions for target data warehouse tables


-- DWADMIN.ACCT contains general information about accounts.
create table dwadmin.acct (
id
int
primary key not null,
title
varchar(100),
currency
char(3),
workingbalance
int,
totalholdings
int,
holdingtypes
int,
officer
varchar(50),
datechanged
date,
timechanged
time,
fullrecord
xml
)
-- DWADMIN.HOLDINGS tracks specific investments (holdings) for a given account
create table dwadmin.holdings (
id
int
references dwadmin.acct on delete cascade,
symbol
varchar(10),
type
varchar(25),
quantity
decimal(12,2)
)

To understand how the XML source data (in the INFO column of TPOXADMIN.ACCOUNT) is
mapped to the various columns of the data warehouse tables, see Table 1. (The ID columns of
both data warehouse tables are populated from the values of the ID column in the operation.)

Table 1. XML source data for each column in the data warehouse tables
DWADMIN.ACCT
Column name

DWADMIN.HOLDINGS

Data source (XPATH or...)

Column name

Data source
(XPATH expression)

title

/Account/AccountTitle

symbol

/Account/Holdings/Position/
Symbol

currency

/Account/Currency

type

/Account/Holdings/Position/Type

workingbalance

/Account/WorkingBalance

quantity

/Account/Holdings/Position/
Quantity

officer

/Account/AccountOffice

IBM InfoSphere DataStage and DB2 pureXML, Part 1: Integrate


XML operational data into a data warehouse

Page 10 of 28

ibm.com/developerWorks/

developerWorks

datachanged

extract from /Account/LastUpdated

timechanged

extract from /Account/LastUpdated

fullrecord

entire XML document as originally


stored in DB2 source table

As you might expect, there are several ways to build a DataStage job for this scenario. This article
employs an incremental development approach. In particular, the initial steps guide you through
creating a portion of the overall DataStage job that extracts, transforms, and loads data into the
DWADMIN.ACCT table. Once this portion is complete and tested, the article guides you through
enhancing the job to extract, transform, and load XML data into the DWADMIN.HOLDINGS table.
First, though, you need to create the appropriate DB2 tables to support this scenario.

Prepare the DB2 database environment


The article download contains a script that creates the necessary operational (source) table and
data warehouse (target) tables. It also populates the operational table with sample XML account
information. Download and unzip the DSsetup.zip file, and then open a DB2 command window and
issue the following command:

Listing 15. Invoking the DB2 script accompanying this article series
db2 td@ -vf

DSsetup.db2

Note: This script is designed to support a DB2 9.5 server running on Windows.

Create a parallel job and outline its contents


To begin, define a new DataStage parallel job with placeholders for various stages (work
tasks) that need to be included. Later, you'll edit each stage to specify appropriate properties to
successfully execute the job.
1. Launch DataStage Designer.
2. To create a new parallel job, right-click on the Jobs folder in the Repository pane, and select
New > Parallel job.
3. Add two DB2 Connector stages to the parallel job. One connector will represent the
TPOXADMIN.ACCOUNT source table, and the other will represent the DWADMIN.ACCT
target table.
a. Select the Database tab in the Palette pane.
b. Drag the DB2 Connector stage to the parallel job.
c. Repeat the previous step to create a second DB2 Connector stage.
d. Place these two connectors at opposite sides of the parallel job canvas.
4. Add an XML Input stage to the job. (This stage will transform the hierarchical XML data about
accounts into a tabular result structure.)
a. Select the Real Time tab in the Palette pane.
b. Locate the XML Input stage and drag it to the parallel job.
c. Place the icon near the first DB2 Connector stage.
IBM InfoSphere DataStage and DB2 pureXML, Part 1: Integrate
XML operational data into a data warehouse

Page 11 of 28

developerWorks

ibm.com/developerWorks/

5. Add a Transformer stage to the job. (This stage will split a single XML element value into two
values that will populate two different columns in the target table.)
a. Select the Processing tab in the Palette pane.
b. Locate the Transformer stage and drag this icon to the parallel job pane.
c. Place the icon between the XML Input stage and the final DB2 Connector.
6. Link the stages together.
a. To link the first DB2 Connector to the XML Input, hold down the right mouse button, click
on the DB2 Connector, and drag the mouse to the XML Input stage. An arrow between
the two stages will appear.
b. Link the XML Input to the Transformer stage.
c. Link the Transformer stage to the final DB2 Connector.
7. If desired, customize your job with descriptive names for each stage and link using standard
DataStage facilities. (Consult the Resources section for links to DataStage tutorials and
documentation, if needed.)
8. Verify that your parallel job design is similar to Figure 5, which shows the various stages
linked together, as described in Step 6:

Figure 5. DataStage job skeleton for the first part of the integration
scenario

(View a larger version of Figure 5.)


9. To save your work, click on File > Save from the main toolbar, and specify a name of your
choice for the job.
You now have a parallel job skeleton for the first part of your DataStage job.

Import table definitions


The DB2 TPOX database already contains source and target tables, so you need to import
metadata about these tables into DataStage.
1. From the main toolbar, select Import > Table Definitions > Start Connector Import Wizard.
2. A new window appears for "Connector metadata import Connector selection." Select DB2
Connector (Variant 9.1), and then click on Next.
IBM InfoSphere DataStage and DB2 pureXML, Part 1: Integrate
XML operational data into a data warehouse

Page 12 of 28

ibm.com/developerWorks/

developerWorks

3. Enter the appropriate details for the connection, including the instance type (DB2), database
name (TPOX), and a valid user ID and password.
4. Click on the Test connection option in the upper right corner of this window to verify that you
can connect to the DB2 TPOX database.
5. After a successful connection, click on Next and then on OK.
6. Accept the default values for the data source location. These include a host name of DB2 and
a database name of TPOX (DB2).
7. Click on Next.
8. Select the TPOXADMIN schema from the Filter drop-down list, verify that the include tables
option is checked, and then click on Next.
9. Select the ACCOUNT table from the list of available tables for this schema.
10. Leave all options unchecked, including the "XML Columns as LOBs" option. Your DataStage
job needs to process and transform the XML, so it will treat it as string data (rather than as an
unstructured large object or LOB).
11. Verify that the TPOXADMIN.ACCOUNT table is slated for import, and click on Import.
12. A pop-up window appears, prompting you to select a folder for the metadata import. Select
Table Definitions, and click on OK.
13. Repeat the prior steps to import the two data warehouse target tables.
In Step 8, select the DWADMIN schema, instead of the TPOXADMIN schema.
In Step 9, select the ACCT and HOLDINGS tables. (You'll only use the DWADMIN.ACCT
table definition initially, but it will save time if you import the DWADMIN.HOLDINGS table
definition now as well.)
14. To confirm that you successfully imported metadata for all necessary tables, expand the Table
Definitions folder in the Repository pane in the upper left corner and confirm that there are
entries for TPOXADMIN.ACCOUNT, DWADMIN.ACCT, and DWADMIN.HOLDINGS.
15. Save your work.
You're now ready to edit each stage of the job.

Edit the DB2 source table stage


The TPOXADMIN.ACCOUNT table contains operational information needed for the warehouse.
The properties of the source DB2 Connector stage need to be edited to reflect appropriate
information about this input table.
1. Double-click on the DB2 Connector that represents the input source.
2. A window appears. In the Properties tab, specify the appropriate connection information. This
includes the instance type (DB2), database name (TPOX), and a valid user ID and password,
as shown in Figure 6:

IBM InfoSphere DataStage and DB2 pureXML, Part 1: Integrate


XML operational data into a data warehouse

Page 13 of 28

developerWorks

ibm.com/developerWorks/

Figure 6. Setting properties for the DB2 source table

3. Click on Test in the upper right corner of the pane to verify that you can successfully connect
to the database.
4. Scroll down to the Usage section of the Properties tab, and specify the following settings (as
also shown in Figure 6):
Generate SQL: Yes
Table name: TPOXADMIN.ACCOUNT
Array size: 1
5. Click on View Data in the right-hand side of the Usage row to verify that you can successfully
query the table, then click on OK to save your settings.
6. Click on the Columns tab, and select Load at the bottom of the pane.
7. A window appears with table definitions. Select the TPOXADMIN.ACCOUNT table, and click
on OK.
8. A window appears with the columns of the table. Accept the default in which all columns of
the table are selected. (Note that the INFO column, which was created in DB2 as an XML
column, appears here with an SQL type of NVarChar, which represents a Unicode string. This
is fine.)
9. Click on OK.

IBM InfoSphere DataStage and DB2 pureXML, Part 1: Integrate


XML operational data into a data warehouse

Page 14 of 28

ibm.com/developerWorks/

developerWorks

10. Specify an appropriate length for the INFO column. For this sample data, a length of 5000
bytes is sufficient.
11. Click on OK.
12. Save your work.
The DB2 Connector stage for the source table is ready. In the next step, you'll customize the DB2
Connector representing the data warehouse target table.

Edit the DB2 target table stage


The DWADMIN.ACCT table is one of two data warehouse tables to be populated from the input
source. As mentioned earlier, this table contains several relational columns for specific information
about financial accounts as well as one pureXML column with full account data in XML format. The
properties of the target DB2 Connector stage need to be edited to reflect appropriate information
about this table.
1. To open the DB2 Connector that represents the target table, double-click on its icon.
2. In the Properties tab, specify the appropriate connection information. This includes the
instance type (DB2), database name (TPOX), and a valid user ID and password.
3. Test that you can successfully connect to the database.
4. In the Usage section of the Properties tab, specify the following settings, as shown in Figure
7:
Write mode: Insert
Generate SQL: Yes
Table name: DWADMIN.ACCT
Array size: 1
Table action: Truncate

IBM InfoSphere DataStage and DB2 pureXML, Part 1: Integrate


XML operational data into a data warehouse

Page 15 of 28

developerWorks

ibm.com/developerWorks/

Figure 7. Setting properties for one of the DB2 target tables

5. Click on View Data to verify that you can successfully query the table. (The first time you run
your job, this table will be empty.)
6. Click on OK to save these settings.
7. Click on the Columns tab, and select Load.
8. Select the DWADMIN.ACCT table, and click on OK.
9. Accept the default in which all columns of the table are selected. (Note that the
FULLRECORD column, which was created in DB2 as an XML column, appears here with an
SQL type of NVarChar.)
10. Click on OK.
11. Specify an appropriate length for the FULLRECORD column. For the sample data, a length of
5000 bytes is sufficient.
12. Click on OK.
13. Save your work.
With the DB2 source and target stages defined, it's time to work on the stages that process the
data.

IBM InfoSphere DataStage and DB2 pureXML, Part 1: Integrate


XML operational data into a data warehouse

Page 16 of 28

ibm.com/developerWorks/

developerWorks

Edit the XML Input stage


You need to customize the XML Input stage to extract XML data from the INFO column of the
TPOXADMIN.ACCOUNT table and generate a tabular result set, which will ultimately be used to
populate the DWADMIN.ACCT table.
1. To open the XML Input stage, double-click on its icon.
2. Click on the Input tab.
3. A second set of tabs appears. Click on the XML Source sub-tab (if needed).
4. Specify the XML source column. Use the drop-down tab to select the INFO column.
5. Specify XML document (rather than URL/File path) as the Column content.
6. Click on the Column sub-tab.
7. Verify that the ID and INFO columns are listed.
8. Set the data type for the INFO column to VarBinary, and set the column length to 5000 bytes.
(Using VarBinary enables the XML Input stage to handle the data in its native UTF-8 format
and avoids unnecessary character set conversions.)
9. Click on the main Output tab.
10. A new set of tabs appears. Click on the Transformation Settings sub-tab.
11. Verify that the Repetition element required setting is checked. This is needed to indicate
that you intend to extract multiple records from a single record that contains XML. The XML
Input stage will loop through the repetition element and produce an output record for each
occurrence of that element. In a following step, you'll see how to indicate the path to the
repetition element.
12. Select the include namespace declaration check box.
13. The XML source data contains a namespace, so you need to declare that here. Enter the
following:
xmlns:ns= "http://tpox-benchmark.com/custacc"

14. Click on the Columns sub-tab, and select Load.


15. A list of table definitions appears. Select the DWADMIN.ACCT table.
16. Accept the default to load all columns.
17. Verify that the Ensure all Char columns use Unicode check box is selected, then click OK.
18. Modify the output structure from this job stage as follows (to match the content of Figure 8):
a. Specify a length of 5000 bytes for the FULLRECORD column.
b. Change the data type of the TIMECHANGED column to NVarChar with a length of
20. Treating the data as a string will enable you to take advantage of certain built-in
DataStage transformation functions, as you'll see shortly.
c. Specify the appropriate XPATH expression that DataStage should use to extract
information from the input XML data. Do this for every column except the ID column
(which is extracted from a relational column in the source TPOXADMIN.ACCOUNT
table). The appropriate description settings include:
TITLE: /ns:Account/ns:AccountTitle/text()
CURRENCY: /ns:Account/ns:Currency/text()
WORKINGBALANCE: /ns:Account/ns:Balance/ns:WorkingBalance/text()
OFFICER: /ns:Account/ns:AccountOfficer/text()
DATECHANGED: /ns:Account/ns:LastUpdate/text()
TIMECHANGED: /ns:Account/ns:LastUpdate/text()
IBM InfoSphere DataStage and DB2 pureXML, Part 1: Integrate
XML operational data into a data warehouse

Page 17 of 28

developerWorks

ibm.com/developerWorks/

FULLRECORD: /ns:Account
d. Identify the TITLE column as a key. To do so, change the Key value for TITLE to Yes.
This instructs DataStage to use this XML element value as the repetition element
identifier. For every occurrence of ns:AccountTitle, the stage will produce an output
record. In other words, it will produce a record for each Account since every account
contains an AccountTitle element. In this scenario, other columns could also serve this
purpose, including the CURRENCY, WORKINGBALANCE, and OFFICER columns,
since all of them are mandatory elements of the Account element. We selected TITLE as
the key column for convenience.

Figure 8. Column definitions for output results from the XML Input stage

(View a larger version of Figure 8.)


Note: VarChar types with a Unicode extension are equivalent to NVarChar types to the DB2
Connector.
19. Click on OK and save your work.
You've now programmed the XML Input stage to extract data from a DB2 pureXML column and
prepare it for mapping into a DB2 target data warehouse table that is largely relational. Next, you'll
customize the Transformer stage so that the data will be modified as needed before it's loaded into
the target table.

Edit the Transformer stage


The Transformer stage needs to extract time information from an input string containing timestamp
data in order to correctly populate the TIMECHANGED column in the target DWADMIN.ACCT
table. DataStage provides a number of built-in functions for manipulating date/time data,
performing data type conversions, and so on. You need to edit the Transformer stage to invoke
appropriate functions for your data conversion.
1. To open the Transformer stage, double-click on its icon.
2. Click on the tool bar's Column Auto-Match icon to instruct DataStage to automatically map
columns by name between the input and output links. (You defined the result set for this
stage's input link in "Edit the XML Input stage" section. You defined the result set for this
step's output link in the "Edit the DB2 target table stage" section.)

IBM InfoSphere DataStage and DB2 pureXML, Part 1: Integrate


XML operational data into a data warehouse

Page 18 of 28

ibm.com/developerWorks/

developerWorks

3. Click OK.
4. Modify the Derivation setting for the TIMECHANGED column of the output link to transform
the data as needed. Recall that the input string contains a full timestamp with date and time
information, and you want to populate the TIMECHANGED column in the DB2 target table
with only a time value.
a. Highlight the appropriate derivation setting, right-click, and select Edit Derivation.
b. A blank pane appears. Use the built-in wizards to select appropriate transformation
function calls, or enter the following code:
TimestampToTime( StringToTimestamp(AccountOverview.TIMECHANGED,
"%yyyy-%mm-%ddT%hh:%nn:%ss"))

In case you're curious, the inner function call converts the input string into a timestamp
that complies with a specific format. The outer function takes this timestamp and
converts it to a time value. For details on these functions or the Transformer stage, see
the Resources section.
5. Verify that your transformation appears similar to Figure 9, which illustrates the mapping
between the input and output links (generated in Step 2) as well as the derivation that you
edited in Step 4:

Figure 9. Modified properties for the Transformer stage

(View a larger version of Figure 9.)


6. Click on OK.
You've now completed the modifications necessary for the initial portion of your DataStage job. It's
time to compile and test your work.

IBM InfoSphere DataStage and DB2 pureXML, Part 1: Integrate


XML operational data into a data warehouse

Page 19 of 28

developerWorks

ibm.com/developerWorks/

Compile and test your work


Follow the standard DataStage procedures to compile, validate, and run your job. Consult the
Resources section for links to DataStage documentation, if needed. Here's a quick summary:
1. Press the F7 key, or select the Compile icon from the toolbar. Ensure that the job compiled
without error.
2. Press Ctrl+F5, or select the Run icon from the toolbar.
3. Set Warnings to no limit, and click on Run.
4. DataStage will attempt to execute your job. Launch the DataStage and QualityStage Director
to inspect the results of your job.
5. If needed, switch to the Status page and highlight your job. To do so select View > Status, or
click on the Status icon in the toolbar.
6. Switch to the Log page to review details about the job's execution. To do so, select View >
Log, or click on the Log icon on the toolbar. The job should execute successfully, although
warning messages may appear.
At this point, your can inspect the data in the target table using standard DB2 tools. Or, if you
prefer, you can view the data from the DataStage Designer by doing the following:
1. Double-click on the DB2 Connector that represents the target table.
2. Click on the View Data link to display the data in the target table.

Add more stages to populate the final warehouse table


To complete the scenario, you need to extract the appropriate information from the XML Input
source and load this data into the DWADMIN.HOLDINGS table. Doing so requires adding two new
stages to the existing job.
1. Select a Copy stage from the Processing section of the palette, and place the Copy stage
just below the existing XML Input stage.
2. Place a new DB2 Connector stage on the palette to the right of the Copy stage.
3. Link the XML input stage to the new Copy stage, and link the new Copy stage to the new DB2
Connector stage.
4. If desired, rename your new links and stages in a more descriptive manner.
5. Verify that your job design appears similar to Figure 10, which shows the various stages
linked together, as described in Step 3:

IBM InfoSphere DataStage and DB2 pureXML, Part 1: Integrate


XML operational data into a data warehouse

Page 20 of 28

ibm.com/developerWorks/

developerWorks

Figure 10. Revised DataStage job for this scenario

(View a larger version of Figure 10.)


Edit the new DB2 Connector stage:
1. Specify the appropriate connection information, including the instance type (DB2), database
name (TPOX), and a valid user ID and password.
2. Test the connection.
3. Specify the appropriate usage information:
Write Mode: Insert
Generate SQL: Yes
Table name: DWADMIN.HOLDINGS
Table action: Truncate
Array Size: 1
4. Select View Data to verify that the specifications are valid.
5. Click on the Column tab.
6. Load information from the table definition for DWADMIN.HOLDINGS. Import all columns.
7. Set the Key value of the ID column to Yes.
8. Specify the XPATH expressions that DataStage should use to extract information from the
input XML data (shown in Figure 11). Do this for every column except the ID column. The
appropriate description settings include:
SYMBOL: /ns:Account/ns:Holdings/ns:Position/ns:Symbol/text()
TYPE: /ns:Account/ns:Holdings/ns:Position/ns:Type/text()
QUANTITY: /ns:Account/ns:Holdings/ns:Position/ns:Quantity/text()

IBM InfoSphere DataStage and DB2 pureXML, Part 1: Integrate


XML operational data into a data warehouse

Page 21 of 28

developerWorks

ibm.com/developerWorks/

Figure 11. Edited column values of the DB2 Connector for the
DWADMIN.HOLDINGS table

(View a larger version of Figure 11.)


Edit the existing XML Input stage:
1. Click on the Output tab and use the drop-down menu beneath the Output Name box to
identify the appropriate DataStage link to modify. (In Figure 10, HoldingInfo is the appropriate
link, as it controls the data flow from the XML Input stage to the Copy stage that you just
added to the job.)
2. Click on the Transformation Settings tab.
3. Verify that the Repetition element required setting is checked.
4. Select the include namespace declaration check box.
5. Enter the following XML namespace definition in the white box:
xmlns:ns="http://tpox-benchmark.com/custacc"

6. Click on the Columns sub-tab.


7. Load information for the DWADMIN.HOLDINGS table. Select all columns.
8. Specify the XPath expressions that DataStage should use to extract information from the
input XML data (shown in Figure 12). Do this for every column except the ID column. The
appropriate description settings include:
SYMBOL: /ns:Account/ns:Holdings/ns:Position/ns:Symbol/text()
TYPE: /ns:Account/ns:Holdings/ns:Position/ns:Type/text()
QUANTITY: /ns:Account/ns:Holdings/ns:Position/ns:Quantity/text()

IBM InfoSphere DataStage and DB2 pureXML, Part 1: Integrate


XML operational data into a data warehouse

Page 22 of 28

ibm.com/developerWorks/

developerWorks

Figure 12. Column definitions for output from XML Input stage to new
Copy stage

(View a larger version of Figure 12.)


9. Set the Key value for the SYMBOL column to Yes. This causes the stage to produce an
output record for each occurrence of the Symbol element. Since Symbol occurs in every
Position element, the stage produces a record for each Position element.
10. Click on OK.
Edit the new Copy stage. (This stage passes information between the XML Input and DB2
Connector stages, each of which associates different meanings with Key values. The XML Input
stage uses a Key column to identify the repetition element, which determines how the XML data is
generated. When you edited the XML Input stage in the previous section, the work you performed
in Steps 3 and 9 identified the Key value and repetition element. The DB2 Connector, of course,
uses Key to identify a column that is a database key. The Copy stage allows data to be passed
between these stages without introducing any operational conflicts due to the way each stage
treats key values. To do so, its input and output links identify different columns as keys.)
1. From the Output tab, select Mappings.
2. Click on Auto-Match, and select OK.
3. If desired, inspect the Key definitions for the columns on the input and output links.
4. Click on the Input tab, and then click on the Columns sub-tab.
5. Verify that the SYMBOL column is identified as the Key.
6. Click on the Output tab, and then click on the Columns sub-tab.
7. Verify that the ID column is identified as the Key.
8. Click on OK to finish editing this stage.
Test your revised job:
1. If you haven't already done so, save your job.
2. Compile the job.
3. Verify that there are no errors.
4. Run the job.
IBM InfoSphere DataStage and DB2 pureXML, Part 1: Integrate
XML operational data into a data warehouse

Page 23 of 28

developerWorks

ibm.com/developerWorks/

5. Use DataStage Director to inspect the log and verify that the job finished successfully.
6. Optionally, inspect the data in the target table by selecting View Data in the stage editor for
the target DB2 stage.

Summary
Increased use of XML as a preferred format for data exchange is prompting data architects
and administrators to evaluate options for integrating business-critical XML data into their data
warehouses. In this first installment of this two-part series, you learned how IBM InfoSphere
DataStage can extract and transform XML data managed by DB2 pureXML. In addition, you
explored how DataStage can load this data into two tables: one with traditional SQL data types,
and one that features both relational and XML columns.
The second part of this article series explores another important scenario: using DataStage to read
information from a flat file, convert the data into an XML format, and load this XML data into a data
warehouse that contains a table with a DB2 pureXML column.

Acknowledgments
Thanks to Stewart Hanna, Susan Malaika, and Ernie Ostic for their review comments on this
article.

IBM InfoSphere DataStage and DB2 pureXML, Part 1: Integrate


XML operational data into a data warehouse

Page 24 of 28

ibm.com/developerWorks/

developerWorks

Downloads
Description

Name

Size

Sample scripts and data

DSsetup.zip

141KB

IBM InfoSphere DataStage and DB2 pureXML, Part 1: Integrate


XML operational data into a data warehouse

Page 25 of 28

developerWorks

ibm.com/developerWorks/

Resources
Learn
IBM InfoSphere DataStage: Get an overview of IBM InfoSphere DataStage.
"Parallel Job Tutorial" (IBM, 2008): Learn more about creating parallel jobs (publication
SC18-9889-01).
IBM InfoSphere DataStage documentation (IBM Information Server Information Center): Get
more details about DataStage capabilities and how to use them.
DB2 pureXML wiki: Find a comprehensive set of links to demos, free downloads, technical
papers, and more.
DB2 pureXML Cookbook (IBM Press, August 2009): Explore this comprehensive guide to
DB2 pureXML technology for all supported platforms.
"Query XML Data with SQL" (developerWorks, March 2006): Learn how to query data stored
in XML columns using SQL and SQL/XML, and explore many of the sample queries included
in this article.
"Query XML Data with XQuery" (developerWorks, April 2006): Learn how to query data
stored in XML columns using XQuery, and explore many of the sample queries included in
this article.
"Enhance business insight and scalability of XML data with new DB2 9.7 pureXML
features" (developerWorks, April 2009): Get an overview of the latest DB2 pureXML features.
Transaction Processing over XML (TPOX) benchmark: Learn more about the Transaction
Processing over XML (TPOX).
developerWorks Information Management zone: Learn more about Information Management.
Find technical documentation, how-to articles, education, downloads, product information,
and more.
Stay current with developerWorks technical events and webcasts.
Get products and technologies
DB2 Express-C 9.7: Download a free version of DB2 Express database server for the
community that includes pureXML.
DB2 9.7 for Linux, UNIX, and Windows: Download a free trial version of DB2 9.7.
Build your next development project with IBM trial software, available for download directly
from developerWorks.
Discuss
Participate in the discussion forum for this content.
Participate in the DB2 pureXML forum or one of the DataStage forums.
Participate in developerWorks blogs and get involved in the My developerWorks community;
with your personal profile and custom home page, you can tailor developerWorks to your
interests and interact with other developerWorks users.

IBM InfoSphere DataStage and DB2 pureXML, Part 1: Integrate


XML operational data into a data warehouse

Page 26 of 28

ibm.com/developerWorks/

developerWorks

About the authors


Branislav Barnak
Branislav Barnak is a software architect at IBM located in Boca Raton, FL. He works
on IBM InfoSphere DataStage connectivity and is focusing on IBM technologies such
as DB2, WebSphere MQ, mainframe data storage systems, and others. Branislav has
been involved with the development of connectivity technologies for more than 10
years. He has worked on several connectivity frameworks interfacing with database
and messaging products such as Oracle, PeopleSoft, Teradata, DB2, WebSphere
MQ, and others. He holds a BS degree in Electrical Engineering from the University of
Belgrade, Serbia.

Amir Bar-or
Amir Bar-or is a senior architect and manager in the Enterprise Information
Management Group at the Massachusetts Laboratory. He has more than 10 years of
experience in database research and development as a researcher in HP Labs and
then in IBM SWG. Today, he is leading the development of DataStage future XML
capabilities.

Cynthia M. Saracco
Cynthia M. Saracco is a senior solutions architect at IBM's Silicon Valley Laboratory
who specializes in emerging technologies and database management topics. She
has 23 years of software industry experience, has written 3 books and more than 60
technical papers, and holds 7 patents.

Paul Stanley
Paul Stanley is a senior architect in the Enterprise Information Management Group
in Boca Raton, FL. He has been architecting and managing the development of
connectivity components for WebSphere Transformation Extender and InfoSphere
DataStage for more than 12 years.

IBM InfoSphere DataStage and DB2 pureXML, Part 1: Integrate


XML operational data into a data warehouse

Page 27 of 28

developerWorks

ibm.com/developerWorks/

Copyright IBM Corporation 2009


(www.ibm.com/legal/copytrade.shtml)
Trademarks
(www.ibm.com/developerworks/ibm/trademarks/)

IBM InfoSphere DataStage and DB2 pureXML, Part 1: Integrate


XML operational data into a data warehouse

Page 28 of 28