Professional Documents
Culture Documents
Datastage Student Guide
Datastage Student Guide
STUDENT GUIDE
Contents - 2
Copyright 2002 Ascential Software Corporation
Version 6.0: 09/01/02
Copyright
This document and the software described herein are the property of Ascential Software
Corporation and its licensors and contain confidential trade secrets. All rights to this
publication are reserved. No part of this document may be reproduced, transmitted,
transcribed, stored in a retrieval system or translated into any language, in any form or by any
means, without prior permission from Ascential Software Corporation.
Copyright 2002 Ascential Software Corporation. All rights Reserved
Ascential Software Corporation reserves the right to make changes to this document and the
software described herein at any time and without notice. No warranty is expressed or
implied other than any contained in the terms and conditions of sale.
Ascential Software Corporation
50 Washington Street
Westboro, MA 01581-1021 USA
Phone: (508) 366-3888
Fax: (508) 389-8749
Ardent, Axielle, DataStage, Iterations, MetaBroker, MetaStage, and uniVerse are registered
trademarks of Ascential Software Corporation. Pick is a registered trademark of Pick
Systems. Ascential Software is not a licensee of Pick Systems. Other trademarks and
registered trademarks are the property of the respective trademark holder.
09-01-2002
Contents - 3
Contents - 4
Copyright 2002 Ascential Software Corporation
Version 6.0: 09/01/02
Table of Contents
Module 1: Introduction to DataStage ............................ 1-01
Module 2: Installing DataStage ..................................... 2-01
Module 3: Configuring Projects ..................................... 3-01
Module 4: Designing and Running Jobs ........................ 4-01
Module 5: Working with Metadata................................. 5-01
Module 6: Working with Relational Data ....................... 6-01
Module 7: Constraints and Derivations .......................... 7-01
Module 8: Creating BASIC Expressions ........................ 8-01
Module 9: Troubleshooting ............................................ 9-01
Module 10: Defining Lookups ...................................... 10-01
Module 11: Aggregating Data ...................................... 11-01
Module 12: Job Control................................................ 12-01
Module 13: Working with Plug-Ins ............................... 13-01
Module 14: Scheduling and Reporting ........................ 14-01
Module 15: Optimizing Job Performance .................... 15-01
Module 16: Putting It All Together .............................. 16-01
Contents - 5
Contents - 6
Copyright 2002 Ascential Software Corporation
Version 6.0: 09/01/02
Module 1
Introduction to DataStage
DataStage 314Svr
Ascential software provides the enterprise with a full featured data integration
platform that can take data from any source and load it into any target. Sources
can range from customer relationship systems to legacy systems to data
warehouses -- in fact, any system that houses data. Target systems, likewise, can
consist of data in warehouses, real-time systems, Web services -- any application
that houses data.
Depending on your needs, source data can undergo scrutiny and transformation
through several stages:
1. Data profiling -- a discovery process where relevant information for target
enterprise applications is gathered
2. Data quality -- a preparation process where data can be cleansed and
corrected
3. Extract, Transform, Load -- a transformation process where data is
enriched and loaded into the target
Underlying these processes is an application framework that allows you to
1. Utilize parallel processing for maximum performance
2. Manage and share metadata amongst all the stages
Overlaying all of this is a command and control structure that allows you to tailor
your environment to your specific needs.
1-2
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
1-3
DataStage 314Svr
A data warehouse is a central database that integrates data from many operational
sources within an organization. The data is transformed, summarized, and
organized to support business analysis and report generation.
Repository of data
Supports business:
Projections
Comparisons
Assessments
1-4
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
Data marts are like data warehouses but smaller in scope. Frequently an
organization will have both an enterprise-wide data warehouse and data marts that
extract data from it for specialized purposes.
1-5
DataStage 314Svr
DataStage is a comprehensive tool for the fast, easy creation and maintenance of
data marts and data warehouses. It provides the tools you need to build, manage,
and expand them. With DataStage, you can build solutions faster and give users
access to the data and reports they need.
With DataStage you can:
Design the jobs that extract, integrate, aggregate, load, and transform the data
for your data warehouse or data mart.
1-6
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
DataStage is client/server software. The server stores all DataStage objects and
metadata in a repository, which consists of the UniVerse RDBMS. The clients
interface with the server.
The clients run on Windows 95 or later (Windows 98, NT, 2000). The server runs
on Windows NT 4.0 and Windows 2000. Most versions of UNIX are supported.
See the installation release notes for details.
The DataStage client components are:
Component
Description
Administrator
Designer
Director
Manager
1-7
DataStage 314Svr
True or False? The DataStage Server and clients must be running on the
same machine.
True: Incorrect. Typically, there are many client machines each accessing the
same DataStage Server running on a separate machine. The Server can be
running on Windows NT or UNIX. The clients can be running on a variety of
Windows platforms.
False: Correct! Typically, there are many client machines each accessing the
same DataStage Server running on a separate machine. The Server can be
running on Windows NT or UNIX. The clients can be running on a variety of
Windows platforms.
1-8
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
Use the Administrator to specify general server defaults, add and delete projects,
and to set project properties. The Administrator also provides a command
interface to the UniVerse repository.
Set job monitoring limits and other Director defaults on the General tab.
Specify a user name and password for scheduling jobs on the Schedule tab.
Specify hashed file stage read and write cache sizes on the Tunables tab.
1-9
DataStage 314Svr
Use the Manager to store and manage reusable metadata for the jobs you define in
the Designer. This metadata includes table and file layouts and routines for
transforming extracted data.
Manager is also the primary interface to the DataStage repository. In addition to
table and file layouts, it displays the routines, transforms, and jobs that are defined
in the project. Custom routines and transforms can also be created in Manager.
1 - 10
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
Decode (denormalize) data going into the data mart using reference lookups.
For example, if the sales order records contain customer IDs, you can look
up the name of the customer in the CustomerMaster table.
This avoids the need for a join when users query the data mart, thereby
speeding up the access.
Aggregate data.
You can easily move between the Director, Designer, and Manager by selecting
commands in the Tools menu.
1 - 11
DataStage 314Svr
Use the Director to validate, run, schedule, and monitor your DataStage jobs.
You can also gather statistics as the job runs.
1 - 12
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
Import metadata that defines the format of data stores your jobs will read from
or write to: Manager
1 - 13
DataStage 314Svr
All your work is done in a DataStage project. Before you can do anything, other
than some general administration, you must open (attach to) a project.
Projects are created during and after the installation process. You can add
projects after installation on the Projects tab of Administrator.
A project is associated with a directory. The project directory is used by
DataStage to store your jobs and other DataStage objects and metadata.
You must open (attach to) a project before you can do any work in it.
Projects are self-contained. Although multiple projects can be open at the same
time, they are separate environments. You can, however, import and export
objects between them.
Multiple users can be working in the same project at the same time. However,
DataStage will prevent multiple users from accessing the same job at the same
time.
1 - 14
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
1 - 15
DataStage 314Svr
DataStage Director is used to execute your jobs after they have been built.
True: Correct! Use Director to validate and run your jobs. You can also
monitor the job while it is running.
False: Incorrect. Use Director to validate and run your jobs. You can also
monitor the job while it is running.
DataStage Administrator is used to set global and project properties.
True: Correct! You can set some global properties such as connection timeout,
as well as project properties, such as permissions.
False: Incorrect. You can set some global properties such as connection timeout,
as well as project properties, such as permissions.
1 - 16
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
1 - 17
Module 2
Installing DataStage
2-2
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
DataStage Essentials
The DataStage server should be installed before the DataStage clients are
installed. The server can be installed on Windows NT (including Workstation
and Server), Windows 2000, or UNIX. This module describes the Windows NT
installation.
The exact system requirements depend on your version of DataStage. See the
installation CD for the latest system requirements.
To install the server you will need the installation CD and a license for the
DataStage server. The license contains the following information:
Serial number
Project count
The maximum number of projects you can have installed on the server.
This includes new projects as well as previously created projects to be
upgraded.
Expiration date
Authorization code
This information must be entered exactly as written in the license.
2-3
DataStage 314Svr
2-4
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
2-5
2-6
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
DataStage Essentials
The DataStage services must be running on the server machine in order to run any
DataStage client applications. To start or stop the DataStage services in Windows
2000, open the DataStage Control Panel window in the Windows 2000 Control
Panel. Then click Start All Services (or Stop All Services). These services must
be stopped when installing or reinstalling DataStage.
UNIX note: In UNIX, these services are started and stopped using the uv.rc
script with the stop or start command options. The exact name varies by platform.
For SUN Solaris, it is /etc/rc2.d/S99uv.rc.
2-7
DataStage 314Svr
The DataStage clients should be installed after the DataStage server is installed.
The clients can be installed on Windows 95, Windows 98, Windows NT, or
Windows 2000.
There are two editions of DataStage.
The Developers edition contains all the client applications (in addition to the
server).
The Operators edition contains just the client applications needed to run and
monitor DataStage jobs (in addition to the server), namely, the Director and
Administrator.
To install the Developers edition you need a license for DataStage Developer.
To install the Operators edition you need a license for DataStage Director. The
license contains the following information:
Serial number
User limit
Expiration date
Authorization code
This information must be entered exactly as written in the license.
2-8
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
2-9
2 - 10
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
Module 3
Configuring Projects
3-2
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
DataStage Essentials
In DataStage all development work is done within a project. Projects are created
during installation and after installation using Administrator.
Each project is associated with a directory. The directory stores the objects (jobs,
metadata, custom routines, etc.) created in the project.
Before you can work in a project you must attach to it (open it).
You can set the default properties of a project using DataStage Administrator.
3-3
3-4
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
DataStage Essentials
3-5
DataStage 314Svr
Use this page to set user group permissions for accessing and using DataStage.
All DataStage users must belong to a recognized user role before they can log on
to DataStage. This helps to prevent unauthorized access to DataStage projects.
There are three roles of DataStage user:
DataStage Developer, who has full access to all areas of a DataStage project.
DataStage Operator, who can run and manage released DataStage jobs.
3-6
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
3-7
DataStage 314Svr
Use the Schedule tab to specify a user name and password for running scheduled
jobs in the selected project. If no user is specified here, the job runs under the
same user name as the system scheduler.
3-8
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
On the Tunables tab, you can specify the sizes of the memory caches used when
reading rows in hashed files and when writing rows to hashed files. Hashed files
are mainly used for lookups and are discussed in a later module.
Active-to-Active link performance settings will be covered in detail in a later
module in this course.
3-9
3 - 10
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
Module 4
4-2
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
DataStage Essentials
A job is an executable DataStage program. In DataStage, you can design and run
jobs that perform many useful data warehouse tasks, including data extraction,
data conversion, data aggregation, data loading, etc.
DataStage jobs are:
4-3
DataStage 314Svr
In this module, you will go through the whole process with a simple job, except
for the first bullet. In this module you will manually define the metadata.
4-4
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
In the center right is the Designer canvas. On it you place stages and links from
the Tools Palette on the right. On the bottom left is the Repository window,
which displays the branches in Manager. Items in Manager, such as jobs and
table definitions can be dragged to the canvas area. Click View>Repository to
display the Repository window.
Click View>Property Browser to display the Property Broswer window. This
window displays the properties of objects selected on the canvas.
4-5
DataStage 314Svr
The toolbar at the top contains quick access to the main functions of Designer.
4-6
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
The tool palette contains icons that represent the components you can add to your
job design.
Most of the stages shown here are automatically installed when you install
DataStage. You can also install additional stages called plug-ins for special
purposes. For example, there is a plug-in called sort that can be used to sort data.
Plug-ins are discussed in a later module.
4-7
DataStage 314Svr
Sequential
ODBC
Hashed
Transformer
Aggregator
Sort plug-in
4-8
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
4-9
4 - 10
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
DataStage Essentials
The Sequential stage is used to extract data from a sequential file or to load data
into a sequential file.
The main things you need to specify when editing the sequential file stage are the
following:
File format
Column definitions
If the sequential stage is being used as a target, specify the write action:
Overwrite the existing file or append to it.
4 - 11
4 - 12
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
DataStage Essentials
4 - 13
DataStage 314Svr
The column definitions you defined in the source stage for a given (output) link
will appear already defined in the target stage for the corresponding (input) link.
Think of a link as like a pipe. What flows in one end flows out the other end.
The format going in is the same as the format going out.
4 - 14
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
The Transformer stage is the primary active stage. Other active stages perform
more specialized types of transformations.
In the Transformer stage you can specify:
Column mappings
Derivations
Constraints
4 - 15
DataStage 314Svr
4 - 16
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
4 - 17
DataStage 314Svr
Add one or more Annotation stages to the canvas to document your job.
An Annotation stage works like a text box with various formatting options. You
can optionally show or hide the Annotation stages by pressing a button on the
toolbar.
There are two Annotation stages. The Description Annotation stage is discussed
in a later module.
4 - 18
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
Type the text in the box. Then specify the various options including:
4 - 19
DataStage 314Svr
Before you can run your job, you must compile it. This generates executable code
that can be run by the DataStage Server engine. To compile a job, click
File>Compile or click the Compile button on the toolbar. The Compile Job
window displays the status of the compile.
If an error occurs:
Click Show Error to identify the stage where the error occurred.
Click More to retrieve more information about the error.
4 - 20
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
As you know, you run your jobs in Director. You can open Director from within
Designer by clicking Tools>Run Director.
In a similar way, you can move between Director, Manager, and Designer.
There are two methods for running a job:
Run it immediately.
Select the job in the Job Status view. The job must have been compiled.
Click Job>Run Now or click the Run Now button in the toolbar. The Job
Run Options window is displayed.
4 - 21
DataStage 314Svr
This shows the Director Job Status view. To run a job, select it and then click
Job>Run Now.
Other views available:
4 - 22
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
The Job Run Options window is displayed when you click Job>Run Now.
This window allows you to stop the job after:
You can validate your job before you run it. Validation performs some checks
that are necessary in order for your job to run successfully. These include:
Click Run to run the job after it is validated. The Status column displays the
status of the job run.
4 - 23
DataStage 314Svr
Click the Log button in the toolbar to view the job log. The job log records
events that occur during the execution of a job.
These events include control events, such as the starting, finishing, and aborting
of a job; informational messages; warning messages; error messages; and
program-generated messages.
4 - 24
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
4 - 25
4 - 26
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
DataStage Essentials
4 - 27
Module 5
5-2
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
DataStage Essentials
5-3
5-4
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
DataStage Essentials
The left pane contains the project tree. There are eight main branches, but you
can create subfolders under each. Select a folder in the project tree to display its
contents. In this example, a folder named DS304 has been created that contains
some of the jobs in the project.
Data Elements branch: Lists the built-in and custom data elements. (Data
elements are extensions of data types, and are discussed in a later module.)
Jobs branch: Lists the jobs in the current project.
Routines branch: Lists the built-in and custom routines.
Routines are blocks of DataStage BASIC code that can be called within a job.
(Routines are discussed in a later module.)
Shared Containers branch: Shared Containers encapsulate sets of DataStage
components into a single stage. (Shared Containers are discussed in a later
module.)
Stage Types branch: Lists the types of stages that are available within a job.
Built-in stages include the sequential and transformer stages you used in
Designer.
Table Definitions branch: Lists the table definitions available for loading into a
job.
5-5
DataStage 314Svr
Transforms branch: Lists the built-in and custom transforms. Transforms are
functions you can use within a job for data conversion. Transforms are discussed
in a later module.
5-6
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
DataStage components
Every object in DataStage (jobs, routines, table definitions, etc.) is stored
in the DataStage repository. Manager is the interface to this repository.
DataStage components, including whole projects, can be exported from
and imported into Manager.
5-7
DataStage 314Svr
Any set of DataStage objects, including whole projects, which are stored in the
Manager Repository, can be exported to a file. This export file can then be
imported back into DataStage.
Import and export can be used for many purposes, including:
Moving DataStage objects from one project to another. Just export the
objects, move to the other project, then re-import them into the new project.
Sharing jobs and projects between developers. The export files, when zipped,
are small and can be easily emailed from one developer to another.
5-8
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
5-9
DataStage 314Svr
True or False? You can export DataStage objects such as jobs, but you can't
export metadata, such as field definitions of a sequential file.
True: Incorrect. Metadata describing files and relational tables are stored as
"Table Definitions". Table definitions can be exported and imported as any
DataStage objects can.
False: Correct! Metadata describing files and relational tables are stored as
"Table Definitions". Table definitions can be exported and imported as any
DataStage objects can.
5 - 10
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
5 - 11
5 - 12
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
DataStage Essentials
5 - 13
5 - 14
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
DataStage Essentials
5 - 15
DataStage 314Svr
Table definitions define the formats of a variety of data files and tables. These
definitions can then be used and reused in your jobs to specify the formats of data
stores.
For example, you can import the format and column definitions of the
Customers.txt file. You can then load this into the sequential source stage of a
job that extracts data from the Customers.txt file.
You can load this same metadata into other stages that access data with the same
format. In this sense the metadata is reusable. It can be used with any file or data
store with the same format.
If the column definitions are similar to what you need you can modify the
definitions and save the table definition under a new name.
You can also use the same table definition for different types of data stores with
the same format. For example, you can import a table definition from a sequential
file and use it to specify the format for an ODBC table. In this sense the metadata
is loosely coupled with the data whose format it defines.
You can import and define several different kinds of table definitions including:
Sequential files, ODBC data sources, UniVerse tables, hashed files.
5 - 16
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
5 - 17
DataStage 314Svr
In Manager, select the category (folder) that contains the table definition.
Double-click the table definition to open the Table Definition window.
Click the Columns tab to view and modify any column definitions. Select the
Format tab to edit the file format specification.
5 - 18
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
5 - 19
5 - 20
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
DataStage Essentials
5 - 21
5 - 22
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
Module 6
6-2
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
DataStage Essentials
You can perform the same tasks with relational data that you can with sequential
data. You can extract, filter, and transform data from relational tables.
You can also load data into relational tables.
Although you can work with many relational databases through native drivers
(including UniVerse, UniData, and Oracle), you can access many more relational
databases using ODBC.
In the ODBC stage, you can either specify your query to one or more tables in the
database interactively or you can type the query or you can paste in an existing
query.
6-3
DataStage 314Svr
Before you can access data through ODBC you must define an ODBC data
source. In Windows NT, this can be done using the (32 bit) ODBC Data Source
Administrator in the Control Panel.
The ODBC Data Source Administrator has several tabs. For use with DataStage,
you should define your data sources on the System DSN tab (not User DSN).
You can install drivers for most of the common relational database systems from
the DataStage installation CD.
Click Add to define a new data source. When you click Add a list of available
drivers is displayed. Select the appropriate driver and then click Finish.
Different relational databases have different requirements. As an example, we
will define a Microsoft Access data source.
Type the name of the data source in the Data Source Name box.
6-4
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
6-5
DataStage 314Svr
6-6
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
6-7
DataStage 314Svr
Specify the ODBC data source name in the Data source name box on the
General tab of the ODBC stage.
You can click the Get SQLInfo button to retrieve the quote character and schema
delimiters from the ODBC database.
6-8
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
Specify the table name on the General tab of the Outputs tab.
Select Generated query to define the SQL SELECT statement interactively using
the Columns and Selection tabs. Select User-defined SQL query to write your
own SQL SELECT statement to send to database.
6-9
DataStage 314Svr
Load the table definitions from Manager on the Columns tab. The procedure is
the same as for sequential files.
When you click Load, the Select Columns window is displayed. Select the
columns data is to be extracted from.
6 - 10
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
Optionally, specify a WHERE clause and other additional SQL clauses on the
Selection tab.
Other clauses can be anything else you wish to add to the Select clause, such as
ORDER BY.
6 - 11
DataStage 314Svr
The View SQL tab enables you to view the SELECT statement that will be used
to select the data from the table.
The SQL displayed in read-only. Click View Data to test the SQL statement
against the database.
6 - 12
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
If you want to define your own SQL query, click User-defined SQL query on
the General tab and then write or paste the query into the SQL for primary
inputs box on the SQL Query tab.
6 - 13
6 - 14
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
DataStage Essentials
Select the update action. You can choose from a variety of INSERT and/or
UPDATE actions.
6 - 15
DataStage 314Svr
Some of the options are different in the ODBC stage when it is used as a target.
Select the type of action to perform from the Update action list.
You can optionally have DataStage create the target table or you can load to an
existing table.
On the View SQL tab you can view the SQL statement used to insert the data into
the target table.
6 - 16
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
On the Edit DDL tab you can generate and modify the CREATE TABLE
statement used to create the target table.
If you make any changes to column definitions, you need to regenerate the
CREATE TABLE statement by clicking the Create DDL button.
6 - 17
DataStage 314Svr
6 - 18
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
6 - 19
DataStage 314Svr
True or False? Using a single ODBC stage, you can only extract data from a
single table.
True: Incorrect. You can join data from multiple tables within a single data
source.
False: Correct! You can join data from multiple tables within a single data
source.
6 - 20
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
6 - 21
6 - 22
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
DataStage Essentials
The ORAOCI8 plug-in lets you rapidly and efficiently prepare and load streams
of tabular data from any DataStage stage (for example, the ODBC stage, the
Sequential File stage, and so forth) to and from tables of the target Oracle
database. The Oracle client on Windows NT or UNIX uses SQL*Net to access an
Oracle server on Windows NT or UNIX.
6 - 23
DataStage 314Svr
The plug-in appears as any other stage on the designer work area. It can extract or
write data contained in Oracle tables.
Features:
Each ORAOCI8 plug-in stage is a passive stage that can have any number
of input, output, and reference output links.
Input links specify the data you are writing, which is a stream of rows to
be loaded into an Oracle database. You can specify the data on an input
link using an SQL statement constructed by DataStage or a user-defined
SQL statement.
Output links specify the data you are extracting, which is a stream of
rowsto be read from an Oracle database. You can also specify the data on
an output link using an SQL statement constructed by DataStage or a
userdefined SQL statement.
Each reference output link represents a row that is key read from an
Oracle database (that is, it reads the record using the key field in the
WHERE clause of the SQL SELECT statement).
6 - 24
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
General Tab
This tab is displayed by default. It contains the following fields:
Table name. This required field is editable when the update action is not Userdefined SQL (otherwise, it is read-only). It is the name of the target Oracle table
the data is written to, and the table must exist or be created by choosing generate
DDL from the Create table action list. You must have insert, update, or delete
privileges, depending on input mode. You must specify Table name if you do not
specify User-defined SQL. There is no default. Click (Browse button) to
browse the Repository to select the table.
Update action. Specifies which SQL statements are used to update the target
table. Some update actions require key columns to update or delete rows. There is
no default. Choose the option you want from the list.
Clear table then insert rows. Deletes the contents of the table and adds the new
rows, with slower performance because of transaction logging.
Truncate table then insert rows. Truncates the table with no transaction logging
and faster performance.
6 - 25
DataStage 314Svr
Insert rows without clearing. Inserts the new rows in the table.
Delete existing rows only. Deletes existing rows in the target table that have
identical keys in the source files.
Replace existing rows completely. Deletes the existing rows, then adds the new
rows to the table.
Update existing rows only. Updates the existing data rows. Any rows in the data
that do not exist in the table are ignored.
Update existing rows or insert new rows. Updates the existing data rows before
adding new rows. It is faster to update first when you have a large number of
records.
Insert new rows or update existing rows. Inserts the new rows before updating
existing rows. It is faster to insert first if you have only a few records.
User-defined SQL. Writes the data using a user-defined SQL statement,
which overrides the default SQL statement generated by the stage. If you
choose this option, you enter the SQL statement on the SQL tab.
User-defined SQL file. Reads the contents of the specified file to write
the data.
Transaction Isolation. Provides the necessary concurrency control between
transactions in the job and other transactions. Use one of the following transaction
isolation levels:
Read committed. Takes exclusive locks on modified data and sharable
locks on all other data. Each query executed by a transaction sees only
data that was committed before the query (not the transaction) began.
Oracle queries never read dirty (uncommitted) data. This is the default.
Serializable. Takes exclusive locks on modified data and sharable locks
on all other data. Serializable transactions see only the changes that were
committed at the time the transaction began.
Note: If Enable transaction grouping is selected on the
Transaction
Handling tab, only the Transaction Isolation value for the first
link is used for the entire group.
Array size. Specifies the number of rows to be transferred in one call
between DataStage and Oracle before they are written. Enter a positive
integer to indicate how often Oracle performs writes at a time to the
database. The default value is 1, that is, each row is written in a separate
statement. Larger numbers use more memory on the client to cache the
6 - 26
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
6 - 27
Module 7
7-2
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
DataStage Essentials
A constraint specifies the condition under which data flows through a link. For
example, suppose you want to split the data in the jobs file into separate files
based on the job level.
We need to define a constraint on each link so that only jobs within a certain level
range are written to each file.
7-3
DataStage 314Svr
Click the Constraints button in the toolbar at the top of the Transformer Stage
window to open the Transformer Stage Contraints window.
The Transformer Stage Contraints window lists all the links out of the
transformer. Double-click on the cell next to a link to create the constraint.
Rows that are not written out to previous rows are written to a rejects link.
If there is no constraint on a (non-rejects) link, all rows will be sent down the
link.
7-4
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
This shows the Constraints window. Constraints are defined for each of the top
three links. The Reject Row box is selected for the last link. All rows that fail to
satisfy the top three links will be sent down this link.
7-5
DataStage 314Svr
7-6
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
True or False? A Rejects link can be placed anywhere in the link ordering.
True: Incorrect. A Rejects link should be placed last in the link ordering, if it is
to get every row that doesn't satisfy any of the other constraints.
False: Correct! A Rejects link should be placed last in the link ordering, if it is
to get every row that doesn't satisfy any of the other constraints.
7-7
7-8
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
DataStage Essentials
Type constants.
7-9
DataStage 314Svr
In this example the concatenation of several fields is moved into the FullName
target field.
The colon (:) is the concatenation operator. You can insert this from the Operator
menu or type it in.
7 - 10
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
True or False? If the constraint for a particular link is not satisified, then the
derivations defined for that link are not executed.
True: Correct! Constraints have precedence over derivations. Derivations in an
output link are only executed if the constraint is satisfied.
False: Incorrect. Constraints have precedence over derivations. Derivations in
an output link are only executed if the constraint is satisfied.
7 - 11
7 - 12
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
DataStage Essentials
You can create stage variables for use in your column derivations and constraints.
Stage variables store values without writing them out to a target file or table.
They can be used in expressions just like constants, input columns, and other
operands.
Stage variables retain their values across reads. This allows them to be used as
counters and accumulators. You can also use them to compare a current input
value to a previous input value.
To create a new stage variable, click the right mouse button over the Stage
Variables window and then click Append New Stage Variable (or Insert New
Stage Variable).
After you create it, you specify a derivation for it in the same way as for columns.
7 - 13
DataStage 314Svr
7 - 14
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
Note the output link reordering icon available on the toolbar from within the
Transformer stage.
7 - 15
DataStage 314Svr
To get to the link ordering screen, open the transformer stage, then click on the
output link execution order icon. The above screen will appear. Select a link
and use the arrow buttons to reposition a link in the execution order.
7 - 16
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
Derivations for stage variables are executed before derivations for any
output link columns.
True: Correct! So you can be sure that the derivations for any of the stage
variables referenced in column derivations will have already been executed.
False: Incorrect. The derivations for stage variables are executed first. So you
can be sure that the derivations for any of the stage variables referenced in column
derivations will have already been executed.
7 - 17
7 - 18
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
DataStage Essentials
7 - 19
7 - 20
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
Module 8
8-2
Copyright 2002 Ascential Software Corporation
03/01/02
DataStage 304
DataStage Essentials
DataStage BASIC is a form of BASIC that has been customized to work with
DataStage.
In the previous module you learned how to define constraints and derivations.
Derivations and constraints are written using DataStage BASIC.
Job control routines, which are discussed in a later module, are also written in
DataStage BASIC.
This module will not attempt to teach you BASIC programming. Our focus is on
what you need to know in order to construct complex DataStage constraints and
derivations.
8-3
DataStage 304
For more information about BASIC operators than is provided here, search for
BASIC Operators in Help. You can insert these operators from the Operators
menu (except for the IF operator, which is on the Operands menu).
Arithmetic operators: -, +, *, /
IF operator:
IF min_lvl < 0 THEN Out of Range ELSE In Range
8-4
Copyright 2002 Ascential Software Corporation
03/01/02
DataStage Essentials
For more information about BASIC functions than is provided here, look up
Alphabetical List BASIC Functions and Statements in Help. BASIC functions
include the standard Pick BASIC functions. Click Function from the Operands
menu to insert a function.
Here are a few of the more common functions:
) xyz
LEN(string)
UPCASE(string), DOWNCASE(string)
ICONV, OCONV
ICONV is used to convert values to an internal format
OCONV is used to convert values from an internal format
Very powerful functions. Often used for date and time conversions and
manipulations.
These functions are discussed later in the module.
8-5
DataStage 304
For more information about BASIC system variables than is provided here, look
up System Variables in Help. Click System Variable from the Operands menu
to insert a system variable.
@DATE, @TIME
@INROWNUM
@OUTROWNUM
@LOGNAME
@NULL
NULL value
@TRUE, @FALSE
@WHO
8-6
Copyright 2002 Ascential Software Corporation
03/01/02
DataStage Essentials
8-7
DataStage 304
8-8
Copyright 2002 Ascential Software Corporation
03/01/02
DataStage Essentials
8-9
DataStage 304
8 - 10
Copyright 2002 Ascential Software Corporation
03/01/02
DataStage Essentials
8 - 11
DataStage 304
Data elements are extended data types. For example, a phone number is a kind of
string. You could define a data element called PHONE.NUMBER to precisely
define this type.
Data elements are defined in DataStage Manager. A number of built-in types are
supplied with DataStage. For example MONTH.TAG represents a string of the
form YYYY-MM.
8 - 12
Copyright 2002 Ascential Software Corporation
03/01/02
DataStage Essentials
8 - 13
DataStage 304
The argument(s) and return value have specific data elements associated with
them. In this sense, they transform data from one data element type to
another data element type.
You can define your own DS Transforms, but there are also a number of pre-built
transforms that are supplied with DataStage.
The pre-built transforms include a number of routines for manipulating strings
and dates.
8 - 14
Copyright 2002 Ascential Software Corporation
03/01/02
DataStage Essentials
8 - 15
8 - 16
Copyright 2002 Ascential Software Corporation
03/01/02
DataStage 304
DataStage Essentials
8 - 17
DataStage 304
Using the Iconv and Oconv functions using the D conversion code.
8 - 18
Copyright 2002 Ascential Software Corporation
03/01/02
DataStage Essentials
For detailed help on Iconv and Oconv, see their entries in the Alphabetical List
of BASIC Functions and Statements in Help.
Use Iconv to convert a string date in a variety of formats to the internal DataStage
integer format. Use Oconv to convert an internal date to a string date in a variety
of formats. Use these two functions together to covert a string date from one
format to another.
The internal format for a date is based on a reference date of December 31, 1967,
which is day 0. Dates before are negative integers; dates after are positive
integers.
Use the D conversion code to specify the format of the date to be converted to
an internal date by Iconv or the format of the date to be output by Oconv.
8 - 19
DataStage 304
For detailed help (more than you probably want), see D Code under Iconv or
Oconv in Help.
D4-MDY[2,2,4]
Separator
MDY
[2,2,4]
Note:
8 - 20
Copyright 2002 Ascential Software Corporation
03/01/02
DataStage Essentials
Iconv(12-31-67, D2-MDY[2,2,2])
Iconv(12311967, D MDY[2,2,4])
Iconv(31-12-1967, D-DMY[2,2,4]) 0
Oconv(0, D2-MDY[2,2,4])
12-31-1967
Oconv(0, D2/DMY[2,2,2])
31/12/67
Oconv(10, D/YDM[4,2,A10])
1968/10/JANUARY
8 - 21
8 - 22
Copyright 2002 Ascential Software Corporation
03/01/02
DataStage 304
DataStage Essentials
DataStage provides a number of built-in transforms you can use for date
conversions.
The following data elements are used with the built-in transforms:
Data element
String format
Example
DATE.TAG
YYYY-MM-DD
1999-02-24
WEEK.TAG
YYYYWnn
1999W06
MONTH.TAG
YYYY-MM
1999-02
QUARTER.TAG
YYYYQn
1999Q4
YEAR.TAG
YYYY
1999
8 - 23
DataStage 304
True or False? You can use Oconv to convert a string date from one format
to another.
True: Incorrect. Oconv by itself can't do this. You would first use Iconv to
convert the input string into a day integer. Then you can use Oconv to convert the
day integer into the output string.
False: Correct! Oconv by itself can't do this. You would first use Iconv to
convert the input string into a day integer. Then you can use Oconv to convert the
day integer into the output string.
8 - 24
Copyright 2002 Ascential Software Corporation
03/01/02
DataStage Essentials
8 - 25
DataStage 304
Tag
Description
MONTH.FIRST
MONTH.TAG
QUARTER.TAG
WEEK.TAG
YEAR.TAG
MONTH.LAST
QUARTER.FIRST
QUARTER.LAST
WEEK.FIRST
WEEK.LAST
YEAR.FIRST
YEAR.LAST
8 - 26
Copyright 2002 Ascential Software Corporation
03/01/02
DataStage Essentials
Examples:
MONTH.FIRST(1993-02) 9164
MONTH.LAST(1993-02) 9191
8 - 27
DataStage 304
Argument type
Description
DATE.TAG
Internal date
MONTH.TAG
Internal date
QUARTER.TAG
Internal date
WEEK.TAG
Internal date
Examples:
MONTH.TAG(9177) 1993-02
DATE.TAG(9177)
1993-02-14
8 - 28
Copyright 2002 Ascential Software Corporation
03/01/02
DataStage Essentials
Tag
Description
TAG.TO.MONTH
DATE.TAG
Convert DATE.TAG to
MONTH.TAG
TAG.TO.QUARTER
DATE.TAG
Convert DATE.TAG to
QUARTER.TAG
TAG.TO.WEEK
DATE.TAG
Convert DATE.TAG to
WEEK.TAG
TAG.TO.DAY
DATE.TAG
Convert DATE.TAG to
DAY.TAG
Examples:
TAG.TO.MONTH(1993-02-14)
1993-02
TAG.TO.QUARTER(1993-02-14) 1993Q1
8 - 29
8 - 30
Copyright 2002 Ascential Software Corporation
03/01/02
DataStage 304
DataStage Essentials
8 - 31
8 - 32
Copyright 2002 Ascential Software Corporation
03/01/02
DataStage 304
Module 9
Troubleshooting
Module 9 Troubleshooting
9-2
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
DataStage Essentials
Module 9 Troubleshooting
Events are logged to the job log file when a job is validated, run, or reset. You
can use the log file to troubleshoot jobs that fail during validation or a run.
Various entries are written to the log, including when:
09 - 3
Module 9 Troubleshooting
DataStage 314Svr
The event window shows the events that are logged for a job during its run.
The job log contains the following information:
Column Name
Description
Occurred
On date
Type
Event
9-4
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
Module 9 Troubleshooting
09 - 5
Module 9 Troubleshooting
DataStage 314Svr
Double-click on an event to open the Event Detail window. This window gives
you more information.
When an active stage finishes, DataStage logs an informational message that
describes how many rows were read in to the stage and how many were written.
This provides you with valuable information that can indicate possible errors.
9-6
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
Module 9 Troubleshooting
The Monitor can be used to display information about a job while it is running.
To start the Monitor, click Tools>New Monitor. Once in Monitor, click the right
mouse button and then select Show links to display information about each of the
input and output links.
09 - 7
Module 9 Troubleshooting
DataStage 314Svr
When you are testing a job, you can save time by limiting the number of rows and
warnings.
9-8
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
Module 9 Troubleshooting
09 - 9
Module 9 Troubleshooting
9 - 10
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
DataStage Essentials
Module 9 Troubleshooting
DataStage provides a debugger for testing and debugging your job designs. The
debugger runs within Designer. With the DataStage debugger you can:
09 - 11
Module 9 Troubleshooting
DataStage 314Svr
Debug job
parameters
Toggle
breakpoint
View job log
Clear breakpoints
Go
Debug window
Next row
Edit breakpoints
Button
Description
Go
Start/continue debugging.
Next Link
Next Row
9 - 12
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
Module 9 Troubleshooting
Job Parameters
Edit Breakpoints
Toggle Breakpoint
Debug Window
09 - 13
Module 9 Troubleshooting
DataStage 314Svr
To set a breakpoint on a link, select the link and then click the Toggle
Breakpoint button. A black circle appears on the link.
9 - 14
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
Module 9 Troubleshooting
Click the Edit Breakpoints button to open the Edit Breakpoints window.
Existing breakpoints are listed in the lower pane.
To set a condition for a breakpoint, select the breakpoint and then specify the
condition in the above pane. You can either specify the number of rows before
breaking or specify an expression to break upon when its true.
09 - 15
Module 9 Troubleshooting
DataStage 314Svr
The top pane lists all the columns defined for all links.
The Local Data column lists the data currently in the column.
The Current Break box at the top of the window lists the link where
execution stopped.
To add a column to the lower pane (where it is isolated), select the column
and then click Add Watch.
If a breakpoint is set, execution stops at that link when a row is written to the
link.
9 - 16
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
Module 9 Troubleshooting
Next Row extracts a row of data and stops at the next link with a breakpoint
that the row is written to.
For example, if a breakpoint is set on the MexicoCustomersOut link,
execution stops at the MexicoCustomersOut link when a Mexican
customer is read.
If a breakpoint is not set on the MexicoCustomersOut link, execution
will not stop at the MexicoCustomersOut link when a Mexican customer
is read.
Execution will stop at the CustomersIn link (even if there is no
breakpoint set on it) because all rows are read through that link.
Next Link stops at the next link that data is written to.
09 - 17
Module 9 Troubleshooting
9 - 18
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
Module 10
Defining Lookups
10 - 2
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
DataStage Essentials
A hashed file is a file that distributes records in one or more evenly-sized groups
based on a primary key. The primary key value is processed by a "hashing
algorithm" to determine the location of the record.
The number of groups in the file is referred to as its modulus.
In this example, there are 5 groups (modulus 5).
Hashed files are used for reference lookups in DataStage because of their fast
performance. The hashing algorithm determines the group the record is in. The
groups contain a small number of records, so the record can be quickly located
within the group.
If write caching is enabled, DataStage does not write hashed file records directly
to disk. Instead it caches the records in memory, and writes the cached records to
disk when the cache is full. This improved performance. You can specify the
size of the cache on the Tunables tab in Administrator.
10 - 3
DataStage 314Svr
To create and load a hashed file, create a job that has the hashed file stage as a
target.
For example, heres a simple job that will create and load the StoresHashed
hashed file, which will contain a list of stores and their addresses keyed by
stor_id.
Loading a hashed file with data is similar to loading a sequential file with data.
10 - 4
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
Properties for the hashed file stage are used to provide the physical location for
the hashed file. If you use the Use account name checkbox and leave the
account name box blank, the hashed file will be created in the same project in
which the job creating it is executed. This provides flexibility for jobs that are
moved from development to production environment. Alternatively, you can opt
to specify the exact directory in which the hashed file will be created; however, if
you place the file outside the area controlled by the DataStage engine (the project)
you will not be able to backup the file using the DataStage Manager export
project function.
10 - 5
DataStage 314Svr
You should use the Key checkboxes to identify the keycolumns. If you dont, the
first column definition is taken as the hashed files key field. The remaining
columns dictate the order in which data will be written to the hashed file. Do not
reorder the column definitions in the grid unless you are certain you understand
the consequences of your action.
10 - 6
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
Once you have created a hashed file (using Director) you may want to import that
hashed files meta data. Like all meta data imports, this is performed in DataStage
Manager (import>table definitions>Universe File Definitions). Note that hashed
files are known as Universe File Definitions.
10 - 7
10 - 8
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
DataStage Essentials
As a stream.
As a lookup.
10 - 9
10 - 10
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
DataStage Essentials
Your job can delete a hashed file and then recreate it. To delete the file and then
recreate it, select the Delete file before create box in the Create file options
window on the hashed file target stage.
To delete a hashed file without recreating it in a job, you can execute the
DELETE.FILE command.
To execute this command, log onto Administrator, select the project (account)
containing the hashed file, and then click Command to open the Command
Interface window. In the Command box, type DELETE.FILE followed by the
name of the hashed file. Then click Execute.
The DELETE.FILE command can also be executed in a Before/After Routine.
(Before/After Routines are discussed in a later module.)
10 - 11
DataStage 314Svr
10 - 12
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
Click the right mouse button over the hashed file key column and select Edit Key
Expression.
10 - 13
DataStage 314Svr
You can drag input columns to the key column like when defining derivations
for target columns.
Output from the lookup file is mapped to fields in the target link.
If the lookup fails (the result of the expression is not found in the hashed file),
NULLs are returned in all the lookup link columns.
You can test for NULLs in a derivation to determine whether the lookup
succeeded.
10 - 14
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
10 - 15
10 - 16
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
DataStage Essentials
10 - 17
Module 11
Aggregating Data
11 - 2
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
DataStage Essentials
The data sources youre extracting data from can contain many thousands of rows
of data. You can summarize groups of data in each column using the functions
listed below.
Function
Description
Minimum
Maximum
Count
Sum
Average
Standard deviation
11 - 3
DataStage 314Svr
11 - 4
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
In this example, we will determine the average sales amount for each product
sold.
The transformer performs some initial calculations on the data. For instance,
the sales amount for each order (qty * price) is calculated.
Calculations cant be defined in the aggregator stage.
The aggregator stage can have at most one input link and it cant be a
reference link.
11 - 5
11 - 6
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
DataStage Essentials
True or False? Suppose you want to aggregate over derived values. For
example, you want to SUM(qty * unit_price). You can perform this
derivation within the Aggregator
True: Incorrect. You cannot perform derivations within the Aggregator stage. If
you want to aggregate derived values, perform the the derivation in an output
column in a prior Transformer stage. Then Aggegate over that incoming column
in the Aggregator stage.
False: Correct! You cannot perform derivations within the Aggregator stage. If
you want to aggregate derived values, perform the the derivation in an output
column in a prior Transformer stage. Then Aggegate over that incoming column
in the Aggregator stage.
11 - 7
DataStage 314Svr
11 - 8
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
Select the column(s) to group by. You will not be able to specify an aggregate
function for the group by column(s).
11 - 9
DataStage 314Svr
11 - 10
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
11 - 11
11 - 12
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
DataStage Essentials
11 - 13
Module 12
Job Control
12 - 2
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
DataStage Essentials
Job parameters allow you to design flexible, reusable jobs. If you want to process
data based on a particular file, file location, time period, or product, you can
include these settings as part of your job design. However, if you do this, when
you want to use the job again for a different file, file location, time period, or
product, you must edit the design and recompile the job.
Instead of entering inherently variable factors as part of the job design, you can
set up parameters which represent processing variables. When you run or
schedule a job with parameters, DataStage prompts for the required information
before continuing.
Job parameters can be used in many places in DataStage Designer, including:
12 - 3
DataStage 314Svr
Recall this job. Customers from different countries are written out to separate
files. The problem here is that the countries are hard-coded into the job design.
What if we want a file containing, for example, Canadian customers? We can add
a new output stage from the transformer and define a new constraint. Then
recompile and run the job.
A more flexible method is to use a parameter in the constraint in place of a
specific country string such as USA. Then during runtime, the user can specify
the particular country.
12 - 4
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
To define job parameters for a job, open the job in Designer and then click
Edit>Job Properties. Click the Parameters tab on the Job Properties window.
12 - 5
DataStage 314Svr
Once a job parameter has been defined it can be used in various components of a
job design to add flexibility. Candidate uses for a parameter are enumerated on
this slide.
To reference a job parameter two methods are used.
If the value will be used in DataStage specific functions (such as value
used with a constraint or derivation), simply supply the name of the
parameter.
If the value will be used in system functions (such as the location of file)
the name of the parameter should be enclosed in # marks.
12 - 6
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
Since the file name is a value that is passed to the operating system for handling,
the parameter is enclosed within # marks.
12 - 7
DataStage 314Svr
Since the value of the parameter will be used within a DataStage function (a
constraint), the parameter name is used without enclosing # marks. In this
example the developer simply right-clicked where the parameter should be
placed, chose Job parameter, and will select from the dropdown list of
parameters available in this job.
12 - 8
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
True or False? When job parameters are used in passive stages such as
Sequential File stages, they must be surrounded with pound (#) signs.
True: Correct! You must surround the name of the job parameter with pound
signs. Otherwise, DataStage won't recognize it as a job parameter.
False: Incorrect. You must surround the name of the job parameter with pound
signs. Otherwise, DataStage won't recognize it as a job parameter.
12 - 9
12 - 10
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
DataStage Essentials
Before and after routines are DS routines that run before or after a job and before
or after a transformer. DS Before/After routines are defined in Manager. Three
built-in Before/After routines are supplied with DataStage: ExecDOS, ExecSH,
ExecTCL. These routines can be used to execute Windows DOS, UNIX, and
UniVerse commands, respectively. The command, together with any output, is
added to the job log as an informational message.
You can also define custom Before/After routines. They are similar to other
routines except that they have only two arguments: an input argument, an error
code argument.
12 - 11
DataStage 314Svr
12 - 12
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
DataStage is supplied with a number of functions you can use to control jobs and
obtain information about jobs. For detailed information about these functions, see
Job Control in Help.
These functions can be executed in the Job control tab of the Job Properties
window, within DS routines, and within column derivations.
12 - 13
DataStage 314Svr
Description
DSAttachJob
DSSetParam
DSSetJobLimit
DSRunJob
DSWaitForJob
DSGetProjectInfo
DSGetJobInfo
DSGetStageInfo
12 - 14
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
DSGetLinkInfo
DSGetParamInfo
DSGetLogEntry
DSGetLogSummary
DSGetNewestLogId
DSLogEvent
DSLogInfo
DSStopJob
DSDetachJob
DSSetUserStatus
12 - 15
DataStage 314Svr
The job control routines and other BASIC statements written in the Job control
tab are executed after the job in which they are defined runs. This enables you to
run a job that controls other jobs. In fact this can be all the job does.
For example, suppose you want a job that first loads a hashed file and then uses
that hashed file in a lookup. You can define this as a single job. Alternatively,
you can define this as two separate jobs (as we did earlier) and then define a
master controlling job that first runs the load and then runs the lookup.
12 - 16
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
Create an empty job and then click Edit>Job Properties. Click the Job control
tab. Select the jobs you want to run one at a time in the Add Job box and then
click Add. The job control functions and other BASIC statements are added to
the edit box. Add and modify the statements as necessary.
In this example:
DSRunJob is used to run the load job.
DSWaitForJob waits for the job to finish. You dont want the lookup to be
performed until the hashed file is fully loaded.
DSGetJobInfo gets information about the status of the job. If an error occurs the
job is aborted before the lookup job is run.
12 - 17
12 - 18
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
DataStage Essentials
The Job Sequencer enables you to graphically create controlling jobs, without
using the job control functions. Job control code is automatically generated from
your graphical design.
Job Sequences resemble standard DataStage jobs. They consist of stages and
links, like DataStage jobs. However, it is a different set of stages and links.
Among the stages are Job Activity stages, which are used to run DataStage jobs.
Links are used to specify the sequence of execution. Triggers can be defined on
these links to specify the condition under which control passes through the link.
There are other Activity stages, including:
12 - 19
DataStage 314Svr
Here is an example of a Job Sequence. The stages are Job Activity stages. The
stage validates a job that loads a lookup hashed file. The second stage runs the
job, if the validation succeeded. The third stage runs a job that does a lookup
from the hashed file.
The links execute these three stages in sequence. Triggers are defined on each of
the links, so that control is passed to the next stage only if the previous stage
executed without errors.
To create a new Job Sequence click the New button and then select Job
Sequence.
12 - 20
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
This shows a Job Activity stage. Select the job to run in the Job name box.
Select how you want to run it in the Execution action box.
The Parameters box lists all the parameters defined for the job. Select a
parameter and then click Insert Parameter Value to specify a value to be passed
to the parameter.
12 - 21
DataStage 314Svr
Triggers specify the condition under which control passes through a link. Select
the type of trigger in the Expression Type. The types include:
Otherwise: Pass control if none of the triggers on the links are executed.
UserStatus: Pass control if the User Status variable contains the specified
value. The User Status variable can be set in a job or Routine using the
DSSetUserStatus job control function.
12 - 22
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
True or False? Triggers can be defined on the Job Activity Triggers tab for
each Input link.
True: Incorrect. Triggers are defined on Output links. They determine whether
execution will continue down the link.
False: Correct! Triggers are defined on Output links. They determine whether
execution will continue down the link.
12 - 23
12 - 24
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
DataStage Essentials
12 - 25
DataStage 314Svr
12 - 26
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
This shows the components that make up a example Container. The same job
components are used with the exception of the Container Input stage, shown on
the left, and the Container Output stage, shown on the right.
12 - 27
12 - 28
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
DataStage Essentials
This shows a job with a Job Container (the stage in the middle). Data is passed
into the Container from the link on the left. Data is retrieved from the Container
in the link on the right. The Container processes the data using the set of stages
and links it is designed with.
12 - 29
12 - 30
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
DataStage Essentials
12 - 31
Module 13
13 - 2
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
DataStage Essentials
A plug-in is a custom-built stage (active or passive) that you can install and use in
DataStage in addition to the built-in stages. Plug-ins provide additional
functionality without the need for new versions of DataStage to be released.
Plug-ins can be written in either C or C++. Sample code is loaded in the /sample
directory when DataStage is installed.
A number of plug-ins are provided by Ascential. These include:
13 - 3
DataStage 314Svr
Plug-in stages written to the DataStage C API may also be available from
third-party vendors.
Once a plug-in is installed you can use it in your jobs just as you can the builtin stages.
13 - 4
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
13 - 5
13 - 6
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
DataStage Essentials
Documentation for the plug-ins that come with DataStage is provided in PDF
format on the DataStage installation CD.
In addition, open the plug-in in Manager. The Stage Type window provides a
variety of information in the four tabs:
Plug-in dependencies.
Plug-in properties.
Most of what you need to do when you use a plug-in in a job is to set its
properties correctly. Plug-ins provide online documentation for each property
when you open the Properties tab in Designer.
The sort plug-in can have one input link and one output link. The input link
specifies the records of data to be sorted. The output link outputs the data in
sorted order.
13 - 7
13 - 8
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
DataStage Essentials
This lists the main tasks involved in defining a sort using the DataStage Sort plugin.
13 - 9
DataStage 314Svr
Outputs tab: Specify the format of the data after its sorted.
Stage tab:
sort.
On the Properties sub-tab, you set the properties that define the
13 - 10
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
True or False? Job parameters can be used in the Sort plug-in stage.
True: Correct! Like when using job parameters in sequential stages, surround
job parameters with # signs.
False: Incorrect. Like when using job parameters in sequential stages, surround
job parameters with # signs.
13 - 11
13 - 12
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
DataStage Essentials
13 - 13
13 - 14
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
Module 14
14 - 2
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
DataStage Essentials
Each job can be scheduled to run on any number of occasions and can be run with
different job parameter values on the different occasions.
Jobs run on the DataStage server under the user name specified on the Schedule
tab in Administrator. If no user name is specified, it runs under the same name as
the Windows NT Schedule service.
If DataStage is running on Windows NT, DataStage uses the Windows NT
Schedule service to schedule jobs. If you intend to use the DataStage scheduler,
be sure to start or verify that the Windows NT Scheduler service is running.
To start the NT Scheduler, open the Windows NT Control Panel and then open
the Services icon. You can then manually start the service or set the service to
14 - 3
14 - 4
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
DataStage Essentials
True or False? When a scheduled job runs, it runs under the user ID of the
person who scheduled it.
True: Incorrect. When a user manually runs a job in Director, the job runs under
the user ID of the person who manually started it. When a scheduled job runs, it
runs under the user ID specified in Administrator.
False: Correct! When a user manually runs a job in Director, the job runs under
the user ID of the person who manually started it. When a scheduled job runs, it
runs under the user ID specified in Administrator.
14 - 5
14 - 6
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
DataStage Essentials
In addition to simple reports you can generate in Designer and Director using
File>Print, DataStage provides a flexible and powerful reporting tool. The
DataStage Reporting Assistant is invoked from DataStage Manager. You can
generate reports at various levels within a project, including:
Entire project
Selected jobs
14 - 7
14 - 8
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
DataStage Essentials
14 - 9
DataStage 314Svr
True or False? The DataStage Reporting Assistant stores the data it uses in
its reports in an ODBC database.
True: Correct! This data source is set up on your client machine when the
DataStage clients are installed on your machine.
False: Incorrect. This data source is set up on your client machine when the
DataStage clients are installed on your machine.
14 - 10
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
14 - 11
14 - 12
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
DataStage Essentials
14 - 13
14 - 14
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
Module 15
15 - 2
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
DataStage Essentials
14 - 3
DataStage 314Svr
Collection of performance statistics for a particular job run is controlled from the
DataStage Director client. Some overhead is consumed by the collection; therefore,
job component percentages may not sum to 100% for every job run.
Statistics are written to the job log and may be viewed as long as that log is
preserved.
15 - 4
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
The collection of performance statistics can be turned on and off for each active
stage in a DataStage job. This is done via the Tracing tab of the Job Run Options
dialog box, select the stage you want to monitor and select the Performance
statistics check box. Use shift-click to select multiple active stages to monitor from
the list.
14 - 5
DataStage 314Svr
The first pane of the above frame contains a sample of a job log. When
performance tracing is turned on a special log entry is generated immediately after
the stage completion message. This is identified by the first line
job.stage.DSD.StageRun Performance statistics.
The second pane contains the detailed view of the statistics message; it is displayed
in a tabular form. You can cut these and paste them into a spreadsheet if required to
make further analysis possible.
The performance statistics relate to the per-row processing cycle of an active stage,
and of each of its input and output links. The information shown is:
Percent. The percentage of overall execution time that this part of the
process used.
Count. The number of times this part of the process was executed.
Minimum. The minimum elapsed time in microseconds that this part of the
process took for any of the rows processed.
Average. The average elapsed time in microseconds that this part of the
process took for the rows processed.
You need to take care interpreting these figures. For example, when inprocess
active stage to active stage links are used the percent column will not add up to
100%. Also be aware that, in these circumstances, if you collect statistics for the
15 - 6
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
first active stage the entire cost of the downstream active stage is included in the
active-to-active link (as shown in our example diagram). This distortion remains
even where you are running the active stages in different processes (by having
inter-process row buffering enabled) unless you are actually running on a multiprocessor system.
14 - 7
DataStage 314Svr
You can improve the performance of most DataStage jobs by turning inprocess row
buffering on and recompiling the job. This allows connected active stages to pass
data via buffers rather than row by row.
You can turn in-process row buffering on for the whole project using the DataStage
Administrator. Alternatively, you can turn it on for individual jobs via the
Performance tab of the Job Properties dialog box.
Note: You cannot use in-process row-buffering if your job uses COMMON blocks
in transform functions to pass data between stages. This is not recommended
practice, and it is advisable to redesign your job to use row buffering rather than
COMMON blocks.
15 - 8
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
14 - 9
15 - 10
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
DataStage Essentials
14 - 11
DataStage 314Svr
When you design a job you see it in terms of stages and links. When it is
compiled, the DataStage engine sees it in terms of processes that are subsequently
run on the server.
How does the DataStage engine define a process? It is here that the distinction
between active and passive stages becomes important. Actives stages, such as the
Transformer and Aggregator, perform processing tasks, while passive stages, such
as Sequential file stage and ODBC stage, are reading or writing data sources and
provide services to the active stages. At its simplest, active stages become
processes. But the situation becomes more complicated where you connect active
stages together, and passive stages together.
15 - 12
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
What happens when you have a job that links two passive stages together?
Obviously there is some processing going on. Under the covers DataStage inserts a
cut-down transformer stage between the passive stages, which just passes data
straight from one stage to the other, and becomes a process when the job is run.
What happens where you have a job that links two or more active stages together?
By default this will all be run in a single process. Passive stages mark the process
boundaries, all adjacent active stages between them being run in a single process.
14 - 13
DataStage 314Svr
This job is comprised of two processes because of the second sequential stage.
15 - 14
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
14 - 15
DataStage 314Svr
15 - 16
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
Data pipelining requires the same transform on all partitions; this can be easily
accomplished using containers.
Data pipelining and data partitioning can occur simultaneously within a job.
14 - 17
DataStage 314Svr
If you split processes in your job design by writing data to a Sequential file and
then reading it back again, you can use an Inter Process (IPC) stage in place of the
Sequential stage. This will split the process and reduce I/O and elapsed time as the
reading process can start reading data as soon as it is available rather than waiting
for writing process to finish.
15 - 18
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
Normally this would be a single process but with the introduction of the IPC stage
this job will split the read operation into one process and the transform and write
into another process. Meta data must be the same on both the input and output links
to the IPC stage.
14 - 19
DataStage 314Svr
The Properties tab allows you to specify two properties for the IPC stage:
Buffer Size. Defaults to 128 Kb. The IPC stage uses two blocks of memory; one
block can be written to while the other is read. This property defines the size of
each block, so that by default 256 Kb is allocated in total.
Timeout. Defaults to 10 seconds. This specifies a time limit for how long the
stage will wait for a process to connect to it before timing out. This normally will
not need changing, but may be important where you are prototyping multiprocessor jobs on single processor platforms and there are likely to be delays.
15 - 20
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
14 - 21
DataStage 314Svr
15 - 22
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
This example shows the partitioner stage depicted in the previous frame. Note:
meta data on the output and input links must be identical.
14 - 23
DataStage 314Svr
Algorithms:
Round-Robin. This is the default method. Using the round-robin method the
stage will write each incoming row to one of its output links in turn.
Random. Using this method the stage will use a random number generator to
distribute incoming rows evenly across all output links.
Hash. Using this method the stage applies a hash function to one or more input
column values to determine which output link the row is passed to.
Modulus. Using this method the stage applies a modulus function to an integer
input column value to determine which output link the row is passed to.
Partitioning Key. This property is only significant where you have chosen a
partitioning algorithm of Hash or Modulus. For the Hash algorithm, specify one or
more column names separated by commas. These keys are concatenated and a hash
function applied to determine the destination output link. For the Modulus
algorithm, specify a single column name which identifies an integer numeric
column. The value of this column value determines the destination output link.
15 - 24
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
The Collector stage is an active stage, which takes up to 64 inputs and allows you
to collect data from these links and route it along a single output link. The stage
expects the output link to use the same meta data as the input links.
The Collector stage can be used in conjunction with a Partitioner stage to enable
you to take advantage of a multi-processor system and have data processed in
parallel. The Partitioner stage partitions data, it is processed in parallel, then the
Collector stage collects it together again before writing it to a single target.
14 - 25
DataStage 314Svr
The Properties tab allows you to specify two properties for the Collector stage:
Collection Algorithm. Use this property to specify the method the stage uses to
collect data. Choose from:
oRound-Robin. This is the default method. Using the round-robin method the
stage will read a row from each input link in turn.
oSort/Merge. Using the sort/merge method the stage reads multiple sorted inputs
and writes one sorted output.
Sort Key. This property is only significant where you have chosen a collecting
algorithm of Sort/Merge. It defines how each of the partitioned data sets are known
to be sorted and how the merged output will be sorted.
15 - 26
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
The partitioner stage can support up to 64 output links and the collector can support
up to 64 input links. Meta data should be identical on input and output links to the
partitioner; similarly for the collector stage.
This configuration can also use a container in place of the transformer.
14 - 27
15 - 28
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
Module 16
16 - 2
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
DataStage Essentials
16 - 3
DataStage 314Svr
Our final application will use a business data mart for source data. Data will be
extracted, go through little or no data transformation, get summarized and loaded
into a target database. The reason that little or no data transformation will occur is
that the data has already been cleaned before being loaded into the data mart.
16 - 4
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
The existing data mart consists of a star schema structure comprised of a sales fact
table surrounded by promotion, store, product, and time dimensions. The time
dimension has been renamed to timex because of naming convention limitations
within Microsoft Access; similarly be date field has been renamed to datex.
A star schema, such as the one depicted above, is an ideal data structure for ad hoc
queries. Many vendor tools are available in the marketplace to support this type of
query building.
Note that the dimensions are linked to the fact table by surrogate keys. We will use
the surrogate keys to build DataStage jobs that denormalize the data.
16 - 5
DataStage 314Svr
The desired result from end-users is a table, or set of tables, get summarized the
data by specific dimensions. This video will briefly demonstrate the actions a
typical end-user might employ with the summary tables.
Be
16 - 6
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage Essentials
16 - 7
16 - 8
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr
DataStage Essentials
16 - 9
16 - 10
Copyright 2002 Ascential Software Corporation
09/01/02
DataStage 314Svr