CG Datastage

Training course Datastage (part 1)
V. BEYET
03/07/2006
1
Presentation ...
Who am I ?
Who are you ?
2
Summary
 General presentation (DataStage : what is it ?)
 DataStage : how to use it ?
 The other components (part 2)
3
General presentation
 Datastage : What is it ?
 An ETL tool: Extract-Transform-Load
 A graphic environment
 A tool integrated in a suite of BI tools
 Developed by Ascential (IBM)
4
 Datastage : why to use it ?
 big size of data (volume)

multi-source and multi-target :
 files,
 Databases (oracle, sqlserver, access, …).
 Data transformation :
 Select,
 Format,
 Combine,
 Aggregate
 Sort.
5
 Datastage : how it works ?
 Development is done :
 on a client-server mode,
 with a graphical Design of flows,
 with simple and basic elements,
 with a simple language (basic).
 Treatments are :
 Compiled and run by an engine,
 Written on a Universe database,
6
Designer Manager
Server
Director Administrator
The different tools

7
General presentation Server
 The server contains programs and data.
 The programs
 Called Jobs : first as source code and then as
executable programs, written in Universe Database
 But we can‟t understand source code
 Data :
 May be written in Universe Database but better in
server directories.
8
General presentation Server
 What is a Project for Datastage ?
 A server is organized in different environments called

“Projects”
 A Project is a separated environment for jobs, table
definitions and routines
 A Project can be created at any time

The number of projects is unlimited
 The number of jobs is unlimited for each project
 But the number of simultaneous client connection is
limited
9
General presentation Servur
Universe Database:
 The Universe Database is a relational Database with files

 Tables are called "Hash File"
A Hash file is an indexed file; It‟s the central element to use all
the possibilities of the Datastage engine.
A Hash file with incorrectly defined keys may create disastrous problems.
10
Summary
 General presentation (Datastage : what is it ?)
 DataStage : how to use it ?
 The other components (part 2)
11
Designer
The designer
 The designer is to design jobs : look at the icon
The jobs are composed with « Stages » :
 active stages : action
 passive stages : data storage
Links : between the stages
12
Designer
The designer
Passive stages : a place for Data storage (the
data flow is from the stage or to the stage)
Text File : sequential file
Hash File : It can be treated only by

datastage (and not by WordPad, …) but
simultaneous access is possible on Hash file.
UV Stage : The file is in the Universe Core
(DataStage engine).
ODBC Stage, OLEDB, ORAOCI :
Representation of a database; it allows to
access directly to a database with an ODBC
link.
13
Designer
The designer
Active stages
An active stage is a representation of a transformation on the dataflow :
Sort : of a file
Aggregator : calculations
Transformer : selection, transformation, transport of properties

…
14
Designer
The designer
links
Between active and passive stages
Between passive stages
Between active stages
15
Designer
The designer
 A job in the designer

Active Stage Passive Stage
16
Designer
The designer
DataStage Designer :
Each job has :
- one or more source of data
- one or more transformation
- one or more destination for the data
The toolbar contains the stage icons to design
the jobs.
The jobs have to be compiled to create
executable programs.
17
Designer
The designer
To compile the job
To run the job
The repository
The toolbar
with stage
icons
(palette)
18
Designer
The designer
Let‟s study now the different Stages :
Sequential Files (text files)

Transformer
Hash Files
Sort
Aggregator
Routines
UV Stages
19
Designer
The designer
Sequential file Stage :
 Can be read,
 Can be written,
 Can be read and written in the same job,
 Can be written cash or not,
 Can be DOS file or Unix file …
 Can be read by two jobs at the same time
 Can‟t be written by two jobs at the same time
20
Designer
The designer
Sequential File :
Stage name
File Type
Stage description
21
Designer
The designer
Sequential File :
Output link
Stage name (to be written)
22
Designer
The designer
Sequential File :
Data Format (Output file)
Always those values
23
Designer
The designer
Sequential File :
To test the connection and
Different columns of the Size to display view the data in the file
file (Output) : type, length (for View Data)
24
Designer
The designer
Sequential File :
To describe easily a file :

use or create a “table
definition”
Group your table

definitions by
application
Create or modify the table

definitions (for files,
databases, transformers, …)
25
Designer
The designer
Sequential File :
Then it can be used in different

jobs (click on Load to find the right
definition).
26
Designer
The designer
Sequential File :
View Data
27
Designer
The designer
Transformer Stage :
 Multi-source and multi-target,

 Wait for the availability of the source of data,
 Makes lookup between 2 flows (reference),
 Transform or propagate the data of each flow,
 Allows to select, filter, create refusals file.
28
Designer
The designer
Transformer Stage :
Can do treatments by :
 native basic function or created in the manager,
 DataStage function or DataStage macro,
 routines (before/after type)
 Or only propagate columns.
29
Designer
The designer
Transformer Stage :
Output data
Input data
Right click :
propagate all
the columns
30
Designer
The designer
Transformer Stage :
Output data
Input data
31
Designer
The designer
Exercise n°1 :
Objective : Read a sequential file and create a new one (save the file)
The catalogue.in file has to be read and the catalogue_save.tmp file has to be written
Source File : catalogue.in(in \in directory)

Target File : catalogue_save.tmp (in \tmp directory)
Steps :
1- Create a table definition (structure of Catalogue table )
2- Design the job with 2 Sequential Files and 1 Transformer
3- Create the links (data flow)
4- Save and Compile the job
5- Run the job
6-Look at the performances statistics (right click)
32
Designer
The designer
Transformer Stage :
Look at the performances of your job :
Right click on the grid and then select

“Show performance statistics”
33
Designer
The designer
Create the parameters of the job :

menu Edit - Job Properties , tab Parameters.
34
Designer
The designer
Exercise n°2 :
Objective : Use environment variables

- create a job parameter : directory
- place it on all the paths from the job of the first
exercise (example : #directory#\tmp),
- compile
- modify your input file (add your best film)
- run with different path (other groups).
35
Designer
The designer
Hash File Stage :
Necessary for a lookup

One Hash file is entirely written before it can be
read (FromTrans link must be finished before FromFilmTypeHF
can start)
Allows to group multiple records with the same
key (suppress duplicate keys)
Can be read in different jobs simultaneously
Can be written by different links simultaneously
(in the same job or in different jobs)
36
Designer
The designer
Hash File :
Stage name
Account name
(DataStage project)
File path
37
Designer
The designer
For files to write

Hash File :
File name
Select this check box to

specify that all records
should be cached, rather
than written to the hashed
file immediately. This is
not recommended where
your job writes and reads
to the same hashed file in
the same stream of
execution
38
Designer
The designer
Hash File :
A key must be defined (it can be a single or multiple key)
39
Designer
The designer
Stage Transformer : Lookup

• The main flow can be from every type
• The secondary flow must has a Hash File to design a lookup (so very
often, you will have to design a temporary Hash File)
• The look up is done with the key of the secondary flow
• The number of records in the main flow can‟t be higher after the
lookup than before the look up
• The lookup is shown with a dotted line
• When a lookup is “exclusive” the number of records after the lookup
is smaller then the number of records before the lookup
40
Designer
The designer
Transformer Stage : Lookup

Reference Flow
(vertical flow)
Principal Flow
(horizontal)
41
Designer
The designer
Exercise n°3 :
Objective : make a lookup between Catalog file and Film Type
to put the type film in the output file.
Source File : catalogue.in(in \in directory)

Target File : catalogue.out (in \out directory)
Steps :
1- Create a table definition (structure of FilmType table )
2- Modify your job to create a Hash File from the FilmType.in file
3- Create the link to show the lookup (data flow)
4- Save and Compile the job
5- Run the job
6-Look at the performances statistics (right click)
42
Designer
The designer
Exercise n°4 :
Objective : put the director name and the film name together
separated by a “>”. If the film type is not found, put “unknown
type” in the output file. What happens when the director name is
empty ? Find a solution.
43
Designer
The designer
Exercise n°5 :
Objective : If the film type is not found (use constraint), put the
film in a refusals file (First a Sequential file and then a Hash File)
44
Designer
The designer
 Stage Lookup with selection (exclusive lookup)
Don’t forget : lookup can be designed with ORAOCI stage or UV stage but it is more
better with Hash Files.
45
Designer
The designer
Exercise n°6 :
Objective : Select only the films for which the type is known
(that means that the lookup is OK)
46
Designer
The designer
Exercise n°7 :
Objective : Select all the clients who are female to put them in
an output file
The SEXE column contains M (Male) or F (female)
And then create an annotation for this job (all the jobs must have annotations)
47
The director Director
 The Director is the job controller, it allows to :
 Run jobs
Immediately or later, with more options than in the Designer
 Control job status

Status : Compiled, Running, Aborted, Validated, Failed validation ...
 Job monitoring
To control the number of lines treated by each active stage of a job.
48
Run jobs with Director

Select the job and
click here
And then enter

the parameters
49
To run a job later :

click here
And then choose

the date and time
50
To modify running parameters for a job : Limits Tab
Rows limit : the job stops after x Warnings limit : the job
rows (on each flow) stops after x warnings
51
Verify the status of jobs with Director
The status :
• "Not compiled"
• "Compiled"
• "Failed validation"
• "Validated ok"
• "Aborted"
• "Finished"
• "Running"
52
Example : list of jobs

To view the log To run jobs To reset job status To run jobs later
To stop jobs
53
Example of a Monitor : The monitor allows to follow the

different stages of a job. See
the importance of a good name
For each step :
for the stages and the links !
the number of treated lines (input and output)
the beginning time
the execution duration (Elapsed time)
Link type :
the status
Pri : principal flow
the performance (rows/sec)
Ref : reference flow (lookup)
Out : output flow
54
Example of a log :
To look at error messages,
choose the job and click on the
“log” button
Green : OK – No problem
Yellow : warning
Red : blocking problem
Don’t forget : Clear the log from time to time (Job>Clear log).
55
The manager Manager
 The manager is the tool to export/import elements from a

DataStage project to an other DataStage project.
File>Open Project to change project To import or export elements click on
the appropriate button
All the elements :
•jobs
•Routines
•table definitions
are classified in Categories but the

name must be unique within a project
Drag and Drop on an element to change

category
56
The manager Manager
 EXPORT choose what do you want to export (create a .dsx)

To append to an
existing file
To change the selection

options :
- By category
- By individual components
•Jobs
•Table definitions
•Routines (always check

“Source Code” box)
57
The manager Manager
 IMPORT
This will create/modify elements in
the DataStage Project
choose what do you want to import
Make your choice
58
The manager Manager
With the manager, you can compile many jobs at the same time (multiple compile
jobs)
Tools > Run multiple job compile
you select the type of jobs you want to compile and select “Show manual
selection page” and click on “Next” button
select the jobs and click on “Next” button
click on the “Start compile” button
59
Designer
The designer
Sort Stage :
Criteria of sorting are filled in

In Stage Tab/Properties Tab
Modify those parameters if the

file to sort has a lot of lines
60
Designer
The designer
Exercise n°8 :
Objective : When you have selected all the Women, sort the file
by alphabetical order.
61
Designer
The designer
Aggregator Stage :
- Allows data to be aggregated on a smaller number of

records,
- Intermediate treatments executed in memory,
- Allows to execute a before/after routine (before or after
the stage treatment when all the lines have been treated),
- Performances are better if data is sorted (Input tab),
- The aggregator does not sort the records.
62
Designer
The designer
Aggregator Stage : Input Tab

When input data
is sorted
63
Designer
The designer
Aggregator Stage : Output tab
Group by
Different
functions
64
Designer
The designer
Exercise n°9 :
Objective : create a Job which reads location.in

And calculates the hit-parade from the most hired cassettes (order
by number of hire descending). Put also the name of the film and
not only the number of the cassette (lookup with catalogue.in).
65
Designer
The designer
Exercise n°10 :
Objective : create a Job which reads location.in

And calculates the average number of hire for each cassette.
(2 different methods can be used)
66
Designer
The designer
Exercise n°9 (job to design)
67
Designer
The designer
68
Designer
The designer
Hash File Stage :

We have seen that the Hash File is necessary for a lookup
We have seen also that Hash File allows to suppress
duplicate key
Let‟s see now how it is useful to group different flows
69
Designer
The designer
Exercise n°11 :
Objective : With the job from exercise 10 (use the 2 methods in
the same job), create a Hash File to put the different results in the
same Hash File.
Column 1 : “AVERAGE METHOD 1” or “AVERAGE
METHOD 2”
Column 2 : the result of each method
In the Hash file, you must have 2 lines.
70
Designer
The designer
71
Designer
The designer
Stage Variables :
Simple treatments can be made easily with stage variable.
- It is a data which remain “active” during all the duration of the stage. So you
can find a max (if data is sorted), calculate a sum or count something.
- In the transformer, click on the right button and then select “Show Stage
variables”. Example :
72
Designer
The designer
Another example :
73
Designer
The designer
Exercise n°12 :
Objective : Try to calculate the average with stage variables.
Exercise n°13 :
Objective : Create a job that create a file with all the client (key)
and in a second column the list of the films (separated by a dot).
74
Designer
The designer
75
Designer
The designer

The order of the different variables is important. The instructions are executed in the
order of the stage variables ! (to change the order => right click>stage properties>Link
ordering Tab)
The variables must be initialized (=> right click>stage properties>variables).
There must be a hash file after the stage.
76
Designer
The designer
DataStage Variables :
Different variables are defined by Datastage :
-@NULL
- @INROWNUM, @OUTROWNUM
- @DATE
- @TRUE, @FALSE
- @PATH
Link Variables :
The more useful is : NOTFOUND
77
Designer
The designer
Routines :
- Source code (written with Basic language)
- It is external from the jobs and can be used many times at many
levels
- It can be a Transform function or a Before/After Function :
 a transform function is called at each line
 a before subroutine is called before the first line
(example : empty a file)
 an after subroutine is called when all the lines have been
treated
78
Designer
The designer
Routines (1/3) Type of routine

Name of the routine
Always fill in this

Short description
79
Designer
The designer
Routines (2/3) To be filled in

Arguments : they
are used in the code
80
Designer
The designer
Routines (3/3)
Code : use
Argument names
Test of
Save Compile the
routine
81
Designer
The designer
Routines : access to a sequential file

OpenSeq FicXXX to xxx then
end
else
end
WriteSeq FicXXX to xxx then

end File Header
else
end
ReadSeq FicXXX to xxx then

end
else
end
WeofSeq xxx To empty the file

CloseSeq FicXXX
82
Designer
The designer
Routines :
Call DSLogInfo("Information", "RoutineName") For i= … To
Call DSLogWarn("Warning", "RoutineName") Next i
Call DSLogFatal("Abort", "RoutineName")
Loop Until If … Then

Repeat GoTo Iconv("05/27/97", "D2/") End
Else
Loop While Upcase(…) Oconv(10740, "D2/") End
Repeat
field(…,',',3,1) search string file after the third comma

A=„Hello ‟
B=„World‟ Trim(…, ‟ ‟,‟T‟) suppress the trailing spaces
C=A:B
C=„Hello World‟ A=„Hello ‟
A[1,3]=„Hel‟
83
Designer
The designer
Routines : Test
By double-click on Result column
84
Designer
The designer
Exercise n°14 :
Step 1 :
Objective : write a routine which calculates the number of day

between two dates.
If begin date is null then return 0 ,
If end date is null then initialize it with date of today,
Save, compile and test the routine.
85
Designer
The designer
86
Designer
The designer
Exercise n°14 :
Step 2
Objective : Read location.in, generate a file with the hire

duration (returned cassettes only)
Non returned cassettes after 10 days (end date null) will be
written in a refusals file with the name and address of client (to
send then a mail)
87
Designer
The designer
Exercise n°14 (job to be designed)
88
Designer
The designer
Exercise n°15 :
Objective : With a routine (Use CASE ), calculate the amount
for the cassette hire (days number * hire price * coefficient).
The coefficient is calculated with that rule :
<5 days = days * hire price
>=5 and <10 days = days * hire price * 1.20
>=10 and <30 days = days * hire price * 1.50
>= 30 days = days * hire price * 3
89
Designer
The designer
UV Stage :
– works with internal hash file (in the DataStage Project)
– makes a Cartesian product
– uses SQL requests (select … from … where … order by …)
90
Designer
The designer
Exercise n°16 : execute the Cartesian product on Clients file

and Cassettes file
Objective : Propose to the clients cassettes he has never hired
•Step 1 : create the job parameter “account”,
•Step 2 : create a job to write clients hash file et cassettes hash file
in the DS project with account parameter
•Step 3 : In a new job, use those hash files to make the Cartesian
product
•Look at your job performances !!
91
Designer
The designer
Exercise 16 : Step 1 and Step 2
92
Designer
The designer
Step 3 :
93
Designer
The designer
94
Designer
The designer
The number of records
95
Designer
The designer
The normalization :
Normalization :
12 A
12 A|B|C|D|E 12 B
12 C
12 D
12 E
Multi-valuated file Normalized file
Un-normalization :
96
Designer
The designer
Normalization :
Multi-valuated file must have :
1- a key
2- char(253) or @VM for separator
3- The “Normalize On” field from Hash File checked
4- the column(s) to normalize
1 2 4
3
97
Designer
The designer
Exercise n°17 : normalization/un-normalization

•Step 1 : create a job which reads location.in file and writes a hash
file (Id_Cli as the key and the list of all Id_Cas separated by
@VM) : use Sort stage and Stage Variables !
=> View Data on the Input Link of the Hash File
•Step 2 : modify the a job to add normalization of this file
=> View Data on the Output Link of the Hash File
•Step 3 : Compare the sequential file with location.in file
98
Designer
The designer
Exercise N°17 : job to design and View Data
99
Designer
The designer
The ORAOCI Stages :
 The version of oracle used is 9i so use ORAOCI9 stage

 You can :
Either use a query generated by DataStage
Or use a user-defined query
Or a combination of the both precedent possibilities
 The access parameters have to be defined by job parameters
 The stage can access only one table or more
 Different actions can be programmed : read, insert, update
 You can also use Stocked Procedures
100
Designer
The designer
The ORAOCI Stages :
The access parameters have to be defined by job parameters
101
Designer
The designer
The ORAOCI Stages : Output link
query generated by
DataStage or user-
defined query
102
Designer
The designer
query generated Selection of the table(s) Sort parameters

by DataStage
Selection of
the
columns
“Group by”
clause
103
Designer
The designer
Generate SELECT clause from column list; enter other clauses
104
Designer
The designer
Enter custom SQL statement : when you want to add something specific
To format a date for

example
105
Designer
The designer
Choose the table
Important parameters
Choose the action
106
Designer
The designer
Number of lines
between 2 commit
107
Designer
The designer
The ORAOCI Stages : verify error code (1/3)
If the job must abort

when there is a
SQL error
108
Designer
The designer
To receive SQL error code
109
Designer
The designer
To select the errors
To receive SQL error code
Treat lines 1 by 1
110
Designer
The designer
The ORA Bulk Stages :
- to insert in a table (like SQLLOAD)

- Very fast (deactivate the index before the load and reactivate it
after the load)
- But no warning if the index is in Unusable state after the load
(when duplicate keys for example)
- Not a lot of Date and Time format (DD.MM.YYYY, YYYY-MM-DD, DD-
MON-YYYY, MM/DD/YYYY - hh24:mi:ss, hh:mi:ss am)
111
Designer
The designer
The ORA Bulk Stages

DSN
user
password
Table name (with

oracle.tableName)
Date and Time format
Number of lines
between 2 Commit
112
Designer
The designer
How to create a table definition from a table in the database ?
On the repository,
right click on Table Definitions
and then choose “Import”
and then Plug-in Meta Data

Definitions
113
Designer
The designer
Then choose the table (s) and click on “Import”

The table definitions will be created in the category “ODBC”
114
Designer
The designer
Exercise n°18 : Read a Database
Objective : Create a job which reads the table

REF_CPTE in BIODS database
Step 1 : create the table definition from the database

Step 2 : create the job that reads the table
115
Designer
The designer
Exercise n°19 : Write in a Database
Objective : Create a job which writes in the table

TST_ALADIN_JGV in BIODS database (only the 2 first
columns : keys)
Location.in TST_ALADIN_JGV :
Id_Cli ======== >> CHAR1
Id_Cas ======== >> CHAR2
In CHAR1, put a letter (different for each group) before the client number (Id_Cli).
Step 1 : Use ORAOCI stage

Step 2 : Same exercise with ORABULK stage
116
Designer
The designer
Exercise n°20 : Update a Database
Objective : Create a job to update the columns BEGIN_DATE

and END_DATE in the table TST_ALADIN_JGV in BIODS
database from location.in file
BEGIN_DATE and END_DATE are defined as timestamp !
117
The administrator
Administrator
 The Administrator :
 Create a DataStage project

 Unlock a job
Sometimes, due to server problems, the designer (or manager) falls down and
some elements may be locked (jobs, table definitions, routines, …)
In that case, in the Administrator (with administrator security rights) :
118
The administrator
Administrator
Unlock a job (1/3) To create a project
choose your project
And click on
Command button
119
The administrator
Administrator
Unlock a job (2/3)

CHDIR C:\Ascential\DataStage\Engine Search the user number
LIST.READU
Search the device number
120
The administrator
Administrator
Unlock a job (3/3)

or with user number
(UNLOCK USER UserNumber READULOCK)
Or everything
(UNLOCK ALL)
unlock your job with device number
121
The administrator
Administrator
Create a project
Location for the Project (jobs,
routines, UV hash files, table
Project name definitions, …) on the server. Must be

different from the location for the
directories of data !
122

CG Datastage

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CG Datastage

Uploaded by

Copyright:

Available Formats

Training course Datastage (part 1)

Who are you ?

 General presentation (DataStage : what is it ?)

 DataStage : how to use it ?

 The other components (part 2)

 An ETL tool: Extract-Transform-Load

 A tool integrated in a suite of BI tools

 Developed by Ascential (IBM)

 Datastage : why to use it ?

 big size of data (volume)

 Datastage : how it works ?

The different tools

 The server contains programs and data.

 But we can‟t understand source code

 What is a Project for Datastage ?

 A server is organized in different environments called

 A Project can be created at any time

 The Universe Database is a relational Database with files

 General presentation (Datastage : what is it ?)

 DataStage : how to use it ?

 The other components (part 2)

 The designer is to design jobs : look at the icon

The jobs are composed with « Stages » :

 active stages : action

 passive stages : data storage

Links : between the stages

Text File : sequential file

Hash File : It can be treated only by

simultaneous access is possible on Hash file.

UV Stage : The file is in the Universe Core

ODBC Stage, OLEDB, ORAOCI :

Representation of a database; it allows to

access directly to a database with an ODBC

Transformer : selection, transformation, transport of properties

Between active and passive stages

Between passive stages

Between active stages

 A job in the designer

To compile the job

To run the job

Let‟s study now the different Stages :

Sequential Files (text files)

Sequential file Stage :

Stage name (to be written)

Data Format (Output file)

Always those values

To describe easily a file :

Group your table

Create or modify the table

Then it can be used in different

 Multi-source and multi-target,

Source File : catalogue.in(in \in directory)

Look at the performances of your job :

Right click on the grid and then select

Create the parameters of the job :

Objective : Use environment variables

Hash File Stage :

Necessary for a lookup

For files to write

Select this check box to

A key must be defined (it can be a single or multiple key)

Stage Transformer : Lookup

Transformer Stage : Lookup