You are on page 1of 122

Training course Datastage (part 1)

V. BEYET
03/07/2006

1
Presentation ...

Who am I ?

Who are you ?

2
Summary

 General presentation (DataStage : what is it ?)

 DataStage : how to use it ?

 The other components (part 2)

3
General presentation

 Datastage : What is it ?

 An ETL tool: Extract-Transform-Load

 A graphic environment

 A tool integrated in a suite of BI tools

 Developed by Ascential (IBM)

4
General presentation

 Datastage : why to use it ?

 big size of data (volume)


multi-source and multi-target :
 files,
 Databases (oracle, sqlserver, access, …).

 Data transformation :
 Select,
 Format,
 Combine,
 Aggregate
 Sort.

5
General presentation

 Datastage : how it works ?

 Development is done :
 on a client-server mode,
 with a graphical Design of flows,
 with simple and basic elements,
 with a simple language (basic).

 Treatments are :
 Compiled and run by an engine,
 Written on a Universe database,

6
General presentation

Designer Manager

Server

Director Administrator

The different tools


7
General presentation Server

 The server contains programs and data.

 The programs
 Called Jobs : first as source code and then as
executable programs, written in Universe Database

 But we can‟t understand source code

 Data :
 May be written in Universe Database but better in
server directories.

8
General presentation Server

 What is a Project for Datastage ?

 A server is organized in different environments called


“Projects”
 A Project is a separated environment for jobs, table
definitions and routines

 A Project can be created at any time


The number of projects is unlimited
 The number of jobs is unlimited for each project
 But the number of simultaneous client connection is
limited

9
General presentation Servur

Universe Database:

 The Universe Database is a relational Database with files


 Tables are called "Hash File"

A Hash file is an indexed file; It‟s the central element to use all
the possibilities of the Datastage engine.

A Hash file with incorrectly defined keys may create disastrous problems.

10
Summary

 General presentation (Datastage : what is it ?)

 DataStage : how to use it ?

 The other components (part 2)

11
Designer
The designer

 The designer is to design jobs : look at the icon

The jobs are composed with « Stages » :

 active stages : action

 passive stages : data storage

Links : between the stages

12
Designer
The designer
Passive stages : a place for Data storage (the
data flow is from the stage or to the stage)

Text File : sequential file

Hash File : It can be treated only by


datastage (and not by WordPad, …) but

simultaneous access is possible on Hash file.

UV Stage : The file is in the Universe Core

(DataStage engine).

ODBC Stage, OLEDB, ORAOCI :

Representation of a database; it allows to

access directly to a database with an ODBC

link.
13
Designer
The designer

Active stages
An active stage is a representation of a transformation on the dataflow :

Sort : of a file

Aggregator : calculations

Transformer : selection, transformation, transport of properties


…

14
Designer
The designer

links

Between active and passive stages

Between passive stages

Between active stages

15
Designer
The designer

 A job in the designer


Active Stage Passive Stage

16
Designer
The designer

DataStage Designer :
Each job has :
- one or more source of data
- one or more transformation
- one or more destination for the data
The toolbar contains the stage icons to design
the jobs.
The jobs have to be compiled to create
executable programs.

17
Designer
The designer

To compile the job

To run the job

The repository

The toolbar
with stage
icons
(palette)

18
Designer
The designer

Let‟s study now the different Stages :

Sequential Files (text files)


Transformer
Hash Files
Sort
Aggregator
Routines
UV Stages

19
Designer
The designer

Sequential file Stage :

 Can be read,
 Can be written,
 Can be read and written in the same job,
 Can be written cash or not,
 Can be DOS file or Unix file …
 Can be read by two jobs at the same time
 Can‟t be written by two jobs at the same time

20
Designer
The designer

Sequential File :

Stage name

File Type

Stage description

21
Designer
The designer

Sequential File :

Output link

Stage name (to be written)

22
Designer
The designer

Sequential File :

Data Format (Output file)

Always those values

23
Designer
The designer

Sequential File :
To test the connection and
Different columns of the Size to display view the data in the file
file (Output) : type, length (for View Data)

24
Designer
The designer

Sequential File :

To describe easily a file :


use or create a “table
definition”

Group your table


definitions by
application

Create or modify the table


definitions (for files,
databases, transformers, …)

25
Designer
The designer

Sequential File :

Then it can be used in different


jobs (click on Load to find the right
definition).

26
Designer
The designer

Sequential File :

View Data

27
Designer
The designer

Transformer Stage :

 Multi-source and multi-target,


 Wait for the availability of the source of data,
 Makes lookup between 2 flows (reference),
 Transform or propagate the data of each flow,
 Allows to select, filter, create refusals file.

28
Designer
The designer

Transformer Stage :

Can do treatments by :
 native basic function or created in the manager,
 DataStage function or DataStage macro,
 routines (before/after type)
 Or only propagate columns.

29
Designer
The designer

Transformer Stage :

Output data
Input data

Right click :
propagate all
the columns

30
Designer
The designer

Transformer Stage :

Output data

Input data

31
Designer
The designer

Exercise n°1 :
Objective : Read a sequential file and create a new one (save the file)

The catalogue.in file has to be read and the catalogue_save.tmp file has to be written

Source File : catalogue.in(in \in directory)


Target File : catalogue_save.tmp (in \tmp directory)

Steps :
1- Create a table definition (structure of Catalogue table )
2- Design the job with 2 Sequential Files and 1 Transformer
3- Create the links (data flow)
4- Save and Compile the job
5- Run the job
6-Look at the performances statistics (right click)

32
Designer
The designer

Transformer Stage :

Look at the performances of your job :

Right click on the grid and then select


“Show performance statistics”

33
Designer
The designer

Create the parameters of the job :


menu Edit - Job Properties , tab Parameters.

34
Designer
The designer

Exercise n°2 :

Objective : Use environment variables


- create a job parameter : directory
- place it on all the paths from the job of the first
exercise (example : #directory#\tmp),
- compile
- modify your input file (add your best film)
- run with different path (other groups).

35
Designer
The designer

Hash File Stage :

Necessary for a lookup


One Hash file is entirely written before it can be
read (FromTrans link must be finished before FromFilmTypeHF
can start)
Allows to group multiple records with the same
key (suppress duplicate keys)
Can be read in different jobs simultaneously
Can be written by different links simultaneously
(in the same job or in different jobs)

36
Designer
The designer

Hash File :

Stage name

Account name
(DataStage project)

File path

37
Designer
The designer

For files to write


Hash File :
File name

Select this check box to


specify that all records
should be cached, rather
than written to the hashed
file immediately. This is
not recommended where
your job writes and reads
to the same hashed file in
the same stream of
execution

38
Designer
The designer

Hash File :

A key must be defined (it can be a single or multiple key)

39
Designer
The designer

Stage Transformer : Lookup


• The main flow can be from every type
• The secondary flow must has a Hash File to design a lookup (so very
often, you will have to design a temporary Hash File)
• The look up is done with the key of the secondary flow
• The number of records in the main flow can‟t be higher after the
lookup than before the look up
• The lookup is shown with a dotted line
• When a lookup is “exclusive” the number of records after the lookup
is smaller then the number of records before the lookup

40
Designer
The designer

Transformer Stage : Lookup


Reference Flow
(vertical flow)

Principal Flow
(horizontal)

41
Designer
The designer

Exercise n°3 :
Objective : make a lookup between Catalog file and Film Type
to put the type film in the output file.

Source File : catalogue.in(in \in directory)


Target File : catalogue.out (in \out directory)

Steps :
1- Create a table definition (structure of FilmType table )
2- Modify your job to create a Hash File from the FilmType.in file
3- Create the link to show the lookup (data flow)
4- Save and Compile the job
5- Run the job
6-Look at the performances statistics (right click)

42
Designer
The designer

Exercise n°4 :
Objective : put the director name and the film name together
separated by a “>”. If the film type is not found, put “unknown
type” in the output file. What happens when the director name is
empty ? Find a solution.

43
Designer
The designer

Exercise n°5 :
Objective : If the film type is not found (use constraint), put the
film in a refusals file (First a Sequential file and then a Hash File)

44
Designer
The designer

 Stage Lookup with selection (exclusive lookup)

Don’t forget : lookup can be designed with ORAOCI stage or UV stage but it is more
better with Hash Files.

45
Designer
The designer

Exercise n°6 :
Objective : Select only the films for which the type is known
(that means that the lookup is OK)

46
Designer
The designer

Exercise n°7 :
Objective : Select all the clients who are female to put them in
an output file
The SEXE column contains M (Male) or F (female)

And then create an annotation for this job (all the jobs must have annotations)

47
The director Director

 The Director is the job controller, it allows to :

 Run jobs
Immediately or later, with more options than in the Designer

 Control job status


Status : Compiled, Running, Aborted, Validated, Failed validation ...

 Job monitoring
To control the number of lines treated by each active stage of a job.

48
The director Director

Run jobs with Director


Select the job and
click here

And then enter


the parameters

49
The director Director

To run a job later :


click here

And then choose


the date and time

50
The director Director

To modify running parameters for a job : Limits Tab

Rows limit : the job stops after x Warnings limit : the job
rows (on each flow) stops after x warnings

51
The director Director

Verify the status of jobs with Director

The status :
• "Not compiled"
• "Compiled"
• "Failed validation"
• "Validated ok"
• "Aborted"
• "Finished"
• "Running"

52
The director Director

Example : list of jobs


To view the log To run jobs To reset job status To run jobs later
To stop jobs

53
The director Director

Example of a Monitor : The monitor allows to follow the


different stages of a job. See
the importance of a good name
For each step :
for the stages and the links !
the number of treated lines (input and output)
the beginning time
the execution duration (Elapsed time)
Link type :
the status
Pri : principal flow
the performance (rows/sec)
Ref : reference flow (lookup)
Out : output flow

54
The director Director

Example of a log :
To look at error messages,
choose the job and click on the
“log” button

Green : OK – No problem
Yellow : warning
Red : blocking problem

Don’t forget : Clear the log from time to time (Job>Clear log).

55
The manager Manager

 The manager is the tool to export/import elements from a


DataStage project to an other DataStage project.
File>Open Project to change project To import or export elements click on
the appropriate button
All the elements :

•jobs

•Routines

•table definitions

are classified in Categories but the


name must be unique within a project

Drag and Drop on an element to change


category

56
The manager Manager

 EXPORT choose what do you want to export (create a .dsx)


To append to an
existing file

To change the selection


options :
- By category
- By individual components

•Jobs

•Table definitions

•Routines (always check


“Source Code” box)

57
The manager Manager

 IMPORT
This will create/modify elements in
the DataStage Project

choose what do you want to import

Make your choice

58
The manager Manager

With the manager, you can compile many jobs at the same time (multiple compile
jobs)

Tools > Run multiple job compile

you select the type of jobs you want to compile and select “Show manual
selection page” and click on “Next” button

select the jobs and click on “Next” button

click on the “Start compile” button

59
Designer
The designer

Sort Stage :

Criteria of sorting are filled in


In Stage Tab/Properties Tab

Modify those parameters if the


file to sort has a lot of lines

60
Designer
The designer

Exercise n°8 :
Objective : When you have selected all the Women, sort the file
by alphabetical order.

61
Designer
The designer

Aggregator Stage :

- Allows data to be aggregated on a smaller number of


records,
- Intermediate treatments executed in memory,
- Allows to execute a before/after routine (before or after
the stage treatment when all the lines have been treated),
- Performances are better if data is sorted (Input tab),
- The aggregator does not sort the records.

62
Designer
The designer

Aggregator Stage : Input Tab


When input data
is sorted

63
Designer
The designer

Aggregator Stage : Output tab

Group by

Different
functions

64
Designer
The designer

Exercise n°9 :

Objective : create a Job which reads location.in


And calculates the hit-parade from the most hired cassettes (order
by number of hire descending). Put also the name of the film and
not only the number of the cassette (lookup with catalogue.in).

65
Designer
The designer

Exercise n°10 :

Objective : create a Job which reads location.in


And calculates the average number of hire for each cassette.
(2 different methods can be used)

66
Designer
The designer

Exercise n°9 (job to design)

67
Designer
The designer

Exercise n°10 (job to design)

68
Designer
The designer

Hash File Stage :


We have seen that the Hash File is necessary for a lookup
We have seen also that Hash File allows to suppress
duplicate key
Let‟s see now how it is useful to group different flows

69
Designer
The designer

Exercise n°11 :
Objective : With the job from exercise 10 (use the 2 methods in
the same job), create a Hash File to put the different results in the
same Hash File.
Column 1 : “AVERAGE METHOD 1” or “AVERAGE
METHOD 2”
Column 2 : the result of each method
In the Hash file, you must have 2 lines.

70
Designer
The designer

Exercise n°11 (job to design)

71
Designer
The designer

Stage Variables :
Simple treatments can be made easily with stage variable.
- It is a data which remain “active” during all the duration of the stage. So you
can find a max (if data is sorted), calculate a sum or count something.
- In the transformer, click on the right button and then select “Show Stage
variables”. Example :

72
Designer
The designer

Another example :

73
Designer
The designer

Exercise n°12 :
Objective : Try to calculate the average with stage variables.

Exercise n°13 :
Objective : Create a job that create a file with all the client (key)
and in a second column the list of the films (separated by a dot).

74
Designer
The designer

Exercise n°13 (job to design)

75
Designer
The designer

Exercise n°13 (job to design)


The order of the different variables is important. The instructions are executed in the
order of the stage variables ! (to change the order => right click>stage properties>Link
ordering Tab)
The variables must be initialized (=> right click>stage properties>variables).
There must be a hash file after the stage.

76
Designer
The designer

DataStage Variables :
Different variables are defined by Datastage :
-@NULL
- @INROWNUM, @OUTROWNUM
- @DATE
- @TRUE, @FALSE
- @PATH

Link Variables :
The more useful is : NOTFOUND

77
Designer
The designer

Routines :
- Source code (written with Basic language)
- It is external from the jobs and can be used many times at many
levels
- It can be a Transform function or a Before/After Function :
 a transform function is called at each line
 a before subroutine is called before the first line
(example : empty a file)
 an after subroutine is called when all the lines have been
treated

78
Designer
The designer

Routines (1/3) Type of routine


Name of the routine

Always fill in this


Short description

79
Designer
The designer

Routines (2/3) To be filled in


Arguments : they
are used in the code

80
Designer
The designer

Routines (3/3)

Code : use
Argument names

Test of
Save Compile the
routine

81
Designer
The designer

Routines : access to a sequential file


OpenSeq FicXXX to xxx then
end
else
end

WriteSeq FicXXX to xxx then


end File Header
else
end

ReadSeq FicXXX to xxx then


end
else
end

WeofSeq xxx To empty the file


CloseSeq FicXXX

82
Designer
The designer

Routines :
Call DSLogInfo("Information", "RoutineName") For i= … To
Call DSLogWarn("Warning", "RoutineName") Next i
Call DSLogFatal("Abort", "RoutineName")

Loop Until If … Then


Repeat GoTo Iconv("05/27/97", "D2/") End
Else
Loop While Upcase(…) Oconv(10740, "D2/") End
Repeat

field(…,',',3,1) search string file after the third comma


A=„Hello ‟
B=„World‟ Trim(…, ‟ ‟,‟T‟) suppress the trailing spaces
C=A:B
C=„Hello World‟ A=„Hello ‟
A[1,3]=„Hel‟

83
Designer
The designer

Routines : Test

By double-click on Result column

84
Designer
The designer

Exercise n°14 :
Step 1 :

Objective : write a routine which calculates the number of day


between two dates.
If begin date is null then return 0 ,
If end date is null then initialize it with date of today,
Save, compile and test the routine.

85
Designer
The designer

86
Designer
The designer

Exercise n°14 :
Step 2

Objective : Read location.in, generate a file with the hire


duration (returned cassettes only)
Non returned cassettes after 10 days (end date null) will be
written in a refusals file with the name and address of client (to
send then a mail)

87
Designer
The designer

Exercise n°14 (job to be designed)

88
Designer
The designer

Exercise n°15 :
Objective : With a routine (Use CASE ), calculate the amount
for the cassette hire (days number * hire price * coefficient).
The coefficient is calculated with that rule :
<5 days = days * hire price
>=5 and <10 days = days * hire price * 1.20
>=10 and <30 days = days * hire price * 1.50
>= 30 days = days * hire price * 3

89
Designer
The designer

UV Stage :
– works with internal hash file (in the DataStage Project)
– makes a Cartesian product
– uses SQL requests (select … from … where … order by …)

90
Designer
The designer

Exercise n°16 : execute the Cartesian product on Clients file


and Cassettes file
Objective : Propose to the clients cassettes he has never hired
•Step 1 : create the job parameter “account”,
•Step 2 : create a job to write clients hash file et cassettes hash file
in the DS project with account parameter
•Step 3 : In a new job, use those hash files to make the Cartesian
product

•Look at your job performances !!

91
Designer
The designer

Exercise 16 : Step 1 and Step 2

92
Designer
The designer

Step 3 :

93
Designer
The designer

94
Designer
The designer

The number of records

95
Designer
The designer

The normalization :

Normalization :
12 A
12 A|B|C|D|E 12 B
12 C
12 D
12 E

Multi-valuated file Normalized file

Un-normalization :

96
Designer
The designer

Normalization :
Multi-valuated file must have :
1- a key
2- char(253) or @VM for separator
3- The “Normalize On” field from Hash File checked
4- the column(s) to normalize

1 2 4
3

97
Designer
The designer

Exercise n°17 : normalization/un-normalization


•Step 1 : create a job which reads location.in file and writes a hash
file (Id_Cli as the key and the list of all Id_Cas separated by
@VM) : use Sort stage and Stage Variables !
=> View Data on the Input Link of the Hash File
•Step 2 : modify the a job to add normalization of this file
=> View Data on the Output Link of the Hash File
•Step 3 : Compare the sequential file with location.in file

98
Designer
The designer

Exercise N°17 : job to design and View Data

99
Designer
The designer

The ORAOCI Stages :

 The version of oracle used is 9i so use ORAOCI9 stage


 You can :
Either use a query generated by DataStage
Or use a user-defined query
Or a combination of the both precedent possibilities
 The access parameters have to be defined by job parameters
 The stage can access only one table or more
 Different actions can be programmed : read, insert, update
 You can also use Stocked Procedures

100
Designer
The designer

The ORAOCI Stages :

The access parameters have to be defined by job parameters

101
Designer
The designer

The ORAOCI Stages : Output link

query generated by
DataStage or user-
defined query

102
Designer
The designer

query generated Selection of the table(s) Sort parameters


by DataStage

Selection of
the
columns

“Group by”
clause

103
Designer
The designer

Generate SELECT clause from column list; enter other clauses

104
Designer
The designer

Enter custom SQL statement : when you want to add something specific

To format a date for


example

105
Designer
The designer

The ORAOCI Stages : Output link

Choose the table

Important parameters

Choose the action

106
Designer
The designer

The ORAOCI Stages : Output link

Number of lines
between 2 commit

107
Designer
The designer

The ORAOCI Stages : verify error code (1/3)

If the job must abort


when there is a
SQL error

108
Designer
The designer

The ORAOCI Stages : verify error code (2/3)

To receive SQL error code

109
Designer
The designer

The ORAOCI Stages : verify error code (3/3)

To select the errors

To receive SQL error code

Treat lines 1 by 1

110
Designer
The designer

The ORA Bulk Stages :

- to insert in a table (like SQLLOAD)


- Very fast (deactivate the index before the load and reactivate it
after the load)
- But no warning if the index is in Unusable state after the load
(when duplicate keys for example)
- Not a lot of Date and Time format (DD.MM.YYYY, YYYY-MM-DD, DD-
MON-YYYY, MM/DD/YYYY - hh24:mi:ss, hh:mi:ss am)

111
Designer
The designer

The ORA Bulk Stages


DSN
user
password

Table name (with


oracle.tableName)

Date and Time format

Number of lines
between 2 Commit

112
Designer
The designer

How to create a table definition from a table in the database ?

On the repository,

right click on Table Definitions

and then choose “Import”

and then Plug-in Meta Data


Definitions

113
Designer
The designer

Then choose the table (s) and click on “Import”


The table definitions will be created in the category “ODBC”

114
Designer
The designer

Exercise n°18 : Read a Database

Objective : Create a job which reads the table


REF_CPTE in BIODS database

Step 1 : create the table definition from the database


Step 2 : create the job that reads the table

115
Designer
The designer

Exercise n°19 : Write in a Database

Objective : Create a job which writes in the table


TST_ALADIN_JGV in BIODS database (only the 2 first
columns : keys)
Location.in TST_ALADIN_JGV :
Id_Cli ======== >> CHAR1
Id_Cas ======== >> CHAR2
In CHAR1, put a letter (different for each group) before the client number (Id_Cli).

Step 1 : Use ORAOCI stage


Step 2 : Same exercise with ORABULK stage

116
Designer
The designer

Exercise n°20 : Update a Database

Objective : Create a job to update the columns BEGIN_DATE


and END_DATE in the table TST_ALADIN_JGV in BIODS
database from location.in file

BEGIN_DATE and END_DATE are defined as timestamp !

117
The administrator
Administrator

 The Administrator :

 Create a DataStage project


 Unlock a job
Sometimes, due to server problems, the designer (or manager) falls down and
some elements may be locked (jobs, table definitions, routines, …)
In that case, in the Administrator (with administrator security rights) :

118
The administrator
Administrator

Unlock a job (1/3) To create a project

choose your project

And click on
Command button

119
The administrator
Administrator

Unlock a job (2/3)


CHDIR C:\Ascential\DataStage\Engine Search the user number
LIST.READU

Search the device number

120
The administrator
Administrator

Unlock a job (3/3)


or with user number
(UNLOCK USER UserNumber READULOCK)
Or everything
(UNLOCK ALL)
unlock your job with device number

121
The administrator
Administrator

Create a project
Location for the Project (jobs,
routines, UV hash files, table

Project name definitions, …) on the server. Must be


different from the location for the
directories of data !

122

You might also like