Professional Documents
Culture Documents
V. BEYET 03/07/2006
Presentation ...
Who am I ? Who are you ?
2
Summary
General presentation
Datastage : What is it ?
An ETL tool: Extract-Transform-Load
General presentation
Datastage : why to use it ? big size of data (volume)
multi-source and multi-target :
files,
Databases (oracle, sqlserver, access, ).
Data transformation :
General presentation
Datastage : how it works ? Development is done :
on a client-server mode, with a graphical Design of flows, with simple and basic elements,
Treatments are :
Compiled and run by an engine, Written on a Universe database,
General presentation
Designer
Manager
Server
Director
Administrator
General presentation
The server contains programs and data. The programs
Called Jobs : first as source code and then as executable programs, written in Universe Database
Server
Data :
May be written in Universe Database but better in server directories.
General presentation
What is a Project for Datastage ? A server is organized in different environments called
Projects
Server
General presentation
Universe Database: The Universe Database is a relational Database with files
Tables are called "Hash File"
Servur
A Hash file is an indexed file; Its the central element to use all the possibilities of the Datastage engine.
A Hash file with incorrectly defined keys may create disastrous problems.
10
Summary
General presentation (Datastage : what is it ?) DataStage : how to use it ? The other components (part 2)
11
The designer
Designer
The designer is to design jobs : look at the icon The jobs are composed with Stages : active stages : action passive stages : data storage
12
The designer
Passive stages : a place for Data storage (the
data flow is from the stage or to the stage)
Designer
The designer
Active stages
An active stage is a representation of a transformation on the dataflow :
Designer
14
The designer
links
Designer
Between active and passive stages Between passive stages Between active stages
15
The designer
A job in the designer
Active Stage Passive Stage
Designer
16
The designer
Designer
DataStage Designer :
Each job has : - one or more source of data - one or more transformation - one or more destination for the data The toolbar contains the stage icons to design the jobs. The jobs have to be compiled to create executable programs.
17
The designer
Designer
The repository
18
The designer
Lets study now the different Stages : Sequential Files (text files) Transformer Hash Files Sort Aggregator Routines UV Stages
Designer
19
The designer
Designer
Sequential file Stage : Can be read, Can be written, Can be read and written in the same job, Can be written cash or not, Can be DOS file or Unix file Can be read by two jobs at the same time Cant be written by two jobs at the same time
20
The designer
Designer
Sequential File :
Stage description
21
The designer
Designer
Sequential File :
Output link
22
The designer
Designer
Sequential File :
23
The designer
Sequential File :
Different columns of the file (Output) : type, length Size to display (for View Data) To test the connection and view the data in the file
Designer
24
The designer
Designer
Sequential File :
To describe easily a file : use or create a table definition Group your table definitions by application Create or modify the table definitions (for files, databases, transformers, )
25
The designer
Sequential File :
Designer
Then it can be used in different jobs (click on Load to find the right definition).
26
The designer
Sequential File :
Designer
View Data
27
The designer
Designer
Transformer Stage :
Multi-source and multi-target, Wait for the availability of the source of data, Makes lookup between 2 flows (reference), Transform or propagate the data of each flow, Allows to select, filter, create refusals file.
28
The designer
Designer
Transformer Stage : Can do treatments by : native basic function or created in the manager, DataStage function or DataStage macro, routines (before/after type) Or only propagate columns.
29
The designer
Transformer Stage :
Designer
Input data
Output data
30
The designer
Transformer Stage :
Designer
Output data
Input data
31
The designer
Designer
Exercise n1 :
Objective : Read a sequential file and create a new one (save the file) The catalogue.in file has to be read and the catalogue_save.tmp file has to be written Source File : catalogue.in(in \in directory) Target File : catalogue_save.tmp (in \tmp directory) Steps : 1- Create a table definition (structure of Catalogue table ) 2- Design the job with 2 Sequential Files and 1 Transformer 3- Create the links (data flow) 4- Save and Compile the job 5- Run the job 6-Look at the performances statistics (right click)
32
The designer
Transformer Stage :
Look at the performances of your job :
Designer
33
The designer
Designer
34
The designer
Designer
Exercise n2 :
Objective : Use environment variables - create a job parameter : directory - place it on all the paths from the job of the first exercise (example : #directory#\tmp), - compile - modify your input file (add your best film) - run with different path (other groups).
35
The designer
Hash File Stage :
Designer
Necessary for a lookup One Hash file is entirely written before it can be read (FromTrans link must be finished before FromFilmTypeHF
can start)
Allows to group multiple records with the same key (suppress duplicate keys) Can be read in different jobs simultaneously Can be written by different links simultaneously (in the same job or in different jobs)
36
The designer
Hash File :
Designer
Stage name
37
The designer
Hash File :
File name
Designer
Select this check box to specify that all records should be cached, rather than written to the hashed file immediately. This is not recommended where your job writes and reads to the same hashed file in the same stream of execution
38
The designer
Hash File :
A key must be defined (it can be a single or multiple key)
Designer
39
The designer
Stage Transformer : Lookup
Designer
The main flow can be from every type The secondary flow must has a Hash File to design a lookup (so very often, you will have to design a temporary Hash File) The look up is done with the key of the secondary flow The number of records in the main flow cant be higher after the lookup than before the look up The lookup is shown with a dotted line When a lookup is exclusive the number of records after the lookup is smaller then the number of records before the lookup
40
The designer
Transformer Stage : Lookup
Reference Flow (vertical flow)
Designer
41
The designer
Designer
Exercise n3 :
Objective : make a lookup between Catalog file and Film Type to put the type film in the output file.
Source File : catalogue.in(in \in directory) Target File : catalogue.out (in \out directory) Steps : 1- Create a table definition (structure of FilmType table ) 2- Modify your job to create a Hash File from the FilmType.in file 3- Create the link to show the lookup (data flow) 4- Save and Compile the job 5- Run the job 6-Look at the performances statistics (right click)
42
The designer
Designer
Exercise n4 :
Objective : put the director name and the film name together separated by a >. If the film type is not found, put unknown type in the output file. What happens when the director name is empty ? Find a solution.
43
The designer
Designer
Exercise n5 :
Objective : If the film type is not found (use constraint), put the film in a refusals file (First a Sequential file and then a Hash File)
44
The designer
Stage Lookup with selection (exclusive lookup)
Designer
Dont forget : lookup can be designed with ORAOCI stage or UV stage but it is more better with Hash Files.
45
The designer
Designer
Exercise n6 :
Objective : Select only the films for which the type is known (that means that the lookup is OK)
46
The designer
Designer
Exercise n7 :
Objective : Select all the clients who are female to put them in an output file
The SEXE column contains M (Male) or F (female) And then create an annotation for this job (all the jobs must have annotations)
47
The director
The Director is the job controller, it allows to : Run jobs
Immediately or later, with more options than in the Designer
Director
Job monitoring
To control the number of lines treated by each active stage of a job.
48
The director
Run jobs with Director
Select the job and click here
Director
49
The director
To run a job later :
click here
Director
50
The director
To modify running parameters for a job : Limits Tab
Rows limit : the job stops after x rows (on each flow)
Director
51
The director
Director
52
The director
Example : list of jobs
To view the log To run jobs To stop jobs
Director
53
The director
Example of a Monitor :
For each step : the number of treated lines (input and output) the beginning time the execution duration (Elapsed time) Link type : the status Pri : principal flow the performance (rows/sec) Ref : reference flow (lookup) Out : output flow
Director The monitor allows to follow the different stages of a job. See the importance of a good name for the stages and the links !
54
The director
Example of a log :
To look at error messages, choose the job and click on the log button
Director
Dont forget : Clear the log from time to time (Job>Clear log).
55
The manager
The manager is the tool to export/import elements from a
DataStage project to an other DataStage project.
File>Open Project to change project All the elements : jobs Routines table definitions are classified in Categories but the name must be unique within a project Drag and Drop on an element to change category 56
Manager
The manager
EXPORT
To append to an existing file To change the selection options :
- By category - By individual components
Manager
57
The manager
IMPORT
This will create/modify elements in the DataStage Project choose what do you want to import
Manager
58
The manager
With the manager, you can compile many jobs at the same time (multiple compile
jobs)
Manager
Tools > Run multiple job compile you select the type of jobs you want to compile and select Show manual
selection page and click on Next button
59
The designer
Sort Stage :
Criteria of sorting are filled in In Stage Tab/Properties Tab
Designer
60
The designer
Designer
Exercise n8 :
Objective : When you have selected all the Women, sort the file by alphabetical order.
61
The designer
Designer
Aggregator Stage : - Allows data to be aggregated on a smaller number of records, - Intermediate treatments executed in memory, - Allows to execute a before/after routine (before or after the stage treatment when all the lines have been treated), - Performances are better if data is sorted (Input tab), - The aggregator does not sort the records.
62
The designer
Designer
63
The designer
Designer
Group by
Different functions
64
The designer
Designer
Exercise n9 :
Objective : create a Job which reads location.in And calculates the hit-parade from the most hired cassettes (order by number of hire descending). Put also the name of the film and not only the number of the cassette (lookup with catalogue.in).
65
The designer
Designer
Exercise n10 :
Objective : create a Job which reads location.in And calculates the average number of hire for each cassette.
(2 different methods can be used)
66
The designer
Designer
67
The designer
Designer
68
The designer
Designer
Hash File Stage : We have seen that the Hash File is necessary for a lookup We have seen also that Hash File allows to suppress duplicate key Lets see now how it is useful to group different flows
69
The designer
Designer
Exercise n11 :
Objective : With the job from exercise 10 (use the 2 methods in the same job), create a Hash File to put the different results in the same Hash File. Column 1 : AVERAGE METHOD 1 or AVERAGE METHOD 2 Column 2 : the result of each method In the Hash file, you must have 2 lines.
70
The designer
Designer
71
The designer
Stage Variables :
Designer
Simple treatments can be made easily with stage variable. - It is a data which remain active during all the duration of the stage. So you can find a max (if data is sorted), calculate a sum or count something. - In the transformer, click on the right button and then select Show Stage variables. Example :
72
The designer
Another example :
Designer
73
The designer
Designer
Exercise n12 :
Objective : Try to calculate the average with stage variables.
Exercise n13 :
Objective : Create a job that create a file with all the client (key) and in a second column the list of the films (separated by a dot).
74
The designer
Designer
75
The designer
Designer
76
The designer
Designer
DataStage Variables :
Different variables are defined by Datastage : -@NULL - @INROWNUM, @OUTROWNUM - @DATE - @TRUE, @FALSE - @PATH
Link Variables :
The more useful is : NOTFOUND
77
The designer
Designer
Routines :
- Source code (written with Basic language) - It is external from the jobs and can be used many times at many levels - It can be a Transform function or a Before/After Function : a transform function is called at each line a before subroutine is called before the first line (example : empty a file) an after subroutine is called when all the lines have been treated
78
The designer
Routines (1/3)
Name of the routine Type of routine
Designer
79
The designer
Routines (2/3)
Arguments : they are used in the code To be filled in
Designer
80
The designer
Routines (3/3)
Designer
Save
Compile 81
The designer
Routines : access to a sequential file
OpenSeq FicXXX to xxx then end else end WriteSeq FicXXX to xxx then end else end ReadSeq FicXXX to xxx then end else end CloseSeq FicXXX WeofSeq xxx
Designer
File Header
82
The designer
Routines :
Call DSLogInfo("Information", "RoutineName") Call DSLogWarn("Warning", "RoutineName") Call DSLogFatal("Abort", "RoutineName") Loop Until Repeat GoTo Iconv("05/27/97", "D2/") Oconv(10740, "D2/")
Designer
Upcase()
The designer
Routines : Test
Designer
84
The designer
Designer
Exercise n14 :
Step 1 : Objective : write a routine which calculates the number of day between two dates. If begin date is null then return 0 , If end date is null then initialize it with date of today, Save, compile and test the routine.
85
The designer
Designer
86
The designer
Designer
Exercise n14 :
Step 2 Objective : Read location.in, generate a file with the hire duration (returned cassettes only) Non returned cassettes after 10 days (end date null) will be written in a refusals file with the name and address of client (to send then a mail)
87
The designer
Designer
88
The designer
Designer
Exercise n15 :
Objective : With a routine (Use CASE ), calculate the amount for the cassette hire (days number * hire price * coefficient). The coefficient is calculated with that rule :
<5 days = days * hire price >=5 and <10 days = days * hire price * 1.20 >=10 and <30 days = days * hire price * 1.50 >= 30 days = days * hire price * 3
89
The designer
Designer
UV Stage :
works with internal hash file (in the DataStage Project) makes a Cartesian product uses SQL requests (select from where order by )
90
The designer
Designer
Exercise n16 : execute the Cartesian product on Clients file and Cassettes file
Objective : Propose to the clients cassettes he has never hired Step 1 : create the job parameter account, Step 2 : create a job to write clients hash file et cassettes hash file in the DS project with account parameter Step 3 : In a new job, use those hash files to make the Cartesian product
Look at your job performances !!
91
The designer
Exercise 16 : Step 1 and Step 2
Designer
92
The designer
Step 3 :
Designer
93
The designer
Designer
94
The designer
The number of records
Designer
95
The designer
The normalization :
Designer
Normalization :
12 A|B|C|D|E 12 A 12 B 12 C 12 D 12 E
Normalized file
Multi-valuated file
Un-normalization :
96
The designer
Normalization :
Multi-valuated file must have : 1- a key 2- char(253) or @VM for separator 3- The Normalize On field from Hash File checked 4- the column(s) to normalize 1
Designer
97
The designer
Exercise n17 : normalization/un-normalization
Designer
Step 1 : create a job which reads location.in file and writes a hash file (Id_Cli as the key and the list of all Id_Cas separated by @VM) : use Sort stage and Stage Variables ! => View Data on the Input Link of the Hash File Step 2 : modify the a job to add normalization of this file => View Data on the Output Link of the Hash File Step 3 : Compare the sequential file with location.in file
98
The designer
Exercise N17 : job to design and View Data
Designer
99
The designer
The ORAOCI Stages :
The version of oracle used is 9i so use ORAOCI9 stage You can :
Either use a query generated by DataStage Or use a user-defined query Or a combination of the both precedent possibilities
Designer
The access parameters have to be defined by job parameters The stage can access only one table or more Different actions can be programmed : read, insert, update You can also use Stocked Procedures
100
The designer
The ORAOCI Stages :
The access parameters have to be defined by job parameters
Designer
101
The designer
The ORAOCI Stages : Output link
query generated by DataStage or userdefined query
Designer
102
The designer
query generated by DataStage
Selection of the columns Selection of the table(s) Sort parameters
Designer
Group by clause
103
The designer
Generate SELECT clause from column list; enter other clauses
Designer
104
The designer
Designer
Enter custom SQL statement : when you want to add something specific
105
The designer
The ORAOCI Stages : Output link
Choose the table
Designer
Important parameters
106
The designer
The ORAOCI Stages : Output link
Designer
107
The designer
The ORAOCI Stages : verify error code (1/3)
Designer
108
The designer
The ORAOCI Stages : verify error code (2/3)
Designer
109
The designer
The ORAOCI Stages : verify error code (3/3)
To select the errors
Designer
Treat lines 1 by 1
110
The designer
The ORA Bulk Stages :
-
Designer
to insert in a table (like SQLLOAD) Very fast (deactivate the index before the load and reactivate it after the load) But no warning if the index is in Unusable state after the load (when duplicate keys for example) Not a lot of Date and Time format (DD.MM.YYYY, YYYY-MM-DD, DDMON-YYYY, MM/DD/YYYY - hh24:mi:ss, hh:mi:ss am)
111
The designer
The ORA Bulk Stages
DSN user password Table name (with oracle.tableName) Date and Time format Number of lines between 2 Commit
Designer
112
The designer
How to create a table definition from a table in the database ?
On the repository, right click on Table Definitions and then choose Import and then Plug-in Meta Data Definitions
Designer
113
The designer
Then choose the table (s) and click on Import The table definitions will be created in the category ODBC
Designer
114
The designer
Exercise n18 : Read a Database
Designer
Objective : Create a job which reads the table REF_CPTE in BIODS database
Step 1 : create the table definition from the database Step 2 : create the job that reads the table
115
The designer
Exercise n19 : Write in a Database
Designer
Step 1 : Use ORAOCI stage Step 2 : Same exercise with ORABULK stage
116
The designer
Exercise n20 : Update a Database
Designer
117
The administrator
The Administrator : Create a DataStage project
Unlock a job
Administrator
Sometimes, due to server problems, the designer (or manager) falls down and some elements may be locked (jobs, table definitions, routines, ) In that case, in the Administrator (with administrator security rights) :
118
The administrator
Unlock a job (1/3)
choose your project
Administrator
To create a project
119
The administrator
Unlock a job (2/3)
CHDIR C:\Ascential\DataStage\Engine LIST.READU Search the user number
Administrator
120
The administrator
Unlock a job (3/3)
unlock your job with device number
Administrator
121
The administrator
Create a project
Project name Location for the Project (jobs, routines, UV hash files, table definitions, ) on the server. Must be different from the location for the directories of data !
Administrator
122