This action might not be possible to undo. Are you sure you want to continue?
V. BEYET 03/07/2006
Who am I ? Who are you ?
General presentation (DataStage : what is it ?) DataStage : how to use it ?
The other components (part 2)
General presentation Datastage : What is it ? An ETL tool: Extract-Transform-Load A graphic environment A tool integrated in a suite of BI tools Developed by Ascential (IBM) 4 .
Data transformation : Select. access. …). Aggregate Sort. 5 . Databases (oracle. sqlserver.General presentation Datastage : why to use it ? big size of data (volume) multi-source and multi-target : files. Format. Combine.
6 . Treatments are : Compiled and run by an engine.General presentation Datastage : how it works ? Development is done : on a client-server mode. Written on a Universe database. with a simple language (basic). with a graphical Design of flows. with simple and basic elements.
General presentation Designer Manager Server Director Administrator The different tools 7 .
The programs Called Jobs : first as source code and then as executable programs. 8 . written in Universe Database Server But we can‟t understand source code Data : May be written in Universe Database but better in server directories.General presentation The server contains programs and data.
table definitions and routines A Project can be created at any time The number of projects is unlimited The number of jobs is unlimited for each project But the number of simultaneous client connection is limited 9 .General presentation What is a Project for Datastage ? A server is organized in different environments called “Projects” Server A Project is a separated environment for jobs.
A Hash file with incorrectly defined keys may create disastrous problems. It‟s the central element to use all the possibilities of the Datastage engine. 10 .General presentation Universe Database: The Universe Database is a relational Database with files Tables are called "Hash File" Servur A Hash file is an indexed file.
Summary General presentation (Datastage : what is it ?) DataStage : how to use it ? The other components (part 2) 11 .
The designer Designer The designer is to design jobs : look at the icon The jobs are composed with « Stages » : active stages : action passive stages : data storage Links : between the stages 12 .
…) but simultaneous access is possible on Hash file. ODBC Stage. UV Stage : The file is in the Universe Core (DataStage engine). OLEDB.The designer Passive stages : a place for Data storage (the data flow is from the stage or to the stage) Designer Text File : sequential file Hash File : It can be treated only by datastage (and not by WordPad. 13 . ORAOCI : Representation of a database. it allows to access directly to a database with an ODBC link.
An active stage is a representation of a transformation on the dataflow :
Sort : of a file Aggregator : calculations Transformer : selection, transformation, transport of properties …
Between active and passive stages Between passive stages Between active stages
A job in the designer
Active Stage Passive Stage
The designer Designer DataStage Designer : Each job has : . 17 .one or more source of data .one or more destination for the data The toolbar contains the stage icons to design the jobs.one or more transformation . The jobs have to be compiled to create executable programs.
The designer Designer To compile the job To run the job The repository The toolbar with stage icons (palette) 18 .
The designer Let‟s study now the different Stages : Sequential Files (text files) Transformer Hash Files Sort Aggregator Routines UV Stages Designer 19 .
Can be written.The designer Designer Sequential file Stage : Can be read. Can be written cash or not. Can be read and written in the same job. Can be DOS file or Unix file … Can be read by two jobs at the same time Can‟t be written by two jobs at the same time 20 .
The designer Designer Sequential File : Stage name File Type Stage description 21 .
The designer Designer Sequential File : Output link Stage name (to be written) 22 .
The designer Designer Sequential File : Data Format (Output file) Always those values 23 .
The designer Sequential File : Different columns of the file (Output) : type. length Size to display (for View Data) To test the connection and view the data in the file Designer 24 .
transformers.The designer Designer Sequential File : To describe easily a file : use or create a “table definition” Group your table definitions by application Create or modify the table definitions (for files. …) 25 . databases.
The designer Sequential File : Designer Then it can be used in different jobs (click on Load to find the right definition). 26 .
The designer Sequential File : Designer View Data 27 .
The designer Designer Transformer Stage : Multi-source and multi-target. Wait for the availability of the source of data. Transform or propagate the data of each flow. create refusals file. 28 . filter. Allows to select. Makes lookup between 2 flows (reference).
The designer Designer Transformer Stage : Can do treatments by : native basic function or created in the manager. 29 . routines (before/after type) Or only propagate columns. DataStage function or DataStage macro.
Transformer Stage :
Right click : propagate all the columns
Transformer Stage :
Exercise n°1 :
Objective : Read a sequential file and create a new one (save the file) The catalogue.in file has to be read and the catalogue_save.tmp file has to be written Source File : catalogue.in(in \in directory) Target File : catalogue_save.tmp (in \tmp directory) Steps : 1- Create a table definition (structure of Catalogue table ) 2- Design the job with 2 Sequential Files and 1 Transformer 3- Create the links (data flow) 4- Save and Compile the job 5- Run the job 6-Look at the performances statistics (right click)
The designer Transformer Stage : Look at the performances of your job : Designer Right click on the grid and then select “Show performance statistics” 33 .
Job Properties .The designer Designer Create the parameters of the job : menu Edit . 34 . tab Parameters.
create a job parameter : directory .run with different path (other groups).place it on all the paths from the job of the first exercise (example : #directory#\tmp). 35 .The designer Designer Exercise n°2 : Objective : Use environment variables . .compile .modify your input file (add your best film) .
The designer Hash File Stage : Designer Necessary for a lookup One Hash file is entirely written before it can be read (FromTrans link must be finished before FromFilmTypeHF can start) Allows to group multiple records with the same key (suppress duplicate keys) Can be read in different jobs simultaneously Can be written by different links simultaneously (in the same job or in different jobs) 36 .
The designer Hash File : Designer Stage name Account name (DataStage project) File path 37 .
The designer Hash File : File name Designer For files to write Select this check box to specify that all records should be cached. rather than written to the hashed file immediately. This is not recommended where your job writes and reads to the same hashed file in the same stream of execution 38 .
The designer Hash File : A key must be defined (it can be a single or multiple key) Designer 39 .
The designer Stage Transformer : Lookup Designer • The main flow can be from every type • The secondary flow must has a Hash File to design a lookup (so very often. you will have to design a temporary Hash File) • The look up is done with the key of the secondary flow • The number of records in the main flow can‟t be higher after the lookup than before the look up • The lookup is shown with a dotted line • When a lookup is “exclusive” the number of records after the lookup is smaller then the number of records before the lookup 40 .
The designer Transformer Stage : Lookup Reference Flow (vertical flow) Designer Principal Flow (horizontal) 41 .
in(in \in directory) Target File : catalogue.Modify your job to create a Hash File from the FilmType.Run the job 6-Look at the performances statistics (right click) 42 .Create the link to show the lookup (data flow) 4.Save and Compile the job 5.Create a table definition (structure of FilmType table ) 2.in file 3.out (in \out directory) Steps : 1.The designer Designer Exercise n°3 : Objective : make a lookup between Catalog file and Film Type to put the type film in the output file. Source File : catalogue.
The designer Designer Exercise n°4 : Objective : put the director name and the film name together separated by a “>”. put “unknown type” in the output file. If the film type is not found. 43 . What happens when the director name is empty ? Find a solution.
The designer Designer Exercise n°5 : Objective : If the film type is not found (use constraint). put the film in a refusals file (First a Sequential file and then a Hash File) 44 .
The designer Stage Lookup with selection (exclusive lookup) Designer Don’t forget : lookup can be designed with ORAOCI stage or UV stage but it is more better with Hash Files. 45 .
The designer Designer Exercise n°6 : Objective : Select only the films for which the type is known (that means that the lookup is OK) 46 .
The designer Designer Exercise n°7 : Objective : Select all the clients who are female to put them in an output file The SEXE column contains M (Male) or F (female) And then create an annotation for this job (all the jobs must have annotations) 47 .
Failed validation . 48 . Aborted. it allows to : Run jobs Immediately or later.. Job monitoring To control the number of lines treated by each active stage of a job. with more options than in the Designer Director Control job status Status : Compiled.. Running. Validated.The director The Director is the job controller.
The director Run jobs with Director Select the job and click here Director And then enter the parameters 49 .
The director To run a job later : click here Director And then choose the date and time 50 .
The director To modify running parameters for a job : Limits Tab Rows limit : the job stops after x rows (on each flow) Director Warnings limit : the job stops after x warnings 51 .
The director Director Verify the status of jobs with Director The status : • "Not compiled" • "Compiled" • "Failed validation" • "Validated ok" • "Aborted" • "Finished" • "Running" 52 .
The director Example : list of jobs To view the log To run jobs To stop jobs Director To reset job status To run jobs later 53 .
See the importance of a good name for the stages and the links ! 54 .The director Example of a Monitor : For each step : the number of treated lines (input and output) the beginning time the execution duration (Elapsed time) Link type : the status Pri : principal flow the performance (rows/sec) Ref : reference flow (lookup) Out : output flow Director The monitor allows to follow the different stages of a job.
55 .The director Example of a log : To look at error messages. choose the job and click on the “log” button Director Green : OK – No problem Yellow : warning Red : blocking problem Don’t forget : Clear the log from time to time (Job>Clear log).
The manager The manager is the tool to export/import elements from a DataStage project to an other DataStage project. File>Open Project to change project All the elements : •jobs •Routines •table definitions are classified in Categories but the name must be unique within a project Drag and Drop on an element to change category 56 Manager To import or export elements click on the appropriate button .
By category .The manager EXPORT To append to an existing file To change the selection options : .By individual components Manager choose what do you want to export (create a .dsx) •Jobs •Table definitions •Routines (always check “Source Code” box) 57 .
The manager IMPORT This will create/modify elements in the DataStage Project choose what do you want to import Manager Make your choice 58 .
The manager With the manager. you can compile many jobs at the same time (multiple compile jobs) Manager Tools > Run multiple job compile you select the type of jobs you want to compile and select “Show manual selection page” and click on “Next” button select the jobs and click on “Next” button click on the “Start compile” button 59 .
The designer Sort Stage : Criteria of sorting are filled in In Stage Tab/Properties Tab Designer Modify those parameters if the file to sort has a lot of lines 60 .
61 .The designer Designer Exercise n°8 : Objective : When you have selected all the Women. sort the file by alphabetical order.
Performances are better if data is sorted (Input tab).The aggregator does not sort the records.The designer Designer Aggregator Stage : . .Allows data to be aggregated on a smaller number of records.Allows to execute a before/after routine (before or after the stage treatment when all the lines have been treated). 62 . .Intermediate treatments executed in memory. . .
The designer Designer Aggregator Stage : Input Tab When input data is sorted 63 .
The designer Designer Aggregator Stage : Output tab Group by Different functions 64 .
The designer Designer Exercise n°9 : Objective : create a Job which reads location. Put also the name of the film and not only the number of the cassette (lookup with catalogue.in). 65 .in And calculates the hit-parade from the most hired cassettes (order by number of hire descending).
in And calculates the average number of hire for each cassette.The designer Designer Exercise n°10 : Objective : create a Job which reads location. (2 different methods can be used) 66 .
The designer Designer Exercise n°9 (job to design) 67 .
The designer Designer Exercise n°10 (job to design) 68 .
The designer Designer Hash File Stage : We have seen that the Hash File is necessary for a lookup We have seen also that Hash File allows to suppress duplicate key Let‟s see now how it is useful to group different flows 69 .
70 . Column 1 : “AVERAGE METHOD 1” or “AVERAGE METHOD 2” Column 2 : the result of each method In the Hash file. you must have 2 lines. create a Hash File to put the different results in the same Hash File.The designer Designer Exercise n°11 : Objective : With the job from exercise 10 (use the 2 methods in the same job).
The designer Designer Exercise n°11 (job to design) 71 .
So you can find a max (if data is sorted). . .It is a data which remain “active” during all the duration of the stage. click on the right button and then select “Show Stage variables”. Example : 72 .The designer Stage Variables : Designer Simple treatments can be made easily with stage variable.In the transformer. calculate a sum or count something.
The designer Another example : Designer 73 .
Exercise n°13 : Objective : Create a job that create a file with all the client (key) and in a second column the list of the films (separated by a dot).The designer Designer Exercise n°12 : Objective : Try to calculate the average with stage variables. 74 .
The designer Designer Exercise n°13 (job to design) 75 .
The designer Designer Exercise n°13 (job to design) The order of the different variables is important. The instructions are executed in the order of the stage variables ! (to change the order => right click>stage properties>Link ordering Tab) The variables must be initialized (=> right click>stage properties>variables). There must be a hash file after the stage. 76 .
@OUTROWNUM . @FALSE .@INROWNUM.@PATH Link Variables : The more useful is : NOTFOUND 77 .@DATE .@TRUE.The designer Designer DataStage Variables : Different variables are defined by Datastage : -@NULL .
The designer Designer Routines : .It is external from the jobs and can be used many times at many levels .Source code (written with Basic language) .It can be a Transform function or a Before/After Function : a transform function is called at each line a before subroutine is called before the first line (example : empty a file) an after subroutine is called when all the lines have been treated 78 .
The designer Routines (1/3) Name of the routine Type of routine Designer Always fill in this Short description 79 .
The designer Routines (2/3) Arguments : they are used in the code To be filled in Designer 80 .
The designer Routines (3/3) Designer Code : use Argument names Save Compile 81 Test of the routine .
The designer Routines : access to a sequential file OpenSeq FicXXX to xxx then end else end WriteSeq FicXXX to xxx then end else end ReadSeq FicXXX to xxx then end else end CloseSeq FicXXX WeofSeq xxx Designer File Header To empty the file 82 .
"RoutineName") Call DSLogFatal("Abort".1) search string file after the third comma A=„Hello ‟ B=„World‟ C=A:B C=„Hello World‟ Trim(…. "RoutineName") Call DSLogWarn("Warning". ‟ ‟.‟T‟) suppress the trailing spaces A=„Hello ‟ A[1. "RoutineName") Loop Until Repeat GoTo Iconv("05/27/97". "D2/") Oconv(10740. "D2/") Designer For i= … To Next i If … Then End Else End Loop While Repeat Upcase(…) field(….'.'.The designer Routines : Call DSLogInfo("Information".3]=„Hel‟ 83 .3.
The designer Routines : Test Designer By double-click on Result column 84 .
85 . If begin date is null then return 0 . Save.The designer Designer Exercise n°14 : Step 1 : Objective : write a routine which calculates the number of day between two dates. compile and test the routine. If end date is null then initialize it with date of today.
The designer Designer 86 .
generate a file with the hire duration (returned cassettes only) Non returned cassettes after 10 days (end date null) will be written in a refusals file with the name and address of client (to send then a mail) 87 .The designer Designer Exercise n°14 : Step 2 Objective : Read location.in.
The designer Designer Exercise n°14 (job to be designed) 88 .
calculate the amount for the cassette hire (days number * hire price * coefficient).The designer Designer Exercise n°15 : Objective : With a routine (Use CASE ). The coefficient is calculated with that rule : <5 days = days * hire price >=5 and <10 days = days * hire price * 1.50 >= 30 days = days * hire price * 3 89 .20 >=10 and <30 days = days * hire price * 1.
The designer Designer UV Stage : – works with internal hash file (in the DataStage Project) – makes a Cartesian product – uses SQL requests (select … from … where … order by …) 90 .
The designer Designer Exercise n°16 : execute the Cartesian product on Clients file and Cassettes file Objective : Propose to the clients cassettes he has never hired •Step 1 : create the job parameter “account”. use those hash files to make the Cartesian product •Look at your job performances !! 91 . •Step 2 : create a job to write clients hash file et cassettes hash file in the DS project with account parameter •Step 3 : In a new job.
The designer Exercise 16 : Step 1 and Step 2 Designer 92 .
The designer Step 3 : Designer 93 .
The designer Designer 94 .
The designer The number of records Designer 95 .
The designer The normalization : Designer Normalization : 12 A|B|C|D|E 12 A 12 B 12 C 12 D 12 E Normalized file Multi-valuated file Un-normalization : 96 .
a key 2.The designer Normalization : Multi-valuated file must have : 1.The “Normalize On” field from Hash File checked 4.char(253) or @VM for separator 3.the column(s) to normalize 1 Designer 3 2 4 97 .
The designer Exercise n°17 : normalization/un-normalization Designer •Step 1 : create a job which reads location.in file 98 .in file and writes a hash file (Id_Cli as the key and the list of all Id_Cas separated by @VM) : use Sort stage and Stage Variables ! => View Data on the Input Link of the Hash File •Step 2 : modify the a job to add normalization of this file => View Data on the Output Link of the Hash File •Step 3 : Compare the sequential file with location.
The designer Exercise N°17 : job to design and View Data Designer 99 .
insert.The designer The ORAOCI Stages : The version of oracle used is 9i so use ORAOCI9 stage You can : Either use a query generated by DataStage Or use a user-defined query Or a combination of the both precedent possibilities Designer The access parameters have to be defined by job parameters The stage can access only one table or more Different actions can be programmed : read. update You can also use Stocked Procedures 100 .
The designer The ORAOCI Stages : The access parameters have to be defined by job parameters Designer 101 .
The designer The ORAOCI Stages : Output link query generated by DataStage or userdefined query Designer 102 .
The designer query generated by DataStage Selection of the columns Selection of the table(s) Sort parameters Designer “Group by” clause 103 .
The designer Generate SELECT clause from column list. enter other clauses Designer 104 .
The designer Designer Enter custom SQL statement : when you want to add something specific To format a date for example 105 .
The designer The ORAOCI Stages : Output link Choose the table Designer Important parameters Choose the action 106 .
The designer The ORAOCI Stages : Output link Designer Number of lines between 2 commit 107 .
The designer The ORAOCI Stages : verify error code (1/3) Designer If the job must abort when there is a SQL error 108 .
The designer The ORAOCI Stages : verify error code (2/3) Designer To receive SQL error code 109 .
The designer The ORAOCI Stages : verify error code (3/3) To select the errors Designer To receive SQL error code Treat lines 1 by 1 110 .
hh:mi:ss am) 111 . YYYY-MM-DD.The designer The ORA Bulk Stages : - Designer - to insert in a table (like SQLLOAD) Very fast (deactivate the index before the load and reactivate it after the load) But no warning if the index is in Unusable state after the load (when duplicate keys for example) Not a lot of Date and Time format (DD.hh24:mi:ss. DDMON-YYYY.MM.YYYY. MM/DD/YYYY .
The designer The ORA Bulk Stages DSN user password Table name (with oracle.tableName) Date and Time format Number of lines between 2 Commit Designer 112 .
right click on Table Definitions and then choose “Import” and then Plug-in Meta Data Definitions Designer 113 .The designer How to create a table definition from a table in the database ? On the repository.
The designer Then choose the table (s) and click on “Import” The table definitions will be created in the category “ODBC” Designer 114 .
Exercise n°18 : Read a Database
Objective : Create a job which reads the table REF_CPTE in BIODS database
Step 1 : create the table definition from the database Step 2 : create the job that reads the table
Exercise n°19 : Write in a Database
Objective : Create a job which writes in the table
TST_ALADIN_JGV in BIODS database (only the 2 first columns : keys)
Location.in TST_ALADIN_JGV : Id_Cli ======== >> CHAR1 Id_Cas ======== >> CHAR2 In CHAR1, put a letter (different for each group) before the client number (Id_Cli).
Step 1 : Use ORAOCI stage Step 2 : Same exercise with ORABULK stage
Exercise n°20 : Update a Database
Objective : Create a job to update the columns BEGIN_DATE
and END_DATE in the table TST_ALADIN_JGV in BIODS database from location.in file
BEGIN_DATE and END_DATE are defined as timestamp !
due to server problems. in the Administrator (with administrator security rights) : 118 . table definitions. the designer (or manager) falls down and some elements may be locked (jobs. routines.The administrator The Administrator : Create a DataStage project Unlock a job Administrator Sometimes. …) In that case.
The administrator Unlock a job (1/3) choose your project Administrator To create a project And click on Command button 119 .
The administrator Unlock a job (2/3) CHDIR C:\Ascential\DataStage\Engine LIST.READU Search the user number Administrator Search the device number 120 .
The administrator Unlock a job (3/3) unlock your job with device number Administrator or with user number (UNLOCK USER UserNumber READULOCK) Or everything (UNLOCK ALL) 121 .
UV hash files. routines. Must be different from the location for the directories of data ! Administrator 122 . …) on the server.The administrator Create a project Project name Location for the Project (jobs. table definitions.