Professional Documents
Culture Documents
DataStage Training Day 1
DataStage Training Day 1
Training Day 1
1-1
Course Prerequisites
To gain full benefit from this class, you should have a strong working knowledge of:
Relational database management systems (RDBMS) Structured Query Language (SQL) Windows GUI Open Database Connectivity (ODBC)
1-2
What is DataStage?
Datastage is a comprehensive tool for the fast,easy creation and maintenance of data marts and datawarehouse.
1-3
DataStage Architecture
DataStage Administrator Repository Manager DataStage Designer DataStage Director
1-4
Client Components
Designer:It creates Datastage jobs that are compiled into executable programs. Director: used to run and monitor the Datastage jobs. Manager:A user interface to edit and view contents of the repository and perform import and export of DataStage Components. Administrator: A user interface to perform administrative jobs like setting up users/adding projects.
1-5
DataStage Designer
1-6
1-7
Repository
Proprietary file format. Repository is created for each project. Single Server can handle multiple repositories. Stores all source/target definitions. Can be imported/exported in an XML format. Repository Contents.
Data Elements. Jobs. Routines. Shared Containers. Stage Types. Table Definitions. Transforms.
1-8
DataStage Administrator
Creating & attaching a project to DataStage Executing a UniVerse Command Cleaning Up a Project License Permissions Auto-Purge of Log Files Inactivity Timeout User for Scheduled Jobs User Name Troubleshooting
1-9
DataStage Designer
Design Area
Stages repository
1 - 10
1 - 11
ODBC Stage
ODBC stage can be used to extract,write and aggregate data ODBC stage can have any number of inputs and outputs. The three editable properties for this stage are Stage, Inputs, Outputs Note:DSN has to be defined on the server machine as a system DSN. Input: When data is added/Loaded to a table Output: When data is read/Extracted from a table
1 - 12
1 - 13
Define SQL Type Generated Query: Depending upon the columns chosen User Generated Query ( SQL Over Ride) Stored Procedure Read.
1 - 14
Preview data
Specify joins if multiple tables are used or any clause to restrict data being extracted
Verify the SQL being generated to extract data. Can be user defined or generated
Isolation Level Specify a suitable transaction isolation level. This specifies how potential conflicts between transactions (i.e., dirty reads, nonrepeatable reads, and phantom reads) are handled. Choose from: None. The default isolation level for the database is used. Read Uncommitted. Dirty reads, nonrepeatable reads, and phantom reads are all possible.
Read Committed. Dirty reads are not possible. Nonrepeatable reads and phantom reads are both possible. Repeatable Read. Dirty reads and nonrepeatable reads are not possible, but phantom reads are possible. Serializable. Transactions are serializable. Dirty reads, nonrepeatable reads, and phantom reads are not possible. Versioning. Transactions are serializable, but have a higher concurrency than with Serializable. Versioning is implemented using a non-locking protocol. Auto commit. Each write is committed separately. Do not use this in conjunction with transaction groups
1 - 15
Update action Specifies how the data is written. Choose the option you want from the drop-down list box: Clear the table, then insert rows. Deletes the contents of the table and adds the new rows. Insert rows without clearing. Inserts the new rows in the table. Replace existing rows completely. Deletes the existing rows, then adds the new rows to the table. Update existing rows only. Updates the existing data rows. If a row with the supplied key does not exist in the table then the table is not updated but a warning is logged. Insert new or update existing rows. New rows are added or, if the insert fails, the existing rows are updated. Update existing or insert new rows. The existing data rows are updated or, if this fails, new rows are added. Call stored procedure. The data is written using a stored procedure. When you select this option, the Procedure name field appears. User-defined SQL. The data is written using a user-defined SQL statement. When you select this option, the View SQL tab is replaced by the Enter SQL tab.
1 - 17
Right click
Columns can be defined by adding a new row on the columns menu and editing the row contents You can Specify data types keys etc If the target table is already defined and imported you can load the table definition as you did before. Column definitions can also be used from the incoming link to this stage.
1 - 18
1 - 19
ODBC Stage - Recap Define the connection Define/import the tables and the table metadata. Choose one of the following to read or write data
User Defined Generated Stored procedure
1 - 21
1 - 22
File name
Update Action
Note: The file has to imported in the repository before using it in this stage. The procedure is similar to importing ODBC table definitions Backup existing file creates a back up before overwriting or appending to it. This can be used if a job is reset during execution
1 - 23
Additional Properties
Missing Column Action: Allows you to specify the action for columns missing in the data being read. Options are Pad with Null Pad with empty string Map empty String Suppress for Truncation warnings: This is used if you are reading only three columns from a file which has more than three columns defined.
1 - 25
Aggregator Stage
Aggregator Stage classifies data rows from single input link into groups and computes totals or other aggregate functions. The sum/totals are output from the stage.You can only have on input link but many output links if the data has to be grouped in different ways.
Aggregator is an Active Stage Before/After Routines: These are routines which have been included in the repository and can be executed before or after the stage. E.g.: run a dos command or send email.
1 - 26
Derivation editor
Note:Each column in the input stage must either be grouped or summarized. The standard columns for data type appear with two changes: Group: Specifies whether a give column is used in the group by clause. Derivation: Each input column needs to have a derivation for this stage.This will be discussed in the next slide.
1 - 28
Note: The functions available change according the underlying column data type.
1 - 29
Transformation Stage
Handles extracted data to perform conversions and pass data to another transformation
stage or other DataStage stage Can have any number inputs or outputs. Has one primary data link and many reference links
Double click the stage to open the transformation editor
1 - 30
Transformation Editor
Link Area: Is where all the column definitions.expressions/stage variables and constraints
are defined.The link area is divided in two parts the left pane shows you the input links and the right pane shows the output columns as well as their derivations,constraints and stage variables. Each of them will be explained in detail. Metadata Area: This shows the column metadata again split in two panes one for the input columns and one for the output columns. Here you can edit the columns as u seem necessary for their properties. Shortcut Area:
1 - 31
When you use the transformation stage you have already defined what input data has to be passed and changed. The input columns are transformed in the transformation editor on the output pane for each column chosen. Input Links: This stage has one primary link which is used to get bulk of the data to be transformed.It can also use many other links as reference links which will affect how the data is to be transferred. (Look Up Activity) Reference Links:You can define key expressions for columns on the input columns on the reference links. The most common expression is specifying an equii-join between the primary table and reference table
Primary Link
Transformer Concepts
Reference Link
CAT_ID is changed to a key column and then its key expression is defined in the link area
A Join is specified between the primary link column CATEGORY_ID and the reference link key column CAT_ID The key expressions are validated by the expression editor. If they are incorrect they appear as red denoting error.
1 - 33
Add delete Columns: You can add new columns by right clicking the output pane in the Link area and then defining the column metadata in the metadata area. The new column derivations can be custom strings/derivations/concatenations/standard functions applied on input columns or stage variables. We will discuss them shortly. Move/edit Columns: In the output pane in the link area you can move the columns up and down by dragging. Also in the metadata area you can edit the column definitions as per requirements. Before/After Routines: This is an active stage hence you can specify routines to be executed by passing suitable input values as discussed before.
1 - 34
Entering Expressions: Whenever the insertion point is an expression box you can right click to open the suggested operand menu. This allows you a rich support of defining various elements and using predefine transformations/routines and functions. Also allows you to define expressions based on stage variables. Continued
1 - 35
1 - 36
Transformation Variables
You can define local variables for a given transformation stage. These can be used to define key expressions or the output column derivations.They can be also uses as follows
Stage variables are shown in the output pane of the transformations stage.
1 - 38
Transformation Constraints
You can define limits for output data by specifying constraints in the transformation stage. Constraints are BASIC expression which can be defined for each output link of a transformer stage. You can also specify reject links for rows which havent been written to any output stage. To define constraints do the following
Specify the constraints expression as desired. This can be used while writing data for partitioned tables on certain keys. ( E.g. Key_ID> 10) or any other filtering criterion desired.
1 - 39
Folder Stage
Folder stages are used to read / write data as files in a directory located on a DataStage Server Folder stage can read multiple files from a single directory and can deliver the files to the job as rows on an output link
Plug-in Stage
Plugin stages are mostly used for access of external programs that can be leveraged by the DataStage applications. DataStage supports a number of plug-in stages to perform special tasks. They have to be installed separately on the server and have to be configured before use. After installing these stages they have to be registered on the server system before being used. E.g.: ORABULK: bulk loading of data in Oracle database. BCPLOAD: Bulk load for MS SQL Server. Each plug-in stage has its own configurable properties which can be checked from the manuals. Please refer them if you plan to use it.
1 - 40