DataStage Enterprise Edition


Introduction to DataStage EE Part 1


Ascential Platform


What is DataStage?
o Design jobs for Extraction, Transformation, and loading (ETL) o Ideal tool for data integration projects - such as, data warehouses, data marts, and system migrations o Import, export, create, and managed metadata for use within jobs o Schedule, run, and monitor jobs all within DataStage o Administer your DataStage development and execution environments
Use the Administrator Project Properties window to:
Set job monitoring limits and other Director defaults on the General tab.
Enable or disable server-side tracing on the Tracing tab.
Specify hashed file stage read and write cache sizes on the Tunables tab.
Set user group privileges on the Permissions tab.
Specify a user name and password for scheduling jobs on the Schedule tab.

Use the Administrator to specify general server defaults, add and delete projects, and to set project properties.

Use the Manager to store and manage reusable metadata for the jobs you define in the Designer. This metadata includes table and file layouts and routines for transforming extracted data.

In addition to table and file layouts, it displays the routines, transforms, and jobs that are defined in the project. Manager is also the primary interface to the DataStage repository. Custom routines and transforms can also be created in Manager.

The DataStage Designer allows you to use familiar graphical point-and-click techniques to develop processes for extracting, transforming, integrating and loading into warehouse tables. The Designer provides a "visual data flow" method to easily interconnect and configure reusable components.

Developing in DataStage
Define your project's properties: Administrator
Open (attach to) your project
Import metadata that defines the format of data stores your jobs will read from or write to: Manager
Design the job: Designer
· Define data extractions (reads)
· Define data flows
· Define data transformations
· Define data constraints
· Define data loads (writes)
· Define data integration
· Define data aggregations
Compile and debug the job: Designer
Run and monitor the job: Director

o DataStage Designer is used build and compile your ETL jobs.
o Administrator is used to set global and project properties.
o Director is used to execute your jobs after you build them.
o Manager is used to execute your Jobs after you build them.

After this module you will be able to:
- Explain how to create and delete projects
- Set project properties in Administrator
- Set EE global properties in Administrator

Projects can be created and deleted in Administrator. Project properties and defaults are set in Administrator. You can set the default properties of a project using DataStage Administrator.

In DataStage all development work is done within a project. Each project is associated with a directory. The directory stores the objects (jobs, metadata, custom routines, etc.) created in the project. Projects are created during installation and after installation using Administrator. Before you can work in a project you must attach to it (open it).

Click Properties on the DataStage Administration window to open the Project Properties window. There are nine tabs. (The Mainframe tab is only enabled if your license supports mainframe jobs.) The default is the General tab.

The Auto-purge of job log box tab allows you to specify conditions for purging these events. When a job is run in Director, events are logged describing the progress of the job. For example, events are logged when a job starts, when it stops, and when it aborts. The number of logged events can grow very large. You can limit the logged events either by number of days or number of job runs.

If you select the Enable job administration in Director box, you can perform some administrative functions in Director without opening Administrator.

Use this page to set user group permissions for accessing and using DataStage. This helps to prevent unauthorized access to DataStage projects. All DataStage users must belong to a recognized user role before they can log on to DataStage.

There are three roles of DataStage user:
DataStage Developer, who has full access to all areas of a DataStage project.
DataStage Operator, who can run and manage released DataStage jobs.
<None>, who does not have permission to log on to DataStage.

UNIX note: In UNIX, the groups displayed are defined in /etc/group.

This tab is used to enable and disable server-side tracing. The default is for server-side tracing to be disabled. When you enable it, information about server activity is recorded for any clients that subsequently attach to the project. This information is written to trace files. Users with in-depth knowledge of the system software can use it to help identify the cause of a client problem.

Warning: Tracing causes a lot of server system overhead. This should only be used to diagnose serious problems. If tracing is enabled, users receive a warning message whenever they invoke a DataStage client.

On the Tunables tab, you can specify the sizes of the memory caches used when reading rows in hashed files and When writing rows to hashed files. Hashed files are mainly used for lookups and are discussed in a later module.

You should enable OSH for viewing. OSH is generated when you compile a job.

Metadata is "data about data" that describes the formats of sources and targets. This includes general format information such as whether the record columns are delimited and, if so, the delimiting character. It also includes the specific column definitions.

DataStage Manager is a graphical tool for managing the contents of your DataStage project repository, which contains metadata and other DataStage components such as jobs and routines. The left pane contains the project tree. There are seven main branches, but you can create subfolders under each. Select a folder in the project tree to display its contents.

Metadata describing sources and targets: Table definitions
DataStage objects: jobs, routines, table definitions, etc.

Import and Export
    

Any object in Manager can be exported to a file
Can export whole projects
Use for backup
Sometimes used for version control
Can be used to move DataStage objects from one project to another
Use to share DataStage jobs and projects with other developers


Export Procedure

  

In Manager, click "Export>DataStage Components" Select DataStage objects for export Specified type of export: DSX, XML Specify file path on client machine


Review Q

You can export DataStage objects such as jobs, but you can't export metadata, such as field definitions of a sequential file. (T/F)
The directory to which you export is on the DataStage client machine, not on the DataStage server machine. (T/F)


Exporting DataStage Objects


In Manager, click "lmport>DataStage Components"
Select DataStage objects for import

43 .Importing DataStage Objects 01/07/09 Sayrite Inc.

Import format and column destinations from sequential files
Import relational table column destinations
Imported as "Table Definitions"
Table definitions can be loaded into job stages

In Manager, click Import>Table Definitions>Sequential File Definitions
Select directory containing sequential file and then the file
Select Manager category
Examined format and column definitions and edit is necessary

In Manager, select the category (folder) that contains the table definition. Double-click the table definition to open the Table Definition window. Select the Format tab to edit the file format specification. Click the Columns tab to view and modify any column definitions.

Intro Part 4: Designing and Documenting Jobs 01/07/09 Sayrite Inc. 49 .

After this module you will be able to:
- Describe what a DataStage job is
- List the steps involved in creating a job
- Identify the different types of stages
- Describe links and stages
- Design a simple extraction and load job
- Compile your job
- Create parameters to make your job flexible
- Document your job

Executable DataStage program
Created in DataStage Designer, but can use components from Manager
Built using a graphical user interface
Compiles into Orchestrate shell language (OSH)

In Manager, import metadata defining sources and targets
In Designer, add stages defining data extractions and loads
Add Transformers and other stages to defined data transformations
Add links defining the flow of data from sources to targets
Compile the job
In Director, validate, run, and monitor your job

53 .Designer Work Area 01/07/09 Sayrite Inc.

Tools Palette 01/07/09 Sayrite Inc. 55 .

Stages can be dragged from the tools palette or from the stage type branch of the repository view
Links can be drawn from the tools palette or by right clicking and dragging from one stage to another

Used to extract data from, or load data to, a sequential file
Specify full path to the file
Specify a file format: fixed width or delimited
Specified column definitions
Specify write action

Any required properties that are not completed will appear in red. You are defining the file from which the job will read. If the file doesn't exist, you will get an error at run time. You will be able to view its data using the View data button.

On the Format tab, you specify a format for the source file. Define the output link listed in the Output name box. You are defining the format of the data flowing out of the stage, to the output link. Think of a link as like a pipe. What flows in one end flows out the other end (at the transformer stage).

Editing a Sequential Target 01/07/09 Sayrite Inc. 63 .

Defining a sequential target stage is similar to defining a sequential source stage. You are defining the file the job will write to. If the file doesn't exist, it will be created. If you click the View data button, you will not (of course!) be able to view its data until after the job runs. If the target file doesn't exist, DataStage will return a "Failed to open ..." error.

On the Format tab, you can specify a different format for the target file than you specified for the source file. Define each input link listed in the Input name box, from the input links. You are defining the format of the data flowing into the stage. Think of a link as like a pipe. What flows in one end flows out the other end. The format going in is the same as the format going out. The column definitions you defined in the source stage for a given (output) link will appear already defined in the target stage for the corresponding (input) link. Specify whether to overwrite or append the data in the Update action set of buttons.

Used to define constraints, derivations, and column mappings
A column mapping maps an input column to an output column
In this module will just defined column mappings (no derivations)

66 .Transformer Stage Elements 01/07/09 Sayrite Inc.

Stage variables are used for a variety of purposes:
Counters
Temporary registers for derivations
Controls for constraints

Makes the job more flexible
Parameters can be:
- Used in constraints and derivations
- Used in directory and file names
Parameter values are determined at run time

Job Properties
- Short and long descriptions
- Shows in Manager
Annotation stage
- Is a stage on the tool palette
- Shows on the job GUI (work area)

Annotation Stage on the Palette 01/07/09 Sayrite Inc. 74 .

75 .Annotation Stage Properties 01/07/09 Sayrite Inc.

Before you can run your job, you must compile it. To compile it, click File>Compile or click the Compile button on the tool bar. The Compile Job window displays the status of the compile. A compile will generate OSH.

If an error occurs:
Click Show Error to identify the stage where the error occurred. This will highlight the stage in error.
Click More to retrieve more information about the error. This can be lengthy for parallel jobs.

79 .Intro Part 5: Running Jobs 01/07/09 Sayrite Inc.

After this module you will be able to:
- Use DataStage Director to run your job
- Validate your job
- Set to run options
- Monitor your job's progress
- View job log messages

Can schedule, validating, and run jobs
Can be invoked from DataStage Manager or Designer
Tools > Run Director

This shows the Director Status view. To run a job, select it and then click Job>Run Now. Better yet: Shift to log view from main Director screen. Then click green arrow to execute job.

The Job Run Options window is displayed when you click Job>Run Now.

You can validate your job before you run it. Validation performs some checks that are necessary in order for your job to run successfully. These include:
Verifying that connections to data sources can be made.
Verifying that files can be opened.
Verifying that SQL statements used to select data can be prepared.

This window allows you to stop the job after:
A certain number of rows.
A certain number of warning messages.

Click Run to run the job after it is validated. The Status column displays the status of the job run.

Click the Log button in the toolbar to view the job log. The job log records events that occur during the execution of a job. These events include control events, such as the starting, finishing, and aborting of a job, error messages, warning messages, informational messages, and program-generated messages.

Schedule job to run on a particular date/time
Clear job log
Set Director options
- Row limits
- Abort after x warnings