Introduction To Datastage: Ibm Infosphere Datastage V11.5

Introduction to DataStage
IBM Infosphere DataStage v11.5
© Copyright IBM Corporation 2015

Course materials may not be reproduced in whole or in part without the written permission of IBM.
Unit objectives
• List and describe the uses of DataStage
• List and describe the DataStage clients
• Describe the DataStage workflow
• Describe the two types of parallelism exhibited by DataStage
parallel jobs
Introduction to DataStage © Copyright IBM Corporation 2015

What is IBM InfoSphere DataStage?
• Design jobs for Extraction, Transformation, and Loading (ETL)
• Ideal tool for data integration projects – such as, data warehouses,
data marts, and system migrations
• Import, export, create, and manage metadata for use within jobs
• Build, run, and monitor jobs, all within DataStage
• Administer your DataStage development and execution environments
• Create batch (controlling) jobs
 Called job sequence

What is Information Server?
• Suite of applications, including DataStage, that share a common:
 Repository
 Set of application services and functionality
− Provided by the Metadata Server component
• By default an application named “server1”, hosted by an IBM WebSphere
Application Server (WAS) instance
− Provided services include:
• Security
• Repository
• Logging and reporting
• Metadata management
• Managed using the Information Server Web Console client

Information Server backbone
Information Information Information FastTrack DataStage / MetaBrokers

Data Click
Services Governance Analyzer QualityStage
Director Catalog
Metadata Metadata
Access Services Analysis Services
Metadata Server
Information Server Web Console

Information Server Web Console
Administration Reporting
InfoSphere
Users

DataStage architecture
• DataStage clients
Administrator Designer Director
• DataStage engines
 Parallel engine
− Runs parallel jobs
 Server engine
− Runs server jobs
− Runs job sequences

DataStage Administrator
Project environment
variables

DataStage Designer
Menus / toolbar
DataStage parallel
job with DB2
Connector stage
Job log

DataStage Director
Log messages

Developing in DataStage
• Define global and project properties in Administrator
• Import metadata into the Repository
 Specifies formats of sources and targets accessed by your jobs
• Build job in Designer
• Compile job in Designer
• Run the job and monitor job log messages
 The job log can be viewed either in Director or in Designer
− In Designer, only the job log for the currently opened job is available
 Jobs can be run from either Director, Designer, or from the command line
 Performance statistics show up in the log and also on the Designer canvas
as the job runs

DataStage project repository
User-added folder
Standard jobs folder
Standard table
definitions folder

Types of DataStage jobs
• Parallel jobs
 Executed by the DataStage parallel engine
 Built-in capability for pipeline and partition parallelism
 Compiled into OSH
− Executable script viewable in Designer and the log
• Server jobs
 Executed by the DataStage Server engine
 Use a different set of stages than parallel jobs
 No built-in capability for partition parallelism
 Runtime monitoring in the job log
• Job sequences (batch jobs, controlling jobs)
 A server job that runs and controls jobs and other activities
 Can run both parallel jobs and other job sequences
 Provides a common interface to the set of jobs it controls
Design elements of parallel jobs
• Stages
 Passive stages (E and L of ETL)
− Read data
− Write data
− Examples: Sequential File, DB2, Oracle, Peek stages
 Processor (active) stages (T of ETL)
− Transform data (Transformer stage)
− Filter data (Transformer stage)
− Aggregate data (Aggregator stage)
− Generate data (Row Generator stage)
− Merge data (Join, Lookup stages)
• Links
 "Pipes” through which the data moves from stage-to-stage

Pipeline parallelism
• Transform, Enrich, Load stages execute in parallel

• Like a conveyor belt moving rows from stage to stage
 Run downstream stages while upstream stages are running
• Advantages:
 Reduces disk usage for staging areas
 Keeps processors busy
• Has limits on scalability
Partition parallelism
• Divide the incoming stream of data into subsets to be separately
processed by an operation
 Subsets are called partitions
• Each partition of data is processed by copies the same stage
 For example, if the stage is Filter, each partition will be filtered in exactly
the same way
• Facilitates near-linear scalability
 8 times faster on 8 processors
 24 times faster on 24 processors
 This assumes the data is evenly distributed

Three-node partitioning
Node 1
subset1 Stage
Node 2
subset2
Data Stage
Node 3
subset3
Stage
• Here the data is split into three partitions (nodes)

• The stage is executed on each partition of data separately and in
parallel
• If the data is evenly distributed, the data will be processed three
times faster

Job design versus execution
A developer designs the flow in DataStage Designer
… at runtime, this job runs in parallel for any number

of partitions (nodes)

Configuration file
• Determines the degree of parallelism (number of partitions) of jobs
that use it
• Every job runs under a configure file
• Each DataStage project has a default configuration file
 Specified by the $APT_CONFIG_FILE job parameter
 Individual jobs can run under different configuration files than the project
default
− The same job can also run using different configuration files on different job runs

Example: Configuration file
Node (partition)
Node (partition)
Resources attached
to the node

Checkpoint
1. True or false: DataStage Director is used to build and compile your
ETL jobs
2. True or false: Use Designer to monitor your job during execution
3. True or false: Administrator is used to set global and project
properties

Checkpoint solutions
1. False.
DataStage Designer is used to build and compile jobs.
Use DataStage Director to run and monitor jobs, but you can do this
from DataStage Designer too.
2. True.
The job log is available both in Director and Designer. In Designer,
you can only view log messages for a job open in Designer.
3. True.

Unit summary
• List and describe the uses of DataStage
• List and describe the DataStage clients
• Describe the DataStage workflow
• Describe the two types of parallelism exhibited by DataStage parallel
jobs

Introduction To Datastage: Ibm Infosphere Datastage V11.5

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Datastage: Ibm Infosphere Datastage V11.5

Uploaded by

Copyright:

Available Formats

Introduction to DataStage

IBM Infosphere DataStage v11.5

© Copyright IBM Corporation 2015

Introduction to DataStage © Copyright IBM Corporation 2015

Introduction to DataStage © Copyright IBM Corporation 2015

Introduction to DataStage © Copyright IBM Corporation 2015

Information Information Information FastTrack DataStage / MetaBrokers

Information Server Web Console

Introduction to DataStage © Copyright IBM Corporation 2015

Introduction to DataStage © Copyright IBM Corporation 2015

Administrator Designer Director

Introduction to DataStage © Copyright IBM Corporation 2015

Introduction to DataStage © Copyright IBM Corporation 2015

Introduction to DataStage © Copyright IBM Corporation 2015

Introduction to DataStage © Copyright IBM Corporation 2015

Introduction to DataStage © Copyright IBM Corporation 2015

Standard jobs folder

Introduction to DataStage © Copyright IBM Corporation 2015

Introduction to DataStage © Copyright IBM Corporation 2015

• Transform, Enrich, Load stages execute in parallel

Introduction to DataStage © Copyright IBM Corporation 2015

• Here the data is split into three partitions (nodes)

Introduction to DataStage © Copyright IBM Corporation 2015

A developer designs the flow in DataStage Designer

… at runtime, this job runs in parallel for any number

Introduction to DataStage © Copyright IBM Corporation 2015

Introduction to DataStage © Copyright IBM Corporation 2015

Introduction to DataStage © Copyright IBM Corporation 2015

Introduction to DataStage © Copyright IBM Corporation 2015

Introduction to DataStage © Copyright IBM Corporation 2015

Introduction to DataStage © Copyright IBM Corporation 2015

You might also like