You are on page 1of 23

Introduction to DataStage

IBM Infosphere DataStage v11.5

© Copyright IBM Corporation 2015


Course materials may not be reproduced in whole or in part without the written permission of IBM.
Unit objectives
• List and describe the uses of DataStage
• List and describe the DataStage clients
• Describe the DataStage workflow
• Describe the two types of parallelism exhibited by DataStage
parallel jobs

Introduction to DataStage © Copyright IBM Corporation 2015


What is IBM InfoSphere DataStage?
• Design jobs for Extraction, Transformation, and Loading (ETL)
• Ideal tool for data integration projects – such as, data warehouses,
data marts, and system migrations
• Import, export, create, and manage metadata for use within jobs
• Build, run, and monitor jobs, all within DataStage
• Administer your DataStage development and execution environments
• Create batch (controlling) jobs
 Called job sequence

Introduction to DataStage © Copyright IBM Corporation 2015


What is Information Server?
• Suite of applications, including DataStage, that share a common:
 Repository
 Set of application services and functionality
− Provided by the Metadata Server component
• By default an application named “server1”, hosted by an IBM WebSphere
Application Server (WAS) instance
− Provided services include:
• Security
• Repository
• Logging and reporting
• Metadata management
• Managed using the Information Server Web Console client

Introduction to DataStage © Copyright IBM Corporation 2015


Information Server backbone

Information Information Information FastTrack DataStage / MetaBrokers


Data Click
Services Governance Analyzer QualityStage
Director Catalog

Metadata Metadata
Access Services Analysis Services

Metadata Server

Information Server Web Console

Introduction to DataStage © Copyright IBM Corporation 2015


Information Server Web Console

Administration Reporting

InfoSphere
Users

Introduction to DataStage © Copyright IBM Corporation 2015


DataStage architecture
• DataStage clients

Administrator Designer Director

• DataStage engines
 Parallel engine
− Runs parallel jobs
 Server engine
− Runs server jobs
− Runs job sequences

Introduction to DataStage © Copyright IBM Corporation 2015


DataStage Administrator

Project environment
variables

Introduction to DataStage © Copyright IBM Corporation 2015


DataStage Designer

Menus / toolbar

DataStage parallel
job with DB2
Connector stage

Job log

Introduction to DataStage © Copyright IBM Corporation 2015


DataStage Director

Log messages

Introduction to DataStage © Copyright IBM Corporation 2015


Developing in DataStage
• Define global and project properties in Administrator
• Import metadata into the Repository
 Specifies formats of sources and targets accessed by your jobs
• Build job in Designer
• Compile job in Designer
• Run the job and monitor job log messages
 The job log can be viewed either in Director or in Designer
− In Designer, only the job log for the currently opened job is available
 Jobs can be run from either Director, Designer, or from the command line
 Performance statistics show up in the log and also on the Designer canvas
as the job runs

Introduction to DataStage © Copyright IBM Corporation 2015


DataStage project repository

User-added folder

Standard jobs folder

Standard table
definitions folder

Introduction to DataStage © Copyright IBM Corporation 2015


Types of DataStage jobs
• Parallel jobs
 Executed by the DataStage parallel engine
 Built-in capability for pipeline and partition parallelism
 Compiled into OSH
− Executable script viewable in Designer and the log
• Server jobs
 Executed by the DataStage Server engine
 Use a different set of stages than parallel jobs
 No built-in capability for partition parallelism
 Runtime monitoring in the job log
• Job sequences (batch jobs, controlling jobs)
 A server job that runs and controls jobs and other activities
 Can run both parallel jobs and other job sequences
 Provides a common interface to the set of jobs it controls
Introduction to DataStage © Copyright IBM Corporation 2015
Design elements of parallel jobs
• Stages
 Passive stages (E and L of ETL)
− Read data
− Write data
− Examples: Sequential File, DB2, Oracle, Peek stages
 Processor (active) stages (T of ETL)
− Transform data (Transformer stage)
− Filter data (Transformer stage)
− Aggregate data (Aggregator stage)
− Generate data (Row Generator stage)
− Merge data (Join, Lookup stages)
• Links
 "Pipes” through which the data moves from stage-to-stage

Introduction to DataStage © Copyright IBM Corporation 2015


Pipeline parallelism

• Transform, Enrich, Load stages execute in parallel


• Like a conveyor belt moving rows from stage to stage
 Run downstream stages while upstream stages are running
• Advantages:
 Reduces disk usage for staging areas
 Keeps processors busy
• Has limits on scalability
Introduction to DataStage © Copyright IBM Corporation 2015
Partition parallelism
• Divide the incoming stream of data into subsets to be separately
processed by an operation
 Subsets are called partitions
• Each partition of data is processed by copies the same stage
 For example, if the stage is Filter, each partition will be filtered in exactly
the same way
• Facilitates near-linear scalability
 8 times faster on 8 processors
 24 times faster on 24 processors
 This assumes the data is evenly distributed

Introduction to DataStage © Copyright IBM Corporation 2015


Three-node partitioning
Node 1

subset1 Stage

Node 2
subset2
Data Stage

Node 3
subset3
Stage

• Here the data is split into three partitions (nodes)


• The stage is executed on each partition of data separately and in
parallel
• If the data is evenly distributed, the data will be processed three
times faster

Introduction to DataStage © Copyright IBM Corporation 2015


Job design versus execution

A developer designs the flow in DataStage Designer

… at runtime, this job runs in parallel for any number


of partitions (nodes)

Introduction to DataStage © Copyright IBM Corporation 2015


Configuration file
• Determines the degree of parallelism (number of partitions) of jobs
that use it
• Every job runs under a configure file
• Each DataStage project has a default configuration file
 Specified by the $APT_CONFIG_FILE job parameter
 Individual jobs can run under different configuration files than the project
default
− The same job can also run using different configuration files on different job runs

Introduction to DataStage © Copyright IBM Corporation 2015


Example: Configuration file

Node (partition)

Node (partition)

Resources attached
to the node

Introduction to DataStage © Copyright IBM Corporation 2015


Checkpoint
1. True or false: DataStage Director is used to build and compile your
ETL jobs
2. True or false: Use Designer to monitor your job during execution
3. True or false: Administrator is used to set global and project
properties

Introduction to DataStage © Copyright IBM Corporation 2015


Checkpoint solutions
1. False.
DataStage Designer is used to build and compile jobs.
Use DataStage Director to run and monitor jobs, but you can do this
from DataStage Designer too.
2. True.
The job log is available both in Director and Designer. In Designer,
you can only view log messages for a job open in Designer.
3. True.

Introduction to DataStage © Copyright IBM Corporation 2015


Unit summary
• List and describe the uses of DataStage
• List and describe the DataStage clients
• Describe the DataStage workflow
• Describe the two types of parallelism exhibited by DataStage parallel
jobs

Introduction to DataStage © Copyright IBM Corporation 2015

You might also like