You are on page 1of 18

DataStage

DataStage: - DataStage is an application on the server, which connects to data


Sources and Targets and processes ("transforms") the data as they move through
the application. DataStage is classed as an "ETL tool", the initials standing for
Extract, Transform and Load respectively.

DataStage Job: - DataStage is a Client –Server Technology & Integrated tools


at used for Designing, Running, Monitoring and Administrating the Data
Acquisition Applications is known as DataStage Job.

Job: - 1) A Job is Graphical Representation of Data flow from Source to Target


Systems.
2) A Job is an order series of Individual stages linked together to describe the
Data flow from Source to Target.

1) Source Metadata
2) Business Logic
3) Target Metadata

DataStage Components

DataStage Client Components

1) DataStage Administrator
2) DataStage Designer
3) DataStage Director

DataStage server Components

1) DataStage Server
2) DataStage Repository
3) DataStage Package Installer

Client Components

DataStage Administrator: - A user interface used to perform administration


Tasks such as setting up DataStage users, creating and deleting a projects,
and Setting up purging criteria.

DataStage Designer: - A design interface used to create DataStage applications


(Known as jobs). Each job specifies the data sources, the transforms required,
And the destination of the data. Jobs are compiled to create executables that are
scheduled by the Director and run by the Server (mainframe jobs are Transferred
and run on the mainframe). Job export and import activities.

DataStage Director: - A user interface used to validate, schedule, run,


and Monitor DataStage server jobs and parallel jobs.

1
Server Components

DataStage Repository: - A central store that contains all the information required
to build a data warehouse.
DataStage Server: - Runs executable jobs that extract, transform, and load
data into a data warehouse.
DataStage Package Installer: - A user interface used to install packaged
DataStage jobs and plug-ins TCP/IP

DataStage Architecture

Connection to data sources and targets can use many different techniques,
primarily direct access (for example directly reading/writing text files), industry-
standard protocols such as ODBC and vendor specific APIs for
Connecting to databases and packages such as Siebel, SAP, Oracle and etc.

DataStage Jobs

There are three basic types of DataStage job

Server Jobs: - These jobs are compiled and run on the datastage server. A server
job will connect to databases on other machines as necessary, extract data, process
it, and then write the data to the target data warehouse.

Parallel Jobs: - These are complied and run on the datastage server in a
similar way to server jobs, but support processing on SMP, MPP, cluster and grid
systems.

Sequence Jobs: - Sequence jobs contain activities, which are special stages that
indicate the actions that occur when the sequence job runs. You interact with
activities in the same way that you interact with stages in parallel jobs and server
jobs. To add an activity to your job sequence, drag the corresponding icon from
the Palette to the sequence job canvas. After you design your sequence job by
adding activities and triggers, you define the properties for each activity. The
properties that you define control the behaviour of the activity.

2
Stage: - A stage is defines a Database (or) File (or) Processing.

They are two Types of Stages


1) Builting Stage
2) Pluging Stage

Builting Stage: - These stage are used for extraction, transformation and
loading these are two types of builting stages.

a) Passive Stage: - A stage, which defines read and write access are known as
passive stage

Ex: - Databases or Files

b) Active Stage: - A stage, which defines the processing filter the Data
known as Active Stage.

Ex: - Transformer Stage, Aggregator Stage, and Sort Stage.

Pluging Stage: - These are used to define specific task that are not possible with
Builting stage.

Ex: - Loading Mechanism,

a) Immediate Method.
b) Bulk Method.

Link: - A link defines data direction flow & carrier the Data from Source to Target.

They are three types of Links.

1) Primary Link
2) Reference Link
3) Reject Link.

Primary links are Solid lines


Reference and Reject links are dotted lines.

Parallel processing environments


Your system’s architecture and hardware resources define
the environment in which you run your parallel jobs. All parallel processing
environments are categorized as one of: v SMP (symmetric multiprocessing), in
which some hardware resources may be shared among processors. The processors
communicate via shared memory and have a single operating system.
Cluster or MPP (massively parallel processing), also
known as shared-nothing, in which each processor has exclusive access to
hardware resources. MPP systems are physically housed in the same box, whereas
cluster systems can be physically dispersed. The processors each have their own
operating system, and communicate via a high-speed network.
SMP systems allow you to scale up the number of
processors, which may improve performance of your jobs. The improvement gained
depends on how your job is limited:

3
In a cluster or MPP environment, you can use the multiple
processors and their associated memory and disk resources in concert to tackle a
single job. In this environment, each processor has its own dedicated memory,
memory bus, disk, and disk access. In a shared-nothing environment, parallelization
of your job is likely to improve the performance of CPU-limited, memory-limited, or
disk I/O-limited applications.

http://publib.boulder.ibm.com/infocenter/iisinfsv/v8r5/index.
jsp?topic=/com.ibm.swg.im.iis.productization.iisinfsv.overview
.doc/topics/cisoarchscale.html

Configuration file, Node & Partition

What is Configuration file?


Configuration files provide details of the nodes allocated for a particular project/job.
Different Configuration files are there, they are default configuration file, 1 node
configuration file, 2 nodes configuration file, 4 nodes configuration file etc.
The configuration file provides hardware configuration for supporting such
architectures as SMP (Single machine with multiple CPU, shared memory and disk),
Grid, Cluster or MPP (multiple CPU, multiple nodes and dedicated memory per
node).

The APT_CONFIG_FILE is an environment variable, which represents the


configuration file that defines the nodes (the scratch area, temp area), disk storage
information for the specific project/job.

DataStage understands the architecture of the system through this file.

Default configuration file can be any node configuration file and it is used when
exporting jobs from one environment to another environment.
Following are the different components in any Configuration File:

4
1) Node Pool: - defines resource allocation. Pools can overlap across
nodes or can be independent.
2) Fastname: - defines node's hostname or IP address.
3) Resource Disk: - (resources) names of disk directories accessible to
each node.
4) Resource Scratch Disk: - (resources) names of disk directories
accessible to each node in scratch temp allocation.

Example: Allocated Nodes for Default Configuration file in Target Account.

Environment Allocated Nodes


for Default
Configuration file
in Target
Development 2
Test 2
PreProd 4
Production 8

What is node?
Node is allocated space for performing the required transformation and storing
data for dataset. In DataStage, node will be allocated to the jobs through
configuration file.

This figure shows 1-node and 4-node execution in SMP Processing

What is partition?

Partition is logical. Partition is to divide memory or mass storage into isolated


sections. Memory space will be split into many partitions to have high parallelism. In
DOS systems, you can partition a disk, and each partition will behave like a
separate disk drive.

Parallel jobs
DataStage parallel jobs bring the power of parallel processing to your
data extraction and transformation applications.

Parallel processing in DataStage

There are two basic types of parallel processing, pipepline and partitioning
processing. WebSphere DataStage allows you to use both of these methods. The
5
followings sections illustrate these methods using a simple parallel job, which
extracts data from a source, transforms it in some way, and then writes it to another
data source. In all cases this job would appear the same on your Designer canvas,
but you can configure it to behave in different ways.

1. Pipeline parallelism

If you ran the example job on a system with at least three processors, the stage
reading start on one processor and start filling a pipeline with the data it has read.
The transformer stage would start running on another processor as soon as there
was data in the pipeline, process it and start filling another pipeline. The stage
writing the transformed data to the target database would similarly start writing as
soon as there are data available. Thus all three stages are operating
simultaneously. If you were running sequentially, there would only be one instance
of each stage. If you were running in parallel, there would be as many instances as
you had partitions.
The key concept of ETL Pipeline processing is to start the Transformation and
loading tasks while the Extraction phase is still running.

Example: -

Source Table is Oracle Database and Target Database is DB2 Database.

In the following example, all stages run concurrently, even in a single-


node configuration. As data is read from the Oracle source, it is passed to the
Transformer stage for transformation, where it is then passed to the DB2
target. Instead of waiting for all source data to be read, as soon as the source
data stream starts to produce rows, these are passed to the subsequent
stages. This method is called pipeline parallelism, and all three stages in our
example operate simultaneously regardless of the degree of parallelism of the
configuration file. The Information Server Engine always executes jobs with
pipeline parallelism.

2. Partitioning parallelism

Imagine you have the same simple job as described above, but that it is
handling very large quantities of data. In this scenario you could use the power of
parallel processing to your best advantage by partitioning the data into a number
of separate sets, with each partition being handled by a separate instance of the
job stages.
Several processors would effectively run using partition parallelism the
same job simultaneously, each handling a separate subset of the total data.
At the end of the job the data partition can be collected back
together again and written to single data source.
Partitioning methods

The aim of most partitioning operations is to end up with a set of


partitions that are as near equal size as possible, ensuring an even load across
your processors.
When performing some operations however, you will need to take
control of partitioning to ensure that you get consistent results. A good example of
this would be where you are using an aggregator stage to summarize your data.
To get the answers you want and need you must ensure that related data is
grouped together in the same partition before the summary operation is performed
on that partition WebSphere DataStage lets you do this.

There are a number of difference partitions methods available, note


that all these descriptions assume you are starting with sequential data. If you are
repartitioning already partitioned data then there are some specific considerations

Round robin partition

The first record goes to the first processing node, the second record
to the second processing node, and so on. When WebSphere DataStage reaches the
last processing node in the system, it starts over. This method is useful for resizing
partitions of an input data set that are not equal in size. The round robin method
always creates approximately equal-sized partition. This method is the one normally
used when WebSphere DataStage initially partitions data.

7
Random partition

Records are randomly distributed across all processing nodes. Like


round robin, random partitioning can rebalance the partitions of an input data set to
guarantee that each processing node receives an approximately equal-sized partition.
The random partitioning has a slightly higher overhead than round robin because of the
extra processing required calculating a random value for each record.

8
Same partition

The stage using the data set as input performs no repartitioning and
takes as input the partitions output by the preceding stage. With this partitioning
method, records stay on the same processing node, which is they are not
redistributed. Same is the fastest partitioning method. This is normally the method
WebSphere DataStage uses when passing data between stages in your job.

9
Entire partition

Every instance of a stage on every processing node receives the


complete data set as input. It is useful when you want the benefits of parallel
execution, but every instance of the operator needs access to the entire input data
set. You are most likely to use this partitioning method with stage that create lookup
Table from their input.

10
Hash partition

Partitioning is based on a function of one or more columns (the hash


partitioning keys) in each record. The hash partition examines one or more fields
of each input record (the hash key fields). Records with the same values for all
hash key fields are assigned to the same processing node.

This method is useful for ensuring that related records are in the same
partition, which may be a prerequisite for a processing operation for example, for a
remove duplicates operation, you can hash partition records so that with the same
partitioning key values are on the same node. You can than sort the records on each
node using the hash key fields as sorting key fields, then remove duplicates, again

11
using the same keys, although the data is distributed across partitions, the hash
partition ensures that records with identical keys are in the same partition, allowing
Duplicates to be found.

DB2 partition

Partitions an input data set in the same way that DB2 would partition it.
For example, if you use this method to partition an input data set containing update
information for an existing DB2 table, records are assigned to the processing node
containing the corresponding DB2 table record. Then, during the execution of the
parallel operator, both the input record and DB2 table record are local to the
processing node. Any reads and writes of the DB2 table would entail no network
activity.
To use DB2 partitioning on a stage, select a partition type of DB2 on the
partitioning tab, then click the properties button to the right. In the
Partitioning/Collection properties dialog box, specify the details of the DB2 table
that’s partitioning you want to replicate.

Auto partition

The most common method you will see on the WebSphere DataStage stage
is Auto. This just means that you are leaving it to WebSphere DataStage to
determine the best partitioning method to use depending on the type of stage, and
what the previous stage in the job has done. Typically, WebSphere DataStage would
use round robin when initially partitioning the data, and same for the intermediate
stage of a job.

Modulus partition

Partitioning is based on a key column modulo the number of partitions.


This method is similar to hash by field, but involves simpler computation.
In data mining, data is often arranged in buckets, that is, each record has
a tag containing its bucket number. You can use the modules partitioner to partition
the record according to this number. The modulus partitioner assigns each of an
input set to partition of its output data set as determined by a specified key field in
the input data set to a partition of its output data set as determined by a specified
key field in the input data set. This field can be the tag field.

The partition number of each record is calculated as follows:


Partition_number =fieldname mod number_of_partitions

Where
· Fieldname is a numeric field of the input data set.
· Number_of_partitions is the number of processing nodes on which the
partitoner is executed on three processing nodes it has three partitions.
In this example, the modulus partitioner partitions a data set containing ten
records. Four processing nodes run the partitioner, and the modulus partitioner
divides the data among four partitions. The input data is as follows:

12
The bucket is specified as the key field, on which the modulus operation
is calculated.
Here is the input data set. Each line represents a rows:

The following table shows the output data set divided among four partitions by the
modulus partitioner.

Here are three sample modulus operations corresponding to the values of three
of the key fields:
· 22677 mod 4 = 1; the data is written to partitions 1.
· 47330 mod 4 = 2; the data is written to partitions 2.
· 64123 mod 4 = 3; the data is written to partitions 3.

4 can divide none of the key fields evenly, so no data is written to partition 0.

Range partition

Divides a data set into approximately equal-sized partitions, each of which


contains records with key columns within a specified range. This method is also
useful for ensuring that related records are in the same partition.
A range partition divides a data set into approximately equal size partitions
based on one or more partitioning keys. Range partitioning is often a pre-
processing step to performing a total sort on a data set.
In order to use a range partition, you have to make a range map. You
can do this using the write range map stage.

13
The range partitioned guarantees that all records with the same
partitioning key values are assigned to the same partition and that the partitions
are approximately equal in size so all nodes perform an equal amount of work
when processing the data set.
An example of the results of a range partition is shown below. The
partitioning is based on the age key, and the numbers in each bar indicates the age
range for each partition. The height of the bar shows the size of the partition.

All partitions are of approximately the same size. In an ideal distribution,


every would be exactly the same size.

Data Collecting Methods

Collecting is the process of joining the multiple partitions of a single data


set back together again into a single partition. These are various situations where
you may want to do this. There may be a stage in your job that you want to run
sequentially rather than in parallel, in which case you will need to collect all your
partitioned data at this stage to make sure it is operating on the whole data set.
Similarly, at the end of a job, you might want to write all your data to a
single database, in which case you need to collect it before you write it.
There might be other cases where you do not want to collect the data at all.
For example, you may want to write each partition to a separate flat file.
Just as for partitioning, in many situations you can leave DataStage to work
out the best collecting method to use. These are situations, however, where you will
to explicitly specify the collection method.

Round robin collector

Read a record from the first input partition, then from the second partition,
and so on. After reaching the last partition, starts over. After reaching the final
record in any partition, skips that partition in the remaining rounds.

14
Ordered Collector

Reads all records from the first partition, all the records from the
partition, and so on. This collection methods preserves the order of totally sorted
input data sets. In a totally sorted data set, both the records in each partition and
the partitions themselves are ordered. This may be useful as a pre-processing
action before exporting a sorted data set to a single data files.

15
Sorted merge collector

Read records in an order based on one or more columns of the record. The
columns used to define record order are collecting keys. Typically, you can the sorted
merge collector with a partition-sorted data set (as created by a sort stage). In this
case, you specify as the collecting key fields those fields you specified as sorting key
fields to the sort stage. For example, the figure below shows the current record in
each of three partitions of an input data set to the collector.

In this example, the records consist of three fields. The first-name


and last-name fields are strings, and the age field is an integer. The following
figure shows the order of the three records read by the sort merge collector,
based on different combinations of collecting keys.

16
You must define a single primary collecting key for the sort merge collector, and
you may define as many secondary keys as are required by your job. Note,
however, that each column can be used only once as a collecting key. Therefore,
the total number of primary and secondary collecting keys must be less than or
equal to the total number of columns in the row. You define the keys on the
Partitioning tab, and the key you define first is the primary key.
The data type of a collecting key can be any type except raw,
subrec, tagged, or vector.
By default, the sort merge collector uses ascending sort order and case-
sensitive comparisons. Ascending order means that records with smaller values for a
collecting field are processed before records with larger values. You also can specify
descending sorting order, so records with larger values are processed first.
With a case-sensitive algorithm, records with uppercase strings are
processed before records with lowercase strings. You can override this default
to perform case-insensitive comparisons of string fields.

Auto collector

The most common method you will see on the parallel stages is Auto. This
normally means that WebSphere DataStage will eagerly read any row from any input
partition as it becomes available, but if it detects that, for example, the data needs
sorting as it is collected, it will do that. This is the fastest collecting method.

17
Best allocation of Partitions in DataStage for each stage

S. No Stage Best way of Partition Important points


1. Join Left and Right All the input links
link: Hash or should be sorted
Modulus based on the
joining key and
partitioned with
higher key order.
2. Lookup Main link: Hash or Both the links need
same not be in the sorted
Reference link: order. Will use
Entire Entire partition
only when the
volume of
reference link is
less.
3. Merge Master and All the input links
update link: Hash should be sorted
or Modulus based on the
merging key and
partitioned with
higher key order.
Pre-sort makes
Merge
“lightweight” for
memory.
Remove Duplicate
4. and Hash or Modulus If the input link is
Aggregator in sorted order
based on the key it
will perform better.
5. Sort Hash or Modulus Sorting happens
after partitioning
6. Transformer, Same None
Funnel, Copy, Filter
Change Capture
/Change Apply
7. Stages Left and Right Both the input links
link: Hash or should be in the
Modulus sorted order based
on the key and
partitioned with
higher key order.

18

You might also like