Professional Documents
Culture Documents
1) Source Metadata
2) Business Logic
3) Target Metadata
DataStage Components
1) DataStage Administrator
2) DataStage Designer
3) DataStage Director
1) DataStage Server
2) DataStage Repository
3) DataStage Package Installer
Client Components
1
Server Components
DataStage Repository: - A central store that contains all the information required
to build a data warehouse.
DataStage Server: - Runs executable jobs that extract, transform, and load
data into a data warehouse.
DataStage Package Installer: - A user interface used to install packaged
DataStage jobs and plug-ins TCP/IP
DataStage Architecture
Connection to data sources and targets can use many different techniques,
primarily direct access (for example directly reading/writing text files), industry-
standard protocols such as ODBC and vendor specific APIs for
Connecting to databases and packages such as Siebel, SAP, Oracle and etc.
DataStage Jobs
Server Jobs: - These jobs are compiled and run on the datastage server. A server
job will connect to databases on other machines as necessary, extract data, process
it, and then write the data to the target data warehouse.
Parallel Jobs: - These are complied and run on the datastage server in a
similar way to server jobs, but support processing on SMP, MPP, cluster and grid
systems.
Sequence Jobs: - Sequence jobs contain activities, which are special stages that
indicate the actions that occur when the sequence job runs. You interact with
activities in the same way that you interact with stages in parallel jobs and server
jobs. To add an activity to your job sequence, drag the corresponding icon from
the Palette to the sequence job canvas. After you design your sequence job by
adding activities and triggers, you define the properties for each activity. The
properties that you define control the behaviour of the activity.
2
Stage: - A stage is defines a Database (or) File (or) Processing.
Builting Stage: - These stage are used for extraction, transformation and
loading these are two types of builting stages.
a) Passive Stage: - A stage, which defines read and write access are known as
passive stage
b) Active Stage: - A stage, which defines the processing filter the Data
known as Active Stage.
Pluging Stage: - These are used to define specific task that are not possible with
Builting stage.
a) Immediate Method.
b) Bulk Method.
Link: - A link defines data direction flow & carrier the Data from Source to Target.
1) Primary Link
2) Reference Link
3) Reject Link.
3
In a cluster or MPP environment, you can use the multiple
processors and their associated memory and disk resources in concert to tackle a
single job. In this environment, each processor has its own dedicated memory,
memory bus, disk, and disk access. In a shared-nothing environment, parallelization
of your job is likely to improve the performance of CPU-limited, memory-limited, or
disk I/O-limited applications.
http://publib.boulder.ibm.com/infocenter/iisinfsv/v8r5/index.
jsp?topic=/com.ibm.swg.im.iis.productization.iisinfsv.overview
.doc/topics/cisoarchscale.html
Default configuration file can be any node configuration file and it is used when
exporting jobs from one environment to another environment.
Following are the different components in any Configuration File:
4
1) Node Pool: - defines resource allocation. Pools can overlap across
nodes or can be independent.
2) Fastname: - defines node's hostname or IP address.
3) Resource Disk: - (resources) names of disk directories accessible to
each node.
4) Resource Scratch Disk: - (resources) names of disk directories
accessible to each node in scratch temp allocation.
What is node?
Node is allocated space for performing the required transformation and storing
data for dataset. In DataStage, node will be allocated to the jobs through
configuration file.
What is partition?
Parallel jobs
DataStage parallel jobs bring the power of parallel processing to your
data extraction and transformation applications.
There are two basic types of parallel processing, pipepline and partitioning
processing. WebSphere DataStage allows you to use both of these methods. The
5
followings sections illustrate these methods using a simple parallel job, which
extracts data from a source, transforms it in some way, and then writes it to another
data source. In all cases this job would appear the same on your Designer canvas,
but you can configure it to behave in different ways.
1. Pipeline parallelism
If you ran the example job on a system with at least three processors, the stage
reading start on one processor and start filling a pipeline with the data it has read.
The transformer stage would start running on another processor as soon as there
was data in the pipeline, process it and start filling another pipeline. The stage
writing the transformed data to the target database would similarly start writing as
soon as there are data available. Thus all three stages are operating
simultaneously. If you were running sequentially, there would only be one instance
of each stage. If you were running in parallel, there would be as many instances as
you had partitions.
The key concept of ETL Pipeline processing is to start the Transformation and
loading tasks while the Extraction phase is still running.
Example: -
2. Partitioning parallelism
Imagine you have the same simple job as described above, but that it is
handling very large quantities of data. In this scenario you could use the power of
parallel processing to your best advantage by partitioning the data into a number
of separate sets, with each partition being handled by a separate instance of the
job stages.
Several processors would effectively run using partition parallelism the
same job simultaneously, each handling a separate subset of the total data.
At the end of the job the data partition can be collected back
together again and written to single data source.
Partitioning methods
The first record goes to the first processing node, the second record
to the second processing node, and so on. When WebSphere DataStage reaches the
last processing node in the system, it starts over. This method is useful for resizing
partitions of an input data set that are not equal in size. The round robin method
always creates approximately equal-sized partition. This method is the one normally
used when WebSphere DataStage initially partitions data.
7
Random partition
8
Same partition
The stage using the data set as input performs no repartitioning and
takes as input the partitions output by the preceding stage. With this partitioning
method, records stay on the same processing node, which is they are not
redistributed. Same is the fastest partitioning method. This is normally the method
WebSphere DataStage uses when passing data between stages in your job.
9
Entire partition
10
Hash partition
This method is useful for ensuring that related records are in the same
partition, which may be a prerequisite for a processing operation for example, for a
remove duplicates operation, you can hash partition records so that with the same
partitioning key values are on the same node. You can than sort the records on each
node using the hash key fields as sorting key fields, then remove duplicates, again
11
using the same keys, although the data is distributed across partitions, the hash
partition ensures that records with identical keys are in the same partition, allowing
Duplicates to be found.
DB2 partition
Partitions an input data set in the same way that DB2 would partition it.
For example, if you use this method to partition an input data set containing update
information for an existing DB2 table, records are assigned to the processing node
containing the corresponding DB2 table record. Then, during the execution of the
parallel operator, both the input record and DB2 table record are local to the
processing node. Any reads and writes of the DB2 table would entail no network
activity.
To use DB2 partitioning on a stage, select a partition type of DB2 on the
partitioning tab, then click the properties button to the right. In the
Partitioning/Collection properties dialog box, specify the details of the DB2 table
that’s partitioning you want to replicate.
Auto partition
The most common method you will see on the WebSphere DataStage stage
is Auto. This just means that you are leaving it to WebSphere DataStage to
determine the best partitioning method to use depending on the type of stage, and
what the previous stage in the job has done. Typically, WebSphere DataStage would
use round robin when initially partitioning the data, and same for the intermediate
stage of a job.
Modulus partition
Where
· Fieldname is a numeric field of the input data set.
· Number_of_partitions is the number of processing nodes on which the
partitoner is executed on three processing nodes it has three partitions.
In this example, the modulus partitioner partitions a data set containing ten
records. Four processing nodes run the partitioner, and the modulus partitioner
divides the data among four partitions. The input data is as follows:
12
The bucket is specified as the key field, on which the modulus operation
is calculated.
Here is the input data set. Each line represents a rows:
The following table shows the output data set divided among four partitions by the
modulus partitioner.
Here are three sample modulus operations corresponding to the values of three
of the key fields:
· 22677 mod 4 = 1; the data is written to partitions 1.
· 47330 mod 4 = 2; the data is written to partitions 2.
· 64123 mod 4 = 3; the data is written to partitions 3.
4 can divide none of the key fields evenly, so no data is written to partition 0.
Range partition
13
The range partitioned guarantees that all records with the same
partitioning key values are assigned to the same partition and that the partitions
are approximately equal in size so all nodes perform an equal amount of work
when processing the data set.
An example of the results of a range partition is shown below. The
partitioning is based on the age key, and the numbers in each bar indicates the age
range for each partition. The height of the bar shows the size of the partition.
Read a record from the first input partition, then from the second partition,
and so on. After reaching the last partition, starts over. After reaching the final
record in any partition, skips that partition in the remaining rounds.
14
Ordered Collector
Reads all records from the first partition, all the records from the
partition, and so on. This collection methods preserves the order of totally sorted
input data sets. In a totally sorted data set, both the records in each partition and
the partitions themselves are ordered. This may be useful as a pre-processing
action before exporting a sorted data set to a single data files.
15
Sorted merge collector
Read records in an order based on one or more columns of the record. The
columns used to define record order are collecting keys. Typically, you can the sorted
merge collector with a partition-sorted data set (as created by a sort stage). In this
case, you specify as the collecting key fields those fields you specified as sorting key
fields to the sort stage. For example, the figure below shows the current record in
each of three partitions of an input data set to the collector.
16
You must define a single primary collecting key for the sort merge collector, and
you may define as many secondary keys as are required by your job. Note,
however, that each column can be used only once as a collecting key. Therefore,
the total number of primary and secondary collecting keys must be less than or
equal to the total number of columns in the row. You define the keys on the
Partitioning tab, and the key you define first is the primary key.
The data type of a collecting key can be any type except raw,
subrec, tagged, or vector.
By default, the sort merge collector uses ascending sort order and case-
sensitive comparisons. Ascending order means that records with smaller values for a
collecting field are processed before records with larger values. You also can specify
descending sorting order, so records with larger values are processed first.
With a case-sensitive algorithm, records with uppercase strings are
processed before records with lowercase strings. You can override this default
to perform case-insensitive comparisons of string fields.
Auto collector
The most common method you will see on the parallel stages is Auto. This
normally means that WebSphere DataStage will eagerly read any row from any input
partition as it becomes available, but if it detects that, for example, the data needs
sorting as it is collected, it will do that. This is the fastest collecting method.
17
Best allocation of Partitions in DataStage for each stage
18