You are on page 1of 69

Parallel Architecture, DataStage v8 Configuration, Metadata

Parallel processing = executing your application on multiple CPUs


Parallel processing environments
The environment in which you run your parallel jobs is defined by your systems architecture and hardware
resources. ll parallel processing environments are categori!ed as one of"
1. #$P %symmetric multiprocessing&' in which some hardware resources may be shared among processors. The
processors communicate via shared memory and have a single operating system.
2. Cluster or $PP %massively parallel processing&' also (nown as shared)nothing' in which each processor has
exclusive access to hardware resources. $PP systems are physically housed in the same box' whereas cluster
systems can be physically dispersed. The processors each have their own operating system' and communicate
via a high)speed networ(.
Pipeline Parallelism
*. +xtract' Transform and ,oad processes execute simultaneously
-. The downstream process starts while the upstream process is running li(e a conveyor belt moving rows from
process to process
.. dvantages" /educes dis( usage for staging areas and 0eeps processors busy
1. #till has limits on scalability
Pipeline Parallelism
*. 2ivide the incoming stream of data into subsets (nown as partitions to be processed separately
-. +ach partition is processed in the same way
.. 3acilitates near)linear scalability. 4owever the data needs to be evenly distributed across the partitions5
otherwise the benefits of partitioning are reduced
6ithin parallel jobs pipelining' partitioning and repartitioning are automatic. 7ob developer only identifies
*. #e8uential or Parallel mode %by stage&
-. Partitioning $ethod
.. Collection $ethod
1. Configuration file
Configuration File
9ne of the great strengths of the 6eb#phere 2ata#tage +nterprise +dition is that' when designing parallel jobs' you
dont have to worry too much about the underlying structure of your system' beyond appreciating its parallel
processing capabilities. :f your system changes' is upgraded or improved' or if you develop a job on one platform
and implement it on another' you dont necessarily have to change your job design.
6eb#phere 2ata#tage learns about the shape and si!e of the system from the configuration file. :t organi!es the
resources needed for a job according to what is defined in the configuration file. 6hen your system changes' you
change the file not the jobs.
The 6eb#phere 2ata#tage 2esigner provides a configuration file editor to help you define configuration files for
the parallel engine. To use the editor' choose Tools ; Configurations' the Configurations dialog box appears.
<ou specify which configuration will be used by setting the =PT>C9?3:@>3:,+ environment variable. This is set
on installation to point to the default configuration file' but you can set it on a project wide level from the
6eb#phere 2ata#tage dministrator or for individual jobs from the 7ob Properties dialog.
Configuration files are text files containing string data. The general form of a configuration file is as follows"
A
node Bn*B A
fastname Bs*B
pool BB Bn*B Bs*B Bapp-B BsortB
resource dis( BCorchCn*Cd*B AD
resource dis( BCorchCn*Cd-B ABbigdataBD
resource scratchdis( BCtempB ABsortBD
D
D
Node names
+ach node you define is followed by its name enclosed in 8uotation mar(s' for example" node BorchEB
3or a single CPU node or wor(station' the nodes name is typically the networ( name of a processing node on a
connection such as a high)speed switch or +thernet. :ssue the following U?:F command to learn a nodes networ(
name"
= uname )n
9n an #$P' if you are defining multiple logical nodes corresponding to the same physical node' you replace the
networ( name with a logical node name. :n this case' you need a fast name for each logical node. :f you run an
application from a node that is undefined in the corresponding configuration file' each user must set the environment
variable PT>P$>C9?2UCT9/>?92+?$+ to the fast name of the node invo(ing the parallel job.
Fastname
Synta!
fastname BnameB
This option ta(es as its 8uoted attribute the name of the node as it is referred to on the fastest networ( in the system'
such as an :G$ switch' 322:' or G<?+T. The fastname is the physical node name that stages use to open
connections for high volume data transfers. The attribute of this option is often the networ( name. 3or an #$P' all
CPUs share a single connection to the networ(' and this setting is the same for all parallel engine processing nodes
defined for an #$P. Typically' this is the principal node name' as returned by the U?:F command uname )n.
Node pools and the default node pool
?ode pools allow association of processing nodes based on their characteristics. 3or example' certain nodes can
have large amounts of physical memory' and you can designate them as compute nodes. 9thers can connect directly
to a mainframe or some form of high)speed :C9. These nodes can be grouped into an :C9 node pool.
The option pools is followed by the 8uoted names of the node pools to which the node belongs. node can be
assigned to multiple pools' as in the following example' where node* is assigned to the default pool %HH& as well as
the pools node*' node*>css' and pool1.
node Bnode*B
A
fastname Bnode*>cssB
pools BB Bnode*B Bnode*>cssB Bpool1B
resource dis( BCorchCsEB AD
resource scratchdis( BCscratchB AD
D
node belongs to the default pool unless you explicitly specify a pools list for it' and omit the default pool name %H
H& from the list.
9nce you have defined a node pool' you can constrain a parallel stage or parallel job to run only on that pool' that is'
only on the processing nodes belonging to it. :f you constrain both a stage and a job' the stage runs only on the
nodes that appear in both pools.
?odes or resources that name a pool declare their membership in that pool.
6e suggest that when you initially configure your system you place all nodes in pools that are named after the
nodes name and fast name. dditionally include the default node pool in this pool' as in the following example"
node Bn*B
A
fastname BnfastB
pools BB Bn*B BnfastB
D
Gy default' the parallel engine executes a parallel stage on all nodes defined in the default node pool. <ou can
constrain the processing nodes used by the parallel engine either by removing node descriptions from the
configuration file or by constraining a job or stage to a particular node pool.
Dis" and scratch dis" pools and their defaults
6hen you define a processing node' you can specify the options resource dis( and resource scratchdis(. They
indicate the directories of file systems available to the node. <ou can also group dis(s and scratch dis(s in pools.
Pools reserve storage for a particular use' such as holding very large data sets.
Pools defined by dis( and scratchdis( are not combined5 therefore' two pools that have the same name and belong to
both resource dis( and resource scratchdis( define two separate pools.
dis( that does not specify a pool is assigned to the default pool. The default pool may also be identified by HH by
and A D %the empty pool list&. 3or example' the following code configures the dis(s for node*"
node Bn*B A
resource dis( BCorchCsEB Apools BB Bpool*BD
resource dis( BCorchCs*B Apools BB Bpool*BD
resource dis( BCorchCs-B A D CI empty pool list IC
resource dis( BCorchCs.B Apools Bpool-BD
resource scratchdis( BCscratchB Apools BB Bscratch>pool*BD
D
:n this example"
*. The first two dis(s are assigned to the default pool.
2. The first two dis(s are assigned to pool*.
3. The third dis( is also assigned to the default pool' indicated by A D.
4. The fourth dis( is assigned to pool- and is not assigned to the default pool.
J. The scratch dis( is assigned to the default scratch dis( pool and to scratch>pool*.
#uffer scratch dis" pools
Under certain circumstances' the parallel engine uses both memory and dis( storage to buffer virtual data set
records.The amount of memory defaults to . $G per buffer per processing node. The amount of dis( space for each
processing node defaults to the amount of available dis( space specified in the default scratchdis( setting for the
node. The parallel engine uses the default scratch dis( for temporary storage other than buffering. :f you define a
buffer scratch dis( pool for a node in the configuration file' the parallel engine uses that scratch dis( pool rather than
the default scratch dis( for buffering' and all other scratch dis( pools defined are used for temporary storage other
than buffering.
4ere is an example configuration file that defines a buffer scratch dis( pool"
A
node node* A
fastname Bnode*>cssB
pools BB Bnode*B Bnode*>cssB
resource dis( BCorchCsEB AD
resource scratchdis( BCscratchEB Apools BbufferBD
resource scratchdis( BCscratch*B AD
D
node node- A
fastname Bnode->cssB
pools BB Bnode-B Bnode->cssB
resource dis( BCorchCsEB AD
resource scratchdis( BCscratchEB Apools BbufferBD
resource scratchdis( BCscratch*B AD
D
D
:n this example' each processing node has a single scratch dis( resource in the buffer pool' so buffering will use
CscratchE but not Cscratch*. 4owever' if CscratchE were not in the buffer pool' both CscratchE and Cscratch* would be
used because both would then be in the default pool.
Partitioning
The aim of most partitioning operations is to end up with a set of partitions that are as near e8ual si!e as possible'
ensuring an even load across your processors.
6hen performing some operations however' you will need to ta(e control of partitioning to ensure that you get
consistent results. good example of this would be where you are using an aggregator stage to summari!e your
data. To get the answers you want %and need& you must ensure that related data is grouped together in the same
partition before the summary operation is performed on that partition.
$ound ro%in partitioner
The first record goes to the first processing node' the second to the second processing node' and so on. 6hen
6eb#phere 2ata#tage reaches the last processing node in the system' it starts over. This method is useful for
resi!ing partitions of an input data set that are not e8ual in si!e. The round robin method always creates
approximately e8ual)si!ed partitions. This method is the one normally used when 6eb#phere 2ata#tage initially
partitions data.
$andom partitioner
/ecords are randomly distributed across all processing nodes. ,i(e round robin' random partitioning can rebalance
the partitions of an input data set to guarantee that each processing node receives an approximately e8ual)si!ed
partition. The random partitioning has a slightly higher overhead than round robin because of the extra processing
re8uired to calculate a random value for each record.
&ntire partitioner
+very instance of a stage on every processing node receives the complete data set as input. :t is useful when you
want the benefits of parallel execution' but every instance of the operator needs access to the entire input data set.
<ou are most li(ely to use this partitioning method with stages that create loo(up tables from their input.
Same partitioner
The stage using the data set as input performs no repartitioning and ta(es as input the partitions output by the
preceding stage. 6ith this partitioning method' records stay on the same processing node5 that is' they are not
redistributed. #ame is the fastest partitioning method. This is normally the method 6eb#phere 2ata#tage uses when
passing data between stages in your job.
'ash partitioner
Partitioning is based on a function of one or more columns %the hash partitioning (eys& in each record. The hash
partitioner examines one or more fields of each input record %the hash (ey fields&. /ecords with the same values for
all hash (ey fields are assigned to the same processing node.
This method is useful for ensuring that related records are in the same partition' which may be a prere8uisite for a
processing operation. 3or example' for a remove duplicates operation' you can hash partition records so that records
with the same partitioning (ey values are on the same node. <ou can then sort the records on each node using the
hash (ey fields as sorting (ey fields' then remove duplicates' again using the same (eys. lthough the data is
distributed across partitions' the hash partitioner ensures that records with identical (eys are in the same partition'
allowing duplicates to be found.
4ash partitioning does not necessarily result in an even distribution of data between partitions. 3or example' if you
hash partition a data set based on a !ip code field' where a large percentage of your records are from one or two !ip
codes' you can end up with a few partitions containing most of your records. This behavior can lead to bottlenec(s
because some nodes are re8uired to process more records than other nodes.
Modulus partitioner
Partitioning is based on a (ey column modulo the number of partitions. This method is similar to hash by field' but
involves simpler computation.
:n data mining' data is often arranged in buc(ets' that is' each record has a tag containing its buc(et number. <ou
can use the modulus partitioner to partition the records according to this number. The modulus partitioner assigns
each record of an input data set to a partition of its output data set as determined by a specified (ey field in the input
data set. This field can be the tag field.
The partition number of each record is calculated as follows"
partition_number = fieldname mod number_of_partitions
where" fieldname is a numeric field of the input data set and number_of_partitions is the number of processing nodes
on which the partitioner executes. :f a partitioner is executed on three processing nodes it has three partitions.
$ange partitioner
2ivides a data set into approximately e8ual)si!ed partitions' each of which contains records with (ey columns
within a specified range. This method is also useful for ensuring that related records are in the same partition.
range partitioner divides a data set into approximately e8ual si!e partitions based on one or more partitioning (eys.
:n order to use a range partitioner' you have to ma(e a range map. <ou can do this using the 6rite /ange $ap stage.
The range partitioner guarantees that all records with the same partitioning (ey values are assigned to the same
partition and that the partitions are approximately e8ual in si!e so all nodes perform an e8ual amount of wor( when
processing the data set.
/ange partitioning is not the only partitioning method that guarantees e8uivalent)si!ed partitions. The random and
round robin partitioning methods also guarantee that the partitions of a data set are e8uivalent in si!e. 4owever'
these partitioning methods are (eyless5 that is' they do not allow you to control how records of a data set are grouped
together within a partition.
D#( partitioner
Partitions an input data set in the same way that 2G-K would partition it. 3or example' if you use this method to
partition an input data set containing update information for an existing 2G- table' records are assigned to the
processing node containing the corresponding 2G- record. Then' during the execution of the parallel operator' both
the input record and the 2G- table record are local to the processing node. ny reads and writes of the 2G- table
would entail no networ( activity.
Auto partitioner
The most common method you will see on the 6eb#phere 2ata#tage stages is uto. This just means that you are
leaving it to 6eb#phere 2ata#tage to determine the best partitioning method to use depending on the type of stage'
and what the previous stage in the job has done. Typically 6eb#phere 2ata#tage would use round robin when
initially partitioning data' and same for the intermediate stages of a job.
Collecting
Collecting is the process of joining the multiple partitions of a single data set bac( together again into a single
partition. There may be a stage in your job that you want to run se8uentially rather than in parallel' in which case
you will need to collect all your partitioned data at this stage to ma(e sure it is operating on the whole data set.
?ote that collecting methods are mostly non)deterministic. That is' if you run the same job twice with the same data'
you are unli(ely to get data collected in the same order each time. :f order matters' you need to use the sorted merge
collection method.
$ound ro%in collector
/eads a record from the first input partition' then from the second partition' and so on. fter reaching the last
partition' starts over. fter reaching the final record in any partition' s(ips that partition in the remaining rounds.
)rdered collector
/eads all records from the first partition' then all records from the second partition' and so on. This collection
method preserves the order of totally sorted input data sets. :n a totally sorted data set' both the records in each
partition and the partitions themselves are ordered. This may be useful as a preprocessing action before exporting a
sorted data set to a single data file.
Sorted merge collector
/ead records in an order based on one or more columns of the record. The columns used to define record order are
called collecting (eys. Typically' you use the sorted merge collector with a partition)sorted data set %as created by a
sort stage&. :n this case' you specify as the collecting (ey fields those fields you specified as sorting (ey fields to the
sort stage.
The data type of a collecting (ey can be any type except raw' subrec' tagged' or vector.
Auto collector
The most common method you will see on the parallel stages is uto. This normally means that 6eb#phere
2ata#tage will eagerly read any row from any input partition as it becomes available' but if it detects that' for
example' the data needs sorting as it is collected' it will do that. This is the fastest collecting method.
Preserve partitioning flag
stage can also re8uest that the next stage in the job preserves whatever partitioning it has implemented. :t does
this by setting the preserve partitioning flag for its output lin(. ?ote' however' that the next stage may ignore this
re8uest. :n most cases you are best leaving the preserve partitioning flag in its default state. The exception to this is
where preserving existing partitioning is important. The flag will not prevent repartitioning' but it will warn you that
it has happened when you run the job. :f the Preserve Partitioning flag is cleared' this means that the current stage
doesnt care what the next stage in the job does about partitioning. 9n some stages' the Preserve Partitioning flag
can be set to Propagate. :n this case the stage sets the flag on its output lin( according to what the previous stage in
the job has set. :f the previous job is also set to Propagate' the setting from the stage before is used and so on until a
#et or Clear flag is encountered earlier in the job. :f the stage has multiple inputs and has a flag set to Propagate' its
Preserve Partitioning flag is set if it is set on any of the inputs' or cleared if all the inputs are clear.
Parallel *o% Score
t runtime' the 7ob #C9/+ can be examined to identify"
1. ?umber of U?:F processes generated for a given job and =PT>C9?3:@>3:,+
-. 9perator combination
.. Partitioning methods between operators
4. 3ramewor()inserted components ) :ncluding #orts' Partitioners' and Guffer operators
#et =PT>2U$P>#C9/+=* to output the #core to the 2ata#tage job log
3or each job run' - separate #core 2umps are written to the log
*. 3irst score is actually from the license operator
-. #econd score entry is the actual job score
7ob scores are divided into two sections
1. 2atasets ) partitioning and collecting
2. 9perators ) nodeCoperator mapping
&ample score dump
The following score dump shows a flow with a single data set' which has a hash partitioner' partitioning on (ey HaH.
:t shows three operators" generator' tsort' and pee(. Tsort and pee( are HcombinedH' indicating that they have been
optimi!ed into the same process. ll the operators in this flow are running on one node.
The 2ata#tage Parallel 3ramewor( implements a producer)consumer data flow model
Upstream stages %operators or persistent data sets& produce rows that are consumed by downstream stages %operators
or data sets&
Partitioning method is associated with producer. Collector method is associated with consumer. LeCollectnyM is
specified for parallel consumers' although no collection occursN
The producer and consumer are separated by the following indicators"
)O #e8uential to #e8uential
PO #e8uential to Parallel
=O Parallel to Parallel %#$+&
QO Parallel to Parallel %not #$+&
OO Parallel to #e8uential
O ?o producer or no consumer
$ay also include RppS notation when Preserve Partitioning flag is set
t runtime' the 2ata#tage Parallel 3ramewor( can only combine stages %operators& that"
*. Use the same partitioning method
/epartitioning prevents operator combination between the corresponding producer and consumer stages
:mplicit repartitioning %eg. #e8uential operators' node maps& also prevents combination
-. re Combinable
#et automatically within the stageCoperator definition
#et within 2ata#tage 2esigner" dvanced stage properties
The ,oo(up stage is a composite operator. :nternally it contains more than one component' but to the user it appears
to be one stage
1. ,UTCreate:mpl ) /eads the reference data into memory
2. ,UTProcess:mpl ) Performs actual loo(up processing once reference data has been loaded
t runtime' each internal component is assigned to operators independently
*o% Compilation
1. )perators. These underlie the stages in a 6eb#phere 2ata#tage job. single stage may correspond to a single
operator' or a number of operators' depending on the properties you have set' and whether you have chosen to
partition or collect or sort data on the input lin( to a stage. t compilation' 6eb#phere 2ata#tage evaluates
your job design and will sometimes optimi!e operators out if they are judged to be superfluous' or insert other
operators if they are needed for the logic of the job.
2. )S'. This is the scripting language used internally by the 6eb#phere 2ata#tage parallel engine.
3. Players. Players are the wor(horse processes in a parallel job. There is generally a player for each operator on
each node. Players are the children of section leaders5 there is one section leader per processing node. #ection
leaders are started by the conductor process running on the conductor node %the conductor node is defined in the
configuration file&.
2ata#tage 2esigner client generates all code ) Talidates lin( re8uirements' mandatory stage options' transformer
logic' etc.
*. @enerates 9#4 representation of job data flow and stages
@U: LstagesM are representations of 3ramewor( LoperatorsM
#tages in parallel shared containers are statically inserted in the job flow
+ach server shared container becomes a dsjobsh operator
-. @enerates transform code for each parallel Transformer
Compiled on the 2ata#tage server into CUU and then to corresponding native operators
To improve compilation times' previously compiled Transformers that have not been modified are not
recompiled
3orce Compile recompiles all Transformers %use after client upgrades&
3. Guildop stages must be compiled manually within the @U: or using buildop U?:F command line
Tiewing of generated 9#4 is enabled in 2# dministrator
9#4 is visible in
*. 7ob Properties
-. 7ob run log
.. Tiew 2ata
1. Table 2efinitions
+enerated )S' Primer
2esigner inserts comment bloc(s to assist in understanding the generated 9#4. ?ote that operator order within the
generated 9#4 is the order a stage was added to the job canvas
9#4 uses the familiar syntax of the U?:F shell to create applications for 2ata #tage +nterprise +dition
*. operator name
-. operator options %use L)name valueM format&
#chema %for generator' import' export&
:nputs
9utputs
The following data sources are supported as inputCoutput
Tirtual data set' %name.v&
Persistent data set
%name.ds or RdsS name&
3ile sets %name.fs or RfsS name&
+xternal files %name or RfileS name&
+very operator has inputs numbered se8uentially starting from E.
3or example"
op* EO dst
op* *Psrc
Terminology
Framework DataStage
schema table definition
property format
type #V, type U length
Rand scaleS
virtual dataset lin(
recordCfield rowCcolumn
operator stage
step' flow' 9#4
command
job
3ramewor( 2# engine
GUI uses both terminologies
Log messages (info, warnings, errors) use Framework term
&ample Stage , )perator Mapping
6ithin 2esigner' stages represent operators' but there is not always a *"* correspondence.
#e8uential 3ile
> #ource" import
> Target" export
2ata#et" copy
#ort %2ata#tage&" tsort
ggregator" group
/ow @enerator' Column @enerator' #urrogate 0ey @enerator" generator
9racle
> #ource" oraread
> #parse ,oo(up" oralookup
> Target ,oad" orawrite
> Target Upsert" oraupsert
,oo(up 3ile #et
> Target" lookup -createOnly
$untime Architecture
@enerated O! and Configuration file are used to LcomposeM a job #C9/+ similar to the way an /2G$# builds a
8uery optimi!ation plan
*. :dentifies degree of parallelism and node assignment for each operator
-. :nserts sorts and partitioners as needed to ensure correct results
.. 2efines connection topology %datasets& between adjacent operators
1. :nserts buffer operators to prevent deadloc(s %eg. for()joins&
5. "efines number of actual U#I$ processes %6here possible' multiple operators are combined within a single
U?:F process to improve performance and optimi!e resource re8uirements
W. 7ob #C9/+ is used to for( U?:F processes with communication interconnects for data' message' and control.
#etting =PT>P$>#496>P:2# to show U?:F process :2s in 2ata#tage log
It is onl& after these steps that processing begins' (his is the )startup o*erhead+ of an ,nterprise ,dition -ob
7ob processing ends when ) ,ast row %end of data& is processed by final operator in the flow (or) fatal error is
encountered by any operator % or) 7ob is halted %#:@:?T& by 2ata#tage 7ob Control or human intervention %eg.
2ata#tage 2irector #T9P&
*o% &ecution! The )rchestra
Conductor ) initial 3ramewor( process
X core Composer
X Creates #ection ,eader processes %oneCnode&
X Consolidates massages' to 2ata#tage log
X $anages orderly shutdown
#ection ,eader %one per ?ode&
X 3or(s Players processes %oneC#tage&
X $anages upCdown communication
Players
X The actual processes associated with #tages
X Combined players" one process only
X #ends stderr' stdout to #ection ,eader
X +stablish connections to other players for data flow
X Clean up upon completion
Y 2efault Communication"
Y ./0 hared .emor&
Y .//0 hared .emor& (within hardware node) and (1/ (across hardware nodes)
-ntroduction
6hat is :G$ 6ebsphere 2ata#tageZ
*. 2esign jobs for +T,
-. :deal tool for data integration projects
.. :mport' export' create and manage metadata for use within jobs
1. #chedule' run and monitor jobs all within 2ata#tage
J. dminister your 2ata#tage development and execution environments
W. Create batch %controlling& jobs
6hat are the componentsCapplications in :G$ :nformation #erver #uiteZ
*. 2ata#tage
-. Vuality #tage
.. $etadata #erver consisting of $etadata ccess #ervices and $etadata nalysis #ervices
1. /epository which is 2G- by default
J. Gusiness @lossary
W. 3ederation #erver
[. :nformation #ervices 2irector
\. :nformation naly!er
]. :nformation #erver console
+xplain the 2ata#tage rchitectureZ
The 2ata#tage client components are
dministrator ) dministers 2ata#tage projects and conducts house(eeping on the server
2esigner ) Creates 2ata#tage jobs that are compiled into executable programs
2irector ) Used to run and monitor 2ata#tage jobs
The /epository is used to store 2ata#tage objects. The /epository which is 2G- by default is shared by other
applications in the #uite
6hat are the uses of 2ata#tage dminsitratorZ
The dministrator is used to add and delete projects' and to set project properties. The dminsitrator also provides a
command line interface to the 2ata#tage repository.
Use dministrator Project Properties window to
*. +nable job administration in 2irector' enable run time column propogation' auto purging options' protect project
and set environment vaiables on the @eneral tab
-. #et user and group priveleges on the Permissions tab
.. +nable or disable server side tracing on the Tracing tab
1. #pecifying a username and password for scheduling jobs on the #chedule tab
J. #pecify parallel job defaults on the Parallel tab
W. #pecify job se8uencer defaults on the #e8uencer tab
+xplain the 2ata#tage 2evelopment wor(flowZ
*. 2efine poject properties ) dministrator
-. 9pen %attach to& your project
.. :mport metadata that defines the format of data stores your jobs will read from or write to
1. 2esign the job ) 2esigner
J. Compile and debug the job ) 2esigner
W. /un and monitor the job ) 2irector
6hat is the 2ata#tage project repositoryZ
ll your wor( is stored in a 2ata#tage project.
Projects are created during and after the installation process. <ou can add projects after installation on the Projects
tab of dminsitrator.
The project directory is used by 2ata#tage to store your jobs and other 2ata#tage objects and metadata on your
server.
lthough multiple projects can be open at the same time' they are seperate environments. <ou can however' import
and export objects between them.
$ultiple users can be wor(ing in the same project at the same time. 4owever' 2ata#tage will prevent multiple users
from editing the same 2ata#tage object %job' table definition' etc& at the same time.
6hat are the different types of 2ata#tage jobsZ
Parallel 7obs)
*. +xecuted by 2ata#tage parallel engine
-. Guilt)in functionality for pipeline and partition parallelism
.. Compiled into 9#4 %9rchestrate #cripting ,anguage&
1. 9#4 executes operators %+xecute CUU class instances&
#erver 7obs)
*. +xecuted by 2ata#tage server engine
-. Compiled into basic
7ob #e8uencers)
*. $aster #erver jobs that (ic()off jobs and other activities
-. Can (ic()off #erver or Parallel jobs
.. +xecuted by 2ata#tage server engine
6hat are the design elements of parallel jobsZ
#tages ) :mplemented as 9#4 operators
Passive #tages %+ and , of +T,& ) /eadC6rite data +g.' #e8uential file' 2G-' 9racle' Pee( stages
ctive #tages %T of +T,& ) TransformC3ilterCggregateC@enerateC#plitC$erge data +g.' Transformer' ggregator'
7oin' #ort stages
,in(s ) Pipes though which the data moves from stage to stage
6hat are the different types of parallelismZ
Pipeline Parallelism
*. Transform' clean' load processes execute simultaneously
-. #tart downstream process while upstream process is running
.. /educes dis( usage for staging areas
1. 0eeps processor busy
J. #till has limits on scalability
Partition Parallelism
*. 2ivide the incoming stream of data into subsets%partitions& to be processed by the same operator
-. The operation is performed on each partition of data seperately and in parallel
.. 3acilitates near)linear scalability provided the data is evenly distributed
1. :f the data is evenly distributes' the data will be processed n times faster on n nodes.
-nstallation and Deployment
6hat gets deployed as part of :nformation #erver 2omainZ
*. $etadata #erver' hosted by an :G$ 6eb#phere pplication #erver instance
-. 9ne or more 2ata#tage servers
.. 9ne 2G- U2G instance containing the repository database
dditional #erver application
*. Gusiness @lossary
-. 3ederation #erver
.. :nformation naly!er
1. :nformation #ervices 2irector
J. /ational 2ata rchitect
6hat are the :nformation #erver clientsZ
*. dministration Console
-. /eporting Console
.. 2ata#tage Clients ) dministrator' 2esigner' 2irector
6hat are the different types of :nformation #erver deploymentZ
*. +verything on 9ne machine ) ll the applicaions in the domain are deployed in one machine
-. The domain is split between two machines ) 2ata#tage #erver in one machine ' $etadata #erver and 2G-
/epository in one machine
.. The domain is split between three machines ) 2ata#tage #erver' $etadata #erver and 2G- /epository on .
different machines
dditional 2ata#tage #ervers can be part of this domain' but they would have to be seperate from one another
There is a possibility of additional 2ata#tage player)node machines connected to the 2ata#tage server machine
using a high speed networ(
6hat are the components that should be running if pplication #erver%hosting the metadata server& and 2ata#tage
server are running on different machinesZ
*. The pplication #erver
-. The #G agent
dministering 2ata#tage
+xplain the User and @roup $anagementZ
#uite uthori!ation can be provided to users or groups. Users that are members of a group ac8uire authori!ations of
the group.
uthori!ation are provided in the form of roles
*. #uite roles
a. dministrator ) Performs user and group management tas(s. :ncludes all the priveleges of the #uite User role
b. User ) Create views of scheduled tas(s and logged messages. Create and run reports
-. #uite Component roles
a. 2ata#tage dministrator ) 3ull permission to wor( in 2ata#tage dministrator' 2esigner and 2irector
b. 2ata#tage user ) Permissions are assigned within 2ata#tage ) 2eveloper' 9perator' #uper 9perator and
Production $anager
2ata#tage user cannot delete projects and cannot set permissions
user :2 that is assigned #uite roles can immediately log onto the :nformation #erver Console.
6hat about a user :2 that is assigned a 2ata#tage #uite Component roleZ :f the user :2 is assigned the 2ata#tage
dministrator role' then the user will immediately ac8uire the 2ata#tage dministrato permission for all projects.
:f the the user :2 is assigned the 2ata#tage user role' one moe step is re8uired. 2ata#tage administrator must
assign a corresponding role to that user :2 on the permissions tab. 6hen #uite users or groups have been assigned
2ata#tage dministrator role they automatically appear on the permissions. #uite users or groups that have a
2ata#tage User role need to be manually added.
+xplain The 2ata#tage Credential $appingZ
ll the #uite users without their own 2ata#tage credentials will be mapped to this user :2 and password. 4ere the
username and password are demohaw(Cdemohwa(. demohaw( is assumed to be a valid user on the 2ata#tage
#erver machine and has file permissions on the 2ata#tage engine and project directories
#uite users can also be mapped individually to specific users
?ote that demohaw( need not be a #uite administrator or user
6hat information are re8uired to login into 2ata#tage dministratorZ
2omain ) 4ost name ' port number of the application server.
/ecall that multiple 2ata#tage servers can exist in a domain' although they must be on different machines.
2ata#tage server ) The #erver that has the 2ata#tage projects you want to administer
+xplain the 2ata#tage rolesZ
*. 2ata#tage 2eveloper ) full access to all areas of a 2ata#tage project
-. 2ata#tage 9perator ) run and manage release 2ata#tage jobs
.. 2ata#tage #uper 9perator ) can open the 2esigner and view the repository in read)only mode
1. 2ata#tage Production $anaget ) create and manipulate protected projects
2ata#tage 2esigner
+xplain :mport and +xport and their corresponding proceduresZ
*. Gac(ing up jobs and projects
-. $aintaining different versions of a job or project
.. $oving 2ata#tage objects from one project to another
1. #haring jobs and projects between developers
+xport))O2ata#tage components
Gy default' objects are exported to a text file in a specific format. Gy default' the extension is dsx. lternatively' you
can export the objects to a F$, document.
The directory you export is on the 2ata#tage client' not the server.
9bjects can also be exported from the list of found objects using search functionality.
:mport))O2ata#tage components
:mport all to begin the import process. Use :mport selected to import selected objects from the list.
#elect 9verwrite without 8uery button to overwrite objects with the same name without warning.
3or large imports you may want to disable BPerform impact analysis.B This adds overhead to the import process
:mport))OTable 2efinitions
Table definition describes the column and format of files and tables
The table definition for the following can be imported
*. #e8uential file
-. /elational tables
.. Cobol files
1. F$,
J. 92GC data sources
etc
Table definitions can be loaded into job stages that access data with the same format. :n this sense the metadata is
reusable.
Creating Parallel 7obs
6hat is a Parallel 7obZ
parallel job is an executable 2ata#tage program created in 2ata#tage designer using components from repository.
:t compiles into 9rchestrate script language%9#4& and object code%from generated CUU&
2ata#tage jobs are
*. 2esigned and built in 2esigner
-. #cheduled' invo(ed and monitored in 2irector
.. +xecuted under the control of 2ata#tage
Use the import process in 2esigner to import metadata defining sources and targets
6hat are the benefits of renaming lin(s and stagesZ
*. 2ocumentation
-. Clarity
.. 3ewer development errors
+xplain the /ow @enerator stageZ
*. Produces moc( data
-. ?o input lin(5single output lin(
.. 9n Properties tab' specify number of rows
1. 9n Columns tab' load or specify column definitions
<ou have a cluster of nodes available to run 2ata#tage jobs. The networ( configuration between the servers is a
private networ( with a * @G connection between each node. The public name is on a *EE $G networ(' which is
what each hostname is identified with. :n order to use the private networ( for communications between each node
you need to use an alias for each node in the cluster. The :nformation #erver +ngine node %conductor node& is where
the 2ata#tage job starts.
6hich environment variable must be used to identify the hostname for the +ngine nodeZ
. =PT>#+/T+/>+?@:?+
G. =PT>+?@:?+>?92+
C. =PT>P$>C9?2UCT9/>49#T?$+
2. =PT>P$>?+T69/0>?$+
nswer" C
6hich three privileges must the user possess when running a parallel jobZ %Choose three.&
. read access to PT>9/C449$+
G. execute permissions on local copies of programs and scripts
C. readCwrite permissions to the U?:FCetc directory
2. readCwrite permissions to PT>9/C449$+
+. readCwrite access to dis( and scratch dis( resources
nswer" 'G'+
6hich two tas(s will create 2ata#tage projectsZ %Choose two.&
. +xport and import a 2ata#tage project from 2ata#tage $anager.
G. dd new projects from 2ata#tage dministrator.
C. :nstall the 2ata#tage engine.
2. Copy a project in 2ata#tage dministrator.
nswer" G'C
6hich three defaults are set in 2ata#tage dministratorZ %Choose three.&
. default prompting options' such as utosave job before compile
G. default #$TP mail server name
C. project level default for /untime Column Propagation
2. project level defaults for environment variables
+. project level default for uto)purge of job log entries
nswer" C'2'+
6hich two must be specified to manage /untime Column PropagationZ %Choose two.&
. enabled in 2ata#tage dministrator
G. attached to a table definition in 2ata#tage $anager
C. enabled at the stage level
2. enabled with environmental parameters set at runtime
nswer" 'C
<ou are reading customer data using a #e8uential 3ile stage and transforming it using the Transformer stage. The
Transformer is used to cleanse the data by trimming spaces from character fields in the input. The cleansed data is to
be written to a target 2G- table. 6hich partitioning method would yield optimal performance without violating the
business re8uirementsZ
. 4ash on the customer :2 field
G. /ound /obin
C. /andom
2. +ntire
nswer" G
job contains a #ort stage that sorts a large volume of data across a cluster of servers. The customer has re8uested
that this sorting be done on a subset of servers identified in the configuration file to minimi!e impact on database
nodes. 6hich two steps will accomplish thisZ %Choose two.&
. Create a sort scratch dis( pool with a subset of nodes in the parallel configuration file.
G. #et the execution mode of the #ort stage to se8uential.
C. #pecify the appropriate node constraint within the #ort stage.
2. 2efine a non)default node pool with a subset of nodes in the parallel configuration file.
nswer" C'2
<ou have a compiled job and parallel configuration file. 6hich three methods can be used to determine the number
of nodes actually used to run the job in parallelZ %Choose three.&
. within 2ata#tage 2esigner' generate report and retain intermediate F$,
G. within 2ata#tage 2esigner' show performance statistics
C. within 2ata#tage 2irector' examine log entry for parallel configuration file
2. within 2ata#tage 2irector' examine log entry for parallel job score
+. within 2ata#tage 2irector' open a new 2ata#tage 7ob $onitor
nswer" C'2'+
6hich environment variable' when set to true' causes a report to be produced which shows the operators' processes
and data sets in the jobZ
. PT>2U$P>#C9/+
G. PT>79G>/+P9/T
C. PT>$9?:T9/>#:^+
2. PT>/+C9/2>C9U?T#
nswer"
job reads from a dataset using a 2ata#et stage. This data goes to a Transformer stage and then is written to a
se8uential file using a #e8uential 3ile stage. The default configuration file has . nodes. The job creating the dataset
and the current job both use the default configuration file. 4ow many instances of the Transformer run in parallelZ
. .
G. *
C. [
2. ]
nswer"
<our job reads from a file using a #e8uential 3ile stage running se8uentially. The
2ata#tage server is running on a single #$P system. 9ne of the columns contains a product :2. :n a ,oo(up stage
following the #e8uential 3ile stage' you decide to loo( up the product description from a reference table. 6hich two
partition settings would correctly find matching product descriptionsZ %Choose two.&
. 4ash algorithm' specifying the product :2 field as the (ey' on both the lin( coming from the #e8uential 3ile
stage and the lin( coming from the reference table.
G. /ound /obin on both the lin( coming from the #e8uential 3ile stage and the lin( coming from the reference
table.
C. /ound /obin on the lin( coming from the #e8uential 3ile stage and +ntire on the lin( coming from the reference
table.
2. +ntire on the lin( coming from the #e8uential 3ile stage and 4ash' specifying the product :2 field as the (ey' on
the lin( coming from the reference table.
nswer" 'C
job design consists of an input fileset followed by a Pee( stage' followed by a 3ilter stage' followed by an output
fileset. The environment variable PT>2:#G,+>C9$G:?T:9? is set to true' and the job executes on an #$P
using a configuration file with \ nodes defined. ssume also that the input dataset was created with the same \ node
configuration file. pproximately how many data processing processes will this job createZ
. .-
G. \
C. *W
2. *
nswer"
6hich two statements are true of the column data types used in 9rchestrate schemasZ %Choose two.&
. 9rchestrate schema column data types are the same as those used in 2ata#tage stages.
G. +xamples of 9rchestrate schema column data types are varchar and integer.
C. +xamples of 9rchestrate schema column data types are int.- and stringRmax=.ES.
2. 9#4 import operators are needed to convert data read from se8uential files into schema types.
nswer" C'2
<ou have set the BPreserve PartitioningB flag for a #ort stage to re8uest that the next stage preserves whatever
partitioning it has implemented. 6hich statement describes what will happen nextZ
. The job will compile but will abort when run.
G. The job will not compile.
C. The next stage can ignore this re8uest but a warning is logged when the job is run depending on the stage type
that ignores the flag.
2. The next stage disables the partition options that are normally available in the Partitioning tab.
nswer" C
6hat is the purpose of the uv command in a U?:F 2ata#tage serverZ
. Cleanup resources from a failed 2ata#tage job.
G. #tart and stop the 2ata#tage engine.
C. Provide read access to a 2ata#tage ++ configuration file.
2. /eport 2ata#tage client connections.
nswer" G
6hich two statements regarding the usage of data types in the parallel engine are correctZ %Choose two.&
. The best way to import /2G$# data types is using the 92GC importer.
G. The parallel engine will use its interpretation of the 9racle meta data %e.g' exact data types& based on
interrogation of 9racle' overriding what you may have specified in the Columns tabs.
C. The best way to import /2G$# data types is using the :mport 9rchestrate #chema 2efinitions using orchdbutil.
2. The parallel engine and server engine have exactly the same data types so there is no conversion cost overhead
from moving data between the engines.
nswer" G'C
6hich two describe a 2ata#tage ++ installation in a clustered environmentZ %Choose two.&
. The CUU compiler must be installed on all cluster nodes.
G. Transform operators must be copied to all nodes of the cluster.
C. The 2ata#tage parallel engine must be installed or accessible in the same directory on all machines in the cluster.
2. remote shell must be configured to support communication between the conductor and section leader nodes.
nswer" C'2
6hich partitioning method would yield the most even distribution of data without duplicationZ
. +ntire
G. /ound /obin
C. 4ash
2. /andom
nswer" G
6hich three accurately describe the differences between a 2ata#tage server root installation and a non)root
installationZ %Choose three.&
. non)root installation enables auto)start on reboot.
G. root installation must specify the user BdsadmB as the 2ata#tage administrative user.
C. non)root installation inherits the permissions of the user who starts the 2ata#tage services.
2. root installation will start 2ata#tage services in impersonation mode.
+. root installation enables auto)start on reboot.
nswer" C'2'+
<our job reads from a file using a #e8uential 3ile stage running se8uentially. <ou are using a Transformer following
the #e8uential 3ile stage to format the data in some of the columns. 6hich partitioning algorithm would yield
optimi!ed performanceZ
. 4ash
G. /andom
C. /ound /obin
2. +ntire
nswer" C
6hich three U?:F (ernel parameters have minimum re8uirements for 2ata#tage installationsZ %Choose three.&
. $FUP/9C ) maximum number of processes per user
G. ?93:,+# ) number of open files
C. $FP+/$ ) dis( cache threshold
2. ?9P/9C ) no process limit
+. #4$$F ) maximum shared memory segment si!e
nswer" 'G'+
6hich partitioning method re8uires specifying a (eyZ
. /andom
G. 2G-
C. +ntire
2. $odulus
nswer" 2
6hen a se8uential file is written using a #e8uential 3ile stage' the parallel engine inserts an operator to convert the
data from the internal format to the external format. 6hich operator is insertedZ
. export operator
G. copy operator
C. import operator
2. tsort operator
nswer"
6hich statement is true when /untime Column Propagation %/CP& is enabledZ
. 2ata#tage $anager does not import meta data.
G. 2ata#tage 2irector does not supply row counts in the job log.
C. 2ata#tage 2esigner does not enforce mapping rules.
2. 2ata#tage dministrator does not allow default settings for environment variables.
nswer" C
Persistent Storage
Se.uential file stage
The #e8uential 3ile stage is a file stage. :t allows you to read data from or write data to one or more flat files. The
stage can have a single input lin( or a single output lin(' and a single rejects lin(.
The stage executes in parallel mode if reading multiple files but executes se8uentially if it is only reading one file.
Gy default a complete file will be read by a single node %although each node might read more than one file&. 3or
fixed)width files' however' you can configure the stage to behave differently"
1. <ou can specify that single file can be read by multiple nodes. This can improve performance on cluster
systems.
2. <ou can specify that a number of readers run on a single node. This means' for example' that a single file
can be partitioned as it is read %even though the stage is constrained to running se8uentially on the
conductor node&.
%These two options are mutually exclusive.&
File This property defines the flat file that data will be read from. <ou can type in a pathname' or browse for a file.
<ou can specify multiple files by repeating the 3ile property
File pattern #pecifies a group of files to import. #pecify file containing a list of files or a job parameter
representing the file. The file could also contain be any valid shell expression' in Gourne shell syntax' that generates
a list of file names.
$ead method This property specifies whether you are reading from a specific file or files or using a file pattern to
select files %e.g.' I.txt&.
Missing file mode #pecifies the action to ta(e if one of your 3ile properties has specified a file that does not exist.
Choose from &rror to stop the job' )/ to s(ip the file' or Depends' which means the default is &rror' unless the
file has a node name prefix of I" in which case it is )/. The default is Depends.
/eep file partitions #et this to True to partition the imported data set according to the organi!ation of the input
file%s&. #o' for example' if you are reading three files you will have three partitions. 2efaults to False.
$e0ect mode llows you to specify behavior if a read record does not match the expected schema %record does not
match the metadata defined in column definition&. Choose from Continue to continue operation and discard any
rejected rows' Fail to cease reading if any rows are rejected' or Save to send rejected rows down a reject lin(.
2efaults to Continue.
$eport progress Choose 1es or No to enable or disable reporting. Gy default the stage displays a progress report
at each *E_ interval when it can ascertain file si!e. /eporting occurs only if the file is greater than *EE 0G' records
are fixed length' and there is no filter on the file.
Num%er )f readers per node This is an optional property and only applies to files containing fixed)length
records' it is mutually exclusive with the /ead from multiple nodes property. #pecifies the number of instances of
the file read operator on a processing node. The default is one operator per node per input data file. :f num2eaders is
greater than one' each instance of the file read operator reads a contiguous range of records from the input file.
This provides a way of partitioning the data contained in a single file. +ach node reads a single file' but the file can
be divided according to the number of readers per node' and written to separate partitions. This method can result in
better :C9 performance on an #$P system.
$ead from multiple nodes This is an optional property and only applies to files containing fixed)length records'
it is mutually exclusive with the ?umber of /eaders Per ?ode property. #et this to <es to allow individual files to
be read by several nodes. This can improve performance on a cluster system. 6eb#phere 2ata#tage (nows the
number of nodes available' and using the fixed length record si!e' and the actual si!e of the file to be read' allocates
the reader on each node a separate region within the file to process. The regions will be of roughly e8ual si!e.
?ote that se8uential row order cannot be maintained when reading a file in parallel
File update mode This property defines how the specified file or files are updated. The same method applies to all
files being written to. Choose from Append to append to existing files' )ver2rite to overwrite existing files' or
Create to create a new file. :f you specify the Create property for a file that already exists you will get an error at
runtime. Gy default this property is set to )ver2rite.
3sing $CP 4ith Se.uential Stages /untime column propagation %/CP& allows 6eb#phere 2ata#tage to be
flexible about the columns you define in a job. :f /CP is enabled for a project' you can just define the columns you
are interested in using in a job' but as( 6eb#phere 2ata#tage to propagate the other columns through the various
stages. #o such columns can be extracted from the data source and end up on your data target without explicitly
being operated on in between.
#e8uential files' unli(e most other data sources' do not have inherent column definitions' and so 6eb#phere
2ata#tage cannot always tell where there are extra columns that need propagating. <ou can only use /CP on
se8uential files if you have used the #chema 3ile property %see H#chema 3ileH on page #chema 3ile and on page
#chema 3ile& to specify a schema which describes all the columns in the se8uential file. <ou need to specify the
same schema file for any similar stages in the job where you want to propagate columns. #tages that will re8uire a
schema file are"
*. #e8uential 3ile
-. 3ile #et
.. +xternal #ource
1. +xternal Target
J. Column :mport
W. Column +xport
-mproving Se.uential File Performance
:f the source file is fixed width' the /eaders Per ?ode option can be used to read a single input file in parallel at
evenly)spaced offsets. ?ote that in this manner' input row order is not maintained.
:f the input se8uential file cannot be read in parallel' performance can still be improved by separating the file :C9
from the column parsing operation. To accomplish this' define a single large string column for the non)parallel
#e8uential 3ile read' and then pass this to a Column :mport stage to parse the file in parallel. The formatting and
column properties of the Column :mport stage match those of the #e8uential 3ile stage.
9n heavily)loaded file servers or some /:2C#? array configurations' the environment variables
=PT>:$P9/T>GU33+/>#:^+ and =PT>+FP9/T>GU33+/>#:^+ can be used to improve :C9 performance.
These settings specify the si!e of the read %import& and write %export& buffer si!e in 0bytes' with a default of *-\
%*-\0&. :ncreasing this may improve performance.
3inally' in some dis( array configurations' setting the environment variable
=PT>C9?#:#T+?T>GU33+/:9>#:^+ to a value e8ual to the readCwrite si!e in bytes can significantly improve
performance of #e8uential 3ile operations.
=PT>C9?#:#T+?T>GU33+/:9>#:^+ ) #ome dis( arrays have read ahead caches that are only effective when
data is read repeatedly in li(e)si!ed chun(s. #etting PT>C9?#:#T+?T>GU33+/:9>#:^+=# will force stages to
read data in chun(s which are si!e # or a multiple of #.
Partitioning Se.uential File $eads
Care must be ta(en to choose the appropriate partitioning method from a #e8uential 3ile read"
` 2ont read from #e8uential 3ile using #$+ partitioningN Unless more than one source file is specified' #$+
will read the entire file into a single partition' ma(ing the entire downstream flow run se8uentially %unless it is later
repartitioned&.
` 6hen multiple files are read by a single #e8uential 3ile stage %using multiple files' or by using a 3ile Pattern&' each
files data is read into a separate partition. :t is important to use /9U?2)/9G:? partitioning %or other partitioning
appropriate to downstream components& to evenly distribute the data in the flow.
Se.uential File 5&port6 #uffering
Gy default' the #e8uential 3ile %export operator& stage buffers its writes to optimi!e performance. 6hen a job
completes successfully' the buffers are always flushed to dis(. The environment variable
=PT>+FP9/T>3,U#4>C9U?T allows the job developer to specify how fre8uently %in number of rows& that the
#e8uential 3ile stage flushes its internal buffer on writes. #etting this value to a low number %such as *& is useful for
realtime applications' but there is a small performance penalty associated with increased :C9.
$eading from and 4riting to Fied78ength Files
Particular attention must be ta(en when processing fixed)length fields using the #e8uential 3ile stage"
` :f the incoming columns are variable)length data types %eg. :nteger' 2ecimal' Tarchar&' the field width column
property must be set to match the fixed)width of the input column.
2ouble)clic( on the column number in the grid dialog to set this column property.
` :f a field is nullable' you must define the null field *alue and length in the #ullable section of the column property.
2ouble)clic( on the column number in the grid dialog to set these
Data set stage
The 2ata #et stage is a file stage. :t allows you to read data from or write data to a data set. The stage can have a
single input lin( or a single output lin(. :t can be configured to execute in parallel or se8uential mode.
6hat is a data setZ Parallel jobs use data sets to manage data within a job. <ou can thin( of each lin( in a job as
carrying a data set. The 2ata #et stage allows you to store data being operated on in a persistent form' which can
then be used by other 6eb#phere 2ata#tage jobs. 2ata sets are operating system files' each referred to by a control
file' which by convention has the suffix .ds. Using data sets wisely can be (ey to good performance in a set of lin(ed
jobs. <ou can also manage data sets independently of a job using the 2ata #et $anagement utility' available from
the 6eb#phere 2ata#tage 2esigner or 2irector
data set comprises a descriptor file and a number of other files that are added as the data set grows. These files are
stored on multiple dis(s in your system.
The descriptor file for a data set contains the following information"
*. 2ata set header information.
2. Creation time and date of the data set.
3. The schema %metadata& of the data set.
4. copy of the configuration file used when the data set was created.
2ata #ets are the structured internal representation of data within the Parallel 3ramewor(
Consist of"
3ramewor( #chema %format=name' type' nullability&
2ata /ecords %data&
Partition %subset of rows for each node&
Tirtual 2ata #ets exist in)memory correspond to 2ata#tage 2esigner lin(s
Persistent 2ata #ets are stored on)dis(
2escriptor file
%metadata' configuration file' data file locations' flags&
$ultiple 2ata 3iles
%one per node' stored in dis( resource file systems&
node*"ClocalCdis(*Ca
node-"ClocalCdis(-Ca
There is no L2ata#etM operator X the 2esigner @U: inserts a copy operator
6hen to Use Persistent 2ata #ets
6hen writing intermediate results between 2ata#tage ++ jobs' always write to persistent 2ata #ets %chec(points&
#tored in native internal format %no conversion overhead&
/etain data partitioning and sort order %end)to)end parallelism across jobs&
$aximum performance through parallel :C9
6hy 2ata #ets are not intended for long)term or archive storage
:nternal format is subject to change with new 2ata#tage releases
/e8uires access to named resources %node names' file system paths' etc&
Ginary format is platform)specific
3or fail)over scenarios' servers should be able to cross)mount filesystems
Can read a dataset as long as your current =PT>C9?3:@>3:,+ defines the same ?92+ names %fastnames may
differ&
orchadmin Xx lets you recover data from a dataset if the node names are no longer available
Data Set Management
*. Tiewing the schema
Clic( the #chema icon from the tool bar to view the record schema of the current data set. This is presented in text
form in the /ecord #chema window.
-. Tiewing the data
Clic( the 2ata icon from the tool bar to view the data held by the current data set. This options the 2ata Tiewer
9ptions dialog box' which allows you to select a subset of the data to view.
/ows to display. #pecify the number of rows of data you want the data browser to display.
#(ip count. #(ip the specified number of rows before viewing data.
Period. 2isplay every Pth record where P is the period. <ou can start after records have been s(ipped by using the
#(ip property. P must e8ual or be greater than *.
Partitions. Choose between viewing the data in ll partitions or the data in the partition selected from the drop)down
list.Clic( 90 to view the selected data' the 2ata Tiewer window appears.
.. Copying data sets
Clic( the Copy icon on the tool bar to copy the selected data set. The Copy data set dialog box appears' allowing
you to specify a path where the new data set will be stored. The new data set will have the same record schema'
number of partitions and contents as the original data set.
?ote" <ou cannot use the U?:F cp command to copy a data set because 6eb#phere 2ata#tage represents a single
data set with multiple files.
1. 2eleting data sets
Clic( the 2elete icon on the tool bar to delete the current data set data set. <ou will be as(ed to confirm the deletion.
?ote" <ou cannot use the U?:F rm command to copy a data set because 6eb#phere 2ata#tage represents a single
data set with multiple files. Using rm simply removes the descriptor file' leaving the much larger data files behind.
)rchadmin Commands
9rchadmin is a command line utility provided by datastage to research on data sets.
The general callable format is " =orchadmin PcommandO RoptionsS Rdescriptor fileS
Gefore using orchadmin' you should ma(e sure that either the wor(ing directory or the =PT>9/C449$+Cetc
contains the file Mconfig.aptM 9/ The environment variable =PT>C9?3:@>3:,+ should be defined for your
session.
The various commands available with orchadmin are
*. C4+C0" =orchadmin chec(
Talidates the configuration file contents li(e ' accesibility of all nodes defined in the configuration file' scratch dis(
definitions and accesibility of all the nodes etc. Throws an error when config file is not found or not defined
properly
-. C9P< " =orchadmin copy Psource.dsO Pdestination.dsO
$a(es a complete copy of the datasets of source with new destination descriptor file name. Please not that
a. <ou cannot use U?:F cp command as it justs copies the config file to a new name. The data is not copied.
b. The new datasets will be arranged in the form of the config file that is in use but not according to the old confing
file that was in use with the source.
.. 2+,+T+ " =orchadmin P delete b del b rm O R)f b )xS descriptorfilesa.
The unix rm utility cannot be used to delete the datasets. The orchadmin delete or rm command should be used to
delete one or more persistent data sets.
)f options ma(es a force delete. :f some nodes are not accesible then )f forces to delete the dataset partitions from
accessible nodes and leave the other partitions in inaccesible nodes as orphans.
)x forces to use the current config file to be used while deleting than the one stored in data set.
1. 2+#C/:G+" =orchadmin describe RoptionsS descriptorfile.ds
This is the single most important command.
*. 6ithout any option lists the no.of.partitions' no.of.segments' valid segments' and preserve partitioning flag details
of the persistent dataset.
)c " Print the configuration file that is written in the dataset if any
)p" ,ists down the partition level information.
)f" ,ists down the file level information in each partition
)e" ,ist down the segment level information .
)s" ,ist down the meta)data schema of the information.
)v" ,ists all segemnts ' valid or otherwise
)l " ,ong listing. +8uivalent to )f )p )s )v )e
J. 2U$P" =orchadmin dump RoptionsS descriptorfile.ds
The dump command is used to dump %extract& the records from the dataset. 6ithout any options the dump command
lists down all the records starting from first record from first partition till last record in last partition.
)delim cPstringO " Uses the given string as delimtor for fields instead of space.
)field PnameO " ,ists only the given field instead of all fields.
)name " ,ist all the values preceded by field name and a colon
)n numrecs " ,ist only the given number of records per partition.
)p period%?& " ,ists every ?th record from each partition starting from first record.
)s(ip ?" #(ip the first ? records from each partition.
)x " Use the current system configuration file rather than the one stored in dataset.
W. T/U?CT+" =orchadmin truncate RoptionsS descriptorfile.ds
6ithout options deletes all the data%ie #egments& from the dataset.
)f" Uses force truncate. Truncate accessible segments and leave the inaccesible ones.
)x" Uses current system config file rather than the default one stored in the dataset.
)n ?" ,eaves the first ? segments in each partition and truncates the remaining.
[. 4+,P" =orchadmin )help 9/ =orchadmin PcommandO )help
4elp manual about the usage of orchadmin or orchadmin commands.
File set stage
The 3ile #et stage is a file stage. :t allows you to read data from or write data to a file set. The stage can have a
single input lin(' a single output lin(' and a single rejects lin(. :t only executes in parallel mode.
6hat is a file setZ 6eb#phere 2ata#tage can generate and name exported files' write them to their destination' and
list the files it has generated in a file whose extension is' by convention' .fs. The data files and the file that lists them
are called a file set. This capability is useful because some operating systems impose a - @G limit on the si!e of a
file and you need to distribute files among nodes to prevent overruns.
The amount of data that can be stored in each destination data file is limited by the characteristics of the file system
and the amount of free dis( space available. The number of files created by a file set depends on"
*. The number of processing nodes in the default node pool
-. The number of dis(s in the export or default dis( pool connected to each processing node in the default node
pool
.. The si!e of the partitions of the data set
The 3ile #et stage enables you to create and write to file sets' and to read data bac( from file sets. Unli(e data sets'
file sets carry formatting information that describes the format of the files to be read or written.
3ilesets are similar to datasets
*. Partitioned
-. :mplemented with header file and data files
3ilesets are different from datasets
*. The data files of filesets are text files and hence are readable by other applications whereas the data files of
datasets are stored in native internal format and are readable only 2ata#tage
8oo"up file set stage
The ,oo(up 3ile #et stage is a file stage. :t allows you to create a loo(up file set or reference one for a loo(up. The
stage can have a single input lin( or a single output lin(. The output lin( must be a reference lin(. The stage can be
configured to execute in parallel or se8uential mode when used with an input lin(.
6hen creating ,oo(up file sets' one file will be created for each partition. The individual files are referenced by a
single descriptor file' which by convention has the suffix .fs.
6hen performing loo(ups' ,oo(up 3ile #et stages are used with ,oo(up stages.
6hen you use a ,oo(up 3ile #et stage as a source for loo(up data' there are special considerations about column
naming. :f you have columns of the same name in both the source and loo(up data sets' the source data set column
will go to the output data. :f you want this column to be replaced by the column from the loo(up data source' you
need to drop the source data column before you perform the loo(up
http"CCwww.dsxchange.comCviewtopic.phpZt=**..]1
4ashed 3ile is only available in server jobs. :t uses a hashing algorithm %without building an index& to determine
the location of (eys within its structure. :t is not amenable to parallelism. The contents of a hashed file may be
cached in memory when using the 4ashed 3ile stage to service a reference input lin(. ?ew rows to be written to a
hashed file may first be written to a memory cache' then flushed to dis(. ll writes to a hashed file using an existing
(ey overwrite the previous row. 2uplicate (ey values are not permitted.
,oo(up 3ile #et is only available in parallel jobs. :t uses an index %based on a hash table& to determine the location
of (eys within its structure. :t is a parallel structure5 it has its records spread over the processing nodes specified
when it was created. The records in the ,oo(up 3ile #et are loaded into a virtual 2ata #et before use' and the index
is also loaded into memory. 2uplicate (ey values are %optionally& permitted. :f the option is not selected' duplicates
are rejected when writing to the ,oo(up 3ile #et.
http"CCwww.dsxchange.comCviewtopic.phpZt=].-\[
: did testing on a 6indows machine processing *EE'EEE primary rows against *EE'EEE loo(up rows with a * to *
match. Two (ey fields of char -JJ and two non (ey fields also of char -JJ. : deliberately chose fat (ey fields. The
dataset as a loo(up too( -). minutes. The fileset as a loo(up too( about 1E seconds. /an it a few times with the
same results.
9ne interesting result was memory utilisation' the fileset was consistently lighter then the dataset' by as much as
.E_ on /$ memory. This may be due to the (eepCdrop (ey field option of the fileset stage. :f you set (eep to
false the (ey fields in the fileset are not loaded into memory as they are not re8uired on the output side of the
loo(up. : am guessing that the fileset version was moving and storing J*E char less for each loo(up then the dataset
version. :n a normal loo(up these (ey fields travel up the reference lin( and bac( down it again' in a loo(up fileset
they only travel up.
6hen : switch the same job onto an :F box with several gig of /$ : get [ seconds for the dataset and 1 for the
fileset. 6ith an increase to JEE'EEE rows : get -. seconds for the dataset and [ seconds for the fileset. This
difference may not be so apparent if your (ey fields are shorter. The major drawbac( of a loo(up fileset is that it
doesndt have the ppend option of a dataset' you can only overwrite it.
Creating a loo"up file set
1. :n the :nput ,in( Properties Tab"
X #pecify the (ey that the loo(up on this file set will ultimately be performed on. <ou can repeat this property to
specify multiple (ey columns. <ou must specify the (ey when you create the file set' you cannot specify it when
performing the loo(up
X #pecify the name of the ,oo(up 3ile #et.
X #pecify a loo(up range' or accept the default setting of ?o.
X #et llow 2uplicates' or accept the default setting of 3alse.
2. +nsure column meta data has been specified for the loo(up file set.
8oo"ing up a loo"up file set
1. :n the 9utput ,in( Properties Tab specify the name of the loo(up file set being used in the loo(up.
2. +nsure column meta data has been specified for the loo(up file set.
Gy default the stage will write to the file set in entire mode. The complete data set is written to each partition. :f the
,oo(up 3ile #et stage is operating in se8uential mode' it will first collect the data before writing it to the file using
the default %auto& collection method.
Comple Flat File stage
The Complex 3lat 3ile %C33& stage is a file stage. <ou can use the stage to read a file or write to a file' but you
cannot use the same stage to do both.
s a source' the C33 stage can have multiple output lin(s and a single reject lin(. <ou can read data from one or
more complex flat files' including $T# data sets with V#$ and T#$ files. <ou can also read data from files
that contain multiple record types. The source data can contain one or more of the following clauses"
*. @/9UP
-. /+2+3:?+#
.. 9CCU/#
1. 9CCU/# 2+P+?2:?@ 9?
C33 source stages run in parallel mode when they are used to read multiple files' but you can configure the stage to
run se8uentially if it is reading only one file with a single reader.
s a target' the C33 stage can have a single input lin( and a single reject lin(. <ou can write data to one or more
complex flat files. <ou cannot write to $T# data sets or to files that contain multiple record types.
&diting a Comple Flat File stage as a source
To edit a C33 stage as a source' you must provide details about the file that the stage will read' create record
definitions for the data' define the column metadata' specify record :2 constraints' and select output columns.
To edit a C33 stage as a source"
1. 9pen the C33 stage editor
2. 9n the #tage page' specify information about the stage data"
a. 9n the 3ile 9ptions tab' provide details about the file that the stage will read.
b. 9n the /ecord 9ptions tab' describe the format of the data in the file.
c. :f the stage is reading a file that contains multiple record types' on the /ecords tab' create record definitions
for the data.
d. 9n the /ecords tab' create or load column definitions for the data.
e. :f the stage is reading a file that contains multiple record types' on the /ecords :2 tab' define the record :2
constraint for each record.
f. )ptional! 9n the dvanced tab' change the processing settings.
3. 9n the 9utput page' specify how to read data from the source file"
a. 9n the #election tab' select one or more columns for each output lin(.
b. )ptional! 9n the Constraint tab' define a constraint to filter the rows on each output lin(.
c. )ptional! 9n the dvanced tab' change the buffering settings.
4. Clic( 90 to save your changes and to close the C33 stage editor.
Creating record definitions
:f you are reading data from a file that contains multiple record types' you must create a separate record definition
for each type. C9G9, copyboo(s with multiple record types can be imported as C9G9, file definition %+g.
:nsurance.cfd&. +ach record type is stores as a separate 2ata#tage table definition %+g. :f the :nsurance.cfd has .
record types for Client' Policy and Coverage then there will be . table definitions one for each record type&
To create record definitions"
1. Clic( the $ecords tab on the #tage page.
2. Clear the Single record chec( box.
3. /ight)clic( the default record definition /+C9/2>* and select $ename Current $ecord.
4. Type a new name for the default record definition.
5. dd another record by clic(ing one of the buttons at the bottom of the records list. +ach button offers a different
insertion point. new record is created with the default name of ?+6/+C9/2.
6. 2ouble)clic( ?+6/+C9/2 to rename it.
7. /epeat steps . and 1 for each new record that you need to create.
8. /ight)clic( the master record in the list and select Toggle Master $ecord. 9nly one master record is permitted.
Column definitions
<ou must define columns to specify what data the C33 stage will read or write.
:f the stage will read data from a file that contains multiple record types' you must first create record definitions on
the /ecords tab. :f the source file contains only one record type' or if the stage will write data to a target file' then
the columns belong to the default record called /+C9/2>*.
<ou can load column definitions from a table in the repository' or you can type column definitions into the columns
grid. <ou can also define columns by dragging a table definition from the /epository window to the C33 stage icon
on the 2esigner canvas.
8oading columns
The fastest way to define column metadata is to load columns from a table definition in the repository.
To load columns"
1. Clic( the $ecords tab on the #tage page.
2. Clic( 8oad to open the Table 2efinitions window. This window displays all of the repository objects that are in
the current project.
3. #elect a table definition in the repository tree and clic( )/.
4. #elect the columns to load in the #elect Columns 3rom Table window and clic( )/.
5. :f flattening is an option for any arrays in the column structure' specify how to handle array data in the
Complex 3ile ,oad 9ption window.
Typing columns
<ou can also define column metadata by typing column definitions in the columns grid.
To type columns"
1. Clic( the $ecords tab on the #tage page.
2. :n the 8evel num%er field of the grid' specify the C9G9, level number where the data is defined. :f you do
not specify a level number' a default value of EJ is used.
3. :n the Column name field' type the name of the column.
4. :n the Native type field' select the native data type.
5. :n the 8ength field' specify the data precision.
6. :n the Scale field' specify the data scale factor.
7. )ptional! :n the Description field' type a description of the column.
Defining record -D constraints
:f you are using the C33 stage to read data from a file that contains multiple record types' you must specify a record
:2 constraint to identify the format of each record.
Columns that are identified in the record :2 clause must be in the same physical storage location across records. The
constraint must be a simple e8uality expression' where a column e8uals a value.
To define a record :2 constraint"
1. Clic( the $ecords -D tab on the #tage page.
2. #elect a record from the $ecords list.
3. #elect the record :2 column from the Column list. This list displays all columns from the selected record'
except the first 9CCU/# 2+P+?2:?@ 9? %929& column and any columns that follow it.
4. #elect the = operator from the )p list.
5. Type the identifying value for the record :2 column in the 9alue field. Character values must be enclosed in
single 8uotation mar(s.
Selecting output columns
Gy selecting output columns' you specify which columns from the source file the C33 stage should pass to the
output lin(s.
<ou can select columns from multiple record types to output from the stage. :f you do not select columns to output
on each lin(' the C33 stage automatically propagates all of the stage columns except group columns to each empty
output lin( when you clic( )/ to exit the stage.
To select output columns"
1. Clic( the Selection tab on the 9utput page.
2. :f you have multiple output lin(s' select the lin( that you want from the )utput name list.
Defining output lin" constraints
Gy defining a constraint' you can filter the data on each output lin( from the C33 stage.
<ou can set the output lin( constraint to match the record :2 constraint for each selected output record by clic(ing
Default on the Constraint tab on the 9utput page. The Default button is available only when the constraint grid is
empty.
To define an output lin( constraint"
1. Clic( the Constraint tab on the 9utput page.
2. :n the 5 field of the grid' select an opening parenthesis if needed. <ou can use parentheses to specify the order in
evaluating a complex constraint expression.
3. :n the Column field' select a column or job parameter. %@roup columns cannot be used in constraint expressions
and are not displayed.&
4. :n the )p field' select an operator or a logical function.
5. :n the Column,9alue field' select a column or job parameter' or double)clic( in the cell to type a value. +nclose
character values in single 8uotation mar(s.
6. :n the 6 field' select a closing parenthesis if needed.
7. :f you are building a complex expression' in the 8ogical field' select AND or )$ to continue the expression in
the next row.
8. Clic( 9erify. :f errors are found' you must either correct the expression' clic( Clear All to start over' or cancel.
<ou cannot save an incorrect constraint.
&diting a Comple Flat File stage as a target
To edit a C33 stage as a target' you must provide details about the file that the stage will write' define the record
format of the data' and define the column metadata.
To edit a C33 stage as a target"
1. 9pen the C33 stage editor.
2. 9n the #tage page' specify information about the stage data"
a. 9n the 3ile 9ptions tab' provide details about the file that the stage will write.
b. 9n the /ecord 9ptions tab' describe the format of the data in the file.
c. 9n the /ecords tab' create or load column definitions for the data
d. )ptional! 9n the dvanced tab' change the processing settings.
3. )ptional! 9n the :nput page' specify how to write data to the target file"
a. 9n the dvanced tab' change the buffering settings.
b. 9n the Partitioning tab' change the partitioning settings.
4. Clic( )/ to save your changes and to close the C33 stage editor.
$e0ect lin"s
The C33 stage can have a single reject lin(' whether you use the stage as a source or a target.
3or C33 source stages' reject lin(s are supported only if the source file contains a single record type without any
9CCU/# 2+P+?2:?@ 9? %929& columns. 3or C33 target stages' reject lin(s are supported only if the target file
does not contain 929 columns.
<ou cannot change the selection properties of a reject lin(. The #election tab for a reject lin( is blan(.
<ou cannot edit the column definitions for a reject lin(. 3or writing files' the reject lin( uses the input lin( column
definitions. 3or reading files' the reject lin( uses a single column named HrejectedH that contains raw data for the
columns that were rejected after reading because they did not match the schema.
FTP &nterprise Stage
The 3TP +nterprise stage transfers multiple files in parallel. These are sets of files that are transferred from one or
more 3TP servers into 6eb#phere 2ata#tage or from 6eb#phere 2ata#tage to one or more 3TP servers. The
source or target for the file is identified by a U/: %Universal /esource :dentifier&. The 3TP +nterprise stage invo(es
an 3TP client program and transfers files to or from a remote host using the 3TP Protocol.
3$-
:s a pathname connecting the #tage to a target file on a remote host. :t has the 9pen dependent property. <ou can
repeat this property to specify multiple U/:s. <ou can specify an absolute or a relative pathname.
)pen command
:s re8uired if you perform any operation besides navigating to the directory where the file exists. There can be
multiple 9pen commands. This is a dependent property of U/:.
ftp command
:s an optional command that you can specify if you do not want to use the default ftp command. 3or example' you
could specify CoptCgnuCbinCwuftp. <ou can enter the path of the command %on the server& directly in this field. <ou
can also specify a job parameter if you want to be able to specify the ftp command at run time.
3ser Name
#pecify the user name for the transfer. <ou can enter it directly in this field' or you can specify a job parameter if
you want to be able to specify the user name at run time. <ou can specify multiple user names. User* corresponds to
U/:* and so on. 6hen the number of users is less than the number of U/:s' the last user name is set for remaining
U/:s. :f no User ?ame is specified' the 3TP +nterprise #tage tries to use .netrc file in the home directory.
Pass2ord
+nter the password in this field. <ou can also specify a job parameter if you want to be able to specify the password
at run time. #pecify a password for each user name. Password* corresponds to U/:*. 6hen the number of
passwords is less than the numbers of U/:s' the last password is set for the remaining U/:s.
Transfer Protocol
#elect the type of 3TP service to transfer files between computers. <ou can choose either 3TP or #ecure 3TP
%#3TP&.
1. FTP #elect this option if you want to transfer files using the standard 3TP protocol. This is a nonsecure
protocol. Gy default 3TP enterprise stage uses this protocol to transfer files.
2. Secure FTP 5SFTP6 #elect this option if you want to transfer files between computers in a secured channel.
#ecure 3TP %#3TP& uses the ##4 %#ecured #hell& protected channel for data transfer between computers over a
nonsecure networ( such as a TCPC:P networ(. Gefore you can use #3TP to transfer files' you should configure
the ##4 connection without any pass phrase for /# authentication.
Force Parallelism
<ou can set either <es or ?o. :n general' the 3TP +nterprise stage tries to start as many processes as needed to
transfer the n files in parallel. 4owever' you can force the parallel transfer of data by specifying this property to yes.
This allows m number of processes at a time where m is the number specified in 6eb#phere 2ata#tage
configuration file. :f m is less than n' then the stage waits to transfer the first m files and then start the next m until n
files are transferred.
6hen you set Force Parallelism to 1es ' you should only give one U/:.
)ver2rite
#et this option to have any existing files overwritten by this transfer.
$estarta%le Mode
6hen you specify a restartable mode of /estartable transfer' 6eb#phere 2ata#tage creates a directory for recording
information about the transfer in a restart directory. :f the transfer fails' you can run an identical job with the
restartable mode property set to /estart transfer' which will reattempt the transfer. :f the transfer repeatedly fails'
you can run an identical job with the restartable mode option set to bandon transfer' which will delete the restart
directory
/estartable mode has the following dependent properties"
1. *o% -d :dentifies a restartable transfer job. This is used to name the restart directory.
2. Chec"point directory 9ptionally specifies a chec(point directory to contain restart directories. :f you do not
specify this' the current wor(ing directory is used.
3or example' if you specify a job>id of *EE and a chec(point directory of 3home3bgamsworth3checkpoint the files
would be written to 3home3bgamsworth3checkpointCpftp_-obid_455.
Schema file
Contains a schema for storing data. #etting this option overrides any settings on the Columns tab. <ou can enter the
path name of a schema file' or specify a job parameter' so the schema file name can be specified at run time.
Transfer Type
#elect a data transfer type to transfer files between computers. <ou can select either the Ginary or #C:: mode of
data transfer. The default data transfer mode is binary.
6hen reading a delimited #e8uential 3ile' you are instructed to interpret two contiguous field delimiters as ?U,,
for the corresponding field regardless of data type. 6hich three actions must you ta(eZ %Choose three.&
. #et the data type to Tarchar.
G. #et the field to nullable.
C. #et the B?U,, 3ield TalueB to two field delimiters %e.g.' BbbB for pipes&.
2. #et the B?U,, 3ield TalueB to dd.
+. #et the environment variable =PT>:$P+FP>,,96>^+/9>,+?@T4>3:F+2>?U,,.
nswer" G'2'+
=PT>:$P+FP>,,96>^+/9>,+?@T4>3:F+2>?U,, ) 6hen set' allows !ero length null>field value with
fixed length fields. This should be used with care as poorly formatted data will cause incorrect results. Gy default a
!ero length null>field value will cause an error.
6hich two attributes are found in a 2ata #et descriptor fileZ %Choose two.&
. copy of the job score.
G. The schema of the 2ata #et.
C. copy of the partitioned data.
2. copy of the configuration file used when 2ata #et was created.
nswer" G'2
6hen importing a C9G9, file definition' which two are re8uiredZ %Choose two.&
. The file you are importing is accessible from your client wor(station.
G. The file you are importing contains level E* items.
C. The column definitions are in a C9G9, copyboo( file and not' for example' in a C9G9, source file.
2. The file does not contain any 9CCU/# 2+P+?2:?@ 9? clauses.
nswer" 'G
6hich three features of datasets ma(e them suitable for job restart pointsZ
%Choose three.&
. They are indexed for fast data access.
G. They are partitioned.
C. They use datatypes that are in the parallel engine internal format.
2. They are persistent.
+. They are compressed to minimi!e storage space.
nswer" G'C'2
6hich statement describes a process for capturing a C9G9, copyboo( from a !C9# systemZ
. 3TP the C9G9, copyboo( to the server platform in text mode and capture the metadata through $anager.
G. #elect the C9G9, copyboo( using the Growse button and capture the C9G9, copyboo( with $anager.
C. 3TP the C9G9, copyboo( to the client wor(station in text mode and capture the copyboo( with $anager.
2. 3TP the C9G9, copyboo( to the client wor(station in binary and capture the metadata through $anager.
nswer" C
The high performance +T, server on which 2ata#tage ++ is installed is networ(ed with several other servers in the
:T department with a very high bandwidth switch. list of seven files %all of which contain records with the same
record layout& must be retrieved from three of the other servers using 3TP. @iven the high bandwidth networ( and
high performance +T, server' which approach will retrieve and process all seven files in the minimal amount of
timeZ
. :n a single job' use seven separate 3TP +nterprise stages the output lin(s of which lead to a single #ort 3unnel
stage' then process the records without landing to dis(.
G. #etup a se8uence of seven separate 2ata#tage ++ jobs' each of which retrieves a single file and appends to a
common dataset' then process the resulting dataset in an eighth 2ata#tage ++ job.
C. Use three 3TP Plug)in stages %one for each machine& to retrieve the seven files and store them to a single file on
the fourth server' then use the 3TP +nterprise stage to retrieve the single file and process the records without landing
to dis(.
2. Use a single 3TP +nterprise stage and specify seven U/: properties' one for each file' then process the records
without landing to dis(.
nswer" 2
n F$, file is being processed by the F$, :nput stage. 4ow can repetition elements be identified on the stageZ
. ?o special settings are re8uired. F$, :nput stage automatically detects the repetition element from the FPath
expression.
G. #et the B0eyB property for the column on the output lin( to B<esB.
C. Chec( the B/epetition +lement /e8uiredB box on the output lin( tab.
2. #et the B?ullableB property for the column on the output lin( to B<esB.
nswer" G
Using 3TP' a file is transferred from an $T# system to a ,:?UF system in binary transfer mode. 6hich data
conversion must be used to read a pac(ed decimal field in the fileZ
. treat the field as a pac(ed decimal
G. pac(ed decimal fields are not supported
C. treat the field as #C::
2. treat the field as +GC2:C
nswer"
6hen a se8uential file is read using a #e8uential 3ile stage' the parallel engine inserts an operator to convert the data
to the internal format. 6hich operator is insertedZ
. import operator
G. copy operator
C. tsort operator
2. export operator
nswer"
6hich type of file is both partitioned and readable by external applicationsZ
. fileset
G. ,oo(up fileset
C. dataset
2. se8uential file
nswer"
6hich two statements are true about F$, $eta 2ata :mporterZ %Choose two.&
. F$, $eta 2ata :mporter is capable of reporting syntax and semantic errors from an F$, file.
G. FPT4 expressions that are created during F$, metadata import cannot be modified.
C. F$, $eta 2ata :mporter can import Table 2efinitions from only F$, documents.
2. FPT4 expressions that are created during F$, metadata import are used by F$, :nput stage and F$,
9utput stage.
nswer" '2
6hich two statements are correct about F$, stages and their usageZ %Choose two.&
. F$, :nput stage converts F$, data to tabular format.
G. F$, 9utput stage converts tabular data to F$, hierarchical structure.
C. F$, 9utput stage uses F#,T stylesheet for F$, to tabular transformations.
2. F$, Transformer stage converts F$, data to tabular format.
nswer" 'G
6hich B/eject $odeB option in the #e8uential 3ile stage will write records to a reject lin(Z
. 9utput
G. 3ail
C. 2rop
2. Continue
nswer"
single se8uential file exists on a single node. To read this se8uential file in parallel' what should be doneZ
. #et the +xecution mode to BParallelB.
G. se8uential file cannot be read in parallel using the #e8uential 3ile stage.
C. #elect B3ile PatternB as the /ead $ethod.
2. #et the B?umber of /eaders Per ?odeB optional property to a value greater than *.
nswer" 2
6hen a se8uential file is written using a #e8uential 3ile stage' the parallel engine inserts an operator to convert the
data from the internal format to the external format. 6hich operator is insertedZ
. export operator
G. copy operator
C. import operator
2. tsort operator
nswer"
ban( receives daily credit score updates from a credit agency in the form of a fixed width flat file. The
monthly>income column is an unsigned nullable integer %int.-& whose width is specified as *E' and null values are
represented as spaces. 6hich #e8uential 3ile property will properly import any nulls in the monthly>income column
of the input fileZ
. #et the record level fill char property to the space character %d d&.
G. #et the null field value property to a single space %d d&.
C. #et the C>format property to dB_d. *EBd.
2. #et the null field value property to ten spaces %d d&.
nswer" 2
n F$, file is being processed by the F$, :nput stage. 4ow can repetition elements be identified on the stageZ
. #et the B?ullableB property for the column on the output lin( to B<esB.
G. #et the B0eyB property for the column on the output lin( to B<esB.
C. Chec( the B/epetition +lement /e8uiredB box on the output lin( tab.
2. ?o special settings are re8uired. F$, :nput stage automatically detects the repetition element from the FPath
expression.
nswer" G
2uring a se8uential file read' you experience an error with the data. 6hat is a valid techni8ue for identifying the
column causing the difficultyZ
. #et the Bdata formatB option to text on the /ecord 9ptions tab.
G. +nable tracing in the 2ata#tage dministrator Tracing panel.
C. +nable the Bprint fieldB option at the /ecord 9ptions tab.
2. #et the PT>:$P9/T>2+GU@ environmental variable.
nswer" C
9n which two does the number of data files created by a fileset depend Z %Choose two.&
. the si!e of the partitions of the dataset
G. the number of CPUs
C. the schema of the file
2. the number of processing nodes in the default node pool
nswer" '2
6hat are two ways to delete a persistent parallel datasetZ %Choose two.&
. standard U?:F command rm
G. orchadmin command rm
C. delete the dataset Table 2efinition in 2ata#tage $anager
2. delete the dataset in 2ata #et $anager
nswer" G'2
parts supplier has a single fixed width se8uential file. /eading the file has been slow' so the supplier would li(e to
try to read it in parallel. :f the job executes using a configuration file consisting of four nodes' which two #e8uential
3ile stage settings will cause the 2ata#tage parallel engine to read the file using four parallel readersZ %Choose two.&
%?ote" ssume the file path and name is CdataCparts>input.txt.&
. #et the read method to specific file%s&' set the file property to dCdataCparts>input.txtd' and set the number of readers
per node option to -.
G. #et the read method to specific file%s&' set the file property to dCdataCparts>input.txtd' and set the read from multiple
nodes option to yes.
C. #et read method to file pattern' and set the file pattern property to dCdataC%eP/T>C9U?T&parts>input.txtd.
2. #et the read method to specific file%s&' set the file property to dCdataCparts>input.txtd' and set the number of readers
per node option to 1.
nswer" G'2
Data Transformation
Transformer Stage
Transformer stages can have a single input and any number of outputs. :t can also have a reject lin( that ta(es any
rows which have not been written to any of the outputs lin(s by reason of a write failure or expression evaluation
failure.
:n order to write efficient Transformer stage derivations' it is useful to understand what items get evaluated and
when. The evaluation se8uence is as follows"
,*aluate each stage *ariable initial *alue
For each input row to process0
,*aluate each stage *ariable deri*ation *alue, unless the deri*ation is empt&
For each output link0
,*aluate each column deri*ation *alue
6rite the output record
#e7t output link
#e7t input row
The stage variables and the columns within a lin( are evaluated in the order in which they are displayed on the
parallel job canvas. #imilarly' the output lin(s are also evaluated in the order in which they are displayed.
System varia%les
6eb#phere 2ata#tage provides a set of variables containing useful system information that you can access from an
output derivation or constraint.
1. :FA8S& The value is replaced with E.
2. :T$3& The value is replaced with *.
3. :-N$)4N3M :nput row counter.
4. :)3T$)4N3M 9utput row counter %per lin(&.
5. :N3MPA$T-T-)NS The total number of partitions for the stage.
6. :PA$T-T-)NN3M The partition number for the particular instance.
Triggers ta%
The Triggers tab allows you to choose routines to be executed at specific execution points as the transformer stage
runs in a job. The execution point is per)instance' i.e.' if a job has two transformer stage instances running in
parallel' the routine will be called twice' once for each instance.
The available execution points are Gefore)stage and fter)stage. t this release' the only available built)in routine is
#etCustom#ummary:nfo. <ou can also define custom routines to be executed5 to do this you define a C function'
ma(e it available in U?:F shared library' and then define a Parallel routine which calls it %see 6ebphere "atatage
"esigner 1lient Guide for details on defining a Parallel /outine&. ?ote that the function should not return a value .
constraint other2ise lin" can be defined by"
1. Clic(ing on the )ther2ise,8og field so a tic( appears and leaving the Constraint fields blan(. This will catch
any rows that have failed to meet constraints on all the previous output lin(s.
-. #et the constraint to 9T4+/6:#+. This will be set whenever a row is rejected on a lin( because the row fails
to match a constraint. 9T4+/6:#+ is cleared by any output lin( that accepts the row.
3. The otherwise lin( must occur after the output lin(s in lin( order so it will catch rows that have failed to meet
the constraints of all the output lin(s. :f it is not last rows may be sent down the otherwise lin( which satisfy a
constraint on a later lin( and is sent down that lin( as well.
4. Clic(ing on the )ther2ise,8og field so a tic( appears and defining a Constraint . This will result in the number
of rows written to that lin( %i.e. rows which satisfy the constraint& to be recorded in the job log as a warning
message.
Note! <ou can also specify a reject lin( which will catch rows that have not been written on any output lin(s due to
a write error or null expression error. 2efine this outside Transformer stage by adding a lin( and using the shortcut
menu to convert it to a reject lin(.
Conditionally A%orting a *o%
Use the Lbort fter /owsM setting in the output lin( constraints of the parallel Transformer to conditionally abort a
parallel job. <ou can specify an abort condition for any output lin(. The abort occurs after the specified number of
rows occurs in one of the partitions. 6hen the Lbort fter /owsM threshold is reached' the Transformer
immediately aborts the job flow' potentially leaving uncommitted database rows or un)flushed file buffers
Functions and )perators
Concatenation 9perator X L"M
#ubstring operator X :nput> #tringRstarting position' lengthS
#tring 3unctions
*. ,en%PstringO&
-. Trim%PstringO&
.. UpCaseC2ownCase%PstringO&
?ull 4andling functions
*. :s?ull
-. :s?ot?ull
.. ?ulltoTalue
1. ?ullto^ero
J. #et?ull%&
Type Conversions
*. #tringtoTimestamp
-. #tringto2ecimal
3sing Transformer stages
:n general' it is good practice not to use more Transformer stages than you have to. <ou should especially avoid
using multiple Transformer stages where the logic can be combined into a single stage. :t is often better to use other
stage types for certain types of operation"
*. Use a Copy stage rather than a Transformer for simple operations such as"
X Providing a job design placeholder on the canvas. %Provided you do not set the 3orce property to True on the Copy
stage' the copy will be optimi!ed out of the job at run time.&
X /enaming columns.
X 2ropping columns.
X :mplicit type conversions. ?ote that' if runtime column propagation is disabled' you can also use output mapping
on a stage to rename' drop' or convert columns on a stage that has both inputs and outputs.
-. Use the $odify stage for explicit type conversion and null handling.
3. 6here complex' reusable logic is re8uired' or where existing Transformer)stage based job flows do not meet
performance re8uirements' consider building your own custom stage
1. Use a G#:C Transformer stage where you want to ta(e advantage of user)defined functions and routines.
SCD Stage
The #C2 stage reads source data on the input lin(' performs a dimension table loo(up on the reference lin(' and
writes data on the output lin(. The output lin( can pass data to another #C2 stage' to a different type of processing
stage' or to a fact table. The dimension update lin( is a separate output lin( that carries changes to the dimension.
<ou can perform these steps in a single job or a series of jobs' depending on the number of dimensions in your
database and your performance re8uirements.
#C2 stages support both #C2 Type * and #C2 Type - processing"
1. SCD Type ; 9verwrites an attribute in a dimension table.
2. SCD Type ( dds a new row to a dimension table.
+ach #C2 stage processes a single dimension and performs loo(ups by using an e8uality matching techni8ue. :f the
dimension is a database table' the stage reads the database to build a loo(up table in memory. :f a match is found'
the #C2 stage updates rows in the dimension table to reflect the changed data. :f a match is not found' the stage
creates a new row in the dimension table. ll of the columns that are needed to create a new dimension row must be
present in the source data.
Purpose codes in a Slo2ly Changing Dimension stage
Purpose codes are an attribute of dimension columns in #C2 stages. Purpose codes are used to build the loo(up
table' to detect dimension changes' and to update the dimension table.
#uilding the loo"up ta%le The #C2 stage uses purpose codes to determine how to build the loo(up table for the
dimension loo(up. :f a dimension has only Type * columns' the stage builds the loo(up table by using all dimension
rows. :f any Type - columns exist' the stage builds the loo(up table by using only the current rows. :f a dimension
has a Current :ndicator column' the stage uses the derivation value of this column on the 2im Update tab to identify
the current rows of the dimension table. :f a dimension does not have a Current :ndicator column' then the stage uses
the +xpiration 2ate column and its derivation value to identify the current rows. ny dimension columns that are
not needed are not used. This techni8ue minimi!es the amount of memory that is re8uired by the loo(up table.
Detecting dimension changes Purpose codes are also used to detect dimension changes. The #C2 stage compares
Type * and Type - column values to source column values to determine whether to update an existing row' insert a
new row' or expire a row in the dimension table.
3pdating the dimension ta%le Purpose codes are part of the column metadata that the #C2 stage propagates to the
dimension update lin(. <ou can send this column metadata to a database stage in the same job' or you can save the
metadata on the Columns tab and load it into a database stage in a different job. 6hen the database stage uses the
auto)generated #V, option to perform inserts and updates' it uses the purpose codes to generate the correct #V,
statements.
Selecting purpose codes
Purpose codes specify how the #C2 stage should process dimension data. Purpose codes apply to columns on the
dimension reference lin( and on the dimension update lin(. #elect purpose codes according to the type of columns
in a dimension"
1. :f a dimension contains a Type - column' you must select a Current :ndicator column' an +xpiration 2ate
column' or both. n +ffective 2ate column is optional. <ou cannot assign Type - and Current :ndicator to the
same column.
2. :f a dimension contains only Type * columns' no Current :ndicator' +ffective 2ate' +xpiration 2ate' or #0
Chain columns are allowed.
Purpose code definitions
The #C2 stage provides nine purpose codes to support dimension processing.
1. 5%lan"6 The column has no #C2 purpose. This purpose code is the default.
2. Surrogate /ey The column is a surrogate (ey that is used to identify dimension records.
3. #usiness /ey The column is a business (ey that is typically used in the loo(up condition.
4. Type ; The column is an #C2 Type * field. #C2 Type * column values are always current. 6hen changes
occur' the #C2 stage overwrites existing values in the dimension table.
5. Type ( The column is an #C2 Type - field. #C2 Type - column values represent a point in time. 6hen
changes occur' the #C2 stage creates a new dimension row.
6. Current< -ndicator 5Type (6 The column is the current record indicator for #C2 Type - processing. 9nly
one Current :ndicator column is allowed.
7. &ffective Date 5Type (6 The column is the effective date for #C2 Type - processing. 9nly one +ffective
2ate column is allowed.
8. &piration Date 5Type (6 T he column is the expiration date for #C2 Type - processing. n +xpiration
2ate column is re8uired if there is no Current :ndicator column' otherwise it is optional.
9. S/ Chain The column is used to lin( a record to the previous record or the next record by using the value
of the #urrogate 0ey column. 9nly one #urrogate 0ey column can exist if you have an #0 Chain column.
Surrogate "eys in a Slo2ly Changing Dimension stage
#urrogate (eys are used to join a dimension table to a fact table in a star schema database.
6hen the #C2 stage performs a dimension loo(up' it retrieves the value of the existing surrogate (ey if a matching
record is found. :f a match is not found' the stage obtains a new surrogate (ey value by using the derivation of the
#urrogate 0ey column on the 2im Update tab. :f you want the #C2 stage to generate new surrogate (eys by using a
(ey source that you created with a #urrogate 0ey @enerator stage' you must use the ?ext#urrogate0ey function to
derive the #urrogate 0ey column. :f you want to use your own method to handle surrogate (eys' you should derive
the #urrogate 0ey column from a source column.
<ou can replace the dimension information in the source data stream with the surrogate (ey value by mapping the
#urrogate 0ey column to the output lin(.
Specifying information a%out a "ey source
:f you created a (ey source with a #urrogate 0ey @enerator stage' you must specify how the #C2 stage should use
the source to generate surrogate (eys.
The (ey source can be a flat file or a database se8uence. The (ey source must exist before the job runs. :f the (ey
source is a flat file' the file must be accessible from all nodes that run the #C2 stage.
To use the (ey source"
*. 9n the :nput page' select the reference lin( in the -nput name field.
-. Clic( the Surrogate /ey tab.
.. :n the Source type field' select the source type.
1. :n the Source name field' type the name of the (ey source' or clic( the arrow button to browse for a file or to
insert a job parameter. :f the source is a flat file' type the name and fully 8ualified path of the state file' such as
C"C#0@CProd2im. :f the source is a database se8uence' type the name of the se8uence' such as
P/92UCT>0+<>#+V. J. Provide additional information about the (ey source according to the type"
:f the source is a flat file' specify information in the Flat File area.
:f the source is a database se8uence' specify information in the D# se.uence area.
Calls to the (ey source are made by the ?ext#urrogate0ey function. 9n the 2im Update tab' create a derivation that
uses the ?ext#urrogate0ey function for the column that has a purpose code of #urrogate 0ey. The
?ext#urrogate0ey function returns the value of the next surrogate (ey when the #C2 stage creates a new dimension
row.
2ata#tage job contains a parallel Transformer with a single input lin( and a single output lin(. The Transformer
has a constraint that should produce *EEE records' however only ]EE came out through the output lin(.
6hat should be done to identify the missing recordsZ
. Turn trace on using 2ata#tage dministrator.
G. dd a /eject lin( to the Transformer stage.
C. #can generated osh script for possible errors.
2. /emove the constraint on the output lin(.
nswer" G
6hich three actions are performed using stage variables in a parallel Transformer stageZ %Choose three.&
. function can be executed once per record.
G. function can be executed once per run.
C. :dentify the first row of an input group.
2. :dentify the last row of an input group.
+. ,oo(up up a value from a reference dataset.
nswer" 'G'C
6hich two system variables must be used in a parallel Transformer derivation to generate a uni8ue se8uence of
integers across partitionsZ %Choose two.&
. eP/T:T:9??U$
G. e:?/96?U$
C. e2T+
2. e?U$P/T:T:9?#
nswer" '2
6hat would re8uire creating a new parallel Custom stage rather than a new parallel Guild9p stageZ
. Custom stage can be created with properties. Guild9p stages cannot be created with properties.
G. :n a Custom stage' the number of input lin(s does not have to be fixed' but can vary' for example from one to
two. Guild9p stages re8uire a fixed number of input lin(s.
C. Creating a Custom stage re8uires (nowledge of CCCUU. <ou do not need (nowledge of CCCUU to create a
Guild9p stage.
2. Custom stages can be created for parallel execution. Guild9p stages can only be built to run se8uentially.
nswer" G
<our input rows contain customer data from a variety of locations. <ou want to select just those rows from a
specified location based on a parameter value. <ou are trying to decide whether to use a Transformer or a 3ilter
stage to accomplish this. 6hich statement is trueZ
. The Transformer stage will yield better performance because the 3ilter stage 6here clause is interpreted at
runtime.
G. <ou cannot use a 3ilter stage because you cannot use parameters in a 3ilter stage 6here clause.
C. The 3ilter stage will yield better performance because it has less overhead than a Transformer stage.
2. <ou cannot use the Transformer stage because you cannot use parameters in a Transformer stage constraint.
nswer"
:n a Transformer you add a new column to an output lin( named 7ob?ame that is to contain the name of the job that
is running. 6hat can be used to derive values for this columnZ
. a 2ata#tage function
G. a lin( variable
C. a system variable
2. a 2ata#tage macro
nswer" 2
6hich statement describes how to add functionality to the Transformer stageZ
. Create a new parallel routine in the /outines category that specifies the name' path' type' and return type of a
function written and compiled in CUU.
G. Create a new parallel routine in the /outines category that specifies the name' path' type' and return type of an
external program.
C. Create a new server routine in the /outines category that specifies the name and category of a function written in
2ata#tage Gasic.
2. +dit the CUU code generated by the Transformer stage.
nswer"
6hich three statements about the +nterprise +dition parallel Transformer stage are correctZ %Choose three.&
. The Transformer allows you to copy columns.
G. The Transformer allows you to do loo(ups.
C. The Transformer allows you to apply transforms using routines.
2. The Transformer stage automatically applies d?ullToTalued function to all non)nullable output columns.
+. The Transformer allows you to do data type conversions.
nswer" 'C'+
6hich two stages allow field names to be specified using job parametersZ %Choose two.&
. Transformer stage
G. 3unnel stage
C. $odify stage
2. 3ilter stage
nswer" C'2
The parallel dataset input into a Transformer stage contains null values. 6hat should you do to properly handle
these null valuesZ
. Convert null values to a valid values in a stage variable.
G. Convert null values to a valid value in the output column derivation.
C. ?ull values are automatically converted to blan(s and !ero' depending on the target data type.
2. Trap the null values in a lin( constraint to avoid derivations.
nswer"
6hich two would re8uire the use of a Transformer stage instead of a Copy stageZ %Choose two.&
. 2rop a column.
G. #end the input data to multiple output streams.
C. Trim spaces from a character field.
2. #elect certain output rows based on a condition.
nswer" C'2
:n which situation should a G#:C Transformer stage be used in a 2ata#tage ++ jobZ
. in a job containing complex routines migrated from 2ata#tage #erver +dition
G. in a job re8uiring loo(ups to hashed files
C. in a large)volume job flow
2. in a job re8uiring complex' reusable logic
nswer"
<ou have three output lin(s coming out of a Transformer. Two of them % and G& have constraints you have
defined. The third you want to be an 9therwise lin( that is to contain all of the rows that do not satisfy the
constraints of and G. This 9therwise lin( must wor( correctly even if the and G constraints are modified.
6hich two are re8uiredZ %Choose two.&
. The 9therwise lin( must be first in the lin( ordering.
G. constraint must be coded for the 9therwise lin(.
C. The 9therwise lin( must be last in the lin( ordering.
2. The 9therwise chec( box must be chec(ed.
nswer" C'2
6hich two statements are true about 2ata#tage Parallel Guildop stagesZ %Choose two.&
. Unli(e standard 2ata#tage stages they do not have properties.
G. They are coded using CCCUU.
C. They are coded using 2ata#tage Gasic.
2. Table 2efinitions are used to define the input and output interfaces of the Guild9p.
nswer" G'2
*o% Control and $un time Management
Message 'andlers
6hen you run a parallel job' any error messages and warnings are written to an error log and can be viewed from the
2irector. <ou can choose to handle specified errors in a different way by creating one or more message handlers.
message handler defines rules about how to handle messages generated when a parallel job is running. <ou can'
for example' use one to specify that certain types of message should not be written to the log.
<ou can edit message handlers in the 2ata#tage $anager or in the 2ata#tage 2irector. The recommended way to
create them is by using the dd rule to message handler feature in the 2irector.
<ou can specify message handler use at different levels"
Pro0ect 8evel . <ou define a project level message handler in the 2ata#tage dministrator' and this applies to all
parallel jobs within the specified project.
*o% 8evel. 3rom the 2esigner and $anager you can specify that any existing handler should apply to a specific job.
6hen you compile the job' the handler is included in the job executable as a local handler %and so can be exported to
other systems if re8uired&.
<ou can also add rules to handlers when you run a job from the 2irector %regardless of whether it currently has a
local handler included&. This is useful' for example' where a job is generating a message for every row it is
processing. <ou can suppress that particular message.
6hen the job runs it will loo( in the local handler %if one exists& for each message to see if any rules exist for that
message type. :f a particular message is not handled locally' it will loo( to the projectwide handler for rules. :f there
are none there' it writes the message to the job log.
?ote that message handlers do not deal with fatal error messages' these will always be written to the job log. <ou
cannot add message rules to jobs from an earlier release of 2ata#tage without first re)running those jobs.
Adding $ules to Message 'andlers
<ou can add rules to message handlers don the flyd from within the 2irector. Using this method' you can add rules to
handlers that are local to the current job' to the project default handler' or to any previously)defined handler.
To add rules in this way' highlight the message you want to add a rule about in the job log and choose dd rule to
message handler... from the job log shortcut menu or from the 7ob menu on the menu bar. The dd rule to message
handler dialog box appears.
To add a rule"
1. Choose an option to specify which handler you want to add the new rule to. Choose between the local runtime
handler for the currently selected job' the project)level message handler' or a specific message handler. :f you
want to edit a specific message handler' select the handler from the $essage 4andler dropdown list. Choose
%?ew& to create a new message handler.
-. Choose an ction from the drop down list. Choose from"
= #uppress from log. The message is not written to the jobds log as it runs.
= Promote to 6arning. Promote an informational message to a warning message.
= 2emote to :nformational. 2emote a warning message to become an informational one.
The $essage :2' $essage type and +xample of message text fields are all filled in from the log entry you have
currently selected. <ou cannot edit these.
3. Clic( dd /ule to add the new message rule to the chosen handler.
Managing Message 'andlers
To open the $essage 4andler $anager' choose Tools $essage 4andlers %you can also open the manager from the
dd rule to message handler dialog box&. The +dit $essage 4andlers dialog box appears.
Message 'andler File Format
message handler is a plain text file and has the suffix .msh. :t is stored in the folder
8"!O.,3''3"atatage3.sg!andlers. The following is an example message file.
TUT, EEEE.** * The open file limit is *EE5 raising to *E-1a
T3#C EEEEE** - PT configuration filea
T3#C EEEE1.- . ttempt to Cleanup after G9/T raised in stagea
+ach line in the file represents message rule' and comprises four tabseparated fields"
) $essage :2. Case)specific string uni8uely identifying the message
) Type. * for :nfo' - for 6arn
) ction. * = #uppress' - = Promote' . = 2emote
) $essage. +xample text of the message
-dentify the use of ds0o% command line utility
<ou can start' stop' validate' and reset jobs using the Xrun option.
$unning a 0o%
dsjob Xrun
R Xmode R ?9/$, b /+#+T b T,:2T+ S S
R Xparam name=*alue S
R Xwarn n S
R Xrows n S
R Xwait S
R Xstop S
R XjobstatusS
RXuserstatusS
RXlocalS
RXopmetadata RT/U+ b 3,#+SS
R)disableprjhandlerS
R)disablejobhandlerS
9useid: pro-ect -ob;-ob_id
Xmode specifies the type of job run. ?9/$, starts a job run' /+#+T resets the job and T,:2T+ validates the
job. :f mode is not specified' a normal job run is started.
Xparam specifies a parameter value to pass to the job. The value is in the format name=*alue' where name is the
parameter name' and *alue is the value to be set. :f you use this to pass a value of an environment variable for a job
%as you may do for parallel jobs&' you need to 8uote the environment variable and its value' for example 7
param >?APT@C)NF-+@F-8&AchrisBapt> otherwise the current value of the environment variable will be used.
Xwarn n sets warning limits to the value specified by n %e8uivalent to the 2##et7ob,imit function used with
2#7>,:$:T6/? specified as the Limit(&pe parameter&.
Xrows n sets row limits to the value specified by n %e8uivalent to the 2##et7ob,imit function used with
2#7>,:$:T/96# specified as the Limit(&pe parameter&.
Xwait waits for the job to complete %e8uivalent to the 2#6ait3or7ob function&.
Xstop terminates a running job %e8uivalent to the 2##top7ob function&.
Xjobstatus waits for the job to complete' then returns an exit code derived from the job status.
Xuserstatus waits for the job to complete' then returns an exit code derived from the user status if that status is
defined. The user status is a string' and it is converted to an integer exit code. The exit code E indicates that the job
completed without an error' but that the user status string could not be converted. :f a job returns a negative user
status value' it is interpreted as an error.
)local use this when running a 2ata#tage job from withing a shellscript on a U?:F server. Provided the script is run
in the project directory' the job will pic( up the settings for any environment variables set in the script and any
setting specific to the user environment.
)opmetadata use this to have the job generate operational meta data as it runs. :f $eta#tage' or the Process $eta
2ata $etaGro(er' is not installed on the machine' then the option has no effect. :f you specify
T/U+' operational meta data is generated' whatever the default setting for the project. :f you specify 3,#+' the
job will not generate operational meta data' whatever the default setting for the project.
)disableprjhandler use this to disable any error message handler that has been set on a project wide basis
)disablejobhandler use this to disable any error message handler that has been set for this job
useid specify this if you intend to use a job alias %jobid& rather than ajob name %job& to identify the job.
pro-ect is the name of the project containing the job.
-ob is the name of the job. To run a job invocation' use the format
-ob'in*ocation_id'
-ob_id is an alias for the job that has been set using the dsjob Xjobid command
Stopping a 0o%
<ou can stop a job using the Xstop option.
dsjob Xstop 9useid: pro-ect -ob;-ob_id
Xstop terminates a running job %e8uivalent to the 2##top7obfunction&.
useid specify this if you intend to use a job alias %jobid& rather than a job name %job& to identify the job.
pro-ect is the name of the project containing the job.
-ob is the name of the job. To stop a job invocation' use the format
-ob'in*ocation_id'
-ob_id is an alias for the job that has been set using the dsjob Xjobid command
8isting Pro0ects
The following syntax displays a list of all (nown projects on the server"
dsjob Xlprojects
This syntax is e8uivalent to the 2#@etProject,ist function.
8isting *o%s
The following syntax displays a list of all jobs in the specified project"
dsjob Xljobs project
project is the name of the project containing the jobs to list. This syntax is e8uivalent to the 2#@etProject:nfo
function.
8isting Stages
The following syntax displays a list of all stages in a job"
dsjob Xlstages RuseidS project jobbjob>id
This syntax is e8uivalent to the 2#@et7ob:nfo function with 2#7>#T@+,:#T specified as the :nfoType parameter.
8isting 8in"s
The following syntax displays a list of all the lin(s to or from a stage"
dsjob Xllin(s RuseidS project jobbjob>id stage
This syntax is e8uivalent to the 2#@et#tage:nfo function with 2#7>,:?0,:#T specified as the :nfoType parameter.
8isting Parameters
The following syntax display a list of all the parameters in a job and their values"
dsjob Xlparams RuseidS project jobbjob>id
8isting -nvocations
The following syntax displays a list of the invocations of a job"
dsjob Xlinvocations
Setting an Alias for a *o%
The dsjob command can be used to specify your own :2 for a 2ata#tage job. 9ther commands can then use that
alias to refer to the job.
dsjob Xjobid Rmy>:2S project job
my>:2 is the alias you want to set for the job. :f you omit my>:2' the command will return the current alias for the
specified job. n alias must be uni8ue within the project' if the alias already exists an error message is displayed
Displaying *o% -nformation
The following syntax displays the available information about a specified job"
dsjob Xjobinfo RuseidS project jobbjob>id
This syntax is e8uivalent to the 2#@et7ob:nfo function.
Displaying Stage -nformation
The following syntax displays all the available information about a stage"
dsjob Xstageinfo RuseidS project jobbjob>id stage
This syntax is e8uivalent to the 2#@et#tage:nfo function.
Displaying 8in" -nformation
The following syntax displays information about a specified lin( to or from a stage"
dsjob Xlin(info RuseidS project jobbjob>id stage lin(
This syntax is e8uivalent to the 2#@et,in(:nfo function.
Displaying Parameter -nformation
This syntax displays information about the specified parameter"
dsjob Xparaminfo RuseidS project jobbjob>id param
The following information is displayed"
The parameter type
The parameter value
4elp text for the parameter that was provided by the jobs designer
6hether the value should be prompted for
The default value that was specified by the jobs designer
ny list of values
The list of values provided by the jobs designer
This syntax is e8uivalent to the 2#@etParam:nfo function.
Adding a 8og &ntry
The following syntax adds an entry to the specified log file. The text for the entry is ta(en from standard input to the
terminal' ending with Ctrl)2.
dsjob Xlog R Xinfo b Xwarn S RuseidS project jobbjob>id
Xinfo specifies an information message. This is the default if no log
entry type is specified.
Xwarn specifies a warning message.
Displaying a Short 8og &ntry
The following syntax displays a summary of entries in a job log file"
dsjob Xlogsum RXtype typeS R Xmax n S RuseidS project jobbjob>id
Xtype type specifies the type of log entry to retrieve. :f Xtype type is
not specified' all the entries are retrieved. type can be one of the
following options"
:?39 :nformation.
6/?:?@ 6arning.
3T, 3atal error.
/+7+CT /ejected rows from a Transformer stage.
#T/T+2 ll control logs.
/+#+T 7ob reset.
GTC4 Gatch control.
?< ll entries of any type. This is the default if type is not specified.
Xmax n limits the number of entries retrieved to n.
Displaying a Specific 8og &ntry
The following syntax displays the specified entry in a job log file"
dsjob Xlogdetail RuseidS project jobbjob>id entry
entry is the event number assigned to the entry. The first entry in the file is E.
This syntax is e8uivalent to the 2#@et,og+ntry function.
-dentifying the Ne2est &ntry
The following syntax displays the :2 of the newest log entry of the specified type"
dsjob Xlognewest RuseidS project jobbjob>id type
:?39 :nformation.
6/?:?@ 6arning.
3T, 3atal error.
/+7+CT /ejected rows from a Transformer stage.
#T/T+2 7ob started.
/+#+T 7ob reset.
GTC4 Gatch control.
This syntax is e8uivalent to the 2#@et?ewest,og:d function.
-mporting *o% &ecuta%les
The dsjob command can be used to import job executables from a 2#F file into a specified project. ?ote that this
command is only available on U?:F servers.
dsjob Ximport project 2#Ffilename R)9T+/6/:T+S R)79GR#S jobname aS b
R),:#TS project is the project to import into. 2#Ffilename is the 2#F file containing the job executables.
)9T+/6/:T+ specifies that any existing jobs in the project with the same name will be overwritten.
)79GR#S jobname specifies that one or more named job executables should be imported %otherwise all the executable
in the 2#F file are imported&.
),:#T causes 2ata#tage to list the executables in a 2#F file rather than import them.
+enerating a $eport
The dsjob command can be used to generate an F$, format report containing job' stage' and lin( information.
dsjob Xreport RuseidS project jobbjobid Rreport>typeS
report>type is one of the following"
G#:C X Text string containing startCend time' time elapsed and status of job.
2+T:, X s basic report' but also contains information about individual stages and lin(s within the job.
,:#T X Text string containing full F$, report.
Gy default the generated F$, will not contain a PZxml)stylesheetZO processing instruction. :f a stylesheet is
re8uired' specify a /etport,evel of - and append the name of the re8uired stylesheet
U/,' i.e.' -"style#heetU/,. This inserts a processing instruction into the generated F$, of the form"
PZxml)stylesheet type=textCxslM href=Mstyle#heetU/,MZO
The generated report is written to stdout.
This syntax is e8uivalent to the 2#$a(e7ob/eport function.2+T:, X s basic report' but also contains
information about individual stages and lin(s within the job.
,:#T X Text string containing full F$, report.
*o% Se.uence
6hat is a 7ob #e8uenceZ
*. master controlling job that controls the execution set of subordinate jobs
2. Passes values to subordinate job parameters
.. Controls the order of execution %lin(s&
1. #pecifies conditions under which the subordinate jobs get executed %triggers&
J. #pecified complex flow of control X ,oops' llCny se8uencer' 6ait for file
W. Perform system activities % +mail' +xecute system commands and executables&
[. Can include /estart chec(points
6hat are the 7ob #e8uence stagesZ
*. /un stages X 7ob ctivity" /un a job' +xecute CommandC/outine ctivity" /un a system command'
?otification ctivity" #end an email
-. 3low Control stages X #e8uencer" @o llCny' 6ait for file" @o when file existsCdoesnt exist' ,oop" #tart loop
and +nd ,oop' ?ested Condition" @o if condition satisfied
.. +rror handling X +xception 4andler' Terminator
1. Tariables X User Tariables
6hat are the compilation options in 7ob #e8uence propertiesZ
*. dd chec(points so se8uence is restartable on failure X /estart 3unctionality
-. utomatically handle activities that fail X +xception stage to handle aborts
.. ,og warnings after activities that finish with status other than 90
1. ,og report messages after each run
6hat are the inputs for 7ob ctivity stageZ
*. 7ob name %select from list&
-. +xecution ction %select from list&
.. Parameters
1. 2o not chec(point run %selectCunselect chec(box&
6hat are the 7ob ctivity +xecution ctionsZ
*. /un
-. /eset if re8uired' then run
.. Talidate
6hat are the different types of triggers for a 7ob ctivityZ
90 X %Conditional&
3ailed X %Conditional&
6arning X %Conditional&
Custom X %Conditional&
User#tatus X %Conditional&
Unconditional
9therwise
Custom Trigger +xample X 7ob>*.=7ob#tatus=2#7#./U?90 or 7ob>*.=7ob#tatus= 2#7#./U?6/?
6hat are the inputs for +xecute Command stageZ
*. Command
-. Parameters
3. 2o not chec(point run %selectCunselect chec(box&
6hat are the inputs for ?otification stageZ
*. #$TP $ail server name
-. #enders email address
.. /ecipients email address
1. +mail subject
J. ttachment
W. +mail body
[. :nclude job status in email %selectCunselect chec(box&
\. 2o not chec(point run %selectCunselect chec(box&
6hat are the inputs for 6ait for file stageZ
*. 3ilename
-. 6ait for file to appear C 6ait for file to appear %#elect one of the two options&
.. Timeout length %disabled if the L2o not timeoutM option is selected&
1. 2o not timeout
J. 2o not chec(point run
+xplain the ?ested Condition stageZ
The ?ested Condition stage is used to branch out to other activities based on trigger conditions.
+xplain the ,oop stageZ
The ,oop stage is made up of #tart ,oop and +nd ,oop. The #tart ,oop connects to one of the /un activities
%preferably 7ob ctivity&. This ctivity stage connects to the +nd ,oop. The +nd ,oop connects to the #tart ,oop
activity by means of a reference lin(.
The - types of looping are
*. ?umeric %3or counter n to n #tep n&
-. ,ist %3or each thing in list&

+xplain the +rror handling and /estartabilityZ
+rror handling is enabled using Lutomatically handle activities that failM option. The control is passed to the
+xception stage when an ctivity fails
/estartability is enabled using Ldd chec(points so se8uence is restartable on failureM option. :f a se8uence fails'
then when the #e8uence is re)run' activities that completed successfully in the prior run are s(ipped over %unless the
L2o not chec(point runM option was set for an activity&.
6hich three are valid ways within a 7ob #e8uence to pass parameters to ctivity stagesZ %Choose three.&
. +xecCommand ctivity stage
G. UserTariables ctivity stage
C. #e8uencer ctivity stage
2. /outine ctivity stage
+. ?ested Condition ctivity stage
nswer" 'G'2
6hich three are valid trigger expressions in a stage in a 7ob #e8uenceZ %Choose three.&
. +8uality%Conditional&
G. Unconditional
C. /eturnTalue%Conditional&
2. 2ifference%Conditional&
+. Custom%Conditional&
nswer" G'C'+
client re8uires that any job that aborts in a 7ob #e8uence halt processing. 6hich three activities would provide
this capabilityZ %Choose three.&
. ?ested Condition ctivity
G. +xception 4andler
C. #e8uencer ctivity
2. #endmail ctivity
+. 7ob trigger
nswer" 'G'+
6hich command can be used to execute 2ata#tage jobs from a U?:F shell scriptZ
. dsjob
G. 2#/un7ob
C. osh
2. 2#+xecute
nswer"
6hich three are the critical stages that would be necessary to build a 7ob #e8uence that" pic(s up data from a file
that will arrive in an directory overnight' launches a job once the file has arrived' sends an email to the administrator
upon successful completion of the flowZ %Choose three.&
. #e8uencer
G. ?otification ctivity
C. 6ait 3or 3ile ctivity
2. 7ob ctivity
+. Terminator ctivity
nswer" G'C'2
6hich two statements describe functionality that is available using the dsjob commandZ %Choose two.&
. dsjob can be used to get a report containing job' stage' and lin( information.
G. dsjob can be used to add a log entry for a specified job.
C. dsjob can be used to compile a job.
2. dsjob can be used to export job executables.
nswer" 'G
)ther Topics
&nvironment 9aria%les
APT@#3FF&$@F$&&@$3N
This environment variable is available in the 2ata#tage dministrator' under the Parallel category. :t specifies how
much of the available inmemory buffer to consume before the buffer resists. This is expressed as a decimal
representing the percentage of $aximum memory buffer si!e %for example' E.J is JE_&. 6hen the amount of data in
the buffer is less than this value' new data is accepted automatically. 6hen the data exceeds it' the buffer first tries
to write some of the data it contains before accepting more. The default value is JE_ of the $aximum memory
buffer si!e. <ou can set it to greater than *EE_' in which case the buffer continues to store data up to the indicated
multiple of $aximum memory buffer si!e before writing to dis(.
APT@#3FF&$@MAC-M3M@M&M)$1
#ets the default value of $aximum memory buffer si!e. The default value is .*1J[-\ %. $G&. #pecifies the
maximum amount of virtual memory' in bytes' used per buffer.
APT@#3FF&$@MAC-M3M@T-M&)3T
2ata#tage buffering is self tuning' which can theoretically lead to long delays between retries. This environment
variable specified the maximum wait before a retry in seconds' and is by default set to *.
APT@#3FF&$-N+@P)8-C1
This environment variable is available in the 2ata#tage dministrator' under the Parallel category. Controls the
buffering policy for all virtual data sets in all steps. The variable has the following settings"
UT9$T:C>GU33+/:?@ %default&. Guffer a data set only if necessary to prevent a data flow deadloc(.
39/C+>GU33+/:?@. Unconditionally buffer all virtual data sets. ?ote that this can slow down processing
considerably.
?9>GU33+/:?@. 2o not buffer data sets. This setting can cause data flow deadloc( if used inappropriately.
APT@D&C-MA8@-NT&$M@P$&C-S-)N
#pecifies the default maximum precision value for any decimal intermediate variables re8uired in calculations.
2efault value is .\.
APT@D&C-MA8@-NT&$M@SCA8&
#pecifies the default scale value for any decimal intermediate variables re8uired in calculations. 2efault value is *E.
APT@C)NF-+@F-8&
#ets the path name of the configuration file. %<ou may want to include this as a job parameter' so that you can
specify the configuration file at job run time&.
APT@D-SA#8&@C)M#-NAT-)N
@lobally disables operator combining. 9perator combining is 2ata#tages default behavior' in which two or more
%in fact any number of& operators within a step are combined into one process where possible. <ou may need to
disable combining to facilitate debugging. ?ote that disabling combining generates more U?:F processes' and
hence re8uires more system resources and memory. :t also disables internal optimi!ations for job efficiency and run
times.
APT@&C&C3T-)N@M)D&
Gy default' the execution mode is parallel' with multiple processes. #et this variable to one of the following values
to run an application in se8uential execution mode"
9?+>P/9C+## one)process mode
$?<>P/9C+## many)process mode
?9>#+/:,:^+ many)process mode' without seriali!ation
APT@)$C'')M&
$ust be set by all 2ata#tage +nterprise +dition users to point to the top)level directory of the 2ata#tage +nterprise
+dition installation.
APT@STA$T3P@SC$-PT
s part of running an application' 2ata#tage creates a remote shell on all 2ata#tage processing nodes on which the
job runs. Gy default' the remote shell is given the same environment as the shell from which 2ata#tage is invo(ed.
4owever' you can write an optional startupshell script to modify the shell configuration of one or more processing
nodes. :f a startup script exists' 2ata#tage runs it on remote shells before running your application.
PT>#T/TUP>#C/:PT specifies the script to be run. :f it is not defined' 2ata#tage searches '3startup'apt,
8</(_O21!!O.,3etc3startup'apt and 8</(_O21!!O.,3etc3startup' in that order.
PT>?9>#T/TUP>#C/:PT disables running the startup script.
APT@N)@STA$T3P@SC$-PT
Prevents 2ata#tage from executing a startup script. Gy default' this variable is not set' and 2ata#tage runs the
startup script. :f this variable is set' 2ata#tage ignores the startup script. This may be useful when debugging a
startup script. #ee also PT>#T/TUP>#C/:PT.
APT@STA$T3P@STAT3S
#et this to cause messages to be generated as parallel job startup moves from phase to phase. This can be useful as a
diagnostic if parallel job startup is failing.
APT@M)N-T)$@S-D&
This environment variable is available in the 2ata#tage dministrator under the Parallel branch. 2etermines the
minimum number of records the 2ata#tage 7ob $onitor reports. The default is JEEE records.
APT@M)N-T)$@T-M&
This environment variable is available in the 2ata#tage dministrator under the Parallel branch. 2etermines the
minimum time interval in seconds for generating monitor information at runtime. The default is J seconds. This
variable ta(es precedence over PT>$9?:T9/>#:^+.
APT@N)@*)#M)N
Turn off job monitoring entirely.
APT@PM@N)@S'A$&D@M&M)$1
Gy default' shared memory is used for local connections. :f this variable is set' named pipes rather than shared
memory are used for local connections. :f both PT>P$>?9>?$+2>P:P+# and
PT>P$>?9>#4/+2>$+$9/< are set' then TCP soc(ets are used for local connections.
APT@PM@N)@NAM&D@P-P&S
#pecifies not to use named pipes for local connections. ?amed pipes will still be used in other areas of 2ata#tage'
including subprocs and setting up of the shared memory transport protocol in the process manager.
APT@$&C)$D@C)3NTS
Causes 2ata#tage to print' for each operator Player' the number of records consumed by get/ecord%& and produced
by put/ecord%&. bandoned input records are not necessarily accounted for. Guffer operators do not print this
information.
APT@N)@PA$T@-NS&$T-)N
2ata#tage automatically inserts partition components in your application to optimi!e the performance of the stages
in your job. #et this variable to prevent this automatic insertion.
APT@N)@S)$T@-NS&$T-)N
2ata#tage automatically inserts sort components in your job to optimi!e the performance of the operators in your
data flow. #et this variable to prevent this automatic insertion.
APT@S)$T@-NS&$T-)N@C'&C/@)N81
6hen sorts are inserted automatically by 2ata#tage' if this is set' the sorts will just chec( that the order is correct'
they wondt actually sort. This is a better alternative to shutting partitioning and sorting off insertion off using
PT>?9>P/T>:?#+/T:9? and PT>?9>#9/T>:?#+/T:9?.
APT@D3MP@SC)$&
Configures 2ata#tage to print a report showing the operators' processes' and data sets in a running job.
APT@PM@P8A1&$@M&M)$1
#etting this variable causes each player process to report the process heap memory allocation in the job log when
returning.
APT@PM@P8A1&$@T-M-N+
#etting this variable causes each player process to report its call and return in the job log. The message with the
return is annotated with CPU times for the player process.
)S'@D3MP
:f set' it causes 2ata#tage to put a verbose description of a job in the job log before attempting to execute it.
)S'@&C')
:f set' it causes 2ata#tage to echo its job specification to the job log after the shell has expanded all arguments.
)S'@&CP8A-N
:f set' it causes 2ata#tage to place a terse description of the job in the job log before attempting to run it.
)S'@P$-NT@SC'&MAS
:f set' it causes 2ata#tage to print the record schema of all data sets and the interface schema of all operators in the
job log.
APT@ST$-N+@PADC'A$
9verrides the pad character of ExE %#C:: null&' used by default when 2ata#tage extends' or pads' a string field to a
fixed length.
CM8 Stages
Cml -mporter
The F$, $eta 2ata :mporter window has the following panes"
Y Tree 9ie2' which depicts the hierarchical structure in the F$, source. This pane is the main view. :t is always
present and cannot be hidden or doc(ed.
Y Source' which contains the original F$, schema or F$, document' in read)only mode. To compare the tree
view with the F$, source' you can doc( this pane next to the tree view.
Y Node Properties' which describes F$, and FPath information of the selected element.
Y Ta%le Definition' which maps elements that you select in the Tree Tiew.
Y Parser )utput' which presents F$, syntax and semantic errors.
The following illustration shows all F$, $eta 2ata :mporter panes except Parser 9utput"
F$, $eta 2ata :mporter reports any syntax and semantic errors when you open a source file. :n the following
example' the Parser 9utput pane indicates that at least one 8uote is missing from line ..
To highlight the error in the #ource pane' double)clic( the error in the Parser 9utput pane. fter correcting the error
outside of the F$, $eta 2ata :mporter' you can load the revised source file. To reload the file' choose
File$efresh.
<ou can process an F$, schema file %.xsd& or an F$, document %.xml&. The file can be located on your file
system or accessed with a U/,.
Processing CM8 Documents
The F$, $eta 2ata :mporter retains namespaces and considers every node in an F$, hierarchy to be fully)
8ualified with a namespace prefix. The form is" prefix:nodename. This approach applies to documents in which
the prefixes are included or unspecified.
6hen prefixes are unspecified' F$, $eta 2ata :mporter generates prefixes using the pattern nsQ' where Q is a
se8uence number.
&ample
The following input does not include a namespace prefix.
-nput
PPerson xmlns=BmynamespaceBO
Pfirst?ameO7ohnPCfirst?ameO
PCPersonO
)utput
Pns*"Person xmlns"ns*=BmynamespaceBO
Pns*"first?ameO7ohn</firstName>
</Person>
Processing CM8 Schemas
The F$, $eta 2ata :mporter processes namespaces in F$, schemas according to three rules"
Y @eneral
Y :mport Gy /eference
Y Target ?amespace Unspecified
+eneral $ule
:n general' the F$, $eta 2ata :mporter assigns the prefix defns to the target namespace.
3or example"
Pxsd"schema target?amespace=BmynamespaceB xmlns"xsd=Bhttp"CCwww.w..orgC-EE*CF$,#chemaBO
Pxsd"element name=BPersonBO
Pxsd"complexTypeO
Pxsd"se8uenceO
Pxsd"element name=Bfirst?ameB type=Bxsd"stringB min9ccurs=B*B max9ccurs=B*BCO
PCxsd"se8uenceO
PCxsd"complexTypeO
PCxsd"elementO
PCxsd"schemaO
The firstName node generates the following FPath expression"
Cdefns"PersonCdefns"first?ame
where defns=mynamespace
-mport #y $eference $ule
:f the schema imports by reference other schemas with different target namespaces' the F$, $eta 2ata assigns a
prefix in the form nsE to each of them. To enable this processing' the dependent schema must specify
elementFormDefault="qualified". :f this is omitted' the elements are considered as belonging to the
callerds target namespace.
&ample
The following example imports by reference the schema mysecondschema.
Pxsd"schema target?amespace BdemonamespaceB
xmlns"xsd=Bhttp"CCwww.w..orgC-EE*CF$,#chemaB xmlns"other=BothernamespaceBO
Pxsd"import namespace=BothernamespaceB schema,ocation=Bmysecondschema.xsdBCO
Pxsd"element name=BPersonBO
Pxsd"complexTypeO
Pxsd"se8uenceO
Pxsd"element name=BaddressB type=Bother"ddressB min9ccurs=B*B max9ccurs=B*B
CO
PCxsd"se8uenceO
PCxsd"complexTypeO
PCxsd"elementO
PCxsd"schemaO
The schema mysecondschema contains the following statements"
Pxsd"schema target?amespace=BothernamespaceB
xmlns"xsd=Bhttp"CCwww.w..orgC-EE*CF$,#chemaB element3orm2efault=B8ualifiedB
attribute3orm2efault=Bun8ualifiedBO
Pxsd"complexType name=BddressBO
Pxsd"se8uenceO
Pxsd"element name=BstreetB min9ccurs=B*B max9ccurs=B*B CO
Pxsd"element name=BcityB min9ccurs=B*B max9ccurs=B*B CO
Pxsd"element name=BstateB min9ccurs=B*B max9ccurs=B*B CO
Pxsd"element name=B!ipB min9ccurs=B*B max9ccurs=B*B CO
PCxsd"se8uenceO
PCxsd"complexTypeO
PCxsd"schemaO
The street node generates the following FPath expression"
Cdefns"PersonCdefns"addressCns-"street
where defns=demonamespace and ns-=othernamespace
The Target Namespace 3nspecified $ule
6hen the target namespace is unspecified' F$, $eta 2ata :mporter omits the prefix defns from FPath
expressions.
3or example"
Pxsd"schema xmlns"xsd=Bhttp"CCwww.w..orgC-EE*CF$,#chemaBO
Pxsd"element name=BPersonBO
Pxsd"complexTypeO
Pxsd"se8uenceO
Pxsd"element name=Bfirst?ameB type=Bxsd"stringB min9ccurs=B*B max9ccurs=B*BCO
PCxsd"se8uenceO
PCxsd"complexTypeO
PCxsd"elementO
PCxsd"schemaO
The firstName tree node generates the following FPath expression"
CPersonCfirst?ame
Mapping Nodes from an CM8 Schema
<ou can individually choose elements and attributes' or select all leaf nodes except empty ones in one step.
Choosing -ndividual -tems
#elect the box that is next to item that you want to map. :n the following example' there are elements and text nodes.
The three T+FT nodes are selected.
:f you select an element box' you get all the sub nodes and the actual content of the element. <our selection is
reflected in the Table 2efinition pane"
n asteris( appears after the title Table 2efinition when you modify the table definition. :t disappears when you
save the information.
Selecting All Nodes
<ou can simplify selecting all leaf nodes by using the Auto7chec" command. This command chec(s leaf nodes.
F$, $eta 2ata :mporter ignores leaf nodes in the following circumstances"
Y ?odes are empty.
Y :n a branch in which a node represents a reference to an element or a type defined elsewhere' such as an included
schema. To avoid recursive looping' which may be deep in the sub)schema' the node is not expanded. <ou may
manually expand the reference branch down to a specific level' and run the uto)chec( command on the top branch
node. This action selects all nodes in the branch.
Y ?ode represents a detected recursion. This happens with a schema that has the following form"
parent = person childrenchild = person
<ou may manually expand the recursive branch and run the uto)chec( command to select all nodes in the branch.
To run Auto7chec"!
Choose File&ditAuto7chec". The nodes appear in the Table 2efinition pane.
The default table definition name depends on the F$, source name"
Source file Default
U?C)name 9riginal file name without extension
U/, The value Ne2
F$, document 9riginal F$, document filename
F$, schema 9riginal F$, schema filename
Cml -nput Stage
F$, :nput stage is used to transform hierarchical F$, data to flat relational tables. F$, :nput stage supports a
single input lin( and one or more output lin(s.
F$, :nput performs two F$, validations when the server job runs"
Y Chec(s for well)formed F$,.
Y 9ptionally chec(s that elements and attributes conform to any F$, schema that is referenced in the document.
<ou control this option.
The F$, parser reports three types of conditions" fatal' error' and warning.
Y 3atal errors are thrown when the F$, is not well)formed.
Y ?on)fatal errors are thrown when the F$, violates a validity constraint. 3or example' the root element in the
document is not found in the validating F$, schema.
Y 6arnings may be thrown when the schema has duplicate definitions.
F$, :nput supports one /eject lin(' which can store rejection messages and rejected rows.
4riting $e0ection Messages to the 8in"
To 2rite re0ection messages to a $e0ect lin"!
*. dd a column on the /eject lin(.
-. Using the @eneral page of the 9utput ,in( properties' identify the column as the target for rejection messages.
4riting $e0ected $o2s to the 8in"
To 2rite re0ected ro2s to a $e0ect lin"!
dd a column on the /eject lin( that has the same name as the column on the input lin( that contains or references
the F$, document. This is a pass)through operation. Column names for this operation are case)sensitive.
Pass)through is available for any input column.
Controlling )utput $o2s
To populate the columns of an output row' F$, :nput uses FPath expressions that are specified on the output lin(.
FPath expressions locate elements' attributes' and text nodes.
Controlling the Num%er of )utput $o2s
<ou must designate one column on the output lin( as the repetition element. repetition element consists of an
FPath expression. 3or each occurrence of the repetition element' F$, :nput always generates a row. Gy varying the
repetition element and using a related option' you can control the number of output rows.
-dentifying the $epetition &lement
To identify the repetition element' set the /ey property to 1es on the output lin(.
Transformation Settings
These properties control the values that can be shared by multiple output lin(s of the F$, :nput stage.
They fall into these categories"
Y /e8uiring the repetition element
Y Processing ?U,,s and empty values
Y Processing namespaces
Y 3ormatting extracted F$, fragments
To use these values with a specific output lin(' select the -nherit Stage properties box on the Transformation
#ettings tab of the output lin(.
Cml )utput Stage
F$, 9utput stage is used to transform tabular data' such as relational tables and se8uential files' to F$,
hierarchical structures. F$, 9utput stage supports a single input lin( and !ero or one output lin(s.
F$, 9utput re8uires FPath expressions to transform tabular data to F$,. table definition stores the FPath
expressions. Using the 2escription property on the Columns pages within the stage' you record or maintain the
FPath expressions.
Aggregating -nput $o2s on )utput
<ou have several options for aggregating input rows on output.
Y ggregate all rows in a single output row. This is the default option.
Y @enerate one output row per input row. This is the Single ro2 option.
Y Trigger a new output row when the value of an input column changes.
Y Trigger a new output row when the value of a pass)through column changes.
pass)through column is an output column that has no FPath expression in the 2escription property and whose
name exactly matches the name of an input column.
*o% Management and Deployment
Fuic" Find
*. ?ame to find
2. Types to find
3. :nclude descriptions %:f chec(ed' the text in short and long descriptions will be searched&
Advanced Find Filtering options
*. Type X Type of object %7ob' Table 2efinition' etc&
-. Creation X 2ate range
3. ,ast $odification X 2ate range
1. 6here used X
J. 2ependencies of X
W. 9ptions X Case sensitivity and #earch within last result set
-mpact Analysis
/ight clic( over a stage or table definition
*. #elect L3ind where table definitions usedM
-. #elect L3ind where table definitions used %deep&M X 2eep includes additional object types
2isplays a list of objects using the table definition
*. #elect L3ind dependenciesM
-. #elect L3ind dependencies %deep&M X 2eep includes additional object types
2isplays list of objects dependent on the one selected
@raphical 3unctionality
*. 2isplay the dependency path
-. Collapse selected objects
.. $ove the graphical object
1. LGirds)eyeM view
Comparison
*. Cross project compare
-. Compare against
The two objects that can be compared are *.7obs and -.Table 2efinitions
Aggregator Stage
*. @rouping 0eys
-. ggregations
ggregation Type ) Count /ows' Calculation' /e)Calculation
ggregation Type ) Count /ows
Count 9utput Column ) ?ame of the output column which consists of the number of records based on grouping (eys
ggregation Type ) Calculation' /e)Calculation
Column for Calculation ) :nput Column to be selected for calculation
9ptions
llow ?ull 9utput ) True means that ?U,, is a valid output value when calculating minimum value' maximum
value' mean value' standard deviation' standard error' sum' sum of weights' and variance. 3alse means E is output
when all input values for calculation column are ?U,,.
$ethod X 4ash %4ash table& or #ort %Pre)#ort&. The default method is 4ash
Use hash mode for a relatively small number of groups5 generally' fewer than about *EEE groups per megabyte of
memory. #ort mode re8uires the input data set to have been partition sorted with all of the grouping (eys specified
as hashing and sorting (eys.
Use 4ash method for inputs with a limited number of distinct groups
*. Uses -0 of memoryCgroup
-. Calculations are made for all groups and stored in memory %4ash table structure and hence the name&
.. :ncoming data does not need to be pre)sorted
1. /esults are output after all rows have been read
J. Useful when the number of uni8ue groups is small
Use #ort method with a large %or un(nown& number of distinct (ey column values
1. /e8uires inputs pre)sorted on (ey columns %2oes not perform the sortN +xpects the sort&
-. /esults are output after each group
.. Can handle unlimited number of groups
#ort ggregator ) one of the lightweight stages that minimi!e memory usage by re8uiring data in (ey column sort
order
,ightweight stages that minimi!e memory usage by re8uiring data in (ey column sort order
*. 7oin
-. $erge
.. #ort ggregator
Sort Stage
2ata#tage designer provides two methods for parallel %group& sorting
*. #ort stage ) Parallel +xecution
-. #ort on a lin( when the partitioning is not uto ) :dentified by the #ort icon
Goth methods use the same tsort operator
#orting on a lin( provides easier job maintenance %fewer stages on job canvas& but fewer options.
The #ort stage offers more options than a lin( sort.
The #ort Utility should be 2ata#tage as it is faster than the Unix #ort.
#table sort preserves the order of non)(ey columns within each sort group but are slightly slower than non)stable
sorts. #table sort is enabled by default on #ort stages but not on #ort lin(s. :f disabled no prior ordering of records is
guaranteed to be preserved by the sorting operation
#ort 0ey $odes
*. 2ont #ort %Previously #orted& means that input records are already sorted by this (ey. The #ort stage will then
sort on secondary (eys' if any.
-. 2ont #ort %Previously @rouped& means that input records are already grouped by that (ey but not sorted
.. #ort X #ort by this (ey
dvantages of 2ont #ort %Previously #orted&
*. Uses significantly less memoryCdis(
-. #ort is now on previously sorted (ey column groups not the entire data set
.. 9utputs rows after each group
2ata#tage provides two methods for generating a se8uentially %totally& sorted result
*. #ort stage ) #e8uential +xecution mode
-. #ort $erge Collector
:n general a parallel #ort U #ort $erge Collector will be faster than a #e8uential #ort.
Gy default the Parallel 3ramewor( will insert tsort operators as necessary to ensure correct results. Gut by setting
=PT>:?#+/T:9?>C4+C0>9?,< we can force the inserted tsort operator to verify if the data is sorted instead
of actually performing the sort operation.
Gy default each tsort operator %#ort stage' lin( sort and inserted sort& uses -E$G per partition as an internal memory
buffer.
Gut the #ort stage provides the B/estrict $emory UsageB option.
*. :ncreasing this value can improve improve performance if the entire %or group& data can fit into memory
-. 2ecreasing this value may hurt performance' but will use less memory
6hen the memory buffer is filled' sort uses temporary dis( space in the following order
*. #cratch dis(s in the =PT>C9?3:@>3:,+ BsortB named dis( pool
-. #cratch dis(s in the =PT>C9?3:@>3:,+ default dis( pool
.. The default directory specified by =T$P2:/
1. The Unix Ctmp directory
$emoving Duplicates
Can be done by #ort stage X Use uni8ue option
?o choice on which duplicate to (eep
#table sort always retains the first row in the group
?on stable sort is indeterminate
/emove 2uplicates stage
Can choose to retain first or last

You might also like