You are on page 1of 69

Parallel Architecture, DataStage v8 Configuration, Metadata

Parallel processing = executing your application on multiple CPUs

Parallel processing environments


The environment in which you run your parallel jobs is defined by your systems architecture and hardware resources. ll parallel processing environments are categori!ed as one of" 1. #$P %symmetric multiprocessing&' in which some hardware resources may be shared among processors. The processors communicate via shared memory and have a single operating system. 2. Cluster or $PP %massively parallel processing&' also (nown as shared)nothing' in which each processor has exclusive access to hardware resources. $PP systems are physically housed in the same box' whereas cluster systems can be physically dispersed. The processors each have their own operating system' and communicate via a high)speed networ(.

Pipeline Parallelism
*. -. .. 1. +xtract' Transform and ,oad processes execute simultaneously The downstream process starts while the upstream process is running li(e a conveyor belt moving rows from process to process dvantages" /educes dis( usage for staging areas and 0eeps processors busy #till has limits on scalability

Pipeline Parallelism
*. -. .. 2ivide the incoming stream of data into subsets (nown as partitions to be processed separately +ach partition is processed in the same way 3acilitates near)linear scalability. 4owever the data needs to be evenly distributed across the partitions5 otherwise the benefits of partitioning are reduced

6ithin parallel jobs pipelining' partitioning and repartitioning are automatic. 7ob developer only identifies *. #e8uential or Parallel mode %by stage& -. Partitioning $ethod .. Collection $ethod 1. Configuration file

Configuration File
9ne of the great strengths of the 6eb#phere 2ata#tage +nterprise +dition is that' when designing parallel jobs' you dont have to worry too much about the underlying structure of your system' beyond appreciating its parallel processing capabilities. :f your system changes' is upgraded or improved' or if you develop a job on one platform and implement it on another' you dont necessarily have to change your job design. 6eb#phere 2ata#tage learns about the shape and si!e of the system from the configuration file. :t organi!es the resources needed for a job according to what is defined in the configuration file. 6hen your system changes' you change the file not the jobs. The 6eb#phere 2ata#tage 2esigner provides a configuration file editor to help you define configuration files for the parallel engine. To use the editor' choose Tools ; Configurations' the Configurations dialog box appears. <ou specify which configuration will be used by setting the = PT>C9?3:@>3:,+ environment variable. This is set on installation to point to the default configuration file' but you can set it on a project wide level from the 6eb#phere 2ata#tage dministrator or for individual jobs from the 7ob Properties dialog. Configuration files are text files containing string data. The general form of a configuration file is as follows"

A node Bn*B A fastname Bs*B pool BB Bn*B Bs*B Bapp-B BsortB resource dis( BCorchCn*Cd*B AD resource dis( BCorchCn*Cd-B ABbigdataBD resource scratchdis( BCtempB ABsortBD D D Node names +ach node you define is followed by its name enclosed in 8uotation mar(s' for example" node BorchEB 3or a single CPU node or wor(station' the nodes name is typically the networ( name of a processing node on a connection such as a high)speed switch or +thernet. :ssue the following U?:F command to learn a nodes networ( name"
= uname )n

9n an #$P' if you are defining multiple logical nodes corresponding to the same physical node' you replace the networ( name with a logical node name. :n this case' you need a fast name for each logical node. :f you run an application from a node that is undefined in the corresponding configuration file' each user must set the environment variable PT>P$>C9?2UCT9/>?92+? $+ to the fast name of the node invo(ing the parallel job. Fastname Synta ! fastname BnameB This option ta(es as its 8uoted attribute the name of the node as it is referred to on the fastest networ( in the system' such as an :G$ switch' 322:' or G<?+T. The fastname is the physical node name that stages use to open connections for high volume data transfers. The attribute of this option is often the networ( name. 3or an #$P' all CPUs share a single connection to the networ(' and this setting is the same for all parallel engine processing nodes defined for an #$P. Typically' this is the principal node name' as returned by the U?:F command uname )n. Node pools and the default node pool ?ode pools allow association of processing nodes based on their characteristics. 3or example' certain nodes can have large amounts of physical memory' and you can designate them as compute nodes. 9thers can connect directly to a mainframe or some form of high)speed :C9. These nodes can be grouped into an :C9 node pool. The option pools is followed by the 8uoted names of the node pools to which the node belongs. node can be assigned to multiple pools' as in the following example' where node* is assigned to the default pool %HH& as well as the pools node*' node*>css' and pool1. node Bnode*B A fastname Bnode*>cssB pools BB Bnode*B Bnode*>cssB Bpool1B resource dis( BCorchCsEB AD resource scratchdis( BCscratchB AD D node belongs to the default pool unless you explicitly specify a pools list for it' and omit the default pool name %H H& from the list.

9nce you have defined a node pool' you can constrain a parallel stage or parallel job to run only on that pool' that is' only on the processing nodes belonging to it. :f you constrain both a stage and a job' the stage runs only on the nodes that appear in both pools. ?odes or resources that name a pool declare their membership in that pool. 6e suggest that when you initially configure your system you place all nodes in pools that are named after the nodes name and fast name. dditionally include the default node pool in this pool' as in the following example" node Bn*B A fastname BnfastB pools BB Bn*B BnfastB D Gy default' the parallel engine executes a parallel stage on all nodes defined in the default node pool. <ou can constrain the processing nodes used by the parallel engine either by removing node descriptions from the configuration file or by constraining a job or stage to a particular node pool. Dis" and scratch dis" pools and their defaults 6hen you define a processing node' you can specify the options resource dis( and resource scratchdis(. They indicate the directories of file systems available to the node. <ou can also group dis(s and scratch dis(s in pools. Pools reserve storage for a particular use' such as holding very large data sets. Pools defined by dis( and scratchdis( are not combined5 therefore' two pools that have the same name and belong to both resource dis( and resource scratchdis( define two separate pools. dis( that does not specify a pool is assigned to the default pool. The default pool may also be identified by HH by and A D %the empty pool list&. 3or example' the following code configures the dis(s for node*" node Bn*B A resource dis( BCorchCsEB Apools BB Bpool*BD resource dis( BCorchCs*B Apools BB Bpool*BD resource dis( BCorchCs-B A D CI empty pool list IC resource dis( BCorchCs.B Apools Bpool-BD resource scratchdis( BCscratchB Apools BB Bscratch>pool*BD D :n this example" *. The first two dis(s are assigned to the default pool. 2. The first two dis(s are assigned to pool*. 3. The third dis( is also assigned to the default pool' indicated by A D. 4. The fourth dis( is assigned to pool- and is not assigned to the default pool. J. The scratch dis( is assigned to the default scratch dis( pool and to scratch>pool*. #uffer scratch dis" pools Under certain circumstances' the parallel engine uses both memory and dis( storage to buffer virtual data set records.The amount of memory defaults to . $G per buffer per processing node. The amount of dis( space for each processing node defaults to the amount of available dis( space specified in the default scratchdis( setting for the node. The parallel engine uses the default scratch dis( for temporary storage other than buffering. :f you define a buffer scratch dis( pool for a node in the configuration file' the parallel engine uses that scratch dis( pool rather than the default scratch dis( for buffering' and all other scratch dis( pools defined are used for temporary storage other than buffering.

4ere is an example configuration file that defines a buffer scratch dis( pool" A node node* A fastname Bnode*>cssB pools BB Bnode*B Bnode*>cssB resource dis( BCorchCsEB AD resource scratchdis( BCscratchEB Apools BbufferBD resource scratchdis( BCscratch*B AD D node node- A fastname Bnode->cssB pools BB Bnode-B Bnode->cssB resource dis( BCorchCsEB AD resource scratchdis( BCscratchEB Apools BbufferBD resource scratchdis( BCscratch*B AD D D :n this example' each processing node has a single scratch dis( resource in the buffer pool' so buffering will use CscratchE but not Cscratch*. 4owever' if CscratchE were not in the buffer pool' both CscratchE and Cscratch* would be used because both would then be in the default pool.

Partitioning
The aim of most partitioning operations is to end up with a set of partitions that are as near e8ual si!e as possible' ensuring an even load across your processors. 6hen performing some operations however' you will need to ta(e control of partitioning to ensure that you get consistent results. good example of this would be where you are using an aggregator stage to summari!e your data. To get the answers you want %and need& you must ensure that related data is grouped together in the same partition before the summary operation is performed on that partition.

$ound ro%in partitioner


The first record goes to the first processing node' the second to the second processing node' and so on. 6hen 6eb#phere 2ata#tage reaches the last processing node in the system' it starts over. This method is useful for resi!ing partitions of an input data set that are not e8ual in si!e. The round robin method always creates approximately e8ual)si!ed partitions. This method is the one normally used when 6eb#phere 2ata#tage initially partitions data.

$andom partitioner
/ecords are randomly distributed across all processing nodes. ,i(e round robin' random partitioning can rebalance the partitions of an input data set to guarantee that each processing node receives an approximately e8ual)si!ed partition. The random partitioning has a slightly higher overhead than round robin because of the extra processing re8uired to calculate a random value for each record.

&ntire partitioner
+very instance of a stage on every processing node receives the complete data set as input. :t is useful when you want the benefits of parallel execution' but every instance of the operator needs access to the entire input data set. <ou are most li(ely to use this partitioning method with stages that create loo(up tables from their input.

Same partitioner
The stage using the data set as input performs no repartitioning and ta(es as input the partitions output by the preceding stage. 6ith this partitioning method' records stay on the same processing node5 that is' they are not redistributed. #ame is the fastest partitioning method. This is normally the method 6eb#phere 2ata#tage uses when passing data between stages in your job.

'ash partitioner
Partitioning is based on a function of one or more columns %the hash partitioning (eys& in each record. The hash partitioner examines one or more fields of each input record %the hash (ey fields&. /ecords with the same values for all hash (ey fields are assigned to the same processing node. This method is useful for ensuring that related records are in the same partition' which may be a prere8uisite for a processing operation. 3or example' for a remove duplicates operation' you can hash partition records so that records with the same partitioning (ey values are on the same node. <ou can then sort the records on each node using the hash (ey fields as sorting (ey fields' then remove duplicates' again using the same (eys. lthough the data is distributed across partitions' the hash partitioner ensures that records with identical (eys are in the same partition' allowing duplicates to be found. 4ash partitioning does not necessarily result in an even distribution of data between partitions. 3or example' if you hash partition a data set based on a !ip code field' where a large percentage of your records are from one or two !ip codes' you can end up with a few partitions containing most of your records. This behavior can lead to bottlenec(s because some nodes are re8uired to process more records than other nodes.

Modulus partitioner
Partitioning is based on a (ey column modulo the number of partitions. This method is similar to hash by field' but involves simpler computation. :n data mining' data is often arranged in buc(ets' that is' each record has a tag containing its buc(et number. <ou can use the modulus partitioner to partition the records according to this number. The modulus partitioner assigns each record of an input data set to a partition of its output data set as determined by a specified (ey field in the input data set. This field can be the tag field. The partition number of each record is calculated as follows"
partition_number = fieldname mod number_of_partitions

where" fieldname is a numeric field of the input data set and number_of_partitions is the number of processing nodes on which the partitioner executes. :f a partitioner is executed on three processing nodes it has three partitions.

$ange partitioner
2ivides a data set into approximately e8ual)si!ed partitions' each of which contains records with (ey columns within a specified range. This method is also useful for ensuring that related records are in the same partition. range partitioner divides a data set into approximately e8ual si!e partitions based on one or more partitioning (eys. :n order to use a range partitioner' you have to ma(e a range map. <ou can do this using the 6rite /ange $ap stage. The range partitioner guarantees that all records with the same partitioning (ey values are assigned to the same partition and that the partitions are approximately e8ual in si!e so all nodes perform an e8ual amount of wor( when processing the data set. /ange partitioning is not the only partitioning method that guarantees e8uivalent)si!ed partitions. The random and round robin partitioning methods also guarantee that the partitions of a data set are e8uivalent in si!e. 4owever'

these partitioning methods are (eyless5 that is' they do not allow you to control how records of a data set are grouped together within a partition.

D#( partitioner
Partitions an input data set in the same way that 2G-K would partition it. 3or example' if you use this method to partition an input data set containing update information for an existing 2G- table' records are assigned to the processing node containing the corresponding 2G- record. Then' during the execution of the parallel operator' both the input record and the 2G- table record are local to the processing node. ny reads and writes of the 2G- table would entail no networ( activity.

Auto partitioner
The most common method you will see on the 6eb#phere 2ata#tage stages is uto. This just means that you are leaving it to 6eb#phere 2ata#tage to determine the best partitioning method to use depending on the type of stage' and what the previous stage in the job has done. Typically 6eb#phere 2ata#tage would use round robin when initially partitioning data' and same for the intermediate stages of a job.

Collecting
Collecting is the process of joining the multiple partitions of a single data set bac( together again into a single partition. There may be a stage in your job that you want to run se8uentially rather than in parallel' in which case you will need to collect all your partitioned data at this stage to ma(e sure it is operating on the whole data set. ?ote that collecting methods are mostly non)deterministic. That is' if you run the same job twice with the same data' you are unli(ely to get data collected in the same order each time. :f order matters' you need to use the sorted merge collection method.

$ound ro%in collector


/eads a record from the first input partition' then from the second partition' and so on. fter reaching the last partition' starts over. fter reaching the final record in any partition' s(ips that partition in the remaining rounds.

)rdered collector
/eads all records from the first partition' then all records from the second partition' and so on. This collection method preserves the order of totally sorted input data sets. :n a totally sorted data set' both the records in each partition and the partitions themselves are ordered. This may be useful as a preprocessing action before exporting a sorted data set to a single data file.

Sorted merge collector


/ead records in an order based on one or more columns of the record. The columns used to define record order are called collecting (eys. Typically' you use the sorted merge collector with a partition)sorted data set %as created by a sort stage&. :n this case' you specify as the collecting (ey fields those fields you specified as sorting (ey fields to the sort stage. The data type of a collecting (ey can be any type except raw' subrec' tagged' or vector.

Auto collector

The most common method you will see on the parallel stages is uto. This normally means that 6eb#phere 2ata#tage will eagerly read any row from any input partition as it becomes available' but if it detects that' for example' the data needs sorting as it is collected' it will do that. This is the fastest collecting method.

Preserve partitioning flag


stage can also re8uest that the next stage in the job preserves whatever partitioning it has implemented. :t does this by setting the preserve partitioning flag for its output lin(. ?ote' however' that the next stage may ignore this re8uest. :n most cases you are best leaving the preserve partitioning flag in its default state. The exception to this is where preserving existing partitioning is important. The flag will not prevent repartitioning' but it will warn you that it has happened when you run the job. :f the Preserve Partitioning flag is cleared' this means that the current stage doesnt care what the next stage in the job does about partitioning. 9n some stages' the Preserve Partitioning flag can be set to Propagate. :n this case the stage sets the flag on its output lin( according to what the previous stage in the job has set. :f the previous job is also set to Propagate' the setting from the stage before is used and so on until a #et or Clear flag is encountered earlier in the job. :f the stage has multiple inputs and has a flag set to Propagate' its Preserve Partitioning flag is set if it is set on any of the inputs' or cleared if all the inputs are clear.

Parallel *o% Score


t runtime' the 7ob #C9/+ can be examined to identify" 1. ?umber of U?:F processes generated for a given job and = PT>C9?3:@>3:,+ -. 9perator combination .. Partitioning methods between operators 4. 3ramewor()inserted components ) :ncluding #orts' Partitioners' and Guffer operators #et = PT>2U$P>#C9/+=* to output the #core to the 2ata#tage job log 3or each job run' - separate #core 2umps are written to the log *. 3irst score is actually from the license operator -. #econd score entry is the actual job score 7ob scores are divided into two sections 1. 2atasets ) partitioning and collecting 2. 9perators ) nodeCoperator mapping

& ample score dump


The following score dump shows a flow with a single data set' which has a hash partitioner' partitioning on (ey HaH. :t shows three operators" generator' tsort' and pee(. Tsort and pee( are HcombinedH' indicating that they have been optimi!ed into the same process. ll the operators in this flow are running on one node.

The 2ata#tage Parallel 3ramewor( implements a producer)consumer data flow model Upstream stages %operators or persistent data sets& produce rows that are consumed by downstream stages %operators or data sets& Partitioning method is associated with producer. Collector method is associated with consumer. LeCollect nyM is specified for parallel consumers' although no collection occursN The producer and consumer are separated by the following indicators" )O #e8uential to #e8uential PO #e8uential to Parallel =O Parallel to Parallel %# $+& QO Parallel to Parallel %not # $+& OO Parallel to #e8uential O ?o producer or no consumer $ay also include RppS notation when Preserve Partitioning flag is set t runtime' the 2ata#tage Parallel 3ramewor( can only combine stages %operators& that" *. Use the same partitioning method /epartitioning prevents operator combination between the corresponding producer and consumer stages :mplicit repartitioning %eg. #e8uential operators' node maps& also prevents combination -. re Combinable #et automatically within the stageCoperator definition #et within 2ata#tage 2esigner" dvanced stage properties The ,oo(up stage is a composite operator. :nternally it contains more than one component' but to the user it appears to be one stage 1. ,UTCreate:mpl ) /eads the reference data into memory 2. ,UTProcess:mpl ) Performs actual loo(up processing once reference data has been loaded t runtime' each internal component is assigned to operators independently

*o% Compilation
1. )perators. These underlie the stages in a 6eb#phere 2ata#tage job.
single stage may correspond to a single operator' or a number of operators' depending on the properties you have set' and whether you have chosen to partition or collect or sort data on the input lin( to a stage. t compilation' 6eb#phere 2ata#tage evaluates your job design and will sometimes optimi!e operators out if they are judged to be superfluous' or insert other operators if they are needed for the logic of the job. 2. )S'. This is the scripting language used internally by the 6eb#phere 2ata#tage parallel engine.

3. Players. Players are the wor(horse processes in a parallel job. There is generally a player for each operator on
each node. Players are the children of section leaders5 there is one section leader per processing node. #ection leaders are started by the conductor process running on the conductor node %the conductor node is defined in the configuration file&. 2ata#tage 2esigner client generates all code ) Talidates lin( re8uirements' mandatory stage options' transformer logic' etc. *. @enerates 9#4 representation of job data flow and stages @U: LstagesM are representations of 3ramewor( LoperatorsM #tages in parallel shared containers are statically inserted in the job flow +ach server shared container becomes a dsjobsh operator -. @enerates transform code for each parallel Transformer Compiled on the 2ata#tage server into CUU and then to corresponding native operators To improve compilation times' previously compiled Transformers that have not been modified are not recompiled 3orce Compile recompiles all Transformers %use after client upgrades& 3. Guildop stages must be compiled manually within the @U: or using buildop U?:F command line Tiewing of generated 9#4 is enabled in 2# dministrator 9#4 is visible in *. 7ob Properties -. 7ob run log .. Tiew 2ata 1. Table 2efinitions

+enerated )S' Primer


2esigner inserts comment bloc(s to assist in understanding the generated 9#4. ?ote that operator order within the generated 9#4 is the order a stage was added to the job canvas 9#4 uses the familiar syntax of the U?:F shell to create applications for 2ata #tage +nterprise +dition *. operator name -. operator options %use L)name valueM format& #chema %for generator' import' export& :nputs 9utputs The following data sources are supported as inputCoutput Tirtual data set' %name.v& Persistent data set %name.ds or RdsS name& 3ile sets %name.fs or RfsS name& +xternal files %name or RfileS name& +very operator has inputs numbered se8uentially starting from E. 3or example" op* EO dst op* *Psrc

Terminology

Framework schema property type virtual recordCfield operator step' flow' 9#4 command 3ramewor(

DataStage table definition format #V, type U length Rand scaleS dataset lin( rowCcolumn stage job 2# engine

GUI uses both terminologies Log messages (info, warnings, errors) use Framework term

& ample Stage , )perator Mapping


6ithin 2esigner' stages represent operators' but there is not always a *"* correspondence. #e8uential 3ile > #ource" import > Target" export 2ata#et" copy #ort %2ata#tage&" tsort ggregator" group /ow @enerator' Column @enerator' #urrogate 0ey @enerator" generator 9racle > #ource" oraread > #parse ,oo(up" oralookup > Target ,oad" orawrite > Target Upsert" oraupsert ,oo(up 3ile #et > Target" lookup -createOnly

$untime Architecture
@enerated O ! and Configuration file are used to LcomposeM a job #C9/+ similar to the way an /2G$# builds a 8uery optimi!ation plan *. :dentifies degree of parallelism and node assignment for each operator -. :nserts sorts and partitioners as needed to ensure correct results .. 2efines connection topology %datasets& between adjacent operators 1. :nserts buffer operators to prevent deadloc(s %eg. for()joins& 5. "efines number of actual U#I$ processes %6here possible' multiple operators are combined within a single U?:F process to improve performance and optimi!e resource re8uirements W. 7ob #C9/+ is used to for( U?:F processes with communication interconnects for data' message' and control. #etting = PT>P$>#496>P:2# to show U?:F process :2s in 2ata#tage log It is onl& after these steps that processing begins' (his is the )startup o*erhead+ of an ,nterprise ,dition -ob 7ob processing ends when ) ,ast row %end of data& is processed by final operator in the flow (or) fatal error is encountered by any operator %or) 7ob is halted %#:@:?T& by 2ata#tage 7ob Control or human intervention %eg. 2ata#tage 2irector #T9P&

*o% & ecution! The )rchestra

Conductor ) initial 3ramewor( process X core Composer X Creates #ection ,eader processes %oneCnode& X Consolidates massages' to 2ata#tage log X $anages orderly shutdown #ection ,eader %one per ?ode& X 3or(s Players processes %oneC#tage& X $anages upCdown communication Players X The actual processes associated with #tages X Combined players" one process only X #ends stderr' stdout to #ection ,eader X +stablish connections to other players for data flow X Clean up upon completion Y 2efault Communication" Y ./0 hared .emor& Y .//0 hared .emor& (within hardware node) and (1/ (across hardware nodes)

-ntroduction
6hat is :G$ 6ebsphere 2ata#tageZ *. 2esign jobs for +T, -. :deal tool for data integration projects .. :mport' export' create and manage metadata for use within jobs 1. #chedule' run and monitor jobs all within 2ata#tage J. dminister your 2ata#tage development and execution environments W. Create batch %controlling& jobs 6hat are the componentsCapplications in :G$ :nformation #erver #uiteZ *. 2ata#tage -. Vuality #tage .. $etadata #erver consisting of $etadata ccess #ervices and $etadata nalysis #ervices 1. /epository which is 2G- by default J. Gusiness @lossary W. 3ederation #erver [. :nformation #ervices 2irector \. :nformation naly!er ]. :nformation #erver console +xplain the 2ata#tage rchitectureZ The 2ata#tage client components are dministrator ) dministers 2ata#tage projects and conducts house(eeping on the server 2esigner ) Creates 2ata#tage jobs that are compiled into executable programs 2irector ) Used to run and monitor 2ata#tage jobs The /epository is used to store 2ata#tage objects. The /epository which is 2G- by default is shared by other applications in the #uite 6hat are the uses of 2ata#tage dminsitratorZ The dministrator is used to add and delete projects' and to set project properties. The dminsitrator also provides a command line interface to the 2ata#tage repository. Use dministrator Project Properties window to *. +nable job administration in 2irector' enable run time column propogation' auto purging options' protect project and set environment vaiables on the @eneral tab -. #et user and group priveleges on the Permissions tab .. +nable or disable server side tracing on the Tracing tab 1. #pecifying a username and password for scheduling jobs on the #chedule tab J. #pecify parallel job defaults on the Parallel tab W. #pecify job se8uencer defaults on the #e8uencer tab +xplain the 2ata#tage 2evelopment wor(flowZ *. 2efine poject properties ) dministrator -. 9pen %attach to& your project .. :mport metadata that defines the format of data stores your jobs will read from or write to 1. 2esign the job ) 2esigner J. Compile and debug the job ) 2esigner

W. /un and monitor the job ) 2irector 6hat is the 2ata#tage project repositoryZ ll your wor( is stored in a 2ata#tage project. Projects are created during and after the installation process. <ou can add projects after installation on the Projects tab of dminsitrator. The project directory is used by 2ata#tage to store your jobs and other 2ata#tage objects and metadata on your server. lthough multiple projects can be open at the same time' they are seperate environments. <ou can however' import and export objects between them. $ultiple users can be wor(ing in the same project at the same time. 4owever' 2ata#tage will prevent multiple users from editing the same 2ata#tage object %job' table definition' etc& at the same time. 6hat are the different types of 2ata#tage jobsZ Parallel 7obs) *. +xecuted by 2ata#tage parallel engine -. Guilt)in functionality for pipeline and partition parallelism .. Compiled into 9#4 %9rchestrate #cripting ,anguage& 1. 9#4 executes operators %+xecute CUU class instances& #erver 7obs) *. +xecuted by 2ata#tage server engine -. Compiled into basic 7ob #e8uencers) *. $aster #erver jobs that (ic()off jobs and other activities -. Can (ic()off #erver or Parallel jobs .. +xecuted by 2ata#tage server engine 6hat are the design elements of parallel jobsZ #tages ) :mplemented as 9#4 operators Passive #tages %+ and , of +T,& ) /eadC6rite data +g.' #e8uential file' 2G-' 9racle' Pee( stages ctive #tages %T of +T,& ) TransformC3ilterC ggregateC@enerateC#plitC$erge data +g.' Transformer' ggregator' 7oin' #ort stages ,in(s ) Pipes though which the data moves from stage to stage 6hat are the different types of parallelismZ Pipeline Parallelism *. Transform' clean' load processes execute simultaneously -. #tart downstream process while upstream process is running .. /educes dis( usage for staging areas 1. 0eeps processor busy J. #till has limits on scalability Partition Parallelism *. 2ivide the incoming stream of data into subsets%partitions& to be processed by the same operator -. The operation is performed on each partition of data seperately and in parallel .. 3acilitates near)linear scalability provided the data is evenly distributed 1. :f the data is evenly distributes' the data will be processed n times faster on n nodes.

-nstallation and Deployment


6hat gets deployed as part of :nformation #erver 2omainZ *. $etadata #erver' hosted by an :G$ 6eb#phere pplication #erver instance -. 9ne or more 2ata#tage servers .. 9ne 2G- U2G instance containing the repository database dditional #erver application *. Gusiness @lossary -. 3ederation #erver .. :nformation naly!er 1. :nformation #ervices 2irector J. /ational 2ata rchitect 6hat are the :nformation #erver clientsZ *. dministration Console -. /eporting Console .. 2ata#tage Clients ) dministrator' 2esigner' 2irector 6hat are the different types of :nformation #erver deploymentZ *. +verything on 9ne machine ) ll the applicaions in the domain are deployed in one machine -. The domain is split between two machines ) 2ata#tage #erver in one machine ' $etadata #erver and 2G/epository in one machine .. The domain is split between three machines ) 2ata#tage #erver' $etadata #erver and 2G- /epository on . different machines dditional 2ata#tage #ervers can be part of this domain' but they would have to be seperate from one another There is a possibility of additional 2ata#tage player)node machines connected to the 2ata#tage server machine using a high speed networ( 6hat are the components that should be running if pplication #erver%hosting the metadata server& and 2ata#tage server are running on different machinesZ *. The pplication #erver -. The #G agent dministering 2ata#tage +xplain the User and @roup $anagementZ #uite uthori!ation can be provided to users or groups. Users that are members of a group ac8uire authori!ations of the group. uthori!ation are provided in the form of roles *. #uite roles a. dministrator ) Performs user and group management tas(s. :ncludes all the priveleges of the #uite User role b. User ) Create views of scheduled tas(s and logged messages. Create and run reports

-. #uite Component roles a. 2ata#tage dministrator ) 3ull permission to wor( in 2ata#tage dministrator' 2esigner and 2irector b. 2ata#tage user ) Permissions are assigned within 2ata#tage ) 2eveloper' 9perator' #uper 9perator and Production $anager 2ata#tage user cannot delete projects and cannot set permissions user :2 that is assigned #uite roles can immediately log onto the :nformation #erver Console. 6hat about a user :2 that is assigned a 2ata#tage #uite Component roleZ :f the user :2 is assigned the 2ata#tage dministrator role' then the user will immediately ac8uire the 2ata#tage dministrato permission for all projects. :f the the user :2 is assigned the 2ata#tage user role' one moe step is re8uired. 2ata#tage administrator must assign a corresponding role to that user :2 on the permissions tab. 6hen #uite users or groups have been assigned 2ata#tage dministrator role they automatically appear on the permissions. #uite users or groups that have a 2ata#tage User role need to be manually added. +xplain The 2ata#tage Credential $appingZ ll the #uite users without their own 2ata#tage credentials will be mapped to this user :2 and password. 4ere the username and password are demohaw(Cdemohwa(. demohaw( is assumed to be a valid user on the 2ata#tage #erver machine and has file permissions on the 2ata#tage engine and project directories #uite users can also be mapped individually to specific users ?ote that demohaw( need not be a #uite administrator or user 6hat information are re8uired to login into 2ata#tage dministratorZ 2omain ) 4ost name ' port number of the application server. /ecall that multiple 2ata#tage servers can exist in a domain' although they must be on different machines. 2ata#tage server ) The #erver that has the 2ata#tage projects you want to administer +xplain the 2ata#tage rolesZ *. 2ata#tage 2eveloper ) full access to all areas of a 2ata#tage project -. 2ata#tage 9perator ) run and manage release 2ata#tage jobs .. 2ata#tage #uper 9perator ) can open the 2esigner and view the repository in read)only mode 1. 2ata#tage Production $anaget ) create and manipulate protected projects 2ata#tage 2esigner +xplain :mport and +xport and their corresponding proceduresZ *. Gac(ing up jobs and projects -. $aintaining different versions of a job or project .. $oving 2ata#tage objects from one project to another 1. #haring jobs and projects between developers +xport))O2ata#tage components

Gy default' objects are exported to a text file in a specific format. Gy default' the extension is dsx. lternatively' you can export the objects to a F$, document. The directory you export is on the 2ata#tage client' not the server. 9bjects can also be exported from the list of found objects using search functionality. :mport))O2ata#tage components :mport all to begin the import process. Use :mport selected to import selected objects from the list. #elect 9verwrite without 8uery button to overwrite objects with the same name without warning. 3or large imports you may want to disable BPerform impact analysis.B This adds overhead to the import process :mport))OTable 2efinitions Table definition describes the column and format of files and tables The table definition for the following can be imported *. #e8uential file -. /elational tables .. Cobol files 1. F$, J. 92GC data sources etc Table definitions can be loaded into job stages that access data with the same format. :n this sense the metadata is reusable. Creating Parallel 7obs 6hat is a Parallel 7obZ parallel job is an executable 2ata#tage program created in 2ata#tage designer using components from repository. :t compiles into 9rchestrate script language%9#4& and object code%from generated CUU& 2ata#tage jobs are *. 2esigned and built in 2esigner -. #cheduled' invo(ed and monitored in 2irector .. +xecuted under the control of 2ata#tage Use the import process in 2esigner to import metadata defining sources and targets 6hat are the benefits of renaming lin(s and stagesZ *. 2ocumentation -. Clarity .. 3ewer development errors +xplain the /ow @enerator stageZ *. Produces moc( data -. ?o input lin(5single output lin( .. 9n Properties tab' specify number of rows 1. 9n Columns tab' load or specify column definitions

<ou have a cluster of nodes available to run 2ata#tage jobs. The networ( configuration between the servers is a private networ( with a * @G connection between each node. The public name is on a *EE $G networ(' which is what each hostname is identified with. :n order to use the private networ( for communications between each node you need to use an alias for each node in the cluster. The :nformation #erver +ngine node %conductor node& is where the 2ata#tage job starts. 6hich environment variable must be used to identify the hostname for the +ngine nodeZ . = PT>#+/T+/>+?@:?+ G. = PT>+?@:?+>?92+ C. = PT>P$>C9?2UCT9/>49#T? $+ 2. = PT>P$>?+T69/0>? $+ nswer" C 6hich three privileges must the user possess when running a parallel jobZ %Choose three.& . read access to PT>9/C449$+ G. execute permissions on local copies of programs and scripts C. readCwrite permissions to the U?:FCetc directory 2. readCwrite permissions to PT>9/C449$+ +. readCwrite access to dis( and scratch dis( resources nswer" 'G'+ 6hich two tas(s will create 2ata#tage projectsZ %Choose two.& . +xport and import a 2ata#tage project from 2ata#tage $anager. G. dd new projects from 2ata#tage dministrator. C. :nstall the 2ata#tage engine. 2. Copy a project in 2ata#tage dministrator. nswer" G'C 6hich three defaults are set in 2ata#tage dministratorZ %Choose three.& . default prompting options' such as utosave job before compile G. default #$TP mail server name C. project level default for /untime Column Propagation 2. project level defaults for environment variables +. project level default for uto)purge of job log entries nswer" C'2'+ 6hich two must be specified to manage /untime Column PropagationZ %Choose two.& . enabled in 2ata#tage dministrator G. attached to a table definition in 2ata#tage $anager C. enabled at the stage level 2. enabled with environmental parameters set at runtime nswer" 'C <ou are reading customer data using a #e8uential 3ile stage and transforming it using the Transformer stage. The Transformer is used to cleanse the data by trimming spaces from character fields in the input. The cleansed data is to be written to a target 2G- table. 6hich partitioning method would yield optimal performance without violating the business re8uirementsZ . 4ash on the customer :2 field G. /ound /obin C. /andom 2. +ntire

nswer" G job contains a #ort stage that sorts a large volume of data across a cluster of servers. The customer has re8uested that this sorting be done on a subset of servers identified in the configuration file to minimi!e impact on database nodes. 6hich two steps will accomplish thisZ %Choose two.& . Create a sort scratch dis( pool with a subset of nodes in the parallel configuration file. G. #et the execution mode of the #ort stage to se8uential. C. #pecify the appropriate node constraint within the #ort stage. 2. 2efine a non)default node pool with a subset of nodes in the parallel configuration file. nswer" C'2 <ou have a compiled job and parallel configuration file. 6hich three methods can be used to determine the number of nodes actually used to run the job in parallelZ %Choose three.& . within 2ata#tage 2esigner' generate report and retain intermediate F$, G. within 2ata#tage 2esigner' show performance statistics C. within 2ata#tage 2irector' examine log entry for parallel configuration file 2. within 2ata#tage 2irector' examine log entry for parallel job score +. within 2ata#tage 2irector' open a new 2ata#tage 7ob $onitor nswer" C'2'+ 6hich environment variable' when set to true' causes a report to be produced which shows the operators' processes and data sets in the jobZ . PT>2U$P>#C9/+ G. PT>79G>/+P9/T C. PT>$9?:T9/>#:^+ 2. PT>/+C9/2>C9U?T# nswer" job reads from a dataset using a 2ata#et stage. This data goes to a Transformer stage and then is written to a se8uential file using a #e8uential 3ile stage. The default configuration file has . nodes. The job creating the dataset and the current job both use the default configuration file. 4ow many instances of the Transformer run in parallelZ .. G. * C. [ 2. ] nswer" <our job reads from a file using a #e8uential 3ile stage running se8uentially. The 2ata#tage server is running on a single #$P system. 9ne of the columns contains a product :2. :n a ,oo(up stage following the #e8uential 3ile stage' you decide to loo( up the product description from a reference table. 6hich two partition settings would correctly find matching product descriptionsZ %Choose two.& . 4ash algorithm' specifying the product :2 field as the (ey' on both the lin( coming from the #e8uential 3ile stage and the lin( coming from the reference table. G. /ound /obin on both the lin( coming from the #e8uential 3ile stage and the lin( coming from the reference table. C. /ound /obin on the lin( coming from the #e8uential 3ile stage and +ntire on the lin( coming from the reference table. 2. +ntire on the lin( coming from the #e8uential 3ile stage and 4ash' specifying the product :2 field as the (ey' on the lin( coming from the reference table.

nswer" 'C

job design consists of an input fileset followed by a Pee( stage' followed by a 3ilter stage' followed by an output fileset. The environment variable PT>2:# G,+>C9$G:? T:9? is set to true' and the job executes on an #$P using a configuration file with \ nodes defined. ssume also that the input dataset was created with the same \ node configuration file. pproximately how many data processing processes will this job createZ . .G. \ C. *W 2. * nswer" 6hich two statements are true of the column data types used in 9rchestrate schemasZ %Choose two.& . 9rchestrate schema column data types are the same as those used in 2ata#tage stages. G. +xamples of 9rchestrate schema column data types are varchar and integer. C. +xamples of 9rchestrate schema column data types are int.- and stringRmax=.ES. 2. 9#4 import operators are needed to convert data read from se8uential files into schema types. nswer" C'2 <ou have set the BPreserve PartitioningB flag for a #ort stage to re8uest that the next stage preserves whatever partitioning it has implemented. 6hich statement describes what will happen nextZ . The job will compile but will abort when run. G. The job will not compile. C. The next stage can ignore this re8uest but a warning is logged when the job is run depending on the stage type that ignores the flag. 2. The next stage disables the partition options that are normally available in the Partitioning tab. nswer" C 6hat is the purpose of the uv command in a U?:F 2ata#tage serverZ . Cleanup resources from a failed 2ata#tage job. G. #tart and stop the 2ata#tage engine. C. Provide read access to a 2ata#tage ++ configuration file. 2. /eport 2ata#tage client connections. nswer" G 6hich two statements regarding the usage of data types in the parallel engine are correctZ %Choose two.& . The best way to import /2G$# data types is using the 92GC importer. G. The parallel engine will use its interpretation of the 9racle meta data %e.g' exact data types& based on interrogation of 9racle' overriding what you may have specified in the Columns tabs. C. The best way to import /2G$# data types is using the :mport 9rchestrate #chema 2efinitions using orchdbutil. 2. The parallel engine and server engine have exactly the same data types so there is no conversion cost overhead from moving data between the engines. nswer" G'C 6hich two describe a 2ata#tage ++ installation in a clustered environmentZ %Choose two.& . The CUU compiler must be installed on all cluster nodes. G. Transform operators must be copied to all nodes of the cluster. C. The 2ata#tage parallel engine must be installed or accessible in the same directory on all machines in the cluster. 2. remote shell must be configured to support communication between the conductor and section leader nodes. nswer" C'2

6hich partitioning method would yield the most even distribution of data without duplicationZ . +ntire G. /ound /obin C. 4ash 2. /andom nswer" G 6hich three accurately describe the differences between a 2ata#tage server root installation and a non)root installationZ %Choose three.& . non)root installation enables auto)start on reboot. G. root installation must specify the user BdsadmB as the 2ata#tage administrative user. C. non)root installation inherits the permissions of the user who starts the 2ata#tage services. 2. root installation will start 2ata#tage services in impersonation mode. +. root installation enables auto)start on reboot. nswer" C'2'+ <our job reads from a file using a #e8uential 3ile stage running se8uentially. <ou are using a Transformer following the #e8uential 3ile stage to format the data in some of the columns. 6hich partitioning algorithm would yield optimi!ed performanceZ . 4ash G. /andom C. /ound /obin 2. +ntire nswer" C 6hich three U?:F (ernel parameters have minimum re8uirements for 2ata#tage installationsZ %Choose three.& . $ FUP/9C ) maximum number of processes per user G. ?93:,+# ) number of open files C. $ FP+/$ ) dis( cache threshold 2. ?9P/9C ) no process limit +. #4$$ F ) maximum shared memory segment si!e nswer" 'G'+ 6hich partitioning method re8uires specifying a (eyZ . /andom G. 2GC. +ntire 2. $odulus nswer" 2 6hen a se8uential file is written using a #e8uential 3ile stage' the parallel engine inserts an operator to convert the data from the internal format to the external format. 6hich operator is insertedZ . export operator G. copy operator C. import operator 2. tsort operator nswer"

6hich statement is true when /untime Column Propagation %/CP& is enabledZ . 2ata#tage $anager does not import meta data. G. 2ata#tage 2irector does not supply row counts in the job log. C. 2ata#tage 2esigner does not enforce mapping rules. 2. 2ata#tage dministrator does not allow default settings for environment variables. nswer" C

Persistent Storage
Se.uential file stage
The #e8uential 3ile stage is a file stage. :t allows you to read data from or write data to one or more flat files. The stage can have a single input lin( or a single output lin(' and a single rejects lin(. The stage executes in parallel mode if reading multiple files but executes se8uentially if it is only reading one file. Gy default a complete file will be read by a single node %although each node might read more than one file&. 3or fixed)width files' however' you can configure the stage to behave differently"

1. <ou can specify that single file can be read by multiple nodes. This can improve performance on cluster
systems.

2. <ou can specify that a number of readers run on a single node. This means' for example' that a single file
can be partitioned as it is read %even though the stage is constrained to running se8uentially on the conductor node&. %These two options are mutually exclusive.&

File This property defines the flat file that data will be read from. <ou can type in a pathname' or browse for a file.
<ou can specify multiple files by repeating the 3ile property

File pattern #pecifies a group of files to import. #pecify file containing a list of files or a job parameter
representing the file. The file could also contain be any valid shell expression' in Gourne shell syntax' that generates a list of file names.

$ead method This property specifies whether you are reading from a specific file or files or using a file pattern to
select files %e.g.' I.txt&.

Missing file mode #pecifies the action to ta(e if one of your 3ile properties has specified a file that does not exist.
Choose from &rror to stop the job' )/ to s(ip the file' or Depends' which means the default is &rror' unless the file has a node name prefix of I" in which case it is )/. The default is Depends.

/eep file partitions #et this to True to partition the imported data set according to the organi!ation of the input
file%s&. #o' for example' if you are reading three files you will have three partitions. 2efaults to False.

$e0ect mode llows you to specify behavior if a read record does not match the expected schema %record does not
match the metadata defined in column definition&. Choose from Continue to continue operation and discard any rejected rows' Fail to cease reading if any rows are rejected' or Save to send rejected rows down a reject lin(. 2efaults to Continue.

$eport progress Choose 1es or No to enable or disable reporting. Gy default the stage displays a progress report
at each *E_ interval when it can ascertain file si!e. /eporting occurs only if the file is greater than *EE 0G' records are fixed length' and there is no filter on the file.

Num%er )f readers per node This is an optional property and only applies to files containing fixed)length
records' it is mutually exclusive with the /ead from multiple nodes property. #pecifies the number of instances of the file read operator on a processing node. The default is one operator per node per input data file. :f num2eaders is greater than one' each instance of the file read operator reads a contiguous range of records from the input file.

This provides a way of partitioning the data contained in a single file. +ach node reads a single file' but the file can be divided according to the number of readers per node' and written to separate partitions. This method can result in better :C9 performance on an #$P system.

$ead from multiple nodes This is an optional property and only applies to files containing fixed)length records'
it is mutually exclusive with the ?umber of /eaders Per ?ode property. #et this to <es to allow individual files to be read by several nodes. This can improve performance on a cluster system. 6eb#phere 2ata#tage (nows the number of nodes available' and using the fixed length record si!e' and the actual si!e of the file to be read' allocates the reader on each node a separate region within the file to process. The regions will be of roughly e8ual si!e. ?ote that se8uential row order cannot be maintained when reading a file in parallel

File update mode This property defines how the specified file or files are updated. The same method applies to all
files being written to. Choose from Append to append to existing files' )ver2rite to overwrite existing files' or Create to create a new file. :f you specify the Create property for a file that already exists you will get an error at runtime. Gy default this property is set to )ver2rite.

3sing $CP 4ith Se.uential Stages /untime column propagation %/CP& allows 6eb#phere 2ata#tage to be
flexible about the columns you define in a job. :f /CP is enabled for a project' you can just define the columns you are interested in using in a job' but as( 6eb#phere 2ata#tage to propagate the other columns through the various stages. #o such columns can be extracted from the data source and end up on your data target without explicitly being operated on in between. #e8uential files' unli(e most other data sources' do not have inherent column definitions' and so 6eb#phere 2ata#tage cannot always tell where there are extra columns that need propagating. <ou can only use /CP on se8uential files if you have used the #chema 3ile property %see H#chema 3ileH on page #chema 3ile and on page #chema 3ile& to specify a schema which describes all the columns in the se8uential file. <ou need to specify the same schema file for any similar stages in the job where you want to propagate columns. #tages that will re8uire a schema file are" *. -. .. 1. J. W. #e8uential 3ile 3ile #et +xternal #ource +xternal Target Column :mport Column +xport

-mproving Se.uential File Performance


:f the source file is fixed width' the /eaders Per ?ode option can be used to read a single input file in parallel at evenly)spaced offsets. ?ote that in this manner' input row order is not maintained. :f the input se8uential file cannot be read in parallel' performance can still be improved by separating the file :C9 from the column parsing operation. To accomplish this' define a single large string column for the non)parallel #e8uential 3ile read' and then pass this to a Column :mport stage to parse the file in parallel. The formatting and column properties of the Column :mport stage match those of the #e8uential 3ile stage. 9n heavily)loaded file servers or some / :2C# ? array configurations' the environment variables

= PT>:$P9/T>GU33+/>#:^+ and = PT>+FP9/T>GU33+/>#:^+ can be used to improve :C9 performance. These settings specify the si!e of the read %import& and write %export& buffer si!e in 0bytes' with a default of *-\ %*-\0&. :ncreasing this may improve performance. 3inally' in some dis( array configurations' setting the environment variable = PT>C9?#:#T+?T>GU33+/:9>#:^+ to a value e8ual to the readCwrite si!e in bytes can significantly improve performance of #e8uential 3ile operations. = PT>C9?#:#T+?T>GU33+/:9>#:^+ ) #ome dis( arrays have read ahead caches that are only effective when data is read repeatedly in li(e)si!ed chun(s. #etting PT>C9?#:#T+?T>GU33+/:9>#:^+= # will force stages to read data in chun(s which are si!e # or a multiple of #.

Partitioning Se.uential File $eads


Care must be ta(en to choose the appropriate partitioning method from a #e8uential 3ile read" ` 2ont read from #e8uential 3ile using # $+ partitioningN Unless more than one source file is specified' # $+ will read the entire file into a single partition' ma(ing the entire downstream flow run se8uentially %unless it is later repartitioned&. ` 6hen multiple files are read by a single #e8uential 3ile stage %using multiple files' or by using a 3ile Pattern&' each files data is read into a separate partition. :t is important to use /9U?2)/9G:? partitioning %or other partitioning appropriate to downstream components& to evenly distribute the data in the flow.

Se.uential File 5& port6 #uffering


Gy default' the #e8uential 3ile %export operator& stage buffers its writes to optimi!e performance. 6hen a job completes successfully' the buffers are always flushed to dis(. The environment variable = PT>+FP9/T>3,U#4>C9U?T allows the job developer to specify how fre8uently %in number of rows& that the #e8uential 3ile stage flushes its internal buffer on writes. #etting this value to a low number %such as *& is useful for realtime applications' but there is a small performance penalty associated with increased :C9.

$eading from and 4riting to Fi ed78ength Files


Particular attention must be ta(en when processing fixed)length fields using the #e8uential 3ile stage" ` :f the incoming columns are variable)length data types %eg. :nteger' 2ecimal' Tarchar&' the field width column property must be set to match the fixed)width of the input column. 2ouble)clic( on the column number in the grid dialog to set this column property. ` :f a field is nullable' you must define the null field *alue and length in the #ullable section of the column property. 2ouble)clic( on the column number in the grid dialog to set these

Data set stage


The 2ata #et stage is a file stage. :t allows you to read data from or write data to a data set. The stage can have a single input lin( or a single output lin(. :t can be configured to execute in parallel or se8uential mode. 6hat is a data setZ Parallel jobs use data sets to manage data within a job. <ou can thin( of each lin( in a job as carrying a data set. The 2ata #et stage allows you to store data being operated on in a persistent form' which can then be used by other 6eb#phere 2ata#tage jobs. 2ata sets are operating system files' each referred to by a control file' which by convention has the suffix .ds. Using data sets wisely can be (ey to good performance in a set of lin(ed jobs. <ou can also manage data sets independently of a job using the 2ata #et $anagement utility' available from the 6eb#phere 2ata#tage 2esigner or 2irector data set comprises a descriptor file and a number of other files that are added as the data set grows. These files are stored on multiple dis(s in your system. The descriptor file for a data set contains the following information"

*.

2ata set header information.

2. Creation time and date of the data set. 3. The schema %metadata& of the data set. 4.
copy of the configuration file used when the data set was created.

2ata #ets are the structured internal representation of data within the Parallel 3ramewor( Consist of" 3ramewor( #chema %format=name' type' nullability& 2ata /ecords %data& Partition %subset of rows for each node& Tirtual 2ata #ets exist in)memory correspond to 2ata#tage 2esigner lin(s Persistent 2ata #ets are stored on)dis( 2escriptor file %metadata' configuration file' data file locations' flags& $ultiple 2ata 3iles %one per node' stored in dis( resource file systems& node*"ClocalCdis(*Ca node-"ClocalCdis(-Ca There is no L2ata#etM operator X the 2esigner @U: inserts a copy operator 6hen to Use Persistent 2ata #ets 6hen writing intermediate results between 2ata#tage ++ jobs' always write to persistent 2ata #ets %chec(points& #tored in native internal format %no conversion overhead& /etain data partitioning and sort order %end)to)end parallelism across jobs& $aximum performance through parallel :C9 6hy 2ata #ets are not intended for long)term or archive storage :nternal format is subject to change with new 2ata#tage releases /e8uires access to named resources %node names' file system paths' etc& Ginary format is platform)specific 3or fail)over scenarios' servers should be able to cross)mount filesystems Can read a dataset as long as your current = PT>C9?3:@>3:,+ defines the same ?92+ names %fastnames may differ& orchadmin Xx lets you recover data from a dataset if the node names are no longer available

Data Set Management


*. Tiewing the schema

Clic( the #chema icon from the tool bar to view the record schema of the current data set. This is presented in text form in the /ecord #chema window.

-.

Tiewing the data

Clic( the 2ata icon from the tool bar to view the data held by the current data set. This options the 2ata Tiewer 9ptions dialog box' which allows you to select a subset of the data to view. /ows to display. #pecify the number of rows of data you want the data browser to display. #(ip count. #(ip the specified number of rows before viewing data. Period. 2isplay every Pth record where P is the period. <ou can start after records have been s(ipped by using the #(ip property. P must e8ual or be greater than *. Partitions. Choose between viewing the data in ll partitions or the data in the partition selected from the drop)down list.Clic( 90 to view the selected data' the 2ata Tiewer window appears. .. Copying data sets

Clic( the Copy icon on the tool bar to copy the selected data set. The Copy data set dialog box appears' allowing you to specify a path where the new data set will be stored. The new data set will have the same record schema' number of partitions and contents as the original data set. ?ote" <ou cannot use the U?:F cp command to copy a data set because 6eb#phere 2ata#tage represents a single data set with multiple files. 1. 2eleting data sets

Clic( the 2elete icon on the tool bar to delete the current data set data set. <ou will be as(ed to confirm the deletion. ?ote" <ou cannot use the U?:F rm command to copy a data set because 6eb#phere 2ata#tage represents a single data set with multiple files. Using rm simply removes the descriptor file' leaving the much larger data files behind.

)rchadmin Commands
9rchadmin is a command line utility provided by datastage to research on data sets. The general callable format is " =orchadmin PcommandO RoptionsS Rdescriptor fileS Gefore using orchadmin' you should ma(e sure that either the wor(ing directory or the = PT>9/C449$+Cetc contains the file Mconfig.aptM 9/ The environment variable = PT>C9?3:@>3:,+ should be defined for your session. The various commands available with orchadmin are *. C4+C0" =orchadmin chec(

Talidates the configuration file contents li(e ' accesibility of all nodes defined in the configuration file' scratch dis( definitions and accesibility of all the nodes etc. Throws an error when config file is not found or not defined properly -. C9P< " =orchadmin copy Psource.dsO Pdestination.dsO

$a(es a complete copy of the datasets of source with new destination descriptor file name. Please not that

a. <ou cannot use U?:F cp command as it justs copies the config file to a new name. The data is not copied. b. The new datasets will be arranged in the form of the config file that is in use but not according to the old confing file that was in use with the source. .. 2+,+T+ " =orchadmin P delete b del b rm O R)f b )xS descriptorfilesa.

The unix rm utility cannot be used to delete the datasets. The orchadmin delete or rm command should be used to delete one or more persistent data sets. )f options ma(es a force delete. :f some nodes are not accesible then )f forces to delete the dataset partitions from accessible nodes and leave the other partitions in inaccesible nodes as orphans. )x forces to use the current config file to be used while deleting than the one stored in data set. 1. 2+#C/:G+" =orchadmin describe RoptionsS descriptorfile.ds

This is the single most important command. *. 6ithout any option lists the no.of.partitions' no.of.segments' valid segments' and preserve partitioning flag details of the persistent dataset. )c " Print the configuration file that is written in the dataset if any )p" ,ists down the partition level information. )f" ,ists down the file level information in each partition )e" ,ist down the segment level information . )s" ,ist down the meta)data schema of the information. )v" ,ists all segemnts ' valid or otherwise )l " ,ong listing. +8uivalent to )f )p )s )v )e J. 2U$P" =orchadmin dump RoptionsS descriptorfile.ds

The dump command is used to dump %extract& the records from the dataset. 6ithout any options the dump command lists down all the records starting from first record from first partition till last record in last partition. )delim cPstringO " Uses the given string as delimtor for fields instead of space. )field PnameO " ,ists only the given field instead of all fields. )name " ,ist all the values preceded by field name and a colon )n numrecs " ,ist only the given number of records per partition. )p period%?& " ,ists every ?th record from each partition starting from first record. )s(ip ?" #(ip the first ? records from each partition. )x " Use the current system configuration file rather than the one stored in dataset.

W.

T/U?C T+" =orchadmin truncate RoptionsS descriptorfile.ds

6ithout options deletes all the data%ie #egments& from the dataset. )f" Uses force truncate. Truncate accessible segments and leave the inaccesible ones. )x" Uses current system config file rather than the default one stored in the dataset. )n ?" ,eaves the first ? segments in each partition and truncates the remaining. [. 4+,P" =orchadmin )help 9/ =orchadmin PcommandO )help

4elp manual about the usage of orchadmin or orchadmin commands.

File set stage


The 3ile #et stage is a file stage. :t allows you to read data from or write data to a file set. The stage can have a single input lin(' a single output lin(' and a single rejects lin(. :t only executes in parallel mode. 6hat is a file setZ 6eb#phere 2ata#tage can generate and name exported files' write them to their destination' and list the files it has generated in a file whose extension is' by convention' .fs. The data files and the file that lists them are called a file set. This capability is useful because some operating systems impose a - @G limit on the si!e of a file and you need to distribute files among nodes to prevent overruns. The amount of data that can be stored in each destination data file is limited by the characteristics of the file system and the amount of free dis( space available. The number of files created by a file set depends on" *. The number of processing nodes in the default node pool -. The number of dis(s in the export or default dis( pool connected to each processing node in the default node pool .. The si!e of the partitions of the data set The 3ile #et stage enables you to create and write to file sets' and to read data bac( from file sets. Unli(e data sets' file sets carry formatting information that describes the format of the files to be read or written. 3ilesets are similar to datasets *. Partitioned -. :mplemented with header file and data files 3ilesets are different from datasets *. The data files of filesets are text files and hence are readable by other applications whereas the data files of datasets are stored in native internal format and are readable only 2ata#tage

8oo"up file set stage


The ,oo(up 3ile #et stage is a file stage. :t allows you to create a loo(up file set or reference one for a loo(up. The stage can have a single input lin( or a single output lin(. The output lin( must be a reference lin(. The stage can be configured to execute in parallel or se8uential mode when used with an input lin(. 6hen creating ,oo(up file sets' one file will be created for each partition. The individual files are referenced by a single descriptor file' which by convention has the suffix .fs.

6hen performing loo(ups' ,oo(up 3ile #et stages are used with ,oo(up stages. 6hen you use a ,oo(up 3ile #et stage as a source for loo(up data' there are special considerations about column naming. :f you have columns of the same name in both the source and loo(up data sets' the source data set column will go to the output data. :f you want this column to be replaced by the column from the loo(up data source' you need to drop the source data column before you perform the loo(up http"CCwww.dsxchange.comCviewtopic.phpZt=**..]1 4ashed 3ile is only available in server jobs. :t uses a hashing algorithm %without building an index& to determine the location of (eys within its structure. :t is not amenable to parallelism. The contents of a hashed file may be cached in memory when using the 4ashed 3ile stage to service a reference input lin(. ?ew rows to be written to a hashed file may first be written to a memory cache' then flushed to dis(. ll writes to a hashed file using an existing (ey overwrite the previous row. 2uplicate (ey values are not permitted. ,oo(up 3ile #et is only available in parallel jobs. :t uses an index %based on a hash table& to determine the location of (eys within its structure. :t is a parallel structure5 it has its records spread over the processing nodes specified when it was created. The records in the ,oo(up 3ile #et are loaded into a virtual 2ata #et before use' and the index is also loaded into memory. 2uplicate (ey values are %optionally& permitted. :f the option is not selected' duplicates are rejected when writing to the ,oo(up 3ile #et. http"CCwww.dsxchange.comCviewtopic.phpZt=].-\[ : did testing on a 6indows machine processing *EE'EEE primary rows against *EE'EEE loo(up rows with a * to * match. Two (ey fields of char -JJ and two non (ey fields also of char -JJ. : deliberately chose fat (ey fields. The dataset as a loo(up too( -). minutes. The fileset as a loo(up too( about 1E seconds. /an it a few times with the same results. 9ne interesting result was memory utilisation' the fileset was consistently lighter then the dataset' by as much as .E_ on / $ memory. This may be due to the (eepCdrop (ey field option of the fileset stage. :f you set (eep to false the (ey fields in the fileset are not loaded into memory as they are not re8uired on the output side of the loo(up. : am guessing that the fileset version was moving and storing J*E char less for each loo(up then the dataset version. :n a normal loo(up these (ey fields travel up the reference lin( and bac( down it again' in a loo(up fileset they only travel up. 6hen : switch the same job onto an :F box with several gig of / $ : get [ seconds for the dataset and 1 for the fileset. 6ith an increase to JEE'EEE rows : get -. seconds for the dataset and [ seconds for the fileset. This difference may not be so apparent if your (ey fields are shorter. The major drawbac( of a loo(up fileset is that it doesndt have the ppend option of a dataset' you can only overwrite it.

Creating a loo"up file set


1. :n the :nput ,in( Properties Tab" X #pecify the (ey that the loo(up on this file set will ultimately be performed on. <ou can repeat this property to specify multiple (ey columns. <ou must specify the (ey when you create the file set' you cannot specify it when performing the loo(up X #pecify the name of the ,oo(up 3ile #et. X #pecify a loo(up range' or accept the default setting of ?o. X #et llow 2uplicates' or accept the default setting of 3alse. 2. +nsure column meta data has been specified for the loo(up file set.

8oo"ing up a loo"up file set

1. :n the 9utput ,in( Properties Tab specify the name of the loo(up file set being used in the loo(up. 2. +nsure column meta data has been specified for the loo(up file set. Gy default the stage will write to the file set in entire mode. The complete data set is written to each partition. :f the ,oo(up 3ile #et stage is operating in se8uential mode' it will first collect the data before writing it to the file using the default %auto& collection method.

Comple Flat File stage


The Complex 3lat 3ile %C33& stage is a file stage. <ou can use the stage to read a file or write to a file' but you cannot use the same stage to do both. s a source' the C33 stage can have multiple output lin(s and a single reject lin(. <ou can read data from one or more complex flat files' including $T# data sets with V# $ and T# $ files. <ou can also read data from files that contain multiple record types. The source data can contain one or more of the following clauses" *. @/9UP -. /+2+3:?+# .. 9CCU/# 1. 9CCU/# 2+P+?2:?@ 9? C33 source stages run in parallel mode when they are used to read multiple files' but you can configure the stage to run se8uentially if it is reading only one file with a single reader. s a target' the C33 stage can have a single input lin( and a single reject lin(. <ou can write data to one or more complex flat files. <ou cannot write to $T# data sets or to files that contain multiple record types.

&diting a Comple Flat File stage as a source


To edit a C33 stage as a source' you must provide details about the file that the stage will read' create record definitions for the data' define the column metadata' specify record :2 constraints' and select output columns. To edit a C33 stage as a source" 1. 9pen the C33 stage editor 2. 9n the #tage page' specify information about the stage data" a. 9n the 3ile 9ptions tab' provide details about the file that the stage will read. b. 9n the /ecord 9ptions tab' describe the format of the data in the file. c. :f the stage is reading a file that contains multiple record types' on the /ecords tab' create record definitions for the data. d. 9n the /ecords tab' create or load column definitions for the data. e. :f the stage is reading a file that contains multiple record types' on the /ecords :2 tab' define the record :2 constraint for each record. f. )ptional! 9n the dvanced tab' change the processing settings. 3. 9n the 9utput page' specify how to read data from the source file" a. 9n the #election tab' select one or more columns for each output lin(. b. )ptional! 9n the Constraint tab' define a constraint to filter the rows on each output lin(. c. )ptional! 9n the dvanced tab' change the buffering settings. 4. Clic( 90 to save your changes and to close the C33 stage editor.

Creating record definitions


:f you are reading data from a file that contains multiple record types' you must create a separate record definition for each type. C9G9, copyboo(s with multiple record types can be imported as C9G9, file definition %+g. :nsurance.cfd&. +ach record type is stores as a separate 2ata#tage table definition %+g. :f the :nsurance.cfd has . record types for Client' Policy and Coverage then there will be . table definitions one for each record type& To create record definitions" 1. Clic( the $ecords tab on the #tage page. 2. Clear the Single record chec( box. 3. /ight)clic( the default record definition /+C9/2>* and select $ename Current $ecord. 4. Type a new name for the default record definition.

5.

dd another record by clic(ing one of the buttons at the bottom of the records list. +ach button offers a different insertion point. new record is created with the default name of ?+6/+C9/2. 6. 2ouble)clic( ?+6/+C9/2 to rename it. 7. /epeat steps . and 1 for each new record that you need to create. 8. /ight)clic( the master record in the list and select Toggle Master $ecord. 9nly one master record is permitted.

Column definitions
<ou must define columns to specify what data the C33 stage will read or write. :f the stage will read data from a file that contains multiple record types' you must first create record definitions on the /ecords tab. :f the source file contains only one record type' or if the stage will write data to a target file' then the columns belong to the default record called /+C9/2>*. <ou can load column definitions from a table in the repository' or you can type column definitions into the columns grid. <ou can also define columns by dragging a table definition from the /epository window to the C33 stage icon on the 2esigner canvas.

8oading columns
The fastest way to define column metadata is to load columns from a table definition in the repository. To load columns" 1. Clic( the $ecords tab on the #tage page. 2. Clic( 8oad to open the Table 2efinitions window. This window displays all of the repository objects that are in the current project. 3. #elect a table definition in the repository tree and clic( )/. 4. #elect the columns to load in the #elect Columns 3rom Table window and clic( )/. 5. :f flattening is an option for any arrays in the column structure' specify how to handle array data in the Complex 3ile ,oad 9ption window.

Typing columns
<ou can also define column metadata by typing column definitions in the columns grid. To type columns" 1. Clic( the $ecords tab on the #tage page. 2. :n the 8evel num%er field of the grid' specify the C9G9, level number where the data is defined. :f you do not specify a level number' a default value of EJ is used. 3. :n the Column name field' type the name of the column. 4. :n the Native type field' select the native data type. 5. :n the 8ength field' specify the data precision. 6. :n the Scale field' specify the data scale factor. 7. )ptional! :n the Description field' type a description of the column.

Defining record -D constraints


:f you are using the C33 stage to read data from a file that contains multiple record types' you must specify a record :2 constraint to identify the format of each record. Columns that are identified in the record :2 clause must be in the same physical storage location across records. The constraint must be a simple e8uality expression' where a column e8uals a value. To define a record :2 constraint" 1. Clic( the $ecords -D tab on the #tage page. 2. #elect a record from the $ecords list.

3. #elect the record :2 column from the Column list. This list displays all columns from the selected record'
except the first 9CCU/# 2+P+?2:?@ 9? %929& column and any columns that follow it.

4. #elect the = operator from the )p list. 5. Type the identifying value for the record :2 column in the 9alue field. Character values must be enclosed in
single 8uotation mar(s.

Selecting output columns


Gy selecting output columns' you specify which columns from the source file the C33 stage should pass to the output lin(s. <ou can select columns from multiple record types to output from the stage. :f you do not select columns to output on each lin(' the C33 stage automatically propagates all of the stage columns except group columns to each empty output lin( when you clic( )/ to exit the stage. To select output columns" 1. Clic( the Selection tab on the 9utput page. 2. :f you have multiple output lin(s' select the lin( that you want from the )utput name list.

Defining output lin" constraints


Gy defining a constraint' you can filter the data on each output lin( from the C33 stage. <ou can set the output lin( constraint to match the record :2 constraint for each selected output record by clic(ing Default on the Constraint tab on the 9utput page. The Default button is available only when the constraint grid is empty. To define an output lin( constraint" 1. Clic( the Constraint tab on the 9utput page. 2. :n the 5 field of the grid' select an opening parenthesis if needed. <ou can use parentheses to specify the order in evaluating a complex constraint expression. 3. :n the Column field' select a column or job parameter. %@roup columns cannot be used in constraint expressions and are not displayed.& 4. :n the )p field' select an operator or a logical function. 5. :n the Column,9alue field' select a column or job parameter' or double)clic( in the cell to type a value. +nclose character values in single 8uotation mar(s. 6. :n the 6 field' select a closing parenthesis if needed. 7. :f you are building a complex expression' in the 8ogical field' select AND or )$ to continue the expression in the next row. 8. Clic( 9erify. :f errors are found' you must either correct the expression' clic( Clear All to start over' or cancel. <ou cannot save an incorrect constraint.

&diting a Comple Flat File stage as a target


To edit a C33 stage as a target' you must provide details about the file that the stage will write' define the record format of the data' and define the column metadata. To edit a C33 stage as a target" 1. 9pen the C33 stage editor. 2. 9n the #tage page' specify information about the stage data" a. 9n the 3ile 9ptions tab' provide details about the file that the stage will write. b. 9n the /ecord 9ptions tab' describe the format of the data in the file. c. 9n the /ecords tab' create or load column definitions for the data d. )ptional! 9n the dvanced tab' change the processing settings. 3. )ptional! 9n the :nput page' specify how to write data to the target file" a. 9n the dvanced tab' change the buffering settings.

b.

9n the Partitioning tab' change the partitioning settings.

4. Clic( )/ to save your changes and to close the C33 stage editor.

$e0ect lin"s
The C33 stage can have a single reject lin(' whether you use the stage as a source or a target. 3or C33 source stages' reject lin(s are supported only if the source file contains a single record type without any 9CCU/# 2+P+?2:?@ 9? %929& columns. 3or C33 target stages' reject lin(s are supported only if the target file does not contain 929 columns. <ou cannot change the selection properties of a reject lin(. The #election tab for a reject lin( is blan(. <ou cannot edit the column definitions for a reject lin(. 3or writing files' the reject lin( uses the input lin( column definitions. 3or reading files' the reject lin( uses a single column named HrejectedH that contains raw data for the columns that were rejected after reading because they did not match the schema.

FTP &nterprise Stage


The 3TP +nterprise stage transfers multiple files in parallel. These are sets of files that are transferred from one or more 3TP servers into 6eb#phere 2ata#tage or from 6eb#phere 2ata#tage to one or more 3TP servers. The source or target for the file is identified by a U/: %Universal /esource :dentifier&. The 3TP +nterprise stage invo(es an 3TP client program and transfers files to or from a remote host using the 3TP Protocol. 3$:s a pathname connecting the #tage to a target file on a remote host. :t has the 9pen dependent property. <ou can repeat this property to specify multiple U/:s. <ou can specify an absolute or a relative pathname. )pen command :s re8uired if you perform any operation besides navigating to the directory where the file exists. There can be multiple 9pen commands. This is a dependent property of U/:. ftp command :s an optional command that you can specify if you do not want to use the default ftp command. 3or example' you could specify CoptCgnuCbinCwuftp. <ou can enter the path of the command %on the server& directly in this field. <ou can also specify a job parameter if you want to be able to specify the ftp command at run time. 3ser Name #pecify the user name for the transfer. <ou can enter it directly in this field' or you can specify a job parameter if you want to be able to specify the user name at run time. <ou can specify multiple user names. User* corresponds to U/:* and so on. 6hen the number of users is less than the number of U/:s' the last user name is set for remaining U/:s. :f no User ?ame is specified' the 3TP +nterprise #tage tries to use .netrc file in the home directory. Pass2ord +nter the password in this field. <ou can also specify a job parameter if you want to be able to specify the password at run time. #pecify a password for each user name. Password* corresponds to U/:*. 6hen the number of passwords is less than the numbers of U/:s' the last password is set for the remaining U/:s. Transfer Protocol #elect the type of 3TP service to transfer files between computers. <ou can choose either 3TP or #ecure 3TP %#3TP&. 1. FTP #elect this option if you want to transfer files using the standard 3TP protocol. This is a nonsecure protocol. Gy default 3TP enterprise stage uses this protocol to transfer files.

2. Secure FTP 5SFTP6 #elect this option if you want to transfer files between computers in a secured channel.
#ecure 3TP %#3TP& uses the ##4 %#ecured #hell& protected channel for data transfer between computers over a nonsecure networ( such as a TCPC:P networ(. Gefore you can use #3TP to transfer files' you should configure the ##4 connection without any pass phrase for /# authentication. Force Parallelism <ou can set either <es or ?o. :n general' the 3TP +nterprise stage tries to start as many processes as needed to transfer the n files in parallel. 4owever' you can force the parallel transfer of data by specifying this property to yes. This allows m number of processes at a time where m is the number specified in 6eb#phere 2ata#tage configuration file. :f m is less than n' then the stage waits to transfer the first m files and then start the next m until n files are transferred. 6hen you set Force Parallelism to 1es' you should only give one U/:. )ver2rite #et this option to have any existing files overwritten by this transfer. $estarta%le Mode 6hen you specify a restartable mode of /estartable transfer' 6eb#phere 2ata#tage creates a directory for recording information about the transfer in a restart directory. :f the transfer fails' you can run an identical job with the restartable mode property set to /estart transfer' which will reattempt the transfer. :f the transfer repeatedly fails' you can run an identical job with the restartable mode option set to bandon transfer' which will delete the restart directory /estartable mode has the following dependent properties" 1. *o% -d :dentifies a restartable transfer job. This is used to name the restart directory. 2. Chec"point directory 9ptionally specifies a chec(point directory to contain restart directories. :f you do not specify this' the current wor(ing directory is used. 3or example' if you specify a job>id of *EE and a chec(point directory of 3home3bgamsworth3checkpoint the files would be written to 3home3bgamsworth3checkpointCpftp_-obid_455. Schema file Contains a schema for storing data. #etting this option overrides any settings on the Columns tab. <ou can enter the path name of a schema file' or specify a job parameter' so the schema file name can be specified at run time. Transfer Type #elect a data transfer type to transfer files between computers. <ou can select either the Ginary or #C:: mode of data transfer. The default data transfer mode is binary. 6hen reading a delimited #e8uential 3ile' you are instructed to interpret two contiguous field delimiters as ?U,, for the corresponding field regardless of data type. 6hich three actions must you ta(eZ %Choose three.& . #et the data type to Tarchar. G. #et the field to nullable. C. #et the B?U,, 3ield TalueB to two field delimiters %e.g.' BbbB for pipes&. 2. #et the B?U,, 3ield TalueB to dd. +. #et the environment variable = PT>:$P+FP> ,,96>^+/9>,+?@T4>3:F+2>?U,,. nswer" G'2'+ = PT>:$P+FP> ,,96>^+/9>,+?@T4>3:F+2>?U,, ) 6hen set' allows !ero length null>field value with fixed length fields. This should be used with care as poorly formatted data will cause incorrect results. Gy default a !ero length null>field value will cause an error.

6hich two attributes are found in a 2ata #et descriptor fileZ %Choose two.& . copy of the job score. G. The schema of the 2ata #et. C. copy of the partitioned data. 2. copy of the configuration file used when 2ata #et was created. nswer" G'2 6hen importing a C9G9, file definition' which two are re8uiredZ %Choose two.& . The file you are importing is accessible from your client wor(station. G. The file you are importing contains level E* items. C. The column definitions are in a C9G9, copyboo( file and not' for example' in a C9G9, source file. 2. The file does not contain any 9CCU/# 2+P+?2:?@ 9? clauses. nswer" 'G 6hich three features of datasets ma(e them suitable for job restart pointsZ %Choose three.& . They are indexed for fast data access. G. They are partitioned. C. They use datatypes that are in the parallel engine internal format. 2. They are persistent. +. They are compressed to minimi!e storage space. nswer" G'C'2 6hich statement describes a process for capturing a C9G9, copyboo( from a !C9# systemZ . 3TP the C9G9, copyboo( to the server platform in text mode and capture the metadata through $anager. G. #elect the C9G9, copyboo( using the Growse button and capture the C9G9, copyboo( with $anager. C. 3TP the C9G9, copyboo( to the client wor(station in text mode and capture the copyboo( with $anager. 2. 3TP the C9G9, copyboo( to the client wor(station in binary and capture the metadata through $anager. nswer" C The high performance +T, server on which 2ata#tage ++ is installed is networ(ed with several other servers in the :T department with a very high bandwidth switch. list of seven files %all of which contain records with the same record layout& must be retrieved from three of the other servers using 3TP. @iven the high bandwidth networ( and high performance +T, server' which approach will retrieve and process all seven files in the minimal amount of timeZ . :n a single job' use seven separate 3TP +nterprise stages the output lin(s of which lead to a single #ort 3unnel stage' then process the records without landing to dis(. G. #etup a se8uence of seven separate 2ata#tage ++ jobs' each of which retrieves a single file and appends to a common dataset' then process the resulting dataset in an eighth 2ata#tage ++ job. C. Use three 3TP Plug)in stages %one for each machine& to retrieve the seven files and store them to a single file on the fourth server' then use the 3TP +nterprise stage to retrieve the single file and process the records without landing to dis(. 2. Use a single 3TP +nterprise stage and specify seven U/: properties' one for each file' then process the records without landing to dis(. nswer" 2 n F$, file is being processed by the F$, :nput stage. 4ow can repetition elements be identified on the stageZ . ?o special settings are re8uired. F$, :nput stage automatically detects the repetition element from the FPath expression. G. #et the B0eyB property for the column on the output lin( to B<esB.

C. Chec( the B/epetition +lement /e8uiredB box on the output lin( tab. 2. #et the B?ullableB property for the column on the output lin( to B<esB. nswer" G Using 3TP' a file is transferred from an $T# system to a ,:?UF system in binary transfer mode. 6hich data conversion must be used to read a pac(ed decimal field in the fileZ . treat the field as a pac(ed decimal G. pac(ed decimal fields are not supported C. treat the field as #C:: 2. treat the field as +GC2:C nswer" 6hen a se8uential file is read using a #e8uential 3ile stage' the parallel engine inserts an operator to convert the data to the internal format. 6hich operator is insertedZ . import operator G. copy operator C. tsort operator 2. export operator nswer" 6hich type of file is both partitioned and readable by external applicationsZ . fileset G. ,oo(up fileset C. dataset 2. se8uential file nswer" 6hich two statements are true about F$, $eta 2ata :mporterZ %Choose two.& . F$, $eta 2ata :mporter is capable of reporting syntax and semantic errors from an F$, file. G. FP T4 expressions that are created during F$, metadata import cannot be modified. C. F$, $eta 2ata :mporter can import Table 2efinitions from only F$, documents. 2. FP T4 expressions that are created during F$, metadata import are used by F$, :nput stage and F$, 9utput stage. nswer" '2 6hich two statements are correct about F$, stages and their usageZ %Choose two.& . F$, :nput stage converts F$, data to tabular format. G. F$, 9utput stage converts tabular data to F$, hierarchical structure. C. F$, 9utput stage uses F#,T stylesheet for F$, to tabular transformations. 2. F$, Transformer stage converts F$, data to tabular format. nswer" 'G 6hich B/eject $odeB option in the #e8uential 3ile stage will write records to a reject lin(Z . 9utput G. 3ail C. 2rop 2. Continue nswer"

single se8uential file exists on a single node. To read this se8uential file in parallel' what should be doneZ . #et the +xecution mode to BParallelB. G. se8uential file cannot be read in parallel using the #e8uential 3ile stage. C. #elect B3ile PatternB as the /ead $ethod. 2. #et the B?umber of /eaders Per ?odeB optional property to a value greater than *. nswer" 2 6hen a se8uential file is written using a #e8uential 3ile stage' the parallel engine inserts an operator to convert the data from the internal format to the external format. 6hich operator is insertedZ . export operator G. copy operator C. import operator 2. tsort operator nswer" ban( receives daily credit score updates from a credit agency in the form of a fixed width flat file. The monthly>income column is an unsigned nullable integer %int.-& whose width is specified as *E' and null values are represented as spaces. 6hich #e8uential 3ile property will properly import any nulls in the monthly>income column of the input fileZ . #et the record level fill char property to the space character %d d&. G. #et the null field value property to a single space %d d&. C. #et the C>format property to dB_d. *EBd. 2. #et the null field value property to ten spaces %d d&. nswer" 2 n F$, file is being processed by the F$, :nput stage. 4ow can repetition elements be identified on the stageZ . #et the B?ullableB property for the column on the output lin( to B<esB. G. #et the B0eyB property for the column on the output lin( to B<esB. C. Chec( the B/epetition +lement /e8uiredB box on the output lin( tab. 2. ?o special settings are re8uired. F$, :nput stage automatically detects the repetition element from the FPath expression. nswer" G 2uring a se8uential file read' you experience an error with the data. 6hat is a valid techni8ue for identifying the column causing the difficultyZ . #et the Bdata formatB option to text on the /ecord 9ptions tab. G. +nable tracing in the 2ata#tage dministrator Tracing panel. C. +nable the Bprint fieldB option at the /ecord 9ptions tab. 2. #et the PT>:$P9/T>2+GU@ environmental variable. nswer" C 9n which two does the number of data files created by a fileset depend Z %Choose two.& . the si!e of the partitions of the dataset G. the number of CPUs C. the schema of the file 2. the number of processing nodes in the default node pool nswer" '2 6hat are two ways to delete a persistent parallel datasetZ %Choose two.& . standard U?:F command rm G. orchadmin command rm

C. delete the dataset Table 2efinition in 2ata#tage $anager 2. delete the dataset in 2ata #et $anager nswer" G'2 parts supplier has a single fixed width se8uential file. /eading the file has been slow' so the supplier would li(e to try to read it in parallel. :f the job executes using a configuration file consisting of four nodes' which two #e8uential 3ile stage settings will cause the 2ata#tage parallel engine to read the file using four parallel readersZ %Choose two.& %?ote" ssume the file path and name is CdataCparts>input.txt.& . #et the read method to specific file%s&' set the file property to dCdataCparts>input.txtd' and set the number of readers per node option to -. G. #et the read method to specific file%s&' set the file property to dCdataCparts>input.txtd' and set the read from multiple nodes option to yes. C. #et read method to file pattern' and set the file pattern property to dCdataC%eP /T>C9U?T&parts>input.txtd. 2. #et the read method to specific file%s&' set the file property to dCdataCparts>input.txtd' and set the number of readers per node option to 1. nswer" G'2

Data Transformation
Transformer Stage
Transformer stages can have a single input and any number of outputs. :t can also have a reject lin( that ta(es any rows which have not been written to any of the outputs lin(s by reason of a write failure or expression evaluation failure. :n order to write efficient Transformer stage derivations' it is useful to understand what items get evaluated and when. The evaluation se8uence is as follows" ,*aluate each stage *ariable initial *alue For each input row to process0 ,*aluate each stage *ariable deri*ation *alue, unless the deri*ation is empt& For each output link0 ,*aluate each column deri*ation *alue 6rite the output record #e7t output link #e7t input row The stage variables and the columns within a lin( are evaluated in the order in which they are displayed on the parallel job canvas. #imilarly' the output lin(s are also evaluated in the order in which they are displayed. System varia%les 6eb#phere 2ata#tage provides a set of variables containing useful system information that you can access from an output derivation or constraint.

1. 2. 3. 4. 5. 6.

:FA8S& The value is replaced with E. :T$3& The value is replaced with *. :-N$)4N3M :nput row counter. :)3T$)4N3M 9utput row counter %per lin(&. :N3MPA$T-T-)NS The total number of partitions for the stage. :PA$T-T-)NN3M The partition number for the particular instance.

Triggers ta% The Triggers tab allows you to choose routines to be executed at specific execution points as the transformer stage runs in a job. The execution point is per)instance' i.e.' if a job has two transformer stage instances running in parallel' the routine will be called twice' once for each instance. The available execution points are Gefore)stage and fter)stage. t this release' the only available built)in routine is #etCustom#ummary:nfo. <ou can also define custom routines to be executed5 to do this you define a C function' ma(e it available in U?:F shared library' and then define a Parallel routine which calls it %see 6eb phere "ata tage "esigner 1lient Guide for details on defining a Parallel /outine&. ?ote that the function should not return a value.

constraint other2ise lin" can be defined by"

1. Clic(ing on the )ther2ise,8og field so a tic( appears and leaving the Constraint fields blan(. This will catch
any rows that have failed to meet constraints on all the previous output lin(s. #et the constraint to 9T4+/6:#+. This will be set whenever a row is rejected on a lin( because the row fails to match a constraint. 9T4+/6:#+ is cleared by any output lin( that accepts the row. 3. The otherwise lin( must occur after the output lin(s in lin( order so it will catch rows that have failed to meet the constraints of all the output lin(s. :f it is not last rows may be sent down the otherwise lin( which satisfy a constraint on a later lin( and is sent down that lin( as well. 4. Clic(ing on the )ther2ise,8og field so a tic( appears and defining a Constraint. This will result in the number of rows written to that lin( %i.e. rows which satisfy the constraint& to be recorded in the job log as a warning message. -. Note! <ou can also specify a reject lin( which will catch rows that have not been written on any output lin(s due to a write error or null expression error. 2efine this outside Transformer stage by adding a lin( and using the shortcut menu to convert it to a reject lin(. Conditionally A%orting a *o% Use the L bort fter /owsM setting in the output lin( constraints of the parallel Transformer to conditionally abort a parallel job. <ou can specify an abort condition for any output lin(. The abort occurs after the specified number of rows occurs in one of the partitions. 6hen the L bort fter /owsM threshold is reached' the Transformer immediately aborts the job flow' potentially leaving uncommitted database rows or un)flushed file buffers Functions and )perators Concatenation 9perator X L"M #ubstring operator X :nput> #tringRstarting position' lengthS #tring 3unctions *. ,en%PstringO& -. Trim%PstringO& .. UpCaseC2ownCase%PstringO& ?ull 4andling functions *. :s?ull -. :s?ot?ull .. ?ulltoTalue 1. ?ullto^ero J. #et?ull%& Type Conversions *. #tringtoTimestamp -. #tringto2ecimal 3sing Transformer stages :n general' it is good practice not to use more Transformer stages than you have to. <ou should especially avoid using multiple Transformer stages where the logic can be combined into a single stage. :t is often better to use other stage types for certain types of operation" *. Use a Copy stage rather than a Transformer for simple operations such as" X Providing a job design placeholder on the canvas. %Provided you do not set the 3orce property to True on the Copy stage' the copy will be optimi!ed out of the job at run time.& X /enaming columns. X 2ropping columns. X :mplicit type conversions. ?ote that' if runtime column propagation is disabled' you can also use output mapping on a stage to rename' drop' or convert columns on a stage that has both inputs and outputs. -. Use the $odify stage for explicit type conversion and null handling. 3. 6here complex' reusable logic is re8uired' or where existing Transformer)stage based job flows do not meet performance re8uirements' consider building your own custom stage

1.

Use a G #:C Transformer stage where you want to ta(e advantage of user)defined functions and routines.

SCD Stage
The #C2 stage reads source data on the input lin(' performs a dimension table loo(up on the reference lin(' and writes data on the output lin(. The output lin( can pass data to another #C2 stage' to a different type of processing stage' or to a fact table. The dimension update lin( is a separate output lin( that carries changes to the dimension. <ou can perform these steps in a single job or a series of jobs' depending on the number of dimensions in your database and your performance re8uirements. #C2 stages support both #C2 Type * and #C2 Type - processing" 1. SCD Type ; 9verwrites an attribute in a dimension table. 2. SCD Type ( dds a new row to a dimension table. +ach #C2 stage processes a single dimension and performs loo(ups by using an e8uality matching techni8ue. :f the dimension is a database table' the stage reads the database to build a loo(up table in memory. :f a match is found' the #C2 stage updates rows in the dimension table to reflect the changed data. :f a match is not found' the stage creates a new row in the dimension table. ll of the columns that are needed to create a new dimension row must be present in the source data.

Purpose codes in a Slo2ly Changing Dimension stage


Purpose codes are an attribute of dimension columns in #C2 stages. Purpose codes are used to build the loo(up table' to detect dimension changes' and to update the dimension table. #uilding the loo"up ta%le The #C2 stage uses purpose codes to determine how to build the loo(up table for the dimension loo(up. :f a dimension has only Type * columns' the stage builds the loo(up table by using all dimension rows. :f any Type - columns exist' the stage builds the loo(up table by using only the current rows. :f a dimension has a Current :ndicator column' the stage uses the derivation value of this column on the 2im Update tab to identify the current rows of the dimension table. :f a dimension does not have a Current :ndicator column' then the stage uses the +xpiration 2ate column and its derivation value to identify the current rows. ny dimension columns that are not needed are not used. This techni8ue minimi!es the amount of memory that is re8uired by the loo(up table. Detecting dimension changes Purpose codes are also used to detect dimension changes. The #C2 stage compares Type * and Type - column values to source column values to determine whether to update an existing row' insert a new row' or expire a row in the dimension table. 3pdating the dimension ta%le Purpose codes are part of the column metadata that the #C2 stage propagates to the dimension update lin(. <ou can send this column metadata to a database stage in the same job' or you can save the metadata on the Columns tab and load it into a database stage in a different job. 6hen the database stage uses the auto)generated #V, option to perform inserts and updates' it uses the purpose codes to generate the correct #V, statements.

Selecting purpose codes


Purpose codes specify how the #C2 stage should process dimension data. Purpose codes apply to columns on the dimension reference lin( and on the dimension update lin(. #elect purpose codes according to the type of columns in a dimension"

1. :f a dimension contains a Type - column' you must select a Current :ndicator column' an +xpiration 2ate
column' or both. n +ffective 2ate column is optional. <ou cannot assign Type - and Current :ndicator to the same column. 2. :f a dimension contains only Type * columns' no Current :ndicator' +ffective 2ate' +xpiration 2ate' or #0 Chain columns are allowed.

Purpose code definitions


The #C2 stage provides nine purpose codes to support dimension processing.

1. 2. 3. 4.
5. 6. 7. 8. 9.

5%lan"6 The column has no #C2 purpose. This purpose code is the default. Surrogate /ey The column is a surrogate (ey that is used to identify dimension records. #usiness /ey The column is a business (ey that is typically used in the loo(up condition. Type ; The column is an #C2 Type * field. #C2 Type * column values are always current. 6hen changes occur' the #C2 stage overwrites existing values in the dimension table. Type ( The column is an #C2 Type - field. #C2 Type - column values represent a point in time. 6hen changes occur' the #C2 stage creates a new dimension row. Current< -ndicator 5Type (6 The column is the current record indicator for #C2 Type - processing. 9nly one Current :ndicator column is allowed. &ffective Date 5Type (6 The column is the effective date for #C2 Type - processing. 9nly one +ffective 2ate column is allowed. & piration Date 5Type (6 The column is the expiration date for #C2 Type - processing. n +xpiration 2ate column is re8uired if there is no Current :ndicator column' otherwise it is optional. S/ Chain The column is used to lin( a record to the previous record or the next record by using the value of the #urrogate 0ey column. 9nly one #urrogate 0ey column can exist if you have an #0 Chain column.

Surrogate "eys in a Slo2ly Changing Dimension stage


#urrogate (eys are used to join a dimension table to a fact table in a star schema database. 6hen the #C2 stage performs a dimension loo(up' it retrieves the value of the existing surrogate (ey if a matching record is found. :f a match is not found' the stage obtains a new surrogate (ey value by using the derivation of the #urrogate 0ey column on the 2im Update tab. :f you want the #C2 stage to generate new surrogate (eys by using a (ey source that you created with a #urrogate 0ey @enerator stage' you must use the ?ext#urrogate0ey function to derive the #urrogate 0ey column. :f you want to use your own method to handle surrogate (eys' you should derive the #urrogate 0ey column from a source column. <ou can replace the dimension information in the source data stream with the surrogate (ey value by mapping the #urrogate 0ey column to the output lin(.

Specifying information a%out a "ey source


:f you created a (ey source with a #urrogate 0ey @enerator stage' you must specify how the #C2 stage should use the source to generate surrogate (eys. The (ey source can be a flat file or a database se8uence. The (ey source must exist before the job runs. :f the (ey source is a flat file' the file must be accessible from all nodes that run the #C2 stage.

To use the (ey source" *. 9n the :nput page' select the reference lin( in the -nput name field.

-. Clic( the Surrogate /ey tab. .. :n the Source type field' select the source type. 1. :n the Source name field' type the name of the (ey source' or clic( the arrow button to browse for a file or to insert a job parameter. :f the source is a flat file' type the name and fully 8ualified path of the state file' such as C"C#0@CProd2im. :f the source is a database se8uence' type the name of the se8uence' such as P/92UCT>0+<>#+V. J. Provide additional information about the (ey source according to the type" :f the source is a flat file' specify information in the Flat File area. :f the source is a database se8uence' specify information in the D# se.uence area. Calls to the (ey source are made by the ?ext#urrogate0ey function. 9n the 2im Update tab' create a derivation that uses the ?ext#urrogate0ey function for the column that has a purpose code of #urrogate 0ey. The ?ext#urrogate0ey function returns the value of the next surrogate (ey when the #C2 stage creates a new dimension row. 2ata#tage job contains a parallel Transformer with a single input lin( and a single output lin(. The Transformer has a constraint that should produce *EEE records' however only ]EE came out through the output lin(. 6hat should be done to identify the missing recordsZ . Turn trace on using 2ata#tage dministrator. G. dd a /eject lin( to the Transformer stage. C. #can generated osh script for possible errors. 2. /emove the constraint on the output lin(. nswer" G 6hich three actions are performed using stage variables in a parallel Transformer stageZ %Choose three.& . function can be executed once per record. G. function can be executed once per run. C. :dentify the first row of an input group. 2. :dentify the last row of an input group. +. ,oo(up up a value from a reference dataset. nswer" 'G'C 6hich two system variables must be used in a parallel Transformer derivation to generate a uni8ue se8uence of integers across partitionsZ %Choose two.& . eP /T:T:9??U$ G. e:?/96?U$ C. e2 T+ 2. e?U$P /T:T:9?# nswer" '2 6hat would re8uire creating a new parallel Custom stage rather than a new parallel Guild9p stageZ . Custom stage can be created with properties. Guild9p stages cannot be created with properties. G. :n a Custom stage' the number of input lin(s does not have to be fixed' but can vary' for example from one to two. Guild9p stages re8uire a fixed number of input lin(s. C. Creating a Custom stage re8uires (nowledge of CCCUU. <ou do not need (nowledge of CCCUU to create a Guild9p stage. 2. Custom stages can be created for parallel execution. Guild9p stages can only be built to run se8uentially. nswer" G

<our input rows contain customer data from a variety of locations. <ou want to select just those rows from a specified location based on a parameter value. <ou are trying to decide whether to use a Transformer or a 3ilter stage to accomplish this. 6hich statement is trueZ . The Transformer stage will yield better performance because the 3ilter stage 6here clause is interpreted at runtime. G. <ou cannot use a 3ilter stage because you cannot use parameters in a 3ilter stage 6here clause. C. The 3ilter stage will yield better performance because it has less overhead than a Transformer stage. 2. <ou cannot use the Transformer stage because you cannot use parameters in a Transformer stage constraint. nswer" :n a Transformer you add a new column to an output lin( named 7ob?ame that is to contain the name of the job that is running. 6hat can be used to derive values for this columnZ . a 2ata#tage function G. a lin( variable C. a system variable 2. a 2ata#tage macro nswer" 2 6hich statement describes how to add functionality to the Transformer stageZ . Create a new parallel routine in the /outines category that specifies the name' path' type' and return type of a function written and compiled in CUU. G. Create a new parallel routine in the /outines category that specifies the name' path' type' and return type of an external program. C. Create a new server routine in the /outines category that specifies the name and category of a function written in 2ata#tage Gasic. 2. +dit the CUU code generated by the Transformer stage. nswer" 6hich three statements about the +nterprise +dition parallel Transformer stage are correctZ %Choose three.& . The Transformer allows you to copy columns. G. The Transformer allows you to do loo(ups. C. The Transformer allows you to apply transforms using routines. 2. The Transformer stage automatically applies d?ullToTalued function to all non)nullable output columns. +. The Transformer allows you to do data type conversions. nswer" 'C'+ 6hich two stages allow field names to be specified using job parametersZ %Choose two.& . Transformer stage G. 3unnel stage C. $odify stage 2. 3ilter stage nswer" C'2 The parallel dataset input into a Transformer stage contains null values. 6hat should you do to properly handle these null valuesZ . Convert null values to a valid values in a stage variable. G. Convert null values to a valid value in the output column derivation. C. ?ull values are automatically converted to blan(s and !ero' depending on the target data type. 2. Trap the null values in a lin( constraint to avoid derivations. nswer"

6hich two would re8uire the use of a Transformer stage instead of a Copy stageZ %Choose two.& . 2rop a column. G. #end the input data to multiple output streams. C. Trim spaces from a character field. 2. #elect certain output rows based on a condition. nswer" C'2 :n which situation should a G #:C Transformer stage be used in a 2ata#tage ++ jobZ . in a job containing complex routines migrated from 2ata#tage #erver +dition G. in a job re8uiring loo(ups to hashed files C. in a large)volume job flow 2. in a job re8uiring complex' reusable logic nswer" <ou have three output lin(s coming out of a Transformer. Two of them % and G& have constraints you have defined. The third you want to be an 9therwise lin( that is to contain all of the rows that do not satisfy the constraints of and G. This 9therwise lin( must wor( correctly even if the and G constraints are modified. 6hich two are re8uiredZ %Choose two.& . The 9therwise lin( must be first in the lin( ordering. G. constraint must be coded for the 9therwise lin(. C. The 9therwise lin( must be last in the lin( ordering. 2. The 9therwise chec( box must be chec(ed. nswer" C'2 6hich two statements are true about 2ata#tage Parallel Guildop stagesZ %Choose two.& . Unli(e standard 2ata#tage stages they do not have properties. G. They are coded using CCCUU. C. They are coded using 2ata#tage Gasic. 2. Table 2efinitions are used to define the input and output interfaces of the Guild9p. nswer" G'2

*o% Control and $un time Management


Message 'andlers
6hen you run a parallel job' any error messages and warnings are written to an error log and can be viewed from the 2irector. <ou can choose to handle specified errors in a different way by creating one or more message handlers. message handler defines rules about how to handle messages generated when a parallel job is running. <ou can' for example' use one to specify that certain types of message should not be written to the log. <ou can edit message handlers in the 2ata#tage $anager or in the 2ata#tage 2irector. The recommended way to create them is by using the dd rule to message handler feature in the 2irector. <ou can specify message handler use at different levels" Pro0ect 8evel. <ou define a project level message handler in the 2ata#tage dministrator' and this applies to all parallel jobs within the specified project. *o% 8evel. 3rom the 2esigner and $anager you can specify that any existing handler should apply to a specific job. 6hen you compile the job' the handler is included in the job executable as a local handler %and so can be exported to other systems if re8uired&. <ou can also add rules to handlers when you run a job from the 2irector %regardless of whether it currently has a local handler included&. This is useful' for example' where a job is generating a message for every row it is processing. <ou can suppress that particular message. 6hen the job runs it will loo( in the local handler %if one exists& for each message to see if any rules exist for that message type. :f a particular message is not handled locally' it will loo( to the projectwide handler for rules. :f there are none there' it writes the message to the job log. ?ote that message handlers do not deal with fatal error messages' these will always be written to the job log. <ou cannot add message rules to jobs from an earlier release of 2ata#tage without first re)running those jobs.

Adding $ules to Message 'andlers


<ou can add rules to message handlers don the flyd from within the 2irector. Using this method' you can add rules to handlers that are local to the current job' to the project default handler' or to any previously)defined handler. To add rules in this way' highlight the message you want to add a rule about in the job log and choose dd rule to message handler... from the job log shortcut menu or from the 7ob menu on the menu bar. The dd rule to message handler dialog box appears. To add a rule" 1. Choose an option to specify which handler you want to add the new rule to. Choose between the local runtime handler for the currently selected job' the project)level message handler' or a specific message handler. :f you want to edit a specific message handler' select the handler from the $essage 4andler dropdown list. Choose %?ew& to create a new message handler. -. Choose an ction from the drop down list. Choose from" = #uppress from log. The message is not written to the jobds log as it runs. = Promote to 6arning. Promote an informational message to a warning message. = 2emote to :nformational. 2emote a warning message to become an informational one. The $essage :2' $essage type and +xample of message text fields are all filled in from the log entry you have currently selected. <ou cannot edit these.

3. Clic( dd /ule to add the new message rule to the chosen handler.

Managing Message 'andlers


To open the $essage 4andler $anager' choose Tools $essage 4andlers %you can also open the manager from the dd rule to message handler dialog box&. The +dit $essage 4andlers dialog box appears.

Message 'andler File Format


message handler is a plain text file and has the suffix .msh. :t is stored in the folder 8" !O.,3''3"ata tage3.sg!andlers. The following is an example message file. TUT, EEEE.** * The open file limit is *EE5 raising to *E-1a T3#C EEEEE** - PT configuration filea T3#C EEEE1.- . ttempt to Cleanup after G9/T raised in stagea +ach line in the file represents message rule' and comprises four tabseparated fields" ) $essage :2. Case)specific string uni8uely identifying the message ) Type. * for :nfo' - for 6arn ) ction. * = #uppress' - = Promote' . = 2emote ) $essage. +xample text of the message

-dentify the use of ds0o% command line utility


<ou can start' stop' validate' and reset jobs using the Xrun option.

$unning a 0o%
dsjob Xrun R Xmode R ?9/$ , b /+#+T b T ,:2 T+ S S R Xparam name=*alue S R Xwarn n S R Xrows n S R Xwait S R Xstop S R XjobstatusS RXuserstatusS RXlocalS RXopmetadata RT/U+ b 3 ,#+SS R)disableprjhandlerS R)disablejobhandlerS 9useid: pro-ect -ob;-ob_id Xmode specifies the type of job run. ?9/$ , starts a job run' /+#+T resets the job and T ,:2 T+ validates the job. :f mode is not specified' a normal job run is started. Xparam specifies a parameter value to pass to the job. The value is in the format name=*alue' where name is the parameter name' and *alue is the value to be set. :f you use this to pass a value of an environment variable for a job %as you may do for parallel jobs&' you need to 8uote the environment variable and its value' for example 7 param >?APT@C)NF-+@F-8&AchrisBapt> otherwise the current value of the environment variable will be used. Xwarn n sets warning limits to the value specified by n %e8uivalent to the 2##et7ob,imit function used with 2#7>,:$:T6 /? specified as the Limit(&pe parameter&. Xrows n sets row limits to the value specified by n %e8uivalent to the 2##et7ob,imit function used with 2#7>,:$:T/96# specified as the Limit(&pe parameter&. Xwait waits for the job to complete %e8uivalent to the 2#6ait3or7ob function&. Xstop terminates a running job %e8uivalent to the 2##top7ob function&. Xjobstatus waits for the job to complete' then returns an exit code derived from the job status. Xuserstatus waits for the job to complete' then returns an exit code derived from the user status if that status is defined. The user status is a string' and it is converted to an integer exit code. The exit code E indicates that the job completed without an error' but that the user status string could not be converted. :f a job returns a negative user status value' it is interpreted as an error.

)local use this when running a 2ata#tage job from withing a shellscript on a U?:F server. Provided the script is run in the project directory' the job will pic( up the settings for any environment variables set in the script and any setting specific to the user environment. )opmetadata use this to have the job generate operational meta data as it runs. :f $eta#tage' or the Process $eta 2ata $etaGro(er' is not installed on the machine' then the option has no effect. :f you specify T/U+' operational meta data is generated' whatever the default setting for the project. :f you specify 3 ,#+' the job will not generate operational meta data' whatever the default setting for the project. )disableprjhandler use this to disable any error message handler that has been set on a project wide basis )disablejobhandler use this to disable any error message handler that has been set for this job useid specify this if you intend to use a job alias %jobid& rather than ajob name %job& to identify the job. pro-ect is the name of the project containing the job. -ob is the name of the job. To run a job invocation' use the format -ob'in*ocation_id' -ob_id is an alias for the job that has been set using the dsjob Xjobid command

Stopping a 0o%
<ou can stop a job using the Xstop option. dsjob Xstop 9useid: pro-ect -ob;-ob_id Xstop terminates a running job %e8uivalent to the 2##top7obfunction&. useid specify this if you intend to use a job alias %jobid& rather than a job name %job& to identify the job. pro-ect is the name of the project containing the job. -ob is the name of the job. To stop a job invocation' use the format -ob'in*ocation_id' -ob_id is an alias for the job that has been set using the dsjob Xjobid command

8isting Pro0ects
The following syntax displays a list of all (nown projects on the server" dsjob Xlprojects This syntax is e8uivalent to the 2#@etProject,ist function.

8isting *o%s
The following syntax displays a list of all jobs in the specified project" dsjob Xljobs project project is the name of the project containing the jobs to list. This syntax is e8uivalent to the 2#@etProject:nfo function.

8isting Stages
The following syntax displays a list of all stages in a job" dsjob Xlstages RuseidS project jobbjob>id This syntax is e8uivalent to the 2#@et7ob:nfo function with 2#7>#T @+,:#T specified as the :nfoType parameter.

8isting 8in"s
The following syntax displays a list of all the lin(s to or from a stage" dsjob Xllin(s RuseidS project jobbjob>id stage This syntax is e8uivalent to the 2#@et#tage:nfo function with 2#7>,:?0,:#T specified as the :nfoType parameter.

8isting Parameters
The following syntax display a list of all the parameters in a job and their values" dsjob Xlparams RuseidS project jobbjob>id

8isting -nvocations
The following syntax displays a list of the invocations of a job" dsjob Xlinvocations

Setting an Alias for a *o%


The dsjob command can be used to specify your own :2 for a 2ata#tage job. 9ther commands can then use that alias to refer to the job. dsjob Xjobid Rmy>:2S project job my>:2 is the alias you want to set for the job. :f you omit my>:2' the command will return the current alias for the specified job. n alias must be uni8ue within the project' if the alias already exists an error message is displayed

Displaying *o% -nformation


The following syntax displays the available information about a specified job" dsjob Xjobinfo RuseidS project jobbjob>id This syntax is e8uivalent to the 2#@et7ob:nfo function.

Displaying Stage -nformation


The following syntax displays all the available information about a stage" dsjob Xstageinfo RuseidS project jobbjob>id stage This syntax is e8uivalent to the 2#@et#tage:nfo function.

Displaying 8in" -nformation


The following syntax displays information about a specified lin( to or from a stage" dsjob Xlin(info RuseidS project jobbjob>id stage lin( This syntax is e8uivalent to the 2#@et,in(:nfo function.

Displaying Parameter -nformation


This syntax displays information about the specified parameter" dsjob Xparaminfo RuseidS project jobbjob>id param The following information is displayed" The parameter type The parameter value 4elp text for the parameter that was provided by the jobs designer 6hether the value should be prompted for The default value that was specified by the jobs designer ny list of values The list of values provided by the jobs designer This syntax is e8uivalent to the 2#@etParam:nfo function.

Adding a 8og &ntry


The following syntax adds an entry to the specified log file. The text for the entry is ta(en from standard input to the terminal' ending with Ctrl)2. dsjob Xlog R Xinfo b Xwarn S RuseidS project jobbjob>id Xinfo specifies an information message. This is the default if no log entry type is specified. Xwarn specifies a warning message.

Displaying a Short 8og &ntry


The following syntax displays a summary of entries in a job log file" dsjob Xlogsum RXtype typeS R Xmax n S RuseidS project jobbjob>id Xtype type specifies the type of log entry to retrieve. :f Xtype type is not specified' all the entries are retrieved. type can be one of the following options"

:?39 :nformation. 6 /?:?@ 6arning. 3 T , 3atal error. /+7+CT /ejected rows from a Transformer stage. #T /T+2 ll control logs. /+#+T 7ob reset. G TC4 Gatch control. ?< ll entries of any type. This is the default if type is not specified. Xmax n limits the number of entries retrieved to n.

Displaying a Specific 8og &ntry


The following syntax displays the specified entry in a job log file" dsjob Xlogdetail RuseidS project jobbjob>id entry entry is the event number assigned to the entry. The first entry in the file is E. This syntax is e8uivalent to the 2#@et,og+ntry function.

-dentifying the Ne2est &ntry


The following syntax displays the :2 of the newest log entry of the specified type" dsjob Xlognewest RuseidS project jobbjob>id type :?39 :nformation. 6 /?:?@ 6arning. 3 T , 3atal error. /+7+CT /ejected rows from a Transformer stage. #T /T+2 7ob started. /+#+T 7ob reset. G TC4 Gatch control. This syntax is e8uivalent to the 2#@et?ewest,og:d function.

-mporting *o% & ecuta%les


The dsjob command can be used to import job executables from a 2#F file into a specified project. ?ote that this command is only available on U?:F servers. dsjob Ximport project 2#Ffilename R)9T+/6/:T+S R)79GR#S jobname aS b R),:#TS project is the project to import into. 2#Ffilename is the 2#F file containing the job executables. )9T+/6/:T+ specifies that any existing jobs in the project with the same name will be overwritten. )79GR#S jobname specifies that one or more named job executables should be imported %otherwise all the executable in the 2#F file are imported&. ),:#T causes 2ata#tage to list the executables in a 2#F file rather than import them.

+enerating a $eport
The dsjob command can be used to generate an F$, format report containing job' stage' and lin( information. dsjob Xreport RuseidS project jobbjobid Rreport>typeS report>type is one of the following" G #:C X Text string containing startCend time' time elapsed and status of job. 2+T :, X s basic report' but also contains information about individual stages and lin(s within the job. ,:#T X Text string containing full F$, report. Gy default the generated F$, will not contain a PZxml)stylesheetZO processing instruction. :f a stylesheet is re8uired' specify a /etport,evel of - and append the name of the re8uired stylesheet U/,' i.e.' -"style#heetU/,. This inserts a processing instruction into the generated F$, of the form"

PZxml)stylesheet type=textCxslM href=Mstyle#heetU/,MZO The generated report is written to stdout. This syntax is e8uivalent to the 2#$a(e7ob/eport function.2+T :, X s basic report' but also contains information about individual stages and lin(s within the job. ,:#T X Text string containing full F$, report.

*o% Se.uence
6hat is a 7ob #e8uenceZ *. master controlling job that controls the execution set of subordinate jobs 2. Passes values to subordinate job parameters .. Controls the order of execution %lin(s& 1. #pecifies conditions under which the subordinate jobs get executed %triggers& J. #pecified complex flow of control X ,oops' llC ny se8uencer' 6ait for file W. Perform system activities % +mail' +xecute system commands and executables& [. Can include /estart chec(points 6hat are the 7ob #e8uence stagesZ *. /un stages X 7ob ctivity" /un a job' +xecute CommandC/outine ctivity" /un a system command' ?otification ctivity" #end an email -. 3low Control stages X #e8uencer" @o llC ny' 6ait for file" @o when file existsCdoesnt exist' ,oop" #tart loop and +nd ,oop' ?ested Condition" @o if condition satisfied .. +rror handling X +xception 4andler' Terminator 1. Tariables X User Tariables 6hat are the compilation options in 7ob #e8uence propertiesZ *. dd chec(points so se8uence is restartable on failure X /estart 3unctionality -. utomatically handle activities that fail X +xception stage to handle aborts .. ,og warnings after activities that finish with status other than 90 1. ,og report messages after each run 6hat are the inputs for 7ob ctivity stageZ *. 7ob name %select from list& -. +xecution ction %select from list& .. Parameters 1. 2o not chec(point run %selectCunselect chec(box& 6hat are the 7ob ctivity +xecution ctionsZ *. /un -. /eset if re8uired' then run .. Talidate 6hat are the different types of triggers for a 7ob ctivityZ 90 X %Conditional& 3ailed X %Conditional& 6arning X %Conditional& Custom X %Conditional& User#tatus X %Conditional& Unconditional 9therwise Custom Trigger +xample X 7ob>*.=7ob#tatus=2#7#./U?90 or 7ob>*.=7ob#tatus= 2#7#./U?6 /?

6hat are the inputs for +xecute Command stageZ *. Command -. Parameters 3. 2o not chec(point run %selectCunselect chec(box& 6hat are the inputs for ?otification stageZ *. #$TP $ail server name -. #enders email address .. /ecipients email address 1. +mail subject J. ttachment W. +mail body [. :nclude job status in email %selectCunselect chec(box& \. 2o not chec(point run %selectCunselect chec(box& 6hat are the inputs for 6ait for file stageZ *. 3ilename -. 6ait for file to appear C 6ait for file to appear %#elect one of the two options& .. Timeout length %disabled if the L2o not timeoutM option is selected& 1. 2o not timeout J. 2o not chec(point run +xplain the ?ested Condition stageZ The ?ested Condition stage is used to branch out to other activities based on trigger conditions. +xplain the ,oop stageZ The ,oop stage is made up of #tart ,oop and +nd ,oop. The #tart ,oop connects to one of the /un activities %preferably 7ob ctivity&. This ctivity stage connects to the +nd ,oop. The +nd ,oop connects to the #tart ,oop activity by means of a reference lin(. The - types of looping are *. ?umeric %3or counter n to n #tep n& -. ,ist %3or each thing in list& +xplain the +rror handling and /estartabilityZ +rror handling is enabled using L utomatically handle activities that failM option. The control is passed to the +xception stage when an ctivity fails /estartability is enabled using L dd chec(points so se8uence is restartable on failureM option. :f a se8uence fails' then when the #e8uence is re)run' activities that completed successfully in the prior run are s(ipped over %unless the L2o not chec(point runM option was set for an activity&. 6hich three are valid ways within a 7ob #e8uence to pass parameters to ctivity stagesZ %Choose three.& . +xecCommand ctivity stage G. UserTariables ctivity stage C. #e8uencer ctivity stage 2. /outine ctivity stage +. ?ested Condition ctivity stage nswer" 'G'2 6hich three are valid trigger expressions in a stage in a 7ob #e8uenceZ %Choose three.& . +8uality%Conditional&

G. Unconditional C. /eturnTalue%Conditional& 2. 2ifference%Conditional& +. Custom%Conditional& nswer" G'C'+ client re8uires that any job that aborts in a 7ob #e8uence halt processing. 6hich three activities would provide this capabilityZ %Choose three.& . ?ested Condition ctivity G. +xception 4andler C. #e8uencer ctivity 2. #endmail ctivity +. 7ob trigger nswer" 'G'+ 6hich command can be used to execute 2ata#tage jobs from a U?:F shell scriptZ . dsjob G. 2#/un7ob C. osh 2. 2#+xecute nswer" 6hich three are the critical stages that would be necessary to build a 7ob #e8uence that" pic(s up data from a file that will arrive in an directory overnight' launches a job once the file has arrived' sends an email to the administrator upon successful completion of the flowZ %Choose three.& . #e8uencer G. ?otification ctivity C. 6ait 3or 3ile ctivity 2. 7ob ctivity +. Terminator ctivity nswer" G'C'2 6hich two statements describe functionality that is available using the dsjob commandZ %Choose two.& . dsjob can be used to get a report containing job' stage' and lin( information. G. dsjob can be used to add a log entry for a specified job. C. dsjob can be used to compile a job. 2. dsjob can be used to export job executables. nswer" 'G

)ther Topics
&nvironment 9aria%les
APT@#3FF&$@F$&&@$3N This environment variable is available in the 2ata#tage dministrator' under the Parallel category. :t specifies how much of the available inmemory buffer to consume before the buffer resists. This is expressed as a decimal representing the percentage of $aximum memory buffer si!e %for example' E.J is JE_&. 6hen the amount of data in the buffer is less than this value' new data is accepted automatically. 6hen the data exceeds it' the buffer first tries to write some of the data it contains before accepting more. The default value is JE_ of the $aximum memory buffer si!e. <ou can set it to greater than *EE_' in which case the buffer continues to store data up to the indicated multiple of $aximum memory buffer si!e before writing to dis(. APT@#3FF&$@MAC-M3M@M&M)$1 #ets the default value of $aximum memory buffer si!e. The default value is .*1J[-\ %. $G&. #pecifies the maximum amount of virtual memory' in bytes' used per buffer. APT@#3FF&$@MAC-M3M@T-M&)3T 2ata#tage buffering is self tuning' which can theoretically lead to long delays between retries. This environment variable specified the maximum wait before a retry in seconds' and is by default set to *. APT@#3FF&$-N+@P)8-C1 This environment variable is available in the 2ata#tage dministrator' under the Parallel category. Controls the buffering policy for all virtual data sets in all steps. The variable has the following settings" UT9$ T:C>GU33+/:?@ %default&. Guffer a data set only if necessary to prevent a data flow deadloc(. 39/C+>GU33+/:?@. Unconditionally buffer all virtual data sets. ?ote that this can slow down processing considerably. ?9>GU33+/:?@. 2o not buffer data sets. This setting can cause data flow deadloc( if used inappropriately. APT@D&C-MA8@-NT&$M@P$&C-S-)N #pecifies the default maximum precision value for any decimal intermediate variables re8uired in calculations. 2efault value is .\. APT@D&C-MA8@-NT&$M@SCA8& #pecifies the default scale value for any decimal intermediate variables re8uired in calculations. 2efault value is *E. APT@C)NF-+@F-8& #ets the path name of the configuration file. %<ou may want to include this as a job parameter' so that you can specify the configuration file at job run time&. APT@D-SA#8&@C)M#-NAT-)N @lobally disables operator combining. 9perator combining is 2ata#tages default behavior' in which two or more %in fact any number of& operators within a step are combined into one process where possible. <ou may need to disable combining to facilitate debugging. ?ote that disabling combining generates more U?:F processes' and hence re8uires more system resources and memory. :t also disables internal optimi!ations for job efficiency and run times. APT@&C&C3T-)N@M)D& Gy default' the execution mode is parallel' with multiple processes. #et this variable to one of the following values to run an application in se8uential execution mode" 9?+>P/9C+## one)process mode $ ?<>P/9C+## many)process mode ?9>#+/: ,:^+ many)process mode' without seriali!ation

APT@)$C'')M& $ust be set by all 2ata#tage +nterprise +dition users to point to the top)level directory of the 2ata#tage +nterprise +dition installation. APT@STA$T3P@SC$-PT s part of running an application' 2ata#tage creates a remote shell on all 2ata#tage processing nodes on which the job runs. Gy default' the remote shell is given the same environment as the shell from which 2ata#tage is invo(ed. 4owever' you can write an optional startupshell script to modify the shell configuration of one or more processing nodes. :f a startup script exists' 2ata#tage runs it on remote shells before running your application. PT>#T /TUP>#C/:PT specifies the script to be run. :f it is not defined' 2ata#tage searches '3startup'apt, 8</(_O21!!O.,3etc3startup'apt and 8</(_O21!!O.,3etc3startup' in that order. PT>?9>#T /TUP>#C/:PT disables running the startup script. APT@N)@STA$T3P@SC$-PT Prevents 2ata#tage from executing a startup script. Gy default' this variable is not set' and 2ata#tage runs the startup script. :f this variable is set' 2ata#tage ignores the startup script. This may be useful when debugging a startup script. #ee also PT>#T /TUP>#C/:PT. APT@STA$T3P@STAT3S #et this to cause messages to be generated as parallel job startup moves from phase to phase. This can be useful as a diagnostic if parallel job startup is failing. APT@M)N-T)$@S-D& This environment variable is available in the 2ata#tage dministrator under the Parallel branch. 2etermines the minimum number of records the 2ata#tage 7ob $onitor reports. The default is JEEE records. APT@M)N-T)$@T-M& This environment variable is available in the 2ata#tage dministrator under the Parallel branch. 2etermines the minimum time interval in seconds for generating monitor information at runtime. The default is J seconds. This variable ta(es precedence over PT>$9?:T9/>#:^+. APT@N)@*)#M)N Turn off job monitoring entirely. APT@PM@N)@S'A$&D@M&M)$1 Gy default' shared memory is used for local connections. :f this variable is set' named pipes rather than shared memory are used for local connections. :f both PT>P$>?9>? $+2>P:P+# and PT>P$>?9>#4 /+2>$+$9/< are set' then TCP soc(ets are used for local connections. APT@PM@N)@NAM&D@P-P&S #pecifies not to use named pipes for local connections. ?amed pipes will still be used in other areas of 2ata#tage' including subprocs and setting up of the shared memory transport protocol in the process manager. APT@$&C)$D@C)3NTS Causes 2ata#tage to print' for each operator Player' the number of records consumed by get/ecord%& and produced by put/ecord%&. bandoned input records are not necessarily accounted for. Guffer operators do not print this information. APT@N)@PA$T@-NS&$T-)N 2ata#tage automatically inserts partition components in your application to optimi!e the performance of the stages in your job. #et this variable to prevent this automatic insertion. APT@N)@S)$T@-NS&$T-)N 2ata#tage automatically inserts sort components in your job to optimi!e the performance of the operators in your data flow. #et this variable to prevent this automatic insertion.

APT@S)$T@-NS&$T-)N@C'&C/@)N81 6hen sorts are inserted automatically by 2ata#tage' if this is set' the sorts will just chec( that the order is correct' they wondt actually sort. This is a better alternative to shutting partitioning and sorting off insertion off using PT>?9>P /T>:?#+/T:9? and PT>?9>#9/T>:?#+/T:9?. APT@D3MP@SC)$& Configures 2ata#tage to print a report showing the operators' processes' and data sets in a running job. APT@PM@P8A1&$@M&M)$1 #etting this variable causes each player process to report the process heap memory allocation in the job log when returning. APT@PM@P8A1&$@T-M-N+ #etting this variable causes each player process to report its call and return in the job log. The message with the return is annotated with CPU times for the player process. )S'@D3MP :f set' it causes 2ata#tage to put a verbose description of a job in the job log before attempting to execute it. )S'@&C') :f set' it causes 2ata#tage to echo its job specification to the job log after the shell has expanded all arguments. )S'@&CP8A-N :f set' it causes 2ata#tage to place a terse description of the job in the job log before attempting to run it. )S'@P$-NT@SC'&MAS :f set' it causes 2ata#tage to print the record schema of all data sets and the interface schema of all operators in the job log. APT@ST$-N+@PADC'A$ 9verrides the pad character of ExE % #C:: null&' used by default when 2ata#tage extends' or pads' a string field to a fixed length.

CM8 Stages Cml -mporter


The F$, $eta 2ata :mporter window has the following panes" Y Tree 9ie2' which depicts the hierarchical structure in the F$, source. This pane is the main view. :t is always present and cannot be hidden or doc(ed. Y Source' which contains the original F$, schema or F$, document' in read)only mode. To compare the tree view with the F$, source' you can doc( this pane next to the tree view. Y Node Properties' which describes F$, and FPath information of the selected element. Y Ta%le Definition' which maps elements that you select in the Tree Tiew. Y Parser )utput' which presents F$, syntax and semantic errors. The following illustration shows all F$, $eta 2ata :mporter panes except Parser 9utput"

F$, $eta 2ata :mporter reports any syntax and semantic errors when you open a source file. :n the following example' the Parser 9utput pane indicates that at least one 8uote is missing from line ..

To highlight the error in the #ource pane' double)clic( the error in the Parser 9utput pane. fter correcting the error outside of the F$, $eta 2ata :mporter' you can load the revised source file. To reload the file' choose File $efresh. <ou can process an F$, schema file %.xsd& or an F$, document %.xml&. The file can be located on your file system or accessed with a U/,.

Processing CM8 Documents


The F$, $eta 2ata :mporter retains namespaces and considers every node in an F$, hierarchy to be fully) 8ualified with a namespace prefix. The form is" prefix:nodename. This approach applies to documents in which the prefixes are included or unspecified. 6hen prefixes are unspecified' F$, $eta 2ata :mporter generates prefixes using the pattern nsQ' where Q is a se8uence number.

& ample
The following input does not include a namespace prefix. -nput
PPerson xmlns=BmynamespaceBO Pfirst?ameO7ohnPCfirst?ameO PCPersonO

)utput
Pns*"Person xmlns"ns*=BmynamespaceBO Pns*"first?ameO7ohn</firstName> </Person>

Processing CM8 Schemas


The F$, $eta 2ata :mporter processes namespaces in F$, schemas according to three rules" Y @eneral Y :mport Gy /eference Y Target ?amespace Unspecified

+eneral $ule
:n general' the F$, $eta 2ata :mporter assigns the prefix defns to the target namespace. 3or example" Pxsd"schema target?amespace=BmynamespaceB xmlns"xsd=Bhttp"CCwww.w..orgC-EE*CF$,#chemaBO Pxsd"element name=BPersonBO Pxsd"complexTypeO Pxsd"se8uenceO Pxsd"element name=Bfirst?ameB type=Bxsd"stringB min9ccurs=B*B max9ccurs=B*BCO PCxsd"se8uenceO PCxsd"complexTypeO PCxsd"elementO PCxsd"schemaO The firstName node generates the following FPath expression" Cdefns"PersonCdefns"first?ame where defns=mynamespace

-mport #y $eference $ule


:f the schema imports by reference other schemas with different target namespaces' the F$, $eta 2ata assigns a prefix in the form nsE to each of them. To enable this processing' the dependent schema must specify elementFormDefault="qualified". :f this is omitted' the elements are considered as belonging to the callerds target namespace. & ample The following example imports by reference the schema mysecondschema. Pxsd"schema target?amespace BdemonamespaceB xmlns"xsd=Bhttp"CCwww.w..orgC-EE*CF$,#chemaB xmlns"other=BothernamespaceBO Pxsd"import namespace=BothernamespaceB schema,ocation=Bmysecondschema.xsdBCO Pxsd"element name=BPersonBO Pxsd"complexTypeO Pxsd"se8uenceO Pxsd"element name=BaddressB type=Bother" ddressB min9ccurs=B*B max9ccurs=B*B CO PCxsd"se8uenceO PCxsd"complexTypeO PCxsd"elementO PCxsd"schemaO The schema mysecondschema contains the following statements" Pxsd"schema target?amespace=BothernamespaceB xmlns"xsd=Bhttp"CCwww.w..orgC-EE*CF$,#chemaB element3orm2efault=B8ualifiedB attribute3orm2efault=Bun8ualifiedBO Pxsd"complexType name=B ddressBO Pxsd"se8uenceO Pxsd"element name=BstreetB min9ccurs=B*B max9ccurs=B*B CO Pxsd"element name=BcityB min9ccurs=B*B max9ccurs=B*B CO

Pxsd"element name=BstateB min9ccurs=B*B max9ccurs=B*B CO Pxsd"element name=B!ipB min9ccurs=B*B max9ccurs=B*B CO PCxsd"se8uenceO PCxsd"complexTypeO PCxsd"schemaO The street node generates the following FPath expression" Cdefns"PersonCdefns"addressCns-"street where defns=demonamespace and ns-=othernamespace

The Target Namespace 3nspecified $ule


6hen the target namespace is unspecified' F$, $eta 2ata :mporter omits the prefix defns from FPath expressions. 3or example"
Pxsd"schema xmlns"xsd=Bhttp"CCwww.w..orgC-EE*CF$,#chemaBO Pxsd"element name=BPersonBO Pxsd"complexTypeO Pxsd"se8uenceO Pxsd"element name=Bfirst?ameB type=Bxsd"stringB min9ccurs=B*B max9ccurs=B*BCO PCxsd"se8uenceO PCxsd"complexTypeO PCxsd"elementO PCxsd"schemaO

The firstName tree node generates the following FPath expression" CPersonCfirst?ame

Mapping Nodes from an CM8 Schema


<ou can individually choose elements and attributes' or select all leaf nodes except empty ones in one step. Choosing -ndividual -tems #elect the box that is next to item that you want to map. :n the following example' there are elements and text nodes. The three T+FT nodes are selected.

:f you select an element box' you get all the sub nodes and the actual content of the element. <our selection is reflected in the Table 2efinition pane"

n asteris( appears after the title Table 2efinition when you modify the table definition. :t disappears when you save the information.

Selecting All Nodes


<ou can simplify selecting all leaf nodes by using the Auto7chec" command. This command chec(s leaf nodes. F$, $eta 2ata :mporter ignores leaf nodes in the following circumstances" Y ?odes are empty. Y :n a branch in which a node represents a reference to an element or a type defined elsewhere' such as an included schema. To avoid recursive looping' which may be deep in the sub)schema' the node is not expanded. <ou may manually expand the reference branch down to a specific level' and run the uto)chec( command on the top branch node. This action selects all nodes in the branch. Y ?ode represents a detected recursion. This happens with a schema that has the following form" parent = person children child = person <ou may manually expand the recursive branch and run the uto)chec( command to select all nodes in the branch. To run Auto7chec"! Choose File &dit Auto7chec". The nodes appear in the Table 2efinition pane. The default table definition name depends on the F$, source name" Source file U?C)name U/, F$, document F$, schema Default 9riginal file name without extension The value Ne2 9riginal F$, document filename 9riginal F$, schema filename

Cml -nput Stage


F$, :nput stage is used to transform hierarchical F$, data to flat relational tables. F$, :nput stage supports a single input lin( and one or more output lin(s. F$, :nput performs two F$, validations when the server job runs" Y Chec(s for well)formed F$,. Y 9ptionally chec(s that elements and attributes conform to any F$, schema that is referenced in the document. <ou control this option. The F$, parser reports three types of conditions" fatal' error' and warning. Y 3atal errors are thrown when the F$, is not well)formed. Y ?on)fatal errors are thrown when the F$, violates a validity constraint. 3or example' the root element in the document is not found in the validating F$, schema. Y 6arnings may be thrown when the schema has duplicate definitions. F$, :nput supports one /eject lin(' which can store rejection messages and rejected rows.

4riting $e0ection Messages to the 8in"


To 2rite re0ection messages to a $e0ect lin"! *. dd a column on the /eject lin(. -. Using the @eneral page of the 9utput ,in( properties' identify the column as the target for rejection messages.

4riting $e0ected $o2s to the 8in"


To 2rite re0ected ro2s to a $e0ect lin"! dd a column on the /eject lin( that has the same name as the column on the input lin( that contains or references the F$, document. This is a pass)through operation. Column names for this operation are case)sensitive. Pass)through is available for any input column.

Controlling )utput $o2s


To populate the columns of an output row' F$, :nput uses FPath expressions that are specified on the output lin(. FPath expressions locate elements' attributes' and text nodes.

Controlling the Num%er of )utput $o2s


<ou must designate one column on the output lin( as the repetition element. repetition element consists of an FPath expression. 3or each occurrence of the repetition element' F$, :nput always generates a row. Gy varying the repetition element and using a related option' you can control the number of output rows.

-dentifying the $epetition &lement


To identify the repetition element' set the /ey property to 1es on the output lin(.

Transformation Settings
These properties control the values that can be shared by multiple output lin(s of the F$, :nput stage. They fall into these categories" Y /e8uiring the repetition element Y Processing ?U,,s and empty values Y Processing namespaces Y 3ormatting extracted F$, fragments To use these values with a specific output lin(' select the -nherit Stage properties box on the Transformation #ettings tab of the output lin(.

Cml )utput Stage


F$, 9utput stage is used to transform tabular data' such as relational tables and se8uential files' to F$, hierarchical structures. F$, 9utput stage supports a single input lin( and !ero or one output lin(s.

F$, 9utput re8uires FPath expressions to transform tabular data to F$,. table definition stores the FPath expressions. Using the 2escription property on the Columns pages within the stage' you record or maintain the FPath expressions.

Aggregating -nput $o2s on )utput


<ou have several options for aggregating input rows on output. Y ggregate all rows in a single output row. This is the default option. Y @enerate one output row per input row. This is the Single ro2 option. Y Trigger a new output row when the value of an input column changes. Y Trigger a new output row when the value of a pass)through column changes. pass)through column is an output column that has no FPath expression in the 2escription property and whose name exactly matches the name of an input column.

*o% Management and Deployment


Fuic" Find
*. ?ame to find

2. Types to find 3. :nclude descriptions %:f chec(ed' the text in short and long descriptions will be searched&

Advanced Find Filtering options


*. -. Type X Type of object %7ob' Table 2efinition' etc& Creation X 2ate range 3. ,ast $odification X 2ate range 1. 6here used X J. 2ependencies of X W. 9ptions X Case sensitivity and #earch within last result set

-mpact Analysis
/ight clic( over a stage or table definition *. #elect L3ind where table definitions usedM -. #elect L3ind where table definitions used %deep&M X 2eep includes additional object types 2isplays a list of objects using the table definition *. #elect L3ind dependenciesM -. #elect L3ind dependencies %deep&M X 2eep includes additional object types 2isplays list of objects dependent on the one selected @raphical 3unctionality *. 2isplay the dependency path -. Collapse selected objects .. $ove the graphical object 1. LGirds)eyeM view

Comparison
*. Cross project compare -. Compare against The two objects that can be compared are *.7obs and -.Table 2efinitions

Aggregator Stage
*. @rouping 0eys -. ggregations ggregation Type ) Count /ows' Calculation' /e)Calculation ggregation Type ) Count /ows Count 9utput Column ) ?ame of the output column which consists of the number of records based on grouping (eys ggregation Type ) Calculation' /e)Calculation Column for Calculation ) :nput Column to be selected for calculation 9ptions llow ?ull 9utput ) True means that ?U,, is a valid output value when calculating minimum value' maximum value' mean value' standard deviation' standard error' sum' sum of weights' and variance. 3alse means E is output when all input values for calculation column are ?U,,. $ethod X 4ash %4ash table& or #ort %Pre)#ort&. The default method is 4ash Use hash mode for a relatively small number of groups5 generally' fewer than about *EEE groups per megabyte of memory. #ort mode re8uires the input data set to have been partition sorted with all of the grouping (eys specified as hashing and sorting (eys. Use 4ash method for inputs with a limited number of distinct groups *. Uses -0 of memoryCgroup -. Calculations are made for all groups and stored in memory %4ash table structure and hence the name& .. :ncoming data does not need to be pre)sorted 1. /esults are output after all rows have been read J. Useful when the number of uni8ue groups is small Use #ort method with a large %or un(nown& number of distinct (ey column values 1. /e8uires inputs pre)sorted on (ey columns %2oes not perform the sortN +xpects the sort& -. /esults are output after each group .. Can handle unlimited number of groups #ort ggregator ) one of the lightweight stages that minimi!e memory usage by re8uiring data in (ey column sort order ,ightweight stages that minimi!e memory usage by re8uiring data in (ey column sort order *. 7oin -. $erge .. #ort ggregator

Sort Stage
2ata#tage designer provides two methods for parallel %group& sorting *. #ort stage ) Parallel +xecution -. #ort on a lin( when the partitioning is not uto ) :dentified by the #ort icon Goth methods use the same tsort operator #orting on a lin( provides easier job maintenance %fewer stages on job canvas& but fewer options.

The #ort stage offers more options than a lin( sort. The #ort Utility should be 2ata#tage as it is faster than the Unix #ort. #table sort preserves the order of non)(ey columns within each sort group but are slightly slower than non)stable sorts. #table sort is enabled by default on #ort stages but not on #ort lin(s. :f disabled no prior ordering of records is guaranteed to be preserved by the sorting operation #ort 0ey $odes *. 2ont #ort %Previously #orted& means that input records are already sorted by this (ey. The #ort stage will then sort on secondary (eys' if any. -. 2ont #ort %Previously @rouped& means that input records are already grouped by that (ey but not sorted .. #ort X #ort by this (ey dvantages of 2ont #ort %Previously #orted& *. Uses significantly less memoryCdis( -. #ort is now on previously sorted (ey column groups not the entire data set .. 9utputs rows after each group 2ata#tage provides two methods for generating a se8uentially %totally& sorted result *. #ort stage ) #e8uential +xecution mode -. #ort $erge Collector :n general a parallel #ort U #ort $erge Collector will be faster than a #e8uential #ort. Gy default the Parallel 3ramewor( will insert tsort operators as necessary to ensure correct results. Gut by setting = PT>:?#+/T:9?>C4+C0>9?,< we can force the inserted tsort operator to verify if the data is sorted instead of actually performing the sort operation. Gy default each tsort operator %#ort stage' lin( sort and inserted sort& uses -E$G per partition as an internal memory buffer. Gut the #ort stage provides the B/estrict $emory UsageB option. *. :ncreasing this value can improve improve performance if the entire %or group& data can fit into memory -. 2ecreasing this value may hurt performance' but will use less memory 6hen the memory buffer is filled' sort uses temporary dis( space in the following order *. #cratch dis(s in the = PT>C9?3:@>3:,+ BsortB named dis( pool -. #cratch dis(s in the = PT>C9?3:@>3:,+ default dis( pool .. The default directory specified by =T$P2:/ 1. The Unix Ctmp directory $emoving Duplicates Can be done by #ort stage X Use uni8ue option ?o choice on which duplicate to (eep #table sort always retains the first row in the group ?on stable sort is indeterminate /emove 2uplicates stage Can choose to retain first or last

You might also like