You are on page 1of 11

Using Configuration Files in Data Stage Best Practices & Performance Tuning

Share on facebookShare on twitterShare on emailShare on printShare on gmailShare on stumbleuponShare on favoritesShare on tumblrShare on pinterest_shareShare on googleShare on mailtoShare on linkedinShare on bloggerShare on deliciousShare on yahoomailMore Sharing Services1
The configuration file tells DataStage Enterprise Edition how to exploit underlying system resources (processing, temporary storage, and dataset storage). In more advanced environments, the configuration file can also define other resources such as databases and buffer storage. At runtime, EE first reads the configuration file to determine what system resources are allocated to it, and then distributes the job flow across these resources. When you modify the system, by adding or removing nodes or disks, you must modify the DataStage EE configuration file accordingly. Since EE reads the configuration file every time it runs a job, it automatically scales the application to fit the system without having to alter the job design.

There is not necessarily one ideal configuration file for a given system because of the high variability between the way different jobs work. For this reason, multiple configuration files should be used to optimize overall throughput and to match job characteristics to available hardware resources. At runtime, the configuration file is specified through the environment variable $APT_CONFIG_FILE.

LOGICAL PROCESSING NODES


The configuration file defines one or more EE processing nodes on which parallel jobs will run. EE processing nodes are a logical rather than a physical construct. For this reason, it is important to note that the number of processing nodes does not necessarily correspond to the actual number of CPUs in your system. Within a configuration file, the number of processing nodes defines the degree of parallelism and resources that a particular job will use to run. It is up to the UNIX operating system to actually schedule and run the processes that make up a DataStage job across physical processors. A configuration file with a larger number of nodes generates a larger number of processes that use more memory (and perhaps more disk activity) than a configuration file with a smaller number of nodes. While the DataStage documentation suggests creating half the number of nodes as physical CPUs, this is a conservative starting point that is highly dependent on system configuration, resource availability, job design, and other applications sharing the server hardware. For example, if a job is highly I/O dependent or dependent on external (eg. database) sources or targets, it may appropriate to have more nodes than physical CPUs. For typical production environments, a good starting point is to set the number of nodes equal to the number of CPUs. For development environments, which are typically smaller and more resource-constrained, create smaller configuration files (eg. 2-4 nodes). Note that even in the smallest development environments,

a 2-node configuration file should be used to verify that job logic and partitioning will work in parallel (as long as the test data can sufficiently identify data discrepancies).

OPTIMIZING PARALLELISM
The degree of parallelism of a DataStage EE application is determined by the number of nodes you define in the configuration file. Parallelism should be optimized rather than maximized. Increasing parallelism may better distribute your work load, but it also adds to your overhead because the number of processes increases. Therefore, you must weigh the gains of added parallelism against the potential losses in processing efficiency. The CPUs, memory, disk controllers and disk configuration that make up your system influence the degree of parallelism you can sustain.

Keep in mind that the closest equal partitioning of data contributes to the best overall performance of an application running in parallel. For example, when hash partitioning, try to ensure that the resulting partitions are evenly populated. This is referred to as minimizing skew.

When business requirements dictate a partitioning strategy that is excessively skewed, remember to change the partition strategy to a more balanced one as soon as possible in the job flow. This will minimize the effect of data skew and significantly improve overall job performance.

Configuration File Examples


Given the large number of considerations for building a configuration file, where do you begin? For starters, the default configuration file (default.apt) created when DataStage is installed is appropriate for only the most basic environments.

The default configuration file has the following characteristics: number of nodes = number of physical CPUs disk and scratchdisk storage use subdirectories within the DataStage install filesystem

You should create and use a new configuration file that is optimized to your hardware and file systems. Because different job flows have different needs (CPU-intensive? Memory-intensive? Disk-Intensive? Database-Intensive? Sorts? need to share resources with other jobs/databases/ applications? etc), it is often appropriate to have multiple configuration files optimized for particular types of processing.

With the synergistic relationship between hardware (number of CPUs, speed, cache, available system memory, number and speed of I/O controllers, local vs. shared disk, RAID configurations, disk size and

speed, network configuration and availability), software topology (local vs. remote database access, SMP vs. Clustered processing), and job design, there is no definitive science for formulating a configuration file. This section attempts to provide some guidelines based on experience with actual production applications.

IMPORTANT: It is important to follow the order of all sub-items within individual node specifications in the example configuration files given in this section. Example for Any Number of CPUs and Any Number of Disks Assume you are running on a shared-memory multi-processor system, an SMP server, which is the most common platform today. Lets assume these properties:

computer host name fastone 6 CPUs 4 separate file systems on 4 drives named /fs0, /fs1, /fs2, /fs3

You can adjust the sample to match your precise environment.

The configuration file you would use as a starting point would look like the one below. Assuming that the system load from processing outside of DataStage is minimal, it may be appropriate to create one node per CPU as a starting point.

In the following example, the way disk and scratchdisk resources are handled is the important.

{ /* config files allow C-style comments. */ /* Configuration do not have flexible syntax. Keep all the sub-items of the individual node specifications in the order shown here. */

node "n0" { pools "" /* on an SMP node pools arent used often. */ fastname "fastone" resource scratchdisk "/fs0/ds/scratch" {} /* start with fs0 */

resource scratchdisk "/fs1/ds/scratch" {} resource scratchdisk "/fs2/ds/scratch" {} resource scratchdisk "/fs3/ds/scratch" {} resource disk "/fs0/ds/disk" {} /* start with fs0 */ resource disk "/fs1/ds/disk" {} resource disk "/fs2/ds/disk" {} resource disk "/fs3/ds/disk" {} } node "n1" { pools "" fastname "fastone" resource scratchdisk "/fs1/ds/scratch" {} /* start with fs1 */ resource scratchdisk "/fs2/ds/scratch" {} resource scratchdisk "/fs3/ds/scratch" {} resource scratchdisk "/fs0/ds/scratch" {} resource disk "/fs1/ds/disk" {} /* start with fs1 */ resource disk "/fs2/ds/disk" {} resource disk "/fs3/ds/disk" {} resource disk "/fs0/ds/disk" {} } node "n2" { pools "" fastname "fastone" resource scratchdisk "/fs2/ds/scratch" {} /* start with fs2 */ resource scratchdisk "/fs3/ds/scratch" {} resource scratchdisk "/fs0/ds/scratch" {} resource scratchdisk "/fs1/ds/scratch" {} resource disk "/fs2/ds/disk" {} /* start with fs2 */ resource disk "/fs3/ds/disk" {} resource disk "/fs0/ds/disk" {} resource disk "/fs1/ds/disk" {} } node "n3" { pools "" fastname "fastone" resource scratchdisk "/fs3/ds/scratch" {} /* start with fs3 */ resource scratchdisk "/fs0/ds/scratch" {} resource scratchdisk "/fs1/ds/scratch" {} resource scratchdisk "/fs2/ds/scratch" {} resource disk "/fs3/ds/disk" {} /* start with fs3 */ resource disk "/fs0/ds/disk" {} resource disk "/fs1/ds/disk" {} resource disk "/fs2/ds/disk" {} }

node "n4" { pools "" fastname "fastone" /* Now we have rotated through starting with a different disk, but the fundamental problem * in this scenario is that there are more nodes than disks. So what do we do now? * The answer: something that is not perfect. Were going to repeat the sequence. You could * shuffle differently i.e., use /fs0 /fs2 /fs1 /fs3 as an order, but that most likely wont * matter. */ resource scratchdisk /fs0/ds/scratch {} /* start with fs0 again */ resource scratchdisk /fs1/ds/scratch {} resource scratchdisk /fs2/ds/scratch {} resource scratchdisk /fs3/ds/scratch {} resource disk /fs0/ds/disk {} /* start with fs0 again */ resource disk /fs1/ds/disk {} resource disk /fs2/ds/disk {} resource disk /fs3/ds/disk {} } node n5 { pools fastname fastone resource scratchdisk /fs1/ds/scratch {} /* start with fs1 */ resource scratchdisk /fs2/ds/scratch {} resource scratchdisk /fs3/ds/scratch {} resource scratchdisk /fs0/ds/scratch {} resource disk /fs1/ds/disk {} /* start with fs1 */ resource disk /fs2/ds/disk {} resource disk /fs3/ds/disk {} resource disk /fs0/ds/disk {} } } /* end of entire config */

The file pattern of the configuration file above is a give every node all the disk example, albeit in different orders to minimize I/O contention. This configuration method works well when the job flow is complex enough that it is difficult to determine and precisely plan for good I/O utilization.

Within each node, EE does not stripe the data across multiple filesystems. Rather, it fills the disk and scratchdisk filesystems in the order specified in the configuration file. In the 4-node example above, the order of the disks is purposely shifted for each node, in an attempt to minimize I/O contention.

Even in this example, giving every partition (node) access to all the I/O resources can cause contention, but EE attempts to minimize this by using fairly large I/O blocks.

This configuration style works for any number of CPUs and any number of disks since it doesn't require any particular correspondence between them. The heuristic here is: When its too difficult to figure out precisely, at least go for achieving balance.

DataStage BASIC functions


Share on facebookShare on twitterShare on emailShare on printShare on gmailShare on stumbleuponShare on favoritesShare on tumblrShare on pinterest_shareShare on googleShare on mailtoShare on linkedinShare on bloggerShare on deliciousShare on yahoomailMore Sharing Services7

These functions can be used in a job control routine, which is defined as part of a jobs properties and allows other jobs to be run and controlled from the first job. Some of the functions can also be used for getting status information on the current job; these are useful in active stage expressions and before- and after-stage subroutines. Specify the job you want to control DSAttachJob Set parameters for the job you want to control DSSetParam Set limits for the job you want to control DSSetJobLimit Request that a job is run DSRunJob Wait for a called job to finish DSWaitForJob Gets the meta data details for the specified link DSGetLinkMetaData Get information about the current project DSGetProjectInfo Get buffer size and timeout value for an IPC or Web Service stage DSGetIPCStageProps Get information about the controlled job or current job DSGetJobInfo Get information about the meta bag properties associated with the named job DSGetJobMetaBag Get information about a stage in the controlled job or current job DSGetStageInfo

Get the names of the links attached to the specified stage DSGetStageLinks Get a list of stages of a particular type in a job. DSGetStagesOfType Get information about the types of stage in a job. DSGetStageTypes Get information about a link in a controlled job or current job DSGetLinkInfo Get information about a controlled jobs parameters DSGetParamInfo Get the log event from the job log DSGetLogEntry Get a number of log events on the specified subject from the job log DSGetLogSummary Get the newest log event, of a specified type, from the job log DSGetNewestLogId Log an event to the job log of a different job DSLogEvent Stop a controlled job DSStopJob Return a job handle previously obtained from DSAttachJob DSDetachJob Log a fatal error message in a job's log file and aborts the job. DSLogFatal Log an information message in a job's log file. DSLogInfo Put an info message in the job log of a job controlling current job. DSLogToController Log a warning message in a job's log file. DSLogWarn Generate a string describing the complete status of a valid attached job. DSMakeJobReport Insert arguments into the message template. DSMakeMsg Ensure a job is in the correct state to be run or validated. DSPrepareJob Interface to system send mail facility.

DSSendMail Log a warning message to a job log file. DSTransformError Convert a job control status or error code into an explanatory text message. DSTranslateCode Suspend a job until a named file either exists or does not exist. DSWaitForFile Checks if a BASIC routine is cataloged, either in VOC as a callable item, or in the catalog space. DSCheckRoutine Execute a DOS or Data Stage Engine command from a before/after subroutine. DSExecute Set a status message for a job to return as a termination message when it finishes DSSetUserStatus

DataStage file FAQ


Posted on July 2, 2010 by Walking Tree 12 Comments

Configuration

2 Votes

1. APT_CONFIG_FILE is the file using which DataStage determines the configuration file (one can have many configuration files for a project) to be used. In fact, this is what is generally used in production. However, if this environment variable is not defined then how DataStage determines which file to use? 1. If the APT_CONFIG_FILE environment variable is not defined then DataStage look for default configuration file (config.apt) in following path: 1. Current working directory. 2. INSTALL_DIR/etc, where INSTALL_DIR ($APT_ORCHHOME) is the top level directory of DataStage installation. 2. What are the different options a logical node can have in the configuration file?

1. fastname The fastname is the physical node name that stages use to open connections for high volume data transfers. The attribute of this option is often the network name. Typically, you can get this name by using Unix command uname -n. 2. pools Name of the pools to which the node is assigned to. Based on the characteristics of the processing nodes you can group nodes into set of pools. 1. A pool can be associated with many nodes and a node can be part of many pools. 2. A node belongs to the default pool unless you explicitly specify apools list for it, and omit the default pool name () from the list. 3. A parallel job or specific stage in the parallel job can be constrained to run on a pool (set of processing nodes). 1. In case job as well as stage within the job are constrained to run on specific processing nodes then stage will run on the node which is common to stage as well as job. 3. resource resource resource_type location [{pools disk_pool_name}] | resource resource_type value . resource_type can be canonicalhostname (Which takes quoted ethernet name of a node in cluster that is unconnected to Conductor node by the hight speed network.) or disk (To read/write persistent data to this directory.) or scratchdisk(Quoted absolute path name of a directory on a file system where intermediate data will be temporarily stored. It is local to the processing node.) or RDBMS Specific resourses (e.g. DB2, INFORMIX, ORACLE, etc.) 3. How datastage decides on which processing node a stage should be run? 1. If a job or stage is not constrained to run on specific nodes then parallel engine executes a parallel stage on all nodes defined in the default node pool. (Default Behavior) 2. If the node is constrained then the constrained processing nodes are choosen while executing the parallel stage. (Refer to 2.2.3 for more detail). 4. When configuring an MPP, you specify the physical nodes in your system on which the parallel engine will run your parallel jobs. This is called Conductor Node. For other nodes, you do not need to specify the physical node. Also, You need to copy the (.apt) configuration file only to the nodes from which you start parallel engine applications. It is possible that conductor node is not connected with the high-speed network switches. However, the other nodes are connected to each other using a very high-speed network switches. How do you configure your system so that you will be able to achieve optimized parallelism? 1. Make sure that none of the stages are specified to be run on the conductor node. 2. Use conductor node just to start the execution of parallel job. 3. Make sure that conductor node is not the part of the default pool. 5. Although, parallelization increases the throughput and speed of the process, why maximum parallelization is not necessarily the optimal parallelization? 1. Datastage creates one process for every stage for each processing node. Hence, if the hardware resource is not available to support the maximum parallelization, the performance of overall system goes down. For example, suppose we have a SMP system with three CPU and a Parallel job with 4 stage. We have 3 logical node (one corresponding to each physical node (say CPU)). Now DataStage will start 3*4 =

12 processes, which has to be managed by a single operating system. Significant time will be spent in switching context and scheduling the process. 6. Since we can have different logical processing nodes, it is possible that some node will be more suitable for some stage while other nodes will be more suitable for other stages. So, when to decide which node will be suitable for which stage? 1. If a stage is performing a memory intensive task then it should be run on a node which hasmore disk space available for it. E.g. sorting a data is memory intensive task and it should be run on such nodes. 2. If some stage depends on licensed version of software (e.g. SAS Stage, RDBMS related stages, etc.) then you need to associate those stages with the processing node, which is physically mapped to the machine on which the licensed software is installed. (Assumption:The machine on which licensed software is installed is connected through other machines using high speed network.) 3. If a job contains stages, which exchange large amounts of data then they should be assigned to nodes where stages communicate by either shared memory (SMP) or high-speed link (MPP) in most optimized manner. 7. Basically nodes are nothing but set of machines (specially in MPP systems). You start the execution of parallel jobs from the conductor node. Conductor nodes creates a shell of remote machines (depending on the processing nodes) and copies the same environment on them. However, it is possible to create a startup script which will selectively change the environment on a specific node. This script has a default name of startup.apt. However, like main configuration file, we can also have many startup configuration files. The appropriate configuration file can be picked up using the environment variable APT_STARTUP_SCRIPT. What is use of APT_NO_STARTUP_SCRIPT environment variable? 1. Using APT_NO_STARTUP_SCRIPT environment variable, you can instruct Parallel engine not to run the startup script on the remote shell. 8. What are the generic things one must follow while creating a configuration file so that optimal parallelization can be achieved? 1. Consider avoiding the disk/disks that your input files reside on. 2. Ensure that the different file systems mentioned as the disk and scratchdisk resources hit disjoint sets of spindles even if theyre located on a RAID (Redundant Array of Inexpensive Disks) system. 3. Know what is real and what is NFS: 1. Real disks are directly attached, or are reachable over a SAN (storage-area network -dedicated, just for storage, low-level protocols). 2. Never use NFS file systems for scratchdisk resources, remember scratchdisk are also used for temporary storage of file/data during processing. 3. If you use NFS file system space for disk resources, then you need to know what you are doing. For example, your final result files may need to be written out onto the NFS disk area, but that doesnt mean the intermediate data sets created and used temporarily in a multi-job sequence should use this NFS disk area. Better to setup a final disk pool, and constrain the result sequential file or data set to reside there, but let intermediate storage go to local or SAN resources, not NFS.

4. Know what data points are striped (RAID) and which are not. Where possible, avoid striping across data points that are already striped at the spindle level.

You might also like