You are on page 1of 3

Configuration file

1. APT_CONFIG_FILE is the file using which Data Stage determines the


configuration file (o n e can have many configuration files for a project) to be
used. In fact, this is what is generally used
in production. However, if this environment variable is not defined then how
Data Stage determines which file to use?
1. If the APT_CONFIG_FILE environment variable is not defined then Data
Stage look for default configuration file (config.apt) in following path:
1. Current working directory.
2. INSTALL_DIR/etc, where INSTALL_DIR ($APT_ORCHHOME) is the top Level
directory of Data Stage installation.
2. What are the different options a logical node can have in the configuration
file?
1. Fast name – The fast name is the physical node name that stages use to
open connections for high volume data transfers. The attribute of this option is
often the network name. Typically, you can get this name by using Unix
command ‘uname -n’.
2. Pools – Name of the pools to which the node is assigned to. Based on the
characteristics of the processing nodes you can group nodes into set of pools.
1. A pool can be associated with many nodes and a node can be part of many
pools.
2. A node belongs to the default pool unless you explicitly specify apools list for
it, and omit the default pool name (“”) from the list.
3. A parallel job or specific stage in the parallel job can be constrained to run
on a pool (set of processing nodes).
1. In case job as well as stage within the job is constrained to run on specific
processing nodes then stage will run on the node which is common to stage as
well as job

3. Resource – resource resource type “location” [{pools “disk_pool_name”}]


| resource resource type “value”. Resource type can be canonical hostname
(Which takes quoted Ethernet name of a node in cluster that is unconnected to
Conductor node by the hight speed network.)o r disk (To read/write persistent
data to this directory.)o r scratch disk (Quoted absolute path name of a
directory on a file system where intermediate data will be temporarily stored. It
is local to the processing node.)o r RDBMS Specific resourses (e.g. DB2,
INFORMIX, ORACLE, etc.)
3. How data stage decides on which processing node a stage should be run?
1. If a job or stage is not constrained to run on specific nodes then parallel
engine executes a parallel stage on all nodes defined in the default node pool.
(Default Behavior)
2. If the node is constrained then the constrained processing nodes are choosen
while executing the parallel stage. (Refer to 2.2.3 for more detail).
4. When configuring an MPP, you specify the physical nodes in your system on
which the parallel engine will run your parallel jobs. This is called Conductor
Node. For other nodes, you do not need to specify the physical node. Also, You
need to copy the (.apt) configuration file only to the nodes from which you
start parallel engine applications. It is possible that conductor node is not
connected with the high-speed network switches. However, the other nodes
are connected to each other using a very high-speed network switches. How
do you configure your system so that you will be able to achieve optimized
parallelism?
1. Make sure that none of the stages are specified to be run on the conductor
node.
2. Use conductor node just to start the execution of parallel job

3. Make sure that conductor node is not the part of the default pool.
5. Although, parallelization increases the throughput and speed of the process,
why maximum parallelization is not necessarily the optimal parallelization?
1. Data stage creates one process for every stage for each processing node.
Hence, if the hardware resource is not available to support the maximum
parallelization, the performance of overall system goes down. For example,
suppose we have a SMP system with three CPU and a Parallel job with 4 stage.
We have 3 logical node (one corresponding to each physical node (say CPU)).
Now DataStage will start 3*4 = 12 processes, which has to be managed by a
single operating system. Significant time will be spent in switching context and
scheduling the process.
6.Since we can have different logical processing nodes, it is possible that some
node will be more suitable for some stage while other nodes will be more
suitable for other stages. So, when to decide which node will be suitable for
which stage?
1.If a stage is performing a memory intensive task then it should be run on a
node which has more disk space available for it. E.g.s o r t i n g a data is
memory intensive task and it should be run on such nodes.
2.If some stage depends on licensed version of software (e.g. SAS Stage,
RDBMS related \stages, etc.) then you need to associate those stages with the
processing node, which is physically mapped to the machine on which the
licensed software is installed.
(Assumption: The machine on which licensed software is installed is connected
through other machines using high speed network.)
3. If a job contains stages, which exchange large amounts of data then they
should be assigned to nodes where stages communicate by either shared
memory (SMP) or high- speed link (MPP) in most optimized manner.
7. Basically nodes are nothing but set of machines (specially in MPP systems).
You start the execution of parallel jobs from the conductor node.
Conductornodes creates a shell of remote machines (depending on the
processing nodes) and copies the same environment on them.
However, it is possible to create a startup script which will selectively change
the environment on a specific node. This script has a default name of
startup.apt. However, like main configuration file, we can also have many
startup configuration files. The appropriate configuration file can be
picked up using the environment variable APT_STARTUP_SCRIPT. What is use
of APT_NO_STARTUP_SCRIPT environment variable?
1. Using APT_NO_STARTUP_SCRIPT environment variable, you can instruct
Parallel engine not to run the startup script on the remote shell.
8. What are the generic things one must follow while creating a configuration
file so that optimal
Parallelization can be achieved?
1. Consider avoiding the disk/disks that your input files reside on.
2. Ensure that the different file systems mentioned as the disk and scratchdisk
resources hit disjoint sets of spindles even if they’re located on a RAID
(Redundant Array of
Inexpensive Disks) system.
3. Know what is real and what is NFS:
1. Real disks are directly attached, or are reachable over a SAN (storage-area
network -dedicated, just for storage, low-level protocols).
2. Never use NFS file systems for scratchdisk resources, remember scratchdisk
are
also used for temporary storage of file/data during processing
3. If you use NFS file system space for disk resources, then you need to know
what
you are doing. For example, your final result files may need to be written out
onto the NFS disk area, but that doesn’t mean the intermediate data sets
created
and used temporarily in a multi-job sequence should use this NFS disk area.
Better to setup a “final” disk pool, and constrain the result sequential file or
data
set to reside there, but let intermediate storage go to local or SAN resources,
not
NFS.
4. Know what data points are striped (RAID) and which are not. Where
possible, avoid
striping across data points that are already striped at the spindle level

You might also like