Professional Documents
Culture Documents
m_attach: Ab Initio provides this shell command to facilitate remote startup on large parallel
systems.
m_env: Displays the current settings of the Ab Initio environment variables. Invoke m_env with
the option h for added help (m_env h).
Environment Variables: Set these environment variables if we want a value different from the
default.
XX_TIMEOUT=seconds
XX_MAX_RECORD_BUFFER=bytes
XX_NICE=priority
XX-SORT-MAX-CORE=megabytes
priority.
The default value for the max-core argument
to the local-sort component. Default is 10
megabytes.
Debugging:
IWAIT=true
XX_DEBUG=value
DISPLAY=display_id
TRACE_ALL_SOCS=path
LAUNCHER_TRACE
An Ab Initio application is a set of mp commands, beginning with mp job and ending (usually)
with mp run. In between are commands that identify the program components and indicate the
flow of data from one to the next. Thus, the mp script usually defines and runs the job.
When a script is invoked, the mp job command executes. At this point, the system creates two
files in the current working directory:
jobname.job: As the rest of the script is read, a text representation of the application
being defined is placed here. The file is a text file.
.abinitio-current-job: This file contains jobname, it enables the system to know the
name of the current job.
If two or more mp jobs are running in the same directory at the same time, one job will overwrite
the others .abinitio-current-job file. To avoid this problem, use the environment variable
AB_JOB. When AB_JOB is set, all mp commands use its value as the name of the current mp
job, ignoring the name stored in .abinitio-current-job.
An Ab Initio application may be designed to execute in sequential phases with or without check
pointing, which means saving state to disk between phases.
Phased execution is enabled from within the application, if the script developer ahs inserted the
command mp phase or mp checkpoint between one component and another.
Phasing makes a difference in how the application uses the system resources, often trading off
performance for safety. Phasing inhibits pipeline parallelism but guarantees that resourceintensive stages will not compete with each other.
When a job does not complete normally, it leaves a file in the working directory on the host
system with the name jobname.rec. This file contains a set of pointers to the log files on the host
and on every node. The log files are placed in the subdirectories that are created when the
application starts and deleted when the application successfully completes.
If the application encounters a software failure, all nodes and their respective files will be rolled
back to their initial state, as if the application were not run at all. If the program contains
checkpoint commands, the state restored is that of the most recent checkpoint.
Specifically, the Ab Initio system will:
Kill all processes running on all nodes, including control processes and processes that
constitute the partitions of a parallel program.
Cleanly shut down all data flows.
Rollback the effects of all file changes.
Report the state of the system.
Exit.
It is not possible for the Co>Operating System to restore the system to an earlier state. For
example, a failure could occur because a node or its native operating system crashed. In this
case, it is not possible to cleanly shut down flow or file operations, nor to rollback file operations
performed in the current phase. In fact, it is likely that stray files (intermediate temporaries) will
be left lying around. To complete the cleanup and get the job running again, you must perform a
manual rollback. For this, we use the command m_rollback.
m_rollback [-d] [-I] [-h] recovery file
-d: Delete the job along with its recovery file and any log files it created.
-i: Display the state of the job and prompt the user whether the job should be deleted.
If the i option is not used, jobs that have reached their first checkpoint will be rolled back to the
checkpoint. Jobs that do not include checkpoints or that did not reach their first checkpoint will be
deleted.
Monitoring
Monitoring is controlled in either (or both) of two ways:
From the shell, set the configuration variable XX_REPORT before running the job.
Within the script, supply arguments to the report option to the mp run command.
The keywords are:
Verbose-errors
Expanded-graph
Flows
Times
Skew
Skew=n
Scroll=mode
File=filename
Interval=n
Table-flows
export XX_REPORT=flows times interval=10 (ksh)
mp run report flows times interval=10 (in script)
File Skew
Skew is only of concern if its large (say, over 25%) and if large amounts of data or CPU time
are involved.
Situations that might lead to skew are an overloaded node, unbalanced data, or
different node speeds.
An overloaded node: If a node is overloaded, then data flows will tend to show up
as initially skewed, but the skew will go to zero at the end of the run.
Unbalanced Data: If different partitions of a data flow have different amounts of data,
then both data and CPU time will be skewed at the end of the run.
Different node speeds: If some nodes are faster than others, then skew is likely to
result. In this case, CPU times will be skewed at the end of the run, but not data
volumes.
Debugging
The XX_DEBUG environment variable controls the tracing and debugging of processes.
The IWAIT mechanism is a simple job-tracking system that lets us detect and handle processes
that fail. We must set IWAIT in order to use any tracing or debugging features.
Administration
AB_SUPPRESS_HISTORY_CHECK: Permits changing parameters when restarting a
checkpointed mp job.
AB_CONNECTION, AB_CONNECTION_SCRIPT, AB_PASSWORD, AB_USER: control aspects
of remote connections.
AB_NODES is used for defining node aliases
Performance
The m_attach utility accelerates job startup on IBM SP configurations of 9 or more nodes.