You are on page 1of 5

AB INITIO GLOSSARY

Ad-hoc Multifile: A parallel dataset created by naming a set of serial files as its
partitions. These partitions can be named by explicitly listing the serial files, or by
using a shell expression that expands at runtime to a list of serial files.

Co>Operating System: The Co>Operating System is core software that unites a


network of computing resources-CPUs, storage disks, programs, datasets-into a
production-quality data processing system with scalable performance and
mainframe reliability. The Co>Operating System is layered on top of the native
operating systems of a collection of servers. It provides a distributed model for
process execution, file management, process monitoring, checkpointing, and
debugging.

Component parallelism: A graph with multiple processes running simultaneously


on separate data uses component parallelism.

Checkpoint is a phase that acts as an intermediate stopping point in a graph to


safeguard against failures. By assigning phases with checkpoints to a graph, you
can recover completed stages of the graph if failure occurs.

Control partition is the file in a multifile that contains the locations (URLs) of the
multfile's data partitions.

Data parallelism: A graph that deals with data divided into segments and operates
on each segment simultaneously uses data parallelism. Nearly all commercial data
processing tasks can use data parallelism. To support this form of parallelism, Ab
Initio provides Partition components to segment data, and Departition components
to merge segmented data back together.

Deadlock occurs when a program cannot progress. It depends on the patterns of


the data and typically occurs in graphs with data flows that split and then join.

A graph carries a potential deadlock when flows diverge and converge within the
same phase. If the flows converge at a component that reads its input flows in a
particular order, that component may wait for records to arrive on one flow even as
the unread data accumulates on others because components have a limited
buffering capacity.
DML is an acronym for Data Manipulation Language. It is the Ab Initio
programming language you use to define record formats (which are kinds of
types), expressions, transform functions, and key specifiers. Expressions include a
large number of built-in functions. Files with .dml extensions contain record
format definitions.

Export: You can export or pass on properties (parameters, layouts, or ports) from
components to graphs or subgraphs. This creates a graph level parameter that is
referenced by the original parameter in the component.

If you export properties from two different components and give them the same
name, the properties will have the same value. A typical use is to export a key
parameter from a Partition by Key and a Sort, so they have the same value.

Fan-in flows connect components with a large number of partitions to components


with a smaller number of partitions. The most common use of fan-in is to connect
flows to Departition components. This flowpattern is used to merge data divided
into many segments back into a single segment, so other programs can access the
data.

When you connect a component running in parallel to any component via a fan-in
flow, the number of partitions of the original component must be a multiple of the
number of partitions of the component receiving the fan-in flow. For example, you
can connect a component running 9 ways parallel to a component running 3 ways
parallel, but not to a component running 4 ways parallel.

To deal with the latter case, Repartition the data by inserting one of the Partition
components between the components. This turns the fan-in-flow into an All-to-all
flow, allowing a record from any partition of one component to flow into any
partition of the other component.

Fan-out flows connect components with a small number of partitions to


components with a larger number of partitions. The most common use of fan-out is
to connect flows from partition components. This flow pattern is used to divide
data into many segments for performance improvements.

When you connect a Partition component running in parallel to another component


running in parallel via a fan-out flow, the number of partitions of the component
receiving the fan-out flow must be a multiple of the number of partitions of the
Partition component. For example, you can connect a Partition component with 3
partitions via a fan-out flow to a component with 9 partitions, but not to a
component with 10 partitions.

To deal with the later case, Repartition the data by inserting a Departition
component after the Partition component. This turns the fan-out flow into an All-
to-all flow, allowing a record from any partition of the Partition component to flow
into any partition of the target component.

Flow: A flow carries a stream of data between components in a graph. Flows


connect components via ports. Ab Initio supplies four kinds of flows with different
patterns: straight, fan-in, fan-out, and all-to-all.

Graph is a diagram that defines the various processing stages of a task and the
streams of data as they move from one stage to another. Visually, stages are
represented by components and streams are represented by flows. The collection of
components and flows comprise an Ab Initio graph. You can create graphs from
the main menu of the GDE.

GDE: The Graphical Development Environment (GDE) provides a graphical user


interface into the services of the Co>Operating System.

Layout: A layout is a list of host and directory locations, usually given by the
URL of a multifile. If the locations are not in a multifile, the layout is a list of
URLs called a custom layout.

A program component's layout lists the hosts and directories in which the
component runs. A dataset component's layout lists the hosts and directories in
which the data resides. Layouts are set on the Properties Layout tab.

The layout defines the level of parallelism. Parallelism is achieved by partitioning


data and computation across processors.

Ab Initio uses layout markers to show the level of parallelism on components. If


the GDE can determine the level of parallelism, it uses that level as the marker. For
example, non-parallel files have a marker of 1. If the GDE cannot determine the
level of parallelism, it abbreviates layouts as L1, L2, and so on. An asterisk next to
a layout marker (L1*) indicates a propagated layout.

.mdc file: A file with an .mdc file extension represents a Dataset or custom dataset
component.
Multifile is a parallel file that is composed of individual files on different disks
and/or nodes. The individual files are partitions of the multifile. Each multifile
contains one control partition and one or more data partitions. Multifiles are stored
in distributed directories called multidirectories.

The data in a multifile is usually divided across partitions by one of these methods:
Random or roundrobin partitioning
Partitioning based on ranges or functions, or
Replication or broadcast, in which each partition is an identical copy of the serial
data.

A partition is a file that is a portion of a multifile. A partition is a segment of a


parallel computation.

Phase is a stage of a graph that runs to completion before the start of the next
stage. By dividing a graph into phases, you can save resources, avoid deadlock,
and safeguard against failures.

If a graph has deadlock potential, for example in a merge-split situation, use


different phases in the upstream and downstream components to stagger the
reading and writing to disk.

To protect a graph, all phases are checkpoints by default. A checkpoint is a special


kind of phase that saves status information and allows you to recover from failures.

Pipeline Parallelism: A graph with multiple components running simultaneously


on the same data uses pipeline parallelism.

Each component in the pipeline continuously reads from upstream components,


processes data, and writes to downstream components. Since a downstream
component can process records previously written by an upstream component, both
components can operate in parallel.

Record format is either a DML file or a DML string that describes data. You set
record formats on the Properties Parameters tab of the Properties dialog box. A
record format is a type applied to a port.
Repartition data is to change the degree of parallelism or the grouping of
partitioned data. For instance, if you have divided skewed data, you can repartition
using a Partition by Key connected to a Gather with an All-to-all flow.

Sandbox: A sandbox is a collection of graphs and related files that are stored in a
single directory tree, and treated as a group for purposes of version control,
navigation, and migration. A sandbox can be a file system copy of a repository
project.

Skew refers to lopsided data storage or program execution. Causes of skew range
from uneven input data to different loads on different processors. Repartitioning
often alleviates skewed input data. Modifying the layouts often repairs skewed
program execution. Skew for a data storage partition is defined as:
(N- AVERAGE) /MAX
Where: N is the number of bytes in that partition
AVERAGE is the total number of bytes in all the partitions divided by number of
partitions
MAX is the number of bytes in the partition with the most bytes.

Skew for program execution is similar, using CPU seconds instead of number of
bytes.

Straight flows: connect components that have the same number of partitions.
Partitions of components connected with straight flows have a one-to-one
correspondence. Straight flows are the most common flow pattern.

Subgraph is a graph fragment. Just like graphs, subgraphs can contain components
and flows. Subgraphs are useful for grouping a graph into subtasks and reusing
them.

Watcher: A watcher lets you view the data that has passed through a Flow.
To use a watcher, do the following:
Turn on debugging mode.
Add a watcher on a flow.
Run the graph.
View the data.

You might also like