Professional Documents
Culture Documents
Ad-hoc Multifile: A parallel dataset created by naming a set of serial files as its
partitions. These partitions can be named by explicitly listing the serial files, or by
using a shell expression that expands at runtime to a list of serial files.
Control partition is the file in a multifile that contains the locations (URLs) of the
multfile's data partitions.
Data parallelism: A graph that deals with data divided into segments and operates
on each segment simultaneously uses data parallelism. Nearly all commercial data
processing tasks can use data parallelism. To support this form of parallelism, Ab
Initio provides Partition components to segment data, and Departition components
to merge segmented data back together.
A graph carries a potential deadlock when flows diverge and converge within the
same phase. If the flows converge at a component that reads its input flows in a
particular order, that component may wait for records to arrive on one flow even as
the unread data accumulates on others because components have a limited
buffering capacity.
DML is an acronym for Data Manipulation Language. It is the Ab Initio
programming language you use to define record formats (which are kinds of
types), expressions, transform functions, and key specifiers. Expressions include a
large number of built-in functions. Files with .dml extensions contain record
format definitions.
Export: You can export or pass on properties (parameters, layouts, or ports) from
components to graphs or subgraphs. This creates a graph level parameter that is
referenced by the original parameter in the component.
If you export properties from two different components and give them the same
name, the properties will have the same value. A typical use is to export a key
parameter from a Partition by Key and a Sort, so they have the same value.
When you connect a component running in parallel to any component via a fan-in
flow, the number of partitions of the original component must be a multiple of the
number of partitions of the component receiving the fan-in flow. For example, you
can connect a component running 9 ways parallel to a component running 3 ways
parallel, but not to a component running 4 ways parallel.
To deal with the latter case, Repartition the data by inserting one of the Partition
components between the components. This turns the fan-in-flow into an All-to-all
flow, allowing a record from any partition of one component to flow into any
partition of the other component.
To deal with the later case, Repartition the data by inserting a Departition
component after the Partition component. This turns the fan-out flow into an All-
to-all flow, allowing a record from any partition of the Partition component to flow
into any partition of the target component.
Graph is a diagram that defines the various processing stages of a task and the
streams of data as they move from one stage to another. Visually, stages are
represented by components and streams are represented by flows. The collection of
components and flows comprise an Ab Initio graph. You can create graphs from
the main menu of the GDE.
Layout: A layout is a list of host and directory locations, usually given by the
URL of a multifile. If the locations are not in a multifile, the layout is a list of
URLs called a custom layout.
A program component's layout lists the hosts and directories in which the
component runs. A dataset component's layout lists the hosts and directories in
which the data resides. Layouts are set on the Properties Layout tab.
.mdc file: A file with an .mdc file extension represents a Dataset or custom dataset
component.
Multifile is a parallel file that is composed of individual files on different disks
and/or nodes. The individual files are partitions of the multifile. Each multifile
contains one control partition and one or more data partitions. Multifiles are stored
in distributed directories called multidirectories.
The data in a multifile is usually divided across partitions by one of these methods:
Random or roundrobin partitioning
Partitioning based on ranges or functions, or
Replication or broadcast, in which each partition is an identical copy of the serial
data.
Phase is a stage of a graph that runs to completion before the start of the next
stage. By dividing a graph into phases, you can save resources, avoid deadlock,
and safeguard against failures.
Record format is either a DML file or a DML string that describes data. You set
record formats on the Properties Parameters tab of the Properties dialog box. A
record format is a type applied to a port.
Repartition data is to change the degree of parallelism or the grouping of
partitioned data. For instance, if you have divided skewed data, you can repartition
using a Partition by Key connected to a Gather with an All-to-all flow.
Sandbox: A sandbox is a collection of graphs and related files that are stored in a
single directory tree, and treated as a group for purposes of version control,
navigation, and migration. A sandbox can be a file system copy of a repository
project.
Skew refers to lopsided data storage or program execution. Causes of skew range
from uneven input data to different loads on different processors. Repartitioning
often alleviates skewed input data. Modifying the layouts often repairs skewed
program execution. Skew for a data storage partition is defined as:
(N- AVERAGE) /MAX
Where: N is the number of bytes in that partition
AVERAGE is the total number of bytes in all the partitions divided by number of
partitions
MAX is the number of bytes in the partition with the most bytes.
Skew for program execution is similar, using CPU seconds instead of number of
bytes.
Straight flows: connect components that have the same number of partitions.
Partitions of components connected with straight flows have a one-to-one
correspondence. Straight flows are the most common flow pattern.
Subgraph is a graph fragment. Just like graphs, subgraphs can contain components
and flows. Subgraphs are useful for grouping a graph into subtasks and reusing
them.
Watcher: A watcher lets you view the data that has passed through a Flow.
To use a watcher, do the following:
Turn on debugging mode.
Add a watcher on a flow.
Run the graph.
View the data.