Professional Documents
Culture Documents
INTRODUCTION
INTRODUCTION
- Ab Initio is a general purpose data processing platform for
enterprise class, mission critical applications such as data
warehousing, clickstream processing, data movement, data
transformation and analytics.
- Supports integration of arbitrary data sources and programs,
and provides complete metadata management across the
enterprise.
- Proven best of breed ETL solution.
- Applications of Ab Initio:
ETL for data warehouses, data marts and operational data sources.
Parallel data cleansing and validation.
Parallel data transformation and filtering.
High performance analytics
Real time, parallel data capture.
ARCHITECTURE
Ab Initio Comprises of 3 Core components.
- The Co>OperatingSystem
-
Supplementary Components
- Continuous>Flow
-
Plan>It
Re>source
Data>Profiler
ARCHITECTURE
ARCHITECTURE
Applications
Applications
Ab
Ab Initio
Initio
Metadata
Metadata
Repository
Repository
Application
Application Development
Development Environments
Environments
Graphical
Graphical
C
C ++
++
Shell
Shell
Component
Component
Library
Library
User-defined
User-defined
Components
Components
Third
Third Party
Party
Components
Components
Ab
Ab Initio
Initio Co>Operating
Co>Operating System
System
Native
Native Operating
Operating System
System
UNIX
UNIX
Windows
Windows NT
NT
OPERATING ENVIRONMENTS
Ab Initio has been developed to support a vast variety of
operating environments.
The Co>Operating System and EME is available in many
flavors and can be installed on
Windows NT based
AIX
HP-UX
Linux
Solaris
Tru64 Unix
The GDE part of Ab Initio is currently available for Windows
based Operating Systems only.
CO>OPERATING SYSTEM
The Co>Operating System and the Graphical Development
Environment (GDE) form the basis of Ab Initio software. They
interact with each other to create and run an Ab Initio application:
The GDE provides a canvas on which you manipulate icons that
represent data, programs, and the connections between them.
The result looks like a data flow diagram, and represents the
solution to a data manipulation problem.
The GDE communicates this solution to the Co>Operating
System as a Korn shell script.
The Co>Operating System executes the script.
ENTERPRISE META>ENVIRONMENT
Enterprise Meta>Environment (EME) is an AbInitio repository and
environment for storing and managing metadata. It provides capability to
store both business and technical metadata. EME metadata can be
accessed from the Ab Initio GDE, web browser or AbInitio CoOperating system
command line (air commands).
The basic situation when you work with the EME is that you have files in two
locations, of two kinds:
A personal work area that you specify, located in some filesystem
Essentially this is a formalized directory structure that usually only
you
have access to. This is where you work on files.
A system storage area
This is the EME system area where every version that you save of the
files you work on is permanently preserved, organized in projects. You
(and other users) have indirect access to this area through the GDE and the
command line. This area's generic name is EME datastore.
WELCOME BACK
GDE
Edi
Edi
tt
Phas
Phas
es
es
Deb
Deb
ug
ug
Ma
Ma
in
in
Menu
Menu
bar
bar
Graph
Graph Design
Design area
area
Applica
Applica
tion
tion
Output
Output
To
To
ols
ols
Ru
Ru
nn
Sandbo
Sandbo
xx
Compo
Compo
nents
nents
THE GRAPH
An Ab Initio application is called a Graph: a graphical
representation of datasets, programs, and the connections
between them. Within the framework of a graph, you can arrange
and rearrange the datasets, programs, and connections and
specify the way they operate. In this way, you can build a graph to
perform any data manipulation you want.
You develop graphs in the GDE.
THE GRAPH
A graph is a data flow diagram that defines the various processing stages of a
task and the streams of data as they move from one stage to another. In a graph,
a component represents a stage and a flow represents a data stream. In
addition, there are parameters for specifying various aspects of graph behavior.
Graphs are built in the GDE by dragging and dropping components, connecting them
with flows, and then defining values for parameters. You run, debug, and tune your
graph in the GDE. In the process of building a graph, you are developing an Ab Initio
application, and thus graph development is called graph programming. When you
are ready to deploy your application, you save the graph as a script that you can run
from the command line.
THE GRAPH
THE GRAPH
A graph can run on one or many processors. The processors can be on one
or many computers.
All the computers that participate in running a graph must have the
Co>Operating System installed.
When you run a graph from the GDE, the GDE connects to the
Co>Operating System on the runhost the computer that hosts the
Co>Operating System installation that controls the execution of the graph.
The GDE transmits a Korn shell script that contains all the information
the Co>Operating System needs to execute the programs and processes
represented by the graph by..
Managing all connections between the processors or computers that
execute various parts of the graph
Program execution
Transferring data from one program to the next
Transferring data from one processor or computer to another
Execution in phases
Restoring the graph to its state at the last completed checkpoint in case of
a failure
Writing and removal of temporary files
SETTING UP GDE
Before using the GDE to run a graph or plan, a connection must be established between GDE
and the installation of the Co>Operating System that you want to run your graph. The computer
that hosts this installation is called the run host. Typically, the GDE is on one computer and the run
host is a different computer. But even if the GDE is local to the run host, you must still establish the
connection.
Host Settings(On GDE, Run>Settings>Host Tab) dialog to specify the information the GDE
needs to log in to the run host.
The GDE creates a host settings file with the name you specified. The host
settings file contains the information you specified, such as the name of the run
host, login name or username, password, Co>Operating System version and
location, and type of shell you log in with.
DATA PROCESSING
Generally, Data processing involves following tasks:
Collection of data from various sources.
Cleansing and standardizing data and its datatypes.
Data integrity checks.
Profiling of data.
Staging it for further operations.
Joining with various datasets.
Transforming data based on Business rules.
Conforming data validity.
Translating data to compatible types of target systems.
Loading data to Target systems.
Tying out source and target data for quality and quantity.
Anomaly, rejection and error handling.
Reporting process updates.
Metadata updation.
COMMON TASKS
Selecting: The SELECT clause of an SQL statement.
Filtering: Statements in WHERE clause of an SQL statement.
Sorting: Specifications in an ORDER BY clause.
Transformation: Various transformation functions used on the fields
in the SELECT clause.
Aggregation: Summarization on a set of records (ADD, AVG, MIN
etc)
Switch: Conditional operations using CASE statements.
String Operations: String selection and manipulation
(CONCATENATE, SUBSTRING etc.)
Mathematical Operations: Addition, Division etc.
Inquiry: Test for attributes and content of a field of a record.
Joins: Specifications using JOIN(SQL server, ANSI) clauses (WHERE
clause in Oracle).
Lookups: Getting relevant fields from Dimensions.
Rollups: Summarized data grouped using GROUP clause of an SQL
statement.
COMPONENTS
Components are classified based on their functionality.
Following are a few important sets of Component
categories.
COMPONENTS
COMPONENTS
The dataset components represent records or act on records, as follows:
INPUT FILE represents records read as input to a graph from one or more serial files
or from a multifile.
INPUT TABLE unloads records from a database into an Ab Initio graph, allowing you to
specify as the source either a database table, or an SQL statement that selects records
from one or more tables.
INTERMEDIATE FILE represents one or more serial files or a multifile of intermediate
results that a graph writes during execution, and saves for your review after execution.
LOOKUP FILE represents one or more serial files or a multifile of records small enough
to be held in main memory, letting a transform function retrieve records much more
quickly than it could if they were stored on disk.
OUTPUT FILE represents records written as output from a graph into one or multiple
serial files or a multifile.
OUTPUT TABLE loads records from a graph into a database, letting you specify the
records destination either directly as a single database table, or through an SQL
statement that inserts records into one or more tables.
READ MULTIPLE FILES sequentially reads from a list of input files.
READ SHARED reduces the disk read rate when multiple graphs (as opposed to
multiple components in the same graph) are reading the same very large file.
WRITE MULTIPLE FILES writes records to a set of output files.
COMPONENTS
The database components are the following:
CALL STORED PROCEDURE calls a stored procedure that returns multiple
result sets. The stored procedure can also take parameters.
INPUT TABLE unloads records from a database into an Ab Initio graph, allowing
you to specify as the source either a database table, or an SQL statement that
selects records from one or more tables.
JOIN WITH DB joins records from the flow or flows connected to its input port
with records read directly from a database, and outputs new records containing
data based on, or calculated from, the joined records.
MULTI UPDATE TABLE executes multiple SQL statements for each input record.
OUTPUT TABLE loads records from a graph into a database, letting you specify
the recordsdestination either directly as a single database table, or through an
SQL statement that inserts records into one or more tables.
RUN SQL executes SQL statements in a database and writes confirmation
messages to the log port.
TRUNCATE TABLE deletes all the rows in a database table, and writes
confirmation messages to the log port.
UPDATE TABLE executes UPDATE, INSERT, or DELETE statements in embedded
SQL format to modify a table in a database, and writes status information to the
log port.
COMPONENTS
MULTI REFORMAT changes the record format of records flowing between from
one to 20 pairs of in and out ports by dropping fields, or by using DML expressions
to add fields, combine fields, or transform the data in the records.
NORMALIZE generates multiple output records from each input record; you can
specify the number of output records, or the number of output records can depend
on a field or fields in each input record. NORMALIZE can separate a record with a
vector field into several individual records, each containing one element of the
vector.
REFORMAT changes the record format of records by dropping fields, or by using
DML expressions to add fields, combine fields, or transform the data in the records.
ROLLUP generates records that summarize groups of records. ROLLUP gives you
more control over record selection, grouping, and aggregation than AGGREGATE.
ROLLUP can process either grouped or ungrouped input. When processing
ungrouped input, ROLLUP maximizes performance by keeping intermediate results
in main memory.
SCAN generates a series of cumulative summary records such as successive
year-to-date totals for groups of records. SCAN can process either grouped or
ungrouped input. When processing ungrouped input, SCAN maximizes
performance by keeping intermediate results in main memory.
SCAN WITH ROLLUP performs the same operations as SCAN; it generates a
summary record for each input group.
COMPONENTS
The Miscellaneous components perform a variety of tasks:
ASSIGN KEYS assigns a value to a surrogate key field in each record on the in port, based on the
value of a natural key field in that record, and then sends the record to one or two of three output
ports.
BUFFERED COPY copies records from its in port to its out port without changing the values in the
records. If the downstream flow stops, BUFFERED COPY copies records to a buffer and defers
outputting them until the downstream flow resumes.
DOCUMENTATION provides a facility for documenting a transform.
GATHER LOGS collects the output from the log ports of components for analysis of a graph after
execution.
LEADING RECORDS copies records from input to output, stopping after the given number of records.
META PIVOT converts each input record into a series of separate output records: one separate output
record for each field of data in the original input record.
REDEFINE FORMAT copies records from its input to its output without changing the values in the
records. You can use REDEFINE FORMAT to change or rename fields in a record format without
changing the values in the records.
REPLICATE arbitrarily combines all the records it receives into a single flow and writes a copy of that
flow to each of its output flows.
RUN PROGRAM runs an executable program.
THROTTLE copies records from its input to its output, limiting the rate at which records are processed.
RECIRCULATE and COMPUTE CLOSURE components can only be used together. These two
components calculate the complete set of direct and derived relationships among a set of input keypairs; in other words, the transitive closure of the relationship within the set.
TRASH ends a flow by accepting all the records in it and discarding them.
COMPONENTS
The transform components modify or manipulate records by using one or several
transform functions. Some of these are multistage transform components that modify
records in up to five stages: input selection, temporary initialization, processing,
finalization, and output selection. Each stage is written as DML transform function.
AGGREGATE generates records that summarize groups of records. AGGREGATE can
process either grouped or ungrouped input. When processing ungrouped input,
AGGREGATE maximizes performance by keeping intermediate results in main memory.
DEDUP SORTED separates one specified record in each group of records from the rest
of the records in the group. DEDUP SORTED requires grouped input.
DENORMALIZE SORTED consolidates groups of related records into a single record
with a vector field for each group, and optionally computes summary fields for each
group. DENORMALIZE SORTED requires grouped input.
FILTER BY EXPRESSION filters records according to a specified DML expression.
FUSE applies a transform to corresponding records from each input flow. The
transform is first applied to the first record on each flow, then to the second record on
each flow, and so on. The result of the transform is sent to the out port.
JOIN performs inner, outer, and semi-joins on multiple flows of records. JOIN can
process either sorted or unsorted input. When processing unsorted input, JOIN
maximizes performance by loading input records into main memory.
MATCH SORTED combines and performs transform operations on multiple flows of
records.MATCH SORTED requires grouped input.
COMPONENTS
The sort components sort and merge records, and perform related
tasks:
CHECKPOINTED SORT sorts and merges records, inserting a
checkpoint between the sorting and merging phases.
FIND SPLITTERS sorts records according to a key specifier, and then
finds the ranges of key values that divide the total number of input
records approximately evenly into a specified number of partitions.
PARTITION BY KEY AND SORT repartitions records by key values and
then sorts the records within each partition; the number of input and
output partitions can be different.
SAMPLE selects a specified number of records at random from one or
more input flows. The probability of any one input record appearing in
the output flow is the same it does not depend on the position of the
record in the input flow.
SORT orders and merges records.
SORT WITHIN GROUPS refines the sorting of records already sorted
according to one key specifier: it sorts the records within the groups
formed by the first sort according to a second key specifier.
COMPONENTS
The partition components distribute records to multiple flow partitions
or multiple straight flows to support data parallelism or component
parallelism.
BROADCAST arbitrarily combines all the records it receives into a single
flow and writes a copy of that flow to each of its output flow partitions.
PARTITION BY EXPRESSION distributes records to its output flow
partitions according to a specified DML expression.
PARTITION BY KEY distributes records to its output flow partitions
according to key values.
PARTITION BY PERCENTAGE distributes a specified percent of the total
number of input records to each output flow.
PARTITION BY RANGE distributes records to its output flow partitions
according to the ranges of key values specified for each partition.
PARTITION BY ROUND-ROBIN distributes records evenly to each
output flow.
PARTITION WITH LOAD BALANCE distributes records to its output flow
partitions, writing more records to the flow partitions that consume
records faster.
COMPONENTS
The departition components combine multiple flow partitions or
multiple straight flows into a single flow to support data parallelism
or component parallelism
CONCATENATE appends multiple flow partitions of records one
after another.
GATHER combines records from multiple flow partitions arbitrarily.
INTERLEAVE combines blocks of records from multiple flow
partitions in round-robin fashion.
MERGE combines records from multiple flow partitions that have
all been sorted according to the same key specifier and maintains
the sort order.
COMPONENTS
The compress components compress data or expand compressed
data:
DEFLATE and INFLATE work on all platforms.
DEFLATE reduces the volume of data in a flow and INFLATE
reverses the effects of DEFLATE.
COMPRESS and UNCOMPRESS are available on Unix and
Linux platforms; not on Windows.
COMPRESS reduces the volume of data in a flow and
UNCOMPRESS reverses the effects of COMPRESS.
COMPRESS cannot output more than 2 GB of data on Linux
platforms.
COMPONENTS
The MVS dataset components perform as follows:
MVS INPUT FILE reads an MVS dataset as an input to your graph.
MVS INTERMEDIATE FILE represents MVS datasets that contain
intermediate results that your graph writes and saves for review.
MVS LOOKUP FILE contains shared MVS data for use with the
DML lookup functions; it allows access to records according to a
key.
MVS OUTPUT FILE represents records written as output from a
graph into an MVS dataset.
VSAM LOOKUP executes a VSAM lookup for each input record.
COMPONENTS
The EME (Enterprise Meta>Environment) components are in the
$AB_HOME/Connectors > EME folder. They transfer data into and out of the
EME.
LOAD ANNOTATION VALUES attaches annotation values to objects in an
EME datastore, and validates the annotation for the object to which it is to be
attached.
LOAD CATEGORY stores annotation values in EME datastores for objects that
are members of a given category.
LOAD FILE DATASET creates a file dataset in an EME datastore, and attaches
its record format.
LOAD MIMEOBJ creates a MIME object in an EME datastore with the specified
MIME type.
LOAD TABLE DATASET creates a table dataset in an EME datastore and
attaches its record format.
LOAD TYPE creates a DML record object in an EME datastore.
REMOVE OBJECTS AND ANNOTATIONS allows you to delete objects and
annotation values from an EME datastore.
UNLOAD CATEGORY unloads the annotation rules of all objects that are
members of a given category.
COMPONENTS
The FTP (File Transfer Protocol) components transfer data, as
follows:
FTP FROM transfers files of records from a computer not running
the Co>Operating System to a computer running the
Co>Operating System.
FTP TO transfers files of records to a computer not running the
Co>Operating System from a computer running the Co>Operating
System.
SFTP FROM transfers files of records from a computer not running
the Co>Operating System to a computer running the
Co>Operating System using the sftp or scp utilities to connect to
a Secure Shell (SSH) server on the remote machine and transfer
the files via the encrypted connection provided by SSH.
SFTP TO transfers files of records from a computer running the
Co>Operating System to a computer not running the
Co>Operating System using the sftp or scp utilities to connect to
a Secure Shell (SSH) server on the remote machine and transfer
the files via the encrypted connection provided by SSH.
COMPONENTS
The validate components test, debug, and check records, and produce data
for testing Ab Initio graphs:
CHECK ORDER tests whether records are sorted according to a key specifier.
COMPARE CHECKSUMS compares two checksums generated by COMPUTE
CHECKSUMS. Typically, you use COMPARE CHECKSUMS to compare checksums
generated from two sets of records, each set computed from the same data by
a different method, in order to check the correctness of the records.
COMPARE RECORDS compares records from two flows one by one.
COMPUTE CHECKSUM calculates a checksum for records.
GENERATE RANDOM BYTES generates a specified number of records, each
consisting of a specified number of random bytes. Typically, the output of
GENERATE RANDOM BYTES is used for testing a graph. For more control over
the content of the records, use GENERATE RECORDS.
GENERATE RECORDS generates a specified number of records with fields of
specified lengths and types. You can let GENERATE RECORDS generate
random values within the specified length and type for each field, or you can
control various aspects of the generated values. Typically, you use the output
of GENERATE RECORDS to test a graph.
VALIDATE RECORDS separates valid records from invalid records.
COMPONENTS
COMPONENTS
Configuring Components:
Components comprises of settings which dictates the way it
impacts the data flowing through. These settings are seen and set
with the following tabs on Properties window.
Description
Access
Layout
Parameters
Ports
COMPONENT PROPERTIES
DESCRIPTION:
Specify Labels
Component specific selections
Data Locations
Partitions
Comments
Name/Author/Version
COMPONENT PROPERTIES
ACCESS:
Specification for file handling
methods.
Creation/Deletion
Roll back settings
Exclusive access
File Protection
COMPONENT PROPERTIES
LAYOUT:
COMPONENT PROPERTIES
PARAMETERS:
Set various component specific
configuration
Specify transform settings
Handle rejects/logs/thresholds
Apply filters
Specify Keys for Sort and
Partition
Specify URL for transforms
Parameter interpretation
COMPONENT PROPERTIES
PORTS:
Assign record formats for various
ports
Specify URL of record format
Interpretation settings for any
environment parameters.
EDITORS
EDITORS:
Configuring components using various tabs involve changing settings for
Transforms
Filters
Keys
Record Formats
Expressions
Text editors
record
decimal(6) cust_id;
string(18) last_name;
string(16) first_name;
string(26) street_addr;
string(2) state;
string(1) newline;
end;
Functions:
Displays the built-in DML
functions.
Fields:
Displays the input record
formats available for
use in an expression.
Operators:
Displays the built-in DML
operators.
Sort Sequences
Sequence
phonebook
index
machine
Details
Digits are treated as the lowest value characters, followed by the
letters of the alphabet in the order AaBbCcDd..., followed by
spaces. All other characters, such as punctuation, are ignored.
The order of digits is 0, 1, 2, 3, 4, 5, 6, 7, 8, 9.
Is the same as phonebook ordering, except that punctuation
characters are not ignored; they have lower values than all other
characters. The order of punctuation characters is the machine
sequence.
Uses character code values in the sequence in which they are
arranged in the character set of the string.
For ASCII-based character sets, digits are the lowest-value
characters, followed by uppercase letters, followed by lowercase
letters.
For EBCDIC character sets, lowercase letters are the lowest-value
characters, followed by uppercase, followed by digits.
For Unicode , the order is from the lowest character code value to
the highest.
Uses the user-defined sort order. You construct a custom
out::reformat(in) =
begin
out.* :: in.*;
out.score :: if (in.income > 1000) 1 else
2;
end;
COMPONENTS
INPUT FILE
OUTPUT FILE
COMPONENTS
SORT
Note:
Use Key Specifier editor to add/modify keys
and sort order and sequences.
Sort orders are impacted by the sort sequence
used. Character sets used can also have an
impact on the sort order.
COMPONENTS
FILTER BY EXPRESSION Filters records according to a DML
expression.
Location: Transform
Key Settings:
select_expr: The expression defining
the filter key
COMPONENTS
REFORMAT
TRANSFORM FUNCTIONS
A transform function is a collection of rules that specifies how to produce result
records from input records.
The exact behavior of a transform function depends on the component using it.
Transform functions express record reformatting logic.
Transform functions encapsulate a wide variety of computational
knowledge that cleanses records, merges records, and aggregates records.
Transform functions perform their operations on the data records flowing
into the component and write the resulting data records to the out flow.
Example:
The purpose of the transform function in the REFORMAT component is to construct
data records flowing out of the component by:
Using all the fields in the record format of the data records flowing into the
component
creating a new field containing a title (Mr. or Ms.) using the gender field of the
data records flowing into the component
Creating a new field containing a score computed by dividing the income field
by 10
Transform functions are created and edited in the GDE Transform Editor.
TRANSFORM FUNCTIONS
Transform functions (or transforms) drive nearly all data
transformation and computation in AbInitio graphs. Typical simple
transforms can:
Extract the year from a date.
TRANSFORM EDITOR
Creating Transforms
Transforms consists of
Rules
Variables
and Statements
Relationships between them are created using the Transform
Editor.
The TransformEditor has two views:
Grid view: Rules are created by dragging and dropping
fields from input flow, functions and operators.
Text view: Rules are created using the DML syntax.
TRANFORM EDITOR
GRID VIEW
Transform
Transform
Function
Function
Input
Input
Fields
Fields
Output
Output
Fields
Fields
TRANSFORM EDITOR
Text View:
Typically, you use text view of the TransformEditor to enter or edit
transforms using DML. In addition, AbInitio software supports the
following text alternatives:
You can enter the DML into a standalone text file, then include that
file in the transform's package.
You can select Embed and enter the DML code into the Value box
on the Parameters tab of the component Properties dialog.
TRANSFORM FUNCTIONS
Untyped transforms:
out :: trans1(in) =
begin
out.x :: in.a;
out.y :: in.b + 1;
out.z :: in.c + 2;
end;
Typed transforms:
decimal(12) out :: add_one(decimal(12) in) =
begin
out :: in + 4;
end;
record integer(4) x, y; end out :: fiddle(in1, double in2) =
begin
out.x :: size_of(in1);
out.y :: in2 + in2;
end;
TRANSFORM EDITOR
Rule:
A rule, or business rule, is an instruction in a transform function
that directs the construction of one field in an output record. Rules
can express everything from simple reformatting logic for field values
to complex computations.
TRANSFORM EDITOR
Prioritized Rules:
Priorities can be optionally assigned to the rules for a particular
output field.
They are evaluated in order of priority, starting with the
assignment of lowest-numbered priority and proceeding to
assignments of higher-numbered priority.
The last rule evaluated will be the one with blank priority, which
places it after all others in priority.
A single output field can have multiple rules attached to it.
Prioritized rules are always evaluated in the ascending order of
priority.
In Text View:
out :: average_part_cost(in) =
begin
out :1: if (in.num_parts > 0) in.total_cost / in.num_parts;
out :2: 0;
end;
TRANSFORM FUNCTIONS
Local Variable:
A local variable is a variable declared within a transform function. You can use local
variables to simplify the structure of rules or to hold values used by multiple rules.
Statements:
A statement can assign a value to a local variable, a global variable, or an output
field; define processing logic; or control the number of iterations of another
statement.
COMPONENTS
INPUT TABLE
COMPONENTS
JOIN
COMPONENTS
Key Settings (continued):
Join-type: specifies the join method from inner-join,OuterJoin and Explicit.
Record-requiredn: dependent setting of join-type.
Dedupn: removes duplicate records on the specified port.
Selectn: Acts as a component level filter for the port.
Override-keyn: Alternative name(s) for the key field(s) for a
particular inn port.
Driving: specifies the port that drives the join
COMPONENTS
JOIN with DB
SPECIAL TRANSFORMS
AGGREGATE
VECTORS
A vector is an array of elements, indexed from 0. An element
can be a single field or an entire record. Vectors are often
used to provide a logical grouping of information.
An array is a collection of elements that are logically grouped
for ease of access. Each element has an index by which the
element can be referenced(read or written).
Example:
char myarray[10] = [a, b, c, d, e, f, g, h, i, j] ;
SPECIAL COMPONENTS
NORMALIZE
SPECIAL COMPONENTS
DENORMALIZE SORTED
PARALLELISM
COMPONENT PARALLELISM
Component parallelism occurs when program components
execute simultaneously on different branches of a graph.
PIPELINE PARALLELISM
Pipeline parallelism occurs when several connected program
components on the same branch of a graph execute
simultaneously.
DATA PARALLELISM
Data parallelism occurs when a graph separates data into
multiple divisions, allowing multiple copies of program
components to operate on the data in all the divisions
simultaneously.
PARTITIONS
The divisions of the data and the copies of the program components that create data
parallelism are called partitions, and a component partitioned in this way is called a
parallel component. If each partition of a parallel program component runs on a separate
processor, the increase in the speed of processing is almost directly proportional to the
number of partitions.
FLOW PARTITIONS
When you divide a component into partitions, you divide the flows that connect to it as well.
These divisions are called flow partitions.
PORT PARTITIONS
The port to which a partitioned flow connects is partitioned as well,
with the same number of
port partitions as the flow connected to it.
DEPTH OF PARALLELISM
The number of partitions of a component, flow, port, graph, or section of a graph determines
its depth of parallelism.
PARTITIONS
PARALLEL FILES
Sometimes you will want to store partitioned data in its
partitioned state in a parallel file for further parallel
processing. If you locate the partitions of the parallel file on
different disks, the parallel components in a graph can all read
from the file at the same time, rather than being limited by all
having to take turns reading from the same disk. These parallel
files are called multifiles.
MULTIFILE SYSTEM
Multifiles are parallel files composed of individual files, which may be
located on separate disks or systems. These individual files are the
partitions of the multifile. Understanding the concept of multifiles is essential
when you are developing parallel applications that use files, because the
parallelization of data drives the parallelization of the application.
Data parallelism makes it possible for Ab Initio software to process large
amounts of data very quickly. In order to take full advantage of the power of
data parallelism, you need to store data in its parallel state. In this state, the
partitions of the data are typically located on different disks on various
machines. An Ab Initio multifile is a parallel file that allows you to manage
all the partitions of the data, no matter where they are located, as a single
entity.
Multifiles are organized by using a Multifile System, which has a directory
tree structure that allows you to work with multifiles and the directory
structures that contain them in the same way you would work with serial
files. Using an Ab Initio multifile system, you can apply familiar file
management operations to all partitions of a multifile from a central point of
administration no matter how diverse the machines on which they are
located. You do this by referencing the control partition of the multifile with a
single Ab Initio URL.
MULTIFILE SYSTEM
Multifile
An Ab Initio multifile organizes all partitions of a multifile into
one single virtual file that you can reference as one entity.
An Ab Initio multifile can only exist in the context of a multifile
system.
Multifiles are created by creating a multifile system, and then
either outputting parallel data to it with an Ab Initio graph, or using
the m_touch command to create an empty multifile in the multifile
system.
Ad Hoc Multifiles
A parallel dataset with partitions that are an arbitrary set of
serial files containing similar data.
Created explicitly by listing a set of serial files as partitions, or
by using a shell expression that expands at runtime to a list of
serial files.
The serial files listed can be anything from a serial dataset divided
into multiple serial files to any set of serial files containing similar
data.
MULTIFILE SYSTEM
- Visualize a directory tree containing
subdirectories and files.
mfile://pluto.us.com/usr/ed/mfs3
MULTIFILE SYSTEM
Creating the Multifile system MFS
m_mkfs //pluto.us.com/usr/ed/mfs3 \
//pear.us.com/p/mfs3-0 \
//pear.us.com/q/mfs3-1 \
//plum.us.com/D:/mfs3-2
Multidirectory
m_mkdir //pluto.us.com/usr/ed/mfs3/cust
Multifile
m_touch //pluto.us.com/usr/ed/mfs3/cust/t.out \
ASSIGNMENT
Creating a Multidirectory
Open the command console.
Enter the following command
m_mkdir c:\data\mfs\mfs2way\employee
LAYOUTS
A layout of a component specifies :
The locations of files: A URL specifying the location of file.
The number and locations of the partitions of multifiles: A URL that specifies
the location of the control partition of a multifile.
Every component in a graph both dataset and program components has a
layout. Some graphs use one layout throughout; others use several layouts.
The layouts you choose can be critical to the success or failure of a graph. For a
layout to be effective, it must fulfill the following requirements:
The Co>Operating System must be installed on the computers specified
by the layout.
The run host must be able to connect to the computers specified by the
layout.
The working directories in the layout must exist
The permissions in the directories of the layout must allow the graph to
write files there.
The layout must allow enough space for the files the graph needs to write
there.
During execution, a graph writes various files in the layouts of some or
all of the components in it.
FLOWS
FLOWS
Flows indicate the type of data transfer between the components
connected.
FLOWS
Straight Flow:
A straight flow connects components with the same depth of
parallelism, including serial components, to each other. If the
components are serial, the flow is serial. If the components are
parallel, the flow has the same depth of parallelism as the
components.
FLOWS
Fan Out flow:
A fan-out flow connects a component with a lesser number of
partitions to one with a greater number of partitions in other
words, it follows a one-to-many pattern. The component with the
greater depth of parallelism determines the depth of parallelism of a
fan-out flow.
NOTE: You can only use a fan-out flow when the result of dividing the greater
number of partitions by the lesser number of partitions is an integer. If this is not
the case, you must use an all-to-all flow.
FLOW
Fan In flow:
A fan-in flow connects a component with a greater depth of
parallelism to one with a lesser depth in other words, it follows a
many-to-one pattern. As with a fan-out flow, the component with the
greater depth of parallelism determines the depth of parallelism of a
fan-in flow.
NOTE: You can only use a fan-out flow when the result of dividing the greater
number of partitions by the lesser number of partitions is an integer. If this is not
the case, you must use an all-to-all flow.
FLOW
All to All flow:
An all-to-all flow connects components with the same or different
degrees of parallelism in such a way that each output port partition of
one component is connected to each input port partition of the other
component.
NOTE: You can only use a fan-out flow when the result of dividing the greater
number of partitions by the lesser number of partitions is an integer. If this is not
the case, you must use an all-to-all flow.
SANDBOX
A sandbox is a special directory (folder) containing a certain
minimum number of specific subdirectories for holding Ab Initio
graphs and related files. These subdirectories have standard
names that indicate their function. The sandbox directory itself can
have any name; its properties are recorded in various special and
hidden files that lie at its top.
SANDBOX
Parts of a graph
Data files (.dat)
SANDBOX
Sandbox Parameters:
Sandbox parameters are variables which are visible to any
component in any graph which is stored in that sandbox. Here are
some examples of sandbox parameters:
$PROJECT_DIR
$DML
$RUN
Graphs refer to the sandbox subdirectories by using sandbox
parameters.
SANDBOX
Sandbox Parameters Editor
SANDBOX
The default sandbox parameters in a GDE-created sandbox
are these eight:
PROJECT_DIR absolute path to the sandbox directory
DML relative sandbox path to the dml subdirectory
XFR relative sandbox path to the xfr subdirectory
RUN relative sandbox path to the run subdirectory
DB relative sandbox path to the db subdirectory
MP relative sandbox path to the mp subdirectory
RESOURCE relative sandbox path to the resource subdirectory
PLAN relative sandbox path to the plan subdirectory
GRAPH PARAMETERS
GRAPH PARAMETERS
Graph Parameters Editor
PARAMETER SETTING
Name:
Value: Unique identifier of the parameter.
Description: String. First character must be a letter; remaining characters can be a combination of
letters, digits, and underscores. Spaces are not allowed.
Example: OUT_DATA_MFS
Scope:
Value: Local or Formal.
Description:
Local: parameter receives its value from the Value column.
Formal: parameter receives its value at runtime from the command line. When run from
GDE, a dialog appears prompting input values.
Kind:
Sandbox
Value: Unspecified or Keyword.
Description:
Unspecified: parameter values are directly assigned on command line.
Keyword: The value for a keyword parameter is identified by a keyword preceding it.
The keyword is the name of the parameter.
Graph
Value: Positional, Keyword or Environment.
Description:
Positional: The value for a positional parameter is identified by its position on the
command line.
Keyword: same as Sandbox setting.
Environment: parameter receives its value from the environment.
PARAMETER SETTING
Type:
Value: Common Project, Dependent, Switch or String.
Description:
Common Project: to include other shared, or common, project values within the
current sandbox.
Dependent: parameters whose values depend on the value of a switch parameter.
Switch: purpose of a switch parameter is to allow you to change your sandbox's
context:
String: a normal string representing record formats, file paths etc..
Dependent on:
Value: Name of the switch column.
Value:
Value: The parameter's value, consistent with its type.
Interpretation:
Value: Constant, $ substitution, ${} substitution and Shell.
Description: Determines how the string value is evaluated
Required:
Value: Required. Or Optional.
Description: Specifies the value is required or optional.
Export:
Value: Checked /Unchecked
Description: Specifies if the parameter /value needs to be exported to environment
ORDER OF EVALUATION
When you run a graph, parameters are evaluated in the following
order:
1. The host setup script is run.
2. Common (that is, included) project (sandbox) parameters are
evaluated.
3. Project (sandbox) parameters are evaluated.
4. The project-start.ksh script is run.
5. Formal parameters are evaluated.
6. Graph parameters are evaluated.
7. The graph start script is run.
PARAMETER INTERPRETATION
The value of a parameter often contains references to other parameters. This attribute
specifies how you want such references to be handled in this parameter. There are four
possibilities:
$ substitution:
This specifies that if the value for a parameter contains the name of another parameter preceded by a
dollar sign ($), the dollar sign and name are replaced with the value of the parameter that's referred
to. In other words, $parameter is replaced by the value of parameter. $parameter is said to be "a
$reference" or "dollar-sign reference to parameter".
${} substitution:
${}substitution is similar to $substitution, except that it additionally requires that you surround the
name of the referenced parameter with curly braces {}. If you do not use curly braces, no
substitution occurs, and the dollar sign (and the name that follows it) is taken literally. You should use $
{}substitution for parameters that are likely to have values that contain $ as a character, such as
names of relational database tables.
constant:
The GDE uses the parameter's specified value literally, as is, with no further interpretation. If the value
contains any special characters, such as $, the GDE surrounds them with single quotes in the deployed
shell script, so that they are protected from any further interpretation by the shell.
shell:
The shell specified in the Host Settings dialog for the graph, usually the Korn shell, interprets the
parameter's value. Note that the value is not a full shell command; rather, it is the equivalent of the
value assigned in the definition of a shell.
PDL:
The Parameter Definition Language is available as an interpretation method only for the :eme value of
a dependent graph parameter. PDL (Parameter Definition Language) is a simple set of notations for
expressing the values of parametersin components, Graphs, and projects.
LOOKUP
A lookup file is a file of data records that is small enough to fit in
main memory, letting a transform function retrieve records much
more quickly than it could if they were stored on disk. Lookup files
associate key values with corresponding data values to index
records and retrieve them.
Ab Initios lookup components are dataset components with
special features.
Data in lookups are accessed using specific functions inside
transforms.
Indexes are used to access the data when called by lookup
functions.
The data is loaded into memory for quick access.
A common use for a lookup file is to hold data frequently used by a
transform component. For example, if the data records of a lookup
file contain numeric product codes and their corresponding product
names, a component can rapidly search the lookup file to translate
codes to names or names to codes.
LOOKUP
Guidelines:
The first argument to these lookup functions is the name of the LOOKUP
FILE. The remaining arguments are values to be matched against the
fields named by the key parameter. The lookup functions return a
record that matches the key values and has the format given by the
RecordFormat parameter.
LOOKUP
COMPONENTS:
LOOKUP
LOOKUP TEMPLATE
WRITE LOOKUP
LOOKUP
DML lookup functions locate records in a lookup file based on a
key. The lookup can find an exact value, a regular expression
(regex) that matches the lookup expression, or an interval (range) in
which the lookup expression falls.
Exact:
A lookup file with an exact key allows for lookups that compare one
value to another. The lookup expression (in the lookup function) must
evaluate to a value, and can be comprised of any number of key fields.
Regex:
Lookups can be configured for pattern matching: the records of the
lookup file have regular expressions in their key fields and you specify
that the key is regex.
Interval:
The lookup expression is considered to match a lookup file record if its
value matches any value in the range (i.e., interval). Another way to
state this: the lookup function compares a value to a series of intervals
in a lookup file to determine in which interval the value falls.
LOOKUP
Interval Lookup:
Intervals defined using one field in each lookup record :
Interval is specified to imply that the lookup file is set up such that each
key is comprised of only one field: that field contains a value that is both
the upper endpoint of the previous interval as well as the lower
endpoint of the current interval.
LOOKUP
Key Specifier types
Keyword
interval
interval_bottom
interval_top
inclusive (default)
exclusive
Description
Specifies that the field holds a value that is both:
- The lower endpoint of the current interval inclusive
- The upper endpoint of the previous interval exclusive
Specifies that the field holds a value that is the lower endpoint of
an interval inclusive.
The interval_bottom keyword must precede the interval_top
keyword.
Specifies that the field holds a value that is the upper endpoint of
an interval inclusive.
Note that interval_top must follow the interval_bottom
keyword.
Modifies interval_bottom and interval_top to make them
inclusive.
Use only with interval_top and interval_bottom.
Modifies interval_bottom and interval_top to make them
exclusive.
Use only with interval_top and interval_bottom.
LOOKUP
LOOKUP
Ways to use Lookup:
Static: LOOKUP FILE component.
The lookup file is read into memory when its graph phase starts
and stays in memory until the phase ends.
You set up this kind of lookup by placing a LOOKUP FILE component
in the graph. Unlike other dataset components, the LOOKUP FILE
component has no ports; it does not need to be connected to other
components. However, its contents are accessible from other
components in the same or later phases of the graph.
A LOOKUP FILE component lookup has the advantage of being
simple to use: you add a LOOKUP FILE component to a graph, set
its properties, and then call a lookup function.
LOOKUP
Dynamic:
The lookup file (or just its index if the file is block-compressed) is
read into memory only when you call a loading function from
within a transform; you later unload the lookup file by calling a
lookup_unload function.
A dynamically loaded lookup provides these advantages:
You can specify the lookup file name at runtime rather than
when you create the graph
You can avoid having multiple lookup files or indexes take up a
lot of memory this is especially relevant in a graph that uses
many lookup files.
A graph need load only the relevant partition of a lookup file
rather than the entire file this saves memory in a graph
where a component and lookup file are identically partitioned
LOOKUP
Catalogued Lookup:
Information about lookup files can be stored in a catalog, so that
you can share the lookup files among multiple graphs by
simply sharing the catalog. A catalog is a file that stores a list of
lookup files on a particular run host. To make a catalog available to
a graph, you use the Shared Catalog Path on the Catalog tab of
the Run Settings dialog to indicate where
find the
shared
Ignoreto
shared
catalog,
use
lookup files only (default)
catalog.
LOOKUP
Uncompressed lookup files:
Any file can serve as an uncompressed lookup file if the data is
not compressed and has a field defined as a key. These files
can also be created using the WRITE LOOKUP component.
The component writes two files:
a file containing the lookup data and
an index file that references the data file.
LOOKUP
Lookup files can be either serial or partitioned (multifiles).
Serial:
When a serial lookup file is accessed in a component, the
Co>Operating System loads the entire file into memory. In most
cases the file is memory-mapped: every component in your graph
that uses that lookup file (and is on the same machine) shares a copy
of that file. If the component is running in parallel, all partitions on
the same machine also share a single copy of the lookup file.
lookup, lookup_count, lookup_match, lookup_next, and
lookup_nth are used with serial files
Partitioned:
Partitioning a lookup file is particularly useful when a multifile system
is partitioned across several machines. The depth of the lookup file
(the number of ways it is partitioned) must match the layout depth of
the component accessing it.
lookup_local, lookup_count_local, lookup_match_local,
lookup_next_local and lookup_nth_local are used with partitioned
lookup files
LOOKUP
Lookup Functions:
lookup[_local]
record lookup[_local] (string file_label, [ expression [ ,
expression ... ]])
LOOKUP
lookup_next[_local]
record lookup_next[_local] (string file_label )
LOOKUP
Dynamically loaded lookup:
Unlike the normal lookup process, dynamically loaded lookups load
the dataset only when it is called by a lookup_load function
in any transformation on the graph.
This provides an opportunity to dynamically specify which file to
use as a lookup.
Dynamically loaded lookups are used in a specific sequence:
Prepare a LOOKUP_TEMPLATE component
Use lookup_load to load the dataset and its index
lookup(lookup_id, "MyLookupTemplate", in.expression)
LOOKUP
Tips:
A lookup file works best if it has a fixed-length record format
because this creates a smaller index and thus faster lookups.
When creating a lookup file from source data, drop any fields
that are not keys or lookup results.
A lookup file represented by a LOOKUP FILE component must fit
into memory. If a file is too large to fit into memory, use INPUT
FILE followed by MATCH SORTED or JOIN instead.
If the lookup file is large, partition the file on its key the same
way as the input data is partitioned. Then use the functions
lookup_local, lookup_count_local, or lookup_next_local. For more
information, see "Partitioned lookup files".