Abinitio Transform Components

1) Reformat:-
Purpose
Reformat changes the format of records by dropping fields, or by using DML expressions to add fields,
combine fields, or transform the data in the records
Recommendation
Component folding can enhance the performance of this component. If this feature is enabled, the
Co>Operating System folds this component by default.
Location in the Component Organizer

Transform folder
Runtime behavior of REFORMAT

 The component reads records from the in port.
 If you specify an expression for the select parameter, the expression filters the records on the
in port:
 If the expression evaluates to 0 for a particular record, Reformat does not process the record,
which means that the record does not appear on any output port.
 If the expression produces NULL for any record, Reformat writes a descriptive error message
and stops execution of the graph.
 If the expression evaluates to anything other than 0 or NULL for a particular record, Reformat
processes the record.
 If you do not specify an expression for the select parameter, Reformat processes all the
records on the in port.
 If you specify a value for either output-index or output-indexes, Reformat passes the records
to the transform functions, calling the transform function on each port in order, depending on the
value of output-index or output-indexes, for each record, beginning with out port 0 and
progressing through out portcount – 1.
 The evaluation of the transform functions takes place within each partition of a Reformat
running in parallel, which means that evaluations of later transform functions can depend on the
results of the evaluations of earlier transform functions, such as modification of global variables
or use of functions such as next_in_sequence.
 If you do not specify a transform function for a particular out port, Reformat uses default
record assignment. You can use default record assignment to eliminate fields from a record
format.
 Reformat writes the valid records to the out ports.
2) Filter By Expression:-
Purpose
Filter by Expression filters records according to a DML expression or transform function, which specifies
the selection criteria. Filter by Expression is sometimes used to create a subset, or sample, of the data.
For example, you can configure Filter by Expression to select a certain percentage of records, or to select
every third (or fourth, or fifth, and so on) record. Note that if you need a random sample of a specific size,
you should use the sample component. FILTER BY EXPRESSION supports implicit reformat.
Implicit Reformat:- Reformat has an implicit gather on it's in port, as do a number of other
components, filter by expression being one. If both flows have the same record format this will
behave the same as having a gather component in front of the reformat. Reformat with more than
one input, arbitrarily reads data from different flows from the input side and process the data just as
a regular reformat.
Recommendation
Co>Operating System folds this component by default.
Component Folding: Basically Ab Initio will try and "fold" multiple components in a single
component where possible. When you use 2.14 it would be worth digging deeper. Performance wise,
you'll have a small number of components executing (look for the multitool processes rather than
the unitool processes) and the implied benefits.

Transform folder
Runtime behavior of FILTER BY EXPRESSION

Filter by Expression does the following:
 Reads data records from the in port.

 If the use_package parameter is false, applies the expression in the select_expr parameter to
each record. It routes records as follows, based on how the expression evaluates:
 For a non-0 value, Filter by Expression writes the record to the out port.
 For 0, Filter by Expression writes the record to the deselect port. If you do not connect a flow
to the deselect port, Filter by Expression discards the records.
 For NULL, Filter by Expression writes the record to the reject port and a
descriptive error message to the error port
 If the use_package parameter is true, executes the functions defined in the package
 If output_for_error or make_error is defined, executes them whenever an error event
occurs. If log_error is defined and logging of rejects is turned on, executes log_error.
3) Rollup:-
Purpose
Rollup evaluates a group of input records that have the same key, and then generates records that either
summarize each group or select certain information from each group.

Transform folder
Recommendations
 For new development, use Rollup rather than AGGREGATE. Rollup provides more control
over record selection, grouping, and aggregation.
 The behavior of ROLLUP varies in the presence of dirty data (NULLs or invalid values),
according to whether you use the aggregation functions for the rollup:
 Without aggregation functions, you can use ROLLUP normally.
 With aggregation functions, always clean and validate data before rolling it up. Because the
aggregation functions use a multistage transform, ROLLUP follows computation rules that may
cause unexpected or even incorrect results in the presence of dirty data (NULLs or invalid
values). Furthermore, the results will be hard to trace, particularly if the reject-threshold
parameter is set to Never abort. Several factors — including the data type, the DML expression
used to perform the rollup, and the value of the sorted-input parameter — may affect where the
problems occur. It is safest to clean and validate the data before using the aggregation functions
in ROLLUP.
 Component folding can enhance the performance of this component. If this feature is enabled,
the Co>Operating System folds this component by default.
Two modes to use ROLLUP

You can use a ROLLUP component in two modes, depending on how you define the transform
parameter:
1. Template mode — You define a simple rollup function that may include aggregation functions.
Template mode is the most common/simple way to use ROLLUP.
2. Expanded mode — You create a transformation using an expanded rollup package. This mode
allows for rollups that do not necessarily use regular aggregation functions.
Then ROLLUP executes the following steps for each group of records:
 Input selection.
 Temporary initialization.
 Computation/Transformation.
 Finalization.
 Output selection.
Runtime behavior of ROLLUP

ROLLUP perform following operation for each group of records:
1. Performing Input selection:
 If you have not defined the input_select function in your transform, ROLLUP processes all records.
 If you have defined the input_select function, ROLLUP filters the input records accordingly.
2. Performing Key change (for sorted input only):
 For every record except the first, ROLLUP checks whether a key change has occurred:
 ROLLUP compares the current record’s key value to the previous record’s key value, unless the
key_change function is defined.
 If the key_change function is defined, ROLLUP calls that function to check for a key change.
3. Temporary initialization:
 ROLLUP passes the first record in each group to the initialize transform function.
4. Performing Computation:
 ROLLUP calls the rollup transform function for each input record.
 The input to the rollup transform function is the input record and the temporary record for the group to
which the input record belongs.
 The rollup transform function returns an updated temporary record for that input group.
5. Performing Finalization of the output:
With sorted-input set to True:
 ROLLUP calls the finalize transform function after it processes all the input records in each group.
 ROLLUP passes the temporary record for the group and the last input record in the group to the
finalize transform function.
 The finalize transform function produces an output record for the group.
Note:
 For sorted-input set to False ROLLUP processes all the input records, it calls the finalize transform
function with the temporary record for each group and an arbitrary input record from each group as
arguments.
 ROLLUP repeats this procedure with each group.
 The finalize transform function then produces an output record for each group.
 The component stops the execution of the graph when the number of reject events exceeds the result
of the following formula:
limit+(ramp* number_of_records_processed_so_far)
6. Output selection:
 If you have defined the output_select transform function, it filters the output records.
4) Scan:-
Purpose
For every input record, Scan generates an output record that consists of a running cumulative summary
for the group to which the input record belongs, up to and including the current record. For example, the
output records might include successive year-to-date totals for groups of records.
Recommendations
 If you want one summary record for a group, use ROLLUP.
 The behavior of SCAN varies in the presence of dirty data (NULLs or invalid values),
according to whether you use the aggregation functions for the scan:
 Without aggregation functions, you can use SCAN normally.
 With aggregation functions, always clean and validate data before scanning it. Because the
aggregation functions use a multistage transform, SCAN follows computation rules that may
cause unexpected or even incorrect results in the presence of dirty data (NULLs or invalid
values). Furthermore, the results will be hard to trace, particularly if the reject-threshold
parameter is set to Never abort. Several factors — including the data type, the DML expression
used to perform the scan, and the value of the sorted-input parameter — may affect where the
problems occur. It is safest to clean and validate the data before using the aggregation functions
in SCAN.
the Co>Operating System folds this component by default. See “Component folding” for more
information.

Transform folder
At runtime, Scan does the following:

 Input selection:
 Temporary initialization:
 Computation:
 Finalization:
 Output selection:
5) Normalize:
Purpose
Normalize generates multiple output records from each of its input records. You can directly specify the
number of output records for each input record, or you can make the number of output records dependent
on a calculation.
In contrast, to consolidate groups of related records into a single record with a vector field for each group
— the inverse of NORMALIZE — you would use the accumulation function of the ROLLUP component.
Recommendations
 Always clean and validate data before normalizing it. Because Normalize uses a multistage
transform, it follows computation rules that may cause unexpected or incorrect results in the
presence of dirty data (NULLs or invalid values). Furthermore, the results will be hard to trace,
particularly if the reject-threshold parameter is set to Never abort. Several factors — including
the data type, the DML expression used to perform the normalization, and the value of the
sorted-input parameter — may affect where the problems occur. It is safest to avoid normalizing
dirty data.
information.
Transform folder
Run time/Real time behavior of NORMALIZE Component in ab initio
 Reads the input record.

 Performs temporary initialization.
 Performs iterations of the normalize transform function. NORMALIZE determines the
number of iterations to perform using either the finished or the length function, whichever is
defined:
 Sends the output record to the out port.
6) Dedup Sorted:
Purpose
Dedup Sorted separates one specified record in each group of records from the rest of the records in the
group.
Requirement
Dedup Sorted requires grouped input.
Recommendation
Co>Operating System folds this component by default. See “Component folding” for more information.

Transform folder
Runtime behavior of DEDUP SORTED with parameters

In the parameters this are the act parameters in dedup sorted component in ab initio if you have software
this parameters can fill and lean the dedup sorted component and learn with example.
Dedup Sorted does the following:
 Reads a grouped flow of records from the in port.
If your records are not already grouped, use SORT to group them.
 Does one of the following:
If you have supplied an expression for the select parameter, Dedup Sorted applies the expression to the
records as follows: If you do not supply an expression for the select parameter, Dedup Sorted processes
all records on the in port.
 Processes groups of records as follows:
Considers any consecutive records with the same key value to be in the same group. If a group consists
of one record, writes that record to the out port. If a group consists of more than one record, uses the
value of the keep parameter to determine which record — if any — to write to the out port, and which
record or records to write to the dup port. If you have chosen unique-only for the keep parameter, does
not write records to the out port from any groups consisting of more than one record.
7) Join:
Join reads the records from multiple ports, operates on the records with matching keys using a multi input
transform function and writes the result into output ports.
In join the key parameter has to be specified from input flow (either of the flow) ascending or descending
order (please refer to picture above).
If all the input flows do not have any common field, override-key must be specified to map the key
specified.
Purpose of JOIN Component

 JOIN is used to reads data from two or more input ports, combines records with matching keys
according to the transform you specify, and sends the transformed records to the output port.
 Its additional ports can also be used to collect rejected and unused records.

Parameters for JOIN (Not all parameters are covered.)
count (integer, required)

 It is an integer n specifying the total number of inputs (in ports) to join. The number of input ports also
determines the number of the following ports and parameters:
unused ports
reject ports
error ports
record-match-required parameters
dedup parameters
select parameters
override-key parameters
Default is 2.
Each in port (always two or more) has a number n appended. Each outn, unused, reject, and error port
corresponds to an in port.

sorted-input (boolean, required)
 When this parameter is set to False, the component accepts unsorted input and permits the use of the
maintain-order parameter.
 When this parameter is set to True, the component requires sorted input .In this case, consider setting
the check-sort parameter to True.
Default is True.
key(key specifier, required)

 Name(s) of the field(s) in the input records that must have matching values for JOIN to call the
transform function. The types of the fields in the different inputs must be compatible;

transform (filename or string, required)
 Either the name of the file containing the transform function, or a transform string.
join-type (choice, required)

You have to choose one of the option from the following:
 Inner join (default) — Sets the record-match-required parameters for all ports to True. The GDE does
not display the record-match-required parameters, because they all have the same value.
 Outer join — Sets the record-match-required parameters for all ports to False. The GDE does not
display the record-match-required parameters, because they all have the same value.
 Explicit — Allows you to set the record-match-required parameter for each port individually.
If you set the dedup parameter to True on the driving input, set the join-type parameter to Inner join. (The
driving input is the largest input, as specified by the driving parameter.)
If you remove duplicates on this input port before joining it to the driving input, set the record-match-
required parameter to True on all other ports.

parameter-interface (choice, required)
 This parameter is available only after you update a pre-Version 3.2.1 JOIN component to Version
3.2.2 or higher. It is not available for new components.
 Controls whether to use a legacy or improved parameter interface. The choices are the following:
 legacy — Displays the record-required parameter whose boolean value specifies whether to use an
inner or outer join and whether a record is required or substitute a null for missing records. This
parameter has inverted booleans. The default for pre-Version 3.2.1 components.
 version-3-2-2 — Displays the record-match-requiredn parameter whose boolean value specifies

whether to use an inner or outer join. This parameter has normal booleans
record-required (boolean, required)
 This parameter is available only when the parameter-interface parameter is set to legacy (or in a pre-
Version 3.2.1 JOIN component) and the join-type parameter is set to Explicit.
The default is True.
record-match-required (boolean, required)
 This parameter is available only when the join-type parameter is set to Explicit.
 It is used to specify whether a record is required or whether to substitute a null for a missing record.
The default is True.
To use this parameter, note the following points:
 When there are two inputs, set record-match-required to True on the input port for which you want to
call the transform for every record, regardless of whether there is a matching record on the other input
port.
 When there are more than two inputs, set record-match-required to True when you want to call the
transform only when there are records with matching keys on all input ports for which record-match-
required is True.
dedup(boolean, required)
 Set the dedup parameter to Dedup this input before joining to remove duplicates from the
corresponding inn port before joining. This allows you to choose only one record from a group with
matching key values as the argument to the transform function.
 There is one dedup parameter associated with each in port. Unused duplicates are sent to the unused
port.
Default is Do not dedup this input.
select (expression, optional)

 Filters for records before a join function. One per in port; n represents the number of an in port. If you
use select with dedup, the JOIN component performs the select first, then removes the duplicate records
that made it through the select.
max-memory (integer, required)
 Maximum memory usage in bytes before the component writes temporary files to disk. Available only
when the sorted-input parameter is set to True.
The default value is 8388608 bytes (8 MB).

check-sort (boolean, required)
 Available only when the sorted-input parameter is set to True.
 If set to True, stops the graph on the first input record that is out of sorted order (according to the key).
Available when the sorted-input parameter is set to True.
 The default is False. In this case, JOIN does not necessarily stop or issue an error when it encounters
unsorted inputs. If sorted input is a requirement, set check-sort to True.
maintain-order (boolean, required)
 Set to True to ensure that records remain in the original order of the driving input. (The driving input is
the largest input, as specified by the driving parameter.)
 Available only when the sorted-input parameter is set to False. If the sorted-input parameter is set to
True and all inputs are sorted on the fields given in the key parameter, the output maintains the sort order
on that key without the use of this parameter.
 If any inputs other than the driving input are too large to fit within the memory limit specified by max-
core, the behavior of the component depends on the setting of maintain-order:
 False — The component stores some of its intermediate results in temporary files on disk. This alters
the order of records in the driving input.
 True — The component stops execution of the graph.
Default is False.
max-core (integer, required)
 Maximum memory usage in bytes. Available only when the sorted-input parameter is set to False.
 If the total size of the non-driving inputs that the component holds in memory exceeds the number of
bytes specified in the max-core parameter, the component writes temporary files to disk.
Default value is 67108864 bytes (64 MB).
Runtime behavior of JOIN
JOIN performs following Operations:

1. Reads data records from multiple in ports. Depending on the setting of the sorted-input parameter, it
does one of the following:
 If input is sorted, it reads records in the order in which they arrive.
 In input is unsorted, it loads all records from all inputs except the driving input into main memory.
Once the non-driving inputs are loaded, it reads records from the driving input in the order in which they
arrive.
2. Applies the expression in any defined select parameter to the records on the corresponding in port:
 If the value of select expression evaluates to 0 for a record the join components does not process the
record, and the record does not appear on any output port
 Evaluates to anything other than 0 or NULL for a particular record Processes the record
 If you do not supply an expression for a select parameter, JOIN processes all the records on the
corresponding in port

3. Removes any duplicate records that have made it through the select if dedup parameter to True.
4.Operates on records that have matching key values using a multi-input transform function.
If the transform function returns NULL, then JOIN:

 Writes each input record to the corresponding reject port, then stops execution of the graph when the
number of reject events exceeds the result of the following formula:
limit + (ramp * number_of_records_processed_so_far)
 Writes an error message to the corresponding error port.If no flows are connected to rejectn or
errorn ports, JOIN component discards the information
5. Writes the non-NULL return record from the transform function to the out port.
8) Fuse:
Purpose
Fuse combines multiple input flows (perhaps with different record formats) into a single output
flow. It examines one record from each input flow simultaneously, acting on the records
according to the transform function you specify. For example, you can compare records,
selecting one record or another based on some criteria, or “fuse” them into a single record that
contains data from all the input records.
Recommendation
Fuse assumes that the records on the input flows always stay synchronized. However, certain
components placed upstream of Fuse, such as Reformat or Filter by Expression, could reject or
divert some records. In that case, you may not be able to guarantee that the flows stay in sync.
A more reliable option is to add a key field to the data; then use Join to match the records by
key.
Component folding can enhance the performance of this component. If this feature is enabled,
information.
9) Broadcast:
Takes data from multiple inputs, combines it and sends it to all the output ports. Broadcast is
used for Data parallelism.Along with multifiles it should be worked.But Replicate doesn't.
10) Replicate:
Replicate - It replicates the data for a particular partition and send it out to multiple out ports of
the component, but maintains the partition integrity.
Replicate is used for component parallelism. Data parallelism would be only by partitioning and
actually having multi files/flows.
If you observe Replicate is under Miscellaneous components list where as Broadcast is under
Partition components list. There lies a major difference.
You can have a serial in to broadcast and partitioned at output.
Let me explain this.

1) Replicate : It will put all the data in the input to all the data at output. You can not have serial
layout(SFS) to input and MFS to output.
2) Broadcast : It can have SFS at in and MFS at out.
Probably below scenario can throw some light on where exactly you can use Broadcast and not the
replicate.
Assume requirements are as below :

a- File A has 20,000,000 records and sorted and partitioned on acct_no
b- File B has 20,000 records and sorted on say surrogate_key and its a serial file.
File B has acct_no as a field in DML.
Now if i say file B has updates. and update the file A using data in B.
Normal approach : Partition by key and sort on acct_no for File B and take a join.
Use of broadcast:
Just use a sort on acct_no and Broadcast it before join.

your all 20,000 records will be copied in every partition.(After out port layout will be MFS). Here you
will save efforts of partitioning also 20,000 records will not take much memory.
This scenario is used just to explain where to use broadcast. There are many other alternative option
to achieve the results for the example.
11) Sort Within Groups:
if your file is sorted on acct_num and you want sort on 2 other keys you can use sort within groups
provided acct_num is your first preferred key. You should use sort with in groups for better
performance. Your major key is acct_num and the minor keys will be the others. It wont check or sort
for major key. It will only sort the minor keys.
For example:
if you require the file to sort on acct_num, key 2, key 3.. in this case you can use sort within groups.
But if you require to sort the file on keys as key1, acct_num, key2 then you will have to use sort
component. It is preferred to use sort within groups wherever applicable as it reduces the keys on
which the sort needs to be done adding its part in the performance.
Sort component read records in memory and sort them until it reach last input records or limit of
maxcore parameter. once it reach maxcore limit, it spill data to disk and continue with next set of
inputs. And once all records are process, it perform merge operation.
But sort within group, it reads the records of each group until it reach end of group or max core
parameter limit , sort them with minor key and write to output port and repeat step for next group.
When we use the already sorted data(based on field 1) to sort again with field 2 as you say, then
the output will only be sorted with Field 2. In case of sort within groups, we will provide field 1 as
major key and field 2 as minor key, which means this component will maintain the sorted order of
the major key and within that major group, it will again sort the records based on minor key which is
field 2. This way the output will be a sorted output based on both field1 & field 2.
12) Read Multiple Files:
In get_filename ,provide the
directory from where you want to pick up the files. you can
parameterize the path (directory structure) so it can take as
dyanamic input also.. Once you provide the directory structure
AbInito will take all the files from the specified location.
e.g : get_filename = ${AB_WORK}/$USER/$LOC
now you can assign $USER and $LOC as per your convience.
$USER = summukhe
$LOC = export/home/process
so it will take all the files from process directory and give
this input to AbInitio.
our input file will have abc.dat say data location.record format for input file is:
record
string('n') file_name;
end;
in read multiple file component : parameters : transform : include

"acquisition_interface/definition/cfg_bcbsri_claims.dat.dml";
filename::get_filename(in) =
begin
filename :: in.file_name;
end;
/*Create output record*/
out::reformat(read, filename) =
begin
out.*::read.*;
end;

Abinitio Transform Components

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Abinitio Transform Components

Uploaded by

Copyright:

Available Formats

1) Reformat:-

Location in the Component Organizer

Runtime behavior of REFORMAT

Location in the Component Organizer

Runtime behavior of FILTER BY EXPRESSION

 Reads data records from the in port.

Location in the Component Organizer

Two modes to use ROLLUP

Runtime behavior of ROLLUP

1. Performing Input selection:

2. Performing Key change (for sorted input only):

5. Performing Finalization of the output:

With sorted-input set to True:

Location in the Component Organizer

At runtime, Scan does the following:

Run time/Real time behavior of NORMALIZE Component in ab initio

 Reads the input record.

Location in the Component Organizer

Runtime behavior of DEDUP SORTED with parameters

Dedup Sorted does the following:

 Reads a grouped flow of records from the in port.

 Does one of the following:

 Processes groups of records as follows:

Purpose of JOIN Component

count (integer, required)

key(key specifier, required)

join-type (choice, required)

 version-3-2-2 — Displays the record-match-requiredn parameter whose boolean value specifies

record-required (boolean, required)

record-match-required (boolean, required)

To use this parameter, note the following points:

select (expression, optional)

max-memory (integer, required)

maintain-order (boolean, required)

max-core (integer, required)

If the transform function returns NULL, then JOIN:

You can have a serial in to broadcast and partitioned at output.

Let me explain this.

Assume requirements are as below :

Just use a sort on acct_no and Broadcast it before join.

12) Read Multiple Files:

In get_filename ,provide the

parameterize the path (directory structure) so it can take as

dyanamic input also.. Once you provide the directory structure

e.g : get_filename = ${AB_WORK}/$USER/$LOC

in read multiple file component : parameters : transform : include

/*Create output record*/

You might also like

/Create output record/