Spec RT Rela

JAN
AB-INITIO TRANSFORM COMPONENT
AGGREGATE
Purpose
Aggregate generates records that summarize groups of records.
Deprecated
AGGREGATE is deprecated. Use ROLLUP instead. Rollup gives you more control over
record selection, grouping, and aggregation.
Recommendation
Component folding can enhance the performance of this component . If this feature
is enabled and if the sorted-inputparameter is set to In memory: Input need not be
sorted, the Co>Operating System folds this component by default. SeeComponent
folding for more information.
Location in the Component Organizer
Miscellaneous/Deprecated/Transform folder
COMBINE
Purpose
combine processes data in a number of useful ways. You can use combine to:
Restore hierarchies of data flattened by the SPLIT component
Create a single output record by joining multiple input streams
Denormalize vectors (including nested vectors)
How COMBINE works

COMBINE does not use transform functions. It determines what operations to perform on
input data by using DML that is generated for COMBINEs input ports by the split_dml
command-line utility.
COMBINE performs the inverse operations of the SPLIT component. It has a single output
port and a counted number of input ports. COMBINE (optionally) denormalizes each input
data stream, then performs an outer join on the input records to form the output records.
Using COMBINE for joining data
To use COMBINE to denormalize and join input data, you need to sort and specify keys for
the data. If the input to COMBINE is from an output of SPLIT, you can set up SPLIT to
automatically generate keys by running split_dml with the -g option. Otherwise, you can
generate keys by running split_dml with the -k option, supplying the names of key fields. If
you specify no keys, COMBINE uses an implied key, which is equal to a records index
within the sequence of records on the input port. In other words, COMBINE merges records
synchronously on each port.
When merging these records, COMBINE selects for processing the records that match the
smallest key present on any port. Thus, the input data on each port should be sorted in the
order specified by the keys.
COMBINE can also merge elements of vectors, in the same way it merges top-level
records: if you specify no key, COMBINE merges the elements based on an implied key,
which is equal to a records index within the sequence of records on the input port.
Recommendation
Component folding can enhance the performance of this component. If this feature is
enabled, the Co>Operating System folds this component by default. See Component

Transform folder
Example of using COMBINE

Say you have a file example2a.dml with the following record format:
record
string("|") region = ""; //Sort key 1
string("|") state = "";
//Sort key 2
string("|") county = ""; //Sort key 3

string("|") addr_line1 = "";
string("|") atm_id = "";
string("|") comment = "";
string("\n") regional_mgr = "";
end;
And you want to roll up the fields that are marked as sort keys region, state, and county into
nested vectors. To do this, you can use a single COMBINE component rather than performing a series of
three rollup actions.
The desired output format (example2b.dml) is:
record
string("|") region;
//Sort key 1
record
string("|") state;
//Sort key 2
record
string("|") county; //Sort Key 3
record
record
string("|") addr_line1;
end location;
string("|")atm_id;
string("|")comment;
end[int] atms;
end[int] counties;
end[int] states;
string("\n") regional_mgr;
end;
To produce this output format, you need to run split_dml to generate DML for the input port. Your
requirements for the split_dml command are:
You want to include all fields, but you do not care about the subrecord hierarchy, so we specify "..#"
for value of the split_dml -i argument.
The base field for normalization can be any of the fields in the atms record; you choose atm_id.
You need to specify the three keys to use when rolling up the vectors: region, states.state, and
states.counties.county.
The resulting command is:
split_dml -i ..# -b ..atm_id -k region,states.state,states.counties.county example2b.dml
The generated DML, to be used on COMBINEs input port, is:
//////////////////////////////////////////////////////////////
// This file was automatically generated by split_dml
// with the command-line arguments:
// split_dml -i ..# -b ..atm_id -k region,states.state,states.counties.county example2b.dml
//////////////////////////////////////////////////////////////
record
string("|") region // Sort key 1
string("|") state
// Sort key 2
string("|") county // Sort key 3
string("|") atm_id;
string("|") comment;
string('0')DML_assignments =
'region=region,state=states.state,county=states.counties.county,
addr_line1=states.counties.atms.location.addr_line1,
atm_id=states.counties.atms.atm_id,
comment=states.counties.atms.comment,
regional_mgr=regional_mgr';
string('0')DML_key_specifiers() =
'{region}=,{state}=states[],{county}=states.counties[]';
end
Related topics
DEDUP SORTED
Purpose
Dedup Sorted separates one specified record in each group of records from the rest of the
records in the group.
Requirement
Dedup Sorted requires grouped input.
Recommendation
Transform folder
FILTER BY EXPRESSION
Purpose
Filter by Expression filters records according to a DML expression or transform function,
which specifies the selection criteria.
Filter by Expression is sometimes used to create a subset, or sample, of the data. For
example, you can configure Filter by Expression to select a certain percentage of records,
or to select every third (or fourth, or fifth, and so on) record. Note that if you need a random
sample of a specific size, you should use the sample component.
FILTER BY EXPRESSION supports implicit reformat. For more information, see Implicit
reformat.
Recommendation
Transform folder
FUSE
Purpose
Fuse combines multiple input flows (perhaps with different record formats) into a single
output flow. It examines one record from each input flow simultaneously, acting on the
records according to the transform function you specify. For example, you can compare
records, selecting one record or another based on some criteria, or fuse them into a single
record that contains data from all the input records.
Recommendation
Fuse assumes that the records on the input flows always stay synchronized. However,
certain components placed upstream of Fuse, such as Reformat or Filter by Expression,
could reject or divert some records. In that case, you may not be able to guarantee that the
flows stay in sync. A more reliable option is to add a key field to the data; then use Join to
match the records by key.
JOIN
Purpose
Join reads data from two or more input ports, combines records with matching keys
according to the transform you specify, and sends the transformed records to the output
port. Additional ports allow you to collect rejected and unused records.
Recommendation
NOTE: When you have units of work (computepoints, checkpoints, or transactions)
that are large and sorted-input is set to Inputs must be sorted, the order of output records
within a key group may differ between the folded and unfolded versions of the output.
Transform folder
Types of joins
Reduced to its basics, Join consists of a match key, a transform function, and a mechanism
for deciding when to call the transform function:
The key is used to match records on incoming flows
The transform function combines matched incoming records to produce new
outgoing records
The mechanism for deciding when to call the transform function consists of the
settings of the parameters join-type, record-requiredn, and dedupn.
Inner joins
The most common case is when join-type is Inner Join. In this case, if each input port
contains a record with the same value for the key fields, the transform function is called and
an output record is produced.
If some of the input flows have more than one record with that key value, the transform
function is called multiple times, once for each possible combination of records, taken one
from each input port.
Whenever a particular key value does not have a matching record on every input port and
Inner Join is specified, the transform function is not called and all incoming records with that
key value are sent to the unusedn ports.
Full outer joins

Another common case is when join-type is Full Outer Join: if each input port has a record
with a matching key value, Join does the same thing it does for an inner join.
If some input ports do not have records with matching key values, Join applies the transform
function anyway, with NULL substituted for the missing records. The missing records are in
effect ignored.
With an outer join, the transform function typically requires additional rules (as compared to
an inner join) to handle the possibility of NULL inputs.
About explicit joins
The final case is when join-type is Explicit. This setting allows you to specify True or False
for the record-requiredn parameter for each inn port. The settings you choose determine
when Join calls the transform function. See record-requiredn.
Examples of join types
Complex multiway joins

For the three-way joins shown in the following diagrams, the shaded regions again
represent the key values that must match in order for Join to call the transform function:
In the cases shown above, suppose you want to narrow the join conditions to a subset of
the shaded (required match) area. To do this, use the DML is_defined function in a rule in
the transform itself. This is the same principle demonstrated in the two-way join shown in
Getting a joined output record.
For example, suppose you want to produce an output record when a particular key value
either is present in in0, or is present in both in1 and in2. Only Case 2 has enough shaded
area to represent the necessary conditions. However, Case 2 also represents conditions
under which you do not want Join to produce an output record.
To produce output records only under the appropriate conditions:
1
. Set join-type to Full Outer Join as in Case 2 above.
2
. Put the following rules in Joins transform function:
out.key :1: if (is_defined(in0)) in0.key;

out.key :2: if (is_defined(in1) &&
is_defined(in2)) in1.key;
For both rules to fail, the particular key value must be absent from in0 and must be present
in only one of in1 or in2.
Join writes the records that result in both rules failing to the rejectn ports if you connect
flows to them.
MATCH SORTED
Purpose
Match Sorted combines multiple flows of records with matching keys and performs
transform operations on them.
NOTE: This component is superseded by either Join (for matching keys) or Fuse (for
transforming multiple records). Both provide more flexible processing options than Match
Sorted.
Requirement
Match Sorted requires grouped input.
Transform folder
Example of using MATCH SORTED

This example shows how repeat and missing key values affect the number of times Match
Sorted calls the transform function.
Suppose three input flows feed Match Sorted. The records in these flows have threecharacter alphabetic key values. The key values of the records in the three flows are as
follows:
in0
in1
in2
record 1
aaa
aaa
aaa
record 2
bbb
bbb
ccc
record 3
ccc
ccc
ddd
record 4
eee
eee
eee
record 5
eee
fff
fff
record 6
eee
end
end
Match Sorted calls the transform function eight times for these data records, with the
arguments as follows:
transform( in0-rec1, in1-rec1, in2-rec1 ) records with key value aaa
transform( in0-rec2, in1-rec2, NULL ) records with key value bbb
transform( in0-rec3, in1-rec3, in2-rec2 ) records with key value ccc
transform( NULL,
NULL,
in2-rec3 ) records with key value ddd
transform( in0-rec4, in1-rec4, in2-rec4 ) records with key value eee

transform( NULL,
in1-rec5, in2-rec5 ) records with key value fff
Since there are three eee records in the flow attached to in0, Match Sorted calls the
transform function three times with eee records as inputs. Since the next records on in1 and
in2 do not have key value eee, in1 and in2 repeat their rec4 records.
MULTI REFORMAT
Purpose
Multi Reformat changes the format of records flowing from 1 to 20 pairs of in and out ports
by dropping fields or by using DML expressions to add fields, combine fields, or transform
data in the records.
We recommend using MULTI REFORMAT in only a few specific situations. Most often, a
regular REFORMAT component is the correct choice. For example:
If you want to reformat data on multiple flows, you should instead use multiple
REFORMAT components. These are faster because they run in parallel.
If you want to filter incoming data, sending it to various output ports while also
reformatting it (by adding, combining, or transforming fields), try using the outputindex and count parameters on the REFORMAT component.
A recommended use for Multi Reformat is to put it immediately before a custom component
that takes multiple inputs. For more information, see Using MULTI REFORMAT to avoid
deadlock.
Using MULTI REFORMAT to avoid deadlock

Deadlock occurs when a program cannot progress, causing a graph to hang. Custom
components (components that you have built to execute your own programs) are prone to
deadlock because they cannot use the GDEs automatic flow buffering. If a custom
component is programmed to read from multiple flows in a specific order, it carries the
possibility of causing deadlock.
To avoid deadlock, insert a MULTI REFORMAT component in the graph in front of the
custom component. Using this built-in component to process the input flows applies
automatic flow buffering to them before they reach the custom component, thus avoiding the
possibility of deadlock.
NORMALIZE
Purpose
Normalize generates multiple output records from each of its input records. You can directly
specify the number of output records for each input record, or you can make the number of
output records dependent on a calculation.
In contrast, to consolidate groups of related records into a single record with a vector field
for each group the inverse of NORMALIZE you would use the accumulation function
of the ROLLUP component.
Recommendations
Always clean and validate data before normalizing it. Because Normalize uses a
multistage transform, it follows computation rules that may cause unexpected or
incorrect results in the presence of dirty data (NULLs or invalid values).
Furthermore, the results will be hard to trace, particularly if the reject-threshold
parameter is set to Never abort. Several factors including the data type, the DML
expression used to perform the normalization, and the value of the sorted-input
parameter may affect where the problems occur. It is safest to avoid normalizing
dirty data.
Component folding can enhance the performance of this component. If this feature
is enabled, the Co>Operating System folds this component by default. See
Component folding for more information.
NORMALIZE transform functions

What Normalize does is determined by the functions, types, and variables you define in its
transform parameter.
There are seven built-in functions, as shown in the following table. Of these, only normalize
is required. Examples of most of these functions can be found in Simple NORMALIZE
example with vectors.
There is also an optional temporary_type (see Optional NORMALIZE transform functions
and types), which you can define if you need to use temporary variables. For an example,
see NORMALIZE example with a more elaborate transform.
Transform
function
Required?
input_select
No
Argumen
ts
input
record
Return value
An integer(4) value.
An output value of 0 means false (the
record was not selected); non-zero
means
true
(the
record
was
selected).
See Optional NORMALIZE transform
functions and types.
initialize
No
input
record
A
record
whose
temporary_type.
type
is

functions and types. For examples,
see NORMALIZE example with a
more elaborate transform.
length
Only
if
finished is
not
provided
input
record
Specifies the number of output
records Normalize generates for this
input record. If the length function is
provided, Normalize calls it once for
each input record.
For
examples,
see
Simple
NORMALIZE example with vectors
and NORMALIZE example with a
finished
Only
if
temporar
0 (meaning false), if more output
(if
you
have
defined
temporary_type)
finished
(if you have not
defined
temporary_type)
length
is
not
provided
Only
if
length
is
not
provided
y record,
input
record,
index
records are to be generated from the

current
input
record.
Otherwise, a non-zero value (true).
input
record,
index

current
input
record.
If the finished function is provided,

NORMALIZE calls it once more than
the number of output records it
produces. On the final call it returns
true and no output record is
produced.

produced.
normalize
(if
you
have
defined
temporary_type)
Yes
temporar
y record,
input
record,
index
A
record
whose
type
is
temporary_type. For examples, see
Simple NORMALIZE example with
vectors.
normalize
(if you have not
defined
temporary_type)
Yes
input
record,
index
An output record.
finalize
No
temporar
y record,
input
record
The output record.
output
record
output_select
No

functions
and
types
and
NORMALIZE example with a more
elaborate transform.

means
true
(the
record
was
selected).
Input and output names in transforms

In all transform functions, the names of the inputs and outputs are used only locally, so you
can use any names that make sense to you.
Optional NORMALIZE transform functions and types
There are several optional transform functions and an optional type you can use with
Normalize:
input_select The input_select transform function performs selection of input
records:
out :: input_select(in) =
begin
out :: in.n == 1;
end;
The input_select transform function takes a single argument the input record and
returns a value of 0 (false) if NORMALIZE is to ignore a record, or non-zero (true) if
NORMALIZE is to accept a record.
initialize The initialize transform function initializes temporary storage. This
transform function takes a single argument the input record and returns a
single record with type temporary_type:
temp :: initialize(in) =
begin
temp.count :: 0;
temp.sum :: 0;
end;
length The length transform function is required when the finished function is not
defined. (You must use at least one of these functions.) This transform function
specifies the number of times the normalize function will be called for the current
record. This function takes the input record as an argument:
out :: length(in) =
begin
out :: length_of(in.big_vector);
end;
length essentially provides a way to implement a for loop in the record-reading process.
finished The finished transform function is required when the length function is
not defined. (You must use at least one of these functions.) This transform function
returns a boolean value: as long as it returns 0 (false), NORMALIZE proceeds to call
the normalize function for the current record. When the finished function returns
non-zero (true) , NORMALIZE moves to the next input record.
out :: finished(in, index) =

begin
out :: in.array[index] == "ignore later elements";
end;
The finished function essentially provides a way to implement a while-do loop in the recordreading process.
NOTE: Although we recommend that you not use both length and finished in the same
component, it is possible to define both. In that case, Normalize loops until either finished
returns true or the limit of length is reached, whichever occurs first.
finalize The finalize transform function performs the last step in a multistage
transform:
out :: finalize(temp, in) =

begin
out.key :: in.key;
out.count :: temp.count;
out.average :: temp.sum / temp.count;
end;
The finalize transform function takes the temporary storage record and the input record as
arguments, and produces a record that has the record format of the out port.
output_select The output_select transform function performs selection of output
records:
out :: output_select(final) =
begin
out :: final.average > 5;
end;
The output_select transform function takes a single argument the record produced by
finalization and returns a value of 0 (false) if NORMALIZE is to ignore a record, or nonzero (true) if NORMALIZE is to generate an output record.
temporary_type If you want Normalize to use temporary storage, define this
storage as a record with a type named temporary_type:
type temporary_type =
record
int count;
int sum;
end;
REFORMAT
Purpose
Reformat changes the format of records by dropping fields, or by using DML expressions to
add fields, combine fields, or transform the data in the records.
Recommendation
Location in the Organizer
Transform folder
ROLLUP
Purpose
Rollup evaluates a group of input records that have the same key, and then generates
records that either summarize each group or select certain information from each group.
Although it lacks a reformat transform function, rollup supports implicit reformat; see Implicit
reformat.
Transform folder
Recommendations
For new development, use Rollup rather than AGGREGATE. Rollup provides more
control over record selection, grouping, and aggregation.
The behavior of ROLLUP varies in the presence of dirty data (NULLs or invalid
values), according to which mode you use for the rollup:
With expanded mode, you can use ROLLUP normally.

With template mode, always clean and validate data before rolling it up. Because
the aggregation functions are not expanded, you may see unexpected or even
expression used to perform the rollup, and the value of the sorted-input
parameter may affect where the problems occur. It is safest to clean and
validate the data before using template mode with ROLLUP.
SCAN
Purpose
For every input record, Scan generates an output record that consists of a running
cumulative summary for the group to which the input record belongs, up to and including the
current record. For example, the output records might include successive year-to-date totals
for groups of records.
Although it lacks a reformat transform function, scan supports implicit reformat.
Recommendations
If you want one summary record for a group, use ROLLUP.
The behavior of SCAN varies in the presence of dirty data (NULLs or invalid values),
according to which mode you use for the scan:
With expanded mode, you can use SCAN normally.

With template mode, always clean and validate data before scanning it. Because

Furthermore, the results will be hard to trace, particularly if the rejectthreshold parameter is set to Never abort. Several factors including the data
type, the DML expression used to perform the scan, and the value of the sortedinput parameter may affect where the problems occur. It is safest to clean and
validate the data before using template mode with SCAN.
enabled, the Co>Operating System folds
this
component
by
default.
See
Two modes to use SCAN

You can use a SCAN component in two modes, depending on how you define
the transform parameter:
Define a transform that uses a template scan function. This is called template mode and is
most often used when you want to output aggregations of the data.
Create a transform using an expanded SCAN package. This is called expanded mode
and allows for scans that do not necessarily use regular aggregation functions.
Template mode
Template mode is the simplest way to use SCAN. In the transform parameter, you specify
an aggregation function that describes how the cumulative summary should be computed.
At runtime, the Co>Operating System expands this template function into the multiple
functions that are required to execute the actual scan.
For example, suppose you have an input record for each purchase by each customer. You
could use the sumaggregation function to calculate the running total of spending for each
customer after each purchase.
For more information, see Using SCAN with aggregation functions.
Expanded mode
Expanded mode provides more control over the scan. It lets you edit the expanded package,
so you can specify transformations that are not possible with template mode. As such, you
might use it when you need a result that an aggregation function cannot produce.
With an expanded SCAN package, you must define the following items:
DML type named temporary_type
initialize function that returns a temporary_type record
scan function that takes two input arguments (an input record
a temporary_type record) and returns an updated temporary_type record
and
finalize function that returns an output record
For more information, see Transform package for SCAN.
Examples of using SCAN
transforms/scan/scan.mp
Template SCAN with an aggregation function

This example shows how to compute, from input records containing customer_id, dt (date),
and amount, a running total of transactions for each customer in a dataset. The example
uses a template scan function with the sum aggregation function.
Suppose you have the following input records:
customer_id
dt
amount
C002142
1994.03.23
52.20
C002142
1994.06.22
22.25
C003213
1993.02.12
47.95
C003213
1994.11.05
221.24
C003213
1995.12.11
17.42
C004221
1994.08.15
25.25
C008231
1993.10.22
122.00
C008231
1995.12.10
52.1
You want to produce output records with customer_id, dt, and amount_to_date:
customer_id
dt
amount_to_date
C002142
1994.03.23
52.20
C002142
1994.06.22
74.45
C003213
1993.02.12
47.95
C003213
1994.11.05
269.19
C003213
1995.12.11
286.61
C004221
1994.08.15
25.25
C008231
1993.10.22
122.00
C008231
1995.12.10
174.1
To accomplish this task, do one of the following:

Sort the input records on customer_id and dt, and use a Scan component with
the sorted-input parameter
set
to Input
must
be
sorted
or
grouped and customer_id as the key field.
Sort the input records on dt, and use a Scan component with the sortedinput parameter
set
to In
memory:
Input
need
not
be
sorted and customer_id as the key field.
Create the transform using the sum aggregation function, as follows:

out :: scan(in) =
begin
out.customer_id :: in.customer_id;
out.dt :: in.dt;
out.amount_to_date :: sum(in.amount);
end;
Expanded SCAN
Continuing the previous example, you want to categorize customers according to their
spending. After their spending exceeds $100, you place them in the premium category.
The new output data includes the category for each customer, current for each date on
which they made a purchase.
customer_id
dt
amount_to_date
category
C002142
1994.03.23
52.20
regular
C002142
1994.06.22
74.45
regular
C003213
1993.02.12
47.95
regular
C003213
1994.11.05
269.19
premium
C003213
1995.12.11
286.61
premium
C004221
1994.08.15
25.25
regular
C008231
1993.10.22
122.00
premium
C008231
1995.12.10
174.10
premiu
For this example, we can use the finalize function in an expanded transform to add the
category information. Because we have expanded the transform, we can no longer use
the sum aggregation function to calculate the amount_to_date. Instead, we store the
running total in a temporary variable and use the scan function to update it for each record.
Here is the transform:
record
decimal(8.2) amount_to_date = 0;
end;
begin
temp.amount_to_date :: 0;
end;
out :: scan(temp, in) =

begin
out.amount_to_date :: temp.amount_to_date + in.amount;
end;

begin
out.dt :: in.dt;
out.amount_to_date :: temp.amount_to_date;
out.category :: if (temp.amount_to_date > 100) "premium"
else "regular";
end;
The temporary_type is a variable that stores the cumulative data from one record to the
next. At the beginning of each group, the initialize function resets the temporary variable to
0.
(Remember
that
in
this
example,
the
data
is
grouped
by customer_id.)
The scan function is called for each record; it keeps a running total of purchase amounts
within
the
group.
The finalize function
creates
the
output
records,
a category value to each one.
SPLIT
Purpose
SPLIT processes data in a number of useful ways. You can use SPLIT to:
Flatten hierarchical data
Select a subset of fields from the data
Normalize vectors (including nested vectors)

Retrieve multiple, distinct outputs from a single pass through the data
How SPLIT works
assigning
SPLIT does not use transform functions. It determines what operations to perform on input
data by using DML that is generated by the split_dml command-line utility. This approach
enables you to perform operations such as normalizing vectors without using expensive
DML loop operations.
SPLIT has a single input port and a counted number of output ports. You use split_dml to
generate DML for each output port. You can have different field selection and base fields for
vector normalization on each port; however, you can specify only one base field for vector
normalization per port.
Although it lacks a reformat transform function, SPLIT supports implicit reformat.
Recommendation
Transform folder
Example of using SPLIT

Say you have a file example1.dml that has both a nested hierarchy of records and three
levels of nested vectors, with the following record format:
record
string("|") region;
record
string("|") state;
record
string("|") county;
record
end location;
record
string("|") atm_id;
end[decimal(2)] atms;
end[decimal(2)] counties;
end[decimal(2)] states;
string("\n") mgr;
end
In this example, SPLIT is used to remove the hierarchy and normalize the vectors in this
record.
First, the desired output DML is generated using the split_dml utility:
split_dml -i ..# -b ..atm_id example1.dml
where:
The -i argument indicates fields to be included in the output DML. In this case, the
specified wildcards "..#" selects all leaf fields anywhere within the record.
The -b argument specifies a base field for normalization. Any field in the vector to
be normalized can be used; in this case, the specified field atm_id is used with the
".." shorthand, because atm_id is unique in the record.
This command generates the following output:

/////////////////////////////////////////////////////////////////////
// With the command line arguments:
// split_dml -i ..# -b ..atm_id example1.dml
/////////////////////////////////////////////////////////////////////
record
string("|") region;
string("|") state;
string("|") county;
string("|") atm_id;
string("\n") mgr;
string('\0') DML_assignments() =
comment=states.counties.atms.comment,mgr=mgr';
end
Note the flattened record, and the generated DML_assignments method that controls how
SPLIT fills the output record from the input data.
Suppose
that
you
want
to
exclude
certain
fields
addr_line1, addr_line2,
and comment from the output. Run split_dml as follows:

split_dml
-i
region,states.state,states.counties.county,..atm_id,..mgr -b
example1.dml
The generated output is:
/////////////////////////////////////////////////////////////////////
// split_dml -i region,states.state,states.counties.county,..atm_id,
// ..mgr -b ..atm_id example1.dml
/////////////////////////////////////////////////////////////////////
record
string("|") region;
string("|") state;
string("|") county;
string("|") atm_id;
string("\n") mgr;
mgr=mgr';
..atm_id
end
Note that the fields specified by the split_dml -i option appear in the order in which they
occur in the input record, not in the order in which they are listed in the option argument.
Posted 3rd January 2016 by kashyap vasani
Add a comment
AB-INITIO Component
Classic
Flipcard
Magazine
Mosaic
Sidebar
Snapshot
Timeslide
1.
JAN
AB-INITIO TRANSFORM COMPONENT

AGGREGATE
Purpose
Aggregate generates records that summarize groups of records.
Deprecated
AGGREGATE is deprecated. Use ROLLUP instead. Rollup gives you more control over
record selection, grouping, and aggregation.
Recommendation
Component folding can enhance the performance of this component . If this feature
is enabled and if the sorted-inputparameter is set to In memory: Input need not be
sorted, the Co>Operating System folds this component by default. SeeComponent
Miscellaneous/Deprecated/Transform folder
COMBINE
Purpose
combine processes data in a number of useful ways. You can use combine to:
Restore hierarchies of data flattened by the SPLIT component
Create a single output record by joining multiple input streams
Denormalize vectors (including nested vectors)
How COMBINE works

COMBINE does not use transform functions. It determines what operations to perform on
input data by using DML that is generated for COMBINEs input ports by the split_dml
command-line utility.
COMBINE performs the inverse operations of the SPLIT component. It has a single output
port and a counted number of input ports. COMBINE (optionally) denormalizes each input
data stream, then performs an outer join on the input records to form the output records.
Using COMBINE for joining data
To use COMBINE to denormalize and join input data, you need to sort and specify keys for
the data. If the input to COMBINE is from an output of SPLIT, you can set up SPLIT to
automatically generate keys by running split_dml with the -g option. Otherwise, you can
generate keys by running split_dml with the -k option, supplying the names of key fields. If
you specify no keys, COMBINE uses an implied key, which is equal to a records index
within the sequence of records on the input port. In other words, COMBINE merges records
synchronously on each port.
When merging these records, COMBINE selects for processing the records that match the
smallest key present on any port. Thus, the input data on each port should be sorted in the
order specified by the keys.
COMBINE can also merge elements of vectors, in the same way it merges top-level
records: if you specify no key, COMBINE merges the elements based on an implied key,
which is equal to a records index within the sequence of records on the input port.
Recommendation
Transform folder
Example of using COMBINE

Say you have a file example2a.dml with the following record format:
record
string("|") region = ""; //Sort key 1
string("|") state = "";
//Sort key 2
string("|") county = ""; //Sort key 3

string("|") atm_id = "";
string("|") comment = "";
string("\n") regional_mgr = "";
end;
And you want to roll up the fields that are marked as sort keys region, state, and county into
nested vectors. To do this, you can use a single COMBINE component rather than performing a series of
three rollup actions.
The desired output format (example2b.dml) is:
record
string("|") region;
//Sort key 1
record
string("|") state;
//Sort key 2
record
string("|") county; //Sort Key 3
record
record
end location;
string("|")atm_id;
string("|")comment;
end[int] atms;
end[int] counties;
end[int] states;
end;
To produce this output format, you need to run split_dml to generate DML for the input port. Your
requirements for the split_dml command are:
You want to include all fields, but you do not care about the subrecord hierarchy, so we specify "..#"
for value of the split_dml -i argument.
The base field for normalization can be any of the fields in the atms record; you choose atm_id.
You need to specify the three keys to use when rolling up the vectors: region, states.state, and
states.counties.county.
The resulting command is:
split_dml -i ..# -b ..atm_id -k region,states.state,states.counties.county example2b.dml
The generated DML, to be used on COMBINEs input port, is:
//////////////////////////////////////////////////////////////
// with the command-line arguments:

// split_dml -i ..# -b ..atm_id -k region,states.state,states.counties.county example2b.dml
//////////////////////////////////////////////////////////////
record
string("|") region // Sort key 1
string("|") state
// Sort key 2
string("|") county // Sort key 3

string("|") atm_id;
string('0')DML_assignments =
comment=states.counties.atms.comment,
regional_mgr=regional_mgr';
string('0')DML_key_specifiers() =
'{region}=,{state}=states[],{county}=states.counties[]';
end
Related topics
DEDUP SORTED
Purpose
Dedup Sorted separates one specified record in each group of records from the rest of the
records in the group.
Requirement
Dedup Sorted requires grouped input.
Recommendation
Transform folder
FILTER BY EXPRESSION
Purpose
Filter by Expression filters records according to a DML expression or transform function,
which specifies the selection criteria.
Filter by Expression is sometimes used to create a subset, or sample, of the data. For
example, you can configure Filter by Expression to select a certain percentage of records,
or to select every third (or fourth, or fifth, and so on) record. Note that if you need a random
sample of a specific size, you should use the sample component.
FILTER BY EXPRESSION supports implicit reformat. For more information, see Implicit
reformat.
Recommendation
Transform folder
FUSE
Purpose
Fuse combines multiple input flows (perhaps with different record formats) into a single
output flow. It examines one record from each input flow simultaneously, acting on the
records according to the transform function you specify. For example, you can compare
records, selecting one record or another based on some criteria, or fuse them into a single
record that contains data from all the input records.
Recommendation
Fuse assumes that the records on the input flows always stay synchronized. However,
certain components placed upstream of Fuse, such as Reformat or Filter by Expression,
could reject or divert some records. In that case, you may not be able to guarantee that the
flows stay in sync. A more reliable option is to add a key field to the data; then use Join to
match the records by key.
JOIN
Purpose
Join reads data from two or more input ports, combines records with matching keys
according to the transform you specify, and sends the transformed records to the output
port. Additional ports allow you to collect rejected and unused records.
Recommendation
NOTE: When you have units of work (computepoints, checkpoints, or transactions)
that are large and sorted-input is set to Inputs must be sorted, the order of output records
within a key group may differ between the folded and unfolded versions of the output.
Transform folder
Types of joins
Reduced to its basics, Join consists of a match key, a transform function, and a mechanism
for deciding when to call the transform function:
The key is used to match records on incoming flows
The transform function combines matched incoming records to produce new
outgoing records
The mechanism for deciding when to call the transform function consists of the
settings of the parameters join-type, record-requiredn, and dedupn.
Inner joins
The most common case is when join-type is Inner Join. In this case, if each input port
contains a record with the same value for the key fields, the transform function is called and
an output record is produced.
If some of the input flows have more than one record with that key value, the transform
function is called multiple times, once for each possible combination of records, taken one
from each input port.
Whenever a particular key value does not have a matching record on every input port and
Inner Join is specified, the transform function is not called and all incoming records with that
key value are sent to the unusedn ports.
Full outer joins
Another common case is when join-type is Full Outer Join: if each input port has a record
with a matching key value, Join does the same thing it does for an inner join.
If some input ports do not have records with matching key values, Join applies the transform
function anyway, with NULL substituted for the missing records. The missing records are in
effect ignored.
With an outer join, the transform function typically requires additional rules (as compared to
an inner join) to handle the possibility of NULL inputs.
About explicit joins
The final case is when join-type is Explicit. This setting allows you to specify True or False
for the record-requiredn parameter for each inn port. The settings you choose determine
when Join calls the transform function. See record-requiredn.
Examples of join types
Complex multiway joins
For the three-way joins shown in the following diagrams, the shaded regions again
represent the key values that must match in order for Join to call the transform function:
In the cases shown above, suppose you want to narrow the join conditions to a subset of
the shaded (required match) area. To do this, use the DML is_defined function in a rule in
the transform itself. This is the same principle demonstrated in the two-way join shown in
Getting a joined output record.
For example, suppose you want to produce an output record when a particular key value
either is present in in0, or is present in both in1 and in2. Only Case 2 has enough shaded
area to represent the necessary conditions. However, Case 2 also represents conditions
under which you do not want Join to produce an output record.
To produce output records only under the appropriate conditions:
1
. Set join-type to Full Outer Join as in Case 2 above.
2
. Put the following rules in Joins transform function:
out.key :1: if (is_defined(in0)) in0.key;

out.key :2: if (is_defined(in1) &&
is_defined(in2)) in1.key;
For both rules to fail, the particular key value must be absent from in0 and must be present
in only one of in1 or in2.
Join writes the records that result in both rules failing to the rejectn ports if you connect
flows to them.
MATCH SORTED
Purpose
Match Sorted combines multiple flows of records with matching keys and performs
transform operations on them.
NOTE: This component is superseded by either Join (for matching keys) or Fuse (for
transforming multiple records). Both provide more flexible processing options than Match
Sorted.
Requirement
Match Sorted requires grouped input.
Transform folder
Example of using MATCH SORTED

This example shows how repeat and missing key values affect the number of times Match
Sorted calls the transform function.
Suppose three input flows feed Match Sorted. The records in these flows have threecharacter alphabetic key values. The key values of the records in the three flows are as
follows:
in0
in1
in2
record 1
aaa
aaa
aaa
record 2
bbb
bbb
ccc
record 3
ccc
ccc
ddd
record 4
eee
eee
eee
record 5
eee
fff
fff
record 6
eee
end
end
Match Sorted calls the transform function eight times for these data records, with the
arguments as follows:
transform( in0-rec1, in1-rec1, in2-rec1 ) records with key value aaa
transform( in0-rec2, in1-rec2, NULL ) records with key value bbb
transform( in0-rec3, in1-rec3, in2-rec2 ) records with key value ccc
transform( NULL,
NULL,
in2-rec3 ) records with key value ddd

transform( NULL,
in1-rec5, in2-rec5 ) records with key value fff
Since there are three eee records in the flow attached to in0, Match Sorted calls the
transform function three times with eee records as inputs. Since the next records on in1 and
in2 do not have key value eee, in1 and in2 repeat their rec4 records.
MULTI REFORMAT
Purpose
Multi Reformat changes the format of records flowing from 1 to 20 pairs of in and out ports
by dropping fields or by using DML expressions to add fields, combine fields, or transform
data in the records.
We recommend using MULTI REFORMAT in only a few specific situations. Most often, a
regular REFORMAT component is the correct choice. For example:
If you want to reformat data on multiple flows, you should instead use multiple
REFORMAT components. These are faster because they run in parallel.
If you want to filter incoming data, sending it to various output ports while also
reformatting it (by adding, combining, or transforming fields), try using the outputindex and count parameters on the REFORMAT component.
A recommended use for Multi Reformat is to put it immediately before a custom component
that takes multiple inputs. For more information, see Using MULTI REFORMAT to avoid
deadlock.
Using MULTI REFORMAT to avoid deadlock

Deadlock occurs when a program cannot progress, causing a graph to hang. Custom
components (components that you have built to execute your own programs) are prone to
deadlock because they cannot use the GDEs automatic flow buffering. If a custom
component is programmed to read from multiple flows in a specific order, it carries the
possibility of causing deadlock.
To avoid deadlock, insert a MULTI REFORMAT component in the graph in front of the
custom component. Using this built-in component to process the input flows applies
automatic flow buffering to them before they reach the custom component, thus avoiding the
possibility of deadlock.
NORMALIZE
Purpose
Normalize generates multiple output records from each of its input records. You can directly
specify the number of output records for each input record, or you can make the number of
output records dependent on a calculation.
In contrast, to consolidate groups of related records into a single record with a vector field
for each group the inverse of NORMALIZE you would use the accumulation function
of the ROLLUP component.
Recommendations
Always clean and validate data before normalizing it. Because Normalize uses a
multistage transform, it follows computation rules that may cause unexpected or
expression used to perform the normalization, and the value of the sorted-input
parameter may affect where the problems occur. It is safest to avoid normalizing
dirty data.
Component folding can enhance the performance of this component. If this feature
is enabled, the Co>Operating System folds this component by default. See
NORMALIZE transform functions

What Normalize does is determined by the functions, types, and variables you define in its
transform parameter.
There are seven built-in functions, as shown in the following table. Of these, only normalize
is required. Examples of most of these functions can be found in Simple NORMALIZE
example with vectors.
There is also an optional temporary_type (see Optional NORMALIZE transform functions
and types), which you can define if you need to use temporary variables. For an example,
see NORMALIZE example with a more elaborate transform.
Transform
function
Required?
input_select
No
Argumen
ts
input
record
Return value
means
true
(the
record
was
selected).

initialize
No
input
record
A
record
whose
temporary_type.
type
is

functions and types. For examples,
see NORMALIZE example with a
length
Only
if
finished is
not
provided
input
record
Specifies the number of output
records Normalize generates for this
input record. If the length function is
provided, Normalize calls it once for
each input record.
For
examples,
see
Simple
NORMALIZE example with vectors
and NORMALIZE example with a
finished
(if
you
have
defined
temporary_type)
finished
(if you have not
defined
temporary_type)
Only
if
length
is
not
provided
Only
if
length
is
not
provided
temporar
y record,
input
record,
index

current
input
record.
input
record,
index

current
input
record.

produced.

produced.
normalize
(if
you
have
defined
temporary_type)
Yes
temporar
y record,
input
record,
index
A
record
whose
type
is
temporary_type. For examples, see
Simple NORMALIZE example with
vectors.
normalize
(if you have not
defined
temporary_type)
Yes
input
record,
index
An output record.
finalize
No
temporar
y record,
input
record
The output record.
output
record
output_select
No

functions
and
types
and
NORMALIZE example with a more
elaborate transform.

means
true
(the
record
was
selected).
Input and output names in transforms

In all transform functions, the names of the inputs and outputs are used only locally, so you
can use any names that make sense to you.
Optional NORMALIZE transform functions and types
There are several optional transform functions and an optional type you can use with
Normalize:
input_select The input_select transform function performs selection of input
records:
out :: input_select(in) =
begin
out :: in.n == 1;
end;
The input_select transform function takes a single argument the input record and
returns a value of 0 (false) if NORMALIZE is to ignore a record, or non-zero (true) if
NORMALIZE is to accept a record.
initialize The initialize transform function initializes temporary storage. This
transform function takes a single argument the input record and returns a
single record with type temporary_type:
begin
temp.count :: 0;
temp.sum :: 0;
end;
length The length transform function is required when the finished function is not
defined. (You must use at least one of these functions.) This transform function
specifies the number of times the normalize function will be called for the current
record. This function takes the input record as an argument:
out :: length(in) =
begin
out :: length_of(in.big_vector);
end;
length essentially provides a way to implement a for loop in the record-reading process.
finished The finished transform function is required when the length function is
not defined. (You must use at least one of these functions.) This transform function
returns a boolean value: as long as it returns 0 (false), NORMALIZE proceeds to call
the normalize function for the current record. When the finished function returns
non-zero (true) , NORMALIZE moves to the next input record.
out :: finished(in, index) =

begin
out :: in.array[index] == "ignore later elements";

end;
The finished function essentially provides a way to implement a while-do loop in the recordreading process.
NOTE: Although we recommend that you not use both length and finished in the same
component, it is possible to define both. In that case, Normalize loops until either finished
returns true or the limit of length is reached, whichever occurs first.
finalize The finalize transform function performs the last step in a multistage
transform:

begin
out.key :: in.key;
out.count :: temp.count;
out.average :: temp.sum / temp.count;
end;
The finalize transform function takes the temporary storage record and the input record as
arguments, and produces a record that has the record format of the out port.
output_select The output_select transform function performs selection of output
records:
out :: output_select(final) =
begin
out :: final.average > 5;
end;
The output_select transform function takes a single argument the record produced by
finalization and returns a value of 0 (false) if NORMALIZE is to ignore a record, or nonzero (true) if NORMALIZE is to generate an output record.
temporary_type If you want Normalize to use temporary storage, define this

storage as a record with a type named temporary_type:
record
int count;
int sum;
end;
REFORMAT
Purpose
Reformat changes the format of records by dropping fields, or by using DML expressions to
add fields, combine fields, or transform the data in the records.
Recommendation
Transform folder
ROLLUP
Purpose
Rollup evaluates a group of input records that have the same key, and then generates
records that either summarize each group or select certain information from each group.
Although it lacks a reformat transform function, rollup supports implicit reformat; see Implicit
reformat.
Transform folder
Recommendations
For new development, use Rollup rather than AGGREGATE. Rollup provides more
control over record selection, grouping, and aggregation.
The behavior of ROLLUP varies in the presence of dirty data (NULLs or invalid
values), according to which mode you use for the rollup:
With expanded mode, you can use ROLLUP normally.

With template mode, always clean and validate data before rolling it up. Because
expression used to perform the rollup, and the value of the sorted-input
parameter may affect where the problems occur. It is safest to clean and
validate the data before using template mode with ROLLUP.
SCAN
Purpose
For every input record, Scan generates an output record that consists of a running
cumulative summary for the group to which the input record belongs, up to and including the
current record. For example, the output records might include successive year-to-date totals
for groups of records.
Although it lacks a reformat transform function, scan supports implicit reformat.
Recommendations
If you want one summary record for a group, use ROLLUP.
The behavior of SCAN varies in the presence of dirty data (NULLs or invalid values),
according to which mode you use for the scan:
With expanded mode, you can use SCAN normally.

With template mode, always clean and validate data before scanning it. Because
Furthermore, the results will be hard to trace, particularly if the rejectthreshold parameter is set to Never abort. Several factors including the data
type, the DML expression used to perform the scan, and the value of the sortedinput parameter may affect where the problems occur. It is safest to clean and
validate the data before using template mode with SCAN.
enabled, the Co>Operating System folds
this
component
by
default.
See
Two modes to use SCAN

You can use a SCAN component in two modes, depending on how you define
the transform parameter:
Define a transform that uses a template scan function. This is called template mode and is
most often used when you want to output aggregations of the data.
Create a transform using an expanded SCAN package. This is called expanded mode
and allows for scans that do not necessarily use regular aggregation functions.
Template mode
Template mode is the simplest way to use SCAN. In the transform parameter, you specify
an aggregation function that describes how the cumulative summary should be computed.
At runtime, the Co>Operating System expands this template function into the multiple
functions that are required to execute the actual scan.
For example, suppose you have an input record for each purchase by each customer. You
could use the sumaggregation function to calculate the running total of spending for each
customer after each purchase.
For more information, see Using SCAN with aggregation functions.
Expanded mode
Expanded mode provides more control over the scan. It lets you edit the expanded package,
so you can specify transformations that are not possible with template mode. As such, you
might use it when you need a result that an aggregation function cannot produce.
With an expanded SCAN package, you must define the following items:
DML type named temporary_type
initialize function that returns a temporary_type record
scan function that takes two input arguments (an input record
a temporary_type record) and returns an updated temporary_type record
finalize function that returns an output record
For more information, see Transform package for SCAN.
Examples of using SCAN
and
transforms/scan/scan.mp
Template SCAN with an aggregation function

This example shows how to compute, from input records containing customer_id, dt (date),
and amount, a running total of transactions for each customer in a dataset. The example
uses a template scan function with the sum aggregation function.
Suppose you have the following input records:
customer_id
dt
amount
C002142
1994.03.23
52.20
C002142
1994.06.22
22.25
C003213
1993.02.12
47.95
C003213
1994.11.05
221.24
C003213
1995.12.11
17.42
C004221
1994.08.15
25.25
C008231
1993.10.22
122.00
C008231
1995.12.10
You want to produce output records with customer_id, dt, and amount_to_date:
customer_id
dt
amount_to_date
C002142
1994.03.23
52.20
C002142
1994.06.22
74.45
C003213
1993.02.12
47.95
52.1
C003213
1994.11.05
269.19
C003213
1995.12.11
286.61
C004221
1994.08.15
25.25
C008231
1993.10.22
122.00
C008231
1995.12.10
174.1
To accomplish this task, do one of the following:

Sort the input records on customer_id and dt, and use a Scan component with
the sorted-input parameter
set
to Input
must
be
sorted
or
grouped and customer_id as the key field.
Sort the input records on dt, and use a Scan component with the sortedinput parameter
set
to In
memory:
Input
need
not
be
sorted and customer_id as the key field.
Create the transform using the sum aggregation function, as follows:

out :: scan(in) =
begin
out.dt :: in.dt;
out.amount_to_date :: sum(in.amount);
end;
Expanded SCAN
Continuing the previous example, you want to categorize customers according to their
spending. After their spending exceeds $100, you place them in the premium category.
The new output data includes the category for each customer, current for each date on
which they made a purchase.
customer_id
dt
amount_to_date
category
C002142
1994.03.23
52.20
regular
C002142
1994.06.22
74.45
regular
C003213
1993.02.12
47.95
regular
C003213
1994.11.05
269.19
premium
C003213
1995.12.11
286.61
premium
C004221
1994.08.15
25.25
regular
C008231
1993.10.22
122.00
premium
C008231
1995.12.10
174.10
premiu
For this example, we can use the finalize function in an expanded transform to add the
category information. Because we have expanded the transform, we can no longer use
the sum aggregation function to calculate the amount_to_date. Instead, we store the
running total in a temporary variable and use the scan function to update it for each record.
Here is the transform:
record
decimal(8.2) amount_to_date = 0;
end;
begin
temp.amount_to_date :: 0;
end;
out :: scan(temp, in) =

begin
out.amount_to_date :: temp.amount_to_date + in.amount;
end;

begin
out.dt :: in.dt;
out.amount_to_date :: temp.amount_to_date;
out.category :: if (temp.amount_to_date > 100) "premium"
else "regular";
end;
The temporary_type is a variable that stores the cumulative data from one record to the
next. At the beginning of each group, the initialize function resets the temporary variable to
0.
(Remember
that
in
this
example,
the
data
is
grouped
by customer_id.)
The scan function is called for each record; it keeps a running total of purchase amounts
within
the
group.
The finalize function
a category value to each one.
SPLIT
Purpose
creates
the
output
records,
assigning
SPLIT processes data in a number of useful ways. You can use SPLIT to:
Flatten hierarchical data
Select a subset of fields from the data
Normalize vectors (including nested vectors)

Retrieve multiple, distinct outputs from a single pass through the data
How SPLIT works

SPLIT does not use transform functions. It determines what operations to perform on input
data by using DML that is generated by the split_dml command-line utility. This approach
enables you to perform operations such as normalizing vectors without using expensive
DML loop operations.
SPLIT has a single input port and a counted number of output ports. You use split_dml to
generate DML for each output port. You can have different field selection and base fields for
vector normalization on each port; however, you can specify only one base field for vector
normalization per port.
Although it lacks a reformat transform function, SPLIT supports implicit reformat.
Recommendation
Transform folder
Example of using SPLIT

Say you have a file example1.dml that has both a nested hierarchy of records and three
levels of nested vectors, with the following record format:
record
string("|") region;
record
string("|") state;
record
string("|") county;
record
end location;
record
string("|") atm_id;
end[decimal(2)] atms;
end[decimal(2)] counties;
end[decimal(2)] states;
string("\n") mgr;
end
In this example, SPLIT is used to remove the hierarchy and normalize the vectors in this
record.
First, the desired output DML is generated using the split_dml utility:
split_dml -i ..# -b ..atm_id example1.dml
where:
The -i argument indicates fields to be included in the output DML. In this case, the
specified wildcards "..#" selects all leaf fields anywhere within the record.
The -b argument specifies a base field for normalization. Any field in the vector to
be normalized can be used; in this case, the specified field atm_id is used with the
".." shorthand, because atm_id is unique in the record.
This command generates the following output:

/////////////////////////////////////////////////////////////////////
// split_dml -i ..# -b ..atm_id example1.dml
/////////////////////////////////////////////////////////////////////
record
string("|") region;
string("|") state;
string("|") county;
string("|") atm_id;
string("\n") mgr;
comment=states.counties.atms.comment,mgr=mgr';
end
Note the flattened record, and the generated DML_assignments method that controls how
SPLIT fills the output record from the input data.
Suppose
that
you
want
to
exclude
certain
fields
addr_line1, addr_line2,
and comment from the output. Run split_dml as follows:

split_dml
-i
region,states.state,states.counties.county,..atm_id,..mgr -b
example1.dml
The generated output is:
/////////////////////////////////////////////////////////////////////
// split_dml -i region,states.state,states.counties.county,..atm_id,
// ..mgr -b ..atm_id example1.dml
/////////////////////////////////////////////////////////////////////
record
string("|") region;
string("|") state;
string("|") county;
..atm_id
string("|") atm_id;
string("\n") mgr;
mgr=mgr';
end
Note that the fields specified by the split_dml -i option appear in the order in which they
occur in the input record, not in the order in which they are listed in the option argument.
Posted 3rd January 2016 by kashyap vasani
Add a comment
1.
SEP
18
AB-INITIO PARTITION COMPONENT
PARTITION BY EXPRESSION
Purpose
Partition by Expression distributes records to its output flow partitions according to a

specified DML expression or transform function.
The output port for Partition by Expression is ordered. See Ordered ports. Although you
can use fan-out flows on the out port, we do not recommend connecting multiple fan-out
flows. You may connect a single fan-out flow; or, preferably, limit yourself to straight flows on
the out port.
Partition by Expression supports implicit reformat. See Implicit reformat.
Recommendation
The component does not fold when connected to a flow that is set to use two-stage routing.
Partitioning folder
PARTITION BY KEY
Purpose
Partition by Key distributes records to its output flow partitions according to key values.
How Partition by Key interprets key values depends on the internal representation of the
key. For example, the number 4 in a field of type integer(2) is not considered identical to the
number 4 in a field of type decimal(4).
Recommendation
Partitioning folder
PARTITION BY KEY AND SORT

Purpose
Partition by Key and Sort repartitions records by key values and then sorts the records
within each partition. The number of input and output partitions can be different.
How Partition by Key and Sort interprets key values depends on the internal representation
of the key. For example, the number 4 is likely to be partitioned differently depending on
whether it is in a field of type integer(2) or decimal(4).
Partition by Key and Sort is a subgraph that contains two components, Partition by Key and
Sort.
Sort folder
PARTITION BY PERCENTAGE
Purpose
Partition by Percentage distributes a specified percentage of the total number of input

records to each output flow.
Partitioning folder
PARTITION BY RANGE
Purpose
Partition by Range distributes records to its output flow partitions according to the ranges of
key values specified for each partition. Partition by Range distributes the records relatively
equally among the partitions.
Use Partition by Range when you want to divide data into useful, approximately equal,
groups. Input can be sorted or unsorted. If the input is sorted, the output is sorted; if the
input is unsorted, the output is unsorted.
The records with the key values that come first in the key order go to partition 0, the records
with the key values that come next in the order go to partition 1, and so on. The records with
the key values that come last in the key order go to the partition with the highest number.
Partitioning folder
PARTITION BY ROUND-ROBIN
Purpose
Partition by Round-robin distributes blocks of records evenly to each output flow in

round-robin fashion.
For information on undoing the effects of Partition by Round-robin, see INTERLEAVE.
The output port for Partition by Round-robin is ordered. See Ordered ports.
Recommendation
Partitioning folder
Posted 18th September 2015 by kashyap vasani

Spec RT Rela

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Spec RT Rela

Uploaded by

Copyright:

Available Formats

JAN

AB-INITIO TRANSFORM COMPONENT

Create a single output record by joining multiple input streams

Denormalize vectors (including nested vectors)

How COMBINE works

Location in the Component Organizer

Example of using COMBINE

string("|") county = ""; //Sort key 3

string("|") county // Sort key 3

Full outer joins

Examples of join types

Complex multiway joins

out.key :1: if (is_defined(in0)) in0.key;

Example of using MATCH SORTED

in2-rec3 ) records with key value ddd

transform( in0-rec4, in1-rec4, in2-rec4 ) records with key value eee

in1-rec5, in2-rec5 ) records with key value fff

Using MULTI REFORMAT to avoid deadlock

NORMALIZE transform functions

See Optional NORMALIZE transform

0 (meaning false), if more output

records are to be generated from the

0 (meaning false), if more output

If the finished function is provided,

If the finished function is provided,

The output record.

See Optional NORMALIZE transform

An output value of 0 means false (the

Input and output names in transforms

out :: finished(in, index) =

out :: finalize(temp, in) =

With expanded mode, you can use ROLLUP normally.

With expanded mode, you can use SCAN normally.

incorrect results in the presence of dirty data (NULLs or invalid values).

Two modes to use SCAN

finalize function that returns an output record

For more information, see Transform package for SCAN.

Examples of using SCAN

Template SCAN with an aggregation function

To accomplish this task, do one of the following:

sorted and customer_id as the key field.

Create the transform using the sum aggregation function, as follows:

out :: scan(temp, in) =

out :: finalize(temp, in) =

The finalize function

a category value to each one.

Normalize vectors (including nested vectors)

How SPLIT works

Example of using SPLIT

This command generates the following output:

and comment from the output. Run split_dml as follows:

Posted 3rd January 2016 by kashyap vasani

AB-INITIO TRANSFORM COMPONENT

Denormalize vectors (including nested vectors)

How COMBINE works

Example of using COMBINE

string("|") county = ""; //Sort key 3

// with the command-line arguments:

string("|") county // Sort key 3

Examples of join types

Complex multiway joins

out.key :1: if (is_defined(in0)) in0.key;

Example of using MATCH SORTED