You are on page 1of 68

JAN

AB-INITIO TRANSFORM COMPONENT

AGGREGATE
Purpose
Aggregate generates records that summarize groups of records.
Deprecated
AGGREGATE is deprecated. Use ROLLUP instead. Rollup gives you more control over
record selection, grouping, and aggregation.
Recommendation
Component folding can enhance the performance of this component . If this feature
is enabled and if the sorted-inputparameter is set to In memory: Input need not be
sorted, the Co>Operating System folds this component by default. SeeComponent
folding for more information.
Location in the Component Organizer

Miscellaneous/Deprecated/Transform folder

COMBINE

Purpose
combine processes data in a number of useful ways. You can use combine to:
Restore hierarchies of data flattened by the SPLIT component

Create a single output record by joining multiple input streams

Denormalize vectors (including nested vectors)

How COMBINE works


COMBINE does not use transform functions. It determines what operations to perform on
input data by using DML that is generated for COMBINEs input ports by the split_dml
command-line utility.
COMBINE performs the inverse operations of the SPLIT component. It has a single output
port and a counted number of input ports. COMBINE (optionally) denormalizes each input
data stream, then performs an outer join on the input records to form the output records.
Using COMBINE for joining data
To use COMBINE to denormalize and join input data, you need to sort and specify keys for
the data. If the input to COMBINE is from an output of SPLIT, you can set up SPLIT to
automatically generate keys by running split_dml with the -g option. Otherwise, you can
generate keys by running split_dml with the -k option, supplying the names of key fields. If
you specify no keys, COMBINE uses an implied key, which is equal to a records index
within the sequence of records on the input port. In other words, COMBINE merges records
synchronously on each port.
When merging these records, COMBINE selects for processing the records that match the
smallest key present on any port. Thus, the input data on each port should be sorted in the
order specified by the keys.
COMBINE can also merge elements of vectors, in the same way it merges top-level
records: if you specify no key, COMBINE merges the elements based on an implied key,
which is equal to a records index within the sequence of records on the input port.
Recommendation
Component folding can enhance the performance of this component. If this feature is
enabled, the Co>Operating System folds this component by default. See Component
folding for more information.

Location in the Component Organizer


Transform folder

Example of using COMBINE


Say you have a file example2a.dml with the following record format:
record
string("|") region = ""; //Sort key 1
string("|") state = "";

//Sort key 2

string("|") county = ""; //Sort key 3


string("|") addr_line1 = "";
string("|") addr_line2 = "";
string("|") atm_id = "";
string("|") comment = "";
string("\n") regional_mgr = "";
end;
And you want to roll up the fields that are marked as sort keys region, state, and county into
nested vectors. To do this, you can use a single COMBINE component rather than performing a series of
three rollup actions.
The desired output format (example2b.dml) is:
record
string("|") region;

//Sort key 1

record
string("|") state;

//Sort key 2

record
string("|") county; //Sort Key 3
record
record
string("|") addr_line1;

string("|") addr_line2;
end location;
string("|")atm_id;
string("|")comment;
end[int] atms;
end[int] counties;
end[int] states;
string("\n") regional_mgr;
end;
To produce this output format, you need to run split_dml to generate DML for the input port. Your
requirements for the split_dml command are:
You want to include all fields, but you do not care about the subrecord hierarchy, so we specify "..#"
for value of the split_dml -i argument.
The base field for normalization can be any of the fields in the atms record; you choose atm_id.

You need to specify the three keys to use when rolling up the vectors: region, states.state, and
states.counties.county.
The resulting command is:
split_dml -i ..# -b ..atm_id -k region,states.state,states.counties.county example2b.dml
The generated DML, to be used on COMBINEs input port, is:
//////////////////////////////////////////////////////////////
// This file was automatically generated by split_dml
// with the command-line arguments:
// split_dml -i ..# -b ..atm_id -k region,states.state,states.counties.county example2b.dml
//////////////////////////////////////////////////////////////
record
string("|") region // Sort key 1
string("|") state

// Sort key 2

string("|") county // Sort key 3

string("|") addr_line1;
string("|") addr_line2;
string("|") atm_id;
string("|") comment;
string("\n") regional_mgr;
string('0')DML_assignments =
'region=region,state=states.state,county=states.counties.county,
addr_line1=states.counties.atms.location.addr_line1,
addr_line2=states.counties.atms.location.addr_line2,
atm_id=states.counties.atms.atm_id,
comment=states.counties.atms.comment,
regional_mgr=regional_mgr';
string('0')DML_key_specifiers() =
'{region}=,{state}=states[],{county}=states.counties[]';
end
Related topics

DEDUP SORTED

Purpose
Dedup Sorted separates one specified record in each group of records from the rest of the
records in the group.
Requirement
Dedup Sorted requires grouped input.
Recommendation

Component folding can enhance the performance of this component. If this feature is
enabled, the Co>Operating System folds this component by default. See Component
folding for more information.
Location in the Component Organizer
Transform folder

FILTER BY EXPRESSION

Purpose
Filter by Expression filters records according to a DML expression or transform function,
which specifies the selection criteria.
Filter by Expression is sometimes used to create a subset, or sample, of the data. For
example, you can configure Filter by Expression to select a certain percentage of records,
or to select every third (or fourth, or fifth, and so on) record. Note that if you need a random
sample of a specific size, you should use the sample component.
FILTER BY EXPRESSION supports implicit reformat. For more information, see Implicit
reformat.
Recommendation
Component folding can enhance the performance of this component. If this feature is
enabled, the Co>Operating System folds this component by default. See Component
folding for more information.
Location in the Component Organizer
Transform folder

FUSE

Purpose
Fuse combines multiple input flows (perhaps with different record formats) into a single
output flow. It examines one record from each input flow simultaneously, acting on the
records according to the transform function you specify. For example, you can compare
records, selecting one record or another based on some criteria, or fuse them into a single
record that contains data from all the input records.
Recommendation
Fuse assumes that the records on the input flows always stay synchronized. However,
certain components placed upstream of Fuse, such as Reformat or Filter by Expression,
could reject or divert some records. In that case, you may not be able to guarantee that the
flows stay in sync. A more reliable option is to add a key field to the data; then use Join to
match the records by key.
Component folding can enhance the performance of this component. If this feature is
enabled, the Co>Operating System folds this component by default. See Component
folding for more information.

JOIN

Purpose
Join reads data from two or more input ports, combines records with matching keys
according to the transform you specify, and sends the transformed records to the output
port. Additional ports allow you to collect rejected and unused records.
Recommendation

Component folding can enhance the performance of this component. If this feature is
enabled, the Co>Operating System folds this component by default. See Component
folding for more information.
NOTE: When you have units of work (computepoints, checkpoints, or transactions)
that are large and sorted-input is set to Inputs must be sorted, the order of output records
within a key group may differ between the folded and unfolded versions of the output.
Location in the Component Organizer
Transform folder

Types of joins
Reduced to its basics, Join consists of a match key, a transform function, and a mechanism
for deciding when to call the transform function:
The key is used to match records on incoming flows
The transform function combines matched incoming records to produce new
outgoing records

The mechanism for deciding when to call the transform function consists of the
settings of the parameters join-type, record-requiredn, and dedupn.

Inner joins
The most common case is when join-type is Inner Join. In this case, if each input port
contains a record with the same value for the key fields, the transform function is called and
an output record is produced.
If some of the input flows have more than one record with that key value, the transform
function is called multiple times, once for each possible combination of records, taken one
from each input port.
Whenever a particular key value does not have a matching record on every input port and
Inner Join is specified, the transform function is not called and all incoming records with that
key value are sent to the unusedn ports.

Full outer joins


Another common case is when join-type is Full Outer Join: if each input port has a record
with a matching key value, Join does the same thing it does for an inner join.
If some input ports do not have records with matching key values, Join applies the transform
function anyway, with NULL substituted for the missing records. The missing records are in
effect ignored.
With an outer join, the transform function typically requires additional rules (as compared to
an inner join) to handle the possibility of NULL inputs.
About explicit joins
The final case is when join-type is Explicit. This setting allows you to specify True or False
for the record-requiredn parameter for each inn port. The settings you choose determine
when Join calls the transform function. See record-requiredn.

Examples of join types

Complex multiway joins


For the three-way joins shown in the following diagrams, the shaded regions again
represent the key values that must match in order for Join to call the transform function:

In the cases shown above, suppose you want to narrow the join conditions to a subset of
the shaded (required match) area. To do this, use the DML is_defined function in a rule in
the transform itself. This is the same principle demonstrated in the two-way join shown in
Getting a joined output record.
For example, suppose you want to produce an output record when a particular key value
either is present in in0, or is present in both in1 and in2. Only Case 2 has enough shaded
area to represent the necessary conditions. However, Case 2 also represents conditions
under which you do not want Join to produce an output record.
To produce output records only under the appropriate conditions:

1
. Set join-type to Full Outer Join as in Case 2 above.

2
. Put the following rules in Joins transform function:

out.key :1: if (is_defined(in0)) in0.key;


out.key :2: if (is_defined(in1) &&
is_defined(in2)) in1.key;
For both rules to fail, the particular key value must be absent from in0 and must be present
in only one of in1 or in2.
Join writes the records that result in both rules failing to the rejectn ports if you connect
flows to them.

MATCH SORTED

Purpose
Match Sorted combines multiple flows of records with matching keys and performs
transform operations on them.
NOTE: This component is superseded by either Join (for matching keys) or Fuse (for
transforming multiple records). Both provide more flexible processing options than Match
Sorted.
Requirement
Match Sorted requires grouped input.
Location in the Component Organizer
Transform folder

Example of using MATCH SORTED


This example shows how repeat and missing key values affect the number of times Match
Sorted calls the transform function.
Suppose three input flows feed Match Sorted. The records in these flows have threecharacter alphabetic key values. The key values of the records in the three flows are as
follows:

in0

in1

in2

record 1

aaa

aaa

aaa

record 2

bbb

bbb

ccc

record 3

ccc

ccc

ddd

record 4

eee

eee

eee

record 5

eee

fff

fff

record 6

eee

end

end

Match Sorted calls the transform function eight times for these data records, with the
arguments as follows:
transform( in0-rec1, in1-rec1, in2-rec1 ) records with key value aaa
transform( in0-rec2, in1-rec2, NULL ) records with key value bbb
transform( in0-rec3, in1-rec3, in2-rec2 ) records with key value ccc
transform( NULL,

NULL,

in2-rec3 ) records with key value ddd

transform( in0-rec4, in1-rec4, in2-rec4 ) records with key value eee


transform( in0-rec5, in1-rec4, in2-rec4 ) records with key value eee
transform( in0-rec6, in1-rec4, in2-rec4 ) records with key value eee
transform( NULL,

in1-rec5, in2-rec5 ) records with key value fff

Since there are three eee records in the flow attached to in0, Match Sorted calls the
transform function three times with eee records as inputs. Since the next records on in1 and
in2 do not have key value eee, in1 and in2 repeat their rec4 records.

MULTI REFORMAT

Purpose
Multi Reformat changes the format of records flowing from 1 to 20 pairs of in and out ports
by dropping fields or by using DML expressions to add fields, combine fields, or transform
data in the records.
We recommend using MULTI REFORMAT in only a few specific situations. Most often, a
regular REFORMAT component is the correct choice. For example:
If you want to reformat data on multiple flows, you should instead use multiple
REFORMAT components. These are faster because they run in parallel.
If you want to filter incoming data, sending it to various output ports while also
reformatting it (by adding, combining, or transforming fields), try using the outputindex and count parameters on the REFORMAT component.

A recommended use for Multi Reformat is to put it immediately before a custom component
that takes multiple inputs. For more information, see Using MULTI REFORMAT to avoid
deadlock.

Using MULTI REFORMAT to avoid deadlock


Deadlock occurs when a program cannot progress, causing a graph to hang. Custom
components (components that you have built to execute your own programs) are prone to
deadlock because they cannot use the GDEs automatic flow buffering. If a custom
component is programmed to read from multiple flows in a specific order, it carries the
possibility of causing deadlock.

To avoid deadlock, insert a MULTI REFORMAT component in the graph in front of the
custom component. Using this built-in component to process the input flows applies
automatic flow buffering to them before they reach the custom component, thus avoiding the
possibility of deadlock.

NORMALIZE

Purpose
Normalize generates multiple output records from each of its input records. You can directly
specify the number of output records for each input record, or you can make the number of
output records dependent on a calculation.
In contrast, to consolidate groups of related records into a single record with a vector field
for each group the inverse of NORMALIZE you would use the accumulation function
of the ROLLUP component.
Recommendations
Always clean and validate data before normalizing it. Because Normalize uses a
multistage transform, it follows computation rules that may cause unexpected or
incorrect results in the presence of dirty data (NULLs or invalid values).
Furthermore, the results will be hard to trace, particularly if the reject-threshold
parameter is set to Never abort. Several factors including the data type, the DML
expression used to perform the normalization, and the value of the sorted-input
parameter may affect where the problems occur. It is safest to avoid normalizing
dirty data.
Component folding can enhance the performance of this component. If this feature
is enabled, the Co>Operating System folds this component by default. See
Component folding for more information.

NORMALIZE transform functions


What Normalize does is determined by the functions, types, and variables you define in its
transform parameter.

There are seven built-in functions, as shown in the following table. Of these, only normalize
is required. Examples of most of these functions can be found in Simple NORMALIZE
example with vectors.
There is also an optional temporary_type (see Optional NORMALIZE transform functions
and types), which you can define if you need to use temporary variables. For an example,
see NORMALIZE example with a more elaborate transform.

Transform
function

Required?

input_select

No

Argumen
ts
input
record

Return value
An integer(4) value.
An output value of 0 means false (the
record was not selected); non-zero
means
true
(the
record
was
selected).
See Optional NORMALIZE transform
functions and types.

initialize

No

input
record

A
record
whose
temporary_type.

type

is

See Optional NORMALIZE transform


functions and types. For examples,
see NORMALIZE example with a
more elaborate transform.
length

Only
if
finished is
not
provided

input
record

An integer(4) value.
Specifies the number of output
records Normalize generates for this
input record. If the length function is
provided, Normalize calls it once for
each input record.
For
examples,
see
Simple
NORMALIZE example with vectors
and NORMALIZE example with a
more elaborate transform.

finished

Only

if

temporar

0 (meaning false), if more output

(if
you
have
defined
temporary_type)

finished
(if you have not
defined
temporary_type)

length
is
not
provided

Only
if
length
is
not
provided

y record,
input
record,
index

records are to be generated from the


current
input
record.
Otherwise, a non-zero value (true).

input
record,
index

0 (meaning false), if more output


records are to be generated from the
current
input
record.
Otherwise, a non-zero value (true).

If the finished function is provided,


NORMALIZE calls it once more than
the number of output records it
produces. On the final call it returns
true and no output record is
produced.

If the finished function is provided,


NORMALIZE calls it once more than
the number of output records it
produces. On the final call it returns
true and no output record is
produced.
normalize
(if
you
have
defined
temporary_type)

Yes

temporar
y record,
input
record,
index

A
record
whose
type
is
temporary_type. For examples, see
Simple NORMALIZE example with
vectors.

normalize
(if you have not
defined
temporary_type)

Yes

input
record,
index

An output record.

finalize

No

temporar
y record,
input
record

The output record.

output
record

An integer(4) value.

output_select

No

See Optional NORMALIZE transform


functions
and
types
and
NORMALIZE example with a more
elaborate transform.

An output value of 0 means false (the


record was not selected); non-zero
means
true
(the
record
was

selected).
See Optional NORMALIZE transform
functions and types.

Input and output names in transforms


In all transform functions, the names of the inputs and outputs are used only locally, so you
can use any names that make sense to you.
Optional NORMALIZE transform functions and types
There are several optional transform functions and an optional type you can use with
Normalize:
input_select The input_select transform function performs selection of input
records:

out :: input_select(in) =
begin
out :: in.n == 1;
end;
The input_select transform function takes a single argument the input record and
returns a value of 0 (false) if NORMALIZE is to ignore a record, or non-zero (true) if
NORMALIZE is to accept a record.
initialize The initialize transform function initializes temporary storage. This
transform function takes a single argument the input record and returns a
single record with type temporary_type:

temp :: initialize(in) =
begin
temp.count :: 0;
temp.sum :: 0;
end;

length The length transform function is required when the finished function is not
defined. (You must use at least one of these functions.) This transform function
specifies the number of times the normalize function will be called for the current
record. This function takes the input record as an argument:

out :: length(in) =
begin
out :: length_of(in.big_vector);
end;
length essentially provides a way to implement a for loop in the record-reading process.
finished The finished transform function is required when the length function is
not defined. (You must use at least one of these functions.) This transform function
returns a boolean value: as long as it returns 0 (false), NORMALIZE proceeds to call
the normalize function for the current record. When the finished function returns
non-zero (true) , NORMALIZE moves to the next input record.

out :: finished(in, index) =


begin
out :: in.array[index] == "ignore later elements";
end;
The finished function essentially provides a way to implement a while-do loop in the recordreading process.
NOTE: Although we recommend that you not use both length and finished in the same
component, it is possible to define both. In that case, Normalize loops until either finished
returns true or the limit of length is reached, whichever occurs first.
finalize The finalize transform function performs the last step in a multistage
transform:

out :: finalize(temp, in) =


begin
out.key :: in.key;

out.count :: temp.count;
out.average :: temp.sum / temp.count;
end;
The finalize transform function takes the temporary storage record and the input record as
arguments, and produces a record that has the record format of the out port.
output_select The output_select transform function performs selection of output
records:

out :: output_select(final) =
begin
out :: final.average > 5;
end;
The output_select transform function takes a single argument the record produced by
finalization and returns a value of 0 (false) if NORMALIZE is to ignore a record, or nonzero (true) if NORMALIZE is to generate an output record.
temporary_type If you want Normalize to use temporary storage, define this
storage as a record with a type named temporary_type:

type temporary_type =
record
int count;
int sum;
end;

REFORMAT

Purpose
Reformat changes the format of records by dropping fields, or by using DML expressions to
add fields, combine fields, or transform the data in the records.
Recommendation
Component folding can enhance the performance of this component. If this feature is
enabled, the Co>Operating System folds this component by default. See Component
folding for more information.
Location in the Organizer
Transform folder

ROLLUP

Purpose
Rollup evaluates a group of input records that have the same key, and then generates
records that either summarize each group or select certain information from each group.
Although it lacks a reformat transform function, rollup supports implicit reformat; see Implicit
reformat.
Location in the Organizer
Transform folder
Recommendations
For new development, use Rollup rather than AGGREGATE. Rollup provides more
control over record selection, grouping, and aggregation.

The behavior of ROLLUP varies in the presence of dirty data (NULLs or invalid
values), according to which mode you use for the rollup:

With expanded mode, you can use ROLLUP normally.


With template mode, always clean and validate data before rolling it up. Because
the aggregation functions are not expanded, you may see unexpected or even
incorrect results in the presence of dirty data (NULLs or invalid values).
Furthermore, the results will be hard to trace, particularly if the reject-threshold
parameter is set to Never abort. Several factors including the data type, the DML
expression used to perform the rollup, and the value of the sorted-input
parameter may affect where the problems occur. It is safest to clean and
validate the data before using template mode with ROLLUP.

SCAN

Purpose
For every input record, Scan generates an output record that consists of a running
cumulative summary for the group to which the input record belongs, up to and including the
current record. For example, the output records might include successive year-to-date totals
for groups of records.
Although it lacks a reformat transform function, scan supports implicit reformat.
Recommendations
If you want one summary record for a group, use ROLLUP.
The behavior of SCAN varies in the presence of dirty data (NULLs or invalid values),
according to which mode you use for the scan:

With expanded mode, you can use SCAN normally.


With template mode, always clean and validate data before scanning it. Because
the aggregation functions are not expanded, you may see unexpected or even

incorrect results in the presence of dirty data (NULLs or invalid values).


Furthermore, the results will be hard to trace, particularly if the rejectthreshold parameter is set to Never abort. Several factors including the data
type, the DML expression used to perform the scan, and the value of the sortedinput parameter may affect where the problems occur. It is safest to clean and
validate the data before using template mode with SCAN.

Component folding can enhance the performance of this component. If this feature is
enabled, the Co>Operating System folds
Component folding for more information.

this

component

by

default.

See

Two modes to use SCAN


You can use a SCAN component in two modes, depending on how you define
the transform parameter:
Define a transform that uses a template scan function. This is called template mode and is
most often used when you want to output aggregations of the data.
Create a transform using an expanded SCAN package. This is called expanded mode
and allows for scans that do not necessarily use regular aggregation functions.

Template mode
Template mode is the simplest way to use SCAN. In the transform parameter, you specify
an aggregation function that describes how the cumulative summary should be computed.
At runtime, the Co>Operating System expands this template function into the multiple
functions that are required to execute the actual scan.
For example, suppose you have an input record for each purchase by each customer. You
could use the sumaggregation function to calculate the running total of spending for each
customer after each purchase.
For more information, see Using SCAN with aggregation functions.
Expanded mode

Expanded mode provides more control over the scan. It lets you edit the expanded package,
so you can specify transformations that are not possible with template mode. As such, you
might use it when you need a result that an aggregation function cannot produce.
With an expanded SCAN package, you must define the following items:
DML type named temporary_type
initialize function that returns a temporary_type record

scan function that takes two input arguments (an input record
a temporary_type record) and returns an updated temporary_type record

and

finalize function that returns an output record

For more information, see Transform package for SCAN.

Examples of using SCAN

transforms/scan/scan.mp

Template SCAN with an aggregation function


This example shows how to compute, from input records containing customer_id, dt (date),
and amount, a running total of transactions for each customer in a dataset. The example
uses a template scan function with the sum aggregation function.
Suppose you have the following input records:

customer_id

dt

amount

C002142

1994.03.23

52.20

C002142

1994.06.22

22.25

C003213

1993.02.12

47.95

C003213

1994.11.05

221.24

C003213

1995.12.11

17.42

C004221

1994.08.15

25.25

C008231

1993.10.22

122.00

C008231

1995.12.10

52.1

You want to produce output records with customer_id, dt, and amount_to_date:

customer_id

dt

amount_to_date

C002142

1994.03.23

52.20

C002142

1994.06.22

74.45

C003213

1993.02.12

47.95

C003213

1994.11.05

269.19

C003213

1995.12.11

286.61

C004221

1994.08.15

25.25

C008231

1993.10.22

122.00

C008231

1995.12.10

174.1

To accomplish this task, do one of the following:


Sort the input records on customer_id and dt, and use a Scan component with
the sorted-input parameter
set
to Input
must
be
sorted
or
grouped and customer_id as the key field.
Sort the input records on dt, and use a Scan component with the sortedinput parameter
set
to In
memory:
Input
need
not
be

sorted and customer_id as the key field.

Create the transform using the sum aggregation function, as follows:


out :: scan(in) =
begin
out.customer_id :: in.customer_id;
out.dt :: in.dt;
out.amount_to_date :: sum(in.amount);
end;
Expanded SCAN
Continuing the previous example, you want to categorize customers according to their
spending. After their spending exceeds $100, you place them in the premium category.
The new output data includes the category for each customer, current for each date on
which they made a purchase.

customer_id

dt

amount_to_date

category

C002142

1994.03.23

52.20

regular

C002142

1994.06.22

74.45

regular

C003213

1993.02.12

47.95

regular

C003213

1994.11.05

269.19

premium

C003213

1995.12.11

286.61

premium

C004221

1994.08.15

25.25

regular

C008231

1993.10.22

122.00

premium

C008231

1995.12.10

174.10

premiu

For this example, we can use the finalize function in an expanded transform to add the
category information. Because we have expanded the transform, we can no longer use
the sum aggregation function to calculate the amount_to_date. Instead, we store the
running total in a temporary variable and use the scan function to update it for each record.
Here is the transform:
type temporary_type =
record
decimal(8.2) amount_to_date = 0;
end;

temp :: initialize(in) =
begin
temp.amount_to_date :: 0;
end;

out :: scan(temp, in) =


begin
out.amount_to_date :: temp.amount_to_date + in.amount;
end;

out :: finalize(temp, in) =


begin

out.customer_id :: in.customer_id;
out.dt :: in.dt;
out.amount_to_date :: temp.amount_to_date;
out.category :: if (temp.amount_to_date > 100) "premium"
else "regular";
end;
The temporary_type is a variable that stores the cumulative data from one record to the
next. At the beginning of each group, the initialize function resets the temporary variable to
0.

(Remember

that

in

this

example,

the

data

is

grouped

by customer_id.)

The scan function is called for each record; it keeps a running total of purchase amounts
within

the

group.

The finalize function

creates

the

output

records,

a category value to each one.

SPLIT

Purpose
SPLIT processes data in a number of useful ways. You can use SPLIT to:
Flatten hierarchical data
Select a subset of fields from the data

Normalize vectors (including nested vectors)


Retrieve multiple, distinct outputs from a single pass through the data

How SPLIT works

assigning

SPLIT does not use transform functions. It determines what operations to perform on input
data by using DML that is generated by the split_dml command-line utility. This approach
enables you to perform operations such as normalizing vectors without using expensive
DML loop operations.
SPLIT has a single input port and a counted number of output ports. You use split_dml to
generate DML for each output port. You can have different field selection and base fields for
vector normalization on each port; however, you can specify only one base field for vector
normalization per port.
Although it lacks a reformat transform function, SPLIT supports implicit reformat.
Recommendation
Component folding can enhance the performance of this component. If this feature is
enabled, the Co>Operating System folds this component by default. See Component
folding for more information.
Location in the Organizer

Transform folder

Example of using SPLIT


Say you have a file example1.dml that has both a nested hierarchy of records and three
levels of nested vectors, with the following record format:
record
string("|") region;
record
string("|") state;

record
string("|") county;
record
string("|") addr_line1;
string("|") addr_line2;
end location;
record
string("|") atm_id;
string("|") comment;
end[decimal(2)] atms;
end[decimal(2)] counties;
end[decimal(2)] states;
string("\n") mgr;
end
In this example, SPLIT is used to remove the hierarchy and normalize the vectors in this
record.
First, the desired output DML is generated using the split_dml utility:
split_dml -i ..# -b ..atm_id example1.dml
where:
The -i argument indicates fields to be included in the output DML. In this case, the
specified wildcards "..#" selects all leaf fields anywhere within the record.
The -b argument specifies a base field for normalization. Any field in the vector to
be normalized can be used; in this case, the specified field atm_id is used with the
".." shorthand, because atm_id is unique in the record.

This command generates the following output:


/////////////////////////////////////////////////////////////////////
// This file was automatically generated by split_dml
// With the command line arguments:
// split_dml -i ..# -b ..atm_id example1.dml
/////////////////////////////////////////////////////////////////////
record
string("|") region;
string("|") state;
string("|") county;
string("|") addr_line1;
string("|") addr_line2;
string("|") atm_id;
string("|") comment;
string("\n") mgr;
string('\0') DML_assignments() =
'region=region,state=states.state,county=states.counties.county,
addr_line1=states.counties.atms.location.addr_line1,
addr_line2=states.counties.atms.location.addr_line2,
atm_id=states.counties.atms.atm_id,
comment=states.counties.atms.comment,mgr=mgr';
end

Note the flattened record, and the generated DML_assignments method that controls how
SPLIT fills the output record from the input data.
Suppose

that

you

want

to

exclude

certain

fields

addr_line1, addr_line2,

and comment from the output. Run split_dml as follows:


split_dml

-i

region,states.state,states.counties.county,..atm_id,..mgr -b

example1.dml
The generated output is:
/////////////////////////////////////////////////////////////////////
// This file was automatically generated by split_dml
// With the command line arguments:
// split_dml -i region,states.state,states.counties.county,..atm_id,
// ..mgr -b ..atm_id example1.dml
/////////////////////////////////////////////////////////////////////
record
string("|") region;
string("|") state;
string("|") county;
string("|") atm_id;
string("\n") mgr;
string('\0') DML_assignments() =
'region=region,state=states.state,county=states.counties.county,
atm_id=states.counties.atms.atm_id,
mgr=mgr';

..atm_id

end
Note that the fields specified by the split_dml -i option appear in the order in which they
occur in the input record, not in the order in which they are listed in the option argument.

Posted 3rd January 2016 by kashyap vasani

Add a comment

AB-INITIO Component

Classic

Flipcard

Magazine

Mosaic

Sidebar

Snapshot

Timeslide
1.
JAN

AB-INITIO TRANSFORM COMPONENT


AGGREGATE

Purpose
Aggregate generates records that summarize groups of records.
Deprecated
AGGREGATE is deprecated. Use ROLLUP instead. Rollup gives you more control over
record selection, grouping, and aggregation.
Recommendation
Component folding can enhance the performance of this component . If this feature
is enabled and if the sorted-inputparameter is set to In memory: Input need not be
sorted, the Co>Operating System folds this component by default. SeeComponent
folding for more information.
Location in the Component Organizer

Miscellaneous/Deprecated/Transform folder

COMBINE

Purpose
combine processes data in a number of useful ways. You can use combine to:
Restore hierarchies of data flattened by the SPLIT component
Create a single output record by joining multiple input streams

Denormalize vectors (including nested vectors)

How COMBINE works


COMBINE does not use transform functions. It determines what operations to perform on
input data by using DML that is generated for COMBINEs input ports by the split_dml
command-line utility.
COMBINE performs the inverse operations of the SPLIT component. It has a single output
port and a counted number of input ports. COMBINE (optionally) denormalizes each input
data stream, then performs an outer join on the input records to form the output records.
Using COMBINE for joining data
To use COMBINE to denormalize and join input data, you need to sort and specify keys for
the data. If the input to COMBINE is from an output of SPLIT, you can set up SPLIT to
automatically generate keys by running split_dml with the -g option. Otherwise, you can
generate keys by running split_dml with the -k option, supplying the names of key fields. If
you specify no keys, COMBINE uses an implied key, which is equal to a records index
within the sequence of records on the input port. In other words, COMBINE merges records
synchronously on each port.
When merging these records, COMBINE selects for processing the records that match the
smallest key present on any port. Thus, the input data on each port should be sorted in the
order specified by the keys.

COMBINE can also merge elements of vectors, in the same way it merges top-level
records: if you specify no key, COMBINE merges the elements based on an implied key,
which is equal to a records index within the sequence of records on the input port.
Recommendation
Component folding can enhance the performance of this component. If this feature is
enabled, the Co>Operating System folds this component by default. See Component
folding for more information.
Location in the Component Organizer
Transform folder

Example of using COMBINE


Say you have a file example2a.dml with the following record format:
record
string("|") region = ""; //Sort key 1
string("|") state = "";

//Sort key 2

string("|") county = ""; //Sort key 3


string("|") addr_line1 = "";
string("|") addr_line2 = "";
string("|") atm_id = "";
string("|") comment = "";
string("\n") regional_mgr = "";
end;
And you want to roll up the fields that are marked as sort keys region, state, and county into
nested vectors. To do this, you can use a single COMBINE component rather than performing a series of
three rollup actions.
The desired output format (example2b.dml) is:
record
string("|") region;

//Sort key 1

record
string("|") state;

//Sort key 2

record
string("|") county; //Sort Key 3
record
record
string("|") addr_line1;
string("|") addr_line2;
end location;
string("|")atm_id;
string("|")comment;
end[int] atms;
end[int] counties;
end[int] states;
string("\n") regional_mgr;
end;
To produce this output format, you need to run split_dml to generate DML for the input port. Your
requirements for the split_dml command are:
You want to include all fields, but you do not care about the subrecord hierarchy, so we specify "..#"
for value of the split_dml -i argument.
The base field for normalization can be any of the fields in the atms record; you choose atm_id.

You need to specify the three keys to use when rolling up the vectors: region, states.state, and
states.counties.county.
The resulting command is:
split_dml -i ..# -b ..atm_id -k region,states.state,states.counties.county example2b.dml
The generated DML, to be used on COMBINEs input port, is:
//////////////////////////////////////////////////////////////
// This file was automatically generated by split_dml

// with the command-line arguments:


// split_dml -i ..# -b ..atm_id -k region,states.state,states.counties.county example2b.dml
//////////////////////////////////////////////////////////////
record
string("|") region // Sort key 1
string("|") state

// Sort key 2

string("|") county // Sort key 3


string("|") addr_line1;
string("|") addr_line2;
string("|") atm_id;
string("|") comment;
string("\n") regional_mgr;
string('0')DML_assignments =
'region=region,state=states.state,county=states.counties.county,
addr_line1=states.counties.atms.location.addr_line1,
addr_line2=states.counties.atms.location.addr_line2,
atm_id=states.counties.atms.atm_id,
comment=states.counties.atms.comment,
regional_mgr=regional_mgr';
string('0')DML_key_specifiers() =
'{region}=,{state}=states[],{county}=states.counties[]';
end
Related topics

DEDUP SORTED

Purpose
Dedup Sorted separates one specified record in each group of records from the rest of the
records in the group.
Requirement
Dedup Sorted requires grouped input.
Recommendation
Component folding can enhance the performance of this component. If this feature is
enabled, the Co>Operating System folds this component by default. See Component
folding for more information.
Location in the Component Organizer
Transform folder

FILTER BY EXPRESSION

Purpose
Filter by Expression filters records according to a DML expression or transform function,
which specifies the selection criteria.
Filter by Expression is sometimes used to create a subset, or sample, of the data. For
example, you can configure Filter by Expression to select a certain percentage of records,
or to select every third (or fourth, or fifth, and so on) record. Note that if you need a random
sample of a specific size, you should use the sample component.
FILTER BY EXPRESSION supports implicit reformat. For more information, see Implicit
reformat.
Recommendation

Component folding can enhance the performance of this component. If this feature is
enabled, the Co>Operating System folds this component by default. See Component
folding for more information.
Location in the Component Organizer
Transform folder

FUSE

Purpose
Fuse combines multiple input flows (perhaps with different record formats) into a single
output flow. It examines one record from each input flow simultaneously, acting on the
records according to the transform function you specify. For example, you can compare
records, selecting one record or another based on some criteria, or fuse them into a single
record that contains data from all the input records.
Recommendation
Fuse assumes that the records on the input flows always stay synchronized. However,
certain components placed upstream of Fuse, such as Reformat or Filter by Expression,
could reject or divert some records. In that case, you may not be able to guarantee that the
flows stay in sync. A more reliable option is to add a key field to the data; then use Join to
match the records by key.
Component folding can enhance the performance of this component. If this feature is
enabled, the Co>Operating System folds this component by default. See Component
folding for more information.

JOIN

Purpose
Join reads data from two or more input ports, combines records with matching keys
according to the transform you specify, and sends the transformed records to the output
port. Additional ports allow you to collect rejected and unused records.
Recommendation
Component folding can enhance the performance of this component. If this feature is
enabled, the Co>Operating System folds this component by default. See Component
folding for more information.
NOTE: When you have units of work (computepoints, checkpoints, or transactions)
that are large and sorted-input is set to Inputs must be sorted, the order of output records
within a key group may differ between the folded and unfolded versions of the output.
Location in the Component Organizer
Transform folder

Types of joins
Reduced to its basics, Join consists of a match key, a transform function, and a mechanism
for deciding when to call the transform function:
The key is used to match records on incoming flows
The transform function combines matched incoming records to produce new
outgoing records

The mechanism for deciding when to call the transform function consists of the
settings of the parameters join-type, record-requiredn, and dedupn.

Inner joins

The most common case is when join-type is Inner Join. In this case, if each input port
contains a record with the same value for the key fields, the transform function is called and
an output record is produced.
If some of the input flows have more than one record with that key value, the transform
function is called multiple times, once for each possible combination of records, taken one
from each input port.
Whenever a particular key value does not have a matching record on every input port and
Inner Join is specified, the transform function is not called and all incoming records with that
key value are sent to the unusedn ports.
Full outer joins
Another common case is when join-type is Full Outer Join: if each input port has a record
with a matching key value, Join does the same thing it does for an inner join.
If some input ports do not have records with matching key values, Join applies the transform
function anyway, with NULL substituted for the missing records. The missing records are in
effect ignored.
With an outer join, the transform function typically requires additional rules (as compared to
an inner join) to handle the possibility of NULL inputs.
About explicit joins
The final case is when join-type is Explicit. This setting allows you to specify True or False
for the record-requiredn parameter for each inn port. The settings you choose determine
when Join calls the transform function. See record-requiredn.

Examples of join types

Complex multiway joins

For the three-way joins shown in the following diagrams, the shaded regions again
represent the key values that must match in order for Join to call the transform function:

In the cases shown above, suppose you want to narrow the join conditions to a subset of
the shaded (required match) area. To do this, use the DML is_defined function in a rule in
the transform itself. This is the same principle demonstrated in the two-way join shown in
Getting a joined output record.
For example, suppose you want to produce an output record when a particular key value
either is present in in0, or is present in both in1 and in2. Only Case 2 has enough shaded
area to represent the necessary conditions. However, Case 2 also represents conditions
under which you do not want Join to produce an output record.
To produce output records only under the appropriate conditions:
1
. Set join-type to Full Outer Join as in Case 2 above.

2
. Put the following rules in Joins transform function:

out.key :1: if (is_defined(in0)) in0.key;


out.key :2: if (is_defined(in1) &&
is_defined(in2)) in1.key;
For both rules to fail, the particular key value must be absent from in0 and must be present
in only one of in1 or in2.
Join writes the records that result in both rules failing to the rejectn ports if you connect
flows to them.

MATCH SORTED

Purpose
Match Sorted combines multiple flows of records with matching keys and performs
transform operations on them.
NOTE: This component is superseded by either Join (for matching keys) or Fuse (for
transforming multiple records). Both provide more flexible processing options than Match
Sorted.
Requirement
Match Sorted requires grouped input.
Location in the Component Organizer
Transform folder

Example of using MATCH SORTED


This example shows how repeat and missing key values affect the number of times Match
Sorted calls the transform function.
Suppose three input flows feed Match Sorted. The records in these flows have threecharacter alphabetic key values. The key values of the records in the three flows are as
follows:

in0

in1

in2

record 1

aaa

aaa

aaa

record 2

bbb

bbb

ccc

record 3

ccc

ccc

ddd

record 4

eee

eee

eee

record 5

eee

fff

fff

record 6

eee

end

end

Match Sorted calls the transform function eight times for these data records, with the
arguments as follows:
transform( in0-rec1, in1-rec1, in2-rec1 ) records with key value aaa
transform( in0-rec2, in1-rec2, NULL ) records with key value bbb
transform( in0-rec3, in1-rec3, in2-rec2 ) records with key value ccc
transform( NULL,

NULL,

in2-rec3 ) records with key value ddd

transform( in0-rec4, in1-rec4, in2-rec4 ) records with key value eee


transform( in0-rec5, in1-rec4, in2-rec4 ) records with key value eee
transform( in0-rec6, in1-rec4, in2-rec4 ) records with key value eee
transform( NULL,

in1-rec5, in2-rec5 ) records with key value fff

Since there are three eee records in the flow attached to in0, Match Sorted calls the
transform function three times with eee records as inputs. Since the next records on in1 and
in2 do not have key value eee, in1 and in2 repeat their rec4 records.

MULTI REFORMAT

Purpose
Multi Reformat changes the format of records flowing from 1 to 20 pairs of in and out ports
by dropping fields or by using DML expressions to add fields, combine fields, or transform
data in the records.
We recommend using MULTI REFORMAT in only a few specific situations. Most often, a
regular REFORMAT component is the correct choice. For example:

If you want to reformat data on multiple flows, you should instead use multiple
REFORMAT components. These are faster because they run in parallel.
If you want to filter incoming data, sending it to various output ports while also
reformatting it (by adding, combining, or transforming fields), try using the outputindex and count parameters on the REFORMAT component.

A recommended use for Multi Reformat is to put it immediately before a custom component
that takes multiple inputs. For more information, see Using MULTI REFORMAT to avoid
deadlock.

Using MULTI REFORMAT to avoid deadlock


Deadlock occurs when a program cannot progress, causing a graph to hang. Custom
components (components that you have built to execute your own programs) are prone to
deadlock because they cannot use the GDEs automatic flow buffering. If a custom
component is programmed to read from multiple flows in a specific order, it carries the
possibility of causing deadlock.
To avoid deadlock, insert a MULTI REFORMAT component in the graph in front of the
custom component. Using this built-in component to process the input flows applies
automatic flow buffering to them before they reach the custom component, thus avoiding the
possibility of deadlock.

NORMALIZE

Purpose
Normalize generates multiple output records from each of its input records. You can directly
specify the number of output records for each input record, or you can make the number of
output records dependent on a calculation.
In contrast, to consolidate groups of related records into a single record with a vector field
for each group the inverse of NORMALIZE you would use the accumulation function
of the ROLLUP component.

Recommendations
Always clean and validate data before normalizing it. Because Normalize uses a
multistage transform, it follows computation rules that may cause unexpected or
incorrect results in the presence of dirty data (NULLs or invalid values).
Furthermore, the results will be hard to trace, particularly if the reject-threshold
parameter is set to Never abort. Several factors including the data type, the DML
expression used to perform the normalization, and the value of the sorted-input
parameter may affect where the problems occur. It is safest to avoid normalizing
dirty data.
Component folding can enhance the performance of this component. If this feature
is enabled, the Co>Operating System folds this component by default. See
Component folding for more information.

NORMALIZE transform functions


What Normalize does is determined by the functions, types, and variables you define in its
transform parameter.
There are seven built-in functions, as shown in the following table. Of these, only normalize
is required. Examples of most of these functions can be found in Simple NORMALIZE
example with vectors.
There is also an optional temporary_type (see Optional NORMALIZE transform functions
and types), which you can define if you need to use temporary variables. For an example,
see NORMALIZE example with a more elaborate transform.

Transform
function

Required?

input_select

No

Argumen
ts
input
record

Return value
An integer(4) value.
An output value of 0 means false (the
record was not selected); non-zero
means
true
(the
record
was
selected).
See Optional NORMALIZE transform

functions and types.


initialize

No

input
record

A
record
whose
temporary_type.

type

is

See Optional NORMALIZE transform


functions and types. For examples,
see NORMALIZE example with a
more elaborate transform.
length

Only
if
finished is
not
provided

input
record

An integer(4) value.
Specifies the number of output
records Normalize generates for this
input record. If the length function is
provided, Normalize calls it once for
each input record.
For
examples,
see
Simple
NORMALIZE example with vectors
and NORMALIZE example with a
more elaborate transform.

finished
(if
you
have
defined
temporary_type)

finished
(if you have not
defined
temporary_type)

Only
if
length
is
not
provided

Only
if
length
is
not
provided

temporar
y record,
input
record,
index

0 (meaning false), if more output


records are to be generated from the
current
input
record.
Otherwise, a non-zero value (true).

input
record,
index

0 (meaning false), if more output


records are to be generated from the
current
input
record.
Otherwise, a non-zero value (true).

If the finished function is provided,


NORMALIZE calls it once more than
the number of output records it
produces. On the final call it returns
true and no output record is
produced.

If the finished function is provided,


NORMALIZE calls it once more than
the number of output records it
produces. On the final call it returns
true and no output record is
produced.

normalize
(if
you
have
defined
temporary_type)

Yes

temporar
y record,
input
record,
index

A
record
whose
type
is
temporary_type. For examples, see
Simple NORMALIZE example with
vectors.

normalize
(if you have not
defined
temporary_type)

Yes

input
record,
index

An output record.

finalize

No

temporar
y record,
input
record

The output record.

output
record

An integer(4) value.

output_select

No

See Optional NORMALIZE transform


functions
and
types
and
NORMALIZE example with a more
elaborate transform.

An output value of 0 means false (the


record was not selected); non-zero
means
true
(the
record
was
selected).
See Optional NORMALIZE transform
functions and types.

Input and output names in transforms


In all transform functions, the names of the inputs and outputs are used only locally, so you
can use any names that make sense to you.
Optional NORMALIZE transform functions and types
There are several optional transform functions and an optional type you can use with
Normalize:
input_select The input_select transform function performs selection of input
records:

out :: input_select(in) =
begin
out :: in.n == 1;

end;
The input_select transform function takes a single argument the input record and
returns a value of 0 (false) if NORMALIZE is to ignore a record, or non-zero (true) if
NORMALIZE is to accept a record.
initialize The initialize transform function initializes temporary storage. This
transform function takes a single argument the input record and returns a
single record with type temporary_type:

temp :: initialize(in) =
begin
temp.count :: 0;
temp.sum :: 0;
end;
length The length transform function is required when the finished function is not
defined. (You must use at least one of these functions.) This transform function
specifies the number of times the normalize function will be called for the current
record. This function takes the input record as an argument:

out :: length(in) =
begin
out :: length_of(in.big_vector);
end;
length essentially provides a way to implement a for loop in the record-reading process.
finished The finished transform function is required when the length function is
not defined. (You must use at least one of these functions.) This transform function
returns a boolean value: as long as it returns 0 (false), NORMALIZE proceeds to call
the normalize function for the current record. When the finished function returns
non-zero (true) , NORMALIZE moves to the next input record.

out :: finished(in, index) =


begin

out :: in.array[index] == "ignore later elements";


end;
The finished function essentially provides a way to implement a while-do loop in the recordreading process.
NOTE: Although we recommend that you not use both length and finished in the same
component, it is possible to define both. In that case, Normalize loops until either finished
returns true or the limit of length is reached, whichever occurs first.
finalize The finalize transform function performs the last step in a multistage
transform:

out :: finalize(temp, in) =


begin
out.key :: in.key;
out.count :: temp.count;
out.average :: temp.sum / temp.count;
end;
The finalize transform function takes the temporary storage record and the input record as
arguments, and produces a record that has the record format of the out port.
output_select The output_select transform function performs selection of output
records:

out :: output_select(final) =
begin
out :: final.average > 5;
end;
The output_select transform function takes a single argument the record produced by
finalization and returns a value of 0 (false) if NORMALIZE is to ignore a record, or nonzero (true) if NORMALIZE is to generate an output record.

temporary_type If you want Normalize to use temporary storage, define this


storage as a record with a type named temporary_type:

type temporary_type =
record
int count;
int sum;
end;

REFORMAT

Purpose
Reformat changes the format of records by dropping fields, or by using DML expressions to
add fields, combine fields, or transform the data in the records.
Recommendation
Component folding can enhance the performance of this component. If this feature is
enabled, the Co>Operating System folds this component by default. See Component
folding for more information.
Location in the Organizer
Transform folder

ROLLUP

Purpose
Rollup evaluates a group of input records that have the same key, and then generates
records that either summarize each group or select certain information from each group.
Although it lacks a reformat transform function, rollup supports implicit reformat; see Implicit
reformat.
Location in the Organizer
Transform folder
Recommendations
For new development, use Rollup rather than AGGREGATE. Rollup provides more
control over record selection, grouping, and aggregation.
The behavior of ROLLUP varies in the presence of dirty data (NULLs or invalid
values), according to which mode you use for the rollup:

With expanded mode, you can use ROLLUP normally.


With template mode, always clean and validate data before rolling it up. Because
the aggregation functions are not expanded, you may see unexpected or even
incorrect results in the presence of dirty data (NULLs or invalid values).
Furthermore, the results will be hard to trace, particularly if the reject-threshold
parameter is set to Never abort. Several factors including the data type, the DML
expression used to perform the rollup, and the value of the sorted-input
parameter may affect where the problems occur. It is safest to clean and
validate the data before using template mode with ROLLUP.

SCAN

Purpose

For every input record, Scan generates an output record that consists of a running
cumulative summary for the group to which the input record belongs, up to and including the
current record. For example, the output records might include successive year-to-date totals
for groups of records.
Although it lacks a reformat transform function, scan supports implicit reformat.
Recommendations
If you want one summary record for a group, use ROLLUP.
The behavior of SCAN varies in the presence of dirty data (NULLs or invalid values),
according to which mode you use for the scan:

With expanded mode, you can use SCAN normally.


With template mode, always clean and validate data before scanning it. Because
the aggregation functions are not expanded, you may see unexpected or even
incorrect results in the presence of dirty data (NULLs or invalid values).
Furthermore, the results will be hard to trace, particularly if the rejectthreshold parameter is set to Never abort. Several factors including the data
type, the DML expression used to perform the scan, and the value of the sortedinput parameter may affect where the problems occur. It is safest to clean and
validate the data before using template mode with SCAN.

Component folding can enhance the performance of this component. If this feature is
enabled, the Co>Operating System folds
Component folding for more information.

this

component

by

default.

See

Two modes to use SCAN


You can use a SCAN component in two modes, depending on how you define
the transform parameter:
Define a transform that uses a template scan function. This is called template mode and is
most often used when you want to output aggregations of the data.

Create a transform using an expanded SCAN package. This is called expanded mode
and allows for scans that do not necessarily use regular aggregation functions.

Template mode
Template mode is the simplest way to use SCAN. In the transform parameter, you specify
an aggregation function that describes how the cumulative summary should be computed.
At runtime, the Co>Operating System expands this template function into the multiple
functions that are required to execute the actual scan.
For example, suppose you have an input record for each purchase by each customer. You
could use the sumaggregation function to calculate the running total of spending for each
customer after each purchase.
For more information, see Using SCAN with aggregation functions.
Expanded mode
Expanded mode provides more control over the scan. It lets you edit the expanded package,
so you can specify transformations that are not possible with template mode. As such, you
might use it when you need a result that an aggregation function cannot produce.
With an expanded SCAN package, you must define the following items:
DML type named temporary_type
initialize function that returns a temporary_type record

scan function that takes two input arguments (an input record
a temporary_type record) and returns an updated temporary_type record
finalize function that returns an output record

For more information, see Transform package for SCAN.

Examples of using SCAN

and

transforms/scan/scan.mp

Template SCAN with an aggregation function


This example shows how to compute, from input records containing customer_id, dt (date),
and amount, a running total of transactions for each customer in a dataset. The example
uses a template scan function with the sum aggregation function.
Suppose you have the following input records:

customer_id

dt

amount

C002142

1994.03.23

52.20

C002142

1994.06.22

22.25

C003213

1993.02.12

47.95

C003213

1994.11.05

221.24

C003213

1995.12.11

17.42

C004221

1994.08.15

25.25

C008231

1993.10.22

122.00

C008231

1995.12.10

You want to produce output records with customer_id, dt, and amount_to_date:

customer_id

dt

amount_to_date

C002142

1994.03.23

52.20

C002142

1994.06.22

74.45

C003213

1993.02.12

47.95

52.1

C003213

1994.11.05

269.19

C003213

1995.12.11

286.61

C004221

1994.08.15

25.25

C008231

1993.10.22

122.00

C008231

1995.12.10

174.1

To accomplish this task, do one of the following:


Sort the input records on customer_id and dt, and use a Scan component with
the sorted-input parameter
set
to Input
must
be
sorted
or
grouped and customer_id as the key field.
Sort the input records on dt, and use a Scan component with the sortedinput parameter
set
to In
memory:
Input
need
not
be
sorted and customer_id as the key field.

Create the transform using the sum aggregation function, as follows:


out :: scan(in) =
begin
out.customer_id :: in.customer_id;
out.dt :: in.dt;
out.amount_to_date :: sum(in.amount);
end;
Expanded SCAN
Continuing the previous example, you want to categorize customers according to their
spending. After their spending exceeds $100, you place them in the premium category.
The new output data includes the category for each customer, current for each date on
which they made a purchase.

customer_id

dt

amount_to_date

category

C002142

1994.03.23

52.20

regular

C002142

1994.06.22

74.45

regular

C003213

1993.02.12

47.95

regular

C003213

1994.11.05

269.19

premium

C003213

1995.12.11

286.61

premium

C004221

1994.08.15

25.25

regular

C008231

1993.10.22

122.00

premium

C008231

1995.12.10

174.10

premiu

For this example, we can use the finalize function in an expanded transform to add the
category information. Because we have expanded the transform, we can no longer use
the sum aggregation function to calculate the amount_to_date. Instead, we store the
running total in a temporary variable and use the scan function to update it for each record.
Here is the transform:
type temporary_type =
record
decimal(8.2) amount_to_date = 0;
end;

temp :: initialize(in) =
begin
temp.amount_to_date :: 0;
end;

out :: scan(temp, in) =


begin
out.amount_to_date :: temp.amount_to_date + in.amount;
end;

out :: finalize(temp, in) =


begin
out.customer_id :: in.customer_id;
out.dt :: in.dt;
out.amount_to_date :: temp.amount_to_date;
out.category :: if (temp.amount_to_date > 100) "premium"
else "regular";
end;
The temporary_type is a variable that stores the cumulative data from one record to the
next. At the beginning of each group, the initialize function resets the temporary variable to
0.

(Remember

that

in

this

example,

the

data

is

grouped

by customer_id.)

The scan function is called for each record; it keeps a running total of purchase amounts
within

the

group.

The finalize function

a category value to each one.

SPLIT

Purpose

creates

the

output

records,

assigning

SPLIT processes data in a number of useful ways. You can use SPLIT to:
Flatten hierarchical data
Select a subset of fields from the data

Normalize vectors (including nested vectors)


Retrieve multiple, distinct outputs from a single pass through the data

How SPLIT works


SPLIT does not use transform functions. It determines what operations to perform on input
data by using DML that is generated by the split_dml command-line utility. This approach
enables you to perform operations such as normalizing vectors without using expensive
DML loop operations.
SPLIT has a single input port and a counted number of output ports. You use split_dml to
generate DML for each output port. You can have different field selection and base fields for
vector normalization on each port; however, you can specify only one base field for vector
normalization per port.
Although it lacks a reformat transform function, SPLIT supports implicit reformat.
Recommendation
Component folding can enhance the performance of this component. If this feature is
enabled, the Co>Operating System folds this component by default. See Component
folding for more information.
Location in the Organizer

Transform folder

Example of using SPLIT


Say you have a file example1.dml that has both a nested hierarchy of records and three
levels of nested vectors, with the following record format:
record
string("|") region;
record
string("|") state;
record
string("|") county;
record
string("|") addr_line1;
string("|") addr_line2;
end location;
record
string("|") atm_id;
string("|") comment;
end[decimal(2)] atms;
end[decimal(2)] counties;
end[decimal(2)] states;
string("\n") mgr;
end
In this example, SPLIT is used to remove the hierarchy and normalize the vectors in this
record.

First, the desired output DML is generated using the split_dml utility:
split_dml -i ..# -b ..atm_id example1.dml
where:
The -i argument indicates fields to be included in the output DML. In this case, the
specified wildcards "..#" selects all leaf fields anywhere within the record.
The -b argument specifies a base field for normalization. Any field in the vector to
be normalized can be used; in this case, the specified field atm_id is used with the
".." shorthand, because atm_id is unique in the record.

This command generates the following output:


/////////////////////////////////////////////////////////////////////
// This file was automatically generated by split_dml
// With the command line arguments:
// split_dml -i ..# -b ..atm_id example1.dml
/////////////////////////////////////////////////////////////////////
record
string("|") region;
string("|") state;
string("|") county;
string("|") addr_line1;
string("|") addr_line2;
string("|") atm_id;
string("|") comment;
string("\n") mgr;
string('\0') DML_assignments() =

'region=region,state=states.state,county=states.counties.county,
addr_line1=states.counties.atms.location.addr_line1,
addr_line2=states.counties.atms.location.addr_line2,
atm_id=states.counties.atms.atm_id,
comment=states.counties.atms.comment,mgr=mgr';
end
Note the flattened record, and the generated DML_assignments method that controls how
SPLIT fills the output record from the input data.
Suppose

that

you

want

to

exclude

certain

fields

addr_line1, addr_line2,

and comment from the output. Run split_dml as follows:


split_dml

-i

region,states.state,states.counties.county,..atm_id,..mgr -b

example1.dml
The generated output is:
/////////////////////////////////////////////////////////////////////
// This file was automatically generated by split_dml
// With the command line arguments:
// split_dml -i region,states.state,states.counties.county,..atm_id,
// ..mgr -b ..atm_id example1.dml
/////////////////////////////////////////////////////////////////////
record
string("|") region;
string("|") state;
string("|") county;

..atm_id

string("|") atm_id;
string("\n") mgr;
string('\0') DML_assignments() =
'region=region,state=states.state,county=states.counties.county,
atm_id=states.counties.atms.atm_id,
mgr=mgr';
end
Note that the fields specified by the split_dml -i option appear in the order in which they
occur in the input record, not in the order in which they are listed in the option argument.

Posted 3rd January 2016 by kashyap vasani

Add a comment
1.
SEP

18

AB-INITIO PARTITION COMPONENT

PARTITION BY EXPRESSION

Purpose

Partition by Expression distributes records to its output flow partitions according to a


specified DML expression or transform function.
The output port for Partition by Expression is ordered. See Ordered ports. Although you
can use fan-out flows on the out port, we do not recommend connecting multiple fan-out
flows. You may connect a single fan-out flow; or, preferably, limit yourself to straight flows on
the out port.
Partition by Expression supports implicit reformat. See Implicit reformat.
Recommendation

Component folding can enhance the performance of this component. If this feature is
enabled, the Co>Operating System folds this component by default. See Component
folding for more information.
The component does not fold when connected to a flow that is set to use two-stage routing.
Location in the Component Organizer

Partitioning folder

PARTITION BY KEY
Purpose

Partition by Key distributes records to its output flow partitions according to key values.
How Partition by Key interprets key values depends on the internal representation of the
key. For example, the number 4 in a field of type integer(2) is not considered identical to the
number 4 in a field of type decimal(4).
Recommendation

Component folding can enhance the performance of this component. If this feature is
enabled, the Co>Operating System folds this component by default. See Component
folding for more information.
The component does not fold when connected to a flow that is set to use two-stage routing.
Location in the Component Organizer

Partitioning folder

PARTITION BY KEY AND SORT


Purpose

Partition by Key and Sort repartitions records by key values and then sorts the records
within each partition. The number of input and output partitions can be different.
How Partition by Key and Sort interprets key values depends on the internal representation
of the key. For example, the number 4 is likely to be partitioned differently depending on
whether it is in a field of type integer(2) or decimal(4).
Partition by Key and Sort is a subgraph that contains two components, Partition by Key and
Sort.
Location in the Component Organizer
Sort folder

PARTITION BY PERCENTAGE
Purpose

Partition by Percentage distributes a specified percentage of the total number of input


records to each output flow.
Location in the Component Organizer

Partitioning folder

PARTITION BY RANGE
Purpose

Partition by Range distributes records to its output flow partitions according to the ranges of
key values specified for each partition. Partition by Range distributes the records relatively
equally among the partitions.
Use Partition by Range when you want to divide data into useful, approximately equal,
groups. Input can be sorted or unsorted. If the input is sorted, the output is sorted; if the
input is unsorted, the output is unsorted.
The records with the key values that come first in the key order go to partition 0, the records
with the key values that come next in the order go to partition 1, and so on. The records with
the key values that come last in the key order go to the partition with the highest number.
Location in the Component Organizer

Partitioning folder

PARTITION BY ROUND-ROBIN
Purpose

Partition by Round-robin distributes blocks of records evenly to each output flow in


round-robin fashion.
For information on undoing the effects of Partition by Round-robin, see INTERLEAVE.
The output port for Partition by Round-robin is ordered. See Ordered ports.
Recommendation

Component folding can enhance the performance of this component. If this feature is
enabled, the Co>Operating System folds this component by default. See Component
folding for more information.
The component does not fold when connected to a flow that is set to use two-stage routing.
Location in the Component Organizer

Partitioning folder
Posted 18th September 2015 by kashyap vasani

You might also like