Professional Documents
Culture Documents
AGGREGATE
Purpose
Aggregate generates records that summarize groups of records.
Deprecated
AGGREGATE is deprecated. Use ROLLUP instead. Rollup gives you more control over
record selection, grouping, and aggregation.
Recommendation
Component folding can enhance the performance of this component . If this feature
is enabled and if the sorted-inputparameter is set to In memory: Input need not be
sorted, the Co>Operating System folds this component by default. SeeComponent
folding for more information.
Location in the Component Organizer
Miscellaneous/Deprecated/Transform folder
COMBINE
Purpose
combine processes data in a number of useful ways. You can use combine to:
Restore hierarchies of data flattened by the SPLIT component
//Sort key 2
//Sort key 1
record
string("|") state;
//Sort key 2
record
string("|") county; //Sort Key 3
record
record
string("|") addr_line1;
string("|") addr_line2;
end location;
string("|")atm_id;
string("|")comment;
end[int] atms;
end[int] counties;
end[int] states;
string("\n") regional_mgr;
end;
To produce this output format, you need to run split_dml to generate DML for the input port. Your
requirements for the split_dml command are:
You want to include all fields, but you do not care about the subrecord hierarchy, so we specify "..#"
for value of the split_dml -i argument.
The base field for normalization can be any of the fields in the atms record; you choose atm_id.
You need to specify the three keys to use when rolling up the vectors: region, states.state, and
states.counties.county.
The resulting command is:
split_dml -i ..# -b ..atm_id -k region,states.state,states.counties.county example2b.dml
The generated DML, to be used on COMBINEs input port, is:
//////////////////////////////////////////////////////////////
// This file was automatically generated by split_dml
// with the command-line arguments:
// split_dml -i ..# -b ..atm_id -k region,states.state,states.counties.county example2b.dml
//////////////////////////////////////////////////////////////
record
string("|") region // Sort key 1
string("|") state
// Sort key 2
string("|") addr_line1;
string("|") addr_line2;
string("|") atm_id;
string("|") comment;
string("\n") regional_mgr;
string('0')DML_assignments =
'region=region,state=states.state,county=states.counties.county,
addr_line1=states.counties.atms.location.addr_line1,
addr_line2=states.counties.atms.location.addr_line2,
atm_id=states.counties.atms.atm_id,
comment=states.counties.atms.comment,
regional_mgr=regional_mgr';
string('0')DML_key_specifiers() =
'{region}=,{state}=states[],{county}=states.counties[]';
end
Related topics
DEDUP SORTED
Purpose
Dedup Sorted separates one specified record in each group of records from the rest of the
records in the group.
Requirement
Dedup Sorted requires grouped input.
Recommendation
Component folding can enhance the performance of this component. If this feature is
enabled, the Co>Operating System folds this component by default. See Component
folding for more information.
Location in the Component Organizer
Transform folder
FILTER BY EXPRESSION
Purpose
Filter by Expression filters records according to a DML expression or transform function,
which specifies the selection criteria.
Filter by Expression is sometimes used to create a subset, or sample, of the data. For
example, you can configure Filter by Expression to select a certain percentage of records,
or to select every third (or fourth, or fifth, and so on) record. Note that if you need a random
sample of a specific size, you should use the sample component.
FILTER BY EXPRESSION supports implicit reformat. For more information, see Implicit
reformat.
Recommendation
Component folding can enhance the performance of this component. If this feature is
enabled, the Co>Operating System folds this component by default. See Component
folding for more information.
Location in the Component Organizer
Transform folder
FUSE
Purpose
Fuse combines multiple input flows (perhaps with different record formats) into a single
output flow. It examines one record from each input flow simultaneously, acting on the
records according to the transform function you specify. For example, you can compare
records, selecting one record or another based on some criteria, or fuse them into a single
record that contains data from all the input records.
Recommendation
Fuse assumes that the records on the input flows always stay synchronized. However,
certain components placed upstream of Fuse, such as Reformat or Filter by Expression,
could reject or divert some records. In that case, you may not be able to guarantee that the
flows stay in sync. A more reliable option is to add a key field to the data; then use Join to
match the records by key.
Component folding can enhance the performance of this component. If this feature is
enabled, the Co>Operating System folds this component by default. See Component
folding for more information.
JOIN
Purpose
Join reads data from two or more input ports, combines records with matching keys
according to the transform you specify, and sends the transformed records to the output
port. Additional ports allow you to collect rejected and unused records.
Recommendation
Component folding can enhance the performance of this component. If this feature is
enabled, the Co>Operating System folds this component by default. See Component
folding for more information.
NOTE: When you have units of work (computepoints, checkpoints, or transactions)
that are large and sorted-input is set to Inputs must be sorted, the order of output records
within a key group may differ between the folded and unfolded versions of the output.
Location in the Component Organizer
Transform folder
Types of joins
Reduced to its basics, Join consists of a match key, a transform function, and a mechanism
for deciding when to call the transform function:
The key is used to match records on incoming flows
The transform function combines matched incoming records to produce new
outgoing records
The mechanism for deciding when to call the transform function consists of the
settings of the parameters join-type, record-requiredn, and dedupn.
Inner joins
The most common case is when join-type is Inner Join. In this case, if each input port
contains a record with the same value for the key fields, the transform function is called and
an output record is produced.
If some of the input flows have more than one record with that key value, the transform
function is called multiple times, once for each possible combination of records, taken one
from each input port.
Whenever a particular key value does not have a matching record on every input port and
Inner Join is specified, the transform function is not called and all incoming records with that
key value are sent to the unusedn ports.
In the cases shown above, suppose you want to narrow the join conditions to a subset of
the shaded (required match) area. To do this, use the DML is_defined function in a rule in
the transform itself. This is the same principle demonstrated in the two-way join shown in
Getting a joined output record.
For example, suppose you want to produce an output record when a particular key value
either is present in in0, or is present in both in1 and in2. Only Case 2 has enough shaded
area to represent the necessary conditions. However, Case 2 also represents conditions
under which you do not want Join to produce an output record.
To produce output records only under the appropriate conditions:
1
. Set join-type to Full Outer Join as in Case 2 above.
2
. Put the following rules in Joins transform function:
MATCH SORTED
Purpose
Match Sorted combines multiple flows of records with matching keys and performs
transform operations on them.
NOTE: This component is superseded by either Join (for matching keys) or Fuse (for
transforming multiple records). Both provide more flexible processing options than Match
Sorted.
Requirement
Match Sorted requires grouped input.
Location in the Component Organizer
Transform folder
in0
in1
in2
record 1
aaa
aaa
aaa
record 2
bbb
bbb
ccc
record 3
ccc
ccc
ddd
record 4
eee
eee
eee
record 5
eee
fff
fff
record 6
eee
end
end
Match Sorted calls the transform function eight times for these data records, with the
arguments as follows:
transform( in0-rec1, in1-rec1, in2-rec1 ) records with key value aaa
transform( in0-rec2, in1-rec2, NULL ) records with key value bbb
transform( in0-rec3, in1-rec3, in2-rec2 ) records with key value ccc
transform( NULL,
NULL,
Since there are three eee records in the flow attached to in0, Match Sorted calls the
transform function three times with eee records as inputs. Since the next records on in1 and
in2 do not have key value eee, in1 and in2 repeat their rec4 records.
MULTI REFORMAT
Purpose
Multi Reformat changes the format of records flowing from 1 to 20 pairs of in and out ports
by dropping fields or by using DML expressions to add fields, combine fields, or transform
data in the records.
We recommend using MULTI REFORMAT in only a few specific situations. Most often, a
regular REFORMAT component is the correct choice. For example:
If you want to reformat data on multiple flows, you should instead use multiple
REFORMAT components. These are faster because they run in parallel.
If you want to filter incoming data, sending it to various output ports while also
reformatting it (by adding, combining, or transforming fields), try using the outputindex and count parameters on the REFORMAT component.
A recommended use for Multi Reformat is to put it immediately before a custom component
that takes multiple inputs. For more information, see Using MULTI REFORMAT to avoid
deadlock.
To avoid deadlock, insert a MULTI REFORMAT component in the graph in front of the
custom component. Using this built-in component to process the input flows applies
automatic flow buffering to them before they reach the custom component, thus avoiding the
possibility of deadlock.
NORMALIZE
Purpose
Normalize generates multiple output records from each of its input records. You can directly
specify the number of output records for each input record, or you can make the number of
output records dependent on a calculation.
In contrast, to consolidate groups of related records into a single record with a vector field
for each group the inverse of NORMALIZE you would use the accumulation function
of the ROLLUP component.
Recommendations
Always clean and validate data before normalizing it. Because Normalize uses a
multistage transform, it follows computation rules that may cause unexpected or
incorrect results in the presence of dirty data (NULLs or invalid values).
Furthermore, the results will be hard to trace, particularly if the reject-threshold
parameter is set to Never abort. Several factors including the data type, the DML
expression used to perform the normalization, and the value of the sorted-input
parameter may affect where the problems occur. It is safest to avoid normalizing
dirty data.
Component folding can enhance the performance of this component. If this feature
is enabled, the Co>Operating System folds this component by default. See
Component folding for more information.
There are seven built-in functions, as shown in the following table. Of these, only normalize
is required. Examples of most of these functions can be found in Simple NORMALIZE
example with vectors.
There is also an optional temporary_type (see Optional NORMALIZE transform functions
and types), which you can define if you need to use temporary variables. For an example,
see NORMALIZE example with a more elaborate transform.
Transform
function
Required?
input_select
No
Argumen
ts
input
record
Return value
An integer(4) value.
An output value of 0 means false (the
record was not selected); non-zero
means
true
(the
record
was
selected).
See Optional NORMALIZE transform
functions and types.
initialize
No
input
record
A
record
whose
temporary_type.
type
is
Only
if
finished is
not
provided
input
record
An integer(4) value.
Specifies the number of output
records Normalize generates for this
input record. If the length function is
provided, Normalize calls it once for
each input record.
For
examples,
see
Simple
NORMALIZE example with vectors
and NORMALIZE example with a
more elaborate transform.
finished
Only
if
temporar
(if
you
have
defined
temporary_type)
finished
(if you have not
defined
temporary_type)
length
is
not
provided
Only
if
length
is
not
provided
y record,
input
record,
index
input
record,
index
Yes
temporar
y record,
input
record,
index
A
record
whose
type
is
temporary_type. For examples, see
Simple NORMALIZE example with
vectors.
normalize
(if you have not
defined
temporary_type)
Yes
input
record,
index
An output record.
finalize
No
temporar
y record,
input
record
output
record
An integer(4) value.
output_select
No
selected).
See Optional NORMALIZE transform
functions and types.
out :: input_select(in) =
begin
out :: in.n == 1;
end;
The input_select transform function takes a single argument the input record and
returns a value of 0 (false) if NORMALIZE is to ignore a record, or non-zero (true) if
NORMALIZE is to accept a record.
initialize The initialize transform function initializes temporary storage. This
transform function takes a single argument the input record and returns a
single record with type temporary_type:
temp :: initialize(in) =
begin
temp.count :: 0;
temp.sum :: 0;
end;
length The length transform function is required when the finished function is not
defined. (You must use at least one of these functions.) This transform function
specifies the number of times the normalize function will be called for the current
record. This function takes the input record as an argument:
out :: length(in) =
begin
out :: length_of(in.big_vector);
end;
length essentially provides a way to implement a for loop in the record-reading process.
finished The finished transform function is required when the length function is
not defined. (You must use at least one of these functions.) This transform function
returns a boolean value: as long as it returns 0 (false), NORMALIZE proceeds to call
the normalize function for the current record. When the finished function returns
non-zero (true) , NORMALIZE moves to the next input record.
out.count :: temp.count;
out.average :: temp.sum / temp.count;
end;
The finalize transform function takes the temporary storage record and the input record as
arguments, and produces a record that has the record format of the out port.
output_select The output_select transform function performs selection of output
records:
out :: output_select(final) =
begin
out :: final.average > 5;
end;
The output_select transform function takes a single argument the record produced by
finalization and returns a value of 0 (false) if NORMALIZE is to ignore a record, or nonzero (true) if NORMALIZE is to generate an output record.
temporary_type If you want Normalize to use temporary storage, define this
storage as a record with a type named temporary_type:
type temporary_type =
record
int count;
int sum;
end;
REFORMAT
Purpose
Reformat changes the format of records by dropping fields, or by using DML expressions to
add fields, combine fields, or transform the data in the records.
Recommendation
Component folding can enhance the performance of this component. If this feature is
enabled, the Co>Operating System folds this component by default. See Component
folding for more information.
Location in the Organizer
Transform folder
ROLLUP
Purpose
Rollup evaluates a group of input records that have the same key, and then generates
records that either summarize each group or select certain information from each group.
Although it lacks a reformat transform function, rollup supports implicit reformat; see Implicit
reformat.
Location in the Organizer
Transform folder
Recommendations
For new development, use Rollup rather than AGGREGATE. Rollup provides more
control over record selection, grouping, and aggregation.
The behavior of ROLLUP varies in the presence of dirty data (NULLs or invalid
values), according to which mode you use for the rollup:
SCAN
Purpose
For every input record, Scan generates an output record that consists of a running
cumulative summary for the group to which the input record belongs, up to and including the
current record. For example, the output records might include successive year-to-date totals
for groups of records.
Although it lacks a reformat transform function, scan supports implicit reformat.
Recommendations
If you want one summary record for a group, use ROLLUP.
The behavior of SCAN varies in the presence of dirty data (NULLs or invalid values),
according to which mode you use for the scan:
Component folding can enhance the performance of this component. If this feature is
enabled, the Co>Operating System folds
Component folding for more information.
this
component
by
default.
See
Template mode
Template mode is the simplest way to use SCAN. In the transform parameter, you specify
an aggregation function that describes how the cumulative summary should be computed.
At runtime, the Co>Operating System expands this template function into the multiple
functions that are required to execute the actual scan.
For example, suppose you have an input record for each purchase by each customer. You
could use the sumaggregation function to calculate the running total of spending for each
customer after each purchase.
For more information, see Using SCAN with aggregation functions.
Expanded mode
Expanded mode provides more control over the scan. It lets you edit the expanded package,
so you can specify transformations that are not possible with template mode. As such, you
might use it when you need a result that an aggregation function cannot produce.
With an expanded SCAN package, you must define the following items:
DML type named temporary_type
initialize function that returns a temporary_type record
scan function that takes two input arguments (an input record
a temporary_type record) and returns an updated temporary_type record
and
transforms/scan/scan.mp
customer_id
dt
amount
C002142
1994.03.23
52.20
C002142
1994.06.22
22.25
C003213
1993.02.12
47.95
C003213
1994.11.05
221.24
C003213
1995.12.11
17.42
C004221
1994.08.15
25.25
C008231
1993.10.22
122.00
C008231
1995.12.10
52.1
You want to produce output records with customer_id, dt, and amount_to_date:
customer_id
dt
amount_to_date
C002142
1994.03.23
52.20
C002142
1994.06.22
74.45
C003213
1993.02.12
47.95
C003213
1994.11.05
269.19
C003213
1995.12.11
286.61
C004221
1994.08.15
25.25
C008231
1993.10.22
122.00
C008231
1995.12.10
174.1
customer_id
dt
amount_to_date
category
C002142
1994.03.23
52.20
regular
C002142
1994.06.22
74.45
regular
C003213
1993.02.12
47.95
regular
C003213
1994.11.05
269.19
premium
C003213
1995.12.11
286.61
premium
C004221
1994.08.15
25.25
regular
C008231
1993.10.22
122.00
premium
C008231
1995.12.10
174.10
premiu
For this example, we can use the finalize function in an expanded transform to add the
category information. Because we have expanded the transform, we can no longer use
the sum aggregation function to calculate the amount_to_date. Instead, we store the
running total in a temporary variable and use the scan function to update it for each record.
Here is the transform:
type temporary_type =
record
decimal(8.2) amount_to_date = 0;
end;
temp :: initialize(in) =
begin
temp.amount_to_date :: 0;
end;
out.customer_id :: in.customer_id;
out.dt :: in.dt;
out.amount_to_date :: temp.amount_to_date;
out.category :: if (temp.amount_to_date > 100) "premium"
else "regular";
end;
The temporary_type is a variable that stores the cumulative data from one record to the
next. At the beginning of each group, the initialize function resets the temporary variable to
0.
(Remember
that
in
this
example,
the
data
is
grouped
by customer_id.)
The scan function is called for each record; it keeps a running total of purchase amounts
within
the
group.
creates
the
output
records,
SPLIT
Purpose
SPLIT processes data in a number of useful ways. You can use SPLIT to:
Flatten hierarchical data
Select a subset of fields from the data
assigning
SPLIT does not use transform functions. It determines what operations to perform on input
data by using DML that is generated by the split_dml command-line utility. This approach
enables you to perform operations such as normalizing vectors without using expensive
DML loop operations.
SPLIT has a single input port and a counted number of output ports. You use split_dml to
generate DML for each output port. You can have different field selection and base fields for
vector normalization on each port; however, you can specify only one base field for vector
normalization per port.
Although it lacks a reformat transform function, SPLIT supports implicit reformat.
Recommendation
Component folding can enhance the performance of this component. If this feature is
enabled, the Co>Operating System folds this component by default. See Component
folding for more information.
Location in the Organizer
Transform folder
record
string("|") county;
record
string("|") addr_line1;
string("|") addr_line2;
end location;
record
string("|") atm_id;
string("|") comment;
end[decimal(2)] atms;
end[decimal(2)] counties;
end[decimal(2)] states;
string("\n") mgr;
end
In this example, SPLIT is used to remove the hierarchy and normalize the vectors in this
record.
First, the desired output DML is generated using the split_dml utility:
split_dml -i ..# -b ..atm_id example1.dml
where:
The -i argument indicates fields to be included in the output DML. In this case, the
specified wildcards "..#" selects all leaf fields anywhere within the record.
The -b argument specifies a base field for normalization. Any field in the vector to
be normalized can be used; in this case, the specified field atm_id is used with the
".." shorthand, because atm_id is unique in the record.
Note the flattened record, and the generated DML_assignments method that controls how
SPLIT fills the output record from the input data.
Suppose
that
you
want
to
exclude
certain
fields
addr_line1, addr_line2,
-i
region,states.state,states.counties.county,..atm_id,..mgr -b
example1.dml
The generated output is:
/////////////////////////////////////////////////////////////////////
// This file was automatically generated by split_dml
// With the command line arguments:
// split_dml -i region,states.state,states.counties.county,..atm_id,
// ..mgr -b ..atm_id example1.dml
/////////////////////////////////////////////////////////////////////
record
string("|") region;
string("|") state;
string("|") county;
string("|") atm_id;
string("\n") mgr;
string('\0') DML_assignments() =
'region=region,state=states.state,county=states.counties.county,
atm_id=states.counties.atms.atm_id,
mgr=mgr';
..atm_id
end
Note that the fields specified by the split_dml -i option appear in the order in which they
occur in the input record, not in the order in which they are listed in the option argument.
Add a comment
AB-INITIO Component
Classic
Flipcard
Magazine
Mosaic
Sidebar
Snapshot
Timeslide
1.
JAN
Purpose
Aggregate generates records that summarize groups of records.
Deprecated
AGGREGATE is deprecated. Use ROLLUP instead. Rollup gives you more control over
record selection, grouping, and aggregation.
Recommendation
Component folding can enhance the performance of this component . If this feature
is enabled and if the sorted-inputparameter is set to In memory: Input need not be
sorted, the Co>Operating System folds this component by default. SeeComponent
folding for more information.
Location in the Component Organizer
Miscellaneous/Deprecated/Transform folder
COMBINE
Purpose
combine processes data in a number of useful ways. You can use combine to:
Restore hierarchies of data flattened by the SPLIT component
Create a single output record by joining multiple input streams
COMBINE can also merge elements of vectors, in the same way it merges top-level
records: if you specify no key, COMBINE merges the elements based on an implied key,
which is equal to a records index within the sequence of records on the input port.
Recommendation
Component folding can enhance the performance of this component. If this feature is
enabled, the Co>Operating System folds this component by default. See Component
folding for more information.
Location in the Component Organizer
Transform folder
//Sort key 2
//Sort key 1
record
string("|") state;
//Sort key 2
record
string("|") county; //Sort Key 3
record
record
string("|") addr_line1;
string("|") addr_line2;
end location;
string("|")atm_id;
string("|")comment;
end[int] atms;
end[int] counties;
end[int] states;
string("\n") regional_mgr;
end;
To produce this output format, you need to run split_dml to generate DML for the input port. Your
requirements for the split_dml command are:
You want to include all fields, but you do not care about the subrecord hierarchy, so we specify "..#"
for value of the split_dml -i argument.
The base field for normalization can be any of the fields in the atms record; you choose atm_id.
You need to specify the three keys to use when rolling up the vectors: region, states.state, and
states.counties.county.
The resulting command is:
split_dml -i ..# -b ..atm_id -k region,states.state,states.counties.county example2b.dml
The generated DML, to be used on COMBINEs input port, is:
//////////////////////////////////////////////////////////////
// This file was automatically generated by split_dml
// Sort key 2
DEDUP SORTED
Purpose
Dedup Sorted separates one specified record in each group of records from the rest of the
records in the group.
Requirement
Dedup Sorted requires grouped input.
Recommendation
Component folding can enhance the performance of this component. If this feature is
enabled, the Co>Operating System folds this component by default. See Component
folding for more information.
Location in the Component Organizer
Transform folder
FILTER BY EXPRESSION
Purpose
Filter by Expression filters records according to a DML expression or transform function,
which specifies the selection criteria.
Filter by Expression is sometimes used to create a subset, or sample, of the data. For
example, you can configure Filter by Expression to select a certain percentage of records,
or to select every third (or fourth, or fifth, and so on) record. Note that if you need a random
sample of a specific size, you should use the sample component.
FILTER BY EXPRESSION supports implicit reformat. For more information, see Implicit
reformat.
Recommendation
Component folding can enhance the performance of this component. If this feature is
enabled, the Co>Operating System folds this component by default. See Component
folding for more information.
Location in the Component Organizer
Transform folder
FUSE
Purpose
Fuse combines multiple input flows (perhaps with different record formats) into a single
output flow. It examines one record from each input flow simultaneously, acting on the
records according to the transform function you specify. For example, you can compare
records, selecting one record or another based on some criteria, or fuse them into a single
record that contains data from all the input records.
Recommendation
Fuse assumes that the records on the input flows always stay synchronized. However,
certain components placed upstream of Fuse, such as Reformat or Filter by Expression,
could reject or divert some records. In that case, you may not be able to guarantee that the
flows stay in sync. A more reliable option is to add a key field to the data; then use Join to
match the records by key.
Component folding can enhance the performance of this component. If this feature is
enabled, the Co>Operating System folds this component by default. See Component
folding for more information.
JOIN
Purpose
Join reads data from two or more input ports, combines records with matching keys
according to the transform you specify, and sends the transformed records to the output
port. Additional ports allow you to collect rejected and unused records.
Recommendation
Component folding can enhance the performance of this component. If this feature is
enabled, the Co>Operating System folds this component by default. See Component
folding for more information.
NOTE: When you have units of work (computepoints, checkpoints, or transactions)
that are large and sorted-input is set to Inputs must be sorted, the order of output records
within a key group may differ between the folded and unfolded versions of the output.
Location in the Component Organizer
Transform folder
Types of joins
Reduced to its basics, Join consists of a match key, a transform function, and a mechanism
for deciding when to call the transform function:
The key is used to match records on incoming flows
The transform function combines matched incoming records to produce new
outgoing records
The mechanism for deciding when to call the transform function consists of the
settings of the parameters join-type, record-requiredn, and dedupn.
Inner joins
The most common case is when join-type is Inner Join. In this case, if each input port
contains a record with the same value for the key fields, the transform function is called and
an output record is produced.
If some of the input flows have more than one record with that key value, the transform
function is called multiple times, once for each possible combination of records, taken one
from each input port.
Whenever a particular key value does not have a matching record on every input port and
Inner Join is specified, the transform function is not called and all incoming records with that
key value are sent to the unusedn ports.
Full outer joins
Another common case is when join-type is Full Outer Join: if each input port has a record
with a matching key value, Join does the same thing it does for an inner join.
If some input ports do not have records with matching key values, Join applies the transform
function anyway, with NULL substituted for the missing records. The missing records are in
effect ignored.
With an outer join, the transform function typically requires additional rules (as compared to
an inner join) to handle the possibility of NULL inputs.
About explicit joins
The final case is when join-type is Explicit. This setting allows you to specify True or False
for the record-requiredn parameter for each inn port. The settings you choose determine
when Join calls the transform function. See record-requiredn.
For the three-way joins shown in the following diagrams, the shaded regions again
represent the key values that must match in order for Join to call the transform function:
In the cases shown above, suppose you want to narrow the join conditions to a subset of
the shaded (required match) area. To do this, use the DML is_defined function in a rule in
the transform itself. This is the same principle demonstrated in the two-way join shown in
Getting a joined output record.
For example, suppose you want to produce an output record when a particular key value
either is present in in0, or is present in both in1 and in2. Only Case 2 has enough shaded
area to represent the necessary conditions. However, Case 2 also represents conditions
under which you do not want Join to produce an output record.
To produce output records only under the appropriate conditions:
1
. Set join-type to Full Outer Join as in Case 2 above.
2
. Put the following rules in Joins transform function:
MATCH SORTED
Purpose
Match Sorted combines multiple flows of records with matching keys and performs
transform operations on them.
NOTE: This component is superseded by either Join (for matching keys) or Fuse (for
transforming multiple records). Both provide more flexible processing options than Match
Sorted.
Requirement
Match Sorted requires grouped input.
Location in the Component Organizer
Transform folder
in0
in1
in2
record 1
aaa
aaa
aaa
record 2
bbb
bbb
ccc
record 3
ccc
ccc
ddd
record 4
eee
eee
eee
record 5
eee
fff
fff
record 6
eee
end
end
Match Sorted calls the transform function eight times for these data records, with the
arguments as follows:
transform( in0-rec1, in1-rec1, in2-rec1 ) records with key value aaa
transform( in0-rec2, in1-rec2, NULL ) records with key value bbb
transform( in0-rec3, in1-rec3, in2-rec2 ) records with key value ccc
transform( NULL,
NULL,
Since there are three eee records in the flow attached to in0, Match Sorted calls the
transform function three times with eee records as inputs. Since the next records on in1 and
in2 do not have key value eee, in1 and in2 repeat their rec4 records.
MULTI REFORMAT
Purpose
Multi Reformat changes the format of records flowing from 1 to 20 pairs of in and out ports
by dropping fields or by using DML expressions to add fields, combine fields, or transform
data in the records.
We recommend using MULTI REFORMAT in only a few specific situations. Most often, a
regular REFORMAT component is the correct choice. For example:
If you want to reformat data on multiple flows, you should instead use multiple
REFORMAT components. These are faster because they run in parallel.
If you want to filter incoming data, sending it to various output ports while also
reformatting it (by adding, combining, or transforming fields), try using the outputindex and count parameters on the REFORMAT component.
A recommended use for Multi Reformat is to put it immediately before a custom component
that takes multiple inputs. For more information, see Using MULTI REFORMAT to avoid
deadlock.
NORMALIZE
Purpose
Normalize generates multiple output records from each of its input records. You can directly
specify the number of output records for each input record, or you can make the number of
output records dependent on a calculation.
In contrast, to consolidate groups of related records into a single record with a vector field
for each group the inverse of NORMALIZE you would use the accumulation function
of the ROLLUP component.
Recommendations
Always clean and validate data before normalizing it. Because Normalize uses a
multistage transform, it follows computation rules that may cause unexpected or
incorrect results in the presence of dirty data (NULLs or invalid values).
Furthermore, the results will be hard to trace, particularly if the reject-threshold
parameter is set to Never abort. Several factors including the data type, the DML
expression used to perform the normalization, and the value of the sorted-input
parameter may affect where the problems occur. It is safest to avoid normalizing
dirty data.
Component folding can enhance the performance of this component. If this feature
is enabled, the Co>Operating System folds this component by default. See
Component folding for more information.
Transform
function
Required?
input_select
No
Argumen
ts
input
record
Return value
An integer(4) value.
An output value of 0 means false (the
record was not selected); non-zero
means
true
(the
record
was
selected).
See Optional NORMALIZE transform
No
input
record
A
record
whose
temporary_type.
type
is
Only
if
finished is
not
provided
input
record
An integer(4) value.
Specifies the number of output
records Normalize generates for this
input record. If the length function is
provided, Normalize calls it once for
each input record.
For
examples,
see
Simple
NORMALIZE example with vectors
and NORMALIZE example with a
more elaborate transform.
finished
(if
you
have
defined
temporary_type)
finished
(if you have not
defined
temporary_type)
Only
if
length
is
not
provided
Only
if
length
is
not
provided
temporar
y record,
input
record,
index
input
record,
index
normalize
(if
you
have
defined
temporary_type)
Yes
temporar
y record,
input
record,
index
A
record
whose
type
is
temporary_type. For examples, see
Simple NORMALIZE example with
vectors.
normalize
(if you have not
defined
temporary_type)
Yes
input
record,
index
An output record.
finalize
No
temporar
y record,
input
record
output
record
An integer(4) value.
output_select
No
out :: input_select(in) =
begin
out :: in.n == 1;
end;
The input_select transform function takes a single argument the input record and
returns a value of 0 (false) if NORMALIZE is to ignore a record, or non-zero (true) if
NORMALIZE is to accept a record.
initialize The initialize transform function initializes temporary storage. This
transform function takes a single argument the input record and returns a
single record with type temporary_type:
temp :: initialize(in) =
begin
temp.count :: 0;
temp.sum :: 0;
end;
length The length transform function is required when the finished function is not
defined. (You must use at least one of these functions.) This transform function
specifies the number of times the normalize function will be called for the current
record. This function takes the input record as an argument:
out :: length(in) =
begin
out :: length_of(in.big_vector);
end;
length essentially provides a way to implement a for loop in the record-reading process.
finished The finished transform function is required when the length function is
not defined. (You must use at least one of these functions.) This transform function
returns a boolean value: as long as it returns 0 (false), NORMALIZE proceeds to call
the normalize function for the current record. When the finished function returns
non-zero (true) , NORMALIZE moves to the next input record.
out :: output_select(final) =
begin
out :: final.average > 5;
end;
The output_select transform function takes a single argument the record produced by
finalization and returns a value of 0 (false) if NORMALIZE is to ignore a record, or nonzero (true) if NORMALIZE is to generate an output record.
type temporary_type =
record
int count;
int sum;
end;
REFORMAT
Purpose
Reformat changes the format of records by dropping fields, or by using DML expressions to
add fields, combine fields, or transform the data in the records.
Recommendation
Component folding can enhance the performance of this component. If this feature is
enabled, the Co>Operating System folds this component by default. See Component
folding for more information.
Location in the Organizer
Transform folder
ROLLUP
Purpose
Rollup evaluates a group of input records that have the same key, and then generates
records that either summarize each group or select certain information from each group.
Although it lacks a reformat transform function, rollup supports implicit reformat; see Implicit
reformat.
Location in the Organizer
Transform folder
Recommendations
For new development, use Rollup rather than AGGREGATE. Rollup provides more
control over record selection, grouping, and aggregation.
The behavior of ROLLUP varies in the presence of dirty data (NULLs or invalid
values), according to which mode you use for the rollup:
SCAN
Purpose
For every input record, Scan generates an output record that consists of a running
cumulative summary for the group to which the input record belongs, up to and including the
current record. For example, the output records might include successive year-to-date totals
for groups of records.
Although it lacks a reformat transform function, scan supports implicit reformat.
Recommendations
If you want one summary record for a group, use ROLLUP.
The behavior of SCAN varies in the presence of dirty data (NULLs or invalid values),
according to which mode you use for the scan:
Component folding can enhance the performance of this component. If this feature is
enabled, the Co>Operating System folds
Component folding for more information.
this
component
by
default.
See
Create a transform using an expanded SCAN package. This is called expanded mode
and allows for scans that do not necessarily use regular aggregation functions.
Template mode
Template mode is the simplest way to use SCAN. In the transform parameter, you specify
an aggregation function that describes how the cumulative summary should be computed.
At runtime, the Co>Operating System expands this template function into the multiple
functions that are required to execute the actual scan.
For example, suppose you have an input record for each purchase by each customer. You
could use the sumaggregation function to calculate the running total of spending for each
customer after each purchase.
For more information, see Using SCAN with aggregation functions.
Expanded mode
Expanded mode provides more control over the scan. It lets you edit the expanded package,
so you can specify transformations that are not possible with template mode. As such, you
might use it when you need a result that an aggregation function cannot produce.
With an expanded SCAN package, you must define the following items:
DML type named temporary_type
initialize function that returns a temporary_type record
scan function that takes two input arguments (an input record
a temporary_type record) and returns an updated temporary_type record
finalize function that returns an output record
and
transforms/scan/scan.mp
customer_id
dt
amount
C002142
1994.03.23
52.20
C002142
1994.06.22
22.25
C003213
1993.02.12
47.95
C003213
1994.11.05
221.24
C003213
1995.12.11
17.42
C004221
1994.08.15
25.25
C008231
1993.10.22
122.00
C008231
1995.12.10
You want to produce output records with customer_id, dt, and amount_to_date:
customer_id
dt
amount_to_date
C002142
1994.03.23
52.20
C002142
1994.06.22
74.45
C003213
1993.02.12
47.95
52.1
C003213
1994.11.05
269.19
C003213
1995.12.11
286.61
C004221
1994.08.15
25.25
C008231
1993.10.22
122.00
C008231
1995.12.10
174.1
customer_id
dt
amount_to_date
category
C002142
1994.03.23
52.20
regular
C002142
1994.06.22
74.45
regular
C003213
1993.02.12
47.95
regular
C003213
1994.11.05
269.19
premium
C003213
1995.12.11
286.61
premium
C004221
1994.08.15
25.25
regular
C008231
1993.10.22
122.00
premium
C008231
1995.12.10
174.10
premiu
For this example, we can use the finalize function in an expanded transform to add the
category information. Because we have expanded the transform, we can no longer use
the sum aggregation function to calculate the amount_to_date. Instead, we store the
running total in a temporary variable and use the scan function to update it for each record.
Here is the transform:
type temporary_type =
record
decimal(8.2) amount_to_date = 0;
end;
temp :: initialize(in) =
begin
temp.amount_to_date :: 0;
end;
(Remember
that
in
this
example,
the
data
is
grouped
by customer_id.)
The scan function is called for each record; it keeps a running total of purchase amounts
within
the
group.
SPLIT
Purpose
creates
the
output
records,
assigning
SPLIT processes data in a number of useful ways. You can use SPLIT to:
Flatten hierarchical data
Select a subset of fields from the data
Transform folder
First, the desired output DML is generated using the split_dml utility:
split_dml -i ..# -b ..atm_id example1.dml
where:
The -i argument indicates fields to be included in the output DML. In this case, the
specified wildcards "..#" selects all leaf fields anywhere within the record.
The -b argument specifies a base field for normalization. Any field in the vector to
be normalized can be used; in this case, the specified field atm_id is used with the
".." shorthand, because atm_id is unique in the record.
'region=region,state=states.state,county=states.counties.county,
addr_line1=states.counties.atms.location.addr_line1,
addr_line2=states.counties.atms.location.addr_line2,
atm_id=states.counties.atms.atm_id,
comment=states.counties.atms.comment,mgr=mgr';
end
Note the flattened record, and the generated DML_assignments method that controls how
SPLIT fills the output record from the input data.
Suppose
that
you
want
to
exclude
certain
fields
addr_line1, addr_line2,
-i
region,states.state,states.counties.county,..atm_id,..mgr -b
example1.dml
The generated output is:
/////////////////////////////////////////////////////////////////////
// This file was automatically generated by split_dml
// With the command line arguments:
// split_dml -i region,states.state,states.counties.county,..atm_id,
// ..mgr -b ..atm_id example1.dml
/////////////////////////////////////////////////////////////////////
record
string("|") region;
string("|") state;
string("|") county;
..atm_id
string("|") atm_id;
string("\n") mgr;
string('\0') DML_assignments() =
'region=region,state=states.state,county=states.counties.county,
atm_id=states.counties.atms.atm_id,
mgr=mgr';
end
Note that the fields specified by the split_dml -i option appear in the order in which they
occur in the input record, not in the order in which they are listed in the option argument.
Add a comment
1.
SEP
18
PARTITION BY EXPRESSION
Purpose
Component folding can enhance the performance of this component. If this feature is
enabled, the Co>Operating System folds this component by default. See Component
folding for more information.
The component does not fold when connected to a flow that is set to use two-stage routing.
Location in the Component Organizer
Partitioning folder
PARTITION BY KEY
Purpose
Partition by Key distributes records to its output flow partitions according to key values.
How Partition by Key interprets key values depends on the internal representation of the
key. For example, the number 4 in a field of type integer(2) is not considered identical to the
number 4 in a field of type decimal(4).
Recommendation
Component folding can enhance the performance of this component. If this feature is
enabled, the Co>Operating System folds this component by default. See Component
folding for more information.
The component does not fold when connected to a flow that is set to use two-stage routing.
Location in the Component Organizer
Partitioning folder
Partition by Key and Sort repartitions records by key values and then sorts the records
within each partition. The number of input and output partitions can be different.
How Partition by Key and Sort interprets key values depends on the internal representation
of the key. For example, the number 4 is likely to be partitioned differently depending on
whether it is in a field of type integer(2) or decimal(4).
Partition by Key and Sort is a subgraph that contains two components, Partition by Key and
Sort.
Location in the Component Organizer
Sort folder
PARTITION BY PERCENTAGE
Purpose
Partitioning folder
PARTITION BY RANGE
Purpose
Partition by Range distributes records to its output flow partitions according to the ranges of
key values specified for each partition. Partition by Range distributes the records relatively
equally among the partitions.
Use Partition by Range when you want to divide data into useful, approximately equal,
groups. Input can be sorted or unsorted. If the input is sorted, the output is sorted; if the
input is unsorted, the output is unsorted.
The records with the key values that come first in the key order go to partition 0, the records
with the key values that come next in the order go to partition 1, and so on. The records with
the key values that come last in the key order go to the partition with the highest number.
Location in the Component Organizer
Partitioning folder
PARTITION BY ROUND-ROBIN
Purpose
Component folding can enhance the performance of this component. If this feature is
enabled, the Co>Operating System folds this component by default. See Component
folding for more information.
The component does not fold when connected to a flow that is set to use two-stage routing.
Location in the Component Organizer
Partitioning folder
Posted 18th September 2015 by kashyap vasani