You are on page 1of 36

JAN

AB-INITIO TRANSFORM COMPONENT


 AGGREGATE
Purpose
Aggregate generates records that summarize groups of records.
Deprecated
AGGREGATE is deprecated. Use ROLLUP instead. Rollup gives you more control over
record selection, grouping, and aggregation.
Recommendation
Component folding can enhance the performance of this component . If this feature is
enabled and if the sorted-input parameter is set to In memory: Input need not be sorted, the
Co>Operating System folds this component by default. See “Component folding” for more
information.
Location in the Component Organizer

Miscellaneous/Deprecated/Transform folder

 COMBINE
Purpose
combine processes data in a number of useful ways. You can use combine to:
Restore hierarchies of data flattened by the SPLIT component
Create a single output record by joining multiple input streams

Denormalize vectors (including nested vectors)


How COMBINE works
COMBINE does not use transform functions. It determines what operations to perform on input
data by using DML that is generated for COMBINE’s input ports by the split_dml command-line
utility.
COMBINE performs the inverse operations of the SPLIT component. It has a single output port
and a counted number of input ports. COMBINE (optionally) denormalizes each input data
stream, then performs an outer join on the input records to form the output records.
Using COMBINE for joining data
To use COMBINE to denormalize and join input data, you need to sort and specify keys for the
data. If the input to COMBINE is from an output of SPLIT, you can set up SPLIT to automatically
generate keys by running split_dml with the -g option. Otherwise, you can generate keys by
running split_dml with the -k option, supplying the names of key fields. If you specify no keys,
COMBINE uses an implied key, which is equal to a record’s index within the sequence of records
on the input port. In other words, COMBINE merges records synchronously on each port.
When merging these records, COMBINE selects for processing the records that match the
smallest key present on any port. Thus, the input data on each port should be sorted in the order
specified by the keys.
COMBINE can also merge elements of vectors, in the same way it merges top-level records: if
you specify no key, COMBINE merges the elements based on an implied key, which is equal to a
record’s index within the sequence of records on the input port.
Recommendation
Component folding can enhance the performance of this component. If this feature is enabled,
the Co>Operating System folds this component by default. See “Component folding” for more
information.
Location in the Component Organizer
Transform folder

Example of using COMBINE


Say you have a file example2a.dml with the following record format:
record
  string("|") region = "";   //Sort key 1
  string("|") state = "";    //Sort key 2
  string("|") county = "";   //Sort key 3
  string("|") addr_line1 = "";
  string("|") addr_line2 = "";
  string("|") atm_id = "";
  string("|") comment = "";
  string("\n") regional_mgr = "";
end;
And you want to roll up the fields that are marked as sort keys — region, state, and county — into
nested vectors. To do this, you can use a single COMBINE component rather than performing a
series of three rollup actions.
The desired output format (example2b.dml) is:
record
  string("|") region;      //Sort key 1
  record
    string("|") state;     //Sort key 2
    record
      string("|") county;  //Sort Key 3
      record
        record
          string("|") addr_line1;
          string("|") addr_line2;
        end location;
        string("|")atm_id;
        string("|")comment;
      end[int] atms;
    end[int] counties;
  end[int] states;
  string("\n") regional_mgr;
end;
To produce this output format, you need to run split_dml to generate DML for the input port. Your
requirements for the split_dml command are:
You want to include all fields, but you do not care about the subrecord hierarchy, so we specify
"..#" for value of the split_dml -i argument.
The base field for normalization can be any of the fields in the atms record; you choose atm_id.

You need to specify the three keys to use when rolling up the vectors: region, states.state, and
states.counties.county.
The resulting command is:
split_dml -i ..# -b ..atm_id -k region,states.state,states.counties.county example2b.dml
The generated DML, to be used on COMBINE’s input port, is:
//////////////////////////////////////////////////////////////
// This file was automatically generated by split_dml
// with the command-line arguments:
// split_dml -i ..# -b ..atm_id -k region,states.state,states.counties.county example2b.dml
//////////////////////////////////////////////////////////////
record
  string("|") region   // Sort key 1
  string("|") state    // Sort key 2
  string("|") county   // Sort key 3
  string("|") addr_line1;
  string("|") addr_line2;
  string("|") atm_id;
  string("|") comment;
  string("\n") regional_mgr;
  string('0')DML_assignments =
    'region=region,state=states.state,county=states.counties.county,
    addr_line1=states.counties.atms.location.addr_line1,
    addr_line2=states.counties.atms.location.addr_line2,
    atm_id=states.counties.atms.atm_id,
    comment=states.counties.atms.comment,
    regional_mgr=regional_mgr';
  string('0')DML_key_specifiers() = 
    '{region}=,{state}=states[],{county}=states.counties[]';
end
Related topics 

 DEDUP SORTED
Purpose
Dedup Sorted separates one specified record in each group of records from the rest of the
records in the group.
Requirement
Dedup Sorted requires grouped input.
Recommendation
Component folding can enhance the performance of this component. If this feature is enabled,
the Co>Operating System folds this component by default. See “Component folding” for more
information.
Location in the Component Organizer
Transform folder

 FILTER BY EXPRESSION
Purpose
Filter by Expression filters records according to a DML expression or transform function, which
specifies the selection criteria.
Filter by Expression is sometimes used to create a subset, or sample, of the data. For example,
you can configure Filter by Expression to select a certain percentage of records, or to select
every third (or fourth, or fifth, and so on) record. Note that if you need a random sample of a
specific size, you should use the sample component.
FILTER BY EXPRESSION supports implicit reformat. For more information, see “Implicit
reformat”.
Recommendation
Component folding can enhance the performance of this component. If this feature is enabled,
the Co>Operating System folds this component by default. See “Component folding” for more
information.
Location in the Component Organizer
Transform folder

 FUSE
Purpose
Fuse combines multiple input flows (perhaps with different record formats) into a single output
flow. It examines one record from each input flow simultaneously, acting on the records
according to the transform function you specify. For example, you can compare records,
selecting one record or another based on some criteria, or “fuse” them into a single record that
contains data from all the input records.
Recommendation
Fuse assumes that the records on the input flows always stay synchronized. However, certain
components placed upstream of Fuse, such as Reformat or Filter by Expression, could reject or
divert some records. In that case, you may not be able to guarantee that the flows stay in sync. A
more reliable option is to add a key field to the data; then use Join to match the records by key.
Component folding can enhance the performance of this component. If this feature is enabled,
the Co>Operating System folds this component by default. See “Component folding” for more
information.

 JOIN
Purpose
Join reads data from two or more input ports, combines records with matching keys according to
the transform you specify, and sends the transformed records to the output port. Additional ports
allow you to collect rejected and unused records.
Recommendation
Component folding can enhance the performance of this component. If this feature is enabled,
the Co>Operating System folds this component by default. See “Component folding” for more
information.
NOTE: When you have units of work (computepoints, checkpoints, or transactions) that are
large and sorted-input is set to Inputs must be sorted, the order of output records within a key
group may differ between the folded and unfolded versions of the output.
Location in the Component Organizer
Transform folder

Types of joins
Reduced to its basics, Join consists of a match key, a transform function, and a mechanism for
deciding when to call the transform function:
The key is used to match records on incoming flows
The transform function combines matched incoming records to produce new outgoing
records

The mechanism for deciding when to call the transform function consists of the settings of
the parameters join-type, record-requiredn, and dedupn.
Inner joins
The most common case is when join-type is Inner Join. In this case, if each input port contains a
record with the same value for the key fields, the transform function is called and an output
record is produced.
If some of the input flows have more than one record with that key value, the transform function
is called multiple times, once for each possible combination of records, taken one from each
input port.
Whenever a particular key value does not have a matching record on every input port and Inner
Join is specified, the transform function is not called and all incoming records with that key value
are sent to the unusedn ports.
Full outer joins
Another common case is when join-type is Full Outer Join: if each input port has a record with a
matching key value, Join does the same thing it does for an inner join.
If some input ports do not have records with matching key values, Join applies the transform
function anyway, with NULL substituted for the missing records. The missing records are in effect
ignored.
With an outer join, the transform function typically requires additional rules (as compared to an
inner join) to handle the possibility of NULL inputs.
About explicit joins
The final case is when join-type is Explicit. This setting allows you to specify True or False for the
record-requiredn parameter for each inn port. The settings you choose determine when Join calls
the transform function. See record-requiredn.

Examples of join types

Complex multiway joins


For the three-way joins shown in the following diagrams, the shaded regions again represent the
key values that must match in order for Join to call the transform function:
In the cases shown above, suppose you want to narrow the join conditions to a subset of the
shaded (required match) area. To do this, use the DML is_defined function in a rule in the
transform itself. This is the same principle demonstrated in the two-way join shown in “Getting a
joined output record”.
For example, suppose you want to produce an output record when a particular key value either is
present in in0, or is present in both in1 and in2. Only Case 2 has enough shaded area to
represent the necessary conditions. However, Case 2 also represents conditions under which
you do not want Join to produce an output record.
To produce output records only under the appropriate conditions:
1.Set join-type to Full Outer Join as in Case 2 above.

2.Put the following rules in Join’s transform function:


out.key :1: if (is_defined(in0)) in0.key;
out.key :2: if (is_defined(in1) &&
   is_defined(in2)) in1.key;
For both rules to fail, the particular key value must be absent from in0 and must be present in
only one of in1 or in2.
Join writes the records that result in both rules failing to the rejectn ports if you connect flows to
them.

 MATCH SORTED
Purpose
Match Sorted combines multiple flows of records with matching keys and performs transform
operations on them.
NOTE: This component is superseded by either Join (for matching keys) or Fuse (for
transforming multiple records). Both provide more flexible processing options than Match Sorted.
Requirement
Match Sorted requires grouped input.
Location in the Component Organizer
Transform folder

Example of using MATCH SORTED


This example shows how repeat and missing key values affect the number of times Match Sorted
calls the transform function.
Suppose three input flows feed Match Sorted. The records in these flows have three-character
alphabetic key values. The key values of the records in the three flows are as follows:

in0 in1 in2


record aaa aaa aaa
1
record bbb bbb ccc
2
record ccc ccc ddd
3
record eee eee eee
4
record eee fff fff
5
record eee —end— —end—
6
Match Sorted calls the transform function eight times for these data records, with the arguments
as follows:
transform( in0-rec1, in1-rec1, in2-rec1 )  — records with key value “aaa”
transform( in0-rec2, in1-rec2, NULL )  — records with key value “bbb”
transform( in0-rec3, in1-rec3, in2-rec2 )  — records with key value “ccc”
transform( NULL,     NULL,     in2-rec3 )  — records with key value “ddd”
transform( in0-rec4, in1-rec4, in2-rec4 )  — records with key value “eee”
transform( in0-rec5, in1-rec4, in2-rec4 )  — records with key value “eee”
transform( in0-rec6, in1-rec4, in2-rec4 )  — records with key value “eee”
transform( NULL,     in1-rec5, in2-rec5 )  — records with key value “fff”
Since there are three eee records in the flow attached to in0, Match Sorted calls the transform
function three times with eee records as inputs. Since the next records on in1 and in2 do not
have key value eee, in1 and in2 repeat their rec4 records.

 MULTI REFORMAT
Purpose
Multi Reformat changes the format of records flowing from 1 to 20 pairs of in and out ports by
dropping fields or by using DML expressions to add fields, combine fields, or transform data in
the records.
We recommend using MULTI REFORMAT in only a few specific situations. Most often, a regular
REFORMAT component is the correct choice. For example:
If you want to reformat data on multiple flows, you should instead use multiple
REFORMAT components. These are faster because they run in parallel.
If you want to filter incoming data, sending it to various output ports while also reformatting
it (by adding, combining, or transforming fields), try using the output-index and count
parameters on the REFORMAT component.
A recommended use for Multi Reformat is to put it immediately before a custom component that
takes multiple inputs. For more information, see “Using MULTI REFORMAT to avoid deadlock”.

Using MULTI REFORMAT to avoid deadlock


Deadlock occurs when a program cannot progress, causing a graph to hang. Custom
components (components that you have built to execute your own programs) are prone to
deadlock because they cannot use the GDE’s automatic flow buffering. If a custom component is
programmed to read from multiple flows in a specific order, it carries the possibility of causing
deadlock.
To avoid deadlock, insert a MULTI REFORMAT component in the graph in front of the custom
component. Using this built-in component to process the input flows applies automatic flow
buffering to them before they reach the custom component, thus avoiding the possibility of
deadlock.

 NORMALIZE
Purpose
Normalize generates multiple output records from each of its input records. You can directly
specify the number of output records for each input record, or you can make the number of
output records dependent on a calculation.
In contrast, to consolidate groups of related records into a single record with a vector field for
each group — the inverse of NORMALIZE — you would use the accumulation function of the
ROLLUP component.
Recommendations
Always clean and validate data before normalizing it. Because Normalize uses a multistage
transform, it follows computation rules that may cause unexpected or incorrect results in the
presence of dirty data (NULLs or invalid values). Furthermore, the results will be hard to
trace, particularly if the reject-threshold parameter is set to Never abort. Several factors —
including the data type, the DML expression used to perform the normalization, and the
value of the sorted-input parameter — may affect where the problems occur. It is safest to
avoid normalizing dirty data.
Component folding can enhance the performance of this component. If this feature is
enabled, the Co>Operating System folds this component by default. See “Component
folding” for more information.

NORMALIZE transform functions


What Normalize does is determined by the functions, types, and variables you define in its
transform parameter.
There are seven built-in functions, as shown in the following table. Of these, only normalize is
required. Examples of most of these functions can be found in “Simple NORMALIZE example
with vectors”.
There is also an optional temporary_type (see “Optional NORMALIZE transform functions and
types”), which you can define if you need to use temporary variables. For an example, see
“NORMALIZE example with a more elaborate transform”.

Transform function Required? Arguments Return value


input_select No input An integer(4) value.
record An output value of 0 means false (the
record was not selected); non-zero
means true (the record was selected).
See “Optional NORMALIZE transform
functions and types”.
initialize No input A record whose type is
record temporary_type.
See “Optional NORMALIZE transform
functions and types”. For examples, see
“NORMALIZE example with a more
elaborate transform”.
length Only if input An integer(4) value.
finished is record Specifies the number of output records
not provided Normalize generates for this input
record. If the length function is
provided, Normalize calls it once for
each input record.
For examples, see “Simple
NORMALIZE example with vectors”
and “NORMALIZE example with a
more elaborate transform”.
finished Only if temporary 0 (meaning false), if more output
(if you have defined length is not record, records are to be generated from the
temporary_type) provided input current input record.
record, Otherwise, a non-zero value (true).
index If the finished function is provided,
NORMALIZE calls it once more than
the number of output records it
produces. On the final call it returns true
and no output record is produced.
finished Only if input 0 (meaning false), if more output
(if you have not length is not record, records are to be generated from the
defined provided index current input record.
temporary_type) Otherwise, a non-zero value (true).
If the finished function is provided,
NORMALIZE calls it once more than
the number of output records it
produces. On the final call it returns true
and no output record is produced.
normalize Yes temporary A record whose type is temporary_type.
(if you have defined record, For examples, see “Simple
temporary_type) input NORMALIZE example with vectors”.
record,
index
normalize Yes input An output record.
(if you have not record,
defined index
temporary_type)
finalize No temporary The output record.
record, See “Optional NORMALIZE transform
input functions and types” and
record “NORMALIZE example with a more
elaborate transform”.
output_select No output An integer(4) value.
record An output value of 0 means false (the
record was not selected); non-zero
means true (the record was selected).
See “Optional NORMALIZE transform
functions and types”.
Input and output names in transforms
In all transform functions, the names of the inputs and outputs are used only locally, so you can
use any names that make sense to you.
Optional NORMALIZE transform functions and types
There are several optional transform functions and an optional type you can use with Normalize:
input_select — The input_select transform function performs selection of input records:
out :: input_select(in) =
begin
  out :: in.n == 1;
end;
The input_select transform function takes a single argument — the input record — and returns a
value of 0 (false) if NORMALIZE is to ignore a record, or non-zero (true) if NORMALIZE is to
accept a record.
initialize — The initialize transform function initializes temporary storage. This transform
function takes a single argument — the input record — and returns a single record with type
temporary_type:
temp :: initialize(in) =
begin
  temp.count :: 0;
  temp.sum :: 0;
end;
length — The length transform function is required when the finished function is not
defined. (You must use at least one of these functions.) This transform function specifies the
number of times the normalize function will be called for the current record. This function
takes the input record as an argument:
out :: length(in) =
begin
  out :: length_of(in.big_vector);
end;
length essentially provides a way to implement a for loop in the record-reading process.
finished — The finished transform function is required when the length function is not
defined. (You must use at least one of these functions.) This transform function returns a
boolean value: as long as it returns 0 (false), NORMALIZE proceeds to call the normalize
function for the current record. When the finished function returns non-zero (true) ,
NORMALIZE moves to the next input record.
out :: finished(in, index) =
begin
  out :: in.array[index] == "ignore later elements";
end;
The finished function essentially provides a way to implement a while-do loop in the record-
reading process.
NOTE: Although we recommend that you not use both length and finished in the same
component, it is possible to define both. In that case, Normalize loops until either finished returns
true or the limit of length is reached, whichever occurs first.
finalize — The finalize transform function performs the last step in a multistage transform:
out :: finalize(temp, in) =
begin
  out.key :: in.key;
  out.count :: temp.count;
  out.average :: temp.sum / temp.count;
end;
The finalize transform function takes the temporary storage record and the input record as
arguments, and produces a record that has the record format of the out port.
output_select — The output_select transform function performs selection of output records:
out :: output_select(final) =
begin
  out :: final.average > 5;
end;
The output_select transform function takes a single argument — the record produced by
finalization — and returns a value of 0 (false) if NORMALIZE is to ignore a record, or non-zero
(true) if NORMALIZE is to generate an output record.
temporary_type — If you want Normalize to use temporary storage, define this storage as a
record with a type named temporary_type:
type temporary_type =
  record
    int count;
    int sum;
  end;

 REFORMAT

Purpose
Reformat changes the format of records by dropping fields, or by using DML expressions to add
fields, combine fields, or transform the data in the records.
Recommendation
Component folding can enhance the performance of this component. If this feature is enabled,
the Co>Operating System folds this component by default. See “Component folding” for more
information.
Location in the Organizer
Transform folder

 ROLLUP
Purpose
Rollup evaluates a group of input records that have the same key, and then generates records
that either summarize each group or select certain information from each group.
Although it lacks a reformat transform function, rollup supports implicit reformat; see “Implicit
reformat”.
Location in the Organizer
Transform folder
Recommendations
For new development, use Rollup rather than AGGREGATE. Rollup provides more control
over record selection, grouping, and aggregation.
The behavior of ROLLUP varies in the presence of dirty data (NULLs or invalid values),
according to which mode you use for the rollup:

With expanded mode, you can use ROLLUP normally.


With template mode, always clean and validate data before rolling it up. Because the
aggregation functions are not expanded, you may see unexpected or even incorrect results in
the presence of dirty data (NULLs or invalid values). Furthermore, the results will be hard to
trace, particularly if the reject-threshold parameter is set to Never abort. Several factors —
including the data type, the DML expression used to perform the rollup, and the value of the
sorted-input parameter — may affect where the problems occur. It is safest to clean and
validate the data before using template mode with ROLLUP.

 SCAN
Purpose
For every input record, Scan generates an output record that consists of a running cumulative
summary for the group to which the input record belongs, up to and including the current record.
For example, the output records might include successive year-to-date totals for groups of
records.
Although it lacks a reformat transform function, scan supports implicit reformat.
Recommendations
If you want one summary record for a group, use  ROLLUP.
The behavior of SCAN varies in the presence of dirty data (NULLs or invalid values),
according to which mode you use for the scan:

With expanded mode, you can use SCAN normally.


With template mode, always clean and validate data before scanning it. Because
the aggregation functions are not expanded, you may see unexpected or even incorrect
results in the presence of dirty data (NULLs or invalid values). Furthermore, the results will
be hard to trace, particularly if the reject-threshold parameter is set to Never abort. Several
factors — including the data type, the DML expression used to perform the scan, and the
value of the sorted-input parameter — may affect where the problems occur. It is safest to
clean and validate the data before using template mode with SCAN.

Component folding can enhance the performance of this component. If this feature
is enabled, the Co>Operating System folds this component by default. See “Component
folding” for more information.

Two modes to use SCAN


You can use a SCAN component in two modes, depending on how you define
the transform parameter:
Define a transform that uses a template scan function. This is called template mode
and is most often used when you want to output aggregations of the data.
Create a transform using an expanded SCAN package. This is called expanded mode
and allows for scans that do not necessarily use regular aggregation functions.
Template mode
Template mode is the simplest way to use  SCAN. In the transform parameter, you specify an
aggregation function that describes how the cumulative summary should be computed. At
runtime, the Co>Operating System expands this template function into the multiple functions that
are required to execute the actual scan.
For example, suppose you have an input record for each purchase by each customer. You could
use the sum aggregation function to calculate the running total of spending for each customer
after each purchase.
For more information, see “Using SCAN with aggregation functions”.
Expanded mode
Expanded mode provides more control over the scan. It lets you edit the  expanded package, so
you can specify transformations that are not possible with template mode. As such, you might
use it when you need a result that an aggregation function cannot produce.
With an expanded SCAN package, you must define the following items:
DML type named temporary_type
initialize function that returns a temporary_type record

scan function that takes two input arguments (an input record and
a temporary_type record) and returns an updated temporary_type record
finalize function that returns an output record
For more information, see “Transform package for SCAN”.

Examples of using SCAN

transforms/scan/scan.mp
Template SCAN with an aggregation function
This example shows how to compute, from input records containing customer_id, dt (date),
and amount, a running total of transactions for each customer in a dataset. The example uses a
template scan function with the sum aggregation function.
Suppose you have the following input records:

customer_id dt amount

C002142 1994.03.23 52.20

C002142 1994.06.22 22.25

C003213 1993.02.12 47.95

C003213 1994.11.05 221.24

C003213 1995.12.11 17.42

C004221 1994.08.15 25.25

C008231 1993.10.22 122.00

C008231 1995.12.10 52.1


You want to produce output records with customer_id, dt, and amount_to_date:

amount_to_dat
customer_id dt
e
C002142 1994.03.23 52.20

C002142 1994.06.22 74.45

C003213 1993.02.12 47.95

C003213 1994.11.05 269.19

C003213 1995.12.11 286.61

C004221 1994.08.15 25.25

C008231 1993.10.22 122.00

C008231 1995.12.10 174.1


To accomplish this task, do one of the following:
Sort the input records on  customer_id and dt, and use a Scan component with
the sorted-input parameter set to Input must be sorted or grouped and customer_id as
the key field.
Sort the input records on  dt, and use a Scan component with the sorted-input parameter
set to In memory: Input need not be sorted and customer_id as the key field.
Create the transform using the sum aggregation function, as follows:
out :: scan(in) =
begin
  out.customer_id :: in.customer_id;
  out.dt :: in.dt;
  out.amount_to_date :: sum(in.amount);
end;
Expanded SCAN
Continuing the previous example, you want to categorize customers according to their
spending. After their spending exceeds $100, you place them in the “premium” category. The
new output data includes the category for each customer, current for each date on which they
made a purchase.

customer_i amount_to_da categor


dt
d te y
C002142 1994.03.2 52.20 regular
3
C002142 1994.06.2 74.45 regular
2
C003213 1993.02.1 47.95 regular
2
C003213 1994.11.0 269.19 premiu
5 m
C003213 1995.12.1 286.61 premiu
1 m
C004221 1994.08.1 25.25 regular
5
C008231 1993.10.2 122.00 premiu
2 m
1995.12.1 174.1 premi
C008231
0 0 u
For this example, we can use the finalize function in an expanded transform to add the category
information. Because we have expanded the transform, we can no longer use
the sum aggregation function to calculate the amount_to_date. Instead, we store the running
total in a temporary variable and use the scan function to update it for each record.
Here is the transform:
type temporary_type =
record
  decimal(8.2) amount_to_date = 0;
end;

temp :: initialize(in) =
begin
  temp.amount_to_date :: 0;
end;

out :: scan(temp, in) =


begin
  out.amount_to_date :: temp.amount_to_date + in.amount;
end;

out :: finalize(temp, in) =


begin
  out.customer_id :: in.customer_id;
  out.dt :: in.dt;
  out.amount_to_date :: temp.amount_to_date;
  out.category :: if (temp.amount_to_date > 100) "premium"
    else "regular";
end;
The temporary_type is a variable that stores the cumulative data from one record to the next. At
the beginning of each group, the initialize function resets the temporary variable to 0.
(Remember that in this example, the data is grouped by customer_id.) The scan function is
called for each record; it keeps a running total of purchase amounts within the group.
The finalize function creates the output records, assigning a category value to each one.

 SPLIT
Purpose
SPLIT processes data in a number of useful ways. You can use SPLIT to:
Flatten hierarchical data
Select a subset of fields from the data

Normalize vectors (including nested vectors)


Retrieve multiple, distinct outputs from a single pass through the data

How SPLIT works


SPLIT does not use transform functions. It determines what operations to perform on input data
by using DML that is generated by the split_dml command-line utility. This approach enables you
to perform operations such as normalizing vectors without using expensive DML loop operations.
SPLIT has a single input port and a counted number of output ports. You use split_dml to
generate DML for each output port. You can have different field selection and base fields for
vector normalization on each port; however, you can specify only one base field for vector
normalization per port.
Although it lacks a reformat transform function, SPLIT supports implicit reformat.
Recommendation
Component folding can enhance the performance of this component. If this feature is enabled,
the Co>Operating System folds this component by default. See “Component folding” for more
information.
Location in the Organizer

Transform folder

Example of using SPLIT


Say you have a file example1.dml that has both a nested hierarchy of records and three levels
of nested vectors, with the following record format:
record
  string("|") region;
  record
    string("|") state;
    record
      string("|") county;
      record
        string("|") addr_line1;
        string("|") addr_line2;
      end location;
      record
      string("|") atm_id;
      string("|") comment;
    end[decimal(2)] atms;
   end[decimal(2)] counties;
  end[decimal(2)] states;
  string("\n") mgr;
end
In this example, SPLIT is used to remove the hierarchy and normalize the vectors in this record.
First, the desired output DML is generated using the split_dml utility:
split_dml -i ..# -b ..atm_id example1.dml
where:
The -i argument indicates fields to be included in the output DML. In this case, the specified
wildcards "..#" selects all leaf fields anywhere within the record.
The -b argument specifies a base field for normalization. Any field in the vector to be
normalized can be used; in this case, the specified field atm_id is used with the ".."
shorthand, because atm_id is unique in the record.
This command generates the following output:
/////////////////////////////////////////////////////////////////////
// This file was automatically generated by split_dml
// With the command line arguments:
// split_dml -i ..# -b ..atm_id example1.dml
/////////////////////////////////////////////////////////////////////
record
  string("|") region;
  string("|") state;
  string("|") county;
  string("|") addr_line1;
  string("|") addr_line2;
  string("|") atm_id;
  string("|") comment;
  string("\n") mgr;
  string('\0') DML_assignments() =
    'region=region,state=states.state,county=states.counties.county,
    addr_line1=states.counties.atms.location.addr_line1,
    addr_line2=states.counties.atms.location.addr_line2,
    atm_id=states.counties.atms.atm_id,
    comment=states.counties.atms.comment,mgr=mgr';
end
Note the flattened record, and the generated  DML_assignments method that controls how
SPLIT fills the output record from the input data.
Suppose that you want to exclude certain fields —  addr_line1, addr_line2, and comment —
from the output. Run split_dml as follows:
split_dml -i region,states.state,states.counties.county,..atm_id,..mgr  -b ..atm_id example1.dml
The generated output is:
/////////////////////////////////////////////////////////////////////
// This file was automatically generated by split_dml
// With the command line arguments:
// split_dml -i region,states.state,states.counties.county,..atm_id,
// ..mgr -b ..atm_id example1.dml
/////////////////////////////////////////////////////////////////////
record
  string("|") region;
  string("|") state;
  string("|") county;
  string("|") atm_id;
  string("\n") mgr;
  string('\0') DML_assignments() =
    'region=region,state=states.state,county=states.counties.county,
    atm_id=states.counties.atms.atm_id,
    mgr=mgr';
end
Note that the fields specified by the split_dml -i option appear in the order in which they occur in
the input record, not in the order in which they are listed in the option argument.

Posted 3rd January 2016 by Anonymous


 


View comments

1.

jeganApril 3, 2020 at 11:44 PM

Thank you for taking the time to provide us with your valuable information. We strive to provide our
candidates with excellent care
http://chennaitraining.in/qliksense-training-in-chennai/
http://chennaitraining.in/pentaho-training-in-chennai/
http://chennaitraining.in/machine-learning-training-in-chennai/
http://chennaitraining.in/artificial-intelligence-training-in-chennai/
http://chennaitraining.in/snaplogic-training-in-chennai/
http://chennaitraining.in/snowflake-training-in-chennai/
Reply

AB-INITIO Component

 Classic
 

 Flipcard
 

 Magazine
 

 Mosaic
 

 Sidebar
 

 Snapshot
 

 Timeslide
1.
JAN

AB-INITIO TRANSFORM COMPONENT


 AGGREGATE
Purpose
Aggregate generates records that summarize groups of records.
Deprecated
AGGREGATE is deprecated. Use ROLLUP instead. Rollup gives you more control over
record selection, grouping, and aggregation.
Recommendation
Component folding can enhance the performance of this component . If this feature is
enabled and if the sorted-input parameter is set to In memory: Input need not be sorted, the
Co>Operating System folds this component by default. See “Component folding” for more
information.
Location in the Component Organizer

Miscellaneous/Deprecated/Transform folder

 COMBINE
Purpose
combine processes data in a number of useful ways. You can use combine to:
Restore hierarchies of data flattened by the SPLIT component
Create a single output record by joining multiple input streams

Denormalize vectors (including nested vectors)


How COMBINE works
COMBINE does not use transform functions. It determines what operations to perform on input
data by using DML that is generated for COMBINE’s input ports by the split_dml command-line
utility.
COMBINE performs the inverse operations of the SPLIT component. It has a single output port
and a counted number of input ports. COMBINE (optionally) denormalizes each input data
stream, then performs an outer join on the input records to form the output records.
Using COMBINE for joining data
To use COMBINE to denormalize and join input data, you need to sort and specify keys for the
data. If the input to COMBINE is from an output of SPLIT, you can set up SPLIT to automatically
generate keys by running split_dml with the -g option. Otherwise, you can generate keys by
running split_dml with the -k option, supplying the names of key fields. If you specify no keys,
COMBINE uses an implied key, which is equal to a record’s index within the sequence of records
on the input port. In other words, COMBINE merges records synchronously on each port.
When merging these records, COMBINE selects for processing the records that match the
smallest key present on any port. Thus, the input data on each port should be sorted in the order
specified by the keys.
COMBINE can also merge elements of vectors, in the same way it merges top-level records: if
you specify no key, COMBINE merges the elements based on an implied key, which is equal to a
record’s index within the sequence of records on the input port.
Recommendation
Component folding can enhance the performance of this component. If this feature is enabled,
the Co>Operating System folds this component by default. See “Component folding” for more
information.
Location in the Component Organizer
Transform folder

Example of using COMBINE


Say you have a file example2a.dml with the following record format:
record
  string("|") region = "";   //Sort key 1
  string("|") state = "";    //Sort key 2
  string("|") county = "";   //Sort key 3
  string("|") addr_line1 = "";
  string("|") addr_line2 = "";
  string("|") atm_id = "";
  string("|") comment = "";
  string("\n") regional_mgr = "";
end;
And you want to roll up the fields that are marked as sort keys — region, state, and county — into
nested vectors. To do this, you can use a single COMBINE component rather than performing a
series of three rollup actions.
The desired output format (example2b.dml) is:
record
  string("|") region;      //Sort key 1
  record
    string("|") state;     //Sort key 2
    record
      string("|") county;  //Sort Key 3
      record
        record
          string("|") addr_line1;
          string("|") addr_line2;
        end location;
        string("|")atm_id;
        string("|")comment;
      end[int] atms;
    end[int] counties;
  end[int] states;
  string("\n") regional_mgr;
end;
To produce this output format, you need to run split_dml to generate DML for the input port. Your
requirements for the split_dml command are:
You want to include all fields, but you do not care about the subrecord hierarchy, so we specify
"..#" for value of the split_dml -i argument.
The base field for normalization can be any of the fields in the atms record; you choose atm_id.

You need to specify the three keys to use when rolling up the vectors: region, states.state, and
states.counties.county.
The resulting command is:
split_dml -i ..# -b ..atm_id -k region,states.state,states.counties.county example2b.dml
The generated DML, to be used on COMBINE’s input port, is:
//////////////////////////////////////////////////////////////
// This file was automatically generated by split_dml
// with the command-line arguments:
// split_dml -i ..# -b ..atm_id -k region,states.state,states.counties.county example2b.dml
//////////////////////////////////////////////////////////////
record
  string("|") region   // Sort key 1
  string("|") state    // Sort key 2
  string("|") county   // Sort key 3
  string("|") addr_line1;
  string("|") addr_line2;
  string("|") atm_id;
  string("|") comment;
  string("\n") regional_mgr;
  string('0')DML_assignments =
    'region=region,state=states.state,county=states.counties.county,
    addr_line1=states.counties.atms.location.addr_line1,
    addr_line2=states.counties.atms.location.addr_line2,
    atm_id=states.counties.atms.atm_id,
    comment=states.counties.atms.comment,
    regional_mgr=regional_mgr';
  string('0')DML_key_specifiers() = 
    '{region}=,{state}=states[],{county}=states.counties[]';
end
Related topics 

 DEDUP SORTED
Purpose
Dedup Sorted separates one specified record in each group of records from the rest of the
records in the group.
Requirement
Dedup Sorted requires grouped input.
Recommendation
Component folding can enhance the performance of this component. If this feature is enabled,
the Co>Operating System folds this component by default. See “Component folding” for more
information.
Location in the Component Organizer
Transform folder

 FILTER BY EXPRESSION
Purpose
Filter by Expression filters records according to a DML expression or transform function, which
specifies the selection criteria.
Filter by Expression is sometimes used to create a subset, or sample, of the data. For example,
you can configure Filter by Expression to select a certain percentage of records, or to select
every third (or fourth, or fifth, and so on) record. Note that if you need a random sample of a
specific size, you should use the sample component.
FILTER BY EXPRESSION supports implicit reformat. For more information, see “Implicit
reformat”.
Recommendation
Component folding can enhance the performance of this component. If this feature is enabled,
the Co>Operating System folds this component by default. See “Component folding” for more
information.
Location in the Component Organizer
Transform folder

 FUSE
Purpose
Fuse combines multiple input flows (perhaps with different record formats) into a single output
flow. It examines one record from each input flow simultaneously, acting on the records
according to the transform function you specify. For example, you can compare records,
selecting one record or another based on some criteria, or “fuse” them into a single record that
contains data from all the input records.
Recommendation
Fuse assumes that the records on the input flows always stay synchronized. However, certain
components placed upstream of Fuse, such as Reformat or Filter by Expression, could reject or
divert some records. In that case, you may not be able to guarantee that the flows stay in sync. A
more reliable option is to add a key field to the data; then use Join to match the records by key.
Component folding can enhance the performance of this component. If this feature is enabled,
the Co>Operating System folds this component by default. See “Component folding” for more
information.

 JOIN
Purpose
Join reads data from two or more input ports, combines records with matching keys according to
the transform you specify, and sends the transformed records to the output port. Additional ports
allow you to collect rejected and unused records.
Recommendation
Component folding can enhance the performance of this component. If this feature is enabled,
the Co>Operating System folds this component by default. See “Component folding” for more
information.
NOTE: When you have units of work (computepoints, checkpoints, or transactions) that are
large and sorted-input is set to Inputs must be sorted, the order of output records within a key
group may differ between the folded and unfolded versions of the output.
Location in the Component Organizer
Transform folder

Types of joins
Reduced to its basics, Join consists of a match key, a transform function, and a mechanism for
deciding when to call the transform function:
The key is used to match records on incoming flows
The transform function combines matched incoming records to produce new outgoing
records

The mechanism for deciding when to call the transform function consists of the settings of
the parameters join-type, record-requiredn, and dedupn.
Inner joins
The most common case is when join-type is Inner Join. In this case, if each input port contains a
record with the same value for the key fields, the transform function is called and an output
record is produced.
If some of the input flows have more than one record with that key value, the transform function
is called multiple times, once for each possible combination of records, taken one from each
input port.
Whenever a particular key value does not have a matching record on every input port and Inner
Join is specified, the transform function is not called and all incoming records with that key value
are sent to the unusedn ports.
Full outer joins
Another common case is when join-type is Full Outer Join: if each input port has a record with a
matching key value, Join does the same thing it does for an inner join.
If some input ports do not have records with matching key values, Join applies the transform
function anyway, with NULL substituted for the missing records. The missing records are in effect
ignored.
With an outer join, the transform function typically requires additional rules (as compared to an
inner join) to handle the possibility of NULL inputs.
About explicit joins
The final case is when join-type is Explicit. This setting allows you to specify True or False for the
record-requiredn parameter for each inn port. The settings you choose determine when Join calls
the transform function. See record-requiredn.

Examples of join types

Complex multiway joins


For the three-way joins shown in the following diagrams, the shaded regions again represent the
key values that must match in order for Join to call the transform function:

In the cases shown above, suppose you want to narrow the join conditions to a subset of the
shaded (required match) area. To do this, use the DML is_defined function in a rule in the
transform itself. This is the same principle demonstrated in the two-way join shown in “Getting a
joined output record”.
For example, suppose you want to produce an output record when a particular key value either is
present in in0, or is present in both in1 and in2. Only Case 2 has enough shaded area to
represent the necessary conditions. However, Case 2 also represents conditions under which
you do not want Join to produce an output record.
To produce output records only under the appropriate conditions:
1.Set join-type to Full Outer Join as in Case 2 above.

2.Put the following rules in Join’s transform function:


out.key :1: if (is_defined(in0)) in0.key;
out.key :2: if (is_defined(in1) &&
   is_defined(in2)) in1.key;
For both rules to fail, the particular key value must be absent from in0 and must be present in
only one of in1 or in2.
Join writes the records that result in both rules failing to the rejectn ports if you connect flows to
them.

 MATCH SORTED
Purpose
Match Sorted combines multiple flows of records with matching keys and performs transform
operations on them.
NOTE: This component is superseded by either Join (for matching keys) or Fuse (for
transforming multiple records). Both provide more flexible processing options than Match Sorted.
Requirement
Match Sorted requires grouped input.
Location in the Component Organizer
Transform folder

Example of using MATCH SORTED


This example shows how repeat and missing key values affect the number of times Match Sorted
calls the transform function.
Suppose three input flows feed Match Sorted. The records in these flows have three-character
alphabetic key values. The key values of the records in the three flows are as follows:

in0 in1 in2


record aaa aaa aaa
1
record bbb bbb ccc
2
record ccc ccc ddd
3
record eee eee eee
4
record eee fff fff
5
record eee —end— —end—
6
Match Sorted calls the transform function eight times for these data records, with the arguments
as follows:
transform( in0-rec1, in1-rec1, in2-rec1 )  — records with key value “aaa”
transform( in0-rec2, in1-rec2, NULL )  — records with key value “bbb”
transform( in0-rec3, in1-rec3, in2-rec2 )  — records with key value “ccc”
transform( NULL,     NULL,     in2-rec3 )  — records with key value “ddd”
transform( in0-rec4, in1-rec4, in2-rec4 )  — records with key value “eee”
transform( in0-rec5, in1-rec4, in2-rec4 )  — records with key value “eee”
transform( in0-rec6, in1-rec4, in2-rec4 )  — records with key value “eee”
transform( NULL,     in1-rec5, in2-rec5 )  — records with key value “fff”
Since there are three eee records in the flow attached to in0, Match Sorted calls the transform
function three times with eee records as inputs. Since the next records on in1 and in2 do not
have key value eee, in1 and in2 repeat their rec4 records.

 MULTI REFORMAT
Purpose
Multi Reformat changes the format of records flowing from 1 to 20 pairs of in and out ports by
dropping fields or by using DML expressions to add fields, combine fields, or transform data in
the records.
We recommend using MULTI REFORMAT in only a few specific situations. Most often, a regular
REFORMAT component is the correct choice. For example:
If you want to reformat data on multiple flows, you should instead use multiple
REFORMAT components. These are faster because they run in parallel.
If you want to filter incoming data, sending it to various output ports while also reformatting
it (by adding, combining, or transforming fields), try using the output-index and count
parameters on the REFORMAT component.
A recommended use for Multi Reformat is to put it immediately before a custom component that
takes multiple inputs. For more information, see “Using MULTI REFORMAT to avoid deadlock”.

Using MULTI REFORMAT to avoid deadlock


Deadlock occurs when a program cannot progress, causing a graph to hang. Custom
components (components that you have built to execute your own programs) are prone to
deadlock because they cannot use the GDE’s automatic flow buffering. If a custom component is
programmed to read from multiple flows in a specific order, it carries the possibility of causing
deadlock.
To avoid deadlock, insert a MULTI REFORMAT component in the graph in front of the custom
component. Using this built-in component to process the input flows applies automatic flow
buffering to them before they reach the custom component, thus avoiding the possibility of
deadlock.

 NORMALIZE
Purpose
Normalize generates multiple output records from each of its input records. You can directly
specify the number of output records for each input record, or you can make the number of
output records dependent on a calculation.
In contrast, to consolidate groups of related records into a single record with a vector field for
each group — the inverse of NORMALIZE — you would use the accumulation function of the
ROLLUP component.
Recommendations
Always clean and validate data before normalizing it. Because Normalize uses a multistage
transform, it follows computation rules that may cause unexpected or incorrect results in the
presence of dirty data (NULLs or invalid values). Furthermore, the results will be hard to
trace, particularly if the reject-threshold parameter is set to Never abort. Several factors —
including the data type, the DML expression used to perform the normalization, and the
value of the sorted-input parameter — may affect where the problems occur. It is safest to
avoid normalizing dirty data.
Component folding can enhance the performance of this component. If this feature is
enabled, the Co>Operating System folds this component by default. See “Component
folding” for more information.

NORMALIZE transform functions


What Normalize does is determined by the functions, types, and variables you define in its
transform parameter.
There are seven built-in functions, as shown in the following table. Of these, only normalize is
required. Examples of most of these functions can be found in “Simple NORMALIZE example
with vectors”.
There is also an optional temporary_type (see “Optional NORMALIZE transform functions and
types”), which you can define if you need to use temporary variables. For an example, see
“NORMALIZE example with a more elaborate transform”.

Transform function Required? Arguments Return value


input_select No input An integer(4) value.
record An output value of 0 means false (the
record was not selected); non-zero
means true (the record was selected).
See “Optional NORMALIZE transform
functions and types”.
initialize No input A record whose type is
record temporary_type.
See “Optional NORMALIZE transform
functions and types”. For examples, see
“NORMALIZE example with a more
elaborate transform”.
length Only if input An integer(4) value.
finished is record Specifies the number of output records
not provided Normalize generates for this input
record. If the length function is
provided, Normalize calls it once for
each input record.
For examples, see “Simple
NORMALIZE example with vectors”
and “NORMALIZE example with a
more elaborate transform”.
finished Only if temporary 0 (meaning false), if more output
(if you have defined length is not record, records are to be generated from the
temporary_type) provided input current input record.
record, Otherwise, a non-zero value (true).
index If the finished function is provided,
NORMALIZE calls it once more than
the number of output records it
produces. On the final call it returns true
and no output record is produced.
finished Only if input 0 (meaning false), if more output
(if you have not length is not record, records are to be generated from the
defined provided index current input record.
temporary_type) Otherwise, a non-zero value (true).
If the finished function is provided,
NORMALIZE calls it once more than
the number of output records it
produces. On the final call it returns true
and no output record is produced.
normalize Yes temporary A record whose type is temporary_type.
(if you have defined record, For examples, see “Simple
temporary_type) input NORMALIZE example with vectors”.
record,
index
normalize Yes input An output record.
(if you have not record,
defined index
temporary_type)
finalize No temporary The output record.
record, See “Optional NORMALIZE transform
input functions and types” and
record “NORMALIZE example with a more
elaborate transform”.
output_select No output An integer(4) value.
record An output value of 0 means false (the
record was not selected); non-zero
means true (the record was selected).
See “Optional NORMALIZE transform
functions and types”.
Input and output names in transforms
In all transform functions, the names of the inputs and outputs are used only locally, so you can
use any names that make sense to you.
Optional NORMALIZE transform functions and types
There are several optional transform functions and an optional type you can use with Normalize:
input_select — The input_select transform function performs selection of input records:
out :: input_select(in) =
begin
  out :: in.n == 1;
end;
The input_select transform function takes a single argument — the input record — and returns a
value of 0 (false) if NORMALIZE is to ignore a record, or non-zero (true) if NORMALIZE is to
accept a record.
initialize — The initialize transform function initializes temporary storage. This transform
function takes a single argument — the input record — and returns a single record with type
temporary_type:
temp :: initialize(in) =
begin
  temp.count :: 0;
  temp.sum :: 0;
end;
length — The length transform function is required when the finished function is not
defined. (You must use at least one of these functions.) This transform function specifies the
number of times the normalize function will be called for the current record. This function
takes the input record as an argument:
out :: length(in) =
begin
  out :: length_of(in.big_vector);
end;
length essentially provides a way to implement a for loop in the record-reading process.
finished — The finished transform function is required when the length function is not
defined. (You must use at least one of these functions.) This transform function returns a
boolean value: as long as it returns 0 (false), NORMALIZE proceeds to call the normalize
function for the current record. When the finished function returns non-zero (true) ,
NORMALIZE moves to the next input record.
out :: finished(in, index) =
begin
  out :: in.array[index] == "ignore later elements";
end;
The finished function essentially provides a way to implement a while-do loop in the record-
reading process.
NOTE: Although we recommend that you not use both length and finished in the same
component, it is possible to define both. In that case, Normalize loops until either finished returns
true or the limit of length is reached, whichever occurs first.
finalize — The finalize transform function performs the last step in a multistage transform:
out :: finalize(temp, in) =
begin
  out.key :: in.key;
  out.count :: temp.count;
  out.average :: temp.sum / temp.count;
end;
The finalize transform function takes the temporary storage record and the input record as
arguments, and produces a record that has the record format of the out port.
output_select — The output_select transform function performs selection of output records:
out :: output_select(final) =
begin
  out :: final.average > 5;
end;
The output_select transform function takes a single argument — the record produced by
finalization — and returns a value of 0 (false) if NORMALIZE is to ignore a record, or non-zero
(true) if NORMALIZE is to generate an output record.
temporary_type — If you want Normalize to use temporary storage, define this storage as a
record with a type named temporary_type:
type temporary_type =
  record
    int count;
    int sum;
  end;

 REFORMAT

Purpose
Reformat changes the format of records by dropping fields, or by using DML expressions to add
fields, combine fields, or transform the data in the records.
Recommendation
Component folding can enhance the performance of this component. If this feature is enabled,
the Co>Operating System folds this component by default. See “Component folding” for more
information.
Location in the Organizer
Transform folder

 ROLLUP
Purpose
Rollup evaluates a group of input records that have the same key, and then generates records
that either summarize each group or select certain information from each group.
Although it lacks a reformat transform function, rollup supports implicit reformat; see “Implicit
reformat”.
Location in the Organizer
Transform folder
Recommendations
For new development, use Rollup rather than AGGREGATE. Rollup provides more control
over record selection, grouping, and aggregation.
The behavior of ROLLUP varies in the presence of dirty data (NULLs or invalid values),
according to which mode you use for the rollup:

With expanded mode, you can use ROLLUP normally.


With template mode, always clean and validate data before rolling it up. Because the
aggregation functions are not expanded, you may see unexpected or even incorrect results in
the presence of dirty data (NULLs or invalid values). Furthermore, the results will be hard to
trace, particularly if the reject-threshold parameter is set to Never abort. Several factors —
including the data type, the DML expression used to perform the rollup, and the value of the
sorted-input parameter — may affect where the problems occur. It is safest to clean and
validate the data before using template mode with ROLLUP.

 SCAN
Purpose
For every input record, Scan generates an output record that consists of a running cumulative
summary for the group to which the input record belongs, up to and including the current record.
For example, the output records might include successive year-to-date totals for groups of
records.
Although it lacks a reformat transform function, scan supports implicit reformat.
Recommendations
If you want one summary record for a group, use  ROLLUP.
The behavior of SCAN varies in the presence of dirty data (NULLs or invalid values),
according to which mode you use for the scan:

With expanded mode, you can use SCAN normally.


With template mode, always clean and validate data before scanning it. Because
the aggregation functions are not expanded, you may see unexpected or even incorrect
results in the presence of dirty data (NULLs or invalid values). Furthermore, the results will
be hard to trace, particularly if the reject-threshold parameter is set to Never abort. Several
factors — including the data type, the DML expression used to perform the scan, and the
value of the sorted-input parameter — may affect where the problems occur. It is safest to
clean and validate the data before using template mode with SCAN.
Component folding can enhance the performance of this component. If this feature
is enabled, the Co>Operating System folds this component by default. See “Component
folding” for more information.

Two modes to use SCAN


You can use a SCAN component in two modes, depending on how you define
the transform parameter:
Define a transform that uses a template scan function. This is called template mode
and is most often used when you want to output aggregations of the data.
Create a transform using an expanded SCAN package. This is called expanded mode
and allows for scans that do not necessarily use regular aggregation functions.
Template mode
Template mode is the simplest way to use  SCAN. In the transform parameter, you specify an
aggregation function that describes how the cumulative summary should be computed. At
runtime, the Co>Operating System expands this template function into the multiple functions that
are required to execute the actual scan.
For example, suppose you have an input record for each purchase by each customer. You could
use the sum aggregation function to calculate the running total of spending for each customer
after each purchase.
For more information, see “Using SCAN with aggregation functions”.
Expanded mode
Expanded mode provides more control over the scan. It lets you edit the  expanded package, so
you can specify transformations that are not possible with template mode. As such, you might
use it when you need a result that an aggregation function cannot produce.
With an expanded SCAN package, you must define the following items:
DML type named temporary_type
initialize function that returns a temporary_type record

scan function that takes two input arguments (an input record and
a temporary_type record) and returns an updated temporary_type record
finalize function that returns an output record
For more information, see “Transform package for SCAN”.

Examples of using SCAN

transforms/scan/scan.mp
Template SCAN with an aggregation function
This example shows how to compute, from input records containing customer_id, dt (date),
and amount, a running total of transactions for each customer in a dataset. The example uses a
template scan function with the sum aggregation function.
Suppose you have the following input records:

customer_id dt amount

C002142 1994.03.23 52.20

C002142 1994.06.22 22.25


C003213 1993.02.12 47.95

C003213 1994.11.05 221.24

C003213 1995.12.11 17.42

C004221 1994.08.15 25.25

C008231 1993.10.22 122.00

C008231 1995.12.10 52.1


You want to produce output records with customer_id, dt, and amount_to_date:

amount_to_dat
customer_id dt
e
C002142 1994.03.23 52.20

C002142 1994.06.22 74.45

C003213 1993.02.12 47.95

C003213 1994.11.05 269.19

C003213 1995.12.11 286.61

C004221 1994.08.15 25.25

C008231 1993.10.22 122.00

C008231 1995.12.10 174.1


To accomplish this task, do one of the following:
Sort the input records on  customer_id and dt, and use a Scan component with
the sorted-input parameter set to Input must be sorted or grouped and customer_id as
the key field.
Sort the input records on  dt, and use a Scan component with the sorted-input parameter
set to In memory: Input need not be sorted and customer_id as the key field.
Create the transform using the sum aggregation function, as follows:
out :: scan(in) =
begin
  out.customer_id :: in.customer_id;
  out.dt :: in.dt;
  out.amount_to_date :: sum(in.amount);
end;
Expanded SCAN
Continuing the previous example, you want to categorize customers according to their
spending. After their spending exceeds $100, you place them in the “premium” category. The
new output data includes the category for each customer, current for each date on which they
made a purchase.
customer_i amount_to_da categor
dt
d te y
C002142 1994.03.2 52.20 regular
3
C002142 1994.06.2 74.45 regular
2
C003213 1993.02.1 47.95 regular
2
C003213 1994.11.0 269.19 premiu
5 m
C003213 1995.12.1 286.61 premiu
1 m
C004221 1994.08.1 25.25 regular
5
C008231 1993.10.2 122.00 premiu
2 m
1995.12.1 174.1 premi
C008231
0 0 u
For this example, we can use the finalize function in an expanded transform to add the category
information. Because we have expanded the transform, we can no longer use
the sum aggregation function to calculate the amount_to_date. Instead, we store the running
total in a temporary variable and use the scan function to update it for each record.
Here is the transform:
type temporary_type =
record
  decimal(8.2) amount_to_date = 0;
end;

temp :: initialize(in) =
begin
  temp.amount_to_date :: 0;
end;

out :: scan(temp, in) =


begin
  out.amount_to_date :: temp.amount_to_date + in.amount;
end;

out :: finalize(temp, in) =


begin
  out.customer_id :: in.customer_id;
  out.dt :: in.dt;
  out.amount_to_date :: temp.amount_to_date;
  out.category :: if (temp.amount_to_date > 100) "premium"
    else "regular";
end;
The temporary_type is a variable that stores the cumulative data from one record to the next. At
the beginning of each group, the initialize function resets the temporary variable to 0.
(Remember that in this example, the data is grouped by customer_id.) The scan function is
called for each record; it keeps a running total of purchase amounts within the group.
The finalize function creates the output records, assigning a category value to each one.
 SPLIT
Purpose
SPLIT processes data in a number of useful ways. You can use SPLIT to:
Flatten hierarchical data
Select a subset of fields from the data

Normalize vectors (including nested vectors)


Retrieve multiple, distinct outputs from a single pass through the data

How SPLIT works


SPLIT does not use transform functions. It determines what operations to perform on input data
by using DML that is generated by the split_dml command-line utility. This approach enables you
to perform operations such as normalizing vectors without using expensive DML loop operations.
SPLIT has a single input port and a counted number of output ports. You use split_dml to
generate DML for each output port. You can have different field selection and base fields for
vector normalization on each port; however, you can specify only one base field for vector
normalization per port.
Although it lacks a reformat transform function, SPLIT supports implicit reformat.
Recommendation
Component folding can enhance the performance of this component. If this feature is enabled,
the Co>Operating System folds this component by default. See “Component folding” for more
information.
Location in the Organizer

Transform folder

Example of using SPLIT


Say you have a file example1.dml that has both a nested hierarchy of records and three levels
of nested vectors, with the following record format:
record
  string("|") region;
  record
    string("|") state;
    record
      string("|") county;
      record
        string("|") addr_line1;
        string("|") addr_line2;
      end location;
      record
      string("|") atm_id;
      string("|") comment;
    end[decimal(2)] atms;
   end[decimal(2)] counties;
  end[decimal(2)] states;
  string("\n") mgr;
end
In this example, SPLIT is used to remove the hierarchy and normalize the vectors in this record.
First, the desired output DML is generated using the split_dml utility:
split_dml -i ..# -b ..atm_id example1.dml
where:
The -i argument indicates fields to be included in the output DML. In this case, the specified
wildcards "..#" selects all leaf fields anywhere within the record.
The -b argument specifies a base field for normalization. Any field in the vector to be
normalized can be used; in this case, the specified field atm_id is used with the ".."
shorthand, because atm_id is unique in the record.
This command generates the following output:
/////////////////////////////////////////////////////////////////////
// This file was automatically generated by split_dml
// With the command line arguments:
// split_dml -i ..# -b ..atm_id example1.dml
/////////////////////////////////////////////////////////////////////
record
  string("|") region;
  string("|") state;
  string("|") county;
  string("|") addr_line1;
  string("|") addr_line2;
  string("|") atm_id;
  string("|") comment;
  string("\n") mgr;
  string('\0') DML_assignments() =
    'region=region,state=states.state,county=states.counties.county,
    addr_line1=states.counties.atms.location.addr_line1,
    addr_line2=states.counties.atms.location.addr_line2,
    atm_id=states.counties.atms.atm_id,
    comment=states.counties.atms.comment,mgr=mgr';
end
Note the flattened record, and the generated  DML_assignments method that controls how
SPLIT fills the output record from the input data.
Suppose that you want to exclude certain fields —  addr_line1, addr_line2, and comment —
from the output. Run split_dml as follows:
split_dml -i region,states.state,states.counties.county,..atm_id,..mgr  -b ..atm_id example1.dml
The generated output is:
/////////////////////////////////////////////////////////////////////
// This file was automatically generated by split_dml
// With the command line arguments:
// split_dml -i region,states.state,states.counties.county,..atm_id,
// ..mgr -b ..atm_id example1.dml
/////////////////////////////////////////////////////////////////////
record
  string("|") region;
  string("|") state;
  string("|") county;
  string("|") atm_id;
  string("\n") mgr;
  string('\0') DML_assignments() =
    'region=region,state=states.state,county=states.counties.county,
    atm_id=states.counties.atms.atm_id,
    mgr=mgr';
end
Note that the fields specified by the split_dml -i option appear in the order in which they occur in
the input record, not in the order in which they are listed in the option argument.

Posted 3rd January 2016 by Anonymous


 

View comments

14.

jeganApril 3, 2020 at 11:44 PM

Thank you for taking the time to provide us with your valuable information. We strive to provide our
candidates with excellent care
http://chennaitraining.in/qliksense-training-in-chennai/
http://chennaitraining.in/pentaho-training-in-chennai/
http://chennaitraining.in/machine-learning-training-in-chennai/
http://chennaitraining.in/artificial-intelligence-training-in-chennai/
http://chennaitraining.in/snaplogic-training-in-chennai/
http://chennaitraining.in/snowflake-training-in-chennai/
Reply

2.
SEP

18

AB-INITIO PARTITION COMPONENT

PARTITION BY EXPRESSION

Purpose
Partition by Expression distributes records to its output flow partitions according to a specified
DML expression or transform function.
The output port for Partition by Expression is ordered. See “Ordered ports”. Although you can
use fan-out flows on the out port, we do not recommend connecting multiple fan-out flows. You
may connect a single fan-out flow; or, preferably, limit yourself to straight flows on the out port.
Partition by Expression supports implicit reformat. See “Implicit reformat”.

Recommendation
Component folding can enhance the performance of this component. If this feature is enabled,
the Co>Operating System folds this component by default. See “Component folding” for more
information.
The component does not fold when connected to a flow that is set to use two-stage routing.

Location in the Component Organizer


Partitioning folder

PARTITION BY KEY
Purpose
Partition by Key distributes records to its output flow partitions according to key values.
How Partition by Key interprets key values depends on the internal representation of the key. For
example, the number 4 in a field of type integer(2) is not considered identical to the number 4 in
a field of type decimal(4).

Recommendation
Component folding can enhance the performance of this component. If this feature is enabled,
the Co>Operating System folds this component by default. See “Component folding” for more
information.
The component does not fold when connected to a flow that is set to use two-stage routing.

Location in the Component Organizer


Partitioning folder

PARTITION BY KEY AND SORT


Purpose
Partition by Key and Sort repartitions records by key values and then sorts the records within
each partition. The number of input and output partitions can be different.
How Partition by Key and Sort interprets key values depends on the internal representation of the
key. For example, the number 4 is likely to be partitioned differently depending on whether it is in
a field of type integer(2) or decimal(4).
Partition by Key and Sort is a subgraph that contains two components, Partition by Key and Sort.
Location in the Component Organizer
Sort folder

PARTITION BY PERCENTAGE
Purpose
Partition by Percentage distributes a specified percentage of the total number of input records to
each output flow.

Location in the Component Organizer


Partitioning folder

PARTITION BY RANGE
Purpose
Partition by Range distributes records to its output flow partitions according to the ranges of key
values specified for each partition. Partition by Range distributes the records relatively equally
among the partitions.
Use Partition by Range when you want to divide data into useful, approximately equal, groups.
Input can be sorted or unsorted. If the input is sorted, the output is sorted; if the input is unsorted,
the output is unsorted.
The records with the key values that come first in the key order go to partition 0, the records with
the key values that come next in the order go to partition 1, and so on. The records with the key
values that come last in the key order go to the partition with the highest number.

Location in the Component Organizer


Partitioning folder
PARTITION BY ROUND-ROBIN
 Purpose
Partition by Round-robin distributes blocks of records evenly to each output flow in round-robin
fashion.
For information on undoing the effects of Partition by Round-robin, see INTERLEAVE.
The output port for Partition by Round-robin is ordered. See “Ordered ports”.

Recommendation
Component folding can enhance the performance of this component. If this feature is enabled,
the Co>Operating System folds this component by default. See “Component folding” for more
information.
The component does not fold when connected to a flow that is set to use two-stage routing.

Location in the Component Organizer


Partitioning folder
Posted 18th September 2015 by Anonymous
 


View comments



Loading

You might also like