You are on page 1of 110

Introduction to Ab initio

Presenters Name
Role Month, Year

Australia | Canada | France | India | New Zealand | Singapore | Switzerland | United Arab Emirates | United Kingdom | United States 2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

Agenda

DWH Concept What is ETL Introduction to Ab Initio

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

DWH Concept

DWH Definition & Process Overview

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

What is a Data Warehouse ?


Data Warehouse is a Subject-Oriented Integrated Time-Variant Non-volatile collection of data in support of managements decision

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

Data warehouses store large volumes of data which are frequently used by Decision Support Systems It is maintained separately from the organizations operational databases Data warehouses are relatively static with only infrequent updates A data warehouse is a stand-alone repository of information, integrated from several, possibly heterogeneous operational databases

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

Data Warehousing Process Overview


Extract from Source Systems Transform to required data (Staging area) Transfer to Data warehouse Produce reports from Data warehouse

Data Ware House

Source Systems

Extract

Staging Area Transform

Transfer

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

Various Data Warehouse Models


 Enterprise warehouse collects all of the information about subjects spanning the entire organization  Data Mart a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

What is ETL?
Extract, Transform, and Load (ETL) is a process that involves extracting data from outside sources, transforming it to fit business needs (which can include quality levels), and ultimately loading it into the end target, i.e. the data warehouse. Extract The first part of an ETL process is to extract the data from the source systems. Most data warehousing projects consolidate data from different source systems. Each separate system may also use a different data organization / format. Common data source formats are relational databases and flat files, but may include nonrelational database structures such as IMS or other data structures such as VSAM or ISAM. Extraction converts the data into a format for transformation processing.
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane. 8

What is ETL?
Transform
The transform stage applies a series of rules or functions to the extracted data from the source to derive the data to be loaded to the end target. Some data sources will require very little or even no manipulation of data. In other cases, one or more of the following transformations types to meet the business and technical needs of the end target may be required: Selecting only certain columns to load (or selecting null columns not to load) Translating coded values (e.g., if the source system stores 1 for male and 2 for female, but the warehouse stores M for male and F for female) Encoding free-form values (e.g., mapping "Male" to "1" and "Mr." to M) Deriving a new calculated value (e.g., sale amount = qty * unit price) Joining together data from multiple sources (e.g., lookup, merge, etc.) Summarizing multiple rows of data (e.g., total sales for each store, and for each region) Generating surrogate key values Transposing or pivoting (turning multiple columns into multiple rows or vice versa) Splitting a column into multiple columns (e.g., putting a comma-separated list specified as a string in one column as individual values in different columns) Applying any form of simple or complex data validation; if failed, a full, partial or no rejection of the data, and thus no, partial or all the data is handed over to the next step, depending on the rule design and exception handling. Most of the above transformations itself might result in an exception, e.g. when a codetranslation parses an unknown code in the extracted data.
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane. 9

What is ETL?
Load
The load phase loads the data into the end target, usually being the data warehouse (DW). Depending on the requirements of the organization, this process ranges widely. Some data warehouses might overwrite existing information with cumulative, updated data, while other DW (or even other parts of the same DW) might add new data in a histories form, e.g. hourly. The timing and scope to replace or append are strategic design choices dependent on the time available and the business needs. Some systems maintain a history and audit trail of all changes to the data loaded in the DW. As the load phase interacts with a database, the constraints defined in the database schema as well as in triggers activated upon data load apply (e.g. uniqueness, referential integrity, mandatory fields), which also contribute to the overall data quality performance of the ETL process.

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

10

Introduction to Ab Initio
1. What is Ab Initio 2. Ab Initio Platforms 3. Architecture of Ab Initio 4. Run Process 5. Components 6. Parallelism 7. Sandbox and Projects 8. Basic Graph Development 9. Multifile 10. Performance Tuning

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

11

1.What is Ab Initio
 Data processing tool from Ab Initio software corporation (http://www.abinitio.com)  Latin for from the beginning  Ab Initio is a general purpose data processing platform for enterprise class, mission critical applications such as data warehousing, click stream processing, data movement, data transformation and analytics.  Designed to support largest and most complex business applications  Proven best of breed ETL solution.  Applications of Ab Initio:  ETL for data warehouses, data marts and operational data sources.  Parallel data cleansing and validation.  Parallel data transformation and filtering.  High performance analytics  Real time, parallel data capture.

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

12

2.Ab Initio Platforms


Abinitio product comes with three suits  Graphical Development Environment (GDE)  CO>Operating System  Enterprise Metadata Environment (EME)

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

13

Graphical Development Environment (GDE)


 GDE lets Developer to create applications by dragging and dropping components onto a canvas configuring them with familiar, intuitive point and click operations, and connecting them into executable flowcharts.

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

14

Co>operating System
 The Co>Operating System is core software that unites a network of computing resources-CPUs, storage disks, programs, datasets-into a production-quality data processing system with scalable performance and mainframe reliability.  The Co>Operating System is layered on top of the native operating systems of a collection of computers. It provides a distributed model for process execution, file management, process monitoring, check-pointing, and debugging.

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

15

Connection Between GDE and CO>Op


Graphical Development Environment (GDE)
FTP TELNET REXEC RSH DCOM

Co-operating System
On a typical installation, the Co-operating system is installed on a Unix or Windows NT server while the GDE is installed on a Pentium PC.

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

16

Connecting to Co>op Server from GDE

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

17

EME (Enterprise Meta>Environment) Data store


It is a system storage area where every version that you save of the files you work on is permanently preserved In a short we can say EME is Storage Area GDE GDE Check Out

E M E
Locking
18

GDE

GDE

GDE

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

Ab Initio runs on many operating systems Compaq Tru64 UNIX Digital unix Hewlett-Packard HP-UNIX IBM AIX Unix NCR MP-RAS Red Hat Linux IBM/Sequent DYNIX/ptx Siemens Pyramid Reliant UNIX Silicon Graphics IRIX Sun Solaris Windows NT and Windows 2000

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

19

3.Architecture of Ab Initio
Applications Ab Initio Metadata Repository (EME)

Application Development Environments Graphical (GDE) C ++ Shell Component Library .ksh Ab Initio Co>Operating System User-defined Components Third Party Components

Native Operating System UNIX Windows NT

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

20

Architecture of Ab Initio
Host Machine 1
Unix Shell Script or NT Batch File
Supplies parameter values to underlying programs through arguments and environment variables Controls the flow of data through pipes Usually generated using the GDE

GDE
Ability to graphically design batch programs comprising Ab Initio components, connected by pipes Ability to test run the graphical design and monitor its progress Ability to generate a shell script or batch file from the graphical design

Co>Operating System
Ab Initio Built-in Component Programs (Partitions, Transforms etc)

Host Machine 2
User Programs Co-Operating System

User Programs

Operating System
( Unix , Windows NT )

Operating System

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

21

4.Run Process
What happens when you push the Run button ?  Your graph is translated into a script that can be executed in the Shell Development Environment.  This script and any metadata files stored on the GDE client machine are shipped (via FTP) to the server.  The script is invoked (via REXEC or TELNET) on the server.  The script creates and runs a job that may run across many nodes.  Monitoring information is sent back to the GDE client.

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

22

Run Process
 Please have look a below Sample graph and find what happens when we press run button on the top right side in the screen shot

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

23

Anatomy of Running Job

 Host Process Creation


 Pushing Run button generates script.  Script is transmitted to Host node.  Script is invoked, creating Host process.

Host GDE

Client

Host

Processing nodes
24

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

Anatomy of Running Job

 Agent Process Creation


 Host process spawns Agent processes.

Host GDE
Agent Agent

Client

Host

Processing nodes

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

25

Anatomy of Running Job

 Component Process Creation


 Agent processes create Component processes on each processing node.

Host GDE
Agent Agent

Client

Host

Processing nodes

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

26

Anatomy of Running Job

 Component Execution
 Component processes do their jobs.  Component processes communicate directly with datasets and each other to move data around.

Host GDE
Agent Agent

Client

Host

Processing nodes

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

27

Anatomy of Running Job

 Successful Component Termination


 As each Component process finishes with its data, it exits with success status.

Host GDE
Agent Agent

Client

Host

Processing nodes

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

28

Anatomy of Running Job

 Agent Termination
 When all of an Agents Component processes exit, the Agent informs the Host process that those components are finished.  The Agent process then exits
Host

Client

Host

Processing nodes

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

29

Anatomy of Running Job

 Host Termination
 When all Agents have exited, the Host process informs the GDE that the job is complete.  The Host process then exits.

Host GDE

Client

Host

Processing nodes

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

30

5.Components Overview
There are Mainly two sets of Components available in Abinitio  Dataset Components:-Components Which holds data  Program Components:-Components which process data

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

31

Dataset Components
Input file :
INPUT FILE represents records read as input to a graph from one or more serial files or from a multi file.

Input table
Input Table unloads records from a database into a graph, allowing you to specify as the source either a database table or an SQL statement that selects records from one or more tables.

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

32

Dataset Components
Output file:
OUTPUT FILE represents records written as output from a graph into one or more serial files or a multifile. When the target of an OUTPUT FILE component is a particular file (such as /dev/null, NUL, a named pipe, or some other special file), the Co>Operating System never deletes and recreates that file, nor does it ever truncate it.

Output table:
OUTPUT TABLE loads records from a graph into a database, letting you specify the destination either directly as a single database table, or through an SQL statement that inserts records into one or more tables.

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

33

Program Components
Sort:
SORT sorts and merges records. You can use SORT to order records before you send them to a component that requires grouped or sorted records. key (key specifier, required) Name(s) of the key field(s) and the sequence specifier(s) you want the component to use when it orders records. max-core (integer, required) Maximum memory usage in bytes. Default is 100663296 (100 MB). When the component reaches the number of bytes specified in the max-core parameter, it sorts the records it has read and writes a temporary file to disk.
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

34

Reformat:
REFORMAT changes the format of records by dropping fields, or by using DML expressions to add fields, combine fields, or transform the data in the records.

1. Reads record from input port 2. Record passes as argument to transform function or xfr 3. Records written to out ports, if the function returns a success status 4. Records written to reject ports, if the function returns a failure status
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

35

Parameters of Reformat Component


 Count  Transform (Xfr) Function  Reject-Threshold Abort Never Abort Use Limit & Ramp  Limit  Ramp

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

36

Join:
1. Reads records from multiple input ports 2. Operates on records with matching keys using a multiinput transform function 3. Writes result to the output port PORTS
in out unused reject (optional) error (optional) log (optional)

PARAMETERS
      count key override key transform limit Ramp
37

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

Join Types Inner Outer Explicit

Join Methods Merge Join  sing sorted inputs

Hash Join  sing in-memory hash tables to group input

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

38

Filter by Expression:
FILTER BY EXPRESSION filters records according to a DML ex 1.Reads data records from the in port. 2.Applies the expression in the select_expr parameter to each record. If the expression returns: Non-0 value FILTER BY EXPRESSION writes the record to the out port. 0 FILTER BY EXPRESSION writes the record to the deselect port. If you do not connect a flow to the deselect port, FILTER BY EXPRESSION discards the records. NULL FILTER BY EXPRESSION writes the record to the reject port and a descriptive error message to the error port. FILTER BY EXPRESSION stops execution of the graph when the number of reject events exceeds the result of the following formula: limit + (ramp * number_of_records_processed_so_far)

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

39

Normalize :
NORMALIZE generates multiple output records from each of its input records. You can directly specify the number of output records for each input record, or the number of output records can depend on some calculation.

1.Reads the input record.


If you have not defined input_select, NORMALIZE processes all records. If you have defined input_select, the input records are filtered as follows: 2.Performs iterations of the normalize transform function for each input record. 3.Performs temporary initialization. 4.Sends the output record to the out port.

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

40

Before Normalization

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

41

After Normalization

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

42

DENORMALIZE SORTED:
DENORMALIZE SORTED consolidates groups of related records by key into a single output record with a vector field for each group, and optionally computes summary fields in the output record for each group. DENORMALIZE SORTED requires grouped input. For example, if you have a record for each person that includes the households to which that person belongs, DENORMALIZE SORTED can consolidate those records into a record for each household that contains a variable number of people.

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

43

Before Denormalize

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

44

After Denormalize:

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

45

Multistage components
Data transformation in multiple stages following several sets of rules Each set of rule form one transform function Information is passed across stages by temporary variables Stages include initialization, iteration, finalization and more Few multistage components are aggregate,rollup,scan

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

46

Rollup: ROLLUP evaluates a group of input records that have


the same key, and then generates records that either summarize each group or select certain information from each group.

Aggregate: AGGREGATE generates records that summarize


groups of records. In general, use ROLLUP for new development rather than AGGREGATE. ROLLUP gives you more control over record selection, grouping, and aggregation. However, use AGGREGATE when you want to return the single record that has a field containing either the maximum or the minimum value of all the records in the group. Scan: For every input record, SCAN generates an output record that includes a running cumulative summary for the group the input record belongs to. For example, the output records might include successive year-to-date totals for groups of records. You can use SCAN in continuous graphs
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane. 47

Partition components:
Data can be partitioned using Partition by Round-robin Broadcast Partition by Key Partition by Expression Partition by Range Partition by Percentage Partition by Load Balance

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

48

Partition by Roundrobin PARTITION BY ROUND-ROBIN distributes blocks of records evenly to each output flow in round-robin fashion. Suppose you attach four flows to the PARTITION BY ROUND-ROBIN output port, as shown in the following figure. PARTITION BY ROUNDROBIN writes to Load-1, then Load-2, then Load-3, then Load-4, then back to Load-1 again.

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

49

Broadcast
BROADCAST arbitrarily combines all records it receives into a single flow and writes a copy of that flow to each of its output flow partitions.

Partition by Key
PARTITION BY KEY distributes records to its output flow partitions according to key values. PARTITION BY KEY does the following: Reads records in arbitrary order from the in port. Distributes records to the flows connected to the out port, according to the key parameter, writing records with the same key value to the same output flow. PARTITION BY KEY is typically followed by SORT

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

50

Partition by Expression
PARTITION BY EXPRESSION distributes records to its output flow partitions according to a specified DML expression.

Partition by Range
PARTITION BY RANGE distributes records to its output flow partitions according to the ranges of key values specified for each partition. PARTITION BY RANGE distributes the records relatively equally among the partitions. Use PARTITION BY RANGE when you want to divide data into useful, approximately equal, groups. Input can be sorted or unsorted. If the input is sorted, the output is sorted; if the input is unsorted, the output is unsorted. The records with the key values that come first in the key order go to partition 0, the records with the key values that come next in the order go to partition 1, and so on. The records with the key values that come last in the key order go to the partition with the highest number.
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane. 51

Partition by Percentage
PARTITION BY PERCENTAGE distributes a specified percentage of the total number of input records to each output flow. Partition by Load Balance PARTITION WITH LOAD BALANCE distributes records to its output flow partitions by writing more records to the flow partitions that consume records faster. The output port for PARTITION WITH LOAD BALANCE is ordered

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

52

Summary of Partitioning Methods

Method
Round robin Hash Function Range

Key-Based Balancing
No Yes Yes Yes Good Good Depends on data and function Depends on splitters

Uses
Record-independent parallelism Key-dependent parallelism Application specific Key-dependent parallelism, Global Ordering Record-independent parallelism

Load-level

No

Depends on load

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

53

De-partition components:
Data can be de-partitioned using
Gather Concatenate Merge Interleave

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

54

Gather Reads data records from the flows connected to the input port Combines the records arbitrarily and writes to the output Concatenate Concatenate appends multiple flow partitions of data records one after another Merge Combines data records from multiple flow partitions that have been sorted on a key Maintains the sort order Interleave: INTERLEAVE combines blocks of records from multiple flow partitions in round-robin fashion. You can use INTERLEAVE to undo the effects of PARTITION BY ROUNDROBIN.
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

55

6.Parallelism
Parallel Runtime Environment Where some or all of the components of an application datasets and processing modules are replicated into a number of partitions, each spawning a process. Ab Initio can process data in parallel runtime environment Forms of Parallelism Component Parallelism Pipeline Parallelism Data Parallelism

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

56

Parallelism
Data Parallelism
servers at the same time Data is processed at the different

Pipeline Parallelism
Pipeline parallelism occurs when several connected program components on the same branch of a graph execute simultaneously.

Component Parallelism
working in parallel

2 or more components are

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

57

Data Parallelism
Data parallelism occurs when a graph separates data into multiple divisions, allowing multiple copies of program components to operate on the data in all the divisions simultaneously.

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

58

Two Ways of Looking at Data Parallelism


Expanded View:

Global View:

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

59

Pipeline Parallelism
the following graph divides a list of customers into two groups, GOOD CUSTOMERS and OTHER CUSTOMERS. The SCORE component assigns a score to each customer in the CUSTOMERS dataset, then the SELECT component directs each customer to the proper group based on that score.

Processing Record: 100

Processing Record: 55
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane. 60

Component Parallelism
The following graph takes the CUSTOMERS and TRANSACTIONS datasets, sorts them, then merges them into a dataset named MERGED INFORMATION.Because the SORT CUSTOMERS and SORT TRANSACTIONS components are on different branches of the graph, they execute at the same time, creating component parallelism.

Sorting Transactions

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

61

7.Sandbox and Project


What is a Sandbox - Sandbox is a collection of the various directories like bin, dml, mp, run etc which contains the metadata (Graphs and their associated files) Why to create a Sandbox Helps in managing the directory structure where this metadata is stored. Also helps in version control, migration and navigation. The sandbox provides an excellent mechanism to maintain uniqueness while moving from development to production environment by means switch parameters Note - Sandbox can be associated with only one project, but project can have many sandboxes

/Projects

bin dml mp run xfr Sandbox

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

62

8.Basic Graph Development

Create a new graph Go to (File>New) Then File>Save As (i.e., my_graph) to save it in the appropriate sandbox to enable this new graph to pick up the proper environment.

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

63

Steps in Building an Application


Add datasets. Where are they sourced from, where does my output go? Add components. Add flows. Edit Component Parameters as needed. Debug your application Configure datasets and components along the way; let the yellow To Do cues guide you. Generally, you should configure your input and output metadata (record formats) before adding flows.

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

64

Adding an Input Dataset

1. Click on Component Organizer Button

2. Open the Datasets Category

3. Choose Input File

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

65

Configuring the Input Dataset


1. Browse to find simple.dat 2. Browse to find simple.dml

3. Change label to something descriptive


2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

66

Create Graph - Dml


Propagate from Neighbors: Copy record formats from connected flow. Same As: Copy record formats from a specific components port. Path: Store record formats in a Local file, Host File, or in the Ab Initio repository. Embedded: Type the record format directly in a string.

Specify the .dml file

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

67

Creating Graph - Transform


A transform function is either a DML file or a DML string that describes how you manipulate your data. Ab Initio transform functions mainly consist of a series of assignment statements. Each statement is called a business rule. When Ab Initio evaluates a transform function, it performs following tasks: Initializes local variables Evaluates statements Evaluates rules. Transform function files have the xfr extension.

Specify the .xfr file

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

68

Adding a Filter by Expression Component

1. Open the Transform Category

2. Choose the Filter by Expression Component

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

69

Adding an Output Dataset

Choose Output File

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

70

Configuring the Output Dataset

1. Browse to see the directory contents

2. Enter name of output file

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

71

Components Have Properties

y A port is a connection point that allows data to flow into or out of a component. Most components have at least one port. y The data streaming into or out of a component is called a flow.
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

72

How to Create Flows


To Create a flow, follow these steps. Move your cursor over the Out Port of the first component until the arrow and Box symbols appear. Click and drag from the Out port of one component to the In port of the next component. Release the mouse button.

How to Delete Flows


 To delete a flow, highlight it (Click it ) then press the Delete key.

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

73

Adding Flows

1. Click on source (hold)

2. Drag to destination (release)

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

74

Configuring Filter by Expression

Enter expression

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

75

Running the Application

1. Push Run button.

2. View monitoring information.

3. View output data.

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

Classification: GE Internal

76

Diagnostic Ports: Reject, Error

Reject: Input records that caused errors. Error: Error messages.

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

77

Tips about Runtime Status


The GDE displays round colored indicators to show the status of each component during runtime.
Un started Running Error Done Success

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

78

Creating Graph Sort Component


Sort: The sort component reorders data. It comprises two parameters: Key and max-core. Key: The Key is one of the parameters for Sort component which describes the collation order. Max-core: The maxcore parameter controls how often the sort component dumps data from memory to disk.

Specify Key for the Sort

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

79

Creating Graph Dedup component

Select Dedup criteria.

Dedup component removes duplicate records. Dedup criteria will be either uniqueonly, First or Last.
80

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

Creating Graph Join Component

Specify the key for join Specify Type of Join

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

81

9.MULTIFILES
Multifiles are parallel files composed of individual files, which may be located on separate disks or systems. These individual files are the partitions of the multifile. Understanding the concept of multifiles is essential when you are developing parallel applications that use files, because the parallelization of data drives the parallelization of the application. An Ab Initio multifile organizes all partitions of a multifile into one single virtual file that you can reference as one entity.

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

82

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

83

Multifile Commands
m_mkfs m_mkdir m_ls m_expand m_dump m_cp m_mv m_touch m_rm
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane. 84

The m_mkfs Command


m_mkfs mfs-url dir-url1 dir-url2 ...

Creates a multifile system rooted at mfsurl and having as partitions the new directories dir-url1, dir-url2, ...
$ m_mkfs //host1/u/jo/mfs3 \ //host1/vol4/dat/mfs3_p0 \ //host2/vol3/dat/mfs3_p1 \ //host3/vol7/dat/mfs3_p2 $ m_mkfs my-mfs my_mfs_p0 my_mfs_p1 my_mfs_p2
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane. 85

The m_mkdir Command


m_mkdir url

Creates the named multidirectory. The url must refer to a pathname within an existing multifile system.

$ m_mkdir mfile:my-mfs/subdir $ m_mkdir mfile://host2/tmp/temp-mfs/dir1


2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane. 86

The m_ls command


m_ls [options...] url [url...]

Lists information on the file or directories specified by the urls. The information presented is controlled by the options, which follow the form of ls.

$ m_ls -ld mfile:my-mfs/subdir $ m_ls mfile://host2/tmp/temp-mfs $ m_ls -l -partitions .

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

87

The m_expand command

m_expand [options...] path

Displays the locations of the data partitions of a multifile or multidirectory


$ m_expand mfile:mymfs $ m_expand -native /path/to/the/mdir/bar

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

88

The m_dump command


m_dump metadata [path] [options ...]

Displays contents of files, multifiles, or selected records from files or multifiles, similar to View Data from GDE.
$ m_dump end 20 $ m_dump $ m_dump 'id*2 $ m_dump simple.dml simple.dat -start 10 simple.dml -describe simple.dml simple.dat -end 1 -print help
89

$ m_dump -string string(\n) bigex/acct.dat


2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

The m_cp command


m_cp source dest m_cp source [] directory

Copies files or multifiles that have the same degree of parallelism. Behind the scenes, m_cp actually builds and runs a small graph, so it may copy from one machine to another where Ab Initio is installed.
$ m_cp foo bar $ m_cp mfile:foo \ mfile://OtherHost/path/to/the/mdir/bar $ m_cp mfile:foo mfile:bar \ //OtherHost/path/to/the/mdir
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane. 90

The m_mv command


m_mv oldpath newpath

Moves a single file, multifile, directory, or multi-directory from one path to another path on the same host via renaming does not actually move data.
$ m_mv foo bar $ m_mv mfile:foo mfile:/path/to/the/mdir/bar
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

91

The m_touch command


m_touch path

Creates an empty file or multifile in the specified location. If some or all of the data partitions already exist in the expected locations, they will not be destroyed.
$ m_touch foo $ m_touch mfile:/path/to/the/mdir/bar

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

92

The m_rm command


m_rm [options] path [...]

Removes a file or multifile and all its associated data partitions.


$ m_rm foo $ m_rm mfile:foo mfile:/path/to/the/mdir/bar $ m_rm -f -r mfile:dir1

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

93

Other Commands

m_env m_kill m_rollback -d m_eval

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

94

The m_env command


m_env [options]
Describes many features of the environment, such as version of Ab Initio, setting of all configuration variables (and where they were set), help on the meanings of all configuration variables, and searches of the names and descriptions of configvars. $ m_env $ m_env -all $ m_env -w $ m_env -version $ m_env -build $ m_env -get AB_WORK_DIR $ m_env -describe AB_NPIPE_READER_OPEN_DELAY $ m_env -find connection
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane. 95

The m_kill command


m_kill jobname.rec

Kills a running job. Should be executed by the user who started the job from the launching node of the job. Must be given the recovery file name for the job.
$ m_kill my_graph.rec

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

96

The m_rollback command


m_rollback jobname.rec m_rollback -d jobname.rec
Rolls back a failed job. If a job failed in mid-phase and was not automatically rolled back to the last checkpoint (a very unusual case), use m_rollback to rollback to the last successful checkpoint. Usually, this is done by default. Use m_rollback -d to delete all recovery info for the job, and roll the job back to start. Should be executed by the user who started the job from the launching node of the job. Must be given the recovery file name for the job. $ m_rollback -d my_graph.rec
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane. 97

The m_eval command


m_eval expression

Evaluates a DML expression outside a graph. It useful for quickly testing out or debugging a complex expression.
$ m_eval "1+1" 2 $ m_eval "reinterpret_as(record string('|') f1,f2,f3; end, 'a|b|c|').f2" "b

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

98

10.Performance Tuning
What is Good Performance? O O O O Minimizing Minimizing Minimizing Minimizing wall clock time overall CPU usage memory usage disk usage

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

99

Parallelism Go parallel as soon as possible. Ask yourself why any serial input isnt followed immediately by a Partition component. Once data is partitioned, do not bring down to serial, then partition back to parallel. Repartition instead. For very small processing jobs (hundreds or thousands of records, runtime in minutes) serial may be better for reduced startup costs.

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

100

Serial Inputs If you need to reformat serial input data to find the true partition key, do not do this serially. Instead, do this:

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

101

Do not access large files across NFS. Use Ab Initio to transfer the data instead, or an FTP component. Use Ad Hoc Multifiles to read many serial files (with same record format) in parallel. To read many, many input files in parallel, use Ad Hoc Multifiles and a fan-in flow to a Concatenate. M must evenly divide N Pad file list with /dev/null if it doesnt

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

102

Phase breaks (and checkpoints) Often, phase breaks will not add to wall clock time since the graph will be mostly CPU-bound, and some additional I/O will not be an issue. Phase breaks let you allocate more memory to individual components. Visualize what happens in each component. Separate components that would benefit from using large amounts of memory. Try to avoid landing multiple copies of the same data to disk in a phase break after a Replicate component.

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

103

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

104

Record Formats In general, completely fixed format records take less CPU to process than variable length records. Drop fields that arent needed as soon as possible. This is often done for free in transform components. Flatten out conditional fields as soon as possible. Often, conditional fields are used to store multiple record types in a single format. Split these into separate processing streams as soon as possible. Join them back at the end of the graph, if required.

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

105

Sorting If you cant make all your fields fixed length, you can still benefit from having the key fields:
fixed length at the beginning of the record.

If you are sorted by a primary key and need to resort by a secondary key, use the Sort Groups component. If you wish to checkpoint near a sort that will land data on disk, consider a Checkpointed Sort component instead.

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

106

In-Memory Components
Join, Rollup, and Scan can operate either in-memory or on sorted data. If your data does not fit in memory, and you need to do multiple joins or rollups on the same key, it will be most efficient to sort once and set the rollups and joins to expect sorted input. In-memory components run efficiently when there is enough memory allocated to them. If the data volume grows until these components need to drop their data to disk, performance may suddenly decrease one day. A graph that relies on sorted data and does not use in-memory components will have more uniform performance characteristics as data volume grows.

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

107

Exceeding max-core If an in-memory Join cannot fit its non-driving inputs (plus overhead) in the provided max-core, then it will drop all the inputs to disk. Similarly, Sort and Rollup will drop all their data to disk if max-core does not fit all the data plus overhead. It is better to set max-core too low rather than too high and risk OS swapping. Ab Initio does better job than the OS at staging working data to disk.

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

108

Reduce number of records Use Rollup or Filter by Expression as soon as possible if they will reduce the number of records being processed. Join as early as possible if this will reduce the number of records being processed. Join as late as possible if this will increase the number of records or the width of records being processed.

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

109

2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

110

You might also like