P. 1
56156290-Abintio

56156290-Abintio

|Views: 8|Likes:
Published by maggie_thomas_6

More info:

Published by: maggie_thomas_6 on Jun 07, 2012
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

03/03/2013

pdf

text

original

Sections

  • 1.What is 1.What is Ab Initio Ab Initio
  • 2.Ab Initio Platforms 2.Ab Initio Platforms
  • 3.Architecture of Ab Initio
  • 4.Run 4.Run Process Process
  • 5.Components Overview
  • 6.Parallelism 6.Parallelism
  • 7.Sandbox 7.Sandbox and Project and Project
  • 8.Basic Graph Development
  • 9.MULTIFILES
  • 10.Performance Tuning

Introduction to Ab initio

Presenter¶s Name
Role Month, Year

Australia | Canada | France | India | New Zealand | Singapore | Switzerland | United Arab Emirates | United Kingdom | United States ©2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

Agenda

DWH Concept What is ETL Introduction to Ab Initio

©2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

2

DWH Concept

DWH Definition & Process Overview

©2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

3

What is a Data Warehouse ?
Data Warehouse is a ‡ Subject-Oriented ‡ Integrated ‡ Time-Variant ‡ Non-volatile collection of data in support of management·s decision

©2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

4

‡ Data warehouses store large volumes of data which are frequently used by Decision Support Systems ‡ It is maintained separately from the organization·s operational databases ‡ Data warehouses are relatively static with only infrequent updates ‡ A data warehouse is a stand-alone repository of information, integrated from several, possibly heterogeneous operational databases

©2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

5

Data Warehousing Process Overview
‡ ‡ ‡ ‡ Extract from Source Systems Transform to required data (Staging area) Transfer to Data warehouse Produce reports from Data warehouse

Data Ware House

Source Systems

Extract

Staging Area Transform

Transfer

©2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

6

stored in a retrieval system. 7 . mechanical.Various Data Warehouse Models  Enterprise warehouse ‡ collects all of the information about subjects spanning the entire organization  Data Mart ‡ a subset of corporate-wide data that is of value to a specific groups of users. electronic. or transmitted by any means. This document is the property of Keane. or otherwise. No part of this document shall be reproduced. to parties outside of your organization without prior written permission from Keane. selected groups. recording. photocopying. such as marketing data mart ©2010 Keane. Its scope is confined to specific.

i. Extraction converts the data into a format for transformation processing. mechanical. This document is the property of Keane. ‡ transforming it to fit business needs (which can include quality levels). but may include nonrelational database structures such as IMS or other data structures such as VSAM or ISAM. or transmitted by any means. Most data warehousing projects consolidate data from different source systems. and ultimately ‡ loading it into the end target.What is ETL? Extract. electronic.e. to parties outside of your organization without prior written permission from Keane. Transform. stored in a retrieval system. 8 . No part of this document shall be reproduced. ©2010 Keane. or otherwise. Each separate system may also use a different data organization / format. and Load (ETL) is a process that involves ‡ extracting data from outside sources. photocopying. Extract ‡ The first part of an ETL process is to extract the data from the source systems. the data warehouse. Common data source formats are relational databases and flat files. recording.

g. stored in a retrieval system... recording.g. partial or no rejection of the data. No part of this document shall be reproduced. Some data sources will require very little or even no manipulation of data. if the source system stores 1 for male and 2 for female. lookup.. but the warehouse stores M for male and F for female) ‡ Encoding free-form values (e.What is ETL? Transform The transform stage applies a series of rules or functions to the extracted data from the source to derive the data to be loaded to the end target. 9 . mechanical. and for each region) ‡ Generating surrogate key values ‡ Transposing or pivoting (turning multiple columns into multiple rows or vice versa) ‡ Splitting a column into multiple columns (e. merge.. e.g. photocopying.g.) ‡ Summarizing multiple rows of data (e. In other cases. depending on the rule design and exception handling. ©2010 Keane. sale amount = qty * unit price) ‡ Joining together data from multiple sources (e.g. electronic. etc. mapping "Male" to "1" and "Mr. or otherwise.. when a codetranslation parses an unknown code in the extracted data. This document is the property of Keane. total sales for each store.. to parties outside of your organization without prior written permission from Keane. and thus no. if failed. one or more of the following transformations types to meet the business and technical needs of the end target may be required: ‡ Selecting only certain columns to load (or selecting null columns not to load) ‡ Translating coded values (e. putting a comma-separated list specified as a string in one column as individual values in different columns) ‡ Applying any form of simple or complex data validation. a full. Most of the above transformations itself might result in an exception. partial or all the data is handed over to the next step.g.g. or transmitted by any means." to M) ‡ Deriving a new calculated value (e.

What is ETL? Load ‡ The load phase loads the data into the end target. the constraints defined in the database schema as well as in triggers activated upon data load apply (e. uniqueness. updated data. Depending on the requirements of the organization.g. which also contribute to the overall data quality performance of the ETL process. 10 . referential integrity. ‡ As the load phase interacts with a database. mechanical. Some systems maintain a history and audit trail of all changes to the data loaded in the DW. ©2010 Keane.g. to parties outside of your organization without prior written permission from Keane. hourly. or otherwise. usually being the data warehouse (DW). No part of this document shall be reproduced. This document is the property of Keane. this process ranges widely. The timing and scope to replace or append are strategic design choices dependent on the time available and the business needs. Some data warehouses might overwrite existing information with cumulative. electronic. e. mandatory fields). photocopying. or transmitted by any means. recording. while other DW (or even other parts of the same DW) might add new data in a histories form. stored in a retrieval system.

or otherwise. to parties outside of your organization without prior written permission from Keane. This document is the property of Keane. Run Process 5. Sandbox and Projects 8.Introduction to Ab Initio 1. Components 6. What is Ab Initio 2. electronic. mechanical. Ab Initio Platforms 3. No part of this document shall be reproduced. or transmitted by any means. 11 . Basic Graph Development 9. Parallelism 7. photocopying. Multifile 10. recording. Performance Tuning ©2010 Keane. Architecture of Ab Initio 4. stored in a retrieval system.

 Designed to support largest and most complex business applications  Proven best of breed ETL solution. recording. to parties outside of your organization without prior written permission from Keane. This document is the property of Keane.  Parallel data transformation and filtering. No part of this document shall be reproduced.  Parallel data cleansing and validation.What is Ab Initio  Data processing tool from Ab Initio software corporation (http://www. ©2010 Keane.1. photocopying. data transformation and analytics. parallel data capture. electronic. or transmitted by any means. data marts and operational data sources.  Applications of Ab Initio:  ETL for data warehouses. 12 .com)  Latin for ´from the beginningµ  Ab Initio is a general purpose data processing platform for enterprise class. stored in a retrieval system.abinitio. mission critical applications such as data warehousing. data movement.  High performance analytics  Real time. click stream processing. or otherwise. mechanical.

Ab Initio Platforms Abinitio product comes with three suits  Graphical Development Environment (GDE)  CO>Operating System  Enterprise Metadata Environment (EME) ©2010 Keane. This document is the property of Keane. recording. photocopying. or transmitted by any means. or otherwise. stored in a retrieval system. electronic. mechanical. to parties outside of your organization without prior written permission from Keane. 13 .2. No part of this document shall be reproduced.

No part of this document shall be reproduced. mechanical. or otherwise. and connecting them into executable flowcharts. to parties outside of your organization without prior written permission from Keane. intuitive point and click operations. ©2010 Keane. recording. This document is the property of Keane. stored in a retrieval system. or transmitted by any means. photocopying.Graphical Development Environment (GDE)  GDE lets Developer to create applications by dragging and dropping components onto a canvas configuring them with familiar. 14 . electronic.

file management.Co>operating System  The Co>Operating System is core software that unites a network of computing resources-CPUs. This document is the property of Keane. or transmitted by any means. check-pointing. stored in a retrieval system. process monitoring. 15 . It provides a distributed model for process execution. datasets-into a production-quality data processing system with scalable performance and mainframe reliability. photocopying. storage disks. mechanical. programs. electronic. ©2010 Keane. recording. or otherwise. No part of this document shall be reproduced.  The Co>Operating System is layered on top of the native operating systems of a collection of computers. to parties outside of your organization without prior written permission from Keane. and debugging.

stored in a retrieval system. This document is the property of Keane.Connection Between GDE and CO>Op Graphical Development Environment (GDE) FTP TELNET REXEC RSH DCOM Co-operating System On a typical installation. mechanical. No part of this document shall be reproduced. the Co-operating system is installed on a Unix or Windows NT server while the GDE is installed on a Pentium PC. or transmitted by any means. 16 . electronic. or otherwise. recording. to parties outside of your organization without prior written permission from Keane. photocopying. ©2010 Keane.

photocopying. to parties outside of your organization without prior written permission from Keane. stored in a retrieval system. mechanical. electronic. This document is the property of Keane. or transmitted by any means. recording. No part of this document shall be reproduced. 17 .Connecting to Co>op Server from GDE ©2010 Keane. or otherwise.

recording. No part of this document shall be reproduced. . stored in a retrieval system. or transmitted by any means.EME (Enterprise Meta>Environment) Data store ‡ It is a system storage area where every version that you save of the files you work on is permanently preserved ‡ In a short we can say EME is Storage Area GDE GDE Check Out E M E Locking 18 GDE GDE GDE ©2010 Keane. electronic. mechanical. to parties outside of your organization without prior written permission from Keane. This document is the property of Keane. photocopying. or otherwise.

electronic. This document is the property of Keane. photocopying. to parties outside of your organization without prior written permission from Keane. 19 . or otherwise. recording. or transmitted by any means. stored in a retrieval system. No part of this document shall be reproduced.Ab Initio runs on many operating systems ‡ ‡ ‡ ‡ ‡ ‡ ‡ ‡ ‡ ‡ ‡ Compaq Tru64 UNIX Digital unix Hewlett-Packard HP-UNIX IBM AIX Unix NCR MP-RAS Red Hat Linux IBM/Sequent DYNIX/ptx Siemens Pyramid Reliant UNIX Silicon Graphics IRIX Sun Solaris Windows NT and Windows 2000 ©2010 Keane. mechanical.

20 . stored in a retrieval system. to parties outside of your organization without prior written permission from Keane. mechanical. photocopying. This document is the property of Keane. or otherwise.Architecture of Ab Initio Applications Ab Initio Metadata Repository (EME) Application Development Environments Graphical (GDE) C ++ Shell Component Library . No part of this document shall be reproduced. electronic.ksh Ab Initio Co>Operating System User-defined Components Third Party Components Native Operating System UNIX Windows NT ©2010 Keane.3. or transmitted by any means. recording.

connected by pipes Ability to test run the graphical design and monitor its progress Ability to generate a shell script or batch file from the graphical design Co>Operating System Ab Initio Built-in Component Programs (Partitions. Transforms etc) Host Machine 2 User Programs Co-Operating System User Programs Operating System ( Unix . or otherwise. This document is the property of Keane. Windows NT ) Operating System ©2010 Keane. to parties outside of your organization without prior written permission from Keane. 21 . No part of this document shall be reproduced. electronic. stored in a retrieval system. photocopying. recording.Architecture of Ab Initio Host Machine 1 Unix Shell Script or NT Batch File Supplies parameter values to underlying programs through arguments and environment variables Controls the flow of data through pipes Usually generated using the GDE GDE Ability to graphically design batch programs comprising Ab Initio components. or transmitted by any means. mechanical.

mechanical.  The script is invoked (via REXEC or TELNET) on the server. or otherwise.  The script creates and runs a job that may run across many nodes. 22 . electronic. recording. No part of this document shall be reproduced.Run Process What happens when you push the ´Runµ button ?  Your graph is translated into a script that can be executed in the Shell Development Environment. to parties outside of your organization without prior written permission from Keane. or transmitted by any means.4. ©2010 Keane. stored in a retrieval system. photocopying.  This script and any metadata files stored on the GDE client machine are shipped (via FTP) to the server.  Monitoring information is sent back to the GDE client. This document is the property of Keane.

to parties outside of your organization without prior written permission from Keane. This document is the property of Keane. or otherwise. electronic. recording. photocopying. or transmitted by any means. No part of this document shall be reproduced. 23 .Run Process  Please have look a below Sample graph and find what happens when we press run button on the top right side in the screen shot ©2010 Keane. mechanical. stored in a retrieval system.

or otherwise. to parties outside of your organization without prior written permission from Keane. Host GDE Client Host Processing nodes 24 ©2010 Keane. recording. stored in a retrieval system.  Script is transmitted to Host node. No part of this document shall be reproduced. electronic. This document is the property of Keane. or transmitted by any means.  Script is invoked. photocopying. .Anatomy of Running Job  Host Process Creation  Pushing ´Runµ button generates script. mechanical. creating Host process.

No part of this document shall be reproduced. photocopying. 25 . This document is the property of Keane. Host GDE Agent Agent Client Host Processing nodes ©2010 Keane. to parties outside of your organization without prior written permission from Keane. electronic. stored in a retrieval system. mechanical. recording.Anatomy of Running Job  Agent Process Creation  Host process spawns Agent processes. or otherwise. or transmitted by any means.

No part of this document shall be reproduced. electronic. or otherwise. This document is the property of Keane. or transmitted by any means. mechanical. Host GDE Agent Agent Client Host Processing nodes ©2010 Keane. stored in a retrieval system.Anatomy of Running Job  Component Process Creation  Agent processes create Component processes on each processing node. photocopying. to parties outside of your organization without prior written permission from Keane. 26 . recording.

stored in a retrieval system. or otherwise. This document is the property of Keane. Host GDE Agent Agent Client Host Processing nodes ©2010 Keane. to parties outside of your organization without prior written permission from Keane. photocopying. mechanical. electronic.  Component processes communicate directly with datasets and each other to move data around.Anatomy of Running Job  Component Execution  Component processes do their jobs. or transmitted by any means. No part of this document shall be reproduced. 27 . recording.

to parties outside of your organization without prior written permission from Keane. 28 . it exits with success status. Host GDE Agent Agent Client Host Processing nodes ©2010 Keane. recording. No part of this document shall be reproduced. stored in a retrieval system. or transmitted by any means. electronic.Anatomy of Running Job  Successful Component Termination  As each Component process finishes with its data. This document is the property of Keane. or otherwise. mechanical. photocopying.

recording. to parties outside of your organization without prior written permission from Keane. electronic. mechanical. stored in a retrieval system.Anatomy of Running Job  Agent Termination  When all of an Agent·s Component processes exit. the Agent informs the Host process that those components are finished. 29 . This document is the property of Keane. photocopying. or transmitted by any means.  The Agent process then exits Host Client Host Processing nodes ©2010 Keane. No part of this document shall be reproduced. or otherwise.

No part of this document shall be reproduced. stored in a retrieval system. the Host process informs the GDE that the job is complete.  The Host process then exits. electronic. or otherwise. or transmitted by any means. This document is the property of Keane. Host GDE Client Host Processing nodes ©2010 Keane. 30 . recording. photocopying. to parties outside of your organization without prior written permission from Keane. mechanical.Anatomy of Running Job  Host Termination  When all Agents have exited.

31 . This document is the property of Keane.5. to parties outside of your organization without prior written permission from Keane. or otherwise. photocopying.Components Overview There are Mainly two sets of Components available in Abinitio  Dataset Components:-Components Which holds data  Program Components:-Components which process data ©2010 Keane. or transmitted by any means. No part of this document shall be reproduced. mechanical. stored in a retrieval system. electronic. recording.

or otherwise. allowing you to specify as the source either a database table or an SQL statement that selects records from one or more tables. stored in a retrieval system. mechanical. 32 . This document is the property of Keane. recording. to parties outside of your organization without prior written permission from Keane. or transmitted by any means. electronic. photocopying.Dataset Components ‡ Input file : INPUT FILE represents records read as input to a graph from one or more serial files or from a multi file. No part of this document shall be reproduced. ‡ Input table Input Table unloads records from a database into a graph. ©2010 Keane.

mechanical. or transmitted by any means. stored in a retrieval system. or some other special file). to parties outside of your organization without prior written permission from Keane. recording. NUL. ‡ Output table: OUTPUT TABLE loads records from a graph into a database. When the target of an OUTPUT FILE component is a particular file (such as /dev/null.Dataset Components ‡ Output file: OUTPUT FILE represents records written as output from a graph into one or more serial files or a multifile. This document is the property of Keane. electronic. a named pipe. photocopying. the Co>Operating System never deletes and recreates that file. 33 . No part of this document shall be reproduced. or otherwise. letting you specify the destination either directly as a single database table. ©2010 Keane. or through an SQL statement that inserts records into one or more tables. nor does it ever truncate it.

or otherwise. You can use SORT to order records before you send them to a component that requires grouped or sorted records. it sorts the records it has read and writes a temporary file to disk. recording. This document is the property of Keane. Default is 100663296 (100 MB). required) Maximum memory usage in bytes. required) Name(s) of the key field(s) and the sequence specifier(s) you want the component to use when it orders records. max-core (integer. or transmitted by any means. ©2010 Keane. No part of this document shall be reproduced. electronic. key (key specifier. When the component reaches the number of bytes specified in the max-core parameter. stored in a retrieval system. mechanical. 34 . photocopying. to parties outside of your organization without prior written permission from Keane.Program Components ‡ Sort: ‡ SORT sorts and merges records.

Records written to out ports. combine fields. if the function returns a failure status ©2010 Keane. This document is the property of Keane. No part of this document shall be reproduced. if the function returns a success status 4. or transmitted by any means.‡ Reformat: ‡ REFORMAT changes the format of records by dropping fields. 1. electronic. photocopying. 35 . recording. Record passes as argument to transform function or xfr 3. or transform the data in the records. or otherwise. Records written to reject ports. to parties outside of your organization without prior written permission from Keane. or by using DML expressions to add fields. Reads record from input port 2. mechanical. stored in a retrieval system.

or otherwise. to parties outside of your organization without prior written permission from Keane. No part of this document shall be reproduced. recording.Parameters of Reformat Component  Count  Transform (Xfr) Function  Reject-Threshold ‡ Abort ‡ Never Abort ‡ Use Limit & Ramp  Limit  Ramp ©2010 Keane. 36 . electronic. photocopying. mechanical. or transmitted by any means. stored in a retrieval system. This document is the property of Keane.

photocopying. or transmitted by any means. No part of this document shall be reproduced. Writes result to the output port PORTS in out unused reject (optional) error (optional) log (optional) PARAMETERS       count key override key transform limit Ramp 37 ©2010 Keane. This document is the property of Keane. Reads records from multiple input ports 2. mechanical. stored in a retrieval system. electronic. to parties outside of your organization without prior written permission from Keane. .Join: 1. recording. or otherwise. Operates on records with matching keys using a multiinput transform function 3.

or transmitted by any means. 38 . photocopying.Join Types ‡ Inner ‡ Outer ‡ Explicit Join Methods ‡ Merge Join  sing sorted inputs ‡ Hash Join  sing in-memory hash tables to group input ©2010 Keane. stored in a retrieval system. No part of this document shall be reproduced. to parties outside of your organization without prior written permission from Keane. recording. This document is the property of Keane. or otherwise. mechanical. electronic.

or transmitted by any means. If you do not connect a flow to the deselect port. photocopying. FILTER BY EXPRESSION stops execution of the graph when the number of reject events exceeds the result of the following formula: limit + (ramp * number_of_records_processed_so_far) ©2010 Keane. If the expression returns: ‡ Non-0 value ² FILTER BY EXPRESSION writes the record to the out port.Filter by Expression: FILTER BY EXPRESSION filters records according to a DML ex 1. stored in a retrieval system. 39 . mechanical. electronic.Applies the expression in the select_expr parameter to each record. FILTER BY EXPRESSION discards the records. recording.Reads data records from the in port. No part of this document shall be reproduced. or otherwise. ‡ 0 ² FILTER BY EXPRESSION writes the record to the deselect port. ‡ NULL ² FILTER BY EXPRESSION writes the record to the reject port and a descriptive error message to the error port. This document is the property of Keane. to parties outside of your organization without prior written permission from Keane. 2.

Sends the output record to the out port. No part of this document shall be reproduced. ‡ If you have not defined input_select. 4. 1. NORMALIZE processes all records. electronic. recording.Normalize : NORMALIZE generates multiple output records from each of its input records. or otherwise. the input records are filtered as follows: 2. to parties outside of your organization without prior written permission from Keane. ©2010 Keane. This document is the property of Keane. or the number of output records can depend on some calculation. photocopying. 40 .Performs iterations of the normalize transform function for each input record.Reads the input record. ‡ If you have defined input_select. 3. mechanical. stored in a retrieval system. You can directly specify the number of output records for each input record. or transmitted by any means.Performs temporary initialization.

recording. photocopying. electronic. or otherwise. No part of this document shall be reproduced. mechanical. stored in a retrieval system. to parties outside of your organization without prior written permission from Keane. This document is the property of Keane.Before Normalization ©2010 Keane. or transmitted by any means. 41 .

After Normalization ©2010 Keane. recording. or transmitted by any means. to parties outside of your organization without prior written permission from Keane. This document is the property of Keane. electronic. stored in a retrieval system. mechanical. or otherwise. No part of this document shall be reproduced. photocopying. 42 .

DENORMALIZE SORTED can consolidate those records into a record for each household that contains a variable number of people. to parties outside of your organization without prior written permission from Keane. photocopying. stored in a retrieval system. recording. or otherwise. ©2010 Keane. DENORMALIZE SORTED requires grouped input. 43 . and optionally computes summary fields in the output record for each group. electronic. if you have a record for each person that includes the households to which that person belongs. or transmitted by any means. mechanical. For example. No part of this document shall be reproduced.DENORMALIZE SORTED: DENORMALIZE SORTED consolidates groups of related records by key into a single output record with a vector field for each group. This document is the property of Keane.

This document is the property of Keane. or otherwise. 44 . photocopying.Before Denormalize ©2010 Keane. to parties outside of your organization without prior written permission from Keane. recording. electronic. mechanical. stored in a retrieval system. No part of this document shall be reproduced. or transmitted by any means.

or transmitted by any means. recording. stored in a retrieval system. mechanical.After Denormalize: ©2010 Keane. to parties outside of your organization without prior written permission from Keane. No part of this document shall be reproduced. 45 . electronic. photocopying. or otherwise. This document is the property of Keane.

to parties outside of your organization without prior written permission from Keane. 46 .scan ©2010 Keane. iteration. No part of this document shall be reproduced. This document is the property of Keane. or otherwise.rollup. or transmitted by any means.Multistage components ‡ Data transformation in multiple stages following several sets of rules ‡ Each set of rule form one transform function ‡ Information is passed across stages by temporary variables ‡ Stages include initialization. recording. electronic. photocopying. mechanical. finalization and more ‡ Few multistage components are aggregate. stored in a retrieval system.

use AGGREGATE when you want to return the single record that has a field containing either the maximum or the minimum value of all the records in the group. For example. the output records might include successive year-to-date totals for groups of records. 47 ‡ ‡ ‡ . mechanical. However. and aggregation. No part of this document shall be reproduced. Scan: For every input record. photocopying. In general. to parties outside of your organization without prior written permission from Keane. ‡ Aggregate: AGGREGATE generates records that summarize ‡ groups of records. recording. You can use SCAN in continuous graphs ©2010 Keane.‡ Rollup: ROLLUP evaluates a group of input records that have the same key. and then generates records that either summarize each group or select certain information from each group. This document is the property of Keane. use ROLLUP for new development rather than AGGREGATE. electronic. grouping. or otherwise. or transmitted by any means. stored in a retrieval system. SCAN generates an output record that includes a running cumulative summary for the group the input record belongs to. ROLLUP gives you more control over record selection.

to parties outside of your organization without prior written permission from Keane. or otherwise. No part of this document shall be reproduced.Partition components: Data can be partitioned using ‡Partition by Round-robin ‡Broadcast ‡Partition by Key ‡Partition by Expression ‡Partition by Range ‡Partition by Percentage ‡Partition by Load Balance ©2010 Keane. or transmitted by any means. mechanical. electronic. 48 . This document is the property of Keane. photocopying. stored in a retrieval system. recording.

‡ Suppose you attach four flows to the PARTITION BY ROUND-ROBIN output port. stored in a retrieval system. This document is the property of Keane. electronic.Partition by Roundrobin ‡ PARTITION BY ROUND-ROBIN distributes blocks of records evenly to each output flow in round-robin fashion. to parties outside of your organization without prior written permission from Keane. then back to Load-1 again. ©2010 Keane. PARTITION BY ROUNDROBIN writes to Load-1. as shown in the following figure. recording. 49 . or transmitted by any means. or otherwise. then Load-2. then Load-4. photocopying. mechanical. then Load-3. No part of this document shall be reproduced.

photocopying. writing records with the same key value to the same output flow. according to the key parameter. ‡ Distributes records to the flows connected to the out port. ‡ PARTITION BY KEY is typically followed by SORT ©2010 Keane. 50 . stored in a retrieval system. mechanical. This document is the property of Keane. or otherwise. ‡ PARTITION BY KEY does the following: ‡ Reads records in arbitrary order from the in port. recording. to parties outside of your organization without prior written permission from Keane. electronic. Partition by Key ‡ PARTITION BY KEY distributes records to its output flow partitions according to key values. or transmitted by any means. No part of this document shall be reproduced.Broadcast ‡ BROADCAST arbitrarily combines all records it receives into a single flow and writes a copy of that flow to each of its output flow partitions.

to parties outside of your organization without prior written permission from Keane. and so on. mechanical. The records with the key values that come last in the key order go to the partition with the highest number. electronic. the output is unsorted. stored in a retrieval system. ‡ The records with the key values that come first in the key order go to partition 0. photocopying. PARTITION BY RANGE distributes the records relatively equally among the partitions. approximately equal. No part of this document shall be reproduced. ‡ Use PARTITION BY RANGE when you want to divide data into useful. the output is sorted. 51 . the records with the key values that come next in the order go to partition 1. ©2010 Keane. Partition by Range ‡ PARTITION BY RANGE distributes records to its output flow partitions according to the ranges of key values specified for each partition. This document is the property of Keane.Partition by Expression ‡ PARTITION BY EXPRESSION distributes records to its output flow partitions according to a specified DML expression. or transmitted by any means. recording. if the input is unsorted. Input can be sorted or unsorted. or otherwise. If the input is sorted. groups.

mechanical. photocopying. or transmitted by any means. recording. to parties outside of your organization without prior written permission from Keane. stored in a retrieval system.Partition by Percentage ‡ PARTITION BY PERCENTAGE distributes a specified percentage of the total number of input records to each output flow. electronic. or otherwise. 52 . This document is the property of Keane. ‡ The output port for PARTITION WITH LOAD BALANCE is ordered ©2010 Keane. No part of this document shall be reproduced. Partition by Load Balance ‡ PARTITION WITH LOAD BALANCE distributes records to its output flow partitions by writing more records to the flow partitions that consume records faster.

photocopying. This document is the property of Keane. stored in a retrieval system. to parties outside of your organization without prior written permission from Keane. No part of this document shall be reproduced. recording.Summary of Partitioning Methods Method Round robin Hash Function Range Key-Based Balancing No Yes Yes Yes Good Good Depends on data and function Depends on splitters Uses Record-independent parallelism Key-dependent parallelism Application specific Key-dependent parallelism. electronic. 53 . mechanical. or transmitted by any means. or otherwise. Global Ordering Record-independent parallelism Load-level No Depends on load ©2010 Keane.

electronic. This document is the property of Keane. 54 .De-partition components: Data can be de-partitioned using ‡Gather ‡Concatenate ‡Merge ‡Interleave ©2010 Keane. or transmitted by any means. mechanical. recording. or otherwise. photocopying. to parties outside of your organization without prior written permission from Keane. No part of this document shall be reproduced. stored in a retrieval system.

©2010 Keane. 55 . No part of this document shall be reproduced. stored in a retrieval system. or transmitted by any means. mechanical. You can use INTERLEAVE to undo the effects of PARTITION BY ROUNDROBIN. to parties outside of your organization without prior written permission from Keane. This document is the property of Keane. electronic. photocopying. recording.Gather ‡ Reads data records from the flows connected to the input port ‡ Combines the records arbitrarily and writes to the output Concatenate ‡ Concatenate appends multiple flow partitions of data records one after another Merge ‡ Combines data records from multiple flow partitions that have been sorted on a key ‡ Maintains the sort order Interleave: INTERLEAVE combines blocks of records from multiple flow partitions in round-robin fashion. or otherwise.

photocopying. 56 . or otherwise. or transmitted by any means. electronic. This document is the property of Keane. mechanical. recording. to parties outside of your organization without prior written permission from Keane.Parallelism Parallel Runtime Environment Where some or all of the components of an application ± datasets and processing modules are replicated into a number of partitions. stored in a retrieval system.6. No part of this document shall be reproduced. each spawning a process. Ab Initio can process data in parallel runtime environment Forms of Parallelism ‡ Component Parallelism ‡ Pipeline Parallelism ‡ Data Parallelism ©2010 Keane.

Parallelism Data Parallelism ± servers at the same time Data is processed at the different Pipeline Parallelism ± Pipeline parallelism occurs when several connected program components on the same branch of a graph execute simultaneously. 57 . electronic. This document is the property of Keane. or otherwise. photocopying. stored in a retrieval system. or transmitted by any means. mechanical. No part of this document shall be reproduced. Component Parallelism working in parallel ± 2 or more components are ©2010 Keane. recording. to parties outside of your organization without prior written permission from Keane.

allowing multiple copies of program components to operate on the data in all the divisions simultaneously. mechanical.Data Parallelism Data parallelism occurs when a graph separates data into multiple divisions. ©2010 Keane. photocopying. This document is the property of Keane. 58 . recording. electronic. to parties outside of your organization without prior written permission from Keane. No part of this document shall be reproduced. stored in a retrieval system. or otherwise. or transmitted by any means.

photocopying. This document is the property of Keane. or otherwise. No part of this document shall be reproduced. recording.Two Ways of Looking at Data Parallelism Expanded View: Global View: ©2010 Keane. to parties outside of your organization without prior written permission from Keane. or transmitted by any means. 59 . stored in a retrieval system. mechanical. electronic.

60 . or otherwise. then the SELECT component directs each customer to the proper group based on that score. GOOD CUSTOMERS and OTHER CUSTOMERS. photocopying. This document is the property of Keane. electronic. or transmitted by any means. recording. The SCORE component assigns a score to each customer in the CUSTOMERS dataset.Pipeline Parallelism the following graph divides a list of customers into two groups. mechanical. stored in a retrieval system. No part of this document shall be reproduced. to parties outside of your organization without prior written permission from Keane. Processing Record: 100 Processing Record: 55 ©2010 Keane.

Component Parallelism The following graph takes the CUSTOMERS and TRANSACTIONS datasets. This document is the property of Keane. 61 . photocopying. or otherwise. mechanical. sorts them. Sorting Transactions ©2010 Keane. recording. then merges them into a dataset named MERGED INFORMATION. electronic. stored in a retrieval system. or transmitted by any means. to parties outside of your organization without prior written permission from Keane. creating component parallelism. they execute at the same time.Because the SORT CUSTOMERS and SORT TRANSACTIONS components are on different branches of the graph. No part of this document shall be reproduced.

dml.Sandbox is a collection of the various directories like bin. This document is the property of Keane. The sandbox provides an excellent mechanism to maintain uniqueness while moving from development to production environment by means switch parameters Note . recording. stored in a retrieval system. or otherwise. Also helps in version control. migration and navigation. run etc which contains the metadata (Graphs and their associated files) Why to create a Sandbox ± Helps in managing the directory structure where this metadata is stored. No part of this document shall be reproduced. photocopying.7. electronic. but project can have many sandboxes /Projects bin dml mp run xfr Sandbox ©2010 Keane. 62 .Sandbox and Project What is a Sandbox .Sandbox can be associated with only one project. to parties outside of your organization without prior written permission from Keane. mp. or transmitted by any means. mechanical.

photocopying. This document is the property of Keane. or otherwise. 63 . or transmitted by any means. recording.8.e. ©2010 Keane. to parties outside of your organization without prior written permission from Keane.. stored in a retrieval system. electronic.Basic Graph Development ‡ Create a new graph ‡ Go to (µFile>New¶) ‡ Then µFile>Save As¶ (i. mechanical. my_graph) to save it in the appropriate µsandbox¶ to enable this new graph to pick up the proper environment. No part of this document shall be reproduced.

or otherwise. This document is the property of Keane. to parties outside of your organization without prior written permission from Keane. let the yellow ³To Do´ cues guide you. or transmitted by any means.Steps in Building an Application ‡ Add datasets. ©2010 Keane. mechanical. recording. where does my output go? ‡ Add components. stored in a retrieval system. ‡ Generally. 64 . you should configure your input and output metadata (record formats) before adding flows. ‡ Debug your application ‡ Configure datasets and components along the way. ‡ Add flows. photocopying. ‡ Edit Component Parameters as needed. Where are they sourced from. No part of this document shall be reproduced. electronic.

Adding an Input Dataset 1. electronic. or otherwise. photocopying. 65 . recording. This document is the property of Keane. Click on Component Organizer Button 2. mechanical. or transmitted by any means. Open the Datasets Category 3. Choose Input File ©2010 Keane. to parties outside of your organization without prior written permission from Keane. No part of this document shall be reproduced. stored in a retrieval system.

to parties outside of your organization without prior written permission from Keane. photocopying.Configuring the Input Dataset 1.dat 2. No part of this document shall be reproduced. Change label to something descriptive ©2010 Keane. or otherwise. This document is the property of Keane. electronic. recording. Browse to find simple. Browse to find simple. or transmitted by any means. stored in a retrieval system.dml 3. 66 . mechanical.

This document is the property of Keane. ‡ Same As: Copy record format¶s from a specific component¶s port. or in the Ab Initio repository. mechanical.dml file ©2010 Keane. or transmitted by any means. stored in a retrieval system.Dml ‡ Propagate from Neighbors: Copy record formats from connected flow. Specify the . or otherwise. ‡ Path: Store record formats in a Local file. Host File. to parties outside of your organization without prior written permission from Keane. No part of this document shall be reproduced. recording. photocopying. electronic. 67 . ‡ Embedded: Type the record format directly in a string.Create Graph .

mechanical. ‡ ‡ Specify the . Each statement is called a business rule. Ab Initio transform functions mainly consist of a series of assignment statements. stored in a retrieval system. This document is the property of Keane. When Ab Initio evaluates a transform function.xfr file ‡ ©2010 Keane.Transform ‡ A transform function is either a DML file or a DML string that describes how you manipulate your data. or transmitted by any means. recording. electronic. 68 . or otherwise. photocopying. No part of this document shall be reproduced.Creating Graph . Transform function files have the xfr extension. to parties outside of your organization without prior written permission from Keane. it performs following tasks: ‡ Initializes local variables ‡ Evaluates statements ‡ Evaluates rules.

This document is the property of Keane. or transmitted by any means. recording. No part of this document shall be reproduced. stored in a retrieval system.Adding a Filter by Expression Component 1. photocopying. electronic. 69 . or otherwise. Choose the µFilter by Expression¶ Component ©2010 Keane. mechanical. to parties outside of your organization without prior written permission from Keane. Open the Transform Category 2.

This document is the property of Keane. mechanical. to parties outside of your organization without prior written permission from Keane. stored in a retrieval system. or transmitted by any means. or otherwise. 70 . electronic. recording. photocopying. No part of this document shall be reproduced.Adding an Output Dataset Choose Output File ©2010 Keane.

Enter name of output file ©2010 Keane. or otherwise.Configuring the Output Dataset 1. to parties outside of your organization without prior written permission from Keane. This document is the property of Keane. Browse to see the directory contents 2. No part of this document shall be reproduced. stored in a retrieval system. photocopying. electronic. 71 . mechanical. recording. or transmitted by any means.

Components Have Properties y A port is a connection point that allows data to flow into or out of a component. stored in a retrieval system. 72 . y The data streaming into or out of a component is called a flow. or transmitted by any means. ©2010 Keane. This document is the property of Keane. electronic. to parties outside of your organization without prior written permission from Keane. recording. Most components have at least one port. photocopying. No part of this document shall be reproduced. mechanical. or otherwise.

electronic. stored in a retrieval system. ‡ Release the mouse button. to parties outside of your organization without prior written permission from Keane. follow these steps. ©2010 Keane. No part of this document shall be reproduced. or otherwise. This document is the property of Keane. 73 . photocopying. mechanical. recording. ‡ Click and drag from the Out port of one component to the In port of the next component. highlight it (Click it ) then press the Delete key. ‡ Move your cursor over the Out Port of the first component until the arrow and Box symbols appear. How to Delete Flows  To delete a flow.How to Create Flows To Create a flow. or transmitted by any means.

or transmitted by any means. mechanical. stored in a retrieval system. recording. Drag to destination (release) ©2010 Keane.Adding Flows 1. photocopying. Click on source (hold) 2. or otherwise. 74 . No part of this document shall be reproduced. This document is the property of Keane. electronic. to parties outside of your organization without prior written permission from Keane.

or transmitted by any means. This document is the property of Keane. mechanical. electronic. photocopying. stored in a retrieval system.Configuring Filter by Expression Enter expression ©2010 Keane. or otherwise. recording. 75 . to parties outside of your organization without prior written permission from Keane. No part of this document shall be reproduced.

photocopying. or transmitted by any means. recording. 2.Running the Application 1. electronic. Push ³Run´ button. No part of this document shall be reproduced. to parties outside of your organization without prior written permission from Keane. ©2010 Keane. stored in a retrieval system. This document is the property of Keane. or otherwise. View monitoring information. Classification: GE Internal 76 . 3. mechanical. View output data.

77 . Error Reject: Input records that caused errors. mechanical. No part of this document shall be reproduced. ©2010 Keane. to parties outside of your organization without prior written permission from Keane.Diagnostic Ports: Reject. recording. electronic. photocopying. or transmitted by any means. This document is the property of Keane. stored in a retrieval system. or otherwise. Error: Error messages.

or transmitted by any means. This document is the property of Keane. 78 . stored in a retrieval system. recording. No part of this document shall be reproduced. or otherwise. mechanical. Un started Running Error Done Success ©2010 Keane. to parties outside of your organization without prior written permission from Keane.Tips about Runtime Status ‡ The GDE displays round colored indicators to show the status of each component during runtime. electronic. photocopying.

to parties outside of your organization without prior written permission from Keane. or otherwise. 79 . Specify Key for the Sort ©2010 Keane. mechanical. ‡ Max-core: The maxcore parameter controls how often the sort component dumps data from memory to disk. electronic. It comprises two parameters: Key and max-core.Creating Graph ± Sort Component ‡ Sort: The sort component reorders data. This document is the property of Keane. photocopying. stored in a retrieval system. ‡ Key: The Key is one of the parameters for Sort component which describes the collation order. recording. or transmitted by any means. No part of this document shall be reproduced.

. to parties outside of your organization without prior written permission from Keane. 80 ©2010 Keane. or otherwise. photocopying.Creating Graph ± Dedup component Select Dedup criteria. recording. stored in a retrieval system. electronic. or transmitted by any means. No part of this document shall be reproduced. ‡ Dedup criteria will be either uniqueonly. This document is the property of Keane. First or Last. ‡ Dedup component removes duplicate records. mechanical.

to parties outside of your organization without prior written permission from Keane. or transmitted by any means.Creating Graph ± Join Component ‡ Specify the key for join ‡ Specify Type of Join ©2010 Keane. stored in a retrieval system. 81 . electronic. No part of this document shall be reproduced. This document is the property of Keane. or otherwise. recording. mechanical. photocopying.

No part of this document shall be reproduced. stored in a retrieval system. because the parallelization of data drives the parallelization of the application. ‡ An Ab Initio multifile organizes all partitions of a multifile into one single virtual file that you can reference as one entity. recording. or otherwise. Understanding the concept of multifiles is essential when you are developing parallel applications that use files. or transmitted by any means. photocopying. which may be located on separate disks or systems. This document is the property of Keane. mechanical. These individual files are the partitions of the multifile. to parties outside of your organization without prior written permission from Keane. ©2010 Keane. electronic.9. 82 .MULTIFILES ‡ Multifiles are parallel files composed of individual files.

or otherwise. recording.©2010 Keane. to parties outside of your organization without prior written permission from Keane. photocopying. No part of this document shall be reproduced. This document is the property of Keane. 83 . mechanical. or transmitted by any means. stored in a retrieval system. electronic.

Multifile Commands m_mkfs m_mkdir m_ls m_expand m_dump m_cp m_mv m_touch m_rm ©2010 Keane. mechanical. to parties outside of your organization without prior written permission from Keane. recording. No part of this document shall be reproduced. photocopying. stored in a retrieval system. electronic. or otherwise. or transmitted by any means. This document is the property of Keane. 84 .

85 . electronic.. to parties outside of your organization without prior written permission from Keane.. or otherwise. . No part of this document shall be reproduced.. mechanical. photocopying. or transmitted by any means. stored in a retrieval system.. dir-url2. This document is the property of Keane. Creates a multifile system rooted at mfsurl and having as partitions the new directories dir-url1. $ m_mkfs //host1/u/jo/mfs3 \ //host1/vol4/dat/mfs3_p0 \ //host2/vol3/dat/mfs3_p1 \ //host3/vol7/dat/mfs3_p2 $ m_mkfs my-mfs my_mfs_p0 my_mfs_p1 my_mfs_p2 ©2010 Keane.The m_mkfs Command m_mkfs mfs-url dir-url1 dir-url2 . recording.

or otherwise. stored in a retrieval system. 86 . mechanical. photocopying. recording. electronic. No part of this document shall be reproduced. The url must refer to a pathname within an existing multifile system. or transmitted by any means. $ m_mkdir mfile:my-mfs/subdir $ m_mkdir mfile://host2/tmp/temp-mfs/dir1 ©2010 Keane.The m_mkdir Command m_mkdir url Creates the named multidirectory. to parties outside of your organization without prior written permission from Keane. This document is the property of Keane.

. electronic. The information presented is controlled by the options.. $ m_ls -ld mfile:my-mfs/subdir $ m_ls mfile://host2/tmp/temp-mfs $ m_ls -l -partitions . No part of this document shall be reproduced.. photocopying. recording. or transmitted by any means.] Lists information on the file or directories specified by the urls.. to parties outside of your organization without prior written permission from Keane.] url [url.The m_ls command m_ls [options. ©2010 Keane. or otherwise. 87 . which follow the form of ls. This document is the property of Keane. stored in a retrieval system. mechanical.

electronic. No part of this document shall be reproduced.. recording.The m_expand command m_expand [options.] path Displays the locations of the data partitions of a multifile or multidirectory $ m_expand mfile:mymfs $ m_expand -native /path/to/the/mdir/bar ©2010 Keane. to parties outside of your organization without prior written permission from Keane.. or otherwise. 88 . or transmitted by any means. This document is the property of Keane. stored in a retrieval system. mechanical. photocopying.

No part of this document shall be reproduced.dat ©2010 Keane. multifiles. .dml simple.The m_dump command m_dump metadata [path] [options .dml simple.. or selected records from files or multifiles. or transmitted by any means. photocopying. electronic. mechanical.. This document is the property of Keane. stored in a retrieval system.dat -end 1 -print help 89 $ m_dump -string µstring(³\n´)¶ bigex/acct. recording. $ m_dump end 20 $ m_dump $ m_dump 'id*2¶ $ m_dump simple.] Displays contents of files. to parties outside of your organization without prior written permission from Keane. similar to View Data from GDE.dat -start 10 simple. or otherwise.dml -describe simple.

$ m_cp foo bar $ m_cp mfile:foo \ mfile://OtherHost/path/to/the/mdir/bar $ m_cp mfile:foo mfile:bar \ //OtherHost/path/to/the/mdir ©2010 Keane. or transmitted by any means. This document is the property of Keane. m_cp actually builds and runs a small graph. No part of this document shall be reproduced. recording. Behind the scenes. electronic. to parties outside of your organization without prior written permission from Keane. or otherwise. mechanical. so it may copy from one machine to another where Ab Initio is installed. stored in a retrieval system. 90 . photocopying.The m_cp command m_cp source dest m_cp source [«] directory Copies files or multifiles that have the same degree of parallelism.

or multi-directory from one path to another path on the same host via renaming« does not actually move data. photocopying. 91 . mechanical. recording. No part of this document shall be reproduced. stored in a retrieval system. or otherwise.The m_mv command m_mv oldpath newpath Moves a single file. multifile. $ m_mv foo bar $ m_mv mfile:foo mfile:/path/to/the/mdir/bar ©2010 Keane. directory. to parties outside of your organization without prior written permission from Keane. This document is the property of Keane. electronic. or transmitted by any means.

recording. or transmitted by any means. they will not be destroyed. to parties outside of your organization without prior written permission from Keane. photocopying. or otherwise. 92 . This document is the property of Keane. electronic. No part of this document shall be reproduced.The m_touch command m_touch path Creates an empty file or multifile in the specified location. stored in a retrieval system. mechanical. If some or all of the data partitions already exist in the expected locations. $ m_touch foo $ m_touch mfile:/path/to/the/mdir/bar ©2010 Keane.

mechanical. or transmitted by any means. 93 . electronic. or otherwise. to parties outside of your organization without prior written permission from Keane..The m_rm command m_rm [options] path [.] Removes a file or multifile and all its associated data partitions. recording. This document is the property of Keane. No part of this document shall be reproduced.. photocopying. $ m_rm foo $ m_rm mfile:foo mfile:/path/to/the/mdir/bar $ m_rm -f -r mfile:dir1 ©2010 Keane. stored in a retrieval system.

Other Commands m_env m_kill m_rollback -d m_eval ©2010 Keane. 94 . This document is the property of Keane. or otherwise. electronic. to parties outside of your organization without prior written permission from Keane. photocopying. No part of this document shall be reproduced. or transmitted by any means. recording. mechanical. stored in a retrieval system.

95 . and searches of the names and descriptions of configvars. such as version of Ab Initio. mechanical. electronic. or otherwise. help on the meanings of all configuration variables. recording.The m_env command m_env [options] Describes many features of the environment. or transmitted by any means. stored in a retrieval system. No part of this document shall be reproduced. This document is the property of Keane. to parties outside of your organization without prior written permission from Keane. photocopying. setting of all configuration variables (and where they were set). $ m_env $ m_env -all $ m_env -w $ m_env -version $ m_env -build $ m_env -get AB_WORK_DIR $ m_env -describe AB_NPIPE_READER_OPEN_DELAY $ m_env -find connection ©2010 Keane.

Must be given the recovery file name for the job.rec ©2010 Keane. photocopying. or otherwise.rec Kills a running job. stored in a retrieval system. Should be executed by the user who started the job from the launching node of the job. 96 . electronic. $ m_kill my_graph. recording. mechanical. No part of this document shall be reproduced.The m_kill command m_kill jobname. This document is the property of Keane. to parties outside of your organization without prior written permission from Keane. or transmitted by any means.

Usually. Must be given the recovery file name for the job. or otherwise. Use m_rollback -d to delete all recovery info for the job. $ m_rollback -d my_graph. electronic. stored in a retrieval system.rec ‡ Rolls back a failed job. to parties outside of your organization without prior written permission from Keane.The m_rollback command m_rollback jobname. If a job failed in mid-phase and was not automatically rolled back to the last checkpoint (a very unusual case). photocopying. recording. No part of this document shall be reproduced.rec ©2010 Keane. and roll the job back to start. 97 .rec m_rollback -d jobname. Should be executed by the user who started the job from the launching node of the job. this is done by default. use m_rollback to rollback to the last successful checkpoint. mechanical. or transmitted by any means. This document is the property of Keane.

'a|b|c|'). photocopying. recording. No part of this document shall be reproduced.f3. stored in a retrieval system. mechanical. $ m_eval "1+1" 2 $ m_eval "reinterpret_as(record string('|') f1.f2. 98 . to parties outside of your organization without prior written permission from Keane. It useful for quickly testing out or debugging a complex expression. or otherwise.The m_eval command m_eval expression Evaluates a DML expression outside a graph.f2" "b´ ©2010 Keane. This document is the property of Keane. electronic. or transmitted by any means. end.

photocopying.Performance Tuning What is ³Good Performance´? O O O O Minimizing Minimizing Minimizing Minimizing wall clock time overall CPU usage memory usage disk usage ©2010 Keane. to parties outside of your organization without prior written permission from Keane. or transmitted by any means. mechanical. This document is the property of Keane. electronic. stored in a retrieval system. or otherwise.10. 99 . recording. No part of this document shall be reproduced.

photocopying. then partition back to parallel. 100 . No part of this document shall be reproduced. or otherwise. This document is the property of Keane. Repartition instead. Ask yourself why any serial input isn¶t followed immediately by a Partition component. do not bring down to serial. recording. ©2010 Keane. or transmitted by any means. ‡ For very small processing jobs (hundreds or thousands of records. stored in a retrieval system. electronic. mechanical. to parties outside of your organization without prior written permission from Keane. runtime in minutes) serial may be better for reduced startup costs. ‡ Once data is partitioned.Parallelism ‡ Go parallel as soon as possible.

This document is the property of Keane. mechanical. do not do this serially. electronic. to parties outside of your organization without prior written permission from Keane. stored in a retrieval system. recording.Serial Inputs ‡ If you need to reformat serial input data to find the true partition key. 101 . No part of this document shall be reproduced. Instead. photocopying. or otherwise. or transmitted by any means. do this: ©2010 Keane.

photocopying.‡ Do not access large files across NFS. to parties outside of your organization without prior written permission from Keane. mechanical. many input files in parallel. stored in a retrieval system. Use Ab Initio to transfer the data instead. or transmitted by any means. ‡ To read many. use Ad Hoc Multifiles and a fan-in flow to a Concatenate. This document is the property of Keane. ‡ M must evenly divide N ‡ Pad file list with /dev/null if it doesn¶t ©2010 Keane. recording. electronic. No part of this document shall be reproduced. ‡ Use Ad Hoc Multifiles to read many serial files (with same record format) in parallel. or otherwise. 102 . or an FTP component.

Phase breaks (and checkpoints) ‡ Often, phase breaks will not add to wall clock time since the graph will be mostly CPU-bound, and some additional I/O will not be an issue. ‡ Phase breaks let you allocate more memory to individual components. ‡ Visualize what happens in each component. Separate components that would benefit from using large amounts of memory. ‡ Try to avoid landing multiple copies of the same data to disk in a phase break after a Replicate component.

©2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

103

©2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

104

Record Formats ‡ In general, completely fixed format records take less CPU to process than variable length records. ‡ Drop fields that aren¶t needed as soon as possible. This is often done ³for free´ in transform components. ‡ ³Flatten´ out conditional fields as soon as possible. ‡ Often, conditional fields are used to store multiple record types in a single format. Split these into separate processing streams as soon as possible. Join them back at the end of the graph, if required.

©2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.

105

stored in a retrieval system. you can still benefit from having the key fields: ‡ fixed length ‡ at the beginning of the record. recording. ©2010 Keane. or transmitted by any means. electronic. to parties outside of your organization without prior written permission from Keane. ‡ If you wish to checkpoint near a sort that will land data on disk. 106 . or otherwise.Sorting ‡ If you can¶t make all your fields fixed length. consider a Checkpointed Sort component instead. use the Sort Groups component. No part of this document shall be reproduced. mechanical. photocopying. ‡ If you are sorted by a primary key and need to resort by a secondary key. This document is the property of Keane.

‡ A graph that relies on sorted data and does not use in-memory components will have more uniform performance characteristics as data volume grows. or transmitted by any means. it will be most efficient to sort once and set the rollups and joins to expect sorted input.In-Memory Components ‡ Join. electronic. and Scan can operate either in-memory or on sorted data. to parties outside of your organization without prior written permission from Keane. ‡ In-memory components run efficiently when there is enough memory allocated to them. No part of this document shall be reproduced. recording. or otherwise. performance may suddenly decrease one day. Rollup. and you need to do multiple joins or rollups on the same key. This document is the property of Keane. stored in a retrieval system. ©2010 Keane. photocopying. 107 . mechanical. ‡ If the data volume grows until these components need to drop their data to disk. ‡ If your data does not fit in memory.

mechanical. 108 . ‡ It is better to set max-core too low rather than too high and risk OS swapping. This document is the property of Keane. Ab Initio does better job than the OS at staging working data to disk. ©2010 Keane. ‡ Similarly. Sort and Rollup will drop all their data to disk if max-core does not fit all the data plus overhead. then it will drop all the inputs to disk. No part of this document shall be reproduced.Exceeding max-core ‡ If an in-memory Join cannot fit its non-driving inputs (plus overhead) in the provided max-core. photocopying. to parties outside of your organization without prior written permission from Keane. recording. stored in a retrieval system. or otherwise. electronic. or transmitted by any means.

stored in a retrieval system. or transmitted by any means. This document is the property of Keane. mechanical. to parties outside of your organization without prior written permission from Keane. recording. ©2010 Keane. or otherwise. 109 . photocopying. ‡ Join as early as possible if this will reduce the number of records being processed. electronic. ‡ Join as late as possible if this will increase the number of records or the width of records being processed.Reduce number of records ‡ Use Rollup or Filter by Expression as soon as possible if they will reduce the number of records being processed. No part of this document shall be reproduced.

stored in a retrieval system.©2010 Keane. recording. mechanical. This document is the property of Keane. to parties outside of your organization without prior written permission from Keane. electronic. or otherwise. or transmitted by any means. 110 . No part of this document shall be reproduced. photocopying.

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->