Abintio

Introduction to Ab initio
Presenters Name
Role Month, Year
Australia | Canada | France | India | New Zealand | Singapore | Switzerland | United Arab Emirates | United Kingdom | United States 2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
Agenda
DWH Concept What is ETL Introduction to Ab Initio
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
DWH Concept
DWH Definition & Process Overview
What is a Data Warehouse ?

Data Warehouse is a Subject-Oriented Integrated Time-Variant Non-volatile collection of data in support of managements decision
Data warehouses store large volumes of data which are frequently used by Decision Support Systems It is maintained separately from the organizations operational databases Data warehouses are relatively static with only infrequent updates A data warehouse is a stand-alone repository of information, integrated from several, possibly heterogeneous operational databases
Data Warehousing Process Overview

Extract from Source Systems Transform to required data (Staging area) Transfer to Data warehouse Produce reports from Data warehouse
Data Ware House
Source Systems
Extract
Staging Area Transform
Transfer
Various Data Warehouse Models

Enterprise warehouse collects all of the information about subjects spanning the entire organization Data Mart a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart
What is ETL?
Extract, Transform, and Load (ETL) is a process that involves extracting data from outside sources, transforming it to fit business needs (which can include quality levels), and ultimately loading it into the end target, i.e. the data warehouse. Extract The first part of an ETL process is to extract the data from the source systems. Most data warehousing projects consolidate data from different source systems. Each separate system may also use a different data organization / format. Common data source formats are relational databases and flat files, but may include nonrelational database structures such as IMS or other data structures such as VSAM or ISAM. Extraction converts the data into a format for transformation processing.
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane. 8
What is ETL?
Transform
The transform stage applies a series of rules or functions to the extracted data from the source to derive the data to be loaded to the end target. Some data sources will require very little or even no manipulation of data. In other cases, one or more of the following transformations types to meet the business and technical needs of the end target may be required: Selecting only certain columns to load (or selecting null columns not to load) Translating coded values (e.g., if the source system stores 1 for male and 2 for female, but the warehouse stores M for male and F for female) Encoding free-form values (e.g., mapping "Male" to "1" and "Mr." to M) Deriving a new calculated value (e.g., sale amount = qty * unit price) Joining together data from multiple sources (e.g., lookup, merge, etc.) Summarizing multiple rows of data (e.g., total sales for each store, and for each region) Generating surrogate key values Transposing or pivoting (turning multiple columns into multiple rows or vice versa) Splitting a column into multiple columns (e.g., putting a comma-separated list specified as a string in one column as individual values in different columns) Applying any form of simple or complex data validation; if failed, a full, partial or no rejection of the data, and thus no, partial or all the data is handed over to the next step, depending on the rule design and exception handling. Most of the above transformations itself might result in an exception, e.g. when a codetranslation parses an unknown code in the extracted data.
What is ETL?
Load
The load phase loads the data into the end target, usually being the data warehouse (DW). Depending on the requirements of the organization, this process ranges widely. Some data warehouses might overwrite existing information with cumulative, updated data, while other DW (or even other parts of the same DW) might add new data in a histories form, e.g. hourly. The timing and scope to replace or append are strategic design choices dependent on the time available and the business needs. Some systems maintain a history and audit trail of all changes to the data loaded in the DW. As the load phase interacts with a database, the constraints defined in the database schema as well as in triggers activated upon data load apply (e.g. uniqueness, referential integrity, mandatory fields), which also contribute to the overall data quality performance of the ETL process.
10
Introduction to Ab Initio
1. What is Ab Initio 2. Ab Initio Platforms 3. Architecture of Ab Initio 4. Run Process 5. Components 6. Parallelism 7. Sandbox and Projects 8. Basic Graph Development 9. Multifile 10. Performance Tuning
11
1.What is Ab Initio
Data processing tool from Ab Initio software corporation (http://www.abinitio.com) Latin for from the beginning Ab Initio is a general purpose data processing platform for enterprise class, mission critical applications such as data warehousing, click stream processing, data movement, data transformation and analytics. Designed to support largest and most complex business applications Proven best of breed ETL solution. Applications of Ab Initio: ETL for data warehouses, data marts and operational data sources. Parallel data cleansing and validation. Parallel data transformation and filtering. High performance analytics Real time, parallel data capture.
12
2.Ab Initio Platforms

Abinitio product comes with three suits Graphical Development Environment (GDE) CO>Operating System Enterprise Metadata Environment (EME)
13
Graphical Development Environment (GDE)

GDE lets Developer to create applications by dragging and dropping components onto a canvas configuring them with familiar, intuitive point and click operations, and connecting them into executable flowcharts.
14
Co>operating System
The Co>Operating System is core software that unites a network of computing resources-CPUs, storage disks, programs, datasets-into a production-quality data processing system with scalable performance and mainframe reliability. The Co>Operating System is layered on top of the native operating systems of a collection of computers. It provides a distributed model for process execution, file management, process monitoring, check-pointing, and debugging.
15
Connection Between GDE and CO>Op

Graphical Development Environment (GDE)
FTP TELNET REXEC RSH DCOM
Co-operating System
On a typical installation, the Co-operating system is installed on a Unix or Windows NT server while the GDE is installed on a Pentium PC.
16
Connecting to Co>op Server from GDE
17
EME (Enterprise Meta>Environment) Data store

It is a system storage area where every version that you save of the files you work on is permanently preserved In a short we can say EME is Storage Area GDE GDE Check Out
E M E
Locking
18
GDE
GDE
GDE
Ab Initio runs on many operating systems Compaq Tru64 UNIX Digital unix Hewlett-Packard HP-UNIX IBM AIX Unix NCR MP-RAS Red Hat Linux IBM/Sequent DYNIX/ptx Siemens Pyramid Reliant UNIX Silicon Graphics IRIX Sun Solaris Windows NT and Windows 2000
19
3.Architecture of Ab Initio
Applications Ab Initio Metadata Repository (EME)
Application Development Environments Graphical (GDE) C ++ Shell Component Library .ksh Ab Initio Co>Operating System User-defined Components Third Party Components
Native Operating System UNIX Windows NT
20
Architecture of Ab Initio
Host Machine 1
Unix Shell Script or NT Batch File
Supplies parameter values to underlying programs through arguments and environment variables Controls the flow of data through pipes Usually generated using the GDE
GDE
Ability to graphically design batch programs comprising Ab Initio components, connected by pipes Ability to test run the graphical design and monitor its progress Ability to generate a shell script or batch file from the graphical design
Co>Operating System
Ab Initio Built-in Component Programs (Partitions, Transforms etc)
Host Machine 2
User Programs Co-Operating System
User Programs
Operating System
( Unix , Windows NT )
Operating System
21
4.Run Process
What happens when you push the Run button ? Your graph is translated into a script that can be executed in the Shell Development Environment. This script and any metadata files stored on the GDE client machine are shipped (via FTP) to the server. The script is invoked (via REXEC or TELNET) on the server. The script creates and runs a job that may run across many nodes. Monitoring information is sent back to the GDE client.
22
Run Process
Please have look a below Sample graph and find what happens when we press run button on the top right side in the screen shot
23
Anatomy of Running Job
Host Process Creation

Pushing Run button generates script. Script is transmitted to Host node. Script is invoked, creating Host process.
Host GDE
Client
Host
Processing nodes
24
Agent Process Creation

Host process spawns Agent processes.
Host GDE
Agent Agent
Client
Host
Processing nodes
25
Component Process Creation

Agent processes create Component processes on each processing node.
Host GDE
Agent Agent
Client
Host
Processing nodes
26
Component Execution
Component processes do their jobs. Component processes communicate directly with datasets and each other to move data around.
Host GDE
Agent Agent
Client
Host
Processing nodes
27
Successful Component Termination

As each Component process finishes with its data, it exits with success status.
Host GDE
Agent Agent
Client
Host
Processing nodes
28
Agent Termination
When all of an Agents Component processes exit, the Agent informs the Host process that those components are finished. The Agent process then exits
Host
Client
Host
Processing nodes
29
Host Termination
When all Agents have exited, the Host process informs the GDE that the job is complete. The Host process then exits.
Host GDE
Client
Host
Processing nodes
30
5.Components Overview
There are Mainly two sets of Components available in Abinitio Dataset Components:-Components Which holds data Program Components:-Components which process data
31
Dataset Components
Input file :
INPUT FILE represents records read as input to a graph from one or more serial files or from a multi file.
Input table
Input Table unloads records from a database into a graph, allowing you to specify as the source either a database table or an SQL statement that selects records from one or more tables.
32
Dataset Components
Output file:
OUTPUT FILE represents records written as output from a graph into one or more serial files or a multifile. When the target of an OUTPUT FILE component is a particular file (such as /dev/null, NUL, a named pipe, or some other special file), the Co>Operating System never deletes and recreates that file, nor does it ever truncate it.
Output table:
OUTPUT TABLE loads records from a graph into a database, letting you specify the destination either directly as a single database table, or through an SQL statement that inserts records into one or more tables.
33
Program Components
Sort:
SORT sorts and merges records. You can use SORT to order records before you send them to a component that requires grouped or sorted records. key (key specifier, required) Name(s) of the key field(s) and the sequence specifier(s) you want the component to use when it orders records. max-core (integer, required) Maximum memory usage in bytes. Default is 100663296 (100 MB). When the component reaches the number of bytes specified in the max-core parameter, it sorts the records it has read and writes a temporary file to disk.
34
Reformat:
REFORMAT changes the format of records by dropping fields, or by using DML expressions to add fields, combine fields, or transform the data in the records.
1. Reads record from input port 2. Record passes as argument to transform function or xfr 3. Records written to out ports, if the function returns a success status 4. Records written to reject ports, if the function returns a failure status
35
Parameters of Reformat Component

Count Transform (Xfr) Function Reject-Threshold Abort Never Abort Use Limit & Ramp Limit Ramp
36
Join:
1. Reads records from multiple input ports 2. Operates on records with matching keys using a multiinput transform function 3. Writes result to the output port PORTS
in out unused reject (optional) error (optional) log (optional)
PARAMETERS
count key override key transform limit Ramp
37
Join Types Inner Outer Explicit
Join Methods Merge Join sing sorted inputs
Hash Join sing in-memory hash tables to group input
38
Filter by Expression:
FILTER BY EXPRESSION filters records according to a DML ex 1.Reads data records from the in port. 2.Applies the expression in the select_expr parameter to each record. If the expression returns: Non-0 value FILTER BY EXPRESSION writes the record to the out port. 0 FILTER BY EXPRESSION writes the record to the deselect port. If you do not connect a flow to the deselect port, FILTER BY EXPRESSION discards the records. NULL FILTER BY EXPRESSION writes the record to the reject port and a descriptive error message to the error port. FILTER BY EXPRESSION stops execution of the graph when the number of reject events exceeds the result of the following formula: limit + (ramp * number_of_records_processed_so_far)
39
Normalize :
NORMALIZE generates multiple output records from each of its input records. You can directly specify the number of output records for each input record, or the number of output records can depend on some calculation.
1.Reads the input record.

If you have not defined input_select, NORMALIZE processes all records. If you have defined input_select, the input records are filtered as follows: 2.Performs iterations of the normalize transform function for each input record. 3.Performs temporary initialization. 4.Sends the output record to the out port.
40
Before Normalization
41
After Normalization
42
DENORMALIZE SORTED:
DENORMALIZE SORTED consolidates groups of related records by key into a single output record with a vector field for each group, and optionally computes summary fields in the output record for each group. DENORMALIZE SORTED requires grouped input. For example, if you have a record for each person that includes the households to which that person belongs, DENORMALIZE SORTED can consolidate those records into a record for each household that contains a variable number of people.
43
Before Denormalize
44
After Denormalize:
45
Multistage components
Data transformation in multiple stages following several sets of rules Each set of rule form one transform function Information is passed across stages by temporary variables Stages include initialization, iteration, finalization and more Few multistage components are aggregate,rollup,scan
46
Rollup: ROLLUP evaluates a group of input records that have

the same key, and then generates records that either summarize each group or select certain information from each group.
Aggregate: AGGREGATE generates records that summarize

groups of records. In general, use ROLLUP for new development rather than AGGREGATE. ROLLUP gives you more control over record selection, grouping, and aggregation. However, use AGGREGATE when you want to return the single record that has a field containing either the maximum or the minimum value of all the records in the group. Scan: For every input record, SCAN generates an output record that includes a running cumulative summary for the group the input record belongs to. For example, the output records might include successive year-to-date totals for groups of records. You can use SCAN in continuous graphs
Partition components:
Data can be partitioned using Partition by Round-robin Broadcast Partition by Key Partition by Expression Partition by Range Partition by Percentage Partition by Load Balance
48
Partition by Roundrobin PARTITION BY ROUND-ROBIN distributes blocks of records evenly to each output flow in round-robin fashion. Suppose you attach four flows to the PARTITION BY ROUND-ROBIN output port, as shown in the following figure. PARTITION BY ROUNDROBIN writes to Load-1, then Load-2, then Load-3, then Load-4, then back to Load-1 again.
49
Broadcast
BROADCAST arbitrarily combines all records it receives into a single flow and writes a copy of that flow to each of its output flow partitions.
Partition by Key
PARTITION BY KEY distributes records to its output flow partitions according to key values. PARTITION BY KEY does the following: Reads records in arbitrary order from the in port. Distributes records to the flows connected to the out port, according to the key parameter, writing records with the same key value to the same output flow. PARTITION BY KEY is typically followed by SORT
50
Partition by Expression
PARTITION BY EXPRESSION distributes records to its output flow partitions according to a specified DML expression.
Partition by Range
PARTITION BY RANGE distributes records to its output flow partitions according to the ranges of key values specified for each partition. PARTITION BY RANGE distributes the records relatively equally among the partitions. Use PARTITION BY RANGE when you want to divide data into useful, approximately equal, groups. Input can be sorted or unsorted. If the input is sorted, the output is sorted; if the input is unsorted, the output is unsorted. The records with the key values that come first in the key order go to partition 0, the records with the key values that come next in the order go to partition 1, and so on. The records with the key values that come last in the key order go to the partition with the highest number.
Partition by Percentage
PARTITION BY PERCENTAGE distributes a specified percentage of the total number of input records to each output flow. Partition by Load Balance PARTITION WITH LOAD BALANCE distributes records to its output flow partitions by writing more records to the flow partitions that consume records faster. The output port for PARTITION WITH LOAD BALANCE is ordered
52
Summary of Partitioning Methods
Method
Round robin Hash Function Range
Key-Based Balancing
No Yes Yes Yes Good Good Depends on data and function Depends on splitters
Uses
Record-independent parallelism Key-dependent parallelism Application specific Key-dependent parallelism, Global Ordering Record-independent parallelism
Load-level
No
Depends on load
53
De-partition components:
Data can be de-partitioned using
Gather Concatenate Merge Interleave
54
Gather Reads data records from the flows connected to the input port Combines the records arbitrarily and writes to the output Concatenate Concatenate appends multiple flow partitions of data records one after another Merge Combines data records from multiple flow partitions that have been sorted on a key Maintains the sort order Interleave: INTERLEAVE combines blocks of records from multiple flow partitions in round-robin fashion. You can use INTERLEAVE to undo the effects of PARTITION BY ROUNDROBIN.
55
6.Parallelism
Parallel Runtime Environment Where some or all of the components of an application datasets and processing modules are replicated into a number of partitions, each spawning a process. Ab Initio can process data in parallel runtime environment Forms of Parallelism Component Parallelism Pipeline Parallelism Data Parallelism
56
Parallelism
Data Parallelism
servers at the same time Data is processed at the different
Pipeline Parallelism
Pipeline parallelism occurs when several connected program components on the same branch of a graph execute simultaneously.
Component Parallelism
working in parallel
2 or more components are
57
Data Parallelism
Data parallelism occurs when a graph separates data into multiple divisions, allowing multiple copies of program components to operate on the data in all the divisions simultaneously.
58
Two Ways of Looking at Data Parallelism

Expanded View:
Global View:
59
Pipeline Parallelism
the following graph divides a list of customers into two groups, GOOD CUSTOMERS and OTHER CUSTOMERS. The SCORE component assigns a score to each customer in the CUSTOMERS dataset, then the SELECT component directs each customer to the proper group based on that score.
Processing Record: 100
Processing Record: 55
Component Parallelism
The following graph takes the CUSTOMERS and TRANSACTIONS datasets, sorts them, then merges them into a dataset named MERGED INFORMATION.Because the SORT CUSTOMERS and SORT TRANSACTIONS components are on different branches of the graph, they execute at the same time, creating component parallelism.
Sorting Transactions
61
7.Sandbox and Project

What is a Sandbox - Sandbox is a collection of the various directories like bin, dml, mp, run etc which contains the metadata (Graphs and their associated files) Why to create a Sandbox Helps in managing the directory structure where this metadata is stored. Also helps in version control, migration and navigation. The sandbox provides an excellent mechanism to maintain uniqueness while moving from development to production environment by means switch parameters Note - Sandbox can be associated with only one project, but project can have many sandboxes
/Projects
bin dml mp run xfr Sandbox
62
8.Basic Graph Development
Create a new graph Go to (File>New) Then File>Save As (i.e., my_graph) to save it in the appropriate sandbox to enable this new graph to pick up the proper environment.
63
Steps in Building an Application

Add datasets. Where are they sourced from, where does my output go? Add components. Add flows. Edit Component Parameters as needed. Debug your application Configure datasets and components along the way; let the yellow To Do cues guide you. Generally, you should configure your input and output metadata (record formats) before adding flows.
64
Adding an Input Dataset
1. Click on Component Organizer Button
2. Open the Datasets Category
3. Choose Input File
65
Configuring the Input Dataset

1. Browse to find simple.dat 2. Browse to find simple.dml
3. Change label to something descriptive

66
Create Graph - Dml

Propagate from Neighbors: Copy record formats from connected flow. Same As: Copy record formats from a specific components port. Path: Store record formats in a Local file, Host File, or in the Ab Initio repository. Embedded: Type the record format directly in a string.
Specify the .dml file
67
Creating Graph - Transform

A transform function is either a DML file or a DML string that describes how you manipulate your data. Ab Initio transform functions mainly consist of a series of assignment statements. Each statement is called a business rule. When Ab Initio evaluates a transform function, it performs following tasks: Initializes local variables Evaluates statements Evaluates rules. Transform function files have the xfr extension.
Specify the .xfr file
68
Adding a Filter by Expression Component
1. Open the Transform Category
2. Choose the Filter by Expression Component
69
Adding an Output Dataset
Choose Output File
70
Configuring the Output Dataset
1. Browse to see the directory contents
2. Enter name of output file
71
Components Have Properties
y A port is a connection point that allows data to flow into or out of a component. Most components have at least one port. y The data streaming into or out of a component is called a flow.
72
How to Create Flows

To Create a flow, follow these steps. Move your cursor over the Out Port of the first component until the arrow and Box symbols appear. Click and drag from the Out port of one component to the In port of the next component. Release the mouse button.
How to Delete Flows

To delete a flow, highlight it (Click it ) then press the Delete key.
73
Adding Flows
1. Click on source (hold)
2. Drag to destination (release)
74
Configuring Filter by Expression
Enter expression
75
Running the Application
1. Push Run button.
2. View monitoring information.
3. View output data.
Classification: GE Internal
76
Diagnostic Ports: Reject, Error
Reject: Input records that caused errors. Error: Error messages.
77
Tips about Runtime Status

The GDE displays round colored indicators to show the status of each component during runtime.
Un started Running Error Done Success
78
Creating Graph Sort Component

Sort: The sort component reorders data. It comprises two parameters: Key and max-core. Key: The Key is one of the parameters for Sort component which describes the collation order. Max-core: The maxcore parameter controls how often the sort component dumps data from memory to disk.
Specify Key for the Sort
79
Creating Graph Dedup component
Select Dedup criteria.
Dedup component removes duplicate records. Dedup criteria will be either uniqueonly, First or Last.
80
Creating Graph Join Component
Specify the key for join Specify Type of Join
81
9.MULTIFILES
Multifiles are parallel files composed of individual files, which may be located on separate disks or systems. These individual files are the partitions of the multifile. Understanding the concept of multifiles is essential when you are developing parallel applications that use files, because the parallelization of data drives the parallelization of the application. An Ab Initio multifile organizes all partitions of a multifile into one single virtual file that you can reference as one entity.
82
83
Multifile Commands
m_mkfs m_mkdir m_ls m_expand m_dump m_cp m_mv m_touch m_rm
The m_mkfs Command

m_mkfs mfs-url dir-url1 dir-url2 ...
Creates a multifile system rooted at mfsurl and having as partitions the new directories dir-url1, dir-url2, ...
$ m_mkfs //host1/u/jo/mfs3 \ //host1/vol4/dat/mfs3_p0 \ //host2/vol3/dat/mfs3_p1 \ //host3/vol7/dat/mfs3_p2 $ m_mkfs my-mfs my_mfs_p0 my_mfs_p1 my_mfs_p2
The m_mkdir Command

m_mkdir url
Creates the named multidirectory. The url must refer to a pathname within an existing multifile system.
$ m_mkdir mfile:my-mfs/subdir $ m_mkdir mfile://host2/tmp/temp-mfs/dir1

The m_ls command

m_ls [options...] url [url...]
Lists information on the file or directories specified by the urls. The information presented is controlled by the options, which follow the form of ls.
$ m_ls -ld mfile:my-mfs/subdir $ m_ls mfile://host2/tmp/temp-mfs $ m_ls -l -partitions .
87
The m_expand command
m_expand [options...] path
Displays the locations of the data partitions of a multifile or multidirectory

$ m_expand mfile:mymfs $ m_expand -native /path/to/the/mdir/bar
88
The m_dump command

m_dump metadata [path] [options ...]
Displays contents of files, multifiles, or selected records from files or multifiles, similar to View Data from GDE.
$ m_dump end 20 $ m_dump $ m_dump 'id*2 $ m_dump simple.dml simple.dat -start 10 simple.dml -describe simple.dml simple.dat -end 1 -print help
89
$ m_dump -string string(\n) bigex/acct.dat

The m_cp command

m_cp source dest m_cp source [] directory
Copies files or multifiles that have the same degree of parallelism. Behind the scenes, m_cp actually builds and runs a small graph, so it may copy from one machine to another where Ab Initio is installed.
$ m_cp foo bar $ m_cp mfile:foo \ mfile://OtherHost/path/to/the/mdir/bar $ m_cp mfile:foo mfile:bar \ //OtherHost/path/to/the/mdir
The m_mv command

m_mv oldpath newpath
Moves a single file, multifile, directory, or multi-directory from one path to another path on the same host via renaming does not actually move data.
$ m_mv foo bar $ m_mv mfile:foo mfile:/path/to/the/mdir/bar
91
The m_touch command

m_touch path
Creates an empty file or multifile in the specified location. If some or all of the data partitions already exist in the expected locations, they will not be destroyed.
$ m_touch foo $ m_touch mfile:/path/to/the/mdir/bar
92
The m_rm command

m_rm [options] path [...]
Removes a file or multifile and all its associated data partitions.

$ m_rm foo $ m_rm mfile:foo mfile:/path/to/the/mdir/bar $ m_rm -f -r mfile:dir1
93
Other Commands
m_env m_kill m_rollback -d m_eval
94
The m_env command

m_env [options]
Describes many features of the environment, such as version of Ab Initio, setting of all configuration variables (and where they were set), help on the meanings of all configuration variables, and searches of the names and descriptions of configvars. $ m_env $ m_env -all $ m_env -w $ m_env -version $ m_env -build $ m_env -get AB_WORK_DIR $ m_env -describe AB_NPIPE_READER_OPEN_DELAY $ m_env -find connection
The m_kill command

m_kill jobname.rec
Kills a running job. Should be executed by the user who started the job from the launching node of the job. Must be given the recovery file name for the job.
$ m_kill my_graph.rec
96
The m_rollback command

m_rollback jobname.rec m_rollback -d jobname.rec
Rolls back a failed job. If a job failed in mid-phase and was not automatically rolled back to the last checkpoint (a very unusual case), use m_rollback to rollback to the last successful checkpoint. Usually, this is done by default. Use m_rollback -d to delete all recovery info for the job, and roll the job back to start. Should be executed by the user who started the job from the launching node of the job. Must be given the recovery file name for the job. $ m_rollback -d my_graph.rec
The m_eval command

m_eval expression
Evaluates a DML expression outside a graph. It useful for quickly testing out or debugging a complex expression.
$ m_eval "1+1" 2 $ m_eval "reinterpret_as(record string('|') f1,f2,f3; end, 'a|b|c|').f2" "b
98
10.Performance Tuning
What is Good Performance? O O O O Minimizing Minimizing Minimizing Minimizing wall clock time overall CPU usage memory usage disk usage
99
Parallelism Go parallel as soon as possible. Ask yourself why any serial input isnt followed immediately by a Partition component. Once data is partitioned, do not bring down to serial, then partition back to parallel. Repartition instead. For very small processing jobs (hundreds or thousands of records, runtime in minutes) serial may be better for reduced startup costs.
100
Serial Inputs If you need to reformat serial input data to find the true partition key, do not do this serially. Instead, do this:
101
Do not access large files across NFS. Use Ab Initio to transfer the data instead, or an FTP component. Use Ad Hoc Multifiles to read many serial files (with same record format) in parallel. To read many, many input files in parallel, use Ad Hoc Multifiles and a fan-in flow to a Concatenate. M must evenly divide N Pad file list with /dev/null if it doesnt
102
Phase breaks (and checkpoints) Often, phase breaks will not add to wall clock time since the graph will be mostly CPU-bound, and some additional I/O will not be an issue. Phase breaks let you allocate more memory to individual components. Visualize what happens in each component. Separate components that would benefit from using large amounts of memory. Try to avoid landing multiple copies of the same data to disk in a phase break after a Replicate component.
103
104
Record Formats In general, completely fixed format records take less CPU to process than variable length records. Drop fields that arent needed as soon as possible. This is often done for free in transform components. Flatten out conditional fields as soon as possible. Often, conditional fields are used to store multiple record types in a single format. Split these into separate processing streams as soon as possible. Join them back at the end of the graph, if required.
105
Sorting If you cant make all your fields fixed length, you can still benefit from having the key fields:
fixed length at the beginning of the record.
If you are sorted by a primary key and need to resort by a secondary key, use the Sort Groups component. If you wish to checkpoint near a sort that will land data on disk, consider a Checkpointed Sort component instead.
106
In-Memory Components
Join, Rollup, and Scan can operate either in-memory or on sorted data. If your data does not fit in memory, and you need to do multiple joins or rollups on the same key, it will be most efficient to sort once and set the rollups and joins to expect sorted input. In-memory components run efficiently when there is enough memory allocated to them. If the data volume grows until these components need to drop their data to disk, performance may suddenly decrease one day. A graph that relies on sorted data and does not use in-memory components will have more uniform performance characteristics as data volume grows.
107
Exceeding max-core If an in-memory Join cannot fit its non-driving inputs (plus overhead) in the provided max-core, then it will drop all the inputs to disk. Similarly, Sort and Rollup will drop all their data to disk if max-core does not fit all the data plus overhead. It is better to set max-core too low rather than too high and risk OS swapping. Ab Initio does better job than the OS at staging working data to disk.
108
Reduce number of records Use Rollup or Filter by Expression as soon as possible if they will reduce the number of records being processed. Join as early as possible if this will reduce the number of records being processed. Join as late as possible if this will increase the number of records or the width of records being processed.
109
110

Abintio

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Abintio

Uploaded by

Copyright:

Available Formats

Introduction to Ab initio

DWH Concept What is ETL Introduction to Ab Initio

DWH Definition & Process Overview

What is a Data Warehouse ?

Data Warehousing Process Overview

Data Ware House

Staging Area Transform

Various Data Warehouse Models

2.Ab Initio Platforms

Graphical Development Environment (GDE)

Connection Between GDE and CO>Op

Connecting to Co>op Server from GDE

EME (Enterprise Meta>Environment) Data store

Native Operating System UNIX Windows NT

Anatomy of Running Job

 Host Process Creation

Anatomy of Running Job

 Agent Process Creation

Anatomy of Running Job

 Component Process Creation

Anatomy of Running Job

Anatomy of Running Job

 Successful Component Termination

Anatomy of Running Job

Anatomy of Running Job

Parameters of Reformat Component

Join Types Inner Outer Explicit

Join Methods Merge Join  sing sorted inputs

Hash Join  sing in-memory hash tables to group input

1.Reads the input record.

Rollup: ROLLUP evaluates a group of input records that have

Aggregate: AGGREGATE generates records that summarize

Summary of Partitioning Methods

2 or more components are

Two Ways of Looking at Data Parallelism

Processing Record: 100

7.Sandbox and Project

bin dml mp run xfr Sandbox

8.Basic Graph Development

Steps in Building an Application

Adding an Input Dataset

1. Click on Component Organizer Button

2. Open the Datasets Category

3. Choose Input File

Configuring the Input Dataset

3. Change label to something descriptive

Create Graph - Dml

Specify the .dml file

Creating Graph - Transform

Specify the .xfr file

Adding a Filter by Expression Component

1. Open the Transform Category

2. Choose the Filter by Expression Component

Adding an Output Dataset

Choose Output File

Configuring the Output Dataset

1. Browse to see the directory contents

2. Enter name of output file

Components Have Properties

How to Create Flows

How to Delete Flows

1. Click on source (hold)

2. Drag to destination (release)

Host Process Creation

Agent Process Creation

Component Process Creation

Successful Component Termination

Join Methods Merge Join sing sorted inputs

Hash Join sing in-memory hash tables to group input