You are on page 1of 76

Introduction to Ab Initio

Prepared By : Ashok Chanda


Ab Initio Training


Ab initio Session 1
     

Introduction to DWH Explanation of DW Architecture Operating System / Hardware Support Introduction to ETL Process Introduction to Ab Initio Explanation of Ab Initio Architecture


Ab Initio Training


What is Data Warehouse

A data warehouse is a copy of transaction data specifically structured for querying and reporting. A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process. A data warehouse is a central repository for all or significant parts of the data that an enterprise's various business systems collect.
Ab Initio Training 3


Data Warehouse-Definitions  A data warehouse is a database geared towards the business intelligence requirements of an organization. Ab Initio Training 4 Accenture . A collection of databases combined with a flexible data extraction system. The data warehouse integrates data from the various operational systems and is typically loaded from these systems at regular intervals. Data warehouses contain historical information that enables analysis of business performance over time.

It can be a relational database. flat file. multidimensional database. Accenture Ab Initio Training 5 . object database. And data warehouses often focus on a specific activity or entity.Data Warehouse  A data warehouse can be normalized or denormalized. hierarchical database. Data warehouse data often gets changed. etc.

Why Use a Data Warehouse?       Data Exploration and Discovery Integrated and Consistent data Quality assured data Easily accessible data Production and performance awareness Access to data in a timely manner Accenture Ab Initio Training 6 .

Simplified Datawarehouse Architecture Accenture Ab Initio Training 7 .

Accenture Ab Initio Training 8 . Data is copied from one database to another using a technology called ETL (Extract. Transform. The model shown below is the "hub-andspokes" Data Warehousing architecture that is popular in many organizations.Data warehouse Architecture   Data Warehouses can be architected in many different ways. data is moved from databases used in operational systems into a data warehouse staging area. depending on the specific needs of a business. In short. then into a data warehouse and finally into a set of conformed data marts. Load).

Accenture Ab Initio Training 9 .

The ETL Process     Capture Scrub or Data cleansing Transform Load and Index Accenture Ab Initio Training 10 .

Should there be a failure in one ETL job. The ETL software extracts data. It is used to copy data from Operational Applications to the Data Warehouse Staging Area. The scheduling of ETL jobs is critical. the remaining ETL jobs must respond appropriately.  Accenture Ab Initio Training 11 .ETL Technology   ETL Technology is an important component of the Data Warehousing Architecture. cleanses "bad" data. filters data and loads data into a target database. from the DW Staging Area into the Data Warehouse and finally from the Data Warehouse into a set of conformed Data Marts that are accessible by decision makers. transforms values of inconsistent data.

all required data must be available before data can be integrated into the Data Warehouse.Data Warehouse Staging Area   The Data Warehouse Staging Area is temporary location where data from source systems is copied. Due to varying business cycles. A staging area is mainly required in a Data Warehousing Architecture for timing reasons. hardware and network resource limitations and geographical factors. data processing cycles. In short. it is not feasible to extract all the data from all Operational databases at exactly the same time Accenture Ab Initio Training 12 .

Not all business require a Data Warehouse Staging Area. only remains around temporarily).e. daily extracts might not be suitable for financial data that requires a month-end reconciliation process.e.Examples. remains around for a long period) or transient (i.Staging Area  For example. but this would not be feasible for "customer" data in a Chicago database. it might be feasible to extract "customer" data from a database in Singapore at noon eastern standard time. For many businesses it is feasible to use ETL to copy data directly from operational databases into the Data Warehouse. it might be reasonable to extract sales data on a daily basis. Similarly. however. Data in the Data Warehouse can be either persistent (i.   Accenture Ab Initio Training 13 .

This allows data to be sliced and diced. For example. Accenture Ab Initio Training 14 . The amount of data in the Data Warehouse is massive. Data is stored at a very granular level of detail. It contains the "single version of truth" for the organization that has been carefully constructed from data stored in disparate internal and external operational databases. every "sale" that has ever occurred in the organization is recorded and related to dimensions of interest.Data warehouse   The purpose of the Data Warehouse in the overall Data Warehousing Architecture is to integrate corporate data. summed and grouped in unimaginable ways.

This depends on how the business intends to use the information. Accenture Ab Initio Training 15 .Data Warehouse    Contrary to popular opinion. The Data Warehouse can be either "relational" or "dimensional". This is done through various front-end Data Warehouse Tools that read data from subject specific Data Marts. It's purpose is to provide key business metrics that are needed by the organization for strategic and tactical decision making. the Data Warehouses does not contain all the data in the organization. Decision makers don't access the Data Warehouse directly.

a data warehouse environment often consists of an ETL solution. and other applications that manage the process of gathering data and delivering it to business users. an OLAP engine.Data Warehouse Environment In addition to a relational/multidimensional database. Accenture Ab Initio Training 16 . client analysis tools.

Ab Initio Training 17 Accenture . A repository of data gathered from operational data and other sources that is designed to serve a particular community of knowledge workers. Data marts have the same definition as the data warehouse (see below). for use by a single department or function. A subset of the information contained in a data warehouse.Data Mart     A subset of a data warehouse. but data marts have a more limited audience and/or data content.

For example.Data Mart   ETL (Extract Transform Load) jobs extract data from the Data Warehouse and populate one or more Data Marts for use by groups of decision makers in the organizations. Each Data Mart can contain different combinations of tables. columns and rows from the Enterprise Data Warehouse. depending on how the information is to be used and what "front end" Data Warehousing Tools will be used to present the information. whereas data such as "salary" or "home address" might not be appropriate for a Data Mart that focuses on Sales. The Data Marts can be Dimensional (Star Schemas) or relational. an business unit or user group that doesn't require a lot of historical data might only need transactions from the current calendar year in the database. Accenture Ab Initio Training 18 . The Personnel Department might need to see all details about employees.

The center of the star consists of a large fact table and the points of the star are the dimension tables. with points radiating from a central table. It is called a star schema because the entityrelationship diagram of this schema resembles a star. Ab Initio Training 19 Accenture .Star Schema    The star schema is perhaps the simplest data warehouse schema.

Star Schema – continued  A star schema is characterized by one or more very large fact tables that contain the primary information in the data warehouse. each of which contains information about the entries for a particular attribute in the fact table. Ab Initio Training 20 Accenture . and a number of much smaller dimension tables (or lookup tables).

Provide highly optimized performance for typical star queries. which may anticipate or even require that the data-warehouse schema contain dimension tables Star schemas are used for both simple data marts and very large data warehouses. Ab Initio Training 21 Accenture .Advantages of Star Schemas     Provide a direct and intuitive mapping between the business entities being analyzed by end users and the schema design. Are widely supported by a large number of business intelligence tools.

Star schema  Diagrammatic representation of star schema Accenture Ab Initio Training 22 .

Ab Initio Training 23 Accenture . Snowflake schemas normalize dimensions to eliminate redundancy. It is called a snowflake schema because the diagram of the schema resembles a snowflake.Snowflake Schema    The snowflake schema is a more complex data warehouse model than a star schema. and is a type of star schema.

it increases the number of dimension tables and requires more foreign key joins. Ab Initio Training 24 Accenture .Snowflake Schema . a product_category table. The result is more complex queries and reduced query performance. a product dimension table in a star schema might be normalized into a products table.Example  That is. and a product_manufacturer table in a snowflake schema. the dimension data has been grouped into multiple tables instead of one large table. While this saves space. For example.

Diagrammatic representation for Snowflake Schema Accenture Ab Initio Training 25 .

The primary key of a fact table is usually a composite key that is made up of all of its foreign keys. A fact table typically has two types of columns: those that contain facts and those that are foreign keys to dimension tables. Accenture Ab Initio Training 26 .Fact Table The centralized table in a star schema is called as FACT table.

The size of the extracted data varies from hundreds of kilobytes up to gigabytes. operating system resources). it has to be physically transported to the target system or an intermediate system for further processing. some transformations may take place during this extraction process. the desired data is identified and extracted from many different sources. depending on the source system and the business situation. including database systems and applications.What happens during the ETL process?  During extraction. Depending on the source system's capabilities (for example. Accenture Ab Initio Training 27 . After extracting data.

5 – Informatica Corporation  Pioneer due to market share General-purpose tool oriented to data marts  Ardent DataStage – Ardent Software.2 – Ab Initio Software   Tapestry 2.1 – D2K.   Sagent Data Mart Solution 3. Inc.0 – Sagent Technology  Progressively integrated with Microsoft A kit of tools that can be used to build applications End-to-end data warehousing solution from a single vendor Ab Initio Training 28  Ab Initio 2. Inc  Accenture .Examples of SecondGeneration ETL Tools  Powermart 4.

and load data Use modern. summarize. cleanse. parallel operation Goal: define 100% of the transform rule with point and click interface Support development of logical and physical data models Generate and manage central metadata repository Open metadata exchange architecture to integrate central metadata with local metadata. aggregate. Support metadata standards Provide end users access to metadata in business terms Accenture Ab Initio Training 29 . engine-driven technology for fast.What to look for in ETL tools          Use optional data cleansing tool to clean-up source data Use extraction/transformation/load tool to retrieve. transform.

SMP/MPP support. managing an environment with these features is difficult and expensive. These OS/hardware features greatly extend the scalability and improve performance. However. Ab Initio Training 30 Accenture .Operating System / Hardware Support  This section discusses how a DBMS utilizes OS/hardware features such as parallel functionality. and clustering.

and administrating data. Ab Initio Training 31 Accenture .Parallel Functionality  The introduction and maturation of parallel processing environments are key enablers of increasing database sizes. loads. retrieving. backups. These products can perform table scans. DBMS vendors are continually bringing products to market that take advantage of multi-processor hardware platforms. and queries in parallel. as well as providing acceptable response times for storing.

Accenture Ab Initio Training 32 . Meeting response time requirements is the overriding factor for determining the best load method and should be a key part of a performance benchmark  Create table as select — This feature makes it possible to create aggregated tables in parallel  Index creation — Parallel index creation exploits the benefits of parallel hardware by distributing the workload generated by a large index created for a large number of processors .Parallel Features An overview of typical parallel functionality is given below :  Queries — Parallel queries can enhance scalability for many query operations  Data load — Performance is always a serious issue when loading large databases.

where the data cannot be easily partitioned. but static data warehouse. have the flexibility and ability to scale in small increments.Which parallel processor configuration. due to the unpredictable nature of how the data is joined over multiple tables for complex searches and ad-hoc queries. SMP or MPP ?   SMP and clustered SMP environments . Accenture Ab Initio Training 33 . SMP environments are often useful for the large.

In fact. SMP or MPP ?    MPP works well in environments where growth is potentially unlimited. access patterns to the database are predictable. data warehousing and decision support are the areas most vendors of parallel hardware platforms and DBMSs are targeting. where transactions are generally small and predictable. Accenture Ab Initio Training 34 . causing I/O bottlenecks over the MPP interconnect. This often occurs in large OLTP environments. MPP does not scale well if heavy data warehouse database accesses must cross MPP nodes. as opposed to decision support and data warehouse environments.Which parallel processor configuration. and the data can be easily partitioned across different MPP nodes with minimal data accesses crossing between them. or if multiple MPP nodes are continually locked for concurrent record updates. where multiple tables can be joined in unpredictable ways.

A Multi-CPU Computer (SMP) Accenture Ab Initio Training 35 .

A Network of Multi-CPU Nodes Accenture Ab Initio Training 36 .

A Network of Networks


Ab Initio Training


Parallel Computer Architecture

Computers come in many “shapes and sizes”:  Single-CPU, Multi-CPU  Network of single-CPU computers  Network of multi-CPU computers Multi-CPU machines are often called SMP’s (for Symmetric Multi Processors). Specially-built networks of machines are often called MPP’s (for Massively Parallel Processors).


Ab Initio Training


Introduction to Ab Initio


Ab Initio Training


but their software is widely regarded as top notch. other former TMC people involved in the founding of Ab Initio included Cliff Lasser. and Craig Stanfill. Angela Lordi. In addition to Handler. Ab Initio Training 40 Accenture . Ab Initio is known for being very secretive in the way that they run their business. after TMC filed for bankruptcy.History of Ab Initio   Ab Initio Software Corporation was founded in the mid 1990's by Sheryl Handler. the former CEO at Thinking Machines Corporation.

batch processing. The Core Ab Initio Products are: The [Co>Operating System] The Component Library The Graphical Development Environment Ab Initio Training 41 Accenture . transform and load data. data manipulation graphical user interface (GUI)based parallel processing tool that is used mainly to extract. The Ab Initio software is a suite of products that together provides platform for robust data processing applications.History of Ab Initio   The Ab Initio software is a fourth generation data analysis.

What Does “Ab Initio” Mean? Ab Initio is Latin for “From the Beginning. Crucial capabilities like parallelism and checkpointing can’t be added after the fact. from simple to the most complex.    Accenture Ab Initio Training 42 .” From the beginning our software was designed to support a complete range of business applications. The Graphical Development Environment and a powerful set of components allow our customers to get valuable results from the beginning.

Ab Initio’s focus  “Moving Data” move small and large volumes of data in an efficient manner  deal with the complexity associated with business data   High Performance  scalable solutions  Better productivity Accenture Ab Initio Training 43 .

Ab Initio’s Software  Ab Initio software is a general-purpose data processing platform for missioncritical applications such as: Data warehousing  Batch processing  Click-stream analysis  Data movement  Data transformation  Accenture Ab Initio Training 44 .

Parallel execution of existing applications. Data transformation. Parallel sort/merge processing.Applications of Ab Initio Software  Processing just about any form and volume of data.     Accenture Ab Initio Training 45 . Rehosting of corporate data.

Reduced Run Time Complexity .a platform for applications to execute across a collection of processors within the confines of a single machine or across multiple machines.the ability for applications to run in parallel on any combination of computers where the Ab Initio Co>Operating System is installed from a single point of control.Ab Initio Provides For:  Distribution . Ab Initio Training 46  Accenture .

Applications of Ab Initio Software in terms of Data Warehouse  Front end of Data Warehouse:     Transformation of disparate sources Aggregation and other preprocessing Referential integrity checking Database loading  Back end of Data Warehouse:   Extraction for external processing Aggregation and loading of Data Marts Ab Initio Training 47 Accenture .

3.  8. We don't have scheduler in Ab Initio like Informatica you need to schedule through script or u need to run manually. where as other ETL tools do have administrative work. Pipe Line parallelism this is the difference in parallelism concepts. AbInitio doesn't need a dedicated administrator. Ab Initio supports different types of text files means you can read same file with different structures that is not possible in Informatica.Ab Initio or InformaticaPowerful ETL  Informatica and Ab Initio both support parallelism. But in Ab Initio the tool it self take care of parallelism we have three types of parallelisms in Ab Initio 1. In Informatica the developer need to do some partitions in server manager by using that you can achieve parallelism concepts. Component 2. But Informatica supports only one type of parallelism but the Ab Initio supports three types of parallelism. Data Parallelism 3. and also Ab Initio is more user friendly than Informatica so there is a lot of differences in Informatica and Ab initio. 2. UNIX or NT Admin will suffice. Accenture Ab Initio Training 48 .

Robust transformation language . Instant feedback .Ab Initio or InformaticaPowerful ETL-continued  Error Handling .In Ab Initio you can attach error and reject files to each transformation and capture and analyze the message and data separately. Informatica has one huge log! Very inefficient when working on a large process. but it is slow and difficult to adapt to.   Accenture Ab Initio Training 49 . While I will not go into a function by function comparison. with numerous points of failure. and detailed performance metrics for each component. it seems that Ab Initio was much more robust. Ab Initio tells you how many records have been processed/rejected/etc.On execution. Informatica has a debug mode.Informatica is very basic as far as transformations go.

code. Accenture Ab Initio Training 50 . Informatica is an engine based ETL tool. it generates ksh or bat etc. Ab Initio doesn't need a dedicated administrator. UNIX or NT Admin will suffice. if any that cannot be taken care through the ETL tool itself. where as other ETL tools do have administrative work. the power this tool is in it's transformation engine and the code that it generates after development cannot be seen or modified.Both tools are fundamentally different Which one to use depends on the work at hand and existing infrastructure and resources available. which can be modified to achieve the goals. Ab Initio is a code based ETL tool.

Windows. OS/390) Accenture Ab Initio Training 51 .Ab Initio Product Architecture User Applications Development Environments Ab Initio EME GDE Component Library Shell 3rd Party Components User-defined Components The Ab Initio Co>Operating® System Native Operating System (Unix.

The Cooperating system is layered on the top of the native operating systems of the collection of servers .A user may perform all these functions from a single point of control.It provides a distributed model for process execution. checkpointing . datasets into a production quality data processing system with scalable performance and mainframe class disks . process monitoring . file management . Accenture Ab Initio Training 52 .debugging.Ab Initio ArchitectureExplanation   The Ab Initio Cooperating system unites the network of computing resources-CPUs. programs .

Checkpointing. Metadata-driven components. Ab Initio Training 53 Accenture .Co>Operating System Services  Parallel and distributed application execution   Control Data Transport      Transactional semantics at the application level. Monitoring and debugging. Parallel file management.

as well as on processing computers. The Graphical Development Environment (GDE): which you install on your PC (GDE Computer) and configure to communicate with the host. Ab Initio Training 54 Accenture .Ab Initio: What We Do    Ab Initio software helps you build large-scale data processing applications and run them in parallel environments. Ab Initio software consists of two main programs: Co>Operating System: which your system administrator installs on a host Unix or Windows NT server.

Ab Initio Training 55 Accenture .The Ab Initio Co>Operating® System  The Co>Operating System Runs across a variety of Operating Systems and Hardware Platforms including OS/390 on Mainframe. Supports distributed and parallel execution. Can provide scalability proportional to the hardware resources provided. and Windows. Supports platform independent data transport. Unix.

Accenture Ab Initio Training 56 . cooperate with) diverse databases. It extracts. transforms and loads data to and from Teradata and other data sources..e.The Ab Initio Co>Operating® System-Continued The Ab Initio Co>Operating System depends on parallelism to connect (i.

Co-Operating System Layer GDE Any OS Top Layer GDE Solaris. NT. GDE Graphs can be moved from One OS to another w/o any Changes. NCR Co-Op System GDE Same Co-Op Command On any OS. AIX. Linux. Ab Initio Training 57 Accenture .

0 (x86) Windows NT 2000 (x86) Compaq Tru64 UNIX IBM OS/390 NCR MP-RAS 58 Accenture Ab Initio Training .The Ab Initio Co>Operating System Runs on:       Sun Solaris IBM AIX Hewlett-Packard HPUX Siemens Pyramid Reliant UNIX IBM DYNIX/ptx Silicon Graphics IRIX       Red Hat Linux Windows NT 4.

IMS Oracle. Siebel. etc. high performance database interfaces:     IBM DB2.Teradata. DB2/PE. Informix XPS.MS SQL Server 7 OLE-DB ODBC Connectors to many other third party products  Trillium. ErWin.Sybase. DB2EEE.Connectivity to Other Software  Common. UDB.  Other software packages:  Accenture Ab Initio Training 59 .

Major corporations worldwide use Ab Initio software in mission critical. Together. with unlimited scalability • Professional and highly responsive support The Co>Operating System executes your application by creating and managing the processes and data flows that the components and arrows represent. and robust. MA. Accenture Ab Initio Training 60 . Teradata and Ab Initio deliver: • End-to-end solutions for integrating and processing data throughout the enterprise • Software that is flexible. efficient. headquartered in Lexington. enterprise-wide.Ab Initio Cooperating System Ab Initio Software Corporation. data processing systems. develops software solutions that process vast amounts of data (well into the terabyte range) in a timely fashion by employing many (often hundreds) of server processors in parallel.

Graphical Development Environment GDE Accenture Ab Initio Training 61 .

helping spot opportunities for improving performance. Graphical monitoring of running applications allows you to quantify data volumes and execution times. Allows you to point and click operations on executable flow charts. The Graphical Development Environment Enables you to create applications by dragging and dropping Components. The Co>Operating System can execute these flowcharts directly. Accenture Ab Initio Training 62 .The GDE The Graphical Development Environment (GDE) provides a graphical user interface into the services of the Co>Operating System.

The Graph Model Accenture Ab Initio Training 63 .

The Component Library:   The Component Library: Reusable software Modules for Sorting. Ab Initio products have helped reduce a project’ s development and research time significantly. Data Transformation. The components adapt at runtime to the record formats and business rules controlling their behavior. database Loading Etc. Accenture Ab Initio Training 64 .

Different components do different jobs.Components     Components may run on any computer running the Co>Operating System. Ab Initio Training 65 Accenture . that is business rules to be applied to an input (s) to produce a required output. The particular work a component accomplishes depends upon its parameter settings. Some parameters are data transformations.

rd 3 Party Components Accenture Ab Initio Training 66 .

EME   The Enterprise Meta>Environment (EME) is a highperformance object-oriented storage system that inventories and manages various kinds of information associated with Ab Initio applications.information that is usually scattered throughout you business . The EME also provides rich store for the applications themselves. It acts as hub for data and definitions . Integrated metadata management provides the global and consolidated view of the structure and meaning of applications and data. Ab Initio Training 67 Accenture . including data formats and business rules. It provides storage for all aspects of your data processing system. from design information to operations data.

record formats and execution statistics  Business Metadata-User defined documentations of job functions .Benefits of EME The Enterprise Meta>Environment provides a rich store for applications and all of their associated information including :  Technical Metadata-Applications related business rules .Storing and using metadata is as important to your business as storing and using data. Metadata is data about data and is critical to understanding and driving your business process and computational resources . Accenture Ab Initio Training 68 .roles and responsibilities.

you can grasp the entirety of your data processing – from operational to analytical systems. Ab Initio Training 69 Accenture .EME-Ab Initio Relevance   By integrating technical and business metadata . The EME is completely integrated environment. The following figure shows how it fits in to the high level architecture of Ab Initio software.

Accenture Ab Initio Training 70 .

Ab Initio Training 71 Accenture .MVS operating systems.Stepwise explanation of Ab Initio Architecture     You construct your application from the building blocks called components. manipulating them through the Graphical Development Environment (GDE). The EME and GDE uses the underlining functionality of the Co>Operating System to perform many of their tasks. Ab Initio software runs on Unix . The Cooperating System units the distributed resources into a single “virtual computer” to run applications in parallel. You check in your applications to the EME.Windows NT.

Accenture Ab Initio Training 72 .You also view and edit your business metadata through a web user interface.Stepwise explanation of Ab Initio Architecture .continued   Ab Initio connector applications extract metadata from third part metadata sources into the EME or extract it from the EME into a third party destination. You view the results of project and application dependency analysis through a Web user interface .

EME :Various users constituency served The EME addresses the metadata needs of three different constituencies:  Business Users  Developers  System Administrators Accenture Ab Initio Training 73 .

System Administrator and production personnel want job status information and run statistics.tables and columns.needing to analyze the impact of potential program changes. Ab Initio Training 74 Accenture . in particular with regard to databases . Developers tend to be oriented towards applications .EME :Various users constituency served    Business users are interested in exploiting data for analysis.

EME Interfaces We can create and manage EME through 3 interfaces: GDE Web User Interface Air Utility    Accenture Ab Initio Training 75 .

Thank You End of Session 1 Accenture Ab Initio Training 76 .