Professional Documents
Culture Documents
Staging area may be required if you have any of the following scenarios:
Delta Loading: Your data is read incrementally from the source and you need an intermediate storage
where incremental set of your data can be stored temporarily for transformation purpose
Transformation need: You need to perform data cleansing, validation etc. before consuming the data in
the warehouse
De-coupling: Your processing takes lot of time and you do not want to remain connected to your source
system (presumably the source system is being constantly used by the actual business users) during the
entire time of your processing and, hence, prefer to just read the data from source system in one-go,
disconnect from the source and then continue processing the data at your "own side"
Debugging purpose: You need not go back to your source all the time and you can troubleshoot issues
(if any) from staging area alone
Failure Recovery: Source system may be transitory and the state of the data may be changing. If you
encounter any upstream failure, you may not be in a position to re-extract your data as source has
changed by that time. Having a local copy helps Performance and reduced processing may not be only
considerations. Adding a staging may sometimes increase latency (i.e. time delay between occurrence of
a business incidence and its reporting).
Dimensional Modeling:
It is one of the logical rational & consistent design techniques used in data warehousing. It is different
from entity-relationship model. If applied to relational databases, and done properly, it is 2nd or 3rd
normal form. It does not necessarily involve relational database. The logical level of modeling approach
can be applied in physical form like database tables or flat files. It is one of the techniques for the
support of end-user queries in data warehousing.
A virtual data warehouse provides a collective view of the completed data. A virtual data warehouse
has no historic data. It can be considered as a logical data model of the containing metadata. Virtual data
warehousing is a ‘de facto’ information system strategy for supporting analytical decision making. It is
one of the best ways for translating raw data and presenting it in the form which decision makers can
use. It provides semantic map – which allows the end user for viewing as virtualized.
What is data modeling and data mining? What is this used for?
Data Modeling is a technique used to define and analyze the requirements of data that supports
organization’s business process. In simple terms, it is used for the analysis of data objects in order to
identify the relationships among these data objects in any business. It is the primary step for database
design and using OOP, the conceptual design model is created to depict how the data items are related
to each other. It is the involvement of progression from conceptual model to logical model to that of
physical metadata.
Data Mining is a technique used to analyze datasets to derive useful insights/information. It is mainly
used in retail, consumer goods, telecommunication and financial organizations that have a strong
consumer orientation in order to determine the impact on sales, customer satisfaction and profitability.
Data Mining is very helpful in determining the relationships among different business attributes. Analyzing
data from various perspectives and summarizing it into the required and useful information is known as
data mining. Correlations or patterns among different fields in large RDBMS are the technical aspect of
data mining.
Snapshot refers to a complete visualization of data at the time of extraction. It occupies less space and
can be used to back up and restore data quickly. Snap shot is stored in a report format from a specific
catalog. The report is generated soon after the catalog is disconnection.
What is Dimension Table?
Dimension table is a table which contains attributes of measurements stored in fact tables. This table
consists of hierarchies, categories and logic that can be used to traverse in nodes.
BUS schema is used to identify the common dimensions across business processes (shared across all
enterprise data marts), like identifying conforming dimensions. BUS schema has conformed dimension
and standardized definition of facts. All the Data Marts can use the conformed dimensions and facts
without having them locally.
Conformed facts:
Allowing having same names in different tables is allowed by Conformed facts. A dimensional table can
be used more than one fact table is referred as conformed dimension. It is used across multiple data
marts along with the combination of multiple fact tables. Without changing the metadata of conformed
dimension tables, the facts in an application can be utilized without further modifications or changes.
They can be compared and combined mathematically. Conformed dimensions can be used across
multiple data marts. These conformed dimensions have a static structure.
OLTP: manages applications based on transactions modifying high volume of data. Typical example of a
transaction is commonly observed in Banks, Air tickets etc. Because OLTP uses client server architecture,
it supports transactions to run cross a network.
OLAP: performs analysis of business data and provides the ability to perform complex calculations on
usually low volumes of data. OLAP helps the user gain an insight on the data coming from different
sources (multi-dimensional).
OLTP OLAP
Online Database Modification system Online query management system
3 NF normalized Not normalized
OLTP database must maintain data integrity constraint Not bothered
Response Time fast Response Time slow
Snowflake Schema: The fact table is usually at the center surrounded by the dimension table and the
dimension tables are further broken down into more dimension table. Schema is inclined slightly towards
normalization. It contains joins in depth. The reason is that, the tables split further.
For e.g. Dimension tables include employee, projects and status. Status table can be further broken into
status_weekly, status_monthly
e.g. An employee may be recruited before the year 2000 while another employee with the same name
may be recruited after the year 2000. Here, the primary key will uniquely identify the record while the
surrogate key will be generated by the system (say a serial number) since the SK is NOT derived from
the data.
For example: A sequential number can be a surrogate key.
Any column or a combination of columns that can uniquely identify a record is a candidate key
Ab Initio:
1. Co Operating System: It operates on top of the operating system and this is provided by the ab
initio and it the base for all Ab Initio processes. Air commands are one of the features that can be
installed on different operating systems like UNIX, Linux, IBM AIX etc
– Manages and runs Ab Initio graphs and control the ETL processes
– Providing the extensions
– ETL processes monitoring and debugging
– Metadata management and interaction with the EME
GDE: It’s a designing component and used to run the ab initio graphs. Graphs are formed by the
components (predefined or user-defined) and flows and the parameters. It provides the ETL process in
Ab Initio that is represented by graphs. Ability to run, debug the process logs jobs and trace execution
logs
Enterprise Meta-Environment (EME): It’s an environment for storage and also metadata
management (Both business and technical metadata). The metadata is accessed from the graphical
development environment and also the web browser or the cooperating command line. It is ab initio
repository for any placeholders.
EME stands for Enterprise Meta environment. It is a repository which holds all the projects, metadata,
and transformations. It performs operations like version controlling, statistical analysis, and dependency
analysis and metadata management.
GDE stands for graphical development environment and is just like a canvas on which we create our
graphs with the help of various components. It just provides graphical interface for editing and executing
Abinitio programs.
Q. I had 10,000 records r there i loaded today 4000 records, i need load to 4001 - 10,000
next day how is in Type 1 and how is it on type 2?
Simply take a reformat component and then put next_in_sequence()> 4000 in select parameter.
How can you increase the number of ports of your output flow? What is the limit?
We can use COUNT parameter, limit is 20 and this can be done only for REFORMAT component, but we
cannot use count parameter for other such as join, fuse etc. to increase the no of out ports.
How to create a new mfs file? Where will we specify the number of partition 4 way ,8 way?
To create a new mfs file we can use m_touch <filename> but you have to be sure that you are in a
multi-directory. After creating the multi file just do a cat <multifilename> you will see how many
partitions are in that.
You should have a multifile system with 4 way and 8 way depth to create multifiles of according depth.
m_mkfs can be used for creating desired mfs file.
But a developer never creates any MFS path it is done by the Abinitio administrator. There are different
mfs path parameters created by administrator like.
AI_MFS: Default MFS path.
AI_4way_MFS: for creating 4-way mfs file and so on.
We only use the required path in the o/p file URL to create desired mfs file.
To convert 4 way to 8 way partition we need to change the layout in the partioning component. There
will be separate parameters for each and every type of partioning eg.
AI_MFS_HOME, AI_MFS_MEDIUM_HOME, AI_MFS_WIDE_HOME etc.
The appropriate parameter need to be selected in the component layout for the type of partioning.
4. .abinitiorc file is used to provide parameters for remote connectivity. You can access abinitio
resources (e.g., EME) on a different server by providing the connection method, and authentication
details in the .abinitiorc file.
.abinitiorc can be placed in two locations:
1. In the $HOME directory of each user
2. In the config directory of the Co>Op
In case both exist, the first one (in $HOME dir) will take precedence over the second (in config). You
specify telnet or ftp or NDM ports by setting the configuration variables AB_TELNET_PORT or
AB_FTP_PORT in your .abinitiorc file
5. What is the difference between sandbox and EME, can we perform checkin and checkout
through sandbox?
Enterprise Meta Environment is the central repository and Sandbox is the private area where you bring
the object (by doing object level checkout) from the repository for editing. Once you are finish with your
editing you can checkin back the object to the EME (object level check in) from sandbox.
EME is the version controlling unit of Ab initio where the version controlling can be done, it can be called
as central repository where as private sandbox is the user specific space which is a similar replica of the
EME where a developer works through sandbox user can safely do the check out and check in with no
conflicts with other developers.
7. Layout is where the Program (component) runs. Based on the layout given, Abinitio Tries to run on
that Physical location
Way to give layouts: from neighboring Component
URL: like $AI_SERIAL or $AI_PARALLEL (a mount point location /opt/apps/ppl/serial)
Database: where database runs
Layout of a component basically specifies whether the component is processing the data serially or
parallels
Depth specifies degree of partitioning, however, at run time this variable is resolved to a depth. It might
be 2 way in development and 4 way in production. The graph's layout doesn't change and the depth is
determined by the environment, not the graph itself.
or
air sanbox run graph.mp
if [ $? –eq 0 ]; then
echo "Graph Failed"
Q. Partitioning by key distributes the data into various multifile partitions depending upon of the fields
present in the input(key), while Partitioning by round robin distributes data equally among the
partitions irrespective of the key field (based on block size (parameter) in round robin fashion)
In PBK, data flows to the output randomly, but in PBRR, data flows to output in orderly format.
3) Use dedup sort to remove header and trailer records. Next by using scan component generate
seqeunce number, then sort according to the seqeunce number(Desc).
*** The simplest way to remove header and trailer records is to use a dedup sorted component using a
NULL key. This would treat the entire record set as a single group. Then use keep_first and keep_last
mode of dedup sorted to select the header and trailer records.
Another way to accomplish similar result set is using a ROLLUP component on NULL key. Use the first
and last functions in rollup to select the header and trailer records.
Q. In MFS I developer developed 2-way, but supporters are supporting 4-way on same
records how is possible?
First you connect 2 way input file to gather component and then connect to partition by exp with sort
component and finally you connect target file that is 4 way mfs file.
for any change in partitioning, we have to change the "AI_MFS_DEPTH_OVERRIDE" parameter in .air-
project-parameters file.
Re-partition the data to 7 ways using partition components, of course you would need 7 way MFS
Parallelism:
Component parallelism: An application that has multiple components running on the system
simultaneously. But the data are separate.
Data parallelism: Data is split into segments and runs the operations simultaneously.
Pipeline parallelism: An application with multiple components but running on the same dataset.
A) Suppose sort is used in front of merge component it’s no use of using sort because we have sort
component built in merge.
B) We use lookup instead of JOIN, Merge Component.
C) Suppose we want to join the data coming from 2 files and we didn’t want duplicates, we can use union
function instead of adding additional component for duplicate remover.
*** pipeline "parallelism" it's achieved automatically in power center; a thread it's spawned for every
partition point (to make it simple you have a thread for the source, the target, the transformation , the
aggregator. it's also spawn additional threads for building the lookups in parallel (you can control the
number of the threads in the session configuration)
Partition parallelism needs partition option license and it has to be specified manually for each partition
point in the session configuration. It doesn’t span different processes for every partition but threads
I achieved a transformation rate 250000 records/sec (35m of records in 2 minutes 20 seconds) on a x86
machine with 4 cores using partitioning on a session with 8 lookups per record and aggregating data
(aggregators in 8.5+ are really blazing fast) reading a partitioned oracle table from the network.
Max-core: This parameter controls how frequently a component should dump data from memory to disk
In Ab initio, dependency analysis is a process through which the EME examines a project entirely and
traces how data is transferred and transformed- from component-to-component, field-by-field, within and
between graphs.
Replicate component combines the data records from the inputs into one flow followed by writing a
copy of that flow to each of its output ports.
A SANDBOX is referred for the collection of graphs and files related to it that are saved in a single
directory tree and behaves as a group for the reason of navigation, version control and migration.
What Information does a .dbc File Extension Provides to Connect to the Database?
.dbc extension provides the GDE with the information to connect with the database are
1. Name and version number of the data-base to which you want to connect
2. Name of the computer on which the data-base instance or server to which you want to connect runs,
or on which the database remote access software is installed
3. Name of the server, database instance or provider to which you want to link
Lookup file defines one or more serial file (Flat Files); it is a physical file where the data for the Look-up
is stored. While Look-up is the component of abinitio graph, where we can save data and retrieve it by
using a key parameter.
Component parallelism: A graph with multiple processes executing simultaneously on separate data
Data parallelism: A graph that works with data divided into segments and operates on each segment
simultaneously, uses data parallelism.
Pipeline parallelism: A graph that deals with multiple components executing simultaneously on the
same data uses pipeline parallelism. Each component in the pipeline read continuously from the upstream
components, processes data and writes to downstream components. Both components can operate in
parallel.
Roll Up vs Aggregator?
1) Rollup can perform some additional functionality, like input filtering and output filtering of records.
2) Aggregate does not display the intermediate results in main memory, where as Rollup can.
3) Analyzing a particular summarization is much simpler compared to Aggregations
A lookup file represents a set of serial or flat files; it is a specific data set that is keyed. The key is used
for mapping values based on the data available in a particular file; the data set can be static or dynamic.
Hash-joins can be replaced by reformatting and any of the input in lookup to join should contain less
number of records with a slim length of records, Abinitio has certain functions for retrieval of values using
the key for the lookup.
A lookup file is an indexed dataset and it actually consists of two files: one files holds data and the other
holds an hash index into the data file. We commonly use a lookup file to hold in physical memory the
data that a transform component frequently needs to access.
lookup ("Level of the MyLookupFile", in.key)
If the lookup file key's Special attribute (in the Key Specifier Editor) is exact, the lookup functions return a
record that matches the key values and has the format specified by the RecordFormat parameter.
Local Lookup:
Lookup files can be partitioned (multifiles)
If the component is running in parallel and we use a _local lookup function Co>Operating System splits
lookup file into partitions.
The benefits of partitioning lookup files are:
1. The per-process footprint is lower. This means the lookup file as a whole can exceed the 2 GB limit.
2. If the component is partitioned across machines, the total memory needed on any one machine is
reduced.
Dynamic Lookup
A disadvantage of static lookup file is that the dataset occupies a fixed amount of memory even when the
graph isn’t using the data. By dynamically loading lookup data, we control how many, which and when
lookup datasets are loaded. This control is useful in conserving memory; applications can unload datasets
that are not immediately needed and load only the ones needed to process the current input record.
1. Load the dataset into memory when it is needed.
2. Retrieve data with your graph.
3. Free up memory by unloading the dataset after use.
To look up data dynamically we use LOOKUP TEMPLATE component:
let lookup_identifier_type LID =lookup_load(MyData, MyIndex, "MyTemplate", -1)
Where LID is a variable to hold the lookup ID returned by the lookup_load function. This ID references
the lookup file in memory. The lookup ID is valid only within the scope of the transform.
MyData is the pathname of the lookup data file.
MyIndex is the index of the pathname of the lookup index file.
If no index file exists, we must enter the DML keyword NULL. The graph creates an index on the fly.
** In a lookup template, we do not provide a static URL for the dataset’s location as we do with a lookup
file. Instead, we specify the dataset’s location in a call to the lookup_load function when the data is
actually loaded.
Ramp/Limit:
A limit is an integer parameter which represents a number of reject events
Ramp parameter contain a real number representing a rate of reject events of certain processed records
The formula is - No. of bad records allowed = limit + no. of records x ramp
A ramp is a percentage value from 0 to 1.
These two provides the threshold value of bad records.
Rollup component allows the users to group the records on certain field values.
It is a multi-stage function and contains
To counts of a particular group Rollup needs a temporary variable
The initialize function is invoked first for each group
Rollup is called for each of the records in the group.
The finally function calls only once at the end of last rollup call.
A decimal strip takes the decimal values out of the data. It trims any leading zeros, The result is a valid
decimal number
First_defined Function:
This function is similar to the function NVL() in Oracle database
It performs the first values which are not null among other values available in the function and assigns to
the variable
Example: A set of variables, say v1,v2,v3,v4,v5,v6 are assigned with NULL. Another variable num is
assigned with value 340 (num=340)
num = first_defined(NULL, v1,v2,v3,v4,v5,v6,NUM)
The result of num is 340
Max Core:
MAX CORE is the space consumed by a component that is used for calculations, Each component has
different MAX COREs
Component performances will be influenced by the MAX CORE’s contribution
The process may slow down / fasten if a wrong MAX CORE is set
Check point:
When a graph fails in the middle of the process, a recovery point is created, known as Check point
The rest of the process will be continued after the check point, Data from the check point is fetched and
continue to execute after correction.
Phase:
If a graph is created with phases, each phase is assigned to some part of memory one after another.
All the phases will run one by one, The intermediate file will be deleted
Sandboxes are work areas used to develop, test or run code associated with a given project. Only one
version of the code can be held within the sandbox at any time.
EME Datastore contains all versions of the code that have been checked into it. A particular sandbox is
associated with only one Project where a Project can be checked out to a number of sandboxes.
Environmental variables serve as global variables in UNIX environment. They are used for passing on
values from a shell/process to another. They are inherited by Abinitio as sandbox variables/ graph
parameters: env | grep AI will give us all AI_*
AI_SORT_MAX_CORE
AI_HOME
AI_SERIAL
AI_MFS
Difference between conventional loading (API) and direct loading (Utility), When it is used
in real time ?
Conventional Load: Before loading the data, all the Table constraints will be checked against the data.
Direct load: (Faster Loading) All the Constraints will be disabled. Data will be loaded directly. Later the
data will be checked against the table constraints and the bad data won't be indexed.
In case of multi file broadcast do data partitioning and replicate do component partitioning.
What is m_dump?
m_dump command prints the data in a formatted way. It is used to view data, residing in multifile, from
UNIX prompt.
m_dump <dml> <datafile>
How you can create cross joined output using join component?
Set the key as NULL - {}
What is AB_LOCAL?
AB_LOCAL is a parameter of input table component which we can be used in parallel unloads and for
determining the driving table in the complex queries.
There are two forms of AB_LOCAL() construct, one with no arguments and one with single argument as a
table name (driving table).
The use of AB_LOCAL() construct is in Some complex SQL statements contain grammar that is not
recognized by the Ab Initio parser when unloading in parallel. You can use the AB_LOCAL() construct in
this case to prevent the Input Table component from parsing the SQL (it will get passed through to the
database). It also specifies which table to use for the parallel clause
If you use an SQL SELECT statement to specify the source for Input Table, and if the statement involves
a complex query or a join of two or more tables in an unload, Input Table may be unable to determine
the best way to run the query in parallel. In such cases, the GDE may return an error message
suggesting you use ABLOCAL (tablename) in the SELECT statement to tell Input Table which table to use
as the basis for the parallel unload.
To do this, you would put an ABLOCAL(tablename) in the appropriate place in the WHERE clause in the
SELECT statement, and specify the name of the "driving table" (often the largest table, but see below) as
a single argument. When you run the graph, Input Table will replace the expression
"ABLOCAL(tablename)" with the appropriate parallel query condition for that table.
For example, suppose you want to join two tables- customer_info and acct_type-and customer_info is the
driving table. You would code the SELECT statement as follows:
What is mean by Co > Operating system and why it is special for Abinitio?
It converts the AbInitio specific code into the format, which the UNIX/Windows can understand and feeds
it to the native operating system, which carries out the task.
Which one is faster for processing fixed length dmls or delimited dmls and why?
Fixed length DML's are faster because it will directly read the data of that length without any comparisons
but in delimited one’s every character is to be compared and hence delay
What are the continuous components in Abinitio?
Continuous components used to create graphs that produce useful output file while running continuously
Continuous rollup, Continuous update, batch subscribe
What is Skew?
Skew is the measure of data flow to each partition
The skew of a data partition is the amount by which its size deviates from the average partition size
expressed as a percentage of the largest partition:
Partition size – (avg partition size/ largest partition) * 100
I had 10,000 records, I loaded today 4000 records, I need load to 4001 - 10,000 next day
how is in Type 1 and how is it on type 2?
Take 10,000 records as source file and take output table as another source and then Join both the
source, select input must be sorted parameter option in the Join component, If any matching records
found then put that into a thrash, take all unmatched records from the unused port and inert into the
target table.
FUSE vs Join?
Fuse: It is a component it will append the data horizontally like if we have two files, from first file first
record will join horizontally second file first records similarly all records join horizontally
Join: In join based on common key value and join type file records will be joined
How to create a new mfs file? Where will we specify the number of partition eg 4 way or 8
way?
What is the difference between Generate Records Component and Create Data Component?
There is no transform function in Generate Record comp. Therefore it will create data's default of it's
own, according to the dml defined. To change that data, we have to connect some other component
after Generate Record comp and have to modify.
Whereas, in Create Data comp. we can write a transform function, so that the data's are generated as
per our own transform function. Also index is default defined in Create Data comp.
Conditional DML using data file will contain cluster of records in each row with different data format,
each record can be read using different conditional DML based on the record identifier at the start of
each record row
How you can track the records those are not getting selected from ‘select’ in reformat
component?
The records you want to deselect, identify them in your reformat function and reject them using
force_error function. The rejected records will come out of the reject port and you can collect them.
Before trying this, make sure the reject threshold parameter of the reformat is set to never abort
otherwise the graph will fail on first reject.
But one major risk of this method is, if any record is rejected by the reformat for some other reason (not
due to force_error) also, it will come out of the reject port and will find its place with the 'deselected'
records.
You can use FBE (Filter By Expression) before the reformat component, to implement this functionality
How internally partition by key decides which key to send in which partition?
As shell type parameters are not supported by EME, then how you can use shell type
parameter (If you don't want to use PDL) without hampering lineage diagram?
Allocate () is not only meant for assigning default values for vector types. For any complex data types,
record types, global and local variables we can use allocate () function to initialize that object type.
It is very convenient way to initialize any DML Object and good coding practice too.
However you are always allowed to initialize any dml type manually but that approach would be more
error-prone, whereas if you use allocate () then you don't need to think of anything whether the data
type is integer or decimal etc., Ab Initio will apply its own merit to initialize the dml types based on the
data type of the object, so no risk of invalid assignments!
But still if you need to initialize any dml object other than default values driven by the corresponding data
types then you have to manually initialize the object unlike using allocate();
From 2.15 Co>Op onward we also have allocate_with_defaults() [ same as allocate() ] and
allocate_with_nulls() for initializing DML objects.
How you can break a lock in EME? How can you lock a file so that only no one other than EME
admin can break it?
Why you should not keep the layout as 'default' for input table component?
In which scenario, .rec files will not get created even if graph fails?
.rec files will be created only if you have checkpoints enabled in your graph.
Can we have more than one launcher process for a particular graph?
Which function you should be using if you do not want to process non printable char from a
feed file?
string_filter_out, make_byte_flags,test_any_character functions
OR
use variable format strings instead of delimited, for example string(integer(2)) my_load_read_string;
OR
string_filter(re_replace(in.line, "[[:cntrl:]]" ,
""),"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxzyz0123456789")
OR
re_replace(in, "[\\x80-xff]", " ") or re_replace(in.line, "[^0-9a-zA-Z]+", "")
Why you should not use checkpoint or phase break directly after replicate?
Having a checkpoint after a replicate is storing the entire data flow on disk.
There are 2 out ports in the REPLICATE component, First REPLICATE output flow data is processed and
created as a lookup file. The Second REPLICAT output flow will be processed by using the lookup
information of the 1st replicate output flow.