You are on page 1of 28

Data Warehousing Questions:

1. What is Data warehousing?


A Datawarehouse is the repository of a data and it is used for Management decision support system.
Datawarehouse consists of wide variety of data that has high level of business conditions at a single point
in time. In single sentence, it is repository of integrated information which can be available for queries
and analysis.

Staging area may be required if you have any of the following scenarios:
Delta Loading: Your data is read incrementally from the source and you need an intermediate storage
where incremental set of your data can be stored temporarily for transformation purpose
Transformation need: You need to perform data cleansing, validation etc. before consuming the data in
the warehouse
De-coupling: Your processing takes lot of time and you do not want to remain connected to your source
system (presumably the source system is being constantly used by the actual business users) during the
entire time of your processing and, hence, prefer to just read the data from source system in one-go,
disconnect from the source and then continue processing the data at your "own side"
Debugging purpose: You need not go back to your source all the time and you can troubleshoot issues
(if any) from staging area alone
Failure Recovery: Source system may be transitory and the state of the data may be changing. If you
encounter any upstream failure, you may not be in a position to re-extract your data as source has
changed by that time. Having a local copy helps Performance and reduced processing may not be only
considerations. Adding a staging may sometimes increase latency (i.e. time delay between occurrence of
a business incidence and its reporting).

2. What is Business Intelligence?


Business Intelligence is also known as Decision support system which refers to the technologies,
application and practices for the collection, integration and analysis of the business related information or
data. Even, it helps to see the data on the information itself.

Difference b/w Data warehousing & Business Intelligence


Data Warehousing helps you store the data while business intelligence helps you to control the data for
decision making, forecasting etc.
Data warehousing using ETL jobs, will store data in a meaningful form. However, in order to query the
data for reporting, forecasting, business intelligence tools were born.
Management of different aspects like development, implementation and operation of a data warehouse is
dealt by data warehousing. It also manages the Meta data, data cleansing, data transformation, data
acquisition persistence management, archiving data..
In business intelligence the organization analyses the measurement of aspects of business such as sales,
marketing, efficiency of operations, profitability, and market penetration within customer groups. The
typical usage of business intelligence is to encompass OLAP, visualization of data, mining data and
reporting tools.

Dimensional Modeling:
It is one of the logical rational & consistent design techniques used in data warehousing. It is different
from entity-relationship model. If applied to relational databases, and done properly, it is 2nd or 3rd
normal form. It does not necessarily involve relational database. The logical level of modeling approach
can be applied in physical form like database tables or flat files. It is one of the techniques for the
support of end-user queries in data warehousing.

A virtual data warehouse provides a collective view of the completed data. A virtual data warehouse
has no historic data. It can be considered as a logical data model of the containing metadata. Virtual data
warehousing is a ‘de facto’ information system strategy for supporting analytical decision making. It is
one of the best ways for translating raw data and presenting it in the form which decision makers can
use. It provides semantic map – which allows the end user for viewing as virtualized.

Fundamental stages of Data Warehousing:


1) Offline Operational Database: It is the stage where copying the data off an operational system to
another server where the report processing load against the copied data takes place and OS performance
does not impact.
2) Offline Data Warehouse: In this stage, data warehouses are updated from data in the OS and the
data of data warehouse is stored in a data structure that is designed for facilitating reports.
3) Real time Data Warehouse: The data warehouses are updated often whenever an OS performs a
transaction.
4) Integrated Data Warehouse: The data warehouses are updated by OS, at the time of performing a
transaction. Then transactions are generated which are passed back into the operational systems.

What is active data warehousing?


An active data warehouse represents a single state of the business. It considers the analytic perspectives
of customers and suppliers. It helps to deliver the updated data through reports. A form of repository of
captured transactional data is known as ‘active data warehousing’. Using this concept, trends and
patterns are found to be used for future decision making. Active data warehouse has a feature which can
integrate the changes of data while scheduled cycle refreshes. Enterprises utilize an active data
warehouse in drawing the company’s image in statistical manner.

What is data modeling and data mining? What is this used for?
Data Modeling is a technique used to define and analyze the requirements of data that supports
organization’s business process. In simple terms, it is used for the analysis of data objects in order to
identify the relationships among these data objects in any business. It is the primary step for database
design and using OOP, the conceptual design model is created to depict how the data items are related
to each other. It is the involvement of progression from conceptual model to logical model to that of
physical metadata.
Data Mining is a technique used to analyze datasets to derive useful insights/information. It is mainly
used in retail, consumer goods, telecommunication and financial organizations that have a strong
consumer orientation in order to determine the impact on sales, customer satisfaction and profitability.
Data Mining is very helpful in determining the relationships among different business attributes. Analyzing
data from various perspectives and summarizing it into the required and useful information is known as
data mining. Correlations or patterns among different fields in large RDBMS are the technical aspect of
data mining.

Difference between ER Modeling and Dimensional Modeling


The entity-relationship model is a method used to represent the logical flow of entities/objects graphically
that in turn create a database. It has both logical and physical model. And it is good for reporting and
point queries. ER modeling is for databases that are OLTP databases which uses normalized data using
1st or 2nd or 3rd normal forms.
Dimensional model is a method in which the data is stored in 2 types of tables namely facts table and
dimension table. It has only physical model. It is good for ad hoc query analysis. It is more flexible for
the perspective of user. Dimensional Modeling is used in data warehouses that use 3rd normal form. It
contains de-normalized data.

Snapshot refers to a complete visualization of data at the time of extraction. It occupies less space and
can be used to back up and restore data quickly. Snap shot is stored in a report format from a specific
catalog. The report is generated soon after the catalog is disconnection.
What is Dimension Table?

Dimension table is a table which contains attributes of measurements stored in fact tables. This table
consists of hierarchies, categories and logic that can be used to traverse in nodes.

What is Fact Table?


Fact table contains the measurement of business processes, and it contains foreign keys for the
dimension tables. Nature of data in a fact table is usually numerical.
Example – If the business process is manufacturing of bricks
Average number of bricks produced by one person/machine – measure of the business process

Define non-additive facts.


The facts that cannot be summed up for the dimensions present in the fact table (cannot be added for
producing any results). The facts can be useful if there are changes in dimensions.
For example, profit margin is a non-additive fact for it has no meaning to add them up for the account
level or the day level.

BUS schema is used to identify the common dimensions across business processes (shared across all
enterprise data marts), like identifying conforming dimensions. BUS schema has conformed dimension
and standardized definition of facts. All the Data Marts can use the conformed dimensions and facts
without having them locally.

What are conformed dimensions?


Conformed dimensions are the dimensions which can be used across multiple data marts in combination
with multiple fact tables accordingly. It can refer to multiple tables in multiple data marts within the same
organization.

What is Bit Mapped Index?


Bitmap indexes make use of bit arrays (bitmaps) to answer queries by performing bitwise logical
operations. They work well with data that has a lower cardinality which means the data that take fewer
distinct values. Bitmap indexes are useful in the data warehousing applications.
Bitmap indexes have a significant space and performance advantage over other structures for such data.
Tables that have less number of insert or update operations can be good candidates.
The advantages of Bitmap indexes are:
- They have a highly compressed structure, making them fast to read.
- Their structure makes it possible for the system to combine multiple indexes together so that they can
access the underlying table faster.
The Disadvantage of Bitmap indexes is:
- The overhead on maintaining them is enormous.

Conformed facts:
Allowing having same names in different tables is allowed by Conformed facts. A dimensional table can
be used more than one fact table is referred as conformed dimension. It is used across multiple data
marts along with the combination of multiple fact tables. Without changing the metadata of conformed
dimension tables, the facts in an application can be utilized without further modifications or changes.
They can be compared and combined mathematically. Conformed dimensions can be used across
multiple data marts. These conformed dimensions have a static structure.

OLTP: manages applications based on transactions modifying high volume of data. Typical example of a
transaction is commonly observed in Banks, Air tickets etc. Because OLTP uses client server architecture,
it supports transactions to run cross a network.
OLAP: performs analysis of business data and provides the ability to perform complex calculations on
usually low volumes of data. OLAP helps the user gain an insight on the data coming from different
sources (multi-dimensional).

OLTP OLAP
Online Database Modification system Online query management system
3 NF normalized Not normalized
OLTP database must maintain data integrity constraint Not bothered
Response Time fast Response Time slow

What are cubes?


A data cube stores data in a summarized version which helps in a faster analysis of data. Multi-
dimensional data is logically represented by Cubes in data warehousing. The dimension and the data are
represented by the edge and the body of the cube respectively. OLAP environments view the data in the
form of hierarchical cube. A cube typically includes the aggregations that are needed for business
intelligence queries.
For e.g. using a data cube a user may want to analyze weekly, monthly performance of an employee.
Here, month and week could be considered as the dimensions of the cube.

Snowflake Schema: The fact table is usually at the center surrounded by the dimension table and the
dimension tables are further broken down into more dimension table. Schema is inclined slightly towards
normalization. It contains joins in depth. The reason is that, the tables split further.
For e.g. Dimension tables include employee, projects and status. Status table can be further broken into
status_weekly, status_monthly

Sequence clustering algorithm:


Microsoft Sequence Clustering algorithm is a sequence analysis algorithm provided by Microsoft SQL
Server Analysis Services. This algorithm is used to explore data that contains events that can be linked by
following paths, or sequences. The algorithm finds the most common sequences by grouping, or
clustering, sequences that are identical.
1. Click paths that are created when users navigate or browse a Web site.
2. Logs that list events preceding an incident, such as hard disk failure or server deadlocks.
3. Transaction records that describe the order in which a customer adds items to a shopping cart at an
online retailer.
4. Records that follow customer (or patient) interactions over time, to predict service cancellations or
other poor outcomes.
It finds clusters of cases that contain similar paths in a sequence.
E.g. Sequence clustering algorithm may help finding the path to store a product of “similar” nature in a
retail ware house.

What is surrogate key? Explain it with an example.


Data warehouses commonly use a surrogate key to uniquely identify an entity. A surrogate is not
generated by the user but by the system. Difference between a primary key and surrogate key in
few databases is that PK uniquely identifies a record while a SK uniquely identifies an entity.

e.g. An employee may be recruited before the year 2000 while another employee with the same name
may be recruited after the year 2000. Here, the primary key will uniquely identify the record while the
surrogate key will be generated by the system (say a serial number) since the SK is NOT derived from
the data.
For example: A sequential number can be a surrogate key.
Any column or a combination of columns that can uniquely identify a record is a candidate key

What is the purpose of Fact less Fact Table?


They simply contain keys which refer to the dimension tables. Hence, they don’t really have facts or any
information but are more commonly used for tracking some information of an event.
e.g. To find the number of leaves taken by an employee in a month.

Level of Granularity of a fact table:


Granularity is the lowest level of information stored in the fact table. The depth of data level is known as
granularity. In date dimension the level could be year, month, quarter, period, week, day of granularity.
The process consists of the following two steps:
- Determining the dimensions that are to be included
- Determining the location to place the hierarchy of each dimension of information

Star and snowflake schemas:


Star schema: A highly de-normalized technique. A star schema has one fact table and is associated with
numerous dimensions table and depicts a star.
Snow flake schema: The normalized principles applied star schema is known as Snow flake schema.
Every dimension table is associated with sub dimension table.
A dimension table will not have parent table in star schema, whereas snow flake schemas have one or
more parent tables.
The dimensional table itself consists of hierarchies of dimensions in star schema, whereas hierarchies are
split into different tables in snow flake schema. The drilling down data from top most hierarchies to the
lowermost hierarchies can be done.

What is junk dimension?


In scenarios where certain data may not be appropriate to store in the schema, this data (or attributes)
can be stored in a junk dimension. The nature of data of junk dimension is usually Boolean or flag values.
A single Junk dimension is formed by lumping a number of small dimensions. It has unrelated attributes..
e.g. Whether the performance of a customer was worthy to get a facility?, Comments on performance.

What is Data Cardinality?


Cardinality denotes the occurrences of data on either side of the relation.
High data cardinality:
Values of a data column are very uncommon.
e.g. email ids and the user names
Normal data cardinality:
Values of a data column are somewhat uncommon but never unique.
e.g. A data column containing LAST_NAME (there may be several entries of the same last name)
Low data cardinality:
Values of a data column are very usual.
e.g. flag statuses: 0/1
Determining data cardinality is a substantial aspect used in data modeling. This is used to determine the
relationships.
The Link Cardinality - 0:0 relationships
The Sub-type Cardinality - 1:0 relationships
The Physical Segment Cardinality - 1:1 relationship
The Possession Cardinality - 0: M relation
The Child Cardinality - 1: M mandatory relationship
The Characteristic Cardinality - 0: M relationship
The Paradox Cardinality - 1: M relationship.

How to load Time dimension?

Ab Initio:
1. Co Operating System: It operates on top of the operating system and this is provided by the ab
initio and it the base for all Ab Initio processes. Air commands are one of the features that can be
installed on different operating systems like UNIX, Linux, IBM AIX etc
– Manages and runs Ab Initio graphs and control the ETL processes
– Providing the extensions
– ETL processes monitoring and debugging
– Metadata management and interaction with the EME

GDE: It’s a designing component and used to run the ab initio graphs. Graphs are formed by the
components (predefined or user-defined) and flows and the parameters. It provides the ETL process in
Ab Initio that is represented by graphs. Ability to run, debug the process logs jobs and trace execution
logs

Enterprise Meta-Environment (EME): It’s an environment for storage and also metadata
management (Both business and technical metadata). The metadata is accessed from the graphical
development environment and also the web browser or the cooperating command line. It is ab initio
repository for any placeholders.

Relation between EME, GDE and Co-ops ?


Cooperating system is the core system. It is the Abinitio server. All the graphs which are made in GDE
are deployed and run on cooperating system. It is installed on UNIX.

EME stands for Enterprise Meta environment. It is a repository which holds all the projects, metadata,
and transformations. It performs operations like version controlling, statistical analysis, and dependency
analysis and metadata management.

GDE stands for graphical development environment and is just like a canvas on which we create our
graphs with the help of various components. It just provides graphical interface for editing and executing
Abinitio programs.

2. What does dependency analysis mean in Ab Initio?


It is the analyses of the dependencies within the graphs. It is nothing but the tracing or monitoring how
data is transformed and transferred, field by field, from component to component. It helps in maintaining
the lineage among the related objects.
What are the steps in actual ab initio graph processing?

1. The host setup script is run.


2. Common project (sandbox) parameters are evaluated. (if any)
3. Project (sandbox) parameters are evaluated.
4. The project-start.ksh script is run.
5. Input parameters are evaluated. (if any)
6. Graph parameters are evaluated. (if any)
7. The graph Start Script is run. (if any)
8. Graph components are executed and finally
9. End script (if any)

Q. I had 10,000 records r there i loaded today 4000 records, i need load to 4001 - 10,000
next day how is in Type 1 and how is it on type 2?
Simply take a reformat component and then put next_in_sequence()> 4000 in select parameter.

Q. What is difference between fuse and join?


Fuse: It is a component it will append the data horizontally like if we have two files from first file first
record will join horizontally with second file first records
Join: In join based on common key value and join type file records will be joined

Q. Continuous graph significance?


We have a continuous graph wherein we have generate record component in batch mode we need to
write everything to the Multi-publish queue with throttle in between and the requirement is like if we
have a 1000 records and if the graph fails in between let us say at 500th record, the first 499 records
should be loaded into the multi-publish queue and when we restart the graph again it should start from
500th record rather than 1st record

Reformat vs Redefined format:


Reformat Changes the record format of your data by dropping fields or by using DML expressions to
add fields, combine fields, or modify the data.
Redefine Format  Copies data records from its input to its output without changing the values. Use
Redefine Format to change a record format or rename fields.

Redefine format is best suitable in the following scenario:


Suppose we are reading the data from the input in a single line like a string (" ") and now that string i
want to map into the different fields but i do not want to change the data then in the output dml of the
redefine format i can specify the dml. it will get the data mapped to the different fields.

What is difference between check point and phase?


Phases divide the graph in to parts and execute one after the other to reduce the complexity and
encounter deadlocks. Part of memory will be allocated to each phase one by one (memory management)
Check points are like intermediate nodes which saves the data in to the disk permanently. We have to
manually delete the data if we have checkpoints. If we have a successful checkpoint we can always roll
back and rerun the graph from that point in case of a failure.

How can you increase the number of ports of your output flow? What is the limit?
We can use COUNT parameter, limit is 20 and this can be done only for REFORMAT component, but we
cannot use count parameter for other such as join, fuse etc. to increase the no of out ports.

How to create a new mfs file? Where will we specify the number of partition 4 way ,8 way?
To create a new mfs file we can use m_touch <filename> but you have to be sure that you are in a
multi-directory. After creating the multi file just do a cat <multifilename> you will see how many
partitions are in that.
You should have a multifile system with 4 way and 8 way depth to create multifiles of according depth.
m_mkfs can be used for creating desired mfs file.
But a developer never creates any MFS path it is done by the Abinitio administrator. There are different
mfs path parameters created by administrator like.
AI_MFS: Default MFS path.
AI_4way_MFS: for creating 4-way mfs file and so on.
We only use the required path in the o/p file URL to create desired mfs file.
To convert 4 way to 8 way partition we need to change the layout in the partioning component. There
will be separate parameters for each and every type of partioning eg.
AI_MFS_HOME, AI_MFS_MEDIUM_HOME, AI_MFS_WIDE_HOME etc.
The appropriate parameter need to be selected in the component layout for the type of partioning.

3. How to Create Surrogate Key using Ab Initio?


We can create surrogate key in abinitio by using the following ways:
1. by using assign_keys component
2. by using scan component
3. By using next_in_sequence()

How to prepare SCD2 in abinitio?


1. Take 2 tables as your input first would be your today's file that is in0 second be your previous file that
would be in1.
2. Take the inner join of both the tables on the matching key say cust_id, and in the out port of the dml
make it embedded and do the naming convention by adding the suffix _new for the today's file and _old
for the prev day file data records so to make it easier to understand the data fields.
3. In the join components, unused 0 will give you inserted records (that would come from today's file),
and unused 1 (that would come from yesterday's file) will give you deleted records.
4. Take a reformat connecting the join output port and check:
if ( _new != _old) (these are the suffix we have given in the dml output port coming from the join
component), force error it from the reject port of reformat those will be your updated records and you
will get the unchanged records from the output port of reformat.
5. combine all the inserted records from join unused 0, updated records from reject port of reformat and
unchanged records from out port of reformat and load all of them into the delta table.

4. .abinitiorc file is used to provide parameters for remote connectivity. You can access abinitio
resources (e.g., EME) on a different server by providing the connection method, and authentication
details in the .abinitiorc file.
.abinitiorc can be placed in two locations:
1. In the $HOME directory of each user
2. In the config directory of the Co>Op
In case both exist, the first one (in $HOME dir) will take precedence over the second (in config). You
specify telnet or ftp or NDM ports by setting the configuration variables AB_TELNET_PORT or
AB_FTP_PORT in your .abinitiorc file

5. What is the difference between sandbox and EME, can we perform checkin and checkout
through sandbox?
Enterprise Meta Environment is the central repository and Sandbox is the private area where you bring
the object (by doing object level checkout) from the repository for editing. Once you are finish with your
editing you can checkin back the object to the EME (object level check in) from sandbox.
EME is the version controlling unit of Ab initio where the version controlling can be done, it can be called
as central repository where as private sandbox is the user specific space which is a similar replica of the
EME where a developer works through sandbox user can safely do the check out and check in with no
conflicts with other developers.

6. Describe the effect of the "checkpoint" t-sql statement?


Checkpoints are normally used for the graph recovery, if we are loading a large volume of data and the
graph gets failed, so instead of rerunning the whole graph we can execute the graph from the last
executed successful checkpoint, it saves time and loads the data from the point where it failed.
Checkpoints save the intermediate files during the graph execution.

7. Layout is where the Program (component) runs. Based on the layout given, Abinitio Tries to run on
that Physical location
Way to give layouts: from neighboring Component
URL: like $AI_SERIAL or $AI_PARALLEL (a mount point location /opt/apps/ppl/serial)
Database: where database runs
Layout of a component basically specifies whether the component is processing the data serially or
parallels
Depth specifies degree of partitioning, however, at run time this variable is resolved to a depth. It might
be 2 way in development and 4 way in production. The graph's layout doesn't change and the depth is
determined by the environment, not the graph itself.

8. How do we handle if DML changing dynamically?


1) It can be handled in the startup script with dynamic SQL creation and create dynamic DML so that
there will be no need to change the component henceforth
2) By Passing DML and XFR values at run time
3) By Using PDL, we can generate the DMLs dynamically

9. How will you view or publish metadata reports using EME?


By using m_dump we can print information about data & metadata.
To view the multifile detail information-
m_dump<dml path><file path>

10. How to Improve Performance of graphs in Ab initio?


1> Try to use partitioning in the graph
2> try minimising the number of components
3> Maintain lookups for better efficiency
4> Components like join/ rollup should have the option, Input must be sorted, if they are placed after a
sort component.
5> If component have In memory: Input need not be sorted option selected, use the MAX_CORE
parameter value efficiently.
6> Use phasing of a graph efficiently
7> Ensure that all the graphs where RDBMS tables are used as input, join condition is on indexed
columns.
8> Try to perform the sort or aggregation operation of data in the source tables at the database server
itself, instead of using it in AbInitio.
9> use parallelism (but efficiently)
10> try to use less no of phases in graphs.
11> use component folding
12> always use the oracle tuned query inside the input table component this will give huge performance
improvement.
11> Try to use as less as possible the components which does not allows the pipeline parallelism.
12> Do not use huge lookups
13> if data is not huge always use in memory sort option

Q.11. .dbc vs .cfg ?


.dbc is used establish the connection between data base server and Abinitio server. We need to
configure some parameters to generate .dbc file like database name, version, username password,
hostname etc.. We create this file to work out with the database components.
.cfg file:Where all the environment variables are declared in order to the application and to support the
multi environments. .cfg file fed into korn shell script as an auto parameter driven shell script. .cfg files
are older version of database configuration files used with 2.1 database components.
.dbc is a abinitio defined file where .cfg is unix defined.

Q.12. How to get DML using Utilities in UNIX?


m_db gendml will help you get the DML from Database.
cobol-to-dml and xml-to-dml are other utilities from command line we can use to get DML's.

Q.13. what is skew and skew measurement?


Skew measures the relative imbalance in Parallel loading. Un-even Load balancing causes the Skew. Skew
is the measure of data flow to each partition.
The skew of a data partition is the amount by which its size deviates from the average partition size
expressed as a percentage of the largest partition.

Q.14. How to run the graph without GDE?


We can do this in 2 Ways.
1. Deploy the graph as a script and execute the script in UNIX environment
2. Use air sandbox run command to run the graph directly from command line
air sanbox run graph.mp

Q.15. What is $mpjret?


$mpjret is a variable declared in the end script which returns the value of the execution of the graph.

Its very simalr to the $? in unix


if 0-eq($mpjret)
then
echo"success"
else
mailx-s"(graphname) failed" mailed

or
air sanbox run graph.mp
if [ $? –eq 0 ]; then
echo "Graph Failed"

Q.16. How Does MAXCORE works?


Maximum memory usage in bytes which can be used by component (i.e. Join, Sort and Rollup) to process
records, before spilling data on the disk. (Default value 64 MB)

Q. What is meant by fancing in abinitio?


1. Straight flow
2. fan in flow
3. fan out flow
4. all to all flow
Q. Ramp/Limit:
When we set the Reject Threshold is set to ramp/Limit then we need to give ramp, limit values.
Ramp - Is a real number defined the rate of Reject records.
Limit - Number of records that can reject

Q. Partitioning by key distributes the data into various multifile partitions depending upon of the fields
present in the input(key), while Partitioning by round robin distributes data equally among the
partitions irrespective of the key field (based on block size (parameter) in round robin fashion)
In PBK, data flows to the output randomly, but in PBRR, data flows to output in orderly format.

Q. How many parallelisms are in Abinitio?


Component parallelism:- A graph with multiple processes running simultaneously on separate data or
same data uses component parallelism.
Data parallelism: - A graph that deals with data divided into segments and operates on each segment
simultaneously uses data parallelism. Nearly all commercial data processing tasks can use data
parallelism. To support this form of parallelism, Ab Initio provides Partition components to segment data,
and De-partition components to merge segmented data back together.
Pipeline parallelism: - A graph with multiple components running simultaneously on the same data
uses pipeline parallelism. Each component in the pipeline continuously reads from upstream components,
processes data, and writes to downstream components. Since a downstream component can process
records previously written by an upstream component, both components can operate in parallel. NOTE:
To limit the number of components running simultaneously, set phases in the graph.

Have you ever encountered an error called "depth not equal"?


When two components are linked together if their layout doesn’t match then this problem can occur
during the compilation of the graph. A solution to this problem would be to use a partitioning component
in between if there was change in layout

Name the air commands in ab initio?


1) air object ls
2) air object rm
3) air object cat
4) air object versions
5) air project show
6) air project modify
7) air lock show
8) air lock show -user <UNIX User ID> -- shows all the files locked by a user in various projects.
9) air sandbox status

What is air_project_parameters and air_sandbox_overrides ?


.air-project-parameters: Contains the parameter definitions of all the parameters within a sandbox.
.air-sandbox-overrides: It contains the user's private values for any parameters in .air-project-parameters
that have the Private Value flag set.

Q. What is meant by re-partioning in how many ways it can be done?


It means redistribution of records in different partition
Repartitioning means changing one or both of the following:
1) The degree of parallelism of partitioned data
2) The grouping of records within the partitions of partitioned data

Q. How to calculate total memory used my a graph?


Resource utilization by an abinitio process can be tracked by means of EME tracking mechanism
Q. If I delete 1 partition (in 8 partition multifile) and run the graph. Will the graph run
successfully?
It will fail giving error "failed to open file (with the path to the file partition)"

Output for sort and dedup sort with NULL key ?


It will treat all the columns as part of key. So it automatically considers the records as one group & data
will be as per the input serial number and will be sorted according to that.
Output for Sort when pass key as {}: Sort component will does not perform any ordering on data, it will
give a same input to your output.
Output for Dedup Sorted when pass key as {}:
1. If your keep parameter is first then it will give first record from input records.
2. If your keep parameter is last then it will give last record from input records.
3. If your keep parameter is unique only then it give zero records in output.
but if your input file contains only one record then it give one record in output.

Q. How to passing parameter to Oracle Stored Procedure in graph?


RUN SQL COMPONENT
exec proc-name (parameters)
exec proc-name ('$DATE')
exec proc-name ('20080101')

Q. Header, Trailer and Body segregation and then reverse rank ?


1. Segregate the header, trailer & body by either using partition by expression component or using filter
by exp. Use next_in_sequence() function to a new column of the data record. Use sort on descending to
sort the data on the next_in_sequence() column

2) i/p-->Dedupsort(Keep 1st)-->Dedupsort(Keep last)-->Reformat(out.Rank::next_in_sequence())--


>Sort(Rank,Desc)-->o/p

3) Use dedup sort to remove header and trailer records. Next by using scan component generate
seqeunce number, then sort according to the seqeunce number(Desc).

*** The simplest way to remove header and trailer records is to use a dedup sorted component using a
NULL key. This would treat the entire record set as a single group. Then use keep_first and keep_last
mode of dedup sorted to select the header and trailer records.

Another way to accomplish similar result set is using a ROLLUP component on NULL key. Use the first
and last functions in rollup to select the header and trailer records.

Q. In MFS I developer developed 2-way, but supporters are supporting 4-way on same
records how is possible?
First you connect 2 way input file to gather component and then connect to partition by exp with sort
component and finally you connect target file that is 4 way mfs file.
for any change in partitioning, we have to change the "AI_MFS_DEPTH_OVERRIDE" parameter in .air-
project-parameters file.
Re-partition the data to 7 ways using partition components, of course you would need 7 way MFS

Q. READ MULTIFILE COMPONENT:


Read Multiple Files is useful for reading records from a number of different target files. It extracts the
filenames of these files from the component’s input flow. Then it reads the records from each target file
and writes the records to the output port. (You can set optional parameters to control how many records
Read Multiple Files skips before starting to read records, and the maximum number of records it reads.)
An optional transform function allows you to manipulate the records or change their formats before they
are written as output.

Reject vs Unused in JOIN?


All the records which evaluates to NULL during join transformation will go into reject port if the limit +
ramp*number_of_input_records_so_far <number_of_input_records_so_far and records which do not
match with DML come to the reject port.

What will happen if we pass null key to join?


If you are passing null key to join component, you will get Cartesian product of records of both the input
ports.

Q2. What will happen if we pass null key to scan?


Scan is a multistage component but when you pass a null key to scan it will give all the records as output

Q3. What will happen if we pass null key to dedup sort?


In case of dedup component if we give keep parameter as first it will give the first record as the output if
we give keep parameter as last it will give the last record as the output but when we give the keep
parameter as unique it will give no records as the output.

Q4. What will happen if we pass null key to rollup?


If we give null key to rollup its output will have only one record, it will consider all the records as one
group. It is useful when we need to count the number of records in the port.

Parallelism:
Component parallelism: An application that has multiple components running on the system
simultaneously. But the data are separate.
Data parallelism: Data is split into segments and runs the operations simultaneously.
Pipeline parallelism: An application with multiple components but running on the same dataset.

What is a multifile system?


Multifile is a set of directories on different nodes in a cluster. They possess an identical directory
structure. The multifile system leads to a better performance as it is parallel processing where the data
resides on multiple disks. It is created with the control partition on one node and data partitions on the
other nodes to distribute the processing in order to improve the performance.

How do you improve the performance of a graph?


1) Reduce the usage of multiple components on certain phases.
2) Use a refined and well defined value of max core values for sort and join components and Tune
Max_core for Optional performance.
3) Minimize the use of regular expression functions like re_index in the transfer functions
4) Minimize sorted join component and if possible replace them by in-memory join/hash join
5) Use only required fields in the sort, reformat, and join components
6) Using Phase or the flow buffering during the cases of merge or sorted joins
7) Use hash join if two sets of input is small else better to choose the sorted join for the huge input size
8) For large dataset better not use broadcast as partitioned
9) Reduce the number of sort components while processing.
10) Avoid repartitioning of data unnecessarily
11) Use MFS system using Partition by Round by robin whenever possible
12) If needed use lookup local than lookup when there is a large data.
13) Takeout unnecessary components like filter by exp instead provide them in reformat/Join/Rollup
14) Use gather instead of concatenate
15) Try to avoid more phases.

A) Suppose sort is used in front of merge component it’s no use of using sort because we have sort
component built in merge.
B) We use lookup instead of JOIN, Merge Component.
C) Suppose we want to join the data coming from 2 files and we didn’t want duplicates, we can use union
function instead of adding additional component for duplicate remover.

Feature AB Initio Informatica


About Tool Code based ETL Engine based ETL
Parallelism Supports 3 Types of parallelism Supports 1 type of parallelism (Pipeline)
Scheduler No scheduler Schedule through script available
Error Handling Can attach error and reject files One file for all
Robust Robustness by function comparison Basic in terms of robustness
Feedback Provides performance metrics for Debug mode, but slow implementation
each component executed
Delimiters Supports multiple delimeters Only dedicated delimeter

*** pipeline "parallelism" it's achieved automatically in power center; a thread it's spawned for every
partition point (to make it simple you have a thread for the source, the target, the transformation , the
aggregator. it's also spawn additional threads for building the lookups in parallel (you can control the
number of the threads in the session configuration)
Partition parallelism needs partition option license and it has to be specified manually for each partition
point in the session configuration. It doesn’t span different processes for every partition but threads
I achieved a transformation rate 250000 records/sec (35m of records in 2 minutes 20 seconds) on a x86
machine with 4 cores using partitioning on a session with 8 lookups per record and aggregating data
(aggregators in 8.5+ are really blazing fast) reading a partitioned oracle table from the network.

Max-core: This parameter controls how frequently a component should dump data from memory to disk

In Ab initio, dependency analysis is a process through which the EME examines a project entirely and
traces how data is transferred and transformed- from component-to-component, field-by-field, within and
between graphs.

Replicate component combines the data records from the inputs into one flow followed by writing a
copy of that flow to each of its output ports.

How can you run a graph infinitely in Abinitio?.


Graph end script should call the .ksh file of the graph.

What is Graph level local and formal parameter?


Both are graph level parameters. In local, you need to initialize the value at the declaration time whereas
global does not need to initialize the data as it will prompt at the time of running the graph for that
parameter.

A SANDBOX is referred for the collection of graphs and files related to it that are saved in a single
directory tree and behaves as a group for the reason of navigation, version control and migration.

What Information does a .dbc File Extension Provides to Connect to the Database?
.dbc extension provides the GDE with the information to connect with the database are
1. Name and version number of the data-base to which you want to connect
2. Name of the computer on which the data-base instance or server to which you want to connect runs,
or on which the database remote access software is installed
3. Name of the server, database instance or provider to which you want to link

Difference Between “look-up” File and “look Is Up” In Abinitio?

Lookup file defines one or more serial file (Flat Files); it is a physical file where the data for the Look-up
is stored. While Look-up is the component of abinitio graph, where we can save data and retrieve it by
using a key parameter.

Different types of parallelism

Component parallelism: A graph with multiple processes executing simultaneously on separate data
Data parallelism: A graph that works with data divided into segments and operates on each segment
simultaneously, uses data parallelism.
Pipeline parallelism: A graph that deals with multiple components executing simultaneously on the
same data uses pipeline parallelism. Each component in the pipeline read continuously from the upstream
components, processes data and writes to downstream components. Both components can operate in
parallel.

Roll Up vs Aggregator?
1) Rollup can perform some additional functionality, like input filtering and output filtering of records.
2) Aggregate does not display the intermediate results in main memory, where as Rollup can.
3) Analyzing a particular summarization is much simpler compared to Aggregations

A lookup file represents a set of serial or flat files; it is a specific data set that is keyed. The key is used
for mapping values based on the data available in a particular file; the data set can be static or dynamic.
Hash-joins can be replaced by reformatting and any of the input in lookup to join should contain less
number of records with a slim length of records, Abinitio has certain functions for retrieval of values using
the key for the lookup.

A lookup file is an indexed dataset and it actually consists of two files: one files holds data and the other
holds an hash index into the data file. We commonly use a lookup file to hold in physical memory the
data that a transform component frequently needs to access.
lookup ("Level of the MyLookupFile", in.key)
If the lookup file key's Special attribute (in the Key Specifier Editor) is exact, the lookup functions return a
record that matches the key values and has the format specified by the RecordFormat parameter.

Local Lookup:
Lookup files can be partitioned (multifiles)
If the component is running in parallel and we use a _local lookup function Co>Operating System splits
lookup file into partitions.
The benefits of partitioning lookup files are:
1. The per-process footprint is lower. This means the lookup file as a whole can exceed the 2 GB limit.
2. If the component is partitioned across machines, the total memory needed on any one machine is
reduced.

Dynamic Lookup
A disadvantage of static lookup file is that the dataset occupies a fixed amount of memory even when the
graph isn’t using the data. By dynamically loading lookup data, we control how many, which and when
lookup datasets are loaded. This control is useful in conserving memory; applications can unload datasets
that are not immediately needed and load only the ones needed to process the current input record.
1. Load the dataset into memory when it is needed.
2. Retrieve data with your graph.
3. Free up memory by unloading the dataset after use.
To look up data dynamically we use LOOKUP TEMPLATE component:
let lookup_identifier_type LID =lookup_load(MyData, MyIndex, "MyTemplate", -1)

Where LID is a variable to hold the lookup ID returned by the lookup_load function. This ID references
the lookup file in memory. The lookup ID is valid only within the scope of the transform.
MyData is the pathname of the lookup data file.
MyIndex is the index of the pathname of the lookup index file.
If no index file exists, we must enter the DML keyword NULL. The graph creates an index on the fly.

** In a lookup template, we do not provide a static URL for the dataset’s location as we do with a lookup
file. Instead, we specify the dataset’s location in a call to the lookup_load function when the data is
actually loaded.

How to Perform Lookup When Key Field in Lookup is Null?


Consider using first_defined() or is_error() to handle this issue.
When passed a NULL argument to use as a key or subkey, these functions try to match the NULL value to
the corresponding key or subkey of a record in the lookup file. If there are NULL values in the fields of a
lookup file and NULL values in the parameters passed to a lookup function, the values will match and the
function will return a record.

Explicit Join (Semi Join):


It uses all records in one specified input, but records with matching keys in the other inputs are optional.
A NULL record is used for the missing records.
Case1( In 0 required): In this type of join, a record is required on input 0, but the presence of a record
on input 1 with the same key is optional. There are two key combinations that can produce here, so that
it will again be necessary to prioritize rules in the transform.
Case2( In 1 required): In this type of join, a record is required on input 1, but the presence of a record
on input 0 with the same is optional.
Duplicate records in input tables will cause multiple records in output table as result of Cartesian product
i.e. duplicate records on both input will cause 4 records on output file.

How to See List of all Objects Related to One Tag in Abinitio


air tag ls -e <tag_name>

Difference between Output-Index and Output-Indexes in Reformat Component:


Output Index and Output indexes are two optional functions available in Reformat and useful when we
have set count parameter greater than 1.
Output index function should return a number (nothing but the out port number of the reformat) and
the current input record is routed to that port and any transform function associated with that would be
executed.
With Output indexes function you can direct the current input record to more than one port.
When we use output index, a single input record can go only single transform output port but in case of
output indexes one record can go multiple transform-output ports.

Ramp/Limit:
A limit is an integer parameter which represents a number of reject events
Ramp parameter contain a real number representing a rate of reject events of certain processed records
The formula is - No. of bad records allowed = limit + no. of records x ramp
A ramp is a percentage value from 0 to 1.
These two provides the threshold value of bad records.

Rollup component allows the users to group the records on certain field values.
It is a multi-stage function and contains
To counts of a particular group Rollup needs a temporary variable
The initialize function is invoked first for each group
Rollup is called for each of the records in the group.
The finally function calls only once at the end of last rollup call.

How To Add Default Rules In Transformer?


Open Add Default Rules dialog box.
Select Match Names – to match the names that generates a set of rules to copy input fields to out fields
with same name
Use Wildcard (. *) Rule: This rule generates only one rule to copy input fields to output fields with the
same name
If not displayed – display the Transform Editor Grid
Click the Business Rule tab. Select Edit, Add Default Rules

Explain PDL With An Example:


To make a graph behave dynamically, PDL is used
Suppose there is a need to have a dynamic field that is to be added to a predefined DML while executing
the graph then a graph level parameter can be defined
Utilize this parameter while embedding the DML in output port.
For Example: define a parameter named myfield with a value “string (“ | ”) name;
Use ${mystring} at the time of embedding the dml in out port.
Use $substitution as an interpretation option

A decimal strip takes the decimal values out of the data. It trims any leading zeros, The result is a valid
decimal number

First_defined Function:
This function is similar to the function NVL() in Oracle database
It performs the first values which are not null among other values available in the function and assigns to
the variable
Example: A set of variables, say v1,v2,v3,v4,v5,v6 are assigned with NULL. Another variable num is
assigned with value 340 (num=340)
num = first_defined(NULL, v1,v2,v3,v4,v5,v6,NUM)
The result of num is 340
Max Core:
MAX CORE is the space consumed by a component that is used for calculations, Each component has
different MAX COREs
Component performances will be influenced by the MAX CORE’s contribution
The process may slow down / fasten if a wrong MAX CORE is set

Check point:
When a graph fails in the middle of the process, a recovery point is created, known as Check point
The rest of the process will be continued after the check point, Data from the check point is fetched and
continue to execute after correction.
Phase:
If a graph is created with phases, each phase is assigned to some part of memory one after another.
All the phases will run one by one, The intermediate file will be deleted

How to find the number of arguments defined in graph?


List of shell arguments $*
$# No of positional parameters

What is the difference between a DB config and a CFG file?


.cfg file is used for remote connection and .dbc is for connecting the database.
CFG file is the table configuration file created by db_config while using components like Load DB Table
.cfg contains:
1. The name of the remote machine
2. The username/pwd to be used while connecting to the db.
3. The location of the operating system on the remote machine.
4. The connection method.
A .dbc file has the information required for AbInitio to connect to the database to extract or load tables or
views:
1. The database name
2. Database version
3. Userid/pwd
4. Database character set

Sandboxes are work areas used to develop, test or run code associated with a given project. Only one
version of the code can be held within the sandbox at any time.
EME Datastore contains all versions of the code that have been checked into it. A particular sandbox is
associated with only one Project where a Project can be checked out to a number of sandboxes.

Environmental variables serve as global variables in UNIX environment. They are used for passing on
values from a shell/process to another. They are inherited by Abinitio as sandbox variables/ graph
parameters: env | grep AI will give us all AI_*
AI_SORT_MAX_CORE
AI_HOME
AI_SERIAL
AI_MFS

Difference between conventional loading (API) and direct loading (Utility), When it is used
in real time ?
Conventional Load: Before loading the data, all the Table constraints will be checked against the data.
Direct load: (Faster Loading) All the Constraints will be disabled. Data will be loaded directly. Later the
data will be checked against the table constraints and the bad data won't be indexed.

How to do we run sequences of jobs like output of A JOB is Input to B?


By writing the wrapper scripts we can control the sequence of execution of more than one job.

What is BRODCASTING and REPLICATE?


Broadcast: Takes data from multiple inputs combines it and sends it to all the output ports. Suppose
we have 2 incoming flows (This can be data parallelism or component parallelism) on Broadcast
component, one with 10 records & other with 20 records. Then on all the outgoing flows (it can be any
number of flows) will have 10 + 20 = 30 records
Replicate: It replicates the data for a particular partition and send it out to multiple out ports of the
component, but maintains the partition integrity.
Suppose an incoming flow has a data parallelism level of 2 will use replicate with one partition having 10
recs & other one having 20 recs. Now suppose you have 3 output flows from replicate. Then each flow
will have 2 data partitions with 10 & 20 records respectively.

In case of multi file broadcast do data partitioning and replicate do component partitioning.

What is m_dump?
m_dump command prints the data in a formatted way. It is used to view data, residing in multifile, from
UNIX prompt.
m_dump <dml> <datafile>

There are other options such as:


-select
-no-print-data
-print-data
-start # -end #
-record
How to Read Block Compressed File Using M_dump Command?
m_dump <dml_file_name> <input_file_name> -decompress
OR
zcat compressed file|m_dump dmlfile -

How you can create cross joined output using join component?
Set the key as NULL - {}

Give an example of real time start script in the graph?


Ans: Here is a simple example to use a start script in a graph:
In start script lets give as:
export $DT=`date '+%m%d%y'`
Now this variable DT will have today's date before the graph is run.
Now somewhere in the graph transform we can use this variable as;
out.process_dt::$DT;
which provides the value from the shell.

Which component breaks the pipe line parallelism in graph?


All the components which will wait for records in line will break pipeline parallelism. Like sort, roll up,
scan, Fuse
What does layout means in terms of Ab Initio?
Layout is where the Program (component) runs. Based on the layout given Abinitio Tries to run on that
Physical location
Way to give layouts: from neighboring Component.
URL: like $AI_SERIAL (a mount point location /opt/apps/ppl/serial) or $AI_PARALLEL
Database: where database runs.

How do you convert 4-way MFS to 8-way mfs?


To convert 4 way to 8 way partition we need to change the layout in the partioning component. There
will be separate parameters for each and every type of partioning eg. AI_MFS_HOME,
AI_MFS_MEDIUM_HOME, AI_MFS_WIDE_HOME etc.

What is AB_LOCAL?
AB_LOCAL is a parameter of input table component which we can be used in parallel unloads and for
determining the driving table in the complex queries.
There are two forms of AB_LOCAL() construct, one with no arguments and one with single argument as a
table name (driving table).
The use of AB_LOCAL() construct is in Some complex SQL statements contain grammar that is not
recognized by the Ab Initio parser when unloading in parallel. You can use the AB_LOCAL() construct in
this case to prevent the Input Table component from parsing the SQL (it will get passed through to the
database). It also specifies which table to use for the parallel clause

If you use an SQL SELECT statement to specify the source for Input Table, and if the statement involves
a complex query or a join of two or more tables in an unload, Input Table may be unable to determine
the best way to run the query in parallel. In such cases, the GDE may return an error message
suggesting you use ABLOCAL (tablename) in the SELECT statement to tell Input Table which table to use
as the basis for the parallel unload.
To do this, you would put an ABLOCAL(tablename) in the appropriate place in the WHERE clause in the
SELECT statement, and specify the name of the "driving table" (often the largest table, but see below) as
a single argument. When you run the graph, Input Table will replace the expression
"ABLOCAL(tablename)" with the appropriate parallel query condition for that table.

For example, suppose you want to join two tables- customer_info and acct_type-and customer_info is the
driving table. You would code the SELECT statement as follows:

select * from acct_type, customer_info where ABLOCAL(customer_info) and customer_info.acctid =


acct_type.id
Note that when using an alias for a table, you must tell ABLOCAL(tablename) the alias name as well.
select * from acct_type, customer_info custinfo where ABLOCAL(customer_info custinfo) and
custinfo.acctid = acct_type.id

What is mean by Co > Operating system and why it is special for Abinitio?
It converts the AbInitio specific code into the format, which the UNIX/Windows can understand and feeds
it to the native operating system, which carries out the task.

How will you test a dbc file from command prompt?


"m_db test myfile.dbc"

Which one is faster for processing fixed length dmls or delimited dmls and why?
Fixed length DML's are faster because it will directly read the data of that length without any comparisons
but in delimited one’s every character is to be compared and hence delay
What are the continuous components in Abinitio?
Continuous components used to create graphs that produce useful output file while running continuously
Continuous rollup, Continuous update, batch subscribe

How to handle if DML changes dynamically in abinitio?


If the DML changes dynamically then both dml and xfr has to be passed as graph level parameter during
the runtime.

Have you worked with packages?


Packages are nothing but the reusable blocks of objects like transforms, user defined functions, dmls etc.
These packages are to be included in the transform where we use them.
For example, consider a user defined function like
/*string_trim.xfr*/
out::trim(input_string)=
begin
let string(35) trimmed_string = string_lrtrim(input_string);
out::trimmed_string;
end
Now, the above xfr can be included in the transform where you call the above function as include
''~/xfr/string_trim.xfr'';
But this should be included ABOVE your transform function.

What is Skew?
Skew is the measure of data flow to each partition
The skew of a data partition is the amount by which its size deviates from the average partition size
expressed as a percentage of the largest partition:
Partition size – (avg partition size/ largest partition) * 100

How to get DML using Utilities in UNIX?


m_db gendml will help you get the DML from Database.
cobol-to-dml and xml-todml are other utilities from command line we can use to get DML's.

Air commands in ab initio:


1) air object ls <EME Path for the object -/Projects/edf/.. > -This is used to see the listing of objects
in a directory inside the project.
2) air object rm <EME Path for the object -/Projects/edf/.. > - This is used to remove an object from
the repository.
3) air object cat <EME Path for the object -/Projects/edf/.. > This is used to see the object which is
present in the EME.
4) air object versions -verbose <EME Path for the object -/Projects/edf/.. > - Gives the Version
History of the object
5) air project show <EME Path for the project -/Projects/edf/.. > Gives the whole info about the
project. What all types of files can be checked-in etc.
6) air project modify <EME Path for the project -/Projects/edf/.. > -extension <something like *.dat
within single quotes> <content-type>  This is to modify the project settings. Ex: If you need to
checkin *.java files into the EME, you may need to add the extension first.
7) air lock show -project <EME Path for the project -/Projects/edf/.. > shows all the files that are
locked in the given project
8) air lock show -user <UNIX User ID> -- shows all the files locked by a user in various projects.
9) air sandbox status <file name with the relative path>  shows the status of file in the sandbox
with respect to the EME (Current, Stale, Modified are few statuses)
10) air sandbox run <name-of-graph-pset>
How do we extract data from client machine?
If u have to extract data from a database system (A) which is on server (B) for connecting to server(B)
provide the username, encrypted password, connection method, server name, etc in ur .abinitiorc file and
provide all the database(B) detail in ur .dbc file of ur component used for doing extraction (input table
component).

READ MULTIFILE COMPONENT:


It is useful for reading records from a number of different target files. It extracts the filenames of these
files from the component’s input flow. Then it reads the records from each target file and writes the
records to the output port. (You can set optional parameters to control how many records Read Multiple
Files skips before starting to read records, and the maximum number of records it reads.) An optional
transform function allows you to manipulate the records or change their formats before they are written
as output.

In Join component which record will go to reject port?


Records which do not match with DML or the records which evaluates to NULL during join transformation
come to the reject port

What is air_project_parameters and air_sandbox_overrides? what is the relation between


them?
.air-project-parameters: Contains the parameter definitions of all the parameters within a sandbox.
This file is maintained by the GDE and Ab Initio environment scripts.
.air-sandbox-overrides: This file exists only if you are using version 1.11 or a later version of the GDE.
It contains the user's private values for any parameters in .air-project-parameters that have the Private
Value flag set. It has the same format as the .air-project-parameters file.
When we edit a value (in GDE) for a parameter that has the Private Value flag checked, the value is
stored in the .air-sandbox-overrides file rather than the .air-project-parameters file.

I had 10,000 records, I loaded today 4000 records, I need load to 4001 - 10,000 next day
how is in Type 1 and how is it on type 2?

Take 10,000 records as source file and take output table as another source and then Join both the
source, select input must be sorted parameter option in the Join component, If any matching records
found then put that into a thrash, take all unmatched records from the unused port and inert into the
target table.

In reformat component and then put next_in_sequence()> 4000 in select parameter

How will i can implement Insert, Update, delete in abinitio?


To find records which should be inserted, updated or deleted one should use ab initio flow
a. unload master table
b. read delta file
c. use inner join to join a and b unused a will be your delete records (if required) unused b will be your
insert records, joined a and b will be your update records

To view MFS in UNIX you should run m_expand command

How can you convert 3 way to 7way partitioning in abinitio?]


For any change in partitioning, we have to change "AI_MFS_DEPTH_OVERRIDE" parameter in .air
project-parameters file.
Another way to do it is to convert the 3-way multifile to serial file.
i/p(MF) -->Merge(Key)-->PBRR-->o/p(7-way mfs path)

FUSE vs Join?
Fuse: It is a component it will append the data horizontally like if we have two files, from first file first
record will join horizontally second file first records similarly all records join horizontally
Join: In join based on common key value and join type file records will be joined

What the difference is between reformat and redefine format?


Reformat: Changes the record format of your data by dropping fields or by using DML expressions to
add fields, combines fields, or modify the data.
Redefine Format: Copies data records from its input to its output without changing the values. Use
Redefine Format to change a record format or rename fields.

How to create a new mfs file? Where will we specify the number of partition eg 4 way or 8
way?

How to prepare SCD2 in abinitio?


1. Take 2 tables as your input first would be your today's file that is in0 second be your previous file that
would be in1.
2. Take the inner join of both the tables on the matching key say cust_id, and in the out port of the dml
make it embedded and do the naming convention by adding the suffix _new for the today's file and _old
for the prev day file data records so to make it easier to understand the data fields.
3. In the join components, unused 0 will give you inserted records (that would come from today's file),
and unused 1(that would come from yesterday's file) will give you deleted records.
4. take a reformat connecting the join output port and check:
if ( _new != _old) (these are the suffix we have given in the dml output port coming from the join
component), force error it from the reject port of reformat those will be your updated records and you
will get the unchanged records from the output port of reformat.
5. combine all the inserted records from join unused 0, updated records from reject port of reformat and
unchanged records from out port of reformat and load all of them into the delta table.

What is the difference between Generate Records Component and Create Data Component?

There is no transform function in Generate Record comp. Therefore it will create data's default of it's
own, according to the dml defined. To change that data, we have to connect some other component
after Generate Record comp and have to modify.
Whereas, in Create Data comp. we can write a transform function, so that the data's are generated as
per our own transform function. Also index is default defined in Create Data comp.

Conditional DML using data file will contain cluster of records in each row with different data format,
each record can be read using different conditional DML based on the record identifier at the start of
each record row

What is the different String functions used in Abinitio?


char_string: Returns a one-character native string that corresponds to the specified character code.
decimal_lpad: Returns a decimal string of the specified length or longer, left-padded with a specified
character as needed.
decimal_lrepad: Returns a decimal string of the specified length or longer, left-padded with a specified
character as needed and trimmed of leading zeros.
decimal_strip: Returns a decimal from a string that has been trimmed of leading zeros and non-
numeric characters.
ends_with: Returns 1 (true) if a string ends with the specified suffix; 0 (false) otherwise.
is_blank: Tests whether a string contains only blank characters.
is_bzero: Tests whether an object is composed of all binary zero bytes.
re_get_match: Returns the first string in a target string that matches a regular expression.
re_get_matches: Returns a vector of substrings of a target string that match a regular expression
containing up to 9 capturing groups.
re_index: Returns the index of the first character of a substring of a target string that matches a
specified regular expression.
re_match_replace: Replaces substrings of a target string that match a specified regular expression.
re_replace: Replaces all substrings in a target string that match a specified regular expression.
re_replace_first: Replaces the first substring in a target string that matches a specified
regular expression.
starts_with: Returns true if the string starts with the supplied prefix.
string_char: Returns the character code of a specific character in a string.
string_compare: Returns a number representing the result of comparing two strings.
string_concat: Concatenates multiple string arguments and returns a NUL-delimited string.
string_downcase: Returns a string with any uppercase letters converted to lowercase.
string_filter: Compares the contents of two strings and returns a string containing characters that
appear in both of them.
string_filter_out: Returns characters that appear in one string but not in another.
string_index: Returns the index of the first character of the first occurrence of a string within another
string.
string_is_alphabetic: Returns 1 if a specified string contains all alphabetic characters or 0 otherwise.
string_is_numeric: Returns 1 if a specified string contains all numeric characters, or 0 otherwise.
string_join: Concatenates vector string elements into a single string.
string_length: Returns the number of characters in a string.
string_like: Tests whether a string matches a specified pattern.
string_lpad: Returns a string of a specified length, left-padded with a given character.
string_lrepad: Returns a string of a specified length, trimmed of leading and trailing blanks and left-
padded with a given character.
string_lrtrim: Returns a string trimmed of leading and trailing blank characters.
string_ltrim: Returns a string trimmed of leading blank characters.
string_pad: Returns a right-padded string.
string_prefix: Returns a substring that starts at the beginning of the parent string and is of the
specified length.
string_repad: Returns a string of a specified length trimmed of any leading and trailing blank
characters, then right-padded with a given character.
string_replace: Returns a string after replacing one substring with another.
string_replace_first: Return+B14s a string after replacing the first occurrence of one substring with
another.
string_rindex: Returns the index of the first character of the last occurrence of a string within another
string.
string_split: Returns a vector consisting of substrings of a specified string.
string_split_no_empty: Behaves like string_split, but excludes empty strings from its output.
string_substring: Returns a substring of a string.
string_trim: Returns a string trimmed of trailing blank characters.
string_upcase: Returns a string with any lowercase letters converted to uppercase.
test_characters_all: Tests a string for the presence of ALL characters in another string.
test_characters_any: Tests a string for the presence of ANY characters in another string.

is_defined: Tests whether an expression results in a non-NULL value.


is_null: Tests whether an expression results in a NULL value.
is_blank: Tests whether a string contains only blank characters.

What are the prioritized Rules in as Transform function?

How can you achieve scan using reformat?


To create a simple sequence, the next_in_sequence() function can be used in a reformat.
In order to get cumulative summary, temporary variables can be used iteratively to increment a particular
column by adding the previous value with the current value incrementally.
If a Key is involved, we need to ensure that the Key is sorted before the Reformat. A key change function
using variables need to be written such that if the Key's current value is not equal to the previous value,
the group has changed and the cumulative total needs to begin again with the current record value. If
multiple columns make a key, the values can be concatenated for the key change verification.

From a graph how to select only second record always in scan?

What is regex (lookup)?


A lookup file configured for pattern matching is known as a regex lookup. Each record in the lookup file
must contain a regular expression. The key field of a regex lookup file declares which field of the record
is to be interpreted as a regular expression.

How you can track the records those are not getting selected from ‘select’ in reformat
component?
The records you want to deselect, identify them in your reformat function and reject them using
force_error function. The rejected records will come out of the reject port and you can collect them.
Before trying this, make sure the reject threshold parameter of the reformat is set to never abort
otherwise the graph will fail on first reject.
But one major risk of this method is, if any record is rejected by the reformat for some other reason (not
due to force_error) also, it will come out of the reject port and will find its place with the 'deselected'
records.
You can use FBE (Filter By Expression) before the reformat component, to implement this functionality

How can you SORT an already partitioned (round-robin) data?

How internally partition by key decides which key to send in which partition?

How shell type parameter to convert it to PDL?

As shell type parameters are not supported by EME, then how you can use shell type
parameter (If you don't want to use PDL) without hampering lineage diagram?

How you can convert from ebcdic to packed decimal?


Why creation of temporary files depends on the value of MAX CORE?

What is the diff between abinitiorc and .abinitiorc files?


.abinitiorc file contains the configuration variables. In other words, u can define your conf. var. in this
example. The location of this file is /usr/local/abinitio/config
Mostly it holds connection information such as Connection method, Login id, Password, co>op location
etc. for other hosts where co>op is installed. The exact location of .abinitiorc file is unix home directory
of the user i.e. $HOME/.abinitiorc
The system wide global configuration file abinitiorc remains in $AB_HOME\config\
(i.e./usr/local/abinitio/config) and usually setup by administrator.

Whereas abinitiorc is a system configuration file which is setup by system admin.


$AB_HOME/config/abinitiorc

What is the use of allocate ()?


When declaring a global vector in your transform, you might want to use the allocate() function.
let string('\t')[10] input_rec = allocate();
OR
let record
string('\307') acct_id;
decimal('\307') seq_nbr;
end initial_rec = allocate();

Allocate () is not only meant for assigning default values for vector types. For any complex data types,
record types, global and local variables we can use allocate () function to initialize that object type.
It is very convenient way to initialize any DML Object and good coding practice too.
However you are always allowed to initialize any dml type manually but that approach would be more
error-prone, whereas if you use allocate () then you don't need to think of anything whether the data
type is integer or decimal etc., Ab Initio will apply its own merit to initialize the dml types based on the
data type of the object, so no risk of invalid assignments!
But still if you need to initialize any dml object other than default values driven by the corresponding data
types then you have to manually initialize the object unlike using allocate();
From 2.15 Co>Op onward we also have allocate_with_defaults() [ same as allocate() ] and
allocate_with_nulls() for initializing DML objects.

What is use of branch in EME?

How you can break a lock in EME? How can you lock a file so that only no one other than EME
admin can break it?

Why you should not keep the layout as 'default' for input table component?

What is dynamic lookup?

What is dependent parameter?

What is BRE? (Business Rule Environment)

In which scenario, .rec files will not get created even if graph fails?
.rec files will be created only if you have checkpoints enabled in your graph.

Can we have more than one launcher process for a particular graph?

What is the default layout of watcher files?

Why you get 'too many open files' error?

What is the significance of vnode folder under AB_WORK_DIR?

What is the significance AB_AIR_BRANCH?

How next_in_sequence work in parallel layout?

How you can encrypt a password and use it in dbc file?

Which function you should be using if you do not want to process non printable char from a
feed file?
string_filter_out, make_byte_flags,test_any_character functions
OR
use variable format strings instead of delimited, for example string(integer(2)) my_load_read_string;
OR
string_filter(re_replace(in.line, "[[:cntrl:]]" ,
""),"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxzyz0123456789")
OR
re_replace(in, "[\\x80-xff]", " ") or re_replace(in.line, "[^0-9a-zA-Z]+", "")

What is catalog and when you should use it?

How you can use reformat as a router?

How many processes get created for a n-way parallel component?

What is private project and public project?


Common Project - A project included by another project
Private Project - A project not intended to be included by any other project. All parameters are
intended to be internal for the use of that project only.
Public Project - A project intended to but not necessarily included by other projects. Imagine this as an
interface to common parameters.
Private projects tend to have public project counterparts from where they can set parameters to be
visible to outside projects.

Why you should not use checkpoint or phase break directly after replicate?
Having a checkpoint after a replicate is storing the entire data flow on disk.
There are 2 out ports in the REPLICATE component, First REPLICATE output flow data is processed and
created as a lookup file. The Second REPLICAT output flow will be processed by using the lookup
information of the 1st replicate output flow.

You might also like