You are on page 1of 12

IMPROVE JOB PERFORMANCE IN

INFORMATICA

BY

A N U PA M A R A J

- BC M D

1
INDEX

1.INTRODUCTION............................................................................................................3
1.1.Datawarehousing........................................................................................................3
1.2.Informatica as a datawarehousing tool .....................................................................3
1.3.Need for Performance Tuning ...................................................................................3
2.IDENTIFICATION OF BOTTLENECKS.......................................................................4
2.1Identify bottleneck in Source......................................................................................4
2.2Identify bottleneck in Target.......................................................................................4
2.3Identify bottleneck in Transformation........................................................................4
2.4Identify bottleneck in sessions ...................................................................................4
3.PERFORMANCE TUNING OF SOURCES....................................................................5
4.PERFORMANCE TUNING OF TARGETS....................................................................5
5.PERFORMANCE TUNING OF LOOKUP TRANSFORMATIONS..............................6
6.PERFORMANCE TUNING OF OTHER TRANSFORMATIONS.................................7
6.1.Update Strategy Transformation................................................................................7
6.2.Sequence generator Transformation..........................................................................7
6.3.Sorter Transformation................................................................................................8
6.4.Aggregator Transformation........................................................................................8
6.5.Joiner Transformation................................................................................................8
6.6.Filter Transformation.................................................................................................9
6.7.Expression Transformation........................................................................................9
7.PERFORMANCE TUNING OF MAPPINGS.................................................................9
8.PERFORMANCE TUNING OF SESSIONS.................................................................10
9.DATABASE OPTIMISATION.......................................................................................11
10. OPTIMUM CACHE SIZE IN LOOKUPS..................................................................11

2
1. INTRODUCTION
1.1. Datawarehousing

The fixed definition of datawarehouse given by William Inmon, a pioneer


in this field who popularized this term is “A subject-oriented, integrated, non-
volatile and time-variant collection of data in support of management decisions”.
Datawarehouse is a place where a wide variety of data is prepared,
organized and presented to its users in the best possible way. It helps to
consolidate information stored in heterogeneous business systems. It is a database
that does not delete, purge or update records, hence is a valuable historical
storage.
Success of any business depends on its users. A datawarehouse provides
several advantages for a company struggling to provide its business users with an
effective decision support solution. Datawarehouse publish the organization’s data
assets and provides high quality information for the management to make timely,
consistent and reliable decisions that impact their business.

1.2.Informatica as a datawarehousing tool

Although ETL (Extract Transform Load) is only one of the many


components in datawarehousing life cycle, it is the center of the datawarehouse
universe. We need high performance tools to ensure that data integration
applications like Datawarehouse remain up to date. The reduced development
time obtained via ETL tools like Informatica makes them viable solutions for any
datawarehousing projects.
Informatica has helped leading business organizations successfully deliver
their datawarehouse projects under budget and greater business adoption rates.
Using Informatica's leading enterprise data integration platform, organizations
benefit from a secure, scalable, and flexible foundation with all the capabilities
required for creating and managing data warehouse environments. Leveraging
Informatica’s enterprise data integration platform for data ware housing,
organizations benefit from performance and scalability.

1.3.Need for Performance Tuning

Performance is not just one job loading maximum data in a particular time
frame. Performance can be more accurately defined as a combination of several
small jobs which affect the over all performance of a system.
Informatica is an ETL tool with high performance capability. We need to
make maximum utilization of its features to increase its performance. With the
ever increasing user requirements and exploding data volumes, we need to
achieve more in less time. The goal of performance tuning is optimize session
performance. This document lists all the techniques available to tune Informatica
performance.

3
2. IDENTIFICATION OF BOTTLENECKS
Performance of Informatica is dependant on the performance of its several
components like database, network, transformations, mappings, sessions etc. To
tune the performance of Informatica, we have to identify the bottleneck first.
Bottleneck may be present in source, target, transformations, mapping, session,
database or network. It is best to identify performance issue in components in the
order source, target, transformations, mapping and session. After identifying the
bottleneck, apply the tuning mechanisms in whichever way they are applicable to
the project.

2.1Identify bottleneck in Source

If source is a relational table, put a filter transformation in the mapping, just after
source qualifier; make the condition of filter to FALSE. So all records will be
filtered off and none will proceed to other parts of the mapping.
In original case, without the test filter, total time taken is as follows:-
Total Time = time taken by (source + transformations + target load)
Now because of filter, Total Time = time taken by source
So if source was fine, then in the latter case, session should take less time. Still if
the session takes near equal time as former case, then there is a source bottleneck.

2.2Identify bottleneck in Target


If the target is a relational table, then substitute it with a flat file and run the
session. If the time taken now is very much less than the time taken for the
session to load to table, then the target table is the bottleneck.

2.3Identify bottleneck in Transformation


Remove the transformation from the mapping and run it. Note the time taken.
Then put the transformation back and run the mapping again. If the time taken
now is significantly more than previous time, then the transformation is the
bottleneck.
But removal of transformation for testing can be a pain for the developer since
that might require further changes for the session to get into the ‘working mode’.
So we can put filter with the FALSE condition just after the transformation and
run the session. If the session run takes equal time with and without this test filter,
then transformation is the bottleneck.

2.4Identify bottleneck in sessions


We can use the session log to identify whether the source, target or
transformations are the performance bottleneck. Session logs contain thread
summary records like the following:-
MASTER> PETL_24018 Thread [READER_1_1_1] created for the read stage of
partition point [SQ_test_all_text_data] has completed: Total Run Time =
[11.703201] secs, Total Idle Time = [9.560945] secs, Busy Percentage =
[18.304876].

4
MASTER> PETL_24019 Thread [TRANSF_1_1_1_1] created for the transformation
stage of partition point [SQ_test_all_text_data] has completed: Total Run
Time = [11.764368] secs, Total Idle Time = [0.000000] secs, Busy Percentage
= [100.000000].
If busy percentage is 100, then that part is the bottleneck.

Basically we have to rely on thread statistics to identify the cause of


performance issues. Once the ‘Collect Performance Data’ option (In session-
‘Properties’ tab) is enabled, all the performance related information would appear
in the log created by the session.

3. PERFORMANCE TUNING OF SOURCES


1. If the source is a flat file, ensure that the flat file is local to the Informatica server.
If source is a relational table, then try not to use synonyms or aliases.
2. If the source is a flat file, reduce the number of bytes (By default it is 1024 bytes
per line) the Informatica reads per line. If we do this, we can decrease the Line
Sequential Buffer Length setting of the session properties.
3. If possible, give a conditional query in the source qualifier so that the records are
filtered off as soon as possible in the process.
4. In the source qualifier, if the query has ORDER BY or GROUP BY, then create an
index on the source table and order by the index field of the source table.

4. PERFORMANCE TUNING OF TARGETS


1. If the target is a flat file, ensure that the flat file is local to the Informatica server.
If target is a relational table, then try not to use synonyms or aliases.
2. Use bulk load whenever possible.
3. Increase the commit level.
4. Drop constraints and indexes of the table before loading.

More information on source and target table tuning can be read from the section
‘Performance tuning of Mappings’ and ‘Performance tuning of sessions’.

5
5. PERFORMANCE TUNING OF LOOKUP
TRANSFORMATIONS
Lookup transformations are used to lookup a set of values in another table.
Lookups slows down the performance.

1. To improve performance, cache the lookup tables. Informatica can cache all the
lookup and reference tables; this makes operations run very fast. (Meaning of
cache is given in point 2 of this section and the procedure for determining the
optimum cache size is given at the end of this document.)
2. Even after caching, the performance can be further improved by minimizing the
size of the lookup cache. Reduce the number of cached rows by using a sql
override with a restriction.
Cache: Cache stores data in memory so that Informatica does not have to read
the table each time it is referenced. This reduces the time taken by the process to a
large extent. Cache is automatically generated by Informatica depending on the
marked lookup ports or by a user defined sql query.
Example for caching by a user defined query: -
Suppose we need to lookup records where employee_id=eno.
‘employee_id’ is from the lookup table, EMPLOYEE_TABLE and ‘eno’ is the
input that comes from the from the source table, SUPPORT_TABLE.
We put the following sql query override in Lookup Transform
‘select employee_id from EMPLOYEE_TABLE’
If there are 50,000 employee_id, then size of the lookup cache will be 50,000.
Instead of the above query, we put the following:-
‘select emp employee_id from EMPLOYEE_TABLE e, SUPPORT_TABLE s
where e. employee_id=s.eno’
If there are 1000 eno, then the size of the lookup cache will be only 1000.
But here the performance gain will happen only if the number of records in
SUPPORT_TABLE is not huge. Our concern is to make the size of the cache as
less as possible.
3. In lookup tables, delete all unused columns and keep only the fields that are used
in the mapping.
4. If possible, replace lookups by joiner transformation or single source qualifier.
Joiner transformation takes more time than source qualifier transformation.
5. If lookup transformation specifies several conditions, then place conditions that
use equality operator ‘=’ first in the conditions that appear in the conditions tab.
6. In the sql override query of the lookup table, there will be an ORDER BY clause.
Remove it if not needed or put fewer column names in the ORDER BY list.
7. Do not use caching in the following cases: -
-Source is small and lookup table is large.
-If lookup is done on the primary key of the lookup table.
8. Cache the lookup table columns definitely in the following case: -
-If lookup table is small and source is large.
9. If lookup data is static, use persistent cache. Persistent caches help to save and
reuse cache files. If several sessions in the same job use the same lookup table,

6
then using persistent cache will help the sessions to reuse cache files. In case of
static lookups, cache files will be built from memory cache instead of from the
database, which will improve the performance.
10. If source is huge and lookup table is also huge, then also use persistent cache.
11. If target table is the lookup table, then use dynamic cache. The Informatica server
updates the lookup cache as it passes rows to the target.
12. Use only the lookups you want in the mapping. Too many lookups inside a
mapping will slow down the session.
13. If lookup table has a lot of data, then it will take too long to cache or fit in
memory. So move those fields to source qualifier and then join with the main
table.
14. If there are several lookups with the same data set, then share the caches.
15. If we are going to return only 1 row, then use unconnected lookup.
16. All data are read into cache in the order the fields are listed in lookup ports. If we
have an index that is even partially in this order, the loading of these lookups can
be speeded up.
17. If the table that we use for look up has an index (or if we have privilege to add
index to the table in the database, do so), then the performance would increase
both for cached and uncached lookups.

6. PERFORMANCE TUNING OF OTHER


TRANSFORMATIONS
6.1. Update Strategy Transformation

An Update Strategy allows determining if a row should be rejected, deleted,


inserted or updated in the target.
1. Use Update Strategy transformation as less as possible in the mapping.
2. Do not use update strategy transformation if we just want to insert into target
table, instead use direct mapping, direct filtering etc.
3. For updating or deleting rows from the target table we can use Update Strategy
transformation itself.

6.2. Sequence generator Transformation

A sequence generator transformation is used to generate primary keys in the


Informatica.
1. If we need the sequence generator more than once in a job, then make it reusable
& use multiple times in the folder.
2. To generate primary keys, use Sequence generator transformation instead of using
a stored procedure for generating sequence numbers.

7
3. We can also opt for sequencing in the source qualifier by adding a dummy field in
the source definition and source qualifier, and then giving a sql query like
‘select seq_name.nextval, <other column names>... from <source table name>
where <condition if any>’.
Seq_name is the sequence that generates primary key for our source table.
<Sequence name>. Nextval is a sequence generator object in Oracle.
This method of primary key generation is faster than using sequence generator
transformation.

6.3. Sorter Transformation

Sorter transformation is used to sort the input data.


1. While using the sorter transformation, configure sorter cache size to be larger than
the input data size.
2. Configure the sorter cache size setting to be larger than the input data size while
using sorter transformation.
3. At the sorter transformation, use hash auto keys partitioning or hash user keys
partitioning.

6.4. Aggregator Transformation

Aggregator transformation helps in performing aggregate calculations like SUM,


AVERAGE etc.

1. Aggregator, rank and joiner transformation will decrease performance since they
group data before processing. So to improve performance here, use sorted ports.
2. In aggregator transformation, in the GROUP BY clause, use numbers instead of
strings if possible.
3. Avoid complex expressions in aggregator conditions.
4. Limit the number of connected input or output ports. This reduces the cache size.

6.5. Joiner Transformation

Joiner transformation helps to perform joins of two source tables.


1. Sort the data before joining.
2. In joiner transformations, normal joins are faster than outer joins.
3. Instead of joiner transformation, perform joins in database.
4. In joiner transformation, the source with lesser number of records should be the
master source.
5. Join on fewer columns as possible.
6. Use source qualifier to perform joins instead of joiner transformation wherever
possible.

8
6.6. Filter Transformation

Filter transformation is used to filter off unwanted fields based on conditions we


specify.
1. Use filter transformation as close to source as possible so that unwanted data gets
eliminated sooner.
2. If elimination of unwanted data can be done by source qualifier instead of filter,
then eliminate them using former.
3. Use conditional filters and keep the filter condition simple, involving
TRUE/FALSE or 1/0

6.7. Expression Transformation

Expression transformation is used to perform simple calculations and also to do


source lookups.
1. Use operators instead of functions.
2. Minimize the usage of string functions.
3. If we use a complex expression multiple times in the expression transformer, then
make that expression as a variable. Then we need to use only this variable for all
computations.

7. PERFORMANCE TUNING OF MAPPINGS


Mapping helps to channel the flow of data from source to target with all
the transformations in between. Mapping is the skeleton of Informatica loading
process.

1. Avoid executing major sql queries from mapplets or mappings.


2. Use optimized queries when we are using them.
3. Reduce the number of transformations in the mapping. Active transformations
like rank, joiner, filter, aggregator etc should be used as less as possible.
4. Remove all the unnecessary links between the transformations from mapping.
5. If a single mapping contains many targets, then dividing them into separate
mappings can improve performance.
6. If we need to use a single source more than once in a mapping, then keep only one
source and source qualifier in the mapping. Then create different data flows as
required into different targets or same target.
7. If a session joins many source tables in one source qualifier, then an optimizing
query will improve performance.
8. In the sql query that Informatica generates, ORDERBY will be present. Remove
the ORDER BY clause if not needed or at least reduce the number of column

9
names in that list. For better performance it is best to order by the index field of
that table.
9. Combine the mappings that use same set of source data.
10. On a mapping, field with the same information should be given the same type and
length throughout the mapping. Otherwise time will be spent on field conversions.
11. Instead of doing complex calculation in query, use an expression transformer and
do the calculation in the mapping.
12. If data is passing through multiple staging areas, removing the staging area will
increase performance.
13. Stored procedures reduce performance. Try to keep the stored procedures simple
in the mappings.
14. Unnecessary data type conversions should be avoided since the data type
conversions impact performance.
15. Transformation errors result in performance degradation. Try running the
mapping after removing all transformations. If it is taking significantly less time
than with the transformations, then we have to fine-tune the transformation.
16. Keep database interactions as less as possible.

8. PERFORMANCE TUNING OF SESSIONS


A session specifies the location from where the data is to be taken, where
the transformations are done and where the data is to be loaded. It has various
properties that help us to schedule and run the job in the way we want.

1. Partition the session: This creates many connections to the source and target, and
loads data in parallel pipelines. Each pipeline will be independent of the other.
But the performance of the session will not improve if the number of records is
less. Also the performance will not improve if it does updates and deletes. So
session partitioning should be used only if the volume of data is huge and the job
is mainly insertion of data.
2. Run the sessions in parallel rather than serial to gain time, if they are independent
of each other.
3. Drop constraints and indexes before we run session. Rebuild them after the
session run completes. Dropping can be done in pre session script and Rebuilding
in post session script. But if data is too much, dropping indexes and then
rebuilding them etc. will be not possible. In such cases, stage all data, pre-create
the index, use a transportable table space and then load into database.
4. Use bulk loading, external loading etc. Bulk loading can be used only if the table
does not have an index.
5. In a session we have options to ‘Treat rows as ‘Data Driven, Insert, Update and
Delete’. If update strategies are used, then we have to keep it as ‘Data Driven’.
But when the session does only insertion of rows into target table, it has to be kept
as ‘Insert’ to improve performance.

10
6. Increase the database commit level (The point at which the Informatica server is
set to commit data to the target table. For e.g. commit level can be set at
every every 50,000 records)
7. By avoiding built in functions as much as possible, we can improve the
performance. E.g. For concatenation, the operator ‘||’ is faster than the function
CONCAT (). So use operators instead of functions, where possible. The functions
like IS_SPACES (), IS_NUMBER (), IFF (), DECODE () etc. reduce the
performance to a big extent in this order. Preference should be in the opposite
order.
8. String functions like substring, ltrim, and rtrim reduce the performance. In the
sources, use delimited strings in case the source flat files or use varchar data type.
9. Manipulating high precision data types will slow down Informatica server. So
disable ‘high precision’.
10. Localize all source and target tables, stored procedures, views, sequences etc. Try
not to connect across synonyms. Synonyms and aliases slow down the
performance.

9. DATABASE OPTIMISATION
To gain the best Informatica performance, the database tables, stored
procedures and queries used in Informatica should be tuned well.

1. If the source and target are flat files, then they should be present in the system in
which the Informatica server is present.
2. Increase the network packet size.
3. The performance of the Informatica server is related to network connections.
Data generally moves across a network at less than 1 MB per second, whereas a
local disk moves data five to twenty times faster. Thus network connections often
affect on session performance. So avoid network connections.
4. Optimize target databases.

10. OPTIMUM CACHE SIZE IN LOOKUPS

10.1 Calculating Lookup Index Cache

The lookup index cache holds data for the columns used in the lookup condition.
For best session performance, specify the maximum lookup index cache size. Use
the following information to calculate the minimum and maximum lookup index
cache for both connected and unconnected Lookup transformations: -

11
To calculate the minimum lookup index cache size, use the formula:-
Columns in lookup cache = 200 * [<Column size> + 16]
To calculate the maximum lookup index cache size, use the formula:-
Columns in lookup cache = <Number of rows in lookup table>* [<column
size> + 16] * 2

Example:-
Suppose the lookup table has lookup values based in the field ITEM_ID. It uses
the lookup condition, ITEM_ID = IN_ITEM_ID1.
This ITEM_ID has data type as ‘integer’ and size as ‘16’.
Therefore the total column size is 16. The table contains 60000 rows.
Minimum lookup index cache size = 200 * [16 + 16] = 6400
Maximum lookup index cache size = 60000 * [16+16] * 2 = 3,840,000

So this lookup transformation needs an index cache size between 6400 and
3,840,000.
For best session performance, this lookup transformation needs an index cache
size of 3,840,000 bytes.

10.2 Calculating Lookup Data Cache


In a connected transformation, the data cache contains data for the connected
output ports, not including ports used in the lookup condition. In an unconnected
transformation, the data cache contains data from the return port.

To calculate the minimum lookup data cache size, use the formula:-
Columns in lookup cache = <Number of rows in lookup table> * [<Column
size of connected output ports not in lookup condition > + 8]

Example:-
Suppose the lookup table has column names as PROMOTION_ID and
DISCOUNT which are connected output ports not in lookup condition
Column size of each is 16. Therefore total column size is 32.The table contains
60000 rows.
Minimum lookup data cache size = 60000 * [32 + 8] = 2,400,000
So this lookup transformation needs a data cache size of 2, 40,000 bytes.

*********************

12